Determine which sentences contain the set of phrases












2












$begingroup$


I am trying to solve this HackerRank challenge which asks to write a function that prints the indexes of all sentences which contain the query words. For example, if the inputs to the function are:



textQueries(sentences, queries)
textQueries(['jim likes mary','kate likes tom', 'tom does not like jim'],['jim tom', 'likes'])


Then the function will print



2
0 1


If there is no sentence contains the query words, -1 will be printed



The constraints for the inputs are:




  • The length of the sentences and queries lists will not exceed 10000.

  • The number of words in any sentence or query is from 1 to 10 words.

  • Each word has at most 11 characters.

  • No word appears in more than 10 sentences.

  • Each word consists of uppercase and lowercase English letters only.


My attempt so far is to create a hashmap which maps each single query word with a set that contains the indexes to the sentences which contain that word, then I intersect those sets with each other to get the final set. My code runs fine, but it is slow for large test cases. Is there any way I can have it optimized?



def textQueries(sentences, queries):
word_map = {}
for query in queries:
query_words = query.split()
final_set = set()
set_modified = False
for query_word in query_words:
if query_word in word_map:
if len(final_set) == 0 and not set_modified:
final_set = word_map[query_word]
set_modified = True
continue
elif len(final_set) == 0 and set_modified:
break
else:
final_set = final_set.intersection(word_map[query_word])
continue
else:
sentences_contain_word =
for index, sentence in enumerate(sentences):
words = sentence.split()
if query_word in words:
sentences_contain_word.append(index)
if len(sentences_contain_word) >= 10:
break
word_map[query_word] = set(sentences_contain_word)
if len(final_set) == 0 and not set_modified:
final_set = word_map[query_word]
set_modified = True
elif len(final_set) == 0 and set_modified:
break
else:
final_set = final_set.intersection(word_map[query_word])
if len(final_set) == 0:
print(-1)
else:
for index in sorted(final_set):
print(str(index), end=' ')
print()









share|improve this question









New contributor




Tien Dinh is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$












  • $begingroup$
    Welcome to Code Review! Your example prints 2 followed by 0 1if I execute the code. Could you verify that your code and the example output are correct?
    $endgroup$
    – Alex
    3 hours ago










  • $begingroup$
    yea it should be 2 and 0 1 sorry for the confusion
    $endgroup$
    – Tien Dinh
    3 hours ago
















2












$begingroup$


I am trying to solve this HackerRank challenge which asks to write a function that prints the indexes of all sentences which contain the query words. For example, if the inputs to the function are:



textQueries(sentences, queries)
textQueries(['jim likes mary','kate likes tom', 'tom does not like jim'],['jim tom', 'likes'])


Then the function will print



2
0 1


If there is no sentence contains the query words, -1 will be printed



The constraints for the inputs are:




  • The length of the sentences and queries lists will not exceed 10000.

  • The number of words in any sentence or query is from 1 to 10 words.

  • Each word has at most 11 characters.

  • No word appears in more than 10 sentences.

  • Each word consists of uppercase and lowercase English letters only.


My attempt so far is to create a hashmap which maps each single query word with a set that contains the indexes to the sentences which contain that word, then I intersect those sets with each other to get the final set. My code runs fine, but it is slow for large test cases. Is there any way I can have it optimized?



def textQueries(sentences, queries):
word_map = {}
for query in queries:
query_words = query.split()
final_set = set()
set_modified = False
for query_word in query_words:
if query_word in word_map:
if len(final_set) == 0 and not set_modified:
final_set = word_map[query_word]
set_modified = True
continue
elif len(final_set) == 0 and set_modified:
break
else:
final_set = final_set.intersection(word_map[query_word])
continue
else:
sentences_contain_word =
for index, sentence in enumerate(sentences):
words = sentence.split()
if query_word in words:
sentences_contain_word.append(index)
if len(sentences_contain_word) >= 10:
break
word_map[query_word] = set(sentences_contain_word)
if len(final_set) == 0 and not set_modified:
final_set = word_map[query_word]
set_modified = True
elif len(final_set) == 0 and set_modified:
break
else:
final_set = final_set.intersection(word_map[query_word])
if len(final_set) == 0:
print(-1)
else:
for index in sorted(final_set):
print(str(index), end=' ')
print()









share|improve this question









New contributor




Tien Dinh is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$












  • $begingroup$
    Welcome to Code Review! Your example prints 2 followed by 0 1if I execute the code. Could you verify that your code and the example output are correct?
    $endgroup$
    – Alex
    3 hours ago










  • $begingroup$
    yea it should be 2 and 0 1 sorry for the confusion
    $endgroup$
    – Tien Dinh
    3 hours ago














2












2








2





$begingroup$


I am trying to solve this HackerRank challenge which asks to write a function that prints the indexes of all sentences which contain the query words. For example, if the inputs to the function are:



textQueries(sentences, queries)
textQueries(['jim likes mary','kate likes tom', 'tom does not like jim'],['jim tom', 'likes'])


Then the function will print



2
0 1


If there is no sentence contains the query words, -1 will be printed



The constraints for the inputs are:




  • The length of the sentences and queries lists will not exceed 10000.

  • The number of words in any sentence or query is from 1 to 10 words.

  • Each word has at most 11 characters.

  • No word appears in more than 10 sentences.

  • Each word consists of uppercase and lowercase English letters only.


My attempt so far is to create a hashmap which maps each single query word with a set that contains the indexes to the sentences which contain that word, then I intersect those sets with each other to get the final set. My code runs fine, but it is slow for large test cases. Is there any way I can have it optimized?



def textQueries(sentences, queries):
word_map = {}
for query in queries:
query_words = query.split()
final_set = set()
set_modified = False
for query_word in query_words:
if query_word in word_map:
if len(final_set) == 0 and not set_modified:
final_set = word_map[query_word]
set_modified = True
continue
elif len(final_set) == 0 and set_modified:
break
else:
final_set = final_set.intersection(word_map[query_word])
continue
else:
sentences_contain_word =
for index, sentence in enumerate(sentences):
words = sentence.split()
if query_word in words:
sentences_contain_word.append(index)
if len(sentences_contain_word) >= 10:
break
word_map[query_word] = set(sentences_contain_word)
if len(final_set) == 0 and not set_modified:
final_set = word_map[query_word]
set_modified = True
elif len(final_set) == 0 and set_modified:
break
else:
final_set = final_set.intersection(word_map[query_word])
if len(final_set) == 0:
print(-1)
else:
for index in sorted(final_set):
print(str(index), end=' ')
print()









share|improve this question









New contributor




Tien Dinh is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$




I am trying to solve this HackerRank challenge which asks to write a function that prints the indexes of all sentences which contain the query words. For example, if the inputs to the function are:



textQueries(sentences, queries)
textQueries(['jim likes mary','kate likes tom', 'tom does not like jim'],['jim tom', 'likes'])


Then the function will print



2
0 1


If there is no sentence contains the query words, -1 will be printed



The constraints for the inputs are:




  • The length of the sentences and queries lists will not exceed 10000.

  • The number of words in any sentence or query is from 1 to 10 words.

  • Each word has at most 11 characters.

  • No word appears in more than 10 sentences.

  • Each word consists of uppercase and lowercase English letters only.


My attempt so far is to create a hashmap which maps each single query word with a set that contains the indexes to the sentences which contain that word, then I intersect those sets with each other to get the final set. My code runs fine, but it is slow for large test cases. Is there any way I can have it optimized?



def textQueries(sentences, queries):
word_map = {}
for query in queries:
query_words = query.split()
final_set = set()
set_modified = False
for query_word in query_words:
if query_word in word_map:
if len(final_set) == 0 and not set_modified:
final_set = word_map[query_word]
set_modified = True
continue
elif len(final_set) == 0 and set_modified:
break
else:
final_set = final_set.intersection(word_map[query_word])
continue
else:
sentences_contain_word =
for index, sentence in enumerate(sentences):
words = sentence.split()
if query_word in words:
sentences_contain_word.append(index)
if len(sentences_contain_word) >= 10:
break
word_map[query_word] = set(sentences_contain_word)
if len(final_set) == 0 and not set_modified:
final_set = word_map[query_word]
set_modified = True
elif len(final_set) == 0 and set_modified:
break
else:
final_set = final_set.intersection(word_map[query_word])
if len(final_set) == 0:
print(-1)
else:
for index in sorted(final_set):
print(str(index), end=' ')
print()






python performance programming-challenge






share|improve this question









New contributor




Tien Dinh is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




Tien Dinh is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited 3 hours ago







Tien Dinh













New contributor




Tien Dinh is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 6 hours ago









Tien DinhTien Dinh

133




133




New contributor




Tien Dinh is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Tien Dinh is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Tien Dinh is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












  • $begingroup$
    Welcome to Code Review! Your example prints 2 followed by 0 1if I execute the code. Could you verify that your code and the example output are correct?
    $endgroup$
    – Alex
    3 hours ago










  • $begingroup$
    yea it should be 2 and 0 1 sorry for the confusion
    $endgroup$
    – Tien Dinh
    3 hours ago


















  • $begingroup$
    Welcome to Code Review! Your example prints 2 followed by 0 1if I execute the code. Could you verify that your code and the example output are correct?
    $endgroup$
    – Alex
    3 hours ago










  • $begingroup$
    yea it should be 2 and 0 1 sorry for the confusion
    $endgroup$
    – Tien Dinh
    3 hours ago
















$begingroup$
Welcome to Code Review! Your example prints 2 followed by 0 1if I execute the code. Could you verify that your code and the example output are correct?
$endgroup$
– Alex
3 hours ago




$begingroup$
Welcome to Code Review! Your example prints 2 followed by 0 1if I execute the code. Could you verify that your code and the example output are correct?
$endgroup$
– Alex
3 hours ago












$begingroup$
yea it should be 2 and 0 1 sorry for the confusion
$endgroup$
– Tien Dinh
3 hours ago




$begingroup$
yea it should be 2 and 0 1 sorry for the confusion
$endgroup$
– Tien Dinh
3 hours ago










2 Answers
2






active

oldest

votes


















1












$begingroup$

Your code is fairly readable, but I find it quite dense. Here are some general things you could do to improve it:




  1. Use vertical white space to separate different areas of the code.


  2. Provide a docblock at the top outlining the algorithm you are using.


  3. Use more than one function to break your code into smaller chunks.



With that advice out of the way, let's look at your performance with an eye to the worst possible case:



def textQueries(sentences, queries):
for query in queries: # Worst case: 10,000 times
...
for query_word in query_words: # Worst case: 10 times
if query_word in word_map:
final_set.intersection # Worst case: O(10,000) in C
else:
for ... in enumerate(sentences): # Worst case: 10,000 times
if ... else:
final_set.intersection # Worst case: O(10,000) in C


According to me, your code runs worst case in $O(Q times W times S)$ where $Q$ is the number of queries, $W$ is the number of words in the query, and $S$ is the number of sentences. (The number of words in the sentences is hidden behind a split, but it's in C so we'll let it go by.)



There's no question you need to split the sentences and split the queries into separate words, somehow. But you are nesting your loops when the "best" way to loop is to do them one after the other. If you can convert from this:



for q in Q:
for s in Q:
pass


to something like this:



for s in S:
pass

for q in Q:
pass


You will have converted from $O(S times Q)$ to $O(S + Q)$ which is considerably faster.



Your current approach is to define a set of sentences that contain a particular word. You cache these sets, in case the same words appear again. You intersect the sets for each word in the query, producing a final set of all the sentences that contain all the query words.



This is pretty much guaranteed to fail based on the instructions you quoted:




No word appears in more than 10 sentences.




In other words, the probability of a "cache hit" is 10/10,000, or 0.1%. Any time you spend maintaining the cache is wasted time.



Suppose you just go through the sentences one time, and build the big dictionary of sets:



sentence_words = collections.defaultdict(set)
for i, s in enumerate(sentences):
for w in s.split():
sentence_words[s].add(i)


Then you can go through the query words and do what you were doing before, without any caching:



for q in queries:
matches = functools.reduce(operator.and_, (sentence_words[w] for w in q.split()))
matches = [-1] if not matches else sorted(matches)


This will seem very similar to what you were doing, but the difference is that these two loops are run one after the other, providing something like $O(100,000 + 100,000)$ performance instead of $O(100,000 times 10,000)$. It's that last operator that makes all the difference.






share|improve this answer









$endgroup$





















    0












    $begingroup$

    I'm not sure how much there can be gained performance-wise, since the problem has an inherent theoretical complexity limit (quite nicely elaborated in this answer) not even the most well crafted code can overcome. Additionally, Python usually is not terribly fast when you have a lot of (nested) loops.



    Python IDEs like Spyder come with a built-in profiler tool (see the docs), which allows you to identify performance bottlenecks. Having a quick run with your small example code showed me that on my machine most of the time is actually spent in printing the values.

    Feel free to give it a go with more realistic inputs on your machine.





    Nevertheless, I think your code could easily gain quite a bit on the style and clarity side.



    What do I mean by this? Well, let me elaborate:



    What (I think) your code wants to do:




    1. Build a dictionary/hash map where each query word is a key that maps to a collection of indices of all sentences where the query word occurs.

    2. Use this dictionary to determine the indices of all sentences containing all the query's words

    3. Print those indices to the command line


    Printing



    Let's start with an easy one, #3. Your code was



    if len(final_set) == 0:
    print(-1)
    else:
    for index in sorted(final_set):
    print(str(index), end=' ')
    print()


    which does what you want to do. But can we do better?
    I would like to propose as slight improvement.



    if final_set:
    print(' '.join(str(i) for i in sorted(final_set)))
    else:
    print(-1)


    So what is going on here?

    First, if final_set: quite obviously replaces if len(final_set) == 0:.

    How? In Python empty "iterables" like lists, tuples, strings or in your case sets evaluate to False which would then print -1 in the console, just as before.

    Next, " ".join(str(i) for i in sorted(final_set)) uses list comprehension together with the built-in join-function to generate the final indices output.
    This does basically the same as your original code and is just a little shorter, as a bonus you get the newline for free.



    Building the hash map



    To get to point #1, most of the heavy lifting is done in the else branch of the major if statement in your code.

    So far so straightforward, not much to gain here. In my oppinion, words = sentence.split() should not be necessary. Python's in is capable of finding the word in the full string, too.

    One could argue that it would be increase clarity and maintainability to place move this functionality to the beginning of the function (or in a seperate function altogether) to better seperate subtasks #1 and #2.



    Using the hash map



    Last but not least #2.

    Your code in the if branch gets a little bit involved here.

    If you build your word_map before processing the sentences and I'm not mistaken this can be simplified quite a bit to



    final_set = set()
    for word in query.split():
    if not final_set:
    final_set = word_map.get(word, set())
    else:
    final_set = final_set.intersection(word_map.get(word, set()))
    if not final_set:
    break


    I have talked about using the "trueness/falseness" of iterables before so the use here should not suprise you. word_map.get(word, set()) is just a way of saying "get me the value of key word, and if there's nothing, return set()".



    Final code



    My final code was



    def textQueriesRevised(sentences, queries):
    # 1. build the hash map
    word_map = {}
    for query in queries:
    for query_word in query.split():
    for i, sentence in enumerate(sentences):
    if query_word in sentence:
    word_map.setdefault(query_word, ).append(i)
    for query in queries:
    # 2. look up in the hash map
    final_set = set()
    for word in query.split():
    if not final_set:
    final_set = set(word_map.get(word, ))
    else:
    final_set = final_set.intersection(word_map.get(word, ))
    if not final_set:
    break
    # 3. print the indices
    if final_set:
    print(' '.join(str(i) for i in sorted(final_set)))
    # or
    #print(*sorted(final_set), sep=' ')
    else:
    print(-1)


    You will find subtle differeces to some of the snippets posted above, but these mainly stem from me not wanting to work with sets while populating word_map (no particular reason for this - apart from that it's getting late here).
    Performance-wise this should be comparable to your original solution, at least in the cases where I tested it.






    share|improve this answer









    $endgroup$













      Your Answer





      StackExchange.ifUsing("editor", function () {
      return StackExchange.using("mathjaxEditing", function () {
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
      });
      });
      }, "mathjax-editing");

      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "196"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });






      Tien Dinh is a new contributor. Be nice, and check out our Code of Conduct.










      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f215792%2fdetermine-which-sentences-contain-the-set-of-phrases%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      1












      $begingroup$

      Your code is fairly readable, but I find it quite dense. Here are some general things you could do to improve it:




      1. Use vertical white space to separate different areas of the code.


      2. Provide a docblock at the top outlining the algorithm you are using.


      3. Use more than one function to break your code into smaller chunks.



      With that advice out of the way, let's look at your performance with an eye to the worst possible case:



      def textQueries(sentences, queries):
      for query in queries: # Worst case: 10,000 times
      ...
      for query_word in query_words: # Worst case: 10 times
      if query_word in word_map:
      final_set.intersection # Worst case: O(10,000) in C
      else:
      for ... in enumerate(sentences): # Worst case: 10,000 times
      if ... else:
      final_set.intersection # Worst case: O(10,000) in C


      According to me, your code runs worst case in $O(Q times W times S)$ where $Q$ is the number of queries, $W$ is the number of words in the query, and $S$ is the number of sentences. (The number of words in the sentences is hidden behind a split, but it's in C so we'll let it go by.)



      There's no question you need to split the sentences and split the queries into separate words, somehow. But you are nesting your loops when the "best" way to loop is to do them one after the other. If you can convert from this:



      for q in Q:
      for s in Q:
      pass


      to something like this:



      for s in S:
      pass

      for q in Q:
      pass


      You will have converted from $O(S times Q)$ to $O(S + Q)$ which is considerably faster.



      Your current approach is to define a set of sentences that contain a particular word. You cache these sets, in case the same words appear again. You intersect the sets for each word in the query, producing a final set of all the sentences that contain all the query words.



      This is pretty much guaranteed to fail based on the instructions you quoted:




      No word appears in more than 10 sentences.




      In other words, the probability of a "cache hit" is 10/10,000, or 0.1%. Any time you spend maintaining the cache is wasted time.



      Suppose you just go through the sentences one time, and build the big dictionary of sets:



      sentence_words = collections.defaultdict(set)
      for i, s in enumerate(sentences):
      for w in s.split():
      sentence_words[s].add(i)


      Then you can go through the query words and do what you were doing before, without any caching:



      for q in queries:
      matches = functools.reduce(operator.and_, (sentence_words[w] for w in q.split()))
      matches = [-1] if not matches else sorted(matches)


      This will seem very similar to what you were doing, but the difference is that these two loops are run one after the other, providing something like $O(100,000 + 100,000)$ performance instead of $O(100,000 times 10,000)$. It's that last operator that makes all the difference.






      share|improve this answer









      $endgroup$


















        1












        $begingroup$

        Your code is fairly readable, but I find it quite dense. Here are some general things you could do to improve it:




        1. Use vertical white space to separate different areas of the code.


        2. Provide a docblock at the top outlining the algorithm you are using.


        3. Use more than one function to break your code into smaller chunks.



        With that advice out of the way, let's look at your performance with an eye to the worst possible case:



        def textQueries(sentences, queries):
        for query in queries: # Worst case: 10,000 times
        ...
        for query_word in query_words: # Worst case: 10 times
        if query_word in word_map:
        final_set.intersection # Worst case: O(10,000) in C
        else:
        for ... in enumerate(sentences): # Worst case: 10,000 times
        if ... else:
        final_set.intersection # Worst case: O(10,000) in C


        According to me, your code runs worst case in $O(Q times W times S)$ where $Q$ is the number of queries, $W$ is the number of words in the query, and $S$ is the number of sentences. (The number of words in the sentences is hidden behind a split, but it's in C so we'll let it go by.)



        There's no question you need to split the sentences and split the queries into separate words, somehow. But you are nesting your loops when the "best" way to loop is to do them one after the other. If you can convert from this:



        for q in Q:
        for s in Q:
        pass


        to something like this:



        for s in S:
        pass

        for q in Q:
        pass


        You will have converted from $O(S times Q)$ to $O(S + Q)$ which is considerably faster.



        Your current approach is to define a set of sentences that contain a particular word. You cache these sets, in case the same words appear again. You intersect the sets for each word in the query, producing a final set of all the sentences that contain all the query words.



        This is pretty much guaranteed to fail based on the instructions you quoted:




        No word appears in more than 10 sentences.




        In other words, the probability of a "cache hit" is 10/10,000, or 0.1%. Any time you spend maintaining the cache is wasted time.



        Suppose you just go through the sentences one time, and build the big dictionary of sets:



        sentence_words = collections.defaultdict(set)
        for i, s in enumerate(sentences):
        for w in s.split():
        sentence_words[s].add(i)


        Then you can go through the query words and do what you were doing before, without any caching:



        for q in queries:
        matches = functools.reduce(operator.and_, (sentence_words[w] for w in q.split()))
        matches = [-1] if not matches else sorted(matches)


        This will seem very similar to what you were doing, but the difference is that these two loops are run one after the other, providing something like $O(100,000 + 100,000)$ performance instead of $O(100,000 times 10,000)$. It's that last operator that makes all the difference.






        share|improve this answer









        $endgroup$
















          1












          1








          1





          $begingroup$

          Your code is fairly readable, but I find it quite dense. Here are some general things you could do to improve it:




          1. Use vertical white space to separate different areas of the code.


          2. Provide a docblock at the top outlining the algorithm you are using.


          3. Use more than one function to break your code into smaller chunks.



          With that advice out of the way, let's look at your performance with an eye to the worst possible case:



          def textQueries(sentences, queries):
          for query in queries: # Worst case: 10,000 times
          ...
          for query_word in query_words: # Worst case: 10 times
          if query_word in word_map:
          final_set.intersection # Worst case: O(10,000) in C
          else:
          for ... in enumerate(sentences): # Worst case: 10,000 times
          if ... else:
          final_set.intersection # Worst case: O(10,000) in C


          According to me, your code runs worst case in $O(Q times W times S)$ where $Q$ is the number of queries, $W$ is the number of words in the query, and $S$ is the number of sentences. (The number of words in the sentences is hidden behind a split, but it's in C so we'll let it go by.)



          There's no question you need to split the sentences and split the queries into separate words, somehow. But you are nesting your loops when the "best" way to loop is to do them one after the other. If you can convert from this:



          for q in Q:
          for s in Q:
          pass


          to something like this:



          for s in S:
          pass

          for q in Q:
          pass


          You will have converted from $O(S times Q)$ to $O(S + Q)$ which is considerably faster.



          Your current approach is to define a set of sentences that contain a particular word. You cache these sets, in case the same words appear again. You intersect the sets for each word in the query, producing a final set of all the sentences that contain all the query words.



          This is pretty much guaranteed to fail based on the instructions you quoted:




          No word appears in more than 10 sentences.




          In other words, the probability of a "cache hit" is 10/10,000, or 0.1%. Any time you spend maintaining the cache is wasted time.



          Suppose you just go through the sentences one time, and build the big dictionary of sets:



          sentence_words = collections.defaultdict(set)
          for i, s in enumerate(sentences):
          for w in s.split():
          sentence_words[s].add(i)


          Then you can go through the query words and do what you were doing before, without any caching:



          for q in queries:
          matches = functools.reduce(operator.and_, (sentence_words[w] for w in q.split()))
          matches = [-1] if not matches else sorted(matches)


          This will seem very similar to what you were doing, but the difference is that these two loops are run one after the other, providing something like $O(100,000 + 100,000)$ performance instead of $O(100,000 times 10,000)$. It's that last operator that makes all the difference.






          share|improve this answer









          $endgroup$



          Your code is fairly readable, but I find it quite dense. Here are some general things you could do to improve it:




          1. Use vertical white space to separate different areas of the code.


          2. Provide a docblock at the top outlining the algorithm you are using.


          3. Use more than one function to break your code into smaller chunks.



          With that advice out of the way, let's look at your performance with an eye to the worst possible case:



          def textQueries(sentences, queries):
          for query in queries: # Worst case: 10,000 times
          ...
          for query_word in query_words: # Worst case: 10 times
          if query_word in word_map:
          final_set.intersection # Worst case: O(10,000) in C
          else:
          for ... in enumerate(sentences): # Worst case: 10,000 times
          if ... else:
          final_set.intersection # Worst case: O(10,000) in C


          According to me, your code runs worst case in $O(Q times W times S)$ where $Q$ is the number of queries, $W$ is the number of words in the query, and $S$ is the number of sentences. (The number of words in the sentences is hidden behind a split, but it's in C so we'll let it go by.)



          There's no question you need to split the sentences and split the queries into separate words, somehow. But you are nesting your loops when the "best" way to loop is to do them one after the other. If you can convert from this:



          for q in Q:
          for s in Q:
          pass


          to something like this:



          for s in S:
          pass

          for q in Q:
          pass


          You will have converted from $O(S times Q)$ to $O(S + Q)$ which is considerably faster.



          Your current approach is to define a set of sentences that contain a particular word. You cache these sets, in case the same words appear again. You intersect the sets for each word in the query, producing a final set of all the sentences that contain all the query words.



          This is pretty much guaranteed to fail based on the instructions you quoted:




          No word appears in more than 10 sentences.




          In other words, the probability of a "cache hit" is 10/10,000, or 0.1%. Any time you spend maintaining the cache is wasted time.



          Suppose you just go through the sentences one time, and build the big dictionary of sets:



          sentence_words = collections.defaultdict(set)
          for i, s in enumerate(sentences):
          for w in s.split():
          sentence_words[s].add(i)


          Then you can go through the query words and do what you were doing before, without any caching:



          for q in queries:
          matches = functools.reduce(operator.and_, (sentence_words[w] for w in q.split()))
          matches = [-1] if not matches else sorted(matches)


          This will seem very similar to what you were doing, but the difference is that these two loops are run one after the other, providing something like $O(100,000 + 100,000)$ performance instead of $O(100,000 times 10,000)$. It's that last operator that makes all the difference.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered 1 hour ago









          Austin HastingsAustin Hastings

          7,3071233




          7,3071233

























              0












              $begingroup$

              I'm not sure how much there can be gained performance-wise, since the problem has an inherent theoretical complexity limit (quite nicely elaborated in this answer) not even the most well crafted code can overcome. Additionally, Python usually is not terribly fast when you have a lot of (nested) loops.



              Python IDEs like Spyder come with a built-in profiler tool (see the docs), which allows you to identify performance bottlenecks. Having a quick run with your small example code showed me that on my machine most of the time is actually spent in printing the values.

              Feel free to give it a go with more realistic inputs on your machine.





              Nevertheless, I think your code could easily gain quite a bit on the style and clarity side.



              What do I mean by this? Well, let me elaborate:



              What (I think) your code wants to do:




              1. Build a dictionary/hash map where each query word is a key that maps to a collection of indices of all sentences where the query word occurs.

              2. Use this dictionary to determine the indices of all sentences containing all the query's words

              3. Print those indices to the command line


              Printing



              Let's start with an easy one, #3. Your code was



              if len(final_set) == 0:
              print(-1)
              else:
              for index in sorted(final_set):
              print(str(index), end=' ')
              print()


              which does what you want to do. But can we do better?
              I would like to propose as slight improvement.



              if final_set:
              print(' '.join(str(i) for i in sorted(final_set)))
              else:
              print(-1)


              So what is going on here?

              First, if final_set: quite obviously replaces if len(final_set) == 0:.

              How? In Python empty "iterables" like lists, tuples, strings or in your case sets evaluate to False which would then print -1 in the console, just as before.

              Next, " ".join(str(i) for i in sorted(final_set)) uses list comprehension together with the built-in join-function to generate the final indices output.
              This does basically the same as your original code and is just a little shorter, as a bonus you get the newline for free.



              Building the hash map



              To get to point #1, most of the heavy lifting is done in the else branch of the major if statement in your code.

              So far so straightforward, not much to gain here. In my oppinion, words = sentence.split() should not be necessary. Python's in is capable of finding the word in the full string, too.

              One could argue that it would be increase clarity and maintainability to place move this functionality to the beginning of the function (or in a seperate function altogether) to better seperate subtasks #1 and #2.



              Using the hash map



              Last but not least #2.

              Your code in the if branch gets a little bit involved here.

              If you build your word_map before processing the sentences and I'm not mistaken this can be simplified quite a bit to



              final_set = set()
              for word in query.split():
              if not final_set:
              final_set = word_map.get(word, set())
              else:
              final_set = final_set.intersection(word_map.get(word, set()))
              if not final_set:
              break


              I have talked about using the "trueness/falseness" of iterables before so the use here should not suprise you. word_map.get(word, set()) is just a way of saying "get me the value of key word, and if there's nothing, return set()".



              Final code



              My final code was



              def textQueriesRevised(sentences, queries):
              # 1. build the hash map
              word_map = {}
              for query in queries:
              for query_word in query.split():
              for i, sentence in enumerate(sentences):
              if query_word in sentence:
              word_map.setdefault(query_word, ).append(i)
              for query in queries:
              # 2. look up in the hash map
              final_set = set()
              for word in query.split():
              if not final_set:
              final_set = set(word_map.get(word, ))
              else:
              final_set = final_set.intersection(word_map.get(word, ))
              if not final_set:
              break
              # 3. print the indices
              if final_set:
              print(' '.join(str(i) for i in sorted(final_set)))
              # or
              #print(*sorted(final_set), sep=' ')
              else:
              print(-1)


              You will find subtle differeces to some of the snippets posted above, but these mainly stem from me not wanting to work with sets while populating word_map (no particular reason for this - apart from that it's getting late here).
              Performance-wise this should be comparable to your original solution, at least in the cases where I tested it.






              share|improve this answer









              $endgroup$


















                0












                $begingroup$

                I'm not sure how much there can be gained performance-wise, since the problem has an inherent theoretical complexity limit (quite nicely elaborated in this answer) not even the most well crafted code can overcome. Additionally, Python usually is not terribly fast when you have a lot of (nested) loops.



                Python IDEs like Spyder come with a built-in profiler tool (see the docs), which allows you to identify performance bottlenecks. Having a quick run with your small example code showed me that on my machine most of the time is actually spent in printing the values.

                Feel free to give it a go with more realistic inputs on your machine.





                Nevertheless, I think your code could easily gain quite a bit on the style and clarity side.



                What do I mean by this? Well, let me elaborate:



                What (I think) your code wants to do:




                1. Build a dictionary/hash map where each query word is a key that maps to a collection of indices of all sentences where the query word occurs.

                2. Use this dictionary to determine the indices of all sentences containing all the query's words

                3. Print those indices to the command line


                Printing



                Let's start with an easy one, #3. Your code was



                if len(final_set) == 0:
                print(-1)
                else:
                for index in sorted(final_set):
                print(str(index), end=' ')
                print()


                which does what you want to do. But can we do better?
                I would like to propose as slight improvement.



                if final_set:
                print(' '.join(str(i) for i in sorted(final_set)))
                else:
                print(-1)


                So what is going on here?

                First, if final_set: quite obviously replaces if len(final_set) == 0:.

                How? In Python empty "iterables" like lists, tuples, strings or in your case sets evaluate to False which would then print -1 in the console, just as before.

                Next, " ".join(str(i) for i in sorted(final_set)) uses list comprehension together with the built-in join-function to generate the final indices output.
                This does basically the same as your original code and is just a little shorter, as a bonus you get the newline for free.



                Building the hash map



                To get to point #1, most of the heavy lifting is done in the else branch of the major if statement in your code.

                So far so straightforward, not much to gain here. In my oppinion, words = sentence.split() should not be necessary. Python's in is capable of finding the word in the full string, too.

                One could argue that it would be increase clarity and maintainability to place move this functionality to the beginning of the function (or in a seperate function altogether) to better seperate subtasks #1 and #2.



                Using the hash map



                Last but not least #2.

                Your code in the if branch gets a little bit involved here.

                If you build your word_map before processing the sentences and I'm not mistaken this can be simplified quite a bit to



                final_set = set()
                for word in query.split():
                if not final_set:
                final_set = word_map.get(word, set())
                else:
                final_set = final_set.intersection(word_map.get(word, set()))
                if not final_set:
                break


                I have talked about using the "trueness/falseness" of iterables before so the use here should not suprise you. word_map.get(word, set()) is just a way of saying "get me the value of key word, and if there's nothing, return set()".



                Final code



                My final code was



                def textQueriesRevised(sentences, queries):
                # 1. build the hash map
                word_map = {}
                for query in queries:
                for query_word in query.split():
                for i, sentence in enumerate(sentences):
                if query_word in sentence:
                word_map.setdefault(query_word, ).append(i)
                for query in queries:
                # 2. look up in the hash map
                final_set = set()
                for word in query.split():
                if not final_set:
                final_set = set(word_map.get(word, ))
                else:
                final_set = final_set.intersection(word_map.get(word, ))
                if not final_set:
                break
                # 3. print the indices
                if final_set:
                print(' '.join(str(i) for i in sorted(final_set)))
                # or
                #print(*sorted(final_set), sep=' ')
                else:
                print(-1)


                You will find subtle differeces to some of the snippets posted above, but these mainly stem from me not wanting to work with sets while populating word_map (no particular reason for this - apart from that it's getting late here).
                Performance-wise this should be comparable to your original solution, at least in the cases where I tested it.






                share|improve this answer









                $endgroup$
















                  0












                  0








                  0





                  $begingroup$

                  I'm not sure how much there can be gained performance-wise, since the problem has an inherent theoretical complexity limit (quite nicely elaborated in this answer) not even the most well crafted code can overcome. Additionally, Python usually is not terribly fast when you have a lot of (nested) loops.



                  Python IDEs like Spyder come with a built-in profiler tool (see the docs), which allows you to identify performance bottlenecks. Having a quick run with your small example code showed me that on my machine most of the time is actually spent in printing the values.

                  Feel free to give it a go with more realistic inputs on your machine.





                  Nevertheless, I think your code could easily gain quite a bit on the style and clarity side.



                  What do I mean by this? Well, let me elaborate:



                  What (I think) your code wants to do:




                  1. Build a dictionary/hash map where each query word is a key that maps to a collection of indices of all sentences where the query word occurs.

                  2. Use this dictionary to determine the indices of all sentences containing all the query's words

                  3. Print those indices to the command line


                  Printing



                  Let's start with an easy one, #3. Your code was



                  if len(final_set) == 0:
                  print(-1)
                  else:
                  for index in sorted(final_set):
                  print(str(index), end=' ')
                  print()


                  which does what you want to do. But can we do better?
                  I would like to propose as slight improvement.



                  if final_set:
                  print(' '.join(str(i) for i in sorted(final_set)))
                  else:
                  print(-1)


                  So what is going on here?

                  First, if final_set: quite obviously replaces if len(final_set) == 0:.

                  How? In Python empty "iterables" like lists, tuples, strings or in your case sets evaluate to False which would then print -1 in the console, just as before.

                  Next, " ".join(str(i) for i in sorted(final_set)) uses list comprehension together with the built-in join-function to generate the final indices output.
                  This does basically the same as your original code and is just a little shorter, as a bonus you get the newline for free.



                  Building the hash map



                  To get to point #1, most of the heavy lifting is done in the else branch of the major if statement in your code.

                  So far so straightforward, not much to gain here. In my oppinion, words = sentence.split() should not be necessary. Python's in is capable of finding the word in the full string, too.

                  One could argue that it would be increase clarity and maintainability to place move this functionality to the beginning of the function (or in a seperate function altogether) to better seperate subtasks #1 and #2.



                  Using the hash map



                  Last but not least #2.

                  Your code in the if branch gets a little bit involved here.

                  If you build your word_map before processing the sentences and I'm not mistaken this can be simplified quite a bit to



                  final_set = set()
                  for word in query.split():
                  if not final_set:
                  final_set = word_map.get(word, set())
                  else:
                  final_set = final_set.intersection(word_map.get(word, set()))
                  if not final_set:
                  break


                  I have talked about using the "trueness/falseness" of iterables before so the use here should not suprise you. word_map.get(word, set()) is just a way of saying "get me the value of key word, and if there's nothing, return set()".



                  Final code



                  My final code was



                  def textQueriesRevised(sentences, queries):
                  # 1. build the hash map
                  word_map = {}
                  for query in queries:
                  for query_word in query.split():
                  for i, sentence in enumerate(sentences):
                  if query_word in sentence:
                  word_map.setdefault(query_word, ).append(i)
                  for query in queries:
                  # 2. look up in the hash map
                  final_set = set()
                  for word in query.split():
                  if not final_set:
                  final_set = set(word_map.get(word, ))
                  else:
                  final_set = final_set.intersection(word_map.get(word, ))
                  if not final_set:
                  break
                  # 3. print the indices
                  if final_set:
                  print(' '.join(str(i) for i in sorted(final_set)))
                  # or
                  #print(*sorted(final_set), sep=' ')
                  else:
                  print(-1)


                  You will find subtle differeces to some of the snippets posted above, but these mainly stem from me not wanting to work with sets while populating word_map (no particular reason for this - apart from that it's getting late here).
                  Performance-wise this should be comparable to your original solution, at least in the cases where I tested it.






                  share|improve this answer









                  $endgroup$



                  I'm not sure how much there can be gained performance-wise, since the problem has an inherent theoretical complexity limit (quite nicely elaborated in this answer) not even the most well crafted code can overcome. Additionally, Python usually is not terribly fast when you have a lot of (nested) loops.



                  Python IDEs like Spyder come with a built-in profiler tool (see the docs), which allows you to identify performance bottlenecks. Having a quick run with your small example code showed me that on my machine most of the time is actually spent in printing the values.

                  Feel free to give it a go with more realistic inputs on your machine.





                  Nevertheless, I think your code could easily gain quite a bit on the style and clarity side.



                  What do I mean by this? Well, let me elaborate:



                  What (I think) your code wants to do:




                  1. Build a dictionary/hash map where each query word is a key that maps to a collection of indices of all sentences where the query word occurs.

                  2. Use this dictionary to determine the indices of all sentences containing all the query's words

                  3. Print those indices to the command line


                  Printing



                  Let's start with an easy one, #3. Your code was



                  if len(final_set) == 0:
                  print(-1)
                  else:
                  for index in sorted(final_set):
                  print(str(index), end=' ')
                  print()


                  which does what you want to do. But can we do better?
                  I would like to propose as slight improvement.



                  if final_set:
                  print(' '.join(str(i) for i in sorted(final_set)))
                  else:
                  print(-1)


                  So what is going on here?

                  First, if final_set: quite obviously replaces if len(final_set) == 0:.

                  How? In Python empty "iterables" like lists, tuples, strings or in your case sets evaluate to False which would then print -1 in the console, just as before.

                  Next, " ".join(str(i) for i in sorted(final_set)) uses list comprehension together with the built-in join-function to generate the final indices output.
                  This does basically the same as your original code and is just a little shorter, as a bonus you get the newline for free.



                  Building the hash map



                  To get to point #1, most of the heavy lifting is done in the else branch of the major if statement in your code.

                  So far so straightforward, not much to gain here. In my oppinion, words = sentence.split() should not be necessary. Python's in is capable of finding the word in the full string, too.

                  One could argue that it would be increase clarity and maintainability to place move this functionality to the beginning of the function (or in a seperate function altogether) to better seperate subtasks #1 and #2.



                  Using the hash map



                  Last but not least #2.

                  Your code in the if branch gets a little bit involved here.

                  If you build your word_map before processing the sentences and I'm not mistaken this can be simplified quite a bit to



                  final_set = set()
                  for word in query.split():
                  if not final_set:
                  final_set = word_map.get(word, set())
                  else:
                  final_set = final_set.intersection(word_map.get(word, set()))
                  if not final_set:
                  break


                  I have talked about using the "trueness/falseness" of iterables before so the use here should not suprise you. word_map.get(word, set()) is just a way of saying "get me the value of key word, and if there's nothing, return set()".



                  Final code



                  My final code was



                  def textQueriesRevised(sentences, queries):
                  # 1. build the hash map
                  word_map = {}
                  for query in queries:
                  for query_word in query.split():
                  for i, sentence in enumerate(sentences):
                  if query_word in sentence:
                  word_map.setdefault(query_word, ).append(i)
                  for query in queries:
                  # 2. look up in the hash map
                  final_set = set()
                  for word in query.split():
                  if not final_set:
                  final_set = set(word_map.get(word, ))
                  else:
                  final_set = final_set.intersection(word_map.get(word, ))
                  if not final_set:
                  break
                  # 3. print the indices
                  if final_set:
                  print(' '.join(str(i) for i in sorted(final_set)))
                  # or
                  #print(*sorted(final_set), sep=' ')
                  else:
                  print(-1)


                  You will find subtle differeces to some of the snippets posted above, but these mainly stem from me not wanting to work with sets while populating word_map (no particular reason for this - apart from that it's getting late here).
                  Performance-wise this should be comparable to your original solution, at least in the cases where I tested it.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered 1 hour ago









                  AlexAlex

                  55127




                  55127






















                      Tien Dinh is a new contributor. Be nice, and check out our Code of Conduct.










                      draft saved

                      draft discarded


















                      Tien Dinh is a new contributor. Be nice, and check out our Code of Conduct.













                      Tien Dinh is a new contributor. Be nice, and check out our Code of Conduct.












                      Tien Dinh is a new contributor. Be nice, and check out our Code of Conduct.
















                      Thanks for contributing an answer to Code Review Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f215792%2fdetermine-which-sentences-contain-the-set-of-phrases%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      How to reconfigure Docker Trusted Registry 2.x.x to use CEPH FS mount instead of NFS and other traditional...

                      is 'sed' thread safe

                      How to make a Squid Proxy server?