Determine which sentences contain the set of phrases
$begingroup$
I am trying to solve this HackerRank challenge which asks to write a function that prints the indexes of all sentences which contain the query words. For example, if the inputs to the function are:
textQueries(sentences, queries)
textQueries(['jim likes mary','kate likes tom', 'tom does not like jim'],['jim tom', 'likes'])
Then the function will print
2
0 1
If there is no sentence contains the query words, -1
will be printed
The constraints for the inputs are:
- The length of the
sentences
andqueries
lists will not exceed 10000. - The number of words in any sentence or query is from 1 to 10 words.
- Each word has at most 11 characters.
- No word appears in more than 10 sentences.
- Each word consists of uppercase and lowercase English letters only.
My attempt so far is to create a hashmap which maps each single query word with a set that contains the indexes to the sentences which contain that word, then I intersect those sets with each other to get the final set. My code runs fine, but it is slow for large test cases. Is there any way I can have it optimized?
def textQueries(sentences, queries):
word_map = {}
for query in queries:
query_words = query.split()
final_set = set()
set_modified = False
for query_word in query_words:
if query_word in word_map:
if len(final_set) == 0 and not set_modified:
final_set = word_map[query_word]
set_modified = True
continue
elif len(final_set) == 0 and set_modified:
break
else:
final_set = final_set.intersection(word_map[query_word])
continue
else:
sentences_contain_word =
for index, sentence in enumerate(sentences):
words = sentence.split()
if query_word in words:
sentences_contain_word.append(index)
if len(sentences_contain_word) >= 10:
break
word_map[query_word] = set(sentences_contain_word)
if len(final_set) == 0 and not set_modified:
final_set = word_map[query_word]
set_modified = True
elif len(final_set) == 0 and set_modified:
break
else:
final_set = final_set.intersection(word_map[query_word])
if len(final_set) == 0:
print(-1)
else:
for index in sorted(final_set):
print(str(index), end=' ')
print()
python performance programming-challenge
New contributor
$endgroup$
add a comment |
$begingroup$
I am trying to solve this HackerRank challenge which asks to write a function that prints the indexes of all sentences which contain the query words. For example, if the inputs to the function are:
textQueries(sentences, queries)
textQueries(['jim likes mary','kate likes tom', 'tom does not like jim'],['jim tom', 'likes'])
Then the function will print
2
0 1
If there is no sentence contains the query words, -1
will be printed
The constraints for the inputs are:
- The length of the
sentences
andqueries
lists will not exceed 10000. - The number of words in any sentence or query is from 1 to 10 words.
- Each word has at most 11 characters.
- No word appears in more than 10 sentences.
- Each word consists of uppercase and lowercase English letters only.
My attempt so far is to create a hashmap which maps each single query word with a set that contains the indexes to the sentences which contain that word, then I intersect those sets with each other to get the final set. My code runs fine, but it is slow for large test cases. Is there any way I can have it optimized?
def textQueries(sentences, queries):
word_map = {}
for query in queries:
query_words = query.split()
final_set = set()
set_modified = False
for query_word in query_words:
if query_word in word_map:
if len(final_set) == 0 and not set_modified:
final_set = word_map[query_word]
set_modified = True
continue
elif len(final_set) == 0 and set_modified:
break
else:
final_set = final_set.intersection(word_map[query_word])
continue
else:
sentences_contain_word =
for index, sentence in enumerate(sentences):
words = sentence.split()
if query_word in words:
sentences_contain_word.append(index)
if len(sentences_contain_word) >= 10:
break
word_map[query_word] = set(sentences_contain_word)
if len(final_set) == 0 and not set_modified:
final_set = word_map[query_word]
set_modified = True
elif len(final_set) == 0 and set_modified:
break
else:
final_set = final_set.intersection(word_map[query_word])
if len(final_set) == 0:
print(-1)
else:
for index in sorted(final_set):
print(str(index), end=' ')
print()
python performance programming-challenge
New contributor
$endgroup$
$begingroup$
Welcome to Code Review! Your example prints2
followed by0 1
if I execute the code. Could you verify that your code and the example output are correct?
$endgroup$
– Alex
3 hours ago
$begingroup$
yea it should be2
and0 1
sorry for the confusion
$endgroup$
– Tien Dinh
3 hours ago
add a comment |
$begingroup$
I am trying to solve this HackerRank challenge which asks to write a function that prints the indexes of all sentences which contain the query words. For example, if the inputs to the function are:
textQueries(sentences, queries)
textQueries(['jim likes mary','kate likes tom', 'tom does not like jim'],['jim tom', 'likes'])
Then the function will print
2
0 1
If there is no sentence contains the query words, -1
will be printed
The constraints for the inputs are:
- The length of the
sentences
andqueries
lists will not exceed 10000. - The number of words in any sentence or query is from 1 to 10 words.
- Each word has at most 11 characters.
- No word appears in more than 10 sentences.
- Each word consists of uppercase and lowercase English letters only.
My attempt so far is to create a hashmap which maps each single query word with a set that contains the indexes to the sentences which contain that word, then I intersect those sets with each other to get the final set. My code runs fine, but it is slow for large test cases. Is there any way I can have it optimized?
def textQueries(sentences, queries):
word_map = {}
for query in queries:
query_words = query.split()
final_set = set()
set_modified = False
for query_word in query_words:
if query_word in word_map:
if len(final_set) == 0 and not set_modified:
final_set = word_map[query_word]
set_modified = True
continue
elif len(final_set) == 0 and set_modified:
break
else:
final_set = final_set.intersection(word_map[query_word])
continue
else:
sentences_contain_word =
for index, sentence in enumerate(sentences):
words = sentence.split()
if query_word in words:
sentences_contain_word.append(index)
if len(sentences_contain_word) >= 10:
break
word_map[query_word] = set(sentences_contain_word)
if len(final_set) == 0 and not set_modified:
final_set = word_map[query_word]
set_modified = True
elif len(final_set) == 0 and set_modified:
break
else:
final_set = final_set.intersection(word_map[query_word])
if len(final_set) == 0:
print(-1)
else:
for index in sorted(final_set):
print(str(index), end=' ')
print()
python performance programming-challenge
New contributor
$endgroup$
I am trying to solve this HackerRank challenge which asks to write a function that prints the indexes of all sentences which contain the query words. For example, if the inputs to the function are:
textQueries(sentences, queries)
textQueries(['jim likes mary','kate likes tom', 'tom does not like jim'],['jim tom', 'likes'])
Then the function will print
2
0 1
If there is no sentence contains the query words, -1
will be printed
The constraints for the inputs are:
- The length of the
sentences
andqueries
lists will not exceed 10000. - The number of words in any sentence or query is from 1 to 10 words.
- Each word has at most 11 characters.
- No word appears in more than 10 sentences.
- Each word consists of uppercase and lowercase English letters only.
My attempt so far is to create a hashmap which maps each single query word with a set that contains the indexes to the sentences which contain that word, then I intersect those sets with each other to get the final set. My code runs fine, but it is slow for large test cases. Is there any way I can have it optimized?
def textQueries(sentences, queries):
word_map = {}
for query in queries:
query_words = query.split()
final_set = set()
set_modified = False
for query_word in query_words:
if query_word in word_map:
if len(final_set) == 0 and not set_modified:
final_set = word_map[query_word]
set_modified = True
continue
elif len(final_set) == 0 and set_modified:
break
else:
final_set = final_set.intersection(word_map[query_word])
continue
else:
sentences_contain_word =
for index, sentence in enumerate(sentences):
words = sentence.split()
if query_word in words:
sentences_contain_word.append(index)
if len(sentences_contain_word) >= 10:
break
word_map[query_word] = set(sentences_contain_word)
if len(final_set) == 0 and not set_modified:
final_set = word_map[query_word]
set_modified = True
elif len(final_set) == 0 and set_modified:
break
else:
final_set = final_set.intersection(word_map[query_word])
if len(final_set) == 0:
print(-1)
else:
for index in sorted(final_set):
print(str(index), end=' ')
print()
python performance programming-challenge
python performance programming-challenge
New contributor
New contributor
edited 3 hours ago
Tien Dinh
New contributor
asked 6 hours ago
Tien DinhTien Dinh
133
133
New contributor
New contributor
$begingroup$
Welcome to Code Review! Your example prints2
followed by0 1
if I execute the code. Could you verify that your code and the example output are correct?
$endgroup$
– Alex
3 hours ago
$begingroup$
yea it should be2
and0 1
sorry for the confusion
$endgroup$
– Tien Dinh
3 hours ago
add a comment |
$begingroup$
Welcome to Code Review! Your example prints2
followed by0 1
if I execute the code. Could you verify that your code and the example output are correct?
$endgroup$
– Alex
3 hours ago
$begingroup$
yea it should be2
and0 1
sorry for the confusion
$endgroup$
– Tien Dinh
3 hours ago
$begingroup$
Welcome to Code Review! Your example prints
2
followed by 0 1
if I execute the code. Could you verify that your code and the example output are correct?$endgroup$
– Alex
3 hours ago
$begingroup$
Welcome to Code Review! Your example prints
2
followed by 0 1
if I execute the code. Could you verify that your code and the example output are correct?$endgroup$
– Alex
3 hours ago
$begingroup$
yea it should be
2
and 0 1
sorry for the confusion$endgroup$
– Tien Dinh
3 hours ago
$begingroup$
yea it should be
2
and 0 1
sorry for the confusion$endgroup$
– Tien Dinh
3 hours ago
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
Your code is fairly readable, but I find it quite dense. Here are some general things you could do to improve it:
Use vertical white space to separate different areas of the code.
Provide a docblock at the top outlining the algorithm you are using.
Use more than one function to break your code into smaller chunks.
With that advice out of the way, let's look at your performance with an eye to the worst possible case:
def textQueries(sentences, queries):
for query in queries: # Worst case: 10,000 times
...
for query_word in query_words: # Worst case: 10 times
if query_word in word_map:
final_set.intersection # Worst case: O(10,000) in C
else:
for ... in enumerate(sentences): # Worst case: 10,000 times
if ... else:
final_set.intersection # Worst case: O(10,000) in C
According to me, your code runs worst case in $O(Q times W times S)$ where $Q$ is the number of queries, $W$ is the number of words in the query, and $S$ is the number of sentences. (The number of words in the sentences is hidden behind a split, but it's in C so we'll let it go by.)
There's no question you need to split the sentences and split the queries into separate words, somehow. But you are nesting your loops when the "best" way to loop is to do them one after the other. If you can convert from this:
for q in Q:
for s in Q:
pass
to something like this:
for s in S:
pass
for q in Q:
pass
You will have converted from $O(S times Q)$ to $O(S + Q)$ which is considerably faster.
Your current approach is to define a set of sentences that contain a particular word. You cache these sets, in case the same words appear again. You intersect the sets for each word in the query, producing a final set of all the sentences that contain all the query words.
This is pretty much guaranteed to fail based on the instructions you quoted:
No word appears in more than 10 sentences.
In other words, the probability of a "cache hit" is 10/10,000, or 0.1%. Any time you spend maintaining the cache is wasted time.
Suppose you just go through the sentences one time, and build the big dictionary of sets:
sentence_words = collections.defaultdict(set)
for i, s in enumerate(sentences):
for w in s.split():
sentence_words[s].add(i)
Then you can go through the query words and do what you were doing before, without any caching:
for q in queries:
matches = functools.reduce(operator.and_, (sentence_words[w] for w in q.split()))
matches = [-1] if not matches else sorted(matches)
This will seem very similar to what you were doing, but the difference is that these two loops are run one after the other, providing something like $O(100,000 + 100,000)$ performance instead of $O(100,000 times 10,000)$. It's that last operator that makes all the difference.
$endgroup$
add a comment |
$begingroup$
I'm not sure how much there can be gained performance-wise, since the problem has an inherent theoretical complexity limit (quite nicely elaborated in this answer) not even the most well crafted code can overcome. Additionally, Python usually is not terribly fast when you have a lot of (nested) loops.
Python IDEs like Spyder come with a built-in profiler tool (see the docs), which allows you to identify performance bottlenecks. Having a quick run with your small example code showed me that on my machine most of the time is actually spent in printing the values.
Feel free to give it a go with more realistic inputs on your machine.
Nevertheless, I think your code could easily gain quite a bit on the style and clarity side.
What do I mean by this? Well, let me elaborate:
What (I think) your code wants to do:
- Build a dictionary/hash map where each query word is a key that maps to a collection of indices of all sentences where the query word occurs.
- Use this dictionary to determine the indices of all sentences containing all the query's words
- Print those indices to the command line
Printing
Let's start with an easy one, #3. Your code was
if len(final_set) == 0:
print(-1)
else:
for index in sorted(final_set):
print(str(index), end=' ')
print()
which does what you want to do. But can we do better?
I would like to propose as slight improvement.
if final_set:
print(' '.join(str(i) for i in sorted(final_set)))
else:
print(-1)
So what is going on here?
First, if final_set:
quite obviously replaces if len(final_set) == 0:
.
How? In Python empty "iterables" like lists, tuples, strings or in your case sets evaluate to False
which would then print -1
in the console, just as before.
Next, " ".join(str(i) for i in sorted(final_set))
uses list comprehension together with the built-in join-function to generate the final indices output.
This does basically the same as your original code and is just a little shorter, as a bonus you get the newline for free.
Building the hash map
To get to point #1, most of the heavy lifting is done in the else
branch of the major if
statement in your code.
So far so straightforward, not much to gain here. In my oppinion, words = sentence.split()
should not be necessary. Python's in
is capable of finding the word in the full string, too.
One could argue that it would be increase clarity and maintainability to place move this functionality to the beginning of the function (or in a seperate function altogether) to better seperate subtasks #1 and #2.
Using the hash map
Last but not least #2.
Your code in the if
branch gets a little bit involved here.
If you build your word_map before processing the sentences and I'm not mistaken this can be simplified quite a bit to
final_set = set()
for word in query.split():
if not final_set:
final_set = word_map.get(word, set())
else:
final_set = final_set.intersection(word_map.get(word, set()))
if not final_set:
break
I have talked about using the "trueness/falseness" of iterables before so the use here should not suprise you. word_map.get(word, set())
is just a way of saying "get me the value of key word
, and if there's nothing, return set()
".
Final code
My final code was
def textQueriesRevised(sentences, queries):
# 1. build the hash map
word_map = {}
for query in queries:
for query_word in query.split():
for i, sentence in enumerate(sentences):
if query_word in sentence:
word_map.setdefault(query_word, ).append(i)
for query in queries:
# 2. look up in the hash map
final_set = set()
for word in query.split():
if not final_set:
final_set = set(word_map.get(word, ))
else:
final_set = final_set.intersection(word_map.get(word, ))
if not final_set:
break
# 3. print the indices
if final_set:
print(' '.join(str(i) for i in sorted(final_set)))
# or
#print(*sorted(final_set), sep=' ')
else:
print(-1)
You will find subtle differeces to some of the snippets posted above, but these mainly stem from me not wanting to work with sets while populating word_map
(no particular reason for this - apart from that it's getting late here).
Performance-wise this should be comparable to your original solution, at least in the cases where I tested it.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Tien Dinh is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f215792%2fdetermine-which-sentences-contain-the-set-of-phrases%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Your code is fairly readable, but I find it quite dense. Here are some general things you could do to improve it:
Use vertical white space to separate different areas of the code.
Provide a docblock at the top outlining the algorithm you are using.
Use more than one function to break your code into smaller chunks.
With that advice out of the way, let's look at your performance with an eye to the worst possible case:
def textQueries(sentences, queries):
for query in queries: # Worst case: 10,000 times
...
for query_word in query_words: # Worst case: 10 times
if query_word in word_map:
final_set.intersection # Worst case: O(10,000) in C
else:
for ... in enumerate(sentences): # Worst case: 10,000 times
if ... else:
final_set.intersection # Worst case: O(10,000) in C
According to me, your code runs worst case in $O(Q times W times S)$ where $Q$ is the number of queries, $W$ is the number of words in the query, and $S$ is the number of sentences. (The number of words in the sentences is hidden behind a split, but it's in C so we'll let it go by.)
There's no question you need to split the sentences and split the queries into separate words, somehow. But you are nesting your loops when the "best" way to loop is to do them one after the other. If you can convert from this:
for q in Q:
for s in Q:
pass
to something like this:
for s in S:
pass
for q in Q:
pass
You will have converted from $O(S times Q)$ to $O(S + Q)$ which is considerably faster.
Your current approach is to define a set of sentences that contain a particular word. You cache these sets, in case the same words appear again. You intersect the sets for each word in the query, producing a final set of all the sentences that contain all the query words.
This is pretty much guaranteed to fail based on the instructions you quoted:
No word appears in more than 10 sentences.
In other words, the probability of a "cache hit" is 10/10,000, or 0.1%. Any time you spend maintaining the cache is wasted time.
Suppose you just go through the sentences one time, and build the big dictionary of sets:
sentence_words = collections.defaultdict(set)
for i, s in enumerate(sentences):
for w in s.split():
sentence_words[s].add(i)
Then you can go through the query words and do what you were doing before, without any caching:
for q in queries:
matches = functools.reduce(operator.and_, (sentence_words[w] for w in q.split()))
matches = [-1] if not matches else sorted(matches)
This will seem very similar to what you were doing, but the difference is that these two loops are run one after the other, providing something like $O(100,000 + 100,000)$ performance instead of $O(100,000 times 10,000)$. It's that last operator that makes all the difference.
$endgroup$
add a comment |
$begingroup$
Your code is fairly readable, but I find it quite dense. Here are some general things you could do to improve it:
Use vertical white space to separate different areas of the code.
Provide a docblock at the top outlining the algorithm you are using.
Use more than one function to break your code into smaller chunks.
With that advice out of the way, let's look at your performance with an eye to the worst possible case:
def textQueries(sentences, queries):
for query in queries: # Worst case: 10,000 times
...
for query_word in query_words: # Worst case: 10 times
if query_word in word_map:
final_set.intersection # Worst case: O(10,000) in C
else:
for ... in enumerate(sentences): # Worst case: 10,000 times
if ... else:
final_set.intersection # Worst case: O(10,000) in C
According to me, your code runs worst case in $O(Q times W times S)$ where $Q$ is the number of queries, $W$ is the number of words in the query, and $S$ is the number of sentences. (The number of words in the sentences is hidden behind a split, but it's in C so we'll let it go by.)
There's no question you need to split the sentences and split the queries into separate words, somehow. But you are nesting your loops when the "best" way to loop is to do them one after the other. If you can convert from this:
for q in Q:
for s in Q:
pass
to something like this:
for s in S:
pass
for q in Q:
pass
You will have converted from $O(S times Q)$ to $O(S + Q)$ which is considerably faster.
Your current approach is to define a set of sentences that contain a particular word. You cache these sets, in case the same words appear again. You intersect the sets for each word in the query, producing a final set of all the sentences that contain all the query words.
This is pretty much guaranteed to fail based on the instructions you quoted:
No word appears in more than 10 sentences.
In other words, the probability of a "cache hit" is 10/10,000, or 0.1%. Any time you spend maintaining the cache is wasted time.
Suppose you just go through the sentences one time, and build the big dictionary of sets:
sentence_words = collections.defaultdict(set)
for i, s in enumerate(sentences):
for w in s.split():
sentence_words[s].add(i)
Then you can go through the query words and do what you were doing before, without any caching:
for q in queries:
matches = functools.reduce(operator.and_, (sentence_words[w] for w in q.split()))
matches = [-1] if not matches else sorted(matches)
This will seem very similar to what you were doing, but the difference is that these two loops are run one after the other, providing something like $O(100,000 + 100,000)$ performance instead of $O(100,000 times 10,000)$. It's that last operator that makes all the difference.
$endgroup$
add a comment |
$begingroup$
Your code is fairly readable, but I find it quite dense. Here are some general things you could do to improve it:
Use vertical white space to separate different areas of the code.
Provide a docblock at the top outlining the algorithm you are using.
Use more than one function to break your code into smaller chunks.
With that advice out of the way, let's look at your performance with an eye to the worst possible case:
def textQueries(sentences, queries):
for query in queries: # Worst case: 10,000 times
...
for query_word in query_words: # Worst case: 10 times
if query_word in word_map:
final_set.intersection # Worst case: O(10,000) in C
else:
for ... in enumerate(sentences): # Worst case: 10,000 times
if ... else:
final_set.intersection # Worst case: O(10,000) in C
According to me, your code runs worst case in $O(Q times W times S)$ where $Q$ is the number of queries, $W$ is the number of words in the query, and $S$ is the number of sentences. (The number of words in the sentences is hidden behind a split, but it's in C so we'll let it go by.)
There's no question you need to split the sentences and split the queries into separate words, somehow. But you are nesting your loops when the "best" way to loop is to do them one after the other. If you can convert from this:
for q in Q:
for s in Q:
pass
to something like this:
for s in S:
pass
for q in Q:
pass
You will have converted from $O(S times Q)$ to $O(S + Q)$ which is considerably faster.
Your current approach is to define a set of sentences that contain a particular word. You cache these sets, in case the same words appear again. You intersect the sets for each word in the query, producing a final set of all the sentences that contain all the query words.
This is pretty much guaranteed to fail based on the instructions you quoted:
No word appears in more than 10 sentences.
In other words, the probability of a "cache hit" is 10/10,000, or 0.1%. Any time you spend maintaining the cache is wasted time.
Suppose you just go through the sentences one time, and build the big dictionary of sets:
sentence_words = collections.defaultdict(set)
for i, s in enumerate(sentences):
for w in s.split():
sentence_words[s].add(i)
Then you can go through the query words and do what you were doing before, without any caching:
for q in queries:
matches = functools.reduce(operator.and_, (sentence_words[w] for w in q.split()))
matches = [-1] if not matches else sorted(matches)
This will seem very similar to what you were doing, but the difference is that these two loops are run one after the other, providing something like $O(100,000 + 100,000)$ performance instead of $O(100,000 times 10,000)$. It's that last operator that makes all the difference.
$endgroup$
Your code is fairly readable, but I find it quite dense. Here are some general things you could do to improve it:
Use vertical white space to separate different areas of the code.
Provide a docblock at the top outlining the algorithm you are using.
Use more than one function to break your code into smaller chunks.
With that advice out of the way, let's look at your performance with an eye to the worst possible case:
def textQueries(sentences, queries):
for query in queries: # Worst case: 10,000 times
...
for query_word in query_words: # Worst case: 10 times
if query_word in word_map:
final_set.intersection # Worst case: O(10,000) in C
else:
for ... in enumerate(sentences): # Worst case: 10,000 times
if ... else:
final_set.intersection # Worst case: O(10,000) in C
According to me, your code runs worst case in $O(Q times W times S)$ where $Q$ is the number of queries, $W$ is the number of words in the query, and $S$ is the number of sentences. (The number of words in the sentences is hidden behind a split, but it's in C so we'll let it go by.)
There's no question you need to split the sentences and split the queries into separate words, somehow. But you are nesting your loops when the "best" way to loop is to do them one after the other. If you can convert from this:
for q in Q:
for s in Q:
pass
to something like this:
for s in S:
pass
for q in Q:
pass
You will have converted from $O(S times Q)$ to $O(S + Q)$ which is considerably faster.
Your current approach is to define a set of sentences that contain a particular word. You cache these sets, in case the same words appear again. You intersect the sets for each word in the query, producing a final set of all the sentences that contain all the query words.
This is pretty much guaranteed to fail based on the instructions you quoted:
No word appears in more than 10 sentences.
In other words, the probability of a "cache hit" is 10/10,000, or 0.1%. Any time you spend maintaining the cache is wasted time.
Suppose you just go through the sentences one time, and build the big dictionary of sets:
sentence_words = collections.defaultdict(set)
for i, s in enumerate(sentences):
for w in s.split():
sentence_words[s].add(i)
Then you can go through the query words and do what you were doing before, without any caching:
for q in queries:
matches = functools.reduce(operator.and_, (sentence_words[w] for w in q.split()))
matches = [-1] if not matches else sorted(matches)
This will seem very similar to what you were doing, but the difference is that these two loops are run one after the other, providing something like $O(100,000 + 100,000)$ performance instead of $O(100,000 times 10,000)$. It's that last operator that makes all the difference.
answered 1 hour ago
Austin HastingsAustin Hastings
7,3071233
7,3071233
add a comment |
add a comment |
$begingroup$
I'm not sure how much there can be gained performance-wise, since the problem has an inherent theoretical complexity limit (quite nicely elaborated in this answer) not even the most well crafted code can overcome. Additionally, Python usually is not terribly fast when you have a lot of (nested) loops.
Python IDEs like Spyder come with a built-in profiler tool (see the docs), which allows you to identify performance bottlenecks. Having a quick run with your small example code showed me that on my machine most of the time is actually spent in printing the values.
Feel free to give it a go with more realistic inputs on your machine.
Nevertheless, I think your code could easily gain quite a bit on the style and clarity side.
What do I mean by this? Well, let me elaborate:
What (I think) your code wants to do:
- Build a dictionary/hash map where each query word is a key that maps to a collection of indices of all sentences where the query word occurs.
- Use this dictionary to determine the indices of all sentences containing all the query's words
- Print those indices to the command line
Printing
Let's start with an easy one, #3. Your code was
if len(final_set) == 0:
print(-1)
else:
for index in sorted(final_set):
print(str(index), end=' ')
print()
which does what you want to do. But can we do better?
I would like to propose as slight improvement.
if final_set:
print(' '.join(str(i) for i in sorted(final_set)))
else:
print(-1)
So what is going on here?
First, if final_set:
quite obviously replaces if len(final_set) == 0:
.
How? In Python empty "iterables" like lists, tuples, strings or in your case sets evaluate to False
which would then print -1
in the console, just as before.
Next, " ".join(str(i) for i in sorted(final_set))
uses list comprehension together with the built-in join-function to generate the final indices output.
This does basically the same as your original code and is just a little shorter, as a bonus you get the newline for free.
Building the hash map
To get to point #1, most of the heavy lifting is done in the else
branch of the major if
statement in your code.
So far so straightforward, not much to gain here. In my oppinion, words = sentence.split()
should not be necessary. Python's in
is capable of finding the word in the full string, too.
One could argue that it would be increase clarity and maintainability to place move this functionality to the beginning of the function (or in a seperate function altogether) to better seperate subtasks #1 and #2.
Using the hash map
Last but not least #2.
Your code in the if
branch gets a little bit involved here.
If you build your word_map before processing the sentences and I'm not mistaken this can be simplified quite a bit to
final_set = set()
for word in query.split():
if not final_set:
final_set = word_map.get(word, set())
else:
final_set = final_set.intersection(word_map.get(word, set()))
if not final_set:
break
I have talked about using the "trueness/falseness" of iterables before so the use here should not suprise you. word_map.get(word, set())
is just a way of saying "get me the value of key word
, and if there's nothing, return set()
".
Final code
My final code was
def textQueriesRevised(sentences, queries):
# 1. build the hash map
word_map = {}
for query in queries:
for query_word in query.split():
for i, sentence in enumerate(sentences):
if query_word in sentence:
word_map.setdefault(query_word, ).append(i)
for query in queries:
# 2. look up in the hash map
final_set = set()
for word in query.split():
if not final_set:
final_set = set(word_map.get(word, ))
else:
final_set = final_set.intersection(word_map.get(word, ))
if not final_set:
break
# 3. print the indices
if final_set:
print(' '.join(str(i) for i in sorted(final_set)))
# or
#print(*sorted(final_set), sep=' ')
else:
print(-1)
You will find subtle differeces to some of the snippets posted above, but these mainly stem from me not wanting to work with sets while populating word_map
(no particular reason for this - apart from that it's getting late here).
Performance-wise this should be comparable to your original solution, at least in the cases where I tested it.
$endgroup$
add a comment |
$begingroup$
I'm not sure how much there can be gained performance-wise, since the problem has an inherent theoretical complexity limit (quite nicely elaborated in this answer) not even the most well crafted code can overcome. Additionally, Python usually is not terribly fast when you have a lot of (nested) loops.
Python IDEs like Spyder come with a built-in profiler tool (see the docs), which allows you to identify performance bottlenecks. Having a quick run with your small example code showed me that on my machine most of the time is actually spent in printing the values.
Feel free to give it a go with more realistic inputs on your machine.
Nevertheless, I think your code could easily gain quite a bit on the style and clarity side.
What do I mean by this? Well, let me elaborate:
What (I think) your code wants to do:
- Build a dictionary/hash map where each query word is a key that maps to a collection of indices of all sentences where the query word occurs.
- Use this dictionary to determine the indices of all sentences containing all the query's words
- Print those indices to the command line
Printing
Let's start with an easy one, #3. Your code was
if len(final_set) == 0:
print(-1)
else:
for index in sorted(final_set):
print(str(index), end=' ')
print()
which does what you want to do. But can we do better?
I would like to propose as slight improvement.
if final_set:
print(' '.join(str(i) for i in sorted(final_set)))
else:
print(-1)
So what is going on here?
First, if final_set:
quite obviously replaces if len(final_set) == 0:
.
How? In Python empty "iterables" like lists, tuples, strings or in your case sets evaluate to False
which would then print -1
in the console, just as before.
Next, " ".join(str(i) for i in sorted(final_set))
uses list comprehension together with the built-in join-function to generate the final indices output.
This does basically the same as your original code and is just a little shorter, as a bonus you get the newline for free.
Building the hash map
To get to point #1, most of the heavy lifting is done in the else
branch of the major if
statement in your code.
So far so straightforward, not much to gain here. In my oppinion, words = sentence.split()
should not be necessary. Python's in
is capable of finding the word in the full string, too.
One could argue that it would be increase clarity and maintainability to place move this functionality to the beginning of the function (or in a seperate function altogether) to better seperate subtasks #1 and #2.
Using the hash map
Last but not least #2.
Your code in the if
branch gets a little bit involved here.
If you build your word_map before processing the sentences and I'm not mistaken this can be simplified quite a bit to
final_set = set()
for word in query.split():
if not final_set:
final_set = word_map.get(word, set())
else:
final_set = final_set.intersection(word_map.get(word, set()))
if not final_set:
break
I have talked about using the "trueness/falseness" of iterables before so the use here should not suprise you. word_map.get(word, set())
is just a way of saying "get me the value of key word
, and if there's nothing, return set()
".
Final code
My final code was
def textQueriesRevised(sentences, queries):
# 1. build the hash map
word_map = {}
for query in queries:
for query_word in query.split():
for i, sentence in enumerate(sentences):
if query_word in sentence:
word_map.setdefault(query_word, ).append(i)
for query in queries:
# 2. look up in the hash map
final_set = set()
for word in query.split():
if not final_set:
final_set = set(word_map.get(word, ))
else:
final_set = final_set.intersection(word_map.get(word, ))
if not final_set:
break
# 3. print the indices
if final_set:
print(' '.join(str(i) for i in sorted(final_set)))
# or
#print(*sorted(final_set), sep=' ')
else:
print(-1)
You will find subtle differeces to some of the snippets posted above, but these mainly stem from me not wanting to work with sets while populating word_map
(no particular reason for this - apart from that it's getting late here).
Performance-wise this should be comparable to your original solution, at least in the cases where I tested it.
$endgroup$
add a comment |
$begingroup$
I'm not sure how much there can be gained performance-wise, since the problem has an inherent theoretical complexity limit (quite nicely elaborated in this answer) not even the most well crafted code can overcome. Additionally, Python usually is not terribly fast when you have a lot of (nested) loops.
Python IDEs like Spyder come with a built-in profiler tool (see the docs), which allows you to identify performance bottlenecks. Having a quick run with your small example code showed me that on my machine most of the time is actually spent in printing the values.
Feel free to give it a go with more realistic inputs on your machine.
Nevertheless, I think your code could easily gain quite a bit on the style and clarity side.
What do I mean by this? Well, let me elaborate:
What (I think) your code wants to do:
- Build a dictionary/hash map where each query word is a key that maps to a collection of indices of all sentences where the query word occurs.
- Use this dictionary to determine the indices of all sentences containing all the query's words
- Print those indices to the command line
Printing
Let's start with an easy one, #3. Your code was
if len(final_set) == 0:
print(-1)
else:
for index in sorted(final_set):
print(str(index), end=' ')
print()
which does what you want to do. But can we do better?
I would like to propose as slight improvement.
if final_set:
print(' '.join(str(i) for i in sorted(final_set)))
else:
print(-1)
So what is going on here?
First, if final_set:
quite obviously replaces if len(final_set) == 0:
.
How? In Python empty "iterables" like lists, tuples, strings or in your case sets evaluate to False
which would then print -1
in the console, just as before.
Next, " ".join(str(i) for i in sorted(final_set))
uses list comprehension together with the built-in join-function to generate the final indices output.
This does basically the same as your original code and is just a little shorter, as a bonus you get the newline for free.
Building the hash map
To get to point #1, most of the heavy lifting is done in the else
branch of the major if
statement in your code.
So far so straightforward, not much to gain here. In my oppinion, words = sentence.split()
should not be necessary. Python's in
is capable of finding the word in the full string, too.
One could argue that it would be increase clarity and maintainability to place move this functionality to the beginning of the function (or in a seperate function altogether) to better seperate subtasks #1 and #2.
Using the hash map
Last but not least #2.
Your code in the if
branch gets a little bit involved here.
If you build your word_map before processing the sentences and I'm not mistaken this can be simplified quite a bit to
final_set = set()
for word in query.split():
if not final_set:
final_set = word_map.get(word, set())
else:
final_set = final_set.intersection(word_map.get(word, set()))
if not final_set:
break
I have talked about using the "trueness/falseness" of iterables before so the use here should not suprise you. word_map.get(word, set())
is just a way of saying "get me the value of key word
, and if there's nothing, return set()
".
Final code
My final code was
def textQueriesRevised(sentences, queries):
# 1. build the hash map
word_map = {}
for query in queries:
for query_word in query.split():
for i, sentence in enumerate(sentences):
if query_word in sentence:
word_map.setdefault(query_word, ).append(i)
for query in queries:
# 2. look up in the hash map
final_set = set()
for word in query.split():
if not final_set:
final_set = set(word_map.get(word, ))
else:
final_set = final_set.intersection(word_map.get(word, ))
if not final_set:
break
# 3. print the indices
if final_set:
print(' '.join(str(i) for i in sorted(final_set)))
# or
#print(*sorted(final_set), sep=' ')
else:
print(-1)
You will find subtle differeces to some of the snippets posted above, but these mainly stem from me not wanting to work with sets while populating word_map
(no particular reason for this - apart from that it's getting late here).
Performance-wise this should be comparable to your original solution, at least in the cases where I tested it.
$endgroup$
I'm not sure how much there can be gained performance-wise, since the problem has an inherent theoretical complexity limit (quite nicely elaborated in this answer) not even the most well crafted code can overcome. Additionally, Python usually is not terribly fast when you have a lot of (nested) loops.
Python IDEs like Spyder come with a built-in profiler tool (see the docs), which allows you to identify performance bottlenecks. Having a quick run with your small example code showed me that on my machine most of the time is actually spent in printing the values.
Feel free to give it a go with more realistic inputs on your machine.
Nevertheless, I think your code could easily gain quite a bit on the style and clarity side.
What do I mean by this? Well, let me elaborate:
What (I think) your code wants to do:
- Build a dictionary/hash map where each query word is a key that maps to a collection of indices of all sentences where the query word occurs.
- Use this dictionary to determine the indices of all sentences containing all the query's words
- Print those indices to the command line
Printing
Let's start with an easy one, #3. Your code was
if len(final_set) == 0:
print(-1)
else:
for index in sorted(final_set):
print(str(index), end=' ')
print()
which does what you want to do. But can we do better?
I would like to propose as slight improvement.
if final_set:
print(' '.join(str(i) for i in sorted(final_set)))
else:
print(-1)
So what is going on here?
First, if final_set:
quite obviously replaces if len(final_set) == 0:
.
How? In Python empty "iterables" like lists, tuples, strings or in your case sets evaluate to False
which would then print -1
in the console, just as before.
Next, " ".join(str(i) for i in sorted(final_set))
uses list comprehension together with the built-in join-function to generate the final indices output.
This does basically the same as your original code and is just a little shorter, as a bonus you get the newline for free.
Building the hash map
To get to point #1, most of the heavy lifting is done in the else
branch of the major if
statement in your code.
So far so straightforward, not much to gain here. In my oppinion, words = sentence.split()
should not be necessary. Python's in
is capable of finding the word in the full string, too.
One could argue that it would be increase clarity and maintainability to place move this functionality to the beginning of the function (or in a seperate function altogether) to better seperate subtasks #1 and #2.
Using the hash map
Last but not least #2.
Your code in the if
branch gets a little bit involved here.
If you build your word_map before processing the sentences and I'm not mistaken this can be simplified quite a bit to
final_set = set()
for word in query.split():
if not final_set:
final_set = word_map.get(word, set())
else:
final_set = final_set.intersection(word_map.get(word, set()))
if not final_set:
break
I have talked about using the "trueness/falseness" of iterables before so the use here should not suprise you. word_map.get(word, set())
is just a way of saying "get me the value of key word
, and if there's nothing, return set()
".
Final code
My final code was
def textQueriesRevised(sentences, queries):
# 1. build the hash map
word_map = {}
for query in queries:
for query_word in query.split():
for i, sentence in enumerate(sentences):
if query_word in sentence:
word_map.setdefault(query_word, ).append(i)
for query in queries:
# 2. look up in the hash map
final_set = set()
for word in query.split():
if not final_set:
final_set = set(word_map.get(word, ))
else:
final_set = final_set.intersection(word_map.get(word, ))
if not final_set:
break
# 3. print the indices
if final_set:
print(' '.join(str(i) for i in sorted(final_set)))
# or
#print(*sorted(final_set), sep=' ')
else:
print(-1)
You will find subtle differeces to some of the snippets posted above, but these mainly stem from me not wanting to work with sets while populating word_map
(no particular reason for this - apart from that it's getting late here).
Performance-wise this should be comparable to your original solution, at least in the cases where I tested it.
answered 1 hour ago
AlexAlex
55127
55127
add a comment |
add a comment |
Tien Dinh is a new contributor. Be nice, and check out our Code of Conduct.
Tien Dinh is a new contributor. Be nice, and check out our Code of Conduct.
Tien Dinh is a new contributor. Be nice, and check out our Code of Conduct.
Tien Dinh is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f215792%2fdetermine-which-sentences-contain-the-set-of-phrases%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Welcome to Code Review! Your example prints
2
followed by0 1
if I execute the code. Could you verify that your code and the example output are correct?$endgroup$
– Alex
3 hours ago
$begingroup$
yea it should be
2
and0 1
sorry for the confusion$endgroup$
– Tien Dinh
3 hours ago