Please help me clarify, maybe it’s obvious but I’m not sure. Are we expected to return 20 database matches for each query in the test set or does the number of database matches to be evaluated for each test set query vary from a lower number to 20 as a max to be considered and we have to handle this situation.
For each query, is it return
a) 20 matches or
b) matches <= 20 depending on the number of matches for each query defined by a criteria?
Great question. I suspect others might wonder about the same thing.
The number of matches for each test set query can vary. It will not be 20 in all cases.
However, we recommend returning the maximum allowed 20 matches for each query because the additional matches (even if they are false positives) will not hurt your score. To see why this is the case, consider an example where there are 5 positives, and you have already found the first 4 with your first 4 guesses, resulting in a precision of 100% and recall of 80% up to this operating threshold. Additional guesses A, B, C in the diagram below will result in lower precision, but your recall will still be 80% for all guesses. Because AP is weighted by the increase in recall from the previous threshold, and the increase in recall is 0, these additional false positives do not penalize your score. But eventually returning the correct positive with guess D will improve your score.
In other words, if you were to restrict your submission to the first 4 guesses, your score could not be greater than 80%. Additional guesses can only increase your score.
I hope this helps.
Thank you very much for the response. I didn’t take into account the weighted Recall properly, it seems.
On local testing, I also noticed same scores even when I tried to return varied number of database matches.
Much clearer with your illustration,