Understanding test outputs

ndujar · February 10, 2023, 9:51pm

Hi again,

Finally, after some trouble with the submission formats, I have some initial results.
However, these are much worse than expected (leaderboard gives a mAP of zero).

I am now trying to figure out the test results in the .csv files, and haven’t found in the documentation any indication as what the “score” column actually means.
How is it possible that I am getting values as high as 40474350?

query_id	ref_id	score
Q204548	R203486	40474350
Q200502	R203486	38109532
…

Also, would appreciate some better understanding of the meaning of these logs:

2023-02-10 20:56:24 INFO Starting Descriptor level eval
2023-02-10 20:56:25 INFO Loaded 8295 query features
2023-02-10 20:56:26 INFO Loaded 40318 ref features
2023-02-10 20:56:26 INFO Performing search for 9954000 nearest vectors
2023-02-10 20:56:29 INFO too many results 39590496 > 19908000, scaling back radius
…
2023-02-10 21:15:23 INFO search done in 1130.237 s + 5.118 s, total 15728845 results, end threshold 1.97795e+07
2023-02-10 21:16:36 INFO Got 460 unique video pairs.

Why is it searching for video pairs, if I am trying to submit to the descriptor track?
Where do the 15728845 results come from?
What does the 1.97795e+07 threshold refer to?

I apologize in advance if these are very basic questions, but I haven’t been able to wrap my head around it so far.

Thanks a lot!

chrisk-dd · February 10, 2023, 10:10pm

Great questions @ndujar!

As a general response to your questions, the evaluation code we are using comes from Meta’s vsc2022 repo, and specifically from their descriptor evaluation script.

How is it possible that I am getting values as high as 40474350?

When evaluating your descriptors, this competition uses inner-product similarity (essentially a vector dot product), which differs from other similarity metrics like cosine similarity in that it is not bounded. In particular, cosine similarity normalizes similarity by dividing the dot product of two vectors by the product of their magnitude. Here, the magnitudes contribute to the similarity score. This might explain how you obtain such high values - your descriptors’ magnitudes contribute to the scores.

Why is it searching for video pairs, if I am trying to submit to the descriptor track?

For the descriptor track, your “prediction” consists, for a particular video, of a ranked list of videos that you believe likely contain derived content. To generate that ranked list, each descriptor you generate for a particular video is compared against all the descriptors in the reference dataset. Hence, the ranked list is a ranked list of query-reference pairs of video descriptors with an inner-product similarity score. For more info / context, you can look at the Wikipedia page on information retrieval, which I have myself referenced many times in preparation for this competition.

Where do the 15728845 results come from?

What does the 1.97795e+07 threshold refer to?

The descriptor evaluation script only considers a certain amount of nearest-neighbor vectors when conducting its evaluation. It therefore sets a cutoff threshold score for the inner-product similarity comparison that is roughly equivalent to considering the ~20 most similar videos.

I hope that helps - dig into the code and please don’t hesitate to ask more questions!

-Chris

ndujar · February 11, 2023, 4:48pm

Thanks for the prompt answer @chrisk-dd !
This is very useful to debug my descriptors.
As a matter of fact, I now recall the mention to the inner product in the Problem description page, but it wasn’t totally clear to me whether this is applied at the video level or at the timestamps level.
Now, if I understand well, the similarity is evaluated between the vectors generated at the timestamp level.

So, if a reference video is 60s long and a query video is 10s, for that pair there will be 60 vectors whose distance will be measured against 10 vectors, totaling 600 similarity “scores”. From these 600 scores, a number of candidates is selected.
This seems to match the 9954000 nearest vectors logged, given that RETRIEVAL_CANDIDATES_PER_QUERY = 20 * 60 = 1200 and 8295 * 1200 = 9954000.

This still leaves me some doubts regarding the role of the timestamps and the needed precision in their definition, but I guess some more digging in the vsc2022 code should answer that.
Thank you very much!

Topic		Replies	Views
The difference in scoring between the two tracks Image Similarity Challenge	1	476	July 3, 2021
Error when running "make test-submission" Video Similarity Challenge	4	286	December 20, 2022
Custom descriptor_eval.py and retrieval evaluation requirements Video Similarity Challenge	2	257	January 10, 2023
Failed without logs Video Similarity Challenge	3	242	January 18, 2023
Clarification on Descriptor Track Submission Format Video Similarity Challenge	8	352	January 23, 2023

Understanding test outputs

Related topics