Clarification on Descriptor Track Submission Format

Hey there! Wanted to clarify - Can we have more than one descriptor embedding per query/reference in our submission’s embeddings, or is there supposed to be only one descriptor per video?

It’s my understanding that if we submit multiple per video, the maximum similar embed for that video_id will be chosen as the confidence score - is that right?

To be clear, I’m referring to the descriptor track :smile:

Thank you!

Hi @nateraw,

It’s my understanding that if we submit multiple per video, the maximum similar embed for that video_id will be chosen as the confidence score - is that right?

Yes, you’ve got it correct. You can submit up to one descriptor per second of video and the max similarity will be the confidence score. More info here.

Mike

1 Like

Awesome thank you so much! One quick followup question

You can submit up to one descriptor per second of video and the max similarity will be the confidence score.

Is the “per second of video” on a video-by-video basis or global across the sum of duration of all videos in the set?

Like lets say I have 2 videos:

  • one is 10 seconds long,
  • the other is 20 seconds long.

Can I have 15 descriptors for both? Or would the maximum allowed be 10 for the first and 20 for the second?

@nateraw Good question. It’s on a video-by-video basis, so 10 for the first and 20 for the second in your example.

1 Like

Thanks so much for clarifying!! Very helpful

Hey @nateraw-

Quick follow-up with some nuance. After chatting with our partners at Meta, it was decided that participants can, if they would like, distribute their descriptors among videos in a set in such a way that violates this “one descriptor per second of video per video” constraint, provided that the number of descriptors is still below the global threshold (i.e., the total time length of all videos). However, you may only do so in such a way that the descriptors you generate for a particular video are still independent of all other videos. This means that you cannot decide, when presented with a set of videos, which videos receive more descriptors and which receive less using information about other videos - the descriptors you generate for a particular video must be a function of that video alone.

We will add language to the descriptor track submission page that clarifies this point. Please follow-up with any questions if the above is unclear!

Thanks,
Chris

1 Like

Hey there @chrisk-dd - that’s great! I was doing constant clips per sample for a baseline I was working on, which is why I asked.


To be super clear, when you say this…

This means that you cannot dynamically decide, when presented with a set of videos, which videos receive more descriptors and which receive less - the descriptors you generate for a particular video must be a function of that video alone.

…that means I can’t, at inference time, sum up all the durations across the set and use that to determine my value for constant clip sampling?

For example:

query_id,duration
0001,10.0
0002,20.0
0003,30.0

You wouldnt be able to sum the durations and use that to determine the # of clips per query, right? Wasn’t planning on doing this, but just wanted to check that’s what you meant.

Hey @nateraw - great question! So great that it’s prompted some internal discussion. Please stand by while we come to consensus & clarity on what is and isn’t permissible.

2 Likes

Hey @nateraw-

I appreciate your patience while we discussed your question and came to an agreement on how to proceed.

We have updated the Rules on Data Use with the following to clarify what is and is not permissible:

Number of descriptors

The limitation of one descriptor per second of video is a global limitation across all videos in the test set (~8,000 videos) and in the code execution test subset (~800 videos). You may, if you wish, choose to allocate more descriptors to some videos and fewer descriptors to others - as long as the number of descriptors you allocate to a particular video is a function of that video alone. Specifically, you may not dynamically allocate the number of descriptors calculated for a particular video based on other videos in the set. Participants may distribute descriptors unevenly across videos. For instance, a participant could assign descriptors for certain frames a “priority”, and calibrate a priority threshold to match the total descriptor budget. Participants may consider the total dataset length (as computed from the metadata file) to compute a descriptor budget and dynamically select a threshold, but may not look at the length distribution or conduct other analysis that would violate the independence criteria. For conducting inference on a subset of videos in the code execution runtime (and for Phase 2), the same restriction applies: you may calculate total length of videos in the subset and use that to dynamically select a priority threshold that ensures that the number of descriptors that you submit falls in this budget, but you may not use information about the distribution of video lengths to inform your priority scores.

I hope this resolves your question, but if anything is unclear or you have any additional follow-up questions, please don’t hesitate to ask!

-Chris

1 Like