Can the ground-truth pairs be used for training?

I am confused by the rules described in section 6.5 of this paper:https://arxiv.org/pdf/2106.09672.pdf . It is said scoring function of the pair should be independent of other query images and reference images. If i learn a metric function by using ground-truth pairs, and the learned function will inevitably be denpendent by the query and reference set. Is that legal? if not, how do i use the ground-truth pairs?

1 Like

Welcome to the competition, @zouxiaochuan! Thanks for the question.

I believe this is the paragraph you are referring to from the paper:

The scoring of a query image w.r.t. a reference image should be independent of other query images and reference images. This means that even if the dataset had a single query image and a single reference image, the score of the image pair would be the same. The intent of this rule is to avoid (1) that algorithms overfit to the reference set, for example by building a gigantic classifier with 1M outputs that predicts the matches, (2) that algorithms use irrelevant dataset statistics like the fact that there is at most one query image per reference image. This rules out methods based on query expansion or neighborhood graphs

You are correct that submissions must treat each query-reference pair independently, and so submissions in this phase should not use information gathered from training on the provided ground truth. Ultimately the rules on scoring a given query-reference pair apply to eligible submissions for final rankings, which will be determined by performance on the unseen query set in Phase 2. It may be a starting point to train from the provided ground truth, however you’ll want to keep a few things in mind:

  • 25k labeled pairs is fine for evaluation but probably too little for training a model
  • Scores in Phase 1 will be less useful to the extent that they overfit on the provided data (the gap between local evaluation on the ground truth and leaderboard evaluation on the 50k query set is indicative of the amount of overfitting)
  • You may not augment the reference set in training, only for inference (see rules on data use in the Problem Description section)

In general, we would advise you to use the provided training set for training, and use the ground truth to evaluate your solutions. As mentioned in the paper, the training set is provided as a statistical twin of the reference set, and it can be used to do all kinds of training tasks without risk of overfitting to the reference set.

Thanks for the quick reply. That’s clear now.

Hi, I am confused about how to use training images because there are no any labels about training image.
By the way, what doese ‘A disjoint set’ mean in the rules:

  • A disjoint set of training images is also provided, and augmentation and annotation of this separate set is permitted for training purposes.

Hi @Octopus - A disjoint set means a totally separate dataset that doesn’t share any images with the reference set. It will be up to you to create labels for the images as you work with them. Check out the Getting Started blog post for some additional guidance on generating your own training data.

Hope that helps!

1 Like

Just confirming a few points:

  • It is fully legal to train on the Phase I query images, along with images derived in part from them
  • It’s legal to use YFCC100M and DFDC in any way during training, even through they have some overlap with the reference set.
  • Query-reference pair ground truth is a valid validation and early stopping metric (without use in any gradient flow or training steps)

I am also curious about whether YFCC100M is legal for training. Waiting for reply!

1 Like

Hi @_NQ @wenhaowang - Thanks for asking about the YFCC100M data.

As the source of the challenge data, the YFCC100M dataset is not external data and training using data from this source is prohibited (with the exception of the training data provided through the challenge). The 1 million training images have been safely drawn from this source to ensure no overlap, so these are a great resource to use for training.

This has also been clarified in the Problem Description.

1 Like

Hi,

I saw in the downloaded public_ground_truth.csv file that there are only 4991 queries which have a reference, what is the meanings of the other 20009 queries that do not have a reference? Are they negative queries which have no reference at all or are their reference not provided?

Hi coin. Thanks for the question.

From the problem description page:

A subset of the query images have been derived in some way from the reference images, and the rest of the query images have not.

For more info on these, check out the competition paper and search for “distractor”.

Hope this helps!