Can the ground-truth pairs be used for training?

zouxiaochuan · June 28, 2021, 7:50am

I am confused by the rules described in section 6.5 of this paper:https://arxiv.org/pdf/2106.09672.pdf . It is said scoring function of the pair should be independent of other query images and reference images. If i learn a metric function by using ground-truth pairs, and the learned function will inevitably be denpendent by the query and reference set. Is that legal? if not, how do i use the ground-truth pairs?

glipstein · June 28, 2021, 10:58pm

Welcome to the competition, @zouxiaochuan! Thanks for the question.

I believe this is the paragraph you are referring to from the paper:

The scoring of a query image w.r.t. a reference image should be independent of other query images and reference images. This means that even if the dataset had a single query image and a single reference image, the score of the image pair would be the same. The intent of this rule is to avoid (1) that algorithms overfit to the reference set, for example by building a gigantic classifier with 1M outputs that predicts the matches, (2) that algorithms use irrelevant dataset statistics like the fact that there is at most one query image per reference image. This rules out methods based on query expansion or neighborhood graphs

You are correct that submissions must treat each query-reference pair independently, and so submissions in this phase should not use information gathered from training on the provided ground truth. Ultimately the rules on scoring a given query-reference pair apply to eligible submissions for final rankings, which will be determined by performance on the unseen query set in Phase 2. It may be a starting point to train from the provided ground truth, however you’ll want to keep a few things in mind:

25k labeled pairs is fine for evaluation but probably too little for training a model
Scores in Phase 1 will be less useful to the extent that they overfit on the provided data (the gap between local evaluation on the ground truth and leaderboard evaluation on the 50k query set is indicative of the amount of overfitting)
You may not augment the reference set in training, only for inference (see rules on data use in the Problem Description section)

In general, we would advise you to use the provided training set for training, and use the ground truth to evaluate your solutions. As mentioned in the paper, the training set is provided as a statistical twin of the reference set, and it can be used to do all kinds of training tasks without risk of overfitting to the reference set.

zouxiaochuan · June 29, 2021, 2:06am

Thanks for the quick reply. That’s clear now.

Octopus · July 8, 2021, 4:11pm

Hi， I am confused about how to use training images because there are no any labels about training image.
By the way, what doese ‘A disjoint set’ mean in the rules:

A disjoint set of training images is also provided, and augmentation and annotation of this separate set is permitted for training purposes.

glipstein · July 8, 2021, 6:23pm

Hi @Octopus - A disjoint set means a totally separate dataset that doesn’t share any images with the reference set. It will be up to you to create labels for the images as you work with them. Check out the Getting Started blog post for some additional guidance on generating your own training data.

Hope that helps!

_NQ · July 21, 2021, 1:39pm

Just confirming a few points:

It is fully legal to train on the Phase I query images, along with images derived in part from them
It’s legal to use YFCC100M and DFDC in any way during training, even through they have some overlap with the reference set.
Query-reference pair ground truth is a valid validation and early stopping metric (without use in any gradient flow or training steps)

wenhaowang · July 22, 2021, 2:09pm

I am also curious about whether YFCC100M is legal for training. Waiting for reply!

glipstein · July 29, 2021, 7:40pm

Hi @_NQ @wenhaowang - Thanks for asking about the YFCC100M data.

As the source of the challenge data, the YFCC100M dataset is not external data and training using data from this source is prohibited (with the exception of the training data provided through the challenge). The 1 million training images have been safely drawn from this source to ensure no overlap, so these are a great resource to use for training.

This has also been clarified in the Problem Description.

coin · September 8, 2021, 12:30am

Hi,

I saw in the downloaded public_ground_truth.csv file that there are only 4991 queries which have a reference, what is the meanings of the other 20009 queries that do not have a reference? Are they negative queries which have no reference at all or are their reference not provided?

mike-dd · September 10, 2021, 1:27pm

Hi coin. Thanks for the question.

From the problem description page:

A subset of the query images have been derived in some way from the reference images, and the rest of the query images have not.

For more info on these, check out the competition paper and search for “distractor”.

Hope this helps!

Topic		Replies	Views
Training on Reference images Image Similarity Challenge	19	1086	September 30, 2021
Questions about Similarity Challenge Rules Image Similarity Challenge	2	483	September 20, 2021
Two question about track2 Image Similarity Challenge	12	458	September 1, 2021
Is legal for training on the reference dataset? Image Similarity Challenge	1	593	July 29, 2021
Basic Questions Image Similarity Challenge	1	359	September 7, 2021

Can the ground-truth pairs be used for training?

Related topics