Pseudo labeling

ngcferreira · October 27, 2020, 10:34pm

Is pseudo labeling is allowed, or does it violate the competition rules?
I would expect it to be allowed, but just want to be sure of it.

Thank you,
Nuno

ngcferreira · October 28, 2020, 7:21am

@glipstein AS the time for the end is running out, can you let me know if the competition rules allow for the use of pseudo labeling? Thanks

glipstein · October 28, 2020, 8:07pm

Hi @ngcferreira - Thanks for reaching out. It’s not clear exactly what you’re intending, but we’d encourage you to check out this thread and the Dataset License Agreement which addresses annotating the dataset.

ngcferreira · October 28, 2020, 9:46pm

hi @glipstein, Thank you for your reply, but that thread doesn’t really cover this particular case.
I wanted to know if it is allowed to use my model’s predictions on the test set as training data.
So in fact the test data gets labeled automatically by my model (I use my best submission as training data).
No manual process is involved. This is a semi-supervised machine learning technique, which can improve the performance of a model.
From the rules it isn’t clear if this is allowed or not, so I would appreciate some guidance.
Thank you,
Nuno

burebista · October 28, 2020, 10:25pm

Uploading the predictions and submitting to the leaderboard in the first place involves a manual process. Plus I think leaderboard-supervised sounds way cooler than semi-supervised

ngcferreira · October 28, 2020, 10:37pm

It’s a standard ML technique, used when you don’t have enough data, which is the case here.
I just want to know if that’s allowed or not, to decide what to do for my 3rd submission. Don’t want to waste it on something, that might not be allowed Still have a few other ideas on how to improve.

bahushruth · October 28, 2020, 11:14pm

Achieving the state of the art result by training the model on the test set labelled by outsourced annotators sounds cooler in my opinion

ngcferreira · October 29, 2020, 8:31am

@bahushruth Here is an article where google achieves state of the art results in a different problem with a mix of pseudo labeling an other techniques: https://arxiv.org/pdf/2001.07685.pdf
And note that independently on knowing the private score of the training set, this technique is still useful.
Anyway as there isn’t an answer from the organizers @glipstein , I will stick to my current approach, which does not use pseudo labeling.

ironbar · October 29, 2020, 5:30pm

I think a clear response is needed here @glipstein instead of pointing to other threads.

Pseudolabelling is a standard ML technique as @ngcferreira said. Taking the predictions of a model on unlabelled data and using them for training is not a modification of the dataset from my point of view.

For example if I use data augmentation and flip the images during training I’m also modifying the dataset?

glipstein · October 30, 2020, 12:04am

@ngcferreira Thanks for your patience. Labeling the test data through semi-supervised learning is still not in compliance with the Dataset License Agreement. You may perform semi-supervised learning on other data (including train and dev), but not the test set, which should really be treated as “unseen” new data.

ironbar · October 30, 2020, 7:00am

That’s sad because we have been training the whole month using pseudolabel on test_seen, I encourage to make rules clearer the next time.

In the other hand it’s better to know now, so thanks @ngcferreira for asking.

ngcferreira · October 30, 2020, 9:11am

@glipstein Thank for clarifying this, which was not clear in the rules. Please next time make the rules more clear.

@ironbar I’m sorry for the time you have put into pseudo-labeling. By the way was it working for you? I did a fast try, and it was only improving slightly my results on the validation set. Of course I do not know how much it would improve the test set results, but from my experience so far, the difference between validation and test score is not that big.

james005 · October 30, 2020, 5:17pm

@glipstein and how about data augmentation on the train and dev sets, is it allowed or not?

glipstein · October 31, 2020, 3:42pm

@james005 Please do not modify the test set, but you may use the training and dev set.

Good luck!

bahushruth · November 1, 2020, 12:02am

Hope this goes for test seen as well.

Topic		Replies	Views
Are self-supervised or pseudo-label on test data allowed Kelp Wanted: Segmenting Kelp Forests	1	131	February 15, 2024
Pseudo labeling allowance Overhead Geopose Challenge	2	491	June 28, 2021
Pseudo labeling during 8 hours of inference STAC Overflow: Map Floodwater from Radar Imagery	4	768	September 3, 2021
Test Data Question Mars Spectrometry	1	360	March 15, 2022
Is unsupervised/self-supervised learning of test images allowed? The BioMassters	6	537	November 14, 2022

Pseudo labeling

Related topics