Back to DrivenData | Blog

Pseudo labeling

Is pseudo labeling is allowed, or does it violate the competition rules?
I would expect it to be allowed, but just want to be sure of it.

Thank you,
Nuno

@glipstein AS the time for the end is running out, can you let me know if the competition rules allow for the use of pseudo labeling? Thanks

Hi @ngcferreira - Thanks for reaching out. It’s not clear exactly what you’re intending, but we’d encourage you to check out this thread and the Dataset License Agreement which addresses annotating the dataset.

hi @glipstein, Thank you for your reply, but that thread doesn’t really cover this particular case.
I wanted to know if it is allowed to use my model’s predictions on the test set as training data.
So in fact the test data gets labeled automatically by my model (I use my best submission as training data).
No manual process is involved. This is a semi-supervised machine learning technique, which can improve the performance of a model.
From the rules it isn’t clear if this is allowed or not, so I would appreciate some guidance.
Thank you,
Nuno

Uploading the predictions and submitting to the leaderboard in the first place involves a manual process. Plus I think leaderboard-supervised sounds way cooler than semi-supervised :smiley:

It’s a standard ML technique, used when you don’t have enough data, which is the case here.
I just want to know if that’s allowed or not, to decide what to do for my 3rd submission. Don’t want to waste it on something, that might not be allowed :wink: Still have a few other ideas on how to improve.

Achieving the state of the art result by training the model on the test set labelled by outsourced annotators sounds cooler in my opinion :thinking:

2 Likes

@bahushruth Here is an article where google achieves state of the art results in a different problem with a mix of pseudo labeling an other techniques: https://arxiv.org/pdf/2001.07685.pdf
And note that independently on knowing the private score of the training set, this technique is still useful.
Anyway as there isn’t an answer from the organizers @glipstein , I will stick to my current approach, which does not use pseudo labeling. :slight_smile:

2 Likes

I think a clear response is needed here @glipstein instead of pointing to other threads.

Pseudolabelling is a standard ML technique as @ngcferreira said. Taking the predictions of a model on unlabelled data and using them for training is not a modification of the dataset from my point of view.

For example if I use data augmentation and flip the images during training I’m also modifying the dataset?

3 Likes

@ngcferreira Thanks for your patience. Labeling the test data through semi-supervised learning is still not in compliance with the Dataset License Agreement. You may perform semi-supervised learning on other data (including train and dev), but not the test set, which should really be treated as “unseen” new data.

5 Likes

That’s sad because we have been training the whole month using pseudolabel on test_seen, I encourage to make rules clearer the next time.

In the other hand it’s better to know now, so thanks @ngcferreira for asking.

2 Likes

@glipstein Thank for clarifying this, which was not clear in the rules. Please next time make the rules more clear.

@ironbar I’m sorry for the time you have put into pseudo-labeling. By the way was it working for you? I did a fast try, and it was only improving slightly my results on the validation set. Of course I do not know how much it would improve the test set results, but from my experience so far, the difference between validation and test score is not that big.

@glipstein and how about data augmentation on the train and dev sets, is it allowed or not?

@james005 Please do not modify the test set, but you may use the training and dev set.

Good luck!

1 Like

Hope this goes for test seen as well.

2 Likes