Using test data for modelling

Nilabhra · September 24, 2020, 11:43am

The description of the innovation track mentions that the model should be trained only on the training dataset provided for this competition. Would it be possible to use the test data (without labels of course) to incorporate more information into the model since the usage of external data is prohibited?

Thanks!

KieranLitschel · September 26, 2020, 9:53am

It’s generally a bad idea to use your test set (and validation set for that matter) at all for constructing the model. If you use the test set for anything other than evaluation, then your test set will no longer give you a fair estimate of your model on unseen data, as the data is seen. Consequently, it’ll likely lead to you overfitting the test set, and when it comes to evaluation on the unseen test set you will see a big drop in performance. Potentially using the test set in an unsupervised way could help boost your performance on the unseen test set, but really you would be rolling a dice on your models performance on unseen data, which defeats the purpose of the test set.

Nilabhra · September 26, 2020, 1:09pm

Thanks for the detailed response! Although my question was concerning the legality of using the test set for modelling.

sorrge · September 26, 2020, 8:29pm

I’m also interested in this. Although using the test inputs it’s not without risks, as discussed above, usually it’s allowed in such competitions. There are several argument for it:

We anyway see this data, so it’s impossible to completely avoid information from it leaking into the model
The hosts said that they will anyway test submissions on completely unseen data, which eliminates the overfitting concerns
In real-world usage scenarios, we also have the unlabeled test data, so it’s possible to re-train the model using the new samples, although it may not be practical

dexarsal · September 27, 2020, 4:05pm

In terms of legality, I don’t believe it’s against the rule as test data isn’t really “external”. Though it does matter how you use the test data so it doesn’t overfit on the “seen” test set. There are however approaches like pseudo labelling and unsupervised learning which might be beneficial.

Nilabhra · September 27, 2020, 4:20pm

Thanks for your reply. The innovation track specifically asks us to train the model only on the training dataset and hence I decided to ask. But I guess what you say makes sense.

glipstein · September 30, 2020, 4:01pm

Hi all - Thanks for this questions. We’ve updated the language on the Innovation Track page to clarify.

What data can I use for the model I submit to the Innovation Track? The model you submit to the Innovation Track should be the same as the corresponding submission to the Prediction Track. As such, the model should be trained only on the dataset provided for this competition. External data may also be used in the Innovation Track for the purposes of showcasing the capabilties of the model post-training. As with the Prediction Track, finalists for the Innovation Track will have their model performance validated against an out-of-sample verification set; teams judged to have violated rules regarding data usage will be disqualified.

In other words, the rules for training are the same as for the Prediction Track.

Phaedrus · October 2, 2020, 5:27pm

Since the host’s reply isn’t very clear on using test data to create models. I wanted to confirm if it is okay to use test data to train models (assuming there is a way to do so) for prediction task?

Nilabhra · October 5, 2020, 8:19pm

I guess it’s fine to use the test data, they have updated the description.

authman · October 11, 2020, 5:45am

Seems pretty clear to me. Train your data on the dataset provided for this competition. The dataset provided to us is whatever you see available for download here: Login

In other words, feel free to use train + test set. Just be aware we’re being evaluated on a hold-out set .

hengcherkeng · October 18, 2020, 9:49am

i have to apologised that this is my first time taking part in driven-data competition. my understanding may not be correct. this is what I gather!:
Selection_094

it is unclear if the private dataset is part of the test.csv or another out-of-sample. But anyway, we are preparing for all possible cases before any further clarification.

also, there is no need to select submission. all submission will be evaluated for the private dataset and the best will be the one on private leaderboard

sorrge · October 18, 2020, 5:46pm

I agree, this is unclear. If there are more labs in the holdout set, and there will be no re-training, then it’s not possible to predict these additional labs exactly. At most, we can hope that the model could identify that there are samples from unknown labs, and not assign them to any of the existing labs. This is already a very tough proposition for this task, though. And the output format doesn’t really allow for that - we could predict all 0 in this case, but this will not work with the top10 metric.

Topic		Replies	Views
Using external data Genetic Engineering Attribution	2	602	September 30, 2020
Test Data Question Mars Spectrometry	1	360	March 15, 2022
Using the data after the competition is over Genetic Engineering Attribution	2	571	October 22, 2020
Does the model we submit for the innovation track have to be our best performing one on the prediction track? Genetic Engineering Attribution	2	369	October 20, 2020
Restrictions for using test data for training Sustainable Industry: Rinse Over Run	5	1191	January 17, 2019

Using test data for modelling

Related topics