Using test data for modelling

The description of the innovation track mentions that the model should be trained only on the training dataset provided for this competition. Would it be possible to use the test data (without labels of course) to incorporate more information into the model since the usage of external data is prohibited?

Thanks!

1 Like

It’s generally a bad idea to use your test set (and validation set for that matter) at all for constructing the model. If you use the test set for anything other than evaluation, then your test set will no longer give you a fair estimate of your model on unseen data, as the data is seen. Consequently, it’ll likely lead to you overfitting the test set, and when it comes to evaluation on the unseen test set you will see a big drop in performance. Potentially using the test set in an unsupervised way could help boost your performance on the unseen test set, but really you would be rolling a dice on your models performance on unseen data, which defeats the purpose of the test set.

2 Likes

Thanks for the detailed response! Although my question was concerning the legality of using the test set for modelling.

I’m also interested in this. Although using the test inputs it’s not without risks, as discussed above, usually it’s allowed in such competitions. There are several argument for it:

  • We anyway see this data, so it’s impossible to completely avoid information from it leaking into the model
  • The hosts said that they will anyway test submissions on completely unseen data, which eliminates the overfitting concerns
  • In real-world usage scenarios, we also have the unlabeled test data, so it’s possible to re-train the model using the new samples, although it may not be practical
1 Like

In terms of legality, I don’t believe it’s against the rule as test data isn’t really “external”. Though it does matter how you use the test data so it doesn’t overfit on the “seen” test set. There are however approaches like pseudo labelling and unsupervised learning which might be beneficial.

1 Like

Thanks for your reply. The innovation track specifically asks us to train the model only on the training dataset and hence I decided to ask. But I guess what you say makes sense.

Hi all - Thanks for this questions. We’ve updated the language on the Innovation Track page to clarify.

What data can I use for the model I submit to the Innovation Track? The model you submit to the Innovation Track should be the same as the corresponding submission to the Prediction Track. As such, the model should be trained only on the dataset provided for this competition. External data may also be used in the Innovation Track for the purposes of showcasing the capabilties of the model post-training. As with the Prediction Track, finalists for the Innovation Track will have their model performance validated against an out-of-sample verification set; teams judged to have violated rules regarding data usage will be disqualified.

In other words, the rules for training are the same as for the Prediction Track.

Since the host’s reply isn’t very clear on using test data to create models. I wanted to confirm if it is okay to use test data to train models (assuming there is a way to do so) for prediction task?

I guess it’s fine to use the test data, they have updated the description.

Seems pretty clear to me. Train your data on the dataset provided for this competition. The dataset provided to us is whatever you see available for download here: https://www.drivendata.org/competitions/63/genetic-engineering-attribution/data/

In other words, feel free to use train + test set. Just be aware we’re being evaluated on a hold-out set :+1:t5:.

i have to apologised that this is my first time taking part in driven-data competition. my understanding may not be correct. this is what I gather!:
Selection_094

it is unclear if the private dataset is part of the test.csv or another out-of-sample. But anyway, we are preparing for all possible cases before any further clarification.

also, there is no need to select submission. all submission will be evaluated for the private dataset and the best will be the one on private leaderboard

I agree, this is unclear. If there are more labs in the holdout set, and there will be no re-training, then it’s not possible to predict these additional labs exactly. At most, we can hope that the model could identify that there are samples from unknown labs, and not assign them to any of the existing labs. This is already a very tough proposition for this task, though. And the output format doesn’t really allow for that - we could predict all 0 in this case, but this will not work with the top10 metric.