Submission scores way higher

gaworecki · November 8, 2018, 4:33pm

I hope this question is not too basic. I am dividing the training dataset into a train and validation set and my predictions are achieving MAE scores around 16-17 in validation but every time I submit predictions for the test features dataset the MAE scores come back between 24-27. Is this just a result of my model overfitting the training dataset and thus not generalizing well to new data or is there something else I am missing here.

I appreciate any help of guidance anyone can offer. I just want to make sure that I am not overlooking something simple in my naivete.

Gillesvdw · November 10, 2018, 12:46pm

If you are measuring your errors on a validation set locally (which you do not use to tune parameters or whatsoever), then this is not an overfitting problem (since then this error would be high too).

It is probably due to the fact that the distributions in your local validation set and the test set of DrivenData differ significantly.

One cool trick to combat this, is to solve a new classification problem as a preprocessing step, here you label all training samples with class 0 and all testing samples with class 1. You can then try to train a classifier on this. If the training and test set are similar, you expect the classifier to only achieve random performance. If there is a difference, the classifier will be able to classify test points from training points.
Finally, you can then use this classifier to generate predictions for your points, in cross-validation, and use the samples of your training set that have the highest probability to be in the test set according to that classifier as your local validation set.

An example can be found here: https://github.com/jimfleming/numerai/blob/master/prep_data.py

kmjoshi · May 8, 2019, 1:36pm

@Gillesvdw This is so cool! Thanks for sharing

Topic		Replies	Views
Model not generalizing well to test set DengAI Competition	0	607	September 1, 2020
How good is 0.36 score as beginner? DengAI Competition	1	621	June 6, 2023
Mismatching MAE from Local Folder to Submitted Folder DengAI Competition	1	283	June 6, 2023
MAE on Train and Test Data Set DengAI Competition	2	1850	May 1, 2018
Some questions about the competition TissueNet: Detect Lesions in Cervical Biopsies	3	514	November 7, 2020

Submission scores way higher

Related topics