Submission scores way higher

I hope this question is not too basic. I am dividing the training dataset into a train and validation set and my predictions are achieving MAE scores around 16-17 in validation but every time I submit predictions for the test features dataset the MAE scores come back between 24-27. Is this just a result of my model overfitting the training dataset and thus not generalizing well to new data or is there something else I am missing here.

I appreciate any help of guidance anyone can offer. I just want to make sure that I am not overlooking something simple in my naivete.

If you are measuring your errors on a validation set locally (which you do not use to tune parameters or whatsoever), then this is not an overfitting problem (since then this error would be high too).

It is probably due to the fact that the distributions in your local validation set and the test set of DrivenData differ significantly.

One cool trick to combat this, is to solve a new classification problem as a preprocessing step, here you label all training samples with class 0 and all testing samples with class 1. You can then try to train a classifier on this. If the training and test set are similar, you expect the classifier to only achieve random performance. If there is a difference, the classifier will be able to classify test points from training points.
Finally, you can then use this classifier to generate predictions for your points, in cross-validation, and use the samples of your training set that have the highest probability to be in the test set according to that classifier as your local validation set.

An example can be found here: https://github.com/jimfleming/numerai/blob/master/prep_data.py

1 Like

@Gillesvdw This is so cool! Thanks for sharing

1 Like