Train / Test split

oliviers · March 14, 2016, 7:08am

Hello,

Can we have some information on how the split train/test was done?
It does not look random: when I do train/valid split on the test data I have an rmse on the valid test completely different than on the leaderboard.
Perhaps there is a skew which is not integrate in a random split?

timcdlucas · March 14, 2016, 10:55am

Did you need more detail than:

“For the test set, windows of 4 days have been removed from the training
data at regular intervals, and competitors will attempt to predict most
accurately the yield during these intervals.”

If it’s any help I’m doing my CV based on days of the month. First fold I hold out data from the 1st to the 4th of any month. Second fold I hold out 5th to the 8th etc.

My RMSE is higher in CV than the leaderboard, but they seem quite proportional. Models that work well locally get better leaderboard scores.

My folds are slightly different sizes and RMSE is quite varied between folds but I think that’s just how the data actually looks.

isms · March 14, 2016, 2:57pm

Hi @oliviers,

Just to build on @timcdlucas’s response, the benchmark blog post has some discussion:

This isn’t your grandpa’s random train/test split

Here’s another fun insight: this problem has a time component and in the real world we are trying to predict the future. That is, we’re trying to figure out the upcoming yield based on current weather. For those of us concerned about overfitting (hint: all of us), we will need to think hard about our modeling assumptions.

So, things that we could do but probably shouldn’t:

Imputing missing values using all of the data.

Treating every data point as if it stands alone and is independent from other points in time.

Drawing on weather that hasn’t happened yet to inform our current predictions.

There’s also plot that gives a visual overview:

oliviers · March 14, 2016, 7:23pm

Thank you for this intel. i missed the “windows of 4 days” passage in the problem description.
I am probably doing something wrong…
Thanks again

Topic		Replies	Views
Present vs Future From Fog Nets to Neural Nets	20	3507	May 1, 2016
MAE on Train and Test Data Set DengAI Competition	2	1850	May 1, 2018
How are you guys validating? Tick Tick Bloom Challenge	9	486	February 7, 2023
Net Positive -- Data Segmentation Power Laws	10	1240	February 12, 2018
Train and test data consistency Youth Mental Health: Automated Abstraction	11	263	October 14, 2024

Train / Test split

This isn’t your grandpa’s random train/test split

Related topics