Train / Test split

Hello,

Can we have some information on how the split train/test was done?
It does not look random: when I do train/valid split on the test data I have an rmse on the valid test completely different than on the leaderboard.
Perhaps there is a skew which is not integrate in a random split?

Did you need more detail than:

“For the test set, windows of 4 days have been removed from the training
data at regular intervals, and competitors will attempt to predict most
accurately the yield during these intervals.”

If it’s any help I’m doing my CV based on days of the month. First fold I hold out data from the 1st to the 4th of any month. Second fold I hold out 5th to the 8th etc.

My RMSE is higher in CV than the leaderboard, but they seem quite proportional. Models that work well locally get better leaderboard scores.

My folds are slightly different sizes and RMSE is quite varied between folds but I think that’s just how the data actually looks.

2 Likes

Hi @oliviers,

Just to build on @timcdlucas’s response, the benchmark blog post has some discussion:

This isn’t your grandpa’s random train/test split

Here’s another fun insight: this problem has a time component and in the real world we are trying to predict the future. That is, we’re trying to figure out the upcoming yield based on current weather. For those of us concerned about overfitting (hint: all of us), we will need to think hard about our modeling assumptions.

So, things that we could do but probably shouldn’t:

  • Imputing missing values using all of the data.
  • Treating every data point as if it stands alone and is independent from other points in time.
  • Drawing on weather that hasn’t happened yet to inform our current predictions.

There’s also plot that gives a visual overview:

1 Like

Thank you for this intel. i missed the “windows of 4 days” passage in the problem description.
I am probably doing something wrong…
Thanks again