Cross-validation and public leaderboard

iorana · January 18, 2021, 1:30pm

Hi everyone,

I use the same cross-validation scheme as suggested in the benchmark model: that is 6000 observations for validation, 3000 for test. Each time I get nice results on validation, but my score on public leaderboard does not change much

How is it for you? Have you managed to achieve good correlation between local scores and those on the leaderboard?

volcanix280 · January 18, 2021, 11:10pm

My tested score has only been 2-3 points higher than local tests, I think this might be due to gaps in data or other issues suggested in the demo code

iorana · January 19, 2021, 8:46am

Thank you!
It is the same for me.
My local score is about 3 points less than the one I have on the public leaderboard.

adalseno · January 20, 2021, 7:18pm

And what is even worse is that the score seems erratic: I got a better score (significantly better) on the public leaderboard with a model that was performing worse (slightly worse) in my private test

isms · January 20, 2021, 7:57pm

You should expect these numbers to be different. Your model is being evaluated on similar but not identical data. The score will certainly vary between your local data and the public leaderboard, and between public and private. What is important is, as always, getting to the heart of the predictive problem while avoiding overfitting.

gowrishankarin · January 25, 2021, 3:43am

I totally agree with the observations. There is no correlation to what we see in benchmark to what we observe from the grader.

@isms getting to the heart of the predictive problem while avoiding overfitting - Words of wisdom

PredictorX · February 5, 2021, 10:44pm

There are two issues here, probably both of which are at play:

First, the leaderboard test data may: 1. be statistically different form the public development data and 2. may be small enough to deliver noisy performance measurements. There are good reasons (such as: same process, different time periods- this happens in real world situations) and bad reasons (perhaps: competition administrators may have sampled poorly) for the leaderboard data to differ from the development data.

Second, test performance on the development data may suffer from several problems: 1. the holdout data contains too few observations, therefore performance measurement is noisy, 2. the holdout data may have been sampled improperly, hence is statistically different from the training data, 3. the analyst may have repeated their analysis so often as to statistically invalidate the leaderboard testing.

That last point about abusing the leaderboard test by repeating it too many times is subtle, and, in my experience, not well understood by most people who participate in data analysis competitions. Think about it this way: If your model was a random number generators, and you repeated the leaderboard test 50 times with different random number seeds, some of your “model” results would score better than others. It should be obvious, though, that no random number seed is actually any better than any other.

adalseno · February 8, 2021, 2:43pm

It’s true that trying to fit too much the test set is a type of overfitting, though you should expect to get from your train data consistent results. How can you select the best model otherwise? In real life you don’t have a test set to “score” your model.

Topic		Replies	Views
Is your performance on training data quite different from that on testing data? Senior Data Science: Safe Aging with SPHERE	5	1660	July 8, 2016
Private Testing Info MagNet: Model the Geomagnetic Field	2	505	January 26, 2021
About private LB Pover-T Tests: Predicting Poverty	3	1195	February 7, 2018
How are you guys validating? Tick Tick Bloom Challenge	9	486	February 7, 2023
Benchmark Blog Error? Validation set taken from test set MagNet: Model the Geomagnetic Field	5	525	January 19, 2021

Cross-validation and public leaderboard

Related topics