I use the same cross-validation scheme as suggested in the benchmark model: that is 6000 observations for validation, 3000 for test. Each time I get nice results on validation, but my score on public leaderboard does not change much
How is it for you? Have you managed to achieve good correlation between local scores and those on the leaderboard?
And what is even worse is that the score seems erratic: I got a better score (significantly better) on the public leaderboard with a model that was performing worse (slightly worse) in my private test
You should expect these numbers to be different. Your model is being evaluated on similar but not identical data. The score will certainly vary between your local data and the public leaderboard, and between public and private. What is important is, as always, getting to the heart of the predictive problem while avoiding overfitting.
There are two issues here, probably both of which are at play:
First, the leaderboard test data may: 1. be statistically different form the public development data and 2. may be small enough to deliver noisy performance measurements. There are good reasons (such as: same process, different time periods- this happens in real world situations) and bad reasons (perhaps: competition administrators may have sampled poorly) for the leaderboard data to differ from the development data.
Second, test performance on the development data may suffer from several problems: 1. the holdout data contains too few observations, therefore performance measurement is noisy, 2. the holdout data may have been sampled improperly, hence is statistically different from the training data, 3. the analyst may have repeated their analysis so often as to statistically invalidate the leaderboard testing.
That last point about abusing the leaderboard test by repeating it too many times is subtle, and, in my experience, not well understood by most people who participate in data analysis competitions. Think about it this way: If your model was a random number generators, and you repeated the leaderboard test 50 times with different random number seeds, some of your “model” results would score better than others. It should be obvious, though, that no random number seed is actually any better than any other.
It’s true that trying to fit too much the test set is a type of overfitting, though you should expect to get from your train data consistent results. How can you select the best model otherwise? In real life you don’t have a test set to “score” your model.