Sharing CV scores and LB scores

Hi everyone,

I wanted to make this thread so that contenders can share their CV vs LB scores.
In fact I have seen a very poor correlation between CV and LB so I wondered if the same happened to you.

Here are a few examples of my CV vs LB score (note that my CV is computed using the out of fold predictions of a 5 fold cross validation):
CV 0.390 → LB 0.4364
CV 0.376 → LB 0.4361
CV 0.360 → LB 0.4462
CV 0.377 → LB 0.4627

The correlation is very poor as you can see, what about yours ?


So far, mine have been fairly correlated, with LB only 0.02 - 0.03 higher than CV, using metadata and a very simple 2 layer conv with attention on the patches and MLP head to classify the attention scores + embedded metadata.

1 Like

Mine is similar to yours with nearly no correlation. Logloss is not suitable for this case. Let’s assume we have a false negative prediction and the total sample size in the public or private dataset is 260. If we yield 0.01 for a positive instance, this would result in a 0.018 influence on the average loss.


This is a simulation from out of fold predictions, where I consider that 1/3 of the test data is in public leaderboard and 2/3 in private. I make random selection of a test set with the same size as the real one:

Looks like it’s hard to infer private score from public score.

I also compared two different models of mine with similar CVs 0.376 and 0.377. Here are their corresponding scores on different random public set:

There could be a large a shakeup…


I am suffering from the same problem as Optimo and was wondering the same thing. It could be that some pre-processing or attention needs to be applied.

We achieved pretty high ROC AUC and accuracy scores locally, but we also experienced the exact same problem and we can also confirm that there is no strong correlation between the local CV fold scores and the public leaderboard scores based on the log loss metric. This is due to the fact that the distribution of the public test set is unknown which is a problem only due to the log loss metric, it wouldn’t be a problem if any other metric had been selected. It is highly probable that fitting the models to the distribution achieves much better log loss scores than letting the model learn meaningful signals.

My best models have around 0.34 logloss and 0.82 AUC, what about yours @zsolt.bedohazi ?