Different results on personal test set and competition test set

I’ve split the labelled data into train/validation/test datasets. I can produce models that generalise well to the test set I’ve created, however they have lower performance on the (unlabelled) competition data. The gain in loss from my test results to the competition results is ~40% (i.e. 0.4 to about 0.56).

Is anyone else having this same issue? Is there a fundamental difference in the labelled and unlabelled data (e.g. taken from different geographical locations) that I’m missing?

Are you using the unverified examples? Only use those with extreme care, whatever automated process labelled them is not reliable.

Turns out there was a bug in my code - I thought I was using only verified examples but the unverified ones had slipped in somehow! I’ve re-trained my model now and the losses are far more similar, plus the predicted class distribution of the competition data seems much closer to the original dataset.

Thanks for pointing this out!

It seems to me that, I am still facing this issue. After reaching to 0.5 loss, my CrossValidation and LB are giving different results (eg, CV= 0.43, LB= 0.68).