I found that my performance on testing data given by the leaderboard is much lower than performance on training data.
Is it because that the testing data is much more challenging than the training data?
I just want to make sure whether the difference is due to the data itself, or due to my code.
By the way, I was doing 10-fold cross validation on the training data, and the weighted Brier score looks good, but once I submitted to the leaderboard, the weighed Brier score becomes much much worse (increases with 0.08~0.09).
Is there anyone who has the same problem as mine?
Or your performance on training and testing data are quite close?