Help with setting up cross validation


I’m trying to set up my cross validation. I’m using stratified K-Fold and I wrote a mean log loss func.

I’m getting really different values in my CV and the leaderboard.

I’d appreciate any suggestions!

I’m not sure if my mean log loss is correct, but this is what I have written:

def a_mean_log_loss(y_true_a, y_pred_a, y_true_b, y_pred_b, y_true_c, y_pred_c):
    # log_loss is from sklearn.metrics 
    a_logloss = log_loss(y_true_a, y_pred_a)
    b_logloss = log_loss(y_true_b, y_pred_b)
    c_logloss = log_loss(y_true_c, y_pred_c)
    # average of each countries log loss
    return np.sum([a_logloss, b_logloss, c_logloss])/3
1 Like

I am doing the same thing. Also got an interesting difference.

However, the difference in CV vs LB might or might not make sense… Since its logloss you cant really say… unless you know the points you are being evaluated.

I asked about this here, but no response so far. So, no idea yet.

1 Like

You have to consider that datasets have a different number of rows, so you should use weighted mean.


Thanks for the replies.

I tried weighted mean but that didn’t help. I still see very different numbers between CV and LB.

I also tried Adversarial validation but my classifiers could not distinguish between train and test samples, so according to what I’ve read on this, I think they are from similar distributions. Correct me if I’m wrong please.

Any other suggestions?

I suppose that public LB is imbalanced and private LB will be more close to CV.

I take my words back. I’v got very close CV results. I think that if the score is much less than CV, then most likely this is a result of overfiting due to imbalanced data for country B and C.

I tried both sklearn cross_val_score and my own CV function using stratified K fold. With both, I get scores that are really different from LB, without a trend I can notice…
For example: I tried a logistic regression with no feature engineering. On CV, I get ~ 0.3 but then I get ~3. on the LB. -------- here, CV is much lower than LB

However, I tried a lightgbm model and I get a CV score that is lower than the LB by ~0.1. --------CV is slightly lower than LB

Then, still using the same lightgbm parameters, I removed some features, transform, etc. and I saw little change in CV, but ~2-fold improvement on the leaderboard over the same model with all the features. It seems to be all over the place. -------- CV is much higher than LB.

If you could point out any problems with my approach, please let me know.
My approach:

  1. load data
  2. Stratified KFold to split data
  3. Preprocess/normalize etc on the current fold (I also tried doing this before splitting data, but wasn’t sure if that would case a data leak)
  4. Train, Predict, and store those predictions
  5. calculate LogLoss on those preds
  6. take mean of all three country’s logloss to get final mean log loss (tried weighted average as well, but it isn’t a big difference)

1.First of all, you should pay attention to the imbalance of the classes and choose your own strategy. ( Validation data should be chosen before any transformation.
2. Instead of the np.mean, you should use np.average and calculate the weight of each prediction by the number of values in the resulting file (10% difference).
3. Before any feature engineering try to create your own baseline prediction with very close CV and LB :wink:

I hope this helps you.


Thanks for the tips. I will try these out.


Did you manage to solve this?

Sagol, which one of the solutions proposed on elite data science would you recommend for this case?

Yes, I think I got it working correctly now. I submitted several submissions to see how the LB scores match up to my CV scores, and it’s pretty close.

I think the key is to put the upsampling or downsampling inside the CV loop. Once I did that, I was getting closer scores between CV and LB.
In addition to what sagol recommended, I looked at this site: and this site:


This link was helpful to me to get an approximate LogLoss:

As per the link:


Extract LogLoss from, for example, the country A model:


Then take a weighted average of the countries’ LogLoss values by their relative number of rows: 46% in A, 18% in B, and 36% in C.

  1. Before any feature engineering try to create your own baseline prediction with very close CV and LB

I suspect that the LB distribution is more close to 50/50 whereas training examples are 90/10 for countries B and C, so how we set up our cross_validation to mimic this fact?
Moreover, I tried a simple RandomOverSampler on the training set and i got better results than the benchmark solution(all things except the sampling method same), whereas when i applied a more sophisticated combined oversampling undersampling method, called SMOTEENN, the results were far worse( i saw that this method lowers the training examples).

I think you refer to SMOTE, with SMOTE you create syntetic data of minority class based on K neighbors. i Tried it and i didn’t have improvements too, i used other tecqniques to improve the LB score