I’m trying to set up my cross validation. I’m using stratified K-Fold and I wrote a mean log loss func.
I’m getting really different values in my CV and the leaderboard.
I’d appreciate any suggestions!
I’m not sure if my mean log loss is correct, but this is what I have written:
def a_mean_log_loss(y_true_a, y_pred_a, y_true_b, y_pred_b, y_true_c, y_pred_c):
# log_loss is from sklearn.metrics
a_logloss = log_loss(y_true_a, y_pred_a)
b_logloss = log_loss(y_true_b, y_pred_b)
c_logloss = log_loss(y_true_c, y_pred_c)
# average of each countries log loss
return np.sum([a_logloss, b_logloss, c_logloss])/3
I am doing the same thing. Also got an interesting difference.
However, the difference in CV vs LB might or might not make sense… Since its logloss you cant really say… unless you know the points you are being evaluated.
I asked about this here, but no response so far. So, no idea yet.
I tried weighted mean but that didn’t help. I still see very different numbers between CV and LB.
I also tried Adversarial validation but my classifiers could not distinguish between train and test samples, so according to what I’ve read on this, I think they are from similar distributions. Correct me if I’m wrong please.
I take my words back. I’v got very close CV results. I think that if the score is much less than CV, then most likely this is a result of overfiting due to imbalanced data for country B and C.
I tried both sklearn cross_val_score and my own CV function using stratified K fold. With both, I get scores that are really different from LB, without a trend I can notice…
For example: I tried a logistic regression with no feature engineering. On CV, I get ~ 0.3 but then I get ~3. on the LB. -------- here, CV is much lower than LB
However, I tried a lightgbm model and I get a CV score that is lower than the LB by ~0.1. --------CV is slightly lower than LB
Then, still using the same lightgbm parameters, I removed some features, transform, etc. and I saw little change in CV, but ~2-fold improvement on the leaderboard over the same model with all the features. It seems to be all over the place. -------- CV is much higher than LB.
If you could point out any problems with my approach, please let me know.
My approach:
load data
Stratified KFold to split data
Preprocess/normalize etc on the current fold (I also tried doing this before splitting data, but wasn’t sure if that would case a data leak)
Train, Predict, and store those predictions
calculate LogLoss on those preds
take mean of all three country’s logloss to get final mean log loss (tried weighted average as well, but it isn’t a big difference)
1.First of all, you should pay attention to the imbalance of the classes and choose your own strategy. (https://elitedatascience.com/imbalanced-classes) Validation data should be chosen before any transformation.
2. Instead of the np.mean, you should use np.average and calculate the weight of each prediction by the number of values in the resulting file (10% difference).
3. Before any feature engineering try to create your own baseline prediction with very close CV and LB
Before any feature engineering try to create your own baseline prediction with very close CV and LB
I suspect that the LB distribution is more close to 50/50 whereas training examples are 90/10 for countries B and C, so how we set up our cross_validation to mimic this fact?
Moreover, I tried a simple RandomOverSampler on the training set and i got better results than the benchmark solution(all things except the sampling method same), whereas when i applied a more sophisticated combined oversampling undersampling method, called SMOTEENN, the results were far worse( i saw that this method lowers the training examples).
I think you refer to SMOTE, with SMOTE you create syntetic data of minority class based on K neighbors. i Tried it and i didn’t have improvements too, i used other tecqniques to improve the LB score