My stratified 10 fold CV scores from a single model which give me the current best LB score of 0.1498 are as follows:
Country A : 0.2679 (with 344 features)
Country B : 0.1990 (with 1906 features)
Country C: 0.0189 (with 164 features)
This corresponds to an overall weighted mean logloss score of 0.1656
Knowing that the feature dimensions are pretty large, I tried feature selection using Boruta, RFE, removing columns containing too many missing values - This showed some improvements in my CV, but the scores worsened upon submitting.
I would like to know the best logloss scores you have achieved when you validated countrywise and also whether you had any luck with feature reduction/stacking/Neural Nets so far.
All the best to everyone for the last 3 days!
Thanks for your sharing! My stratified 10 fold CV scores from a single model which give me current best LB score of 0.1526 are as follows:
Country A : 0.2619
Country B : 0.1962
Country C: 0.0159
This corresponds to an overall weighted mean logloss score of 0.1612. Compared to your results, it seems my local validation scores don’t correspond to the PB well.
Don’t have any success with feature reduction. Haven’t tried stacking yet. But Neural Nets help me a lot. Good luck for the last 2 days! Cheers!
@LastRocky: I appreciate your response. It’s good to see Neural Nets are working for you, maybe you could do a weighted average with another model to improve your scores even further. What are the country-wise breakup of feature dimensions you’re using?
hi i haven’t submitted my last results so far, i ll do in these days
with RFE based on random forest my results are
0.2943909 0.2014885 0.018
I ll try submitting again with all the features but when you say 1096 features do you mean with dummy encoding?
For feature selection i tried Boruta and RFE(very long time computing it) and like you removing features with too many NAs
I have lost so much time trying to balance country B, i have tried:
oversampling the minority class
oversampling the whole dataset
also i tried a lot of times to use autoencoders
all these approaches failed to improve
for the final model i’m using a stacked ensemble of neural network, GBM, lasso, elastic net, ridge, random forest, i would like to use also SVM but aren’t implemented in h2o
@payback: Wow, I see a lot of models there. And yes, those 1906 features are one hot encoded values of categorical variables. I believe if you focus on improving predictions for Country A, you can significantly increase your scores. All the best!
@nickil21 yes for country A and B i’ve built a lot of models! I know country A is the best way to improve score, but i have lost too much time on country B and on feature selection
@nickil21 have you done any feature engineering/extraction? i have done one that led me significant improvements on A and B
Yes - Like I mentioned in the other thread, grouping on the ID column and aggregating features for both categorical and numerical independent variables and finally merging helped improve scores a bit. Apart from that, I haven’t been able to generate any.
@nickil21 what do you mean for grouping on the ID column?
maybe you mean the iid column on individual trainset?
IId is just an indicator to capture the size, isn’t it?
iid are the others family members
Cool, I did mean the ID column.
I think if you follow this thread closely, you should be able to comprehend fairly easily.
@nickil21 cool, i use R not python that python function is very cool, it does implicit engineering, i was confused of it’s name groupby, thxs!