Just made my final submission for the day. Wasn’t able to get past 74th place on the LB… we’ll see how the private LB churns out. Congrats to all the participants and especially those who were able to break through 0.14x LB. If possible, once the competition wraps up in 45min, would love to hear about and learn from your processes and how you all were able to achieve such greatness =] !
Catboost+xgboost+lgbm. New features and feature selection ) It should be a greate shake up on the private LB
Hello @sagol
Could you elaborate on how and why you chose the steps you did? This was my first competition and am looking to learn more about approaching such data challenge. It would be great if you could give a brief overview of your process, if you don’t mind.
Thanks!
And congrats to all the winners!
My solution uses 100 bayesian optimized versions of 5 different models (20 versions each). Models: XGBoost, LGBM, LogisticRegression (L1 regularized), Random Forest, and 3 layer neural nets.
Combined all predictions with optimized weighted mean of the 100 predictions from base models.
I jumped 20 spots up on private LB! For my solution, I used LGB primarily. I label encoded all categorical variables and rank encoded all numeric variables. Ran 10-fold CV and sorted all features by importance. Performed standard feature interaction (multiplication and addition) against the top 100 features over all folds. Then, successively removed the bottom min() features until 25% features were left OR the loss kept growing (5 patience). All of this was performed under CV.
On the last day, I added in some other models, bayesian ridge, neural network, lda, regular ridge, and linear regression stacked the results. The results were ok, but I really liked my main LGB model, so I took the predictions from all the other models and used them as additional features in my LGB model instead of stacking for my final submission. I had totally forgotten about catboost!!
@sagol, can you share your process of feature elimination and feature engineering? I unfortunately did not have time to do any real EDA or univariate feature analysis as I wanted to. Thank you and congrats on holding onto your private LB position!
This was an interesting competition especially for a neophyte like me…What type of feature engineering did you guys do. Congrats to all the winners!!! Just wondering now that the competition is over if the top people in the leader board will share their code
Thanks!
What a competition! Thanks Drivendata for holding such a great competition! What a surprise for me to jump from 18rd to 3rd place! For me, because I joined this competition 5 days ago, so I didn’t have time to do much work about feature selection and stacking. My model is a simple combination of lightgbm model and 2 layer neural network with 10-fold CV. I kept all features in *_hhold tables and one additional feature from individual data table is the family members’ number of each family. Neural network gives a huge improvement, but only work for country A and B. Don’t know why nn fails in country C because haven’t time to look into country C table yet, because I spend almost all my time focusing on country A and migrate the ideas to B and C.
Hi, thanks for sharing! What library did you use for neural networks?
Hi, I used keras to build the nn
Thanks for detailing you process @authman !
Would mind sharing your code? I have never worked on such complex kind of models and would like to understand how you have coded it.
Congrats on your jump in the private LB!
Thanks for everyone who’ve shared up to now.
It’s my first competition and I didn’t know about the private LB…and my model was really overfitting the public LB data… lesson learned about doing better CV.
I used an XGB model and didn’t do much in feature engineering (something I have to work on). I did aggregate the features from the individual level and create a count of family member as well. Wasn’t able to get lower than 0.155x, but it was a great experience.
@sagol, I am very interested to hear your features as wel.
I applied LGBM+CAT+XGB as well in the end. Although I think my best score comes from a stacking model (my laptop crashed during the competition so everything from the first half was lost). I didn’t get a NN to work really well (with Keras), so I’m very interested in hearing about the architecture you used @LastRocky
I did some feature engineering on the individual data: group by ID and take mean for numerical variables, mode for categorical variables and a count. At one time, I also one-hot-encoded the categorical individual variables and took a mean after grouping (this calculates the fraction of family members specifying a certain answer). I also added NaN counts and zero counts for both individual and hhold data. I tried a denoising autoencoder but with no luck.
I did feature selection with a genetic algorithm.
I guess my main take-away is that I should not be spending so much time again on just getting a good stacking model up and running etc, but more on the feature engineering/selection part.
i used t SNE for creating a 3d rapresentation of country A and B and added as a feature, this added a good boost. Sum , diff, mul, div, mean for the first numerical features but not all combination.
For feature selection i used recursive feature elimination based on random forest but haven’t seen much improvement it was useful only because after i had the features ranked by importance on 10 cv folds.
My model was a stacked ensamble of 10 gbm , elastic net (from ridge to lasso), NN (worked good), random forest (didn’t worked very good).
The last day i tried GLRM, pca, k mean clustering, and add it as features but without success.
I’m very interested in hearing about feature engineering of the winning models. Congratulation to all
I will release my work on github
As has always been the case with challenges involving small datasets, I was skeptical about the CV-LB score correspondence and during the later stages, I only relied on improving CV scores locally.
Here’s how I achieved the 6th place finish – My models consisted of LightGBM, ANN (2 hidden layered MLP with softmax activation in output layer), and RGF to add some diversity to the mix. The predictions were simply a weighted average of all these in the ratio 0.45, 0.45, 0.1 respectively for A and B. Like @LastRocky had mentioned before, NN’s weren’t really performing for C. Maybe a different kind of pre-processing had to be applied apart from just mere standardization. Hence, a simple average of LightGBM and RGF were taken for C.
For feature generation, like I mentioned in the other thread, the frequency counts of categorical variables and the mean of numerical variables proved to give some improvements in the case of merging Individual with household data. All other categorical variables in the household data were label encoded. To reduce the feature size, I used RFECV with LightGBM as the base estimator simply because it was very fast. Finally, there were 240, 212, 6 features left for A, B and C respectively.
I also found that imputing missing values as zeroes helped improve scores a little bit. In terms of building a robust CV scheme, I did Stratified 5-fold 5 times (by changing seed value) and bagged the predictions for each of my models.
Hi, thanks for sharing your solutions.
Here is my solution :-
Country A ( Final log loss: 0.268 )
-
Simple XGB model without any feature engineering gave log loss score of ~0.28
-
Tuning hyperparameters using grid search CV improved score to ~0.272
-
Adding distances of combinations of numerical features from origin improved the score to 0.27001
-
Importing number of family members from individual dataset reduced the score slightly to 0.27003
-
Importing all the data from individual dataset significantly reduced the score to ~0.272 (Numerical features was averaged after grouping by id. Certain categorical features were different for each family member. To capture this information, categorical features were first one hot encoded and summed)
-
The ensemble of models improved the score to 0.268. Base layer consisted of following models shown in the below graph. Graph shows error rate of each model at different clusters of data (Clustered using agglomerative clustering)

All the predicted probabilities from above base models + LocalOutlierFactor was given to XGB model to get final prediction.
Country B ( Final score: 0.192)
- Simple XGB model without any feature engineering gave score of ~0.22
- Hyper parameter tuning improved score to ~0.21
- One of the problems with country B dataset is low training size. If we one hot encode the categorical variables, the feature size is ~1000 and training sample size is ~3000. So instead of one hot encoding categorical variables, converting the discrete categorical variables into continuous probabilities reduced the feature size to ~380. This significantly improved the score to 0.192
- Importing data from the individual dataset improved the score to ~0.17 but this time when I submitted the predictions, scores calculated by drivendata was significantly less. I think this has something to do with dataset imbalance
- Applying SMOTE, upsampling and adjusting the scale_pos_weight parameter of XGB for correcting dataset imbalance didn’t help. It reduced the score back to ~0.21
Country C ( Final score : 0.018)
- Simple XGB model without any feature engineering gave score of ~0.1
- Hyperparameter tuning improved score to 0.018
All log loss scores mentioned above is average of scores from 8 fold stratified CV
This wouldn’t happen to be Authman Apatira from Coding Dojo?
New features from the individual datasets (B and C):
- the number of iid
- the number of positive and negative values
- sum
- number of unique values
LabelEncoding for hhold (A, B, C).
Categorical features were those whose number of values was not more than 5.
For feature selection I used recursive feature elimination based on feature_importances and 5-fold CV for each of the algorithms(catboost, xgboost, lightbgm).
That’s all
Very nice, thanks for sharing your process insights!
For better or worse, yes it is =) !
without stacking?
i see some are using catboost (never heard of it) what are the advantages over xgboost? i just checked a R tutorial about it and the syntax seems difficult, are worth to learn it? is the performance over xgboost so sensible?