Back to DrivenData | Blog

Share the knowledge

Hi, I used keras to build the nn

Thanks for detailing you process @authman !
Would mind sharing your code? I have never worked on such complex kind of models and would like to understand how you have coded it.

Congrats on your jump in the private LB!

Thanks for everyone who’ve shared up to now.
It’s my first competition and I didn’t know about the private LB…and my model was really overfitting the public LB data… lesson learned about doing better CV.

I used an XGB model and didn’t do much in feature engineering (something I have to work on). I did aggregate the features from the individual level and create a count of family member as well. Wasn’t able to get lower than 0.155x, but it was a great experience.

@sagol, I am very interested to hear your features as wel.

I applied LGBM+CAT+XGB as well in the end. Although I think my best score comes from a stacking model (my laptop crashed during the competition so everything from the first half was lost). I didn’t get a NN to work really well (with Keras), so I’m very interested in hearing about the architecture you used @LastRocky

I did some feature engineering on the individual data: group by ID and take mean for numerical variables, mode for categorical variables and a count. At one time, I also one-hot-encoded the categorical individual variables and took a mean after grouping (this calculates the fraction of family members specifying a certain answer). I also added NaN counts and zero counts for both individual and hhold data. I tried a denoising autoencoder but with no luck.

I did feature selection with a genetic algorithm.

I guess my main take-away is that I should not be spending so much time again on just getting a good stacking model up and running etc, but more on the feature engineering/selection part.


i used t SNE for creating a 3d rapresentation of country A and B and added as a feature, this added a good boost. Sum , diff, mul, div, mean for the first numerical features but not all combination.
For feature selection i used recursive feature elimination based on random forest but haven’t seen much improvement it was useful only because after i had the features ranked by importance on 10 cv folds.

My model was a stacked ensamble of 10 gbm , elastic net (from ridge to lasso), NN (worked good), random forest (didn’t worked very good).

The last day i tried GLRM, pca, k mean clustering, and add it as features but without success.
I’m very interested in hearing about feature engineering of the winning models. Congratulation to all
I will release my work on github


As has always been the case with challenges involving small datasets, I was skeptical about the CV-LB score correspondence and during the later stages, I only relied on improving CV scores locally.

Here’s how I achieved the 6th place finish – My models consisted of LightGBM, ANN (2 hidden layered MLP with softmax activation in output layer), and RGF to add some diversity to the mix. The predictions were simply a weighted average of all these in the ratio 0.45, 0.45, 0.1 respectively for A and B. Like @LastRocky had mentioned before, NN’s weren’t really performing for C. Maybe a different kind of pre-processing had to be applied apart from just mere standardization. Hence, a simple average of LightGBM and RGF were taken for C.

For feature generation, like I mentioned in the other thread, the frequency counts of categorical variables and the mean of numerical variables proved to give some improvements in the case of merging Individual with household data. All other categorical variables in the household data were label encoded. To reduce the feature size, I used RFECV with LightGBM as the base estimator simply because it was very fast. Finally, there were 240, 212, 6 features left for A, B and C respectively.

I also found that imputing missing values as zeroes helped improve scores a little bit. In terms of building a robust CV scheme, I did Stratified 5-fold 5 times (by changing seed value) and bagged the predictions for each of my models.


Hi, thanks for sharing your solutions.

Here is my solution :-

Country A ( Final log loss: 0.268 )

  1. Simple XGB model without any feature engineering gave log loss score of ~0.28

  2. Tuning hyperparameters using grid search CV improved score to ~0.272

  3. Adding distances of combinations of numerical features from origin improved the score to 0.27001

  4. Importing number of family members from individual dataset reduced the score slightly to 0.27003

  5. Importing all the data from individual dataset significantly reduced the score to ~0.272 (Numerical features was averaged after grouping by id. Certain categorical features were different for each family member. To capture this information, categorical features were first one hot encoded and summed)

  6. The ensemble of models improved the score to 0.268. Base layer consisted of following models shown in the below graph. Graph shows error rate of each model at different clusters of data (Clustered using agglomerative clustering)

    All the predicted probabilities from above base models + LocalOutlierFactor was given to XGB model to get final prediction.

Country B ( Final score: 0.192)

  1. Simple XGB model without any feature engineering gave score of ~0.22
  2. Hyper parameter tuning improved score to ~0.21
  3. One of the problems with country B dataset is low training size. If we one hot encode the categorical variables, the feature size is ~1000 and training sample size is ~3000. So instead of one hot encoding categorical variables, converting the discrete categorical variables into continuous probabilities reduced the feature size to ~380. This significantly improved the score to 0.192
  4. Importing data from the individual dataset improved the score to ~0.17 but this time when I submitted the predictions, scores calculated by drivendata was significantly less. I think this has something to do with dataset imbalance
  5. Applying SMOTE, upsampling and adjusting the scale_pos_weight parameter of XGB for correcting dataset imbalance didn’t help. It reduced the score back to ~0.21

Country C ( Final score : 0.018)

  1. Simple XGB model without any feature engineering gave score of ~0.1
  2. Hyperparameter tuning improved score to 0.018

All log loss scores mentioned above is average of scores from 8 fold stratified CV


This wouldn’t happen to be Authman Apatira from Coding Dojo?

New features from the individual datasets (B and C):

  1. the number of iid
  2. the number of positive and negative values
  3. sum
  4. number of unique values

LabelEncoding for hhold (A, B, C).
Categorical features were those whose number of values was not more than 5.

For feature selection I used recursive feature elimination based on feature_importances and 5-fold CV for each of the algorithms(catboost, xgboost, lightbgm).

That’s all :slight_smile:


Very nice, thanks for sharing your process insights!

For better or worse, yes it is =) !

without stacking?
i see some are using catboost (never heard of it) what are the advantages over xgboost? i just checked a R tutorial about it and the syntax seems difficult, are worth to learn it? is the performance over xgboost so sensible?

catboost has been recommended by a bunch of high-performing, russian kaggle grandmasters. also, it’s renown for being able to deal with categorical variables (and all that they entail) out-of-the-box without really doing much / any preprocessing.

catboost is very slow (

result = xgboost * 0.4 + catboost * 0.4 + lightgbm * 0.2
Sometimes catboost can be useful for categorical features.

Just to share some knowledge (I ended in rank 200 so I am not sure it will be useful for someone), I got a really good boost at some point by eliminating low entropy columns. I passed from 0.24 to 0.18 logloss. My rationale was that all those columns had little information and thus would not contribute to the classifier (XGBoost). I think that the same results could be achieved with what Sagol said ( recursive feature elimination based on feature_importances ). I added some features, but they did not improve the model significantly: counting the members in the household, adding mean, std and median of numeric columns and also normalizing them.
I think I stopped submitting around a month ago, so the model could be really improved, especially because I did not do any kind of stacking. Thank you all for sharing!

1 Like

Thank you so much for your replies I did not go over 0.2 so definitely lots for me to learn Would love to see some code snippets if possible Thanks again will try some of these measures in my code and see if I get some improvement

Amazing work everyone, thanks for participating! Really great to see the collaboration and knowledge sharing that happened on the forums.

Once we’ve reviewed the winning submissions, we’ll make the code available on our GitHub repository for competitions winners as well:

Thanks again to all!


Reading the comments made me realize how hard some people have worked on the dataset. Amazing work people! My final rank is 122 but my solution is really simple - XGBoost tuned using hyperopt. I only spent a couple of days on this problem. If anybody is interested, here is my solution on Github.


Catboost can automatically deal with categorical features and has really good default hyper-parameters. My baseline, which was catboost with the default parameters on the hhold data scores 0.1746