catboost has been recommended by a bunch of high-performing, russian kaggle grandmasters. also, it’s renown for being able to deal with categorical variables (and all that they entail) out-of-the-box without really doing much / any preprocessing.
catboost is very slow (
result = xgboost * 0.4 + catboost * 0.4 + lightgbm * 0.2
Sometimes catboost can be useful for categorical features.
Just to share some knowledge (I ended in rank 200 so I am not sure it will be useful for someone), I got a really good boost at some point by eliminating low entropy columns. I passed from 0.24 to 0.18 logloss. My rationale was that all those columns had little information and thus would not contribute to the classifier (XGBoost). I think that the same results could be achieved with what Sagol said ( recursive feature elimination based on feature_importances ). I added some features, but they did not improve the model significantly: counting the members in the household, adding mean, std and median of numeric columns and also normalizing them.
I think I stopped submitting around a month ago, so the model could be really improved, especially because I did not do any kind of stacking. Thank you all for sharing!
Thank you so much for your replies I did not go over 0.2 so definitely lots for me to learn Would love to see some code snippets if possible Thanks again will try some of these measures in my code and see if I get some improvement
Amazing work everyone, thanks for participating! Really great to see the collaboration and knowledge sharing that happened on the forums.
Once we’ve reviewed the winning submissions, we’ll make the code available on our GitHub repository for competitions winners as well:
Thanks again to all!
Reading the comments made me realize how hard some people have worked on the dataset. Amazing work people! My final rank is 122 but my solution is really simple - XGBoost tuned using hyperopt. I only spent a couple of days on this problem. If anybody is interested, here is my solution on Github.
Catboost can automatically deal with categorical features and has really good default hyper-parameters. My baseline, which was catboost with the default parameters on the hhold data scores 0.1746
i want to share my solution, feel free to drop me a line!
Our solution is an ensemble of models built using gradient boosting (lightgbm) and neural networks (keras).
We tried to take into account the only interpretable feature – hhold_size – when normalizing the features created from the individual hhold members data.
The most challenging part was feature selection. We did this using a couple of techniques. The most successful one was to fit a model to the core group of features and the group of features we wanted to add/test. We then evaluated the effect that a random permutation on each individual feature had on the predictions of that model. After going through every feature, we removed the ones for which we registered a score improvement.
The cross-validation scores of our best submission were:
A: 0.2517962 (20-fold cv)
B: 0.1726869 (20-fold cv)
C: 0.0154211 (5-fold cv)
We will give more details on our final write-up.
RGama and hugoguh
(the Ag100 team)