Share the knowledge

nickil21 · March 1, 2018, 10:40am

As has always been the case with challenges involving small datasets, I was skeptical about the CV-LB score correspondence and during the later stages, I only relied on improving CV scores locally.

Here’s how I achieved the 6th place finish – My models consisted of LightGBM, ANN (2 hidden layered MLP with softmax activation in output layer), and RGF to add some diversity to the mix. The predictions were simply a weighted average of all these in the ratio 0.45, 0.45, 0.1 respectively for A and B. Like @LastRocky had mentioned before, NN’s weren’t really performing for C. Maybe a different kind of pre-processing had to be applied apart from just mere standardization. Hence, a simple average of LightGBM and RGF were taken for C.

For feature generation, like I mentioned in the other thread, the frequency counts of categorical variables and the mean of numerical variables proved to give some improvements in the case of merging Individual with household data. All other categorical variables in the household data were label encoded. To reduce the feature size, I used RFECV with LightGBM as the base estimator simply because it was very fast. Finally, there were 240, 212, 6 features left for A, B and C respectively.

I also found that imputing missing values as zeroes helped improve scores a little bit. In terms of building a robust CV scheme, I did Stratified 5-fold 5 times (by changing seed value) and bagged the predictions for each of my models.

Topic		Replies	Views
Calling on the LB leaders: Did you use the indiv data at all? Pover-T Tests: Predicting Poverty	15	1553	February 22, 2018
Leaderboard Split Pover-T Tests: Predicting Poverty	2	1567	February 7, 2018
Spitballing for fun? Richter's Predictor	9	2094	September 30, 2020
Luck with individual data? Pover-T Tests: Predicting Poverty	0	907	January 8, 2018
22nd place Non ML submission looking for teammate Cold Start Energy Forecasting	2	805	September 17, 2018

Share the knowledge

Related topics