As has always been the case with challenges involving small datasets, I was skeptical about the CV-LB score correspondence and during the later stages, I only relied on improving CV scores locally.
Here’s how I achieved the 6th place finish – My models consisted of LightGBM, ANN (2 hidden layered MLP with softmax activation in output layer), and RGF to add some diversity to the mix. The predictions were simply a weighted average of all these in the ratio 0.45, 0.45, 0.1 respectively for A and B. Like @LastRocky had mentioned before, NN’s weren’t really performing for C. Maybe a different kind of pre-processing had to be applied apart from just mere standardization. Hence, a simple average of LightGBM and RGF were taken for C.
For feature generation, like I mentioned in the other thread, the frequency counts of categorical variables and the mean of numerical variables proved to give some improvements in the case of merging Individual with household data. All other categorical variables in the household data were label encoded. To reduce the feature size, I used RFECV with LightGBM as the base estimator simply because it was very fast. Finally, there were 240, 212, 6 features left for A, B and C respectively.
I also found that imputing missing values as zeroes helped improve scores a little bit. In terms of building a robust CV scheme, I did Stratified 5-fold 5 times (by changing seed value) and bagged the predictions for each of my models.