Calling on the LB leaders: Did you use the indiv data at all?

Hi pover-t predictors,

This is Andrew, am a student based in the Middle East. I’ve been playing around with the household data mostly which gives relatively nice results. I did one submission with the individual files, and average the probability for each household id for submission but did rather poorly. Wonder what strategies there are in general in dealing with such a situation? Any pointers would be much appreciated.

Happy predicting!


@yipcma: Using Individual data does bring in some improvement, although the significant chunk comes from the household one. Instead of averaging predictions on a common ID, it’s better if you keep the household data static, and focus on adding new features just looking at the individual data -
finally merging them on the same ID. This approach if done right should improve your scores.

1 Like

For example, you can extract the number of positive and negative values for house holds.

1 Like

Thanks @nickil21 @sagol I’ll give them a try tonight. Do you have some good pointers as to feature engineering in this particular case (collapsing multiple observations to one)? Any blogposts or books of your favorites would be much appreciated.

Also, I’ve seen that there are quite some columns that only have one value (category). There could be some serious deep cleaning to be done there as well just from the hhld data. Pointers and resources on this (feature selection?) would be appreciated too.

Happy coding!

@yipcma: Well, Groupby aggregate seems to be an ideal start for data reduction - see what functions go well with numerical & categorical columns and use them accordingly.

1 Like

the competition ends today or tomorrow?

Feb. 28, 2018, 11:59 p.m. UTC

1 Like

What type of join did you all use to merge the two sets of data?

@wbickelmann am following the advice above and trying out count etc. on the indiv columns. I’m hoping to learn more about the different ways of summarizing the columns, and more important, feature selection. I feel that there are too many columns…

Hey @nickil21 @sagol, i’ve just submitted using all the non-na columns from indiv by aggregating mean for numeric, and mode for categorical. The score increased tiny tiny bit. And my 3-fold cv didn’t really improve. I’d love to learn if I’m doing it correctly, or that what I can do to improve the scoring. What did you do for feature selection?

Thanks a ton :slight_smile:

@yipcma: You can, for instance, change the seed value while creating the fold indices to see if the 3 fold CV results are comparable and that your model is stable – maybe try a 5 fold Stratified CV approach? I haven’t done anything on the feature selection part, just dumping the entire feature list as of now.

@nickil21 hmm… have you done any parameter tuning and gridsearch for example? I was just surprised that by incorporating that much more information the gain is minimal…Also, how did you deal with NAs?

No, I’m just using some of the optimal parameters without hyperparameter tuning. I’m not imputing for missing values either.

@nickil21 hhm… did you use early stopping and anything to prevent over-training? I’m curious which strance of algorithms you are trying. I’ve been looking at decision trees.

So far, LightGBM is working well for this problem. And yes, I use early stopping to monitor if the model is overfitting or not.


@nickil21 thanks, will give that a try and see if I get some improvement.