Calling on the LB leaders: Did you use the indiv data at all?

yipcma · February 20, 2018, 10:13am

Hi pover-t predictors,

This is Andrew, am a student based in the Middle East. I’ve been playing around with the household data mostly which gives relatively nice results. I did one submission with the individual files, and average the probability for each household id for submission but did rather poorly. Wonder what strategies there are in general in dealing with such a situation? Any pointers would be much appreciated.

Happy predicting!

Andrew

nickil21 · February 20, 2018, 10:36am

@yipcma: Using Individual data does bring in some improvement, although the significant chunk comes from the household one. Instead of averaging predictions on a common ID, it’s better if you keep the household data static, and focus on adding new features just looking at the individual data -
finally merging them on the same ID. This approach if done right should improve your scores.

sagol · February 20, 2018, 11:13am

For example, you can extract the number of positive and negative values for house holds.

yipcma · February 20, 2018, 2:13pm

Thanks @nickil21 @sagol I’ll give them a try tonight. Do you have some good pointers as to feature engineering in this particular case (collapsing multiple observations to one)? Any blogposts or books of your favorites would be much appreciated.

Also, I’ve seen that there are quite some columns that only have one value (category). There could be some serious deep cleaning to be done there as well just from the hhld data. Pointers and resources on this (feature selection?) would be appreciated too.

Happy coding!

nickil21 · February 20, 2018, 2:46pm

@yipcma: Well, Groupby aggregate seems to be an ideal start for data reduction - see what functions go well with numerical & categorical columns and use them accordingly.

payback · February 20, 2018, 3:03pm

the competition ends today or tomorrow?

sagol · February 20, 2018, 3:30pm

Feb. 28, 2018, 11:59 p.m. UTC

wbickelmann · February 21, 2018, 5:41pm

What type of join did you all use to merge the two sets of data?

yipcma · February 21, 2018, 8:08pm

@wbickelmann am following the advice above and trying out count etc. on the indiv columns. I’m hoping to learn more about the different ways of summarizing the columns, and more important, feature selection. I feel that there are too many columns…

yipcma · February 21, 2018, 9:39pm

Hey @nickil21 @sagol, i’ve just submitted using all the non-na columns from indiv by aggregating mean for numeric, and mode for categorical. The score increased tiny tiny bit. And my 3-fold cv didn’t really improve. I’d love to learn if I’m doing it correctly, or that what I can do to improve the scoring. What did you do for feature selection?

Thanks a ton

nickil21 · February 22, 2018, 8:41am

@yipcma: You can, for instance, change the seed value while creating the fold indices to see if the 3 fold CV results are comparable and that your model is stable – maybe try a 5 fold Stratified CV approach? I haven’t done anything on the feature selection part, just dumping the entire feature list as of now.

yipcma · February 22, 2018, 9:03am

@nickil21 hmm… have you done any parameter tuning and gridsearch for example? I was just surprised that by incorporating that much more information the gain is minimal…Also, how did you deal with NAs?

nickil21 · February 22, 2018, 9:07am

No, I’m just using some of the optimal parameters without hyperparameter tuning. I’m not imputing for missing values either.

yipcma · February 22, 2018, 9:12am

@nickil21 hhm… did you use early stopping and anything to prevent over-training? I’m curious which strance of algorithms you are trying. I’ve been looking at decision trees.

nickil21 · February 22, 2018, 9:17am

So far, LightGBM is working well for this problem. And yes, I use early stopping to monitor if the model is overfitting or not.

yipcma · February 22, 2018, 9:39am

@nickil21 thanks, will give that a try and see if I get some improvement.

Topic		Replies	Views
Luck with individual data? Pover-T Tests: Predicting Poverty	0	871	January 8, 2018
Share the knowledge Pover-T Tests: Predicting Poverty	29	2598	March 4, 2018
Prediction format Pover-T Tests: Predicting Poverty	4	1227	February 6, 2018
Need suggestion Pover-T Tests: Predicting Poverty	2	705	February 24, 2018
Share your approach! Pump it Up: Data Mining the Water Table	46	19583	December 27, 2021

Calling on the LB leaders: Did you use the indiv data at all?

Related Topics