Share your approach!

bull · February 10, 2015, 3:36pm

Like with Blood Donations and Millenium Development Goals, this competition is just for fun so we want to treat it as a learning opportunity.

What approaches are you using to tackle this data? Sharing your process and tools helps our community members that are launching their data science careers learn and improve.

What score and rank have you achieved?
Do you use Python or R? Julia or Java? Stata or SAS?
Are you preprocessing any of the features?
Are you using an ensemble of methods or leaning on something standard?
What features of the data help or hurt your solutions?
If you’ve got your code on GitHub or elsewhere, share a link!

washier · February 11, 2015, 12:13pm

Score = 0.8218 (current rank = 2)
Using R to clean data\preprocess features, C# to model (using ALGLIB1)
Dropped some features and reduced the number of levels for some of the factors(categorical features). Created 2 new features.
No ensembles, nothing special.

BKR · February 25, 2015, 4:46am

Hi @washier , do you mind sharing your r code and provide a bit more detail?
Do you have an email where I can contact you?

washier · February 25, 2015, 7:04am

@BKR,

Sure. Apparently I can’t attach anything other than images to this message, so I’ll share some Dropbox links. Hope that’s OK.

2 pieces of R code. The first piece of code cleans the data, and produces 2 new csv’s(one for the training data, the other for the test data). The comments in the code should explain everything.

The second piece of code transforms the data produced by the first piece of code by changing all the factors to dummy variables. This is required by ALGLIB.

I feed the data produced by the second piece of code to ALGLIB’s Random Decision Forest algorithm.

BKR · February 25, 2015, 9:25am

Score = 0.8106, Rank = 7
SQL to clean/prep data. Also dropped some variables and created a few new ones. I am not yet happy with my data and still experimenting with ideas. One area where I am unsure about best practice is reducing levels of categorical variables.
Modeling in R (Caret), best score with ensemble of about 6 models.
Tried H2O, but failed so far mostly due to overfitting.

sushiyan · March 18, 2015, 7:07am

There are many wells with population = 0, that is 37% of the pumps. Anyone have any idea if the data is accurate or it is a failure to capture the actual population?

tjox · March 23, 2015, 5:35am

Pretty sure 0 is equivalent to missing. There’s seems to be quite a bit of missing data across all the variables, though it’s not consistently coded. Might be worth experimenting with multiple imputation.

I managed to get to .798 with a random forest. I tried collapsing the funder/installer variables into something akin to international/government/local/unknown with the assumption that there might be a quality difference, but it was only a marginal improvement .76. The model basically fails to predict the ‘in need of repairs’ category completely, unfortunately.

Jellis · June 2, 2015, 4:17pm

Thank you for the cleaned data. I have been working on the $installer portion for days, basically taking the long way around trying to code each variable. I am a noob at the data engineering experience, but I feel silly for not thinking about the summary() option for these values.

KeynesYouDigIt · July 15, 2015, 4:25am

Dude, this is very well done. Bravo. Are you taking questions on your method still, this late in the game? I thnk I get what you are doing but I might still have a thing or two I want to run by you.

washier · July 15, 2015, 2:53pm

Thanks. It’s quite a while back but, fire away, I’ll try to answer as best I can

KeynesYouDigIt · July 29, 2015, 4:16am

Thanks! sorry it took so dang long to get back to you. Im a pretty big Data Science n00b so bear with me.

Why not start your sorting and mining by binding the “output” (functional, non-functional)

dat <- merge(train, Output)
date_recorded_offset_days <- as.numeric(as.Date("2014-01-01") - as.Date(dat$date_recorded))
date_recorded_month <- factor(format(as.Date(dat$date_recorded), "%b"))
dat <- dat[, -which(names(dat) == "date_recorded")]
dat <- cbind(dat, date_recorded_offset_days)
dat <- cbind(dat, date_recorded_month)

dipetkov · November 14, 2015, 12:18pm

I used H2O’s random forest to get a score of .821. I spent more time transforming the features, deciding which features to keep and which features to transform. Otherwise I used the randomForest method with most of its default values, except for the number of trees to build. The code is on GitHub:

bull · November 14, 2015, 10:28pm

Thanks for sharing, @dipetkov. that you included .md documentation of your process!

Abdul · November 27, 2015, 11:10pm

Hey pals, glad to come across this competition, pls is anyone using matlab, will like some clue on how to start working on this with matlab. first year Msc student. cheers

bhagyeshvikani · June 25, 2016, 6:57pm

Hello everyone,

My approach is very simple and uses RandomForest Classifier with 200 estimators. I ignored “wpt_name”, “subvillage”, “funder”, “installer” from the dataset.

I got score = 0.8205 and rank = 47.

holly · July 6, 2016, 12:55pm

Hello! Did you transform string values to integers? And how?

bhagyeshvikani · July 7, 2016, 9:14am

Hii, Yes I transformed string values to integers. For transformation, I assigned unique number to each unique label in both training and testing data set for every features contains string values.

If you have any doubt feel free to ask.

holly · July 8, 2016, 4:57pm

Thank you for your answer!

arnabitsme · August 1, 2016, 2:27pm

Hi,
Looking at the problem , I was just thinking if we can use Logistic regression as we need to predict the status “Functional/Non functional” based on the calculated probability only.
Regards
Arnab

bhagyeshvikani · August 3, 2016, 11:42pm

Absolutely we can use logistic regression to binary classification. But I found it less accurate than random forest.

What is your score with logistic regression?

Topic		Replies	Views
What's your strategy? Warm Up: Predict Blood Donations	26	10826	August 23, 2020
Simple cleaned and processed data with random forest classifier implemented and score 0.8162 Pump it Up: Data Mining the Water Table	2	3137	April 24, 2019
Share the knowledge Pover-T Tests: Predicting Poverty	29	2738	March 4, 2018
90.6%? May need the help of a stronger computer Pump it Up: Data Mining the Water Table	1	2322	January 22, 2021
Evalluation metric Pump it Up: Data Mining the Water Table	0	657	January 3, 2021

Share your approach!

Related topics