Share your approach!

Like with Blood Donations and Millenium Development Goals, this competition is just for fun so we want to treat it as a learning opportunity.

What approaches are you using to tackle this data? Sharing your process and tools helps our community members that are launching their data science careers learn and improve.

  • What score and rank have you achieved?
  • Do you use Python or R? Julia or Java? Stata or SAS?
  • Are you preprocessing any of the features?
  • Are you using an ensemble of methods or leaning on something standard?
  • What features of the data help or hurt your solutions?
  • If you’ve got your code on GitHub or elsewhere, share a link!
1 Like

Score = 0.8218 (current rank = 2)
Using R to clean data\preprocess features, C# to model (using ALGLIB1)
Dropped some features and reduced the number of levels for some of the factors(categorical features). Created 2 new features.
No ensembles, nothing special.

1 Like

Hi @washier , do you mind sharing your r code and provide a bit more detail?
Do you have an email where I can contact you?


Sure. Apparently I can’t attach anything other than images to this message, so I’ll share some Dropbox links. Hope that’s OK.

2 pieces of R code. The first piece of code cleans the data, and produces 2 new csv’s(one for the training data, the other for the test data). The comments in the code should explain everything.

The second piece of code transforms the data produced by the first piece of code by changing all the factors to dummy variables. This is required by ALGLIB.

I feed the data produced by the second piece of code to ALGLIB’s Random Decision Forest algorithm.


Score = 0.8106, Rank = 7
SQL to clean/prep data. Also dropped some variables and created a few new ones. I am not yet happy with my data and still experimenting with ideas. One area where I am unsure about best practice is reducing levels of categorical variables.
Modeling in R (Caret), best score with ensemble of about 6 models.
Tried H2O, but failed so far mostly due to overfitting.


There are many wells with population = 0, that is 37% of the pumps. Anyone have any idea if the data is accurate or it is a failure to capture the actual population?


Pretty sure 0 is equivalent to missing. There’s seems to be quite a bit of missing data across all the variables, though it’s not consistently coded. Might be worth experimenting with multiple imputation.

I managed to get to .798 with a random forest. I tried collapsing the funder/installer variables into something akin to international/government/local/unknown with the assumption that there might be a quality difference, but it was only a marginal improvement .76. The model basically fails to predict the ‘in need of repairs’ category completely, unfortunately.

1 Like

Thank you for the cleaned data. I have been working on the $installer portion for days, basically taking the long way around trying to code each variable. I am a noob at the data engineering experience, but I feel silly for not thinking about the summary() option for these values.

Dude, this is very well done. Bravo. Are you taking questions on your method still, this late in the game? I thnk I get what you are doing but I might still have a thing or two I want to run by you.

Thanks. It’s quite a while back but, fire away, I’ll try to answer as best I can :smiley:

Thanks! sorry it took so dang long to get back to you. Im a pretty big Data Science n00b so bear with me.

Why not start your sorting and mining by binding the “output” (functional, non-functional)

dat <- merge(train, Output)
date_recorded_offset_days <- as.numeric(as.Date("2014-01-01") - as.Date(dat$date_recorded))
date_recorded_month <- factor(format(as.Date(dat$date_recorded), "%b"))
dat <- dat[, -which(names(dat) == "date_recorded")]
dat <- cbind(dat, date_recorded_offset_days)
dat <- cbind(dat, date_recorded_month)

I used H2O’s random forest to get a score of .821. I spent more time transforming the features, deciding which features to keep and which features to transform. Otherwise I used the randomForest method with most of its default values, except for the number of trees to build. The code is on GitHub:


Thanks for sharing, @dipetkov. :heart: that you included .md documentation of your process!

Hey pals, glad to come across this competition, pls is anyone using matlab, will like some clue on how to start working on this with matlab. first year Msc student. cheers

1 Like

Hello everyone,

My approach is very simple and uses RandomForest Classifier with 200 estimators. I ignored “wpt_name”, “subvillage”, “funder”, “installer” from the dataset.

I got score = 0.8205 and rank = 47.

Hello! Did you transform string values to integers? And how?

Hii, Yes I transformed string values to integers. For transformation, I assigned unique number to each unique label in both training and testing data set for every features contains string values.

If you have any doubt feel free to ask.

1 Like

Thank you for your answer!

Looking at the problem , I was just thinking if we can use Logistic regression as we need to predict the status “Functional/Non functional” based on the calculated probability only.

1 Like

Absolutely we can use logistic regression to binary classification. But I found it less accurate than random forest.

What is your score with logistic regression?