What approaches are you using to tackle this data? Sharing your process and tools helps our community members that are launching their data science careers learn and improve.
What score and rank have you achieved?
Do you use Python or R? Julia or Java? Stata or SAS?
Are you preprocessing any of the features?
Are you using an ensemble of methods or leaning on something standard?
What features of the data help or hurt your solutions?
If youāve got your code on GitHub or elsewhere, share a link!
Score = 0.8218 (current rank = 2)
Using R to clean data\preprocess features, C# to model (using ALGLIB1)
Dropped some features and reduced the number of levels for some of the factors(categorical features). Created 2 new features.
No ensembles, nothing special.
Sure. Apparently I canāt attach anything other than images to this message, so Iāll share some Dropbox links. Hope thatās OK.
2 pieces of R code. The first piece of code cleans the data, and produces 2 new csvās(one for the training data, the other for the test data). The comments in the code should explain everything.
The second piece of code transforms the data produced by the first piece of code by changing all the factors to dummy variables. This is required by ALGLIB.
I feed the data produced by the second piece of code to ALGLIBās Random Decision Forest algorithm.
Score = 0.8106, Rank = 7
SQL to clean/prep data. Also dropped some variables and created a few new ones. I am not yet happy with my data and still experimenting with ideas. One area where I am unsure about best practice is reducing levels of categorical variables.
Modeling in R (Caret), best score with ensemble of about 6 models.
Tried H2O, but failed so far mostly due to overfitting.
There are many wells with population = 0, that is 37% of the pumps. Anyone have any idea if the data is accurate or it is a failure to capture the actual population?
Pretty sure 0 is equivalent to missing. Thereās seems to be quite a bit of missing data across all the variables, though itās not consistently coded. Might be worth experimenting with multiple imputation.
I managed to get to .798 with a random forest. I tried collapsing the funder/installer variables into something akin to international/government/local/unknown with the assumption that there might be a quality difference, but it was only a marginal improvement .76. The model basically fails to predict the āin need of repairsā category completely, unfortunately.
Thank you for the cleaned data. I have been working on the $installer portion for days, basically taking the long way around trying to code each variable. I am a noob at the data engineering experience, but I feel silly for not thinking about the summary() option for these values.
Dude, this is very well done. Bravo. Are you taking questions on your method still, this late in the game? I thnk I get what you are doing but I might still have a thing or two I want to run by you.
I used H2Oās random forest to get a score of .821. I spent more time transforming the features, deciding which features to keep and which features to transform. Otherwise I used the randomForest method with most of its default values, except for the number of trees to build. The code is on GitHub:
Hey pals, glad to come across this competition, pls is anyone using matlab, will like some clue on how to start working on this with matlab. first year Msc student. cheers
My approach is very simple and uses RandomForest Classifier with 200 estimators. I ignored āwpt_nameā, āsubvillageā, āfunderā, āinstallerā from the dataset.
Hii, Yes I transformed string values to integers. For transformation, I assigned unique number to each unique label in both training and testing data set for every features contains string values.
Hi,
Looking at the problem , I was just thinking if we can use Logistic regression as we need to predict the status āFunctional/Non functionalā based on the calculated probability only.
Regards
Arnab