Share your approach!

blitu12345 · August 5, 2016, 9:13am

hey man can you help me out,on how did you transformed string values into integers.It would appreciate it , if you could share your code with me…!!! thanks

bhagyeshvikani · August 5, 2016, 1:38pm

Link to code : https://github.com/BhagyeshVikani/Pump-it-Up-Data-Mining-the-Water-Table

I have not clean up my code yet, so there is some debugging code is there.

arnabitsme · August 12, 2016, 10:29am

Hi ,
Hope you are doing well, as the dataset is not that large can we do away with diving the training further into training and test set in 70-30% ratio, and can directly apply the logistic regression on the whole Training set.
Regards
Arnab

bhagyeshvikani · August 12, 2016, 11:32am

Hi

Yes logistic regression is also worthy to try for small datasets

bhagyeshvikani · August 12, 2016, 11:36am

Hello,

Can anyone explain relationship of date with functional or not functional?

matthew_brown_iowa · September 15, 2016, 7:27pm

Sharing my best solution…

Solution is currently in 8th place
Score: .8247
Software Tools: XGBoost package in R
Brief Model Description: Ensemble of 11 XGBoost models with equal weight to each solution

Feature Selection
The original data set contained 40 variables. I reduced it down to 26 variables by removing variables
that were similar/duplicates of other variables.

Feature Engineering
For the construction_year and gps_height I used the median of them to replace the 0 values.

I first built a model using all of the available variables. I then removed variables that were duplicates and tested the model to understand how it changed the performance. I also used the xgb.importance function to understand the variable’s influence on the model.

I used an ensemble of XGBoost 11 XGBoost models only updating the random seed for each iteration. This turned out to be more accurate than a single XGBoost model with a large number of iterations and a low eta (learning rate).

I think that there is still some room for improvement with this model. I removed the installer variable but it may have predictive power if you could reduce the number of factors by grouping some of them together. Please share any suggestions you have on this as I wasn’t sure about the best way to do it.

Here’s is a link to the R code https://github.com/MattBrown88/Pump-it-Up-XGBoost-Ensemble.

Please let me know if you have any questions on my methodology.

kamchatang · October 4, 2016, 1:43pm

Here’s a GitHub repository showing how I tackled this competition. Any thoughts or suggestions
would be much appreciated.

aratik · October 13, 2016, 7:26pm

Hi,

I worked with R, random forest model. I cleaned the data and used a decision tree or mean values for missing value imputation.
accuracy was 0.8036

mauro.pelucchi · October 19, 2016, 5:54pm

Here our approach,

we worked with R and SAS. The best model is with a H2O Random Forest.

akvarel · October 22, 2016, 1:38pm

I am wondering why nobody used a feed forward neuronal network (nnet) in R. Is there any special reason that I am missing?

woodyb23 · November 7, 2016, 3:20pm

I tried with my first solution to use SQL Server data tools and built a decision tree based upon a few variables. I see that this competition has been extended. I am going to try and see if I can come up with some other solutions. I do not know R yet and I am limited to Microsoft tools at the moment.

nicolaskodak · January 4, 2017, 9:15am

Hi Matthew, good idea with how you tune the ensemble.

One question - it seems that you data cleaning step used “test data” (that we submit prediction against). Typically this is not possible in real world as we don’t manage to see the actual “test data” when we make prediction. Thoughts?

matthew_brown_iowa · January 4, 2017, 5:33pm

Hi Nicolas - The reason I did it that way since the actual test data was what we submitted against for grading. I needed to make sure the training and test data I used to build my model was in the same format otherwise the algorithm wouldn’t work. Does that clear it up?

maddula · January 19, 2017, 6:09pm

Very clean code! “Beautifully Crafted Code”.

I think I am influenced by Python PEP suggestion, I feel like having one space after using a hash symbols made it more readable. Thank you.

zlatankr · February 3, 2017, 6:40pm

Hi all - I used a random forest model (python) with pretty straightforward feature engineering and got a score of .8181. You can read my logic in my write-up: https://zlatankr.github.io/posts/2017/01/23/pump-it-up

Original code can be found on my GitHub: https://github.com/zlatankr/Projects/tree/master/Tanzania

Feedback gladly welcome!

maddula · February 8, 2017, 4:42pm

Hi @zlatankr, it’s really impressive post. Thank you. You have good skills with writing, clear code and very sequential. Thank you. I learned something new (lda, using np.nan for replacing 0’s and 1’s and then use mean) - nice

Can you also share some details about

Why RF?
Any overfitting issue?

zlatankr · February 10, 2017, 1:09am

Thanks @maddula! I briefly tried a couple other algorithms (Logistic Regression, SVM, AdaBoost, XGBoost), but none of them returned good results for me. I know that others have used XGBoost to get good results, but the model was taking way too long to run on my machine, so I couldn’t commit enough energy to it.

As far as over-fitting, I was getting some of it initially; however, I just found a bug in my code yesterday, and once I fixed it, not only did the overall score go up, but the over-fitting disappeared (in fact, my test scores were higher than my cross-validation score).

maddula · February 10, 2017, 6:28am

@zlatankr Thank you for reply. Great to see your code improved further. (Also, you can update or create a “.gitignore” files to ignore “.pyc” files in your project )

zlatankr · February 13, 2017, 5:25pm

Sure thing @maddula, done!

bonzoyang · May 31, 2017, 11:12am

Hi Matthew.
Thanks for providing the though of ensemble.
But when I run your R code, I just can’t reproduce your result.
Instead, I got 0.7401 on LB, which is not even close to 0.8
I guess maybe the difference is caused by parameter of XGBoost?
Cloud you give me any hint?

Topic		Replies	Views
Simple cleaned and processed data with random forest classifier implemented and score 0.8162 Pump it Up: Data Mining the Water Table	2	2959	April 24, 2019
What's your strategy? Warm Up: Predict Blood Donations	26	10723	August 23, 2020
Scheme_management Pump it Up: Data Mining the Water Table	1	2573	October 13, 2016
Spitballing for fun? Richter's Predictor	9	1986	September 30, 2020
90.6%? May need the help of a stronger computer Pump it Up: Data Mining the Water Table	1	2201	January 22, 2021

Share your approach!

Related Topics