Share your approach!

hey man can you help me out,on how did you transformed string values into integers.It would appreciate it , if you could share your code with me…!!! thanks

Link to code :

I have not clean up my code yet, so there is some debugging code is there.


Hi ,
Hope you are doing well, as the dataset is not that large can we do away with diving the training further into training and test set in 70-30% ratio, and can directly apply the logistic regression on the whole Training set.


Yes logistic regression is also worthy to try for small datasets


Can anyone explain relationship of date with functional or not functional?

Sharing my best solution…

Solution is currently in 8th place
Score: .8247
Software Tools: XGBoost package in R
Brief Model Description: Ensemble of 11 XGBoost models with equal weight to each solution

Feature Selection
The original data set contained 40 variables. I reduced it down to 26 variables by removing variables
that were similar/duplicates of other variables.

Feature Engineering
For the construction_year and gps_height I used the median of them to replace the 0 values.

I first built a model using all of the available variables. I then removed variables that were duplicates and tested the model to understand how it changed the performance. I also used the xgb.importance function to understand the variable’s influence on the model.

I used an ensemble of XGBoost 11 XGBoost models only updating the random seed for each iteration. This turned out to be more accurate than a single XGBoost model with a large number of iterations and a low eta (learning rate).

I think that there is still some room for improvement with this model. I removed the installer variable but it may have predictive power if you could reduce the number of factors by grouping some of them together. Please share any suggestions you have on this as I wasn’t sure about the best way to do it.

Here’s is a link to the R code

Please let me know if you have any questions on my methodology.


Here’s a GitHub repository showing how I tackled this competition. Any thoughts or suggestions
would be much appreciated.


I worked with R, random forest model. I cleaned the data and used a decision tree or mean values for missing value imputation.
accuracy was 0.8036

1 Like

Here our approach,

we worked with R and SAS. The best model is with a H2O Random Forest.

I am wondering why nobody used a feed forward neuronal network (nnet) in R. Is there any special reason that I am missing?

I tried with my first solution to use SQL Server data tools and built a decision tree based upon a few variables. I see that this competition has been extended. I am going to try and see if I can come up with some other solutions. I do not know R yet and I am limited to Microsoft tools at the moment.

Hi Matthew, good idea with how you tune the ensemble.

One question - it seems that you data cleaning step used “test data” (that we submit prediction against). Typically this is not possible in real world as we don’t manage to see the actual “test data” when we make prediction. Thoughts?

Hi Nicolas - The reason I did it that way since the actual test data was what we submitted against for grading. I needed to make sure the training and test data I used to build my model was in the same format otherwise the algorithm wouldn’t work. Does that clear it up?

Very clean code! “Beautifully Crafted Code”.

I think I am influenced by Python PEP suggestion, I feel like having one space after using a hash symbols made it more readable. Thank you.

Hi all - I used a random forest model (python) with pretty straightforward feature engineering and got a score of .8181. You can read my logic in my write-up:

Original code can be found on my GitHub:

Feedback gladly welcome!


Hi @zlatankr, it’s really impressive post. Thank you. You have good skills with writing, clear code and very sequential. Thank you. I learned something new (lda, using np.nan for replacing 0’s and 1’s and then use mean) - nice :slight_smile:

Can you also share some details about

  • Why RF?
  • Any overfitting issue?

Thanks @maddula! I briefly tried a couple other algorithms (Logistic Regression, SVM, AdaBoost, XGBoost), but none of them returned good results for me. I know that others have used XGBoost to get good results, but the model was taking way too long to run on my machine, so I couldn’t commit enough energy to it.

As far as over-fitting, I was getting some of it initially; however, I just found a bug in my code yesterday, and once I fixed it, not only did the overall score go up, but the over-fitting disappeared (in fact, my test scores were higher than my cross-validation score).

1 Like

@zlatankr Thank you for reply. Great to see your code improved further. (Also, you can update or create a “.gitignore” files to ignore “.pyc” files in your project ) :slight_smile:

Sure thing @maddula, done! :slight_smile:

Hi Matthew.
Thanks for providing the though of ensemble.
But when I run your R code, I just can’t reproduce your result.
Instead, I got 0.7401 on LB, which is not even close to 0.8
I guess maybe the difference is caused by parameter of XGBoost?
Cloud you give me any hint?