Back to DrivenData | Blog

Simple cleaned and processed data with random forest classifier implemented and score 0.8162

Link to the notebook-
Github link

Hi guys,

A simple visualisation of data in python, trying to explain the meaning of different features and the relationship between different features through plots and charts with very clear visible relationship between some features and target variable .

During cleaning of data I have removed some features which were similar to each other and made one more new feature.

Evaluation of the model is mainly done through two classifiers-
1)Random Forest Classifier
2)XGboost

Of which RFC proved to be the better one after trying various forms of both the classifiers.

Yours suggestions on improving my model are mostly humbly welcome.

1 Like

My best attempt is 0.8248. I used virtually every feature + some very minor feature engineering. The model was a parameter-tuned version of xgboost.

The last couple of days before the competition ends, I have played around with stacking of different models, which have gotten me 0.82xx, but not better than 0.8248…yet.

There is not a lot of time left, but here is a few things you can try none the less:

  • ensembles (either with bagging, boosting or stacking)
  • mean-encoding of target-variable
  • plot importance if using tree-based models, and drop non important features.
  • create new features from the ones you have. (feature-engineering)
  • try to remove features that are very similar to each other.
  • play around with both dummies and label-encoding.
  • if you use just dummies-variables and get A LOT of features, perform PCA or LDA on it to minimize the feature-space
1 Like

Hi @Nilzone !
Congrats on your score. I would like to understand the .82xx score you have achieved…
Is it the score after submitting the predictions of the test data or is it the score of your model on the training/validation data