Back to DrivenData | Blog

Help required for feature engineering


#1

Can any one help to engineer the features and select the best features instead to applying model on so many features.


#2

I think one really good way to do feature selection is to train a Random Forest or Extreme Gradient Boosting model on the data and then examining the feature importance from those models to pick out the most important features.

Here’s a link to the scikit learn’s implementation with Random Forest: http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html


#3

Thanks RonL. It’s great suggestion. How do you think if use Two Step cluster analysis available in SPSS?


#4

I’m not really familiar with that method, but another quite popular way to reduce dimensionality would be through Principal Component Analysis.

PCA has a pretty handy parameter where you can set how much of the variance you preserve when you project the data to the principal components. This helps you to strike a balance between the number of dimensions and the amount of information they retain.

Here’s a pretty good write-up on that method: https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60


#5

https://machinelearningmastery.com/an-introduction-to-feature-selection/ - could be useful as start point


#6

Possibly a useful SPSS screenshot for feature selection: https://stats.stackexchange.com/questions/66478/correlation-and-categorical-variables