Can any one help to engineer the features and select the best features instead to applying model on so many features.
I think one really good way to do feature selection is to train a Random Forest or Extreme Gradient Boosting model on the data and then examining the feature importance from those models to pick out the most important features.
Here’s a link to the scikit learn’s implementation with Random Forest: http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
Thanks RonL. It’s great suggestion. How do you think if use Two Step cluster analysis available in SPSS?
I’m not really familiar with that method, but another quite popular way to reduce dimensionality would be through Principal Component Analysis.
PCA has a pretty handy parameter where you can set how much of the variance you preserve when you project the data to the principal components. This helps you to strike a balance between the number of dimensions and the amount of information they retain.
Here’s a pretty good write-up on that method: https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60
https://machinelearningmastery.com/an-introduction-to-feature-selection/ - could be useful as start point
Possibly a useful SPSS screenshot for feature selection: https://stats.stackexchange.com/questions/66478/correlation-and-categorical-variables