Handling Missing Values

Hello guys

Anyone with a suitable technique that they utilized to handle missing categorical/ordinal values without distorting the data . This would be for variables like construction year , funder , installer etc

It depends on the algorithm which you are going to use: for logistic regression you should really take a sophisticated strategy. And use one-hot encoding.

For Gradient Boosting (XGBoost) I usually do the following: convert all the categorical values to numerical.
In R: as.numeric(…)

After that: give the missing values a clearly separate value: for example -999.
For a tree-algorithm this is enough to see if it should place missings in a separate branche of the trees which are generated.

1 Like