New Data in Test Set: date_recorded: year 2001

Hello, I created a feature “year_recorded” from the date_recorded. But in the training set there is no data for the year 2001 whereas in the test set one line for 2001 appears.
Is that a mistake ? Or is it normal and we should dealt with it ? If so then how as my model gives me an error in R: Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels)
Thanks !

It is not a mistake, just a function of how the data was divided. It will often be the case that if you are creating factors, there may be values in the test set that don’t appear in the training set.

Sometimes, a good approach is to combine the test and training set, create the factors, and then re-separate test and train. That way, 2001 will be recognized as a valid value when you see it in the test set.

Thank you for your answer :slight_smile:

Actually,I would need a bit clarification. @bull: By this, do you mean that a good approach is to train the model on the training data + test data ?
If yes, then the labels provided are of course only provided for the training data. How to deal with this problem then ?

He means to say that you should combine your data, and then once your data is ready, you can encode your variables. Once the labels across all the data are encoded, you can split your data back into the train and test sets. This ensures that values that may appear only in the training set and not the test set (or vice versa) are taken into account as part of the encoding.

You can also create categorical variables using cut points, so for example any date below 2000 is “ancient”, between 2000 and 2010 is “old” and anything above 2010 is “modern”.