Dearest,
I use caret + vtreat + a couple of models.
The funny thing is that the MAE calculated on the train data set (and I am using cross validation) is in my case way better than (around 15-18) than on the test data set (where it shoots up to 30).
Is anybody else experiencing this?
Cheers
Hey Larry, I was also experiencing the same problem. However, my MAE is really bad. Do you mind sharing with me your ways of building the model? Thanks a lot!
What way you use for cross-validation?
I used just sklearn’s TimeSeriesSplit and have similar problem (e.g. XGBRegressor with basic features gives ~25 error on CV,~28 on LB, and XGBRegressor with huber loss gives ~20on CV and ~28 on LB).
But I found next things:
- we have data from two cities (and this cities have different total_cases distributions)
- for each city we have data for different time range
So “baseline” TimeSeriesSplit splits data next way (just a sample):
[sj, time1], [sj, time2], [sj, iq, time3], [sj, iq, time4], [iq, time5]
As you can see we build models just (or at least - most) for individual sj/iq on some splits. So on some splits we approximate distribution only for one city - but on test we need to approximate both cities distributions.
So I’ll try use next “algorythm” for cross-validation:
- split dataset by city
- on each part run TimeSeriesSplit.
- for each split pair:
– concatenate train/validation indices to train/validate model on both cities
– train/validate model
And mine algorythm splits data next way:
[[sj, sjTime1], [iq, iqTime1]], [[sj, sjTime2], [iq, iqTime2]], [[sj, sjTime3], [iq, iqTime3]]
At least it don’t give such better result with huber loss on CV.
UPD. it was a error in mine cross-validation.