MAE on Train and Test Data Set

Dearest,
I use caret + vtreat + a couple of models.
The funny thing is that the MAE calculated on the train data set (and I am using cross validation) is in my case way better than (around 15-18) than on the test data set (where it shoots up to 30).
Is anybody else experiencing this?
Cheers

Hey Larry, I was also experiencing the same problem. However, my MAE is really bad. Do you mind sharing with me your ways of building the model? Thanks a lot!

What way you use for cross-validation?
I used just sklearn’s TimeSeriesSplit and have similar problem (e.g. XGBRegressor with basic features gives ~25 error on CV,~28 on LB, and XGBRegressor with huber loss gives ~20on CV and ~28 on LB).

But I found next things:

  • we have data from two cities (and this cities have different total_cases distributions)
  • for each city we have data for different time range

So “baseline” TimeSeriesSplit splits data next way (just a sample):
[sj, time1], [sj, time2], [sj, iq, time3], [sj, iq, time4], [iq, time5]

As you can see we build models just (or at least - most) for individual sj/iq on some splits. So on some splits we approximate distribution only for one city - but on test we need to approximate both cities distributions.

So I’ll try use next “algorythm” for cross-validation:

  • split dataset by city
  • on each part run TimeSeriesSplit.
  • for each split pair:
    – concatenate train/validation indices to train/validate model on both cities
    – train/validate model

And mine algorythm splits data next way:

[[sj, sjTime1], [iq, iqTime1]], [[sj, sjTime2], [iq, iqTime2]], [[sj, sjTime3], [iq, iqTime3]]

At least it don’t give such better result with huber loss on CV.

UPD. it was a error in mine cross-validation.