Model not generalizing well to test set

Hi everyone and @bull,

I have been working on this challenge for a couple of weeks now. I’ve implemented a number of models (RF, XGBoost, KNN, etc.) on the original features (with very minimal feature engineering) as well as engineered features. Using the same split for SJ and IQ as the benchmark did (https://www.drivendata.co/blog/dengue-benchmark/), I am able to achieve a much lower validation MAE on SJ than the benchmark (13 vs 22) and do slightly better than the benchmark on IQ (6.2 vs 6.5). However, when I fit this model on the entire training set and predict on the test set, my test MAE is higher than that of the benchmark (27 vs 25.8). I don’t believe this is a case of overfitting as I am accounting for that via the validation set and I also don’t think data leakage should be an issue as this model was fit on the minimally feature-engineered set. I’ve also tried time series cross validation to reduce bias and this appeared to yield better results on the validation folds, but it performed worse on the test set. Does anyone have any advice or have a similar problem? Thanks!

Elliot