Hi @progin,
The key idea here is that water years should be treated as independent observations without any temporal relationship between them.
Here’s an analogy to a simple standard regression setup.
Consider a generic supervised regression problem. You have a set of observations with ID values A, B, C, D, E, F, G, H. Let’s say that A, B, C, D, E, G are in your training set and F and H are in your test set.
You can train a model that depends on variables for all of the observations in your training set (A, B, C, D, E, G). The model parameters themselves (e.g., if doing linear regression, the weight for a variable) could be fit to training variables, or feature parameters could be fit to training variables (e.g., maybe you want a feature to be scaled by the max value of some variable in the training set). That’s all normal supervised regression.
Now you have a trained model, and all of your parameters are fixed. Now you want to do inference for observation F. When you predict for observation F, that prediction should just depend on the trained model and the variable values for observation F.
Your model should treat observation F independently.
- If for some reason, there is some known relationship between observation E from your training set and observation F, explicitly incorporating information about the relationship between observation E and observation F would not be predicting for F independently.
- However, if you don’t explicitly model any relationship between E and F, and E is just a generic independent observation in your training set that is incorporated into your trained model’s parameters, then everything is fine.
Note that I’ve purposefully used letters instead of years in my example above. In the formulation of the problem for this competition, we are not treating this as a longitudinal time series forecasting problem across years. You should consider years to simply be identifiers, and that they are all generic independent observations. This means for example, that it’s fine for a model trained on years in the training set that are in the future of years in the test set. The prohibition on future data applies within water years (e.g., you can’t use data from May if you’re issuing a forecast in April).
Let me know if this helps clarify things.