Training, inference, and use of future data

Are we allowed to use models which are trained to all of the years in train.csv to predict all of the test years, just making sure at the inference stage that the time stamps of predictor data are earlier than each of the issue dates? Or is this considered as using future data?

For example, when predicting year 2005, can I use a model trained with years 1890–2004, 2006, 2008, 2010, 2012, … ?

Hi @kamarain,

Yes, that is correct. Hydrologists generally consider the water supply across years to be independent enough to train models in this way. This approach is necessary in order to have sufficient data for training and evaluation.

3 Likes

just making sure at the inference stage that the time stamps of predictor data are earlier than each of the issue dates

Just to confirm, if I was predicting the year 2005 on issue date March 8th, I could use a model trained using all data from the years 1890, 2006, 2008, and so on - not just the data from before March 8th?

And of course input to my model would be data from October 2004 - March 8th 2005.

1 Like

Yes, that is correct.

And of course input to my model would be data from October 2004 - March 8th 2005.

It should be data through March 7, 2005. This clarification was made in an announcement on November 2.

1 Like

Hi,
With the latest announcements about the use of future data I’d like to just confirm that we may still use all the training data including 2006, 2008, 2010…2022 for training our models - or did this change?
Thanks

Hi @jimking100,

Yes, that is correct.

In this challenge, forecast years are treated as independent observations. We are not imposing time series restrictions (e.g., future years used in training) across forecast years. This means that it’s fine, for example, to have a model trained on training data including the 2006 forecast year be used to issue predictions for the 2005 forecast year. In treating forecast years independently, a prediction for one forecast year should generally not depend on being a specific year or overlap/include data from other forecast years.

The time series restriction applies within a forecast year: your model should only use past data relative to that issue date within that forecast year.

Assuming the years are independent, would it then be possible to use historic average streamflows of a given catchment (excluding the forecast year in question) as a predictor?

A post was split to a new topic: Naturalized flow data for sites other than the 26

Hi @jitters,

You may calculate “historic” average streamflows using data from training years. The reason I put quotes around “historic” is that you can include any year 2004 and earlier, and training years (even years 2006–2022)—such an average may include years that are in the future of years in the test set. This can be used as a feature or be used as part of the derivation of some other feature.

However, such an average should not include any years from the test set.

Hi everyone,

We’ve updated the problem description to more clearly explain the concepts discussed in this thread. We’ve also added an FAQ section to address specific questions in more detail. See this announcement.

Please let us know if you continue to have questions about the modeling setup.