Training, inference, and use of future data

Are we allowed to use models which are trained to all of the years in train.csv to predict all of the test years, just making sure at the inference stage that the time stamps of predictor data are earlier than each of the issue dates? Or is this considered as using future data?

For example, when predicting year 2005, can I use a model trained with years 1890–2004, 2006, 2008, 2010, 2012, … ?

Hi @kamarain,

Yes, that is correct. Hydrologists generally consider the water supply across years to be independent enough to train models in this way. This approach is necessary in order to have sufficient data for training and evaluation.

3 Likes

just making sure at the inference stage that the time stamps of predictor data are earlier than each of the issue dates

Just to confirm, if I was predicting the year 2005 on issue date March 8th, I could use a model trained using all data from the years 1890, 2006, 2008, and so on - not just the data from before March 8th?

And of course input to my model would be data from October 2004 - March 8th 2005.

1 Like

Yes, that is correct.

And of course input to my model would be data from October 2004 - March 8th 2005.

It should be data through March 7, 2005. This clarification was made in an announcement on November 2.

1 Like

Hi,
With the latest announcements about the use of future data I’d like to just confirm that we may still use all the training data including 2006, 2008, 2010…2022 for training our models - or did this change?
Thanks

Hi @jimking100,

Yes, that is correct.

In this challenge, forecast years are treated as independent observations. We are not imposing time series restrictions (e.g., future years used in training) across forecast years. This means that it’s fine, for example, to have a model trained on training data including the 2006 forecast year be used to issue predictions for the 2005 forecast year. In treating forecast years independently, a prediction for one forecast year should generally not depend on being a specific year or overlap/include data from other forecast years.

The time series restriction applies within a forecast year: your model should only use past data relative to that issue date within that forecast year.

Assuming the years are independent, would it then be possible to use historic average streamflows of a given catchment (excluding the forecast year in question) as a predictor?

Maybe it’s not the best place to ask, but don’t want to create a new question. So, can we use for training the NRCS and RFCs monthly naturalized flow data from the sites other than 26 used in the challenge?