Training, inference, and use of future data

kamarain · November 12, 2023, 10:12am

Are we allowed to use models which are trained to all of the years in train.csv to predict all of the test years, just making sure at the inference stage that the time stamps of predictor data are earlier than each of the issue dates? Or is this considered as using future data?

For example, when predicting year 2005, can I use a model trained with years 1890–2004, 2006, 2008, 2010, 2012, … ?

jayqi · November 12, 2023, 1:46pm

Hi @kamarain,

Yes, that is correct. Hydrologists generally consider the water supply across years to be independent enough to train models in this way. This approach is necessary in order to have sufficient data for training and evaluation.

tomwetherell · November 12, 2023, 10:24pm

just making sure at the inference stage that the time stamps of predictor data are earlier than each of the issue dates

Just to confirm, if I was predicting the year 2005 on issue date March 8th, I could use a model trained using all data from the years 1890, 2006, 2008, and so on - not just the data from before March 8th?

And of course input to my model would be data from October 2004 - March 8th 2005.

jayqi · November 13, 2023, 2:32pm

Yes, that is correct.

And of course input to my model would be data from October 2004 - March 8th 2005.

It should be data through March 7, 2005. This clarification was made in an announcement on November 2.

jimking100 · November 21, 2023, 6:55pm

Hi,
With the latest announcements about the use of future data I’d like to just confirm that we may still use all the training data including 2006, 2008, 2010…2022 for training our models - or did this change?
Thanks

jayqi · November 22, 2023, 4:46pm

Hi @jimking100,

Yes, that is correct.

In this challenge, forecast years are treated as independent observations. We are not imposing time series restrictions (e.g., future years used in training) across forecast years. This means that it’s fine, for example, to have a model trained on training data including the 2006 forecast year be used to issue predictions for the 2005 forecast year. In treating forecast years independently, a prediction for one forecast year should generally not depend on being a specific year or overlap/include data from other forecast years.

The time series restriction applies within a forecast year: your model should only use past data relative to that issue date within that forecast year.

jitters · November 23, 2023, 1:19pm

Assuming the years are independent, would it then be possible to use historic average streamflows of a given catchment (excluding the forecast year in question) as a predictor?

jayqi · November 29, 2023, 10:14pm

A post was split to a new topic: Naturalized flow data for sites other than the 26

jayqi · November 29, 2023, 11:24pm

Hi @jitters,

You may calculate “historic” average streamflows using data from training years. The reason I put quotes around “historic” is that you can include any year 2004 and earlier, and training years (even years 2006–2022)—such an average may include years that are in the future of years in the test set. This can be used as a feature or be used as part of the derivation of some other feature.

However, such an average should not include any years from the test set.

jayqi · December 6, 2023, 3:22am

Hi everyone,

We’ve updated the problem description to more clearly explain the concepts discussed in this thread. We’ve also added an FAQ section to address specific questions in more detail. See this announcement.

Please let us know if you continue to have questions about the modeling setup.

Topic		Replies	Views
Clarification on "you must only use feature data from the same water year" Water Supply Forecast Rodeo	9	418	December 6, 2023
Discrepancy between the training data and the submission format Water Supply Forecast Rodeo	7	529	November 2, 2023
Restrictions for using test data for training Sustainable Industry: Rinse Over Run	5	1191	January 17, 2019
What is past data? Power Laws	11	1304	March 8, 2018
Present vs Future From Fog Nets to Neural Nets	20	3507	May 1, 2016

Training, inference, and use of future data

Related topics