Clarification on "you must only use feature data from the same water year"

kurisu · November 20, 2023, 11:08am

The announcement from Nov 17 says that we should only use feature data from the water year.

Does this also mean that, for example, the following feature would be prohibited:

Maximum yearly volume observed at a site (so far).

Let’s say I make a prediction for 2017. For the mentioned feature I take the volumes from all 2016 and before (except the test years) and calculate the maximum per site.

These types of features do not use data from test_monthly_naturalized_flow.csv but only from train.csv.

Are such features allowed or not?

jayqi · November 29, 2023, 11:50pm

Hi @kurisu,

Based on my understanding of the feature you’re describing, that feature is permitted. Aggregating some kind of parameter over all test years (it doesn’t just have to be 2016 and before, it can also include 2018, 2020, and 2022) is a basic thing to do in the sense you are “training” a model.

The clarification that you should only use feature data from the water year is based on the idea of treating water years as independent observations. So it’s fine to fit parameters across all of the training years.

However, having a feature at inference time depend on data from 2016 specifically with the knowledge that it’s the year before 2017 would not be treating years independently.

Let me know if that helps, and if I can clarify further.

progin · November 30, 2023, 10:12am

Hi @jayqi ,
I’m confused now. It seemed to me that it’s clearly stated in Competition: Water Supply Forecast Rodeo: Hindcast Evaluation that we could use only features based on data from the given water year (besides teleconnections). However, if I understand correctly what @kurisu suggested, when predicting on forecast year 2017, we want to use a feature that takes a maximum value of volume for site_id from all previous years (excluding test years), say a value from 1995 is the highest and we use this values to predict on 2017. Then, I’m not sure any more what using only features from the same water year refers to and when using data from different years is allowed.

jayqi · November 30, 2023, 7:10pm

Hi @progin,

The key idea here is that water years should be treated as independent observations without any temporal relationship between them.

Here’s an analogy to a simple standard regression setup.

Consider a generic supervised regression problem. You have a set of observations with ID values A, B, C, D, E, F, G, H. Let’s say that A, B, C, D, E, G are in your training set and F and H are in your test set.

You can train a model that depends on variables for all of the observations in your training set (A, B, C, D, E, G). The model parameters themselves (e.g., if doing linear regression, the weight for a variable) could be fit to training variables, or feature parameters could be fit to training variables (e.g., maybe you want a feature to be scaled by the max value of some variable in the training set). That’s all normal supervised regression.

Now you have a trained model, and all of your parameters are fixed. Now you want to do inference for observation F. When you predict for observation F, that prediction should just depend on the trained model and the variable values for observation F.

Your model should treat observation F independently.

If for some reason, there is some known relationship between observation E from your training set and observation F, explicitly incorporating information about the relationship between observation E and observation F would not be predicting for F independently.
However, if you don’t explicitly model any relationship between E and F, and E is just a generic independent observation in your training set that is incorporated into your trained model’s parameters, then everything is fine.

Note that I’ve purposefully used letters instead of years in my example above. In the formulation of the problem for this competition, we are not treating this as a longitudinal time series forecasting problem across years. You should consider years to simply be identifiers, and that they are all generic independent observations. This means for example, that it’s fine for a model trained on years in the training set that are in the future of years in the test set. The prohibition on future data applies within water years (e.g., you can’t use data from May if you’re issuing a forecast in April).

Let me know if this helps clarify things.

mmiron · December 1, 2023, 2:41pm

@jayqi,

Pardon me, but after looking at the water-supply-forecast-rodeo example submission, I’m a little surprised that it wouldn’t be disqualified. I’m sensitive to the fact that any statistical model will, on some level, be using data outside of the water year of the forecast issue date; but I would have thought that using antecedent flow data from outside of it so directly violated the rules.

To be clear: if the example submission were using the mean of values over only the site a prediction is being issued for, it would be disqualified; but since it’s using the mean of the entire training set, it doesn’t violate the rules. Is that correct?

What I’m stumbling over is that the 26 sites relevant for the contest are only a subset of possible sites that a solution would, in practice, be asked to predict for – meaning using the mean of historical data for those 26 sites to train a model represents a subset of values that was (presumably) not randomly selected from the universe of all historical flow data (i.e. all sites everywhere). And using only the historical values for a single site represents a subset of that subset, if you see what I mean. I want to be very sure I understand where the rules draw the line on this, and it’s still fuzzy for me.

jayqi · December 1, 2023, 4:09pm

Hi @mmiron,

Have you read my response from yesterday that is directly above your post? I believe that explanation should provide information about the distinction that you are asking about.

The key idea isn’t about the specific site (sites shouldn’t be a relevant concept here—we’re talking about years), but it’s about whether you are treating years independently.

Here are two cases that may help illustrate the distinction:

Not independent: For a 2009 forecast for hungry_horse_reservoir_inflow, I will use a feature that depends on the 2008 water supply for hungry_horse_reservoir_inflow because that is the most recent year at the same location.
Is independent: For a 2009 forecast for hungry_horse_reservoir_inflow, I will use a feature that depends on the average water supply for hungry_horse_reservoir_inflow across all of my training data.

mmiron · December 1, 2023, 9:31pm

I’m sorry – and rest assured I consider this my own shortcoming and not yours – but I still don’t get it. Take the second example, and only use the 2008 year for “all of my training data” (clearly a silly thing to do, but ignoring that); then it would be disqualified, based on the first example. But I don’t see why.

I’m sure I’ll get it eventually. Thank you for your patience, @jayqi.

progin · December 2, 2023, 9:05am

@jayqi @mmiron,
If I understand correctly, we could use data from all years but without giving information on when a particular thing happened.

For example, we could use average value of volume for all sites but we can’t use a feature like average value of volume from the previous 3 years, as it will operate only on a few last years, so it will be using information that something happened in the recent years and it will depend on specifically selected years.
We can use average value of monthly USGS streamflow per site for February for all years if we predict on issue year of March (we can’t use USGS streamflow from test years for training here but using all train years is allowed, even the year that we predict on?)
If issue month is February, can we use the same variable as in point 2, but excluding information from this year’s February?
Then, we could also use variables that base on this water year and on what was generally in the past, for example for February issue date, the feature could be a difference between USGS streamflow from this January and USGS streamflow on January from all years?
The key is to not use information about what happened when but we could still use information on just general numbers from many years on what was in the past?
It could be also helpful to update Only data within the same water year section (Competition: Water Supply Forecast Rodeo: Hindcast Evaluation) as it seems to be only specified there that we can’t use data from different water years under any circumstances and when it was added, I removed all my features that weren’t using information from the same water year, though there were mostly variables using information on general tendencies instead of previous years and if I understand correctly, they were fine, so it would be great to have it clearly explained.
The clarification should also cover such dilemmas like specifying where is the boundary – if using information from previous year is not allowed, when does it get allowed, when using information from last 5/10/30/all available years? If it is clearly stated what’s permitted and what’s not, we could work solely on improving our models instead of pondering if our variables are all right and it’s safe using them or will we get disqualified.

jayqi · December 4, 2023, 3:26pm

The key difference is whether or not you depend specifically on the relationship between your current inference year and the year 2008.

Is independent : It would be permissible if you calculate this feature the same way for performing inference for any year. For example, if you calculate a feature that depends on only the 2008 value (and none of the other training years), then this feature for predicting on 2011 must also use the 2008 value in the same way, and this feature for predicting on 2019 must also use the 2008 value in the same way, etc.
Not independent: If you use the 2008 value only for predicting on 2009, and then you use the 2018 value only for predicting on 2019, etc.

The key idea here again is to treat years as statistically independent observations.

One way to think about it: Imagine that we replaced all years with random hashes and randomly shuffled them. So now you instead of years we have f9e0da, 04b390, 65cf98 and you don’t know which year is which. Your model should be able to work if the data were presented in this way.

For example, we could use average value of volume for all sites but we can’t use a feature like average value of volume from the previous 3 years, as it will operate only on a few last years, so it will be using information that something happened in the recent years and it will depend on specifically selected years.

You can use the average value of volume across all training years, or from a fixed subset of training years. You shouldn’t pick different values that depend on the temporal relationship with the inference year.

We can use average value of monthly USGS streamflow per site for February for all years if we predict on issue year of March (we can’t use USGS streamflow from test years for training here but using all train years is allowed, even the year that we predict on?)

Correct (for all training years, not all years). You can fit a parameter based on all of the train years, and then you can use the value from the current year you are performing inference on. You should not use any values from other test years.

If issue month is February, can we use the same variable as in point 2, but excluding information from this year’s February?

Correct. You can fit a parameter on the February data from all of the train years—this is just a trained model parameter. You can use this parameter however you want. If the February variable you want to use from the current inference year is not before the issue date, then you won’t be able to use it, as you are suggesting.

Then, we could also use variables that base on this water year and on what was generally in the past, for example for February issue date, the feature could be a difference between USGS streamflow from this January and USGS streamflow on January from all years?

This would be fine if it the reference USGS streamflow value were fitted on all training years only (not all years—test years should be excluded).

Thank you for all of your questions and feedback. We are working on improving the challenge documentation to clarify these points further.

jayqi · December 6, 2023, 3:20am

Hi everyone,

We’ve updated the problem description to more clearly explain the concepts discussed in this thread. We’ve also added an FAQ section to address specific questions in more detail. See this announcement.

Please let us know if you continue to have questions about the modeling setup.

Topic		Replies	Views
Training, inference, and use of future data Water Supply Forecast Rodeo	9	510	December 6, 2023
Training Data Question Water Supply Forecast Rodeo	3	169	February 20, 2024
Discrepancy between the training data and the submission format Water Supply Forecast Rodeo	7	529	November 2, 2023
Training Data - Monthly vs. Ground Truth Water Supply Forecast Rodeo	3	396	November 10, 2023
Cross Validation Feature Parameters/Aggregate Statistics Water Supply Forecast Rodeo	6	189	March 6, 2024

Clarification on "you must only use feature data from the same water year"

Related topics