Training Data Question

Hi,
For the cross-validation training data I assume we can use pre-2004 data for training. For example, for the first 2004 fold we could use say 1994-2023 data for training (excluding 2004) and use 2004 data for testing. This also assumes the lookback window remains within the 2004 water year. Is this correct?

1 Like

The instructions were not very explicit about this. I hope we can use LOOCV for 2004–2023 and use all years prior to 2004 for additional fitting samples? Because this is seasonal forecasting, many useful signals come from low frequency predictors. They can not be properly identified and later exploited with models fitted to short time periods (20 years in this case).

I believe the upshot of the LOOCV process is that yes, we get to train with all water years except for whichever year we’re using as the holdout set for a specific iteration (see Competition: Water Supply Forecast Rodeo: Final Prize Stage).

My reading of the process basically reduces it to 20 different test sets, each one composed of a single year between 2004 and 2023; maybe that interpretation clears it up a little.

Note also that we’re being provided with data going back to 1898 in some cases in the file prior_historical_monthly_flow.csv on the downloads page, so I think it’s pretty safe to assume that we can use it.

@jimking100 @kamarain @mmiron

Yes, you may train on the data from 2003 and earlier so long as it does not overlap with the test year of any given cross-validation iteration. It is provided in a separate CSV file prior_historical_labels.csv to be clear that those years are not part of the cross-validation period.

If you have lookback windows that extend beyond the start of the water year, you may still train on the 2003-and-earlier data but will need to adjust to make sure there is no overlap. (See here.)