Training Data Question

jimking100 · February 17, 2024, 6:48am

Hi,
For the cross-validation training data I assume we can use pre-2004 data for training. For example, for the first 2004 fold we could use say 1994-2023 data for training (excluding 2004) and use 2004 data for testing. This also assumes the lookback window remains within the 2004 water year. Is this correct?

kamarain · February 17, 2024, 4:14pm

The instructions were not very explicit about this. I hope we can use LOOCV for 2004–2023 and use all years prior to 2004 for additional fitting samples? Because this is seasonal forecasting, many useful signals come from low frequency predictors. They can not be properly identified and later exploited with models fitted to short time periods (20 years in this case).

mmiron · February 18, 2024, 10:22am

I believe the upshot of the LOOCV process is that yes, we get to train with all water years except for whichever year we’re using as the holdout set for a specific iteration (see Competition: Water Supply Forecast Rodeo: Final Prize Stage).

My reading of the process basically reduces it to 20 different test sets, each one composed of a single year between 2004 and 2023; maybe that interpretation clears it up a little.

Note also that we’re being provided with data going back to 1898 in some cases in the file prior_historical_monthly_flow.csv on the downloads page, so I think it’s pretty safe to assume that we can use it.

jayqi · February 20, 2024, 4:14pm

@jimking100 @kamarain @mmiron

Yes, you may train on the data from 2003 and earlier so long as it does not overlap with the test year of any given cross-validation iteration. It is provided in a separate CSV file prior_historical_labels.csv to be clear that those years are not part of the cross-validation period.

If you have lookback windows that extend beyond the start of the water year, you may still train on the 2003-and-earlier data but will need to adjust to make sure there is no overlap. (See here.)

Topic		Replies	Views
Training, inference, and use of future data Water Supply Forecast Rodeo	9	509	December 6, 2023
Discrepancy between the training data and the submission format Water Supply Forecast Rodeo	7	529	November 2, 2023
Clarification on "you must only use feature data from the same water year" Water Supply Forecast Rodeo	9	414	December 6, 2023
Cross Validation Feature Parameters/Aggregate Statistics Water Supply Forecast Rodeo	6	188	March 6, 2024
Training Data - Monthly vs. Ground Truth Water Supply Forecast Rodeo	3	395	November 10, 2023

Training Data Question

Related topics