Training Data - Monthly vs. Ground Truth

jimking100 · November 4, 2023, 7:54pm

Hi,
You have provided train.csv with the annual ground truth and train_monthly_naturalized_flow.csv with the past monthly naturalized flow. My expectation was that the sum of the monthly flows would equal the annual flow, but this does not seem to be the case. For example, Hungry Horse in forecast year 2022 has an annual flow of 2,297 (train.csv), but the sum of the monthly flows in forecast year 2022 is 2,627 (train_naturalized_flow.csv). Can you clarify why there is a difference?

jayqi · November 5, 2023, 3:10am

Hi @jimking100,

The numbers you are comparing represent different things.

The 2,627 number—sum of volume values for the provided rows for the forecast year 2022—is the sum of the monthly flows from October 2021 through June 2022. (This isn’t any special number in particular. If you were calculating the total water supply for the 2022 water year, you’d be missing July 2022, August 2022, and September 2022.)

The 2,297 number is the seasonal water supply for the April through July season, so it’s the sum of April 2022, May 2022, June 2022, and July 2022. It is annual in the sense that there is one value per year, but it is not the total annual flow.

Please review the “Forecasting task” section for additional discussion. For each site, you can find the forecast season defined in the metadata.csv file.

jimking100 · November 5, 2023, 3:56am

Got it, thanks for the excellent explanation!

HiddenThunder · November 10, 2023, 6:53pm

The problem description says that the last month of each location’s forecast season is explicitly excluded from the monthly data, but doesn’t explain why. Obviously it won’t be available at inference time since it can’t be empirically calculated until the forecast season ends, but is there a restriction on using it to train models with past data? And if not, can we assume that it is equal to the difference between the target volume and the monthly data for the available forecast season months*, or are they meaningfully different data sources?

*i.e., for Hungry Horse, the July 2022 monthly volume is
(2297.956 - (222.303 + 674.407 + 1088.706)) = 312.54

Topic		Replies	Views
Discrepancy between the training data and the submission format Water Supply Forecast Rodeo	7	529	November 2, 2023
Target variable - final clarification Water Supply Forecast Rodeo	5	357	November 20, 2023
Forecast Data Question Water Supply Forecast Rodeo	13	361	January 23, 2024
Naturalized flow data for sites other than the 26 Water Supply Forecast Rodeo	11	327	December 19, 2023
Date of monthly naturalized streamflow Water Supply Forecast Rodeo	9	384	January 4, 2024

Training Data - Monthly vs. Ground Truth

Related topics