You have provided train.csv with the annual ground truth and train_monthly_naturalized_flow.csv with the past monthly naturalized flow. My expectation was that the sum of the monthly flows would equal the annual flow, but this does not seem to be the case. For example, Hungry Horse in forecast year 2022 has an annual flow of 2,297 (train.csv), but the sum of the monthly flows in forecast year 2022 is 2,627 (train_naturalized_flow.csv). Can you clarify why there is a difference?
The numbers you are comparing represent different things.
The 2,627 number—sum of volume values for the provided rows for the forecast year 2022—is the sum of the monthly flows from October 2021 through June 2022. (This isn’t any special number in particular. If you were calculating the total water supply for the 2022 water year, you’d be missing July 2022, August 2022, and September 2022.)
The 2,297 number is the seasonal water supply for the April through July season, so it’s the sum of April 2022, May 2022, June 2022, and July 2022. It is annual in the sense that there is one value per year, but it is not the total annual flow.
Please review the “Forecasting task” section for additional discussion. For each site, you can find the forecast season defined in the
Got it, thanks for the excellent explanation!
The problem description says that the last month of each location’s forecast season is explicitly excluded from the monthly data, but doesn’t explain why. Obviously it won’t be available at inference time since it can’t be empirically calculated until the forecast season ends, but is there a restriction on using it to train models with past data? And if not, can we assume that it is equal to the difference between the target volume and the monthly data for the available forecast season months*, or are they meaningfully different data sources?
*i.e., for Hungry Horse, the July 2022 monthly volume is
(2297.956 - (222.303 + 674.407 + 1088.706)) = 312.54