Target variable - final clarification

Thank you for organising an interesting challenge.
Please clarify exactly and definitively for which period we should predict volume, as I have read the description of the problem several times and all the conversations so far raised on this topic, but they are still not coherent for me.
Do I understand correctly that for the prediction data site_id: hungry_horse_reservoir_inflow (April-July)- this is the prediction period:

  • issue_date: 01/01/2005 - volume: 01/04/2005 to 31/07/2005
  • issue_date: 08/01/2005 - volume: 01/04/2005 to 31/07/2005
  • issue_date: 22/03/2005 - volume: 01/04/2005 to 31/07/2005
  • issue_date: 01/04/2005 - volume: 01/04/2005 to 31/07/2005
  • issue_date: 08/04/2005 - volume: 08/04/2005 to 31/07/2005
  • issue_date: 15/06/2005 - volume: 15/06/2005 to 31/07/2005
  • issue_date: 22/06/2005 - volume: 22/06/2005 to 31/07/2005
  • issue_date: 01/07/2005 - volume: 01/07/2005 to 31/07/2005
  • issue_date: 08/07/2005 - volume: 08/07/2005 to 31/07/2005
  • issue_date: 15/07/2005 - volume: 15/07/2005 to 31/07/2005
  • issue_date: 22/07/2005 - volume: 22/07/2005 to 31/07/2005
    Is that correct?
1 Like

Hi @sumatorikki,

This is not correct. In all cases listed in your example, you should be predicting the April 1, 2005 to July 31, 2005 cumulative naturalized flow volume. The ground truth value for all of these issue dates is the same number.

To write this out explicitly, revising your example:

  • issue_date: 01/01/2005 - volume: 01/04/2005 to 31/07/2005
  • issue_date: 08/01/2005 - volume: 01/04/2005 to 31/07/2005
  • issue_date: 22/03/2005 - volume: 01/04/2005 to 31/07/2005
  • issue_date: 01/04/2005 - volume: 01/04/2005 to 31/07/2005
  • issue_date: 08/04/2005 - volume: 01/04/2005 to 31/07/2005
  • issue_date: 15/06/2005 - volume: 01/04/2005 to 31/07/2005
  • issue_date: 22/06/2005 - volume: 01/04/2005 to 31/07/2005
  • issue_date: 01/07/2005 - volume: 01/04/2005 to 31/07/2005
  • issue_date: 08/07/2005 - volume: 01/04/2005 to 31/07/2005
  • issue_date: 15/07/2005 - volume: 01/04/2005 to 31/07/2005
  • issue_date: 22/07/2005 - volume: 01/04/2005 to 31/07/2005

You have hit on an important point regarding issue dates that are after April 1—namely that they overlap with the forecast season and that means some of the naturalized flow has already occurred in the past. A few things to note here:

  • Naturalized flow is not available in real-time. The time series naturalized flow data available has a monthly frequency (see this section). So for example, on April 15, we don’t have a ground truth measurement for the April 1 through April 15 naturalized flow yet. You can think of a forecast issued on April 15 as partially like a nowcast for that portion of the water supply.
  • For issue dates later in the season, you do indeed have naturalized flow data available for months that have fully passed. For example, a June 1 forecast may have April and May naturalized flow values available. You may incorporate the April and May values as inputs to your forecast, and in effect you are mainly trying to predict the residual June and July portions of the water supply.

The competition was set up in this way in order to simplify the ground truth data so that all forecasts for a site and year are targeting the same single value.

2 Likes

Thank you for your quick response. Now slowly all the pieces are starting to come together into a coherent whole :slight_smile:

…The ground truth value for all of these issue dates is the same number.

…the ground truth data so that all forecasts for a site and year are targeting the same single value.

Hi @jayqi. Thanks for the clarification. Do I understand it correctly, that we are supposed to forecast a single value, e.g. 3109, for a whole season for a selected site?

  • issue_date: 01/01/2005 - volume: 3109
  • issue_date: 08/01/2005 - volume: 3109
  • issue_date: 22/03/2005 - volume: 3109
  • issue_date: 01/04/2005 - volume: 3109

In case it is true: What is the reason for splitting issue dates throughout the season, instead of making a single forecast for a given site on the date 01/01/200*?

Hi @na-sa,

This is because the seasonal water supply is affected by things such as precipitation and snowmelt between January 1 and the end of the forecast season. At different issue dates during the year, you will have different feature/predictor data available to you, e.g., on March 15 you will have data available for conditions and events between January 1 and March 14.

Within a given year for a given site, this is a time series forecasting problem (but with a fixed absolute forecast target rather than a relative forecast target).

@jayqi I got the idea: The ground truth is a fixed cumulative volume for a given site and a given water season. Therefore, we should issue a forecast in up to 6 months advance for that site and season.