Negative influence of the Hindcast stage. Possible fixes

Hi, wanted to participate in this competition and create value for the society. However, I feel that artificial data limitations introduced by the organizers to be able to run the Hindcast stage will influence Forecast stage (where help of DS folks is REALLY need) in a negative way:

  1. spirit of the competition requires that at forecast issue dates ALL present information should be accounted for (for features only, not for targets). However, your current data withholding (for the Hindcast stage) creates artificial gaps in features for no reason. For example, at the day of forecast issue I want to know runing water inflow for every 12 preceeding months. Or 24. It’s very natural for TS feature creation. But you simply don’t allow this currently. Contestants need to fight to overcome such gaps. Why should we spend extra efforts developing worse models?

  2. I understand the desire of keeping hindcast data private to be able to estimate true predictive power of the models early. However, ground truth data is in public open APIs and easily downloadable. And there are ways to overfit on ground truth data indirectly. Do you want to make the Hindcast stage a “data leakage” stage? Then at least do that without artificial features/data limiting, so that people don’t spend time and carbon footprint on solutions that go nowhere.

My suggestions:

  1. allowing all data up to forecast issue dates without gaps and length limitations, to not waste contestants time on fighting windmills
  2. using known in advance TimeSeriesSplit cross-validation schema with big enough folds number, to decrease data leakage impact and judge submissions more fairly. Do not allow hardcoded hyperparameters. Do the training inside of submitted scripts, not only the inference. Or, even easier, let platform users submit predictions as before, and only require strict rules (training & parameters tuning inside prediction script) at later stage for the ones who want to contend for prizes.
  3. lowering prizes for the hindcast stage in favor of forecast stage

Hi @fingoldo,

We appreciate hearing your thoughts about the competition.

Overall, the structure and design of the competition is set up in a way that balances a variety of tradeoffs and technical limitations in service of the goals of the competition.

Regarding the window of past feature data going to October 1 of the previous year, this has been chosen as a starting point for practical data management purposes. Based on hydrology domain knowledge, there is a general baseline expectation that most data sources will not have strong correlations between data before October 1 and the target forecast periods. However, challenge organizers would be interested in any cases that prove otherwise. Participants are encouraged to experiment with this, and to request extensions further back if they are finding that this will improve forecasts. We have received a request for additional months for the monthly naturalized flow time series data that is being reviewed.

The prizes and evaluation have been designed to balance considerations over the course of the competition. We plan to require a timewise cross-validation in a later stage of the competition as part of the final model report for determining the overall prizes, which constitute the majority of the prize pool. More details on the final model report and cross-validation will be forthcoming later during the competition.

1 Like

Thank you, I hope additional months will be permitted and I’ll be able to apply my ML skills & pipelines to contribute to this socially important project.

Hi @fingoldo,

Additional clarifications about the time range of data that models may use was added today to the problem description page.

Unfortunately, you may in general not use feature data from before the start of the water year of the forecast (i.e., before October 1). This restriction includes the monthly naturalized flow observations. Exceptions are made for features for climate teleconnection indices, which represent long timescale climate patterns—teleconnection features are allowed to span across water years.

Challenge organizers recognize that this setup is different from typical time series forecasting problems. However, this is not a generic time series forecasting problem. Expert hydrologists generally don’t believe there is a substantial signal from features before the start of a water year. This challenge focuses more on the incorporation of varied data sources into successful forecast models than on the possibility that streamflow data from before the water year may be predictive of seasonal water supply.

While there are validation approaches or other versions of restrictions that can allow for longer lookback ranges and still prevent leakage, this is the version has been determined to be the best tradeoff for the design of this challenge.

If there are specific data sources that you believe should be an exception (like climate teleconnection indices), you may submit a request that justifies making such an exception.

Let me know if you have any further questions.

Thank you Jay for getting back to me and clarifying the rules.

Contrary to what hydrologists say, I have found evidence that features from past water years indeed matter for current prediction (they are ranked high among others). I can share the feature importance chart if necessary. However, I understand that the hindcast is a special type of competition where everyone has to wear the same shoes, so if you consider allowing usage of whole time spans as pleased in the “real” forecast stage, I’m good with it )

Another point I wanted to draw your attention to, is that the training conditions are not clearly formulated, as opposed to inference conditions:

“When performing inference to issue a forecast, your model must not use any future data as features . For this challenge, this means that a forecast may only use feature data from before the issue data. For example, if you are issuing a forecast for 2021-03-15, you may only use feature data from 2021-03-14 or earlier.”

Are you sure you don’t want to include a paragraph that the models issuing predictions must not be trained on data after the issue date? I think that’s what you really meant by imposing all these restrictions.

Current rules still allow me to overfit: I can, while using only current water year features, train on all years till 2023 inclusive, and then with that model issue forecasts for the hindcast years 2001, 2003, etc. Scores resulting from such submission will never reproduce in real life, and it’s easy to see why: let’s say there was some gradual heating trend from 2000 to 2023, and we are predicting for the year 2003. Model trained on 2000,2002,…2022 would of course know of such a trend and issue optimistically skewed predictions compared to the “fair” model that was only trained on data strictly prior to each issue date.

So maybe, to reduce the data leakage effect, you’ll want to reformulate rules to include requirements that at issuing predictions models must NOT be trained on data at later dates? Current rules allow that, unfortunately. Otherwise, I’m afraid, there will be a big gap between hindcast and forecast accuracies.

Honestly, I still think that you have introduced a restriction that hinders predictive accuracy, and not introduced a limitation that will prevent data leakage in the hindcast stage (

Hi Jay @jayqi,

I hope this message finds you well. I am writing to address some concerns I have regarding the current rules of the competition. Allow me to introduce myself as a machine learning specialist with a background in hydrology.

In my capacity as a hydrologist, it is evident to me that the long-term distribution of river flow exhibits a pronounced seasonality, characterized by high- flow and low- flow periods. This phenomenon is well-documented in hydrology articles and can be effectively described using methods such as a moving average with a multi-year window. Notably, the choice of data, whether target or USGS data, is inconsequential to this approach. However, I have observed that the current rules prohibit the utilization of such features.

I wish to bring to your attention that this restriction was imposed a month after the competition commenced. This raises two important points: firstly, the rules appear to evolve as the competition progresses, and secondly, participants who may have already implemented this approach in their models are now compelled to alter their model architectures.

In fact, this ban means it is impossible to use certain classes of models. Does it means it is impossible to use autoregressive or ESP-like models that leverage historical meteorological data?

Additionally, I comprehend the organizers’ rationale behind not permitting the use of the target to generate features. However, what perplexes me is the restriction on utilizing other approved data sources to compute long-term features or anomalies. For instance, why cannot approved data sources be employed for this purpose?

Furthermore, in the field of machine learning, it is customary to refrain from making explicit hypotheses about the impact of specific features on the studied process. Instead, assessments are typically based on an analysis of feature importance. My findings align with those of @fingoldo, and I am willing to share the feature importance chart if deemed necessary. While I acknowledge the organizers’ prerogative to impose restrictions on the source data, I find it unusual that these restrictions extend to entire classes of models or the feature engineering process. Given that we are participating in a machine learning and data analytics competition, rather than a hydrology contest, I propose that it may be more relevant to focus on the methodologies employed within ML field.

I sincerely hope you will consider these arguments, and I kindly request that the current restrictions be reconsidered. Allowing competitors the freedom to experiment with various models and approaches to feature engineering would enhance the overall quality and innovation of the submissions.

2 Likes

Hi @fingoldo, @Vervan,

Thank you for your feedback and discussion about the challenge setup.

As with any machine learning challenge, specific parameters and design choices must be made in order to evaluate solutions in a standard and comparable way. These choices aim to balance various tradeoffs in service of the challenge organizers’ goals. These goals include the use of a diversity of data products—including those with limited periods of record—and evaluating solutions on conditions throughout the past 20 years.

That is why the task is framed as treating years as independent observations and the test split is every other year. This has been the framing of the task since the challenge launched, though we recognize it is not a standard longitudinal time series framing. We welcome additional feedback or questions to help best clarify and communicate the task framing to participants.

For this challenge, participants must at least submit a solution that can run on the current water year during inference. While there are physics and statistical signals that have longer time scales than one water year, this challenge generally asks participants to model those dynamics based on how they are encoded in near-term data that starts in the same water year as the forecast season.

If you have additional discussion or results regarding how your models may be improved by using features that incorporate data from earlier than the water year, you are also welcome to include that in your model report.

Current rules still allow me to overfit: I can, while using only current water year features, train on all years till 2023 inclusive, and then with that model issue forecasts for the hindcast years 2001, 2003, etc. Scores resulting from such submission will never reproduce in real life, and it’s easy to see why: let’s say there was some gradual heating trend from 2000 to 2023, and we are predicting for the year 2003. Model trained on 2000,2002,…2022 would of course know of such a trend and issue optimistically skewed predictions compared to the “fair” model that was only trained on data strictly prior to each issue date.

This is indeed a limitation of not using rolling validation splits. Such a feature would conceptually not be treating water years independently, and would be judged negatively for technical rigor in the qualitative evaluation of model reports. A generalizable model should make use of training data in a way that is not dependent on them being specific years.

However, what perplexes me is the restriction on utilizing other approved data sources to compute long-term features or anomalies. For instance, why cannot approved data sources be employed for this purpose?

As documented, an exception exists for climate teleconnection indices. If there are additional exceptions which you would like to use for modeling and can justify as not leaking data about test observations, you may request for them to be made.

@fingoldo @Vervan In case you haven’t been following, the problem description has been updated since the discussion in this thread with additional explanation about the independent water year framing. Additionally, approval has been granted today for solutions to use cross-water-year lookback windows for Weather and Climate data sources, subject to justification in your model report per these guidelines.

I don’t know how to react to this news when the essential part of the deadline is over.