Negative influence of the Hindcast stage. Possible fixes

Hi Jay @jayqi,

I hope this message finds you well. I am writing to address some concerns I have regarding the current rules of the competition. Allow me to introduce myself as a machine learning specialist with a background in hydrology.

In my capacity as a hydrologist, it is evident to me that the long-term distribution of river flow exhibits a pronounced seasonality, characterized by high- flow and low- flow periods. This phenomenon is well-documented in hydrology articles and can be effectively described using methods such as a moving average with a multi-year window. Notably, the choice of data, whether target or USGS data, is inconsequential to this approach. However, I have observed that the current rules prohibit the utilization of such features.

I wish to bring to your attention that this restriction was imposed a month after the competition commenced. This raises two important points: firstly, the rules appear to evolve as the competition progresses, and secondly, participants who may have already implemented this approach in their models are now compelled to alter their model architectures.

In fact, this ban means it is impossible to use certain classes of models. Does it means it is impossible to use autoregressive or ESP-like models that leverage historical meteorological data?

Additionally, I comprehend the organizers’ rationale behind not permitting the use of the target to generate features. However, what perplexes me is the restriction on utilizing other approved data sources to compute long-term features or anomalies. For instance, why cannot approved data sources be employed for this purpose?

Furthermore, in the field of machine learning, it is customary to refrain from making explicit hypotheses about the impact of specific features on the studied process. Instead, assessments are typically based on an analysis of feature importance. My findings align with those of @fingoldo, and I am willing to share the feature importance chart if deemed necessary. While I acknowledge the organizers’ prerogative to impose restrictions on the source data, I find it unusual that these restrictions extend to entire classes of models or the feature engineering process. Given that we are participating in a machine learning and data analytics competition, rather than a hydrology contest, I propose that it may be more relevant to focus on the methodologies employed within ML field.

I sincerely hope you will consider these arguments, and I kindly request that the current restrictions be reconsidered. Allowing competitors the freedom to experiment with various models and approaches to feature engineering would enhance the overall quality and innovation of the submissions.

2 Likes