Hi, wanted to participate in this competition and create value for the society. However, I feel that artificial data limitations introduced by the organizers to be able to run the Hindcast stage will influence Forecast stage (where help of DS folks is REALLY need) in a negative way:
-
spirit of the competition requires that at forecast issue dates ALL present information should be accounted for (for features only, not for targets). However, your current data withholding (for the Hindcast stage) creates artificial gaps in features for no reason. For example, at the day of forecast issue I want to know runing water inflow for every 12 preceeding months. Or 24. It’s very natural for TS feature creation. But you simply don’t allow this currently. Contestants need to fight to overcome such gaps. Why should we spend extra efforts developing worse models?
-
I understand the desire of keeping hindcast data private to be able to estimate true predictive power of the models early. However, ground truth data is in public open APIs and easily downloadable. And there are ways to overfit on ground truth data indirectly. Do you want to make the Hindcast stage a “data leakage” stage? Then at least do that without artificial features/data limiting, so that people don’t spend time and carbon footprint on solutions that go nowhere.
My suggestions:
- allowing all data up to forecast issue dates without gaps and length limitations, to not waste contestants time on fighting windmills
- using known in advance TimeSeriesSplit cross-validation schema with big enough folds number, to decrease data leakage impact and judge submissions more fairly. Do not allow hardcoded hyperparameters. Do the training inside of submitted scripts, not only the inference. Or, even easier, let platform users submit predictions as before, and only require strict rules (training & parameters tuning inside prediction script) at later stage for the ones who want to contend for prizes.
- lowering prizes for the hindcast stage in favor of forecast stage