We have a question regarding the average quantile losses on the test set. We are performing an internal benchmarking on our models by calculating their average quantile loss on an internal test set. When submitting in the development arena, we seem to consistently get an approximately 50% higher average quantile loss (from ca 120 to 180). We have tried around with changing the years we put in our internal test set (which of course are not the same years as those in the test set, but rather taken from the training set), but still we seem to consistently have this gap. We are including the factor 2 in our calculation of the average quantile loss, and as far as we can see from extensive double checking our quantile loss is computing its values in the same way as it says the submission arena is.
So, therefore we are wondering, is there anything special about the years in the test set which make for a much higher loss compared to our internal test sets? Or could there possibly be something in your calculation of the average quantile loss that is incorrect/different from what it says on the website?
We find especially the latter one of these scenarios very unlikely of course, yet we just wanted to ask the question here just in case, as we have been banging our heads into this question for quite some while and can’t seem to find any other hypotheses at the moment.
The real world is a mess. To me it’s almost more surprising that any given subsets of observations (i.e. random samples) have enough in common with each other for a statistical model to work at all, let alone “predictably.”
Same principle as flipping a coin more and more times: our models can’t predict which it will be (perfectly), but they can tell you that 50% will be heads and 50% will be tails if you flip it enough times. That doesn’t mean every (sub)set of flips within that infinite series will be predictable though: you’ll still see bizarre things if you look at isolated parts of the whole.
Thank you for your reply! It is true that might be the case, but for us considering it seems that the our loss on the leaderboard is constantly ca 1.5x higher than our internal test loss (even when trying different constellations of test sets), it seems the only way for that to consistently be true (assuming correct loss calculations/benchmarking everywhere) is for the test set to be special in some sense of having anomalies or years more difficult to predict than other subsets.
Since I’ve also seen the scoring differences that you’re referring to, that’s a difficult thing to argue against. Until the competition is over, I’m going to refrain from commenting further; I will say that I don’t suspect any bugs in their calculations though (with the exception of their submission validation code being a little too strict about the order of rows in .csv files).
I had the same problem. Did you also use the even years between 2004 and 2022 as local test data?
The performance varies strongly between the years. The even years (2004 - 2022) seem easier to predict than the odd years in the test data. However, I could resolve the discrepancy in evaluation by cross-validating with all years in the training data, which is also in line with the challenge to treat hydrological years as independent, not as time series.
Mmiron, you achieved a very good score in the development leaderboard, but it appears to be unrealistic. I noticed that you have another score in the actual leaderboard. I highly appreciate that you refrained from utilizing any form of data leakage or attempting to train with the test set.
I am not sure how the host could actually verify other teams’ solutions.
I’m sure the organizers will require sending full code for price eligibility that generates the same/very similar models to these uploaded and if the results don’t match roughly the submitted results or the processing isn’t compliant with the rules, the teams will get disqualified, so I woudn’t worry about that.
That was a careless mistake @motoki, not an attempt at cheating – and to be frank, I’m pretty humiliated about it, since it does look awfully suspicious (though I’m not the only one who got a better score in the development arena than in the evaluation arena). I did bring it to the organizer’s attention, for what it’s worth.
Believe me, it was as confusing for me as it was for anybody.