Forecasting energy consumption: NWRMSE

Is there a reference implementation of the NWRMSE metric? Or does someone have one that they know works?


Would love to have more clarity here. I note the problem statement talks about a weight applied to entries to compute a weighted root mean squared error (WRMSE). However the formula in the next paragraph then shows that weight being applied to compute a root sum squared error (i.e. there is no mean applied). Either way I try things my numbers don’t seem to come out suitably. Would love to see some code.


The implementation of NWRMSE is very up to your train-validation splitting strategy, for example if you do random splitting for the training dataset, then there is no meaning to evaluate the NWRMSE, because it only makes sense when the data you evaluate is mostly continuous in timestamp, which is because you need to assign weight for each instance record beforehand. I have tried splitting the training dataset into 5 subparts based on datetime from the beginning of the timestamp to the end and used one subpart as validation dataset to evaluate the NWRMSE of the training and validation losses during training. But this kind of splitting strategy didn’t help too much compared to random splitting. Hope it helps.

@mlean @LastRocky There is now an example implementation in pure numpy (no pandas) here:

Thanks - really helpful. That’s clarified things and corrected my code. The weights are interesting. You assume a maximum length of 200 and then truncate the weights?

In the problem description page I’d list the RMSE weights as \frac{(600-2t+1)}{2200}. The full WRMSE expression then has this weight term \frac{(600-2t+1)}{2200*T_n}. It’s also important that the mean is over all sites and forecast periods (my original reading was that one took a mean over forecasts for a site and then a mean over sites).

Sadly I’m still not getting close to the leaderboard score (the leaderboard score is two orders of magnitude better) but I’m hoping that’s just my issue.

You are not alone :-/

Oh - it makes me both happy and sad to hear I’m not alone. Can we get some confirmation of the python scoring matching the leaderboard scoring? I’d hope the organisers can run the script offline against a few submissions.

Any progress on this? Can this be a fair competition if the evaluation metric is in doubt? The idea that my submission is getting what’s crudely a relative error of 0.2% is unlikely. We need to know what is being measured.

Also similar to another post about one of the related competitions - what is the public/private leaderboard split?


We’ve made clarifications to the metric as announced here. This is just confirmation that the available implementation matches mathematically the scores on the platform. To calculate this consistently, competitors can treat T_n as a constant of 200.

I’m still feeling mystified about the evaluation metric. I had been wondering whether the public leaderboard score was been based off a non-typical set but the private leaderboard scores are coming out similar. Will you be posting the test data so we can understand scoring better?


1 Like