Forecasting energy consumption: NWRMSE

mlearn · March 11, 2018, 10:00pm

Is there a reference implementation of the NWRMSE metric? Or does someone have one that they know works?

Thanks.

mlearn · March 12, 2018, 11:03pm

Would love to have more clarity here. I note the problem statement talks about a weight applied to entries to compute a weighted root mean squared error (WRMSE). However the formula in the next paragraph then shows that weight being applied to compute a root sum squared error (i.e. there is no mean applied). Either way I try things my numbers don’t seem to come out suitably. Would love to see some code.

Thanks.

LastRocky · March 12, 2018, 11:58pm

The implementation of NWRMSE is very up to your train-validation splitting strategy, for example if you do random splitting for the training dataset, then there is no meaning to evaluate the NWRMSE, because it only makes sense when the data you evaluate is mostly continuous in timestamp, which is because you need to assign weight for each instance record beforehand. I have tried splitting the training dataset into 5 subparts based on datetime from the beginning of the timestamp to the end and used one subpart as validation dataset to evaluate the NWRMSE of the training and validation losses during training. But this kind of splitting strategy didn’t help too much compared to random splitting. Hope it helps.

bull · March 13, 2018, 5:50pm

@mlean @LastRocky There is now an example implementation in pure numpy (no pandas) here:

github.com

drivendataorg/metrics/blob/master/metrics.py#L230-L284


def power_laws_nwrmse(actual, predicted):
""" Calcultes NWRMSE for the Power Laws Forecasting competition.


    Data comes in the form:
    col 0: site id
    col 1: timestamp
    col 2: forecast id
    col 3: consumption value


    Computes the weighted, normalized RMSE per site and then
    averages across forecasts for a final score.
"""
def _per_forecast_wrmse(actual, predicted, weights=None):
    """ Calculates WRMSE for a single forecast period.
    """
    # limit weights to just the ones we need
    weights = weights[:actual.shape[0]]


    # NaNs in the actual should be weighted zero
    nan_mask = np.isnan(actual)

This file has been truncated. show original

mlearn · March 13, 2018, 9:31pm

Thanks - really helpful. That’s clarified things and corrected my code. The weights are interesting. You assume a maximum length of 200 and then truncate the weights?

In the problem description page I’d list the RMSE weights as \frac{(600-2t+1)}{2200}. The full WRMSE expression then has this weight term \frac{(600-2t+1)}{2200*T_n}. It’s also important that the mean is over all sites and forecast periods (my original reading was that one took a mean over forecasts for a site and then a mean over sites).

Sadly I’m still not getting close to the leaderboard score (the leaderboard score is two orders of magnitude better) but I’m hoping that’s just my issue.

amogil · March 13, 2018, 10:19pm

You are not alone :-/

mlearn · March 14, 2018, 9:39am

Oh - it makes me both happy and sad to hear I’m not alone. Can we get some confirmation of the python scoring matching the leaderboard scoring? I’d hope the organisers can run the script offline against a few submissions.

mlearn · March 16, 2018, 7:20pm

Any progress on this? Can this be a fair competition if the evaluation metric is in doubt? The idea that my submission is getting what’s crudely a relative error of 0.2% is unlikely. We need to know what is being measured.

Also similar to another post about one of the related competitions - what is the public/private leaderboard split?

bull · March 19, 2018, 9:35pm

We’ve made clarifications to the metric as announced here. This is just confirmation that the available implementation matches mathematically the scores on the platform. To calculate this consistently, competitors can treat T_n as a constant of 200.

https://www.drivendata.org/competitions/51/electricity-prediction-machine-learning/announcements/

mlearn · April 1, 2018, 8:42am

I’m still feeling mystified about the evaluation metric. I had been wondering whether the public leaderboard score was been based off a non-typical set but the private leaderboard scores are coming out similar. Will you be posting the test data so we can understand scoring better?

Thanks.

Topic		Replies	Views
Explanation: metrics WRMSE Power Laws	1	1617	March 2, 2018
Forecast Prediction : Custom metric evaluation Power Laws	1	806	March 5, 2018
Forecasting Power Consumption: buildings description meta-data Power Laws	4	957	March 13, 2018
Invalid Scores on Leaderboard Power Laws	0	677	February 22, 2018
About the Power Laws category Power Laws	8	1540	January 1, 2019

Forecasting energy consumption: NWRMSE

Related topics