Why was RMSE chosen which is very unstable on data with many outliers? If you want to detect emissions more stably, the metric should be a log scale or some type of scale operation.
For example if you super accurate predict 49999 steps but missed once with outlier by 100 your score RMSE: (100**2 / 50000)**0.5 = 0.447. Huge influence by one single value.
+1, I also agree with you, this score seems weird seeing that in the benchmark post they explicitely say that there are outliers.
A better score would be median absolute error or at least mean absolute error or a median squared error if we are really interested in the mean, all these solutions would be more robust, I think the median bit is even better than log-scale (at least it is mathematically speaking).
One of the reason for this score I think is that it is very often used in the context of regression but apart from that I would also vote for a more robust score considering the nature of the problem (there are outliers in the test set).
You raised a great point. Most of the time, Dst fluctuates in the range +10 to -30 nT. The energetic solar-events that produce large (< -100 nT) negative deflections in Dst are rare, but they are important to be modeled accurately. They adversely and severely affect the magnetic referencing. These large, negative deflections are not “outliers” and they usually last several hours. However, the input solar-wind data (RTSW) does have some outliers, owing to the sensor malfunctions & outages. We use RMSE for two reasons. First is that we want to make sure that the model is also sensitive to the rare, large events. Secondly, RMSE is widely used in the geophysics scientific literature, so it is useful for comparison with a model published in the past. Thanks for your question and good luck.