Retro-scoring? or big single-week impact?

@Vervan @FBykov what is the intuition behind overfitting to Ground Measurement stations, if you had to guess (or if you found something in the data)? They actually appear to be more geographically dispersed than the other observations in the train labels.

  • Are they strategically located in high SWE areas, which might lead a model to overpredict SWE when generalizing to non-Ground Measurement stations?
  • Is it something to do with them being point measurements as opposed to measurements averaged over an entire 1 km^2 grid cell?
  • Is it some subtle measurement difference attributable to the way ASO captures data compared to the SNOTEL technology?

I think the precition of ASO data is strongly depends on the type of the Earth surface, e.g. ASO errors is small for smooth surfaces, such as croplands and pastures. But ASO works not well for kurums, shrubs and forests (espetially evergreens). Because the airplane cannot see the snow in crevices between rocks and between the roots, therefore the ASO data for notsmooth surfaces have a negative bias.

Hello,
First of all congratulations to the winners!

Second I would like to ask to the host of the challenge for insights of that big leaderboard change on the 7 of March. I have already read that the ground truth is going to be published but I would like to know how I failed to be able to improve.

Thanks
ironbar

I believed we deserve an explanation after devoting half a year for the challenge. 3 weeks without answer probed me wrong. :frowning:

We take your concerns about the accuracy of evaluation seriously, especially since this competition was a big commitment for all who participated.

I have looked into the scoring, in particular the large changes in scores that occurred on 3/7. I’m happy to say that scoring worked exactly as intended. The reason for the large shift in scores on 3/7 is essentially what Emily already mentioned:

I’ll add: scores from 2/28 were determined using a total of ~400 ground truth measurements; scores from 3/7 were determined using over 4,000 ground truth measurements. At the end of the competition, we scored using over 42,000 measurements. Looking back now that the competition is closed, we know that 3/7 added the most new sites of any other single week in the competition.

In addition, prior to 3/7 all of the ground truth was from ground-based measurements. The updates for that week consisted of mostly flight-based measurements. As FBykov suggests, those datasets could have systematic differences. The goal of the competition was to make a model that performs well regardless of the measurement source.

The change on 3/7 was not a result of missing HRRR data. As Emily mentioned, the data update from 3/7 primarily added ground truth observations to the 2/17 column.

As an additional check, we manually calculated RMSE for two submissions for 2/28 and 3/7: one submission that saw a large decrease in score on 3/7 and one that saw a large increase. We were able to perfectly replicate the scores observed on the leaderboards for the respective weeks.

We’re still verifying the winners, validating solutions, and making final decisions. We will post more information about data releases as soon as possible after the validation process is complete.

1 Like

Hi all – I’m excited to share that the winning solutions, write ups, and model reports are all available on Github. Check out the winners repo!

You can also read about the winning solutions in the “Meet the Winners” blog post.

Finally, we have released the real-time evaluation dataset used for final scoring. Please keep in mind that this represents the ground truth data that was available at the end of the challenge.

All of these links are available on the competition results page.

Thanks again to everyone who participated in this challenge and made it a success!