Retro-scoring? or big single-week impact?

I noticed the LB had some very big shifts, even people who missed the submission (check the dates) got a big shift.

I just want to make sure no retro-scoring is happening (e.g. collecting more data for older days/fixing labels and re-running), the reason I ask is because, the impact of 1 single week, seems a bit too much?

It can be of course that the data is not evenly distributed and suddenly we had much more volume for the week, but… sounds fishy :smiley:

3 Likes

My estimation is that this week there were at least twice as many measurements as the sum of all the previous weeks.

Otherwise I cannot explain the big improvements of some teams such as
andrey1362010
, oshbocker, ua-ck and TeamUArizona

2 Likes

at least twice as many measurements as the sum of all the previous weeks

Yes, this could explain the shift, but… it seems there is an underlying bias on data collection then… interesting to know the “why”.

Anyway, it is what it is… but just making sure there are no bugs in the scoring logic code.

Yes, maybe @tglazer can give us some insights.

It looks like the LB score is only in one column.

Certain weeks will have significantly more labels than others based on the availability of ground measures and ASO readings. This is similar to the Development Stage, where the availability of labels varied across weeks. Models that can best generalize across these data sources will experience smaller score fluctuations.

The number of labels in train_labels.csv is no more than 231 except 3 dates:
2016-03-29: 3077 labels
2016-04-26: 917 labels
2019-03-26: 780 labels
But in labels_2020_2021.csv the number of labels offen is more then 231:
2019-06-04: 657
2019-06-11: 514
2020-02-11: 330
2020-02-18: 286
2020-04-14: 1238
2020-05-05: 1206
2020-05-19: 472
2020-05-26: 1037
2020-06-02: 293
2020-06-09: 653
2021-02-23: 838
2021-03-23: 311
2021-03-30: 1035
2021-04-20: 526
2021-04-27: 491
2021-05-04: 1164
2021-05-18: 286
2021-05-25: 321

I think then 231 is number of the regular SNOTEL/CDEC stations and at several dates (and e.g. 10/02/2022) the submissions
is compared with unregular measurments too.

So, could you please clarify whether the current score on LB will be based on the last week’s predictions only or all predictions since the start of Phase 2b?

@dmitry_v The leaderboard will represent accumulating scores for the Real-time Evaluation Stage of the competition (beginning February 15).

1 Like

Hi @tglazer, today I have seen dramatic changes in the leaderboard. I would like to ask if there could be some error in the evaluation (maybe some outlier in the data).

The ground station data has not changed too much between weeks. Thus I don’t understand that big changes in LB scores since I have been consistently on the top positions since the start of phase 2.

Thanks
ironbar

3 Likes

I’ve been keeping track of the public leaderboard using the WayBack Machine if anyone is curious: Wayback Machine

1 Like

Also should be noted that the HRRR data from the approved data source was down last week, that may have impacted some model scores.

This is very weird indeed. On my internal tests, I have never obtained a negative R² score, not even on the most pessimistic scenarios, yet I find that on the leaderboard. I hope @tglazer can clarify if there is a problem with the evaluation. I agree that missing HRRR data must have had an impact, but not such an abysmal one.

Hahaha… omg… this is completely out of control. It is not possible that HRRR has this massive impact, unless some stations dont truly report measurements and rely on HRRR?.

Either way, the delta in predictions between Week_n-1 and Week_n its usually small for most algorithms, to have this sort of impact, we need a completely disjoint set for each week, never scored before.

Could you provide:

  1. The numbers of samples evaluated for each week
  2. The performance for each week

There is nothing really this info can be used to “cheat”, and it can provide some answers, to move so massively I would like to see my R2 in last week, doing worse than average guess its actually hard thing to do (unintentionally).

Could you also take a look at the deadline for the reports? can it be moved? because if this is actually true, its pointless to write a report for a model scoring a negative R2.

It looks like the predictions of the teams were messed up. Some people have very big drops on scores and the last teams improved.

For example
Johnny.Research
improved from 14 to 8 without making a submission.

That is very scary, because, if a bug as big like that exists… how do we know there are not other bugs? It makes very hard to trust the scoring from now on.

In my case I don’t use HRRR. :slight_smile:

In my case I don’t use HRRR.

Me neither, but… if some stations do?, but… still this doesnt make sense anyway.

PS: Are we looking at the same leaderboard? I see Jonny Research at position #28

Please keep in mind that labels are derived from a combination of ground-based SNOTEL and CDEC sites as well as Airborne Snow Observatory (ASO) LiDAR measurements. Some weeks may have significantly more ground truth points that are being evaluated against, particularly when recent ASO flights have been flown. More evaluation points means there are opportunities for larger changes in scores and shifts on the leaderboard, depending on how well models generalize. You may have noticed that the training data contained similar patterns in terms of the varying number of observations between weeks. Rest assured that any weeks in which you have seen larger scoring shuffles have been the same weeks that have had more ground truth data points.

Additionally, while all submissions are queued for rescoring at the same time, there may be slight delays between when each submission’s score is populated to the leaderboard. You are correct that there was a small delay earlier today, which explains the inconsistencies noted above. All scores for this week have now been populated to the leaderboard.

The development phase was hard to generalize, due to a much longer period of time, e.g. you are actually tested on multiple conditions (not like here so far, only few months). On top of that, that phase required to use older data because it was about to predict 2yr in advance.

I can totally see some models degrading and not holding, but this looks as if most models are broken. The (handwavey) average of R2 in development phase is around 0.5, yet in evaluation stage is around -0.5 (so, an entire 1.0 of difference in R2 is huge)

I would argue this might go beyond generalization, e.g. I would be Ok, if the R2 of the overall LB hovers around R=0.0, but on the negative side (R2 < 0) it opens a lot of questions, they are actually moving in the opposite direction.