I want to ask about the qualitative evaluation of the winners for this competition. Are winners already determined by the scores on the private leaderboard or there will be a post-process of evaluating the results by judges? if yes, could you explain how this works please?
That would be interesting to see! But I suspect a qualitative difference of any of the leading models would be impossible to tell… Perhaps just removing the 5-10% bad/corrupted chip labels and recalculating IoUs on the non-corrupted labels (actual clouds) would be the most robust way to determine final rankings.
This would for sure be enough to change your IoU by the 0.0004 you need haha (tho no guarantee in the right direction!), and reshuffle everyone quite a number of spots.
Assuming a 50/50 public private split of the 10000 test chips, the ~250-500 corrupted chip labels in the private set will on average decrease the true IoU by 0.04-0.09, but this average reduction will have a lot of scatter depending on the overlap of each individual model’s predictions on said bad chips.
For example, it does not seem that the bad chips are a random sample, but they are clearly more localized to a few locations/land cover types. e.g. over water where the bad labels (and even many of the others) vastly overestimate the cloud cover, quite often labelling water as clouds. So a model that is better at detecting clouds over water will actually get a much lower IoU on these corrupted labels. From the large fraction of corrupted labels, I suspect this effect could easily result in a “better” model getting a lower total IoU than a model that is actually worse at cloud detection for these regions.
Given the top 29 entries differ by an IoU smaller than the bad chip contribution by an order of magnitude (0.005 difference from 1st to 29th, yet corrupted chips causing 0.04-0.09 IoU difference), it would be interesting to see how it shakes things up…
Oh well, such is the nature of competitions on noisy labels!