Can we use external data?

The data of ACE and DISCOVER satellites is publicly available on the NOAA website.
I assume that the private test set used in the leaderboard is not published on the NOAA website. If that’s the case, can we use data downloaded from the web?

I have looked more carefully into the data available on the web. And to be fair to all competitors I think this information should be shared.
The hourly Dst is available here:
https://spdf.gsfc.nasa.gov/pub/data/omni/low_res_omni/

From a quick analysis one can see that the training data provided spans the following times:

train_a=[16-Feb-1998 00:00 31-May-2001 23:00]
train_b=[01-Jun-2013 00:00 31-May-2019 23:00]
train_c=[01-May-2004 00:00 31-Dec-2010 23:00]

The ACE data is available here:
https://sohoftp.nascom.nasa.gov/sdb/goes/ace/daily/
from August 2001 to yesterday.
The organizers have said that the test set is made of ~60,000 hours so I do not understand how can they claim to use a private test set that is not publicly available to calculate scores in the leaderboard??

Finally, I have checked that the rules of the competition do not allow to use the data available on the web, but how do the organizers plan to enforce this rule? It seems way too easy to leak data from the “private” test set to a model.

The rules are clear on this, so I’m not sure I understand the question. We’re openly using scientific data adapted from a public source. The “private” set means we aren’t revealing which pieces of data will be used for the private leaderboard, and the answer to “can we use data downloaded from the web?” is no.

In order to be eligible for a prize, the competitor’s results (including model training) must be independently reproducible using only the data made available for the competition. Furthermore, it must generalize to an out-of-sample verification set.

Relying on tricks to get better apparent leaderboard scores won’t get anybody prizes:

  • If a competitor’s prediction function relies on data that wouldn’t be available at call time, they won’t be eligible to win a prize.
  • If a competitor’s model relies on “seeing” data past t=0, or using stored trends or data blobs for which we can’t reproduce an allowable provenance, they won’t be eligible to win a prize.
  • If a competitor’s code is too sloppy or complicated to verify that it abides by the rules, then they are in jeopardy of not being confirmed as eligible to win a prize.

Does that answer the question?

@isms Please do not pretend you don’t understand the problem. You are not revealing which pieces of data is used for the private leaderboard, but as I explained that can be figured out pretty easily. I figured it out. Anybody else can. I posted the times used in the training set. Everything else goes in the test.

And using the “private” test set as your validation set will certainly give you an advantage, still abiding to the rules. Nobody can find out and the model is still reproducible.
This competition is simply flawed.

At the end of the competition, you have to provide the training code. And if your result is not reproducible or you are using external data you will be excluded from the competition prizes. I guess it will be clear if someone used this data.

At the end of the competition, you have to provide the training code. And if your result is not reproducible or you are using external data you will be excluded from the competition prizes. I guess it will be clear if someone used this data.

Yes, this is correct.

@Ammarali32 Yes, and sure enough you do a random search to optimize your hyper-parameters?
And I guess you’ve been lucky enough to find the hypers that work just fine for the “private” test set…

I agree with [juliaquinn], having access to private data allows you to get a well tuned model for that data. And there’s no way to detect that.

Thanks @mchahhou!
I was starting to think I am the only one speaking ML on this forum :wink:

I also do agree with @juliaquinn. Having access to the private test dataset gives a significant advantage in model selection and hyperopt optimization. The final model training can be done without any traces of using private test dataset.

Hi @juliaquinn just a quick note here to make sure it’s clear to everyone:

And using the “private” test set as your validation set will certainly give you an advantage, still abiding to the rules.

I think what you meant here is it’s possible, not that it’s abiding by the rules. This is clearly prohibited in the rules on External Data. Here’s a link to the rules for reference.

Thanks for checking in about the external data. As @isms mentioned, the challenge openly makes use of public scientific data. Please keep a couple things in mind:

  • We will be validating and auditing winners’ submissions, including their training and inferencing code, to ensure they adhere to the competition rules. We have several ways of doing this. Your best chance at winning is to adhere to the rules.

  • Ultimately this is a competition to help advance science and benefit the public, and we appreciate that participants act in good faith and play by the rules.

Thanks for everyone’s hard work so far, and good luck!

@glipstein: Yes you’re right I meant to say that it is possible, although explicitly prohibited. However you have no way of enforcing the rule about not using external data. And we know that in any competition a rule is not a rule if it cannot be enforced.

You are extremely naive if you think that you “have several ways” to make sure that a submission adheres to the rules. The sad reality is that you are going to give away $30,000 in prizes with no absolute certainty that the winners aren’t cheaters.

1 Like

I agree with the remarks of @juliaquinn

Since the test set is known cheating is possible, and if it is done cleverly it will be difficult or impossible to prove. Given the relatively large number of participants, it is likely at least some of them will attempt to cheat cleverly. If they use approaches that are not significantly worse than those of fair participants they would end up at the top of the leaderboard.

So what to do about this? It’s a bit late, but perhaps some of the rules could be changed:

  • The number of hyper-parameters could be limited (would still give an advantage to cheaters though)
  • The original prizes could be scrapped
  • All models contending for a prize could be made public. If someone could then reasonably prove some level of cheating, the prize money could instead go to this person
  • The models could be trained on all presently available data, and then evaluated on segments of data collected in the coming 2 years (this information alone would actually help to optimize for the upcoming stage of the solar cycle)

So the problem seems to be difficult to solve. Perhaps having an independent jury rank models not only based on their performance but also on their simplicity could work.

Another question is what exactly constitutes external data. For example, would using an existing model for DST prediction with a few free parameters constitute external data? (I would argue so, since this model was developed on more than the provided training data)

1 Like

Guys… Any updates on this. Whats the p value of @juliaquinn’s hypothesis?

[This message was removed]

@juliaquinn Please don’t do this here. Intellectually curious and good faith discussion is welcome but this is not the proper venue to accuse other users of cheating.

You are free to email us with any concerns of users violating the terms of service.

@isms Fair Enough. But I think it is also fair to raise the issue of reproducibility and to ask if all of the winning models are entirely reproducible. I believe that at least one of them is blatantly not reproducible.

I also wanted to address the point of whether a new state of the art has been achieved, as claimed by the organizers: Meet the winners of MagNet: Model the Geomagnetic Field - DrivenData Labs
The fact that the models have been overfitted to the test set makes them unusable in real life and this is clear by looking at their score during extreme events. State-of-the-art models achieve lower RMSE by a factor of 3-4 (during storms).

I think both points are worth discussing…

Hi @juliaquinn – Glad to respond to those points. The winning code is open source and available here. You are welcome to do the work to reproduce the models and check the results, and to run an apples-to-apples comparison of the performance during extreme storms (the RMSE measure you’re referring to in the post is based only on the extreme data points below -80 nT). We’re also excited to see what happens going forward as more data from unseen storms is used to assess different approaches.