Thoughts on Overall efficiency scoring


First off, I’d like to thank DrivenData for organizing the contest in such a way that making a mistake in the hindcast stage doesn’t preclude prizes in following stages – provided there was a successful code execution run, of course.

I wanted to share a problem that I’ve come up against: the Microsoft Planetary Computer data. My score is a little better with some of that data (judging from my local scoring), but it’s unwieldy and very slow to collect for all 26 sites. So slow, in fact, that the bare minimum data I need to increase my score can take slightly over 30 minutes to prepare just by itself. I think any benefit to my score will probably be offset by losses in the efficiency category when it comes to the overall prize metrics, and randomly failing runs throughout the Forecast stage (prediction runtime issues) isn’t an option. Since that seems to be contrary to the spirit of the challenge, I wanted to raise the issue and make sure that I’m not misunderstanding something myself.

To be clear, it’s such a pain to work with and such a meager increase in prediction accuracy that I just cut it out of my current submissions. But again, that seems contrary to the spirit of what we’re doing.

So to sum up: is the efficiency portion of the Overall metrics in respect to the Forecast stage runtime? Or will that be judged in respect to subsequent efficiency, meaning that only the 10% from the Forecast stage predictions is what’s included in the Overall prize metrics?

Hi @mmiron,

A few notes regarding what you’ve mentioned:

  • The discussion about efficiency vs. accuracy when selecting features is something we expect you to discuss in the Final Model Report. You’re encouraged to report and discuss these results in your report. While the format for the Final Report is not yet available, you can see similar questions listed in the format for the Hindcast Report. Judges will consider all of this holistically when reviewing your Final Report.
  • If this is data that you are downloading data from the Planetary Computer, and it seems to be too slow or unreliable for the Forecast Stage, we are accepting requests to potentially have some direct access data sources be predownloaded to the mounted data drive.
  • The 30 minute time limit is not a hard constraint and we may consider increasing it.
  • I recall that you had this previous thread about working with netcdf data. You should make sure that you’re processing the data in an efficient manner. Converting gridded data to points in a geodataframe is likely not the best approach, and the suggestion that thread for using xarray will likely be a lot faster.