Preparing your final submission? We can help!

For all in the prescreened arena, we want to make sure you’re able to focus on the subject matter and not get too tangled up in other details. While we can’t help with any subject matter or provide hints, we want to make sure your code runs!

If you’re having any issues, please post here so we can help debug. We’re also more than happy to have a member of our time spend some time on a video call with you debugging any Docker/dependencies/environment setup issues. Feel free to contact us any time at robert@drivendata.org.

As you know, after the close of the prescreened arena, we’ll be collecting the final evaluation dataset over the summer. We’ll run your final prescreened submission on that set to determine the final model performance and prizes. Here are some tips to consider when putting together your final submission:

  1. Dropping an airport: We might drop an airport from the final evaluation dataset if there are issues with data quality. Note that we might say “you don’t need to predict for airport ABC since the data quality is poor” (i.e., ABC is not in submission format), but we will still include airport ABC features (past configurations, past weather, etc.).
  2. Distribution of configurations: While we will enforce that the set of airport configurations in the evaluation set matches that from the training period, but we can’t ensure that the distribution of how often those configurations are active is the same. Some of these changes might be predictable (seasonality of configurations), but some may not. Your code should be able to handle configuration counts that were not directly observed in the training period.
  3. Missing data: Let’s say you compute the number of arriving flights per hour as a feature. In the final evaluation set, it could happen that there are no flights recorded for several hours (due to chance or a temporary reporting outage). Your code should be able to handle these kinds of cases of missing data.
  4. Computational requirements: Note that the prescreened evaluation set is one week, and the final evaluation set is one month. Some shortcuts that might work in the prescreened arena submissions might cause your algorithm to run out of compute resources when run on the final evaluation set. For example, your model might start by loading all of the available past features. That might work fine at the beginning of the month, when the past data is only a few days, but might run out of memory towards the end of the month. Be strategic about how much data and what data you load, and make sure that doesn’t scale poorly as time goes on.

Best of luck!

1 Like

Thanks for the thorough detail. Regarding point (4), will we have 8h of compute time per week or 8h for the entire blind evaluation month period? Thanks!

will we have 8h of compute time per week or 8h for the entire blind evaluation month period

Good Q, it will be 8h per week of inference data.

Hello,

We just submitted a pull request to the runtime repository. We get a message saying that they need approval by a maintainer since we are first time contributors, so I don’t think the new dependencies will get accepted unless you manually do so.

Could you please take a look and approve our request. Thank you.

1 Like

Hi @rbgb,

A quick question. If we make a submission before 11:59PM UTC today, does that submission renew before the deadline next week? Just want to confirm how the 3 submissions per week rule works.

Hi @rbgb,

When running our submission we get a warning in the logs: < … WARNING: logs capped at 5,000 lines; dropping … more >… This happens while the logs are showing the “inflating: …” calls because we have quite a few files in our submission. Because of this we don’t get any value from the logs - it would be much more useful for us to see the last 5000 lines of the logs, or be able to download the complete logs. Is there a chance you could change this so that we get better insight into the logs?

Thank you!

Submissions are binned by day (in UTC time). So when you load the submission page, it checks to see how many submissions you made since 00:00 UTC 7 days ago. So right before the competition closes April 25 11:59, we’ll count up how many submissions you have made since Mon, April 18 00:00 UTC. Hope that clarifies it!

I upped the number of logs to 10,000 – let me know if it’s still not enough and we can try to decrease the verbosity at each step! You might find it helpful to debug your submission locally using the instructions in the runtime repo.

Unfortunately 10,000 is not enough… We need about 40,000… Yes, debugging locally has been helpful but we would like to see our loss breakdown on the prescreened data which we cannot do unless we see the end of the logs.

Vast majority of those lines come from the unzipping step: “inflating: …”

Ok upped it to 50k! I can also email you the logs of past runs if that’d be helpful. Send me an email robert@drivendata.org if that’s something you’d be interested in.

I just wanted to double check that our last submission will be used for final evaluation (not our “best” based on the one week test data that gets shown on the leaderboard). Thanks so much!

Correct, it’s the last successful submission you made that will be used in the final evaluation.

1 Like

Hi @rbgb,

We had a quick question. We understand that our model needs to predict 168 hours (a full week - 24 hours per day) in 8 hours. However, in submitting our prescreened submissions, we realize that the supervisor script scales significantly with additional data per prediction (see below from logs).

Initial supervisor call at the beginning of the week:

Final supervisor call at the end of the week:

That was an increase from 168 seconds to 709 seconds. We could imagine that if this scaling continues, it could be significantly large on the final evaluation month. We have optimized our code such that each prediction will not exceed a fixed time, however, we are afraid that the supervisor script runtime will push our execution time over the 8hour/week time-limit.

Can we confirm that the supervisor “censoring” data runtime will not be counted towards the 8-hour time limit?

Thank you!

I second this comment. Most of the compute time is consumed reading the raw data, and since for the first prediction day we only have 2 trailing days, for the second one we have 3… the reading process slows down as we move towards the end of the week. In our case our time profile is the following for the prescreened test:

Created extracts for 24 prediction times in 157.79 seconds
Created extracts for 24 prediction times in 232.39 seconds
Created extracts for 24 prediction times in 277.96 seconds
Created extracts for 24 prediction times in 493.99 seconds
Created extracts for 24 prediction times in 490.00 seconds
Created extracts for 24 prediction times in 477.60 seconds
Created extracts for 24 prediction times in 920.01 seconds
Created extracts for 24 prediction times in 651.17 seconds

My understanding here is that the blind test period will consist of 4 weeks and in each week the data will be refreshed, i.e. the first week will mimic the prescreened arena in that we only have one week, and when the second week is evaluated we will not have access to the data from the first and so on. Having said that, it would be great if we could get some confirmation that the above understanding is the correct one. Thanks!

Good question! We will not penalize your submissions for the time the supervisor is processing the data. That is, we will allow extra time for the supervisor when we are processing your final submission.

My understanding here is that the blind test period will consist of 4 weeks and in each week the data will be refreshed, i.e. the first week will mimic the prescreened arena in that we only have one week, and when the second week is evaluated we will not have access to the data from the first and so on.

No, you will have access to all data prior to the prediction point (although you will probably want to limit how much of it you process to stay under the time limit).

Hello! I am preparing my submission for the prescreened arena. My code, when run by the runtime setup, gets through the entirety of the dataset, but fails in the wrap-up part of the scoring. The logs are blank for some reason when this happens. Can you provide additional information, like were the logs saved to the machine somewhere? Thank you.

Glad to help! Shoot me an email at robert@drivendata.org and we can debug.