Data download and test set question

Hi @jayqi!

Regarding the 2-hour limit of a runtime and also processing the test data on your-side.

Do I understand it correctly, that the data query and download is included in the runtime?
As mentioned in some other thread, ERA5-Land is a massive amounts of data and its query usually takes a while. However, as a user one has no control over the time it takes for the request to be processed and it only depends on the cds server. Same applies to other spatial gridded datasets from the approved data sources.

2nd question: does that also apply to ALL the calculations done for the train set? This sounds reasonable for the weekly inference, however for the 10 years of test data seems not to be the case. There are currently around 30 approved data sources, that alltogether create a massive data cube, that needs to be pulled together using various api-s and processed (i.e. the zonal statistics and watershed aggregation).

Is that possible to softening the requirements and allow us to calculate features for the test set on our side at this stage of the competition? At the next step everything will be computed on-the-fly.

I would also like to follow the similar earlier thread and ask about the daily data (so far the hourly and monthly have been approved). Are we allowed to download and use daily data?

Thanks in advance for your response!

1 Like

We will also of course include all the codes for feature calculations and data processing in our submit.

Hi @varyabazilova

Do I understand it correctly, that the data query and download is included in the runtime?

For certain data sources designated as “Direct API access approved”, you are permitted to download data during your code execution run. This is indeed considered part of the 2-hour time limit for test set inference.

There are currently around 30 approved data sources, that alltogether create a massive data cube, that needs to be pulled together using various api-s and processed (i.e. the zonal statistics and watershed aggregation).

For many of the data sources, in particular ones that were pre-approved at the start of the competition and not added by request, we are rehosting data for the test set years in a mounted data drive. See [GitHub - drivendataorg/water-supply-forecast-rodeo-runtime: Data and runtime repository for the Water Supply Forecast Rodeo competition on DrivenData](documentation in the runtime repository). For the rehosted data, you will generally not have access and will be prevented by the network firewall from downloading additional data anyways. Processing the raw data into your features is still part of the 2-hour limit.

Is that possible to softening the requirements and allow us to calculate features for the test set on our side at this stage of the competition? At the next step everything will be computed on-the-fly.

Given the short time before the deadline, we will not be making major changes to the Hindcast evaluation requirements.

Can you provide more information about how long is needed for you to download the data that you need? If the additional time needed for your solution is needed, an increase to the 2 hour time limit is potentially an option.

Otherwise, leaving ERA5-Land features out of your model for Hindcast Stage submission does not preclude you from including it in your Forecast Stage or Overall evaluation cross-validation model submissions later in the challenge.

I would also like to follow the similar earlier thread and ask about the daily data (so far the hourly and monthly have been approved). Are we allowed to download and use daily data?

Please link to the specific daily ERA5-Land dataset that you are asking to use. We will take a look, but given the short time before the deadline, we may or may not be able to approve its use.

Hi!

@jayqi thanks for your reply!
We need to think about all that.
But, so we get it right and the code runtime of 4 hours total is possible?

Hi @Vervan, @varyabazilova,

We have increased the time limit for normal submissions to 4 hours. Please see the latest announcement.

1 Like

I was wondering if I am correct in thinking that the runtime limit (4 hours) means that the data query, download and predictions for all years and sites must be completed within this time limit, and that it’s not 4 hours per year or per site.
If so, some data sets (such as CDS data) can easily take an hour or more to download for a single year even for just a few variables depending on how busy the CDS is. Is it at all permissible to pre-download the required data from the CDS and submit it with the solution or is this not allowed?

@robby I have merged your thread into this recent one, as it is largely asking a duplicate question.

The runtime limit (4 hours) is inclusive of data downloads, data processing, and predictions for all years and all sites in the test set (10 years, 26 sites).

Note that many data sources are pre-downloaded by DrivenData and included in a mounted data drive. You do not need to (and generally will not be able to from the firewall) down those datasets. However, some data sources, such as the ones from Copernicus CDS, indeed do not have data that is pre-downloaded.

As discussed in this thread, it is not permissible for you to pre-download data associated with test years and bundle it with your solution. (See this FAQ entry with additional discussion about distinguishing between train and test data.)