Track B: caching preprocessed data

kzliu · January 25, 2023, 3:58am

Are solutions allowed to cache preprocessed data (features) to disk? If so, are there any limit on the disk space?

Specifically, from Track B: quick clarification regarding train/test folders in the runtime repo - #2 by jayqi we know that the train/test data CSV files (person, household, …) are the same for both central and federated (for each client), so in the centralized case, the data preprocessing for fit can be shared with predict, and in the federated case, the preprocessed data can be shared across many FL rounds as well as fit and evaluate.

Thanks!

jayqi · January 25, 2023, 5:43am

Hi @kzliu,

Yes, within one evaluation ~~job~~ scenario, you can use the client_dir and server_dir directories to store processed data in the same way you could store model state.

Regarding disk space:

The client and server state directories and the captured communication will be written to a mounted cloud storage container, which does not currently have explicit limits set.
The VM running the runtime Docker container has 340 GiB of disk space in total.
There may be unexpected errors or crashes in the tech stack due to limitations that are not explicitly set by us if writing large volumes of data. Such errors may not produce useful logs, and the DrivenData team will be limited in how much help we may be able to provide.

jayqi · January 25, 2023, 3:47pm

Actually, to correct my previous response slightly: it should be within one evaluation scenario, rather than the job. The three federated evaluation scenarios do not share client_dir and server_dir across scenarios. You can reuse cached data within a scenario but you will need to do things independently across scenarios.

@kzliu

kzliu · January 25, 2023, 4:48pm

Thanks for the clarification @jayqi! Few follow up questions:

Regarding job vs scenario: Would the string of client_dir can serve as a unique identifier? (e.g. scenario01-client01 will have a different client_dir than scenario02-client01.) Or, if not (e.g. they may have the same client_dir), are the disk space cleared between each scenario?
Are the amount of read/write to disk for this caching be an evaluation criteria for efficiency or scalability?

Thanks!

jayqi · January 25, 2023, 5:54pm

Hi @kzliu,

Yes, the client_dir path will be different across scenarios.
Judges may look at that information, but it is not one of the primary ways that solutions will be evaluated on efficiency or scalability. You can see details about the evaluation criteria on the Problem Description page, and in particular look at the primary metrics that will be reported in the “Computational Metrics” section.

Topic		Replies	Views
Clarification on federated pandemic model state storage PETs Prize Challenge	3	254	January 23, 2023
Track B: quick clarification regarding train/test folders in the runtime repo PETs Prize Challenge	1	265	January 4, 2023
Data partitioning and preparation for Pandemic Track PETs Prize Challenge	1	329	September 1, 2022
Specifying different hyperparameters for different federated scenarios PETs Prize Challenge	1	222	January 25, 2023
I have not enough space for data The BioMassters	2	421	December 7, 2022

Track B: caching preprocessed data

Related topics