Track B: caching preprocessed data

Hi @jayqi,

Are solutions allowed to cache preprocessed data (features) to disk? If so, are there any limit on the disk space?

Specifically, from Track B: quick clarification regarding train/test folders in the runtime repo - #2 by jayqi we know that the train/test data CSV files (person, household, …) are the same for both central and federated (for each client), so in the centralized case, the data preprocessing for fit can be shared with predict, and in the federated case, the preprocessed data can be shared across many FL rounds as well as fit and evaluate.

Thanks!

Hi @kzliu,

Yes, within one evaluation job scenario, you can use the client_dir and server_dir directories to store processed data in the same way you could store model state.

Regarding disk space:

  • The client and server state directories and the captured communication will be written to a mounted cloud storage container, which does not currently have explicit limits set.
  • The VM running the runtime Docker container has 340 GiB of disk space in total.
  • There may be unexpected errors or crashes in the tech stack due to limitations that are not explicitly set by us if writing large volumes of data. Such errors may not produce useful logs, and the DrivenData team will be limited in how much help we may be able to provide.
1 Like

Actually, to correct my previous response slightly: it should be within one evaluation scenario, rather than the job. The three federated evaluation scenarios do not share client_dir and server_dir across scenarios. You can reuse cached data within a scenario but you will need to do things independently across scenarios.

@kzliu

Thanks for the clarification @jayqi! Few follow up questions:

  • Regarding job vs scenario: Would the string of client_dir can serve as a unique identifier? (e.g. scenario01-client01 will have a different client_dir than scenario02-client01.) Or, if not (e.g. they may have the same client_dir), are the disk space cleared between each scenario?

  • Are the amount of read/write to disk for this caching be an evaluation criteria for efficiency or scalability?

Thanks!

Hi @kzliu,

  1. Yes, the client_dir path will be different across scenarios.
  2. Judges may look at that information, but it is not one of the primary ways that solutions will be evaluated on efficiency or scalability. You can see details about the evaluation criteria on the Problem Description page, and in particular look at the primary metrics that will be reported in the “Computational Metrics” section.
1 Like