Multiprocessing

Hi, It does not appear that we can use multiprocessing in our final submissions, perhaps you could shed some light on this. For example, in the Pandemic Centralized Code Submission it is stated:

"
Your submission should be a zip archive named with the extension .zip (e.g., submission.zip). The root level of the archive must contain a solution_centralized.py module that contains the following named functions:

  • fit: A function that fits your model on the training data and writes your model to disk
  • predict: A function that loads your model and performs inference on the test data

When you make a submission, this will kick off a containerized evaluation job. This job will run a Python main_centralized_train.py script which will import your fit function and call it with the appropriate training data access and filesystem access. Then, the job will run a Python main_centralized_test.pyscript which will import your predict function with the appropriate test data access and filesystem access.
"

By importing our solution_centralized.py script I don’t believe we can initiate multiprocessing as the parent process needs to called from a command line (initiated from if name == “main”: ). In your baseline solution you call a shell (e.g. logistic_regression.sh) which executes several scripts from the command line and also makes use of storage to store processed training and test data. It appears we will not have access to executing scripts directly from a command line or access to storage (other than for the model)? Am I interpreting this correctly? The use of multiprocessing and storage can greatly increase the speed of both fit and predict functions.

Hi @jimking100,

You are correct that there are some differences between the submission specifications (you must provide a fit method in Python) and the entrypoint command for the UVA-BII-provided centralized baseline (a shell script). However, the submission specifications should not be limiting you from doing multiprocessing, or even calling shell scripts, in general.

You should be able to use either the multiprocessing or concurrert.futures standard libraries for multiprocessing or multithreading within Python. Additionally, whatever multiprocessing support is built into third-party libraries should also work.

If needed, you can also run shell commands using the subprocess standard library.

With respect to storage, you are provided a directory within which you can stash whatever you want. At a minimum, you will need to write out your trained model so that the predict function can read it in later for test-time inference.

I hope that helps.

Hi,
Yes, this is very helpful, so we are free to use the model directory to store data other than our model. Are there any limits on the amount of disk space?
Thanks

There are not currently any limits on disk space we expect teams to reach under reasonable usage.