Developing Own Submission

jimking100 · January 4, 2023, 5:45pm

Hi,
I’m trying to test the development of my own submission using the steps outlined in the runtime repo. I’m struggling with a few items and perhaps you could clarify them for me:

It seems there needs to be a directory named submission in order for it to create the submission.zip file. ‘SUBMISSION_TRACK=fincrime make pack-submission’ only worked once I manually added the submission directory.
I’ve edited my .yml file, but
‘SUBMISSION_TRACK=fincrime SUBMISSION_TYPE=centralized make test-submission’
does not seem to use my local .yml file so none of my packages are being loaded into the environment? Do I need to wait until the Pull Request is merged? Can I test this locally?
I’m unclear about the pred_format.csv file - I don’t see it in the repo, are we supposed to add it, if so, where?

Thanks!

jayqi · January 4, 2023, 6:47pm

Hi @jimking100,

That is correct. Thanks for reporting—we will fix the make recipes to create that directory if it doesn’t already exist.
The dependencies are installed from environment-*.yml at image build time, and not at container run time. If you would like to test your submission locally with local dependency changes, you can build the image yourself and then use the local image.
- The make build command will build an image from your local copies of environment-*.yml. You can optionally explicitly specify CPU_OR_GPU=cpu or CPU_OR_GPU=gpu.
- By default, if a local image exists, it will use that when you run make test-submission. You can explicitly override which image to use (if you also have the official image pulled) with the SUBMISSION_IMAGE variable.
We don’t currently have the predictions format files available for download, but we will add them. I will update here when those are available. In the meantime, you can see the documented specifications in the challenge documentation and create your own.
- Track A: Financial Crime
- Track B: Pandemic Forecasting

jimking100 · January 4, 2023, 10:54pm

Hi,

Thanks, can you tell me where the predictions format files will be located in the repo? Also, how will the current Centralized Submission Testing work with the .yml files since the Pull Requests have yet to be merged?

Jim

jayqi · January 5, 2023, 12:14am

The predictions format files are always named predictions_format.csv and will be located in the test/ directory within the centralized case, or within each test/{partition_name}/ directory for the federated case.

Here is a file tree that should be instructive for how to set things up locally in the runtime repository:

Click here to expand:

❯ tree data/ -F
data/
├── fincrime/
│   ├── centralized/
│   │   ├── test/
│   │   │   ├── bank_dataset.csv
│   │   │   ├── data.json
│   │   │   ├── predictions_format.csv
│   │   │   └── swift_transaction_test_dataset.csv
│   │   └── train/
│   │       ├── bank_dataset.csv
│   │       ├── data.json
│   │       └── swift_transaction_train_dataset.csv
│   ├── scenario01/
│   │   ├── test/
│   │   │   ├── bank01/
│   │   │   │   └── bank_dataset.csv
│   │   │   ├── bank02/
│   │   │   │   └── bank_dataset.csv
│   │   │   ├── partitions.json
│   │   │   └── swift/
│   │   │       ├── predictions_format.csv
│   │   │       └── swift_transaction_test_dataset.csv
│   │   └── train/
│   │       ├── bank01/
│   │       │   └── bank_dataset.csv
│   │       ├── bank02/
│   │       │   └── bank_dataset.csv
│   │       ├── partitions.json
│   │       └── swift/
│   │           └── swift_transaction_train_dataset.csv
│   └── scenarios.txt
└── pandemic/
    ├── centralized/
    │   ├── test/
    │   │   ├── data.json
    │   │   ├── predictions_format.csv
    │   │   ├── va_activity_location_assignment.csv.gz
    │   │   ├── va_activity_locations.csv.gz
    │   │   ├── va_disease_outcome_training.csv.gz
    │   │   ├── va_household.csv.gz
    │   │   ├── va_person.csv.gz
    │   │   ├── va_population_network.csv.gz
    │   │   └── va_residence_locations.csv.gz
    │   └── train/
    │       ├── data.json
    │       ├── predictions_format.csv
    │       ├── va_activity_location_assignment.csv.gz
    │       ├── va_activity_locations.csv.gz
    │       ├── va_disease_outcome_training.csv.gz
    │       ├── va_household.csv.gz
    │       ├── va_person.csv.gz
    │       ├── va_population_network.csv.gz
    │       └── va_residence_locations.csv.gz
    ├── scenario01/
    │   ├── test/
    │   │   ├── client01/
    │   │   │   ├── predictions_format.csv
    │   │   │   ├── va_activity_location_assignment.csv.gz
    │   │   │   ├── va_activity_locations.csv.gz
    │   │   │   ├── va_disease_outcome_training.csv.gz
    │   │   │   ├── va_household.csv.gz
    │   │   │   ├── va_person.csv.gz
    │   │   │   ├── va_population_network.csv.gz
    │   │   │   └── va_residence_locations.csv.gz
    │   │   ├── client02/
    │   │   │   ├── predictions_format.csv
    │   │   │   ├── va_activity_location_assignment.csv.gz
    │   │   │   ├── va_activity_locations.csv.gz
    │   │   │   ├── va_disease_outcome_training.csv.gz
    │   │   │   ├── va_household.csv.gz
    │   │   │   ├── va_person.csv.gz
    │   │   │   ├── va_population_network.csv.gz
    │   │   │   └── va_residence_locations.csv.gz
    │   │   ├── client03/
    │   │   │   ├── predictions_format.csv
    │   │   │   ├── va_activity_location_assignment.csv.gz
    │   │   │   ├── va_activity_locations.csv.gz
    │   │   │   ├── va_disease_outcome_training.csv.gz
    │   │   │   ├── va_household.csv.gz
    │   │   │   ├── va_person.csv.gz
    │   │   │   ├── va_population_network.csv.gz
    │   │   │   └── va_residence_locations.csv.gz
    │   │   └── partitions.json
    │   └── train/
    │       ├── client01/
    │       │   ├── predictions_format.csv
    │       │   ├── va_activity_location_assignment.csv.gz
    │       │   ├── va_activity_locations.csv.gz
    │       │   ├── va_disease_outcome_training.csv.gz
    │       │   ├── va_household.csv.gz
    │       │   ├── va_person.csv.gz
    │       │   ├── va_population_network.csv.gz
    │       │   └── va_residence_locations.csv.gz
    │       ├── client02/
    │       │   ├── predictions_format.csv
    │       │   ├── va_activity_location_assignment.csv.gz
    │       │   ├── va_activity_locations.csv.gz
    │       │   ├── va_disease_outcome_training.csv.gz
    │       │   ├── va_household.csv.gz
    │       │   ├── va_person.csv.gz
    │       │   ├── va_population_network.csv.gz
    │       │   └── va_residence_locations.csv.gz
    │       ├── client03/
    │       │   ├── predictions_format.csv
    │       │   ├── va_activity_location_assignment.csv.gz
    │       │   ├── va_activity_locations.csv.gz
    │       │   ├── va_disease_outcome_training.csv.gz
    │       │   ├── va_household.csv.gz
    │       │   ├── va_person.csv.gz
    │       │   ├── va_population_network.csv.gz
    │       │   └── va_residence_locations.csv.gz
    │       └── partitions.json
    └── scenarios.txt

Submissions made through the provided infrastructure will always use the latest official image, so you won’t be able to run those with any dependencies that are still in open pull requests. Thanks for your patience—we’ll try to get your pull request reviewed soon.

jimking100 · January 5, 2023, 6:38pm

Hi,
I have created my local image, it has my local dependancies and it begins to run my code, however, after reading in the swift and bank data (centralized version) it fails with a cryptic ‘/tmp/tmpt3ppoeij: line 3: 99 Killed python main_centralized_train.py’ - basically any attempt to manipulate the dataframes after the data is loaded causes a failure. The only thing I can come up with is that it’s a RAM related issue. The code runs fine when its outside the image. Does the image significantly reduce the available RAM, does this error message provide any insight from your end? I’m running a Mac, M2, 24GB - no gpu.

jayqi · January 5, 2023, 7:39pm

Hi @jimking100,

It looks like the docker run command was written to provide 8 GB of memory:

github.com

drivendataorg/pets-prize-challenge-runtime/blob/2e9e6f63244fa41956edf7346d265bd0bec1c357/Makefile#L137


      
          	$(error Specify the SUBMISSION_TRACK=fincrime or pandemic)
          endif
          	docker run \
          		${TTY_ARGS} \
          		${GPU_ARGS} \
          		${NETWORK_ARGS} \
          		--env SUBMISSION_TRACK=${SUBMISSION_TRACK} \
          		--network none \
          		--mount type=bind,source="$(shell pwd)"/data/${SUBMISSION_TRACK},target=/code_execution/data,readonly \
          		--mount type=bind,source="$(shell pwd)"/submission,target=/code_execution/submission \
          		--shm-size 8g \
          		--name ${CONTAINER_NAME} \
          		--rm \
          		${SUBMISSION_IMAGE} \
          		${SUBMISSION_TYPE}
          
          
## Delete temporary Python cache and bytecode files
          clean:
          	find . -type f -name "*.py[co]" -delete
          	find . -type d -name "__pycache__" -delete

You can change that to a different number as appropriate for your local test.

EDIT: Fixed repo link.

markblunk · January 5, 2023, 7:45pm

Thanks for this file tree. Is there a reason the file data/fincrime/centralized/test/data.json doesn’t contain a line for the file predictions_format.csv, since that is a required entry?

jayqi · January 5, 2023, 8:11pm

Hi @markblunk,

If you take a look at how the evaluation runner is implemented, the predictions format file is separately specified as an input argument when calling provided client factories.

github.com

drivendataorg/pets-prize-challenge-runtime/blob/2e9e6f63244fa41956edf7346d265bd0bec1c357/runtime/supervisor.py#L310-L316


      
          solution_client = solution_client_factory(
              cid=cid,
              **supervisor.get_data_paths(cid),
              client_dir=supervisor.get_client_state_dir(cid),
              preds_format_path=supervisor.get_predictions_format_path(cid),
              preds_dest_path=supervisor.get_predictions_dest_path(cid),
          )

markblunk · January 5, 2023, 9:19pm

I was asking about the centralized solution, not the federated one.

I got to step 6 in GitHub - drivendataorg/pets-prize-challenge-runtime: Evaluation runtime for Phase 2 of the PETs Prize Challenge and was attempting to test my centralized solution by running SUBMISSION_TRACK=fincrime SUBMISSION_TYPE=centralized make test-submission. This command fails if the file data/fincrime/centralized/test/predictions_format.csv is missing (as you mention above in Developing Own Submission - #4 by jayqi) . I didn’t realize this file is required from the instructions in GitHub - drivendataorg/pets-prize-challenge-runtime: Evaluation runtime for Phase 2 of the PETs Prize Challenge because those instructions refer you to the various data.json files, which do not mention the file predictions_format.csv. Hence my original question:
" Is there a reason the file data/fincrime/centralized/test/data.json doesn’t contain a line for the file predictions_format.csv, since that is a required entry?"

Thanks!

jimking100 · January 5, 2023, 9:31pm

Hi,
The link does not seem to work?

jayqi · January 5, 2023, 9:33pm

Hi @markblunk,

Apologies for linking to the wrong thing. As you can see, the centralized evaluation code is implemented in exactly the same way:

github.com

drivendataorg/pets-prize-challenge-runtime/blob/2e9e6f63244fa41956edf7346d265bd0bec1c357/runtime/main_centralized_test.py#L20-L25


      
          predict(
              **supervisor.get_data_paths(),
              model_dir=supervisor.get_model_state_dir(),
              preds_format_path=supervisor.get_predictions_format_path(),
              preds_dest_path=supervisor.get_predictions_dest_path(),
          )

Thank you for pointing out that the README is wrong in omitting the predictions_format.csv files. We will get that fixed.

jayqi · January 5, 2023, 9:35pm

@jimking100 Fixed in the original post. Sorry about that!

jimking100 · January 5, 2023, 11:58pm

Hi,
I’ve changed the shm-size to 24g in two places in the makefile, deleted the local image using Docker Desktop, rebuilt the image and re-ran the code - same or similar error. When I rebuild the image it doesn’t seem to build from scratch but from cache, but my logs show 24g for the shm-size instead of 8g. Are there settings I need for my local Docker setup - I just installed Docker with the default installation on a Mac?

Aha! I answered my own question - Docker defaults to 2gb at installation on a Mac. I upped it to 24gb and it works now. So, you need to change the makefile and check the memory settings in Docker.

Thanks @jayqi

Topic		Replies	Views
How to submit the prediction file for Geomagnetic Field competition MagNet: Model the Geomagnetic Field	1	393	February 12, 2021
Issue with make test-submission MagNet: Model the Geomagnetic Field	6	858	January 9, 2021
Access to test environment - Phase 2 / Track B PETs Prize Challenge	1	193	December 5, 2022
Submission csv not found Hateful Memes	2	843	August 28, 2020
Prediction csv file output path Where's Whale-do?	5	615	April 29, 2022

Developing Own Submission

Related topics