Phase 2 data convert doesn’t seem to be working
mmf_convert_hm --zip_file=XjiOc5ycDBRRNwbhRlgH.zip --password=XXXX --bypass_checksum 1
Data folder is /home/vvm/.cache/torch/mmf/data
Zip path is XjiOc5ycDBRRNwbhRlgH.zip
Copying XjiOc5ycDBRRNwbhRlgH.zip
Unzipping XjiOc5ycDBRRNwbhRlgH.zip
Extracting the zip can take time. Sit back and relax.
replace /home/vvm/.cache/torch/mmf/data/datasets/hateful_memes/defaults/images/data/README.md? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
Traceback (most recent call last):
…
mmf/mmf_cli/hm_convert.py", line 34, in assert_files
), f"{file} doesn’t exist in {folder}"
AssertionError: dev.jsonl doesn’t exist in /XXX/YYY/.cache/torch/mmf/data/datasets/hateful_memes/defaults/images
It looks to me like mmf is expecting a file naming schema that isn’t there (we updated the file names for phase 2). For example, dev.jsonl doesn’t exist anymore. It’s dev_seen.jsonl and dev_unseen.jsonl. If this is the case, it probably requires a change to the mmf codebase (cc @douwekiela)
Thanks! In the meantime, is there a way to point to a custom location for data directory other than ~/.cache/torch/mmf/data/datasets/hateful_memes when running the scripts? Is there an example for that?
du -sh ~/.cache/torch/mmf/data/datasets/hateful_memes
It’s pretty big
37G /home/xxx/.cache/torch/mmf/data/datasets/hateful_memes
Yes, while extracting pass --mmf_data_folder=<your_dir> option to mmf_convert_hm command. Then, while running the commands, use MMF_DATA_DIR=<your_dir> environment variable to specify where your data dir is to MMF.
For MMF specific questions, please open up an issue on MMF for faster response in future.
There is one other issue.
Runner now always loads dataset
mmf.trainers.mmf_trainer: Loading datasets
where as before it was looking up from cache correctly. So it downloads and extracts 10GB file all the time now.