When the competition first started, a larger dataset was available for download. I believe it was 1 terabyte. I notice that it is no longer on the download page. Will that dataset be returning, and if so, is it possible for that dataset to be split into small parts?
Hi @Kingseso! Since the full train set is so large (1.4 terabytes), we do not provide a link to download it from the data download page. You should instead get the videos from the public s3 bucket (drivendata-competition-clog-loss). All train videos are in the train folder.
You can also find the urls for the train videos in train_metadata.csv, so you can use these to download a particular set of videos if you wish.
Finally, the nano and micro subsets are readily available on the data download page. These are a great way to get started prototyping while your full download completes. You can find more details on the problem description page.
I am trying to download selected files from the bucket and encountering a bit of problems. I am not using aws often.
the bucket (unlike buckets I preciously used) requires aws credentials (its not completely public?)
I have set a profile but I still get “fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden” when i do “aws s3 cp s3://drivendata-competition-clog-loss/train/101309.mp4 t.mp4”
Hi @Moshel – glad you got it working with the AWS CLI. The reason your wget command failed was because you have a typo. If you make it loss and not losso, your command should run fine.
@yangsenwxy check out the AWS docs for how to interact with s3 buckets from the command line. Here’s a simple way of downloading the full train set:
# install the aws command line tools
pip install awscli
# create a directory for your videos to live in
mkdir train_videos
# copy everything in the train folder to your new directory
aws s3 cp --recursive s3://drivendata-competition-clog-loss/train/ train_videos/
Note: this downloads videos one at a time and so will take a while to run.
For those looking to split up the dataset into smaller downloads, simply use --include and --exclude tags as shown in documentation (emily linked above). for example: