Larger Dataset no longer on Data Download


When the competition first started, a larger dataset was available for download. I believe it was 1 terabyte. I notice that it is no longer on the download page. Will that dataset be returning, and if so, is it possible for that dataset to be split into small parts?



Hi @Kingseso! Since the full train set is so large (1.4 terabytes), we do not provide a link to download it from the data download page. You should instead get the videos from the public s3 bucket (drivendata-competition-clog-loss). All train videos are in the train folder.

You can also find the urls for the train videos in train_metadata.csv, so you can use these to download a particular set of videos if you wish.

Finally, the nano and micro subsets are readily available on the data download page. These are a great way to get started prototyping while your full download completes. You can find more details on the problem description page.

Happy modeling!

hi @emily,

I am trying to download selected files from the bucket and encountering a bit of problems. I am not using aws often.
the bucket (unlike buckets I preciously used) requires aws credentials (its not completely public?)
I have set a profile but I still get “fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden” when i do “aws s3 cp s3://drivendata-competition-clog-loss/train/101309.mp4 t.mp4”

any help will be greatly appreciated.

also, trying to download using wget does not work.

ah sorted it out… needed to add --no-sign-request to the aws cli

1 Like

Hi,@emily ,
I still do not download all the data, can you share a tutorial on how to download all the data

Can you share how to download it

Hi @Moshel – glad you got it working with the AWS CLI. The reason your wget command failed was because you have a typo. If you make it loss and not losso, your command should run fine.

@yangsenwxy check out the AWS docs for how to interact with s3 buckets from the command line. Here’s a simple way of downloading the full train set:

# install the aws command line tools
pip install awscli

# create a directory for your videos to live in
mkdir train_videos

# copy everything in the train folder to your new directory
aws s3 cp --recursive s3://drivendata-competition-clog-loss/train/ train_videos/

Note: this downloads videos one at a time and so will take a while to run.

1 Like

Somehow is it possible to download split of the larger dataset rather than full sample (1.4 TB) ?

For those looking to split up the dataset into smaller downloads, simply use --include and --exclude tags as shown in documentation (emily linked above). for example:

aws s3 cp --no-sign-request --recursive --exclude "*" --include "1*" s3://drivendata-competition-clog-loss/train/ train_set_1/

will download the first 99,999 videos

aws s3 cp --no-sign-request --recursive --exclude "*" --include "2*" s3://drivendata-competition-clog-loss/train/ train_set_2/

will download videos 200,000 to 299,999

Of course running this after installing the command line tools client

pip install awscli


Hi, I’m searching for the train_metadata.csv but couldn’t find it anywhere ?