Larger Dataset no longer on Data Download

Kingseso · June 1, 2020, 2:11pm

Hello,

When the competition first started, a larger dataset was available for download. I believe it was 1 terabyte. I notice that it is no longer on the download page. Will that dataset be returning, and if so, is it possible for that dataset to be split into small parts?

Regards,

Cecil

emily · June 1, 2020, 5:10pm

Hi @Kingseso! Since the full train set is so large (1.4 terabytes), we do not provide a link to download it from the data download page. You should instead get the videos from the public s3 bucket (drivendata-competition-clog-loss). All train videos are in the train folder.

You can also find the urls for the train videos in train_metadata.csv, so you can use these to download a particular set of videos if you wish.

Finally, the nano and micro subsets are readily available on the data download page. These are a great way to get started prototyping while your full download completes. You can find more details on the problem description page.

Happy modeling!

Moshel · June 2, 2020, 11:34pm

hi @emily,

I am trying to download selected files from the bucket and encountering a bit of problems. I am not using aws often.
the bucket (unlike buckets I preciously used) requires aws credentials (its not completely public?)
I have set a profile but I still get “fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden” when i do “aws s3 cp s3://drivendata-competition-clog-loss/train/101309.mp4 t.mp4”

any help will be greatly appreciated.

Moshel · June 2, 2020, 11:38pm

also, trying to download using wget does not work.
Screenshot_2020-06-03_11-38-05

Moshel · June 3, 2020, 2:36am

ah sorted it out… needed to add --no-sign-request to the aws cli

yangsenwxy · June 3, 2020, 2:25pm

Hi,@emily ,
I still do not download all the data, can you share a tutorial on how to download all the data

yangsenwxy · June 3, 2020, 3:33pm

@Moshel
Can you share how to download it

emily · June 3, 2020, 4:10pm

Hi @Moshel – glad you got it working with the AWS CLI. The reason your wget command failed was because you have a typo. If you make it loss and not losso, your command should run fine.

@yangsenwxy check out the AWS docs for how to interact with s3 buckets from the command line. Here’s a simple way of downloading the full train set:

# install the aws command line tools
pip install awscli

# create a directory for your videos to live in
mkdir train_videos

# copy everything in the train folder to your new directory
aws s3 cp --recursive s3://drivendata-competition-clog-loss/train/ train_videos/

Note: this downloads videos one at a time and so will take a while to run.

aia39 · June 9, 2020, 7:25am

Somehow is it possible to download split of the larger dataset rather than full sample (1.4 TB) ?

leeschmalz · July 18, 2020, 8:14pm

For those looking to split up the dataset into smaller downloads, simply use --include and --exclude tags as shown in documentation (emily linked above). for example:

aws s3 cp --no-sign-request --recursive --exclude "*" --include "1*" s3://drivendata-competition-clog-loss/train/ train_set_1/

will download the first 99,999 videos

aws s3 cp --no-sign-request --recursive --exclude "*" --include "2*" s3://drivendata-competition-clog-loss/train/ train_set_2/

will download videos 200,000 to 299,999

Of course running this after installing the command line tools client

pip install awscli

Bihy · February 16, 2021, 10:19am

Hi, I’m searching for the train_metadata.csv but couldn’t find it anywhere ?

Topic		Replies	Views
Data download problem N+1 Fish, N+2 Fish	7	1073	September 6, 2017
Access to videos from "train metadata" Clog Loss: Advance Alzheimer’s Research	1	629	June 15, 2020
MATLAB Support point of Contact Clog Loss: Advance Alzheimer’s Research	1	470	July 14, 2020
AWS CLI access forbidden Overhead Geopose Challenge	6	693	June 28, 2021
Data download from server/cli Genetic Engineering Attribution	9	1365	September 15, 2020

Larger Dataset no longer on Data Download

Related topics