How to have access to the data

hazhir_bahrami · November 14, 2022, 11:25pm

Hi,
I hope everything is going well with you. When I want to download data by command line, downloading data is interrupted. When I want to download data by passing the following link:
s3://drivendata-competition-biomassters-public-us/train_features/
But, it needs “access key ID” and “secret access key.”
I did not find these parameters in the .txt file. I would appreciate it if you help me in this regard.

nick.burns · November 15, 2022, 5:56am

Hi Hazhir - check out the “AWS CLI” section of the download instructions:

AWS CLI

The easiest way to download data from AWS is using the AWS CLI:
https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html
To download an individual data file to your local machine, the general structure is
aws s3 cp <S3 URI> <local path> --no-sign-request
For example:
aws s3 cp s3://drivendata-competition-biomassters-public-us/train_features/001b0634_S1_00.tif ./ --no-sign-request
The above downloads the file 001b0634_S1_00.tif from the public bucket in the US region. Adding “–no-sign-request” allows data to be downloaded without configuring an AWS profile.

To download a directory rather than a file, use the --recursive flag. For example, to download all of the training data:
aws s3 cp s3://drivendata-competition-biomassters-public-us/train_features/ train_features/ --no-sign-request --recursive

hazhir_bahrami · November 15, 2022, 6:39am

Hi,
Thank you so much for your reply. I did in the same way. But, the download was interrupted after about 30 minutes of downloading. When I am going to download data again using the following code:

aws s3 cp s3://drivendata-competition-biomassters-public-us/train_features/ train_features/ --no-sign-request --recursive

it will start with the first image. I am not able to continue downloading data from a specific image ID to end. Moreover, the number of images is a lot and it is not possible to download them one by one.

nick.burns · November 15, 2022, 7:02am

Ahhh, I understand! That worried me too, that it might error out during the download.

Could you perhaps write a script (like the python one below) to loop over the files and download them individually? I just checked and this works nicely. If you ran a few scripts in parallel, it wouldn’t be too slow. Sure, it’ll take a while But that’s usually one of the pain points working with good-sized data.

import os
import pandas as pd

metadata = pd.read_csv("features_metadata.csv")
for i, row in metadata.iterrows():
    
    if not os.path.exists(row.filename):
        cmd = f"aws s3 cp s3://drivendata-competition-biomassters-public-us/train_features/{row.filename} ./ --no-sign-request"
        os.system(cmd)
        
    if i > 5:
        break

hazhir_bahrami · November 15, 2022, 7:40am

Thank you very much. I will do the same way you recommended.

ishashah · November 15, 2022, 4:47pm

Hi @hazhir_bahrami,

I’m sorry you’re having trouble downloading the data! If you are located outside of the US, you might find download speeds improve if you use another bucket. You might want to see if using the EU (s3://drivendata-competition-biomassters-public-eu) or Asia (s3://drivendata-competition-biomassters-public-as) buckets make the process faster.

Good luck, and let us know if you have any other questions.

hazhir_bahrami · November 17, 2022, 7:08am

Hi,
Thank you so much for the help.
I started downloading data using python. The problem got solved.
Best wishes.

Topic		Replies	Views
Larger Dataset no longer on Data Download Clog Loss: Advance Alzheimer’s Research	10	1122	February 16, 2021
How to download the data from a direct link The BioMassters	3	437	November 24, 2022
AWS CLI access forbidden Overhead Geopose Challenge	6	693	June 28, 2021
Dowloading Data NASA Airathon	13	776	February 22, 2022
Data download from server/cli Genetic Engineering Attribution	9	1368	September 15, 2020

How to have access to the data

AWS CLI

Related topics