How to have access to the data

Hi,
I hope everything is going well with you. When I want to download data by command line, downloading data is interrupted. When I want to download data by passing the following link:
s3://drivendata-competition-biomassters-public-us/train_features/
But, it needs “access key ID” and “secret access key.”
I did not find these parameters in the .txt file. I would appreciate it if you help me in this regard.

2 Likes

Hi Hazhir - check out the “AWS CLI” section of the download instructions:

AWS CLI

The easiest way to download data from AWS is using the AWS CLI:

https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html

To download an individual data file to your local machine, the general structure is

aws s3 cp <S3 URI> <local path> --no-sign-request

For example:

aws s3 cp s3://drivendata-competition-biomassters-public-us/train_features/001b0634_S1_00.tif ./ --no-sign-request

The above downloads the file 001b0634_S1_00.tif from the public bucket in the US region. Adding “–no-sign-request” allows data to be downloaded without configuring an AWS profile.

To download a directory rather than a file, use the --recursive flag. For example, to download all of the training data:

aws s3 cp s3://drivendata-competition-biomassters-public-us/train_features/ train_features/ --no-sign-request --recursive
1 Like

Hi,
Thank you so much for your reply. I did in the same way. But, the download was interrupted after about 30 minutes of downloading. When I am going to download data again using the following code:

aws s3 cp s3://drivendata-competition-biomassters-public-us/train_features/ train_features/ --no-sign-request --recursive

it will start with the first image. I am not able to continue downloading data from a specific image ID to end. Moreover, the number of images is a lot and it is not possible to download them one by one.

1 Like

Ahhh, I understand! That worried me too, that it might error out during the download.

Could you perhaps write a script (like the python one below) to loop over the files and download them individually? I just checked and this works nicely. If you ran a few scripts in parallel, it wouldn’t be too slow. Sure, it’ll take a while :frowning: But that’s usually one of the pain points working with good-sized data.

import os
import pandas as pd

metadata = pd.read_csv("features_metadata.csv")
for i, row in metadata.iterrows():
    
    if not os.path.exists(row.filename):
        cmd = f"aws s3 cp s3://drivendata-competition-biomassters-public-us/train_features/{row.filename} ./ --no-sign-request"
        os.system(cmd)
        
    if i > 5:
        break
3 Likes

Thank you very much. I will do the same way you recommended.

2 Likes

Hi @hazhir_bahrami,

I’m sorry you’re having trouble downloading the data! If you are located outside of the US, you might find download speeds improve if you use another bucket. You might want to see if using the EU (s3://drivendata-competition-biomassters-public-eu) or Asia (s3://drivendata-competition-biomassters-public-as) buckets make the process faster.

Good luck, and let us know if you have any other questions.

Hi,
Thank you so much for the help.
I started downloading data using python. The problem got solved.
Best wishes.

1 Like