Hi,
I hope everything is going well with you. When I want to download data by command line, downloading data is interrupted. When I want to download data by passing the following link:
s3://drivendata-competition-biomassters-public-us/train_features/
But, it needs “access key ID” and “secret access key.”
I did not find these parameters in the .txt file. I would appreciate it if you help me in this regard.
Hi Hazhir - check out the “AWS CLI” section of the download instructions:
AWS CLI
The easiest way to download data from AWS is using the AWS CLI:
https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html
To download an individual data file to your local machine, the general structure is
aws s3 cp <S3 URI> <local path> --no-sign-request
For example:
aws s3 cp s3://drivendata-competition-biomassters-public-us/train_features/001b0634_S1_00.tif ./ --no-sign-request
The above downloads the file
001b0634_S1_00.tif
from the public bucket in the US region. Adding “–no-sign-request” allows data to be downloaded without configuring an AWS profile.To download a directory rather than a file, use the
--recursive
flag. For example, to download all of the training data:aws s3 cp s3://drivendata-competition-biomassters-public-us/train_features/ train_features/ --no-sign-request --recursive
Hi,
Thank you so much for your reply. I did in the same way. But, the download was interrupted after about 30 minutes of downloading. When I am going to download data again using the following code:
aws s3 cp s3://drivendata-competition-biomassters-public-us/train_features/ train_features/ --no-sign-request --recursive
it will start with the first image. I am not able to continue downloading data from a specific image ID to end. Moreover, the number of images is a lot and it is not possible to download them one by one.
Ahhh, I understand! That worried me too, that it might error out during the download.
Could you perhaps write a script (like the python one below) to loop over the files and download them individually? I just checked and this works nicely. If you ran a few scripts in parallel, it wouldn’t be too slow. Sure, it’ll take a while But that’s usually one of the pain points working with good-sized data.
import os
import pandas as pd
metadata = pd.read_csv("features_metadata.csv")
for i, row in metadata.iterrows():
if not os.path.exists(row.filename):
cmd = f"aws s3 cp s3://drivendata-competition-biomassters-public-us/train_features/{row.filename} ./ --no-sign-request"
os.system(cmd)
if i > 5:
break
Thank you very much. I will do the same way you recommended.
Hi @hazhir_bahrami,
I’m sorry you’re having trouble downloading the data! If you are located outside of the US, you might find download speeds improve if you use another bucket. You might want to see if using the EU (s3://drivendata-competition-biomassters-public-eu
) or Asia (s3://drivendata-competition-biomassters-public-as
) buckets make the process faster.
Good luck, and let us know if you have any other questions.
Hi,
Thank you so much for the help.
I started downloading data using python. The problem got solved.
Best wishes.