Data Download Option


Is there a way to download only specific pages of the TIF files (instead of downloading everything and then selecting only the page we are interested in)?
The total size of the training dataset is about 1.5TB. If we have to download the full TIF files to only retain a downsampled image, that would be a large waste of bandwidth.
Could you provide a way to download only specific pages?


I have also problem downloading single files. When I run the example:
aws s3 cp s3://drivendata-competition-visiomel-public-us/images/1u4lhlqb.tif ./ --no-sign-request
I get:
[Errno 2] No such file or directory
I do not know what is wrong exactly?

@h.marko Handling large images is indeed part of the challenge of this competition. That said, you might look into partial reads with boto3, though I’m not sure offhand if that will support pyramidal tifs.

@majabedi I copied and pasted that command exactly and it works for me. Can you double check there is no typo in what you’ve run locally?

Yes, it worked now. I had to open a new tab. There was a local problem in my other tab. Thank you for your support.

Did you get to solve this challenge? If yes, please share your approach

Did you get to solve this challenge? If yes, please share your approach

Nope. I had to download the full image, then select the interesting pages. That’s just a waste of bandwidth.

1 Like

@h.marko, thanks for the response. I don’t even have the space to download such data. I would appreciate it if anyone could share a reduced version as well.

If you are constrained by space, you could just download one image at a time, save a single page (i.e. downsampled image) in a new file and remove the original file.

1 Like

I’d also suggest looking into s5cmd for downloading images. This is typically much faster than the AWS CLI.