Data Download Option

h.marko · March 23, 2023, 8:08pm

Hi,

Is there a way to download only specific pages of the TIF files (instead of downloading everything and then selecting only the page we are interested in)?
The total size of the training dataset is about 1.5TB. If we have to download the full TIF files to only retain a downsampled image, that would be a large waste of bandwidth.
Could you provide a way to download only specific pages?

Thanks

majabedi · March 24, 2023, 6:00pm

I have also problem downloading single files. When I run the example:
aws s3 cp s3://drivendata-competition-visiomel-public-us/images/1u4lhlqb.tif ./ --no-sign-request
I get:
[Errno 2] No such file or directory
I do not know what is wrong exactly?

emily · March 24, 2023, 6:40pm

@h.marko Handling large images is indeed part of the challenge of this competition. That said, you might look into partial reads with boto3, though I’m not sure offhand if that will support pyramidal tifs.

@majabedi I copied and pasted that command exactly and it works for me. Can you double check there is no typo in what you’ve run locally?

majabedi · March 24, 2023, 10:07pm

Yes, it worked now. I had to open a new tab. There was a local problem in my other tab. Thank you for your support.

bojesomo · March 25, 2023, 2:02pm

Did you get to solve this challenge? If yes, please share your approach

bojesomo · March 25, 2023, 2:03pm

@h.marko
Did you get to solve this challenge? If yes, please share your approach

h.marko · March 28, 2023, 2:31pm

Nope. I had to download the full image, then select the interesting pages. That’s just a waste of bandwidth.

bojesomo · March 28, 2023, 5:57pm

@h.marko, thanks for the response. I don’t even have the space to download such data. I would appreciate it if anyone could share a reduced version as well.

h.marko · March 28, 2023, 8:16pm

If you are constrained by space, you could just download one image at a time, save a single page (i.e. downsampled image) in a new file and remove the original file.

emily · March 29, 2023, 7:15pm

I’d also suggest looking into s5cmd for downloading images. This is typically much faster than the AWS CLI.

Topic		Replies	Views
Sharing colab notebook to download only selected pages VisioMel Challenge	3	408	April 18, 2023
Larger Dataset no longer on Data Download Clog Loss: Advance Alzheimer’s Research	10	1115	February 16, 2021
Data download problem N+1 Fish, N+2 Fish	7	1070	September 6, 2017
Data Images Download from Python code VisioMel Challenge	3	314	April 23, 2023
Data_download_instructions.txt VisioMel Challenge	2	377	March 30, 2023

Data Download Option

Related topics