Sharing colab notebook to download only selected pages

Hello everyone! If you as well as me having trouble downloading such large amount of train data due to poor internet connection, here is a solution for you.

Please find <link redacted> which

  1. Downloads train dataset image-by-image via s5cmd
  2. Extracts selected page(s)
  3. Saves page to .png to images_page_{page} folder for all selected pages,
  4. zip each images_page_{page} folder to images_page_{page}.zip
  5. Optionally copy to connected google drive

To run the conversion, please run Install deps, Imports and Download & convert all sections. The whole process should take a couple of hours, depending which page you select (I use pages 4 and 7). Download on colab takes much less time than local so allows to save some time.

Also I will share a link to google drive folder with page 4 and 7 archives later when the conversion I run ends.

@ishashah I have read the Data use and code sharing part of competition rules and in my understanding I could upload part of the dataset to google drive and and share the folder by link in discussion thread, here only participants could access it. Is it correct?

Also, colab notebook I share here contains link to AWS bucket with data, and I only share it here by link, so it also does not break the rules.

If my understanding is incorrect, please notify me or just remove the thread by yourself, thank you!

Hi @mkotyushev – thanks for creating this resource and being mindful of the data use and sharing rules. This forum is public (i.e. not limited to participants) therefore we need you to do the following:

  • remove the bucket name from the colab notebook
  • share only a non-executed version of the notebook

For pulling out the bucket name, you can just set this as a variable to make it easy for others to use

  • BUCKET = {competition bucket here}
  • !s5cmd --no-sign-request cp '{BUCKET}/{image_name}'

It is not allowed to share a google drive link to the data here, but if there is sufficient interest, we may be able to put a link to the google drive on the data download page which is only accessible to people who have agreed to the rules of the competition.

I’ve removed the link from the above post. Once you have made the changes above, please feel free to add the link back in.

Hi everyone, I have updated the notebook as requested, here is the link

I will not post any google drive links here, but processing requires only ~5 hours to convert + some download time from drive to complete, so it is easy to do it by yourself.

1 Like