Back to DrivenData | Blog

Resized dataset is now available

I started to upload resized dataset to Kaggle because:
a) it’s almost impossible for anybody to deal with 5TB of data
b) it’s absolutely unnecessary to have 2048x1536 images for such a problem

Images will have the size of 512x384 (EXIF is preserved)

The list will be updated (upvotes are welcomed):

24 Likes
1 Like

Thank you so much for sharing this! Anything that lowers the barrier to entry is epic :slight_smile:

1 Like

S10 - public test

1 Like

last piece

Hi Pavel,
It looks like you have done a wonderful service to everyone here. Can I ask if you have documented the process, i.e., the algorithm(s) that you used to downsize the images? I assume that you reduced the resolution with some loss of information involved, and I guess if we train a model with the smaller images then we would probably want to duplicate that downsizing process with the test set as well.

1 Like

sure thing https://gitlab.com/ppleskov/snapshot-serengeti/blob/master/resize.ipynb
key line is img = img.resize((img.size[0]//4, img.size[1]//4), Image.ANTIALIAS)

2 Likes

As someone with 3Mb/s download limit I can now participate, thank you.

Thanks very much for this.

BTW, I don’t see season 7 or 9 when I use the kaggle API (kaggle datasets list), but I do see them in your links above. Not sure why – maybe they have to be registered or something to appear?

Thank you so much! Much appreciated :slight_smile:

better to ask kaggle support
all data sets were produced in the same manner

Hi Pavel,

First of all thanks for sharing!!

BTW, some files are missing I think. For ex. S8_Q09_R2_IMAG1456.jpeg is not present in season 8 part 5 here

May I know the reason?

I’m missing 400k images on the downsized dataset. I have yet to compare to original but wondering if others are seeing the same :slight_smile:

a couple of images may be missing

it should be around 100 total