I have not enough space for data

Hello, I am new to this community and I would like to start participating in competitions of this style. My problem is that most of the competitions require a lot of space to store the data, how do you do it? In this competition, for example, the dataset dataset has these sizes… How do you do to download all this locally?
| # files | size
--------------------------------------
train_features | 189078 | 215.9GB
test_features | 63348 | 73.0GB
train_agbm | 8689 | 2.1GB
I
Any comments would be helpful,

Thank you very much

3 Likes

Hi Robert,

I have had similar issues with the dataset size. The way I see it is you can take subsets of the training data and train multiple times, or you can stream the data using a pipeline such as torchdata pipes.

I have been experimenting with data pipes and it seems to work.
https://pytorch.org/data/main/torchdata.datapipes.iter.html
Youll want to check out the sf3fileloader as these files are stored in an AWS bucket.

5 Likes

Thank so much for your help! I will try everything you recommend. Thanks a lot!