Train/Test split ratio


With such a (relative) small dataset, what ratio do you use for training/testing? Also what is the recommended validation method Bootstrapping or K-Fold in this particular scenario?

I would love to hear your thoughts.


There are no hard and fast rules for sample splitting. A 70/30 (training/test) split worked for me. I use the sample.split function in R’s caTools package.

Are you talking about splitting the dataset where prediction variable is known, further into training/test datasets? I believe that helps you test the algorithms you have created.

The fact that the dataset is so small is kind of annoying. From what I’ve read, the 70.30 split that jhpincus mentioned seems standard. That said, I was frustrated to see that that the model that performed the best when tested with the 70/30 strategy ended up being in the middle of the pack when I decided to just go ahead and upload all the sane models I tried out.

I’d love to hear more thoughts by experienced users on how to handle small datasets.