Ensuring Sites are in Test or Train


There’s a suggestion in the problem description to ensure sites appear either in the train or the eval set, but not both. Does anyone have slick ways of doing this while maintaining stratified and balanced datasets? I take the following naive approach but have poor accuracy:

  1. get unique list of sites
  2. split these 25/75 for eval/train
  3. add all images at each site into respective set (so train/eval may not be balanced)
  4. check if split of images in eval/train us ~25/75. if its not close, then go to step 2
  5. check if the balance of species labels is close to the original balance (if not go to step 2)
  6. train per usual.


I think what you need is sklearn.model_selection.StratifiedGroupKFold — scikit-learn 1.3.0 documentation !