Ensuring Sites are in Test or Train

rothm · January 18, 2023, 12:35am

Hello!

There’s a suggestion in the problem description to ensure sites appear either in the train or the eval set, but not both. Does anyone have slick ways of doing this while maintaining stratified and balanced datasets? I take the following naive approach but have poor accuracy:

get unique list of sites
split these 25/75 for eval/train
add all images at each site into respective set (so train/eval may not be balanced)
check if split of images in eval/train us ~25/75. if its not close, then go to step 2
check if the balance of species labels is close to the original balance (if not go to step 2)
train per usual.

Gassoupaaalou · July 3, 2023, 8:37am

Hello,

I think what you need is sklearn.model_selection.StratifiedGroupKFold — scikit-learn 1.3.0 documentation !

Topic		Replies	Views
Site id in test set but not in train set Random Walk of the Penguins	4	1105	May 11, 2017
How are you guys validating? Tick Tick Bloom Challenge	9	486	February 7, 2023
Train and test data consistency Youth Mental Health: Automated Abstraction	11	263	October 14, 2024
Data Quality in the Test Set Kelp Wanted: Segmenting Kelp Forests	1	187	February 2, 2024
Data Quality Issues? Mapping Disaster Risk from Aerial Imagery	3	820	December 14, 2019

Ensuring Sites are in Test or Train

Related topics