Bad training samples

ashwani.iitb · January 14, 2025, 3:36pm

Hi,

I found that some training samples are bad e.g.
In train metadata: ggtbxx.wav,deletion,eat,2
corresponding label: ggtbxx.wav,1.0

If you listen audio file, that is some non word audio and label must be 0. I think there are many such bad samples. Could you please confirm it and fix those samples

hannahmoro · January 14, 2025, 8:41pm

Thank you for sharing your observations about the challenge data. These data were annotated manually, so it is possible that there are misclassified samples. Since this is an ongoing competition, we will not change the training data. It is up to you and other solvers to figure out the best way to work with the training data while developing your solutions.

That said, if you find other training labels that seem wrong, please share the file IDs and we will pass them on to the data provider for future improvements.

meralh · January 14, 2025, 10:37pm

Hi Ashwani - thanks for pointing this out and sharing your observations.

The audio files in this competition have been anonymized using voice cloning technology to protect the privacy of the students while keeping the structure of the original speech intact. As a result, some audio clips may sound different from what you might expect, including background noise or non-word sounds. However, the labels reflect the intended speech content before anonymization.

A key goal of this competition is to build models that perform well on anonymized children’s audio, even when the data includes these variations. This reflects real-world challenges when working with privacy-preserving datasets.

We appreciate your attention to detail and encourage you to keep sharing your feedback—it helps us make the competition better for everyone. For more details about how the dataset was created, check out the “Problem Description” page.

Good luck with the competition!

Topic		Replies	Views
Training data - incredibly corrupt Children’s Speech Recognition Challenge	2	106	May 4, 2026
Test data Audio Goodnight Moon, Hello Early Literacy Screening	1	74	January 15, 2025
Problems with the training and test datasets PREPARE Challenge	1	132	December 4, 2024
Requesting clarification re: transcript data annotation AIAI Challenge	4	94	July 31, 2025
Can we use data from other track? Children’s Speech Recognition Challenge	7	193	March 18, 2026

Bad training samples

Related topics