Bad training samples

Hi,

I found that some training samples are bad e.g.
In train metadata: ggtbxx.wav,deletion,eat,2
corresponding label: ggtbxx.wav,1.0

If you listen audio file, that is some non word audio and label must be 0. I think there are many such bad samples. Could you please confirm it and fix those samples

Thank you for sharing your observations about the challenge data. These data were annotated manually, so it is possible that there are misclassified samples. Since this is an ongoing competition, we will not change the training data. It is up to you and other solvers to figure out the best way to work with the training data while developing your solutions.

That said, if you find other training labels that seem wrong, please share the file IDs and we will pass them on to the data provider for future improvements.

Hi Ashwani - thanks for pointing this out and sharing your observations.

The audio files in this competition have been anonymized using voice cloning technology to protect the privacy of the students while keeping the structure of the original speech intact. As a result, some audio clips may sound different from what you might expect, including background noise or non-word sounds. However, the labels reflect the intended speech content before anonymization.

A key goal of this competition is to build models that perform well on anonymized children’s audio, even when the data includes these variations. This reflects real-world challenges when working with privacy-preserving datasets.

We appreciate your attention to detail and encourage you to keep sharing your feedback—it helps us make the competition better for everyone. For more details about how the dataset was created, check out the “Problem Description” page.

Good luck with the competition!

2 Likes