Noisy Classroom Data

Hi,

The problem description notes that the Noisy WER subset is drawn from audio recorded in natural classroom environments. Are any of these real classroom recordings included in the training set, or are they exclusive to the test set?

As it stands, we are using the synthetic RealClass data to try to simulate the Noisy WER subset, but without knowing any characteristics of the data (SNR, amount of crosstalk etc) it’s difficult to make informed augmentation decisions. Would it be possible to share some high-level details, or include some noisy samples in the training set so we have something to calibrate against?

Also, I wanted to confirm: is the Noisy WER subset included in the overall leaderboard WER calculation, or is it excluded?

Thank you!

I’m finding it extremely frustrating to get a sense of the test set distribution. Almost everytime, my public LB score is worse that my validation scores.

Hi @nchuzhoy - thanks for your thoughtful questions!

As noted in the problem description, the training and test splits are drawn from multiple data sources, with some sources appearing exclusively in either train or test. The data used for the Noisy WER metric is not included in the training set, and we’re not able to release those recordings publicly. It was collected in real-world, diverse classroom environments using different recording devices. We understand it would be helpful to have more signal, but unfortunately, we are unable to share further characteristics or provide sample clips.

The overall leaderboard score reflects performance on the evaluation data used for ranking, which spans multiple recording conditions and environments. The Noisy WER is reported separately to provide insight into performance in real-world classroom conditions.

Thanks for your understanding!