Noisy Classroom Data

nchuzhoy · February 28, 2026, 1:05am

Hi,

The problem description notes that the Noisy WER subset is drawn from audio recorded in natural classroom environments. Are any of these real classroom recordings included in the training set, or are they exclusive to the test set?

As it stands, we are using the synthetic RealClass data to try to simulate the Noisy WER subset, but without knowing any characteristics of the data (SNR, amount of crosstalk etc) it’s difficult to make informed augmentation decisions. Would it be possible to share some high-level details, or include some noisy samples in the training set so we have something to calibrate against?

Also, I wanted to confirm: is the Noisy WER subset included in the overall leaderboard WER calculation, or is it excluded?

Thank you!

oknaitik · March 1, 2026, 8:25pm

I’m finding it extremely frustrating to get a sense of the test set distribution. Almost everytime, my public LB score is worse that my validation scores.

cszc · March 2, 2026, 5:02pm

Hi @nchuzhoy - thanks for your thoughtful questions!

As noted in the problem description, the training and test splits are drawn from multiple data sources, with some sources appearing exclusively in either train or test. The data used for the Noisy WER metric is not included in the training set, and we’re not able to release those recordings publicly. It was collected in real-world, diverse classroom environments using different recording devices. We understand it would be helpful to have more signal, but unfortunately, we are unable to share further characteristics or provide sample clips.

The overall leaderboard score reflects performance on the evaluation data used for ranking, which spans multiple recording conditions and environments. The Noisy WER is reported separately to provide insight into performance in real-world classroom conditions.

Thanks for your understanding!

Topic		Replies	Views
Can we use data from other track? Children’s Speech Recognition Challenge	7	163	March 18, 2026
Question about Smoke Test Dataset and WER Calculation Children’s Speech Recognition Challenge	4	126	March 12, 2026
Bad training samples Goodnight Moon, Hello Early Literacy Screening	2	89	January 14, 2025
Smoke test utterances Children’s Speech Recognition Challenge	3	90	March 7, 2026
Smoke test data source Children’s Speech Recognition Challenge	1	166	February 14, 2026

Noisy Classroom Data

Related topics