Smoke test utterances

jialuli · March 6, 2026, 11:39pm

Hi,

I was wondering whether you could share the exact utterances used in the smoke dataset, along with the corresponding scoring script.

I ran my model locally on the utterances in smoke_test_submission_format and evaluated the results using metric/score.py from the provided GitHub repository, but the WER I obtained locally was dramatically different from the WER reported by the cloud smoke test.

Having access to the exact smoke test utterances and scoring setup would be very helpful for debugging whether the discrepancy is due to an environment mismatch or an issue with my own model.

Thank you very much for your help.

cszc · March 7, 2026, 3:01am

Hi @jialuli - The exact utterance IDs are shared in the “Smoke test submission format” file on the data download pages. That and metric/score.py should give you everything you need to replicate the score locally. Good luck!

oknaitik · March 7, 2026, 1:15pm

Pardon me but does the smoke test score is representative of the public LB test set to any degree? I believed it doesn’t since it’s mentioned that it’s fake data and only purpose is test the submission run E2E, right?

cszc · March 7, 2026, 2:50pm

No, the smoke test does not impact the leaderboard score at all. It is drawn from the training data.

Topic		Replies	Views
Question about Smoke Test Dataset and WER Calculation Children’s Speech Recognition Challenge	4	155	March 12, 2026
Smoke test data source Children’s Speech Recognition Challenge	1	180	February 14, 2026
Submission pass smoke test but fail normal submission Children’s Speech Recognition Challenge	2	110	March 14, 2026
Entire Smoketest Used For Scoring? Children’s Speech Recognition Challenge	1	91	March 14, 2026
Unable to Run Smoke Test in Prescreened Arena Pushback to the Future Challenge	2	210	March 13, 2023

Smoke test utterances

Related topics