Now that we are done, who wants to talk about what worked?

csatoxy · April 7, 2026, 8:25pm

Are we allowed to do this?

oknaitik · April 8, 2026, 6:36am

Did we even need to use noise-augmentation on training data for the phonetic task? I suspect that the Talkbank+Drivendata dataset distribution were similar to the pvt set given how close my LB scores were to clean-val-cer.

Phaedrus · April 8, 2026, 2:23pm

What about the word track? For me CV LB never matched.

oknaitik · April 8, 2026, 2:48pm

Never bothered because training was expensive.

csatoxy · April 8, 2026, 3:47pm

I did the phoentic track and had an eval set that was 50-50 talkbank and DD. Ended up being accurate to blind set by 0.01 CER

oknaitik · April 8, 2026, 4:52pm

0.01 as in 1% or 0.01% CER

Used any noise augmentation? I couldn’t decide if augmentation and to what degree was even needed for this track.

csatoxy · April 8, 2026, 5:16pm

0.01 CER my bad. Tried all kinds of augmentation and nothing really worked

gezi · April 9, 2026, 1:32am

For phonetic I used 1:1 weight for eval of dd(drivendata) and ext(talkbank), seems fold 0(1/5) and LB match very well, like my final submission fold 0 score 0.26284 and LB 0.2538 PB 0.2559, LB always –about 0.008.
What worked for the phonetic track:

backbones: nemo tdt 0.6b(tdt and ctc) tdt is much better for long audios and long label audios and tdt convergent quick(around 5), ctc need more epochs(around 10) but training faster and it could provide ctc beam scores fast for all candidates. wavlm-large + ctc preform extremly well on short audios and ext audios, though it is training expensive, I used 1 pro 6000 gpu for training wavlm models(5 hours 1 epoch and luckily only need to train 3-5 epochs), for nemo models training I used 4090 or 5090.
augs: concat mix helped most. I used at most mix 8, and mix strategy of selecting from dd/ext with equal probablity or just random select from dd+ext, notice the former is dd friendly. I also used classroom noise but it might only help a little.
Using word track data(much larger than phonetic dataset which could help our encoder) with ctc loss so the model have two heads. 1 head with phonteic label(ctc or tdt) and another head with word label(ctc with loss weight 0.3).
Postprocess to rerank all candidates is important, my local best single model perform on fold 0 about 0.289(did not submit to test the LB), as different models perform different on ext/dd, short /long audios, ensemble could help here, gpt54 and claude46 helped design tree model(catboost rank), and tree model boost a lot(though just simple nbest rescore also help a lot).
I did not write any code in this game ,gpt54 and claude46 did all the job, for tasks of this kind(asr) llm work very well.
I used similar strategy for word track but did not get good LB as word dataset is large I used 20 folds so fold 0 might not enough and word track is more training expesive. Word track seems to need different strategy which I missed to get good score.

WillemDieleman · April 9, 2026, 9:01am

For our approach we just use a single fold split (grouped on child_id), which also matched quite well with LB scores (r2 of 0.95). Our best LB score of 0.2607 got a local score of 0.2259.

What worked for the phonetic track for us:

We pretty much exclusively trained CTC models (wavlm-large, hubert-large-ll60k) and also used the encoder part of whisper (large-v3 + medium) and trained it with CTC. The training time for us was fine for these models with the longest training loops taking 12 hours with a Quadro RTX 6000 (15 epochs). Whisper converged a lot earlier than wavlm did, so training here only took 5-8 hours on an RTX A6000.
The logits of these models were then decoded with a Minimum Bayes Risk (MBR) beam search with beam width 50, which we found worked consistently better than greedy, while coming with a small increase in compute.
The String output of all our final models (13 in total) were then ensembled using a character-level ROVER ensemble. This all just barely fit in the 2 hour inference limit time.
Diversity worked well for the ensemble, with the final collection consisting out of: 6 Wavlm-large (2 trained on all data), 2 hubert-large-ll60k, 3 whisper-large-v3, 2 whisper-medium
What really seemed to work for us was EMA, fp-16 instead of bf-16, focal-ctc-loss, time-stretch augmentation, background noise augmentation and a tri-stage-lr scheduler.
We also added various things to our without any conclusive proof if they worked or not. These include: MTL head based on the word data (similar to gezi’s approach), SSL-pretraining on the word-track data, adding an age-head, using more external datasets and other forms of augmentation (band-stop-filter, masking, pitch shift, white noise).

Full code and write up will soon be available!

csatoxy · April 9, 2026, 4:57pm

Awesome thanks for the writeups! I used 1:1 Talkbank to DD val set. We had some success with different ensembles of wavlm large + RNN-T and CTC, and had similar ensembling approaches like ROVER. We just barely cracked 0.28, and the highest amount of models we used was 4, which I am now understanding was much, much too low.

8nalkjsd7 · April 14, 2026, 5:02pm

I only tried one model (w2v-bert) and got around 0.286 after many side quests wrote up my retro over here On Top of Pasketti Retro | Pine Desk Software

ZFTurbo · May 10, 2026, 10:46am

I have published my inference code here: GitHub - ZFTurbo/Children-Speech-Recognition-Challenge-Solution: Solution for the Children’s Speech Recognition Challenge. Tracks: Word and Phonetic. Results: Ranked 4th (Phonetic) and 7th (Word). · GitHub

Our team achieved 4th place in the Phonetic track and 7th place in the Word track. I focused primarily on fine-tuning three models for both tracks: Qwen3-ASR-1.7B, parakeet-tdt-0.6b, and facebook/wav2vec2-lv-60-espeak-cv-ft. Unfortunately, I would have needed another 1–2 weeks to further optimize the models and ensembles.

Other contributions from our team:

Extensive data augmentation using the following noise types:
- Mixing with instrumental music
- Adding White and Gaussian noise
- Random pitch shifting
- Seven-band parametric EQ
- Tanh distortion and MP3 compression
- Time stretching
- Pedalboard effects (reverb, phaser, distortion, chorus)
- Mixing several audio clips into a single track
Synthetic dataset creation based on the competition data using Voice Cloning techniques (Qwen3-TTS and VibeVoice). We used various text sources, such as “Tiny Stories”: [Link]

I am not entirely sure if the additional data had a significant impact on the final results, but it was a valuable experiment.

csatoxy · May 10, 2026, 2:55pm

Our best was #13 and we didn’t ensemble max really, I think at most we used 3 models? Can anyone speak to the best single model they found? Our best model was a wavlm large + RNN-T model. But it was only slightly better than a wavlm small + age-aware CTC head ( 90 m parameters) model that converged 5x faster. That model was ~0.29 CER w/ beam search width = 2. We had a dual head CTC training objective that predicted age bucket and phonemes, and then we just discarded the age prediction at inference. We imagine that it learned some regularization signal during training. But will need to do interpretability experiments to see. Are there any organizers out there that want to allow us access to the data for publishing/ dissemination?

enes3774 · May 20, 2026, 2:05am

GitHub - enes3774/childrens-speech-asr · GitHub, i ended up in 10th place. I also forgot about ensembling, i guess model selection was what saved me.

csatoxy · May 20, 2026, 4:02pm

@enes3774 we tried the Lm thing too ( with KenLM) and didn’t get much from it surprisingly. We didn’t try ensembling until the last week - we were mostly just trying different data processing and model tuning. Your solution is cool! Would love to see some ablation studies. Do we know if we are getting any additional acess to the data post contest for dissemination purposes?

Topic		Replies	Views
New Tutorial: Finetuning Wav2Vec2 with Hugging Face Transformers for the Phonetic Track Children’s Speech Recognition Challenge	0	115	March 11, 2026
Qwen_asr is not available Children’s Speech Recognition Challenge	3	268	February 20, 2026
New Tutorial: Finetuning Parakeet with NeMo Children’s Speech Recognition Challenge	0	158	March 3, 2026
Can we use data from other track? Children’s Speech Recognition Challenge	7	177	March 18, 2026
Use of Adult Speech Data for Pretraining and Fine-Tuning Children’s Speech Recognition Challenge	3	166	March 23, 2026

Now that we are done, who wants to talk about what worked?

Related topics