Are we allowed to do this?
Did we even need to use noise-augmentation on training data for the phonetic task? I suspect that the Talkbank+Drivendata dataset distribution were similar to the pvt set given how close my LB scores were to clean-val-cer.
What about the word track? For me CV LB never matched.
Never bothered because training was expensive.
I did the phoentic track and had an eval set that was 50-50 talkbank and DD. Ended up being accurate to blind set by 0.01 CER
0.01 as in 1% or 0.01% CER
Used any noise augmentation? I couldn’t decide if augmentation and to what degree was even needed for this track.
0.01 CER my bad. Tried all kinds of augmentation and nothing really worked
For phonetic I used 1:1 weight for eval of dd(drivendata) and ext(talkbank), seems fold 0(1/5) and LB match very well, like my final submission fold 0 score 0.26284 and LB 0.2538 PB 0.2559, LB always –about 0.008.
What worked for the phonetic track:
- backbones: nemo tdt 0.6b(tdt and ctc) tdt is much better for long audios and long label audios and tdt convergent quick(around 5), ctc need more epochs(around 10) but training faster and it could provide ctc beam scores fast for all candidates. wavlm-large + ctc preform extremly well on short audios and ext audios, though it is training expensive, I used 1 pro 6000 gpu for training wavlm models(5 hours 1 epoch and luckily only need to train 3-5 epochs), for nemo models training I used 4090 or 5090.
- augs: concat mix helped most. I used at most mix 8, and mix strategy of selecting from dd/ext with equal probablity or just random select from dd+ext, notice the former is dd friendly. I also used classroom noise but it might only help a little.
- Using word track data(much larger than phonetic dataset which could help our encoder) with ctc loss so the model have two heads. 1 head with phonteic label(ctc or tdt) and another head with word label(ctc with loss weight 0.3).
- Postprocess to rerank all candidates is important, my local best single model perform on fold 0 about 0.289(did not submit to test the LB), as different models perform different on ext/dd, short /long audios, ensemble could help here, gpt54 and claude46 helped design tree model(catboost rank), and tree model boost a lot(though just simple nbest rescore also help a lot).
- I did not write any code in this game ,gpt54 and claude46 did all the job, for tasks of this kind(asr) llm work very well.
- I used similar strategy for word track but did not get good LB as word dataset is large I used 20 folds so fold 0 might not enough and word track is more training expesive. Word track seems to need different strategy which I missed to get good score.
For our approach we just use a single fold split (grouped on child_id), which also matched quite well with LB scores (r2 of 0.95). Our best LB score of 0.2607 got a local score of 0.2259.
What worked for the phonetic track for us:
- We pretty much exclusively trained CTC models (wavlm-large, hubert-large-ll60k) and also used the encoder part of whisper (large-v3 + medium) and trained it with CTC. The training time for us was fine for these models with the longest training loops taking 12 hours with a Quadro RTX 6000 (15 epochs). Whisper converged a lot earlier than wavlm did, so training here only took 5-8 hours on an RTX A6000.
- The logits of these models were then decoded with a Minimum Bayes Risk (MBR) beam search with beam width 50, which we found worked consistently better than greedy, while coming with a small increase in compute.
- The String output of all our final models (13 in total) were then ensembled using a character-level ROVER ensemble. This all just barely fit in the 2 hour inference limit time.
- Diversity worked well for the ensemble, with the final collection consisting out of: 6 Wavlm-large (2 trained on all data), 2 hubert-large-ll60k, 3 whisper-large-v3, 2 whisper-medium
- What really seemed to work for us was EMA, fp-16 instead of bf-16, focal-ctc-loss, time-stretch augmentation, background noise augmentation and a tri-stage-lr scheduler.
- We also added various things to our without any conclusive proof if they worked or not. These include: MTL head based on the word data (similar to gezi’s approach), SSL-pretraining on the word-track data, adding an age-head, using more external datasets and other forms of augmentation (band-stop-filter, masking, pitch shift, white noise).
Full code and write up will soon be available!
Awesome thanks for the writeups! I used 1:1 Talkbank to DD val set. We had some success with different ensembles of wavlm large + RNN-T and CTC, and had similar ensembling approaches like ROVER. We just barely cracked 0.28, and the highest amount of models we used was 4, which I am now understanding was much, much too low.
I only tried one model (w2v-bert) and got around 0.286 after many side quests
wrote up my retro over here On Top of Pasketti Retro | Pine Desk Software