Can I expect only these phonetic characters or more?

sinty · March 4, 2026, 7:51am

I collected phonetic letters from main and extra datasets and used scoring script to check if they are valid (they are valid).
That’s what I found:

phonetic_chars	count
	297923
i	111898
n	91048
ɑ	77628
ə	74327
t	71591
d	70436
ɪ	66453
s	59844
ɹ	53693
ɛ	49516
w	49466
k	47502
æ	46985
ʌ	45928
ʊ	40687
b	38208
ɚ	36146
l	35914
m	34477
u	33463
o	33289
ð	31564
e	31039
h	30903
f	29044
p	27322
g	25821
ː	24677
z	24175
j	18394
ɔ	15661
ŋ	12252
v	10185
θ	9506
ʃ	9022
ʧ	8903
ɫ	8134
ɾ	6571
ʤ	6301
ʔ	5670
ɐ	2174
ʝ	436
ʁ	179
c	176
ʒ	161
x	154
ɬ	117
ç	104
ɟ	100
χ	17
r	2

Can I expect only these characters, or should I expect more? If I need to expect another character, what kind of character should it be? Or maybe I need to remove or replace some of them.

I understand that the distribution may vary, but I need to know what kinds of characters I should expect. This is very important for everyone, I think. It’s about the format we should expect for the future.

Please, LIKE IT if it’s useful post.

cszc · March 4, 2026, 6:12pm

Only valid IPA characters as specified in the scoring script will be found in either the phonetic training or test set.

oknaitik · March 5, 2026, 9:17am

In phonetic track, are we supposed to predict the exact phonemes? For example, a child has a speech impediment and says “Wabbit” (/wæbɪt/) instead of “Rabbit” (/ɹæbɪt/). We predict “Rabbit”(/ɹæbɪt/) here, right?

cszc · March 5, 2026, 6:47pm

@oknaitik Your task is to predict the former, /wæbɪt/, i.e. the actual speech sounds or phones, not the phonemes.

From the problem description:

The ground truth labels for the Phonetic Track are normalized phonetic transcriptions of individual utterances using the International Phonetic Alphabet (IPA), with a one-to-one mapping between Unicode characters and phones. Each transcription captures the full sequence of speech sounds in the corresponding audio clip and may include substitutions, omissions, or non-standard productions that are typically ignored in word-level ASR.

oknaitik · March 29, 2026, 4:14am

@cszc Kindly confirm if our model should generalize to real-world noisy conditions for the phonetic track as well.

oknaitik · March 31, 2026, 2:58am

@cszc @hannahmoro Kindly reply to this as well.

cszc · March 31, 2026, 6:16pm

All available information about the phonetic track test set distribution is provided in the problem description. We encourage participants to focus on building robust, generalizable models rather than overfitting to specific conditions.

Topic		Replies	Views
Now that we are done, who wants to talk about what worked? Children’s Speech Recognition Challenge	10	194	April 14, 2026
Can we use data from other track? Children’s Speech Recognition Challenge	7	174	March 18, 2026
Submission Failed: “Expected file not found” - What output file/path is required? Children’s Speech Recognition Challenge	4	86	March 1, 2026
Will the phonetic audio be no more than 60 seconds long on the test? Children’s Speech Recognition Challenge	3	120	March 4, 2026
Clarification on rules Children’s Speech Recognition Challenge	3	102	March 26, 2026

Can I expect only these phonetic characters or more?

Related topics