I collected phonetic letters from main and extra datasets and used scoring script to check if they are valid (they are valid).
That’s what I found:
phonetic_chars
count
297923
i
111898
n
91048
ɑ
77628
ə
74327
t
71591
d
70436
ɪ
66453
s
59844
ɹ
53693
ɛ
49516
w
49466
k
47502
æ
46985
ʌ
45928
ʊ
40687
b
38208
ɚ
36146
l
35914
m
34477
u
33463
o
33289
ð
31564
e
31039
h
30903
f
29044
p
27322
g
25821
ː
24677
z
24175
j
18394
ɔ
15661
ŋ
12252
v
10185
θ
9506
ʃ
9022
ʧ
8903
ɫ
8134
ɾ
6571
ʤ
6301
ʔ
5670
ɐ
2174
ʝ
436
ʁ
179
c
176
ʒ
161
x
154
ɬ
117
ç
104
ɟ
100
χ
17
r
2
Can I expect only these characters, or should I expect more? If I need to expect another character, what kind of character should it be? Or maybe I need to remove or replace some of them.
I understand that the distribution may vary, but I need to know what kinds of characters I should expect. This is very important for everyone, I think. It’s about the format we should expect for the future.
In phonetic track, are we supposed to predict the exact phonemes? For example, a child has a speech impediment and says “Wabbit” (/wæbɪt/) instead of “Rabbit” (/ɹæbɪt/). We predict “Rabbit”(/ɹæbɪt/) here, right?
The ground truth labels for the Phonetic Track are normalized phonetic transcriptions of individual utterances using the International Phonetic Alphabet (IPA), with a one-to-one mapping between Unicode characters and phones. Each transcription captures the full sequence of speech sounds in the corresponding audio clip and may include substitutions, omissions, or non-standard productions that are typically ignored in word-level ASR.
All available information about the phonetic track test set distribution is provided in the problem description. We encourage participants to focus on building robust, generalizable models rather than overfitting to specific conditions.