I collected phonetic letters from main and extra datasets and used scoring script to check if they are valid (they are valid).
That’s what I found:
phonetic_chars
count
297923
i
111898
n
91048
ɑ
77628
ə
74327
t
71591
d
70436
ɪ
66453
s
59844
ɹ
53693
ɛ
49516
w
49466
k
47502
æ
46985
ʌ
45928
ʊ
40687
b
38208
ɚ
36146
l
35914
m
34477
u
33463
o
33289
ð
31564
e
31039
h
30903
f
29044
p
27322
g
25821
ː
24677
z
24175
j
18394
ɔ
15661
ŋ
12252
v
10185
θ
9506
ʃ
9022
ʧ
8903
ɫ
8134
ɾ
6571
ʤ
6301
ʔ
5670
ɐ
2174
ʝ
436
ʁ
179
c
176
ʒ
161
x
154
ɬ
117
ç
104
ɟ
100
χ
17
r
2
Can I expect only these characters, or should I expect more? If I need to expect another character, what kind of character should it be? Or maybe I need to remove or replace some of them.
I understand that the distribution may vary, but I need to know what kinds of characters I should expect. This is very important for everyone, I think. It’s about the format we should expect for the future.
In phonetic track, are we supposed to predict the exact phonemes? For example, a child has a speech impediment and says “Wabbit” (/wæbɪt/) instead of “Rabbit” (/ɹæbɪt/). We predict “Rabbit”(/ɹæbɪt/) here, right?
The ground truth labels for the Phonetic Track are normalized phonetic transcriptions of individual utterances using the International Phonetic Alphabet (IPA), with a one-to-one mapping between Unicode characters and phones. Each transcription captures the full sequence of speech sounds in the corresponding audio clip and may include substitutions, omissions, or non-standard productions that are typically ignored in word-level ASR.