Question about Smoke Test Dataset and WER Calculation

Hi,

Is the smoke test evaluated on the audio files that are available for download from the competition website? I was only able to find a little over 2,000 audio files.

I used those files together with the provided score.py script to calculate the WER locally, but the result is quite different from the WER shown on the website. Is this expected, or could it be that I downloaded the wrong dataset?

Hi @huix.c,

The smoke test is evaluated on audio files from the training data. The training data are comprised of two corpora - one is hosted by DrivenData, and another is hosted on TalkBank. Instructions for accessing the TalkBank data are available on the data download page. You’ll need both datasets to reconstruct the smoke test data.

Good luck!

Can you review the pull request please?

1 Like

@cszc Did your team get a chance to review the PRs??

@oknaitik Sorry for the delay. I was able to review and deploy today.

1 Like