Problems with the training and test datasets

maryamzolnoori · December 4, 2024, 2:02pm

Dear organizer,
We thoroughly evaluated the quality of the training and test datasets. Based on our assessment, approximately 20 audio files in the training dataset and about 5 in the test dataset consist solely of healthcare providers’ voices instructing patients. It appears that patient speech was either not recorded during the audio recording process or was removed due to technical issues during the preprocessing stage (prior to data sharing). Since predicting cognitive impairment in healthcare providers is neither reasonable nor aligned with the goals of this challenge, would you consider removing these specific audio files from the training and test datasets to minimize potential biases in reporting the results?

cszc · December 4, 2024, 10:23pm

@maryamzolnoori Thanks for your careful review of the dataset and for flagging this important data quality issue! After careful consideration, we’ve decided to keep the current dataset as is to avoid disrupting the competition. This issue affects 1% of the test data and we believe the noise is minimal and affects all competitors equally. As a reminder, final rankings will be based on report submissions from the top 15 participants, not just the automated leaderboard.

That said, could you please share the file IDs where you found the healthcare provider-only audio? We will pass on this information to the data provider for future improvements. Please don’t hesitate to reach out if you have any other questions or concerns.

Topic		Replies	Views
Training data - incredibly corrupt Children’s Speech Recognition Challenge	2	101	May 4, 2026
Bad training samples Goodnight Moon, Hello Early Literacy Screening	2	99	January 14, 2025
Train and test data consistency Youth Mental Health: Automated Abstraction	11	296	October 14, 2024
Languages, tasks and validity of audio features PREPARE Challenge	3	118	November 26, 2024
Smoke test data source Children’s Speech Recognition Challenge	1	180	February 14, 2026

Problems with the training and test datasets

Related topics