Problems with the training and test datasets

Dear organizer,
We thoroughly evaluated the quality of the training and test datasets. Based on our assessment, approximately 20 audio files in the training dataset and about 5 in the test dataset consist solely of healthcare providers’ voices instructing patients. It appears that patient speech was either not recorded during the audio recording process or was removed due to technical issues during the preprocessing stage (prior to data sharing). Since predicting cognitive impairment in healthcare providers is neither reasonable nor aligned with the goals of this challenge, would you consider removing these specific audio files from the training and test datasets to minimize potential biases in reporting the results?

1 Like

@maryamzolnoori Thanks for your careful review of the dataset and for flagging this important data quality issue! After careful consideration, we’ve decided to keep the current dataset as is to avoid disrupting the competition. This issue affects 1% of the test data and we believe the noise is minimal and affects all competitors equally. As a reminder, final rankings will be based on report submissions from the top 15 participants, not just the automated leaderboard.

That said, could you please share the file IDs where you found the healthcare provider-only audio? We will pass on this information to the data provider for future improvements. Please don’t hesitate to reach out if you have any other questions or concerns.

1 Like