Train and test data consistency

qahn · September 28, 2024, 4:15pm

Hello, @kwetstone. Could you please confirm whether the test data used for the final evaluation has undergone the same cleaning and quality control level as the training data? Thanks.

ishashah · September 28, 2024, 5:01pm

Hi @qahn,

Yes, the final test data has been processed in the exact same way as the training data. Let us know if you have any more questions!

MPWARE · September 30, 2024, 8:27pm

You’ve a CV/LB gap, right? Same here.

mananjhaveri · October 1, 2024, 6:59am

Same here. I am seeing ~10% delta which is a bit concerning to me.

cyong · October 2, 2024, 3:06pm

I did find that in the training dataset that there were a noticeable amount of rows that are seemingly misclassified, especially in the WeaponType1 column. Would we assume the same misclassifications on the public/private dataset, or has it undergone more scrutiny? (eg. drowning misclassified as fall, or unknown classified as sharp instrument)

ishashah · October 3, 2024, 7:15pm

Hi @cyong, could you point us to a few of those rows? You can assume that the train and test sets have undergone the same level of data processing.

MPWARE · October 3, 2024, 7:51pm

One question: The test set is split in public / private sets and current leaderboard is releated to public one? I don’t find the ratio of public/private anywhere.

ishashah · October 3, 2024, 8:38pm

Hi @MPWARE , that is correct - the leaderboard is based on the public portion of the test set, and there is a separate private portion that is withheld. We don’t release any additional information about the split; our advice is to create the best solution possible that doesn’t over-fit the public leaderboard.

cyong · October 3, 2024, 8:55pm

Hi @ishashah,

btkc and cgfg are cases of drowning based on what the notes say that is not classified as Fall and Unknown respectively.

dsni is classified as sharp instrument but from the notes, it should be unknown as well.

I have more examples if needed.

ishashah · October 10, 2024, 2:10pm

Hi @cyong , thanks for flagging these cases for us. Since this dataset is human-coded, there is a minimal level of human error present. Since this is real data from the CDC that has undergone a few rounds of review, we won’t be updating it unless there is a systematic error throughout the data.

Thanks for the note, and please don’t hesitate to bring up any other concerns you have throughout the competition.

cyong · October 10, 2024, 5:17pm

Thanks @ishashah for the clarification. I do want to raise that in my opinion, an entire class is coded inconsistently and would appreciate further clarification. In WeaponType1, the Blunt Instrument class has 4 note examples. I would classify these 4 notes as Fall, Fall, Vehicle and Vehicle. Knowing that Blunt Instrument is also a minority class in the test set, the effect is not huge, but would appreciate any clarification.

ishashah · October 14, 2024, 8:54pm

Hi @cyong , thank you for pointing out these specific cases. We will certainly flag these instances of miscoding to the CDC.

Given that these cases are a small proportion of all records, we will be keeping them as-is for the duration of the competition. Since the performance metric takes a micro-averaged F1 score for categorical variables, these rare cases will have a very small impact on performance. Please let us know if you have any other comments!

Topic		Replies	Views
Data Quality Issues? Mapping Disaster Risk from Aerial Imagery	3	820	December 14, 2019
Overfitting and Leaderboard shakeup Youth Mental Health: Automated Abstraction	1	78	November 21, 2024
How are you guys validating? Tick Tick Bloom Challenge	9	486	February 7, 2023
Is your performance on training data quite different from that on testing data? Senior Data Science: Safe Aging with SPHERE	5	1660	July 8, 2016
Different results on personal test set and competition test set Mapping Disaster Risk from Aerial Imagery	3	640	December 11, 2019

Train and test data consistency

Related topics