Dataset split ratio

Dear Organizers,

is there any information about public/private split ratio?

Best Regards,
Mikhail

Hi @MikhailK-

You can find information about the public/private split here:

The annotations have been divided into a training dataset with annotations for approximately 200 clinical notes and a test dataset with annotations for approximately 70 clinical notes which your submissions will be evaluated against. Note that there are concepts which appear in the test dataset which do not appear in the training dataset . Your models should appropriately leverage the SNOMED CT clinical terminology and the relationships contained therein to generalize to concepts not seen in the training dataset.

Hi @chrisk-dd,

thanks for the answer!

However, how these 70 notes are splitted in terms of the leaderboard (private / public)?

We don’t release any additional information about the split; our advice is to create the best solution possible that doesn’t over-fit the public leaderboard.

@chrisk-dd apologies if we did not get this in advance because we did miss it, but is the public leaderboard different from the final leaderboard? We understood that the winners from the public leaderboard are the final winners (apart from possible disqualifications).

I’m asking because here above you mention public and private splits so I feel we are missing something.

thanks!

@kevinr The criteria for the selection of winners is outlined in the rules for the competition, specifically:

For that portion of the Competition evaluated quantitatively, the results will be determined solely by leaderboard ranking on the private leaderboard… Scores displayed on the public leaderboard while the competition is running may or may not be the same as the final scores on the private leaderboard, depending on how samples from the Data are used for evaluation.

@chrisk-dd you precisely. You say

may not be the same as the final scores

do we have any information whether they will or will not be the same? That helps in understanding how much to rely on the scores if the public leaderboard. Without such information, there is the risk of under/over fitting to that set of data.

Rules also say

This Competition is a challenge of skill and the final results are determined by evaluating a combination of quantitative and qualitative factors, as more fully described on the Competition Website

but we could not find any information about the qualitative factors in the competition website.

As I said above, we don’t release any additional information about the split; our advice is to create the best solution possible that doesn’t over-fit the public leaderboard.

There are no qualitative factors for this competition. The quantitative metric for the competition is described on the Problem description page.