Dataset split ratio

MikhailK · February 25, 2024, 9:17pm

Dear Organizers,

is there any information about public/private split ratio?

Best Regards,
Mikhail

chrisk-dd · February 26, 2024, 6:11pm

You can find information about the public/private split here:

The annotations have been divided into a training dataset with annotations for approximately 200 clinical notes and a test dataset with annotations for approximately 70 clinical notes which your submissions will be evaluated against. Note that there are concepts which appear in the test dataset which do not appear in the training dataset . Your models should appropriately leverage the SNOMED CT clinical terminology and the relationships contained therein to generalize to concepts not seen in the training dataset.

MikhailK · February 26, 2024, 7:49pm

Hi @chrisk-dd,

thanks for the answer!

However, how these 70 notes are splitted in terms of the leaderboard (private / public)?

chrisk-dd · February 26, 2024, 8:09pm

We don’t release any additional information about the split; our advice is to create the best solution possible that doesn’t over-fit the public leaderboard.

kevinr · February 27, 2024, 1:05pm

@chrisk-dd apologies if we did not get this in advance because we did miss it, but is the public leaderboard different from the final leaderboard? We understood that the winners from the public leaderboard are the final winners (apart from possible disqualifications).

I’m asking because here above you mention public and private splits so I feel we are missing something.

thanks!

chrisk-dd · February 27, 2024, 2:19pm

@kevinr The criteria for the selection of winners is outlined in the rules for the competition, specifically:

For that portion of the Competition evaluated quantitatively, the results will be determined solely by leaderboard ranking on the private leaderboard… Scores displayed on the public leaderboard while the competition is running may or may not be the same as the final scores on the private leaderboard, depending on how samples from the Data are used for evaluation.

kevinr · February 27, 2024, 3:20pm

@chrisk-dd you precisely. You say

may not be the same as the final scores

do we have any information whether they will or will not be the same? That helps in understanding how much to rely on the scores if the public leaderboard. Without such information, there is the risk of under/over fitting to that set of data.

Rules also say

This Competition is a challenge of skill and the final results are determined by evaluating a combination of quantitative and qualitative factors, as more fully described on the Competition Website

but we could not find any information about the qualitative factors in the competition website.

chrisk-dd · February 27, 2024, 4:10pm

As I said above, we don’t release any additional information about the split; our advice is to create the best solution possible that doesn’t over-fit the public leaderboard.

There are no qualitative factors for this competition. The quantitative metric for the competition is described on the Problem description page.

Topic		Replies	Views
Release of the top 3 scoring solutions SNOMED CT Entity Linking	12	426	April 22, 2024
Final model for private leaderboard SNOMED CT Entity Linking	1	158	March 5, 2024
Availability of Test Data SNOMED CT Entity Linking	2	148	July 8, 2024
Public/private leaderboard N+1 Fish, N+2 Fish	3	1082	September 11, 2017
Availability of test data after deadline SNOMED CT Entity Linking	2	237	March 6, 2024

Dataset split ratio

Related topics