Inappropriate evaluation metric

Dear Organizers,

I am delighted to have the opportunity to participate in this competition, and I extend my sincere appreciation to you for organizing such a significant challenge.

However, I would like to bring to your attention a concern I have regarding the current evaluation metric used for the leaderboard. In cases of heavily imbalanced binary classification tasks, the log loss metric may not accurately reflect the performance of the models being evaluated.

Given the computational and medical complexities involved in this task, achieving highly accurate models may not be feasible. Furthermore, the presence of false confident predictions, which are highly likely to occur, can significantly affect the score, rendering it unreliable. Moreover, the output probabilities of the models or calibrated models are likely to hover around the probability value given by the ratio of both classes, which undermines the scientific value of the entire investigation by turning the task into a distribution-fitting exercise.

Furthermore, since the distributions are unknown in either leaderboard sets (public leaderboard and private leaderboard), fitting the distribution is even more pointless. The private set of the leaderboard will just favor a submission that may have fortunately guessed the distribution, which is not a desirable outcome in a significant competition like this.

Therefore, I kindly request that you consider changing the evaluation metric to another metric that does not take the class distribution into account, such as the ROC AUC or any other. Additionally, in a medical prediction task like this, it is essential to choose a metric that penalizes the number of false negatives.

Thank you for your attention to this matter, and I look forward to your response.

Reference: Intuition behind Log-loss score. In Machine Learning, classification… | by Gaurav Dembla | Towards Data Science

3 Likes

Hi @zsolt.bedohazi, thanks for your message! From a fairness perspective, we don’t change metrics in the middle of a competition. That said, a lot of thought and consideration went into the choice of the metric. There are a number of tradeoffs involved and we determined log loss to be the best suited for the challenge.

1 Like

Hi, thank you for confirming this and and presenting your arguments, it is fully understandable.

In addition, our team is interested by the following declaration on the webpage:

“The challenge organizers intend to make the collection of WSI data available online after the competition for ongoing improvement.”

Can you confirm whether this is still your objective, and if so, starting from which date (right after the competition ends?) would it be feasible to publish any additional research on this subject?

“The challenge organizers intend to make the collection of WSI data available online after the competition for ongoing improvement.” Can you confirm whether this is still your objective?

Yes, VisioMel is planning to do this, but it will not be instantaneous. Once that is done, the data can be used for publication. See similar question here: Data use for proof of concept - #3 by emily