I am delighted to have the opportunity to participate in this competition, and I extend my sincere appreciation to you for organizing such a significant challenge.
However, I would like to bring to your attention a concern I have regarding the current evaluation metric used for the leaderboard. In cases of heavily imbalanced binary classification tasks, the log loss metric may not accurately reflect the performance of the models being evaluated.
Given the computational and medical complexities involved in this task, achieving highly accurate models may not be feasible. Furthermore, the presence of false confident predictions, which are highly likely to occur, can significantly affect the score, rendering it unreliable. Moreover, the output probabilities of the models or calibrated models are likely to hover around the probability value given by the ratio of both classes, which undermines the scientific value of the entire investigation by turning the task into a distribution-fitting exercise.
Furthermore, since the distributions are unknown in either leaderboard sets (public leaderboard and private leaderboard), fitting the distribution is even more pointless. The private set of the leaderboard will just favor a submission that may have fortunately guessed the distribution, which is not a desirable outcome in a significant competition like this.
Therefore, I kindly request that you consider changing the evaluation metric to another metric that does not take the class distribution into account, such as the ROC AUC or any other. Additionally, in a medical prediction task like this, it is essential to choose a metric that penalizes the number of false negatives.
Thank you for your attention to this matter, and I look forward to your response.
Reference: Intuition behind Log-loss score. In Machine Learning, classification… | by Gaurav Dembla | Towards Data Science