Hi @glipstein and DrivenData community,
I think I may be speaking for a lot of people that has put a lot of time and effort on this challenge and we feel that many people on the top leaderboard may be cheating for the following reasons, and we wanted to kindly ask how DrivenData is going to procced.
First of all, if any team has achieved the scores shown on the leaderboard, my very best respect and congratulations.
Let’s analize the results of phase 2:
Facebook AI benchmark:
- Best facebook benchmarch is 0.71 auc and 0.64 acc.
- Human is 82 auc and 84 acc.
Some people is achieving around 90 auc.
That’s 20 points more of auc than best Facebook AI benchmark and 8 points more of auc than human performance.
This means super human accuracy and new state of the art.
We think that some are getting SOTA results by overfitting test set…
When my team was first on the leaderboard months ago, another team ask mine for our submissions.csv so they could merge with theirs and train on that data, of course we report this behavior to DrivenData and they got discualified.
We think this behavior may be happening other time.
Until all of this is cleared I do not want to say any specific model names, but not any of the current models (not only from Facebook AI) that have SOTA results on multimodal benchmarks can achieve those scores on this dataset, not even close.
Once stated this, we think some of the following practices maybe occurring which are prohibited by the official rules and Data License Agreement, we should be aware that these practices could be hidden in the code in multiple ways.
- Using pseudo-labels on test_seen or test_unseen.
- Making a big ensemble of not random but yes very unstable models, and choosing the ones that performs best on pseudo-labels.
- Exploiting cofounders, making the model explicitly search for repeated pairs of text or images.
- …
This pseudo-labels could be created by making lots of submissions on phase 1.
With this in mind we want to ask DrivenData @glipstein if their are going to check for this behavior, and how they are going to procced.
I want to say other time that if any team has achieved those results by making a new SOTA multimodal model they have all my respect and congratulations.
Thanks DrivenData and community.