Phase 2 Submissions Cheating?

Hi @glipstein and DrivenData community,

I think I may be speaking for a lot of people that has put a lot of time and effort on this challenge and we feel that many people on the top leaderboard may be cheating for the following reasons, and we wanted to kindly ask how DrivenData is going to procced.

First of all, if any team has achieved the scores shown on the leaderboard, my very best respect and congratulations.

Let’s analize the results of phase 2:
Facebook AI benchmark:

  • Best facebook benchmarch is 0.71 auc and 0.64 acc.
  • Human is 82 auc and 84 acc.

Some people is achieving around 90 auc.

That’s 20 points more of auc than best Facebook AI benchmark and 8 points more of auc than human performance.

This means super human accuracy and new state of the art.

We think that some are getting SOTA results by overfitting test set…

When my team was first on the leaderboard months ago, another team ask mine for our submissions.csv so they could merge with theirs and train on that data, of course we report this behavior to DrivenData and they got discualified.

We think this behavior may be happening other time.

Until all of this is cleared I do not want to say any specific model names, but not any of the current models (not only from Facebook AI) that have SOTA results on multimodal benchmarks can achieve those scores on this dataset, not even close.

Once stated this, we think some of the following practices maybe occurring which are prohibited by the official rules and Data License Agreement, we should be aware that these practices could be hidden in the code in multiple ways.

  • Using pseudo-labels on test_seen or test_unseen.
  • Making a big ensemble of not random but yes very unstable models, and choosing the ones that performs best on pseudo-labels.
  • Exploiting cofounders, making the model explicitly search for repeated pairs of text or images.

This pseudo-labels could be created by making lots of submissions on phase 1.

With this in mind we want to ask DrivenData @glipstein if their are going to check for this behavior, and how they are going to procced.

I want to say other time that if any team has achieved those results by making a new SOTA multimodal model they have all my respect and congratulations.

Thanks DrivenData and community.


Thanks @VictorCallejas. The top solutions will be checked for adherence to the challenge rules as the verification process is carried out. As you know from the previous phase, any detected cheating will result in disqualification.


I’d like to make an explanation here about the performance, not for the private sharing.

Simple model as the paper shown can not work because the data is confusing your model with a lot knowledge beyond the data.

The data has a very rule that if two samples have same image/text and total different text/image and your model predict one is hateful, the other is not with high probability. We could do a lot of things like designing pair-wise model or reconstructing the submission file to improve the ROC-AUC.

Most of competitions allow post-process, pre-process and pseudo-labels. So do you think it is cheating or not?

I suggest DrivenData check each top team’s solution including where their pseudo-labels data come from. I believe how we used the rule above should be totally different.

1 Like

Pseudo-label was clearly banned days before the competition closed, and we do not use it on our final solution. So please @VictorCallejas and @bahushruth do not say or suggest that we are using it.

As I told in the other post and like @qqerret says it is possible to exploit repeated images and repeated text information to improve model predictions. This is not against the rules of the challenge, and if you think I’m wrong please point me to the rule.

I would love to see a phase 3 of the competition with:

  • Really unseen test set without repeated sentences or images
  • A simple rule that forces all the predictions to be independent. This is tipically done on Speaker Recognition challenges, for example on “Participants agree to process each trial independently. That is, each decision for a trial
    is to be based only upon the specified test segment and target speaker enrollment data.
    The use of information about other test segments and/or other target speaker data is
    not allowed.”

I am not saying any names. I am just saying some techniques or practices.

@ironbar I am not suggesting that you are using it, I just do not know, but maybe other teams does.

Respect the explicit search for confounders… I would love to see someone from FacebookAI in this thread and see what they think .

I think this is most disrespectful for the 3,173 participants of the competition and overall to the 39 teams that achieved to improve their benchmarks.

As FacebookAI said:

The Hateful Memes Challenge and Data Set is a competition and open source data set designed to measure progress in multimodal vision-and-language classification.

As @bahushruth said in other thread:

This is just my opinion and I don’t expect anyone else to share my opinion but I do not think that this was the objective of the challenge. The goal of the challenge was to improve multimodal models to identify multimodel hate, not to exploit the data to achieve superhuman accuracy. The dataset was obviously made specifically for this challenge and based on the way it was made, it can be exploited easily.

In the FacebookAI paper in the points 2.3 to 2.5:

…we collected confounders (a.k.a.,“contrastive” [22] or “counterfactual” [38] examples). This allows us to address some of the biases that machine learning systems would easily pick up on.

Address some of the biases… and some teams are searching for pairs…

These benign confounders make the dataset more challenging.

That’s the whole point, to improve maybe just a bit the current multimodal models, not achieving SOTA and super human scores by exploiting a data leakage. Well in my opinion it is not even a data leakage, it’s the whole purpose of this competition and dataset.

In addition to the “seen” test set described above, we will organize a NeurIPS competition whose winners will be determined according to performance on a to-be-released “unseen” test set.

So NeurIPS 2020, achieving SOTA and super human accuracy in multimodal problems by exploiting how dataset was explicitly constructed?

Like spam and other adversarial challenges, the problem of hateful content will continue to evolve. By providing a data set expressly made to help researchers tackle this problem, along with a common benchmark and a community competition, we are confident that the Hateful Memes Challenge will spur faster progress across the industry in dealing with hateful content and help advance multimodal machine learning more broadly.

You still think that what facebook wants is to exploit the dataset in the way it was explicitly constructed?

This are just citations and my opinion.

I would appreciate if @glipstein or someone from FacebookAI could clarify this.

I share your frustration Victor, having spent a lot of time squeezing what I could from the training set and getting nowhere near the top scores. And I agree that exploiting a data leakage in the test set goes against the spirit of the contest. However I also recognise this was structured as a competition, with substantial rewards, and you can’t blame people for doing what they can to win as long as it doesn’t break any explicitly stated rules.
Ultimately it’s on the organisers to set up the incentives correctly.


@slawekbiel thanks, that’s my point.

I just think that’s not a data leakage, that’s whole competition.

expressly made to help researchers tackle this problem

We all were aware of how the dataset is constructed and even the distribution which is in the point 2 of the paper.

But I think out of the 3,173 participants, some teams decided to try their luck to see if it would slip.

Is very unfair and sad for participants that spent 6 months improving multimodal models, and see in the forum that a winning solutions was something you tried months ago and give you very bad results, and are not better than mmf. So I guess then you look for pairs and get superhuman accuracy.

Of course this is my opinion, and I do not pretend to know what the competiton purpose or rules are. That’s up to DrivenData and Facebook AI. This is just my opinion.

I find it particularly sad some of the top participants are trying to justify what sounds like voluntary overfitting through what they loosely call pseudo-labeling using explicit leaderboard feedback, while claiming their methods are in close proximity to respected academic papers doing a completely different thing!

This kind of practices have no legs to stand on in the real world except in Kaggle-like competitions. In my opinion this is equivalent to manually labeling the test dataset, then saying you trained thousands of models and did a month long of grid search to end up with those specific model parameters…

That being said, I’d like to congratulate the top teams and cannot wait to hear them explain their methods at the NeurIPS workshop. I’d also like to strongly encourage them to dispel any suspicions or misinterpretations regarding their techniques by publishing their code and giving instructions to fully reproduce their impressive results, including of course their seeds.

Superhuman performance is not something easily overlooked by top researchers :slight_smile:


In my opinion the way this competition was setup wasn’t the best for the goal of improving the multi-modal algorithms.
They should have kept a final test set hidden, against which the models would be evaluated after the ending of the submission phase. That was the approach they followed on the fake video detection Kaggle competition.
That not being the case, then anything that doesn’t break the rules is fair game, cause everyone is allowed and free to use, and after all this is a competition not a university research project.
In the end the ones that got the best results without breaking any of the rules are the winners, and that’s the way it is :slight_smile:
I just wished I had thought about some of the approaches mentioned (i.e. the ones not breaking the rules :wink: ) so that I could be one of them :slight_smile:
Congratulations to all the top 5!

Not breaking the rules?? This goes against the same definition of the competition.

Competition Entities reserve the right at any time to disqualify a Submission from a Competition where, acting in good faith, it believes there are reasonable grounds to warrant disqualification. For example, Competition Entities’ determination that the Submission does not provide the functionality described or required, or the Submission appears to be purposely designed to circumvent these Competition Rules or the spirit of the Competition would be grounds for disqualification.

And therefore yes also against the rules.

Although this is my opinion and Competition Sponsor (Facebook AI @douwekiela ) and Competition organizer (Driven Data @glipstein) will determine that.

1 Like

@VictorCallejas You are definitely not happy about the outcome of this competition, and you hoped you would have won.
Me too, I also hoped (“wished”) I would have won, but I didn’t and there’s nothing I can do about it. I’m just looking forward to see what was the best rules compliant solution for this problem, so that I learn something for it.
Winning is nice, but learning is invaluable.

Let’s wait for the organizer’s analysis of the top scores and see if they break or not the rules. If they break the rules they will be disqualified, if they don’t then they won. I have my opinion regarding what breaks the rules, you have yours and everyone has their own opinion, but in the end the only opinion that matters is the one from the organizers :wink:

Best ML model and method could find the pattern of the dataset.
The methodology works in reality too, for different pattern maybe. The pattern is not only lying in the single sample, but also among the dataset.
Finding best way to fit the dataset is the spirit of engineer.

Just my opinion.

Exactly! Facebook disqualified the top team during the deep fakes challenge just for using external data that did not comply with their rules. I don’t see any reason why Facebook would be ok with people clearly exploiting the way the data was created.

I would literally stand by anyone who can justify that their approach can work with the same performance in the real world. I doubt anyone in the real world uploads memes as pairs containing hateful and nonhateful content.

1 Like

Of course, I think exactly as you. As I have said

But I not mad that I have lost, I have lost and I am so happy and glad of everything I have learned and the work I have done. I have improved fb benchmarks as well as @burebista .

I am mad that a few teams have achieved super human accuracy by doing the contrary the competition was about and have 0 respect for the other 3,173 participants of the competition.

And it seems to me it is exactly like this:

Competition Entities’ determination that the Submission does not provide the functionality described or required, or the Submission appears to be purposely designed to circumvent these Competition Rules or the spirit of the Competition would be grounds for disqualification.

1 Like

Not really sure what you meant by

after all this is a competition not a university research project.

Following the rules and having proper ethical practices while trying to improve upon the given objective of the task within those said rules should not be just limited to the university level :wink: (Not implying anyone is doing this. Just justifying my point to what you said) :hugs: :hugs: :hugs:

I understand that this is a competition and people will try to find any ways they can to win. That is ok. It is important to speak up when you find that something is wrong not because one is “Sad” about not winning the competition but because that’s the right thing to do. Else you just normalize such practices. (Again not implying anyone is cheating.)

Anyone that followed the rules of the competition and came up with new ideas can also try seeing where their model lies on the VQA benchmark. Not exactly a loss in my book. Just like to see proper results and accurate representation of everyone’s work on the leaderboard. Wouldnt you like that?

I totally agree with finding the best solution is the spirit of engineering but that does not necessarily mean finding loopholes in the given objective.

Let me give you a fun example, Do you watch Formula 1?
If you do you may be aware of the “Fuel Flow Gate” incident. If not then basically the FIA regulates the flow of fuel for every race car (To regulate speed and give every team a fair chance) and in order to monitor this said fuel flow, they had mounted non-intrusive sensors on every car.

One sure way to increase performance is to increase fuel flow in the engine and Ferrari engineers did exactly that. They tricked the sensors by allowing more fuel outside of the sensors sampling rate and achieved a lot of performance.

Was this amazing engineering?

Was it anywhere mentioned that you couldn’t do that?

Was it still unethical?

Was it the spirit of engineering?
I don’t think so

I know it’s really funny and doesn’t even make sense to compare F1 to this challenge but look at Ferrari’s pace now and a team like Renault (That is 3rd in the constructor’s championship).

I feel like the spirit of engineering should not be mistaken for unethical practices (BY FERRARI AND NOT ANYONE FROM THIS CHALLENGE).

In case you want to go through it here is the link to this article Fuel Flow Gate

Another similar example would be how Volkswagen engineers designed systems to cheat during emission tests.

Absolutely agree with you. :raised_hands:
But I don’t think participants in this compete should be to blame. It seems like there isn‘t any sensor on cars.

1 Like

I would love to see unseen test as well. Disclaimer. Our model was trained on train data only! But we didn’t have time (joined late September) to fully experiment with ideas that we had. There are some clauses that need to clarified about derived data sets. Cheers,

Absolutely! Anyways it finally comes down to the people that organized the competition to decide what’s allowed and what’s not. No amount of justification of exploiting the data will change the fact that it’s still an exploitation. Let’s just see what the organizers have to say.