Phase 2 Submissions Cheating?

I find it particularly sad some of the top participants are trying to justify what sounds like voluntary overfitting through what they loosely call pseudo-labeling using explicit leaderboard feedback, while claiming their methods are in close proximity to respected academic papers doing a completely different thing!

This kind of practices have no legs to stand on in the real world except in Kaggle-like competitions. In my opinion this is equivalent to manually labeling the test dataset, then saying you trained thousands of models and did a month long of grid search to end up with those specific model parameters…

That being said, I’d like to congratulate the top teams and cannot wait to hear them explain their methods at the NeurIPS workshop. I’d also like to strongly encourage them to dispel any suspicions or misinterpretations regarding their techniques by publishing their code and giving instructions to fully reproduce their impressive results, including of course their seeds.

Superhuman performance is not something easily overlooked by top researchers :slight_smile:


In my opinion the way this competition was setup wasn’t the best for the goal of improving the multi-modal algorithms.
They should have kept a final test set hidden, against which the models would be evaluated after the ending of the submission phase. That was the approach they followed on the fake video detection Kaggle competition.
That not being the case, then anything that doesn’t break the rules is fair game, cause everyone is allowed and free to use, and after all this is a competition not a university research project.
In the end the ones that got the best results without breaking any of the rules are the winners, and that’s the way it is :slight_smile:
I just wished I had thought about some of the approaches mentioned (i.e. the ones not breaking the rules :wink: ) so that I could be one of them :slight_smile:
Congratulations to all the top 5!

Not breaking the rules?? This goes against the same definition of the competition.

Competition Entities reserve the right at any time to disqualify a Submission from a Competition where, acting in good faith, it believes there are reasonable grounds to warrant disqualification. For example, Competition Entities’ determination that the Submission does not provide the functionality described or required, or the Submission appears to be purposely designed to circumvent these Competition Rules or the spirit of the Competition would be grounds for disqualification.

And therefore yes also against the rules.

Although this is my opinion and Competition Sponsor (Facebook AI @douwekiela ) and Competition organizer (Driven Data @glipstein) will determine that.

1 Like

@VictorCallejas You are definitely not happy about the outcome of this competition, and you hoped you would have won.
Me too, I also hoped (“wished”) I would have won, but I didn’t and there’s nothing I can do about it. I’m just looking forward to see what was the best rules compliant solution for this problem, so that I learn something for it.
Winning is nice, but learning is invaluable.

Let’s wait for the organizer’s analysis of the top scores and see if they break or not the rules. If they break the rules they will be disqualified, if they don’t then they won. I have my opinion regarding what breaks the rules, you have yours and everyone has their own opinion, but in the end the only opinion that matters is the one from the organizers :wink:

Best ML model and method could find the pattern of the dataset.
The methodology works in reality too, for different pattern maybe. The pattern is not only lying in the single sample, but also among the dataset.
Finding best way to fit the dataset is the spirit of engineer.

Just my opinion.

Exactly! Facebook disqualified the top team during the deep fakes challenge just for using external data that did not comply with their rules. I don’t see any reason why Facebook would be ok with people clearly exploiting the way the data was created.

I would literally stand by anyone who can justify that their approach can work with the same performance in the real world. I doubt anyone in the real world uploads memes as pairs containing hateful and nonhateful content.

1 Like

Of course, I think exactly as you. As I have said

But I not mad that I have lost, I have lost and I am so happy and glad of everything I have learned and the work I have done. I have improved fb benchmarks as well as @burebista .

I am mad that a few teams have achieved super human accuracy by doing the contrary the competition was about and have 0 respect for the other 3,173 participants of the competition.

And it seems to me it is exactly like this:

Competition Entities’ determination that the Submission does not provide the functionality described or required, or the Submission appears to be purposely designed to circumvent these Competition Rules or the spirit of the Competition would be grounds for disqualification.

1 Like

Not really sure what you meant by

after all this is a competition not a university research project.

Following the rules and having proper ethical practices while trying to improve upon the given objective of the task within those said rules should not be just limited to the university level :wink: (Not implying anyone is doing this. Just justifying my point to what you said) :hugs: :hugs: :hugs:

I understand that this is a competition and people will try to find any ways they can to win. That is ok. It is important to speak up when you find that something is wrong not because one is “Sad” about not winning the competition but because that’s the right thing to do. Else you just normalize such practices. (Again not implying anyone is cheating.)

Anyone that followed the rules of the competition and came up with new ideas can also try seeing where their model lies on the VQA benchmark. Not exactly a loss in my book. Just like to see proper results and accurate representation of everyone’s work on the leaderboard. Wouldnt you like that?

I totally agree with finding the best solution is the spirit of engineering but that does not necessarily mean finding loopholes in the given objective.

Let me give you a fun example, Do you watch Formula 1?
If you do you may be aware of the “Fuel Flow Gate” incident. If not then basically the FIA regulates the flow of fuel for every race car (To regulate speed and give every team a fair chance) and in order to monitor this said fuel flow, they had mounted non-intrusive sensors on every car.

One sure way to increase performance is to increase fuel flow in the engine and Ferrari engineers did exactly that. They tricked the sensors by allowing more fuel outside of the sensors sampling rate and achieved a lot of performance.

Was this amazing engineering?

Was it anywhere mentioned that you couldn’t do that?

Was it still unethical?

Was it the spirit of engineering?
I don’t think so

I know it’s really funny and doesn’t even make sense to compare F1 to this challenge but look at Ferrari’s pace now and a team like Renault (That is 3rd in the constructor’s championship).

I feel like the spirit of engineering should not be mistaken for unethical practices (BY FERRARI AND NOT ANYONE FROM THIS CHALLENGE).

In case you want to go through it here is the link to this article Fuel Flow Gate

Another similar example would be how Volkswagen engineers designed systems to cheat during emission tests.

Absolutely agree with you. :raised_hands:
But I don’t think participants in this compete should be to blame. It seems like there isn‘t any sensor on cars.

1 Like

I would love to see unseen test as well. Disclaimer. Our model was trained on train data only! But we didn’t have time (joined late September) to fully experiment with ideas that we had. There are some clauses that need to clarified about derived data sets. Cheers,

Absolutely! Anyways it finally comes down to the people that organized the competition to decide what’s allowed and what’s not. No amount of justification of exploiting the data will change the fact that it’s still an exploitation. Let’s just see what the organizers have to say.

In my opinion, the purpose of creating an AI model is to let the model do the work (automated). If you have model and a test data, and you label the test data using human or machine, then you feed it to the model, then your run again the AI model to label the test data (with idea of labeled test set), then it clearly defeats the purpose of creating an AI model. A labeled test set feed into the AI Model will yield high accuracy score (Because your model has already an idea about the result of the test). My point is, an AI model should not have an idea about the test set (it should only learn from dev and training sets), because that is the purpose of test. In the real world, a model will not have an idea or clue about the test (the social media post, text and image, that a user will post). If a model needs a feed itself first with labeled test data before it can predict accurately, in my opinion, I think it’s a failed model.

This is my first competition and I really learned a lot, thanks Driven Data and Facebook AI for this opportunity. Congratulation to everyone!

1 Like

@ipr999 In this competition pseudo-label wasn’t allowed, which was a choice of the organizers.
As competitions are good for learning, here is something new for you to learn:
Pseudo-labeling is not cheating! It is a valid machine learning technique which is used both in academia and in real life scenarios. In pseudo labeling you use the model that you’ve trained on the training data (with a good validation score) to predict your test set. And then add the predicted test set as training data, and keep on iterating. This has several advantages, one of them is that it is a form of regularization which can reduce over-fitting.
A search of arvix will return hundreds of papers describing the advantages of pseudo-labeling.
If you search on google, you’ll find good explanations on what it is, and it’s advantages.

Thank for the info. But, I don’t get the logic of adding test set to training when we have already been provided by 9,540 samples (dev_seen + dev_unseen + training), is this samples not enough to create a good model? Do we still need to result to pseudo labeling given that we have huge number of samples from dev and training set?

@ipr999 9540 samples is definitely not enough to train a good model, and in this cases is even less because dev_seen and dev_unseen are almost completely the same. That is actually one of the main problems with this competition, there isn’t enough training data.
The models have so many parameters that they can easily memorize (overfit) the dataset, instead of learning how to generalize. You can see that after some epochs the network starts overfitting to the training set (training loss is almost 0)
In machine learning more distinct data is always better.

Just to give you an example of the sizes used to train the standard models, whose weights are used when you choose to finetune a pre-trained model:

  • COCO dataset has 200K+ images
  • ImageNet dataset has 14M images

Of course those problems have many more classes, so you would always need a bigger dataset, but if you calculate images per class you’ll see that it has plenty more images than this competition.

1 Like

If I can still remember my basic statistics, 9,540 samples will be considered appropriate sample size, but off course you are right the larger the sample size the better. I respect your opinion about the sample size, but I think we can still create a good model given that we have 9,540 sample, as long as we can apply new ways to dissect the image and text data and convert it to a new data; in my case I have created 9,396 data for text and 29,778 for image.

I think you are very correct. In fact, our team members had the exact same questions in mind. We noted that one of the competitors entered and made a “super-human” entry for both phase 1 and phase 2 on the same day (close to the end of the competition) and stayed there forever. This looked highly suspicious to us, but to maintain the decorum of the competition we never complained!

I think it should be definitely scrutinised. In fact leading experts from industry and academia should be invited to review the entries. It seemed a little odd to me when FB declared the winners just based on the leaderboard numbers!

Hi all – The leaderboard has been updated based on a review of the top eligible scores. Thanks for your patience during this process. Don’t forget, you can hear from the prize finalists at the competition session at NeurIPS this coming Friday, and all winning solutions will be shared openly with a final announcement of the results in the coming weeks.

Thanks to everyone for all your hard work on this challenge.

1 Like