Back to DrivenData | Blog

Nice job everyone!

Hi everyone,

Congratulations on the submissions!

I am really impressed by the top-10, so if you have any plans to present the methodology (during NeurIPS?), please let me know :slight_smile:



I am very interested as well! In fact I would be interested in anyone who details their methodology, even if you weren’t inside the top 10. If anyone happens upon papers/presentations doing this maybe we should share them in this thread?


Pair-wise model matters.

@qqerret do you mean pairing benign images and using something like a triplet loss?

I did try Siamese Networks, Triple loss and Quadruplet Network. At least for me didn’t work at all :upside_down_face:

1 Like

I also really curse how the top 5 teams achieve superhuman performance on the test set. Maybe @ironbar can give us a quick overview of your solution? :hugs:


Sure, we created an ensemble using some of the mmf models, and a new architecture developed for the challenge that we name Albit. Albit is a combination of Albert transformer and resnets from Big Transfer paper. We used Bit models trained for Imagenet21k and the labels were fed directly to the Albert model along the text of the meme.
The volatility of the model scores on training was very high, so we trained a lot of models and took the best ones. I guess I would have trained around 2000 models for this challenge.

However this is not enough to achieve superhuman performance. There are problems in the creation and split of the dataset that could be exploited to improve the ROC AUC metric. Information about repeated text, repeated images, could be used to improve the scores dramatically.


Actually, I have matched the memes into triplets (hateful, benign text, benign image) with no improvement using the triplet loss. I feel your pain.

Thank you for sharing your approach! I had tried using Bit models + a few transformer combinations (Including Albert) and didn’t seem to get good results so ended up using something else to process images and extracted region encodings and positional encodings. Positional encodings combined with tags used as anchor points helped increase the performance of our model. Maybe my implementation was not the exact same obviously so you seemed to get better results with Bit + Albert. Maybe as you mentioned before pseudo labeling perhaps helped too.

I was however more concerned with the exploit that you mentioned to increase the AUC. “Information about repeated text, repeated images, could be used to improve the scores dramatically.” maybe I am mistaken but I assume you are finding pairs of confounders in the data and using the data of one confounder to influence the prediction of the other one.

This is just my opinion and I don’t expect anyone else to share my opinion but I do not think that this was the objective of the challenge. The goal of the challenge was to improve multimodal models to identify multimodel hate, not to exploit the data to achieve superhuman accuracy. The dataset was obviously made specifically for this challenge and based on the way it was made, it can be exploited easily.

I feel like many teams including my team approached this challenge from a research mindset and didn’t resort to exploiting the data or pseudo labeling. I personally find this approach unethical and most importantly unfair to all the other teams. I repeat that this is just my own opinion and I do not expect everyone else to agree. We will have to see what @glipstein thinks about it. I hope you consider my points when you are evaluating.

I believe that Facebook themselves are aware of the exploit based on the way the data was created. If they were ok with this they probably would have reported a higher score on their benchmark too.

Thank you for reading have a good day.

1 Like