Anomalies detection - quite a lot of difference in #1 and #2!

The score for #1 (0.75x) is almost double of that of #2 (0.38x)! This is a lot of difference for any data science competition. I wonder if #1 (viana) is using any radically new technique…

Theoretically, according to the scoring, if we declared 1 point as an anomaly and it was correct, our score would be 0.8! I wonder how the judges would consider this model. According to the scoring of the competition, this would be a great model, but it clearly is not practical.

My guess is the top model is severely overfitting and it might even be hand-labeled. Again, this is just a hypothesis, but I would think second place has a more robust model.

how to know which record is an anomaly or not as there is no column which shows any particular record is an anomaly or not.

Top2 scores have pretty much the same nicknames, the differences is one letter: @viana, @lviana. Maybe it is the same person.

I guess that’s how to get more than two submissions per day.

@vikas79 This is an unsupervised learning problem! We don’t have the answers but have to find them ourselves.

My guess again is that the top scorer(s) is hand-labeling the individual data points as anomalies. Due to the way the scoring is set up, an optimal approach would be to find a single anomalous data point and declare all others as normal. I would like to hear from the organizers how they plan on handling a situation where the top entry is hand-labeling data points.

If you ever suspect that someone is using more than one account for submissions, please email info@drivendata.org directly.

Hello ! You can be 100% sure that I’m not @viana ! It turns out that Viana is a pretty common surname in Brazil and Portugal.

@bull if you ever need a proof that I’m not @viana, don’t hesitate to contact me.

1 Like