I am thinking about the low scores. In the evaluation there is a 0.8*TP/(TP+FP) term.
If we identify 1 real anomaly and predict only this then we have a 0.8 score.
Ok, we need to do this to all the 3 sites, but it is not that a big extra effort.
There is a public/private split, so it can happen that all the predicted TP are in the private part, but if we can identify 20-50 anomalies then some of them should fall to the public.
What do you think? Is there an anomaly here, or just simply it is that hard to identify 20-50 anomalies without having many FP among them?
As with all DrivenData competitions, competitors will be best served by focusing on generalizable methods since there are both public and private leaderboards. It’s unlikely that trying to work around the metric will result in a good outcome on both, and that’s especially true if you only focus on either precision or recall. Those have been intentionally weighted in terms of how important they are in context, so the metric is a good fit for this task.