I am thinking about the low scores. In the evaluation there is a 0.8*TP/(TP+FP) term.
If we identify 1 real anomaly and predict only this then we have a 0.8 score.
Ok, we need to do this to all the 3 sites, but it is not that a big extra effort.
There is a public/private split, so it can happen that all the predicted TP are in the private part, but if we can identify 20-50 anomalies then some of them should fall to the public.
What do you think? Is there an anomaly here, or just simply it is that hard to identify 20-50 anomalies without having many FP among them?