My solution and code

The steps I passed from baseline to have my solution.

  1. The task is multilable classification. So softmax + categorical_crossentropy is not the best choise for the task. Replace softmax + crossentropy on sigmoid + binary_crossentropy gives huge improvement.
    Baseline changes
    -NASMobile -> MobileNetV2
    -softmax + categorical_crossentropy -> sigmoid + binary_crossentropy
    -Training season1 -> Training seasons 1-8
    -Training last layers -> Training all layers
    gives 0.0060 on leaderboard.
  2. MobileNetV2 is not best choise if we have Tesla V100 and 8 hours for inference. I tried different architectures from keras application, also tried pretrained and found that InceptionResNetV2 gives me the best result.
    MobileNetV2 -> InceptionResNetV2
    gives ~0.0045 on leaderboard.
  3. Limitation of 1 submission in 5 days resulted to my local environment which reproduces server calculation. I found that my local inference works ~10 times faster than on server and the only difference was input images sizes. I trained model on resized images from Pavel and also tested on those images. To optimize speed I decided to download fullsized images and use them for local inference. As expected it slowed down inference but also gave me worse score (~0.0045 vs ~0.0038). I understand that the issue with resize flow. So the final flow is to load full size image, resize it as Pavel did, than save on disk, load from disk and resize to model input size. It return my ~0.0038 on fullsize images.
  4. Metric is mean logloss of all classes and I was interested which classes is better to optimize. ‘empty’ is the most popular class and it has biggest logloss ~0.045 ‘zebra’ also popular and logloss ~0.01 while ‘zorilla’ logloss 0.00001 so it does not significantly affect final score. To optimize empty class I decided to find background of the images.
    Like this np.sum([np.abs(img - np.mean(np.asarray(imgs)) for img in imgs], axis=0). It makes almost black images by empty sequinces and lights images from non empty. I trained another InceptionResNetV2 on those backgrounds (on 2 classes empty/non-empty) and ensembled ‘empty’ class. It reduced empty logloss on season 10 from 0.045 to 0.035 so final score on (0.045-0.035)/54=0.0002 not much.
  5. Next step is to train InceptionResNetV2 not on 2 classes but on all 54 classes. But simple mean/median ensempling gave worse results than original classifier. I decided to use boosting to ensample original model and background model. Fortunately I was training NN on seasons 1-8 so I was able to train boosting on season 9 and validate it on season 10. This move me to 0.0028 on leaderboard.
  6. ‘otherbird’ class makes me advise. birds fly fast and appear only on 1 image in sequence. What if we make a clue to NN. I build a monster 2 InceptionResNetV2 one of them has mean of images as input and another background. Idea is that background should work as attention. I also tested another architecture concat(mean, background) -> conv2D(6layers->3 layers) -> InceptionResNetV2 but it work worse than (mean -> InceptionResNetV2, background->InceptionResNetV2) -> conv2D -> dence. This model in boosting slightly improved result 0.0027.
  7. I did not use season 10 for training until now to see real score on leaderboard. In the last days I trained another boosting on season 10 with validation on season 9 and ensembled them. It gave me 0.0020 (expected overfitting).
  8. Also I did not use TTA and in the final submission added horizontal flip TTA. It reduced slightly overfitting and gave me 0.0022 score on leaderboard which is 0.0054 on private part.
    Trained on a single 2080 TI.

Cleaned code here Hope I moved all interesting things because original code is very messy.


There were one good Idea which I did not use.
On boosting step we can use average scores from all other cameras around several seconds. If zebra walks near a camera we can expect it will be detected again soon. Of course we do not know which scores to average but it does not matter, anyway it improves result.
I did not use it because leaderboard has a sparse data and it should not work properly there. It is risky to make such algorithm without test. But I tested it on a full season 10 and it works great. On season 10 holdout it improves from 0.00265 to 0.00242.

1 Like

Thanks for sharing!

On boosting step we can use average scores from all other cameras around several seconds

That’s a pretty cool idea that we should definitely look at implementing.