What are some basic approaches?

After understanding the setting of this competition, I still don’t have any idea how I can tackle the problem. From my view point, it feels impossible to detect a certain animal specified in csv because we don’t know WHEN and WHERE the animal appears in the video.
For like this problem, what is the basic approach? In other words, how can I frame the problem in a tractable way?

I joined N+1 N+2 fish competition last month. It was first time for me to use object detection network and hard to implement the network. But it was OK because I could see what should be the loss and how I can utilize models to minimize it.
Any hints, comments or references would be appreciated. Thanks.

I did not notice that a very nice introduction is now available as the benchmark.

(This was not available when the competition began.)

Knowing that you can convolve not only images axes but also time axis at the same time was eye-opening.
Thank you for giving us a nice guideline.