My 0.5871 LB Solution (and some lessons learnt)

Hi guys,

With competition almost over and no further attempts here’s an overview of my solution. The source is open-source and available at GitHub - antorsae/fish: Identify fish challenge https://www.drivendata.org/competitions/48/identify-fish-challenge/

STEP 1. Take each video and create new videos cropped to 384x384 centered in the zone of interest (where the fish action is happening) and rotating them so all fish / ruler is horizontal, so they end up like this:

INPUT

OUTPUT

This part is implemented here: https://github.com/antorsae/fish/blob/master/boat.py

STEP 2 Take all videos created in step 1) and build frame by frame predictions of length, fish species including a new species_no_fish when there’s no fish.

Here’s an example of the generated predictions for the training data:

row_id,video_id,frame,length,species_fourspot,species_grey sole,species_other,species_plaice,species_summer,species_windowpane,species_winter,species_no_fish
0,00WK7DR6FyPZ5u3A,0,157.63180541992188,1.7653231099146183e-09,1.0,3.6601498720756354e-08,3.1204638162307674e-07,2.3616204103404925e-08,5.9193606460894443e-08,1.3712766033791013e-08,6.261254969358587e-12
1,00WK7DR6FyPZ5u3A,1,162.67236328125,1.2406671523468304e-10,1.0,2.6489321847122937e-09,1.1546971734333056e-07,9.599543382421416e-10,4.391315311380595e-09,5.5750559724288e-10,1.2707434930703254e-13
2,00WK7DR6FyPZ5u3A,2,155.6374969482422,8.897231285054374e-10,0.9999971389770508,1.4144226270218496e-06,1.3051650284978678e-06,1.1586666737173346e-08,8.389072547743126e-08,3.659388880805636e-08,9.07268617872381e-12

this is implemented using again resnet-152 with transfer learning and classifying the 7 species + a new one (no fish) and regressing the length (only propagating gradients when it’s an actual fish). Oversampling is used for balancing classes. Implementation @ https://github.com/antorsae/fish/blob/master/fish.py

STEP 3 Initially I wanted to take the output of 2) and build a RNN to predict the fish sequence. I made some attempts to make it work: by trying to use a CTC function predicting the frame where the fish changes, or generating proxy signals (square wave with mod 2 of the fish number and use binary cross entropy loss, or a chainsaw wave using MSE loss) but none worked. I believe with more work the CTC approach would have worked.

So with just 1 day left I explored the frame-by-frame predictions https://github.com/antorsae/fish/blob/master/playground-sequence-prediction-thresholding.ipynb and settled on some basic thresholding to generate the fish sequences: https://github.com/antorsae/fish/blob/master/generate_submission.py

Not bad but way worse I think than what a decent RNN-based approach would have provided.

LESSONS LEARNT

Since the most important metric was sequence accuracy, I should have focused on an end-to-end RNN (or 3d CNN) sequence prediction taking the frame-by-frame CNN features and not just the frame-by-frame probabilities for 8 classes.

Any comments or suggestions would be greatly appreciated. I really enjoyed this competition! :slight_smile:

1 Like

Cropping the region of interest is really nice approach.
As far as I understand the code in boat.py, you trained the model against xy coordinate and boat id.
How did you create these labels?
I can imagine that xy coordinate can be computed somehow from x1, y1, x2, y2 in training.csv. Is it the way you did?
Did you also train a model to get boat id?

I first build training_transform.csv with https://github.com/antorsae/fish/blob/master/augment_train.py just looking at the centroid of head/tail for each video. I look at all frames and in retrospect I should have experimented looking at the first 2, but anyway that generates a file with the x,y centroid for each video as well as boat_id which is a clustering id - see https://github.com/antorsae/fish/blob/master/playground-boat-scenarios-clustering.ipynb

boat.py classifies the boat into a finite set of boat_id (classification task) as well as regressing the x,y of the centroid for that particular video. Instead of taking the raw x,y, I regress the difference of x,y (dx,dy) for the video and the centroid for that boat_id. I weighted the loss so that the classification should be more important and if you look at the variance of dx,dy for each cluster they are smaller that if you where to regress x,y directly, in other words if you classify boat_id correctly (very easy) even if you do a bad job on dx,dy it wouldn’t be so bad.

I tried to regress x,y and the angle directly but when I saw there was just 5 boat scenarios I tried the above and worked on the first attempt (almost :slight_smile: )

1 Like

Thank you for answering. So you used xy centroids to cluster boat types. Very interesting!
But does the model work also for test data? When you crop test images, did they capture fish like training images?
I wonder if it can work well for test images that can contain new types of boat.

Yes, xy centroids and ruler angle (measured by proxy of arctan2(head,tail)). Yes, the model works for test data (I manually verified all 667 test videos visually and yes they capture the fish ROI properly).

Re: new types of boat… not as it is. you would have to train the boat.py with ground truth for new types of boats or regress x,y centroids and angle ruler angle directly (code is almost ready to do it, I implementation an orientation-invariant angle loss in boat.py which is not used but you can see what it does).