Below is my attempt at implementing performance metric in Python 3 (probably adding from __future__ import division would be enough for Python 2). It uses editdistance library for the Levenshtain distance calculation. I’m not 100% sure it is correct, this is just my interpretation of the rules, would be glad if anyone points at errors in the implementation.

With some inspiration from @lopuhin I created this implementation, which includes the AUC normalization, and the fallback to MAE (not sure I got this right, though). It calculates all metrics per video, then weights them, and then averages over all. Print statements also give you a breakdown on videos and individual metrics.

import pandas as pd
import numpy as np
import editdistance
from sklearn.metrics import roc_auc_score, r2_score, mean_absolute_error
species = [
'species_fourspot',
'species_grey sole',
'species_other',
'species_plaice',
'species_summer',
'species_windowpane',
'species_winter']
def get_metrics(true_df, pred_df):
vid_scores = []
for vid, indexes in true_df.groupby('video_id').groups.items():
true = true_df.loc[indexes]
pred = pred_df.loc[indexes]
sequence_score = get_sequence_score(true, pred)
auc_score = get_auc_score(true, pred)
length_score = get_length_score(true, pred)
print(vid)
print("Edit score: %.3f" % sequence_score)
print("AUC score: %.3f" % auc_score)
print("Length score: %.3f" % length_score)
print("")
weighted_score = 0.6 * sequence_score + 0.3 * auc_score + 0.1 * length_score
vid_scores.append(weighted_score)
return np.mean(vid_scores)
def get_sequence_score(true, pred):
true_seq = true[species].apply(np.argmax, axis=1).tolist()
pred_seq = pred[species].apply(np.argmax, axis=1).tolist()
edist = editdistance.eval(true_seq, pred_seq)
edist = min(edist / len(true_seq), 1)
score = 1 - edist
return score
def get_auc_score(true, pred):
try:
# fails when only one unique class predicted in video
score = np.mean([roc_auc_score(true[spc],
pred[spc])
for spc in species])
score = 2 * score - 1
except Exception as e:
score = 1 - np.sum([mean_absolute_error(true[spc],
pred[spc])
for spc in species])
return np.mean(score)
def get_length_score(true, pred):
fish = true['length'] > 0
score = r2_score(true['length'][fish],
pred['length'][fish])
return max(score, 0)
print("Overall score: %.3f" % get_metrics(true_df, pred_df))

@bull could you please clarify if length score should be calculated per-video and then averaged, or is it calculated globally? Or even better, provide a reference metric implementation? I think this detail is not specified in the metric description, and as you can see I and @chris_k interpreted it in different ways, and there are quite a few other details that can be implemented in different ways.

Hi all, thanks for you patience on the implementation! You can find a reference implementation here. It’s in pure numpy for performance reasons. We’re pretty sure it is mathematically identical to the backend, but let us know if you spot anything…fishy…

Hi,
I tried different metric implementations including the reference one. All of them show much better scores on the local validation than I get on the LB (0.95 Local vs 0.64 LB). Even though it could be due to the overfitting it seems a bit fishy…
Also improvements after 0.6 are much harder to get and it seems like the limit is 0.7
Most likely:

overfitting - we are on our own here, as always

wrong metrics - as all of them fail, not likely the reason

wrong (or maybe a bit differrent from the description) LB backend calculation

@bull it would be great if you could confirm that the LB backend is spot on and is in accordance with the description

for the first task it doesn’t matter, as we group and basically take the best frames to generate a sequence

for the second task I’m not sure, the description is not so clear. What is the point of predicting a class for the fish which is 90% occluded by gloves? If it matters then we can group and take the best frame prediction for the specific fish number.

for the third task again the length should be calculated for the most clear frames

Atm only the evaluation of the first task is quite clear.

You’re right, the second task is the most unclear. I tried to ask @bull about it in another thread The sequence of fish in the submission file
The main point is to understand what to do with frames where fish is partly visible: we put zero or predict classes? Because if we make mistake MAE/mean auc will punish us

Also you can easily check if second task is the most weak - simply crop your probabilities to 0,000001 so 1St task will be performed as usual, but 2ND task will be failed. Then check the score.

@wolhow123 I don’t think it’s that easy. First, cropping the probabilities is not good because the first task needs the same fish to have the max probability. Multiplying by some small number will achieve that, but ROC-AUC will not change. In some cases MAE is used in the second task and it will change, but we don’t know the proportion (and likely ROC-AUC is used more).