Performance metric implemented in Python

Below is my attempt at implementing performance metric in Python 3 (probably adding from __future__ import division would be enough for Python 2). It uses editdistance library for the Levenshtain distance calculation. I’m not 100% sure it is correct, this is just my interpretation of the rules, would be glad if anyone points at errors in the implementation.

import editdistance
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score, r2_score

CLASSES = [
    'species_fourspot',
    'species_grey sole',
    'species_other',
    'species_plaice',
    'species_summer',
    'species_windowpane',
    'species_winter',
]


def get_score(true_df, pred_df):
    edit_score = get_edit_score(true_df, pred_df)
    auc_score = np.mean([roc_auc_score(true_df[cls] == 1, pred_df[cls])
                         for cls in CLASSES])
    with_fish = ~np.isnan(true_df['length'])
    length_score = max(0, r2_score(true_df['length'][with_fish],
                                   pred_df['length'][with_fish]))
    return 0.6 * edit_score + 0.3 * auc_score + 0.1 * length_score


def get_edit_score(true_df: pd.DataFrame, pred_df: pd.DataFrame):
    video_ids = true_df['video_id'].unique()
    true_df, pred_df = [df.groupby(['video_id', 'fish_number']).max()
                          .reset_index().set_index('video_id')
                        [CLASSES].apply(np.argmax, axis=1)
                        for df in [true_df, pred_df]]
    return 1 - np.mean([
        normalized_edit_distance(
            *[list(df.loc[video_id]) for df in [true_df, pred_df]])
        for video_id in video_ids])


def normalized_edit_distance(true_seq, pred_seq) -> float:
    return min(1., editdistance.eval(true_seq, pred_seq) / len(true_seq))

FWIW, there are several issues:

  • dataframes must be aligned, but this is not checked here
  • AUC calculation misses normalization to [0, 1]
  • AUC calculation is done across all videos, but should be done inside one video and then averaged instead

With some inspiration from @lopuhin I created this implementation, which includes the AUC normalization, and the fallback to MAE (not sure I got this right, though). It calculates all metrics per video, then weights them, and then averages over all. Print statements also give you a breakdown on videos and individual metrics.

import pandas as pd
import numpy as np
import editdistance
from sklearn.metrics import roc_auc_score, r2_score, mean_absolute_error

species = [
    'species_fourspot',
    'species_grey sole',
    'species_other',
    'species_plaice',
    'species_summer',
    'species_windowpane',
    'species_winter']

def get_metrics(true_df, pred_df):
    vid_scores = []
    for vid, indexes in true_df.groupby('video_id').groups.items():
        true = true_df.loc[indexes]
        pred = pred_df.loc[indexes]

        sequence_score = get_sequence_score(true, pred)
        auc_score = get_auc_score(true, pred)
        length_score = get_length_score(true, pred)

        print(vid)
        print("Edit score: %.3f" % sequence_score)
        print("AUC score: %.3f" % auc_score)
        print("Length score: %.3f" % length_score)
        print("")
        weighted_score = 0.6 * sequence_score + 0.3 * auc_score + 0.1 * length_score
        vid_scores.append(weighted_score)
    return np.mean(vid_scores)

def get_sequence_score(true, pred):
    true_seq = true[species].apply(np.argmax, axis=1).tolist()
    pred_seq = pred[species].apply(np.argmax, axis=1).tolist()
    edist = editdistance.eval(true_seq, pred_seq)
    edist = min(edist / len(true_seq), 1)
    score = 1 - edist
    return score
    
def get_auc_score(true, pred):
    try:
        # fails when only one unique class predicted in video
        score = np.mean([roc_auc_score(true[spc],
                                       pred[spc])
                         for spc in species])
        score = 2 * score - 1
    except Exception as e:
        score = 1 - np.sum([mean_absolute_error(true[spc],
                                                pred[spc])
                            for spc in species])
    return np.mean(score)

def get_length_score(true, pred):
    fish = true['length'] > 0
    score = r2_score(true['length'][fish],
                     pred['length'][fish])
    return max(score, 0)

print("Overall score: %.3f" % get_metrics(true_df, pred_df))
2 Likes

@bull could you please clarify if length score should be calculated per-video and then averaged, or is it calculated globally? Or even better, provide a reference metric implementation? :slight_smile: I think this detail is not specified in the metric description, and as you can see I and @chris_k interpreted it in different ways, and there are quite a few other details that can be implemented in different ways.

1 Like

Hi, guys! Did you manage to implement correct metric?

Hi all, thanks for you patience on the implementation! You can find a reference implementation here. It’s in pure numpy for performance reasons. We’re pretty sure it is mathematically identical to the backend, but let us know if you spot anything…fishy…

2 Likes

Hi,
I tried different metric implementations including the reference one. All of them show much better scores on the local validation than I get on the LB (0.95 Local vs 0.64 LB). Even though it could be due to the overfitting it seems a bit fishy
Also improvements after 0.6 are much harder to get and it seems like the limit is 0.7
Most likely:

  1. overfitting - we are on our own here, as always
  2. wrong metrics - as all of them fail, not likely the reason
  3. wrong (or maybe a bit differrent from the description) LB backend calculation

@bull it would be great if you could confirm that the LB backend is spot on and is in accordance with the description :slight_smile:

you cant calculate the score on your local validation splits because you dont have labels for each frame.

@wolhow123
by tasks:

  1. for the first task it doesn’t matter, as we group and basically take the best frames to generate a sequence
  2. for the second task I’m not sure, the description is not so clear. What is the point of predicting a class for the fish which is 90% occluded by gloves? If it matters then we can group and take the best frame prediction for the specific fish number.
  3. for the third task again the length should be calculated for the most clear frames

Atm only the evaluation of the first task is quite clear.

1 Like

You’re right, the second task is the most unclear. I tried to ask @bull about it in another thread The sequence of fish in the submission file
The main point is to understand what to do with frames where fish is partly visible: we put zero or predict classes? Because if we make mistake MAE/mean auc will punish us

Also you can easily check if second task is the most weak - simply crop your probabilities to 0,000001 so 1St task will be performed as usual, but 2ND task will be failed. Then check the score.

@wolhow123 I don’t think it’s that easy. First, cropping the probabilities is not good because the first task needs the same fish to have the max probability. Multiplying by some small number will achieve that, but ROC-AUC will not change. In some cases MAE is used in the second task and it will change, but we don’t know the proportion (and likely ROC-AUC is used more).

if you crop values you will not change max probability.
PS not cropping, but dividing by 1000000