Performance metric implemented in Python

lopuhin · September 3, 2017, 10:06am

Below is my attempt at implementing performance metric in Python 3 (probably adding from __future__ import division would be enough for Python 2). It uses editdistance library for the Levenshtain distance calculation. I’m not 100% sure it is correct, this is just my interpretation of the rules, would be glad if anyone points at errors in the implementation.

import editdistance
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score, r2_score

CLASSES = [
    'species_fourspot',
    'species_grey sole',
    'species_other',
    'species_plaice',
    'species_summer',
    'species_windowpane',
    'species_winter',
]


def get_score(true_df, pred_df):
    edit_score = get_edit_score(true_df, pred_df)
    auc_score = np.mean([roc_auc_score(true_df[cls] == 1, pred_df[cls])
                         for cls in CLASSES])
    with_fish = ~np.isnan(true_df['length'])
    length_score = max(0, r2_score(true_df['length'][with_fish],
                                   pred_df['length'][with_fish]))
    return 0.6 * edit_score + 0.3 * auc_score + 0.1 * length_score


def get_edit_score(true_df: pd.DataFrame, pred_df: pd.DataFrame):
    video_ids = true_df['video_id'].unique()
    true_df, pred_df = [df.groupby(['video_id', 'fish_number']).max()
                          .reset_index().set_index('video_id')
                        [CLASSES].apply(np.argmax, axis=1)
                        for df in [true_df, pred_df]]
    return 1 - np.mean([
        normalized_edit_distance(
            *[list(df.loc[video_id]) for df in [true_df, pred_df]])
        for video_id in video_ids])


def normalized_edit_distance(true_seq, pred_seq) -> float:
    return min(1., editdistance.eval(true_seq, pred_seq) / len(true_seq))

lopuhin · September 7, 2017, 1:34pm

FWIW, there are several issues:

dataframes must be aligned, but this is not checked here
AUC calculation misses normalization to [0, 1]
AUC calculation is done across all videos, but should be done inside one video and then averaged instead

chris_k · September 12, 2017, 11:17am

With some inspiration from @lopuhin I created this implementation, which includes the AUC normalization, and the fallback to MAE (not sure I got this right, though). It calculates all metrics per video, then weights them, and then averages over all. Print statements also give you a breakdown on videos and individual metrics.

import pandas as pd
import numpy as np
import editdistance
from sklearn.metrics import roc_auc_score, r2_score, mean_absolute_error

species = [
    'species_fourspot',
    'species_grey sole',
    'species_other',
    'species_plaice',
    'species_summer',
    'species_windowpane',
    'species_winter']

def get_metrics(true_df, pred_df):
    vid_scores = []
    for vid, indexes in true_df.groupby('video_id').groups.items():
        true = true_df.loc[indexes]
        pred = pred_df.loc[indexes]

        sequence_score = get_sequence_score(true, pred)
        auc_score = get_auc_score(true, pred)
        length_score = get_length_score(true, pred)

        print(vid)
        print("Edit score: %.3f" % sequence_score)
        print("AUC score: %.3f" % auc_score)
        print("Length score: %.3f" % length_score)
        print("")
        weighted_score = 0.6 * sequence_score + 0.3 * auc_score + 0.1 * length_score
        vid_scores.append(weighted_score)
    return np.mean(vid_scores)

def get_sequence_score(true, pred):
    true_seq = true[species].apply(np.argmax, axis=1).tolist()
    pred_seq = pred[species].apply(np.argmax, axis=1).tolist()
    edist = editdistance.eval(true_seq, pred_seq)
    edist = min(edist / len(true_seq), 1)
    score = 1 - edist
    return score
    
def get_auc_score(true, pred):
    try:
        # fails when only one unique class predicted in video
        score = np.mean([roc_auc_score(true[spc],
                                       pred[spc])
                         for spc in species])
        score = 2 * score - 1
    except Exception as e:
        score = 1 - np.sum([mean_absolute_error(true[spc],
                                                pred[spc])
                            for spc in species])
    return np.mean(score)

def get_length_score(true, pred):
    fish = true['length'] > 0
    score = r2_score(true['length'][fish],
                     pred['length'][fish])
    return max(score, 0)

print("Overall score: %.3f" % get_metrics(true_df, pred_df))

lopuhin · September 12, 2017, 11:22am

@bull could you please clarify if length score should be calculated per-video and then averaged, or is it calculated globally? Or even better, provide a reference metric implementation? I think this detail is not specified in the metric description, and as you can see I and @chris_k interpreted it in different ways, and there are quite a few other details that can be implemented in different ways.

ywi4ebyrawi · September 19, 2017, 12:23pm

Hi, guys! Did you manage to implement correct metric?

bull · September 25, 2017, 11:30pm

Hi all, thanks for you patience on the implementation! You can find a reference implementation here. It’s in pure numpy for performance reasons. We’re pretty sure it is mathematically identical to the backend, but let us know if you spot anything…fishy…

selim_sef · October 13, 2017, 9:52pm

Hi,
I tried different metric implementations including the reference one. All of them show much better scores on the local validation than I get on the LB (0.95 Local vs 0.64 LB). Even though it could be due to the overfitting it seems a bit fishy…
Also improvements after 0.6 are much harder to get and it seems like the limit is 0.7
Most likely:

overfitting - we are on our own here, as always
wrong metrics - as all of them fail, not likely the reason
wrong (or maybe a bit differrent from the description) LB backend calculation

@bull it would be great if you could confirm that the LB backend is spot on and is in accordance with the description

wolhow123 · October 13, 2017, 10:16pm

you cant calculate the score on your local validation splits because you dont have labels for each frame.

selim_sef · October 13, 2017, 10:47pm

@wolhow123
by tasks:

for the first task it doesn’t matter, as we group and basically take the best frames to generate a sequence
for the second task I’m not sure, the description is not so clear. What is the point of predicting a class for the fish which is 90% occluded by gloves? If it matters then we can group and take the best frame prediction for the specific fish number.
for the third task again the length should be calculated for the most clear frames

Atm only the evaluation of the first task is quite clear.

wolhow123 · October 14, 2017, 8:42am

You’re right, the second task is the most unclear. I tried to ask @bull about it in another thread The sequence of fish in the submission file
The main point is to understand what to do with frames where fish is partly visible: we put zero or predict classes? Because if we make mistake MAE/mean auc will punish us

wolhow123 · October 14, 2017, 8:49am

Also you can easily check if second task is the most weak - simply crop your probabilities to 0,000001 so 1St task will be performed as usual, but 2ND task will be failed. Then check the score.

lopuhin · October 14, 2017, 9:03am

@wolhow123 I don’t think it’s that easy. First, cropping the probabilities is not good because the first task needs the same fish to have the max probability. Multiplying by some small number will achieve that, but ROC-AUC will not change. In some cases MAE is used in the second task and it will change, but we don’t know the proportion (and likely ROC-AUC is used more).

wolhow123 · October 14, 2017, 10:32am

if you crop values you will not change max probability.
PS not cropping, but dividing by 1000000

Topic		Replies	Views
Task 2 metric clarification N+1 Fish, N+2 Fish	3	898	October 9, 2017
Code for metric Cold Start Energy Forecasting	0	828	September 13, 2018
Any last hints about correlating local MAP to score? Where's Whale-do?	0	361	June 28, 2022
Question to orgs about the metric Image Similarity Challenge	2	470	August 4, 2021
The sequence of fish in the submission file N+1 Fish, N+2 Fish	3	1132	September 26, 2017

Performance metric implemented in Python

Related topics