Question about scoring metric

This line of code taken from score_per_query method in the scoring script in score_submission.py

predicted_n_pos = merged[“actual”].groupby(QUERY_ID_COL).sum().astype(“int64”).rename()

From my understanding of adjusted mean average precision, I think it is meant to be no predicted correct/actual correct. Hence, this is probably what the script meant for this line:

predicted_n_pos = merged.groupby(QUERY_ID_COL)[“actual”].sum().astype(“int64”).rename()

Please clarify.

Thanks.

Hi @flamethrower.

Yes, the adjustment to scikit-learn’s classification mean average precision into information retrieval mean average precision is indeed a factor of number predicted correct / number actual correct.

The original line of code and the line of code you are proposing should be equivalent for the scoring script. I’m not totally sure what you think is incorrect—is it related to QUERY_ID_COL being available on the merged["actual"] series to group by? In our script, we load all of the dataframes such that QUERY_ID_COL is set as an index and not a regular column. (See here.) That means it’s still available for the groupby operation. If you don’t have it set as an index, then the groupby will error. You can see the below example.

import pandas as pd
df = pd.DataFrame(
    {
        "query_id": ["A", "A", "A", "B", "B", "B", "B"],
        "database_image_id": ["01", "02", "03", "01", "02", "03", "04"],
        "score": [1.0, 0.9, 0.8, 1.0, 0.9, 0.8, 0.7],
        "actual": [1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0],
    }
)
df = df.set_index("query_id")
df
#>          database_image_id  score  actual
#> query_id
#> A                       01    1.0     1.0
#> A                       02    0.9     0.0
#> A                       03    0.8     1.0
#> B                       01    1.0     0.0
#> B                       02    0.9     1.0
#> B                       03    0.8     1.0
#> B                       04    0.7     1.0

df["actual"]
#> query_id
#> A    1.0
#> A    0.0
#> A    1.0
#> B    0.0
#> B    1.0
#> B    1.0
#> B    1.0
#> Name: actual, dtype: float64

df["actual"].groupby("query_id").sum().astype("int64").rename()
#> query_id
#> A    2
#> B    3
#> dtype: int64

df.groupby("query_id")["actual"].sum().astype("int64").rename()
#> query_id
#> A    2
#> B    3
#> dtype: int64

Created at 2022-05-10 10:29:55 EDT by reprexlite v0.4.3

Thank you so much for the detailed response.

My bad, I can see they are actually equivalent.