The competitions uses some form of multi-class weighted Brier score.
My implementation seems to be a bit off (or it could be my cross-validation which is wack). Can someone see if there is something wrong:
# get w_c for the weighted Brier formula
weights = np.genfromtxt('../data/class_weights.json', delimiter=',',
skip_header=1, skip_footer=1, usecols=[0])
# y is the activities annotations vector, e.g. [4,4,7,7] which we need to convert into a probability
# matrix to compare with our predictions, so I one-hot-encode it
# yp is our probabilistic predictions
def brier_score(y, yp):
from sklearn.preprocessing import OneHotEncoder
yy = OneHotEncoder([20], sparse=False).fit_transform(y[:, np.newaxis])
return (1./len(yy)) * np.sum(weights * ((yy-yp)**2))