Hindcast Evaluation code implementation question


For the Hindcast evaluation we need to implement 2 functions: predict and preprocess.
For our approach, we have a lot of common calculations for the same site_id and different dates and it is not optimal to recalculate information for the past months again and again.

So the question is: Is it fine to make full features precalculations and predictions in the preprocess function and then make fetching of the predictions in the predict function?

Predictions in the preprocess are made to ensure that future data does not leak into the features.

1 Like

@jayqi is the final word on these things (I’m just a contestant), but here’s my understanding of it:

def preprocess(src_dir, data_dir, preprocessed_dir): -> dict
    ret_val = dict()
    # ...
    ret_val['my_dataset_object'] = [ 1.0, 2.0, 3.0, ]
    return ret_val

def predict(site_id, issue_date, assets, src_dir, data_dir, preprocessed_dir):
    ds = assets['my_dataset_object']
    quantile_10, quantile_50, quantile_90 = ds[0], ds[1], ds[2]
    # ...
    return quantile_10, quantile_50, quantile_90

As long as the objects you store in your return value from preprocess() are hashable, you shouldn’t have a problem. Note that preprocess is called a single time before any predictions are issued, but predict() is always passed the assets object that you returned.

Hope that helps.

Hi @RomanChernenko and @mmiron,

The preprocess function option is provided in the case that you may want to do feature calculations once in a more efficient manner once outside of the predict loop. You are encouraged to use it for data processing to reduce redundancy. You might also find other ways to reduce redundant calculations, such as caching calculation results in your code, in the assets dictionary that is returned by preprocess and then passed between predict recalls, or on disk in preprocessed_dir.

Overall, we encourage you to structure your code in a way so that it is clear and readable as possible, so that challenge organizers will be able to understand how it works and verify that you are correctly following the requirements about the use of data and time.

1 Like