Cross Validation Feature Parameters/Aggregate Statistics

If part of our feature transformation requires an aggregated variable across training years, do we need to hold out data from that training year when computing the aggregated statistics? Or can we compute the aggregate statistics once and use them for every iteration of the LOOCV?

Hi @oshbocker,

Including test data in aggregate statistics computed for feature processing is a form of leakage. You should recompute the aggregate statistics per cross-validation iteration where the test year is excluded.

To clarify, wouldn’t the aggregate statistic only be considered data leakage if it contained data within the forecast year that is beyond the issue date? For example, if we computed the average temperature for a location on December 31st across all training years, including the test year, and then used that aggregate statistic for an inference on January 1st, would that be considered data leakage?

Per the problem formulation of this competition, water years are considered units of observation. The full water year for the forecast year should be held out and considered test data. This is documented in the “Time and data use” section of the competition documentation.

Hi @jayqi and @oshbocker,

My apologies, but I’ve gotten a little confused since reading this thread. Let’s take the example of a simple baseline method that always predicts the mean annual flow volume for a given site. Let’s arbitrarily say that it’s Libby Reservoir Inflow, and it’s 100 KAF of streamflow (I’m making that number up for simplicity, it’s quite unrealistic). So are we saying that the mean flow volume (100 in this example) has to be recalculated without the holdout year values, or is it okay to use the values from the holdout year (via their inclusion in the calculation of the overall mean) and not recalculate the mean every time you change the test year?

@mmiron Including the annual flow from the holdout year as part of your mean is using test data in fitting a parameter of your model, and this is a form of leakage. Correctly avoiding leakage means that your fitted model should not use any information from the holdout year. So, the correct way to do cross-validation is to recalculate your mean each cross-validation iteration where the holdout year is excluded from the mean.

Including any data from your test set in fitting the parameters of a feature is leakage. This is a standard understanding in what constitutes leakage in machine learning.

For example, from the scikit-learn docs on cross-validation:

Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction:

From the scikit-learn docs on “Data leakage”:

A common cause is not keeping the test and train data subsets separate. Test data should never be used to make choices about the model. The general rule is to never call fit on the test data. While this may sound obvious, this is easy to miss in some cases, for example when applying certain pre-processing steps.

Although both train and test data subsets should receive the same preprocessing transformation (as described in the previous section), it is important that these transformations are only learnt from the training data. For example, if you have a normalization step where you divide by the average value, the average should be the average of the train subset, not the average of all the data. If the test subset is included in the average calculation, information from the test subset is influencing the model.

From the Wikipedia article on leakage:

Premature featurization; leaking from premature featurization before Cross-validation/Train/Test split (must fit MinMax/ngrams/etc on only the train split, then transform the test set)

A StackExchange answer to “Featurization before or after dataset splitting” (additional discussion about why in link):

we have to do “feature extraction” from our training data only.

Another StackExchange answer to “Data normalization before or after train-test split?”

Normalization across instances should be done after splitting the data between training and test set, using only the data from the training set.

This is because the test set plays the role of fresh unseen data, so it’s not supposed to be accessible at the training stage. Using any information coming from the test set before or during training is a potential bias in the evaluation of the performance.

I probably wouldn’t have asked if I had given myself another hour to consider it, but it was a genuine question – though as it turns out (after double checking), it’s irrelevant in my case. Thank you for your thorough and concrete response, as always.

1 Like