Restrictions for using test data for training

The Problem description contains following note:

Note: you may not use future data in making your predictions.

Could you be more specific what kind of restrictions you have in mind, in particular could you give some example of usage that violates this rule and some example of usage that is acceptable.

That information is contained in the following text:

Note: you may not use future data in making your predictions. The train and test sets are split in time (i.e. all the observations in the test set occur after the train set) so you may use all of the training set in making your predictions. However, you must be careful not to use any of the time series information provided in the test set that is future to the process being predicted.

All training set data may be used. Algorithms that use test set data either for training purposes or for feature inputs at inference time should only use observations from before the period you are trying to predict.

Thank you very much for the answer.

Just to have some more formal definition, is it boils down to:

When processing (either for training or predicting) series with process_id=X, its features could be computed from:

  • data from the series itself (process_id=X)
  • whole train set
  • test_data[test_data.timestamp <= starting_timestamp_of_X]

I agree with twalen that this is rather confusing… Is this even ever applicable? Aren’t we predicting stuff for the final_rinse phase, which is not included in the test set? As such, it is impossible to use future data?

All of the observations provided have timestamps provided, so it is possible for every observation to know what observations occur in the future and which occur in the past.

Where this is relevant from a rules perspective (rather than from a best-practices for building your models perspective), is that you cannot use data in the test set where the timestamp comes after the start of the series that you are predicting. Since some processes occur after other processes, it is theoretically possible to use future data.

The bullets @twalen provided are a good summary, but the simplest explanation (which encompasses all of his bullets) is that for process X, you may only use data where observation.timestamp <= starting_timestamp_of_X.

Ok cheers @bull. I think this clarifies the issue :slight_smile: