Restrictions for using test data for training

twalen · January 12, 2019, 8:40pm

The Problem description contains following note:

Note: you may not use future data in making your predictions.

Could you be more specific what kind of restrictions you have in mind, in particular could you give some example of usage that violates this rule and some example of usage that is acceptable.

bull · January 14, 2019, 5:13pm

That information is contained in the following text:

Note: you may not use future data in making your predictions. The train and test sets are split in time (i.e. all the observations in the test set occur after the train set) so you may use all of the training set in making your predictions. However, you must be careful not to use any of the time series information provided in the test set that is future to the process being predicted.

All training set data may be used. Algorithms that use test set data either for training purposes or for feature inputs at inference time should only use observations from before the period you are trying to predict.

twalen · January 14, 2019, 8:55pm

Thank you very much for the answer.

Just to have some more formal definition, is it boils down to:

When processing (either for training or predicting) series with process_id=X, its features could be computed from:

data from the series itself (process_id=X)
whole train set
test_data[test_data.timestamp <= starting_timestamp_of_X]

Gillesvdw · January 15, 2019, 1:27pm

I agree with twalen that this is rather confusing… Is this even ever applicable? Aren’t we predicting stuff for the final_rinse phase, which is not included in the test set? As such, it is impossible to use future data?

bull · January 16, 2019, 11:06pm

All of the observations provided have timestamps provided, so it is possible for every observation to know what observations occur in the future and which occur in the past.

Where this is relevant from a rules perspective (rather than from a best-practices for building your models perspective), is that you cannot use data in the test set where the timestamp comes after the start of the series that you are predicting. Since some processes occur after other processes, it is theoretically possible to use future data.

The bullets @twalen provided are a good summary, but the simplest explanation (which encompasses all of his bullets) is that for process X, you may only use data where observation.timestamp <= starting_timestamp_of_X.

Gillesvdw · January 17, 2019, 6:55am

Ok cheers @bull. I think this clarifies the issue

Topic		Replies	Views
Present vs Future From Fog Nets to Neural Nets	20	3507	May 1, 2016
Training, inference, and use of future data Water Supply Forecast Rodeo	9	511	December 6, 2023
What is past data? Power Laws	11	1304	March 8, 2018
Forecasting Power Consumption: Using past predicted test data for training for future? Power Laws	2	824	March 15, 2018
Can I use statistics computed over the whole training set in my solution? Predict Wind Speeds of Tropical Storms	1	624	December 21, 2020

Restrictions for using test data for training

Related topics