Can I use statistics computed over the whole training set in my solution?

Brasnold · December 17, 2020, 8:39pm

Suppose I want to standardize the input subtracting the mean per pixel and dividing by the standard deviation. Can I use the the whole training set to compute them? (I guess yes) And can I use the the whole dataset with also the test set? (I guess no)

Using the training set it is technically a violation of the rule to not use information from future images but not for the test set so I think It is legit. In other words, I’m asking if this rule applies only to image for which a prediction is requested. Thank you.

glipstein · December 21, 2020, 7:19pm

Hi @Brasnold - Your guesses are correct. You can imagine that you are running the prediction on a real storm in real time. Your model could use all information gathered from the training set, but may only use images up to the point of prediction for the storm it is making an inference on in the test set.

Topic		Replies	Views
Take advantage /temporal data /up to the point of prediction Predict Wind Speeds of Tropical Storms	7	892	January 11, 2021
What is past data? Power Laws	11	1304	March 8, 2018
Restrictions for using test data for training Sustainable Industry: Rinse Over Run	5	1190	January 17, 2019
Can the ground-truth pairs be used for training? Image Similarity Challenge	9	1017	September 10, 2021
Some baselline performance for your reference Predict Wind Speeds of Tropical Storms	9	690	February 9, 2021

Can I use statistics computed over the whole training set in my solution?

Related topics