You have asked us to avoid polluting our algorithms with data from the future.
Does this include the present? To be specific, when predicting yield for 2013-12-16 02:00:00 can we use climate data from the row 2013-12-16 02:00:00 or should we use 2013-12-16 00:00:00 and earlier?
Good question! You can use the “present” data to predict that yield since it is a roll-up of the previous 2 hours (and data actually arrives at 5 minute intervals).
OK great. Makes getting up and running with a basic model much easier. Sure I’ll be looking more than 2hrs back at some point… but not today.
Hi, I have a question somehow related with the present vs future issue here discussed. As showed in the blog post the train/test split has been performed by slicing the data over time so that a training slice is followed by a testing slice, then we have another training slice and so on.
This temporal aspect of the train/test split puts a clear constraint on the how we should a) impute missing values and b) train our models: we cannot use the entire training set in one shot to perform imputation and training.
I am sticking with this constraint at the general level but I am also wondering whether we are allowed to ignore it within a given slice of data or not. For example, in regard with imputation are we allowed to use all the data from a given slice to impute all the missing values for that specific slice? Thanks.
Hi @Elio, that will depend on your strategy for imputation. For example, filling in the average for all historical measurements is probably generalizable. Using the average of the point before the gap and the point after the gap is probably not (in that it cannot be performed in the same way for new data). The preference is for methods that can be applied easily to new data.
Does that answer your question?
I just realized than my code is indeed using future data. This probably explains why it performs better than others.
I am computing new estimators with the bug fixed.
@bull, Is there any mechanism for members to retract submissions that they know to be flawed as @oliviers indicated? I feel like artificially good scores may discourage newer members if their first efforts are way off what appears to be a “good” score. It would also be nice for everyone to know where they really stand.
@bull I would like to clarify your statement: you could use interpolation for everything but ‘y’ label.
Imagine that you receive new information every two hours, but your model trained with one hour gaps — so you simply fill the missed hour (once again — only X features) with the average between two points.
With that implementation there will be no future leaks, believe me.
I’m waiting for your answer, thanks in advance.
PS: I’m talking only about macro features. With micro features, yeah you couldn’t do that.
In the interest of fairness, I want to be clear that using future values in your model is not an issue that will disqualify a submission. The intention in creating the test set was to have a long enough time gap to effectively measure algorithms that will generalize well. The best scores will still have very good ways to predict future values, but may include backwards imputing since it is a challenge to identify and exclude these methods. After the competition, the useful parts of these winning models will be the parts using past and present data.
@Littus - You’re right that imputing the macro features to a finer-grained time scale (essentially, resampling the time series) would not leak future information.
@bull Thank you for the answer.
Last question: do you split leaderboard into public/private chucks?
If yes, in what proportion?
@Littus - Yep, there is a public/private split, and we don’t make public the proportion!
Resampling the macro features obviously leaks future information.
Let’s say that temperature at sidi at 10:00 is the average of the temperature between 9h and 12h ,…
Clearly the future value is injected in the model.
Thank you for your clarifications.
That’s totally right.
I was thinking that if I was predicting at time 11:00, than imputing the value at 10:00 (by averaging 9:00 and 11:00) doesn’t leak future information to my prediction at 11:00.
Of course, averaging values at 9:00 and 11:00 to create a measurement at 10:00 leaks future information to 10:00. Predicting at 10:00 will use that information about what happened in the future (at 11:00).
@oliviers In terms
The preference is for methods that can be applied easily to new data.
If we consider that all macro features always accessible for all observations, it obviously doesn’t.
Predicting at 10:00 will use that information about what happened in the future (at 11:00).
There’s no need to predict ‘y’ for the nonobservant points (e.g. 10:00).
So I was talking about the first case.
" all macro features always accessible for all observations"
What do you mean? macro features are not a weather forecast or a statistical average. They are measurements.
I want to make sure I understand this: you are saying that we can use data from the future to predict the present? For example, this would mean that you could use the temperature from 12:00 to predict the yield at 10:00? If so, isn’t that bad data science? How could you put a model into production like that when the future data isn’t available?
The water collection data is a timeseries, and so is the weather data. The method for separating this into train, test, and eval sets is to drop out chunks of time (including a chunk at the end for which there is no future data if the competition rules are being followed). This is a bit different in character from problems where there is a random split, because for time series that would fundamentally alter the character of the data.
Again, the public leaderboard is not the be-all end-all, and any unrealistic parts of models will not be put into production - so not to worry. The best advice has been the same since the beginning - as always, everybody should obey best practices and do good data science.
Sorry but I still don’t get it. It seemed to me that using future train-set observations to predict past test-set values for the response was a big no-no in this competition. But now you say that this type of behavior is not going to disqualify a submission. How can you put these two things together? How are you going to distinguish between predictions generated by “valid” models and those generated by “invalid” models? Thanks.
Hi @isms, sorry for jumping in. If I correctly understood unrealistic models will not be put into production but they will still be considered good enough to win the competition? Thanks
Hi @bull, I somehow missed this thread until now. I am also getting confused by your assertion. Do you mean using future values for imputation and training will not disqualify a submission? It somehow feels wrong. Thanks for clarifying.