What is past data?

Can you clarify what data I’m allowed to use especially considering time.

  1. If I’m making a forecast for time t for building i, am I allowed to use all data upto but not including time t? This is particularly considering data from buildings other than i as for building i the data is masked from the start of the forecast period.

  2. Am I allowed to use weather data into the future?

Thanks.

  1. I took it to mean we can use any data from any building prior to the start of the forecast period to train our model.
  2. My interpretation of the instructions is we are allowed to use weather data during the forecast period:

Note: Weather data is available for test periods under the assumption that reasonably accurate forecasts will be available to algorithms that the time that we are attempting to make predictions about the future.

I may be wrong about either of these points because the directions are not entirely clear!

Great question, thanks for asking for clarification.

1.You may use all of the values we provide for training your models (i.e., for tuning the weights/parameters). At prediction time, the model must only take historical values as input features. That is across all of the datasets except weather the model can only take features derived from data before t when making a prediction.

  1. It is correct that weather data into the future can be an input at prediction time since reasonably accurate forecasts are available.

Do you really mean to say we can train on all the data? That suggests the right strategy is to massively overfit a model based on all the data. Then when given a query features the model will be able to take advantage of both past and future data if carefully constructed.

Can I check you are using my definition of t - i.e. the time we are being asked to predict about? So willkoehrsen statement is overly strict in that he is not allowing data between the start of the forecast period and t to be used?

Here’s a table of what’s allowed at prediction time for building i given what I think you’ve said:

                Before forecast period  Between forecast start and prediction time Prediction time onwards
Building i      yes                     no (not in data)                           no
Other building  yes                     yes                                        no
Weather         yes                     yes                                        yes

BTW how do these forum statements get codified into official rules? Time rules are important constraints in permitting or disallowing entries and clarity should be key in a fair competition.

Based on the structure of the data, competitors need to create their own train/test splits on segments of the data in order to train their models. In creating these segments for training, competitors can use all of the data that is available to them. In order to create a forecasting model that is reliable competitors will want to use data from different seasons of the year, so there is no way to build a model that creates a successful forecast across all of the test periods/seasons without allowing use of all of the training data that covers those time periods. It’s an artifact of running a competition on historical data that some of these training segments may be temporally before other training segments.

However, as explicitly stated in the rules, the model must be making a forecast (it cannot be an interpolation between past and future data). So even if it trained on “future” data to parameterize the forecasting algorithm, that future data cannot be explicitly encoded into the model as the future in a way that is used at prediction time.

One easy way to think of this distinction is as follows: Say I am asked to make predictions for March 2016. My model can use the following features: any data before March 2016, any weather data we provide, and information that we want to make predictions in March. However, the model shouldn’t need to know that I am predicting in the year 2016. The historical time series and the weather time series should be enough information without telling your model what year we are looking at. If your model requires the year to make a prediction, then you’ve probably created an interpolation.

The simple restatement is that you need to be intellectually honest in building a model that generalizes well to forecasting the future. The ultimate decision of if your model qualifies is up to the judges, so this honesty should be your guiding principle.

Your table is has one error. You cannot use other building consumption data during the period for which you are forecasting. We’ll codify this post and the table into an official announcement so it is fair for everyone.

Thanks for bringing up these clarifications so that we can communicate them to all of the participants.

See updated model eligibility section:

Thanks for the added description. I still find it surprising that you are allowing future data at train from the point of view of data leakage. At the beginning of this competition it felt like a challenge in dealing with missing data (missing weather, missing holidays, missing individual observations and missing history). I’m sure it must be a standard industry challenge to predict future loads with little past history. It feels consistent, simple and intellectually honest to completely ban the use of future data.

I completely agree with this. Allowing access to future data does not make sense in the context of predicting future consumption. In any real-world application, we would not have access to data in the future of the prediction period. It also makes it more difficult for competitors to understand what is acceptable training.

When testing, are we allowed to use information such as days from the start of the training period? If there is a trend over time, then using a feature like days elapsed since the start of the data will allow our model to learn that trend.

In addition to daily, weekly, monthly, and seasonal patterns, some buildings have increasing or decreasing trends on a large time-scale. However, I’m concerned that using days since the start of the training period might not be allowed.

I did a quick check of the max timestamp from the training data for each forecast ID and then compared those timestamps to the min timestamp for each forecast ID from the submission format file and didn’t find any observations where there was future data included in the training data for a given forecast ID. However that is not true when comparing forecast IDs across the same site ID.

For example, site 302 has training data spanning from 2009-12-31 to 2017-08-23 and forecast ID 6730’s submission period spans between 2010-01-10 13:15:00 and 2010-01-12 13:00:00…

I had originally assumed that there was a simple single timestamp split but that is not the case. Unfortunately this introduces leakage into the training data and this information could be easily used to inform final submissions without directly using future data in a final model.

1 Like

You are right, for the buildings in which the training data comes after the testing period, there will always be leakage unless we exclude the post-test training data from the training set. However, this will result in a less accurate model because we are limiting the amount of training data. I have trained models both ways (with and without future data) and the models with future data do a small but significant amount better. In an ideal problem, all the test data would come after the training data.