The input data and the test data are given with a very wide range of granularity (from 15 min to one day).
Would it be possible to know from which line to which line we have e.g. data every 15min, then every hour etc… both in the submission and in the train data?
Hi @larry77, we’ve added a new file to the Data Download page with the frequency of predictions (in nanoseconds) called “Submission Format Period”. Hope that helps! For the training data, you’ll have to segment that using the differences in the Timestamps for different sites.
Thanks! I believe there may be some anomalies in the train data set at this regard, but this may be the topic for another post after I have double-checked.
It would make much more sense to split the training data in advance into 3 files, by the temporal scale involved…
I cannot agree more. I think the problem formulation is very confusing. Essentially, the aim is to predict 3 different outcomes at 3 different time scales. Then everything is bundled up together in the final evaluation. This is messy because also the consumption scales are different. We talk about 3 prediction problems which it would have been better to keep separate. I wonder if there is still time for the organizers to give this competition a new spin.
@larry77 @ddofer We’ve added a ForecastId
to the training data that will align with the submission format for timestep size and siteid. Should be much easier to align those individual time series now. You’ll have to re-download train.csv
data if you already downloaded it.
Enjoy the competition!
Thanks for being reactive on this.
Nevertheless, having 3 input files and 3 output files would have been (and would still be) IMHO the best way to treat the data in the competition. It would be dead obvious there are 3 time scales and 3 predictions (and possibly 3 models) involved, leaving aside the practical aspects (which burn up time better spent for modelling). I know that data cleaning is part of the model, but this is not asking the admin to do our job, but just to separate the data (the same way a restaurant does not serve starter, main dish and coffee all at once and in one plate).
Hi! More on the technical side here now.
- The new file “Submission Format Period” contains the forecast ID and the frequency of the observations, so it allows me to have the frequency of the observations in the submission file.
- Now, seen that we have also a new Forecast ID column in the train data set, can I use the new file at 1) also to have the frequency of the observations the train data set. In other words, is what you wrote in a previous post
For the training data, you’ll have to segment that using the differences in the Timestamps for different sites
still correct, but no longer necessary because I already have this info? It would appear so, but I prefer to be sure.
Apologies, for the many questions, but I am now delving into this rich data set.
Some ForecastId appear only in the submission file. E.g. ForecastId 700 corresponds to siteId 22 and a 15 minute frequency of observations.
However, in the test data set as ForecastId=652, I have again building 22 and the timestamps moves in steps of 15 min, so I am confused: are we talking about the same buildings in both the train and test (submission) data set? Why I have different forecastId for the same building with the same frequency in the two data sets?
The predictions for a particular site span multiple years, so forecastId buckets observations with similar time stamps for that building. Doing some initial exploration of the train.csv and submission_format.csv, you’ll notice that the timestamps of the submission_format.csv for a particular forecastID come directly after all the timestamps with the same forecastId in the training set.
Thanks for the reply, but it is still unclear to me why in the case of my post, different forecast IDs are used for info on the same building at the same frequency. OK, we talk about different years, but…should we not have the same forecast ID precisely in the light of what you wrote??