Clarification on features' dates used for prediction

it is unclear to me reading the problem description whether we are allowed to use satellite data on the same day of the label value or only from previous dates? for example, for label’s datetime 31st Jan 2019, are we allowed any satellite data taken before that date or also form the same date? thanks.

I don’t have an answer, unfortunately, but rather a related question. In the problem description it states:

“You may use historical ground truth training data as feature input to your model. Note that for verification, you may only use historical data up until the point of inference in order to make your prediction.”

Does that mean that we would have access to the ground-truth data from up to the day before we are trying to predict the concentrations for, which can be used as input data into the forecast algorithm? As a more concrete example, if I am trying to predict the concentration at a location on January 2, 2021, would I be able to know the concentration on January 1, 2021 and use that to inform my forecast? Or does this only apply to data from the training period provided for the competition, but not for the testing or for any validation periods?

In short, you are allowed to use data up-through (i.e. including) the date of prediction. This means that yes, you can use satellite data from the same local date.

The label’s datetime represents the start of a 24 period over which the air quality is averaged. For example, a label with datetime 2019-01-31T08:00:00Z represents an average taken from 2019-01-31T08:00:00Z to 2019-02-01T07:59:00Z (inclusive).

Therefore, you can use satellite data with an endtime on or before 2019-02-01T07:59:00Z for a label with datetime 2019-01-31T08:00:00Z. Note that this is 11:59pm local time (pacific time).

3 Likes

Yes, this is correct. You will have access to historical ground truth data to make your predictions, e.g. you can use the concentrations from January 1, 2021 to make a prediction for January 2, 2021. Edit: This is no longer true. The use of ground truth as input is disallowed.

Thank you! What is the best source for ground truth data from the testing period, and how should we structure the code to accept these data for any validation cases?

Sorry @Carl_Malings! I misspoke in my previous reply. You can only use historical data for the training period. It will not be available for the test or validation periods. Apologies for the confusion - I’ve edited my previous post. The use of groundtruth as input data is disallowed.

OK, that is what I initially suspected. Thank you!

@Carl_Malings Sorry to keep changing my answer on this, but upon further review, we’ve decided not to allow any ground truth data as input to the model, for any of the train, test, or validation periods. The problem description is now updated to reflect this. We hope this simplifies things. Again, apologies for the confusion!

OK. So, just to clarify: The ground-truth data is available for training, but once the forecasting method has been trained/calibrated, no ground-truth data can be used as an additional input to inform the forecasting.

Related to this: would it be allowable to “save” some of this ground data, or information derived from it (e.g., “what was the average concentration at this location across all Mondays during the training period?”) into some kind of “lookup table” which is used by the forecasting method?

The ground truth from the training period is available for developing your model. You may use that model when running inference. The example you provided sounds like it reflects information on weekly averages from that period in your model, which is fine to do.

1 Like