Provisions for R-users

rasyidstat · January 4, 2024, 8:47pm

Hi @jayqi, I believe what @tabumis mean is that the data used for training (raw or processed), not the mean value or any statistical aggregation.

It is related to prior discussion

rasyidstat:

I think it will be more relevant for Forecast Stage track where there will be unknowns, e.g. data is not updated up to specified lag dates, the station is closed, etc.

For example, if the model depends on weather site A and it is somehow down or closed, then the trained model weights will not work properly and need to be updated/retrained with the model without the feature from site A.

Hi @rasyidstat,

Exactly. I assume some pretrained models will not work at all, if one of the predictor features is absent. Such uncertainty in data availability by issue date is what we have previously been recommended to take into account.

I believe that it’s forbidden to retrain or update the model weight during the runtime environment or attempt any approach like active learning or online training. Even though it’s possible to run in environment and execute within the time limit of 30 minutes.

@jayqi could you please clarify?

tabumis · January 5, 2024, 5:46pm

@rasyidstat , thanks for bringing this up.

Yes, technically speaking, our solution involves training models during execution. However its necessary to note that we maintain the same model structure and flow of predictor variables. This setup is solely driven by the uncertainty of data latency by each issue date. As I understand, this is not active learning, which involves retraining/updating models using new observations of both target and predictor variables.

@jayqi, could you please clarify eligibility of this approach

jayqi · January 5, 2024, 6:07pm

Hi @tabumis,

Based on my understanding of what you’ve described, it should be fine. This seems like a particular case of “feature parameters” computed on the training set and the “retraining” you do could be considered just part of the inference process.

My understanding is that you would produce equivalent prediction values if you pretrained models for all permutations of missing variables and then dispatched to a specific model at inference time based on the test feature data. I agree this does not seem to meet the definition of active learning.

rasyidstat · January 5, 2024, 10:41pm

Hi @tabumis, indeed, it’s not active learning. However, it requires retraining processes. Previously, I thought it was not permitted since it could change the model weight even just a little.

Hi @jayqi, thanks for the clarification
For your second point, it would produce an equivalent prediction or not depending on how many predictors are absent. If it’s just one or two, the model weight and prediction will not change much

jayqi · January 6, 2024, 1:51am

@rasyidstat @tabumis If you are doing something like this, please make sure you use random seeds and/or take other steps to have your “retraining” process be deterministic given the same inputs. Your solution being reproducible is an important part of verifying valid solutions at the end of the challenge.

rasyidstat · January 6, 2024, 3:31am

Thanks

Does it mean that we also need to cache the model and additional data downloaded on preprocessed directory?

Reproducibility in DS is quite tricky, model and data can change. I can ensure that the model is static but there’s a possibility that the data is being updated at a later date by the providers.

Topic		Replies	Views
Looking for team-mate for this challenge Water Supply Forecast Rodeo	0	129	December 14, 2023
What tools may be used? Water Supply Forecast Rodeo	2	369	November 2, 2023
Predictors data Water Supply Forecast Rodeo	4	403	November 3, 2023
Feedback for the organizers Water Supply Forecast Rodeo	1	213	December 23, 2023
SNOTEL data during code jobs execution Water Supply Forecast Rodeo	5	192	January 4, 2024

Provisions for R-users

Related topics