Provisions for R-users

Hi @jayqi, I believe what @tabumis mean is that the data used for training (raw or processed), not the mean value or any statistical aggregation.

It is related to prior discussion

Hi @rasyidstat,

Exactly. I assume some pretrained models will not work at all, if one of the predictor features is absent. Such uncertainty in data availability by issue date is what we have previously been recommended to take into account.

I believe that it’s forbidden to retrain or update the model weight during the runtime environment or attempt any approach like active learning or online training. Even though it’s possible to run in environment and execute within the time limit of 30 minutes.

@jayqi could you please clarify?

@rasyidstat , thanks for bringing this up.

Yes, technically speaking, our solution involves training models during execution. However its necessary to note that we maintain the same model structure and flow of predictor variables. This setup is solely driven by the uncertainty of data latency by each issue date. As I understand, this is not active learning, which involves retraining/updating models using new observations of both target and predictor variables.

@jayqi, could you please clarify eligibility of this approach

Hi @tabumis,

Based on my understanding of what you’ve described, it should be fine. This seems like a particular case of “feature parameters” computed on the training set and the “retraining” you do could be considered just part of the inference process.

My understanding is that you would produce equivalent prediction values if you pretrained models for all permutations of missing variables and then dispatched to a specific model at inference time based on the test feature data. I agree this does not seem to meet the definition of active learning.

1 Like

Hi @tabumis, indeed, it’s not active learning. However, it requires retraining processes. Previously, I thought it was not permitted since it could change the model weight even just a little.

Hi @jayqi, thanks for the clarification
For your second point, it would produce an equivalent prediction or not depending on how many predictors are absent. If it’s just one or two, the model weight and prediction will not change much

@rasyidstat @tabumis If you are doing something like this, please make sure you use random seeds and/or take other steps to have your “retraining” process be deterministic given the same inputs. Your solution being reproducible is an important part of verifying valid solutions at the end of the challenge.

Thanks

Does it mean that we also need to cache the model and additional data downloaded on preprocessed directory?

Reproducibility in DS is quite tricky, model and data can change. I can ensure that the model is static but there’s a possibility that the data is being updated at a later date by the providers.