# Asked to predict into the future

Hi

I wish someone can enlighten me the below points:

• For 10% of the test instances, t corresponds to the end of the first (pre-rinse) phase.
• For 30% of the test instances, t corresponds to the end of the second (caustic) phase.
• For 30% of the test instances, t corresponds to the end of the third (intermediate rinse) phase.
• For 30% of the test instances, t corresponds to the end of the fourth (acid) phase.

Thanks

I understand this paragraph as a fraction of test processes that ends on particular phase.
But simple calculations on test_values.zip give me different percentages.

``````# test_data is the data from the test_values.zip
n = test_data['process_id'].nunique()
test_data.sort_values(by=['process_id', 'timestamp'])\
.groupby('process_id', as_index=False)['phase'].last()\
.groupby('phase').size() * 100 / n
``````

gives me following results

``````phase
pre_rinse              9.841591
caustic               40.613414
acid                  26.895854
intermediate_rinse    22.649141
``````

which does not match exactly (10/30/30/30) split.

Thanks twalen for the feedback. My understanding for 10/30/30/30 split for each process_id not the test_data. Do you have sample code to check this?

In the train set the last phase is always final_rinse, in my understanding this paragraphs in the problem description is only for the testing set.

However, in the test set you are only given data from select previous phases (up to a given time, t) and then asked to predict into the future.

• For 10% of the test instances, t corresponds to the end of the first (pre-rinse) phase.
• For 30% of the test instances, t corresponds to the end of the second (caustic) phase.
• For 30% of the test instances, t corresponds to the end of the third (intermediate rinse) phase.
• For 30% of the test instances, t corresponds to the end of the fourth (acid) phase.

Those percentages are approximate, with the true percentages being subject to a number of other constraints. We provide this estimate in case participants want to undertake a similar approach with the training data.