Asked to predict into the future

Hi

I wish someone can enlighten me the below points:

  • For 10% of the test instances, t corresponds to the end of the first (pre-rinse) phase.
  • For 30% of the test instances, t corresponds to the end of the second (caustic) phase.
  • For 30% of the test instances, t corresponds to the end of the third (intermediate rinse) phase.
  • For 30% of the test instances, t corresponds to the end of the fourth (acid) phase.

Thanks

I understand this paragraph as a fraction of test processes that ends on particular phase.
But simple calculations on test_values.zip give me different percentages.

# test_data is the data from the test_values.zip
n = test_data['process_id'].nunique()
test_data.sort_values(by=['process_id', 'timestamp'])\
  .groupby('process_id', as_index=False)['phase'].last()\
  .groupby('phase').size() * 100 / n

gives me following results

phase
pre_rinse              9.841591
caustic               40.613414
acid                  26.895854
intermediate_rinse    22.649141

which does not match exactly (10/30/30/30) split.

Thanks twalen for the feedback. My understanding for 10/30/30/30 split for each process_id not the test_data. Do you have sample code to check this?

In the train set the last phase is always final_rinse, in my understanding this paragraphs in the problem description is only for the testing set.

However, in the test set you are only given data from select previous phases (up to a given time, t) and then asked to predict into the future.

  • For 10% of the test instances, t corresponds to the end of the first (pre-rinse) phase.
  • For 30% of the test instances, t corresponds to the end of the second (caustic) phase.
  • For 30% of the test instances, t corresponds to the end of the third (intermediate rinse) phase.
  • For 30% of the test instances, t corresponds to the end of the fourth (acid) phase.

Those percentages are approximate, with the true percentages being subject to a number of other constraints. We provide this estimate in case participants want to undertake a similar approach with the training data.