Asked to predict into the future

dcart · January 16, 2019, 7:18am

Hi

I wish someone can enlighten me the below points:

For 10% of the test instances, t corresponds to the end of the first (pre-rinse) phase.
For 30% of the test instances, t corresponds to the end of the second (caustic) phase.
For 30% of the test instances, t corresponds to the end of the third (intermediate rinse) phase.
For 30% of the test instances, t corresponds to the end of the fourth (acid) phase.

Thanks

twalen · January 16, 2019, 8:07am

I understand this paragraph as a fraction of test processes that ends on particular phase.
But simple calculations on test_values.zip give me different percentages.

# test_data is the data from the test_values.zip
n = test_data['process_id'].nunique()
test_data.sort_values(by=['process_id', 'timestamp'])\
  .groupby('process_id', as_index=False)['phase'].last()\
  .groupby('phase').size() * 100 / n

gives me following results

phase
pre_rinse              9.841591
caustic               40.613414
acid                  26.895854
intermediate_rinse    22.649141

which does not match exactly (10/30/30/30) split.

dcart · January 16, 2019, 1:37pm

Thanks twalen for the feedback. My understanding for 10/30/30/30 split for each process_id not the test_data. Do you have sample code to check this?

twalen · January 16, 2019, 1:53pm

In the train set the last phase is always final_rinse, in my understanding this paragraphs in the problem description is only for the testing set.

However, in the test set you are only given data from select previous phases (up to a given time, t) and then asked to predict into the future.

For 10% of the test instances, t corresponds to the end of the first (pre-rinse) phase.

For 30% of the test instances, t corresponds to the end of the second (caustic) phase.

For 30% of the test instances, t corresponds to the end of the third (intermediate rinse) phase.

For 30% of the test instances, t corresponds to the end of the fourth (acid) phase.

bull · January 16, 2019, 11:11pm

Those percentages are approximate, with the true percentages being subject to a number of other constraints. We provide this estimate in case participants want to undertake a similar approach with the training data.

Topic		Replies	Views
Gaps from test data to target period? Sustainable Industry: Rinse Over Run	2	688	February 2, 2019
Rows with phase=='final_rinse' and target_time_period==false Sustainable Industry: Rinse Over Run	4	936	January 14, 2019
Data not starting with the Pre-rinse Sustainable Industry: Rinse Over Run	7	1101	February 2, 2019
Are there missing phases in the test data Sustainable Industry: Rinse Over Run	7	955	February 17, 2019
Duration of target time period in test phase Sustainable Industry: Rinse Over Run	4	808	January 25, 2019

Asked to predict into the future

Related topics