Are there missing phases in the test data

This is one more question about the cleaning phases in the training vs the test data. (I’ve read the other questions and answers on the topic but this is still not clear to me.) Here is what I know so far:

There are five cleaning phases and they can appear only in the following order:

pre_rinse -> caustic -> intermediate_rinse -> acid -> final_rinse

One or more of the phases can be skipped, so that different cleaning “recipes” are possible. For simplicity, let’s use a 5-digit long binary string to describe the recipes, where 0 means that the phase was skipped and 1 means that the phase took place. For example:

  • 11111 corresponds to the normal long recipe with all five phases on.
  • 11001 corresponds to the normal short recipe pre_rinse -> caustic -> final_rinse.

We are given all available data for each cleaning process. I will assume that no observations are missing, so that “all available data” means “all observations taken every 2 sec throughout the cleaning process, from start to finish”.

The training data has 5,021 processes with the following breakdown by recipe.

11111    3726
11001    1017
00011    199
01001    38  
01111    22  
00001    16  
10001    3   

As explained by @ThomasF, the two main recipes are 11111 and 11001. And there is a small subset of less common recipes.

The test data has 2,967 processes with the following breakdown by recipe. We know there is final_rinse phase at the end but those data points have been removed, so I’ve put in “?” instead.

1100?    1182
1111?    671
1110?    670
1000?    292
0001?    122
0100?    23  
0111?    5   
0110?    2   

In the reply to another answer, @bull says that “for the test data, only the final_rinse phase has been removed. Any other phases present in the original dataset are still there.”

This means that the test data contains 670 recipes 1110? when the training data doesn’t have even one example of a 11101 recipe. Isn’t it more likely that these 670 recipes are in fact long 5-phase recipes but we are given only data from the first three phases, just as the competition overview describes?

It is even more complicated for the 1,182 test recipes for which we have only a pre_rinse and a caustic phase: these can be either 11001 or 11111.

That is, because we are only given data from select previous phases (up to a given time t), for some recipes we don’t know for sure whether a phase was skipped in practice or whether it did occur but we just don’t get to observe any data from it. (But maybe we can guess if we look at the actual measurements, not just whether a phase occurs or not.)

11???    1182    -> 11111 or 11001
1111?    671     -> 11111
111??    670     -> 11111
1????    292     -> 11111 or 11001 or 10001
0001?    122     -> 00011
01???    23      -> 01111 or 01001
0111?    5       -> 01111
011??    2       -> 01111
2 Likes

Sorry for the confusion here. The test data are limited as described in the problem description, where they simulate having to make predictions at a certain point in the process:

Can you confirm that the full recipe is not known in the test dataset?

1 Like

I think that we won’t have any cases where we are asked to predict at phase 5 with no information about the intermediate phases. That is, if we’re making a prediction in phase 5 (the common case) then we have all the data leading up to that point for that trial. I haven’t found a case yet to contradict this, as every single trial has data up to the prediction point for the cases I’ve examined so far.

Hi all, we just made an announcement releasing “recipes” that specify the phases you can expect (for the most part) for each process. Find the announcement here (you must be logged in to see this link):
https://www.drivendata.org/competitions/56/predict-cleaning-time-series/announcements/

2 Likes

About predicting the values, if we look at the train data with target values return_flow / return_turbidity changes drastically during the later stages. As the test data doesn’t contain all the phases, which means predicting without those phases won’t improve the performance of the model.

Alternatively can the model need to predict the data for the missing phases to predict the final result.

With respect to receipe,

The difference in phases in training set is 340 out off 5021

But in case of test set the difference is huge,

2936 out of 2967 ignoring the final_rinse phase, not sure receipe_metadata is of real use.

I think you need to split your training and testing data in a way that your model will represent the 3 recipes i.e. 1111, 1100 & 1001 respectively.