This is one more question about the cleaning phases in the training vs the test data. (I’ve read the other questions and answers on the topic but this is still not clear to me.) Here is what I know so far:
There are five cleaning phases and they can appear only in the following order:
pre_rinse -> caustic -> intermediate_rinse -> acid -> final_rinse
One or more of the phases can be skipped, so that different cleaning “recipes” are possible. For simplicity, let’s use a 5-digit long binary string to describe the recipes, where 0 means that the phase was skipped and 1 means that the phase took place. For example:
11111corresponds to the normal long recipe with all five phases on.
11001corresponds to the normal short recipe
pre_rinse -> caustic -> final_rinse.
We are given all available data for each cleaning process. I will assume that no observations are missing, so that “all available data” means “all observations taken every 2 sec throughout the cleaning process, from start to finish”.
The training data has 5,021 processes with the following breakdown by recipe.
11111 3726 11001 1017 00011 199 01001 38 01111 22 00001 16 10001 3
As explained by @ThomasF, the two main recipes are
11001. And there is a small subset of less common recipes.
The test data has 2,967 processes with the following breakdown by recipe. We know there is final_rinse phase at the end but those data points have been removed, so I’ve put in “?” instead.
1100? 1182 1111? 671 1110? 670 1000? 292 0001? 122 0100? 23 0111? 5 0110? 2
In the reply to another answer, @bull says that “for the test data, only the final_rinse phase has been removed. Any other phases present in the original dataset are still there.”
This means that the test data contains 670 recipes
1110? when the training data doesn’t have even one example of a
11101 recipe. Isn’t it more likely that these 670 recipes are in fact long 5-phase recipes but we are given only data from the first three phases, just as the competition overview describes?
It is even more complicated for the 1,182 test recipes for which we have only a pre_rinse and a caustic phase: these can be either
That is, because we are only given data from select previous phases (up to a given time t), for some recipes we don’t know for sure whether a phase was skipped in practice or whether it did occur but we just don’t get to observe any data from it. (But maybe we can guess if we look at the actual measurements, not just whether a phase occurs or not.)
11??? 1182 -> 11111 or 11001 1111? 671 -> 11111 111?? 670 -> 11111 1???? 292 -> 11111 or 11001 or 10001 0001? 122 -> 00011 01??? 23 -> 01111 or 01001 0111? 5 -> 01111 011?? 2 -> 01111