This is one more question about the cleaning phases in the training vs the test data. (I’ve read the other questions and answers on the topic but this is still not clear to me.) Here is what I know so far:
There are five cleaning phases and they can appear only in the following order:
pre_rinse -> caustic -> intermediate_rinse -> acid -> final_rinse
One or more of the phases can be skipped, so that different cleaning “recipes” are possible. For simplicity, let’s use a 5-digit long binary string to describe the recipes, where 0 means that the phase was skipped and 1 means that the phase took place. For example:
-
11111
corresponds to the normal long recipe with all five phases on. -
11001
corresponds to the normal short recipepre_rinse -> caustic -> final_rinse
.
We are given all available data for each cleaning process. I will assume that no observations are missing, so that “all available data” means “all observations taken every 2 sec throughout the cleaning process, from start to finish”.
The training data has 5,021 processes with the following breakdown by recipe.
11111 3726
11001 1017
00011 199
01001 38
01111 22
00001 16
10001 3
As explained by @ThomasF, the two main recipes are 11111
and 11001
. And there is a small subset of less common recipes.
The test data has 2,967 processes with the following breakdown by recipe. We know there is final_rinse phase at the end but those data points have been removed, so I’ve put in “?” instead.
1100? 1182
1111? 671
1110? 670
1000? 292
0001? 122
0100? 23
0111? 5
0110? 2
In the reply to another answer, @bull says that “for the test data, only the final_rinse phase has been removed. Any other phases present in the original dataset are still there.”
This means that the test data contains 670 recipes 1110?
when the training data doesn’t have even one example of a 11101
recipe. Isn’t it more likely that these 670 recipes are in fact long 5-phase recipes but we are given only data from the first three phases, just as the competition overview describes?
It is even more complicated for the 1,182 test recipes for which we have only a pre_rinse and a caustic phase: these can be either 11001
or 11111
.
That is, because we are only given data from select previous phases (up to a given time t), for some recipes we don’t know for sure whether a phase was skipped in practice or whether it did occur but we just don’t get to observe any data from it. (But maybe we can guess if we look at the actual measurements, not just whether a phase occurs or not.)
11??? 1182 -> 11111 or 11001
1111? 671 -> 11111
111?? 670 -> 11111
1???? 292 -> 11111 or 11001 or 10001
0001? 122 -> 00011
01??? 23 -> 01111 or 01001
0111? 5 -> 01111
011?? 2 -> 01111