After removing the rows with target_time_period==true, the dataframe still keeps, for each process_id, at least one row with phase==‘final_rinse’, which by the description should not be used.
Shouldn’t target_time_period==true for all phase==‘final_rinse’?
Or are those remaining rows to be used?
The test_values does not contain any phase==‘final_rinse’.
From what I understand up to now, the target period is part of final_rinse after the last closing of the caustic and acid valves (so it is the final portion of final_rinse time series).
In the problem description there is following note:
The target time period is the portion of the final rinse phase when the return caustic and return acid valves have been closed for the last time
Yes, I understand. However, it sounds suspicious that we’re suppose to be using some rows with final_rinse for training but not for prediction (the test_values do not have a single row with final_rinse).
I agree that this is strange, but like you said, the lack of such non_target rows (from final_rinse) in the test set makes the usage of such rows almost impossible.
Good questions—this is by design. We want to be able to predict the turbidity before the start of the final rinse phase (so it can be adjusted if needed). This means you are not provided any final rinse data for the test set. However, the turbidity that matters (which we want to predict) is only during the times marked target_time_period, which is not all of the final rinse.
For the training set, you are provided all of the observations, which you can split with whatever strategy works best.