One hot encoding / test data unique values

On the code_execution_development data, the submission_format / test_labels are all on the same date. This essentially results in many of the weather data not to have various values, such as lightning_prob has 4 categories in the training data but only has 1 category on the test_set. As a result, when passed through pd.getDummies, the training and test example data no longer have the same number of columns.

This is easily fixable by concating both the test and training data and then separating them again. But for efficiency purposes, I am wondering if the actual test data will contain all the unique values that are in the training data provided. Just did not want to have to load in all the training data in the solution.py if its not necessary.

Hi @axj65 ,
Good to point out that the development dataset contains some of the complexity needed to debug your code, but not all of it. To make sure your solution runs on the test data without error might involve some combination of hard-coding the categorical values, filling missing values and other data checks. I’d start by assuming that the test set values are a subset of the training set values, but it’s hard to guarantee for all ways of processing the data, in which case you are welcome to ask specific questions here or by email.