One hot encoding / test data unique values

axj65 · April 8, 2023, 4:28pm

On the code_execution_development data, the submission_format / test_labels are all on the same date. This essentially results in many of the weather data not to have various values, such as lightning_prob has 4 categories in the training data but only has 1 category on the test_set. As a result, when passed through pd.getDummies, the training and test example data no longer have the same number of columns.

This is easily fixable by concating both the test and training data and then separating them again. But for efficiency purposes, I am wondering if the actual test data will contain all the unique values that are in the training data provided. Just did not want to have to load in all the training data in the solution.py if its not necessary.

rbgb · April 10, 2023, 1:52pm

Hi @axj65 ,
Good to point out that the development dataset contains some of the complexity needed to debug your code, but not all of it. To make sure your solution runs on the test data without error might involve some combination of hard-coding the categorical values, filling missing values and other data checks. I’d start by assuming that the test set values are a subset of the training set values, but it’s hard to guarantee for all ways of processing the data, in which case you are welcome to ask specific questions here or by email.

Topic		Replies	Views
Country B: New Category values observed for Few columns Pover-T Tests: Predicting Poverty	0	915	January 8, 2018
New Data in Test Set: date_recorded: year 2001 Pump it Up: Data Mining the Water Table	5	1723	January 27, 2020
Test Labels are missing Flu Shot Learning	2	823	July 29, 2021
Power Laws Forecasting: Where is test data? Power Laws	1	805	March 18, 2018
Object_id values not in test data Sustainable Industry: Rinse Over Run	0	553	February 24, 2019

One hot encoding / test data unique values

Related topics