Predicting on the test set reduces the size of our data

The submission format requires 416 rows: 260 entries for San Juan, and 156 entries for Iquitos.

The two sets start with iq_X = 520 rows, and sj_X = 936 rows.

We perform a train_test_split(test_size = 0.2) for both San Juan and Iquitos. This reduces iq_X to 416 rows, and sj_X to 748 rows.

After the train_test_split mentioned above, but right before running the model.predict(), our iq_train_X set has 416 rows. After predicting on the test set, this gets reduced to 104 rows.

Similarly, after train_test_split, sj_train_X has 748 rows. After predicting on the test set, this gets reduced to 188 rows.

This leaves us with a total of 292 entries, when we need 416 for the submission format.

The 292 != 416 is causing an index mismatch when attempting to merge the sets for a submission.

Does anyone know why this may be happening?

Because both training sets have nothing to do with the submission.

You train with 520 and 936 observations whatever split you choose.

Then the submission dataset is totally independant of your training. You just need to pass the submission dataset into your model and submit the predictions in the right format. See the Driven Data benchmark to understand better

Thanks for the response!

I’ve looked at the benchmark quite a bit, and it still looks like they are using the pre-processed train / test data to do their submission, through the subtrain / subtest sets.

Are you saying you should use the discovered model parameters from the preprocessed training / test, and run the full 416 row provided “dengue_features_test” test set through it for the submission?

I’m still a little confused.

Hi @MatthewCSC,

If you look carefully at the benchmark model, you only need to process and then pass the submission sets into the predict method of sj_best_model or iq_best_model, and done.

Previously, into get_best_model function, you can see that during the “step 3”, the model with best parameters is fitted again on the entire dataset (training + testing sets), so now the model has been created with the best parameters and you can reuse it to make predictions with new datasets regardless of their length, provided that the new datasets willl have the same features than the trained model.

I really appreciate the help! I apologize for not understanding quite yet.

We have used a train_test_split and then a RandomForestRegressor function, and used the and model.predict to get our results which produce the 2 city datasets that don’t match the size for submission.

Can you explain what you mean by “submission sets”? Is that the submission_format.csv? Or are those the results we get from the model.predict? Or something else entirely?

What do we pass into which functions to generate the correct size predictions, after figuring out what generates the best model? I should also mention we aren’t using a get_best_model function like the benchmark. We are using a grid search to find the optimal parameters, and then we fit the model on those.

So we already have the best model parameters, we just don’t know how to get the sizes to line up after predicting.

Thank you for the clarification!

Hi @MatthewCSC,

Sorry if I have confused you. Indeed, the terms `submission set" are quite confusing.

  1. You train with dengue_features_train and dengue_labels_train datasets, merging them as in the benchmark. Since you use train_test_split for your grid search, I assume you have splitted the data into a training subset and a testing subset for each city.

  2. Then you have to pass the dengue_features_test file dataset into your final model and use the predictions to make your submission overwriting the submission_format file.

It’s hard to say what is happening without knowing your code. But re-reading better your previous posts, it looks like you are trying to make your submission using the testing subset of the training dataset (the one that you have splitted into train/test subsets) instead of using the testing dataset (which is dengue_features_test file) !

No problem about the confusion!

  1. We have trained with both the dengue_features_train and dengue_labels_train, and merged them. Yes, for the grid search we have split training and test sets for each city.

  2. When we pass the dengue_features_test into model.predict for the final predictions, is that passed twice, once per city, or just once overall?
    The way we have our 2 models (1 iq 1 sj) set up does 2 model.fits and model.predicts, 1 per city.

For example, we have 2 total RandomForestRegressor instances.
1 with:, iq_train_y) and 1 with, sj_train_y).
Similarly, our predicts are like: iq_predicted = model.predict(iq_test_X)
and sj_predicted = model.predict(sj_test_X).

I guess a question that would help narrow down the issue is, When exactly do you stop using the testing subset of the training dataset?
We understand it to be right after you do a, iq_train_y) for instance.

And then we assume you make a prediction on the dengue_features_test like you’ve mentioned.

We tried doing that, and both city dataset shapes line up perfectly (260 and 156), but every case number comes out around 31-32 for each of the 416 rows.

This is why we were curious about when the dengue_features_test starts being used, because we thought maybe we were using StandardScaler() on the wrong test set. (We are currently scaling on iq_train_X and iq_test_X which are the training set and testing subset respectively.)

Yes, your assumption of us predicting on the testing subset is what we think that is what was happening. For clarity, we were predicting on iq_test_X, which is the testing subset generated by train_test_split.

sj_train_X, sj_train_y : you train with it
sj_test_X : you test your trained model with it and calculate the MAE score of your predictions

You train the model until you cannot improve it more (for instance, relying on the MAE score of sj_test_X).

When you are satisfied with your model, just do for each city (for instance San Juan):
sj_full_X = pd.concat(sj_train_X, sj_test_X])
sj_full_y = pd.concat(sj_train_y, sj_test_y])
sj_best_model =, sj_full_y, **your optimized model parameters)

Then comes the submission step. Do just as in the benchmark:


Of course it’s up to you to adapt the ‘preprocess_data’ function of the benchmark so that this function will apply on dengue_features_test exactly the same transformations that those you did on dengue_features_train.
For example if the only preprocessing stuff you made to dengue_features_train is StandardScaler(), just do the same with dengue_features_test. Do not forget that Iquito and San Juan have to be processed separately!

You should not obtain so similar results for both Iquito and San Juan. It looks like you have applied dengue_features_test to your San Juan model or to you Iquito model, instead of applying dengue_features_tes.loc['sj'] to your San Juan model and dengue_features_tes.loc['iq'] to your Iquito model.

Or maybe you choosed to create an unique model for both Iquito and San Juan? Why not, if this is the case, just send your predictions as shown in the screenshot and done!

Thank you so much! We were able to get everything working correctly!

1 Like