2nd Place Solution - Stacking of Multi-Step CNNs

Hi everyone

Let me start by thanking the organizers for such a unique problem (it seems to be a theme for Radiant Earth competitions :slight_smile: ) and congratulate the top solutions for their achievement.

Regarding my approach, it has three stages. In the first stage, ImageNet pretrained CNNs are finetuned on the dataset with different time steps in multiple of 3s. Each 3 time step images are concatenated channel wise and pass through the CNN for feature extraction. The features of each 3 time steps are concatenated and pass through fully connected layer for final output. Some models are trained (mostly ResNet-50 backbone) with 1, 3, 6, 9, and 12 consecutive time steps and some models with 6 time steps and spacing more than 1 (for example in case of spacing 2 images 1,3,5,7,9,11 are used instead of images 1,2,3,4,5,6). Aside from rotation and flipping, time augmentation are applied by dropping or repeating one of the input images (except for the main image). All models are trained using 224X224 image size, 5 group-folds and test time augmentation is applied on their predictions.

In the second stage, around 200 models are trained on the output predictions of first stage models and taking into consideration a history of 25-30 time steps. The models are combination of:

  • Xgboost
  • 1D CNN
  • LSTM
  • GRU
  • MLP
  • Transformer
  • Decision Tree
  • Linear Regression
    Each model in the previous list is trained on a different CNN output.

In the final stage, ensemble selection is applied to combine the best group of second stage models.

My best combination of single CNN with 6 time steps and single Xgboost model on 30 time step predictions achieve score of 7.1994 and 6.9343 on public and private leaderboards receptively.

I didn’t spend much tuning CNN+LSTM since it takes a long time to train and I had only one GPU but it seems I needed to be more patient about it. Maybe combining the multi time-step CNN with LSTM can give better results.


Nice approach @KarimAmer! May I ask you how the dataset split has been done?

Thanks @cayala. It was 5-folds group-kfold by storm id.

1 Like

Thank you @KarimAmer for sharing your approach. As I am kinda new to these competions i have
some questions on the details and especially on why you did approach the problem the way you did.

  1. For your 5-fold validation split did you stratify(by windspeed for example). If so why and if not why not?

  2. Why did you choose 5 fold validation over a simple 1 fold validation split, especially with regards to run time and limited computations resources (to me it seemed a stratified 1 fold validation split gave and extremly good indication on performance on the public and private test set already) How do you decide to balance between computation time and reducing overfitting by increasing the validation folds.

  3. Did you adress data imbalance in any way (i.e. only very few images with higher windspeed available) for example by generating more augmented images with high windspeeds or using a customized weighted loss function. Again if so why and if not why not?

  4. What optimizer did you use, in general how did you determine your optimal learning rate, did use stuff like decrease learning rate on plateau?

  5. How did you concatenated channel approach work for the first image in the sequence? So for your prediction at time=0 how did fill the remaing channels?

  6. Why did you use image flipping? This seems counterintuitive to me. If I am not mistaken for all images tropical storms were roating in the same direction regardless of the ocean variable. I would expect the CNNs to pick up on these roational patterns, using image flipping might make it harder to learn them. Did you notice an increase in performance after using flipping.

  7. Why did you not use other augmentations like bluring or cutouts? Or more general how did you come up with the augemenations you choose.

  8. Why did you use 224x224? Did you use centercrop or resize to obtain this format.

  9. In general it seems to me the winning approaches not only in this competition are always averaging over many different models in order to reduce variance and overfitting, but why is it that for ImageNet and other Datasets a single model performce the best (ImageNet Benchmark (Image Classification) | Papers With Code)

Sorry for asking so many maybe stupid questions but I am very curious about these things and your thought process behind coming up with your solution.

Many thanks in advance.


It will take me sometime to answer but your questions can be a good starting point for me to write a blog post. Until I write it down, you can watch a webinar about my winning solution in another competition. I believe it will give some answers. You can find it here:


thank you, im looking forward to that one