Clouds in dataset

Hi everyone!

I am relatively new to ml/ai tasks and I’m using this as abit of a training exercise. I’m wondering how you would go about tackling cloudy images/ snowy images creating noise in the segmentation training.

I know I could download all the data and clean it up this way, but I am trying to do everything on the fly/ streaming the data.

Are there techniques, such as transformers etc which could work around this issue? Appreciate any input or suggested readings!


1 Like

In my opinion, figuring out how to deal with bad images is going to be what makes or breaks a great solution for this challenge.

See @hengcherkeng’s post here.
My understanding is he basically uses a transformer to choose the best time step, and then takes that as a prediction.

The way I see the problem is, is that there are 12 possible time steps, and the difference between these time steps can be very informative, i.e the way vegetation behaves in winter vs. summer can be quite informative as to how much biomass there is at a location (also, see the comment in the post above about snow).

The hard part is that you often don’t have all 12 images, or the images might be cloudy, or otherwise degraded. So the question is, how do we deal with a sequence with missing data? On the one hand, this reminds me of how transformers like BERT are trained.

So I’m thinking some kind of model that sees the entire series and then predicts an output, even with missing steps.

I wonder if some kind of SSL pre-training would help?

1 Like

feature = ... (batch_size, month, x,y, channel)  
attention_weight = ... (batch_size, month, x,y,1)   where   attention_weight.sum(1) = 1

selected feature = (feature*attention_weight).sum(1) = ... (batch_size,x,y, channel)
predicted_agbm = decode ... (batch_size,x,y, channel) = ... (batch_size,x,y, 1)


I’m fairly new to transformers and attention. But are you suggesting that you add another dimension to the dataset that would be equivalent to the month etc, and this would be fed into the transformer to output some attention weight? Or you concat the data together on a new dimension which would be equivalent to the month?

Any videos or readings you could point me towards to wrap my head around how to implement this?


I really like your idea, could some sort of lstm framework for each time period be used?

I would have no clue how to set up my data/ dataloader for this, but keen to see what people come up with!

could some sort of lstm framework for each time period be used?

Yes, that could work as well.

I haven’t had a lot of time to look into the details yet, but have written a dataloader. Was hoping to play around with it this week somewhere. I can share the dataloader if I get time to clean things up a bit.

Basically think about it this way: for a single image you’ll have a input of shape:

(batch_dim, width, height, input_channels) (assuming channels last).

Now as you (in the best case) have a sequence of 12 images for every location (one for each month), you could create an input of:
(batch_dim, seq_len, width, height, input_channels)

You can now use something like a UNet to get a prediction for each month, and then figure out some kind of sequence model to choose how to best combine the predictions of each time step.

Or, as this reply suggests: embed your sequence with some kind of (presumably) CNN, so you are left with a sequence of shape:

(batch_dim, seq_len, width_emb, height_emb, channels_emb)

You run a transformer over this and sum together the embeddings using the (normalized) attention weights from the transformer, and then use some kind of decoder to get your final output.

I was thinking about something in this direction:


google for papers using the keywords:

spatial temporal
remote sensing

e.g1. GitHub - MarcCoru/MTLCC: Multi-temporal land cover classification. Source code and evaluation of IJGI 2018 journal publication


Figure 2: Spatio-temporal Encoding. A sequence of images is processed in parallel by a shared convolutional encoder. At the lowest resolution, an attention-based temporal encoder produces a set of temporal attention masks for each pixel, which are then spatially interpolated at all resolutions. These masks are used to collapse the temporal dimension of the feature map sequences into a single map per resolution.