could some sort of lstm framework for each time period be used?
Yes, that could work as well.
I haven’t had a lot of time to look into the details yet, but have written a dataloader. Was hoping to play around with it this week somewhere. I can share the dataloader if I get time to clean things up a bit.
Basically think about it this way: for a single image you’ll have a input of shape:
(batch_dim, width, height, input_channels)
(assuming channels last).
Now as you (in the best case) have a sequence of 12 images for every location (one for each month), you could create an input of:
(batch_dim, seq_len, width, height, input_channels)
You can now use something like a UNet to get a prediction for each month, and then figure out some kind of sequence model to choose how to best combine the predictions of each time step.
Or, as this reply suggests: embed your sequence with some kind of (presumably) CNN, so you are left with a sequence of shape:
(batch_dim, seq_len, width_emb, height_emb, channels_emb)
You run a transformer over this and sum together the embeddings using the (normalized) attention weights from the transformer, and then use some kind of decoder to get your final output.
I was thinking about something in this direction: https://openaccess.thecvf.com/content/CVPR2021/papers/Wang_End-to-End_Video_Instance_Segmentation_With_Transformers_CVPR_2021_paper.pdf