At first, I thought this is a simple, image segmentation problem. But on a closer look, not all satellite images contain useful information.
Hence you need to consider temporal-spatial input. My model is feature extraction from the segmentation model (e.g. unet) + transformer for fusing features for 12 months and predicting a single yearly agbm. From my experiments, i have some interesting observations:
you should think of transformers as a switch to select what information to pass through.
all cases are possible, but with different performance
predict(x,y) = some_function ( select best ... for input (month,c,x,y) for each ... )
- you can select the best month or best (month,c)
- over each image or over each patch(x,y) or over each patch(c,x,y)