I am trying to wrap my head around the available data for this competition, and I have decided to do a bit of exploration. One thing that I’ve done is visualize the values in train_label.csv on a map - this helps me understand a bit more where PM25 is high/low visually.
I am now playing around with MAIAC files. I am not sure i completely understand their structure. I am unable to understand few things:
What are the “layers” that are mentioned at the top of the metadata?
When I browse the 1x 1 km data: where is this Kilometer tile? I’m unable to find out in the metadata of the file - All I could find where the Horizontal and Vertical Tile IDs (it took me some time to read those from the MODIS website).
I’ve tried following the instructions on the problem description page, but I was unable to infer the location of the satellite imagery.
I’m not sure which layers you’re referring to. It could refer to the subdatasets - each file contains multiple datasets - or it might refer to the number of orbit overpasses, the third dimension of the dataset. Could you provide more information?
There is an attribute called StructMetadata.0. Within it are fields called UpperLeftPointMtrs and LowerRightMtrs which give you the coordinates of the upper left and lower right corners in meters on the sinusoidal grid. Using these, you can interpolate between the coordinates to get the coordinate of each individual grid cell.
Here is a snippet using pyhdf to turn the metadata into a dictionary:
from pyhdf.SD import SD, SDC
maiac_fp = example.hdf #path to file here
hdf = SD(maiac_fp, SDC.READ)
# construct grid metadata from text blob
gridmeta = hdf.attributes()["StructMetadata.0"]
gridmeta = dict([x.split("=") for x in gridmeta.split() if "=" in x])
for key, val in gridmeta.items():
try:
gridmeta[key] = eval(val)
except:
pass
wow, thanks @cszc . I had originally used gdal to access the HDF files, and then started exploring each individual function. gdal.Open.GetMetadata() did not return “StructMetadata.0” or any of its contents.
At the header of the file, it mentioned the 3 Additional Layers… Here’s also the output of gdalinfo:
Driver: HDF4/Hierarchical Data Format Release 4
Files: train/maiac/2018/20180201T060000_maiac_dl_0.hdf
Size is 512, 512
Coordinate System is `’
Metadata:
ADDITIONALLAYERS=3
ALGORITHMPACKAGEACCEPTANCEDATE=TBD
… (truncated) …
That link from the HDF EOS zoo you’ve shared is amazing - I am now understanding the file structure a bit more. Where can i learn more about the meaning behind the “channels” for each datafield?
In other words, this line:
data = data3D[0,:,:].astype(np.double)
What data lives in index0, 1, 2 (of axis 0)? what’s in data3D[1,:,:] -
Brilliant, thank you @cszc - the information you’ve just shared is extremely helpful.
I have one more question if you don’t mind - just to validate my understanding: when i look at a 1x1 km dataset (which is represented as a 1200x1200 array), each pixel = 1x1km - is that correct?