Python and NetCDF4 files

I have yet to find a satisfactory solution for making programmatic use of netCDF4 files (.nc files). I don’t think you’re supposed to be extracting data from them the way that I am, but I couldn’t find a method of going straight from netCDF4 to GeoDataFrame, and I’m beginning to suspect that my method isn’t as solid as I was hoping it would be (I’m completely ignoring coordinate reference systems and projection of coordinates, for example – always a safe bet when you’re drawing conclusions based on poor model performance, as we all are very much aware of, I’m sure!)

Would anybody care to save me the frustrating hours of Googling and point me to the proper libraries for an end-to-end .nc file → GeoDataFrame method in Python?

I would recommend using xarray to open netcdfs as a dataset and do any geo-referencing (e.g., selecting data at specific lat/lons) and then converting to a data frame (ds.to_dataframe()) only when you’re done with your geo-referencing. xarray has great selection features for pulling data from specific dates/locations, so I would choose that over a geopandas data frame entirely. Hope this helps!

Hi @mmiron

If it still relevant. You can avoid getting data as dataframe but get numpy arrays. One of the fastest ways that I know is to use the rasterio library, which allows you to get the values for the desired polygon (geometry) in the form of a numpy array

from pathlib import Path

import numpy as np
import rasterio
from rasterio.mask import mask
from shapely.geometry import mapping

def extract_data_from_netcdf_file_as_mean_value(geometry, netcdf_file: Path):
    with rasterio.open(netcdf_file) as src:
        # Do not change into crop=True because the library somehow distort statistics
        out_image, _ = mask(src, geometry, crop=False, nodata=32767.0)
        # Lets assume that NetCDF file has only one matrix (it is not always the case!)
        clipped_matrix = out_image[0, :, :]
        # Here will be values from matrix in the poly
        filtered_values = np.ravel(clipped_matrix)[np.ravel(clipped_matrix) < 32766]

    return np.mean(filtered_values)

Geometry here will something from the file:

spatial_objects = geopandas.read_file(Path('geospatial.gpkg'))
site_geom = spatial_objects[spatial_objects['site_id'] == 'hungry_horse_reservoir_inflow']
site_geometry = site_geom.geometry.values[0]
geometry = [mapping(site_geometry)]

As another option, you can also look at the rioxarray package, which lets you use rasterio with xarray instead of numpy arrays. xarray datasets are generally a more direct representation of the data contained in netcdf files, compared to numpy arrays.

Example from documentation of using rioxarray to clip to a geopandas geometry: Example - Clip — rioxarray 0.15.0 documentation

Hi @dreamlone, @shartke, and @jayqi,

The support is greatly appreciated, thank you. Without going into too much detail, I managed to solve my coordinate projection issues; I apparently was using satellite data for Namibia (Africa is clearly known for its heavy snowfall). I’m using xarray, and am now properly producing NetCDF4 files from the Microsoft Planetary Computer with the appropriate geographical regions – unless I’ve accidentally discovered a foreign country with snowpack that’s highly correlated to the Western US.

There’s still a bug in my implementation effecting data relevant for the current water year though, and there are only so many hours in a day. For the time being I just retrained my models without that data included. I just hope I score well enough to need to write a final report about it :crossed_fingers: