Care to share general methodologies?

Hi Everyone,

I’m curious, would people mind sharing what approaches they took, what worked and what didn’t? I’m writing a chapter of my dissertation on the competition so I’m very curious to know what ended up being successful, especially for teams in the top 100 or so.

Roughly speaking, we built a shallow 4-layer CNN on MODIS imagery using the Azure storage, built a 6-layer CNN on Sentinel 2 “VV” band using google earth engine, and aggregated their predictions with a linear model along with a couple day of year dummies. We got very little lift from the Sentinel CNN, and the majority of our performance was from MODIS (although we might have had data quality issues with Sentinel from GEE). We tried more complicated things like random forest aggregators, and were able to get a performance of roughly 10.0 just using fuzzed coordinates in the random forest.

We had a hard time getting any extra performance from using weather variables or the digital elevation model, which was a surprise to me, but perhaps we could have done that better. We did a significant amount of hand-tuning that probably explains most of our drop from 10.0 - ~8.0, with the sentinel CNN getting us the rest of the way there. Lastly, we did not control for cloud cover or aggregate over time to improve image quality in MODIS as we ran out of time.

Generally we were frustrated by how much time went into preprocessing, and how building the pipeline absorbed much of the effort we would have put into optimizing additional datasets, but I suppose that’s just part of being an applied ML practitioner.

I’ve made my team’s repository public so anybody can look if they would like at this link: GitHub - M-Harrington/SnowComp: Competition for Bureau of Reclamation's SWE competition. Please pardon the mess there, we started the competition a month late so there was a fair bit of rush to finish!

Thanks!
Matt

PS: if you’d like to keep secret some of your methods, that’s totally understandable, but anything you can share would be super helpful!

3 Likes

Interesting that you’re writing a chapter of your dissertation on this! (will you be able to update this thread with a link to read it when you’re done?)

Sure, I’ll share (currently 5th with RMSE = 4.25):

Data Sources:

I tried to use the satellite images and weather data, but ran out of time.

In the end I only used the provided SNOTEL and CDEC data, as well as the elevation data from Copernicus DEM.

General Idea:

I used an ensemble of several neural networks, where each neural network was a variation on the same idea (outlined below) with different data params like network size and number of SNOTEL/CDEC sites, and was trained with different hyperparams.

Data Aggregation:

To get the SNOTEL and CDEC data (along with elevation) in a form that would be useful: for every point in the test set, I sorted the SNOTEL and CDEC sites by elevation and distance to the point in the test set using the formula:

0.001 * abs(x[“ele”] - loc[“ele”]) +
1000 * abs(x[“lat”] - loc[“lat”]) +
1000 * abs(x[“lng”] - loc[“lng”])

which was an attempt to have both elevation and lat/lng influence which sites were used for prediction for each test point.

Data Used:

For each site I collected the current SWE total as well as the elevation in the center and at each of the 4 corners of the site (the idea there is that a “western” facing region would have different properties from an “eastern” facing region on a mountain, and a neural network should be able to figure that out using the four corner elevations).

I also calculated the lat/lng distance from the data site to the test site.

Embeddings:

I also added embeddings for the region and the “week number” (week of the year from 0-52)

Models:

All that went into a pytorch model - the average was a 5 layer model that took in the data from the 10 - 50 nearest stations, and outputted just a single value - the SWE prediction.

Aggregation:

I used 5 such neural networks, and then took the: “mean of the mean and the median” as the final SWE prediction (in practice, the mean or the median alone probably would have been fine, but CV showed that the average of the two was slightly more accurate).

What I’d do differently:

I really wanted to get the climate, weather, or satellite image data included, but just ran out of time.

Also, I used neural networks, but something like a gradient boosted tree would probably have been easier, and probably more accurate :slight_smile:

I hope that helps some! Let me know if you have any questions that would be helpful for your dissertation. Good luck!

3 Likes

Wow Chris, thanks for sharing, it’s fascinating how different of an approach you took to us with great success! We thought that using the fuzzed coordinates in a random forest model would be sufficient to get some spatial dependencies, but more explicitly modeling it via nearest neighbors seemed to benefit you a lot. Regarding the NN ensemble, did you change the data at all that each network saw? Otherwise like you I’m pretty curious how a gradient boosted machine would have performed!

I’ll definitely share what I can regarding the chapter, hopefully at the end of this thread, although it might be another month before I’m done! Most of the information is regarding the difficulties we faced, lessons learned regarding pipelines, and hyperparameter experiments.

I’m really glad you shared though because my guess of what other people were doing was definitely off, at least in your case, so thank you I really appreciate it!

Between each NN in the ensemble I just changed how many of the nearest neighbors each saw (from 10 - 50). I also experimented with more (up to all of the given data), but there was limited benefit beyond about 50.

I took the approach of trying to incorporate as many useful features as possible in a gradient boosting model. Because of this, my approach was more heavily reliant on feature engineering. I used the Modis NDSI data, several values from the HRRR climate data, the DEM, and the ground measures.

Since gradient boosting naturally works with tabular data I used the mean and in some cases the variance of pixel values from the Modis data and the DEM over an entire grid cell. If you eliminate Modis data for a grid cell on days with high cloud cover (recommended) the Modis data becomes sparse, so the Modis features I created used a rolling average of the mean pixel values, one 5 day rolling average and one 15 day. Modis derived features were most important according to my feature importance analysis.

I found the DEM very helpful. Just using the mean and variance of elevation for a grid cell was a useful feature. I also created a feature that I called southern gradient that took the difference between north-south pixels and represents southern exposure for a grid cell, with the idea that snow on South facing slopes melts faster in the Northern hemisphere.

Geo-spatial and time based features were important, I created a feature I called “snow season day” that just counted the days from the beginning of the snow season, November 1, until the last day, June 30. I also just fed in the raw lat and lon as features, I tried fuzzing a little and it may have helped with generalization, but very minimally in my experiments.

The way I incorporated the ground measures was to use the relative SWE, compared to historical average for a ground measure station. Then for each grid cell I took the 15 nearest neighbors relative SWE. That feature reduced the RMSE a bit more.

The usefulness of the HRRR climate data was a little more perplexing to me. I used a three day rolling average for different values (temperature, WEASD - snow water accumulation, snow depth, pressure, precipitation, etc). In the absence of some of the other features the HRRR data provided value but with all the other features the model remained flat (RMSE didn’t improve). I included it for robustness, there was a period last month where the Aqua Modis dataset was unavailable for over half a month.

I used an ensemble of three different gradient boosting implementations, LightGBM, XGBoost, and Catboost. LightGBM performed the best on its own, and it was the fastest to train. You always here about XGBoost being good for data science competitions but I came away very impressed with LightGBM.

I’m currently in 6th, around 4.5 RMSE. I expect that will fall a bit as we finish the melt season. On the test data I was getting RMSE of ~3.2. I may have overfit a bit but I’m curious to know if I over predicted or under predicted on average. My guess is overpredict. I’m also curious to know my splits between the sierras, the central rockies, and other. On the test set I had better performance on other, followed by central rockies, and then sierras.

It was a fun and interesting competition. I actually like that there was a data engineering component, unlike most Kaggle competitions. I think that gave me a chance to compete because I feel I am pretty strong with data engineering. Hope that helps, feel free to ask any follow up questions.

5 Likes

Thanks for sharing Oshbocker! It’s fascinating how far you got with turning the images and elevation data into tabular data. I think that’s a strong testament to the power of feature engineering.

Also it sounds like a good strategy based on you and Chris’s results was to try to the DEM into a north-facing or south-facing information, or to keep it very low dimensional. I didn’t mention it above but we tried adding the DEM information in a pixel grid as an extra band to the CNN, but test accuracy tanked. I suspect when too much information was given to the network, the model was able to memorize the dataset and so generalization error went much higher.

Regarding the HRRR data, your difficulties getting non-redundant information is perhaps not too surprising. These climate models are well known in the literature to have large amounts of errors in places where there’s a) too few ground stations and b) high elevations, which both are definitely the case in our situation. I am pretty curious to see if anyone managed to use an LSTM or temporal CNN on the weather data to any success. Had we had more time I was thinking about doing something like that, because I have a hard time believing there is very little information in that dataset, even if there are substantial errors.

Lastly, it’s interesting to point out that both you and Chris took approaches using nearest neighbor models, with Chris taking a more minimalistic approach. Because he did very close in performance to you, I wonder then if there’s something to be said here about the strength of his ensemble method across various nearest neighbors. Either that or perhaps having less inputs made it less likely for him to overfit.

Anyways thanks again for sharing, I really appreciate the extra input on what worked and what didn’t! It helps me get a much better idea of what sort of signal there was in the dataset, and how I maybe could have spotted it.

Because Chris’s model and my model have such similar performance so far and we approached the problem very differently it may make for an interesting study. One thing that might be interesting to you is how the leaderboard scores have changed week-to-week. I’ve been keeping track using the internet archive: Wayback Machine

There has been some interesting movement on the board in the past few weeks which suggests that some of the models that performed really well during the peak snow season are losing a bit of ground to other models during the melt season. Some of that might be regression to the mean, but maybe some of models just perform better in peak snow season and some perform better in melt season?

I used regions, stations types, DEM, GlobCover, soil map and month averaged MODIS. Also take into account the last 5 SNOTEL/CDEC measurements. I found then effect from HRRR data is not stable and not using HRRR data in final solution. So from the real-time data I use only SNOTEL/CDEC data and I can submit my solution a several minutes after SNOTEL/CDEC data are availiable.

I think then ASO and SNOTEL/CDEC have differ statistical properties and use the additional predictor “regular measurements” =1 iff more then 10 measurements in this point contains in the training dataset and otherwise =0.

My solution is optimal interpolation based on neural network gaussian process.
In this approach the parameters of the gaussian process calculation using the neural network. The correlation function is homogeneous (1+r/R)exp(-r/R), R=44…94km in multi - dimensional (5-7d = 2 real dimensions + 3-5 virtual dimensions) space. The mapping on this virtual space is a 2-layer perception with ReLU activations takes into account all predictors.

I using the k-fold cross validation. One fold is a snow season. For training the loss function (x-y)^2/(x+y+1) was used. I blend the folds with weights inversely proportional to MSE of leave-one-out validation at SNOTEL/CDEC points at current day. This blending is very good for snow melting period.

After submitting the final solution I found then the assumption of the distribution of values according to the Pareto-II distribution leads to a lower MSE than Gauss distribution.

3 Likes

Thank you for sharing FBykov! It’s really interesting to hear that again, the interpolation method had a lot more value than I would have guessed.

Regarding your GP network, am I right in assuming you used something like what’s described in: [1711.00165] Deep Neural Networks as Gaussian Processes or [1912.02803] Neural Tangents: Fast and Easy Infinite Neural Networks in Python and their corresponding implementations? I have heard of this work previously, but I didn’t realize it was useful in implementation! Am I correct in understanding you when you say that your dimensionality was 2 +3-5 being lat, lon of last 5 SNOTEL/CDEC + MODIS, soil, GlobCover, station type, region? This sounds like a fascinating approach, I’m glad you shared because it helps put the newer technology on my radar.

I also appreciate your unique loss function and validation approaches. These were aspects we spent a long time trying to understand what the best method might be, so it’s very useful to hear how you avoided overfitting. Thank you a lot for taking the time to write this up!

My model is custom and close to Deep Compositional Spatial Models
I think then Neural Tangents is good package for construct NNGP models

Correct, the interpolation doing in the space lat/lon + several virtual dimensions

2 Likes

@FBykov the Modis data that I am accessing has a 2 day lag minimum, and yet you are publishing your submission on the same day. Since you are averaging the Modis data over a month are you content with having slightly stale data? Do you know if your model would improve with using more current Modis data?

I used a similar approach as oshbocker, with less success. I fed a Gradient Boosting regression model with HDRR SWE (3km), 1km and 3km elevation, easting/northing and MODIS. To reduce the data gaps due to cloud cover, I made a temporal composite of MODIS NDSI data. The compositing algorithm takes the most recent non-cloud observation over the last 5 days. My model can be understood as a downscaling of HRRR SWE from 3 km to 1 km. The elevation is a key ingredient, because it drives precipitation and temperature lapse rates… Then, I trained the model separately by region (“other”, “sierras”, “central rockies”). My idea was to make a simple regression model based on a limited set of predictors, so it is easy to operate in near real time and less sensitive to local effects in comparison with a model that would be calibrated using in situ measurements. However, my results are not so great … (RMSE ~ 8 and I’m ranked ~ 30), which has upset me a bit because I’m supposed to be a snow scientist :slight_smile: Yet the big question mark in this competition is the validation data. How was generated the “truth” SWE? If it was done by interpolating ground measurements then my method cannot work well. It seems that the SWE values came from heterogeneous datasets which makes it difficult to draw a robust interpretation on the model parameters. Also I wonder why the Bureau needs model to predict SWE since they provided us SWE in near real time over a large domain… The airborne snow observations that could be used as validation data do not cover such a large domain.
I was thinking of writing about this competition too when it’s over, so I would be interested to continue this discussion.

1 Like

Hi sgascoin, what did your MODIS data look like? Was it someway aggregated over the four pixels or so that made up the 1km measured plots? We used a 21x21x2 grid (one band Aqua, one Terra) to build or CNN on the 500m resolution, so actually quite a bit larger than the sites. I’m not sure if that helped us or not, but it might have improved our accuracy on cloudy days given that we did not composite, even if it ended up being a lot of extra data. Oh and we tried using sub models trained on just singular regions with decent success when we were using the random forest, although our final model didn’t use that strategy.

Also I can clear up one of your confusions, the dataset was a mix of ground based SNOTEL and CDEC remote or physically measured sites and airborne estimations with very high accuracy with LIDAR and spectrometry. I couldn’t figure out exactly where these came from, but it looks like FBykov figured out it was the Airborne Snow Observations (ASO).

And I feel your pain as someone who specializes in remote sensing and hydrology. In the end I felt that my pipeline creation skills were my biggest bottleneck to actually getting to the kinds of things I thought I would be able to uniquely bring to the table. In actuality, it was a combination of that and having too much faith in remote sensing and not trying an interpolation method!

Otherwise do you have any guess why your RMSE is so high in the real-time portion? Were your submissions weird in some way?

Lastly, our model was a huge pain to run in the real-time section because of downloading images from Google Earth Engine which is why we stopped after the first four submissions. It took 30 mins-1 hour to run through our whole pipeline with a number of user input, so not the best.

Also @sgascoin, @oshbocker, @FBykov, @chris62, I’m currently writing a brief section in my dissertation chapter about each of your methodology and my takeaways from the differences in approaches. Please message me here, DM me, or email me at m.harrington (at) columbia (dot) edu if you’d like me to include your real name to give credit, or if you’d like me not to directly reference you in some way.

Also I’ll be able to share an early draft of the chapter next week, I’m still debating posting it here directly, but at the very least I’m happy to share it with people directly.

1 Like

The experiments shows then the latest MODIS is not significant for my model. I think because in my model the predictors used for discribe the snow properties difference at differ points, not for direct calculations the SWE
The latest MODIS may be helpfull for postprocessing of NNGP output

@themrharrington you’re email me link just goes to columbia .edu (maybe that was blocked by the forum?), if you meant to include your email address @ there, what is the first part? (I couldn’t figure out how to DM you here :laughing:)

Good luck on your dissertation!

m.harrington (at) columbia (dot) edu

Sorry about that!