Back to DrivenData | Blog

Link train_labels to ground_measures_train_features

How can i link the features in ground_measures_train_features to the train_labels (cell_id) ?

1 Like

@neurocomputing The file ground_measures_train_features.csv contains SNOTEL and CDEC stations located in entirely distinct grid cells than you’ll find in train_labels.csv. You can use the latitude and longitude fields to overlay and compare the two.

Thanks alot this helps!!!

Hi, could you explain more details about matching train data and train labels?

Hi everyone, I have been trying to figure it out how to do the matching but unfortunately I’m not successful. Please @tglazer and people who managed to get it correct can you explain or give more hint about how you did it? I’m sorry to ask that but I’m kind of a little bit stack at that point. Thanks for your help in advance.

I think target labels and train features are linked via latitude and longitude. CDEC and SNOTE are included in ground_measures_train_features. Via CDEC and SNOTE, latitude and longitude can be determined from ground_measures_metadata. Now you have latitude and longitude for each training data set.
For the train lables, latitude and longitude can be determined using the cell_id. In grid_cells.geojson, latitude and longitude are included for each cell_id.
If I have correctly understood the solution from tglazer, I can now use the distance, which can be calculated from the latitude and longitude of train features and train lables, to make the assignment.

3 Likes

Thank you so much for these details.

So you get a polygon (if im understanding this correctly) from the geojson file and each polygon is supposed to have a SNOTEL/CDEC station in it? please elaborate if you can, appreciate your time

I’ve just been through this process and I found that there is not a station within each polygon (cell_id). This makes sense (right?) as each polygon represents a relatively small area so in the end I decided to find the closest station to each polygon using the lats and lons. This works out but … you end up only needing around 300 of the stations as some stations are close to multiple cells and some aren’t closest to any cell, at least in the training set. I’m currently debugging code so I could be wrong on those numbers. This project could have done with a benchmark blog notebook - it’s great working stuff out for yourself but it’s been mega painful getting this project off the ground and that initial leg up can get you going in the right direction faster. Good luck!

Hi @JM1000 and @AvivTahar, please see @tglazer’s post above: Link train_labels to ground_measures_train_features - #2 by tglazer

The ground measures features contain SWE measures for entirely distinct grid cells than those that are in train and test labels. Your goal in this competition is to predict SWE for the grid cells in test_labels, and one of the inputs you can use are these ground measures from “nearby” stations (you can use lat lons to figure out the nearest ground measure). The main features for this competition however are the remote sensing and climate data sources as described in the problem description: Competition: Snowcast Showdown: Evaluation Stage