Many NAs on train_labels.csv

igorkf · December 8, 2021, 1:27am

Hi. The file train_labels.csv has 267 columns (and one aditional column for cell_id as well), but ~98% of the columns are NA. Is this right?

Can we consider NA as 0? Plotting the distribution we see many zeros.

tglazer · December 8, 2021, 5:56pm

@igorkf train_labels.csv does not contain labels for every grid cell every week. Missing values are not equivalent to 0.0. While you should make predictions for all grid cells and dates in the submission format, you will only be evaluated on non-null values.

rasim321 · December 8, 2021, 10:45pm

Apologies if I’m not understanding this right, but doesn’t this mean we cannot use most of the training data? Like @igorkf said 98 percent of the labels from most columns are missing. Or is this just the nature of the readings across locations and time?

tglazer · December 9, 2021, 3:58pm

@rasim321 All labels that can be used for model training are contained train_labels.csv. The features for this competition include remote sensing data, snow measurements captured by volunteer networks, and climate information, in addition to a narrow set of ground measures. Please refer to the Development Stage page of the competition site for a complete list of approved features for model training.

emily · December 9, 2021, 5:59pm

@rasim321 you may be getting thrown by the “wide” format of the train labels. The columns are weekly dates between 2013 and 2019 (excluding summer months), and the rows are 1km by 1km grid cells across the Western US. Most cells do not have a reading every week, so you will see NaNs. If you transform the data to “long” format, you’ll see that you have over 91,000 SWE readings to work with.

df = pd.read_csv("train_labels.csv")
df.melt(id_vars=["cell_id"]).dropna()

Topic		Replies	Views
Train_labels (zeros) Mars Spectrometry	3	444	February 24, 2022
Questions about the competition Snowcast Showdown	7	735	December 15, 2021
Grid ID's that are in submission_format but not in train_labels NASA Airathon	3	328	March 3, 2022
Strange coordinates for some training cells Snowcast Showdown	3	655	December 11, 2021
Problem clarification Snowcast Showdown	8	1041	December 22, 2021

Many NAs on train_labels.csv

Related topics