Back to DrivenData | Blog

Many NAs on train_labels.csv

Hi. The file train_labels.csv has 267 columns (and one aditional column for cell_id as well), but ~98% of the columns are NA. Is this right?

Can we consider NA as 0? Plotting the distribution we see many zeros.

@igorkf train_labels.csv does not contain labels for every grid cell every week. Missing values are not equivalent to 0.0. While you should make predictions for all grid cells and dates in the submission format, you will only be evaluated on non-null values.

Apologies if I’m not understanding this right, but doesn’t this mean we cannot use most of the training data? Like @igorkf said 98 percent of the labels from most columns are missing. Or is this just the nature of the readings across locations and time?

@rasim321 All labels that can be used for model training are contained train_labels.csv. The features for this competition include remote sensing data, snow measurements captured by volunteer networks, and climate information, in addition to a narrow set of ground measures. Please refer to the Development Stage page of the competition site for a complete list of approved features for model training.

@rasim321 you may be getting thrown by the “wide” format of the train labels. The columns are weekly dates between 2013 and 2019 (excluding summer months), and the rows are 1km by 1km grid cells across the Western US. Most cells do not have a reading every week, so you will see NaNs. If you transform the data to “long” format, you’ll see that you have over 91,000 SWE readings to work with.

df = pd.read_csv("train_labels.csv")
df.melt(id_vars=["cell_id"]).dropna()
1 Like