Null values correlation

Plotted the correlation between the amount of missing values in each of the consumedxxxx features for surveys 1-6.

There are some clear patterns within and between the surveys, not sure what insights can be gleamed from this if any. Thought I’d share in case anyone thought it was interesting or wanted to look further into this.

Excuse the lack of axis, couldn’t figure out an easy way to put readable ones in, but each row/col is just consumedxxxx starting from consumed100 ending at consumed5000.

Each cell represents the correlation between the amount of nulls in the two features, i.e. we can see in survey 3 that if for a response consumed100 is null, then all of consumed100-900 will be null, (in this survey there is only 1 row where these features are null so this isn’t all that interesting in of itself). I don’t think this exercise is all that useful in this problem, though in theory this could let us infer some things about the structure of the questionnaire.

2 Likes

This is quite interesting! Thanks.

Could you help me understand what utl_exp_ppp17 is? Is that the household (not per-capita) expenditure or last 7 days household expenditure since 95% of the samples have it greater than cons_ppp17? how is utl_exp_ppp17 different from cons_ppp17?

1 Like

My assumption would be that utl_exp_ppp17 is the amount the household spends on utilities (electricity, water, etc), it doesn’t specify a timeframe so may be weekly/monthly. Whereas cons_ppp17 is the total daily expenditure per person in the household. I may be wrong though.

1 Like