Back to DrivenData | Blog

Household feature hashing in two countries


#1

hello. Just starting to look at the data here and I’m unclear about the hashing of the question names. Column “CtFxPQPT” appears in both household country A (training) and household country C (training) but they are clearly different questions. In country A, column CtFxPQPT has two answers that are both hashed text (8185 answers are “vSqQC” and 18 answers are “atYJj”) and yet in country C, column CtFxPQPT looks like integers ranging from -1 to -1611. So may we assume that column labels are unique to that country irrespective of them having the same hashed header? And if so, do we assume that there are NO questions that repeat in countries A, B or C? Thank you!


#2

Hi @sgenzer,

Thanks for pointing this out. It looks like a small bug in our obfuscation process led to 6 hashing collisions in the household training data.

All occur between countries A and C:

SlDKnCuu enTUTSQi znHDEHZP CtFxPQPT CNkSTLvx hJrMTBVd

We have confirmed that none of these correspond to the same question, e.g., question SlDKnCuu asks something different for country A than it does C.

As for your other question:

There is some small overlap across countries for each question but not many. The reason these were hashed differently is that the original surveys coded the questions differently. So it’s best to assume no overlap.

Good luck!


#3

thank you @caseyalan!