Consistency, Bias, Time Sequences: Things to Think About

Everyone’s had a few weeks to implement their initial algorithm, start testing things out and making improvements. But how closely have you looked at the feature spec and the data? Here’s a few things that might be worth checking out if you’d like to further improve your results:

Consistency: How closely have you looked at the definitions of the features we’re using in this sprint’s data set? We know there’s correlations between feature values (education and income, for instance). But sometimes the relationships between features are a bit more rigid than that, and a particular value of one feature will impose hard constraints on the possible valid values for another feature. Have you ever ARRIVED at work before you DEPARTED from home, for instance? (Or were you ever 10 years old the year after you were 20 years old?) Making sure the feature values in a given record (or set of individual records) are internally consistent with each other is an important step for cleaning survey data. We won’t be explicitly scoring your privatized data based on consistency, but keeping these rules in mind might be a good way to help improve your 3-marginal score with no added sensitivity cost.

Bias, Fairness & Geography: In many cases relationships between features may be dependent on geography, and those relationships can change over time as neighborhoods change. Different PUMA will have very different data distributions. And with significant rural, urban and suburban population shifts over the past decade, some PUMAs will have one distribution in 2012 and a very different one in 2017. When you aggregate the data naively across PUMA or YEAR to reduce the impact of added noise, you may be accidentally erasing these differences, and effectively erasing these populations from your synthetic data. In the real world, this can interfere with data analysis, impact public policy and corporate strategy, and cause problems for the people living in these areas.

You can use the visualizer in the competitor’s pack to check your algorithm for especially poor scoring PUMA/YEARs, or you can check out the detailed scoring report (see the readme in the repo). If you find you’re sometimes aggregating too broadly, you may be able to better tailor your aggregation with a few low-sensitivity queries on PUMA demographics or finances. (Remember that the final scoring will be done on a different geography/PUMA set, so you can’t take lessons about specific PUMA directly from the public data).

Time Sequences: This is just a quick note-- remember that some individual features are fixed for all of that individual’s records (ex: sex, race, hispanic origin) and others change following specific rules (ex: age, education, citizenship). If your algorithm considers individuals in the data, you might find this a handy way to reduce sensitivity on some queries.