Consistency, Bias, Time Sequences: Things to Think About

Christine_Task · February 5, 2021, 5:20pm

Everyone’s had a few weeks to implement their initial algorithm, start testing things out and making improvements. But how closely have you looked at the feature spec and the data? Here’s a few things that might be worth checking out if you’d like to further improve your results:

Consistency: How closely have you looked at the definitions of the features we’re using in this sprint’s data set? We know there’s correlations between feature values (education and income, for instance). But sometimes the relationships between features are a bit more rigid than that, and a particular value of one feature will impose hard constraints on the possible valid values for another feature. Have you ever ARRIVED at work before you DEPARTED from home, for instance? (Or were you ever 10 years old the year after you were 20 years old?) Making sure the feature values in a given record (or set of individual records) are internally consistent with each other is an important step for cleaning survey data. We won’t be explicitly scoring your privatized data based on consistency, but keeping these rules in mind might be a good way to help improve your 3-marginal score with no added sensitivity cost.

Bias, Fairness & Geography: In many cases relationships between features may be dependent on geography, and those relationships can change over time as neighborhoods change. Different PUMA will have very different data distributions. And with significant rural, urban and suburban population shifts over the past decade, some PUMAs will have one distribution in 2012 and a very different one in 2017. When you aggregate the data naively across PUMA or YEAR to reduce the impact of added noise, you may be accidentally erasing these differences, and effectively erasing these populations from your synthetic data. In the real world, this can interfere with data analysis, impact public policy and corporate strategy, and cause problems for the people living in these areas.

You can use the visualizer in the competitor’s pack to check your algorithm for especially poor scoring PUMA/YEARs, or you can check out the detailed scoring report (see the readme in the repo). If you find you’re sometimes aggregating too broadly, you may be able to better tailor your aggregation with a few low-sensitivity queries on PUMA demographics or finances. (Remember that the final scoring will be done on a different geography/PUMA set, so you can’t take lessons about specific PUMA directly from the public data).

Time Sequences: This is just a quick note-- remember that some individual features are fixed for all of that individual’s records (ex: sex, race, hispanic origin) and others change following specific rules (ex: age, education, citizenship). If your algorithm considers individuals in the data, you might find this a handy way to reduce sensitivity on some queries.

Topic		Replies	Views
With final dataset domain be same as provisional dataset? Differential Privacy Temporal Map Challenge	4	450	February 11, 2021
Sprint 3 Results! Differential Privacy Temporal Map Challenge	2	380	June 28, 2021
IMPORTANT: Regarding final submission write-ups Differential Privacy Temporal Map Challenge	2	505	May 16, 2021
Sprint 1 Webinar scheduled for Oct 14th @ 11:30 EDT Differential Privacy Temporal Map Challenge	2	373	October 15, 2020
Sprint #3 Prescreening (For Prize Eligibility!) Differential Privacy Temporal Map Challenge	0	424	April 13, 2021

Consistency, Bias, Time Sequences: Things to Think About

Related topics