Question about sensitivity

Hello, we had two questions regarding the sensitivity.

Question 1: Our first question is in reference to the webinar video for the challenge, where we had a question about one slide (timestamp in the YouTube link below).

pasted image 0

For the diagram attached, the presenter said that on the right hand side, the sensitivity is 20. I assume this is a sum release query for attributes: neighborhood, month, incident type. However, the presentation goes on to say that because we are given the user_ids for each row, “counts of nodes on the left have sensitivity 1.” However, we are unsure how the sensitivity changes. Could someone clarify what the query is now and how that changes the sensitivity? To our knowledge, sensitivity is defined in terms of all possible datasets, not just the one we are trying to privatize, so regardless of how the dataset given to us looks, we can always find two theoretical datasets where the sensitivity for a query is 20. I am using the sensitivity definition from Dwork and Roth, where I assume X is all datasets where any user can make at most 20 calls.

Question 2: Our second question is in reference to tip 3

Tip 3 : Histograms (built on externally-provided schema) can be quite useful. The sensitivity of creating a histogram is equal to the number of histogram bins that can be changed in the worst case when one individual’s records are changed. In this sprint that is equal to the max_records_per_individual parameter. Check out the provided baseline solution and beginners guide tutorial above for more information on differentially private algorithms using histograms

It says that the sensitivity is determined by the number of bins that are changed. Here, are we referring to a histogram of 174 bins (1 for each incident type) for each neighborhood-month pair? In that case, wouldn’t the sensitivity be calculated not by the number of bins changed but by how the counts in each bins are changed in total across all bins. The sensitivity would still be 20, but that could happen also if only a single bin is changed by 20, instead of 20 bins each changing by 1. We wanted to start thinking about how to use this tip, but since this statement was a bit confusing to us, we wanted to just double check our understanding of what is meant by a histogram.

I’ll answer your questions out of order, if you don’t mind–

First off, you’re exactly correct on your interpretation of Question 2/Tip 3. That’s our mistake, and I’ll go make sure the text gets updated. The sensitivity of the histogram is the L1 norm of change across all bins of the histogram, resulting from adding/removing one individual from the data. If a single bin can change by 20, the sensitivity of the histogram is 20, just as much as if it’s 20 different bins changing by 1, or one bin changing by 15 and one by 5.

To your second question, for the diagram you’re looking at it’s a matter of what you’re counting. On the right, you’re doing a simple count of records (and that has sensitivity 20, as one person can contribute up to 20 records).

On the left hand side, the implication is that you’re doing a simple count of people. If, rather than using event records as your fundamental data point, you use models of types individuals (either as graph nodes, as shown in this slide, or using any other sort of approach to capture an entity that may produce multiple records) then you can count the number of instances of each type of person with sensitivity 1.

In Sprint #1 the majority of individuals in the data-set provide relatively few records, so this may not be an especially appealing approach. But in each sprint the data set will confront an increasingly challenging real world application context. We’ll have an announcement for the Sprint #2 data coming up in the next few weeks.

Thanks for getting back to me!

So here, is a “type” of person defined by the histogram of calls they make (by neighborhood, month, and incident). So there would be something like (278 x 12 x 174 + 1)^20 types of people?

Yep, naively. I used to have a slide to that effect in the deck, but pulled it for time constraints. In the previous challenge (NIST Differential Privacy Synthetic Data) we were working in data spaces that had greater (sometimes much greater) than 20^10 possible record types… and far fewer individual records. In many real world use cases it becomes impossible to work with the naive histogram counting all possible record types, and you have to start finding clever ways to compress your representation of the data space.

You might find it interesting to look through some of the winning solutions to the last challenge (scroll down to the challenge section here):
https://www.nist.gov/itl/applied-cybersecurity/privacy-engineering/collaboration-space/focus-areas/de-id/tools

Ok I see, thank you!