Sensitivity of a Query over taxi_id

What is the sensitivity of a query that asks for the shift histogram of a single taxi_id? According to tip 4 it seems like it would be 1:

“However, if the histogram is counting some properties of taxis , the sensitivity will be 1, because in this problem the taxis (as specified by taxi_id ) are defined to be our individuals.”

Now, if I were to ask for the shift histogram of each unique taxi_id, would the sensitivity of this query be 200? The reason being: if I add or remove a single taxi_id, the L1 distance of one of the histograms would change by at most 200 while all the other histograms would stay the same.

We’ll go back and clarify that language a bit, but the intended meaning of that tip is: “Any query that counts taxis will have a sensitivity of 1”. For example:

  • “How many taxis dropped off in area 17?”
  • “How many taxis did at least 15 trips on Friday night?”
  • “How many taxis did less than 5 trips every morning and afternoon?”
  • “How many taxis picked up between 5-10 trips in community area 23?”
  • “How many taxis picked up in areas 10, 12 or 16 and dropped off in area 50?”

By contrast, the query you’re proposing still counts trips, rather than taxis… but let’s look into it a bit more closely too. The first difficulty here is that the list of all taxi_id’s is itself private (remember that these are unique identifiers for the taxi drivers, effectively like individuals’ names on survey data). The domain of taxi_id’s is just the full domain of 7-digit serial numbers, so there’s slightly less than 10^7 x 21 possible combinations of taxi_id and shift. This would be an extraordinarily sparse representation of the input data, but if you wanted to use it, then yes, querying the count of trips per shift associated with one possible serial number would have a sensitivity of 200 (ie, the maximum possible number of trips that could be associated with one taxi). As I’m sure you’ve noticed, that’s likely not an effective direction to head in.

I know this problem definition is a change from how people are used to thinking about individuals in differential privacy. We’re not accustomed to working in the context where there are many records associated with one individual, so this problem will take some creativity. But still, this is a very real world situation with data sets (such as GPS tracking, transaction, browsing, etc) that are currently being bought and sold without any effective privacy protection at all. So let’s see if we can successfully add some.

Feel free to ask more questions on the forum, and to submit early and often to DP pre-screening.