As you guys are preparing for your final submission on monday, here are some notes about what to you should get into your final write-up. We’re going to be asking you to be a bit more formal than we have in the past.
Levels of formality in write-ups have historically varied remarkably widely, but as we’re dealing with both trip-level queries and taxi-level queries in this sprint, that rapidly becomes problematic. It can be very easy to get things tangled up and have a differential privacy violation, and it can be very difficult for us to validate what you’re doing without formal specifications. Writing everything out as clearly and formally as possible will both significantly improve our review process and help ensure you catch any oversights before we do.
Note that we do mean it this time. Write-ups which aren’t clearly and fully specified according to these instructions will be bounced back for clarification.
Write-up Guidelines
Please split your write-up into three subsections, with subheading titles, and cover them in order as described below:
(1) Preprocessing:
This section documents steps that act separately on each distinct taxi_ID and have a sensitivity cost of 0.
- Please clearly define all preprocessing steps using pseudocode or mathematical notation, not just english.
- All steps that operate only on the public data or external publicly available data should also be included here
- Indicate where in your source code each step occurs.
(2) Privatization:
This section documents steps that run queries on the private data, add privatization noise as needed, and produce differentially private output
For every query (or operation that processes the private data, whatever makes the most sense for your algorithm) you perform during privatization, please include the following information:
- The portion of the epsilon budget allotted to the query/operation
- The formal definition of the query/operation (using pseudocode or mathematical notation, in addition to the english description). In particular, every histogram must include a formal definition of the set of bin labels in pseudocode or mathematical notation, so there is no ambiguity what operation is being performed.
- The query/operation’s sensitivity (and the justification for that sensitivity)
- How that query/operation’s input/output relates to the other queries in your algorithm. In particular, note explicitly whether you’re assuming two queries have parallel composition (ie, one distinct taxi_ID can impact at most one of the queries, the queries are “disjoint”), or sequential composition (ie, one distinct taxi_ID can impact both queries, the queries “overlap”).
- Indicate where in your source code the query/operation occurs
Finally, and this one is important, at the very end of your privatization section clearly write out the arithmetic to make sure that the privacy loss budget you allotted across your queries does sum up to exactly 1 full epsilon at the end and not, for example, (10/9)ths of an epsilon. Remember that arithmetic on fractions is tricky, and if you only do the math in your head it’s easy to end up with a fraction of an epsilon error when you’re distributing your budget. Writing it out in one location will make it easier for both us and you to verify it’s correct.
(3) Post-processing
Steps which act only on the differentially private results to produce the final results.
- Please go ahead and write up the remaining steps that you use to get from your privatized query results to your final differentially private synthetic data that has the same schema as the input data.
- For instance, how do you assign taxi_ID’s to trips?
- Indicate which sections of your source code perform post-processing (ie, which components/files only touch the privatized data)
Again, to improve review efficiency this time around, if your write-up doesn’t meet these guidelines, you’ll be getting an email tuesday morning and we’ll need you to update your write-up before we begin final scoring on it.
If you have any questions, just reach out, we’re happy to help clarify. If you’re not familiar with mathematical notation or pseudocode, remember your basic goal is just to get us something unambiguous to accompany your english description of your algorithm. A sequence of pandas commands with simple/clear variable names (and variable definitions) will work fine. If you have gotten through an undergrad discrete math class though, please break out the formal notation and clearly define what you’re doing (for example, provide a formal definition for the set of bin labels for your histograms). Be careful to use variable names consistently.