Back to DrivenData | Blog

Knowledge extracted from public data violates privacy?

When we processed and extracted the 2019 true( non-private) public dataset, we found some interesting general patterns that might hold for future data (2020). For example some heuristics rule, or some sparsity patterns,… We would like to use this observation into DP algorithm design.
Do you accept that strategy still satisfy differential privacy definition, because we touch the ground-truth data to learn them?

We are aware that if we come up with so many hard rules into our algorithm design, our model might subject to overfitting, thus likely will not perform well on future data. However, we just to make sure that we are allowed to extract general (not specific) knowledge from 2019 data and incorporate them into our algorithm design.


Yep, you can use absolutely anything you want to from the 2019 data (and only the 2019 data) to inform your algorithm design, without violating differential privacy. The 2019 data we gave you during the development phase is considered to be “Previously Publicly Released Anonymous Data”, and that means that using it does not cause any (new) privacy loss. This is a common real life scenario–where organizations release simply anonymous data for many years before considering switching to formally private data. In general using previously released publicly available data to inform the fit/behavior of your algorithm will improve its performance, and understanding your target data context is a good idea… Overfitting is an issue for accuracy, not privacy :slight_smile: