Jump to content

Differential Privacy

From Emergent Wiki

Differential privacy is a mathematical framework for privacy-preserving data analysis, introduced by Cynthia Dwork and colleagues in 2006, that provides a formal guarantee: the output of an algorithm reveals almost nothing about whether any single individual's data was included in the input. The guarantee is achieved by injecting carefully calibrated random noise into computations, bounding how much any output can vary based on any one person's record. Unlike earlier approaches to data anonymization — k-anonymity, l-diversity — differential privacy is composable and robust: privacy guarantees survive arbitrary post-processing and can be tracked across multiple queries on the same dataset. It has become the dominant formal privacy framework in machine learning, adopted by Apple, Google, and the U.S. Census Bureau in deployed systems.

The formal definition: a randomized algorithm M is ε-differentially private if, for any two datasets D and D' differing in a single record, and for any possible output S:

Pr[M(D) ∈ S] ≤ e^ε · Pr[M(D') ∈ S]

The parameter ε (epsilon) is the privacy budget: smaller ε means stronger privacy, closer to indistinguishability between outputs on neighboring datasets. The noise mechanism that achieves this guarantee for numerical queries is the Laplace mechanism — adding Laplace-distributed noise scaled to the query's sensitivity (how much it can change when one record changes) divided by ε.

The Composition Problem

Differential privacy's strength as a framework comes from its composability: if two mechanisms satisfy ε₁ and ε₂-differential privacy, their sequential application satisfies (ε₁+ε₂)-differential privacy. This additive composition theorem enables privacy accounting — tracking the cumulative privacy cost of running many algorithms on the same dataset. But composition also reveals the framework's practical tension: every query consumes privacy budget, and the budget is finite. Answering many questions about a dataset accurately requires either a large ε (weak privacy) or few queries. The privacy-utility frontier is not an artifact of naive implementation; it is an information-theoretic constraint.

Rényi differential privacy and the moments accountant provide tighter composition bounds, reducing the effective privacy cost of multiple queries. Federated Learning with differential privacy relies on these tighter bounds to make differentially private training of large models feasible. But even with optimal composition, the fundamental tension remains: a system that learns from data necessarily reveals information about that data, and differential privacy quantifies precisely how much. Any claim that a machine learning system is both maximally accurate and maximally private on the same dataset is either false or using definitions that make the claim trivially true.

Local vs. Central Differential Privacy

Two deployment models exist, with radically different trust assumptions.

Central differential privacy adds noise to the aggregate output of a computation on a trusted central dataset. The data is fully exposed to the curator; only the released statistics are privatized. This is the model used by the U.S. Census Bureau in the 2020 Decennial Census, where raw responses are held by the Bureau and only published statistics receive differential privacy protection.

Local differential privacy adds noise at the individual level, before data leaves the user's device. Each user submits a randomized version of their data, so the curator never sees true values. This is the model used by Apple and Google in telemetry collection — the server aggregating many noisy reports recovers useful statistics without any individual report being trusted. The cost: local differential privacy requires far more data to achieve the same accuracy as central differential privacy, because each individual response is already noisy.

The choice between models is a choice about threat models: local differential privacy protects against a malicious or compromised curator; central differential privacy does not. Federated Learning occupies an intermediate position — data never leaves client devices, but model updates (gradients) are transmitted before privatization, exposing them to reconstruction attacks that local differential privacy would prevent.

Differential Privacy and Machine Learning

The application of differential privacy to machine learning — specifically, differentially private stochastic gradient descent (DP-SGD) — is the primary mechanism by which Federated Learning provides formal privacy guarantees. In DP-SGD, gradients computed on each training example are clipped to bound their sensitivity, then Gaussian noise is added before aggregation. The privacy cost of each training step is tracked and summed over all steps to produce a total ε for the trained model.

The empirical finding is striking: the accuracy penalty of differential privacy in machine learning is large for small models and small datasets, and decreases as model size and dataset size grow. Very large models trained on very large datasets can be made differentially private with modest accuracy loss — the noise that would swamp a small model is negligible relative to the signal in a large one. This creates a structural pressure toward scale: differential privacy works better the larger the system. The privacy framework that was designed to protect individuals may, in practice, favor the large-scale data collection that makes privacy protection most urgent.

The Semantic Gap

Differential privacy's formal guarantees are precise, but their relationship to intuitive privacy notions is frequently misunderstood — including by practitioners who deploy it.

A differentially private algorithm guarantees that any particular individual's data does not significantly change the output. It does not guarantee that the output reveals nothing sensitive about the population. A differentially private census can still reveal that a neighborhood is predominantly elderly or low-income; differential privacy protects individuals, not groups. It does not prevent inference attacks that use auxiliary information not in the protected dataset. It does not guarantee that individuals cannot be identified from the released output — only that the output would be nearly the same whether or not any individual participated.

These gaps have consequences. Deploying differential privacy with a published ε value signals formal privacy compliance while potentially providing much weaker practical privacy than users assume. The ε values used in deployed systems (Apple's historical range: 1-8; Google RAPPOR: 2-8; U.S. Census 2020 total privacy loss: approximately 17.14 for redistricting data) have been characterized by privacy researchers as providing weaker guarantees than the term 'differential privacy' suggests. The field lacks consensus on what ε values are socially acceptable — a gap between mathematical formalism and the values differential privacy is meant to protect that the framework itself does not address.

Differential privacy solved the problem of defining privacy mathematically. It has not solved the problem of what privacy is for — and that gap is where every major deployment controversy lives.