By: Zoe Wood
In 2020, the U.S. Census Bureau applied differential privacy to U.S. Census data for the first time. This was in response to a growing understanding throughout the early 2000s that large data sets which have been reasonably de-identified can be cross-referenced with other large data sets to the point of identifying individual data points.
But as the privacy risks associated with collection and publication of personal data on a huge scale increase, the Bureau’s need to collect accurate and comprehensive data every ten years remains constant. And these two requirements have an inverse relationship. The more accurate and useful a data set, the easier it is to identify; the more obscured, the harder it is to use in the service of understanding the nuances of the U.S. population. Moreover, adequate privacy measures are essential to convincing potential census respondents–“especially among people of color, immigrants and historically undercounted groups who may be unsure of how their responses could be used against them”–that responding to the census is safe. Differential privacy provides a sophisticated computational method for balancing privacy with accuracy, calibrated to the nature of a given data set.
Legal census data protection
The tension between useable data and individual privacy is not new to the Census Bureau. The Census is fascinating in the context of privacy law because it represents a well-documented microcosm of government thought on large data sets and privacy in addition to what types of data have been considered worthy of protection over the years. As early as 1840, officials responsible for the census became aware that poor response rates among certain professional groups were due to privacy concerns, and the following year, officials were required to treat information “relative to the business of the people” as confidential. By the turn of the century, officials were threatened with fines and eventually imprisonment for revealing protected census information–privacy by threat of punishment.
When legislation around census data privacy began to develop at the beginning of the 20th century, it continued to protect only business data. 1940 marked the first year in which census privacy laws protected census data about people, although the protection was quickly overturned during the Second World War, when the American government used its individual census data (micro data) to intern Japanese Americans and Americans of Japanese ancestry.
13 U.S.C. § 9 passed in 1954. It prohibits use of census data “for any purpose other than the statistical purpose for which it is supplied,” “any publication whereby the data furnished by any particular establishment or individual under this title can be identified,” and “anyone other than the sworn officers and employees of the Department or bureau or agency thereof to examine the individual reports.” Section 9 also makes census data “immune from legal process” and prohibits its use as evidence or for any other purpose in “any action, suit, or other judicial or administrative proceeding.” Notably, Section 9’s consolidation and strengthening of census data privacy law did not prevent the Census Bureau from providing “specially tabulated population statistics on Arab-Americans to the Department of Homeland Security, including detailed information on how many people of Arab backgrounds live in certain ZIP codes” in 2002 and 2003.
Extra-legal census data protection
In 1920, data analysts at the Census Bureau first used (crude) statistical measures to prevent disclosure of, once again, business information. They did so by manually pouring over the data sets and hiding potentially compromising information. The first automatically tabulated census arrived in 1950; in 1960, the Bureau’s “disclosure avoidance techniques” for public-use microdata samples involved only the removal of direct identifiers and a geographic population threshold of 250,000. In 1980, the Bureau introduced the “top coding” technique which eliminates the upper outliers in a data file, on the theory that outliers are more readily identifiable. In 1990, the Bureau top coded more categories of data and introduced “recoding,” which combines categories of data with fewer than 100,000 people or households. The 1990 census also used an “imputation” method by which original data were modeled, which flagged unique data points. Those points were “blanked” and replaced with a value generated by the model.
Despite this increase in sophistication, Dr. Latanya Sweeney, who is the Daniel Paul Professor of the Practice of Government and Technology at the Harvard Kennedy School, analyzed 1990 census data and demonstrated that 87% of the population could be identified by a combination of zip code, date of birth, and gender. Dr. Sweeney’s revelation came in 2000, which was the first year that census results were published online. Even so, the Bureau’s disclosure avoidance techniques progressed slowly. Through the 2010 census, it continued to apply direct de-identification, recoding, top coding, and bottom coding in addition to implementing newer techniques such as swapping (characteristics of unique records are swapped with characteristics of other records), use of partially synthetic (model-generated) data, and noise infusion.
So what is differential privacy??
Differential privacy begins from the principle that privacy is a property of a computation rather than a property of the output itself. Wikipedia puts it very well: differential privacy is “a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset.” This means that a data analyst working with the output of a differentially private algorithm should know no more about any individual in the data set after the analysis is completed than she knew before the analysis began.
The implementation of differential privacy has two major components: (1) a randomized algorithm and (2) ε,which denotes some positive real number. ε is likened to a “tuning knob” which balances the privacy and accuracy of the output. Depending on the value given for ε, the output of the differentially private algorithm will be more accurate and less private, or more private and less accurate.
While differential privacy will better protect publicly released microdata samples from analysts, it does assume the existence of a trustworthy curator who collects and holds the pre-processed data of individuals. In the case of the U.S. Census, this trusted curator is the Census Bureau and its agents. Adherence to § 9 is still essential to prevent further privacy violations and breaches of public trust. And arguably, as demonstrated in 2002 and 2003, § 9 isn’t strong enough to ensure the ethical use of U.S. census data. Even the most sophisticated iteration of computational privacy is only as strong as the policy of its implementers.