Can an algorithm be illegal? Or could it be wrong to deploy it? The frontiers of mathematics and algorithm design are usually explored in the lab or the classroom, but one technique for protecting privacy—differential privacy—is being challenged in the courtroom.
Attorneys general from 16 states recently joined together in a lawsuit against the US Census Bureau. They're contesting the bureau's plan to use a new mathematical approach to adjust the census results to protect the privacy of some citizens.
The court is making only a narrow decision about how the privacy algorithm will be applied to the 2020 census, but the debate highlights the complex tradeoffs that every organization faces when balancing privacy with practicality. Here's what you need to know about the lawsuit, and why your company needs to pay attention.
Why differential privacy is important
Differential privacy, as a technique for allowing data sharing, has emerged over the last decade. Instead of giving someone access to a completely accurate database containing data about people, the algorithm adds mistakes and noise to make it hard—or even impossible—to recognize specific people.
The strength of the algorithm depends upon adding just enough noise to obscure the identities of the users and adding it in such a way that analysis isn't affected.
While the algorithms are relatively new and best practices far from established, enterprises and governments are already exploring the technique's ability to support safe exchanges of data with researchers and policy makers.
This ability to gather aggregate information is what might be attractive to companies. Apple, for instance, is tracking which emojis are used by iPhone owners. Its system uses differential privacy to find which are most popular without gathering all of the text messages sent by the users. The individual phones report results with known errors, but the errors are carefully chosen so they'll cancel out.
Both Google and Microsoft have active programs to support the algorithms. They've released open-source tool kits that can support data-collecting mechanisms that they claim still respect user privacy. The tools tackle a wide variety of uses cases, ranging from simple surveys to support for complex machine-learning algorithms.
The Census Bureau's use of differential privacy
The US Census Bureau began exploring the use of differential privacy well before 2020. Traditionally, the answers to the census are protected from disclosure for 72 years by law. Soon after the count is over, though, the bureau shares statistical summaries such as the total count of children living in a particular census tract.
The data is widely distributed and used by researchers and businesses. And the census itself is enshrined in the Constitution, as the foundation for drawing the boundaries of congressional districts. That's where the lawsuit comes into the story.
The bureau wants the differential privacy algorithms to block attackers who try to reverse engineer the statistical tables. For example, if there's only one 82-year-old male Inuit living in Wyoming, it would be easy to find the one census block that is home to a person of that description.
The differential privacy algorithm blocks these reconstructions and hides identities by "noising" the number of people, effectively adding and subtracting people in the various categories. Some blocks may gain a few older males, while others may lose them. The algorithm strives to keep overall counts consistent, and many statistical summaries such as the mean age should be close to the original value or even unchanged.
What's behind the lawsuit
The problem with differential privacy is that the noise can distort some results, and the possibility that it might introduce errors is behind the lawsuit. The complaint, filed in March by Alabama Attorney General Steve Marshall and joined by Alabama Rep. Robert Aderholt, uses words such as "manipulated," "skewed," and "flawed" throughout.
"Indeed, the only counts that will be unaltered by differential privacy will be the total population of each state, the total housing units at the census block level, and the number of group quarters facilities by type at the census block level," said the filing. "All other tabulations—such as how many people live in a census block, or how many of those people identify as a certain race or ethnicity— will be intentionally skewed."
The states behind the lawsuit are mainly concerned about using the data for redistricting. The arcane ritual is often done by committees and is often highly politicized as parties try to anticipate how the potential districts might vote.
Political parties often try to spread out their presumed supporters so they have a slim majority in as many districts as possible while concentrating their opponents in as few districts as possible—a process known as gerrymandering.
Some fear that the added noise will hamper their ability to draw boundaries with enough precision to ensure that the districts are close to the same size while also balancing the other requirements, such as the Voting Rights Act. In the past, battles over redistricting have turned on whether the population counts in each district were accurate within a handful of people.
The Census Bureau's response
The Census Bureau's first response to the lawsuit, a statement from chief scientist John Abowd, emphasizes the need for confidentiality to encourage people to respond. In the past, the bureau made small changes, sometimes swapping people between blocks, but these haven't been sufficient to stop attackers from reconstructing the real values.
The differential privacy algorithms that the bureau wants to use are more rigorous, the statement explains, and often provide very accurate results.
"Total populations for counties have an average error of +/- 5 persons," according to the statement.
The redistricting process, though, must often draw election districts that are much smaller than counties, and smaller blocks are often distorted more. Some voting rights advocates worry that adding the noise will blur the locations of some minorities and make it harder to draw boundaries that can satisfy the Voting Rights Act.
Examples of skewed data
The brief from Alabama highlights some of the ways that differential privacy can skew the results and lead to confusion. Their examples came from taking the data from 2010 and applying a test version of the algorithm distributed by the Census Bureau several years ago. They found:
- "The demonstration data for Alabama also reported the (thankfully false) phenomenon of significant numbers of unaccompanied children living in census blocks with no adults." In this case, it was 13,000 census blocks.
- "An extreme example is the census block that contains Washington's Correction Center for Women. In the original 2010 census, the census block was, understandably, approximately 99% female. After the application of differential privacy, the same census block was reported to be only 25% female."
- "Under the differential privacy demonstration data, 60 Alabama 'census places' lost all of their black voting age population, 68 'census places' lost all of their black minor population, approximately 100 'census places' lost all of their Hispanic minor population, and approximately 100 'census places' lost all of their Hispanic voting age population."
The initial response from the Census Bureau acknowledges these issues but points out that they still haven't chosen a value for epsilon, an important parameter for differential privacy algorithms that controls just how much error is added. (Update: the Census Bureau recently announced they were choosing the value of ε=17.14 for the persons file and ε=2.47 for the housing unit data.)
"We purposefully set the budget lower than ones most likely to be finally chosen [set to favor privacy over accuracy]," Abowd said in his response. "So that we could isolate the distortions and demonstrate the effectiveness of various methodological approaches."
Adding more noise also adds more privacy but at the cost of increasing the distortions that can be readily apparent with smaller sets. Abowd said that the Census Bureau plans to release a new version of its demonstration software that uses a version for epsilon that's closer to what will be used in the final data.
What it means for businesses
Businesses face similar challenges. Microsoft, for instance, is working with four cancer research institutes to simplify the process of sharing their research data without revealing too much information about patients. They want to explore using the latest artificial intelligence algorithms for analyzing the data.
Will the added noise affect how the machine-learning algorithms converge on a model? Will the noise cancel out? Or will it create anomalies that thwart the algorithm? These are important questions for researchers everywhere.
Google has been experimenting with integrating the differential privacy algorithm with its machine-learning powerhouse TensorFlow. It has been supporting an open-source collection of Python routines that offer more privacy to applications that might use TensorFlow.
Banks, for instance, might look for fraud, theft, or attempts to elude money-laundering regulations. Pharma researchers could turn to the algorithm to find outliers in tests that might indicate either better solutions or more dangerous interactions.
For some that are ready to experiment today, the software routines are proven. For those enterprises that are naturally more cautious or hesitant, the Census Bureau's experience with the algorithm will help researchers decide when adding noise is acceptable and when the added complexity may hurt more than it helps.
Moon Duchin, a professor of mathematics at Tufts University who has studied the area in depth, says that the Census Bureau's research will take several more decades to be fully understood. Still, experimentation shouldn't be curtailed.
"What we learned from experiments is that the algorithmic approach used by the Census Bureau gives significantly better accuracy on moderate-sized geographies than more naive strategies that add comparable levels of noise," she explained. "The bureau's math works out in our favor."
Keep learning
Get up to speed on unstructured data security with TechBeacon's Guide. Plus: Get the Forrester Wave for Unstructured Data Security Flatforms, Q2 2021.
Join this discussion about how to break the Ground Hog Day repetition with better data management capabilities.
Learn how to accelerate your analytics securely into the cloud in this Webinar.
Find out more about cloud security and privacy, and selecting the right encryption and key management in TechBeacon's Guide.
Learn to appreciate the art of data protection and go behind the privacy shield in this Webinar.
Dive into the new laws with TechBeacon's guide to GDPR and CCPA.