De-Identification Definition Contentious in Do Not Track Talks
The definition of data de-identification has dominated the Do Not Track working group discussions the past few weeks as the World Wide Web Consortium (W3C)-facilitated group marches forward on a compliance document. Since mid-July, when group co-Chairman Justin Brookman, also director of the Center for Democracy & Technology’s Project on Consumer Privacy, released an updated list of de-identification definitions, the group’s public email listserv and Wednesday meetings have centered on the issue. Offered definitions range from one that dovetails with the EU’s strict definition to more general definitions provided by representatives at Apple and Adobe (http://bit.ly/1oHoeaV).
Sign up for a free preview to unlock the rest of this article
Communications Daily is required reading for senior executives at top telecom corporations, law firms, lobbying organizations, associations and government agencies (including the FCC). Join them today!
After moving a tracking preference expression (TPE) document to last call (CD April 25 p9), the DNT group has shifted focus to a tracking compliance document -- best practices intended to satisfy a user’s expectations after expressing a DNT preference (http://bit.ly/1mGE85C). A major point of contention over the past month has been the definition of de-identification, according to emails exchanged over the group’s public listserv.
De-identification has sparked debate in various forums this year. It was a divisive issue during the NTIA-backed facial recognition talks (CD June 24 p6), the subject of personal back-and-forth between researchers, part of the big data conversation at the White House (CD May 2 p5) and an issue many have pushed the FTC and the National Institute of Standards and Technology to take leadership on it (CD Aug 8 p9). In question is not only the concept’s definition, but also its effectiveness and who -- if anyone -- is best-positioned to set de-identification standards.
Vincent Toubiana, a research engineer at Alcatel-Lucent, offered a definition “closer to the concept of anonymization” as defined by the EU’s Article 29 data protection working party (http://bit.ly/1qJ5StX). Data is de-identified when three steps are met: One can’t isolate records corresponding to a device or user; one can’t link two records related to the same device or user; and one can’t “deduce, with significant probability, information about a user or device.” These steps hold companies accountable to the principle “you should not be able to link two transactions from the same device and you should not be able to link a transaction to another dataset,” he said in a public email (http://bit.ly/VjfxMI).
"That is simply wrong,” said Adobe Principal Scientist Roy Fielding, editor of the compliance document, most recently updated Saturday (http://bit.ly/1mGE85C). “All session-based interactions with users depend on the linking of multiple interactions over time,” he said in a public email (http://bit.ly/VjfxMI). “Linking data records doesn’t have anything to do with privacy or EU data protection.” Toubiana didn’t agree: “I still don’t see why a third party needs to be able to link between sessions if it’s not for a permitted use.” It doesn’t matter if third parties link between sessions as long as they can’t link back to the user, Fielding responded. That’s de-identification, he said. Fielding and Toubiana could not be reached for comment.
Fielding proposed a definition: “A data set is considered de-identified when there exists a reasonable level of justified confidence that the data within it cannot be used to infer information about, or otherwise be linked to, a particular user.” Others within the group, such as Apple Embedded Media Director David Singer, wanted to tack on a bullet point that would make companies accept responsibility for downstream re-identification, if they choose to pass de-identified data along to others. Brookman said Singer’s definition is verging on the concept of a safe harbor. “It sounds like you're trying to force companies who release deidentified data to bind recipients not to identify the data, or they take responsibility in the event the data is subsequently de-identified,” Brookman said in a public email (http://bit.ly/1sSI1dv). “So essentially, there is a safe harbor for entities that bind recipients.” Singer mostly agreed with Brookman’s assessment in followup emails. Neither could be reached for comment.
Another definition more explicitly calls for a safe harbor or expert review. National Advertising Initiative (NAI) Senior Director-Technology Jack Hobaugh proposed a delineated review process in which either “a qualified statistical or scientific expert” determines re-identification risks are “very small,” or the company adheres to a safe harbor removal of a variety of sensitive information -- names, IDs, geographic information, date-specific information, full IP addresses, etc. The company must also implement “technical safeguards that prohibit re-identification” and “administrative controls” limiting access to de-identified data.
The main difference, Singer said, between NAI’s definition and Fielding’s and Toubiana’s definition is that “NAI envisions that the secret key is maintained (but not used); Roy’s and Vincent’s (I think) envision that you couldn’t reidentify even if you wanted to” (http://bit.ly/1ppPorX). Hobaugh didn’t respond to a request for comment.