Monday, May 23, 2011

Our data, ourselves

What if privacy is keeping us from reaping the real benefits of the infosphere?

By Leon Neyfakh, The Boston Globe  May 22, 2011

 If you’re obsessive about your health, and you have $100 to spare, the Fitbit is a portable tracking device you can wear on your wrist that logs, in real time, how many calories you’ve burned, how far you’ve walked, how many steps you’ve taken, and how many hours you’ve slept. It generates colorful graphs that chart your lifestyle and lets you measure yourself against other users. Essentially, the Fitbit is a machine that turns your physical life into a precise, analyzable stream of data.

If this sounds appealing — if you’re the kind of person who finds something seductive about the idea of leaving a thick plume of data in your wake as you go about your daily business — you’ll be glad to know that it’s happening to you regardless of whether you own a fancy pedometer. Even if this thought terrifies you, there’s not much you can do: As most of us know by now, we’re all leaving a trail of data behind us, generating 0s and 1s in someone’s ledger every time we look something up online, make a phone call, go to the doctor, pay our taxes, or buy groceries.

Taken together, the information that millions of us are generating about ourselves amounts to a data set of unimaginable size and growing complexity: a vast, swirling cloud of information about all of us and none of us at once, covering everything from the kind of car we drive to the movies we’ve rented on Netflix to the prescription drugs we take.

Who owns the data in that cloud has been the subject of ferocious debate. It’s not all stored in one place, of course — our lives are tracked and documented by a diffuse assortment of entities that includes private companies like Google and Visa, as well as governmental agencies like the IRS, the Department of Education, and the Census Bureau. Up to now, the public conversation on this kind of data has taken the form of an argument about privacy rights, with legal scholars, computer scientists, and others arguing for tighter restrictions on how our data is used by companies and the government, and consumer advocates instructing us on how to prevent our information from being collected and misused.

But a small group of thinkers is suggesting an entirely new way of understanding our relationship with the data we generate. Instead of arguing about ownership and the right to privacy, they say, we should be imagining data as a public resource: a bountiful trove of information about our society which, if properly managed and cared for, can help us set better policy, more effectively run our institutions, promote public health, and generally give us a more accurate understanding of who we are. This growing pool of data should be public and anonymous, they say — and each of us should feel a civic responsibility to contribute to it.

In a paper forthcoming in the Harvard Journal of Law & Technology, Brooklyn Law School professor Jane Yakowitz introduces the concept of a “data commons” — a sort of public garden where everyone brings their data to be anonymized and made available to researchers working in the public interest. In the paper, she argues that the societal benefits of a thriving data commons outweigh the potential risks from the crooks and hackers who might use it for harm.

Yakowitz’s paper has found support among a wider movement of thinkers who believe that, while protecting people’s privacy is certainly important, it should not be our only priority when it comes to managing information. This position might be a hard sell at a time when consumers are increasingly worried about mass data leaks and identity theft, but Yakowitz and others argue that we shouldn’t let fear of such inevitable accidents cloud our ability to see just how necessary data collection is to our progress as a society.

“There are patterns and trends that none of us can discern by looking at our own individual experiences,” Yakowitz said. “But if we pooled our information, then these patterns can emerge very quickly and irrefutably. So, we should want that sort of knowledge to be made publicly available.”

The idea of sharing one’s personal information with researchers and policy makers for the good of society has a long history in the United States, dating back to the early years of the national census in the 1790s. Back then, a failure to comply with the census was considered a serious abdication of one’s duty to the state. According to Douglas Sylvester, a law professor at Arizona State University, that attitude was grounded in a fundamental belief that in order to run a fair democracy, the country’s leaders needed a detailed knowledge of the people they were governing. Anyone who stood in the way of that was publicly shamed.

“During the early years of the census, your name and your economic information were posted — literally posted, on a sheet of paper — in the public square, for anyone to come and see,” said Sylvester, who has written extensively on the history of data-collection and privacy in America. “The idea was that if your name did not appear, your peers would know that you had not cooperated. Providing this information was a civic obligation.”

Of course, census workers still speak of responsible citizenship and good government when they knock on your door and implore you to fill out their forms, and technically, not doing so is still illegal. But the idea that we owe it to our fellow men to share our information with the public is long gone — and the fact that we think of it as “our” information provides a hint as to what has changed. At some point, privacy experts say, Americans started thinking of their personal data as a form of property, something that could be stolen from them if they didn’t vigilantly protect it.

It’s hard to pinpoint exactly when this transformation began, but its roots lie in the dramatic expansion of administrative data-collection that began around the turn of the last century. A more urban and industrialized nation with more public programs meant that more information was being submitted to government agencies, and eventually, people started getting possessive. Then, during the 1960s, according to Sylvester, the Watergate scandal and advancements in computing power made people even more nervous about government monitoring, and the notion that one’s personal information required protection from hostile outside forces became deeply ingrained in the nation’s psyche.

“Property rules are where people end up going when something is new and uncertain,” said Yakowitz. “When we aren’t sure what to do with something new, there are always a lot of stakeholders who claim a property interest in it. And I think that’s sort of what happened with data.”

Yakowitz came face-to-face with this attitude, and realized how severely it might impede scholarship, as a researcher at UCLA four years ago, when she was working on a study on affirmative action and student performance. Trying to obtain the data sets she needed for her work proved to be an immensely frustrating experience, the 31-year-old said: Some of the schools that kept the records she was after were uncooperative, and in one case, individual graduates who had heard about her research objected to having their information included in her analysis despite the fact that it had been scrubbed of anything that personally identified them.

Yakowitz was disturbed by the fact that her research could be thwarted just because a few people didn’t want “their” data being used in ways they hadn’t anticipated or agreed to. The experience had a galvanizing effect on Yakowitz, causing her to think more pointedly about how Americans understood their relationship to data, and how their attitudes might be at odds with the public interest. Her concept of a “data commons” came out of that thought process. The underlying goal is to revive the idea that sharing our information — this time, without our names attached — should be seen as a civic duty, like a tax that each of us pays in exchange for enjoying the benefits of what researchers learn from it.

Yakowitz began giving presentations on the data commons in February — she visited Google earlier this month to discuss the idea — and although it won’t officially be published until the fall, her paper has already begun attracting attention among people who care about data and privacy law. In it, she reviews the literature on so-called re-identification techniques — the ways that hackers and criminals might cross-reference big, anonymous data sets to figure out information about specific individuals. Yakowitz concludes that these risks have been overblown, and don’t outweigh the social benefits of having lots of anonymized data publicly available. The importance currently placed on privacy in our culture, she says, is the result of a “moral panic” that ultimately hurts us.

She joins a small chorus of voices from law, technology, and government — united under the banner of a movement known as open data — who are already arguing that the benefits of opening up government records and generally disseminating as much data as possible outweigh the costs.

“If you look at the kinds of concerns that we have as a society, they involve questions about health and our economy, and these are all issues which, if they’re to be addressed from an empirical point of view, require actual data on individuals and organizations,” said George T. Duncan, a professor emeritus at Carnegie Mellon University’s Heinz College, who has written about the tension between privacy and the social benefits of data. “Privacy advocates are so locked into their own ideological viewpoint...that they fail to appreciate the value of the data.”

The potential value of data has arguably never been greater, for the simple reason that there’s never been as much of it collected as there is today. According to a report published this month by the consulting firm McKinsey & Co., 15 out of 17 sectors of the American economy have more data stored per company than the entire Library of Congress. One example of data being leveraged for the public good in a way that would have been unthinkable a short time ago is Google Flu Trends, a tool that helps users track the spread of flu by telling them where, and how often, people are typing in flu-related search terms. The Global Viral Forecasting Initiative, based in San Francisco, uses large data sets provided by cellphone and credit card companies to detect and predict epidemics around the world. In Boston recently, a group of researchers commissioned by the governor to study local greenhouse emissions obtained data from the Registry for Motor Vehicles — which keeps inspection records on every car in the city — to find out how much Bostonians were driving.

But advocates of the open data movement see these applications as just a hint of its potential: The more access researchers have to the vast amount of data that is being generated every day, the more accurate and wide-ranging the insights they’ll be able to produce about how to organize our cities, educate our children, fight crime, and stay healthy.

Marc Rodwin, a professor at Suffolk University Law School, has argued for a system in which patient records collected by hospitals and insurance companies — which are currently considered private property, and are routinely purchased in aggregate by pharmaceutical companies — are managed by a central authority and made available, in anonymized form, to researchers. “You can find out about dangerous drugs, you can find out about trends, you can compare effectiveness of different therapies, and the like,” he said. “But if you don’t have that database, you can’t do it.”

Even as such ideas ripen in some corners of the academy and government, proponents of open data are the first to admit that the culture as a whole seems to be heading in the opposite direction. More and more, people are bristling as they realize that everything they do online — including the e-mails they send their friends and the words they search for on Google — is being tracked and turned into data for the benefit of advertisers. And they are made understandably nervous by large-scale data breaches like the one reported last week in Massachusetts, which resulted in as many as 210,000 residents having their financial information exposed to computer hackers. In light of such perceived threats, it’s no wonder the words of privacy advocates are resonating.

Yakowitz and the open data advocates acknowledge that these are reasonable fears, but point out that they won’t be solved by locking down data further. The most damaging breaches, they argue, happen when thieves hack into private sources like credit card processors that are supposedly secure. When we respond by imposing tighter controls on the dissemination of anonymized data, we’re just ensuring that it can’t be used where it might do the most public good.

“The same groups that get really concerned about privacy issues are also the groups that call for more efficiently targeted government resources,” said Holly St. Clair, the director of data services at the Metropolitan Area Planning Council in Boston, where she works on procuring governmental data sets for research purposes. “The only way to do that is with more information — with better information, with more timely information.”

The problem with this vision of the future, according to some privacy experts, is not that large amounts of data don’t come with obvious public benefits. It’s that Yakowitz’s argument presumes a level of anonymization that not only doesn’t exist, but never will. Given enough outside information to draw on, they say, bad actors will always be able to cross-reference data sets with each other, figure out who’s who, and harm individuals who never explicitly agreed to be included in the first place.

In one famous case back in 1997, Carnegie Mellon professor of computer science Latanya Sweeney was able to match publicly available voter rolls to a set of supposedly anonymized medical data, and successfully identify former Massachusetts Governor William F. Weld.

According to Sweeney, currently a visiting professor at Harvard and an affiliate of the Berkman Center for Internet and Society, 87 percent of the US population can be identified by name in this way, based only on birthday, ZIP code, and gender. Sweeney called Yakowitz’s paper on the data commons “irresponsible” for dismissing the risk of re-identification.

There are other practical obstacles as well: Data, in today’s economy, is extremely valuable. Even if data sets could be made truly anonymous, Sweeney asks, why should we expect the huge private collectors of data — companies like Google and Facebook, whose business rather depends on their ability to maintain an exclusive trove of data on their customers — to share what they have for the public good? As data-gathering becomes bigger and bigger business, it might become more valuable to society — but also becomes an asset that companies will fight harder to protect.

As far as Yakowitz is concerned, that’s all the more reason to try to bring about a shift in the way our culture views data. To that end, she proposes granting legal immunity to any entity that releases data into the commons, protecting them from privacy litigation under the condition that they follow a set of strictly enforced standards for anonymization. She also hopes that framing data as a public resource — something that belongs, collectively, to all of us who generate it — will give the public some leverage over big private companies to make their information public.

“Right now I feel like the public gets the rawest deal, because a lot of data is collected, and it’s shared with any company that the private data-collector cares to share it with. But there’s no guarantee that they’ll share it with researchers who are working in the public interest,” Yakowitz said. “Maybe I don’t go far enough — maybe we should force these companies to share with researchers. But that’s for another day, I guess.”

Leon Neyfakh is the staff writer for Ideas. E-mail lneyfakh@globe.com.

No comments:

Post a Comment