*Based on Remarks at the Big Data East Big Data Innovation Conference, September 9, 2015
I believe in the enormous potential of big data. Erik Brynolfsson and Andrew McAfee, authors of The New Machine Age and leading scholars of the digital economy, have compared the power and granularity of computational science to the transformation in understanding of nature that occurred when Anton Van Leuwenhook first peered at samples through his newly-invented microscope. We are seeing new advances in medicine, in social science, new ways of teasing out causation from correlation.
I believe that questions about how we can harness the benefits of this transformation for our entire society are some of the most important issues we face as a society, in the United States and around the globe. If we are to harness the benefits, we also need to understand and address the risks and challenges. We should not be Pollyannas but neither should we be dystopians. We cannot let the risks blind us to the potential. Nor should the potential override addressing the risk.
The most prominent challenges and risks related to big data are grounded in security and privacy. If we do not get these right, we will not get big data right.
Security and privacy go hand in hand. Data security is an element of privacy, one of the fair information practice principles. People have a right to expect that data they entrust to others will not fall into the wrong hands, and that those they entrust with their data will take reasonable steps to protect their own information assets as well as those entrusted to them. Security and privacy are two sides of the same coin: trust, and this coin is essential currency in the digital economy.
Think back a moment to 1995: the greatest obstacle to uptake of e-commerce was getting people to trust giving out credit card information online. That problem was solved by the introduction of SSL encryption. What happens now to the digital economy if we go back to the distrust of 20 years ago?
Companies understand they have a problem. But still not enough is being done. Survey research shows 40% of corporate boards get briefed on cybersecurity; but that means some 60% still do not. SSL and PCI standards protect financial transactions, but SSL encryption needs to extend a wider umbrella, and more data needs to be encrypted at rest. The uptake and security of encryption is not helped when some in government — including in U.S. law enforcement — promote windows into encryption.
The National Institute of Standards & Technology (NIST) Cybersecurity Framework provides a set of processes for cybersecurity preparedness that is adaptable to organizations of all kinds, regardless of their scale or sophistication. Because NIST is a Commerce agency, I am proud that its framework has seen wide adoption by a wide range of companies as well as insurers and governments. But more organizations need to look at themselves through the NIST framework.
These security issues are not endemic to big data. But the more data you collect, the greater your attack surface. And aggregating data works against distributing it and limiting permissions to strengthen security. All this puts a premium on looking closely at the data lifecycle to evaluate how you are going to protect it, how you are going to detect vulnerabilities and attacks, how you are going to respond, and how you are going ensure your resiliency in the event of a loss.
Privacy is an inherent issue for big data. Big data changes some important aspects of the way we have dealt with privacy in the past. The ability to aggregate and correlate data from a wide variety of sources means a greater of risk re-identifying individuals from data that supposedly has had identifiers removed. Privacy law and practices have depended heavily on management of what usually gets referred to as “personally identifiable information,” PII, recognizable identifiers such as names and numbers and, in more sophisticated approaches, other attributes that make someone unique within a dataset. Research on re-identification has shown it possible to identify individuals in datasets that have had identifiers removed by correlating with other available data.
The more data available, the greater the probability of re-identification. This has serious implications for sharing and release of data. Open data programs need to consider this risk carefully. But the right response is not to dismiss de-identification altogether as too risky or to strip datasets of information, perturb them, or generalize them so much that they become almost useless. It is instead to perform intelligent de-identification and control the risk of re-identification with security and administrative controls that ensure it is shared only in ways that will maintain protection against re-identification.
The second big privacy impact of massive datasets — the big one that cause fear — is the increased ability to infer other attributes about individuals besides identity. It can be cool … or creepy. It can be harmless … or harmful. It depends on what the data is, who collects it, how it is collected, and how it is used. The burden is on those who collect and analyze data to consider these issues carefully and to be prepared to justify — to customers and consumers, and potentially to regulators or to the world — what data they collect and what they do with it. The emergence of big data, with exploration of unstructured data, unforeseen relationships, and evolving uses, alters the fair information practice principle of data minimization and data retention limits but certainly does not do away it. Big data does not mean indiscriminate collection of data. Here, my emphasis is on the word “indiscriminate.” Big data means unprecedented scale but still warrants discerning choices about what data to collect and keep.
The indiscriminate collection of data can be a form of hoarding, stashing stuff away because you vaguely think it might come in handy someday. Most of us have some of that in our basements, and now it’s accumulating in digital form in disc drives and data centers around the world. You end up swimming in data, greatly increasing the ratio of noise to signal.
Indiscriminate collection of data increases cybersecurity risk. All that data means more to keep safe, and more to lose in the event of a data breach.
Indiscriminate collection risks creating data pollution, bits of information floating around like particulate matter in the air or piled up like mine tailings. Indiscriminate collection of data also increases the probability that uses of data move farther and farther away from the context in which the data was collected, and in the process diminish the autonomy of people the data is about. It is what risks turning information technology into an extractive industry, strip mining data for its value without stewardship of the commons from which the data comes.
Law and public policy have an important but limited role to play here. America has a strong body of laws that protect personal privacy, starting with a common law right to privacy that has evolved over 125 years of jurisprudence, a mosaic of sectoral privacy laws for the most sensitive information such as health records, financial information, student records, communications data, children’s data, data breach notification laws in nearly every state, and strong enforcement from the Federal Trade Commission, other federal agencies, and state attorneys general.
As the digital economy and the uses of data grow, though, an increasing amount of activity is outside any of these sectoral laws, even as new ways of measuring creditworthiness, new devices to generate health information, and other new measurements of human activity erode existing sectoral boundaries. Trying to establish practices or rules for rapidly changing technology in a piecemeal fashion is an exercise in running harder to get further and further behind.
What we need instead is a system of baseline privacy rules across the board. That is the idea behind the Consumer Privacy Bill of Rights which I spearheaded for the Obama Administration and which the Administration put out as a draft bill early this year. It’s a simple, principles-based approach adapting Fair Information Practice Principles to today’s digital, distributed world. Now, somehow the process of finalizing the proposal for release eroded the broad support there when I left the government. But this bill provides a starting point for a broad and flexible model that protects privacy and enables innovation.
In default of a clear model for privacy that maps onto other legal systems, the world is turning to the only model out there, that of the European Union. The EU is in the process of adopting a new data protection regulation that will be binding across all 28 member states, with European institutions in the final stages of negotiating language. The good news is that the final regulation is likely to provide a consistent set of rules across Europe, and is likely to do away with a system that has required prior notification to regulators before processing data. I believe, however, that the proposed regulation is unnecessarily restrictive in ways that can retard the development of big data and the hopes of Europe’s digital agenda.
The regulation could restrict all use of algorithms as “profiling” without discriminating between those that are beneficial, such as financial fraud analytics or drug studies, and those that have unfair and discriminatory effects. The regulation also increases dependence on a system of notice-and-consent when it is widely recognized that the volume and velocity of boxes we are asked to check has made explicit consent a fiction.
In the end, I believe government is only a small part of the solution. Law and public policy cannot provide all the answers and will always lag behind. The greater part is up to you as stakeholders in the global Internet community and as stewards of data.
The lodestar for ethical research has been what’s known as the Belmont Report, which applies to human subject research. If you think about it, data science about human behavior — consumer A/B tests, phone data, ad profiles — is a form of human subject research. The Belmont Report defined three principles of ethical treatment. The first is respect: respect for people as autonomous beings. The second is beneficence: acting to protect well-being. And the third is justice: a fair distribution of benefits and burdens or risks.
In some sense, this brings us back to first principles on privacy. The font of the right to privacy in the United States is a famous 1890 law review article by Louis Brandeis, who later went on to become a Supreme Court justice, and his law partner Samuel Warren. They identified the right of privacy by looking at common law cases involving an implied trust based on the receipt of confidential information. Trust law imposes duties on a trustee to look out for the interests of the beneficiary and to put those ahead of the trustee’s own interests — in the cases Warren and Brandeis looked at, to protect the confidentiality of the information in their custody. That’s a model that is relevant in today’s data-driven world. A data holder may have custody of information about a person, but that custody comes with duties to look out for the interests of that person and not to take advantage. Some people speak of being “stewards” of data. Stewardship is like a trusteeship; it comes with duties.
Security, privacy, and data collection have a common root in information governance, making careful choices from the ground up about what information is strategically important, how it will be collected and used and stored, who will have access to it for what purposes, how it will be protected, how long it is useful and what happens when its use. How well information is governed and how wisely the power of data is used matter to your organizations. They also matter around the world. As good stewards and wise governors of data, you can build that trust and set norms for beneficial use of data that can be examples to the governments of the world.