Mining Facebook data for science
Organization set to release massive trove of information, strictly for research purposes
More like this
It seems Christmas is coming early this year for social scientists.
That’s because just months after Harvard’s Gary King wrote an academic paper about a system that would allow researchers to access the massive data troves held by Facebook and other private companies, it is set to become a reality.
Along with collaborator Nathaniel Persily at Stanford University, King, the Albert J. Weatherhead III University Professor, created an organization called Social Science One that will lead the effort to identify data inside Facebook, prepare it for researchers, and fund numerous scholars to analyze the data.
The organization is today making available for research the first of what King says will be many data sets, more than half a trillion numbers that include every link clicked by Facebook users in the last year, information on the types of people who clicked, and indicators of whether links were judged to be intentionally false news stories.
“As social scientists, our goal is to understand and solve the greatest challenges that affect human society,” King said. “Twenty years ago, almost all the data in the world to address these challenges was created by those of us in the academy, by governments and given to us, or by private companies and sold to us,” he said. “But the problem is that even though we have more data than ever before, we have a smaller fraction of the data that the world is creating. Most of the data that would be useful for social science is now locked up inside private companies. Social Science One is an important mechanism for unlocking that data for social scientists.”
The amount of data to which they’ll have access is “extraordinary,” he said.
“In quantity, it may rival the total amount of data that currently exists in the social sciences.”
Outlined by King and Persily in a working paper in April, the framework that underpins Social Science One has two parts.
The first, he said, is a commission of distinguished academics from across the globe who will work with Facebook officials to identify potential data sets that they will make available to researchers through a process in which study proposals are submitted and peer-reviewed. Once study ideas are approved, researchers will get access to the data as well as grants to support their work, provided by seven charitable foundations. The foundations span the ideological gamut but their money will be pooled, and all decisions will be made by academics so no one viewpoint can dominate. The outside researchers will have complete academic freedom without having to give Facebook prepublication approval rights.
“The key part of the process is that the commission, as a trusted third party, can look at the proposals and decide that some not be funded — even if scientifically appropriate — for reasons not publicly known, such as if they would touch on litigation that has not been made public,” King continued. “And if Facebook reneges on this agreement and does not make data available that Social Science One requests, we are obligated to report that to the public. So this system is incentive compatible for the public, for the company, and for the social scientific community. We think of this as essentially a work of political science, where we came up with a constitution that works for all parties.”
Matthew Baum, Marvin Kalb Professor of Global Communications at the Harvard Kennedy School and member of the Social Science One commission, said, “This commission has the potential to open a new chapter in social science research, and in the overall acquisition of knowledge, in which the organizations that possess critically important information about people and institutions, like social media platforms, and professional researchers will be able to more effectively collaborate to address some of the most difficult problems facing our society.”
Social Science One is being incubated at Harvard’s Institute for Quantitative Social Science, which King directs. Over the years, the institute has taken on this type of activity many times. It has regularly incubated and spun off nonprofit research groups and for-profit companies, as well as centers, programs, and research projects now housed at the institute, elsewhere at Harvard, and at other institutions.
Though it is an exciting prospect for researchers to have access to Facebook’s data store, the use — and misuse — of its data has made headlines in recent months, something King and colleagues have developed procedures to avoid. They built safeguards into their procedures, the first of which is simple: To ensure access to the data is limited, academics won’t actually be given the data, but instead will be allowed access to the servers that hold it.
“No academic will be handed data, like before,” King said. “Instead, we’ll make data access available to academics so that individual privacy is always preserved.”
In addition, the organization plans to make use of a mathematical concept known as “differential privacy” to ensure that the data that is made available can’t be traced back to individual users.
“We have some of the leading experts in the world studying this concept here at Harvard, including Cynthia Dwork, the Gordon McKay Professor of Computer Science in the Harvard John A. Paulson School of Engineering & Applied Sciences, and Salil Vadhan, the Vicky Joseph Professor of Computer Science and Applied Mathematics, both of whom are members of the commission,” King said. “The idea is that you can take a data set and add special types of random noise to make it impossible to identify any single person, but when you aggregate it, it doesn’t alter the overall patterns you want to examine.”
But by far the strongest security measure, King said, is related to the system that will allow academics to use the data. “When academics access the data, every character they type will be logged and audited,” he said. “So if they type the letter K, we will know they typed that letter. So there is no possibility of them copying or misusing the data. This means that we are switching from a model of individual responsibility, that has the researcher violating the rules as a single point of failure, to one of collective responsibility, where no one person can violate privacy without everyone knowing and being able to stop it.”
Ultimately, King said, the goal of Social Science One is to develop ways for Facebook — and eventually other companies — to make their vast data stores available to researchers in the hope of finding solutions to social problems that continue to plague humanity.
“Facebook has highly informative data on 2 billion people,” King said. “That’s an incredible privilege, and with the privilege comes considerable responsibility. It only makes sense that Facebook also use some of that information and power to help the public and contribute to social good.”
It’s an idea that’s not without precedent, King said.
Over the decades, several large companies have built large research divisions — perhaps most notably Bell Labs at AT&T and Microsoft Research at Microsoft — that allowed scientists the freedom to explore topics from information theory to the development of lasers and transistors.
With the release of the first data set today, King and colleagues hope to continue that tradition, but in a manner designed for social science-related businesses.
“This is just our first data set. We have quite a lot of others that will be coming after this, and we have funding from seven generous foundations, and so we hope to begin getting researchers up and running fast,” King said. “We also hope to extend this collaboration beyond Facebook and to partner with other companies as well.
“The discoveries we make using these data sets are not going to interrupt these companies’ businesses, but they could help solve some of the challenges that affect human society,” King said. “And if there’s a way to do that, who wouldn’t want to contribute to that mission?”