Big data, massive potential
Across Harvard, programs and researchers are mining vast quantities of computerized information, sometimes revolutionizing their fields in the process
What if you could predict domestic abuse two years in advance? Or understand the brain’s intricate physical structure, connection by microscopic connection?
What if we needed new ways to think about privacy and how personal information is used today? And what if existing data — the vast mountains generated by modern science, by government and business, and by our every digital move — offered a promise of discovery as exciting as that beyond new scientific frontiers, but only if we know how to understand it?
Across Harvard, faculty members, students, and researchers are examining those questions, engaging the world’s latest information revolution, the one in “big data.” Big data is an offspring of the computer revolution, which has blessed scientists with ever-more powerful computers and analytic tools, has transformed the way we communicate and interact, and has turned each cellphone-equipped one of us into a walking treasure trove of information.
In some ways, “big data” is just what it sounds like. It’s the massive amounts of information generated and gathered by modern technology. It is what we’d traditionally consider scientific data, only in vast quantities just recently collectible through innovative sensors, sampling techniques, microchip arrays, DNA analysis, satellite instruments, chemical screening, and the like. It’s data pouring in from the genome, the proteome, the microbiome. It’s climate data, screening data for promising drug candidates. It’s data on health, both of a person and of the population.
In addition to traditional scientific sources, though, big data is almost everything else now too. New analytical techniques are transforming our ideas of what data is, enabling scientists to analyze for the first time less-traditional forms of information. Our digitally augmented lives, for example, generate an avalanche of it each day, in tweets and posts, in web browser histories and credit card purchases, in GPS-marked cellphone calls, in fitness trackers, and ATM transactions.
Nathan Eagle, adjunct assistant professor at the Harvard T.H. Chan School of Public Health, calls it our “data exhaust” and says that with proper analysis it can be used to improve health. Karim Lakhani, Lumry Family Associate Professor of Business Administration at Harvard Business School (HBS), says it’s potential gold to businesses, so he teaches two classes on what it is and how to use it. Jonathan Zittrain, the George Bemis Professor of International Law at Harvard Law School and Harvard Kennedy School, believes that big data — and the algorithms developed to make sense of it — are both exciting and potentially worrisome at the same time, and that thought should be given to who uses it and how. Isaac Kohane, Lawrence J. Henderson Professor of Pediatrics and chair of Harvard Medical School’s (HMS) new Department of Biomedical Informatics, says big data is not just potentially disruptive to the hidebound medical establishment, it’s also potentially lifesaving to its patients.
That’s just a snapshot of the many big-data projects across the University, as researchers increasingly seek out or create from scratch large new data sets and plumb their depths for fresh insights that illuminate the world. Faculty, fellows, students, and staff are figuring out how to manage and understand such data, building places to store it, and finding new, sometimes startling ways to apply it. Students are taking an array of courses across the University, in computational science, computer science, statistics, and bioinformatics, among others, in order to hone their big data skills.
The big-data tsunami has been building for years, and Harvard faculty and administrators have been working to stay out in front of it. Across the University, faculty members and researchers have recognized the unique challenges that the new data environment presents and have responded. Centers, projects, and departments have been created with big data at least partially in mind. Among them are Harvard’s Institute for Quantitative Social Science (IQSS), opened in 2005, the Institute for Applied Computational Science (IACS), founded in 2010, and — as of July — the Department of Biomedical Informatics at HMS, already noted.
The algorithm is king
Big data, said IQSS Director Gary King, is “a massively important development,” but he added that the data itself isn’t what’s most important. It’s the algorithm.
Algorithms are sets of instructions that tell computers what to do first, second, and third to solve a problem or perform a function. Algorithms underlie search engine operation, for example, telling computers how to rank pages to produce results. Big data, said King, the Albert J. Weatherhead III University Professor, would be useless without the algorithms that manipulate it.
“There’s 600 million social media posts every day. You can download them all, but what do you do with it?” King said. “More data isn’t helpful; what’s helpful is the methods of making use of it all.”
King is among a group of Harvard faculty members working to build better algorithms. His work to wring the meaning out of the global blizzard of social media posts has drawn the interest of people seeking to understand what’s going on in nations around the world, including China, where he has analyzed posts both before and after government censorship to understand not just what people are posting, but also what the government thinks of those posts. One analysis said that the Chinese government didn’t censor criticism of it or of the nation’s leaders, but did censor posts that encourage people to organize and mobilize.
King and colleagues have used statistics and machine learning — a form of artificial intelligence that allows computers to incorporate new data in order to act without being specifically programmed to do so — to find out what they want to know. A key point, he said, is that the answers don’t come by digging through the massive data sets until they are found, like so many needles in a haystack. Instead, researchers use the data to find out what isn’t there.
“The goal is inference, using facts you have to learn about facts you don’t have,” King said. “We do this all the time. Someone walks into your office and you infer this person is not going to shoot you. There’s always facts we want to know that we don’t know.”
Big data, King said, wouldn’t change how one might find the average age of people in the United States. Though today’s massive data sets might allow you to exhaustively add up the ages of every American and then mathematically calculate the average, there’s no point in that, King said. Averaging a random sample of 1,000 people — tried and true — will give you a pretty good answer with a lot less effort.
Key to big data, King said, is its diversity and the speed at which it accumulates. Methods have been developed to “scrape” information from websites, for example, which opens up vast new sources of information on a dizzying array of topics. A generation ago, students researching voter opinion would likely study poll results collected by someone else. Today, the students can create data scrapers that roam the Internet, gathering specifically targeted information, allowing them to compile their own data sets.
Today’s data is heterogeneous, King said. It used to be that data consisted largely of columns of numbers. Today, it can be text, spoken words, video images, and many other unconventional forms. It is generated rapidly, sometimes in real time, like the millions of Twitter posts tweeted each second. This ability leaves researchers managing data sets that are not just large, but also current and constantly changing.
Hanspeter Pfister, An Wang Professor of Computer Science and director of the Institute for Applied Computational Science, said big data is marked by “three V’s”: volume, velocity, and variety.
“Twenty years ago, you were lucky to collect data on a phenomenon. Now, we’re flooded with so much that analysis is a challenge,” Pfister said.
Though analysis and inference are important in big data, sometimes the greatest challenge remains the “big” part. Jeff Lichtman, the Jeremy Knowles Professor of Molecular and Cellular Biology, is engaged in an ambitious, multiyear project to develop a wiring diagram of the brain, called the connectome, that is producing staggering amounts of data.
“The idea is if you want to understand the way brains are connected, you have to take pictures of every piece of brain at enough resolution to see every single wire, every single synapse in there,” Lichtman said.
The project consists of imaging extraordinarily thin slices of a mouse brain — one thousandth the thickness of a human hair — and tracing the neurons, which each have an axon and many dendrites, through which they connect to other neurons. The project, under way since Lichtman arrived at Harvard in 2004, has sped up considerably over the years as technology has improved and as Lichtman and colleagues devised new ways to automate the process.
“If you’re doing it by hand, you’ll never finish,” Lichtman said.
Even with the current state of automation, the finish line is distant. So far researchers have documented a grain-of-salt-sized piece, and generated 100 terabytes of data. Next up is a cubic millimeter of brain, to be scanned with a new 61-beam scanning electron microscope, which will generate 2 petabytes of data, or 2 million gigabytes.
The new microscope will reduce the time for a cubic millimeter from 2,000 days to about a month. Even with that advance, however, Lichtman said, an entire brain is beyond current capabilities.
But researchers may not need to do the whole thing. Their early work may eventually reveal structural patterns through which the entire brain can be understood, Lichtman said. If that’s the case, it might be more fruitful to shift attention from healthy brain structure to discovering what it is that changes in brains affected by disease or aging.
Once gathered, the images have to be analyzed. Lichtman is working with Pfister to devise algorithms to track neurons from one slice of brain to another.
“The connectome is one of my biggest projects. I’ve worked with Jeff for eight years on it,” Pfister said. “Back then, I told him this is crazy and that’s why I wanted to work with him on this project.”
Pfister said the connectome highlights another challenge of big data: bridging the data-human interface.
Once an algorithm extracts information, there remains the problem of presenting it in a format that allows humans to explore it and understand it. One way to do that is through visualization, creating representations such as 3-D images that allow researchers to trace a neuron through many different slices, for example, to rotate the image, zoom in or out, enrich an image with more variables, or add filters to simplify it.
Lichtman believes that big data may represent a sea change in fields such as biology, traditionally driven by the “big idea,” where scientists hypothesize about how things might be and then test their concepts through experimentation.
“In biology, I think we’re entering the age of big data, which will replace big ideas with what’s actually the case, where the complexity is so great we have to think about it differently, we have to mine it to understand it,” Lichtman said. “Nature, almost always, is much more subtle and convoluted than the simple … thoughts that come out of our minds.”
Building a big-data community
As big data’s potential has become apparent, researchers have come together to work with it and share what they know. The Institute for Quantitative Social Science holds weekly seminars that bring together people from disparate fields, whose projects — which may seem vastly different — are quite similar from a data standpoint.
King offered an example, saying that at one seminar a government department researcher presented work designed to predict presidential vetoes. The next week, an astronomer from the Chandra X-Ray Observatory explained a process to count photons of light and plot them over time. After stripping away the very different fields and purposes, King said, the veto data looked an awful lot like the photon data.
“The political scientist had developed methods that the astronomer didn’t know about, and the astronomer had developed methods the political scientist didn’t know about,” King said.
The Institute for Applied Computational Science at the Harvard John A. Paulson School of Engineering and Applied Sciences is a place where faculty can come for advice and a place where students can learn to manage the data deluge. The program has grown rapidly and now offers an annual conference, a “computefest” each January, and regular public seminars.
The institute offers a one-year master of science degree, a two-year master’s of engineering, and a graduate secondary concentration in computational science and engineering. Applications to the master’s degree programs have nearly tripled since 2013, from 148 to nearly 400 for fall 2016.
“These are really exciting times,” Pfister said. “I’m extremely energized because there’s a lot of interest in the things Harvard has to offer.”
Software’s hardware backbone
Harvard has an abundance of raw computing power. When Lichtman and his Connectome Project arrived in 2004, technicians wondered what he would do with the seven terabytes of data storage requested, at a cost of $70,000.
Today, not only is data storage just a fraction of the cost — an eight-terabyte drive costs $200 — Lichtman has another advantage. He can depend on Harvard’s Odyssey computer cluster to get the work done.
Odyssey is Harvard’s supercomputer, with a brain of 60,000 CPUs. It has 190 terabytes of RAM and more than 20 petabytes of data storage. It covers almost 9,700 square feet. But don’t bother looking for it on campus; most of it is not even close. Odyssey is spread over three locations, connected by high-speed fiber optic cable, and the biggest chunk is almost 100 miles away, in Holyoke, Mass.
“We could see this big-data tsunami on the horizon,” said James Cuff, assistant dean for research computing in Harvard’s Faculty of Arts and Sciences (FAS).
When Cuff arrived at Harvard in 2006, faculty members and lab groups commonly ran their own computer systems, and the biggest machine on campus had just 200 processors. Officials realized early on, he said, that adequate computing power would be as important for a research university in the near future as a deep research library.
“It’s not a question of ‘Is this optional?’ anymore,” Cuff said. “The University has to take this seriously.”
In 2008, Harvard began talking with four other universities: the Massachusetts Institute of Technology, the University of Massachusetts, Northeastern University, and Boston University. Together they pioneered the Massachusetts Green High Performance Computing Center.
Linked to campus via high-speed networks, the center was constructed with both performance and sustainability in mind. It was built on a rehabbed urban brownfield site in Holyoke, incorporates energy-saving features, and gained LEED platinum certification.
So far, three Harvard Schools have signed on: FAS, the Harvard Paulson School, and the Harvard Chan School. There’s also a pilot project with HBS. Altogether, Cuff said, Odyssey has 1,500 active scientists who do more than 2 million “pieces of science” monthly.
Though Odyssey is Harvard’s biggest, there are other large computers on campus, including one at HMS with 6,000 CPUs. The Institute for Quantitative Social Science also runs a large computing cluster, though King said it is a research project marked more by the “tuning” of its capabilities for social science research than by specific hardware hallmarks. There are also ample cloud computing services and off-site storage provided by vendors.
“Now we’re getting to implement some amazing science,” Cuff said.
Waking up to lifesaving data
Kohane would like to see more science emerge from America’s medical records.
The U.S. medical system generates mountains of data, Kohane said, most of which is used for compliance and billing, and little of which is used to improve patient care. Private businesses that service the industry, by contrast, are way ahead in their use of big data. They already mine prescription databases to find out which doctors aren’t prescribing their drugs, to figure out who to sell health insurance to, and to determine how to price health insurance products to employers, Kohane said. He drew a contrast between a data-savvy company such as Amazon.com and the typical physician’s practice.
“Not only does Amazon have your whole history, but it has the history of consumers with similar or intersecting interests, and can make predictions, based on your past and that of consumers like you,” Kohane said. “In medicine, most physicians in the standard, harried, multidoctor practice know very little about you. They don’t have anything in hand other than what they learned in medical school, what they learned in the literature, or what they’ve seen in the patient population.”
In 2009, Kohane and colleagues displayed the potential of mining existing medical records, showing that they could diagnose a patient suffering domestic abuse a full two years before the system would detect it, simply by using discharge data from emergency room patients across Massachusetts.
“There would be a pattern, a gross signal, but any single clinician wouldn’t pick it up,” Kohane said.
The ability to tap existing health care data to improve patient care would speed up a process that today takes years of effort to design a study, fund it, recruit participants, and analyze results. Instead, he said, we need to recognize that in many cases the answers already exist, buried in thousands upon thousands of patient records, which, with care, can be used while safeguarding privacy.
“I would argue that there’s a strong sense now that in many ways medicine is not keeping up with this data-driven approach,” Kohane said. “We don’t know how many people improve on a treatment without an expensive study, but it’s in the data. … It’s hard not to make an improvement at this point.”
Panning for gold in the “data exhaust”
Those interested in improving public health around the world are seizing on the potential of increasingly sophisticated mobile phone technology as a possible tool, particularly since it is already in the hands of millions. Eagle, the Harvard Chan School adjunct assistant professor, referring to his Engineering Social Systems work, said several lab projects rely on a large volume of mobile-phone data, such as that from 15 million phones being used, to investigate links between population movement and malaria’s spread.
Researchers can plumb passive data like a phone’s GPS coordinates, or devise active projects in which researchers offer small payments in exchange for answering questions, like whether and how bed nets are used.
“We’re leaving tremendous amounts of data in the wake of our normal daily behavior,” Eagle said. “Tremendous amounts of data [are] sitting there, much of which is not leveraged in any way to help you as the original data creator.”
Big data is even penetrating fields where traditional practice is more qualitative, featuring more targeted, close reading of primary documents, like in history, according to lecturer on history Gabriel Pizzorno. Historians, he said, are beginning to understand the “massive potential” of large data sets and of what’s called “digital history” to provide insights that can’t be gleaned through other methods.
Pizzorno described the History of the News project, which uses software tools to scrape online databases to examine how the news about World War I shaped American public opinion. The project is doing topic modeling and sentiment analysis, examining word sets and associations in 20,000 articles.
History’s embrace of big data approaches has been slow, primarily due to the scarcity of large digital data sets, which inhibits the development and refinement of new methodologies, Pizzorno said. Historians usually rely on unpublished, handwritten records that are not easily transformed into machine-readable text. While some of those texts have been digitized, they are often held in private databases accessible only through subscription. In addition, he said, the for-profit organizations that run these databases have little incentive to provide the programming instructions and standards — called APIs — needed for computational analysis of their data sets.
Despite these challenges, Pizzorno said, big-data approaches will eventually be viewed as routine in the field.
“History is history,” Pizzorno said. “This is another tool in the toolbox. Eventually, it’ll become so common we’ll drop the ‘digital’ and what you have left is ‘history.’ ”
Education, like medicine, is another field awash in data that can be repurposed to improve performance. The Harvard Graduate School of Education’s Strategic Data Project is using data already gathered for compliance purposes to look for ways to improve school district performance. The project seeks to provide statistical tools so that districts can better understand how well they’re getting kids into and through college. It also looks at teacher recruitment, and how to help teachers become more effective.
“We are trying to move education agencies from simply collecting data for legal compliance purposes towards using that data to inform strategy and manage improvement,” said Jon Fullerton, executive director of the Center for Education Policy Research.
Project Executive Director Nicholas Morgan said the center’s current analyses are potentially the tip of the iceberg. The increasing use of Internet-based teaching tools — math exercises performed on the web for homework — have the potential to capture fine-grained student data, enabling even more personalized analysis based on which paths students take through the exercises, where they pause, and where they move forward.
“There are so many ways to think about analyzing this data,” Morgan said. “That’s a new future.”
Though big data’s potential is still being digested in some academic disciplines, business has always known data’s importance, according to Lakhani, associate professor of business administration and co-chair of the executive education course “Competing on Business Analytics and Big Data.” What has changed in recent years, Lakhani said, isn’t business’s use of data, but the volume and availability. Executives are coming to Harvard to understand what that means to them, their competitors, and their industry.
“Data is nothing new to business. Business has always run on data,” Lakhani said.
Lakhani’s course attracts executives from for-profit enterprises, nonprofits, and the government who want to understand what others are doing with big data, to get a better grasp of the analytic tools available, and to gain a sense of what their own outfits should mine.
Harvard, Lakhani said, is an ideal place to study and work on big data because of its evident strengths in related fields: computer science, statistics, data visualization, business modeling, and econometrics. Potential employers are taking notice.
“Harvard had deep roots in all of these places. It has leading-edge research scholars,” Lakhani said. “Smart Harvard students are in demand on Wall Street, in Silicon Valley. McKinsey is forecasting a shortage of 1 million data scientists by 2020.”
What’s a society to do?
As some analysts explore what we can do with big data, others are thinking about what we should do with it.
Zittrain, who is also director and faculty chair of the Berkman Center for Internet & Society and is a professor of computer science at the Harvard Paulson School, said the intersection of big data, artificial intelligence, and our always-on world has created interesting moral, ethical, and potentially legal issues. And he believes that the relatively neutral ground of universities is a good place to sort that out.
Questions exist about the unprecedented ability to use massive amounts of data — about society and individuals alike — to influence people’s behavior, about who wields that power, and about whether there should be checks on it. In a recent lecture, Zittrain described how, during the 2010 congressional election, Facebook salted some news feeds with get-out-the-vote messages, such as: today is election day; here’s your polling place; these friends of yours have voted. After the ballots were tallied, those who received the messages had voted more often than the general population.
While increasing voter turnout is a social good, Zittrain pointed out that the margin would have been enough to tip the dead-even 2000 presidential election one way or the other. What if in a future election, Zittrain asked, Facebook founder Mark Zuckerberg decided he wanted a particular candidate to win and sent get-out-the-vote messages only to that person’s supporters?
In another example, Zittrain described businesses that collect all sorts of data on their customers, mainly to sell them things more efficiently. It’s one thing if the data is helping you to find things you want or need, but what if information about your buying decisions on everyday items like books reveals markers of impulsiveness? What if that data was sold to your local car dealer when you walked in the showroom?
“There’s surely some obligation to shoot straight,” Zittrain said. “We’re just at the beginning stages of this, which is what makes it so interesting. … There’s something that almost every discipline has to look at. … From an academic standpoint, it’s just a wonderful set of problems, open to questioning from multiple disciplines, with real issues at stake, making it fertile and edgy.”