Harvard and MIT Researchers Reflect on the Perils and Promise of Open Data in MOOCs

August 18, 2014

A follow-up study led by a joint team of Harvard and MIT researchers explores the promise and perils of de-identifying learner data from MOOCs (massive open online courses) and offers recommendations of how to balance privacy with open data. 

The dataset (made available in May) contains the original learning data from the 16 HarvardX and MITx courses offered in 2012-13 that formed the basis of the first HarvardX and MITx working papers (released in January) and underpin a suite of powerful open-source interactive visualization tools (released in February).

Led by John P. Daries, Senior Research Analyst at MIT (Institutional Research/Office of the Provost), the new report takes a deep dive into the team's motivations behind efforts to release learner data, the contemporary regulatory framework of student privacy, and their efforts to comply with those regulations in creating an open data set from MOOCs, and some analytical consequences of de-identification.

Published in acmqueue, the full report, "Quality social science research and the privacy of human subjects requires trust," is available online.

Within hours of the release, original analysis of the data began appearing on Twitter, with figures and source code. Two weeks after the release, the data journalism team at The Chronicle of Higher Education published "8 Things You Should Know about MOOCs," an article that explored new dimensions of the data set, including the gender balance of the courses. Within the first month of the release, the data had been downloaded more than 650 times. With surprising speed, the data set began fulfilling its purpose: to allow the research community to use open data from online learning platforms to advance scientific progress.

The researchers conclude that "de-identification procedures necessitate changes to data sets that threaten replication and extension of baseline analyses."

"To balance student privacy and the benefits of open data, we suggest focusing on protecting privacy without anonymizing data by instead expanding policies that compel researchers to uphold the privacy of the subjects in open data sets," wrote Daries. "If we want to have high-quality social science research and also protect the privacy of human subjects, we must eventually have trust in researchers. Otherwise, we'll always have the strict tradeoff between anonymity and science."

Beyond just MOOCs and online learning, the team expects their work to help inform broader conversations about the use of open data in the social sciences, motivating either technological solutions or new policies that may allow open access to possibly re-identifiable data while policing the uses of the data.

Daries co-authors are Justin Reich (Harvard), Jim Waldo (Harvard), Elise M. Young (Harvard), Jonathan Whittinghill (Harvard), Daniel Thomas Seaton (MIT), Andrew Dean Ho (Harvard), Isaac Chuang (MIT).

To learn more: