Big Data, De-identification, and the Avoidance of Statistical Bias

September 21, 2017
2:50pm - 4:00pm
Halligan 102
Speaker: Jim Waldo, Harvard University
Host: Norman Ramsey / Diane Souvaine


The use of “big data” in research promises to offer new insights in many fields. But when we look at large datasets that concern human subjects, the right of the individuals to privacy raises at least ethical concerns, and often legal requirements as well. But if meeting those requirements introduces statistical bias into the dataset, we may find ourselves having to choose between scientific accuracy and the protection of privacy.

This talk will begin with an example drawn from attempts to share the data generated by the HarvardX and MITx Massive Open On-line Courses (MOOCs). We will look at the interpretation of the Family Educational Rights to Privacy Act (FERPA), and then talk about how we approached the de-identification of the data set to fulfill the requirements of that law. We will then discuss what this did to the data, and why.

Finally, we will talk about some different mechanisms that can be used to de-identify such data sets. While the work is still in progress, we will sketch a set of mechanisms that differ substantially from those generally used that promise much better fidelity to the original data set while giving the required level of privacy protections.


Jim Waldo is Gordon McKay Professor of the Practice of Computer Science in the School of Engineering and Applied Sciences at Harvard, where he teaches courses in distributed systems and privacy; the Chief Technology Officer for the School of Engineering and Applied Sciences; and a Professor of Policy teaching topics of technology and policy at the Harvard Kennedy School.

Jim spent most of his life in the technology industry. He was a Distinguished Engineer with Sun Microsystems Laboratories, where he investigated next-generation large-scale distributed systems; and got his start in distributed systems at Apollo Computer, where he designed what became the first incarnation of the Common Object Request Broker Architecture (CORBA). While at Sun, he was the technical lead of Project Darkstar, a multi-threaded, distributed infrastructure for massive multi-player on-line games and virtual worlds; the lead architect for Jini, a distributed programming system based on Java; and an early member of the Java software organization.

Jim is the author of "Java: the Good Parts" (O'Reilly) and co-authored "The Jini Specifications" (Addison-Wesley). He edited "The Evolution of C++: Language Design in the Marketplace of Ideas" (MIT Press). He co-chaired a National Academies study on privacy, and co-edited the report "Engaging Privacy and Information Technology in a Digital Age." He is the author of numerous journal and conference proceedings articles, and holds over 50 patents.

Jim received his Ph.D. in philosophy from the University of Massachusetts (Amherst). He also holds M.A. degrees in linguistics and philosophy from the University of Utah. He is a member of the IEEE and ACM.