Big Data, De-identification, and the Avoidance of Statistical Bias
The use of “big data” in research promises to offer new insights in
many fields. But when we look at large datasets that concern human
subjects, the right of the individuals to privacy raises at least
ethical concerns, and often legal requirements as well. But if meeting
those requirements introduces statistical bias into the dataset, we
may find ourselves having to choose between scientific accuracy and
the protection of privacy.
This talk will begin with an example drawn from attempts to share the data generated by the HarvardX and MITx Massive Open On-line Courses (MOOCs). We will look at the interpretation of the Family Educational Rights to Privacy Act (FERPA), and then talk about how we approached the de-identification of the data set to fulfill the requirements of that law. We will then discuss what this did to the data, and why.
Finally, we will talk about some different mechanisms that can be used to de-identify such data sets. While the work is still in progress, we will sketch a set of mechanisms that differ substantially from those generally used that promise much better fidelity to the original data set while giving the required level of privacy protections.
Jim Waldo is Gordon McKay Professor of the Practice of Computer
Science in the School of Engineering and Applied Sciences at Harvard,
where he teaches courses in distributed systems and privacy; the Chief
Technology Officer for the School of Engineering and Applied Sciences;
and a Professor of Policy teaching topics of technology and policy at
the Harvard Kennedy School.
Jim spent most of his life in the technology industry. He was a Distinguished Engineer with Sun Microsystems Laboratories, where he investigated next-generation large-scale distributed systems; and got his start in distributed systems at Apollo Computer, where he designed what became the first incarnation of the Common Object Request Broker Architecture (CORBA). While at Sun, he was the technical lead of Project Darkstar, a multi-threaded, distributed infrastructure for massive multi-player on-line games and virtual worlds; the lead architect for Jini, a distributed programming system based on Java; and an early member of the Java software organization.
Jim is the author of "Java: the Good Parts" (O'Reilly) and co-authored "The Jini Specifications" (Addison-Wesley). He edited "The Evolution of C++: Language Design in the Marketplace of Ideas" (MIT Press). He co-chaired a National Academies study on privacy, and co-edited the report "Engaging Privacy and Information Technology in a Digital Age." He is the author of numerous journal and conference proceedings articles, and holds over 50 patents.
Jim received his Ph.D. in philosophy from the University of Massachusetts (Amherst). He also holds M.A. degrees in linguistics and philosophy from the University of Utah. He is a member of the IEEE and ACM.