Big Data, De-identification, and the Avoidance of Statistical Bias
Abstract
The use of “big data” in research promises to offer new insights in
many fields. But when we look at large datasets that concern human
subjects, the right of the individuals to privacy raises at least
ethical concerns, and often legal requirements as well. But if meeting
those requirements introduces statistical bias into the dataset, we
may find ourselves having to choose between scientific accuracy and
the protection of privacy.
This talk will begin with an example drawn from attempts to share the
data generated by the HarvardX and MITx Massive Open On-line Courses
(MOOCs). We will look at the interpretation of the Family Educational
Rights to Privacy Act (FERPA), and then talk about how we approached
the de-identification of the data set to fulfill the requirements of
that law. We will then discuss what this did to the data, and why.
Finally, we will talk about some different mechanisms that can be used
to de-identify such data sets. While the work is still in progress, we
will sketch a set of mechanisms that differ substantially from those
generally used that promise much better fidelity to the original data
set while giving the required level of privacy protections.
Bio
Jim Waldo is Gordon McKay Professor of the Practice of Computer
Science in the School of Engineering and Applied Sciences at Harvard,
where he teaches courses in distributed systems and privacy; the Chief
Technology Officer for the School of Engineering and Applied Sciences;
and a Professor of Policy teaching topics of technology and policy at
the Harvard Kennedy School.
Jim spent most of his life in the technology industry. He was a
Distinguished Engineer with Sun Microsystems Laboratories, where he
investigated next-generation large-scale distributed systems; and got
his start in distributed systems at Apollo Computer, where he designed
what became the first incarnation of the Common Object Request Broker
Architecture (CORBA). While at Sun, he was the technical lead of
Project Darkstar, a multi-threaded, distributed infrastructure for
massive multi-player on-line games and virtual worlds; the lead
architect for Jini, a distributed programming system based on Java;
and an early member of the Java software organization.
Jim is the author of "Java: the Good Parts" (O'Reilly) and co-authored
"The Jini Specifications" (Addison-Wesley). He edited "The Evolution
of C++: Language Design in the Marketplace of Ideas" (MIT Press). He
co-chaired a National Academies study on privacy, and co-edited the
report "Engaging Privacy and Information Technology in a Digital Age."
He is the author of numerous journal and conference proceedings
articles, and holds over 50 patents.
Jim received his Ph.D. in philosophy from the University of
Massachusetts (Amherst). He also holds M.A. degrees in linguistics and
philosophy from the University of Utah. He is a member of the IEEE and
ACM.