COMP 150BD - Fall 2014 - Links

Students are encouraged to contribute links for this page

2013 IEEE International Conference on Big Data

Panel Discussion on "Big Data for Good" (also at

Historical timeline of big data

United Nations Global Pulse

Computing Research Association big data white paper

AlJazeera America series on Living With Data

AlJazeera America article on targeted ads

Interview about radio astronomy from

Article on big data from NASA

Hadoop tutorial

Twitter/MIT Laboratory for Social Machines

Rachel Hogue's presentation on Big Data in Education

IEEE Spectrum interview with M. Jordan

Noah Daniels' slides

Georgetown University Master's in Technology Management


Handbook of Data Compression, by David Salomon and Giovanni Motta, 5th edition, Springer, 2009

Introduction to Data Compression, by Khalid Sayood, 4th edition, Morgan Kaufmann, 2012

An Introduction to Kolmogorov Complexity and Its Applications, Ming Li and Paul Vitanyi, Third Edition, Springer Verlag, 2008

Introduction to Data Mining, by Tan, Steinbach, and Kumar, Addison-Wesley, 2006.

Applied Data Mining, by Paolo Giudici and Silvia Figini, Wiley, 2009.

Data Mining: Practical Machine Learning Tools and Techniques, by Ian Witten and Eibe Frank, Morgan Kaufmann, 2005.

The Fourth Paradigm: Data-Intensive Scientific Discovery, edited by Tony Hey, Stewart Tansley, and Kristin Tolle

Dataclysm: Who We Are (When We Think No One's Looking), by Christian Rudder, Crown, 2014.

Big data in Education

Video with Mona Vernon and Una-May O'Reilly
Coursera course by Ryan Baker
Wall Street Journal Article
PRWeb article about RANDA project
Stanford Social Innovation Review article
MindShift blog article
Brief history of education big data debate
Campus Technology article
Big Data Landscape article about apps for higher education
Brookings article
Collegestats graphic on big data in education
Elements of Effective E-learning, by Dusti Howell
The Design of Adaptive E-Learning System Based on Students' Learning Styles, by Herman Dwi Surjono
Article on Learning Styles and elearning by Larry McNutt and Marie Brennan

Netflix and Big Data

The Netflix prediction contest
Why Netflix didn't use the winner's code

Real-time processing of Big Data

Thread pools
Thread pools and the worker-queue model

The history of Storm

Documentation for Storm
Blog article by Storm's author
Storm at Wayfair
Celery documentation
Apache Kafka documentation
Comparison of Storm and S4 (from an S4 point of view)

NoSQL Databases

NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence
   by Pramod J. Sadalage and Martin Fowler, Addison-Wesley, 2012
Introduction to NoSQL
Wikipedia on NoSQL

Articles on Genetic Algorithms

Bellingham, Richard. "Using Big Data Analytics and Genetic Algorithms to Predict Street Crime and Optimise Crime Reduction Measures." Economics and Social Research Council. ESRC, 1 June 2013.

Formoso, Carl. "Genetic Search Algorithms for Large Problems." (n.d.): 1-5. Washington State Department of Social and Health Services.

Marczyk, Adam. "Genetic Algorithms and Evolutionary Computation." Genetic Algorithms and Evolutionary Computation. The TalkOrigins Archive, 2004.

Solon, Olivia. "How Big Data Analysts Reappropriate Algorithms from Evolution and Warfare (Wired UK)." Future Science, Culture & Technology News & Reviews. Wired UK, 6 Jan. 2012.

Stuart, Keith Douglas, and Maciej Majewski. "Artificial Creativity in Linguistics Using Evolvable Fuzzy Neural Networks, Springer Link, 2008.

Verma, Gunjan, and Vineeta Verma. "Role and Applications of Genetic Algorithm in Data Mining." International Journal of Computer Applications 48.17 (2012): 5-8. IJCA, June 2012.

Xu Yang, Mingming Zeng, Quanhui Liu, and Xiaofeng Wang. "A Genetic Algorithm Based Multilevel Association Rules Mining for Big Datasets." Mathematical Problems in Engineering. Hindawi Publishing Corporations, 1 July 2014.

The Lambda Architecture

Lambda architecture web site
Big Data, by Nathan Marz and James Warren

Of Algebirds, Monoids, Monads, and Other Bestiary for Large-Scale Data Analytics, by Michael G. Noll

Dealing With Dirty Data

Best Practices in Data Cleaning, by Jason W. Osborne, SAGE Publications, 2013.
Stanford lecture on data cleaning
Wikipedia article on imputing missing values
KDD lecture on Scaling Out Big Data Missing Value Imputations, by Christos Anagnostopoulos

Testing Big Data Applications

Online Controlled Experiments at Large Scale
Effective Testing Strategies for MapReduce Applications
Testing Hadoop Applications
Big Data: Testing Approach to Overcome Quality Challenges
Webinar - Perfomance Testing Approach for Big Data Applications (Sponsored by Impetus)
Online Controlled Experiments at Large Scale
What are best methods for testing big data applications?
How many web pages do people visit per data? (blog)
MRUnit Tutorial
Storm Testing API Demo

Big Data in Healthcare

Peter Groves et. al, The ‘big data’ revolution in healthcare, McKinsey Quarterly (2013).
Cottle, Hoover, et al, Transforming HealthCare Through Big Data, Institute for Health Technology Transformation (2012).
Jeremy Ginsberg, Detecting influenza epidemics using search engine query data, Nature.
Lazer, Kennedy, King, et al. The Parable of Google Flu: Traps in Big Data Analysis, Science.
HIPAA privacy rules
World Bank statistics on health expenditures
Article about big data and heart disease
Healthcare Hashtag Project

Big Data and Robotics

A Survey of Research on Cloud Robotics and Automation. Ben Kehoe, Sachin Patil, Pieter Abbeel, Ken Goldberg
Robots with Their Heads in the Clouds, Aspen Ideas Festival.
NY Times article on surgical robots
NY Times interview with Ken Goldberg
RoboBrain Project at Cornell
Atlantic article on cloud robotics
DARPA robotics challenge
IEEE Transactions on Automation Science and Engineering special issue on cloud robotics
SXSW Panel on Cloud Robotics and Automation

Recommender Systems

Recommender Systems Handbook, by Ricci, Rokach, Shapira, and Kantor, Springer, 2011.
A Survey of Collaborative Filtering Techniques, by Su and Khoshgoftaar, Advances in Artificial Intelligence, 2009.
Evaluation of Item-Based Top-N Recommendation Algorithms, by George Karypis, 10th Conference of Information and Knowledge Management

Data Preprocessing and Cleaning

Lecture on smoothing by Rafael A. Irizarry and Hector Corrada Bravo
Paper on random forests by Leo Breiman
EM tutorial by Jeff Bilmes
Powerpoint presentation on data preprocessing from NYU
Data preprocessing lecture by Han, Kamber, and Pei
Data mining lecture by Tan, Steinbach, and Kumar

Supervised Learning and Data Mining

Data mining in metric spaces, R. Caruana and A. Niculescu-Mizil, KDD ‘04
ROC curves

The geometry of ROC space, Peter Falch, ICML-03

Personalized Medicine

Social networks, web-based tools and diseases, Fabricio F. Costa, Drug Discovery Today, 2013
Individual genomes and personalized medicine, Christos Katsios and Dimitrios H. Roukos, Personalized Medicine, 2010.
The path to personalized medicine, Margaret A. Hamburg, NEJM, 2010.
The Impact of Online Networks and Big Data in Life Sciences, Ruchita Gujarathi and Fabricio F. Costa, Social Networking, 2014.
Big data in biomedicine, Fabricio F. Costa, Drug discovery today , 2014.
DrugBank 4.0, Vivian Law et al, Nucleic Acids Research, 2014.
The Gene Ontology (GO) database and informatics resource, The Gene Ontology Consortium, Nucleic Acids Research, 2004.
KEGG: Kyoto Encyclopedia of Genes and Genomes, Minoru Kanehisa and Susumu Goto, Nucleic Acids Research, 2000.
Gene Expression Omnibus: NCBI gene expression and hybridization array data repository, Ron Edgar, Michael Domrachev, and Alex E. Lasha, Nucleic Acids Research, 2002.
Lean Big Data integration in systems biology and systems pharmacology, Avi Ma’ayan et al, Trends in Pharmacological Sciences , 2014.
Whole genome sequencing as a diagnostic test: challenges and opportunities, Caitlin C. Chrystoja and Eleftherios P. Diamandis, Clinical Chemistry, 2014.
Biomedical cloud computing with Amazon Web Services, Vincent A. Fusaro et al, PLoS Computational Biology, 2011.
Dynamic Clinical Data Mining: Search Engine-Based Decision Support, Leo Anthony Celi et al, JMIR Medical Informatics , 2014.

Big Data and Psychology

The smartphone psychology manifesto, G. Miller, Perspectives on Psychological Science, 2012.
Crowdsourcing for Cognitive Science ­ The Utility of Smartphones, Brown HR et el, PLoS ONE, 2014.
Wijnand Ijsselsteijn, Human-Technology Interaction Group. Eindhoven University of Technology


Google paper on MapReduce by Dean and Ghemawat
MapReduce on MongoDB
Hadoop main page
Hadoop and MongoDB Use Cases
Survey paper on Parallel Data Processing with MapReduce, by Lee, Lee, Choi, Chung, and Moon
Google Dataflow, possible successor to MapReduce
Apache Spark main page
Sort Benchmark Home Page

Running Hadoop

Cloudera Hadoop tutorial
Michael Noll Python MapReduce Tutorial
VirtualBox VM download/documentation
Cloudera Quickstart VM Download
GitHub repository for Jake Schmitz' presentation