Data Science is Computer Science in a Funhouse Mirror
Abstract
The advent of Big Data has caused a paradigm shift in Computer Science. This has required new types of algorithms, new applications for existing algorithms, and reconsideration of previous assumptions. This talk will survey four ways in which old ideas have been revised in the age of Big Data. 1) In traditional applications of hash functions, collisions are to be avoided. Used properly, hash functions with the right sorts of collisions can be used to construct rapid indices for large data sets. 2) In traditional analysis of algorithms there is always a linear lower bound since algorithms must at least examine all the data points. This may no longer be possible, so "sketching algorithms" have been devised to summarize large amounts of data. 3) Error-correcting codes can be used to enable communication in the presence of noise. These codes introduce an overhead that may put too much strain on an already overloaded system. Error-estimating codes with much lower overhead can be used to determine when error- correcting codes are worthwhile and when they are not needed. 4) Clustering is a popular technique for analyzing data, but can have problems in high dimensions. Techniques that can identify points that don't cluster well can be valuable signposts that the current theory needs to be revised or replaced.
Bio:
Anselm Blumer is an emeritus professor in the Department of Computer Science at Tufts University. His primary research interest is in machine learning, particularly in applications to biology and medicine. He has also done work on data compression and algorithms for indexing and search.