Mathematics of Big Data: Unifying Spreadsheets, Databases, Matrices, and Graphs

December 3, 2015
2:50 pm - 4:00 pm
Halligan 102
Speaker: Jeremy Kepner, MIT
Host: Soha Hassoun

Abstract

Data processing systems impose multiple views on data as it is processed by the system. These views include spreadsheets, databases, matrices, and graphs. The common theme amongst these views is the need to store and operate on data as whole sets instead of as individual data elements. This work describes a common mathematical representation of these data sets (associative arrays) that applies across a wide range of applications and technologies.

Associative arrays unify and simplify these different approaches for representing and manipulating data into common two-dimensional view of data. Specifically, associative arrays (1) reduce the effort required to pass data between steps in a data processing system, (2) allow steps to be interchanged with full confidence that the results will be unchanged, and (3) make it possible to recognize when steps can be simplified or eliminated. Most database system naturally support associative arrays via their tabular interfaces. The D4M implementation of associative arrays uses this feature to provide a common interface across SQL, NoSQL, and NewSQL databases.

Bio: Dr. Jeremy Kepner is a MIT Lincoln Laboratory Fellow and also has appointments with MIT Computer Science and AI Laboratory, and the MIT Mathematics Department. Prior to joining MIT he was a Department of Energy Computational Science Fellow at Princeton University where he received his Ph.D. in astrophysics. Dr. Kepner leads the supercomputing and big data research efforts at MIT Lincoln Laboratory, which has 3,500 employees and is the largest lab at MIT. His team conducts research in a wide range of computing areas and oversees the operation of a variety of supercomputers that service hundreds of users at MIT.

Throughout his career the focus of Dr. Kepner's research has been creating and delivering supercomputing systems that require minimal training for user's to achieve high performance, thus allowing scientists to be scientists and engineers to be engineers. Dr. Kepner has authored two SIAM best selling books. "Graph Algorithms in the Language of Linear Algebra" (2011) which is the basis of the new GraphBLAS.org mathematical library standard. "Parallel Matlab for Multicore and Multinode Computers" (2009) that describes the parallel Matlab techniques used by tens of thousands of scientists and engineers worldwide. Dr. Kepner's next book "Mathematics of Big Data" will be published by MIT Press in 2016.