Managing Arrays for Science Applications at Scale
Science applications are becoming increasingly data-driven. Researchers are collecting new data at an unprecedented scale, and much of it is stored in multidimensional arrays. Such workloads consist of complex transformations, many of which query the data spatially. The established relational model of data management cannot support this new class of applications. At the same time, scientists are increasingly conducting their experiments on large, shared-nothing clusters in lieu of purpose built platforms. As a result, processor time is becoming more plentiful and network bandwidth is the scarcer resource.
In this talk, I will describe my research on efficiently distributing arrays for scientific workloads. This work is done in the context of SciDB, an open source array database system built for applications with complex analytics. I will first present our optimization of data- intensive queries to minimize their use of network resources. Our approach uses integer programming to assign segments of a distributed query to individual database nodes. The second part of my talk will present research on data placement for elastic array databases. This partitioning minimizes the time needed to reorganize the database for a change in the hardware configuration, while optimizing the layout of multidimensional data structures for spatial queries.
Jennie Duggan is a postdoctoral associate at the Massachusetts Institute of Technology where she works with Michael Stonebraker. She received her Ph.D. from Brown University in December 2012 under the guidance of Ugur Cetintemel. Her research interests include scientific data management, database workload modeling, and cloud computing. She is especially focused on making data-driven science applications fast and scalable.