Data Enrichment for Data Science
Abstract
Data Science is built on the power of data processing and data preparation. In this talk, I discuss the challenges of data preparation for end-to-end data science. Particularly, I talk about data enrichment via discovery -- the problem of discovering and integrating the right data from data lakes to solve a given data science problem. I introduce two paradigms of data discovery. In the first paradigm, the query is a dataset and a data scientist is interested in interactively finding datasets in data lakes that can be integrated with the query. I introduce a probabilistic framework for searching top-k unionable tables and discuss the need for distribution-aware techniques for data discovery. In the second paradigm, search does not start with a query, instead, it is data-driven. I will talk about data lake organization problem of building a directory structure that enables users to most efficiently navigate data lakes. I will present a navigation model of how users interact with a directory structure and introduce a scalable local search algorithm for optimizing data lake organizations.