Data Enrichment for Data Science

March 4, 2019
3:00 PM
Halligan 209
Speaker: Fatemeh Nargesian, University of Toronto
Host: Jeff Foster

Abstract

Data Science is built on the power of data processing and data preparation. In this talk, I discuss the challenges of data preparation for end-to-end data science. Particularly, I talk about data enrichment via discovery -- the problem of discovering and integrating the right data from data lakes to solve a given data science problem. I introduce two paradigms of data discovery. In the first paradigm, the query is a dataset and a data scientist is interested in interactively finding datasets in data lakes that can be integrated with the query. I introduce a probabilistic framework for searching top-k unionable tables and discuss the need for distribution-aware techniques for data discovery. In the second paradigm, search does not start with a query, instead, it is data-driven. I will talk about data lake organization problem of building a directory structure that enables users to most efficiently navigate data lakes. I will present a navigation model of how users interact with a directory structure and introduce a scalable local search algorithm for optimizing data lake organizations.

Bio

Fatemeh Nargesian is a PhD candidate in the Data Curation Group of the Department of Computer Science at University of Toronto. Her primary research interests are in the data management challenges of end-to-end data science. A paper she co-authored on data discovery was accorded the Best Demonstration Award at VLDB 2017. While at University of Toronto, Fatemeh was a joint Research intern at IBM Research-NY. Prior to University of Toronto, she worked on clinical data management at the Clinical Informatics Research Group at McGill University, and received M.Sc. degrees in Computer Science at University of Ottawa and Artificial Intelligence at Sharif University of Technology.