Errors and Explanations in Data Transformations: When Provenance is not Enough

April 19, 2011
4:30 pm - 6:00 pm
Halligan 111


Data transformations from a source to a target dataset are ubiquitous today and can be found in data integration, data exchange, and ETL tools. Users often detect errors or surprising observations in the target data. For example, a user may detect that an item in a target data instance is incorrect: the tuple should not be there, or some of its attribute values are erroneous; she would like to find out which of the many input tuples that contributed to the incorrect output is faulty. It is critical that the error be traced and corrected in the source data, because once an error is identified, one can prevent it from propagating to multiple items in the target data. This can be viewed as a form of “post-factum” data cleaning: while in standard data cleaning one corrects errors before the data is transformed and integrated, in our setting the errors are detected only after the data has been transformed. A first approach would be to try to exploit provenance information in order to solve these problems, however such an approach quickly becomes impractical, or even ineffective. In this talk, I will present examples that showcase this fact, and make an argument for introducing causal reasoning to databases. Causality is related to provenance, yet it is a more refined notion: Causality can explain results and trace errors by returning the causes of query results and transformations, ranked by their degree of responsibility. I will present our adaptation of the notions of causality and responsibility in a database setting, discuss complexity results including a complete dichotomy for computing responsibility for conjunctive queries, and demonstrate results for error tracing in a practical context-aware recommendation system.

Bio: Alexandra is currently a postdoctoral research associate with Dan Suciu in the database group of the University of Washington. She completed her PhD at the University of California Berkeley, advised by Joe Hellerstein and Carlos Guestrin. Her research interests are in the area of database systems, focusing on issues of provenance, cleaning, and result explanations.