The Theory and Practice of Data Description

March 6, 2006
12:00 pm - 1:00 pm
Halligan 111

Abstract

Massive amounts of useful data are stored and processed in non-standard or ad hoc formats, for which critical tools like parsers and formatters do not exist. Ad hoc data formats are often poorly documented, and the data itself can be very large scale with a significant number of errors, like missing or malformed data and out-of-range values. Traditional databases and XML systems provide rich infrastructure for processing well-behaved data, but are of little help when dealing with data in ad hoc formats. I will discuss my attempts to address the challenges of ad hoc data with my work on the PADS project. I will present an introduction to PADS/ML, a declarative data description language that permits analysts to describe the physical layout of their data and its semantic properties. From a description, the PADS compiler can automatically generate a collection of useful data-processing tools for the data source described, including parsing routines, statistical profiling tools, and translators to standard formats like XML. I will discuss the formal semantics of the PADS language, two of its essential properties, and a number of its applications. Finally, I will describe support for querying ad hoc data with the PADS tool PADX. I will discuss PADX from the user’s perspective and review the main challenges encountered in implementing PADX and their solutions.

Bio:

Yitzhak Mandelbaum is a Ph.D. candidate in computer science at Princeton University. He will receive his Ph.D. this summer. In 1999, he received a Bachelor’s degree in computer science from Princeton University. His research interests include programming languages, language-based security and reliability, and applications of programming language technology to data processing and analysis.