[an error occurred while processing this directive] An error occured whilst processing this directive
AT & T Labs Research, USA
4pm Thursday, 14th January, 2009
Room 4.31/33, Informatics Forum
XML. HTML. CSV. JPEG. MPEG. These data formats represent vast quantities of scientific, governmental, industrial, and private data. Because the formats have been standardized and are widely used, many reliable, efficient, and convenient tools exist for processing such data. In an ideal world, all data would be in such formats. In reality, however, we are not so fortunate. Instead, vast amounts of data exist in ad hoc formats, which forces domain-experts to waste valuable time on low-level parsing tasks.
In this talk, I will describe the PADS data description language, which addresses this problem. PADS allows users to describe the physical layout of ad hoc data sources and semantic properties of that data. The descriptions are concise enough to serve as ``living'' documentation while flexible enough to describe most of the formats that we have seen in practice. In addition, we have developed a multi-phase machine-learning algorithm that can automatically infer a PADS description from sample data. Given a PADS description, the PADS compiler generates libraries and tools for manipulating the associated data, including parsing routines, statistical profiling tools, translation programs to produce well-behaved formats such as XML, and tools for running queries over raw PADS data sources. As I describe the PADS and its associated tools, I will highlight how various ideas from the programming language research community have informed the design and implementation of the PADS system.
Information about PADS and a list of all the people who have contributed to the project is available from the project web site: www.padsproj.org.