XML is rapidly being adopted as a standard syntax for describing the structure of data and for encoding structured or semistructured data in a manner that facilitates interchange. XML can be viewed simply as an exchange form, but our interest concentrates more heavily on studying XML as an encoding form for persistent data.
Several projects address XML data management issues from multiple perspectives. Most are driven primarily by document-centric applications , as opposed to conventional business (i.e., relational) data wrapped in XML. This research is strongly motivated and informed by our experience with computerizing the 20-volume Oxford English Dictionary .
A long-term objective of our research is to support document storage and management by applying sound database principles to structured text management and by designing and prototyping suitable XML database systems. The challenge is to discover how the complexity of text, with its intricate structure and diversity of expression, can be efficiently managed .
Early on in the project to computerize the Oxford English Dictionary , we addressed the problem of converting text from its data capture form to some more convenient stored form (database loading) . Our experience indicated that the specification of transductions is extremely difficult in the presence of high variability in the input form (as is common when data capture is not tightly constrained). Thus, one ongoing goal is to find effective mechanisms for specifying text transformations with efficient methods to carry them out across very large text databases and on large user-specified subsets of the databases.We have developed theories, techniques, and tools to support the recognition of implicit text structure  , whereby explicit structural tags can be introduced into a text, and to support the generation of a grammar that appropriately represents that text structure . At a higher level, we explored text transduction from one structured form to another, and we designed a specification language for describing structured text transformations by relating the components of two "syntax graphs," one representing the space of possible input data (derived from a grammar for that data) and the other representing the data space of the transformed data .
We continue to be interested in how to extend our transformation specification language to cover more complex transformations. We are also investigating how to translate specifications into efficient XSLT and XQuery programs, since these two XML languages and their processors are gaining wide acceptance. In addition to supporting data capture, other applications for such massive transductions are to migrate data to match an evolved schema or to be integrated with other databases, to transform data for storage in a materialized view, and to capture the relationship between how data is stored and how it is presented to an application (perhaps through a filter of unmaterialized views). Thus, as well as applying our system to carry out conventional text transductions, we expect to apply our results to the problems of updating documents through unmaterialized views and updating of materialized views defined over structured text databases.
Two other projects focus on optimizing the execution of XQuery and XPath, which have emerged as the de-facto standard query languages for querying and manipulating XML documents. There are two main approaches to this problem: by mapping XML data to a relational database and XQuery queries to SQL, and developing native XML DBMSs.
The first project pursues the first approach to XQuery implementation and investigates techniques that can be layered on top of standard relational systems in an attempt to leverage considerable efforts and investments in existing relational technology by major database companies like IBM, Oracle, and Microsoft.
Our preliminary results  show that a large fragment of XQuery, including queries that use arbitrarily-nested FLWR expressions, element constructors, many of XQuery's built-in functions, and structural comparisons can be efficiently handled using relational-style query execution engine on top of a dynamic interval encoding - a novel relational encoding of XML documents that facilitates execution of these queries. The technique enables (suitably enhanced) relational engines to produce predictably good query plans that do not restrict the use of algorithmically preferable query operators. The benefits are realized despite the challenges presented by intermediate results that create arbitrary documents and the need to preserve document order as prescribed by the semantics of XQuery. Experimental results demonstrate that XML query systems can benefit from the above technique to avoid a quadratic-or-worse scale up penalty that effectively prevents the evaluation of nested FLWR expressions for large documents. This benefit translates to performance improvement measured in orders of magnitude for large XML documents. In related work (, ), we show how extended relational algebra operators can be used to manipulate XML data, how to translate from XQuery into the extended relational algebra, and how to rewrite the resulting algebra to produce efficient query plans that can be executed on relational database engines.
Ongoing research is structured along two directions:
The research promises to open a new approach to enhancing existing relational technology to be suitable to processing XML and XQuery. The first direction of the research will lead to the definition of an orthogonal set of relational operators suitable to processing inherently ordered data, in particular the dynamic interval encoding of XML.
The second line of research promises to develop clean XML query and manipulation languages that are more powerful then XQuery's FLWR expressions, while still maintaining favorable computational properties, e.g., query termination (which has been sacrificed in the full XQuery language on the altar of expressive power).
A second project addresses the issues in evaluating XPath and XQuery within the context of a native XML DBMS. The goal is to be able to store and query terabytes of XML documents in a native XML database, no matter how complex the structures are (in terms of depth, width, and recursions). The focus of the project, at this stage, is on the following issues:
Our current work in this project is along two lines. The first is the development of indexing techniques for XML and the incorporation of all of this in an XML query processor/optimizer. The second line of research is the exploitation of materialized views in efficiently executing XPath queries. As has been demonstrated in relational systems, storing the results of the frequently issued queries in a semantic cache can expedite query processing significantly; these results have also been demonstrated in preliminary work in XML databases. We focus on another aspect of the problem: rather than efficient processing of queries using views, we concentrate on determining the views that should be kept by the system (under limited storage) so that maximum benefit can be obtained. The challenge is in deciding which full or partial query results to cache and how to use the cached results in future query processing. We address the problem in two cases:
Orthogonal to these three projects that focus on the efficient management of XML data, we also work on benchmarks for evaluating the performance of XML DBMSs. We have developed XBench, which is a family of benchmarks that capture different XML application characteristics . These applications are categorized as data-centric or text-centric and the corresponding databases can consist of single documents or multiple documents. In data-centric (DC) applications, the database stores data that are captured in XML even though the original data may not be in XML. Examples include e-commerce catalogue data or transactional data that is captured as XML. Text-centric (TC) applications manage actual text documents and use a database of native XML documents. Examples include book collections in a digital library, or news article archives. The single document (SD) case covers those databases, such as an e-commerce catalogue, that consists of a single document with complex structures (deep nested elements), while the multiple document case covers those databases that contain a set of XML documents, such as an archive of news documents or transactional data. The result is a requirement for a database generator that can handle four cases: DC/SD, DC/MD, TC/SD, and TC/MD. XBench database generator can generate databases in any of these classes ranging from 10MB to 10GB in size. The workload specification covers the functionality of XQuery as captured in the Use Cases. Each of these queries are slightly varied to fit the specifics of the application domain. XBench can be downloaded from here.
Ashraf Aboulnaga's research
Ihab Ilyas's research
M. Tamer ÷zsu's research
David Toman's research
Frank Tompa's research