XML Data Management

XML is rapidly being adopted as a standard syntax for describing the structure of data and for encoding structured or semistructured data in a manner that facilitates interchange. XML can be viewed simply as an exchange form, but our interest concentrates more heavily on studying XML as an encoding form for persistent data.

Several projects address XML data management issues from multiple perspectives. Most are driven primarily by document-centric applications , as opposed to conventional business (i.e., relational) data wrapped in XML. This research is strongly motivated and informed by our experience with computerizing the 20-volume Oxford English Dictionary [3].

A long-term objective of our research is to support document storage and management by applying sound database principles to structured text management and by designing and prototyping suitable XML database systems. The challenge is to discover how the complexity of text, with its intricate structure and diversity of expression, can be efficiently managed [5].

Early on in the project to computerize the Oxford English Dictionary , we addressed the problem of converting text from its data capture form to some more convenient stored form (database loading) [2]. Our experience indicated that the specification of transductions is extremely difficult in the presence of high variability in the input form (as is common when data capture is not tightly constrained). Thus, one ongoing goal is to find effective mechanisms for specifying text transformations with efficient methods to carry them out across very large text databases and on large user-specified subsets of the databases.We have developed theories, techniques, and tools to support the recognition of implicit text structure [8] , whereby explicit structural tags can be introduced into a text, and to support the generation of a grammar that appropriately represents that text structure [7]. At a higher level, we explored text transduction from one structured form to another, and we designed a specification language for describing structured text transformations by relating the components of two "syntax graphs," one representing the space of possible input data (derived from a grammar for that data) and the other representing the data space of the transformed data [5].

We continue to be interested in how to extend our transformation specification language to cover more complex transformations. We are also investigating how to translate specifications into efficient XSLT and XQuery programs, since these two XML languages and their processors are gaining wide acceptance. In addition to supporting data capture, other applications for such massive transductions are to migrate data to match an evolved schema or to be integrated with other databases, to transform data for storage in a materialized view, and to capture the relationship between how data is stored and how it is presented to an application (perhaps through a filter of unmaterialized views). Thus, as well as applying our system to carry out conventional text transductions, we expect to apply our results to the problems of updating documents through unmaterialized views and updating of materialized views defined over structured text databases.

Two other projects focus on optimizing the execution of XQuery and XPath, which have emerged as the de-facto standard query languages for querying and manipulating XML documents. There are two main approaches to this problem: by mapping XML data to a relational database and XQuery queries to SQL, and developing native XML DBMSs.

The first project pursues the first approach to XQuery implementation and investigates techniques that can be layered on top of standard relational systems in an attempt to leverage considerable efforts and investments in existing relational technology by major database companies like IBM, Oracle, and Microsoft.

Our preliminary results [1] show that a large fragment of XQuery, including queries that use arbitrarily-nested FLWR expressions, element constructors, many of XQuery's built-in functions, and structural comparisons can be efficiently handled using relational-style query execution engine on top of a dynamic interval encoding - a novel relational encoding of XML documents that facilitates execution of these queries. The technique enables (suitably enhanced) relational engines to produce predictably good query plans that do not restrict the use of algorithmically preferable query operators. The benefits are realized despite the challenges presented by intermediate results that create arbitrary documents and the need to preserve document order as prescribed by the semantics of XQuery. Experimental results demonstrate that XML query systems can benefit from the above technique to avoid a quadratic-or-worse scale up penalty that effectively prevents the evaluation of nested FLWR expressions for large documents. This benefit translates to performance improvement measured in orders of magnitude for large XML documents. In related work ([9], [10]), we show how extended relational algebra operators can be used to manipulate XML data, how to translate from XQuery into the extended relational algebra, and how to rewrite the resulting algebra to produce efficient query plans that can be executed on relational database engines.

Ongoing research is structured along two directions:

  1. On the practical side we propose to implement and empirically validate various enhancements of relational systems needed for efficient execution of relational reformulations of XQueries in a commercial-grade relational system.This should also validate our preliminary results obtained using our homegrown experimental relational engine.
  2. On the more theoretical side of the research, we propose to investigate the limits of using relational technology for processing XQuery expressions and, in general, expressions given by various tree-transformation languages. We also propose to study several provably terminating fragments of XQuery (which, in its full generality is Turing-complete), e.g., the very powerful languages based on the second-order lambda calculus, l (or System F).

The research promises to open a new approach to enhancing existing relational technology to be suitable to processing XML and XQuery. The first direction of the research will lead to the definition of an orthogonal set of relational operators suitable to processing inherently ordered data, in particular the dynamic interval encoding of XML.

The second line of research promises to develop clean XML query and manipulation languages that are more powerful then XQuery's FLWR expressions, while still maintaining favorable computational properties, e.g., query termination (which has been sacrificed in the full XQuery language on the altar of expressive power).

A second project addresses the issues in evaluating XPath and XQuery within the context of a native XML DBMS. The goal is to be able to store and query terabytes of XML documents in a native XML database, no matter how complex the structures are (in terms of depth, width, and recursions). The focus of the project, at this stage, is on the following issues:

  1. Storage structures for XML documents. We are designing storage structures [14] for XML documents whose the main requirements are the following:
    1. Scalability. The storage structures are able to store terabytes of XML data. We store the two types of data - XML tree structures (tags and their nesting relationships) and the PCDATA of each XML element - separately. The rationale of this separation is that different types of constraints in a path query use different types of data: tree pattern-based structural constraints on tree structures and value-based constrains on PCDATA. We store the tree structures by linearizing them into a string. Each XML element tag is mapped into a small unit of datum (could be as small as one byte) in the string, and the units are stored in document order. If the XML document is very large, the mapped string can be divided into sub-strings, each of which can be stored into a disk page. The PCDATA of XML elements are stored continuously according to their document order, and are also stored in disk pages if necessary. This mechanism allows virtually unlimited XML documents to be stored into the database.
    2. Supporting efficient query evaluation. The storage structures should be optimized to XML path queries. Since the PCDATA are usually subject to value-based queries, we could build value-based indexes such as B+ trees on them. For evaluating structural constraints, an efficient single-pass navigational operator could be developed based on the linearized tree storage, due to the fact that elements are stored in document order. The navigational operator sequentially scans the string and maintains internal states for pattern matching. Lightweight statistics can also be collected for each disk page and are stored in the page header. These statistics are used to skip unnecessary page I/Os during the sequential scan.
    3. Easy-to-update. When the underlying XML document is updated, the basic storage structure has to be able to be easily updated accordingly. The update to the XML tree (deleting an element /inserting a new element) is significantly easier if it is stored in the string format than previous mappings (e.g., interval encoding). The advantage of our storage structure is that updating the tree structure only affects the string locally, i.e., a page or several pages; while it usually involves a global update in the case of interval encoding.
  2. Efficient physical operators for answering path queries (e.g., XPath) on the storage. Path queries are ubiquitous and are significantly different from traditional relational database query. There are usually two approaches to evaluating path queries in the literature: join-based and navigational approaches. Each of these approaches has its advantages and disadvantages. Join-based approach exploits tag-name indexes and is usually I/O optimal when the queries involve only ancestor-descendant relationships ("global axes"). The navigational approach does not dependent on existing indexes and is usually more efficient when there is only "local axes" - parent-child or sibling relationships. We define a special type of path expression - Next-of-Kin (NoK) expressions -that only contains "local axes" [14]. A physical NoK operator is defined to answer the NoK path expression using navigational approach. For a general path expression containing "global axes", we can decompose the path expression into interconnected NoK expressions and apply a NoK operator to each of them. The intermediate results are then "joined" using join-based approach on the "global axes". In this way, we can exploit the best of both worlds. The decomposition of path expressions and composition of intermediate results are preliminary work with preliminary results [13].
  3. XPath cardinality estimation and cost modeling. Cardinality estimation is crucial to cost-based optimization. Cardinality estimation of path expressions is usually based on a graph-based synopsis structure that summarizes the XML tree and maintains sufficient statistics. A synopsis structure that is useful in practice should have the following desirable properties: the estimations based on this synopsis should be reasonably accurate; the cardinality estimation algorithm should work for a majority types of queries and data sets; the synopsis should be adaptive to different memory budgets; the synopsis should be easy to create and update; and the estimation time should be a very small fraction of the actual querying time. To meet all these criteria, we developed a synopsis structure, called XSeed [11] , that is accurate and fast for cardinality estimation, using our novel estimation algorithm. In many cases, XSeed improves the accuracy of the state-of-art synopsis by an order of magnitude even under less memory budget. XSeed is also easy to construct and update, which makes it suitable to be implemented in a commercial system. In addition to cardinality estimation, we have also worked on cost estimation for a navigational operator [12].

Our current work in this project is along two lines. The first is the development of indexing techniques for XML and the incorporation of all of this in an XML query processor/optimizer. The second line of research is the exploitation of materialized views in efficiently executing XPath queries. As has been demonstrated in relational systems, storing the results of the frequently issued queries in a semantic cache can expedite query processing significantly; these results have also been demonstrated in preliminary work in XML databases. We focus on another aspect of the problem: rather than efficient processing of queries using views, we concentrate on determining the views that should be kept by the system (under limited storage) so that maximum benefit can be obtained. The challenge is in deciding which full or partial query results to cache and how to use the cached results in future query processing. We address the problem in two cases:

  1. If a workload is provided, using that workload to select XML views to materialize in the semantic cache.
  2. When a predefined workload is not available, incrementally building the semantic cache and managing it efficiently as new workload arrives.

Orthogonal to these three projects that focus on the efficient management of XML data, we also work on benchmarks for evaluating the performance of XML DBMSs. We have developed XBench, which is a family of benchmarks that capture different XML application characteristics [6]. These applications are categorized as data-centric or text-centric and the corresponding databases can consist of single documents or multiple documents. In data-centric (DC) applications, the database stores data that are captured in XML even though the original data may not be in XML. Examples include e-commerce catalogue data or transactional data that is captured as XML. Text-centric (TC) applications manage actual text documents and use a database of native XML documents. Examples include book collections in a digital library, or news article archives. The single document (SD) case covers those databases, such as an e-commerce catalogue, that consists of a single document with complex structures (deep nested elements), while the multiple document case covers those databases that contain a set of XML documents, such as an archive of news documents or transactional data. The result is a requirement for a database generator that can handle four cases: DC/SD, DC/MD, TC/SD, and TC/MD. XBench database generator can generate databases in any of these classes ranging from 10MB to 10GB in size. The workload specification covers the functionality of XQuery as captured in the Use Cases. Each of these queries are slightly varied to fit the specifics of the application domain. XBench can be downloaded from here.


  1. D. DeHaan, D. Toman, M. P. Consens, and M. T. Özsu. "A Comprehensive XQuery to SQL Translation using Dynamic Interval Encoding," In Proceedings of the ACM SIGMOD Conference on Management of Data, 2003, pages 623-634.
  2. R. Kazman, Structuring the Text of the Oxford English Dictionary through Finite State Transduction. University of Waterloo Technical Report CS-86-20, 1986.
  3. F. W. Tompa, "Not Just Another Database Project: Developments at UW," In Reflections on the Future of Text, Proceedings of the 10th Conference of the University of Waterloo Centre for the New OED, 1994, pages 82-89.
  4. A. Salminen and F. W. Tompa, "Requirements for XML Document Database Systems," In Proceedings of the ACM Symposium on Document Engineering, 2001, 85-94.
  5. X. Tang and F. W. Tompa, "Specifying Transformations for Structured Documents," in Proceedings of the 4th International Workshop on the Web and Databases (WebDB'2001), 2001, pages 67-72.
  6. B. B. Yao, M. T. Özsu, and N. Khandelwal. "XBench Benchmark and Performance Testing of XML DBMSs," In Proceedings of the 20 th International Conference on Data Engineering, March 2004, pages 621-632.
  7. M. D. Young-Lai and F. W. Tompa, "Stochastic Grammatical Inference of Text Database Structure," Machine Learning, 40: 111-137, 2000.
  8. M. D. Young-Lai and F. W. Tompa, "One-Pass Evaluation of Region Algebra Expressions," Information Systems, 28: 159-168, 2003.
  9. H. Zhang and F. W. Tompa, "Querying XML documents by dynamic shredding," In Proceedings of the ACM Symposium on Document Engineering , 2004, pages 21-30.
  10. H. Zhang and F. W. Tompa, "XQuery Rewriting at the Algebraic Level," In "Trends in XML Technology for the Global Information Infrastructure," a special issue of Journal of Computer Systems, Science, and Engineering , 18(5): 241-262, September 2003.
  11. N. Zhang, M. T. Özsu, A. Aboulnaga, and I. F. Ilyas. "XSeed: Accurate and Fast Cardinality Estimation for XPath Queries," In Proceedings of the 22nd International Conference on Data Engineering, April 2006.
  12. N. Zhang, P. J. Haas, V. Josifovski, G. M. Lohman, and C. Zhang. "Statistical Learning Techniques for Costing XML Queries," In Proceedings of the 31 st International Conference. on Very Large Data Bases, September 2005, pages 289-300
  13. N. Zhang, S. Agarawal, and M. T. Özsu. "BlossomTree: Evaluating XPaths in FLWOR Expressions," In Proceedings of the 21 st International Conference on Data Engineering , April 2005, pages 388-389 .
  14. N. Zhang, V. Kacholia, and M. T. Özsu. "A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML," In Proceedings of the 20 th International Conference on Data Engineering, March 2004, pages 54-65.

Related Links

Ashraf Aboulnaga's research

Ihab Ilyas's research

M. Tamer ÷zsu's research

David Toman's research

Frank Tompa's research

Campaign Waterloo

Data Systems Group
David R. Cheriton School of Computer Science
University of Waterloo
Waterloo, Ontario, Canada N2L 3G1
Tel: 519-888-4567
Fax: 519-885-1208

Contact | Feedback: db-webmaster@cs.uwaterloo.ca | Data Systems Group

Valid HTML 4.01!Valid CSS! Last modified: Friday, 01-Jun-2012 11:01:03 EDT