Database Research Group Events

Spring 2007

Note: Events of interest to the Database Research Group are posted to the uw.cs.database newsgroup and are mailed to the db-group@lists.uwaterloo.ca mailing list. There are actually three mailing lists aggregated into the db-group list: db-faculty (for DB group faculty), db-grads (for DB group graduate students), and db-friends (for DB group alumni, visitors, and friends). If you wish to subscribe to one of these three lists (or to unsubscribe), please visit https://lists.uwaterloo.ca/mailman/listinfo/<listname>, where <listname> is the list you wish to subscribe to.
DB group meetings
The DB group meets most Friday afternoons at 2pm, usually in DC1331. See the list of current events for times and locations of upcoming meetings. Each meeting lasts for an hour and features an informal presentation by one of the members of the group. Everyone is welcome to attend. These talks are intended to raise questions and to stimulate discussion rather than being polished presentations of research results. Speakers are determined using a rotating speaker list, which can be found on the DB group meeting page
DB seminar series
The DB seminar series features visiting speakers. These seminars are more-or-less monthly, and are usually scheduled on Monday mornings at 11am. See the list of current events for times and locations of upcoming seminars. The full schedule can be found on the DB seminar series page.

Recent and Upcoming Events


DB Meeting: Friday May 4, 2:00pm, DC 1331
Speaker: Ani Nica, Sybase iAnywhere
Title: Issues and Solutions Related to Transaction Support for Maintenance of Materialized Views
Abstract: In this talk I will present a survey of the research related to transaction support for the maintenance of the materialized views in relational database management systems. The survey will include proposed solutions to increase the level of concurrency, decrease the rate of deadlocks, and mechanisms for recovery when materialized views are incrementally maintained. I will provide in this talk an overview of the mechanisms implemented in some of the commercially available relational database management systems for dealing with maintenance of the materialized views.

MMath Essay Presentation: Thursday May 10, 1:00pm, DC 1331
Speaker: Ye Qin
Title: The Impact of I/O Cost Models on Database Query Optimizers
Abstract: In this study, we try to characterize quantitatively the impact of I/O cost model errors on the optimizer of a Database Management System (DBMS). The cost of a query execution plan depends not only on the number of I/O operations required by the plan, but also on the cost of those operations. An I/O cost model consists of two parts: I/O costing functions (IOCF) and I/O count estimation. The I/O count estimation is concerned with estimating the number of I/O operations required by a relational operator; the IOCF describes the cost of those operations, e.g., the cost of reading sequentially or randomly one page from disk. The errors in estimating the IOCFs and the number of I/O requests can cause the optimizer to choose suboptimal execution plans. We use the global relative cost (GRC) to describe how far a suboptimal plan is away from the optimal plan. We analyze the I/O cost model of an open source DBMS (Postgres) and prove that the maximum GRC is proportional to the product of the square of the the IOCF estimation error and the I/O count estimation error. We also study empirically the IOCF problem with another DBMS that models I/O cost differently and confirm the results obtained from analyzing Postgres. In addition, we study experimentally the I/O count estimation problem for two relational operators: nested loop join and index scan. The experimental results demonstrate that in case of nested loop join, Postgres can overestimate the number of I/O operations issued by the DBMS by an order of magnitude or more, with larger Postgres buffer sizes resulting in larger estimation errors. The estimates become worse when data are skewed. In case of index scan, the Postgres optimizer can estimate poorly the number of I/O requests when the underlying data is skewed and the index is not completely unclustered. Skew or correlation by itself can cause estimation errors; altogether they can cause larger errors.

DB Meeting: Friday May 11, 2:00pm, DC 1331
Speaker: Ashraf Aboulnaga
Title: μBE: User Guided Source Selection and Schema Mediation for Internet Scale Data Integration
Abstract: Data integration research has always assumed that users know the data sources they want to integrate data from, and the mediated schema to which the data will be mapped. This may not be true for data integration on the Internet, where there may be thousands of sources providing data of relevance to a particular domain. Users who want to build data integration systems for such a domain need to make decisions about (1) which sources to include in the data integration system, and (2) what mediated schema to use and how to map the data from the different sources to this schema. These decisions are best made by an iterative process in which the user explores the space of possible data integration systems, and refines his or her decision based on this exploration. The current state of the art in data integration does not support such an exploratory process.

In this talk, I will present μBE, a tool that helps in this exploratory process by automatically selecting the sources to include in a data integration system and defining a mediated schema on these sources. The input to μBE is a large number of data sources and their characteristics. Users set weights to indicate which source characteristics are important to them, and can also specify constraints on which data sources and which attributes to include in the solution. Based on this input, μBE searches the space of possible data integration solutions and produces a recommended solution. The user can then provide feedback on this solution and iteratively explore the search space using μBE until he or she is satisfied with the generated solution. Our experiments verify that μBE is efficient in exploring the space of possible solutions, and that it produces high-quality solutions.

This is joint work with Kareem El Gebaly.


DB Seminar: Monday May 14, 10:30am, DC 1304
Speaker: Jun Yang, Duke University
Title: Data-Driven Processing in Sensor Networks

DB Meeting: Friday May 25, 2:00pm, DC 1331
Speaker: Jérémy Barbay
Title: Lowest Common Ancestor Queries
Abstract: Schema-free queries permit to search XML documents without knowing their schema. Among them, lowest common ancestor (LCA) queries were introduced in several variants on labeled trees. I will review the approaches from Schmidt, Kersten and Windhouwer [ICDE 2001]; Li, Yu and Jagadish [VLDB 2004] and Xu and Papakonstantinou [SIGMOD 2005], and two related projects developped at Waterloo, one in collaboration with Ehsan Chiniforooshan, Alexander Golynski and Jui-Yi Kao, the other one in collaboration with Aleh Veraskouski. And of course I will bring some home-made cake ;)

As obvious from the abstract, this is corresponds to joint works with Ehsan Chiniforooshan, Alexander Golynski, Jui-Yi Kao and Aleh Veraskouski.


DB Seminar: Monday May 28, 11:00am, DC 1331
Speaker: Jiri Schindler, Network Appliance Corp.
Title: Data ONTAP - Storage system for reliably storing data on real-world (=unreliable) disk drives
Abstract: One of the many functions of an enterprise storage system is to reliably store data using unreliable components and magnetic media i.e., disk drives. Over the last twenty years, the disk drive technology has advanced in many ways. However, despite these improvements, the reliability of data stored on disk drives has not markedly changed.; as complexity increases, disk drives exhibit many more types of partial or transient failures with system-wide effects.

This talk will describe briefly the high level architecture of the Network Appliance Storage Systems, called Filers, with an emphasis on the mechanisms of its specialized operating system, called Data ONTAP, dealing with disk errors and failures. The talk will also present results of a recent wide-scale study on latent sector errors, show how such errors have system-level effects, and describe how a storage system copes with these types of errors. Finally, the talk will describe some related work and argue for more detailed failure models accounting for current disk drive failure modes instead of, now almost twenty years old, RAID levels reliability equations of Patterson et al.[1988].

Speaker Bio: Jiri Schindler is a member of research staff at the Network Appliance Advanced Development Group, located in Waltham, MA near Boston. His research interests include storage device-efficient data layouts, performance modeling and load balancing. He received his PhD in 2003 from Carnegie Mellon University. Although not Canadian, he enjoys sports on water in its solid state when not building prototypes or analytical models of storage systems.

DB Meeting: Friday June 1, 2:00pm, DC 1331
Speaker: Xiaoyan Yang, National University of Singapore
Title: In-Network Execution of Monitoring Queries in Sensor Networks
Abstract: Sensor networks are widely used in many applications for collecting information from the physical environment. In these applications, it is usually necessary to track the relationships between sensor data readings within a time window to detect events of interest. However, it is difficult to detect such events by using the common aggregate or selection queries. We address the problem of processing window self-join in order to detect events of interest. Self-joins are useful in tracking correlations between different sensor readings, which can indicate an event of interest. We propose the Two-Phase Self-Join (TPSJ) scheme to efficiently evaluate self-join queries for event detection in sensor networks. Our TPSJ scheme takes advantage of the properties of the events and carries out data filtering during in-network processing. We discuss TPSJ execution with one window and we extend it for continuous event monitoring. Our experimental evaluation results indicate that the TPSJ scheme is effective in reducing the amount of radio transmissions during event detection.

DB Meeting: Friday June 8, 2:00pm, DC 1331
Speaker: Shai Ben-David
Title: Principles of clustering - a theory-biased survey
Abstract: The talk is a survey of clustering from a theoretical perspective: Major clustering approaches: in what ways do they differ? What kind of guarantees are available for the performance of clustering algorithms? Considerations in choosing clustering parameters. etc.

I will try to make the talk as useful as possible to a database audience but I am a theoretician and I can't change my skin. So, I'll talk about clustering paradigms rather then clustering software packages, and I'll talk about theorems and open questions rather than practical experience and experimental results.


DB Meeting: Friday June 22, 2:00pm, DC 1331
Speaker: Qiang Wang
Title: Performance-failure Tolerant Range Query Processing in P2P Networks
Abstract: Tree-based overlay networks have been developed to support range query processing over large-scale P2P networks, and they employ multi-paths among peers to provide fault-tolerance against network failures. However, due to lack of "fault-tolerance configurability", they are unable to support fault-tolerance specifications with real-time demands, leading to performance failures. To correct this problem, we developed a scalable B-tree-based overlay network, which supports fault-tolerance specifications through persistent performance guarantees independent of varying network failure rates; secondly, I will discuss certain replication strategies deployed at query processing level to enhance existing tree overlays to support real-time demands.

DB Seminar: Monday June 25, 10:30am, DC 1304
Speaker: Ugur Cetinemel, Brown University
Title: Processing and Routing Streams in a Networked World

DB Meeting: Friday July 6, 2:00pm, DC 1331
Speaker: Gulay Unel
Title: An Incremental Technique for Automata-based Decision Procedures
Abstract: Automata-based decision procedures commonly achieve optimal complexity bounds. However, in practice, they are often outperformed by sub-optimal (but more local-search based) techniques, such as tableaux, on many practical reasoning problems. This discrepancy is often the result of automata-style techniques global approach to the problem and the consequent need for constructing an extremely large automaton. This is in particular the case when reasoning in theories consisting of large number of relatively simple formulas, such as descriptions of database schemes, is required. In this paper, we propose techniques that allow us to approach a mu-calculus satisfiability problem in an incremental fashion and without the need for re-computation. In addition, we also propose heuristics that guide the problem partitioning in a way that is likely to reduce the size of the problems that need to be solved.

DB Meeting: Friday July 13, 2:00pm, DC 1331
Speaker: Robert Warren
Title: Some observations on scalable client/server databases
Abstract:

DB Meeting: Friday July 20, 2:00pm, DC 1331
Speaker: Susanne Busse, Technische Universität Berlin
Title: Feature-based Classification of Information Integration Systems
Abstract: An Information Integration System (IIS) provides a single access point to read data from a set of autonomous heterogeneous data sources. There exist different types of IIS ranging from structured mediators and Peer Data Management Systems to Information Retrieval systems like search engines. Similarly, a variety of techniques can be used for particular aspects of IIS, such as query optimization or specifications of integrated schemas.

In this talk we present a model-driven approach dealing with variants to support an IIS engineer to select and reuse existing architectures and techniques for information integration. The core concept is a classification specifying both functional and quality features of IIS and dependencies between them. This feature-based classification allows to match specifications to find appropriate elements for reuse. The talk shall open a discussion on crucial points of IIS design from a database, Information Retrieval and software engineering point of view.


DB Meeting: Friday July 27, 2:00pm, DC 1331
Speaker: Nan Tang, Chinese University of Hong Kong
Title: Multiple Materialized View Selection for XPath Query Rewriting
Abstract: We study the problem of answering XPATH queries using multiple materialized views. Despite the efforts on answering queries using single materialized view, answering queries using multiple views remains relatively new. We address two important aspects of this problem: multiple-view selection and equivalent multiple-view rewriting. With regards to the first problem, we describer an NFA-based approach (called VFILTER) to filter views that cannot be used to answer a given query. Then we present the criterion for multiple view/query answerability. Based on the output of VFILTER, we further discuss a heuristic method to identify a minimal view set that can answer a given query. For the problem of multiple-view rewriting, we first refine the materialized fragments of each selected view (like pushing selection), then we join the refined fragments utilizing an encoding scheme. Finally, we extract the result of the query from the materialized fragments of a single view.

MMath Thesis Presentation: Tuesday July 31, 2:00pm, DC 3314
Speaker: Kareem El Gebaly
Title: Robustness in Automatic Physical Database Design
Abstract: Automatic physical database design tools rely on "what-if" interfaces to the query optimizer to estimate the execution time of the training query workload under different candidate physical designs. The tools use these what-if interfaces to recommend physical designs that minimize the estimated execution time of the input training workload. In this thesis, we argue that minimizing estimated execution time alone can lead to designs with inherent problems. In particular, if the optimizer makes an error in estimating the execution time of some workload queries, the recommended physical design may actually harm the workload instead of benefiting it. In this sense, the physical design is risky. Moreover, if the production queries are slightly different from the training queries, the recommended physical design may not benefit them at all. In this sense, the physical design is not general. We define Risk and Generality as two new measures to evaluate the quality of a proposed physical database design, and we show how to extend the objective function being optimized by a generic physical design tool to take these measures into account. We have implemented a simple physical design advisor in PostgreSQL, and we use it to experimentally demonstrate the usefulness of our approach. We show that our two new metrics result in physical designs that are more robust, which means that the user can implement them with a higher degree of confidence. This is particularly important as we move towards truly zero-administration database systems in which there is not the possibility for a DBA to vet the recommendations of the physical design tool before applying them.

DB Seminar: Friday August 17, 2:00pm, DC 1304 (Please note special day and time)
Speaker: Shivnath Babu, Duke University
Title: Experiment-Driven Management of Web Services and Scientific Applications

MMath Thesis Presentation: Friday August 24, 10:30am, DC 1316
Speaker: Mohammed Abouzour
Title: Automatically Tuning Database Server Multiprogramming Level
Abstract: Optimizing database systems to achieve the maximum attainable throughput of the underlying hardware is one of the many difficult tasks that face Database Administrators. With the increased use of database systems in many environments, this task has even become more difficult. One of the parameters that needs to be configured is the number of worker tasks that the database server uses (the multiprogramming level). The underlying intuition is that every workload has an optimal multiprogramming level that can achieve the best throughput given the workload characteristic. This talk will focus on how to automatically adjust the number of database server worker tasks to achieve maximum throughput under varying workload characteristics. I will present three algorithms that we have developed and compare the performance of these algorithm.

This page is maintained by Ashraf Aboulnaga.

Campaign Waterloo

Data Systems Group
David R. Cheriton School of Computer Science
University of Waterloo
Waterloo, Ontario, Canada N2L 3G1
Tel: 519-888-4567
Fax: 519-885-1208

Contact | Feedback: db-webmaster@cs.uwaterloo.ca | Data Systems Group


Valid HTML 4.01!Valid CSS! Last modified: Friday, 01-Jun-2012 11:01:03 EDT


Menu:ShowHide