[Please remove <h1>]
Spring 2007
Note: Events of interest to the
Database Research Group are posted to the uw.cs.database
newsgroup and are mailed to the
db-group@lists.uwaterloo.ca
mailing list. There are actually three mailing lists aggregated into the
db-group list: db-faculty
(for DB group faculty), db-grads (for DB group graduate students),
and db-friends (for DB group alumni, visitors, and friends). If
you wish to subscribe to one of these three lists (or to unsubscribe), please
visit
https://lists.uwaterloo.ca/mailman/listinfo/<listname>, where
<listname> is the list you wish to subscribe to.
- DB group meetings
- The DB group meets most Friday afternoons at 2pm, usually in DC1331.
See the list of current events for
times and locations of upcoming meetings. Each meeting lasts
for an hour and features an informal presentation by one of the
members of the group. Everyone is welcome to attend. These talks are
intended to raise questions and to stimulate discussion rather than
being polished presentations of research results. Speakers are determined
using a rotating speaker list, which can be found on the
DB group meeting page
- DB seminar series
- The DB seminar series features visiting speakers. These seminars are
more-or-less monthly, and are usually scheduled on Monday
mornings at 11am. See the list of current
events for times and locations of upcoming seminars. The
full schedule can be found on the DB seminar series page.
Recent and Upcoming Events
DB Meeting:
|
Friday May 4, 2:00pm, DC 1331
|
Speaker:
|
Ani Nica, Sybase iAnywhere
|
Title:
|
Issues and Solutions Related to Transaction Support for Maintenance of Materialized Views
|
Abstract:
|
In this talk I will present a survey of the research related to
transaction support for the maintenance of the materialized views in
relational database management systems. The survey will include proposed
solutions to increase the level of concurrency, decrease the rate of
deadlocks, and mechanisms for recovery when materialized views are
incrementally maintained. I will provide in this talk an overview of the
mechanisms implemented in some of the commercially available relational
database management systems for dealing with maintenance of the
materialized views.
|
MMath Essay Presentation:
|
Thursday May 10, 1:00pm, DC 1331
|
Speaker:
|
Ye Qin
|
Title:
|
The Impact of I/O Cost Models on Database Query Optimizers
|
Abstract:
|
In this study, we try to characterize quantitatively the impact of
I/O cost model errors on the optimizer of a Database Management
System (DBMS). The cost of a query execution plan depends not
only on the number of I/O operations required by the plan, but also on the
cost of those operations. An I/O cost model consists of two parts:
I/O costing functions (IOCF) and I/O count estimation. The
I/O count estimation is concerned with estimating the number of I/O
operations required by a relational operator; the IOCF describes the
cost of those operations, e.g., the cost of reading
sequentially or randomly one page from disk. The errors in estimating
the IOCFs and the number of I/O requests can cause the optimizer
to choose suboptimal execution plans. We use the global relative
cost (GRC) to describe how far a suboptimal plan is away from the
optimal plan. We analyze the I/O cost model of an open source DBMS
(Postgres) and prove that the maximum GRC is proportional to the
product of the square of the the IOCF estimation error and the I/O
count estimation error. We also study empirically the IOCF problem
with another DBMS that models I/O cost differently and confirm the
results obtained from analyzing Postgres. In addition, we study
experimentally the I/O count estimation problem for two relational
operators: nested loop join and index scan. The experimental
results demonstrate that in case of nested loop join, Postgres can overestimate
the number of I/O operations issued by the DBMS by an order of
magnitude or more, with larger Postgres buffer sizes resulting in
larger estimation errors. The estimates become worse when data are
skewed. In case of index scan, the Postgres optimizer can estimate
poorly the number of I/O requests when the underlying data is skewed
and the index is not completely unclustered. Skew or correlation by
itself can cause estimation errors; altogether they can cause larger
errors.
|
DB Meeting:
|
Friday May 11, 2:00pm, DC 1331
|
Speaker:
|
Ashraf Aboulnaga
|
Title:
|
μBE: User Guided Source Selection and Schema Mediation for Internet Scale Data Integration
|
Abstract:
|
Data integration research has always assumed that users know the data sources
they want to integrate data from, and the mediated schema to which the data
will be mapped. This may not be true for data integration on the Internet,
where there may be thousands of sources providing data of relevance to a
particular domain. Users who want to build data integration systems for such a
domain need to make decisions about (1) which sources to include in the data
integration system, and (2) what mediated schema to use and how to map the
data from the different sources to this schema. These decisions are best made
by an iterative process in which the user explores the space of possible
data integration systems, and refines his or her decision based on this
exploration. The current state of the art in data integration does not
support such an exploratory process.
In this talk, I will present μBE, a tool that helps in this
exploratory process by automatically selecting the sources to include in a
data integration system and defining a mediated schema on these sources. The
input to μBE is a large number of data sources and their
characteristics. Users set weights to indicate which source characteristics
are important to them, and can also specify constraints on which data sources
and which attributes to include in the solution. Based on this input,
μBE searches the space of possible data integration solutions
and produces a recommended solution. The user can then provide feedback on
this solution and iteratively explore the search space using μBE
until he or she is satisfied with the generated solution. Our
experiments verify that μBE is efficient in exploring the
space of possible solutions, and that it produces high-quality solutions.
This is joint work with Kareem El Gebaly.
|
DB Meeting:
|
Friday May 25, 2:00pm, DC 1331
|
Speaker:
|
Jérémy Barbay |
Title:
|
Lowest Common Ancestor Queries
|
Abstract:
|
Schema-free queries permit to search XML documents without knowing
their schema. Among them, lowest common ancestor (LCA) queries were
introduced in several variants on labeled trees. I will review the
approaches from Schmidt, Kersten and Windhouwer [ICDE 2001]; Li, Yu
and Jagadish [VLDB 2004] and Xu and Papakonstantinou [SIGMOD 2005],
and two related projects developped at Waterloo, one in collaboration
with Ehsan Chiniforooshan, Alexander Golynski and Jui-Yi Kao, the
other one in collaboration with Aleh Veraskouski. And of course I will
bring some home-made cake ;)
As obvious from the abstract, this is corresponds to joint works with
Ehsan Chiniforooshan, Alexander Golynski, Jui-Yi Kao and Aleh
Veraskouski.
|
DB Seminar:
|
Monday May 28, 11:00am, DC 1331
|
Speaker:
|
Jiri Schindler, Network Appliance Corp. |
Title:
|
Data ONTAP - Storage system for reliably storing data on real-world
(=unreliable) disk drives
|
Abstract:
|
One of the many functions of an enterprise storage system is to
reliably store data using unreliable components and magnetic media
i.e., disk drives. Over the last twenty years, the disk drive
technology has advanced in many ways. However, despite these
improvements, the reliability of data stored on disk drives has not
markedly changed.; as complexity increases, disk drives exhibit many
more types of partial or transient failures with system-wide effects.
This talk will describe briefly the high level architecture of the
Network Appliance Storage Systems, called Filers, with an emphasis on
the mechanisms of its specialized operating system, called Data ONTAP,
dealing with disk errors and failures. The talk will also present
results of a recent wide-scale study on latent sector errors, show how
such errors have system-level effects, and describe how a storage
system copes with these types of errors. Finally, the talk will
describe some related work and argue for more detailed failure models
accounting for current disk drive failure modes instead of, now almost
twenty years old, RAID levels reliability equations of Patterson et
al.[1988].
|
Speaker Bio:
|
Jiri Schindler is a member of research staff at the Network Appliance
Advanced Development Group, located in Waltham, MA near Boston. His
research interests include storage device-efficient data layouts,
performance modeling and load balancing. He received his PhD in 2003
from Carnegie Mellon University. Although not Canadian, he enjoys
sports on water in its solid state when not building prototypes or
analytical models of storage systems.
|
DB Meeting:
|
Friday June 1, 2:00pm, DC 1331
|
Speaker:
|
Xiaoyan Yang, National University of Singapore
|
Title:
|
In-Network Execution of Monitoring Queries in Sensor Networks
|
Abstract:
|
Sensor networks are widely used in many applications for collecting
information from the physical environment. In these applications, it is
usually necessary to track the relationships between sensor data readings
within a time window to detect events of interest. However, it is
difficult to detect such events by using the common aggregate or selection
queries. We address the problem of processing window
self-join in order to detect events of interest. Self-joins are useful in
tracking correlations between different sensor readings, which can
indicate an event of interest. We propose the Two-Phase Self-Join (TPSJ)
scheme to efficiently evaluate self-join queries for event detection in
sensor networks. Our TPSJ scheme takes advantage of the properties of the
events and carries out data filtering during in-network processing. We
discuss TPSJ execution with one window and we extend
it for continuous event monitoring. Our experimental evaluation results
indicate that the TPSJ scheme is effective in reducing the amount of radio
transmissions during event detection.
|
DB Meeting:
|
Friday June 8, 2:00pm, DC 1331
|
Speaker:
|
Shai Ben-David |
Title:
|
Principles of clustering - a theory-biased survey
|
Abstract:
|
The talk is a survey of clustering from a theoretical perspective:
Major clustering approaches: in what ways do they differ?
What kind of guarantees are available for the performance of clustering algorithms?
Considerations in choosing clustering parameters. etc.
I will try to make the talk as useful as possible to a database audience
but I am a theoretician and I can't change my skin. So, I'll talk about clustering
paradigms rather then clustering software packages, and I'll talk
about theorems and open questions rather than practical experience and
experimental results.
|
DB Meeting:
|
Friday June 22, 2:00pm, DC 1331
|
Speaker:
|
Qiang Wang
|
Title:
|
Performance-failure Tolerant Range Query Processing in P2P Networks
|
Abstract:
|
Tree-based overlay networks have been developed to support range
query processing over large-scale P2P networks, and they employ
multi-paths among peers to provide fault-tolerance against network
failures. However, due to lack of "fault-tolerance configurability",
they are unable to support fault-tolerance specifications with
real-time demands, leading to performance failures.
To correct this problem, we developed a scalable B-tree-based
overlay network, which supports fault-tolerance
specifications through persistent performance guarantees independent
of varying network failure rates; secondly, I will discuss certain
replication strategies deployed at query processing level to enhance
existing tree overlays
to support real-time demands.
|
DB Meeting:
|
Friday July 6, 2:00pm, DC 1331
|
Speaker:
|
Gulay Unel
|
Title:
|
An Incremental Technique for Automata-based Decision Procedures
|
Abstract:
|
Automata-based decision procedures commonly achieve optimal complexity
bounds. However, in practice, they are often outperformed by
sub-optimal (but more local-search based) techniques, such as
tableaux, on many practical reasoning problems. This discrepancy is
often the result of automata-style techniques global approach to the
problem and the consequent need for constructing an extremely large
automaton. This is in particular the case when reasoning in theories
consisting of large number of relatively simple formulas, such as
descriptions of database schemes, is required. In this paper, we
propose techniques that allow us to approach a mu-calculus
satisfiability problem in an incremental fashion and without the need
for re-computation. In addition, we also propose heuristics that guide
the problem partitioning in a way that is likely to reduce the size of
the problems that need to be solved.
|
DB Meeting:
|
Friday July 13, 2:00pm, DC 1331
|
Speaker:
|
Robert Warren |
Title:
|
Some observations on scalable client/server databases
|
Abstract:
|
|
DB Meeting:
|
Friday July 20, 2:00pm, DC 1331
|
Speaker:
|
Susanne Busse, Technische Universität Berlin
|
Title:
|
Feature-based Classification of Information Integration Systems
|
Abstract:
|
An Information Integration System (IIS) provides a single access point
to read data from a set of autonomous heterogeneous data sources.
There exist different types of IIS ranging from structured mediators
and Peer Data Management Systems to Information Retrieval systems like
search engines. Similarly, a variety of techniques can be used for
particular aspects of IIS, such as query optimization or
specifications of integrated schemas.
In this talk we present a model-driven approach dealing with variants
to support an IIS engineer to select and reuse existing architectures
and techniques for information integration. The core concept is a
classification specifying both functional and quality features of IIS
and dependencies between them. This feature-based classification
allows to match specifications to find appropriate elements for reuse.
The talk shall open a discussion on crucial points of IIS design from
a database, Information Retrieval and software engineering point of
view.
|
DB Meeting:
|
Friday July 27, 2:00pm, DC 1331
|
Speaker:
|
Nan Tang, Chinese University of Hong Kong
|
Title:
|
Multiple Materialized View Selection for XPath Query Rewriting
|
Abstract:
|
We study the problem of answering XPATH queries using
multiple materialized views. Despite the efforts on answering queries
using single materialized view, answering queries using multiple views
remains relatively new. We address two important aspects of this
problem: multiple-view selection and equivalent multiple-view rewriting.
With regards to the first problem, we describer an NFA-based approach
(called VFILTER) to filter views that cannot be used to answer a given
query. Then we present the criterion for multiple view/query
answerability. Based on the output of VFILTER, we further discuss a
heuristic method to identify a minimal view set that can answer a given
query. For the problem of multiple-view rewriting, we first refine the
materialized fragments of each selected view (like pushing selection),
then we join the refined fragments utilizing an encoding scheme.
Finally, we extract the result of the query from the materialized
fragments of a single view.
|
MMath Thesis Presentation:
|
Tuesday July 31, 2:00pm, DC 3314
|
Speaker:
|
Kareem El Gebaly
|
Title:
|
Robustness in Automatic Physical Database Design
|
Abstract:
|
Automatic physical database design tools rely on "what-if" interfaces to the query optimizer
to estimate the execution time of the training query workload under different candidate
physical designs. The tools use these what-if interfaces to recommend physical designs that
minimize the estimated execution time of the input training workload. In this thesis, we
argue that minimizing estimated execution time alone can lead to designs with inherent
problems. In particular, if the optimizer makes an error in estimating the execution time
of some workload queries, the recommended physical design may actually harm the
workload instead of benefiting it. In this sense, the physical design is risky. Moreover, if
the production queries are slightly different from the training queries, the recommended
physical design may not benefit them at all. In this sense, the physical design is not
general. We define Risk and Generality as two new measures to evaluate the quality of
a proposed physical database design, and we show how to extend the objective function
being optimized by a generic physical design tool to take these measures into account.
We have implemented a simple physical design advisor in PostgreSQL, and we use it to
experimentally demonstrate the usefulness of our approach. We show that our two new
metrics result in physical designs that are more robust,
which means that the user can implement them with a higher degree of confidence. This
is particularly important as we move towards truly zero-administration database systems
in which there is not the possibility for a DBA to vet the recommendations of the physical
design tool before applying them.
|
MMath Thesis Presentation:
|
Friday August 24, 10:30am, DC 1316
|
Speaker:
|
Mohammed Abouzour
|
Title:
|
Automatically Tuning Database Server Multiprogramming Level
|
Abstract:
|
Optimizing database systems to achieve the maximum attainable throughput
of the underlying hardware is one of the many difficult tasks that face
Database Administrators. With the increased use of database systems in
many environments, this task has even become more difficult. One of the
parameters that needs to be configured is the number of worker tasks
that the database server uses (the multiprogramming level). The
underlying intuition is that every workload has an optimal
multiprogramming level that can achieve the best throughput given the
workload characteristic. This talk will focus on how to automatically
adjust the number of database server worker tasks to achieve maximum
throughput under varying workload characteristics. I will present three
algorithms that we have developed and compare the performance of these
algorithm.
|
This page is maintained by
Ashraf Aboulnaga.