|DB Meeting:||Friday May 4, 2:00pm, DC 1331|
|Speaker:||Ani Nica, Sybase iAnywhere|
|Title:||Issues and Solutions Related to Transaction Support for Maintenance of Materialized Views|
|Abstract:||In this talk I will present a survey of the research related to transaction support for the maintenance of the materialized views in relational database management systems. The survey will include proposed solutions to increase the level of concurrency, decrease the rate of deadlocks, and mechanisms for recovery when materialized views are incrementally maintained. I will provide in this talk an overview of the mechanisms implemented in some of the commercially available relational database management systems for dealing with maintenance of the materialized views.|
|MMath Essay Presentation:||Thursday May 10, 1:00pm, DC 1331|
|Title:||The Impact of I/O Cost Models on Database Query Optimizers|
|Abstract:||In this study, we try to characterize quantitatively the impact of I/O cost model errors on the optimizer of a Database Management System (DBMS). The cost of a query execution plan depends not only on the number of I/O operations required by the plan, but also on the cost of those operations. An I/O cost model consists of two parts: I/O costing functions (IOCF) and I/O count estimation. The I/O count estimation is concerned with estimating the number of I/O operations required by a relational operator; the IOCF describes the cost of those operations, e.g., the cost of reading sequentially or randomly one page from disk. The errors in estimating the IOCFs and the number of I/O requests can cause the optimizer to choose suboptimal execution plans. We use the global relative cost (GRC) to describe how far a suboptimal plan is away from the optimal plan. We analyze the I/O cost model of an open source DBMS (Postgres) and prove that the maximum GRC is proportional to the product of the square of the the IOCF estimation error and the I/O count estimation error. We also study empirically the IOCF problem with another DBMS that models I/O cost differently and confirm the results obtained from analyzing Postgres. In addition, we study experimentally the I/O count estimation problem for two relational operators: nested loop join and index scan. The experimental results demonstrate that in case of nested loop join, Postgres can overestimate the number of I/O operations issued by the DBMS by an order of magnitude or more, with larger Postgres buffer sizes resulting in larger estimation errors. The estimates become worse when data are skewed. In case of index scan, the Postgres optimizer can estimate poorly the number of I/O requests when the underlying data is skewed and the index is not completely unclustered. Skew or correlation by itself can cause estimation errors; altogether they can cause larger errors.|
|DB Meeting:||Friday May 11, 2:00pm, DC 1331|
|Title:||μBE: User Guided Source Selection and Schema Mediation for Internet Scale Data Integration|
Data integration research has always assumed that users know the data sources
they want to integrate data from, and the mediated schema to which the data
will be mapped. This may not be true for data integration on the Internet,
where there may be thousands of sources providing data of relevance to a
particular domain. Users who want to build data integration systems for such a
domain need to make decisions about (1) which sources to include in the data
integration system, and (2) what mediated schema to use and how to map the
data from the different sources to this schema. These decisions are best made
by an iterative process in which the user explores the space of possible
data integration systems, and refines his or her decision based on this
exploration. The current state of the art in data integration does not
support such an exploratory process.
In this talk, I will present μBE, a tool that helps in this exploratory process by automatically selecting the sources to include in a data integration system and defining a mediated schema on these sources. The input to μBE is a large number of data sources and their characteristics. Users set weights to indicate which source characteristics are important to them, and can also specify constraints on which data sources and which attributes to include in the solution. Based on this input, μBE searches the space of possible data integration solutions and produces a recommended solution. The user can then provide feedback on this solution and iteratively explore the search space using μBE until he or she is satisfied with the generated solution. Our experiments verify that μBE is efficient in exploring the space of possible solutions, and that it produces high-quality solutions.
This is joint work with Kareem El Gebaly.
|DB Seminar:||Monday May 14, 10:30am, DC 1304|
|Speaker:||Jun Yang, Duke University|
|Title:||Data-Driven Processing in Sensor Networks|
|DB Meeting:||Friday May 25, 2:00pm, DC 1331|
|Title:||Lowest Common Ancestor Queries|
Schema-free queries permit to search XML documents without knowing
their schema. Among them, lowest common ancestor (LCA) queries were
introduced in several variants on labeled trees. I will review the
approaches from Schmidt, Kersten and Windhouwer [ICDE 2001]; Li, Yu
and Jagadish [VLDB 2004] and Xu and Papakonstantinou [SIGMOD 2005],
and two related projects developped at Waterloo, one in collaboration
with Ehsan Chiniforooshan, Alexander Golynski and Jui-Yi Kao, the
other one in collaboration with Aleh Veraskouski. And of course I will
bring some home-made cake ;)
As obvious from the abstract, this is corresponds to joint works with Ehsan Chiniforooshan, Alexander Golynski, Jui-Yi Kao and Aleh Veraskouski.
|DB Seminar:||Monday May 28, 11:00am, DC 1331|
|Speaker:||Jiri Schindler, Network Appliance Corp.|
|Title:||Data ONTAP - Storage system for reliably storing data on real-world (=unreliable) disk drives|
One of the many functions of an enterprise storage system is to
reliably store data using unreliable components and magnetic media
i.e., disk drives. Over the last twenty years, the disk drive
technology has advanced in many ways. However, despite these
improvements, the reliability of data stored on disk drives has not
markedly changed.; as complexity increases, disk drives exhibit many
more types of partial or transient failures with system-wide effects.
This talk will describe briefly the high level architecture of the Network Appliance Storage Systems, called Filers, with an emphasis on the mechanisms of its specialized operating system, called Data ONTAP, dealing with disk errors and failures. The talk will also present results of a recent wide-scale study on latent sector errors, show how such errors have system-level effects, and describe how a storage system copes with these types of errors. Finally, the talk will describe some related work and argue for more detailed failure models accounting for current disk drive failure modes instead of, now almost twenty years old, RAID levels reliability equations of Patterson et al..
|Speaker Bio:||Jiri Schindler is a member of research staff at the Network Appliance Advanced Development Group, located in Waltham, MA near Boston. His research interests include storage device-efficient data layouts, performance modeling and load balancing. He received his PhD in 2003 from Carnegie Mellon University. Although not Canadian, he enjoys sports on water in its solid state when not building prototypes or analytical models of storage systems.|
|DB Meeting:||Friday June 1, 2:00pm, DC 1331|
|Speaker:||Xiaoyan Yang, National University of Singapore|
|Title:||In-Network Execution of Monitoring Queries in Sensor Networks|
|Abstract:||Sensor networks are widely used in many applications for collecting information from the physical environment. In these applications, it is usually necessary to track the relationships between sensor data readings within a time window to detect events of interest. However, it is difficult to detect such events by using the common aggregate or selection queries. We address the problem of processing window self-join in order to detect events of interest. Self-joins are useful in tracking correlations between different sensor readings, which can indicate an event of interest. We propose the Two-Phase Self-Join (TPSJ) scheme to efficiently evaluate self-join queries for event detection in sensor networks. Our TPSJ scheme takes advantage of the properties of the events and carries out data filtering during in-network processing. We discuss TPSJ execution with one window and we extend it for continuous event monitoring. Our experimental evaluation results indicate that the TPSJ scheme is effective in reducing the amount of radio transmissions during event detection.|
|DB Meeting:||Friday June 8, 2:00pm, DC 1331|
|Title:||Principles of clustering - a theory-biased survey|
The talk is a survey of clustering from a theoretical perspective:
Major clustering approaches: in what ways do they differ?
What kind of guarantees are available for the performance of clustering algorithms?
Considerations in choosing clustering parameters. etc.
I will try to make the talk as useful as possible to a database audience but I am a theoretician and I can't change my skin. So, I'll talk about clustering paradigms rather then clustering software packages, and I'll talk about theorems and open questions rather than practical experience and experimental results.
|DB Meeting:||Friday June 22, 2:00pm, DC 1331|
|Title:||Performance-failure Tolerant Range Query Processing in P2P Networks|
|Abstract:||Tree-based overlay networks have been developed to support range query processing over large-scale P2P networks, and they employ multi-paths among peers to provide fault-tolerance against network failures. However, due to lack of "fault-tolerance configurability", they are unable to support fault-tolerance specifications with real-time demands, leading to performance failures. To correct this problem, we developed a scalable B-tree-based overlay network, which supports fault-tolerance specifications through persistent performance guarantees independent of varying network failure rates; secondly, I will discuss certain replication strategies deployed at query processing level to enhance existing tree overlays to support real-time demands.|
|DB Seminar:||Monday June 25, 10:30am, DC 1304|
|Speaker:||Ugur Cetinemel, Brown University|
|Title:||Processing and Routing Streams in a Networked World|
|DB Meeting:||Friday July 6, 2:00pm, DC 1331|
|Title:||An Incremental Technique for Automata-based Decision Procedures|
|Abstract:||Automata-based decision procedures commonly achieve optimal complexity bounds. However, in practice, they are often outperformed by sub-optimal (but more local-search based) techniques, such as tableaux, on many practical reasoning problems. This discrepancy is often the result of automata-style techniques global approach to the problem and the consequent need for constructing an extremely large automaton. This is in particular the case when reasoning in theories consisting of large number of relatively simple formulas, such as descriptions of database schemes, is required. In this paper, we propose techniques that allow us to approach a mu-calculus satisfiability problem in an incremental fashion and without the need for re-computation. In addition, we also propose heuristics that guide the problem partitioning in a way that is likely to reduce the size of the problems that need to be solved.|
|DB Meeting:||Friday July 13, 2:00pm, DC 1331|
|Title:||Some observations on scalable client/server databases|
|DB Meeting:||Friday July 20, 2:00pm, DC 1331|
|Speaker:||Susanne Busse, Technische Universität Berlin|
|Title:||Feature-based Classification of Information Integration Systems|
An Information Integration System (IIS) provides a single access point
to read data from a set of autonomous heterogeneous data sources.
There exist different types of IIS ranging from structured mediators
and Peer Data Management Systems to Information Retrieval systems like
search engines. Similarly, a variety of techniques can be used for
particular aspects of IIS, such as query optimization or
specifications of integrated schemas.
In this talk we present a model-driven approach dealing with variants to support an IIS engineer to select and reuse existing architectures and techniques for information integration. The core concept is a classification specifying both functional and quality features of IIS and dependencies between them. This feature-based classification allows to match specifications to find appropriate elements for reuse. The talk shall open a discussion on crucial points of IIS design from a database, Information Retrieval and software engineering point of view.
|DB Meeting:||Friday July 27, 2:00pm, DC 1331|
|Speaker:||Nan Tang, Chinese University of Hong Kong|
|Title:||Multiple Materialized View Selection for XPath Query Rewriting|
|Abstract:||We study the problem of answering XPATH queries using multiple materialized views. Despite the efforts on answering queries using single materialized view, answering queries using multiple views remains relatively new. We address two important aspects of this problem: multiple-view selection and equivalent multiple-view rewriting. With regards to the first problem, we describer an NFA-based approach (called VFILTER) to filter views that cannot be used to answer a given query. Then we present the criterion for multiple view/query answerability. Based on the output of VFILTER, we further discuss a heuristic method to identify a minimal view set that can answer a given query. For the problem of multiple-view rewriting, we first refine the materialized fragments of each selected view (like pushing selection), then we join the refined fragments utilizing an encoding scheme. Finally, we extract the result of the query from the materialized fragments of a single view.|
|MMath Thesis Presentation:||Tuesday July 31, 2:00pm, DC 3314|
|Speaker:||Kareem El Gebaly|
|Title:||Robustness in Automatic Physical Database Design|
|Abstract:||Automatic physical database design tools rely on "what-if" interfaces to the query optimizer to estimate the execution time of the training query workload under different candidate physical designs. The tools use these what-if interfaces to recommend physical designs that minimize the estimated execution time of the input training workload. In this thesis, we argue that minimizing estimated execution time alone can lead to designs with inherent problems. In particular, if the optimizer makes an error in estimating the execution time of some workload queries, the recommended physical design may actually harm the workload instead of benefiting it. In this sense, the physical design is risky. Moreover, if the production queries are slightly different from the training queries, the recommended physical design may not benefit them at all. In this sense, the physical design is not general. We define Risk and Generality as two new measures to evaluate the quality of a proposed physical database design, and we show how to extend the objective function being optimized by a generic physical design tool to take these measures into account. We have implemented a simple physical design advisor in PostgreSQL, and we use it to experimentally demonstrate the usefulness of our approach. We show that our two new metrics result in physical designs that are more robust, which means that the user can implement them with a higher degree of confidence. This is particularly important as we move towards truly zero-administration database systems in which there is not the possibility for a DBA to vet the recommendations of the physical design tool before applying them.|
|DB Seminar:||Friday August 17, 2:00pm, DC 1304 (Please note special day and time)|
|Speaker:||Shivnath Babu, Duke University|
|Title:||Experiment-Driven Management of Web Services and Scientific Applications|
|MMath Thesis Presentation:||Friday August 24, 10:30am, DC 1316|
|Title:||Automatically Tuning Database Server Multiprogramming Level|
|Abstract:||Optimizing database systems to achieve the maximum attainable throughput of the underlying hardware is one of the many difficult tasks that face Database Administrators. With the increased use of database systems in many environments, this task has even become more difficult. One of the parameters that needs to be configured is the number of worker tasks that the database server uses (the multiprogramming level). The underlying intuition is that every workload has an optimal multiprogramming level that can achieve the best throughput given the workload characteristic. This talk will focus on how to automatically adjust the number of database server worker tasks to achieve maximum throughput under varying workload characteristics. I will present three algorithms that we have developed and compare the performance of these algorithm.|