Database Research Group Events

Winter 2005

Note: Events of interest to the Database Research Group are posted to the uw.cs.database newsgroup and are mailed to the dbgroup mailing lists: db-faculty (for DB group faculty), db-grads (for DB group graduate students), and db-friends (for DB group alumni, visitors, and friends). If you wish to subscribe to one of these lists, send mail to majordomo@db with "subscribe <list>" in the message body, where <list> is the list you wish to subscribe to.  For example, use "subscribe db-friends" to subscribe to the db-friends list. To unsubscribe, send "unsubscribe <list>" to the same address.
DB group meetings
The DB group meets most Friday afternoons at 2pm, usually in DC1331. See the list of current events for times and locations of upcoming meetings. Each meeting lasts for an hour and features an informal presentation by one of the members of the group. Everyone is welcome to attend. These talks are intended to raise questions and to stimulate discussion rather than being polished presentations of research results. Speakers are determined using a rotating speaker list, which can be found on the DB group meeting page
DB seminar series
The DB seminar series features visiting speakers. These seminars are more-or-less monthly, and are usually scheduled on Monday mornings at 11am. See the list of current events for times and locations of upcoming seminars. The full schedule can be found on the DB seminar series page.

Recent and Upcoming Events

DB meeting: Friday, January 7, 2:00pm, DC1331
Speaker: Maneesha Sahasrabudhe
Title: SQL-based CTL model-checking for Feature Interactions

The primary goal of this thesis is to harness the high productivity of SQL in developing an efficient model checker and applying it to detect feature interactions in telephony software. Model checking has been successfully applied in the industry for verification of hardware and software systems. CTL model checking is a special subtype of model checking that deals with properties specified in a branch of Temporal Logic called the Computational Tree Logic. Feature interactions occur in telephony software with addition of new features, since the new feature might behave differently from the expected system behavior and there are exponential number of execution points where such interaction might occur. These interactions can be detected by verifying the telephony software using a model-checker. We investigated in particular, if SQL offered any performance gain in detecting these interactions due to its ability to efficiently handle huge data sets. To conduct the performance evaluation of our approach, we have model-checked a telephony system of four feature specifications taken from the Bellcore standard for feature interactions. We also compare the performance of our approach with commercially available model checker - SMV.

DB meeting: Friday, January 14, 2:00pm, DC1331
Speaker: Lei Chen
Title: Similarity-based Search Over Time Series and Trajectory Data

Time series data have been used in many applications, such as financial data analysis and weather forecasting. Similarly, trajectories of moving objects are often used to perform movement pattern analysis in surveillance video and sensor monitoring systemsThese applications require finding, from among a large set of time series or trajectory data, those that are similar to a query data (the similarity-based retrieval problem). Most of this data are unclean, containing local time shifting and noise, while most of the work in this area has developed techniques that work on clean data. In this talk, I address similarity-based retrieval of time series and trajectory data in the presence of local time shifting and noise. In particular, I will present two novel distance functions: metric distance function, called Edit distance with Real Penalty (ERP), that can support local time shiftin, and, Edit Distance on Real sequence (EDR) that can handle noise as well as local time shifting, but is not metric. Since the proposed distance functions are computationally expensive, I propose several indexing and pruning methods to improve the retrieval efficiency. For ERP, a framework is developed to index time series or trajectory data under a metric distance function, which exploits the pruning power of lowering bounding and triangle inequality. For EDR, three pruning techniques mean value Q-grams, near triangle inequality, and trajectory histograms are developed to improve the retrieval efficiency.

DB Seminar: Monday, January 17, 11:00 am, DC1304
Speaker: Dennis Shasha, New York University
Title: Privacy-Preserving Piracy Prevention

DB meeting: Friday, January 21, 2:00pm, MC 5158 (note the unusual location)
Speaker: Troy Gonsalves
Title: Relational Repository-based Development: Leveraging Database Technology to Implement a Peer-to-Peer VoIP Application for Handheld Devices

Application developers typically have access to an informal specification in the form of documentation and a formal specification in the form of code. Unfortunately, neither of these is particularly convenient when it comes to analyzing or understanding control flow data. We present an approach whereby an application's control flow data is expressed using an augmented version of statecharts. This expression is then stored in a relational database allowing developers to easily manipulate the control flow data. We claim that our approach has the ability to expedite the ongoing development of feature rich systems. To demonstrate this we implement, PDAPhone, an IP-based, peer-to-peer telephony application for handheld devices. The existence of this application allows us to show how our approach can reduce the complexity involved in expressing features and pave the way for tool development. In addition, we demonstrate the efficacy of leveraging database technology to aid in development by implementing a robust application upgrade distribution mechanism using database replication and synchronization and by presenting SQL as a novel means of performing system verification.

DB meeting: Friday, January 28, 2:00pm, DC1331
Speaker: Lukasz Golab
Title: Identifying Frequent Items in Sliding Windows over Data Streams

In this talk, I will present an algorithm for answering frequent item queries in the limited-memory sliding window model. This model maintains the last N items that have arrived on the input stream at any given time and does not allow the entire window to be stored in memory. Our algorithm is designed for real-time Internet traffic analysis and is shown to work well when tested on TCP traffic logs.

This talk is based on: Golab, DeHaan, Demaine, Lopez-Ortiz, and Munro, Identifying Frequent Items in Sliding Windows over On-Line Packet Streams, ACM SIGCOMM Internet Measurement Conference, 2003.

DB meeting: Friday, February 4, 2:00pm, DC1331
Speaker: Ihab Ilyas
Title: Correlation Detection in Query Optimization

When query optimizers erroneously assume that database columns are statistically independent, they can underestimate the selectivities of con- junctive predicates by orders of magnitude. Such underestimation often leads to drastically suboptimal query execution plans. We demonstrate cords, an efficient and scalable tool for automatic discovery of correlations and soft functional dependencies between column pairs. Cords discovers correlations in an efficient and scalable manner. The output of cords can be visualized graphically, making cords a useful mining and analysis tool for database administrators. cords ranks the discovered correlated column pairs and recommends to the optimizer a set of statistics to collect for the "most important" of the pairs. Use of these statistics speeds up processing times by orders of magnitude for a wide range of queries.

DB meeting: Friday, February 18, 2:00pm, DC1331
Speaker: Huaxin Zhang

Numerous access control models have been proposed in recent years. These models are mostly defined at an abstract level without tackling the practical problems of efficient computing and cost effective maintenance of the access controls. Some existing work addresses these low level concerns by assuming that the access control subjects are organized as a tree. This may not be true for role-based multi-user systems, where the subject domains may instead form a DAG. Having a DAG hierarchy results in a sharp rise in both computational complexity and maintenance overhead. In this talk, I am going to illustrate some aspects of these problems and propose some algorithms for efficiently computing the access controls.

DB Seminar: Monday, February 21, 11:00am, DC1302
Speaker: Elisa Bertino, University of Milano and Purdue University
Title: Purpose-Based Database Access Control for Privacy Protection

DB meeting: Friday, February 25, 2:00pm, DC1331
Speaker: Charles Clarke
Title: Approximating the top-m passages in a parallel question answering system

In this talk, I'll discuss the problem of retrieving the top-m ranked items from a large collection, randomly distributed across an n-node system. In order to retrieve the top-m overall, we must retrieve the top m from the subcollection stored on each node and merge the results. However, if we are willing to accept a small probability that one or more of the top-m items may be missed, it is possible to reduce computation time by retrieving only the top k < m from each node. I'll demonstrate that this simple observation can be exploited in a realistic application to produce a substantial efficiency improvement without compromising the quality of the retrieved results. To support the claim, I'll present a statistical model that predicts the impact of the optimization. The results are structured around a specific application -- passage retrieval for question answering -- but are more broadly applicable.

DB meeting: Friday, March 4, 2:00pm, DC1331
Speaker: Gulay Unel
Title: Deciding Weak Monadic Second-order Logics using Complex-value Datalog

We propose the use of techniques developed for query evaluation of Complex-value Datalog queries for determining satisfiability of WS1S and WS2S formulas. We show that the use of database query evaluation techniques in particular the Magic Set rewriting of Datalog queries and the top-down resolution-based evaluation with memoing can, in many common cases, considerably improve the performance of decision procedures based on the connection between logics and automata, such as the MONA system.

DB meeting: Friday, March 11, 2:00pm, DC1331
Speaker: Anil Goel

I will provide a technical overview of the SQL Anywhere cost model while focussing on some the unique challenges faced, and solutions necessary, because of the design philosophy and the business requirements of the product.

DB meeting: Friday, March 18, 2:00pm, DC1331
Speaker: Qiang Wang

In this talk, I will give an overview on the XML query shipping problem over P2P networks. Briefly, I will explain the decentralized catalog methodologies employed under different types of P2P networks (i.e., structured and unstructured), and go through related works under structured P2P networks, then focus on an existing bloom-filter based approach for unstructured P2P networks, from which I raise some problems expected to be solved.

DB meeting: Friday, April 1, 2:00pm, DC1331
Speaker: Shai Ben-David
Title: Some Core Statistical Learning Techniques and Their Applications to DB Tasks

I shall give a crash tutorial on the basics of statistical learning theory. I'll explain what the VC-dimension is and what is its statistical significance. I'll introduce the notions of epsilon-nets and epsilon-approximations and show how can these be applied to efficient approximate query processing. Finally, I shall briefly describe some recent application of these ideas to change detection in streaming data analysis.

DB Seminar: Monday, April 11, 11:30 am, DC1302
Speaker: Minos N. Garofalakis, Bell Labs
Title: Sketching Massive Distributed Data Streams

DB meeting: Friday, April 15, 2:00pm, DC1331
Speaker #1: David DeHaan
Title: Stacked Indexed Views in Microsoft SQL Server

Appropriately selected materialized views (also called indexed views) can speed up query execution by orders of magnitude. Most database systems limit support for materialized views to select-project-join expressions, possibly with a group-by, over base tables because this class of views can be efficiently maintained incrementally and thus kept up to date with the underlying source tables. However, limiting views to reference only base tables restricts the class of queries that can be supported by materialized views. View stacking (also called views on views) relaxes one restriction by allowing a materialized view to reference both base tables and other materialized views. This extends materialized view support to additional types of queries. This paper describes a prototype implementation of stacked views within Microsoft SQL Server and explains which classes of queries can be supported. To support view matching for stacked views, a signature mechanism was added to the optimizer. This mechanism turned out to be beneficial also for regular views by significantly speeding up view matching.

This talk is based upon work I performed last summer at Microsoft Research working with Paul Larson and Jingren Zhou.

Speaker #2: Lukasz Golab
Title: Update-Pattern-Aware Modeling and Processing of Continuous Queries

A defining characteristic of continuous queries over on-line data streams, possibly bounded by sliding windows, is the potentially infinite and time-evolving nature of their inputs and outputs. New items continually arrive on the input streams and new results are continually produced. Additionally, inputs expire by falling out of range of their sliding windows and results expire when they cease to satisfy the query. This impacts continuous query processing in two ways. First, data stream systems allow tables to be queried alongside data streams, but in terms of query semantics, it is not clear how updates of tables are different from insertions and deletions caused by the movement of the sliding windows. Second, many interesting queries need to store state, which must be kept up-to-date as time goes on. Therefore, query processing efficiency depends highly on the amount of overhead involved in state maintenance.

In this paper, we show that the above issues can be solved by understanding the update patterns of continuous queries and exploiting them during query processing. We propose a classification that defines four types of update characteristics. Using our classification, we present a definition of continuous query semantics that clearly states the role of relations. We then propose the notion of update-pattern-aware query processing, where physical implementations of query operators, including the data structures used for storing intermediate state, vary depending on the update patterns of their inputs and outputs. When tested on IP traffic logs, our update-pattern-aware query plans routinely outperform the existing techniques by an order of magnitude.

DB meeting: Friday, April 22, 2:00pm, DC1331
Speaker: Robert Warren
Title: Multi-column matching for database schema translation

I will describe a generalized method for discovering complex schema matches involving multiple database columns. The method does not require linked data and is capable of dealing with both fixed and variable-length text columns.

This is done through an iterative algorithm, where the correct sequence of concatenations of column substrings is learned to translate the information from one database to another. A solution is also proposed for the discovery of constant data separators within the aggregated column.

I will introduce the algorithm and discuss its performance on several data sets reflecting the contents of common databases.

DB meeting: Friday, April 29, 2:00pm, DC1331
Speaker: Glenn Paulley

There is no dispute that industry-standard benchmarks have forced database vendors to dramatically improve their products in order to remain competitive. While there may be specific features or optimizations added for specific benchmarks, the vast percentage of these performance improvements have yielded substantial improvements for most classes of database applications.

However, it remains very difficult for database users to determine how they can use industry-standard benchmark results to determine how changes to their server environment will affect performance. In this talk, I will outline some of the issues in doing so, and outline some of the pitfalls in constructing a custom database application benchmark.

Spring 2005 meeting schedule

This page is maintained by  Ken Salem.