[Please remove <h1>]
Winter 2005
Note: Events of interest to the
Database Research Group are posted to the uw.cs.database
newsgroup and are mailed to the dbgroup mailing lists: db-faculty
(for DB group faculty), db-grads (for DB group graduate students),
and db-friends (for DB group alumni, visitors, and friends). If
you wish to subscribe to one of these lists, send mail to
majordomo@db
with "subscribe <list>" in the message body, where
<list> is the list you wish to subscribe to. For example,
use "subscribe db-friends" to subscribe to the db-friends list. To
unsubscribe, send "unsubscribe <list>" to the same address.
- DB group
meetings
- The DB group meets most Friday afternoons at 2pm, usually in DC1331.
See the list of current events for
times and locations of upcoming meetings. Each meeting lasts
for an hour and features an informal presentation by one of the
members of the group. Everyone is welcome to attend. These talks are
intended to raise questions and to stimulate discussion rather than
being polished presentations of research results. Speakers are determined
using a rotating speaker list, which can be found on the DB group meeting page
- DB seminar
series
- The DB seminar series features visiting speakers. These seminars are
more-or-less monthly, and are usually scheduled on Monday
mornings at 11am. See the list of current
events for times and locations of upcoming seminars. The
full schedule can be found on the DB seminar
series page.
Recent and Upcoming Events
DB meeting: |
Friday, January 7, 2:00pm, DC1331 |
Speaker: |
Maneesha Sahasrabudhe
|
Title: |
SQL-based CTL model-checking for Feature Interactions
|
Abstract: |
The primary goal of this thesis is to harness the high productivity of
SQL in developing an efficient model checker and applying it to detect
feature interactions in telephony software. Model checking has been
successfully applied in the industry for verification of hardware and
software systems. CTL model checking is a special subtype of model
checking that deals with properties specified in a branch of Temporal
Logic called the Computational Tree Logic. Feature interactions occur
in telephony software with addition of new features, since the new
feature might behave differently from the expected system behavior and
there are exponential number of execution points where such
interaction might occur. These interactions can be detected by
verifying the telephony software using a model-checker. We
investigated in particular, if SQL offered any performance gain in
detecting these interactions due to its ability to efficiently handle
huge data sets. To conduct the performance evaluation of our
approach, we have model-checked a telephony system of four feature
specifications taken from the Bellcore standard for feature
interactions. We also compare the performance of our approach with
commercially available model checker - SMV.
|
DB meeting: |
Friday, January 14, 2:00pm, DC1331 |
Speaker: |
Lei Chen
|
Title: |
Similarity-based Search Over Time Series and Trajectory Data
|
Abstract: |
Time series data have been used in many applications, such as financial data analysis and weather forecasting. Similarly, trajectories of moving objects are often used to perform movement pattern analysis in surveillance video and sensor monitoring systemsThese applications require finding, from among a large set of time series or trajectory data, those that are similar to a query data (the similarity-based retrieval problem). Most of this data are unclean, containing local time shifting and noise, while most of the work in this area has developed techniques that work on clean data. In this talk, I address similarity-based retrieval of time series and trajectory data in the presence of local time shifting and noise. In particular, I will present two novel distance functions: metric distance function, called Edit distance with Real Penalty (ERP), that can support local time shiftin, and, Edit Distance on Real sequence (EDR) that can handle noise as well as local time shifting, but is not metric. Since the proposed distance functions are computationally expensive, I propose several indexing and pruning methods to improve the retrieval efficiency. For ERP, a framework is developed to index time series or trajectory data under a metric distance function, which exploits the pruning power of lowering bounding and triangle inequality. For EDR, three pruning techniques mean value Q-grams, near triangle inequality, and trajectory histograms are developed to improve the retrieval efficiency.
|
DB meeting: |
Friday, January 21, 2:00pm, MC 5158 (note the unusual location) |
Speaker: |
Troy Gonsalves
|
Title: |
Relational Repository-based Development: Leveraging Database
Technology to Implement a Peer-to-Peer
VoIP Application for Handheld Devices
|
Abstract: |
Application developers typically have access to an informal specification in
the form of documentation and a formal specification in the form of code.
Unfortunately, neither of these is particularly convenient when it comes to
analyzing or understanding control flow data. We present an approach whereby
an
application's control flow data is expressed using an augmented version of
statecharts. This expression is then stored in a relational database
allowing
developers to easily manipulate the control flow data. We claim that our
approach has the ability to expedite the ongoing development of feature rich
systems. To demonstrate this we implement, PDAPhone, an IP-based,
peer-to-peer
telephony application for handheld devices. The existence of this
application
allows us to show how our approach can reduce the complexity involved in
expressing features and pave the way for tool development. In addition, we
demonstrate the efficacy of leveraging database technology to aid in
development by implementing a robust application upgrade distribution
mechanism
using database replication and synchronization and by presenting SQL as a
novel
means of performing system verification.
|
DB meeting: |
Friday, January 28, 2:00pm, DC1331 |
Speaker: |
Lukasz Golab
|
Title: |
Identifying Frequent Items in Sliding Windows over Data Streams
|
Abstract: |
In this talk, I will present an algorithm for answering frequent item
queries in the limited-memory sliding window model. This model maintains
the last N items that have arrived on the input stream
at any given time and does not allow the entire window to be stored in
memory. Our algorithm is designed for real-time Internet traffic analysis
and is shown to work well when tested on TCP traffic logs.
This talk is based on: Golab, DeHaan, Demaine, Lopez-Ortiz, and Munro,
Identifying Frequent Items in Sliding Windows over On-Line Packet Streams,
ACM SIGCOMM Internet Measurement Conference, 2003.
|
DB meeting: |
Friday, February 4, 2:00pm, DC1331 |
Speaker: |
Ihab Ilyas
|
Title: |
Correlation Detection in Query Optimization
|
Abstract: |
When query optimizers erroneously assume that
database columns are statistically independent,
they can underestimate the selectivities of con-
junctive predicates by orders of magnitude. Such
underestimation often leads to drastically suboptimal
query execution plans. We demonstrate
cords, an efficient and scalable tool for automatic
discovery of correlations and soft functional
dependencies between column pairs. Cords discovers
correlations in an efficient and scalable manner. The output
of cords can be visualized graphically, making
cords a useful mining and analysis tool for
database administrators. cords ranks the discovered
correlated column pairs and recommends
to the optimizer a set of statistics to collect for
the "most important" of the pairs. Use of these
statistics speeds up processing times by orders of
magnitude for a wide range of queries.
|
DB meeting: |
Friday, February 18, 2:00pm, DC1331 |
Speaker: |
Huaxin Zhang
|
Topic: |
Numerous access control models have been proposed in recent
years. These
models are mostly defined at an abstract level without tackling the
practical problems of efficient computing and cost effective
maintenance of the access controls. Some existing work addresses
these low level concerns by assuming that the access control
subjects are organized as a tree. This may not be true for role-based
multi-user systems, where the subject domains may instead form a DAG.
Having a DAG hierarchy results in a
sharp rise in both computational complexity and maintenance
overhead. In this talk, I am going to illustrate some aspects of these
problems and propose some algorithms for efficiently computing the
access controls.
|
DB meeting: |
Friday, February 25, 2:00pm, DC1331 |
Speaker: |
Charles Clarke
|
Title: |
Approximating the top-m passages in a parallel question answering system
|
Abstract: |
In this talk, I'll discuss the problem of retrieving the top-m ranked
items from a large collection, randomly distributed across an n-node
system. In order to retrieve the top-m overall, we must retrieve the
top m from the subcollection stored on each node and merge the
results. However, if we are willing to accept a small probability that
one or more of the top-m items may be missed, it is possible to reduce
computation time by retrieving only the top k < m from each node. I'll
demonstrate that this simple observation can be exploited in a
realistic application to produce a substantial efficiency improvement
without compromising the quality of the retrieved results. To support
the claim, I'll present a statistical model that predicts the impact of
the optimization. The results are structured around a specific
application -- passage retrieval for question answering -- but are more
broadly applicable.
|
DB meeting: |
Friday, March 4, 2:00pm, DC1331 |
Speaker: |
Gulay Unel
|
Title: |
Deciding Weak Monadic Second-order Logics using Complex-value Datalog
|
Abstract: |
We propose the use of techniques developed for query evaluation
of Complex-value Datalog queries for determining satisfiability
of WS1S and WS2S formulas. We show that the use of database query
evaluation techniques in particular the Magic Set rewriting of Datalog
queries and the top-down resolution-based evaluation with memoing
can, in many common cases, considerably improve the performance of
decision procedures based on the connection between logics and automata,
such as the MONA system.
|
DB meeting: |
Friday, March 11, 2:00pm, DC1331 |
Speaker: |
Anil Goel
|
Topic: |
I will provide a technical overview of the SQL Anywhere cost model while
focussing on some the unique challenges faced, and solutions necessary,
because of the design philosophy and the business requirements of the
product.
|
DB meeting: |
Friday, March 18, 2:00pm, DC1331 |
Speaker: |
Qiang Wang
|
Topic: |
In this talk, I will give an overview on the XML query shipping problem
over P2P networks. Briefly, I will explain the decentralized catalog
methodologies employed under different types of P2P networks (i.e.,
structured and unstructured), and go through related works under
structured P2P networks, then focus on an existing bloom-filter
based approach for unstructured P2P networks, from which I raise
some problems expected to be solved.
|
DB meeting: |
Friday, April 1, 2:00pm, DC1331 |
Speaker: |
Shai Ben-David
|
Title: |
Some Core Statistical Learning Techniques and Their Applications to DB Tasks
|
Abstract: |
I shall give a crash tutorial on the basics of statistical learning theory.
I'll explain what the VC-dimension is and what is its statistical
significance. I'll introduce the notions of epsilon-nets and
epsilon-approximations and show how can these be applied to efficient
approximate query processing.
Finally, I shall briefly describe some recent application of these
ideas to change detection in streaming data analysis.
|
DB meeting: |
Friday, April 15, 2:00pm, DC1331 |
Speaker #1: |
David DeHaan
|
Title: |
Stacked Indexed Views in Microsoft SQL Server
|
Abstract: |
Appropriately selected materialized views (also called indexed views)
can speed up query execution by orders of magnitude. Most database
systems limit support for materialized views to select-project-join
expressions, possibly with a group-by, over base tables because this
class of views can be efficiently maintained incrementally and thus
kept up to date with the underlying source tables. However, limiting
views to reference only base tables restricts the class of queries
that can be supported by materialized views. View stacking (also
called views on views) relaxes one restriction by allowing a
materialized view to reference both base tables and other materialized
views. This extends materialized view support to additional types of
queries. This paper describes a prototype implementation of stacked
views within Microsoft SQL Server and explains which classes of
queries can be supported. To support view matching for stacked views,
a signature mechanism was added to the optimizer. This mechanism
turned out to be beneficial also for regular views by significantly
speeding up view matching.
This talk is based upon work I performed last summer at Microsoft Research working with Paul Larson and Jingren Zhou.
|
Speaker #2: |
Lukasz Golab
|
Title: |
Update-Pattern-Aware Modeling and Processing of Continuous Queries
|
Abstract: |
A defining characteristic of continuous queries over on-line data
streams, possibly bounded by sliding windows, is the potentially
infinite and time-evolving nature of their inputs and outputs. New
items continually arrive on the input streams and new results are
continually produced. Additionally, inputs expire by falling out of
range of their sliding windows and results expire when they cease to
satisfy the query. This impacts continuous query processing in two
ways. First, data stream systems allow tables to be queried alongside
data streams, but in terms of query semantics, it is not clear how
updates of tables are different from insertions and deletions caused
by the movement of the sliding windows. Second, many interesting
queries need to store state, which must be kept up-to-date as time
goes on. Therefore, query processing efficiency depends highly on the
amount of overhead involved in state maintenance.
In this paper, we show that the above issues can be solved by
understanding the update patterns of continuous queries and exploiting
them during query processing. We propose a classification that
defines four types of update characteristics. Using our
classification, we present a definition of continuous query semantics
that clearly states the role of relations. We then propose the notion
of update-pattern-aware query processing, where physical
implementations of query operators, including the data structures used
for storing intermediate state, vary depending on the update patterns
of their inputs and outputs. When tested on IP traffic logs, our
update-pattern-aware query plans routinely outperform the existing
techniques by an order of magnitude.
|
DB meeting: |
Friday, April 22, 2:00pm, DC1331 |
Speaker: |
Robert Warren
|
Title: |
Multi-column matching for database schema translation
|
Abstract: |
I will describe a generalized method for discovering complex schema matches
involving multiple database columns. The method does not require linked
data and is capable of dealing with both fixed and variable-length text
columns.
This is done through an iterative algorithm, where the correct sequence
of concatenations of column substrings is learned to translate the
information from one database to another. A solution is also proposed for
the discovery of constant data separators within the aggregated column.
I will introduce the algorithm and discuss its performance on several
data sets reflecting the contents of common databases.
|
DB meeting: |
Friday, April 29, 2:00pm, DC1331 |
Speaker: |
Glenn Paulley
|
Topic: |
There is no dispute that industry-standard benchmarks have forced database
vendors to dramatically improve their products in order to remain
competitive. While there may be specific features or optimizations added
for specific benchmarks, the vast percentage of these performance
improvements have yielded substantial improvements for
most classes of database applications.
However, it remains very difficult for database users to determine how they
can use industry-standard benchmark results to determine how changes to
their server environment will affect performance. In this talk, I will
outline some of the issues in doing so, and outline some of the pitfalls in
constructing a custom database application benchmark.
|
Spring 2005 meeting schedule
This page is maintained by
Ken
Salem.