[Please remove <h1>]
The Database Seminar Series provides a forum for presentation and discussion
of interesting and current database issues. It complements our internal database
meetings by bringing in external colleagues. The talks that are scheduled for
2003-2004 are below, and more will be listed as we get confirmations. Please
send your suggestions to M. Tamer Özsu.
Unless otherwise noted, all talks will be in room DC 1304. Coffee will be
served 30 minutes before the talk.
We will try to post the presentation notes, whenever that is possible. Please
click on the presentation title to access these notes (usually in pdf format).
Database Seminar Series is supported by iAnywhere Solutions, A Sybase Company.
22 September 2003, 11:00 AM
Title: |
Phrase Matching in XML |
Speaker: |
Divesh Srivastava,
AT&T Labs-Research |
Abstract: |
Phrase matching is a common IR technique to search text and identify
relevant documents in a document collection. Phrase matching in XML presents
new challenges as text may be interleaved with arbitrary markup, thwarting
search techniques that require strict contiguity or close proximity of
keywords. We present a technique for phrase matching in XML that permits
dynamic specification of both the phrase to be matched and the markup that
should be ignored. We develop an efficient algorithm for our technique
that utilizes inverted indices on phrase words and XML tags. We describe
experimental results comparing our algorithm to an indexed-nested loop
algorithm that illustrate our algorithm's effectiveness. Finally, we show
a demo of the PIX system prototype that we built for phrase matching in
XML.
This is joint work with Sihem Amer-Yahia, Mary Fernandez, and Yu
Xu. |
Bio: |
Divesh Srivastava is the head of the Database Research Department at
AT&T Labs-Research. He received his B.Tech. in Computer Science & Engineering
from the Indian Institute of Technology, Bombay, India, and his Ph.D. in
Computer Sciences from the University of Wisconsin, Madison, USA. His current
research interests include XML databases and IP network data management. |
20 October 2003, 11:00 AM
Title: |
Efficient Exploration of Links and Paths in Life Sciences Data Sources |
Speaker: |
Louiqa Raschid, University
of Maryland |
Abstract: |
Life science data sources contain data on classes of scientific entities
such as genes and sequences. Each source may have data on one or more classes.
There is significant diversity in the contents of sources, links between
sources, overlap among sources, and the computational capabilities of these
sources. In this talk, I will summarize two approaches to address the efficient
exploration of sources.
The first involves algorithms to explore the search space of links and
paths between biological data sources, and to efficiently identify paths
that satisfy a navigational query (regular expression). The second is to
develop a framework to determine the properties that characterize (multiple
alternate) links and paths between two sources. I will then define two
problems related to optimization. The task is to
determine a path that satisfies a query, has a high ranking (with respect
to some desired semantic criteria) and can be evaluated by a plan with
a low evaluation cost. The second task is to choose a subset of paths so
as to maximize the benefit (with respect to the semantic criteria subject
to some upper bound on the evaluation cost.
Joint work with Zoe Lacroix, Felix Naumann and Maria Esther Vidal.
If there is time, I will also talk about the following:
Latency Profiles: Performance Monitoring for Wide Area Applications
Recent technological advances have enabled the deployment of wide area
applications against Internet accessible sources. A performance challenge
to applications in such a setting is the unpredictable end-to-end latency
of accessing these sources. We use passive information gathering mechanisms
to learn end-to-end latency distributions and construct Latency Profiles
(LPs). We hypothesize that a group of clients, within an autonomous system
(AS), that are accessing a content server, in another AS, may be represented
by (one or more) LPs. Related networking research on IDMaps, points of
congestion, and BGP routes support such hypothesis. We develop aggregate
LPs to provide coverage of groups (clusters) of client-server pairs.
Using data gathered from a (limited) experiment we demonstrate the feasibility
of constructing LPs and aggregate LPs.
Joint work with Avigdor Gal, Vladimir Zadorozhny and Hui-Fang Wen |
Bio: |
Louiqa Raschid received a Bachelor of Technology in electrical engineering
from the Indian Institute of Technology, Madras, in 1980, and a Ph.D. in
electrical engineering from the University of Florida, Gainesville, in
1987. She is a Professor in the Smith School of Business at the University
of Maryland and holds a joint appointment with the Institute for Advanced
Computer Studies and the Department of Computer Science. Dr. Raschid's
research interests include the following: Scalable architectures for wide
area computation with heterogeneous information servers, e.g., scientific
discovery from biomolecular data sources; Cost models and query planning
and optimization techniques for dynamic wide-area environments; Profile
(utility) based tuning of data delivery to clients in fixed and mobile
networks; Publishing and locating sources based on quality and content
metadata using the WWW and XML. Her research has been published in high
quality conferences (ACM SIGMOD, VLDB, IEEE ICDE and ICDCS, EDBT) and journals
(ACM TODS, IEEE TKDE, IEEE ToC, the Journal of Logic Programming, the VLDB
Journal and the Journal of Parallel and Distributed Computing). Her research
is supported by grants from the National Science Foundation and the Defense
Advanced Research Projects Agency. She was recognized by the Vice President
of Research at the University of Maryland to be among the Top 100 Faculty
in obtaining research funding. Dr. Raschid serves on the editorial board
of the IEEE Transactions on Knowledge and Data Engineering, the Journal
of Very Large Data Bases, and the INFORMS Journal of Computing. She has
served on the Program Committee of all the major database conferences (ACM
Sigmod, VLDB and ICDE), the WWW Conference, IEEE Advances in Digital Libraries,
and the International Conference on Information Systems. |
26 November 2003, 10:30 AM
Title: |
Raindrop: The WPI Stream Query Engine |
Speaker: |
Elke Rundensteiner, Worcester
Polytechnic Institute |
Abstract: |
Many modern applications process queries over unbounded streams of
data, including intrusion detection in networks, monitoring remote sensors,
and monitoring patients vital signs. These data streams arrive in real
time, are unbounded in length and have unpredictable arrival patterns
due to external uncontrollable factors such as network congestion or
weather in the case of remote sensors. In this talk, I'll describe the
recent research and system development activities we have been undergoing
at Worcester Polytechnic Institute in this emerging area of stream query
engines.
One, I'll describe the novel approach we are taking towards processing
XQueries over XML streams. XML stream applications bring the challenge
of efficiently processing queries on sequentially accessible token-based
data. While the automata model is naturally suited for pattern matching
on tokenized XML streams, the algebraic model in contrast is a well-established
technique for set-oriented processing of self-contained tuples. However,
neither automata nor algebraic models are well-equipped to handle both
computation paradigms. Our Raindrop engine thus accommodates these two
paradigms within one algebraic framework to take advantage of both. In
our query model, both tokenized data and self-contained tuples are supported
in a uniform manner. Query plans can be flexibly rewritten using equivalence
rules to change what computation is done using tokenized data versus
tuples. The necessity of such a uniform plan model are been illustrated
using an experimental study.
Two, we have designed a novel technique for adapting the execution
of stream queries that, to my knowledge, is not present in any other
continuous query system to date. Namely, we note that utilizing a single
scheduling algorithm to execute a continuous query, as is employed in
most current continuous query systems, is not adequate because no existing
scheduling algorithms can optimally meet an arbitrary set of Quality
of Service (QoS) requirements. Therefore, an adaptive strategy selector
guidable by QoS requirements was developed to meet arbitrary Quality
of Service (QoS) requirements. The adaptive strategy selector monitors
the effects of its behavior on its environment through a feedback mechanism,
with the aim of exploiting previously beneficial behavior and exploring
alternative behavior. The feedback mechanism is guided by qualitatively
comparing how well each algorithm has met the QoS requirements. Then
the next scheduling algorithm is chosen by spinning a roulette wheel
where each candidate is chosen with a probability equal to its performance
score. The adaptive algorithm was shown to be effective in outperforming
single scheduling algorithms for many QoS combinations and stream arrival
patterns.
Three, we also have been studying methods for online plan reoptimization
and plan migration so to online modify the structure of the query plan.
Dynamic query plan migration is concerned with the on-the-fly transition
from one continuous query plan to a semantically equivalent yet more
efficient query plan. In particular, we have designed two alternate online
plan migration strategies for continuous queries over data streams for
plans with window-based join operators. The first strategy is called "moving
state strategy" and the second is called "parallel track strategy". In
the moving state strategy, the query execution is paused and all tuples
in intermediate queues are drained before switching to a new plan. In
the parallel track strategy, the new plan is plugged and the system executes
both plans in parallel until eventually disconnecting the old plan. Pros
and Cons of these techniques are discussed, and an experimental study
exposing the trade-offs between these techniques will be analyzed. |
Bio: |
Elke A. Rundensteiner is an Associate professor of the Department of
Computer Science at the Worcester Polytechnic Institute, after having been
a faculty member at the University of Michigan, Ann Arbor. She has received
a BS degree (Vordiplom) from the J. W. Goethe University, Frankfurt, West
Germany, an M.S. degree from Florida State University, and a Ph.D. degree
from the University of California, Irvine, all in Computer Science. Dr.
Rundensteiner has been active in the database and information systems research
community for over 15 years. Her current research interests include data
integration and migration, XML and web data management, data warehousing
for distributed systems, continuous query processing, and large-scale information
visualization. She has more than 100 publications in these and related
areas. Her research has been funded by government agencies including NSF
and industry like Verizon Labs, GTE, and others. Dr. Rundensteiner has
received numerous honors and awards, including Fulbright, NSF Young Investigator,
and IBM Partnership Award. Elke Rundensteiner has served on program committees
of numereous conferences in the database field, including ACM SIGMOD, VLDB,
and IEEE Data Engineering. She has been an editor of several journals,
including Associate Editor of the IEEE Transactions on Data and Knowledge
Engineering Journal. |
15 December 2004, 11:00 AM
Title: |
Incentives and Sanctions in Peer-to-Peer Information Systems |
Speaker: |
Klemens Böhm,
Otto-von-Guericke-Universität Magdeburg |
Abstract: |
Modern information systems require that users provide resources.
These resources may be on the physical level, or they may be intellectual
input. But only in rare cases, incentives to provide these resources
exist. Our research objective is the development of decentralized
mechanisms to stabilize cooperative behavior in Peer-to-Peer systems
and to reduce destructive behavior and free riding. This talk will
focus on these issues in the context of Content-Addressable Networks
(CAN). CAN administer huge sets of (key, value)-pairs and cope with
very high workloads. They consist of many autonomous nodes. However,
this also means that nodes may be uncooperative, i.e., not carrying
out their share of the work while trying to benefit from the network.
This talk deals with this kind of adverse behavior in CAN. It is
challenging for large networks of more than 100,000 nodes to design
a forwarding protocol for CAN that bypasses and excludes uncooperative
nodes. We have designed and evaluated such a CAN protocol that is
based on reputation, with the following characteristics: It establishes
logical trust networks within the CAN. Nodes give positive feedback
on peers that have performed useful work. Feedback is distributed
in a swarm-like fashion and expires after a certain amount of time.
In extreme situations, the CAN asks nodes to perform a proof of work,
in order to deter uncooperative behavior. Results of experiments
with 100,000 peers are positive. Finally, we will address other scenarios
where our main concern (decentralized mechanisms to ensure cooperative
behavior) is important as well. |
Bio: |
Klemens Böhm is a professor (Associate Professor) at Otto-von-Guericke-Universität
Magdeburg, Germany, since 2002. Prior to that, he has served as a
research assistant in the database research group of Swiss Federal
Institute of Technology (ETH), Zurich. Klemens received his Ph.D.
degree and his diploma in computer science from Technical University
of Darmstadt in 1997 and 1993, respectively. His current research
interests are Peer-to-Peer information systems and cooperative work. |
12 January 2004, 11:00 AM
Title: |
Effective and Efficient
Region-based Image Retrieval |
Speaker: |
Mario Nascimento, University
of Alberta |
Abstract: |
Content Based Image Retrieval (CBIR) is a challenging task. Current
research works attempt to obtain and use the semantics of image to
perform better retrieval. Towards this goal, segmentation of an image
into regions has been advocated, since local properties of regions
can help matching objects between images and thereby contribute towards
a more effective CBIR. In this talk I will introduce three variations
of a CBIR technique we proposed recently, called SNL. In SNL each
image is segmented and features including the color, shape, size
and spatial position of the obtained regions are extracted. Regions
are then compared using the Integrated Region Matching (IRM) distance
measure, which, not being a metric distance, prevents the use of
metric access structures. We overcome this issue, by using MiCRoM,
a true metric distance to compare segmented images. The resulting
approach, which we call SNL*, can be used in conjunction with a filtering
technique to reduce substantially the number of images compared,
hence speeding up query processing. Albeit metric-based, SNL* is
computationally expensive. We address this drawback in the SNL+ approach,
where we replace the expensive metric distance in SNL* by the inexpensive
(but non-metric) IRM distance. We found that one can still make use
of the same filtering technique, at the expense of little loss in
retrieval effectiveness. Thus, the main contribution we offer is
SNL+, an effective and efficient region-based image retrieval technique.
The talk is based on the paper: Mario A. Nascimento, Veena Sridhar,
Xiaobo Li: Effective and efficient region-based image retrieval.
J. Vis. Lang. Comput. 14(2): 151-179 (2003). |
Bio: |
A. Nascimento obtained his Ph.D. degree in Computer Science at
Southern Methodist University's School of Engineering in 1996. Between
1989 and 1999 he was a researcher with the Brazilian Agency for Agricultural
Research (Information Technology Center) and, between 1997 and 1999,
he was also associated with the Institute of Computing of the State
University of Campinas (Brazil). Since then he has been with the
Department of Computing Science of the University of Alberta. He
has published over fifty papers in international conferences, journals
and workshops, has served or is serving as Program Committee member
of several conferences and as Program Chair or Co-chair for ACM MM
2001 MIR, IDEAS'02 SPIRE'2003, and as a member of the ACM SIGMOD
DiSC Editorial Board (2001 and 2002). He is currently serving as
a member of ACM SIGMOD's Executive Board (as its Information Director).
His main research interests lie in the areas of Content-based Image
Retrieval and Access Structures for (Spatio/Temporal/Image) Databases
He is a member of ACM, SIGMOD, SIGIR, and IEEE Computer Society.
Further and current information can be found at http://www.cs.ualberta.ca/~mn. |
16
February 2004, 11:30 AM
Title: |
Distance Based Indexing and Similarity
Search for Sequences |
Speaker: |
Z. Meral Özsoyoglu, Case
Western Reserve University |
Abstract: |
Finding sequences similar to a given query sequence in a large
collection of sequences is a fundamental problem in many database
applications including, computational genomics, computational finance,
image and text processing. The similarity between sequences is defined
in terms of a distance function determined by the application domain.
In this work, we consider sequence proximity search in computational
genomics, where sequence similarity is usually an indication of an
evolutionary relationship between DNA and protein sequences, and
usually indicates functional similarity. The most popular distance
measures are based on (a weighted) count of character edit or block
edit operations to transform one string to another. The main goal
is to develop efficient near neighbor search tools that work for
both character edit and block edit distances. Our premise is that
the Distance Based Indexing techniques, which are originally developed
for metric distances can be modified for sequence distance measures
provided that they are almost metrics. We first show that sequence
distance functions of interest (compression distance and weighted
character edit distance) are almost metric. We then show how to modify
distance based index structures vantage point trees to accommodate
almost metric distances. We test our theoretical results on protein
sequences and synthetic data sets. |
Bio: |
Z. Meral Özsoyoglu is a professor of Computer Science, and
Associate Chair for Division of Computer Science at EECS department,
Case Western Reserve University, Cleveland, Ohio. Prof. Ozsoyoglu's
primary work and research interests are in the areas of query languages
and query processing, data models, and index structures in object
oriented databases, multimedia and temporal databases. More recently,
she has been working in Bioinformatics, more specifically on managing,
visualizing and querying genomic pathways, and efficient access methods
for genomic sequences. She is a founding member of CWRU Center for
Computational Genomics. Dr. Ozsoyoglu has published over 60 papers
in computer science journals including ACM TODS, IEEE TKDE, IEEE
TSE, JCSS, and conferences including ACM SIGMOD, ACM PODS, VLDB,
and IEEE ICDE. She is currently an associate editor of ACM TODS,
IEEE Data Engineering Bulletin, and WWW Journal. |
15 March 2004, 11:00 AM
Title: |
FATES: Decoupling Main Memory Data Layout from Storage Organization |
Speaker: |
Anastassia Ailamaki,
Carnegie Mellon University |
Abstract: |
Current database systems use data layouts that can exploit unique
features of only one level of the memory hierarchy (cache/main memory
or on-line storage). Such layouts optimize for the predominant access
pattern of one workload (e.g., DSS), while trading off performance
of another workload type (e.g., OLTP). Achieving efficient execution
of different workloads without such trade-off or the need to manually
re-tune the system for each workload type is still an unsolved problem.
In addition, current database workloads are often compound (i.e.,
consisting of both DSS and OLTP requests). Such workloads are difficult
to optimize and execute, because optimization assumptions rarely
hold at execution time. In this talk, I will present the design of
a multi-layer storage and main memory management system for relational
databases called Fates, that achieves efficient execution of compound
database workloads at all levels of the memory hierarchy. By leveraging
unique characteristics of devices at each level, Fates can automatically
match query access patterns to the respective device characteristics,
which eliminates the difficult and error-prone task of manual performance
tuning. Borrowing from the Greek mythology of The Three Fates-Clotho,
Lachesis, and Atropos-who spin, measure, and cut the thread of life,
the three components of our database system (bearing the Fates' respective
names) establish proper abstractions in the database query execution
engine. The talk will describe all three components and will show
experimental results. |
Bio: |
Anastassia Ailamaki received a B.Sc. degree in Computer Engineering
(1990) from the Polytechnic School of the University of Patra, Greece,
M.Sc. degrees from the Technical University of Crete, Greece and
from the University of Rochester, NY, and a Ph.D. degree in Computer
Science from the University of Wisconsin-Madison. She is currently
an Assistant Professor at the School of Computer Science at Carnegie
Mellon University. She has received two best-paper awards (VLDB 2001
and Performance 2002) and an NSF CAREER award (2002). Her current
research interests include database system design and performance,
architecture-conscious databases, internet querying and caching,
and workload characterization. |
19 April 2004, 11:30 AM
Title: |
Flexible String Matching in Databases |
Speaker: |
Nick Koudas, University of Toronto |
Abstract: |
In this talk I will provide an overview of our work on flexible string
matching in databases for data cleaning applications. I will show how it
is possible to embedded popular metrics that quantify string closeness
into an SQL framework and express them in a declarative form. The end result
is a set of techniques that can assist in identifying pairs of strings
that are 'similar' (with respect to those metrics) directly, using SQL
on any relational database. |
Bio: |
Nick Koudas received his PhD from the university of toronto, his M.Sc
from the University of Maryland at College park and his B.Sc from the university
of Patras. He is an associate editor for Information Systems and IEEE TKDE
and served in the past as a SIGMOD online associate information editor.
His interests are in database systems, metadata management, network data
management. |
7 June 2004, 11:00 AM
Title: |
Architecture Sensitive Design of Database Engines |
Speaker: |
Ken Ross, Columbia
University |
Abstract: |
With modern machines having large main memories, database system
performance has become less dependent on I/O characteristics. Current
performance bottlenecks include memory latency (for both data and
instructions) and branch misprediction delays. At the same time,
commodity processors provide features (such as multimedia instruction
sets like MMX and SSE) that can potentially enhance the performance
of data intensive operations.
In this talk, I will give an overview of how our group has addressed
these issues in designing algorithms and data structures for
database operations. I will then explore in more depth how buffering
can be
helpful, in several contexts, for avoiding cache misses. Buffering
can improve throughput while still providing certain response-time
guarantees.
This talk represents joint work with Jingren Zhou. |
Bio: |
Professor Ross is an associate professor in the Computer Science
Department at Columbia University in New York City. His research
interests touch on various aspects of database systems, including
query processing, query language design, data warehousing, and architecture-sensitive
database system design. Professor Ross received his PhD from Stanford
University. He has received several awards, including a Packard Foundation
Fellowship, a Sloan Foundation Fellowship, and an NSF Young Investigator
award. |