[Please remove <h1>]
Fall 2006
Note: Events of interest to the
Database Research Group are posted to the uw.cs.database
newsgroup and are mailed to the
db-group@lists.uwaterloo.ca
mailing list. There are actually three mailing lists aggregated into the
db-group list: db-faculty
(for DB group faculty), db-grads (for DB group graduate students),
and db-friends (for DB group alumni, visitors, and friends). If
you wish to subscribe to one of these three lists (or to unsubscribe), please
visit
https://lists.uwaterloo.ca/mailman/listinfo/<listname>, where
<listname> is the list you wish to subscribe to.
- DB group meetings
- The DB group meets most Friday afternoons at 2pm, usually in DC1331.
See the list of current events for
times and locations of upcoming meetings. Each meeting lasts
for an hour and features an informal presentation by one of the
members of the group. Everyone is welcome to attend. These talks are
intended to raise questions and to stimulate discussion rather than
being polished presentations of research results. Speakers are determined
using a rotating speaker list, which can be found on the
DB group meeting page
- DB seminar series
- The DB seminar series features visiting speakers. These seminars are
more-or-less monthly, and are usually scheduled on Monday
mornings at 11am. See the list of current
events for times and locations of upcoming seminars. The
full schedule can be found on the DB seminar series page.
Recent and Upcoming Events
DB Meeting:
|
Friday September 22, 2:00pm, DC 1304
(Please note room change) |
Title:
|
Kickoff Meeting
|
Seminar:
|
- CANCELLED -
Wednesday, September 27, 1:30pm, DC 1304
|
Speaker:
|
- CANCELLED -
Chen Li, UC Irvine
|
Title:
|
- CANCELLED -
Answering Approximate Queries Efficiently
|
Abstract:
|
Many database applications have the emerging need to answer
approximate queries efficiently. Such a query can ask for strings
that are similar to a given string, such as "names similar to
Schwarzenegger" and "telephone numbers similar to 412-0964," where
"similar to" uses a predefined, domain-specific function to specify
the similarity between strings, such as edit distance. There are many
reasons to support such queries. To name a few: (1) The user might not
remember exactly the name or the telephone number when issuing the
query. (2) There could be typos in the query. (3) There could be
errors or inconsistencies even in the database, especially in
applications such as data cleaning.
In this talk we will present some of our recent results on answering
approximate queries efficiently. One problem related to optimizing
such queries is to estimate the selectivity of a fuzzy string
predicate, i.e., estimating how many strings in the database satisfy
the predicate. We develop a novel technique, called SEPIA, to solve
the problem. We will present the details of this technique using the
edit distance function. We study challenges in adopting this
technique, including how to construct its histogram structures, how to
use them to do selectivity estimation, and how to alleviate the effect
of non-uniform errors in the estimation. We discuss how to extend the
techniques to other similarity functions. We show the results of our
extensive experiments.
Time permitting, we will also briefly report our other related
results. One is on supporting fuzzy queries with both predicates on
numeric attributes (e.g., salary > 50K) and predicates on string
attributes (e.g., telephone numbers similar to 412-0964). Another one
is on how to relax conditions in an SQL query that returns an empty
answer. These results are based on three recent papers in VLDB'2005
and VLDB'2006.
|
Bio:
|
Chen Li is an assistant professor in the Department of Computer
Science at the University of California, Irvine. He received his
Ph.D. degree in Computer Science from Stanford University in 2001, and
his M.S. and B.S. in Computer Science from Tsinghua University, China,
in 1996 and 1994, respectively. He received a National Science
Foundation CAREER Award in 2003. He is currently a part-time Visiting
Research Scientist at Google, Santa Monica. His research interests
are in the fields of database and information systems, including data
integration and sharing, data cleansing, data warehousing, and data
privacy. More information is available at: http://www.ics.uci.edu/~chenli/
|
DB Meeting:
|
Friday September 29, 2:00pm, DC 1331
|
Speaker:
|
Matthew Young-Lai
|
Title:
|
Logging and Crash Recovery
|
Abstract:
|
The choice between logical and physical logging is a trade-off. Physical
logging writes much more data to the log. Logical logging requires more
expensive management of the relationship between the persistent database on
disk and the cache in memory. The following papers develop a theory that
improves the understanding of this issue and implies the possibility of
better logging strategies. I'll overview some of the ideas.
Lomet, D. and Tuttle, M. A Theory of Redo Recovery. SIGMOD Conference, San
Diego, CA (June 2003) 397-406
Lomet, D. and Tuttle, M. Logical Logging to Extend Recovery to New Domains.
ACM SIGMOD Conference Philadelphia, PA (June, 1999)
Lomet, D. and Tuttle, M. Redo Recovery after System Crashes. VLDB
Conference, Zurich, Switzerland (Sept. 1995) 457-468.
|
DB Meeting:
|
Friday October 13, 2:00pm, DC 1331
|
Speaker:
|
Dan Farrar
|
Title:
|
Automatically summarizing a database workload
|
Abstract:
|
Analyzing the performance of a complex database application, and in particular
answering questions about its scalability, can be very difficult. To be able
to answer such questions, we would like to be able to convert the application
into a small but representative set of statements. This set of statements can
then be used, among other things, as inputs to a capacity planning simulation
or to develop a custom benchmark, making the scalability analysis tractable.
Converting a trace log of SQL statements into a manageably small set without
losing its key features can be done with data mining techniques. I will
explore different methods of characterizing a database workload, with
emphasis on a method described by Wasserman et al.
Wasserman, Ted J. et. al. "Developing a Characterization of Business
Intelligence Workloads for Sizing New Database Systems", DBOLAP 2004,
Washington, DC (November, 2004)
|
DB Meeting:
|
Friday October 20, 2:00pm, DC1331
|
Speaker:
|
Ashraf Aboulnaga
|
Title:
|
What is a good database benchmark for DBMS research?
|
Seminar:
|
Wednesday, October 25, 3:00pm, DC 1304
(Please note unusual day) |
Speaker:
|
Chen Li, UC Irvine
|
Title:
|
Answering Approximate Queries Efficiently
|
Abstract:
|
Many database applications have the emerging need to answer
approximate queries efficiently. Such a query can ask for strings
that are similar to a given string, such as "names similar to
Schwarzenegger" and "telephone numbers similar to 412-0964," where
"similar to" uses a predefined, domain-specific function to specify
the similarity between strings, such as edit distance. There are many
reasons to support such queries. To name a few: (1) The user might not
remember exactly the name or the telephone number when issuing the
query. (2) There could be typos in the query. (3) There could be
errors or inconsistencies even in the database, especially in
applications such as data cleaning.
In this talk we will present some of our recent results on answering
approximate queries efficiently. One problem related to optimizing
such queries is to estimate the selectivity of a fuzzy string
predicate, i.e., estimating how many strings in the database satisfy
the predicate. We develop a novel technique, called SEPIA, to solve
the problem. We will present the details of this technique using the
edit distance function. We study challenges in adopting this
technique, including how to construct its histogram structures, how to
use them to do selectivity estimation, and how to alleviate the effect
of non-uniform errors in the estimation. We discuss how to extend the
techniques to other similarity functions. We show the results of our
extensive experiments.
Time permitting, we will also briefly report our other related
results. One is on supporting fuzzy queries with both predicates on
numeric attributes (e.g., salary > 50K) and predicates on string
attributes (e.g., telephone numbers similar to 412-0964). Another one
is on how to relax conditions in an SQL query that returns an empty
answer. These results are based on three recent papers in VLDB'2005
and VLDB'2006.
|
Bio:
|
Chen Li is an assistant professor in the Department of Computer
Science at the University of California, Irvine. He received his
Ph.D. degree in Computer Science from Stanford University in 2001, and
his M.S. and B.S. in Computer Science from Tsinghua University, China,
in 1996 and 1994, respectively. He received a National Science
Foundation CAREER Award in 2003. He is currently a part-time Visiting
Research Scientist at Google, Santa Monica. His research interests
are in the fields of database and information systems, including data
integration and sharing, data cleansing, data warehousing, and data
privacy. More information is available at: http://www.ics.uci.edu/~chenli/
|
Seminar:
|
Thursday October 26, 2:00pm, DC 1304
(Please note unusual day) |
Speaker:
|
Raghu Krishnapuram, IBM India Research Laboratory
|
Title:
|
Search Result Summarization and Disambiguation via Contextual Dimensions
and
Efficient Named Entity Recognition using Inverse Index Operations
|
Abstract:
|
First Talk
Dynamically generated topic hierarchies are a popular method of summarizing
the results obtained in response to a query in various search
applications. However, topic hierarchies generated by statistical techniques
tend to be somewhat unintuitive, rigid and inefficient for browsing and
disambiguation. In this talk, we propose an alternative approach to query
disambiguation and result summarization. The approach uses a fixed set of
orthogonal contextual dimensions to summarize and disambiguate search
results. A contextual dimension defines a specific type to a context which
makes it incomparable to contexts of other types. For the generic search
scenario, we propose to use three types of contextual dimensions, namely,
concepts, features, and specializations. We use NLP techniques to extract the
three types of contexts, and a data mining algorithm to select a subset of
contexts that are as distinct (i.e., mutually exclusive) as possible. Our
experimental results show that we can achieve a considerable reduction in the
effort required for retrieving relevant documents via the proposed interface.
Second Talk
In this talk, we will present an efficient process for named entity
annotation, an important IE problem, using inverse index operations. Named
entity annotation involves attaching a label such as `name' or `organization'
to a sequence of tokens in a document. The current rule-based and machine
learning-based approaches for this task operate at the document level. We
present a new and generic approach to entity annotation which uses the inverse
index typically created for rapid key-word based searching of a document
collection. We define a set of operations on the inverse index that allows us
to create annotations defined by cascading regular expressions. The entity
annotations for an entire document corpus can be created purely of the index
with no need to access the original documents. Experiments on two publicly
available data sets show very significant performance improvements over the
document-based annotators.
|
Bio:
|
Raghu Krishnapuram received his Ph.D. degree in electrical and computer
engineering from Carnegie Mellon University, Pittsburgh, in 1987. From 1987 to
1997, he was on the faculty of the Department of Computer Engineering and
Computer Science at the University of Missouri, Columbia. From 1997 to 2000,
Dr. Krishnapuram was a Full Professor at the Department of Mathematical and
Computer Sciences, Colorado School of Mines,Golden, Colorado. Since then, he
has been at at IBM India Research Lab, New Delhi. Dr. Krishnapuram's research
encompasses many aspects of Web mining, information retrieval, e-commerce,
fuzzy set theory, neural networks, pattern recognition, computer vision, and
image processing. He has published over 160 papers in journals and conferences
in these areas. Dr. Krishnapuram is an IEEE Fellow, and a co-author (with
J. Bezdek, J. Keller and N. Pal) of the book "Fuzzy Models and Algorithms for
Pattern Recognition and Image Processing".
|
DB Meeting:
|
Friday November 3, 2:00pm, DC 1331
|
Speaker:
|
Grant Weddell
|
Title:
|
Databases in Performance Critical Embedded Software Systems
|
Abstract:
|
Commercial relational engines are a multi-billion dollar industry,
and are testimony to the reductions in the cost of developing and
maintaining information systems that derive from using the programming
language SQL. But what about using SQL as a systems programming language
for implementing performance critical embedded software systems, on the
grounds that similar reductions in software development and maintenance
costs would ensue? In this talk, I'll review some of the issues, trends
and research areas relating to "SQL the systems programming language."
|
DB Meeting:
|
Friday November 10, 2:00pm, DC 1331
|
Speaker:
|
Amir H. Chinaei
|
Title:
|
Secure Anonymization for Incremental Datasets
|
Abstract:
|
I will be talking about a recent work by Ji-Won Byun et al. presented at the VLDB workshop
on Secure Data Management, September 2006. This talk addresses data privacy, k-anonymity,
l-diversity, some types of inference channels, and data quality. The paper abstract is as
follows:
"Data anonymization techniques based on the k-anonymity model have been the focus of intense
research in the last few years. Although the k-anonymity model and the related techniques provide
valuable solutions to data privacy, current solutions are limited only to static data release
(i.e., the entire dataset is assumed to be available at the time of release). While this may be
acceptable in some applications, today we see databases continuously growing everyday and even
every hour. In such dynamic environments, the current techniques may suffer from poor data
quality and/or vulnerability to inference. In this paper, we analyze various inference channels
that may exist in multiple anonymized datasets and discuss how to avoid such inferences. We
then present an approach to securely anonymizing a continuously growing dataset in an efficient
manner while assuring high data quality."
To access the full paper, visit http://www.springerlink.com/content/p80r126487u63w31/fulltext.pdf
|
DB Meeting:
|
Friday December 1, 2:00pm, DC 1331
|
Speaker:
|
Mumtaz Ahmad
|
Title:
|
Priority Mechanisms for OLTP Workloads
|
Abstract:
|
OLTP workloads and their characteristics are an important area of study because of their application in electronic commerce
and internet applications. If we want to provide differentiated performance classes in such workloads, we need to prioritize
transactions and provide scheduling mechanisms to ensure that high-priority transactions have improved performance. At the
same time, depending on the application context, it may be desirable that low-priority transactions do not suffer heavily
in the process. I will discuss priority mechanisms and scheduling policies as proposed by McWherter et al. The key idea
is to identify the bottleneck resource for a workload and then to come up with scheduling policies that work well with
proposed prioritization in DBMS.
David McWherter, Bianca Schroeder, Natassa Ailamaki, and Mor Harchol-Balter "Priority Mechanisms for OLTP and Transactional Web Applications." ICDE 2004.
David McWherter, Bianca Schroeder, Natassa Ailamaki, and Mor Harchol-Balter "Improving Preemptive Prioritization via Statistical Characterization of OLTP Locking." ICDE 2005.
|
DB Seminar:
|
Tuesday, December 11, 11:00am, DC 3314
(two 1/2 hour talks)
|
Speaker:
|
Yi Lin, McGill University
|
Title:
|
One-Copy-Snapshot-Isolation for Database Replication in WANs
|
Abstract:
|
Database replication replicates data to different databases in order to
improve performance, scalability, and fault tolerance of database systems.
One-copy-serializability has been considered as the standard correctness
of database replication for a long time, corresponding to serializability
in centralized databases. However, Snapshot Isolation (SI) has been used
extensively in DBMSs such as Oracle, PostgreSQL, SQL server 2005. In my
research, I proposed a new isolation level, 1-copy-SI, for transactions in
replicated databases. Besides, a replication protocol which provides
1-copy-SI is proposed. The protocol works well in WANs. It overcomes the
limitations of existing protocols which only work well in LANs.
|
Speaker:
|
José Enrique Armendáriz-Iñigo, Instituto Tecnológico de Informática
|
Title:
|
Managing Transaction Conflicts in Middleware-Based Database Replication Architectures
|
Abstract:
|
Database replication protocols need to detect, block or abort part
of conflicting transactions. A possible solution is to check their
writesets (and also their readsets in case a serializable isolation
level is requested), which however burdens the consumption of CPU time.
This gets even worse when the replication support is provided by a
middleware, since there is no direct DBMS support in that layer. We
propose and discuss the use of the concurrency control support of the
local DBMS for detecting conflicts between local transactions and writesets
of remote transactions. This allows to simplify many database replication
protocols and to enhance their performance.
|
This page is maintained by
Ashraf Aboulnaga.