Fall 2006 Events Schedule | Database Research Group | UW

[Please remove <h1>]

Fall 2006

Note: Events of interest to the Database Research Group are posted to the uw.cs.database newsgroup and are mailed to the db-group@lists.uwaterloo.ca mailing list. There are actually three mailing lists aggregated into the db-group list: db-faculty (for DB group faculty), db-grads (for DB group graduate students), and db-friends (for DB group alumni, visitors, and friends). If you wish to subscribe to one of these three lists (or to unsubscribe), please visit https://lists.uwaterloo.ca/mailman/listinfo/<listname>, where <listname> is the list you wish to subscribe to.
DB group meetings
The DB group meets most Friday afternoons at 2pm, usually in DC1331. See the list of current events for times and locations of upcoming meetings. Each meeting lasts for an hour and features an informal presentation by one of the members of the group. Everyone is welcome to attend. These talks are intended to raise questions and to stimulate discussion rather than being polished presentations of research results. Speakers are determined using a rotating speaker list, which can be found on the DB group meeting page
DB seminar series
The DB seminar series features visiting speakers. These seminars are more-or-less monthly, and are usually scheduled on Monday mornings at 11am. See the list of current events for times and locations of upcoming seminars. The full schedule can be found on the DB seminar series page.

Recent and Upcoming Events


DB Meeting: Friday September 22, 2:00pm, DC 1304 (Please note room change)
Title: Kickoff Meeting

DB Seminar: Monday, September 25, 11:00am, DC 1304
Speaker: Amol Deshpande, University of Maryland
Title: MauveDB: Managing Uncertain Data using Statistical Models

Seminar: - CANCELLED - Wednesday, September 27, 1:30pm, DC 1304
Speaker: - CANCELLED - Chen Li, UC Irvine
Title: - CANCELLED - Answering Approximate Queries Efficiently
Abstract: Many database applications have the emerging need to answer approximate queries efficiently. Such a query can ask for strings that are similar to a given string, such as "names similar to Schwarzenegger" and "telephone numbers similar to 412-0964," where "similar to" uses a predefined, domain-specific function to specify the similarity between strings, such as edit distance. There are many reasons to support such queries. To name a few: (1) The user might not remember exactly the name or the telephone number when issuing the query. (2) There could be typos in the query. (3) There could be errors or inconsistencies even in the database, especially in applications such as data cleaning.

In this talk we will present some of our recent results on answering approximate queries efficiently. One problem related to optimizing such queries is to estimate the selectivity of a fuzzy string predicate, i.e., estimating how many strings in the database satisfy the predicate. We develop a novel technique, called SEPIA, to solve the problem. We will present the details of this technique using the edit distance function. We study challenges in adopting this technique, including how to construct its histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. We show the results of our extensive experiments.

Time permitting, we will also briefly report our other related results. One is on supporting fuzzy queries with both predicates on numeric attributes (e.g., salary > 50K) and predicates on string attributes (e.g., telephone numbers similar to 412-0964). Another one is on how to relax conditions in an SQL query that returns an empty answer. These results are based on three recent papers in VLDB'2005 and VLDB'2006.

Bio: Chen Li is an assistant professor in the Department of Computer Science at the University of California, Irvine. He received his Ph.D. degree in Computer Science from Stanford University in 2001, and his M.S. and B.S. in Computer Science from Tsinghua University, China, in 1996 and 1994, respectively. He received a National Science Foundation CAREER Award in 2003. He is currently a part-time Visiting Research Scientist at Google, Santa Monica. His research interests are in the fields of database and information systems, including data integration and sharing, data cleansing, data warehousing, and data privacy. More information is available at: http://www.ics.uci.edu/~chenli/

DB Meeting: Friday September 29, 2:00pm, DC 1331
Speaker: Matthew Young-Lai
Title: Logging and Crash Recovery
Abstract: The choice between logical and physical logging is a trade-off. Physical logging writes much more data to the log. Logical logging requires more expensive management of the relationship between the persistent database on disk and the cache in memory. The following papers develop a theory that improves the understanding of this issue and implies the possibility of better logging strategies. I'll overview some of the ideas.

Lomet, D. and Tuttle, M. A Theory of Redo Recovery. SIGMOD Conference, San Diego, CA (June 2003) 397-406
Lomet, D. and Tuttle, M. Logical Logging to Extend Recovery to New Domains. ACM SIGMOD Conference Philadelphia, PA (June, 1999)
Lomet, D. and Tuttle, M. Redo Recovery after System Crashes. VLDB Conference, Zurich, Switzerland (Sept. 1995) 457-468.


DB Meeting: Friday October 13, 2:00pm, DC 1331
Speaker: Dan Farrar
Title: Automatically summarizing a database workload
Abstract: Analyzing the performance of a complex database application, and in particular answering questions about its scalability, can be very difficult. To be able to answer such questions, we would like to be able to convert the application into a small but representative set of statements. This set of statements can then be used, among other things, as inputs to a capacity planning simulation or to develop a custom benchmark, making the scalability analysis tractable. Converting a trace log of SQL statements into a manageably small set without losing its key features can be done with data mining techniques. I will explore different methods of characterizing a database workload, with emphasis on a method described by Wasserman et al.

Wasserman, Ted J. et. al. "Developing a Characterization of Business Intelligence Workloads for Sizing New Database Systems", DBOLAP 2004, Washington, DC (November, 2004)


DB Seminar: Monday, October 16, 10:30am, DC 1304 (Please note time change)
Speaker: Alex Borgida, Rutgers University
Title: Visions of Data Semantics: Another (and another) Look

DB Meeting: Friday October 20, 2:00pm, DC1331
Speaker: Ashraf Aboulnaga
Title: What is a good database benchmark for DBMS research?

Seminar: Wednesday, October 25, 3:00pm, DC 1304 (Please note unusual day)
Speaker: Chen Li, UC Irvine
Title: Answering Approximate Queries Efficiently
Abstract: Many database applications have the emerging need to answer approximate queries efficiently. Such a query can ask for strings that are similar to a given string, such as "names similar to Schwarzenegger" and "telephone numbers similar to 412-0964," where "similar to" uses a predefined, domain-specific function to specify the similarity between strings, such as edit distance. There are many reasons to support such queries. To name a few: (1) The user might not remember exactly the name or the telephone number when issuing the query. (2) There could be typos in the query. (3) There could be errors or inconsistencies even in the database, especially in applications such as data cleaning.

In this talk we will present some of our recent results on answering approximate queries efficiently. One problem related to optimizing such queries is to estimate the selectivity of a fuzzy string predicate, i.e., estimating how many strings in the database satisfy the predicate. We develop a novel technique, called SEPIA, to solve the problem. We will present the details of this technique using the edit distance function. We study challenges in adopting this technique, including how to construct its histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. We show the results of our extensive experiments.

Time permitting, we will also briefly report our other related results. One is on supporting fuzzy queries with both predicates on numeric attributes (e.g., salary > 50K) and predicates on string attributes (e.g., telephone numbers similar to 412-0964). Another one is on how to relax conditions in an SQL query that returns an empty answer. These results are based on three recent papers in VLDB'2005 and VLDB'2006.

Bio: Chen Li is an assistant professor in the Department of Computer Science at the University of California, Irvine. He received his Ph.D. degree in Computer Science from Stanford University in 2001, and his M.S. and B.S. in Computer Science from Tsinghua University, China, in 1996 and 1994, respectively. He received a National Science Foundation CAREER Award in 2003. He is currently a part-time Visiting Research Scientist at Google, Santa Monica. His research interests are in the fields of database and information systems, including data integration and sharing, data cleansing, data warehousing, and data privacy. More information is available at: http://www.ics.uci.edu/~chenli/

Seminar: Thursday October 26, 2:00pm, DC 1304 (Please note unusual day)
Speaker: Raghu Krishnapuram, IBM India Research Laboratory
Title: Search Result Summarization and Disambiguation via Contextual Dimensions
and
Efficient Named Entity Recognition using Inverse Index Operations
Abstract: First Talk
Dynamically generated topic hierarchies are a popular method of summarizing the results obtained in response to a query in various search applications. However, topic hierarchies generated by statistical techniques tend to be somewhat unintuitive, rigid and inefficient for browsing and disambiguation. In this talk, we propose an alternative approach to query disambiguation and result summarization. The approach uses a fixed set of orthogonal contextual dimensions to summarize and disambiguate search results. A contextual dimension defines a specific type to a context which makes it incomparable to contexts of other types. For the generic search scenario, we propose to use three types of contextual dimensions, namely, concepts, features, and specializations. We use NLP techniques to extract the three types of contexts, and a data mining algorithm to select a subset of contexts that are as distinct (i.e., mutually exclusive) as possible. Our experimental results show that we can achieve a considerable reduction in the effort required for retrieving relevant documents via the proposed interface.

Second Talk
In this talk, we will present an efficient process for named entity annotation, an important IE problem, using inverse index operations. Named entity annotation involves attaching a label such as `name' or `organization' to a sequence of tokens in a document. The current rule-based and machine learning-based approaches for this task operate at the document level. We present a new and generic approach to entity annotation which uses the inverse index typically created for rapid key-word based searching of a document collection. We define a set of operations on the inverse index that allows us to create annotations defined by cascading regular expressions. The entity annotations for an entire document corpus can be created purely of the index with no need to access the original documents. Experiments on two publicly available data sets show very significant performance improvements over the document-based annotators.

Bio: Raghu Krishnapuram received his Ph.D. degree in electrical and computer engineering from Carnegie Mellon University, Pittsburgh, in 1987. From 1987 to 1997, he was on the faculty of the Department of Computer Engineering and Computer Science at the University of Missouri, Columbia. From 1997 to 2000, Dr. Krishnapuram was a Full Professor at the Department of Mathematical and Computer Sciences, Colorado School of Mines,Golden, Colorado. Since then, he has been at at IBM India Research Lab, New Delhi. Dr. Krishnapuram's research encompasses many aspects of Web mining, information retrieval, e-commerce, fuzzy set theory, neural networks, pattern recognition, computer vision, and image processing. He has published over 160 papers in journals and conferences in these areas. Dr. Krishnapuram is an IEEE Fellow, and a co-author (with J. Bezdek, J. Keller and N. Pal) of the book "Fuzzy Models and Algorithms for Pattern Recognition and Image Processing".

DB Meeting: Friday November 3, 2:00pm, DC 1331
Speaker: Grant Weddell
Title: Databases in Performance Critical Embedded Software Systems
Abstract: Commercial relational engines are a multi-billion dollar industry, and are testimony to the reductions in the cost of developing and maintaining information systems that derive from using the programming language SQL. But what about using SQL as a systems programming language for implementing performance critical embedded software systems, on the grounds that similar reductions in software development and maintenance costs would ensue? In this talk, I'll review some of the issues, trends and research areas relating to "SQL the systems programming language."

DB Meeting: Friday November 10, 2:00pm, DC 1331
Speaker: Amir H. Chinaei
Title: Secure Anonymization for Incremental Datasets
Abstract: I will be talking about a recent work by Ji-Won Byun et al. presented at the VLDB workshop on Secure Data Management, September 2006. This talk addresses data privacy, k-anonymity, l-diversity, some types of inference channels, and data quality. The paper abstract is as follows:

"Data anonymization techniques based on the k-anonymity model have been the focus of intense research in the last few years. Although the k-anonymity model and the related techniques provide valuable solutions to data privacy, current solutions are limited only to static data release (i.e., the entire dataset is assumed to be available at the time of release). While this may be acceptable in some applications, today we see databases continuously growing everyday and even every hour. In such dynamic environments, the current techniques may suffer from poor data quality and/or vulnerability to inference. In this paper, we analyze various inference channels that may exist in multiple anonymized datasets and discuss how to avoid such inferences. We then present an approach to securely anonymizing a continuously growing dataset in an efficient manner while assuring high data quality."

To access the full paper, visit http://www.springerlink.com/content/p80r126487u63w31/fulltext.pdf


DB Seminar: Monday, November 27, 10:30am, DC 1304
Speaker: Jiawei Han, University of Illinois at Urbana-Champaign
Title: Warehousing and Mining Massive RFID Data Sets

DB Meeting: Friday December 1, 2:00pm, DC 1331
Speaker: Mumtaz Ahmad
Title: Priority Mechanisms for OLTP Workloads
Abstract: OLTP workloads and their characteristics are an important area of study because of their application in electronic commerce and internet applications. If we want to provide differentiated performance classes in such workloads, we need to prioritize transactions and provide scheduling mechanisms to ensure that high-priority transactions have improved performance. At the same time, depending on the application context, it may be desirable that low-priority transactions do not suffer heavily in the process. I will discuss priority mechanisms and scheduling policies as proposed by McWherter et al. The key idea is to identify the bottleneck resource for a workload and then to come up with scheduling policies that work well with proposed prioritization in DBMS.

David McWherter, Bianca Schroeder, Natassa Ailamaki, and Mor Harchol-Balter "Priority Mechanisms for OLTP and Transactional Web Applications." ICDE 2004.

David McWherter, Bianca Schroeder, Natassa Ailamaki, and Mor Harchol-Balter "Improving Preemptive Prioritization via Statistical Characterization of OLTP Locking." ICDE 2005.


DB Seminar: Tuesday, December 11, 11:00am, DC 3314 (two 1/2 hour talks)
Speaker: Yi Lin, McGill University
Title: One-Copy-Snapshot-Isolation for Database Replication in WANs
Abstract: Database replication replicates data to different databases in order to improve performance, scalability, and fault tolerance of database systems. One-copy-serializability has been considered as the standard correctness of database replication for a long time, corresponding to serializability in centralized databases. However, Snapshot Isolation (SI) has been used extensively in DBMSs such as Oracle, PostgreSQL, SQL server 2005. In my research, I proposed a new isolation level, 1-copy-SI, for transactions in replicated databases. Besides, a replication protocol which provides 1-copy-SI is proposed. The protocol works well in WANs. It overcomes the limitations of existing protocols which only work well in LANs.
Speaker: José Enrique Armendáriz-Iñigo, Instituto Tecnológico de Informática
Title: Managing Transaction Conflicts in Middleware-Based Database Replication Architectures
Abstract: Database replication protocols need to detect, block or abort part of conflicting transactions. A possible solution is to check their writesets (and also their readsets in case a serializable isolation level is requested), which however burdens the consumption of CPU time. This gets even worse when the replication support is provided by a middleware, since there is no direct DBMS support in that layer. We propose and discuss the use of the concurrency control support of the local DBMS for detecting conflicts between local transactions and writesets of remote transactions. This allows to simplify many database replication protocols and to enhance their performance.

This page is maintained by Ashraf Aboulnaga.