[Please remove <h1>]
Fall 2006
Note: Events of interest to the Database Research Group are posted to the uw.cs.database newsgroup and are mailed to the db-group@lists.uwaterloo.ca mailing list. There are actually three mailing lists aggregated into the db-group list: db-faculty (for DB group faculty), db-grads (for DB group graduate students), and db-friends (for DB group alumni, visitors, and friends). If you wish to subscribe to one of these three lists (or to unsubscribe), please visit https://lists.uwaterloo.ca/mailman/listinfo/<listname>, where <listname> is the list you wish to subscribe to.- DB group meetings
- The DB group meets most Friday afternoons at 2pm, usually in DC1331. See the list of current events for times and locations of upcoming meetings. Each meeting lasts for an hour and features an informal presentation by one of the members of the group. Everyone is welcome to attend. These talks are intended to raise questions and to stimulate discussion rather than being polished presentations of research results. Speakers are determined using a rotating speaker list, which can be found on the DB group meeting page
- DB seminar series
- The DB seminar series features visiting speakers. These seminars are more-or-less monthly, and are usually scheduled on Monday mornings at 11am. See the list of current events for times and locations of upcoming seminars. The full schedule can be found on the DB seminar series page.
Recent and Upcoming Events
DB Meeting: | Friday September 22, 2:00pm, DC 1304 (Please note room change) |
Title: | Kickoff Meeting |
DB Seminar: | Monday, September 25, 11:00am, DC 1304 |
Speaker: | Amol Deshpande, University of Maryland |
Title: | MauveDB: Managing Uncertain Data using Statistical Models |
Seminar: | - CANCELLED - Wednesday, September 27, 1:30pm, DC 1304 |
Speaker: | - CANCELLED - Chen Li, UC Irvine |
Title: | - CANCELLED - Answering Approximate Queries Efficiently |
Abstract: |
Many database applications have the emerging need to answer
approximate queries efficiently. Such a query can ask for strings
that are similar to a given string, such as "names similar to
Schwarzenegger" and "telephone numbers similar to 412-0964," where
"similar to" uses a predefined, domain-specific function to specify
the similarity between strings, such as edit distance. There are many
reasons to support such queries. To name a few: (1) The user might not
remember exactly the name or the telephone number when issuing the
query. (2) There could be typos in the query. (3) There could be
errors or inconsistencies even in the database, especially in
applications such as data cleaning.
In this talk we will present some of our recent results on answering approximate queries efficiently. One problem related to optimizing such queries is to estimate the selectivity of a fuzzy string predicate, i.e., estimating how many strings in the database satisfy the predicate. We develop a novel technique, called SEPIA, to solve the problem. We will present the details of this technique using the edit distance function. We study challenges in adopting this technique, including how to construct its histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. We show the results of our extensive experiments. Time permitting, we will also briefly report our other related results. One is on supporting fuzzy queries with both predicates on numeric attributes (e.g., salary > 50K) and predicates on string attributes (e.g., telephone numbers similar to 412-0964). Another one is on how to relax conditions in an SQL query that returns an empty answer. These results are based on three recent papers in VLDB'2005 and VLDB'2006. |
Bio: | Chen Li is an assistant professor in the Department of Computer Science at the University of California, Irvine. He received his Ph.D. degree in Computer Science from Stanford University in 2001, and his M.S. and B.S. in Computer Science from Tsinghua University, China, in 1996 and 1994, respectively. He received a National Science Foundation CAREER Award in 2003. He is currently a part-time Visiting Research Scientist at Google, Santa Monica. His research interests are in the fields of database and information systems, including data integration and sharing, data cleansing, data warehousing, and data privacy. More information is available at: http://www.ics.uci.edu/~chenli/ |
DB Meeting: | Friday September 29, 2:00pm, DC 1331 |
Speaker: | Matthew Young-Lai |
Title: | Logging and Crash Recovery |
Abstract: |
The choice between logical and physical logging is a trade-off. Physical
logging writes much more data to the log. Logical logging requires more
expensive management of the relationship between the persistent database on
disk and the cache in memory. The following papers develop a theory that
improves the understanding of this issue and implies the possibility of
better logging strategies. I'll overview some of the ideas.
Lomet, D. and Tuttle, M. A Theory of Redo Recovery. SIGMOD Conference, San
Diego, CA (June 2003) 397-406 |
DB Meeting: | Friday October 13, 2:00pm, DC 1331 |
Speaker: | Dan Farrar |
Title: | Automatically summarizing a database workload |
Abstract: |
Analyzing the performance of a complex database application, and in particular
answering questions about its scalability, can be very difficult. To be able
to answer such questions, we would like to be able to convert the application
into a small but representative set of statements. This set of statements can
then be used, among other things, as inputs to a capacity planning simulation
or to develop a custom benchmark, making the scalability analysis tractable.
Converting a trace log of SQL statements into a manageably small set without
losing its key features can be done with data mining techniques. I will
explore different methods of characterizing a database workload, with
emphasis on a method described by Wasserman et al.
Wasserman, Ted J. et. al. "Developing a Characterization of Business Intelligence Workloads for Sizing New Database Systems", DBOLAP 2004, Washington, DC (November, 2004) |
DB Seminar: | Monday, October 16, 10:30am, DC 1304 (Please note time change) | |
Speaker: | Alex Borgida, Rutgers University | |
Title: | Visions of Data Semantics: Another (and another) Look |
DB Meeting: | Friday October 20, 2:00pm, DC1331 |
Speaker: | Ashraf Aboulnaga |
Title: | What is a good database benchmark for DBMS research? |
Seminar: | Wednesday, October 25, 3:00pm, DC 1304 (Please note unusual day) |
Speaker: | Chen Li, UC Irvine |
Title: | Answering Approximate Queries Efficiently |
Abstract: |
Many database applications have the emerging need to answer
approximate queries efficiently. Such a query can ask for strings
that are similar to a given string, such as "names similar to
Schwarzenegger" and "telephone numbers similar to 412-0964," where
"similar to" uses a predefined, domain-specific function to specify
the similarity between strings, such as edit distance. There are many
reasons to support such queries. To name a few: (1) The user might not
remember exactly the name or the telephone number when issuing the
query. (2) There could be typos in the query. (3) There could be
errors or inconsistencies even in the database, especially in
applications such as data cleaning.
In this talk we will present some of our recent results on answering approximate queries efficiently. One problem related to optimizing such queries is to estimate the selectivity of a fuzzy string predicate, i.e., estimating how many strings in the database satisfy the predicate. We develop a novel technique, called SEPIA, to solve the problem. We will present the details of this technique using the edit distance function. We study challenges in adopting this technique, including how to construct its histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. We show the results of our extensive experiments. Time permitting, we will also briefly report our other related results. One is on supporting fuzzy queries with both predicates on numeric attributes (e.g., salary > 50K) and predicates on string attributes (e.g., telephone numbers similar to 412-0964). Another one is on how to relax conditions in an SQL query that returns an empty answer. These results are based on three recent papers in VLDB'2005 and VLDB'2006. |
Bio: | Chen Li is an assistant professor in the Department of Computer Science at the University of California, Irvine. He received his Ph.D. degree in Computer Science from Stanford University in 2001, and his M.S. and B.S. in Computer Science from Tsinghua University, China, in 1996 and 1994, respectively. He received a National Science Foundation CAREER Award in 2003. He is currently a part-time Visiting Research Scientist at Google, Santa Monica. His research interests are in the fields of database and information systems, including data integration and sharing, data cleansing, data warehousing, and data privacy. More information is available at: http://www.ics.uci.edu/~chenli/ |
Seminar: | Thursday October 26, 2:00pm, DC 1304 (Please note unusual day) |
Speaker: | Raghu Krishnapuram, IBM India Research Laboratory |
Title: |
Search Result Summarization and Disambiguation via Contextual Dimensions
and Efficient Named Entity Recognition using Inverse Index Operations |
Abstract: |
First Talk
Dynamically generated topic hierarchies are a popular method of summarizing the results obtained in response to a query in various search applications. However, topic hierarchies generated by statistical techniques tend to be somewhat unintuitive, rigid and inefficient for browsing and disambiguation. In this talk, we propose an alternative approach to query disambiguation and result summarization. The approach uses a fixed set of orthogonal contextual dimensions to summarize and disambiguate search results. A contextual dimension defines a specific type to a context which makes it incomparable to contexts of other types. For the generic search scenario, we propose to use three types of contextual dimensions, namely, concepts, features, and specializations. We use NLP techniques to extract the three types of contexts, and a data mining algorithm to select a subset of contexts that are as distinct (i.e., mutually exclusive) as possible. Our experimental results show that we can achieve a considerable reduction in the effort required for retrieving relevant documents via the proposed interface.
Second Talk
|
Bio: | Raghu Krishnapuram received his Ph.D. degree in electrical and computer engineering from Carnegie Mellon University, Pittsburgh, in 1987. From 1987 to 1997, he was on the faculty of the Department of Computer Engineering and Computer Science at the University of Missouri, Columbia. From 1997 to 2000, Dr. Krishnapuram was a Full Professor at the Department of Mathematical and Computer Sciences, Colorado School of Mines,Golden, Colorado. Since then, he has been at at IBM India Research Lab, New Delhi. Dr. Krishnapuram's research encompasses many aspects of Web mining, information retrieval, e-commerce, fuzzy set theory, neural networks, pattern recognition, computer vision, and image processing. He has published over 160 papers in journals and conferences in these areas. Dr. Krishnapuram is an IEEE Fellow, and a co-author (with J. Bezdek, J. Keller and N. Pal) of the book "Fuzzy Models and Algorithms for Pattern Recognition and Image Processing". |
DB Meeting: | Friday November 3, 2:00pm, DC 1331 |
Speaker: | Grant Weddell |
Title: | Databases in Performance Critical Embedded Software Systems |
Abstract: | Commercial relational engines are a multi-billion dollar industry, and are testimony to the reductions in the cost of developing and maintaining information systems that derive from using the programming language SQL. But what about using SQL as a systems programming language for implementing performance critical embedded software systems, on the grounds that similar reductions in software development and maintenance costs would ensue? In this talk, I'll review some of the issues, trends and research areas relating to "SQL the systems programming language." |
DB Meeting: | Friday November 10, 2:00pm, DC 1331 |
Speaker: | Amir H. Chinaei |
Title: | Secure Anonymization for Incremental Datasets |
Abstract: |
I will be talking about a recent work by Ji-Won Byun et al. presented at the VLDB workshop
on Secure Data Management, September 2006. This talk addresses data privacy, k-anonymity,
l-diversity, some types of inference channels, and data quality. The paper abstract is as
follows:
"Data anonymization techniques based on the k-anonymity model have been the focus of intense research in the last few years. Although the k-anonymity model and the related techniques provide valuable solutions to data privacy, current solutions are limited only to static data release (i.e., the entire dataset is assumed to be available at the time of release). While this may be acceptable in some applications, today we see databases continuously growing everyday and even every hour. In such dynamic environments, the current techniques may suffer from poor data quality and/or vulnerability to inference. In this paper, we analyze various inference channels that may exist in multiple anonymized datasets and discuss how to avoid such inferences. We then present an approach to securely anonymizing a continuously growing dataset in an efficient manner while assuring high data quality." To access the full paper, visit http://www.springerlink.com/content/p80r126487u63w31/fulltext.pdf |
DB Seminar: | Monday, November 27, 10:30am, DC 1304 | |
Speaker: | Jiawei Han, University of Illinois at Urbana-Champaign | |
Title: | Warehousing and Mining Massive RFID Data Sets |
DB Meeting: | Friday December 1, 2:00pm, DC 1331 |
Speaker: | Mumtaz Ahmad |
Title: | Priority Mechanisms for OLTP Workloads |
Abstract: |
OLTP workloads and their characteristics are an important area of study because of their application in electronic commerce
and internet applications. If we want to provide differentiated performance classes in such workloads, we need to prioritize
transactions and provide scheduling mechanisms to ensure that high-priority transactions have improved performance. At the
same time, depending on the application context, it may be desirable that low-priority transactions do not suffer heavily
in the process. I will discuss priority mechanisms and scheduling policies as proposed by McWherter et al. The key idea
is to identify the bottleneck resource for a workload and then to come up with scheduling policies that work well with
proposed prioritization in DBMS.
David McWherter, Bianca Schroeder, Natassa Ailamaki, and Mor Harchol-Balter "Priority Mechanisms for OLTP and Transactional Web Applications." ICDE 2004. David McWherter, Bianca Schroeder, Natassa Ailamaki, and Mor Harchol-Balter "Improving Preemptive Prioritization via Statistical Characterization of OLTP Locking." ICDE 2005. |
DB Seminar: | Tuesday, December 11, 11:00am, DC 3314 (two 1/2 hour talks) |
Speaker: | Yi Lin, McGill University |
Title: | One-Copy-Snapshot-Isolation for Database Replication in WANs |
Abstract: | Database replication replicates data to different databases in order to improve performance, scalability, and fault tolerance of database systems. One-copy-serializability has been considered as the standard correctness of database replication for a long time, corresponding to serializability in centralized databases. However, Snapshot Isolation (SI) has been used extensively in DBMSs such as Oracle, PostgreSQL, SQL server 2005. In my research, I proposed a new isolation level, 1-copy-SI, for transactions in replicated databases. Besides, a replication protocol which provides 1-copy-SI is proposed. The protocol works well in WANs. It overcomes the limitations of existing protocols which only work well in LANs. |
Speaker: | José Enrique Armendáriz-Iñigo, Instituto Tecnológico de Informática |
Title: | Managing Transaction Conflicts in Middleware-Based Database Replication Architectures |
Abstract: | Database replication protocols need to detect, block or abort part of conflicting transactions. A possible solution is to check their writesets (and also their readsets in case a serializable isolation level is requested), which however burdens the consumption of CPU time. This gets even worse when the replication support is provided by a middleware, since there is no direct DBMS support in that layer. We propose and discuss the use of the concurrency control support of the local DBMS for detecting conflicts between local transactions and writesets of remote transactions. This allows to simplify many database replication protocols and to enhance their performance. |
This page is maintained by Ashraf Aboulnaga.