Winter 2010 Events Schedule | Database Research Group | UW

[Please remove <h1>]

Winter 2010

Events of interest to the Database Research Group are posted here, and are also mailed to the uw.cs.database newsgroup and the db-faculty, db-grads, db-friends mailing lists. Subscribe to one of these mailing lists to receive e-mail notification of upcoming events.

The DB group meets Wednesday afternoons at 2:30pm. The list below gives the times and locations of upcoming meetings. Each meeting lasts for an hour and features either a local speaker or, on Seminar days, an invited outside speaker. Everyone is welcome to attend.

Recent and Upcoming Events


DB Meeting: Wednesday January 13, 2:30pm, DC 1331
Speaker: Yingying Tao
Title: On-line estimating column cardinalities in large tables
Abstract: Estimating column cardinalities in large tables is an important task for RDBMSs. Since the statistics in RDBMSs is only updated every certain time interval, if large amount of insertions/deletions occur between two updates, the statistics of certain tables may change dramatically but is not updated, leading to a poor performance of the query optimizer. Hence, an on-line algorithm that can maintain important statistics such as column cardinalities is desired. In this talk, a technique on estimating column cardinalities in real time will be presented. The tables of interest are the large tables that are split into partitions. This technique updates the column cardinalities efficienctly with high accuracy when a partition is added into or dropped from the table.

DB Meeting: Wednesday January 20, 2:30pm, DC 1331
Speaker: Amr El-Helw
Title: Structured Querying of Text Databases
Abstract: Unstructured text documents often embed data that is structured in nature, and we can expose this structured data using information extraction technology. A collection of such unstructured documents can be known as a text database. By processing a text database with information extraction systems, we can materialize a variety of structured "relations," over which we can then issue regular SQL queries. Unlike the traditional relational world, a SQL query execution over a text database might produce answers that are not fully accurate or complete, for a number of reasons. A query optimizer over a text database has to take into consideration the efficiency of query execution as well as the quality of the results. In this talk, I will present recent work that addresses the problem of processing and cost-based optimization of SQL queries issued over a text database, incorporating a trade-off between efficiency and result quality.

DB Seminar: Friday January 29th, 2:30pm, MC 2018B
Speaker: Raymond Ng, University of British Columbia
Title: Towards multi-modal extraction and summarization of conversations

DB Meeting: Wednesday February 3, 2:30pm, DC 1331
Speaker: Grant Weddell
Title: On RDF and SPARQL

DB Meeting: Wednesday February 10, 2:30pm, DC 1331
Speaker: Iman Elghandour
Title: Integrating MapReduce ideas into Distributed DBMS
Abstract: MapReduce has emerged as a framework to process large scale data. It has also become popular for its scalability and fault tolerance. MapReduce and parallel databases share some similarities. But, MapReduce is designed for unstructured data and it lacks the efficiency of DBMS. Therefore, recent research has focused on combining Mapreduce with independent units of DBMS running on cluster nodes.

In this talk, I will discuss two different approaches: HadoopDB and OspreyDB. In HadoopDB, the Hadoop MapReduce implementation is used as a communication layer on top of single node DBMS instances. In contrast, Osprey exports the MapReduce fault tolerance and adds it to their distributed shared nothing database.

[1] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: An architectural hybrid of mapreduce and DBMS technologies for analytical workloads. Proc. VLDB Endow., 2(1):922-933, 2009.

[2] C. Yang, C. Yen, C. Tan, and S. Madden. Osprey: Implementing mapreduce-style fault tolerance in a shared-nothing distributed database. In ICDE '10, 2010.


CS Seminar: Wednesday February 17, 10:30pm, DC 1304
Speaker: Rafae Bhatti, Database Security Group, Oracle Inc.
Title: Security and Privacy For Healthcare Applications: Does Policy mean Protection?
Abstract: With the adoption of Electronic Medical Records (EMRs), an increasing number of health-related Web applications is now available to consumers, providers, and partners. While this transformation offers huge benefits, there are security and privacy concerns integral to the process of electronic healthcare delivery. In this talk, we first survey the body of evidence to emphasize the design of appropriate security solutions for electronic healthcare applications. The successful solutions will always comply with the prime directive of healthcare - "nothing should interfere with delivery of care." We then formally present the problem of reconciling security and privacy policies with the actual healthcare workflow, which we refer to as the policy coverage problem. We outline a technical solution to the problem based on the concept of policy refinement, and develop a privacy protection architecture called PRIMA. We also offer guidelines for electronic healthcare applications to ensure adequate policy coverage. The ultimate goal is that electronic healthcare applications should be made secure without compromising usability.

DB Meeting: Wednesday February 24, 2:30pm, DC 1331
Speaker: Tim Brecht
Title: Q-Cop: Avoiding Bad Query Mixes to Minimize Client Timeouts Under Heavy Load
Abstract: In three-tiered web applications, some form of admission control is required to ensure that throughput and response times are not significantly harmed during periods of heavy load. We propose Q-Cop, a prototype system for improving admission control decisions that considers a combination of the load on the system, the number of simultaneous queries being executed, the actual mix of queries being executed, and the expected time a user may wait for a reply before they or their browser give up (i.e., time out). Using TCP-W queries, we show that the response times of different types of queries can vary significantly depending not just on the number of queries being processed but on the mix of other queries that are running simultaneously. We develop a model of expected query execution times that accounts for the mix of queries being executed and integrate this model into a three-tiered system to make admission control decisions. Our results show that this approach makes more informed decisions about which queries to reject and as a result significantly reduces the number of requests that time out. Across the range of workloads examined an average of 47% fewer requests are unsuccessful than the next best approach.

This is joint work with Sean Tozer and Ashraf Aboulnaga This is a practice talk for ICDE 2010.


DB Meeting: Wednesday March 10, 2:30pm, DC 1331
Speaker: Mohamed Soliman
Title: Wrapper Induction from Noisy Example
Abstract: Today's Web is a massive source of information that is mainly formatted for human consumption. However, many websites use scripts to generate structured HTML, which allows extraction rules, called wrappers, to effectively extract information of interest. Although many supervised/unsupervised information extraction approaches are out there, building a noise-tolerant extraction system has received limited attention.

In the first part of this talk I'm going to present new methods for wrapper induction from noisy examples. The settings assume an automatic (noisy) annotator (e.g., dictionary lookup) that produces a few training examples of a target type from a given set of input pages. The objective is to learn a wrapper, based on a blackbox wrapper induction algorithm, that correctly extracts instances of the target type in similar pages. By removing the need to manually annotate the pages, we are able to perform extraction at Web scale. This is a joint work with Nilesh Dalvi and Ravi Kumar from Yahoo! Research.

In the second part of the talk I'm going to give a demo of MashRank [1], a mashup system that integrates concepts from the areas of rank-aware query processing, probabilistic databases, and information extraction to enable building ranked mashups of sources with (possibly) uncertain ranking attributes. MashRank integrates information extraction techniques into query processing by asynchronously pushing extracted data into pipelined rank-aware query plans producing mashup results in an early-out fashion. This is a joint work with my supervisor Ihab Ilyas and my colleague Mina Saleeb.

[1] "MashRank: Towards Uncertainty-Aware and Rank-Aware Mashups", Mohamed A. Soliman, Mina Saleeb, and Ihab F. Ilyas. In ICDE 2010.


DB Seminar: Wednesday March 17th, 2:30pm, DC 1302
Speaker: Sunil Prabhakar, Purdue University
Title: The Orion Uncertain Data Management System

DB Meeting: Wednesday March 24, 2:30pm, DC 1331
Speaker: Ivan Bowman, Sybase iAnywhere
Title: The Perils of Upgrading
Abstract: In this talk I will discuss how customers can encounter unintended consequences when migrating to a new version of database server software. I will discuss customer workloads that we investigated as part of problem reports related to moving to a new version of the SQL Anywhere database server software. I will characterize the workloads and problems that the customers encountered and explore ways that these problems can be avoided or mitigated.

DB Meeting: Wednesday March 31, 2:30pm, DC 1331
Speaker: Greg Drzadzewski
Title: Creating a facility for Browsing / Searching on Clustered data
Abstract: In this talk I will describe an application being developed that enables the user to browse / search on a collection of clustered documents and provide the user with facilities that help them answer Business Intelligence questions. A range of different operations on collections of data are being examined in order to determine what functionality should be provided in the applications and how the application should be designed in order to optimize the performance of these operations.

DB Meeting: Wednesday April 7, 2:30pm, DC 1331
Speaker: Umar Farooq Minhas
Title: High Availability for Database Systems through Remus
Abstract: Remus [1] is a research prototype that provides high availability (HA) through asynchronous virtual machine replication on the open source Xen hypervisor running on commodity hardware in an application and operating system agnostic manner. One problem with using Remus for HA is that database workloads, such as Online Transaction Processing (OLTP) workloads, dirty memory at a very high rate which results in a high replication overhead for Remus. We propose the design and implementation of a Remus-aware database system to reduce the performance overhead without compromising the HA guarantees of Remus. In this talk, I will first present experimental results that quantify the overhead incurred by TPC-H and TPC-C benchmarks running over PostgreSQL with Remus. I will then describe our proposed optimization called memory deprotection. Finally, I will present results from ongoing experiments showing that selective memory deprotection has the potential to improve performance for database workloads.

[1] Brendan Cully, Geoffrey Lefebvre, Dutch Meyer, Mike Feeley, Norm Hutchinson, and Andrew Warfield, Remus: High Availability via Asynchronous Virtual Machine Replication, in NSDI 2008.


DB Meeting: Wednesday April 14, 2:30pm, DC 1331 CANCELLED
Speaker: David Toman
Title: TBD
Abstract: TBD

DB Seminar: Wednesday April 21st, 2:30pm, DC 1302
Speaker: Christoph Koch, Cornell University
Title: DBToaster: Aggressive compilation techniques for online aggregation

DB Meeting: Wednesday April 28, 2:30pm, DC 1331
Speaker: Glenn Paulley
Title: Query Mesh
Abstract: The subject of my talk is Query Mesh, a query processing technique that creates multiple execution strategies for distinct subsets of data. Details of Query Mesh have appeared in several recent conference proceedings, authored by Nehme, Rundensteiner, and Bertino.

This page is maintained by Ken Salem.