The DB group meets Wednesday afternoons at 2:30pm. The list below gives the times and locations of upcoming meetings. Each meeting lasts for an hour and features either a local speaker or, on Seminar days, an invited outside speaker. Everyone is welcome to attend.
|DB Meeting:||Wednesday January 13, 2:30pm, DC 1331|
|Title:||On-line estimating column cardinalities in large tables|
|Abstract:||Estimating column cardinalities in large tables is an important task for RDBMSs. Since the statistics in RDBMSs is only updated every certain time interval, if large amount of insertions/deletions occur between two updates, the statistics of certain tables may change dramatically but is not updated, leading to a poor performance of the query optimizer. Hence, an on-line algorithm that can maintain important statistics such as column cardinalities is desired. In this talk, a technique on estimating column cardinalities in real time will be presented. The tables of interest are the large tables that are split into partitions. This technique updates the column cardinalities efficienctly with high accuracy when a partition is added into or dropped from the table.|
|DB Meeting:||Wednesday January 20, 2:30pm, DC 1331|
|Title:||Structured Querying of Text Databases|
|Abstract:||Unstructured text documents often embed data that is structured in nature, and we can expose this structured data using information extraction technology. A collection of such unstructured documents can be known as a text database. By processing a text database with information extraction systems, we can materialize a variety of structured "relations," over which we can then issue regular SQL queries. Unlike the traditional relational world, a SQL query execution over a text database might produce answers that are not fully accurate or complete, for a number of reasons. A query optimizer over a text database has to take into consideration the efficiency of query execution as well as the quality of the results. In this talk, I will present recent work that addresses the problem of processing and cost-based optimization of SQL queries issued over a text database, incorporating a trade-off between efficiency and result quality.|
|DB Seminar:||Friday January 29th, 2:30pm, MC 2018B|
|Speaker:||Raymond Ng, University of British Columbia|
|Title:||Towards multi-modal extraction and summarization of conversations|
|DB Meeting:||Wednesday February 3, 2:30pm, DC 1331|
|Title:||On RDF and SPARQL|
|DB Meeting:||Wednesday February 10, 2:30pm, DC 1331|
|Title:||Integrating MapReduce ideas into Distributed DBMS|
MapReduce has emerged as a framework to process large scale data. It has also become popular for its scalability and fault tolerance. MapReduce and parallel databases share some similarities. But, MapReduce is designed for unstructured data and it lacks the efficiency of DBMS. Therefore, recent research has focused on combining Mapreduce with independent units of DBMS running on cluster nodes.
In this talk, I will discuss two different approaches: HadoopDB and OspreyDB. In HadoopDB, the Hadoop MapReduce implementation is used as a communication layer on top of single node DBMS instances. In contrast, Osprey exports the MapReduce fault tolerance and adds it to their distributed shared nothing database.
 A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: An architectural hybrid of mapreduce and DBMS technologies for analytical workloads. Proc. VLDB Endow., 2(1):922-933, 2009.
 C. Yang, C. Yen, C. Tan, and S. Madden. Osprey: Implementing mapreduce-style fault tolerance in a shared-nothing distributed database. In ICDE '10, 2010.
|CS Seminar:||Wednesday February 17, 10:30pm, DC 1304|
|Speaker:||Rafae Bhatti, Database Security Group, Oracle Inc.|
|Title:||Security and Privacy For Healthcare Applications: Does Policy mean Protection?|
|Abstract:||With the adoption of Electronic Medical Records (EMRs), an increasing number of health-related Web applications is now available to consumers, providers, and partners. While this transformation offers huge benefits, there are security and privacy concerns integral to the process of electronic healthcare delivery. In this talk, we first survey the body of evidence to emphasize the design of appropriate security solutions for electronic healthcare applications. The successful solutions will always comply with the prime directive of healthcare - "nothing should interfere with delivery of care." We then formally present the problem of reconciling security and privacy policies with the actual healthcare workflow, which we refer to as the policy coverage problem. We outline a technical solution to the problem based on the concept of policy refinement, and develop a privacy protection architecture called PRIMA. We also offer guidelines for electronic healthcare applications to ensure adequate policy coverage. The ultimate goal is that electronic healthcare applications should be made secure without compromising usability.|
|DB Meeting:||Wednesday February 24, 2:30pm, DC 1331|
|Title:||Q-Cop: Avoiding Bad Query Mixes to Minimize Client Timeouts Under Heavy Load|
In three-tiered web applications, some form of admission control
is required to ensure that throughput and response times are not
significantly harmed during periods of heavy load. We propose Q-Cop, a
prototype system for improving admission control decisions that considers
a combination of the load on the system, the number of simultaneous
queries being executed, the actual mix of queries being executed,
and the expected time a user may wait for a reply before they or their
browser give up (i.e., time out). Using TCP-W queries, we show that
the response times of different types of queries can vary significantly
depending not just on the number of queries being processed but on the
mix of other queries that are running simultaneously. We develop a
model of expected query execution times that accounts for the mix of
queries being executed and integrate this model into a three-tiered
system to make admission control decisions. Our results show that this
approach makes more informed decisions about which queries to reject and
as a result significantly reduces the number of requests that time out.
Across the range of workloads examined an average of 47% fewer requests
are unsuccessful than the next best approach.
This is joint work with Sean Tozer and Ashraf Aboulnaga This is a practice talk for ICDE 2010.
|DB Meeting:||Wednesday March 10, 2:30pm, DC 1331|
|Title:||Wrapper Induction from Noisy Example|
Today's Web is a massive source of information that is mainly
formatted for human consumption. However, many websites use scripts to
generate structured HTML, which allows extraction rules, called
wrappers, to effectively extract information of interest. Although
many supervised/unsupervised information extraction approaches are out
there, building a noise-tolerant extraction system has received
In the first part of this talk I'm going to present new methods for wrapper induction from noisy examples. The settings assume an automatic (noisy) annotator (e.g., dictionary lookup) that produces a few training examples of a target type from a given set of input pages. The objective is to learn a wrapper, based on a blackbox wrapper induction algorithm, that correctly extracts instances of the target type in similar pages. By removing the need to manually annotate the pages, we are able to perform extraction at Web scale. This is a joint work with Nilesh Dalvi and Ravi Kumar from Yahoo! Research.
In the second part of the talk I'm going to give a demo of MashRank , a mashup system that integrates concepts from the areas of rank-aware query processing, probabilistic databases, and information extraction to enable building ranked mashups of sources with (possibly) uncertain ranking attributes. MashRank integrates information extraction techniques into query processing by asynchronously pushing extracted data into pipelined rank-aware query plans producing mashup results in an early-out fashion. This is a joint work with my supervisor Ihab Ilyas and my colleague Mina Saleeb.
 "MashRank: Towards Uncertainty-Aware and Rank-Aware Mashups", Mohamed A. Soliman, Mina Saleeb, and Ihab F. Ilyas. In ICDE 2010.
|DB Seminar:||Wednesday March 17th, 2:30pm, DC 1302|
|Speaker:||Sunil Prabhakar, Purdue University|
|Title:||The Orion Uncertain Data Management System|
|DB Meeting:||Wednesday March 24, 2:30pm, DC 1331|
|Speaker:||Ivan Bowman, Sybase iAnywhere|
|Title:||The Perils of Upgrading|
|Abstract:||In this talk I will discuss how customers can encounter unintended consequences when migrating to a new version of database server software. I will discuss customer workloads that we investigated as part of problem reports related to moving to a new version of the SQL Anywhere database server software. I will characterize the workloads and problems that the customers encountered and explore ways that these problems can be avoided or mitigated.|
|DB Meeting:||Wednesday March 31, 2:30pm, DC 1331|
|Title:||Creating a facility for Browsing / Searching on Clustered data|
|Abstract:||In this talk I will describe an application being developed that enables the user to browse / search on a collection of clustered documents and provide the user with facilities that help them answer Business Intelligence questions. A range of different operations on collections of data are being examined in order to determine what functionality should be provided in the applications and how the application should be designed in order to optimize the performance of these operations.|
|DB Meeting:||Wednesday April 7, 2:30pm, DC 1331|
|Speaker:||Umar Farooq Minhas|
|Title:||High Availability for Database Systems through Remus|
Remus  is a research prototype that provides high availability (HA)
through asynchronous virtual machine replication on the open source
Xen hypervisor running on commodity hardware in an application and
operating system agnostic manner. One problem with using Remus for HA
is that database workloads, such as Online Transaction Processing
(OLTP) workloads, dirty memory at a very high rate which results in a
high replication overhead for Remus. We propose the design and
implementation of a Remus-aware database system to reduce the
performance overhead without compromising the HA guarantees of
Remus. In this talk, I will first present experimental results that
quantify the overhead incurred by TPC-H and TPC-C benchmarks running
over PostgreSQL with Remus. I will then describe our proposed
optimization called memory deprotection. Finally, I will present
results from ongoing experiments showing that selective memory
deprotection has the potential to improve performance for database
 Brendan Cully, Geoffrey Lefebvre, Dutch Meyer, Mike Feeley, Norm Hutchinson, and Andrew Warfield, Remus: High Availability via Asynchronous Virtual Machine Replication, in NSDI 2008.
|DB Meeting:||Wednesday April 14, 2:30pm, DC 1331 CANCELLED|
|DB Seminar:||Wednesday April 21st, 2:30pm, DC 1302|
|Speaker:||Christoph Koch, Cornell University|
|Title:||DBToaster: Aggressive compilation techniques for online aggregation|
|DB Meeting:||Wednesday April 28, 2:30pm, DC 1331|
|Abstract:||The subject of my talk is Query Mesh, a query processing technique that creates multiple execution strategies for distinct subsets of data. Details of Query Mesh have appeared in several recent conference proceedings, authored by Nehme, Rundensteiner, and Bertino.|