Database Research Group Events

Winter 2013

Events of interest to the Database Research Group are posted here, and are also mailed to the uw.cs.database newsgroup and the db-faculty, db-grads, db-friends mailing lists. Subscribe to one of these mailing lists to receive e-mail notification of upcoming events.

The DB group meets Wednesday afternoons at 2:30pm. The list below gives the times and locations of upcoming meetings. Each meeting lasts for an hour and features either a local speaker or, on Seminar days, an invited outside speaker. Everyone is welcome to attend.

Winter 2013 Events

DB Meeting: Wednesday January 16, 2:30pm, DC 1331
Speaker: Andrew Kane
Title: Space-Time Optimization of Hybrid Bitvector Intersection
Abstract: Various algorithms for in-memory intersection of ordered integer lists are used to evaluate queries in search engines. I will review a few based on compressed lists (delta vbyte), bitvectors and hybrid systems (Culpepper and Moffat 2010). I will also introduce two new metrics related to query based minimums and efficiency frontiers and show how these metrics can be used to guide the space-time optimization of the intersection algorithms, resulting in significant performance improvements.

DB Meeting: Wednesday January 23, 2:30pm, DC 1331
Speaker: Shahab Kamali
Title: Searching text documents with structured components
Abstract: Many online documents consist of two components: plain text and structured attribute-value pairs, where an attribute represents a property (e.g. "color") and the value is a specification for that property (e.g. "green"). Examples include product information pages that consist of reviews with plain text format and product specifications in the form of structured data. In many applications, retrieving collections of such documents with respect to a structured query that also consists of a collection of attribute-value pairs is desirable. While retrieving text documents with text queries is a well studied problem, taking both text and structured components into account and ranking documents according to a structured query is nontrivial. We propose an approach based on language modelling principles to leverage document scores with respect to a given query.

DB Meeting: Wednesday February 6, 2:30pm, DC 1331
Speaker: Ahmed El-Roby
Title: RACE: A Scalable and Elastic Parallel System for Discovering Repeats in Very Long Sequences
Abstract: A wide range of applications including bioinformatics, time series, and log analysis, depend on the identification of repetitions in very long sequences. Maximal pairs represent a superset of the most important types of repetitions. Existing maximal pair computation methods require both the input sequence and its index structure (which is at least an order of magnitude larger than the input) to fit in memory. Moreover, they are serial algorithms with prohibitively long execution time. Therefore, they are limited to small datasets, despite the fact that modern applications demand orders of magnitude longer sequences. This talk is about RACE, a parallel system for finding maximal repeats in very long sequences. RACE supports parallel execution on stand-alone multi-core systems, but can scale to thousands of nodes on clusters or supercomputers. RACE does not require the input or the index to fit in memory; therefore, it supports very long sequences with limited memory. Moreover, it uses a novel array representation that traverses the tree in a cache-efficient manner. RACE is particularly suitable for the cloud (e.g., Amazon EC2) since, based on availability, it can scale elastically to more or fewer machines during its execution. Since scaling out introduces overheads, mainly due to load imbalance, we propose a cost model to estimate the expected speedup for a specific problem size and computing infrastructure, based on statistics gathered through sampling. The model allows the user to select the appropriate combination of cloud resources based on the provider's prices and the required deadline. We conducted extensive experimental evaluation with large real datasets and large computing infrastructures. In contrast to existing methods, RACE can handle the entire human genome on a single machine with 16GB RAM. Moreover, for a problem that takes 10 hours of serial execution, RACE finishes in 28 seconds.

DB Meeting: Wednesday February 13, 2:30pm, DC 1331
Speaker: Dan Farrar, SAP Waterloo
Title: The Performance-Insightful Query Language
Abstract: will discuss the Performance-Insightful Query Language system developed by the AMP Lab at Berkeley. This design attempts to provide the predictable scalability of NoSQL architectures, without losing the development and maintenance benefits of data independence. The system provides a declarative query language and query processor above a key-value store. It enables the development of applications that can scale drastically and rapidly without degradation in user experience.
Armbrust, Michael, et al. "PIQL: Success-tolerant query processing in the cloud." Proceedings of the VLDB Endowment 5.3 (2011): 181-192.

DB Seminar: Wednesday February 27, 2:30pm, DC 1302
Speaker: Mohamed Mokbel, University of Minnesota
Title: Database Support for Recommender Systems

DB Meeting: Wednesday March 6, 2:30pm, DC 1331
Speaker: Pedram Ghodsnia
Title: Solid State Drives as a Persistent Cache for Database Systems
Abstract: Recent advances in solid state drives (SSDs) are resulting in a revolutionary breakthrough in storage technology. In terms of price/performance, SSD can be placed in between hard disk and main memory. This property makes it a good candidate for a second level cache in database systems. Recently, a number of interesting approaches for exploiting SSD as a second level cache are proposed. However, most existing methods are based on this implicit assumption that SSD is not reliable enough to maintain the cache content permanently. This implicit assumption results in underutilization of the second level cache. Long checkpoint times, frequent drops in performance, and long recovery and ramp-up times are some of the disadvantages of the existing approaches. With new advancements in SSD technology the reliability of SSD is improved significantly. In this talk we propose the idea of exploiting the persistency of SSD in designing a more efficient second level cache. Our empirical results show that the proposed approach addresses all aforementioned issues effectively.

DB Meeting: Wednesday March 13, 2:30pm, DC 1331
Speaker: Jiewen Wu
Title: Efficient evaluation of instance queries in knowledge bases
Abstract: Instance queries are the basic type of queries over description logic knowledge bases, and hence efficient evaluation of such queries is of paramount importance for answering more complex queries, e.g., conjunctive queries. We present a novel method of evaluating instance queries that derives from binary absorption. The method is designed to work well for large ABoxes (data) and where the TBox (schema) is not necessarily Horn, e.g., where disjunctions and negations are required. The method significantly improves the performance of answering instance queries, and particularly so in cases where a large number of concrete feature values are included. We also report on the results of an experimental evaluation that validates the efficacy of the method.

DB Meeting: Wednesday March 20, 2:30pm, DC 1331
Speaker: Simon Forsyth
Title: Providing Freshness for Cached Data in Unstructured Peer-to-Peer Systems
Abstract: Maintaining freshness of replicated data in peer-to-peer systems is challenging due to the high cost of update management. While updates have been studied in structured networks, they have been neglected in unstructured networks. We therefore confront the problem of maintaining fresh replicas of data in unstructured peer-to-peer networks. We leverage path replication to support efficient lazy updates and provide freshness for cached data in these systems using only local knowledge. In addition, we show that locally available information may be used to provide additional guarantees of freshness at an acceptable cost to performance. Through performance simulations based on both synthetic and real-world workloads from big data environments, we demonstrate the efficacy of our approach.

DB Seminar: Wednesday March 27, 2:30pm, DC 1302
Speaker: Michael Bender, SUNY Stonybrook
Title: Indexing Massive Data Sets

DB Meeting: Wednesday April 3, 2:30pm, DC 1331 CANCELLED
Speaker: Ani Nica, SAP Waterloo

DB Meeting: Wednesday April 10, 2:30pm, DC 1304
Speaker: Jaroslaw Szlichta, York University
Title: Fundamentals and Applications of Order Dependencies
Abstract: Dependencies have played a significant role in databases for many years. In this talk, we present order dependencies (ODs) between lexicographically ordered sets of tuples, which capture monotonicity property of the data. Queries can be evaluated more efficiently when the query optimizer uses order dependencies. They can be run even faster when the optimizer can also reason over known ODs to infer new ones. ODs can be declared as integrity constraints, and they can be inferred automatically for many types of SQL functions and algebraic expressions. We present optimization techniques using ODs for queries that involve join, order by, group by, partition by, and distinct. Essentially, ODs can further exploit interesting orders and eliminate or simplify potentially expensive sorts in the query plan. We evaluate these techniques over our prototype in IBM DB2 using TPC-DS benchmark. Our experimental results demonstrate a significant performance gain. We additionally devise a sound and complete axiomatization of order dependencies and investigate the expressiveness and complexity of ODs.

Bio: Jarek Szlichta is a Ph.D. student at York University in Toronto and he previously finished master’s degree at Warsaw University, Poland. His research concerns data dependencies, data quality, data warehousing and keyword search. He recently spent a 3-year fellowship at IBM Centre for Advanced Studies in Toronto. He is a recipient of IBM Student-of-the-Year award for 2012. His research at IBM includes business-intelligence query optimization, and its focus is on order dependencies. Previously he worked at Comarch R&D in Warsaw designing and implementing the ODL layer, a data layer for RDBMS independent query execution and optimization. The team was managed by Andrzej Gasiennica-Samek (ACM World Champion in programming contest from 2003) and was partially funded by EU with a budget of 2,3 million Euros. Additionally, he worked at Bosch as a SAP Consultant.

DB Seminar: Tuesday April 16, 2:30pm, DC 1302
Speaker: Franz Faerber
Title: The SAP HANA Platform: Vision and R&D Agenda

DB Meeting: Wednesday April 24, 2:30pm, DC 1331
Speaker: Taha Rafiq
Title: Elasca: Workload-Aware Elastic Scalability for Partition Based Database Systems
Abstract: Providing the ability to increase or decrease allocated resources on demand as the transactional load varies is essential for database management systems (DBMS) deployed in today's computing platforms, such as the cloud. This talk presents Elasca, a system that utilizes the flexibility provided by a partitioned, shared-nothing parallel DBMS to implement elasticity. The core idea behind Elasca is to start with a small number of servers that manage all the partitions, and to elastically scale out by dynamically adding new servers and redistributing database partitions among these servers as the load varies. Implementing this approach requires efficient mechanisms for addition/removal of servers and migration of partitions, as well as policies to efficiently determine the optimal placement of partitions on the given servers and plans for partition migration. Elasca incorporates both these features in an existing shared-nothing DBMS (VoltDB) to provide autonomic elastic scalability. Elasca consists of a mechanism for enabling elastic scalability, and a workload-aware online optimizer for determining optimal partition placement and migration plans. In this talk, I will focus on the partition placement problem in the context of an elastically scalable DBMS and present Elasca's optimizer for solving this problem. Elasca's optimizer minimizes computing resources required and balances load effectively without compromising system performance, even in the presence of variations in intensity and skew of the load. The optimizer also returns results quickly and provides good scalability; features that are essential for any optimizer that is meant to be used in an online setting. I will present results from experiments that show Elasca is able to achieve nearly 'optimal' performance while saving considerable computing and network resources.

This page is maintained by Khuzaima Daudjee.

Campaign Waterloo

Data Systems Group
David R. Cheriton School of Computer Science
University of Waterloo
Waterloo, Ontario, Canada N2L 3G1
Tel: 519-888-4567
Fax: 519-885-1208

Contact | Feedback: | Data Systems Group

Valid HTML 4.01!Valid CSS! Last modified: Saturday, 13-Apr-2013 17:24:33 EDT