Fall 2011

Events of interest to the Database Research Group are posted here.

The DB group meets Wednesday afternoons at 2:30pm. The list below gives the times and locations of upcoming meetings. Each meeting lasts for an hour and features either a local speaker or, on Seminar days, an invited outside speaker. Everyone is welcome to attend.

Recent and Upcoming Events

DB Meeting: Wednesday September 21, 1:00pm, DC1331 (Note: special meeting time)
Speakers: Ashraf Aboulnaga, Gunes Aluc, Khuzaima Daudjee, Dave DeHaan, Ken Salem, Ning Zhang
Title: VLDB 2011 5 Minute Madness
  • Ashraf will talk about trendy topics in social networks based on the paper: Structural Trend Analysis For Online Social Networks Ceren Budak, Divyakant Agrawal, Amr El Abbadi.
  • Gunes will talk about database cracking based on the paper "Merging What's Cracked, Cracking What's Merged: Adaptive Indexing in Main-Memory Column-Stores" by Stratos Idreos (CWI), Stefan Manegold (CWI), Harumi Kuno (HP Labs), Goetz Graefe (HP Labs).
  • Khuzaima will talk about something.
  • Dave will talk about a scheme for optimistic concurrency control allowing highly-parallel updates to tree-structured indexes: Optimistic Concurrency Control by Melding Trees Philip A. Bernstein (Microsoft Research), Colin W. Reid (Microsoft), Ming Wu (Microsoft Research), Xinhao Yuan (Tsinghua University)
  • Patrick will talk about something.
  • Ken will talk about something so secret that nobody else at VLDB heard about it.
  • Ning will talk about graph databases.

MMath Thesis Seminar Thursday September 22, 2:00pm, DC 1331
Speaker: Alexey Karyakin
Title: Dynamic Scale-out Mechanisms for Partitioned Shared-Nothing Databases

DB Seminar: Wednesday September 28, 2:30pm, DC 1302
Speaker: Jonathan Goldstein, Microsoft
Title: Temporal Analytics on Big Data for Web Advertising

DB Meeting: Wednesday October 5, 2:30pm, DC 1331
Speaker: Lukasz Golab
Title: Discovering Pattern Tableaux for Data Quality Analysis
Abstract: I'll present a case study that illustrates the utility of pattern tableau discovery for data quality analysis. Given a user-supplied integrity constraint, such as a Boolean predicate expected to be satisfied by every tuple, a functional dependency, or an inclusion dependency, a pattern tableau is a concise summary of subsets of the data that satisfy or fail the constraint. I'll describe Data Auditor---a system for automatic tableau discovery from data---and give examples of characterizing data quality in a network monitoring database used by a large Internet Service Provider. This is joint work with Flip Korn and Divesh Srivastava.

DB Seminar: Wednesday October 12, 2:00pm, DC 1302 (Please note the early start time.)
Speaker: Alon Halevy, Google Research
Title: Bringing (Web) Databases to the Masses

DB Meeting: Wednesday October 19, 2:30pm, DC 1331
Speaker: Shahab Kamali
Title: Answering math queries with general-purpose search engines
Abstract: Traditionally, general-purpose search engines such as Bing and Google, are used to look-up keywords within web-pages. Hence, a query is assumed to consist of a bag of keywords, and the search result is a ranked list of documents. However, for some queries a short answer can better satisfy users needs. For a class of such queries, that we call math queries, the answer should be calculated rather than looked up in a database. Arithmetic computations, unit conversions, and symbolic computations are examples of math queries. Our goal is to evaluate a search engine's ability in recognizing and answering math queries. Determining if an arbitrary query is a math query or not, is a hard problem. We propose a novel approach for recognizing and classifying math queries using large scale search logs. Traditional approaches for evaluating the quality of results, mostly rely on users interactions with the engine typically measured by the click information. Answers to math queries do not contain links, therefore most of the previously proposed metrics are not applicable in this case. We propose various evaluation metrics that can be applied for math queries, and present the results on a large collection of math queries taken from Bing's search logs.

DB Meeting: Wednesday October 26, 2:30pm, DC 1331 CANCELLED
Speaker: Amr El-Helw
Title: Column-Oriented Query Processing for Row Stores
Abstract: Column-oriented DBMSs have gained increasing interest due to their superior performance for analytical workloads. Prior efforts tried to determine the possibility of simulating the query processing techniques of column-oriented systems in row-oriented databases, in a hope to improve their performance, especially for OLAP and data warehousing applications. In this talk, I show that column-oriented query processing can significantly improve the performance of row-oriented DBMSs, using techniques that take into account the unique characteristics of data obtained from indexes, and exploit new technologies such as flash SSDs and multi-core processors to boost the performance.

DB Meeting: Wednesday November 2, 2:30pm, DC 1331
Speaker: Ken Salem
Title: A Scalable, Highly Available Cloud Storage Tier for Relational DBMS
Abstract: I'll present some recent work on using an eventually consistent NoSQL system (Cassandra) to provide a scalable, available, wide area, multi-tenant storage service for relational database systems. This is joint work with Rui Liu and Ashraf Aboulnaga. (slides (PDF))

DB Meeting: Wednesday November 16, 2:30pm, DC 1331
Speaker: Pedram Ghodsnia
Title: Error Reduction and GPU-Accelerated Query Execution in Signature Files
Abstract: Signature File index is a well-studied method in information retrieval for indexing large text databases. Because of the small index size in this method, it is a good candidate for environments where memory is scarce. This small index size, however, comes at the cost of high false positive error rate and long query execution time. These two critical problems make this method impractical for many applications.

In the first part of this talk, we will address the problem of high false positive error rate of signature files by introducing COCA Filters, a new variation of Bloom Filters which exploits the co-occurrence probability of words in documents to reduce the false positive error. We will show that by using this technique we can reduce the false positive error by up to 21 times, for the same index size.

In the second part of the talk, we will address the long query execution time of signature files by proposing a scalable approach that utilizes the massive computational power of Graphics Processing Units (GPUs) to accelerate the query execution. The impressive speed-up and the scalability of our proposed method will be shown experimentally. The scalability of this approach allows us to reach the desirable speed-up by increasing the number of GPUs.

The first part of the talk is based on a paper entitled "COCA Filters: Co-Occurrence Aware Bloom Filters", a joint work with Kamran Tirdad, Ian Munro and Alejandro Lopez-Ortiz. This Paper won the best student paper award in SPIRE 2011.

DB Meeting: Wednesday November 23, 2:30pm, DC 1331
Speaker: Ming-Yee Iu
Title: MapReduce Query Optimization and LINQ for Java
Abstract: Recently, there has been a blurring of the boundary between programming languages and query languages. Programmers increasingly want to include arbitrary code in their database queries and to embed database queries into their programming languages. To support these sorts of use patterns, we need appropriate code analysis tools. I will show how two database code analysis problems can be solved using symbolic execution.

I will first look at Hadoop MapReduce queries. These queries are written using arbitrary Java code, which makes them difficult to anaylse and optimise. By analyzing certain queries With symbolic execution, we can extract input restrictions from them, significantly improving their performance.

Secondly, I will look at database queries in Java 8. When Java 8 is released next year, it will finally include limited support for functional programming. I will show how this functional support can be combined with symbolic execution to allow programmers to finally write database queries in a functional-style, like in Microsoft's LINQ.

DB Seminar: Wednesday November 30, 2:30pm, DC 1302
Speaker: Molham Aref, LogicBlox
Title: Datalog for Enterprise Software: from Industrial Applications to Research

DB Meeting: Wednesday December 7, 2:30pm, DC 1331
Speaker: Dan Farrar
Title: Pricing models in cloud database platforms
Abstract: One of the main advantages of cloud computing is that one only needs to pay for whatever computing resources are actually used in a given time period. However, this means that when distributing workloads across cloud resources, administrators must take account not only of performance and durability, but also the cost implications of workload placement.

I will be discussing a paper which describes different cloud pricing models and analyzes their sensitivity to workload type and distribution, and proposes a pricing model that reduces these sensitivities. "Resource and Virtualization Costs up in the Cloud: Models and Design Choices." Daniel Gmach, Jerry Rolia, Ludmila Cherkasova (HP Labs). In Proc. 2011 IEEE/IFIP 41st Conf. on Dependable Systems & Networks, Hong Kong, p395-402.

