Database Research Group Events

Fall 2010

Events of interest to the Database Research Group are posted here, and are also mailed to the uw.cs.database newsgroup and the db-faculty, db-grads, db-friends mailing lists. Subscribe to one of these mailing lists to receive e-mail notification of upcoming events.

The DB group meets Wednesday afternoons at 2:30pm. The list below gives the times and locations of upcoming meetings. Each meeting lasts for an hour and features either a local speaker or, on Seminar days, an invited outside speaker. Everyone is welcome to attend.

Recent and Upcoming Events

DB Seminar: Wednesday September 29, 3:00pm, DC 1302
Speaker: Kelly Lyons
Title: Mediating Human-to-Human Interactions through Social Media Technology

DB Meeting: Wednesday October 6, 2:30pm, DC 1331
Speaker: Jane Mason
Title: Classifying Web Pages by Genre
Abstract: This talk discusses the classification of Web pages by genre, using n-gram representations of Web pages. The goal of this research was to develop an approach to the problem of Web page genre classification that is effective not only on balanced, single-labeled data sets, but also on unbalanced and multi-labeled data sets, which better represent a real world environment. Some of the questions associated with developing this new classification model will be discussed, and the results of experiments on a variety of data sets will be presented.

DB Seminar: Thursday October 14, 2:30pm, DC 1304
Speaker: Alkis Polyzotis
Title: Semi-Automatic Index Tuning for Database Systems

DB Meeting: Wednesday October 20, 2:30pm, DC 1331
Speaker: Jeff Pound
Title: Expressive and Flexible Search over Knowledge Bases and Text Collections
Abstract: In this talk I will present ongoing work on extending search capabilities by enabling richer query constructs for search systems. These keyword-based query constructs aim to trade-off the expressiveness of a structured query language with the ease-of-use of keyword queries. A key component of this approach is the use of large reference knowledge bases (KB) that enable semantic query understanding. Our solution efficiently finds the top-k ranking query interpretations over the reference KB while achieving high accuracy, even in the presence of incomplete KB coverage.

I will also discuss our back-end query processing approach that facilitates QUICK, our end-to-end semantic search system. Our back-end system is based on a compact graph index over the KB, which utilizes a lazy materialization strategy for transitive closure computation. This lazy materialization approach allows in-memory query processing, an efficient alternative to pre-computing very large (and thus disk-based) transitive closures.

NDS Seminar: Monday October 25, 9:00am, DC 3323
Speaker: Vikas Garg, IBM Research, Bangalore, India
Title: Real Time Memory Efficient Data Redundancy Removal
Abstract: Data intensive computing has become a central theme in research community and industry. There is an ever growing need to process and analyze massive amounts of data from diverse sources such as telecom call data records, telescope imagery, online transaction records, web pages, stock markets, medical records (monitoring critical health conditions of patients), climate warning systems, etc. Removing redundancy in the data is an important problem as it helps in resource and compute efficiency for downstream processing of the massive (1 billion to 10 billion records) datasets. In application domains such as IR, stock markets, telecom and others, there is a strong need for real-time data redundancy removal (referred to as DRR) of enormous amounts of data flowing at the rate of 1GB/s or more. Real-time scalable data redundancy removal on massive datasets is a challenging problem. We present the design of a novel parallel data redundancy removal algorithm for both in-memory and disk-based execution. We also develop queueing theoretic analysis to optimize the throughput of our parallel algorithm on multi-core architectures. For 500 million records, our parallel algorithm can perform complete de-duplication in 255 seconds on 16 core Intel Xeon 5570 architecture, with in-memory execution. This gives a throughput of 2M records/s. For 6 billion records, our parallel algorithm can perform complete de-duplication in less than 4.5 hours, using 6 cores of Intel Xeon 5570, with disk-based execution. This gives a throughput of around 370K records/s. To the best of our knowledge, this is the highest real-time throughput for data redundancy removal on such massive datasets. We also demonstrate the scalability of our algorithm with increasing number of cores and data.

DB Meeting: Wednesday October 27, 2:30pm, DC 1331
Speaker: L Venkata Subramaniam, IBM Research India
Title: Data Cleansing as a Service
Abstract: Data Quality is a major challenge for utilizing information assets present within the databases of organizations. Often the data quality is poor and it is necessary to improve this data quality quickly. Data cleansing methods involve the use of data cleansing rules that are iteratively written for a given database. In this talk I will provide methods to quickly (i) estimate the deviation in accuracy for a base ruleset (ii) discover noisy data points, syntactic and semantic variations and handle them (iii) organize rules so that updating and managing them becomes easy and does not degrade performance.

DB Meeting: Wednesday November 10, 2:30pm, DC 1331
Speaker: David Toman
Title: Efficient Approach to Query Answering over Description Logic Ontologies
Abstract: Databases and related information systems can benefit from the use of ontologies to enrich the data with general background knowledge. The DL-Lite family of ontology languages was specifically tailored towards such ontology-based data access, enabling an implementation in a relational database management system (RDBMS) based on a query rewriting approach. In this paper, we propose an alternative approach to implementing ontology-based data access in DL. The distinguishing feature of our approach is to allow rewriting of both the query and the data. We show that, in contrast to the existing approaches, no exponential blowup is produced by the rewritings. Based on experiments with a number of real-world ontologies, we demonstrate that query execution in the proposed novel approach is often more efficient than in existing approaches, especially for large ontologies. We also show how to seamlessly integrate the data rewriting step of our approach into an RDBMS using views (which solves the update problem) and make an interesting observation regarding the succinctness of queries in the other query rewriting approaches.

[based on KR 2010 paper with Roman Kontchakov, Carsten Lutz, Frank Wolter and Michael Zakharyaschev]

DB Meeting: Wednesday November 17, 2:30pm, DC 1331
Speaker: Pedram Ghodsnia
Title: Utilizing the computational power of GPUs in database processing
Abstract: In recent years, GPUs have emerged as powerful tools for high performance general-purpose computing. Compared to commodity multi-core CPUs, GPUs have an order of magnitude higher computational power; their memory bandwidth is much higher; and they can deliver an equivalent performance at up to 1/10th the cost and 1/20th the power consumption. Because of these outstanding capabilities, GPUs have enthusiastically received attention recently in the area of scientific research. Many researchers are studying the application of the new capabilities of GPUs in various computationally intensive scientific areas such as physics simulation, molecular dynamics, computational finance, image processing, medical engineering, pattern recognition, and database processing.

In this talk, I will give a brief introduction to the general-purpose computing on graphics processing units (GPGPU) in general, as well as its applications in the area of database processing. I will discuss the architectural features of the new generation of GPUs and the possible application of these features in database processing. I will also review some of the ongoing challenges and recent related studies.

DB Meeting: Wednesday November 24, 2:30pm, DC 1331
Speakers: Kevyn Collins-Thompson, Microsoft Research
Title: Risk-aware Information Retrieval

Current Web search engines aim to improve result quality by attempting such operations as inferring a user's task intent in a session, automatically reformulating queries to use common word variants, or personalizing results for particular users or groups. While these algorithms can add great benefit for users, they are also risky, making results worse if they make incorrect predictions, compared to not being used at all. Moreover, such algorithms are typically optimized with respect to the mean of a retrieval metric across queries, while the variance of the metric is often ignored, which can increase the probability of making serious errors for users.

I'll discuss how the novel use of ideas from portfolio theory and computational finance can help address these problems in reliability and effectiveness for a range of important problems in Web search, from automatic query reformulation to document ranking. I'll give examples of new, efficient robust algorithms that significantly reduce downside variance without sacrificing strong average effectiveness across queries. This new line of research in risk-aware information retrieval develops theoretical models, optimization algorithms, and evaluation methods for estimating risk and managing risk-reward tradeoffs not adequately handled by current approaches.

DB Seminar: Wednesday December 1, 2:30pm, DC 1302
Speaker: Goetz Graefe, HP Labs
Title: A New Join Algorithm

DB Meeting: Wednesday December 8, 2:30pm, DC 1331
Speaker: Jiewen Wu
Title: Ontology-based Query Expansion
Abstract: Keyword querying is challenged by short queries that vaguely describe users' information needs. Query expansion, aiming at short and general keyword queries, has received decades of research. Different expansion strategies have shown various improvements in retrieval effectiveness. Recently, knowledge-based query expansion becomes popular as the Web is more semantic. In this talk, we discuss key factors that affect "semantic" query expansion and unify some existing expansion strategies. The trade-off between retrieval effectiveness gain and the computation cost is also discussed.

This page is maintained by Ken Salem.

Campaign Waterloo

Data Systems Group
David R. Cheriton School of Computer Science
University of Waterloo
Waterloo, Ontario, Canada N2L 3G1
Tel: 519-888-4567
Fax: 519-885-1208

Contact | Feedback: | Data Systems Group

Valid HTML 4.01!Valid CSS! Last modified: Friday, 01-Jun-2012 11:01:03 EDT