2016 Data Systems Group Events

Public talks of interest to the Data Systems Group are posted here, and are also mailed to the dsg-faculty, dsg-grads, dsg-friends mailing lists. Subscribe to one of these mailing lists to receive e-mail notification of upcoming events. Everyone is welcome to attend.

2016 Events



DSG Seminar Series: Monday January 11, 2:00 pm, DC 1302
Speaker: Stephen Green, Oracle Labs
Title: Research in Information Retrieval and Machine Learning at Oracle Labs


PhD Seminar: Tuesday January 26, 11:00 am, DC 1331
Speaker: Michael Mior
Title: NoSE: Schema Design for NoSQL Applications
Absract: Database design is critical for high performance in relational databases and many tools exist to aid application designers in selecting an appropriate schema. While the problem of schema optimization is also highly relevant for NoSQL databases, existing tools for relational databases are inadequate for this setting. Application designers wishing to use a NoSQL database instead rely on rules of thumb to select an appropriate schema. We present an system for recommending database schemas for NoSQL applications. Our cost-based approach uses a novel binary integer programming formulation to guide the mapping from the application's conceptual data model to a database schema.

We implemented a prototype of this approach for the Cassandra extensible record store. Our prototype, the NoSQL Schema Evaluator (NoSE) is able to capture rules of thumb used by expert designers without explicitly encoding the rules. Automating the design process allows NoSE to produce efficient schemas and to examine more alternatives than would be possible with a manual rule-based approach.



Seminar: Thursday February 11, 12:00 noon, DC 1304
Speaker: Peter Unterbrunner, Snowflake Computing
Title: The Snowflake Elastic Data Warehouse


PhD Seminar: Wednesday March 2, 12:30 pm, DC 1331
Speaker: Aiman Al-Harbi
Title: Are Secondary Assessors Uncertain When They Disagree About Relevance Judgements?


DB Seminar Series: Monday April 11, 10:30 am, DC 1302
Speaker: Frank McSherry
Title: Next-generation Data-parallel Dataflow Systems


DSG Seminar Series: Monday April 18th, 2:30 pm, DC 1302
Speaker: Ricardo Baeza-Yates, UPF, Spain & UChile
Title: Data and Algorithmic Bias in the Web


PhD Seminar: Wednesday April 27th, 12:30 pm, DC 1331
Speaker: Gaurav Baruah
Title: Matching Nuggets with Sentences
Abstract: Nugget-based evaluation requires assessors to judge whether or not a given nugget is found in a given piece of text. In TREC tracks such as Temporal Summarization and Question Answering, assessors may need to keep track of over 100 nuggets per search topic. Matching these sets of nuggets to run submissions is time-consuming and tedious. In this talk, we present our work on estimating the potential for assistive user interfaces to reduce assessors’ nugget matching effort. We iteratively build upon different matching strategies continuous active learning to help assessors match nuggets with sentences. The proposed matching strategies may simplify assessment for secondary assessors by potentially alleviating the memory information overload caused by a large number of nuggets. Across four nugget-based test collections, we found that our proposed matching strategies have the potential to reduce assessor effort while not hurting the quality of the collected judgements.


PhD Seminar: Wednesday May 4th, 12:30 pm, DC 1331
Speaker: Mustafa Korkmaz
Title: Energy Efficient Database Management Systems
Abstract: Data centers consume significant amounts of energy and the consumption is growing each year. Alongside efforts in the hardware domain, there are some mechanisms in the software domain to reduce energy consumption. One of these mechanisms is dynamic voltage and frequency scaling (DVFS) on CPUs. We show that a DBMS can exploit its knowledge of the workload and performance constraints to obtain power savings that are more than twice as large as the power savings achieved when DVFS is managed by the operating system. We will also discuss how we might extend the work to a more generalized power manager which adapts to different CPUs and workloads.


ICDE Practice Talk: Wednesday May 4th, 15:00 pm, DC 1331
Speaker: Michael Mior
Title: NoSE: Schema Design for NoSQL Applications
Abstract:

Database design is critical for high performance in relational databases and many tools exist to aid application designers in selecting an appropriate schema. While the problem of schema optimization is also highly relevant for NoSQL databases, existing tools for relational databases are inadequate for this setting. Application designers wishing to use a NoSQL database instead rely on rules of thumb to select an appropriate schema. We present a system for recommending database schemas for NoSQL applications. Our cost-based approach uses a novel binary integer programming formulation to guide the mapping from the application’s conceptual data model to a database schema.

We implemented a prototype of this approach for the Cassandra extensible record store. Our prototype, the NoSQL Schema Evaluator (NoSE) is able to capture rules of thumb used by expert designers without explicitly encoding the rules. Automating the design process allows NoSE to produce efficient schemas and to examine more alternatives than would be possible with a manual rule-based approach.



DSG Seminar Series: Monday May 9, 10:30 am, DC 1302
Speaker: Shivakumar Vaithyanathan, IBM Almaden Research Center
Title: Watson Content Services: Creation, Maintenance and Consumption of Knowledge Bases


DSG Seminar Series: Monday May 16, 10:30 am, DC 1302
Speaker: Kevyn Collins-Thompson, University of Michigan,
Title: Connecting Searching with Learning


PhD Seminar: Wed June 8, 12:30 pm, DC 1331
Speaker: Xu Chu
Title: Qualitative Data Cleaning
Absract:

Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and wrong business decisions.

Data cleaning exercise often consist of two phases: error detection and error repairing. Error detection techniques can either be quantitative or qualitative; and error repairing is performed by applying data transformation scripts or by involving human experts, and sometimes both.

In this tutorial, we discuss the main facets and directions in designing qualitative data cleaning techniques. We present a taxonomy of current qualitative error detection techniques, as well as a taxonomy of current data repairing techniques. We will also discuss proposals for tackling the challenges for cleaning “big data” in terms of scale and distribution.



DSG Seminar Series: Tuesday July 12, 2:00 pm, DC 1304
Speaker: Jay Aslam, Northeastern University
Title: ML for IR: Sentiment Analysis and Multi-label Categorization


DSG Seminar: Wednesday August 24th, 12:30pm, DC 1311 Tuesday August 23rd, 12:30pm, DC 1304
Speaker: Karthik Ramasamy, Twitter
Title: Twitter Heron in Practice
Absract:

Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Twitter designed and deployed a new streaming system called Heron. Heron has been in production nearly 2 years and is widely used by several teams for diverse use cases. In this talk, I will describe Heron in detail and share our operating experiences and challenges of running Heron at scale.

Bio:

Karthik is the engineering manager and technical lead for Real Time Analytics at Twitter. He is the co-creator of Heron and has more than two decades of experience working in parallel databases, big data infrastructure and networking. He cofounded Locomatix, a company that specializes in real time streaming processing on Hadoop and Cassandra using SQL that was acquired by Twitter. Before Locomatix, he had a brief stint with Greenplum where he worked on parallel query scheduling. Greenplum was eventually acquired by EMC for more than $300M. Prior to Greenplum, Karthik was at Juniper Networks where he designed and delivered platforms, protocols, databases and high availability solutions for network routers that are widely deployed in the Internet. Before joining Juniper at University of Wisconsin, he worked extensively in parallel database systems, query processing, scale out technologies, storage engine and online analytical systems. Several of these research were spun as a company later acquired by Teradata. He is the author of several publications, patents and one of the best selling book “Network Routing: Algorithms, Protocols and Architectures.” He has a Ph.D. in Computer Science from UW Madison with a focus on databases.



PhD Seminar: Wednesday September 14, 12:30 pm, DC 1331
Speaker: Alexey Karyakin
Title: Main Memory Energy Efficiency in Database Systems
Absract: Most research on computer systems energy efficiency has been focused on the CPU. CPU power management features such as power states and frequency scaling has been used by operating systems and applications to optimize energy use. However, with increasing amount of main memory installed in servers, its contribution to total server energy footprint may become significant. In this talk, I will analyze the factors that determine memory consumption and how to optimize it for database workloads. In particular, I will present a strategy of physical memory allocation that maximizes its residency in low power states. Using this strategy for a database buffer pool allocation reduces power consumption of memory devices which are not used for database pages. We will also discuss methods of estimating and measuring memory power consumption in simulated and real world settings.


DSG Seminar Series: Monday September 26th, 10:15 am, DC 1302
Speaker: Per-Âke (Paul) Larson
Title: Database Systems Meet Non-Volatile Memory (NVRAM)


DSG Seminar Series: Monday October 3rd, 10:15 am, DC 1302
Speaker: Olga Papaemmanouil, Brandeis University
Title: Performance Management for Cloud Databases via Machine Learning


PhD Seminar: Wednesday October 5, 12:30 pm, DC 1331
Speaker: Michael Mior
Title: Schema Renormalization for NoSQL Databases
Absract: Applications using NoSQL databases commonly denormalize and duplicate data to improve performance. While this can improve application performance, it also introduces several complications. Applications need to be careful when replicating inserted data. New queries in the application's workload can result in complex changes to the underlying schema. We show how a set of relations can be extracted from a NoSQL database and how we can make use of existing and novel techniques to produce a normalized logical schema for applications. We also describe several uses cases for such a logical schema to reduce the burden of schema management on developers.


PhD Seminar: Wednesday October 19, 12:30 pm, DC 1331 POSTPONED
Speaker: Zeynep Korkmaz
Title: Graph Aware Cache Replacement Policy
Absract: Dense linked data dominates the new generation web applications. For example, large social-networking providers store and serve billions of photos, texts on behalf of users. They are growing so fast that they require powerful distributed storage and processing systems to handle social connections and diverse user behaviors. Deploying a caching layer is a well known solution to improve the performance of such systems. Existing caching approaches vary in architecture design, however, their cache replacement algorithms rely on either recency or frequency information that do not well represent the connection density and unpredictability of access pattern in social graphs. In this study, we propose a graph aware, object-level cache replacement policy to exploit social connections and structural properties of graph data in order to perform low latency graph computations.


SCS Distinguished Lecture Series: Thursday October 20th, 3:30 pm, DC 1302
Speaker: Ophir Frieder, Georgetown University
Title: Searching in Harsh Environments


DSG Seminar Series: Tuesday October 25th, 1:00 pm, DC 2585
Speaker: Wolfgang Gatterbauer, CMU
Title: Approximate lifted inference with probabilistic databases


PhD Seminar: Wednesday October 26, 12:30 pm, DC 1331
Speaker: Kareem El Gebaly
Title: In-Browser SQL Analytics with Afterburner
Absract: This talk explores the idea of implementing an analytical RDBMS in pure JavaScript so that it runs completely inside a browser with no external dependencies. Our prototype, called Afterburner, generates compiled query plans that exploit typed arrays and asm.js, two relatively recent advances in JavaScript. On the TPC-H benchmark, we show that Afterburner achieves comparable performance to MonetDB running natively on the same machine. This is an interesting finding in that it shows how far JavaScript has come as an efficient execution platform. We also discuss how our techniques could support ubiquitous in-browser interactive analytics (potentially integrating with browser-based note-books) and also present interesting opportunities for split execution strategies where query operators are distributed between the browser and backend servers.


PhD Seminar: Wednesday November 30, 12:30 pm, DC 1331
Speaker: Xu Chu
Title: Scalable Data Cleaning
Absract:

Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and wrong business decisions. There are many different data cleaning activities performed to improve data quality, such as filling in missing values, removing duplicate records, and fixing integrity constraint violations. There are usually three steps in data cleaning: data quality rules specification, error detection, and error repairing.

In this talk, I will present three projects as examples that tackle the challenges in each of the three steps in data cleaning. The first project, called denial constraints discovery, proposes to use denial constraints (DCs) as the formal language to express a variety of data quality rules. The second project, called holistic data cleaning, advocates the idea of accumulating evidences in detecting and repairing errors, which leads to better cleaning accuracy. The third project, called distributed data deduplication, tackles the scalability challenges in data cleaning.



DSG Seminar Series: Friday December 2nd, 10:30 am, DC 1304
Speaker: Amol Deshpande, University of Maryland
Title: Scalable Platforms for Graph Analytics and Collaborative Data Science


DSG Seminar Series: Monday December 19th, 10:30 am, DC 1304
Speaker: Fabian M Suchanek, Telecom ParisTech University
Title: A Hitchhiker's Guide to Ontology


This page is maintained by Ken Salem.

Campaign Waterloo

Data Systems Group
David R. Cheriton School of Computer Science
University of Waterloo
Waterloo, Ontario, Canada N2L 3G1
Tel: 519-888-4567
Fax: 519-885-1208

Contact | Feedback: db-webmaster@cs.uwaterloo.ca | Data Systems Group


Valid HTML 4.01!Valid CSS! Last modified: Wednesday, 04-Jan-2017 00:29:44 EST


Menu:ShowHide