[Please remove <h1>]
Public talks of interest to the Data Systems Group are posted here, and are also
mailed to the
dsg-faculty,
dsg-grads,
dsg-friends
mailing lists.
Subscribe to one of these mailing lists to receive e-mail notification
of upcoming events.
Everyone is welcome to attend.
2016 Events
PhD Seminar:
|
Tuesday January 26, 11:00 am, DC 1331
|
Speaker:
|
Michael Mior
|
|
Title:
|
NoSE: Schema Design for NoSQL Applications
|
Absract:
|
Database design is critical for high performance in relational databases and many tools exist to aid application designers in selecting an appropriate schema. While the problem of schema optimization is also highly relevant for NoSQL databases, existing tools for relational databases are inadequate for this setting. Application designers wishing to use a NoSQL database instead rely on rules of thumb to select an appropriate schema. We present an system for recommending database schemas for NoSQL applications. Our cost-based approach uses a novel binary integer programming formulation to guide the mapping from the application's conceptual data model to a database schema.
We implemented a prototype of this approach for the Cassandra extensible record store. Our prototype, the NoSQL Schema Evaluator (NoSE) is able to capture rules of thumb used by expert designers without explicitly encoding the rules. Automating the design process allows NoSE to produce efficient schemas and to examine more alternatives than would be possible with a manual rule-based approach.
|
Seminar:
|
Thursday February 11, 12:00 noon, DC 1304
|
Speaker:
|
Peter Unterbrunner, Snowflake Computing
|
|
Title:
|
The Snowflake Elastic Data Warehouse
|
PhD Seminar:
|
Wednesday March 2, 12:30 pm, DC 1331
|
Speaker:
|
Aiman Al-Harbi
|
|
Title:
|
Are Secondary Assessors Uncertain When They Disagree About Relevance Judgements?
|
PhD Seminar:
|
Wednesday April 27th, 12:30 pm, DC 1331
|
Speaker:
|
Gaurav Baruah
|
|
Title:
|
Matching Nuggets with Sentences
|
Abstract:
|
Nugget-based evaluation requires assessors to judge whether or not a given nugget is found in a given piece of text. In
TREC tracks such as Temporal Summarization and Question Answering, assessors may need to keep track of over
100 nuggets per search topic. Matching these sets of nuggets to run submissions is time-consuming and tedious. In this
talk, we present our work on estimating the potential for assistive user interfaces to reduce assessors’ nugget matching effort.
We iteratively build upon different matching strategies continuous active learning to help assessors match nuggets with
sentences. The proposed matching strategies may simplify assessment for secondary assessors by potentially alleviating
the memory information overload caused by a large number of nuggets. Across four nugget-based test collections, we
found that our proposed matching strategies have the potential to reduce assessor effort while not hurting the quality
of the collected judgements.
|
PhD Seminar:
|
Wednesday May 4th, 12:30 pm, DC 1331
|
Speaker:
|
Mustafa Korkmaz
|
|
Title:
| Energy Efficient Database Management Systems
|
Abstract:
|
Data centers consume significant amounts of energy and the consumption is growing each year. Alongside efforts in the hardware domain, there are some mechanisms in the software domain to reduce energy consumption. One of these mechanisms is dynamic voltage and frequency scaling (DVFS) on CPUs. We show that a DBMS can exploit its knowledge of the workload and performance constraints to obtain power savings that are more than twice as large as the power savings achieved when DVFS is managed by the operating system. We will also discuss how we might extend the work to a more generalized power manager which adapts to different CPUs and workloads.
|
ICDE Practice Talk:
|
Wednesday May 4th, 15:00 pm, DC 1331
|
Speaker:
|
Michael Mior
|
|
Title:
|
NoSE: Schema Design for NoSQL Applications
|
Abstract:
|
Database design is critical for high performance in relational databases and many tools exist to aid application designers in selecting an appropriate schema. While the problem of schema optimization is also highly relevant for NoSQL databases, existing tools for relational databases are inadequate for this setting. Application designers wishing to use a NoSQL database instead rely on rules of thumb to select an appropriate schema. We present a system for recommending database schemas for NoSQL applications. Our cost-based approach uses a novel binary integer programming formulation to guide the mapping from the application’s conceptual data model to a database schema.
We implemented a prototype of this approach for the Cassandra extensible record store. Our prototype, the NoSQL Schema Evaluator (NoSE) is able to capture rules of thumb used by expert designers without explicitly encoding the rules. Automating the design process allows NoSE to produce efficient schemas and to examine more alternatives than would be possible with a manual rule-based approach.
|
PhD Seminar:
|
Wed June 8, 12:30 pm, DC 1331
|
Speaker:
|
Xu Chu
|
|
Title:
|
Qualitative Data Cleaning
|
Absract:
|
Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and wrong business decisions.
Data cleaning exercise often consist of two phases: error detection and error repairing. Error detection techniques can either be quantitative or qualitative; and error repairing is performed by applying data transformation scripts or by involving human experts, and sometimes both.
In this tutorial, we discuss the main facets and directions in designing qualitative data cleaning techniques. We present a taxonomy of current qualitative error detection techniques, as well as a taxonomy of current data repairing techniques. We will also discuss proposals for tackling the challenges for cleaning “big data” in terms of scale and distribution.
|
DSG Seminar:
|
Wednesday August 24th, 12:30pm, DC 1311 Tuesday August 23rd, 12:30pm, DC 1304
|
Speaker:
|
Karthik Ramasamy, Twitter
|
|
Title:
|
Twitter Heron in Practice
|
Absract:
|
Twitter generates billions and billions of events per day. Analyzing these events in
real time presents a massive challenge. Twitter designed and deployed a new
streaming system called
Heron. Heron has been in production nearly 2 years and is widely used by
several teams for diverse use cases. In this talk, I will describe
Heron in detail and share our operating experiences and challenges of running
Heron at scale.
|
Bio:
|
Karthik is the engineering manager and technical lead for Real
Time Analytics at Twitter. He is the co-creator of
Heron and has more than two decades of experience working in
parallel databases, big data infrastructure and networking.
He cofounded Locomatix, a company that specializes in real time
streaming processing on Hadoop and Cassandra using SQL that was
acquired by Twitter. Before Locomatix, he had a brief stint with
Greenplum where he worked on parallel query scheduling. Greenplum
was eventually acquired by EMC for more than $300M. Prior to
Greenplum, Karthik was at Juniper Networks where
he designed and delivered platforms, protocols, databases and
high availability solutions for network routers that are widely
deployed in the Internet. Before joining Juniper at University
of Wisconsin, he worked extensively in parallel database systems, query
processing, scale out technologies, storage engine and online
analytical systems. Several of these research were spun as a
company later acquired by Teradata. He is the author of several
publications, patents and one of the best selling book “Network Routing:
Algorithms, Protocols and Architectures.” He has a Ph.D. in Computer
Science from UW Madison with a focus on databases.
|
PhD Seminar:
|
Wednesday September 14, 12:30 pm, DC 1331
|
Speaker:
|
Alexey Karyakin
|
|
Title:
|
Main Memory Energy Efficiency in Database Systems
|
Absract:
|
Most research on computer systems energy
efficiency has been focused on the CPU. CPU power management features
such as power states and frequency scaling has been used by operating
systems and applications to optimize energy use. However, with
increasing amount of main memory installed in servers, its
contribution to total server energy footprint may become
significant. In this talk, I will analyze the factors that determine
memory consumption and how to optimize it for database workloads. In
particular, I will present a strategy of physical memory allocation
that maximizes its residency in low power states. Using this strategy
for a database buffer pool allocation reduces power consumption of
memory devices which are not used for database pages. We will also
discuss methods of estimating and measuring memory power consumption
in simulated and real world settings.
|
PhD Seminar:
|
Wednesday October 5, 12:30 pm, DC 1331
|
Speaker:
|
Michael Mior
|
|
Title:
|
Schema Renormalization for NoSQL Databases
|
Absract:
|
Applications using NoSQL databases commonly denormalize and
duplicate data to improve performance. While this can improve application
performance, it also introduces several complications. Applications need to
be careful when replicating inserted data. New queries in the application's
workload can result in complex changes to the underlying schema. We show
how a set of relations can be extracted from a NoSQL database and how we
can make use of existing and novel techniques to produce a normalized
logical schema for applications. We also describe several uses cases for
such a logical schema to reduce the burden of schema management on
developers.
|
PhD Seminar:
|
Wednesday October 19, 12:30 pm, DC 1331 POSTPONED
|
Speaker:
|
Zeynep Korkmaz
|
|
Title:
|
Graph Aware Cache Replacement Policy
|
Absract:
|
Dense linked data dominates the new generation web applications. For example, large social-networking providers store and serve billions of photos, texts on behalf of users. They are growing so fast that they require powerful distributed storage and processing systems to handle social connections and diverse user behaviors. Deploying a caching layer is a well known solution to improve the performance of such systems. Existing caching approaches vary in architecture design, however, their cache replacement algorithms rely on either recency or frequency information that do not well represent the connection density and unpredictability of access pattern in social graphs. In this study, we propose a graph aware, object-level cache replacement policy to exploit social connections and structural properties of graph data in order to perform low latency graph computations.
|
SCS Distinguished Lecture Series:
|
Thursday October 20th, 3:30 pm, DC 1302
|
Speaker:
|
Ophir Frieder, Georgetown University
|
|
Title:
|
Searching in Harsh Environments
|
PhD Seminar:
|
Wednesday October 26, 12:30 pm, DC 1331
|
Speaker:
|
Kareem El Gebaly
|
|
Title:
|
In-Browser SQL Analytics with Afterburner
|
Absract:
|
This talk explores the idea of implementing an analytical RDBMS in pure
JavaScript so that it runs completely inside a browser with no external
dependencies. Our prototype, called Afterburner, generates compiled
query plans that exploit typed arrays and asm.js, two relatively recent
advances in JavaScript. On the TPC-H benchmark, we show that
Afterburner achieves comparable performance to MonetDB running
natively on the same machine. This is an interesting finding in that it
shows how far JavaScript has come as an efficient execution platform.
We also discuss how our techniques could support ubiquitous in-browser
interactive analytics (potentially integrating with browser-based
note-books) and also present interesting opportunities for split execution
strategies where query operators are distributed between the browser
and backend servers.
|
PhD Seminar:
|
Wednesday November 30, 12:30 pm, DC 1331
|
Speaker:
|
Xu Chu
|
|
Title:
|
Scalable Data Cleaning
|
Absract:
|
Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and wrong business decisions. There are many different data cleaning activities performed to improve data quality, such as filling in missing values, removing duplicate records, and fixing integrity constraint violations. There are usually three steps in data cleaning: data quality rules specification, error detection, and error repairing.
In this talk, I will present three projects as examples that tackle the challenges in each of the three steps in data cleaning. The first project, called denial constraints discovery, proposes to use denial constraints (DCs) as the formal language to express a variety of data quality rules. The second project, called holistic data cleaning, advocates the idea of accumulating evidences in detecting and repairing errors, which leads to better cleaning accuracy. The third project, called distributed data deduplication, tackles the scalability challenges in data cleaning.
|
This page is maintained
by
Ken Salem.