2017 Data Systems Group Events

Public talks of interest to the Data Systems Group are posted here, and are also mailed to the dsg-faculty, dsg-grads, dsg-friends mailing lists. Subscribe to one of these mailing lists to receive e-mail notification of upcoming events. Everyone is welcome to attend.

2017 Events

DSG Seminar Series: Monday January 9, 10:30 am, DC 1302
Speaker: Felix Naumann, Hasso Plattner Institute
Title: Data Profiling

MMath Seminar: Monday January 30, 2:00 pm, DC 2310
Speaker: Shaikh Quader
Title: Measuring Document Type Biases in Enterprise Search

PhD Seminar: Wednesday February 1, 12:30 pm, DC 1331
Speaker: Xu Chu
Title: Big Data Cleaning
Absract: Data quality is one of the most important problems in data management and data science, since dirty data often leads to inaccurate data analytics results and wrong business decisions. It is estimated that data scientists spend 60-80% of their time cleaning and organizing data rather than performing modelling or data mining. A typical data cleaning process consists of three steps: data quality rules specification, error detection, and error repairing. In this talk, I will discuss my proposals in dealing with challenges in each of these steps. First, I will introduce a system to automatically discover data quality rules from a possibly dirty sample data instance. Automatically discovering data quality rules is particularly useful since asking users to design them is an expensive process, which requires domain expertise, and is rarely done in practice. Second, I will show a holistic error detection and error repairing process, which accumulates evidence from a broad spectrum of data quality rules, and suggests more accurate data repairs in a holistic manner. Third, I will present a distribution strategy to scale up the common combinatorial operations used in data cleaning such as comparing every tuple pair to detect duplicates. I will conclude the talk by discussing some ongoing work in cleaning relational data as well as other data forms (e.g., IoT data and unstructured data) and my long-term vision of debugging data analytics.

PhD Seminar: Wednesday March 81, 12:30 pm, DC 1331
Speaker: Ahmed El-Roby
Title: Sapphire: Querying RDF Data Made Simple
Absract: There is currently a large amount of publicly accessible structured data available as RDF data sets. For example, the Linked Open Data (LOD) cloud now consists of thousands of RDF data sets with over 30 billion triples, and the number and size of the data sets is continuously growing. Many of the data sets in the LOD cloud provide public SPARQL endpoints to allow issuing queries over them. These endpoints enable users to retrieve data using precise and highly expressive SPARQL queries. However, in order to do so, the user must have sufficient knowledge about the data sets that she wishes to query, that is, the structure of data, the vocabulary used within the data set, the exact values of literals, their data types, etc. Thus, while SPARQL is powerful, it is not easy to use. An alternative to SPARQL that does not require as much prior knowledge of the data is some form of keyword search over the structured data. Keyword search queries are easy to use, but inherently ambiguous in describing structured queries.

In this talk, I introduce Sapphire, a framework for querying RDF data that strikes a middle ground between ambiguous keyword search and difficult-to-use SPARQL. Sapphire does not replace either, but utilizes both where they are most effective. Sapphire helps the user construct expressive SPARQL queries that represent her information needs without requiring detailed knowledge about the queried data sets. These queries are then executed over public SPARQL endpoints from the LOD cloud. Sapphire guides the user in the query writing process by showing suggestions of query terms based on the queried data, and by recommending changes to the query based on a predictive user model.

PhD Seminar: Wednesday Mar 15, 12:30 pm, DC 1331
Speaker: Kareem El Gebaly
Title: Accelerating Interactive SQL Using Frontend Engines
Absract: This talk explores the idea of in-browser interactive analytics with split execution strategies where query operators are distributed between the frontend and backend servers. Our frontend, Afterburner, is an in browser analytical RDBMS in pure JavaScript that runs completely inside a browser with no external dependencies. Given a pointer to a SQL backend and some hints from the user about the next queries, Afterburner splits the queries into two parts: a one time SQL query that runs at the backend and local SQL queries as per the user's interactions. To meet interactive analytics performance requirements Afterburner uses compiled query plans that exploit typed arrays and asm.js, two relatively recent advances in JavaScript to run queries inside the browser at comparable performance to the state of the art running native. We show some interesting findings of how such a setup not only offloads the backend, but also can accelerate data exploration.

PhD Seminar: Wednesday April 19, 12:30 pm, DC 1331
Speaker: Mohamed Sabri
Title: A Hybrid Framework for Online Execution of Linked Data Queries
Absract: Linked data has been widely adopted over the last few years, with the size of the Linked Data Cloud almost doubling every year. However, there is still no well-defined mechanism to query such a Web of Data. In this talk, we propose a framework that incorporates a set of optimizations to tackle various limitations in the state-of-the-art. The framework aims at combining the centralized query optimization capabilities of the data warehouse-based approaches with the result freshness and explorative data source discovery capabilities of link- traversal approaches. This is achieved by augmenting baseline link-traversal query execution with a set of optimization techniques. The proposed optimizations fall under two categories: metadata-based optimizations and semantics-based optimizations.

DSG Seminar Series: Monday May 1, 10:30 am, DC 1302
Speaker: Peter Bailis, Stanford University
Title: MacroBase: Prioritizing Attention in Fast Data

DSG Seminar Series: Tuesday May 2, 10:30 am, DC 1302
Speaker: Patrick Valduriez, INRIA and Biology Computational Institute (IBC)
Title: The CloudMdsQL Multistore System

This page is maintained by Ken Salem.

Campaign Waterloo

Data Systems Group
David R. Cheriton School of Computer Science
University of Waterloo
Waterloo, Ontario, Canada N2L 3G1
Tel: 519-888-4567
Fax: 519-885-1208

Contact | Feedback: db-webmaster@cs.uwaterloo.ca | Data Systems Group

Valid HTML 4.01!Valid CSS! Last modified: Thursday, 06-Apr-2017 15:44:44 EDT