[Please remove <h1>]
Public talks of interest to the Data Systems Group are posted here, and are also
mailed to the
dsg-faculty,
dsg-grads,
dsg-friends
mailing lists.
Subscribe to one of these mailing lists to receive e-mail notification
of upcoming events.
Everyone is welcome to attend.
Information about older events can be found in the
event archive.
2017 Events
MMath Seminar:
|
Monday January 30, 2:00 pm, DC 2310
|
Speaker:
|
Shaikh Quader
|
|
Title:
|
Measuring Document Type Biases in Enterprise Search
|
PhD Seminar:
|
Wednesday February 1, 12:30 pm, DC 1331
|
Speaker:
|
Xu Chu
|
|
Title:
|
Big Data Cleaning
|
Absract:
|
Data quality is one of the most important problems in data management and
data science, since dirty data often leads to inaccurate data analytics results
and wrong business decisions. It is estimated that data scientists spend 60-80%
of their time cleaning and organizing data rather than performing modelling or
data mining. A typical data cleaning process consists of three steps: data
quality rules specification, error detection, and error repairing. In this talk,
I will discuss my proposals in dealing with challenges in each of these steps.
First, I will introduce a system to automatically discover data quality rules
from a possibly dirty sample data instance. Automatically discovering data
quality rules is particularly useful since asking users to design them is an
expensive process, which requires domain expertise, and is rarely done in
practice. Second, I will show a holistic error detection and error repairing
process, which accumulates evidence from a broad spectrum of data quality rules,
and suggests more accurate data repairs in a holistic manner. Third, I will
present a distribution strategy to scale up the common combinatorial operations
used in data cleaning such as comparing every tuple pair to detect duplicates.
I will conclude the talk by discussing some ongoing work in cleaning relational
data as well as other data forms (e.g., IoT data and unstructured data) and my
long-term vision of debugging data analytics.
|
PhD Seminar:
|
Wednesday March 81, 12:30 pm, DC 1331
|
Speaker:
|
Ahmed El-Roby
|
|
Title:
|
Sapphire: Querying RDF Data Made Simple
|
Absract:
|
There is currently a large amount of publicly accessible structured data available as RDF data sets. For example, the Linked Open Data (LOD) cloud now consists of thousands of RDF data sets with over 30 billion triples, and the number and size of the data sets is continuously growing. Many of the data sets in the LOD cloud provide public SPARQL endpoints to allow issuing queries over them. These endpoints enable users to retrieve data using precise and highly expressive SPARQL queries. However, in order to do so, the user must have sufficient knowledge about the data sets that she wishes to query, that is, the structure of data, the vocabulary used within the data set, the exact values of literals, their data types, etc. Thus, while SPARQL is powerful, it is not easy to use. An alternative to SPARQL that does not require as much prior knowledge of the data is some form of keyword search over the structured data. Keyword search queries are easy to use, but inherently ambiguous in describing structured queries.
In this talk, I introduce Sapphire, a framework for querying RDF data that strikes a middle ground between ambiguous keyword search and difficult-to-use SPARQL. Sapphire does not replace either, but utilizes both where they are most effective. Sapphire helps the user construct expressive SPARQL queries that represent her information needs without requiring detailed knowledge about the queried data sets. These queries are then executed over public SPARQL endpoints from the LOD cloud. Sapphire guides the user in the query writing process by showing suggestions of query terms based on the queried data, and by recommending changes to the query based on a predictive user model.
|
PhD Seminar:
|
Wednesday Mar 15, 12:30 pm, DC 1331
|
Speaker:
|
Kareem El Gebaly
|
|
Title:
|
Accelerating Interactive SQL Using Frontend Engines
|
Absract:
|
This talk explores the idea of in-browser interactive analytics with split
execution strategies where query operators are distributed between the
frontend and backend servers. Our frontend, Afterburner, is an in browser
analytical RDBMS in pure JavaScript that runs completely inside a browser
with no external dependencies. Given a pointer to a SQL backend and some
hints from the user about the next queries, Afterburner splits the queries
into two parts: a one time SQL query that runs at the backend and local SQL
queries as per the user's interactions. To meet interactive analytics
performance requirements Afterburner uses compiled query plans that exploit
typed arrays and asm.js, two relatively recent advances in JavaScript to
run queries inside the browser at comparable performance to the state of
the art running native. We show some interesting findings of how such a setup
not only offloads the backend, but also can accelerate data exploration.
|
PhD Seminar:
|
Wednesday April 19, 12:30 pm, DC 1331
|
Speaker:
|
Mohamed Sabri
|
|
Title:
|
A Hybrid Framework for Online Execution of Linked Data Queries
|
Absract:
|
Linked data has been widely adopted over the last few years, with the size of the Linked
Data Cloud almost doubling every year. However, there is still no well-defined
mechanism to query such a Web of Data. In this talk, we propose a framework that
incorporates a set of optimizations to tackle various limitations in the
state-of-the-art. The framework
aims at combining the centralized query optimization capabilities of the data
warehouse-based approaches with the result freshness and explorative data source
discovery capabilities of link- traversal approaches. This is achieved by augmenting
baseline link-traversal query execution with a set of optimization techniques. The
proposed optimizations fall under two categories: metadata-based optimizations and
semantics-based optimizations.
|
PhD Seminar:
|
Wednesday September 20, 12:30 pm, DC 2585
|
Speaker:
|
Amira Ghenai
|
|
Title:
|
The Positive and Negative Influence of Search Results on People’s Decisions about the Efficacy of Medical Treatments
|
Absract:
|
People regularly use web search engines to investigate the efficacy of medical treatments. Search results can contain documents that present incorrect information that contradicts current established medical understanding on whether a treatment is helpful or not for a health issue. If people are influenced by the incorrect information found in search results, they can make harmful decisions about the appropriate treatment. To determine the extent to which people can be influenced by search engine results, we conducted a controlled laboratory study that biased search results towards correct or incorrect information for 10 different medical treatments. We found that search engine results can significantly influence people both positively and negatively. Importantly, study participants made more incorrect decisions when they interacted with search results biased towards incorrect information than when they had no interaction with search results at all. For search domains such as health information, search engine designers and researchers must recognize that not all non-relevant information is the same. Some non-relevant information is incorrect and potentially harmful when people use it to make decisions that may negatively impact their lives.
Published as a full paper in the Proceedings of the 3rd ACM International Conference on the Theory of Information Retrieval (ICTIR), 2017.
|
PhD Seminar:
|
Wednesday September 27, 12:30 pm, DC 2585
|
Speaker:
|
Kareem El Gebaly
|
|
Title:
|
Analytics for Everyone
|
Absract:
|
The process of analyzing relational data typically involves tasks
facilitating gaining familiarity or insights and coming up with
findings or conclusions based on the data. This process is usually
practiced by data experts (data scientists) that share their output
with potentially less data expert audience (everyone). Our goal is
to enable everyone to take a part in this process rather than
passively consuming its outputs (analytics democratization). With
today's increasing wide availability of data (data democratization)
on the internet (web) combined with an already wide spread personal
computing capabilities such a goal is becoming more permissible. Two
main challenges would face experts such as the data journalist who
wants to share their data exploration tasks over the web. First,
infrastructure necessary for interactive data exploration is costly
and hard to manage, especially in data journalism use cases.
Second, their audiences need guidance because they would not know
where to start the data exploration task since there are too many
starting points. To eliminate problems and costs related to managing
infrastructure, we propose an in browser SQL engine (serverless),
i.e., a portable database. In addition, for databases that are too
large for the browser, we propose a hybrid architecture: a onetime
SQL query that runs at the backend and SQL queries running in the
browser as per the user's interactions. To guide the user exploration
task, we introduce an information theoretic technique that picks the
most informative parts from the entire data cube of a relational
table (explanation tables). We introduce optimizations that allows
for creating explanation tables under the modest resources available
in the browser, again, without any external dependencies.
Facilitating data exploration for everyone is one step closer towards
analytics democratization where everyone can take part in data
exploration not just the experts.
|
PhD Seminar:
|
Wednesday December 6, 12:30 pm, DC 2585
|
Speaker:
|
Alexey Karyakin
|
|
Title:
|
Improving Memory Energy Efficiency of Database Systems
|
Absract:
|
Main memory is a significant contributor to energy footprint of database systems. This is especially true for main-memory databases, that need large amounts of DRAM to store data. Unfortunately, energy consumption in database systems is not power proportional: it stays high even when memory utilization and workload intensity are low. The existing power saving mechanisms of DRAM, such as low power states, do not reduce memory energy consumption because of insufficient memory idleness allowed by existing database systems. In this talk, we will discuss the ways to improve memory power efficiency in a modern main-memory RDBMS by applying "hot-cold" data classification mechanisms, rank-aware memory allocation, and conditioning the transaction execution to allow for more idleness in memory access.
|
PhD Seminar:
|
Wednesday December 13, 12:30 pm, DC 2585
|
Speaker:
|
Besat Kassaie
|
|
Title:
|
Applying Local Differential Privacy to Text
|
Absract:
|
Electronic Health Record (EHR) systems have been developed to provide individuals with high quality and continuing health care. These systems are being adopted throughout the world. For example, the Optometry Clinic at the University of Waterloo's School of Optometry, which is one of the largest vision care centers in Canada, adopted an EHR system recently. As well as supporting individual health care, such data has the potential to be used in a broader scope to improve the lives of multiple generations. Researchers from various domains such as public health, social science, and economics can extract invaluable insights using this data.
Providing researchers with a reliable framework to work with EHR entails addressing many problems, such as dealing with unstructured data, protecting privacy of patients, and access control, among others. Health researchers would like to access all the data available in a clinical database to conduct accurate studies, however at the same time patients are very cautious about their medical data, which in most cases are considered as private information.
The Optometry Clinic at UW wants to share medical data with optometry researchers. However, due to the lack of ability in ensuring data privacy, currently clinicians can only use data for primary health care, and it is not possible to share the data with researchers. There are a plethora of privacy legislations which have to be considered in any system working on sensitive data, such as PIPEDA, PHIPA, and HIPAA. These concerns make it difficult to share medical data among researchers.
In this work we would like to propose a solution for applying differential privacy to medical domains. This problem has its own challenges resulting from the importance of data utility in medical domain, correlations between data items that should be preserved to make a study possible, and finding and justifying the parameter values required in differential privacy paradigm. However having a well-structured set of records is an important implicit assumption in most works proposing differentially private data access algorithms. Considering that the optometry EHR system contains many unstructured data items, our aim is to propose a solution to apply an appropriate variant of differential privacy to text.
|
This page is maintained
by
Ken Salem.