[Please remove <h1>]
Spring 2014
Events of interest to the
Database Research Group are posted here, and are also
mailed to the uw.cs.database newsgroup and the
db-faculty,
db-grads,
db-friends
mailing lists.
Subscribe to one of these mailing lists to receive e-mail notification
of upcoming events.
The DB group meets Wednesday afternoons at 2:30pm.
The list below gives the
times and locations of upcoming meetings.
Each meeting lasts for an hour and features either
a local speaker or, on
Seminar days,
an invited outside speaker.
Everyone is welcome to attend.
Spring 2014 Events
DB Meeting:
|
Wednesday, May 7, 2:30pm, DC 1331
|
Speaker:
|
Matteo Riondato, Brown University
|
Title:
|
Fast Mining of Frequent Itemsets through Sampling
|
Abstract:
|
Frequent Itemsets mining is one of the key tasks in
knowledge discovery from databases. The cost of mining algorithms for
this problem depends on the number of itemsets and on the size of the
dataset. In this talk I will present three algorithms that cuts the
dependency on the dataset size by using random samples of the dataset
to extract approximations of the collection of frequent itemsets with
guaranteed high-quality. The algorithms use advanced data structures,
VC-dimension, and MapReduce.
Bio:
Matteo Riondato is graduating with a PhD in computer science from
Brown University. His dissertation, titled "Sampling-based Randomized
Algorithms for Big Data Analytics" explores the connection between
statistical learning theory and data mining. In his research, he tries
to bridge the gap between data analytics, database systems, and theory
by exploiting the power of modern statistics and probability in new
ways that are efficient for modern problems, modern systems, and real
data.
|
DB Meeting:
|
Wednesday May 14, 2:30pm, DC 1331
|
Speaker:
|
Daniel Nicoara
|
Title:
|
Scaling Graph Databases through Dynamic Distributed Partitioning
|
DB Meeting:
|
Wednesday May 28, 2:30pm, DC 1331
|
Speaker:
|
Olaf Hartig
|
Title:
|
Optimizing the Response Time of Traversal-Based Query Execution
|
Abstract:
|
Traversal-based query execution is a recent approach to query the Web as if it
was a distributed database. The novelty of this approach lies in integrating a
traversal-based retrieval of Web data into the query execution process. Hence,
this approach does not assume a-priori a fixed set of potentially relevant data
sources; instead, the traversal process discovers data and data sources on the
fly. While existing work studies techniques to implement the idea of traversal-
based query execution, query optimization is an open problem. What makes the
problem challenging is the lack of information about data that will be
discovered during query execution. This talk presents ongoing research to
address this problem. In particular, we will focus on response time
optimization. That is, we will discuss different approaches that enable
traversal-based query executions to report first elements of a query result as
early as possible.
|
DB Meeting:
|
Wednesday June 4, 2:30pm, DC 1331
|
Speaker:
|
Frank Tompa
|
Title:
|
A Retrospective of Structured Text Search
|
Abstract:
|
Structured text is ubiquitous in humanities research and throughout business, leading to the standardization of first SGML and later XML. I will review several of the alternatives that search engines have adopted for such texts, including options for defining document structure and for querying both structure and content.
|
DB Meeting:
|
Wednesday June 11, 2:30pm, DC 1331
|
Speaker:
|
Michael Mior
|
Title:
|
Automated Schema Design for NoSQL Databases
|
Abstract:
|
Selecting appropriate indices and materialized views is critical for high performance in relational databases. By example, we show that the problem of schema optimization is also highly relevant for NoSQL databases. We explore the problem of schema design in NoSQL databases with a goal of optimizing query performance while minimizing storage overhead. Our suggested approach uses the cost of executing a given workload for a given schema to guide the mapping from the application data model to a physical schema. We propose a cost-driven approach for optimization and discuss its usefulness as part of an automated schema design tool.
|
DB Meeting:
|
Wednesday July 9, 2:30pm, DC 1331
|
Speaker:
|
Mohamed Sabri
|
Title:
|
Optimizations for Live Linked Data Query Execution
|
Abstract:
|
|
DB Meeting:
|
Wednesday July 16, 2:30pm, DC 1331
|
Speaker:
|
Zhiping Wu
|
Title:
|
Data Structures for Fast Access Control in ECM Systems (MMath Thesis Presentation)
|
Abstract:
|
While many access control models have been proposed, little work has been done on the efficiency of access control systems. Because the access control sub-system of an Enterprise Content Management (ECM) system may be a bottleneck, we investigate the representation of permissions to improve its efficiency. Observing that there are many browsing-oriented permission request queries, we choose to implement a subject-oriented representation (i.e., maintaining a permission list for each subject). Additionally, we notice that we may encounter many contiguous IDs under one object (e.g., folder) with breadth-firrst ID numbering.
To optimize the efficiency taking into account the above two characteristics, this thesis presents a space-efficient data structure specifically tailored for representing permission lists in ECM systems. Besides the space efficiency, checking, granting or revocation of a permission is very fast using our data structure. It also supports fast union of two or more permission lists (determining the effective permissions inherited from users' groups). In addition, our data structure is scalable to support any increase in the number of objects and subjects.
We evaluate our representation by comparing it against the bitmap based representation and a hash table based representation while using random ID numbering and breadth-first numbering, respectively. Our experimental tests on both synthetic and real-world data show that the hash table outperforms our representation for regular permission queries (i.e., querying permissions on a single object each time) as well as browsing-oriented queries with random ID numbering. However, our tests also show that 1) our representation supports faster browsing-oriented queries with breadth-first ID numbering applied while consuming only half the space when compared to the hash table based representation, and 2) our representation is much more space and time efficient than the bitmap based representation for our application.
|
DB Meeting:
|
Wednesday July 23, 2:30pm, DC 1331
|
Speaker:
|
Jaemyung Kim
|
Title:
|
Database High Availability Using Shared Volume Service With Offloaded Writes
|
Abstract:
|
Hot standby techniques are widely used to implement highly available database systems. These techniques make use of two copies of the database, an active copy and a backup that is managed by the standby. Synchronization of these two database copies is the responsibility of the database systems than manage them. However, database systems are often deployed in settings in which a reliable, persistent, network-accessible storage service (such as cloud block storage services, cluster file systems, or NAS) is available. In this paper we address the following question: how can we improve hot standby techniques in settings in which the active and standby database systems have access to a common, reliable persistent storage service?
We present SHADOW systems, a novel approach to hot standby high availability. In a SHADOW system, the active and standby database systems share access to a single logical copy of the database, which resides in the persistent shared storage. SHADOW introduces write offloading, which frees the active system from the need to update the persistent database, placing that responsibilty on the standby system instead. SHADOW systems push the task of managing database replication out of the DBMS and into the underlying storage service. We have implemented SHADOW prototypes using PostgreSQL, and we present the results of a performance evaluation that shows that SHADOW systems outperform traditional synchronous hot standby replication. Because of write offloading, SHADOW systems can potentially outperform even a standalone DBMS, while providing fast failover and durability of committed updates.
|
DB/IR Seminar:
|
Tuesday August 5, 3:00pm, DC 1304 ← Note non-standard day and time
|
Speaker:
|
Alistair Moffat, , Department of Computing and Information Systems, The University of Melbourne
|
Title:
|
Waving the Magic WAND
|
Abstract:
|
Web search services process thousands of queries per second, and filter their answers from collections containing very large amounts of data. Fast response to queries is a critical service expectation. The well-known WAND (Weak AND) processing strategy is one way of reducing the amount of computation necessary when executing such a query. The value of WAND has now been validated in a wide range of studies, and has become one of the key baselines against which all new top-k processing algorithms are benchmarked. However, most previous implementations of WAND-based retrieval approaches have been in the context of the BM25 Okapi similarity scoring regime. Here we measure the performance of WAND in the context of the alternative Language Model similarity score computation, and find that the dramatic efficiency gains reported in previous studies are no longer achievable. That is, when the primary goal of a retrieval system is to maximize effectiveness, WAND is relatively unhelpful in terms of attaining the secondary objective of maximizing query throughput rates. However, the BlockMax-WAND algorithm does in fact help reducing the percentage of postings to be scored, but with additional computational overhead. We explore a variety of trade-offs between scoring metric and processing regime and present new insight into how score-safe algorithms interact with rank scoring.
|
DB Meeting:
|
Wednesday August 6, 2:30pm, DC 1331
|
Speaker:
|
|
Title:
|
VLDB 2014 dry-runs
|
Abstract:
|
Minyang Han, K. Daudjee, K. Ammar, M. T. Özsu, X. Wang, T. Jin, An Experimental Comparison of Pregel-like Graph Processing Systems.
Güneş Aluç, M. T. Özsu, K. Daudjee, Workload Matters: Why RDF Databases Need a New Design.
|
This page is maintained
by
Khuzaima Daudjee.