The DB group meets Wednesday afternoons at 2:30pm. The list below gives the times and locations of upcoming meetings. Each meeting lasts for an hour and features either a local speaker or, on Seminar days, an invited outside speaker. Everyone is welcome to attend.
|DB Meeting:||Wednesday January 19, 2:30pm, DC 1331|
|Title:||Spatial Data Support in SQL Anywhere 12|
|Abstract:||Since joining iAnywhere fifteen months ago, much of my time has been spent on the new spatial features just released in SQL Anywhere 12. Spatial databases was a new area for me, so in this talk I will give an overview of standards and practices for RDBMS support of spatial data. I will then discuss the spatial functionality implemented in SQL Anywhere 12, including how the product goals for SA differ from some other RDMS products and how those goals impact our design. I will end with an overview and brief demo of Quantum GIS, an open-source desktop tool for manipulating geographic information.|
|DB Meeting:||Wednesday January 26, 2:30pm, DC 1331|
|Title:||HBaseDB: A solution for Multi-row distributed transactions with global strong snapshot isolation using HBase on clouds|
|Abstract:||Modern applications such as collaborative Web 2.0 applications and social network sites pose challenging requests for scalable distributed transactions involving multiple data items on clouds. On the one hand, traditional database management systems (DBMS) cannot provide the desired degree of scalability and availability and guarantee transactional properties at the same time, especially in face of various kinds of failures on clouds. On the other hand, column-oriented data stores fall short of multi-row distributed transactional supports, although they have been proven to scale and perform well on clouds with integrated fault tolerance schemes. Under this background, HBase, a representative open source column-oriented store modeled after Google's BigTable system, has been studied recently for solutions to provide transactional data management capabilities. Unfortunately, none of the existing solutions support distributed transactions with global strong snapshot isolation (SI) with high throughput and low latency. This paper presents a solution, called HBaseDB, supporting global strong SI for distributed transactions using HBase. HBaseDB targets at the same type of OLTP workloads as HBase, featuring random data access. It is implemented as a light-weight client-side library in lieu of the standard HBase API for transactional processing and requires no extra programs to be deployed or maintained, and no modifications to the existing user data that have been stored in HBase. Transactions are autonomously managed by applications that issue them through the client library without using any consensus-based protocols, atomic broadcast, or transactional locks on data for distributed synchronization and concurrency control. As a result of the simplicity in design, HBaseDB adds low overhead to HBase performance and directly inherits many nice properties of HBase on clouds, such as scalability, fault tolerance, access transparency and high throughput.|
|DB Meeting:||Wednesday February 2, 2:30pm, DC 1331|
|Title:||Tailoring RDF Databases for Web Data Management|
|Abstract:||The Resource Description Framework (RDF) provides a flexible model to capture the many-to-many relationships in web data. However, due to these complex relationships, when the data volume is large, querying involves potentially a large number of joins, even if one partitions the data. Existing techniques such as the property table approach, vertical partitioning and hexatuple indexing cannot fully address this problem. The property table approach requires the schema of the data to be analyzed in advance to construct the flattened tables. Vertical partitioning assumes that queries will have bound predicates, hence are inefficient at handling fuzzy queries. Finally, for hexatuple indexing to be efficient in a distributed scenario, one must guarantee that data can be partitioned such that there are negligible cross references among the partitions. Unfortunately, these conditions are hardly ever met in web data management: web data does not have a well-defined schema, fuzzy queries are as likely to exist in web applications as exact relational queries, and data partitioning in the presence these many-to-many relationships is challenging. This talk will highlight the strengths and weaknesses of existing RDF databases as potential tools in web data management, and provide some perspective on how some of these challenges can be addressed.|
|DB Meeting:||Wednesday February 9, 2:30pm, DC 1331|
|Title:||Visual exploration of high-dimensional data by interactive navigation of low-dimensional data spaces|
The structure of a set of high dimensional data objects (e.g. images, documents, molecules, genetic expressions, etc.) is notoriously difficult to visualize. In contrast, lower dimensional structure (esp. 3 or fewer dimensions) is natural to us and easy to visualize. A not unreasonable approach, then, might be to explore one low dimensional visualization after another in the hope that, together, these will shed light on the higher dimensional structure.
In this talk, I will introduce some graph-theoretic structures which have low dimensional spaces as nodes/vertices and transitions from one space to another as edges. To be concrete, suppose that each node is a 2-d scatterplot of the data and that an edge exists between nodes whose corresponding scatterplots share a variable. In this case, travel along an edge amounts to a 3d transition effected by rotating one 2d scatterplot into the next. More generally, imagine a user moving a "You are here" circle, or "bullet", from one node to another along defined edges, causing one data visualization to be smoothly morphed into the other. A walk on the graph represents a low-dimensional trajectory through the higher dimensional space. Of interest, are walks along these graphs that reveal meaningful structure in the displayed data.
These ideas will be demonstrated on several different data sets using an interactive software package called RnavGraph (written by UW Ph.D. student Adrian Waddell). Rnavgraph allows a user to visually explore any data set by dynamically walking the graph structure and interacting with the displayed data. It connects to the statistical programming system called R.
Methods for constructing these graphs and for identifying interesting subgraphs will also be described and demonstrated. Some dimensionality reduction (manifold learning) methods will also be used to constrain the size of the graph.
|DB Meeting:||Wednesday February 16, 2:30pm, DC 1331|
|Title:||HyPer: Hybrid OLTP and OLAP Main Memory Database System Based on Virtual Memory Snapshots|
|Abstract:||I'd like to present HyPer: Hybrid OLTP and OLAP Main Memory Database System - Based on Virtual Memory Snapshots by Kemper & Neumann. HyPer is interesting in that it uses another feature of commodity hardware to improve database performance.|
|DB Meeting:||Wednesday March 2, 2:30pm, DC 1331|
|Title:||Extending the Cache with Solid State Drives (SSD)|
In the past few years, the cost of flash memory has fallen dramatically
while fabrication has become more efficient. Flash-based Solid State Drives (SSDs)
start to make inroads into the laptop market, desktop storage market
as well as the enterprise server market. The price and performance of flash memory
fall between traditional RAM and hard disk drives. If flash memory is
introduced to fill the gap between RAM and traditional rotating disks, a common question
is whether it should works as a special part of main memory or a special part of
Rather than putting SSD and HDD side by side and using SSD as alternative storage options, we propose to treat the memory, SSD, and HDD hierarchically. This project focuses on the issues of using SSD as a second tier cache of the memory cache. We will discuss the impact of the memory and SSD on each other, and how to utilize them efficiently.
|DB Meeting:||Wednesday March 9, 2:30pm, DC 1331 CANCELLED|
|DB Meeting:||Wednesday March 16, 2:30pm, DC 1331|
|Title:||Augmenting Data Warehouses from Text|
One theme in the Business Intelligence Network (BIN) aims to use the information contained in documents to provide input for business analytics.
I will review how text mining can be applied to extract facts that can be added to a data warehouse.
Special attention will be given to the problem of extracting temporal information from text and posing queries with temporal constraints against document data.
[Primary references for this material are Gomes and Tompa's Information Extraction in the Business Intelligence Context (unpublished); Zhang, Suchanek, Yue, and Weikum's TOB: Timely Ontologies for Business Relations (WebDB 2008); and Arikan, Bedathur, and Berberich's Time Will Tell: Leveraging Temporal Expressions in IR (WSDM 2009).]
|DB Seminar:||Wednesday March 23, 2:30pm, MC 6005 (Please note change in room)|
|Speaker:||Bettina Kemme, McGill University|
|Title:||Data Consistency in Scalable Multi-tier Architectures|
|DB Meeting:||Wednesday March 30, 2:30pm, DC 1331|
|Title:||Materialization in Access Control Systems|
|Abstract:||In modern access control systems, hierarchies (subject, object or role) are often supported to make models more flexible. They can, however, increase the time to answer authorization requests due to possibly timing-consuming permission propagation and conflict resolution, especially when the hierarchies are deep. Our work is to take advantage of materialization in access control systems so that an authorization request does not always have to bear the cost of permission propagation and conflict resolution. By measuring the system and authorization costs, we illustrate that materialization can be useful in access control systems, especially from user experience point of view. Moreover, we can potentially reduce both system cost and authorization cost with better update mechanism in our future research.|
|DB Seminar:||Wednesday April 6, 2:30pm, DC 1302|
|Speaker:||Daniel Abadi, Yale University|
|Title:||Scalable Database Systems for a Machine-Dominated World|
|DB Meeting:||Wednesday April 20, 2:30pm, DC 1331|
|Title:||Generating Efficient Execution Plans for Vertically Partitioned XML Databases|
Experience with relational systems has shown that distribution is an
effective way of improving the scalability of query evaluation. In
this paper, we show how distributed query evaluation can be performed
in a vertically partitioned XML database system. We propose a novel
technique for constructing distributed execution plans that is
independent of local query evaluation strategies. We then present a
number of optimizations that allow us to further improve the
performance of distributed query execution. Finally, we present a
response time-based cost model that allows us to pick the best
execution plan for a given query and database instance. Based on an
implementation of our techniques within a native XML database system,
we verify that our execution plans take advantage of the parallelism
in a distributed system and that our cost model is effective at
identifying the most advantageous plan.
This talk is an extended version of our upcoming presentation at VLDB
|DB Meeting:||Wednesday April 27, 2:30pm, DC 1331|
|Title:||SLiM-FiD: A Durable SkipList-based In-Memory Index using Flash|
|Abstract:||We present SLiM-FiD, an in-memory indexing system based on SkipLists and HashTables. Our solution is optimized for using SSDs as secondary storage to provide persistence and was entered in the 3rd Annual SIGMOD Programming Contest. We cover our implementation and the associated persistence mechanism in detail as well as the reasons behind our choices of algorithms and data structures used. Finally, our experiments show that SLiM-FiD is able to outperform established key-value store technologies.|