The DB group meets Wednesday afternoons at 2:30pm. The list below gives the times and locations of upcoming meetings. Each meeting lasts for an hour and features either a local speaker or, on Seminar days, an invited outside speaker. Everyone is welcome to attend.
|DB Meeting:||Wednesday May 15, 2:30pm, DC 1331|
|Title:||Personalized User Interest Models for Entity Ranking|
In this talk I will explore the possibility of cross-referencing Web search query logs with large Web-extracted entity databases in order to build session-level profiles of the types of entities a user is interested in. Entity types describe classes of entities and may be very general such as "Person" and "Organization", or they may be very specific like "ETH_Zurich_alumni" and "Airlines_of_the_United_States." We build user interest models on-line, and use them to rerank the results of subsequent queries issued to an entity-based search engine. We conduct experiments over a Web search log from a commercial search engine to evaluate the effectiveness of personalized reranking of entity results.
|DB Meeting:||Wednesday May 22, 2:30pm, DC 1331|
|Title:||Querying Linked Data on the Web|
The World Wide Web (WWW) currently evolves into a Web of Linked Data where
content providers publish and link their data as they have done so for Web
documents since 20 years. While the execution of SQL-like queries over this
emerging dataspace opens possibilities not conceivable before, querying the
Web of Linked Data poses novel challenges. Due to the openness of the WWW, it
is impossible to know all data sources that might contribute to the answer of
a query. To tap the full potential of the Web, traditional query execution
paradigms are insufficient because those assume a fixed set of potentially
relevant data sources beforehand. In the Web context these data sources might
not be known before executing the query.
In this talk we discuss how the Web of Linked Data -conceived as a database-
differs from traditional database scenarios. In this context we present results
on theoretical properties of queries over the Web of Linked Data. Furthermore,
we introduce a novel query execution paradigm that allows an execution engine
to discover potentially relevant data during the execution of queries.
Finally, we discuss an approach for selecting query execution plans that is
tailored to the new paradigm.
|DB Seminar:||Wednesday May 29, 2:30pm, DC 1302|
|Speaker:||Nick Koudas, University of Toronto|
|Title:||Alternate ways to search Twitter|
|DB Event:||Wednesday June 3, 1:00pm, DC 1331|
|Title:||ViewDF: a Flexible Framework for Incremental View Maintenance in Stream Data Warehouses (MMath Thesis Presentation)|
Because of the increasing data size and demands for low latency from the modern data analysis, the traditional data warehousing technologies are greatly pushed beyond their limits. Several stream data warehouse (SDW) systems, which are warehouses that ingest append-only data feeds and support frequent refresh cycles, have been proposed including different methods to improve the responsiveness of the systems. Materialized views are critical in large-scale data warehouses due to their ability to speed up queries. Thus an SDW maintains layers of materialized views. The materialized view maintenance in SDW systems introduces new challenges. However, some of the existing SDW systems do not discuss the maintenance of views while others employ view maintenance techniques that are not efficient. This thesis presents ViewDF: a flexible framework for incremental maintenance of materialized views in SDW systems that generalizes existing techniques and enables new optimizations for views defined with operators that are common in stream analytics. We give a special view definition (ViewDF) to enhance the traditional way of creating views in SQL by being able to reference any partition of any table. We describe a prototype system based on this idea, which allows users to write ViewDFs directly and can automatically translate a broad class of queries into ViewDFs. Several optimizations are proposed and experiments show that our proposed system can improve view maintenance time by a factor of two or more in practical settings.
|DB Meeting:||Wednesday June 5, 2:30pm, DC 1331|
|Title:||Multi-Master Replication for Snapshot Isolation Databases (MMath Thesis Presentation)|
Lazy replication with snapshot isolation has emerged
as a popular choice for distributed databases. However, lazy
replication requires the execution of update transactions at one
(master) site so that it is relatively easy for a total SI order to be
determined for consistent installation of updates in the lazily
replicated system. In this talk, we propose a set of techniques that
support update transaction execution over multiple partitioned sites,
thereby allowing the master to scale. Our techniques determine
a total SI order for update transactions over multiple master sites
without requiring global coordination in the distributed system,
and ensure that updates are installed in this order at all replicas
in a consistent manner. The effectiveness of the proposed techniques is
demonstrated through a system built on top of PostgreSQL.
|DB Meeting:||Wednesday June 12, 2:30pm, DC 1331|
|Title:||CAC-DB: High Availability for Database Systems in Geographically Distributed Cloud Computing Environments (MMath Thesis Presentation)|
There have been solutions that provide transactional SQL-based DBMS services on the cloud, including solutions that use cloud shared storage systems to store the data. However, none of these solutions take advantage of the shared cloud storage architecture to provide DBMS high availability. It is possible to run traditional DBMS high availability solutions in cloud environments. These solutions are typically based on log shipping. However, these solutions do not work well if the primary and backup are in diㄦent, geographically distributed data centers Furthermore, these solutions do not take advantage of the capabilities of the underlying shared storage system. We present a new transparent high availability system for transactional SQL-based DBMS on a shared storage architecture, which we call CAC-DB (Continuous Access Cloud DataBase). Our system is especially designed for eventually consistent cloud storage systems that run eventsciently in multiple geographically distributed data centers. By taking advantage of shared storage, CAC-DB can run in a geographically distributed environment and achieves the following goal: if a data center fails, not only does the persistent image of the database on the storage tier survive, but also the DBMS service can resume almost uninterrupted and reach peak throughput in a very short time. At the same time, the throughput of the DBMS service in normal processing is not negatively afected.
|DB Meeting:||Wednesday June 19, 2:30pm, DC 1331|
|Title:||Scalable Scientific Computing Algorithms Using MapReduce (MMath Thesis Presentation)|
Cloud computing systems, like MapReduce and Pregel, provide a scalable and fault-tolerant environment for running computations at massive scale. However, these systems are designed primarily for data intensive computational tasks, while a large class of problems in scientific computing and business analytics are computationally intensive (i.e., they require a lot of CPU in addition to I/O). In this thesis, we investigate the use of cloud computing systems, in particular MapReduce, for computationally intensive problems, focusing on two classic problems that arise in scientific computing and also in analytics: maximum clique and matrix inversion.
The key contribution that enables us to effectively use MapReduce to solve the maximum clique problem on dense graphs is a recursive partitioning method that partitions the graph into several subgraphs of similar size and running time complexity. After partitioning, the maximum cliques of the different partitions can be computed independently, and the computation is sped up using a branch and bound method. Our experiments show that our approach leads to good scalability, which is unachievable by other partitioning methods since they result in partitions of different sizes and hence lead to load imbalance. Our method is more scalable than an MPI algorithm, and is simpler and more fault tolerant.
For the matrix inversion problem, we show that a recursive block LU decomposition allows us to effectively compute in parallel both the lower-triangular (L) and upper-triangular (U) matrices using MapReduce. After computing the L and U matrices, their inverses are computed using MapReduce. The inverse of the original matrix, which is the product of the inverses of the L and U matrices, is also obtained using MapReduce. Our technique is the first matrix inversion technique that uses MapReduce. We show experimentally that our technique has good scalability, and it is simpler and more fault tolerant than MPI implementations such as ScaLAPACK.
|DB Meeting:||Wednesday July 3, 2:30pm, DC 1331|
|Title:||SIGMOD 5-minute Madness|
Speakers will each give a 6-minute summary of a piece of interesting
work that was presented at SIGMOD 2013:|
-Ahmed will present "InfoGather+: Semantic Matching and Annotation of Numeric and Time-Varying Attributes in Web Tables" -Ani (SAP Waterloo) will present something.
-Glenn (Conestoga College) will present DBTEST workshop paper "Mutatis Mutandis"
-Gunes will present "DBMS Metrology: Measuring Query Time"
-Khaled will present something.
-Khuzaima will present "RTP: robust tenant placement for elastic in-memory database clusters"
-Olaf will present "Building an efficient RDF store over a relational database"
-Tamer will present "Crowd Mining".
-Zeynep will present "Shark: SQL and Rich Analytics at Scale"
|DB Event:||VLDB'13 Practice Talks, Wednesday July 9, 2:30pm, DC 3313|
|Speakers:||Rui Liu, Xin Liu|
Hybrid Storage Management for Database Systems (Xin Liu)|
DAX: A Widely Distributed Multi-tenant Storage Service for DBMS (Rui Liu)
Hybrid Storage Management for Database Systems
The use of ash-based solid state drives (SSDs) in storage systems is growing. Adding SSDs to a storage system not only raises the question of how to manage the SSDs, but also raises the question of whether current bu er pool algo- rithms will still work e ectively. We are interested in the use of hybrid storage systems, consisting of SSDs and hard disk drives (HDDs), for database management. We present cost-aware replacement algorithms, which are aware of the di erence in performance between SSDs and HDDs, for both the DBMS bu er pool and the SSDs. In hybrid storage sys- tems, the physical access pattern to the SSDs depends on the management of the DBMS bu er pool. We studied the im- pact of bu er pool caching policies on SSD access patterns. Based on these studies, we designed a cost-adjusted caching policy to e ectively manage the SSD.We implemented these algorithms in MySQL's InnoDB storage engine and used the TPC-C workload to demonstrate that these cost-aware al- gorithms outperform previous algorithms.
DAX: A Widely Distributed Multi-tenant Storage Service for DBMS Hosting
Many applications hosted on the cloud have sophisticated data management needs that are best served by a SQL-based relational DBMS. It is not difficult to run a DBMS in the cloud, and in many cases one DBMS instance is enough to support an application’s workload. However, a DBMS run- ning in the cloud (or even on a local server) still needs a way to persistently store its data and protect it against failures. One way to achieve this is to provide a scalable and reliable storage service that the DBMS can access over a network. This paper describes such a service, which we call DAX. DAX relies on multi-master replication and Dynamo-style flexible consistency, which enables it to run in multiple data centers and hence be disaster tolerant. Flexible consistency allows DAX to control the consistency level of each read or write operation, choosing between strong consistency at the cost of high latency or weak consistency with low latency. DAX makes this choice for each read or write operation by applying protocols that we designed based on the storage tier usage characteristics of database systems. With these protocols, DAX provides a storage service that can host multiple DBMS tenants, scaling with the number of tenants and the required storage capacity and bandwidth. DAX also provides high availability and disaster tolerance for the DBMS storage tier. Experiments using the TPC-C benchmark show that DAX provides up to a factor of 4 performance improvement over baseline solutions that do not exploit flexible consistency.
|DB Meeting:||Wednesday July 10, 2:30pm, DC 1331|
|Speaker:||Ani Nica, SAP Waterloo|
|DB Meeting:||Wednesday July 17, 2:30pm, DC 1331|
|Title:||Partial Materialization for On-Line Analytical Processing on Multi-Tagged Document Collections|
On-Line Analytical Processing (OLAP) systems are commonly used on top of structured data to help users make sense of large data collections by providing them with summary information that can be examined at various levels of detail. Partial materialization has been used as part of these OLAP systems as a way of reducing the time required to calculate summaries as well as satisfying the constraints of limited storage and available time for updates.
When dealing with large collections of tagged documents, one would also benefit from the summarization operations provided by an OLAP system. Such a system could make it less time consuming for users to explore and understand the information contained in large document collections. Tagged document collections, however, require different types of measures for summarizing the data, and the data exhibits considerably different properties than is the case with the data in traditional OLAP. To address these issues, an OLAP system will require a different design and partial materialization approach.
This talk will describe the document centric measures, the properties unique to multi-tagged documents, the partial materialization approach that can address these properties, and the system being developed to help users explore document collections.
|DB Meeting:||Wednesday July 24, 2:30pm, DC 1331|
|Title:||Modeling, Enforcement and Verification of Business Processes within Relational Database Systems|
In this talk I will present a systematic method of mapping a broad set
of process centric business policies onto database level constraints.
The underlying observation of this work is that a database represents
the union of all the states of every on going business process. Thus
if we can establish a bijective relationship between progression in
individual business process and changes in database state we can then
easily specify and reason over complex constraints in both the
business process and the database state space. The work presented will
bridge several different areas of research including database systems,
temporal logics, model checking, and business workflow/policy
management to propose an accessible method of integrating, enforcing
and reasoning about the consequences of process-centric constraints
embedded in database systems.
|DB Meeting:||Wednesday July 31, 2:30pm, DC 1331|
|Speaker:||Kemafor Ogan, North Carolina State University|
|Title:||Querying RDF Data Models, On Land and in the Cloud|
Querying RDF data models present some unique and interesting challenges from the data management perspective. First, due its fine-grained model, queries on RDF models can require up to an order magnitude more number of joins than typical relational workloads, and have more relaxed assumptions than underlie relational join optimization. Further, a graph model allows the possibility of query classes not actively studied in relational query optimization e.g. path or subgraph extraction queries. Second, RDF models are semantic. This means that additional information may be entailed beyond what is explicitly stated. Such entailments need to be accounted for in query processing to ensure complete answers. Third, in the context of the Linked Open Web, models are vertically fragmented and distributed making cost-based optimization difficult, and operations for model discovery have to be part of the query processing. Fourth, depending on the computing model and architecture used for processing, additional challenges and optimization goals may need to be considered e.g., length of execution workflow in Map Reduce cloud platforms.
In this talk, I will overview some of the work I have done with my students to address challenges primarily in the first, second and fourth categories. Specifically, I will discuss techniques we have proposed for Generalized Graph Pattern Queries - graph pattern queries extended with path variables; Context-Aware Keyword Search on RDF - personalized keyword query interpretation using a user's query history. Finally, I will discuss our proposal for a Nested Data Model and Query Algebra (NTGA) that enables new algebraic optimization techniques for SPARQL queries, and its benefits for cloud-based processing.
Kemafor's general research interests revolve around techniques for transforming data to knowledge. Her doctoral research was motivated by the idea that gaining knowledge and deeper insight from data is dependent on being able to get answers to complex questions. To this end, she focused her research on developing new query models for Semantic Web data. She earned her doctoral degree in Aug of 2007 from the University of Georgia under the direction of Prof. Amit Sheth and joined the Computer Science faculty at NCState in the fall of 2007. During the early part of her research career, she continued work on query models for the Semantic Web including structure discovery queries, keyword search and skyline queries on RDF. More recently, she started to focus on optimization techniques for querying "big" Semantic Web data. Her work is currently funded by the different organizations including National Science Foundation and IBM. In her previous life, she was a biochemist, having earned a bachelor degree in Biochemistry from the University of Nigeria in 1989. For her undergraduate thesis topic she studied Serum Total and Prostatic Acid Phosphatase levels in the male community of her university town. Serum Prostatic Acid phosphatase is linked a prostate cancer activity.