Starting Fall 2015, the regular time for DSG talks is Monday at 10:30, although some talks may be on different days.
|DB Seminar Series:||Wednesday January 21, 2:30 pm, DC 1302|
|Speaker:||Cristiana Amza, University of Toronto|
|Title:||Stage-Aware Anomaly Detection through Execution Flow Tracking|
|DB Seminar:||Monday January 26, 1:30 pm, DC 2568|
|Speaker:||Alexandra Roatis, University of Waterloo|
|Title:||Efficient Querying and Analytics of Semantic Web Data|
The high rate of data publication and its increased complexity, for instance the heterogeneous, self-describing Semantic Web data, motivate the interest in efficient techniques for data manipulation. This talk focuses on leveraging mature relational data management technology for querying Semantic Web data.
The first part describes query answering over data subject to RDFS constraints, stored in relational data management systems. The implicit information resulting from RDF reasoning is required to correctly answer such queries. We introduce the database fragment of RDF, going beyond the expressive power of previously studied fragments. We devise novel techniques for answering Basic Graph Pattern queries within this fragment, exploring the two established approaches for handling RDF semantics, namely graph saturation and query reformulation. In particular, we consider graph updates within each approach and propose a method for incrementally maintaining the saturation. We experimentally study the performance trade-offs of our techniques, which can be deployed on top of any relational data management engine.
The second part considers the new requirements for data analytics tools and methods emerging from the development of the Semantic Web. We fully redesign, from the bottom up, core data analytics concepts and tools in the context of RDF data. We propose the first complete formal framework for warehouse-style RDF analytics. Notably, we define analytical schemas tailored to heterogeneous, semantic-rich RDF graphs, analytical queries which (beyond relational cubes) allow flexible querying of the data and the schema as well as powerful aggregation and OLAP-style operations. Experiments on a fully-implemented platform demonstrate the practical interest of our approach.
|CS Seminar:||Thursday February 5, 10:30 am, DC 1304|
|Speaker:||Willis Lang, Microsoft Jim Gray Systems Lab|
|Title:||Redefining the Rules for Data Processing in the Cloud|
|DB Seminar Series:||
|Speaker:||Aaron Elmore, MIT and University of Chicago|
|Title:||Building an Elastic Main-Memory Database: E-Store|
|DB Seminar Series:||Wednesday May 6th, 2:30 pm, DC 1302|
|Speaker:||Sudipto Das, Microsoft Research|
|Title:||Performance Isolation in Multi-Tenant Relational Database-as-a-Service|
|PhD Seminar:||Monday May 25th, 1:30pm, DC 2314|
|Title:||ALEX: Automatic Link Exploration in Linked Data|
There has recently been an increase in the number of RDF knowledge bases published on the Internet. These rich RDF data sets can be useful in answering many queries, but much more interesting queries can be answered by integrating information from different data sets. This has given rise to research on automatically linking different RDF data sets representing different knowledge bases. This is challenging due to their scale and semantic heterogeneity. Various approaches have been proposed, but there is room for improving the quality of the generated links.
In this paper, we present ALEX, a system that aims at improving the quality of links between RDF data sets by using feedback provided by users on the answers to linked data queries. ALEX starts with a set of candidate links obtained using any automatic linking algorithm. ALEX utilizes user feedback to discover new links that did not exist in the set of candidate links while preserving link precision. ALEX discovers these new links by finding links that are similar to a link approved by the user through feedback on queries. ALEX uses a Monte-Carlo reinforcement learning method to learn how to explore in the space of possible links around a given link. Our experiments on real-world data sets show that ALEX is efficient and significantly improves the quality of links.
|Practice Talk:||Wednesday June 3rd, 1:00pm, DC 2314|
|Title:||EdgeX: Edge Replication for Web Applications|
This will be a short practice talk.
|MMath Thesis Presentation||Monday July 6th, 2:00pm, Room DC 2310|
|Title:||On Improving Distributed Pregel-like Graph Processing Systems|
The considerable interest in distributed systems that can execute algorithms to process large graphs has led to the creation of many graph processing systems. However, existing systems suffer from two major issues: (1) poor performance due to frequent global synchronization barriers and limited scalability; and (2) lack of support for graph algorithms that require serializability, the guarantee that parallel executions of an algorithm produce the same results as some serial execution of that algorithm.
Many graph processing systems use the bulk synchronous parallel (BSP) model, which allows graph algorithms to be easily implemented and reasoned about. However, BSP suffers from poor performance due to stale messages and frequent global synchronization barriers. While asynchronous models have been proposed to alleviate these overheads, existing systems that implement such models have limited scalability or retain frequent global barriers and do not always support graph mutations or algorithms with multiple computation phases. We propose barrierless asynchronous parallel (BAP), a new computation model that overcomes the limitations of existing asynchronous models by reducing both message staleness and global synchronization while retaining support for graph mutations and algorithms with multiple computation phases. We present GiraphUC, which implements our BAP model in the open source distributed graph processing system Giraph, and evaluate it at scale to demonstrate that BAP provides efficient and transparent asynchronous execution of algorithms that are programmed synchronously.
Secondly, very few systems provide serializability, despite the fact that many graph algorithms require it for accuracy, correctness, or termination. To address this deficiency, we provide a complete solution that can be implemented on top of existing graph processing systems to provide serializability. Our solution formalizes the notion of serializability and the conditions under which it can be provided for graph processing systems. We propose a partition-based synchronization technique that enforces these conditions efficiently to provide serializability. We implement this technique into Giraph and GiraphUC to demonstrate that it is configurable, transparent to algorithm developers, and more performant than existing techniques.
|DB Seminar Series:||Monday July 20th, 1:30pm, Room DC1304|
|Speaker:||Wolfgang Lehner, TU Dresden|
|Title:||Steps towards HW/SW-DB-CoDesign|
|Practice Talk:||Wednesday September 2, 12:45pm, DC 1316|
|Title:||Enhancing Exploration with a Faceted Browser through Summarization|
An enhanced faceted browsing system has been developed to support users' exploration of large multi-tagged document collections. It provides summary measures of document result sets at each step of navigation through a set of representative terms and a diverse set of documents. These summaries are derived from pre-materialized views that allow for quick calculation of centroids for various result sets. The utility and efficiency of the system is demonstrated on the New York Times Annotated Corpus.
This will be a short practice talk.
|CS Distinguished Lecture Series:||Monday Sep 14th, 2:00 pm, Humanities Theatre|
|Speaker:||Mike Stonebraker, MIT|
|Title:||The Land Sharks are on the Squawk Box (How Riding a Bicycle across America and Building Postgres Have a Lot in Common)|
|MMath thesis presentation:||Thursday Oct 1st, 2:00 pm, DC 3323|
|Title:||EdgeX: Edge Replication for Web Applications|
|PhD Seminar (Systems):||Friday Oct 2nd, 10:30 am, DC 1331|
|Title:||Towards Adaptive Resource Allocation for Database Workloads|
|DSG Seminar Series:||Monday Oct 5th, 10:30 am, M3-3127|
|Speaker:||Nesime Tatbul, Intel Labs and MIT|
|Title:||S-Store: A Streaming NewSQL System for Big Velocity Applications|
|MMath Research Paper presentation:||Tuesday Oct 6th, 10:00 am, DC 2310|
|Title:||Holistic Cleaning of Heterogeneous Data Sets using Conditional Denial Constraints|
|Seminar||Mon Oct 19, 10:30 am, DC 1302|
|Speaker:||Ashraf Aboulnaga, QCRI|
|Title:||Arabesque: A System for Distributed Graph Mining|
Distributed data processing platforms such as MapReduce and Pregel have substantially simplified the design and deployment of certain classes of distributed graph analytics algorithms. However, these platforms do not represent a good match for distributed graph mining problems, as for example finding frequent subgraphs in a graph. Given an input graph, these problems require exploring a very large number of subgraphs and finding patterns that match some "interestingness" criteria desired by the user. These algorithms are very important for areas such as social networks, semantic web, and bioinformatics. In this talk, I will present Arabesque, the first distributed data processing platform for implementing graph mining algorithms. Arabesque automates the process of exploring a very large number of subgraphs. It defines a high-level filter-process computational model that simplifies the development of scalable graph mining algorithms: Arabesque explores subgraphs and passes them to the application, which must simply compute outputs and decide whether the subgraph should be further extended. We use Arabesque's API to produce distributed solutions to three fundamental graph mining problems: frequent subgraph mining, counting motifs, and finding cliques. Our implementations require a handful of lines of code, scale to trillions of subgraphs, and represent in some cases the first available distributed solutions.
This is joint work with Carlos Teixeira, Alexandre Fonseca, Marco Serafini, Georgos Siganos, and Mohammed Zaki. It appears in SOSP 2015.
|DSG Seminar Series:||Monday Oct 26th, 10:30 am, Room TBA|
|Speaker:||Ankur Goyal, MemSQL|
|Title:||Key Innovations in MemSQL|
|CS Distinguished Lecture Series:||Tuesday Oct 27th, 3:30 pm, DC 1302|
|Speaker:||Susan Dumais, Microsoft|
|Title:||Personalized Search: Potential and Pitfalls|
|DSG Seminar Series:||Monday Nov 2nd, 10:30 am, DC 1302|
|Speaker:||Andy Pavlo, Carnegie Mellon University|
|Title:||I Don't Want to be the Mitt Romney of Databases|
|DSG Seminar Series:||Monday Nov 9th, 10:30 am, DC 1302|
|Speaker:||Shane Culpepper, RMIT|
|Title:||Efficient Location-aware Web Search|