The Data Systems Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for this year are listed below.
The talks are usually held on a Monday at 10:30am in room DC 1302. Exceptions are flagged.
We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes.
The Database Seminar Series is supported by
|Fabian M. Suchanek|
|Title:||Database Systems Meet Non-Volatile Memory (NVRAM)|
|Speaker:||Per-Åke (Paul) Larson|
|Abstract:||Byte addressable, non-volatile memory (NVRAM) with close to DRAM speed is becoming a reality. Low capacity DIMMs (10s of MBs) are already available and high capacity DIMMs (100s of GB) are expected in 2017. This talk is about how database systems, in particular, main-memory databases can benefit from NVRAM. It will begin with an outline of the characteristics of different types of NVRAM and how the operating system manages and provides applications access to NVRAM. Ensuring that data structures such as indexes in NVRAM can be recovered to a consistent state without data or memory loss after a crash is challenging. The talk will discuss what causes the difficulties and how they can be overcome. It will then show how NVRAM can be used to greatly reduce the latency of commit processing and replication. By storing a main-memory database, including indexes, it is possible to achieve near-instant recovery after a crash. The final part of the talk will discuss how this can be achieved.|
|Bio:||Paul has conducted research in the database field for over 35 years. He served as a Professor in the Department of Computer Science at the University of Waterloo for 15 years and as a Principal Researcher at Microsoft Research for close to 20 years. Paul is a Fellow of the ACM. He has worked in a variety of areas: file structures, materialized views, query processing, query optimization, column stores, and main-memory databases among others. Paul collaborated closely with the SQL Server team to drastically improve SQL Server performance by adding column store indexes, a novel main-memory engine (Hekaton), and support for real-time analytics.|
|Title:||Performance Management for Cloud Databases via Machine Learning|
|Speaker:||Olga Papaemmanouil, Brandeis University|
|Abstract:||Cloud computing has become one of the most active areas of computer science research, in large part because it allows computing to behave like a general utility that is always available on demand. While existing cloud infrastructures and services reduce significantly the application development time, significant effort is still required by cloud users, for often application deployment involves a number of challenges including but not limited to performance monitoring, resource provisioning and workload allocation. These tasks strongly depend on the application-specific workload characteristics and performance objectives, therefore their implementation burden is left on the application developers.
We argue for a substantial shift away from human-crafted solutions and towards leveraging machine learning algorithms to address the above challenges. These algorithms can be trained on application- specific properties and customized performance goals to automatically learn how to provision resources as well as schedule the execution of incoming query workloads. Towards this vision, we have developed WiSeDB, a learning-based performance management service for cloud-deployed data management applications. In this talk, I will discuss how WiSeDB uses (a) supervised learning to automatically learn cost-effective models for guiding query placement, scheduling, and resource provisioning decisions for batch workload processing, and (b) reinforcement learning to naturally adapt to changes in query arrival rates and dynamic resource availability, while being decoupled from notoriously inaccurate performance prediction models.
|Bio:||Olga Papaemmanouil is an Assistant Professor in the Department of Computer Science at Brandeis University since January 2009. Her research interest lies in the area of data management with a recent focus on cloud databases, data exploration, query optimization and query performance prediction. She received her undergraduate degree in Computer Science and Informatics at the University of Patras, Greece in 1999. In 2001, she received her Sc.M. in Information Systems at the University of Economics and Business, Athens, Greece. She then joined the Computer Science Department at Brown University, where she completed her Ph.D in Computer Science at Brown University in 2008. She is the recipient of an NSF Career Award (2013) and a Paris Kanellakis Fellowship from Brown University (2002).|
Approximate lifted inference with probabilistic databases
|Speaker:||Wolfgang Gatterbauer, CMU|
Performing inference over large uncertain data sets is becoming a central data management problem. Recent large knowledge bases, such as Yago, Nell or DeepDive, have millions to billions of uncertain tuples. Because general reasoning under uncertainty is highly intractable, many state-of-the-art systems today perform approximate inference by reverting to sampling. This talk shows an alternative approach that allows approximate ranking answers to hard probabilistic queries in guaranteed polynomial time, and by using only basic operators of existing database management systems (i.e. no sampling required).
(1) The first part of this talk develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds (i.e. when the new probabilities are chosen independent of the probabilities of all other variables). Our new bounds shed light on the connection between previous relaxation-based and model-based approximations and unify them as concrete choices in a larger design space.
(2) The second part then draws the connection to lifted inference and shows how the problem of approximate probabilistic inference can be entirely reduced to a standard query evaluation problem with aggregates. There are no iterations and no exponential blow-ups. All benefits of relational engines (such as cost-based optimizations, multi-core query processing, shared-nothing parallelization) are directly available to queries over probabilistic databases. To achieve this, we compute approximate rather than exact probabilities, with a one-sided guarantee: The probabilities are guaranteed to be upper bounds to the true probabilities, which we show is sufficient to rank the top query answers with high precision. We give experimental evidence on synthetic TPC-H data that this approach can be orders of magnitude faster and also more accurate than sampling-based approaches.
(Based on joint work with Dan Suciu from TODS 2014, VLDB 2015, and VLDBJ 2016: http://arxiv.org/pdf/1409.6052, http://arxiv.org/pdf/1412.1069, http://arxiv.org/pdf/1310.6257)
|Bio:||Wolfgang Gatterbauer is an Assistant Professor in the Tepper School of Business and, by courtesy, in the Computer Science Department of Carnegie Mellon University. His current research focuses on scalable approaches to performing inference over uncertain data and is supported by a Career award from the National Science Foundation. Prior to joining CMU, he was a Post-Doc in the Database group at University of Washington. In earlier times, he won a Bronze medal at the International Physics Olympiad, worked in the steam turbine development department of ABB Alstom Power, and in the German office of McKinsey & Company.|
|Title:||Scalable Platforms for Graph Analytics and Collaborative Data Science|
|Speaker:||Amol Deshpande, University of Maryland|
For several decades now, the amount of data available to us has been growing at a pace far higher than our ability to process it; this trend, popularly referred to as "big data", has accelerated many-fold in recent years with the emergence of efficient and mass-produced scientific instruments, increasing ease of generating and publishing data, and proliferation of Internet-connected devices. In this talk, I will present an overview of two recent projects from my group at UMD on building scalable platforms for large-scale data analytics.
First, I will discuss our ongoing work on building a platform, called "DataHub", for enabling collaborative data science, where teams of data scientists can simultaneously analyze, modify, and share datasets, to understand trends and to extract actionable insights. While numerous solutions exist for specific data analysis tasks, underlying infrastructure and data management capabilities for supporting ad hoc collaboration pipelines are still largely missing. I will present our vision for a unified, dataset-centric platform for addressing these challenges, and present our recent work on: (a) efficiently managing a large number versioned datasets, (b) designing and supporting a unified query language to seamlessly query versioning and provenance information, and (c) lifecycle management of complex machine learning models like deep neural networks.
Second, I will present our initial work on extracting hidden graphs from relational databases. Although there has been much work on large-scale graph analytics, graphs are not the primary representation choice for most data today, and users who want to employ graph analytics are forced to extract data from their data stores, construct the requisite graphs, and then use a specialized engine to write and execute their graph analysis tasks. I will describe our work on a system called GraphGen, that enables users to declaratively specify graph extraction tasks over relational databases, visually explore the extracted graphs, and write and execute graph algorithms over them, either directly or using existing graph libraries like the widely used NetworkX Python library.
|Bio:||Amol Deshpande is a Professor in the Department of Computer Science at the University of Maryland with a joint appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS). He received his Ph.D. from University of California at Berkeley in 2004. His research interests include uncertain data management, adaptive query processing, data streams, graph analytics, and sensor networks. He is a recipient of an NSF Career award, and has received best paper awards at the VLDB 2004, EWSN 2008, and VLDB 2009 conferences.|
|Speaker:||Fabian M. Suchanek, Telecom ParisTech University|
In this talk, I will give an overview of our recent work in the area of knowledge bases. I will first talk about our main project, the YAGO knowledge base. YAGO is now multilingual, and has grown into a larger project at the Max Planck Institute for Informatics and Télécom ParisTech. I will then talk about rule mining. We can find semantic correlations in the form of Horn rules in the knowledge base. In our newest work, we show how rule mining can be applied to predict the completeness or incompleteness of the data in the knowledge base. I will also talk about watermarking approaches to trace the provenance of ontological data. Finally, I will showcase our work on creativity in knowledge bases.
|Bio:||Fabian M. Suchanek is an associate professor at the Telecom ParisTech University in Paris. He obtained his PhD at the Max-Planck Institute for Informatics under the supervision of Gerhard Weikum. In his thesis, Fabian developed inter alia the YAGO-Ontology, one of the largest public ontologies, which earned him a honorable mention of the SIGMOD dissertation award. Fabian was a postdoc at Microsoft Research in Silicon Valley (reporting to Rakesh Agrawal) and at INRIA Saclay/France (reporting to Serge Abiteboul). He continued as the leader of the Otto Hahn Research Group "Ontologies" at the Max-Planck Institute for Informatics in Germany. Since 2013, he is an associate professor at Télécom ParisTech University in Paris. Fabian teaches classes on the Semantic Web, Information Extraction and Knowledge Representation in France, in Germany, and in Senegal. With his students, he works on information extraction, rule mining, ontology matching, and other topics related to large knowledge bases. He has published around 50 scientific articles, among others at ISWC, VLDB, SIGMOD, WWW, CIKM, ICDE, and SIGIR, and his work has been cited more than 5500 times.|
|Speaker:||Felix Naumann, Hasso-Plattner-Institut für Softwaresystemtechnik|
|Abstract:||Data profiling comprises a broad range of methods to efficiently analyze a given data set. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and occasionally functional dependencies and association rules. Individual research projects have proposed several additional profiling tasks, such as the discovery of inclusion dependencies or conditional functional dependencies.
Data profiling deserves a fresh look for two reasons: First, the area itself is neither established nor defined in any principled way, despite significant research activity on individual parts in the past. Second, current data profiling techniques hardly scale beyond what can only be called small data. Finally, more and more data beyond the traditional relational databases are being created and beg to be profiled. The talk highlights the state of the art and proposes new research directions and challenges.
|Bio:||Felix Naumann studied mathematics, economy, and computer sciences at the University of Technology in Berlin. After receiving his diploma (MA) in 1997 he joined the graduate school "Distributed Information Systems" at Humboldt University of Berlin. He completed his PhD thesis on "Quality-driven Query Answering" in 2000. In 2001 and 2002 he worked at the IBM Almaden Research Center on topics of data integration. From 2003 - 2006 he was assistant professor for information integration at the Humboldt-University of Berlin. Since then he holds the chair for information systems at the Hasso Plattner Institute at the University of Potsdam in Germany. His research interests are in data profiling, data cleansing, and text mining.|
|Title:||MacroBase: Prioritizing Attention in Fast Data|
|Speaker:||Peter Bailis, Stanford University|
|Abstract:||While data volumes continue to rise, the capacity of human attention remains limited. As a result, users need analytics engines that can assist in prioritizing attention in this "fast data" that is too large for manual inspection. We are developing MacroBase, a new data analytics engine designed to prioritize attention in fast data streams. MacroBase identifies deviations within streams and generates potential explanations that help contextualize and summarize relevant behaviors. As the first engine to combine streaming classification and streaming explanation operators, MacroBase exploits cross-layer optimizations that deliver order-of-magnitude speedups over existing alternatives while allowing flexible operation across domains including sensor, video, and relational data via extensible feature transform operators. As a result, MacroBase can deliver accurate results at speeds of up to 2M events per second per query on a single core, with operators for flexible operation over time-series, video-, and relational data. MacroBase is a core component of the Stanford DAWN project, a new research initiative designed to enable more usable and efficient machine learning infrastructure.|
|Bio:||Peter Bailis is an assistant professor of Computer Science at Stanford University. Peter's research in the Future Data Systems group focuses on the design and implementation of next-generation, post-database data-intensive systems. His work spans large-scale data management, distributed protocol design, and architectures for high-volume complex decision support. He is the recipient of an NSF Graduate Research Fellowship, a Berkeley Fellowship for Graduate Study, best-of-conference citations for research appearing in both SIGMOD and VLDB, and the CRA Outstanding Undergraduate Researcher Award. He received a Ph.D. from UC Berkeley in 2015 and an A.B. from Harvard College in 2011, both in Computer Science.|
|Title:||The CloudMdsQL Multistore System|
|Speaker:||Patrick Valduriez, Inria and Biology Computational Institute (IBC)|
|Abstract:||The blooming of different cloud data management infrastructures, specialized for different kinds of data and tasks, has led to a wide diversification of DBMS interfaces and the loss of a common programming paradigm. In this talk, we present the design of a Cloud Multidatastore Query Language (CloudMdsQL), and its query engine. CloudMdsQL is a functional SQL-like language, capable of querying multiple heterogeneous data stores (relational and NoSQL) within a single query that may contain embedded invocations to each data store’s native query interface. The query engine has a fully distributed architecture, which provides important opportunities for optimization. The major innovation is that a CloudMdsQL query can exploit the full power of local data stores, by simply allowing some local data store native queries (e.g. a breadth-first search query against a graph database) to be called as functions, and at the same time be optimized, e.g. by pushing down select predicates, using bind join, performing join ordering, or planning intermediate data shipping. Our experimental validation, with various data stores (graph, document, relational, Spark/HDFS), and representative queries, shows that CloudMdsQL satisfies the five important requirements for a cloud multidatastore query language.
This work partially funded by the European Commission under the Integrated Project CoherentPaaS.
|Bio:||Patrick Valduriez is a senior researcher at Inria and LIRMM, University of Montpellier, France. He has also been a professor of Computer Science at University Paris 6 and a researcher at Microelectronics and Computer Technology Corp. in Austin, Texas. He received his Ph. D. degree and Doctorat d'Etat in CS from University Paris 6 in 1981 and 1985, respectively. He is the head of the Zenith team (between Inria and University of Montpellier, LIRMM) that focuses on data management in large-scale distributed and parallel systems (P2P, cluster, grid, cloud), in particular, scientific data management. He has authored and co-authored over 250 technical papers and several textbooks, among which “Principles of Distributed Database Systems”. He currently serves as associate editor of several journals, including the VLDB Journal, Distributed and Parallel Databases, and Internet and Databases. He has served as PC chair of major conferences such as SIGMOD and VLDB. He was the general chair of SIGMOD04, EDBT08 and VLDB09. He obtained the best paper award at VLDB00. He was the recipient of the 1993 IBM scientific prize in Computer Science in France and the 2014 Innovation Award from Inria – French Academy of Science – Dassault Systems. He is an ACM Fellow.|
|Speaker:||Laks Lakshmanan, University of British Columbia|