[Please remove <h1>]
The Database Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for this year are listed below.
The talks are usually held on a Monday at 10:30am in room DC 1302. Exceptions are flagged.
We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes.
The Database Seminar Series is supported by
Nesime Tatbul |
Ankur Goyal |
Andy Pavlo |
Shane Culpepper |
Stephen Green |
Frank McSherry |
Ricardo Baeza-Yates |
Shivakumar Vaithyanathan |
Kevyn Collins-Thompson |
Ellen Voorhees |
Jay Aslam |
5 October 2015, 10:30 am, M3-3127 (Please note special location!)
Title: | S-Store: A Streaming NewSQL System for Big Velocity Applications |
Speaker: | Nesime Tatbul, Intel Labs and MIT |
Abstract: | Managing high-speed data streams in real time has become an integral part of today’s big data applications. In a significant portion of these applications, we see a critical need for real-time stream processing to co-exist with transactional state management due to the presence of shared mutable state. Yet, existing systems treat streaming and transaction processing as two separate computational paradigms, which makes it difficult to build such applications to execute correctly and scalably. S-Store is a new data management system that provides a single, scalable platform for processing streams and transactions. S-Store takes its architectural foundation from H-Store - a modern distributed main-memory OLTP ("NewSQL") system, and adds well-defined primitives to support data-driven processing such as streams, windows, triggers, and dataflow graphs. Furthermore, it makes a number of careful extensions to H-Store's traditional transaction model in order to maintain correctness guarantees in the presence of data and processing dependencies among transaction executions that involve streams. These guarantees include ACID, ordered execution, and exactly-once processing. In this talk, I will present S-Store's design and implementation, and show how S-Store can ensure transactional integrity without sacrificing performance. |
Bio: | Nesime Tatbul is a senior research scientist at the Intel Science and Technology Center for Big Data based at MIT CSAIL. Before joining Intel Labs, she was a faculty member at the Computer Science Department of ETH Zurich. She received her B.S. and M.S. degrees in Computer Engineering from the Middle East Technical University (METU), and her M.S. and Ph.D. degrees in Computer Science from Brown University. During her graduate school years at Brown, she also worked as a research intern at the IBM Almaden Research Center, and as a consultant for the U.S. Army Research Institute of Environmental Medicine (USARIEM). Her research interests are in database systems, with a recent focus on data stream processing and distributed data management. She is the recipient of an IBM Faculty Award in 2008, a Best System Demonstration Award at the ACM SIGMOD 2005 Conference, and both the Best Poster Award and the Grand Challenge Award at the ACM DEBS 2011 Conference. She has served on the program committee for various conferences including ACM SIGMOD (as an industrial program co-chair in 2014 and as a group leader in 2011), VLDB, and IEEE ICDE (as a PC track chair for Streams, Sensor Networks, and Complex Event Processing in 2013). She has chaired a number of VLDB co-located workshops including the International Workshop on Data Management for Sensor Networks (DMSN) and the International Workshop on Business Intelligence for the Real-Time Enterprise (BIRTE). Her recent editorial duties include PVLDB (associate editor, Volume 5, 2011-2012) and ACM SIGMOD Record (associate editor, Research Surveys Column, since June 2012). |
26 October 2015, 10:30 am, DC 1304
2 November 2015, 10:30 am, DC 1302
Title: | I Don't Want to be the Mitt Romney of Databases |
Speaker: | Andy Pavlo, Carnegie Mellon University |
Abstract: | What can I say? Yes, I helped build a database management system (DBMS) for the "one percent." This previous system (H-Store) is able to get up to 40x higher throughput over traditional, disk-oriented DBMSs for on-line transaction processing workloads. But getting this great performance requires a significant upfront deployment cost (e.g., application rewriting, pre-partitioning). It is also unable to perform non-trivial analysis operations without the use of a separate data warehouse, which further increases costs and overhead. This makes a DBMS like ours accessible to only those organizations with ample resources. In this talk, I outline our vision for a new distributed DBMS (codenamed "Peloton") that we are building at CMU that is truly for the 99%. It will enable any application to get the same kind of performance as a specialized system like H-Store without any expensive setup or maintenance. The crux of the system is to employ machine learning techniques to support the efficient execution of hybrid workloads (transactions + analytics) through intelligent pre-fetching and automatic partitioning/tuning. In essence, our new DBMS is able to learn about how an application uses the database without any human intervention and reconfigure itself accordingly. |
Bio: | Andy Pavlo is an Assistant Professor of Databaseology in the Computer Science Department at Carnegie Mellon University. |
9 November 2015, 10:30 am, DC 1302
Title: | |
Speaker: | Shane Culpepper, RMIT |
Abstract: | Mobile search is quickly becoming the most common mode of search on the internet. This shift is driving changes in user behaviour, and search engine behaviour. Just over half of all search queries from mobile devices have local intent, making location-aware search an increasingly important problem. In this work, we explore the efficiency and effectiveness of two general types of geographical search queries, range queries and $k$ nearest neighbor queries, for common web search tasks. We test state-of-the-art spatial-textual indexing and search algorithms for both query types on two large datasets. Finally, we present a rank-safe dynamic pruning algorithm that is simple to implement and use with current inverted indexing techniques. Our algorithm is more efficient than the tightly coupled best-in-breed hybrid indexing algorithms that are commonly used for top-$k$ spatial textual queries, and more likely to find relevant documents than techniques derived from range queries. |
Bio: | Shane Culpepper is an ARC DECRA Research Fellow and Senior Lecturer at RMIT University in Melbourne, Australia. He completed a PhD in Computer Science at The University of Melbourne in 2008 under the supervision of Alistair Moffat. His research focuses primarily on designing efficient and scalable algorithms for a wide variety of information storage and retrieval problems. Research interests include information retrieval, text indexing, data compression, experimental algorithmics, and natural language processing. For more information, visit his homepage at http://www.culpepper.io. |
11 January 2016, 2:00 pm, DC 1302 (Please note unusual time!)
Title: | Research in Information Retrieval and Machine Learning at Oracle Labs |
Speaker: | Stephen Green, Oracle Labs |
Abstract: | This talk will describe current and past research in Information Retrieval and Machine Learning at Oracle Labs. Along the way we will talk about research at Oracle Labs in general, about work on scalable machine learning, feature selection, and sentiment analysis, and about what it is like to do research in an industrial setting. |
Bio: | Stephen Green is a Consulting Member of Technical Staff at Oracle Labs in Burlington, Massachusetts, where he is the Principal Investigator of the Information Retrieval and Machine Learning project.
He is the chief architect and implementer of the Minion search engine Search Engine, a high-performance, open source Java search engine incorporating techniques from information retrieval, natural language processing, and knowledge representation. |
11 April 2016, 10:30 am, DC 1302
18 April 2016, 2:30 pm, DC 1302
9 May 2016, 10:30 am, DC 1302
Title: | Watson Content Services: Creation, Maintenance and Consumption of Knowledge Bases |
Speaker: | Shivakumar Vaithyanathan, IBM Almaden Research Center |
Abstract: | In this talk I will describe a scalable ontology-driven infrastructure for the creation, maintenance and consumption of knowledge bases from multiple (un/semi)structured data sources. The purpose of this infrastructure is to support the next generation of applications based on insights derived from public, licensed and data sometimes referred to as dark (primarily for effect). I will describe the design and current status of the Content Services platform for (a) scalable infrastructure for continuous large-scale analysis of multiple (un/semi) structured sources to create an integrated view of entities, relationships and events including support for incremental processing, flow automation, monitoring, failure recovery and versioning (b) flexible knowledge representation and querying over the structured representation of enriched data. At appropriate points in the talk I will discuss research challenges describing briefly the current work in IBM Research to address these challenges. |
Bio: | Shivakumar Vaithyanathan is an IBM Fellow and Director, Watson Content Services. Prior to that he started and managed the Analytics Department at IBM Almaden. Multiple technologies developed under his direction ship with several IBM products as well as released in open-source. He has co-authored more than 40 papers in major conferences including, ACL, EMNLP, SIGMOD, VLDB, ICML, NIPS and UAI. |
16 May 2016, 10:30 am, DC 1302
Title: | Connecting Searching with Learning |
Speaker: | Kevyn Collins-Thompson, University of Michigan |
Abstract: | While search engines are widely used to find educational material, current search technology is optimized to provide information of generic relevance, not results that are oriented toward a user's learning goals. As a result, users often do not get effective access to the materials best suited for their specific learning needs. Moreover, little is known about the relationship between search interaction over time and actual learning outcomes. With collaborators, I have been exploring new content representations, implicit assessment methods, interaction features, and retrieval algorithms for search engines toward better understanding and support of human learning, broadly defined. This talk will summarize progress from recent projects oriented toward that goal, including a study of search ranking algorithms that incorporate learning-related features such as reading difficulty and concept density, and user studies exploring the relationship between search interaction patterns and learning outcomes. |
Bio: | Kevyn Collins-Thompson is an Associate Professor of Infomation and Computer Science at the University of Michigan. His research explores algorithms and software systems for optimally connecting people with information, especially toward educational goals. His research on personalization has been applied to real-world systems ranging from intelligent tutoring systems to Web search engines. Kevyn has also pioneered techniques for modeling the reading difficulty of text, and understanding and supporting how people learn language. He received his Ph.D. from the School of Computer Science at Carnegie Mellon University. and B.Math in Computer Science from the University of Waterloo. Before joining the University of Michigan in 2013 he was a researcher in the Context, Learning, and User Experience for Search (CLUES) group at Microsoft Research. |
7 July 2016, 2:00 pm, DC 2585 (please note unusual room and time)
12 July 2016, 2:00 pm, DC 1304 (please note unusual room and time)
Title: | ML for IR: Sentiment Analysis and Multi-label Categorization |
Speaker: | Jay Aslam, Northeastern University |
Abstract: | We consider two problems in information retrieval, sentiment analysis and multi-label categorization, and we explore the use of machine learning techniques to solve each of these problems. In sentiment analysis, we demonstrate the utility of skip-gram features and the use of L1 and L2 regularization within machine learning in order to effectively accomplish feature selection and predictive generalization. In multi-label categorization, where one must assign an object such as a text document to an appropriate subset of possible labels, we introduce a new technique based on conditional Bernoulli mixtures and demonstrate its utility on a number of benchmark data sets. |
Bio: | Jay Aslam is a Professor and Associate Dean of Faculty in the College of Computer and Information Science at Northeastern University. Prior to joining Northeastern University, he was on faculty at Dartmouth College. Prof. Aslam obtained his PhD in Computer Science from MIT, and he held a postdoctoctoral position at Harvard University. Prof. Aslam's research interests include information retrieval, machine learning, and the design and analysis of algorithms. In machine learning, he has developed models and algorithms for learning in the presence of noisy or erroneous training data, and he has explored the use of machine learning to solve problems in transportation, computer security, wireless networking, human computation, and medical informatics. In information retrieval, he has applied techniques from machine learning, statistics, information theory, and social choice theory to develop algorithms for efficient search engine training and evaluation, metasearch, automatic information organization, and learning-to-rank. Prof. Aslam served as the General co-Chair for the 2009 ACM SIGIR Conference on Research and Development in Information Retrieval, and he currently serves as the Program co-Chair for SIGIR 2016. |