The Data Systems Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for this year are listed below.
The talks are usually held on a Monday at 10:30am in room DC 1302. Exceptions are flagged.
We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes.
The Database Seminar Series is supported by
|Title:||Enumerating Tree Decompositions: Why and How|
|Speaker:||Benny Kimelfeld, Technion|
Many intractable problems on graphs have efficient solvers when graphs are trees or forests. Tree decompositions often allow to apply such solvers to general graphs by grouping nodes into bags laid out in a tree structure, thereby decomposing the problem into the sub-problems induced by the bags. This approach has applications in a plethora of domains, partly because it allows the optimize inference on probabilistic graphical models, as well as evaluation of database queries. Nevertheless, a graph can have exponentially many tree decompositions and finding an ideal one is challenging, for two main reasons. First, the measure of goodness often depends on subtleties of the specific application at hand. Second, theoretical hardness is met already for the simplest measures such as the maximal size of bag (a.k.a. “width”). Therefore, we explore the approach of producing a large space of high-quality tree decompositions for the application to choose from.
I will describe our application of tree decompositions in the context of “worst-case optimal” joins --- a new breed of in-memory join algorithms that satisfy strong theoretical guarantees and were found to feature a significant speedup compared to traditional approaches. Specifically, I will explain how this development led us to the challenge of enumerating tree decompositions. Then, I will describe a novel enumeration algorithm for tree decompositions with a theoretical guarantee on the delay (the time between consecutive answers), and an experimental study thereof (on graphs from various relevant domains). Finally, I will describe recent results that provide guarantees on both the delay and the quality of the generated tree decompositions.The talk will be based on papers that appeared in EDBT 2017 and PODS 2017, co-authored with Nofar Carmeli, Yoav Etsion, Oren Kalinsky and Batya Kenig.
|Bio:||Benny Kimelfeld is an Associate Professor at Technion, Israel. In the past he has been at LogicBlox and at IBM Research – Almaden. His research interests are around aspects of data management, such as database theory and systems, algorithms for query evaluation, information extraction, information retrieval, data mining, and database uncertainty. He received his Ph.D. in Computer Science from The Hebrew University of Jerusalem, under the supervision of Prof. Yehoshua Sagiv.|
|Title:||Universal Information Extraction|
|Speaker:||Heng Ji, Rensselaer Polytechnic Institute|
|Abstract:||The goal of Information Extraction (IE) is to extract structured facts from a wide spectrum of heterogeneous unstructured data types including texts, speech, images and videos. Traditional IE techniques are limited to a certain source X (X = a particular language, domain, limited number of pre-defined fact types, single data modality...). When we move from X to a new source Y, we need to start from scratch again by annotating a substantial amount of training data and developing Y specific extraction capabilties. We propose a new Universal Information Extraction (IE) paradigm to combine the merits of traditional IE (high quality and fine granularity) and Open IE (high scalability). This framework aims to discover schemas and extract facts from any input corpus, without any annotated training data or predefined schema. It can also be extended to multiple data modalities (images, videos) and 282 languages by constructing a common semantic space and transfer learning across sources.|
|Bio:||Heng Ji is Edward P. Hamilton Development Chair Professor in Computer Science Department of Rensselaer Polytechnic Institute. She received her Ph.D. in Computer Science from New York University. Her research interests focus on Natural Language Processing, especially on Information Extraction and Knowledge Base Population. She was selected as "Young Scientist" and a member of the Global Future Council on the Future of Computing by the World Economic Forum in 2016 and 2017. She received "AI's 10 to Watch" Award by IEEE Intelligent Systems in 2013, NSF CAREER award in 2009, Google Research Awards in 2009 and 2014, IBM Watson Faculty Award in 2012 and 2014, Bosch Research Awards in 2015 and 2016. She coordinated the NIST TAC Knowledge Base Population task since 2010. She is now serving as the Program Committee Co-Chair of NAACL2018.|
|Title:||Enabling Data Science for the 99%|
|Speaker:||Aditya Parameswaran, University of Illinois-Urbana Champaign|
|Abstract:||There is a severe lack of interactive tools to help people manage, analyze, and make sense of large datasets. This talk will briefly cover three tools under development in our research group (with collaborators at Illinois, MIT, Maryland, and Chicago) that empower individuals and teams to perform interactive data analysis more effectively. The three tools span the spectrum of analyses types --- from browsing with DataSpread, a spreadsheet-database hybrid, to exploration with ZenVisage, a effortless visualization recommendation tool, and finally to analysis and collaboration with Orpheus, a database system that supports versioning as a first-class citizen.|
|Bio:||Aditya Parameswaran is an Assistant Professor in Computer Science at the University of Illinois (UIUC). He spent a year as a PostDoc at MIT CSAIL following his PhD at Stanford University, before starting at Illinois in August 2014. He develops systems and algorithms for "human-in-the-loop" data analytics, synthesizing techniques from database systems, data mining, and human computation. Aditya received the NSF CAREER Award, the TCDE Early Career Award, the C. W. Gear Junior Faculty Award from Illinois, multiple "best" Doctoral Dissertation Awards (from SIGMOD, SIGKDD, and Stanford), an "Excellent" Lecturer award from Illinois, a Google Faculty award, the Key Scientific Challenges award from Yahoo!, and multiple best-of-conference citations. He is an associate editor of SIGMOD Record and serves on the steering committee of the HILDA (Human-in-the-loop Data Analytics) Workshop. His research group is supported with funding from the NSF, the NIH, Adobe, the Siebel Energy Institute, and Google.|
|Title:||Citizen-Sourced Data for Public Health Modeling|
|Speaker:||Rumi Chunara, New York University|
|Abstract:||Knowledge generation through crowdsourcing is becoming increasingly possible and useful in many domain areas; yet requires new method development given the observational, unstructured and noisy nature of citizen-sourced data. In this talk I will discuss statistical and machine learning methods we are developing to integrate crowdsourced data into public health models. This includes, combining citizen-sourced and clinical data, accounting for biases, drawing inference from observational data, and generating relevant features. Examples will use empirical data from local and worldwide contexts.|
|Bio:||Rumi Chunara is an Assistant Professor at New York University, jointly appointed in Computer Science and in Global Public Health. Her research interests combine data mining and machine learning with social and ubiquitous computing. Specifically she focuses on feature extraction from and statistical modeling of unstructured and observational personally-generated data -- for epidemiological applications. She received her Ph.D. from MIT and was named an MIT Technology Review Innovator Under 35 in 2014.|
|Title:||Speedup Set Intersections in Graph Algorithms using SIMD Instructions|
|Speaker:||Lei Zou, Peking University|
|Abstract:||In this talk, I focus on accelerating a widely employed computing pattern— set intersection, to boost a group of relevant graph algorithms. Graph’s adjacency-lists can be naturally considered as node sets, thus set intersection is a primitive operation in many graph algorithms. We propose QFilter, a set intersection algorithm using SIMD instructions. QFilter adopts a merge-based framework and compares two blocks of elements iteratively by SIMD instructions. The key insight for our improvement is that we quickly filter most of unnecessary comparisons in one byte-checking step. We also present a binary representation called BSR that represent sets in a compact layout. From the combination of QFilter and BSR, we achieve dataparallelism in two levels — inter-chunk and intra-chunk parallelism. Furthermore, we find that node ordering impacts the performance of intersection in graph algorithms by affecting the compactness of BSR. We formulate the graph reordering problem as an optimization of the compactness of BSR, and prove its strong NP-completeness. Thus we propose an approximate algorithm that can find a better ordering and improve the performance by 39% on average. This work is accepted to SIGMOD 2018.|
|Bio:||Lei Zou is a Professor in the Institute of Computer Science and Technology (ICST) of Peking University (PKU). He joined PKU in 2009 after receiving his BS degree and Ph.D. degree in Computer Science at Huazhong University of Science and Technology (HUST) in 2003 and 2009, respectively. He received a CCF (China Computer Federation) Doctoral Dissertation Nomination Award in 2009, won Second Class Prize of CCF Natural Science Award in 2014 and Second Class Prize of Natural Science of the Ministry of Education, China in 2017. During his PhD, Lei Zou visited Hong Kong University of Science and Technology in 2007 and University of Waterloo in 2008 as a visiting scholar. His recent research interests include graph databases, knowledge graph, particularly in graph-based RDF data management. He has published more than 40 papers, including more than 30 papers published in reputed journals and major international conferences, such as SIGMOD, VLDB, ICDE, AAAI, TODS, TKDE, VLDB Journal.|
Making Approximate Query Processing Mainstream: Progress and the Road Ahead
|Speaker:||Barzan Mozafari, University of Michigan|
|Abstract:||Approximate Query Processing (AQP) has been a subject of academic research for over 25 years now. However, until recently, it has had little success in terms of commercial adoption. In talk, we explain the interface and deployment barriers that have historically slowed down the adoption of AQP by database vendors and enterprise users alike. We then discuss some of the recent advances that have successfully overcome some of these barriers. We also introduce several research directions and exciting opportunities that would not be possible in a database with precise answers. In particular, we explore several opportunities at the intersection of statistics and data management, including our Database Learning vision--- a database system that learns and becomes smarter over time---as well as novel abstractions for speeding up machine learning workloads through approximate operators and error-computation tradeoffs.|
|Bio:||Barzan Mozafari is a Morris Wellman Assistant Professor of Computer Science and Engineering at the University of Michigan, Ann Arbor, where he leads a research group designing the next generation of scalable databases using advanced statistical models. Prior to that, he was a Postdoctoral Associate at MIT. He earned his Ph.D. in Computer Science from UCLA in 2011. His research career has led to several open-source projects, including DBSeer (an automated database diagnosis tool), BlinkDB (a massively parallel approximate query engine), and SnappyData (an HTAP engine that empowers Apache Spark with transactions and real-time analytics). He has won the National Science Foundation CAREER award, as well as several best paper awards in ACM SIGMOD and EuroSys. He is the founder of Michigan Software Experts, and a strategic advisor to SnappyData, a company that commercializes the ideas introduced by BlinkDB.|
|Speaker:||Rachel Pottinger, University of British Columbia|
|Abstract:||Users are faced with an increasing onslaught of data, whether it's in their choices of movies to watch,
assimilating data from multiple sources, or finding information relevant to their lives on open data
registries. In this talk I discuss some of the recent and ongoing work about how to improve understanding
and exploration of such data, particularly by users with little database background.
|Bio:||Rachel Pottinger is an Associate Professor in Computer Science at the University of British Columbia. She received her PhD in computer science from the University of Washington in 2004. Her main research interest is data management, particularly semantic data integration, how to manage metadata, how to manage data that is currently not well supported by databases, and how to make data easier to understand and explore.|
Next Generation Indexes For Big Data Engineering
|Speaker:||Daniel Lemire, Université Télug|
|Abstract:||Maximizing performance in data engineering is a daunting challenge. We present some of our work on designing faster indexes, with a particular emphasis on compressed indexes. Some of our prior work includes (1) Roaring indexes which are part of multiple big-data systems such as Spark, Hive, Druid, Atlas, Pinot, Kylin, (2) EWAH indexes are part of Git (GitHub) and included in major Linux distributions.
We will present ongoing and future work on how we can process data faster while supporting the diverse systems found in the cloud (with upcoming ARM processors) and under multiple programming languages (e.g., Java, C++, Go, Python). We seek to minimize shared resources (e.g., RAM) while exploiting algorithms designed for the single-instruction-multiple-data (SIMD) instructions available on commodity processors. Our end goal is to process billions of records per second per core.
|Bio:||Daniel Lemire is a computer science professor at the Université du Québec (TELUQ). He has also been a research officer at the National Research Council of Canada and an entrepreneur. He has written over 70 peer-reviewed publications, including more than 40 journal articles. He has held competitive research grants for the last 15 years. He serves on the program committees of leading computer science conferences (e.g., ACM CIKM, WWW, ACM WSDM, ACM SIGIR, ACM RecSys).
|Speaker:||Panos Ipeirotis, NYU|