The Data Systems Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for this year are listed below.
The talks are usually held on a Monday at 10:30am in room DC 1302. Exceptions are flagged.
We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes.
The Database Seminar Series is supported by
|Panos K. Chrysanthis|
|Title:||Enumerating Tree Decompositions: Why and How|
|Speaker:||Benny Kimelfeld, Technion|
Many intractable problems on graphs have efficient solvers when graphs are trees or forests. Tree decompositions often allow to apply such solvers to general graphs by grouping nodes into bags laid out in a tree structure, thereby decomposing the problem into the sub-problems induced by the bags. This approach has applications in a plethora of domains, partly because it allows the optimize inference on probabilistic graphical models, as well as evaluation of database queries. Nevertheless, a graph can have exponentially many tree decompositions and finding an ideal one is challenging, for two main reasons. First, the measure of goodness often depends on subtleties of the specific application at hand. Second, theoretical hardness is met already for the simplest measures such as the maximal size of bag (a.k.a. “width”). Therefore, we explore the approach of producing a large space of high-quality tree decompositions for the application to choose from.
I will describe our application of tree decompositions in the context of “worst-case optimal” joins --- a new breed of in-memory join algorithms that satisfy strong theoretical guarantees and were found to feature a significant speedup compared to traditional approaches. Specifically, I will explain how this development led us to the challenge of enumerating tree decompositions. Then, I will describe a novel enumeration algorithm for tree decompositions with a theoretical guarantee on the delay (the time between consecutive answers), and an experimental study thereof (on graphs from various relevant domains). Finally, I will describe recent results that provide guarantees on both the delay and the quality of the generated tree decompositions.The talk will be based on papers that appeared in EDBT 2017 and PODS 2017, co-authored with Nofar Carmeli, Yoav Etsion, Oren Kalinsky and Batya Kenig.
|Bio:||Benny Kimelfeld is an Associate Professor at Technion, Israel. In the past he has been at LogicBlox and at IBM Research – Almaden. His research interests are around aspects of data management, such as database theory and systems, algorithms for query evaluation, information extraction, information retrieval, data mining, and database uncertainty. He received his Ph.D. in Computer Science from The Hebrew University of Jerusalem, under the supervision of Prof. Yehoshua Sagiv.|
|Title:||Universal Information Extraction|
|Speaker:||Heng Ji, Rensselaer Polytechnic Institute|
|Abstract:||The goal of Information Extraction (IE) is to extract structured facts from a wide spectrum of heterogeneous unstructured data types including texts, speech, images and videos. Traditional IE techniques are limited to a certain source X (X = a particular language, domain, limited number of pre-defined fact types, single data modality...). When we move from X to a new source Y, we need to start from scratch again by annotating a substantial amount of training data and developing Y specific extraction capabilties. We propose a new Universal Information Extraction (IE) paradigm to combine the merits of traditional IE (high quality and fine granularity) and Open IE (high scalability). This framework aims to discover schemas and extract facts from any input corpus, without any annotated training data or predefined schema. It can also be extended to multiple data modalities (images, videos) and 282 languages by constructing a common semantic space and transfer learning across sources.|
|Bio:||Heng Ji is Edward P. Hamilton Development Chair Professor in Computer Science Department of Rensselaer Polytechnic Institute. She received her Ph.D. in Computer Science from New York University. Her research interests focus on Natural Language Processing, especially on Information Extraction and Knowledge Base Population. She was selected as "Young Scientist" and a member of the Global Future Council on the Future of Computing by the World Economic Forum in 2016 and 2017. She received "AI's 10 to Watch" Award by IEEE Intelligent Systems in 2013, NSF CAREER award in 2009, Google Research Awards in 2009 and 2014, IBM Watson Faculty Award in 2012 and 2014, Bosch Research Awards in 2015 and 2016. She coordinated the NIST TAC Knowledge Base Population task since 2010. She is now serving as the Program Committee Co-Chair of NAACL2018.|
|Title:||Enabling Data Science for the 99%|
|Speaker:||Aditya Parameswaran, University of Illinois-Urbana Champaign|
|Abstract:||There is a severe lack of interactive tools to help people manage, analyze, and make sense of large datasets. This talk will briefly cover three tools under development in our research group (with collaborators at Illinois, MIT, Maryland, and Chicago) that empower individuals and teams to perform interactive data analysis more effectively. The three tools span the spectrum of analyses types --- from browsing with DataSpread, a spreadsheet-database hybrid, to exploration with ZenVisage, a effortless visualization recommendation tool, and finally to analysis and collaboration with Orpheus, a database system that supports versioning as a first-class citizen.|
|Bio:||Aditya Parameswaran is an Assistant Professor in Computer Science at the University of Illinois (UIUC). He spent a year as a PostDoc at MIT CSAIL following his PhD at Stanford University, before starting at Illinois in August 2014. He develops systems and algorithms for "human-in-the-loop" data analytics, synthesizing techniques from database systems, data mining, and human computation. Aditya received the NSF CAREER Award, the TCDE Early Career Award, the C. W. Gear Junior Faculty Award from Illinois, multiple "best" Doctoral Dissertation Awards (from SIGMOD, SIGKDD, and Stanford), an "Excellent" Lecturer award from Illinois, a Google Faculty award, the Key Scientific Challenges award from Yahoo!, and multiple best-of-conference citations. He is an associate editor of SIGMOD Record and serves on the steering committee of the HILDA (Human-in-the-loop Data Analytics) Workshop. His research group is supported with funding from the NSF, the NIH, Adobe, the Siebel Energy Institute, and Google.|
|Title:||Citizen-Sourced Data for Public Health Modeling|
|Speaker:||Rumi Chunara, New York University|
|Abstract:||Knowledge generation through crowdsourcing is becoming increasingly possible and useful in many domain areas; yet requires new method development given the observational, unstructured and noisy nature of citizen-sourced data. In this talk I will discuss statistical and machine learning methods we are developing to integrate crowdsourced data into public health models. This includes, combining citizen-sourced and clinical data, accounting for biases, drawing inference from observational data, and generating relevant features. Examples will use empirical data from local and worldwide contexts.|
|Bio:||Rumi Chunara is an Assistant Professor at New York University, jointly appointed in Computer Science and in Global Public Health. Her research interests combine data mining and machine learning with social and ubiquitous computing. Specifically she focuses on feature extraction from and statistical modeling of unstructured and observational personally-generated data -- for epidemiological applications. She received her Ph.D. from MIT and was named an MIT Technology Review Innovator Under 35 in 2014.|
|Speaker:||Barzan Mozafari, University of Michigan|
|Speaker:||Panos K. Chrysanthis, University of Pittsburgh|
|Speaker:||Panos Ipeirotis, NYU|