Database Seminar Series (2010-2011)
The Database Seminar Series provides a forum for presentation and
discussion of interesting and current database issues. It complements our
internal database meetings by bringing in external colleagues. The talks that
are scheduled for 2010-2011 are below.
The talks are usually held on a Wednesday at 2:30pm. Unless
otherwise noted, all talks will be in room DC 1302. Coffee will be
served 30 minutes before the talk.
We will try to post the presentation notes, whenever that is possible.
Please click on the presentation title to access these notes (usually in PDF
The Database Seminar Series is supported by Sybase iAnywhere.
29 September 2010, 3:00 pm
(Please note change in starting time)
||Mediating Human-to-Human Interactions through Social Media Technology
University of Toronto
||With the increase in globalization, an increasing number of
companies conduct business across distance. Companies are
distributed across geographic boundaries, time zones, and
individual knowledge workers are working from home or, otherwise,
telecommuting. Companies are choosing to save money by reducing
the number of meetings involving travel and are further motivated
by a desire to reduce their carbon footprints. This means that
many aspects of business engagement between people (including
research collaborations, decision making, and even software
development) are taking place over distance supported by some
combination of technologies including teleconferences, video
conferences, electronic meeting software, and various
collaborative platforms. In this talk, we present current
research projects looking into how social media technology can
support human interactions over distance.
||Kelly Lyons is an Associate Professor in the Faculty of
Information at the University of Toronto. Prior to joining the
Faculty of Information, she was the Program Director of the IBM
Toronto Lab Centre for Advanced Studies (CAS). Her current
research interests include service science, social computing, and
collaboration. Currently, she is focusing on technologies, work
practices, and business models that support and mediate
human-to-human interactions in service systems. Kelly holds a
cross-appointment with the Department of Computer Science at the
University of Toronto, is a member of the University of Toronto's
Knowledge Media Design Institute, is an IBM Faculty Fellow,
Member-at-Large of the ACM Council, and a member of the ACM-W
Executive Committee. More details can be found on her webpage at:
14 October 2010, 2:30 pm, DC 1304
(Please note change in day and room)
||Semi-Automatic Index Tuning for Database Systems
University of California, Santa Cruz
||Database systems rely heavily on indexes in order to achieve
good performance. Selecting the appropriate indexes is a difficult
optimization problem, and modern database systems are equipped
with automated methods that recommend indexes based on some type
of workload analysis. Unfortunately, current methods either
require advanced knowledge of the database workload, or force the
administrator to relinquish control of which indices are
created. This talk will summarize our recent work in
semi-automatic index tuning, a novel index recommendation
technique that addresses the shortcomings of previous
methods. Semi-automatic tuning leverages techniques from online
optimization, which allows us to prove strong bounds on the
quality of its recommendations. The experimental results show that
semi-automatic tuning outperforms previous methods by a large
margin, offering index recommendations that achieve close to
optimal savings in workload evaluation time.
||Neoklis Polyzotis is currently an associate professor at UC
Santa Cruz. His research focuses on database systems, and in
particular on on-line database tuning, scientific data management,
and cloud computing. He is the recipient of an NSF CAREER award in
2004 and of an IBM Faculty Award in 2005 and 2006. He has also
received the runner-up for best paper in VLDB 2007 and the best
newcomer paper award in PODS 2008. He received his PhD from the
University of Wisconsin at Madison in 2003.
1 December 2010, 2:30 pm, MC 5136
(Please note change in room)
||A New Join Algorithm
||Goetz Graefe, HP Labs
Database query processing traditionally relies on three alternative
join algorithms: index nested loops join exploits an index on its
inner input, merge join exploits sorted inputs, and hash join exploits
differences in the sizes of the join inputs. Cost-based query
optimization chooses the most appropriate algorithm for each query and
for each operation. Unfortunately, mistaken algorithm choices during
compile-time query optimization are common yet expensive to
investigate and to resolve.
Our goal is to end mistaken choices among join algorithms by replacing
the three traditional join algorithms with a single one. Like merge
join, this new join algorithm exploits sorted inputs. Like hash join,
it exploits different input sizes for unsorted inputs. In fact, for
unsorted inputs, the cost functions for recursive hash join and for
hybrid hash join have guided our search for the new join algorithm. In
consequence, the new join algorithm can replace both merge join and
hash join in a database management system.
The in-memory components of the new join algorithm employ indexes. If
the database contains indexes for one (or both) of the inputs, the new
join can exploit persistent indexes instead of temporary in-memory
indexes. Using database indexes to match input records, the new join
algorithm can also replace index nested loops join.
Goetz Graefe is a member of the Information Analytics Lab within
Hewlett-Packard Laboratories. His experience and expertise are focused
on database management systems, gained in academic research,
industrial consulting, and industrial product development.
Goetz's areas of expertise within database management systems cover
compile-time query optimization including extensible query
optimization, run-time query execution including parallel query
execution, indexing, and transactions. He has also worked on
transactional memory, specifically techniques for software
implementations of transactional memory.
Goetz pursued undergraduate studies in business and in computer
science at multiple German universities. In 1983, he was admitted to
the University of Wisconsin - Madison, where he was granted a MS
degree in 1984 and a Ph.D. in 1987.
23 March 2011, 2:30 pm, MC 6005
(Please note change in room)
||Data Consistency in Scalable Multi-tier Architectures
||Bettina Kemme, McGill University
||Most transactional e-commerce applications are implemented in
multi-tier architectures where the application tier implements
business logic and the database tier maintains persistent data. Each
of the tiers might be replicated for performance. The application tier
usually coordinates transaction execution across tiers, and caches
frequently accessed data. Often, it has its own concurrency control
mechanism that provides various degrees of isolation offering a
trade-off between consistency and performance.
While developers are aware that choosing lower levels of isolation
might lead to inconsistencies, there is often no understanding how
often they occur. Furthermore, inconsistencies are often detected very
late, where reconciliation becomes an expensive task. Finally,
although a multi-component system might claim to offer a certain level
of isolation, it might actually fail to do so, as distribution aspects
are often not taken into account.
In this talk, I will present approaches that address these issues. I
will present solutions that automatically detect, quantify and
categorize consistency anomalies during run-time of multi-tier
applications. These approaches do not need to know anything about the
applications themselves, and are fully implemented in the application
server tier, in our case, JEE-conform servers. I will also discuss
application server distribution strategies for various levels of
||Bettina Kemme is an Associate Professor at the School of Computer
Science of McGill University, Montreal where she leads the distributed
information systems lab. She holds degrees in Computer Science from
ETH Zurich (PhD) and Friedrich-Alexander-Universitaet Erlangen,
Germany (Inf.-Diplom). She was recipient of the VLDB 10-year paper
award in 2010. Her research focuses on large-scale data management
with a main focus on data consistency.
6 April 2011, 2:30 pm
||Scalable Database Systems for a Machine-Dominated World
||Daniel Abadi, Yale University
||As machines slowly replace humans as the primary source of data
generation and transaction initiators, we enter a new era where data
generation and transaction processing increases at the speed of Moore's law,
permanently creating a need for scalable data management systems. In this
talk, I will present the architecture of two scalable data management
systems we have built in my group at Yale: the first is a scalable system
optimized for data analysis called HadoopDB that attempts to combine the
scalability of batch-processing systems such as Hadoop with the interactive
performance of parallel database systems. The talk will overview the ideas
from the initial paper on HadoopDB, and then will discuss some recent
The second system is designed for scalable transactional processing, with a
particular focus on the hard problem of achieving high throughput in
non-partitionable workloads. The basic idea is to replace the concurrency
control component of database systems with a deterministic protocol, and use
multiple such systems as building blocks for scalable transaction
processing. Such an approach enables low-cost consistent replication while
improving transactional throughput by eliminating two-phase commit. I will
present the basic architecture of the system in addition to some promising
early results on transactional processing benchmarks (i.e., TPC-C).
||Daniel Abadi is an assistant professor of computer science at Yale
University. Before joining the Yale faculty three and a half years ago, he
spent four years at the Massachusetts Institute of Technology where he
received his Ph.D. Abadi has been a recipient of a Churchill Scholarship, an
NSF CAREER Award, the 2008 SIGMOD Jim Gray Doctoral Dissertation Award, and
the 2007 VLDB best paper award. He tweets at @daniel_abadi.
4 May 2011, 2:30 pm
||Elastic Scalability of Data-intensive Applications in the Cloud
||Divyakant Agrawal, University of California, Santa Barbara
||Over the past two decades, database and systems researchers have made significant advances in the development of algorithms and techniques to provide data management solutions that carefully balance the three major requirements when dealing with critical data: high availability, reliability, and data consistency. However, over the past few years the data requirements, in terms of data availability and system scalability, from Internet scale enterprises that provide services and cater to millions of users have been unprecedented. Cloud computing has emerged as an extremely successful paradigm for deploying Internet and Web-based applications. Scalability, elasticity, pay-per-use pricing, and autonomic control of large-scale operations are the major reasons for success and widespread adoption of cloud infrastructures. Current proposed solutions to scalable data management, driven primarily by prevalent application requirements, significantly downplay the data consistency requirements and instead focus on high scalability and resource elasticity to support data-rich applications for millions to tens of millions of users. In particular, the "newer" data management systems limit consistent access only at the granularity of single objects, rows, or keys, thereby significantly trading-off consistency in order to achieve very high scalability and availability. But the growing popularity of "cloud computing", the resulting shift of a large number of Internet applications to the cloud, and the quest towards providing data management services in the cloud, has opened up the challenge for designing data management systems that provide consistency guarantees at a granularity which goes beyond single rows and keys. In this talk, we analyze the design choices that allowed modern scalable data management systems to achieve orders of magnitude higher levels of scalability compared to traditional databases. With this understanding, we highlight some design principles for data management systems that can be used to augment existing databases with new cloud features such as scalability, elasticity, and autonomy. In this talk we present recent advances that have been made to strike a middle-ground between the two radically different data management architectures: traditional database management systems where the data is treated as a "whole" versus modern key-value stores where data is treated as a collection of independent "granules".
||Dr. Divyakant Agrawal is a Professor of Computer Science at the University of California at Santa Barbara. His research expertise is in the areas of database systems, distributed computing, data warehousing, and large-scale information systems. Dr. Agrawal served as the Chair of Computer Science Department at UCSB from 1999 to 2003. From January 2006 through December 2007, Dr. Agrawal served as VP of Data Solutions and Advertising Systems at the Internet Search Company ASK.com. Dr. Agrawal has also served as a Visiting Senior Research Scientist at the NEC Laboratories of America in Cupertino, CA from 1997 to 2009. During his professional career, Dr. Agrawal has served on numerous Program Committees of International Conferences, Symposia, and Workshops and served as an editor of the journal of Distributed and Parallel Databases (1993-2008), the VLDB journal (2003-2008) and currently serves on the editorial boards of the Proceedings of the VLDB and ACM Transactions on Database Systems. He recently served as the Program Chair of the 2010 ACM International Conference on Management of Data and served as the General Chair of the 2010 ACM SIGSPATIAL Conference on Advances in Geographical Information Systems. Dr. Agrawal organized an NSF Workshop on the Science of Cloud Computing in March’2011, is serving as the General Co-Chair of ACM SIGSPATIAL Conference on Advances in GIS (ACM GIS’2011), and is serving as the Program Co-Chair of ACM Workshop on Large Scale Distributed Systems and Middleware (ACM LADIS’2011). Dr. Agrawal's research philosophy is to develop data management solutions that are theoretically sound and are relevant in practice. He has published 300+ research manuscripts in prestigious forums (journals, conferences, symposia, and workshops) on wide range of topics related to data management and distributed systems and has advised more than 30 Doctoral students during his academic career. Recently, Dr. Agrawal has been recognized as an Association of Computing Machinery (ACM) Distinguished Scientist. His current interests are in the area of scalable data management and data analysis in Cloud Computing environments, security and privacy of data in the cloud, and scalable analytics over social networks data and social media.
Monday, 27 June 2011, 10:30 am
(Please note change in day and starting time)
||MADDER and Self-Tuning Data Analytics on Hadoop with Starfish
||Shivnath Babu, Duke University
Timely and cost-effective analytics over "big data" is now a key
ingredient for success in many businesses, scientific and engineering
disciplines, and government endeavors. The Hadoop software
stack---which consists of an extensible MapReduce execution engine,
pluggable distributed storage engines, and a range of procedural to
declarative interfaces---is a popular choice for big data analytics.
Most practitioners of big data analytics---like computational
scientists, systems researchers, and business analysts---lack the
expertise to tune the system to get good performance. Unfortunately,
Hadoop's performance out of the box leaves much to be desired, leading
to suboptimal use of resources, time, and money (in pay-as-you-go
clouds). We introduce Starfish, a self-tuning system for big data
analytics. Starfish builds on Hadoop while adapting to user needs and
system workloads to provide good performance automatically, without
any need for users to understand and manipulate the many tuning knobs
in Hadoop. While Starfish's system architecture is guided by work on
self-tuning database systems, we discuss how new analysis practices
(dubbed the MADDER principles) over big data pose new challenges;
leading us to different design choices in Starfish.
Shivnath Babu is an Assistant Professor of Computer Science at
Duke University. He got his Ph.D. from Stanford University in 2005. He
has received a U.S. National Science Foundation CAREER Award and three
IBM Faculty Awards. His research interests include making
data-intensive computing systems easier to manage, automated cluster
sizing and problem diagnosis for systems running on cloud platforms,
as well as automated detection and recovery from data corruption
caused by hardware faults, software bugs, or human mistakes.
29 June 2011, 2:30 pm
||On Schema Discovery
||Renee Miller, University of Toronto
||Structured data is distinguished from unstructured data by the presence of a schema describing the logical structure and semantics of the data. The schema is the means through which we understand and query the underlying data. Schemas enable data independence. In this talk, I consider a few problems related to the discovery and maintenance of schemas. I'll discuss the changing role of schemas from prescriptive to descriptive and new applications of schemas in data curation and data quality. This talk is based on joint work with Fei Chiang, Periklis Andritsos, and Oktie Hassanzadeh.
Renée J. Miller is a professor of computer science and the Bell Canada
Chair of Information Systems at the University of Toronto. She
received the US Presidential Early Career Award for Scientists and
Engineers (PECASE), the highest honor bestowed by the United States
government on outstanding scientists and engineers beginning their
careers. She received an NSF CAREER Award, the Premier's Research
Excellence Award, and an IBM Faculty Award. She is a fellow of the
ACM. Her research interests are in the efficient, effective use of
large volumes of complex, heterogeneous data. This interest spans
data integration and exchange, inconsistent and uncertain data
management, and data curation and cleaning. She serves on the Board
of Trustees of the VLDB Endowment and was elected to serve as VLDB
President beginning January 2010. She is also serving as the PC Chair
for SIGMOD 2011. She leads a Canada-wide Strategic Research
Network on Business Intelligence. She received her PhD in Computer
Science from the University of Wisconsin, Madison and bachelor's
degrees in Mathematics and Cognitive Science from MIT.
6 July 2011, 2:30 pm, MC 5136
(Please note change in room)
||Uncertain Schema Matching: the Power of not Knowing
||Avigdor Gal, Technion
||Schema matching is the task of providing correspondences between concepts describing the meaning of data in various heterogeneous, distributed data
sources. Schema matching is one of the basic operations required by the process of data and schema integration, and thus has a great effect on its outcomes, whether these involve targeted content delivery, view integration, database integration, query rewriting over heterogeneous sources, duplicate data elimination, or automatic streamlining of workflow activities that involve heterogeneous data sources. Although schema matching research has been ongoing for over 25 years, only recently a realization has emerged that schema matchers are inherently uncertain. Since 2003, work on the uncertainty in schema matching has picked up, along with research on uncertainty in other areas of data management. This lecture presents the benefits of modelling schema matching as an uncertain process and shows a single unified framework for it. We also briefly cover two common methods that have been proposed to deal with uncertainty in schema matching, namely ensembles and top-K matchings, and discuss the applicability of this research to NisB, a European project offering a toolkit for enterprize integration. The talk is based on a recent manuscript, part of the Synthesis Lectures on Data Management by Morgan & Claypool.
||Avigdor Gal is an Associate professor at the Faculty of Industrial Engineering & Management at the Technion - Israel Institute ofTechnology . He received his D.Sc. degree from the Technion in 1995 in the area of temporal active databases. He has published more than 95 papers in journals (e.g. Journal of the ACM (JACM), ACM Transactions on Database Systems (TODS), IEEE Transactions on Knowledge and Data Engineering (TKDE), ACM Transactions on Internet Technology (TOIT), and the VLDB Journal), books (Schema Matching and Mapping) and conferences (ICDE, ER, CoopIS, BPM) on the topics of data integration, temporal databases, information systems architectures, and active databases. Avigdor is a member of CoopIS (Cooperative Information Systems) Advisory Board, a member of IFIP WG 2.6, and a recipient of the IBM Faculty Award for 2002-2004. He is a member of the ACM and a senior member of IEEE. Avigdor served as a Program co-Chair and General Chair of CoopIS and DEBS, and in various roles in ER and CIKM. He served as a program committee member in SIGMOD, VLDB, ICDE and others. Avigdor is an Area Editor of the Encyclopedia of Database Systems.
13 July 2011, 2:30 pm, MC 2018B
(Please note change in room)
||Data Integration in the Cloud
||Andreas Thor, University of Maryland
||Cloud computing has become a popular paradigm for efficiently processing computationally and data-intensive tasks. Such tasks can be executed on demand on powerful distributed hardware and service infrastructures. The parallel execution of complex tasks is facilitated by different programming models (e.g., MapReduce), distributed data stores, and the ability to employ computing capacity on demand. Data integration can notably benefit from cloud computing because accessing multiple data sources and integration of instance data are usually expensive tasks.
In the first part of the talk we introduce CloudFuice, a data integration system that follows a mashup-like specification of advanced data flows for data integration. CloudFuice's task-based execution approach allows for an efficient, asynchronous, and parallel execution of data flows in the cloud and utilizes recent cloud-based web engineering instruments. The second part of the talk deals with the effectiveness and scalability of MapReduce-based implementations for entity resolution. In the presence of skewed data, sophisticated redistribution approaches become necessary to achieve load balancing among all reduce tasks to be executed in parallel. The proposed approaches support blocking techniques to reduce the search space of entity resolution and effectively distribute the entities of large blocks among multiple reduce tasks.
||Andreas Thor (http://dbs.uni-leipzig.de/de/person/andreas_thor) received a Diploma and a Ph.D. in Computer Science in 2002 and 2008, respectively, from the University of Leipzig, Germany. He holds an appointment as Research Scientist with the database group in Leipzig. Andreas is currently a visiting research scientist at University of Maryland Institute for Advanced Computer Studies. Andreas' research areas deal with integration of web data sources. More specifically, he has been working on approaches for entity resolution, ontology alignment, and flexible integration architectures.
Last modified: Wednesday, 13-Aug-2014 10:15:04 EDT