[Please remove <h1>]
The Database Seminar Series provides a forum for presentation and discussion
of interesting and current database issues. It complements our internal database
meetings by bringing in external colleagues. The talks that are scheduled for
2006-2007 are below, and more will be listed as we get confirmations. Please
send your suggestions to M. Tamer Özsu.
Unless otherwise noted, all talks will be in room DC 1304. Coffee will be
served 30 minutes before the talk.
We will try to post the presentation notes, whenever that is possible. Please
click on the presentation title to access these notes (usually in pdf format).
Database Seminar Series is supported by iAnywhere Solutions, A Sybase Company.
25 September 2006, 11:00
AM
Title: |
MauveDB: Managing Uncertain Data using Statistical Models |
Speaker: |
Amol Deshpande, University of Maryland |
Abstract: |
Real-world data, especially that generated by distributed measurement
infrastructures such as wireless sensor networks, tends to be
incomplete,
imprecise, and erroneous, and hence rarely usable in its raw form. The
traditional approach to dealing with this problem is to first synthesize
(filter) such data using a statistical or a probabilistic model, thus
resulting in a more robust interpretation of the data. However current
database systems do not provide adequate support for statistical
modeling
of data, especially when those models need to be frequently updated
as new
data arrives in the system. Hence most scientists and engineers, who
depend on models for managing their data, do not use database systems
for
archival or querying at all; at best, databases serve as a persistent
raw
data store.
In this talk, I will present our approach to integrating statistical and
probabilistic models into database systems, in the context of data
management in wireless sensor networks. I will first present a data
acquisition approach for wireless sensor networks that demonstrates how
models can be used both to provide more meaningful answers to user
queries, and to significantly reduce the energy cost of acquiring data
from the underlying sensing devices. I will then present our recent
work
on the "MauveDB" system, which uses an abstraction called "model-based
views" to seamlessly integrate models into traditional relational
database
systems. |
Bio: |
Amol Deshpande is an Assistant Professor at the University of Maryland at College Park. He received his PhD from UC Berkeley in 2004. His research interests are adaptive query processing, sensor network data management, and statistical modeling of data. |
16 October 2006, 10:30 AM (Please note time change)
Title: |
Visions of Data Semantics: Another (and another) Look |
Speaker: |
Alex Borgida, Rutgers University |
Abstract: |
The problem of data semantics is establishing and maintaining a correspondence between a data source (e.g., a database, an XML document) and its intended subject matter. We review the (relatively minor) role data semantics has played in Databases under the term "semantic data models", its more prominent place in ontology-based information integration, and then outline two new views: (i) Semantics as a composition of mappings between models, and (ii) Attaching intensional aspects (stakeholder goals) to Information Systems. In each case we consider the benefits of this view for the important problem of data integration/loading.
Joint work with John Mylopoulos and students (Univ. of Toronto)
|
Bio: |
Alex Borgida holds a PhD degree from University of Toronto, and is a Professor of Computer Science at Rutgers University, New Brunswick, NJ. His research is mainly concerned with knowledge representation and its applications. He has published in a variety of areas including Artificial Intelligence (description logics, explanation), Databases (exceptions, semantic data models, data mapping), Software Engineering (requirements modeling, software specification). The main unifying thread of this work is a belief in the importance of languages, which shape the way we think of the problem (an unabashed Whorfian!), and the need to be precise and logical about the semantics of such languages.
Alex is co-recipient of the most influential paper award of the 1994 International Conference on Software Engineering, and is proud to have contributed to the design and implementation of the Classic language/logic, which was used by AT&T as part of a system that configured "billions of dollars' worth of equipment sold". |
27 November 2006, 10:30 AM
Title: |
Warehousing and Mining Massive RFID Data Sets |
Speaker: |
Jiawei Han, University of Illinois at Urbana-Champaign |
Abstract: |
Radio Frequency Identification (RFID) applications are set to play
an essential role in object tracking and supply chain management
systems. In the near future, it is expected that every major
retailer will use RFID systems to track the movement of products
from suppliers to warehouses, store backrooms, and eventually to
points of sale. The volume of information generated by such systems
can be enormous as each individual item (a pallet, a case, or an
SKU) will leave a trail of data as it moves through different
locations. As a departure from the traditional data cube, we
propose a new RFID data warehouse model that preserves object
transitions while providing significant compression and
path-dependent aggregates, based on the following observations: (1)
items usually move together in large groups through early stages in
the system (e.g., distribution centers) and only in later stages
(e.g., stores) do they move in smaller groups, and (2) although RFID
data is registered at the primitive level, data analysis usually
takes place at a higher abstraction level. Techniques for
summarizing data, query processing in FRID data warehouse, RFID
flow-cube construction, and data mining based on this framework are
developed. We also illustrate a few promising research topics for
mining such massive RFID data warehouses.
Besides this technical talk, I will give a short summary of our
recent research work on data mining. |
Bio: |
Jiawei Han, Professor, Department of Computer Science, University of
Illinois at Urbana-Champaign. He has been working on research into
data mining, data warehousing, database systems, data mining from
spatiotemporal data, multimedia data, stream and RFID data, social
network data, and biological data, with over 300 journal and
conference publications. He has chaired or served in over 100
program committees of international conferences and workshops,
including PC co-chair of 2005 (IEEE) International Conference on
Data Mining (ICDM), Americas Coordinator of 2006 International
Conference on Very Large Data Bases (VLDB). He is also serving as
the founding Editor-In-Chief of ACM Transactions on Knowledge
Discovery from Data. He is an ACM Fellow and has received 2004 ACM
SIGKDD Innovations Award and 2005 IEEE Computer Society Technical
Achievement Award. His book "Data Mining: Concepts and Techniques"
(2nd ed., Morgan Kaufmann, 2006) has been popularly used as a
textbook worldwide.
|
16 January 2007, 9:30 AM; DC1331
(New Room, New Time!)
Title: |
Complex Event Processing with Cayuga |
Speaker: |
Johannes Gehrke, Cornell University |
Abstract: |
Publish/subscribe (pub/sub) is a powerful paradigm enabling asynchronous interaction in large distributed settings ranging from Enterprise Application Integration, Internet-scale notification services, to processing events from RFIDs and monitoring the blogosphere. By limiting subscriptions to simple filters on message topics or content, pub/sub systems achieve great scalability in the number of publishers and subscribers. However, today's pub/sub is unfortunately rather limited; in particular, users cannot express conditions that involve more than a single message.
In this talk, I will overview the Cornell Cayuga System, a stateful pub/sub system for complex event processing. Cayuga supports powerful subscription features that includes maintenance of state across messages, parameterization and aggregation. Cayuga compiles subscriptions down to simple finite state automata that can be implemented very efficiently. I will conclude with experimental results and first experiences from several prototype deployments. |
Bio: |
Johannes Gehrke is an Associate Professor in the Department of Computer Science at Cornell University and an Associate Director of the Cornell Theory Center. Johannes' research interests are in the areas of data mining, search, data privacy, complex event processing, and applications of database and data mining technology to marketing and the sciences. Johannes has received a National Science Foundation Career Award, an Arthur P. Sloan Fellowship, an IBM Faculty Award, the Cornell College of Engineering James and Mary Tien Excellence in Teaching Award, and the Cornell University Provost's Award for Distinguished Scholarship. He is the author of numerous publications on data mining and database systems, and he co-authored the undergraduate textbook Database Management Systems (McGrawHill (2002), currently in its third edition), used at universities all over the world.
Johannes was co-Chair of the 2003 ACM SIGKDD Cup, Program co-Chair of the 2004 ACM International Conference on Knowledge Discovery and Data Mining (KDD 2004), and he is Program Chair of the 33rd International Conference on Very Large Data Bases (VLDB 2007).
At Cornell, Johannes teaches in the Department of Computer Science, the Information Science Program, and in the Johnson Graduate School of Management. He has given courses and tutorials on data mining, data stream processing, and data privacy on Wall Street and all over the world, and he has extensive industry experience as technical advisor. |
5 February 2007, 10:30 AM
Title: |
Predicate-based Indexing of Annotated Data |
Speaker: |
Donald Kossmann, ETH Zürich |
Abstract: |
In many environments, data is annotated either by humans or by software
applications. A prominent example is the tagging of Links and Web pages
as done by users of, e.g., del.icio.us. Other examples include Office
documents (e.g., Word) in which the data is annotated in order to encode
layout, versioning, comments, or references to, say, addresses stored in
EXCEL. This talk shows why today's generation of search engines do not
support such annotated data well. Furthermore, it shows how today's
search engines can be extended. The idea is to extend inverted file
indices with an additional column that contains predicates. These
predicates encode how to interpret the annotations. The talk
demonstrates how this extended approach can improve search on tagged Web
pages, desktop search, and enterprise search (i.e., Web-based Java
applications). Furthermore, results of preliminary performance
experiments (precision and recall) are presented.
Joint work with Cristian Duda. |
Bio: |
Donald Kossmann is a professor for Computer Science at ETH Zurich (Switzerland). He received his MS in 1991 from the University of Karlsruhe and completed his PhD in 1995 at the Technical
University of Aachen. He is a co-founder of i-TV-T, a German company that develops eProcurement applications. His research interests lie in the area of database and information systems; in particular, Web-based information systems and database applications. |
20 February 2007, 2:00 PM - MC 5136 (Please note special time & place.)
Title: |
SwissQm: A Virtual Machine for Sensor Networks |
Speaker: |
Gustavo Alonso , ETH Zürich |
Abstract: |
Sensor networks have become one of the main lines of research in several areas
of computer science. The potential for sensor networks is well known and
numerous applications have been described and are being explored. A less known
fact about wireless sensor networks is that it is very difficult and cumbersome
to program, deploy, and getting them to wok in real settings. Recent experience
reports confirm the many problems encountered which are caused by both the
nature of the problem but also because of the lack of appropriate tools and
abstractions to build real systems based on sensor networks. In this talk I will
give an overview of the typical problems encountered and describe some of the
efforts at ETH Zurich to come up with better infrastructures for sensor
networks. In particular, I will describe SwissQM, a virtual machine designed to
run on the sensors that is not only efficient but also offers the necessary
level of abstraction and interface to develop many of the functionality needed
to make turn sensor networks into real systems.
Work done in collaboration with Rene Müller and Donald Kossmann. |
Bio: |
Gustavo Alonso is professor in the Department of Computer Science at the
Swiss Federal Institute of Technology in Zurich (ETHZ). He
holds degrees in Telecommunications Engineering from the Madrid
Technical University (1989) and in computer science (M.S. 1992, Ph.D.
1994) from the University of California at Santa Barbara. Before joining ETH
Zurich, he was a visiting scientist in the IBM Almaden Research
Laboratory in San Jose, California. Currently, Gustavo Alonso leads the
Information and Communication Systems Research Group and is the Chair of
the Institute for Pervasive Computing. For more information on the activities of
the group, please
contact www.iks.ethz.ch. |
12 March 2007, 9:00 AM (Please note special time.)
Title: |
Towards Declarative and Efficient Querying on Biological Datasets |
Speaker: |
Jignesh Patel, University of Michigan |
Abstract: |
Modern life sciences explorations often need to analyze and manage large volumes of complex biological data. Unfortunately, existing life sciences applications often employ awkward procedural querying methods and use query evaluation algorithms that do not scale as the data size increases. For example, data is often stored in flat files and queries are expressed and evaluated by programs written in Python. The perils of employing such procedural querying methods are well known to a database audience, namely a) severely limiting the ability to rapidly express complex queries, and b) often resulting in very inefficient query plans as sophisticated query optimization and evaluation methods are not employed. The problem is likely to get worse in the future as many life sciences datasets are growing at a rate faster than Moore's Law. Furthermore, the queries that scientists want to pose are also rapidly increasing in their complexity. The focus of this talk is on a database approach to querying biological datasets. I will describe the ongoing work in the Periscope project in which we are developing a system for declarative and efficient querying on genomic and proteomics datasets. |
Bio: |
Jignesh M. Patel is an Associate Professor at the University of Michigan. He graduated with a PhD from the University of Wisconsin in 1998. Since 1999, he has been a faculty member in the EECS department at the University of Michigan, where his research has focused on bioinformatics, spatial query processing, and XML query processing. He is the recipient of a NSF Career Award, and multiple IBM Faculty Awards. He has served on a number of Program Committees including ACM SIGMOD and VLDB, and has served as an Associate Editor for the Systems and Prototype section of ACM SIGMOD Record, a Vice-Chair for IEEE International Conference on Data Engineering 2005, and an Associate Editor for the IEEE Data Engineering Bulletin. |
30 April 2007, 10:30 AM
Title: |
Self-Managing DBMS Technology: The AutoAdmin Experience |
Speaker: |
Surajit Chaudhuri, Microsoft Research |
Abstract: |
The cost of ownership of any commercial database system is significant. The AutoAdmin project at Microsoft Research was initiated (well before the term Autonomic Computing became a buzzword) to develop techniques to reduce the overhead of database administration. Our goal was to make it easier to monitor the server and develop self-tuning techniques for performance management. The technology from this project has been incorporated in the Microsoft SQL Server 2005 (and earlier releases - SQL Server 7.0 and SQL Server 2000). This talk will take a look at some of the past research results and discuss challenges and opportunities in self-tuning DBMS research. |
Bio: |
Surajit Chaudhuri is a Senior Researcher and leads the Data Management and Exploration Group at Microsoft Research. His areas of interest include self-tuning database systems, query optimization, data cleaning and other tools for data integration, understanding synergy between IR and DBMS. As his work outside of database research, he led the development of CMT, a conference management service hosted by Microsoft Research since 1999 for the academic community. Surajit has a PhD from Stanford University and is an ACM Fellow. He was awarded the SIGMOD Contributions Award in 2004. |
14 May 2007, 10:30 AM
Title: |
Data-Driven Processing in Sensor Networks |
Speaker: |
Jun Yang, Duke University |
Abstract: |
Wireless sensor networks enable data collection from the physical
environment on unprecedented scales. In this talk, I will describe
some data processing problems that arise in building an environmental
sensing network in Duke Forest, in collaboration with ecologists
and statisticians. Because of severe resource constraints on
battery-powered sensor nodes, it is infeasible to collect and report
all raw readings for centralized processing. An effective approach
is model-driven data acquisition, which avoids acquiring readings
that can be accurately predicted from known spatio-temporal models
of data. We argue for an alternative, data-driven approach, which
exploits models in optimizing push-based reporting, but does not
depend on the quality of models for correctness. A particularly
thorny issue with push-based reporting is transmission failures,
which are common in sensor networks, and make failed reports
indistinguishable from intentionally suppressed ones. The cost of
implementing reliable transmissions is prohibitively high. We show
how to inject application-level redundancy in data reporting to
enable efficient, effective, and principled resolution of uncertainty
in the missing data.
|
Bio: |
Jun Yang received his B.A. from University of California at
Berkeley in 1995, and his Ph.D. from Stanford University in 2001.
He is currently an Assistant Professor of Computer Science at Duke
University. He is broadly interested in research on data management,
and is currently focusing on derived data maintenance, continuous
query systems, and sensor data processing. He is a recipient of the
National Science Foundation CAREER Award and the IBM Faculty Award.
|
25 June 2007, 10:30 AM
Title: |
Processing and Routing Streams in a Networked World |
Speaker: |
Ugur Cetintemel, Brown University |
Abstract: |
This talk will provide an overview of our recent work on building
distributed stream-oriented software infrastructures at Brown. Our
goal is to provide robust, scalable software support for an emerging
class of applications that require collection, processing and
distribution of large volumes of real-time data streams, generated by
a number of potentially distributed data sources (such as cameras,
weather stations, and network monitors). In particular, the talk will
cover the high-level design and the key features of Borealis, a
distributed stream processor, and XPORT, a distributed
publish/subscribe system. The talk will also highlight our ongoing
work on integrating these two systems and some future directions. |
Bio: |
Ugur Cetintemel is an assistant professor in the Computer Science
Department at Brown University. He received a Ph.D. in Computer
Science from the University of Maryland, College Park, in 2001. His
current work focuses on the architecture and performance of advanced
database systems, with an emphasis on streaming data. He is a Brown
University Manning Assistant Professor and a recipient of the NSF
CAREER Award. He is also a co-founder and a senior architect for
StreamBase Inc. |
17 August 2007, 2:00 PM (Please note
special day and time)
Title: |
Experiment-Driven Management of Web Services and Scientific Applications |
Speaker: |
Shivnath Babu, Duke
University |
Abstract: |
Database-backed Web services (e.g., Amazon, eBay, Yahoo!) play an
important role in our daily lives. The performance P (e.g.,
throughput) of a Web service S is a complex function of its workload
W, resource allocation R, and the large number of configuration
parameters C that affect S. Furthermore, P may be dictated by unknown
interactions among W, R, and C. We have developed a systematic
approach based on statistical design of experiments and active
machine-learning to discover these dependencies and interactions
accurately and comprehensively. Our approach plans a small set of
experiments, where each experiment observes P for a selected
combination. In this talk, I will describe how we use the
experiment-driven approach to process four basic queries in
Web-service management; a harness that leverages virtualization to
conduct experiments with specified combinations; and an
empirical evaluation using two multitier Web services that
demonstrates the feasibility and usefulness of our approach. I will
conclude by describing how we applied the same experiment-driven
approach to manage scientific applications in a utility computing
setting.
|
Bio: |
Shivnath Babu is an Assistant Professor of Computer Science at Duke
University. He received his Ph.D. from Stanford University in 2005. He
was awarded a National Science Foundation Early CAREER Award in 2007
for his work on the Ques project on Querying and Controlling Systems.
He is also the recipient of two IBM Faculty Awards. His current
research focuses on making large-scale databases and systems easier to
manage.
|