[Please remove <h1>]
The Database Seminar Series provides a forum for presentation and discussion
of interesting and current database issues. It complements our internal database
meetings by bringing in external colleagues. The talks that are scheduled for
2000-2001 are below, and more will be listed as we get confirmations. Please
send your suggestions to M. Tamer Özsu.
Unless otherwise noted, all talks will be in room DC 1304. Coffee will be
served 30 minutes before the talk.
We will try to post the presentation notes, whenever that is possible. Please
click on the presentation title to access these notes (usually in pdf format).
Database Seminar Series is supported by iAnywhere Solutions, A Sybase Company.
23 October 2000, 11:00 AM; MC 5158
Title: |
Courseware-On-Demand:
Generating New Course Material From Existing Courses |
Speaker: |
Vincent
Oria, New Jersey Institute of Technology |
Abstract: |
Creation of appropriate educational content is a major challenge and
cost factor in the increasing important areas of on-line and life-long
learning. The aim of the ongoing Courseware-on-Demand project is to enable
the creation of course content by recomposing parts extracted from existing
annotated course materials. Course materials comprising a variety of learning
objects (slides, video, audio, textbook, etc.), are gathered, decomposed
into elementary learning fragments, appropriately annotated, indexed, and
stored. They are further augmented with course structures and dependencies
(prerequisite and precedence). A course designer who has a particular training
goal in mind can pose a query to the Courseware-on-Demand system, expressing
that goal. The first step in processing the query is to locate course fragments
containing material relevant to the goal. Then, taking into account the
dependencies, the system can define a minimal path in the metadata network
that leads to the goal. The last step is to physically extract the parts
of the course materials retrieved during the previous step, and compose
the new teaching material.
This is joint work with Silvia Hollfelder and Peter Fankhauser, GMD-IPSI,
Darmstadt, Germany. |
Bio: |
Vincent Oria got a Diplôme d'Ingénieur in 1989 from the
Institut National Polytechnique, Yamoussoukro, Côte d'Ivoire (Ivory
Coast), a D.E.A in 1990 from Université Pierre et Marie Curie (Paris
VI), Paris, France, and a Ph.D. in 1994 in Computer Science from the Ecole
Nationale Supérieure des Télécommunications (ENST),
Paris, France. He worked as a Research Scientist at the ENST Paris from
1994 to 1996 before joining Prof. Tamer Ozsu's group at the University
of Alberta, Canada as a Post-Doct from 1996 to 1999 where he was the primarily
researcher of the DISIMA project. He is currently an Assistant Professor
in the Department of Computer and Information Science at the New Jersey
Institute of Technology (NJIT), USA. |
6 November 2000, 11:00 AM
Title: |
How
to Crawl the Web |
Speaker: |
Hector Garcia-Molina,
Stanford University |
Abstract: |
Joint work with Junghoo Cho
A crawler collects large numbers of web pages, to be used for building
an index or for data mining. Crawlers consume significant network and
computing resources, both at the visited web servers and at the site(s)
collecting the pages, and thus it is critical to make them efficient
and well behaved. In this talk I will discuss how to build a "good" crawler,
addressing questions such as:
- How can a crawler gather "important" pages only?
- How can a crawler efficiently maintain its collection "fresh"?
- How can a crawler be parallelized?
I will also summarize results from an experiment conducted on more than
half million web pages over 4 months, to estimate how web pages evolve
over time. |
Bio: |
Hector Garcia-Molina is the Leonard Bosack and Sandra Lerner Professor
in the Departments of Computer Science and Electrical Engineering at Stanford
University, Stanford, California. From August 1994 to December 1997 he
was the Director of the Computer Systems Laboratory at Stanford. From 1979
to 1991 he was on the faculty of the Computer Science Department at Princeton
University, Princeton, New Jersey. His research interests include distributed
computing systems and database systems. He received a BS in electrical
engineering from the Instituto Tecnologico de Monterrey, Mexico, in 1974.
From Stanford University, Stanford, California, he received in 1975 a MS
in electrical engineering and a PhD in computer science in 1979. Garcia-Molina
is a Fellow of the ACM, received the 1999 ACM SIGMOD Innovations Award,
and is a member of the President's Information Technology Advisory Committee
(PITAC). |
20 November 2000, 11:00 AM
Title: |
An
Authorization Model for Temporal Data |
Speaker: |
Avigdor Gal, Rutgers
University |
Abstract: |
The use of temporal data has become wide-spread in recent years, within
applications such as data warehouses and spatiotemporal databases. In this
paper, we extend the basic authorization model by facilitating it with
the capability to express authorizations based on the temporal attributes
associated with data, such as transaction time and valid time. In particular,
a subject can specify authorizations based on data validity or data update
time, using either absolute or relative time references. Such a specification
is essential in providing access control for predictive data, or in constraining
access to data based on currency considerations. We provide an expressive
language for specifying such access control to temporal data, using a variation
of temporal logic for specifying complex temporal constraints. We also
introduce an easy-to-use access control mechanism for stream data.
A joint work with Vijay Atluri, Rutgers University.
A paper describing this material is available here. |
Bio: |
Avigdor Gal is a faculty member at the Department of MSIS at Rutgers
University. He received his D.Sc. degree from the Technion-Israel Institute
of Technology in 1995 in the area of temporal active databases. He has
published more than 30 papers in journals, books, and conferences on the
topics of information systems architectures, active databases and temporal
databases. Together with Dr. John Mylopoulos, Avigdor has chaired the Distributed
Heterogeneous Information Services workshop at HICSS98 and he was the guest
editor of a special issue by the same name in the International Journal
of Cooperative Information Systems. Also, he was the General co-Chair of
CoopIS2000. Avigdor has consulted in the area of eCommerce and is a member
of the ACM and the IEEE computer society.. |
4 December 2000, 11:00 AM
Title: |
Consistent
Query Answers in Inconsistent Databases |
Speaker: |
Jan Chomicki, University
at Buffalo |
Abstract: |
As the amount of information available in online data sources explodes,
there is a growing concern about the consistency and quality of answers
to user queries. This talk addresses the issue of using logical integrity
constraints to gauge the consistency and quality of query answers. Although
it is impractical to enforce global integrity constraints across different
data sources and correct integrity violations by updating individual sources,
integrity constraints are still useful because they express important semantic
properties of data. This talk introduces formal notions of database repair
and consistent query answer: A consistent answer is true in every minimal
repair of the database. The information about answer consistency serves
as an important indication of its quality and reliability.
I will show a general method for transforming first-order queries in
such a way that the transformed query computes consistent answers to
the original query. I will characterize the scope of this method: the
classes of queries and integrity constraints it handles. For functional
dependencies, I will provide a complete characterization of the computational
complexity of computing consistent query answers to scalar aggregation
queries (those involving the SQL operators COUNT, MIN, MAX, SUM, or AVG).
I will also show that if the set of functional dependencies is in Boyce-Codd
Normal Form, the boundary between tractable and intractable can be pushed
further using the tools of perfect graph theory.
(Joint work with Marcelo Arenas and Leopoldo Bertossi, with a contribution
from Vijay Raghavan and Jeremy Spinrad.) |
Bio: |
Jan Chomicki received his M.S. in computer science with honors from Warsaw
University, Poland, and his Ph.D., also in computer science, from Rutgers
University. He is currently an Associate Professor at the Department of
Computer Science and Engineering, University at Buffalo (formerly: SUNY
Buffalo). He has held visiting positions at Hewlett-Packard Labs, European
Community Research Center, Bellcore, and Lucent Bell Labs. He is the recipient
of 3 National Science Foundation research grants, the author of over 40
publications, and an editor of the book "Logics for Databases and Information
Systems," Kluwer, 1998. |
22 January 2001, 11:00 AM
Title: |
W3C XML Query WG: A Status Report |
Speaker: |
Paul Cotton, Microsoft Canada |
Abstract: |
The World Wide Web Consortium (W3C) XML Query Working Group [1]
was chartered in September 1999 to develop a query language for XML [2]
documents. The goal of the XML Query Working Group is to produce a formal
data model for XML documents with Namespaces [3]
based on the XML Infoset [4]
and XML Schemas [5, 6, 7],
an algebra of query operators on that data model, and then a query language
with a concrete canonical syntax based on the proposed operators. The WG
produced its first Requirements document [8]
in January 2000. Subsequently it has published an XML Query Data Model
Working Draft [9] and
an XML Query Algebra Working Draft [10].
The XML Query WG is currently working on a syntax for the XQuery language.
This talk will provide an update on the current status of the W3C XML Query
WG. This update will include current information on the status of the XML
Query Requirements, Data Model and Algebra. The talk will also outline
the relationship of the work of the XML Query WG to other W3C XML standards
especially XML Schema and XPath [11]. |
Bio: |
Paul Cotton joined Microsoft in May, 2000 and is currently Program Manager
of XML Standards. Paul telecommutes to his Redmond job from his home in
Ottawa, Canada. Paul has 28 years of experience in the Information Technology
industry. He has been involved in computer standards work since 1988 when
he began representing Fulcrum Technologies in the SQL Access Group where
he was heavily involved in the development of the SQL Call-Level Interface
(CLI) which is the de jure standard based on ODBC. Paul has represented
his employer and Canada on the ISO SQL and SQL/Multi-Media committees since
1992 and was the Editor of the SQL/MM Full-Text and Still Image documents
from 1995 until joining Microsoft. Paul was a founding member in the consortium
efforts to standardize JDBC and SQLJ which provide interfaces to SQL for
the Java language. Paul has been participating in the W3C XML Activity
since mid-1998 when he became IBM's prime representative on the XML Linking
and Infoset Working Groups. Paul has been chairperson of the XML Query
Working Group and a member of the XML Coordination Group since September
1999. Paul was elected to the W3C Advisory Board in June 2000 soon after
joining Microsoft. Paul is also Microsoft's alternate on the XML Protocol
Working Group. Paul graduated with a M.Math in Computer Science from the
University of Waterloo in 1974. |
5 February 2001, 11:00 AM; DC 1302
Title: |
Consistent and Efficient Database Replication
based on Group Communication |
Speaker: |
Bettina Kemme, McGill University |
Abstract: |
Data replication is a common mechanism in distributed database systems
that provides performance and fault-tolerance. Most early work has focused
on eager replication: updates are immediately sent to all copies in order
to guarantee consistency. However, database designers often point out that
eager replication is not feasible in practice due to the high deadlock
rate, the number of messages involved and the poor response times. As a
result, the focus has shifted towards faster lazy replication where updates
are propagated asynchronously to remote copies. The price paid is poor
consistency or severe restrictions to the configuration of the system.
This talk suggests that, contrary to what has been claimed, eager replication
can be implemented in such a manner as to provide not only clear correctness
criteria but also reasonable performance and scalability. This is especially
true in environments where communication is fast (e.g., computer clusters).
I will present a suite of eager replication protocols based on novel techniques
that avoid most of the limitations of current approaches. These protocols
take advantage of powerful broadcast primitives to serialize transactions,
and to avoid deadlocks and an explicit 2-phase commit. The talk will also
discuss the practical implications of integrating these techniques into
a real database system, and provide experimental results that prove the
feasibility of the approach. |
Bio: |
Bettina Kemme received her M.S. in Computer Science in 1996 from the
Friedrich-Alexander University, Erlangen, Germany, and her Ph.D. in Computer
Science in 2000 from the Swiss Federal Institute of Technology (ETH), Zurich,
Switzerland. She is currently an Assistant Professor at the School of Computer
Science at McGill University, Montreal. |
28 February 2001, 2:30 PM, DC 1302
Title: |
The XML Query Algebra: State and
Challenges |
Speaker: |
Peter Fankhauser,
GMD-IPSI |
Abstract: |
The XML Query Algebra [1]
is designed as part of the W3C-Activity XML Query [2]
to provide a formal basis for an XML Query Language. It is guided by three
principles. Full support of wellformed and valid XML, compositionality,
and strong typing. This talk will introduce the overall design, the underlying
datamodel, the type system, and the operational semantics of the algebra.
Furthermore, the talk will exemplify how to map surface syntaxes such as
XPath [3] or XQuery to the
algebra, and discuss the relationship between the algebra's typesystem
and XML Schema's [4,5,6]
typesystem. |
Bio: |
Peter Fankhauser is head of the department OASYS (Open Adaptive Information
systems) and director of the XML Competence Center at the GMD Institute
for Integrated Publication and Information Systems (IPSI). His main research
and development interests lie in information brokering and transaction
brokering. He has been the technical leader in the development of tools
for unified access to information retrieval systems, document structure
recognition, and the integration of heterogeneous databases. He is member
of the W3C XML Query working group, and is one of the editors of the W3C
XML-Query Requirements and the XML-Query Algebra working drafts He has
received a PhD in computer science from the Technical University of Vienna
in 1997. |
12 March 2001, 11:00 AM
Title: |
Profile-Driven Data Management |
Speaker: |
Mitch Cherniack, Brandeis
University |
Abstract: |
Large amounts of data populate the Internet, yet it is largely unmanaged.
Access to this data requires a wide array of network resources including
server capacity, bandwidth, and cache space. Applications compete for these
resources with little overarching support for its intelligent allocation.
The database community has seen this problem before. Data management
techniques (e.g., indexing, data organization, buffer management) have
long been used in DBMS's to offset the demands of limited resources
made by competing applications' requests for data. Such techniques
are application-driven; a database administrator (DBA) consults with
users to determine their data needs, and tweaks the knobs of data management
tools accordingly. But the Internet environment, consisting of autonomous
data sources and global in scale, cannot be tamed by a DBA. We take
the view that in this environment, the DBA role must be automated,
and the specification of application-level data requirements must be
made formal by way of processible user profiles.
In this talk, I will discuss the Profile-Driven Data Management project:
a multi-institution endeavor funded by the National Science Foundation
in October, 2001. I will demonstrate the goals and scope of the project,
present some early results, and pose some of the many challenges that
lie ahead.
Joint Work with: Stan Zdonik (Brown University) and Michael J. Franklin
(UC Berkeley) |
Bio: |
Mitch Cherniack received his Ph.D. in Computer Science in 1999 from Brown
University. He is currently an Assistant Professor at Brandeis University,
Boston. |
19 March 2001, 11:00 AM; DC1302
Title: |
Optimizing Queries Using Materialized Views (Associated paper) |
Speaker: |
Per-Ake (Paul)
Larson, Microsoft Research |
Abstract: |
Materialized views can provide orders of magnitude improvements in query
processing time, especially for aggregation queries over large tables.
To realize this potential, the query optimizer must understand how
and when to exploit materialized views. This talk describes a fast
and scalable algorithm for determining whether part or all of a query
can be computed from materialized views and outlines how the algorithm
is integrated into a transformation-based optimizer. The algorithm
handles views composed of selections, joins and a final group-by. Multiple
rewrites using views may be generated and the optimizer chooses the
best alternative in the normal way. Experimental results based on a
prototype implementation in Microsoft SQL Server show outstanding performance
and scalability. Optimization time increases slowly with the number
of views but remains low even up to a thousand views.
|
Bio: |
Paul Larson is a senior researcher in the Database Group at Microsoft
Research. Prior to joining Microsoft, he was a Professor of Computer Science
at the University of Waterloo. |
26 March 2001, 11:00 AM
Title: |
A Configurable Application View Storage
System |
Speaker: |
Sibel Adali, RPI |
Abstract: |
Storage management is a well-known method for improving the efficiency
of data intensive and networked applications. Today's data management systems
handle many non-traditional data formats, ranging from spatial data to
images, video and other hybrid representations. This requires the use of
specialized methods to query, extract and transform data from multiple,
possibly distributed sources. There is a great need to develop efficient
and scalable methodologies for storing and reusing the results of computations
in such applications. In this paper, we introduce a configurable storage
management system that allows programmers to specify a collection of storage
management protocols for managing different types of data requests on top
of a shared and possibly distributed pool of resources. In addition, dynamic
protocol change rules allow the system to shift from one storage management
method to another depending on the availability of system resources. Furthermore,
the storage management system can be expanded with application specific
methods to look-up and re-use stored items. We show how the storage system
can be tuned to specific workload specifications with the help of a simulation
model that takes into account both the cost of storage management protocols
as well as the methods to look-up and re-use stored items. |
Bio: |
Sibel Adali is an Assistant Professor in Rensselaer Polytechnic Institute,
which she joined in 1996 after her PhD from University of Maryland. At
RPI, she leads the Multimedia Information Integration Lab. Her research
focuses on heterogeneous distributed information systems, database interoperability,
query optimization, and multimedia information systems. |
23 April 2001, 11:00 AM; DC 1302
Title: |
Data (and Links) on the Web |
Speaker: |
Alberto Mendelzon, University
of Toronto |
Abstract: |
In the excitement of extending database technology such as semistructured
models and languages to the Web, the importance of links between documents
has somehow been left behind. We will present two projects that keep links
front and center: the WebSQL/WebOQL query languages and the TOPIC system
for computing page reputations. WebSQL and WebOQL are query languages intended
to serve as high-level tools for building Web-based information systems
in the same way that SQL is used for building traditional information systems.
TOPIC is a method for analyzing incoming links to a page in order to determine
what are the topics that this page is best known for on the Web. |
Bio: |
Alberto Mendelzon was born in Buenos Aires and received his graduate
degrees from Princeton University. His research interests are in databases
and knowledge-bases including database design theory, query languages,
semi-structured data, global information systems, and OLAP. Alberto has
chaired or co-chaired the Program Committees of ACM PODS (1991), VLDB (1992),
DOOD (1995), WebDb (1998), DBPL (1999), and WWW8 (1999). He is an Associate
Editor of the ACM Transactions on Database Systems and has guest-edited
issues of the VLDB Journal, Theory and Practice of Object Systems,
and Journal of Computer and System Sciences. He is a member of the
Executive Board ACM PODS and Information Director for ACM SIGMOD. |
14 May 2001, 11:00 AM
Title: |
Data Provenance |
Speaker: |
Peter Buneman,
University of Pennsylvania |
Abstract: |
When you find some data on the Web, are you concerned about out how it
got there? Most of us have learned not to put much faith in what we find
when we casually browse the Web. But if you are a scientist or any kind
of scholar, you would like to have confidence in the accuracy and timeliness
of the data that you find. In particular, you would like to know how it
got there -- its provenance. In Bioinformatics there are literally hundreds
of public databases. Most of these are not source data: their contents
have been extracted from other databases by a process of filtering, transforming,
and manual correction and annotation. Thus, describing the provenance of
some piece of data is a complex issue. These "curated" databases have enormous
added value, yet they often fail to keep an adequate description of the
provenance of the data they contain.
In this talk I shall describe some recent work on a Digital Libraries
project to investigate data provenance. I shall try to describe the general
problem and then deal with some technical issues that have arisen, including
(a) a semistructured data model that may help in characterizing data
provenance (b) tracing provenance through database queries (c) efficient
archiving of scientific databases and (d) keys (canonical identifiers)
for semistructured data and XML.
Joint work with Sanjeev Khanna and Wang-Chiew Tan |
Bio: |
Peter Buneman is Professor of Computer and Information Science at the
University of Pennsylvania. His work in computer science has focussed mainly
on databases and programming languages, areas in which he has worked on
active databases, database semantics, type systems, approximate information,
query languages, and on semistructured data and data formats, an area in
which he has recently co-authored a book.
He has served on numerous program committees, editorial boards and
working groups, and has been program chair for ACM Sigmod, ACM PODS
and the International Conference on Database Theory. He was recently
elected fellow of the ACM. |
28 May 2001, 11:00 AM
Title: |
Combining Data Independence and
Realtime in Media Servers |
Speaker: |
Klaus Meyer-Wegener,
Technical University of Dresden |
Abstract: |
In order to be flexible and useful, media servers should provide data
independence and realtime support. Extensible database systems today offer
just the former, media servers just the latter. Going for data independence
makes realtime support even harder, since it requires additional mappings
and conversions: To play a video in a format different from the one used
for storage, realtime conversion is needed. The presentation will introduce
the implementation of data independence using so-called converter graphs.
The timing of each step in the graph can be described as a jitter-constrained
periodic stream, which allows to calculate some characteristics (e.g. buffer
size) and supports scheduling. Probabilities of conflict can be used to
determine the need for simultaneous playout of the same media stream. Finally,
statistical rate-monotonic scheduling would allow to introduce a notion
of quality, but an integrated model remains to be developed. This gives
an overview of ongoing work in a project called "Memo.real." It
is part of a Collaborative Research Center funded
by the Deutsche Forschungsgemeinschaft. This
implementation builds on the KANGAROO
prototype and the Dresden Realtime Operating
System (DROPS) |
Bio: |
Prof. Klaus Meyer-Wegener is a Professor at Dresden University of Technology,
the Head of the Database Group. His research interests include multimedia
databases, document and workflow management, and digital Libraries. |
11 June 2001, 11:00 AM
Title: |
Query Optimization via Data Mining |
Speaker: |
Jarek Gryz, York University |
Abstract: |
It is almost a trivial observation that the more we know about the data
the faster we can access it. This idea was first explored in the early
1980's in semantic query optimization (SQO), a technique that uses semantic
information stored in databases as integrity constraints to improve query
performance. Some of the ideas developed for SQO have been used commercially,
but to the best of our knowledge, no extensive, commercial implementations
of SQO exist today.
But SQO can be extended beyond the use of just integrity constraints.
We can use data mining to discover patterns in data that can be useful
for optimization in the same way integrity constraints are. Indeed,
by focusing on the most useful patterns we can provide better query
performance improvements than it is possible with the traditional SQO.
In the first part of this talk, I present an implementation of two
SQO techniques, Predicate Introduction and Join Elimination, in DB2
Universal Database. I present the implemented algorithms and performance
results using the TPCD and APB-1 OLAP benchmarks. In the second part
of the talk I will describe my current research on developing data
mining techniques for discovery of check constraints and empty joins
and the use of this information for query optimization.
|
Bio: |
Jarek Gryz is an Assistant Professor at York University. His research
interests include query Optimization via data mining, semantic query caching,
and query optimization in parallel database systems. |