Database Seminar Series (2002-2003)

The Database Seminar Series provides a forum for presentation and discussion of interesting and current database issues. It complements our internal database meetings by bringing in external colleagues. The talks that are scheduled for 2002-2003 are below, and more will be listed as we get confirmations. Please send your suggestions to M. Tamer Özsu.

Unless otherwise noted, all talks will be in room DC 1304. Coffee will be served 30 minutes before the talk.

We will try to post the presentation notes, whenever that is possible. Please click on the presentation title to access these notes (usually in pdf format).

23 September 2002, 11:00 AM


SINGAPORE: Towards flexible querying of heterogeneous data sources

Speaker: Klaus R. Dittrich, University of Zurich
Abstract: Data available on-line today is spread across heterogeneous data sources like traditional databases or repositories of various forms containing unstructured and semistructured data.  Obviously, the "technical'' availability alone is not at all sufficient for making meaningful use of existing information, and thus the problem of effectively and efficiently accessing and querying heterogeneous data is an important research issue. One popular approach is to integrate the data sources and offer users an a priori defined global schema. Alternatively, there are approaches which implement tools for giving users the possibility to define the query schema themselves. We propose a new approach where heterogeneous sources can be queried through a unified interface and underlying sources are integrated by means of a query language only. We present extensions to OQL which allow to query structurally heterogeneous, i.e. structured, semistructured and unstructured data alike, and to integrate data on the fly. We also present some details of query preprocessing and show how techniques from database and information retrieval systems can be combined.
Bio: Prof. Klaus Dittrich received his diploma degree (M.Sc.) in Computer Science from the University of Karlsruhe. He earned his Ph.D. in 1982 at IPD Institute for Program Structures and Data Organization. 1984 he spent a year as a post-doctoral fellow at IBM Almaden Research Center. He was head of the database department at FZI Research Center for Information Technologies at University of Karlsruhe from 1985 to 1989.

Since 1989 he has been a Professor of Computer Science at the University of Zurich and head of the Database Technology Research Group.

He took a sabbatical leave at Stanford University, USA and Hewlett Packard Labs, USA (1996) and was guest professor at Aalborg University, Denmark (1999).

He is a member of

and the current president of SI (Swiss Informaticians Society) and former president of IPEG (interuniversitäre Partnerschaft für Erdbeobachtung und Geoinformatik). He is also the secretary of the VLDB Endowment (Very Large Data Base Endowment Inc.). Until 1997 he was a member of the SIGMOD Advisory Committee.

Prof. Klaus Dittrich has been nominated as a distinguished speaker of the IEEE Europe Distinguished Visitor Program.

4 October 2002, 2:00 PM (Note the special time.)

Title: Profile Driven Data Management for Pervasive Environments
Speaker: Yelena Yesha, University Maryland at Baltimore County
Abstract: The past few years have seen significant work in mobile data management, typically based on the client/proxy/server model. Mobile/wireless devices are treated as clients that are data consumers only, while data sources are on servers that typically reside on the wired network. With the advent of "pervasive computing" environments an alternative scenario arises where mobile devices gather and exchange data from not just wired sources, but also from their ethereal environment and one another. This is accomplished using ad-hoc connectivity engendered by Bluetooth like systems. In this new scenario, mobile devices become both data consumers and producers. We describe the new data management challenges which this scenario introduces. We describe the design and present an implementation prototype of our framework, MoGATU, which addresses these challenges. An important component of our approach is to treat each device as an autonomous entity with its "goals" and "beliefs", expressed using a semantically rich language. We have implemented this framework over a combined Bluetooth and Ad-Hoc 802.11 network with clients running on a variety of mobile devices. We present experimental results validating our approach and measure system performance.
Bio: Yelena Yesha received the B.Sc. degree in Computer Science from York University, Toronto, Canada in 1984, and the M.Sc. and Ph.D degrees in Computer and Information Science from The Ohio State University in 1986 and 1989, respectively.

Since 1989 she has been with the Department of Computer Science and Electrical Engineering at the University of Maryland Baltimore County, where she is presently a Verizon Professor. In addition, from December, 1994 through August, 1999 Dr. Yesha served as the Director of the Center of Excellence in Space Data and Information Sciences at NASA. Her research interests are in the areas of distributed databases, distributed systems, mobile computing, digital libraries, electronic commerce, and trusted information systems. She published 8 books and over 100 refereed articles in these areas. Dr. Yesha was a program chair and general co-chair of the ACM International Conference on Information and Knowledge Management and a member of the program committees of many prestigious conferences.

She is a member of the editorial board of the Very Large Databases Journal, and the IEEE Transaction on Knowledge and Data Engineering, and is editor-in-chief of the International Journal of Digital Libraries.

During 1994, Dr. Yesha was the Director of the Center for Applied Information Technology at the National Institute of Standards and Technology.

Dr. Yesha is a senior member of IEEE, and a member of the ACM.

21 October 2002, 11:00 AM

Title: Bridging Relational Technology and XML
Speaker: Jayavel Shanmugasundaram, Cornell University
Abstract: XML has emerged as the standard data-exchange format for Internet-based business applications. These applications introduce a new set of data management requirements involving XML. However, for the foreseeable future, a significant amount of business data will continue to be stored in relational database systems. Thus, a bridge is needed to satisfy the requirements of these new XML-based applications while still leveraging relational database technology. In this talk, we shall describe the design and implementation of a middleware system that we believe achieves this goal. In particular, we shall describe a general framework for creating XML views of relational data, querying XML views, and storing and querying XML documents using a relational database system. Some of the interesting features of the system architecture are that it (a) provides users with a single XML query language for creating and querying XML views of relational data, (b) it evaluates queries efficiently! by pushing most computation down to the relational database engine, (c) it allows users to query seamlessly over relational data and meta-data, and (d) it allows users to write queries that span XML documents and XML views of relational data.
Bio: Jayavel Shanmugasundaram is an Assistant Professor in the Department of Computer Science at Cornell University. He received his Ph.D. degree from the University of Wisconsin at Madison, a masters degree from the University of Massachusetts at Amherst, and a bachelors degree from the Regional Engineering College at Tiruchirappalli, India. Shanmugasundaram's research interests include Internet data management, database systems and query-processing in emerging system architectures. He is the author of several publications and patents, and his research ideas have been implemented in commercial data management products.

4 November 2002, 11:00 AM

Title: Mining Knowledge about Changes, Differences, and Trends
Speaker: Guozhu Dong, Wright State University
Abstract: Knowledge about changes, differences, and trends is very useful. For example, companies wish to identify important temporal changes and trends in customer purchase behavior, so that they can adjust their business priorities. Medical researchers wish to identify differences in gene group interactions between normal cell tissues and cancer cell tissues, so that they can discover better treatment to cancer.

We discuss some recent results on mining such knowledge. We are concerned with transactional data, relational data, and data cubes. We consider emerging patterns that capture differences and changes between a dataset pair, gradient patterns in a data cube that capture similar cells with big differences in measure values, and multidimensional multi-level trends in sets of time series in a data cube context. We discuss mining algorithms and ways to use the patterns.

Bio: Guozhu Dong is an associate professor at Wright State University, USA. He received his PhD from the University of Southern California in 1988. He previously taught at the University of Melbourne and the Flinders University, both in Australia, and consulted for Lucent Bell Labs and LIT Singapore. His main research interests are in the areas of databases, data mining, and bioinformatics. He has published over 80 articles in these areas. He has served on numerous program committees, including ICDE, ICDM, ICDT, PODS, SIGKDD, and VLDB. He is a program co-chair of the International Conference on Web-Age Information Management (2003), and is on the editorial board of International Journal of Information Technology.

2 December 2002, 11:00 AM

Title: FLORA-2: Programming with Logic and Objects
Speaker: Michael Kifer, SUNY at Stony Brook
Abstract: This talk is about a marriage of object-based and logic-based paradigms for programming knowledge-intensive applications.

The product of this marriage is FLORA-2, which is both a seamless integration of Frame Logic, HiLog and Transaction Logic in a single formalism, and an implementation that adds important pragmatic extensions. Together they make a powerful knowledge programming language.

Frame Logic relates to the object-oriented data model as classical predicate calculus relates to the relational data model. HiLog adds meta-programming, and Transaction Logic add dynamics to the mix.

Although FLORA-2 has been released only in its alpha form, it is already very usable and has a following of dedicated users in the areas of information integration, semantic web, information systems design, agent building, etc.

Bio: Michael Kifer is a Professor with the Department of Computer Science, State University of New York at Stony Brook (USA). He received his Ph.D. in Computer Science in 1985 from the Hebrew University of Jerusalem, Israel, and the M.S. degree in Mathematics in 1976 from Moscow University, Russia.

Dr. Kifer's interests include database systems, knowledge representation, and Web information systems. He has published two text books and numerous articles in these areas. In 1999 and 2002 he was a recipient of the ACM-SIGMOD "Test of Time" awards for his works on object-oriented database languages.

21 January 2003, 1:00 PM, MC 5136 (Please note special date, time and place)

Title: Practical Considerations for Semantic Cache Management
Speaker: Björn Þór Jónsson, Reykjavik University
Abstract: The emergence of query-based on-line data services and e-commerce applications has prompted much recent research on data caching. This talk describes semantic caching, a caching arcitecture for such applications, that caches the results of selection queries. Unlike most previous approaches to caching query results, data is not replicated in the semantic cache, thus improving the utility of the cache. Furthermore, partial results are re-used, reducing network traffic. The focus of the talk is on two performance studies using a prototype implementation that connects to a commercial relational server. One study focuses on relatively simple selection workloads and demonstrates several intrinsic benefits of semantic caching, including low overhead, insensitivity to the physical layout of the database, reduced network traffic, the ability to answer some queries without contacting the server, and the ability to incorporate application knowledge in replacement decisions. The second performance study focuses on complex selection workloads. It demonstrates that, despite the increased complexity of cache management, semantic caching works well in a wide range of network-constrained environments.
Bio: Dr. Björn Þór Jónsson is an associate professor in the School of Computer Science at Reykjavík University, Iceland. His research focuses on database caching architectures and multimedia database systems, in particular image and text databases. He has taught classes on database theory and application, database tuning and advanced database systems. Björn received his Ph.D. degree in Computer Science from the University of Maryland, College Park in 1999. The subject of his thesis was "Application-Oriented Buffering and Caching Techniques".

14 March 2003, 2:00 PM (Please note special date and time)

Title: TelegraphCQ: Continuous Dataflow Processing for an Uncertain World
Speaker: Michael Franklin, University of California, Berkeley
Abstract: Increasingly pervasive networks are leading towards a world where data is constantly in motion. In such a world, conventional techniques for query processing, which were developed under the assumption of a far more static and predictable computational environment, will not be sufficient. In response to this need, the Telegraph project at Berkeley has developed a suite of novel technologies for continuously adaptive query processing. We are currently building the next generation Telegraph system, called TelegraphCQ, which is focused on meeting the challenges that arise in handling large numbers of continuous queries over high-volume, highly-variable data streams. In this talk, I will describe the TelegraphCQ system architecture and its underlying technology, and report on our ongoing implementation effort leveraging the PostgreSQL open source code base. I will also discuss our overall research agenda, including related projects on high-volume XML filtering and query processing in ad hoc sensor networks.
Bio: Michael Franklin is an Associate Professor of Computer Science at the University of California, Berkeley. His research focuses on the architecture and performance of distributed databases and information systems. He received his Ph.D. from the University of Wisconsin, Madison in 1993. Previously, he was on the faculty at the University of Maryland, College Park, where he led projects on adaptive query processing and data dissemination. He served as Program Chair for the 2002 ACM SIGMOD Conference and is currently an Editor of ACM Transactions on Database Systems, Vice Chair of the SIGMOD Advisory Board, and a member of the Board of Trustees of the VLDB Endowment. He is also a technology advisor to the Mayfield Fund and sits on the technology advisory boards of several companies.

14 April 2003, 11:00 AM

Title: Hidden-Web Databases: Classification and Search
Speaker: Luis Gravano, Columbia University
Abstract: Many valuable text databases on the web have non-crawlable contents that are "hidden" behind search interfaces.  Hence traditional search engines do not index this valuable information. One way to facilitate access to "hidden-web" databases is through commercial Yahoo!-like directories, which organize these databases manually into categories that users can browse.  In this talk, I will describe a technique to automate the classification of hidden-web databases.  Our technique adaptively probes the databases with queries derived from document classifiers, without retrieving any documents. A large-scale experimental evaluation over 130 real web databases indicates that our technique produces highly accurate database classification results using -on average- fewer than 200 queries of four words or less to classify a database.

An alternative way to facilitate access to hidden-web databases is through "metasearchers," which provide a unified query interface to search many databases at once.  For efficiency, a critical task for a metasearcher is the selection of the most promising databases to search for a query, a task that typically relies on statistical summaries of the database contents.  In this talk, I will also describe a recent technique to derive content summaries from hidden-web databases. We exploit our probing-based classification algorithm to adaptively zoom in on and extract documents that are representative of the topic coverage of the databases. We can then build content summaries from these topically-focused document samples. A large-scale experimental evaluation over a variety of databases indicates that our new content-summary construction technique is efficient and produces more accurate summaries than those
from previously proposed strategies.
Bio: Luis Gravano has been on the faculty of the Computer Science Department, Columbia University since September 1997, where he has been an associate professor since July 2002. From January through August 2001, Luis was a Senior Research Scientist at Google (while on leave from Columbia University). He received his Ph.D. degree in Computer Science from Stanford University in 1997. He also received an M.S. degree from Stanford University in 1994 and a B.S. degree from the Escuela Superior Latinoamericana de Informatica (ESLAI), Argentina in 1990. Luis is an associate editor of the ACM Transactions on Information Systems, as well as database program chair for the upcoming ACM CIKM 2004. Luis is also a recipient of a CAREER award from the National Science Foundation.
This talk describes work performed jointly with Panos Ipeirotis
(Columbia) and Mehran Sahami (Stanford/Google).

12 May 2003, 11:00 AM

Title: Bioinformatics: Gene Expression Data Analysis
Speaker: Aidong Zhang, University at Buffalo
Abstract: DNA microarray technology provides a broad snapshot of the state of the cell by measuring the expression levels of thousands of genes simultaneously. It has already had a significant impact on the field of bioinformatics and has proposed an unique challenge: information in gene expression matrices is special in that the sample space and gene space are of very different dimensionality and it can be studied in either sample space or gene space. While most of the previous studies focus on clustering either genes or samples, it is interesting to ask whether we can partition the complete set of samples into exclusive groups (called phenotypes) and find a set of informative genes that can manifest the phenotypes. The mining of phenotypes and informative genes can provide valuable information to the biologists to understand the roles of genes and the phenotype structure of samples. In this talk, I will describe new techniques which simultaneously mine phenotypes and informative genes from gene expression data. These techniques integrate statistics, data mining, and machine learning methods in an unique fashion to achieve optimal solutions.
Bio: Aidong Zhang is a Professor in the Department of Computer Science and Engineering at State University of New York at Buffalo. She received her Ph.D degree in computer science from Purdue University, West Lafayette, Indiana, in 1994. Her research interests include bioinformatics, multimedia systems, content-based image retrieval, geographical information systems, and data mining. She serves on the editorial boards of ACM Multimedia Systems, the International Journal of Multimedia Tools and Applications, International Journal of Distributed and Parallel Databases, and ACM SIGMOD DiSC (Digital Symposium Collection).
She was co-chair of the technical program committee for ACM Multimedia 2001. Dr. Zhang is a recipient of the National Science Foundation CAREER award and SUNY Chancellor's Research Recognition award.

7 July 2003, 11:00 AM

Title: Database Support for Data Mining Applications
Speaker: Wolfgang Lehner, Technische Universität Dresden
Abstract: Database support for data mining has become an important research topic. Especially for large high-dimensional data volumes, comprehensive support from the database side is necessary. In this talk I will focus on the data intensive subproblem of aggregating high-dimensional data in all possible low-dimensional projections (for instance estimating low-dimensional histograms), which occurs in several established data mining techniques. I will argue that existing OLAP SQL-extensions are insufficient for high-dimensional data and propose a new SQL-operator, which seamlessly fits into the set of existing OLAP group-by operators.The new SQL operator is presented from a SQL language as well as from an implementational point of view. Different methods implementing the operator will be outlined and discussed in the context of the prototypical implementation within the Postgres database engine. Performance studies show that the operator yields a large speedup (up to factor 10) over existing methods provided by commercially available database systems.
Bio: Please see

31 July 2003, 11:00 AM; DC1302 (Please note change of regular place)

Title: Mining the Web: Search Engines
Speaker: Ricardo Baeza-Yates, University of Chile
Abstract: The Web grows and evolves faster than we like and expect, imposing scalability and relevance problems to Web search engines. In this talk we present how mining Web data and usage logs allows to improve a search engine in several ways: page ranking, indices and interfaces. As a corollary we show several interesting relations of different Web characteristics: structure, dynamics, "quality", etc. Our results help to understand not only technical issues, but also social ones, as the Web is the collaborative work of many people, a few publishing, and all of them querying.
Bio: Ricardo Baeza-Yates obtained a Ph.D. in CS at U. of Waterloo, Canada, in 1989. In 1992 he was elected president of the Chilean Computer Science Society (SCCC) until 1995, being elected again in for 1997-98. During 1993, he received the Organization of American States award for young researchers in exact sciences. In 1994 he received the award to the best engineering research in the last 4 years from the Institute of Engineers of Chile. In 1997 with two Brazilian colleagues obtained the COMPAQ prize to the best Brazilian research article in CS. He was recently elected to the IEEE CS Board of Governors for the period 2002-04. In 2002 he was appointed to the Chilean Academy of Sciences, being the first person from computer science to achieve this position in Chile. Currently he is a professor at the CS department of the University of Chile, where he was the chair in the period 1993-95. He is also director of the Center for Web Research, a project funded by the Millenium Scientific Initiative. His research interests include information retrieval, algorithms, and information visualization. He is co-author of the book Modern Information Retrieval, published in 1999 by Addison-Wesley, as well as co-author of the 2nd edition of the Handbook of Algorithms and Data Structures, Addison-Wesley, 1991; and co-editor of Information Retrieval: Algorithms and Data Structures, Prentice-Hall, 1992.

