Grand Central Station is a system that extends search to all digital sources of data. The system consists of components to access and understand data sources and generate searchable metadata ("Gatherers"), a metadata repository for satisfying ad hoc queries, and an extensible profiling system for processing persistent queries. The Gatherer is an extensible crawler framework written in Java that is capable of using a variety of protocols (e.g., http, ftp, nntp, odbc, cics, pop3) to access and understand a wide range of data formats (HTML, Java Bytecode, PowerPoint, TAR/Zip archives, and many others). The Gatherer generates summaries of each data source it encounters in an instance of XML we call SumML (Summary Metalanguage). Key features of the Gatherer are its ability to be easily extended by adding protocol and data source specific code, and its ability to run, unchanged, on any platform that supports Java. The metadata repository is less advanced, but the Profiling framework is moving forward to encompass multimedia profiling. The system has been deployed in the form of a Java specific search engine called "jCentral" and is accessible from IBM's Java home page. This talk will present and demonstrate the system, and discuss future directions into searching video, image and audio.
Business Intelligence applications perform complex aggregation for large amounts (typically 1 to 10 Terabytes) of data. This puts increasing demands on database systems to provide native support for such processing, often referred to as Online Analytical Processing (OLAP). SQL has recently extended the group-by clause to provide primitives for common OLAP computations in the DBMS, allowing the DBMS more flexibility in processing and optimizing such aggregation. These computations are the data-cube, rollup, concatenations of rollup (multi-dimensional cube), and combinations of ad-hoc grouping elements. The specification of the group-by clause can expand into many grouping sets. For example, the cube alone will result in 2**n grouping sets where n is the number of grouping elements. In this talk I will present the SQL OLAP extensions and describe a novel technique for stacking grouping operations. Our technique results in linear expansion of grouping sets, greatly reducing the amount of complexity and resources required to optimize and compute such queries.
Some problems associated with this situation, are:
My presentation is based on a VLDB-98 paper by Meng et al., with the
Data structure libraries, like Leda/STL for C++, provide a toolkit for constructing both standard and experimental primary-memory data-structures.
This talk will discuss preliminary work in building the equivalent of a data structure library but for secondary-memory data-structures, e.g., B-Tree, R-Tree, etc. Traditionally, secondary-memory data-structures are called access methods and each provides some specialized technique for quickly accessing related data on secondary storage. Unfortunately, traditional access methods are usually hand-coded from scratch, normally requiring substantial knowledge of the underlying file system to build correctly and efficiently. The complexity in building new access methods impedes experimenting with new specialized data structures for accessing non-traditional data, e.g., text, images, highly-related data (CAD/CAM), etc.
Currently, we have generalized one group of access methods: search trees. The generalizations are the parts of the search tree developers possibly need to specialize. As well, we have created a small set of specialization components for each generalization so a developer is not required to write all the components from scratch. By judiciously selecting components from the library and only specializing components needed for a new algorithm/data-structure, a new access method can be created, and subsequently tested, significantly faster than via the traditional approach.
Friday 13 November 1998 - research group meeting, 2pm, DC1331
Daniela Florescu, Alon Levy, Alberto Mendelson "Database Techniques
for the World Wide Web: A Survey", SIGMOD Record, September 1998.
The article describes the current research activities and directions
in the area of databases and the WWW and focuses in three main areas:
modeling and querying the web; information extraction and integration;
and web site construction and restructuring.
I will present an overview of this paper.
Friday 20 November 1998 - research group meeting, 2pm, DC1331
This topic relates to the VLDB paper discussed by Forbes Burkowski on Oct 30/98.
Friday 27 November 1998 - research group meeting, 2pm, DC1331
My work on distributed ODBMS is concentrated around the development
of TIGUKAT (means "object" in Inuit)
system. TIGUKAT's object model is purely behavioral in nature and is uniform. Every concept that can be
modeled in TIGUKAT is a first-class object with well-defined behavior. This gives the system extensibility.
Current work involves the development of a query language and its optimization, incorporation of the
temporal dimension into the object model, development of a programming language, and distribution of the system.
My multimedia research focuses on data
management issues. We are developing an object-oriented multimedia
database system which can support SGML/HyTime compliant documents. An associated project is the development of
a distributed image database system. Current research involves the design of a multimedia query model and
language, development of a visual query interface and context-based indexing and access of images.
Interoperability research goes beyond multidatabase by considering
the inter-operability of information
systems in general. The approach is object-oriented and our focus is on the use of object-oriented
characteristics to deal with interoperability problems.