[Please remove <h1>]
Fall 1998
- Friday 09 October 1998 - research group meeting, 2pm, DC1331
- Speaker: Connie Zhang
- Snacks: Reem Al-Halimi
- Friday 16 October 1998 - seminar double header
- Talk 1: 10:30am, DC1302
- Speaker: Dan Ford, IBM Almaden Research Center
- Title: Grand Central Station: Searching all digital sources of data
- Abstract:
- Talk 2: 2:00pm, DC1304
- Speaker: Roberta Cochrane, IBM Almaden Research Center
- Title: Intersection Stacking for Multi-dimensional Aggregation in RDBMSs
- Abstract:
-
- Speaker: Vlado Keselj
- Snacks: Connie Zhang
- when a user query comes to a global database, who do we select which local databases to subquery? (It is not efficient to query all of them.)
- how many documents to retrieve from each local database, and to merge them? (collection fusion problem)
Grand Central Station is a system that extends search to all digital sources of data. The system consists of components to access and understand data sources and generate searchable metadata ("Gatherers"), a metadata repository for satisfying ad hoc queries, and an extensible profiling system for processing persistent queries. The Gatherer is an extensible crawler framework written in Java that is capable of using a variety of protocols (e.g., http, ftp, nntp, odbc, cics, pop3) to access and understand a wide range of data formats (HTML, Java Bytecode, PowerPoint, TAR/Zip archives, and many others). The Gatherer generates summaries of each data source it encounters in an instance of XML we call SumML (Summary Metalanguage). Key features of the Gatherer are its ability to be easily extended by adding protocol and data source specific code, and its ability to run, unchanged, on any platform that supports Java. The metadata repository is less advanced, but the Profiling framework is moving forward to encompass multimedia profiling. The system has been deployed in the form of a Java specific search engine called "jCentral" and is accessible from IBM's Java home page. This talk will present and demonstrate the system, and discuss future directions into searching video, image and audio.
Business Intelligence applications perform complex aggregation for large amounts (typically 1 to 10 Terabytes) of data. This puts increasing demands on database systems to provide native support for such processing, often referred to as Online Analytical Processing (OLAP). SQL has recently extended the group-by clause to provide primitives for common OLAP computations in the DBMS, allowing the DBMS more flexibility in processing and optimizing such aggregation. These computations are the data-cube, rollup, concatenations of rollup (multi-dimensional cube), and combinations of ad-hoc grouping elements. The specification of the group-by clause can expand into many grouping sets. For example, the cube alone will result in 2**n grouping sets where n is the number of grouping elements. In this talk I will present the SQL OLAP extensions and describe a novel technique for stacking grouping operations. Our technique results in linear expansion of grouping sets, greatly reducing the amount of complexity and resources required to optimize and compute such queries.
Friday 23 October 1998 - research group meeting, 2pm, DC1331
- Topic: Determining Text Databases to Search in the Internet
Abstract:
- The Internet can be seen as a large document collection, i.e., as a
kind
of large database. However, since it is very dynamic a better
approximation would be to treat it as a collection of databases.
Actually, it is a hierarchy of databases with smaller or local databases,
and larger or global databases based upon the smaller ones. The larger
databases are not really databases, but interfaces to the local ones.
Some problems associated with this situation, are:
My presentation is based on a VLDB-98 paper by Meng et al., with the
above
title.
Friday 30 October 1998 - research group meeting, 2pm, DC1331
Friday 06 November 1998 - research group meeting, 2pm, DC1331
Data structure libraries, like Leda/STL for C++, provide a toolkit for constructing both standard and experimental primary-memory data-structures.This talk will discuss preliminary work in building the equivalent of a data structure library but for secondary-memory data-structures, e.g., B-Tree, R-Tree, etc. Traditionally, secondary-memory data-structures are called access methods and each provides some specialized technique for quickly accessing related data on secondary storage. Unfortunately, traditional access methods are usually hand-coded from scratch, normally requiring substantial knowledge of the underlying file system to build correctly and efficiently. The complexity in building new access methods impedes experimenting with new specialized data structures for accessing non-traditional data, e.g., text, images, highly-related data (CAD/CAM), etc.
Currently, we have generalized one group of access methods: search trees. The generalizations are the parts of the search tree developers possibly need to specialize. As well, we have created a small set of specialization components for each generalization so a developer is not required to write all the components from scratch. By judiciously selecting components from the library and only specializing components needed for a new algorithm/data-structure, a new access method can be created, and subsequently tested, significantly faster than via the traditional approach.
Friday 13 November 1998 - research group meeting, 2pm, DC1331
- This Friday I will be talking about a paper that appear recently
in SIGMOD Record:
Daniela Florescu, Alon Levy, Alberto Mendelson "Database Techniques
for the World Wide Web: A Survey", SIGMOD Record, September 1998.
The article describes the current research activities and directions
in the area of databases and the WWW and focuses in three main areas:
modeling and querying the web; information extraction and integration;
and web site construction and restructuring.
I will present an overview of this paper.
Friday 20 November 1998 - research group meeting, 2pm, DC1331
- Trees provide flexible and often interpretable way to model data. By
- using one or more explanatory variables and a tree-structured set of
questions, tree models divide a population into similar groups. This
talk will provide an introduction and overview of tree models,
including CART (Breiman et. al., 1984) and C4.5 (Quinlan, 1993).
Topics will include tree construction and validation, and some more
recent methods for the identification and selection of trees when many
different trees may fit the data well.
This topic relates to the VLDB paper discussed by Forbes Burkowski on Oct 30/98.
Friday 27 November 1998 - research group meeting, 2pm, DC1331
-
- Monday 07 December 1998 - seminar, 10:30pm, DC1304
- Speaker: M. Tamer Ozsu, University of Alberta
- Title: Distributed Objectbase Management Systems, Multimedia, Interoperability
- Schematic heterogeneity arises when information that is represented
as
data under one schema, is represented within the schema (as metadata)
in another. Schematic heterogeneity is an important class of
heterogeneity that arises frequently in integrating legacy data for
data warehousing applications. Traditional query languages and view
mechanisms are insufficient for reconciling and translating data
between schematically heterogeneous schemas. Higher order query
languages, that permit quantification over schema labels, have been
proposed to permit querying and restructuring of data between
schematically disparate schemas. We extend this work by considering
how these languages can be used in practice with minimal extensions to
existing query processing engines. Specifically, we consider the
problem of using higher order views to answer queries in a
heterogeneous environment. We give conditions under which a higher
order view is usable for answering a query and provide query
translation algorithms. We show how our solutions permit schema
browsing and new forms of data independence that are important for
global information systems. This is on-going work with Laura Haas
and the Garlic Heterogeneous Database group from IBM Almaden Research Labs.
directions into searching video, image and audio.
- Abstract:
- My current research concentrates on three areas:
(1) development of distributed object database management
systems (ODBMS), (2) multimedia data management, and (3) interoperability of information systems.
My work on distributed ODBMS is concentrated around the development
of TIGUKAT (means "object" in Inuit)
system. TIGUKAT's object model is purely behavioral in nature
and is uniform. Every concept that
can be
modeled in TIGUKAT is a first-class object with well-defined
behavior. This gives the system extensibility.
Current work involves the development
of a query language and its optimization,
incorporation of the
temporal dimension into the object model, development of
a programming language, and distribution of the
system.
My multimedia research focuses on data
management issues. We are developing an object-oriented multimedia
database system which can support SGML/HyTime compliant documents.
An associated project is the development of
a distributed image database system. Current research involves
the design of a multimedia query model and
language, development of a visual query interface and context-based
indexing and access of images.
Interoperability research goes beyond multidatabase by considering
the inter-operability of information
systems in general. The approach is object-oriented and our
focus is on the use
of object-oriented
characteristics to deal with interoperability problems.
- Friday 11 December
- research group meeting, 2pm, DC1331
- Speaker: Anthony Cox
- Snacks: Mariano Consens
- Topic: A Searchable Source Code Repository
- This talk will describe a prototype source code repository implemented
- using the Multitext system. Issues regarding the system structure,
data
collection and querying will be presented.
- Wednesday 16 December - MMath thesis presentation,
10:30am, DC1331
- Speaker: Winnie Min-Min Yeung
- Title: Efficient Evaluation of Geometric Expressions with a Single
Plane-Sweep
- Wednesday 16 December - MMath thesis presentation,
2pm, DC1331
- Speaker: Peter Yang
- Title: PAPRICCA: Pre-Analyzing, PRedIcate-Based Concurrency Control
Algorithm
- Friday 18 December - MMath thesis presentation,
2pm, DC1304
- Speaker: Ming Lei
- Title: Efficient Processing of Spatial Join Qualification