Database Research Group Events

Spring 2000

Time: Friday, May 5, 2:00 pm, DC1331
Speaker: Frank Tompa
Topic: An example of text data mining

The data mining group at Stanford has been examining the problem of extracting useful relations from large collections of text. An intriguing approach to this task was described by Sergey Brin (one of the developers of Google) in 1998: extract a <book author, book title> database from the World Wide Web by using the following bootstrap process:

  1. Seed the target relation with a few tuples; e.g., <Melville, Moby Dick>.
  2. Repeat:
    1. Search the text for matches to a tuple's fields in close proximity; e.g., "the beast immortalized by Herman Melville in MOBY DICK. In a special ...."
    2. Recover patterns that match the surrounding context in the text; e.g., "... by $author in $title ...".
    3. Extract additional tuples by matching these patterns back to the text.

I will describe my experience with applying this technique to recovering the names of sports teams from the Ottawa Citizen. After that I will introduce the problem of classifying word senses in the Oxford English Dictionary by subject (e.g., recognizing that "dial-up", "and" in sense D, "cursor" in sense 2b, and "zone" in sense 9 designate computing usage of the words).

Snacks: Anthony Cox

Time: Friday, May 12, 2:00 pm, DC1331
Speaker: none
Topic: "open mike"

Time: Friday, May 19,
No meeting

Time: Friday, May 26, 2:00 pm, DC1331
Speaker: Arun Marathe
Topic: Applications of AML

I shall present several applications of the Array Manipulation Language (AML) that we have designed to manipuate array data. Applications include wavelet reconstruction, JPEG-based still image compression, and several queries from the satellite image processing domain. Array manipulations in these and other queries are highly structured and it will be seen how AML captures such regularities.

I shall also present query evaluation times for these queries (except JPEG) on an iterator-based AML evaluator. For fun, I shall compare these times with running times of the C++ programs that I wrote for the same queries.

Snacks: Frank Tompa

Time: Friday, June 2, 2:00 pm, DC1331
Speaker: Huizhu Liu
Topic: Pipeline Plan Generation for Network Management Applications

Well maintained computing systems (including hardware, software and network connections) provide a pleasant working environment for everybody who share the facility. That makes the network management work very important. In order to efficiently and effectively manage the system and provide the high quality service to everybody, system administrators have to collect statistical data and monitor the performance of the system all the time. However, with the introducing of more and more different hardware and software into the system, it is more and more difficult to do that. In this talk, we describe a framework to help make data retrieving easy by treating OS as data repositories and using standard distributed database management skills to manage the data. This talk will focus on the special features needed in query processing to make a distributed DBMS fit to handle the network management applications. Also we'll talk about exploiting union information in schema to generate more efficient plans. Part of this talk is based on the work of DEMO project.

Snacks: Arun Marathe

Time: Friday, June 9, 2:00 pm, DC1331
Speaker: Reem Al-Halimi
Topic: Presentation of a paper by A. McCallum entitled: Multi-Label, Multiclass Document Classification with a Mixture Model Trained by EM

The paper "describes a Bayesian Classification approach in which multiple class 'topics' that comprise a document are representer by a mixture model" which also provides information about which words in the document was responsible for generating which topic.

Snacks: Huizhu Liu

Time: Friday, June 16, 2:00 pm, DC1331
Speaker: none
Topic: "open mike"

Time: Friday, June 23, 2:00 pm, DC1331
Speaker: Matthew Young-Lai
Topic: One-Pass Evaluation of Region Algebra Expressions

Suppose we have a region algebra for querying text databases. Given an expression composed of functions from the algebra, what is an appropriate way to compute the result? The obvious approach is to evaluate the component queries one at a time in some appropriate order. I will describe an alternative that evaluates all queries simultaneously, thus avoiding the cost of writing intermediate results and reading them back in. This approach is applicable with certain restricted algebras, which I will characterize.

Snacks: Reem Al-Halimi

Time: Friday, June 30, 2:00 pm, DC1331
Speaker: Zarrin Langari
Topic: Classifying word senses in the OED

We aim to classify word senses in the Oxford English Dictionary based on their subjects (Anthropology, Music, ...) by applying some categorization techniques. The categories are denoted by subject labels used in the dictionary, where some of the word senses have an explicit subject, but most do not. It has been noted that some of the senses with no explicit label could be assigned a subject, and our intention is to choose a subject for these word senses.

In assigning subjects, one technique is to use a bootstrap process over word senses based on their citations (refer to the talk presented by Frank on May 5th). In anticipation of this task, we first did some preparation of the OED data. We intend to classify senses using a standard technique such as k-nearest neighbours, as well as a newer technique which we hope will have better performance. I will point out some problems that must be addressed in this work.

Snacks: Matthew Young-Lai

Time: Tuesday & Wednesday, July 3-4, DC1331
Workshop: Temporal Get-Together
Topics: Discussion will be stimulated by 4 presentations as follows:
  • Dr. Carlo Combi
    Vagueness, Indeterminacy, and Different Granularities in Temporal Databases
  • Dr. Iqbal Goralwalla
    Temporality in Object Database Management Systems
  • Dr. Claudio Bettini
    Free Schedules for Free Agents in Workflow Systems
  • Dr. David Toman
    On Expiration in Historical Data Warehouses
See full program for details.

Time: Friday, July 7, 2:00 pm, DC1331
Speaker: Khuzaima Daudjee
Topic: Client-Server Query Architectures

I will give an overview of query processing architectures, and talk about some issues involved in client-server query processing.

Snacks: Zarrin Langari

Time: Friday, July 14, 2:00 pm, DC1331
Speaker: Feng Zhao
Topic: Getting Good Data From the Web

The talk will cover (1) how a crawler works, (2) how to write a crawler with database help, (3) my approach in getting more useful data first, and (4) evaluation -- the work under construction: mainly about how to compare with other crawling methods.

Snacks: Khuzaima Daudjee

Time: Friday, July 21, 2:00 pm, DC1331
Speaker: Lubomir Stanchev
Topic: Reducing the Size of Auxiliary Data Needed to Maintain Materialized Views by Exploiting Integrity Constraints

A data warehouse consists of materialized views, which contain derived data from several data sources. The advantage of using materialized views is that they contain the results of standard queries and therefore when such queries are posed, the data sources those queries are based on, which are usually costly to access because of their size and remoteness, don't have to be accessed. However in order for the materialized views to contain up-to-date data they have to be updated periodically. Such synchronization of the materialized views with the data sources could be slow if the later have to be queried for a correct update to be done -- i.e. if just the data for the changes done to the data sources is insufficient. A way querying the data sources during materialized view update can be avoided is by storing any data from the data sources, which could be relevant to an update of the materialized views, on the data warehouse site. Such auxiliary data is stored in auxiliary views and has the characteristic that it makes the data stored at the data warehouse self-maintainable, i.e. it can be updated correctly from the log of changes done to the data sources. In the talk unique algorithms for keeping the size of such auxiliary views relatively small will be covered. This is achieved by using the additional information about the integrity constraints that hold on the data sources. More specifically object-oriented model to describe a database schema and a subset of OQL as the query language will be used. The proposed algorithms produce auxiliary views that make a materialized view self-maintainable relative to the three possible operations on a database: addition, deletion and object update.

Snacks: Feng Zhao

Time: Friday, July 28, 2:00 pm, DC1331
Speaker: Hans-Peter Kriegel (Institute for Computer Science, University of Munich)
Topic: Knowledge Discovery and Similarity Search in Databases

Both, the number and the size of spatial and multimedia databases, such as geographic and image databases, respectively, are rapidly growing because of the large amount of data obtained from satellite images, X-ray crystallography, computer tomography or other scientific equipment. This growth by far exceeds human capacities to analyze the databases in order to find implicit similarities, regularities, rules or clusters hidden in the data. Therefore, automated knowledge discovery and efficient similarity search becomes more and more important.

The major difference between knowledge discovery in relational databases and in spatial databases is that attributes of the neighbors of some object of interest may have an influence on the object itself. Therefore, spatial data mining algorithms heavily depend on the efficient processing of neighborhood relations since the neighbors of many objects have to be investigated in a single run of a typical algorithm. We define a small set of database primitives and we demonstrate that typical spatial data mining algorithms such as clustering, characterization and trend detection are well supported by the proposed database primitives.

In the second part of the talk we focus on similarity search in multimedia databases which is highly application- and user-dependent. Therefore we derive similarity models to be adaptable to application specific requirements and individual user preferences. Examples include flexible pixel-based shape similarity, 3D shape histograms and quadratic forms resulting in ellipsoid queries in high-dimensional data spaces. We show that these ellipsoid queries can be processed efficiently. Additionally, these queries support modification of the similarity matrix at query time. This is shown using some snapshots of our similarity search system.

The talk concludes with some considerations of future research topics.

This page is maintained by
Frank Tompa.