|Time:||Friday, May 5, 2:00 pm, DC1331|
An example of text data mining
The data mining group at Stanford has been examining the problem of extracting useful relations from large collections of text. An intriguing approach to this task was described by Sergey Brin (one of the developers of Google) in 1998: extract a <book author, book title> database from the World Wide Web by using the following bootstrap process:
I will describe my experience with applying this technique to recovering the names of sports teams from the Ottawa Citizen. After that I will introduce the problem of classifying word senses in the Oxford English Dictionary by subject (e.g., recognizing that "dial-up", "and" in sense D, "cursor" in sense 2b, and "zone" in sense 9 designate computing usage of the words).
|Time:||Friday, May 12, 2:00 pm, DC1331|
|Time:||Friday, May 19,|
|Time:||Friday, May 26, 2:00 pm, DC1331|
Applications of AML
I shall present several applications of the Array Manipulation Language (AML) that we have designed to manipuate array data. Applications include wavelet reconstruction, JPEG-based still image compression, and several queries from the satellite image processing domain. Array manipulations in these and other queries are highly structured and it will be seen how AML captures such regularities.
I shall also present query evaluation times for these queries (except JPEG) on an iterator-based AML evaluator. For fun, I shall compare these times with running times of the C++ programs that I wrote for the same queries.
|Time:||Friday, June 2, 2:00 pm, DC1331|
Pipeline Plan Generation for Network Management Applications
Well maintained computing systems (including hardware, software and network connections) provide a pleasant working environment for everybody who share the facility. That makes the network management work very important. In order to efficiently and effectively manage the system and provide the high quality service to everybody, system administrators have to collect statistical data and monitor the performance of the system all the time. However, with the introducing of more and more different hardware and software into the system, it is more and more difficult to do that. In this talk, we describe a framework to help make data retrieving easy by treating OS as data repositories and using standard distributed database management skills to manage the data. This talk will focus on the special features needed in query processing to make a distributed DBMS fit to handle the network management applications. Also we'll talk about exploiting union information in schema to generate more efficient plans. Part of this talk is based on the work of DEMO project.
|Time:||Friday, June 9, 2:00 pm, DC1331|
Presentation of a paper by A. McCallum entitled: Multi-Label, Multiclass
Document Classification with a Mixture Model Trained by EM
The paper "describes a Bayesian Classification approach in which multiple class 'topics' that comprise a document are representer by a mixture model" which also provides information about which words in the document was responsible for generating which topic.
|Time:||Friday, June 16, 2:00 pm, DC1331|
|Time:||Friday, June 23, 2:00 pm, DC1331|
One-Pass Evaluation of Region Algebra Expressions
Suppose we have a region algebra for querying text databases. Given an expression composed of functions from the algebra, what is an appropriate way to compute the result? The obvious approach is to evaluate the component queries one at a time in some appropriate order. I will describe an alternative that evaluates all queries simultaneously, thus avoiding the cost of writing intermediate results and reading them back in. This approach is applicable with certain restricted algebras, which I will characterize.
|Time:||Friday, June 30, 2:00 pm, DC1331|
Classifying word senses in the OED
We aim to classify word senses in the Oxford English Dictionary based on their subjects (Anthropology, Music, ...) by applying some categorization techniques. The categories are denoted by subject labels used in the dictionary, where some of the word senses have an explicit subject, but most do not. It has been noted that some of the senses with no explicit label could be assigned a subject, and our intention is to choose a subject for these word senses.
In assigning subjects, one technique is to use a bootstrap process over word senses based on their citations (refer to the talk presented by Frank on May 5th). In anticipation of this task, we first did some preparation of the OED data. We intend to classify senses using a standard technique such as k-nearest neighbours, as well as a newer technique which we hope will have better performance. I will point out some problems that must be addressed in this work.
|Time:||Tuesday & Wednesday, July 3-4, DC1331|
Discussion will be stimulated by 4 presentations as follows:
|Time:||Friday, July 7, 2:00 pm, DC1331|
Client-Server Query Architectures
I will give an overview of query processing architectures, and talk about some issues involved in client-server query processing.
|Time:||Friday, July 14, 2:00 pm, DC1331|
Getting Good Data From the Web
The talk will cover (1) how a crawler works, (2) how to write a crawler with database help, (3) my approach in getting more useful data first, and (4) evaluation -- the work under construction: mainly about how to compare with other crawling methods.
|Time:||Friday, July 21, 2:00 pm, DC1331|
Reducing the Size of Auxiliary Data Needed to
Maintain Materialized Views by Exploiting Integrity Constraints
A data warehouse consists of materialized views, which contain derived data from several data sources. The advantage of using materialized views is that they contain the results of standard queries and therefore when such queries are posed, the data sources those queries are based on, which are usually costly to access because of their size and remoteness, don't have to be accessed. However in order for the materialized views to contain up-to-date data they have to be updated periodically. Such synchronization of the materialized views with the data sources could be slow if the later have to be queried for a correct update to be done -- i.e. if just the data for the changes done to the data sources is insufficient. A way querying the data sources during materialized view update can be avoided is by storing any data from the data sources, which could be relevant to an update of the materialized views, on the data warehouse site. Such auxiliary data is stored in auxiliary views and has the characteristic that it makes the data stored at the data warehouse self-maintainable, i.e. it can be updated correctly from the log of changes done to the data sources. In the talk unique algorithms for keeping the size of such auxiliary views relatively small will be covered. This is achieved by using the additional information about the integrity constraints that hold on the data sources. More specifically object-oriented model to describe a database schema and a subset of OQL as the query language will be used. The proposed algorithms produce auxiliary views that make a materialized view self-maintainable relative to the three possible operations on a database: addition, deletion and object update.
|Time:||Friday, July 28, 2:00 pm, DC1331|
|Speaker:||Hans-Peter Kriegel (Institute for Computer Science, University of Munich)|
Knowledge Discovery and Similarity Search in Databases
Both, the number and the size of spatial and multimedia databases, such as geographic and image databases, respectively, are rapidly growing because of the large amount of data obtained from satellite images, X-ray crystallography, computer tomography or other scientific equipment. This growth by far exceeds human capacities to analyze the databases in order to find implicit similarities, regularities, rules or clusters hidden in the data. Therefore, automated knowledge discovery and efficient similarity search becomes more and more important.
The major difference between knowledge discovery in relational databases and in spatial databases is that attributes of the neighbors of some object of interest may have an influence on the object itself. Therefore, spatial data mining algorithms heavily depend on the efficient processing of neighborhood relations since the neighbors of many objects have to be investigated in a single run of a typical algorithm. We define a small set of database primitives and we demonstrate that typical spatial data mining algorithms such as clustering, characterization and trend detection are well supported by the proposed database primitives.
In the second part of the talk we focus on similarity search in multimedia databases which is highly application- and user-dependent. Therefore we derive similarity models to be adaptable to application specific requirements and individual user preferences. Examples include flexible pixel-based shape similarity, 3D shape histograms and quadratic forms resulting in ellipsoid queries in high-dimensional data spaces. We show that these ellipsoid queries can be processed efficiently. Additionally, these queries support modification of the similarity matrix at query time. This is shown using some snapshots of our similarity search system.
The talk concludes with some considerations of future research topics.