[Please remove <h1>]
Spring 2000
Time: |
Friday,
May 5,
2:00 pm, DC1331 |
Speaker: |
Frank Tompa
|
Topic: |
An example of text data mining
The data mining group at Stanford has been examining the problem of
extracting useful relations from large collections of text.
An intriguing approach to this task was described by Sergey Brin
(one of the developers of Google) in 1998:
extract a <book author, book title> database from the World Wide Web
by using the following bootstrap process:
- Seed the target relation with a few tuples; e.g., <Melville, Moby Dick>.
- Repeat:
- Search the text for matches to a tuple's fields in close proximity; e.g., "the beast immortalized by Herman Melville in MOBY DICK. In a special ...."
- Recover patterns that match the surrounding context in the text; e.g., "... by $author in $title ...".
- Extract additional tuples by matching these patterns back to the text.
I will describe my experience with applying this technique to recovering the names of sports teams from the Ottawa Citizen. After that I will introduce the problem of classifying word senses in the Oxford English Dictionary by subject (e.g., recognizing that "dial-up", "and" in sense D, "cursor" in sense 2b, and "zone" in sense 9 designate computing usage of the words).
|
Snacks: |
Anthony Cox
|
Time: |
Friday,
May 12,
2:00 pm, DC1331 |
Speaker: |
none
|
Topic: |
"open mike"
|
Time: |
Friday,
May 19,
|
No meeting
|
Time: |
Friday,
May 26,
2:00 pm, DC1331 |
Speaker: |
Arun Marathe
|
Topic: |
Applications of AML
I shall present several applications of the Array Manipulation
Language (AML) that we have designed to manipuate array data.
Applications include wavelet reconstruction, JPEG-based still
image compression, and several queries from the satellite image
processing domain. Array manipulations in these and other
queries are highly structured and it will be seen how AML
captures such regularities.
I shall also present query evaluation times for these queries
(except JPEG) on an iterator-based AML evaluator. For fun,
I shall compare these times with running times of the C++
programs that I wrote for the same queries.
|
Snacks: |
Frank Tompa
|
Time: |
Friday,
June 2,
2:00 pm, DC1331 |
Speaker: |
Huizhu Liu
|
Topic: |
Pipeline Plan Generation for Network Management Applications
Well maintained computing systems (including hardware,
software and network connections) provide a pleasant
working environment for everybody who share the
facility. That makes the network management work very
important. In order to efficiently and effectively
manage the system and provide the high quality service
to everybody, system administrators have to collect
statistical data and monitor the performance of the
system all the time. However, with the introducing of
more and more different hardware and software into the
system, it is more and more difficult to do that. In
this talk, we describe a framework to help make data
retrieving easy by treating OS as data repositories and
using standard distributed database management skills
to manage the data. This talk will focus on the special
features needed in query processing to make a
distributed DBMS fit to handle the network management
applications. Also we'll talk about exploiting union
information in schema to generate more efficient plans.
Part of this talk is based on the work of DEMO project.
|
Snacks: |
Arun Marathe
|
Time: |
Friday,
June 9,
2:00 pm, DC1331 |
Speaker: |
Reem Al-Halimi
|
Topic: |
Presentation of a paper by A. McCallum entitled: Multi-Label, Multiclass
Document Classification with a Mixture Model Trained by EM
The paper "describes a Bayesian Classification approach in which multiple
class 'topics' that comprise a document are representer by a mixture
model" which also provides information about which words in the document
was responsible for generating which topic.
|
Snacks: |
Huizhu Liu
|
Time: |
Friday,
June 16,
2:00 pm, DC1331 |
Speaker: |
none
|
Topic: |
"open mike"
|
Time: |
Friday,
June 23,
2:00 pm, DC1331 |
Speaker: |
Matthew Young-Lai
|
Topic: |
One-Pass Evaluation of Region Algebra Expressions
Suppose we have a region algebra for querying text databases. Given an
expression composed of functions from the algebra, what is an appropriate
way to compute the result? The obvious approach is to evaluate the
component queries one at a time in some appropriate order. I will
describe an alternative that evaluates all queries simultaneously, thus
avoiding the cost of writing intermediate results and reading them back
in. This approach is applicable with certain restricted algebras, which I
will characterize.
|
Snacks: |
Reem Al-Halimi
|
Time: |
Friday,
June 30,
2:00 pm, DC1331 |
Speaker: |
Zarrin Langari
|
Topic: |
Classifying word senses in the OED
We aim to classify word senses in the Oxford English Dictionary
based on their subjects (Anthropology, Music, ...) by applying
some categorization techniques.
The categories are denoted by subject labels used in the dictionary,
where some of the word senses have an explicit subject,
but most do not.
It has been noted that some of the senses with no explicit label
could be assigned a subject,
and our intention is to choose a subject for these word senses.
In assigning subjects, one technique is to use a bootstrap process
over word senses based on their citations
(refer to the talk presented by Frank on May 5th).
In anticipation of this task, we first did some preparation of the OED data.
We intend to classify senses using a standard technique such as
k-nearest neighbours,
as well as a newer technique which we hope will have better performance.
I will point out some problems that must be addressed in this work.
|
Snacks: |
Matthew Young-Lai
|
Time: |
Tuesday & Wednesday,
July 3-4,
DC1331 |
Workshop: |
Temporal Get-Together
|
Topics: |
Discussion will be stimulated by 4 presentations as follows:
- Dr. Carlo Combi
Vagueness, Indeterminacy, and Different Granularities in Temporal Databases
- Dr. Iqbal Goralwalla
Temporality in Object Database Management Systems
- Dr. Claudio Bettini
Free Schedules for Free Agents in Workflow Systems
- Dr. David Toman
On Expiration in Historical Data Warehouses
See full program for details.
|
Time: |
Friday,
July 7,
2:00 pm, DC1331 |
Speaker: |
Khuzaima Daudjee
|
Topic: |
Client-Server Query Architectures
I will give an overview of query
processing architectures, and talk about some issues involved in
client-server query processing.
|
Snacks: |
Zarrin Langari
|
Time: |
Friday,
July 14,
2:00 pm, DC1331 |
Speaker: |
Feng Zhao
|
Topic: |
Getting Good Data From the Web
The talk will cover (1) how a crawler works,
(2) how to write a crawler with database help,
(3) my approach in getting more useful data first, and
(4) evaluation -- the work under construction: mainly about how to compare
with other crawling methods.
|
Snacks: |
Khuzaima Daudjee
|
Time: |
Friday,
July 21,
2:00 pm, DC1331 |
Speaker: |
Lubomir Stanchev
|
Topic: |
Reducing the Size of Auxiliary Data Needed to
Maintain Materialized Views by Exploiting Integrity Constraints
A data warehouse consists of materialized views, which contain derived
data from several data sources. The advantage of using materialized
views is that they contain the results of standard queries and therefore
when such queries are posed, the data sources those queries are based
on, which are usually costly to access because of their size and
remoteness, don't have to be accessed. However in order for the
materialized views to contain up-to-date data they have to be updated
periodically. Such synchronization of the materialized views with the
data sources could be slow if the later have to be queried for a correct
update to be done -- i.e. if just the data for the changes done to the
data sources is insufficient. A way querying the data sources during
materialized view update can be avoided is by storing any data from the
data sources, which could be relevant to an update of the materialized
views, on the data warehouse site. Such auxiliary data is stored in
auxiliary views and has the characteristic that it makes the data stored
at the data warehouse self-maintainable, i.e. it can be updated
correctly from the log of changes done to the data sources. In the talk
unique algorithms for keeping the size of such auxiliary views
relatively small will be covered. This is achieved by using the
additional information about the integrity constraints that hold on the
data sources. More specifically object-oriented model to describe a
database schema and a subset of OQL as the query language will be used.
The proposed algorithms produce auxiliary views that make a
materialized view self-maintainable relative to the three possible
operations on a database: addition, deletion and object update.
|
Snacks: |
Feng Zhao
|
Time: |
Friday,
July 28,
2:00 pm, DC1331 |
Speaker: |
Hans-Peter Kriegel
(Institute for Computer Science, University of Munich)
|
Topic: |
Knowledge Discovery and Similarity Search in Databases
Both, the number and the size of spatial and multimedia databases, such
as geographic and image databases, respectively, are rapidly growing
because of the large amount of data obtained from satellite images,
X-ray crystallography, computer tomography or other scientific
equipment. This growth by far exceeds human capacities to analyze the
databases in order to find implicit similarities, regularities, rules
or clusters hidden in the data. Therefore, automated knowledge discovery
and efficient similarity search becomes more and more important.
The major difference between knowledge discovery in relational databases
and in spatial databases is that attributes of the neighbors of some
object of interest may have an influence on the object itself.
Therefore, spatial data mining algorithms heavily depend on the
efficient processing of neighborhood relations since the neighbors of
many objects have to be investigated in a single run of a typical
algorithm. We define a small set of database primitives and we
demonstrate that typical spatial data mining algorithms such as
clustering, characterization and trend detection are well supported by
the proposed database primitives.
In the second part of the talk we focus on similarity search in
multimedia databases which is highly application- and user-dependent.
Therefore we derive similarity models to be adaptable to application
specific requirements and individual user preferences. Examples include
flexible pixel-based shape similarity, 3D shape histograms and quadratic
forms resulting in ellipsoid queries in high-dimensional data spaces.
We show that these ellipsoid queries can be processed efficiently.
Additionally, these queries support modification of the similarity
matrix at query time. This is shown using some snapshots of our
similarity search system.
The talk concludes with some considerations of future research topics.
|
This page is maintained by
Frank Tompa.