[Please remove <h1>]
Fall 2012
Events of interest to the
Database Research Group are posted here, and are also
mailed to the uw.cs.database newsgroup and the
db-faculty,
db-grads,
db-friends
mailing lists.
Subscribe to one of these mailing lists to receive e-mail notification
of upcoming events.
The DB group meets Wednesday afternoons at 2:30pm.
The list below gives the
times and locations of upcoming meetings.
Each meeting lasts for an hour and features either
a local speaker or, on
Seminar days,
an invited outside speaker.
Everyone is welcome to attend.
Fall 2012 Events
DB Meeting:
|
Wednesday September 26, 2:30pm, DC 1331
|
Speaker:
|
Bojana Bislimovska, Politecnico di Milano
|
Title:
|
Textual and Graph-Based Search over Software Model Repositories
|
Abstract:
|
Software model repositories are valuable sources of knowledge as they gather the best practices from past modeling activities. Therefore, efficient search of relevant information from large model repositories is of essential importance, allowing users to retrieve models or model fragments, and reuse them by incorporating them into their projects, thus improving the quality and productivity of model design. The variety of existing modeling languages imposes a search framework that uses the power of models, adapting to the syntax and semantics of the modeling language. In this talk, we present two methods for searching model repositories, contrasted with respect to the system architecture, the processing of queries and repository models, and the way in which metamodel information can be exploited to improve search. The first method explores classical Information Retrieval techniques, while the second one explores graph-based similarity search. We also provide an evaluation of the approaches through a case study of Web Application Models encoded in a Domain Specific Language. Furthermore, we discuss future directions for efficient graph-based search of model repositories.
Bio:
Bojana Bislimovska is a PhD student in Computer Science and Engineering at Politecnico di Milano. She received her BSc with honors in Computer Science, Information Technology and Automation from the Faculty of Electrical Engineering, University “Ss Cyril and Methodius”, Skopje, Macedonia, and her MSc in Computer Engineering from Politecnico di Milano, Italy. Her doctoral work explores techniques for analysis of semistructured data for information retrieval applications on a specific use-case of software models, experimenting with different querying, indexing and matching methodologies. Her research interests include graph data management, model-driven development, search engines and information retrieval.
|
DB Seminar:
|
Friday September 28, 10:45am - 11:20am, DC 1304
|
Speaker:
|
Stephan Ewen, TU Berlin
|
Title:
|
Spinning Fast Iterative Data Flows
|
Abstract:
|
Parallel data flow systems are a central part of most analytic
pipelines for big data. The iterative nature of many analysis and
machine learning algorithms, however, is still a challenge for current
systems. While certain types of bulk iterative algorithms are
supported by novel data flow frameworks, these systems cannot exploit
computational dependencies present in many algorithms, such as graph
algorithms. As a result, these algorithms are inefficiently executed
and have led to specialized systems based on other paradigms, such as
message passing or shared memory.
We propose a method to integrate "incremental iterations", a form of
workset iterations, with parallel data flows. After showing how to
integrate bulk iterations into a dataflow system and its optimizer, we
present an extension to the programming model for incremental
iterations. The extension alleviates for the lack of mutable state in
dataflows and allows for exploiting the "sparse computational
dependencies" inherent in many iterative algorithms. The evaluation of
a prototypical implementation shows that those aspects lead to up to
two orders of magnitude speedup in algorithm runtime, when exploited.
In our experiments, the improved dataflow system is highly competitive
with specialized systems while maintaining a transparent and unified
data flow abstraction.
Biography:
Stephan Ewen is a research associate at the department for Database
Systems and Information Management (DIMA) at the Berlin University of
Technology. He is working on the Stratosphere Project that aims at
creating a versatile and efficient analytics engine for deep analysis
of Big Data on cloud platforms. Within the project, Stephan works on
the system's data flow programming abstraction, the data flow
optimization and the parallel runtime system.
Prior to joining the DIMA group, Stephan completed the “Applied
Computer Science” program at the University of Cooperative Education
Stuttgart jointly with IBM Germany Ltd and got his Diploma from the
University of Stuttgart. In the course of his studies, Stephan Ewen
worked, among others, for the IBM Almaden Research Centre and the IBM
Development Laboratory Böblingen.
|
DB Seminar:
|
Friday September 28, 11:25am - 12 noon, DC 1304
CANCELLED
|
Speaker:
|
Alexander Alexandrov, TU Berlin
|
Title:
|
Generating a Myriad of atoms in the blink of an eye
|
Abstract:
|
Data from real-world applications is regarded as the golden standard
for database systems evaluation. Unfortunately, finding appropriate
real-world datasets is often hard due to various privacy-related
constraints. To overcome this problem, we developed the Myriad
Parallel Data Generator Toolkit -- a generic toolkit for declarative
specification of synthetic data generators that provides built-in
parallelization support for the specified data generation programs. In
this talk, I will motivate and present the main technical challenges
solved by the highly-parallel execution model of the Myriad Toolkit.
In addition, to demonstrate the usability of the toolkit, I will also
give a brief overview of the supported data generator specification
syntax and explain how different statistical constraints for the
generated data can be implemented using the appropriate combination of
specification routines.
Biography:
Alexander Alexandrov is a research assistant at the Database Systems
and Information Management research group at the Technical University
of Berlin. Before moving to Berlin for a Master in Computer Science at
the TU Berlin, he did a Bachelor of Science in Software and Internet
Technologies at the University of Mannheim. Alexander has been working
on the Stratosphere project both as student and research assistant
since 2009. His research interests include data generation,
evaluation, and query optimization for large-scale parallel batch
processing systems with partial operator semantics.
|
DB Meeting:
|
Wednesday October 3, 2:30pm, DC 1331
|
Speakers:
|
Ashraf Aboulnaga, Gunes Aluc, Khuzaima Daudjee, Pedram Ghodsnia, Tamer Ozsu, David Toman
|
Title:
|
VLDB 8-minute Madness: speakers will each give an 8-minute summary of a piece
of interesting work that was presented at VLDB 2012.
|
Abstract:
|
Ashraf will present "Only Aggressive Elephants are Fast Elephants" by
Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, Jörg Schad.
Gunes will present "Don't Thrash: How to Cache Your
Hash on Flash" by Bender et. al.
David will present something.
Khuzaima will present "Probabilistically Bounded Staleness
for Practical Partial Quorums".
Pedram will present "Dense Subgraph Maintenance under Streaming Edge Weight Updates for Real-time Story Identification".
Tamer will present "Efficient Multi-way Theta-Join Processing Using MapReduce" by X. Zhang, L. Chen, and M. Wang.
|
DB Meeting:
|
Wednesday October 17, 2:30pm, DC 1331
|
Speaker:
|
Ashraf Abdelbar
|
Title:
|
Classification Rule Discovery with Artificial Ant Colonies
|
Abstract:
|
Ant Colony Optimization (ACO) is a relatively young subfield of artificial intelligence that is inspired by the two phenomena of collective intelligence and of stigmergy exhibited by colonies of social insects such as ants and termites. Classification rule discovery is a well-established data mining task whose goal is to learn, from labeled cases, a classification rule model that can be used to predict the class of unlabeled cases. ACO approaches to classification rule discovery for large databases have been investigated for the past decade, and have evolved to become competitive with well-established rule mining algorithms such as Ripper, and C4.5-Rules. In this talk, we present a broad introduction to the ACO methodology, survey ACO approaches to classification rule mining, and focus on recent work that is based on the use of multiple pheromone types within the same colony.
Bio: Ashraf Abdelbar is a professor of computer science at the American University in Cairo. His research interests span across several subdisciplines of artificial intelligence, including neural networks, Bayesian networks, and swarm intelligence. He has close to 70 refereed publications, including articles in journals such as Artificial Intelligence, Neural Networks, and Swarm Intelligence. He is a recipient of the International Neural Network Society’s Young Investigator Award. He has been active in organizations such ABET and the ACM. He is a program evaluator for the Computing Accreditation Commission of ABET, and has served on the ACM’s Membership Activities Board. He has also been active with the ACM Programming Contest and has three times coached students that have made it to the ACM Programming Contest World Finals.
|
DB Meeting:
|
Wednesday October 24, 2:30pm, DC 1331
|
Speaker:
|
David DeHaan
|
Title:
|
Dynamic compilation of query execution plans
|
Abstract:
|
For the last 20+ years, most commercial RDBMs have processed relational queries by assembling pre-compiled physical operator classes into "Volcano-style" tuple-flow operator trees. As modern hardware has changed the nature of the I/O bottleneck, not only have traditional I/O-based cost models and assumptions used during optimization become less accurate, but entire execution frameworks need to be re-evaluated. In this talk I'll present a VLDB 2011 paper that proposes to replace tuple-flow operator trees with dynamic (per-query) code generation and compilation to produce query plans tailored to modern hardware.
Neumann, Thomas. "Efficiently Compiling Efficient Query Plans for Modern Hardware." VLDB 2011, pp 539-550.
|
DB Meeting:
|
Monday October 29, 11:00am, DC 1304
|
Speaker:
|
Azza Abouzied, Ph.D. student, Yale University
|
Title:
|
Learning Database Queries with Membership Questions
|
Abstract:
|
Writing complex database queries in a query language like SQL can be
challenging for casual and expert users alike. Of all query tasks,
quantification --- specifying constraints over a /set of
tuples/ instead of individual tuples --- is the most challenging. We
turn to examples to simplify quantified query specification. In an
example-driven interaction model, users provide simple propositions,
which serve as a query skeleton. We then learn the complete query by
asking the user to classify specifically constructed sets of tuples as
either answers or non-answers. We call such questions: /membership
questions/. For this interaction model to work, we need to identify an
expressive class of quantified queries that is exactly and efficiently
learnable from membership questions. In this talk, I will describe one
such class: /qhorn/. qhorn is the class of /conjunctions of/
/quantified Horn formulae/ on /n boolean variables. ///We show that a
subclass of qhorn is exactly and efficiently learnable with /O(n^k
)/ membership questions, where /k/ is the size of the query. We show
that a slight relaxation on the properties of this subclass results in
an intractable query class.
|
DB Meeting:
|
Wednesday October 31, 2:30pm, DC 1331
|
Speaker:
|
Frank Tompa
|
Title:
|
Exact and approximate phrase searching
|
Abstract:
|
Matching phrases in text is often required, but it can be quite expensive
to support. I'll review a variety of approaches that have been proposed
over the last 15 years.
|
DB Meeting:
|
Wednesday November 14, 2:30pm, DC 1331
|
Speaker:
|
Zhiping Wu
|
Title:
|
Efficient Implementation for Access Control Systems
|
Abstract:
|
A lot of work on access control models has been done; researchers, however, usually fail to consider the efficiency of access control systems. We are proposing a novel approach to implementing fast access control systems exploiting (partially) in-memory bitmaps. We break our objective into two sub-goals. The first one is to develop a new bitmap compression scheme that balances operational cost (time) and compression ratio (space) after noticing none of those existing schemes are satisfactory for our application; the second one is to make use of bitmaps to construct access control systems correctly and efficiently. The early tests show that our approach can answer authorization requests decades or hundreds of times faster than table-based implementation.
|
DB Meeting:
|
Wednesday November 21, 2:30pm, DC 1331
|
Speaker:
|
Khaled Ammar
|
Title:
|
Distributed Graph Mining Systems: A survey
|
Abstract:
|
Massive data analysis has become a major challenge for scientific discovery, as various scientific communities embrace computational techniques and, consequently, generate more data. Examples include computational biology, chemical data analysis, drug discovery, health informatics analysis, social networking, web link analysis, and communication networks. The increasing data volumes have increased the importance of the management, efficient querying, and analysis of these data. These data are usually interrelated, and the relationships among data are important and typically implemented using a graph data structure. Consequently, graphs have re-emerged as important data structures within this context.
In general, graph algorithms do not scale because they require random access for their data structures. Usually, graph algorithms are proposed as centralized algorithms with an assumption that all data resides in memory. These assumptions are not valid for the current growth in the data size. The technical challenge only gets more complicated by the dynamic nature of data and relations. In this talk, I will describe several recently proposed systems for distributed graph mining. Each system embraces some techniques to tackle the distributed challenge in graph mining. I will also provide a brief overview of a qualitative comparison between these systems through a discussion of similarities and differences between them.
|
DB Meeting:
|
Tuesday November 27, 11:00am, DC 2310
|
Speaker:
|
Changjiu Jin
|
Title:
|
Materialized Views for Eventually-Consistent Record Stores
|
Abstract:
|
In recent years, NoSQL systems, such as Cassandra, Hadoop, Voldmort and HBase have become widely used in
some social media companies, such as Twitter, Facebook and LinkedIn. They show significant advantages for bigdata
storage and processing compared to relational databases. There are some common features of these NoSQL
systems, such as their distributed, replicated keyed-record structure. They are mainly used by applications that
require high availability and scalability. However, in order to achieve high availability and scalability, they usually
sacrifice some features found in relational databases. For example, they do not support the SQL language. They
provide only weak data consistency guarantees, and do not support complex queries. The only way for such systems
to efficiently access stored data is by using a primary key value.
We propose the use of materialized views in multi-master, eventually-consistent keyed-record
stores. Materialized views are widely used in relational databases to simplify queries and speed up query processing.
However, there are many difference between NoSQL systems and relational databases, which make the query
optimization problem very challenging in NoSQL systems. For example, in some NoSQL systems, like Cassandra,
all the nodes in a cluster are masters, and no single master server responsible for serializing all updates to a record.
We propose a method to define materialized views and an asynchronous view maintenance mechanism. To solve
the data inconsistency issue between base tables and their views, we also proposed a session consistency guarantee.
We have implemented a prototype of our technique using Cassandra. By doing some experiments, we evaluated
the costs associated with incremental view maintenance by comparing with Cassandra’s native secondary indexes.
|
DB Meeting:
|
Wednesday December 5, 2:30pm, DC 1331
|
Speaker:
|
Mostafa Ead
|
Title:
|
PStorM: Profile Storage and Matching for Feedback-Based Tuning of MapReduce Jobs
|
Abstract:
|
The MapReduce programming model has become widely adopted for large scale analytics on big data. MapReduce systems such as Hadoop have many tuning parameters, many of which have a significant impact on performance. The Map and Reduce functions that make up a MapReduce job are developed using arbitrary programming constructs, which makes them black-box in nature and prevents users from making good parameter tuning decisions for a submitted MapReduce job. Some research projects, such as the Starfish system, aim to provide automatic tuning decisions for input MapReduce jobs. Starfish and similar systems rely on an execution profile of a MapReduce job being tuned, and this profile is assumed to come from a previous execution of the same job. Managing these execution profiles has not been previously studied. This talk presents PStorM, a profile store that organizes the collected profiling information in a scalable and extensible data model, and a profile matcher that accurately picks the relevant profiling information even for previously unseen MapReduce jobs. PStorM is currently integrated with the Starfish system, providing the necessary profiles that Starfish needs to tune a job. The talk presents results that demonstrate the accuracy and efficiency of profile matching. The results also show that the profiles returned by PStorM lead to Starfish tuning decisions that are as good as the decisions made by profiles collected from a pervious run of the job.
|
This page is maintained
by
Khuzaima Daudjee.