Database Research Group Events

Fall 2012

Events of interest to the Database Research Group are posted here, and are also mailed to the uw.cs.database newsgroup and the db-faculty, db-grads, db-friends mailing lists. Subscribe to one of these mailing lists to receive e-mail notification of upcoming events.

The DB group meets Wednesday afternoons at 2:30pm. The list below gives the times and locations of upcoming meetings. Each meeting lasts for an hour and features either a local speaker or, on Seminar days, an invited outside speaker. Everyone is welcome to attend.

Fall 2012 Events


DB Seminar: Wednesday September 19, 2:30pm, DC 1302
Speaker: Val Tannen, University of Pennsylvania
Title: Provisioning for What-If Analysis

DB Meeting: Wednesday September 26, 2:30pm, DC 1331
Speaker: Bojana Bislimovska, Politecnico di Milano
Title: Textual and Graph-Based Search over Software Model Repositories
Abstract: Software model repositories are valuable sources of knowledge as they gather the best practices from past modeling activities. Therefore, efficient search of relevant information from large model repositories is of essential importance, allowing users to retrieve models or model fragments, and reuse them by incorporating them into their projects, thus improving the quality and productivity of model design. The variety of existing modeling languages imposes a search framework that uses the power of models, adapting to the syntax and semantics of the modeling language. In this talk, we present two methods for searching model repositories, contrasted with respect to the system architecture, the processing of queries and repository models, and the way in which metamodel information can be exploited to improve search. The first method explores classical Information Retrieval techniques, while the second one explores graph-based similarity search. We also provide an evaluation of the approaches through a case study of Web Application Models encoded in a Domain Specific Language. Furthermore, we discuss future directions for efficient graph-based search of model repositories.

Bio: Bojana Bislimovska is a PhD student in Computer Science and Engineering at Politecnico di Milano. She received her BSc with honors in Computer Science, Information Technology and Automation from the Faculty of Electrical Engineering, University “Ss Cyril and Methodius”, Skopje, Macedonia, and her MSc in Computer Engineering from Politecnico di Milano, Italy. Her doctoral work explores techniques for analysis of semistructured data for information retrieval applications on a specific use-case of software models, experimenting with different querying, indexing and matching methodologies. Her research interests include graph data management, model-driven development, search engines and information retrieval.

DB Seminar: Friday September 28, 10:45am - 11:20am, DC 1304
Speaker: Stephan Ewen, TU Berlin
Title: Spinning Fast Iterative Data Flows
Abstract: Parallel data flow systems are a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for current systems. While certain types of bulk iterative algorithms are supported by novel data flow frameworks, these systems cannot exploit computational dependencies present in many algorithms, such as graph algorithms. As a result, these algorithms are inefficiently executed and have led to specialized systems based on other paradigms, such as message passing or shared memory.
We propose a method to integrate "incremental iterations", a form of workset iterations, with parallel data flows. After showing how to integrate bulk iterations into a dataflow system and its optimizer, we present an extension to the programming model for incremental iterations. The extension alleviates for the lack of mutable state in dataflows and allows for exploiting the "sparse computational dependencies" inherent in many iterative algorithms. The evaluation of a prototypical implementation shows that those aspects lead to up to two orders of magnitude speedup in algorithm runtime, when exploited. In our experiments, the improved dataflow system is highly competitive with specialized systems while maintaining a transparent and unified data flow abstraction.

Biography:
Stephan Ewen is a research associate at the department for Database Systems and Information Management (DIMA) at the Berlin University of Technology. He is working on the Stratosphere Project that aims at creating a versatile and efficient analytics engine for deep analysis of Big Data on cloud platforms. Within the project, Stephan works on the system's data flow programming abstraction, the data flow optimization and the parallel runtime system. Prior to joining the DIMA group, Stephan completed the “Applied Computer Science” program at the University of Cooperative Education Stuttgart jointly with IBM Germany Ltd and got his Diploma from the University of Stuttgart. In the course of his studies, Stephan Ewen worked, among others, for the IBM Almaden Research Centre and the IBM Development Laboratory Böblingen.

DB Seminar: Friday September 28, 11:25am - 12 noon, DC 1304 CANCELLED
Speaker: Alexander Alexandrov, TU Berlin
Title: Generating a Myriad of atoms in the blink of an eye
Abstract: Data from real-world applications is regarded as the golden standard for database systems evaluation. Unfortunately, finding appropriate real-world datasets is often hard due to various privacy-related constraints. To overcome this problem, we developed the Myriad Parallel Data Generator Toolkit -- a generic toolkit for declarative specification of synthetic data generators that provides built-in parallelization support for the specified data generation programs. In this talk, I will motivate and present the main technical challenges solved by the highly-parallel execution model of the Myriad Toolkit. In addition, to demonstrate the usability of the toolkit, I will also give a brief overview of the supported data generator specification syntax and explain how different statistical constraints for the generated data can be implemented using the appropriate combination of specification routines.

Biography:
Alexander Alexandrov is a research assistant at the Database Systems and Information Management research group at the Technical University of Berlin. Before moving to Berlin for a Master in Computer Science at the TU Berlin, he did a Bachelor of Science in Software and Internet Technologies at the University of Mannheim. Alexander has been working on the Stratosphere project both as student and research assistant since 2009. His research interests include data generation, evaluation, and query optimization for large-scale parallel batch processing systems with partial operator semantics.

DB Meeting: Wednesday October 3, 2:30pm, DC 1331
Speakers: Ashraf Aboulnaga, Gunes Aluc, Khuzaima Daudjee, Pedram Ghodsnia, Tamer Ozsu, David Toman
Title: VLDB 8-minute Madness: speakers will each give an 8-minute summary of a piece of interesting work that was presented at VLDB 2012.
Abstract: Ashraf will present "Only Aggressive Elephants are Fast Elephants" by Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, Jörg Schad.
Gunes will present "Don't Thrash: How to Cache Your Hash on Flash" by Bender et. al.
David will present something.
Khuzaima will present "Probabilistically Bounded Staleness for Practical Partial Quorums".
Pedram will present "Dense Subgraph Maintenance under Streaming Edge Weight Updates for Real-time Story Identification".
Tamer will present "Efficient Multi-way Theta-Join Processing Using MapReduce" by X. Zhang, L. Chen, and M. Wang.

DB Seminar: Wednesday October 10, 2:30pm, DC 1331
Speaker: Gokul Soundararajan, NetApp
Title: MixApart: Decoupled Analytics for Shared Storage Systems

DB Meeting: Wednesday October 17, 2:30pm, DC 1331
Speaker: Ashraf Abdelbar
Title: Classification Rule Discovery with Artificial Ant Colonies
Abstract: Ant Colony Optimization (ACO) is a relatively young subfield of artificial intelligence that is inspired by the two phenomena of collective intelligence and of stigmergy exhibited by colonies of social insects such as ants and termites. Classification rule discovery is a well-established data mining task whose goal is to learn, from labeled cases, a classification rule model that can be used to predict the class of unlabeled cases. ACO approaches to classification rule discovery for large databases have been investigated for the past decade, and have evolved to become competitive with well-established rule mining algorithms such as Ripper, and C4.5-Rules. In this talk, we present a broad introduction to the ACO methodology, survey ACO approaches to classification rule mining, and focus on recent work that is based on the use of multiple pheromone types within the same colony.

Bio: Ashraf Abdelbar is a professor of computer science at the American University in Cairo. His research interests span across several subdisciplines of artificial intelligence, including neural networks, Bayesian networks, and swarm intelligence. He has close to 70 refereed publications, including articles in journals such as Artificial Intelligence, Neural Networks, and Swarm Intelligence. He is a recipient of the International Neural Network Society’s Young Investigator Award. He has been active in organizations such ABET and the ACM. He is a program evaluator for the Computing Accreditation Commission of ABET, and has served on the ACM’s Membership Activities Board. He has also been active with the ACM Programming Contest and has three times coached students that have made it to the ACM Programming Contest World Finals.

DB Meeting: Wednesday October 24, 2:30pm, DC 1331
Speaker: David DeHaan
Title: Dynamic compilation of query execution plans
Abstract: For the last 20+ years, most commercial RDBMs have processed relational queries by assembling pre-compiled physical operator classes into "Volcano-style" tuple-flow operator trees. As modern hardware has changed the nature of the I/O bottleneck, not only have traditional I/O-based cost models and assumptions used during optimization become less accurate, but entire execution frameworks need to be re-evaluated. In this talk I'll present a VLDB 2011 paper that proposes to replace tuple-flow operator trees with dynamic (per-query) code generation and compilation to produce query plans tailored to modern hardware.

Neumann, Thomas. "Efficiently Compiling Efficient Query Plans for Modern Hardware." VLDB 2011, pp 539-550.

DB Meeting: Monday October 29, 11:00am, DC 1304
Speaker: Azza Abouzied, Ph.D. student, Yale University
Title: Learning Database Queries with Membership Questions
Abstract: Writing complex database queries in a query language like SQL can be challenging for casual and expert users alike. Of all query tasks, quantification --- specifying constraints over a /set of tuples/ instead of individual tuples --- is the most challenging. We turn to examples to simplify quantified query specification. In an example-driven interaction model, users provide simple propositions, which serve as a query skeleton. We then learn the complete query by asking the user to classify specifically constructed sets of tuples as either answers or non-answers. We call such questions: /membership questions/. For this interaction model to work, we need to identify an expressive class of quantified queries that is exactly and efficiently learnable from membership questions. In this talk, I will describe one such class: /qhorn/. qhorn is the class of /conjunctions of/ /quantified Horn formulae/ on /n boolean variables. ///We show that a subclass of qhorn is exactly and efficiently learnable with /O(n^k )/ membership questions, where /k/ is the size of the query. We show that a slight relaxation on the properties of this subclass results in an intractable query class.

DB Meeting: Wednesday October 31, 2:30pm, DC 1331
Speaker: Frank Tompa
Title: Exact and approximate phrase searching
Abstract: Matching phrases in text is often required, but it can be quite expensive to support. I'll review a variety of approaches that have been proposed over the last 15 years.

DB Seminar: Tuesday November 6, 1:00pm, DC 2585
Speaker: Shivakumar Vaithyanathan, IBM Almaden Research Center
Title: Social Media Data Analytics Research

DB Meeting: Wednesday November 14, 2:30pm, DC 1331
Speaker: Zhiping Wu
Title: Efficient Implementation for Access Control Systems
Abstract: A lot of work on access control models has been done; researchers, however, usually fail to consider the efficiency of access control systems. We are proposing a novel approach to implementing fast access control systems exploiting (partially) in-memory bitmaps. We break our objective into two sub-goals. The first one is to develop a new bitmap compression scheme that balances operational cost (time) and compression ratio (space) after noticing none of those existing schemes are satisfactory for our application; the second one is to make use of bitmaps to construct access control systems correctly and efficiently. The early tests show that our approach can answer authorization requests decades or hundreds of times faster than table-based implementation.

DB Meeting: Wednesday November 21, 2:30pm, DC 1331
Speaker: Khaled Ammar
Title: Distributed Graph Mining Systems: A survey
Abstract: Massive data analysis has become a major challenge for scientific discovery, as various scientific communities embrace computational techniques and, consequently, generate more data. Examples include computational biology, chemical data analysis, drug discovery, health informatics analysis, social networking, web link analysis, and communication networks. The increasing data volumes have increased the importance of the management, efficient querying, and analysis of these data. These data are usually interrelated, and the relationships among data are important and typically implemented using a graph data structure. Consequently, graphs have re-emerged as important data structures within this context. In general, graph algorithms do not scale because they require random access for their data structures. Usually, graph algorithms are proposed as centralized algorithms with an assumption that all data resides in memory. These assumptions are not valid for the current growth in the data size. The technical challenge only gets more complicated by the dynamic nature of data and relations. In this talk, I will describe several recently proposed systems for distributed graph mining. Each system embraces some techniques to tackle the distributed challenge in graph mining. I will also provide a brief overview of a qualitative comparison between these systems through a discussion of similarities and differences between them.

DB Meeting: Tuesday November 27, 11:00am, DC 2310
Speaker: Changjiu Jin
Title: Materialized Views for Eventually-Consistent Record Stores
Abstract: In recent years, NoSQL systems, such as Cassandra, Hadoop, Voldmort and HBase have become widely used in some social media companies, such as Twitter, Facebook and LinkedIn. They show significant advantages for bigdata storage and processing compared to relational databases. There are some common features of these NoSQL systems, such as their distributed, replicated keyed-record structure. They are mainly used by applications that require high availability and scalability. However, in order to achieve high availability and scalability, they usually sacrifice some features found in relational databases. For example, they do not support the SQL language. They provide only weak data consistency guarantees, and do not support complex queries. The only way for such systems to efficiently access stored data is by using a primary key value. We propose the use of materialized views in multi-master, eventually-consistent keyed-record stores. Materialized views are widely used in relational databases to simplify queries and speed up query processing. However, there are many difference between NoSQL systems and relational databases, which make the query optimization problem very challenging in NoSQL systems. For example, in some NoSQL systems, like Cassandra, all the nodes in a cluster are masters, and no single master server responsible for serializing all updates to a record. We propose a method to define materialized views and an asynchronous view maintenance mechanism. To solve the data inconsistency issue between base tables and their views, we also proposed a session consistency guarantee. We have implemented a prototype of our technique using Cassandra. By doing some experiments, we evaluated the costs associated with incremental view maintenance by comparing with Cassandra’s native secondary indexes.

DB Seminar: Wednesday November 28, 2:30pm, DC 1302
Speaker: Arthur Zimek, University of Alberta
Title: There and Back Again: Outlier Detection between Statistical Reasoning and Efficient Database Methods

DB Meeting: Wednesday December 5, 2:30pm, DC 1331
Speaker: Mostafa Ead
Title: PStorM: Profile Storage and Matching for Feedback-Based Tuning of MapReduce Jobs
Abstract: The MapReduce programming model has become widely adopted for large scale analytics on big data. MapReduce systems such as Hadoop have many tuning parameters, many of which have a significant impact on performance. The Map and Reduce functions that make up a MapReduce job are developed using arbitrary programming constructs, which makes them black-box in nature and prevents users from making good parameter tuning decisions for a submitted MapReduce job. Some research projects, such as the Starfish system, aim to provide automatic tuning decisions for input MapReduce jobs. Starfish and similar systems rely on an execution profile of a MapReduce job being tuned, and this profile is assumed to come from a previous execution of the same job. Managing these execution profiles has not been previously studied. This talk presents PStorM, a profile store that organizes the collected profiling information in a scalable and extensible data model, and a profile matcher that accurately picks the relevant profiling information even for previously unseen MapReduce jobs. PStorM is currently integrated with the Starfish system, providing the necessary profiles that Starfish needs to tune a job. The talk presents results that demonstrate the accuracy and efficiency of profile matching. The results also show that the profiles returned by PStorM lead to Starfish tuning decisions that are as good as the decisions made by profiles collected from a pervious run of the job.

This page is maintained by Khuzaima Daudjee.

Campaign Waterloo

Data Systems Group
David R. Cheriton School of Computer Science
University of Waterloo
Waterloo, Ontario, Canada N2L 3G1
Tel: 519-888-4567
Fax: 519-885-1208

Contact | Feedback: db-webmaster@cs.uwaterloo.ca | Data Systems Group


Valid HTML 4.01!Valid CSS! Last modified: Friday, 07-Dec-2012 13:12:39 EST


Menu:ShowHide