The Data Systems Group builds innovative, high-impact platforms, systems, and applications for processing, managing, analyzing, and searching the vast collections of data that are integral to modern information societies – colloquially known as “big data” technologies. Our capabilities span the full spectrum from unstructured text collections to relational data, and everything in between including semi-structured sources such as time series, log data, graphs, and other data types. We work at multiple layers in the software stack, ranging from storage management and execution platforms to user-facing applications and studies of user behaviour. Our research tackles all phases of the information lifecycle, from ingest and cleaning to inference and decision support and covers the following areas.
Our research addresses issues that are fundamental to building an effective and efficient infrastructure for management and analysis of very large and heterogeneous data collections. The specific research topics include the following
Cloud-based applications and database systems are often geographically distributed for disaster tolerance and to ensure low latency for geographically-distributed clients. Our work in this area has focused on techniques for ensuring database consistency and providing transactional guarantees in geographically distributed, multi-data-center settings. We are also developing tools to simplify the development of applications that use scalable NoSQL database systems.
We do both theoretical and systems research on this topic. Our theoretical work aims to develop formal models to answer two questions: (1) How difficult is it to parallelize different queries? and (2) How should we measure the efficiencies of parallel algorithms? Our systems projects in this area focus on replication, distributed transaction protocols, parallel graph processing, data integration, and application-level integration. Projects in this area investigate data management problems in Internet-scale (i.e., very large and widespread) distributed environments as well as in cluster computing environments.
Our goals are to investigate and improve data management systems geared towards processing high volume real-time data feeds, including data stream management systems, stream data warehouses, and event processing systems.
Recent research includes the development of new indexing structures and query processing for search, with a particular focus on modern multi-stage ranking architectures. Specific research includes alternative organizations of inverted indexes based on quantized impact scores and the indexes based on treap data structures, which perform faster ranked unions and intersections while consuming less space. Other research allows for the comparative evaluation of first-stage rankers without the need for relevance assessment by directly measuring the impact on later stages.
Our goal is to design database systems that exploit the capabilities of modern processors, storage, and networks. Recent projects include the design of energy-efficient, power-aware database systems, database support for hybrid storage systems that incorporate flash-based storage and non-volatile memory, and database systems that exploit RDMA for fast scaleout and load balancing.
Our research addresses the data management and analysis needs of specific applications. The particular application areas change over time; the topics that are currently under investigation include the following:
Graph data are of growing importance in many recent applications including semantic web (i.e., RDF), bioinformatics, social networks, software engineering, and physical networks. Our work in this area follows two tracks: the development of efficient and scalable RDF data stores (both their management and reasoning over them), and the development of generic parallel graph processing and analysis solutions.
Searching for mathematical information by matching formulas requires that attention be paid to matching structure as well as to matching individual terms. Efficiently combining formula search with text search is also among our research goals.
We are developing methods to identify sentiment expressed in natural language text and mine opinions from a variety of sources, such as user reviews and social media. Recent projects include development of unsupervised methods for determining polarity of words, and discovering aspects of products and services from online reviews.
Our research focuses on tracking evolving news events, extracting queries from new articles for tracking purposes and creating user models for evaluating filter output. Other research aims to improvement our understanding of language in social media, computing word embeddings over collections of tweets, and other social media, and comparing them to word embeddings computed from collections of standard English documents.
Our research addresses the development of database technology to fulfill the requirements of large classes of applications (e.g., geographic information systems, location-based services, graphic and simulation systems, historical data warehouses, multimedia systems) that deal with spatial objects that evolve over time, move, and may consist of multiple media types (video, audio, text, images).
The principal purpose of high recall information retrieval is to find as close as practicable to 100% of records or documents in a collection that are relevant to an information need. Motivating application domains include electronic discover (eDiscovery) in civil litigation, spam filtering, systematic review for meta-analysis in evidence-based medicine, vanity search, and the creation of fully labeled test collections for IR evaluation.
A fundamental challenge in building effective data systems is to ensure their usability, which requires novel interaction paradigms going beyond querying and better understanding user behavior to design better interfaces. Our research in this area include:
This research focuses on supporting users with complex information needs. One research direction is aimed at investigating the role of knowledge graphs automatically generated from unstructured text in assisting users with complex information needs. We also study user behaviour in exploratory search tasks.
This research address the problem of a traveler in a foreign city who is seeking attractions, food, drink, and entertainment that suits his or her personal interests, as inferred from interests in his or her home city. Our research includes content-oriented methods of mapping preferences from city to city, as well as the development of crowd-sourcing methods for evaluation.
Knowing which documents in a large collection are relevant to an information need is a required part of test collection construction as well as legal e-discovery. Commonly, a single person originates the information need and can act as a primary assessor of document relevance, but having only a single assessor is inefficient. To improve efficiency, secondary assessors are hired and provided with a written description of the information need. We are studying the behavior of secondary assessors and developing methods of reducing the amount of errors that these assessors make.
Addressing all phases of the information lifecycle, from ingest and cleaning to inference and decision support requires attention each of these phases. Our work in support of lifecycle support include the following:
Real-world data has various quality problems including duplicate records, violated integrity constraints, and missing values. Performing analysis over these data produce less than satisfactory results. Data cleaning is the process of detecting and correcting these anomalies. We take a pragmatic approach to this problem that is aimed at developing adoptable solutions in real applications. We pursue three main directions: building generic data cleaning platforms, non-destructive data cleaning, and trusted and knowledge-based data cleaning.
Offline measures of search systems are traditionally designed with little to no model of the users of the systems. We are developing new effectiveness measures that model user behavior with the search system. The primary result of creating these user centered effectiveness measures is that we are able to make better predictions of search system effectiveness prior to deployment and online evaluation.
When searching personal information, such as electronic medical records, it is important that privacy is protected. In one project, we aim to address the problem of evaluating retrieval systems on sensitive data without disclosing that data to the evaluators. We aim to develop fully autonomous evaluation systems that operate in a sandboxed virtual test environment which can be deployed behind a privacy-preserving firewall and report back only summary results to the researcher. In a second project, we wish to search one or more views of the data that embody effective de-identification of records. In addition, we intend to provide secure search for users who are authorized to access and update personal information for only some records in the database.
In addition to the general computing infrastructure provided by the Cheriton School of Computer Science, the Data Systems Group has established specific research laboratories to support this research. In particular, three separate cluster computing systems are being established to support these research projects. In aggregate, these clusters will provide over 100 multi-core computing nodes.
The research conducted within the group is funded by multiple agencies and industries. Currently active funding sources are the Natural Sciences and Engineering Research Council (NSERC) of Canada, Canada Foundation for Innovation, Ontario Research Fund, Mitacs, Google, Microsoft. Thomson Reuters, and IBM.