Linked Data Query Processing
at the 22nd International World Wide Web Conference (WWW 2013)
Location: Rio de Janeiro, Brazil
Date: May 14, 2013
The amount of Linked Data published on the Web has increased dramatically over the last five years. While some data publishers expose their datasets not only based on the Linked Data principles but also via dedicated query services, relying on such services for use cases of live, Linked Data consumption, results in ignoring a large portion of the Web of Linked Data. As a consequence, we currently witness the emergence of a new area of research that focuses on query processing approaches which rely only on the Linked Data principles. This tutorial aims to provide participants with an overview on this new field. We will discuss the foundations, the new problems that are specific to Linked Data query processing, and the works that aim to address them.
The tutorial covers the following main aspects in Linked Data query processing:
In the introductory part we will give a quick overview of the Linked Data principles and the query language SPARQL, which are requirements for understanding the rest of the tutorial. We will give a short overview of Linked Data query processing and the content of the tutorial. We will emphasize the specific challenges associated with Linked Data query processing and the differences to existing (database) works, particularly approaches to processing queries over Linked Data such as SPARQL endpoint federation and warehousing.
An awareness of foundations and theoretical properties of Linked Data queries is essential for understanding the possibilities and limitations of Linked Data query processing. Consequently, before we focus on systems-related topics for the major part of our tutorial, we will provide the participants with an overview on the main results of theory research in the context of Linked Data queries. More precisely, we will introduce different semantics for SPARQL-based Linked Data queries and discuss characteristics of queries under these semantics [BGS09, HS12, Har12].
For the execution of Linked Data queries, it is necessary to retrieve data by looking up URIs. There exist three classes of approaches for selecting the URIs that a query execution system looks up during the execution of a query: i) index-based approaches [HHK+10], ii) live exploration [HF12, HBF09], and iii) hybrid approaches [LT10]. In this part of the tutorial, we introduce each of these types and discuss their respective characteristics, advantages and limitations.
The actual process of executing a Linked Data query may consist of two separate phases: During the first phase, a query execution system selects URIs and uses them to retrieve data from the queried Web; during a subsequent, second phase, the system generates the query result using the data retrieved in the first phase [HHK+10, UHK+11]. Instead of separating these two phases, it is also possible to integrate the retrieval of data into the result construction process [HBF09, HF12, LT10, LT11, WTL+12]. We refer to Linked Data query execution approaches that apply the latter idea as integrated execution approaches. Analogously, separated execution approaches clearly separate data retrieval from result construction by two consecutive phases. In this part, we first discuss the general properties of both types of approaches. Thereafter, we focus on integrated execution approaches proposed in the literature. In particular, we describe the algorithms based on which these approaches combine data retrieval with query-local data processing.
An essential step of any procedure for processing queries is the query planning phase. This phase comprises the generation of candidate query execution plans and the selection of one of these plans for the actual execution. While such a selection usually aims for minimizing the overall executing time, possible optimization criteria in the context of Linked Data queries are more diverse and include minimizing response time, minimizing network traffic, or maximizing the degree of result completeness. The aforementioned index-based source selection approaches provide some information for assessing query execution plans w.r.t. these optimization criteria [LT10, LT11, UHK+11]. However, for live evaluation processes such information cannot be assumed to be available initially; instead, it may only be obtained incrementally during query execution. Hence, in this case it requires heuristics for selecting an initial plan [Har11a]. In this part of the tutorial we discuss query planning for Linked Data queries and review existing work that aims to address the problems in this phase of query processing.
Olaf is a postdoc fellow in the Database Research Group at the University of Waterloo. His research focuses on querying Linked Data on the Web. His aim is to develop and study concepts that allow users to query the Web of Linked Data as if it is a giant global database. As project maintainer and lead developer he is implementing these concepts in the free software project SQUIN. In addition to his hands-on experience in the field of processing Linked Data queries, he recently also published results regarding theoretical properties of such queries. Olaf presented several Linked Data related tutorials at major international conferences such as ISWC 2008, ISWC 2009, WWW 2010, and ICWE 2012; and he was lecturer at the Indian-Summer School on Linked Data 2011.