Stream WatDiv Benchmark Suite


Libo Gao M. Tamer Özsu Lukasz Golab Güneş Aluç

1. Overview

We extend Waterloo SPARQL Diversity Test Suite (WatDiv) [1] and develop a streaming RDF benchmark, called Stream WatDiv. The goal of Stream WatDiv is to provide a diversified workload to test streaming RDF processing (SRP) engines.

Stream WatDiv contains two components: data generator and query generator.

2. Stream Data and Static Data

WatDiv designs a dataset description model through a dataset description language. Stream WatDiv generates two datasets, streaming dataset and static dataset, based on the same dataset description model. These two datasets together describe an e-commerce website database. Streaming data records users' activities on the website. Static data includes necessary information stored in the website, such as the metadata of users and products.

Two parameters of the data generator, stream scale factor and static scale factor, are used to generate different size of streaming dataset and static dataset. Table 1 lists the different types of entities in the streaming data and the static data, and the corresponding instance count when both stream scale factor and static scale factor are set to 1.

Table 1. Instance count of entities in streaming dataset and static dataset when both stream scale factor and static scale factor are set to 1. Entities marked with an asterisk * do not scale.
Streaming Entity TypeInstance CountStatic Entity TypeInstance Count
wsdbm:Purchase1500wsdbm:User1000
wsdbm:Review1500wsdbm:Topic*250
wsdbm:Offer900wsdbm:Product250
wsdbm:City*240
wsdbm:SubGenre*145
wsdbm:Website50
wsdbm:Language*25
wsdbm:Country*25
wsdbm:Genre*21
wsdbm:ProductCategory*15
wsdbm:Retailer12
wsdbm:AgeGroup*9
wsdbm:Role*3
wsdbm:Gender*2

Streaming data contains six user activities. Three of them are more complex, having their own entity type.

  • Purchase: The most important activity on the website is the purchase. Users are able to purchase various products. Each purchase entity will record a user's purchase information for a single product, such as price and date.
  • Review: Users can also write reviews for products. The review entity contains information like rating, title, content and votes.
  • Offer: A retailer might make some offers of products. The offer may have restrictions on eligible region and valid time period, so the offer entity should include these details.

The remaining three are more straightforward, only including a single triple.

  • Likes: Users can show interest in products by "liking" them.
  • Follows: Users can make friends by following each other.
  • Subscribes: If a user wants to get updates on other users or some products, the most convenient way is to subscribe to their websites.

Each triple in the streaming dataset is annotated with a timestamp. Timestamps are monotonically non-decreasing integers, starting from zero. The streaming dataset is sorted on the timestamps. Triples of the same entity instance are put together with same timestamp. Following is a snippet of streaming data. As we can see, the first triple comes at time 0, then all the related triples of Offer38737 appear together at time 1.

<http://db.uwaterloo.ca/~galuc/wsdbm/User227>     <http://db.uwaterloo.ca/~galuc/wsdbm/follows>     <http://db.uwaterloo.ca/~galuc/wsdbm/User579>     0
<http://db.uwaterloo.ca/~galuc/wsdbm/Offer38737>  <http://schema.org/eligibleQuantity>      "2"     1
<http://db.uwaterloo.ca/~galuc/wsdbm/Offer38737>  <http://purl.org/goodrelations/validThrough>      "2013-10-08"    1
<http://db.uwaterloo.ca/~galuc/wsdbm/Offer38737>  <http://purl.org/goodrelations/price>     "136"   1
<http://db.uwaterloo.ca/~galuc/wsdbm/Offer38737>  <http://purl.org/goodrelations/serialNumber>      "86717525"      1
<http://db.uwaterloo.ca/~galuc/wsdbm/Retailer10>  <http://purl.org/goodrelations/offers>    <http://db.uwaterloo.ca/~galuc/wsdbm/Offer38737>  1
<http://db.uwaterloo.ca/~galuc/wsdbm/Offer38737>  <http://purl.org/goodrelations/includes>  <http://db.uwaterloo.ca/~galuc/wsdbm/Product103>  1
<http://db.uwaterloo.ca/~galuc/wsdbm/User484>     <http://db.uwaterloo.ca/~galuc/wsdbm/follows>     <http://db.uwaterloo.ca/~galuc/wsdbm/User527>     2
<http://db.uwaterloo.ca/~galuc/wsdbm/User247>     <http://db.uwaterloo.ca/~galuc/wsdbm/likes>       <http://db.uwaterloo.ca/~galuc/wsdbm/Product0>    3
<http://db.uwaterloo.ca/~galuc/wsdbm/Review124335>        <http://purl.org/stuff/rev#text>  "TORI TOMMY HARRIET STUART ELVIRA VERNON DELLA ALISSA"  4
<http://db.uwaterloo.ca/~galuc/wsdbm/Review124335>        <http://purl.org/stuff/rev#rating>        "6"     4
<http://db.uwaterloo.ca/~galuc/wsdbm/Review124335>        <http://purl.org/stuff/rev#reviewer>      <http://db.uwaterloo.ca/~galuc/wsdbm/User906>     4
<http://db.uwaterloo.ca/~galuc/wsdbm/Product8>    <http://purl.org/stuff/rev#hasReview>     <http://db.uwaterloo.ca/~galuc/wsdbm/Review124335>        4
<http://db.uwaterloo.ca/~galuc/wsdbm/User19>      <http://db.uwaterloo.ca/~galuc/wsdbm/follows>     <http://db.uwaterloo.ca/~galuc/wsdbm/User848>     5
                                                                ...
                

The timestamp interval between two consecutive triples indicates when the next triple should be sent to the engine. If the stream rate is high, the timestamp interval will be small. We allow the user to set the stream rate when using the data generator, and compute the interval accordingly. In Stream WatDiv, the time unit of the timestamp is milliseconds. If the stream rate is higher than 1000 triples/second, then certain number of triples will be sent to the engine in a batch with the same timestamp. We use the batch size to represent the number of triples which should be sent to the engine together at the same time.

3. Description of Workload

Generating streaming RDF queries involves two steps. First, the query generator will generate a set of SPARQL queries that are as diverse as possible. Second, these SPARQL queries will be equipped with time windows, and then translated to equivalent streaming RDF queries in different language.

In the first step, Stream WatDiv reuses the query generator of WatDiv to generate SPARQL queries. The query generator will perform a random walk among the datasets and generate a set of query templates with various structure and selectivity. These query templates normally contain some placeholders and will be instantiated by replaceing the placeholders with real RDF terms. Here are some query template examples.

Once the SPARQL queries are generated, they will be equipped with time windows and translated to equivalent streaming RDF queries. Currently there is no standard streaming RDF query language. Different engine proposes its own language. Stream WatDiv covers C-SPARQL [2] and CQELS [3] query language. Other languages can be added as needed.

The query generator will generate two types of queries. Pure stream queries refer to the queries that only involve streaming data, and hybrid queries refer to the queries that involve both streaming data and static data.

The reason why providing a diversified workload is important is that it can help detect engines' unusual behaviors that are not detected before. For example, following is one generated CQELS query that makes CQELS crash during query registeration. We find other similar failed queries which share the same query pattern: the static part of the query is disjoint, and triple patterns in the static part of the query have high result cardinality, so the intermediate result of the static part of the query is large. When CQELS registers a query, it will pre-calculate and cache the intermediate result of the static part of the query. This mechanism makes CQELS run out of memory when the engine registers this type of query. At the same time, this query makes C-SPARQL crash during query execution. We monitor the memory and CPU usage, after around first 15 query executions, C-SPARQL has used up all heap memory, and the CPU is busy with garbage collection. At some point, no more memory is available and the engine crashes.

SELECT ?v0 ?v1 ?v2 ?v3 ?v4 ?v5
FROM NAMED <http://dsg.uwaterloo.ca/watdiv/knowledge>
WHERE{
        STREAM <http://ex.org/streams/test> [RANGE ${WSIZE} SLIDE ${WSLIDE}] {
                ?v0     <http://db.uwaterloo.ca/~galuc/wsdbm/follows>     ?v2 .
                ?v2     <http://db.uwaterloo.ca/~galuc/wsdbm/follows>     ?v3 .
                ?v4     <http://db.uwaterloo.ca/~galuc/wsdbm/makesPurchase> ?v5 .
        }
        GRAPH <http://dsg.uwaterloo.ca/watdiv/knowledge>{
                ?v0     <http://schema.org/email>     ?v1 .
                ?v3     <http://db.uwaterloo.ca/~galuc/wsdbm/friendOf>        ?v4 .
        }
}
                

4. Setup and Source Code

The source code of Stream WatDiv can be downloaded from here.

The steps to compile Stream WatDiv is same with WatDiv. Once the executable file is generated, following commands can be used to generate data and workload.

To run the data generator, issue the following command.

  • ./watdiv -sd <model-file> <static-scale-factor> <stream-scale-factor> <rand-seed>

You will find a model file in the model sub-directory where Stream WatDiv was installed. The static scale factor and stream scale factor control the size of static dataset and stream dataset seperately. Running the data generator with same random seed, together with same model file, static and stream scale factor will generate the same datasets. The stream dataset is written to a file named stream.txt. At this stage, triples in the stream.txt have not been annotated with timestamps.

To annotate streaming data with timestamps, issue the following command.

  • ./watdiv -ts <source-file> <dest-file> <stream rate>

Source-file is the stream.txt generated in the previous step. Dest-file defines the name of the file where the stream dataset with timestamps is written to. Timestamps are determined by the stream rate. A stream rate of 1 means 1 triple per second. Higer stream rate results in smaller interval between two consecutive timestamps.

To generate the workload, issue the following command.

  • ./watdiv -sq <model-file> <static-dataset> <stream-dataset> <max-query-size> <query-count> <constant-per-query-count>

Running the query generator needs to cite model file, static dataset and stream dataset. The query generator conducts a random walk in the stream dataset and static dataset. Each query may have different number of triple patterns. Max-query-size constrains the maximum number of triple patterns in each query. Query-count specifies the number of queries to be generated. Users can also specify the number of placeholders in a query through constant-per-query-count.

References

[1] Güneş Aluç, Olaf Hartig, M Tamer Özsu, and Khuzaima Daudjee. Diversified stress testing of RDF data management systems. In Proc. 13th International Semantic Web Conference, pages 197-212, 2014.

[2] Davide Francesco Barbieri, Daniele Braga, Stefano Ceri, Emanuele Della Valle, and Michael Grossniklaus. Querying RDF streams with C-SPARQL. ACM SIGMOD Record, 39(1):20-26, 2010.

[3] Danh Le-Phuoc, Minh Dao-Tran, Josiane Xavier Parreira, and Manfred Hauswirth. A native and adaptive approach for unified processing of linked streams and linked data. In Proc. 10th International Semantic Web Conference, pages 370-388, 2011.