WatDiv Dataset Description Language Tutorial

[Please remove <h1>]


Author: Güneş Aluç

1. Overview

The WatDiv dataset description language enables users to create their own test databases and/or customize existing ones.

The WatDiv dataset description language consists of four main constructs: namespace declarations (Section 2), entity declarations (Section 3), literal property declarations (Section 4), which are enclosed within entity declarations, and non-literal property declarations (Section 5).

A dataset description model needs to conform to the following syntax:

DatasetDescriptionModel :== NamespaceDeclaration* EntityDeclaration+ PropertyDeclaration*

In other words, it must consist of zero or more NamespaceDeclaration constructs, followed by at least one EntityDeclaration construct, followed by zero or more PropertyDeclaration constructs.

2. Declaring Namespaces

Namespace declarations are optional, but highly recommended as they simplify schema development and improve readability.

Namespace declarations consist of two parts: a namespace identifier (NSIdentifier) and a prefix (NSPrefix) part, and they have the following syntax:

NSIdentifier ::= NCName
NSPrefix ::= URI
NSLocalName ::= NCName
NamespaceDeclaration ::= '#namespace'   '\s'   NSIdentifier   '='   NSPrefix   '\n'

Consider the following namespace declaration:

#namespace	watdiv=http://db.uwaterloo.ca/watdiv/
The prefix http://db.uwaterloo.ca/watdiv/ is assigned to the namespace identifier watdiv, which can be cross-referenced in subsequent statements using the syntax NSIdentifier:NSLocalName. In that case, NSIdentifier will be automatically replaced by the prefix http://db.uwaterloo.ca/watdiv/.

For example,

 watdiv:Genre 
will become
 http://db.uwaterloo.ca/watdiv/Genre 

3. Declaring Entities

WatDiv dataset description model supports two types of entities: scalable and non-scalable.

The number of instantiations of a scalable entity increases proportionally with the scale factor used in dataset generation. In contrast, non-scalable entities always have a constant number of instantiations, irrespective of the scale factor.

Both scalable and non-scalable entity declarations have a similar syntax, as shown by the grammar rules below:

EntityName ::= URI   |   NSIdentifier   ':'   NSLocalName
PropertyName ::= URI   |   NSIdentifier   ':'   NSLocalName  
EntityCount ::= [0-9]+
NonScalableEntity ::=
'<type*>'   '\s'   EntityName   '\s'   EntityCount   '\n'
    PGroupDeclaration*
'</type>'   '\n'  
ScalableEntity ::=
'<type>'   '\s'   EntityName   '\s'   EntityCount   '\n'
    PGroupDeclaration*
'</type>'   '\n'  
EntityDeclaration ::= ScalableEntity   |   NonScalableEntity

In other words, non-scalable entity declarations start with the tag <type*> while scalable entity declarations with the tag <type>. They both end with the tag </type>.

The opening tag is followed by two parameters, namely, EntityName and EntityCount:

  • EntityName is a URI, either provided directly, or constructed through dereferencing namespaces.
  • EntityCount indicates the number of instances to be generated per scale factor. If a non-scalable entity is being declared, EntityCount corresponds directly to the total number of instances to be generated.

Consider the following non-scalable entity declaration:

<type*>		watdiv:Genre 20
	...	
</type>

The dataset generator will interpret the command and instantiate the following 20 Genre instances:

<http://db.uwaterloo.ca/watdiv/Genre0>
<http://db.uwaterloo.ca/watdiv/Genre1>
<http://db.uwaterloo.ca/watdiv/Genre2>

		...

<http://db.uwaterloo.ca/watdiv/Genre19>

These instances can appear as the subject or object of an RDF triple.

Next, consider the following scalable entity declaration:

<type>		watdiv:Product 250
	...
</type>

Let us assume that in this case, the dataset generator is invoked at scale factor 20. This means that there will be a total of 20 × 250 = 5000 Product instances, as illustrated below:

<http://db.uwaterloo.ca/watdiv/Product0>
<http://db.uwaterloo.ca/watdiv/Product1>
<http://db.uwaterloo.ca/watdiv/Product2>

		...

<http://db.uwaterloo.ca/watdiv/Product4999>

4. Declaring Literal Properties

InstantiationProbability ::= [0-1] ( '.' [0-9]+ )?
PGroupDeclaration ::=
'<pgroup>'   '\s'   InstantiationProbability   ('\s'   '\@'   EntityName)?   '\n'
    LiteralPropertyDeclaration+
'</pgroup>'   '\n'

Each entity declaration consists of a zero or more number of PGroupDeclaration constructs. A Property Group Declaration (PGroupDeclaration), whose syntax is given in the table above, is used within an EntityDeclaration to generate RDF triples in which

  • the entity that is being declared is the subject of the triple, and
  • the object of the triple is a literal.

Consider the following PGroupDeclaration:

<type>		watdiv:Product 250
  <pgroup>	1.0
    #predicate	watdiv:name	string
  </pgroup>
</type>

Let us assume that the dataset generator is invoked at scale factor 20, as in Section 3. Hence, 5000 Product instances will be generated. Then, the Property Group Declaration in the example dictates that for each Product instance [Product0 .. Product4999], the dataset generator will generate an RDF triple as follows:

<http://db.uwaterloo.ca/watdiv/Product0>	<http://db.uwaterloo.ca/watdiv/name>	"random sequence of words" .
<http://db.uwaterloo.ca/watdiv/Product1>	<http://db.uwaterloo.ca/watdiv/name>	"another random sequence of words" .
<http://db.uwaterloo.ca/watdiv/Product2>	<http://db.uwaterloo.ca/watdiv/name>	"yet another random sequence of words" .

								...

<http://db.uwaterloo.ca/watdiv/Product4999>	<http://db.uwaterloo.ca/watdiv/name>	"yet yet another random sequence of words" .

Often, it may be more realistic to generate a dataset in which not every Product instance has the attribute watdiv:name. This is really easy to achieve with WatDiv: just change the InstantiationProbability, which is the first parameter of the <pgroup> construct in a PGroupDeclaration.

Consider a slightly modified version of the example, in which the InstantiationProbability is 0.2:

<type>		watdiv:Product 250
  <pgroup>	0.2
    #predicate	watdiv:name	string
  </pgroup>
</type>

This would result in the generation of an RDF dataset, in which only approximately 15 of the Product instances have the attribute watdiv:name. These Product instances are selected uniformly at random during dataset generation. Consequently, one might obtain a dataset such as the following:

<http://db.uwaterloo.ca/watdiv/Product4>	<http://db.uwaterloo.ca/watdiv/name>	"random sequence of words" .
<http://db.uwaterloo.ca/watdiv/Product9>	<http://db.uwaterloo.ca/watdiv/name>	"another random sequence of words" .
<http://db.uwaterloo.ca/watdiv/Product12>	<http://db.uwaterloo.ca/watdiv/name>	"yet another random sequence of words" .

								...

<http://db.uwaterloo.ca/watdiv/Product4996>	<http://db.uwaterloo.ca/watdiv/name>	"yet yet another random sequence of words" .

The WatDiv dataset generator is also sensitive to type assertions. A type assertion is an RDF triple of the form:

S <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> C .

where S and C are both URIs, C denotes an RDF Schema Class, and the triple reads as "S is an instance of class C".

For example, let us assume that the following type assertions have already been asserted for various Product instances (in Section 5, we will see how such type assertions can be made).

<http://db.uwaterloo.ca/watdiv/Product0>	<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>	<http://db.uwaterloo.ca/watdiv/Book>	.
<http://db.uwaterloo.ca/watdiv/Product1>	<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>	<http://db.uwaterloo.ca/watdiv/Book>	.

								...

<http://db.uwaterloo.ca/watdiv/Product998>	<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>	<http://db.uwaterloo.ca/watdiv/Book>	.
<http://db.uwaterloo.ca/watdiv/Product999>	<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>	<http://db.uwaterloo.ca/watdiv/Book>	.

Then, it is possible to restrict the domain of a PGroupDeclaration to a particular RDF Schema Class, using the following syntax:

<type>		watdiv:Product 250
  <pgroup>	1.0	@watdiv:Book
    #predicate	watdiv:name	string
  </pgroup>
</type>

The code snippet above will result in the generation of the following RDF triples:

<http://db.uwaterloo.ca/watdiv/Product0>	<http://db.uwaterloo.ca/watdiv/name>	"random sequence of words" .
<http://db.uwaterloo.ca/watdiv/Product1>	<http://db.uwaterloo.ca/watdiv/name>	"another random sequence of words" .
<http://db.uwaterloo.ca/watdiv/Product2>	<http://db.uwaterloo.ca/watdiv/name>	"yet another random sequence of words" .

								...

<http://db.uwaterloo.ca/watdiv/Product999>	<http://db.uwaterloo.ca/watdiv/name>	"yet yet another random sequence of words" .

Note how instances [Product1000 .. Product4999] have been excluded from the generation process due to the type restriction.

A PGroupDeclaration consists of one or more LiteralPropertyDeclaration constructs, each starting with the reserved keyword #predicate.

During dataset generation, the LiteralPropertyDeclarations within a single PGroupDeclaration are processed as whole. This implies that for a particular entity instance, either all or none of the properties described by the LiteralPropertyDeclarations within the same PGroupDeclaration will be instantiated.

For example, the following code snippet:

<type>		watdiv:Product 250
  <pgroup>	0.2
    #predicate	watdiv:name		string
    #predicate	watdiv:purchaseDate	date
  </pgroup>
</type>

will result in approximately one in every five Product to be instantiated with both watdiv:name and watdiv:purchaseDate properties; thus, the generation of an RDF dataset such as the one below:

<http://db.uwaterloo.ca/watdiv/Product4>	<http://db.uwaterloo.ca/watdiv/name>		"random sequence of words" .
<http://db.uwaterloo.ca/watdiv/Product4>	<http://db.uwaterloo.ca/watdiv/purchaseDate>	"2015-01-01" .
<http://db.uwaterloo.ca/watdiv/Product9>	<http://db.uwaterloo.ca/watdiv/name>		"another random sequence of words" .
<http://db.uwaterloo.ca/watdiv/Product9>	<http://db.uwaterloo.ca/watdiv/purchaseDate>	"2015-01-02" .
<http://db.uwaterloo.ca/watdiv/Product12>	<http://db.uwaterloo.ca/watdiv/name>		"yet another random sequence of words" .
<http://db.uwaterloo.ca/watdiv/Product12>	<http://db.uwaterloo.ca/watdiv/purchaseDate>	"2015-01-03" .

								...

<http://db.uwaterloo.ca/watdiv/Product4996>	<http://db.uwaterloo.ca/watdiv/name>	"yet yet another random sequence of words" .
<http://db.uwaterloo.ca/watdiv/Product4996>	<http://db.uwaterloo.ca/watdiv/purchaseDate>	"2015-01-31" .

Now, consider the following code snippet, which will result in a completely different behaviour:

<type>		watdiv:Product 250
  <pgroup>	0.2
    #predicate	watdiv:name		string
  </pgroup>
  <pgroup>	0.2
    #predicate	watdiv:purchaseDate	date
  </pgroup>
</type>

In the code snippet above, properties watdiv:name and watdiv:purchaseDate do not appear within the same PGroupDeclaration constructs. Therefore, these two properties will be instantiated for an independently randomly selected subset of Product instances, which may not be the same as the previous case. This behaviour is demonstrated by the following RDF dataset:

<http://db.uwaterloo.ca/watdiv/Product3>	<http://db.uwaterloo.ca/watdiv/name>		"random sequence of words" .
<http://db.uwaterloo.ca/watdiv/Product4>	<http://db.uwaterloo.ca/watdiv/purchaseDate>	"2015-01-01" .
<http://db.uwaterloo.ca/watdiv/Product7>	<http://db.uwaterloo.ca/watdiv/name>		"another random sequence of words" .
<http://db.uwaterloo.ca/watdiv/Product9>	<http://db.uwaterloo.ca/watdiv/purchaseDate>	"2015-01-02" .
<http://db.uwaterloo.ca/watdiv/Product12>	<http://db.uwaterloo.ca/watdiv/name>		"yet another random sequence of words" .
<http://db.uwaterloo.ca/watdiv/Product12>	<http://db.uwaterloo.ca/watdiv/purchaseDate>	"2015-01-03" .

								...

<http://db.uwaterloo.ca/watdiv/Product4996>	<http://db.uwaterloo.ca/watdiv/name>		"yet yet another random sequence of words" .
<http://db.uwaterloo.ca/watdiv/Product4999>	<http://db.uwaterloo.ca/watdiv/purchaseDate>	"2015-01-31" .
LiteralType ::= INTEGER | integer | STRING | string | DATE | date | NAME | name
LiteralValue ::= see RDF Documentation
DistributionType ::= UNIFORM | uniform | NORMAL | normal | ZIPFIAN | zipfian
LiteralPropertyDeclaration ::= '#predicate'   '\s'   PropertyName   '\s'   LiteralType   ( '\s'   LiteralValue   '\s'   LiteralValue ( '\s'   DistributionType )? )?   '\n'

The syntax of a LiteralPropertyDeclaration construct is provided in the table above. In summary, there are 3 use cases, which are all shown in the table below.

Use Case Syntax Example
1 '#predicate' '\s' PropertyName '\s' LiteralType '\n'
#predicate   watdiv:annualIncome   integer
2 '#predicate' '\s' PropertyName '\s' LiteralType '\s' LiteralValue '\s' LiteralValue '\n'
#predicate   watdiv:annualIncome   integer   11000 100000
3 '#predicate' '\s' PropertyName '\s' LiteralType '\s' LiteralValue '\s' LiteralValue '\s' DistributionType '\n'
#predicate   watdiv:annualIncome   integer   11000 100000   normal

In the first use case, two parameters are used: PropertyName and LiteralType.

PropertyName sets the URI of the predicate of the RDF triples that are to be generated, and LiteralType determines the type of the literal objects in the RDF triples. Currently, WatDiv supports the generation of 4 types, namely, INTEGER, STRING, DATE and NAME. Syntactically, these types correspond to XML Schema Definition Language datatypes no decimal point numeral, string, date, and string, respectively. The value space of each type is given in the table below. Unless range values are specified by the user (cf., use cases 2 and 3), the WatDiv data generator assumes the default ranges given in the table below.

WatDiv Datatype Syntax Value Space Default Range
INTEGER no decimal point numeral C++ (signed) int [0, 65535]
STRING string C++ char[] ["A", "z"]
DATE date ["1970-01-01", Current Date] ["1970-01-01", Current Date]
NAME string Popular English first/last names ["A", "z"]

In the second use case, two additional parameters are used to enforce a minimum and maximum range on the generated literal objects (minimum and maximum values, respectively).

In the third use case, in addition to the minimum and maximum range values, it is possible to instruct the WatDiv data generator to use a particular distribution, using which literal object values are generated.

5. Declaring Non-Literal Properties

SubjectCardinality ::= [1-2]
ObjectCardinality ::= [1-9][0-9]*
ObjectCardinalityDistribution ::= UNIFORM | uniform | NORMAL | normal
PropertyDeclaration ::= #association
        '\s' EntityName '\s' PropertyName '\s' EntityName
        '\s' SubjectCardinality '\s' ObjectCardinality ('[' ObjectCardinalityDistribution ']')?
        '\s' InstantiationProbability '\s' DistributionType
        ( '\@' EntityName '\@' EntityName )? '\n'

A PropertyDeclaration, whose syntax is given in the table above, dictates the generation of RDF triples, where both the subject and the object of a triple are URIs.

In other words, a PropertyDeclaration relates instances of two entities (indicated by the first and third parameters) using an RDF predicate (indicated by the second parameter). The first parameter determines the subject entity, and the third parameter determines the object entity.

For example, the following PropertyDeclaration

#association     watdiv:Product  watdiv:availableAt  watdiv:Retailer  ... (For readability, only the first three parameters are shown.)

might result in the generation of a set of RDF triples such as the following:

<http://db.uwaterloo.ca/watdiv/Product0>	<http://db.uwaterloo.ca/watdiv/availableAt>	<http://db.uwaterloo.ca/watdiv/Retailer23> .
<http://db.uwaterloo.ca/watdiv/Product1>	<http://db.uwaterloo.ca/watdiv/availableAt>	<http://db.uwaterloo.ca/watdiv/Retailer1> .
<http://db.uwaterloo.ca/watdiv/Product2>	<http://db.uwaterloo.ca/watdiv/availableAt>	<http://db.uwaterloo.ca/watdiv/Retailer46> .
								...

<http://db.uwaterloo.ca/watdiv/Product4999>	<http://db.uwaterloo.ca/watdiv/availableAt>	<http://db.uwaterloo.ca/watdiv/Retailer5> .

In a PropertyDeclaration, the relationship between the entities can be defined in multiple ways. In particular, this relationship can be one of:

  1. one-to-one,
  2. one-to-many,
  3. many-to-one, and
  4. many-to-many
relationships. To achieve cases (1) and (2), SubjectCardinality must be set to 1, and to achieve cases (3) and (4), SubjectCardinality must be set to 2. Likewise, to achieve cases (1) and (3), the ObjectCardinality must be set to 1. Unlike SubjectCardinality, to achieve cases (2) and (4), ObjectCardinality can be set to any integer value greater than or equal to 2.

ObjectCardinality is used by the WatDiv data generator to determine how many RDF triples to generate for each unique subject instance. More specifically, for each unique subject instance, the WatDiv data generator picks a number between 1 and the value of ObjectCardinality uniformly at random and generates that many RDF triples. Optionally, this distribution can be configured with the ObjectCardinalityDistribution parameter.

For example, for the following PropertyDeclaration,

#association     watdiv:Product  watdiv:availableAt  watdiv:Retailer	2 5 ... (For readability, the remaining parameters have been omitted.)

WatDiv may generate the following dataset:

<http://db.uwaterloo.ca/watdiv/Product0>	<http://db.uwaterloo.ca/watdiv/availableAt>	<http://db.uwaterloo.ca/watdiv/Retailer157> .
<http://db.uwaterloo.ca/watdiv/Product0>	<http://db.uwaterloo.ca/watdiv/availableAt>	<http://db.uwaterloo.ca/watdiv/Retailer198> .
<http://db.uwaterloo.ca/watdiv/Product0>	<http://db.uwaterloo.ca/watdiv/availableAt>	<http://db.uwaterloo.ca/watdiv/Retailer2> .
<http://db.uwaterloo.ca/watdiv/Product1>	<http://db.uwaterloo.ca/watdiv/availableAt>	<http://db.uwaterloo.ca/watdiv/Retailer1000> .

								...

<http://db.uwaterloo.ca/watdiv/Product4999>	<http://db.uwaterloo.ca/watdiv/availableAt>	<http://db.uwaterloo.ca/watdiv/Retailer5> .
<http://db.uwaterloo.ca/watdiv/Product4999>	<http://db.uwaterloo.ca/watdiv/availableAt>	<http://db.uwaterloo.ca/watdiv/Retailer1000> .
<http://db.uwaterloo.ca/watdiv/Product4999>	<http://db.uwaterloo.ca/watdiv/availableAt>	<http://db.uwaterloo.ca/watdiv/Retailer7> .
<http://db.uwaterloo.ca/watdiv/Product4999>	<http://db.uwaterloo.ca/watdiv/availableAt>	<http://db.uwaterloo.ca/watdiv/Retailer2> .

Note that in the dataset above, Product instances are related to Retailer instances in a many-to-many fashion, where each Product can be related to [1 .. 5] Retailers.

Like in PGroupDeclaration, the InstantiationProbability parameter controls the probability that an instance of the subject entity can participate in the relationship. For example, in the following one-to-one relationship, only 10% of Product instances will be available at a Retailer:

#association     watdiv:Product  watdiv:availableAt  watdiv:Retailer	1 1	0.1 ... (For readability, the remaining parameters have been omitted.)

The DistributionType parameter determines from which distribution the object entity instances should be drawn.

The last two optional parameters provide means to apply type restrictions respectively to the subject and object entity instances.