Generating RDF for Application Testing⋆

                           Daniel Blum and Sara Cohen
                  {daniel.blum@mail,sara@cs}.huji.ac.il

                         School of Computer Science and Engineering
                             The Hebrew University of Jerusalem


        Abstract. Application testing is a critical component of application develop-
        ment. Testing of Semantic Web applications requires large RDF datasets, con-
        forming to an expected form or schema, and preferably, to an expected data dis-
        tribution. Finding such datasets often proves impossible, while generating input
        datasets is often cumbersome. The G RR (Generating Random RDF) system is a
        convenient, yet powerful, tool for generating random RDF, based on a SPARQL-
        like syntax. In this poster and demo, we show how large datasets can be easily
        generated using intuitive commands.


1 Introduction
Testing is a critical step in application development. For Semantic Web applications,
testing is a challenge due to both the large volume of input data needed, and the intricate
format that this data must have. While many Semantic Web applications focus on varied
and unexpected types of data, there are also many others that target specific domains.
For such applications, to be useful, datasets used should have at least two properties:
 1. The data structure should have the expected structure needed for the target applica-
    tion (e.g., conform to a specific RDF schema).
 2. The data should match the expected data distribution of the target application.
    Currently, there are several distinct sources for RDF datasets. First, there are down-
loadable RDF datasets that can be found on the web, e.g., Barton libraries, UniProt
catalog sequence, and WordNet. RDF Benchmarks, which include both large datasets
and sample queries, have also been developed, e.g., the Lehigh University Benchmark
(LUBM) [4] (which generates data about universities), the SP2 Bench Benchmark [7]
(which provides DBLP-style data) and the Berlin SPARQL Benchmark [1] (which is
built around an e-commerce use case). Such downloadable RDF datasets are usually an
excellent choice when testing the efficiency of an RDF storage system. However, they
will not be suitable for experimentation and analysis of a particular RDF application.
Specifically, since these datasets are built for a single given scenario, they may not have
either of the two specified properties, for the application at hand.
    Data generators are another source for datasets. A data generator is a program that
generates data according to user constraints. As such, data generators are usually more
flexible than benchmarks. Unfortunately, there are few data generators available for
⋆
    This work was partially supported by GIF Grant (2201-1880.6/2008).
RDF (SIMILE [8], RBench [6]) and none of these programs can produce data that con-
forms to a specific given structure, and thus, again, will not have the specified properties.
    In this demo, we present the G RR (Generating Random RDF) system for generating
RDF that satisfies both desirable properties given above. Thus, G RR is not a benchmark
system, but rather, a system to use for Semantic Web application testing. Using intuitive
data generation commands with a SPARQL-like syntax, G RR can produce data with a
complex graph structure, as well as draw the data values from desirable domains. Data
generation commands are translated into a series of SPARQL queries and update com-
mands which are applied directly to an RDF storage system.1 A video demonstration of
G RR is available online,2 and the system is available upon request.


2 Motivating Example
As a motivating example, we discuss the problem of generating the data described in
the LUBM Benchmark. Note that G RR is not limited to creating benchmark data. In our
demo, we will demonstrate using G RR to generate other types of data, such as FOAF [3]
(Friend of a Friend) datasets, which are used in social network applications.
    LUBM [4] is a collection of data describing university classes (i.e., entities), such
as departments, faculty members, students, courses, etc. These classes have a plethora
of properties (i.e., relations) between them, e.g., faculty members work for departments
and head departments, students take courses and are advised by faculty members, etc.
    In order to capture a real-world scenario, LUBM defines interdependencies between
the entities. For example, the number of students in a department is a function of the
number of faculty members. Specifically, LUBM requires there to be a 1:8-14 ratio of
faculty members to undergraduate students. As another example, the cardinality of a
property may be specified, such as each department must have a single head of depart-
ment (who must be a full professor). Properties may also be required to satisfy addi-
tional constraints, e.g., courses, taught by faculty members, must be pairwise disjoint.
    In the next section, we describe the G RR data generation language, and demonstrate
commands for producing LUBM benchmark data. Due to space limitations, we do not
provide all commands used to reproduce LUBM. However, we note that the number of
words needed in all data generation commands (in order to reproduce LUBM), is only
about twice as many as used in the intuitive description of LUBM, provided by [4]!


3 Data Generation Commands
Data is generated by a sequence of data generation commands (dg-commands, for
short) c1 , . . . , cn , when given as input a (possibly empty) RDF dataset R. The first
command c1 is evaluated over R, while each consecutive command ci is evaluated over
the output of the previous command ci−1 .
    The general syntax of a single dg-command appears below. Note that square brack-
ets are used to denote optional portions, and the “*” indicates a component that can
appear any number of times.
 1
     The Jena Semantic Web Framework for Java [5] is used in our implementation.
 2
     http://www.cs.huji.ac.il/˜danieb12/
(FOR (EACH | sampling-method)
    [WITH (GLOBAL DISTINCT | LOCAL DISTINCT | REPEATABLE)]
    {list of classes}
    [WHERE {list of conditions}] )*
[CREATE i-j {list of classes}]
[CONNECT {list of connections}]

    A dg-command contains any number of FOR clauses, and then optionally a CREATE
and/or CONNECT clause. Intuitively, the FOR clauses choose portions of the RDF input,
the CREATE clause creates new nodes in the RDF graph, and the CONNECT clause
connects nodes in the RDF graph. We require that at least one among the CREATE
and CONNECT clauses be present in every dg-command. We now describe each clause,
briefly. (Full language semantics appears in [2]).
 – The FOR Clause: Each FOR clause defines (1) a query which will applied against
   the RDF input, as well as (2) a method to choose a subset of the query results. For
   (1), the user provides a list of classes whose instances should be chosen (similar to
   a SPARQL SELECT clause), as well as any conditions (similar to a SPARQL WHERE
   clause). The correspondence to SPARQL is not precise as we allow for certain syn-
   tactic shortcuts, which avoid explicit variable use, and make dg-commands more
   readable. For (2), the user defines both the method with which answers should be
   sampled, as well as whether the sampling process is with/without repetition.
 – The CREATE Clause: The CREATE clause defines nodes that should be created.
   The user provides both a list of RDF classes, and a range determining how many
   instances of these classes should be created.3
 – The CONNECT Clause: The CONNECT clause determines the edges that should be
   generated in the RDF graph, by providing a list of triples.
      Several examples of dg-commands appear below. Explanations follow.
(c1 ) CREATE 1-5 {ub:Univ}

(c2 ) FOR EACH {ub:Univ}
       CREATE 15-25 {ub:Dept}
       CONNECT {ub:Dept ub:subOrg ub:Univ}

(c3 ) FOR EACH {ub:Faculty, ub:Dept}
      WHERE {ub:Faculty ub:worksFor ub:Dept}
        CREATE 8-14 {ub:Undergrad}
        CONNECT {ub:Undergrad ub:memberOf ub:Dept}

(c4 ) FOR EACH {ub:Dept}
         FOR 1 {ub:FullProf}
         WHERE {ub:FullProf ub:worksFor ub:Dept}
         CONNECT {ub:FullProf ub:headOf ub:Dept}

 3
     Dg-commands do not directly define how textual (or other atomic) properties are created and
     associated with class instances. This information is provided in a simple auxilliary file, e.g.,
     which associates each textual property with a sampling method or dictionary.
(c5 ) FOR 20%-20% {ub:Undergrad, ub:Dept}
      WHERE {ub:Undergrad ub:memberOf ub:Dept}
        FOR 1 {ub:Prof}
        WHERE {ub:Prof ub:memberOf ub:Dept}
        CONNECT {ub:Undergrad ub:advisor ub:Prof}

(c6 ) FOR EACH {ub:Undergrad}
       FOR 2-4 WITH LOCAL DISTINCT {ub:UndergradCourse}
       CONNECT {ub:Undergrad ub:takeCourse ub:UndergradCourse}

(c7 ) FOR EACH {foaf:Person ?p1}
       FOR 15-25 {foaf:Person ?p2} WHERE {FILTER( ?p1 != ?p2 )}
       CONNECT {?p1 foaf:knows ?p2}

    Command c1 creates between 1 and 5 universities, and command c2 adds 15–25
departments as suborganizations for each university. Command c3 iterates over all pairs
of faculty members4 and departments, and adds 8-14 students, per pair to the depart-
ment (therby achieving the required 1:8-14 ratio of faculty members to undergraduates).
Command c4 chooses one full professor as the head of each department. Command c5
adds an advisor for 20% of all undergraduates. Command c6 assigns 2-4 courses for
each undergraduate. Note the use of WITH LOCAL DISTINCT which ensures that
the set of courses chosen per student does not contain repetition, while allowing differ-
ent students to be assigned the same courses. Finally, c7 demonstrates advanced features
including variables and a filter command, to connect people (in an FOAF RDF dataset)
to one another.
    In our poster and demo, we will show how to recreate the LUBM benchmark using
24 dg-commands, of the style seen above. In addition, we will show how to create
interesting datasets for the FOAF schema. We will also allow those interested to write
their own dg-commands, which we will evaluate in G RR to create an RDF dataset.


References
1. Bizer, C., Schultz, A.: The Berlin SPARQL benchmark. International Journal of Semantic
   Web Information Systems 5(2), 1–24 (2009)
2. Blum, D., Cohen, S.: Grr: Generating random RDF. Tech. rep., The Hebrew University of
   Jerusalem (2010)
3. The friend of a friend (FOAF) project. http://www.foaf-project.org
4. Guo, Y., Pan, Z., Heflin, J.: LUBM: a benchmark for OWL knowledge base systems. Journal
   of Web Semantics 3(2-3), 158–182 (2005)
5. Jena–a Semantic Web framework for Java. http://jena.sourceforge.net
6. RBench website. http://139.91.183.30:9090/RDF/RBench/index.html
7. Schmidt, M., Hornung, T., Lausen, G., Pinkel, C.: SP2 Bench: a SPARQL performance bench-
   mark. In: ICDE. pp. 222–233. Shanghai, China (Mar 2009)
8. Simile website. http://simile.mit.edu/


 4
     The faculty members were created with an additional dg-command, which was omitted due to
     lack of space.