Tracing the Provenance of Object- Oriented Computations on RDF Data Matthias Quasthoff, Christoph Meinel Hasso Plattner Institute, University of Potsdam {matthias.quasthoff, christoph.meinel}@hpi.uni-potsdam.de Abstract. This paper presents a new method for tracing the prove- nance of RDF data within object-oriented computations. The proposed approach eliminates the burden of manually keeping track of data prove- nance from software developers. Using object triple mapping, the source data items used for generating result data can be identified efficiently. The integration into an existing object triple mapping framework shows the feasibility of the approach. The work presented will help developing compliant RDF-enabled applications using object-oriented programming and may support the efforts getting the Web of data mainstream. 1 Introduction The meteoric rise of interoperable websites backed by massive amounts of user- generated data over the last decade clearly demonstrates people’s interest in using integrated services over the World Wide Web. With the advent of user- generated content and so-called Social Web application programming interfaces it became also clear that some degree of decoupling between data on the Web and the services processing these data is desirable for a variety of reasons. For instance, users should not be forced to publish their data to some kind of “black box” Web services. Also, a shift from service control to user control on how user- generated data can be reused is desirable. The Linking Open Data initiative has shown that Semantic Web technologies such as the Resource Description Framework (RDF) and the SPARQL query language are the tools of choice for designing such interoperable Web applications. However, it is widely known that developing RDF-enabled applications usually requires some non-standard software engineering knowledge. This is still regarded one big hindering factor for getting Semantic Web technologies picked up by non-academic software engineers and a wider range of commercial products. Besides the complexity introduced by using RDF technologies alone, using so-called linked data [1] from various data sources on the Web requires the consuming software to handle and process additional metadata about this data, mainly – check data licenses and privacy policies attached to the data, – assess provenance information about the data consumed and publish prove- nance information along with the data created, – assess trust and quality of the data consumed, e. g., based on provenance information. Such metadata processing is not necessary in traditional software projects build- ing upon rather centralized data storage technologies. Hence it is considered something new and an extra-effort by software developers. We argue that besides making the development of RDF-enabled applications simpler, also higher-level operations like such metadata handling must be simplified for software develop- ers. In this paper, we use Object Triple Mapping (OTM) [2–4], which has been proven to simplify the development of Semantic Web applications [5], and show how this approach does as well contribute to handling metadata in Semantic Web applications. We show that Object Triple Mapping (OTM) is suitable to process and gen- erate metadata required for distributed Web applications in object-oriented pro- grams. OTM [4] lets developers focus on object-oriented concepts representing actual business logic and hides the most frequent RDF operations from the source code. In Section 3 we extend existing basic formalizations of OTM to be able to attach metadata to object-oriented data items. Afterwards in Section 4, we show how to pass such metadata through the object-oriented program’s control flow. Section 4 also contains a detailed example of how provenance information will be collected during a specific object-oriented computation. The paper is concluded in Section 5. 2 Related work This paper builds upon work from different fields of research on the Seman- tic Web. The Linking Open Data movement is driven by the observation that Semantic Web standards and technologies have matured to allow building real- world applications upon them, and that the publication of openly usable, real- world data on the Web might foster the development of such applications [6]. Inspite of the success of the initiative resulting in myriad widely usable datasets on the Web, we argue that the development of Semantic Web applications itself still needs to get simpler. Lassila investigated the design of Semantic Web software, especially with regards to how to organize rather generic operations on RDF data such as rea- soning and other operations more related to specific business logic [7]. Oren et al. address on a related problem: in their work on ActiveRDF they showed how to access RDF data in an object-oriented manner, hiding most basic ac- cess patterns to RDF from software developers [3]. In their work, they use the PathLog language [8] to explain the handling of object-oriented concepts. The idea of Object Triple Mapping (OTM) is partly inspired from similar work on ac- cessing relational databases [9]. Other, more implementation-centric approaches to OTM include So(m)mer [2] and Elmo for the Java programming language, and Surf RDF for Python. An experimental evaluation showed that developing simple Semantic Web applications using OTM is much easier for the Seman- tic Web beginner and leads to better source code [5]. The actual semantics of OTM and possible desirable extensions of the mapping are however not yet fully understood. This is still an open issue, especially to identify the most common patterns implemented in different types of Semantic Web software and separating the core concepts of OTM from optional, use-case-dependent extensions. In this paper, we investigate how to deal with provenance information of RDF data in object-oriented programs. Provenance information helps consumers to assess data with regard to various issues: Is the data source trustworthy? Has the data been generated using high-quality algorithms? But also, are we allowed to use the data for our purposes? The latter question can, e. g., refer to data licenses or privacy policies. In this paper, we show how Hartig and Zhao’s work on Web data provenance [10] can be integrated into OTM. We will mainly focus on the question what data items have been used to construct other data items, i. .e., tracking the input RDF data of an object-oriented computation and see the relation of this input to the output RDF data of the computation. 3 Object-oriented access to Web data formalized To formally describe how to attach provenance information to the results of object-oriented computations on RDF data, a formal representation of the RDF and OO data models is required as well as a formal mapping between the two data models. In this section, the required formal concepts are presented. First, the RDF data model has an established formal notation building upon the fol- lowing concepts [11]. Definition 1 (RDF data model) Let U be the set of URI references, B an infinite set of blank nodes, and L the set of literals. – V := U ∪ B ∪ L is the set of RDF nodes, – R := (U ∪ B) × U × V is the set of all triples or statements, that is, arcs connecting two nodes being labelled with a URI, – any G ⊆ R is an RDF graph. For their work on OTM, Oren et al. use PathLog [8] to describe object-oriented access to data [3]. PathLog itself differentiates between scalar and set-valued class members. For the scope of this paper however, dealing with set-valued class members will be sufficient. Also, we will use a simplified version of the semantic structure explained in [8]. Definition 2 (OO data model) Let N be a set of names. The set of PathLog references RN is defined inductively as follows. – n ∈ N is a reference, also called a simple reference. – for references t0 , t1 and a simple reference s • t0 .s is a reference, called a path • t0 : s and t0 [s → t1 ] are references, called molecules. Furthermore, a semantic structure is a triple (N , O, I) such that                          Fig. 1. Representing social information in RDF (left) and OOP (right). – O is a set of objects and – The interpretation I : RN → 2O relates references to objects. In the following, we will show how RDF and PathLog can be used to express information in the respective data model. Example 1 (comparison of RDF and OO data model) Let p1 , p2 , p3 ∈ U be URI denoting three people, n ∈ U be the URI foaf:name and k ∈ U be the URI foaf:knows1 . An RDF graph describing p1 and p2 might look the following (Fig. 1, left). � � G := �p1 , n, “John Doe”�, �p1 , k, p2 �, �p1 , k, p3 �, �p2 , n, “Jane Doe”� Let furthermore p�1 , p�2 , p�3 ∈ N be object names denoting three people and n� , k � ∈ N fields labelled name and knownPeople. The OO representation of G (Fig. 1, right) requires a semantic structure (N , O, I) such that I(p�1 .n� ) = {“John Doe”} I(p�2 .n� ) = {“Jane Doe”} I(p�1 .k � ) = I(p�2 ) ∪ I(p�3 ) The mapping of RDF data to OOP concepts as shown in Example 1 and vice versa has been formalized [5]. This formalization can be adopted to the concepts of PathLog as follows. Definition 3 (Object triple mapping, OTM) An object triple mapping for an RDF graph G ⊆ (U ∪ B) × U × V is a tuple (N , O, I, mt , ma ), such that – (N , O, I) is a semantic structure, – the vocabulary map mt : F → U maps a field names F ⊆ N to properties, – the instance map ma : O → U maps objects to resources, – and the following holds for all o ∈ O, n ∈ I −1 ({s}), f ∈ F : • the mapping is complete: ∀u ∈ U : �ma (o), mt (f ), u� ∈ G → ∃o� ∈ I(n.f ) : ma (o� ) = u ∀o� ∈ O : o� ∈ I(n.f ) → �ma (o), mt (f ), ma (o� )� ∈ G        Fig. 2. Mapping OO concepts (right) on RDF (left). • the mapping is injective, i. e. ma |I(n.f ) is injective. The idea of OTM is illustrated in Fig. 2 using the names from Example 1 and Definition 3. Besides such purely formal description of Object Triple Mapping, concrete implementations need to consider additional aspects such as object equivalence. Depending on the context, deciding on semantic equivalence can be hard. For the scope of this paper on metadata processing, considering syntactical equivalence based solely on the URI returned by ma is sufficient. Also, concrete implementations need to map the semantics of RDF, such as lists or reification, of RDF Schema, such as class hierarchies, and OWL, such as constraints on classes and properties, to the semantics of their supported programming language. The general feasibility of such features is discussed in [12]. 4 Tracing the provenance of object-oriented computations 4.1 Attaching metadata to RDF and OO data RDF metadata will be attached to data items [10] like RDF graphs or triples. This applies for different kinds of metadata: Data licenses and policies specify whether a specific data item can be used in specific contexts or for specific purposes. Similarly, provenance metadata describe how a data item has been derived, and what other data items have been employed for such derivation. It is important to see that such kind of information must be attached to data items like RDF graphs or triples, not only to RDF resources. A similar concept is required to express metadata information about object-oriented concepts. In this section, we introduce the concept of object-oriented data items. The OO data items corresponding to RDF triples will be values together with the variables they are assigned to, or the values returned by a method together with the method called. Both cases can be represented by the PathLog reference used to obtain a certain object. Definition 4 (OO data item) Let (N , O, I, mt , ma ) be an object triple map- ping for some RDF graph G ⊆ (U ∪ B) × U × V , and r ∈ RN a reference. Whenever we access an o ∈ I(r), we say o has been obtained using the OO data item (r, I) ∈ RN × ORN . 1 The foaf prefix denotes the URI namespace http://xmlns.com/foaf/0.1/ References r = n.f for names n, f ∈ N with n referring to a single object I(n) = {o} are essential in object-oriented source code. Due to the nature of OTM, an object o� obtained using the data item (n.f, I) has actually been obtained using the RDF triple �ma (o), mt (f ), ma (o� )� More complex references might be translated to other types of RDF data items, e. g., SPARQL query results. OO data items can now be used to process or update metadata attached to them. Such metadata can be obligations or restrictions of use due to privacy or license regulations, but also trust and quality assessments. Due to the mapping defined between the object-oriented and RDF data model, not only objects can be mapped to resources, but also OO data items to their corresponding RDF data items. 4.2 Aggregation of object-oriented provenance information A significant part of tracing and generating provenance information is to relate the data items serving as input for a computation to its output data items. Such computation, or data creation [10], could, e. g., be performed by a method call in an OO program. However, from the object-oriented program flow within and around such methods further valuable information on data provenance can be aggregated. Aggregation on data flow. If an object o has been obtained from a data item (r, I), a number of names appearing in r potentially identify objects o�1 , . . . , o�� (a simple case is r = n.f , such that I(n) = {o�1 }). Naturally, all data items used to obtain o�1 , . . . , o�� have also been used to eventually get hold of o. Aggregation on object equality. A very common and important operation on objects o1 , o2 is checking for equality. This can explicitly be triggered in source code by the software developer, but is also performed by higher-level operations provided by programming libraries, such as constructing the intersection of two sets of objects. If some action is taken because o1 and o2 have been detected being equal, all data items used to obtain o1 and o2 , and the ones used to detect equality have actually been used to assign the value. Hence, for subsequent uses of o1 and o2 , this aggregation of source data items should be considered. In implementations, this can be realized by adding side effects to equality checking. 4.3 Example: tracing provenance information of social network data In this section, we present a comprehensive example of how to benefit from object-oriented provenance information on RDF data. The example illustrates that OO provenance information can be gathered during program execution without placing provenance-related artifacts in the OO source code. Given aquain- tance information, we want to suggest new contacts for a person if the new con- tact is a mutual friend of at least to of the person’s friends. The pseudo code for this computation is shown in Fig. 3. The relationships are described using the foaf:knows property k, and we introduce an additional RDF property s for suggested contacts. Consider the following RDF graph G ⊆ (U ∪ B) × U × V and person resources p1 , . . . , p4 ∈ U . If G contains �p1 , k, p2 �, �p1 , k, p3 �, �p2 , k, p4 �, �p3 , k, p4 � then we want the suggestion �p1 , s, p4 � to be created. Additionally, we want to be able to see the facts have been used to derive this statement, e. g., for later trust or quality assessments based on these provenance information. To see what data items are used for the computation, let (N , O, I, mt , ma ) be an object triple mapping and o1 , . . . , o4 , o�4 ∈ O be objects representing p1 , . . . , p4 . Going through the computation of find friends for(o1 ) as de- scribed in Figure 3, the following OO provenance information is aggregated. – In line 1, o1 passed via person, i. e. I(person) = o1 . – In line 2, the OTM interpretation returns I(person.friends) = {o2 , o3 }. The objects o2 , o3 assigned to friend1 will have the provenance information (person.friends, I). – In line 3, the loop-local interpretations return I1 (friend1.friends) = {o4 }, I2 (friend1.friends) = {o�4 }. The provenance information of o4 will be (friend1.friends, I1 ) and for o�4 it will be (friend1.friends, I2 ). – In line 4, o4 is not found equal to a friend of o1 . – In line 6 in the first iteration, o4 is not a friendship candidate, but is added to the candidates in line 9. – In line 6 in the second iteration, o�4 is found being equal to the friend- ship candidate o4 , hence o4 ’s and o�4 ’s provenance are both aggregated to (friend1.friends, I1 ), (friend1.friends, I2 ). – In line 7 in the second iteration, o�4 is suggested as a friend including this combined provenance information from line 6. When the friend suggestion o�4 to o1 is mapped to the RDF statement �p1 , s, p4 �, this statement will eventually have been created using all OO data items and the corresponding RDF statements listed in Table 1. With the help of trust- 1 find_friends_for(person): 2 foreach friend1 in person.friends 3 foreach friend2 in friend1.friends 4 if person.friends contains friend2 5 nothing to do, continue 6 else if candidates contains friend2 7 add friend2 to suggestions 8 else 9 add friend2 to candidates Fig. 3. Object-oriented search for new contacts in a social network OO data item RDF statement Mapping from OO to RDF data (person.friends, I) �p1 , k, p2 � ma (I(person)) = {p1 } �p1 , k, p3 � mt (friends) = k ma (I(person.friends)) = {p2 , p3 } (friend1.friends, I1 ) �p2 , k, p4 � ma (I1 (friend1)) = {p2 } mt (friends) = k ma (I1 (friend1.friends)) = {p4 } (friend1.friends, I2 ) �p3 , k, p4 � ma (I2 (friend1)) = {p3 } mt (friends) = k ma (I2 (friend1.friends)) = {p4 } Table 1. Data items leading to the suggestion �p1 , s, p4 � in Fig. 3 related metadata potentially attached to these source triples, the result state- ment �p1 , s, p4 � can now be dealt with appropriately in further steps of the computation. 4.4 Implementation The proposed approach for tracing the provenance of object-oriented compu- tations has been integrated in our Java object triple mapping implementation OTMj [5]. In OTMj, an RDF resource u is mapped to a dynamic proxy object o implementing the interfaces corresponding to the known RDF types of u. This allowed for a straight-forward integration of the OO provenance model proposed: If o is accessed using a data item (r, I), OTMj returns an object o� acting as a proxy for o. In addition to o, the proxy o� has the data item (r, I) attached. OTMj can be obtained as open source software.2 Our implementation uses reification to link source and result triples using the concept of data creations as defined by Hartig and Zhao [10]. Given a triple �s1 , p1 , o1 � having been created using another triple �s2 , p2 , o2 �, our implemen- tation creates metadata as illustrated in Fig. 43 . 5 Conclusion In this paper we showed how metadata about RDF data can be processed and generated within object-oriented computations on these data using Object Triple Mapping (OTM). We extended existing formalisms to OTM and introduced the notion of object-oriented data items. Using these concepts we showed how to trace the provenance of RDF data items on their way through object-oriented computations and presented a detailed example and a ready-to-use implementa- tion of the approach. Our next steps in research include a tighter integration with potential sources of provenance information such as SQUIN [13] or context-based 2 http://projects.quasthoffs.de/otm-j 3 The prv prefix denotes the URI namespace http://trdf.sourceforge.net/provenance/ns# : t1 a rdf : Statement ; rdf : subject s1 ; rdf : predicate p1 ; rdf : object o1 ; prv : createdBy : c1 . : t2 a rdf : Statement ; rdf : subject s2 ; rdf : predicate p2 ; rdf : object o2 . : c1 a prv : DataCreation; prv : usedData : t2 ; prv : performedBy ... ; prv : performedAt ... . Fig. 4. N3 representation of metadata generated by OTMj during data creation. reasoners [14], and with programming frameworks oriented towards generating complete applications instead of focusing on the data backend only, such as the Grails RDFa plugin4 . This paper shows that the development of Semantic Web applications cannot only be simplified by focusing on established and widely understood abstractions like object-oriented programming, but also, functionality essential for a working Web of data can be integrated in these abstraction layers without adding further burden to software developers. Without these abstractions, metadata handling would have to be implemented per use-case, a task potentially being skipped due to time constraints or deferred for unlimited time by software engineers. We’re looking forward to learning what will eventually make non-expert software developers use Semantic Web technologies and how our work will contribute to making this happen. References 1. Berners-Lee, T.: Linked data. http://www.w3.org/DesignIssues/LinkedData.html (2006) 2. Story, H.: Java annotations and the semantic web. http://blogs.sun.com/bblfish/entry/java annotations the semantic web (2005) 3. Oren, E., Heitmann, B., Decker, S.: Activerdf: Embedding semantic web data into object-oriented languages. J. Web Sem. 6(3) (2008) 191–202 4. Quasthoff, M., Meinel, C.: Design patterns for object triple mapping. In: Proc. of IEEE SCC 2009. (2009) 5. Quasthoff, M., Sack, H., Meinel, C.: How to simplify building semantic web ap- plications. In: Proc. of the 5th International Workshop on Semantic Web Enabled Software Engineering, CEUR-WS.org (2009) 6. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia: A nucleus for a web of open data. In: Proceedings of 6th International Semantic Web 4 http://grails.org/plugin/rdfa Conference, 2nd Asian Semantic Web Conference (ISWC+ASWC 2007). Springer (2008) 722–735 7. Lassila, O.: Programming semantic web applications: A synthesis of knowledge representation and semi-structured data. ISBN 978-951-22-8984-4 (2007) 8. Frohn, J., Lausen, G., Uphoff, H.: Access to objects by path expressions and rules. In Bocca, J.B., Jarke, M., Zaniolo, C., eds.: VLDB, Morgan Kaufmann (1994) 273–284 9. Fowler, M., Rice, D.: Patterns of Enterprise Application Architecture. Addison- Wesley (2003) 10. Hartig, O., Zhao, J.: Publishing and Consuming Provenance Metadata on the Web of Linked Data. In: Proceedings of the 3rd International Provenance and Annotation Workshop (IPAW). (2010) 11. Manola, F., Miller, E.: Rdf primer. w3c recommendation 10 february 2004. http://www.w3.org/TR/2004/REC-rdf-primer-20040210/ (2004) 12. Quasthoff, M., Sack, H., Meinel, C.: Can software developers use linked data vocabulary? In: Proc. of I-Semantics ’09. (2009) 13. Hartig, O., Bizer, C., Freytag, J.C.: Executing sparql queries over the web of linked data. In: Proc. of the 8th International Semantic Web Conference, Springer (2009) 14. Delbru, R., Polleres, A., Tummarello, G., Decker, S.: Context dependent reason- ing for semantic documents in sindice. In: Proceedings of the 4th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS 2008), 7th International Semantic Web Conference, Kalrsruhe, Germany (10 2008)