=Paper=
{{Paper
|id=None
|storemode=property
|title=Growing Triples on Trees: an XML-RDF Hybrid Model for Annotated Documents
|pdfUrl=https://ceur-ws.org/Vol-880/VLDS-p30-Goasdoue.pdf
|volume=Vol-880
|dblpUrl=https://dblp.org/rec/conf/vlds/GoasdoueKKLMZ11
}}
==Growing Triples on Trees: an XML-RDF Hybrid Model for Annotated Documents==
Growing Triples on Trees: an XML-RDF Hybrid Model for Annotated Documents François Goasdoué1 Konstantinos Karanasos1 Yannis Katsis2 Julien Leblay1 Ioana Manolescu1 Stamatis Zampetakis1,3 1 Leo team, INRIA Saclay and LRI, Univ. Paris-Sud 2 INRIA Saclay, ENS Cachan 3 Univ. of Crete firstname.lastname@inria.fr ABSTRACT istrator adds annotations to the articles, classifying them Content on today’s Web is typically document-structured according to their topics selected from a publicly available and richly connected; XML is by now widely adopted to ontology on news articles. By reasoning on the ontology, a represent Web data. Moreover, the vision of a computer- user searching for article topics (e.g., biology), also obtains understandable Web relies on Web (and real world) resources articles on more specific ones (e.g., bioengineering, biofuel described by simple properties having names or values; URIs energy, etc.). The readers can also annotate the documents are the normative method of identifying resources and RDF stating their personal opinion on an article or on specific (the Resource Description Framework) enjoys important trac- parts of it. Moreover, articles can be linked with each other. tion as a way to encode such statements. For instance, when a specific article is also published in a We present XR, a carefully designed hybrid model be- blog, this version can be linked to the newspaper version. tween XML and RDF, for describing RDF-annotated XML Hence, when searching for articles, it becomes feasible to documents. XR follows and combines the W3C’s XML, URI use semantic information that is not attached directly to a and RDF standards by assigning URIs to all XML nodes and given article, but to another linked version of it. enabling these URIs to appear in RDF statements. The XR Many more use cases of semantic annotations can be found. management platform thus provides the capabilities to cre- Among them, and closely related to the Web search, lately, ate and handle interconnected XML and RDF content. We search engines such as Google Search and Microsoft Bing, define the XR data model, its query language, and present let users reward search results they like or use information preliminary results with a prototype implementation. available through social networks (e.g., friends list) to return results more suitable for the users. Such annotations can be naturally expressed in RDF, which 1. INTRODUCTION is becoming the de facto standard for describing semanti- With its widespread acceptance, XML has become the cally rich data. First, by assigning URIs to any part of language of choice for publishing semi-structured data on the XML articles, we can refer to these parts in RDF state- the Web, be it business, scientific or governmental data. ments and, thus, add semantic information on the articles at Efficient and effective search for information within large any granularity, all the while leaving the original documents XML corpora is thus crucial. Optimizing the XML stor- intact. Second, in the spirit of the Linked Data initiative age and querying infrastructure is a well-trodden path to- (http://linkeddata.org), we can create semantic links be- wards increasing efficiency. However, on heterogeneous Web tween documents. Finally, RDF, especially when combined data authored in a distributed, independent fashion, one with RDF Schema (RDFS in short), enables the discovery could greatly improve search effectiveness by querying XML of new implicit knowledge through reasoning. sources jointly with data about the documents. In particu- Previous works have enabled the simultaneous process- lar, if semantic information about the XML documents is ing of XML and RDF data, typically by converting RDF to available, not only additional information can be exploited XML (or vice-versa) and then using a language and plat- during the search process, but also new knowledge can be form dedicated to the target format. Such translation re- discovered through a reasoning process. sults are generally awkward and bury content under syntax. Consider a newspaper publishing its articles in XML. To Fundamentally, the data models are quite different: XML enhance the way readers search for information, the admin- emphasizes structural relationships, whereas RDF consists of triples connected in arbitrarily complex graphs. As a con- sequence, the challenges and opportunities for optimization are very different for the two models. Thus, existing opti- Permission to make digital or hard copies of all or part of this work for mizers and techniques for one model are likely not to apply personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies well on the data converted from the other model. bear this notice and the full citation on the first page. To copy otherwise, to At a more conceptual level, these approaches juxtapose republish, to post on servers or to redistribute to lists, requires prior specific XML and RDF data, and do not consider the main idea of permission and/or a fee. This article was presented at the workshop Very this work, namely, turning XML nodes (or words) into RDF Large Data Search (VLDS) 2011. resources, and connecting these data sources. Copyright 2011. In this work, we propose a unified model allowing the com- (#3, ⟨sameAs⟩, #13), (#8, ⟨sameAs⟩, dbpedia:ACME), (#3, ⟨reportOn⟩, #8), (#8, rdf:type, ⟨Company⟩), bination of XML with RDF data into a single instance. Our (_:B, ⟨workAt⟩, #3), (_:B, ⟨email⟩, “bob@example.com”), main contributions are: (i) a data model for annotated doc- (#8, rdf:type, ⟨Organization⟩), (⟨Company⟩, rdf:subClassOf, ⟨Organization⟩), uments which can express instances where XML and RDF RDF sub-instance are interconnected (described in Section 2), (ii) a query lan- NewsFeed NewsArchive guage for querying the unified data model (Section 3) and #1 (iii) an implemented system with experiments encompass- Article Article Report … #2 #3 #13 ing the previous ideas (Section 4). The full technical report Title Date #4 … describing our work is available at [8]. #11 “Rise and Fall of Story #6 “03-04-2011” 2. THE XR DATA MODEL Phone Carrier”#5 #12 To represent annotated documents, we introduce the XR “Once upon a “... happily ever data model, extending and combining the standard XML time...” #7 after.” #10 structured data model and RDF semantic data model. An NamedEntity #8 instance of the XR data model comprises two sub-instances: An XML sub-instance, consisting of a set of XML trees, “ACME” and an RDF one, consisting of a set of RDF triples. The #9 XML sub-instance connection between the two is achieved by assigning to each Figure 1: XR instance featuring an annotated news article. XML node a unique URI, which can then be referred to from an RDF triple, as we will explain below. (p1 , relatedTo, bioengineering), the implicit triple (p1 , re- The XML sub-instance relies on a set N of possible XML latedTo, biology) is also conceptually part of the data set. element and attribute names, a set U of URIs, and a set L Implicit tuples must be reflected in RDF query answers. Ac- of literals. An XML tree is defined as usual: cordingly, XR queries (presented in Section 3) also consider Definition 2.1 (XML Tree). An XML tree is a fi- that implicit triples are part of the RDF sub-instance. The nite, unranked, ordered, labeled tree T = (N, E) with nodes management of implicit RDF (and thus, XR) triples is an N and edges E, where each node n ∈ N is assigned a label orthogonal issue, on which we will not delve further here. λ(n) ∈ N and a type τ (n) ∈ {attribute, element, text}. An We can now define an XR instance as follows: attribute node must be the child of an element node; it has Definition 2.4 (XR Instance). An XR instance is a a value belonging to L and it does not have any children. A pair (IX , IR ), where IX and IR are an XML and an RDF text node can only appear as a leaf. data instance, respectively, built upon the same set of URIs. A set of XML trees forms an XML instance: Note that the XML and the RDF sub-instances are de- Definition 2.2 (XML Instance). An XML instance fined over the same set of URIs U, thus allowing RDF triples IX is a finite set of XML trees together with a function as- to annotate XML nodes, as shown in the following example. signing to each node of these trees a unique URI from U. Example. Figure 1 shows an XR instance. The XML sub- The URI assignment function is crucial for interconnect- instance (in the bottom of the Figure) consists of two trees. ing the XML and RDF sub-instances. The unique identifiers The first represents a news feed composed of articles. Each assigned to the nodes allow the RDF sub-instance to refer article has a story (consisting of text nodes and a named- to nodes of the XML sub-instance. entity identifying a company name), a title and a date. The In addition to the aforementioned sets U and L, the RDF second tree refers to a news archive. Each XML node has a sub-instance also relies on a set of blank nodes B, whose role subscript corresponding to its URI. The RDF sub-instance is discussed below. comprises RDF triples which refer to (annotate) the XML sub-instance by using the URIs of the XML nodes. For in- Definition 2.3 (RDF Instance). An RDF instance stance, we state that the article with URI=‘#3’ is a report IR is a set of triples of the form (s, p, o), where s ∈ (U ∪ B), on the named-entity with URI=‘#8’, which is of type Orga- p ∈ U, and o ∈ (L ∪ U ∪ B). nization. Moreover, in the context of Open Linked Data, we As customary, the components of a triple (s, p, o) are re- use the sameAs property to show that two resources are the ferred to as its subject, property and object, respectively. same, e.g., the nodes with URIs ‘#3’ and ‘#13’ in the Fig- As defined above, the subject or the object of a triple can ure. RDF triples may also contain blank nodes (e.g., ∶B). be bound to a blank node. Blank nodes are used in RDF Finally, some triples may be implicitly derived, such as the [13] to denote unknown URIs or literals, similarly to labeled one stating that the named-entity is an Organization, which nulls in the database literature [1]. For instance, one can was inferred from the facts that the Company is a subclass use a blank node b1 in the triples (b1 , country, “F rance”) of Organization and the named-entity is a Company. and (b1 , city, “P aris”) to state that the country and city of b1 is France and Paris, resp., without using a concrete URI. 3. THE XRQ QUERY LANGUAGE Another advantage of adopting RDF is that XR inherits Given an XR data instance, users should be able to query its reasoning capabilities. As described in [13], an RDF data the data based on both its structure, described in the XML set contains not only explicit triples but also implicit triples, sub-instance, and its semantic annotations, stored in the derived from the explicit ones through a set of entailment RDF sub-instance. To this end, we design the XRQ query rules, the most important of which are due to knowledge language. In XRQ, the XML sub-instance is queried using encoded in RDF Schemas. In the example discussed in the tree patterns, while the RDF sub-instance is queried through introduction, an RDF Schema specifies that bioengineering triple patterns. Both are defined below. Most importantly, is a subClassOf biology. Assuming the paragraph p1 is re- reusing variables across tree patterns and triple patterns al- lated to bioengineering, modeled by the explicit RDF triple lows to relate nodes and resources of the two sub-instances. a XML documents and their RDF annotations. The following ($B, rdf:type, ⟨Organization⟩),($A, ⟨workAt⟩, $B), ($A, ⟨email⟩, $C) example illustrates the expressivity of XRQ. Example. Figure 2 shows an XRQ query, whose body NewsFeed NewsFeed (at the right-hand side) consists of the previously described ⟨$VT, $C⟩ :- Article tree and triple patterns. Observe that variable $B appears Article in both parts of the body, forming a join between the XML NamedEntity Date and RDF sub-instances. Two variables appear in the head of $B: uri $VD: val Title Date [val=”ACME”] $VT: val $VD: val the query: $V T from tree pattern (c) and $C from the first triple pattern. The answer set for this query will contain b c the titles of articles published on the same date as articles Figure 2: Sample XRQ query. referring to the organization called ACME, and emails of people working in this organization. Definition 3.1 (Tree Pattern). A tree pattern is a Query Semantics For each match of the tree patterns, finite, ordered, unranked, N -labeled tree with two types of triple patterns and the corresponding variables against the edges, namely child and descendant edges. We may attach XR instance (including the XML data, the explicit as well to each node at most one uri variable, one val variable and as the implicit triples, as explained in Section 2), we create one cont variable. We may also attach to a node an equality a copy of the head tuple with the variables replaced by their predicate of the form [val=c] for some c ∈ L. bound values. The query answer is the set of all such tuples. A tree pattern is a variant of tree patterns as presented in For more details see the extended version [8]. the literature [3] with the additional capability of attaching In [8] we also provide an extended version of XRQ, sup- one or more variables of different types to the nodes. Vari- porting more complex construction clauses, such that the ables serve two purposes: (i) to denote data items that are result of an XRQ query is an XR data instance. returned by the query (in the style of distinguished variables in conjunctive queries) and (ii) to express joins between tree 4. EXPERIMENTS or triple patterns. The variable type specifies the exact in- To experiment with XR, we implemented XRP, a proto- formation item from an XML node, to which the variable type platform that supports the storage of XR instances and will be bound. When a node nt of a tree pattern is matched the evaluation of XRQ queries, using Java 1.6. against a node nd of an XML tree, the variables attached to Architecture The data store is built on top of Berke- nt will be bound to the following concepts. A uri variable is leyDB. RDF triples are mapped to a simple three-attribute bound to the URI of nd . If nd is an element, a val variable relation. XML documents are stored also within a tabular is bound to the concatenation of all text descendants of nd ; format, enriched with XML-specific features such as struc- if nd is an attribute, a val variable is bound to the attribute tural XML identifiers and full serialized XML fragments. value. Finally, a cont variable is bound to the serialization For the experiments described here, we manually specified of the subtree rooted at nd . The semantics of val variables the XML data that each relation will store, by means of a follow from the XPath and XQuery specifications [16]. tree pattern. XRP supports easily changing the data or- Example. The lower part of Figure 2 depicts two tree pat- ganization by specifying other tree patterns, thanks to its terns. Pattern (b) finds articles containing a named-entity reliance on the ViP2P (http://vip2p.saclay.inria.fr) access path with a value equal to “ACME”. Variables $B and $V D are selection infrastructure [11]. Furthermore, we materialized bound to the URI of this entity and the publication date an index of all XML node URIs appearing in some RDF of the article, respectively. Observe that $V D appears also triples. This index facilitates the evaluation of the XRQ in the second tree-pattern, thus expressing a join between queries that combine data from the two sub-instances. them. Pattern (c) retrieves the titles of articles whose pub- The query engine supports the typical operators: scan, se- lication date matches that of pattern (b). lection, projection, joins on values and XML structural IDs, Definition 3.2 (Triple Pattern). A triple pattern is etc. To take advantage of indexes, we also implemented the a triple (s, p, o), where s, p are URIs or variables, whereas o sideways information passing BindJoin operator [6], which is a URI, a literal, or a variable. uses attribute values from one join input as keys to access Example. The top part of Figure 2 depicts three triple pat- the other input (an XML/RDF index is used to retrieve for terns, looking for entities of type Organization and finding each XML node URI, all the triples containing this URI). emails of resources known to be working at that organiza- Experimental Setup For the purpose of the experi- tion. Joins between triple patterns are expressed through ments, we created a synthetic XR instance with soccer league shared variables, such as $A and $B in the Figure. information. The XML sub-instance describes teams and By combining tree and triple patterns and endowing them players, while the RDF sub-instance annotates players with with a set of head variables, we obtain an XRQ query: properties. To assess the system’s scalability, we varied the number of player nodes (10K, 50K and 100K), while keeping Definition 3.3 (XRQ Query). An XRQ query con- the size of the XML document constant (95MB) by modulat- sists of a head and a body. The body is a set of tree and/or ing the length of some attributes that each player contains. triple patterns built over the same set of variables, whereas The number of annotations also increased proportionally by the head is a list of variables appearing also in the body. keeping an average of one annotation per player. Note that by using variables in multiple places within the We devised three queries, each joining a tree pattern with query, one can express joins. In general, three types of joins a triple pattern on URIs of player nodes. Q1 retrieves teams are possible: between two tree patterns, between two triple which have players annotated with a given property. Q2 is patterns or between a tree pattern and a triple pattern. The a relaxed version of Q1 returning teams regardless of the latter facilitates queries that cross the boundaries between property used. Finally, Q3 returns properties annotating Q1 HashJoin QR Q2 HashJoin QR Q3 HashJoin QR studies the connection between RDF and XML. This in- Q1 BindJoin QR cludes works on (i) transforming XML data to RDF and vice Q2 BindJoin QR Q3 BindJoin QR 8000 Time (ms) 8000 8000 versa [2], (ii) using the query language of one model to query Time (ms) Time (ms) 6000 6000 6000 4000 4000 4000 the other (e.g., use XQuery to query XML-ized RDF) [5, 10] 2000 2000 2000 0 0 0 and (iii) extending the query language of one model with 10K 50K 100K 10K 50K 100K 10K 50K 100K primitives for the other (e.g., adding XPath constructs to #of Nodes #of Nodes #of Nodes SPARQL)[15]. Finally, XML and RDF are modeled within Figure 3: Query Response (QR) time for Q1 , Q2 , Q3 on data a single rule-based framework in [7]. sets of varying number of players (and triples). As outlined in Section 1, transforming XML to RDF or vice-versa is feasible but brings a significant conversion over- players of a particular team. Observe that Q1 and Q3 are head, while the optimization techniques appropriate for one highly selective; Q1 applies a selection on the RDF instance, are likely not very efficient for the other. The main differ- whereas Q3 applies a selection on the XML instance. ence between our work and the previous ones integrating For each query, we compared execution times both with XML and RDF, though, is our treatment of any XML node and without indexes on the joined attributes. For Q1 and as a resource, and the resulting focus on connecting (in the Q2 , we used an index on the XML node URIs, while for Q3 , RDF/Semantic Web sense) data to and from documents. we used an index on the RDF subjects. Conclusion XRP is a first step towards enabling the The experiments were conducted on a 2.40GHz Intel X3430 modeling and querying of annotated XML documents. Our with 4GB RAM running Mandriva Linux (kernel 2.6.31.14). current and future work focuses on (i) checking the model’s Experimental Results Figure 3 shows the total query expressiveness against large-scale real-life annotated docu- evaluation times for each query over the three instances both ment applications built on previous content management with indexes (BindJoin) and without (HashJoin). systems such as Confolio (www.confolio.org), and (ii) devising The results indicate that XRP scales well with the number an architecture for XR management based on off-the-shelf of nodes in the XML sub-instance (and the number of triples efficient XML and RDF data management platforms. in the RDF sub-instance). For instance, evaluating the least selective query (Q2 ) over the largest data instance (100K) 6. REFERENCES requires less than 6.5 seconds. [1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of As expected, the BindJoin plans (using the index) are gen- Databases. Addison-Wesley, 1995. erally faster (up to a factor of 4 in some cases). The only [2] W. Akhtar, J. Kopecký, T. Krennwallner, and A. Polleres. XSPARQL: Traveling between the XML and RDF worlds - case when BindJoin is slightly worse than HashJoin is for and avoiding the XSLT pilgrimage. In ESWC, 2008. non-selective queries (such as Q2 ). This is due to the over- [3] S. Amer-Yahia, S. Cho, L. V. Lakshmanan, and head incurred from accessing all items of an index. Neither D. Srivastava. Minimization of Tree Pattern Queries. In side of the join operator is more selective than the other and SIGMOD, 2001. thus the join operator has to match all RDF triples with all [4] S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. Guha, tuples from the XML instance. BindJoin will need random A. Jhingran, T. Kanungo, S. Rajagopalan, A. Tomkins, access to the whole index, while HashJoin will simply apply J. A. Tomlin, and J. Y. Zien. SemTag and seeker: bootstrapping the semantic web via automated semantic a sequential scan to the same relation. annotation. In WWW, 2003. Our experiments demonstrate the applicability of many [5] M. Droop, M. Flarer, J. Groppe, S. Groppe, V. Linnemann, query and storage optimization techniques to the problem J. Pinggera, F. Santner, M. Schier, F. Schöpf, H. Staffler, of XRP data management, in a single-platform setting. To and S. Zugal. Bringing the XML and Semantic Web Worlds better exploit existing systems, we are also investigating the Closer: Transforming XML into RDF and Embedding deployment of an XR platform as an integrator, comple- XPath into SPARQL. In ICEIS. 2009. menting separate XML, respectively, RDF databases, while [6] D. Florescu, A. Levy, I. Manolescu, and D. Suciu. Query optimization in the presence of binding patterns. In preserving the single-model abstraction. SIGMOD, 1999. More experiments and a detailed description of the data [7] T. Furche, F. Bry, and O. Bolzer. Marriages of sets can be found in our extended report [8]. Convenience: Triples and Graphs, RDF and XML in Web Querying. In Principles and Practice of Semantic Web 5. RELATED WORK & CONCLUSION Reasoning. Springer Berlin / Heidelberg, 2005. We briefly discuss the main classes of related works. [8] Extended version of this work, available at Tools for annotating web pages Several works propose http://xr.saclay.inria.fr/papers/xr-vlds-extended.pdf, 2011. frameworks that let users semantically annotate web-pages [9] S. Handschuh and S. Staab. Authoring and annotation of either manually [9] or (semi-)automatically [4] (see [12] for web pages in CREAM. In WWW, 2002. a survey of such systems). However, they focus solely on [10] K. Karanasos and S. Zoupanos. Viewing a world of storing and querying annotations and they do not consider annotations through AnnoVIP (demo). In ICDE, 2010. the problem of simultaneously querying structured data and [11] I. Manolescu, K. Karanasos, V. Vassalos, and S. Zoupanos. Efficient XQuery rewriting using multiple views. In ICDE, the annotations on them. 2011. Embedding RDF annotations in XML documents [12] L. Reeve and H. Han. Survey of semantic annotation Other works look at the particular problem of embedding platforms. In ACM SAC, 2005. RDF annotations in XHTML [14]. However, this requires [13] RDF. http://www.w3.org/standards/techs/rdf. modifying the original XML document, which is not always [14] RDFa. http://www.w3.org/TR/xhtml-rdfa-primer/, 2004. possible, due to privacy reasons or access constraints. In [15] SPAT - SPARQL Annotations. contrast, in XRP the RDF annotations can be kept sepa- http://www.w3.org/2007/01/SPAT/, 2007. rately from the XML documents in their own physical store. [16] XQuery 1.0 and XPath 2.0 Data Model (XDM). From RDF to XML and back Another line of work http://www.w3.org/TR/xpath-datamodel/, 2010.