Benchmarking the Performance of Linked Data Translation Systems ∗ Carlos R. Rivero Andreas Schultz Christian Bizer University of Sevilla, Spain Freie Universität Berlin, Freie Universität Berlin, carlosrivero@us.es Germany Germany a.schultz@fu-berlin.de christian.bizer@fu- berlin.de David Ruiz University of Sevilla, Spain druiz@us.es ABSTRACT 1. INTRODUCTION Linked Data sources on the Web use a wide range of differ- The Web of Linked Data is growing rapidly and covers a wide ent vocabularies to represent data describing the same type range of different domains, such as media, life sciences, pub- of entity. For some types of entities, like people or biblio- lications, governments, or geographic data [4, 13]. Linked graphic record, common vocabularies have emerged that are Data sources use vocabularies to publish their data, which used by multiple data sources. But even for representing consist of more or less complex data models that are repre- data of these common types, different user communities use sented using RDFS or OWL [13]. Some data sources try to different competing common vocabularies. Linked Data ap- reuse as much from existing vocabularies as possible in or- plications that want to understand as much data from the der to ease the integration of data from multiple sources [4]. Web as possible, thus need to overcome vocabulary hetero- Other data sources use completely proprietary vocabularies geneity and translate the original data into a single target to represent their content or use a mixture of common and vocabulary. To support application developers with this in- proprietary terms [7]. tegration task, several Linked Data translation systems have been developed. These systems provide languages to express Due to these facts, there exists heterogeneity amongst vo- declarative mappings that are used to translate heteroge- cabularies in the context of Linked Data. According to [5], neous Web data into a single target vocabulary. In this pa- on the one hand, 104 out of the 295 data sources in the per, we present a benchmark for comparing the expressivity LOD Cloud only use proprietary vocabularies. On the other as well as the runtime performance of data translation sys- hand, the rest of the sources (191) use common vocabularies tems. Based on a set of examples from the LOD Cloud, we to represent some of their content, but also often extend and developed a catalog of fifteen data translation patterns and mix common vocabularies with proprietary terms to repre- survey how often these patterns occur in the example set. sent other parts of their content. Some examples of the use Based on these statistics, we designed the LODIB (Linked of common vocabularies are the following: regarding publi- Open Data Integration Benchmark) that aims to reflect the cations, 31.19% data sources use the Dublin Core vocabu- real-world heterogeneities that exist on the Web of Data. lary, 4.75% use the Bibliographic Ontology, or 2.03% use the We apply the benchmark to test the performance of two Functional Requirements for Bibliographic Records; in the data translation systems, Mosto and LDIF, and compare context of people information, 27.46% data sources use the the performance of the systems with the SPARQL 1.1 CON- Friend of a Friend vocabulary, 3.39% use the vCard ontology, STRUCT query performance of the Jena TDB RDF store. or 3.39% use the Semantically-Interlinked Online Communi- ties ontology; finally, regarding geographic data sets, 8.47% data sources use the Geo Positioning vocabulary, or 2.03% Categories and Subject Descriptors use the GeoNames ontology. D.2.12 [Interoperability]: Data mapping; H.2.5 [Heterogeneous Databases]: Data translation To solve these heterogeneity problems, mappings are used ∗Work partially done whilst visiting Freie Universität Berlin. to perform data translation, i.e., exchanging data from the source data set to the target data set [19, 21]. Data trans- lation, a.k.a. data exchange, is a major research topic in the database community, and it has been studied for re- lational, nested relational, and XML data models [3, 10, 11]. Current approaches to perform data translation rely on two types of mappings that are specified at different levels, namely: correspondences (modelling level) and executable mappings (implementation level). Correspondences are rep- resented as declarative mappings that are then combined Copyright is held by the author/owner(s). into executable mappings, which consist of queries that are LDOW2012, April 16, 2012, Lyon, France. executed over a source and translate the data into a tar- Table 1: Prefixes of the sample patterns get [7, 18, 19]. Prefix URI In the context of executable mappings, there exists a num- rdfs: http://www.w3.org/2000/01/rdf-schema# ber of approaches to define and also automatically generate xsd: http://www.w3.org/2001/XMLSchema# them. Qin et al. [18] devised a semi-automatic approach fb: http://rdf.freebase.com/ns/ to generate executable mappings that relies on data-mining; dbp: http://dbpedia.org/ontology/ Euzenat et al. [9] and Polleres et al. [17] presented prelim- lgdo: http://linkedgeodataorg/ontology/ inary ideas on the use of executable mappings in SPARQL gw: http://govwild.org/ontology/ to perform data translation; Parreiras et al. [16] presented po: http://purl.org/ontology/po/ a Model-Driven Engineering approach that automatically lgdp: http://linkedgeodata.org/property/ transforms handcrafted mappings in MBOTL (a mapping movie: http://data.linkedmdb.org/resource/movie/ language by means of which users can express executable db: http://www4.wiwiss.fu-berlin.de mappings) into executable mappings in SPARQL or Java; ,→ /drugbank/resource/drugbank/ skos: http://www.w3.org/2004/02/skos/core# Bizer and Schultz [7] proposed a SPARQL-like mapping lan- foaf: http://xmlns.com/foaf/spec/ guage called R2R, which is designed to publish expressive, grs: http://www.georss.org/georss/ named executable mappings on the Web, and to flexible combine partial executable mappings to perform data trans- lation. Finally, Rivero et al. [19] devised an approach called LODIB is designed to measure the following: 1) Expressiv- Mosto to automatically generate executable mappings in ity: the number of mapping patterns that can be expressed SPARQL based on constraints of the source and target data in a specific data translation system; 2) Time performance: models, and also correspondences between these data mod- the time needed to perform the data translation, i.e., load- els. In addition, translating amongst vocabularies by means ing the source file, executing the mappings, and serializing of mappings is one of the main research challenges in the the result into a target file. In this context, LODIB provide context of Linked Data, and it is expected that research ef- a validation tool that examines if the source data is repre- forts on mapping approaches will be increased in the next sented correctly in the target data set: we perform the data years [4]. As a conclusion, a benchmark to test data trans- translation task in a particular scenario using LODIB, and lation systems in this context seems highly relevant. the target data that we obtain are the expected target data when performing data translation using a particular system. To the best of our knowledge, there exist two benchmarks to test data translation systems: STBenchmark and DTS- This paper is organized as follows: Section 2 presents the Bench. STBenchmark [1] provides eleven patterns that oc- mapping patterns of our benchmark; in Section 3, we de- cur frequently when integrating nested relational models, scribe the 84 data translation examples from the LOD Cloud which makes it difficult for at least some of the patterns that we have analyzed, and the counting of the occurrences to extrapolate to our context due to a number of inherent of mapping patterns in the examples; Section 4 deals with differences between nested relational models and the graph- the design of our benchmark; Section 5 describes the eval- based RDF data model that is used in the context of Linked uation of our benchmark with two data translation systems Data [14]. DTSBench [21] allows to test data translation (Mosto and LDIF), and compares their performance with systems in the context of Linked Data using synthetic data the SPARQL 1.1 performance of the Jena TDB RDF store; translation tasks only, without taking real-world data from Section 6 describes the related work on benchmarking in the Linked Data sources into account. Linked Data context; and, finally, Section 7 recaps on our main conclusions regarding LODIB. In this paper, we present a benchmark to test data trans- lation systems in the context of Linked Data. Our bench- mark provides a catalogue of fifteen data translation pat- 2. MAPPING PATTERNS terns, each of which is a common data translation problem in A mapping pattern represents a common data translation the context of Linked Data. To motivate that these patterns problem that should be supported by any data translation are common in practice, we have analyzed 84 random exam- system in the context of Linked Data. Our benchmark pro- ples of data translation in the Linked Open Data Cloud. vides a catalogue of fifteen mapping patterns that we have After this analysis, we have studied the distribution of the repeatedly discovered as we analyzed the heterogeneity be- patterns in these examples, and have designed LODIB, the tween different data sources in the Linked Open Data Cloud. Linked Open Data Integration Benchmark, to reflect this In the rest of this section, we present these patterns in de- real-world heterogeneity that exists on the Web of Data. tail. Note that for vocabulary terms in concrete examples we use the prefixes shown in Table 1. The benchmark provides a data generator that produces three different synthetic data sets, which reflect the pattern distribution. These source data sets need to be translated Rename Class (RC). Every source instance of a class C into a single target vocabulary by the system under test. is reclassified into the same instance of the renamed This generator allows us to scale source data and it also class C’ in the target. An example of this pattern is automatically generates the expected target data, i.e., after the renaming of class fb:location.citytown in Freebase performing data translation over the source data. The data into class dbp:City in DBpedia. sets reflect the same e-commerce scenario that we already Rename Property (RP). Every source instance of a used for the BSBM benchmark [6]. property P is transformed into the same instance of the renamed property P’ in the target. An example 1:1 Value to Value (1:1). The value of every source in- is the renaming of property dbp:elevation in DBpe- stance of a property P must be transformed by means dia into property lgdo:ele in LinkedGeoData, in which of a function into the value of a target instance of both properties represent the elevation of a geographic property P’. An example is dbp:runtime in DBpedia location. is transformed into movie:runtime in LinkedMDB, in which the source is expressed in seconds and the target Rename Class based on Property (RCP). This pat- in minutes. tern is similar to the Rename Class pattern but it is based on the existence of a property. Every source Value to URI (VtU). Every source instance of a prop- instance of a class C is reclassified into the same erty P is translated into a target instance of prop- instance of the renamed class C’ in the target, if erty P’ and the source object value is transformed and only if, the source instance is related with an- into an URI in the target. An example of this pattern other instance of a property P. An example is the is property grs:point in DBpedia, which is translated renaming of class dbp:Person in DBpedia into class into property fb:location.location.geolocation in Free- fb:people.deceased person in Freebase, if and only if, base, and the value of every instance of grs:point is an instance of dbp:Person is related with an instance transformed into an URI. of property dbp:deathDate, i.e., if a deceased person in Freebase exists, there must exist a person with a date URI to Value (UtV). This pattern is similar to the pre- of death in DBpedia. vious one but the source instance relates to a URI that is transformed into a literal value in the target. Rename Class based on Value (RCV). This pattern An example of the URI to Value pattern is property is similar to the previous pattern, but the prop- dbp:wikiPageExternalLink in DBpedia that is trans- erty instance must have a specific value v to re- lated into property fb:common.topic.official website in name the source instance. An example is the re- Freebase, and the URI of the source instance is trans- naming of class gw:Person in GovWILD into class lated to a literal value in the target. fb:government.politician in Freebase, if and only if, each instance of gw:Person is related with an instance Change Datatype (CD). Every source instance of a of property gw:profession and its value is the literal datatype property P whose type is TYPE is renamed “politician”. This means that only people whose pro- into the same target instance of property P’ whose fession is politician in GovWILD are translated into type is TYPE’. An example of this pattern is property politicians in Freebase. fb:people.person.date of birth in Freebase whose type is xsd:dateTime, which is translated into target prop- Reverse Property (RvP). This pattern is similar to the erty dbp:birthDate in DBpedia whose type is xsd:date. Rename Property pattern, but the property instance in the target is reversed, i.e., the subject is inter- Add Language Tag (ALT). In this pattern, every source changed with the object. An example is the reverse of instance of a property P is translated into a target property fb:airports operated in Freebase into property instance of property P’ and a new language tag TAG is dbp:operator in DBpedia, in which the former relates added to the target literal. An example of this pattern an operator with an airport, and the latter relates an is that db:genericName in Drug Bank is renamed into airport with an operator. property rdfs:label in DBpedia and a new language tag “@en” is added. Resourcesify (Rsc). Every source instance of a property P is split into a target instance of property P’ and Remove Language Tag (RLT). Every source instance an instance of property Q. Both instances are con- of a property P is translated into a target instance of nected using a fresh resource, which establishes the property P’ and the source instance has a language tag original connection of the instance of property P. Note TAG that is removed. An example is skos:altLabel in that the new target resource must be unique and con- DataGov Statistics, which has a language tag “@en”, sistent with the definition of the target vocabulary. is translated into skos:altLabel in Ordnance Survey An example is the creation of a new URI or blank and the language tag is removed. node when translating property dbp:runtime in DBpe- dia into po:duration in BBC by creating a new instance N:1 Value to Value (N:1). A number of source instances of property po:version. of properties P1 , P2 , . . . , Pn are translated into a sin- gle target instance of property P’, and the value of the Deresourcesify (DRsc). Every source instance of a prop- target instance is computed by means of a function erty P is renamed into a target instance of property P’, over the values of the source instances. An example if and only if, P is related to another source instance of this pattern is that we concatenate the values of of a property Q, that is, both instances use the same properties foaf:givenName and foaf:surname in DBpe- resource. In this case, the source needs more instances dia into property fb:type.object.name in Freebase. than the target to represent the same information. An example of this pattern is that an airport in DBpe- Aggregate (Agg). In this pattern, we count the number of dia is related with its city served by property dbp:city, source instances of property P, which is translated into and the name of this city is given as value of rdfs:label. a target instance of property Q. An example is prop- This is transformed into property lgdp:city served in erty fb:metropolitan transit.transit system.transit lines LinkedGeoData, which relates an airport with its city in Freebase whose values are aggregated into a single served (as literal). value of dbp:numberOfLines for each city in DBpedia. Table 2: Mapping patterns in the LOD Cloud we analyzed both directions: one instance is the source and the other instance is the target, and backwards. Therefore, Code Source triples Target triples the total number of examples we analyzed was 84. Then, RC ?x a C ?x a C’ we manually counted the number of mapping patterns that RP ?x P ?y ?x P’ ?y are needed to translate between the representations of the ?x a C instances (neighboring instances were also considered to de- FILTER EXISTS { tect more complex structural mismatches). These statistics RCP ?x a C’ {?x P ?y} UNION are publicly-available at [22]. {?y P ?x} } ?x a C In the next step, we computed the averages of our mapping RCV ?x a C’ ?x P v patterns grouped by the pair of source and target data set. RvP ?x P ?y ?y P’ ?x To compute them, in some cases, we analyzed the translation ?x Q ?z of one single instance since the data set of the Linked Data Rsc ?x P ?y ?z P’ ?y source comprises only a couple of classes, such as Drug Bank ?x Q ?z or Ordnance Survey. In other cases, we analyzed more than DRsc ?x P’ ?y one instance since the data set comprises a large number of ?z P ?y 1:1 ?x P ?y ?x P’ f(?y) classes, such as DBpedia or Freebase. VtU ?x P ?y ?x P’ toURI(?y) UtV ?x P ?y ?x P’ toLiteral(?y) Table 3 presents the statistics of the mappings patterns that CD ?x P ?yˆˆTYPE ?x P’ ?yˆˆTYPE’ we have found in the LOD Cloud. The two first columns ALT ?x P ?y ?x P’ ?y@TAG stand for the source and target Linked Data data sets, the RLT ?x P ?y@TAG ?x P’ ?y following columns contain the averages of each mapping pat- ?x P1 ?v1 tern according to the source and the target, i.e., we count the N:1 ... ?x P’ f(?v1 , . . . , ?vn ) occurrences of mapping patterns in a number of examples ?x Pn ?vn and compute the average. Note that, for certain data sets, we analyzed several examples of the same type; therefore, Agg ?x P ?y ?x Q count(?y) the final numbers of these columns are real numbers (no in- tegers). Finally, the last column contains the total number of instances that we analyzed for each pair of Linked Data Finally, we present a summary of these mapping patterns data sets. in Table 2. The first column of this table stands for the code of each pattern; the second and third columns establish On the one hand, Rename Class and Rename Property map- the triples to be retrieved in the source and the triples to ping patterns appear in the vast majority of the analyzed be constructed in the target using a SPARQL-like notation. examples, since these patterns are very common in practice. Note that properties are represented as P and Q, classes as On the other hand, there are some patterns that are not so C, constant values as v, tag languages as TAG, and data common, e.g., Value to URI and URI to Value patterns ap- types as TYPE. pear only once in all analyzed examples (between DBpedia and Drug Bank). Table 4 presents the average occurrences of 3. LODIB GROUNDING the LODIB mapping patterns over all analyzed examples. In order to base the LODIB Benchmark on realistic real- world distributions of these mapping patterns, we analyzed 84 data translation examples from the LOD Cloud and 4. LODIB DESIGN Based on the previously described statistics, we have de- counted the occurrences of mapping patterns in these ex- signed the LODIB Benchmark. The benchmark consists of amples. First, we selected different Linked Data sources by three different source data sets that need to be translated exploring the LOD data set catalog maintained on CKAN1 . by the system under test into a single target vocabulary. The criteria we followed was to choose sources that comprise The topic of the data sets is the same e-commerce data set a great number of owl:sameAs links with other Linked Data that we already used for the BSBM Benchmark [6]. The sources, i.e., more than 25, 000. Furthermore, we tried to se- data sets describe products, reviews, people and some more lect sources from the major domains represented in the LOD lightweight classes, such as product price using different Cloud. Therefore, the selected Linked Data sources are the source vocabularies. For translation from the representation following: ACM (RKB Explorer), DBLP (RKB Explorer), of an instance in the source data sets to the target vocab- Dailymed, Drug Bank, DataGov Statistics, Ordnance Sur- ulary, data translation systems need to apply several of the vey, DBpedia, GeoNames, Linked GeoData, LinkedMDB, presented mapping patterns. The descriptions of these data New York Times, Music Brainz, Sider, GovWILD, Pro- sets are publicly-available at the LODIB homepage [22]. ductDB, and OpenLibrary. Note that, for each domain of the LOD Cloud, there are at least two Linked Data sources These data sets take the previously computed averages of that contribute to our statistics except from the domain of Table 4 into account by multiplying them by a constant user-generated content. (11), and divided each one by another constant (3, the total number of data translation tasks, i.e., from each source data After selecting these sources, we randomly selected 42 exam- set to the target data set). As a result, each of the three data ples, each of which comprises a pair of instances that are con- translation tasks comprises a number of mapping patterns, nected by an owl:sameAs link. For each of these examples, and we present the numbers in Table 5, in which the total 1 http://thedatahub.org/group/lodcloud number of mapping patterns for each task is 18. Table 3: Mapping patterns in Linked Data sources Source Target RC RP RCP RCV RvP Rsc DRsc 1:1 VtU UtV CD ALT RLT N:1 Agg Total ACM DBLP 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1 Dailymed Drug Bank 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1 DataGov Stats Ordnance Survey 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1 DBLP ACM 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1 DBpedia Freebase 2.14 8.57 0.64 0.00 2.21 2.29 0.00 0.57 0.00 0.00 1.14 0.00 0.00 0.14 0.07 14 DBpedia Geonames 1.00 3.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3 DBpedia Linked GeoData 1.00 3.50 0.00 0.00 0.00 0.00 1.50 0.13 0.00 0.00 2.25 0.00 0.00 0.00 0.00 8 DBpedia LinkedMDB 1.00 5.50 0.33 0.00 0.33 0.00 0.00 0.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6 DBpedia Drug Bank 1.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 0.00 3.00 0.00 0.00 1 DBpedia New York Times 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1 DBpedia Music Brainz 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1 Drug Bank DBpedia 1.00 1.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 3.00 0.00 1.00 0.00 1 Drug Bank Freebase 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1 Drug Bank Sider 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1 Drug Bank Dailymed 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1 Freebase DBpedia 2.14 8.57 0.29 0.07 2.21 0.00 2.29 0.79 0.00 0.00 1.14 0.00 0.00 0.00 0.14 14 Freebase GovWILD 1.00 4.50 0.00 0.00 0.00 0.00 2.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2 Freebase Drug Bank 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1 GeoNames DBpedia 1.00 3.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1 GovWILD Freebase 1.00 4.50 0.00 0.50 0.00 2.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00 2 Linked GeoData DBpedia 1.00 3.50 0.88 0.75 0.00 1.50 0.00 0.13 0.00 0.00 2.25 0.00 0.00 0.00 0.00 8 LinkedMDB DBpedia 1.00 5.50 0.00 0.00 0.33 0.00 0.00 0.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6 Music Brainz DBpedia 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1 New York Times DBpedia 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1 OpenLibrary ProductDB 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1 Ordnance Survey DataGov Stats 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1 ProductDB OpenLibrary 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1 Sider Drug Bank 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1 Table 4: Average occurrences of the mapping patterns RC RP RCP RCV RvP Rsc DRsc 1:1 VtU UtV CD ALT RLT N:1 Agg 0.87 2.01 0.08 0.05 0.18 0.24 0.24 0.30 0.04 0.04 0.24 0.14 0.14 0.09 0.01 Table 5: Number of mapping patterns in the data translation tasks Task RC RP RCP RCV RvP Rsc DRsc 1:1 VtU UtV CD ALT RLT N:1 Agg Task 1 3 7 1 0 1 1 1 1 0 1 1 0 1 0 0 Task 2 3 7 0 1 0 1 1 1 1 0 1 1 0 1 0 Task 3 3 7 0 0 1 1 1 1 0 0 1 1 1 0 1 We have implemented a data generator to populate and scale Resourcesify (Rsc). Property src1:birthDate needs to be the three source data sets that we have specified in the previ- renamed into property tgt:birthDate and a new target ous section, which is publicly-available at the LODIB home- instance of property tgt:birth is needed, e.g., the date page [22]. In the data generator, we defined a number of data of birth of src1-data:Smith-W person instance. generation rules, and the generated data are scaled based on the number of product instances that each data set contains. Deresourcesify (DRsc). Property src2:revText needs to In our implementation, we use an extension of the language be renamed into property tgt:text, if and only if, the used in [8], which allows to define particular value genera- instance of property src2:revText is related to another tion rules for basic types, such as xsd:string or xsd:date. In source instance of property src2:hasText, e.g., the text addition, missing properties often occurs in the context of of src2-data:Review-HTC-W-S review instance. the Web of Data, therefore, we also provide 44 statistical distributions in our implementation to randomly select dis- 1:1 Value to Value (1:1). Property src2:price needs to tribute properties, including Uniform, Normal, Exponential, be renamed into property tgt:productPrice, and the Zipf, Pareto and empirical distributions, to mention a few. value must be transformed by means of function us- DollarsToEuros, since the source price is represented In this section, we provide examples on how a data transla- in US dollars and the target in Euros, e.g., the price tion system needs to translate the data from the source to of src2-data:HTC-Wildfire-S product instance. the target vocabulary regarding the mapping patterns in the Value to URI (VtU). In this example, we need to re- three data translation tasks. Specifically, Figure 1 presents name property src1:personHomepage into property a number of source triples that are translated into a number tgt:personHomepage, and the values of the source in- of target triples. Note that we use prefixes src1:, src2:, src3: stances are transformed into URIs in the target, e.g., and tgt: for referring to the source data sets and the single the homepage of src1-data:Smith-W person instance. target vocabularies of the data sets; and src1-data:, src2:- data, src3:-data and tgt:-data for referring to the source and URI to Value (UtV). In this example, we need to re- target data. These examples are the following: name property src2:productHomepage into property tgt:productHomepage, and the URIs of the source in- stances are transformed into values in the target, e.g., Rename Class (RC). Class src1:Product needs to be re- the homepage of src2-data:HTC-Wildfire-S product named into class tgt:Product, e.g., src1-data:Canon- instance. Ixus-20010 product instance. Change Datatype (CD). Property dc:date in the first Rename Property (RP). Property src1:name needs to source needs to be translated into dc:date, and its be renamed into property rdfs:label, e.g., the name of type is transformed from xsd:string into xsd:date, e.g., src1-data:Canon-Ixus-20010 product instance. the date of src1-data:Review-CI-001 review instance. Rename Class based on Property (RCP). In this case, Add Language Tag (ALT). property src2:mini-cv needs class src1:Person needs to be renamed into class to be renamed into property tgt:bio and a new tag tgt:Reviewer, if and only if, property src1:author ex- language “@en” is added in the target, e.g., the CV of ists, e.g., src1-data:Smith-W person instance. src2-data:Doe-J person instance. Rename Class based on Value (RCV). In this exam- Remove Language Tag (RLT). property src1:revText ple, class src2:Product needs to be renamed into needs to be renamed into property tgt:text and the class tgt:OutdatedProduct, if and only if, property tag language of the source is removed, e.g., the text of src2:outdated exists and has value “Yes”, e.g., src2- src1-data:Review-CI-001 review instance. data:HTC-Wildfire-S product instance. N:1 Value to Value (N:1). properties foaf:firstName and Reverse Property (RvP). In this example, property foaf:surname in the second source need to be trans- src1:author is reversed into property tgt:author, e.g., lated into property tgt:name, and their values are src1-data:Review-CI-001 review instance and src1- concatenated to compose the target value, e.g., the data:Smith-W person instance are related and reversed first name and surname of src2-data:Doe-J person in the target. instance. src1-data:Canon-Ixus-20010 src1-data:Canon-Ixus-20010 a src1:Product ; a tgt:Product ; src1:name "Canon Ixus"^^xsd:string . rdfs:label "Canon Ixus"^^xsd:string . src2-data:HTC-Wildfire-S src2-data:HTC-Wildfire-S a src2:Product ; a tgt:OutdatedProduct ; src2:outdated "Yes"^^xsd:string ; tgt:productPrice "152.59"^^xsd:double ; src2:price "199.99"^^xsd:double ; tgt:productHomepage "htc.com/"^^xsd:string . src2:productHomepage . src3-data:VPCS src3-data:VPCS a src3:Product ; a tgt:Product ; src3:hasReview src3-data:Review-VPCS-01 ; tgt:totalReviews "3"^^xsd:integer . src3:hasReview src3-data:Review-VPCS-02 ; src3:hasReview src3-data:Review-VPCS-03 . src1-data:Review-CI-001 src1-data:Review-CI-001 a src1:Review ; a tgt:Review ; src1:author src1-data:Smith-W ; dc:date "01/10/2011"^^xsd:date ; dc:date "01/10/2011"^^xsd:string ; tgt:text "This camera is awesome!" . src1:revText "This camera is awesome!"@en . src2-data:Review-HTC-W-S src2-data:Review-HTC-W-S a src2:Review ; a tgt:Review ; src2:hasText src2-data:Review-HTC-W-S-Text . tgt:text "Great phone"^^xsd:string . src2-data:Review-HTC-W-S-Text a src2:ReviewText ; src1-data:Smith-W src2:revText "Great phone"^^xsd:string . a tgt:Reviewer ; src1-data:Smith-W tgt:author src1-data:Review-CI-001 ; a src1:Person ; tgt:birthDate tgt-data:Smith-W-BirthDate . src1:birthDate "06/07/1979"^^xsd:date ; tgt-data:Smith-W-BirthDate src1:personHomepage "wsmith.org"^^xsd:string . a tgt:Birth ; src2-data:Doe-J tgt:birthDate "06/07/1979"^^xsd:date ; a src2:Person ; tgt:personHomepage . src2:mini-cv "Born in the US."^^xsd:string ; src2-data:Doe-J foaf:firstName "John"^^xsd:string ; a tgt:Person ; foaf:surname "Doe"^^xsd:string . tgt:bio "Born in the US."@en ; tgt:name "John Doe"^^xsd:string . (a) Source triples (b) Target triples Figure 1: Sample data translation tasks. Aggregate (Agg). we count the number of instances as rdfs:domain of the source and target ontologies of source property src3:hasReview, and this num- to be integrated, and a number of 1-to-1 correspon- ber needs to be translated as the value of property dences between TBox ontology entities [19]. Mosto tgt:totalReviews, e.g., the reviews of src3-data:VPCS tool also allows to run these automatically generated product instance. executable mappings using several semantic-web tech- nologies, such as Jena TDB, Jena SDB, or Oracle 11g. 5. EXPERIMENTS For our tests we advised Mosto to generate (Jena The LODIB benchmark can be used to measure two per- specific) SPARQL Construct queries. The data sets formance dimensions of a data translation system. For one were translated using these generated queries and Jena thing we state the expressivity of the data translation sys- TDB (version 0.8.10). tem, that is, the number of mapping patterns that can be expressed in each system. Secondly we measure the perfor- LDIF It is an ETL like component for integrating data mance by taking the time to translate all source data sets to from Linked Open Data sources [24]. LDIF’s inte- the target representation. For our benchmark experiment, gration pipeline includes one module for vocabulary we generated data sets in N-Triples format containing 25, 50, mapping, which executes mappings expressed in the 75 and 100 million triples. For each data translation system R2R [7] mapping language. All the R2R mappings and data set the time is measured starting with reading the were written by hand. LDIF supports different run input data set file and ending when the output data set has time profiles that apply to different work loads. For been completely serialized to one or more N-Triples files. the smaller data sets we used the in-memory profile, in which all the data is stored in memory. For the We have applied the benchmark to test the performance of 100M data set we executed the Hadoop version, which two data translation systems: was run in single-node mode (pseudo-distributed) on the benchmarking machine as the in-memory version was not able to process this use case. Mosto It is a tool to automatically generate executable mappings amongst semantic-web ontologies [20]. It is based on an algorithm that relies on constraints such To allow other researchers to reproduce our results, the con- figuration and all used mappings for Mosto and LDIF are is not suitable in our context since semantic-web technolo- publicly-available at the LODIB homepage [22]. To set the gies have a number of inherent differences with respect to results of these two systems into context of the more popu- nested relational models [2, 14, 15, 25]. lar tools in the Linked Data space, we compared the perfor- mance of both systems with the SPARQL 1.1 performance Rivero et al. [21] devised DTSBench, a benchmark to test of the Jena TDB RDF store (version 0.8.10). All the map- data translation systems in the context of semantic-web pings for Jena TDB were expressed as SPARQL 1.1 Con- technologies that provides seven data translation patterns. struct queries, which were manually written by ourselves. Furthermore, it provides seven parameters that allow to For loading the source data sets we used the more efficient create a variety of synthetic, domain-independent data tdbloader2, which also generates data set statistics that are translation tasks to test such systems. This benchmark used by the TDB optimizer. is suitable to test data translation amongst Linked Data sources, however, the patterns that it provides are inspired Table 6 gives an overview of the expressivity of the data from the ontology evolution and information integration translation systems. All mapping patterns are expressable contexts, not the Linked Data context. Therefore, it allows in SPARQL 1.1, so all the mappings are actually executed on to generate synthetic tasks based on these patterns, but not Jena TDB. The current implementation of the Mosto tool real-world Linked Data translation tasks. generates Jena-specific SPARQL Construct queries, which could, in general, cover all the mapping patterns. However, There are other benchmarks in the literature that are suit- the goal of Mosto tool is to automatically generate SPARQL able to test semantic-web technologies. However, they can- Construct queries by means of constraints and correspon- not be applied to our context, since none of them focuses on dences without user intervention, therefore, the meaning of data translation problems, i.e., they do not provide source a checkmark in Table 6 is that it was able to automatically and target data sets and a number of queries to perform data generate executable mappings from the source and target translation. Furthermore, these benchmarks focus mainly on data sets and a number of correspondences amongst them. Select SPARQL queries, which are not suitable to perform Note that Mosto tool is not able to deal with RCP and RCV data translation, instead of on Construct SPARQL queries. mapping patterns since it does not allow the renaming of classes based on conditional properties and/or values. Fur- Guo et al. [12] presented LUBM, a benchmark to compare thermore, it does not support Agg mapping pattern since systems that support semantic-web technologies, which pro- it does not allow to aggregate/count properties. In R2R it vides a single ontology, a data generator algorithm that al- is not possible to express aggregates, therefore no aggrega- lows to create scalable synthetic data, and fourteen SPARQL tion mapping was executed on LDIF. In order to check if queries of the Select type. Wu et al. [26] presented the ex- the source data has been correctly and fully translated, we perience of the authors when implementing an inference en- developed a validation tool that examines if the source data gine for Oracle. Bizer and Schultz [6] presented BSBM, a is represented correctly in the target data set. Using the benchmark to compare the performance of SPARQL queries validation tool, we verified that all three systems produce using native RDF stores and SPARQL-to-SQL query rewrit- proper results. ers. Schmidt et al. [23] presented SP2 Bench, a benchmark to test SPARQL query management systems, which com- To compare the performance and the scaling behaviour of prises both a data generator and a set of benchmark queries the systems we have run the benchmark on an Intel i7 950 in SPARQL. (4 cores, 3.07GHz, 1 x SATA HDD) machine with 24GB of RAM running Ubuntu 10.04. 7. CONCLUSIONS Linked Data sources try to reuse as much existing vocab- Table 7 summarizes the overall runtimes for each mapping ularies as possible in order to ease the integration of data system and use case. Since Mosto and R2R were not able from multiple sources. Other data sources use completely to express all mapping patterns, we created three groups: proprietary vocabularies to represent their content or use a 1) one that did not execute the RCV, RCP and AGG map- mixture of common terms and proprietary terms. Due to pings, 2) one without the AGG mapping and 3) one execut- these facts, there exists heterogeneity amongst vocabularies ing the full set of mappings. The results show that Mosto in the context of Linked Data. Data translation, which re- and Jena TDB have – as expected – similar runtime per- lies on executable mappings and consists of exchanging data formance because Mosto internally uses Jena TDB. LDIF from a source data set to a target data set, helps solve these on the other hand is about twice as fast on the smallest heterogeneity problems. data set and about three times as fast for the largest data set compared to Jena TDB and Mosto. One reason for the In this paper, we presented LODIB, a benchmark to test differences could be that LDIF highly parallelizes its work data translation systems in the context of Linked Data. Our load, both in the in-memory as well as the Hadoop version. benchmark provides a catalogue of fifteen data translation patterns, each of which is a common data translation prob- 6. RELATED WORK lem. Furthermore, we analyzed 84 random examples of data The most closely related benchmarks are STBenchmark [1] translation in the LOD Cloud and we studied the distribu- and DTSBench [21]. Alexe et al. [1] devised STBenchmark, tion of the patterns in these examples. Taking these results a benchmark that is used to test data translation systems into account, we devised three source and one target data in the context of nested relational models. This benchmark set based on the e-commerce domain that reflect the map- provides eleven patterns that occur frequently in the infor- ping pattern distribution. Each source data set comprises mation integration context. Unfortunately, this benchmark one data translation task. Table 6: Expressivity of the mapping systems RC RP RCP RCV RvP Rsc DRsc 1:1 VtU UtV CD ALT RLT N:1 Agg Mosto queries X X X X X X X X X X X X SPARQL 1.1 X X X X X X X X X X X X X X X R2R X X X X X X X X X X X X X X Table 7: Runtimes of the mapping systems for each use case (in seconds) 25M 50M 75M 100M Mosto SPARQL queries / Jena TDB1 3,121 7,308 10,622 15,763 R2R / LDIF1 1,506 2,803 4,482 *5,718 SPARQL 1.1 / Jena TDB1 2,720 6,418 10,481 16,548 R2R / LDIF2 1,485 2,950 4,715 *5,784 SPARQL 1.1 / Jena TDB2 2,839 6,508 12,386 19,499 SPARQL 1.1 / Jena TDB 2,925 6,858 12,774 20,630 * Hadoop version of LDIF as single node cluster. Out of memory for in-memory version. 1 without RCP, RCV and AGG mappings 2 without AGG mapping Current benchmarks concerning data translation focus on 21744, TIN2010-09809-E, TIN2010-10811-E, and TIN2010- nested relational models, which is not suitable for our con- 09988-E), and partially financed through funds received text since semantic-web technologies have a number of in- from the European Community’s Seventh Framework Pro- herent differences with respect to these models, or in the gramme (FP7) under Grant Agreement No. 256975 (LATC) general context of semantic-web technologies. To the best of and Grant Agreement No. 257943 (LOD2). our knowledge, LODIB is the first benchmark that is based on real-world distribution of data translation patterns in the LOD Cloud, and that is specifically tailored towards the 8. REFERENCES Linked Data context. [1] B. Alexe, W. C. Tan, and Y. Velegrakis. STBenchmark: Towards a benchmark for mapping In this paper, we compared three data translation systems, systems. PVLDB, 1(1):230–244, 2008. Mosto, SPARQL 1.1/Jena TDB and R2R, by scaling the [2] R. Angles and C. Gutiérrez. Survey of graph database three data translation tasks. In this context, Mosto is able to models. ACM Comput. Surv., 40(1), 2008. deal with 12 out of the 15 mapping patterns described in this [3] M. Arenas and L. Libkin. Xml data exchange: paper, SPARQL 1.1/Jena TDB deals with 15 out of 15, and Consistency and query answering. J. ACM, 55(2), R2R deals with 14 out of 15. Furthermore, the results show 2008. that R2R outperforms both Mosto and SPARQL 1.1/Jena [4] C. Bizer, T. Heath, and T. Berners-Lee. Linked Data - TDB data translation systems when performing the three the story so far. Int. J. Semantic Web Inf. Syst., data translation tasks. Our empirical study has shown that, 5(3):1–22, 2009. to translate data amongst data sets in the LOD Cloud, there [5] C. Bizer, A. Jentzsch, and R. Cyganiak. State of the is only needed a small set of simple mapping patterns. In this LOD cloud. Available at: http://www4.wiwiss. context, the fifteen mapping patterns identified in this paper fu-berlin.de/lodcloud/state/#terms, 2011. were enough to cover the vast majority of data translation [6] C. Bizer and A. Schultz. The Berlin SPARQL problems when integrating these data sets. benchmark. Int. J. Semantic Web Inf. Syst., 5(2):1–24, 2009. As the Web of Data grows, the task of translating data [7] C. Bizer and A. Schultz. The R2R framework: amongst data sets moves into the focus. We hope that Publishing and discovering mappings on the Web. In LODIB benchmark will be considered useful by the develop- 1st International Workshop on Consuming Linked ers of the currently existing Linked Data translation systems Data (COLD), 2010. as well as the systems to come. More information about [8] D. Blum and S. Cohen. Grr: Generating random LODIB is publicly-available at the homepage [22], such as RDF. In ESWC (2), pages 16–30, 2011. the exact specification of the benchmark data sets, the data generator, examples of the mapping patterns, or the statis- [9] J. Euzenat, A. Polleres, and F. Scharffe. Processing tics about these mappings that we found in the LOD Cloud. ontology alignments with SPARQL. In CISIS, pages 913–917, 2008. [10] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Acknowledgments Data exchange: semantics and query answering. Supported by the European Commission (FEDER), the Theor. Comput. Sci., 336(1):89–124, 2005. Spanish and the Andalusian R&D&I programmes (grants [11] A. Fuxman, M. A. Hernández, C. T. H. Ho, R. J. P07-TIC-2602, P08-TIC-4100, TIN2008-04718-E, TIN2010- Miller, P. Papotti, and L. Popa. Nested mappings: Schema mapping reloaded. In VLDB, pages 67–78, 2006. [12] Y. Guo, Z. Pan, and J. Heflin. LUBM: A benchmark for OWL knowledge base systems. J. Web Sem., 3(2-3):158–182, 2005. [13] T. Heath and C. Bizer. Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool, 2011. [14] B. Motik, I. Horrocks, and U. Sattler. Bridging the gap between OWL and relational databases. J. Web Sem., 7(2):74–89, 2009. [15] N. F. Noy and M. C. A. Klein. Ontology evolution: Not the same as schema evolution. Knowl. Inf. Syst., 6(4):428–440, 2004. [16] F. S. Parreiras, S. Staab, S. Schenk, and A. Winter. Model driven specification of ontology translations. In ER, pages 484–497, 2008. [17] A. Polleres, F. Scharffe, and R. Schindlauer. SPARQL++ for mapping between RDF vocabularies. In ODBASE, pages 878–896, 2007. [18] H. Qin, D. Dou, and P. LePendu. Discovering executable semantic mappings between ontologies. In ODBASE, pages 832–849, 2007. [19] C. R. Rivero, I. Hernández, D. Ruiz, and R. Corchuelo. Generating SPARQL executable mappings to integrate ontologies. In ER, pages 118–131, 2011. [20] C. R. Rivero, I. Hernández, D. Ruiz, and R. Corchuelo. Mosto: Generating SPARQL executable mappings between ontologies. In ER Workshops, pages 345–348, 2011. [21] C. R. Rivero, I. Hernández, D. Ruiz, and R. Corchuelo. On benchmarking data translation systems for semantic-web ontologies. In CIKM, pages 1613–1618, 2011. [22] C. R. Rivero, A. Schultz, and C. Bizer. Linked Open Data Integration Benchmark (LODIB) specification. Available at: http://www4.wiwiss.fu-berlin.de/bizer/lodib/, 2012. [23] M. Schmidt, T. Hornung, G. Lausen, and C. Pinkel. SP2 Bench: A SPARQL performance benchmark. In ICDE, pages 222–233, 2009. [24] A. Schultz, A. Matteini, R. Isele, C. Bizer, and C. Becker. LDIF - Linked Data integration framework. In 2nd International Workshop on Consuming Linked Data (COLD), 2011. [25] M. Uschold and M. Grüninger. Ontologies and semantics for seamless connectivity. SIGMOD Record, 33(4):58–64, 2004. [26] Z. Wu, G. Eadon, S. Das, E. I. Chong, V. Kolovski, M. Annamalai, and J. Srinivasan. Implementing an inference engine for RDFS/OWL constructs and user-defined rules in oracle. In ICDE, pages 1239–1248, 2008.