Augmenting the Web of Data using Referers Hannes Mühleisen Anja Jentzsch Freie Universität Berlin Freie Universität Berlin Networked Information Systems Group Web-based Systems Group Königin-Luise-Str. 24/26 Garystr. 21 14195 Berlin, Germany 14195 Berlin, Germany muehleis@inf.fu-berlin.de mail@anjajentzsch.de ABSTRACT the “forward” link. In order to generate correctly typed back Linked Data relies on one central concept: Typed links con- links, the remote document URLs are dereferenced, and the nect entities stored within data sets published by different retrieved document analyzed. Documents are searched for individuals. Manual input and mapping are common tech- the URL of local entities. Matches are then used to de- niques to create these links. We propose a novel method, termine or create the semantically correct link property to where HTTP Referer information is used to create new links be used for a back link. Using the local and remote entity between Linked Data entities stored in different data sets. URLs along with the back link property, new back links can We evaluate our method using 27.86 million real-world log be created and inserted into the data set. We present a entries from web servers hosting Linked Data. fully deterministic algorithm for this back link generation approach. 1. INTRODUCTION On the ever growing Web of Data there is not only a rel- The remainder of this paper is structured as follows: Sec- evant overlap of data on the same real world concepts but tion 2 details our approach and algorithm for automatic back also a growing number of entities that are related to each link generation for Linked Data sets. Section 3 describes the other. Since 2007, the number of data sets on the Web of evaluation of our approach using 27.86 million log entries Data doubles every few months1 . Data providers have to from web servers hosting Linked Data. Section 4 gives a constantly keep up with the growth of the Web of Data and short overview over related approaches for link generation, new linking possibilities. We contribute to this development and finally Section 5 concludes the findings of this paper and by providing a novel way for Linked Data publishers to find gives pointers to areas of future work. new and suitable link targets. The Web of Data forms a single global data space for the 2. LINK GENERATION APPROACH According to the Linked Data principles, links between en- very reason that its data sources are connected by links. tities contained in different data sets and stored on different However, as the current state of the Linked Open Data servers are an integral part of Linked Data [3]. These links cloud shows, most data sources are not sufficiently inter- into other data sets are often used to provide background linked, with over 50% of them only being interlinked with information, or lead to other related entities. Apart from only one or two other data sources2 . Almost two thirds of using central databases such as Sindice[13], many Linked the data sources do not link back to all the data sources Data tools and applications are dependent on considerable they are linked from. This leads to a weakly interlinked and quantities of links. For example, the SQUIN SPARQL query often unidirectional graph of Linked Data which impedes processor uses link traversal to resolve patterns within a applications relying on link traversal. In addition, for the query [8]. integration of data duplicate detection and linkage record- ing are crucial preliminaries. While some fully automatic Within the Resource Description Format (RDF) data model, tools for link discovery do exist [6], most tools generate the links are directed and have link semantics specified; each link links semi-automatically based on user-defined link specifi- is required to be labeled by a machine-readable URI. Data cations [11, 12, 10]. sets are independent in management and storage, and links between entities are not a part of any meta-level central sys- In this paper, we propose a novel – fully automatic – ap- tem, but reside in the data set they were created in. Figure 1 proach for back link generation. Here, the “Referer” request shows this principle: Two entities, ds1:res1 and ds2:res2 header defined by the HTTP protocol specification is used to are linked from ds2:res2 to ds1:res1 using the link type discover remote documents containing Linked Data entities ex:p1 (1). The back link from ds1:res1 to ds2:res2 is not linking to local entities. Since RDF links between entities required to be present, its link type is also unknown a priori. are typed, the type of a back link depends on the type of (2) shows the physical storage of the entities and the link, 1 Dataset 1 contains the entity ds1:res1, and Dataset 2 con- http://lod-cloud.net 2 tains both the entity ds2:res2 as well as the link. Should a http://lod-cloud.net/state link-traversal based tool encounter ds1:res1, it would have Copyright is held by the author/owner(s). no way of reaching ds2:res2 without the help of central LDOW2011, March 29, 2011, Hyderabad, India. databases. entities from the web server’s log files and increase the over- (1) all connectivity of the Linked Data cloud. ex:p1 ds1:res1 ds2:res2 If Referer information are to be used to create links be- ? tween RDF entities, the link property URI has to be de- termined first, as RDF does not allow untyped links. For (2) ex:p1 very generic cases, the RDF Schema (RDFS) specification defines the rdfs:seeAlso property, which “indicates a en- ds1:res1 ds2:res2 tity that might provide additional information” [4]. How- ever, the Linked Data specification allows the retrieval of remote entities (“dereferencing”) in order to gain more in- Dataset 1 Dataset 2 formation about that entity. The dereferenced remote RDF document can then be processed into RDF statements, pos- sibly yielding the link property that was used to refer to a Figure 1: Linked Data Links and Storage Locations local entity. Reconsider the situation depicted in Figure 1, if a Referer value of ds2:res2 is logged for an HTTP re- quest to the server hosting ds1:res1 as part of Dataset 1, Links between different data sets cannot – so far – be created an automatic process can retrieve the document describing automatically without complex entity recognition schemes ds2:res2 to determine the property value of the link point- or data structure conventions. Thus, link creation is often ing to ds1:res1, in this case ex:p1. based on human interaction, which represents a tedious pro- cess and is only practicable between two different data sets One of the strengths of RDF is the possibility to describe at a time. An automatic or supporting process for link gen- the vocabularies used to link entities in a machine-readable eration would be desirable, even if only a subset of possible and dereferenceable way as well. This description can be links can be discovered. In the “classic” WWW, links are encoded using either RDFS or the Web Ontology Language often created on the basis of a link exchange; web authors (OWL) [1]. Using the owl:inverseOf property, a property communicate the intent of linking to each other’s sites, a itself can define which property is to be used for back links. process that can be beneficial for both sites and their vis- For example, the link property hasChild could have the itors. The amount of links is kept low as not to distract inverse link property hasParent. Alternatively, vocabular- readers. For Linked Data entities however, a large amount ies can specify properties to be symmetric, for example the of links to other entities is not disruptive for its usage, as property hasFriend could be defined to be symmetric (as- these entities are mainly published for use by computer pro- suming a main-stream sociocultural environment). Should grams. Hence, as content is machine-readable, link exchange a link property neither have an inverse link property, nor be can be performed automatically. defined to be symmetric, the remote statement linking the local and remote entity can be included into the local data The Linked Data specification defines the Hypertext Trans- set. Since agents can follow properties regardless of their fer Protocol (HTTP) as underlying data exchange protocol. direction, these links can be useful to them as well. Linked Data entities are thus requested and served using this protocol. The HTTP specification defines the Referer 3 Figure 2 gives examples for both cases. For both pictures, header field as part of HTTP requests [7]. This field can be the dashed elements are new to the local data set. If the set by the user agent program to the URL of the site that inverse property is unknown, the remote statement is in- it was referred from. cluded (1). If the inverse property is known – for example by dereferencing the property URL – the correct link property ex:p2 known to be the owl:inverseOf ex:p1 along with the “The Referer[sic] request-header field allows entity URL of the remote resource ds2:res2 is included (2). the client to specify, for the server’s benefit, the address (URI) of the resource from which the Request-URI was obtained[. . . ] The Referer request- (1) ex:p1 header allows a server to generate lists of back ds1:res1 ds2:res2 links to resources for interest, logging, opti- mized caching, etc.” [7, sec. 14.36] (2) ex:p1 ds1:res1 owl:inverseOf ds2:res2 The value of the Referer header is commonly added to re- quest log files by standard web servers, for example by the ex:p2 Apache HTTP Server. For human-only web sites, the Ref- erer values are currently mainly analyzed to track visitor sources such as search engine queries. In the case of Linked Data, the highlighted part of the Referer definition is more Figure 2: New Back Link Properties relevant: If RDF crawlers and user agents would correctly set this field, a program could generate back links to local From these prerequisites, the automatic generation or rec- ommendation of back links in the Linked Data context is 3 possible. The following algorithm can be executed fully au- This spelling is used in this paper to be consistent with the HTTP specification tomatically, and – given Referers are supplied by the user agents – will generate new and meaningful links between USEWOD 2011 Data Challenge [2]. The first set of files Linked Data entities in different data sets. In the follow- was created on the web server of the DBpedia project, the ing, RDF statements are encoded as triples in the triple no- second set on the web server hosting the Semantic Web Dog tation (subject, predicate, object). Algorithm 1 details the Food project. Both servers used the Apache “combined” process of link (and statement) generation: After the doc- log format4 , which is the default setting. Each log entry ument pointed to by the Referer URL has been retrieved, is represented by one line in the log file. Each log entry two cases are differentiated: If the response contains RDF is similar to the following sample entry in the “combined” statements, they are checked whether the local entity URL format (line breaks added, not an actual log entry): occurs as subject or as object. If the local entity occurs as an object, the remote statement is returned. If the local 160.45.170.10 [07/Jan/2010:09:52:45 -0800] entity occurs as an object in one of the statements, three "GET /resource/South_Bend,_Indiana HTTP/1.1" cases are possible: First, the link property may be symmet- 303 40 ric, in this case it is used to create the connecting state- "http://en.openei.org/wiki/South_Bend,_Indiana" ment (Line 11). Second, if the inverse property is known, "Mozilla/4.0" that property is used to create the new statement (Line 14). Third, if neither of both is the case, the remote statement The format is structured into fields for client IP address, is also returned. For non-RDF-documents, a string search date and time, HTTP request method and URL, status code, for the URI of the local entity within the remote document bytes transmitted for the response, “Referer” request header is performed, if a match is found, a rdfs:seeAlso link is field, and user agent (browser). In order to generate new created as well (Line 22), since this link property explicitly links, two things have to be determined: First, the URL of allows linking to non-RDF resources [4]. the local resource that was requested, and second the URL of the remote resource the user agent visited before. This Algorithm 1 Link Generation from Referers data can be taken from the described log file format. Require: Requested local entity URL u, Referer URL r 1: rdoc ← retrieve(r) In total, about 27.86 million log entries were parsed, fil- 2: if isRDF (rdoc ) then tered, and checked for “interesting” Referer entries. Filter- 3: statementSet ← parseRdf (rdoc ) ing included the removal of log entries without the optional 4: for all statementSet as s do Referer field, local redirects, and log entries with Referer 5: if subject(s) == u then entries pointing to result pages of search engines such as 6: return s Google, Yahoo, etc.. For all remaining entries, the Referer 7: end if URL was resolved, and the resulting HTML or RDF docu- 8: if object(s) == u then ment searched for the URL of the local resource identifying 9: p ← predicate(s) a local entity. Requests expressed their preference for RDF 10: if isSymmetric(p) then document responses using the Accept HTTP header. Thus, 11: return (subject(s), p, u) this operation was defined to have four possible outcomes: 12: end if 13: if hasInverseP roperty(p) then 14: return (subject(s), inverse(p), u) • Not found – The local resource was not found in the 15: end if remote document, neither in plain text nor RDF 16: n ← createN ewLocalU rl() • Text match – the local resource was found occurring 17: return s in a plain text or HTML response 18: end if 19: end for • RDF subject match – the local resource was found in 20: else a remote RDF statement as the subject entry 21: if contains(rdoc , u) then 22: return (u, rdfs:seeAlso, r) • RDF object match – the local resource was found in a 23: end if remote RDF statement as the object entry. In the last 24: end if case, the properties used to link to the local resource were also recorded The statements generated by this algorithm can now be used in a variety of ways. We propose two methods: First, the statements could be handed over for review by another soft- For RDF matches (not considering possible links to HTML ware component or the person responsible for the local data documents), new statements linking the local and remote set. Second, an automatic inclusion into the data set is also resources were generated according to our algorithm. Then, feasible. In this case, we recommend storing the statements an additional request was performed on the local data set in a separate Named Graph, along with a machine-readable to see whether the local data set already contains this state- provenance annotation, for example using the Provenance ment. If this was not the case, the new statement could have Vocabulary [9]. been added to the data set. 3. EVALUATION The frequencies of the possible outcomes mentioned above To answer our research question and validate our algorithm, as well as the properties used for object matches can give real log files from web servers hosting Linked Data sets were 4 http://httpd.apache.org/docs/current/logs.html# analyzed. Two sets of log files were made available for the combined an indication whether the additional links created using our Two main conclusions can be drawn from our evaluation: approach merit the additional effort of analyzing log files for First, the generation of new links between Linked Data en- Referer entries. tities is indeed possible using log files, which contain Ref- erer values. Second, the comparably small amount of state- Table 1 contains the detailed results of our evaluation. For ments generated shows the failure of Linked Data clients and each data set, the raw amount of log entries, the amount crawlers to properly set the Referer header. of log entries with Referers, the amount of Referer URLs ultimately dereferenced, and the amount of unique derefer- 4. RELATED WORK encing results are given in the first block. The second block Link discovery between data entities across data sets re- details the frequencies for the different result types as de- quires linkage recording and duplicate detection techniques. scribed above. The third block gives the amounts of new While there is a large amount of related work on these top- statements that could be generated from our results, and ics in the database community [15, 5] as well as on ontology the amount of generated statements according to our algo- matching in the knowledge representation community [6], rithm that were not yet contained in the respective data set. the approaches for Linked Data are still limited at the mo- The quality of the generated statements was evaluated using ment. manual inspection, and no obviously bogus statements were found. It has to be noted that we limited the generation The Silk Link Discovery Framework [11] is an identity reso- of new statements to RDF matches, since they enable more lution framework which generates RDF links between data meaningful back links. Since this analysis included “live” items based on user-provided link specifications which are data5 , results may vary for repeated analyses of the same expressed using the Silk Link Specification Language. Silk log file set. is available in different variants, one on them being Silk Server. Silk Server can be used as an identity resolution DBpedia SWDF component within applications that consume Linked Data Log entries 19,770,157 8,092,552 from the Web. It provides an HTTP API for matching in- Referer set 1,328,595 533,188 stances from an incoming stream of RDF data. Dereferenced 4,217 20,451 Unique Results 3,255 6,146 LIMES [12] is a link discovery framework for the Web of Data. It is available as a web interface as well as standalone Result type tool. It offers string metrics. Not found 2,229 4,821 Text match 431 1,168 LinQuer [10] is a tool for semantic link discovery over rela- Subject match 395 47 tional data, based on string and semantic matching tech- Object match 200 110 niques and their combinations. The LinQuer framework rewrites linkage requirement queries into standard SQL que- Statements ries that can be run over relational data sources. LinQuer is Total 595 157 meant to be used together with relational databases to RDF New 507 136 wrappers such as D2R Server or Virtuoso RDF Views. Table 1: Evaluation Results Raimond et al. [14] propose a link discovery algorithm that takes into account both the similarities of data entities on The most frequent properties used in object matches are the Web of Data and of their neighbor entities. The algo- given in Table 2 for the two data sets. Entries with less than rithm is implemented within the GNAT tool. ten occurrences are are not included. Both the dereferencing results as well as the statements generated for the respective The RKBExplorer sameAs service7 provides a unified view data sets are available online6 in order to to support further over different Linked Data sets by managing owl:sameAs analysis. links to identify duplicate URIs. The links have to be pro- vided to the system from external sources, which also applies Property URI Freq. to the related BackLink service. DBpedia http://www.w3.org/2002/07/owl#sameAs 95 Most of the current approaches generate links semi-automa- http://dbpedia.org/ontology/wikiPageRedirects 77 tically based on user-defined link specifications. This re- http://rdfs.org/sioc/ns#links to 21 quires data providers to keep up with new linking possibili- http://www.rkbexplorer.com[..]#duplicate 3 ties and schemata. Furthermore, except for Silk Server and RKBExplorer’s sameAs service, data sets to be linked have SWDF to be specified manually. This doesn’t scale for the growing http://www.w3.org/2002/07/owl#sameAs 42 number of data sets on the Web of Data. http://xmlns.com/foaf/0.1/knows 35 http://www.w3.org[..]rdf-schema#seeAlso 16 Table 2: Link Property Usage 5 Accessible on 2011/03/10 6 7 http://page.mi.fu-berlin.de/muehleis/ldow2011/ http://www.rkbexplorer.com/sameAs/ 5. CONCLUSION 6. REFERENCES Acting on the fourth Linked Data principle, namely the need [1] Sean Bechhofer, Frank van Harmelen, Jim Hendler, for cross-dataset links between Linked Data entities, we have et al. Owl web ontology language reference, 2004. identified the Referer request header field defined by the [2] B. Berendt, L. Hollink, V. Hollink, M. Luczak-Rösch, HTTP specification as a possible source for automatic cre- K. H. Möller, and D. Vallet. USEWOD2011 — 1st ation of those links. However, the presence of an Referer international workshop on usage analysis and the web URL does not prove the presence of an existing link to a lo- of data. In 20th International World Wide Web cal entity. Thus, our approach is based on applying the third Conference (WWW2011), Hyderabad, India, 2011. Linked Data principle – the possibility of de-referencing ar- [3] Tim Berners-Lee. Linked data, 2006. bitrary URLs – on the Referer URL. When retrieving the http://www.w3.org/DesignIssues/LinkedData.html document identified by the Referer, we were able to ascer- accessed 2010-08-12. tain the presence of a link between a remote entity to a local [4] Dan Brickley, R.V. Guha, and Brian McBride. Rdf entity along with the link type used. We were then also able vocabulary description language, 02 2004. to determine the semantically correct back link property and [5] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and create a new locally stored back link leading from a local en- Vassilios S. Verykios. Duplicate record detection: A tity to a remote entity. survey. IEEE Trans. on Knowl. and Data Eng., 19(1):1–16, 2007. We have evaluated our fully automatic approach using log [6] Jérôme Euzenat, Alfio Ferrara, Christian Meilicke, entries from web servers hosting the DBpedia and Semantic et al. Results of the ontology alignment evaluation Web Dog Food data sets. In total, 27.86 million log entries initiative 2010. In Proc. 5th ISWC workshop on were analyzed, and 24,668 Referer URLs were dereferenced, ontology matching (OM), Shanghai (CN), pages yielding 9,401 distinct results. From these results, we were 85–117, 2010. able to generate 643 new typed links. Our results show the [7] Fielding, Gettys, Mogul, Frystyk, Masinter, Leach, feasibility and practicability of automatic back link gener- and Berners-Lee. Hypertext transfer protocol – ation for Linked Data entities using Referer information in http/1.1, 1999. general and web server log files in particular. [8] Olaf Hartig and Andreas Langegger. A database From our results, the failure of many Linked Data clients perspective on consuming linked data on the web. and spider programs to add the Referer header field to their Datenbank-Spektrum, Semantic Web Special Issue, 10 requests was identified to be the main factor limiting the / 2010, 2010. amount of statements generated by our algorithm. We there- [9] Olaf Hartig and Jun Zhao. Publishing and consuming fore would like to urge developers of Linked Data tools to set provenance metadata on the web of linked data. In the Referer request header to the resource where the URL of Deborah L. McGuinness, James Michaelis, and Luc the document currently retrieved was found whenever pos- Moreau, editors, IPAW, volume 6378 of Lecture Notes sible. in Computer Science, pages 78–90. Springer, 2010. [10] Oktie Hassanzadeh, Reynold Xin, Renée J. Miller, 5.1 Further Work Anastasios Kementsietsidis, et al. Linkage query Since our approach can be used to directly add statements writer. PVLDB, 2(2):1590–1593, 2009. based on information loaded from remote sources, the state- [11] Robert Isele, Anja Jentzsch, and Christian Bizer. Silk ments generated are easily susceptible to malicious requests Server - Adding missing Links while consuming Linked and malicious remote statements. For example, if an at- Data. In 1st International Workshop on Consuming tacker would publish RDF data linking a popular DBpe- Linked Data (COLD 2010), Shanghai, 2010. dia entity (e.g. dbpedia:Berlin) to his advertisement page, [12] Axel-Cyrille Ngonga Ngomo and Sören Auer. Limes - and then creating a request to this entity with his document a time-efficient approach for large-scale link discovery as Referer, the algorithm would automatically create a link on the web of data, 2011. from the popular resource to the advertisement page. To [13] Eyal Oren, Renaud Delbru, Michele Catasta, Richard overcome this problem, one could evaluate provenance in- Cyganiak, et al. Sindice.com: a document-oriented formation in order to establish and enforce a required trust lookup index for open linked data. Int. J. of Metadata level, before new links are created [9]. and Semantics and Ontologies, 3:37–52, November 10 2008. We would also like to create a generic tool for Linked Data [14] Yves Raimond, Christopher Sutton, and Mark server administrators, which they can use to automatically Sandler. Automatic interlinking of music datasets on process their log entries for interesting Referers, generate the semantic web, 2008. new back links, and automatically publish these links again [15] William E. Winkler. Overview of record linkage and in their local data set. Alternatively, the tool could also dis- current research directions. Technical report, Bureau play the new statements to an administrator for approval. of the Census, 2006. Acknowledgments This work has been partially supported by the “DigiPolis” project funded by the German Federal Ministry of Education and Research (BMBF), support code 03WKP07B. The au- thors would like to thank the reviewers and their colleagues R. Oldakowski and M. Luczak-Rösch for their insights.