Augmenting the Web of Data using Referers

                     Hannes Mühleisen                                              Anja Jentzsch
                   Freie Universität Berlin                                   Freie Universität Berlin
            Networked Information Systems Group                              Web-based Systems Group
                  Königin-Luise-Str. 24/26                                          Garystr. 21
                   14195 Berlin, Germany                                      14195 Berlin, Germany
                muehleis@inf.fu-berlin.de                                     mail@anjajentzsch.de


ABSTRACT                                                         the “forward” link. In order to generate correctly typed back
Linked Data relies on one central concept: Typed links con-      links, the remote document URLs are dereferenced, and the
nect entities stored within data sets published by different     retrieved document analyzed. Documents are searched for
individuals. Manual input and mapping are common tech-           the URL of local entities. Matches are then used to de-
niques to create these links. We propose a novel method,         termine or create the semantically correct link property to
where HTTP Referer information is used to create new links       be used for a back link. Using the local and remote entity
between Linked Data entities stored in different data sets.      URLs along with the back link property, new back links can
We evaluate our method using 27.86 million real-world log        be created and inserted into the data set. We present a
entries from web servers hosting Linked Data.                    fully deterministic algorithm for this back link generation
                                                                 approach.
1.     INTRODUCTION
On the ever growing Web of Data there is not only a rel-         The remainder of this paper is structured as follows: Sec-
evant overlap of data on the same real world concepts but        tion 2 details our approach and algorithm for automatic back
also a growing number of entities that are related to each       link generation for Linked Data sets. Section 3 describes the
other. Since 2007, the number of data sets on the Web of         evaluation of our approach using 27.86 million log entries
Data doubles every few months1 . Data providers have to          from web servers hosting Linked Data. Section 4 gives a
constantly keep up with the growth of the Web of Data and        short overview over related approaches for link generation,
new linking possibilities. We contribute to this development     and finally Section 5 concludes the findings of this paper and
by providing a novel way for Linked Data publishers to find      gives pointers to areas of future work.
new and suitable link targets.

The Web of Data forms a single global data space for the
                                                                 2.   LINK GENERATION APPROACH
                                                                 According to the Linked Data principles, links between en-
very reason that its data sources are connected by links.
                                                                 tities contained in different data sets and stored on different
However, as the current state of the Linked Open Data
                                                                 servers are an integral part of Linked Data [3]. These links
cloud shows, most data sources are not sufficiently inter-
                                                                 into other data sets are often used to provide background
linked, with over 50% of them only being interlinked with
                                                                 information, or lead to other related entities. Apart from
only one or two other data sources2 . Almost two thirds of
                                                                 using central databases such as Sindice[13], many Linked
the data sources do not link back to all the data sources
                                                                 Data tools and applications are dependent on considerable
they are linked from. This leads to a weakly interlinked and
                                                                 quantities of links. For example, the SQUIN SPARQL query
often unidirectional graph of Linked Data which impedes
                                                                 processor uses link traversal to resolve patterns within a
applications relying on link traversal. In addition, for the
                                                                 query [8].
integration of data duplicate detection and linkage record-
ing are crucial preliminaries. While some fully automatic
                                                                 Within the Resource Description Format (RDF) data model,
tools for link discovery do exist [6], most tools generate the
                                                                 links are directed and have link semantics specified; each link
links semi-automatically based on user-defined link specifi-
                                                                 is required to be labeled by a machine-readable URI. Data
cations [11, 12, 10].
                                                                 sets are independent in management and storage, and links
                                                                 between entities are not a part of any meta-level central sys-
In this paper, we propose a novel – fully automatic – ap-
                                                                 tem, but reside in the data set they were created in. Figure 1
proach for back link generation. Here, the “Referer” request
                                                                 shows this principle: Two entities, ds1:res1 and ds2:res2
header defined by the HTTP protocol specification is used to
                                                                 are linked from ds2:res2 to ds1:res1 using the link type
discover remote documents containing Linked Data entities
                                                                 ex:p1 (1). The back link from ds1:res1 to ds2:res2 is not
linking to local entities. Since RDF links between entities
                                                                 required to be present, its link type is also unknown a priori.
are typed, the type of a back link depends on the type of
                                                                 (2) shows the physical storage of the entities and the link,
1                                                                Dataset 1 contains the entity ds1:res1, and Dataset 2 con-
    http://lod-cloud.net
2                                                                tains both the entity ds2:res2 as well as the link. Should a
    http://lod-cloud.net/state
                                                                 link-traversal based tool encounter ds1:res1, it would have
Copyright is held by the author/owner(s).                        no way of reaching ds2:res2 without the help of central
LDOW2011, March 29, 2011, Hyderabad, India.                      databases.
                                                                  entities from the web server’s log files and increase the over-
  (1)
                                                                  all connectivity of the Linked Data cloud.
                              ex:p1
           ds1:res1                            ds2:res2           If Referer information are to be used to create links be-
                                ?                                 tween RDF entities, the link property URI has to be de-
                                                                  termined first, as RDF does not allow untyped links. For
  (2)
                              ex:p1
                                                                  very generic cases, the RDF Schema (RDFS) specification
                                                                  defines the rdfs:seeAlso property, which “indicates a en-
           ds1:res1                            ds2:res2
                                                                  tity that might provide additional information” [4]. How-
                                                                  ever, the Linked Data specification allows the retrieval of
                                                                  remote entities (“dereferencing”) in order to gain more in-
           Dataset 1                   Dataset 2
                                                                  formation about that entity. The dereferenced remote RDF
                                                                  document can then be processed into RDF statements, pos-
                                                                  sibly yielding the link property that was used to refer to a
Figure 1: Linked Data Links and Storage Locations                 local entity. Reconsider the situation depicted in Figure 1,
                                                                  if a Referer value of ds2:res2 is logged for an HTTP re-
                                                                  quest to the server hosting ds1:res1 as part of Dataset 1,
Links between different data sets cannot – so far – be created    an automatic process can retrieve the document describing
automatically without complex entity recognition schemes          ds2:res2 to determine the property value of the link point-
or data structure conventions. Thus, link creation is often       ing to ds1:res1, in this case ex:p1.
based on human interaction, which represents a tedious pro-
cess and is only practicable between two different data sets      One of the strengths of RDF is the possibility to describe
at a time. An automatic or supporting process for link gen-       the vocabularies used to link entities in a machine-readable
eration would be desirable, even if only a subset of possible     and dereferenceable way as well. This description can be
links can be discovered. In the “classic” WWW, links are          encoded using either RDFS or the Web Ontology Language
often created on the basis of a link exchange; web authors        (OWL) [1]. Using the owl:inverseOf property, a property
communicate the intent of linking to each other’s sites, a        itself can define which property is to be used for back links.
process that can be beneficial for both sites and their vis-      For example, the link property hasChild could have the
itors. The amount of links is kept low as not to distract         inverse link property hasParent. Alternatively, vocabular-
readers. For Linked Data entities however, a large amount         ies can specify properties to be symmetric, for example the
of links to other entities is not disruptive for its usage, as    property hasFriend could be defined to be symmetric (as-
these entities are mainly published for use by computer pro-      suming a main-stream sociocultural environment). Should
grams. Hence, as content is machine-readable, link exchange       a link property neither have an inverse link property, nor be
can be performed automatically.                                   defined to be symmetric, the remote statement linking the
                                                                  local and remote entity can be included into the local data
The Linked Data specification defines the Hypertext Trans-        set. Since agents can follow properties regardless of their
fer Protocol (HTTP) as underlying data exchange protocol.         direction, these links can be useful to them as well.
Linked Data entities are thus requested and served using
this protocol. The HTTP specification defines the Referer 3       Figure 2 gives examples for both cases. For both pictures,
header field as part of HTTP requests [7]. This field can be      the dashed elements are new to the local data set. If the
set by the user agent program to the URL of the site that         inverse property is unknown, the remote statement is in-
it was referred from.                                             cluded (1). If the inverse property is known – for example by
                                                                  dereferencing the property URL – the correct link property
                                                                  ex:p2 known to be the owl:inverseOf ex:p1 along with the
           “The Referer[sic] request-header field allows          entity URL of the remote resource ds2:res2 is included (2).
        the client to specify, for the server’s benefit, the
        address (URI) of the resource from which the
        Request-URI was obtained[. . . ] The Referer request-       (1)
                                                                                               ex:p1
        header allows a server to generate lists of back
                                                                           ds1:res1                             ds2:res2
        links to resources for interest, logging, opti-
        mized caching, etc.” [7, sec. 14.36]
                                                                    (2)                        ex:p1

                                                                           ds1:res1        owl:inverseOf        ds2:res2
The value of the Referer header is commonly added to re-
quest log files by standard web servers, for example by the                                    ex:p2
Apache HTTP Server. For human-only web sites, the Ref-
erer values are currently mainly analyzed to track visitor
sources such as search engine queries. In the case of Linked
Data, the highlighted part of the Referer definition is more               Figure 2: New Back Link Properties
relevant: If RDF crawlers and user agents would correctly
set this field, a program could generate back links to local      From these prerequisites, the automatic generation or rec-
                                                                  ommendation of back links in the Linked Data context is
3                                                                 possible. The following algorithm can be executed fully au-
  This spelling is used in this paper to be consistent with the
HTTP specification                                                tomatically, and – given Referers are supplied by the user
agents – will generate new and meaningful links between          USEWOD 2011 Data Challenge [2]. The first set of files
Linked Data entities in different data sets. In the follow-      was created on the web server of the DBpedia project, the
ing, RDF statements are encoded as triples in the triple no-     second set on the web server hosting the Semantic Web Dog
tation (subject, predicate, object). Algorithm 1 details the     Food project. Both servers used the Apache “combined”
process of link (and statement) generation: After the doc-       log format4 , which is the default setting. Each log entry
ument pointed to by the Referer URL has been retrieved,          is represented by one line in the log file. Each log entry
two cases are differentiated: If the response contains RDF       is similar to the following sample entry in the “combined”
statements, they are checked whether the local entity URL        format (line breaks added, not an actual log entry):
occurs as subject or as object. If the local entity occurs as
an object, the remote statement is returned. If the local        160.45.170.10 [07/Jan/2010:09:52:45 -0800]
entity occurs as an object in one of the statements, three       "GET /resource/South_Bend,_Indiana HTTP/1.1"
cases are possible: First, the link property may be symmet-      303 40
ric, in this case it is used to create the connecting state-     "http://en.openei.org/wiki/South_Bend,_Indiana"
ment (Line 11). Second, if the inverse property is known,        "Mozilla/4.0"
that property is used to create the new statement (Line 14).
Third, if neither of both is the case, the remote statement      The format is structured into fields for client IP address,
is also returned. For non-RDF-documents, a string search         date and time, HTTP request method and URL, status code,
for the URI of the local entity within the remote document       bytes transmitted for the response, “Referer” request header
is performed, if a match is found, a rdfs:seeAlso link is        field, and user agent (browser). In order to generate new
created as well (Line 22), since this link property explicitly   links, two things have to be determined: First, the URL of
allows linking to non-RDF resources [4].                         the local resource that was requested, and second the URL
                                                                 of the remote resource the user agent visited before. This
Algorithm 1 Link Generation from Referers                        data can be taken from the described log file format.
Require: Requested local entity URL u, Referer URL r
 1: rdoc ← retrieve(r)                                           In total, about 27.86 million log entries were parsed, fil-
 2: if isRDF (rdoc ) then                                        tered, and checked for “interesting” Referer entries. Filter-
 3:    statementSet ← parseRdf (rdoc )                           ing included the removal of log entries without the optional
 4:    for all statementSet as s do                              Referer field, local redirects, and log entries with Referer
 5:       if subject(s) == u then                                entries pointing to result pages of search engines such as
 6:          return s                                            Google, Yahoo, etc.. For all remaining entries, the Referer
 7:       end if                                                 URL was resolved, and the resulting HTML or RDF docu-
 8:       if object(s) == u then                                 ment searched for the URL of the local resource identifying
 9:          p ← predicate(s)                                    a local entity. Requests expressed their preference for RDF
10:          if isSymmetric(p) then                              document responses using the Accept HTTP header. Thus,
11:             return (subject(s), p, u)                        this operation was defined to have four possible outcomes:
12:          end if
13:          if hasInverseP roperty(p) then
14:             return (subject(s), inverse(p), u)                  • Not found – The local resource was not found in the
15:          end if                                                   remote document, neither in plain text nor RDF
16:          n ← createN ewLocalU rl()
                                                                    • Text match – the local resource was found occurring
17:          return s
                                                                      in a plain text or HTML response
18:       end if
19:    end for                                                      • RDF subject match – the local resource was found in
20: else                                                              a remote RDF statement as the subject entry
21:    if contains(rdoc , u) then
22:       return (u, rdfs:seeAlso, r)                               • RDF object match – the local resource was found in a
23:    end if                                                         remote RDF statement as the object entry. In the last
24: end if                                                            case, the properties used to link to the local resource
                                                                      were also recorded
The statements generated by this algorithm can now be used
in a variety of ways. We propose two methods: First, the
statements could be handed over for review by another soft-      For RDF matches (not considering possible links to HTML
ware component or the person responsible for the local data      documents), new statements linking the local and remote
set. Second, an automatic inclusion into the data set is also    resources were generated according to our algorithm. Then,
feasible. In this case, we recommend storing the statements      an additional request was performed on the local data set
in a separate Named Graph, along with a machine-readable         to see whether the local data set already contains this state-
provenance annotation, for example using the Provenance          ment. If this was not the case, the new statement could have
Vocabulary [9].                                                  been added to the data set.

3.   EVALUATION                                                  The frequencies of the possible outcomes mentioned above
To answer our research question and validate our algorithm,      as well as the properties used for object matches can give
real log files from web servers hosting Linked Data sets were    4
                                                                   http://httpd.apache.org/docs/current/logs.html#
analyzed. Two sets of log files were made available for the      combined
an indication whether the additional links created using our      Two main conclusions can be drawn from our evaluation:
approach merit the additional effort of analyzing log files for   First, the generation of new links between Linked Data en-
Referer entries.                                                  tities is indeed possible using log files, which contain Ref-
                                                                  erer values. Second, the comparably small amount of state-
Table 1 contains the detailed results of our evaluation. For      ments generated shows the failure of Linked Data clients and
each data set, the raw amount of log entries, the amount          crawlers to properly set the Referer header.
of log entries with Referers, the amount of Referer URLs
ultimately dereferenced, and the amount of unique derefer-        4.     RELATED WORK
encing results are given in the first block. The second block     Link discovery between data entities across data sets re-
details the frequencies for the different result types as de-     quires linkage recording and duplicate detection techniques.
scribed above. The third block gives the amounts of new           While there is a large amount of related work on these top-
statements that could be generated from our results, and          ics in the database community [15, 5] as well as on ontology
the amount of generated statements according to our algo-         matching in the knowledge representation community [6],
rithm that were not yet contained in the respective data set.     the approaches for Linked Data are still limited at the mo-
The quality of the generated statements was evaluated using       ment.
manual inspection, and no obviously bogus statements were
found. It has to be noted that we limited the generation          The Silk Link Discovery Framework [11] is an identity reso-
of new statements to RDF matches, since they enable more          lution framework which generates RDF links between data
meaningful back links. Since this analysis included “live”        items based on user-provided link specifications which are
data5 , results may vary for repeated analyses of the same        expressed using the Silk Link Specification Language. Silk
log file set.                                                     is available in different variants, one on them being Silk
                                                                  Server. Silk Server can be used as an identity resolution
                        DBpedia      SWDF                         component within applications that consume Linked Data
    Log entries       19,770,157   8,092,552                      from the Web. It provides an HTTP API for matching in-
    Referer set        1,328,595     533,188                      stances from an incoming stream of RDF data.
    Dereferenced           4,217      20,451
    Unique Results         3,255       6,146                      LIMES [12] is a link discovery framework for the Web of
                                                                  Data. It is available as a web interface as well as standalone
    Result type                                                   tool. It offers string metrics.
    Not found              2,229      4,821
    Text match               431      1,168                       LinQuer [10] is a tool for semantic link discovery over rela-
    Subject match            395         47                       tional data, based on string and semantic matching tech-
    Object match             200        110                       niques and their combinations. The LinQuer framework
                                                                  rewrites linkage requirement queries into standard SQL que-
    Statements                                                    ries that can be run over relational data sources. LinQuer is
    Total                   595         157                       meant to be used together with relational databases to RDF
    New                     507         136                       wrappers such as D2R Server or Virtuoso RDF Views.

                  Table 1: Evaluation Results                     Raimond et al. [14] propose a link discovery algorithm that
                                                                  takes into account both the similarities of data entities on
The most frequent properties used in object matches are           the Web of Data and of their neighbor entities. The algo-
given in Table 2 for the two data sets. Entries with less than    rithm is implemented within the GNAT tool.
ten occurrences are are not included. Both the dereferencing
results as well as the statements generated for the respective    The RKBExplorer sameAs service7 provides a unified view
data sets are available online6 in order to to support further    over different Linked Data sets by managing owl:sameAs
analysis.                                                         links to identify duplicate URIs. The links have to be pro-
                                                                  vided to the system from external sources, which also applies
    Property URI                                       Freq.      to the related BackLink service.
    DBpedia
    http://www.w3.org/2002/07/owl#sameAs                 95       Most of the current approaches generate links semi-automa-
    http://dbpedia.org/ontology/wikiPageRedirects        77       tically based on user-defined link specifications. This re-
    http://rdfs.org/sioc/ns#links to                     21       quires data providers to keep up with new linking possibili-
    http://www.rkbexplorer.com[..]#duplicate              3       ties and schemata. Furthermore, except for Silk Server and
                                                                  RKBExplorer’s sameAs service, data sets to be linked have
    SWDF                                                          to be specified manually. This doesn’t scale for the growing
    http://www.w3.org/2002/07/owl#sameAs                 42       number of data sets on the Web of Data.
    http://xmlns.com/foaf/0.1/knows                      35
    http://www.w3.org[..]rdf-schema#seeAlso              16

                 Table 2: Link Property Usage
5
    Accessible on 2011/03/10
6                                                                 7
    http://page.mi.fu-berlin.de/muehleis/ldow2011/                    http://www.rkbexplorer.com/sameAs/
5.    CONCLUSION                                                   6.   REFERENCES
Acting on the fourth Linked Data principle, namely the need         [1] Sean Bechhofer, Frank van Harmelen, Jim Hendler,
for cross-dataset links between Linked Data entities, we have           et al. Owl web ontology language reference, 2004.
identified the Referer request header field defined by the          [2] B. Berendt, L. Hollink, V. Hollink, M. Luczak-Rösch,
HTTP specification as a possible source for automatic cre-              K. H. Möller, and D. Vallet. USEWOD2011 — 1st
ation of those links. However, the presence of an Referer               international workshop on usage analysis and the web
URL does not prove the presence of an existing link to a lo-            of data. In 20th International World Wide Web
cal entity. Thus, our approach is based on applying the third           Conference (WWW2011), Hyderabad, India, 2011.
Linked Data principle – the possibility of de-referencing ar-       [3] Tim Berners-Lee. Linked data, 2006.
bitrary URLs – on the Referer URL. When retrieving the                  http://www.w3.org/DesignIssues/LinkedData.html
document identified by the Referer, we were able to ascer-              accessed 2010-08-12.
tain the presence of a link between a remote entity to a local      [4] Dan Brickley, R.V. Guha, and Brian McBride. Rdf
entity along with the link type used. We were then also able            vocabulary description language, 02 2004.
to determine the semantically correct back link property and        [5] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and
create a new locally stored back link leading from a local en-          Vassilios S. Verykios. Duplicate record detection: A
tity to a remote entity.                                                survey. IEEE Trans. on Knowl. and Data Eng.,
                                                                        19(1):1–16, 2007.
We have evaluated our fully automatic approach using log
                                                                    [6] Jérôme Euzenat, Alfio Ferrara, Christian Meilicke,
entries from web servers hosting the DBpedia and Semantic               et al. Results of the ontology alignment evaluation
Web Dog Food data sets. In total, 27.86 million log entries             initiative 2010. In Proc. 5th ISWC workshop on
were analyzed, and 24,668 Referer URLs were dereferenced,               ontology matching (OM), Shanghai (CN), pages
yielding 9,401 distinct results. From these results, we were
                                                                        85–117, 2010.
able to generate 643 new typed links. Our results show the
                                                                    [7] Fielding, Gettys, Mogul, Frystyk, Masinter, Leach,
feasibility and practicability of automatic back link gener-
                                                                        and Berners-Lee. Hypertext transfer protocol –
ation for Linked Data entities using Referer information in
                                                                        http/1.1, 1999.
general and web server log files in particular.
                                                                    [8] Olaf Hartig and Andreas Langegger. A database
From our results, the failure of many Linked Data clients               perspective on consuming linked data on the web.
and spider programs to add the Referer header field to their            Datenbank-Spektrum, Semantic Web Special Issue, 10
requests was identified to be the main factor limiting the              / 2010, 2010.
amount of statements generated by our algorithm. We there-          [9] Olaf Hartig and Jun Zhao. Publishing and consuming
fore would like to urge developers of Linked Data tools to set          provenance metadata on the web of linked data. In
the Referer request header to the resource where the URL of             Deborah L. McGuinness, James Michaelis, and Luc
the document currently retrieved was found whenever pos-                Moreau, editors, IPAW, volume 6378 of Lecture Notes
sible.                                                                  in Computer Science, pages 78–90. Springer, 2010.
                                                                   [10] Oktie Hassanzadeh, Reynold Xin, Renée J. Miller,
5.1    Further Work                                                     Anastasios Kementsietsidis, et al. Linkage query
Since our approach can be used to directly add statements               writer. PVLDB, 2(2):1590–1593, 2009.
based on information loaded from remote sources, the state-        [11] Robert Isele, Anja Jentzsch, and Christian Bizer. Silk
ments generated are easily susceptible to malicious requests            Server - Adding missing Links while consuming Linked
and malicious remote statements. For example, if an at-                 Data. In 1st International Workshop on Consuming
tacker would publish RDF data linking a popular DBpe-                   Linked Data (COLD 2010), Shanghai, 2010.
dia entity (e.g. dbpedia:Berlin) to his advertisement page,        [12] Axel-Cyrille Ngonga Ngomo and Sören Auer. Limes -
and then creating a request to this entity with his document            a time-efficient approach for large-scale link discovery
as Referer, the algorithm would automatically create a link             on the web of data, 2011.
from the popular resource to the advertisement page. To            [13] Eyal Oren, Renaud Delbru, Michele Catasta, Richard
overcome this problem, one could evaluate provenance in-                Cyganiak, et al. Sindice.com: a document-oriented
formation in order to establish and enforce a required trust            lookup index for open linked data. Int. J. of Metadata
level, before new links are created [9].                                and Semantics and Ontologies, 3:37–52, November 10
                                                                        2008.
We would also like to create a generic tool for Linked Data        [14] Yves Raimond, Christopher Sutton, and Mark
server administrators, which they can use to automatically              Sandler. Automatic interlinking of music datasets on
process their log entries for interesting Referers, generate            the semantic web, 2008.
new back links, and automatically publish these links again        [15] William E. Winkler. Overview of record linkage and
in their local data set. Alternatively, the tool could also dis-        current research directions. Technical report, Bureau
play the new statements to an administrator for approval.               of the Census, 2006.

Acknowledgments
This work has been partially supported by the “DigiPolis”
project funded by the German Federal Ministry of Education
and Research (BMBF), support code 03WKP07B. The au-
thors would like to thank the reviewers and their colleagues
R. Oldakowski and M. Luczak-Rösch for their insights.