<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Statistical Analysis of Web of Data Usage</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Markus Luczak-Rosch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Bischo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Freie Universitat Berlin, Networked Information Systems WG</institution>
          ,
          <addr-line>14109 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Motivation</institution>
          ,
          <addr-line>Terminology and Challenges</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Linked Data initiative gained momentum inside as well as outside of the research community. Thus, it is already an accepted research issue to investigate usage mining in the context of the Web of Data from various perspectives. We are currently working on an approach that applies such usage mining methods and analysis to support ontology and dataset maintenance tasks. This paper presents one part of this work, namely a method to detect errors or weaknesses within ontologies used for Linked Data population based on statistics and network visualizations. We contribute a detailed description of a log le preprocessing algorithm for Web of Data endpoints, a set of statistical measures that help to visualize di erent usage aspects, and an examplary analysis of one of the most prominent Linked Data set { DBpedia { aimed to show the feasibility and the potential of our approach.</p>
      </abstract>
      <kwd-group>
        <kwd>linked data</kwd>
        <kwd>web usage mining</kwd>
        <kwd>ontology maintenance</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The Linked Data initiative gained momentum inside as well as outside of the
research community. At least the recent open government data approaches stress
that assumption. That means that it is reasonable to expect that the real world
usage of Linked Data, in the sense of querying and accessing it, will increase. It
is already an accepted research issue to investigate usage mining in the context
of the Web of Linked Data (or short: Web of Data). We are currently working
on an approach that applies such usage mining methods and analysis to support
dataset ontology maintenance. This paper presents one part of this work, namely
a method to detect errors and weaknesses within ontologies used for Linked
Data population based on statistical measures and their visualization by use of
a network analysis tool.
1.1
It is not in all cases trivial to apply the methods from classical Web usage
mining to this new discipline one could call Web of Data usage mining. A rst
problem is the terminology as it is familiar for people in the context of the Web
of documents. To our best knowledge only one W3C e ort exists which aimed
to de ne a terminology that characterizes the structure and the content of the
Web1. This terminology does not cover the entities properly which are of interest
on the Web of Data: resources that represent individual \things" named by URIs
(or IRIs respectively) and a collection of RDF statements about such resources
served in one place { a dataset { maintained by a Web data publisher. So far
this is only a need for an adapted set of terms. But, even though it is not a
requirement of a Linked Data endpoint to o er a SPARQL endpoint, lots of
dataset providers on the Web of Data do so. Hence, resources on the Web of
Data are requested directly via their URIs and by use of SPARQL queries which
raises at least one central problem: The Web server observes requests for only one
single Web resource very often (the SPARQL endpoint URI) while potentially
more than one resource has been accessed as part of the query patterns.</p>
      <p>Analyzing server logs is an intuitive way to perform Web usage mining.
However, another problem on the Web of Data in its current shape is that the meaning
of HTTP status codes2 does not work out at all time. When accessing a URI
which does not point to any resource on a Web server, the server responds the
404 code. The SPARQL protocol3 requires servers to respond the 200 HTTP
status code and a serialization of the SPARQL results format that contains no
bindings in the case that a SELECT query is performed correctly but yields
an empty result set. The HTTP 1.1 status code de nitions4 would recommend
the use of the 204 status code in this case. This looks like a misuse of HTTP
response codes at a rst sight but also may be a desired feature for developers
which deal with empty result sets application-dependent and detect this when
the serialization of the result is processed. During our intensive work with logs
from several Web of Data endpoints such as DBpedia5, the Semantic Web Dog
Food server6, and Linked Geo Data7 we observed that queries must be re-ran to
nd out whether they returned any result or not.</p>
      <p>Listing 1.1. Anonymized excerpt of a DBpedia log le showing some of the di erent
types of requests and the responded HTTP status codes.
xxx . xxx . xxx . xxx [ 2 1 / Sep / 2 0 0 9 : 0 0 : 0 0 : 0 0 0600]
"GET / page / Jeroen Simaeys HTTP/ 1 . 1 "
200 26777 ""
"msnbot / 2 . 0 b (+http : / / s e a r c h . msn . com/msnbot . htm) "
xxx . xxx . xxx . xxx [ 2 1 / Sep / 2 0 0 9 : 0 0 : 0 0 : 0 0 0600]
"GET / r e s o u r c e /Guano Apes HTTP/ 1 . 1 "
303 0 ""
" M o z i l l a / 5 . 0 ( compatible ; Googlebot /2.1;+ http : / /www. g o o g l e . com/ bot . html ) "
xxx . xxx . xxx . xxx [ 2 1 / Sep / 2 0 0 9 : 0 0 : 0 0 : 0 1 0600]
"GET / s p a r q l ? query=PREFIX+r d f s%3A+%3Chttp%3A%2F%2Fwww. w3 . org . . . "
200 1844 ""
""
1 http://www.w3.org/1999/05/WCA-terms
2 http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
3 http://www.w3.org/TR/rdf-sparql-protocol/
4 http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
5 http://dbpedia.org
6 http://data.semanticweb.org/
7 http://linkedgeodata.org/</p>
      <p>The above mentioned problems show that it is an interesting issue to analyze
usage on the Web of Data { especially requests against SPARQL endpoints.
This paper deals with the research question how usage analysis can support the
maintenance of linked datasets. Altogether we contribute three central things:
First, an innovative log le preprocessing algorithm for Web of Data endpoints.
Second, a set of statistical measures that help to visualize di erent usage aspects.
Third, a statistical analysis of the usage of the DBpedia dataset with the purpose
to identify problems in the data or the underlying schema. The remainder of
this paper is structured as follows: Firstly we present a survey of related work in
the following subsection. Afterwards Section 2 will introduce our preprocessing
algorithm for log les of Linked Data endpoints before Section 3 describes the
set of statistics and visualizations we propose for the analysis of the usage data.
The Sections 4 and 5 complete this work with an evaluation of our approach
by an examplary study and a discussion of the results as well as an outlook on
future work.
1.2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Classical Web usage mining has been placed within the Web mining hierarchy
as a child of Web mining and a sibling to Web content mining [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Essential
parts of Web usage mining are the characteristic metrics and patterns one has
to identify, such as hits, page impressions, visits, time and navigation heuristics,
unique visitors, clickthrough, viewtime, sessions, path analysis, association rules,
sequential patterns, classi cation rules or clustering [
        <xref ref-type="bibr" rid="ref13 ref14">13,14</xref>
        ]. In this work we do
not apply complex data mining methods to our data, such as sequential pattern
mining or clustering, but remain on the statistical level.
      </p>
      <p>
        We mentioned several di erences between the classical Web and the Web of
Data with reference to usage mining methods and techniques beforehand. Such
a di erence is also recognizable when we regard the use of the Web of Data in
practice which has been described in works such as [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Altogether, one
can summarize that Linked Data typically is used (1) to provide unambiguous
concept identi ers within Web applications, (2) to enhance the experience of
Web users by aggregation and integration of corresponding content within CMS
systems and Web applications, and (3) to be browsed and mashed up in a
userspeci c way. It becomes apparent that the classical browsing scenario plays a
minor role and is outperformed by the access and use of Web resources through
libraries or applications which are not or only indirectly connected with a human
user's interaction and the SPARQL8 query language plays an important role in
these scenarios.
      </p>
      <p>
        Already in 2002 and again in 2004 Berendt et al. [
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ] identi ed a new research
area { the so called Semantic Web mining. The authors describe how the two
disciplines, namely the Semantic Web and Web mining, may converge. They
present three perspectives which re ect this: First, the perspective how Web
mining can help to extract semantics from the Web. Second, the exploitation of
      </p>
      <sec id="sec-2-1">
        <title>8 http://www.w3.org/TR/rdf-sparql-query/</title>
        <p>
          semantics for Web mining. And third, the perspective of mining of the Semantic
Web. The latter perspective is the one which matches best to the focus of our
work. It is subdivided into Semantic Web structure and content mining as well as
Semantic Web usage mining. Again, the latter point is the one which is the most
interesting one with reference to our work because it deals with the analysis of
the usage of semantic data on the Web. Even though Berendt et al. mention one
early approach that could result in log les which contain information about the
usage of semantically rich content[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], it seems that since that date the research
in that area and in the analysis of such log les was not very active.
        </p>
        <p>
          Today this area gains a new momentum due to the broader success of the
Linked Data ideas. To our best knowledge, in 2010 Moller et al.[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] published
the next notable piece of work in this area. As a motivation for Linked Data
usage analysis the authors raise a set of challenges, namely reliability,
peakload, performance, usefulness, and attacks. Moller et al. address these challenges
by analyzing raw logs in order to learn about user clients, requested content
types, and the structure of SPARQL queries. Our work will rely on the above
mentioned challenges but address them under a di erent scope. We preprocess
the logs in order to analyze the usage data on the level of basic graph patterns
and the ontology primitives used in them.
        </p>
        <p>
          Also after a very recent workshop on usage analysis and the Web of Data9[
          <xref ref-type="bibr" rid="ref4 ref5">4,5</xref>
          ]
this perspective is still unique. Only two papers at the workshop were related
to log le analysis and worked upon the USEWOD challenge dataset which is
partially a subset of the data we are working on. Kirchberg et al.[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] present an
approach that combines data about real world events and log les to retrieve a
notion of time-windowed relevance of data. Using an analysis of the syntactical
and structural use of SPARQL in real-world scenarios to provide
recommendations for index and store designers was introduced by Arias et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
2
        </p>
        <sec id="sec-2-1-1">
          <title>Log File Preprocessing</title>
          <p>To overcome the above mentioned issues with log les of Web of Data endpoints
we propose an innovative preprocessing method. Our approach runs on server log
les following the extended common log format10. These logs contain information
about the access to RDF resources via their URIs and SPARQL queries. The rst
step of our preprocessing is to clean the log from all entries that contain 40x and
50x response codes. Afterwards we transform each single request for resources
into a SPARQL DESCRIBE query to retrieve a normalized view to the usage
of the dataset on the level of SPARQL queries. For all (1) basic graph patterns
and (2) triple patterns of each single query, as well as the original query itself,
we perform auto-generated queries that result in information about the success
of individual graph patterns, triple patterns and the existence of resources and
predicates in the dataset. The pseudocode of our algorithm is shown in Listing 1.2
and the resulting usage database in Figure 1.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>9 http://data.semanticweb.org/usewod/2011/ 10 http://www.w3.org/TR/WD-logfile.html</title>
        <sec id="sec-2-2-1">
          <title>3 Visualization of Web of Data Usage</title>
          <p>The visualization of the collected data is done with an extension of the software
\SONIVIS:Tool"11 which enables network generation and analysis. We
implemented network visualizations di erent perspectives on usage data, e.g. ontology,
request hosts or time perspectives. Each perspective is supported by a set of
widgets that represent detailed information about a selected entity of the network.
To visualize the usage data on the basis of a given ontology, a transformation
11 see http://sonivis.org
of the preprocessed data is necessary. Hence, a mapping between the resources
used in queries and the classes which represent the corresponding types in an
ontology which was used for data population in the respective dataset is
established. In this section we introduce each of the implemented visualizations, the
underlying metrics and interpretations of observations which are possible due to
the visualizations. We do not present images of each visualization here due to
limited space but we do so for a representative selection in Section4.
3.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Ontology Heat Map</title>
      <p>The ontology heat map provides an overview of the associated ontology
primitives12 of resources and predicates being used in queries. This is the global
perspective on ontology usage. Its concept of a network visualization with weighted
nodes and edges as a so called heat map is the basic concept of all further
visualization as well.</p>
      <p>Views: The central network view shows how often a speci c primitive was used
in queries. The more a certain primitive is used, the bigger the corresponding
node in the graph view becomes and a speci c color is applied to it. Zoom levels
enable to focus parts of the network which are of a special interest. Two widgets
contain lists that support (a) the examination of corresponding primitives of the
resources that are present in the collected usage data and (b) statistical results
for each primitive (count, absolute, relative).</p>
      <p>Metrics: The view is based on metrics that sum the number of requests for each
primitive that appears in triple patterns. \Count" is the absolute number of
occurences used as a speci c part of triple patterns. \Absolute" is the percentage
of triple patterns using a chosen primitive out of all requested triple patterns.
\Relative" is the percentage of queries that had no variable in the part of the
triple pattern and used the chosen primitive.</p>
      <p>Interpretation: With the results of this visualization one gets an overview of the
general usage of an ontology which was used for data population. It is possible
to see which ontology primitives are the most important ones for the users. For
example resources of a speci c class being used in 50% of the queries seems to
be very important for the users while a class may be deleted from the ontology
if no one accesses instances of it. The heat map is a starting point to analyze
suspicious primitives in detail by use of other visualizations.
3.2</p>
    </sec>
    <sec id="sec-4">
      <title>Primitive Usage Statistics</title>
      <p>The primitive usage statistics help to nd out in which context speci c primitives
were used, i.e. in which combination of classes and predicates.
12 A primitive is a class if a subject or object in a triple pattern is analyzed. It is
derived by resolving the rdf:type property of the resource. A primitive is a property
when a predicate is analyzed.
Views: The visualization o ers three perspectives: \Subject", \Predicate" and
\Object" each of them indicating the part of a triple for which a primitive is
valid. In \Subject" for example one can choose a class that was used as subject
and visualize as a graph which predicates or respectively objects are used in
combination with it.</p>
      <p>Metrics: The underlying metric groups the primitives being used in combination
with a selected class/property, counts how often such a pattern was used and
returns whether the triple succeeded.</p>
      <p>Interpretation: With this view one can examine in which information users are
interested in with respect to a speci c class or property. For example if only one
speci c predicate is used in combination with a class. That means that users
are only interested in one property of the class. With these information the Web
data publisher can reason on how the ontology and the data should be evolved. If
a combination is often and successfully used then the modelling of the ontology
was well and there is data conforming to it. If another combination, which is
conceptually possible, is queried very often but fails most of the time, it means
that there is not enough data. It could be interesting to further investigate the
triples of the query to get to know in which resources the users were interested
in exactly. If a combination was used often but is not modelled in the ontology
it could be necessary to adjust the ontology to enable this pattern if the queries
are reasonable. Furthermore it is possible that some combinations of primitives
that are modeled in the ontology but are never used in queries. In this case one
could think about deleting this model and the according data.
3.3</p>
    </sec>
    <sec id="sec-5">
      <title>Resource Usage Statistics</title>
      <p>The resource usage statistics provide a more detailed view on triples that match
a chosen pattern of primitives. Furthermore the view gives information if the
ontology conforms to these triple combinations.</p>
      <p>Views: The resource usage statistic is directly accessed from the primitive usage
statistics and o ers not a network visualization but a group of widgets. The core
is a table containing all triples that match a chosen primitive combination. Below
this table there are lists that contain the classes (or properties) associated to the
resources of a chosen triple. If a predicate is focused in combination with other
primitives there are two lists showing the domain and range of the predicate
so that one can see if the ontology permits the observed usage. If the selected
combination contains a subject-object pattern then here is a table that shows
all properties that can be used between the two resources. A colored rectangle
indicates whether the triple can conceptually be answered successfully (it is
green) or if the ontology does not allow this combination of primitives (it is
red).
Metrics: This metric aggregates every triple conforming to a selected triple
pattern. It calculates how often each distinct triple was requested. To check if a
triple request can conceptually be answered, the respective types of the resources
as well as the domain/range of the property are compared to determine if the
triple pattern is conceptually possible.</p>
      <p>Interpretation: With this view one can get a close look on the resources the
users are interested in and one can nd the answer why a speci c primitive
combination failed. In general a request fails when there is no data that matches
the query. This can have two reasons: (1) The ontology is modeled in a way
that the combination is possible but there is a lack of data. (2) The successful
answering of a triple is conceptually not possible, so there cannot be any valid
data. In the rst case one could extend the dataset as the users seem to be
interested in these facts. In the second case and if a lot of users asked for such a
failing combination one could decide if it is a good idea to extend the ontology
and populate data. When data is detected that does not conform to the ontology,
there are inconsistencies between the dataset and the ontology which should be
examined as described in 3.6.
3.4</p>
    </sec>
    <sec id="sec-6">
      <title>Hosts Statistics</title>
      <p>The hosts statistics visualize the point of origin of requests as well as the request
time of di erent hosts.</p>
      <p>Views: The rst view shows which classes a certain hosts uses in its queries as
a table containing all hosts that request the dataset as well as the amount of
requests of each host. Additionally it lists the classes and the number of the
requests for it by the chosen host. The second view visualizes { starting from a
class or property { which hosts used that primitive as certain part of a triple and
how often this happened. The third view shows the request times of a selected
host.</p>
      <p>Metrics: The underlying metrics aggregate the distinct hosts which used the
dataset and all triple patterns each host performed. Furthermore the request
times of each host are calculated in an hourly format.</p>
      <p>Interpretation: With the information provided by this perspective one can
analyze the origin of requests and which parts of the dataset are used by di erent
hosts. At rst one gains statistical information about how many di erent hosts
use the dataset, if there are hosts that make many requests at a certain point
of time, or if di erent hosts access di erent and potentially speci c parts of the
data. Concretely, with the visualization of the primitives being used in queries
of a speci c host one can discover if a host has a regular set of patterns in the
performed queries and a signi cant number of requests. Then the Web data
publisher could serve the respective data separately to improve the performance of
query answering for this host.
3.5</p>
    </sec>
    <sec id="sec-7">
      <title>Time Statistics</title>
      <p>The time statistics provide a global temporal view of the data and show how
many requests were received by the service per hour to nd out if there are times
of high load.</p>
      <p>Views: This view presents a bar chart that shows the number of all requests per
hour. Below this chart there is a table containing the exact amount of queries
for each time interval.</p>
      <p>Metrics: The metric calculates the absolute number of queries in an hourly
format.</p>
      <p>Interpretation: As already mentioned, the amount of requests per hour can help
to detect times of very high load. If that is the case one could decide to increase
the server capacity so that the service does not break down during these times.
On the other hand one can see times during which the dataset is not requested
very often so the capacity of the server could be decreased to save resources
and money. An observation of the time statistics over a time period can help to
reason about the popularity of a dataset.
3.6</p>
    </sec>
    <sec id="sec-8">
      <title>Error Statistics</title>
      <p>The error statistics represent errors and missing things in the dataset to get
information on what could be changed in the data and the underlying ontology
to improve the dataset with respect to the users needs.</p>
      <p>Views: The rst view visualizes inconsistencies between the data and the
ontology as a table of triples that should not exists conforming to the schema. Two
additional lists contain the domain and range of the predicate so one can check
which classes are permitted as subject and object. The second view shows
combinations of classes and properties that are not modelled in the ontology which
means that the predicate itself exists but that it cannot be used in combination
with a certain class. The last view visualizes properties being used in requests
but do not exist in the dataset. These properties and the amount of their usage
are listed within a table.</p>
      <p>Metrics: The rst two metrics aggregate all requested combinations of primitives
in triple patterns and checks in the ontology whether such a combination is
allowed. For invalid combinations one metric checks whether there are requests
for this pattern that succeeded which would be an inconsistency. The other
metric simply lists the distinct triple patterns that are requested but fail due
to the modeling. A third metric lists properties which are used in queries but
which do not appear in the populated data.
Interpretation: With the inconsistency view one can easily see if there is data
that should not exist. This data should be deleted or the ontology should be
adapted to conform to it. The second view can provide information on how to
modify or extend the ontology with respect to the users needs. If a lot of users
request a speci c property of a class it can be reasonable to modify the ontology
and populate such data. With the third view one can observe which predicates
are used that are not represented within the own dataset, for example properties
of other ontologies or facts that have been deleted from the dataset. In the rst
case it could be a workaround to introduce \owl:sameAs" relations between the
concept in the locally used ontology and the external one that models the same
thing.
4</p>
      <sec id="sec-8-1">
        <title>Evaluation</title>
        <p>To evaluate our visualization concept for usage data derived from the
preprocessed log les of Web of Data endpoints we ran the method experimentally on
real world log data of the DBpedia 3.3 dataset. Therefor a local mirror of the
DBpedia 3.3 dataset was set up for the preprocessing and the respective
dbpedia 3.3 ontology was downloaded. We analyzed the log data of two randomly
chosen days, namely 2009-07-02 and 2009-07-11. The number of requests which
were analyzed was 631.512 and 1.083.390 respectively. This limited amount of
days covered results from scalability issues of our method that requires a
reexecution of queries, yet. We are aware that it would be reasonable to simply
extend a SPARQL server library directly to produce the above mentioned usage
data directly because this would avoid the e ort of re-running each single query.
However, the log le analysis respects the state of the art how servers on the Web
of Data produce usage data. In the following we will present several exemplary
visualizations which are the most signi cant ones for what we conclude from our
analysis. It is not the goal of this paper to evaluate the usage of the DBpedia
dataset completely but to prove the feasibility and the usefulness of our visual
analysis approach in general. Thus, and due to limited space, we only present
selected visualizations and corresponding interpretations which represent each of
the possible maintenance recommendations our approach provides at least once.
A broader extend of statistics and visualizations for both analysed log les can
be found at
http://page.mi.fu-berlin.de/mluczak/pub/visual-analysis-of-web-ofdata-usage-dbpedia33/
Ontology Heat Map Analysis: For both datasets we used the ontology heat map
as an indicator to step inside the DBpedia ontology and analyze speci c
primitives in detail. The visualizations depicted in Figure 2 indicate that only very
few classes and properties have been used.</p>
        <p>To proof that statistically for both datasets we generated the top 10 classes
as depticted in Table 1 and the usage of prroperties as shown in Table 2.</p>
        <p>The most representative observations of the ontology heat map analysis are:
1. Requested resources are type of a small set of classes.
2. Only very few properties are used as predicates in queries.
3. In general resources of more generic classes like Person, Organization and</p>
        <p>Place are used most frequently.
4. The class \MusicalArtist" is a hot spot in the log from 2009-07-11. Resources
of this class were used about 19.000 times as a subject and 900 times as an
object. The latter amount is more than 50% of all queries containing an
object instead of a variable.
5. The data of 2009-07-11 indicates that there was a quite periodical and regular
usage since there are some predicates which were requested exactly 1000
times, such as \associatedBand", \instrument", and \nationality".</p>
        <p>We conclude from observations 1, 2 and 3 that there is a potential to shape
the ontology. Observation 4 and 5 indicate that it is reasonable to examine which
other classes and properties are requested in context with \MusicalArtist" as well
as \associatedBand", \instrument", and \nationality". From 5 we also conclude
that it is possible that there may be regular usage pro les of hosts.
Primitive and Resource Usage Analysis: With the observations of the heat map
in mind the primitive and resource usage analysis as shown in Figure 3 helps to
understand the users intention of requests better and to reveal concrete issues
of the data and the underlying ontology.</p>
        <p>Again we list the most important observations as follows:
1. A lot of triples containing a property like \birthdate" or \associatedBand"
failed.
2. The predicates which are requested 1000 times occur almost everytime in
combination with resources of the class \MusicalArtist".
3. Most of the queries which used resources of the class \Band" failed since the
ontology does not allow the requested combinations.
4. The property \instrument" is often used as an object in queries.
5. A lot of users query for the president of a certain country.</p>
        <p>Quantity
Organization
Person
Place
Work
PopulatedPlace
Company
Artist
Athlete
EducationalInstitution
RadioStation
(a) 2009-07-02</p>
        <p>Subject Object
Quantity abs rel Quantity abs rel
69175 10.3% 19.1% 46984 7.0% 72.2%
61845 9.2% 17.1% 187 &lt;0.1% 0.3%
25590 3.8% 7.1% 231 &lt;0.1% 0.4%
21394 3.2% 5.9% 34 &lt;0.1% &lt;0.1%
20455 3.0% 5.7% 221 &lt;0.1% 0.3%
19006 2.8% 5.2% 9688 1.4% 14.9%
17951 2.7% 5.9% 106 &lt;0.1% 0.2%
12530 1.9% 3.5% 0 0% 0%
12288 1.8% 3.4% 9967 1.5% 15.5%
10968 1.6% 3.0% 9805 1.5% 15.1%</p>
        <p>We conclude from observation 1 and 3 that the two examplary properties
as well as the class \Band" are used in a di erent context than the one they
are valid for. In the case of observation 1 it is also possible that there is a lack
of data which conforms to the requests. 4 indicates that the identi er
\instrument" is badly chosen or the users' understanding of this concept is di erent.
A workaround could be to to change it to \playsInstrument" which re ects the
character of a property more intuitively. It is modeled that a \Person" is
\president" of a \school" but observation 5 revealed a reasonable query so the ontology
could be adjusted to ful ll this user requirement. The observation 2 stresses the
aforementioned assumption that there may be regular usage pro les of hosts
which are worth of a detailed inspection.</p>
        <p>Hosts and Time Analysis: The hosts and time analysis helps to detect hosts
which only use a speci c set of patterns, hosts which access the dataset at speci c
point of time, and times of heavy load in general. We discovered that the queries
containing the properties which were used 1000 times originated from the same
host. Figure 4 compares the access time of this host with the overall usage
activity of all hosts. On both analyzed log les there was a constant tra c with
about 30.000 (2009-07-02) and respectively 50.000 (2009-07-11) queries per hour.</p>
        <p>The two central observations of this analysis are:
1. The load has an average distribution over the whole day.
2. There is at least one host that requests the dataset in a dedicated time
period.</p>
        <p>(a) 2009-07-02
Predicate Quantity abs rel
birthdate 103 0.02% 0.04%
deathdate 72 0.01% 0.03%
birthplace 44 &lt;0.01% 0.02%
knownFor 13 &lt;0.01% &lt;0.01%
president 2 &lt;0.01% &lt;0.01%
capital 1 &lt;0.01% &lt;0.01%
(b) 2009-07-11
Predicate
deathdate
associatedBand
deathplace
employer
genre
instrument
knownFor
nationality
occupation
spouse
term</p>
        <p>Observation 2 allows at least the assumption that this dedicated host always
requested the same query pattern as part of an experiment on the DBpedia
dataset. But, the overall amount is still too limited to recommend to scale the
infrastructure to improve the performance of the service or to modularize the
data and serve modules separately for request hosts which need only a speci c
part of the dataset.</p>
        <p>Error Analysis: Figure 5 depicts how we addressed the detection of inconsitencies
in the data and the ontology. The scope of the exploration is de ned by one of
the primitives which was suspicious in the primitive and resource usage analysis
such as \instrument" as well as \Band".
1. There are triples in the dataset that use the property \associatedBand" but
do not use a resource of type \MusicalArtist" as the respective subject which
contradicts the domain restriction of \associatedBand".
2. The users needs and the modeling of the ontology obviously di er with
respect to the class \Band" since several predicates are requested in
combination with resources of this type which are not valid.</p>
        <p>Our observation 1 reveals (a) an inconsistency in the populated data which
potentially results from (b) errors in or the misuse of mappings for the Wikipedia
infoboxes. The workaround could be to either to change the modeling of the
domain restriction of \associatedBand" or to x the mappings that they do not
match the cases that result the invalid data. A detailed analysis for 2 shows that
resources of the type \Band" are requested as subject in combination with the
predicate \nationality" for example. These triples fail since \nationality" has a
domain restriction to \Person". This could be xed to represent the fact that
e.g. a British band exist.
4.1</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Summary</title>
      <p>
        We summarize that we analyzed the general usage of the DBpedia dataset,
detected inconsistencies in the populated data, and revealed requirements of
users which are not modeled in the DBpedia ontology yet. With reference to
the challenges introduced by Moller et al.[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] we discovered the following: The
error statistics help to nd out inconsistencies within served data and to improve
the data quality which is an advance for the reliability of a dataset. The time
statistics give an insight on times of heavy load. It is possible to guide from the
time statistics, via the hosts statistics directly to the primitives usage statistics
and the resource usage statistics which enable a detection of the most frequently
requested entities of a dataset. So, both perspectives of peak-load are covered.
The hosts statistics can be used to detect hosts which use a speci c set of query
patterns. The data conforming to these patterns could be served separately as
a module of the entire dataset. Speci c indexes covering the most important
patterns could be con gured based on the ontology heat map, the primitives
usage statistics, and the resource usage statistics. Both activities can improve the
system performance. The ontology heat map is an indicator for the suitability
and conciseness of an ontology and the error statistics reveal requirements of the
users which are not conform to it. Bringing both perspectives together one can
draw conclusions about the usefulness of existing data as well as its modeling
and restrain or extend the schema. The time statistics and the hosts statistics
allow the detection of external attacks.
5
      </p>
      <sec id="sec-9-1">
        <title>Discussion and Concluding Remarks</title>
        <p>In this paper we presented an approach that helps Web data publishers to
visualize and interpret the usage of their Web of Data endpoints with the goal
to recommend maintenance activities such as the assurance of the systems
performance or the xing of bugs and weaknesses in the data or the underlying
schema. We clearly motivated that such data analysis must be based on a
speci c preprocessing of log les and we proposed an algorithm for this. Then we
presented six metrics, the associated visualizations, and the descriptions how
they should be interpreted. The approach was evaluated by an exemplary
usage analysis of the DBpedia dataset. The results of this analysis prove that our
approach address a set of ve accepted challenges properly. However, it does
not seem to be reasonable to take all the derived maintenance recommendations
into account for future evolution steps of the DBpedia dataset, yet. For example
round about 1.500.000 people access the German Wikipedia per hour13, which
shows that the real-world usage of DBpedia is rather limited. Hence, we have to
admit that in case of a broader public success of DBpedia such an analysis has
to be re-performed.</p>
        <p>
          The metrics and the associated visualizations which were presented in this
paper are only a subset of all possibilities to perform a detailed usage analysis.
Even though, the evaluation has shown that they are signi cant to reveal
inconsistencies and weaknesses of a dataset and its underlying ontology. Also the
most related piece of work by Moller et al.[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] only presented another subset of
patterns and metrics for Web of Data usage analysis. Bringing both approaches
together seems to be promising to get a complete view on the usage of Web of
Data.
        </p>
        <p>We are currently running further data mining experiments and analysis on
our usage data. Hence, we are going to extend the statistical view to Web of
Data usage with a content based view by application of cluster, session and
path analysis. Furthermore we are currently using the preprocessed usage data
for an approach that automatically adapts indexes of an RDF store based on
the popularity and complexity of patterns in queries performed in real. The
application of network visualizations in our approach also o ers the chance to
apply various network metrics (e.g. connectivity or centrality measures) to the
ontologies. We are also going to experiment with these structural properties of
various ontologies and the e ects of changes on them which are concluded from
our usage analysis.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Arias</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandez</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mart</surname>
            nez-Prieto,
            <given-names>M.A.</given-names>
          </string-name>
          , de la Fuente,
          <string-name>
            <surname>P.:</surname>
          </string-name>
          <article-title>An empirical study of real-world SPARQL queries</article-title>
          .
          <source>CoRR abs/1103</source>
          .5043 (
          <year>2011</year>
          ), http://arxiv. org/abs/1103.5043
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Berendt</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hotho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stumme</surname>
          </string-name>
          , G.:
          <article-title>Towards semantic web mining</article-title>
          . In: In International Semantic Web Conference (ISWC. pp.
          <volume>264</volume>
          {
          <fpage>278</fpage>
          . Springer (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Berendt</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , et al.:
          <article-title>A roadmap for web mining: From web to semantic web</article-title>
          . In: Berendt,
          <string-name>
            <surname>B.</surname>
          </string-name>
          , et al. (eds.) Web Mining: From Web to Semantic Web. pp.
          <volume>1</volume>
          {
          <fpage>22</fpage>
          . Springer, Heidelberg
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Berendt</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , et al.:
          <article-title>Usewod2011: 1st international workshop on usage analysis and the web of data</article-title>
          . In: Srinivasan,
          <string-name>
            <surname>S.</surname>
          </string-name>
          , et al. (eds.)
          <source>WWW (Companion Volume)</source>
          . pp.
          <volume>305</volume>
          {
          <fpage>306</fpage>
          . ACM
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Berendt</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , et al.:
          <article-title>Usage analysis and the web of data</article-title>
          .
          <source>ACM SIGIR Forum</source>
          <volume>45</volume>
          (
          <issue>11</issue>
          ),
          <volume>63</volume>
          {
          <fpage>70</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Linked data - the story so far</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems</source>
          <volume>5</volume>
          (
          <issue>3</issue>
          ),
          <volume>1</volume>
          {
          <fpage>22</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Cooley</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mobasher</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srivastava</surname>
          </string-name>
          , J.:
          <article-title>Web mining: Information and pattern discovery on the world wide web</article-title>
          .
          <source>In: ICTAI '97: Proceedings of the 9th International Conference on Tools with Arti cial Intelligence</source>
          . p.
          <fpage>558</fpage>
          . IEEE Computer Society, Washington, DC, USA
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Hausenblas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Exploiting Linked Data For Building Web Applications</article-title>
          .
          <source>IEEE Internet Computing</source>
          <volume>13</volume>
          (
          <issue>4</issue>
          ),
          <volume>68</volume>
          {
          <fpage>73</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>How will we interact with the web of data?</article-title>
          <source>IEEE Internet Computing</source>
          <volume>12</volume>
          (
          <issue>5</issue>
          ),
          <volume>88</volume>
          {
          <fpage>91</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Hotho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maedche</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Staab</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Studer</surname>
          </string-name>
          , R.:
          <article-title>Seal-ii -the soft spot between richly structured and unstructured knowledge</article-title>
          .
          <source>Journal of Universal Computer Science</source>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kirchberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ko</surname>
            ,
            <given-names>R.K.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>B.S.:</given-names>
          </string-name>
          <article-title>From linked data to relevant data{time is the essence</article-title>
          .
          <source>CoRR abs/1103</source>
          .5046 (
          <year>2011</year>
          ), http://arxiv.org/abs/1103.5046
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Moller,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Hausenblas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Cyganiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Grimnes</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.A.</surname>
          </string-name>
          :
          <article-title>Learning from linked open data usage: Patterns &amp; metrics</article-title>
          .
          <source>In: Proceedings of the WebSci10: Extending the Frontiers of Society On-Line</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Spiliopoulou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Web Usage Mining for Web Site Evaluation</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>43</volume>
          (
          <issue>8</issue>
          ),
          <volume>127</volume>
          {
          <fpage>134</fpage>
          (
          <year>August 2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cooley</surname>
          </string-name>
          , R.:
          <article-title>Web usage mining: Discovery and applications of usage patterns from web data</article-title>
          .
          <source>SIGKDD Explorations 1</source>
          ,
          <issue>12</issue>
          {
          <fpage>23</fpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>