<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Access Logs Don't Lie: Towards Tra c Analytics for Linked Data Publishers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Costabello</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierre-Yves Vandenbussche</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gofran Shukair</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Corine Deliot</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Neil Wilson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>British Library</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fujitsu Ireland Ltd.</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Considerable investment in RDF publishing has recently led to the birth of the Web of Data. But is this investment worth it? Are publishers aware of how their linked datasets tra c looks like? We propose an access analytics platform for linked datasets. The system mines tra c insights from the logs of registered RDF publishers and extracts Linked Data-speci c metrics not available in traditional web analytics tools. We present a demo instance showing one month (December 2014) of real tra c to the British National Bibliography RDF dataset.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        We believe Linked Data publishers have limited awareness of how datasets are
accessed by visitors. While some works describe speci c access metrics for linked
datasets [
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ], no comprehensive analytics tool for Linked Data publishers has
ever been proposed, and in most cases publishers have no choice but to manually
browse through records stored in server access logs. Applications for analysing
traditional websites tra c exist, but none takes into account the speci cities
of Linked Data: Google Analytics1 and other popular web analytics platforms2
(e.g. Open Web Analytics, PIWIK3) are not designed for linked datasets. For
example, existing systems do not o er insights on SPARQL queries, or properly
interpret 303 URIs. Besides, to the best of our knowledge, there are no tools that
detect Linked Data visitors sessions, or that help identifying workload peaks of
SPARQL endpoints.
      </p>
      <p>This has two consequences: rst, publishers struggle to justify Linked Data
investment with management. Second, they miss out technical bene ts: For
instance, limited awareness of tra c spikes prevents predicting peaks during
realworld events, and hinders the identi cation of visitors that overload triplestores
with repeated SPARQL queries.
1 http://analytics.google.com
2 https://en.wikipedia.org/wiki/List_of_web_analytics_software
3 http://piwik.org | http://www.openwebanalytics.com</p>
    </sec>
    <sec id="sec-2">
      <title>Our Contribution</title>
      <p>We present an hosted analytics platform for linked datasets. The system mines
the logs of registered Linked Data publishers and extracts tra c insights. The
analytics system is designed for RDF data stores with or without SPARQL
engine, and supports load-balancing scenarios. The online demo4 shows one month
of tra c insights of the The British National Bibliography (BNB) dataset5. The
system can easily accommodate any Linked Data publisher and only requires
the modi cation of the log parser to meet publisher's log syntax.</p>
      <p>The system o ers Linked Data-speci c features which are currently not
supported by classic web analytics tools (e.g. SPARQL-speci c statistics). We do not
track clients, thus preserving visitors privacy. The system supports Linked Data
HTTP dereferencing with HTTP 303 patterns, and lters out search engines
and robots activity. It also detects linked data visitor sessions with an
unsupervised learning algorithm. To better identify workload peaks of a SPARQL
endpoint, supervised learning is adopted to label SPARQL queries as heavy or
light, according to SPARQL syntactic features.</p>
      <p>System Overview. Our tra c analytics platform is organised in the following
components (Figure 1):</p>
      <p>Extract-Transform-Load (ETL) Unit. On a daily basis, for registered
publishers, the Log Ingestion sub-component fetches and parses access logs
from one or more linked dataset servers (see Figure 2 for an example). Records
are ltered to remove robots and search engine crawlers noise.</p>
      <p>Metrics Extraction Unit. Extracts tra c metrics from access logs.
Data Warehouse and MOLAP Unit. Tra c metrics are stored in a data
warehouse equipped with an SQL-compliant MOLAP6 unit that answers
queries with sub-second latency.</p>
      <p>Web user interface. The front end queries the RESTful APIs exposed by
the MOLAP Unit, and generates a web UI that shows tra c metrics ltered
by date, user agent type, and access protocol (Figure 3). The user interface
runs on Node.js, and charts are based on amCharts7.</p>
      <p>Metrics. We support three groups of tra c metrics:</p>
      <p>Content Metrics. How many times RDF resources have been accessed. We
support Linked Data dual access protocol; this means that the system counts
how many times an RDF resource is dereferenced with HTTP operations, but
also how many times its URI is included in SPARQL queries8. Unlike existing
tools, we support 303 URIs9, thus counting each HTTP 303 pattern as a single
4 http://52.49.205.156/analytics/
5 Released as Linked Open Data in July 2011, the dataset o ers SPARQL and HTTP
access to almost 100 million statements about books and serials. It is available at
http://bnb.data.bl.uk
6 Multidimensional Online Analytical Processing
7 https://www.amcharts.com
8 This is a lower bound estimation. Access logs do not contain SPARQL result sets.
9 https://www.w3.org/TR/cooluris
request. We also provide aggregates by family of RDF resource: instances
(URIs accessed either in HTTP operations or included in SPARQL queries),
classes (URIs used as RDFS/OWL classes in SPARQL queries, objects of
rdf:type), properties (URIs used as predicates in SPARQL queries), graphs
(URIs used as graphs in SPARQL queries - FROM/FROM NAMED, USING/USING
NAMED, GRAPH).</p>
      <p>
        Audience Metrics. Besides traditional information about visitors (e.g.
location, network provider, user agent type), these measures include details of
visitor sessions (duration, size, depth, bounce rate), which we identify with
unsupervised hierarchical agglomerative clustering (HAC) proposed by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Protocol Metrics. Information about the data access protocols used by
visitors. It includes a breakdown of requests by protocol (HTTP lookups vs
SPARQL queries), and various SPARQL-speci c metrics: the count of
malformed queries, queries by verb, the count of light and heavy SPARQL queries
(obtained with an o -the-shelf supervised binary classi er trained on a super
set of SPARQL syntactic features listed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]).
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Conclusions and Future Perspectives</title>
      <p>Our analytics platform relieves Linked Data publishers from time-consuming log
mining, and unlike other popular web analytics platforms, supports linked
dataspeci c tra c metrics. Tra c patterns knowledge helps gauging the popularity
of a dataset: for example, awareness of decreasing user retention might prompt
for better promotion (e.g. hackatons, spreading the word on community mailing
lists, etc.). Likewise, if portions of a dataset are never accessed, perhaps better
data documentation is required.</p>
      <p>Note that the extracted metrics should be considered as a lower-bound
estimation: because we do not track visitors, we have a partial view on the
communi10 https://httpd.apache.org/docs/trunk/logs.html#common
cation with the data store, and we cannot circumvent intermediate components
between visitors and datasets (e.g. caches, proxy servers, or NAT). Besides,
visitors might fake user agent strings or HTTP referrer, thus leading to client
identi cation mistakes.</p>
      <p>We will add new metrics in future extensions, such as ner-grained SPARQL
insights (e.g. useful to ne-tune SPARQL engine caches). Users suggest
upgrading the web interface with secondary dimensions capabilities, to improve
reporting. Real time monitoring is also part of the future work roadmap.
Acknowledgments. This work has been supported by the TOMOE project funded
by Fujitsu Laboratories Limited in collaboration with Insight Centre for Data Analytics
at National University of Ireland Galway.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>D.</given-names>
            <surname>Fasel</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Zumstein</surname>
          </string-name>
          .
          <article-title>A fuzzy data warehouse approach for web analytics</article-title>
          .
          <source>In Procs of WSKS</source>
          . Springer,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. K. Moller, M. Hausenblas,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Handschuh</surname>
          </string-name>
          .
          <article-title>Learning from linked open data usage: Patterns &amp; metrics</article-title>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>G. C.</given-names>
            <surname>Murray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhury</surname>
          </string-name>
          .
          <article-title>Identi cation of user sessions with hierarchical agglomerative clustering</article-title>
          .
          <source>ASIS&amp;T, 43(1):1{9</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>F.</given-names>
            <surname>Picalausa</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Vansummeren</surname>
          </string-name>
          .
          <article-title>What are real SPARQL queries like? In Procs of SWIM, page 7</article-title>
          . ACM,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>