Introduction

Access Logs Don't Lie: Towards Tra c Analytics for Linked Data Publishers

Luca Costabello

Pierre-Yves Vandenbussche

Gofran Shukair

Corine Deliot

Neil Wilson

0 0 British Library , United Kingdom 1 Fujitsu Ireland Ltd. , Ireland

Considerable investment in RDF publishing has recently led to the birth of the Web of Data. But is this investment worth it? Are publishers aware of how their linked datasets tra c looks like? We propose an access analytics platform for linked datasets. The system mines tra c insights from the logs of registered RDF publishers and extracts Linked Data-speci c metrics not available in traditional web analytics tools. We present a demo instance showing one month (December 2014) of real tra c to the British National Bibliography RDF dataset.

Introduction

We believe Linked Data publishers have limited awareness of how datasets are accessed by visitors. While some works describe speci c access metrics for linked datasets [ 1,2 ], no comprehensive analytics tool for Linked Data publishers has ever been proposed, and in most cases publishers have no choice but to manually browse through records stored in server access logs. Applications for analysing traditional websites tra c exist, but none takes into account the speci cities of Linked Data: Google Analytics1 and other popular web analytics platforms2 (e.g. Open Web Analytics, PIWIK3) are not designed for linked datasets. For example, existing systems do not o er insights on SPARQL queries, or properly interpret 303 URIs. Besides, to the best of our knowledge, there are no tools that detect Linked Data visitors sessions, or that help identifying workload peaks of SPARQL endpoints.

This has two consequences: rst, publishers struggle to justify Linked Data investment with management. Second, they miss out technical bene ts: For instance, limited awareness of tra c spikes prevents predicting peaks during realworld events, and hinders the identi cation of visitors that overload triplestores with repeated SPARQL queries. 1 http://analytics.google.com 2 https://en.wikipedia.org/wiki/List_of_web_analytics_software 3 http://piwik.org | http://www.openwebanalytics.com

Our Contribution

We present an hosted analytics platform for linked datasets. The system mines the logs of registered Linked Data publishers and extracts tra c insights. The analytics system is designed for RDF data stores with or without SPARQL engine, and supports load-balancing scenarios. The online demo4 shows one month of tra c insights of the The British National Bibliography (BNB) dataset5. The system can easily accommodate any Linked Data publisher and only requires the modi cation of the log parser to meet publisher's log syntax.

The system o ers Linked Data-speci c features which are currently not supported by classic web analytics tools (e.g. SPARQL-speci c statistics). We do not track clients, thus preserving visitors privacy. The system supports Linked Data HTTP dereferencing with HTTP 303 patterns, and lters out search engines and robots activity. It also detects linked data visitor sessions with an unsupervised learning algorithm. To better identify workload peaks of a SPARQL endpoint, supervised learning is adopted to label SPARQL queries as heavy or light, according to SPARQL syntactic features.

System Overview. Our tra c analytics platform is organised in the following components (Figure 1):

Extract-Transform-Load (ETL) Unit. On a daily basis, for registered publishers, the Log Ingestion sub-component fetches and parses access logs from one or more linked dataset servers (see Figure 2 for an example). Records are ltered to remove robots and search engine crawlers noise.

Metrics Extraction Unit. Extracts tra c metrics from access logs. Data Warehouse and MOLAP Unit. Tra c metrics are stored in a data warehouse equipped with an SQL-compliant MOLAP6 unit that answers queries with sub-second latency.

Web user interface. The front end queries the RESTful APIs exposed by the MOLAP Unit, and generates a web UI that shows tra c metrics ltered by date, user agent type, and access protocol (Figure 3). The user interface runs on Node.js, and charts are based on amCharts7.

Metrics. We support three groups of tra c metrics:

Content Metrics. How many times RDF resources have been accessed. We support Linked Data dual access protocol; this means that the system counts how many times an RDF resource is dereferenced with HTTP operations, but also how many times its URI is included in SPARQL queries8. Unlike existing tools, we support 303 URIs9, thus counting each HTTP 303 pattern as a single 4 http://52.49.205.156/analytics/ 5 Released as Linked Open Data in July 2011, the dataset o ers SPARQL and HTTP access to almost 100 million statements about books and serials. It is available at http://bnb.data.bl.uk 6 Multidimensional Online Analytical Processing 7 https://www.amcharts.com 8 This is a lower bound estimation. Access logs do not contain SPARQL result sets. 9 https://www.w3.org/TR/cooluris request. We also provide aggregates by family of RDF resource: instances (URIs accessed either in HTTP operations or included in SPARQL queries), classes (URIs used as RDFS/OWL classes in SPARQL queries, objects of rdf:type), properties (URIs used as predicates in SPARQL queries), graphs (URIs used as graphs in SPARQL queries - FROM/FROM NAMED, USING/USING NAMED, GRAPH).

Audience Metrics. Besides traditional information about visitors (e.g. location, network provider, user agent type), these measures include details of visitor sessions (duration, size, depth, bounce rate), which we identify with unsupervised hierarchical agglomerative clustering (HAC) proposed by [ 3 ]. Protocol Metrics. Information about the data access protocols used by visitors. It includes a breakdown of requests by protocol (HTTP lookups vs SPARQL queries), and various SPARQL-speci c metrics: the count of malformed queries, queries by verb, the count of light and heavy SPARQL queries (obtained with an o -the-shelf supervised binary classi er trained on a super set of SPARQL syntactic features listed in [ 4 ]). 3

Conclusions and Future Perspectives

Our analytics platform relieves Linked Data publishers from time-consuming log mining, and unlike other popular web analytics platforms, supports linked dataspeci c tra c metrics. Tra c patterns knowledge helps gauging the popularity of a dataset: for example, awareness of decreasing user retention might prompt for better promotion (e.g. hackatons, spreading the word on community mailing lists, etc.). Likewise, if portions of a dataset are never accessed, perhaps better data documentation is required.

Note that the extracted metrics should be considered as a lower-bound estimation: because we do not track visitors, we have a partial view on the communi10 https://httpd.apache.org/docs/trunk/logs.html#common cation with the data store, and we cannot circumvent intermediate components between visitors and datasets (e.g. caches, proxy servers, or NAT). Besides, visitors might fake user agent strings or HTTP referrer, thus leading to client identi cation mistakes.

We will add new metrics in future extensions, such as ner-grained SPARQL insights (e.g. useful to ne-tune SPARQL engine caches). Users suggest upgrading the web interface with secondary dimensions capabilities, to improve reporting. Real time monitoring is also part of the future work roadmap. Acknowledgments. This work has been supported by the TOMOE project funded by Fujitsu Laboratories Limited in collaboration with Insight Centre for Data Analytics at National University of Ireland Galway.

Fasel and

Zumstein . A fuzzy data warehouse approach for web analytics . In Procs of WSKS . Springer, 2009 .

2. K. Moller, M. Hausenblas,

Cyganiak , and

Handschuh . Learning from linked open data usage: Patterns & metrics . 2010 .

G. C.

Murray ,

Lin , and

Chowdhury . Identi cation of user sessions with hierarchical agglomerative clustering . ASIS&T, 43(1):1{9 , 2006 .

Picalausa and

Vansummeren . What are real SPARQL queries like? In Procs of SWIM, page 7 . ACM, 2011 .