<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Linked Statistical Data Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sarven Capadisli</string-name>
          <email>1info@csarven.ca</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sören Auer</string-name>
          <email>2auer@cs.uni-bonn.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Reinhard Riedl</string-name>
          <email>3reinhard.riedl@bfh.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bern University of Applied Sciences</institution>
          ,
          <addr-line>Bern</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universität Leipzig, Institut für Informatik, AKSW</institution>
          ,
          <addr-line>Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Bonn</institution>
          ,
          <addr-line>Bonn</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Linked Data principles are increasingly employed to publish high-fidelity, heterogeneous statistical datasets in a distributed way. Currently, there exists no simple way for researchers, journalists and interested people to compare statistical data retrieved from different data stores on the Web. Given that the RDF Data Cube vocabulary is used to describe statistical data, its use makes it possible to discover and identify statistical data artifacts in a uniform way. In this article, the design and implementation of an application and service is presented, which utilizes federated SPARQL queries to gather statistical data from distributed data stores. The R language for statistical computing is employed to perform statistical analyses and visualizations. The Shiny application and server bridges the front-end Web user interface with R on the server-side in order to compare statistical macrodata, and stores analyses results in RDF for future research. As a result, distributed linked statistical data can be more easily explored and analysed.</p>
      </abstract>
      <kwd-group>
        <kwd>Document ID</kwd>
        <kwd>http</kwd>
        <kwd>//csarven</kwd>
        <kwd>ca/linked-statistical-data-analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Statistical data artifacts and the analyses conducted on the data are
fundamental to testing scientific theories about our societies and the universe(s)
we live in. As statistics are often used to add credibility to an argument or
advice, they influence the decisions we make. The decisions are, however,
complex beings on their own with multiple variables based on facts, cognitive
processes, social demands, and maybe even factors that are unknown to us.
Regardless of uncontrollable forces, in order for the society to tract and learn
from its own vast knowledge about events and things, it needs to be able to
gather statistical information from heterogeneous and distributed sources. This
is to uncover insights, make predictions, or build smarter systems that the
society needs to progress. This brings us to the core of our research challenge;
how do we reliably acquire statistical data in a uniform way and conduct
well-formed analyses that is accessible to different types of data consumers and
users?</p>
      <p>
        This article presents an approach - Statistical Linked Data Analyses - towards
this challenge with its contributions. In a nutshell, it takes advantage of Linked
Data design principles that are widely accepted as a way to publish and
consume data without central coordination on the Web. The work herein offers a
Web based user-interface for researchers, journalists, or interested people to
compare statistical data from different sources against each other without
having any knowledge of the technology underneath or the expertise to develop
themselves. The service, which we built, proceeds with running decentralized
(federated) structured queries to retrieve data from various endpoints, runs an
analysis on the data, and provides the analysis back to the user. For future
research, analysis is stored so that it can be searched for and reused.
As pointed out in Statistical Linked Dataspaces [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], what linked statistics
provide, and in fact enable, are queries across datasets: Given that the
dimension concepts are interlinked, one can learn from a certain observation's
dimension value, and enable the automation of cross-dataset queries.
      </p>
      <p>
        The RDF Data Cube vocabulary [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is used to describe multi-dimensional
statistical data, and SDMX-RDF as one of the statistical information models. It
makes it possible to represent significant amounts of heterogeneous statistical
data as Linked Data where they can be discovered and identified in a uniform
way. The statistical artifacts that are produced, and which use this vocabulary,
are invaluable for statisticians, researchers, and developers.
      </p>
      <p>
        Linked SDMX Data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] provided templates and tooling to transform
SDMX-ML data from statistical agencies to RDF/XML, resulting in linked
statistical datasets at 270a.info [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] using the RDF Data Cube vocabulary. In
addition to semantically uplifting the original data, information pertaining
provenance was kept track using the PROV Ontology [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] at transformation time,
while incorporating retrieval time provenance data.
      </p>
    </sec>
    <sec id="sec-2">
      <title>3    Related Work</title>
      <p>
        Performing Statistical Methods on Linked Data [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] investigated simple statistical
calculations, such as linear regression and presented the results using R [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and
SPARQL queries. It highlighted the importance of a wide range of typical issues
on data integration for heterogeneous statistical data. The other technical issues
raised are SPARQL query performance, and the use of a central SPARQL
endpoint, which contained multiple data sources. For future work, the work
pointed out a friendly user-interface that allows dataset selection, statistical
method and a visualization of the results.
      </p>
      <p>
        Defining and Executing Assessment Tests on Linked Data for Statistical
Analysis [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] explains: identification of data items, analysis of data
characteristics, and data matching as key requirements to conduct statistical
analysis on integrated Linked Data.
      </p>
      <p>
        Linked Open Piracy: A story about e-Science, Linked Data, and statistics [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
investigated analysis and visualization of piracy reports to answer domain
questions through a SPARQL client for R [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Towards Next Generation Health Data Exploration: A Data Cube-based
Investigation into Population Statistics for Tobacco [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], presents the qb.js [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
tool to explore data that is expressed as RDF Data Cubes. It is designed to
formulate and explore hypotheses. Under the hood, it makes a SPARQL query
to an endpoint which contains the data that it analyzes.
      </p>
      <p>
        Publishing Statistical Data on the Web [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] explains CubeViz [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], which was
developed to visualize the multidimensional statistical data. It is a faceted
browser, which utilizes the RDF Data Cube vocabulary, with a chart
visualization component. The inspection and results are for a single dataset.
      </p>
      <p>
        Google Public Data Explorer [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], derived from the Gapminder [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] tool,
displays statistical data as line graphs, bar graphs, cross sectional plots or on
maps. The process to display the data requires the data to be uploaded in CSV
format, and accompanying Dataset Publishing Language (DSPL) [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] in XML to
describe the data and metadata of the datasets. Its visualizations and
comparisons are based on one dataset at a time.
      </p>
      <p>
        Generating Possible Interpretations for Statistics from Linked Open Data [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]
talks about Explain-a-LOD [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] tool which focuses on generating hypotheses
that explain statistics. It has a configuration to compare two variables, and then
provides possible interpretations of the correlation analysis for users to review.
      </p>
      <p>Looking at this state of the art, we can see a common pattern which is that
the analysis is conducted on central repositories. As statistical Linked Data is
published by different parties independently from one another, it is only
reasonable to work towards a solution that can gather, integrate and analyze the
data without having to resort to centralism.</p>
    </sec>
    <sec id="sec-3">
      <title>4    Analysis platform for Linked Statistical Data</title>
      <p>
        The analysis platform is focused on two goals: 1) a Web user interface for
researchers to compare macrodata observations and to view plots and analysis
results, 2) caching and storage of that analysis for future research and reuse.
Here, we describe the platform at stats.270a.info [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <sec id="sec-3-1">
        <title>4.1    Functional Requirements</title>
        <p>The requirements for functionality and performance are that Linked Data design
principles are employed behind the scenes to pull in the statistical data that are
needed to conduct analysis, and to make the results of the analysis available
using the same methods for both, humans and machines. While achieving this
workflow includes many steps, the front-end interface for humans should aim for
minimum interactivity that is required to accomplish this. Finally, the
performance of the system should be reasonable for a Web user interface, as it
needs to display a visualization and present analysis. Additionally, essential
parts of the analysis should be cached and stored for future use both, for
application responsiveness and data discovery.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2    User interface</title>
        <p>
          A web application was created to provide users with a simple interface to
conduct regression analysis and display of scatter plot(s). The interface presents
three drop-down selection areas for the user: an independent variable, a
dependent variable, and a time series. Both, the independent and dependent
variables are composed of a list of datasets with observations, and time series
are composed of reference periods of those observations. Upon selecting and
submitting datasets to compare, the interface then presents a scatter plot with
the best line of best fit from a list of linear models that is tested. The points in
the scatter plot represent locations, in this case, countries, which happen to have
a measure value for both variables, as well as the reference period that was
selected by the user. Below the scatter-plot, a table of analysis results is
presented. Figure [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] is a screenshot of the user interface.
        </p>
        <p>The datasets are compiled by gathering qb:DataSets (an RDF Data Cube
class for datasets) from each statistical dataspace at 270a.info. Similarly, the
reference periods are derived from calendar intervals e.g., YYYY, YYYY-MM-DD,
YYYY-QQ.</p>
      </sec>
      <sec id="sec-3-3">
        <title>4.3    Data Requirements</title>
        <p>Our expectation from the data is that it is modeled using the RDF Data Cube
vocabulary and is well-formed. Specifically, it needs to pass some of the integrity
constraints as outlined by the vocabulary specification. For our application,
some of the essential checks are that: a unique data structure definition (DSD)
is used for a dataset, DSD includes measure (value of each observation),
Concept dimensions have code lists, Codes from code list.</p>
        <p>
          In addition to well-formed adherence, to compare variables from two datasets,
there needs to be an agreement on the concepts that are being matched for in
respective observations. Here, the primary concern is about reference areas
(locations), and making sure that the comparison made for the observations
from datasetx (independent variable) and datasety (dependent variable) are
using concepts that are interlinked (using the property skos:exactMatch).
Practically, a concept e.g. Switzerland, from at least one of the dataset's code
lists should have an arc to the other. It ensures that there is a reliable degree of
confidence that the particular concept is interchangeable. Hence, the measure
corresponding to the phenomenon being observed, is about the same location in
both datasets. Concepts in the datasets were interlinked using LInk discovery
framework for MEtric Spaces (LIMES) [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. Figure [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] shows available outbound
interlinks for the datasets at http://270a.info/.
        </p>
        <p>The limitations of the interlinks are that reference areas (concepts) are
interlinked based on their notations and labels, excluding their temporarilty and
changes in their space. The shortcoming is that the maching-readable metadata
is unavailable from many sources. Feature plans to accommodate this will
provide a richer concept assignment system, and incorporate provenance.</p>
        <p>One additional requirement from the datasets is that the RDF Data Cube
component properties (e.g., dimensions, measures) either use
sdmx-dimension:refArea, sdmx-dimension:refPeriod, sdmx-measure:obsValue
directly or are rdfs:subPropertyOfs. Given decentralized mappings of the
statistical datasets (published as SDMX-ML), their commonality is expected to
be the use, or a reference to SDMX-RDF properties in order to achieve
generalized federated queries without having complete knowledge of the
structures of the datasets, but rather only the essential bits.</p>
        <p>In order to proceed with the analysis, we use the selections made by the user:
datasetx and datasety, reference period, and then gather all observations with
corresponding reference areas, and measures (observation values). Only the
observations whose values for the reference areas with interlinked concepts area
are retained in the final result.</p>
      </sec>
      <sec id="sec-3-4">
        <title>4.4    Application</title>
        <p>
          Shiny [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], an R package, along with Shiny server [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] is used to build an
interactive web application. A Shiny application was built to essentially allow an
interaction between the front-end Web application and R. User inputs are set to
trigger an event which is sent to the Shiny server and handled by the application
written in R. While the application uses R for statistical analaysis and
visualizations, to achieve the goals of this research, other statistical computing
software can be used. The motivation to use R is due to it being an open source
software and it being a requirement of Shiny server's
        </p>
        <p>The application assembles a SPARQL query using the input values and then
sends them to stats.270a.info/sparql endpoint which dispatches federated queries
to two SPARQL endpoints where the datasets are located. The SPARQL query
request is handled by the SPARQL client for R. The query results are retrieved
and given to R for statistical data analysis. R generates a scatter plot containing
the (in)dependent variables, where each point in the chart is a reference area
(e.g., country) for that particular reference period selection. Regression analysis
is done where correlation, p-value, and the line of best fit is determined after
testing several linear models, and shown in the user interface.</p>
      </sec>
      <sec id="sec-3-5">
        <title>4.5    Federated Queries</title>
        <p>During this research, establishing a correct and reasonably performing federated
query was one of the most demanding steps. This was due in part by ensuring
dataset integrity, finding a balance between processing and filtering applicable
observations at remote endpoints and at the originating endpoint. The challenge
was compromising between what should be processed remotely and sent over the
wire versus handling some of that workload by the parent endpoint. Since one of
the requirements was to ensure that the concepts are interlinked at either one of
the endpoints (in which case, it is optional per endpoint), each endpoint had to
include each observation's reference area as well as its interlinked concept. The
result from both endpoints was first joined and then filtered in order to avoid
false negatives. That is, either conceptx has a skos:exactMatch relationship to
concepty, or vice versa, or conceptx and concepty are the same. One quick and
simple way to minimize the number of results was to filter out exact matches at
each endpoint which did not contain the other dataset's domain name. Hence,
minimizing the number of join operations which had to be handled by the
parent endpoint.</p>
        <p>
          In order to put the cost of queries briefly into perspective i.e., the conducted
tests and sample sizes of the dataspaces that were used; the total number of
triples (including observations and metadata) per endpoint are: 50 thousand
(Transparency International [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]), 54 million (Food and Agriculture
Organization of the United Nations [FAO] [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]), 225 million (Organisation for
Economic Co-operation and Development [OECD] [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]), 221 million (World
Bank [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]), 242 million (European Central Bank [ECB] [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]), 36 million
(International Monetary Fund [IMF] [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]).
        </p>
        <p>
          The anatomy of the query is shown in Figure [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Essentially, the SPARQL
Endpoint URI and the dataset URI are the only requirements. The structure of
the statements and operations are particular to getting the most out of Apache
Jena's [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] TDB storage system [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ], TDB Optimizer [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] and Fuseki [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ]
SPARQL endpoints. Better performing queries can be achieved by knowing the
frequencies of the predicates upfront, and choosing better orders for a given
dataset to avoid processing of false negatives.
        </p>
        <p>SELECT ?refAreaY ?x ?y ?identityX ?identityY
WHERE {</p>
        <p>SERVICE &lt;http://example.org/sparql&gt; {</p>
        <p>SELECT DISTINCT ?identityX ?refAreaX ?refAreaXExactMatch ?measureX
WHERE {
?observationX qb:dataSet &lt;http://example.org/dataset/X&gt; .
?observationX ?propertyRefPeriodX exampleRefPeriod:1234 .
?propertyRefAreaX rdfs:subPropertyOf* sdmx-dimension:refArea .
?observationX ?propertyRefAreaX ?refAreaX .
?propertyMeasureX rdfs:subPropertyOf* sdmx-measure:obsValue .
?observationX ?propertyMeasureX ?x .
&lt;http://example.org/dataset/X&gt;</p>
        <p>qb:structure/stats:identityDimension ?propertyIdentityX .
?observationX ?propertyIdentityX ?identityX .</p>
        <p>OPTIONAL {
?refAreaX skos:exactMatch ?refAreaXExactMatch .</p>
        <p>FILTER (REGEX(STR(?refAreaXExactMatch), "^http://example.net/"))
}</p>
        <p>}
}
SERVICE &lt;http://example.net/sparql&gt; {</p>
        <p>SELECT DISTINCT ?identityY ?refAreaY ?refAreaYExactMatch ?measureY
WHERE {
?observationY qb:dataSet &lt;http://example.net/dataset/Y&gt; .
?observationY ?propertyRefPeriodY exampleRefPeriod:1234 .
?propertyRefAreaY rdfs:subPropertyOf* sdmx-dimension:refArea .
?observationY ?propertyRefAreaY ?refAreaY .
?propertyMeasureY rdfs:subPropertyOf* sdmx-measure:obsValue .
?observationY ?propertyMeasureY ?y .
&lt;http://example.net/dataset/Y&gt;</p>
        <p>qb:structure/stats:identityDimension ?propertyIdentityY .
?observationY ?propertyIdentityY ?identityY .</p>
        <p>OPTIONAL {
?refAreaY skos:exactMatch ?refAreaYExactMatch .</p>
        <p>FILTER (REGEX(STR(?refAreaYExactMatch), "^http://example.org/"))
}</p>
        <p>}
}
FILTER (?refAreaYExactMatch = ?refAreaX
|| ?refAreaXExactMatch = ?refAreaY
|| ?refAreaY = ?refAreaX)
}
ORDER BY ?identityY ?identityX ?x ?y</p>
        <p>
          For the time being, the use of NAMED GRAPHs in the SPARQL queries were
excluded for a good reason. For federated queries to work with the goal of
minimal knowledge about store organization, the queries had to work without
including graph names. However, by employing Vocabulary of Interlinked
Datasets (VoID) [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ], it is possible to extract both, the location of the SPARQL
endpoint, as well as the the graph names within. This is left as a future
enhancement.
        </p>
        <p>As statistical datasets are multi-dimensional, slicing the datasets with only
reference area and reference period are insufficient. It is likely that there would
be duplicate results if we leave the column order to reference area, measurex,
measurey. For this reason, there is an additional expectation from the datasets
indicating one other dimension to group the observations with. This grouping is
also used to display faceted scatter-plots.</p>
        <p>
          Recommendations from On the Formulation of Performant SPARQL Queries
[
          <xref ref-type="bibr" rid="ref35">35</xref>
          ] and Querying over Federated SPARQL Endpoints — A State of the Art
Survey [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ] were applied where applicable.
        </p>
      </sec>
      <sec id="sec-3-6">
        <title>4.6    Analysis caching and storing</title>
        <p>In order to optimize application reactivity for all users, previously user selected
options for analysis are cached in the Shiny server session. That is, the service is
able to provide cached results which were triggered by different users.</p>
        <p>In addition to a cache that is closest to the user, results from the federated
queries as well as the R analysis, which was previously conducted, is stored back
into the RDF store with a SPARQL Update. This serves multiple purposes. In
the event that the Shiny server is restarted and the cache is no longer available,
previously calculated results in the store can be reused, which is still more cost
efficient than making new federated queries.</p>
        <p>Another reason for storing the results back in the RDF store is to offer them
over the stats.270a.info SPARQL endpoint for additional discovery and reuse of
analysis for researchers. Interesting use cases from this approach emerge
immediately. For instance, a researcher or journalist can investigate analysis that
meets their criteria. Some examples are as follows:</p>
        <p>analysis which is statistically significant, and has to do with Gross Domestic
Product (GDP) and health subjects
a list of indicator pairs with strong correlations
using the line of best fit of a regressional analysis to predict or forecast
possible outcomes</p>
        <p>countries which have less mortality rate than average with high corruption</p>
      </sec>
      <sec id="sec-3-7">
        <title>4.7    URI patterns</title>
        <p>The design pattern for the analysis URIs which refer to the data and the
analysis is aimed to keep the length as minimal as possible, while leaving a trace
to encourage self exploration and reuse. The general URI pattern with base
http://stats.270a.info/analysis/ is as follows:</p>
        <p>As URIs for both independent and dependent variable are based on datasets,
and the reference period is codified, their prefixed names are used instead in the
analysis URI to keep them short and friendly:</p>
        <p>For example, the URI http://stats.270a.info/analysis/
worldbank:SP.DYN.IMRT.IN/transparency:CPI2009/year:2009 refers to an
analysis which entails the infant mortality rate from the World Bank dataset as
the independent variable, 2009 corruption perceptions index from the
Transparency International dataset as the dependent variable, for the reference
interval year:2009. The variable values are prefixed names, which correspond to
their respective datasets, i.e., worldbank:SP.DYN.IMRT.IN becomes
http://worldbank.270a.info/dataset/SP.DYN.IMRT.IN, and
transparency:CPI2009 becomes http://transparency.270a.info/dataset/CPI2009
when processed.</p>
      </sec>
      <sec id="sec-3-8">
        <title>4.8    Vocabularies</title>
        <p>Besides the common vocabularies: RDF, RDFS, XSD, OWL, the RDF Data
Cube vocabulary is used to describe multi-dimensional statistical data, and
SDMX-RDF for the statistical information model. PROV-O is used for
provenance coverage.</p>
        <p>
          A statistical vocabulary (http://stats.270a.info/vocab)[
          <xref ref-type="bibr" rid="ref37">37</xref>
          ] is created to
describe analyses. It contains classes for analyses, summaries and each data row
that is retrieved. Some of the properties include: graph (e.g., scatter plot),
independent and dependent variables, reference period, sample size, p-value,
correlation value, correlation method that is used, adjusted R-squared, best
model that is tested, reference area, measure values for both variables, and the
identity concept for both variables.
        </p>
        <p>
          Future plans for this vocabulary is to reflect back on the experience, and to
consider alignment with Semanticscience Integrated Ontology (SIO) [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ]. While
SIO is richer, queries are more complex than necessary for simple analysis reuse
at stats.270a.info.
Putting it all together: following the Linked Data design principles, the platform
for linked statistical data analyses is now available for different types of users.
Human users with a Web browser can interact with the application with a few
clicks. This is arguably the simplest approach for researchers and journalists
without having to go down the development road. Additionally, humans as well
as machines can consume the same analysis as an RDF or JSON serialization. In
the case of JSON, the analyses can be used as part of a widget on a webpage.
The Scalar Vector Graphics (SVG) format of the scatter plot can be used in
articles on the Web. Storing the analyses permanently and having it accessible
over a SPARQL endpoint opens up the possibility for researchers to discover
interesting statistics. Finally, with the help of Apache Rewrites, Linked Data
Pages [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ] handles the direction of these requests and provides dereferenceable
URIs for a follow your nose type of exploration. The source code [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ] is available
at a public repository.
We believe that the presented work here and the prior Linked SDMX Data
effort contributed towards strengthening the relationship between Semantic Web
/ Linked Data and statistical communities. The stats.270a.info service is
intended to allow humans and machines explore statistical analyses.
        </p>
        <p>
          Some research and application areas that are planned as future work:
Making the query optimization file from Jena TDB available in RDF and at
SPARQL endpoints can help to devise better performing federated queries, or
placed in VoID along with LODStats [
          <xref ref-type="bibr" rid="ref41">41</xref>
          ].
        </p>
        <p>With the availability of more interlinks across datasets, we can investigate
analysis that is not dependent on reference areas. For instance, interlinking
currencies, health matters, policies, or concepts on comparability can contribute
towards various analyses.</p>
        <p>
          Enriching the datasets with information on comparability can lead to
achieving more coherent results. This is particularly important given that the
European Statistics Code of Practice [
          <xref ref-type="bibr" rid="ref42">42</xref>
          ] from the European Commission lists
Coherence and Comparability as one of the principles that national and
community statistical authorities should adhere to. While the research at hand
is not obligated to follow those guidelines, they are highly relevant for providing
quality statistical analyses.
        </p>
        <p>The availability of the analysis in JSON serialization, and the cached scatter
plot in SVG format, makes it possible for webpage widget to use them. For
instance, they can be dynamically used in articles or wiki pages with all
references intact. As the Linked Data approach allows one to explore resources
from one item to another, consumers of the article can follow the trace all the
way back to the source. This is arguably an ideal scenario to show provenance
and references for fact-checking in online or journal articles. Moreover, since the
analysis is stored, and the queried data can also be exported in different
formats, it can be reused to reproduce the results.</p>
        <p>This brings us to an outlook for Linked Statistical Data Analyses. The reuse
of Linked analyses artifacts as well as the approach to collect data from different
sources can help us build smarter systems. It can be employed in fact-checking
scenarios as well as uncovering decision-making processes, where knowledge from
different sources is put to their potential use when combined.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>7    Acknowledgements</title>
      <p>Many thanks to colleagues whom helped one way or another during the course
of this work (not implying any endorsement); in no particular order: Deborah
Hardoon (Transparency International), Axel-Cyrille Ngonga Ngomo (Universität
Leipzig, AKSW), Alberto Rascón (Berner Fachhochshule [BFS]), Michael
Mosimann (BFS), Joe Cheng (RStudio, Inc.), Government Linked Data Working
Group, Publishing Statistical Data group, Apache Jena, Andy Seaborne
(Epimorphics Ltd), Richard Cyganiak (Digital Enterprise Research Institute
[DERI]). And, DERI for graciously offering to host this work on their servers.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Capadisli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Statistical Linked Dataspaces.
          <source>Master's thesis</source>
          , National University of Ireland (
          <year>2012</year>
          ), http://csarven.ca/statistical-linked-dataspaces
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>2. The RDF Data Cube vocabulary</article-title>
          , http://www.w3.org/TR/vocab
          <article-title>-data-cube/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Capadisli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S. Ngonga</given-names>
          </string-name>
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.-C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Linked</surname>
            <given-names>SDMX Data</given-names>
          </string-name>
          ,
          <source>Semantic Web Journal</source>
          (
          <year>2013</year>
          ), http://csarven.ca/linked-sdmx-data
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>4. 270a.info, http://270a.info/</mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>5. The PROV Ontology, http://www.w3.org/TR/prov-o/</mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Zapilko</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mathiak</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Performing Statistical Methods on Linked Data</article-title>
          ,
          <source>Proc. Int'l Conf. on Dublin Core and Metadata Applications</source>
          (
          <year>2011</year>
          ), http://dcevents.dublincore.org/IntConf/dc-2011/paper/download/27/16
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>The R Project for Statistical Computing</surname>
          </string-name>
          , http://www.r-project.
          <source>org/"&gt;</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Zapilko</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mathiak</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Defining and Executing Assessment Tests on Linked Data for Statistical Analysis</article-title>
          ,
          <string-name>
            <surname>COLD</surname>
          </string-name>
          , ISWC (
          <year>2011</year>
          ), http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/Workshops /COLD/cold2011_submission_13.pdf
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hage</surname>
            ,
            <given-names>W. R. v.</given-names>
          </string-name>
          ,
          <source>Marieke</source>
          v.,
          <string-name>
            <surname>Malaisé.</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Linked Open Piracy: A story about e-Science, Linked Data, and statistics (</article-title>
          <year>2012</year>
          ), http://www.few.vu.nl/~wrvhage /papers/LOP_JoDS_
          <year>2012</year>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>SPARQL client for</surname>
            <given-names>R</given-names>
          </string-name>
          , http://cran.r-project.org/web/packages/SPARQL /index.html
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>McCusker</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>D. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courtney</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tatalovich</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Contractor</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morgan</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shaikh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Towards Next Generation Health Data Exploration: A Data Cube-based Investigation into Population Statistics for Tobacco</article-title>
          ,
          <source>Hawaii International Conference on System Sciences</source>
          (
          <year>2012</year>
          ), http://www.hicss.hawaii.edu/hicss_46/bp46/hc6.pdf
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>12. qb.js, http://orion.tw.rpi.edu/~jimmccusker/qb.js/</mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Percy</surname>
            <given-names>E. Rivera</given-names>
          </string-name>
          <string-name>
            <surname>Salas</surname>
            ,
            <given-names>P. E. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mota</surname>
            ,
            <given-names>F. M. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Breitman</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Casanova</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          :
          <article-title>Publishing Statistical Data on the Web, ISWC (</article-title>
          <year>2012</year>
          ), http://svn.aksw.org/papers/2012/ESWC_PublishingStatisticData /public.pdf
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>14. CubeViz, http://aksw.org/Projects/CubeViz</mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>15. Google Public Data Explorer, http://www.google.com/publicdata/</mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Gapminder</surname>
          </string-name>
          , http://www.gapminder.org/
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>17. Dataset Publishing Language, https://developers.google.com/public-data/</mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Generating Possible Interpretations for Statistics from Linked Open Data, ESWC (</article-title>
          <year>2012</year>
          ), http://www.ke.tu-darmstadt.de/bibtex/attachments /single/310
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Explain-</surname>
          </string-name>
          a-LOD, http://www.ke.tu-darmstadt.de/resources/explain-a-lod
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>20. stats.270a.info, http://stats.270a.info/</mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <given-names>Ngonga</given-names>
            <surname>Ngomo</surname>
          </string-name>
          , A.-C.:
          <article-title>LInk discovery framework for MEtric Spaces (LIMES): A Time-Efficient Hybrid Approach to Link Discovery (</article-title>
          <year>2011</year>
          ), http://aksw.org/Projects/limes
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Shiny</surname>
          </string-name>
          , http://www.rstudio.com/shiny/
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>23. Shiny server, https://github.com/rstudio/shiny-server</mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>24. Transparency International, http://transparency.270a.info/</mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <article-title>Food and Agriculture Organization of the United Nations</article-title>
          , http://fao.270a.info/
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <article-title>Organisation for Economic Co-operation and</article-title>
          <string-name>
            <surname>Development</surname>
          </string-name>
          , http://oecd.270a.info/
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>27. World Bank, http://worldbank.270a.info/</mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>28. European Central Bank, http://ecb.270a.info/</mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>29. International Monetary Fund, http://imf.270a.info/</mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>30. Apache Jena, http://jena.apache.org/</mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Jena</surname>
            <given-names>TDB</given-names>
          </string-name>
          , http://jena.apache.org/documentation/tdb/index.html
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Jena</surname>
          </string-name>
          TDB Optimizer, http://jena.apache.org/documentation /tdb/optimizer.html
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>33. Jena Fuseki, https://jena.apache.org/documentation/serving_data/</mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>34. Vocabulary of Interlinked Datasets, http://www.w3.org/TR/void/</mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <surname>Loizou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Groth</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>On the Formulation of Performant SPARQL Queries</article-title>
          , arXiv:
          <fpage>1304</fpage>
          .0567 (
          <year>2013</year>
          ) http://arxiv.org/abs/1304.0567
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          36.
          <string-name>
            <surname>Rakhmawati</surname>
            ,
            <given-names>N.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karnstedt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasnain</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hausenblas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Querying over Federated SPARQL Endpoints - A State of the Art Survey</article-title>
          , arXiv:
          <fpage>1306</fpage>
          .1723 (
          <year>2013</year>
          ) http://arxiv.org/abs/1306.1723
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>37. Stats Vocab, http://stats.270a.info/vocab</mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>38. Semanticscience Integrated Ontology, http://semanticscience.org/ontology /sio.owl</mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>39. Linked Data Pages, http://csarven.ca/statistical-linked-dataspaces#linkeddata-pages</mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          40.
          <string-name>
            <surname>LSD</surname>
          </string-name>
          <article-title>Analysis code at GitHub</article-title>
          , https://github.com/csarven/lsd-analysis
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          41.
          <string-name>
            <surname>Demter</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
          </string-name>
          , J.:
          <article-title>LODStats - An Extensible Framework for High-performance Dataset Analytics, EKAW (</article-title>
          <year>2012</year>
          ), http://svn.aksw.org/papers/2011/RDFStats/public.pdf
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>42. European Statistics Code of Practice, http://epp.eurostat.ec.europa.eu /portal/page/portal/quality/code_of_practice</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>