<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multiscale Exploration of Spatial Statistical Datasets: A Linked Data Mashup Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ba-Lam Do</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tuan-Dat Trinh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Wetz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elmar Kiesling</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amin Anjomshoaa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A Min Tjoa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Vienna University of Technology</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Many national and international organizations today leverage semantic web technologies to make statistical datasets available as Linked Open Data (LOD). A key advantages of this approach is that the data not only becomes publicly available, but also machine-readable and hence suitable for automated discovery and exploration. Whereas this has great potential to support interesting use cases, it remains difcult for end users today to utilize and combine these statistical Linked Data. Three challenges are: (i) directing users to relevant data sources based on a speci ed location; (ii) facilitating data integration despite a lack of outgoing links between datasets; and (iii) o ering exible means to integrate and aggregate data from various sources. As time and location are highly relevant dimensions in most statistical data, we address the identi ed challenges by rst constructing geographical metadata for statistical sources. Following a mashup approach, we introduce mechanisms to recommend interesting datasets to end users and automatically enable data integration, visualization, and comparisons based on userde ned criteria.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In recent years, statistical data communities have increasingly adopted
semantic web technologies in order to formalize and link their published datasets. In
particular, many organizations make use of the W3C RDF Data Cube
vocabulary [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to publish their data as LOD. The data hence becomes accessible in a
standardized format. However, means for end users that allow them to utilize
and combine statistical LOD datasets on the web are lacking, which may be
attributed to the following reasons:
      </p>
      <p>(i) There are a large and growing number of statistical sources covering a
wide range of domains. Those are valuable assets, but without an appropriate
catalogue, users are unable to discover relevant statistical information.</p>
      <p>(ii) Each statistical LOD dataset typically uses a distinct terminology and
does not consistently include outgoing links to other sources. Consolidating
equivalent entities stored in di erent sources therefore poses a signi cant
challenge. For example, Ireland's 2011 census data1 includes thousands of
geographical entities (county, province, town, city, etc.), none of which are linked to any
external resources. Other LOD sources refer to Irish cities with di erent URIs.
Collecting all statistical information on a city available as LOD is therefore
difcult.</p>
      <p>(iii) Current statistical data visualization applications typically focus on
displaying a given dataset of a particular granularity (e.g. country level, city level,
etc.). They do not, however, provide mechanisms for locating relevant datasets
suitable for meaningful comparisons.</p>
      <p>This paper addresses these issues, focusing on the geographical dimension
which is highly relevant in most statistical datasets. In particular, we use
geographical points as input for the data source discovery process. Based on a given
location, we can nd appropriate datasets that provide statistical information.
For instance, a user may locate his/her home on a map and trigger a retrieval
process that yields datasets that contain data about the speci ed location. Using
the supplied location, we determine the geographical entities that the location is
situated in at di erent levels and allow users to compare data on the respective
areas to appropriate similar areas, e.g., compare the population of their
country to that of others, or compare income levels in their district to that in other
districts in the same city.</p>
      <p>To deal with the rst issue of discovering relevant statistical datasets, we
analyze available SPARQL endpoints to construct geographical metadata, which
connects each dataset to the related locations. We use Google Geocoding API2
to collect information of a geographical area based on their name.</p>
      <p>The second issue, i.e., lacking support for data integration due to missing or
inconsistent links, is addressed by using our constructed data catalogue. This
catalogue uses a consistent unique identi er for the same location, which can
refer to di erent identi ers in disparate datasets. This approach is possible even
in cases where the original location entities do not reference common external
entities.</p>
      <p>
        To allow users to automatically integrate data from di erent statistical sources
and overcome the third issue, we make use of the Linked Widget platform [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] {
a exible, interactive mashup platform. This platform implements semantic web
principles and allows users to model widgets and data ows. It also provides
mechanisms for semantic widget discovery, terminal matching, and automatic
widget composition. In this paper, we describe the design of a collection of
widgets targeted towards three use cases dealing with statistical data.
      </p>
      <p>The remainder of this paper is organized as follows. In Section 2, we discuss
the development of a geographical metadata catalogue. Section 3 illustrates our
mashup approach with a collection of widgets and three practical use cases.
Section 4 provides pointers to related work and we conclude in Section 5 with
an outlook on future research.
1 http://data.cso.ie/datasets/index.html
2 https://developers.google.com/maps/documentation/geocoding/</p>
    </sec>
    <sec id="sec-2">
      <title>Geographical Metadata for Statistical Data</title>
      <p>Countries are typically subdivided into hierarchical administrative units, e.g.,
regions, provinces, cities, districts, communes. Current statistical LOD sources
usually provide information describing speci c characteristics of these
administrative units in chronological order. Time and location are therefore typically
the most important dimensions for this kind of statistical data.</p>
      <p>We therefore develop a catalogue for spatial statistical data. If users provide
an address, e.g., Donaufelder Strasse 54, Austria, or simply de ne a point on
a map, we detect corresponding administrative areas and use those to select
relevant datasets. We then provide suggestions to compare these areas with
other areas at the same administrative level. For example, users can compare
the population of their district to that of other districts in their city, or contrast
the income of their city to other cities in the country ; on a higher level, they
can compare the happiness index of their country to that of other countries in
the world.</p>
      <p>We use SPARQL endpoints as input of the catalogue creation process and
perform four steps: (i) identify all datasets of the source; (ii) identify all
dimensions and measures for each dataset; (iii) from the set of detected dimensions,
determine the location dimension, (iv) for each text value of the dimension
location, identify its administrative area and create a unique resource in the
catalogue. We follow the results returned by the Google Geocoding API to deal
with di erent geographical hierarchies of countries, ensuring consistent entity
classi cation.</p>
      <p>The metadata structure of a dataset is illustrated in Listing 1. A de nitive
mapping between the administrative area and the resource URI in the metadata
catalogue must exist so that same locations from di erent sources have the same
unique URI in the metadata catalogue. This facilitates integration and
aggregation. We de ne the mapping rule and provide a URI mapping service to convert
an administrative area to the corresponding URI in the catalogue.
PREFIX qb : &lt;http :// purl . org / linked - data / cube # &gt;
PREFIX lc : &lt;http :// location - based - catalogue . ifs . tuwien . ac . at /&gt;
[]
3
3.1
a lc : SPARQLEndpoint ;
lc : hasDataSet [ lc : describes
qb : dimension
qb : measure
qb : attribute
[ a
[ a
[ a
[ a
lc : AdministrativeArea ];
qb : DimensionProperty ];
qb : MeasureProperty ];
qb : AttributeProperty ]; ]</p>
      <p>Listing 1: A SPARQL query for terminal matching</p>
    </sec>
    <sec id="sec-3">
      <title>Widget and Mashup approach</title>
      <sec id="sec-3-1">
        <title>Linked Widget Platform</title>
        <p>To enable users to exibly select and combine statistical datasets and synthesize
desired information, we follow a visual programming paradigm implemented in
the Linked Widgets platform { a widget-based mashup platform. Its key
elements are Linked Widgets, an extension of standard web widgets backed by a
semantic model that follows Linked Data principles. This model describes data
input/output and metadata (such as data provenance and licensing terms) that
is useful for widget search and auto composition features.</p>
        <p>There are three types of widgets, i.e., data, process, and visualization
widgets. The platform provides a graphical interface for creating a data ow and
composing various applications by connecting widgets in di erent ways. Other
stakeholders can develop widgets independently and contribute widgets to the
platform to extend its functionality. Whereas existing similar platforms are
oriented more towards low-level data processing with generic widgets, the platform
is more problem-oriented and supports modeling on a higher, semantic level.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Statistical Widget Collection</title>
        <p>In the platform, widgets are grouped into widget collections. Each collection
addresses a di erent problem domain. We developed a collection3 for statistical
data exploration based on spatial context. Each widget can have multiple inputs,
but only a single output. The output of a widget can serve as input of another
one, if their semantic models can be matched, i.e., they have a certain level of
overlap. Our exemplary statistical widget collection consists of the following four
widgets:</p>
        <p>Spatial Entity Recognizer (data widget): This widget accepts an address
text or a user-de ned location point as an input and calls the Google Geocoding
API service to obtain the corresponding spatial entities at di erent levels, i.e.,
country level, administrative area level one to ve, and sublocality level. It then
uses the URI mapping service to return the URIs of these areas from the
catalogue. Note that these URIs may not be in the catalogue, because the catalogue
does not create instances of all locations in the world; it only contains instances
of locations that exist in the analyzed statistical datasets.</p>
        <p>Spatial Data Locator (process widget): This widget returns a list of datasets
related to the input spatial entity. It contains options to lter the output datasets
based on di erent domains, e.g., census, transportation, or income.</p>
        <p>Spatial Comparator (process widget): From the input of two
administrative areas, this widget returns a list of datasets that contain information on the
two areas that can be compared.</p>
        <p>Google Chart (visualization widget): This widget presents the input of a
single dataset or a list of datasets in appropriate charts.</p>
        <p>All datasets returned by the two process widgets follow the W3C RDF Data
Cube vocabulary and represent the data in JSON-LD format.
3 http://linkedwidgets.org/MashupPlatform.html?widgetCollectionId=
SpatialStatisticalCollection</p>
        <p>Spatial</p>
        <p>Entity
Recognizer</p>
        <p>Spatial
Comparator</p>
        <p>Google Chart
Sample mashups created from the four widgets are shown in Fig. 1: (i) discovery
of statistical information on an area (at di erent administrative levels) based on
a user's address;4 (ii) users select di erent criteria to compare a pair of areas,
which have the same administrative level and the same parent area.5
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Related Work</title>
      <p>
        In the research eld of Geographic Information Systems (GIS) the problem of
spatial dataset discovery and integration has also been addressed. Hariharan et
al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] create summaries of GIS datasets which are then used for source
discovery. They adapt histogram-based techniques that take textual and spatial
information into account. Jones et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] present an architecture of a
spatiallyaware search engine to discover and access geo-datasets on the web. Their so
called SPIRIT Spatial Search Engine uses a geo-ontology and spatial indexing
to deal with query disambiguation, relevance ranking, and metadata extraction.
Li et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] enable intelligent data discovery of geo-referenced data based on
the combination of Latent Semantic Analysis and Two-Tier Ranking. This
technique allows them to build a semantic search engine called Semantic Indexing
and Ranking (SIR), which outperforms existing keyword-matching approaches.
However, these approaches do not rely on Linked Data principles.
      </p>
      <p>
        In the Linked Data domain, with the growing number of statistical SPARQL
endpoints available, several useful applications have emerged from the semantic
web community. The Linked Data Query Wizard is a web-based tool for
displaying, accessing, ltering, exploring, and navigating Linked Data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. It aims at
providing a tabular interface to cope with users lack of knowledge about Linked
Data and its graph structure. CubeViz is a facetted browser for statistical data
utilizing the RDF Data Cube vocabulary [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Based on a dataset CubeViz
generates a facetted browser that can be used to lter observations which in turn
can be analyzed, explored, and visualized. Stats.270a.info provides an interface
to compare statistical data retrieved from di erent sources on the web [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The
application makes use of the RDF Data Cube vocabulary to discover statistical
data in a uniform way. Furthermore, federated SPARQL queries are employed
to gather data from disparate sources.
4 http://linkedwidgets.org?id=MashupSpatialDataLocator
5 http://linkedwidgets.org?id=MashupSpatialDataComparator
      </p>
      <p>
        Zapilko et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] discuss issues and challenges for using Linked Open Data
for analyzing and enriching statistical data. They reveal problems regarding
data integration, link generation, aggregation level of the data, and complexities
within the data structures. Kampgen [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposes to use expressive semantic
web ontologies to build a conceptual model of statistical Linked Data from
disparate sources. However, he also describes challenges that need to be addressed
as for instance discovering datasets, or integrating data of di erent granularity.
Kalampokis et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] argue that the availability of previously closed
governmental statistical datasets now enables to gain unexpected and unexplored insights
into di erent domains. They support their claim by describing a linked open
government data analytics vision along with its technical requirements. Finally
they present a use case dealing with UK general election data.
      </p>
      <p>Similar to the related work, we try to provide an easy to use interface for lay
users to discover and explore statistical datasets. However, in contrast to current
implementations, we want to provide the user with greater control to in uence
the process of dataset discovery, on the one hand, and data exploration, on the
other hand. This is achieved by employing an approach based on widgets. Users
can apply widgets to control the data processing ow from the beginning, i.e.,
dataset discovery, to the end, i.e., data visualization, allowing for the creation of
many di erent scenarios. Furthermore, through creating a metadata catalogue,
we o er a exible means to compare di erent datasets, which are related via
administrative levels, or statistical criteria they o er.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>This paper addresses two crucial issues in statistical Linked Data: (i) Where
and how can users discover statistical data? (ii) How can data reconciliation,
ontology matching and instance matching be performed with statistical data?</p>
      <p>We address these issues by building a data catalogue, describing identi ed
datasets, and nally fostering data discovery by adopting a mashup-based
approach. Our approach is built around geographical context information that
users provide by specifying a geographical point or an address. In future work,
we will focus on the temporal dimension of the data. For instance, users
provide a certain timespan they are interested in and our system suggests datasets,
which support this timespan. Furthermore, we will support users in discovering
datasets based on keyword search.</p>
      <p>The current mashups work well with datasets that have a fairly limited
number of dimensions. For larger datasets, it is di cult to compare di erent areas in
a single chart. We can standardize locations through available geocoding APIs,
but to standardize dimensions or measures that do not have the same meaning
is still an open challenge. For instance, the year dimension has di erent URIs in
di erent statistical data sources. At present, our implementation is applicable
for datasets of the same statistical LOD source only and we need to improve this
in future.</p>
      <p>We currently develop a Dataset Recommender widget which accepts a single
spatial entity as input and nds all spatial entities that share the same parent.
Then, based on the criteria chosen by the user, it returns a list of datasets that
contain statistical information that relate to the areas. When connecting its
input with the output of Spatial Entity Recognizer and its output with the input
of Google Chart, we have another mashup enabling users to compare their area
to appropriate, automatically identi ed, other areas.</p>
      <p>Additionally, we plan to automatically create outer links to DBpedia for the
location entities of the metadata catalogue. This would allow other developers
to more easily integrate statistical data with the Web of Data.</p>
      <p>Finally, it will be necessary to evaluate precision and recall of the location
detection process during the metadata construction process in a dedicated
experiment.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Capadisli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Riedl</surname>
          </string-name>
          , R.:
          <article-title>Linked Statistical Data Analysis</article-title>
          ,
          <source>ISWC SemStats</source>
          (
          <year>2013</year>
          ), http://csarven.ca/linked-statistical
          <article-title>-data-analysis</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          et al.:
          <source>The RDF Data Cube Vocabulary. W3C Recommendation</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hoe</surname>
            <given-names>er</given-names>
          </string-name>
          , P. et al.:
          <article-title>Linked Data Query Wizard: A Novel Interface for Accessing SPARQL Endpoints</article-title>
          .
          <source>In Proceedings of the Workshop on Linked Data on the Web, CEUR-WS</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kalampokis</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          et al.:
          <article-title>Linked Open Government Data Analytics</article-title>
          . In: Wimmer,
          <string-name>
            <surname>M.</surname>
          </string-name>
          et al. (eds.) Electronic Government, pp.
          <volume>9</volume>
          {
          <fpage>110</fpage>
          . Springer Berlin Heidelberg (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Kampgen, B.: DC Proposal:
          <article-title>Online Analytical Processing of Statistical Linked Data</article-title>
          . In: Aroyo,
          <string-name>
            <surname>L.</surname>
          </string-name>
          et al. (eds.)
          <source>The Semantic Web { ISWC</source>
          <year>2011</year>
          , pp.
          <volume>301</volume>
          {
          <fpage>308</fpage>
          . Springer Berlin Heidelberg (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Mader</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          et al.:
          <article-title>Facilitating the Exploration and Visualization of Linked Data</article-title>
          . In: Auer,
          <string-name>
            <surname>S.</surname>
          </string-name>
          et al. (eds.)
          <article-title>Linked Open Data { Creating Knowledge Out of Interlinked Data</article-title>
          , pp.
          <volume>90</volume>
          {
          <fpage>107</fpage>
          . Springer Berlin Heidelberg (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Trinh</surname>
          </string-name>
          , T.-D. et al.:
          <article-title>Open Linked Widgets Mashup Platform</article-title>
          .
          <source>Proceedings of the AI Mashup Challenge 2014 ESWC Satellite Event</source>
          ,
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Zapilko</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          et al.:
          <article-title>Enriching and Analysing Statistics with Linked Open Data</article-title>
          .
          <source>In NTTS-Conference on New Techniques and Technologies for Statistics</source>
          , Brussel. (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hariharan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          et al.:
          <article-title>Discovering GIS Sources on the Web using Summaries</article-title>
          .
          <source>In Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries</source>
          , pp.
          <volume>94</volume>
          {
          <fpage>103</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>B</article-title>
          . et al.:
          <article-title>The SPIRIT Spatial Search Engine: Architecture, Ontologies and Spatial Indexing</article-title>
          .
          <source>In Geographic Information Science</source>
          , pp.
          <volume>125</volume>
          {
          <issue>139</issue>
          , Springer Berlin Heidelberg (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          et al.:
          <article-title>Towards Geospatial Semantic Search: Exploiting Latent Semantic Relations in Geospatial Data</article-title>
          . In
          <source>International Journal of Digital Earth</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          ), pp.
          <volume>17</volume>
          {
          <issue>37</issue>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          &amp; Francis (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>