<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Assessing the Quality of Geospatial Linked Data - Experiences from Ordnance Survey Ireland (OSi)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jeremy Debattista</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eamon Clinton</string-name>
          <email>eamonn.clinton@osi.ie</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rob Brennan</string-name>
          <email>rob.brennan@scss.tcd.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Knowledge and Data Engineering Group, ADAPT Centre, School of Computer Science and Statistics, Trinity College Dublin</institution>
          ,
          <addr-line>Dublin 2</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ordnance Survey Ireland</institution>
          ,
          <addr-line>Phoenix Park, Dublin 8</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Ordnance Survey Ireland (OSi) is Ireland's national mapping agency that is responsible for the digitisation of the island's infrastructure in terms of mapping. Generating data from various sensors (e.g. spatial sensors), OSi build its knowledge in the Prime2 framework, a subset of which is transformed into geo-Linked Data. In this paper we discuss how the quality of the generated sematic data fares against datasets in the LOD cloud. We set up Luzzu, a scalable Linked Data quality assessment framework, in the OSi pipeline to continuously assess produced data in order to tackle any quality problems prior to publishing.</p>
      </abstract>
      <kwd-group>
        <kwd>Data Quality</kwd>
        <kwd>Geospatial Data</kwd>
        <kwd>Linked Data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Ordnance Survey Ireland (OSi) has been publishing authoritative open Linked Data
since 2016 through its national geo-data portal data.geohive.ie [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The
semanticallyuplifted data being analysed in this paper is a subset of OSi’s Prime2 national data
infrastructure (see §2). A key feature for authorative data publication is the assurances
provided by the use of extensive quality processes, human curation and rule-based
quality checking within Prime2. However Linked Data has its own quality measures
and challenges [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and so it was important to investigate the relative quality of the
Linked Data being produced. Also, it is hoped that the semantic enrichment of the data
provided by uplifting to RDF could help to provide further semantic quality assurances
and feedback for the underlying Prime2 data.
geohive ontology is an extension of the geoSPARQL ontology for describing
geographical features and their geometries. The current published data focuses on
describing Ireland’s administrative entries and their boundaries. However in 2018 it is
planned to have a significant expansion of the coverage of Prime2 spatial entities
published as Linked Data. Geohive currently supports browsing via Pubby and its
boundary entities are interlinked to equivalent DBpedia concepts where available.
Typical entity properties are multi-lingual labelling in both English and Irish, WKT
(well known text) representations of the 2D polygons of their geometry at multiple
resolutions and administrative boundary classifications e.g. county, barony, or city
council.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Assessing the Quality of Linked Geospatial Data</title>
      <p>
        As part of the OSi Linked Data publishing pipeline, the quality of the generated
Linked Data is assessed in order to maintain the high standards expected of a national
spatial data infrastructure. This is of utmost importance as OSi’s data is used for the
infrastructural planning and development of Ireland. The objectives of this quality
assessment in this pipeline is to help OSi, as Linked Data producers, to (O1) identify if
there are any errors in the R2RML mappings that are generating incorrect RDF data;
and (O2) check if Linked Data best practices are being followed. Based on these two
objectives we first identified 19 suitable Linked Data quality metrics (see Table 1) as
defined in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In order to achieve these objectives deployed Luzzu [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] a Linked Data
quality assessment framework, on the OSi release server to assess the quality of the last
snapshot of OSi boundary data1.
      </p>
      <p>
        In this work, the chosen metrics and quality results reflect a data producer view of
quality, rather than the data consumer view that is more common in the literature.
Furthermore, we assume that geographic boundary data (polygons) provided by OSi is
accurate2 and thus no geo-specific quality measures are required to be assessed.
Therefore, our quality assessment scope was narrowed to objective
domainindependent Linked Data quality metrics. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] formally define 27 metrics and assess
them over the 2015 version of the LOD cloud, which includes geographic datasets.
Nonetheless, not all of these metrics were relevant to the OSi datasets and our
generation pipeline. With regard to our first objective (O1) we looked at the intrinsic
category metrics. These metrics enable us to understand if there are any consistency,
syntactic and conciseness issues in the generated RDF datasets. Metrics from the other
three categories were considered for our second objective (O2).
      </p>
      <p>
        The following results are based on the data dump that was available on December
10th 2017 and newer versions of this data is available on data.geohive.ie. In this section,
we will present and discuss the quality results and compare them to the mean values
observed in the LOD cloud as in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
1 https://www.osi.ie/education/third-level-and-academic/boundaries/
2 https://www.osi.ie/about/osi-positional-accuracy/
      </p>
      <sec id="sec-2-1">
        <title>Quality Results and Discussion of OSi RDF Datasets.</title>
        <p>
          Nineteen quality metrics were assessed by Luzzu. Seven of these metrics fell
significantly below 100% or a maximum score and are worthy of further investigation.
However, overall it must be observed that the generated OSi Boundary dataset displays
high quality characteristics by these objective measures that are accepted within the
Linked Data community as state of the art ways to measure Linked Data quality. A
more detailed breakdown is shown in Table 1. The second column (Value) shows the
exact value recorded for each metric and the overall quality picture is good with 13 of
the 19 metrics equalling or exceeding the mean quality levels seen in LOD (as per [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]).
Three metrics in which we vastly exceed the quality seen in common LOD practice are
U1, CS9 and A3. U1 (Human Readable Labelling) shows that our dataset is well
annotated for human consumption and our result is more than double that of the
average. For CS9 (correct usage of domain and range datatypes), at c. 76% we exceed
the average of 60% for LOD but we aim at 100% in this metric so the R2RML mappings
were investigated. Similarly, in A3 (Dereferenceability) at 64% we are nearly double
the average of 37% since OSi hosts definitions and resources as Linked Data in addition
to the dump files.
        </p>
        <p>Table 1 also allows us to more clearly identify the 6 metrics that are of most concern:
RC1, IN4, V2, I1, L1, L2. There is a secondary set of metrics that must also be
investigated based on their power as objective validators of the dataset (i.e. they can
easily detect defects that should not be present, despite the OSi data exceeding the LOD
average for these metrics): IN3, CN2, CS2, CS9, A3. These 5 represent metrics for
which the OSi should be achieving 100% scores and hence are worth investigating,
even when the score is very high like U1 (Labelling) at 99.94%.
(SV3) Compatible Datatypes
(A3) Dereferenceability of URIs (Approximate)
(I1) Links to External Linked Data Providers
(L1) Presence of a Machine-Readable Licence
(L2) Presence of a Human-Readable Licence
100%
64.1%
1 LD Prov
0
04</p>
      </sec>
      <sec id="sec-2-2">
        <title>Root cause analysis of under-performing metrics</title>
        <p>RC1 - Approximately 33% of the OSi Boundary data URIs exceed the recommended
length for short URIs (60 chars). Part of this is due to the reuse of long internal OSi
GUIDs to identify spatial thing. Therefore, techniques should be explored in order to
map the internal OSi identifiers with more meaningful URIs.</p>
        <p>IN3 - Luzzu identified some 11 typos that have crept into the mappings e.g.
ElectoralDistrict vs ElectoralDistricts.</p>
        <p>IN4 - While blank nodes are generally discouraged in LOD, it is part of the design of
the OSi Boundary dataset that feature geometries cannot directly be addressed and users
need to go through the associated feature concept e.g. Co. Dublin to access a geometry.
V2 - The average number of languages used per resource is less than 2, i.e. less than
50% of all resources include a label in Irish as well as English. The way this metrics is
calculated is not really representative since all geometry resources are totally unlabelled
and count for this metric. Hence it suggests a potential revision of the metric.
A3 - Despite having a high score here we expect to achieve 100%. It is possible that
some of the errors were due to load throttling at the data.geohive.ie server as opposed
to the resources not being available.</p>
        <p>I1 - Our score is low here because we have only interlinked to DBpedia.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Final Remarks</title>
      <p>
        In this paper, we discuss and compare the quality of the generated OSi geo-linked
datasets. Our assessment was done in light of two objectives: problems in mappings
that is generating noisy data, and if best practices are followed. We assessed 19 different
quality metrics and compared them with the LOD cloud’s mean quality value as
assessed in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We show that for most metrics assessed our published data has a better
quality value than the assessed mean value of the LOD cloud datasets.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Debattista</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lange</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cortis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Evaluating the Quality of the LOD Cloud: An Empirical Investigation</article-title>
          .
          <source>Semantic Web Journal</source>
          (
          <year>2018</year>
          , preprint)
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Debruyne</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meehan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clinton</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McNerney</surname>
            , Nautiyal, Lavin,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>O'Sullivan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <article-title>Ireland's Authoritative Geospatial Linked Data</article-title>
          ,
          <source>In Proceedings 16th International Semantic Web Conference (ISWC 2017)</source>
          , vol.
          <volume>10588</volume>
          , pp.
          <fpage>66</fpage>
          -
          <lpage>74</lpage>
          , Springer, Vienna, Austria (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Debattista</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lange</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :. Luzzu
          <string-name>
            <surname>-A Methodology</surname>
          </string-name>
          and
          <article-title>Framework for Linked Data Quality Assessment</article-title>
          .
          <source>J. Data and Information Quality</source>
          <volume>8</volume>
          ,
          <issue>1</issue>
          , Article 4 (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>4 The OSi data has a human-readable license specified in the Geohive web-page</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>