<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TAA: A Platform for Triple Accuracy Measuring and Evidence Triples Discovering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shuangyan Liu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlo Allocca</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mathieu d'Aquin</string-name>
          <email>mathieu.daquin@insight-centre.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrico Motta</string-name>
          <email>enrico.mottag@open.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Insight Center Galway</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Knowledge Media Institute, The Open University</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Knowledge graph re nement is often a time consuming process due to the lack of domain knowledge and automatic tools to support maintainers to detect erroneous information. Few tools are available to support the task of measuring the accuracy of triples (source triples) in a knowledge graph. We developed a platform which we call TAA for measuring triple accuracy by discovering evidence triples from external knowledge graphs. It consists of a quality assessment pipeline which contains a series of components from fetching target triples from external knowledge graphs to nding matched triples among the target triples and calculating a score to represent the level of accuracy of a source triple. In addition, TAA represents the assessment result using an evidence graph and exposes its functionality to other applications via REST web service.</p>
      </abstract>
      <kwd-group>
        <kwd>Data Quality</kwd>
        <kwd>Knowledge Graph Accuracy</kwd>
        <kwd>Quality Assessment Pipeline</kwd>
        <kwd>Evidence Graph</kwd>
        <kwd>RESTful Web Service</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In recent years research communities and industrial stakeholders have constructed
many large-scale knowledge graphs such as DBpedia3, YAGO4, Freebase5,
Wikidata6, Google Knowledge Graph, Microsoft Satori, and others. They are
intensively used in di erent application scenarios such as search, question answering,
natural language processing, data integration and analytics, and for specialised
areas such as digital humanities, business, life science and more. Due to the
diversity of data sources and limitations of present knowledge graph construction
methods, most knowledge graphs face a variety of quality issues such as noise
and vague data, inconsistency, inaccurate and out-of-date data, incomplete
information, and poor interlinking between KGs. To facilitate wide adoption and
advanced usage, it is crucial to ensure the quality of knowledge graphs.</p>
    </sec>
    <sec id="sec-2">
      <title>3 http://wiki.dbpedia.org/ 4 http://yago-knowledge.org/ 5 https://developers.google.com/freebase/ 6 http://www.wikidata.org</title>
      <p>
        Various studies have been conducted to investigate the quality of popular
knowledge graphs and the quality measures along di erent dimensions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
However, not enough tools have been developed for assisting the task of triple
accuracy evaluation. We developed a system (TAA) to assist knowledge graph
maintainers and developers to verify the accuracy of statements from a
knowledge graph. The TAA platform allows automatically discovering evidence triples
from external knowledge graphs, and assigns a con dence score to a source triple
representing consensus of the evidence triples obtained.
      </p>
      <p>
        The approach used in TAA consists of three phases. First, target triples which
have equivalent subjects of a source triple are fetched by querying the sameAs
links of the subject link of the source triple using sameAs.org service and the
source knowledge graph, and then retrieving predicates and objects of the target
triples from external knowledge graphs through content negotiation mechanism.
Second, we developed a predicate matching mechanism based on predicate
semantic similarity calculation and predicate type and value comparison to nd
matched triples of a source triple. Finally, a con dence score to represent the
accuracy level of a source triple is calculated using the consensus of the matched
triples. A detailed description of the approach is presented in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In this paper,
we focus on explaining the design principles, architecture and functionality of
the TAA tool. We also present an analysis of other tools and systems for tackling
similar quality problems.
2
      </p>
      <sec id="sec-2-1">
        <title>The TAA Platform</title>
        <p>The design principles of the TAA platform are to reduce the human workload in
the process of knowledge graph accuracy assessment. We achieve this goal by (i)
providing a pipeline of evidence triples discovering and accuracy measuring; (ii)
minimising the user interactions with the platform for completing an evaluation
task; (iii) representing the assessment results in an intuitive way. The
architecture of TAA is shown in Figure 1. We designed and implemented the TAA
platform using the Oracle Jersey RESTful Web Service framework7. Therefore,
TAA not only provides a web application portal but also a REST web service
to support the triple accuracy assessment process. The Grizzly HTTP server8
is adopted in TAA for providing a container of REST web resources and
nonREST web resources. The Grizzly server is chosen since it is fast, lightweight
and supports features such as non-blocking IO and NIO bu ers which makes
TAA scalable to a large number of users. In addition, JAX-RS9 resources are
developed to contain methods to handle HTTP requests for performing triple
accuracy assessment tasks.</p>
        <p>The core functionality of TAA is discovering evidence triples for a source
triple in external knowledge graphs. In order to achieve this objective, a pipeline
is developed to contain a series of components adopting di erent technology</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>7 https://jersey.github.io/ 8 https://javaee.github.io/grizzly/ 9 https://github.com/jax-rs</title>
      <p>KG Maintainer
Developer
browser
client
application
specify source triple
view assessment results</p>
      <p>HTTP request
HTTP response</p>
      <p>User
Interface
REST
Web
Server</p>
      <p>REST Web
Resources
TAA Pipeline
Evidence Graph
Visualisation</p>
      <p>SPARQL
content
negotiation
SPARQL
sameAs.org</p>
      <p>Source
Knowledge</p>
      <p>Graph</p>
      <p>External Knowledge Graphs</p>
      <p>TAA platform
including semantic web and linked data technology, predicate semantic similarity
measures, natural language processing and outlier detection algorithms. The
pipeline components are responsible for fetching and identifying matched triples
and generating a con dence score of the source triple based on agreement among
matched triples.</p>
      <p>In addition, TAA represents the evidence triples discovered using an evidence
graph. An evidence graph is a network of a source triple and its evidence triples,
and the types of identity links between the source subject and matched subjects.
We implemented the evidence graph using the D3 JavaScript API.10 The data
of the evidence graph is represented in the D3 graph JSON format, parsed and
visualised using the D3 API. An example of the evidence graph visualisation is
shown in Figure 2. Di erent colours are applied to nodes of the evidence graph
for distinguishing a source triple (blue) and its evidence triples (orange).</p>
      <p>Di erent types of users can bene t from TAA. Knowledge graph maintainers
who need to verify the accuracy of triples can use TAA to collect evidence
from external KGs and obtain the con dence level of how accurate the source
triples are. Developers who build quality assessment applications can invoke the
functionality of TAA using HTTP from their own applications, since TAA is
developed as a REST web service that exposes the functionality of TAA over
the web. The RESTful design also allows TAA to represent the assessment results
in di erent formats such as JSON and XML depending on the type of client.</p>
      <p>The TAA UI currently supports the validation of a single source triple at
the same time. However, the core data processing pipeline of TAA is capable of
validating a cluster of homogeneous triples simultaneously which have the same
predicates. In order to verify a cluster of triples at the same time, the data input
and the result visualisation components of the TAA platform can be extended
correspondingly. The TAA platform source code is available on GitHub.11
10 https://github.com/d3/d3/wiki
11 https://github.com/TriplesAccuracyAssessment/taa-demo</p>
      <sec id="sec-3-1">
        <title>Related Work</title>
        <p>
          Not enough research has been provided regarding developing quantitative
measures of accuracy and automatic tools for assessing triple accuracy. DeFacto [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
was developed for RDF fact validation. The di erence between DeFacto and
TAA is that DeFacto retrieves web documents as evidence for verifying facts.
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] adopted an outlier detection method to identify numerical errors in KGs.
Compared to TAA, [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] did not address the matching of similar properties for
two triples. In the TAA pipeline, a predicate matching algorithm is used for
matching similar properties of triples.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Farber,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Ell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Menne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Rettinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Bartscherer</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          :
          <article-title>Linked data quality of dbpedia, freebase, opencyc, wikidata, and yago</article-title>
          .
          <source>Semantic Web Journal</source>
          (to appear)
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Fleischhacker</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bryl</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , Volker, J.,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Detecting errors in numerical linked data using cross-checked outlier detection</article-title>
          . In: Mika,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Tudorache</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Welty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Knoblock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Vrandecic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Groth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Noy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Janowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Goble</surname>
          </string-name>
          , C. (eds.)
          <source>ISWC</source>
          <year>2014</year>
          ,
          <string-name>
            <surname>Part</surname>
            <given-names>I. LNCS</given-names>
          </string-name>
          , vol.
          <volume>8796</volume>
          , pp.
          <volume>357</volume>
          {
          <fpage>372</fpage>
          . Springer, Cham (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gerber</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Esteves</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Buhmann, L.,
          <string-name>
            <surname>Usbeck</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.C.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Speck</surname>
          </string-name>
          , R.:
          <article-title>Defacto|temporal and multilingual deep fact validation</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <volume>35</volume>
          ,
          <issue>85</issue>
          {
          <fpage>101</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Liu</surname>
          </string-name>
          , S.,
          <string-name>
            <surname>d'Aquin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motta</surname>
          </string-name>
          , E.:
          <article-title>Measuring accuracy of triples in knowledge graphs</article-title>
          . In: Gracia,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Bond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>McCrae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Buitelaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Chiarcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Hellmann</surname>
          </string-name>
          , S. (eds.) International Conference on Language,
          <source>Data and Knowledge. Lecture Notes in Computer Science</source>
          , vol.
          <volume>10318</volume>
          , pp.
          <volume>343</volume>
          {
          <fpage>357</fpage>
          . Springer, Cham (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>