<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Linked Data Fact Validation through Measuring Consensus</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shuangyan Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mathieu d'Aquin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrico Motta</string-name>
          <email>enrico.mottag@open.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The Open University</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In the context of linked open data, di erent datasets can be interlinked together, thereby providing rich background knowledge for a dataset under examination. We believe that knowledge from interlinked datasets can be used to validate the accuracy of a linked data fact. In this paper, we present a novel approach for linked data fact validation using linked open data published on the web. This approach utilises owl:sameAs links for retrieving evidence triples, and a novel predicate similarity matching method. It computes the con dence score of an input fact based on weighted average of similarity of the evidence triples retrieved. We also demonstrate the feasibility of our approach using a sample of facts extracted from DBpedia.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked Open Data</kwd>
        <kwd>Data Quality</kwd>
        <kwd>Fact Validation</kwd>
        <kwd>Semantic Similarity</kwd>
        <kwd>DBpedia</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Linked datasets created from unstructured sources are likely to contain factual
errors [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] (e.g. a wrong population number for a country). Measuring the
semantic accuracy of linked sources is viewed as one of the challenging dimensions for
data quality assessment [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Zaveri et al. de ned semantic accuracy as \the
degree to which data values correctly represent the real world facts."[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] A simple
example to illustrate this would be: when our search engine returns the state
where New York City is located as CA, this is viewed as semantically inaccurate
since the state CA does not represent the real world state of NYC, i.e. NY.
      </p>
      <p>
        Di erent approaches were discussed in previous studies [
        <xref ref-type="bibr" rid="ref3 ref5">3,5</xref>
        ] for linked data
semantic accuracy measurement. The DeFacto approach [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] validated facts by
retrieving webpages that contain the actual statement phrased in natural
language using search engines and fact con rmation method. Paulheim and Bizer
presented in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] an algorithm for detecting type incompletion based on the
statistical distributions of properties and types, and an algorithm for identifying
wrong statements by nding large deviation between actual types of the subject
and/or objects and apriori probabilities given by the distribution.
      </p>
      <p>However, no studies have investigated how to validate linked data facts
leveraging the very nature of linked data (via collecting matched evidence triples from
other linked sources). This paper presents an approach for RDF facts validation
by collecting consensus from other linked datasets. Owl:sameAs links are
followed to collect triples describing same real-world entities in other datasets.
A predicate matching method is described to collect \equivalent" facts and a
consensus measure is presented to quantify the agreement among the sources.</p>
      <p>The rest of the paper is structured as follows. Section 2 presents the details
of our approach. The method and results of an experiment with sample facts
from DBpedia are described in Section 3. Finally, we conclude in Section 4 and
provide an outlook for future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <p>Subject Links Crawling and Cleaning. The rst task addressed in this
subsection deals with the process of automatically collecting the resource or
subject links equivalent to the subject of the input fact(s). We approach the
problem in two steps. Firstly, the values of the property owl:sameAs1 of the
subject of a fact are retrieved. It can be achieved by querying the underlying
dataset of the input fact. Secondly, we fetch the equivalent subject links via
querying the http://sameas.org service.</p>
      <p>There may be duplicated and non-resolvable subject links in the results
obtained via owl:sameAs and the http://sameas.org service. The duplication
cases can happen since two separate services are used and the resources that
they provide may overlap. It can also be due to the fact that the underlying
dataset contains multilingual versions of the same resources and link them
together via owl:sameAs. In addition, there are several reasons for non-resolvable
subject links. The resources may have been deleted from the underlying dataset
while the value of the relevant owl:sameAs property not being updated
coordinately. The services of publishing the datasets may be down or have retired.</p>
      <p>The erroneous subject links need to be cleaned before the next task can be
performed e ectively and e ciently. We follow the following steps for cleaning
the errors. First, all subject links are veri ed by \pinging" the corresponding
URIs. If a valid response is received within a given timeout, the subject links are
considered as resolvable. Second, duplicated subject links are removed if they
have the identical URIs. Finally, multilingual versions of the same resource are
removed from the result set.</p>
      <p>In our approach the reliability of the subject links are determined according
to the provenance of the subject links, i.e., the methods or services used to
retrieve the links, for example, the DBpedia owl:sameAs property and the http:
//sameas.org service. Details of how to determine the reliability of the subject
links are addressed later. The provenance information of the subject links are
retained for calculating the con dence score of an input fact.
1 The following namespace conventions are used in this document: owl=http:
//www.w3.org/2002/07/owl, dbpedia=http://dbpedia.org/resource/,
dbpedia-owl=http://dbpedia.org/ontology/, dbpprop=http://dbpedia.org/
property/, yago=http://yago-knowledge.org/resource/
Predicate Links and Objects Retrieving. The next task of fact validation
is collecting all triples that use the collected resources as the subject links. This
problem cannot be tackled by simply dereferencing the URIs of the collected
subject links.2 There are three reasons. First, not all of the corresponding URIs
can be dereferenced such as the URI of the mosquito Aedes vexans.3 Second,
some dereferenceable URIs may not return the real data of the resources since
they were redirected to somewhere else, e.g. yago:Borough of Buckingham.4
Finally, the content types of the representation of the information resources
obtained via dereferencing can be di erent.</p>
      <p>The non-dereferenceable URIs are removed from the set of subject links as
a result of performing the subject links cleaning task. For those dereferenceable
URIs, a combination of methods are applied to extract the desired predicates and
objects, and convert them to a uniform format for performing the subsequent
tasks.</p>
      <p>The rst method used in our approach is HTTP GET with the resource URI
and content negotiation. It allows to obtain the RDF facts of an information
resource in most cases. Programming libraries such as the Jena API5 can be used
to extract the desired data from the RDF data. The second method is HTTP
GET with a SPARQL query to a dataset endpoint. This method is adopted
when the resource URIs cannot return the real data of that resources, and there
is a SPARQL endpoint associated with that knowledge base. Last but not the
least, when there are only dumps of data available from the knowledge bases,
e.g. Wikidata,6, particular toolkits can be developed to extract desired data from
the dumps.</p>
      <p>Predicate Similarity Measurement. After completing the beforementioned
tasks, a large amount of triples with subjects being equivalent to the subject
links of the input facts are collected. The objective of the next task is selecting
the evidence triples that have predicates matching the predicates of the input
facts.</p>
      <p>
        We choose to measure the predicate similarity based on the semantic
similarity between the predicates of the input facts and the collected triples. String
similarity measures such as the Trigram similarity metric [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] are not used since they
cannot e ectively detect predicates which are composed of di erent words but
actually have the same meaning. For example, the property
dbpedia-owl:populationTotal and the property yago:hasNumberOfPeople should be identi ed
as highly related.
      </p>
      <p>
        There are a number of semantic relatedness measures available including
Jiang &amp; Conrath [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Resnik [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Lin [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and Wu &amp; Palmer [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. They rely
mas2 According to the W3Cs note on dereferencing HTTP URIs, the act of retrieving a
representation of a resource identi ed by a URI is known as dereferencing that URI,
http://www.w3.org/2001/tag/doc/httpRange-14/2007-05-31/HttpRange-14
3 http://lod.geospecies.org/ses/4XSQO
4 http://tinyurl.com/mxdkv4s
5 https://jena.apache.org/
6 http://www.wikidata.org/
sively on the enormous store of knowledge available in WordNet.7 The principle
of our approach for detecting highly related predicates is applying a suitable
semantic relatedness measure on the predicates of the evidence triples. In addition,
our method is based on WS4J8 which can generate a matrix of pairwise
similarity scores for two input sentences, according to selected semantic relatedness
measures. WS4J implements several semantic similarity algorithms described
earlier.
      </p>
      <p>Many predicates use compound words such as
dbpedia-owl:populationTotal and yago:hasNumberOfPeople. Thus, our method should be able to handle
predicates of compound words as well as predicates composed of single words.
Our method consists of three parts. First, a compound word splitter is used to
transform predicate names into space separated words (i.e. sentences). Second, a
matrix of pairwise similarity scores are generated for two input sentences by the
means of WS4J. Finally, formulas are de ned to measure the semantic similarity
of the input sentences (i.e. the predicates) using the pairwise similarity matrix.</p>
      <p>Table 2 provides an example of the pairwise similarity matrix for the
sentences \population Total" and \has Number Of People" (as generated by WS4J).</p>
      <p>Let r be the number of rows of a similarity matrix and c the number of
columns of the matrix. The scores in the nth row or column are represented by
the sets Srow(n), Scolumn(n) respectively. For each word in the shorter sentence
(either r c or r &gt; c), we choose the max score in the row or column where
the word lies as the semantic similarity score of that word, noted as W (n). This
leads to the following formula:</p>
      <p>W (n) =
max ( Srow ( n ) ) if r c
max ( Scolumn ( n ) ) if r &gt; c</p>
      <p>Moreover, let (W ) be the set of similarity scores of the words in the shorter
sentence of a similarity matrix, and k the number of values in the set. If any
word in the shorter sentence has a value of similarity greater than the threshold
, then the two input sentences may have similar meaning. Thus we de ne the
average of the scores belonging to (W ), P , as the semantic similarity score
for the two input sentences (i.e. the predicates). Thus, it leads to the following
formula:</p>
      <p>P =</p>
      <p>P</p>
      <p>W 2 (W ) W
k
with 9 W 2
(W ) and W &gt;
7 http://wordnet.princeton.edu/
8 https://code.google.com/p/ws4j/
(1)
(2)</p>
      <p>If no word in the shorter sentence has a value of similarity greater than the
threshold , then the two input sentences can not have similar meaning. In this
case, the value of the similarity score for the two input sentences is assigned to
zero.</p>
      <p>To obtain the set of matched predicates for the predicate of the input facts, a
threshold is applied, e.g., all predicates with P 0:5 are considered as matched
predicates.</p>
      <p>Con dence Calculation. As mentioned in the rst task above, the reliability
of the subject links collected are determined according to the provenance of the
subject links (i.e., owl:sameAs and http://sameas.org service). A weighting
factor is assigned to the subject links of the evidence triples to represent their
reliability. The value of a weighting factor ranges from 1 to 5. The greater the
value, the more reliable the subject link is.</p>
      <p>We de ne a con dence score for the input fact to represent the degree to
which the evidence triples agree with the input fact (or triple). The con dence
of the input fact is based on the weighted average of the values of the objects of
the evidence triples, represented as .</p>
      <p>The values of the objects, de ned as , are considered to be literal values
(either numerical or string). If the type of the objects is string, string similarity
scores of the objects for the input facts and the evidence tripes are applied as
the values of . If the type of the objects is numerical, the numerical values
of the objects are directly used. The weight ! is the product of the
reliability of the subject link and the similarity of the predicate link of an evidence
triple. Additionally, let m be the number of evidence triples collected through
the abovementioned tasks. Thus, is represented as:
=</p>
      <p>Pm
i=1 !i
Pm
j=1 !j</p>
      <p>i</p>
      <p>Formula (3) is applied to represent the con dence score of an input fact where
the value of the objects of the evidence triples are the type of string.</p>
      <p>Furthermore, the following formula is applied to represent the con dence
score of the input fact, denoted as when the values of the objects are numerical.
In Formula (4) x represents the numerical value of the object of the input fact
while is the weighted average number calculated via formula (3).</p>
      <p>Based on Formula (4), a smaller di erence in the numerical values of the
objects between the input fact and the weighted average value will lead to a
higher con dence score.
(3)
(4)</p>
    </sec>
    <sec id="sec-3">
      <title>Experiment</title>
      <p>In order to test the feasibility of the approach described in the previous section,
we conducted an experiment with a property from DBpedia
(dbpedia:populationTotal) and a sample of facts using this property as the predicate. This
property was selected since the type of its values are numerical.</p>
      <p>We made a query to the DBpedia SPARQL endpoint for obtaining all towns
in Milton Keynes that have a population of more than 10,000. The resulting 18
triples were utilised as the input facts. The subjects of these facts were used as
seeds to crawl equivalent subject links from other knowledge bases.</p>
      <p>The number of subject links retrieved for a single fact ranges from dozens to
several hundred. For example, dbpedia:Stantonbury has 23 subject links found
while dbpedia:Buckingham has 232 subject links retrieved. The number of the
cleaned subject links is reduced greatly, ranging from a few to several tens.</p>
      <p>We selected a representative resource dbpedia:Buckingham to examine the
correctness of the subject links cleaning process. A total of 207 noise subject
links were found for the resource dbpedia:Buckingham. It consisted of 172
nonresolvable links, and 35 duplicate links. We manually examined the causes of
the non-resolvable links, and corrected 56 out of 172 as valid links (Figure 1).
Initially the 56 links were identi ed as invalid links due to a small value of the
read timeout eld set for the tool used for the subject links cleaning process. It
allowed us to adjust the timeout eld for a suitable value.</p>
      <p>404 not found</p>
      <p>Unknown Host
500 Internal Server</p>
      <p>Error</p>
      <p>Socket Timeout</p>
      <p>We also found that di erent data access services were provided by the
knowledge bases where the subject links originated from. Accordingly, we needed to
adopt di erent methods to deal with this diversity in terms of retrieving the
predicate links and objects from these knowledge bases.</p>
      <p>
        In addition, the compound word splitter9 was utilised in the predicate
similarity measurement process. It could split compound predicate names into
sen9 http://www.lina.univ-nantes.fr/?Compound-Splitting-Tool.html
tences. The Wu &amp; Palmer [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] semantic similarity measure (WUP) was selected
since the result similarity scores are normalised from 0 to 1. We also tested other
measures such as Lin [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The WUP measure demonstrated the highest rate of
correctness (threshold 0:8). The distribution of the predicate similarity
scores generated is provided in Figure 2.
      </p>
      <p>no matched triples
found
-68.58
0.04
0.7 ~ 0.8
0.9 ~ 1</p>
      <p>Furthermore, 45% of the sample facts (i.e. statements about the population
of the 18 subjects) were assigned to a con dence score and 55% were not (as
no evidence triples were found). Figure 3 demonstrates the distribution of the
con dence scores generated for the sample facts. 22% of the facts were identi ed
as highly reliable ( 0:9). Two facts were assigned to very low con dence
scores (0.04 and -68.58). We manually examined the causes of the low con dence
values, and discovered that a matched triple for each fact had a very large or
small population number. It caused the weight average of the object values of
the evidence triples to be too large or small. It was due to the fact that the
subject links of the erroneous triples (retrieved from sameas.org service) were
pointed to resources not identical to the subjects of the facts (wrong subject
links). We corrected the errors by removing the erroneous triples from the set
of evidence triples. It leaded to the fact (initially with 0.04 con dence) to get
a much higher con dence (0.94), and no con dence score produced for the fact
(initially with -68.58 con dence) because no evidence triples are found. Based
on this experiment, we plan to extend our approach to verify abnormal evidence
triples with \fake" subject links in future work.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Work</title>
      <p>In this paper, we presented an approach for validating linked data facts using
RDF triples retrieved from open knowledge bases. Our approach enables the
assessment of the accuracy of facts using the vast interlinked RDF resources on
the Web. This would become increasingly important due to the fast growth of
LOD on the Web.</p>
      <p>The presented work is still at its early stage, the experiment discussed in this
paper focused on testing the feasibility of each component of the presented
approach. This can help re ne our approach before an evaluation of the approach
as a whole is carried out. We are planning to demonstrate that the proposed
approach can be applied pro ciently to arbitrary predicates, and evaluate the
predicate similarity matching method with standard evaluation measures
(Precision/Recall) on well-known datasets. Moreover, we are also going to de ne a
gold standard and apply the standard for evaluating our method for validating
RDF facts.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Angell</surname>
            ,
            <given-names>R.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freund</surname>
            ,
            <given-names>G.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Willett</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Automatic spelling correction using a trigram similarity measure</article-title>
          .
          <source>Information Processing &amp;amp; Management</source>
          <volume>19</volume>
          (
          <issue>4</issue>
          ),
          <volume>255</volume>
          {
          <fpage>261</fpage>
          (
          <year>1983</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conrath</surname>
            ,
            <given-names>D.W.</given-names>
          </string-name>
          :
          <article-title>Semantic similarity based on corpus statistics and lexical taxonomy</article-title>
          .
          <source>In: Proceedings of International Conference on Research in Computational Linguistics</source>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerber</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morsey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.C.N.</given-names>
          </string-name>
          :
          <article-title>Defacto-deep fact validation</article-title>
          .
          <source>In: The Semantic Web{ISWC</source>
          <year>2012</year>
          , pp.
          <volume>312</volume>
          {
          <fpage>327</fpage>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>An information-theoretic de nition of similarity</article-title>
          .
          <source>In: ICML</source>
          . vol.
          <volume>98</volume>
          , pp.
          <volume>296</volume>
          {
          <issue>304</issue>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Improving the quality of linked data using statistical distributions</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems (IJSWIS) 10(2)</source>
          ,
          <volume>63</volume>
          {
          <fpage>86</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Resnik</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Using information content to evaluate semantic similarity in a taxonomy</article-title>
          .
          <source>In: Proceedings of the 14th International Joint Conference on Arti cial Intelligence</source>
          ,. pp.
          <volume>448</volume>
          {
          <issue>453</issue>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palmer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Verbs semantics and lexical selection</article-title>
          .
          <source>In: Proceedings of the 32nd annual meeting on Association for Computational Linguistics</source>
          . pp.
          <volume>133</volume>
          {
          <fpage>138</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Zaveri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rula</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maurino</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pietrobon</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Quality assessment for linked data: A survey. Semantic Web journal</article-title>
          (to appear)
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>