<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Knowledge Validation Using Ob jectivity and Corroborativeness of Web Resources</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Seongchan Kim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiyeon Choi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mun Y. Yi</string-name>
          <email>munyi@kaist.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Knowledge Service Engineering</institution>
          ,
          <addr-line>KAIST</addr-line>
          ,
          <country>South</country>
          <addr-line>Korea sckim, jeeyeon51</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this study, we propose a method to validate knowledge candidates using the Web to minimize the false positive rate in a knowledge base (KB). Our approach assesses the objectivity and corroborativeness of a triple, which is the basic form of knowledge, using diverse Web resources. Compared to the state-of-the-art baseline of the Defacto framework, our approach demonstrates superior false positive rates, enabling more e ective ltering of false triples in the construction of a KB.</p>
      </abstract>
      <kwd-group>
        <kwd>knowledge validation</kwd>
        <kwd>Web-based validation</kwd>
        <kwd>objectivity</kwd>
        <kwd>corroborativeness</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Knowledge validation that identi es the truth value (true or false) of facts is a
vital step for constructing a reliable, usable knowledge base (KB). To the greatest
extent possible, true facts should be distinguished and stored in a KB while
incorrect or unreliable facts are ltered out as the existence of false triples in a KB
can cause serious problems for those applications that use the KB. For instance,
a wrong answer can be generated from a Q &amp; A system that references a KB
implanted with false knowledge. Therefore, techniques for validating knowledge
are essential for the use of KB-based systems.</p>
      <p>
        In literature, several studies have attempted to validate facts. Defacto
demonstrated a new approach that allowed testing whether a given fact (i.e., an RDF
triple) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] could be trusted. More speci cally, Defacto takes a statement from an
RDF as input to the Web and tries to nd evidence for validation using
webpages. It combines the trustworthiness of a Web resource and textual evidence
for validating triples with a machine learning technique. On the other hand,
there has been a study about a system - Honto? Search for helping users
determine the trustworthiness of uncertain facts considering their sentimental aspects
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The semi-automatic system incorporates what sentiments are mentioned on
the relevant webpages.
      </p>
      <p>
        In this paper, we present a new technique to utilize the objectivity and
corroborativeness of Web resources for RDF truth validation. For validating an
RDF triple, we identify con rming sources on the Web as introduced in Defacto
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. On top of that, we estimate to what extent the con rming sources is neutral
by performing the sentiment analysis of the sentences in the page. Contrary to
Honto? Search [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], our approach automatically utilizes the results of sentiment
analysis to truth validation. Furthermore, we check the corroborativeness of the
RDF - the extent to which di erent types of evidence support the truthfulness
of the triple. We count various types of Web resources such as images, videos,
and news for validation. Finally, we evaluate the performance of the proposed
approach by comparing it with Defacto, which serves as a baseline.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <p>
        Our primary purpose is to determine whether an RDF triple (s; p; o) given is true
or false. We deal with this problem as a binary classi cation. In this section, we
describe how we estimate the objectivity and corroborativeness of webpages.
Figure 1 shows a complete picture of our strategy for estimating the proposed
features of webpages retrieved for a given triple t (s; p; o). The estimated
objectivity features are replaced with the trustworthiness features proposed by Defacto
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and the corroborativeness features are added to the Defacto framework1.
The objectivity of a webpage is calculated by measuring the degree of
sentiments of all sentences in the document, and it is estimated as follows:
objsum(w) = X 2 jsentiment(s)j
s2w 2
obj(w) = objsunm(w)
where t is a triple, w is a web document retrieved by t, w consists of a set of
sentences s and is denoted as w = fs1; s2; :::; sng, where n is the number of
sentences. The sentiment value of a sentence takes one of the following values:
sentiment(s) 2 f 2; 1; 0; 1; 2g (very negative, negative, neutral, positive, very
positive). Therefore, obj(w) has a range from 0 (subjective) and 1 (objective).
After obtaining the objectivity score, we multiply it with trustworthiness score
and textual proof score of w considering that webpages with high
trustworthiness, proof score, and objectivity increase the con dence in the input fact.
      </p>
      <sec id="sec-2-1">
        <title>1 http://aksw.org/Projects/DeFacto.html</title>
        <p>Ffobjsum(t) =</p>
        <p>
          X (f(w) scw(w) obj(w)) Ffobjmax(t) = wm2sa(xt)(f(w) scw(w) obj(w))
w2s(t)
f (w) is instantiated by three trustworthiness scores: topic majority (tmweb),
topic majority in search results (tmsearch), and topic coverage (tc) of w. scw(w)
is the proof score on a webpage w. The proof, one of the supporting evidence on
webpage that the triple given is true, is de ned as a textual occurrence between
s and o within a certain token distance. The details for the trustworthiness and
proof score are described in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. We generate the objectivity features multiplying
three criteria using their sum and maximum as di erent features. Finally, we
propose six objectivity features by a combination of three trustworthiness
measures (tmweb, tmsearch, and tc) and two cases (sum and max): TM Web Obj Sum,
TM Web Obj Max, TM Search Obj Sum, TM Search Obj Max, TC Obj Sum, and TC Obj Max.
        </p>
        <p>Moreover, for validation of RDF triples, we utilize various web resources:
images, videos, and news. We counted the number of hits of images, videos, and
news. Our rationale is that true triples are more likely to have more supporting
evidence in multiple forms (i.e., images, videos, and news) than the false ones
because diverse types of information about the triples is accumulated on the
Web. For example, if we have an RDF triple such as (Maroon 5, Song, Sugar)
and search the Web with the \AND" operator (e.g., \Maroon 5" AND \Song"
AND \Sugar") using the Bing Search API2, the API returns 14200 images,
833000 videos, and 16800 news as of June 2015; however, with a false triple
(\Maroon 5", \Song", \Blank Space"), the API returns 4530, 103000, and 4290,
respectively. (\Blank Space" is a song by Taylor Swift.) For corroborativeness,
we employ the following three features: Num Images, Num Videos, and Num News.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiment</title>
      <p>We collected 570 RDFs from DBpedia as true triples on the top 57 most
frequently used properties in DBpedia3. We randomly selected triples containing
the property. We derived the false triples from the true triples with the
following restriction. A triple (s , p , o ) is generated, where s and o are randomly
selected resources, and p is a randomly selected property from the de ned
properties. We applied the Defacto as a baseline, which is the only Web-based RDF
true=false validation framework, to our best knowledge. To measure the
objectivity of sentences in webpages, we used a sentiment analysis tool4 in the Stanford
CoreNLP. The classi cation of the two classes (true and false) was conducted
using ten-fold cross-validation. We used random forest (RF), which have been
widely adopted for classi cation, in the Weka toolkit5 with the default
parameter values given in Weka. We report on three measures: precision (P), recall (R),
F-1 score (F1) (micro-averaged), and false positive rate (FP rate).</p>
      <sec id="sec-3-1">
        <title>2 https://datamarket.azure.com/dataset/bing/search 3 http://live.dbpedia.org/sparql 4 http://nlp.stanford.edu/sentiment/ 5 http://www.cs.waikato.ac.nz/ml/weka/</title>
        <p>The classi cation results are shown in Table 1. Note that the performance
was measured by replacing the six objectivity features with those proposed by
Defacto (e.g, from TM Web Sum to TM Web Obj Sum) and adding the
corrobrativeness features. A 5.9% and 6.7% decrease in the FP rate with RF was achieved
using the objectivity and corroborativeness features, respectively.</p>
        <p>
          Mean objectivity score of the retrieved websites from the true triples was
0.871 (StdDev: 0.202) and score from the false was 0.804 (StdDev: 0.262). This
indicates that the websites from the true triples have less sentiments on their
texts than those of the false. The overall performance decrease in the FP rate
and the di erent distribution of objectivity score in the two classes con rm the
e ectiveness of objectivity in judging the trustworthiness of the fact. In addition,
our results are in general agreement with the results of a prior study, in which
users admitted that sentiment analysis for a fact is useful in fact validation [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
Furthermore, corroborativeness is shown to be highly e ective, supporting the
rationale that diverse types of information, not limited to the text, accumulated
on the Web are useful evidence about the truthfulness of the given triple.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Work</title>
      <p>In this paper, we presented an approach for applying the objectivity and
corroborativeness of webpages retrieved by the RDF triples in the true/false validation
of given RDF triples, showing the e ectiveness of these concepts in knowledge
validation. For future work, we are planning to extend the current study to
perform a deeper analysis about the di erences of sentiment distributions for true
and false fact validation as well as the con rmation of our approach even in
political triples that could have limitations.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgement</title>
      <p>This work was supported by Institute for Information &amp; communications Technology Promotion(IITP)
grant funded by the Korea government(MSIP) (No. R0101-15-0054, WiseKB: Big data based
selfevolving knowledge base and reasoning platform)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerber</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morsey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Ngonga</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>DeFacto - Deep Fact Validation</article-title>
          .
          <source>In: ISWC2012</source>
          ,
          <volume>312</volume>
          {
          <fpage>327</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Yamamoto</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tezuka</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jatowt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tanaka</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Supporting Judgment of Fact Trustworthiness Considering Temporal and Sentimental Aspects</article-title>
          .
          <source>In: WISE2008</source>
          ,
          <volume>206</volume>
          {
          <fpage>220</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>