<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Fact Validation with Knowledge Graph Embeddings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ammar Ammar</string-name>
          <email>a.ammar@student.maastrichtuniversity.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Remzi Celebi</string-name>
          <email>remzi.celebi@maastrichtuniversity.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Data Science, Maastricht University</institution>
          ,
          <addr-line>Maastricht</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Maastricht Centre for Systems Biology, Maastricht University</institution>
          ,
          <addr-line>Maastricht</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Fact validation in a knowledge graph is a task to determine whether a given fact (subject, predicate, object) should appear in the knowledge graph. In this paper, we have described our approach for the fact validation task in the context of the Semantic Web Challenge 2019. We used embedding features with machine learning to predict facts that were missing from the knowledge graph. The embedding features were generated applying a knowledge graph method known as the RDF2Vec method on the knowledge graph with only positive statements. To improve our machine learning model, we added the test facts that we could validate via the public sources into the positive knowledge graph. We trained a Random Forest classi er on the training data (positive and negative statements) plus the veri ed test statements and made predictions for test data.</p>
      </abstract>
      <kwd-group>
        <kwd>Fact validation Fact checking Knowledge Graph Embedding</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Knowledge graphs are currently among the most prominent implementations
of Semantic Web technologies. The task of fact checking in knowledge graphs,
which is to decide whether a fact t is missing from a given a knowledge graph
G, is among the cornerstones of knowledge base management. The veri ed facts
can be used to (1) incomplete knowledge graphs re ning (2) knowledge graphs
violation detection, (3) improve the quality of knowledge search, and (4) multiple
knowledge graphs integration [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This year, the International Semantic Web
Conference revealed a dataset for the fact validation challenge in the context of
the Semantic Web Challenge. The challenge task is to assess the correctness of a
given statement about drugs, diseases, products. The challenge participants are
asked to assign a trust score for each of the statements with (i.e., a numerical
Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0)
value between 0 and 1), where 0 means that they are sure that the statement is
false and 1 means that they are sure the statement is true. We propose a machine
learning model using embedding features for this challenge, to predict if a given
statement is true or not (i.e. validate the correctness of the statement).
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Dataset</title>
      <p>The core dataset consists of a graph of entities (drugs, diseases and products)
and information linking these entities. The dataset is created by extracting
information from a well-known source and identifying links between entities. The
dataset contains both training and test set parts in which the training data
was made available for building a system to make prediction on test data. Both
the training and testing sets consist of 25k examples with positive and negative
statements, equally distributed among each of the following ve properties:
{ http://dice-research.org/ontology/drugbank/interactsWith
{ http://dice-research.org/ontology/drugbank/hasCommonIndication
{ http://dice-research.org/ontology/drugbank/hasSameState
{ http://dice-research.org/ontology/drugbank/hasIndication
{ http://dice-research.org/ontology/drugbank/hasCommonProducer</p>
      <p>While the challenge organizers generated positive statements by identifying
the entities for which the proposed properties hold, they generated the negative
statements by replacing the entities in the positive statements such that the
generated triples are false or invalid.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Methods and Results</title>
      <p>We used embedding features to train our machine learning model to predict
trust scores for test facts as described in Figure 1. In order to learn embedding
features, the training KG, given in the form of rei ed statements, was converted
to the positive explicit statements (eg. &lt; drug; hasIndication; disease &gt; ). We
trained our classi er on the training data (positive and negative statements) plus
the veri ed test statements and made predictions for test data.</p>
      <p>We added the veri ed facts to the positive knowledge graph to generate a
better embedding feature vector. We used the external resource "DrugBank" to
check if the DrugBank relations might hold for the test facts. This can be seen
as the enrichment of the training knowledge graph. While enriching with the
DrugBank, only the properties that are relevant with the challenge dataset were
included. The properties extracted from Drugbank XML (v5.1.1) are:
{ indication
{ state
{ drug-interactions
{ packagers</p>
      <p>Fact Validation with Knowledge Graph Embeddings</p>
      <p>To validate test facts, we link the challenge data to the Drugbank dataset by
mapping the drug and disease entities to DrugBank drugs and Human Disease
Ontology (DO) diseases respectively. The challenge use the same unique
identiers with Drugbank for drugs. In order to link the challenge disease entities, each
text in the indication section in Drugbank was annotated with Human Disease
Ontology using BioPortal API.</p>
      <p>After disease annotations were obtained for both training and test datasets
drugs, the normalized Levenshtein similarity was computed between disease
names to match the annotated diseases (DO) with the challenges diseases. A
test fact will be considered as veri ed if the proposed relation between the
identi ed entities holds in the DrugBank.</p>
      <p>
        For feature learning, we need a proper representation of the entities in the
knowledge graph that re ects their features. Here, we used the RDF2Vec
approach [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] in which the "Random Walks" algorithm was used to generate certain
number of walks for each entity and for a speci c depth. The parameter used
for random walks are: number of walks of 200 for each entity with a depth of 5.
We learned from a previous study [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] that the value of 5 for the random walk
depth parameter for knowledge graph learning gives the best results. Next, the
random walks were fed into another algorithm "Word2Vec" where it is split into
terms (i.e. subjects, predicates and objects) and then the Word2Vec algorithm
was applied using the "CBOW" neural networks with a layer size of 200 and the
graph embeddings were generated. We have noticed that the dimension of the
features vector is not critical if it is in the range of 100-500 from our experiments.
      </p>
      <p>The Fig. 2 shows the work ow of generating the embeddings. After that, a
Random Forest classi er was trained on the training set plus the veri ed
statements. A number of estimators of 200 was used for the Random Forest classi er
and the remaining parameters were left as default. To represent the feature
vector of a statement, we concatenated embedding vectors of the subject and
object entities, and a numerical value encoding predicates between the entities.
The nal model was used to predict a probability for each statement in the test
set. The predictions were submitted to the challenge website and reported an
AUC of 0.99971976. The reported AUC was the highest obtained score among
all the participants in the challenge as provided through the leader board on the
challenge website 3. The same method was also applied on the original dataset
without performing the enrichment using Drugbank and reported an AUC of
0.9926. From these results, a conclusion can be drawn that the major
contributor to the performance is the proposed embedding-based method over enriching
the data with an external resource.</p>
      <p>Acknowledgments This work was supported by funding from King Abdullah
University of Science and Technology (KAUST) O ce of Sponsored Research
(OSR) under Award No. URF/1/3454-01-01
3 https://dice-group.github.io/semantic-web-challenge.github.io/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <article-title>: act Checking in Knowledge Graphs with Ontological Subgraph Patterns. Data Science and Engineering (</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ristoski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.,
          <year>2016</year>
          , October. Rdf2vec:
          <article-title>Rdf graph embeddings for data mining</article-title>
          .
          <source>In International Semantic Web Conference</source>
          (pp.
          <fpage>498</fpage>
          -
          <lpage>514</lpage>
          ). Springer, Cham.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Celebi</surname>
            ,
            <given-names>R</given-names>
          </string-name>
          and Yasar, E and Uyar,
          <string-name>
            <given-names>G H</given-names>
            and
            <surname>Gumus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O</given-names>
            and
            <surname>Dikenelli</surname>
          </string-name>
          ,
          <string-name>
            <surname>O</surname>
          </string-name>
          and Dumontier,
          <string-name>
            <surname>M,</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Evaluation of Knowledge Graph Embedding Approaches for DrugDrug Interaction Prediction using Linked Open Data. Semantic Web Applications and Tools for Healthcare and Life Sciences</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>