<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>The Journal of Supercomputing 79 (2023) 18417-18444. doi:10.1007/
s11227</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/3292500.3330935</article-id>
      <title-group>
        <article-title>Graph Embeddings into RAG Architectures: Scalable Fact-Checking for Combating Disinformation with LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Orlando Abuanza Ubaque</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diego Rincon-Yanez</string-name>
          <email>diego.rincon@adaptcentre.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Declan O'Sullivan</string-name>
          <email>declan.osullivan@adaptcentre.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Graph Retrieval Augmented Generation, Retrieval Augmented Generation, Knowledge Graphs, Knowledge</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ADAPT Centre for Digital Content</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Pontificia Universidad Javeriana</institution>
          ,
          <addr-line>Bogota, colombia</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Computer Science and Statistics, Trinity College Dublin</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Semantic Systems</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>13998</volume>
      <fpage>395</fpage>
      <lpage>405</lpage>
      <abstract>
        <p>The growing threat of disinformation and misinformation across digital platforms has intensified the demand for systems capable of producing verifiable and trustworthy outputs. With the widespread adoption of Large Language Models (LLMs) for a variety of tasks, the requirement to provide accurate and fact-verifiable answers is increasing daily. GraphRAGs have become a powerful approach for solving complex tasks that require factual context to deliver accurate and explainable answers. However, Knowledge Bases (KB) used to provide factual and contextual knowledge are composed of thousands or millions of statements, which limits the size of inputs that an LLM can handle, typically by the number of input tokens supported by the model. This work addresses the problem of Fact-Checking by injecting Knowledge Graph Embedding (KGE) vector representations into LLMs using a Retrieval Augmented Generation (RAG) approach to obtain more accurate results. The results show a notable diference in the quality of the results with two diferent vector representations and two KB construction methods.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The Disinformation phenomenon is a growing concern due to its ability to distort mass public perception
of reality and undermine trust in valid information sources. The impact can be significant, as evidenced
during the pandemic [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. From the intent, disinformation[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] can be classified into two categories:
Information or news (1) created to deceive or (2) produced without the intention of deceiving.
      </p>
      <p>
        With the significant adoption of Large Language Models (LLMs) and their applications in many
disciplines, particularly in the communication sciences, source validation and fact-checking are essential
features. One of the biggest known constraints with Large Language Models is hallucinations [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These
instances represent statistical drifting in the token generation by the language model, which sometimes
corresponds to the generation of contexts that contradict or mislead real-world factual knowledge.
      </p>
      <p>
        This work presents an approach to detecting disinformation by combining Knowledge Graphs and
their vector representations (known as knowledge graph embeddings) with large language models.
The approach utilises the WELFake dataset [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], a well-known fake news dataset. In this dataset, a
Knowledge Graph (KG) is generated from a single news article using OpenIE techniques. Then, vector
space representations are generated using low-dimensional embedding algorithms such as RotatE and
TransE. To retrieve the context from the entire KB, the single article is decomposed into triples and
compared via its vector representation and social network analysis techniques. Finally, the retrieved
context is sent to a Large Language Model (LLM).
∗Corresponding author.
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>This paper is structured as follows: Section 2 presents a brief overview of Knowledge Graphs and their
embeddings, Retrieval Augmented Generation (RAGs) techniques and scenarios. Section 3 presents
the overall procedure, with special focus on the retrieval part from the KG; Section 4 details the
experimentation scenarios and the metrics utilised. Finally, Section 5 draws some conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Numerous studies have focused on detecting disinformation scenarios on social media using a diverse
number of AI models. Normally, these studies extract linguistic features and train models[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], such as
K-nearest neighbour (KNN), support vector machine (SVM), and stochastic gradient descent (SGD). In
some cases, it is also considered to use contextual user information in addition to finding patterns that
could help verify the validity of a particular story[6].
      </p>
      <p>NeuroSymbolic AI (NeSyAI), as a combination of neural and symbolic methods, positions a promising
candidate for many applications [7, 8]. One benefit of neuro-symbolic solutions is the integration of
domain knowledge, such as in the form of Knowledge Graphs (KGs). Integrating KGs as a structured and
symbolic knowledge representation into RAG-type applications ofers a powerful approach to addressing
the challenge of reducing hallucinations by combining the ability of language models to analyse text
with the capability to retrieve relevant information from external sources, such as specialised knowledge
bases [9].</p>
      <p>Knowledge Graphs used to reduce disinformation have been notable for their ability to capture
and represent complex semantic relationships between entities. However, manual information
extraction and annotation can be employed to generate these KGs; these are still fundamentally useful
for smaller datasets or when precision in relationship capture is required [10]. In the same pathway,
hybrid approaches to disinformation are performed [11] by integrating a KG through heterogeneous
representation ensembles, and the use of neural networks to combine representations from Language
Models, allowing deeper context understanding and relationship understanding between the mentioned
entities.</p>
      <p>Integrating KG into Retrieval-Augmented Generation (RAG) ensures the ability of Language Models
to analyse text and retrieve relevant information from external sources, thereby enhancing accuracy
and reliability, while mitigating hallucinations [12]. Techniques such as RotatE, TransE, or DisMult can
enrich knowledge representation[13], combined with the eficient construction of KG, can provide a solid
foundation in knowledge representation, improving pre-trained language models, thereby contributing
to the accuracy, efectiveness and fact-checking capabilities[ 14].</p>
    </sec>
    <sec id="sec-3">
      <title>3. Injecting KGE into the GraphRAG Workflow</title>
      <p>To move closer to a scenario where LLMs can be confidently used for fake news detection. The LLMs
must be equipped with tools for this purpose. Specifically, the use of RAGs (Retrieval-Augmented
Generation) with knowledge graph embeddings is explored. The proposed method constructs the
knowledge graphs (KG), supported by OpenIE methods, where entities and relationships are extracted
from unstructured text data (the news text). For this task, the well-known Stanford CoreNLP, through the
Stanza library, provides pre-trained models and utilises a pipeline-based approach to structure data into
subject-predicate-object (S, P, O) triplets. KGs were also generated using the REBEL model[15]. REBEL
reformulates the task as a sequence-to-sequence (seq2seq) problem within a pre-trained language model,
BART (Bidirectional and Auto-Regressive Transformer) [16]. Once the news articles are generated
into KGs, additional triples are added, inserting the TrueNews or FakeNews label to the news article
representation, following the representation shown in 1.</p>
      <sec id="sec-3-1">
        <title>3.1. Knowledge Graph Embedding Techniques</title>
        <p>Knowledge graph embeddings (KGEs) learn dimensional representations of labelled, directed multigraph
nodes and edges to predict missing parts of a triple (entities or relations). These have been utilised in
various tasks, including fact-checking, question-answering, link prediction, and entity linking[17].</p>
        <p>The TransE model is one of the most well-known in this category. It uses a translation-based approach
to model relationships. For a valid triplet (ℎ,  , ) , the embedding of the tail entity should be close to the
embedding of the head plus a vector representing the relation  in the embedding space: h + r ≈ t. This
model naturally captures hierarchical and structural relationships. However, its simplicity presents
limitations when modelling more complex relationships, such as non-transitive or many-to-many
relations [18].</p>
        <p>RotatE, also part of the Geometric Models, models relationships as rotations in the complex plane.
For the triplet (ℎ,  , ) , RotatE uses a rotation operation to transform the embedding of entity ℎ into that
of  : t ≈ h ∘ r, where ∘ denotes the Hadamard product. This strategy is particularly useful for modelling
complex patterns such as symmetry, asymmetry, transitivity, and inversion [19].</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Enhanced Retrieval Augmented Scenario</title>
        <p>Once the KG is constructed, various manipulations can be performed to extract information from
it. A particularly efective approach is the use of a neighbourhood function, which searches for the
nearest neighbours in proximity using algorithms such as Dijkstra’s algorithm. By incorporating such
information, we can bridge the gap between surface-level entity recognition and a deeper, more accurate
understanding of news. The graphs generated using Stanford NLP and REBEL were processed to obtain
embeddings in the vector space. The two Knowledge Graph Embedding (KGE) models were then applied
to capture the semantic relationships and structure of the graphs</p>
        <p>For the search of nearest entities, the unique entities extracted from the subject and object nodes of
the news have been mapped into the vector space of the global generated KGE, which is the knowledge
base. Cosine similarity was computed between the vector of the target entity and the vectors of all
other entities in the vector space. This step allowed for the identification of entities most similar to the
target entity in the news.</p>
        <p>Subsequently, for each combination of the subject entity from the news and its nearest neighbours,
the pretrained KGE model was used to predict the most probable relation between them. This step
enabled the identification of potential connections between the news and the existing knowledge base.
Finally, the extracted entities, relationships, and triplets were used as augmented knowledge. This
context was fed into GPT-4o (LLM) during the final stage to generate predictions or grounded responses
based on the provided knowledge.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment and Evaluation</title>
      <p>The evaluation was designed to assess both the RAG model’s performance and the quality of the
generated text, specifically in terms of factual accuracy, contextual understanding, and transparency. Six
metrics were applied to evaluate the performance of the tested models: precision, accuracy, contextual
understanding, compelling misinformation, transparency and traceability, and source retrieval accuracy.
The average of the evaluations for each scenario in each metric was used to rate the proposed approach,
as shown in the Figure 2. The results are presented in Table 1, where it is evident that the best-performing
model is TransE with REBEL del for Knowledge Graph Construction.</p>
      <p>However, precision, specifically, accurately predicting whether a news item is false, ties in with
TransE and Stanford, which also achieved the best performance in retrieving useful information for
decision-making. On the other hand, the worst-performing model overall was RotatE with REBEL,
although it predicted truthfulness just as well as the baseline model.</p>
      <p>The most computationally and time-intensive model was RotatE with Stanford, requiring more than
8 hours of training on a Google Colab T4 GPU. In all cases, the models tended to predict true news
correctly but often misclassified false news.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Work</title>
      <p>While the LLMs can model the veracity of news, they rely on the assumption that if the supposed source
is a real organisation, such as a magazine, government, or a recognised person, everything they say
must be true. This leads to hallucinations when concluding these false facts. A notable example of
this was a case involving WikiLeaks, a news item supposedly generated by the Washington Examiner.
The model erroneously claims the news is true because it mentions Clinton, Podesta, and the source,
the Washington Examiner (WE). In contrast, the complete RAGs, embeddings, and KG neighbours
prioritised the entities mentioned in the case, such as Clinton, Podesta, and Doug Band, since they were
found in both retrievals.</p>
      <p>News with high semantic content but few entities (people, places, or things) tends to perform worse,
sometimes causing hallucinations due to lack of context, particularly in REBEL. This issue is significantly
reduced with Stanford due to a higher granularity in the extracted triplets. However, this comes at the
cost of generating embeddings, as it quadruples the number of entities obtained compared to REBEL.</p>
      <p>Future work will validate the proposed approach using evaluation metrics such as F1-score, accuracy,
and area under the ROC curve (AUC-ROC). These metrics will comprehensively assess the model’s
performance and ability to accurately diferentiate between classes. Additionally, hyperparameter tuning
and the inclusion of additional data will be explored to further optimise the model’s efectiveness.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The ADAPT Centre for Digital Content Technology also partially supports the project under the
Research Ireland Research Centres Programme (Grant 13/RC/2106_P2).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author have not employed any Generative AI tools in creating the paper.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Nieves-Cuervo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. F.</given-names>
            <surname>Manrique-Hernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Robledo-Colonia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. K. A.</given-names>
            <surname>Grillo</surname>
          </string-name>
          , Infodemia: noticias falsas y tendencias de mortalidad por COVID-19 en seis países de América Latina,
          <source>Revista Panamericana de Salud Pública</source>
          <volume>45</volume>
          (
          <year>2021</year>
          )
          <article-title>1</article-title>
          . doi:
          <volume>10</volume>
          .26633/RPSP.
          <year>2021</year>
          .
          <volume>44</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Allcott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gentzkow</surname>
          </string-name>
          ,
          <source>Social Media and Fake News in the 2016 Election, Journal of Economic Perspectives</source>
          <volume>31</volume>
          (
          <year>2017</year>
          )
          <fpage>211</fpage>
          -
          <lpage>236</lpage>
          . doi:
          <volume>10</volume>
          .1257/jep.31.2.211.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dziri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Milton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Zaiane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <article-title>On the origin of hallucinations in conversational models: Is it the datasets or the models?, in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , Seattle, United States,
          <year>2022</year>
          , pp.
          <fpage>5271</fpage>
          -
          <lpage>5285</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .naacl- main.387.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Verma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Amorim</surname>
          </string-name>
          , R. Prodan,
          <article-title>WELFake: Word Embedding Over Linguistic Features for Fake News Detection</article-title>
          ,
          <source>IEEE Transactions on Computational Social Systems</source>
          <volume>8</volume>
          (
          <year>2021</year>
          )
          <fpage>881</fpage>
          -
          <lpage>893</lpage>
          . doi:
          <volume>10</volume>
          .1109/TCSS.
          <year>2021</year>
          .
          <volume>3068519</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Detecting fake news by enhanced text representation with multi-EDU-structure awareness</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>206</volume>
          (
          <year>2022</year>
          )
          <article-title>117781</article-title>
          . doi:
          <volume>10</volume>
          . 1016/j.eswa.
          <year>2022</year>
          .
          <volume>117781</volume>
          . arXiv:
          <volume>2205</volume>
          .
          <fpage>15139</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>