<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CrossFactual: A Novel Approach for Detecting Factual Inaccuracies in Machine-Generated Summaries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aniket Deroy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Subhankar Maity</string-name>
          <email>subhankar.ai@kgpian.iitkgp.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IIT Kharagpur</institution>
          ,
          <addr-line>Kharagpur</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Detecting factual inaccuracy in machine-generated summaries is a novel and challenging task. Participants are tasked with identifying factual errors in summaries produced from English source documents, which are provided in Hindi and Gujarati. The training set includes English source documents along with summaries in English, Hindi, and Gujarati, enabling participants to familiarize themselves with error detection across languages. The test set consists solely of the English source document paired with summaries in Hindi and Gujarati. We focus on categorizing each data point based on the presence of factual inaccuracies, exploring four distinct types of factual errors. This study aims to enhance understanding of cross-lingual summary accuracy and contribute to improved evaluation frameworks in multilingual contexts. We use GPT-3.5 Turbo via prompting combined with several algorithmic approaches to detect factual inaccuracies in the machine-generated summaries across both languages. This paper presents a comparative analysis of factual inaccuracy detection models in Gujarati and Hindi, focusing on their performance across multiple experimental runs. The study reveals that Run 5 is the most efective model for both languages, achieving a F1 score of 0.0677, while other runs exhibit significantly lower scores, particularly Run 4. Notably, the ensemble approach demonstrates the highest performance results. Despite these advancements, the overall scores indicate ongoing challenges in creating robust models for detecting factual inaccuracies in Gujarati and Hindi. The findings emphasize the need for continued research and refinement to enhance the efectiveness of detection systems in these linguistic contexts.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;GPT</kwd>
        <kwd>Factual Inaccuracies</kwd>
        <kwd>Prompt Engineering</kwd>
        <kwd>Hindi</kwd>
        <kwd>Gujarati</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Detecting factual inaccuracy in machine-generated summaries presents a novel and challenging task,
particularly in a multilingual context [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. As automated summarization technologies advance, ensuring
the reliability of generated content becomes increasingly critical, especially when the output is intended
for diverse language speakers [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This study focuses on identifying factual errors in summaries
produced from English source documents, specifically targeting Hindi and Gujarati languages.
      </p>
      <p>Participants are engaged in a rigorous evaluation process where they must identify inaccuracies within
these summaries. To facilitate this, the training set comprises English source documents along with their
corresponding summaries in English, Hindi, and Gujarati, allowing participants to develop a nuanced
understanding of factual error detection across languages. The test set narrows this focus, providing
only the English source document alongside summaries in Hindi and Gujarati, which encourages
participants to apply their learned skills in a practical setting.</p>
      <p>
        We emphasize the categorization of each data point based on the presence of factual inaccuracies,
exploring four distinct types of factual errors. By examining these variations, we aim to provide insights
into the nature of inaccuracies that can arise in machine-generated summaries. Additionally, we leverage
the capabilities of GPT-3.5 Turbo [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], employing prompting techniques in conjunction with various
algorithmic approaches to enhance the detection of factual discrepancies across both languages. This
study ultimately aims to deepen our understanding of cross-lingual summary accuracy and contribute
to the development of robust evaluation frameworks in multilingual contexts.
      </p>
      <p>This paper ofers a comparative analysis of models for detecting factual inaccuracies in Gujarati
and Hindi, examining their performance across various experimental runs. The results indicate that
Run 5 is the most efective model for both languages, achieving a F1 score of 0.0677, while other runs,
particularly Run 4, show significantly lower scores. The ensemble approach stands out with the highest
performance metrics. However, despite these improvements, the overall results highlight persistent
challenges in developing robust detection models for factual inaccuracies in both languages. These
ifndings underscore the necessity for ongoing research and refinement to improve the efectiveness of
detection systems within these linguistic contexts.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The task of detecting factual inaccuracies in machine-generated summaries has gained significant
attention in recent years, driven by advancements in natural language processing (NLP) and the
increasing reliance on automated summarization tools [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. A considerable body of work has focused
on evaluating the quality of machine-generated text, particularly in terms of factual correctness and
coherence [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ].
      </p>
      <p>
        One of the early approaches in this domain involved manual evaluation of summaries, where human
annotators assessed the fidelity of the content against the source material [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Studies by [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]
highlighted the importance of ensuring that summaries accurately represent the source, laying the
groundwork for subsequent automated methods.
      </p>
      <p>
        With the advent of deep learning, researchers began exploring automatic evaluation metrics for
summarization. The introduction of ROUGE (Recall-Oriented Understudy for Gisting Evaluation) by
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] provided a quantitative method for assessing summary quality, although it primarily focuses on
lexical similarity rather than factual correctness. To address this gap, recent studies have proposed
metrics that consider factual consistency, such as FactCC and QAGS, which evaluate whether the
generated summary maintains the truthfulness of the original content [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ].
      </p>
      <p>
        In the realm of cross-lingual summarization, researchers like [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] have explored methods for
generating and evaluating summaries across diferent languages. These studies emphasize the
importance of understanding linguistic nuances and maintaining factual integrity when summarizing content
in languages with distinct grammatical and syntactic structures. Furthermore, work by [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] highlights
the challenges in cross-lingual settings, particularly when dealing with low-resource languages, and
emphasizes the need for tailored evaluation frameworks.
      </p>
      <p>
        Our approach builds upon this foundational work, particularly by incorporating both algorithmic
and human-driven methods for detecting factual inaccuracies in multilingual contexts. Leveraging
the capabilities of GPT-3.5 Turbo allows us to explore advanced prompting techniques that enhance
accuracy detection, aligning with trends in using transformer-based models for nuanced language
understanding [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. This study aims to bridge the gap between existing methodologies and the specific
challenges of cross-lingual factual accuracy, contributing valuable insights to the ongoing discourse in
this evolving field.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>There are 200 (article,summary) pairs in Gujarati Language and 200 (article,summary) pairs in Hindi
language in the test set respectively.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Task Definition</title>
      <p>
        The task [
        <xref ref-type="bibr" rid="ref18 ref19 ref20 ref21 ref22">18, 19, 20, 21, 22, 23, 24</xref>
        ] is, given a Gujarati and Hindi summaries we have to classify
the summaries into one of the five categories namely-Misrepresentation, False Attribution, Incorrect
quantities, Fabrication and Correct.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Methodology</title>
      <p>Prompting [25, 26] is a powerful technique that leverages large language models (LLMs) like GPT-3.5
Turbo to generate contextually relevant and accurate responses based on specific inputs. Here are
several reasons why prompting is beneficial, particularly in the context of detecting factual inaccuracies
in machine-generated summaries:
• Flexibility and Adaptability: Prompting allows researchers to customize the input to the model,
guiding it to focus on specific tasks such as factual accuracy detection [ 27]. This adaptability
enables a tailored approach that can be adjusted based on the nuances of the task or the languages
involved.
• Enhanced Contextual Understanding: LLMs excel at understanding context due to their
training on vast amounts of text [28]. By crafting well-designed prompts, we can help the
model better grasp the relationships between the source document and the generated summary,
facilitating more accurate assessments of factual correctness.
• Eficiency in Error Detection: Prompting can streamline the process of identifying factual
inaccuracies by generating direct queries related to specific claims or statements in the summaries
[29]. This eficiency reduces the need for extensive manual evaluation and allows for rapid
analysis of multiple summaries.
• Leveraging Knowledge: LLMs possess a wealth of general knowledge and can often identify
inaccuracies based on their understanding of facts and relationships [30]. By employing prompting,
we can harness this knowledge to flag discrepancies in the summaries, even when they are not
explicitly stated in the source material.
• Multilingual Capabilities: Given the cross-lingual nature of this study, prompting can be
particularly advantageous in handling diferent languages [ 31]. The model’s ability to process
and generate text in multiple languages enhances its utility in evaluating summaries produced in
Hindi and Gujarati from English sources.
• Combining with Algorithmic Approaches: Prompting can complement traditional algorithmic
methods, creating a hybrid approach that combines the strengths of both [26]. This synergy can
lead to more robust and comprehensive evaluations of factual accuracy.
• Facilitating User Interaction: Involving participants in the evaluation process through
prompting can lead to more engaging interactions, as users can pose questions or seek clarifications,
enhancing the overall assessment of factual accuracy [32].</p>
      <p>Overall, prompting serves as a versatile tool that enhances the capabilities of LLMs in detecting factual
inaccuracies, making it an integral part of our approach in this study.
5.1. Prompt Engineering-Based Approach combined with Algorithms
• For the Misrepresentation class the prompt is shown in Fig. 1:
• For the Incorrect_Quantities class the prompt is depicted in Fig. 2:
• For the False_Attribution class the prompt is given in Fig. 3:
• For the Fabrication class the prompt is illustrated in Fig. 4:
We use the GPT-3.5 Turbo model at diferent temperature values via zero-shot prompting.
Next, we discuss four algorithms named Algorithm 1, Algorithm 2, Algorithm 3, and Algorithm 4.</p>
      <p>The fifth approach is an ensembling approach where we run every algorithm(Algorithm 1-4) in three
diferent temperature values 0.7, 0.8, and 0.9. Then we take an ensemble of all the runs by considering
majority voting in which the label which occurs maximum no of times for a datapoint is selected. We
perform the same for all the datapoints.</p>
      <p>Next, we discuss the four algorithms namely Algorithm 1, Algorithm 2, Algorithm 3, and Algorithm
4.</p>
      <sec id="sec-5-1">
        <title>Algorithm 1:</title>
        <p>1. Input: A pair consisting of an article and its corresponding incorrect summaries (in Hindi or</p>
        <p>Gujarati).
2. Step 1:
• Prompt the Large Language Model (LLM) with the pair (article and incorrect summaries) to
determine if it belongs to the Misrepresentation class.
• If the predicted label is Misrepresentation:
– Output: Misrepresentation
– End the algorithm for this datapoint.
• If the pair does not belong to the Fabrication class, prompt the LLM to check if it belongs
to the False Attribution class.
• If the predicted label is False Attribution:
– Output: False Attribution
– End the algorithm for this datapoint.
• If the pair does not belong to the False Attribution class, prompt the LLM to check if it
belongs to the Incorrect Quantities class.
• If the predicted label is Incorrect Quantities:
– Output: Incorrect Quantities
– End the algorithm for this datapoint.
7. Repeat this procedure for every datapoint in the dataset.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Algorithm 2:</title>
        <p>1. Input: A pair consisting of an article and its corresponding incorrect summaries (in Hindi or</p>
        <p>Gujarati).
2. Step 1:
• Prompt the Large Language Model (LLM) with the pair (article and incorrect summaries) to
determine if it belongs to the Fabrication class.
• If the predicted label is Fabrication:
– Output: Fabrication
– End the algorithm for this datapoint.
• If the pair does not belong to the Fabrication class, prompt the LLM to check if it belongs
to the Misrepresentation class.
• If the predicted label is Misrepresentation:
– Output: Misrepresentation
– End the algorithm for this datapoint.
• If the pair does not belong to the Misrepresentation class, prompt the LLM to check if it
belongs to the False Attribution class.
• If the predicted label is False Attribution:
– Output: False Attribution
– End the algorithm for this datapoint.
• If the pair does not belong to the False Attribution class, prompt the LLM to check if it
belongs to the Incorrect Quantities class.</p>
        <p>• If the predicted label is Incorrect Quantities:
Algorithm 3:
7. Repeat this procedure for every datapoint in the dataset.</p>
        <p>– Output: Incorrect Quantities
– End the algorithm for this datapoint.
1. Input: A pair consisting of an article and its corresponding incorrect summaries (in Gujarati or</p>
        <p>Hindi).
2. Step 1:
• Prompt the Large Language Model (LLM) with the pair (article and incorrect summaries) to
determine if it belongs to the False_Attribution class.
• If the predicted label is False_Attribution:
– Output: False_Attribution
– End the algorithm for this datapoint.
• If the pair does not belong to the False_Attribution class, prompt the LLM to check if it
belongs to the Misrepresentation class.
• If the predicted label is Misrepresentation:
– Output: Misrepresentation
– End the algorithm for this datapoint.
• If the pair does not belong to the Misrepresentation class, prompt the LLM to check if it
belongs to the Fabrication class.
• If the predicted label is Fabrication:
– Output: Fabrication
– End the algorithm for this datapoint.
• If the pair does not belong to the Fabrication class, prompt the LLM to check if it belongs
to the Incorrect Quantities class.
• If the predicted label is Incorrect Quantities:
– Output: Incorrect Quantities
– End the algorithm for this datapoint.</p>
        <p>• If the pair does not belong to any of the above classes, classify it as Correct.
7. Repeat this procedure for every datapoint in the dataset.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Algorithm 4:</title>
        <p>1. Input: A pair consisting of an article and its corresponding incorrect summaries (in Gujarati or</p>
        <p>Hindi).
2. Step 1:
• Prompt the Large Language Model (LLM) with the pair (article and incorrect summaries) to
determine if it belongs to the Incorrect_Quantities class.
• If the predicted label is Incorrect_Quantities:
– Output: Incorrect_Quantities
– End the algorithm for this datapoint.</p>
        <p>• If the pair does not belong to any of the above classes, classify it as Correct.
7. Repeat this procedure for every datapoint in the dataset.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>Table 1 shows the results of factual inaccuracy in Gujarati. The results from the factual inaccuracy
detection task in Gujarati reveal varying performance across five experimental runs, measured by F1
scores and their respective ranks. The F1 score is a key indicator of model accuracy, taking into account
both precision and recall, and higher scores denote better performance.</p>
      <p>Among the runs, Run 5 stands out with the highest F1 score of 0.0677, earning it a rank of 13th. This
indicates that it was the most efective in identifying factual inaccuracies compared to the others. In
contrast, Run 1 achieved a score of 0.0365 and ranked 15th, showing slightly better performance than
Runs 2 and 3 but still significantly trailing behind Run 5.</p>
      <p>Run 2 followed closely with a F1 score of 0.0364, ranking 16th, while Run 3 recorded a score of 0.0357
and ranked 18th, indicating a further decline in performance. Lastly, Run 4 had the lowest score at
0.0344, resulting in a rank of 19th, marking it as the least efective among all runs.</p>
      <p>Overall, the results highlight that while there are minor diferences in performance, none of the
runs, except for Run 5, achieved satisfactory scores, which indicates ongoing challenges in developing
efective models for detecting factual inaccuracies in Gujarati text.</p>
      <p>Table 2 shows the results of factual inaccuracy in Hindi. The latest results from the factual inaccuracy
detection task in Hindi reflect varying performance across five experimental runs, measured by their F1
scores and ranks.</p>
      <p>Run 5 continues to lead with the highest F1 score of 0.0677, securing a rank of 16th, indicating it
remains the most efective at detecting factual inaccuracies. Run 1 follows with a score of 0.0653, ranked
17th, showing a relatively strong performance and an improvement compared to its previous iteration.</p>
      <p>In contrast, Run 2 recorded a F1 score of 0.0364 and is ranked 18th, representing a modest performance
that is slightly better than Runs 3 and 4. Run 3 achieved a score of 0.0357, ranking 19th, indicating a
minor decline in efectiveness compared to Run 2. Lastly, Run 4 had the lowest score of 0.0344 and is
ranked 21st, marking it as the least efective among the runs.</p>
      <p>Overall, these results suggest that while Run 5 maintains its position as the top performer, Run 1
has shown some improvement. However, the other runs struggle with lower scores, highlighting the
ongoing challenges in efectively detecting factual inaccuracies in Hindi text.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In conclusion, the comparative analysis of factual inaccuracy detection in both Gujarati and Hindi
demonstrates distinct performance trends among the experimental runs. In Gujarati, Run 5 emerges as
the most efective model, achieving a F1 score of 0.0677, while the other runs exhibit significantly lower
performance levels, with Run 4 lagging the furthest behind. Similarly, in Hindi, Run 5 retains its lead
with a score of 0.0677, but Run 1 also shows notable improvement. For both Hindi and Gujarati, the
ensemble approach shows the highest results. Despite these advancements, the overall scores across
the runs indicate persistent challenges in developing robust models for detecting factual inaccuracies in
both languages. The results underscore the need for further research and refinement to enhance the
efectiveness of such detection systems in the future.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT in order to: Drafting content, Grammar
and spelling check, etc. After using this tool/service, the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.
[23] S. Satapara, P. Mehta, S. Modha, A. Hegde, S. HL, D. Ganguly, Key insights from the third ilsum
track at fire 2024, in: Proceedings of the 16th Annual Meeting of the Forum for Information
Retrieval Evaluation, FIRE 2024, Gandhiinagar, India. December 12-15, 2024, ACM, 2024.
[24] S. Satapara, P. Mehta, D. Ganguly, S. Modha, Fighting fire with fire: Adversarial prompting to
generate a misinformation detection dataset, CoRR abs/2401.04481 (2024). URL: https://doi.org/10.
48550/arXiv.2401.04481. doi:10.48550/ARXIV.2401.04481. arXiv:2401.04481.
[25] L. Wang, X. Chen, X. Deng, H. Wen, M. You, W. Liu, Q. Li, J. Li, Prompt engineering in consistency
and reliability with the evidence-based guideline for llms, npj Digital Medicine 7 (2024) 41.
[26] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, Pre-train, prompt, and predict: A systematic
survey of prompting methods in natural language processing, ACM Computing Surveys 55 (2023)
1–35.
[27] X. Amatriain, Prompt design and engineering: Introduction and advanced methods, arXiv preprint
arXiv:2401.14423 (2024).
[28] P. Srivastava, M. Malik, V. Gupta, T. Ganu, D. Roth, Evaluating llms’ mathematical reasoning
in financial document question answering, in: Findings of the Association for Computational
Linguistics ACL 2024, 2024, pp. 3853–3878.
[29] L. Henrickson, A. Meroño-Peñuela, Prompting meaning: a hermeneutic approach to optimising
prompt engineering with chatgpt, AI &amp; SOCIETY (2023) 1–16.
[30] J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong, B. Yin, X. Hu, Harnessing the power
of llms in practice: A survey on chatgpt and beyond, ACM Transactions on Knowledge Discovery
from Data 18 (2024) 1–32.
[31] L. Huang, S. Ma, D. Zhang, F. Wei, H. Wang, Zero-shot cross-lingual transfer of prompt-based
tuning with a unified multilingual prompt, arXiv preprint arXiv:2202.11451 (2022).
[32] G. Xun, S. M. Land, A conceptual framework for scafolding iii-structured problem-solving
processes using question prompts and peer interactions, Educational technology research and
development 52 (2004) 5–22.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Mage:
          <article-title>Machine-generated text detection in the wild</article-title>
          ,
          <source>in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2024</year>
          , pp.
          <fpage>36</fpage>
          -
          <lpage>53</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mansurov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ivanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shelmanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tsvigun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. M.</given-names>
            <surname>Afzal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mahmoud</surname>
          </string-name>
          , G. Puccetti,
          <string-name>
            <given-names>T.</given-names>
            <surname>Arnold</surname>
          </string-name>
          , et al.,
          <article-title>Semeval-2024 task 8: Multidomain, multimodel and multilingual machine-generated text detection</article-title>
          ,
          <source>arXiv preprint arXiv:2404.14183</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>U.</given-names>
            <surname>Hahn</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Mani</surname>
          </string-name>
          ,
          <article-title>The challenges of automatic summarization</article-title>
          ,
          <source>Computer</source>
          <volume>33</volume>
          (
          <year>2000</year>
          )
          <fpage>29</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>T. B. Brown</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          , arXiv preprint arXiv:
          <year>2005</year>
          .
          <volume>14165</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Mridha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Lima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nur</surname>
          </string-name>
          , S.
          <string-name>
            <surname>C. Das</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Kabir</surname>
          </string-name>
          ,
          <article-title>A survey of automatic text summarization: Progress, process and challenges</article-title>
          ,
          <source>IEEE Access 9</source>
          (
          <year>2021</year>
          )
          <fpage>156043</fpage>
          -
          <lpage>156070</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Kovatchev</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lease</surname>
          </string-name>
          ,
          <article-title>The state of human-centered nlp technology for factchecking</article-title>
          ,
          <source>Information processing &amp; management 60</source>
          (
          <year>2023</year>
          )
          <fpage>103219</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Muthiah</surname>
          </string-name>
          ,
          <article-title>Automatic Coherent and Concise Text Summarization using Natural Language Processing</article-title>
          ,
          <source>Ph.D. thesis</source>
          , Dublin, National College of Ireland,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C. van der</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gatt</surname>
          </string-name>
          , E. van
          <string-name>
            <surname>Miltenburg</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Krahmer</surname>
          </string-name>
          ,
          <article-title>Human evaluation of automatically generated text: Current trends and best practice guidelines</article-title>
          ,
          <source>Computer Speech &amp; Language</source>
          <volume>67</volume>
          (
          <year>2021</year>
          )
          <fpage>101151</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fukumoto</surname>
          </string-name>
          ,
          <article-title>Automated summarization evaluation with basic elements</article-title>
          .,
          <source>in: LREC</source>
          , volume
          <volume>6</volume>
          ,
          <year>2006</year>
          , pp.
          <fpage>604</fpage>
          -
          <lpage>611</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <article-title>Automatic summarising: factors and directions, arXiv preprint cmp-lg/9805011 (</article-title>
          <year>1998</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>C.-Y. Lin</surname>
          </string-name>
          ,
          <article-title>Rouge: A package for automatic evaluation of summaries</article-title>
          , in: Text summarization branches out,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kryściński</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>McCann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <article-title>Evaluating the factual consistency of abstractive text summarization</article-title>
          , arXiv preprint arXiv:
          <year>1910</year>
          .
          <volume>12840</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <article-title>Asking and answering questions to evaluate the factual consistency of summaries, 2020</article-title>
          . URL: https://arxiv.org/abs/
          <year>2004</year>
          .04228. arXiv:
          <year>2004</year>
          .04228.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <article-title>The factual inconsistency problem in abstractive text summarization: A survey</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2104.14839. arXiv:
          <volume>2104</volume>
          .
          <fpage>14839</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ouni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Eger</surname>
          </string-name>
          ,
          <article-title>Cross-lingual cross-temporal summarization: Dataset, models</article-title>
          , evaluation, Computational
          <string-name>
            <surname>Linguistics</surname>
          </string-name>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>J. Wu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Zhan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>D. F.</given-names>
          </string-name>
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>L. S.</given-names>
          </string-name>
          <string-name>
            <surname>Chao</surname>
          </string-name>
          ,
          <article-title>A survey on llm-generated text detection: Necessity, methods</article-title>
          , and future directions,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2310.14724. arXiv:
          <volume>2310</volume>
          .
          <fpage>14724</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1810</year>
          .04805. arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <article-title>Findings of the first shared task on indian language summarization (ILSUM): approaches challenges and the path ahead</article-title>
          , in: K. Ghosh,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , M. Mitra (Eds.), Working Notes of FIRE 2022 -
          <article-title>Forum for Information Retrieval Evaluation, Kolkata</article-title>
          , India, December 9-
          <issue>13</issue>
          ,
          <year>2022</year>
          , volume
          <volume>3395</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>369</fpage>
          -
          <lpage>382</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3395</volume>
          /
          <fpage>T6</fpage>
          -1.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          , P. Mehta,
          <article-title>FIRE 2022 ILSUM track: Indian language summarization</article-title>
          , in: D.
          <string-name>
            <surname>Ganguly</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gangopadhyay</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Mitra</surname>
          </string-name>
          , P. Majumder (Eds.),
          <source>Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation</source>
          ,
          <string-name>
            <surname>FIRE</surname>
          </string-name>
          <year>2022</year>
          , Kolkata, India, December 9-
          <issue>13</issue>
          ,
          <year>2022</year>
          , ACM,
          <year>2022</year>
          , pp.
          <fpage>8</fpage>
          -
          <lpage>11</lpage>
          . URL: https://doi.org/10.1145/3574318.3574328. doi:
          <volume>10</volume>
          .1145/ 3574318.3574328.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <article-title>Key takeaways from the second shared task on indian language summarization (ILSUM 2023)</article-title>
          , in: K. Ghosh,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , M. Mitra (Eds.), Working Notes of FIRE 2023 -
          <article-title>Forum for Information Retrieval Evaluation (FIRE-WN</article-title>
          <year>2023</year>
          ), Goa, India,
          <source>December 15-18</source>
          ,
          <year>2023</year>
          , volume
          <volume>3681</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>724</fpage>
          -
          <lpage>733</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3681</volume>
          /
          <fpage>T8</fpage>
          -1.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <source>Indian language summarization at FIRE</source>
          <year>2023</year>
          , in: D.
          <string-name>
            <surname>Ganguly</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Majumdar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mitra</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gangopadhyay</surname>
          </string-name>
          , P. Majumder (Eds.),
          <source>Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation</source>
          ,
          <string-name>
            <surname>FIRE</surname>
          </string-name>
          <year>2023</year>
          , Panjim, India,
          <source>December 15-18</source>
          ,
          <year>2023</year>
          , ACM,
          <year>2023</year>
          , pp.
          <fpage>27</fpage>
          -
          <lpage>29</lpage>
          . URL: https://doi.org/10.1145/3632754.3634662. doi:
          <volume>10</volume>
          .1145/3632754.3634662.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. HL</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <article-title>Overview of the third shared task on indian language summarization</article-title>
          (ilsum
          <year>2024</year>
          ), in: K. Ghosh,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , D. Ganguly (Eds.), Working Notes of FIRE 2024 -
          <article-title>Forum for Information Retrieval Evaluation, Gandhinagar, India</article-title>
          .
          <source>December 12-15</source>
          ,
          <year>2024</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>