<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Approach to a Cost-Efective and Controllable Text Generation Architecture</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Iván Martínez-Murillo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Software and Computing Systems, University of Alicante, Apdo. de Correos 99</institution>
          ,
          <addr-line>E-03080, Alicante</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <abstract>
        <p>Large Language Models (LLMs), which take advantage of the Transformers architecture, have obtained remarkable outcomes in the Generative Artificial Intelligence (AI) field. Specifically, these models have boosted the Natural Language Generation (NLG) field to new dimensions. Nonetheless, state-of-the-art NLG models also entail some challenges. Firstly, these models can generate text which may contain biased or hallucinated information, that can be used in an unethical way. Secondly, these models sometimes fail at generating commonsense knowledge, a fundamental factor in the human language. Thirdly, the expense of training LLMs is excessively high. Finally, most of the NLG research proposing more eficient architectures is focused on the English language. Thus, other languages such as Spanish still lack resources and models to address the NLG. Given the challenges mentioned above, the main objective of this paper is to provide a detailed research line that aims to propose an eficient and efective text generation architecture that could generate high-quality text in Spanish.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Generation</kwd>
        <kwd>Hallucination</kwd>
        <kwd>Eficient architectures</kwd>
        <kwd>Spanish</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Generative Artificial Intelligence (AI) is a rapidly growing trend involving machine learning algorithms
to construct systems capable of generating new content. This trend has its origin in neural networks,
which were one of the earliest forms of generative AI [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This rapid development of generative AI has
caused a surge of interest in AI tools across society.
      </p>
      <p>
        One of the important topics within the generative AI trend is the Natural Language Generation
(NLG) field. NLG is a sub-field of Natural Language Processing (NLP) that aims to generate natural
language to achieve a specific communicative goal [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The input to these systems can be linguistic
data such as text or voice, or non-linguistic data such as knowledge graphs or structured data. From
that input, current available NLG tools can produce text similar to human-generated texts. In light
of this performance, there is a big concern about detecting whether a text is generated by a human
or a machine. Some shared tasks have been proposed to advance in the state-of-the-art in detecting
AI-generated content, such as the AuTexTification challenge[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] at the IberLEF 2023 or the Multidomain,
Multimodal and Multilingual Machine-Generated Text Detection task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] at SemEval 2024.
      </p>
      <p>
        Large Language Models (LLMs) are the core of those NLG tools. These models, formed by millions
of parameters in their neural networks are extremely expensive to train. Therefore, there is a need
to find more eficient architectures to generate text. Moreover, LLMs and NLG are not exempt from
mistakes and one of the current major issues is the hallucination phenomenon, which is when a text is
nonsensical or unfaithful to the provided source [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This issue is present even in most superior LLMs
such as GPT4 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Figure 1 shows an example of hallucination in GPT4-o. When asked to generate the
six days of the week, GPT4-o fails at detecting that the week is not formed by just six days and writes
the first six days.
      </p>
      <p>
        Another important issue is bias, which is the misrepresentation errors that favour certain groups [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
Figure 1 shows an example of gender bias in GPT4-o. When we request GPT4.o to generate a list of
ifve adjectives describing men and 5 describing women, it generates adjectives related to physical or
ambition to men, whereas adjectives related to self-care to women.
      </p>
      <p>
        Finally, these tools also lack logical reasoning or commonsense, a vital factor in human intelligence
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Figure 1 shows an output generated by ChatGPT. When asked to generate a sentence with three
concepts (lion, bicycle and helmet), ChatGPT output is a sentence with no commonsense.
      </p>
      <p>These limitations can be exploited in a bad and unethical manner to generate misinformation.</p>
      <p>
        Furthermore, there is a notable disparity in NLG research. The majority of studies conducted are on
the English language [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], so there is a need to make developments in languages less represented.
      </p>
      <p>Because of this, our goal with this research proposal is to describe my thesis project which focuses
on creating an eficient architecture that integrates external commonsense knowledge along with
controllability techniques to enhance the performance of a smaller model and solve the hallucination
problem.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <p>This section aims to contextualise this research project within the NLG state of the art.</p>
      <sec id="sec-2-1">
        <title>2.1. Natural Language Generation</title>
        <p>
          NLG is the sub-field in the Natural Language Processing (NLP) area that aims to produce meaningful
sentences to meet a communicative goal [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. NLG started to be studied in the decade of 1970 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], but it
was not until recent years that it has become very popular and advanced considerably. Considering the
task typology NLG systems can be classified into three groups [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]: Text abbreviation tasks that aim to
condense information from long texts to short ones, such as text summarisation. Text expansion tasks
whose goal is to generate complete sentences from meaningful words, such as topic-to-essay. Finally,
text rewriting and reasoning tasks aim to rewrite a text into another style or apply reasoning methods,
such as text simplification.
        </p>
        <p>
          In order to address those tasks, NLG systems have evolved through diferent architectures. Originally,
the NLG task consisted of a sequential scheme of three diferent stages (macroplanning, microplanning
and realisation). Macroplanning is the set of sub-tasks related to the selection of what information
to include in the generated text. Microplanning receives as input the output of the Macroplanning
stage and conducts diferent sub-tasks to decide how to include that selected information in the final
text. Finally, the realisation stage receives the generated plan from the previous steps and performs the
generation syntactically correctly. Modular architectures is the group of architectures that follows
this scheme. They make a well-diferentiated distinction between the distinct sub-tasks of each stage.
Reiter proposed the architecture that was considered the standard within this group [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>As the NLG field started to become mature, the distinction between sub-tasks became more flexible,
performing the generation in fewer steps. This group of architectures is named planning perspectives.
This scheme was similar to the modular architecture but needed fewer sub-tasks in each stage.</p>
        <p>
          With the appearance of neural networks, the sub-task division disappeared, originating global
approaches. This group performs the generation in one step. The most important milestone in this
group was the proposal of Transformers architecture [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. This architecture presented the concept of
self-attention which raise considerably the performance of the NLG. Models based on that architecture
can achieve high performance at NLG tasks, such as LLMs, which are state-of-the-art in the NLG field.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Commonsense Knowledge</title>
        <p>
          Commonsense knowledge is an important factor in human communication, as it facilitates inference
without the explicit mention of context [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>
          Originally, commonsense has been incorporated into NLG systems through rules and ontologies.
Since the neural networks proposal, the focus has shifted to integrating commonsense into neural NLG
models through pre-trained models and commonsense graphs. However, there is still much work to
be done in this field to achieve complete commonsense reasoning and generation. Although current
state-of-the-art models exhibit some commonsense abilities, it is far from perfect. The commonsense
knowledge integration into human language is a challenging task in the NLG field [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], as there is
an urge to enhance the ability of NLG systems to generate texts containing that knowledge. Several
collaborative eforts have been proposed to advance the frontier of commonsense generation. In the
Avicenna task [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], models are presented with two premises containing a syllogistic relation, and the
goal is to produce a conclusion that efectively completes the given relation. Other works study the
integration of commonsense in keyword-to-text tasks. For instance, the SituatedGen task [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] requires
the generation of a pair of contrasting sentences based on a group of concepts that include temporal
or geographical entities. In the CommonGen [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] and C2Gen [19] tasks, the challenge is to generate
a coherent sentence describing an everyday scenario given a set of words. Notably, the C2Gen task
additionally provides a contextual input to which the generated text must adhere.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. NLG Evaluation Metrics</title>
        <p>As in any other NLP task, NLG systems also should be evaluated. The generated output usually is
compared against a reference or reference text through diferent evaluation metrics. Commonly the
NLG evaluation metrics can be divided into three categories:
• Human evaluation metrics are the best and more reliable metrics [20]. Nonetheless, they also
present some problems: they are expensive and time-consuming because they require a great deal
of manual work, and replicability is hard to achieve due to the idiosyncrasies of every evaluator.
• Standard evaluation metrics are based on string or n-gram techniques to measure the distribution
similarity via overlaps with a reference text. These metrics are not fully aligned with human
judgements, but even so, they are temporary and computationally low-cost. Most of these metrics
evaluate the quality of a text by comparing it against a reference text based on features such as
words, characters or embeddings. Metrics measuring the word overlapping between a candidate
sentence and a reference sentence are the most common. Some examples are: BLEU [21], ROUGE
[22], CIDEr [23], or SPICE [24]. Character overlapping performs better in languages with lexical
diversity. Some metrics within this group are Extended Edit Distance [25] or chrF [26]. Finally,
embeddings-based metrics tend to capture better semantic similarity. BERTscore [27], and Word
Mover-Distance [28] are examples of this group.
• Machine-learned evaluation metrics can generate a more reliable score given that during training,
models learn how to generate and evaluate text in a more human-like way. BARTScore [29] and
GPTScore [30] are metrics that use BART or GPT models to evaluate the generated text.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Main Hypothesis and Objectives</title>
      <p>This PhD thesis hypothesises that integrating external commonsense knowledge into a smaller
architecture in terms of parameters, compared to state-of-the-art LLMs, will boost the quality of the
outputs when generating text. Those texts could be similar to the ones that could write a human when
solving certain tasks. Moreover, external commonsense knowledge will help reduce the problem of
hallucinations. After reviewing the scientific literature, we have observed that some works have been
conducted to incorporate external commonsense in NLG systems. However, although they improved
the results of previous works, the performance was still not excellent. Moreover, most of the research
has been conducted in the English language. Therefore, the main objective of this research is to design
an eficient architecture that will require less computational expenses than the state-of-the-art LLMs.
This architecture will be capable of achieve human-like performance in generating text for diferent
NLG tasks prioritising the research for Spanish. Additionally, a key focus will be on minimising the
issue of hallucinations.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>To fulfil the objectives of this PhD thesis, we have proposed a methodology consisting of four main
important milestones, that will be explained next: define a NLG pipeline; analyse, test and propose NLG
architectures; analyse and build a corpus in Spanish; and evaluate proposed NLG architectures.
1. Define a NLG pipeline. The first step of my research was to propose a pipeline to follow. It
involves the steps of defining the task I want to address, the corpus and external knowledge
collection, how to integrate both in a NLG architecture and finally evaluating the results obtained.
Figure 2 shows the sequential steps of this pipeline. The task this research project is addressing
ifrst is the commonsense generation through a concept-to-text task. Given some words, the goal
is to generate a sentence that contains those words and has common sense. Afterwards, more
NLG tasks will be addressed.
2. Analyse, Test and Propose NLG Architectures. During the initial part of the research, we
explored diferent available NLG architectures in English and Spanish to find a suitable model to
address the NLG task based on the topics pursued. Nowadays, most of the architectures employed
are based on LLMs. Despite this, we will first test and compare traditional and neural architectures
to efectively decide which architecture performs better based on a set of decisions, such as content
produced, eficiency, etc. Consequently, we will try to integrate commonsense knowledge and
controllable generation techniques into the best-performing architecture obtained from our
experimentation to raise its performance and compare the results against the state-of-the-art
models.
3. Analyse and Build Corpus in Spanish. Datasets constructed for a specific task in Spanish are
hard to find, as most of the available corpora are in the English language. Moreover, in some cases,
these datasets may be biased [31]. Once we have proposed an eficient architecture, we will focus
on adapting that architecture to address diferent NLG tasks in Spanish, such as commonsense
generation, text simplification, or abstractive summarisation. Therefore, we will have to collect
and build specific corpora in Spanish to address these tasks. Moreover, as we are concerned
about the importance of the training data on the results of the NLG model, we will check and
preprocess, if necessary that corpora to try to be unbiased and balanced.
4. Evaluate Proposed NLG Architecture on diferent tasks. An important part of this research
is to measure the performance of our proposed model. Most NLG metrics tend to compare the
generated text against some reference texts. As NLG systems can generate texts in a large number
of diferent styles, these metrics may not be appropriate. To be able to determine which metric is
the most appropriate to measure the performance of each proposed task an exploratory analysis
will be conducted and compared with a manual evaluation of those tasks.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments</title>
      <sec id="sec-5-1">
        <title>5.1. Conducted experiments</title>
        <p>Two experiments have been conducted: a preliminary experiment on English NLG architectures and
the problems in evaluating its results, and a similar experiment in Spanish.</p>
        <p>
          1. English NLG models evaluation: This experiment is detailed in [32]. We conducted an experiment
for the task of commonsense generation based on the CommonGen dataset [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. Based on a
set of concepts, the systems must generate a sentence with common sense that contains those
concepts. We experimented with a modular architecture, and a global architecture and evaluated
the results. We evaluated them manually and automatically. The manual evaluation showed that
global architectures performed better than the modular ones. However, none of them performed
in an acceptable manner. Furthermore, in the automatic evaluation, the employed metrics did
not reflect the diference in quality between both outputs, despite performing overall the global
architectures on those metrics.
2. Spanish NLG preliminary experimentation: Following the ideas of the CommonGen dataset,
we wanted to extend the CommonGen task to other languages, in this case Spanish. To do so,
we have proposed a Spanish corpus called COCOTEROS [33]. We have collected Spanish text
from the Tatoeba dataset2 and formatted to be similar to the CommonGen. With this Spanish
corpus, we have fine-tuned a T5 Spanish model to generate texts to evaluate their performance.
2Available at https://tatoeba.org/es/
        </p>
        <p>We also performed a manual and automatic evaluation. The manual evaluation shows that the
produced sentences are not acceptable. Furthermore, in some cases, the generated sentences
do not contain all the words given as keywords. For the words "despedir", "aeropuerto", and
"amigo" the system generated the sentence "él se despidieron que vinieras al aeropuerto.". We also
evaluated automatically the sentences with the same metrics as for the English experiment. Table
1 shows the results obtained by this model in diferent NLG evaluation metrics and compared
with the results obtained for the T5 model in English of the previous experiment. Although they
are two diferent datasets, we can see that the results achieved for Spanish are lower than the
results obtained in the English experimentation. Those results align with the manual evaluation,
which showed a worse performance for the Spanish model than for the English model.</p>
        <p>Model
T5-Small Spanish
T5-Small English</p>
        <p>SPICE</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Planned experiments</title>
        <p>Considering the obtained results, the next steps are the experimentation with diferent architectures
that integrate external commonsense and the search for the optimal way to evaluate the generated
outputs, which can be done manually or to find other more appropriate automatic metrics.
1. External commonsense Integration: To figure how to obtain external knowledge for the task of
concept-to-text and how to include it in the proposed model are two key points to incorporate
commonsense in NLG architectures. There are diferent options such as ConceptNet, WordNet or
COMET that could be interesting to explore. The research question is how to successfully extract
knowledge from them and incorporate it into a NLG system.
2. Output evaluation: As demonstrated in [32], most common automatic evaluation metrics do
not align completely with human performance, only measuring the overlapping with a
reference sentence. Thus, finding other methods to evaluate the generated text is important to
measure the performance of the models. Some works already done to measure hallucinations
could be interesting to explore to evaluate outputs. Examples are AlignScore metric [34] and
Hallucination_evaluation_model [35].</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Research issues to discuss</title>
      <p>As I explained, my research is currently focused on the task of concept-to-text generation. This task is
based on giving some concepts, the systems must generate a sentence using those concepts. We have
compiled a Spanish corpus named COCOTEROS to accomplish this task. The next step is to incorporate
external commonsense into some pre-trained models to enhance the results obtained. Given that, the
research questions that arise are the following:
• Which is the most adequate external knowledge base to complement this task?
• In which part of the pipeline is more eficient to incrust the knowledge, during fine-tuning, during
inference, or in both?
• How the hallucination issue can be measured for this task in Spanish?
• If we use LLMs to address this task, will the results be improved considerably?</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This research work is part of the R&amp;D projects “CORTEX: Conscious Text Generation”
(PID2021123956OB-I00), funded by MCIN/ AEI/10.13039/501100011033/ and by “ERDF A way of making Europe”
text generation challenge for generative commonsense reasoning, in: Findings of the Association
for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online,
2020, pp. 1823–1840. URL: https://www.aclweb.org/anthology/2020.findings-emnlp.165.
[19] F. Carlsson, J. Öhman, F. Liu, S. Verlinden, J. Nivre, M. Sahlgren, Fine-grained controllable text
generation using non-residual prompting, in: Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 6837–6857.
[20] K. R. Chandu, A. W. Black, Positioning yourself in the maze of neural text generation: A
task-agnostic survey, 2020. URL: https://arxiv.org/abs/2010.07279. doi:10.48550/ARXIV.2010.
07279.
[21] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: a method for automatic evaluation of machine
translation, in: Proceedings of the 40th annual meeting of the Association for Computational
Linguistics, 2002, pp. 311–318.
[22] C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text summarization
branches out, 2004, pp. 74–81.
[23] R. Vedantam, C. Lawrence Zitnick, D. Parikh, CIDEr: Consensus-based image description
evaluation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp.
4566–4575.
[24] P. Anderson, B. Fernando, M. Johnson, S. Gould, SPICE: Semantic propositional image caption
evaluation, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The
Netherlands, October 11-14, 2016, Proceedings, Part V 14, Springer, 2016, pp. 382–398.
[25] P. Stanchev, W. Wang, H. Ney, EED: Extended edit distance measure for machine translation, in:
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers,
Day 1), 2019, pp. 514–520.
[26] M. Popović, chrF: character n-gram f-score for automatic MT evaluation, in: Proceedings of the
tenth workshop on statistical machine translation, 2015, pp. 392–395.
[27] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, BERTScore: Evaluating text generation
with bert, arXiv preprint arXiv:1904.09675 (2019).
[28] M. Kusner, Y. Sun, N. Kolkin, K. Weinberger, From word embeddings to document distances, in:</p>
      <p>International conference on machine learning, PMLR, 2015, pp. 957–966.
[29] W. Yuan, G. Neubig, P. Liu, BARTScore: Evaluating generated text as text generation, Advances
in Neural Information Processing Systems 34 (2021) 27263–27277.
[30] J. Fu, S.-K. Ng, Z. Jiang, P. Liu, GPTScore: Evaluate as you desire, arXiv preprint arXiv:2302.04166
(2023).
[31] J. S. Ernst, S. Marton, J. Brinkmann, E. Vellasques, D. Foucard, M. Kraemer, M. Lambert, Bias
mitigation for large language models using adversarial learning (2023).
[32] I. Martínez-Murillo, P. Moreda, E. Lloret, Analysing the problem of automatic evaluation of
language generation systems, Procesamiento del Lenguaje Natural 72 (2024) 123–136.
[33] M. M. Maestre, I. Martínez-Murillo, E. Lloret, P. Moreda, A. Suárez Cueto, COCOTEROS: A spanish
corpus with contextual knowledge for natural language generation, in: 40th Annual Conference
of the Spanish Association for Natural Language Processing 2024: Posters (SEPLN-P 2024), CEUR,
2024.
[34] Y. Zha, Y. Yang, R. Li, Z. Hu, AlignScore: Evaluating factual consistency with a unified alignment
function, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association
for Computational Linguistics, Toronto, Canada, 2023, pp. 11328–11348. URL: https://aclanthology.
org/2023.acl-long.634. doi:10.18653/v1/2023.acl-long.634.
[35] S. M. Hughes, M. Bae, Hughes hallucination evaluation model (hhem) leaderboard, Huggingface,
available at https://huggingface. co/spaces/vectara/leaderboard (accessed 22nd April, 2024) (2024).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Muller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Houde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Talamadupula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Weisz</surname>
          </string-name>
          ,
          <article-title>Investigating explainability of generative AI for code through scenario-based design</article-title>
          ,
          <source>in: 27th International Conference on Intelligent User Interfaces</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>212</fpage>
          -
          <lpage>228</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Reiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dale</surname>
          </string-name>
          ,
          <article-title>Building applied natural language generation systems</article-title>
          ,
          <source>Natural Language Engineering</source>
          <volume>3</volume>
          (
          <year>1997</year>
          )
          <fpage>57</fpage>
          -
          <lpage>87</lpage>
          . doi:
          <volume>10</volume>
          .1017/S1351324997001502.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Sarvazyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Á.</given-names>
            <surname>González</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Franco</given-names>
            <surname>Salvador</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Overview of AuTexTification at IberLEF 2023:
          <article-title>Detection and attribution of machine-generated text in multiple domains</article-title>
          ,
          <source>in: Procesamiento del Lenguaje Natural</source>
          , Jaén, Spain,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mansurov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ivanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shelmanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tsvigun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Whitehouse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. M.</given-names>
            <surname>Afzal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mahmoud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Aji</surname>
          </string-name>
          , P. Nakov,
          <article-title>M4: Multi-generator, multi-domain, and multi-lingual black-box machine-generated text detection</article-title>
          ,
          <source>arXiv:2305.14902</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Frieske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ishii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Survey of hallucination in natural language generation</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>55</volume>
          (
          <year>2023</year>
          ). URL: https: //doi.org/10.1145/3571730. doi:
          <volume>10</volume>
          .1145/3571730.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          , et al.,
          <article-title>A survey of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2303.18223</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Ferrara</surname>
          </string-name>
          ,
          <article-title>Should ChatGPT be biased? challenges and risks of bias in large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2304.03738</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Teng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Evaluating the logical reasoning ability of ChatGPT and GPT-4</article-title>
          , arXiv preprint arXiv:
          <volume>2304</volume>
          .03439 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Maurya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Desarkar</surname>
          </string-name>
          ,
          <article-title>Towards low-resource language generation with limited supervision</article-title>
          , in: Y.
          <string-name>
            <surname>Elazar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ettinger</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Kassner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ruder</surname>
            ,
            <given-names>N. A.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the Big Picture Workshop</source>
          , Association for Computational Linguistics, Singapore, Singapore,
          <year>2023</year>
          , pp.
          <fpage>80</fpage>
          -
          <lpage>92</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .bigpicture-
          <volume>1</volume>
          .7. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .bigpicture-
          <volume>1</volume>
          .7.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>D. D. McDonald</surname>
          </string-name>
          ,
          <article-title>Natural language generation</article-title>
          .,
          <source>Handbook of natural language processing 2</source>
          (
          <year>2010</year>
          )
          <fpage>121</fpage>
          -
          <lpage>144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>A survey of natural language generation 55 (</article-title>
          <year>2022</year>
          ). URL: https://doi.org/10.1145/3554727. doi:
          <volume>10</volume>
          .1145/3554727.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E.</given-names>
            <surname>Reiter</surname>
          </string-name>
          ,
          <article-title>Has a consensus NL generation architecture appeared</article-title>
          , and is it psycholinguistically plausible?,
          <year>1994</year>
          . arXiv:cmp-lg/9411032.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , Attention is all you need,
          <year>2017</year>
          . arXiv:
          <volume>1706</volume>
          .
          <fpage>03762</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mahamood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Clinciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gkatzia</surname>
          </string-name>
          ,
          <article-title>It's common sense, isn't it? demystifying human evaluations in commonsense-enhanced NLG systems (</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <article-title>Retrieval enhanced model for commonsense generation</article-title>
          , in: C.
          <string-name>
            <surname>Zong</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          (Eds.),
          <article-title>Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021</article-title>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>3056</fpage>
          -
          <lpage>3062</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .findings-acl.
          <volume>269</volume>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2021</year>
          .findings-acl.
          <volume>269</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Aghahadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Talebpour</surname>
          </string-name>
          ,
          <article-title>Avicenna: a challenge dataset for natural language generation toward commonsense syllogistic reasoning</article-title>
          ,
          <source>Journal of Applied Non-Classical Logics</source>
          <volume>32</volume>
          (
          <year>2022</year>
          )
          <fpage>55</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , X. Wan,
          <article-title>SituatedGen: Incorporating geographical and temporal contexts into generative commonsense reasoning</article-title>
          ,
          <source>arXiv preprint arXiv:2306.12552</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>B. Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bhagavatula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          , CommonGen: A constrained
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>