<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>COCOTEROS: A Spanish Corpus with Contextual Knowledge for Natural Language Generation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>María Miró Maestre</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iván Martínez-Murillo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Lloret</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paloma Moreda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Armando Suárez Cueto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Software and Computing Systems, University of Alicante, Apdo. de Correos 99</institution>
          ,
          <addr-line>E-03080, Alicante</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Contextual information is one of the key elements when automatically generating language with a more semantic-pragmatic perspective. To contribute to the study of this linguistic aspect, we present COCOTEROS, a COrpus of COnTextual TExt geneRatiOn in Spanish. COCOTEROS is available at https://huggingface.co/datasets/gplsi/cocoteros. The corpus is composed of sentences and automatically generated context pairs. For creating it, a semi-automatic weakly supervised methodology is implemented. Taking as a reference the Spanish section of the Tatoeba dataset, we filtered the sentences according to our research purpose. Then, we determined several linguistic parameters that the generated contexts need to fulfil considering their reference sentence. Finally, contexts were automatically generated using prompt engineering with Google's large language model Bard. Furthermore, we performed two types of evaluation to check both the linguistic quality and the presence of gender bias in the corpus: the former by manually measuring the magnitude estimation metric and the latter thanks to the GenBit automatic metric. The results show that COCOTEROS is an appropriate language resource to approach Natural Language Generation tasks from a semantic-pragmatic perspective for Spanish. For instance, the NLG task of concept-to-text generation could benefit from contextual information by generating sentences according to the information provided in the context and a set of given concepts. Additionally, regarding the task of question-answering, the inclusion of linguistic context can enhance the generation of more appropriate answers by serving as a guide on what information to include in the automatically generated answer.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;corpus</kwd>
        <kwd>contextual information</kwd>
        <kwd>natural language generation</kwd>
        <kwd>Spanish</kwd>
        <kwd>human evaluation</kwd>
        <kwd>large language models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        in mind these linguistic levels of analysis, NLG is
starting to put linguistic context in the research spotlight,
Natural Language Generation (NLG) systems are steadily given its importance for appropriately understanding
improving their performance in a wide range of tasks human utterances. Indeed, Newman et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] already
where the information to be generated is delimited ac- defended the consideration of context not only to create
cording to the objective of the task, e.g., text summari- text automatically but also to assess the suitability of
sation, machine translation or question answering (QA). the generated text. This statement comes from the idea
One of the most important issues those systems have to that communication-based features help to evaluate the
deal with is the lack of suficient contextual knowledge, performance of any model that imitates human language.
as it prevents NLG models from better adapting the gen- Language itself is used to communicate ideas always
exerated text to the communicative situation of each task. pressed within a given communicative context, and it is
That derives in crucial problems such as the hallucina- such context what directly afects the structure of the
tion issue and lack of commonsense in the produced text utterance we want to say.
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In fact, one of the current concerns within the NLG Parallel to this, making NLG systems aware of
contexdiscipline [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is the need to address tasks from a more tual knowledge involves the creation of new resources,
‘semantic-pragmatic perspective’ to solve these contex- such as datasets, corpora, knowledge bases, etc., to train
tual inference dificulties that afect the output of the models in several languages, especially for those
difersystems at issue. To address the lack of studies that bear ent from English or low-resourced ones. In the case of
Spanish, we observed that most of the recently published
SEPLN-2024: 40th Conference of the Spanish Society for Natural corpora hardly address pragmatic-related issues with a
L$anmguaraigae.mPirrooc@essuian.ge.sV(Mal.laMd.olMida, eSsptarein);. i2v4a-n2.7mSaerpttinemezbmeru2ri0l2lo4@.ua.es contextual perspective, but rather focus on concrete
prag(I. Martínez-Murillo); elena.lloret@ua.es (E. Lloret); moreda@ua.es matic aspects such as metaphors to tackle identification
(P. Moreda); armando.suarez@ua.es (A. S. Cueto) tasks. Furthermore, the high performance of Large
Lan0000-0001-7996-4440 (M. M. Maestre); 0009-0007-5684-0083 guage Models (LLMs) recently witnessed within the field
(I. Martínez-Murillo); 0000-0002-2926-294X (E. Lloret); of Natural Language Processing (NLP) has allowed
re(0A00.S0.-0C0u0e2t-o7)193-1561 (P. Moreda); 0000-0002-8590-3000 searchers to use NLP tools to automatise data collection
© 2024 Copyright for this paper by its authors. Use permitted under Creative and corpus creation tasks, therefore reducing the time
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) spent in collecting suficient data for research purposes
[
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. their classification systems are Named Entity
Recogni
      </p>
      <p>
        To bridge the gap of NLG systems that handle more tion (NER) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], word recognition and lexical processing
semantic-pragmatic features of language, specifically con- to boost semantic disambiguation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], language
learntextual knowledge, we present COCOTEROS, a COr- ing [9] or even healthcare studies devoted to diseases or
pus of COntextual TExt geneRatiOn in Spanish. This syndromes which critically afect language [10].
corpus comprises 4,845 sentences extracted from the
existing Tatoeba dataset, together with 4,845 context
sentences automatically generated with Bard1 language 2. Related Work
model and manually revised. Given the dificulties inher- In view of the multidisciplinary nature of the task, the
folent to prompt engineering when using LLMs-based chat- lowing theoretical background is on one side focused on
bots, several linguistic parameters were determined to the linguistic notion of context and its approach to NLG
ensure the quality of the automatically generated outputs research (subsection 2.1). On the other side, subsection
with Bard, including semantic similarity, length of the 2.2 includes prior NLG research focused on the creation
generated text, and forbidden keywords, among others. of linguistic resources to address contextual-related tasks.
Moreover, we performed a human evaluation experiment
based on the magnitude estimation metric with three
linguistics specialists to measure the contextual appropri- 2.1. Linguistic Context in NLG Research
ateness of the resulting contexts. In parallel, we measured
gender bias with the GenBit tool [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to verify that our
corpus would be useful for NLG tasks without adding
gender biases to models trained in further experiments.
      </p>
      <p>In sum, the main contributions of this paper are:</p>
      <sec id="sec-1-1">
        <title>Messages need their surrounding communicative context</title>
        <p>in order to be completely understood [11]. This claim is
well accepted within the NLP discipline, as many tasks
try to solve context-related linguistic issues to improve</p>
        <p>NLP systems performance, i.e., coreference resolution
• Expansion of a subset of Tatoeba’s corpus with [12], information retrieval [13], word sense
disambiguacontextual information. tion [14] or question answering [15]. Context, therefore,
• Proposal of a weakly supervised methodology for becomes a pragmatic element of great interest when
probuilding a corpus using prompt-engineering. cessing language automatically. Similarly, when focusing
on language generation, there are concrete applications
• Creation of COCOTEROS, a novel Spanish corpus such as dialogue systems where context is usually
predefor commonsense NLG that includes contextual termined so researchers can study the linguistic features
information. surrounding such communicative context [16, 17].
• Corpus validation through human assessment When addressing the task of contextual
appropriatewith the magnitude estimation method. ness (i.e., how appropriate is a context given a linguistic
• Corpus evaluation of gender bias with GenBit setting), several conceptions of context may come to
automatic metric. mind, as linguistic theories tend to diverge on the
definiWe believe this corpus will provide the research com- tion of context given the wide range of perspectives from
munity with a valuable resource in Spanish to test the which context can be approached [18]. For the sake of
performance of NLG systems in diferent tasks by con- the present research, we focus on the linguistic context
sidering semantic-pragmatic aspects of communication of a given message, which can be defined as ‘any
contiguas contextual appropriateness. Some of the NLG tasks ous span of text within a document’ [19] or as ‘the set
that could use COCOTEROS could be those related to of utterances that precedes the current one’ [20]. These
concept-to-text generation, where sets of words are pro- definitions align with the linguistic dimension of context
vided and the model has to generate a text given those known as ‘intratextual context’ (or ‘co-text’), which
studconcepts. These words can have multiple semantic mean- ies the relation of a piece of text to its surrounding text
ings depending on their context. Having a prefixed con- [18].
text within which the sentence has to be generated could
help to a more precise sentence. Moreover, COCOTEROS 2.2. Contextual Corpora for NLG
could also be used to train NLG models to automatically
generate sentences in accordance with a given context
as input information, therefore improving the model’s
awareness of the diferent communicative situations it is
trained with. As for NLP, some tasks that have already
exploited the role of contextual knowledge for improving</p>
      </sec>
      <sec id="sec-1-2">
        <title>1Since 8th February 2024 Bard is known as Gemini.</title>
        <p>The creation of linguistic resources directly oriented to
analyse more complex linguistic phenomena such as
context provides an added asset value to the research
community, as there are not as many resources
available as for other far-reaching linguistic levels of
analysis as syntax or grammar. To motivate the study of
this pragmatic element, several resources to analyse
context from diferent perspectives have already been made
available. Castilho et al. [21] created an English corpus 3.1. Data Collection and Filtering
annotated with context-aware issues for the task of
Machine Translation into Brazilian Portuguese. Regarding For the present study, we wanted to gather simple
Spandialogue tasks, Udagawa and Aizawa [22] addressed the ish sentences with enough semantic content to
automaticommon grounding problem by collecting a dialogue cally generate contexts linked to the situation stated in
dataset with continuous and partially-observable context. the reference sentence. We prioritised sentences with
As for controllable text generation, Lin et al. [23] cre- not too much linguistic information so the context does
ated the CommonGen task and dataset to test to which not add extra information besides the purpose of the task,
extent a generation system can generate text with com- being not too distant from the original sentence situation.
monsense reasoning in English. To this end, the task is to To this end, we chose the Spanish section of sentences
generate a coherent sentence that includes several com- written on the website Tatoeba2 as the original dataset
mon concepts previously shown to the system. Derived from which we would select the sentences to generate
from this work, Carlsson et al. [24] generated the C2Gen the contexts. We first considered using other already
dataset of context sentences in English from which they published corpora such as CommonGen [23] or C2Gen
extracted several keywords that had to be included in [24] as original datasets because they also focused on
an automatically generated text. Finally, a recent En- the task of NLG with contextual information. However,
glish corpus worth mentioning is databricks-dolly-15k, for using these corpora we would have had to translate
a human-generated instruction corpus created to train the original datasets into Spanish, which would imply
Dolly LLM [25]. This dataset was applied to diferent choosing an appropriate automatic translation tool or
contextual tasks such as summarisation of Wikipedia manually translating the datasets for adapting the task
articles or closed QA, where a question and a reference into Spanish. Also, a further proofreading step would
passage are input to the system to get factually correct have been necessary to check the accuracy of the
transresponses. lations into Spanish, so we preferred to benefit from an</p>
        <p>Focusing on Spanish resources, Sanchez-Bayona and already-existing Spanish dataset that could help us
genAgerri [26] generated a corpus of Spanish metaphors, erate our context corpus.
which depend directly on the contextual meaning to be Tatoeba’s original dataset includes around 393,000
clearly identified by an automatic system. As for Natu- Spanish sentences either translated from other languages
ral Language Inference (NLI), Kovatchev and Taulé [27] or directly written in Spanish. The dataset includes
sencompiled the INFERES corpus to check the performance tences with a range of 1 up to 44 words per sentence, so
of machine learning systems on negation-based adver- we first filtered them by selecting only those sentences
sarial examples by using context paragraphs from topics conformed by either 8 or 9 words, collecting a total of
extracted from the Spanish Wikipedia. 60,170 Spanish sentences. We chose this section from the</p>
        <p>After a thorough review of the current corpora that dataset after a previous preprocessing of an excerpt of
address contextual NLG tasks in Spanish, we can say that, the dataset with Spacy tokenizer3. In this preliminary
to the best of our knowledge, there is no corpus focused preprocessing, we noticed that the more words the
senon the contextual information generation task in Spanish. tence comprised, the more risk we had of including too
Consequently, for this research we base on the previous much semantic information in the sentence. This could
works by Lin et al. [23] and Carlsson et al. [24] to address entail the generation of contexts not linked to the
origithe task of contextual information generation in Spanish. nal situation stated in the reference sentence. Similarly,
we rejected those sentences made up of 7 words or less,
as many of their keywords lacked enough linguistic
in3. Corpus Creation formation (verbs, nouns, etc.) to generate a context that
could be in line with the situation stated in the reference
sentence.</p>
        <p>The following subsections include the methodology steps
to create COCOTEROS: i) we explain the reference
sentences dataset collection process and how we filtered 3.2. Linguistic Constraints
them (subsection 3.1); ii) we move on to determine the
linguistic constraints that will comprise the prompt to LLMs can be useful for supporting the automatic
cregenerate automatic contexts (subsection 3.2); iii) we de- ation of corpora to study specific linguistic phenomena
scribe the context generation task (subsection 3.3); and that would become very costly tasks if compiled
manuiv) we include a manual post-edition to curate the results ally. Nevertheless, generating a corpus with LLMs from
generated by the LLM (subsection 3.4). Figure 1 shows scratch also entails several risks regarding linguistic
apa visual pipeline of the methodology used for creating
COCOTEROS.</p>
      </sec>
      <sec id="sec-1-3">
        <title>2This dataset was released under a CC-BY License and can be</title>
        <p>found at https://tatoeba.org/es.</p>
        <p>3https://spacy.io/api/tokenizer
propriateness that could worsen the quality of the corpus,
as it happens with hallucination issues or lack of
commonsense.</p>
        <p>Therefore, with the aim of automatically creating
linguistic contexts referred to a given sentence, and to better
control the output of our chosen LLM (further explained
in Section 3.3), we determined several linguistic
parameters to include in the prompt:
in the context to be generated. With this
restriction, we wanted to ensure that, even if some of
the linguistic structures in the reference sentence
were repeated in the generated context, the
semantic meaning of the context is related to, but
changes somewhat from the reference sentence.</p>
        <p>This goes in line with the idea that the choice of
words influence co-text and meaning potential
[28], and we wanted to test up to which point
can LLMs generate co-text with the same
conceptual background but adding new words that
can enlarge the semantic information of the new
sentence.
• Maximum context length: Inspired by the
work presented in Carlsson et al. [24], we
decided that an appropriate length for the
generated context could be around 45 words. This
decision comes also from preliminary prompt tests
where we found that, if no length limitation was
included, the model tended to delve into the
generation process, creating contexts of more than
ten lines of text that distanced too much from the
original situation stated in the reference sentence.
• Definition of context : Following previous
studies focused on context as described in Section
2.1, we started our prompt with a simple and
straightforward definition of what we consider a
linguistic context so the model could first get the
idea of the task to accomplish.
• Reference sentence or synonyms: On the first
attempts to find the right prompt to compile the
corpus, we observed that, even by including a
short definition of linguistic context, the model
sometimes generated a context including the
reference sentence. Therefore, to better specify the
linguistic nature of the context to be generated,
we indicated that the reference sentence could
not appear in the context nor a sentence with
similar semantic meaning. 3.3. Context Generation
• Forbidden keywords: We extracted three
keywords from each reference sentence that could se- Once we filtered the original dataset, the next step was to
mantically define the sentence meaning. The ex- generate an appropriate context for each of the selected
traction was automatically performed by means sentences. For this, we benefited from the capabilities
of a random choice where we prioritised the se- of LLMs, and in particular, we used Bard [29], Google’s
lection of two nouns and one verb, as we consider recent LLM. Our decision was motivated by an empirical
them some of the main linguistic elements that study we previously conducted in which several LLMs
define the semantic meaning of a sentence. Then, were compared to check how appropriately they fulfilled
we added those batches of three keywords in the the task of generating a context resembling a sentence
prompt as forbidden words that could not appear but without repeating or paraphrasing it. The LLMs
compared were LLaMa4 [30], Vicuna4 [31], Bard, and sona] sentado en su escritorio”7. Consequently, we had to
ChatGPT5 We automatically generated contexts for our modify those contexts by completing the missing
inforsubset of sentences of 8 or 9 words with Bard, which mation with generic concepts or names so we could add
could generate a context in an average of 5 seconds. Nev- the resulting context to the final corpus.
ertheless, Bard’s public version could be prompted only
130 times per day. The generation process was made
through a zero-shot prompt that comprised the linguistic 4. COCOTEROS - Corpus of
restrictions the generated context should include or not, Contextual Text Generation in
as stated in section 3.2. With this setup, we created an Spanish
initial version of COCOTEROS corpus with 5,000
contexts.</p>
      </sec>
      <sec id="sec-1-4">
        <title>As the first corpus focused on the contextual text gener</title>
        <p>ation task for Spanish, COCOTEROScontains a total of
3.4. Post-editing 4,845 pairs of reference sentences with their respective
generated contexts as illustrated in Figure 2. Moreover,
Finally, given Bard’s predefined chat-like communica- the corpus includes the three keywords extracted from
tive structures, we manually revised and post-edited the each reference sentence. The final amount of contexts
resulting contexts by eliminating all the information in- comes from a previous manual post-edition from the
origcluded in the response which was not the generated con- inal 5,000 contexts generated with Bard. We performed
text itself (e.g. Bard’s output included similar sentences this post-edition because we noticed sexist content in
to “Aquí tienes un contexto relacionado para la frase ‘Tengo some of the generated contexts, so we decided to discard
sdteamteamsiaednatsfocorsaeascehn lcaocnatbeexzta6)e.sAtoss daíares’m” aasrka, ptrheelrimeiwnaerrye thoTsaebcleas1esshsotrwaisgahtsftoartwistaircdallys.ummary of COCOTEROS.
times when Bard generated several contexts for a sin- Apart from the corpus general information, we found it
gle input, giving us the opportunity to choose between interesting to check the average sentences and words per
them, so we did a manual proofreading process where context because Bard sometimes generated contexts with
we checked every possible context to choose those that very diferent lengths. Even though the prompt included
approximated more to the conception of context we de- the maximum length that the context could have (45
termined for this research task. In line with this, in those words), we found cases where the context had only 15
cases where we could choose from two options, we se- words, whereas other contexts contained more than four
lected the context describing a female-subject situation. sentences, with a total of more than 50 words.
We made this decision because we detected a somewhat
higher proportion of reference sentences addressing male
subjects, so the generated context was male-gendered too. Table 1
Therefore, in those cases where the reference sentence COCOTEROS data summary.
was no gender-specific, we prioritised female contexts Data Total
to balance gender in COCOTEROS. Further details on Reference sentences 4,845
how we addressed gender bias in our corpus are shown Keywords 14,535
in subsection 5.2. Generated contexts 4,845</p>
        <p>In this manual post-editing step we also discarded Words in the sentences 40,827
contexts that were repetitions or paraphrasing of the ref- Words in the contexts 119,885
erence sentence, as well as those that did not include Words in the corpus 175,247
enough semantic information to be considered appro- Average no. of sentences per context 2
priately generated contexts. Within the rest of contexts Average no. of words per context 25
we kept in COCOTEROS, there were times where Bard
left some of the concepts in the generated text
incomplete so the user could complete it according to his/her
preferences, as in “Nos encontramos a [nombre de la
per</p>
      </sec>
      <sec id="sec-1-5">
        <title>The oficial version of COCOTEROS corpus is available</title>
        <p>at https://huggingface.co/datasets/gplsi/cocoteros. With
this, we aim to contribute to NLG research with a new
language resource for studying contextual information
generation in Spanish, as well as for other unexplored
NLG tasks that can benefit from our corpus to address
further research questions.</p>
      </sec>
      <sec id="sec-1-6">
        <title>7Example translated into English for clarity purposes: We found</title>
        <p>[name of the person] sit on his/her desk
4The tested version of LLama was llama-2-70b-chat, and
Vicuna’s version was vicuna-33b. They were tested on
https://chat.lmsys.org/</p>
        <p>5Tested version of ChatGPT was GPT 3.5 on
https://chat.openai.com/</p>
        <p>6Example translated into English for clarity purposes: Here’s a
related context for the phrase “I have too many things on my mind
these days”</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. Corpus Evaluation</title>
      <p>well as more distinctive rankings when comparing the
outputs between the annotators in comparison to other
To ensure that the contexts included in COCOTEROS are more common methods such as Likert scales [33, 35].
appropriate for contextual generation tasks, we evalu- Taking this method as a basis, we wanted to measure
ated them taking into account diferent aspects: context the appropriateness of the generated context given its
appropriateness with the manual magnitude estimation reference sentence. For this, we took a representative
method (subsection 5.1) and gender bias through the au- sample of sentences and contexts from the COCOTEROS
tomatic GenBit metric (subsection 5.2). corpus through Formula 1, presented in [37] and
previously used in [38]:</p>
      <sec id="sec-2-1">
        <title>5.1. Context Appropriateness</title>
        <p>=</p>
        <p>* 2 *  * 
2 * ( − 1) + 2 *  *</p>
        <sec id="sec-2-1-1">
          <title>With the evolution of the latest LLMs, researchers face</title>
          <p>a need for consistent evaluation metrics that help them
evaluate the outputs provided by these models when where N is the population, K the confidence interval, P
testing their performance for language generation tasks. the probability of success, Q the probability of failure and
To this end, we performed an experiment based on the E the error rate. The population N was 4,845 sentences
magnitude estimation method [32] with the help of three and their respective contexts, and the values given to the
linguistics specialists. Magnitude estimation is a method rest of these parameters were taken as presented in [39],
generally used in psychology to check the reaction of so that K =0.95, E=0.05, P=0.5, and Q=0.5. Once the
fordiferent subjects when presented with several stimuli. mula was calculated, the resulting number of sentences
To measure the diferent levels of reaction subjects can M for testing contextual appropriateness was rounded to
have, they need to assign a score to a first stimuli (in our 90 sentences with their respective contexts. This subset
case, the generated context) where no ranges or limits are of 90 sentences and contexts was selected at random from
determined. Then, when a second stimuli is presented, the final COCOTEROS corpus. With the subset of
conthey have to compare it with the first stimuli shown, and texts already determined, we performed the magnitude
depending on the intensity of the reaction they have, estimation analysis to validate the generated contexts.
its score will change based on the previous score they To accomplish this, we explained the methodology to
assigned to the first stimuli. In this manner, if subjects’ score the subset of 90 generated contexts to the
annoreaction to the second stimuli is twice as much as to the tators, with the only requirement that the lowest score
ifrst stimuli, they will have to double the score they as- they could assign could be 1. In this manner, we
ensign to the second stimuli. This method has been used sured the subsequent normalisation of the values each
positively for evaluating automatically generated text in of them may assign to each context. As a remark, we
several NLG tasks [33, 34, 35, 36], as researchers demon- noticed that two annotators scored contexts based on a
strated that it helps to detect more linguistic nuances as 1 to 100 ranking, even when we highlighted that there
were no restrictions in the values they could choose for
(1)
each context. Once we collected all the scores made by
the annotators, we normalised the results by means of
the z-score normalisation formula (Formula 2) as used in
[40]:
ℎ =
ℎ −  ℎ
 ℎ</p>
          <p>(2)
where ℎ is annotator h’s z-score for the context
when annotator h gave a magnitude estimation score
of ℎ to that context.  ℎ is the mean and  ℎ the
standard deviation of the set of magnitude estimation scores
for annotator h.</p>
          <p>Figure 3 shows the normalised results for the
magnitude estimation evaluation. The 0 line serves as the mean
from which upper numbers indicate those contexts with
higher scores, and the negative numbers show those
contexts considered not suitably generated. As can be seen,
the three annotators tend to agree on which linguistic
contexts have an appropriate contextual relatedness to
the reference sentence, even though each of them used a
diferent range of scores within the magnitude estimation
experiment. In spite of a few disagreeing cases in the
total of 90 contexts, we observe that the annotators agree
that more than half of the corpus sample comprises
contexts with appropriate contextual relatedness, while the
rest could be improved. After evaluating the results with
the annotators, we concluded that they tended to highly
penalise those contexts that paraphrased the reference
sentence, even if after that paraphrasing sentence the
context included new excerpts of text that indeed served
as an appropriate linguistic context.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>5.2. Gender Bias</title>
        <p>
          Several methodological issues come to mind when
using a LLM to generate a new language resource for
further training LLMs so they can learn how to approach
new emerging NLG problems. One of those recently
detected issues is the presence of gender bias in the
humancompiled corpora that LLMs are trained with. This poses
a new problem for the research community, as the
incredible performance those LLMs currently show is based
on data that reflect and amplify societal biases detected
in naturally occurring texts [41]. With an eye to check
possible biases in our corpus, we used the Gender Bias
Tool (GenBit) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] to measure the apparent level of gender
bias in the 4,845 generated contexts from COCOTEROS.
        </p>
        <p>According to its developers, GenBit helps determine if
gender is distributed uniformly across data by measuring
the strength of association between a pre-defined list of
gender definition words and other words in the corpus
via co-occurrence statistics. Table 2 shows the obtained
results after processing COCOTEROS with the Spanish
metric provided in GenBit.
definition words that appear in the corpus, resulting in training phase of the model. As discussed in Section 5.2,
0.335 and 0.665, respectively, in COCOTEROS. Consider- remarkable eforts have been made to balance the
numing the results, it seems there is a higher representation ber of sentences addressing both genders as we are aware
of words associated with the male gender rather than of the importance of dealing with gender
underrepresenwith the female. However, these results do not imply tation when creating inclusive language resources that
that those sentences containing female-gendered words comply with gender balance standards. By doing this,
are used in a sexist context but that the appearance of we also want to encourage the rest of the community to
female-gendered terms in the corpus is lower. We want take similar steps so that NLP resources and LLMs are
to remark on this because the apparent underrepresenta- trained on trustful resources with no biases. We used
tion of female-gendered words could be modified easily GenBit tool for measuring this number, and although
by creating parallel contexts to those where there are the results obtained are the expected, it is true that
Genmale-gendered words so that we could balance both gen- Bit does not detect some grammatical categories such as
der representation at the same time that we expand our male or female proper names and gendered adjectives,
corpus with further examples. Moreover, we have to so the results cannot be conclusive.
bear in mind that words in Spanish have a specific genre, One problem worth commenting regarding LLMs is
whereas English words don’t. Consequently, a predomi- hallucination, which occurs when a text is nonsensical or
nance of male-gendered words does not need to imply unfaithful to the input source. During post-processing,
that the corpus is gender biased, but that the corpus in- we detected that some generated context sufered from
cludes more words linked to that genre, whether those this (e.g., the reference sentence contained the word
“fawords refer to objects, places or people. ther” while the context was generated with “grandfather”;
In addition, during our manual post-editing stage of the the generated context was written in the masculine form
4,845 contexts, we found that many of them described when the reference sentence was in the feminine form;
communicative situations where the subject is a woman. or the case of fake generated data, such as the winner
However, GenBit does not include female or male proper of Eurovision 2023 which was not Germany).
Neverthenames and gendered adjectives in its Spanish section, so less, we did not discard these sentences as our scope was
it cannot consider those contexts as gendered-defined, to obtain appropriate contexts. Therefore, future works
which may also afect the final result of the gender bias will focus on detecting and eliminating hallucinations to
metric. Therefore, the results achieved with GenBit score gather a corpus free of this issue.
serve as a first attempt to consider possible gender bias Finally, another of the main interests for generating
in our corpus, but we believe they cannot be conclusive new resources for the NLP community is creating
multigiven the diferent examples of gendered sentences found task datasets so that linguistic resources become a
valuin our corpus not considered by the metric. able and reusable tool which can motivate new research.
COCOTEROS will contribute to boosting NLP research
specifically addressing semantic and pragmatic aspects
6. Overall Discussion and for Spanish language. Although it has been originally
conceived for NLG, its nature for containing contexts
associated with reference sentences could be beneficial for
solving other NLP-related issues such as textual
entailment, also known as Natural Language Inference (NLI)
[42]. This task focuses on the semantic relations that
may exist between several pieces of text and how such
relations can be characterised and computationally
analysed.</p>
        <sec id="sec-2-2-1">
          <title>The results obtained throughout the experimentation</title>
          <p>process for creating and evaluating COCOTEROS open
the door for discussion along several dimensions.</p>
          <p>Regarding the magnitude estimation evaluation, this
metric helped us to detect further nuances in the scores
each annotator assigned to contexts depending on their
appropriateness. Those nuances could be future
challenges to address to keep on discovering knowledge on
how to deal with contextual information in NLG systems.</p>
          <p>Therefore, these results helped us to determine one of 7. Conclusions and Future Work
the modifications to apply to COCOTEROS, as in future
work we will manually analyse and discard contexts with In this paper we have presented COCOTEROS, a
Spanparaphrasing sentences, so we only leave linguistic con- ish corpus of contextual knowledge for NLG,
containtexts that add contextual information to the reference ing nearly 5,000 sentences with their corresponding
consentence without using synonyms. texts. The creation of COCOTEROS comes from the
cur</p>
          <p>Another key aspect of generating new resources is rent need in NLP research to address tasks with a more
that they must not contain gender biases. An unbiased semantic-pragmatic approach, as it occurs with the
generdataset is an important factor when training a language ation of linguistic context. Also, we wanted to contribute
model, as bias is mostly introduced in the data used in the to the research community with a well-defined Spanish
resource to study contextual aspects in NLG, given the funded by the Generalitat Valenciana. Moreover, it has
lack of enough linguistic resources to study pragmatic been also partially funded by the Ministry of Economic
aspects of language for languages other than English. Afairs and Digital Transformation and “European</p>
          <p>With the aim of verifying the level of linguistic and con- Union NextGenerationEU/PRTR” through the "ILENIA"
textual appropriateness of COCOTEROS, we performed project (grant number 2022/TL22/00215337) and "VIVES"
a two-fold evaluation. First of all, we used the magni- subproject (grant number 2022/TL22/00215334).
tude estimation method with the help of three
linguistics specialists to measure the linguistic and contextual
appropriateness of a representative sample of the gen- References
erated contexts. Then, we applied the GenBit metric to
COCOTEROS to check the level of gender bias our
corpus showed. On the one hand, results on the contextual
appropriateness evaluation reflect the dificulties when
addressing the contextual generation task even for
human annotators, as annotators tended to difer on the
degree of appropriateness of each context. Nevertheless,
the magnitude estimation metric indicates that more than
half of the evaluated contexts were scored favourably. On
the other hand, the gender bias metric score shows that,
with a few modifications, we could reduce the presence of
gender bias in the corpus to a large extent. However, the
resulting bias score cannot be conclusive as the metric
did not consider some of the gender-linguistic features
the generated contexts included.</p>
          <p>Several research directions are planned for future work.</p>
          <p>First, we would like to improve our resource, so further
experiments will be made to balance gender
representation in COCOTEROS, as well as to extend the number
of contexts so this Spanish resource may be of help for
addressing NLP tasks that need more amounts of data.</p>
          <p>Finally, we aim to devote a branch of future research to
adapting COCOTEROS corpus to the task of intention
identification to better understand which reasons make
humans have a particular intention when uttering a
message based on the context surrounding such intention.</p>
          <p>At the same time, we would check if LLMs can better
detect specific communicative intentions depending on
reference sentences and their linguistic context.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <sec id="sec-3-1">
        <title>The research work conducted is part of the</title>
        <p>R&amp;D projects “CORTEX: Conscious Text
Generation” (PID2021-123956OB-I00), funded by MCIN/
AEI/10.13039/501100011033/ and by “ERDF A way
of making Europe”; “CLEAR.TEXT:Enhancing the
modernization public sector organizations by
deploying Natural Language Processing to make their
digital content CLEARER to those with cognitive
disabilities” (TED2021-130707B-I00), funded by
MCIN/AEI/10.13039/501100011033 and “European Union
NextGenerationEU/PRTR”; and the project “NL4DISMIS:
Natural Language Technologies for dealing with dis- and
misinformation” with grant reference (CIPROM/2021/21)
science/article/pii/S0749596X21000966. doi:https: //aclanthology.org/2023.findings-eacl.60. doi: 10.
//doi.org/10.1016/j.jml.2021.104313. 18653/v1/2023.findings-eacl.60.
[9] T. Heck, D. Meurers, On the relevance and learner [16] C. Strathearn, D. Gkatzia, Task2Dial dataset: A
dependence of co-text complexity for exercise dif- novel dataset for commonsense-enhanced
taskifculty, in: D. Alfter, E. Volodina, T. François, based dialogue grounded in documents, in:
ProceedA. Jönsson, E. Rennes (Eds.), Proceedings of the 12th ings of the 4th International Conference on Natural
Workshop on NLP for Computer Assisted Language Language and Speech Processing (ICNLSP 2021),
Learning, LiU Electronic Press, Tórshavn, Faroe Is- Association for Computational Linguistics, Trento,
lands, 2023, pp. 71–84. URL: https://aclanthology. Italy, 2021, pp. 242–251. URL: https://aclanthology.
org/2023.nlp4call-1.9. org/2021.icnlsp-1.28.
[10] T. Tyagi, C. G. Magdamo, A. Noori, Z. Li, X. Liu, [17] D. Ghosal, S. Shen, N. Majumder, R. Mihalcea, S.
PoM. Deodhar, Z. Hong, W. Ge, E. M. Ye, Y. han ria, CICERO: A dataset for contextualized
comSheu, H. Alabsi, L. Brenner, G. K. Robbins, S. Za- monsense inference in dialogues, in: Proceedings
far, N. Benson, L. Moura, J. Hsu, A. Serrano-Pozo, of the 60th Annual Meeting of the Association
D. Prokopenko, R. E. Tanzi, B. T. Hyman, D. Blacker, for Computational Linguistics (Volume 1: Long
S. S. Mukerji, M. B. Westover, S. Das, Using deep Papers), Association for Computational
Linguislearning to identify patients with cognitive impair- tics, Dublin, Ireland, 2022, pp. 5010–5028. URL:
ment in electronic health records, in: Proceed- https://aclanthology.org/2022.acl-long.344. doi:10.
ings of Machine Learning Research ML4H, 2021. 18653/v1/2022.acl-long.344.
arXiv:2111.09115. [18] R. Finkbeiner, J. Meibauer, P. B. Schumacher, What
[11] J. Verschueren, Context and structure in a theory of is a context? Linguistic approaches and challenges,
pragmatics, Studies in Pragmatics 10 (2008) 14–24. volume 196 of Linguistik aktuell = linguistics today,
[12] T. Lai, H. Ji, T. Bui, Q. H. Tran, F. Dernoncourt, John Benjamins Pub. Co., Amsterdam, 2012.</p>
        <p>W. Chang, A context-dependent gated module for [19] G. Hollis, Delineating linguistic contexts, and
incorporating symbolic semantics into event coref- the validity of context diversity as a measure
erence resolution, in: Proceedings of the 2021 Con- of a word’s contextual variability, Journal
ference of the North American Chapter of the As- of Memory and Language 114 (2020) 104146.
sociation for Computational Linguistics: Human URL: https://www.sciencedirect.com/science/
Language Technologies, Association for Compu- article/pii/S0749596X20300607. doi:https:
tational Linguistics, Online, 2021, pp. 3491–3499. //doi.org/10.1016/j.jml.2020.104146.
URL: https://aclanthology.org/2021.naacl-main.274. [20] G. Ferrari, Types of contexts and their role in
multidoi:10.18653/v1/2021.naacl-main.274. modal communication, Computational Intelligence
[13] L. Tamine, M. Daoud, Evaluation in contex- 13 (1997) 414–426.</p>
        <p>tual information retrieval: Foundations and re- [21] S. Castilho, J. L. Cavalheiro Camargo, M. Menezes,
cent advances within the challenges of context dy- A. Way, DELA corpus - a document-level corpus
annamicity and data privacy, ACM Comput. Surv. notated with context-related issues, in: Proceedings
51 (2018). URL: https://doi.org/10.1145/3204940. of the Sixth Conference on Machine Translation,
doi:10.1145/3204940. Association for Computational Linguistics, Online,
[14] C. Hadiwinoto, H. T. Ng, W. C. Gan, Improved 2021, pp. 566–577. URL: https://aclanthology.org/
word sense disambiguation using pre-trained con- 2021.wmt-1.63.
textualized word representations, in: Proceed- [22] T. Udagawa, A. Aizawa, A natural language
ings of the 2019 Conference on Empirical Meth- corpus of common grounding under continuous
ods in Natural Language Processing and the 9th and partially-observable context, in:
ProceedInternational Joint Conference on Natural Lan- ings of the AAAI Conference on Artificial
Intelguage Processing (EMNLP-IJCNLP), Association ligence, AAAI Press, 2019. URL: https://doi.org/
for Computational Linguistics, Hong Kong, China, 10.1609/aaai.v33i01.33017120. doi:10.1609/aaai.
2019, pp. 5297–5306. URL: https://aclanthology.org/ v33i01.33017120.</p>
        <p>D19-1533. doi:10.18653/v1/D19-1533. [23] B. Y. Lin, W. Zhou, M. Shen, P. Zhou, C.
Bhagavat[15] D. Su, M. Patwary, S. Prabhumoye, P. Xu, R. Prenger, ula, Y. Choi, X. Ren, CommonGen: A constrained
M. Shoeybi, P. Fung, A. Anandkumar, B. Catan- text generation challenge for generative
commonzaro, Context generation improves open domain sense reasoning, in: Findings of the Association
question answering, in: Findings of the As- for Computational Linguistics: EMNLP 2020,
Association for Computational Linguistics: EACL sociation for Computational Linguistics, Online,
2023, Association for Computational Linguistics, 2020, pp. 1823–1840. URL: https://aclanthology.
Dubrovnik, Croatia, 2023, pp. 793–808. URL: https: org/2020.findings-emnlp.165. doi: 10.18653/v1/
2020.findings-emnlp.165. [32] E. G. Bard, D. Robertson, A. Sorace, Magnitude
[24] F. Carlsson, J. Öhman, F. Liu, S. Verlinden, J. Nivre, estimation of linguistic acceptability, Language
M. Sahlgren, Fine-grained controllable text gen- 72 (1996) 32–68. URL: http://www.jstor.org/stable/
eration using non-residual prompting, in: Pro- 416793.
ceedings of the 60th Annual Meeting of the As- [33] J. Novikova, O. Dušek, V. Rieser, RankME: Reliable
sociation for Computational Linguistics (Volume 1: human ratings for natural language generation, in:
Long Papers), Association for Computational Lin- Proceedings of the 2018 Conference of the North
guistics, Dublin, Ireland, 2022, pp. 6837–6857. URL: American Chapter of the Association for
Computahttps://aclanthology.org/2022.acl-long.471. doi:10. tional Linguistics: Human Language Technologies,
18653/v1/2022.acl-long.471. Volume 2 (Short Papers), Association for
Compu[25] M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, tational Linguistics, New Orleans, Louisiana, 2018,
S. Shah, A. Ghodsi, P. Wendell, M. Zaharia, pp. 72–78. URL: https://aclanthology.org/N18-2012.
R. Xin, Free Dolly: Introducing the world’s first doi:10.18653/v1/N18-2012.
truly open instruction-tuned LLM, 2023. URL: [34] A. Turpin, F. Scholer, S. Mizzaro, E. Maddalena,
https://www.databricks.com/blog/2023/04/12/ The benefits of magnitude estimation relevance
dolly-first-open-commercially-viable-instruction-tuned-llm.assessments for information retrieval evaluation,
[26] E. Sanchez-Bayona, R. Agerri, Leveraging a in: Proceedings of the 38th International ACM
SInew Spanish corpus for multilingual and cross- GIR Conference on Research and Development in
lingual metaphor detection, in: Proceedings of Information Retrieval, SIGIR ’15, Association for
the 26th Conference on Computational Natural Computing Machinery, New York, NY, USA, 2015,
Language Learning (CoNLL), Association for Com- p. 565–574. URL: https://doi.org/10.1145/2766462.
putational Linguistics, Abu Dhabi, United Arab 2767760. doi:10.1145/2766462.2767760.
Emirates (Hybrid), 2022, pp. 228–240. URL: https: [35] S. Santhanam, S. Shaikh, Understanding the
//aclanthology.org/2022.conll-1.16. doi:10.18653/ impact of experiment design for evaluating
diav1/2022.conll-1.16. logue system output, in: Proceedings of the The
[27] V. Kovatchev, M. Taulé, InferES : A natural language Fourth Widening Natural Language Processing
inference corpus for Spanish featuring negation- Workshop, Association for Computational
Linguisbased contrastive and adversarial examples, in: tics, Seattle, USA, 2020, pp. 124–127. URL: https://
Proceedings of the 29th International Conference aclanthology.org/2020.winlp-1.33. doi:10.18653/
on Computational Linguistics, International Com- v1/2020.winlp-1.33.
mittee on Computational Linguistics, Gyeongju, Re- [36] R. Doust, P. Piwek, A model of suspense for
narrapublic of Korea, 2022, pp. 3873–3884. URL: https: tive generation, in: Proceedings of the 10th
Inter//aclanthology.org/2022.coling-1.340. national Conference on Natural Language
Gener[28] W. G. Reijnierse, C. Burgers, T. Krennmayr, G. Steen, ation, Association for Computational Linguistics,
The role of co-text in the analysis of potentially Santiago de Compostela, Spain, 2017, pp. 178–187.
deliberate metaphor, in: Drawing Attention to URL: https://aclanthology.org/W17-3527. doi:10.
Metaphor: Case studies across time periods, cul- 18653/v1/W17-3527.
tures and modalities, John Benjamins Publishing [37] S. Pita-Fernández, Determinación del tamaño
muesCompany, 2020, pp. 15–38. tral, Cuadernos de atención primaria 3 (1996) 138–
[29] J. Manyika, An overview of Bard: An early 141.</p>
        <p>experiment with generative AI, Technical Re- [38] C. Barros, M. Vicente, E. Lloret, To what
port, Tech. rep., Technical report, Google AI, extent does content selection afect surface
re2023. URL: https://ai.google/static/documents/ alization in the context of headline
generagoogle-about-bard.pdf. tion?, Computer Speech &amp; Language 67
[30] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. (2021) 101179. URL: https://www.sciencedirect.com/
Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham- science/article/pii/S0885230820301121. doi:https:
bro, F. Azhar, et al., LLaMA: Open and eficient //doi.org/10.1016/j.csl.2020.101179.
foundation language models, Computing Research [39] Y. G. Vázquez, A. F. Orquín, A. M. Guijarro, S. V.</p>
        <p>Repository, arXiv:2302.13971. Version 1 (2023). Pérez, Integración de recursos semánticos basados
[31] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, en WordNet, Procesamiento del lenguaje natural
H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. 45 (2010) 161–168.</p>
        <p>Gonzalez, I. Stoica, E. P. Xing, Vicuna: An open- [40] A. Siddharthan, N. Katsos, Ofline sentence
processsource chatbot impressing GPT-4 with 90%* Chat- ing measures for testing readability with users, in:
GPT quality, 2023. URL: https://lmsys.org/blog/ Proceedings of the First Workshop on Predicting
2023-03-30-vicuna/. and Improving Text Readability for target reader
populations, Association for Computational
Linguistics, Montréal, Canada, 2012, pp. 17–24. URL:
https://aclanthology.org/W12-2203.
[41] M. R. Costa-jussà, An analysis of gender bias studies
in natural language processing, Nature Machine</p>
        <p>Intelligence 1 (2019) 495–496.
[42] S. R. Bowman, G. Angeli, C. Potts, C. D. Manning,</p>
        <p>A large annotated corpus for learning natural
language inference, in: Proceedings of 2015
Conference on Empirical Methods in Natural Language
Processing, Association for Computational
Linguistics, Lisbon, Portugal, 2015, pp. 632–642. URL: http:
//aclanthology.lst.uni-saarland.de/D15-1075.pdf .</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Frieske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ishii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Survey of hallucination in natural language generation</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          ). URL: https://doi.org/ 10.1145/3571730. doi:
          <volume>10</volume>
          .1145/3571730.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. a. T. Yan</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>From semantics to pragmatics: Where IS can lead in natural language processing (NLP) research</article-title>
          ,
          <source>European Journal of Information Systems</source>
          <volume>30</volume>
          (
          <year>2021</year>
          )
          <fpage>569</fpage>
          -
          <lpage>590</lpage>
          . doi:
          <volume>10</volume>
          .1080/ 0960085X.
          <year>2020</year>
          .
          <volume>1816145</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Newman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cohn-Gordon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Potts</surname>
          </string-name>
          ,
          <article-title>Communication-based evaluation for natural language generation</article-title>
          ,
          <source>in: Proceedings of the Society for Computation in Linguistics</source>
          <year>2020</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , New York, New York,
          <year>2020</year>
          , pp.
          <fpage>116</fpage>
          -
          <lpage>126</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .scil-
          <volume>1</volume>
          .
          <fpage>16</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. C. B.</given-names>
            <surname>Cruz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. K.</given-names>
            <surname>Resabal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Velasco</surname>
          </string-name>
          , C. Cheng,
          <article-title>Exploiting news article structure for automatic corpus generation of entailment datasets</article-title>
          , in: D. N. Pham,
          <string-name>
            <given-names>T.</given-names>
            <surname>Theeramunkong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Governatori</surname>
          </string-name>
          , F. Liu (Eds.),
          <source>PRICAI 2021: Trends in Artificial Intelligence</source>
          , Springer International Publishing, Cham,
          <year>2021</year>
          , pp.
          <fpage>86</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Vallecillo-Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Montejo-Raéz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Martín-Valdivia</surname>
          </string-name>
          ,
          <article-title>Automatic counter-narrative generation for hate speech in spanish</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>71</volume>
          (
          <year>2023</year>
          )
          <fpage>227</fpage>
          -
          <lpage>245</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sengupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Maher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Groves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Olieman</surname>
          </string-name>
          ,
          <article-title>Genbit: measure and mitigate gender bias in language datasets</article-title>
          ,
          <source>Microsoft Journal of Applied Research</source>
          <volume>16</volume>
          (
          <year>2021</year>
          )
          <fpage>63</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Surana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.-N.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Chng</surname>
          </string-name>
          ,
          <article-title>CASSI: Contextual and semantic structure-based interpolation augmentation for low-resource NER</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Singapore,
          <year>2023</year>
          , pp.
          <fpage>9729</fpage>
          -
          <lpage>9742</lpage>
          . URL: https:// aclanthology.org/
          <year>2023</year>
          .findings-emnlp.
          <volume>651</volume>
          . doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2023</year>
          .findings-emnlp.
          <volume>651</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B. T.</given-names>
            <surname>Johns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <article-title>Content matters: Measures of contextual diversity must consider semantic content</article-title>
          ,
          <source>Journal of Memory and Language</source>
          <volume>123</volume>
          (
          <year>2022</year>
          )
          <article-title>104313</article-title>
          . URL: https://www.sciencedirect.com/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>