=Paper= {{Paper |id=Vol-3846/paper04 |storemode=property |title=COCOTEROS: A Spanish Corpus with Contextual Knowledge for Natural Language Generation |pdfUrl=https://ceur-ws.org/Vol-3846/paper04.pdf |volume=Vol-3846 |authors=María Miró Maestre,Iván Martínez-Murillo,Elena Lloret,Paloma Moreda,Armando Suárez Cueto |dblpUrl=https://dblp.org/rec/conf/sepln/MaestreMLMC24 }} ==COCOTEROS: A Spanish Corpus with Contextual Knowledge for Natural Language Generation== https://ceur-ws.org/Vol-3846/paper04.pdf
                                COCOTEROS: A Spanish Corpus with Contextual
                                Knowledge for Natural Language Generation
                                María Miró Maestre, Iván Martínez-Murillo, Elena Lloret, Paloma Moreda and
                                Armando Suárez Cueto
                                Dept. of Software and Computing Systems, University of Alicante, Apdo. de Correos 99, E-03080, Alicante, Spain


                                                                          Abstract
                                                                          Contextual information is one of the key elements when automatically generating language with a more semantic-pragmatic
                                                                          perspective. To contribute to the study of this linguistic aspect, we present COCOTEROS, a COrpus of COnTextual TExt
                                                                          geneRatiOn in Spanish. COCOTEROS is available at https://huggingface.co/datasets/gplsi/cocoteros. The corpus is composed
                                                                          of sentences and automatically generated context pairs. For creating it, a semi-automatic weakly supervised methodology is
                                                                          implemented. Taking as a reference the Spanish section of the Tatoeba dataset, we filtered the sentences according to our
                                                                          research purpose. Then, we determined several linguistic parameters that the generated contexts need to fulfil considering
                                                                          their reference sentence. Finally, contexts were automatically generated using prompt engineering with Google’s large
                                                                          language model Bard. Furthermore, we performed two types of evaluation to check both the linguistic quality and the presence
                                                                          of gender bias in the corpus: the former by manually measuring the magnitude estimation metric and the latter thanks to
                                                                          the GenBit automatic metric. The results show that COCOTEROS is an appropriate language resource to approach Natural
                                                                          Language Generation tasks from a semantic-pragmatic perspective for Spanish. For instance, the NLG task of concept-to-text
                                                                          generation could benefit from contextual information by generating sentences according to the information provided in
                                                                          the context and a set of given concepts. Additionally, regarding the task of question-answering, the inclusion of linguistic
                                                                          context can enhance the generation of more appropriate answers by serving as a guide on what information to include in the
                                                                          automatically generated answer.

                                                                          Keywords
                                                                          corpus, contextual information, natural language generation, Spanish, human evaluation, large language models



                                1. Introduction                                                                                                    in mind these linguistic levels of analysis, NLG is start-
                                                                                                                                                   ing to put linguistic context in the research spotlight,
                                Natural Language Generation (NLG) systems are steadily                                                             given its importance for appropriately understanding
                                improving their performance in a wide range of tasks                                                               human utterances. Indeed, Newman et al. [3] already
                                where the information to be generated is delimited ac-                                                             defended the consideration of context not only to create
                                cording to the objective of the task, e.g., text summari-                                                          text automatically but also to assess the suitability of
                                sation, machine translation or question answering (QA).                                                            the generated text. This statement comes from the idea
                                One of the most important issues those systems have to                                                             that communication-based features help to evaluate the
                                deal with is the lack of sufficient contextual knowledge,                                                          performance of any model that imitates human language.
                                as it prevents NLG models from better adapting the gen-                                                            Language itself is used to communicate ideas always ex-
                                erated text to the communicative situation of each task.                                                           pressed within a given communicative context, and it is
                                That derives in crucial problems such as the hallucina-                                                            such context what directly affects the structure of the
                                tion issue and lack of commonsense in the produced text                                                            utterance we want to say.
                                [1]. In fact, one of the current concerns within the NLG                                                              Parallel to this, making NLG systems aware of contex-
                                discipline [2] is the need to address tasks from a more                                                            tual knowledge involves the creation of new resources,
                                ‘semantic-pragmatic perspective’ to solve these contex-                                                            such as datasets, corpora, knowledge bases, etc., to train
                                tual inference difficulties that affect the output of the                                                          models in several languages, especially for those differ-
                                systems at issue. To address the lack of studies that bear                                                         ent from English or low-resourced ones. In the case of
                                                                                                                                                   Spanish, we observed that most of the recently published
                                SEPLN-2024: 40th Conference of the Spanish Society for Natural                                                     corpora hardly address pragmatic-related issues with a
                                Language Processing. Valladolid, Spain. 24-27 September 2024.
                                                                                                                                                   contextual perspective, but rather focus on concrete prag-
                                $ maria.miro@ua.es (M. M. Maestre); ivan.martinezmurillo@ua.es
                                (I. Martínez-Murillo); elena.lloret@ua.es (E. Lloret); moreda@ua.es                                                matic aspects such as metaphors to tackle identification
                                (P. Moreda); armando.suarez@ua.es (A. S. Cueto)                                                                    tasks. Furthermore, the high performance of Large Lan-
                                 0000-0001-7996-4440 (M. M. Maestre); 0009-0007-5684-0083                                                         guage Models (LLMs) recently witnessed within the field
                                (I. Martínez-Murillo); 0000-0002-2926-294X (E. Lloret);                                                            of Natural Language Processing (NLP) has allowed re-
                                0000-0002-7193-1561 (P. Moreda); 0000-0002-8590-3000
                                                                                                                                                   searchers to use NLP tools to automatise data collection
                                (A. S. Cueto)
                                                                    © 2024 Copyright for this paper by its authors. Use permitted under Creative   and corpus creation tasks, therefore reducing the time
                                                                    Commons License Attribution 4.0 International (CC BY 4.0).
                                 CEUR
                                 Workshop
                                 Proceedings
                                               http://ceur-ws.org
                                               ISSN 1613-0073
                                                                    CEUR Workshop Proceedings (CEUR-WS.org)                                        spent in collecting sufficient data for research purposes




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
[4, 5].                                                       their classification systems are Named Entity Recogni-
   To bridge the gap of NLG systems that handle more          tion (NER) [7], word recognition and lexical processing
semantic-pragmatic features of language, specifically con-    to boost semantic disambiguation [8], language learn-
textual knowledge, we present COCOTEROS, a COr-               ing [9] or even healthcare studies devoted to diseases or
pus of COntextual TExt geneRatiOn in Spanish. This            syndromes which critically affect language [10].
corpus comprises 4,845 sentences extracted from the
existing Tatoeba dataset, together with 4,845 context
sentences automatically generated with Bard1 language         2. Related Work
model and manually revised. Given the difficulties inher-
                                                              In view of the multidisciplinary nature of the task, the fol-
ent to prompt engineering when using LLMs-based chat-
                                                              lowing theoretical background is on one side focused on
bots, several linguistic parameters were determined to
                                                              the linguistic notion of context and its approach to NLG
ensure the quality of the automatically generated outputs
                                                              research (subsection 2.1). On the other side, subsection
with Bard, including semantic similarity, length of the
                                                              2.2 includes prior NLG research focused on the creation
generated text, and forbidden keywords, among others.
                                                              of linguistic resources to address contextual-related tasks.
Moreover, we performed a human evaluation experiment
based on the magnitude estimation metric with three
linguistics specialists to measure the contextual appropri-   2.1. Linguistic Context in NLG Research
ateness of the resulting contexts. In parallel, we measured   Messages need their surrounding communicative context
gender bias with the GenBit tool [6] to verify that our       in order to be completely understood [11]. This claim is
corpus would be useful for NLG tasks without adding           well accepted within the NLP discipline, as many tasks
gender biases to models trained in further experiments.       try to solve context-related linguistic issues to improve
   In sum, the main contributions of this paper are:          NLP systems performance, i.e., coreference resolution
        • Expansion of a subset of Tatoeba’s corpus with      [12], information retrieval [13], word sense disambigua-
          contextual information.                             tion [14] or question answering [15]. Context, therefore,
                                                              becomes a pragmatic element of great interest when pro-
        • Proposal of a weakly supervised methodology for
                                                              cessing language automatically. Similarly, when focusing
          building a corpus using prompt-engineering.
                                                              on language generation, there are concrete applications
        • Creation of COCOTEROS, a novel Spanish corpus
                                                              such as dialogue systems where context is usually prede-
          for commonsense NLG that includes contextual
                                                              termined so researchers can study the linguistic features
          information.
                                                              surrounding such communicative context [16, 17].
        • Corpus validation through human assessment             When addressing the task of contextual appropriate-
          with the magnitude estimation method.               ness (i.e., how appropriate is a context given a linguistic
        • Corpus evaluation of gender bias with GenBit        setting), several conceptions of context may come to
          automatic metric.                                   mind, as linguistic theories tend to diverge on the defini-
                                                              tion of context given the wide range of perspectives from
   We believe this corpus will provide the research com-
                                                              which context can be approached [18]. For the sake of
munity with a valuable resource in Spanish to test the
                                                              the present research, we focus on the linguistic context
performance of NLG systems in different tasks by con-
                                                              of a given message, which can be defined as ‘any contigu-
sidering semantic-pragmatic aspects of communication
                                                              ous span of text within a document’ [19] or as ‘the set
as contextual appropriateness. Some of the NLG tasks
                                                              of utterances that precedes the current one’ [20]. These
that could use COCOTEROS could be those related to
                                                              definitions align with the linguistic dimension of context
concept-to-text generation, where sets of words are pro-
                                                              known as ‘intratextual context’ (or ‘co-text’), which stud-
vided and the model has to generate a text given those
                                                              ies the relation of a piece of text to its surrounding text
concepts. These words can have multiple semantic mean-
                                                              [18].
ings depending on their context. Having a prefixed con-
text within which the sentence has to be generated could
help to a more precise sentence. Moreover, COCOTEROS          2.2. Contextual Corpora for NLG
could also be used to train NLG models to automatically       The creation of linguistic resources directly oriented to
generate sentences in accordance with a given context         analyse more complex linguistic phenomena such as
as input information, therefore improving the model’s         context provides an added asset value to the research
awareness of the different communicative situations it is     community, as there are not as many resources avail-
trained with. As for NLP, some tasks that have already        able as for other far-reaching linguistic levels of anal-
exploited the role of contextual knowledge for improving      ysis as syntax or grammar. To motivate the study of
                                                              this pragmatic element, several resources to analyse con-
    1
        Since 8th February 2024 Bard is known as Gemini.      text from different perspectives have already been made
available. Castilho et al. [21] created an English corpus      3.1. Data Collection and Filtering
annotated with context-aware issues for the task of Ma-
                                                               For the present study, we wanted to gather simple Span-
chine Translation into Brazilian Portuguese. Regarding
                                                               ish sentences with enough semantic content to automati-
dialogue tasks, Udagawa and Aizawa [22] addressed the
                                                               cally generate contexts linked to the situation stated in
common grounding problem by collecting a dialogue
                                                               the reference sentence. We prioritised sentences with
dataset with continuous and partially-observable context.
                                                               not too much linguistic information so the context does
As for controllable text generation, Lin et al. [23] cre-
                                                               not add extra information besides the purpose of the task,
ated the CommonGen task and dataset to test to which
                                                               being not too distant from the original sentence situation.
extent a generation system can generate text with com-
                                                               To this end, we chose the Spanish section of sentences
monsense reasoning in English. To this end, the task is to
                                                               written on the website Tatoeba2 as the original dataset
generate a coherent sentence that includes several com-
                                                               from which we would select the sentences to generate
mon concepts previously shown to the system. Derived
                                                               the contexts. We first considered using other already
from this work, Carlsson et al. [24] generated the C2Gen
                                                               published corpora such as CommonGen [23] or C2Gen
dataset of context sentences in English from which they
                                                               [24] as original datasets because they also focused on
extracted several keywords that had to be included in
                                                               the task of NLG with contextual information. However,
an automatically generated text. Finally, a recent En-
                                                               for using these corpora we would have had to translate
glish corpus worth mentioning is databricks-dolly-15k,
                                                               the original datasets into Spanish, which would imply
a human-generated instruction corpus created to train
                                                               choosing an appropriate automatic translation tool or
Dolly LLM [25]. This dataset was applied to different
                                                               manually translating the datasets for adapting the task
contextual tasks such as summarisation of Wikipedia
                                                               into Spanish. Also, a further proofreading step would
articles or closed QA, where a question and a reference
                                                               have been necessary to check the accuracy of the trans-
passage are input to the system to get factually correct
                                                               lations into Spanish, so we preferred to benefit from an
responses.
                                                               already-existing Spanish dataset that could help us gen-
   Focusing on Spanish resources, Sanchez-Bayona and
                                                               erate our context corpus.
Agerri [26] generated a corpus of Spanish metaphors,
                                                                  Tatoeba’s original dataset includes around 393,000
which depend directly on the contextual meaning to be
                                                               Spanish sentences either translated from other languages
clearly identified by an automatic system. As for Natu-
                                                               or directly written in Spanish. The dataset includes sen-
ral Language Inference (NLI), Kovatchev and Taulé [27]
                                                               tences with a range of 1 up to 44 words per sentence, so
compiled the INFERES corpus to check the performance
                                                               we first filtered them by selecting only those sentences
of machine learning systems on negation-based adver-
                                                               conformed by either 8 or 9 words, collecting a total of
sarial examples by using context paragraphs from topics
                                                               60,170 Spanish sentences. We chose this section from the
extracted from the Spanish Wikipedia.
                                                               dataset after a previous preprocessing of an excerpt of
   After a thorough review of the current corpora that
                                                               the dataset with Spacy tokenizer3 . In this preliminary
address contextual NLG tasks in Spanish, we can say that,
                                                               preprocessing, we noticed that the more words the sen-
to the best of our knowledge, there is no corpus focused
                                                               tence comprised, the more risk we had of including too
on the contextual information generation task in Spanish.
                                                               much semantic information in the sentence. This could
Consequently, for this research we base on the previous
                                                               entail the generation of contexts not linked to the origi-
works by Lin et al. [23] and Carlsson et al. [24] to address
                                                               nal situation stated in the reference sentence. Similarly,
the task of contextual information generation in Spanish.
                                                               we rejected those sentences made up of 7 words or less,
                                                               as many of their keywords lacked enough linguistic in-
3. Corpus Creation                                             formation (verbs, nouns, etc.) to generate a context that
                                                               could be in line with the situation stated in the reference
The following subsections include the methodology steps        sentence.
to create COCOTEROS: i) we explain the reference sen-
tences dataset collection process and how we filtered          3.2. Linguistic Constraints
them (subsection 3.1); ii) we move on to determine the
linguistic constraints that will comprise the prompt to        LLMs can be useful for supporting the automatic cre-
generate automatic contexts (subsection 3.2); iii) we de-      ation of corpora to study specific linguistic phenomena
scribe the context generation task (subsection 3.3); and       that would become very costly tasks if compiled manu-
iv) we include a manual post-edition to curate the results     ally. Nevertheless, generating a corpus with LLMs from
generated by the LLM (subsection 3.4). Figure 1 shows          scratch also entails several risks regarding linguistic ap-
a visual pipeline of the methodology used for creating
COCOTEROS.                                                         2
                                                                     This dataset was released under a CC-BY License and can be
                                                               found at https://tatoeba.org/es.
                                                                   3
                                                                     https://spacy.io/api/tokenizer
Figure 1: Proposed methodology pipeline for corpus creation.



propriateness that could worsen the quality of the corpus,            in the context to be generated. With this restric-
as it happens with hallucination issues or lack of com-               tion, we wanted to ensure that, even if some of
monsense.                                                             the linguistic structures in the reference sentence
   Therefore, with the aim of automatically creating lin-             were repeated in the generated context, the se-
guistic contexts referred to a given sentence, and to better          mantic meaning of the context is related to, but
control the output of our chosen LLM (further explained               changes somewhat from the reference sentence.
in Section 3.3), we determined several linguistic parame-             This goes in line with the idea that the choice of
ters to include in the prompt:                                        words influence co-text and meaning potential
                                                                      [28], and we wanted to test up to which point
     • Definition of context: Following previous stud-                can LLMs generate co-text with the same con-
       ies focused on context as described in Section                 ceptual background but adding new words that
       2.1, we started our prompt with a simple and                   can enlarge the semantic information of the new
       straightforward definition of what we consider a               sentence.
       linguistic context so the model could first get the          • Maximum context length: Inspired by the
       idea of the task to accomplish.                                work presented in Carlsson et al. [24], we de-
     • Reference sentence or synonyms: On the first                   cided that an appropriate length for the gener-
       attempts to find the right prompt to compile the               ated context could be around 45 words. This de-
       corpus, we observed that, even by including a                  cision comes also from preliminary prompt tests
       short definition of linguistic context, the model              where we found that, if no length limitation was
       sometimes generated a context including the ref-               included, the model tended to delve into the gen-
       erence sentence. Therefore, to better specify the              eration process, creating contexts of more than
       linguistic nature of the context to be generated,              ten lines of text that distanced too much from the
       we indicated that the reference sentence could                 original situation stated in the reference sentence.
       not appear in the context nor a sentence with
       similar semantic meaning.
                                                               3.3. Context Generation
     • Forbidden keywords: We extracted three key-
       words from each reference sentence that could se-       Once we filtered the original dataset, the next step was to
       mantically define the sentence meaning. The ex-         generate an appropriate context for each of the selected
       traction was automatically performed by means           sentences. For this, we benefited from the capabilities
       of a random choice where we prioritised the se-         of LLMs, and in particular, we used Bard [29], Google’s
       lection of two nouns and one verb, as we consider       recent LLM. Our decision was motivated by an empirical
       them some of the main linguistic elements that          study we previously conducted in which several LLMs
       define the semantic meaning of a sentence. Then,        were compared to check how appropriately they fulfilled
       we added those batches of three keywords in the         the task of generating a context resembling a sentence
       prompt as forbidden words that could not appear         but without repeating or paraphrasing it. The LLMs
compared were LLaMa4 [30], Vicuna4 [31], Bard, and                      sona] sentado en su escritorio”7 . Consequently, we had to
ChatGPT5 We automatically generated contexts for our                    modify those contexts by completing the missing infor-
subset of sentences of 8 or 9 words with Bard, which                    mation with generic concepts or names so we could add
could generate a context in an average of 5 seconds. Nev-               the resulting context to the final corpus.
ertheless, Bard’s public version could be prompted only
130 times per day. The generation process was made
through a zero-shot prompt that comprised the linguistic                4. COCOTEROS - Corpus of
restrictions the generated context should include or not,                  Contextual Text Generation in
as stated in section 3.2. With this setup, we created an
initial version of COCOTEROS corpus with 5,000 con-
                                                                           Spanish
texts.                                                                  As the first corpus focused on the contextual text gener-
                                                                        ation task for Spanish, COCOTEROScontains a total of
3.4. Post-editing                                                       4,845 pairs of reference sentences with their respective
                                                                        generated contexts as illustrated in Figure 2. Moreover,
Finally, given Bard’s predefined chat-like communica-
                                                                        the corpus includes the three keywords extracted from
tive structures, we manually revised and post-edited the
                                                                        each reference sentence. The final amount of contexts
resulting contexts by eliminating all the information in-
                                                                        comes from a previous manual post-edition from the orig-
cluded in the response which was not the generated con-
                                                                        inal 5,000 contexts generated with Bard. We performed
text itself (e.g. Bard’s output included similar sentences
                                                                        this post-edition because we noticed sexist content in
to “Aquí tienes un contexto relacionado para la frase ‘Tengo
                                                                        some of the generated contexts, so we decided to discard
demasiadas cosas en la cabeza estos días’” as a preliminary
                                                                        those cases straightforwardly.
statement for each context6 ). As a remark, there were
                                                                           Table 1 shows a statistical summary of COCOTEROS.
times when Bard generated several contexts for a sin-
                                                                        Apart from the corpus general information, we found it
gle input, giving us the opportunity to choose between
                                                                        interesting to check the average sentences and words per
them, so we did a manual proofreading process where
                                                                        context because Bard sometimes generated contexts with
we checked every possible context to choose those that
                                                                        very different lengths. Even though the prompt included
approximated more to the conception of context we de-
                                                                        the maximum length that the context could have (45
termined for this research task. In line with this, in those
                                                                        words), we found cases where the context had only 15
cases where we could choose from two options, we se-
                                                                        words, whereas other contexts contained more than four
lected the context describing a female-subject situation.
                                                                        sentences, with a total of more than 50 words.
We made this decision because we detected a somewhat
higher proportion of reference sentences addressing male
subjects, so the generated context was male-gendered too.               Table 1
                                                                        COCOTEROS data summary.
Therefore, in those cases where the reference sentence
was no gender-specific, we prioritised female contexts                                         Data                          Total
to balance gender in COCOTEROS. Further details on                                    Reference sentences                    4,845
how we addressed gender bias in our corpus are shown                                        Keywords                         14,535
in subsection 5.2.                                                                     Generated contexts                    4,845
   In this manual post-editing step we also discarded                                Words in the sentences                  40,827
contexts that were repetitions or paraphrasing of the ref-                           Words in the contexts                  119,885
                                                                                      Words in the corpus                   175,247
erence sentence, as well as those that did not include
                                                                               Average no. of sentences per context            2
enough semantic information to be considered appro-
                                                                                Average no. of words per context               25
priately generated contexts. Within the rest of contexts
we kept in COCOTEROS, there were times where Bard
left some of the concepts in the generated text incom-                     The official version of COCOTEROS corpus is available
plete so the user could complete it according to his/her                at https://huggingface.co/datasets/gplsi/cocoteros. With
preferences, as in “Nos encontramos a [nombre de la per-                this, we aim to contribute to NLG research with a new
                                                                        language resource for studying contextual information
                                                                        generation in Spanish, as well as for other unexplored
     4
       The tested version of LLama was llama-2-70b-chat, and            NLG tasks that can benefit from our corpus to address
Vicuna’s version was vicuna-33b.            They were tested on
https://chat.lmsys.org/
                                                                        further research questions.
     5
       Tested version of ChatGPT was GPT 3.5 on
https://chat.openai.com/
     6
       Example translated into English for clarity purposes: Here’s a
                                                                            7
related context for the phrase “I have too many things on my mind             Example translated into English for clarity purposes: We found
these days”                                                             [name of the person] sit on his/her desk
Figure 2: Excerpt of COCOTEROS corpus. Examples translated into English for clarity purposes.



5. Corpus Evaluation                                     well as more distinctive rankings when comparing the
                                                         outputs between the annotators in comparison to other
To ensure that the contexts included in COCOTEROS are more common methods such as Likert scales [33, 35].
appropriate for contextual generation tasks, we evalu-     Taking this method as a basis, we wanted to measure
ated them taking into account different aspects: context the appropriateness of the generated context given its
appropriateness with the manual magnitude estimation reference sentence. For this, we took a representative
method (subsection 5.1) and gender bias through the au- sample of sentences and contexts from the COCOTEROS
tomatic GenBit metric (subsection 5.2).                  corpus through Formula 1, presented in [37] and previ-
                                                         ously used in [38]:
5.1. Context Appropriateness
                                                                                       𝑁 * 𝐾2 * 𝑃 * 𝑄
With the evolution of the latest LLMs, researchers face                  𝑀=                                            (1)
                                                                                𝐸 2 * (𝑁 − 1) + 𝐾 2 * 𝑃 * 𝑄
a need for consistent evaluation metrics that help them
                                                                  where N is the population, K the confidence interval, P
evaluate the outputs provided by these models when
                                                               the probability of success, Q the probability of failure and
testing their performance for language generation tasks.
                                                               E the error rate. The population N was 4,845 sentences
To this end, we performed an experiment based on the
                                                               and their respective contexts, and the values given to the
magnitude estimation method [32] with the help of three
                                                               rest of these parameters were taken as presented in [39],
linguistics specialists. Magnitude estimation is a method
                                                               so that K=0.95, E=0.05, P=0.5, and Q=0.5. Once the for-
generally used in psychology to check the reaction of
                                                               mula was calculated, the resulting number of sentences
different subjects when presented with several stimuli.
                                                               M for testing contextual appropriateness was rounded to
To measure the different levels of reaction subjects can
                                                               90 sentences with their respective contexts. This subset
have, they need to assign a score to a first stimuli (in our
                                                               of 90 sentences and contexts was selected at random from
case, the generated context) where no ranges or limits are
                                                               the final COCOTEROS corpus. With the subset of con-
determined. Then, when a second stimuli is presented,
                                                               texts already determined, we performed the magnitude
they have to compare it with the first stimuli shown, and
                                                               estimation analysis to validate the generated contexts.
depending on the intensity of the reaction they have,
                                                                  To accomplish this, we explained the methodology to
its score will change based on the previous score they
                                                               score the subset of 90 generated contexts to the anno-
assigned to the first stimuli. In this manner, if subjects’
                                                               tators, with the only requirement that the lowest score
reaction to the second stimuli is twice as much as to the
                                                               they could assign could be 1. In this manner, we en-
first stimuli, they will have to double the score they as-
                                                               sured the subsequent normalisation of the values each
sign to the second stimuli. This method has been used
                                                               of them may assign to each context. As a remark, we
positively for evaluating automatically generated text in
                                                               noticed that two annotators scored contexts based on a
several NLG tasks [33, 34, 35, 36], as researchers demon-
                                                               1 to 100 ranking, even when we highlighted that there
strated that it helps to detect more linguistic nuances as
                                                               were no restrictions in the values they could choose for
Figure 3: Results of Z-score normalised values for the magnitude estimation evaluation. Values higher than 0 indicate
appropriate contexts, whereas negative values show not-suitably generated contexts.



each context. Once we collected all the scores made by      5.2. Gender Bias
the annotators, we normalised the results by means of
                                                            Several methodological issues come to mind when us-
the z-score normalisation formula (Formula 2) as used in
                                                            ing a LLM to generate a new language resource for fur-
[40]:
                                                            ther training LLMs so they can learn how to approach
                           𝑥𝑖ℎ − 𝜇ℎ                         new emerging NLG problems. One of those recently de-
                   𝑍𝑖ℎ =                              (2)   tected issues is the presence of gender bias in the human-
                              𝜎ℎ
                                                            compiled corpora that LLMs are trained with. This poses
   where 𝑍𝑖ℎ is annotator h’s z-score for the context
                                                            a new problem for the research community, as the incred-
when annotator h gave a magnitude estimation score
                                                            ible performance those LLMs currently show is based
of 𝑥𝑖ℎ to that context. 𝜇ℎ is the mean and 𝜎ℎ the stan-
                                                            on data that reflect and amplify societal biases detected
dard deviation of the set of magnitude estimation scores
                                                            in naturally occurring texts [41]. With an eye to check
for annotator h.
                                                            possible biases in our corpus, we used the Gender Bias
   Figure 3 shows the normalised results for the magni-
                                                            Tool (GenBit) [6] to measure the apparent level of gender
tude estimation evaluation. The 0 line serves as the mean
                                                            bias in the 4,845 generated contexts from COCOTEROS.
from which upper numbers indicate those contexts with
                                                            According to its developers, GenBit helps determine if
higher scores, and the negative numbers show those con-
                                                            gender is distributed uniformly across data by measuring
texts considered not suitably generated. As can be seen,
                                                            the strength of association between a pre-defined list of
the three annotators tend to agree on which linguistic
                                                            gender definition words and other words in the corpus
contexts have an appropriate contextual relatedness to
                                                            via co-occurrence statistics. Table 2 shows the obtained
the reference sentence, even though each of them used a
                                                            results after processing COCOTEROS with the Spanish
different range of scores within the magnitude estimation
                                                            metric provided in GenBit.
experiment. In spite of a few disagreeing cases in the
total of 90 contexts, we observe that the annotators agree
that more than half of the corpus sample comprises con- Table 2
texts with appropriate contextual relatedness, while the GenBit gender bias results in the generated contexts.
rest could be improved. After evaluating the results with                      Metric       Results
the annotators, we concluded that they tended to highly                     GenBit Score      0.724
penalise those contexts that paraphrased the reference                      Female words      0.335
sentence, even if after that paraphrasing sentence the                       Male words       0.665
context included new excerpts of text that indeed served
as an appropriate linguistic context.                         Following the benchmarks as stated in Sengupta et al.
                                                           [6], the GenBit score from COCOTEROS is 0.724, which
                                                           indicates a moderate gender bias in our corpus. This key
                                                           metric comes from a parallel calculation where GenBit
                                                           calculates the percentage of female or male-gendered
definition words that appear in the corpus, resulting in    training phase of the model. As discussed in Section 5.2,
0.335 and 0.665, respectively, in COCOTEROS. Consider-      remarkable efforts have been made to balance the num-
ing the results, it seems there is a higher representation  ber of sentences addressing both genders as we are aware
of words associated with the male gender rather than        of the importance of dealing with gender underrepresen-
with the female. However, these results do not imply        tation when creating inclusive language resources that
that those sentences containing female-gendered words       comply with gender balance standards. By doing this,
are used in a sexist context but that the appearance of     we also want to encourage the rest of the community to
female-gendered terms in the corpus is lower. We want       take similar steps so that NLP resources and LLMs are
to remark on this because the apparent underrepresenta-     trained on trustful resources with no biases. We used
tion of female-gendered words could be modified easily      GenBit tool for measuring this number, and although
by creating parallel contexts to those where there are      the results obtained are the expected, it is true that Gen-
male-gendered words so that we could balance both gen-      Bit does not detect some grammatical categories such as
der representation at the same time that we expand our      male or female proper names and gendered adjectives,
corpus with further examples. Moreover, we have to          so the results cannot be conclusive.
bear in mind that words in Spanish have a specific genre,      One problem worth commenting regarding LLMs is
whereas English words don’t. Consequently, a predomi-       hallucination, which occurs when a text is nonsensical or
nance of male-gendered words does not need to imply         unfaithful to the input source. During post-processing,
that the corpus is gender biased, but that the corpus in-   we detected that some generated context suffered from
cludes more words linked to that genre, whether those       this (e.g., the reference sentence contained the word “fa-
words refer to objects, places or people.                   ther” while the context was generated with “grandfather”;
In addition, during our manual post-editing stage of the    the generated context was written in the masculine form
4,845 contexts, we found that many of them described        when the reference sentence was in the feminine form;
communicative situations where the subject is a woman.      or the case of fake generated data, such as the winner
However, GenBit does not include female or male proper      of Eurovision 2023 which was not Germany). Neverthe-
names and gendered adjectives in its Spanish section, so    less, we did not discard these sentences as our scope was
it cannot consider those contexts as gendered-defined,      to obtain appropriate contexts. Therefore, future works
which may also affect the final result of the gender bias   will focus on detecting and eliminating hallucinations to
metric. Therefore, the results achieved with GenBit score   gather a corpus free of this issue.
serve as a first attempt to consider possible gender bias      Finally, another of the main interests for generating
in our corpus, but we believe they cannot be conclusive     new resources for the NLP community is creating multi-
given the different examples of gendered sentences found    task datasets so that linguistic resources become a valu-
in our corpus not considered by the metric.                 able and reusable tool which can motivate new research.
                                                            COCOTEROS will contribute to boosting NLP research
                                                            specifically addressing semantic and pragmatic aspects
6. Overall Discussion                                       and for Spanish language. Although it has been originally
                                                            conceived for NLG, its nature for containing contexts as-
The results obtained throughout the experimentation
                                                            sociated with reference sentences could be beneficial for
process for creating and evaluating COCOTEROS open
                                                            solving other NLP-related issues such as textual entail-
the door for discussion along several dimensions.
                                                            ment, also known as Natural Language Inference (NLI)
   Regarding the magnitude estimation evaluation, this
                                                            [42]. This task focuses on the semantic relations that
metric helped us to detect further nuances in the scores
                                                            may exist between several pieces of text and how such
each annotator assigned to contexts depending on their
                                                            relations can be characterised and computationally anal-
appropriateness. Those nuances could be future chal-
                                                            ysed.
lenges to address to keep on discovering knowledge on
how to deal with contextual information in NLG systems.
Therefore, these results helped us to determine one of 7. Conclusions and Future Work
the modifications to apply to COCOTEROS, as in future
work we will manually analyse and discard contexts with In this paper we have presented COCOTEROS, a Span-
paraphrasing sentences, so we only leave linguistic con- ish corpus of contextual knowledge for NLG, contain-
texts that add contextual information to the reference ing nearly 5,000 sentences with their corresponding con-
sentence without using synonyms.                            texts. The creation of COCOTEROS comes from the cur-
   Another key aspect of generating new resources is rent need in NLP research to address tasks with a more
that they must not contain gender biases. An unbiased semantic-pragmatic approach, as it occurs with the gener-
dataset is an important factor when training a language ation of linguistic context. Also, we wanted to contribute
model, as bias is mostly introduced in the data used in the to the research community with a well-defined Spanish
resource to study contextual aspects in NLG, given the          funded by the Generalitat Valenciana. Moreover, it has
lack of enough linguistic resources to study pragmatic          been also partially funded by the Ministry of Economic
aspects of language for languages other than English.           Affairs and Digital Transformation and “European
   With the aim of verifying the level of linguistic and con-   Union NextGenerationEU/PRTR” through the "ILENIA"
textual appropriateness of COCOTEROS, we performed              project (grant number 2022/TL22/00215337) and "VIVES"
a two-fold evaluation. First of all, we used the magni-         subproject (grant number 2022/TL22/00215334).
tude estimation method with the help of three linguis-
tics specialists to measure the linguistic and contextual
appropriateness of a representative sample of the gen-          References
erated contexts. Then, we applied the GenBit metric to
                                                                 [1] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii,
COCOTEROS to check the level of gender bias our cor-
                                                                     Y. J. Bang, A. Madotto, P. Fung, Survey of hal-
pus showed. On the one hand, results on the contextual
                                                                     lucination in natural language generation, ACM
appropriateness evaluation reflect the difficulties when
                                                                     Computing Surveys 55 (2023). URL: https://doi.org/
addressing the contextual generation task even for hu-
                                                                     10.1145/3571730. doi:10.1145/3571730.
man annotators, as annotators tended to differ on the
                                                                 [2] M. a. T. Yan Li, D. Liu, From semantics to prag-
degree of appropriateness of each context. Nevertheless,
                                                                     matics: Where IS can lead in natural language pro-
the magnitude estimation metric indicates that more than
                                                                     cessing (NLP) research, European Journal of Infor-
half of the evaluated contexts were scored favourably. On
                                                                     mation Systems 30 (2021) 569–590. doi:10.1080/
the other hand, the gender bias metric score shows that,
                                                                     0960085X.2020.1816145.
with a few modifications, we could reduce the presence of
                                                                 [3] B. Newman, R. Cohn-Gordon, C. Potts,
gender bias in the corpus to a large extent. However, the
                                                                     Communication-based evaluation for natural
resulting bias score cannot be conclusive as the metric
                                                                     language generation, in: Proceedings of the
did not consider some of the gender-linguistic features
                                                                     Society for Computation in Linguistics 2020,
the generated contexts included.
                                                                     Association for Computational Linguistics,
   Several research directions are planned for future work.
                                                                     New York, New York, 2020, pp. 116–126. URL:
First, we would like to improve our resource, so further
                                                                     https://aclanthology.org/2020.scil-1.16.
experiments will be made to balance gender representa-
                                                                 [4] J. C. B. Cruz, J. K. Resabal, J. Lin, D. J. Velasco,
tion in COCOTEROS, as well as to extend the number
                                                                     C. Cheng, Exploiting news article structure for au-
of contexts so this Spanish resource may be of help for
                                                                     tomatic corpus generation of entailment datasets,
addressing NLP tasks that need more amounts of data.
                                                                     in: D. N. Pham, T. Theeramunkong, G. Governatori,
Finally, we aim to devote a branch of future research to
                                                                     F. Liu (Eds.), PRICAI 2021: Trends in Artificial Intel-
adapting COCOTEROS corpus to the task of intention
                                                                     ligence, Springer International Publishing, Cham,
identification to better understand which reasons make
                                                                     2021, pp. 86–99.
humans have a particular intention when uttering a mes-
                                                                 [5] M. E. Vallecillo-Rodríguez, A. Montejo-Raéz, M. T.
sage based on the context surrounding such intention.
                                                                     Martín-Valdivia, Automatic counter-narrative gen-
At the same time, we would check if LLMs can better
                                                                     eration for hate speech in spanish, Procesamiento
detect specific communicative intentions depending on
                                                                     del Lenguaje Natural 71 (2023) 227–245.
reference sentences and their linguistic context.
                                                                 [6] K. Sengupta, R. Maher, D. Groves, C. Olieman, Gen-
                                                                     bit: measure and mitigate gender bias in language
Acknowledgments                                                      datasets, Microsoft Journal of Applied Research 16
                                                                     (2021) 63–71.
The research work conducted is part of the                       [7] T. Surana, T.-N. Ho, K. Tun, E. S. Chng, CASSI:
R&D projects “CORTEX: Conscious Text Genera-                         Contextual and semantic structure-based interpo-
tion” (PID2021-123956OB-I00), funded by MCIN/                        lation augmentation for low-resource NER, in:
AEI/10.13039/501100011033/ and by “ERDF A way                        H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the
of making Europe”; “CLEAR.TEXT:Enhancing the                         Association for Computational Linguistics: EMNLP
modernization public sector organizations by de-                     2023, Association for Computational Linguistics,
ploying Natural Language Processing to make their                    Singapore, 2023, pp. 9729–9742. URL: https://
digital content CLEARER to those with cognitive                      aclanthology.org/2023.findings-emnlp.651. doi:10.
disabilities” (TED2021-130707B-I00), funded by                       18653/v1/2023.findings-emnlp.651.
MCIN/AEI/10.13039/501100011033 and “European Union               [8] B. T. Johns, M. N. Jones, Content matters: Mea-
NextGenerationEU/PRTR”; and the project “NL4DISMIS:                  sures of contextual diversity must consider seman-
Natural Language Technologies for dealing with dis- and              tic content, Journal of Memory and Language 123
misinformation” with grant reference (CIPROM/2021/21)                (2022) 104313. URL: https://www.sciencedirect.com/
     science/article/pii/S0749596X21000966. doi:https:            //aclanthology.org/2023.findings-eacl.60. doi:10.
     //doi.org/10.1016/j.jml.2021.104313.                         18653/v1/2023.findings-eacl.60.
 [9] T. Heck, D. Meurers, On the relevance and learner       [16] C. Strathearn, D. Gkatzia, Task2Dial dataset: A
     dependence of co-text complexity for exercise dif-           novel dataset for commonsense-enhanced task-
     ficulty, in: D. Alfter, E. Volodina, T. François,            based dialogue grounded in documents, in: Proceed-
     A. Jönsson, E. Rennes (Eds.), Proceedings of the 12th        ings of the 4th International Conference on Natural
     Workshop on NLP for Computer Assisted Language               Language and Speech Processing (ICNLSP 2021),
     Learning, LiU Electronic Press, Tórshavn, Faroe Is-          Association for Computational Linguistics, Trento,
     lands, 2023, pp. 71–84. URL: https://aclanthology.           Italy, 2021, pp. 242–251. URL: https://aclanthology.
     org/2023.nlp4call-1.9.                                       org/2021.icnlsp-1.28.
[10] T. Tyagi, C. G. Magdamo, A. Noori, Z. Li, X. Liu,       [17] D. Ghosal, S. Shen, N. Majumder, R. Mihalcea, S. Po-
     M. Deodhar, Z. Hong, W. Ge, E. M. Ye, Y. han                 ria, CICERO: A dataset for contextualized com-
     Sheu, H. Alabsi, L. Brenner, G. K. Robbins, S. Za-           monsense inference in dialogues, in: Proceedings
     far, N. Benson, L. Moura, J. Hsu, A. Serrano-Pozo,           of the 60th Annual Meeting of the Association
     D. Prokopenko, R. E. Tanzi, B. T. Hyman, D. Blacker,         for Computational Linguistics (Volume 1: Long
     S. S. Mukerji, M. B. Westover, S. Das, Using deep            Papers), Association for Computational Linguis-
     learning to identify patients with cognitive impair-         tics, Dublin, Ireland, 2022, pp. 5010–5028. URL:
     ment in electronic health records, in: Proceed-              https://aclanthology.org/2022.acl-long.344. doi:10.
     ings of Machine Learning Research ML4H, 2021.                18653/v1/2022.acl-long.344.
     arXiv:2111.09115.                                       [18] R. Finkbeiner, J. Meibauer, P. B. Schumacher, What
[11] J. Verschueren, Context and structure in a theory of         is a context? Linguistic approaches and challenges,
     pragmatics, Studies in Pragmatics 10 (2008) 14–24.           volume 196 of Linguistik aktuell = linguistics today,
[12] T. Lai, H. Ji, T. Bui, Q. H. Tran, F. Dernoncourt,           John Benjamins Pub. Co., Amsterdam, 2012.
     W. Chang, A context-dependent gated module for          [19] G. Hollis, Delineating linguistic contexts, and
     incorporating symbolic semantics into event coref-           the validity of context diversity as a measure
     erence resolution, in: Proceedings of the 2021 Con-          of a word’s contextual variability,           Journal
     ference of the North American Chapter of the As-             of Memory and Language 114 (2020) 104146.
     sociation for Computational Linguistics: Human               URL:       https://www.sciencedirect.com/science/
     Language Technologies, Association for Compu-                article/pii/S0749596X20300607.            doi:https:
     tational Linguistics, Online, 2021, pp. 3491–3499.           //doi.org/10.1016/j.jml.2020.104146.
     URL: https://aclanthology.org/2021.naacl-main.274.      [20] G. Ferrari, Types of contexts and their role in multi-
     doi:10.18653/v1/2021.naacl-main.274.                         modal communication, Computational Intelligence
[13] L. Tamine, M. Daoud, Evaluation in contex-                   13 (1997) 414–426.
     tual information retrieval: Foundations and re-         [21] S. Castilho, J. L. Cavalheiro Camargo, M. Menezes,
     cent advances within the challenges of context dy-           A. Way, DELA corpus - a document-level corpus an-
     namicity and data privacy, ACM Comput. Surv.                 notated with context-related issues, in: Proceedings
     51 (2018). URL: https://doi.org/10.1145/3204940.             of the Sixth Conference on Machine Translation,
     doi:10.1145/3204940.                                         Association for Computational Linguistics, Online,
[14] C. Hadiwinoto, H. T. Ng, W. C. Gan, Improved                 2021, pp. 566–577. URL: https://aclanthology.org/
     word sense disambiguation using pre-trained con-             2021.wmt-1.63.
     textualized word representations, in: Proceed-          [22] T. Udagawa, A. Aizawa, A natural language
     ings of the 2019 Conference on Empirical Meth-               corpus of common grounding under continuous
     ods in Natural Language Processing and the 9th               and partially-observable context, in: Proceed-
     International Joint Conference on Natural Lan-               ings of the AAAI Conference on Artificial Intel-
     guage Processing (EMNLP-IJCNLP), Association                 ligence, AAAI Press, 2019. URL: https://doi.org/
     for Computational Linguistics, Hong Kong, China,             10.1609/aaai.v33i01.33017120. doi:10.1609/aaai.
     2019, pp. 5297–5306. URL: https://aclanthology.org/          v33i01.33017120.
     D19-1533. doi:10.18653/v1/D19-1533.                     [23] B. Y. Lin, W. Zhou, M. Shen, P. Zhou, C. Bhagavat-
[15] D. Su, M. Patwary, S. Prabhumoye, P. Xu, R. Prenger,         ula, Y. Choi, X. Ren, CommonGen: A constrained
     M. Shoeybi, P. Fung, A. Anandkumar, B. Catan-                text generation challenge for generative common-
     zaro, Context generation improves open domain                sense reasoning, in: Findings of the Association
     question answering, in: Findings of the As-                  for Computational Linguistics: EMNLP 2020, As-
     sociation for Computational Linguistics: EACL                sociation for Computational Linguistics, Online,
     2023, Association for Computational Linguistics,             2020, pp. 1823–1840. URL: https://aclanthology.
     Dubrovnik, Croatia, 2023, pp. 793–808. URL: https:           org/2020.findings-emnlp.165. doi:10.18653/v1/
     2020.findings-emnlp.165.                               [32] E. G. Bard, D. Robertson, A. Sorace, Magnitude
[24] F. Carlsson, J. Öhman, F. Liu, S. Verlinden, J. Nivre,      estimation of linguistic acceptability, Language
     M. Sahlgren, Fine-grained controllable text gen-            72 (1996) 32–68. URL: http://www.jstor.org/stable/
     eration using non-residual prompting, in: Pro-              416793.
     ceedings of the 60th Annual Meeting of the As- [33] J. Novikova, O. Dušek, V. Rieser, RankME: Reliable
     sociation for Computational Linguistics (Volume 1:          human ratings for natural language generation, in:
     Long Papers), Association for Computational Lin-            Proceedings of the 2018 Conference of the North
     guistics, Dublin, Ireland, 2022, pp. 6837–6857. URL:        American Chapter of the Association for Computa-
     https://aclanthology.org/2022.acl-long.471. doi:10.         tional Linguistics: Human Language Technologies,
     18653/v1/2022.acl-long.471.                                 Volume 2 (Short Papers), Association for Compu-
[25] M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan,            tational Linguistics, New Orleans, Louisiana, 2018,
     S. Shah, A. Ghodsi, P. Wendell, M. Zaharia,                 pp. 72–78. URL: https://aclanthology.org/N18-2012.
     R. Xin, Free Dolly: Introducing the world’s first           doi:10.18653/v1/N18-2012.
     truly open instruction-tuned LLM, 2023. URL: [34] A. Turpin, F. Scholer, S. Mizzaro, E. Maddalena,
     https://www.databricks.com/blog/2023/04/12/                 The benefits of magnitude estimation relevance
     dolly-first-open-commercially-viable-instruction-tuned-llm. assessments for information retrieval evaluation,
[26] E. Sanchez-Bayona, R. Agerri, Leveraging a                  in: Proceedings of the 38th International ACM SI-
     new Spanish corpus for multilingual and cross-              GIR Conference on Research and Development in
     lingual metaphor detection, in: Proceedings of              Information Retrieval, SIGIR ’15, Association for
     the 26th Conference on Computational Natural                Computing Machinery, New York, NY, USA, 2015,
     Language Learning (CoNLL), Association for Com-             p. 565–574. URL: https://doi.org/10.1145/2766462.
     putational Linguistics, Abu Dhabi, United Arab              2767760. doi:10.1145/2766462.2767760.
     Emirates (Hybrid), 2022, pp. 228–240. URL: https: [35] S. Santhanam, S. Shaikh, Understanding the
     //aclanthology.org/2022.conll-1.16. doi:10.18653/           impact of experiment design for evaluating dia-
     v1/2022.conll-1.16.                                         logue system output, in: Proceedings of the The
[27] V. Kovatchev, M. Taulé, InferES : A natural language        Fourth Widening Natural Language Processing
     inference corpus for Spanish featuring negation-            Workshop, Association for Computational Linguis-
     based contrastive and adversarial examples, in:             tics, Seattle, USA, 2020, pp. 124–127. URL: https://
     Proceedings of the 29th International Conference            aclanthology.org/2020.winlp-1.33. doi:10.18653/
     on Computational Linguistics, International Com-            v1/2020.winlp-1.33.
     mittee on Computational Linguistics, Gyeongju, Re- [36] R. Doust, P. Piwek, A model of suspense for narra-
     public of Korea, 2022, pp. 3873–3884. URL: https:           tive generation, in: Proceedings of the 10th Inter-
     //aclanthology.org/2022.coling-1.340.                       national Conference on Natural Language Gener-
[28] W. G. Reijnierse, C. Burgers, T. Krennmayr, G. Steen,       ation, Association for Computational Linguistics,
     The role of co-text in the analysis of potentially          Santiago de Compostela, Spain, 2017, pp. 178–187.
     deliberate metaphor, in: Drawing Attention to               URL: https://aclanthology.org/W17-3527. doi:10.
     Metaphor: Case studies across time periods, cul-            18653/v1/W17-3527.
     tures and modalities, John Benjamins Publishing [37] S. Pita-Fernández, Determinación del tamaño mues-
     Company, 2020, pp. 15–38.                                   tral, Cuadernos de atención primaria 3 (1996) 138–
[29] J. Manyika, An overview of Bard: An early                   141.
     experiment with generative AI, Technical Re- [38] C. Barros, M. Vicente, E. Lloret,                  To what
     port, Tech. rep., Technical report, Google AI,              extent does content selection affect surface re-
     2023. URL: https://ai.google/static/documents/              alization in the context of headline genera-
     google-about-bard.pdf.                                      tion?,       Computer Speech & Language 67
[30] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A.       (2021) 101179. URL: https://www.sciencedirect.com/
     Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham-          science/article/pii/S0885230820301121. doi:https:
     bro, F. Azhar, et al., LLaMA: Open and efficient            //doi.org/10.1016/j.csl.2020.101179.
     foundation language models, Computing Research [39] Y. G. Vázquez, A. F. Orquín, A. M. Guijarro, S. V.
     Repository, arXiv:2302.13971. Version 1 (2023).             Pérez, Integración de recursos semánticos basados
[31] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu,               en WordNet, Procesamiento del lenguaje natural
     H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E.             45 (2010) 161–168.
     Gonzalez, I. Stoica, E. P. Xing, Vicuna: An open- [40] A. Siddharthan, N. Katsos, Offline sentence process-
     source chatbot impressing GPT-4 with 90%* Chat-             ing measures for testing readability with users, in:
     GPT quality, 2023. URL: https://lmsys.org/blog/             Proceedings of the First Workshop on Predicting
     2023-03-30-vicuna/.                                         and Improving Text Readability for target reader
     populations, Association for Computational Lin-
     guistics, Montréal, Canada, 2012, pp. 17–24. URL:
     https://aclanthology.org/W12-2203.
[41] M. R. Costa-jussà, An analysis of gender bias studies
     in natural language processing, Nature Machine
     Intelligence 1 (2019) 495–496.
[42] S. R. Bowman, G. Angeli, C. Potts, C. D. Manning,
     A large annotated corpus for learning natural lan-
     guage inference, in: Proceedings of 2015 Confer-
     ence on Empirical Methods in Natural Language
     Processing, Association for Computational Linguis-
     tics, Lisbon, Portugal, 2015, pp. 632–642. URL: http:
     //aclanthology.lst.uni-saarland.de/D15-1075.pdf.