=Paper=
{{Paper
|id=Vol-3846/paper04
|storemode=property
|title=COCOTEROS: A Spanish Corpus with Contextual Knowledge for Natural Language Generation
|pdfUrl=https://ceur-ws.org/Vol-3846/paper04.pdf
|volume=Vol-3846
|authors=María Miró Maestre,Iván Martínez-Murillo,Elena Lloret,Paloma Moreda,Armando Suárez Cueto
|dblpUrl=https://dblp.org/rec/conf/sepln/MaestreMLMC24
}}
==COCOTEROS: A Spanish Corpus with Contextual Knowledge for Natural Language Generation==
COCOTEROS: A Spanish Corpus with Contextual
Knowledge for Natural Language Generation
María Miró Maestre, Iván Martínez-Murillo, Elena Lloret, Paloma Moreda and
Armando Suárez Cueto
Dept. of Software and Computing Systems, University of Alicante, Apdo. de Correos 99, E-03080, Alicante, Spain
Abstract
Contextual information is one of the key elements when automatically generating language with a more semantic-pragmatic
perspective. To contribute to the study of this linguistic aspect, we present COCOTEROS, a COrpus of COnTextual TExt
geneRatiOn in Spanish. COCOTEROS is available at https://huggingface.co/datasets/gplsi/cocoteros. The corpus is composed
of sentences and automatically generated context pairs. For creating it, a semi-automatic weakly supervised methodology is
implemented. Taking as a reference the Spanish section of the Tatoeba dataset, we filtered the sentences according to our
research purpose. Then, we determined several linguistic parameters that the generated contexts need to fulfil considering
their reference sentence. Finally, contexts were automatically generated using prompt engineering with Google’s large
language model Bard. Furthermore, we performed two types of evaluation to check both the linguistic quality and the presence
of gender bias in the corpus: the former by manually measuring the magnitude estimation metric and the latter thanks to
the GenBit automatic metric. The results show that COCOTEROS is an appropriate language resource to approach Natural
Language Generation tasks from a semantic-pragmatic perspective for Spanish. For instance, the NLG task of concept-to-text
generation could benefit from contextual information by generating sentences according to the information provided in
the context and a set of given concepts. Additionally, regarding the task of question-answering, the inclusion of linguistic
context can enhance the generation of more appropriate answers by serving as a guide on what information to include in the
automatically generated answer.
Keywords
corpus, contextual information, natural language generation, Spanish, human evaluation, large language models
1. Introduction in mind these linguistic levels of analysis, NLG is start-
ing to put linguistic context in the research spotlight,
Natural Language Generation (NLG) systems are steadily given its importance for appropriately understanding
improving their performance in a wide range of tasks human utterances. Indeed, Newman et al. [3] already
where the information to be generated is delimited ac- defended the consideration of context not only to create
cording to the objective of the task, e.g., text summari- text automatically but also to assess the suitability of
sation, machine translation or question answering (QA). the generated text. This statement comes from the idea
One of the most important issues those systems have to that communication-based features help to evaluate the
deal with is the lack of sufficient contextual knowledge, performance of any model that imitates human language.
as it prevents NLG models from better adapting the gen- Language itself is used to communicate ideas always ex-
erated text to the communicative situation of each task. pressed within a given communicative context, and it is
That derives in crucial problems such as the hallucina- such context what directly affects the structure of the
tion issue and lack of commonsense in the produced text utterance we want to say.
[1]. In fact, one of the current concerns within the NLG Parallel to this, making NLG systems aware of contex-
discipline [2] is the need to address tasks from a more tual knowledge involves the creation of new resources,
‘semantic-pragmatic perspective’ to solve these contex- such as datasets, corpora, knowledge bases, etc., to train
tual inference difficulties that affect the output of the models in several languages, especially for those differ-
systems at issue. To address the lack of studies that bear ent from English or low-resourced ones. In the case of
Spanish, we observed that most of the recently published
SEPLN-2024: 40th Conference of the Spanish Society for Natural corpora hardly address pragmatic-related issues with a
Language Processing. Valladolid, Spain. 24-27 September 2024.
contextual perspective, but rather focus on concrete prag-
$ maria.miro@ua.es (M. M. Maestre); ivan.martinezmurillo@ua.es
(I. Martínez-Murillo); elena.lloret@ua.es (E. Lloret); moreda@ua.es matic aspects such as metaphors to tackle identification
(P. Moreda); armando.suarez@ua.es (A. S. Cueto) tasks. Furthermore, the high performance of Large Lan-
0000-0001-7996-4440 (M. M. Maestre); 0009-0007-5684-0083 guage Models (LLMs) recently witnessed within the field
(I. Martínez-Murillo); 0000-0002-2926-294X (E. Lloret); of Natural Language Processing (NLP) has allowed re-
0000-0002-7193-1561 (P. Moreda); 0000-0002-8590-3000
searchers to use NLP tools to automatise data collection
(A. S. Cueto)
© 2024 Copyright for this paper by its authors. Use permitted under Creative and corpus creation tasks, therefore reducing the time
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org) spent in collecting sufficient data for research purposes
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
[4, 5]. their classification systems are Named Entity Recogni-
To bridge the gap of NLG systems that handle more tion (NER) [7], word recognition and lexical processing
semantic-pragmatic features of language, specifically con- to boost semantic disambiguation [8], language learn-
textual knowledge, we present COCOTEROS, a COr- ing [9] or even healthcare studies devoted to diseases or
pus of COntextual TExt geneRatiOn in Spanish. This syndromes which critically affect language [10].
corpus comprises 4,845 sentences extracted from the
existing Tatoeba dataset, together with 4,845 context
sentences automatically generated with Bard1 language 2. Related Work
model and manually revised. Given the difficulties inher-
In view of the multidisciplinary nature of the task, the fol-
ent to prompt engineering when using LLMs-based chat-
lowing theoretical background is on one side focused on
bots, several linguistic parameters were determined to
the linguistic notion of context and its approach to NLG
ensure the quality of the automatically generated outputs
research (subsection 2.1). On the other side, subsection
with Bard, including semantic similarity, length of the
2.2 includes prior NLG research focused on the creation
generated text, and forbidden keywords, among others.
of linguistic resources to address contextual-related tasks.
Moreover, we performed a human evaluation experiment
based on the magnitude estimation metric with three
linguistics specialists to measure the contextual appropri- 2.1. Linguistic Context in NLG Research
ateness of the resulting contexts. In parallel, we measured Messages need their surrounding communicative context
gender bias with the GenBit tool [6] to verify that our in order to be completely understood [11]. This claim is
corpus would be useful for NLG tasks without adding well accepted within the NLP discipline, as many tasks
gender biases to models trained in further experiments. try to solve context-related linguistic issues to improve
In sum, the main contributions of this paper are: NLP systems performance, i.e., coreference resolution
• Expansion of a subset of Tatoeba’s corpus with [12], information retrieval [13], word sense disambigua-
contextual information. tion [14] or question answering [15]. Context, therefore,
becomes a pragmatic element of great interest when pro-
• Proposal of a weakly supervised methodology for
cessing language automatically. Similarly, when focusing
building a corpus using prompt-engineering.
on language generation, there are concrete applications
• Creation of COCOTEROS, a novel Spanish corpus
such as dialogue systems where context is usually prede-
for commonsense NLG that includes contextual
termined so researchers can study the linguistic features
information.
surrounding such communicative context [16, 17].
• Corpus validation through human assessment When addressing the task of contextual appropriate-
with the magnitude estimation method. ness (i.e., how appropriate is a context given a linguistic
• Corpus evaluation of gender bias with GenBit setting), several conceptions of context may come to
automatic metric. mind, as linguistic theories tend to diverge on the defini-
tion of context given the wide range of perspectives from
We believe this corpus will provide the research com-
which context can be approached [18]. For the sake of
munity with a valuable resource in Spanish to test the
the present research, we focus on the linguistic context
performance of NLG systems in different tasks by con-
of a given message, which can be defined as ‘any contigu-
sidering semantic-pragmatic aspects of communication
ous span of text within a document’ [19] or as ‘the set
as contextual appropriateness. Some of the NLG tasks
of utterances that precedes the current one’ [20]. These
that could use COCOTEROS could be those related to
definitions align with the linguistic dimension of context
concept-to-text generation, where sets of words are pro-
known as ‘intratextual context’ (or ‘co-text’), which stud-
vided and the model has to generate a text given those
ies the relation of a piece of text to its surrounding text
concepts. These words can have multiple semantic mean-
[18].
ings depending on their context. Having a prefixed con-
text within which the sentence has to be generated could
help to a more precise sentence. Moreover, COCOTEROS 2.2. Contextual Corpora for NLG
could also be used to train NLG models to automatically The creation of linguistic resources directly oriented to
generate sentences in accordance with a given context analyse more complex linguistic phenomena such as
as input information, therefore improving the model’s context provides an added asset value to the research
awareness of the different communicative situations it is community, as there are not as many resources avail-
trained with. As for NLP, some tasks that have already able as for other far-reaching linguistic levels of anal-
exploited the role of contextual knowledge for improving ysis as syntax or grammar. To motivate the study of
this pragmatic element, several resources to analyse con-
1
Since 8th February 2024 Bard is known as Gemini. text from different perspectives have already been made
available. Castilho et al. [21] created an English corpus 3.1. Data Collection and Filtering
annotated with context-aware issues for the task of Ma-
For the present study, we wanted to gather simple Span-
chine Translation into Brazilian Portuguese. Regarding
ish sentences with enough semantic content to automati-
dialogue tasks, Udagawa and Aizawa [22] addressed the
cally generate contexts linked to the situation stated in
common grounding problem by collecting a dialogue
the reference sentence. We prioritised sentences with
dataset with continuous and partially-observable context.
not too much linguistic information so the context does
As for controllable text generation, Lin et al. [23] cre-
not add extra information besides the purpose of the task,
ated the CommonGen task and dataset to test to which
being not too distant from the original sentence situation.
extent a generation system can generate text with com-
To this end, we chose the Spanish section of sentences
monsense reasoning in English. To this end, the task is to
written on the website Tatoeba2 as the original dataset
generate a coherent sentence that includes several com-
from which we would select the sentences to generate
mon concepts previously shown to the system. Derived
the contexts. We first considered using other already
from this work, Carlsson et al. [24] generated the C2Gen
published corpora such as CommonGen [23] or C2Gen
dataset of context sentences in English from which they
[24] as original datasets because they also focused on
extracted several keywords that had to be included in
the task of NLG with contextual information. However,
an automatically generated text. Finally, a recent En-
for using these corpora we would have had to translate
glish corpus worth mentioning is databricks-dolly-15k,
the original datasets into Spanish, which would imply
a human-generated instruction corpus created to train
choosing an appropriate automatic translation tool or
Dolly LLM [25]. This dataset was applied to different
manually translating the datasets for adapting the task
contextual tasks such as summarisation of Wikipedia
into Spanish. Also, a further proofreading step would
articles or closed QA, where a question and a reference
have been necessary to check the accuracy of the trans-
passage are input to the system to get factually correct
lations into Spanish, so we preferred to benefit from an
responses.
already-existing Spanish dataset that could help us gen-
Focusing on Spanish resources, Sanchez-Bayona and
erate our context corpus.
Agerri [26] generated a corpus of Spanish metaphors,
Tatoeba’s original dataset includes around 393,000
which depend directly on the contextual meaning to be
Spanish sentences either translated from other languages
clearly identified by an automatic system. As for Natu-
or directly written in Spanish. The dataset includes sen-
ral Language Inference (NLI), Kovatchev and Taulé [27]
tences with a range of 1 up to 44 words per sentence, so
compiled the INFERES corpus to check the performance
we first filtered them by selecting only those sentences
of machine learning systems on negation-based adver-
conformed by either 8 or 9 words, collecting a total of
sarial examples by using context paragraphs from topics
60,170 Spanish sentences. We chose this section from the
extracted from the Spanish Wikipedia.
dataset after a previous preprocessing of an excerpt of
After a thorough review of the current corpora that
the dataset with Spacy tokenizer3 . In this preliminary
address contextual NLG tasks in Spanish, we can say that,
preprocessing, we noticed that the more words the sen-
to the best of our knowledge, there is no corpus focused
tence comprised, the more risk we had of including too
on the contextual information generation task in Spanish.
much semantic information in the sentence. This could
Consequently, for this research we base on the previous
entail the generation of contexts not linked to the origi-
works by Lin et al. [23] and Carlsson et al. [24] to address
nal situation stated in the reference sentence. Similarly,
the task of contextual information generation in Spanish.
we rejected those sentences made up of 7 words or less,
as many of their keywords lacked enough linguistic in-
3. Corpus Creation formation (verbs, nouns, etc.) to generate a context that
could be in line with the situation stated in the reference
The following subsections include the methodology steps sentence.
to create COCOTEROS: i) we explain the reference sen-
tences dataset collection process and how we filtered 3.2. Linguistic Constraints
them (subsection 3.1); ii) we move on to determine the
linguistic constraints that will comprise the prompt to LLMs can be useful for supporting the automatic cre-
generate automatic contexts (subsection 3.2); iii) we de- ation of corpora to study specific linguistic phenomena
scribe the context generation task (subsection 3.3); and that would become very costly tasks if compiled manu-
iv) we include a manual post-edition to curate the results ally. Nevertheless, generating a corpus with LLMs from
generated by the LLM (subsection 3.4). Figure 1 shows scratch also entails several risks regarding linguistic ap-
a visual pipeline of the methodology used for creating
COCOTEROS. 2
This dataset was released under a CC-BY License and can be
found at https://tatoeba.org/es.
3
https://spacy.io/api/tokenizer
Figure 1: Proposed methodology pipeline for corpus creation.
propriateness that could worsen the quality of the corpus, in the context to be generated. With this restric-
as it happens with hallucination issues or lack of com- tion, we wanted to ensure that, even if some of
monsense. the linguistic structures in the reference sentence
Therefore, with the aim of automatically creating lin- were repeated in the generated context, the se-
guistic contexts referred to a given sentence, and to better mantic meaning of the context is related to, but
control the output of our chosen LLM (further explained changes somewhat from the reference sentence.
in Section 3.3), we determined several linguistic parame- This goes in line with the idea that the choice of
ters to include in the prompt: words influence co-text and meaning potential
[28], and we wanted to test up to which point
• Definition of context: Following previous stud- can LLMs generate co-text with the same con-
ies focused on context as described in Section ceptual background but adding new words that
2.1, we started our prompt with a simple and can enlarge the semantic information of the new
straightforward definition of what we consider a sentence.
linguistic context so the model could first get the • Maximum context length: Inspired by the
idea of the task to accomplish. work presented in Carlsson et al. [24], we de-
• Reference sentence or synonyms: On the first cided that an appropriate length for the gener-
attempts to find the right prompt to compile the ated context could be around 45 words. This de-
corpus, we observed that, even by including a cision comes also from preliminary prompt tests
short definition of linguistic context, the model where we found that, if no length limitation was
sometimes generated a context including the ref- included, the model tended to delve into the gen-
erence sentence. Therefore, to better specify the eration process, creating contexts of more than
linguistic nature of the context to be generated, ten lines of text that distanced too much from the
we indicated that the reference sentence could original situation stated in the reference sentence.
not appear in the context nor a sentence with
similar semantic meaning.
3.3. Context Generation
• Forbidden keywords: We extracted three key-
words from each reference sentence that could se- Once we filtered the original dataset, the next step was to
mantically define the sentence meaning. The ex- generate an appropriate context for each of the selected
traction was automatically performed by means sentences. For this, we benefited from the capabilities
of a random choice where we prioritised the se- of LLMs, and in particular, we used Bard [29], Google’s
lection of two nouns and one verb, as we consider recent LLM. Our decision was motivated by an empirical
them some of the main linguistic elements that study we previously conducted in which several LLMs
define the semantic meaning of a sentence. Then, were compared to check how appropriately they fulfilled
we added those batches of three keywords in the the task of generating a context resembling a sentence
prompt as forbidden words that could not appear but without repeating or paraphrasing it. The LLMs
compared were LLaMa4 [30], Vicuna4 [31], Bard, and sona] sentado en su escritorio”7 . Consequently, we had to
ChatGPT5 We automatically generated contexts for our modify those contexts by completing the missing infor-
subset of sentences of 8 or 9 words with Bard, which mation with generic concepts or names so we could add
could generate a context in an average of 5 seconds. Nev- the resulting context to the final corpus.
ertheless, Bard’s public version could be prompted only
130 times per day. The generation process was made
through a zero-shot prompt that comprised the linguistic 4. COCOTEROS - Corpus of
restrictions the generated context should include or not, Contextual Text Generation in
as stated in section 3.2. With this setup, we created an
initial version of COCOTEROS corpus with 5,000 con-
Spanish
texts. As the first corpus focused on the contextual text gener-
ation task for Spanish, COCOTEROScontains a total of
3.4. Post-editing 4,845 pairs of reference sentences with their respective
generated contexts as illustrated in Figure 2. Moreover,
Finally, given Bard’s predefined chat-like communica-
the corpus includes the three keywords extracted from
tive structures, we manually revised and post-edited the
each reference sentence. The final amount of contexts
resulting contexts by eliminating all the information in-
comes from a previous manual post-edition from the orig-
cluded in the response which was not the generated con-
inal 5,000 contexts generated with Bard. We performed
text itself (e.g. Bard’s output included similar sentences
this post-edition because we noticed sexist content in
to “Aquí tienes un contexto relacionado para la frase ‘Tengo
some of the generated contexts, so we decided to discard
demasiadas cosas en la cabeza estos días’” as a preliminary
those cases straightforwardly.
statement for each context6 ). As a remark, there were
Table 1 shows a statistical summary of COCOTEROS.
times when Bard generated several contexts for a sin-
Apart from the corpus general information, we found it
gle input, giving us the opportunity to choose between
interesting to check the average sentences and words per
them, so we did a manual proofreading process where
context because Bard sometimes generated contexts with
we checked every possible context to choose those that
very different lengths. Even though the prompt included
approximated more to the conception of context we de-
the maximum length that the context could have (45
termined for this research task. In line with this, in those
words), we found cases where the context had only 15
cases where we could choose from two options, we se-
words, whereas other contexts contained more than four
lected the context describing a female-subject situation.
sentences, with a total of more than 50 words.
We made this decision because we detected a somewhat
higher proportion of reference sentences addressing male
subjects, so the generated context was male-gendered too. Table 1
COCOTEROS data summary.
Therefore, in those cases where the reference sentence
was no gender-specific, we prioritised female contexts Data Total
to balance gender in COCOTEROS. Further details on Reference sentences 4,845
how we addressed gender bias in our corpus are shown Keywords 14,535
in subsection 5.2. Generated contexts 4,845
In this manual post-editing step we also discarded Words in the sentences 40,827
contexts that were repetitions or paraphrasing of the ref- Words in the contexts 119,885
Words in the corpus 175,247
erence sentence, as well as those that did not include
Average no. of sentences per context 2
enough semantic information to be considered appro-
Average no. of words per context 25
priately generated contexts. Within the rest of contexts
we kept in COCOTEROS, there were times where Bard
left some of the concepts in the generated text incom- The official version of COCOTEROS corpus is available
plete so the user could complete it according to his/her at https://huggingface.co/datasets/gplsi/cocoteros. With
preferences, as in “Nos encontramos a [nombre de la per- this, we aim to contribute to NLG research with a new
language resource for studying contextual information
generation in Spanish, as well as for other unexplored
4
The tested version of LLama was llama-2-70b-chat, and NLG tasks that can benefit from our corpus to address
Vicuna’s version was vicuna-33b. They were tested on
https://chat.lmsys.org/
further research questions.
5
Tested version of ChatGPT was GPT 3.5 on
https://chat.openai.com/
6
Example translated into English for clarity purposes: Here’s a
7
related context for the phrase “I have too many things on my mind Example translated into English for clarity purposes: We found
these days” [name of the person] sit on his/her desk
Figure 2: Excerpt of COCOTEROS corpus. Examples translated into English for clarity purposes.
5. Corpus Evaluation well as more distinctive rankings when comparing the
outputs between the annotators in comparison to other
To ensure that the contexts included in COCOTEROS are more common methods such as Likert scales [33, 35].
appropriate for contextual generation tasks, we evalu- Taking this method as a basis, we wanted to measure
ated them taking into account different aspects: context the appropriateness of the generated context given its
appropriateness with the manual magnitude estimation reference sentence. For this, we took a representative
method (subsection 5.1) and gender bias through the au- sample of sentences and contexts from the COCOTEROS
tomatic GenBit metric (subsection 5.2). corpus through Formula 1, presented in [37] and previ-
ously used in [38]:
5.1. Context Appropriateness
𝑁 * 𝐾2 * 𝑃 * 𝑄
With the evolution of the latest LLMs, researchers face 𝑀= (1)
𝐸 2 * (𝑁 − 1) + 𝐾 2 * 𝑃 * 𝑄
a need for consistent evaluation metrics that help them
where N is the population, K the confidence interval, P
evaluate the outputs provided by these models when
the probability of success, Q the probability of failure and
testing their performance for language generation tasks.
E the error rate. The population N was 4,845 sentences
To this end, we performed an experiment based on the
and their respective contexts, and the values given to the
magnitude estimation method [32] with the help of three
rest of these parameters were taken as presented in [39],
linguistics specialists. Magnitude estimation is a method
so that K=0.95, E=0.05, P=0.5, and Q=0.5. Once the for-
generally used in psychology to check the reaction of
mula was calculated, the resulting number of sentences
different subjects when presented with several stimuli.
M for testing contextual appropriateness was rounded to
To measure the different levels of reaction subjects can
90 sentences with their respective contexts. This subset
have, they need to assign a score to a first stimuli (in our
of 90 sentences and contexts was selected at random from
case, the generated context) where no ranges or limits are
the final COCOTEROS corpus. With the subset of con-
determined. Then, when a second stimuli is presented,
texts already determined, we performed the magnitude
they have to compare it with the first stimuli shown, and
estimation analysis to validate the generated contexts.
depending on the intensity of the reaction they have,
To accomplish this, we explained the methodology to
its score will change based on the previous score they
score the subset of 90 generated contexts to the anno-
assigned to the first stimuli. In this manner, if subjects’
tators, with the only requirement that the lowest score
reaction to the second stimuli is twice as much as to the
they could assign could be 1. In this manner, we en-
first stimuli, they will have to double the score they as-
sured the subsequent normalisation of the values each
sign to the second stimuli. This method has been used
of them may assign to each context. As a remark, we
positively for evaluating automatically generated text in
noticed that two annotators scored contexts based on a
several NLG tasks [33, 34, 35, 36], as researchers demon-
1 to 100 ranking, even when we highlighted that there
strated that it helps to detect more linguistic nuances as
were no restrictions in the values they could choose for
Figure 3: Results of Z-score normalised values for the magnitude estimation evaluation. Values higher than 0 indicate
appropriate contexts, whereas negative values show not-suitably generated contexts.
each context. Once we collected all the scores made by 5.2. Gender Bias
the annotators, we normalised the results by means of
Several methodological issues come to mind when us-
the z-score normalisation formula (Formula 2) as used in
ing a LLM to generate a new language resource for fur-
[40]:
ther training LLMs so they can learn how to approach
𝑥𝑖ℎ − 𝜇ℎ new emerging NLG problems. One of those recently de-
𝑍𝑖ℎ = (2) tected issues is the presence of gender bias in the human-
𝜎ℎ
compiled corpora that LLMs are trained with. This poses
where 𝑍𝑖ℎ is annotator h’s z-score for the context
a new problem for the research community, as the incred-
when annotator h gave a magnitude estimation score
ible performance those LLMs currently show is based
of 𝑥𝑖ℎ to that context. 𝜇ℎ is the mean and 𝜎ℎ the stan-
on data that reflect and amplify societal biases detected
dard deviation of the set of magnitude estimation scores
in naturally occurring texts [41]. With an eye to check
for annotator h.
possible biases in our corpus, we used the Gender Bias
Figure 3 shows the normalised results for the magni-
Tool (GenBit) [6] to measure the apparent level of gender
tude estimation evaluation. The 0 line serves as the mean
bias in the 4,845 generated contexts from COCOTEROS.
from which upper numbers indicate those contexts with
According to its developers, GenBit helps determine if
higher scores, and the negative numbers show those con-
gender is distributed uniformly across data by measuring
texts considered not suitably generated. As can be seen,
the strength of association between a pre-defined list of
the three annotators tend to agree on which linguistic
gender definition words and other words in the corpus
contexts have an appropriate contextual relatedness to
via co-occurrence statistics. Table 2 shows the obtained
the reference sentence, even though each of them used a
results after processing COCOTEROS with the Spanish
different range of scores within the magnitude estimation
metric provided in GenBit.
experiment. In spite of a few disagreeing cases in the
total of 90 contexts, we observe that the annotators agree
that more than half of the corpus sample comprises con- Table 2
texts with appropriate contextual relatedness, while the GenBit gender bias results in the generated contexts.
rest could be improved. After evaluating the results with Metric Results
the annotators, we concluded that they tended to highly GenBit Score 0.724
penalise those contexts that paraphrased the reference Female words 0.335
sentence, even if after that paraphrasing sentence the Male words 0.665
context included new excerpts of text that indeed served
as an appropriate linguistic context. Following the benchmarks as stated in Sengupta et al.
[6], the GenBit score from COCOTEROS is 0.724, which
indicates a moderate gender bias in our corpus. This key
metric comes from a parallel calculation where GenBit
calculates the percentage of female or male-gendered
definition words that appear in the corpus, resulting in training phase of the model. As discussed in Section 5.2,
0.335 and 0.665, respectively, in COCOTEROS. Consider- remarkable efforts have been made to balance the num-
ing the results, it seems there is a higher representation ber of sentences addressing both genders as we are aware
of words associated with the male gender rather than of the importance of dealing with gender underrepresen-
with the female. However, these results do not imply tation when creating inclusive language resources that
that those sentences containing female-gendered words comply with gender balance standards. By doing this,
are used in a sexist context but that the appearance of we also want to encourage the rest of the community to
female-gendered terms in the corpus is lower. We want take similar steps so that NLP resources and LLMs are
to remark on this because the apparent underrepresenta- trained on trustful resources with no biases. We used
tion of female-gendered words could be modified easily GenBit tool for measuring this number, and although
by creating parallel contexts to those where there are the results obtained are the expected, it is true that Gen-
male-gendered words so that we could balance both gen- Bit does not detect some grammatical categories such as
der representation at the same time that we expand our male or female proper names and gendered adjectives,
corpus with further examples. Moreover, we have to so the results cannot be conclusive.
bear in mind that words in Spanish have a specific genre, One problem worth commenting regarding LLMs is
whereas English words don’t. Consequently, a predomi- hallucination, which occurs when a text is nonsensical or
nance of male-gendered words does not need to imply unfaithful to the input source. During post-processing,
that the corpus is gender biased, but that the corpus in- we detected that some generated context suffered from
cludes more words linked to that genre, whether those this (e.g., the reference sentence contained the word “fa-
words refer to objects, places or people. ther” while the context was generated with “grandfather”;
In addition, during our manual post-editing stage of the the generated context was written in the masculine form
4,845 contexts, we found that many of them described when the reference sentence was in the feminine form;
communicative situations where the subject is a woman. or the case of fake generated data, such as the winner
However, GenBit does not include female or male proper of Eurovision 2023 which was not Germany). Neverthe-
names and gendered adjectives in its Spanish section, so less, we did not discard these sentences as our scope was
it cannot consider those contexts as gendered-defined, to obtain appropriate contexts. Therefore, future works
which may also affect the final result of the gender bias will focus on detecting and eliminating hallucinations to
metric. Therefore, the results achieved with GenBit score gather a corpus free of this issue.
serve as a first attempt to consider possible gender bias Finally, another of the main interests for generating
in our corpus, but we believe they cannot be conclusive new resources for the NLP community is creating multi-
given the different examples of gendered sentences found task datasets so that linguistic resources become a valu-
in our corpus not considered by the metric. able and reusable tool which can motivate new research.
COCOTEROS will contribute to boosting NLP research
specifically addressing semantic and pragmatic aspects
6. Overall Discussion and for Spanish language. Although it has been originally
conceived for NLG, its nature for containing contexts as-
The results obtained throughout the experimentation
sociated with reference sentences could be beneficial for
process for creating and evaluating COCOTEROS open
solving other NLP-related issues such as textual entail-
the door for discussion along several dimensions.
ment, also known as Natural Language Inference (NLI)
Regarding the magnitude estimation evaluation, this
[42]. This task focuses on the semantic relations that
metric helped us to detect further nuances in the scores
may exist between several pieces of text and how such
each annotator assigned to contexts depending on their
relations can be characterised and computationally anal-
appropriateness. Those nuances could be future chal-
ysed.
lenges to address to keep on discovering knowledge on
how to deal with contextual information in NLG systems.
Therefore, these results helped us to determine one of 7. Conclusions and Future Work
the modifications to apply to COCOTEROS, as in future
work we will manually analyse and discard contexts with In this paper we have presented COCOTEROS, a Span-
paraphrasing sentences, so we only leave linguistic con- ish corpus of contextual knowledge for NLG, contain-
texts that add contextual information to the reference ing nearly 5,000 sentences with their corresponding con-
sentence without using synonyms. texts. The creation of COCOTEROS comes from the cur-
Another key aspect of generating new resources is rent need in NLP research to address tasks with a more
that they must not contain gender biases. An unbiased semantic-pragmatic approach, as it occurs with the gener-
dataset is an important factor when training a language ation of linguistic context. Also, we wanted to contribute
model, as bias is mostly introduced in the data used in the to the research community with a well-defined Spanish
resource to study contextual aspects in NLG, given the funded by the Generalitat Valenciana. Moreover, it has
lack of enough linguistic resources to study pragmatic been also partially funded by the Ministry of Economic
aspects of language for languages other than English. Affairs and Digital Transformation and “European
With the aim of verifying the level of linguistic and con- Union NextGenerationEU/PRTR” through the "ILENIA"
textual appropriateness of COCOTEROS, we performed project (grant number 2022/TL22/00215337) and "VIVES"
a two-fold evaluation. First of all, we used the magni- subproject (grant number 2022/TL22/00215334).
tude estimation method with the help of three linguis-
tics specialists to measure the linguistic and contextual
appropriateness of a representative sample of the gen- References
erated contexts. Then, we applied the GenBit metric to
[1] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii,
COCOTEROS to check the level of gender bias our cor-
Y. J. Bang, A. Madotto, P. Fung, Survey of hal-
pus showed. On the one hand, results on the contextual
lucination in natural language generation, ACM
appropriateness evaluation reflect the difficulties when
Computing Surveys 55 (2023). URL: https://doi.org/
addressing the contextual generation task even for hu-
10.1145/3571730. doi:10.1145/3571730.
man annotators, as annotators tended to differ on the
[2] M. a. T. Yan Li, D. Liu, From semantics to prag-
degree of appropriateness of each context. Nevertheless,
matics: Where IS can lead in natural language pro-
the magnitude estimation metric indicates that more than
cessing (NLP) research, European Journal of Infor-
half of the evaluated contexts were scored favourably. On
mation Systems 30 (2021) 569–590. doi:10.1080/
the other hand, the gender bias metric score shows that,
0960085X.2020.1816145.
with a few modifications, we could reduce the presence of
[3] B. Newman, R. Cohn-Gordon, C. Potts,
gender bias in the corpus to a large extent. However, the
Communication-based evaluation for natural
resulting bias score cannot be conclusive as the metric
language generation, in: Proceedings of the
did not consider some of the gender-linguistic features
Society for Computation in Linguistics 2020,
the generated contexts included.
Association for Computational Linguistics,
Several research directions are planned for future work.
New York, New York, 2020, pp. 116–126. URL:
First, we would like to improve our resource, so further
https://aclanthology.org/2020.scil-1.16.
experiments will be made to balance gender representa-
[4] J. C. B. Cruz, J. K. Resabal, J. Lin, D. J. Velasco,
tion in COCOTEROS, as well as to extend the number
C. Cheng, Exploiting news article structure for au-
of contexts so this Spanish resource may be of help for
tomatic corpus generation of entailment datasets,
addressing NLP tasks that need more amounts of data.
in: D. N. Pham, T. Theeramunkong, G. Governatori,
Finally, we aim to devote a branch of future research to
F. Liu (Eds.), PRICAI 2021: Trends in Artificial Intel-
adapting COCOTEROS corpus to the task of intention
ligence, Springer International Publishing, Cham,
identification to better understand which reasons make
2021, pp. 86–99.
humans have a particular intention when uttering a mes-
[5] M. E. Vallecillo-Rodríguez, A. Montejo-Raéz, M. T.
sage based on the context surrounding such intention.
Martín-Valdivia, Automatic counter-narrative gen-
At the same time, we would check if LLMs can better
eration for hate speech in spanish, Procesamiento
detect specific communicative intentions depending on
del Lenguaje Natural 71 (2023) 227–245.
reference sentences and their linguistic context.
[6] K. Sengupta, R. Maher, D. Groves, C. Olieman, Gen-
bit: measure and mitigate gender bias in language
Acknowledgments datasets, Microsoft Journal of Applied Research 16
(2021) 63–71.
The research work conducted is part of the [7] T. Surana, T.-N. Ho, K. Tun, E. S. Chng, CASSI:
R&D projects “CORTEX: Conscious Text Genera- Contextual and semantic structure-based interpo-
tion” (PID2021-123956OB-I00), funded by MCIN/ lation augmentation for low-resource NER, in:
AEI/10.13039/501100011033/ and by “ERDF A way H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the
of making Europe”; “CLEAR.TEXT:Enhancing the Association for Computational Linguistics: EMNLP
modernization public sector organizations by de- 2023, Association for Computational Linguistics,
ploying Natural Language Processing to make their Singapore, 2023, pp. 9729–9742. URL: https://
digital content CLEARER to those with cognitive aclanthology.org/2023.findings-emnlp.651. doi:10.
disabilities” (TED2021-130707B-I00), funded by 18653/v1/2023.findings-emnlp.651.
MCIN/AEI/10.13039/501100011033 and “European Union [8] B. T. Johns, M. N. Jones, Content matters: Mea-
NextGenerationEU/PRTR”; and the project “NL4DISMIS: sures of contextual diversity must consider seman-
Natural Language Technologies for dealing with dis- and tic content, Journal of Memory and Language 123
misinformation” with grant reference (CIPROM/2021/21) (2022) 104313. URL: https://www.sciencedirect.com/
science/article/pii/S0749596X21000966. doi:https: //aclanthology.org/2023.findings-eacl.60. doi:10.
//doi.org/10.1016/j.jml.2021.104313. 18653/v1/2023.findings-eacl.60.
[9] T. Heck, D. Meurers, On the relevance and learner [16] C. Strathearn, D. Gkatzia, Task2Dial dataset: A
dependence of co-text complexity for exercise dif- novel dataset for commonsense-enhanced task-
ficulty, in: D. Alfter, E. Volodina, T. François, based dialogue grounded in documents, in: Proceed-
A. Jönsson, E. Rennes (Eds.), Proceedings of the 12th ings of the 4th International Conference on Natural
Workshop on NLP for Computer Assisted Language Language and Speech Processing (ICNLSP 2021),
Learning, LiU Electronic Press, Tórshavn, Faroe Is- Association for Computational Linguistics, Trento,
lands, 2023, pp. 71–84. URL: https://aclanthology. Italy, 2021, pp. 242–251. URL: https://aclanthology.
org/2023.nlp4call-1.9. org/2021.icnlsp-1.28.
[10] T. Tyagi, C. G. Magdamo, A. Noori, Z. Li, X. Liu, [17] D. Ghosal, S. Shen, N. Majumder, R. Mihalcea, S. Po-
M. Deodhar, Z. Hong, W. Ge, E. M. Ye, Y. han ria, CICERO: A dataset for contextualized com-
Sheu, H. Alabsi, L. Brenner, G. K. Robbins, S. Za- monsense inference in dialogues, in: Proceedings
far, N. Benson, L. Moura, J. Hsu, A. Serrano-Pozo, of the 60th Annual Meeting of the Association
D. Prokopenko, R. E. Tanzi, B. T. Hyman, D. Blacker, for Computational Linguistics (Volume 1: Long
S. S. Mukerji, M. B. Westover, S. Das, Using deep Papers), Association for Computational Linguis-
learning to identify patients with cognitive impair- tics, Dublin, Ireland, 2022, pp. 5010–5028. URL:
ment in electronic health records, in: Proceed- https://aclanthology.org/2022.acl-long.344. doi:10.
ings of Machine Learning Research ML4H, 2021. 18653/v1/2022.acl-long.344.
arXiv:2111.09115. [18] R. Finkbeiner, J. Meibauer, P. B. Schumacher, What
[11] J. Verschueren, Context and structure in a theory of is a context? Linguistic approaches and challenges,
pragmatics, Studies in Pragmatics 10 (2008) 14–24. volume 196 of Linguistik aktuell = linguistics today,
[12] T. Lai, H. Ji, T. Bui, Q. H. Tran, F. Dernoncourt, John Benjamins Pub. Co., Amsterdam, 2012.
W. Chang, A context-dependent gated module for [19] G. Hollis, Delineating linguistic contexts, and
incorporating symbolic semantics into event coref- the validity of context diversity as a measure
erence resolution, in: Proceedings of the 2021 Con- of a word’s contextual variability, Journal
ference of the North American Chapter of the As- of Memory and Language 114 (2020) 104146.
sociation for Computational Linguistics: Human URL: https://www.sciencedirect.com/science/
Language Technologies, Association for Compu- article/pii/S0749596X20300607. doi:https:
tational Linguistics, Online, 2021, pp. 3491–3499. //doi.org/10.1016/j.jml.2020.104146.
URL: https://aclanthology.org/2021.naacl-main.274. [20] G. Ferrari, Types of contexts and their role in multi-
doi:10.18653/v1/2021.naacl-main.274. modal communication, Computational Intelligence
[13] L. Tamine, M. Daoud, Evaluation in contex- 13 (1997) 414–426.
tual information retrieval: Foundations and re- [21] S. Castilho, J. L. Cavalheiro Camargo, M. Menezes,
cent advances within the challenges of context dy- A. Way, DELA corpus - a document-level corpus an-
namicity and data privacy, ACM Comput. Surv. notated with context-related issues, in: Proceedings
51 (2018). URL: https://doi.org/10.1145/3204940. of the Sixth Conference on Machine Translation,
doi:10.1145/3204940. Association for Computational Linguistics, Online,
[14] C. Hadiwinoto, H. T. Ng, W. C. Gan, Improved 2021, pp. 566–577. URL: https://aclanthology.org/
word sense disambiguation using pre-trained con- 2021.wmt-1.63.
textualized word representations, in: Proceed- [22] T. Udagawa, A. Aizawa, A natural language
ings of the 2019 Conference on Empirical Meth- corpus of common grounding under continuous
ods in Natural Language Processing and the 9th and partially-observable context, in: Proceed-
International Joint Conference on Natural Lan- ings of the AAAI Conference on Artificial Intel-
guage Processing (EMNLP-IJCNLP), Association ligence, AAAI Press, 2019. URL: https://doi.org/
for Computational Linguistics, Hong Kong, China, 10.1609/aaai.v33i01.33017120. doi:10.1609/aaai.
2019, pp. 5297–5306. URL: https://aclanthology.org/ v33i01.33017120.
D19-1533. doi:10.18653/v1/D19-1533. [23] B. Y. Lin, W. Zhou, M. Shen, P. Zhou, C. Bhagavat-
[15] D. Su, M. Patwary, S. Prabhumoye, P. Xu, R. Prenger, ula, Y. Choi, X. Ren, CommonGen: A constrained
M. Shoeybi, P. Fung, A. Anandkumar, B. Catan- text generation challenge for generative common-
zaro, Context generation improves open domain sense reasoning, in: Findings of the Association
question answering, in: Findings of the As- for Computational Linguistics: EMNLP 2020, As-
sociation for Computational Linguistics: EACL sociation for Computational Linguistics, Online,
2023, Association for Computational Linguistics, 2020, pp. 1823–1840. URL: https://aclanthology.
Dubrovnik, Croatia, 2023, pp. 793–808. URL: https: org/2020.findings-emnlp.165. doi:10.18653/v1/
2020.findings-emnlp.165. [32] E. G. Bard, D. Robertson, A. Sorace, Magnitude
[24] F. Carlsson, J. Öhman, F. Liu, S. Verlinden, J. Nivre, estimation of linguistic acceptability, Language
M. Sahlgren, Fine-grained controllable text gen- 72 (1996) 32–68. URL: http://www.jstor.org/stable/
eration using non-residual prompting, in: Pro- 416793.
ceedings of the 60th Annual Meeting of the As- [33] J. Novikova, O. Dušek, V. Rieser, RankME: Reliable
sociation for Computational Linguistics (Volume 1: human ratings for natural language generation, in:
Long Papers), Association for Computational Lin- Proceedings of the 2018 Conference of the North
guistics, Dublin, Ireland, 2022, pp. 6837–6857. URL: American Chapter of the Association for Computa-
https://aclanthology.org/2022.acl-long.471. doi:10. tional Linguistics: Human Language Technologies,
18653/v1/2022.acl-long.471. Volume 2 (Short Papers), Association for Compu-
[25] M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, tational Linguistics, New Orleans, Louisiana, 2018,
S. Shah, A. Ghodsi, P. Wendell, M. Zaharia, pp. 72–78. URL: https://aclanthology.org/N18-2012.
R. Xin, Free Dolly: Introducing the world’s first doi:10.18653/v1/N18-2012.
truly open instruction-tuned LLM, 2023. URL: [34] A. Turpin, F. Scholer, S. Mizzaro, E. Maddalena,
https://www.databricks.com/blog/2023/04/12/ The benefits of magnitude estimation relevance
dolly-first-open-commercially-viable-instruction-tuned-llm. assessments for information retrieval evaluation,
[26] E. Sanchez-Bayona, R. Agerri, Leveraging a in: Proceedings of the 38th International ACM SI-
new Spanish corpus for multilingual and cross- GIR Conference on Research and Development in
lingual metaphor detection, in: Proceedings of Information Retrieval, SIGIR ’15, Association for
the 26th Conference on Computational Natural Computing Machinery, New York, NY, USA, 2015,
Language Learning (CoNLL), Association for Com- p. 565–574. URL: https://doi.org/10.1145/2766462.
putational Linguistics, Abu Dhabi, United Arab 2767760. doi:10.1145/2766462.2767760.
Emirates (Hybrid), 2022, pp. 228–240. URL: https: [35] S. Santhanam, S. Shaikh, Understanding the
//aclanthology.org/2022.conll-1.16. doi:10.18653/ impact of experiment design for evaluating dia-
v1/2022.conll-1.16. logue system output, in: Proceedings of the The
[27] V. Kovatchev, M. Taulé, InferES : A natural language Fourth Widening Natural Language Processing
inference corpus for Spanish featuring negation- Workshop, Association for Computational Linguis-
based contrastive and adversarial examples, in: tics, Seattle, USA, 2020, pp. 124–127. URL: https://
Proceedings of the 29th International Conference aclanthology.org/2020.winlp-1.33. doi:10.18653/
on Computational Linguistics, International Com- v1/2020.winlp-1.33.
mittee on Computational Linguistics, Gyeongju, Re- [36] R. Doust, P. Piwek, A model of suspense for narra-
public of Korea, 2022, pp. 3873–3884. URL: https: tive generation, in: Proceedings of the 10th Inter-
//aclanthology.org/2022.coling-1.340. national Conference on Natural Language Gener-
[28] W. G. Reijnierse, C. Burgers, T. Krennmayr, G. Steen, ation, Association for Computational Linguistics,
The role of co-text in the analysis of potentially Santiago de Compostela, Spain, 2017, pp. 178–187.
deliberate metaphor, in: Drawing Attention to URL: https://aclanthology.org/W17-3527. doi:10.
Metaphor: Case studies across time periods, cul- 18653/v1/W17-3527.
tures and modalities, John Benjamins Publishing [37] S. Pita-Fernández, Determinación del tamaño mues-
Company, 2020, pp. 15–38. tral, Cuadernos de atención primaria 3 (1996) 138–
[29] J. Manyika, An overview of Bard: An early 141.
experiment with generative AI, Technical Re- [38] C. Barros, M. Vicente, E. Lloret, To what
port, Tech. rep., Technical report, Google AI, extent does content selection affect surface re-
2023. URL: https://ai.google/static/documents/ alization in the context of headline genera-
google-about-bard.pdf. tion?, Computer Speech & Language 67
[30] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. (2021) 101179. URL: https://www.sciencedirect.com/
Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham- science/article/pii/S0885230820301121. doi:https:
bro, F. Azhar, et al., LLaMA: Open and efficient //doi.org/10.1016/j.csl.2020.101179.
foundation language models, Computing Research [39] Y. G. Vázquez, A. F. Orquín, A. M. Guijarro, S. V.
Repository, arXiv:2302.13971. Version 1 (2023). Pérez, Integración de recursos semánticos basados
[31] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, en WordNet, Procesamiento del lenguaje natural
H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. 45 (2010) 161–168.
Gonzalez, I. Stoica, E. P. Xing, Vicuna: An open- [40] A. Siddharthan, N. Katsos, Offline sentence process-
source chatbot impressing GPT-4 with 90%* Chat- ing measures for testing readability with users, in:
GPT quality, 2023. URL: https://lmsys.org/blog/ Proceedings of the First Workshop on Predicting
2023-03-30-vicuna/. and Improving Text Readability for target reader
populations, Association for Computational Lin-
guistics, Montréal, Canada, 2012, pp. 17–24. URL:
https://aclanthology.org/W12-2203.
[41] M. R. Costa-jussà, An analysis of gender bias studies
in natural language processing, Nature Machine
Intelligence 1 (2019) 495–496.
[42] S. R. Bowman, G. Angeli, C. Potts, C. D. Manning,
A large annotated corpus for learning natural lan-
guage inference, in: Proceedings of 2015 Confer-
ence on Empirical Methods in Natural Language
Processing, Association for Computational Linguis-
tics, Lisbon, Portugal, 2015, pp. 632–642. URL: http:
//aclanthology.lst.uni-saarland.de/D15-1075.pdf.