COCOTEROS: A Spanish Corpus with Contextual Knowledge for Natural Language Generation María Miró Maestre, Iván Martínez-Murillo, Elena Lloret, Paloma Moreda and Armando Suárez Cueto Dept. of Software and Computing Systems, University of Alicante, Apdo. de Correos 99, E-03080, Alicante, Spain Abstract Contextual information is one of the key elements when automatically generating language with a more semantic-pragmatic perspective. To contribute to the study of this linguistic aspect, we present COCOTEROS, a COrpus of COnTextual TExt geneRatiOn in Spanish. COCOTEROS is available at https://huggingface.co/datasets/gplsi/cocoteros. The corpus is composed of sentences and automatically generated context pairs. For creating it, a semi-automatic weakly supervised methodology is implemented. Taking as a reference the Spanish section of the Tatoeba dataset, we filtered the sentences according to our research purpose. Then, we determined several linguistic parameters that the generated contexts need to fulfil considering their reference sentence. Finally, contexts were automatically generated using prompt engineering with Google’s large language model Bard. Furthermore, we performed two types of evaluation to check both the linguistic quality and the presence of gender bias in the corpus: the former by manually measuring the magnitude estimation metric and the latter thanks to the GenBit automatic metric. The results show that COCOTEROS is an appropriate language resource to approach Natural Language Generation tasks from a semantic-pragmatic perspective for Spanish. For instance, the NLG task of concept-to-text generation could benefit from contextual information by generating sentences according to the information provided in the context and a set of given concepts. Additionally, regarding the task of question-answering, the inclusion of linguistic context can enhance the generation of more appropriate answers by serving as a guide on what information to include in the automatically generated answer. Keywords corpus, contextual information, natural language generation, Spanish, human evaluation, large language models 1. Introduction in mind these linguistic levels of analysis, NLG is start- ing to put linguistic context in the research spotlight, Natural Language Generation (NLG) systems are steadily given its importance for appropriately understanding improving their performance in a wide range of tasks human utterances. Indeed, Newman et al. [3] already where the information to be generated is delimited ac- defended the consideration of context not only to create cording to the objective of the task, e.g., text summari- text automatically but also to assess the suitability of sation, machine translation or question answering (QA). the generated text. This statement comes from the idea One of the most important issues those systems have to that communication-based features help to evaluate the deal with is the lack of sufficient contextual knowledge, performance of any model that imitates human language. as it prevents NLG models from better adapting the gen- Language itself is used to communicate ideas always ex- erated text to the communicative situation of each task. pressed within a given communicative context, and it is That derives in crucial problems such as the hallucina- such context what directly affects the structure of the tion issue and lack of commonsense in the produced text utterance we want to say. [1]. In fact, one of the current concerns within the NLG Parallel to this, making NLG systems aware of contex- discipline [2] is the need to address tasks from a more tual knowledge involves the creation of new resources, ‘semantic-pragmatic perspective’ to solve these contex- such as datasets, corpora, knowledge bases, etc., to train tual inference difficulties that affect the output of the models in several languages, especially for those differ- systems at issue. To address the lack of studies that bear ent from English or low-resourced ones. In the case of Spanish, we observed that most of the recently published SEPLN-2024: 40th Conference of the Spanish Society for Natural corpora hardly address pragmatic-related issues with a Language Processing. Valladolid, Spain. 24-27 September 2024. contextual perspective, but rather focus on concrete prag- $ maria.miro@ua.es (M. M. Maestre); ivan.martinezmurillo@ua.es (I. Martínez-Murillo); elena.lloret@ua.es (E. Lloret); moreda@ua.es matic aspects such as metaphors to tackle identification (P. Moreda); armando.suarez@ua.es (A. S. Cueto) tasks. Furthermore, the high performance of Large Lan-  0000-0001-7996-4440 (M. M. Maestre); 0009-0007-5684-0083 guage Models (LLMs) recently witnessed within the field (I. Martínez-Murillo); 0000-0002-2926-294X (E. Lloret); of Natural Language Processing (NLP) has allowed re- 0000-0002-7193-1561 (P. Moreda); 0000-0002-8590-3000 searchers to use NLP tools to automatise data collection (A. S. Cueto) © 2024 Copyright for this paper by its authors. Use permitted under Creative and corpus creation tasks, therefore reducing the time Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) spent in collecting sufficient data for research purposes CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings [4, 5]. their classification systems are Named Entity Recogni- To bridge the gap of NLG systems that handle more tion (NER) [7], word recognition and lexical processing semantic-pragmatic features of language, specifically con- to boost semantic disambiguation [8], language learn- textual knowledge, we present COCOTEROS, a COr- ing [9] or even healthcare studies devoted to diseases or pus of COntextual TExt geneRatiOn in Spanish. This syndromes which critically affect language [10]. corpus comprises 4,845 sentences extracted from the existing Tatoeba dataset, together with 4,845 context sentences automatically generated with Bard1 language 2. Related Work model and manually revised. Given the difficulties inher- In view of the multidisciplinary nature of the task, the fol- ent to prompt engineering when using LLMs-based chat- lowing theoretical background is on one side focused on bots, several linguistic parameters were determined to the linguistic notion of context and its approach to NLG ensure the quality of the automatically generated outputs research (subsection 2.1). On the other side, subsection with Bard, including semantic similarity, length of the 2.2 includes prior NLG research focused on the creation generated text, and forbidden keywords, among others. of linguistic resources to address contextual-related tasks. Moreover, we performed a human evaluation experiment based on the magnitude estimation metric with three linguistics specialists to measure the contextual appropri- 2.1. Linguistic Context in NLG Research ateness of the resulting contexts. In parallel, we measured Messages need their surrounding communicative context gender bias with the GenBit tool [6] to verify that our in order to be completely understood [11]. This claim is corpus would be useful for NLG tasks without adding well accepted within the NLP discipline, as many tasks gender biases to models trained in further experiments. try to solve context-related linguistic issues to improve In sum, the main contributions of this paper are: NLP systems performance, i.e., coreference resolution • Expansion of a subset of Tatoeba’s corpus with [12], information retrieval [13], word sense disambigua- contextual information. tion [14] or question answering [15]. Context, therefore, becomes a pragmatic element of great interest when pro- • Proposal of a weakly supervised methodology for cessing language automatically. Similarly, when focusing building a corpus using prompt-engineering. on language generation, there are concrete applications • Creation of COCOTEROS, a novel Spanish corpus such as dialogue systems where context is usually prede- for commonsense NLG that includes contextual termined so researchers can study the linguistic features information. surrounding such communicative context [16, 17]. • Corpus validation through human assessment When addressing the task of contextual appropriate- with the magnitude estimation method. ness (i.e., how appropriate is a context given a linguistic • Corpus evaluation of gender bias with GenBit setting), several conceptions of context may come to automatic metric. mind, as linguistic theories tend to diverge on the defini- tion of context given the wide range of perspectives from We believe this corpus will provide the research com- which context can be approached [18]. For the sake of munity with a valuable resource in Spanish to test the the present research, we focus on the linguistic context performance of NLG systems in different tasks by con- of a given message, which can be defined as ‘any contigu- sidering semantic-pragmatic aspects of communication ous span of text within a document’ [19] or as ‘the set as contextual appropriateness. Some of the NLG tasks of utterances that precedes the current one’ [20]. These that could use COCOTEROS could be those related to definitions align with the linguistic dimension of context concept-to-text generation, where sets of words are pro- known as ‘intratextual context’ (or ‘co-text’), which stud- vided and the model has to generate a text given those ies the relation of a piece of text to its surrounding text concepts. These words can have multiple semantic mean- [18]. ings depending on their context. Having a prefixed con- text within which the sentence has to be generated could help to a more precise sentence. Moreover, COCOTEROS 2.2. Contextual Corpora for NLG could also be used to train NLG models to automatically The creation of linguistic resources directly oriented to generate sentences in accordance with a given context analyse more complex linguistic phenomena such as as input information, therefore improving the model’s context provides an added asset value to the research awareness of the different communicative situations it is community, as there are not as many resources avail- trained with. As for NLP, some tasks that have already able as for other far-reaching linguistic levels of anal- exploited the role of contextual knowledge for improving ysis as syntax or grammar. To motivate the study of this pragmatic element, several resources to analyse con- 1 Since 8th February 2024 Bard is known as Gemini. text from different perspectives have already been made available. Castilho et al. [21] created an English corpus 3.1. Data Collection and Filtering annotated with context-aware issues for the task of Ma- For the present study, we wanted to gather simple Span- chine Translation into Brazilian Portuguese. Regarding ish sentences with enough semantic content to automati- dialogue tasks, Udagawa and Aizawa [22] addressed the cally generate contexts linked to the situation stated in common grounding problem by collecting a dialogue the reference sentence. We prioritised sentences with dataset with continuous and partially-observable context. not too much linguistic information so the context does As for controllable text generation, Lin et al. [23] cre- not add extra information besides the purpose of the task, ated the CommonGen task and dataset to test to which being not too distant from the original sentence situation. extent a generation system can generate text with com- To this end, we chose the Spanish section of sentences monsense reasoning in English. To this end, the task is to written on the website Tatoeba2 as the original dataset generate a coherent sentence that includes several com- from which we would select the sentences to generate mon concepts previously shown to the system. Derived the contexts. We first considered using other already from this work, Carlsson et al. [24] generated the C2Gen published corpora such as CommonGen [23] or C2Gen dataset of context sentences in English from which they [24] as original datasets because they also focused on extracted several keywords that had to be included in the task of NLG with contextual information. However, an automatically generated text. Finally, a recent En- for using these corpora we would have had to translate glish corpus worth mentioning is databricks-dolly-15k, the original datasets into Spanish, which would imply a human-generated instruction corpus created to train choosing an appropriate automatic translation tool or Dolly LLM [25]. This dataset was applied to different manually translating the datasets for adapting the task contextual tasks such as summarisation of Wikipedia into Spanish. Also, a further proofreading step would articles or closed QA, where a question and a reference have been necessary to check the accuracy of the trans- passage are input to the system to get factually correct lations into Spanish, so we preferred to benefit from an responses. already-existing Spanish dataset that could help us gen- Focusing on Spanish resources, Sanchez-Bayona and erate our context corpus. Agerri [26] generated a corpus of Spanish metaphors, Tatoeba’s original dataset includes around 393,000 which depend directly on the contextual meaning to be Spanish sentences either translated from other languages clearly identified by an automatic system. As for Natu- or directly written in Spanish. The dataset includes sen- ral Language Inference (NLI), Kovatchev and Taulé [27] tences with a range of 1 up to 44 words per sentence, so compiled the INFERES corpus to check the performance we first filtered them by selecting only those sentences of machine learning systems on negation-based adver- conformed by either 8 or 9 words, collecting a total of sarial examples by using context paragraphs from topics 60,170 Spanish sentences. We chose this section from the extracted from the Spanish Wikipedia. dataset after a previous preprocessing of an excerpt of After a thorough review of the current corpora that the dataset with Spacy tokenizer3 . In this preliminary address contextual NLG tasks in Spanish, we can say that, preprocessing, we noticed that the more words the sen- to the best of our knowledge, there is no corpus focused tence comprised, the more risk we had of including too on the contextual information generation task in Spanish. much semantic information in the sentence. This could Consequently, for this research we base on the previous entail the generation of contexts not linked to the origi- works by Lin et al. [23] and Carlsson et al. [24] to address nal situation stated in the reference sentence. Similarly, the task of contextual information generation in Spanish. we rejected those sentences made up of 7 words or less, as many of their keywords lacked enough linguistic in- 3. Corpus Creation formation (verbs, nouns, etc.) to generate a context that could be in line with the situation stated in the reference The following subsections include the methodology steps sentence. to create COCOTEROS: i) we explain the reference sen- tences dataset collection process and how we filtered 3.2. Linguistic Constraints them (subsection 3.1); ii) we move on to determine the linguistic constraints that will comprise the prompt to LLMs can be useful for supporting the automatic cre- generate automatic contexts (subsection 3.2); iii) we de- ation of corpora to study specific linguistic phenomena scribe the context generation task (subsection 3.3); and that would become very costly tasks if compiled manu- iv) we include a manual post-edition to curate the results ally. Nevertheless, generating a corpus with LLMs from generated by the LLM (subsection 3.4). Figure 1 shows scratch also entails several risks regarding linguistic ap- a visual pipeline of the methodology used for creating COCOTEROS. 2 This dataset was released under a CC-BY License and can be found at https://tatoeba.org/es. 3 https://spacy.io/api/tokenizer Figure 1: Proposed methodology pipeline for corpus creation. propriateness that could worsen the quality of the corpus, in the context to be generated. With this restric- as it happens with hallucination issues or lack of com- tion, we wanted to ensure that, even if some of monsense. the linguistic structures in the reference sentence Therefore, with the aim of automatically creating lin- were repeated in the generated context, the se- guistic contexts referred to a given sentence, and to better mantic meaning of the context is related to, but control the output of our chosen LLM (further explained changes somewhat from the reference sentence. in Section 3.3), we determined several linguistic parame- This goes in line with the idea that the choice of ters to include in the prompt: words influence co-text and meaning potential [28], and we wanted to test up to which point • Definition of context: Following previous stud- can LLMs generate co-text with the same con- ies focused on context as described in Section ceptual background but adding new words that 2.1, we started our prompt with a simple and can enlarge the semantic information of the new straightforward definition of what we consider a sentence. linguistic context so the model could first get the • Maximum context length: Inspired by the idea of the task to accomplish. work presented in Carlsson et al. [24], we de- • Reference sentence or synonyms: On the first cided that an appropriate length for the gener- attempts to find the right prompt to compile the ated context could be around 45 words. This de- corpus, we observed that, even by including a cision comes also from preliminary prompt tests short definition of linguistic context, the model where we found that, if no length limitation was sometimes generated a context including the ref- included, the model tended to delve into the gen- erence sentence. Therefore, to better specify the eration process, creating contexts of more than linguistic nature of the context to be generated, ten lines of text that distanced too much from the we indicated that the reference sentence could original situation stated in the reference sentence. not appear in the context nor a sentence with similar semantic meaning. 3.3. Context Generation • Forbidden keywords: We extracted three key- words from each reference sentence that could se- Once we filtered the original dataset, the next step was to mantically define the sentence meaning. The ex- generate an appropriate context for each of the selected traction was automatically performed by means sentences. For this, we benefited from the capabilities of a random choice where we prioritised the se- of LLMs, and in particular, we used Bard [29], Google’s lection of two nouns and one verb, as we consider recent LLM. Our decision was motivated by an empirical them some of the main linguistic elements that study we previously conducted in which several LLMs define the semantic meaning of a sentence. Then, were compared to check how appropriately they fulfilled we added those batches of three keywords in the the task of generating a context resembling a sentence prompt as forbidden words that could not appear but without repeating or paraphrasing it. The LLMs compared were LLaMa4 [30], Vicuna4 [31], Bard, and sona] sentado en su escritorio”7 . Consequently, we had to ChatGPT5 We automatically generated contexts for our modify those contexts by completing the missing infor- subset of sentences of 8 or 9 words with Bard, which mation with generic concepts or names so we could add could generate a context in an average of 5 seconds. Nev- the resulting context to the final corpus. ertheless, Bard’s public version could be prompted only 130 times per day. The generation process was made through a zero-shot prompt that comprised the linguistic 4. COCOTEROS - Corpus of restrictions the generated context should include or not, Contextual Text Generation in as stated in section 3.2. With this setup, we created an initial version of COCOTEROS corpus with 5,000 con- Spanish texts. As the first corpus focused on the contextual text gener- ation task for Spanish, COCOTEROScontains a total of 3.4. Post-editing 4,845 pairs of reference sentences with their respective generated contexts as illustrated in Figure 2. Moreover, Finally, given Bard’s predefined chat-like communica- the corpus includes the three keywords extracted from tive structures, we manually revised and post-edited the each reference sentence. The final amount of contexts resulting contexts by eliminating all the information in- comes from a previous manual post-edition from the orig- cluded in the response which was not the generated con- inal 5,000 contexts generated with Bard. We performed text itself (e.g. Bard’s output included similar sentences this post-edition because we noticed sexist content in to “Aquí tienes un contexto relacionado para la frase ‘Tengo some of the generated contexts, so we decided to discard demasiadas cosas en la cabeza estos días’” as a preliminary those cases straightforwardly. statement for each context6 ). As a remark, there were Table 1 shows a statistical summary of COCOTEROS. times when Bard generated several contexts for a sin- Apart from the corpus general information, we found it gle input, giving us the opportunity to choose between interesting to check the average sentences and words per them, so we did a manual proofreading process where context because Bard sometimes generated contexts with we checked every possible context to choose those that very different lengths. Even though the prompt included approximated more to the conception of context we de- the maximum length that the context could have (45 termined for this research task. In line with this, in those words), we found cases where the context had only 15 cases where we could choose from two options, we se- words, whereas other contexts contained more than four lected the context describing a female-subject situation. sentences, with a total of more than 50 words. We made this decision because we detected a somewhat higher proportion of reference sentences addressing male subjects, so the generated context was male-gendered too. Table 1 COCOTEROS data summary. Therefore, in those cases where the reference sentence was no gender-specific, we prioritised female contexts Data Total to balance gender in COCOTEROS. Further details on Reference sentences 4,845 how we addressed gender bias in our corpus are shown Keywords 14,535 in subsection 5.2. Generated contexts 4,845 In this manual post-editing step we also discarded Words in the sentences 40,827 contexts that were repetitions or paraphrasing of the ref- Words in the contexts 119,885 Words in the corpus 175,247 erence sentence, as well as those that did not include Average no. of sentences per context 2 enough semantic information to be considered appro- Average no. of words per context 25 priately generated contexts. Within the rest of contexts we kept in COCOTEROS, there were times where Bard left some of the concepts in the generated text incom- The official version of COCOTEROS corpus is available plete so the user could complete it according to his/her at https://huggingface.co/datasets/gplsi/cocoteros. With preferences, as in “Nos encontramos a [nombre de la per- this, we aim to contribute to NLG research with a new language resource for studying contextual information generation in Spanish, as well as for other unexplored 4 The tested version of LLama was llama-2-70b-chat, and NLG tasks that can benefit from our corpus to address Vicuna’s version was vicuna-33b. They were tested on https://chat.lmsys.org/ further research questions. 5 Tested version of ChatGPT was GPT 3.5 on https://chat.openai.com/ 6 Example translated into English for clarity purposes: Here’s a 7 related context for the phrase “I have too many things on my mind Example translated into English for clarity purposes: We found these days” [name of the person] sit on his/her desk Figure 2: Excerpt of COCOTEROS corpus. Examples translated into English for clarity purposes. 5. Corpus Evaluation well as more distinctive rankings when comparing the outputs between the annotators in comparison to other To ensure that the contexts included in COCOTEROS are more common methods such as Likert scales [33, 35]. appropriate for contextual generation tasks, we evalu- Taking this method as a basis, we wanted to measure ated them taking into account different aspects: context the appropriateness of the generated context given its appropriateness with the manual magnitude estimation reference sentence. For this, we took a representative method (subsection 5.1) and gender bias through the au- sample of sentences and contexts from the COCOTEROS tomatic GenBit metric (subsection 5.2). corpus through Formula 1, presented in [37] and previ- ously used in [38]: 5.1. Context Appropriateness 𝑁 * 𝐾2 * 𝑃 * 𝑄 With the evolution of the latest LLMs, researchers face 𝑀= (1) 𝐸 2 * (𝑁 − 1) + 𝐾 2 * 𝑃 * 𝑄 a need for consistent evaluation metrics that help them where N is the population, K the confidence interval, P evaluate the outputs provided by these models when the probability of success, Q the probability of failure and testing their performance for language generation tasks. E the error rate. The population N was 4,845 sentences To this end, we performed an experiment based on the and their respective contexts, and the values given to the magnitude estimation method [32] with the help of three rest of these parameters were taken as presented in [39], linguistics specialists. Magnitude estimation is a method so that K=0.95, E=0.05, P=0.5, and Q=0.5. Once the for- generally used in psychology to check the reaction of mula was calculated, the resulting number of sentences different subjects when presented with several stimuli. M for testing contextual appropriateness was rounded to To measure the different levels of reaction subjects can 90 sentences with their respective contexts. This subset have, they need to assign a score to a first stimuli (in our of 90 sentences and contexts was selected at random from case, the generated context) where no ranges or limits are the final COCOTEROS corpus. With the subset of con- determined. Then, when a second stimuli is presented, texts already determined, we performed the magnitude they have to compare it with the first stimuli shown, and estimation analysis to validate the generated contexts. depending on the intensity of the reaction they have, To accomplish this, we explained the methodology to its score will change based on the previous score they score the subset of 90 generated contexts to the anno- assigned to the first stimuli. In this manner, if subjects’ tators, with the only requirement that the lowest score reaction to the second stimuli is twice as much as to the they could assign could be 1. In this manner, we en- first stimuli, they will have to double the score they as- sured the subsequent normalisation of the values each sign to the second stimuli. This method has been used of them may assign to each context. As a remark, we positively for evaluating automatically generated text in noticed that two annotators scored contexts based on a several NLG tasks [33, 34, 35, 36], as researchers demon- 1 to 100 ranking, even when we highlighted that there strated that it helps to detect more linguistic nuances as were no restrictions in the values they could choose for Figure 3: Results of Z-score normalised values for the magnitude estimation evaluation. Values higher than 0 indicate appropriate contexts, whereas negative values show not-suitably generated contexts. each context. Once we collected all the scores made by 5.2. Gender Bias the annotators, we normalised the results by means of Several methodological issues come to mind when us- the z-score normalisation formula (Formula 2) as used in ing a LLM to generate a new language resource for fur- [40]: ther training LLMs so they can learn how to approach 𝑥𝑖ℎ − 𝜇ℎ new emerging NLG problems. One of those recently de- 𝑍𝑖ℎ = (2) tected issues is the presence of gender bias in the human- 𝜎ℎ compiled corpora that LLMs are trained with. This poses where 𝑍𝑖ℎ is annotator h’s z-score for the context a new problem for the research community, as the incred- when annotator h gave a magnitude estimation score ible performance those LLMs currently show is based of 𝑥𝑖ℎ to that context. 𝜇ℎ is the mean and 𝜎ℎ the stan- on data that reflect and amplify societal biases detected dard deviation of the set of magnitude estimation scores in naturally occurring texts [41]. With an eye to check for annotator h. possible biases in our corpus, we used the Gender Bias Figure 3 shows the normalised results for the magni- Tool (GenBit) [6] to measure the apparent level of gender tude estimation evaluation. The 0 line serves as the mean bias in the 4,845 generated contexts from COCOTEROS. from which upper numbers indicate those contexts with According to its developers, GenBit helps determine if higher scores, and the negative numbers show those con- gender is distributed uniformly across data by measuring texts considered not suitably generated. As can be seen, the strength of association between a pre-defined list of the three annotators tend to agree on which linguistic gender definition words and other words in the corpus contexts have an appropriate contextual relatedness to via co-occurrence statistics. Table 2 shows the obtained the reference sentence, even though each of them used a results after processing COCOTEROS with the Spanish different range of scores within the magnitude estimation metric provided in GenBit. experiment. In spite of a few disagreeing cases in the total of 90 contexts, we observe that the annotators agree that more than half of the corpus sample comprises con- Table 2 texts with appropriate contextual relatedness, while the GenBit gender bias results in the generated contexts. rest could be improved. After evaluating the results with Metric Results the annotators, we concluded that they tended to highly GenBit Score 0.724 penalise those contexts that paraphrased the reference Female words 0.335 sentence, even if after that paraphrasing sentence the Male words 0.665 context included new excerpts of text that indeed served as an appropriate linguistic context. Following the benchmarks as stated in Sengupta et al. [6], the GenBit score from COCOTEROS is 0.724, which indicates a moderate gender bias in our corpus. This key metric comes from a parallel calculation where GenBit calculates the percentage of female or male-gendered definition words that appear in the corpus, resulting in training phase of the model. As discussed in Section 5.2, 0.335 and 0.665, respectively, in COCOTEROS. Consider- remarkable efforts have been made to balance the num- ing the results, it seems there is a higher representation ber of sentences addressing both genders as we are aware of words associated with the male gender rather than of the importance of dealing with gender underrepresen- with the female. However, these results do not imply tation when creating inclusive language resources that that those sentences containing female-gendered words comply with gender balance standards. By doing this, are used in a sexist context but that the appearance of we also want to encourage the rest of the community to female-gendered terms in the corpus is lower. We want take similar steps so that NLP resources and LLMs are to remark on this because the apparent underrepresenta- trained on trustful resources with no biases. We used tion of female-gendered words could be modified easily GenBit tool for measuring this number, and although by creating parallel contexts to those where there are the results obtained are the expected, it is true that Gen- male-gendered words so that we could balance both gen- Bit does not detect some grammatical categories such as der representation at the same time that we expand our male or female proper names and gendered adjectives, corpus with further examples. Moreover, we have to so the results cannot be conclusive. bear in mind that words in Spanish have a specific genre, One problem worth commenting regarding LLMs is whereas English words don’t. Consequently, a predomi- hallucination, which occurs when a text is nonsensical or nance of male-gendered words does not need to imply unfaithful to the input source. During post-processing, that the corpus is gender biased, but that the corpus in- we detected that some generated context suffered from cludes more words linked to that genre, whether those this (e.g., the reference sentence contained the word “fa- words refer to objects, places or people. ther” while the context was generated with “grandfather”; In addition, during our manual post-editing stage of the the generated context was written in the masculine form 4,845 contexts, we found that many of them described when the reference sentence was in the feminine form; communicative situations where the subject is a woman. or the case of fake generated data, such as the winner However, GenBit does not include female or male proper of Eurovision 2023 which was not Germany). Neverthe- names and gendered adjectives in its Spanish section, so less, we did not discard these sentences as our scope was it cannot consider those contexts as gendered-defined, to obtain appropriate contexts. Therefore, future works which may also affect the final result of the gender bias will focus on detecting and eliminating hallucinations to metric. Therefore, the results achieved with GenBit score gather a corpus free of this issue. serve as a first attempt to consider possible gender bias Finally, another of the main interests for generating in our corpus, but we believe they cannot be conclusive new resources for the NLP community is creating multi- given the different examples of gendered sentences found task datasets so that linguistic resources become a valu- in our corpus not considered by the metric. able and reusable tool which can motivate new research. COCOTEROS will contribute to boosting NLP research specifically addressing semantic and pragmatic aspects 6. Overall Discussion and for Spanish language. Although it has been originally conceived for NLG, its nature for containing contexts as- The results obtained throughout the experimentation sociated with reference sentences could be beneficial for process for creating and evaluating COCOTEROS open solving other NLP-related issues such as textual entail- the door for discussion along several dimensions. ment, also known as Natural Language Inference (NLI) Regarding the magnitude estimation evaluation, this [42]. This task focuses on the semantic relations that metric helped us to detect further nuances in the scores may exist between several pieces of text and how such each annotator assigned to contexts depending on their relations can be characterised and computationally anal- appropriateness. Those nuances could be future chal- ysed. lenges to address to keep on discovering knowledge on how to deal with contextual information in NLG systems. Therefore, these results helped us to determine one of 7. Conclusions and Future Work the modifications to apply to COCOTEROS, as in future work we will manually analyse and discard contexts with In this paper we have presented COCOTEROS, a Span- paraphrasing sentences, so we only leave linguistic con- ish corpus of contextual knowledge for NLG, contain- texts that add contextual information to the reference ing nearly 5,000 sentences with their corresponding con- sentence without using synonyms. texts. The creation of COCOTEROS comes from the cur- Another key aspect of generating new resources is rent need in NLP research to address tasks with a more that they must not contain gender biases. An unbiased semantic-pragmatic approach, as it occurs with the gener- dataset is an important factor when training a language ation of linguistic context. Also, we wanted to contribute model, as bias is mostly introduced in the data used in the to the research community with a well-defined Spanish resource to study contextual aspects in NLG, given the funded by the Generalitat Valenciana. Moreover, it has lack of enough linguistic resources to study pragmatic been also partially funded by the Ministry of Economic aspects of language for languages other than English. Affairs and Digital Transformation and “European With the aim of verifying the level of linguistic and con- Union NextGenerationEU/PRTR” through the "ILENIA" textual appropriateness of COCOTEROS, we performed project (grant number 2022/TL22/00215337) and "VIVES" a two-fold evaluation. First of all, we used the magni- subproject (grant number 2022/TL22/00215334). tude estimation method with the help of three linguis- tics specialists to measure the linguistic and contextual appropriateness of a representative sample of the gen- References erated contexts. Then, we applied the GenBit metric to [1] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, COCOTEROS to check the level of gender bias our cor- Y. J. Bang, A. Madotto, P. Fung, Survey of hal- pus showed. On the one hand, results on the contextual lucination in natural language generation, ACM appropriateness evaluation reflect the difficulties when Computing Surveys 55 (2023). URL: https://doi.org/ addressing the contextual generation task even for hu- 10.1145/3571730. doi:10.1145/3571730. man annotators, as annotators tended to differ on the [2] M. a. T. Yan Li, D. Liu, From semantics to prag- degree of appropriateness of each context. Nevertheless, matics: Where IS can lead in natural language pro- the magnitude estimation metric indicates that more than cessing (NLP) research, European Journal of Infor- half of the evaluated contexts were scored favourably. On mation Systems 30 (2021) 569–590. doi:10.1080/ the other hand, the gender bias metric score shows that, 0960085X.2020.1816145. with a few modifications, we could reduce the presence of [3] B. Newman, R. Cohn-Gordon, C. Potts, gender bias in the corpus to a large extent. However, the Communication-based evaluation for natural resulting bias score cannot be conclusive as the metric language generation, in: Proceedings of the did not consider some of the gender-linguistic features Society for Computation in Linguistics 2020, the generated contexts included. Association for Computational Linguistics, Several research directions are planned for future work. New York, New York, 2020, pp. 116–126. URL: First, we would like to improve our resource, so further https://aclanthology.org/2020.scil-1.16. experiments will be made to balance gender representa- [4] J. C. B. Cruz, J. K. Resabal, J. Lin, D. J. Velasco, tion in COCOTEROS, as well as to extend the number C. Cheng, Exploiting news article structure for au- of contexts so this Spanish resource may be of help for tomatic corpus generation of entailment datasets, addressing NLP tasks that need more amounts of data. in: D. N. Pham, T. Theeramunkong, G. Governatori, Finally, we aim to devote a branch of future research to F. Liu (Eds.), PRICAI 2021: Trends in Artificial Intel- adapting COCOTEROS corpus to the task of intention ligence, Springer International Publishing, Cham, identification to better understand which reasons make 2021, pp. 86–99. humans have a particular intention when uttering a mes- [5] M. E. Vallecillo-Rodríguez, A. Montejo-Raéz, M. T. sage based on the context surrounding such intention. Martín-Valdivia, Automatic counter-narrative gen- At the same time, we would check if LLMs can better eration for hate speech in spanish, Procesamiento detect specific communicative intentions depending on del Lenguaje Natural 71 (2023) 227–245. reference sentences and their linguistic context. [6] K. Sengupta, R. Maher, D. Groves, C. Olieman, Gen- bit: measure and mitigate gender bias in language Acknowledgments datasets, Microsoft Journal of Applied Research 16 (2021) 63–71. The research work conducted is part of the [7] T. Surana, T.-N. Ho, K. Tun, E. S. Chng, CASSI: R&D projects “CORTEX: Conscious Text Genera- Contextual and semantic structure-based interpo- tion” (PID2021-123956OB-I00), funded by MCIN/ lation augmentation for low-resource NER, in: AEI/10.13039/501100011033/ and by “ERDF A way H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the of making Europe”; “CLEAR.TEXT:Enhancing the Association for Computational Linguistics: EMNLP modernization public sector organizations by de- 2023, Association for Computational Linguistics, ploying Natural Language Processing to make their Singapore, 2023, pp. 9729–9742. URL: https:// digital content CLEARER to those with cognitive aclanthology.org/2023.findings-emnlp.651. doi:10. disabilities” (TED2021-130707B-I00), funded by 18653/v1/2023.findings-emnlp.651. MCIN/AEI/10.13039/501100011033 and “European Union [8] B. T. Johns, M. N. Jones, Content matters: Mea- NextGenerationEU/PRTR”; and the project “NL4DISMIS: sures of contextual diversity must consider seman- Natural Language Technologies for dealing with dis- and tic content, Journal of Memory and Language 123 misinformation” with grant reference (CIPROM/2021/21) (2022) 104313. URL: https://www.sciencedirect.com/ science/article/pii/S0749596X21000966. doi:https: //aclanthology.org/2023.findings-eacl.60. doi:10. //doi.org/10.1016/j.jml.2021.104313. 18653/v1/2023.findings-eacl.60. [9] T. Heck, D. Meurers, On the relevance and learner [16] C. Strathearn, D. Gkatzia, Task2Dial dataset: A dependence of co-text complexity for exercise dif- novel dataset for commonsense-enhanced task- ficulty, in: D. Alfter, E. Volodina, T. François, based dialogue grounded in documents, in: Proceed- A. Jönsson, E. Rennes (Eds.), Proceedings of the 12th ings of the 4th International Conference on Natural Workshop on NLP for Computer Assisted Language Language and Speech Processing (ICNLSP 2021), Learning, LiU Electronic Press, Tórshavn, Faroe Is- Association for Computational Linguistics, Trento, lands, 2023, pp. 71–84. URL: https://aclanthology. Italy, 2021, pp. 242–251. URL: https://aclanthology. org/2023.nlp4call-1.9. org/2021.icnlsp-1.28. [10] T. Tyagi, C. G. Magdamo, A. Noori, Z. Li, X. Liu, [17] D. Ghosal, S. Shen, N. Majumder, R. Mihalcea, S. Po- M. Deodhar, Z. Hong, W. Ge, E. M. Ye, Y. han ria, CICERO: A dataset for contextualized com- Sheu, H. Alabsi, L. Brenner, G. K. Robbins, S. Za- monsense inference in dialogues, in: Proceedings far, N. Benson, L. Moura, J. Hsu, A. Serrano-Pozo, of the 60th Annual Meeting of the Association D. Prokopenko, R. E. Tanzi, B. T. Hyman, D. Blacker, for Computational Linguistics (Volume 1: Long S. S. Mukerji, M. B. Westover, S. Das, Using deep Papers), Association for Computational Linguis- learning to identify patients with cognitive impair- tics, Dublin, Ireland, 2022, pp. 5010–5028. URL: ment in electronic health records, in: Proceed- https://aclanthology.org/2022.acl-long.344. doi:10. ings of Machine Learning Research ML4H, 2021. 18653/v1/2022.acl-long.344. arXiv:2111.09115. [18] R. Finkbeiner, J. Meibauer, P. B. Schumacher, What [11] J. Verschueren, Context and structure in a theory of is a context? Linguistic approaches and challenges, pragmatics, Studies in Pragmatics 10 (2008) 14–24. volume 196 of Linguistik aktuell = linguistics today, [12] T. Lai, H. Ji, T. Bui, Q. H. Tran, F. Dernoncourt, John Benjamins Pub. Co., Amsterdam, 2012. W. Chang, A context-dependent gated module for [19] G. Hollis, Delineating linguistic contexts, and incorporating symbolic semantics into event coref- the validity of context diversity as a measure erence resolution, in: Proceedings of the 2021 Con- of a word’s contextual variability, Journal ference of the North American Chapter of the As- of Memory and Language 114 (2020) 104146. sociation for Computational Linguistics: Human URL: https://www.sciencedirect.com/science/ Language Technologies, Association for Compu- article/pii/S0749596X20300607. doi:https: tational Linguistics, Online, 2021, pp. 3491–3499. //doi.org/10.1016/j.jml.2020.104146. URL: https://aclanthology.org/2021.naacl-main.274. [20] G. Ferrari, Types of contexts and their role in multi- doi:10.18653/v1/2021.naacl-main.274. modal communication, Computational Intelligence [13] L. Tamine, M. Daoud, Evaluation in contex- 13 (1997) 414–426. tual information retrieval: Foundations and re- [21] S. Castilho, J. L. Cavalheiro Camargo, M. Menezes, cent advances within the challenges of context dy- A. Way, DELA corpus - a document-level corpus an- namicity and data privacy, ACM Comput. Surv. notated with context-related issues, in: Proceedings 51 (2018). URL: https://doi.org/10.1145/3204940. of the Sixth Conference on Machine Translation, doi:10.1145/3204940. Association for Computational Linguistics, Online, [14] C. Hadiwinoto, H. T. Ng, W. C. Gan, Improved 2021, pp. 566–577. URL: https://aclanthology.org/ word sense disambiguation using pre-trained con- 2021.wmt-1.63. textualized word representations, in: Proceed- [22] T. Udagawa, A. Aizawa, A natural language ings of the 2019 Conference on Empirical Meth- corpus of common grounding under continuous ods in Natural Language Processing and the 9th and partially-observable context, in: Proceed- International Joint Conference on Natural Lan- ings of the AAAI Conference on Artificial Intel- guage Processing (EMNLP-IJCNLP), Association ligence, AAAI Press, 2019. URL: https://doi.org/ for Computational Linguistics, Hong Kong, China, 10.1609/aaai.v33i01.33017120. doi:10.1609/aaai. 2019, pp. 5297–5306. URL: https://aclanthology.org/ v33i01.33017120. D19-1533. doi:10.18653/v1/D19-1533. [23] B. Y. Lin, W. Zhou, M. Shen, P. Zhou, C. Bhagavat- [15] D. Su, M. Patwary, S. Prabhumoye, P. Xu, R. Prenger, ula, Y. Choi, X. Ren, CommonGen: A constrained M. Shoeybi, P. Fung, A. Anandkumar, B. Catan- text generation challenge for generative common- zaro, Context generation improves open domain sense reasoning, in: Findings of the Association question answering, in: Findings of the As- for Computational Linguistics: EMNLP 2020, As- sociation for Computational Linguistics: EACL sociation for Computational Linguistics, Online, 2023, Association for Computational Linguistics, 2020, pp. 1823–1840. URL: https://aclanthology. Dubrovnik, Croatia, 2023, pp. 793–808. URL: https: org/2020.findings-emnlp.165. doi:10.18653/v1/ 2020.findings-emnlp.165. [32] E. G. Bard, D. Robertson, A. Sorace, Magnitude [24] F. Carlsson, J. Öhman, F. Liu, S. Verlinden, J. Nivre, estimation of linguistic acceptability, Language M. Sahlgren, Fine-grained controllable text gen- 72 (1996) 32–68. URL: http://www.jstor.org/stable/ eration using non-residual prompting, in: Pro- 416793. ceedings of the 60th Annual Meeting of the As- [33] J. Novikova, O. Dušek, V. Rieser, RankME: Reliable sociation for Computational Linguistics (Volume 1: human ratings for natural language generation, in: Long Papers), Association for Computational Lin- Proceedings of the 2018 Conference of the North guistics, Dublin, Ireland, 2022, pp. 6837–6857. URL: American Chapter of the Association for Computa- https://aclanthology.org/2022.acl-long.471. doi:10. tional Linguistics: Human Language Technologies, 18653/v1/2022.acl-long.471. Volume 2 (Short Papers), Association for Compu- [25] M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, tational Linguistics, New Orleans, Louisiana, 2018, S. Shah, A. Ghodsi, P. Wendell, M. Zaharia, pp. 72–78. URL: https://aclanthology.org/N18-2012. R. Xin, Free Dolly: Introducing the world’s first doi:10.18653/v1/N18-2012. truly open instruction-tuned LLM, 2023. URL: [34] A. Turpin, F. Scholer, S. Mizzaro, E. Maddalena, https://www.databricks.com/blog/2023/04/12/ The benefits of magnitude estimation relevance dolly-first-open-commercially-viable-instruction-tuned-llm. assessments for information retrieval evaluation, [26] E. Sanchez-Bayona, R. Agerri, Leveraging a in: Proceedings of the 38th International ACM SI- new Spanish corpus for multilingual and cross- GIR Conference on Research and Development in lingual metaphor detection, in: Proceedings of Information Retrieval, SIGIR ’15, Association for the 26th Conference on Computational Natural Computing Machinery, New York, NY, USA, 2015, Language Learning (CoNLL), Association for Com- p. 565–574. URL: https://doi.org/10.1145/2766462. putational Linguistics, Abu Dhabi, United Arab 2767760. doi:10.1145/2766462.2767760. Emirates (Hybrid), 2022, pp. 228–240. URL: https: [35] S. Santhanam, S. Shaikh, Understanding the //aclanthology.org/2022.conll-1.16. doi:10.18653/ impact of experiment design for evaluating dia- v1/2022.conll-1.16. logue system output, in: Proceedings of the The [27] V. Kovatchev, M. Taulé, InferES : A natural language Fourth Widening Natural Language Processing inference corpus for Spanish featuring negation- Workshop, Association for Computational Linguis- based contrastive and adversarial examples, in: tics, Seattle, USA, 2020, pp. 124–127. URL: https:// Proceedings of the 29th International Conference aclanthology.org/2020.winlp-1.33. doi:10.18653/ on Computational Linguistics, International Com- v1/2020.winlp-1.33. mittee on Computational Linguistics, Gyeongju, Re- [36] R. Doust, P. Piwek, A model of suspense for narra- public of Korea, 2022, pp. 3873–3884. URL: https: tive generation, in: Proceedings of the 10th Inter- //aclanthology.org/2022.coling-1.340. national Conference on Natural Language Gener- [28] W. G. Reijnierse, C. Burgers, T. Krennmayr, G. Steen, ation, Association for Computational Linguistics, The role of co-text in the analysis of potentially Santiago de Compostela, Spain, 2017, pp. 178–187. deliberate metaphor, in: Drawing Attention to URL: https://aclanthology.org/W17-3527. doi:10. Metaphor: Case studies across time periods, cul- 18653/v1/W17-3527. tures and modalities, John Benjamins Publishing [37] S. Pita-Fernández, Determinación del tamaño mues- Company, 2020, pp. 15–38. tral, Cuadernos de atención primaria 3 (1996) 138– [29] J. Manyika, An overview of Bard: An early 141. experiment with generative AI, Technical Re- [38] C. Barros, M. Vicente, E. Lloret, To what port, Tech. rep., Technical report, Google AI, extent does content selection affect surface re- 2023. URL: https://ai.google/static/documents/ alization in the context of headline genera- google-about-bard.pdf. tion?, Computer Speech & Language 67 [30] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. (2021) 101179. URL: https://www.sciencedirect.com/ Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham- science/article/pii/S0885230820301121. doi:https: bro, F. Azhar, et al., LLaMA: Open and efficient //doi.org/10.1016/j.csl.2020.101179. foundation language models, Computing Research [39] Y. G. Vázquez, A. F. Orquín, A. M. Guijarro, S. V. Repository, arXiv:2302.13971. Version 1 (2023). Pérez, Integración de recursos semánticos basados [31] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, en WordNet, Procesamiento del lenguaje natural H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. 45 (2010) 161–168. Gonzalez, I. Stoica, E. P. Xing, Vicuna: An open- [40] A. Siddharthan, N. Katsos, Offline sentence process- source chatbot impressing GPT-4 with 90%* Chat- ing measures for testing readability with users, in: GPT quality, 2023. URL: https://lmsys.org/blog/ Proceedings of the First Workshop on Predicting 2023-03-30-vicuna/. and Improving Text Readability for target reader populations, Association for Computational Lin- guistics, Montréal, Canada, 2012, pp. 17–24. URL: https://aclanthology.org/W12-2203. [41] M. R. Costa-jussà, An analysis of gender bias studies in natural language processing, Nature Machine Intelligence 1 (2019) 495–496. [42] S. R. Bowman, G. Angeli, C. Potts, C. D. Manning, A large annotated corpus for learning natural lan- guage inference, in: Proceedings of 2015 Confer- ence on Empirical Methods in Natural Language Processing, Association for Computational Linguis- tics, Lisbon, Portugal, 2015, pp. 632–642. URL: http: //aclanthology.lst.uni-saarland.de/D15-1075.pdf.