1. Introduction

COCOTEROS: A Spanish Corpus with Contextual Knowledge for Natural Language Generation

María Miró Maestre

Iván Martínez-Murillo

Elena Lloret

Paloma Moreda

Armando Suárez Cueto

0 0 Dept. of Software and Computing Systems, University of Alicante, Apdo. de Correos 99 , E-03080, Alicante , Spain

Contextual information is one of the key elements when automatically generating language with a more semantic-pragmatic perspective. To contribute to the study of this linguistic aspect, we present COCOTEROS, a COrpus of COnTextual TExt geneRatiOn in Spanish. COCOTEROS is available at https://huggingface.co/datasets/gplsi/cocoteros. The corpus is composed of sentences and automatically generated context pairs. For creating it, a semi-automatic weakly supervised methodology is implemented. Taking as a reference the Spanish section of the Tatoeba dataset, we filtered the sentences according to our research purpose. Then, we determined several linguistic parameters that the generated contexts need to fulfil considering their reference sentence. Finally, contexts were automatically generated using prompt engineering with Google's large language model Bard. Furthermore, we performed two types of evaluation to check both the linguistic quality and the presence of gender bias in the corpus: the former by manually measuring the magnitude estimation metric and the latter thanks to the GenBit automatic metric. The results show that COCOTEROS is an appropriate language resource to approach Natural Language Generation tasks from a semantic-pragmatic perspective for Spanish. For instance, the NLG task of concept-to-text generation could benefit from contextual information by generating sentences according to the information provided in the context and a set of given concepts. Additionally, regarding the task of question-answering, the inclusion of linguistic context can enhance the generation of more appropriate answers by serving as a guide on what information to include in the automatically generated answer.

eol>corpus contextual information natural language generation Spanish human evaluation large language models

1. Introduction

in mind these linguistic levels of analysis, NLG is starting to put linguistic context in the research spotlight, Natural Language Generation (NLG) systems are steadily given its importance for appropriately understanding improving their performance in a wide range of tasks human utterances. Indeed, Newman et al. [ 3 ] already where the information to be generated is delimited ac- defended the consideration of context not only to create cording to the objective of the task, e.g., text summari- text automatically but also to assess the suitability of sation, machine translation or question answering (QA). the generated text. This statement comes from the idea One of the most important issues those systems have to that communication-based features help to evaluate the deal with is the lack of suficient contextual knowledge, performance of any model that imitates human language. as it prevents NLG models from better adapting the gen- Language itself is used to communicate ideas always exerated text to the communicative situation of each task. pressed within a given communicative context, and it is That derives in crucial problems such as the hallucina- such context what directly afects the structure of the tion issue and lack of commonsense in the produced text utterance we want to say. [ 1 ]. In fact, one of the current concerns within the NLG Parallel to this, making NLG systems aware of contexdiscipline [ 2 ] is the need to address tasks from a more tual knowledge involves the creation of new resources, ‘semantic-pragmatic perspective’ to solve these contex- such as datasets, corpora, knowledge bases, etc., to train tual inference dificulties that afect the output of the models in several languages, especially for those difersystems at issue. To address the lack of studies that bear ent from English or low-resourced ones. In the case of Spanish, we observed that most of the recently published SEPLN-2024: 40th Conference of the Spanish Society for Natural corpora hardly address pragmatic-related issues with a L$anmguaraigae.mPirrooc@essuian.ge.sV(Mal.laMd.olMida, eSsptarein);. i2v4a-n2.7mSaerpttinemezbmeru2ri0l2lo4@.ua.es contextual perspective, but rather focus on concrete prag(I. Martínez-Murillo); elena.lloret@ua.es (E. Lloret); moreda@ua.es matic aspects such as metaphors to tackle identification (P. Moreda); armando.suarez@ua.es (A. S. Cueto) tasks. Furthermore, the high performance of Large Lan0000-0001-7996-4440 (M. M. Maestre); 0009-0007-5684-0083 guage Models (LLMs) recently witnessed within the field (I. Martínez-Murillo); 0000-0002-2926-294X (E. Lloret); of Natural Language Processing (NLP) has allowed re(0A00.S0.-0C0u0e2t-o7)193-1561 (P. Moreda); 0000-0002-8590-3000 searchers to use NLP tools to automatise data collection © 2024 Copyright for this paper by its authors. Use permitted under Creative and corpus creation tasks, therefore reducing the time CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) spent in collecting suficient data for research purposes [ 4, 5 ]. their classification systems are Named Entity Recogni

To bridge the gap of NLG systems that handle more tion (NER) [ 7 ], word recognition and lexical processing semantic-pragmatic features of language, specifically con- to boost semantic disambiguation [ 8 ], language learntextual knowledge, we present COCOTEROS, a COr- ing [9] or even healthcare studies devoted to diseases or pus of COntextual TExt geneRatiOn in Spanish. This syndromes which critically afect language [10]. corpus comprises 4,845 sentences extracted from the existing Tatoeba dataset, together with 4,845 context sentences automatically generated with Bard1 language 2. Related Work model and manually revised. Given the dificulties inher- In view of the multidisciplinary nature of the task, the folent to prompt engineering when using LLMs-based chat- lowing theoretical background is on one side focused on bots, several linguistic parameters were determined to the linguistic notion of context and its approach to NLG ensure the quality of the automatically generated outputs research (subsection 2.1). On the other side, subsection with Bard, including semantic similarity, length of the 2.2 includes prior NLG research focused on the creation generated text, and forbidden keywords, among others. of linguistic resources to address contextual-related tasks. Moreover, we performed a human evaluation experiment based on the magnitude estimation metric with three linguistics specialists to measure the contextual appropri- 2.1. Linguistic Context in NLG Research ateness of the resulting contexts. In parallel, we measured gender bias with the GenBit tool [ 6 ] to verify that our corpus would be useful for NLG tasks without adding gender biases to models trained in further experiments.

In sum, the main contributions of this paper are:

Messages need their surrounding communicative context

in order to be completely understood [11]. This claim is well accepted within the NLP discipline, as many tasks try to solve context-related linguistic issues to improve

NLP systems performance, i.e., coreference resolution • Expansion of a subset of Tatoeba’s corpus with [12], information retrieval [13], word sense disambiguacontextual information. tion [14] or question answering [15]. Context, therefore, • Proposal of a weakly supervised methodology for becomes a pragmatic element of great interest when probuilding a corpus using prompt-engineering. cessing language automatically. Similarly, when focusing on language generation, there are concrete applications • Creation of COCOTEROS, a novel Spanish corpus such as dialogue systems where context is usually predefor commonsense NLG that includes contextual termined so researchers can study the linguistic features information. surrounding such communicative context [16, 17]. • Corpus validation through human assessment When addressing the task of contextual appropriatewith the magnitude estimation method. ness (i.e., how appropriate is a context given a linguistic • Corpus evaluation of gender bias with GenBit setting), several conceptions of context may come to automatic metric. mind, as linguistic theories tend to diverge on the definiWe believe this corpus will provide the research com- tion of context given the wide range of perspectives from munity with a valuable resource in Spanish to test the which context can be approached [18]. For the sake of performance of NLG systems in diferent tasks by con- the present research, we focus on the linguistic context sidering semantic-pragmatic aspects of communication of a given message, which can be defined as ‘any contiguas contextual appropriateness. Some of the NLG tasks ous span of text within a document’ [19] or as ‘the set that could use COCOTEROS could be those related to of utterances that precedes the current one’ [20]. These concept-to-text generation, where sets of words are pro- definitions align with the linguistic dimension of context vided and the model has to generate a text given those known as ‘intratextual context’ (or ‘co-text’), which studconcepts. These words can have multiple semantic mean- ies the relation of a piece of text to its surrounding text ings depending on their context. Having a prefixed con- [18]. text within which the sentence has to be generated could help to a more precise sentence. Moreover, COCOTEROS 2.2. Contextual Corpora for NLG could also be used to train NLG models to automatically generate sentences in accordance with a given context as input information, therefore improving the model’s awareness of the diferent communicative situations it is trained with. As for NLP, some tasks that have already exploited the role of contextual knowledge for improving

1Since 8th February 2024 Bard is known as Gemini.

The creation of linguistic resources directly oriented to analyse more complex linguistic phenomena such as context provides an added asset value to the research community, as there are not as many resources available as for other far-reaching linguistic levels of analysis as syntax or grammar. To motivate the study of this pragmatic element, several resources to analyse context from diferent perspectives have already been made available. Castilho et al. [21] created an English corpus 3.1. Data Collection and Filtering annotated with context-aware issues for the task of Machine Translation into Brazilian Portuguese. Regarding For the present study, we wanted to gather simple Spandialogue tasks, Udagawa and Aizawa [22] addressed the ish sentences with enough semantic content to automaticommon grounding problem by collecting a dialogue cally generate contexts linked to the situation stated in dataset with continuous and partially-observable context. the reference sentence. We prioritised sentences with As for controllable text generation, Lin et al. [23] cre- not too much linguistic information so the context does ated the CommonGen task and dataset to test to which not add extra information besides the purpose of the task, extent a generation system can generate text with com- being not too distant from the original sentence situation. monsense reasoning in English. To this end, the task is to To this end, we chose the Spanish section of sentences generate a coherent sentence that includes several com- written on the website Tatoeba2 as the original dataset mon concepts previously shown to the system. Derived from which we would select the sentences to generate from this work, Carlsson et al. [24] generated the C2Gen the contexts. We first considered using other already dataset of context sentences in English from which they published corpora such as CommonGen [23] or C2Gen extracted several keywords that had to be included in [24] as original datasets because they also focused on an automatically generated text. Finally, a recent En- the task of NLG with contextual information. However, glish corpus worth mentioning is databricks-dolly-15k, for using these corpora we would have had to translate a human-generated instruction corpus created to train the original datasets into Spanish, which would imply Dolly LLM [25]. This dataset was applied to diferent choosing an appropriate automatic translation tool or contextual tasks such as summarisation of Wikipedia manually translating the datasets for adapting the task articles or closed QA, where a question and a reference into Spanish. Also, a further proofreading step would passage are input to the system to get factually correct have been necessary to check the accuracy of the transresponses. lations into Spanish, so we preferred to benefit from an

Focusing on Spanish resources, Sanchez-Bayona and already-existing Spanish dataset that could help us genAgerri [26] generated a corpus of Spanish metaphors, erate our context corpus. which depend directly on the contextual meaning to be Tatoeba’s original dataset includes around 393,000 clearly identified by an automatic system. As for Natu- Spanish sentences either translated from other languages ral Language Inference (NLI), Kovatchev and Taulé [27] or directly written in Spanish. The dataset includes sencompiled the INFERES corpus to check the performance tences with a range of 1 up to 44 words per sentence, so of machine learning systems on negation-based adver- we first filtered them by selecting only those sentences sarial examples by using context paragraphs from topics conformed by either 8 or 9 words, collecting a total of extracted from the Spanish Wikipedia. 60,170 Spanish sentences. We chose this section from the

After a thorough review of the current corpora that dataset after a previous preprocessing of an excerpt of address contextual NLG tasks in Spanish, we can say that, the dataset with Spacy tokenizer3. In this preliminary to the best of our knowledge, there is no corpus focused preprocessing, we noticed that the more words the senon the contextual information generation task in Spanish. tence comprised, the more risk we had of including too Consequently, for this research we base on the previous much semantic information in the sentence. This could works by Lin et al. [23] and Carlsson et al. [24] to address entail the generation of contexts not linked to the origithe task of contextual information generation in Spanish. nal situation stated in the reference sentence. Similarly, we rejected those sentences made up of 7 words or less, as many of their keywords lacked enough linguistic in3. Corpus Creation formation (verbs, nouns, etc.) to generate a context that could be in line with the situation stated in the reference sentence.

The following subsections include the methodology steps to create COCOTEROS: i) we explain the reference sentences dataset collection process and how we filtered 3.2. Linguistic Constraints them (subsection 3.1); ii) we move on to determine the linguistic constraints that will comprise the prompt to LLMs can be useful for supporting the automatic cregenerate automatic contexts (subsection 3.2); iii) we de- ation of corpora to study specific linguistic phenomena scribe the context generation task (subsection 3.3); and that would become very costly tasks if compiled manuiv) we include a manual post-edition to curate the results ally. Nevertheless, generating a corpus with LLMs from generated by the LLM (subsection 3.4). Figure 1 shows scratch also entails several risks regarding linguistic apa visual pipeline of the methodology used for creating COCOTEROS.

2This dataset was released under a CC-BY License and can be

found at https://tatoeba.org/es.

3https://spacy.io/api/tokenizer propriateness that could worsen the quality of the corpus, as it happens with hallucination issues or lack of commonsense.

Therefore, with the aim of automatically creating linguistic contexts referred to a given sentence, and to better control the output of our chosen LLM (further explained in Section 3.3), we determined several linguistic parameters to include in the prompt: in the context to be generated. With this restriction, we wanted to ensure that, even if some of the linguistic structures in the reference sentence were repeated in the generated context, the semantic meaning of the context is related to, but changes somewhat from the reference sentence.

This goes in line with the idea that the choice of words influence co-text and meaning potential [28], and we wanted to test up to which point can LLMs generate co-text with the same conceptual background but adding new words that can enlarge the semantic information of the new sentence. • Maximum context length: Inspired by the work presented in Carlsson et al. [24], we decided that an appropriate length for the generated context could be around 45 words. This decision comes also from preliminary prompt tests where we found that, if no length limitation was included, the model tended to delve into the generation process, creating contexts of more than ten lines of text that distanced too much from the original situation stated in the reference sentence. • Definition of context : Following previous studies focused on context as described in Section 2.1, we started our prompt with a simple and straightforward definition of what we consider a linguistic context so the model could first get the idea of the task to accomplish. • Reference sentence or synonyms: On the first attempts to find the right prompt to compile the corpus, we observed that, even by including a short definition of linguistic context, the model sometimes generated a context including the reference sentence. Therefore, to better specify the linguistic nature of the context to be generated, we indicated that the reference sentence could not appear in the context nor a sentence with similar semantic meaning. 3.3. Context Generation • Forbidden keywords: We extracted three keywords from each reference sentence that could se- Once we filtered the original dataset, the next step was to mantically define the sentence meaning. The ex- generate an appropriate context for each of the selected traction was automatically performed by means sentences. For this, we benefited from the capabilities of a random choice where we prioritised the se- of LLMs, and in particular, we used Bard [29], Google’s lection of two nouns and one verb, as we consider recent LLM. Our decision was motivated by an empirical them some of the main linguistic elements that study we previously conducted in which several LLMs define the semantic meaning of a sentence. Then, were compared to check how appropriately they fulfilled we added those batches of three keywords in the the task of generating a context resembling a sentence prompt as forbidden words that could not appear but without repeating or paraphrasing it. The LLMs compared were LLaMa4 [30], Vicuna4 [31], Bard, and sona] sentado en su escritorio”7. Consequently, we had to ChatGPT5 We automatically generated contexts for our modify those contexts by completing the missing inforsubset of sentences of 8 or 9 words with Bard, which mation with generic concepts or names so we could add could generate a context in an average of 5 seconds. Nev- the resulting context to the final corpus. ertheless, Bard’s public version could be prompted only 130 times per day. The generation process was made through a zero-shot prompt that comprised the linguistic 4. COCOTEROS - Corpus of restrictions the generated context should include or not, Contextual Text Generation in as stated in section 3.2. With this setup, we created an Spanish initial version of COCOTEROS corpus with 5,000 contexts.

As the first corpus focused on the contextual text gener

ation task for Spanish, COCOTEROScontains a total of 3.4. Post-editing 4,845 pairs of reference sentences with their respective generated contexts as illustrated in Figure 2. Moreover, Finally, given Bard’s predefined chat-like communica- the corpus includes the three keywords extracted from tive structures, we manually revised and post-edited the each reference sentence. The final amount of contexts resulting contexts by eliminating all the information in- comes from a previous manual post-edition from the origcluded in the response which was not the generated con- inal 5,000 contexts generated with Bard. We performed text itself (e.g. Bard’s output included similar sentences this post-edition because we noticed sexist content in to “Aquí tienes un contexto relacionado para la frase ‘Tengo some of the generated contexts, so we decided to discard sdteamteamsiaednatsfocorsaeascehn lcaocnatbeexzta6)e.sAtoss daíares’m” aasrka, ptrheelrimeiwnaerrye thoTsaebcleas1esshsotrwaisgahtsftoartwistaircdallys.ummary of COCOTEROS. times when Bard generated several contexts for a sin- Apart from the corpus general information, we found it gle input, giving us the opportunity to choose between interesting to check the average sentences and words per them, so we did a manual proofreading process where context because Bard sometimes generated contexts with we checked every possible context to choose those that very diferent lengths. Even though the prompt included approximated more to the conception of context we de- the maximum length that the context could have (45 termined for this research task. In line with this, in those words), we found cases where the context had only 15 cases where we could choose from two options, we se- words, whereas other contexts contained more than four lected the context describing a female-subject situation. sentences, with a total of more than 50 words. We made this decision because we detected a somewhat higher proportion of reference sentences addressing male subjects, so the generated context was male-gendered too. Table 1 Therefore, in those cases where the reference sentence COCOTEROS data summary. was no gender-specific, we prioritised female contexts Data Total to balance gender in COCOTEROS. Further details on Reference sentences 4,845 how we addressed gender bias in our corpus are shown Keywords 14,535 in subsection 5.2. Generated contexts 4,845

In this manual post-editing step we also discarded Words in the sentences 40,827 contexts that were repetitions or paraphrasing of the ref- Words in the contexts 119,885 erence sentence, as well as those that did not include Words in the corpus 175,247 enough semantic information to be considered appro- Average no. of sentences per context 2 priately generated contexts. Within the rest of contexts Average no. of words per context 25 we kept in COCOTEROS, there were times where Bard left some of the concepts in the generated text incomplete so the user could complete it according to his/her preferences, as in “Nos encontramos a [nombre de la per

The oficial version of COCOTEROS corpus is available

at https://huggingface.co/datasets/gplsi/cocoteros. With this, we aim to contribute to NLG research with a new language resource for studying contextual information generation in Spanish, as well as for other unexplored NLG tasks that can benefit from our corpus to address further research questions.

7Example translated into English for clarity purposes: We found

[name of the person] sit on his/her desk 4The tested version of LLama was llama-2-70b-chat, and Vicuna’s version was vicuna-33b. They were tested on https://chat.lmsys.org/

5Tested version of ChatGPT was GPT 3.5 on https://chat.openai.com/

6Example translated into English for clarity purposes: Here’s a related context for the phrase “I have too many things on my mind these days”

5. Corpus Evaluation

well as more distinctive rankings when comparing the outputs between the annotators in comparison to other To ensure that the contexts included in COCOTEROS are more common methods such as Likert scales [33, 35]. appropriate for contextual generation tasks, we evalu- Taking this method as a basis, we wanted to measure ated them taking into account diferent aspects: context the appropriateness of the generated context given its appropriateness with the manual magnitude estimation reference sentence. For this, we took a representative method (subsection 5.1) and gender bias through the au- sample of sentences and contexts from the COCOTEROS tomatic GenBit metric (subsection 5.2). corpus through Formula 1, presented in [37] and previously used in [38]:

5.1. Context Appropriateness

* 2 * * 2 * ( − 1) + 2 * *

With the evolution of the latest LLMs, researchers face

a need for consistent evaluation metrics that help them evaluate the outputs provided by these models when where N is the population, K the confidence interval, P testing their performance for language generation tasks. the probability of success, Q the probability of failure and To this end, we performed an experiment based on the E the error rate. The population N was 4,845 sentences magnitude estimation method [32] with the help of three and their respective contexts, and the values given to the linguistics specialists. Magnitude estimation is a method rest of these parameters were taken as presented in [39], generally used in psychology to check the reaction of so that K =0.95, E=0.05, P=0.5, and Q=0.5. Once the fordiferent subjects when presented with several stimuli. mula was calculated, the resulting number of sentences To measure the diferent levels of reaction subjects can M for testing contextual appropriateness was rounded to have, they need to assign a score to a first stimuli (in our 90 sentences with their respective contexts. This subset case, the generated context) where no ranges or limits are of 90 sentences and contexts was selected at random from determined. Then, when a second stimuli is presented, the final COCOTEROS corpus. With the subset of conthey have to compare it with the first stimuli shown, and texts already determined, we performed the magnitude depending on the intensity of the reaction they have, estimation analysis to validate the generated contexts. its score will change based on the previous score they To accomplish this, we explained the methodology to assigned to the first stimuli. In this manner, if subjects’ score the subset of 90 generated contexts to the annoreaction to the second stimuli is twice as much as to the tators, with the only requirement that the lowest score ifrst stimuli, they will have to double the score they as- they could assign could be 1. In this manner, we ensign to the second stimuli. This method has been used sured the subsequent normalisation of the values each positively for evaluating automatically generated text in of them may assign to each context. As a remark, we several NLG tasks [33, 34, 35, 36], as researchers demon- noticed that two annotators scored contexts based on a strated that it helps to detect more linguistic nuances as 1 to 100 ranking, even when we highlighted that there were no restrictions in the values they could choose for (1) each context. Once we collected all the scores made by the annotators, we normalised the results by means of the z-score normalisation formula (Formula 2) as used in [40]: ℎ = ℎ − ℎ ℎ

(2) where ℎ is annotator h’s z-score for the context when annotator h gave a magnitude estimation score of ℎ to that context. ℎ is the mean and ℎ the standard deviation of the set of magnitude estimation scores for annotator h.

Figure 3 shows the normalised results for the magnitude estimation evaluation. The 0 line serves as the mean from which upper numbers indicate those contexts with higher scores, and the negative numbers show those contexts considered not suitably generated. As can be seen, the three annotators tend to agree on which linguistic contexts have an appropriate contextual relatedness to the reference sentence, even though each of them used a diferent range of scores within the magnitude estimation experiment. In spite of a few disagreeing cases in the total of 90 contexts, we observe that the annotators agree that more than half of the corpus sample comprises contexts with appropriate contextual relatedness, while the rest could be improved. After evaluating the results with the annotators, we concluded that they tended to highly penalise those contexts that paraphrased the reference sentence, even if after that paraphrasing sentence the context included new excerpts of text that indeed served as an appropriate linguistic context.

5.2. Gender Bias

Several methodological issues come to mind when using a LLM to generate a new language resource for further training LLMs so they can learn how to approach new emerging NLG problems. One of those recently detected issues is the presence of gender bias in the humancompiled corpora that LLMs are trained with. This poses a new problem for the research community, as the incredible performance those LLMs currently show is based on data that reflect and amplify societal biases detected in naturally occurring texts [41]. With an eye to check possible biases in our corpus, we used the Gender Bias Tool (GenBit) [ 6 ] to measure the apparent level of gender bias in the 4,845 generated contexts from COCOTEROS.

According to its developers, GenBit helps determine if gender is distributed uniformly across data by measuring the strength of association between a pre-defined list of gender definition words and other words in the corpus via co-occurrence statistics. Table 2 shows the obtained results after processing COCOTEROS with the Spanish metric provided in GenBit. definition words that appear in the corpus, resulting in training phase of the model. As discussed in Section 5.2, 0.335 and 0.665, respectively, in COCOTEROS. Consider- remarkable eforts have been made to balance the numing the results, it seems there is a higher representation ber of sentences addressing both genders as we are aware of words associated with the male gender rather than of the importance of dealing with gender underrepresenwith the female. However, these results do not imply tation when creating inclusive language resources that that those sentences containing female-gendered words comply with gender balance standards. By doing this, are used in a sexist context but that the appearance of we also want to encourage the rest of the community to female-gendered terms in the corpus is lower. We want take similar steps so that NLP resources and LLMs are to remark on this because the apparent underrepresenta- trained on trustful resources with no biases. We used tion of female-gendered words could be modified easily GenBit tool for measuring this number, and although by creating parallel contexts to those where there are the results obtained are the expected, it is true that Genmale-gendered words so that we could balance both gen- Bit does not detect some grammatical categories such as der representation at the same time that we expand our male or female proper names and gendered adjectives, corpus with further examples. Moreover, we have to so the results cannot be conclusive. bear in mind that words in Spanish have a specific genre, One problem worth commenting regarding LLMs is whereas English words don’t. Consequently, a predomi- hallucination, which occurs when a text is nonsensical or nance of male-gendered words does not need to imply unfaithful to the input source. During post-processing, that the corpus is gender biased, but that the corpus in- we detected that some generated context sufered from cludes more words linked to that genre, whether those this (e.g., the reference sentence contained the word “fawords refer to objects, places or people. ther” while the context was generated with “grandfather”; In addition, during our manual post-editing stage of the the generated context was written in the masculine form 4,845 contexts, we found that many of them described when the reference sentence was in the feminine form; communicative situations where the subject is a woman. or the case of fake generated data, such as the winner However, GenBit does not include female or male proper of Eurovision 2023 which was not Germany). Neverthenames and gendered adjectives in its Spanish section, so less, we did not discard these sentences as our scope was it cannot consider those contexts as gendered-defined, to obtain appropriate contexts. Therefore, future works which may also afect the final result of the gender bias will focus on detecting and eliminating hallucinations to metric. Therefore, the results achieved with GenBit score gather a corpus free of this issue. serve as a first attempt to consider possible gender bias Finally, another of the main interests for generating in our corpus, but we believe they cannot be conclusive new resources for the NLP community is creating multigiven the diferent examples of gendered sentences found task datasets so that linguistic resources become a valuin our corpus not considered by the metric. able and reusable tool which can motivate new research. COCOTEROS will contribute to boosting NLP research specifically addressing semantic and pragmatic aspects 6. Overall Discussion and for Spanish language. Although it has been originally conceived for NLG, its nature for containing contexts associated with reference sentences could be beneficial for solving other NLP-related issues such as textual entailment, also known as Natural Language Inference (NLI) [42]. This task focuses on the semantic relations that may exist between several pieces of text and how such relations can be characterised and computationally analysed.

The results obtained throughout the experimentation

process for creating and evaluating COCOTEROS open the door for discussion along several dimensions.

Regarding the magnitude estimation evaluation, this metric helped us to detect further nuances in the scores each annotator assigned to contexts depending on their appropriateness. Those nuances could be future challenges to address to keep on discovering knowledge on how to deal with contextual information in NLG systems.

Therefore, these results helped us to determine one of 7. Conclusions and Future Work the modifications to apply to COCOTEROS, as in future work we will manually analyse and discard contexts with In this paper we have presented COCOTEROS, a Spanparaphrasing sentences, so we only leave linguistic con- ish corpus of contextual knowledge for NLG, containtexts that add contextual information to the reference ing nearly 5,000 sentences with their corresponding consentence without using synonyms. texts. The creation of COCOTEROS comes from the cur

Another key aspect of generating new resources is rent need in NLP research to address tasks with a more that they must not contain gender biases. An unbiased semantic-pragmatic approach, as it occurs with the generdataset is an important factor when training a language ation of linguistic context. Also, we wanted to contribute model, as bias is mostly introduced in the data used in the to the research community with a well-defined Spanish resource to study contextual aspects in NLG, given the funded by the Generalitat Valenciana. Moreover, it has lack of enough linguistic resources to study pragmatic been also partially funded by the Ministry of Economic aspects of language for languages other than English. Afairs and Digital Transformation and “European

With the aim of verifying the level of linguistic and con- Union NextGenerationEU/PRTR” through the "ILENIA" textual appropriateness of COCOTEROS, we performed project (grant number 2022/TL22/00215337) and "VIVES" a two-fold evaluation. First of all, we used the magni- subproject (grant number 2022/TL22/00215334). tude estimation method with the help of three linguistics specialists to measure the linguistic and contextual appropriateness of a representative sample of the gen- References erated contexts. Then, we applied the GenBit metric to COCOTEROS to check the level of gender bias our corpus showed. On the one hand, results on the contextual appropriateness evaluation reflect the dificulties when addressing the contextual generation task even for human annotators, as annotators tended to difer on the degree of appropriateness of each context. Nevertheless, the magnitude estimation metric indicates that more than half of the evaluated contexts were scored favourably. On the other hand, the gender bias metric score shows that, with a few modifications, we could reduce the presence of gender bias in the corpus to a large extent. However, the resulting bias score cannot be conclusive as the metric did not consider some of the gender-linguistic features the generated contexts included.

Several research directions are planned for future work.

First, we would like to improve our resource, so further experiments will be made to balance gender representation in COCOTEROS, as well as to extend the number of contexts so this Spanish resource may be of help for addressing NLP tasks that need more amounts of data.

Finally, we aim to devote a branch of future research to adapting COCOTEROS corpus to the task of intention identification to better understand which reasons make humans have a particular intention when uttering a message based on the context surrounding such intention.

At the same time, we would check if LLMs can better detect specific communicative intentions depending on reference sentences and their linguistic context.

Acknowledgments The research work conducted is part of the

R&D projects “CORTEX: Conscious Text Generation” (PID2021-123956OB-I00), funded by MCIN/ AEI/10.13039/501100011033/ and by “ERDF A way of making Europe”; “CLEAR.TEXT:Enhancing the modernization public sector organizations by deploying Natural Language Processing to make their digital content CLEARER to those with cognitive disabilities” (TED2021-130707B-I00), funded by MCIN/AEI/10.13039/501100011033 and “European Union NextGenerationEU/PRTR”; and the project “NL4DISMIS: Natural Language Technologies for dealing with dis- and misinformation” with grant reference (CIPROM/2021/21) science/article/pii/S0749596X21000966. doi:https: //aclanthology.org/2023.findings-eacl.60. doi: 10. //doi.org/10.1016/j.jml.2021.104313. 18653/v1/2023.findings-eacl.60. [9] T. Heck, D. Meurers, On the relevance and learner [16] C. Strathearn, D. Gkatzia, Task2Dial dataset: A dependence of co-text complexity for exercise dif- novel dataset for commonsense-enhanced taskifculty, in: D. Alfter, E. Volodina, T. François, based dialogue grounded in documents, in: ProceedA. Jönsson, E. Rennes (Eds.), Proceedings of the 12th ings of the 4th International Conference on Natural Workshop on NLP for Computer Assisted Language Language and Speech Processing (ICNLSP 2021), Learning, LiU Electronic Press, Tórshavn, Faroe Is- Association for Computational Linguistics, Trento, lands, 2023, pp. 71–84. URL: https://aclanthology. Italy, 2021, pp. 242–251. URL: https://aclanthology. org/2023.nlp4call-1.9. org/2021.icnlsp-1.28. [10] T. Tyagi, C. G. Magdamo, A. Noori, Z. Li, X. Liu, [17] D. Ghosal, S. Shen, N. Majumder, R. Mihalcea, S. PoM. Deodhar, Z. Hong, W. Ge, E. M. Ye, Y. han ria, CICERO: A dataset for contextualized comSheu, H. Alabsi, L. Brenner, G. K. Robbins, S. Za- monsense inference in dialogues, in: Proceedings far, N. Benson, L. Moura, J. Hsu, A. Serrano-Pozo, of the 60th Annual Meeting of the Association D. Prokopenko, R. E. Tanzi, B. T. Hyman, D. Blacker, for Computational Linguistics (Volume 1: Long S. S. Mukerji, M. B. Westover, S. Das, Using deep Papers), Association for Computational Linguislearning to identify patients with cognitive impair- tics, Dublin, Ireland, 2022, pp. 5010–5028. URL: ment in electronic health records, in: Proceed- https://aclanthology.org/2022.acl-long.344. doi:10. ings of Machine Learning Research ML4H, 2021. 18653/v1/2022.acl-long.344. arXiv:2111.09115. [18] R. Finkbeiner, J. Meibauer, P. B. Schumacher, What [11] J. Verschueren, Context and structure in a theory of is a context? Linguistic approaches and challenges, pragmatics, Studies in Pragmatics 10 (2008) 14–24. volume 196 of Linguistik aktuell = linguistics today, [12] T. Lai, H. Ji, T. Bui, Q. H. Tran, F. Dernoncourt, John Benjamins Pub. Co., Amsterdam, 2012.

W. Chang, A context-dependent gated module for [19] G. Hollis, Delineating linguistic contexts, and incorporating symbolic semantics into event coref- the validity of context diversity as a measure erence resolution, in: Proceedings of the 2021 Con- of a word’s contextual variability, Journal ference of the North American Chapter of the As- of Memory and Language 114 (2020) 104146. sociation for Computational Linguistics: Human URL: https://www.sciencedirect.com/science/ Language Technologies, Association for Compu- article/pii/S0749596X20300607. doi:https: tational Linguistics, Online, 2021, pp. 3491–3499. //doi.org/10.1016/j.jml.2020.104146. URL: https://aclanthology.org/2021.naacl-main.274. [20] G. Ferrari, Types of contexts and their role in multidoi:10.18653/v1/2021.naacl-main.274. modal communication, Computational Intelligence [13] L. Tamine, M. Daoud, Evaluation in contex- 13 (1997) 414–426.

tual information retrieval: Foundations and re- [21] S. Castilho, J. L. Cavalheiro Camargo, M. Menezes, cent advances within the challenges of context dy- A. Way, DELA corpus - a document-level corpus annamicity and data privacy, ACM Comput. Surv. notated with context-related issues, in: Proceedings 51 (2018). URL: https://doi.org/10.1145/3204940. of the Sixth Conference on Machine Translation, doi:10.1145/3204940. Association for Computational Linguistics, Online, [14] C. Hadiwinoto, H. T. Ng, W. C. Gan, Improved 2021, pp. 566–577. URL: https://aclanthology.org/ word sense disambiguation using pre-trained con- 2021.wmt-1.63. textualized word representations, in: Proceed- [22] T. Udagawa, A. Aizawa, A natural language ings of the 2019 Conference on Empirical Meth- corpus of common grounding under continuous ods in Natural Language Processing and the 9th and partially-observable context, in: ProceedInternational Joint Conference on Natural Lan- ings of the AAAI Conference on Artificial Intelguage Processing (EMNLP-IJCNLP), Association ligence, AAAI Press, 2019. URL: https://doi.org/ for Computational Linguistics, Hong Kong, China, 10.1609/aaai.v33i01.33017120. doi:10.1609/aaai. 2019, pp. 5297–5306. URL: https://aclanthology.org/ v33i01.33017120.

D19-1533. doi:10.18653/v1/D19-1533. [23] B. Y. Lin, W. Zhou, M. Shen, P. Zhou, C. Bhagavat[15] D. Su, M. Patwary, S. Prabhumoye, P. Xu, R. Prenger, ula, Y. Choi, X. Ren, CommonGen: A constrained M. Shoeybi, P. Fung, A. Anandkumar, B. Catan- text generation challenge for generative commonzaro, Context generation improves open domain sense reasoning, in: Findings of the Association question answering, in: Findings of the As- for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics: EACL sociation for Computational Linguistics, Online, 2023, Association for Computational Linguistics, 2020, pp. 1823–1840. URL: https://aclanthology. Dubrovnik, Croatia, 2023, pp. 793–808. URL: https: org/2020.findings-emnlp.165. doi: 10.18653/v1/ 2020.findings-emnlp.165. [32] E. G. Bard, D. Robertson, A. Sorace, Magnitude [24] F. Carlsson, J. Öhman, F. Liu, S. Verlinden, J. Nivre, estimation of linguistic acceptability, Language M. Sahlgren, Fine-grained controllable text gen- 72 (1996) 32–68. URL: http://www.jstor.org/stable/ eration using non-residual prompting, in: Pro- 416793. ceedings of the 60th Annual Meeting of the As- [33] J. Novikova, O. Dušek, V. Rieser, RankME: Reliable sociation for Computational Linguistics (Volume 1: human ratings for natural language generation, in: Long Papers), Association for Computational Lin- Proceedings of the 2018 Conference of the North guistics, Dublin, Ireland, 2022, pp. 6837–6857. URL: American Chapter of the Association for Computahttps://aclanthology.org/2022.acl-long.471. doi:10. tional Linguistics: Human Language Technologies, 18653/v1/2022.acl-long.471. Volume 2 (Short Papers), Association for Compu[25] M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, tational Linguistics, New Orleans, Louisiana, 2018, S. Shah, A. Ghodsi, P. Wendell, M. Zaharia, pp. 72–78. URL: https://aclanthology.org/N18-2012. R. Xin, Free Dolly: Introducing the world’s first doi:10.18653/v1/N18-2012. truly open instruction-tuned LLM, 2023. URL: [34] A. Turpin, F. Scholer, S. Mizzaro, E. Maddalena, https://www.databricks.com/blog/2023/04/12/ The benefits of magnitude estimation relevance dolly-first-open-commercially-viable-instruction-tuned-llm.assessments for information retrieval evaluation, [26] E. Sanchez-Bayona, R. Agerri, Leveraging a in: Proceedings of the 38th International ACM SInew Spanish corpus for multilingual and cross- GIR Conference on Research and Development in lingual metaphor detection, in: Proceedings of Information Retrieval, SIGIR ’15, Association for the 26th Conference on Computational Natural Computing Machinery, New York, NY, USA, 2015, Language Learning (CoNLL), Association for Com- p. 565–574. URL: https://doi.org/10.1145/2766462. putational Linguistics, Abu Dhabi, United Arab 2767760. doi:10.1145/2766462.2767760. Emirates (Hybrid), 2022, pp. 228–240. URL: https: [35] S. Santhanam, S. Shaikh, Understanding the //aclanthology.org/2022.conll-1.16. doi:10.18653/ impact of experiment design for evaluating diav1/2022.conll-1.16. logue system output, in: Proceedings of the The [27] V. Kovatchev, M. Taulé, InferES : A natural language Fourth Widening Natural Language Processing inference corpus for Spanish featuring negation- Workshop, Association for Computational Linguisbased contrastive and adversarial examples, in: tics, Seattle, USA, 2020, pp. 124–127. URL: https:// Proceedings of the 29th International Conference aclanthology.org/2020.winlp-1.33. doi:10.18653/ on Computational Linguistics, International Com- v1/2020.winlp-1.33. mittee on Computational Linguistics, Gyeongju, Re- [36] R. Doust, P. Piwek, A model of suspense for narrapublic of Korea, 2022, pp. 3873–3884. URL: https: tive generation, in: Proceedings of the 10th Inter//aclanthology.org/2022.coling-1.340. national Conference on Natural Language Gener[28] W. G. Reijnierse, C. Burgers, T. Krennmayr, G. Steen, ation, Association for Computational Linguistics, The role of co-text in the analysis of potentially Santiago de Compostela, Spain, 2017, pp. 178–187. deliberate metaphor, in: Drawing Attention to URL: https://aclanthology.org/W17-3527. doi:10. Metaphor: Case studies across time periods, cul- 18653/v1/W17-3527. tures and modalities, John Benjamins Publishing [37] S. Pita-Fernández, Determinación del tamaño muesCompany, 2020, pp. 15–38. tral, Cuadernos de atención primaria 3 (1996) 138– [29] J. Manyika, An overview of Bard: An early 141.

experiment with generative AI, Technical Re- [38] C. Barros, M. Vicente, E. Lloret, To what port, Tech. rep., Technical report, Google AI, extent does content selection afect surface re2023. URL: https://ai.google/static/documents/ alization in the context of headline generagoogle-about-bard.pdf. tion?, Computer Speech & Language 67 [30] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. (2021) 101179. URL: https://www.sciencedirect.com/ Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham- science/article/pii/S0885230820301121. doi:https: bro, F. Azhar, et al., LLaMA: Open and eficient //doi.org/10.1016/j.csl.2020.101179. foundation language models, Computing Research [39] Y. G. Vázquez, A. F. Orquín, A. M. Guijarro, S. V.

Repository, arXiv:2302.13971. Version 1 (2023). Pérez, Integración de recursos semánticos basados [31] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, en WordNet, Procesamiento del lenguaje natural H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. 45 (2010) 161–168.

Gonzalez, I. Stoica, E. P. Xing, Vicuna: An open- [40] A. Siddharthan, N. Katsos, Ofline sentence processsource chatbot impressing GPT-4 with 90%* Chat- ing measures for testing readability with users, in: GPT quality, 2023. URL: https://lmsys.org/blog/ Proceedings of the First Workshop on Predicting 2023-03-30-vicuna/. and Improving Text Readability for target reader populations, Association for Computational Linguistics, Montréal, Canada, 2012, pp. 17–24. URL: https://aclanthology.org/W12-2203. [41] M. R. Costa-jussà, An analysis of gender bias studies in natural language processing, Nature Machine

Intelligence 1 (2019) 495–496. [42] S. R. Bowman, G. Angeli, C. Potts, C. D. Manning,

A large annotated corpus for learning natural language inference, in: Proceedings of 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Lisbon, Portugal, 2015, pp. 632–642. URL: http: //aclanthology.lst.uni-saarland.de/D15-1075.pdf .

[1]

Ji ,

Lee ,

Frieske ,

Yu ,

Su ,

Xu ,

Ishii ,

Y. J.

Bang ,

Madotto ,

Fung , Survey of hallucination in natural language generation , ACM Computing Surveys 55 ( 2023 ). URL: https://doi.org/ 10.1145/3571730. doi: 10 .1145/3571730.

[2]

M. a. T. Yan

Li ,

Liu , From semantics to pragmatics: Where IS can lead in natural language processing (NLP) research , European Journal of Information Systems 30 ( 2021 ) 569 - 590 . doi: 10 .1080/ 0960085X. 2020 . 1816145 .

[3]

Newman ,

Cohn-Gordon ,

Potts , Communication-based evaluation for natural language generation , in: Proceedings of the Society for Computation in Linguistics 2020 , Association for Computational Linguistics , New York, New York, 2020 , pp. 116 - 126 . URL: https://aclanthology.org/ 2020 .scil- 1 . 16 .

[4]

J. C. B.

Cruz ,

J. K.

Resabal ,

Lin ,

D. J.

Velasco , C. Cheng, Exploiting news article structure for automatic corpus generation of entailment datasets , in: D. N. Pham,

Theeramunkong ,

Governatori , F. Liu (Eds.), PRICAI 2021: Trends in Artificial Intelligence , Springer International Publishing, Cham, 2021 , pp. 86 - 99 .

[5]

M. E.

Vallecillo-Rodríguez ,

Montejo-Raéz ,

M. T.

Martín-Valdivia , Automatic counter-narrative generation for hate speech in spanish , Procesamiento del Lenguaje Natural 71 ( 2023 ) 227 - 245 .

[6]

Sengupta ,

Maher ,

Groves ,

Olieman , Genbit: measure and mitigate gender bias in language datasets , Microsoft Journal of Applied Research 16 ( 2021 ) 63 - 71 .

[7]

Surana ,

T.-N.

Ho ,

Tun ,

E. S.

Chng , CASSI: Contextual and semantic structure-based interpolation augmentation for low-resource NER , in: H. Bouamor , J. Pino , K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 , Association for Computational Linguistics , Singapore, 2023 , pp. 9729 - 9742 . URL: https:// aclanthology.org/ 2023 .findings-emnlp. 651 . doi: 10 . 18653/v1/ 2023 .findings-emnlp. 651 .

[8]

B. T.

Johns ,

M. N.

Jones , Content matters: Measures of contextual diversity must consider semantic content , Journal of Memory and Language 123 ( 2022 ) 104313 . URL: https://www.sciencedirect.com/