Syntax Savants-UA at IberLEF 2024: Leveraging FLAN-T5-XXL for Automatic 5W1H Identification in Texts Eduardo Grande1 , Ahmed Begga2 1 Department of Software and Computing Systems, University of Alicante, Spain 2 Department of Computer Science and AI, University of Alicante, Spain Abstract The 5W1H technique involves formulating six questions to briefly capture the basic information of a news text: What, When, Where, Who, Why, and How. This method enables an easy understanding of the news. This work, part of the FLARES competition at the Iberian Languages Evaluation Forum 2024, presents a Natural Language Processing (NLP) model, specifically FLAN-T5-XXL, designed to automatically extract 5W1H elements with an F1-score of 0.543. Our methodology includes creating task-specific templates and prompts and then fine-tuning the FLAN-T5-XXL model using annotated Spanish news articles provided by the competition organizers. We leverage Low-Rank Adaptation (LoRA) for efficient fine-tuning and conduct comprehensive hyperparameter optimization to enhance model performance and output generation. The evaluation demonstrates that FLAN-T5- XXL significantly outperforms other models. This study underscores the potential of Generative NLP models in automating information extraction from news texts. Keywords Natural Language Processing, Large Language Models, Information extraction, 5W1H 1. Introduction The advent of digital media has led to a dramatic surge in the volume of news articles [1]. These articles contain a wealth of pertinent information such as events, people, reasons, times, places, and methods, among others [2, 3]. Consequently, the mining of news text is increasingly being utilized in various domains due to its practical applications in credibility prediction, bridging the gap between unstructured and structured information to provide richer datasets, and identifying potential biases or misrepresentations [4, 5]. In journalism, numerous studies have demonstrated that narrative text is more expressive, allowing journalists to accurately describe events [6]. Therefore, there is a demand to develop tools that process this valuable narrative text and extract useful knowledge to assist journalists [7]. In this regard, knowledge extraction is a subdomain of Natural Language Processing (NLP) used computationally to analyze this unstructured textual data and extract structured information [8]. This technique aims to identify related journalistic concepts, which is a cornerstone in identifying journalistic information, organizing data into suitable representation for efficient analysis, and concentrating the usable knowledge dispersed by journalists into various format files like news reports and articles [9]. In this context, FLARES proposes a challenge [10] inside the IberLEF 2024 competi- tion [11]: a set of shared sub-tasks that demands the development of tools for Fine-Grained Language-based Reliability Detection in Spanish News. In this work, we focus on the first sub-task of 5W1Hs identification, which aims to locate and classify the WHAT, WHO, WHY, IberLEF 2024, September 2024, Valladolid, Spain $ eduardo.grande@ua.es (E. Grande); ahmed.begga@ua.es (A. Begga) € https://cvnet.cpd.ua.es/curriculum-breve/es/grande-ruiz-eduardo/327690 (E. Grande); https://github.com/AhmedBegggaUA (A. Begga)  0000-0002-4894-3943 (E. Grande); 0009-0000-8733-2072 (A. Begga) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings WHEN, WHERE, and HOW elements within a text. The proposed system leverages the FLAN-T5-XXL [12] model, a state-of-the-art deep learning neural network architecture [13], to perform this task. The remainder of the manuscript is organized as follows. Section 2 describes the FLARES’s sub-task of 5W1Hs identification. Section 3 details the FLAN-T5-XXL-based system proposed for the challenge. Section 4 presents and discusses the results achieved in the challenge. Finally, Section 5 summarizes the harvested findings to address this challenge. 2. Task and dataset description Assessing the reliability of the language used in news writing is becoming increasingly crucial in today’s digital media landscape. Identifying specific segments of a news article to gauge linguistic credibility offers a more nuanced understanding of the message’s truthfulness. This approach not only enhances our grasp of information presentation but also paves the way for developing more effective techniques in spotting fake or misleading news. Recent studies [14] have highlighted the importance of analyzing language style, tone, and structure in identifying deceptive content. Style and language have proven valuable in distinguishing between fake and true articles, and specific linguistic features have indicated potential biases or misrepresentations in online content. These studies underscore the emerging significance of leveraging linguistic analysis to discern trustworthy news in the digital age. In this context, we propose harnessing the “5W1H” technique, commonly employed by journalists to clearly present the key information of a news item in an explicit way. This method focuses on identifying the WHAT, WHO, WHY, WHEN, WHERE, and HOW elements within a piece of news. By applying this technique, we can systematically evaluate the reliability of the language across these dimensions. Analyzing the presence of these fundamental journalistic issues offers a structured approach to calibrate the linguistic integrity and possible biases of the content. Moreover, our challenge will utilize texts in Spanish, aiming to progress on the development of techniques of this nature specifically tailored for this language [15]. This integration of journalistic methodology with linguistic analysis not only provides a comprehensive framework but also could pave the way for enhancing the authenticity and trustworthiness of information in the Spanish digital media landscape. The FLARES dataset [10] consists of Spanish news articles sourced from various digital media outlets, covering a wide range of topics such as politics, economy, health, technology, and culture. This diversity ensures a broad spectrum of linguistic styles and structures, providing a rich resource for analyzing language reliability. Each article in the dataset is annotated with the "5W1H" elements, highlighting the WHAT, WHO, WHY, WHEN, WHERE, and HOW aspects of the content. Additionally, annotations include markers of linguistic credibility, such as tone, style, and indicators of potential bias. These annotations support the development and testing of models aimed at detecting the fine-grained linguistic reliability of news content. To ensure high quality and consistency, the articles are curated from reputable sources and manually reviewed by experts proficient in Spanish. This review process guarantees that the annotations accurately reflect essential journalistic elements and credibility indicators. The dataset serves as a valuable resource for advancing research in linguistic reliability detection and for fostering the creation of more reliable and trustworthy Spanish digital media content. Table 1 presents the distribution of the 5W1H elements in the training. Each text can contain multiple instances of the 5W1H elements, and also more than one specific 5W1H. Table 1 Distribution of 5W1H Elements in the Training. Element Training Set Number of texts 1,585 WHAT 2,711 WHERE 1,024 WHEN 778 WHO 533 HOW 563 This comprehensive annotation approach helps in systematically evaluating and improving the reliability of news content in Spanish. 3. Methodology To assess the linguistic reliability of Spanish news articles, we implemented a comprehensive methodology that integrates template creation, prompt engineering, and fine-tuning a Large Language Model (LLM). Label Unlabeled Train set Test set Training Flan-XXL Inferred Templates Predictions Instructions Fine-tunning Test Set Figure 1: Overview of the Methodology for Fine-Grained Language-based Reliability Detection. The individual steps detailed shown in Figure 1 are explained below. 3.1. Template creation Following the procedure established by Google when creating Flan [12], we defined a set of templates for this specific task of extracting 5W1H information. As will be commented under the subsequent sections, we created 6 models, one per category. So, when defining a template, we defined the same one as a set of 6 templates. In total, we generated both manually and using ChatGPT 3.5 [16] a total of 25 sets of templates. { "templates": { "1": ["Dado el texto: \"{text}\", la W What es: \"{W_what}\"", "Dado el texto: \"{text}\", la W Who es: \"{W_who}\"", "Dado el texto: \"{text}\", la W Where es: \"{W_where}\"", "Dado el texto: \"{text}\", la W When es: \"{W_when}\"", "Dado el texto: \"{text}\", la W Why es: \"{W_why}\"", "Dado el texto: \"{text}\", la H How es: \"{W_how}\""], ... ... } Figure 2: Example of a set of a template. El método 5W + 1H es una técnica de análisis que se utiliza para obtener una comprensión profunda de un problema o situación, identificando seis preguntas clave: qué (What), quién (Who), dónde (Where), cuándo (When), por qué (Why) y cómo (How). Al responder estas preguntas de manera exhaustiva, podemos obtener una visión completa del problema y establecer una base sólida para su resolución. El What describe la acción o evento principal que se está investigando o describiendo, proporcionando la base de la comprensión. Es la parte central que responde a la pregunta sobre qué está ocurriendo o qué se está discutiendo. Dado el texto: "Dos días, exactamente han pasado dos días desde que Sánchez compareciera en rueda de prensa en la Moncloa afirmando que a España llegarían, entre abril y septiembre, un total de 87 millones de vacunas para darnos cuenta de que las mentiras de Sánchez hacen bueno ese refrán que dice que ’la mentira tiene las patas muy cortas’", la W What es: Figure 3: Example of a full constructed prompt with pre-prompt, description of the W What and a template having mixed it with a piece of data. These templates will then be mixed with the provided data to generate a prompt that will be the input for the model. Figure 2 shows one template set example. In order to test more strategies, some more templates were defined. The defined templates were not just for mixing them with data and creating input prompts for the model. Several templates were created to be appended before the main prompt so that the prompts could finally contain more information about the task to perform. These templates are: • Pre-prompts: We call a pre-prompt to a general template that explains what the 5W1H are. The aim is that by explaining to the model this it will improve the results. • Description of W: Whereas the pre-prompts were a general description of the 5W1H method, the descriptions are a concrete description of a single W/H. They describe what type of answer should respond to a corresponding W/H. Figure 3 shows a complete example of a prompt that contains a pre-prompt, a description, and a template, mixed with an element of the data. 3.2. Input data preparation Once all the templates have been defined, the input data preparation process consists of merging those templates with the given annotated corpus. The prompts generated are zero-shot, meaning that a single input prompt contains just the information of the templates and the given text, without any examples of answers, as already shown in Figure 3. For merging the data, the templates are iterated over and over until each piece of data has a single template. If pre-prompts and/or descriptions are appended, these are chosen randomly among the defined ones. The prepared input prompts were saved into a json file that could be easily loaded when training the models. 3.3. Model fine-tuning The fine-tuning of the FLAN-T5-XXL [17] model was conducted using specific parameters tailored to optimize performance on the given task. The fine-tuning process was aimed at adapting the model to effectively answer the 5W1H questions. To achieve this, we configured the following hyperparameters: a learning rate of 1 × 10−3 was selected to control the adjustment of the model weights during training. We set the batch size to 32, allowing the model to process 6 samples per batch. The training process was conducted over 1 epoch, which was sufficient to allow the model to learn the patterns in the training data without overfitting. All these parameters were selected after performing a hyperparameters search so that the values selected are the best configuration set for improving the model performance. The maximum input length was capped at 1024 tokens to ensure that the model could handle lengthy input sequences, and the maximum output length was set to 400 tokens to accommodate detailed responses. To further enhance the model’s performance, we implemented Low-Rank Adaptation (LoRA) [18]. LoRA is an efficient fine-tuning technique that injects trainable rank decom- position matrices into each layer of the transformer’s architecture. This method helps in optimizing the model without drastically increasing the computational load. For our task, we set the rank of the decomposition matrices (𝑟 ) to 8 and used a scaling factor (𝑙𝑜𝑟𝑎_𝑎𝑙𝑝ℎ𝑎) of 16 for the LoRA layers. We targeted the query (𝑞 ) and value (𝑣 ) matrices within the transformer layers, as these are critical components in the model’s attention mechanism. A dropout rate of 0.05 was applied to the LoRA layers to prevent overfitting and enhance the model’s generalizability. The task type for the LoRA configuration was specified as sequence-to-sequence language modelling, which is appropriate for generating text responses based on input prompts. These configurations were instrumental in adapting the pre-trained FLAN-T5-XXL model to our specific requirements, ensuring that it could effectively process and respond to 5W1H questions. 3.4. Inference and output parsing During the inference phase, we conducted a comprehensive hyperparameter search to de- termine the optimal settings for generating the best possible results. The hyperparameters we varied included the number of beams for beam search, the use of early stopping, whether to sample, the top-p and top-k sampling parameters, and the temperature. The specific values tested are summarized in the Table 2. By systematically varying these parameters, we aimed to find the combination that produced the most accurate and coherent answers to the 5W1H questions. The number of beams controlled the breadth of the search space, with higher values providing more candidate sequences but also increasing computational complexity. Early stopping was tested to determine if halting the search process early when a complete sequence was found would yield better or faster results. Sampling parameters such as 𝑑𝑜_𝑠𝑎𝑚𝑝𝑙𝑒, 𝑡𝑜𝑝_𝑝, and 𝑡𝑜𝑝_𝑘 were adjusted to balance between deterministic outputs and diverse, creative Table 2 Hyperparameters and their values used in the inference phase. Hyperparameter Values num_beams 3, 4, 5, 6, 7, 8 early_stopping True, False do_sample True, False top_p 0.5, 0.7, 0.9, 1 top_k 50, 100, 200 temperature 0.5, 0.6, 0.7, 0.8, 0.9, 1 responses. The temperature parameter controlled the randomness of predictions, with lower values making the model more conservative and higher values introducing more variability. Once the inference was complete, the results were parsed to isolate the positions of the 5W1H answers. This was a crucial step because our evaluation metric relied on accurately identifying the spans of the text where the answers were located. For instances where no valid 5W1H answer was found, we formatted the response as an empty string instead of the placeholder "No answer". This ensured consistency and allowed for seamless integration with our evaluation pipeline. By recording the start and end positions of each 5W1H answer within the generated text, we were able to precisely evaluate the model’s performance. This approach allowed us to not only assess the accuracy of the responses but also understand how well the model could locate relevant information within a given context. The detailed hyperparameter tuning and meticulous output parsing were essential to optimize the model’s capability to provide reliable and contextually appropriate answers to the 5W1H questions. 4. Results To evaluate the performance of our model, we used the metrics proposed by [19]. This evaluation method defines five categories of matches for span classification: correct, partial, missing, incorrect, and spurious matches. • Correct matches. Reported when a text in the predicted file matches exactly with a corresponding text span in the gold file, including the start and end index, and the 5W1H label. Only one correct match per entry in the gold file can be counted. • Incorrect matches. Reported when the start and end indices match, but not the 5W1H type. • Partial matches. Reported when two intervals [start, end] have a non-empty intersec- tion and match the 5W1H label. A partial phrase will only be matched against a single correct phrase. • Missing matches. Those that appear in the gold file but not in the predicted file. • Spurious matches. Those that appear in the predicted file but not in the gold file. These metrics were used to ensure a comprehensive and accurate evaluation of the model’s ability to correctly identify and classify the spans related to the 5W1H questions. The results obtained by following the pipeline outlined in Figure 1(with the exception that we tested other LLM models apart from FLAN-T5-XXL) are shown in Table 3. Our best result on the test set was achieved with the FLAN-T5-XXL model, with a score of 54.256%. Other models such as FLAN-T5-Large [12], Bloom-7B1 [20], Bloomz-7B1 [21] and Llama-3-8B-Instruct [22] were also tested for comparison. Table 3 Performance of various LLM models on the 5W1H Task. Model Score (%) FLAN-T5-XXL 54.256 FLAN-T5-Large 44.222 Bloomz-7B1 43.875 Bloom-7B1 29.338 Llama-3-8B-Instruct 13.606 These results demonstrate that the FLAN-T5-XXL model significantly outperformed the other models in this task, showcasing its ability to effectively understand and respond to 5W1H questions. 5. Conclusions In this study, we explored the fine-tuning and evaluation of various LLMs for the task of answering 5W1H questions. Our approach involved detailed data preparation, systematic hyperparameter tuning during inference, and precise output parsing to ensure robust evaluation. The FLAN-T5-XXL model emerged as the top performer, significantly surpassing other tested models, including FLAN-T5-Large, Bloom-7B1, Bloomz-7B1, and Llama-3-8B-Instruct. This superior performance underscores the efficacy of advanced fine-tuning techniques and the importance of hyperparameter optimization in enhancing model capabilities. By using the comprehensive evaluation metrics proposed in the previous section, we ensured a rigorous assessment of our models, considering various match types such as correct, partial, missing, incorrect, and spurious. This thorough evaluation methodology provided clear insights into the strengths and limitations of each model. The results demonstrate the potential of LLMs, particularly FLAN-T5-XXL, in effectively addressing complex NLP tasks that require a deep understanding and generation of text based on specific prompts. Future work could focus on further refining these models, exploring additional fine-tuning techniques, and extending the evaluation to broader datasets and more diverse question types. Overall, this study highlights the critical role of fine-tuning and careful evaluation in maximizing the performance of large language models, paving the way for more accurate and contextually aware AI systems. Acknowledgments This publication of the author Eduardo Grande is part of the grant PRE2022-101573, funded by MCIN/AEI/10.13039/501100011033 and the ESF+. The author Ahmed Begga is funded by the project PID2022-142516OB-I00 of the Spanish Government. References [1] J. Lugea, Linguistic Approaches to Fake News Detection, 2021, pp. 287–302. [2] B. D. Horne, S. Adali, This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news, CoRR abs/1703.09398 (2017). URL: http://arxiv.org/abs/1703.09398. arXiv:1703.09398. [3] A. Bessi, M. Coletto, G. A. Davidescu, A. Scala, G. Caldarelli, W. Quattrociocchi, Science vs conspiracy: Collective narratives in the age of misinformation, PLoS ONE 10 (2014). URL: https://api.semanticscholar.org/CorpusID:2274379. [4] F. Zollo, P. Kralj Novak, M. Del Vicario, A. Bessi, I. Mozetic, A. Scala, G. Caldarelli, W. Quattrociocchi, Emotional dynamics in the age of misinformation, PLoS ONE 10 (2015). doi:10.1371/journal.pone.0138740. [5] A. Bessi, A. Scala, L. Rossi, Q. Zhang, W. Quattrociocchi, The economy of attention in the age of (mis)information, Journal of Trust Management 1 (2014). doi:10.1186/ s40493-014-0012-y. [6] A. Noain-Sánchez, Addressing the impact of artificial intelligence on journalism: the perception of experts, journalists and academics, Communication Society 35 (2022) 105–121. doi:10.15581/003.35.3.105-121. [7] M. Túñez López, C. Fieiras-Ceide, M. Vaz-Álvarez, Impact of artificial intelligence on journalism: transformations in the company, products, contents and professional profile, Communication Society 34 (2021) 177–193. doi:10.15581/003. [8] S. Fu, D. Chen, H. He, S. Liu, S. Moon, K. J. Peterson, F. Shen, L. Wang, Y. Wang, A. Wen, Y. Zhao, S. Sohn, H. Liu, Clinical concept extraction: A methodology review, Journal of Biomedical Informatics 109 (2020) 103526. URL: https://www.sciencedirect.com/ science/article/pii/S1532046420301544. doi:https://doi.org/10.1016/j.jbi.2020. 103526. [9] J. Templeton, I. Timmis, A Flexible Framework for Integrating Data-Driven Learning, 2023, pp. 39–58. doi:10.1007/978-3-031-11220-1_3. [10] R. Sepúlveda-Torres, A. Bonet-Jover, I. Diab, I. Guillén-Pacho, I. Cabrera-de Castro, C. Badenes-Olmedo, E. Saquete, M. T. Martín-Valdivia, P. Martínez-Barco, L. A. Ureña- López, Overview of FLARES at IberLEF 2024: Fine-Grained Language-based Reliability Detection in Spanish News, Procesamiento del Lenguaje Natural 73 (2024). [11] L. Chiruzzo, S. M. Jiménez-Zafra, F. Rangel, Overview of IberLEF 2024: Natural Language Processing Challenges for Spanish and other Iberian Languages, in: Pro- ceedings of the Iberian Languages Evaluation Forum (IberLEF 2024), co-located with the 40th Conference of the Spanish Society for Natural Language Processing (SEPLN 2024), CEUR-WS.org, 2024. [12] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro- Ros, M. Pellat, K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V. Zhao, Y. Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, J. Wei, Scaling instruction-finetuned language models, 2022. arXiv:2210.11416. [13] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J.-Y. Nie, J.-R. Wen, A survey of large language models, 2023. arXiv:2303.18223. [14] B. D. Horne, S. Adali, This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news, 2017. arXiv:1703.09398. [15] A. Bonet-Jover, R. Sepúlveda-Torres, E. Saquete, P. Martínez-Barco, M. Nieto-Pérez, RUN-AS: a novel approach to annotate news reliability for disinformation detection, Language Resources and Evaluation (2023) 1–31. [16] OpenAI, ChatGPT 3.5, https://openai.com/chatgpt/, 2022. Accessed: June 7, 2024. [17] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, Q. V. Le, Finetuned language models are zero-shot learners, in: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, OpenReview.net, 2022. URL: https://openreview.net/forum?id=gEZrGCozdqR. [18] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank adaptation of large language models, 2021. arXiv:2106.09685. [19] A. Piad-Morffis, Y. Gutiérrez, H. Cañizares-Diaz, S. Estévez-Velarde, R. Muñoz, A. Mon- toyo, Y. Almeida-Cruz, Overview of the ehealth knowledge discovery challenge at iberlef 2020, 2020. [20] B. Workshop, T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, et al., Bloom: A 176b-parameter open-access multilingual language model, arXiv preprint arXiv:2211.05100 (2022). [21] N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, S. Biderman, T. L. Scao, M. S. Bari, S. Shen, Z.-X. Yong, H. Schoelkopf, et al., Crosslingual generalization through multitask finetuning, arXiv preprint arXiv:2211.01786 (2022). [22] AI@Meta, Llama 3 model card (2024). URL: https://github.com/meta-llama/llama3/ blob/main/MODEL_CARD.md.