1. Introduction

10.1007/978-3-030-03840-3_29

Leveraging LLMs for Event Extraction in Italian Documents: a Roadmap for Future Research

Federica Rollo

Giovanni Bonisoli

Laura Po

0 0 "Enzo Ferrari" Engineering Department, University of Modena and Reggio Emilia , MO 41121 Italy

2024

6 29 30

Event extraction is a task of significant interest in the field of Natural Language Processing (NLP) and plays a vital role in various applications, such as information retrieval and document summarization. Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. In this paper, we present a roadmap for the application of LLMs for event extraction from Italian documents, aiming to address the gap in research and resources for event extraction in non-English languages. We first discuss the challenges of event extraction and the current state-of-the-art approaches based on LLMs. Next, we present potential Italian datasets suitable for adapting linguistic models to the domain of event extraction. Furthermore, we outline future research directions and potential areas for improvement in this evolving ifeld.

eol>event extraction Large Language Model Italian language

1. Introduction 2.2. Challenges 2.3. Large Language Models based approaches Due to the complexity of the natural language, event

extraction poses several challenges that require sophisti- Several approaches have been proposed for event extraccated techniques to address efectively. tion in recent surveys, from traditional methods which

The first challenge consists of detecting multiple rely on the use of linguistic rules for pattern identificaevents described in the same document and understand- tion within the text to more advanced solutions such as ing which are the references to each event. Natural lan- machine learning and deep learning algorithms able to guage often contains ambiguous expressions that can learn patterns after training on annotated data, and the refer to multiple events or entities. This ambiguity, along use of pre-trained language models [ 2, 3 ]. LLMs based apwith the use of coreference, further complicates the task proaches have emerged as a promising avenue for event of accurately extracting event data from text since resolv- extraction in recent years. These models leverage the ing ambiguity requires contextual understanding and power of machine learning and deep learning algorithms disambiguation techniques. as they are pre-trained on vast amounts of text data and

Identifying relevant elements for each event requires then fine-tuned for specific tasks. By encoding contextual distinguishing between event triggers (words or phrases information and capturing semantic relationships within that indicate the occurrence of an event) and background the text, LLMs seem to be promising in identifying and information and noise. Another complexity is given by extracting events from various sources. the variability in language usage, writing styles, syntactic We identified three main approaches based on the use structures, and document length. Indeed, event extrac- of LLMs that could reach good performance in event extion can be performed on short text like tweets, longer traction: sequence labeling models, extractive Question documents such as news articles, and lengthy documents Answering (QA) models and instruction-tuned models. such as investigative reports or government documents.

All these factors require the use of techniques able of Sequence Labeling models In Sequence labeling accommodating these variations to achieve accurate and each token in a sequence is assigned a label based on reliable results across diverse text types and genres. its role or category within the context of the sequence.

Two of the key aspects of events are the time and the Sequence labeling models can be used to identify those space, i.e., when the event took place and where. The text spans reporting relevant information within a text. recognition and standardization of temporal and spatial Therefore, it is widely employed for several classical NLP expressions could be complex since temporal reference tasks like part-of-speech (POS) tagging, named entity can be expressed in various formats (such as dates, times, recognition (NER), text chunking. part of the day). In addition, a document describing an Sequence labeling models are suitable for the scenario event can refer to the location of the event providing of event extraction, where they can identify and classify information at diferent granularity, for example indicat- those parts of text reporting information about events. ing the name of the city, specifying the address, and/or Indeed, some works in literature have already treated describing the type of the place like an apartment, a shop, event extraction as a sequence labeling or NER problem, or a park. During event extraction, the references to all [4, 5], also for Italian Language [6]. these locations should be identified.

Extractive Question Answering The goal of extrac- next-word prediction objective of LLMs and the users’ tive QA models is to understand an input question in objective of following their instructions helpfully and natural language and extract the answer as a span from safely. Instruction-tuning involves a fine-tuning of Autoan input text. QA models can facilitate rapid and efi- Regressive LLMs with input-output pairs, where input cient access to event-related information by automati- denotes the human instructions, and output denotes the cally identifying text spans containing the desired an- desired output that follows the instruction. The results swers to specific questions. For instance, the question of this process are the Instruction-Tuned LLMs, designed “When did the event take place?” (Q1) can be formulated specifically to provide appropriate results based on into retrieve the date of the event. struction inputs. This ability is also enhanced as a cross

The results of these models depend significantly on the task generalization, leading Instruction-Tuned LLMs to quality of the input documents, as well as the structure better performances on novel tasks. of the questions provided to the models. Prior knowledge Instruction-Tuned LLMs can be employed to solve a about the kind of event described in the document allows wide range of NLP tasks through various techniques of to formulate ad hoc questions. For instance, considering prompt engineering [13], i.e., the process of designing the document in Figure 1, the question “When did the air task-specific instructions to guide model output. Therecrash take place?” (Q2) should provide more accurate an- fore, the utilization of these models can also yield benefits swers than Q1. In addition, questions should be enriched for event extraction. by other details about the event after a partial process of Currently, there are several Instruction-Tuned LLMs event extraction. For example, the question “When did capable of understanding and generating text. For those, the Flight 345 crash?” (Q3) contain the reference to the Italian represents a minority percentage in the training lfight number and should help the QA models to select data compared to more widely used languages on the the correct context for the extraction of the date. web such as English. Among these, there are proprietary

Within the QA models, distinctions arise between models like GPT-3.5 and GPT-4 from OpenAI, Gemini Single-Span QA (SQA) and Multi-Span QA (MQA). While from Google, and open-source families of LLMs like Misthe former identifies a single text segment for each ques- tral [ 14 ] and Mixtral [15] from Mistral AI and Llama tion, the latter locates answers even when distributed [11] and Llama 2 [16] from Meta. From this last family, across non-consecutive text segments, potentially located Llamantino [17] has been derived through a language far apart within a document. Given the prevalence of adaptation process to the Italian Language. such scenarios, especially in complex inquiries and detailed documents, the limitations of SQA models are evident. An example is the annotation of “causalities and 3. Italian datasets losses” in Figure 1. The recent surge in MQA model development [ 7, 8, 9 ] underscores a notable interest. Currently, there are few Italian datasets suitable for event

In the current state-of-the-art, the only Italian dataset extraction. Some of them provide a comprehensive anproperly designed for training QA models is SQuAD-it notation of event-related data, while in other cases, only [ 10 ], derived from the automatic translation of the En- one type of information (e.g., the temporal references) is glish SQuAD dataset, consisting of a list of pairs question- annotated. answer. However, this dataset can be used only for SQA, therefore it is unsuitable for complex tasks like event 3.1. EVENTI extraction which requires the ability to retrieve multiple spans for one question.

The EVENTI1 corpus was built in 2014 for the evaluation of Temporal Information Processing systems of the EVENTI evaluation exercise [18] in the EVALITA workInstruction-Tuned models Among LLMs, Auto- shop. The corpus consists of three datasets: the Main Regressive models such as GPT [1] or Llama [11] series task training data (274 documents) and test data (92 docstand out. These models leverage advanced deep learn- uments) of contemporary news articles and the Pilot task ing techniques to predict the subsequent word based on (10 documents) test data of historical news articles. The an input text. This prediction process is repeated mul- annotation guidelines involve the use of four tags to antiple times, with each predicted word being added to notate diferent elements within news texts: the EVENT the original text. By training on vast amounts of text tag is used to annotate all the mentions of events includdata, Auto-Regressive LLMs efectively capture complex ing verbs, nouns, prepositional phrases and adjectives; patterns and structures in language, leading them to gen- the TIMEX3 tag is used for temporal expressions; the erate full and coherent text which is contextually relevant SIGNAL tag identifies textual items which encode a reto input text. lation either between EVENTs, or TIMEX3s or both; the

The research in recent years has led to the development of instruction tuning [12] to bridge the gap between the 1https://sites.google.com/site/eventievalita2014/data-tools

TLINK tag is used for temporal dependencies between EVENTs and/or Temporal Expressions. 3.2. NewsReader MEANTIME The NewsReader MEANTIME (Multilingual Event ANd

TIME) is a multilingual semantically annotated corpus of 480 Wikinews articles in four languages: English, Italian, Spanish, and Dutch [19]. The corpus was released in 2016 and derives from the NewsReader Project2 [20] which aims at extracting information about what happened to whom, when, and where, processing a large volume of ifnancial and economic data. The corpus is enriched with annotations that span multiple levels, including entities, entity mentions, events, temporal information, semantic roles, and intra-document and cross-document event and entity coreference.

3.3. De Gasperi

The De Gasperi corpus [21] is a collection of historical documents by Alcide De Gasperi, the first Prime Minister of the Italian Republic. The corpus was released in 2019 and includes 2,762 documents published between 1901 and 1954, originally released in an oral or written form. In addition to the raw text, a set of meta-data and additional semiautomatically annotated information are available. The corpus contains diferent kinds of documents, like daily press written by De Gasperi when he worked as a journalist for newspapers in Trentino, and speeches in institutional venues when he was a Member of the Italian Parliament. In each document, references to persons and places are annotated. 3.4. DICE [24, 25]. The news articles underwent automated NLP processes to extract temporal references, entities, and corresponding DBpedia resources. Duplicates are annotated to identify news articles referring to the same crime event. The theft-related news articles are annotated manually following a sophisticated annotation schema to identify stolen items (What), crime locations (Where), references to authors and victims, and their sociodemographic characteristics (Who). The annotation provided in the dataset is multi-span since it involves identifying and linking multiple text spans within the document.

3.5. EventNet-ITA

EventNet-ITA4 [26] is an Italian corpus for Frame Parsing applied to events released in 2024. Semantic Frame Parsing is a task which aims at identifying semantic frames within textual data. A semantic frame [27] is a cognitive structure that organizes and represents knowledge about a concept or situation. It consists of a set of interconnected elements such as roles, attributes, and relations, which collectively define the meaning and typical features of that concept or situation. Frames help humans understand and interpret language by providing a mental framework for comprehending and categorizing information.

EventNet-ITA is built upon the idea of enabling frame parsing for event extraction. It is composed of 53,854 sentences manually annotated with 205 semantic frames of events and covers diferent domains, like conflictual, social, communication, legal, geopolitical, economic and biographical events.

4. Future directions

DICE3 [22] is a collection of 10,395 Italian news articles Automated information extraction from documents condescribing crime events that happened in the Modena tinues to captivate the scientific community due to its province between 2011 and 2021. The news articles are manifold advantages, facilitating improved information extracted from one of the most popular local newspapers, accessibility across various domains. By leveraging LLMs “Gazzetta di Modena”, following the approach described and exploiting annotated datasets, researchers can dein [23]. Thanks to an agreement between the University velop robust event extraction systems capable of achievof Modena and Reggio Emilia and the Gazzetta di Mod- ing high accuracy and eficiency across a wide range of ena, DICE was released online in 2023, free to redistribute text sources. As the field continues to advance, further and transform without encountering legal copyright is- research into LLMs and their applications in event exsues under an Attribution-NonCommercial-ShareAlike traction is expected to drive continued innovation and 4.0 International (CC BY-NC-SA 4.0). progress in this area.

Along with the data related to the title, the text, and Future directions will focus on three key aspects: the publication date of each news article that are crawled from the newspaper’s webpage, several annotations are available on the data. The crime event category (e.g., theft, robbery) is assigned to each news article using text categorization approaches based on word embeddings • Definition of an Italian benchmark : while we have identified five Italian datasets suitable for event extraction, further eforts are needed to expand their annotation and support comprehensive event extraction tasks. This entails defining 2http://www.newsreader-project.eu/ 3https://github.com/federicarollo/Italian-Crime-News 4https://huggingface.co/datasets/mrovera/eventnet-ita

a standardized benchmark for evaluating event [2]

Frisoni ,

Moro ,

Carbonaro , A survey on

comparisons between diferent approaches and Access 9 ( 2021 ) 160721 - 160757 . doi: 10 .1109/

fostering the development of more accurate and ACCESS . 2021 . 3130956 .

reliable event extraction models . [3]

Xiang ,

Wang , A survey of event extraction

• Evaluation of LLMs on the benchmark: de- from text , IEEE Access 7 ( 2019 ) 173111 - 173137 .

spite the limited literature on Italian event extrac - doi:10.1109/ACCESS. 2019 . 2956831 .

tion, our preliminary evaluation of three BERT- [4]

Ramponi , R. van der Goot ,

Lombardo ,

Plank ,

promising results [22]. However , challenges per- in: B. Webber , T. Cohn, Y. He , Y. Liu (Eds.), Pro-

sist, particularly related to the size and quality ceedings of the 2020 Conference on Empirical

and comparing various approaches outlined in line,

2020 , pp. 5357 - 5367 . doi: 10 .18653/v1/ 2020 .

Section 2.3. The evaluation will include the recent emnlp-main . 431 .

Minerva models that represent the first family of [5]

Pongpaichet ,

Sukosit , C. Duangtanawat,

consuming process, new strategies will be studied news articles , IEEE Access 12 ( 2024 ) 22778 - 22802 .

to automate the process of annotation . Employ- doi:10.1109/ACCESS. 2024 . 3363879 .

ing LLMs for data augmentation (i .e., to expand [6]

Viani ,

T. A.

Miller ,

Dligach , S. Bethard,

the spans to extract from the text (like “May 14th” ports, in: A. ten

Teije , C.

Popow , J. H.

Holmes ,

ate a document with that span with the expected Springer International Publishing , Cham, 2017 , pp.

role in the event described (like “create a docu- 198-202 .

ment describing an event that occurred on May [7]

Li ,

Tomko ,

Vasardani , T. Baldwin, Mul-

14th”) . This methodology allows for obtaining a tispanqa: A dataset for multi-span question an-

Furthermore, this approach ofers control over Ruíz (Eds .), Proceedings of the 2022 Conference

opment of balanced and unbiased datasets essen- guage Technologies , NAACL 2022 , Seattle, WA,

tial for training accurate and equitable AI models . United States, July 10-15 , 2022 , Association for

Computational

Linguistics , 2022 , pp. 1250 - 1260 .

doi:10 .18653/V1/ 2022 .NAACL-MAIN. 90 .

[8]

Segal ,

Efrat ,

Shoham ,

Globerson , J. Be-

rant , A simple and efective model for answering [ 1]

Brown ,

Mann ,

Ryder ,

Subbiah , J. D. multi-span questions, 2020 , p. 3074 - 3080 .

Kaplan , P.

Dhariwal , A.

Neelakantan , P. Shyam, [9] M.

Zhu , A.

Ahuja , D.

Juan , W.

Wei , C. K.

Reddy ,

Sigler ,

Litwin ,

Gray ,

Chess ,

Clark , Linguistics: EMNLP 2020 ,

Online

Event , 16 - 20

Berner ,

McCandlish ,

Radford , I. Sutskever , November 2020 , volume EMNLP 2020 of Findings

in: H. Larochelle , M.

Ranzato , R.

Hadsell , M. Bal- tics, 2020 , pp. 3840 - 3849 . doi: 10 .18653/V1/ 2020 .

can , H. Lin (Eds.), Advances in Neural Information FINDINGS-EMNLP.342.

Processing Systems , volume 33 , Curran

Associates

, [10]

Croce ,

Zelenanska ,

Basili , Neural learning

Inc. , 2020 , pp. 1877 - 1901 . for question answering in italian , Lecture Notes