-

Journal of Machine Learning Research 25 (2024) 1-53. URL: http://jmlr.org/papers/v25/23

1613-0073

10.3115/V1/P15-1017

Leveraging Pretrained and Large Language Models for Inference

Flores Gustavo Miguel

gustavo.flores@eurecom.fr 0

ä Youssra Rebboud

youssra.rebboud@eurecom.fr 0

ä Pasquale Lisena

pasquale.lisena@eurecom.fr 0

Raphäel Troncy

raphael.troncy@eurecom.fr 0

Event Relation Extraction, Event Knowledge Graphs, Causal Event Relations, Web Platform,

0 EURECOM , Sophia Antipolis, Biot , France

2005

1 26 31

Event relation extraction is crucial for understanding the temporal sequence and interconnections between events. To demonstrate this, we developed a Streamlit-based application that showcases our event relation extraction system, capable of identifying semantically accurate relations such as Direct-cause, Enable, Intend, and Prevent. The system features an API that simplifies inference and displays results in a user-friendly manner. Users can input text like a sentence and the application highlights extracted events and their corresponding relationships. The backend runs a series of pre-trained language models, trained on datasets focused on events and their semantic relations. The app allows users to switch between various models, including HuggingFace's RoBERTa, REBEL, and large language models like Zephyr. The demo is available at https://demo.kflow.eurecom.fr/.

Language supports

1. Introduction Understanding the flow of events and their interconnections is crucial for tasks such as narrative com-prehension, historical analysis, and machine learning applications. The way information is represented significantly impacts the contextual knowledge models can access, often encoded through relational triplets. While knowledge about entities is important, understanding the context surrounding those entities especially events is equally vital. Events are instances that occur in time and space, inherently existing within a web of causal relationships that can provide critical insights into their nature and consequences [ 1 ].

Given the significance of event knowledge, researchers have developed various methods to represent events and their relationships. Event Relation Extraction (ERE) is the task of identifying and predicting relationships between events in text, enabling a deeper understanding of their progression and impact [ 2 ].

However, with the proliferation of different machine learning models, architectures, and tuning parameters, evaluating their performance on this task remains challenging.

This research aims to address this challenge by developing a comprehensive Event Relation Extraction pipeline. The pipeline allows users to experiment with various models and datasets, enabling them to input sentences, extract events, and visualize the relationships between them. For everyday users, this pipeline enhances the understanding of event flows in textual data. For researchers, it offers a qualitative evaluation tool that allows them to analyze and compare model performance on event relation extraction tasks, identifying potential strengths and weaknesses.To make this accessible, we developed a user-friendly Streamlit web application https://demo.kflow. eurecom.fr/ that visually presents the extracted events and their relations. The application http://www.eurecom.fr/~troncy (R. Troncy) CEUR

ceur-ws.org multiple pre-trained models and datasets, providing an interactive platform for generating and analyzing inferences.

The structure of this demo paper is the following: In Section 2 we cover the related work in event and relation extraction. Section 3 explains the pipeline architecture, detailing the tasks performed, how inferences are generated, and how users can interact with thesystem. Finally, Section 4 highlights observations, potential improvements, and future development directions. 2. Related work The study of event relations has historically focused on temporal relationships, where researchers aimed to represent the temporal order of events [ 3 ]. Subsequently, attention shifted towards causal relations between events, which sought to understand the influence of one event on another. In these causal relationships, the cause is typically regarded as the subject, while the effect is viewed as the object.In our work, we aim to move beyond basic causality and focus on extracting fine-grained causal relationships between events. These nuanced event relations, initially introduced by [ 1 ], were accompanied by the creation of the first dataset specifically designed to capture such detailed event relations.

Event Relation Extraction (ERE) is generally divided into two main subtasks: (1) identifying the type of relation, and (2) extracting the corresponding spans of the subject and object from the sentence. Early work in this domain was carried out by the Linguistic Data Consortium (LDC) through the Automatic Content Extraction (ACE) program [4], which focused on texts from various domains such as newswire, broadcast news, conversational speech, weblogs, Usenet, and telephone conversations. The primary objective of ACE was to develop information extraction techniques that could facilitate the automatic processing of human language in textual form.

In recent years, neural models have gained prominence in event extraction tasks. With advancements in deep learning, researchers have explored the use of Convolutional Neural Networks (CNNs) [5], Recurrent Neural Networks (RNNs) [6], and, more recently, transformerbased models [7]. Pretrained language models (PLMs) have become a focal point in event extraction studies due to their ability to learn general-purpose representations from raw text, which aids in extracting relevant event relations [ 3 ]. BERT, in particular, has demonstrated strong performance in this area, as highlighted in a study [7] showing that BERT could achieve state-ofthe-art results without the need for task-specific architectures or external resources [ 3 ]. Large Language Models (LLMs) have demonstrated strong performance in relation extraction tasks. In the work of [8], the Flan-T5 model [9] significantly outperformed previous baselines on the CoNLL04 dataset [10], underscoring the potential of LLMs for event relation extraction.

For precise event relations, such as Direct-cause, Enable, Intend, and Prevent, [11] proposed an approach to augment thedataset introduced by [ 1 ] usingGPT. They then employed BERT [12] to perform event relation extraction tasks. While their method achieved good performance on the relation classification subtask, it showed limitations in the quality of event extraction.

In this work, we aim to provide an API based on an event relation extraction pipeline that leverages various pre-trained language models (PLMs) and large language models (LLMs) instead of relying solely on BERT [12]. Although detailed performance results cannot be shared here, as they are under review in another study, we offer insights into the models performance and provide access to the code and a link to the API. 3. Platform and API for Event RelationExtraction 3.1. Event Relation Extraction fromText In our pipeline, the goal is to perform event relation extraction from textual data, focusing on four semantically precise event relations: Direct-Cause, Enable, Intend, and Prevent [ 1 ]. These relations are categorized under the broader supertype of Cause.

Input Input Text

RD Output: 0/1 Models: RoBERTa

Filtered Sentences

RC Output: Relation Type Models: Roberta/ Langchain LLMs/

REBEL

EE Output: (Subject,

Object) Models: RoBERTa/ Langchain LLMs/REBEL

Streamlit Application Framework

REBEL Langchain

Hugging Face (Roberta)

The pipeline performs three tasks: Relation Detection(RD), Relation Classification(RC), and Event Extraction(EE). Dividing the task into three subtasks could enable testing a broader combination of models for each task, allowing evaluation of strengths and weaknesses for each subtask independently. In the RD phase, the model filters out sentences that do not have a causal event relation, this task is not optional. The sentences containing a causal relation will passe to the RC module. At this level, the causal sentence will be given as input to the RC module to determine which type of event relation is in the sentence from the four relations. Finally, when we decide the relation type, the EE module will extract the subject and the object of the event relation in a given sentence. Figures 1 illustrates the pipeline modules.

The most integral component of the pipeline is the ERE models. The pipeline runs only one model for each task (RC, RD, and EE), chosen from the available options. At the present, the models included are: • the BERT family of models by Hugging Face for (RC, RD, EE)[13]; • REBEL for (RD, EE)[14]; • the large language models (LLM) available through the LangChain library1 for (RD, EE).

The available LLMs are: Zephyr[15], DPO[16], UNA[17], SOLAR[18], and GPT4[19]. Both the BERT family models and REBEL were trained using a combination of two datasets, The Event Relations Dataset from [11], and the CausalNewsCorpus [20] which made a total of 5613 example sentences annotated with the four relations Direct-Cause, Enable, Intend, and Prevent together with the subject and object of each relation. On the other hand, The same prompt template 2 was designed for every LLM that we have been using. The chosen LLMs ranked among the top performers on the Huggingface Open LLM Leaderboard at the time of writing, excelling across various benchmarks, including the Multi-Task Language Understanding Benchmark (MMLU) [21].

The RoBERTa model performed well in the relation detection task, achieving an average F1-score of 0.86. In contrast, the REBEL model excelled in both relation classification and event extraction tasks, with F1-scores of 0.975 and 0.829, respectively, showcasing its overall effective-ness. The performance details of these models are currently under review for another conference and cannot be disclosed atthis time. However, the code and data for this work are available at: https://github.com/ANR-kFLOW/Relation_extraction/tree/main. Figures 3 shows an example of an accurate and an inaccurate prediction produced using RoBERTa as a filter (RD) and REBEL for both RC and EE. 3.2. Relation Extraction Pipeline The pipeline is made in Python and the specifications for the inferences can be passed through: command line, a configuration file, or through the user interface developed for the pipeline. 1https://Python.langchain.com/ 2https://github.com/ANR-kFLOW/Relation_extraction/blob/main/LLMs_as_Relation_Classifiors_and_ Event_Extractors/ prompt_template.yml

After training, the pretrained model can be made available to the pipeline by saving the trained models in a common folder, making them available at for the inference stage.

Information that the pipeline can receive from users is: the path to a pretrained model for a given task, the choice to skip performing one of the tasks, and the user’s OpenAI key (if GPT4 is used for any of the tasks). The pipeline has a default that configuration that is ran if the user does not input any instructions. If the user inputs instructions that does not cover all arguments then the pipeline will fill in the missing arguments with the default values. 3.3. Streamlit Platform Architecture

This application serves to provide users with a curated demonstration of the capabilities of the models. This application is developed using Streamlit 3, which acts as both the web application and the web API as shown in 2. the Streamlit application receives input from the user and passes it along to the Python pipeline via a configuration file. The user can write their own text or use an input preset. After the user makes his choice of the model used for each task,the inference running in the Back-End will be produced.

The output returned tothe user will include his original text used to produce the inference with highlights of subject and object of the extracted event relation. Next to the highlights are labels indicating what part of the span it is (subject or object) together with the event relation type.

There are two different versions for how the highlighting is formatted: one for spans that do not overlap, and one for spans that overlap. In the case of spans that do not overlap, the color of the highlights are color coded for the classification of the event relation and there are labels at the end of each highlight. In the case where the spans can overlap one another the spans are represented by being encased in color coded brackets. The color for the bracket indicates what part of the span the bracket contains(subject or object). The labels are placed at the closing bracket to avoid cluttering up the sentence. The classification for the relation can either be: cause, intend, prevent, enable, or other. Other refers to when the model producing the classifications gives a nonstandard response. Some models such as RoBERTa[13] identify multiple event relations in a given sentence. In that case the sentences that have multiple event relations will be displayed multiple times for each event relation detected. This makes it so that there is only one span displayed at a time, for visual clarity. Figures 4 shows a screen of the demo with Streamlit.

4. Conclusion and Future Work

In this work we have constructed an API for event relation extraction based on a set of pretrained language models, BERT, RoBERTa, and REBEL together with few LLMs such as GPT4, and Zephyr. The API was created to help stream line the process of preforming inferences on textual input from a given user, and aiding the process of comparing ERE models to one another. The API is accessible at https://demo.kflow.eurecom.fr/.

In the future, the platform will allow a user to compare multiple models in an A/B testing fashion. The A/B testing happens in the user comparing the inferences generated by the models side by side and recording their evaluation of how one model compares to another. First, an automatic test will apply widely adopted metrics – e.g. precision, recall and F1-score – on a predefined ground truth to evaluate the performance of the models. These comparisons and evaluations will be saved so that users in future can use these metrics to determine what are the best performing models. The best 3 models for a given task will be selected for human evaluation through an UI. A future addition to the pipeline can be including functionality to be able to train the models by using the pipeline.

Acknowledgements

This work has been partially supported by the French National Research Agency (ANR) within the kFLOW project (Grant nrˇANR-21-CE23-0028).

[1]

Rebboud ,

Lisena ,

Troncy , Beyond Causality: Representing Event Relations in Knowledge Graphs, in: Knowledge Engineering and Knowledge Management: 23rd International Conference , EKAW 2022, Bolzano, Italy, September 2629 , 2022 , Proceedings, Springer-Verlag, Berlin, Heidelberg, 2022 , p. 121135 . doi: 10 .1007/978-3- 031 -17105- 5 _ 9 .

[2]

Hu ,

Li ,

Xu ,

Bai ,

Jin ,

Guo , X. Cheng, Protoem: A prototype-enhanced matching framework for event relation extraction , 2023 . URL: https://arxiv.org/abs/2309.12892. arXiv: 2309 . 12892 .

[3]

Liu ,

Chen ,

Liu ,

Zuo ,

Zhao , Extracting events and their relations from texts: A survey on recent research progress and challenges , AI Open 1 ( 2020 ) 22 - 39 . URL: https://www.