1. Introduction

Jaén, Spain * Corresponding author. $ jordi.porta@uam.es (J. Porta-Zamorano); yanco.torterolo@inv.uam.es (Y. Torterolo); antonio.msandoval@uam.es (A. Moreno-Sandoval) http://www.lllf.uam.es/ (J. Porta-Zamorano)

LLI-UAM Team at FinancES 2023: Noise, Data Augmentation and Hallucinations

Jordi Porta-Zamorano

Yanco Torterolo

Antonio Moreno-Sandoval

0 0 Laboratorio de Lingüística Informática, Universidad Autónoma de Madrid , Cantoblanco, 28049, Madrid , Spain

2023

000 0 0001

This paper describes the T5-based system developed for FinancES 2023 Shared Task by the Laboratorio de Lingüística Informática at UAM. The LLI-UAM system achieved a good ranking in all the tasks. The paper also describes some noise and data augmentation or hallucination mitigation experiments. In particular, we used corrected versions of the datasets to evaluate the impact of noise. Moreover, ChatGPT was utilised to augment the data and improve accuracy in tagging. We also describe the presence of hallucinations. Ultimately, we identify the best model for each task and draw conclusions based on our ifndings.

eol>Data augmentation ChatGPT noise hallucinations mT5 FinancES shared task

1. Introduction

(a) Task 1: Financial targeted sentiment analysis.

Ranking

Team

F1 Companies F1 Consumers

Sentiment Sentiment F1 Task 1 item; ( 2 ) the individual economic agent: companies; and (3) the individual economic agent/patient: consumers. The news item impacts the target and the economic participants, categorising positivity, negativity, or neutrality. FinancES proposes two tasks: ( 1 ) identifying the target entity in the text and determining the emotional polarity towards that target, and ( 2 ) assessing the impact of a news headline on companies and consumers regarding their stance and expressed polarity values. Our systems reached the second position for Task 1 and the first position for Task 2 in the oficial leaderboards, as can be seen in Table 1 for the LLI-UAM Team.

This paper outlines the system developed by the LLI-UAM team, presenting their contributions to the FinancES shared task. First, the dataset and the noise found in the examples are described (sections 2 and 3, respectively). Next, we show how data augmentation has been performed with ChatGPT. In section 5, we describe the deep learning model used. The longest part is devoted to discussing the results of the diferent experiments (noise, data augmentation, and Peligroso atasco en los fondos que invierten en renovables El fondo de recuperación de la UE ’va demasiado lento’, según el ministro de economía francés El PSOE propone un intrumento para abaratar los préstamos a las empresas similar al británico Madrid negocia con Economía para poder destinar las viviendas del ’banco malo’ a desahuciados

2. The FinancES Dataset

The FinancES dataset and its annotation process are detailed in [5] and [6]. The dataset consists of news headlines written in Spanish, collected from digital newspapers specialised in economic and financial news from various Spanish-speaking countries. Each headline is labelled to identify the target and sentiment polarity across three dimensions: target, companies, and consumers, employing a three-class polarity value system (positive, neutral, or negative). According to [6], three organisation committee members manually annotated each headline. In cases of disagreement, the annotators engaged in discussions to resolve the matter, and if no consensus was reached, the headline was excluded. Table 2 illustrates a selection of examples from the dataset.

3. Noise in the Dataset

Annotated data holds paramount importance for training and evaluating machine learning models. Consequently, the annotations should exhibit a high level of accuracy. However, recent research has demonstrated that this is only sometimes the case, revealing a surprising number of annotation errors or inconsistencies in even widely-used datasets [7]. Since humans typically carry out dataset annotations, errors or inconsistencies are an inherent possibility. Such inaccuracies can adversely afect a model’s performance, potentially leading to erroneous predictions. Although efective, rectifying these labelling errors incurs high costs and demands substantial time investments.

Upon inspecting the dataset, we identified errors in target and polarity labels that could impact the model’s performance. These errors were readily apparent, as we expected the target to be mentioned within the news headline and the polarity values to adhere to the three labels. Most of the spotted target and label errors are recurrent: • The omission of the segment el within the target, e.g., *Barcó (Barceló) • Extra blanks and extra or missing quotation marks in the target or sentiment field, e.g., *telecos’ (’telecos’), *positive⊔ (positive) • Casing, e.g., *SUARA (Suara) or *hosteLEría (hostelería) • Typographical errors in the sentiment labels, e.g., *postive (positive)

Validating polarity values presents additional challenges, requiring domain knowledge in ifnance, and only a few labels were modified during the review process. Namely, ten were adjusted due to typographical errors, extra spaces, or characters.

Over three hundred instances from the dataset samples were corrected and handled separately.

4. Data Augmentation

Machine learning models’ efectiveness and overall capabilities rely on the training data’s quality, quantity, and relevance. Unfortunately, gathering enough data can be dificult and costly, resulting in a shortage of available data.

Data augmentation (DA) refers to strategies for increasing the diversity of training examples without gathering more data directly [8]. Similarly to AugGPT [9], we used ChatGPT to rephrase or paraphrase some of the training examples to enhance the training set. To this end, we designed a step-wise prompt that provides an example of rephrasing a news headline five times. In addition, the prompt includes instructions on maintaining the target’s format and other labels in the generated examples since both the input and rephrased examples are in CSV.

For instance, the headline Pemex no reemplazará a 3.000 empleados que se jubilarán este año can be restated as: • La petrolera Pemex no tiene previsto sustituir a 3.000 trabajadores que se retirarán este año • Pemex se niega a contratar nuevos empleados para reemplazar a los 3.000 trabajadores que se jubilarán este año • La petrolera Pemex enfrenta la no renovación de 3.000 trabajadores jubilados y no tiene planes para reemplazarlos • El plan de Pemex no incluye el reemplazo de los 3.000 trabajadores que se retirarán este año, lo que podría afectar su productividad • Pemex anuncia que no habrá sustitución de los 3.000 empleados que se jubilarán este año, lo que genera preocupación sobre su capacidad para mantener su producción The last two rephrases of the previous example add a consequence or efect to the initial statement.

It is important to note that ChatGPT may occasionally fail to preserve the original intended meaning in its rephrased versions and the target. The latter is more likely to occur when the target word is capitalised at the start of a sentence (but is not a proper noun) or when it can be easily paraphrased. The headline Restaurantes elásticos para sobrevivir exemplifies both problems at preserving the target Restaurantes elásticos: • La adaptabilidad de los restaurantes les permite sobrevivir en situaciones cambiantes • La capacidad de los restaurantes para adaptarse les ayuda a superar los desafíos • Los restaurantes se vuelven flexibles para mantenerse a flote en tiempos difíciles • La elasticidad de los restaurantes es clave para su supervivencia • Los restaurantes demuestran su resiliencia al adaptarse a las circunstancias

Data augmentation was performed blindly on 527 training examples, resulting in five rephrased entries from each. Additionally, we chose 250 entries that were accurately annotated from a linguistic perspective to augment similarly. As a result, we created a new dataset with 3885 examples to use for experimentation.

5. Modeling

We decided to approach all the FinancES tasks with the T5 end-to-end architecture [10]. We initially considered byT5 [11] and mT5 [12]. After conducting multiple experiments, it was determined that byT5 (a byte-based multilingual version of T5) was less efective than mT5 due to longer training times and inferior results. The mT5 model is a massively multilingual pretrained text-to-text transformer that can be simultaneously fine-tuned on multiple downstream tasks using a task prefix or prompt.

The tasks related to target, target sentiment, company sentiment, and consumer sentiment annotations have been divided into sub-tasks. This is illustrated in Figure 1, where each annotation in the example has a diferent prefix indicating the specific task that needs to be performed on the headline and the expected output. The mT5 model comes in diferent sizes, but only small, base, and large models were chosen to experiment with since only these models ift into the single RTX 3090 24GB GPU card available for this work.

6. Experiments and Results

The data provided for training was divided into two sets: the training set and the development test set. The examples in the development set align with the ones given to participants for practice. We conducted experiments using three diferent versions of the original training set: 1. The original training set (T) 2. The augmented original training set (T+A) 3. The corrected training set (T’) 4. The augmented corrected training set (T’+A’) Renfe afronta mañana un nuevo día de paros parciales de los maquinistas

Target Companies Target Sentiment Sentiment target: Renfe afronta mañana un nuevo día de paros [. . . ] target_sentiment: Renfe afronta mañana un nuevo día de paros [. . . ] companies_sentiment: Renfe afronta mañana un nuevo día de paros [. . . ] consumers_sentiment: Renfe afronta mañana un nuevo día de paros [. . . ] Output Renfe negative negative negative (b) Converted Example

As the development set, we used the corrected versions for all the experiments. Throughout the training process, we used a more straightforward metric called exact match (also known as subset accuracy) instead of more complicated F1-based metrics in our multi-task framework. This metric was employed as the early-stopping criterion on the development set. The following hyperparameters, chosen tentatively, were common to all the experiments: • learning rate: 1e-4 (constant) • weight decay: 0.01 • batch size: 12 • optimizer: Adafactor • epochs: 100 / patience: 10

6.1. Results on Noise and Data Augmentation 6.2. Results on Hallucinations

Any language model generating content is prone to hallucinate unintended text, which can harm the system’s performance [13]. Using mT5 for the tasks, we only observed hallucinations in the form of unfaithful text in target identification or fabricated targets. We categorize these mistakes as "hallucinations" within our system. They can be grouped as follows: • Typographical hallucinations. Afecting spacing: * jubilacionesforzosas y anticipadas (jubilaciones forzosas y anticipadas), serial punctuation: *Santander, Sabadell BBVA y CaixaBank (Santander, Sabadell, BBVA y CaixaBank); casing: *hidrógeno (Hidrógeno), *Dos Heridos (Dos heridos), *empresas Alicantinas (empresas alicantinas), *coronavirus (Coronavirus), *ministerio (Ministerio), and *Unicaja Y LIBERBANK (Unicaja y Liberbank); but one of the most recurrent patterns observed is the capitalized segment le inside a word: *TeLEfónica, *hosteLEría, *hosteLEros, *TeLEcinco, *hoteLEs, *cadenas hoteLEras, and *TeLEpizza, or the segment el: *MerkEL, and *ELéctricas. • Hallucinations inside words: *Cada hogagar (Cada hogar), *Barcclays (Barclays), *startupups (start-ups), *modeloo Alzira (modelo Alzira), *Tefónica (Telefónica), *"inflación multipólica" ("inflación monopólica" ), and *marcas líders en gran consumo (marcas líderes en

gran consumo). • Lexical hallucinations (some words are replaced by other somehow related): *teatral Lliure (teatro Lliure), *motos minera (marcha minera), *Mar del Norte (Mar del Sur), *Bolsa de Buenos Argentina (Bolsa de Buenos Aires), *Argentina de regulación (Aires de regulación), and *web (Internet).

However, Typographical hallucinations, like TeLEcinco or TeLEpizza, are not considered genuine hallucinations because they replicate the same errors as found in the sample datasets, such as TeLEfónica. These can be more accurately explained as an instance of noise amplification or error overfitting.

In order to deal with hallucinations in the post-processing stage, it is necessary to anchor Model Train. Uncorr.

Size Set F1 small small small small base base base base large large large large

T T’ T+A T’+A’ T T’ T+A T’+A’ T T’ T+A T’+A’ the target predictions to the headline text. This can be achieved by identifying all headline variations, completing partial words, and using a limited form of string matching.

Table 3 displays the results for post-correction of the target, showing the F1s of the original and corrected versions of the systems’ output. The diference column indicates a slight improvement in the corrected versions regardless of the model’s size or training set used. However, the improvement decreases on average as the model’s size increases.

6.3. Best Systems

• Task 1: Finally, because no single training dataset fits all tasks, the best-performing systems for each of the FinancES 2023 Tasks are the following: – Model: mT5-large – Training set: Uncorrected training set (T) – Task F1: 0.8019 – Target F1: 0.8732 – Target sentiment F1: 0.7305 negative neutral positive

7. Conclusions and Future Work

While correcting hallucinations and accessing the large model are beneficial, it still needs to be determined how augmenting or correcting the training set will improve the FinancES shared tasks. According to [8], a plausible hypothesis suggests that adding more data may not necessarily improve the performance of large pre-trained transformers when working on tasks that already have suficient representation in the pretraining data. Whether or not this hypothesis applies to the FinancES tasks is left as future work.

Acknowledgements

This publication is part of the project “Computational linguistic methods for readability and simplification of financial narratives.” CLARA-FINT (PID2020-116001RB-C31), funded by the Spanish Ministry of Science and Innovation and the State Research Agency. [3] N. Bel, G. Bracons, S. Anderberg, Finding Evidence of Fraudster Companies in the CEO’s

Letter to Shareholders with Sentiment Analysis, Information 12 (2021). [4] S. M. Jiménez-Zafra, F. Rangel, M. Montes-y Gómez, Overview of IberLEF 2023: Natural Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2023), co-located with the 39th Conference of the Spanish Society for Natural Language Processing (SEPLN 2023), CEURWS.org, 2023. [5] J. A. García-Díaz, Almela, F. García-Sánchez, G. Alcaráz Mármol, M. J. Marín-Pérez, R. Valencia-García, Overview of FinancES 2023: Financial Targeted Sentiment Analysis in Spanish, Procesamiento del Lenguaje Natural 71 (2023). [6] P. Ronghao, J. A. García-Díaz, F. García-Sánchez, R. Valencia-García, Evaluation of transformer models for financial targeted sentiment analysis in Spanish, PeerJ Computer Science 9 (2023). [7] C. G. Northcutt, A. Athalye, J. Mueller, Pervasive Label Errors in Test Sets Destabilize

Machine Learning Benchmarks, 2021. arXiv:2103.14749. [8] S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mitamura, E. Hovy, A Survey of Data Augmentation Approaches for NLP, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online, 2021. [9] H. Dai, Z. Liu, W. Liao, X. Huang, Y. Cao, Z. Wu, L. Zhao, S. Xu, W. Liu, N. Liu, S. Li, D. Zhu, H. Cai, L. Sun, Q. Li, D. Shen, T. Liu, X. Li, AugGPT: Leveraging ChatGPT for Text Data Augmentation, 2023. arXiv:2302.13007. [10] J. Ni, G. Hernandez Abrego, N. Constant, J. Ma, K. Hall, D. Cer, Y. Yang, Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models, in: Findings of the Association for Computational Linguistics: ACL 2022, Association for Computational Linguistics, Dublin, Ireland, 2022. [11] L. Xue, A. Barua, N. Constant, R. Al-Rfou, S. Narang, M. Kale, A. Roberts, C. Rafel, ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models, Transactions of the Association for Computational Linguistics 10 (2022). [12] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, C. Rafel, mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 2021. [13] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, P. Fung, Survey of Hallucination in Natural Language Generation, ACM Computing Surveys 55 (2023) 1–38.

[1]

Moreno-Sandoval (Ed.), Financial Narrative Processing in Spanish, Tirant lo Blanch , Valencia, 2021 .

[2]

Moreno-Sandoval ,

Gisbert ,

P. A.

Haya ,

Guerrero ,

Montoro , Tone Analysis in Spanish Financial Reporting Narratives , in: Proceedings of the Second Financial Narrative Processing Workshop (FNP 2019 ) NoDaLiDa, Association for Computational Linguistics , Online, 2019 . URL: https://aclanthology.org/W19-6406.pdf.