<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Jaén, Spain
* Corresponding author.
$ jordi.porta@uam.es (J. Porta-Zamorano); yanco.torterolo@inv.uam.es (Y. Torterolo);
antonio.msandoval@uam.es (A. Moreno-Sandoval)
 http://www.lllf.uam.es/ (J. Porta-Zamorano)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>LLI-UAM Team at FinancES 2023: Noise, Data Augmentation and Hallucinations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jordi Porta-Zamorano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yanco Torterolo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Moreno-Sandoval</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Laboratorio de Lingüística Informática, Universidad Autónoma de Madrid</institution>
          ,
          <addr-line>Cantoblanco, 28049, Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>This paper describes the T5-based system developed for FinancES 2023 Shared Task by the Laboratorio de Lingüística Informática at UAM. The LLI-UAM system achieved a good ranking in all the tasks. The paper also describes some noise and data augmentation or hallucination mitigation experiments. In particular, we used corrected versions of the datasets to evaluate the impact of noise. Moreover, ChatGPT was utilised to augment the data and improve accuracy in tagging. We also describe the presence of hallucinations. Ultimately, we identify the best model for each task and draw conclusions based on our ifndings.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data augmentation</kwd>
        <kwd>ChatGPT</kwd>
        <kwd>noise</kwd>
        <kwd>hallucinations</kwd>
        <kwd>mT5</kwd>
        <kwd>FinancES shared task</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>(a) Task 1: Financial targeted sentiment analysis.</p>
      <p>Ranking</p>
      <p>Team</p>
      <p>F1 Companies F1 Consumers</p>
      <p>
        Sentiment Sentiment
F1 Task 1
item; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) the individual economic agent: companies; and (3) the individual economic
agent/patient: consumers. The news item impacts the target and the economic participants, categorising
positivity, negativity, or neutrality. FinancES proposes two tasks: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) identifying the target entity
in the text and determining the emotional polarity towards that target, and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) assessing the
impact of a news headline on companies and consumers regarding their stance and expressed
polarity values. Our systems reached the second position for Task 1 and the first position for
Task 2 in the oficial leaderboards, as can be seen in Table 1 for the LLI-UAM Team.
      </p>
      <p>This paper outlines the system developed by the LLI-UAM team, presenting their contributions
to the FinancES shared task. First, the dataset and the noise found in the examples are described
(sections 2 and 3, respectively). Next, we show how data augmentation has been performed
with ChatGPT. In section 5, we describe the deep learning model used. The longest part is
devoted to discussing the results of the diferent experiments (noise, data augmentation, and
Peligroso atasco en los fondos
que invierten en renovables
El fondo de recuperación de
la UE ’va demasiado lento’,
según el ministro de economía
francés
El PSOE propone un intrumento
para abaratar los préstamos a
las empresas similar al británico
Madrid negocia con Economía para
poder destinar las viviendas del
’banco malo’ a desahuciados</p>
    </sec>
    <sec id="sec-2">
      <title>2. The FinancES Dataset</title>
      <p>The FinancES dataset and its annotation process are detailed in [5] and [6]. The dataset consists
of news headlines written in Spanish, collected from digital newspapers specialised in economic
and financial news from various Spanish-speaking countries. Each headline is labelled to identify
the target and sentiment polarity across three dimensions: target, companies, and consumers,
employing a three-class polarity value system (positive, neutral, or negative). According to
[6], three organisation committee members manually annotated each headline. In cases of
disagreement, the annotators engaged in discussions to resolve the matter, and if no consensus
was reached, the headline was excluded. Table 2 illustrates a selection of examples from the
dataset.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Noise in the Dataset</title>
      <p>Annotated data holds paramount importance for training and evaluating machine learning
models. Consequently, the annotations should exhibit a high level of accuracy. However,
recent research has demonstrated that this is only sometimes the case, revealing a surprising
number of annotation errors or inconsistencies in even widely-used datasets [7]. Since humans
typically carry out dataset annotations, errors or inconsistencies are an inherent possibility.
Such inaccuracies can adversely afect a model’s performance, potentially leading to erroneous
predictions. Although efective, rectifying these labelling errors incurs high costs and demands
substantial time investments.</p>
      <p>Upon inspecting the dataset, we identified errors in target and polarity labels that could
impact the model’s performance. These errors were readily apparent, as we expected the target
to be mentioned within the news headline and the polarity values to adhere to the three labels.
Most of the spotted target and label errors are recurrent:
• The omission of the segment el within the target, e.g., *Barcó (Barceló)
• Extra blanks and extra or missing quotation marks in the target or sentiment field, e.g.,
*telecos’ (’telecos’), *positive⊔ (positive)
• Casing, e.g., *SUARA (Suara) or *hosteLEría (hostelería)
• Typographical errors in the sentiment labels, e.g., *postive (positive)</p>
      <p>Validating polarity values presents additional challenges, requiring domain knowledge in
ifnance, and only a few labels were modified during the review process. Namely, ten were
adjusted due to typographical errors, extra spaces, or characters.</p>
      <p>Over three hundred instances from the dataset samples were corrected and handled separately.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Data Augmentation</title>
      <p>Machine learning models’ efectiveness and overall capabilities rely on the training data’s
quality, quantity, and relevance. Unfortunately, gathering enough data can be dificult and
costly, resulting in a shortage of available data.</p>
      <p>Data augmentation (DA) refers to strategies for increasing the diversity of training examples
without gathering more data directly [8]. Similarly to AugGPT [9], we used ChatGPT to
rephrase or paraphrase some of the training examples to enhance the training set. To this end,
we designed a step-wise prompt that provides an example of rephrasing a news headline five
times. In addition, the prompt includes instructions on maintaining the target’s format and
other labels in the generated examples since both the input and rephrased examples are in CSV.</p>
      <p>For instance, the headline Pemex no reemplazará a 3.000 empleados que se jubilarán este año
can be restated as:
• La petrolera Pemex no tiene previsto sustituir a 3.000 trabajadores que se retirarán este año
• Pemex se niega a contratar nuevos empleados para reemplazar a los 3.000 trabajadores que
se jubilarán este año
• La petrolera Pemex enfrenta la no renovación de 3.000 trabajadores jubilados y no tiene
planes para reemplazarlos
• El plan de Pemex no incluye el reemplazo de los 3.000 trabajadores que se retirarán este año,
lo que podría afectar su productividad
• Pemex anuncia que no habrá sustitución de los 3.000 empleados que se jubilarán este año, lo
que genera preocupación sobre su capacidad para mantener su producción
The last two rephrases of the previous example add a consequence or efect to the initial
statement.</p>
      <p>It is important to note that ChatGPT may occasionally fail to preserve the original intended
meaning in its rephrased versions and the target. The latter is more likely to occur when the
target word is capitalised at the start of a sentence (but is not a proper noun) or when it can
be easily paraphrased. The headline Restaurantes elásticos para sobrevivir exemplifies both
problems at preserving the target Restaurantes elásticos:
• La adaptabilidad de los restaurantes les permite sobrevivir en situaciones cambiantes
• La capacidad de los restaurantes para adaptarse les ayuda a superar los desafíos
• Los restaurantes se vuelven flexibles para mantenerse a flote en tiempos difíciles
• La elasticidad de los restaurantes es clave para su supervivencia
• Los restaurantes demuestran su resiliencia al adaptarse a las circunstancias</p>
      <p>Data augmentation was performed blindly on 527 training examples, resulting in five
rephrased entries from each. Additionally, we chose 250 entries that were accurately annotated
from a linguistic perspective to augment similarly. As a result, we created a new dataset with
3885 examples to use for experimentation.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Modeling</title>
      <p>We decided to approach all the FinancES tasks with the T5 end-to-end architecture [10]. We
initially considered byT5 [11] and mT5 [12]. After conducting multiple experiments, it was
determined that byT5 (a byte-based multilingual version of T5) was less efective than mT5 due
to longer training times and inferior results. The mT5 model is a massively multilingual
pretrained text-to-text transformer that can be simultaneously fine-tuned on multiple downstream
tasks using a task prefix or prompt.</p>
      <p>The tasks related to target, target sentiment, company sentiment, and consumer sentiment
annotations have been divided into sub-tasks. This is illustrated in Figure 1, where each
annotation in the example has a diferent prefix indicating the specific task that needs to be
performed on the headline and the expected output. The mT5 model comes in diferent sizes,
but only small, base, and large models were chosen to experiment with since only these models
ift into the single RTX 3090 24GB GPU card available for this work.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Experiments and Results</title>
      <p>The data provided for training was divided into two sets: the training set and the development
test set. The examples in the development set align with the ones given to participants for
practice. We conducted experiments using three diferent versions of the original training set:
1. The original training set (T)
2. The augmented original training set (T+A)
3. The corrected training set (T’)
4. The augmented corrected training set (T’+A’)
Renfe afronta mañana un nuevo día
de paros parciales de los maquinistas</p>
      <p>Target Companies
Target Sentiment Sentiment
target: Renfe afronta mañana un nuevo día de paros [. . . ]
target_sentiment: Renfe afronta mañana un nuevo día de paros [. . . ]
companies_sentiment: Renfe afronta mañana un nuevo día de paros [. . . ]
consumers_sentiment: Renfe afronta mañana un nuevo día de paros [. . . ]
Output
Renfe
negative
negative
negative
(b) Converted Example</p>
      <p>As the development set, we used the corrected versions for all the experiments. Throughout
the training process, we used a more straightforward metric called exact match (also known as
subset accuracy) instead of more complicated F1-based metrics in our multi-task framework.
This metric was employed as the early-stopping criterion on the development set. The following
hyperparameters, chosen tentatively, were common to all the experiments:
• learning rate: 1e-4 (constant)
• weight decay: 0.01
• batch size: 12
• optimizer: Adafactor
• epochs: 100 / patience: 10</p>
      <sec id="sec-6-1">
        <title>6.1. Results on Noise and Data Augmentation</title>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Results on Hallucinations</title>
        <p>Any language model generating content is prone to hallucinate unintended text, which can
harm the system’s performance [13]. Using mT5 for the tasks, we only observed hallucinations
in the form of unfaithful text in target identification or fabricated targets. We categorize these
mistakes as "hallucinations" within our system. They can be grouped as follows:
• Typographical hallucinations. Afecting spacing: * jubilacionesforzosas y anticipadas
(jubilaciones forzosas y anticipadas), serial punctuation: *Santander, Sabadell BBVA y CaixaBank
(Santander, Sabadell, BBVA y CaixaBank); casing: *hidrógeno (Hidrógeno), *Dos Heridos
(Dos heridos), *empresas Alicantinas (empresas alicantinas), *coronavirus (Coronavirus),
*ministerio (Ministerio), and *Unicaja Y LIBERBANK (Unicaja y Liberbank); but one of the
most recurrent patterns observed is the capitalized segment le inside a word: *TeLEfónica,
*hosteLEría, *hosteLEros, *TeLEcinco, *hoteLEs, *cadenas hoteLEras, and *TeLEpizza, or the
segment el: *MerkEL, and *ELéctricas.
• Hallucinations inside words: *Cada hogagar (Cada hogar), *Barcclays (Barclays),
*startupups (start-ups), *modeloo Alzira (modelo Alzira), *Tefónica (Telefónica), *"inflación
multipólica" ("inflación monopólica" ), and *marcas líders en gran consumo (marcas líderes en</p>
        <p>gran consumo).
• Lexical hallucinations (some words are replaced by other somehow related): *teatral Lliure
(teatro Lliure), *motos minera (marcha minera), *Mar del Norte (Mar del Sur), *Bolsa de
Buenos Argentina (Bolsa de Buenos Aires), *Argentina de regulación (Aires de regulación),
and *web (Internet).</p>
        <p>However, Typographical hallucinations, like TeLEcinco or TeLEpizza, are not considered genuine
hallucinations because they replicate the same errors as found in the sample datasets, such as
TeLEfónica. These can be more accurately explained as an instance of noise amplification or
error overfitting.</p>
        <p>In order to deal with hallucinations in the post-processing stage, it is necessary to anchor
Model Train. Uncorr.</p>
        <p>Size Set F1
small
small
small
small
base
base
base
base
large
large
large
large</p>
        <p>T
T’
T+A
T’+A’
T
T’
T+A
T’+A’
T
T’
T+A
T’+A’
the target predictions to the headline text. This can be achieved by identifying all headline
variations, completing partial words, and using a limited form of string matching.</p>
        <p>Table 3 displays the results for post-correction of the target, showing the F1s of the original and
corrected versions of the systems’ output. The diference column indicates a slight improvement
in the corrected versions regardless of the model’s size or training set used. However, the
improvement decreases on average as the model’s size increases.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Best Systems</title>
        <p>• Task 1:
Finally, because no single training dataset fits all tasks, the best-performing systems for each of
the FinancES 2023 Tasks are the following:
– Model: mT5-large
– Training set: Uncorrected training set (T)
– Task F1: 0.8019
– Target F1: 0.8732
– Target sentiment F1: 0.7305
negative
neutral
positive</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions and Future Work</title>
      <p>While correcting hallucinations and accessing the large model are beneficial, it still needs
to be determined how augmenting or correcting the training set will improve the FinancES
shared tasks. According to [8], a plausible hypothesis suggests that adding more data may
not necessarily improve the performance of large pre-trained transformers when working on
tasks that already have suficient representation in the pretraining data. Whether or not this
hypothesis applies to the FinancES tasks is left as future work.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>This publication is part of the project “Computational linguistic methods for readability and
simplification of financial narratives.” CLARA-FINT (PID2020-116001RB-C31), funded by the
Spanish Ministry of Science and Innovation and the State Research Agency.
[3] N. Bel, G. Bracons, S. Anderberg, Finding Evidence of Fraudster Companies in the CEO’s</p>
      <p>Letter to Shareholders with Sentiment Analysis, Information 12 (2021).
[4] S. M. Jiménez-Zafra, F. Rangel, M. Montes-y Gómez, Overview of IberLEF 2023: Natural
Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings
of the Iberian Languages Evaluation Forum (IberLEF 2023), co-located with the 39th
Conference of the Spanish Society for Natural Language Processing (SEPLN 2023),
CEURWS.org, 2023.
[5] J. A. García-Díaz, Almela, F. García-Sánchez, G. Alcaráz Mármol, M. J. Marín-Pérez,
R. Valencia-García, Overview of FinancES 2023: Financial Targeted Sentiment Analysis in
Spanish, Procesamiento del Lenguaje Natural 71 (2023).
[6] P. Ronghao, J. A. García-Díaz, F. García-Sánchez, R. Valencia-García, Evaluation of
transformer models for financial targeted sentiment analysis in Spanish, PeerJ Computer
Science 9 (2023).
[7] C. G. Northcutt, A. Athalye, J. Mueller, Pervasive Label Errors in Test Sets Destabilize</p>
      <p>Machine Learning Benchmarks, 2021. arXiv:2103.14749.
[8] S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mitamura, E. Hovy, A Survey of
Data Augmentation Approaches for NLP, in: Findings of the Association for Computational
Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online, 2021.
[9] H. Dai, Z. Liu, W. Liao, X. Huang, Y. Cao, Z. Wu, L. Zhao, S. Xu, W. Liu, N. Liu, S. Li, D. Zhu,
H. Cai, L. Sun, Q. Li, D. Shen, T. Liu, X. Li, AugGPT: Leveraging ChatGPT for Text Data
Augmentation, 2023. arXiv:2302.13007.
[10] J. Ni, G. Hernandez Abrego, N. Constant, J. Ma, K. Hall, D. Cer, Y. Yang, Sentence-T5:
Scalable Sentence Encoders from Pre-trained Text-to-Text Models, in: Findings of the
Association for Computational Linguistics: ACL 2022, Association for Computational
Linguistics, Dublin, Ireland, 2022.
[11] L. Xue, A. Barua, N. Constant, R. Al-Rfou, S. Narang, M. Kale, A. Roberts, C. Rafel, ByT5:
Towards a Token-Free Future with Pre-trained Byte-to-Byte Models, Transactions of the
Association for Computational Linguistics 10 (2022).
[12] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, C. Rafel, mT5:
A Massively Multilingual Pre-trained Text-to-Text Transformer, in: Proceedings of the
2021 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Association for Computational Linguistics,
Online, 2021.
[13] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, P. Fung, Survey
of Hallucination in Natural Language Generation, ACM Computing Surveys 55 (2023)
1–38.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Moreno-Sandoval</surname>
          </string-name>
          (Ed.),
          <source>Financial Narrative Processing in Spanish, Tirant lo Blanch</source>
          , Valencia,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Moreno-Sandoval</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gisbert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Haya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guerrero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Montoro</surname>
          </string-name>
          ,
          <article-title>Tone Analysis in Spanish Financial Reporting Narratives</article-title>
          ,
          <source>in: Proceedings of the Second Financial Narrative Processing Workshop (FNP</source>
          <year>2019</year>
          )
          <article-title>NoDaLiDa, Association for Computational Linguistics</article-title>
          , Online,
          <year>2019</year>
          . URL: https://aclanthology.org/W19-6406.pdf.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>