=Paper=
{{Paper
|id=Vol-3878/112_main_short
|storemode=property
|title=Topic Modeling for Auditing Purposes in the Banking Sector
|pdfUrl=https://ceur-ws.org/Vol-3878/112_main_short.pdf
|volume=Vol-3878
|authors=Alessandro Giaconia,Valeria Chiariello,Marco Passarotti
|dblpUrl=https://dblp.org/rec/conf/clic-it/GiaconiaCP24
}}
==Topic Modeling for Auditing Purposes in the Banking Sector==
<pdf width="1500px">https://ceur-ws.org/Vol-3878/112_main_short.pdf</pdf>
<pre>
                         Topic Modeling for Auditing Purposes in the Banking Sector
                         Alessandro Giaconia1,* , Valeria Chiariello2 , Sara Giannuzzi2 and Marco Passarotti1
                         1
                             CIRCSE Research Centre, Università Cattolica del Sacro Cuore, Largo Gemelli 1, 20123 Milano, Italy
                         2
                             CREDEM, Via Emilia San Pietro 4, 42121 Reggio Emilia, Italy


                                             Abstract
                                             This study explores the application of topic modeling techniques for auditing purposes in the banking sector, focusing on the analysis of
                                             reviews of anti-money laundering alerts. We compare three topic modeling algorithms: Latent Dirichlet Allocation (LDA), Embedded
                                             Topic Model (ETM), and Product of Experts LDA (ProdLDA), using a dataset of 35,000 suspicious activity reports from an Italian bank.
                                             The models were evaluated using the coherence score, NPMI coherence, and topic diversity metrics. Our results show that ProdLDA
                                             consistently outperformed LDA and ETM, with the best performance achieved using 1-gram word embeddings. The study reveals
                                             distinct topics related to specific client activities, cross-border transactions, and high-risk business sectors, like gambling. These results
                                             demonstrate the potential of advanced topic modeling techniques in enhancing the efficiency and effectiveness of auditing processes in
                                             the banking sector, particularly in the analysis of activities that could be tied to money laundering and terrorism.

                                             Keywords
                                             Topic modeling, Auditing, Banking sector


                         1. Introduction                                                                             then present the results and their interpretation, leading us
                                                                                                                     into the conclusions. Finally, we will present a number of
                         There has always been a close connection between banks and                                  future works suggestions, which can expand this topic.
                         the collection of different kinds of empirical data: banks, just
                         like any other company, have always poured large amounts of
                         resources into understanding numbers, and how to deal with                                  2. Related work
                         them. Numerical data, being closely related to the financial
                         performances of companies, has always taken the spotlight.                                  Topic Modeling is an unsupervised task of NLP, consisting
                            On the other hand, linguistic data has always been much                                  in the extraction of latent themes in a given corpus. Latent
                         less considered, due to the difficulties of analysis and under-                             Dirichlet Allocation, or LDA [3] is a probabilistic generical
                         whelming performances.                                                                      model, which became the most widely used and expanded-
                            But things are changing. More and more companies are                                     upon topic model. However, LDA faces several limitations,
                         understanding the value of language, which contains informa-                                like scalability, low performances with large datasets, and the
                         tion that no number can convey. Different Natural Language                                  struggle against polysemy and homonymy [4].
                         Processing (NLP) tasks, language resources, and computa-                                       To overcome the limitations of LDA, a lot of effort has been
                         tional linguistics practices have now become a staple in many                               put into developing models that rely on word embeddings
                         realities, like sentiment analysis [1] and word embeddings [2].                             and neural networks, like ETM [5] and ProdLDA [6]. These
                            In fact, there is a wide variety of linguistic data that banks                           models have been proved to provide better performances than
                         can exploit: emails, bank transfers descriptions, internal com-                             LDA, at the cost of a higher computational effort[7].
                         munications, and customer feedback. Some peculiar issues                                       In the last decade, topic Modeling has already been largely
                         arise, when dealing with linguistic data in the banking sec-                                employed in the banking sector, and in auditing as well. [8]
                         tor, like the usage of acronyms, abbreviations and technical                                focused on the assessment and handling of frauds, while [9]
                         terminology. These data are often proprietary, meaning that                                 analyzed financial misreportings. Another popular subject of
                         the bank owns them, and the access is forbidden to externals.                               analysis is accounting (for example [10]).
                         While the quantity of information they contain is massive,
                         a downside is that the impossibility of sharing it with other                               3. Data
                         banks hinders the possibility of a more global analysis.
                            In this context, this paper wants to explore the applica-                                                       The data employed is a collection of reviews of anti-money
                         tion of topic modeling techniques to the auditing process, in                                                      laundering alerts, that are automatically detected by a rule-
                         particular regarding the analysis of reviews of anti-money                                                         based detection tool, whose name cannot be disclosed due to a
                         laundering (AML) alerts. Topic modeling can, in fact, be an                                                        specific request. This tool is widely employed across all Italian
                         incredibly helpful tool for auditors who want to perform an                                                        banks, and is aimed at tackling potential money laundering
                         in-depth analysis on large amounts of data.                                                                        and terrorism financing schemes. It uses advanced algorithms
                            An overview of topic modeling algorithms and applications                                                       to identify patterns that deviate from standard behavior.
                         in the banking sector, both documented in scientific research                                                         An activity is considered suspicious whenever it exceeds
                         and in concrete applications within banks, will be presented.                                                      certain risk thresholds. These activities are then reviewed by
                         Then, we will provide a comprehensive description of the data                                                      a human operator, who will evaluate whether the movement
                         employed, followed by the preprocessing operations. We will                                                        is actually tied to illegal operations or not. If the operation
                                                                                                                                            is not considered dangerous, or if there is not enough evi-
                         CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Dec
                                                                                                                                            dence to decide whether the activity is actually a threat or
                         04 — 06, 2024, Pisa, Italy
                         *
                           Corresponding author.                                                                                            not, the operator will write a brief review, consisting of two
                         $ alessandro.giaconia01@icatt.it (A. Giaconia); vchiariello@credem.it                                              sections. The first one is a description of the analyzed activity,
                         (V. Chiariello); sgiannuzzi@credem.it (S. Giannuzzi);                                                              The second section is either an explanation for why it was
                         marco.passarotti@unicatt.it (M. Passarotti)                                                                        not considered dangerous; or a statement about the lack of
                          0000-0002-9806-7187 (M. Passarotti)
                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution evidence and the need to keep monitoring. This latter kind
                                     4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
              Italian:
              CASEIFICIO.MOVIM.COERENTE CON TIPO DI ATTIVITA’(ACCONTI A CONF.E PAGAM FORNITORI).
              IL CASEIF SI STA FONDENDO CON ALTRA LATTERIA, STA VENDENDO FORMAGGIO E SALDANDO I
              DEBITI.OK DOC REDD., OK ADEG.VERIF.NON SEGNALARE
              English:
              Cheese factory. Consistent movement withtype of activities (advance payments to contributors and payments
              to suppliers). The cheese factory is merging with another milk factory, it’s selling cheese and settling debts.
              Income documentation is ok, adequate verification is ok. Do not report.
              Italian:
              TRATTASI DI FRUTTA E VERDURA ATTIVO SULLA PIAZZA DI ***UNICO FRUTTA E VERDURA DELLA
              PIZZA. ATTIVO CC CHE RACC INCASSI E ADDEBRELATIVI ALL’ATTIVITA’.AL MOMENTO NO PART
              ANOMALIE. MONITORIAMO
              English:
              Case of greengrocer active in the square of ***, only greengrocer in the square. Active bank account, that
              collects income and charges relative to the activity. No particular anomalies at the moment. We keep
              monitoring.
     Table 1
     Examples of sentences from the dataset with translations


of reviews usually ends with expressions such as ’monitori-     against the same enhanced dictionary, contains only 6% of
amo’ and ’continuiamo a monitorare’. The dataset employed       OOVs. For this comparison, the treebank in its entirety has
consists of such reviews.                                       been employed, consisting of training, testing and developing
   In Table 1 we provide two examples of documents, with        set.
their corresponding English translation. The English trans-        The result shows a peculiar dataset, containing a consider-
lations have been cleaned of abbreviations and spelling mis-    able amount of OOVs, which will require robust methods of
takes.                                                          analysis.
   Due to hardware limitations, we worked using a selection        Before processing the data, we performed data cleaning
of 35,000 documents, chosen randomly. The data is owned by      through stopwords removal and lemmatization.
Credem and is not publicly available, due to legal constraints.    Stopwords removal includes prepositions, articles, and con-
It is not possible to reveal the time period in which these     junctions. This operation is helpful in reducing the number
documents where collected, nor the whole dataset size.          of tokens to be processed, gaining in efficiency, while also
   Each document has an average of 20.94 tokens per docu-       excluding data without semantic content. This operation was
ment.                                                           performed using the stopwords removal tool for Italian pro-
   It is important to note that the documents feature an abun-  vided by Natural Language Toolkit5 (NLTK).
dance of spelling errors, abbreviations, acronyms, and miss-       After performing stopwords removal, the number of tokens
ing blanks spaces between words. This in part due to a 300-     in the complete dataset is reduced to 972,019, with an average
characters limit. By comparing the tokens in the dataset with   of 13.47 tokens per document. Since we are using 35,000 rows,
a dictionary of 4 millions Italian words1 , we obtain the results
                                                                about half of the dataset, the number of tokens is 471,293.
shown in Table 2:                                                  Secondly, we performed lemmatization. The model em-
                                                                ployed is it_core_news_lg, provided by spaCy6 , which is made
  Metric                                                 Value  by 500.000, 300-dimensions-shaped vectors. Lemmatization is
  Total number of tokens                              1,474,077 helpful in maintaining consistency through the whole dataset,
  Total number of Out Of Vocabulary tokens (OOV)        193,482 as well as improving text understanding and efficiency. The
  Total number of OOV types                              29,809 spaCy model employed has a lemmatization accuracy of 97%,
  Number of sentences containing 1+ OOVs                 60,870 which is a satisfactory performance7 . However, the model’s
  Ratio of OOVs over the total number of tokens          0.1313 performance on the dataset was tested. We created a sample of
                                                                100, randomly selected documents, who were then manually
Table 2
                                                                lemmatized, acting as the gold standard. The model’s lem-
OOVs in the complete dataset
                                                                mas were then compared to the gold standard. The model’s
                                                                accuracy score was 79%, which is much lower than its usual
   The dictionary has been further enhanced in a data-driven accuracy. This underwhelming result further indicates how
approach, by including a list of Italian names2 and surnames3 , challenging to analyze the dataset is.
and a list of the most frequent acronyms featured in the           Before preprocessing, the TTR (Type/Token Ratio) was
dataset, so that they are not incorrectly considered OOVs. 0.0541; after this operation, the Lemma/Token Ratio is attested
In order to find the acronyms, we created a list of all OOVs in at 0.0428. The score is lower, indicating that we managed to
the dataset, in descending order, based on frequency. The 20 reduce dispersion. Reducing dispersion is helpful in improving
most frequent acronyms were added to dictionary, such as PEP the performance of the algorithms, since word forms that used
(Persona Politicamente Esposta) and CC (Conto Corrente).        to be different are now considered to be the same.
   The table shows that about 13% of the dataset is made of
OOVs. In comparison, the UD_Italian-ISDT treebank4 , tested
1
  https://github.com/sigmasaur/AnagramSolver/blob/main/dictionary.txt
2
  https://gist.github.com/pdesterlich/2562329
3                                                                       5
  https://github.com/PaoloSarti/lista_cognomi_italiani/blob/master/         https://www.nltk.org/
                                                                        6
  cognomi.txt                                                               https://spacy.io/
4                                                                       7
  https://github.com/UniversalDependencies/UD_Italian-ISDT                  https://spacy.io/models/it
4. Processing                                                       The Word2Vec embeddings are created from our dataset.
                                                                 Table 4 shows the composition of these word embeddings.
We have chosen three models for our analysis: LDA, ETM, and         We can check the quality of the created embeddings by
ProdLDA. These models were selected due to their different employing the library Bokeh8 . Bokeh allows us to perform
natures: the first is generative, the second is embedding-based, interactive visualization, creating a representation of the vec-
and the third is neural-network-based.                           torial space that can be easily examined. As we can see in
   LDA assumes that each document is a mixture of topics and Figure 1, the word embeddings create a plot where the differ-
that each topic is a distribution over words. It uses Dirichlet ent semantic fields are nicely divided and distinct from the
priors to model the distribution of topics within documents others.
and words within topics.                                            The pre-trained embeddings, instead, are trained on Com-
   ETM represents words as vectors in a continuous space mon Crawl and Wikipedia9 . The pre-trained embeddings
(word embeddings) and models topics as distributions over composition can be seen in Table 5.
these embeddings, enabling it to capture more semantic rela-
tionships between words compared to traditional models like
LDA.                                                             5. Results and discussion
   ProdLDA is a neural-network based variant of LDA that
uses a variational autoencoder (VAE) framework. ProdLDA In Table 6 we can find an average of the scores of the eval-
models document-topic and topic-word distributions using uation metrics for each model run, either enhanced or not
neural networks, and it represents a "product of experts" enhanced by the aforementioned embeddings.
model, focusing on improving topic coherence and overcom-           We can clearly see that ProdLDA provided the best perfor-
ing the limitations of LDA.                                      mances   across all runs. In particular, the dataset enhanced
   The tool used for optimizing, training and comparing these by 1-grams embeddings yielded the best overall performance,
models is the OCTIS (Optimizing and Comparing Topic Mod- with an average score of 0.564. Much worse is the performance
els is Simple!) library, developed by [11]. It allows users to of both LDA and ETM, which failed at creating distinct and
compare the performance of various models with respect to interpretable topics. In the reminder of this section, in Table
different metrics, like Topic Diversity and Coherence Score. 7 we show some of the topics created by 1-grams-ProdLDA,
   Before training, a fundamental step is hyperparameters together with examples of the most relevant words associated.
optimization, which controls the behavior of the algorithm,         The topics of 1-gram-ProdLDA were examined by seven
and therefore, its performance.                                  bank employees, working in the auditing sector. They were
   OCTIS allows to perform Multi-Objective Bayesian Opti- then asked how interpretable the topics were, and to give a
mization [12], a method that searches for the best hyperpa- label, indicating what that topic was about. The chosen label
rameters configuration considering more evaluation metrics for each topic was the most frequent one, assigned to that
at once; in particular, the evaluation metrics we employ are: topic, by the employees. Out of the 12 topics created, only
                                                                 one was considered to be non-interpretable, confirming the
      • the Coherence Score, measuring how interpretable the excellent performance provided by ProdLDA. However, this
         topics are [13];                                        non-interpretable topic was also the most frequent, as shown
      • the NPMI (Normalized Pointwise Mutual Information, in Figure2.
         measuring the statistical similarity of words inside a     We can clearly see the even distribution of the documents
         topic [14];                                             associated to each topic. The most frequent topic, labeled as
      • Topic Diversity, measuring how different topics are "X", is the aforementioned non-interpretable topic, containing
         from one another [15].                                  miscellaneous or difficult to categorize documents. Most of the
                                                                 topics refer to specific clients’ activities, like bank transfers,
   However, certain limitations need to be considered. In payments, or activities related to the bank account.
particular, the hardware employed was uncapable of handling         There are also some more specific topics. An entire topic is
such computational efforts; and, since the data is protected by dedicated to tobacconists and gambling. This kind of activity
privacy laws, using another, more powerful machine, is out typically makes wide use of cash, which can potentially be tied
of question.                                                     to money laundering schemes. This level of specificity in au-
   To overcome this problem, we relied on SOBO (Single- diting could indicate either regulatory requirements for these
Objective Bayesian Optimization)[16] which finds the best sectors or the bank’s recognition of unique risks associated
hyperparameters configuration with respect to only one met- with these business types.
ric. In particular, we chose the Coherence Score as the target      There is also a specific topic for suspicious activities with
evaluation metric. This metric was chosen due to its nature foreign countries or carried on by foreign users. Dealing
of measuring semantic coherence and, therefore, it can be with cross-borders regulations on transfers can be difficult
considered a good indicator of topic quality. SOBO works for the bank, suggesting that particular effort should be put
by training the model n times, each with different hyperpa- into developing efficient strategies for auditing cross-border
rameters. The output of this process is the configuration that activities.
provides the best result.                                           Using 2-grams word embeddings was the best option for
   Algorithms were optimized and trained in four different both LDA and ETM. However, in ProdLDA, 1-grams word em-
configurations:                                                  beddings provided a slightly better performance. Nonetheless,
                                                                 2-grams were generally the better option, especially consider-
      • without the enhancement of word embeddings;
                                                                 ing the sharp difference in ETM. On the other hand, enhancing
      • enhanced by 1-gram Word2Vec[17] embeddings;              the dataset with pre-trained embeddings did not result in a
      • enhanced by 2-grams Word2Vec embeddings;                 significant impact: the performance improvement of LDA was
      • enhanced by pre-trained embeddings.
                                                                  8
                                                                      https://bokeh.org/
                                                                  9
                                                                      https://fasttext.cc/docs/en/crawl-vectors.html
                               Model     Hyper-parameter       Values/[Range]
                                         Num. of topics        [2, 50]
                               LDA       𝛼                     [0.001, 5]
                                         𝛽                     [0.001, 5]
                                         Number of topics      [2, 50]
                                         Dropout               [0, 0.95]
                               ProdLDA   Num. of neurons       100, 200, 300
                                         Num. of layers        1, 2, 3
                                         Activation function   softplus, relu, sigmoid
                                         Num. of topics        [2, 50]
                                         Dropout               [0, 0.95]
                               ETM
                                         Hidden size           100, 200, 300
                                         Activation function   softplus, relu, sigmoid

     Table 3
     Hyperparameters and values


Figure 1: Vectorial distribution


minimal, while for ETM and ProdLDA it turned out to lower         6. Conclusions and future work
the outcome.
                                                                  NLP is now an essential component of the banking sector,
                                                                  and any company that wants to be competitive should make
                                                                  use of linguistic data science. In particular, in this paper we
                                                                  presented a NLP task, topic modeling, and how it can be imple-


Figure 2: Topic distribution
             Parameter                      Value                                            Embeddings
             min_count                        20
             window                            5                               None    1-gram      2-gram   Pre-trained    Total avg
             vector_size                      200                 LDA          0.384    0.397      0.410       0.390         0.395
             min_alpha                      0.0007                ETM          0.424    0.354      0.455       0.416         0.412
             number of negative samples       20                  ProdLDA      0.552    0.564      0.552       0.535         0.550
             workers                           6
                                                                  Table 6
Table 4                                                           Average of the metrics’ scores
Word2Vec embeddings model parameters

                                                                            Label                            Top words
             Parameter                      Value
             Character n-grams                5                             Tobacconists and gambling        tabaccheria
             window                           5                                                              bar
             vector_size                     300                                                             lottomatica
             number of negative samples      10                                                              tabacchi
                                                                                                             servizi
Table 5
                                                                            Foreign activities               origine
Pre-trained embeddings model parameters
                                                                                                             egitto
                                                                                                             periodo
                                                                                                             tunisia
mented in the daily job of bank employees, in order to perform                                               vacanza
more detailed investigations. In particular, topic modeling can             Family ties                       cointestato
be a key component in the understanding and identification                                                    successione
of money laundering schemes, as it allows auditors to perform                                                 moglie
more in-depth and focused analyses. For example, auditors                                                     fratello
could investigate patterns from the recent years, in order to                                                 marito
have a better understanding on whether an activity is part of
a larger trend, or an anomaly that deserves attention.            Table 7
                                                                  ProdLDA topics
   After citing other implementations of topic modeling in
banking, we described the data employed, and its prepro-
cessing, consisting in stopwords removal and lemmatization.
Examples were provided, showing the peculiarities of the doc- [2] I. Raicu, N. Boitout, R. Bologa, M. G. Sturza, Word em-
uments in the dataset. Then, the data was processed using              beddings in romanian for the retail banking domain,
three algorithms: LDA, ETM and ProdLDA. These algorithms               Bucharest University of Economic Studies (2020).
were evaluated using three metrics: coherence score, NPMI          [3] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet allo-
score, and topic diversity. The optimal hyperparameters were           cation, Journal of machine Learning research 3 (2003)
found using SOBO. Optimization and processing were per-                993–1022.
formed using four different configurations: without additional     [4] X.-Y. Jing, D. Zhang, Y.-Y. Tang, An improved lda ap-
word embeddings, enhanced by 1-gram word embeddings cre-               proach, IEEE Transactions on Systems, Man, and Cyber-
ated from our dataset, enhanced by 2-grams word embeddings             netics, Part B (Cybernetics) 34 (2004) 1942–1951.
created from our dataset, and enhanced by pre-trained word         [5] A. B. Dieng, F. J. Ruiz, D. M. Blei, The dynamic embedded
embeddings. The results show that ProdLDA’s performance                topic model, arXiv preprint arXiv:1907.05545 (2019).
was far superior than its competition, especially when employ-     [6] A. Srivastava, C. Sutton, Autoencoding variational infer-
ing 1-gram Word2Vec embeddings. The algorithm outputted                ence for topic models, arXiv preprint arXiv:1703.01488
distinct and interpretable topics, which can provide a great           (2017).
insight into the data.                                             [7] X. Wu, T. Nguyen, A. T. Luu, A survey on neural topic
   This experiment also has a large potential of being ex-             models: methods, applications, and challenges, Artificial
panded. In particular, future works could employ a more                Intelligence Review 57 (2024) 18.
computationally performing machine, in order to make use of        [8] M. Soltani, A. Kythreotis, A. Roshanpoor, Two decades
the whole dataset, as well as performing MOBO, and obtain              of financial statement fraud detection literature review;
more precise hyperparameters. Finally, it is also possible to          combination of bibliometric analysis and topic modeling
perform the same analysis on different kinds of data, in order         approach, Journal of Financial Crime 30 (2023) 1367–
to notice more clearly the differences and similarities from one       1388.
kind of linguistic data to another, and their similarities. There  [9] N. C. Brown, R. M. Crowley, W. B. Elliott, What are
are also new techniques that could have a great impact on              you saying? using topic to detect financial misreporting,
this research, such as LLMs, Attention-based topic modeling,           Journal  of Accounting Research 58 (2020) 237–291.
and Contrastive topic modeling.                                   [10] J.-C. Yen, T. Wang, A topic modeling-based review of dig-
                                                                       ital transformation literature in accounting, in: Digital
                                                                       Transformation in Accounting and Auditing, Springer,
References                                                             2024, pp. 105–118.
                                                                  [11] S. Terragni, E. Fersini, B. G. Galuzzi, P. Tropeano, A. Can-
 [1] C. Nopp, A. Hanbury, Detecting risks in the banking sys-          delieri, Octis: Comparing and optimizing topic models
      tem by sentiment analysis, in: Proceedings of the 2015           is simple!, in: Proceedings of the 16th Conference of the
      conference on empirical methods in natural language              European Chapter of the Association for Computational
      processing, 2015, pp. 591–600.                                   Linguistics: System Demonstrations, 2021, pp. 263–270.
[12] S. Terragni, E. Fersini, E. Fersini, M. Passarotti, V. Patti,
     Octis 2.0: Optimizing and comparing topic models in
     italian is even simpler!, in: CLiC-it, 2021.
[13] S. Syed, M. Spruit, Full-text or abstract? examining topic
     coherence scores using latent dirichlet allocation, in:
     2017 IEEE International conference on data science and
     advanced analytics (DSAA), Ieee, 2017, pp. 165–174.
[14] S. M. Watford, R. G. Grashow, Y. Vanessa, R. A. Rudel,
     K. P. Friedman, M. T. Martin, Novel application of nor-
     malized pointwise mutual information (npmi) to mine
     biomedical literature for gene sets associated with dis-
     ease: Use case in breast carcinogenesis, Computational
     Toxicology 7 (2018) 46–57.
[15] Y. Wu, X. Wang, W. Zhao, X. Lv, A novel topic clustering
     algorithm based on graph neural network for question
     topic diversity, Information Sciences 629 (2023) 685–702.
[16] P. Feliot, J. Bect, E. Vazquez, A bayesian approach to con-
     strained single-and multi-objective optimization, Jour-
     nal of Global Optimization 67 (2017) 97–133.
[17] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient esti-
     mation of word representations in vector space, arXiv
     preprint arXiv:1301.3781 (2013).

</pre>