1. Introduction

Topic Modeling for Auditing Purposes in the Banking Sector

Alessandro Giaconia

Valeria Chiariello

Sara Giannuzzi

Marco Passarotti

0 0 CIRCSE Research Centre, Università Cattolica del Sacro Cuore , Largo Gemelli 1, 20123 Milano , Italy 1 CREDEM , Via Emilia San Pietro 4, 42121 Reggio Emilia , Italy

This study explores the application of topic modeling techniques for auditing purposes in the banking sector, focusing on the analysis of reviews of anti-money laundering alerts. We compare three topic modeling algorithms: Latent Dirichlet Allocation (LDA), Embedded Topic Model (ETM), and Product of Experts LDA (ProdLDA), using a dataset of 35,000 suspicious activity reports from an Italian bank. The models were evaluated using the coherence score, NPMI coherence, and topic diversity metrics. Our results show that ProdLDA consistently outperformed LDA and ETM, with the best performance achieved using 1-gram word embeddings. The study reveals distinct topics related to specific client activities, cross-border transactions, and high-risk business sectors, like gambling. These results demonstrate the potential of advanced topic modeling techniques in enhancing the eficiency and efectiveness of auditing processes in the banking sector, particularly in the analysis of activities that could be tied to money laundering and terrorism.

eol>Topic modeling Auditing Banking sector

1. Introduction

then present the results and their interpretation, leading us into the conclusions. Finally, we will present a number of future works suggestions, which can expand this topic.

There has always been a close connection between banks and

the collection of diferent kinds of empirical data: banks, just like any other company, have always poured large amounts of resources into understanding numbers, and how to deal with 2. Related work them. Numerical data, being closely related to the financial performances of companies, has always taken the spotlight. Topic Modeling is an unsupervised task of NLP, consisting

On the other hand, linguistic data has always been much in the extraction of latent themes in a given corpus. Latent less considered, due to the dificulties of analysis and under- Dirichlet Allocation, or LDA [ 3 ] is a probabilistic generical whelming performances. model, which became the most widely used and expanded

But things are changing. More and more companies are upon topic model. However, LDA faces several limitations, understanding the value of language, which contains informa- like scalability, low performances with large datasets, and the tion that no number can convey. Diferent Natural Language struggle against polysemy and homonymy [ 4 ]. Processing (NLP) tasks, language resources, and computa- To overcome the limitations of LDA, a lot of efort has been tional linguistics practices have now become a staple in many put into developing models that rely on word embeddings realities, like sentiment analysis [ 1 ] and word embeddings [ 2 ]. and neural networks, like ETM [ 5 ] and ProdLDA [ 6 ]. These

In fact, there is a wide variety of linguistic data that banks models have been proved to provide better performances than can exploit: emails, bank transfers descriptions, internal com- LDA, at the cost of a higher computational efort[ 7 ]. munications, and customer feedback. Some peculiar issues In the last decade, topic Modeling has already been largely arise, when dealing with linguistic data in the banking sec- employed in the banking sector, and in auditing as well. [ 8 ] tor, like the usage of acronyms, abbreviations and technical focused on the assessment and handling of frauds, while [ 9 ] terminology. These data are often proprietary, meaning that analyzed financial misreportings. Another popular subject of the bank owns them, and the access is forbidden to externals. analysis is accounting (for example [ 10 ]). While the quantity of information they contain is massive, a downside is that the impossibility of sharing it with other 3. Data banks hinders the possibility of a more global analysis.

In this context, this paper wants to explore the applica- The data employed is a collection of reviews of anti-money tion of topic modeling techniques to the auditing process, in laundering alerts, that are automatically detected by a ruleparticular regarding the analysis of reviews of anti-money based detection tool, whose name cannot be disclosed due to a laundering (AML) alerts. Topic modeling can, in fact, be an specific request. This tool is widely employed across all Italian incredibly helpful tool for auditors who want to perform an banks, and is aimed at tackling potential money laundering in-depth analysis on large amounts of data. and terrorism financing schemes. It uses advanced algorithms

An overview of topic modeling algorithms and applications to identify patterns that deviate from standard behavior. in the banking sector, both documented in scientific research An activity is considered suspicious whenever it exceeds and in concrete applications within banks, will be presented. certain risk thresholds. These activities are then reviewed by Then, we will provide a comprehensive description of the data a human operator, who will evaluate whether the movement employed, followed by the preprocessing operations. We will is actually tied to illegal operations or not. If the operation is not considered dangerous, or if there is not enough evidence to decide whether the activity is actually a threat or not, the operator will write a brief review, consisting of two sections. The first one is a description of the analyzed activity, The second section is either an explanation for why it was not considered dangerous; or a statement about the lack of

Italian: CASEIFICIO.MOVIM.COERENTE CON TIPO DI ATTIVITA’(ACCONTI A CONF.E PAGAM FORNITORI).

IL CASEIF SI STA FONDENDO CON ALTRA LATTERIA, STA VENDENDO FORMAGGIO E SALDANDO I DEBITI.OK DOC REDD., OK ADEG.VERIF.NON SEGNALARE English: Cheese factory. Consistent movement withtype of activities (advance payments to contributors and payments to suppliers). The cheese factory is merging with another milk factory, it’s selling cheese and settling debts.

Income documentation is ok, adequate verification is ok. Do not report.

Italian: TRATTASI DI FRUTTA E VERDURA ATTIVO SULLA PIAZZA DI ***UNICO FRUTTA E VERDURA DELLA PIZZA. ATTIVO CC CHE RACC INCASSI E ADDEBRELATIVI ALL’ATTIVITA’.AL MOMENTO NO PART ANOMALIE. MONITORIAMO English: Case of greengrocer active in the square of ***, only greengrocer in the square. Active bank account, that collects income and charges relative to the activity. No particular anomalies at the moment. We keep monitoring. of reviews usually ends with expressions such as ’monitori- against the same enhanced dictionary, contains only 6% of amo’ and ’continuiamo a monitorare’. The dataset employed OOVs. For this comparison, the treebank in its entirety has consists of such reviews. been employed, consisting of training, testing and developing

In Table 1 we provide two examples of documents, with set. their corresponding English translation. The English trans- The result shows a peculiar dataset, containing a considerlations have been cleaned of abbreviations and spelling mis- able amount of OOVs, which will require robust methods of takes. analysis.

Due to hardware limitations, we worked using a selection Before processing the data, we performed data cleaning of 35,000 documents, chosen randomly. The data is owned by through stopwords removal and lemmatization. Credem and is not publicly available, due to legal constraints. Stopwords removal includes prepositions, articles, and conIt is not possible to reveal the time period in which these junctions. This operation is helpful in reducing the number documents where collected, nor the whole dataset size. of tokens to be processed, gaining in eficiency, while also

Each document has an average of 20.94 tokens per docu- excluding data without semantic content. This operation was ment. performed using the stopwords removal tool for Italian pro

It is important to note that the documents feature an abun- vided by Natural Language Toolkit5 (NLTK). dance of spelling errors, abbreviations, acronyms, and miss- After performing stopwords removal, the number of tokens ing blanks spaces between words. This in part due to a 300- in the complete dataset is reduced to 972,019, with an average characters limit. By comparing the tokens in the dataset with of 13.47 tokens per document. Since we are using 35,000 rows, a dictionary of 4 millions Italian words1, we obtain the results about half of the dataset, the number of tokens is 471,293. shown in Table 2: Secondly, we performed lemmatization. The model employed is it_core_news_lg, provided by spaCy6, which is made Metric Value by 500.000, 300-dimensions-shaped vectors. Lemmatization is Total number of tokens 1,474,077 helpful in maintaining consistency through the whole dataset, Total number of Out Of Vocabulary tokens (OOV) 193,482 as well as improving text understanding and eficiency. The Total number of OOV types 29,809 spaCy model employed has a lemmatization accuracy of 97%, Number of sentences containing 1+ OOVs 60,870 which is a satisfactory performance7. However, the model’s Ratio of OOVs over the total number of tokens 0.1313 performance on the dataset was tested. We created a sample of 100, randomly selected documents, who were then manually Table 2 lemmatized, acting as the gold standard. The model’s lemOOVs in the complete dataset mas were then compared to the gold standard. The model’s accuracy score was 79%, which is much lower than its usual

The dictionary has been further enhanced in a data-driven accuracy. This underwhelming result further indicates how approach, by including a list of Italian names2 and surnames3, challenging to analyze the dataset is. and a list of the most frequent acronyms featured in the Before preprocessing, the TTR (Type/Token Ratio) was dataset, so that they are not incorrectly considered OOVs. 0.0541; after this operation, the Lemma/Token Ratio is attested In order to find the acronyms, we created a list of all OOVs in at 0.0428. The score is lower, indicating that we managed to the dataset, in descending order, based on frequency. The 20 reduce dispersion. Reducing dispersion is helpful in improving most frequent acronyms were added to dictionary, such as PEP the performance of the algorithms, since word forms that used (Persona Politicamente Esposta) and CC (Conto Corrente). to be diferent are now considered to be the same.

The table shows that about 13% of the dataset is made of OOVs. In comparison, the UD_Italian-ISDT treebank4, tested 1https://github.com/sigmasaur/AnagramSolver/blob/main/dictionary.txt 2https://gist.github.com/pdesterlich/2562329 3https://github.com/PaoloSarti/lista_cognomi_italiani/blob/master/ cognomi.txt 4https://github.com/UniversalDependencies/UD_Italian-ISDT

5https://www.nltk.org/ 6https://spacy.io/ 7https://spacy.io/models/it 4. Processing The Word2Vec embeddings are created from our dataset.

Table 4 shows the composition of these word embeddings.

We can check the quality of the created embeddings by employing the library Bokeh8. Bokeh allows us to perform interactive visualization, creating a representation of the vectorial space that can be easily examined. As we can see in Figure 1, the word embeddings create a plot where the diferent semantic fields are nicely divided and distinct from the others.

The pre-trained embeddings, instead, are trained on Common Crawl and Wikipedia9. The pre-trained embeddings composition can be seen in Table 5.

We have chosen three models for our analysis: LDA, ETM, and ProdLDA. These models were selected due to their diferent natures: the first is generative, the second is embedding-based, and the third is neural-network-based.

LDA assumes that each document is a mixture of topics and that each topic is a distribution over words. It uses Dirichlet priors to model the distribution of topics within documents and words within topics.

ETM represents words as vectors in a continuous space (word embeddings) and models topics as distributions over these embeddings, enabling it to capture more semantic relationships between words compared to traditional models like LDA. 5. Results and discussion

ProdLDA is a neural-network based variant of LDA that uses a variational autoencoder (VAE) framework. ProdLDA In Table 6 we can find an average of the scores of the evalmodels document-topic and topic-word distributions using uation metrics for each model run, either enhanced or not neural networks, and it represents a "product of experts" enhanced by the aforementioned embeddings. model, focusing on improving topic coherence and overcom- We can clearly see that ProdLDA provided the best perforing the limitations of LDA. mances across all runs. In particular, the dataset enhanced

The tool used for optimizing, training and comparing these by 1-grams embeddings yielded the best overall performance, models is the OCTIS (Optimizing and Comparing Topic Mod- with an average score of 0.564. Much worse is the performance els is Simple!) library, developed by [ 11 ]. It allows users to of both LDA and ETM, which failed at creating distinct and compare the performance of various models with respect to interpretable topics. In the reminder of this section, in Table diferent metrics, like Topic Diversity and Coherence Score. 7 we show some of the topics created by 1-grams-ProdLDA,

Before training, a fundamental step is hyperparameters together with examples of the most relevant words associated. optimization, which controls the behavior of the algorithm, The topics of 1-gram-ProdLDA were examined by seven and therefore, its performance. bank employees, working in the auditing sector. They were

OCTIS allows to perform Multi-Objective Bayesian Opti- then asked how interpretable the topics were, and to give a mization [12], a method that searches for the best hyperpa- label, indicating what that topic was about. The chosen label rameters configuration considering more evaluation metrics for each topic was the most frequent one, assigned to that at once; in particular, the evaluation metrics we employ are: topic, by the employees. Out of the 12 topics created, only one was considered to be non-interpretable, confirming the • the Coherence Score, measuring how interpretable the excellent performance provided by ProdLDA. However, this topics are [13]; non-interpretable topic was also the most frequent, as shown • the NPMI (Normalized Pointwise Mutual Information, in Figure2.

measuring the statistical similarity of words inside a We can clearly see the even distribution of the documents topic [14]; associated to each topic. The most frequent topic, labeled as • Topic Diversity, measuring how diferent topics are "X", is the aforementioned non-interpretable topic, containing from one another [15]. miscellaneous or dificult to categorize documents. Most of the topics refer to specific clients’ activities, like bank transfers,

However, certain limitations need to be considered. In payments, or activities related to the bank account. particular, the hardware employed was uncapable of handling There are also some more specific topics. An entire topic is such computational eforts; and, since the data is protected by dedicated to tobacconists and gambling. This kind of activity privacy laws, using another, more powerful machine, is out typically makes wide use of cash, which can potentially be tied of question. to money laundering schemes. This level of specificity in au

To overcome this problem, we relied on SOBO (Single- diting could indicate either regulatory requirements for these Objective Bayesian Optimization)[16] which finds the best sectors or the bank’s recognition of unique risks associated hyperparameters configuration with respect to only one met- with these business types. ric. In particular, we chose the Coherence Score as the target There is also a specific topic for suspicious activities with evaluation metric. This metric was chosen due to its nature foreign countries or carried on by foreign users. Dealing of measuring semantic coherence and, therefore, it can be with cross-borders regulations on transfers can be dificult considered a good indicator of topic quality. SOBO works for the bank, suggesting that particular efort should be put by training the model n times, each with diferent hyperpa- into developing eficient strategies for auditing cross-border rameters. The output of this process is the configuration that activities. provides the best result. Using 2-grams word embeddings was the best option for

Algorithms were optimized and trained in four diferent both LDA and ETM. However, in ProdLDA, 1-grams word emconfigurations: beddings provided a slightly better performance. Nonetheless, 2-grams were generally the better option, especially considering the sharp diference in ETM. On the other hand, enhancing the dataset with pre-trained embeddings did not result in a significant impact: the performance improvement of LDA was • without the enhancement of word embeddings; • enhanced by 1-gram Word2Vec[17] embeddings; • enhanced by 2-grams Word2Vec embeddings; • enhanced by pre-trained embeddings.

8https://bokeh.org/ 9https://fasttext.cc/docs/en/crawl-vectors.html

Model LDA ProdLDA ETM Hyper-parameter Num. of topics Number of topics Dropout Num. of neurons Num. of layers Activation function Num. of topics Dropout Hidden size Activation function Values/[Range]

6. Conclusions and future work NLP is now an essential component of the banking sector, and any company that wants to be competitive should make use of linguistic data science. In particular, in this paper we presented a NLP task, topic modeling, and how it can be imple

mented in the daily job of bank employees, in order to perform more detailed investigations. In particular, topic modeling can be a key component in the understanding and identification of money laundering schemes, as it allows auditors to perform more in-depth and focused analyses. For example, auditors could investigate patterns from the recent years, in order to have a better understanding on whether an activity is part of a larger trend, or an anomaly that deserves attention.

After citing other implementations of topic modeling in banking, we described the data employed, and its preprocessing, consisting in stopwords removal and lemmatization. Examples were provided, showing the peculiarities of the documents in the dataset. Then, the data was processed using three algorithms: LDA, ETM and ProdLDA. These algorithms were evaluated using three metrics: coherence score, NPMI score, and topic diversity. The optimal hyperparameters were found using SOBO. Optimization and processing were performed using four diferent configurations: without additional word embeddings, enhanced by 1-gram word embeddings created from our dataset, enhanced by 2-grams word embeddings created from our dataset, and enhanced by pre-trained word embeddings. The results show that ProdLDA’s performance was far superior than its competition, especially when employing 1-gram Word2Vec embeddings. The algorithm outputted distinct and interpretable topics, which can provide a great insight into the data.

This experiment also has a large potential of being expanded. In particular, future works could employ a more computationally performing machine, in order to make use of the whole dataset, as well as performing MOBO, and obtain more precise hyperparameters. Finally, it is also possible to perform the same analysis on diferent kinds of data, in order to notice more clearly the diferences and similarities from one kind of linguistic data to another, and their similarities. There are also new techniques that could have a great impact on this research, such as LLMs, Attention-based topic modeling, and Contrastive topic modeling.

Embeddings LDA ETM ProdLDA

None [12] S. Terragni, E. Fersini, E. Fersini, M. Passarotti, V. Patti, Octis 2.0: Optimizing and comparing topic models in italian is even simpler!, in: CLiC-it, 2021. [13] S. Syed, M. Spruit, Full-text or abstract? examining topic coherence scores using latent dirichlet allocation, in: 2017 IEEE International conference on data science and advanced analytics (DSAA), Ieee, 2017, pp. 165–174. [14] S. M. Watford, R. G. Grashow, Y. Vanessa, R. A. Rudel, K. P. Friedman, M. T. Martin, Novel application of normalized pointwise mutual information (npmi) to mine biomedical literature for gene sets associated with disease: Use case in breast carcinogenesis, Computational Toxicology 7 (2018) 46–57. [15] Y. Wu, X. Wang, W. Zhao, X. Lv, A novel topic clustering algorithm based on graph neural network for question topic diversity, Information Sciences 629 (2023) 685–702. [16] P. Feliot, J. Bect, E. Vazquez, A bayesian approach to constrained single-and multi-objective optimization, Journal of Global Optimization 67 (2017) 97–133. [17] T. Mikolov, K. Chen, G. Corrado, J. Dean, Eficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781 (2013).

[1]

Nopp ,

Hanbury , Detecting risks in the banking system by sentiment analysis , in: Proceedings of the 2015 conference on empirical methods in natural language processing , 2015 , pp. 591 - 600 .

[2]

Raicu ,

Boitout ,

Bologa ,

M. G.

Sturza , Word embeddings in romanian for the retail banking domain , Bucharest University of Economic Studies ( 2020 ).

[3]

D. M.

Blei ,

A. Y.

Ng , M. I. Jordan , Latent dirichlet allocation , Journal of machine Learning research 3 ( 2003 ) 993 - 1022 .

[4]

X.-Y.

Jing ,

Zhang , Y.-

Tang , An improved lda approach , IEEE Transactions on Systems, Man, and Cybernetics , Part

( Cybernetics ) 34 ( 2004 ) 1942 - 1951 .

[5]

A. B.

Dieng ,

F. J.

Ruiz ,

D. M.

Blei , The dynamic embedded topic model , arXiv preprint arXiv: 1907 . 05545 ( 2019 ).

[6]

Srivastava ,

Sutton , Autoencoding variational inference for topic models , arXiv preprint arXiv:1703.01488 ( 2017 ).

[7]

Wu ,

Nguyen ,

A. T.

Luu , A survey on neural topic models: methods, applications, and challenges , Artificial Intelligence Review 57 ( 2024 ) 18 .

[8]

Soltani ,

Kythreotis ,

Roshanpoor , Two decades of financial statement fraud detection literature review; combination of bibliometric analysis and topic modeling approach , Journal of Financial Crime 30 ( 2023 ) 1367 - 1388 .

[9]

N. C.

Brown , R. M. Crowley , W. B. Elliott , What are you saying? using topic to detect financial misreporting , Journal of Accounting Research 58 ( 2020 ) 237 - 291 .

[10] J.-C. Yen , T. Wang , A topic modeling-based review of digital transformation literature in accounting , in: Digital Transformation in Accounting and Auditing , Springer, 2024 , pp. 105 - 118 .

[11]

Terragni ,

Fersini ,

B. G.

Galuzzi ,

Tropeano ,

Candelieri , Octis: Comparing and optimizing topic models is simple! , in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations , 2021 , pp. 263 - 270 .