<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Topic Modeling for Auditing Purposes in the Banking Sector</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Alessandro</forename><surname>Giaconia</surname></persName>
							<email>alessandro.giaconia01@icatt.it</email>
							<affiliation key="aff0">
								<orgName type="department">CIRCSE Research Centre</orgName>
								<orgName type="institution">Università Cattolica del Sacro Cuore</orgName>
								<address>
									<addrLine>Largo Gemelli 1</addrLine>
									<postCode>20123</postCode>
									<settlement>Milano</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Valeria</forename><surname>Chiariello</surname></persName>
							<email>vchiariello@credem.it</email>
							<affiliation key="aff1">
								<orgName type="department">CREDEM</orgName>
								<address>
									<addrLine>Via Emilia San Pietro 4</addrLine>
									<postCode>42121</postCode>
									<settlement>Reggio Emilia</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Sara</forename><surname>Giannuzzi</surname></persName>
							<email>sgiannuzzi@credem.it</email>
							<affiliation key="aff1">
								<orgName type="department">CREDEM</orgName>
								<address>
									<addrLine>Via Emilia San Pietro 4</addrLine>
									<postCode>42121</postCode>
									<settlement>Reggio Emilia</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Marco</forename><surname>Passarotti</surname></persName>
							<email>marco.passarotti@unicatt.it</email>
							<affiliation key="aff0">
								<orgName type="department">CIRCSE Research Centre</orgName>
								<orgName type="institution">Università Cattolica del Sacro Cuore</orgName>
								<address>
									<addrLine>Largo Gemelli 1</addrLine>
									<postCode>20123</postCode>
									<settlement>Milano</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Topic Modeling for Auditing Purposes in the Banking Sector</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">0DB215CCD6AD95DB3E1D1E4E2A97EFD6</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:32+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Topic modeling</term>
					<term>Auditing</term>
					<term>Banking sector</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This study explores the application of topic modeling techniques for auditing purposes in the banking sector, focusing on the analysis of reviews of anti-money laundering alerts. We compare three topic modeling algorithms: Latent Dirichlet Allocation (LDA), Embedded Topic Model (ETM), and Product of Experts LDA (ProdLDA), using a dataset of 35,000 suspicious activity reports from an Italian bank. The models were evaluated using the coherence score, NPMI coherence, and topic diversity metrics. Our results show that ProdLDA consistently outperformed LDA and ETM, with the best performance achieved using 1-gram word embeddings. The study reveals distinct topics related to specific client activities, cross-border transactions, and high-risk business sectors, like gambling. These results demonstrate the potential of advanced topic modeling techniques in enhancing the efficiency and effectiveness of auditing processes in the banking sector, particularly in the analysis of activities that could be tied to money laundering and terrorism.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>There has always been a close connection between banks and the collection of different kinds of empirical data: banks, just like any other company, have always poured large amounts of resources into understanding numbers, and how to deal with them. Numerical data, being closely related to the financial performances of companies, has always taken the spotlight.</p><p>On the other hand, linguistic data has always been much less considered, due to the difficulties of analysis and underwhelming performances.</p><p>But things are changing. More and more companies are understanding the value of language, which contains information that no number can convey. Different Natural Language Processing (NLP) tasks, language resources, and computational linguistics practices have now become a staple in many realities, like sentiment analysis <ref type="bibr" target="#b0">[1]</ref> and word embeddings <ref type="bibr" target="#b1">[2]</ref>.</p><p>In fact, there is a wide variety of linguistic data that banks can exploit: emails, bank transfers descriptions, internal communications, and customer feedback. Some peculiar issues arise, when dealing with linguistic data in the banking sector, like the usage of acronyms, abbreviations and technical terminology. These data are often proprietary, meaning that the bank owns them, and the access is forbidden to externals. While the quantity of information they contain is massive, a downside is that the impossibility of sharing it with other banks hinders the possibility of a more global analysis.</p><p>In this context, this paper wants to explore the application of topic modeling techniques to the auditing process, in particular regarding the analysis of reviews of anti-money laundering (AML) alerts. Topic modeling can, in fact, be an incredibly helpful tool for auditors who want to perform an in-depth analysis on large amounts of data.</p><p>An overview of topic modeling algorithms and applications in the banking sector, both documented in scientific research and in concrete applications within banks, will be presented. Then, we will provide a comprehensive description of the data employed, followed by the preprocessing operations. We will</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related work</head><p>Topic Modeling is an unsupervised task of NLP, consisting in the extraction of latent themes in a given corpus. Latent Dirichlet Allocation, or LDA <ref type="bibr" target="#b2">[3]</ref> is a probabilistic generical model, which became the most widely used and expandedupon topic model. However, LDA faces several limitations, like scalability, low performances with large datasets, and the struggle against polysemy and homonymy <ref type="bibr" target="#b3">[4]</ref>.</p><p>To overcome the limitations of LDA, a lot of effort has been put into developing models that rely on word embeddings and neural networks, like ETM <ref type="bibr" target="#b4">[5]</ref> and ProdLDA <ref type="bibr" target="#b5">[6]</ref>. These models have been proved to provide better performances than LDA, at the cost of a higher computational effort <ref type="bibr" target="#b6">[7]</ref>.</p><p>In the last decade, topic Modeling has already been largely employed in the banking sector, and in auditing as well. <ref type="bibr" target="#b7">[8]</ref> focused on the assessment and handling of frauds, while <ref type="bibr" target="#b8">[9]</ref> analyzed financial misreportings. Another popular subject of analysis is accounting (for example <ref type="bibr" target="#b9">[10]</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Data</head><p>The data employed is a collection of reviews of anti-money laundering alerts, that are automatically detected by a rulebased detection tool, whose name cannot be disclosed due to a specific request. This tool is widely employed across all Italian banks, and is aimed at tackling potential money laundering and terrorism financing schemes. It uses advanced algorithms to identify patterns that deviate from standard behavior.</p><p>An activity is considered suspicious whenever it exceeds certain risk thresholds. These activities are then reviewed by a human operator, who will evaluate whether the movement is actually tied to illegal operations or not. If the operation is not considered dangerous, or if there is not enough evidence to decide whether the activity is actually a threat or not, the operator will write a brief review, consisting of two sections. The first one is a description of the analyzed activity, The second section is either an explanation for why it was not considered dangerous; or a statement about the lack of evidence and the need to keep monitoring. This latter kind Case of greengrocer active in the square of ***, only greengrocer in the square. Active bank account, that collects income and charges relative to the activity. No particular anomalies at the moment. We keep monitoring.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1</head><p>Examples of sentences from the dataset with translations of reviews usually ends with expressions such as 'monitoriamo' and 'continuiamo a monitorare'. The dataset employed consists of such reviews.</p><p>In Table <ref type="table">1</ref> we provide two examples of documents, with their corresponding English translation. The English translations have been cleaned of abbreviations and spelling mistakes.</p><p>Due to hardware limitations, we worked using a selection of 35,000 documents, chosen randomly. The data is owned by Credem and is not publicly available, due to legal constraints. It is not possible to reveal the time period in which these documents where collected, nor the whole dataset size.</p><p>Each document has an average of 20.94 tokens per document.</p><p>It is important to note that the documents feature an abundance of spelling errors, abbreviations, acronyms, and missing blanks spaces between words. This in part due to a 300characters limit. By comparing the tokens in the dataset with a dictionary of 4 millions Italian words 1 , we obtain the results shown in Table <ref type="table">2</ref>:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Metric Value</head><p>Total number of tokens </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>OOVs in the complete dataset</p><p>The dictionary has been further enhanced in a data-driven approach, by including a list of Italian names 2 and surnames 3 , and a list of the most frequent acronyms featured in the dataset, so that they are not incorrectly considered OOVs. In order to find the acronyms, we created a list of all OOVs in the dataset, in descending order, based on frequency. The 20 most frequent acronyms were added to dictionary, such as PEP (Persona Politicamente Esposta) and CC (Conto Corrente).</p><p>The table shows that about 13% of the dataset is made of OOVs. In comparison, the UD_Italian-ISDT treebank 4 , tested 1 https://github.com/sigmasaur/AnagramSolver/blob/main/dictionary.txt 2 https://gist.github.com/pdesterlich/2562329 3 https://github.com/PaoloSarti/lista_cognomi_italiani/blob/master/ cognomi.txt 4 https://github.com/UniversalDependencies/UD_Italian-ISDT against the same enhanced dictionary, contains only 6% of OOVs. For this comparison, the treebank in its entirety has been employed, consisting of training, testing and developing set.</p><p>The result shows a peculiar dataset, containing a considerable amount of OOVs, which will require robust methods of analysis.</p><p>Before processing the data, we performed data cleaning through stopwords removal and lemmatization.</p><p>Stopwords removal includes prepositions, articles, and conjunctions. This operation is helpful in reducing the number of tokens to be processed, gaining in efficiency, while also excluding data without semantic content. This operation was performed using the stopwords removal tool for Italian provided by Natural Language Toolkit<ref type="foot" target="#foot_0">5</ref> (NLTK).</p><p>After performing stopwords removal, the number of tokens in the complete dataset is reduced to 972,019, with an average of 13.47 tokens per document. Since we are using 35,000 rows, about half of the dataset, the number of tokens is 471,293.</p><p>Secondly, we performed lemmatization. The model employed is it_core_news_lg, provided by spaCy <ref type="foot" target="#foot_1">6</ref> , which is made by 500.000, 300-dimensions-shaped vectors. Lemmatization is helpful in maintaining consistency through the whole dataset, as well as improving text understanding and efficiency. The spaCy model employed has a lemmatization accuracy of 97%, which is a satisfactory performance <ref type="foot" target="#foot_2">7</ref> . However, the model's performance on the dataset was tested. We created a sample of 100, randomly selected documents, who were then manually lemmatized, acting as the gold standard. The model's lemmas were then compared to the gold standard. The model's accuracy score was 79%, which is much lower than its usual accuracy. This underwhelming result further indicates how challenging to analyze the dataset is.</p><p>Before preprocessing, the TTR (Type/Token Ratio) was 0.0541; after this operation, the Lemma/Token Ratio is attested at 0.0428. The score is lower, indicating that we managed to reduce dispersion. Reducing dispersion is helpful in improving the performance of the algorithms, since word forms that used to be different are now considered to be the same.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Processing</head><p>We have chosen three models for our analysis: LDA, ETM, and ProdLDA. These models were selected due to their different natures: the first is generative, the second is embedding-based, and the third is neural-network-based.</p><p>LDA assumes that each document is a mixture of topics and that each topic is a distribution over words. It uses Dirichlet priors to model the distribution of topics within documents and words within topics.</p><p>ETM represents words as vectors in a continuous space (word embeddings) and models topics as distributions over these embeddings, enabling it to capture more semantic relationships between words compared to traditional models like LDA.</p><p>ProdLDA is a neural-network based variant of LDA that uses a variational autoencoder (VAE) framework. ProdLDA models document-topic and topic-word distributions using neural networks, and it represents a "product of experts" model, focusing on improving topic coherence and overcoming the limitations of LDA.</p><p>The tool used for optimizing, training and comparing these models is the OCTIS (Optimizing and Comparing Topic Models is Simple!) library, developed by <ref type="bibr" target="#b10">[11]</ref>. It allows users to compare the performance of various models with respect to different metrics, like Topic Diversity and Coherence Score.</p><p>Before training, a fundamental step is hyperparameters optimization, which controls the behavior of the algorithm, and therefore, its performance.</p><p>OCTIS allows to perform Multi-Objective Bayesian Optimization <ref type="bibr" target="#b11">[12]</ref>, a method that searches for the best hyperparameters configuration considering more evaluation metrics at once; in particular, the evaluation metrics we employ are:</p><p>• the Coherence Score, measuring how interpretable the topics are <ref type="bibr" target="#b12">[13]</ref>; • the NPMI (Normalized Pointwise Mutual Information, measuring the statistical similarity of words inside a topic <ref type="bibr" target="#b13">[14]</ref>; • Topic Diversity, measuring how different topics are from one another <ref type="bibr" target="#b14">[15]</ref>.</p><p>However, certain limitations need to be considered. In particular, the hardware employed was uncapable of handling such computational efforts; and, since the data is protected by privacy laws, using another, more powerful machine, is out of question.</p><p>To overcome this problem, we relied on SOBO (Single-Objective Bayesian Optimization) <ref type="bibr" target="#b15">[16]</ref> which finds the best hyperparameters configuration with respect to only one metric. In particular, we chose the Coherence Score as the target evaluation metric. This metric was chosen due to its nature of measuring semantic coherence and, therefore, it can be considered a good indicator of topic quality. SOBO works by training the model n times, each with different hyperparameters. The output of this process is the configuration that provides the best result.</p><p>Algorithms were optimized and trained in four different configurations:</p><p>• without the enhancement of word embeddings; • enhanced by 1-gram Word2Vec <ref type="bibr" target="#b16">[17]</ref> embeddings; • enhanced by 2-grams Word2Vec embeddings; • enhanced by pre-trained embeddings.</p><p>The Word2Vec embeddings are created from our dataset. Table <ref type="table" target="#tab_3">4</ref> shows the composition of these word embeddings.</p><p>We can check the quality of the created embeddings by employing the library Bokeh<ref type="foot" target="#foot_3">8</ref> . Bokeh allows us to perform interactive visualization, creating a representation of the vectorial space that can be easily examined. As we can see in Figure <ref type="figure" target="#fig_1">1</ref>, the word embeddings create a plot where the different semantic fields are nicely divided and distinct from the others.</p><p>The pre-trained embeddings, instead, are trained on Common Crawl and Wikipedia <ref type="foot" target="#foot_4">9</ref> . The pre-trained embeddings composition can be seen in Table <ref type="table">5</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results and discussion</head><p>In Table <ref type="table" target="#tab_4">6</ref> we can find an average of the scores of the evaluation metrics for each model run, either enhanced or not enhanced by the aforementioned embeddings.</p><p>We can clearly see that ProdLDA provided the best performances across all runs. In particular, the dataset enhanced by 1-grams embeddings yielded the best overall performance, with an average score of 0.564. Much worse is the performance of both LDA and ETM, which failed at creating distinct and interpretable topics. In the reminder of this section, in Table <ref type="table" target="#tab_5">7</ref> we show some of the topics created by 1-grams-ProdLDA, together with examples of the most relevant words associated.</p><p>The topics of 1-gram-ProdLDA were examined by seven bank employees, working in the auditing sector. They were then asked how interpretable the topics were, and to give a label, indicating what that topic was about. The chosen label for each topic was the most frequent one, assigned to that topic, by the employees. Out of the 12 topics created, only one was considered to be non-interpretable, confirming the excellent performance provided by ProdLDA. However, this non-interpretable topic was also the most frequent, as shown in Figure2.</p><p>We can clearly see the even distribution of the documents associated to each topic. The most frequent topic, labeled as "X", is the aforementioned non-interpretable topic, containing miscellaneous or difficult to categorize documents. Most of the topics refer to specific clients' activities, like bank transfers, payments, or activities related to the bank account.</p><p>There are also some more specific topics. An entire topic is dedicated to tobacconists and gambling. This kind of activity typically makes wide use of cash, which can potentially be tied to money laundering schemes. This level of specificity in auditing could indicate either regulatory requirements for these sectors or the bank's recognition of unique risks associated with these business types.</p><p>There is also a specific topic for suspicious activities with foreign countries or carried on by foreign users. Dealing with cross-borders regulations on transfers can be difficult for the bank, suggesting that particular effort should be put into developing efficient strategies for auditing cross-border activities.</p><p>Using 2-grams word embeddings was the best option for both LDA and ETM. However, in ProdLDA, 1-grams word embeddings provided a slightly better performance. Nonetheless, 2-grams were generally the better option, especially considering the sharp difference in ETM. On the other hand, enhancing the dataset with pre-trained embeddings did not result in a significant impact: the performance improvement of LDA was   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusions and future work</head><p>NLP is now an essential component of the banking sector, and any company that wants to be competitive should make use of linguistic data science. In particular, in this paper we presented a NLP task, topic modeling, and how it can be imple-  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 5</head><p>Pre-trained embeddings model parameters mented in the daily job of bank employees, in order to perform more detailed investigations. In particular, topic modeling can be a key component in the understanding and identification of money laundering schemes, as it allows auditors to perform more in-depth and focused analyses. For example, auditors could investigate patterns from the recent years, in order to have a better understanding on whether an activity is part of a larger trend, or an anomaly that deserves attention. After citing other implementations of topic modeling in banking, we described the data employed, and its preprocessing, consisting in stopwords removal and lemmatization. Examples were provided, showing the peculiarities of the documents in the dataset. Then, the data was processed using three algorithms: LDA, ETM and ProdLDA. These algorithms were evaluated using three metrics: coherence score, NPMI score, and topic diversity. The optimal hyperparameters were found using SOBO. Optimization and processing were performed using four different configurations: without additional word embeddings, enhanced by 1-gram word embeddings created from our dataset, enhanced by 2-grams word embeddings created from our dataset, and enhanced by pre-trained word embeddings. The results show that ProdLDA's performance was far superior than its competition, especially when employing 1-gram Word2Vec embeddings. The algorithm outputted distinct and interpretable topics, which can provide a great insight into the data.</p><p>This experiment also has a large potential of being expanded. In particular, future works could employ a more computationally performing machine, in order to make use of the whole dataset, as well as performing MOBO, and obtain more precise hyperparameters. Finally, it is also possible to perform the same analysis on different kinds of data, in order to notice more clearly the differences and similarities from one kind of linguistic data to another, and their similarities. There are also new techniques that could have a great impact on this research, such as LLMs, Attention-based topic modeling, and Contrastive topic modeling. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Embeddings</head></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Vectorial distribution</figDesc><graphic coords="4,184.82,240.75,225.64,192.46" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Topic distribution</figDesc><graphic coords="4,139.69,557.11,315.90,188.45" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>Cheese factory. Consistent movement withtype of activities (advance payments to contributors and payments to suppliers). The cheese factory is merging with another milk factory, it's selling cheese and settling debts. Income documentation is ok, adequate verification is ok. Do not report.</figDesc><table><row><cell>Italian:</cell></row><row><cell>CASEIFICIO.MOVIM.COERENTE CON TIPO DI ATTIVITA'(ACCONTI A CONF.E PAGAM FORNITORI).</cell></row><row><cell>IL CASEIF SI STA FONDENDO CON ALTRA LATTERIA, STA VENDENDO FORMAGGIO E SALDANDO I</cell></row><row><cell>DEBITI.OK DOC REDD., OK ADEG.VERIF.NON SEGNALARE</cell></row><row><cell>English:</cell></row><row><cell>Italian:</cell></row><row><cell>TRATTASI DI FRUTTA E VERDURA ATTIVO SULLA PIAZZA DI ***UNICO FRUTTA E VERDURA DELLA</cell></row><row><cell>PIZZA. ATTIVO CC CHE RACC INCASSI E ADDEBRELATIVI ALL'ATTIVITA'.AL MOMENTO NO PART</cell></row><row><cell>ANOMALIE. MONITORIAMO</cell></row><row><cell>English:</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Hyperparameters and values</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4</head><label>4</label><figDesc>Word2Vec embeddings model parameters</figDesc><table><row><cell>Parameter</cell><cell>Value</cell></row><row><cell>Character n-grams</cell><cell>5</cell></row><row><cell>window</cell><cell>5</cell></row><row><cell>vector_size</cell><cell>300</cell></row><row><cell>number of negative samples</cell><cell>10</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 6</head><label>6</label><figDesc>Average of the metrics' scores</figDesc><table><row><cell></cell><cell cols="5">None 1-gram 2-gram Pre-trained Total avg</cell></row><row><cell>LDA</cell><cell>0.384</cell><cell>0.397</cell><cell>0.410</cell><cell>0.390</cell><cell>0.395</cell></row><row><cell>ETM</cell><cell>0.424</cell><cell>0.354</cell><cell>0.455</cell><cell>0.416</cell><cell>0.412</cell></row><row><cell>ProdLDA</cell><cell>0.552</cell><cell>0.564</cell><cell>0.552</cell><cell>0.535</cell><cell>0.550</cell></row><row><cell></cell><cell>Label</cell><cell></cell><cell></cell><cell>Top words</cell><cell></cell></row><row><cell></cell><cell cols="4">Tobacconists and gambling tabaccheria</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>bar</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>lottomatica</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>tabacchi</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>servizi</cell><cell></cell></row><row><cell></cell><cell cols="2">Foreign activities</cell><cell></cell><cell>origine</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>egitto</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>periodo</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>tunisia</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>vacanza</cell><cell></cell></row><row><cell></cell><cell>Family ties</cell><cell></cell><cell></cell><cell>cointestato</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>successione</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>moglie</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>fratello</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>marito</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 7</head><label>7</label><figDesc>ProdLDA topics</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_0">https://www.nltk.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_1">https://spacy.io/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_2">https://spacy.io/models/it</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_3">https://bokeh.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_4">https://fasttext.cc/docs/en/crawl-vectors.html</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Detecting risks in the banking system by sentiment analysis</title>
		<author>
			<persName><forename type="first">C</forename><surname>Nopp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hanbury</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2015 conference on empirical methods in natural language processing</title>
				<meeting>the 2015 conference on empirical methods in natural language processing</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="591" to="600" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">I</forename><surname>Raicu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Boitout</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Bologa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">G</forename><surname>Sturza</surname></persName>
		</author>
		<title level="m">Word embeddings in romanian for the retail banking domain</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
		<respStmt>
			<orgName>Bucharest University of Economic Studies</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Latent dirichlet allocation</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Blei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Y</forename><surname>Ng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">I</forename><surname>Jordan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of machine Learning research</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="993" to="1022" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">An improved lda approach</title>
		<author>
			<persName><forename type="first">X.-Y</forename><surname>Jing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-Y</forename><surname>Tang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics)</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="1942" to="1951" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">B</forename><surname>Dieng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">J</forename><surname>Ruiz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Blei</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1907.05545</idno>
		<title level="m">The dynamic embedded topic model</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Srivastava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Sutton</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1703.01488</idno>
		<title level="m">Autoencoding variational inference for topic models</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A survey on neural topic models: methods, applications, and challenges</title>
		<author>
			<persName><forename type="first">X</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">T</forename><surname>Luu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Artificial Intelligence Review</title>
		<imprint>
			<biblScope unit="volume">57</biblScope>
			<biblScope unit="page">18</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Two decades of financial statement fraud detection literature review; combination of bibliometric analysis and topic modeling approach</title>
		<author>
			<persName><forename type="first">M</forename><surname>Soltani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kythreotis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roshanpoor</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Financial Crime</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="page" from="1367" to="1388" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">What are you saying? using topic to detect financial misreporting</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">C</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">M</forename><surname>Crowley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Elliott</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Accounting Research</title>
		<imprint>
			<biblScope unit="volume">58</biblScope>
			<biblScope unit="page" from="237" to="291" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">A topic modeling-based review of digital transformation literature in accounting</title>
		<author>
			<persName><forename type="first">J.-C</forename><surname>Yen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Digital Transformation in Accounting and Auditing</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="105" to="118" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Candelieri, Octis: Comparing and optimizing topic models is simple!</title>
		<author>
			<persName><forename type="first">S</forename><surname>Terragni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Fersini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">G</forename><surname>Galuzzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Tropeano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations</title>
				<meeting>the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="263" to="270" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Octis 2.0: Optimizing and comparing topic models in italian is even simpler!</title>
		<author>
			<persName><forename type="first">S</forename><surname>Terragni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Fersini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Fersini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Passarotti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Patti</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CLiC-it</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Full-text or abstract? examining topic coherence scores using latent dirichlet allocation</title>
		<author>
			<persName><forename type="first">S</forename><surname>Syed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Spruit</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE International conference on data science and advanced analytics (DSAA)</title>
				<imprint>
			<publisher>Ieee</publisher>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="page" from="165" to="174" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Novel application of normalized pointwise mutual information (npmi) to mine biomedical literature for gene sets associated with disease: Use case in breast carcinogenesis</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Watford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">G</forename><surname>Grashow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Vanessa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">A</forename><surname>Rudel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">P</forename><surname>Friedman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">T</forename><surname>Martin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational Toxicology</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page" from="46" to="57" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">A novel topic clustering algorithm based on graph neural network for question topic diversity</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Lv</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Sciences</title>
		<imprint>
			<biblScope unit="volume">629</biblScope>
			<biblScope unit="page" from="685" to="702" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">A bayesian approach to constrained single-and multi-objective optimization</title>
		<author>
			<persName><forename type="first">P</forename><surname>Feliot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bect</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Vazquez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Global Optimization</title>
		<imprint>
			<biblScope unit="volume">67</biblScope>
			<biblScope unit="page" from="97" to="133" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1301.3781</idno>
		<title level="m">Efficient estimation of word representations in vector space</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
