<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Journal of Modern Education and Computer Science 9(7) (2017) 50.
doi:10.5815/ijmecs.2017.07.06.
[12] B. Ogunleye</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1111/rsp3.12632</article-id>
      <title-group>
        <article-title>Probabilistic thematic modelling of Ukrainian-language texts based on the Latent Dirichlet Allocation algorithm</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Victoria Vysotska</string-name>
          <email>Victoria.A.Vysotska@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Denys Ptushkin</string-name>
          <email>denys.ptushkin.sa.2022@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rostyslav Fedchuk</string-name>
          <email>rostyslav.b.fedchuk@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roman Lynnyk</string-name>
          <email>roman.o.lynnyk@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>2025</institution>
          ,
          <addr-line>Vinnytsia</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>Stepan Bandera 12, 79013 Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>3722</volume>
      <issue>33</issue>
      <fpage>56</fpage>
      <lpage>75</lpage>
      <abstract>
        <p>The article presents the results of the study of methods of thematic modelling of texts using the Latent Dirichlet Allocation (LDA) algorithm for the Ukrainian-language corpus of documents. The proposed model allows you to automatically detect hidden topics in large volumes of unstructured text data without prior labelling. The model was implemented in Python using Gensim and pyLDAvis libraries. Perplexity and coherence metrics were used to assess the quality of the model, which showed that the optimal number of topics depends on the characteristics of the corpus and the parameters of hyperparameters α and β. Texts and demonstrate the suitability of the method for a wide range of applied tasks - analysis of user reviews, media analytics, classification of scientific publications and monitoring of social networks. A comparative study with alternative approaches (K-means, NMF, BERTopic, transformer models) was carried out, which showed that LDA provides the best balance between interpretation, speed and computational efficiency. The developed program module "Thematic Analysis Module" implements an automated system for thematic modelling, which can be used both in scientific research and in analytical information systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;thematic modelling</kwd>
        <kwd>Latent Dirichlet Allocation</kwd>
        <kwd>LDA</kwd>
        <kwd>natural language processing</kwd>
        <kwd>machine learning</kwd>
        <kwd>probabilistic model</kwd>
        <kwd>coherence</kwd>
        <kwd>TF-IDF</kwd>
        <kwd>Gensim</kwd>
        <kwd>Ukrainian-language corpus of texts 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Today, humanity is in an information glut: a massive amount of text data is generated every day –
news, scientific publications, messages in social networks, forums, blogs, and instant messengers.
This information is often unstructured and challenging to subject to classical analysis, which
causes the need for automated tools to classify, sort, filter, and understand it. One of the most
effective modern methods of analysing such texts is thematic modelling. It allows you to
automatically detect topics hidden in texts based on the probabilistic distribution of words. For
example, without having to read thousands of product reviews, you can automatically discover that
people most often talk about "price", "quality", "delivery", "packaging", etc. This approach is actively
used in the following areas:</p>
      <p>
media;



</p>
      <p>Journalism and media analytics – to track information campaigns, trends in the
Business and marketing – to analyse user reviews, surveys, customer feedback;
Science – classification of scientific publications by topics;
Public administration – monitoring of public moods, thematic appeals of citizens;
Education – automatic classification of educational materials.</p>
      <p> Thus, thematic modelling is one of the key natural language processing (NLP)
tools, allowing you to efficiently work with large text arrays without the need for manual
processing.</p>
      <p>The purpose of this work is the in-depth development of information technology for thematic
modelling of texts, as well as the practical implementation of the thematic model on a specific
corpus of Ukrainian-language documents using Python tools. During the work, it is planned to
investigate how the pre-processing of texts, the choice of the number of topics, algorithms and
parameters affects the quality of the thematic model, as well as to analyse the practical results of
modelling and the possibilities of their application in a real environment.</p>
      <p>To achieve the goal, it is necessary to solve the following tasks:








</p>
      <p>Analyse the literature on thematic modelling (LDA, NMF, PLSA).</p>
      <p>Choose a corpus of texts for modelling (e.g., news, articles, forum posts).</p>
      <p>Clean up the data – remove HTML tags, numbers, punctuation, stop words.</p>
      <p>Perform lemmatisation or stemming (if necessary, in Ukrainian).</p>
      <p>Create a Bag-of-Words or TF-IDF matrix.</p>
      <p>Build an LDA model with a different number of themes.</p>
      <p>Visualise the results obtained.</p>
      <p>Analyse the interpretation of topics.</p>
      <p>Compare the quality of models by coherence.</p>
      <p>The object of research is the text corpus – a set of documents containing natural language (in
our case, Ukrainian). These can be news, social messages, reviews, scientific articles, product
descriptions, etc. Such texts are unstructured, which makes it difficult to analyse them without
preprocessing. That is why the object of research is interesting from the point of view of practical data
processing. The subject of the study is algorithms and methods of thematic modelling, in particular:




</p>
      <p>Latent Dirichlet Allocation (LDA);
Non-Negative Matrix Factorisation (NMF);
Probabilistic Latent Semantic Analysis (PLSA);
TF-IDF and Bag-of-Words for text representation;</p>
      <p>Quality assessment metrics: coherence, perplexity.</p>
      <p>Although thematic modelling is a well-known technique, its application to Ukrainian-language
texts has not yet been sufficiently researched. Most libraries and examples focus on
Englishlanguage content. Therefore, the novelty of this work lies in:
 Implementation of thematic modelling specifically for the Ukrainian language;
 Comparison of models with a different number of themes for a real case;
 Application of modern methods of pre-processing of Ukrainian-language texts (for
example, through langdetect, pymorphy2-uk or Stanza);</p>
      <p> visualisation of results and analysis of the correspondence of topics to the real
content of documents.</p>
      <p>Also, the novelty lies in the application of coherence to automatically assess the quality of the
model without human intervention. The developed model has a number of real-world applications:
 Information systems (filtering news, searching by topics, classification of
documents).</p>
      <p> Education (automatic grouping of educational materials by topic).


trends).</p>
      <p>

forums).</p>
      <p>Marketing (classification of customer reviews on topics to identify pain points).</p>
      <p>Science (analysis of scientific publications and identification of new research
Security (monitoring social media to identify radical topics).</p>
      <p>Electronic democracy (analysis of citizens' appeals in petitions, complaints,
The model is universal and can be adapted to any subject area containing large amounts of
textual information. It is necessary to investigate the methods of thematic modelling of texts, in
particular the Latent Dirichlet Allocation (LDA) algorithm, which allows you to automatically
identify the main topics in a large amount of text data. The preliminary processing of the corpus of
documents was carried out, a thematic model was built, and its results were analysed. The study
confirmed the effectiveness of thematic modelling as a tool for classifying and analysing
unstructured texts. The practical implementation of the model has demonstrated that this approach
can be used in various fields – from journalism and marketing to science and education. The results
obtained showed the dependence of the quality of thematic modelling on the preliminary
processing of data, the choice of the number of topics and the parameters of the model. Thus, the
work contributed to the consolidation of knowledge in computational linguistics and the acquired
practical skills in natural language processing.</p>
      <sec id="sec-1-1">
        <title>2. Related works</title>
        <p>In today's information age, society generates enormous amounts of text data every day. News sites,
social networks, forums, emails, blogs, user reviews, documents – all this creates a powerful flow of
information that needs to be stored, processed and analysed. According to analytical agencies, tens
of millions of new texts of various formats are created every day in the world, and this trend is
only growing, requiring manual analysis. There is an urgent need for tools that can automatically
reveal meaning and structure in unstructured text.</p>
        <p>One of the most promising areas in this area is thematic modelling of texts – a method of
identifying hidden thematic structures in large amounts of text data. Thematic modelling allows
you to understand what the documents are about, without the need to read them thoroughly. It
automatically classifies texts by content, highlights key topics, and allows you to visualise the
results, which significantly simplifies analysis.</p>
        <p>The principle of thematic modelling is that each document consists of a particular set of topics,
and each topic consists of a specific set of words. For example, if the system analyses the news
corpus, it can detect issues such as "politics", "economy", "sports", "education", even if these labels
are not set manually. Thematic modelling algorithms, in particular Latent Dirichlet Allocation
(LDA), are based on statistical patterns of the joint appearance of words in texts and are able to
automatically find relationships between words and group them into meaningful topics. The
relevance of this topic is due not only to the rapid growth of textual data but also to the need to
interpret it effectively. In many fields, from media and journalism to education, marketing, and
research, thematic modelling is becoming an indispensable tool. It allows:</p>
        <p>Analyse large amounts of news;
Identify trends in social networks;
Carry out automatic classification of documents;
Segment customer reviews by topic;</p>
        <p>Build dashboards for decision-making.</p>
        <p>
          Latent Dirichlet Allocation (LDA) is a classical probabilistic generative model of the issues
proposed in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. LDA formalises a document as a mixture of topics and a topic as a distribution of
words; It was this work that laid the mathematical foundation for most of the further research in
case study. Its advantages are ease of interpretation, relative ease of implementation, and low
hardware requirements; The disadvantage is the weak ability to capture context (sequence/order)
and problems with short texts.
        </p>
        <p>
          Other classical methods – PLSA and NMF – use linear/probabilistic factorisations of the
document-term matrix [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. NMF sometimes gives more stable and interpreted themes on a small
corpus, but lacks Bayesian regularisation of LDA and can be sensitive to noise. Comparative studies
show that no "classic" dominates universally – the choice depends on the size of the case, the
length of the documents and the goals of the analysis.
        </p>
        <p>
          Modern approaches: built-in representations and hybrids:
1. BERTopic is a practical cluster-embedding approach that combines transformer embedding of
documents (BERT-like) with dense clustering and c-TF-IDF for describing topics [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. BERTopic
shows good semantic coherence of issues, especially on short and variable texts (tweets,
comments), but requires more resources and depends on the quality of embeddings.
        </p>
        <p>
          2. Contextualised Topic Models (CTM) and their development – methods that combine the BOW
part with contextual embedding (BERT) into variational autoencoders [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. They increase the
coherence of topics compared to classical LDA, especially on data where context significantly
changes the meaning of words. CTM and derivatives (improvement due to negative sampling,
pretraining, etc.) are now actively researched and often give better NPMI/UMass results than LDA.
        </p>
        <p>3. Top2Vec / embedding-based clustering – an approach where documentary and verbal
embedding are used to simultaneously identify topics and semantic centres (without an explicit K
task). It works well for large enclosures with moderate document lengths. The downside is that
interpreting topics sometimes requires additional c-TF-IDF or manual filtering.</p>
        <p>
          The general trend in recent years has been to replace or supplement purely frequency
representations (BoW/TF-IDF) with contextual embeddings (BERT, MiniLM, etc.). It improves the
quality of topics (semantics), but increases computational costs and can complicate interpretation
in some cases [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          In comparative work, a set of metrics is most often used: perplexy (probabilistic measure),
coherence (UMass, Cv, C_v) and NPMI [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Almost all modern research emphasises that perplexy
and coherence sometimes conflict (perplexy can decrease, while coherence can deteriorate), so it is
recommended to use a combination of metrics to choose the optimal model and number of topics.
        </p>
        <p>
          Although most of the methodological works are tested on English-language corpora (20
Newsgroups, Wikipedia, ArXiv abstracts), there are more and more publications dedicated to
Ukrainian-language corpora. Study of themes in folk songs of Podillia (case-study) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] – LDA
applied to folklore texts; The authors note the importance of lemmatisation and morphological
normalisation through productive word formation in the Ukrainian language [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. An analysis of
discussions and media coverage of the war (Russo-Ukraine war) showed [7] that for social
networks/tweets, it is advisable to compare LDA with models on embedding (BERTopic/CTM):
transformer approaches are better at catching context and nuances, while LDA gives more stable
"dark" clusters for a large number of short, noisy messages. These studies emphasize two
important theses for Ukrainian: (1) pre-processing (lemmatization, removal of inflectional forms,
stop words) significantly affects the quality of topics; (2) the choice of model depends on the genre
of texts – for long forms (articles), LDA/NMF work well; for short/social media –
CTM/BERTOPIC/Top2Vec gives a better semantic grouping [7].
        </p>
        <p>Bayesian Interpretation,
generative model low resources</p>
        <p>Weak context,
problems with</p>
        <p>Basic
suitable
method-reference;
for long
NMF
BERTopic
(BoW)
Linear Simplicity,
factorisation of sometimes
document–terms better stability
with small data
short texts
Sensitivity
noise, no
priori
documents;
lemmatisation.</p>
        <p>requires
to Alternative to LDA on
a small cases
Top2Vec / Coordination of Does not need Interpretation Large body,
embedding documents and a previous K, is sometimes overview of themes
clustering words in good semantic more
embedded space centres complicated
quick</p>
        <p>
          LDA remains a "practical standard" – it provides interpreted topics and serves as a good
baseline for any thematic analysis [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. It is especially valuable because of the limited resources or
the need to explain the results to a non-professional audience.
        </p>
        <p>
          Contextual models (CTM, BERTopic) show a marked improvement in the clinical (semantic)
quality of topics (NPMI/Cv), especially on short or highly contextual texts [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. If the project allows
for calculation costs, these approaches give a better interpretation of the topics.
        </p>
        <p>
          Assessment should be multidimensional [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Perplexy ≠ coherence: in practice, it is advised to
minimise perplexy and simultaneously maximise NPMI/Cv (or perform human validation for the
most important topics).
        </p>
        <p>
          The peculiarities of the Ukrainian language (morphology, inflexion, word formation) make
high-quality linguistic preprocessing critical: tokenisation, lemmatisation (pymorphy2-uk / Stanza /
spaCy pipelines), removal of stop words, and filtering of N-grams. Studies on Ukrainian corpora
confirm that without such training, the quality of topics drops sharply [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          Practical recommendations for research:
1. Implementation of LDA as a baseline (Gensim) after careful linguistic pre-processing
(lemmatisation, stop words, removal of frequent noise tokens), estimation of perplexity and
NPMI/Cv on the K grid [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>
          2. BERTopic testing (with Ukrainian/multilingual embeddings – mBERT or lightweight MiniLM
models) and CTM – comparison of NPMI and Cv; on short texts, BERTopic is expected to win [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>3. Validation: In parallel, a small manual assessment (human judgment) for 10-20 topics will give
a high-quality check of metrics.</p>
        <p>4. Resources: if there are resource constraints – use LDA/NMF; Transformers launch –
CTM/BERTOPIC will give better semantics.</p>
        <p>5. Documentation: fixing hyperparameters (α, β, minimum frequency of terms, seed) so that the
results are reproducible [9].</p>
        <p>Classical approaches to topic modelling (LDA, PLSA, NMF) formalise a document as a mixture
of topics and a topic as a distribution of words, and they are widely used as a baseline in thematic
analysis studies. Having settled on LDA, we are guided by its interpretation and stability in
problems with significant cases. Modern approaches (BERTopic, Contextualised Topic Models,
Top2Vec) combine contextual embedding and clustering, which allows you to increase the
semantic coherence of topics, but requires more computing resources. Evaluation of models is
carried out by perplexity and coherence (UMass, NPMI, Cv), since the combined approach to
validation gives the most reliable results. For Ukrainian-language corpora, the importance of
lemmatisation and morphological normalisation is additionally emphasised." (case studies: Blei et
al. 2003; BERTopic; CTM and comparative studies).</p>
        <p>Within the framework of this work, the development of a thematic model of texts is considered,
which allows you to automatically single out key topics from a large corpus of Ukrainian-language
texts. The focus is on the LDA (Latent Dirichlet Allocation) algorithm, which is one of the most
common and at the same time interpreted methods of thematic analysis. A comparison of this
approach with other methods such as clustering, classification and modern neural approaches
(transformers) will also be carried out, and the advantages and disadvantages of each technique
will be identified. Special attention is paid to the formulation of the problem that the proposed
thematic model is designed to solve. First of all, it is about automating the understanding of text
data in situations where labels are missing, and human analysis is too costly or impossible. Thus,
this section lays the theoretical and methodological basis for the implementation of the work,
demonstrating not only technical aspects but also the strategic significance of thematic modelling
in the digital information age.</p>
        <p>Within the framework of this work, a tool for thematic modelling of texts is being developed,
the primary purpose of which is to automatically identify content topics in the corpus of
documents without preliminary markup or manually specified categories. This approach allows
you to better understand the structure and content of large text arrays, identify hidden patterns,
and optimise the content analysis process.</p>
        <p>The product being developed is a thematic model built using the Latent Dirichlet Allocation
(LDA) algorithm, which belongs to the category of probabilistic models. LDA allows each
document to be represented as a combination of several topics, and each topic as a set of keywords
with appropriate weights. Based on the statistics of the coincidence of words in different
documents, the model identifies those words that occur most often together and groups them into
topics. This approach is beneficial in cases where the structure of the texts is not strictly defined,
and manual classification is too costly or subjective. So that they cover a wide range of topics,
including politics, technology, education, health, economics, etc, this choice is justified by the fact
that in real conditions, the texts are of a mixed nature and often include several topics at the same
time, so high-quality thematic modelling should take this context into account. Preliminarily, texts
undergo standard processing: clearing punctuation and special characters; lowercase casting;
removing stop words; lemmatisation (if necessary); tokenization. The product is developed in the
Python programming language, using the following libraries:
1. Gensim – a library for building LDA models and working with text data;
2. pyLDAvis – visualisation of the constructed theme (interactive graphs, which show
the placement of topics in vector space);</p>
        <p>3. NLTK / spaCy / Stanza – for pre-processing of texts: tokenisation, lemmatisation,
removal of stop words;
4. Pandas – convenient work with text datasets;
5. Matplotlib / Seaborn – Additional visualisation of results is needed.</p>
        <p>This stack of tools allows you to effectively implement a complete cycle of thematic modelling –
from word processing to visual analysis of results.</p>
        <p>In the field of natural language processing (NLP), there are several methods that, to some extent,
perform the function of grouping, classifying or summarising text documents. Although thematic
modelling, in particular based on LDA, is a specialised approach to identifying topics, it is worth
considering other methods that can act as its counterparts in specific contexts.</p>
        <p>1. K-means clustering is one of the most popular methods of unsupervised learning, which
distributes objects (in our case, documents) into groups called clusters. The algorithm tries to
minimise the distance between documents within the same cluster and maximise the distance
between different clusters. Each document is represented as a vector (for example, based on
TFIDF), and the cluster itself is defined through the centre of mass. Advantages:


</p>
        <p>Easy to implement and quick learning.</p>
        <p>Does not require labels (unsupervised).</p>
        <p>Scales well for large amounts of text.</p>
        <p>Disadvantages:</p>
        <p>
LDA).</p>
        <p>
</p>
        <p>Clusters do not have a clear, meaningful description (there is no list of words as in
It is challenging to interpret what each cluster is about.</p>
        <p>Does not take into account topics, only "groups of similar texts".</p>
        <p>K-means groups texts by similarity, while LDA detects semantic themes within
texts. Clusters are "similar documents", topics are "similar words".</p>
        <p>2. Classification of texts (SVM, Naive Bayes) – these algorithms belong to supervised learning,
which requires pre-labelled data. Each text must have a predefined category (e.g. "sports",
"education", "politics"), and the model learns to recognise these categories with new examples.</p>
        <p>Advantages:
High accuracy with proper data preparation.</p>
        <p>Easy to use (especially Naive Bayes).</p>
        <p>Works well with short texts.






Disadvantages:</p>
        <p> Does not work without labels – you need to manually classify a large number of
documents for training.</p>
        <p> It does not detect new topics; it works only with those that are already known.
 Less flexible in a dynamic environment (changing topics requires retraining).</p>
        <p>The classification requires tagged training data, while LDA is fully automated and suitable for
exploring new, previously undefined topics.</p>
        <p>3. Transformers (BERT, GPT, BERTopic) – transformer-based models are modern approaches in
NLP that allow you to take into account the context of an entire sentence or text. Models like BERT
(Bidirectional Encoder Representations from Transformers) generate vector representations of texts
that preserve semantics at a deeper level. BERTopic is an example of thematic modelling that
combines BERT and clustering. Advantages:</p>
        <p>High-quality results.</p>
        <p>Taking into account the context and order of words.</p>
        <p>The ability to analyse the nuances of language, synonyms, and irony.</p>
        <p>Need for powerful hardware (GPU/TPU).</p>
        <p>Complexity of implementation (not "out of the box").</p>
        <p>Weak interpretation (results are difficult to explain – "black box").</p>
        <p>Transformers are more potent in quality, but more challenging to implement. LDA loses in
accuracy, but wins in simplicity, interpretation, and resources.</p>
        <p>4. Alternative thematic models of NMF and PLSA. NMF (Non-negative Matrix Factorisation)
decomposes the document-term matrix into two smaller matrices that reflect topics and word
distribution. It works similarly to LDA, but is based on linear algebra rather than probabilities.
Advantages:</p>
        <p>A simple approach without complicated statistics.</p>
        <p>Can give clear topics for minor cases.</p>
        <p>Themes are less stable when data changes.</p>
        <p>Less interpreted compared to LDA.</p>
        <p>PLSA (Probabilistic Latent Semantic Analysis) is a precursor to LDA. A statistical model that
also identifies topics by word distribution in documents. Advantages – considered the "foundation"
for LDA – are theoretically powerful. Disadvantages:</p>
        <p>The model is prone to overtraining.</p>
        <p>Doesn't scale to large amounts of data.</p>
        <p>Does not allow you to simulate new documents without re-learning.</p>
        <p>NMF is simpler, PLSA is theoretically deeper, but both are inferior to LDA in flexibility,
scalability, and resilience.</p>
        <p>








</p>
        <p>Disadvantages:</p>
        <p>Low (black box)
Advantages
Accuracy
tasks
Works well with labels
in
narrow</p>
        <p>Not scalable
Disadvantages
Inability to work
unstructured topics
Themes
uninterpreted
are
with
often
Speed
Resources Required
Average</p>
        <p>Small
Explanation of results</p>
        <p>Themes can be interpreted</p>
        <p>Slow on large cases
High, but needs fine-tuning
High (GPU, TPU)
Difficult to explain the reason for the
classification</p>
        <p>Thematic modelling, especially in the implementation of Latent Dirichlet Allocation (LDA), has
a number of significant advantages that make it a versatile and convenient tool for analysing large
corpora of text data. Unlike other approaches (e.g., classification or transformers), LDA strikes an
optimal balance between interpretation, automaticity, and efficiency.</p>
        <p>1. Unsupervised learning. One of the most valuable properties of thematic modelling is its
independence from the labelled data. The algorithm does not require prior manual classification of
documents – that is, there is no need to create a training sample, where each text is manually
assigned to a specific topic. It is essential in cases where:


</p>
        <p>Labels are difficult or expensive to obtain.</p>
        <p>The subject matter of the data changes over time.</p>
        <p>It is necessary to explore a new, unexplored corpus of texts.</p>
        <p>Thus, thematic modelling is an indispensable tool for exploratory analysis, when it is necessary
to find out: "what the texts are about", and not just classify them into already known categories.</p>
        <p>2. Visualisation capability. Modern libraries, including pyLDAvis, make it easy to visualise the
results of thematic modelling. It opens up opportunities for intuitive analysis even for users
without technical training. Thanks to visualisation, you can:
 See how topics are placed in a vector space.
 Evaluate which words are key for each topic.
 Check which documents belong to which topics and how strongly.</p>
        <p> Explore the intersections between topics (the more topics overlap, the more similar
they are).</p>
        <p>It makes thematic modelling a powerful tool for data analytics and presentation.</p>
        <p>3. Flexibility of the model. The user independently sets the number of topics that the model
should find. It allows you to adapt the analysis to different tasks:</p>
        <p> If you need to conduct a general review, you can choose a smaller number of topics
(for example, 5-10).</p>
        <p> If you need details, the model can be reconfigured for 20-30 themes.
 In addition to the number of themes, you can flexibly customise:
 the number of keywords in the topic;
 alpha and beta distribution parameters (affecting the "smearing" of topics across
documents);</p>
        <p> filtering of rarely used or commonplace words.</p>
        <p>This flexibility allows you to optimise the model for a specific type of content or business task.
4. Interpretation of results. Unlike many modern models (especially transformers), LDA
provides transparency in the results. Each topic is clearly expressed in the form of a set of words,
and each document has a distribution of issues with the weight of each of them. It makes it possible
to:</p>
        <p>Quickly describe the essence of the topic (by keywords).</p>
        <p>Understand how the content of the document is related to the issues.</p>
        <p>Check the logic of the results based on human intuition.</p>
        <p>Justify the conclusions of analytics to customers or management.</p>
        <p>LDA models are one of the few in machine learning that can be explained and defended in front
of a non-professional audience.</p>
        <p>5. Efficiency and the possibility of the use of small resources. LDA models do not require
significant computing power. They can be launched:
on a regular laptop or server without a GPU;
with small corpora of texts (even several hundred documents);
with limited RAM.</p>
        <p>It opens up access to thematic analysis for small companies, research projects, and university
laboratories. Even for educational purposes, LDA is an excellent demonstration of how text
analytics works in the real world.</p>
        <p>In addition to the implementation of thematic modelling through open libraries (for example,
Gensim and LDA), there are many ready-made commercial or SaaS solutions on the market that
provide the functionality of automatic analysis of text topics. Such services, as a rule, are aimed at
business intelligence, automation of feedback processing, customer requests, social networks, etc.
Below is a detailed review and comparison of the most well-known platforms.</p>
        <p>IBM Watson Natural Language Understanding is a platform with a set of tools for natural
language processing. One of its components, topic classification, allows you to identify common
topics in the text (for example, politics, healthcare, finance). In this case, the thematic analysis is
based on pre-trained classifiers. Advantages:




objects.</p>
        <p>Support for many languages.</p>
        <p>High-quality results.</p>
        <p>The API integrates seamlessly into business systems.</p>
        <p>It provides not only themes but also emotional tone, categories, concepts, and
Disadvantages:</p>
        <p>Only works with a fixed list of topics.</p>
        <p>There is no full-fledged topic-forming model (as in LDA).</p>
        <p>Commercial model: paid for a large number of requests.</p>
        <p>Limited flexibility for the user (no access to simulation engines).</p>
        <p>Google Cloud Natural Language API – Google's word processing service includes content
classification, where documents are classified according to a hierarchy of ~700 topics (for
example, /Arts &amp; Entertainment/Music or /Business/Banking). It is based on deep neural networks
and a predefined topic dictionary. Advantages:


</p>
        <p>Reliability and speed from Google.</p>
        <p>An extensive database of topics and subtopics.</p>
        <p>Convenient to integrate into cloud services.
</p>
        <p>Support for many formats.
 Topics are hardcoded – it is impossible to identify new ones.
 Interpretation of the result is only possible within the Google framework.</p>
        <p> There is no transparency – it is not clear which words influenced the
classification.</p>
        <p> The cost increases when processing large arrays of texts.</p>
        <p>MonkeyLearn is a cloud-based platform for text analysis that allows you to create your own
classifiers and pre-trained thematic templates. It is positioned as a no-code/low-code tool for
business users. Advantages:</p>
        <p>Ready-made templates (for example: customer support, surveys, e-commerce).</p>
        <p>You can create custom models without programming.</p>
        <p>Visual interface for customising categories.</p>
        <p>It has an API and integration with Google Sheets, Zapier, etc.</p>
        <p>The free version is minimal.</p>
        <p>Less flexible for complex analysis.</p>
        <p>It is not a full-fledged thematic modelling (works as a classifier).</p>
        <p>Gensim (LDA implementation) is an open-source Python library for thematic modelling.
Implements LDA (Latent Dirichlet Allocation), as well as other methods for analysing the latent
structure of texts. Supports model training both in memory and from streaming data. Advantages:
Complete freedom of customisation: number of topics, words, alpha/beta.</p>
        <p>Open core – can be expanded and adapted.</p>
        <p>Visualisation capability (via pyLDAvis).</p>
        <p>Works locally, without cloud costs.</p>
        <p>Requires programming (not suitable for non-specialists).</p>
        <p>Requires independent processing of texts (cleaning, tokenisation, etc.).</p>
        <p>It does not have a graphical interface "out of the box".</p>
        <p>BERTopic is a modern library for thematic modelling that combines contextual vector
representations of BERT and clustering (e.g. HDBSCAN) to detect topics. Topics are created based
on the similarity of vectors of texts. Advantages:</p>
        <p> It takes into account the context that "bank" and "riverbank" will not be on the
same topic.</p>
        <p> A more accurate model for short or unstructured data.
 Topics can have dynamic depth (topics within topics).</p>
        <p> It has integration with visualisation and meta-information.</p>
        <p>Requires a lot of resources (GPU for fast work).</p>
        <p>Complexity of installation (need for transformers, BERT models).</p>
        <p>The interpretation is more complicated than that of the classic LDA.</p>
      </sec>
      <sec id="sec-1-2">
        <title>3. Problem formulation</title>
        <p>In the XXI century, humanity lives in the conditions of the information revolution: text data is
created at an unprecedented speed – in news, social networks, instant messengers, reviews,
reports, blogs, comments, documents. All this forms a complex and extensive information
ecosystem that requires tools for systematisation, analysis, and understanding. Much of this
information is unstructured – i.e., one that does not have clear tags, headings, or categories- and
therefore, it is difficult to process automatically using traditional methods.</p>
        <p>Modern society faces the challenge of efficiently processing large amounts of unstructured text
data. Classical methods of analysis – for example, manual classification, keyword search,
rulebased systems – are not able to scale to large data sets and do not allow you to automatically
identify content topics without prior human intervention. It makes it difficult:</p>
        <p>Decision-making in business, science, journalism;
Identifying trends and topics in social networks;
Customer feedback analytics;</p>
        <p>Creation of personalised recommendation systems.</p>
        <p>Classic classifiers (SVM, Naive Bayes) require labelled data, which means that someone has to
manually specify which topic each document belongs to. In cases where thousands of texts are
used, it becomes an irrelevant, expensive, and slow process. In turn, clustering algorithms (for
example, K-means), although they allow you to group documents, do not provide interpreted topics
– we cannot say "what" each cluster is about without additional analysis. In connection with the
described problem, the product under development faces several tasks:</p>
        <p>1. Develop a thematic model that automatically highlights topics from the corpus of documents.
The model must work without previous labels, and therefore belongs to the unsupervised learning
class. The algorithm should determine the most likely topics in a large corpus based on statistical
patterns of word distribution.</p>
        <p>2. Ensure the interpretation of the results. Unlike the "black boxes" of modern deep learning, the
developed system should provide clear and transparent results. A list of keywords should express
the topic, and each document should show which topics are present in it and in what ratio.</p>
        <p>3. Provide a flexible tool that works even without labelled data. The thematic model should
work on any corpus of texts – news, forums – without the need for mark-up. It should be scalable,
customizable (for example, change the number of themes) and available for local launch, without
cloud dependency.</p>
        <p>4. Compare several approaches and justify the feasibility of choosing LDA. In order for the
choice of thematic modelling (in particular, the LDA algorithm) to be reasonable, it is necessary:
 Compare it with classifiers, clusters, and transformers.
 Assess the advantages and limitations of other approaches.</p>
        <p> Show that LDA is the best compromise between interpretation, automation, and
technical simplicity.</p>
        <p>A comprehensive study of thematic modelling of texts as a modern tool for analysing large
volumes of unstructured data has been carried out. The main goal was to create a model capable of
automatically detecting content topics in the body of documents without pre-labelling, as well as
comparing this approach with similar methods. A thematic model based on the Latent Dirichlet
Allocation (LDA) algorithm using Gensim and pyLDAvis libraries has been developed. The corpus
of texts used has undergone a complete cycle of pre-processing: tokenisation, cleaning, deletion of
stop words, and, if necessary, lemmatisation. After building the model, a set of topics was obtained,
each of which is described by a list of words with the highest probability, and an analysis of the
distribution of topics across documents was carried out.</p>
      </sec>
      <sec id="sec-1-3">
        <title>4. Methods</title>
        <p>Topic modelling as a task of extracting hidden topics in the corpus of texts has become one of the
key paradigms in natural language processing and text data analysis. This section provides an
overview of the three main approaches – classical generative models, factorisation-based models,
and modern approaches with contextual embedding – with a focus on their applicability, strengths
and weaknesses, and challenges for Ukrainian-language corpora.</p>
        <p>
          The original and most common approach is the Latent Dirichlet Allocation (LDA), proposed in
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The model formalises the document as a mixture of topics, and the topic as a distribution of
words, using a priori Dirichlet distributions for θ (document-topic) and φ (topic-words) among the
advantages: high interpretation of themes, support for a large corpus, and relatively simple
implementation. However, a number of studies have noted the weak ability of the model to take
into account word order, context or short texts, as well as the instability of the results regarding
initialisation or order of documents [10]. Other researchers in their review classify LDA as the
"dominant" model in the topic of topic modelling before the era of deep learning [11]. The problem
of the "order-effect" in LDA was also investigated, and the LDADE approach for adjusting
hyperparameters to reduce the instability of topic distributions was proposed [10].
        </p>
        <p>Along with LDA, methods based on negative matrix factorisation (NMF), latent-semantic
analysis (LSA), and pLSA are widely discussed in the literature. These methods, although not as
common as LDA, sometimes show better stability in small enclosures or with limited resources. A
study [12] compared LDA, NMF, and embedding clustering on tweet data and found that
traditional models have limitations and are less stable in short texts. Some works also pay attention
to dynamic versions of topic models (e.g. Dynamic Topic Model, HDP) to track thematic changes
over time [13].</p>
        <p>
          In recent years, models that combine embedding text representations (e.g., BERT-like models)
with clustering algorithms or variational autoencoders have been growing in popularity. For
example, BERTopic uses document embedding (based on transformers), then UMAP to reduce
dimensionality, HDBSCAN for clustering, and c-TF-IDF to form a representation of topics [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
BERTopic demonstrates higher topic coherence compared to LDA, especially on short or variable
texts. A study [14] noted that embedding models (e.g., BERTopic or Combined Topic Model CTM)
can outperform LDAs in terms of NPMI/coherence, but require more computational resources. In
addition, studies on Indo-Aryan languages (e.g. Hindi) show that BERTopic consistently
outperforms classical methods on short texts [15].
        </p>
        <p>
          Evaluation of thematic models is carried out through metrics such as perplexity, coherence
(UMass, Cv) and NPMI. The reviews emphasise that reducing perplexity does not guarantee an
increase in coherence, so a combined approach is recommended [11-12]. There are additional
challenges for the corpora of the Ukrainian language: rich inflexion, word formation,
morphological variants, and a lack of large, composed datasets. For example, in the study of
Podillya folklore, the need for lemmatisation and morphological normalisation before the use of
LDA is emphasised [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Thus, it is essential for Ukrainian-speaking buildings to take into account:
 Careful pre-processing of the text (lemmatisation, correction of inflexions),
 Choice of model depending on the genre (long formats – LDA/NMF, short/social
texts – BERTopic/CTM),
        </p>
        <p> Multi-metric evaluation (perplexy and NPMI/coherence) and, if possible, human
validation.</p>
        <p>Thus, the literature demonstrates the evolution of thematic modelling [16-21]: from traditional
generative models to modern approaches with contextual embeddings. For the analysis of
Ukrainian-language texts, it is advisable to use a hybrid strategy: to use LDA as a basic model for
long documents, and for short/noisy texts – models on embeddings; At the same time, to ensure
high-quality pre-processing and combined assessment. In the following sections, these
recommendations will be taken into account when choosing a model, adjusting hyperparameters,
and evaluating the results.</p>
        <p>Within the framework of this study, the construction of the thematic model was carried out on
the basis of a probabilistic approach implemented through Latent Dirichlet Allocation (LDA). The
LDA algorithm assumes that each document in the corpus is a mixture of several topics, and each
topic is a distribution of word probabilities. Thus, the mathematical essence of the model is to
restore the hidden parameters of these distributions. Let:</p>
        <p>D={d1 , d2 , … , d M } – a corpus of documents consisting of M documents;
V ={w1 , w2 , … , w N } is a dictionary that contains N of unique words;</p>
        <p>K is the number of hidden topics.</p>
        <p>Each document dm is modelled as a stochastic process of generating words according to the
following steps:</p>
        <p>1. For each document dm, the distribution of topics is chosen θm∼ Dir ( α ), where α is a
hyperparameter of the Dirichlet distribution that controls the "blurring" of topics in the document.</p>
        <p>2. For each k topic, the distribution of words is determined ϕk∼ Dir ( β ), where β is a
hyperparameter that controls the "blurredness" of words in the subject.</p>
        <p>3. For each word wmn – in document dm:</p>
        <p>The topic zmn∼ Mult (θm);</p>
        <p>Next, a word on this topic is chosen: wmn∼ Mult ( ϕzmn).







(1)
(2)
The total probability of the body of documents is given as:</p>
        <p>M Nm
P (W , Z|α , β )=∏ ∫ P (θm|α )(∏ ∑ P ( zmn|θm) P ( wmn|zmn , β ))d θm,</p>
        <p>m=1 n=1 zmn
where W are all observed words, Z are hidden variables (topics for each word).</p>
        <p>The purpose of thematic modelling is to find a posteriori distribution:
which is approximated using the variational Bayesian approach or the Gibbs sampling method.
For each document, the following are calculated:
θm=(θm1 , θm2 , … , θmK ) – is the probability vector of topics;
ϕk=( ϕk 1 , ϕk 2 , … , ϕkN ) – is the probability vector of words in the subject.</p>
        <p>P (θ , ϕ , Z|W , α , β ),</p>
        <p>After training the model, each paper is presented as a combination of topics with weights – θm,
and each topic is defined by the most likely words from the vectorϕk.</p>
        <p>Two primary metrics evaluate the quality of thematic modelling:
1. Perplexy is a measure of the consistency of the model with the test data:</p>
        <p>Perplexity ( Dtest )=exp {</p>
        <p>C (W t )=∑ log
i&lt; j</p>
        <p>D ( wi , w j)+ ϵ</p>
        <p>D ( w j)
,
A lower perplexy value corresponds to better model consistency.</p>
        <p>2. Topic Coherence – assesses the semantic consistency of the most important words of the
topic. For a set of words W t={w1 , w2 , … , w M }, coherence is defined as:</p>
        <p>where D ( wi , w j) – is the number of documents in which the words wi and w j – occur together,
ϵ is the smoothing factor.</p>
        <p>The result of modelling is a set of thematic distributions:
2. Pre-processing of texts:
(3)
(4)
(5)
Φ={ϕ1 , ϕ2 , … , ϕK }, Θ={θ1 , θ2 , … , θM },
which allows the construction of the matrix M × K , where each element of θmk – is interpreted
as the probability that the document dm – belongs to the topic k.</p>
        <p>This matrix forms the basis for further analysis – clustering of documents, construction of
semantic maps, and visualisation of thematic structures.</p>
        <p>The goal of the developed software product is to create an efficient, automated system for
thematic modelling of texts, which allows the user to quickly identify key topics in unstructured
text documents without prior mark-up or classification. This tool should provide the ability to:
1. Process large amounts of textual information;
2. Analyse the content of documents without manual intervention;
3. Identify hidden topics by applying machine learning methods, in particular the
Latent Dirichlet Allocation (LDA) algorithm;</p>
        <p>4. Display the results in an understandable, interpreted form – in the form of lists of
keywords that form topics, and their distribution in documents;</p>
        <p>5. Visualise the results of thematic analysis to improve understanding of the structure
of the text corpus.</p>
        <p>As a result of the development, a software module will be implemented that helps the user
interpret large arrays of texts, reduce the time for their processing, and identify semantic trends in
the content without deep linguistic or technical preparation.</p>
        <p>A software product for thematic modelling of texts should implement a complete cycle of
automated analysis of textual information, from downloading data to displaying results in a
convenient form. The main functions that the system should implement include:
1. Loading the Text Corpus:





</p>
        <p>Load input text data from a file (e.g.,.txt, .csv,. json).</p>
        <p>Support for entering one or more documents.</p>
        <p>Ability to work with Ukrainian-language texts.</p>
        <p>Clearing punctuation, numbers, and special characters.</p>
        <p>Lowercase texting.</p>
        <p>Delete stop words (in Ukrainian).
4. Generate Results:
5. Visualisation of results:
6. Data Saving/Export:</p>
        <p>
IDF).</p>
        <p>











</p>
        <p>Tokenisation is the division of text into separate words (tokens).</p>
        <p>Lemmatisation (if necessary) – bringing words to their original form.
3. Building a thematic model:

</p>
        <p>Option to export model, topic lists, or topic breakdowns to a file.</p>
        <p>Ability to reuse the saved model.
7. User Interface (Optional). A simple graphical interface (or console menu) where the user can:
Formation of a vector representation of texts (for example, Bag-of-Words or
TFBuilding an LDA model to define topics in the corpus.</p>
        <p>Setting the number of topics, dictionary sizes, alpha and beta parameters.</p>
        <p>Create lists of topics with a set of keywords for each.</p>
        <p>Calculate the distribution of topics for each document.</p>
        <p>Save results in text or tabular format.</p>
        <p>Building an interactive thematic map (through the pyLDAvis library).</p>
        <p>Displaying the weights of words in topics.</p>
        <p>Visualise similarities and overlaps between topics.
select a file;
set model parameters (number of topics);</p>
        <p>View visualisation.</p>
        <p>Thus, the software product should cover all the key stages of thematic analysis of texts – from
processing and modelling to interpretation and output of results.</p>
        <p>The developed software product is aimed at users who need to analyse a large amount of textual
information, but do not have sufficient technical or linguistic knowledge for its deep processing.
Thanks to the automation of the main processes of thematic analysis, the system allows you to
solve a number of applied problems.</p>
        <p>1. Analysis of a large volume of texts. In the real world, processing thousands of documents
manually is extremely time-consuming and resource-intensive. The software product allows you to
automatically process large corpora of texts without the need for human intervention at each stage.</p>
        <p>2. Automatic detection of topics in documents. The user gets the opportunity to find out what
the texts are about, even if the issues have not been determined in advance. The LDA-based model
allows you to automatically generate topics based on statistical patterns in the data.</p>
        <p>3. Classification and grouping of texts by topic. The system allows you to determine which
documents are related to a particular topic, which makes it possible to segment texts by content
(for example: economics, politics, sports, culture).</p>
        <p>4. Building thematic profiles of documents. Each document has a distribution of topics that
helps to assess which topics dominate the text and which are secondary. It is beneficial for reports,
news, research articles, or social media content.</p>
        <p>5. Visualisation of results for analytics. The user receives an interactive visualisation (via
pyLDAvis), which allows a better understanding of the structure of topics, their relationships, and
their distribution in the space of texts. It makes the analysis accessible even to non-professional
users.</p>
        <p>6. Decision support. Thanks to the interpretation of the results of thematic modelling, the user
can faster identify trends, filter important documents and to draw conclusions based on the actual
content of the texts.</p>
        <p>7. Saving the results for further processing. The resulting topics and breakdowns can be stored,
used in other systems, or used to generate reports, making the product useful in research,
educational, and business contexts.</p>
        <p>Thus, the software product removes the need for the user to manually process texts and allows
you to obtain valuable semantic insights automatically, quickly and in a convenient format.</p>
        <p>The system being developed, conventionally called the "Thematic Analysis Module", is a
software tool for processing, analysing and visualising large volumes of text documents in order to
automatically identify semantic topics. The system belongs to the application software that
implements computational linguistics and machine learning methods for the needs of text analysis.
The principle of operation of the module is based on the Latent Dirichlet Allocation (LDA)
algorithm, one of the most common approaches to thematic modelling, which allows you to
determine a set of hidden topics in the corpus of texts without human intervention.</p>
        <p>1. What actions take place on the input data? After loading the text corpus, the system performs
several sequential stages of processing on the input data to prepare it for thematic analysis:
– Text pre-processing:
 Clearing texts from punctuation, memorable characters, and numbers.
 Normalisation (converting all words to lower case).
 Remove stop words that do not carry a semantic load (for example, and, or, also).
 Tokenisation is the division of text into separate words (tokens).</p>
        <p> Lemmatisation is the reduction of words to their original form (for example:
"worked" → "work").
– Building a textual representation:

</p>
        <p>Creating a document-term matrix using Bag-of-Words or TF-IDF methods;</p>
        <p>Formation of a dictionary (all unique words of the corpus).
– Thematic modelling:



what).</p>
        <p>Building an LDA model – automatic detection of topics that are repeated in texts;
Definition of the set of words that best characterise each topic;</p>
        <p>Calculate the distribution of topics in each document (i.e. which document is about
2. What the user sees at the output. After completing all stages of processing and modelling, the
system provides the user with the result in a clear and visualised form:</p>
        <p>–Theme. A list of topics was found, each of which is represented by a set of keywords with the
highest weight (importance). For example:</p>
        <p>Topic 1: ['economy', 'profit,' 'currency', 'inflation', 'bank']
Topic 2: ['sport', 'match', 'team', 'goal', 'tournament']
– Distribution of topics by documents. Each document displays which topics dominate it and
with what probability. For example: Document No. 7: Topic 1 – 60%, Topic 3 – 30%, Topic 5 – 10%.</p>
        <p>– Interactive visualisation. Using the pyLDAvis library, the results are displayed as an
interactive topic map where:
– Tables with results. Export in CSV or JSON formats:



table with topics;
distribution of topics by documents;</p>
        <p>Top words for each topic.</p>
        <p>The user can use all these results to:




circles are themes;
circle size – frequency of the topic;
overlap – similarity between topics;
You can hover the cursor and see the topic keywords.
building reports;
content analytics;
segmentation of texts by topic;
automatic classification or preparation of training data.</p>
        <p>First of all, it must have the functionality to load the corpus of documents in a convenient
format (for example, TXT or CSV), which can contain both individual texts and large arrays of text
data, tokenisation, deletion of stop words, and, if necessary, also lemmatisation to bring words to
the basic form. Based on the cleaned texts, the program should build a thematic model using the
Latent Dirichlet Allocation (LDA) method, which allows you to automatically determine a set of
topics in a given corpus, each of which will be represented by a set of keywords. Once processed,
the results must be stored in a file or database, allowing them to be reused, exported, or integrated
with other systems. Finally, the system should display the results to the user clearly and intuitively
– both in the form of lists of topics and tables, and in the form of interactive visualisation of issues,
which significantly simplifies the analysis of the data obtained.</p>
        <p>The project aims to create an effective tool for automatic thematic analysis of texts, which
allows you to identify content structures in unstructured text data without the need for their
preliminary labelling. The goal of the project is to provide the user with an accessible tool for
analysing large corpora of texts, with the ability to visualise, store and interpret the results without
deep technical knowledge. The Python programming language was chosen as the implementation
environment of the software product using the libraries Gensim (for building a thematic model of
LDA) and pyLDAvis (for visualising the results). Additionally, the NLTK, Pandas, and Matplotlib
libraries can be used to process texts and present results. The program provides an interface in the
form of a CLI (command line) or, by extension, a simple graphical interface (GUI) based on the
Tkinter or Streamlit libraries, which allows you to interact with the user conveniently - select a file,
set analysis parameters, and start processing. Among the main limitations of the product are the
focus on texts in Ukrainian (for correct lemmatisation and a list of stop words), as well as the need
for a pre-prepared document format (for example, one document – one line in a file). There are two
types of users: the regular user, who runs the analysis, views the results, and exports them in a
convenient format, and the administrator or developer, who can change model settings, update
dictionaries, or change the architecture of the module for a specific application. Scripts:
1. Download the document corpus – the user imports a set of texts for analysis.
2. Configure model – sets the parameters of thematic modelling (number of topics,
type of filtering).</p>
        <p>3. Run simulation – the system pre-processes and builds an LDA model.</p>
        <p>4. View results – the user receives a list of topics, keywords, and breakdowns by
documents.</p>
        <p>5. Save results – the results are exported to a file (CSV, JSON, etc.).</p>
        <p>A class diagram displays the structure of a system, that is, what classes it consists of, what
functions each class performs, and how these classes are related to each other. It answers the
question: "What parts are included in the program and what do they do?". The main classes in the
diagram are:
 CorpusLoader – Responsible for loading text data.
 Pre-processor – cleans, tokenises, and lemmatises texts.
 LDAModel builds a thematic model, manages learning, and produces results.
 Visualizer – creates a visualisation of topics (e.g. via pyLDAvis).
 Exporter – saves the results to a file.
 Links:
 Each class has one or more methods, which are displayed at the bottom of the
rectangle.</p>
        <p> The arrows between classes show dependencies: for example, LDAModel uses
Preprocessor to prepare data, and Visualizer uses it to plot based on model results.</p>
        <p>This diagram is needed to describe the architecture of the code and helps the programmer to
implement the system correctly. The sequence diagram shows the order in which actions are
performed in time – that is, what exactly happens in the system from the start of the start-up to the
receipt of the result. It simulates how objects pass requests to each other, including in what order.
Sequence of events:</p>
        <p>The user gives a command through the interface to start the analysis.</p>
        <p>CorpusLoader loads data.</p>
        <p>Pre-processor cleans and processes texts.</p>
        <p>LDAModel performs thematic modelling.</p>
        <p>Visualizer displays themes on the screen.</p>
        <p>Exporter allows you to save the results. Each object has a vertical "lifeline", and arrows show
the interactions between them. The lower on the diagram, the later in time the action takes place.
This diagram gives an idea of the logic of executing a program step by step and is very useful for
testing or scripting.</p>
        <p>During the work, it was determined that the proposed system – the "Thematic Analysis Module"
– is designed to automatically process large volumes of unstructured text in order to identify key
topics without preliminary data labelling. This approach is especially relevant in today's
information environment, where a vast number of text messages are generated every day, which
require quick and meaningful analysis. In the process of formalisation, the input and output data of
the system, the main stages of word processing (cleaning, tokenisation, lemmatisation), building a
model and displaying the results were described. It is established that the system should provide
the loading of the body of documents, the adjustment of model parameters, the execution of
simulations, the visualisation of the results and the ability to export the received data. All these
features have been described in the form of functional requirements for the product. In order to
structurally present the work of the module, a technical task was created, which describes in detail
the implementation environment (Python, Gensim, pyLDAvis), user interaction interface, target
audience (ordinary user, analyst) and system limitations (focus on Ukrainian-language texts).
Particular attention is paid to the visualisation of the system architecture using UML diagrams.
Created:</p>
        <p> a class diagram showing the internal structure of the program and the relationships
between its modules (CorpusLoader, Pre-processor, LDAModel, Visualizer, Exporter);
 Sequence Diagram, which illustrates the step-by-step process of performing
thematic analysis – from loading texts to saving results.</p>
        <p>The work done made it possible to systematise the idea of the logic of the software product's
functioning, to determine the key elements and their interaction. The results obtained from a solid
basis for further development, testing, and implementation of thematic modelling in real text
analysis tasks. Thus, the work has been completed, and the goals set – to formulate the
requirements, build the architecture and model the system – have been achieved in full.</p>
        <p>The task of thematic modelling is to find latent (hidden) topics in a large corpus of text
documents without predefined labels. Formally, each document dm from the corpus</p>
        <p>Construction of a statistical model that forms probabilistic distributions of
words in topics and topics in documents. Each text is presented as a vector
with the weights of belonging to each subject.</p>
        <p>Algorithm for The new text is processed (lemmatisation, tokenisation), converted to the
classifying new BoW format, and passed to LDA, which returns the probability vector.
text The highest probability determines the topic.</p>
        <p>CoherenceMod Calculates the semantic consistency of topics using the c_v metric, which
el (coherence is based on the mutual appearance of terms in documents. It is used to
score) evaluate the quality of the model.</p>
        <p>Stages of logical inference in the system
 The model is trained on the basis of the processed body of documents.
 Each topic becomes a probabilistic distribution of words.
 Each document receives a distribution of topics, such as:
 Coherence determines how stable and consistent the topics are ( higher the better).
 New text can be classified via lda_model.get_document_topics() - without the need
to retrain it.
1
2
3
1
2
3</p>
        <p>Thus, the system automatically concludes about the topic of the new document without manual
intervention. During the operation of the software, various data structures are formed and used.
They are necessary for storing the corpus of texts, a glossary of terms, the results of modelling
topics and the subsequent classification of new documents. The following are the key structures
that are stored or used in memory at runtime.
processed_texts.p
kl
dictionary
corpus
.pkl file (Pickle)</p>
        <p>Saves preprocessed texts as lists of lemmas. Allows
you not to repeat preprocessing when restarting.
gensim.corpora.</p>
        <p>Dictionary</p>
        <p>A unique glossary of terms generated from
processed texts. Each word has a unique ID.</p>
        <p>List of lists (Bag- Representing each document as a dictionary-based
of-Words) frequency vector of words. Required for LDA</p>
        <p>training.</p>
        <p>The software has a text-based interface in the form of an interactive Jupyter Notebook, which
provides a convenient user experience with the system. All the main functions are implemented
through clear code blocks with output and visualisation.
lda_model.print_t
opics()
lda_model.get_do
cument_topics()</p>
        <p>Gensim.models.L
daModel object</p>
        <p>A model that contains topics, their distributions,
probabilities, and parameters. Retains knowledge of
the model after training.</p>
        <p>Returns a list of topics with their keywords and
probabilities. It is used to interpret the results.</p>
        <p>Returns a probabilistic distribution of topics for a
single document. It is used to classify new texts.</p>
        <p>Output of topic For each generated topic, a list of keywords with the highest
keywords probabilities is displayed. It allows you to interpret the content of the
topic.</p>
        <p>Automatic
naming of topics
Interactive
Visualisation
(pyLDAvis)</p>
        <p>Implemented through the generate_smart_title(keywords) function,
which creates a conditional name of the topic based on keywords (for
example, "Presidential activity", "Education and science").</p>
        <p>Displays topics in the form of circles, shows the relationships between
them and keywords. The user can click on the issues and view their
content.</p>
        <p>Save models and
data.</p>
        <p>The processed data (processed_texts), dictionary, corpus, and trained
model are stored in .pkl files, which avoids reprocessing and retraining.</p>
        <p>Classification
new texts
of The user can enter any new Ukrainian text, and the program will</p>
        <p>automatically determine which topic it most likely belongs to.</p>
        <p>Inference
coherence
of The system outputs an assessment of the quality of the model based on
the c_v metric, which allows you to choose the optimal number of
topics.</p>
        <p>Graph of quality
dependence on
the topics</p>
        <p>A graph is created to analyse how coherence changes with a different
number of issues (e.g., 5–40). It helps to automatically select the best
model.</p>
        <p>The software's interface is built to be intuitive for both technical and non-technical users. It
allows you to both study the structure of topics in the corpus and analyse new documents in real
time. In the developed software, all modules function within a single data processing flow. Each
component does its part of the work and passes the results to the input of the next module. This
modular structure allows for consistent processing, resource overuse, and flexibility for
modifications.</p>
        <p>CSV → Word Processing → Saving → Building a Corpus → LDA Model →
→ Topic generation → Coherence → Visualization → Analysis of new textsTable 18
Co-operation processes</p>
        <p>Process description</p>
        <p>Participation in modules
Loading and pre- The user imports a .csv file → The data processing module cleans
processing of texts and lemmatises the texts.</p>
        <p>The processed texts are stored in processed_texts.pkl for reuse.</p>
        <p>a The data is passed to the dictionary and corpus via Gensim.</p>
        <p>Training the LDA
model</p>
        <p>The corpus and dictionary are passed to the Thematic Modelling</p>
        <p>Module, where a theme model is created.</p>
        <p>The model is passed to the Coherence Evaluation Module, where
the c_v is calculated.
of The theme model goes to the Visualisation Module, where an</p>
        <p>interactive theme map (pyLDAvis) is built.</p>
        <p>Topic The keywords of each topic are analysed in the Interpretation</p>
        <p>Module, which assigns a human name.
of The user enters the text;it is processed, transmitted to lda_model</p>
        <p>→ model returns the belonging probability to the topics.
1
2
3
4
5
6
7
8</p>
        <p>Saving an
intermediate result
Creating
dictionary/corpus
Model
Assessment
Visualisation
results
Generating
Names
Classification
new text
Windows 10 / 11, Linux, macOS
Version 3.9 or higher
stanza, gensim, pyLDAvis, pandas, matplotlib, pickle, tqdm
Development environment</p>
        <p>Recommended: Jupyter Notebook / Jupyter Lab
Internet connection</p>
        <p>Only required to load the stanza language model</p>
        <p>Install dependencies pip install stanza gensim pyLDAvis pandas matplotlib tqdm. A parser was
also developed specifically for this project, thanks to which it was possible to assemble my own
unique data set from articles from the President's Office, which helped to train the model well for
further use.
Step</p>
        <p>Description
1
2
3
4
5
6</p>
        <p>Running pre-processing: execute the pre-process(text) function, which will clean and
lemmatise texts
Creating a dictionary and corpus: use gensim.corpora.Dictionary and corpus =
[dictionary.doc2bow(text) for text in processed_texts]
Model Training: Building an LDA Model via LdaModel(...)
Model Estimation: Coherence Calculation via CoherenceModel
Topic Visualisation: Use pyLDAvis to create a theme map</p>
        <p>Parsing new text: call lda_model.get_document_topics() for a new document
Software Name</p>
        <p>News parser from the website of the President of Ukraine</p>
        <p>Automatically collect news headlines and texts from the "Administration"
section.</p>
        <p>Link to source
Result of work</p>
        <p>CSV file with news headlines and texts</p>
        <p>Python 3.x
selenium, bs4, csv, time
Library
Runtime</p>
        <p>On-premises environment (Windows with ChromeDriver installed)
Access method</p>
        <p>Via Chrome Browser Control with Selenium
6. Verification
7. Conservation</p>
        <p>Running the Chrome browser in headless mode with the specified
user-agent</p>
        <p>Go to the news page with the ?page=n parameter
3. Collection of news Search for blocks .item_stat.cat_stat, from each, the first reference is
links taken
4. Header Extraction</p>
        <p>From the tag &lt;h1 itemprop="name"&gt;
5. Text extraction</p>
        <p>From the &lt;div itemprop="articleBody"&gt; tag, all &lt;p&gt;
Skip news without text, prevent duplication
Saving the result to a file president_news.csv</p>
        <p>Explanation
Avoiding duplicates</p>
        <p>From each .item_stat.cat_stat block, only the first &lt;a is taken&gt;
Selective Content Collection
articleBody,
which
excludes
Text is taken only from
footers/menus/meta.</p>
        <p>Dynamic content expectation</p>
        <p>WebDriverWait is used to wait for the DOM to fully load
Work in headless mode</p>
        <p>Can be run in the background without displaying the browser
Flexible scaling</p>
        <p>Can be expanded to any number of pages (via the pages
parameter)</p>
        <p>When creating a parser, there was a big problem - the site president.gov.uaStructure of the
output CSV file in Fig. 10. A big problem arose when creating the parser - the website
president.gov.ua uses protection against automated requests (bots).This protection includes
filtering requests from libraries like requests, even with fake headers (User-Agent). In particular,
when using standard parsing through requests and BeautifulSoup, the server returned a 403
Forbidden response code, which indicates that the request was blocked. To circumvent this
limitation, the project implemented automation of interaction with the site through the Google
Chrome web browser, using the Selenium WebDriver tool and the ChromeDriver driver. It allows
you to emulate the behaviour of a real user - open pages in the browser, load dynamic JavaScript
content, interact with DOM elements and wait for the page to be fully rendered. The parser works
in headless mode, that is, the browser does not open graphically, but all processes related to the
display and processing of the web page are performed as in a real browser. It allows you to
discreetly bypass anti-bot protection, while maintaining a high processing speed and minimal load
on the system. Also, to minimise detection by security mechanisms, custom headers of HTTP
requests were installed, including User-Agent, Accept-Language, Referer, and others, which
simulate a typical request from an ordinary browser user. In addition, the code implements waiting
(WebDriverWait) so that you do not try to extract information before the site fully loads the
content via JavaScript. Thanks to this solution, the developed software works stably with the
official website of the President of Ukraine, bypassing server checks for the bot and ensuring the
correct extraction of news texts.</p>
        <p>Software for thematic modelling of Ukrainian-language texts based on the Latent Dirichlet
Allocation (LDA) algorithm has been developed. The developed system allows you to automatically
identify meaningful topics in a collection of texts, interpret them through keywords, and also
classify new documents according to the built model. A feature of the implementation is the use of
the Stanza library for the lemmatisation of Ukrainian texts, which provides deep linguistic data
processing. The program covers the entire cycle of thematic modelling: from collecting and
preprocessing texts to building an LDA model, assessing its quality using coherence metrics, and
visualising the resulting topics. In the process of implementation, mechanisms for automatic
generation of conventional names of topics were also implemented, which significantly facilitates
the perception of the results of the analysis. The functionality of the program includes the ability to
save processed texts, reuse models, view an interactive topic map, and classify new texts in real
time. It has been proven that the model is capable of detecting thematic clusters with high
coherence, which indicates its efficiency and accuracy. Thus, the goals of the work have been
achieved. The developed software is a universal tool for text data analytics. It can be used to solve
practical problems in the fields of journalism, public administration, education, and social
sentiment research.</p>
      </sec>
      <sec id="sec-1-4">
        <title>6. Results</title>
        <p>Now, in the modern information space, it is essential to be able to quickly analyse large amounts of
text data and isolate the main topics and content areas from it. One of the key approaches to
solving this problem is thematic modelling of texts, which allows you to automatically detect the
hidden structure of information in a large corpus of documents without manual mark-up. This
approach is based on the use of machine learning algorithms, in particular, Latent Dirichlet
Allocation (LDA), which allows you to break down texts into topics based on statistical patterns in
word distribution. Within the framework of the previous work, full-fledged software for thematic
modelling of Ukrainian-language texts was implemented. The implementation included the stages
of pre-processing of texts, lemmatization using the stanza library, construction of a dictionary of
terms, creation of a corpus in the Bag-of-Words format, training of the LDA model, output of topic
keywords, evaluation of the quality of the model by the coherence metric (cv), generation of
conditional names of topics and classification of new texts. The study is devoted to checking the
operability of the developed software tool by running a control case. Such an example allows you
to make sure that all modules of the system function in a coordinated manner, the results of the
simulation correspond to the content of the text, and the system correctly classifies new documents
according to the topics that were discovered during the training. The analysis of the control
example allows you to confirm that the results obtained are logical, meaningfully relevant, and
correspond to the task. Thus, the purpose of this work is to launch and analyse a control case
demonstrating the full cycle of the software - from loading a new text to determining its topic, with
the output of topic keywords, probabilistic distribution and interpretation of results.</p>
        <p>The purpose of the control example is to check the operability of the software for thematic
modelling of texts. For this purpose, a test task is formed, which should reflect the key functionality
of the system for determining the subject matter of the Ukrainian-language text on the basis of the
already trained LDA model. The user enters a new Ukrainian-language text that was not included
in the educational building, and the system should:</p>
        <p>Carry out full pre-processing of the text (cleaning, tokenisation, lemmatisation).</p>
        <p>Convert text to a numeric format according to the already built dictionary.</p>
        <p>Transmit text to the trained LDA model.</p>
        <p>Obtain a probabilistic distribution of topics identified in the previous analysis.</p>
        <p>Identify the topic with the highest probability.</p>
        <p>Output keywords and the automatically generated name of this topic.</p>
        <p>To train the thematic model and further test the software, a corpus of Ukrainian texts, collected
from open sources, in particular, from the official website of the President of Ukraine, was used.
The data is from news, public speeches, event reports, international meetings, decrees, and other
documents covering socio-political topics. At the initial implementation stage, the first dataset of
approximately 300 documents was created. This set made it possible to check the correctness of the
main modules of the system – word processing, dictionary construction, creation of a corpus in the
Bag-of-Words format, model training, knowledge base construction and primary classification.
However, in the analysis process, it was found that the model trained on this set showed
insufficient topic resolution, and the coherence (topic quality metric) was below the desired level.
The topics were often mixed, vaguely defined, or too general.In order to improve the quality of the
model and expand the thematic coverage, a new, significantly larger corpus was created. The
second dataset, which was formed as a result, consisted of more than 830 documents, which made
it possible to provide better statistical representativeness of words and contexts. The new set of
texts covered a wide range of topics: international politics, internal governance, educational issues,
commemoration of historical memory, humanitarian initiatives, etc. The extended corpus was used
for the final training of the LDA model, the classification of texts and the execution of a control
example within this work.</p>
        <p>Each text in the dataset is saved in .csv format, in the text column. Additional metadata, such as
headers or dates, is not used in the model. Thus, all texts underwent the same processing cycle,
which ensured the purity of the experiment and the ability to compare the results. The use of two
different corpora in the development process made it possible to assess the impact of sample size
on the quality of thematic modelling. It also highlights the flexibility and scalability of the software
created, which can work efficiently with enclosures of different sizes.</p>
        <p>To check the functionality of the developed software, a separate fragment of Ukrainian text was
selected, which was not included in the educational building. This approach allows you to
objectively assess the ability of the model to generalise - that is, the ability to apply the formed
topics to new, previously unknown texts. A control example simulates a real situation when the
user submits an arbitrary text for input and expects the system to correctly recognise its content.
The selected fragment refers to the commemoration of the victims of political repression and is a
typical example of official political communication. It has a clear thematic focus and contains
specific vocabulary that allows you to test the model's ability to identify keywords and classify the
document towards the relevant topic. The text was taken from an open source and was not
included in the training dataset in advance, which guarantees the fairness of the test.
After the inaugural mass, Pope Leo XIV held an audience with President of Ukraine
Volodymyr Zelenskyy and First Lady Olena Zelenska, who became the first for heads of
state. The President congratulated Pope Leo XIV on the beginning of his pontificate and
noted that he is a hope for millions of people who want peace.
"""</p>
        <p>This example was specifically selected as a control example, since its theme potentially
correlates with one or more topics formed by the LDA model (in particular, with issues related to
historical memory, political repression or state policy in the field of culture). The following sections
will provide a step-by-step analysis of the processing of this fragment, the results of classification,
and an assessment of how the model has determined its topic correctly. A control fragment of the
text was submitted for input to the software to check the full cycle of its processing and
classification. After loading the text, the system automatically carried out all the stages of analysis
in accordance with the logic embedded in the architecture of the software tool. In the first step, text
pre-processing is performed, which includes lowercase, tokenisation, filtering of service words, and
lemmatisation using the Stanza library. It allows you to bring the text to a unified form, where each
word is represented in its basic grammatical form. For example, the phrase "honoured the memory
of the dead" after lemmatisation turns into a sequence of lemmas "honour", "memory", "deceased".
Next, the cleaned text is transformed into a numeric format using a pre-saved dictionary. To do
this, each lemma is replaced with a corresponding numerical identifier, and the frequency of its
appearance in the text is recorded in the Bag-of-Words format. This format allows you to present
text as a vector that the model can interpret as input for thematic analysis. The third step is to
transfer the processed text to the trained LDA model, which conducts the classification. The model
returns the probability distribution of topics formed in the process of previous training. As a result,
a list of topics with corresponding probability values is obtained. The topic is most likely to be
interpreted as the main one to which the input text belongs. At the final stage, the system displays
the topic ID, a list of its keywords, and the generated conditional name formed on the basis of the
detected topic semantics. It allows the user not only to see the numerical results of the
classification, but also to interpret them understandably. Thus, the submitted text goes through a
complete cycle of processing: from natural language to a formalised topic with interpretation. It
confirms the ability of the software to correctly identify the subject of a new document based on an
already trained model.</p>
        <p>In the process of developing software for thematic modelling of texts, there was a need to
optimise the performance of the pre-processing subsystem. One of the key elements at this stage is
the filtering of stop words - that is, those tokens that do not carry a semantic load, but significantly
increase the amount of processing during lemmatisation and vocabulary construction. Initially, we
used a complete list of Ukrainian stop words, containing more than 300 elements. However, when
tested on a full case with more than 800 documents, the processing time exceeded 150 minutes,
which is completely inefficient for practical use. In view of this, its own optimised list has been
created, which includes only the most used service words - about 50. It made it possible to reduce
the processing time to 51 minutes without significantly losing the quality of the topics.</p>
        <p>This graph shows how the size of the stop word list affects the processing time of texts during
thematic modelling. As you can see, when using a smaller custom list, the processing of the entire
case took 51 minutes, while when using the complete list of stop words, it took more than 150
minutes. It is because a higher number of stop words significantly increases the filtering and
processing time of each token in the text, especially when processing involves lemmatisation. In
this regard, in order to maintain the effectiveness of software execution, it was decided to use a
limited but relevant list of the most frequent service words. It made it possible to significantly
reduce the processing time without a critical loss of simulation quality. This analysis confirms that
thoughtful optimisation during the pre-processing phase has a significant impact on the overall
performance and efficiency of the system.</p>
        <p>For the convenience of the user and control over the execution of the software, the output of the
progress of word processing in real time was implemented. It became essential after expanding the
corpus of texts to more than 800 documents, as the pre-processing time (lemmatisation, filtering,
and tokenisation) increased to tens of minutes. To avoid a situation where the user does not
understand whether the program is "frozen" or really working, a progress bar was added using the
tqdm library, which displays a dynamic scale with the number of documents already processed. It
allows you to visually observe the progress of processing, estimate the pace of execution and
navigate the remaining time until completion. In this way, the output of execution progress has
increased the clarity, predictability, and usability of the system, which is an integral part of
frontend interaction even in console applications.</p>
        <p>To optimise performance and avoid re-wasting time on text processing, a mechanism for saving
already processed data has been implemented. It allows the software to run much faster when
reused, especially in the context of experiments, testing models, or changing classification
parameters. After the pre-processing step is completed, all cleaned and lemmatised texts are
automatically saved as a serialised object in .pkl (pickle) format. In particular, the
processed_texts.pkl file stores a list of tokenised texts that have already passed all stages of
preprocessing: lowering, removing stop words, lemmatisation, etc. In the future, when the system
starts, the program first checks whether the file with the processed data exists. If the file is found,
the data is loaded from the disk, and there is no need to process more than 800 documents again,
which can take up to an hour. This approach provides significant resource savings and improves
user experience, especially in environments with limited execution time, such as during
demonstrations, training, or research.</p>
        <p>In the process of developing a system of thematic modelling of texts, it was decided to use
machine learning not only to build the model itself, but also to optimally select the number of
topics. It is critically important because too few topics can lead to generalisation and loss of
content, and too large a number of issues can lead to excessive division of texts, which reduces the
quality of classification.</p>
        <p>For each of the models, a coherence metric (in particular, c_v) was calculated, which shows how
logically the topics are related in terms of the semantic proximity of words num_topics. The model
analysed the quality of the issues constructed, and among all the options, the number of topics that
provided the highest coherence was chosen. Thus, the decision was not made manually, but on the
basis of an objective indicator of the quality of the model, calculated during the training. Thanks to
this approach, it was possible to achieve a more stable, interpreted, and high-quality thematic
model that confirms the effectiveness of the use of machine learning methods in the tasks of
thematic analysis of texts. In order to determine the optimal number of topics for building a
thematic model, a series of experiments was conducted using machine learning and coherence
metrics (in particular, c_v). In the course of these experiments, 29 LDA models were built with a
different number of topics from 12 to 40. Based on each of the models, the coherence of the
indicator was calculated, reflecting how logically the words in the topic are related to each other
from the point of view of real language usage. The visualisation of the results was presented in the
form of a line graph, where the number of topics is displayed along the X axis, and the coherence
values are displayed along the Y axis. The graph clearly shows the fluctuations in the quality of the
models. The highest coherence values were achieved for the following configurations
num_topics=12 → coherence = 0.5135
num_topics=22 → coherence = 0.5167
num_topics=27 → coherence = 0.5102
num_topics=28 → coherence = 0.5025</p>
        <p>It indicates that these configurations describe the topics in the corpus in the most balanced way,
providing high semantic coherence of topic keywords. As can be seen from the graph, too many
issues lead to a decrease in coherence, since the model "blurs" the context between too many
problems. Based on this analysis, the optimal number of topics was selected - 22, which provides
the highest coherence within the framework of the experiment. Thus, the process of choosing the
number of issues was not implemented manually, but based on a quality metric that follows the
principles of reasonable customisation of machine learning models.</p>
        <p>This code fragment is implemented at the training stage of the LDA model, which is the basis
for the thematic modelling of texts. Its goal is to create a machine model that will be able to detect
hidden topics in Ukrainian-language texts based on the joint appearance of words in documents.
The parameter num_topics=22 is because I previously conducted an automated coherence analysis
and determined that 22 topics provided the highest quality of issues (coherence ≈ 0.5167). Thanks
to the passes parameter=10 model, which passes through the entire body 10 times, you can achieve
greater stability in the topics. Setting alpha='auto' allows the model to independently adapt the
distribution of topics in documents, which is especially useful when working with imperfectly
balanced data. Beep(1000, 1000), which is triggered after the completion of the training. It is done
for the convenience of observation, since training can take tens of minutes, and the beep helps not
to constantly monitor the laptop. This stage is key, because it is here that the model is formed,
which will later be:
</p>
        <p>classify new texts by topic;


allow you to visualise the connections between words;
serve as the basis for the interpretation and generation of topic names.</p>
        <p>After completing the training of the Thematic Modelling Model (LDA) based on
Ukrainianlanguage news texts, the system formed 22 topics. Each of the issues is represented by a set of the
most relevant words that have appropriate weights reflecting their significance for a particular
topic. These keywords are the result of a probabilistic distribution of words in the corpus and allow
you to gain a deeper understanding of the content of each topic. One example is the theme
dominated by the words "Ukraine", "President", "Volodymyr", "Zelensky", "support" - indicates
political content, in particular related to leadership and international activities. Another topic may
include words like "generation", "youth", "culture", which indicate a completely different semantic
emphasis. The results obtained make it possible to automatically interpret topics, analyse
information flows and structure large volumes of texts. Thematic word distributions are further
used to generate topic names, which makes the models more understandable for the user. It also
opens up the possibility of classifying new texts: the system can determine which topic the newly
received text belongs to, with the corresponding probability. Thus, this stage is critically important
in the entire chain of operation of the software tool, because it is on its basis that a knowledge base
is formed, which provides all the further functionality of analysis, interpretation and visualisation
of text data.</p>
        <p>The image shows an interactive visualisation of the results of thematic modelling created using
the pyLDAvis library. This approach allows you to intuitively understand the structure of the
constructed LDA model and assess how clearly the topics are delineated and which words are the
most characteristic for each of them. The visualisation consists of two parts: the left pane shows a
map of topics, and the right pane shows a list of the most relevant terms for the selected topic. On
the left side of the visualisation, the so-called "Intertopic Distance Map" is displayed, which
demonstrates how topics are arranged in vector space. Each circle represents a different topic, and
its size reflects the proportion of documents related to that topic. The distance between the circles
indicates the similarity of the topics: the closer the circles, the more similar the topics in content,
and if the circles do not intersect, this shows a clear separation of topics. For example, the largest
circle on the graph is topic 1, which occupies the largest share in the corpus of texts. The right
pane lists the 30 most important terms for the selected topic. Light blue bars indicate the total
frequency of use of a word in all texts, while red bars indicate the frequency of this word in the
selected topic. It allows you to see which words are really relevant to a particular topic and not just
frequently used in the corpus. In our case, topic one is characterised by the phrase "Ukraine",
"president", "Zelensky", "Volodymyr", "support", "state", etc., which indicates political topics related
to state power and the country's leadership. This type of visualisation is beneficial for analysing the
quality of the model, interpreting the content of topics, and later use in the user interface or
reports. It allows not only an analyst, but also an ordinary user without deep knowledge of
machine learning to quickly understand what each topic is about and how well the model divided
the topics of the documents.</p>
        <p>In the course of the implementation of the software for thematic modelling, all documents from
the corpus of texts were divided into topics that the trained LDA model defined. It made it possible
to see which topics are the most common among the analysed texts, as well as to identify less
covered or even highly specialised areas. The image shows the final statistics: each topic
corresponds to a certain number of documents. For example, the most significant number of texts,
208, fell into topic 11. It means that this topic is the most representative of the corpus, and its
content has the most significant information load. Topics 2 (107 papers), 12 (76 papers) and 20 (68
papers) also have a considerable number, suggesting that the texts are mostly centred around a few
leading topics. At the same time, some topics cover only 1-3 documents (for example, topics 21, 4,
3, 10, 16). It may be due to the fact that some texts cover particular events or topics that do not
have a broad representation in the general corpus. Such a distribution is proper both for assessing
the balance of a data set and for further use in analysis, for example, to identify thematic priorities
in news content, to identify topics that require additional attention, or to divide texts into thematic
clusters. It also allows you to form an idea of the temperature coverage, which can be used for
decision-making in a journalistic, informational or analytical context. After the LDA model formed
topics in the form of a set of keywords, there was a need to make them more understandable to a
person. After all, a set of words is just a machine representation, from which it is difficult to
quickly understand what precisely the topic is about. Therefore, a special module was implemented
that automatically generates topic names based on the keywords that characterise them. For
example, if among the keywords of the topic there are often "president", "office", "Vladimir", then
such a topic can be called "Presidential activity". If the words refer to such concepts as "child",
"protection", "rights", the topic is called "Protection of children's rights", etc. Thus, we do not just
leave issues in the form of machine combinations of words, but transform them into
humanreadable titles. It greatly facilitates the perception of modelling results and makes them suitable for
practical application both in reports and in interactive text analysis. The user can now easily
navigate which topic means which without having to analyse a technical set of words. It is an
essential step in "interpreting" the model and bringing the results of machine learning closer to real
use.</p>
        <p>At the final stage of the work, the function of recognising the topic of third-party text by
calculating the probabilities of its belonging to already trained topics was implemented. It made it
possible to assess the practical ability of the built LDA model to classify new documents without
prior reference to the educational building. An experiment was conducted with a test piece of text
that was not part of the training kit. The distribution of topics for this text was analysed separately
for two models: the one that was trained on a smaller corpus of about 300 texts and the one based
on an extended corpus of 830 articles. For a larger dataset, the central theme received a weight of
49.62%, which indicates the high confidence of the model in the classification.</p>
        <p>On the other hand, on a smaller dataset, the main topic had a similar weight - 48%, but the
second topic was almost equal to it - 46%, which may indicate a lower accuracy of the model due to
a lack of training examples. The graph below shows a comparison of the distribution of topics
between the two models. Visualisation confirms that the growth of the volume of data significantly
improves the clarity of classification and reduces the blurring of results. It also reduces the chance
of misclassification of text between two nearly equivalent topics. Thus, the increase in the learning
corpus directly affects the quality of topic recognition in new documents.</p>
        <p>A test run of the software for thematic modelling of Ukrainian-language texts was carried out,
which confirmed its operability and compliance with the task. The main goal was to check whether
the system built on the basis of the LDA model is able to recognise topics in new texts and provide
a meaningful interpretation of the results. In the course of the work, a test task was formulated - to
automatically determine the topic of the new Ukrainian-language text. For this, a model previously
trained on a large body of news articles was used. Two training options were tested: on a smaller
set (approximately 300 documents) and on a much larger set (more than 800 documents). It made it
possible to see the impact of the amount of data on the accuracy of the distribution of topics. As
the analysis showed, the model trained on a larger dataset demonstrated higher coherence and a
more stable probability distribution, which indicates a higher quality of thematic classification.
During testing, a complete cycle was implemented: pre-processing of the text with lemmatisation,
construction of thematic distribution, interpretation of topics, generation of topic names, and
display of results in a convenient form. It was conveniently organised to control the processing
execution (through process output and completion signals), save data to avoid re-wasting
resources, and visualise the results in the form of diagrams and pyLDAvis graphs. Summing up, it
can be argued that the developed software not only demonstrates correct technical implementation
but is also able to provide flexible, effective thematic modelling of text data. Such a tool can be
helpful for analysts, journalists, researchers, or information systems that require quick orientation
in large arrays of Ukrainian-language texts.</p>
      </sec>
      <sec id="sec-1-5">
        <title>7. Discussion</title>
        <p>When developing software, coherence (a measure of consistency of topics) is compared when using
different amounts of data. For the first dataset (~300 texts), the coherence of the model was 0.462,
while after switching to the extended dataset (~830 texts), it increased to 0.516. It suggests that a
larger body allows the model to better shape topics - the keywords in them are more related, and
the classification results are more resistant to random deviations. Thus, the quality of the model
directly depends on the volume of the educational building
~200+
~51 min
~150+ min</p>
        <p>In order not to repeat the lengthy processing process each time, it is implemented to save the
processed case to the processed_texts.pkl file. It allows you to load ready-made data at the
subsequent launch of the program, which reduces the waiting time from tens of minutes to several
seconds. In addition, visual observation of pre-processing progress via tqdm has been implemented,
and a sound signal has been added after the model training is completed, which is convenient for
long calculations. To optimise performance and avoid re-wasting time on text processing, a
mechanism for saving already processed data has been implemented. It allows the software to run
much faster when reused, especially in the context of experiments, testing models, or changing
classification parameters. After the pre-processing step is completed, all cleaned and lemmatised
texts are automatically saved as a serialised object in .pkl (pickle) format. In particular, the
processed_texts.pkl file stores a list of tokenised texts that have already passed all stages of
preprocessing: lowering, removing stop words, lemmatisation, etc. In the future, when the system
starts, the program first checks whether the file with the processed data exists. If the file is found,
the data is loaded from the disk, and there is no need to process more than 800 documents again,
which can take up to an hour. This approach provides significant resource savings and improves
user experience, especially in environments with limited execution time, such as during
demonstrations, training, or research.</p>
        <p>In order to determine the optimal number of topics for building a thematic model, a series of
experiments using machine learning and coherence metrics (including c_v) were conducted. In the
course of these experiments, 29 LDA models were built with a different number of topics from 12
to 40. Based on each of the models, coherence was calculated - an indicator that reflects how
logically the words in the topic are related to each other from the point of view of real language
usage. The visualisation of the results was presented in the form of a line graph, where the number
of topics is displayed along the X axis, and the coherence value is displayed along the Y axis. The
graph clearly shows the fluctuations in the quality of the models.</p>
        <p>The distribution of topics for this text was analysed separately for two models: the one that was
trained on a smaller corpus of about 300 texts and the one based on an extended corpus of 830
articles. For a larger dataset, the central theme received a weight of 49.62%, which indicates the
high confidence of the model in the classification. On the other hand, on a smaller dataset, the main
topic had a similar weight - 48%, but the second topic was almost equal to it - 46%, which may
indicate a lower accuracy of the model due to a lack of training examples. The graph below shows
a comparison of the distribution of topics between the two models. Visualisation confirms that the
growth of the volume of data significantly improves the clarity of classification and reduces the
blurring of results. It also reduces the chance of misclassification of text between two nearly
equivalent topics. Thus, the increase in the learning corpus directly affects the quality of topic
recognition in new documents.</p>
        <p>The left graph demonstrates how an increase in the volume of the dataset has a positive effect
on the quality of the thematic model. When the first dataset, consisting of ~300 documents, was
used, the coherence of the model (a measure of its thematic consistency) was approximately 0.462.
After expanding the corpus to more than 800 documents, the coherence value increased to 0.516,
indicating an improvement in the quality of the topic classification. The right graph illustrates how
the number of stop words affects the processing time of the text. When using a smaller list of stop
words, the process of pre-processing the entire text took approximately 51 minutes. However, an
attempt to apply a complete extended list led to a significant increase in duration - more than 150
minutes. It showed me that in order to maintain processing efficiency, you need to find a balance
between the depth of text clean-up and performance.</p>
        <p>The main goal of this work was not just to check the functionality of the model but also to
assess how stable, fast, and qualitatively it works under different conditions. It was found that the
quality of thematic modelling directly depends on the volume of the educational building. With an
increase in the number of documents from ~300 to more than 800, the coherence of the model
increased significantly from 0.462 to 0.516, which indicates better structured and accurate topics. In
this way, the model becomes more meaningfully expressive and resistant to mixed themes.
Separately, the performance of the system was analysed, particularly the time required for word
processing. It turned out that the use of a complete list of stop words dramatically increases the
duration of pre-processing from 51 to more than 150 minutes. It was the basis for the decision to
use a shortened, optimised list of stop words, which allows you to maintain a balance between
processing depth and performance. In addition, a number of technical improvements have been
implemented: a progress bar (tqdm), a sound signal about completion, and saving processed texts
to a file (pickle). It made it possible to save time significantly when restarting the program and
made interaction with it more comfortable. Thanks to the analysis, it became apparent that the
created software is not only functional, but also efficient, scalable and suitable for further use in
real tasks of text data analysis. The results of the work confirmed the feasibility of using machine
learning for thematic modelling and the importance of correctly adjusting parameters to achieve
maximum quality.</p>
      </sec>
      <sec id="sec-1-6">
        <title>8. Conclusions</title>
        <p>Software for thematic modelling of Ukrainian-language texts based on the LDA (Latent Dirichlet
Allocation) algorithm was designed, implemented and tested. The system was created from scratch,
taking into account the peculiarities of the Ukrainian language, the specifics of working with text
corpora and the requirements for the interpretation of results for an ordinary user. Several datasets
were built: at the first stage, a test case of about 300 documents, and later a full-fledged extended
case with a volume of more than 830 documents. It made it possible to conclude the effect of the
amount of training data on the quality of the model, in particular, on coherence (which increased
from 0.46 to 0.516 with an increase in the corpus). The system covers all the main stages of text
analysis: pre-processing (cleaning, tokenisation, lemmatisation via stanza), conversion to numerical
format, training a thematic model, building a dictionary of topics, automatic assignment of new
texts to topics, as well as generation of conditional names of topics for user convenience.
Visualisation of results via pyLDAvis was also implemented, which made it possible to better
interpret the topic space and estimate the distances between them. Particular attention was paid to
usability: saving processed data (pickle), displaying processing progress via tqdm, sound
notifications about the completion of calculations, and optimisation of work with stop words.
Thanks to these solutions, the software became not only functional, but also practical in Use. After
training the model, the functionality of classifying new (third-party) texts was implemented and
tested. The results demonstrate that the system is able to correctly determine the subject matter of
even those documents that it has not seen before. The results were compared using two variants of
the trained model on smaller and larger datasets. In both cases, the model returned meaningful and
logical results, but from a larger case, the results were more stable and more confidently
interpreted. This project is essential in terms of the practical application of natural language
processing and machine learning methods. It proved that it is possible to effectively perform
thematic modelling of Ukrainian-language documents using modern tools (gensim, stanza,
pyLDAvis) even without the use of powerful clusters or ample computing resources. It has been
confirmed that the quality of the LDA model significantly depends on the hull volume, purity, and
quality of pre-processing, optimal selection of the number of topics, and balance between the
completeness of the stop dictionary and the speed of processing. The developed software can be
adapted to other languages, extended for more complex corpora, or integrated into larger systems
such as web applications, dashboards, or content filtering systems. Prospects for further research:
 Integration of other topic models - such as BERTopic or NMF using modern vector
representations (e.g. BERT or FastText). It can increase accuracy and flexibility in defining
topics.</p>
        <p> Evaluation of the quality of the model by the user - implementation of feedback
mechanisms (for example, if the user agrees or disagrees with the topic assigned to the
text).</p>
        <p> Analysis of the dynamics of topics over time - identifying how popular issues in
the news stream or publications change over periods.</p>
        <p> Clustering of users or sources - based on the topics they produce or read; it is
possible to build recommended systems.</p>
        <p> Deeper coherence research involves various metrics (u_mass, c_npmi) and manual
evaluation by experts.</p>
        <p>The study made it possible to create a full-fledged, functional and optimised system for thematic
analysis of texts in Ukrainian. It combines elements of machine learning, natural language
processing, and visual analytics. Work on the project made it possible to deepen practical skills in
building NLP models, optimising code, and interpreting results. In addition to the practical result, it
was also a meaningful learning experience, forming the basis for more complex research or
commercial decisions in the future.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Acknowledgements</title>
    </sec>
    <sec id="sec-3">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. I. Jordan</surname>
          </string-name>
          , Latent Dirichlet Allocation,
          <source>Journal of Machine Learning Research 3(Jan)</source>
          (
          <year>2003</year>
          )
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          . URL: https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Egger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>A topic modeling comparison between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts, Frontiers in Sociology 7 (</article-title>
          <year>2022</year>
          )
          <article-title>886498</article-title>
          . doi:
          <volume>10</volume>
          .3389/fsoc.
          <year>2022</year>
          .
          <volume>886498</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grootendorst</surname>
          </string-name>
          ,
          <article-title>BERTopic: Neural topic modeling with a class-based TF-IDF procedure</article-title>
          ,
          <source>arXiv preprint arXiv:2203.05794</source>
          (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2203.05794.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bianchi</surname>
          </string-name>
          , Contextualized Topic Models. URL: https://github.com/MilaNLProc/contextualizedtopic-models.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Procter</surname>
          </string-name>
          , CWTM:
          <article-title>Leveraging contextualized word embeddings from BERT for neural topic modeling</article-title>
          ,
          <source>arXiv preprint arXiv:2305.09329</source>
          (
          <year>2023</year>
          ). URL: https://arxiv.org/html/2305.09329v3.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O. B.</given-names>
            <surname>Petrovych</surname>
          </string-name>
          ,
          <article-title>Topic modelling of Ukrainian folk songs: A case study on Podillia region</article-title>
          ,
          <source>in: CS&amp;SE@SW</source>
          , (
          <year>2024</year>
          )
          <fpage>183</fpage>
          -
          <lpage>198</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3917</volume>
          /paper45.pdf.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>