1. Introduction

International Journal of Modern Education and Computer Science 9(7) (2017) 50. doi:10.5815/ijmecs.2017.07.06. [12] B. Ogunleye

10.1111/rsp3.12632

Probabilistic thematic modelling of Ukrainian-language texts based on the Latent Dirichlet Allocation algorithm

Victoria Vysotska

Victoria.A.Vysotska@lpnu.ua 0 1

Denys Ptushkin

denys.ptushkin.sa.2022@lpnu.ua 0 1

Rostyslav Fedchuk

rostyslav.b.fedchuk@lpnu.ua 0 1

Roman Lynnyk

roman.o.lynnyk@lpnu.ua 0 1 0 2025 , Vinnytsia , Ukraine 1 Lviv Polytechnic National University , Stepan Bandera 12, 79013 Lviv , Ukraine

2025

3722 33 56 75

The article presents the results of the study of methods of thematic modelling of texts using the Latent Dirichlet Allocation (LDA) algorithm for the Ukrainian-language corpus of documents. The proposed model allows you to automatically detect hidden topics in large volumes of unstructured text data without prior labelling. The model was implemented in Python using Gensim and pyLDAvis libraries. Perplexity and coherence metrics were used to assess the quality of the model, which showed that the optimal number of topics depends on the characteristics of the corpus and the parameters of hyperparameters α and β. Texts and demonstrate the suitability of the method for a wide range of applied tasks - analysis of user reviews, media analytics, classification of scientific publications and monitoring of social networks. A comparative study with alternative approaches (K-means, NMF, BERTopic, transformer models) was carried out, which showed that LDA provides the best balance between interpretation, speed and computational efficiency. The developed program module "Thematic Analysis Module" implements an automated system for thematic modelling, which can be used both in scientific research and in analytical information systems.

eol>thematic modelling Latent Dirichlet Allocation LDA natural language processing machine learning probabilistic model coherence TF-IDF Gensim Ukrainian-language corpus of texts 1

1. Introduction

Today, humanity is in an information glut: a massive amount of text data is generated every day – news, scientific publications, messages in social networks, forums, blogs, and instant messengers. This information is often unstructured and challenging to subject to classical analysis, which causes the need for automated tools to classify, sort, filter, and understand it. One of the most effective modern methods of analysing such texts is thematic modelling. It allows you to automatically detect topics hidden in texts based on the probabilistic distribution of words. For example, without having to read thousands of product reviews, you can automatically discover that people most often talk about "price", "quality", "delivery", "packaging", etc. This approach is actively used in the following areas:

 media;    

Journalism and media analytics – to track information campaigns, trends in the Business and marketing – to analyse user reviews, surveys, customer feedback; Science – classification of scientific publications by topics; Public administration – monitoring of public moods, thematic appeals of citizens; Education – automatic classification of educational materials.

 Thus, thematic modelling is one of the key natural language processing (NLP) tools, allowing you to efficiently work with large text arrays without the need for manual processing.

The purpose of this work is the in-depth development of information technology for thematic modelling of texts, as well as the practical implementation of the thematic model on a specific corpus of Ukrainian-language documents using Python tools. During the work, it is planned to investigate how the pre-processing of texts, the choice of the number of topics, algorithms and parameters affects the quality of the thematic model, as well as to analyse the practical results of modelling and the possibilities of their application in a real environment.

To achieve the goal, it is necessary to solve the following tasks:         

Analyse the literature on thematic modelling (LDA, NMF, PLSA).

Choose a corpus of texts for modelling (e.g., news, articles, forum posts).

Clean up the data – remove HTML tags, numbers, punctuation, stop words.

Perform lemmatisation or stemming (if necessary, in Ukrainian).

Create a Bag-of-Words or TF-IDF matrix.

Build an LDA model with a different number of themes.

Visualise the results obtained.

Analyse the interpretation of topics.

Compare the quality of models by coherence.

The object of research is the text corpus – a set of documents containing natural language (in our case, Ukrainian). These can be news, social messages, reviews, scientific articles, product descriptions, etc. Such texts are unstructured, which makes it difficult to analyse them without preprocessing. That is why the object of research is interesting from the point of view of practical data processing. The subject of the study is algorithms and methods of thematic modelling, in particular:     

Latent Dirichlet Allocation (LDA); Non-Negative Matrix Factorisation (NMF); Probabilistic Latent Semantic Analysis (PLSA); TF-IDF and Bag-of-Words for text representation;

Quality assessment metrics: coherence, perplexity.

Although thematic modelling is a well-known technique, its application to Ukrainian-language texts has not yet been sufficiently researched. Most libraries and examples focus on Englishlanguage content. Therefore, the novelty of this work lies in:  Implementation of thematic modelling specifically for the Ukrainian language;  Comparison of models with a different number of themes for a real case;  Application of modern methods of pre-processing of Ukrainian-language texts (for example, through langdetect, pymorphy2-uk or Stanza);

 visualisation of results and analysis of the correspondence of topics to the real content of documents.

Also, the novelty lies in the application of coherence to automatically assess the quality of the model without human intervention. The developed model has a number of real-world applications:  Information systems (filtering news, searching by topics, classification of documents).

 Education (automatic grouping of educational materials by topic).   trends).

  forums).

Marketing (classification of customer reviews on topics to identify pain points).

Science (analysis of scientific publications and identification of new research Security (monitoring social media to identify radical topics).

Electronic democracy (analysis of citizens' appeals in petitions, complaints, The model is universal and can be adapted to any subject area containing large amounts of textual information. It is necessary to investigate the methods of thematic modelling of texts, in particular the Latent Dirichlet Allocation (LDA) algorithm, which allows you to automatically identify the main topics in a large amount of text data. The preliminary processing of the corpus of documents was carried out, a thematic model was built, and its results were analysed. The study confirmed the effectiveness of thematic modelling as a tool for classifying and analysing unstructured texts. The practical implementation of the model has demonstrated that this approach can be used in various fields – from journalism and marketing to science and education. The results obtained showed the dependence of the quality of thematic modelling on the preliminary processing of data, the choice of the number of topics and the parameters of the model. Thus, the work contributed to the consolidation of knowledge in computational linguistics and the acquired practical skills in natural language processing.

2. Related works

In today's information age, society generates enormous amounts of text data every day. News sites, social networks, forums, emails, blogs, user reviews, documents – all this creates a powerful flow of information that needs to be stored, processed and analysed. According to analytical agencies, tens of millions of new texts of various formats are created every day in the world, and this trend is only growing, requiring manual analysis. There is an urgent need for tools that can automatically reveal meaning and structure in unstructured text.

One of the most promising areas in this area is thematic modelling of texts – a method of identifying hidden thematic structures in large amounts of text data. Thematic modelling allows you to understand what the documents are about, without the need to read them thoroughly. It automatically classifies texts by content, highlights key topics, and allows you to visualise the results, which significantly simplifies analysis.

The principle of thematic modelling is that each document consists of a particular set of topics, and each topic consists of a specific set of words. For example, if the system analyses the news corpus, it can detect issues such as "politics", "economy", "sports", "education", even if these labels are not set manually. Thematic modelling algorithms, in particular Latent Dirichlet Allocation (LDA), are based on statistical patterns of the joint appearance of words in texts and are able to automatically find relationships between words and group them into meaningful topics. The relevance of this topic is due not only to the rapid growth of textual data but also to the need to interpret it effectively. In many fields, from media and journalism to education, marketing, and research, thematic modelling is becoming an indispensable tool. It allows:

Analyse large amounts of news; Identify trends in social networks; Carry out automatic classification of documents; Segment customer reviews by topic;

Build dashboards for decision-making.

Latent Dirichlet Allocation (LDA) is a classical probabilistic generative model of the issues proposed in [ 1 ]. LDA formalises a document as a mixture of topics and a topic as a distribution of words; It was this work that laid the mathematical foundation for most of the further research in case study. Its advantages are ease of interpretation, relative ease of implementation, and low hardware requirements; The disadvantage is the weak ability to capture context (sequence/order) and problems with short texts.

Other classical methods – PLSA and NMF – use linear/probabilistic factorisations of the document-term matrix [ 2 ]. NMF sometimes gives more stable and interpreted themes on a small corpus, but lacks Bayesian regularisation of LDA and can be sensitive to noise. Comparative studies show that no "classic" dominates universally – the choice depends on the size of the case, the length of the documents and the goals of the analysis.

Modern approaches: built-in representations and hybrids: 1. BERTopic is a practical cluster-embedding approach that combines transformer embedding of documents (BERT-like) with dense clustering and c-TF-IDF for describing topics [ 3 ]. BERTopic shows good semantic coherence of issues, especially on short and variable texts (tweets, comments), but requires more resources and depends on the quality of embeddings.

2. Contextualised Topic Models (CTM) and their development – methods that combine the BOW part with contextual embedding (BERT) into variational autoencoders [ 4 ]. They increase the coherence of topics compared to classical LDA, especially on data where context significantly changes the meaning of words. CTM and derivatives (improvement due to negative sampling, pretraining, etc.) are now actively researched and often give better NPMI/UMass results than LDA.

3. Top2Vec / embedding-based clustering – an approach where documentary and verbal embedding are used to simultaneously identify topics and semantic centres (without an explicit K task). It works well for large enclosures with moderate document lengths. The downside is that interpreting topics sometimes requires additional c-TF-IDF or manual filtering.

The general trend in recent years has been to replace or supplement purely frequency representations (BoW/TF-IDF) with contextual embeddings (BERT, MiniLM, etc.). It improves the quality of topics (semantics), but increases computational costs and can complicate interpretation in some cases [ 5 ].

In comparative work, a set of metrics is most often used: perplexy (probabilistic measure), coherence (UMass, Cv, C_v) and NPMI [ 2 ]. Almost all modern research emphasises that perplexy and coherence sometimes conflict (perplexy can decrease, while coherence can deteriorate), so it is recommended to use a combination of metrics to choose the optimal model and number of topics.

Although most of the methodological works are tested on English-language corpora (20 Newsgroups, Wikipedia, ArXiv abstracts), there are more and more publications dedicated to Ukrainian-language corpora. Study of themes in folk songs of Podillia (case-study) [ 6 ] – LDA applied to folklore texts; The authors note the importance of lemmatisation and morphological normalisation through productive word formation in the Ukrainian language [ 6 ]. An analysis of discussions and media coverage of the war (Russo-Ukraine war) showed [7] that for social networks/tweets, it is advisable to compare LDA with models on embedding (BERTopic/CTM): transformer approaches are better at catching context and nuances, while LDA gives more stable "dark" clusters for a large number of short, noisy messages. These studies emphasize two important theses for Ukrainian: (1) pre-processing (lemmatization, removal of inflectional forms, stop words) significantly affects the quality of topics; (2) the choice of model depends on the genre of texts – for long forms (articles), LDA/NMF work well; for short/social media – CTM/BERTOPIC/Top2Vec gives a better semantic grouping [7].

Bayesian Interpretation, generative model low resources

Weak context, problems with

Basic suitable method-reference; for long NMF BERTopic (BoW) Linear Simplicity, factorisation of sometimes document–terms better stability with small data short texts Sensitivity noise, no priori documents; lemmatisation.

requires to Alternative to LDA on a small cases Top2Vec / Coordination of Does not need Interpretation Large body, embedding documents and a previous K, is sometimes overview of themes clustering words in good semantic more embedded space centres complicated quick

LDA remains a "practical standard" – it provides interpreted topics and serves as a good baseline for any thematic analysis [ 1 ]. It is especially valuable because of the limited resources or the need to explain the results to a non-professional audience.

Contextual models (CTM, BERTopic) show a marked improvement in the clinical (semantic) quality of topics (NPMI/Cv), especially on short or highly contextual texts [ 3 ]. If the project allows for calculation costs, these approaches give a better interpretation of the topics.

Assessment should be multidimensional [ 2 ]. Perplexy ≠ coherence: in practice, it is advised to minimise perplexy and simultaneously maximise NPMI/Cv (or perform human validation for the most important topics).

The peculiarities of the Ukrainian language (morphology, inflexion, word formation) make high-quality linguistic preprocessing critical: tokenisation, lemmatisation (pymorphy2-uk / Stanza / spaCy pipelines), removal of stop words, and filtering of N-grams. Studies on Ukrainian corpora confirm that without such training, the quality of topics drops sharply [ 6 ].

Practical recommendations for research: 1. Implementation of LDA as a baseline (Gensim) after careful linguistic pre-processing (lemmatisation, stop words, removal of frequent noise tokens), estimation of perplexity and NPMI/Cv on the K grid [ 1 ].

2. BERTopic testing (with Ukrainian/multilingual embeddings – mBERT or lightweight MiniLM models) and CTM – comparison of NPMI and Cv; on short texts, BERTopic is expected to win [ 3 ].

3. Validation: In parallel, a small manual assessment (human judgment) for 10-20 topics will give a high-quality check of metrics.

4. Resources: if there are resource constraints – use LDA/NMF; Transformers launch – CTM/BERTOPIC will give better semantics.

5. Documentation: fixing hyperparameters (α, β, minimum frequency of terms, seed) so that the results are reproducible [9].

Classical approaches to topic modelling (LDA, PLSA, NMF) formalise a document as a mixture of topics and a topic as a distribution of words, and they are widely used as a baseline in thematic analysis studies. Having settled on LDA, we are guided by its interpretation and stability in problems with significant cases. Modern approaches (BERTopic, Contextualised Topic Models, Top2Vec) combine contextual embedding and clustering, which allows you to increase the semantic coherence of topics, but requires more computing resources. Evaluation of models is carried out by perplexity and coherence (UMass, NPMI, Cv), since the combined approach to validation gives the most reliable results. For Ukrainian-language corpora, the importance of lemmatisation and morphological normalisation is additionally emphasised." (case studies: Blei et al. 2003; BERTopic; CTM and comparative studies).

Within the framework of this work, the development of a thematic model of texts is considered, which allows you to automatically single out key topics from a large corpus of Ukrainian-language texts. The focus is on the LDA (Latent Dirichlet Allocation) algorithm, which is one of the most common and at the same time interpreted methods of thematic analysis. A comparison of this approach with other methods such as clustering, classification and modern neural approaches (transformers) will also be carried out, and the advantages and disadvantages of each technique will be identified. Special attention is paid to the formulation of the problem that the proposed thematic model is designed to solve. First of all, it is about automating the understanding of text data in situations where labels are missing, and human analysis is too costly or impossible. Thus, this section lays the theoretical and methodological basis for the implementation of the work, demonstrating not only technical aspects but also the strategic significance of thematic modelling in the digital information age.

Within the framework of this work, a tool for thematic modelling of texts is being developed, the primary purpose of which is to automatically identify content topics in the corpus of documents without preliminary markup or manually specified categories. This approach allows you to better understand the structure and content of large text arrays, identify hidden patterns, and optimise the content analysis process.

The product being developed is a thematic model built using the Latent Dirichlet Allocation (LDA) algorithm, which belongs to the category of probabilistic models. LDA allows each document to be represented as a combination of several topics, and each topic as a set of keywords with appropriate weights. Based on the statistics of the coincidence of words in different documents, the model identifies those words that occur most often together and groups them into topics. This approach is beneficial in cases where the structure of the texts is not strictly defined, and manual classification is too costly or subjective. So that they cover a wide range of topics, including politics, technology, education, health, economics, etc, this choice is justified by the fact that in real conditions, the texts are of a mixed nature and often include several topics at the same time, so high-quality thematic modelling should take this context into account. Preliminarily, texts undergo standard processing: clearing punctuation and special characters; lowercase casting; removing stop words; lemmatisation (if necessary); tokenization. The product is developed in the Python programming language, using the following libraries: 1. Gensim – a library for building LDA models and working with text data; 2. pyLDAvis – visualisation of the constructed theme (interactive graphs, which show the placement of topics in vector space);

3. NLTK / spaCy / Stanza – for pre-processing of texts: tokenisation, lemmatisation, removal of stop words; 4. Pandas – convenient work with text datasets; 5. Matplotlib / Seaborn – Additional visualisation of results is needed.

This stack of tools allows you to effectively implement a complete cycle of thematic modelling – from word processing to visual analysis of results.

In the field of natural language processing (NLP), there are several methods that, to some extent, perform the function of grouping, classifying or summarising text documents. Although thematic modelling, in particular based on LDA, is a specialised approach to identifying topics, it is worth considering other methods that can act as its counterparts in specific contexts.

1. K-means clustering is one of the most popular methods of unsupervised learning, which distributes objects (in our case, documents) into groups called clusters. The algorithm tries to minimise the distance between documents within the same cluster and maximise the distance between different clusters. Each document is represented as a vector (for example, based on TFIDF), and the cluster itself is defined through the centre of mass. Advantages:   

Easy to implement and quick learning.

Does not require labels (unsupervised).

Scales well for large amounts of text.

Disadvantages:

 LDA).

 

Clusters do not have a clear, meaningful description (there is no list of words as in It is challenging to interpret what each cluster is about.

Does not take into account topics, only "groups of similar texts".

K-means groups texts by similarity, while LDA detects semantic themes within texts. Clusters are "similar documents", topics are "similar words".

2. Classification of texts (SVM, Naive Bayes) – these algorithms belong to supervised learning, which requires pre-labelled data. Each text must have a predefined category (e.g. "sports", "education", "politics"), and the model learns to recognise these categories with new examples.

Advantages: High accuracy with proper data preparation.

Easy to use (especially Naive Bayes).

Works well with short texts.       Disadvantages:

 Does not work without labels – you need to manually classify a large number of documents for training.

 It does not detect new topics; it works only with those that are already known.  Less flexible in a dynamic environment (changing topics requires retraining).

The classification requires tagged training data, while LDA is fully automated and suitable for exploring new, previously undefined topics.

3. Transformers (BERT, GPT, BERTopic) – transformer-based models are modern approaches in NLP that allow you to take into account the context of an entire sentence or text. Models like BERT (Bidirectional Encoder Representations from Transformers) generate vector representations of texts that preserve semantics at a deeper level. BERTopic is an example of thematic modelling that combines BERT and clustering. Advantages:

High-quality results.

Taking into account the context and order of words.

The ability to analyse the nuances of language, synonyms, and irony.

Need for powerful hardware (GPU/TPU).

Complexity of implementation (not "out of the box").

Weak interpretation (results are difficult to explain – "black box").

Transformers are more potent in quality, but more challenging to implement. LDA loses in accuracy, but wins in simplicity, interpretation, and resources.

4. Alternative thematic models of NMF and PLSA. NMF (Non-negative Matrix Factorisation) decomposes the document-term matrix into two smaller matrices that reflect topics and word distribution. It works similarly to LDA, but is based on linear algebra rather than probabilities. Advantages:

A simple approach without complicated statistics.

Can give clear topics for minor cases.

Themes are less stable when data changes.

Less interpreted compared to LDA.

PLSA (Probabilistic Latent Semantic Analysis) is a precursor to LDA. A statistical model that also identifies topics by word distribution in documents. Advantages – considered the "foundation" for LDA – are theoretically powerful. Disadvantages:

The model is prone to overtraining.

Doesn't scale to large amounts of data.

Does not allow you to simulate new documents without re-learning.

NMF is simpler, PLSA is theoretically deeper, but both are inferior to LDA in flexibility, scalability, and resilience.

         

Disadvantages:

Low (black box) Advantages Accuracy tasks Works well with labels in narrow

Not scalable Disadvantages Inability to work unstructured topics Themes uninterpreted are with often Speed Resources Required Average

Small Explanation of results

Themes can be interpreted

Slow on large cases High, but needs fine-tuning High (GPU, TPU) Difficult to explain the reason for the classification

Thematic modelling, especially in the implementation of Latent Dirichlet Allocation (LDA), has a number of significant advantages that make it a versatile and convenient tool for analysing large corpora of text data. Unlike other approaches (e.g., classification or transformers), LDA strikes an optimal balance between interpretation, automaticity, and efficiency.

1. Unsupervised learning. One of the most valuable properties of thematic modelling is its independence from the labelled data. The algorithm does not require prior manual classification of documents – that is, there is no need to create a training sample, where each text is manually assigned to a specific topic. It is essential in cases where:   

Labels are difficult or expensive to obtain.

The subject matter of the data changes over time.

It is necessary to explore a new, unexplored corpus of texts.

Thus, thematic modelling is an indispensable tool for exploratory analysis, when it is necessary to find out: "what the texts are about", and not just classify them into already known categories.

2. Visualisation capability. Modern libraries, including pyLDAvis, make it easy to visualise the results of thematic modelling. It opens up opportunities for intuitive analysis even for users without technical training. Thanks to visualisation, you can:  See how topics are placed in a vector space.  Evaluate which words are key for each topic.  Check which documents belong to which topics and how strongly.

 Explore the intersections between topics (the more topics overlap, the more similar they are).

It makes thematic modelling a powerful tool for data analytics and presentation.

3. Flexibility of the model. The user independently sets the number of topics that the model should find. It allows you to adapt the analysis to different tasks:

 If you need to conduct a general review, you can choose a smaller number of topics (for example, 5-10).

 If you need details, the model can be reconfigured for 20-30 themes.  In addition to the number of themes, you can flexibly customise:  the number of keywords in the topic;  alpha and beta distribution parameters (affecting the "smearing" of topics across documents);

 filtering of rarely used or commonplace words.

This flexibility allows you to optimise the model for a specific type of content or business task. 4. Interpretation of results. Unlike many modern models (especially transformers), LDA provides transparency in the results. Each topic is clearly expressed in the form of a set of words, and each document has a distribution of issues with the weight of each of them. It makes it possible to:

Quickly describe the essence of the topic (by keywords).

Understand how the content of the document is related to the issues.

Check the logic of the results based on human intuition.

Justify the conclusions of analytics to customers or management.

LDA models are one of the few in machine learning that can be explained and defended in front of a non-professional audience.

5. Efficiency and the possibility of the use of small resources. LDA models do not require significant computing power. They can be launched: on a regular laptop or server without a GPU; with small corpora of texts (even several hundred documents); with limited RAM.

It opens up access to thematic analysis for small companies, research projects, and university laboratories. Even for educational purposes, LDA is an excellent demonstration of how text analytics works in the real world.

In addition to the implementation of thematic modelling through open libraries (for example, Gensim and LDA), there are many ready-made commercial or SaaS solutions on the market that provide the functionality of automatic analysis of text topics. Such services, as a rule, are aimed at business intelligence, automation of feedback processing, customer requests, social networks, etc. Below is a detailed review and comparison of the most well-known platforms.

IBM Watson Natural Language Understanding is a platform with a set of tools for natural language processing. One of its components, topic classification, allows you to identify common topics in the text (for example, politics, healthcare, finance). In this case, the thematic analysis is based on pre-trained classifiers. Advantages:     objects.

Support for many languages.

High-quality results.

The API integrates seamlessly into business systems.

It provides not only themes but also emotional tone, categories, concepts, and Disadvantages:

Only works with a fixed list of topics.

There is no full-fledged topic-forming model (as in LDA).

Commercial model: paid for a large number of requests.

Limited flexibility for the user (no access to simulation engines).

Google Cloud Natural Language API – Google's word processing service includes content classification, where documents are classified according to a hierarchy of ~700 topics (for example, /Arts & Entertainment/Music or /Business/Banking). It is based on deep neural networks and a predefined topic dictionary. Advantages:   

Reliability and speed from Google.

An extensive database of topics and subtopics.

Convenient to integrate into cloud services. 

Support for many formats.  Topics are hardcoded – it is impossible to identify new ones.  Interpretation of the result is only possible within the Google framework.

 There is no transparency – it is not clear which words influenced the classification.

 The cost increases when processing large arrays of texts.

MonkeyLearn is a cloud-based platform for text analysis that allows you to create your own classifiers and pre-trained thematic templates. It is positioned as a no-code/low-code tool for business users. Advantages:

Ready-made templates (for example: customer support, surveys, e-commerce).

You can create custom models without programming.

Visual interface for customising categories.

It has an API and integration with Google Sheets, Zapier, etc.

The free version is minimal.

Less flexible for complex analysis.

It is not a full-fledged thematic modelling (works as a classifier).

Gensim (LDA implementation) is an open-source Python library for thematic modelling. Implements LDA (Latent Dirichlet Allocation), as well as other methods for analysing the latent structure of texts. Supports model training both in memory and from streaming data. Advantages: Complete freedom of customisation: number of topics, words, alpha/beta.

Open core – can be expanded and adapted.

Visualisation capability (via pyLDAvis).

Works locally, without cloud costs.

Requires programming (not suitable for non-specialists).

Requires independent processing of texts (cleaning, tokenisation, etc.).

It does not have a graphical interface "out of the box".

BERTopic is a modern library for thematic modelling that combines contextual vector representations of BERT and clustering (e.g. HDBSCAN) to detect topics. Topics are created based on the similarity of vectors of texts. Advantages:

 It takes into account the context that "bank" and "riverbank" will not be on the same topic.

 A more accurate model for short or unstructured data.  Topics can have dynamic depth (topics within topics).

 It has integration with visualisation and meta-information.

Requires a lot of resources (GPU for fast work).

Complexity of installation (need for transformers, BERT models).

The interpretation is more complicated than that of the classic LDA.

3. Problem formulation

In the XXI century, humanity lives in the conditions of the information revolution: text data is created at an unprecedented speed – in news, social networks, instant messengers, reviews, reports, blogs, comments, documents. All this forms a complex and extensive information ecosystem that requires tools for systematisation, analysis, and understanding. Much of this information is unstructured – i.e., one that does not have clear tags, headings, or categories- and therefore, it is difficult to process automatically using traditional methods.

Modern society faces the challenge of efficiently processing large amounts of unstructured text data. Classical methods of analysis – for example, manual classification, keyword search, rulebased systems – are not able to scale to large data sets and do not allow you to automatically identify content topics without prior human intervention. It makes it difficult:

Decision-making in business, science, journalism; Identifying trends and topics in social networks; Customer feedback analytics;

Creation of personalised recommendation systems.

Classic classifiers (SVM, Naive Bayes) require labelled data, which means that someone has to manually specify which topic each document belongs to. In cases where thousands of texts are used, it becomes an irrelevant, expensive, and slow process. In turn, clustering algorithms (for example, K-means), although they allow you to group documents, do not provide interpreted topics – we cannot say "what" each cluster is about without additional analysis. In connection with the described problem, the product under development faces several tasks:

1. Develop a thematic model that automatically highlights topics from the corpus of documents. The model must work without previous labels, and therefore belongs to the unsupervised learning class. The algorithm should determine the most likely topics in a large corpus based on statistical patterns of word distribution.

2. Ensure the interpretation of the results. Unlike the "black boxes" of modern deep learning, the developed system should provide clear and transparent results. A list of keywords should express the topic, and each document should show which topics are present in it and in what ratio.

3. Provide a flexible tool that works even without labelled data. The thematic model should work on any corpus of texts – news, forums – without the need for mark-up. It should be scalable, customizable (for example, change the number of themes) and available for local launch, without cloud dependency.

4. Compare several approaches and justify the feasibility of choosing LDA. In order for the choice of thematic modelling (in particular, the LDA algorithm) to be reasonable, it is necessary:  Compare it with classifiers, clusters, and transformers.  Assess the advantages and limitations of other approaches.

 Show that LDA is the best compromise between interpretation, automation, and technical simplicity.

A comprehensive study of thematic modelling of texts as a modern tool for analysing large volumes of unstructured data has been carried out. The main goal was to create a model capable of automatically detecting content topics in the body of documents without pre-labelling, as well as comparing this approach with similar methods. A thematic model based on the Latent Dirichlet Allocation (LDA) algorithm using Gensim and pyLDAvis libraries has been developed. The corpus of texts used has undergone a complete cycle of pre-processing: tokenisation, cleaning, deletion of stop words, and, if necessary, lemmatisation. After building the model, a set of topics was obtained, each of which is described by a list of words with the highest probability, and an analysis of the distribution of topics across documents was carried out.

4. Methods

Topic modelling as a task of extracting hidden topics in the corpus of texts has become one of the key paradigms in natural language processing and text data analysis. This section provides an overview of the three main approaches – classical generative models, factorisation-based models, and modern approaches with contextual embedding – with a focus on their applicability, strengths and weaknesses, and challenges for Ukrainian-language corpora.

The original and most common approach is the Latent Dirichlet Allocation (LDA), proposed in [ 1 ]. The model formalises the document as a mixture of topics, and the topic as a distribution of words, using a priori Dirichlet distributions for θ (document-topic) and φ (topic-words) among the advantages: high interpretation of themes, support for a large corpus, and relatively simple implementation. However, a number of studies have noted the weak ability of the model to take into account word order, context or short texts, as well as the instability of the results regarding initialisation or order of documents [10]. Other researchers in their review classify LDA as the "dominant" model in the topic of topic modelling before the era of deep learning [11]. The problem of the "order-effect" in LDA was also investigated, and the LDADE approach for adjusting hyperparameters to reduce the instability of topic distributions was proposed [10].

Along with LDA, methods based on negative matrix factorisation (NMF), latent-semantic analysis (LSA), and pLSA are widely discussed in the literature. These methods, although not as common as LDA, sometimes show better stability in small enclosures or with limited resources. A study [12] compared LDA, NMF, and embedding clustering on tweet data and found that traditional models have limitations and are less stable in short texts. Some works also pay attention to dynamic versions of topic models (e.g. Dynamic Topic Model, HDP) to track thematic changes over time [13].

In recent years, models that combine embedding text representations (e.g., BERT-like models) with clustering algorithms or variational autoencoders have been growing in popularity. For example, BERTopic uses document embedding (based on transformers), then UMAP to reduce dimensionality, HDBSCAN for clustering, and c-TF-IDF to form a representation of topics [ 3 ]. BERTopic demonstrates higher topic coherence compared to LDA, especially on short or variable texts. A study [14] noted that embedding models (e.g., BERTopic or Combined Topic Model CTM) can outperform LDAs in terms of NPMI/coherence, but require more computational resources. In addition, studies on Indo-Aryan languages (e.g. Hindi) show that BERTopic consistently outperforms classical methods on short texts [15].

Evaluation of thematic models is carried out through metrics such as perplexity, coherence (UMass, Cv) and NPMI. The reviews emphasise that reducing perplexity does not guarantee an increase in coherence, so a combined approach is recommended [11-12]. There are additional challenges for the corpora of the Ukrainian language: rich inflexion, word formation, morphological variants, and a lack of large, composed datasets. For example, in the study of Podillya folklore, the need for lemmatisation and morphological normalisation before the use of LDA is emphasised [ 6 ]. Thus, it is essential for Ukrainian-speaking buildings to take into account:  Careful pre-processing of the text (lemmatisation, correction of inflexions),  Choice of model depending on the genre (long formats – LDA/NMF, short/social texts – BERTopic/CTM),

 Multi-metric evaluation (perplexy and NPMI/coherence) and, if possible, human validation.

Thus, the literature demonstrates the evolution of thematic modelling [16-21]: from traditional generative models to modern approaches with contextual embeddings. For the analysis of Ukrainian-language texts, it is advisable to use a hybrid strategy: to use LDA as a basic model for long documents, and for short/noisy texts – models on embeddings; At the same time, to ensure high-quality pre-processing and combined assessment. In the following sections, these recommendations will be taken into account when choosing a model, adjusting hyperparameters, and evaluating the results.

Within the framework of this study, the construction of the thematic model was carried out on the basis of a probabilistic approach implemented through Latent Dirichlet Allocation (LDA). The LDA algorithm assumes that each document in the corpus is a mixture of several topics, and each topic is a distribution of word probabilities. Thus, the mathematical essence of the model is to restore the hidden parameters of these distributions. Let:

D={d1 , d2 , … , d M } – a corpus of documents consisting of M documents; V ={w1 , w2 , … , w N } is a dictionary that contains N of unique words;

K is the number of hidden topics.

Each document dm is modelled as a stochastic process of generating words according to the following steps:

1. For each document dm, the distribution of topics is chosen θm∼ Dir ( α ), where α is a hyperparameter of the Dirichlet distribution that controls the "blurring" of topics in the document.

2. For each k topic, the distribution of words is determined ϕk∼ Dir ( β ), where β is a hyperparameter that controls the "blurredness" of words in the subject.

3. For each word wmn – in document dm:

The topic zmn∼ Mult (θm);

Next, a word on this topic is chosen: wmn∼ Mult ( ϕzmn).        (1) (2) The total probability of the body of documents is given as:

M Nm P (W , Z|α , β )=∏ ∫ P (θm|α )(∏ ∑ P ( zmn|θm) P ( wmn|zmn , β ))d θm,

m=1 n=1 zmn where W are all observed words, Z are hidden variables (topics for each word).

The purpose of thematic modelling is to find a posteriori distribution: which is approximated using the variational Bayesian approach or the Gibbs sampling method. For each document, the following are calculated: θm=(θm1 , θm2 , … , θmK ) – is the probability vector of topics; ϕk=( ϕk 1 , ϕk 2 , … , ϕkN ) – is the probability vector of words in the subject.

P (θ , ϕ , Z|W , α , β ),

After training the model, each paper is presented as a combination of topics with weights – θm, and each topic is defined by the most likely words from the vectorϕk.

Two primary metrics evaluate the quality of thematic modelling: 1. Perplexy is a measure of the consistency of the model with the test data:

Perplexity ( Dtest )=exp {

C (W t )=∑ log i< j

D ( wi , w j)+ ϵ

D ( w j) , A lower perplexy value corresponds to better model consistency.

2. Topic Coherence – assesses the semantic consistency of the most important words of the topic. For a set of words W t={w1 , w2 , … , w M }, coherence is defined as:

where D ( wi , w j) – is the number of documents in which the words wi and w j – occur together, ϵ is the smoothing factor.

The result of modelling is a set of thematic distributions: 2. Pre-processing of texts: (3) (4) (5) Φ={ϕ1 , ϕ2 , … , ϕK }, Θ={θ1 , θ2 , … , θM }, which allows the construction of the matrix M × K , where each element of θmk – is interpreted as the probability that the document dm – belongs to the topic k.

This matrix forms the basis for further analysis – clustering of documents, construction of semantic maps, and visualisation of thematic structures.

The goal of the developed software product is to create an efficient, automated system for thematic modelling of texts, which allows the user to quickly identify key topics in unstructured text documents without prior mark-up or classification. This tool should provide the ability to: 1. Process large amounts of textual information; 2. Analyse the content of documents without manual intervention; 3. Identify hidden topics by applying machine learning methods, in particular the Latent Dirichlet Allocation (LDA) algorithm;

4. Display the results in an understandable, interpreted form – in the form of lists of keywords that form topics, and their distribution in documents;

5. Visualise the results of thematic analysis to improve understanding of the structure of the text corpus.

As a result of the development, a software module will be implemented that helps the user interpret large arrays of texts, reduce the time for their processing, and identify semantic trends in the content without deep linguistic or technical preparation.

A software product for thematic modelling of texts should implement a complete cycle of automated analysis of textual information, from downloading data to displaying results in a convenient form. The main functions that the system should implement include: 1. Loading the Text Corpus:      

Load input text data from a file (e.g.,.txt, .csv,. json).

Support for entering one or more documents.

Ability to work with Ukrainian-language texts.

Clearing punctuation, numbers, and special characters.

Lowercase texting.

Delete stop words (in Ukrainian). 4. Generate Results: 5. Visualisation of results: 6. Data Saving/Export:

 IDF).

            

Tokenisation is the division of text into separate words (tokens).

Lemmatisation (if necessary) – bringing words to their original form. 3. Building a thematic model:  

Option to export model, topic lists, or topic breakdowns to a file.

Ability to reuse the saved model. 7. User Interface (Optional). A simple graphical interface (or console menu) where the user can: Formation of a vector representation of texts (for example, Bag-of-Words or TFBuilding an LDA model to define topics in the corpus.

Setting the number of topics, dictionary sizes, alpha and beta parameters.

Create lists of topics with a set of keywords for each.

Calculate the distribution of topics for each document.

Save results in text or tabular format.

Building an interactive thematic map (through the pyLDAvis library).

Displaying the weights of words in topics.

Visualise similarities and overlaps between topics. select a file; set model parameters (number of topics);

View visualisation.

Thus, the software product should cover all the key stages of thematic analysis of texts – from processing and modelling to interpretation and output of results.

The developed software product is aimed at users who need to analyse a large amount of textual information, but do not have sufficient technical or linguistic knowledge for its deep processing. Thanks to the automation of the main processes of thematic analysis, the system allows you to solve a number of applied problems.

1. Analysis of a large volume of texts. In the real world, processing thousands of documents manually is extremely time-consuming and resource-intensive. The software product allows you to automatically process large corpora of texts without the need for human intervention at each stage.

2. Automatic detection of topics in documents. The user gets the opportunity to find out what the texts are about, even if the issues have not been determined in advance. The LDA-based model allows you to automatically generate topics based on statistical patterns in the data.

3. Classification and grouping of texts by topic. The system allows you to determine which documents are related to a particular topic, which makes it possible to segment texts by content (for example: economics, politics, sports, culture).

4. Building thematic profiles of documents. Each document has a distribution of topics that helps to assess which topics dominate the text and which are secondary. It is beneficial for reports, news, research articles, or social media content.

5. Visualisation of results for analytics. The user receives an interactive visualisation (via pyLDAvis), which allows a better understanding of the structure of topics, their relationships, and their distribution in the space of texts. It makes the analysis accessible even to non-professional users.

6. Decision support. Thanks to the interpretation of the results of thematic modelling, the user can faster identify trends, filter important documents and to draw conclusions based on the actual content of the texts.

7. Saving the results for further processing. The resulting topics and breakdowns can be stored, used in other systems, or used to generate reports, making the product useful in research, educational, and business contexts.

Thus, the software product removes the need for the user to manually process texts and allows you to obtain valuable semantic insights automatically, quickly and in a convenient format.

The system being developed, conventionally called the "Thematic Analysis Module", is a software tool for processing, analysing and visualising large volumes of text documents in order to automatically identify semantic topics. The system belongs to the application software that implements computational linguistics and machine learning methods for the needs of text analysis. The principle of operation of the module is based on the Latent Dirichlet Allocation (LDA) algorithm, one of the most common approaches to thematic modelling, which allows you to determine a set of hidden topics in the corpus of texts without human intervention.

1. What actions take place on the input data? After loading the text corpus, the system performs several sequential stages of processing on the input data to prepare it for thematic analysis: – Text pre-processing:  Clearing texts from punctuation, memorable characters, and numbers.  Normalisation (converting all words to lower case).  Remove stop words that do not carry a semantic load (for example, and, or, also).  Tokenisation is the division of text into separate words (tokens).

 Lemmatisation is the reduction of words to their original form (for example: "worked" → "work"). – Building a textual representation:  

Creating a document-term matrix using Bag-of-Words or TF-IDF methods;

Formation of a dictionary (all unique words of the corpus). – Thematic modelling:    what).

Building an LDA model – automatic detection of topics that are repeated in texts; Definition of the set of words that best characterise each topic;

Calculate the distribution of topics in each document (i.e. which document is about 2. What the user sees at the output. After completing all stages of processing and modelling, the system provides the user with the result in a clear and visualised form:

–Theme. A list of topics was found, each of which is represented by a set of keywords with the highest weight (importance). For example:

Topic 1: ['economy', 'profit,' 'currency', 'inflation', 'bank'] Topic 2: ['sport', 'match', 'team', 'goal', 'tournament'] – Distribution of topics by documents. Each document displays which topics dominate it and with what probability. For example: Document No. 7: Topic 1 – 60%, Topic 3 – 30%, Topic 5 – 10%.

– Interactive visualisation. Using the pyLDAvis library, the results are displayed as an interactive topic map where: – Tables with results. Export in CSV or JSON formats:    table with topics; distribution of topics by documents;

Top words for each topic.

The user can use all these results to:     circles are themes; circle size – frequency of the topic; overlap – similarity between topics; You can hover the cursor and see the topic keywords. building reports; content analytics; segmentation of texts by topic; automatic classification or preparation of training data.

First of all, it must have the functionality to load the corpus of documents in a convenient format (for example, TXT or CSV), which can contain both individual texts and large arrays of text data, tokenisation, deletion of stop words, and, if necessary, also lemmatisation to bring words to the basic form. Based on the cleaned texts, the program should build a thematic model using the Latent Dirichlet Allocation (LDA) method, which allows you to automatically determine a set of topics in a given corpus, each of which will be represented by a set of keywords. Once processed, the results must be stored in a file or database, allowing them to be reused, exported, or integrated with other systems. Finally, the system should display the results to the user clearly and intuitively – both in the form of lists of topics and tables, and in the form of interactive visualisation of issues, which significantly simplifies the analysis of the data obtained.

The project aims to create an effective tool for automatic thematic analysis of texts, which allows you to identify content structures in unstructured text data without the need for their preliminary labelling. The goal of the project is to provide the user with an accessible tool for analysing large corpora of texts, with the ability to visualise, store and interpret the results without deep technical knowledge. The Python programming language was chosen as the implementation environment of the software product using the libraries Gensim (for building a thematic model of LDA) and pyLDAvis (for visualising the results). Additionally, the NLTK, Pandas, and Matplotlib libraries can be used to process texts and present results. The program provides an interface in the form of a CLI (command line) or, by extension, a simple graphical interface (GUI) based on the Tkinter or Streamlit libraries, which allows you to interact with the user conveniently - select a file, set analysis parameters, and start processing. Among the main limitations of the product are the focus on texts in Ukrainian (for correct lemmatisation and a list of stop words), as well as the need for a pre-prepared document format (for example, one document – one line in a file). There are two types of users: the regular user, who runs the analysis, views the results, and exports them in a convenient format, and the administrator or developer, who can change model settings, update dictionaries, or change the architecture of the module for a specific application. Scripts: 1. Download the document corpus – the user imports a set of texts for analysis. 2. Configure model – sets the parameters of thematic modelling (number of topics, type of filtering).

3. Run simulation – the system pre-processes and builds an LDA model.

4. View results – the user receives a list of topics, keywords, and breakdowns by documents.

5. Save results – the results are exported to a file (CSV, JSON, etc.).

A class diagram displays the structure of a system, that is, what classes it consists of, what functions each class performs, and how these classes are related to each other. It answers the question: "What parts are included in the program and what do they do?". The main classes in the diagram are:  CorpusLoader – Responsible for loading text data.  Pre-processor – cleans, tokenises, and lemmatises texts.  LDAModel builds a thematic model, manages learning, and produces results.  Visualizer – creates a visualisation of topics (e.g. via pyLDAvis).  Exporter – saves the results to a file.  Links:  Each class has one or more methods, which are displayed at the bottom of the rectangle.

 The arrows between classes show dependencies: for example, LDAModel uses Preprocessor to prepare data, and Visualizer uses it to plot based on model results.

This diagram is needed to describe the architecture of the code and helps the programmer to implement the system correctly. The sequence diagram shows the order in which actions are performed in time – that is, what exactly happens in the system from the start of the start-up to the receipt of the result. It simulates how objects pass requests to each other, including in what order. Sequence of events:

The user gives a command through the interface to start the analysis.

CorpusLoader loads data.

Pre-processor cleans and processes texts.

LDAModel performs thematic modelling.

Visualizer displays themes on the screen.

Exporter allows you to save the results. Each object has a vertical "lifeline", and arrows show the interactions between them. The lower on the diagram, the later in time the action takes place. This diagram gives an idea of the logic of executing a program step by step and is very useful for testing or scripting.

During the work, it was determined that the proposed system – the "Thematic Analysis Module" – is designed to automatically process large volumes of unstructured text in order to identify key topics without preliminary data labelling. This approach is especially relevant in today's information environment, where a vast number of text messages are generated every day, which require quick and meaningful analysis. In the process of formalisation, the input and output data of the system, the main stages of word processing (cleaning, tokenisation, lemmatisation), building a model and displaying the results were described. It is established that the system should provide the loading of the body of documents, the adjustment of model parameters, the execution of simulations, the visualisation of the results and the ability to export the received data. All these features have been described in the form of functional requirements for the product. In order to structurally present the work of the module, a technical task was created, which describes in detail the implementation environment (Python, Gensim, pyLDAvis), user interaction interface, target audience (ordinary user, analyst) and system limitations (focus on Ukrainian-language texts). Particular attention is paid to the visualisation of the system architecture using UML diagrams. Created:

 a class diagram showing the internal structure of the program and the relationships between its modules (CorpusLoader, Pre-processor, LDAModel, Visualizer, Exporter);  Sequence Diagram, which illustrates the step-by-step process of performing thematic analysis – from loading texts to saving results.

The work done made it possible to systematise the idea of the logic of the software product's functioning, to determine the key elements and their interaction. The results obtained from a solid basis for further development, testing, and implementation of thematic modelling in real text analysis tasks. Thus, the work has been completed, and the goals set – to formulate the requirements, build the architecture and model the system – have been achieved in full.

The task of thematic modelling is to find latent (hidden) topics in a large corpus of text documents without predefined labels. Formally, each document dm from the corpus

Construction of a statistical model that forms probabilistic distributions of words in topics and topics in documents. Each text is presented as a vector with the weights of belonging to each subject.

Algorithm for The new text is processed (lemmatisation, tokenisation), converted to the classifying new BoW format, and passed to LDA, which returns the probability vector. text The highest probability determines the topic.

CoherenceMod Calculates the semantic consistency of topics using the c_v metric, which el (coherence is based on the mutual appearance of terms in documents. It is used to score) evaluate the quality of the model.

Stages of logical inference in the system  The model is trained on the basis of the processed body of documents.  Each topic becomes a probabilistic distribution of words.  Each document receives a distribution of topics, such as:  Coherence determines how stable and consistent the topics are ( higher the better).  New text can be classified via lda_model.get_document_topics() - without the need to retrain it. 1 2 3 1 2 3

Thus, the system automatically concludes about the topic of the new document without manual intervention. During the operation of the software, various data structures are formed and used. They are necessary for storing the corpus of texts, a glossary of terms, the results of modelling topics and the subsequent classification of new documents. The following are the key structures that are stored or used in memory at runtime. processed_texts.p kl dictionary corpus .pkl file (Pickle)

Saves preprocessed texts as lists of lemmas. Allows you not to repeat preprocessing when restarting. gensim.corpora.

Dictionary

A unique glossary of terms generated from processed texts. Each word has a unique ID.

List of lists (Bag- Representing each document as a dictionary-based of-Words) frequency vector of words. Required for LDA

training.

The software has a text-based interface in the form of an interactive Jupyter Notebook, which provides a convenient user experience with the system. All the main functions are implemented through clear code blocks with output and visualisation. lda_model.print_t opics() lda_model.get_do cument_topics()

Gensim.models.L daModel object

A model that contains topics, their distributions, probabilities, and parameters. Retains knowledge of the model after training.

Returns a list of topics with their keywords and probabilities. It is used to interpret the results.

Returns a probabilistic distribution of topics for a single document. It is used to classify new texts.

Output of topic For each generated topic, a list of keywords with the highest keywords probabilities is displayed. It allows you to interpret the content of the topic.

Automatic naming of topics Interactive Visualisation (pyLDAvis)

Implemented through the generate_smart_title(keywords) function, which creates a conditional name of the topic based on keywords (for example, "Presidential activity", "Education and science").

Displays topics in the form of circles, shows the relationships between them and keywords. The user can click on the issues and view their content.

Save models and data.

The processed data (processed_texts), dictionary, corpus, and trained model are stored in .pkl files, which avoids reprocessing and retraining.

Classification new texts of The user can enter any new Ukrainian text, and the program will

automatically determine which topic it most likely belongs to.

Inference coherence of The system outputs an assessment of the quality of the model based on the c_v metric, which allows you to choose the optimal number of topics.

Graph of quality dependence on the topics

A graph is created to analyse how coherence changes with a different number of issues (e.g., 5–40). It helps to automatically select the best model.

The software's interface is built to be intuitive for both technical and non-technical users. It allows you to both study the structure of topics in the corpus and analyse new documents in real time. In the developed software, all modules function within a single data processing flow. Each component does its part of the work and passes the results to the input of the next module. This modular structure allows for consistent processing, resource overuse, and flexibility for modifications.

CSV → Word Processing → Saving → Building a Corpus → LDA Model → → Topic generation → Coherence → Visualization → Analysis of new textsTable 18 Co-operation processes

Process description

Participation in modules Loading and pre- The user imports a .csv file → The data processing module cleans processing of texts and lemmatises the texts.

The processed texts are stored in processed_texts.pkl for reuse.

a The data is passed to the dictionary and corpus via Gensim.

Training the LDA model

The corpus and dictionary are passed to the Thematic Modelling

Module, where a theme model is created.

The model is passed to the Coherence Evaluation Module, where the c_v is calculated. of The theme model goes to the Visualisation Module, where an

interactive theme map (pyLDAvis) is built.

Topic The keywords of each topic are analysed in the Interpretation

Module, which assigns a human name. of The user enters the text;it is processed, transmitted to lda_model

→ model returns the belonging probability to the topics. 1 2 3 4 5 6 7 8

Saving an intermediate result Creating dictionary/corpus Model Assessment Visualisation results Generating Names Classification new text Windows 10 / 11, Linux, macOS Version 3.9 or higher stanza, gensim, pyLDAvis, pandas, matplotlib, pickle, tqdm Development environment

Recommended: Jupyter Notebook / Jupyter Lab Internet connection

Only required to load the stanza language model

Install dependencies pip install stanza gensim pyLDAvis pandas matplotlib tqdm. A parser was also developed specifically for this project, thanks to which it was possible to assemble my own unique data set from articles from the President's Office, which helped to train the model well for further use. Step

Description 1 2 3 4 5 6

Running pre-processing: execute the pre-process(text) function, which will clean and lemmatise texts Creating a dictionary and corpus: use gensim.corpora.Dictionary and corpus = [dictionary.doc2bow(text) for text in processed_texts] Model Training: Building an LDA Model via LdaModel(...) Model Estimation: Coherence Calculation via CoherenceModel Topic Visualisation: Use pyLDAvis to create a theme map

Parsing new text: call lda_model.get_document_topics() for a new document Software Name

News parser from the website of the President of Ukraine

Automatically collect news headlines and texts from the "Administration" section.

Link to source Result of work

CSV file with news headlines and texts

Python 3.x selenium, bs4, csv, time Library Runtime

On-premises environment (Windows with ChromeDriver installed) Access method

Via Chrome Browser Control with Selenium 6. Verification 7. Conservation

Running the Chrome browser in headless mode with the specified user-agent

Go to the news page with the ?page=n parameter 3. Collection of news Search for blocks .item_stat.cat_stat, from each, the first reference is links taken 4. Header Extraction

From the tag <h1 itemprop="name"> 5. Text extraction

From the <div itemprop="articleBody"> tag, all <p> Skip news without text, prevent duplication Saving the result to a file president_news.csv

Explanation Avoiding duplicates

From each .item_stat.cat_stat block, only the first <a is taken> Selective Content Collection articleBody, which excludes Text is taken only from footers/menus/meta.

Dynamic content expectation

WebDriverWait is used to wait for the DOM to fully load Work in headless mode

Can be run in the background without displaying the browser Flexible scaling

Can be expanded to any number of pages (via the pages parameter)

When creating a parser, there was a big problem - the site president.gov.uaStructure of the output CSV file in Fig. 10. A big problem arose when creating the parser - the website president.gov.ua uses protection against automated requests (bots).This protection includes filtering requests from libraries like requests, even with fake headers (User-Agent). In particular, when using standard parsing through requests and BeautifulSoup, the server returned a 403 Forbidden response code, which indicates that the request was blocked. To circumvent this limitation, the project implemented automation of interaction with the site through the Google Chrome web browser, using the Selenium WebDriver tool and the ChromeDriver driver. It allows you to emulate the behaviour of a real user - open pages in the browser, load dynamic JavaScript content, interact with DOM elements and wait for the page to be fully rendered. The parser works in headless mode, that is, the browser does not open graphically, but all processes related to the display and processing of the web page are performed as in a real browser. It allows you to discreetly bypass anti-bot protection, while maintaining a high processing speed and minimal load on the system. Also, to minimise detection by security mechanisms, custom headers of HTTP requests were installed, including User-Agent, Accept-Language, Referer, and others, which simulate a typical request from an ordinary browser user. In addition, the code implements waiting (WebDriverWait) so that you do not try to extract information before the site fully loads the content via JavaScript. Thanks to this solution, the developed software works stably with the official website of the President of Ukraine, bypassing server checks for the bot and ensuring the correct extraction of news texts.

Software for thematic modelling of Ukrainian-language texts based on the Latent Dirichlet Allocation (LDA) algorithm has been developed. The developed system allows you to automatically identify meaningful topics in a collection of texts, interpret them through keywords, and also classify new documents according to the built model. A feature of the implementation is the use of the Stanza library for the lemmatisation of Ukrainian texts, which provides deep linguistic data processing. The program covers the entire cycle of thematic modelling: from collecting and preprocessing texts to building an LDA model, assessing its quality using coherence metrics, and visualising the resulting topics. In the process of implementation, mechanisms for automatic generation of conventional names of topics were also implemented, which significantly facilitates the perception of the results of the analysis. The functionality of the program includes the ability to save processed texts, reuse models, view an interactive topic map, and classify new texts in real time. It has been proven that the model is capable of detecting thematic clusters with high coherence, which indicates its efficiency and accuracy. Thus, the goals of the work have been achieved. The developed software is a universal tool for text data analytics. It can be used to solve practical problems in the fields of journalism, public administration, education, and social sentiment research.

6. Results

Now, in the modern information space, it is essential to be able to quickly analyse large amounts of text data and isolate the main topics and content areas from it. One of the key approaches to solving this problem is thematic modelling of texts, which allows you to automatically detect the hidden structure of information in a large corpus of documents without manual mark-up. This approach is based on the use of machine learning algorithms, in particular, Latent Dirichlet Allocation (LDA), which allows you to break down texts into topics based on statistical patterns in word distribution. Within the framework of the previous work, full-fledged software for thematic modelling of Ukrainian-language texts was implemented. The implementation included the stages of pre-processing of texts, lemmatization using the stanza library, construction of a dictionary of terms, creation of a corpus in the Bag-of-Words format, training of the LDA model, output of topic keywords, evaluation of the quality of the model by the coherence metric (cv), generation of conditional names of topics and classification of new texts. The study is devoted to checking the operability of the developed software tool by running a control case. Such an example allows you to make sure that all modules of the system function in a coordinated manner, the results of the simulation correspond to the content of the text, and the system correctly classifies new documents according to the topics that were discovered during the training. The analysis of the control example allows you to confirm that the results obtained are logical, meaningfully relevant, and correspond to the task. Thus, the purpose of this work is to launch and analyse a control case demonstrating the full cycle of the software - from loading a new text to determining its topic, with the output of topic keywords, probabilistic distribution and interpretation of results.

The purpose of the control example is to check the operability of the software for thematic modelling of texts. For this purpose, a test task is formed, which should reflect the key functionality of the system for determining the subject matter of the Ukrainian-language text on the basis of the already trained LDA model. The user enters a new Ukrainian-language text that was not included in the educational building, and the system should:

Carry out full pre-processing of the text (cleaning, tokenisation, lemmatisation).

Convert text to a numeric format according to the already built dictionary.

Transmit text to the trained LDA model.

Obtain a probabilistic distribution of topics identified in the previous analysis.

Identify the topic with the highest probability.

Output keywords and the automatically generated name of this topic.

To train the thematic model and further test the software, a corpus of Ukrainian texts, collected from open sources, in particular, from the official website of the President of Ukraine, was used. The data is from news, public speeches, event reports, international meetings, decrees, and other documents covering socio-political topics. At the initial implementation stage, the first dataset of approximately 300 documents was created. This set made it possible to check the correctness of the main modules of the system – word processing, dictionary construction, creation of a corpus in the Bag-of-Words format, model training, knowledge base construction and primary classification. However, in the analysis process, it was found that the model trained on this set showed insufficient topic resolution, and the coherence (topic quality metric) was below the desired level. The topics were often mixed, vaguely defined, or too general.In order to improve the quality of the model and expand the thematic coverage, a new, significantly larger corpus was created. The second dataset, which was formed as a result, consisted of more than 830 documents, which made it possible to provide better statistical representativeness of words and contexts. The new set of texts covered a wide range of topics: international politics, internal governance, educational issues, commemoration of historical memory, humanitarian initiatives, etc. The extended corpus was used for the final training of the LDA model, the classification of texts and the execution of a control example within this work.

Each text in the dataset is saved in .csv format, in the text column. Additional metadata, such as headers or dates, is not used in the model. Thus, all texts underwent the same processing cycle, which ensured the purity of the experiment and the ability to compare the results. The use of two different corpora in the development process made it possible to assess the impact of sample size on the quality of thematic modelling. It also highlights the flexibility and scalability of the software created, which can work efficiently with enclosures of different sizes.

To check the functionality of the developed software, a separate fragment of Ukrainian text was selected, which was not included in the educational building. This approach allows you to objectively assess the ability of the model to generalise - that is, the ability to apply the formed topics to new, previously unknown texts. A control example simulates a real situation when the user submits an arbitrary text for input and expects the system to correctly recognise its content. The selected fragment refers to the commemoration of the victims of political repression and is a typical example of official political communication. It has a clear thematic focus and contains specific vocabulary that allows you to test the model's ability to identify keywords and classify the document towards the relevant topic. The text was taken from an open source and was not included in the training dataset in advance, which guarantees the fairness of the test. After the inaugural mass, Pope Leo XIV held an audience with President of Ukraine Volodymyr Zelenskyy and First Lady Olena Zelenska, who became the first for heads of state. The President congratulated Pope Leo XIV on the beginning of his pontificate and noted that he is a hope for millions of people who want peace. """

This example was specifically selected as a control example, since its theme potentially correlates with one or more topics formed by the LDA model (in particular, with issues related to historical memory, political repression or state policy in the field of culture). The following sections will provide a step-by-step analysis of the processing of this fragment, the results of classification, and an assessment of how the model has determined its topic correctly. A control fragment of the text was submitted for input to the software to check the full cycle of its processing and classification. After loading the text, the system automatically carried out all the stages of analysis in accordance with the logic embedded in the architecture of the software tool. In the first step, text pre-processing is performed, which includes lowercase, tokenisation, filtering of service words, and lemmatisation using the Stanza library. It allows you to bring the text to a unified form, where each word is represented in its basic grammatical form. For example, the phrase "honoured the memory of the dead" after lemmatisation turns into a sequence of lemmas "honour", "memory", "deceased". Next, the cleaned text is transformed into a numeric format using a pre-saved dictionary. To do this, each lemma is replaced with a corresponding numerical identifier, and the frequency of its appearance in the text is recorded in the Bag-of-Words format. This format allows you to present text as a vector that the model can interpret as input for thematic analysis. The third step is to transfer the processed text to the trained LDA model, which conducts the classification. The model returns the probability distribution of topics formed in the process of previous training. As a result, a list of topics with corresponding probability values is obtained. The topic is most likely to be interpreted as the main one to which the input text belongs. At the final stage, the system displays the topic ID, a list of its keywords, and the generated conditional name formed on the basis of the detected topic semantics. It allows the user not only to see the numerical results of the classification, but also to interpret them understandably. Thus, the submitted text goes through a complete cycle of processing: from natural language to a formalised topic with interpretation. It confirms the ability of the software to correctly identify the subject of a new document based on an already trained model.

In the process of developing software for thematic modelling of texts, there was a need to optimise the performance of the pre-processing subsystem. One of the key elements at this stage is the filtering of stop words - that is, those tokens that do not carry a semantic load, but significantly increase the amount of processing during lemmatisation and vocabulary construction. Initially, we used a complete list of Ukrainian stop words, containing more than 300 elements. However, when tested on a full case with more than 800 documents, the processing time exceeded 150 minutes, which is completely inefficient for practical use. In view of this, its own optimised list has been created, which includes only the most used service words - about 50. It made it possible to reduce the processing time to 51 minutes without significantly losing the quality of the topics.

This graph shows how the size of the stop word list affects the processing time of texts during thematic modelling. As you can see, when using a smaller custom list, the processing of the entire case took 51 minutes, while when using the complete list of stop words, it took more than 150 minutes. It is because a higher number of stop words significantly increases the filtering and processing time of each token in the text, especially when processing involves lemmatisation. In this regard, in order to maintain the effectiveness of software execution, it was decided to use a limited but relevant list of the most frequent service words. It made it possible to significantly reduce the processing time without a critical loss of simulation quality. This analysis confirms that thoughtful optimisation during the pre-processing phase has a significant impact on the overall performance and efficiency of the system.

For the convenience of the user and control over the execution of the software, the output of the progress of word processing in real time was implemented. It became essential after expanding the corpus of texts to more than 800 documents, as the pre-processing time (lemmatisation, filtering, and tokenisation) increased to tens of minutes. To avoid a situation where the user does not understand whether the program is "frozen" or really working, a progress bar was added using the tqdm library, which displays a dynamic scale with the number of documents already processed. It allows you to visually observe the progress of processing, estimate the pace of execution and navigate the remaining time until completion. In this way, the output of execution progress has increased the clarity, predictability, and usability of the system, which is an integral part of frontend interaction even in console applications.

To optimise performance and avoid re-wasting time on text processing, a mechanism for saving already processed data has been implemented. It allows the software to run much faster when reused, especially in the context of experiments, testing models, or changing classification parameters. After the pre-processing step is completed, all cleaned and lemmatised texts are automatically saved as a serialised object in .pkl (pickle) format. In particular, the processed_texts.pkl file stores a list of tokenised texts that have already passed all stages of preprocessing: lowering, removing stop words, lemmatisation, etc. In the future, when the system starts, the program first checks whether the file with the processed data exists. If the file is found, the data is loaded from the disk, and there is no need to process more than 800 documents again, which can take up to an hour. This approach provides significant resource savings and improves user experience, especially in environments with limited execution time, such as during demonstrations, training, or research.

In the process of developing a system of thematic modelling of texts, it was decided to use machine learning not only to build the model itself, but also to optimally select the number of topics. It is critically important because too few topics can lead to generalisation and loss of content, and too large a number of issues can lead to excessive division of texts, which reduces the quality of classification.

For each of the models, a coherence metric (in particular, c_v) was calculated, which shows how logically the topics are related in terms of the semantic proximity of words num_topics. The model analysed the quality of the issues constructed, and among all the options, the number of topics that provided the highest coherence was chosen. Thus, the decision was not made manually, but on the basis of an objective indicator of the quality of the model, calculated during the training. Thanks to this approach, it was possible to achieve a more stable, interpreted, and high-quality thematic model that confirms the effectiveness of the use of machine learning methods in the tasks of thematic analysis of texts. In order to determine the optimal number of topics for building a thematic model, a series of experiments was conducted using machine learning and coherence metrics (in particular, c_v). In the course of these experiments, 29 LDA models were built with a different number of topics from 12 to 40. Based on each of the models, the coherence of the indicator was calculated, reflecting how logically the words in the topic are related to each other from the point of view of real language usage. The visualisation of the results was presented in the form of a line graph, where the number of topics is displayed along the X axis, and the coherence values are displayed along the Y axis. The graph clearly shows the fluctuations in the quality of the models. The highest coherence values were achieved for the following configurations num_topics=12 → coherence = 0.5135 num_topics=22 → coherence = 0.5167 num_topics=27 → coherence = 0.5102 num_topics=28 → coherence = 0.5025

It indicates that these configurations describe the topics in the corpus in the most balanced way, providing high semantic coherence of topic keywords. As can be seen from the graph, too many issues lead to a decrease in coherence, since the model "blurs" the context between too many problems. Based on this analysis, the optimal number of topics was selected - 22, which provides the highest coherence within the framework of the experiment. Thus, the process of choosing the number of issues was not implemented manually, but based on a quality metric that follows the principles of reasonable customisation of machine learning models.

This code fragment is implemented at the training stage of the LDA model, which is the basis for the thematic modelling of texts. Its goal is to create a machine model that will be able to detect hidden topics in Ukrainian-language texts based on the joint appearance of words in documents. The parameter num_topics=22 is because I previously conducted an automated coherence analysis and determined that 22 topics provided the highest quality of issues (coherence ≈ 0.5167). Thanks to the passes parameter=10 model, which passes through the entire body 10 times, you can achieve greater stability in the topics. Setting alpha='auto' allows the model to independently adapt the distribution of topics in documents, which is especially useful when working with imperfectly balanced data. Beep(1000, 1000), which is triggered after the completion of the training. It is done for the convenience of observation, since training can take tens of minutes, and the beep helps not to constantly monitor the laptop. This stage is key, because it is here that the model is formed, which will later be: 

classify new texts by topic;   allow you to visualise the connections between words; serve as the basis for the interpretation and generation of topic names.

After completing the training of the Thematic Modelling Model (LDA) based on Ukrainianlanguage news texts, the system formed 22 topics. Each of the issues is represented by a set of the most relevant words that have appropriate weights reflecting their significance for a particular topic. These keywords are the result of a probabilistic distribution of words in the corpus and allow you to gain a deeper understanding of the content of each topic. One example is the theme dominated by the words "Ukraine", "President", "Volodymyr", "Zelensky", "support" - indicates political content, in particular related to leadership and international activities. Another topic may include words like "generation", "youth", "culture", which indicate a completely different semantic emphasis. The results obtained make it possible to automatically interpret topics, analyse information flows and structure large volumes of texts. Thematic word distributions are further used to generate topic names, which makes the models more understandable for the user. It also opens up the possibility of classifying new texts: the system can determine which topic the newly received text belongs to, with the corresponding probability. Thus, this stage is critically important in the entire chain of operation of the software tool, because it is on its basis that a knowledge base is formed, which provides all the further functionality of analysis, interpretation and visualisation of text data.

The image shows an interactive visualisation of the results of thematic modelling created using the pyLDAvis library. This approach allows you to intuitively understand the structure of the constructed LDA model and assess how clearly the topics are delineated and which words are the most characteristic for each of them. The visualisation consists of two parts: the left pane shows a map of topics, and the right pane shows a list of the most relevant terms for the selected topic. On the left side of the visualisation, the so-called "Intertopic Distance Map" is displayed, which demonstrates how topics are arranged in vector space. Each circle represents a different topic, and its size reflects the proportion of documents related to that topic. The distance between the circles indicates the similarity of the topics: the closer the circles, the more similar the topics in content, and if the circles do not intersect, this shows a clear separation of topics. For example, the largest circle on the graph is topic 1, which occupies the largest share in the corpus of texts. The right pane lists the 30 most important terms for the selected topic. Light blue bars indicate the total frequency of use of a word in all texts, while red bars indicate the frequency of this word in the selected topic. It allows you to see which words are really relevant to a particular topic and not just frequently used in the corpus. In our case, topic one is characterised by the phrase "Ukraine", "president", "Zelensky", "Volodymyr", "support", "state", etc., which indicates political topics related to state power and the country's leadership. This type of visualisation is beneficial for analysing the quality of the model, interpreting the content of topics, and later use in the user interface or reports. It allows not only an analyst, but also an ordinary user without deep knowledge of machine learning to quickly understand what each topic is about and how well the model divided the topics of the documents.

In the course of the implementation of the software for thematic modelling, all documents from the corpus of texts were divided into topics that the trained LDA model defined. It made it possible to see which topics are the most common among the analysed texts, as well as to identify less covered or even highly specialised areas. The image shows the final statistics: each topic corresponds to a certain number of documents. For example, the most significant number of texts, 208, fell into topic 11. It means that this topic is the most representative of the corpus, and its content has the most significant information load. Topics 2 (107 papers), 12 (76 papers) and 20 (68 papers) also have a considerable number, suggesting that the texts are mostly centred around a few leading topics. At the same time, some topics cover only 1-3 documents (for example, topics 21, 4, 3, 10, 16). It may be due to the fact that some texts cover particular events or topics that do not have a broad representation in the general corpus. Such a distribution is proper both for assessing the balance of a data set and for further use in analysis, for example, to identify thematic priorities in news content, to identify topics that require additional attention, or to divide texts into thematic clusters. It also allows you to form an idea of the temperature coverage, which can be used for decision-making in a journalistic, informational or analytical context. After the LDA model formed topics in the form of a set of keywords, there was a need to make them more understandable to a person. After all, a set of words is just a machine representation, from which it is difficult to quickly understand what precisely the topic is about. Therefore, a special module was implemented that automatically generates topic names based on the keywords that characterise them. For example, if among the keywords of the topic there are often "president", "office", "Vladimir", then such a topic can be called "Presidential activity". If the words refer to such concepts as "child", "protection", "rights", the topic is called "Protection of children's rights", etc. Thus, we do not just leave issues in the form of machine combinations of words, but transform them into humanreadable titles. It greatly facilitates the perception of modelling results and makes them suitable for practical application both in reports and in interactive text analysis. The user can now easily navigate which topic means which without having to analyse a technical set of words. It is an essential step in "interpreting" the model and bringing the results of machine learning closer to real use.

At the final stage of the work, the function of recognising the topic of third-party text by calculating the probabilities of its belonging to already trained topics was implemented. It made it possible to assess the practical ability of the built LDA model to classify new documents without prior reference to the educational building. An experiment was conducted with a test piece of text that was not part of the training kit. The distribution of topics for this text was analysed separately for two models: the one that was trained on a smaller corpus of about 300 texts and the one based on an extended corpus of 830 articles. For a larger dataset, the central theme received a weight of 49.62%, which indicates the high confidence of the model in the classification.

On the other hand, on a smaller dataset, the main topic had a similar weight - 48%, but the second topic was almost equal to it - 46%, which may indicate a lower accuracy of the model due to a lack of training examples. The graph below shows a comparison of the distribution of topics between the two models. Visualisation confirms that the growth of the volume of data significantly improves the clarity of classification and reduces the blurring of results. It also reduces the chance of misclassification of text between two nearly equivalent topics. Thus, the increase in the learning corpus directly affects the quality of topic recognition in new documents.

A test run of the software for thematic modelling of Ukrainian-language texts was carried out, which confirmed its operability and compliance with the task. The main goal was to check whether the system built on the basis of the LDA model is able to recognise topics in new texts and provide a meaningful interpretation of the results. In the course of the work, a test task was formulated - to automatically determine the topic of the new Ukrainian-language text. For this, a model previously trained on a large body of news articles was used. Two training options were tested: on a smaller set (approximately 300 documents) and on a much larger set (more than 800 documents). It made it possible to see the impact of the amount of data on the accuracy of the distribution of topics. As the analysis showed, the model trained on a larger dataset demonstrated higher coherence and a more stable probability distribution, which indicates a higher quality of thematic classification. During testing, a complete cycle was implemented: pre-processing of the text with lemmatisation, construction of thematic distribution, interpretation of topics, generation of topic names, and display of results in a convenient form. It was conveniently organised to control the processing execution (through process output and completion signals), save data to avoid re-wasting resources, and visualise the results in the form of diagrams and pyLDAvis graphs. Summing up, it can be argued that the developed software not only demonstrates correct technical implementation but is also able to provide flexible, effective thematic modelling of text data. Such a tool can be helpful for analysts, journalists, researchers, or information systems that require quick orientation in large arrays of Ukrainian-language texts.

7. Discussion

When developing software, coherence (a measure of consistency of topics) is compared when using different amounts of data. For the first dataset (~300 texts), the coherence of the model was 0.462, while after switching to the extended dataset (~830 texts), it increased to 0.516. It suggests that a larger body allows the model to better shape topics - the keywords in them are more related, and the classification results are more resistant to random deviations. Thus, the quality of the model directly depends on the volume of the educational building ~200+ ~51 min ~150+ min

In order not to repeat the lengthy processing process each time, it is implemented to save the processed case to the processed_texts.pkl file. It allows you to load ready-made data at the subsequent launch of the program, which reduces the waiting time from tens of minutes to several seconds. In addition, visual observation of pre-processing progress via tqdm has been implemented, and a sound signal has been added after the model training is completed, which is convenient for long calculations. To optimise performance and avoid re-wasting time on text processing, a mechanism for saving already processed data has been implemented. It allows the software to run much faster when reused, especially in the context of experiments, testing models, or changing classification parameters. After the pre-processing step is completed, all cleaned and lemmatised texts are automatically saved as a serialised object in .pkl (pickle) format. In particular, the processed_texts.pkl file stores a list of tokenised texts that have already passed all stages of preprocessing: lowering, removing stop words, lemmatisation, etc. In the future, when the system starts, the program first checks whether the file with the processed data exists. If the file is found, the data is loaded from the disk, and there is no need to process more than 800 documents again, which can take up to an hour. This approach provides significant resource savings and improves user experience, especially in environments with limited execution time, such as during demonstrations, training, or research.

In order to determine the optimal number of topics for building a thematic model, a series of experiments using machine learning and coherence metrics (including c_v) were conducted. In the course of these experiments, 29 LDA models were built with a different number of topics from 12 to 40. Based on each of the models, coherence was calculated - an indicator that reflects how logically the words in the topic are related to each other from the point of view of real language usage. The visualisation of the results was presented in the form of a line graph, where the number of topics is displayed along the X axis, and the coherence value is displayed along the Y axis. The graph clearly shows the fluctuations in the quality of the models.

The distribution of topics for this text was analysed separately for two models: the one that was trained on a smaller corpus of about 300 texts and the one based on an extended corpus of 830 articles. For a larger dataset, the central theme received a weight of 49.62%, which indicates the high confidence of the model in the classification. On the other hand, on a smaller dataset, the main topic had a similar weight - 48%, but the second topic was almost equal to it - 46%, which may indicate a lower accuracy of the model due to a lack of training examples. The graph below shows a comparison of the distribution of topics between the two models. Visualisation confirms that the growth of the volume of data significantly improves the clarity of classification and reduces the blurring of results. It also reduces the chance of misclassification of text between two nearly equivalent topics. Thus, the increase in the learning corpus directly affects the quality of topic recognition in new documents.

The left graph demonstrates how an increase in the volume of the dataset has a positive effect on the quality of the thematic model. When the first dataset, consisting of ~300 documents, was used, the coherence of the model (a measure of its thematic consistency) was approximately 0.462. After expanding the corpus to more than 800 documents, the coherence value increased to 0.516, indicating an improvement in the quality of the topic classification. The right graph illustrates how the number of stop words affects the processing time of the text. When using a smaller list of stop words, the process of pre-processing the entire text took approximately 51 minutes. However, an attempt to apply a complete extended list led to a significant increase in duration - more than 150 minutes. It showed me that in order to maintain processing efficiency, you need to find a balance between the depth of text clean-up and performance.

The main goal of this work was not just to check the functionality of the model but also to assess how stable, fast, and qualitatively it works under different conditions. It was found that the quality of thematic modelling directly depends on the volume of the educational building. With an increase in the number of documents from ~300 to more than 800, the coherence of the model increased significantly from 0.462 to 0.516, which indicates better structured and accurate topics. In this way, the model becomes more meaningfully expressive and resistant to mixed themes. Separately, the performance of the system was analysed, particularly the time required for word processing. It turned out that the use of a complete list of stop words dramatically increases the duration of pre-processing from 51 to more than 150 minutes. It was the basis for the decision to use a shortened, optimised list of stop words, which allows you to maintain a balance between processing depth and performance. In addition, a number of technical improvements have been implemented: a progress bar (tqdm), a sound signal about completion, and saving processed texts to a file (pickle). It made it possible to save time significantly when restarting the program and made interaction with it more comfortable. Thanks to the analysis, it became apparent that the created software is not only functional, but also efficient, scalable and suitable for further use in real tasks of text data analysis. The results of the work confirmed the feasibility of using machine learning for thematic modelling and the importance of correctly adjusting parameters to achieve maximum quality.

8. Conclusions

Software for thematic modelling of Ukrainian-language texts based on the LDA (Latent Dirichlet Allocation) algorithm was designed, implemented and tested. The system was created from scratch, taking into account the peculiarities of the Ukrainian language, the specifics of working with text corpora and the requirements for the interpretation of results for an ordinary user. Several datasets were built: at the first stage, a test case of about 300 documents, and later a full-fledged extended case with a volume of more than 830 documents. It made it possible to conclude the effect of the amount of training data on the quality of the model, in particular, on coherence (which increased from 0.46 to 0.516 with an increase in the corpus). The system covers all the main stages of text analysis: pre-processing (cleaning, tokenisation, lemmatisation via stanza), conversion to numerical format, training a thematic model, building a dictionary of topics, automatic assignment of new texts to topics, as well as generation of conditional names of topics for user convenience. Visualisation of results via pyLDAvis was also implemented, which made it possible to better interpret the topic space and estimate the distances between them. Particular attention was paid to usability: saving processed data (pickle), displaying processing progress via tqdm, sound notifications about the completion of calculations, and optimisation of work with stop words. Thanks to these solutions, the software became not only functional, but also practical in Use. After training the model, the functionality of classifying new (third-party) texts was implemented and tested. The results demonstrate that the system is able to correctly determine the subject matter of even those documents that it has not seen before. The results were compared using two variants of the trained model on smaller and larger datasets. In both cases, the model returned meaningful and logical results, but from a larger case, the results were more stable and more confidently interpreted. This project is essential in terms of the practical application of natural language processing and machine learning methods. It proved that it is possible to effectively perform thematic modelling of Ukrainian-language documents using modern tools (gensim, stanza, pyLDAvis) even without the use of powerful clusters or ample computing resources. It has been confirmed that the quality of the LDA model significantly depends on the hull volume, purity, and quality of pre-processing, optimal selection of the number of topics, and balance between the completeness of the stop dictionary and the speed of processing. The developed software can be adapted to other languages, extended for more complex corpora, or integrated into larger systems such as web applications, dashboards, or content filtering systems. Prospects for further research:  Integration of other topic models - such as BERTopic or NMF using modern vector representations (e.g. BERT or FastText). It can increase accuracy and flexibility in defining topics.

 Evaluation of the quality of the model by the user - implementation of feedback mechanisms (for example, if the user agrees or disagrees with the topic assigned to the text).

 Analysis of the dynamics of topics over time - identifying how popular issues in the news stream or publications change over periods.

 Clustering of users or sources - based on the topics they produce or read; it is possible to build recommended systems.

 Deeper coherence research involves various metrics (u_mass, c_npmi) and manual evaluation by experts.

The study made it possible to create a full-fledged, functional and optimised system for thematic analysis of texts in Ukrainian. It combines elements of machine learning, natural language processing, and visual analytics. Work on the project made it possible to deepen practical skills in building NLP models, optimising code, and interpreting results. In addition to the practical result, it was also a meaningful learning experience, forming the basis for more complex research or commercial decisions in the future.

Acknowledgements Declaration on Generative AI

The authors have not employed any Generative AI tools.

[1]

D. M.

Blei ,

A. Y.

Ng , M. I. Jordan , Latent Dirichlet Allocation, Journal of Machine Learning Research 3(Jan) ( 2003 ) 993 - 1022 . URL: https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf.

[2]

Egger ,

Yu , A topic modeling comparison between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts, Frontiers in Sociology 7 ( 2022 ) 886498 . doi: 10 .3389/fsoc. 2022 . 886498 .

[3]

Grootendorst , BERTopic: Neural topic modeling with a class-based TF-IDF procedure , arXiv preprint arXiv:2203.05794 ( 2022 ). doi: 10 .48550/arXiv.2203.05794.

[4]

Bianchi , Contextualized Topic Models. URL: https://github.com/MilaNLProc/contextualizedtopic-models.

[5]

Fang ,

He ,

Procter , CWTM: Leveraging contextualized word embeddings from BERT for neural topic modeling , arXiv preprint arXiv:2305.09329 ( 2023 ). URL: https://arxiv.org/html/2305.09329v3.

[6]

O. B.

Petrovych , Topic modelling of Ukrainian folk songs: A case study on Podillia region , in: CS&SE@SW , ( 2024 ) 183 - 198 . URL: https://ceur-ws. org/ Vol- 3917 /paper45.pdf.