<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>N. Grabar);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Bratislava University of Economics and Management</institution>
          ,
          <addr-line>Furdekova 16, Bratislava</addr-line>
          ,
          <country>Slovak Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CNRS, Univ. Lille, UMR 8163 - STL - Savoirs Textes Langage</institution>
          ,
          <addr-line>F-59000 Lille</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Natalia Grabar</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>National Technical University “Kharkiv Polytechnic Institute”</institution>
          ,
          <addr-line>Kyrpychova str. 2, Kharkiv, 61002</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>NRS, Maison Européenne des Sciences de l'Homme et de la Société (MESHS, UAR3185)</institution>
          ,
          <addr-line>365 bis, rue Jules Guesde 59650 Villeneuve d'Ascq</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Assessing semantic similarity between texts of different length, such as questions and their extended answers, remains challenging in natural language processing. This study investigates whether topic modelling, specifically BERTopic, can effectively capture semantic similarity in such cases. The EFCAMDAT corpus, a large-scale dataset of learners' written texts, was used for experimentation. The research addresses two key questions: (1) Do students' texts correlate with the questions they are asked? (2) How does this correlation vary across different levels of foreign language proficiency? The findings indicate semantic similarity between questions and answers can be identified using topic modelling. Keyword analysis confirms a correlation between the examined elements; however, the method for determining semantic similarity still requires further refinement to improve accuracy.</p>
      </abstract>
      <kwd-group>
        <kwd>text semantic analysis</kwd>
        <kwd>topic analysis</kwd>
        <kwd>topic modelling</kwd>
        <kwd>semantic similarity</kwd>
        <kwd>text summarisation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Text semantics evaluation is a strategic area of linguistics for both linguistic theory and natural
language processing (NLP). It belongs to AI tasks related to understanding texts in different
languages, on the one hand, and, on the other hand, to solving the problem of the lack of processed
data in low-resource languages and domains by developing Model Transfer Learning.</p>
      <p>
        Semantic analysis, or the process of recognizing the semantics of a text and establishing
relationships between words, phrases, sentences, paragraphs and texts by their
languageindependent meanings, is an important component in various NLP tasks, such as sense recognition
[1], text summarization [
        <xref ref-type="bibr" rid="ref1">2</xref>
        ], short text evaluation [
        <xref ref-type="bibr" rid="ref2">3</xref>
        ], determining the degree of semantic similarity
between texts [
        <xref ref-type="bibr" rid="ref3">4</xref>
        ], text classification [
        <xref ref-type="bibr" rid="ref4">5</xref>
        ], clustering of text documents [
        <xref ref-type="bibr" rid="ref5 ref6">6, 7</xref>
        ], Cross-lingual Transfer
modelling. It is the first step in the search for a formalised unit of meaning that unites texts of
different sizes (in our case, questions for English as a Foreign Language (EFL) learners and answers
provided by students with varying language levels from the EFCAMDAT corpus) and a tool that
allows us to evaluate the results of condensing meaning into a single phrase or even a word, and vice
versa, expanding an idea into a whole text while retaining its main idea. This can be the basis of the
Transfer Learning Model for low-resource languages and domains, in particular for Ukrainian.
      </p>
      <p>We hypothesise that meaning transfer can occur between languages, within a language, its
stylistic layers, or between common language and professional domains. Thus, if 100 people answer
the same question, their answers can be reduced to the essence of the question, i.e. a single element
of meaning. The research aims to test how to determine that given texts refer to one question since
the answers to this question must have something in common - the elements of meaning we are
looking for.</p>
      <p>To solve this problem, we use the Topic Modelling method of the modern large language model
BERT. As a research material, we use the EFCAMDAT corpus of English texts, which contains
answers to questions from learners of English as a second language (L2). This study compares texts
of different lengths to one topic. The questions in the EFCAMDAT corpus are examples of short
texts, and we call them text-questions or simple questions. Student answers are examples of long
texts, and we call them text-answers or answers. We expect the topical modelling method to show
us the correlation between the corresponding text-questions and text-answers. We assume that
questions are topics and answers are texts that correspond to these topics. Hence, the BERTopic
model should match answers and questions with some minimal error. In this way, we want to answer
the research questions:
1) Do students' texts correlate with the questions they are asked?
2) How does this correlation occur at different levels of foreign language proficiency?</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The main objective of semantic similarity is to measure the distance between the semantic meanings
of a pair of words, phrases, sentences, or documents [
        <xref ref-type="bibr" rid="ref3">4</xref>
        ]. Semantic similarity is a metric defined over
a set of documents or terms, where the idea of distance between items is based on the likeness of
their meaning or semantic content as opposed to lexicographical similarity [
        <xref ref-type="bibr" rid="ref9">10</xref>
        ]. We used it to
estimate the strength of the semantic relationship between units of language (words), concepts and
texts, through Topic Modelling obtained according to the comparison of information from the texts,
which are answers to the same question. But it is noted that the term semantic similarity usually
includes only the ‘is a’ relationship. In that case, we can also use the notion of 'semantic relatedness'
[
        <xref ref-type="bibr" rid="ref10">11</xref>
        ]. Defining semantic relatedness also requires understanding the lexical hierarchy [
        <xref ref-type="bibr" rid="ref10">11</xref>
        ], using the
concept of the lexical-semantic field [
        <xref ref-type="bibr" rid="ref11 ref12">12, 13</xref>
        ] and methods of measuring it [
        <xref ref-type="bibr" rid="ref13">14</xref>
        ]. Lexical semantics is
also related to concepts such as connotation (semiotics) [
        <xref ref-type="bibr" rid="ref14">15</xref>
        ] and collocation, a specific combination
of words that can be or often surround a single word.
      </p>
      <p>
        In addition, words in the language are combined non-compositionally to form multi-word
expressions (syntagms), whose meaning cannot be derived from the standard representation of their
components [
        <xref ref-type="bibr" rid="ref15">16</xref>
        ]. Thus, for many domains or languages, it is also essential to have not only
crosslinguistic representations of individual words but also to compose them into correct embeddings in
phrases, sentences, and other high-level cross-linguistic representations [
        <xref ref-type="bibr" rid="ref16 ref17">17, 18</xref>
        ].
      </p>
      <p>
        The method of Topic Modelling of texts is relevant today and has great potential for improving
the transfer learning model for low-resource languages and domains. However, topic modelling
methods such as latent Dirichlet allocation (LDA), latent semantic analysis (LSA), or keyword
extraction technique KeyBERT are not suitable because they are based on the “bag of words”
principle. Keywords alone are insufficient to solve the task of correlating text with a topic (question).
It is necessary to account for the size of the context so that the weights are proportional [
        <xref ref-type="bibr" rid="ref18">19</xref>
        ].
BERTopic is better indicated because it's a model considered the most contextualized due to TF-IDF
methodology [
        <xref ref-type="bibr" rid="ref19 ref20">20, 21</xref>
        ].
      </p>
      <p>Semantic analysis also involves identifying features specific to certain linguistic (professional
domains) and cultural contexts as far as possible. We see opportunities for semantic analysis of these
features through cross-linguistic comparison of the scope of lexical and semantic fields of individual
concepts. We consider this a novel approach to the task of cross-lingual transfer.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Materials and Methods</title>
      <sec id="sec-3-1">
        <title>3.1. Materials</title>
        <p>
          The material for the study is the EFCAMDAT [
          <xref ref-type="bibr" rid="ref29">30</xref>
          ] corpus [
          <xref ref-type="bibr" rid="ref21">22</xref>
          ], which consists of students' answers
on questions asked. The students come from all over the world and study English as a foreign
language (L2). They may have different CEFR levels (A1, A2, B1, B2, C1, C2). The corpus was first
released in July 2013.
        </p>
        <p>
          The EFCAMDAT corpus is an open-access corpus of student work submitted to Englishtown, an
online school from EF Education First [
          <xref ref-type="bibr" rid="ref30">31</xref>
          ]. The entire Englishtown course offers 16 levels of
language proficiency according to common standards such as TOEFL, IELTS and the Common
European Framework of Reference for Languages (CEFR) [
          <xref ref-type="bibr" rid="ref22">23</xref>
          ].
        </p>
        <p>The EFCAMDAT corpus consists of scripted writing tasks on a specific question at the end of
each lesson. The EFCAMDAT corpus does not have direct information on L1 proficiency, so
nationality is the closest proxy for L1 proficiency. EFCAMDAT contains data on students of 198
nationalities. The answers of Ukrainian students make up less than 1%, namely 0.11% of the total
number of answers.</p>
        <p>
          For this study, we used the data collected for the second release in September 2017, which contains
1,180,310 scripts (with 7,126,752 sentences and 83,543,480 tokens) written by 174,743 students. This
text corpus includes information on learner errors, parts of speech, and grammatical relationships.
All tasks were evaluated by English teachers. Currently, EFCAMDAT contains teacher feedback for
66% of the answers [
          <xref ref-type="bibr" rid="ref23">24</xref>
          ].
        </p>
        <p>As materials, the questions and a certain number of students' answers to these questions were
taken from the EFCAMDAT corpus. We assume that the questions are topics and the answers are
texts corresponding to these topics. At each level, we have 24 questions (except for C2, where the
number of questions is 8).</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Topic modelling methods</title>
        <p>
          First, we use the LDA method to compare the student answers’ Topic Modelling to the manual
thematic distribution of questions. The LDA method is applied to a BOW representation: information
on the frequency of words is exploited, but their contexts are lost [
          <xref ref-type="bibr" rid="ref24 ref25">25, 26</xref>
          ].
        </p>
        <p>
          To match students' text-answers to the studied text-questions, we used the Topic Modelling
methods of the Modern BERT large language model. ModernBERT is a modernized bidirectional
encoder-only Transformer model (BERT-style) pre-trained on 2 trillion tokens of English and data
with a native context length of up to 8,192 tokens. ModernBERT’s native long context length makes
it ideal for tasks that require processing long documents, such as retrieval, classification, and
semantic search within large corpora. The model was trained on a large corpus of text and code,
making it suitable for a wide range of downstream tasks, including code retrieval and hybrid (text +
code) semantic search [
          <xref ref-type="bibr" rid="ref20">21</xref>
          ].
        </p>
        <p>
          For topic extraction, we used BERTopic method, which is based on the TF-IDF statistical matrix
and takes better account of contexts than LDA and other topic modelling models [
          <xref ref-type="bibr" rid="ref26 ref27">27, 28</xref>
          ]. BERTopic
[
          <xref ref-type="bibr" rid="ref31">32</xref>
          ] generates document embeddings with pre-trained transformer-based language models, clusters
these embeddings, and generates topic representations with the class-based TF-IDF procedure. The
semantic properties of Text embedding representations allow the meaning of texts to be encoded in
such a way that similar texts are close in vector space.
        </p>
        <p>
          BERTopic generates topic representations in three steps. First, each document is converted to its
embedding representation using a pre-trained language model. Then, before clustering these
embeddings, the dimensionality of the resulting embeddings is reduced to optimise the clustering
process. Finally, from the clusters of documents, topic representations are extracted using a custom
class-based variation of TF-IDF [
          <xref ref-type="bibr" rid="ref19">20</xref>
          ].
        </p>
        <p>
          To find the best clusterisation model for EFCAMDAT corpus, we first tested BERTopic methods
on the 20NewsGroups [
          <xref ref-type="bibr" rid="ref32">33</xref>
          ] dataset, a classic example of topic modelling, one of the three datasets
used to validate BERTopic. 20NewsGroups contains 18,846 news items from English-language
forums. This dataset was pre-processed using Galileo [
          <xref ref-type="bibr" rid="ref33">34</xref>
          ] by removing punctuation, lemmatisation,
and stop words, as well as removing documents containing less than 5 words and empty messages,
which amounted to 1,163 messages. The number of analysed articles after cleaning was 17,734
articles in 20 thematic categories.
        </p>
        <p>
          Our results for Topic modelling of this benchmark 20NewsGroups align with those reported in
[
          <xref ref-type="bibr" rid="ref26 ref27">27, 28</xref>
          ]. In addition, BERTopic demonstrated good coherence and accuracy in formulating topics
using 4 keywords compared to topic titles in 20NewsGroups. Although the BERTopic model does
not currently have the best results in terms of CV (topic coherence) and TD (topic diversity) metrics,
according to [
          <xref ref-type="bibr" rid="ref28">29</xref>
          ], it is among the three most accurate and fastest models.
        </p>
        <p>It should be noted that, when testing BERTopic on the benchmark 20NewsGroups, the first topic
is “Topic -1”, which groups texts that have not been assigned to any of the topics, the so-called
outliers. That's why we set the distribution to 20 topics (with and without outliers) and 21 topics
(with and without outliers) to compare the results with the distribution marked in the benchmark.
We also gave the texts the option to be divided into an automatically determined number of topics
and obtained the following results:
1) The division into 20 topics with outliers (Topic -1) refers to emissions of almost 40%-50%, and
sometimes 60% of articles from each topic.
2) The division of 21 topics, i.e. 20 topics without outliers (Topic -1), assigns different texts to
the relevant topics with high confidence.
3) Many proposed topics (Labels) of the 20NewsGroups benchmark coincide with one Topic,
which is appropriate for close topics related, for example, to computers (5 labels) or sports (2
labels). In our results, combining space technology with cars and motorcycles or atheism with
religion and politics is possible. However, in the distribution of 20 Topics with outliers, the
topics were more clearly identified by the Top 4 keywords than those of 21 Topics without
outliers. Although distributed with outlier, all other topics have a more accurate
characterization by 4 keywords corresponding to the specified topic than when distributed
into 21 topics and without outliers.
4) In addition, the 20NewsGroups benchmark was analysed with the BERTopic model, using the
previous generation MiniLM vectorizer compared to vectorizer of ModernBERT. The results
were similar to the previous ones, with about a third of the documents for each label
belonging to Topic -1 (outlier) divided into 20 topics. However, when divided into 21 topics,
the documents were more accurately grouped into separate Topics.</p>
        <p>Thus, we concluded that the best distribution for Topic analysis of the EFCAMDAT corpus using
BERTopic should be 25 topics at A1-C1 levels and 9 topics at C2 levels, so that the topics outside of
the Topic -1 outliers correspond to the distribution of the data to questions in terms of number.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <p>This section describes the results of experiments with EFCAMDAT corpus in the second release of
2017. Firstly, we clean and preprocess the texts. Then, we compare LDA-based topics with manually
chosen ones. Finally, we run BERTopic model to evaluate semantic similarity between text-answers
and text-questions.</p>
      <sec id="sec-4-1">
        <title>4.1. Preparing corpus EFCAMDAT for the study</title>
        <p>For levels B1-C2, the study was conducted on the entire data set. Since the number of answers at the
lower levels A1-A2 was too large, the topic analysis was conducted on a random subset evenly
selected for each question. In Figure 1, you can see an example of the stratification validation of the
proportional random selection of the answers’ number for level A1:</p>
        <p>At the B1 level, some broken lines were removed, which allowed to obtain a more accurate Topic
Modelling result.</p>
        <p>In data preparation, the texts were comprehensively cleaned: removal of empty answers,
lowercase conversion, deletion of non-Latin characters, removal of stop words, including auxiliary verbs.
We also cut off answers with poor scores from teachers, i.e. those that received a score below 64 on
a 100-point scale. The number of answers remaining after the cleaning is presented in Table 1.
Quantitative results of EFCAMDAT corpus cleaning</p>
        <p>Level</p>
        <p>C2
C1
B2
B1
А2
А1
Total</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Topic modelling of EFCAMDAT by LDA</title>
        <p>
          The LDA model was created using Gensim [
          <xref ref-type="bibr" rid="ref34">35</xref>
          ], a Python library for topic modelling, where we set
the number of topics depending on how many topics the questions were manually divided into at
each level of the EFCAMDAT corpus. LDA is trained directly on our EFCAMDAT data.
        </p>
        <p>All the answers in the EFCAMDAT corpus are grouped into subsets that correspond to 128
questions at a particular level of proficiency. There are no common questions across levels. We have
also manually identified 10 topics based on questions and indicated at which levels they are, as you
can see in Table 2.</p>
        <p>Manually identified topics, under which 128 EFCAMDAT questions are grouped
Themes / Levels
А1
А2
В1
В2
С1</p>
        <p>С2
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+</p>
        <p>Family
Business
Travel</p>
        <p>Goods</p>
        <p>Feelings
Habits</p>
        <sec id="sec-4-2-1">
          <title>Party Home Health Learning / Language Training</title>
          <p>Total correspondences:</p>
          <p>Using the LDA method, we analysed only levels B1-B2 to see if the approach would be effective
for the entire data, taking into account some system slowdowns and redundancy in levels A1-А2.
LDA was run in 20 passes with preliminary lemmatisation. We obtained the top 10 words for each
topic, a visualisation of the top 30 words, and metrics that can be seen in Table 3:
LDA model metrics</p>
          <p>According to the comparison results, the word-topic coherence index is almost the same for both
levels and is around 0.6: B1 = 0.66; B2 = 0.59. However, the number of answers at level B1 is 2.74
times higher than at level B2. Therefore, the size of the vocabulary is 1.64 times larger at level B1
than at level B2. At the same time, the length of an answer is 1.4 times longer at B2.</p>
          <p>Based on the results of the Topic analysis, 10 keywords were also selected using the LDA method
for each of the 9 topics at the B1 level and 8 topics at the B2 level. Their weight indicates how specific
and relevant a word is to a given topic compared to other topics.</p>
          <p>The keywords were ranked according to their weight and importance within the topic, as can be
seen in Figures 1 and 2. A comparison of keywords for the same topics for B1 and B2 shows that at
B2, the top 10 words are better when matching manually selected topics.</p>
          <p>The full probability distribution (Figures 2, 3) for all topics for each answer allows to see how
much the answer is related to each topic, not just the dominant one.</p>
          <p>According to the results of the LDA thematic analysis, the coherence of answers with the topics
that were manually selected is higher at the B2 level (68.84% to 91.25%), compared to the B1 level,
where the coherence is quite low (starting at 22.96%, exceeding 60% in only 3 cases and reaching a
maximum of 75%).</p>
          <p>For example, at the B2 level, it is interesting to note the selection of Topic 0, which shows high
coherence (90.35%) and, based on the lexical composition of 10 keywords, can be attributed to the
topic Business selected manually.</p>
          <p>However, according to the graphs in Figure 4, where this topic is represented by circle 3 and the
distribution of vocabulary by 30 keywords, we see that this topic is divided into several lexical and
semantic fields that do not even overlap.</p>
          <p>Thus, we conclude that
1) At level B1, the thematic distribution does not correspond to the 9 manually identified themes
per question, in contrast to level B2, where we can correlate some of the themes that were manually
identified with the LDA distribution of 8 themes.</p>
          <p>2) The visualisations do not always correspond to the given topic numbering id, but they allow
us to expand the range of words that are relevant to a topic.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Topic modelling of EFCAMDAT by BERTopic</title>
        <p>
          Topic modelling was done using Google Colab for all language levels A1-C2 of the EFCAMDAT
corpus. We used the ModernBERT-base [
          <xref ref-type="bibr" rid="ref35">36</xref>
          ], which has 22 layers and 149 million parameters. The
all-MiniLM-L6-v2 was used as a vectorizer for embedding, which provides a good balance between
quality and processing speed. To calculate the probability that a certain themes are present in the
document, the HDBSCAN model was used at the clustering stage.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>4.3.1. Automatic detection of topics in EFCAMDAT with BERTopic</title>
        <p>In the first step of Topic Modelling, we did not limit BERTopic to the number of topics found so as
not to limit our ‘horizons’.</p>
        <p>The results of the topic analysis were presented in the form of



clouds of answers by topic
hierarchical classification of topics
a map of distances between topics</p>
        <p>As a result, the following number of Topics were identified, which correspond to a certain number
of EFCAMDAT questions, as we can see in Table 4:</p>
        <p>The difference in the number of Topics for the same number of Questions by level is explained
by the different number of answers, which increases with the lower levels, especially A1-A2.</p>
        <p>The resulting visualisations form rather dense areas of topics, which are close to each other, as
we can see in Figure 5. Hence, we can assume that these are the semantic fields of the topics set in
the themes, which are revealed in the students' answers.</p>
        <p>Comparing the results of the analysis between levels is complicated by the fact that the topics
assigned to students at A1-C2 levels do not match. Accordingly, the grouping of topics according to
BERTopic modelling does not correspond to the Topics that we identified manually in Table 2.</p>
        <p>For example, in Figure 6, red and turquoise groupings at A2 are probably answers to themes:
‘A2|Writing a resume’, which we assigned to the topic ‘Yourself|Family’ rather than to the profession,
which we did not even select, or ‘A2|Complaining about a meal’, which was assigned to the general
topics ‘Food or Feelings’, but does not convey an associative connection with the topic of restaurants,
which was also not selected in the 10 topics selected manually.</p>
        <p>Since there is no direct correspondence between the manually selected topics and questions, the
automatically obtained clusters of topics correlated with the questions can likely lead to another set
of topics based on the semes and sememes from lexical and semantic fields.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.3.2. Identifying 24 and 25 questions in EFCAMDAT by BERTopic</title>
        <p>EFCAMDAT answers were divided by BERTopic into 24 topics according to the number of questions
at A1-C1 levels. At C2 level, we divide them into 8 topics according to the number of questions.</p>
        <p>Then, we correlated the questions of EFCAMDAT with the BERTopics. Considering that
BERTopic a priori allocates Topic -1 for outliers, we also divided answers into 25 topics at A1-C1
levels and 9 topics at level C2 level, with and without outliers.</p>
        <p>At this stage, we conduct a quantitative analysis of the Topic Modelling by BERTopic:





</p>
        <p>Determine the number of answers for 24 topics identified by Modern BERT and check the
quantitative ratio of topics to answers.</p>
        <p>Check whether all answers in the topics are included in the topics, and vice versa;
Calculate the correlation between the answers and one of the Modern BERT topics.
Find out if there is an error and what it depends on.</p>
        <p>Compare the analysis data between levels.</p>
        <p>Visualise the results.</p>
        <p>In the third final step, we perform a qualitative analysis of the distribution of the maximally
cleaned EFCAMDAT:


</p>
        <p>Create correlative matrices of Themes and Topics for each level.</p>
        <p>Compare the Topic names defined by 4 keywords from Themes and 10 and 100 keywords
from Topics.</p>
        <p>Determine what has a greater impact on the correlation of the text with the given topic.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>The distribution of answers to EFCAMDAT questions by BERTopic is not homogeneous, as can be
seen when comparing the results of the quantitative analysis by levels:
1) Level C2: only 2 Topics out of 8 have an accuracy exceeding &gt;98%, 3 Topics &gt;60%, 1 Topic
&gt;50%. There are no leaders at this level, as Topic 2 (Topic 0) is &gt;90% and Topic 4 (Topic 2) is
&gt;90% for 1 question.
2) Level C1: 3 topics out of 24 have an accuracy of &gt;90%, 4 topics &gt;85%, 7 topics &gt;50%. Topic 2
(Topic 0) is in the lead with &gt;90% = 3 questions, &gt;50% = 2 questions.
3) Level B2: 6 topics out of 24 have an accuracy of &gt;90%, 7 topics &gt;85%, 9 topics &gt;50%. Topic 2
(Topic 0) is in the lead with &gt;90% = 4 questions, &gt;80% = 2 questions, &gt;50% = 5 questions.
4) Level B1: 11 topics out of 24 &gt;90%, 14 topics &gt;85%, 2 topics &gt;82%, 7 topics &gt;50%. Topic 2
(Topic 0) is the leader with &gt;90% = 7 questions, &gt;80% = 3 questions, &gt;50% = 6 questions.
5) Level А2-A1 data were truncated by 52% and 74%, respectively, so we consider them
unsuitable for quantitative analysis.</p>
      <p>The distribution of topics by level is not homogeneous: at the lower level B1, there are more
questions that correlate with answers by more than 90% compared to level C1.</p>
      <p>However, at the same level B1, there are more topics that overlap with each other than at level
C1. In Figures 7 and 8, we can see that answers to different questions overlap in Topic 0, while at
level C1, there are far fewer such overlaps.</p>
      <p>This is also confirmed by the percentage of answers that relate to Themes compared to Topics
and is clearly visible in the pie charts (Figure 9).</p>
      <p>The quantitative analysis has led to the following conclusions:
1) The lower the level of language learning, the fewer words learners have to express the same
idea accurately. The fewer words, the smaller the volume of the lexical and semantic field (LSF) for
one concept. That is why words from different lexical and semantic fields are used when revealing
Themes in lower levels. Thus, Topics, which are clearly more than 24, can overlap for different tasks,
which is also confirmed by the BERTopic distribution.</p>
      <p>2) The uneven distribution of Topics may also depend on the wording of the task: more general
questions are distributed to a larger number of Topics than more specific ones and vice versa.</p>
      <p>3) In most cases, the composition of the top 4 or top 10 keywords in the answers allows us to
understand which Theme they relate. This confirms the hypothesis that a question can be reduced
to a single concept and it can be represented by a single word, and then expanded to a lexical and
semantic field of 4, 10 or even 100 keywords, that make up the concept and can be used in an answer
on the same question.</p>
      <p>The latter conclusion is also confirmed by the results of the qualitative analysis, which we
presented in the form of matrices (Figure 10), where the names of EFCAMDAT Topics are presented
horizontally and BERTopics Topics vertically. At their intersection, we see the number of relevant
answers that correlate Questions with Topics, and the names of the 4 keywords of the latter can
quickly indicate how close this correlation is.</p>
      <p>We propose to take the correlation matrix for level C1 in Figure 10 as an example. The same
trends emerge from other levels with 24 topics and level C2 with 9 Topics.</p>
      <p>It should be noted that there is a small number of outlier responses, which, depending on the type
of division into 24 or 25 Topics with outlier or no_outlier, varies between 5-8% depending on the
level. However, there is a tendency for the final topics to have a much smaller number of answers
than the first ones. At the same time, (Topic -1 outlier) is present in all variants of calculations for
24 or 25 Topics with or without outlier.</p>
      <p>Regarding the comparison of the 24 and 25 Topics, despite the fact that the 24 Topics do not
correspond in number to the EFCAMDAT questions, as Topic -1 includes ‘outliers’ that are not
included in the themes, the 24 Topics have a clearer correlation with 4 Topics keywords and the
wording of the Themes.</p>
      <p>For example, at the C1 level, in question 17.2441, Writing a campaign speech, the keywords include
the word vote 0 (0_student_council_vote_student_president), which corresponds to the idea of a
campaign speech in the question, but in the variant of the distribution on 25 Topics, this word is no
longer in the Top 4 words. It appears at the 5th place in the Top 10 keywords [‘school’, ‘students’,
‘student’, ‘president’, ‘vote’, ‘council’, ‘thank’, ‘best’, ‘better’, ‘university’].</p>
      <p>The trend of a clearer correlation between Themes (questions) and Topics, when divided into 24,
is also observed at all levels.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussions</title>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>The unequal distribution of topics can be explained by the fact that students with a lower level of
language learning use fewer words when expressing their ideas. With fewer words, we obtain a
smaller volume of lexical and semantic fields and a smaller number of Topics.</p>
      <p>We assume that this also corresponds to the very idea of BERT: the less diverse the vectors, the
more homogeneous the clusters.</p>
      <p>This research explored the potential of BERTopic to assess semantic similarity between questions
and answers in the EFCAMDAT corpus. By treating questions as topics and student answers as
corresponding texts, we aimed to determine whether answers maintain a semantic correlation with
their questions and how this correlation varies across language proficiency levels.</p>
      <p>Our analysis revealed that BERTopic can effectively identify semantic links between questions
and answers, with keyword analysis confirming a meaningful correlation. However, our results also
indicate that topic modelling methods require further refinement to improve precision in
determining semantic similarity. The comparison with the benchmark 20NewsGroups dataset
demonstrated that topic distribution plays a crucial role in ensuring meaningful clustering.
Specifically, our findings suggest that the optimal topic distribution for EFCAMDAT is 25 topics for
A1-C1 levels and 9 topics for C2 levels, allowing a clearer mapping between student answers and the
corresponding questions.</p>
      <p>Beyond its immediate findings, this study provides a foundation for further research into text
meaning. The ability to extract core semantic units from texts of varying length has broader
implications for transfer learning in low-resource languages, such as Ukrainian, as well as in
domainspecific applications. Future work should focus on refining topic modelling approaches to enhance
their ability to capture nuanced semantic relationships and to reduce classification errors.</p>
      <p>By advancing methods for semantic similarity detection, this research contributes to the broader
field of computational linguistics and NLP, offering new insights into how topic modelling can be
leveraged for analysing texts.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>The research study depicted in this paper is funded by the program PAUSE ANR (the French National
Research Agency) associated with the project ANR-17-CE19-0016 CLEAR (Communication, Literacy,
Education, Accessibility, Readability).</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <sec id="sec-9-1">
        <title>The authors have not employed any Generative AI tools.</title>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>References</title>
      <p>[1] A. Bordes, X. Glorot, J. Weston, Joint learning of words and meaning representations for
opentext semantic parsing, in: International Conference on Artificial Intelligence and Statistics, 2012.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>Gomaa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fahmy</surname>
          </string-name>
          ,
          <article-title>A survey of text similarity approaches</article-title>
          ,
          <source>International Journal of Computer Applications</source>
          <volume>68</volume>
          (
          <year>2013</year>
          )
          <fpage>13</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. H.</given-names>
            <surname>Duong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Cambria</surname>
          </string-name>
          ,
          <article-title>Learning short-text semantic similarity with word embeddings and external knowledge sources, Knowledge-Based Systems 182 (</article-title>
          <year>2019</year>
          )
          <fpage>104842</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Q.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <article-title>Estimation in semantic similarity of texts</article-title>
          ,
          <source>Journal of Information Science and Engineering</source>
          <volume>37</volume>
          (
          <year>2021</year>
          )
          <fpage>617</fpage>
          -
          <lpage>633</lpage>
          . doi:
          <volume>10</volume>
          .6688/JISE.202105_
          <issue>37</issue>
          (
          <issue>3</issue>
          ).
          <fpage>0008</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Khairova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kupriianov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vorzhevitina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Shanidze</surname>
          </string-name>
          ,
          <article-title>Models for effective categorization and classification of texts into specific thematic groups</article-title>
          , in: CLW-2024
          <source>: Computational Linguistics Workshop at 8th Int. Conf. on Computational Linguistics and Intelligent Systems (CoLInS-2024)</source>
          , Vol.
          <volume>4</volume>
          ,
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          , Lviv, Ukraine,
          <year>2024</year>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.-S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>A similarity measure for text classification and clustering</article-title>
          ,
          <source>IEEE Trans. on Knowledge and Data Engineering</source>
          <volume>26</volume>
          (
          <year>2014</year>
          )
          <fpage>1575</fpage>
          -
          <lpage>1590</lpage>
          . doi:
          <volume>10</volume>
          .1109/TKDE.
          <year>2013</year>
          .
          <volume>19</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kutuzov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kopotev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sviridenko</surname>
          </string-name>
          , L. Ivanova,
          <article-title>Clustering comparable corpora of Russian and Ukrainian academic texts: Word embeddings and semantic fingerprints</article-title>
          ,
          <source>arXiv preprint arXiv:1604.05372</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Søgaard</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Vulic</surname>
          </string-name>
          ,
          <article-title>Unsupervised cross-lingual representation learning</article-title>
          ,
          <source>in: Proceedings of the Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Remy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Delobelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Avetisyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khabibullina</surname>
          </string-name>
          , M. de Lhoneux, T. Demeester,
          <article-title>Transtokenization and cross-lingual vocabulary transfers: Language adaptation of LLMs for lowresource NLP</article-title>
          ,
          <source>in: Proceedings of the Conference On Language Modeling</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Harispe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ranwez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Montmain</surname>
          </string-name>
          ,
          <article-title>Semantic similarity from natural language and ontology analysis</article-title>
          ,
          <source>Springer Nature</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Bagheri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ensan</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Jovanovic,</surname>
          </string-name>
          <article-title>The state of the art in semantic relatedness: a framework for comparison</article-title>
          ,
          <source>The Knowledge Engineering Review</source>
          <volume>32</volume>
          (
          <year>2017</year>
          )
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          . doi:
          <volume>10</volume>
          .1017/S0269888917000029.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [12]
          <string-name>
            <surname>P. B. Andersen</surname>
          </string-name>
          ,
          <article-title>A theory of computer semiotics: semiotic approaches to construction and assessment of computer systems</article-title>
          , Vol.
          <volume>3</volume>
          , Cambridge University Press,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jackson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. Zé Amvela</given-names>
            , Words, Meaning, and
            <surname>Vocabulary</surname>
          </string-name>
          , Continuum,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Vakulenko</surname>
          </string-name>
          ,
          <article-title>Semantic comparison of texts by the metric approach</article-title>
          ,
          <source>Digital Scholarship in the Humanities</source>
          <volume>38</volume>
          (
          <issue>2</issue>
          ) (
          <year>2022</year>
          )
          <fpage>766</fpage>
          -
          <lpage>771</lpage>
          . doi:
          <volume>10</volume>
          .1093/llc/fqac059.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Akmajian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Demers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Harnish</surname>
          </string-name>
          ,
          <article-title>Linguistics: An Introduction to Language and Communication</article-title>
          , MIT Press,
          <year>2001</year>
          . doi:
          <volume>10</volume>
          .7551/mitpress/4252.001.0001.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [16]
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Mel'čuk</article-title>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Milićević</surname>
          </string-name>
          , An Advanced Introduction to Semantics: A
          <string-name>
            <surname>Meaning-Text</surname>
            <given-names>Approach</given-names>
          </string-name>
          , Cambridge University Press,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          , et al.,
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .02116v2 (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pikuliak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Šimko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bieliková</surname>
          </string-name>
          ,
          <article-title>Cross-lingual learning for text processing: A survey</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>165</volume>
          (
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .1016/j.eswa.
          <year>2020</year>
          .
          <volume>113765</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Vakulenko</surname>
          </string-name>
          ,
          <article-title>Deep contextual disambiguation of homonyms and polysemants, Digital Scholarship in the Humanities (</article-title>
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .1093/llc/fqac081.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Grootendorst</surname>
          </string-name>
          ,
          <article-title>BERTopic: Neural topic modeling with a class-based TF-IDF procedure</article-title>
          ,
          <source>arXiv preprint arXiv:2203.05794</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>B.</given-names>
            <surname>Warner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chaffin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Clavié</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Weller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hallström</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Taghadouini</surname>
            ,
            <given-names>... I. Poli</given-names>
          </string-name>
          , Smarter, better, faster, longer
          <article-title>: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference</article-title>
          ,
          <source>arXiv preprint arXiv:2412.13663</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Eertzen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Lexopoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korhonen</surname>
          </string-name>
          ,
          <article-title>Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge open language database (EFCAMDAT)</article-title>
          ,
          <source>in: 31st Second Language Research Forum (SLRF)</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [23]
          <article-title>Council of Europe, A Common European Framework of Reference for Languages: Learning, Teaching</article-title>
          , Assessment, Strasbourg,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Geertzen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Baker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korhonen</surname>
          </string-name>
          , T. Alexopoulou, The EF Cambridge Open Language Database (EFCAMDAT): Information for Users, University of Cambridge and EF Education First,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [25]
          <string-name>
            <surname>D. M. Blei</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>M. I. Jordan</given-names>
          </string-name>
          , Latent Dirichlet Allocation,
          <source>Journal of Machine Learning Research</source>
          <volume>3</volume>
          (
          <year>2003</year>
          )
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          . doi:
          <volume>10</volume>
          .1162/jmlr.
          <year>2003</year>
          .
          <volume>3</volume>
          .4-
          <fpage>5</fpage>
          .
          <fpage>993</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [26]
          <string-name>
            <surname>D. M. Blei</surname>
            ,
            <given-names>J. D.</given-names>
          </string-name>
          <string-name>
            <surname>McAuliffe</surname>
          </string-name>
          , Supervised Topic Models,
          <source>in: Advances in Neural Information Processing Systems</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>R.</given-names>
            <surname>Egger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Topic</given-names>
            <surname>Modeling Comparison Between</surname>
          </string-name>
          <string-name>
            <surname>LDA</surname>
          </string-name>
          , NMF, Top2Vec, and
          <article-title>BERTopic to Demystify Twitter Posts, Frontiers in Sociology 7 (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>O.</given-names>
            <surname>Babalola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ojokoh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Boyinbode</surname>
          </string-name>
          ,
          <article-title>Comprehensive Evaluation of LDA, NMF, and BERTopic's Performance on News Headline Topic Modeling</article-title>
          ,
          <source>Journal of Computing Theories and Applications</source>
          <volume>2</volume>
          (
          <year>2024</year>
          )
          <fpage>268</fpage>
          -
          <lpage>289</lpage>
          . doi:
          <volume>10</volume>
          .62411/jcta.11635.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Y.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>A. T.</given-names>
          </string-name>
          <string-name>
            <surname>Luu</surname>
          </string-name>
          ,
          <article-title>FASTopic: Pretrained Transformer is a Fast, Adaptive</article-title>
          , Stable, and Transferable Topic Model,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>37</volume>
          (
          <year>2025</year>
          )
          <fpage>84447</fpage>
          -
          <lpage>84481</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>A.</given-names>
            <surname>Michalak</surname>
          </string-name>
          ,
          <article-title>NLP research on the EFCAMDAT dataset</article-title>
          , GitHub, URL: https://github.com/amichw/EFCAMDAT.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>EF</given-names>
            <surname>Education First</surname>
          </string-name>
          ,
          <article-title>Learn English online, EF English Live</article-title>
          , URL: https://englishlive.ef.com/enus/learn-english-online.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grootendorst</surname>
          </string-name>
          ,
          <article-title>BERTopic documentation</article-title>
          , URL: https://maartengr.github.io/BERTopic/index.html.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [33]
          <string-name>
            <surname>Hugging</surname>
            <given-names>Face,</given-names>
          </string-name>
          <article-title>20 newsgroups fixed dataset</article-title>
          , URL: https://huggingface.co/datasets/rungalileo/20_Newsgroups_Fixed.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [34]
          <string-name>
            <surname>Galileo</surname>
          </string-name>
          ,
          <article-title>Improving your ML datasets with Galileo (Part 1)</article-title>
          , URL: https://www.galileo.ai/blog/improving
          <article-title>-your-ml-datasets-with-galileo-part-1.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>R.</given-names>
            <surname>Řehůřek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sojka</surname>
          </string-name>
          , Gensim:
          <article-title>Topic modelling for humans, PyPI</article-title>
          , URL: https://pypi.org/project/gensim/.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>[36] AnswerDotAI, ModernBERT-base, Hugging Face, URL: https://huggingface.co/answerdotai/ModernBERT-base.</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>