<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>S. Burkhardt, S. Kramer, Decoupling sparsity and smoothness in the dirichlet variational au-
toencoder topic model, Journal of Machine Learning Research</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/2911451.2914720</article-id>
      <title-group>
        <article-title>Bridging the Evaluation Gap: Leveraging Large Language Models for Topic Model Evaluation⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zhiyin Tan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jennifer D'Souza</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>L3S Research Center, Leibniz University Hannover</institution>
          ,
          <addr-line>Hannover</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TIB Leibniz Information Centre for Science and Technology</institution>
          ,
          <addr-line>Hannover</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>20</volume>
      <issue>2019</issue>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>This study presents a framework for automated evaluation of dynamically evolving topic taxonomies in scientific literature using Large Language Models (LLMs). In digital library systems, topic modeling plays a crucial role in efifciently organizing and retrieving scholarly content, guiding researchers through complex knowledge landscapes. As research domains proliferate and shift, traditional human-centric and static evaluation methods struggle to maintain relevance. The proposed approach harnesses LLMs to measure key quality dimensions-such as coherence, repetitiveness, diversity, and topic-document alignment-without heavy reliance on expert annotators or narrow statistical metrics. Tailored prompts guide LLM assessments, ensuring consistent and interpretable evaluations across various datasets and modeling techniques. Experiments on benchmark corpora demonstrate the method's robustness, scalability, and adaptability, underscoring its value as a more holistic and dynamic alternative to conventional evaluation strategies.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Topic modeling</kwd>
        <kwd>Large language models</kwd>
        <kwd>Evaluation</kwd>
        <kwd>Natural language processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Topic taxonomies of Science have been traditionally used to simplify literature search, to study the
structure and dynamics of scientific disciplines, or to facilitate bibliometric research evaluations [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
These taxonomies, often hierarchical and multidisciplinary, provide a framework for categorizing
knowledge and can significantly influence the dissemination and evolution of scientific information
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. They play a crucial role in the organization of academic databases, directly impact the eficiency
of information retrieval systems, and serve as essential tools for structuring and navigating vast
repositories within digital library systems. As scientific output continues to grow exponentially, the
need for efective and dynamically adaptable taxonomic systems [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] becomes increasingly important,
not just for academic researchers but also for policymakers and funding agencies aiming to identify
and support pivotal research areas [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        As we shift towards dynamically updatable taxonomies to manage the growing volumes of scientific
literature, developing robust evaluation methods is essential to ensure their efectiveness [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These
methods must verify that taxonomies adapt to the rapid evolution of scientific domains while consistently
producing meaningful and coherent topics. Evaluating these dynamic systems involves assessing their
accuracy in reflecting current research trends and their capacity to interlink related disciplines seamlessly
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This process is crucial for maintaining the integrity and utility of taxonomies in facilitating eficient
research discovery and supporting informed decision-making in scientific policy and funding strategies.
Human evaluation is laborious and time-consuming; thus, we also need automated evaluation systems
to eficiently manage and validate these complex, dynamic structures [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Building on the premise of automated evaluation systems, Large Language Models (LLMs) [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ] are
well-suited to function as evaluators of dynamic taxonomies due to their advanced natural language
understanding and generation capabilities [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Trained on extensive corpora, these models excel in
discerning linguistic patterns and semantic relationships within complex datasets, making them ideal
for assessing the coherence and relevance of topics generated by taxonomies. Moreover, LLMs ofer
a scalable, consistent, and context-sensitive approach to evaluation, overcoming key challenges of
traditional methods such as reliance on human annotators or narrowly focused statistical metrics. With
the ability to simulate nuanced human-like reasoning, LLMs can evaluate multiple dimensions of topic
quality—such as coherence, repetitiveness, diversity, and topic-document alignment—while providing
detailed, interpretable feedback.
      </p>
      <p>In this work, we propose a novel framework that leverages LLMs as evaluators for topic model
outputs, addressing key limitations in existing evaluation methodologies. The contributions of this work
are threefold. First, we introduce a comprehensive set of metrics—coherence, repetitiveness, diversity,
and topic-document alignment—that collectively capture multiple facets of topic model quality. Second,
we design and implement tailored LLM prompts for each metric, ensuring consistency, interpretability,
and adaptability across diferent datasets and topic modeling techniques. Third, we validate the
framework through extensive experiments on benchmark datasets 20 Newsgroups (20NG)1 and a subset
of scholarly documents from the International System for Agricultural Science and Technology (AGRIS)
2, demonstrating its robustness and efectiveness. This study not only provides a scalable and holistic
solution for topic model evaluation but also paves the way for broader applications of LLMs in addressing
dynamic and context-sensitive challenges in natural language processing. The code is available here3.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Topic Modeling Approaches and Their Evolution</title>
        <p>
          The field of topic modeling has evolved significantly, advancing from early matrix factorization
techniques to modern probabilistic and neural architectures. One foundational work, Latent Semantic
Indexing (LSI) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], utilized singular value decomposition to uncover conceptual associations between
words and documents through co-occurrence patterns, though it lacked a fully generative probabilistic
model. Building on these concepts, Probabilistic Latent Semantic Indexing (pLSI) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] introduced a
probabilistic framework where words are generated from a mixture of topics. This probabilistic perspective
paved the way for LDA [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], a seminal contribution that introduced a fully generative Bayesian model
capable of inferring a corpus-wide set of latent topics and their associated per-document distributions.
LDA’s flexible yet tractable variational inference techniques (as well as alternative inference algorithms
like Gibbs sampling, as demonstrated in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]) solidified it as a cornerstone in topic modeling research.
        </p>
        <p>
          In more recent years, researchers have embraced neural variational inference frameworks—inspired
by VAE [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] —to develop neural topic models (NTMs) such as NVDM [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], GSM/GSB/RSB [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ],
ProdLDA/AVITM [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], and ETM [19]. These models employ continuous latent representations and deep
neural networks to capture richer semantic structures and often rely on embeddings to represent
words and documents. Another emerging direction leverages contextualized embeddings and large
language models (LLMs) as building blocks or alternatives to traditional topic models. Methods such
as clustering-based approaches [20], CombinedTM [21], and BERTopic [22] have demonstrated that
directly clustering embeddings of documents and words can yield coherent, diverse topics. More
recently, research eforts like [ 23, 24, 25] propose employing LLMs directly to generate and refine topics,
showing promise for overcoming certain limitations of classical topic modeling frameworks.
        </p>
        <sec id="sec-2-1-1">
          <title>1http://qwone.com/ jason/20Newsgroups/ 2https://agris.fao.org/ 3https://github.com/zhiyintan/topic-model-LLMjudgment</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Evaluation of Topic Models: Methods and Limitations</title>
        <p>
          Topic model evaluation has shifted from basic statistical measures to methods that capture human
interpretability. Early work used held-out likelihood, introduced in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and consistently employed by
[26], measure how well a trained model can predict unseen data, although it can be computationally
demanding. Similarly, log probability [27] aggregates observed words and documents likelihood,
providing an intuitive gauge of model fit but potentially favoring complexity over interpretability. Perplexity
normalized held-out likelihood [
          <xref ref-type="bibr" rid="ref13 ref18">13, 26, 28, 27, 29, 18, 30, 31, 32, 19</xref>
          ]. However, low perplexity does not
necessarily translate to coherent or meaningful topics[33], highlighting a fundamental disconnect
between statistical quality and human interpretability. To bridge this gap, human-centered evaluations like
word intrusion and topic intrusion tasks [33] were introduced, where human judges identify out-of-place
words or topics. Human-rated coherence, first explored in [ 34] and later adopted by [35, 36, 37] asking
annotators score topics on an ordinal scale. Although these methods closely reflect human
understanding, their reliance on manual annotation constrains scalability and eficiency. In response, automated
metrics were then introduced to approximate human judgments. Topic words-based coherence metrics
measure how strongly top-ranked topic words co-occur in the underlying data. Early approaches, such
as coherence UCI [38] and coherence UMass [35], rely on word co-occurrence frequencies and statistics
derived from the training corpus. More refined metrics, like NPMI [ 39], normalize mutual information
coherence scores NPMI to better align with human judgments [
          <xref ref-type="bibr" rid="ref18">36, 40, 18, 30, 31, 41, 42, 22, 43</xref>
          ], while
coherence v [44] uses a variation of NPMI to compute topic coherence over a sliding window of
size and adds a weight to assign more strength to more related words. Moreover, embedding-based
coherence used by [45, 46, 30, 21] further improves the match to human judgement [46]. Other metrics
assess diferent aspects of topic quality beyond coherence. Diversity metrics ensure that the
discovered topics are distinctive, not redundant or overlapping. For instance, topic diversity [19] counts the
proportion of unique top words across topics, while topic redundancy [47] and topic uniqueness [48]
measure how frequently top words appear across multiple topics. Similarly, inverted ranked-biased
overlap [21] and embedding-based diversity metrics [49, 50] compare ranked word lists or semantic
distances to ensure substantial topic variety. Document-level evaluations measure how well topics
capture document’s content. [51] ask annotators to rate each topic’s relevance to a given document.
[52] vectorize documents associated with the selected topic and calculate a coherence score based on
the document vectors. [53] tests whether an outlier topic can be identified given a document and a
few topics. Supervised coverage-based methods [54] match model-generated topics to a fixed,
humandefined topics, though these methods are resource intensive. More recently, LLM-based evaluations
have emerged as a promising new paradigm. Studies [37, 55] demonstrate that large language models
can simulate human reasoning, providing nuanced judgments of topic coherence, word intrusions. [56]
proposes a set of metrics to quantify the agreement between keywords generated from documents
using LLM and topic words generated from documents by a topic model. By leveraging LLMs’ extensive
world knowledge and contextual reasoning, this approach overcomes the limitations of statistical
cooccurrence, embedding similarities that often fail to capture semantic quality, and avoids the resource
intensity of human-centric evaluations. However, current LLM-based methods often focus on a single
aspect, such as coherence. In contrast, our approach integrates multiple LLM-based metrics—including
coherence, repetitiveness, diversity, and topic–document alignment—into a unified framework. Rather
than simply measuring co-occurrence, our framework provides interpretable evidence (e.g., flagged
outlier words and identified duplicate concepts) that explains topic flaws. Following [ 33, 57], topic model
evaluations should assess the model’s practical capabilities rather than rely on legacy metrics detached
from its intended usage. Our novel topic–document alignment metrics explicitly reveal discrepancies
between topic words and document content, which is crucial for applications like recommendation,
summarization, and classification. By integrating document-level and topic words-based assessments,
our comprehensive, adversarially validated framework bridges the gap between statistical measures
and human-centered evaluations, ofering actionable insights to improve topic model performance.
Prompt 1: Prompt for evaluating coherence.
        </p>
        <p>Prompt for coherence rating metric rate
Given a list of words [TOPIC WORDS] representing a topic, assess the degree of semantic consistency
among the words in the context of the topic. Consider the remaining words in the list as the contextual
basis for each word’s semantics. Rate the coherence of the topic on a scale of 1 to 3, where 1 indicates
that the words are mostly unrelated, and 3 indicates that the words are highly related and form a clear,
unified theme.</p>
        <p>The rate is: [RATE]
Prompt for outlier detection metric outlier
Given a list of words [TOPIC WORDS] representing a topic. Consider the remaining words in the list as
the contextual basis for each word’s semantics. Which words are not semantically consistent with the
remaining words and put them into a comma-separated list.</p>
        <p>The semantically inconsistent words are: [WORD LIST]
Prompt for adversarial test of outlier detection metric outlier
Which words from this list [TOPIC WORDS] are not semantically consistent with the remaining words?
The semantically inconsistent words are: [WORD LIST]</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Our Solution: Evaluation Metrics and LLM as Evaluator</title>
      <p>In this section, we present our comprehensive evaluation framework that leverages LLMs as evaluators
for topic models. We begin with topic words-based evaluation (Section 3.1), where we introduce metrics
to evaluate topic coherence rate and outlier, and repetitiveness rate and duplicate, as well include
adversarial tests to validate metric robustness. Next, cross-topic evaluation (Section 3.2) focus on topic
diversity rate that captures thematic distinctiveness across topics. Finally, topic–document alignment
(Section 3.3) introduces novel metrics irrelevant topic words ir-topic and missing themes missing-theme
to evaluate the correspondence between topic words and document content. Together, these subsections
form an integrated, interpretable, and scalable framework for evaluating topic model performance.</p>
      <sec id="sec-3-1">
        <title>3.1. Topic words-based Evaluation</title>
        <p>We evaluate individual topics by examining two key dimensions: (1) coherence, which measures
the semantic consistency of the top-ranked topic words using the metrics rate and outlier, and (2)
repetitiveness, which assesses potential redundancy within the topic words using the metrics rate
and duplicate. This two-pronged evaluation enables us to quantify both the semantic integrity and the
diversity of the generated topic words.</p>
        <p>Coherence Topic coherence measures how well the top-ranked topic words form a semantically
unified theme. Inspired by [ 37], we first employ an coherence rating metric, rate, which asks the
LLM to assess the overall semantic consistency of the topic words on a 3-point scale (with 1 indicating
minimal alignment and 3 indicating strong coherence). While rate yields an overall numerical score,
it does not reveal which and how many specific words are responsible for any lack of coherence. To
enhance interpretability, we introduce an auxiliary outlier detection metric, outlier, that explicitly
identifies semantic outlier words. In this procedure, the LLM extracts candidate outliers over 5 iterations,
and a word flagged in at least 3 out of 5 iterations is deemed a semantic outlier. We then count the
number of outliers as the final evaluation result, and the outlier words themselves are saved for later case
studies. In addition, we perform an adversarial test outlier to validate the reliability of the outlier
detection inspired by established word intrusion methodologies [37, 33]. A semantically unrelated term
(e.g., "Shakespeare") is inserted into the topic list, and the LLM is expected to correctly identify the
inserted term. Even if the LLM flags additional words along with the inserted term, the detection is
considered successful. Each successful detection is assigned a score of 1 and each unsuccessful attempt a
score of 0. The final result is calculated as the percentage of successful detections over the total number
of tests. Prompt 1 presents the prompts for the two evaluation metrics as well as for the adversarial test.
Prompt 2: Prompt for evaluating repetitiveness.</p>
        <p>Prompt for repetitiveness rating metric rate
Given a list of words [TOPIC WORDS] representing a topic, evaluate if there are words that are semantically
equivalent. Rate the repetitive on a scale of 1 to 3, where 1 indicates highly repetitive with significant
semantic overlap, and 3 indicates minimal repetition with diverse and distinctive words.</p>
        <p>The rate is: [RATE]
Prompt for duplicate concept detection metric duplicate
Given a list of words [TOPIC WORDS] representing a topic, identify pairs of words that refer to the exact
same concept or idea (not just related or similar). Provide each pair as a tuple in a comma-separated list.
The word pairs are: [WORD LIST]
Prompt for adversarial test of duplicate concept detection metric duplicate
Given a list of words [TOPIC WORDS] representing a topic, which words from the list have the exact
same concept or idea (not just related or similar)?</p>
        <p>The word pair with [ANCHOR] is (’None’ or a word): [None/WORD]
Prompt 3: Prompt for evaluating diversity.</p>
        <p>Prompt for diversity rating metric rate
Given two groups of topic words: [TOPIC WORDS 1], [TOPIC WORDS 2], analyze the themes represented
by the two groups. Rate on a scale of 1-3 based on the degree of thematic distinctiveness between the two
groups: Rate 1: Partial overlapping themes. Rate 3: Highly distinctive themes.</p>
        <p>The rate is: [RATE]
Repetitiveness While coherence focuses on thematic alignment, we introduce a repetitiveness
rating metric rate to assess whether the perceived coherence is due to redundant topic words. rate
is rating on a 3-point scale, where a rating of 1 indicates high repetitiveness with significant semantic
overlap, and a rating of 3 indicates minimal repetition with diverse and distinctive words. To further
elucidate these ratings, we introduce an auxiliary duplicate concept detection metric, duplicate,
which explicitly identifies exact semantic repetitions in the topic word list. duplicate is critical as it
helps distinguish genuine topic coherence from inflated scores due to redundancy. For each topic word,
we compute a binary indicator: if a word has at least one other conceptual repetition in the list, it is
assigned a value of 1, otherwise, 0. The final duplicate score for a topic is the sum of these indicators,
reflecting the number of topic words that have at least one duplicate in the list. In addition, we perform
an adversarial test duplicate to validate the reliability of using the LLM for duplicate concept
detection. In this test, we randomly select an anchor word from the topic word list and manually
choose a conceptually identical word to serve as its duplicate. We then insert this duplicate into the
topic word list. The LLM is expected to identify the inserted duplicate given the anchor word. Each
successful detection is scored as 1, while an unsuccessful one is scored as 0. Prompt 2 presents the
prompts designed for both rate and duplicate, as well as for the associated adversarial test.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Cross-topic Evaluation</title>
        <p>Diversity Diversity quantifies the uniqueness among generated topics by assessing the thematic
distinctiveness of their associated top words. Inspired by [50]’s word embedding-based pairwise distance,
we exhaustively extract all possible pairs of topics, with each topic represented by its corresponding
topic word list. For each pair, the LLM rates the thematic distinctiveness on a 3-point scale, where
a rating of 1 denotes partial overlap (low diversity) and a rating of 3 denotes minimal overlap (high
distinctiveness). Finally, the average of all pairwise ratings is computed to yield the overall diversity
rating, rate. The prompt for diversity evaluation is provided in Prompt 3.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Topic-document Alignment</title>
        <p>Document-level evaluation focuses on assessing how efectively topics capture the underlying themes of
documents. Early methods relied on human annotations [51, 53] or supervised matching against curated
references [54] to measure topic relevance. However, these approaches are resource-intensive and
Prompt 4: Prompt for evaluating topic-document alignment.</p>
        <p>Prompt for irrelevant topic words detection metric ir-topic
Given a document: [DOCUMENT] and a topic word list [TOPIC WORDS], identify which topics in the
word list are not relevant to the document.</p>
        <p>Return these extraneous topics, or [ ] if all topics in the word list are relevant to the document.
Return the extraneous topics list or [ ]: [TOPIC WORDS/[ ]]
Prompt for missing themes detection metric missing-theme
Given a document: [DOCUMENT] and a topic word list [TOPIC WORDS], identify which themes present
in the document are not included in the topic word list.</p>
        <p>Return these missing themes, or [ ] if all themes from the document are included in the word list.</p>
        <p>Return the missed themes list or [ ]: [MISSING THEMES/[ ]]
lack scalability. Recent work by [56] introduces a set of metrics that quantify the agreement between
keywords generated by LLMs from documents and the topic words produced by topic models. Although
these metrics capture similarity, they do not quantify the degree of mismatch between the two sets.
This unaccounted discrepancy is critical for evaluating how well a topic model covers less frequent
or nuanced themes, which are often key to understanding long-tail phenomena. By incorporating
measures of mismatch, we can gain a more complete picture of the model’s limitations and identify
specific areas where the topic representation may require further improvement. Motivated by these
limitations, we propose two novel LLM-based metrics for topic-document alignment: irrelevant topic
words detection metric ir-topic and missing themes detection metric missing-theme. These metrics
leverage the contextual understanding of LLMs to assess both overrepresentation (irrelevant topic
words) and underrepresentation (missing themes) in topic-document relationships, providing a more
comprehensive evaluation of how well topics capture document content. Prompts for these evaluations
are provided in Prompt 4.</p>
        <p>Irrelevant Topic Words Detection Metric ir-topic assesses the extent to which a topic contains
words that are not relevant to the content of its associated documents. For each topic-document pair, we
instruct the LLM to identify topic words that are not explicitly or implicitly related to the document. The
number of extraneous words is tallied for each document and then averaged across all pairs, providing
a precise measure of overrepresentation.</p>
        <p>Missing Themes Detection Conversely, metric missing-theme quantifies the extent to which a topic
fails to capture key themes present in the documents. For each topic-document pair, the LLM extracts
significant themes from the document that are absent in the topic word list and counts these missing
themes. The resulting counts are then averaged across all pairs, yielding a measure of
underrepresentation.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>
        4.1. Data
Datasets The 20NG dataset is a widely used benchmark comprising approximately 20,000 newsgroup
posts organized into 20 categories. We adopt the pre-processed version from OCTIS4, which contains
11,415 training and 4,894 test documents. Known for its diverse topics, 20NG has been extensively
employed in topic model evaluations [
        <xref ref-type="bibr" rid="ref17">29, 17, 32, 43</xref>
        ]. Besides, from a repository of approximately
14 million records of food and agricultural scholarly documents—we extract a subset by excluding
non-English documents, those with titles shorter than five tokens or abstracts shorter than 40 tokens,
and duplicate records (by DOI, and named it AGRIS. The final dataset comprises 50,067 documents
(45,060 for training and 5,007 for testing). For each document, we retain the title and abstract. To support
      </p>
      <sec id="sec-4-1">
        <title>4https://github.com/MIND-Lab/OCTIS/</title>
        <p>sentence-level analysis, abstracts are segmented using SaT [58], yielding 454,850 training and 50,703
test entries. This granularity allows multiple topics to be assigned to a single document, reflecting the
multi-faceted nature of scholarly texts, where a single work often spans diverse thematic areas.
Domain-Specific Stopword Removal Stopword removal is a critical preprocessing step,
particularly for domain-specific data. Stopwords are frequent but low-information terms, and their removal,
guided by Zipf’s law [59], reduces token counts while preserving vocabulary diversity, optimizing
computational eficiency without compromising semantic integrity. Generic stopword lists often overlook
contextually irrelevant terms in specialized domains. For AGRIS, we employed an information-theoretic
framework [60] to identify and remove domain-specific stopwords. The 20NG dataset, pre-processed by
OCTIS, required no further stopword refinement.</p>
        <sec id="sec-4-1-1">
          <title>4.2. Topic Models</title>
          <p>
            We evaluated four topic models chosen for their methodological diversity and proven performance,
spanning traditional probabilistic approaches to neural and embedding-based methods, enabling a
comprehensive comparison. Their key characteristics and implementations are detailed below.
• Latent Dirichlet Allocation (LDA) [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]: a foundational probabilistic topic model that represents
documents as mixtures of topics, with each topic modeled as a distribution over words. We use
the Gensim implementation5.
• Product of Experts Latent Dirichlet Allocation (ProdLDA) [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ]: a neural adaptation of LDA
that leverages variational autoencoders to enhance scalability and improve topic coherence. we
adopt the code provided by the TopMost toolkit6.
• CombinedTM [21]: it integrates contextual embeddings from pre-trained transformers into the
LDA framework, efectively capturing semantic nuances through deep neural embeddings. We
used the oficial implementation 7.
• BERTopic [22]: combines document embeddings with a class-based TF-IDF procedure to generate
coherent and interpretable topics. For this study, we configured BERTopic with UMAP for
dimensionality reduction and HDBSCAN for clustering, following its standard pipeline8.
For each model, we conducted an extensive parameter tuning process to identify the optimal settings
for two key evaluation metrics: v [44] for topic coherence and unique [19] for topic diversity. Once
the optimal configurations were determined, we obtain the top 10 topic words and the topic-document
pairs from each model on both the 20NG and AGRIS datasets with number of topic  = 50 and
 = 100. Each configuration was run ten times to account for variability in probabilistic and
neuralbased outputs. This resulted in ten aggregated sets of results for each model and dataset, ensuring a
robust and statistically sound evaluation. This rigorous evaluation protocol not only ensures a fair
comparison across the diverse modeling paradigms but also provides comprehensive insights into each
model’s strengths and limitations in capturing thematic content across varied datasets.
          </p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.3. Evaluation</title>
          <p>Metrics We employ two widely recognized automated metrics as baselines: the coherence metric
v [44], which measures the semantic consistency of top-ranked topic words, and the diversity metric
unique [19], which quantifies the proportion of unique words across all topics. To address the limitations
of automated metrics, we also use the suite of LLM-based metrics described in Section 3 for a more
nuanced evaluation of topic quality. These include: (1) Coherence Metrics: coherence rating metric
rate and outlier detection metric outlier. (2) Repetitiveness Metrics: repetitiveness rating metric
5https://radimrehurek.com/gensim/models/ldamodel.html
6https://github.com/BobXWu/TopMost
7https://github.com/MilaNLProc/contextualized-topic-models
8https://maartengr.github.io/BERTopic/
rate and duplicate concept detection metric duplicate. (3) Diversity Metric: diversity rating metric
rate. (4) Topic-document Alignment Metrics: irrelevant topic words detection metric ir-topic and
missing themes detection metric missing-theme. For the topic-document alignment metrics, a sample
set was constructed by randomly selecting one iteration of the model’s output and sampling up to 100
associated documents per topic—yielding 59,499 samples from AGRIS and 38,321 from 20NG—thus
ensuring a comprehensive evaluation. These metrics harness LLMs’ deep semantic understanding to
provide a comprehensive, multi-dimensional evaluation of topic quality.</p>
          <p>LLM as Evaluators We selected three open-source LLMs as evaluators for the proposed metrics:
Mistral-7B-Instruct-v0.39 (referred to as Mistral), Meta-Llama-3.1-8B-Instruct10 (referred to as Llama),
and Qwen2.5-14B-Instruct11 (referred to as Qwen). These LLMs, chosen for their diverse pretraining
corpora and instruction-tuning objectives, exhibit robust semantic understanding. Their complementary
strengths ensure reliable, scalable, and reproducible evaluations, while promoting transparency and
facilitating replication in the research community.</p>
          <p>Eficiency and Scalability We evaluated the computational eficiency and scalability of our
LLMbased evaluation framework. All experiments were conducted on an NVIDIA A100 GPU, with
approximately 40 GB allocated for coherence, repetitiveness, and diversity evaluations, and about 70 GB for
topic-document alignment. The combined evaluation (three LLMs) time for coherence, repetitiveness,
and diversity ranged from 15 minutes for  = 50 to 35 minutes for  = 100, while topic-document
alignment evaluation required between 2 and 3 hours. On GPUs with lower performance and memory,
reducing the batch size enables smooth operation at the expense of increased processing time, while
parallel computing can reduce runtime proportionally to the number of GPUs available. These results
demonstrate that our framework is both computationally eficient and scalable, making it suitable for
extensive evaluations of topic models.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <sec id="sec-5-1">
        <title>5.1. Quantitative Results</title>
        <p>Adversarial Test We sampled 100 topics (each with 10 words) from four topic models applied to
the 20NG and AGRIS datasets. For adversarial test of outlier detection outlier, the success rates on
20NG are 77% (Mistral), 81% (Llama), and 90% (Qwen), and on AGRIS, 82% (Mistral), 85% (Llama), and
93% (Qwen). For duplicate concept detection duplicate, success rates on 20NG are, 37% (Mistral),
81% (Llama), and 84% (Qwen), and on AGRIS, 29% (Mistral), 74% (Llama), and 81% (Qwen). These results
highlight significant variability among LLM evaluators and underscore the importance of using multiple
evaluators to reliably assess topic quality.</p>
        <p>Coherence Tables 1 and 2 indicate that, based on the coherence rating metric rate, BERTopic
consistently outperforms the other topic models on both datasets, which aligns with the automated metric v.
With respect to the outlier detection metric outlier, for  = 50 two of the three LLMs report the fewest
outliers for BERTopic, supporting its superior coherence. At  = 100, Qwen maintain its preference for
BERTopic while Mistral finds that LDA and CombinedTM are comparable. In contrast, Llama’s outlier
suggest that BERTopic has more outliers. These discrepancies are further analysed in Section 5.3.
Repetitiveness To discern whether high coherence reflects genuine semantic unity or is driven
by redundant topic words, we examine rate and duplicate. A robust coherence result should be
accompanied by a high rate (indicating minimal repetition) and a low duplicate (indicating few duplicate</p>
        <sec id="sec-5-1-1">
          <title>9https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 10https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct 11https://huggingface.co/Qwen/Qwen2.5-14B-Instruct</title>
          <p>Evaluation results of repetitiveness for 20NG: LLM-based metrics rate and duplicate.</p>
          <p>Evaluation results of repetitiveness for AGRIS: LLM-based metrics rate and duplicate.
Evaluation results of coherence for AGRIS: automated metric v vs. LLM-based metrics rate and outlier.
concepts). For instance, although BERTopic shows high coherence, Tables 3 and 4 reveal that, most
LLM evaluators report a low rate and a high duplicate for it, suggesting its apparent coherence may be
inflated by redundant word selection. In contrast, ProdLDA on 20NG and LDA on AGRIS tend to exhibit
relatively high rate and low duplicate scores, implying that their coherence is less likely to be enhanced
by redundant word selection. These findings underscore the importance of assessing repetitiveness
alongside coherence to ensure that high coherence truly reflect meaningful topic quality.
Diversity</p>
          <p>Table 5 compares the automated diversity metric unique with the LLM-based diversity
rating rate for both the 20NG and AGRIS datasets. For 20NG, ProdLDA consistently shows the lowest
diversity across both metrics, while LDA exhibits relatively high diversity. BERTopic and CombinedTM
yield intermediate scores. In AGRIS, however, the results are less consistent: for  = 50, Mistral
and Qwen rate LDA as highly diverse, whereas Llama assigns it a lower diversity, and for  = 100,
Mistral again favors LDA while Llama and Qwen indicate lower diversity for LDA. Furthermore,
although LLM-based evaluations generally rate BERTopic as highly diverse, its unique scores remain
moderate—suggesting that even when topics share common words, contextual nuances preserve
thematic distinctiveness. Overall, the divergence between unique and rate on AGRIS underscores the
importance of considering both lexical uniqueness and semantic context when assessing topic diversity.
Irrelevant Topic Words Table 6 presents the evaluation of the irrelevant topic words detection metric
ir-topic, where lower counts indicate better topic-document alignment. On 20NG, LDA, ProdLDA, and
CombinedTM consistently yield lower counts compared to BERTopic, indicating that their topic words
are more closely aligned with document content. In AGRIS, at  = 50 two of the three LLM evaluators
favor LDA for having the fewest irrelevant words, whereas at  = 100, CombinedTM achieves the
lowest count, suggesting its superior ability to capture nuanced document themes—likely due to its
efective integration of contextual embeddings.</p>
          <p>Missing Themes Table 7 reports the missing themes detection metric missing-theme , which quantifies
the number of key document themes are omitted from the topic word list, with lower counts indicating
better thematic coverage. For 20NG at  = 50, two out of three LLM evaluators rate BERTopic as having
the lowest missing theme counts, suggesting that its topic words more comprehensively represent the
document themes. At  = 100, Qwen continues to favor BERTopic, while both Mistral and Llama
indicate that LDA provides the best coverage. In AGRIS, however, the diferences across topic models
are minimal. Overall, missing-theme provides valuable insight into the extent to which topic models
may fail to capture less frequent or nuanced themes from documents, which is vital for understanding
long-tail phenomena and enhancing downstream applications.</p>
          <p>Divergent LLM Evaluation Patterns The evaluation results reveal distinct evaluation tendencies
among the three LLMs, ofering valuable insights for researchers using LLMs as evaluators. In terms of
coherence, Qwen consistently flags a higher number of outliers outlier, suggesting a stricter criterion
for semantic consistency, while Mistral reports lower outlier counts, indicative of a more lenient
evaluation; Llama’s results generally fall between these extremes. For repetitiveness, Qwen detects
fewer duplicate concepts duplicate compared to Llama, with Mistral’s assessments again falling in
between—demonstrating variable sensitivity to lexical redundancy across evaluators. In topic–document
alignment, Qwen registers higher counts of irrelevant topic words ir-topic yet lower counts of missing
counts for 20NG than for AGRIS, implying that 20NG documents exhibit greater thematic diversity and
complexity. These insights underscore the influence of evaluator-specific biases on metric outcomes
and highlight the importance of carefully selecting an LLM evaluator based on the intended application.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Visualization</title>
        <p>Standardization of Metrics</p>
        <sec id="sec-5-2-1">
          <title>To enable fair comparisons across LLM evaluators (Mistral, Llama, Qwen), all metrics are normalized to the [0, 1] range using the following piecewise function:</title>
          <p>norm =
⎧
⎪⎩1 −
⎨
⎪0.5 +
max − min
0.5 +
 − mean )︂
max − min</p>
          <p>if higher values indicate better performance,
, if higher values indicate poorer performance.</p>
          <p>Here, norm is the normalized score, while mean, max, and min are the mean, maximum, and
minimum values of  within each evaluator group.</p>
          <p>Visualization and Analysis</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>The radar plots (Figures 1 and 2) show clear discrepancies in Llama’s</title>
          <p>coherence scores. A high coherence rate rate should imply fewer outliers outlier, yet Llama rates
BERTopic high while flagging more outliers than other models. Similarly, Mistral gives low
rate scores
to CombinedTM and LDA but paradoxically finds fewer outliers in their topics.</p>
          <p>Topic words
interested book advance fax printer print email address mail mailing
keyboard window output problem work time run input response drug
science evidence theory scientific observation scientist fact explain bug claim
Mistral Llama Qwen
fax fax, printer, print advance, fax
drug drug drug
bug bug bug, claim</p>
          <p>Mistral Llama Qwen
(faith, belief), (christian, church) ((dfaoicthtr,inreel,igsicornip)t,u(crhe)ristian, church), (faith, belief), (scripture, doctrine)
(patient, adult), (child, adult) ((cdhisielda,spe,ahtieeanltt)h,)(,t(rpeaattmienent,ta,ddurultg),) (disease, treatment), (medical, drug)
(client, customer), (mail, email) ((csoliftwenatr,ed,iprerocjteocrty)), (search, package), None</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Qualitative Analysis</title>
        <p>In this section, we provide a qualitative analysis of representative examples to explore discrepancies
and patterns in outlier detection, duplicate concept detection.</p>
        <p>Outlier Detection Discrepancies Outlier detection is a crucial aspect of evaluating topic coherence,
as it identifies semantically inconsistent words in a topic’s word list. Across the examples, outliers
identified by the models often intersect but also reflect unique insights. Table 8 shows the examples of
outlier detection discrepancies across diferent LLMs. Compared to the other two models, Mistral is
more cautious in detecting outliers in topic words. On the contrary, Qwen is relatively more aggressive
in detecting words with unclear semantic pointing from topic words and considering them as outliers.
Duplicate Concept Detection Contradictions The extracted duplicate pairs often difer
significantly among the LLMs, showcasing varying thresholds for identifying conceptual overlap. Table 9
shows that Mistral treats semantically related nouns (e.g., "christian" and "church", collective nouns
where there is intersection (e.g., "patient" and "adult"), and nouns that belong to the same category (e.g.
"child" and "adult") as conceptually identical. It has also had hallucinations (e.g., detecting a non-existent
repetition of the word "customer" for "client" and a non-existent repetition of the word "email" for "mail").
Llama treats grammatically related words (e.g., the verb “search” and its potential object “ package”),
semantically opposite words (e.g., “disease” and “health “) as conceptually identical.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this work, we have introduced a comprehensive framework for evaluating topic models using
LLMbased metrics that complements traditional automated metrics by incorporating nuanced measures of
coherence, repetitiveness, diversity, and topic–document alignment. We designed specific evaluation
protocols—including adversarial tests—to reveal not only the strengths and weaknesses of various
topic models but also the intrinsic biases and judgment tendencies of diferent LLM evaluators. Our
experiments on both the 20NG and AGRIS datasets demonstrate that LLMs can provide rich,
contextsensitive insights into topic quality, while also highlighting evaluator-specific variations that are crucial
for informed application in downstream tasks.</p>
      <p>These findings illustrate the potential of our framework to expand the boundaries of topic model
assessment by emphasizing both interpretability and practical application needs. Future work will focus
on further refining these metrics, exploring additional LLM evaluators, and assessing how evaluator
biases impact downstream tasks, thereby fostering more robust and actionable topic model assessments.
[19] A. B. Dieng, F. J. R. Ruiz, D. M. Blei, Topic modeling in embedding spaces, Transactions of the
Association for Computational Linguistics 8 (2020) 439–453. URL: https://aclanthology.org/2020.
tacl-1.29. doi:10.1162/tacl_a_00325.
[20] S. Sia, A. Dalmia, S. J. Mielke, Tired of topic models? clusters of pretrained word embeddings
make for fast and good topics too!, in: B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the
2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association
for Computational Linguistics, Online, 2020, pp. 1728–1736. URL: https://aclanthology.org/2020.
emnlp-main.135. doi:10.18653/v1/2020.emnlp-main.135.
[21] F. Bianchi, S. Terragni, D. Hovy, Pre-training is a hot topic: Contextualized document embeddings
improve topic coherence, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceedings of the 59th
Annual Meeting of the Association for Computational Linguistics and the 11th International
Joint Conference on Natural Language Processing (Volume 2: Short Papers), Association for
Computational Linguistics, Online, 2021, pp. 759–766. URL: https://aclanthology.org/2021.acl-short.
96. doi:10.18653/v1/2021.acl-short.96.
[22] M. Grootendorst, Bertopic: Neural topic modeling with a class-based tf-idf procedure, arXiv
preprint arXiv:2203.05794 (2022).
[23] C. Pham, A. Hoyle, S. Sun, P. Resnik, M. Iyyer, TopicGPT: A prompt-based topic modeling
framework, in: K. Duh, H. Gomez, S. Bethard (Eds.), Proceedings of the 2024 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language
Technologies (Volume 1: Long Papers), Association for Computational Linguistics, Mexico City,
Mexico, 2024, pp. 2956–2984. URL: https://aclanthology.org/2024.naacl-long.164. doi:10.18653/
v1/2024.naacl-long.164.
[24] Y. Mu, C. Dong, K. Bontcheva, X. Song, Large language models ofer an alternative to the traditional
approach of topic modelling, in: N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.),
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language
Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL, Torino, Italia, 2024, pp. 10160–
10171. URL: https://aclanthology.org/2024.lrec-main.887.
[25] Y. Mu, P. Bai, K. Bontcheva, X. Song, Addressing topic granularity and hallucination
in large language models for topic modelling, 2024. URL: https://arxiv.org/abs/2405.00611.
arXiv:2405.00611.
[26] D. M. Blei, J. D. Laferty, A correlated topic model of science, The Annals of Applied
Statistics 1 (2007) 17–35. URL: https://projecteuclid.org/journals/annals-of-applied-statistics/volume-1/
issue-1/A-correlated-topic-model-of-Science/10.1214/07-AOAS114.full.
[27] D. Newman, A. Asuncion, P. Smyth, M. Welling, Distributed algorithms for topic models., Journal
of Machine Learning Research 10 (2009).
[28] C. Wang, D. Blei, D. Heckerman, Continuous time dynamic topic models, in: Proceedings of the
Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI’08, AUAI Press, Arlington,
Virginia, USA, 2008, p. 579–586.
[29] G. E. Hinton, R. R. Salakhutdinov, Replicated softmax: an undirected topic model, in: Y. Bengio,
D. Schuurmans, J. Laferty, C. Williams, A. Culotta (Eds.), Advances in Neural Information
Processing Systems, volume 22, Curran Associates, Inc., 2009. URL: https://proceedings.neurips.cc/
paper_files/paper/2009/file/31839b036f63806cba3f47b93af8ccb5-Paper.pdf.
[30] R. Ding, R. Nallapati, B. Xiang, Coherence-aware neural topic modeling, in: E. Rilof, D. Chiang,
J. Hockenmaier, J. Tsujii (Eds.), Proceedings of the 2018 Conference on Empirical Methods in
Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, 2018,
pp. 830–836. URL: https://aclanthology.org/D18-1096. doi:10.18653/v1/D18-1096.
[31] D. Card, C. Tan, N. A. Smith, Neural models for documents with metadata, in: I. Gurevych, Y. Miyao
(Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), Association for Computational Linguistics, Melbourne, Australia, 2018,
pp. 2031–2040. URL: https://aclanthology.org/P18-1189. doi:10.18653/v1/P18-1189.
[32] H. Zhang, B. Chen, D. Guo, M. Zhou, Whai: Weibull hybrid autoencoding inference for deep topic
modeling, arXiv preprint arXiv:1803.01328 (2018).
[33] J. Chang, S. Gerrish, C. Wang, J. Boyd-graber, D. Blei, Reading tea leaves: How
humans interpret topic models, in: Y. Bengio, D. Schuurmans, J. Laferty, C. Williams,
A. Culotta (Eds.), Advances in Neural Information Processing Systems, volume 22,
Curran Associates, Inc., 2009. URL: https://proceedings.neurips.cc/paper_files/paper/2009/file/
f92586a25bb3145facd64ab20fd554f-Paper.pdf.
[34] D. Newman, J. H. Lau, K. Grieser, T. Baldwin, Automatic evaluation of topic coherence, in:
R. Kaplan, J. Burstein, M. Harper, G. Penn (Eds.), Human Language Technologies: The 2010 Annual
Conference of the North American Chapter of the Association for Computational Linguistics,
Association for Computational Linguistics, Los Angeles, California, 2010, pp. 100–108. URL:
https://aclanthology.org/N10-1012.
[35] D. Mimno, H. Wallach, E. Talley, M. Leenders, A. McCallum, Optimizing semantic coherence in
topic models, in: R. Barzilay, M. Johnson (Eds.), Proceedings of the 2011 Conference on Empirical
Methods in Natural Language Processing, Association for Computational Linguistics, Edinburgh,
Scotland, UK., 2011, pp. 262–272. URL: https://aclanthology.org/D11-1024.
[36] N. Aletras, M. Stevenson, Evaluating topic coherence using distributional semantics, in: A. Koller,
K. Erk (Eds.), Proceedings of the 10th International Conference on Computational Semantics
(IWCS 2013) – Long Papers, Association for Computational Linguistics, Potsdam, Germany, 2013,
pp. 13–22. URL: https://aclanthology.org/W13-0102.
[37] D. Stammbach, V. Zouhar, A. Hoyle, M. Sachan, E. Ash, Revisiting automated topic model evaluation
with large language models, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023
Conference on Empirical Methods in Natural Language Processing, Association for Computational
Linguistics, Singapore, 2023, pp. 9348–9357. URL: https://aclanthology.org/2023.emnlp-main.581.
doi:10.18653/v1/2023.emnlp-main.581.
[38] D. Newman, Y. Noh, E. Talley, S. Karimi, T. Baldwin, Evaluating topic models for digital libraries,
in: Proceedings of the 10th Annual Joint Conference on Digital Libraries, JCDL ’10, Association
for Computing Machinery, New York, NY, USA, 2010, p. 215–224. URL: https://doi.org/10.1145/
1816123.1816156. doi:10.1145/1816123.1816156.
[39] G. Bouma, Normalized (pointwise) mutual information in collocation extraction, Proceedings of</p>
      <p>GSCL 30 (2009) 31–40.
[40] J. H. Lau, D. Newman, T. Baldwin, Machine reading tea leaves: Automatically evaluating topic
coherence and topic model quality, in: S. Wintner, S. Goldwater, S. Riezler (Eds.), Proceedings of
the 14th Conference of the European Chapter of the Association for Computational Linguistics,
Association for Computational Linguistics, Gothenburg, Sweden, 2014, pp. 530–539. URL: https:
//aclanthology.org/E14-1056. doi:10.3115/v1/E14-1056.
[41] R. Wang, D. Zhou, Y. He, Atm: Adversarial-neural topic model, Information Processing &amp;
Management 56 (2019) 102098. URL: https://www.sciencedirect.com/science/article/pii/S0306457319300500.
doi:https://doi.org/10.1016/j.ipm.2019.102098.
[42] R. Wang, X. Hu, D. Zhou, Y. He, Y. Xiong, C. Ye, H. Xu, Neural topic modeling with bidirectional
adversarial training, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics, Association for Computational
Linguistics, Online, 2020, pp. 340–350. URL: https://aclanthology.org/2020.acl-main.32. doi:10.
18653/v1/2020.acl-main.32.
[43] X. Wu, X. Dong, T. T. Nguyen, A. T. Luu, Efective neural topic modeling with embedding clustering
regularization, in: International Conference on Machine Learning, PMLR, 2023, pp. 37335–37357.
[44] M. Röder, A. Both, A. Hinneburg, Exploring the space of topic coherence measures, in:
Proceedings of the Eighth ACM International Conference on Web Search and Data Mining,
WSDM ’15, Association for Computing Machinery, New York, NY, USA, 2015, p. 399–408. URL:
https://doi.org/10.1145/2684822.2685324. doi:10.1145/2684822.2685324.
[45] T. Schnabel, I. Labutov, D. Mimno, T. Joachims, Evaluation methods for unsupervised word
embeddings, in: L. Màrquez, C. Callison-Burch, J. Su (Eds.), Proceedings of the 2015 Conference
on Empirical Methods in Natural Language Processing, Association for Computational Linguistics,
Lisbon, Portugal, 2015, pp. 298–307. URL: https://aclanthology.org/D15-1036. doi:10.18653/v1/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Waltman</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. J. Van Eck</surname>
          </string-name>
          ,
          <article-title>A new methodology for constructing a publication-level classification system of science</article-title>
          ,
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>63</volume>
          (
          <year>2012</year>
          )
          <fpage>2378</fpage>
          -
          <lpage>2392</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Moura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Marcacini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. M.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          , M. d. S. CONRADO,
          <string-name>
            <given-names>S. O.</given-names>
            <surname>Rezende</surname>
          </string-name>
          ,
          <article-title>A proposal for building domain topic taxonomies</article-title>
          .,
          <source>In: WORKSHOP ON WEB AND TEXT INTELLIGENCE</source>
          ,
          <article-title>1.; SIMPÓSIO BRASILEIRO EM</article-title>
          . . . ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dauxais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Zaratiana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Laneuville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Holat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Grosman</surname>
          </string-name>
          ,
          <article-title>Towards automation of topic taxonomy construction</article-title>
          ,
          <source>in: International Symposium on Intelligent Data Analysis</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>26</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kotitsas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pappas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Manola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Papageorgiou</surname>
          </string-name>
          ,
          <article-title>Scinobo: a novel system classifying scholarly communication in a dynamically constructed hierarchical field-of-science taxonomy</article-title>
          ,
          <source>Frontiers in Research Metrics and Analytics</source>
          <volume>8</volume>
          (
          <year>2023</year>
          )
          <fpage>1149834</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Bhat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Shafi</surname>
          </string-name>
          , et al.,
          <article-title>Taxonomies in knowledge organisation-need, description and benefits</article-title>
          ,
          <source>Annals of Library and Information Studies (ALIS) 61</source>
          (
          <year>2014</year>
          )
          <fpage>102</fpage>
          -
          <lpage>111</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Shang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Liu,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          , J. Han,
          <article-title>Nettaxo: Automated topic taxonomy construction from text-rich network</article-title>
          ,
          <source>in: Proceedings of the web conference</source>
          <year>2020</year>
          ,
          <year>2020</year>
          , pp.
          <fpage>1908</fpage>
          -
          <lpage>1919</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Langlais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Rate: a reproducible automatic taxonomy evaluation by filling the gap</article-title>
          ,
          <source>in: Proceedings of the 15th International Conference on Computational Semantics</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>173</fpage>
          -
          <lpage>182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Amatriain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Bodigutla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Hazen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kazi</surname>
          </string-name>
          ,
          <article-title>Transformer models: an introduction and catalog</article-title>
          ,
          <source>arXiv preprint arXiv:2302.07730</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>J. D'Souza</surname>
          </string-name>
          , A catalog of transformer models,
          <year>2023</year>
          . URL: https://orkg.org/comparison/R609337/. doi:
          <volume>10</volume>
          .48366/R609337.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Stammbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zouhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hoyle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sachan</surname>
          </string-name>
          , E. Ash,
          <article-title>Revisiting automated topic model evaluation with large language models</article-title>
          ,
          <source>in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>9348</fpage>
          -
          <lpage>9357</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Papadimitriou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tamaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vempala</surname>
          </string-name>
          ,
          <article-title>Latent semantic indexing: A probabilistic analysis</article-title>
          ,
          <source>in: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems</source>
          ,
          <year>1998</year>
          , pp.
          <fpage>159</fpage>
          -
          <lpage>168</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <article-title>Probabilistic latent semantic indexing</article-title>
          ,
          <source>in: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval</source>
          ,
          <year>1999</year>
          . URL: https://sigir.org/wp-content/uploads/2017/06/p211.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>D. M. Blei</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>M. I. Jordan</given-names>
          </string-name>
          ,
          <article-title>Latent dirichlet allocation</article-title>
          ,
          <source>Journal of machine Learning research 3</source>
          (
          <year>2003</year>
          )
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          . URL: https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf?ref= http://githubhelp.com.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Grifiths</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Steyvers</surname>
          </string-name>
          ,
          <article-title>Finding scientific topics</article-title>
          ,
          <source>Proceedings of the National Academy of Sciences</source>
          <volume>101</volume>
          (
          <year>2004</year>
          )
          <fpage>5228</fpage>
          -
          <lpage>5235</lpage>
          . URL: https://dyurovsky.github.io/learning-humans-machines/class/ 24-class/papers/grifiths2004.pdf. doi:
          <volume>10</volume>
          .1073/pnas.0307752101.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          ,
          <string-name>
            <surname>Auto-Encoding Variational</surname>
          </string-name>
          Bayes,
          <source>in: 2nd International Conference on Learning Representations, ICLR</source>
          <year>2014</year>
          ,
          <article-title>Banf</article-title>
          ,
          <string-name>
            <surname>AB</surname>
          </string-name>
          , Canada,
          <source>April 14-16</source>
          ,
          <year>2014</year>
          , Conference Track Proceedings,
          <year>2014</year>
          . arXiv:http://arxiv.org/abs/1312.6114v10.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Miao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Blunsom</surname>
          </string-name>
          ,
          <article-title>Neural variational inference for text processing</article-title>
          , in: M. F. Balcan,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of The 33rd International Conference on Machine Learning</source>
          , volume
          <volume>48</volume>
          <source>of Proceedings of Machine Learning Research</source>
          , PMLR, New York, New York, USA,
          <year>2016</year>
          , pp.
          <fpage>1727</fpage>
          -
          <lpage>1736</lpage>
          . URL: https://proceedings.mlr.press/v48/miao16.html.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Miao</surname>
          </string-name>
          , E. Grefenstette,
          <string-name>
            <given-names>P.</given-names>
            <surname>Blunsom</surname>
          </string-name>
          ,
          <article-title>Discovering discrete latent topics with neural variational inference</article-title>
          , in: D.
          <string-name>
            <surname>Precup</surname>
            ,
            <given-names>Y. W.</given-names>
          </string-name>
          <string-name>
            <surname>Teh</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 34th International Conference on Machine Learning</source>
          , volume
          <volume>70</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2410</fpage>
          -
          <lpage>2419</lpage>
          . URL: https://proceedings.mlr.press/v70/miao17a.html.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <article-title>Autoencoding variational inference for topic models</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2017</year>
          . URL: https://openreview.net/forum?id=
          <fpage>BybtVK9lg</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>