<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Comparison of Embedding Techniques for Topic Modeling Coherence Measures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Derek Greene</string-name>
          <email>derek.greene@ucd.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Insight Centre for Data Analytics, University College Dublin</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Insight Centre for Data Analytics, University College Dublin</institution>
          ,
          <addr-line>Ireland https://</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Mark Belford</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The quality of topic modeling solutions are often evaluated using topic coherence measures, which attempt to quantify the semantic meaningfulness of the descriptors. One popular approach to evaluate coherence is through the use of word embeddings, where terms are represented as vectors in a semantic space. However, there exist a number of popular embedding methodologies and variants which can be used to construct these vectors. Due to this, questions arise regarding the optimal embedding approach to utilise when calculating the coherence of solutions produced for a given dataset. In this work we evaluate the difference between two popular word embedding algorithms and their variants, using two distinct external reference corpora, to discover if these underlying choices have a substantial impact on the resulting coherence scores. 2012 ACM Subject Classification Information systems → Document topic models Funding Mark Belford: This research was supported by Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289 Topic modeling facilitates the discovery of the underlying latent themes or topics in a corpus of text documents. These are frequently represented by their top n terms and are referred to as topic descriptors. There are many popular topic modeling approaches including probabilistic techniques such as Latent Dirichlet Allocation (LDA) [2] and those based on matrix factorization such as Non-negative Matrix Factorization (NMF) [5]. Ideally topic modeling solutions should be of high quality and easily interpretable, however this is unfortunately not always the case as poor solutions can be discovered for a number of reasons, such as the stochastic nature of traditional topic modeling algorithms [1]. With this in mind quality metrics are frequently used to evaluate solutions, with topic coherence being the most common. These measures typically attempt to evaluate the semantic coherence of a set of topics, relative to a background corpus. While originally a human evaluated task [4], there now exists a variety of automated coherence methodologies [7, 8, 12]. A more recently proposed approach to evaluate coherence utilises word embedding algorithms, such as word2vec [6] and fastText [3]. In both of these approaches, words are represented in a dense, low-dimensional vector space, where words with similar meaning and usage appear to be similar to one another. Both algorithms offer two different model variants to construct these vectors - Continuous Bag-Of-Words (CBOW) and Skip-Gram (SG). The goal of CBOW is to predict a target word while using the surrounding context 1 Corresponding author</p>
      </abstract>
      <kwd-group>
        <kwd>and phrases Topic Modeling</kwd>
        <kwd>Coherence</kwd>
        <kwd>Embeddings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>words, based on a sliding window, while SG is the inverse where the goal is to predict the
surrounding context words for a given target word. Word embedding models require that
they be trained on large external reference corpora to facilitate making these predictions.
However, questions arise regarding which of these embedding approaches to utilise when
calculating topic coherence for a given dataset, especially as there are many facets which are
left to the user to specify and these may have an impact on the results. With this in mind
we propose the following research question – how does the choice of embedding algorithm,
selected variant, and background reference corpus impact the resulting coherence scores?
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>To calculate the coherence of topic descriptors using word embeddings we utilise the approach
proposed by [9]. This technique quantifies the intra-topic coherence based on word similarities
using their learned vector representations from a given embedding model. However, it is
possible that some of these top terms may not have a corresponding vector in the embedding
model due to not appearing in the vocabulary of the external reference corpus used for
training. To account for this we propose a small modification to this approach in which we
construct the list of top terms as the first N terms that appear in a descriptor but are also
contained in the embedding vocabulary. By following this approach coherence scores for
topics are calculated using the formulation seen in Equation 1. While fastText can generate
vectors for terms that are not present in the reference corpus vocabulary we chose not to
utilise this feature to ensure a fair comparison with word2vec. Frequently topic coherence is
only measured at the individual topic level, such as in Equation 1. However, we can also
calculate an overall coherence score at the model level by simply computing the average of
these individual topic descriptor coherence scores.</p>
      <p>T C =</p>
      <p>N1 XN Xj−1 similarity(wi, wj)
2 j=2 i=1</p>
      <p>For our experiments, we constructed 15 yearly datasets from The Guardian API, where
associated article section labels were used as ground truth topics (e.g. “politics”, “technology”).
We then built 100-dimensional CBOW/SG word2vec and fastText embeddings on two larger
background corpora: (1) 1.6m Guardian news articles published from 2004-2018, (2) 4.9m
Wikipedia long abstracts collected in 2016 [10]. These variant and corpus combinations
yielded 8 embeddings, as seen in Table 2. For each dataset, we generate 100 runs of
randomly-initialized NMF, and compute 100 corresponding model-level coherence scores,
before averaging this set to compute a final coherence value, as seen in Equation 2. We
repeat this process over a range of topic numbers k ∈ [2, 30] for each embedding and dataset
combination. Table 1 provides a detailed breakdown of these datasets.</p>
      <p>M eanT C = 1 Xr T C(modeli)
r i=1
(1)
(2)</p>
    </sec>
    <sec id="sec-3">
      <title>Ranked Correlation</title>
      <p>We first investigated whether there was a noticeable difference between the different embedding
approaches with respect to their coherence scores by measuring the Spearman rank correlation
between the average topic coherence scores produced on each of the 15 Guardian datasets.
These results are displayed as a heatmap plot, as seen in Figure 1. It is evident that there is
a large difference between embedding models that are trained using different background
corpora, with the models having much lower correlation scores with respect to each other.
It is also worth noting that, when considering the same background corpora, the different
embedding algorithms exhibit relatively high correlation scores. This suggests that they</p>
      <p>Average Spearman Rank Correlation
g-ft-cbow</p>
      <p>g-ft-sg
g-w2v-cbow
g-w2v-sg
w-ft-cbow</p>
      <p>w-ft-sg
w-w2v-cbow
w-w2v-sg
1.0
may perform similarly when trained on the same data. When exploring this further, it
also appears that there is a high level of correlation between the variants of the different
embedding algorithms (i.e. CBOW v SG) when utilising the same reference corpora.
3.2</p>
    </sec>
    <sec id="sec-4">
      <title>Ground Truth Evaluation</title>
      <p>A common application of topic coherence is to select an appropriate number of topics k.
Therefore, we further explored the effect of embedding choice as follows. For each dataset
and embedding model, we sorted the coherence scores for different k values to identify the
top values of k. We then counted the number of times the “ground truth value” of k appears
within the top n recommendations, for n = 1 to n = 5, as seen in Table 3. For example, the
wikipedia-w2v-cbow embedding correctly identifies the ground truth number of topics when
n = 5 for 14 of the 15 datasets. Surprisingly, using the Wikipedia corpus, rather than the
domain-specific Guardian corpus produces better embeddings with respect to identifying the
“correct” number of topics. This may be due to a temporal effect where The Guardian news
articles span over a 15 year duration, while the Wikipedia dump reflects a relatively recent
collection of articles. It is also interesting to note that fastText performs considerably worse
than the word2vec model in these cases. Across all combinations it is also clear that the
CBOW variant performs better than SG, and is likely due to CBOW having to only predict
a single target word rather than the context words around it.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this work we have demonstrated that care should be taken when utilising word embeddings
in the process of measuring topic coherence. It is clear that the choice of embedding algorithm,
model variant, and background corpus has a large impact on the resulting coherence values,
which could potentially influence topic model parameter selection choices, and ultimately
affect the interpretations made from the topics identified on a given corpus.</p>
      <p>Table 3 Results of the number of times the ground truth value of k was identified in the top n
elements for each embedding combination.
5
6
7
8
9
10
11
12</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Mark</given-names>
            <surname>Belford</surname>
          </string-name>
          , Brian Mac Namee, and
          <string-name>
            <given-names>Derek</given-names>
            <surname>Greene</surname>
          </string-name>
          .
          <article-title>Stability of topic modeling via matrix factorization</article-title>
          .
          <source>Expert Systems with Applications</source>
          ,
          <volume>91</volume>
          :
          <fpage>159</fpage>
          -
          <lpage>169</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>David M Blei</surname>
            , Andrew Y Ng, and
            <given-names>Michael I</given-names>
          </string-name>
          <string-name>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>Journal of machine Learning research</source>
          ,
          <volume>3</volume>
          :
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Piotr</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          , Edouard Grave, Armand Joulin, and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <article-title>Enriching word vectors with subword information</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          ,
          <volume>5</volume>
          :
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>Reading tea leaves: How humans interpret topic models</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>288</fpage>
          -
          <lpage>296</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Daniel D</given-names>
            <surname>Lee</surname>
          </string-name>
          and
          <string-name>
            <given-names>H Sebastian</given-names>
            <surname>Seung</surname>
          </string-name>
          .
          <article-title>Learning the parts of objects by non-negative matrix factorization</article-title>
          .
          <source>Nature</source>
          ,
          <volume>401</volume>
          (
          <issue>6755</issue>
          ):
          <fpage>788</fpage>
          -
          <lpage>791</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>Efficient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>Optimizing semantic coherence in topic models</article-title>
          .
          <source>In Proceedings of the conference on empirical methods in natural language processing</source>
          , pages
          <fpage>262</fpage>
          -
          <lpage>272</lpage>
          . Association for Computational Linguistics,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Newman</surname>
          </string-name>
          , Jey Han Lau,
          <string-name>
            <given-names>Karl</given-names>
            <surname>Grieser</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Timothy</given-names>
            <surname>Baldwin</surname>
          </string-name>
          .
          <article-title>Automatic evaluation of topic coherence</article-title>
          .
          <source>In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics</source>
          , pages
          <fpage>100</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Association for Computational Linguistics</surname>
          </string-name>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Derek O'Callaghan</surname>
            , Derek Greene, Joe Carthy, and
            <given-names>Pádraig</given-names>
          </string-name>
          <string-name>
            <surname>Cunningham</surname>
          </string-name>
          .
          <article-title>An analysis of the coherence of descriptors in topic modeling</article-title>
          .
          <source>Expert Systems with Applications</source>
          ,
          <volume>42</volume>
          (
          <issue>13</issue>
          ):
          <fpage>5645</fpage>
          -
          <lpage>5657</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>M. Atif Qureshi</surname>
            and
            <given-names>Derek</given-names>
          </string-name>
          <string-name>
            <surname>Greene</surname>
          </string-name>
          . Eve:
          <article-title>Explainable vector based embedding technique using wikipedia</article-title>
          .
          <source>Journal of Intelligent Information Systems</source>
          ,
          <year>Jun 2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks</source>
          , pages
          <fpage>45</fpage>
          -
          <lpage>50</lpage>
          , Valletta, Malta, May
          <year>2010</year>
          . ELRA.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Michael</given-names>
            <surname>Röder</surname>
          </string-name>
          , Andreas Both, and
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Hinneburg</surname>
          </string-name>
          .
          <article-title>Exploring the space of topic coherence measures</article-title>
          .
          <source>In Proceedings of the eighth ACM international conference on Web search and data mining</source>
          , pages
          <fpage>399</fpage>
          -
          <lpage>408</lpage>
          . ACM,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>