<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Topic Models with Sparse and Group-Sparsity Inducing Priors</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>TU Dortmund University</institution>
          ,
          <addr-line>Otto Hahn Str. 12, 44227 Dortmund</addr-line>
        </aff>
      </contrib-group>
      <fpage>33</fpage>
      <lpage>40</lpage>
      <abstract>
        <p>The quality of topic models highly depends on quality of used documents. Insufficient information may result in topics that are difficult to interpret or evaluate. Including external data to can help to increase the quality of topic models. We propose sparsity and grouped sparsity inducing priors on the meta parameters of word topic probabilities in fully Bayesian Latent Dirichlet Allocation (LDA). This enables controlled integration of information about words.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Topic models have been used for text analysis in the last decade very successfully. Topic
models assign a number of latent topics to documents and words from a given corpus.
These topics can be interpreted as different meanings of words or semantic clusters of
the documents in a text corpus. In text analysis the topics can be used in many ways.</p>
      <p>The estimation of the topics highly depends on the amount of text data used.
Considering the case when we have only very limited amounts of texts to estimate a topic
model, the quality of the found topics can be quite poor. In such situation external
information about the words can be quite beneficial. For instance prior word probabilities
can help sampling word topic distributions from a Dirichlet distribution by adding prior
weights on more likely words. In this sense, we try to align the topics with an
external probability model like a language model p(w) over some of the words. Structural
external information like similarities of words can provide further help to align the
topics. Hence, prior weights of whole groups of similar words can be used to estimate the
topics.</p>
      <p>
        To measure the quality of the found topics, intrinsic measures like the perplexity
have been used in the past. Recently, coherence measures have been introduced as an
evaluation measure for topics that agree well with human judgements, see [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. These
coherence measures use external information to evaluate how much related the most
likeliest words in the topics are. To extract coherent topics by a topic model we must
assume to have enough coherent documents. This is not always the case. In Word Sense
Induction for instance, there may be rare words that appear only in a few documents. In
such a case these documents might not be enough to generate coherent topics. Further,
very sparse documents as in collections of Blog posts or Tweets might also lack enough
information to extract coherent topics.
      </p>
      <p>
        To increase the coherence, we propose to integrate external information like word
probabilities or word similarities from external data sources. To control the influence
from the external information we weight these information additionally. We integrate
external word probability information by appropriate prior distributions. We add a
sparsity prior and a group sparsity prior on the log-likelihood of the topic model, see [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
The sparsity inducing priors can now actively control the amount and the weight of the
external information to be integrated in the estimation of a topic model. From the group
sparsity we expect more coherence since whole groups of words are considered. These
groups are expected to be more coherent since they are similar based on some external
information.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        There are many previous approaches integrating external information into the
generation of a topic model. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] use a regression model on the hyperparameters of the Dirichlet
prior for LDA. They use Dirichlet multinomial regression to make the prior probability
of the document topic distribution dependent on document features. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] integrate word
features into LDA by adding a Logistic prior on the parameter of the Dirichlet prior
of the word topic distribution. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] integrate correlation information about words into a
topic model. They propose regularized topic models that have structural priors instead
of Dirichlet priors. These structural priors contain word co-occurrence statistics for
instance. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] propose a Po´ lya Urn Model to integrate co-occurrence statistics into a topic
model. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] use First Order Logic incorporated into LDA to leverage domain knowledge.
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] incorporate information about words that should or should not be together in a topic
from topic model. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] integrate lexical semantic relations like synonyms or antonyms
derived from external dictionaries into a topic model.
      </p>
      <p>
        In the last years many approaches have been proposed to evaluate topic models.
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] propose to estimate the probability of some held-out documents of the collection
used for topic modelling. The authors propose several sampling techniques to efficiently
approximate this probability. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] propose to evaluate topic models based on external
information. They use pointwise mutual information (PMI) based on co-occurrence
statistics from external text sources to evaluate topics. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] evaluate topic models by
coherence of the topics. They authors showed that the coherence measure agrees with
human evaluations of the topics. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] evaluate topics based on distributional semantics.
They find semantic spaces such that words that are semantically related based on
statistics on Wikipedia are close in these spaces. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] developed a framework for measuring
coherence in topics. The authors performed large empirical experiments on standard
data sets and possible coherence measures to evaluate the framework.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Topic Models with Prior Information of Words</title>
      <p>
        We integrate external information into LDA via priors on the word-topic distributions.
Similar to the approach by [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], we define an asymmetric Dirichlet prior with
metaparameter on the word topic distribution ✓ . specifies the prior believe on the
distribution of the words before we have seen any data. We make dependent on the word
distribution from the external information p(w). We set w,t = exp( w,t) · p(w) for
a weight parameter w,t for the individual influence of the prior information in each
topic. If w,t is zero, the prior believe of the probability of w is directly used. If w,t is
less than zero, the prior believe is weighted down. If w,t is greater than zero, the prior
believe is weighted up.
      </p>
      <p>The optimal parameters must be found by optimizing the likelihood of the topic
model. We perform alternating optimization of the parameters with quasi Newton
methods and Gibbs sampling of topics to find the optimal topic model.</p>
      <p>For the optimization of the parameters we minimize the part of the negative log
likelihood from standard LDA that depends on :</p>
      <p>X</p>
      <p>X
t w:nw,t&gt;0</p>
      <p>L =</p>
      <p>X log ( ˜t + nk)
t</p>
      <p>log ( ˜t)+
log ( w,t)
log ( w,t + nw,t)
with ˜t = Pw</p>
      <p>w,t.
3.1</p>
      <sec id="sec-3-1">
        <title>Sparsity Priors for LDA</title>
        <p>We propose to use a sparsity inducting priors on the parameter w,t weights to influence
the prior information about word w for topic t. We expect that some parts of the prior
information play a bigger role than other parts in the estimated topic model. To find out
which parts are important we impose sparsity to identify them.</p>
        <p>We add a Laplace prior on the parameters to gain sparsity. This means, we aim
at reducing the amount of adaptation of the external information. This has three
advantages. First, we can easily read from the parameters which parts of the prior
information influences the topics. Second, we get a simpler model that adapts the external prior
information only for some words. Third, we gain control on the amount of external
information to be integrated into the topic model.</p>
        <p>The difference to standard LDA is that we have now an asymmetric prior that is
derived from the external information (the word probabilities) and the weight of this
information has a Laplace prior. Adding the Laplace prior of the parameters of the DMR
and optimizing for the negative log-likelihood is the same as putting a sparsity inducing
penalty on them. Now, the negative log likelihood is simply extended by k tk1:
L1 = L +</p>
        <p>
          Hence, the Laplace prior is integrated into the optimization via a sparse lasso penalty
k k1. We solve the optimization problem via Orthantwise Quasi Newton Optimization
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
The previous idea of limiting the adaptation of the external prior information for some
words does not consider that the information about similar words should also be treated
similar. For instance, in case the prior information about the word “book” is not adapted,
we should also not adapt the information about “author” or “books”.
        </p>
        <p>We propose to add a group lasso penalty to the negative log likelihood to gain group
sparsity:</p>
        <p>L2 = L +
1
k k1 +</p>
        <p>X
g
1
k gk2
for the group lasso penalty Pg 1 · k gk2 for the groups g and the variance .
Conceptionally, this is the same as having a prior on the parameters that induces
group sparsity.</p>
        <p>
          Similar to above we solve the group lasso via Blockwise Coordinate Descent with
Proximal Operators for the group penalty, see [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] for more details.
3.3
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Finding Groups</title>
        <p>
          To find the groups for the grouped sparsity priors on the weight parameters we use
external information about similarities of words. From such similarities we can easily
generate clusters that are used as groups. We divide the weight parameter = ( 1, · · · , G)
with G partial weights g = ( w1,g, · · · wk,g). The partial weights build a group g if
the words w1, · · · wk build a cluster based on the similarities from the external
information. The similarities we use are based on WordNet (see [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]). We generate a so called
affinity matrix M such that (M )ij = exp( (1 sim(wi, wj )) for sim the similarity
derived from WordNet. Next, we perform a spectral clustering [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] to find the groups.
Spectral clusterings performs a simple k-means clusterings on the words projected onto
low-dimensional space spanned by the eigenvectors of the affinity matrix.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>
        In this section, we investigate the topics extracted by our proposed methods (SparsePrior)
for LDA with sparsity prior, (GroupPrior) for LDA with group sparsity prior) and
compare them with two standard state-of-the-art implementations of topic models that
integrate external information about words: (RegLDA) by [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and (WordFeatures) by
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Additionally, we also compare to the standard LDA with Gibbs sampling
without external information. For each method, we use T = 20 topics, 1000 iterations and
set ↵ = 50/T , = 0.1 (for standard LDA and topic models with structural prior),
1 = 0.1, 1 = 0.1.
4.1
      </p>
      <sec id="sec-4-1">
        <title>Data sets</title>
        <p>We use two standard text data sets used in previous approaches of topic modelling. First,
we use the 20 Newsgroups1 data set. The data set contains about 20.000 text documents
from 20 different newsgroups. Overall we have 1000 documents per newsgroup. We
additional remove stop words and prune very infrequent and very frequent words.
Second, we use the Senseval-32 data set of English lexical samples. The data set contains
1 http://qwone.com/ jason/20Newsgroups/
2 http://www.senseval.org/senseval3</p>
        <p>Data
Method</p>
        <p>20 newsgroups
NPMI UCI UMASS
nLL</p>
        <p>Wikipedia
NPMI UCI UMASS
nLL
texts from Penn Treebank II Wall Street Journal article. The sizes of the data sets range
from 20 to 200 documents per word. Further, we use the wikipedia talk pages to apply
the method to a more recent data source of internet based communication. As example,
we extract 10.000 postings of discussions on wikipedia from 2002 to 2014 that contain
the term ”cloud”.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Coherence Results</title>
        <p>
          In the first experiments, we compare to the state-of-the-art LDA implementations with
external information about words and standard LDA in terms of quality. We want to
show that our model produces more coherent topics. To evaluate the coherence of
the found topics, we use Pointwise Mutual Information (UCI), normalized Pointwise
Mutual Information (NPMI) and arithmetic mean of conditional probability (UMass),
see [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Further, for the two larger data sets 20 news groups and the postings from
wikipedia we also estimate the negative log-likelihood (nLL) on a held out data set.
Finally, on the SensEval data set, we also estimate the Mutual Information (MI) of the
found topics to the true sense.
        </p>
        <p>The results on the 20 newsgroups data set on the left in Table 1 show that our
proposed group sparsity prior results in topic with better coherence measures than the
state-of-the art methods and the standard LDA. From the state-of-the-art competitors
only WordFeatures performs comparably good. In terms of loglikelihood,
WordFeatures performs best. For the wikipedia talk pages we get similar results as shown on the
middle in Table 1.</p>
        <p>Finally, we compare the different topic model methods on collection of very small
data sets. In Table 2 shows the resulting coherence values on the SensEval data set. LDA
with our proposed grouped sparsity prior performs better on all data samples compared
to the competitor.</p>
        <p>We are especially interested in how the different methods perform on very small
data sets. To investigate this, we evaluate the NPMI for the different methods on
different sample sizes and different document lengths of the samples. For the 20 news
groups date, we sample 100, 100, 5000 and 10000 documents to extract topics. From
the wikipedia talk pages we extract postings of different context sizes from 100 to 1000
characters. In Figure 1, we see that our propose sparsity and group sparsity priors
results for small samples and small context sizes in the highest NMPI. In these situations
our proposed methods of using the group sparsity pays of the most.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper we propose to integrate external information about words into topic
models to increase topic coherence. We use different priors on the metaparameters for LDA.
To control the amount of the integration of the external information we weight them
individually. Adding sparsity inducing priors on these weights enables active control
on the how much we adapt the external information. By this we trade off topic
coherences and likelihood of the topics. Our proposed group sparsity prior further enables
integration of external similarity information about words. Now, we can influence the
external information of whole groups of words that are similar. The results on large data
collections showed the benefit of our proposed method in terms of topic coherence.
Finally, we showed that on very small data sets, the group sparsity inducing prior results
in better performance.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Nikolaos</given-names>
            <surname>Aletras</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Stevenson</surname>
          </string-name>
          .
          <article-title>Evaluating topic coherence using distributional semantics</article-title>
          .
          <source>In Proceedings of the 10th International Conference on Computational Semantics (IWCS</source>
          <year>2013</year>
          )
          <article-title>- Long Papers</article-title>
          , pages
          <fpage>13</fpage>
          -
          <lpage>22</lpage>
          , Potsdam, Germany,
          <year>March 2013</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Galen</given-names>
            <surname>Andrew</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jianfeng</given-names>
            <surname>Gao</surname>
          </string-name>
          .
          <article-title>Scalable training of l1-regularized log-linear models</article-title>
          .
          <source>In Proceedings of the 24th International Conference on Machine Learning, ICML '07</source>
          , pages
          <fpage>33</fpage>
          -
          <lpage>40</lpage>
          , New York, NY, USA,
          <year>2007</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>David</given-names>
            <surname>Andrzejewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xiaojin</given-names>
            <surname>Zhu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Craven</surname>
          </string-name>
          .
          <article-title>Incorporating domain knowledge into topic modeling via dirichlet forest priors</article-title>
          .
          <source>In Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09</source>
          , pages
          <fpage>25</fpage>
          -
          <lpage>32</lpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>David</given-names>
            <surname>Andrzejewski</surname>
          </string-name>
          , Xiaojin Zhu,
          <string-name>
            <given-names>Mark</given-names>
            <surname>Craven</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Recht</surname>
          </string-name>
          .
          <article-title>A framework for incorporating general domain knowledge into latent dirichlet allocation using first-order logic</article-title>
          .
          <source>In IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence</source>
          , Barcelona, Catalonia, Spain,
          <source>July 16-22</source>
          ,
          <year>2011</year>
          , pages
          <fpage>1171</fpage>
          -
          <lpage>1177</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Francis</given-names>
            <surname>Bach</surname>
          </string-name>
          , Rodolphe Jenatton, and
          <string-name>
            <given-names>Julien</given-names>
            <surname>Mairal</surname>
          </string-name>
          .
          <article-title>Optimization with Sparsity-Inducing Penalties (Foundations and Trends(R) in Machine Learning)</article-title>
          . Now Publishers Inc., Hanover, MA, USA,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Zhiyuan</given-names>
            <surname>Chen</surname>
          </string-name>
          , Arjun Mukherjee, Bing Liu, Meichun Hsu, Malu Castellanos, and
          <string-name>
            <given-names>Riddhiman</given-names>
            <surname>Ghosh</surname>
          </string-name>
          .
          <article-title>Discovering coherent topics using general knowledge</article-title>
          .
          <source>In Proceedings of the 22Nd ACM International Conference on Information &amp; Knowledge Management, CIKM</source>
          , pages
          <fpage>209</fpage>
          -
          <lpage>218</lpage>
          , New York, NY, USA,
          <year>2013</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>David</given-names>
            <surname>Mimno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Hanna M.</given-names>
            <surname>Wallach</surname>
          </string-name>
          , Edmund Talley, Miriam Leenders, and
          <string-name>
            <surname>Andrew McCallum</surname>
          </string-name>
          .
          <article-title>Optimizing semantic coherence in topic models</article-title>
          .
          <source>In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '11</source>
          , pages
          <fpage>262</fpage>
          -
          <lpage>272</lpage>
          , Stroudsburg, PA, USA,
          <year>2011</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>David</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Mimno</surname>
            and
            <given-names>Andrew</given-names>
          </string-name>
          <string-name>
            <surname>McCallum</surname>
          </string-name>
          .
          <article-title>Topic models conditioned on arbitrary features with dirichlet-multinomial regression</article-title>
          .
          <source>CoRR, abs/1206.3278</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>David</given-names>
            <surname>Newman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Edwin V.</given-names>
            <surname>Bonilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Wray L.</given-names>
            <surname>Buntine</surname>
          </string-name>
          .
          <article-title>Improving topic coherence with regularized topic models</article-title>
          . In John Shawe-Taylor, Richard S. Zemel,
          <string-name>
            <surname>Peter L. Bartlett</surname>
          </string-name>
          ,
          <string-name>
            <surname>Fernando</surname>
            <given-names>C. N.</given-names>
          </string-name>
          <string-name>
            <surname>Pereira</surname>
          </string-name>
          , and Kilian Q. Weinberger, editors,
          <source>NIPS</source>
          , pages
          <fpage>496</fpage>
          -
          <lpage>504</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. David Newman,
          <string-name>
            <given-names>Sarvnaz</given-names>
            <surname>Karimi</surname>
          </string-name>
          , and Lawrence Cavedon.
          <article-title>External evaluation of topic models</article-title>
          .
          <source>In Australasian Document Computing Symposium</source>
          , pages
          <fpage>11</fpage>
          -
          <lpage>18</lpage>
          , Sydney,
          <year>December 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. David Newman, Jey Han Lau,
          <string-name>
            <given-names>Karl</given-names>
            <surname>Grieser</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Timothy</given-names>
            <surname>Baldwin</surname>
          </string-name>
          .
          <article-title>Automatic evaluation of topic coherence</article-title>
          .
          <source>In Human Language Technologies</source>
          :
          <article-title>The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics</article-title>
          ,
          <source>HLT '10</source>
          , pages
          <fpage>100</fpage>
          -
          <lpage>108</lpage>
          , Stroudsburg, PA, USA,
          <year>2010</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Andrew</surname>
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>Michael I.</given-names>
          </string-name>
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>and Yair</given-names>
          </string-name>
          <string-name>
            <surname>Weiss</surname>
          </string-name>
          .
          <article-title>On spectral clustering: Analysis and an algorithm</article-title>
          .
          <source>In ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS</source>
          , pages
          <fpage>849</fpage>
          -
          <lpage>856</lpage>
          . MIT Press,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Ted</surname>
            <given-names>Pedersen</given-names>
          </string-name>
          , Siddharth Patwardhan, and
          <string-name>
            <given-names>Jason</given-names>
            <surname>Michelizzi</surname>
          </string-name>
          . Wordnet:
          <article-title>:similarity: Measuring the relatedness of concepts</article-title>
          .
          <source>In Demonstration Papers at HLT-NAACL</source>
          <year>2004</year>
          , HLT-NAACLDemonstrations '
          <volume>04</volume>
          , pages
          <fpage>38</fpage>
          -
          <lpage>41</lpage>
          , Stroudsburg, PA, USA,
          <year>2004</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>James</surname>
            <given-names>Petterson</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Alexander J.</given-names>
            <surname>Smola</surname>
          </string-name>
          , Tibrio S. Caetano, Wray L.
          <string-name>
            <surname>Buntine</surname>
          </string-name>
          , and
          <string-name>
            <surname>Shravan</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Narayanamurthy</surname>
          </string-name>
          .
          <article-title>Word features for latent dirichlet allocation</article-title>
          . In John D. Lafferty,
          <string-name>
            <surname>Christopher</surname>
            <given-names>K. I. Williams</given-names>
          </string-name>
          , John Shawe-Taylor, Richard S. Zemel, and Aron Culotta, editors,
          <source>NIPS</source>
          , pages
          <fpage>1921</fpage>
          -
          <lpage>1929</lpage>
          . Curran Associates, Inc.,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. Michael Ro¨der, Andreas Both, and
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Hinneburg</surname>
          </string-name>
          .
          <article-title>Exploring the space of topic coherence measures</article-title>
          .
          <source>In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM '15</source>
          , pages
          <fpage>399</fpage>
          -
          <lpage>408</lpage>
          , New York, NY, USA,
          <year>2015</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Noah</surname>
            <given-names>Simon</given-names>
          </string-name>
          , Jerome Friedman, Trevor Hastie, and
          <string-name>
            <given-names>Robert</given-names>
            <surname>Tibshirani</surname>
          </string-name>
          .
          <article-title>A sparse-group lasso</article-title>
          .
          <source>Journal of Computational and Graphical Statistics</source>
          ,
          <volume>22</volume>
          (
          <issue>2</issue>
          ):
          <fpage>231</fpage>
          -
          <lpage>245</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Hanna M. Wallach</surname>
            , Iain Murray, Ruslan Salakhutdinov, and
            <given-names>David</given-names>
          </string-name>
          <string-name>
            <surname>Mimno</surname>
          </string-name>
          .
          <article-title>Evaluation methods for topic models</article-title>
          .
          <source>In Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09</source>
          , pages
          <fpage>1105</fpage>
          -
          <lpage>1112</lpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>