Topic Modelling Games

                                      Rocco Tripodi
                                   Sapienza NLP Group
                 Department of Computer Science, Sapienza University of Rome
                               tripodi@di.uniroma1.it


                     Abstract                         uments (Blei, 2012; Griffiths and Steyvers, 2004).
    English. This paper presents a new topic          It can be used in different tasks of text classifica-
    modelling framework inspired by game              tion, document retrieval, and sentiment analysis,
    theoretic principles. It is formulated as         providing together vector representations of words
    a normal form game in which words are             and documents. State-of-the-art systems are based
    represented as players and topics as strate-      on probabilistic (Blei et al., 2003; Mcauliffe and
    gies that the players select. The strate-         Blei, 2008; Chong et al., 2009) and neural net-
    gies of each player are modelled with a           works models (Bengio et al., 2003; Hinton and
    probability distribution guided by a util-        Salakhutdinov, 2009; Larochelle and Lauly, 2012;
    ity function that the players try to max-         Cao et al., 2015). A different perspective based on
    imize. This function induces players to           game theory is proposed in this article.
    select strategies similar to those selected          The use of game-theoretic principles in machine
    by similar players and to choice strate-          learning (Goodfellow et al., 2014), pattern recog-
    gies not shared with those selected by dis-       nition (Pavan and Pelillo, 2007) and natural lan-
    similar players. The proposed framework           guage processing (Tripodi et al., 2016; Tripodi and
    is compared with state-of-the-art models          Navigli, 2019) problems is developing a promis-
    demonstrating good performances on stan-          ing field of research with the development of orig-
    dard benchmarks.                                  inal models. The main difference between compu-
                                                      tational models based on optimization techniques
    Italiano. Questo articolo presenta un ap-
                                                      and game-theoretic models is that the former tries
    proccio di modellazione dei topic ispirato
                                                      to maximize (minimize) a function (that in many
    alla teoria dei giochi. La modellazione dei
                                                      cases is non-convex) and the latter tries to find
    topic è vista come un gioco in forma nor-
                                                      the equilibrium state of a dynamical system. The
    male in cui le parole rappresentano i gio-
                                                      equilibrium concept is useful because it represents
    catori e i topic le strategie che i giocatori
                                                      a state in which all the constraints of a given sys-
    possono scegliere. Ogni giocatore sceglie
                                                      tem are satisfied and no object of the system has
    le strategie da impiegare tramite una dis-
                                                      an incentive to deviate from it, because a differ-
    tribuzione di probabilità che viene influen-
                                                      ent configuration will immediately lead to a worse
    zata da una funzione di utilità che i gio-
                                                      situation in terms of payoff and fitness, at object
    catori cercano di massimizzare. Questa
                                                      and system level. Furthermore, it is guaranteed
    funzione incentiva i giocatori a scegliere
                                                      that the system converges to a mixed strategy Nash
    strategie simili a quelle impiegate da gio-
                                                      equilibrium (Nash, 1951). So far, game-theoretic
    catori simili e disincentiva la scelta di
                                                      models have been used in classification and clus-
    strategie condivise con giocatori dissim-
                                                      tering tasks (Pavan and Pelillo, 2007; Tripodi and
    ili. Il confronto con modelli allo stato
                                                      Pelillo, 2017). In this work, it is proposed a game-
    dell’arte dismostra buone prestazioni su
                                                      theoretic model for inferring a low dimensional
    diversi dataset di valutazione.
                                                      representation of words that can capture their la-
                                                      tent semantic representation.
1   Introduction
                                                         In this work, topic modeling is interpreted as a
Topic modeling is a technique that discovers the      symmetric non-cooperative game (Weibull, 1997)
underlying topics contained in a collection of doc-   in which, the words are the players and the topics
are the strategies that the players can select. Two      and Lauly, 2012) have been used to model docu-
players are matched to play the games together ac-       ments with layer-wise neural network tools. Neu-
cording to the co-occurrence patterns found in the       ral Topic Model (NTM; (Cao et al., 2015)) tries to
corpus under study. The players use a probability        overcome some limitations of classical topic mod-
distribution over their strategies to play the games     els, such as the initialization problem and the gen-
and obtain a payoff for each strategy. This reward       eralization to n-grams. It exploits word embed-
helps them to adjust their strategy selection in fu-     ding to represent n-grams and uses backpropaga-
ture games, considering what strategy has been ef-       tion to adjust the weights of the network between
fective in previous games. It allows concentrating       the embedding and the word-topic and document-
more mass on the strategies that get high reward.        topic layers. A general framework for topic mod-
The underlying idea to model the payoff function         eling based also on neural networks is Sparse Con-
is to create two influence dynamics, the first one       textual Hidden and Observed Language Autoen-
forces similar players (words that appear in sim-        codeR (SCHOLAR; (Card et al., 2018)). It allows
ilar contexts) to select similar strategies; the sec-    using covariates to influence the topic distributions
ond one forces dissimilar players (words that do         and labels to include supervision. As Sparse Addi-
not share any context) to select different strategies.   tive GEnerative models (SAGE; (Eisenstein et al.,
The games are played repeatedly until the system         2011))it can produce sparse topic representations
converges, that is, the difference among the strat-      but differently from it and Structural Topic Model
egy distributions of the players at time t and at        (STM; (Roberts et al., 2014)) it can easily consider
time t − 1 is under a small threshold. The conver-       a larger set of metadata. A graphical topic model
gence of the system corresponds to an equilibrium,       was proposed by Gerlach et al. (2018). In this
a situation in which there is an optimal association     framework, the task of finding topical structures
of words and topics.                                     is interpreted as the task of finding communities
                                                         in complex networks. It is particularly interesting
2   Related Work                                         because it shows analogies with traditional topic
                                                         models and overcomes some of their limitations
Hofmann (1999) proposed one of the earliest topic        such as the bound with a Bayesian prior and the
models, probabilistic Latent Semantic Indexing           need to specify the number of topics in advance.
(pLSI). It represents each word in a document
as a sample from a mixture model, where top-             3   Topic Modelling Games
ics are represented as multinomial random vari-
ables and documents as a mixture of topics. La-          Normal-form games consist of a finite set of play-
tent Dirichlet Allocation (LDA) (Blei et al., 2003),     ers N = (1, .., n), a finite set of pure strategies,
the most widely used topic model, is a general-          Si = {1, ..., mi } for each player i ∈ N and a
ization of pLSI that introduces Dirichlet priors for     payoff (utility) function ui : S → R, that asso-
both the word multinomial distributions over top-        ciates a payoff to each combination of strategies
ics and topic multinomial distributions over docu-       S = S1 × S2 × ... × Sn . The payoff function does
ments. This line of research has been developed          not depend only on the strategy chosen by a single
building on top of LDA different features to in-         player but by the combination of strategies played
fer correlations among topics (Lafferty and Blei,        at the same time by the players. Each player tries
2006) or to model jointly words and labels in a su-      to maximize the value of ui . Furthermore, in non-
pervised way (Mcauliffe and Blei, 2008).                 cooperative games the players choose their strate-
   Topic models based on neural network princi-          gies independently, considering what other play-
ples have been introduced with the neural net-           ers can play and trying to find the best response
work language model proposed in (Bengio et al.,          to the strategy of the co-players. Nash equilibria
2003). This paradigm is very popular in NLP and          (Nash, 1951) represent the key concept of game
many topic models are based on it because with           theory and can be defined as those strategy com-
these techniques it is possible to obtain a low-         binations in which each strategy is a best response
dimensional representation of the data. In particu-      to the strategy of the co-player and no player has
lar, auto-encoders (Ranzato and Szummer, 2008),          the incentive to unilaterally deviate from them be-
Boltzmann machines (Hinton and Salakhutdinov,            cause there is no way to do better. In addition
2009) and autoregressive distributions (Larochelle       to play pure strategies, that correspond to select-
ing just one strategy from those available in Si ,        the n × n adjacency matrix (W ) of an undirected
a player i can also use mixed strategies, which           weighted graph. Each entry wij encodes the sim-
are probability distributions over pure strategies.       ilarity between two words. The strategy space of
A mixed strategy over Si is defined as a vec-             the games can be represented as a n × m matrix
P xi = (x1 , . . . , xmi ), such that xj ≥ 0 and
tor                                                       X, where each row represents the probability dis-
    xj = 1. In a two-player game, a strategy pro-         tribution of a player over its m strategies (topics
file can be defined as a pair (xi , xj ). The expected    that have to be extracted from the corpus).
payoff for this strategy profile is computed as:
                                                          Payoff Function and System Dynamics The
              u(xi , xj ) = xTi · Aij xj                  payoff function of the game is constructed ex-
                                                          ploiting the information stored in W . This ma-
where Aij is the mi × mj payoff matrix between            trix gives us the structural information of the cor-
player i and j.                                           pus. It allows us to select the players with whom
   Evolutionary game theory (Weibull, 1997) has           each player is playing the games, indicated with
introduced two important modifications: 1. the            the presence of an edge between two nodes (play-
games are played repeatedly, and 2. the players           ers), and to quantify the level of influence that each
update their mixed strategy over time until it is not     player has on the other, indicated with the weight
possible to improve the payoff. The players, with         on each edge. The absence of an edge in this graph
these two modifications, can develop an inductive         indicates that two words are distributional dissim-
learning process, that allows them to learn their         ilar. Using these three sources of information we
strategy distribution according to what other play-       model a payoff function that forces similar players
ers are selecting. The payoff corresponding to the        to choose similar strategies (topics) and dissimilar
h-th pure strategy is computed as:                        players to choose different ones. The payoff of a
                              ni
                                                          player is calculated as,
                              X
            u(xhi ) = xhi ·         (Aij xj )h     (1)                         ni
                                                                               X               neg
                                                                                               Xi
                              j=1                            u(xhi ) = xhi (      (Aij xj )h −     (xg )h )   (3)
                                                                               j=1             g=1
The average payoff of player i is calculated as:
                                                          where the first summation is over all the ni di-
                            mi
                            X                             rect neighbors of player i that are the players with
                 u(xi ) =         u(xhi )          (2)
                                                          whom i share some similarity and the second sum-
                            h=1
                                                          mation is over the negi negative players of player
To find the Nash equilibrium of the game, it is           i, that are players with whom player i does not
common to use the replicator dynamics equation            share any similarity. With the first summation
(Weibull, 1997). It allows better than average            player i will negotiate with its neighbors a corre-
strategies to grow at each iteration. It can be con-      lated strategy (topic), with the second he will devi-
sidered as an inductive learning process, in which        ate from the strategies chosen by negative players,
the players learn from past experiences how to            this is done by subtracting the payoff that i would
play their best strategy. It is important to notice       have gained if these negative players would have
that each player optimizes its individual strategy        been his neighbors. The negative players are sam-
space, but this operation is done according to what       pled from V according to frequency, in the same
other players simultaneously are doing so the local       way, negative samples are selected in word embed-
optimization is the result of a global process.           dings models (Mikolov et al., 2013; Tripodi and
                                                          Pira, 2017). The equation that gives us the proba-
Data Preparation The players of the topic mod-
                                                          bility of selecting a word as negative is:
elling games are the words v = (1, . . . , n) in the
vocabulary V of the corpus under analysis and the                                 f (wi )3/4
strategies S = (1, . . . , m) are the topics to extract               P (wi ) = Pn           3/4
                                                                                                 ,             (4)
                                                                                 j=0 f (wj )
from the same corpus. The strategy space xi of
each player i is represented as a probability dis-        where f (wi ) is the frequency of word wi . Since
tribution that can be interpreted as the mixture of       the similarity with negative players is 0 we intro-
topics typically used in topic modeling. The in-          duced the parameter  to weight their influence and
teractions among the players are modeled using            set it to (A > 0). The number of negative players,
negi , is set to ni (number of neighbours of player               Dataset   TMG    SCHOLAR    NVDM     LDA

i ).                                                              20NG       824      819        927    791
                                                                  NIPS      1311     1370       1564   1017
   Once the players have played all the games with
their neighbors and negative players, the average            Table 1: Comparison of the models as perplexity.
payoff of each player can be calculated with Equa-
tion (2). The payoff is higher when two words are
highly correlated and have a similar mixed strat-         The strategy space of the players was initialized
egy. For this reason the replicator dynamics equa-     using a normal distribution to reduce the parame-
tion (Weibull, 1997) is used to compute the dy-        ters of the framework3 . The last two parameters
namics of the system. It pushes the players to be      of the systems concern the stopping criteria of the
influenced by the mixed strategy of the co-players.    dynamics and are: 1. the maximum number of it-
This influence is proportional to the similarity be-   erations (105 ); and 2. the minimum difference be-
                                                       tween two                          −3
tween two players (Aij ). Once the influence dy-                Pndifferent iterations (10 ) that is calcu-
namics do not affect the players the Nash equilib-     lated as i=1 xi (t − 1) − xi (t).
rium of the system is reached. The stopping cri-          TMG has been compared with SCHOLAR4 ,
teria of the dynamics and are: 1. the maximum          LDA5 and NVDM6 . We configured the
number of iterations (105 ); and 2. the minimum        NVDM network with two encoder layers
difference between two                            −3   (500-dimensional) and ReLu non-linearities.
                       Pn different iterations (10 )   SCHOLAR has been configured using a more
that is calculated as i=1 xi (t − 1) − xi (t).
                                                       complex setting that consists in a single layer
4       Experimental Results                           encoder and a 4-layer generator. LDA has been
                                                       run with the following parameters: α = 50,
In this section, we evaluate TMG and compare it        iterations = 1000 and topicthreshold = 0.
with state-of-the-art systems.
                                                       4.2    Evaluation
4.1       Data and Setting
                                                       In this section, we compared the generalization
The datasets used to evaluate TMG are 20 News-         performances of TMG and compared them with
groups1 (20NG) and NIPS2 . 20NG is a collection        the models presented in the previous section. For
of about 20, 000 documents organized into 20 dif-      the evaluation we used perplexity (PPL), even if
ferent classes. NIPS is composed of about 1, 700       it is has been shown to not correlate with human
NIPS conference papers published between 1987          interpretation of topics (Chang et al., 2009). We
and 1999 with no class information. Each text was      computed perplexity on unobserved documents
tokenized and lowercased. The stop-words were          (C), as.
removed and the vocabulary was constructed con-
                                                                               1 N
                                                                                 P
sidering the 1000 and 2000 most frequent words                                     n=1 logP (Cn )
                                                           P P L(C) = exp(−         PN             ) (5)
in 20NG and NIPS, respectively. This choice is in                              N       n=1 Dn
line with previous work (Card et al., 2018). To        where N is the number of documents in the collec-
keep the model as simple as possible, the tf-idf       tion C. Low perplexity suggests less uncertainties
weighting was used to construct the feature vec-       about the documents. Held out documents repre-
tors of the words and the cosine similarity was        sent the 15% of each dataset. Perplexity is com-
employed to create the adjacency matrix A. It is       puted for 10 topics for the NIPS dataset and 20
important to notice here that other sources of in-     topics for the 20 Newsgroups dataset. These num-
formation can be easily included at this stage, de-    bers correspond to the real number of classes of
rived from pre-trained word embeddings, syntactic      each dataset.
structures or document metadata. Then A is spar-          Table 1 shows the comparison of perplexity. As
sified taking only the r nearest neighbours of each    reported in previous work (Card et al., 2018), it is
node. r is calculated as r = log(n) this operation
                                                           3
reduces the computational cost of the algorithm              Experimentally it was also observed that using a Dirich-
                                                       let distribution to initialize the strategy space with different
and guarantees that the graph remains connected        α parameters did not affect much the performances of the
(Von Luxburg, 2007).                                   model.
                                                           4
                                                             https://github.com/dallascard/scholar
    1                                                      5
        http://qwone.com/ jason/20Newsgroups/                http://mallet.cs.umass.edu
    2                                                      6
        http://www.cs.nyu.edu/ roweis/data.html              https://github.com/ysmiao/nvdm
difficult to achieve a lower perplexity than LDA.      each target word. For each topic, we selected the
The results in these experiments follow the same       10 words with the highest mass. Then we calcu-
pattern, with LDA that has the lowest perplexity,      lated the PMI among all the words pair and finally
TMG, and SCHOLAR that have similar results,            compute the coherence as the arithmetic mean of
and NVDM that performs slightly worse on both          all these values. This metric has been shown to
datasets.                                              correlate well with human judgments (Lau et al.,
                                                       2017). We used two different sources of informa-
                                                       tion for the computation of the PMI: one is inter-
                                                       nal and corresponds to the dataset under analysis;
                                                       the other one is external and is represented by the
                                                       English Wikipedia corpus.


         (a) 20NG                      (b) NIPS        Internal PMI Figure 1 presents the PMI val-
                                                       ues of the different models computed on the two
       Figure 1: Internal PMI mean and std values.
                                                       corpora. As it is possible to see from figure 1a,
                                                       TMG has a low PMI compared to all other sys-
                                                       tems on the 20 Newsgroups dataset when there are
                                                       few topics to extract (i.e.: 2 and 5). The situation
                                                       changes drastically when the number of topics in-
                                                       creases. In fact, it has the highest performances on
                                                       this dataset when extracts 10, 20, 50, 100 topics.
                                                       The performances of NDVM and SCHOLAR are
                                                       similar and follow a decreasing pattern, with very
         (a) 20NG                      (b) NIPS        high values at the beginning. On the contrary, the
                                                       performances of LDA follow an opposite pattern
      Figure 2: External PMI mean and std values.
                                                       this model seems to work better when the num-
                                                       ber of topics to extract is high. On NIPS (Figure
                                                       1b) the performances of the systems are similar to
                                                       those on 20 Newsgroups. The only exception is
                                                       that TMG has always the highest PMI and seems
                                                       to behave better also when the number of topics to
                                                       extract is high. This probably because the number
                                                       of words in NIPS is higher and for this, it is reason-
                                                       able to have also a higher number of topics. This
         (a) 20NG                      (b) NIPS        is also confirmed from a qualitative analysis of the
                                                       topics in Section 4.4, where it is demonstrated that
         Figure 3: Sparsity mean and std values.
                                                       with low values of k it is possible to extract gen-
                                                       eral topics and increasing its value it is possible to
4.3   Topic Coherence and Interpretability             extract more specific ones.
It has been shown that perplexity does not neces-         In general, we can find three different patterns
sarily correlate well with topic coherence (Chang      in these experiments: 1. NDVM and SCHOLAR
et al., 2009; Srivastava and Sutton, 2017). For this   work well on extracting a low number of topics;
reason, we evaluated the performances of our sys-      2. LDA works well when it has to extract a large
tem also on coherence (Chang et al., 2009; Das et      number of topics; 3. TMG works well on extract-
al., 2015). The coherence is calculated by com-        ing a number of topics that is close to the real num-
puting the relatedness between topic words using       ber of classes in the datasets. Another aspect to
the pointwise mutual information (PMI). We used        take into account is the fact that even if TMG has
Wikipedia (2018.05.01 dump) as corpus to com-          the highest performances, its results have also a
pute co-occurrence statistics using a sliding win-     high standard deviation. This is due to the stochas-
dow of 5 words on the left and on the right of         tic nature of negative sampling.
    turks    schneider      drive          vms           god           intellect      bike      providing        fbi           gun      team      space        male         tim      amateur
    soviet     allan         ide       disclaimer       jesus            banks        ride      encryption   compound       firearms    game      orbit         gay       israel     georgia
   turkish    morality       scsi         vnews       christians        gordon       riding       clipper        batf         guns       play    shuttle        men      israeli   intelligence
 armenian      keith      controller        vax         christ        surrender        dod          key          fire      criminals   season    launch       sexual       arab          ai
  armenia     atheists      drives     necessarily   christianity         univ       bikes        escrow        waco          crime    hockey     earth     percentage     jews     programs
   passes      moral         mb        represents       bible        pittsburgh    motorcycle     crypto      children     weapons     league    mission       study      arabs      michael
    roads     political      disk         views       christian      significant      bmw          keys        koresh      criminal      nhl      flight        sex      policy        radio
 armenians   pasadena         isa       expressed       faith          hospital     honda          chip          gas         violent   players     nasa      apparent       war       adams
     argic   objective       bus           news        church            level        road        secure       branch       weapon       cup      moon       showing       land       ignore
 proceeded    animals      floppy         poster        belief           blood      advice       wiretap       started       armed     stanley    solar       women       north        occur
   29.71       15.27        12.7         11.72         10.79           10.18          8.94        8.93          8.55         7.52       7.45      7.14        6.92        6.21        6.13


Table 2: Best topics (each topic is represented on the columns) extracted from 20 Newsgroup using TMG (setting k = 20)
ordered using external PMI (bottom row).

    ocular       dendrites            oscillatory          crowdsourcing           kaiming            retina             auditory         graph            disturbances           lifted
      eye        dendritic           oscillations              crowds              shaoqing      photoreceptor             sound          edges                plant         propositional
     fovea          soma              oscillators             workers               xiangyu          retinal              sounds          graphs            controllers        predicate
 dominance        dendrite             oscillator               worker                 jian        vertebrate            cochlear      optimisation         controller         grounding
  saccades         axonal             oscillation              labelers            yangqing         schulten                 ear           edge            disturbance         predicates
  saccadic         axons           synchronization              crowd                 karen      photoreceptors          hearing         vertices              plants          domingos
   fixation         nmda               decoding                  turk                sergey         ganglion                ears         optimise             activate           clauses
    foveal       pyramidal              locking                wisdom                trevor         kohonen               acoust        optimising           activated        compilation
      eyes        somatic            synchronize              expertise              sergio          bipolar                tone        optimised           activating          formulas
   saccade          axon            synchronized                dawid               jitendra       visualizing           cochlea          vertex             activates           logical
   304.85           283.66              276.39                      230.5           218.51             196.86            176.75           146.3              146.25                145.84

             Table 3: Topics extracted from NIPS using TMG (setting k = 10) ordered using external PMI (bottom row).


Sparsity We compared the sparsity of the word-                                                        We can also easily identify from Table 3 highly
topics matrices, X, in Figure 3a and 3b, computed                                                  coherent topics, related to optic, signal analysis,
                 −3 |
as s = |X>10 |X|      . From both figures, we can see                                              optimization, crowdsourcing, audio, graph theory
that TMG can produce highly sparse representa-                                                     and logics. We noticed from these topics that they
tions especially when the number of topics to ex-                                                  are general and that it is possible to discover more
tract is low. This is a nice feature since it provides                                             specific topics increasing the number of topics to
more interpretable results. Only SCHOLAR pro-                                                      extract. For example, we discovered topics related
duces more sparse representations when the num-                                                    to topic modelling and generative adversarial net-
ber of topics to extract is high. Experimentally we                                                works.
also noticed that we can control the sparsity of X,
in TMG, increasing the number of iterations of the                                                 5         Conclusion and Future Work
game dynamics.
                                                                                                   In this paper, it is presented a new topic mod-
4.4        Qualitative Evaluation                                                                  eling framework based on game-theoretic princi-
                                                                                                   ples. The results of its evaluation show that the
Examples of topics extracted from 20NG and
                                                                                                   model performs well compared to state-of-the-art
NIPS are presented in Table 2 and 3, respectively7 .
                                                                                                   systems and that it can extract topically and se-
The first difference that emerges from these results
                                                                                                   mantically related groups of words. In this work,
are the external PMI values. This is due to the fact
                                                                                                   the model was left as simple as possible to assess
that the texts in NIPS have a very specific lan-
                                                                                                   if a game-theoretic framework itself is suited for
guage and for this reason the PMI values are very
                                                                                                   topic modeling. In future work, it will be inter-
high. We can also see that TMG groups highly
                                                                                                   esting to introduce the topic-document distribution
coherent set of words in each topic. We can easily
                                                                                                   and to test it on classification tasks and covariates
identify in Table 2 the topics in which the dataset
                                                                                                   to extract topics using different dimensions, such
is organized and especially: talk.politics.midleast,
                                                                                                   as time, authorship, or opinion. The framework
alt.atheism, comp.graphics, soc.religion.christian,
                                                                                                   is open and flexible and in future work, it will be
talk.politics.misc, rec.motorcycles, sci.crypt,
                                                                                                   tested with different initializations of the strategy
talk.politics.guns, rec.sport.hockey, sci.space,
                                                                                                   space, graph structures, and payoff functions. It
talk.politics.misc.
                                                                                                   will be particularly interesting to test it using word
   7
     for space limitation we presented only 15 topics for                                          embedding and syntactic information.
20NG
 References                                                [Hofmann1999] Thomas Hofmann. 1999. Probabilis-
                                                              tic latent semantic indexing. In Proceedings of the
[Bengio et al.2003] Yoshua Bengio, Réjean Ducharme,          22nd annual international ACM SIGIR conference,
   Pascal Vincent, and Christian Jauvin. 2003. A neu-         pages 50–57. ACM.
   ral probabilistic language model. Journal of ma-
   chine learning research, 3(Feb):1137–1155.              [Lafferty and Blei2006] John D Lafferty and David M
                                                               Blei. 2006. Correlated topic models. In NIPS,
[Blei et al.2003] David M Blei, Andrew Y Ng, and               pages 147–154.
    Michael I Jordan. 2003. Latent dirichlet allocation.
    Journal of machine Learning research, 3(Jan):993–      [Larochelle and Lauly2012] Hugo Larochelle and
    1022.                                                      Stanislas Lauly. 2012. A neural autoregressive
                                                               topic model. In NIPS, pages 2708–2716.
[Blei2012] David M. Blei. 2012. Probabilistic topic
    models. Commun. ACM, 55(4):77–84, April.
                                                           [Lau et al.2017] Jey Han Lau, Timothy Baldwin, and
                                                              Trevor Cohn. 2017. Topically driven neural lan-
[Cao et al.2015] Ziqiang Cao, Sujian Li, Yang Liu,
                                                              guage model. In Proceedings of the 55th Annual
   Wenjie Li, and Heng Ji. 2015. A novel neural topic
                                                              Meeting of the ACL, volume 1, pages 355–365.
   model and its supervised extension. In AAAI, pages
   2210–2216.
                                                           [Mcauliffe and Blei2008] Jon D Mcauliffe and
                                                              David M Blei. 2008. Supervised topic mod-
[Card et al.2018] Dallas Card, Chenhao Tan, and
                                                              els. In NIPS, pages 121–128.
   Noah A Smith. 2018. Neural models for documents
   with metadata. In Proceedings of the 56th Annual
   Meeting of the ACL, volume 1, pages 2031–2040.          [Mikolov et al.2013] Tomas Mikolov, Kai Chen, Greg
                                                              Corrado, and Jeffrey Dean. 2013. Efficient estima-
[Chang et al.2009] Jonathan Chang, Sean Gerrish,              tion of word representations in vector space. CoRR,
   Chong Wang, Jordan L Boyd-Graber, and David M              abs/1301.3781.
   Blei. 2009. Reading tea leaves: How humans inter-
   pret topic models. In NIPS, pages 288–296.              [Nash1951] John Nash. 1951. Non-cooperative games.
                                                              Annals of mathematics, pages 286–295.
[Chong et al.2009] Wang Chong, David Blei, and Fei-
   Fei Li. 2009. Simultaneous image classification and     [Pavan and Pelillo2007] Massimiliano Pavan and Mar-
   annotation. In CVPR, 2009. CVPR 2009. IEEE Con-             cello Pelillo. 2007. Dominant sets and pairwise
   ference on, pages 1903–1910. IEEE.                          clustering. IEEE transactions on pattern analysis
                                                               and machine intelligence, 29(1).
[Das et al.2015] Rajarshi Das, Manzil Zaheer, and
   Chris Dyer. 2015. Gaussian lda for topic models         [Ranzato and Szummer2008] Marc’Aurelio Ranzato
   with word embeddings. In Proceedings of the 53rd           and Martin Szummer. 2008. Semi-supervised
   Annual Meeting of the ACL, volume 1, pages 795–            learning of compact document representations
   804.                                                       with deep networks. In Proceedings of the 25th
                                                              international conference on Machine learning,
[Eisenstein et al.2011] Jacob Eisenstein, Amr Ahmed,          pages 792–799. ACM.
    and Eric P Xing. 2011. Sparse additive generative
    models of text.                                        [Roberts et al.2014] Margaret E Roberts, Brandon M
                                                              Stewart, Dustin Tingley, Christopher Lucas, Jetson
[Gerlach et al.2018] Martin Gerlach, Tiago P. Peixoto,        Leder-Luis, Shana Kushner Gadarian, Bethany Al-
   and Eduardo G. Altmann. 2018. A network ap-                bertson, and David G Rand. 2014. Structural topic
   proach to topic models. Science Advances, 4(7).            models for open-ended survey responses. American
                                                              Journal of Political Science, 58(4):1064–1082.
[Goodfellow et al.2014] Ian Goodfellow, Jean Pouget-
   Abadie, Mehdi Mirza, Bing Xu, David Warde-              [Srivastava and Sutton2017] Akash Srivastava and
   Farley, Sherjil Ozair, Aaron Courville, and Yoshua          Charles Sutton. 2017. Autoencoding variational
   Bengio. 2014. Generative adversarial nets. In               inference for topic models.     In International
   NIPS, pages 2672–2680.                                      Conference on Learning Representations (ICLR).

[Griffiths and Steyvers2004] Thomas L Griffiths and        [Tripodi and Navigli2019] Rocco Tripodi and Roberto
    Mark Steyvers. 2004. Finding scientific topics.            Navigli. 2019. Game theory meets embeddings: a
    Proceedings of the National academy of Sciences,           unified framework for word sense disambiguation.
    101(suppl 1):5228–5235.                                    In Proceedings of the 2019 Conference on Empir-
                                                               ical Methods in Natural Language Processing and
[Hinton and Salakhutdinov2009] Geoffrey E Hinton               the 9th International Joint Conference on Natural
    and Ruslan R Salakhutdinov. 2009. Replicated soft-         Language Processing (EMNLP-IJCNLP), pages 88–
    max: an undirected topic model. In NIPS, pages             99, Hong Kong, China, November. Association for
    1607–1614.                                                 Computational Linguistics.
[Tripodi and Pelillo2017] Rocco Tripodi and Marcello
    Pelillo. 2017. A game-theoretic approach to word
    sense disambiguation. Computational Linguistics,
    43(1):31–70.
[Tripodi and Pira2017] Rocco Tripodi and Stefano Li
    Pira. 2017. Analysis of italian word embeddings.
    In Proceedings of the Fourth Italian Conference on
    Computational Linguistics (CLiC-it 2017), Rome,
    Italy, December 11-13, 2017.
[Tripodi et al.2016] Rocco Tripodi, Sebastiano Vascon,
    and Marcello Pelillo. 2016. Context aware nonneg-
    ative matrix factorization clustering. In 23rd Inter-
    national Conference on Pattern Recognition, ICPR
    2016, Cancún, Mexico, December 4-8, 2016, pages
    1719–1724.
[Von Luxburg2007] Ulrike Von Luxburg. 2007. A tuto-
   rial on spectral clustering. Statistics and computing,
   17(4):395–416.
[Weibull1997] J. W. Weibull. 1997. Evolutionary game
   theory. MIT press.