Topic Modelling Games Rocco Tripodi Sapienza NLP Group Department of Computer Science, Sapienza University of Rome tripodi@di.uniroma1.it Abstract uments (Blei, 2012; Griffiths and Steyvers, 2004). English. This paper presents a new topic It can be used in different tasks of text classifica- modelling framework inspired by game tion, document retrieval, and sentiment analysis, theoretic principles. It is formulated as providing together vector representations of words a normal form game in which words are and documents. State-of-the-art systems are based represented as players and topics as strate- on probabilistic (Blei et al., 2003; Mcauliffe and gies that the players select. The strate- Blei, 2008; Chong et al., 2009) and neural net- gies of each player are modelled with a works models (Bengio et al., 2003; Hinton and probability distribution guided by a util- Salakhutdinov, 2009; Larochelle and Lauly, 2012; ity function that the players try to max- Cao et al., 2015). A different perspective based on imize. This function induces players to game theory is proposed in this article. select strategies similar to those selected The use of game-theoretic principles in machine by similar players and to choice strate- learning (Goodfellow et al., 2014), pattern recog- gies not shared with those selected by dis- nition (Pavan and Pelillo, 2007) and natural lan- similar players. The proposed framework guage processing (Tripodi et al., 2016; Tripodi and is compared with state-of-the-art models Navigli, 2019) problems is developing a promis- demonstrating good performances on stan- ing field of research with the development of orig- dard benchmarks. inal models. The main difference between compu- tational models based on optimization techniques Italiano. Questo articolo presenta un ap- and game-theoretic models is that the former tries proccio di modellazione dei topic ispirato to maximize (minimize) a function (that in many alla teoria dei giochi. La modellazione dei cases is non-convex) and the latter tries to find topic è vista come un gioco in forma nor- the equilibrium state of a dynamical system. The male in cui le parole rappresentano i gio- equilibrium concept is useful because it represents catori e i topic le strategie che i giocatori a state in which all the constraints of a given sys- possono scegliere. Ogni giocatore sceglie tem are satisfied and no object of the system has le strategie da impiegare tramite una dis- an incentive to deviate from it, because a differ- tribuzione di probabilità che viene influen- ent configuration will immediately lead to a worse zata da una funzione di utilità che i gio- situation in terms of payoff and fitness, at object catori cercano di massimizzare. Questa and system level. Furthermore, it is guaranteed funzione incentiva i giocatori a scegliere that the system converges to a mixed strategy Nash strategie simili a quelle impiegate da gio- equilibrium (Nash, 1951). So far, game-theoretic catori simili e disincentiva la scelta di models have been used in classification and clus- strategie condivise con giocatori dissim- tering tasks (Pavan and Pelillo, 2007; Tripodi and ili. Il confronto con modelli allo stato Pelillo, 2017). In this work, it is proposed a game- dell’arte dismostra buone prestazioni su theoretic model for inferring a low dimensional diversi dataset di valutazione. representation of words that can capture their la- tent semantic representation. 1 Introduction In this work, topic modeling is interpreted as a Topic modeling is a technique that discovers the symmetric non-cooperative game (Weibull, 1997) underlying topics contained in a collection of doc- in which, the words are the players and the topics are the strategies that the players can select. Two and Lauly, 2012) have been used to model docu- players are matched to play the games together ac- ments with layer-wise neural network tools. Neu- cording to the co-occurrence patterns found in the ral Topic Model (NTM; (Cao et al., 2015)) tries to corpus under study. The players use a probability overcome some limitations of classical topic mod- distribution over their strategies to play the games els, such as the initialization problem and the gen- and obtain a payoff for each strategy. This reward eralization to n-grams. It exploits word embed- helps them to adjust their strategy selection in fu- ding to represent n-grams and uses backpropaga- ture games, considering what strategy has been ef- tion to adjust the weights of the network between fective in previous games. It allows concentrating the embedding and the word-topic and document- more mass on the strategies that get high reward. topic layers. A general framework for topic mod- The underlying idea to model the payoff function eling based also on neural networks is Sparse Con- is to create two influence dynamics, the first one textual Hidden and Observed Language Autoen- forces similar players (words that appear in sim- codeR (SCHOLAR; (Card et al., 2018)). It allows ilar contexts) to select similar strategies; the sec- using covariates to influence the topic distributions ond one forces dissimilar players (words that do and labels to include supervision. As Sparse Addi- not share any context) to select different strategies. tive GEnerative models (SAGE; (Eisenstein et al., The games are played repeatedly until the system 2011))it can produce sparse topic representations converges, that is, the difference among the strat- but differently from it and Structural Topic Model egy distributions of the players at time t and at (STM; (Roberts et al., 2014)) it can easily consider time t − 1 is under a small threshold. The conver- a larger set of metadata. A graphical topic model gence of the system corresponds to an equilibrium, was proposed by Gerlach et al. (2018). In this a situation in which there is an optimal association framework, the task of finding topical structures of words and topics. is interpreted as the task of finding communities in complex networks. It is particularly interesting 2 Related Work because it shows analogies with traditional topic models and overcomes some of their limitations Hofmann (1999) proposed one of the earliest topic such as the bound with a Bayesian prior and the models, probabilistic Latent Semantic Indexing need to specify the number of topics in advance. (pLSI). It represents each word in a document as a sample from a mixture model, where top- 3 Topic Modelling Games ics are represented as multinomial random vari- ables and documents as a mixture of topics. La- Normal-form games consist of a finite set of play- tent Dirichlet Allocation (LDA) (Blei et al., 2003), ers N = (1, .., n), a finite set of pure strategies, the most widely used topic model, is a general- Si = {1, ..., mi } for each player i ∈ N and a ization of pLSI that introduces Dirichlet priors for payoff (utility) function ui : S → R, that asso- both the word multinomial distributions over top- ciates a payoff to each combination of strategies ics and topic multinomial distributions over docu- S = S1 × S2 × ... × Sn . The payoff function does ments. This line of research has been developed not depend only on the strategy chosen by a single building on top of LDA different features to in- player but by the combination of strategies played fer correlations among topics (Lafferty and Blei, at the same time by the players. Each player tries 2006) or to model jointly words and labels in a su- to maximize the value of ui . Furthermore, in non- pervised way (Mcauliffe and Blei, 2008). cooperative games the players choose their strate- Topic models based on neural network princi- gies independently, considering what other play- ples have been introduced with the neural net- ers can play and trying to find the best response work language model proposed in (Bengio et al., to the strategy of the co-players. Nash equilibria 2003). This paradigm is very popular in NLP and (Nash, 1951) represent the key concept of game many topic models are based on it because with theory and can be defined as those strategy com- these techniques it is possible to obtain a low- binations in which each strategy is a best response dimensional representation of the data. In particu- to the strategy of the co-player and no player has lar, auto-encoders (Ranzato and Szummer, 2008), the incentive to unilaterally deviate from them be- Boltzmann machines (Hinton and Salakhutdinov, cause there is no way to do better. In addition 2009) and autoregressive distributions (Larochelle to play pure strategies, that correspond to select- ing just one strategy from those available in Si , the n × n adjacency matrix (W ) of an undirected a player i can also use mixed strategies, which weighted graph. Each entry wij encodes the sim- are probability distributions over pure strategies. ilarity between two words. The strategy space of A mixed strategy over Si is defined as a vec- the games can be represented as a n × m matrix P xi = (x1 , . . . , xmi ), such that xj ≥ 0 and tor X, where each row represents the probability dis- xj = 1. In a two-player game, a strategy pro- tribution of a player over its m strategies (topics file can be defined as a pair (xi , xj ). The expected that have to be extracted from the corpus). payoff for this strategy profile is computed as: Payoff Function and System Dynamics The u(xi , xj ) = xTi · Aij xj payoff function of the game is constructed ex- ploiting the information stored in W . This ma- where Aij is the mi × mj payoff matrix between trix gives us the structural information of the cor- player i and j. pus. It allows us to select the players with whom Evolutionary game theory (Weibull, 1997) has each player is playing the games, indicated with introduced two important modifications: 1. the the presence of an edge between two nodes (play- games are played repeatedly, and 2. the players ers), and to quantify the level of influence that each update their mixed strategy over time until it is not player has on the other, indicated with the weight possible to improve the payoff. The players, with on each edge. The absence of an edge in this graph these two modifications, can develop an inductive indicates that two words are distributional dissim- learning process, that allows them to learn their ilar. Using these three sources of information we strategy distribution according to what other play- model a payoff function that forces similar players ers are selecting. The payoff corresponding to the to choose similar strategies (topics) and dissimilar h-th pure strategy is computed as: players to choose different ones. The payoff of a ni player is calculated as, X u(xhi ) = xhi · (Aij xj )h (1) ni X neg Xi j=1 u(xhi ) = xhi ( (Aij xj )h − (xg )h ) (3) j=1 g=1 The average payoff of player i is calculated as: where the first summation is over all the ni di- mi X rect neighbors of player i that are the players with u(xi ) = u(xhi ) (2) whom i share some similarity and the second sum- h=1 mation is over the negi negative players of player To find the Nash equilibrium of the game, it is i, that are players with whom player i does not common to use the replicator dynamics equation share any similarity. With the first summation (Weibull, 1997). It allows better than average player i will negotiate with its neighbors a corre- strategies to grow at each iteration. It can be con- lated strategy (topic), with the second he will devi- sidered as an inductive learning process, in which ate from the strategies chosen by negative players, the players learn from past experiences how to this is done by subtracting the payoff that i would play their best strategy. It is important to notice have gained if these negative players would have that each player optimizes its individual strategy been his neighbors. The negative players are sam- space, but this operation is done according to what pled from V according to frequency, in the same other players simultaneously are doing so the local way, negative samples are selected in word embed- optimization is the result of a global process. dings models (Mikolov et al., 2013; Tripodi and Pira, 2017). The equation that gives us the proba- Data Preparation The players of the topic mod- bility of selecting a word as negative is: elling games are the words v = (1, . . . , n) in the vocabulary V of the corpus under analysis and the f (wi )3/4 strategies S = (1, . . . , m) are the topics to extract P (wi ) = Pn 3/4 , (4) j=0 f (wj ) from the same corpus. The strategy space xi of each player i is represented as a probability dis- where f (wi ) is the frequency of word wi . Since tribution that can be interpreted as the mixture of the similarity with negative players is 0 we intro- topics typically used in topic modeling. The in- duced the parameter  to weight their influence and teractions among the players are modeled using set it to (A > 0). The number of negative players, negi , is set to ni (number of neighbours of player Dataset TMG SCHOLAR NVDM LDA i ). 20NG 824 819 927 791 NIPS 1311 1370 1564 1017 Once the players have played all the games with their neighbors and negative players, the average Table 1: Comparison of the models as perplexity. payoff of each player can be calculated with Equa- tion (2). The payoff is higher when two words are highly correlated and have a similar mixed strat- The strategy space of the players was initialized egy. For this reason the replicator dynamics equa- using a normal distribution to reduce the parame- tion (Weibull, 1997) is used to compute the dy- ters of the framework3 . The last two parameters namics of the system. It pushes the players to be of the systems concern the stopping criteria of the influenced by the mixed strategy of the co-players. dynamics and are: 1. the maximum number of it- This influence is proportional to the similarity be- erations (105 ); and 2. the minimum difference be- tween two −3 tween two players (Aij ). Once the influence dy- Pndifferent iterations (10 ) that is calcu- namics do not affect the players the Nash equilib- lated as i=1 xi (t − 1) − xi (t). rium of the system is reached. The stopping cri- TMG has been compared with SCHOLAR4 , teria of the dynamics and are: 1. the maximum LDA5 and NVDM6 . We configured the number of iterations (105 ); and 2. the minimum NVDM network with two encoder layers difference between two −3 (500-dimensional) and ReLu non-linearities. Pn different iterations (10 ) SCHOLAR has been configured using a more that is calculated as i=1 xi (t − 1) − xi (t). complex setting that consists in a single layer 4 Experimental Results encoder and a 4-layer generator. LDA has been run with the following parameters: α = 50, In this section, we evaluate TMG and compare it iterations = 1000 and topicthreshold = 0. with state-of-the-art systems. 4.2 Evaluation 4.1 Data and Setting In this section, we compared the generalization The datasets used to evaluate TMG are 20 News- performances of TMG and compared them with groups1 (20NG) and NIPS2 . 20NG is a collection the models presented in the previous section. For of about 20, 000 documents organized into 20 dif- the evaluation we used perplexity (PPL), even if ferent classes. NIPS is composed of about 1, 700 it is has been shown to not correlate with human NIPS conference papers published between 1987 interpretation of topics (Chang et al., 2009). We and 1999 with no class information. Each text was computed perplexity on unobserved documents tokenized and lowercased. The stop-words were (C), as. removed and the vocabulary was constructed con- 1 N P sidering the 1000 and 2000 most frequent words n=1 logP (Cn ) P P L(C) = exp(− PN ) (5) in 20NG and NIPS, respectively. This choice is in N n=1 Dn line with previous work (Card et al., 2018). To where N is the number of documents in the collec- keep the model as simple as possible, the tf-idf tion C. Low perplexity suggests less uncertainties weighting was used to construct the feature vec- about the documents. Held out documents repre- tors of the words and the cosine similarity was sent the 15% of each dataset. Perplexity is com- employed to create the adjacency matrix A. It is puted for 10 topics for the NIPS dataset and 20 important to notice here that other sources of in- topics for the 20 Newsgroups dataset. These num- formation can be easily included at this stage, de- bers correspond to the real number of classes of rived from pre-trained word embeddings, syntactic each dataset. structures or document metadata. Then A is spar- Table 1 shows the comparison of perplexity. As sified taking only the r nearest neighbours of each reported in previous work (Card et al., 2018), it is node. r is calculated as r = log(n) this operation 3 reduces the computational cost of the algorithm Experimentally it was also observed that using a Dirich- let distribution to initialize the strategy space with different and guarantees that the graph remains connected α parameters did not affect much the performances of the (Von Luxburg, 2007). model. 4 https://github.com/dallascard/scholar 1 5 http://qwone.com/ jason/20Newsgroups/ http://mallet.cs.umass.edu 2 6 http://www.cs.nyu.edu/ roweis/data.html https://github.com/ysmiao/nvdm difficult to achieve a lower perplexity than LDA. each target word. For each topic, we selected the The results in these experiments follow the same 10 words with the highest mass. Then we calcu- pattern, with LDA that has the lowest perplexity, lated the PMI among all the words pair and finally TMG, and SCHOLAR that have similar results, compute the coherence as the arithmetic mean of and NVDM that performs slightly worse on both all these values. This metric has been shown to datasets. correlate well with human judgments (Lau et al., 2017). We used two different sources of informa- tion for the computation of the PMI: one is inter- nal and corresponds to the dataset under analysis; the other one is external and is represented by the English Wikipedia corpus. (a) 20NG (b) NIPS Internal PMI Figure 1 presents the PMI val- ues of the different models computed on the two Figure 1: Internal PMI mean and std values. corpora. As it is possible to see from figure 1a, TMG has a low PMI compared to all other sys- tems on the 20 Newsgroups dataset when there are few topics to extract (i.e.: 2 and 5). The situation changes drastically when the number of topics in- creases. In fact, it has the highest performances on this dataset when extracts 10, 20, 50, 100 topics. The performances of NDVM and SCHOLAR are similar and follow a decreasing pattern, with very (a) 20NG (b) NIPS high values at the beginning. On the contrary, the performances of LDA follow an opposite pattern Figure 2: External PMI mean and std values. this model seems to work better when the num- ber of topics to extract is high. On NIPS (Figure 1b) the performances of the systems are similar to those on 20 Newsgroups. The only exception is that TMG has always the highest PMI and seems to behave better also when the number of topics to extract is high. This probably because the number of words in NIPS is higher and for this, it is reason- able to have also a higher number of topics. This (a) 20NG (b) NIPS is also confirmed from a qualitative analysis of the topics in Section 4.4, where it is demonstrated that Figure 3: Sparsity mean and std values. with low values of k it is possible to extract gen- eral topics and increasing its value it is possible to 4.3 Topic Coherence and Interpretability extract more specific ones. It has been shown that perplexity does not neces- In general, we can find three different patterns sarily correlate well with topic coherence (Chang in these experiments: 1. NDVM and SCHOLAR et al., 2009; Srivastava and Sutton, 2017). For this work well on extracting a low number of topics; reason, we evaluated the performances of our sys- 2. LDA works well when it has to extract a large tem also on coherence (Chang et al., 2009; Das et number of topics; 3. TMG works well on extract- al., 2015). The coherence is calculated by com- ing a number of topics that is close to the real num- puting the relatedness between topic words using ber of classes in the datasets. Another aspect to the pointwise mutual information (PMI). We used take into account is the fact that even if TMG has Wikipedia (2018.05.01 dump) as corpus to com- the highest performances, its results have also a pute co-occurrence statistics using a sliding win- high standard deviation. This is due to the stochas- dow of 5 words on the left and on the right of tic nature of negative sampling. turks schneider drive vms god intellect bike providing fbi gun team space male tim amateur soviet allan ide disclaimer jesus banks ride encryption compound firearms game orbit gay israel georgia turkish morality scsi vnews christians gordon riding clipper batf guns play shuttle men israeli intelligence armenian keith controller vax christ surrender dod key fire criminals season launch sexual arab ai armenia atheists drives necessarily christianity univ bikes escrow waco crime hockey earth percentage jews programs passes moral mb represents bible pittsburgh motorcycle crypto children weapons league mission study arabs michael roads political disk views christian significant bmw keys koresh criminal nhl flight sex policy radio armenians pasadena isa expressed faith hospital honda chip gas violent players nasa apparent war adams argic objective bus news church level road secure branch weapon cup moon showing land ignore proceeded animals floppy poster belief blood advice wiretap started armed stanley solar women north occur 29.71 15.27 12.7 11.72 10.79 10.18 8.94 8.93 8.55 7.52 7.45 7.14 6.92 6.21 6.13 Table 2: Best topics (each topic is represented on the columns) extracted from 20 Newsgroup using TMG (setting k = 20) ordered using external PMI (bottom row). ocular dendrites oscillatory crowdsourcing kaiming retina auditory graph disturbances lifted eye dendritic oscillations crowds shaoqing photoreceptor sound edges plant propositional fovea soma oscillators workers xiangyu retinal sounds graphs controllers predicate dominance dendrite oscillator worker jian vertebrate cochlear optimisation controller grounding saccades axonal oscillation labelers yangqing schulten ear edge disturbance predicates saccadic axons synchronization crowd karen photoreceptors hearing vertices plants domingos fixation nmda decoding turk sergey ganglion ears optimise activate clauses foveal pyramidal locking wisdom trevor kohonen acoust optimising activated compilation eyes somatic synchronize expertise sergio bipolar tone optimised activating formulas saccade axon synchronized dawid jitendra visualizing cochlea vertex activates logical 304.85 283.66 276.39 230.5 218.51 196.86 176.75 146.3 146.25 145.84 Table 3: Topics extracted from NIPS using TMG (setting k = 10) ordered using external PMI (bottom row). Sparsity We compared the sparsity of the word- We can also easily identify from Table 3 highly topics matrices, X, in Figure 3a and 3b, computed coherent topics, related to optic, signal analysis, −3 | as s = |X>10 |X| . From both figures, we can see optimization, crowdsourcing, audio, graph theory that TMG can produce highly sparse representa- and logics. We noticed from these topics that they tions especially when the number of topics to ex- are general and that it is possible to discover more tract is low. This is a nice feature since it provides specific topics increasing the number of topics to more interpretable results. Only SCHOLAR pro- extract. For example, we discovered topics related duces more sparse representations when the num- to topic modelling and generative adversarial net- ber of topics to extract is high. Experimentally we works. also noticed that we can control the sparsity of X, in TMG, increasing the number of iterations of the 5 Conclusion and Future Work game dynamics. In this paper, it is presented a new topic mod- 4.4 Qualitative Evaluation eling framework based on game-theoretic princi- ples. The results of its evaluation show that the Examples of topics extracted from 20NG and model performs well compared to state-of-the-art NIPS are presented in Table 2 and 3, respectively7 . systems and that it can extract topically and se- The first difference that emerges from these results mantically related groups of words. In this work, are the external PMI values. This is due to the fact the model was left as simple as possible to assess that the texts in NIPS have a very specific lan- if a game-theoretic framework itself is suited for guage and for this reason the PMI values are very topic modeling. In future work, it will be inter- high. We can also see that TMG groups highly esting to introduce the topic-document distribution coherent set of words in each topic. We can easily and to test it on classification tasks and covariates identify in Table 2 the topics in which the dataset to extract topics using different dimensions, such is organized and especially: talk.politics.midleast, as time, authorship, or opinion. The framework alt.atheism, comp.graphics, soc.religion.christian, is open and flexible and in future work, it will be talk.politics.misc, rec.motorcycles, sci.crypt, tested with different initializations of the strategy talk.politics.guns, rec.sport.hockey, sci.space, space, graph structures, and payoff functions. It talk.politics.misc. will be particularly interesting to test it using word 7 for space limitation we presented only 15 topics for embedding and syntactic information. 20NG References [Hofmann1999] Thomas Hofmann. 1999. Probabilis- tic latent semantic indexing. In Proceedings of the [Bengio et al.2003] Yoshua Bengio, Réjean Ducharme, 22nd annual international ACM SIGIR conference, Pascal Vincent, and Christian Jauvin. 2003. A neu- pages 50–57. ACM. ral probabilistic language model. Journal of ma- chine learning research, 3(Feb):1137–1155. [Lafferty and Blei2006] John D Lafferty and David M Blei. 2006. Correlated topic models. In NIPS, [Blei et al.2003] David M Blei, Andrew Y Ng, and pages 147–154. Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993– [Larochelle and Lauly2012] Hugo Larochelle and 1022. Stanislas Lauly. 2012. A neural autoregressive topic model. In NIPS, pages 2708–2716. [Blei2012] David M. Blei. 2012. Probabilistic topic models. Commun. ACM, 55(4):77–84, April. [Lau et al.2017] Jey Han Lau, Timothy Baldwin, and Trevor Cohn. 2017. Topically driven neural lan- [Cao et al.2015] Ziqiang Cao, Sujian Li, Yang Liu, guage model. In Proceedings of the 55th Annual Wenjie Li, and Heng Ji. 2015. A novel neural topic Meeting of the ACL, volume 1, pages 355–365. model and its supervised extension. In AAAI, pages 2210–2216. [Mcauliffe and Blei2008] Jon D Mcauliffe and David M Blei. 2008. Supervised topic mod- [Card et al.2018] Dallas Card, Chenhao Tan, and els. In NIPS, pages 121–128. Noah A Smith. 2018. Neural models for documents with metadata. In Proceedings of the 56th Annual Meeting of the ACL, volume 1, pages 2031–2040. [Mikolov et al.2013] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estima- [Chang et al.2009] Jonathan Chang, Sean Gerrish, tion of word representations in vector space. CoRR, Chong Wang, Jordan L Boyd-Graber, and David M abs/1301.3781. Blei. 2009. Reading tea leaves: How humans inter- pret topic models. In NIPS, pages 288–296. [Nash1951] John Nash. 1951. Non-cooperative games. Annals of mathematics, pages 286–295. [Chong et al.2009] Wang Chong, David Blei, and Fei- Fei Li. 2009. Simultaneous image classification and [Pavan and Pelillo2007] Massimiliano Pavan and Mar- annotation. In CVPR, 2009. CVPR 2009. IEEE Con- cello Pelillo. 2007. Dominant sets and pairwise ference on, pages 1903–1910. IEEE. clustering. IEEE transactions on pattern analysis and machine intelligence, 29(1). [Das et al.2015] Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian lda for topic models [Ranzato and Szummer2008] Marc’Aurelio Ranzato with word embeddings. In Proceedings of the 53rd and Martin Szummer. 2008. Semi-supervised Annual Meeting of the ACL, volume 1, pages 795– learning of compact document representations 804. with deep networks. In Proceedings of the 25th international conference on Machine learning, [Eisenstein et al.2011] Jacob Eisenstein, Amr Ahmed, pages 792–799. ACM. and Eric P Xing. 2011. Sparse additive generative models of text. [Roberts et al.2014] Margaret E Roberts, Brandon M Stewart, Dustin Tingley, Christopher Lucas, Jetson [Gerlach et al.2018] Martin Gerlach, Tiago P. Peixoto, Leder-Luis, Shana Kushner Gadarian, Bethany Al- and Eduardo G. Altmann. 2018. A network ap- bertson, and David G Rand. 2014. Structural topic proach to topic models. Science Advances, 4(7). models for open-ended survey responses. American Journal of Political Science, 58(4):1064–1082. [Goodfellow et al.2014] Ian Goodfellow, Jean Pouget- Abadie, Mehdi Mirza, Bing Xu, David Warde- [Srivastava and Sutton2017] Akash Srivastava and Farley, Sherjil Ozair, Aaron Courville, and Yoshua Charles Sutton. 2017. Autoencoding variational Bengio. 2014. Generative adversarial nets. In inference for topic models. In International NIPS, pages 2672–2680. Conference on Learning Representations (ICLR). [Griffiths and Steyvers2004] Thomas L Griffiths and [Tripodi and Navigli2019] Rocco Tripodi and Roberto Mark Steyvers. 2004. Finding scientific topics. Navigli. 2019. Game theory meets embeddings: a Proceedings of the National academy of Sciences, unified framework for word sense disambiguation. 101(suppl 1):5228–5235. In Proceedings of the 2019 Conference on Empir- ical Methods in Natural Language Processing and [Hinton and Salakhutdinov2009] Geoffrey E Hinton the 9th International Joint Conference on Natural and Ruslan R Salakhutdinov. 2009. Replicated soft- Language Processing (EMNLP-IJCNLP), pages 88– max: an undirected topic model. In NIPS, pages 99, Hong Kong, China, November. Association for 1607–1614. Computational Linguistics. [Tripodi and Pelillo2017] Rocco Tripodi and Marcello Pelillo. 2017. A game-theoretic approach to word sense disambiguation. Computational Linguistics, 43(1):31–70. [Tripodi and Pira2017] Rocco Tripodi and Stefano Li Pira. 2017. Analysis of italian word embeddings. In Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017), Rome, Italy, December 11-13, 2017. [Tripodi et al.2016] Rocco Tripodi, Sebastiano Vascon, and Marcello Pelillo. 2016. Context aware nonneg- ative matrix factorization clustering. In 23rd Inter- national Conference on Pattern Recognition, ICPR 2016, Cancún, Mexico, December 4-8, 2016, pages 1719–1724. [Von Luxburg2007] Ulrike Von Luxburg. 2007. A tuto- rial on spectral clustering. Statistics and computing, 17(4):395–416. [Weibull1997] J. W. Weibull. 1997. Evolutionary game theory. MIT press.