=Paper=
{{Paper
|id=Vol-2481/paper59
|storemode=property
|title=To be Fair: a Case for Cognitively-Inspired Models of Meaning
|pdfUrl=https://ceur-ws.org/Vol-2481/paper59.pdf
|volume=Vol-2481
|authors=Simon Preissner,Aurélie Herbelot
|dblpUrl=https://dblp.org/rec/conf/clic-it/PreissnerH19
}}
==To be Fair: a Case for Cognitively-Inspired Models of Meaning==
To be Fair: a Case for Cognitively-Inspired Models of Meaning Simon Preissner Aurélie Herbelot Center for Mind/Brain Sciences Center for Mind/Brain Sciences & University of Trento Dept. of Information Engineering simon.preissner@gmx.de and Computer Science University of Trento aurelie.herbelot@unitn.it Abstract are not available to those with poor Internet ac- cess). Training such models can often take a long In the last years, the cost of Natural Lan- time and extraordinary amounts of energy, gener- guage Processing algorithms has become more ating CO2 emissions disproportionate to the mod- and more evident. That cost has many facets, els’ improvements (Strubell et al., 2019). From a including training times, storage, replicabil- ity, interpretability, equality of access to ex- pure modelling point of view, finally, complexity perimental paradigms, and even environmen- often comes with a loss of interpretability, which tal impact. In this paper, we review the re- weakens theoretical insights. Whilst we appreciate quirements of a ‘good’ model and argue that a that a part of NLP is focused on engineering ap- move is needed towards lightweight and inter- plications rather than modelling natural language pretable implementations, which promote sci- proper, the linguists and cognitive scientists in the entific fairness, paradigmatic diversity, and ul- community have a duty to provide transparent, ex- timately foster applications available to all, re- planatory simulations of particular phenomena. gardless of financial prosperity. We propose that the community still has much to learn Such considerations call for smaller and more from cognitively-inspired algorithms, which interpretable systems. In this paper, we offer an often show extreme efficiency and can ‘run’ example investigation into one of the most widely on very simple organisms. As a case study, used techniques in NLP: the vectorial representa- we investigate the fruit fly’s olfactory system tion of word meanings. Our starting point is the as a distributional semantics model. We show set of requirements that should be fulfilled by an that, even in its rawest form, it provides many ideal model of lexical acquisition, which is ex- of the features that we might require from an ideal model of meaning acquisition. 1 pressed in QasemiZadeh et al. (2017): (A) high performance on fundamental lexical tasks, (B) ef- 1 Introduction ficiency, (C) low dimensionality for compact stor- age, (D) amenability to incremental learning, (E) In recent years, the Natural Language Processing interpretability. As we will show in §2, state- (NLP) community has seen an increase in the pop- of-the-art systems still fail to integrate all those ularity of expensive models requiring enormous points. (A-D) are however basic features of hu- computational resources to train and run. The mans and animal cognition. It seems, therefore, cost of such models is multi-faceted. From the that we should find inspiration in algorithms from point of view of shaping the scientific commu- cognitive science, which in turn would allow us to nity, they create a huge gap between researchers in derive interpretability (E) from the clear underpin- wealthy institutions and those with less resources nings of biological or psychological theories. and they often make replication prohibitive. From We propose that a good place to find appropri- the point of view of applicability, they make the ate algorithms is the natural world, as many or- end-user dependent on high-tech hardware which ganisms display core cognitive abilities such as they may not afford, or on cloud services which incremental learning, generalization or classifi- may have problematic privacy side-effects (and cation, which many NLP systems need to emu- 1 late. Such faculties develop in extremely sim- Copyright c 2019 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- ple systems, which are good contenders for the ternational (CC BY 4.0). type of models we advocate here. One success story from ‘algorithmic’ cognitive science is based by computationally intensive procedures involv- on the neural architecture of the fruit fly’s olfac- ing weighting, dimensionality reduction, complex tory system, which clusters patterns of chemicals attention mechanisms etc. The high complex- into categories of smells (Stevens, 2015), and has ity of most current architectures often comes at inspired the so-called Fruit Fly Optimization Al- the cost of flexibility: once a language model gorithm (Pan, 2011; here: Fruit Fly Algorithm is constructed, any new data requires a re-run or ‘FFA’). The FFA has been implemented as a of the complete system in order to be incorpo- lightweight neural algorithm that performs ran- rated. This makes incrementality unsatisfiable in dom indexing for locality-sensitive hashing (LSH) those frameworks (Sahlgren, 2005; Baroni et al., (Dasgupta et al., 2017). This LSH algorithm has 2007). Further, architectures themselves have be- successfully been applied to various tasks, partic- come increasingly complex, at the expense of ularly in information retrieval and for data com- transparency. We recall that even Word2Vec pression (Andoni and Indyk, 2008). As a simple (W2V: Mikolov et al., 2013), which is a compara- LSH algorithm, the FFA compresses data while tively simple system by today’s standards, has at- preserving the notion of similarity of the origi- tracted a large amount of literature which attempts nal data, which is one of the core mechanisms to explain the effects of various hyperparameters involved in constructing vector representations of in the model (Levy and Goldberg, 2014; Levy word meaning. To our knowledge, it has however et al., 2015; Gittens et al., 2017). Finally, high- never been taken as the basis for building distribu- performance DS representations are hardly or not tional semantic models from scratch, even though at all interpretable. As a result, much research has it seems to naturally fulfill a number of require- been dedicated to producing representations that ments of those models. are intuitively interpretable by humans (Murphy In the following, we present the FFA and show et al., 2012; Luo et al., 2015; Fyshe et al., 2015; how it can be adapted to create vector spaces of Shin et al., 2018). These approaches typically at- word meaning (§4). We then apply the FFA in tempt to preserve or reconstruct word labels for an incremental setup (§5) and assess its worth as the basis of the dimensionality-reduced represen- a model, according to the various criteria high- tations, but they can themselves require intensive lighted above (§6), including a possible interpre- procedures. In summary, it becomes apparent that tation of the FFA’s output. the ideal vector-based semantics model that ful- fills all requirements highlighted in our introduc- 2 Related work tion has not yet been found. In Distributional Semantics (DS: Turney and Pan- The Fruit Fly Algorithm we present here can tel, 2010; Erk, 2012), the meaning of words be related to two existing techniques in com- is represented by points in a multidimensional puter science: Random Indexing and Locality- space, derived from word co-occurrence statistics. Sensitive Hashing. Random Indexing (RI) is a The quality of models usually correlates with the simple and efficient method for dimensionality amount of data that is used. With increasing pro- reduction (cf. Sahlgren, 2005), originally used cessing resources and larger corpora available, a to solve clustering problems (Kaski, 1998). It variety of approaches have been developed in that is also a less-travelled technique in distributional area (e.g., Bengio et al., 2003; Pennington et al., semantics (Kanerva et al., 2000; QasemiZadeh 2014; Mikolov et al., 2013). State-of-the-art mod- et al., 2017; QasemiZadeh and Kallmeyer, 2016). els perform remarkably well and are often a core Its advocates argue that it fulfills a number of component of NLP applications. Recent work on requirements of an ideal vector space construc- DS (e.g., ELMo (Peters et al., 2018) and BERT tion method, in particular incrementality. As (Devlin et al., 2018) shifts the scope of represen- for Locality-Sensitive Hashing (LSH: Slaney and tations from word meaning to sentence meaning, Casey, 2008), it is a way to produce hashes that pushing performance, but also model complexity, preserve a notion of distance between points in even further. a space, thus satisfying storage efficiency whilst The latest DS techniques yield high perfor- maintaining the spatial configuration of a repre- mance, but they have multiple shortcomings. First, sentation. A comparison of various hash functions they require massive amounts of text, followed for LSH, including RI, is provided by Paulevé As in the original implementation, our FFA is a simple feedforward architecture consisting of two layers connected by random projections (Fig. 1). The input layer, the projection neuron layer or PN layer, consists of m nodes {x1 ...xm } which en- code the raw co-occurrence counts of a target word with a particular context. To satisfy incremental- ity, m is variable and can grow as the algorithm encounters new data. If a new context is observed, then a node xm+1 is recruited to encode that con- text. A logarithmic function is applied to the in- put in order to diminish frequency effects of nat- Figure 1: Schematic of the adapted FFA, with input ural languages (Zipf, 1932). This ‘flattens’ acti- size m = 4 and output size n = 6 (dense representa- vation across the PN layer, reducing the impact of tion: 2). Darker cells correspond to higher activation. very frequent words (e.g., stopwords). The second layer (Kenyon Cell layer or KC layer) consists of n nodes {y1 ...yn }. It is larger than the PN layer and et al. (2010). fixed at a constant size (n does not grow). PN and KC are not fully connected. Instead, each KC cell 3 Data receives a constant number of connections from In the spirit of ‘training small’, the corpus used the PN layer, randomly and uniformly allocated. for our experiments is a subset of 100M words In other words, the mapping from P N to KC is a from the ukWaC corpus (Ferraresi et al., 2008), bipartite connection matrix M so that Mji = 1 if minimally pre-processed (tokenized and stripped xi is connected to yj and 0 otherwise. The connec- of punctuation signs); this results in a corpus of tivity of each PN is thus variable, albeit uniformly 87.8M words. Following common practice, we distributed. The activation function on each KC quantitatively evaluate the FFA as a lexical acqui- is simply the sum of the activations of its con- sition algorithm by testing it over the MEN simi- nected PNs. In the end, hashing is carried out via larity dataset (Bruni et al., 2014), which consists a winner-takes-all (WTA) procedure that ‘remem- of 3000 word pairs (751 unique English words), bers’ the IDs of a small percentage of the most human-annotated for semantic relatedness. activated KCs as a compact representation of the For our experiments, we compute two co- word’s meaning. So W T A(yi ) = 1 if yi is one of occurrence count spaces over our corpus, with dif- the k top values in y and 0 otherwise. ferent context sizes (±2 and ±5 around the target). The FFA’s hyperparameters are expressed as a We only consider the 10k most frequent words in 5-tuple (f, m, n, c, h), where f is the flattening the data, ensuring we cover all 751 words in MEN. function, m is the size of the PN layer (initially 0), n is the size of the KC layer, c is the number 4 Model of connections leading to any one KC, and h is the percentage of activated KCs to be hashed. The Fruitfly Algorithm mimics the olfactory sys- tem of the fruit fly, which assigns a pattern of bi- Note that, since both the connectivity per KC nary activations to a particular smell (i.e., a com- and the size of the KC layer are constant, the bination of multiple chemicals), using sparse con- overall number of connections is constant. Thus, nections between just two neuronal layers. This the expansion mechanism (which increments m) mechanism allows the fly to ‘conceptualize’ its en- does not create new connections: it randomly vironment and to appropriately react to new smells selects existing PNs and reallocates connections by relating them to previous experiences. Our im- from those PNs to the new PN. In the reallocation plementation of the FFA is an extension of the process, we encode a bias towards taking connec- work of Dasgupta et al. (2017) which allows us to tions from those PNs with the most outgoing con- generate a semantic space by hashing each word – nections in order to ensure even connectivity of the as represented by its co-occurrences in a corpus – PN layer. For example, in a setup with parameters to a pattern of binary activations. (f = ln, m = 300, n = 10000, c = 14, h = 8), the average number of connections going out from each PN is (n × c)/m = 466.67: some PNs have 466 connections, some have 467 or more. The next newly encountered word will lead to the cre- ation of x301 and the expansion process will real- locate b(n × c)/301c = 465 already existing con- nections to x301 . For this, it will choose PNs with 467 or more connections with a higher probabil- ity than those with 466 connections. The parame- ters after the expansion process are (f = ln, m = 301, n = 10000, c = 14, h = 8). The expansion of dimensions from the PN layer Figure 2: ρ-values of co-occurrence counts, hashed to the KC layer in combination with random pro- spaces, and Word2Vec models (window sizes ± 2 jections can be interpreted as a form of ‘zooming’ (lines) and ± 5 (dotted)). The blue dot shows the per- into a concept for a particular target word: mul- formance on POS-tagged data with FFA-5. tiple context words are randomly projected onto a single KC. If several of these context words are important for the target (i.e., their PNs have (f = ln, n = 40000, c = 20, h = 0.08); we use high activation), the corresponding KC will be ac- this for all further experiments.3 (The grid search n tivated in the final hash. We can imagine this pro- revealed in fact that the factor of expansion m is cess as aggregating dimensions of the original co- minimally important.) occurrence space, thus generating latent features Next, we incrementally generate a raw which give different ‘views’ into the raw data. For frequency-count model of the 10k most frequent example, one might imagine that a random pro- words of our corpus, parallelly expanding the FFA jection from the PNs beak, bill, bank, wing, and with every newly encountered word. Every 1M feather, have one KC in common. This KC might processed words, the aggregated co-occurrences be somewhat activated by the PNs bank and bill in are hashed by the FFA and the corresponding finance contexts, but more crucially, it will consis- word vectors (i.e., binary hashes) are stored for tently be strongly activated for target words related evaluation. We compare a) the raw frequency to birds and thus selected for the final hashes of space (input to the FFA); b) the final hashes those words. Note that this behaviour lets us back- (output of the FFA); c) a separate Word2Vec track from a dimensionality-reduced representa- (W2V) model trained on exactly the same data, tion to the most characteristic contexts for a par- using standard hyperparameters and a minimum ticular target word, and gives interpretability to the count set to match the 10k target words of our KCs. We will come back to that feature in §6. co-occurrence space. We repeat this experiment for window sizes ±2 and ±5. 5 Experiments and results Figure 2 shows the results of our incremental In order to characterize the behavior and perfor- simulation. For the window size ±5, we reach ρ = mance of our incremental FFA, we evaluate the 0.100 for raw counts, ρ = 0.345 for the FFA out- quality of its output vectors against the MEN test put, and ρ = 0.600 for W2V. The 2-word-context set by means of the non-parametric Spearman rank setup yields very similar results. The FFA hashing correlation ρ. In order to run the experiments with thus has a clear and positive effect (+0.245 from a sound configuration of the hyperparameters f , 80M words on for the ±5 setup). The amount n, c, and h, we first perform a grid search, apply- of improvement is already large at the beginning ing various configurations of the FFA to the counts of training (+0.136 at 5M words) and slowly in- (window size: ±5) of the 10k most frequent words creases with corpus size. Results are comparable of a held-out corpus.2 For this setting, the grid to W2V for very small corpus sizes, but start lag- search yields the following optimal configuration: ging behind after around 10M words. 2 we restricted the grid search and the subsequent exper- 3 iment setup to a vocabulary of 10k words for more conve- The source code of this implementation of the FFA nient experimentation. The actual FFA potentially has no will be released for public use on git@github.com: such limit SimonPreissner/semantic-fruitfly.git 6 Discussion was set to 3200,5 which is much larger than the op- timal 300-400 dimensions of W2V. However, the Investigating cognitive algorithm from scratch re- hash corresponds to a sparse vector of integers and quires a clear stance on evaluation: we cannot ex- is thus efficiently stored and manipulated. The hy- pect a very simple model to beat the performance perparameter grid search revealed that the factor of of heavily-trained systems, but we can require it expansion from PN layer to KC layer is much less to give satisfactory results whilst also being a good important than expected, although the expansion is model in the strong sense of the term, that is, simu- a core characteristic of the FFA and intuitively, its lating all observable features of a given real-world factor should have an effect on performance. This phenomenon. Our discussion keeps this in mind, suggests that the FFA does not require inconve- as we focus on the ‘wish list’ highlighted in §1. niently high-dimensional hash signatures to reach Performance: hashing increases performance its performance. However, it will take further ex- over the raw co-occurrence space by over 20 periments, especially with larger vocabularies, to points overall. The system is however outper- fully characterize this behaviour. formed by W2V after seeing around 10M words. Incrementality: the FFA is fully incremental. In the spirit of providing a comprehensive evalu- Note that in our experiments, the W2V space is ation of the modelling power of the FFA, we at- retrained from scratch after each addition of 1M tempt to pull apart aspects of the learning process words to the corpus while the FFA simply incre- that are captured by its very simple algorithm, and ments counts in its stored co-occurrence space. It those that are not. In other words, which feature is also in stark contrast with weighted count-based results in the large increase over baseline perfor- distributional models which require some global mance? What does the FFA fail to model with re- PMI (re-)computation to outperform the raw co- spect to W2V? We know that the algorithm gener- occurrence count vectors. ates latent features out of the original space dimen- Time efficiency: our FFA runs without costly sions, encapsulated in each KC. We have tuned learning mechanisms; its two most costly opera- the size of the KC layer, so the number of fea- tions are (1) the expansion of the PN layer along tures captured by the FFA should be optimal for with new vocabulary and (2) the projection from our task. We assume that the performance dis- PN layer to KC layer. Following Zipf’s Law, most played by the algorithm is due to correctly gen- new words are encountered within the first few eralizing over contexts. As for its lack of perfor- millions of words. As a consequence, the fre- mance, we can make hypotheses based on what quency of expansion operations on the PN layer is we know from other DS models. The FFA does high at first, but decreases rapidly, resulting in fast not perform any subsampling or weighting of its scaling to large amounts of text. Hashing is solely input data, and the log function we use to mini- dependent on the number of connections per KC mize the impact of very frequent items is probably and the size of the KC layer (both constant). too crude to fulfill that purpose. When we infor- Interpretability: the FFA’s two-layer architec- mally inspect the performance of the algorithm on ture allows for uncomplicated backtracking. Each a POS-tagged version of our corpus, keeping only of the activated nodes in a word’s hash represents a verbs, nouns and adjectives in the input and filter- single KC. The connections of these ‘winner’ KCs ing some highly frequent stopwords (punctuation, with the PN layer let us reconstruct which context auxiliaries), we obtain ρ ≈ 0.51 over the whole words originally contributed to the largest activa- corpus,4 coming close to W2V’s performance and tions in the KC layer. To illustrate this, we use thus indicating that indeed, a higher-level ‘atten- the hashes obtained at the last iteration of our in- tion’ mechanism could be added to the input layer. cremental experiment (based on window ±5) and (Note that the olfactory system of actual fruit flies identify the k = 50 most characteristic PNs for only has ≈ 50 odorant receptors, which makes each hash, ignoring stopwords. Table 1 reports it potentially less crucial to successfully suppress the characteristic PNs shared by various sets of in- large parts of the input.) put words. For example, for the words hawk, pi- Dimensionality: The size of the hashes pro- 5 duced by the FFA is variable; in the experiments, it This results from expressing the (n=40k-dimensional) bi- nary vector as the positions of its 1s, which make up h = 8% 4 We use the top 4000 dimensions of the co-occurrence of the vector. This yields a much smaller representation of matrix, with n = 16000, c = 20 and h = 0.08. length n × h = 3200. Hashed Words Mutual Important Words esting behaviour of the fruit fly with respect to in- hawk, pigeon, tailed, breasted, black, red, terpretability and incrementality makes it a worthy parrot dove competitor for other distributional models – or at library, collec- collection, national, new, art the very least, a source of inspiration. tion, museum beard, wig man, wearing, long, like, hair cold, dirty get, said, war, mind References Alexandr Andoni and Piotr Indyk. 2008. Near-optimal Table 1: Top PNs for selected sets of words. The im- hashing algorithms for approximate nearest neigh- portance of a PN for a word is estimated by the number bor in high dimensions. Communications of the of connections to KCs that are activated in the word’s ACM, 51(1):117. hash (window size ± 5). Marco Baroni, Alessandro Lenci, and Luca Onnis. 2007. Isa meets lara: An incremental word space geon, and parrot the tailed, black, breasted, red, model for cognitively plausible simulations of se- and dove PNs are among the most influential, con- mantic learning. In Proceedings of the workshop on cognitive aspects of computational language acqui- tributing to many of the activated KCs. Similarly, sition, pages 49–56. we can connect beard to wig and cold to dirty; the shared important words of the latter seem to en- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and code shared collocates (cold/dirty war, cold/dirty Christian Jauvin. 2003. A neural probabilistic lan- mind, get cold/dirty). guage model. Journal of machine learning research, 3(Feb):1137–1155. Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. 7 Conclusion Multimodal distributional semantics. Journal of Ar- tificial Intelligence Research, 49:1–47. We started this paper suggesting that NLP should explore a different class of algorithms for its Sanjoy Dasgupta, Charles F Stevens, and Saket Navlakha. 2017. A neural algorithm for a most fundamental tasks. We argued that it is fundamental computing problem. Science, worth investigating cognitively-inspired architec- 358(6364):793–796. tures, which may not (yet) perform at state-of-the- art level, but give us insights into potentially more Jacob Devlin, Ming-Wei Chang, Kenton Lee, and plausible ways to model linguistic faculties in the Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understand- mind. We also made a case for ‘small’ and ‘fair’ ing. arXiv preprint arXiv:1810.04805. systems, in reach of all researchers and end-users. As illustration, we have explored what the ol- Katrin Erk. 2012. Vector space models of word mean- factory system of a fruit fly can do for the rep- ing and phrase meaning: A survey. Language and Linguistics Compass, 6(10):635–653. resentation of word meanings. The algorithm is certainly ‘fair’ in terms of complexity and re- Adriano Ferraresi, Eros Zanchetta, Marco Baroni, and quired resources. Being based on an actual cogni- Silvia Bernardini. 2008. Introducing and evaluating tive mechanism, it naturally encodes requirements ukwac, a very large web-derived corpus of english. such as (processing and storage) efficiency. Its In Proceedings of the 4th Web as Corpus Workshop (WAC-4) Can we beat Google, pages 47–54. simplicity lends itself to incremental learning and interpretability. Performance on a relatedness data Alona Fyshe, Leila Wehbe, Partha P Talukdar, Brian set highlights that the raw model successfully cap- Murphy, and Tom M Mitchell. 2015. A composi- tures latent concepts in the data but would proba- tional and interpretable semantic space. In Proceed- bly require an extra attention layer, as indicated by ings of the 2015 Conference of the North American Chapter of the Association for Computational Lin- the stronger results obtained on additionally pre- guistics: Human Language Technologies, pages 32– processed data. 41. We hope to have demonstrated that such study is accessible to all, and actually sheds insights into Alex Gittens, Dimitris Achlioptas, and Michael W Ma- honey. 2017. Skip-gram- zipf+ uniform= vector ad- the minimal components of a model in a way that ditivity. In Proceedings of the 55th Annual Meet- more complex systems do not achieve. We par- ing of the Association for Computational Linguistics ticularly draw attention to the fact that the inter- (Volume 1: Long Papers), pages 69–76. Pentii Kanerva, Jan Kristoferson, and Anders Holst. Behrang QasemiZadeh, Laura Kallmeyer, and Au- 2000. Random indexing of text samples for la- relie Herbelot. 2017. Projection aléatoire non- tent semantic analysis. In Proceedings of the An- négative pour le calcul de word embedding. In nual Meeting of the Cognitive Science Society, vol- 24e Conférence sur le Traitement Automatique des ume 22. Langues Naturelles (TALN), pages 109–122. Samuel Kaski. 1998. Dimensionality reduction by ran- Magnus Sahlgren. 2005. An introduction to random dom mapping: Fast similarity computation for clus- indexing. In Proceedings of the Methods and Appli- tering. In 1998 IEEE International Joint Confer- cations of Semantic Indexing Workshop at the 7th In- ence on Neural Networks Proceedings. IEEE World ternational Conference on Terminology and Knowl- Congress on Computational Intelligence (Cat. No. edge Engineering (TKE). 98CH36227), volume 1, pages 413–418. IEEE. Jamin Shin, Andrea Madotto, and Pascale Fung. Omer Levy and Yoav Goldberg. 2014. Neural word 2018. Interpreting word embeddings with eigenvec- embedding as implicit matrix factorization. In Ad- tor analysis. 32nd Conference on Neural Informa- vances in neural information processing systems, tion Processing Systems (NIPS 2018), IRASL work- pages 2177–2185. shop. Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- Malcolm Slaney and Michael Casey. 2008. Locality- proving distributional similarity with lessons learned sensitive hashing for finding nearest neighbors [lec- from word embeddings. Transactions of the Associ- ture notes]. IEEE Signal processing magazine, ation for Computational Linguistics, 3:211–225. 25(2):128–131. Hongyin Luo, Zhiyuan Liu, Huanbo Luan, and Charles F Stevens. 2015. What the flys nose tells the Maosong Sun. 2015. Online learning of inter- flys brain. Proceedings of the National Academy of pretable word embeddings. In Proceedings of the Sciences, 112(30):9460–9465. 2015 Conference on Empirical Methods in Natural Emma Strubell, Ananya Ganesh, and Andrew Mc- Language Processing, pages 1687–1692. Callum. 2019. Energy and policy considera- Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- tions for deep learning in nlp. arXiv preprint frey Dean. 2013. Efficient estimation of word arXiv:1906.02243. representations in vector space. arXiv preprint Peter D Turney and Patrick Pantel. 2010. From fre- arXiv:1301.3781. quency to meaning: Vector space models of se- mantics. Journal of artificial intelligence research, Brian Murphy, Partha Talukdar, and Tom Mitchell. 37:141–188. 2012. Learning effective and interpretable semantic models using non-negative sparse embedding. Pro- George Kingsley Zipf. 1932. Selected studies of the ceedings of COLING 2012, pages 1933–1950. principle of relative frequency in language. Harvard university press. Wen-Tsao Pan. 2011. A new evolutionary computation approach: fruit fly optimization algorithm. In Pro- ceedings of the conference on digital technology and innovation management. Loı̈c Paulevé, Hervé Jégou, and Laurent Amsaleg. 2010. Locality sensitive hashing: A comparison of hash function types and querying mechanisms. Pat- tern Recognition Letters, 31(11):1348–1358. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 confer- ence on empirical methods in natural language pro- cessing (EMNLP), pages 1532–1543. Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word rep- resentations. arXiv preprint arXiv:1802.05365. Behrang QasemiZadeh and Laura Kallmeyer. 2016. Random positive-only projections: Ppmi-enabled incremental semantic space construction. In Pro- ceedings of the Fifth Joint Conference on Lexical and Computational Semantics, pages 189–198.