=Paper= {{Paper |id=Vol-2481/paper59 |storemode=property |title=To be Fair: a Case for Cognitively-Inspired Models of Meaning |pdfUrl=https://ceur-ws.org/Vol-2481/paper59.pdf |volume=Vol-2481 |authors=Simon Preissner,Aurélie Herbelot |dblpUrl=https://dblp.org/rec/conf/clic-it/PreissnerH19 }} ==To be Fair: a Case for Cognitively-Inspired Models of Meaning== https://ceur-ws.org/Vol-2481/paper59.pdf
            To be Fair: a Case for Cognitively-Inspired Models of Meaning

                       Simon Preissner                                   Aurélie Herbelot
                Center for Mind/Brain Sciences                   Center for Mind/Brain Sciences &
                     University of Trento                        Dept. of Information Engineering
                simon.preissner@gmx.de                                and Computer Science
                                                                        University of Trento
                                                                aurelie.herbelot@unitn.it



                          Abstract                              are not available to those with poor Internet ac-
                                                                cess). Training such models can often take a long
        In the last years, the cost of Natural Lan-             time and extraordinary amounts of energy, gener-
        guage Processing algorithms has become more             ating CO2 emissions disproportionate to the mod-
        and more evident. That cost has many facets,
                                                                els’ improvements (Strubell et al., 2019). From a
        including training times, storage, replicabil-
        ity, interpretability, equality of access to ex-        pure modelling point of view, finally, complexity
        perimental paradigms, and even environmen-              often comes with a loss of interpretability, which
        tal impact. In this paper, we review the re-            weakens theoretical insights. Whilst we appreciate
        quirements of a ‘good’ model and argue that a           that a part of NLP is focused on engineering ap-
        move is needed towards lightweight and inter-           plications rather than modelling natural language
        pretable implementations, which promote sci-            proper, the linguists and cognitive scientists in the
        entific fairness, paradigmatic diversity, and ul-       community have a duty to provide transparent, ex-
        timately foster applications available to all, re-
                                                                planatory simulations of particular phenomena.
        gardless of financial prosperity. We propose
        that the community still has much to learn                 Such considerations call for smaller and more
        from cognitively-inspired algorithms, which             interpretable systems. In this paper, we offer an
        often show extreme efficiency and can ‘run’             example investigation into one of the most widely
        on very simple organisms. As a case study,              used techniques in NLP: the vectorial representa-
        we investigate the fruit fly’s olfactory system         tion of word meanings. Our starting point is the
        as a distributional semantics model. We show            set of requirements that should be fulfilled by an
        that, even in its rawest form, it provides many
                                                                ideal model of lexical acquisition, which is ex-
        of the features that we might require from an
        ideal model of meaning acquisition. 1                   pressed in QasemiZadeh et al. (2017): (A) high
                                                                performance on fundamental lexical tasks, (B) ef-
1       Introduction                                            ficiency, (C) low dimensionality for compact stor-
                                                                age, (D) amenability to incremental learning, (E)
In recent years, the Natural Language Processing                interpretability. As we will show in §2, state-
(NLP) community has seen an increase in the pop-                of-the-art systems still fail to integrate all those
ularity of expensive models requiring enormous                  points. (A-D) are however basic features of hu-
computational resources to train and run. The                   mans and animal cognition. It seems, therefore,
cost of such models is multi-faceted. From the                  that we should find inspiration in algorithms from
point of view of shaping the scientific commu-                  cognitive science, which in turn would allow us to
nity, they create a huge gap between researchers in             derive interpretability (E) from the clear underpin-
wealthy institutions and those with less resources              nings of biological or psychological theories.
and they often make replication prohibitive. From                  We propose that a good place to find appropri-
the point of view of applicability, they make the               ate algorithms is the natural world, as many or-
end-user dependent on high-tech hardware which                  ganisms display core cognitive abilities such as
they may not afford, or on cloud services which                 incremental learning, generalization or classifi-
may have problematic privacy side-effects (and                  cation, which many NLP systems need to emu-
    1
                                                                late. Such faculties develop in extremely sim-
     Copyright c 2019 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-       ple systems, which are good contenders for the
ternational (CC BY 4.0).                                        type of models we advocate here. One success
story from ‘algorithmic’ cognitive science is based    by computationally intensive procedures involv-
on the neural architecture of the fruit fly’s olfac-   ing weighting, dimensionality reduction, complex
tory system, which clusters patterns of chemicals      attention mechanisms etc. The high complex-
into categories of smells (Stevens, 2015), and has     ity of most current architectures often comes at
inspired the so-called Fruit Fly Optimization Al-      the cost of flexibility: once a language model
gorithm (Pan, 2011; here: Fruit Fly Algorithm          is constructed, any new data requires a re-run
or ‘FFA’). The FFA has been implemented as a           of the complete system in order to be incorpo-
lightweight neural algorithm that performs ran-        rated. This makes incrementality unsatisfiable in
dom indexing for locality-sensitive hashing (LSH)      those frameworks (Sahlgren, 2005; Baroni et al.,
(Dasgupta et al., 2017). This LSH algorithm has        2007). Further, architectures themselves have be-
successfully been applied to various tasks, partic-    come increasingly complex, at the expense of
ularly in information retrieval and for data com-      transparency. We recall that even Word2Vec
pression (Andoni and Indyk, 2008). As a simple         (W2V: Mikolov et al., 2013), which is a compara-
LSH algorithm, the FFA compresses data while           tively simple system by today’s standards, has at-
preserving the notion of similarity of the origi-      tracted a large amount of literature which attempts
nal data, which is one of the core mechanisms          to explain the effects of various hyperparameters
involved in constructing vector representations of     in the model (Levy and Goldberg, 2014; Levy
word meaning. To our knowledge, it has however         et al., 2015; Gittens et al., 2017). Finally, high-
never been taken as the basis for building distribu-   performance DS representations are hardly or not
tional semantic models from scratch, even though       at all interpretable. As a result, much research has
it seems to naturally fulfill a number of require-     been dedicated to producing representations that
ments of those models.                                 are intuitively interpretable by humans (Murphy
   In the following, we present the FFA and show       et al., 2012; Luo et al., 2015; Fyshe et al., 2015;
how it can be adapted to create vector spaces of       Shin et al., 2018). These approaches typically at-
word meaning (§4). We then apply the FFA in            tempt to preserve or reconstruct word labels for
an incremental setup (§5) and assess its worth as      the basis of the dimensionality-reduced represen-
a model, according to the various criteria high-       tations, but they can themselves require intensive
lighted above (§6), including a possible interpre-     procedures. In summary, it becomes apparent that
tation of the FFA’s output.                            the ideal vector-based semantics model that ful-
                                                       fills all requirements highlighted in our introduc-
2   Related work                                       tion has not yet been found.

In Distributional Semantics (DS: Turney and Pan-          The Fruit Fly Algorithm we present here can
tel, 2010; Erk, 2012), the meaning of words            be related to two existing techniques in com-
is represented by points in a multidimensional         puter science: Random Indexing and Locality-
space, derived from word co-occurrence statistics.     Sensitive Hashing. Random Indexing (RI) is a
The quality of models usually correlates with the      simple and efficient method for dimensionality
amount of data that is used. With increasing pro-      reduction (cf. Sahlgren, 2005), originally used
cessing resources and larger corpora available, a      to solve clustering problems (Kaski, 1998). It
variety of approaches have been developed in that      is also a less-travelled technique in distributional
area (e.g., Bengio et al., 2003; Pennington et al.,    semantics (Kanerva et al., 2000; QasemiZadeh
2014; Mikolov et al., 2013). State-of-the-art mod-     et al., 2017; QasemiZadeh and Kallmeyer, 2016).
els perform remarkably well and are often a core       Its advocates argue that it fulfills a number of
component of NLP applications. Recent work on          requirements of an ideal vector space construc-
DS (e.g., ELMo (Peters et al., 2018) and BERT          tion method, in particular incrementality. As
(Devlin et al., 2018) shifts the scope of represen-    for Locality-Sensitive Hashing (LSH: Slaney and
tations from word meaning to sentence meaning,         Casey, 2008), it is a way to produce hashes that
pushing performance, but also model complexity,        preserve a notion of distance between points in
even further.                                          a space, thus satisfying storage efficiency whilst
   The latest DS techniques yield high perfor-         maintaining the spatial configuration of a repre-
mance, but they have multiple shortcomings. First,     sentation. A comparison of various hash functions
they require massive amounts of text, followed         for LSH, including RI, is provided by Paulevé
                                                             As in the original implementation, our FFA is a
                                                          simple feedforward architecture consisting of two
                                                          layers connected by random projections (Fig. 1).
                                                          The input layer, the projection neuron layer or PN
                                                          layer, consists of m nodes {x1 ...xm } which en-
                                                          code the raw co-occurrence counts of a target word
                                                          with a particular context. To satisfy incremental-
                                                          ity, m is variable and can grow as the algorithm
                                                          encounters new data. If a new context is observed,
                                                          then a node xm+1 is recruited to encode that con-
                                                          text. A logarithmic function is applied to the in-
                                                          put in order to diminish frequency effects of nat-
Figure 1: Schematic of the adapted FFA, with input        ural languages (Zipf, 1932). This ‘flattens’ acti-
size m = 4 and output size n = 6 (dense representa-       vation across the PN layer, reducing the impact of
tion: 2). Darker cells correspond to higher activation.   very frequent words (e.g., stopwords). The second
                                                          layer (Kenyon Cell layer or KC layer) consists of n
                                                          nodes {y1 ...yn }. It is larger than the PN layer and
et al. (2010).
                                                          fixed at a constant size (n does not grow). PN and
                                                          KC are not fully connected. Instead, each KC cell
3   Data
                                                          receives a constant number of connections from
In the spirit of ‘training small’, the corpus used        the PN layer, randomly and uniformly allocated.
for our experiments is a subset of 100M words             In other words, the mapping from P N to KC is a
from the ukWaC corpus (Ferraresi et al., 2008),           bipartite connection matrix M so that Mji = 1 if
minimally pre-processed (tokenized and stripped           xi is connected to yj and 0 otherwise. The connec-
of punctuation signs); this results in a corpus of        tivity of each PN is thus variable, albeit uniformly
87.8M words. Following common practice, we                distributed. The activation function on each KC
quantitatively evaluate the FFA as a lexical acqui-       is simply the sum of the activations of its con-
sition algorithm by testing it over the MEN simi-         nected PNs. In the end, hashing is carried out via
larity dataset (Bruni et al., 2014), which consists       a winner-takes-all (WTA) procedure that ‘remem-
of 3000 word pairs (751 unique English words),            bers’ the IDs of a small percentage of the most
human-annotated for semantic relatedness.                 activated KCs as a compact representation of the
   For our experiments, we compute two co-                word’s meaning. So W T A(yi ) = 1 if yi is one of
occurrence count spaces over our corpus, with dif-        the k top values in y and 0 otherwise.
ferent context sizes (±2 and ±5 around the target).          The FFA’s hyperparameters are expressed as a
We only consider the 10k most frequent words in           5-tuple (f, m, n, c, h), where f is the flattening
the data, ensuring we cover all 751 words in MEN.         function, m is the size of the PN layer (initially
                                                          0), n is the size of the KC layer, c is the number
4   Model                                                 of connections leading to any one KC, and h is the
                                                          percentage of activated KCs to be hashed.
The Fruitfly Algorithm mimics the olfactory sys-
tem of the fruit fly, which assigns a pattern of bi-         Note that, since both the connectivity per KC
nary activations to a particular smell (i.e., a com-      and the size of the KC layer are constant, the
bination of multiple chemicals), using sparse con-        overall number of connections is constant. Thus,
nections between just two neuronal layers. This           the expansion mechanism (which increments m)
mechanism allows the fly to ‘conceptualize’ its en-       does not create new connections: it randomly
vironment and to appropriately react to new smells        selects existing PNs and reallocates connections
by relating them to previous experiences. Our im-         from those PNs to the new PN. In the reallocation
plementation of the FFA is an extension of the            process, we encode a bias towards taking connec-
work of Dasgupta et al. (2017) which allows us to         tions from those PNs with the most outgoing con-
generate a semantic space by hashing each word –          nections in order to ensure even connectivity of the
as represented by its co-occurrences in a corpus –        PN layer. For example, in a setup with parameters
to a pattern of binary activations.                       (f = ln, m = 300, n = 10000, c = 14, h = 8),
the average number of connections going out from
each PN is (n × c)/m = 466.67: some PNs have
466 connections, some have 467 or more. The
next newly encountered word will lead to the cre-
ation of x301 and the expansion process will real-
locate b(n × c)/301c = 465 already existing con-
nections to x301 . For this, it will choose PNs with
467 or more connections with a higher probabil-
ity than those with 466 connections. The parame-
ters after the expansion process are (f = ln, m =
301, n = 10000, c = 14, h = 8).
   The expansion of dimensions from the PN layer               Figure 2: ρ-values of co-occurrence counts, hashed
to the KC layer in combination with random pro-                spaces, and Word2Vec models (window sizes ± 2
jections can be interpreted as a form of ‘zooming’             (lines) and ± 5 (dotted)). The blue dot shows the per-
into a concept for a particular target word: mul-              formance on POS-tagged data with FFA-5.
tiple context words are randomly projected onto
a single KC. If several of these context words
are important for the target (i.e., their PNs have             (f = ln, n = 40000, c = 20, h = 0.08); we use
high activation), the corresponding KC will be ac-             this for all further experiments.3 (The grid search
                                                                                                              n
tivated in the final hash. We can imagine this pro-            revealed in fact that the factor of expansion m   is
cess as aggregating dimensions of the original co-             minimally important.)
occurrence space, thus generating latent features                 Next, we incrementally generate a raw
which give different ‘views’ into the raw data. For            frequency-count model of the 10k most frequent
example, one might imagine that a random pro-                  words of our corpus, parallelly expanding the FFA
jection from the PNs beak, bill, bank, wing, and               with every newly encountered word. Every 1M
feather, have one KC in common. This KC might                  processed words, the aggregated co-occurrences
be somewhat activated by the PNs bank and bill in              are hashed by the FFA and the corresponding
finance contexts, but more crucially, it will consis-          word vectors (i.e., binary hashes) are stored for
tently be strongly activated for target words related          evaluation. We compare a) the raw frequency
to birds and thus selected for the final hashes of             space (input to the FFA); b) the final hashes
those words. Note that this behaviour lets us back-            (output of the FFA); c) a separate Word2Vec
track from a dimensionality-reduced representa-                (W2V) model trained on exactly the same data,
tion to the most characteristic contexts for a par-            using standard hyperparameters and a minimum
ticular target word, and gives interpretability to the         count set to match the 10k target words of our
KCs. We will come back to that feature in §6.                  co-occurrence space. We repeat this experiment
                                                               for window sizes ±2 and ±5.
5   Experiments and results                                       Figure 2 shows the results of our incremental
In order to characterize the behavior and perfor-              simulation. For the window size ±5, we reach ρ =
mance of our incremental FFA, we evaluate the                  0.100 for raw counts, ρ = 0.345 for the FFA out-
quality of its output vectors against the MEN test             put, and ρ = 0.600 for W2V. The 2-word-context
set by means of the non-parametric Spearman rank               setup yields very similar results. The FFA hashing
correlation ρ. In order to run the experiments with            thus has a clear and positive effect (+0.245 from
a sound configuration of the hyperparameters f ,               80M words on for the ±5 setup). The amount
n, c, and h, we first perform a grid search, apply-            of improvement is already large at the beginning
ing various configurations of the FFA to the counts            of training (+0.136 at 5M words) and slowly in-
(window size: ±5) of the 10k most frequent words               creases with corpus size. Results are comparable
of a held-out corpus.2 For this setting, the grid              to W2V for very small corpus sizes, but start lag-
search yields the following optimal configuration:             ging behind after around 10M words.
   2
     we restricted the grid search and the subsequent exper-
                                                                  3
iment setup to a vocabulary of 10k words for more conve-           The source code of this implementation of the FFA
nient experimentation. The actual FFA potentially has no       will be released for public use on git@github.com:
such limit                                                     SimonPreissner/semantic-fruitfly.git
6   Discussion                                             was set to 3200,5 which is much larger than the op-
                                                           timal 300-400 dimensions of W2V. However, the
Investigating cognitive algorithm from scratch re-
                                                           hash corresponds to a sparse vector of integers and
quires a clear stance on evaluation: we cannot ex-
                                                           is thus efficiently stored and manipulated. The hy-
pect a very simple model to beat the performance
                                                           perparameter grid search revealed that the factor of
of heavily-trained systems, but we can require it
                                                           expansion from PN layer to KC layer is much less
to give satisfactory results whilst also being a good
                                                           important than expected, although the expansion is
model in the strong sense of the term, that is, simu-
                                                           a core characteristic of the FFA and intuitively, its
lating all observable features of a given real-world
                                                           factor should have an effect on performance. This
phenomenon. Our discussion keeps this in mind,
                                                           suggests that the FFA does not require inconve-
as we focus on the ‘wish list’ highlighted in §1.
                                                           niently high-dimensional hash signatures to reach
   Performance: hashing increases performance              its performance. However, it will take further ex-
over the raw co-occurrence space by over 20                periments, especially with larger vocabularies, to
points overall. The system is however outper-              fully characterize this behaviour.
formed by W2V after seeing around 10M words.
                                                              Incrementality: the FFA is fully incremental.
In the spirit of providing a comprehensive evalu-
                                                           Note that in our experiments, the W2V space is
ation of the modelling power of the FFA, we at-
                                                           retrained from scratch after each addition of 1M
tempt to pull apart aspects of the learning process
                                                           words to the corpus while the FFA simply incre-
that are captured by its very simple algorithm, and
                                                           ments counts in its stored co-occurrence space. It
those that are not. In other words, which feature
                                                           is also in stark contrast with weighted count-based
results in the large increase over baseline perfor-
                                                           distributional models which require some global
mance? What does the FFA fail to model with re-
                                                           PMI (re-)computation to outperform the raw co-
spect to W2V? We know that the algorithm gener-
                                                           occurrence count vectors.
ates latent features out of the original space dimen-
                                                              Time efficiency: our FFA runs without costly
sions, encapsulated in each KC. We have tuned
                                                           learning mechanisms; its two most costly opera-
the size of the KC layer, so the number of fea-
                                                           tions are (1) the expansion of the PN layer along
tures captured by the FFA should be optimal for
                                                           with new vocabulary and (2) the projection from
our task. We assume that the performance dis-
                                                           PN layer to KC layer. Following Zipf’s Law, most
played by the algorithm is due to correctly gen-
                                                           new words are encountered within the first few
eralizing over contexts. As for its lack of perfor-
                                                           millions of words. As a consequence, the fre-
mance, we can make hypotheses based on what
                                                           quency of expansion operations on the PN layer is
we know from other DS models. The FFA does
                                                           high at first, but decreases rapidly, resulting in fast
not perform any subsampling or weighting of its
                                                           scaling to large amounts of text. Hashing is solely
input data, and the log function we use to mini-
                                                           dependent on the number of connections per KC
mize the impact of very frequent items is probably
                                                           and the size of the KC layer (both constant).
too crude to fulfill that purpose. When we infor-
                                                              Interpretability: the FFA’s two-layer architec-
mally inspect the performance of the algorithm on
                                                           ture allows for uncomplicated backtracking. Each
a POS-tagged version of our corpus, keeping only
                                                           of the activated nodes in a word’s hash represents a
verbs, nouns and adjectives in the input and filter-
                                                           single KC. The connections of these ‘winner’ KCs
ing some highly frequent stopwords (punctuation,
                                                           with the PN layer let us reconstruct which context
auxiliaries), we obtain ρ ≈ 0.51 over the whole
                                                           words originally contributed to the largest activa-
corpus,4 coming close to W2V’s performance and
                                                           tions in the KC layer. To illustrate this, we use
thus indicating that indeed, a higher-level ‘atten-
                                                           the hashes obtained at the last iteration of our in-
tion’ mechanism could be added to the input layer.
                                                           cremental experiment (based on window ±5) and
(Note that the olfactory system of actual fruit flies
                                                           identify the k = 50 most characteristic PNs for
only has ≈ 50 odorant receptors, which makes
                                                           each hash, ignoring stopwords. Table 1 reports
it potentially less crucial to successfully suppress
                                                           the characteristic PNs shared by various sets of in-
large parts of the input.)
                                                           put words. For example, for the words hawk, pi-
   Dimensionality: The size of the hashes pro-
                                                               5
duced by the FFA is variable; in the experiments, it             This results from expressing the (n=40k-dimensional) bi-
                                                           nary vector as the positions of its 1s, which make up h = 8%
   4
     We use the top 4000 dimensions of the co-occurrence   of the vector. This yields a much smaller representation of
matrix, with n = 16000, c = 20 and h = 0.08.               length n × h = 3200.
    Hashed Words       Mutual Important Words            esting behaviour of the fruit fly with respect to in-
    hawk, pigeon,      tailed, breasted, black, red,     terpretability and incrementality makes it a worthy
    parrot             dove                              competitor for other distributional models – or at
    library, collec-   collection, national, new, art    the very least, a source of inspiration.
    tion, museum
    beard, wig         man, wearing, long, like, hair
    cold, dirty        get, said, war, mind              References
                                                         Alexandr Andoni and Piotr Indyk. 2008. Near-optimal
Table 1: Top PNs for selected sets of words. The im-       hashing algorithms for approximate nearest neigh-
portance of a PN for a word is estimated by the number     bor in high dimensions. Communications of the
of connections to KCs that are activated in the word’s     ACM, 51(1):117.
hash (window size ± 5).
                                                         Marco Baroni, Alessandro Lenci, and Luca Onnis.
                                                          2007. Isa meets lara: An incremental word space
geon, and parrot the tailed, black, breasted, red,        model for cognitively plausible simulations of se-
and dove PNs are among the most influential, con-         mantic learning. In Proceedings of the workshop on
                                                          cognitive aspects of computational language acqui-
tributing to many of the activated KCs. Similarly,        sition, pages 49–56.
we can connect beard to wig and cold to dirty; the
shared important words of the latter seem to en-         Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and
code shared collocates (cold/dirty war, cold/dirty         Christian Jauvin. 2003. A neural probabilistic lan-
mind, get cold/dirty).                                     guage model. Journal of machine learning research,
                                                           3(Feb):1137–1155.

                                                         Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014.
7     Conclusion                                            Multimodal distributional semantics. Journal of Ar-
                                                            tificial Intelligence Research, 49:1–47.
We started this paper suggesting that NLP should
explore a different class of algorithms for its          Sanjoy Dasgupta, Charles F Stevens, and Saket
                                                           Navlakha. 2017.    A neural algorithm for a
most fundamental tasks. We argued that it is               fundamental computing problem.       Science,
worth investigating cognitively-inspired architec-         358(6364):793–796.
tures, which may not (yet) perform at state-of-the-
art level, but give us insights into potentially more    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
plausible ways to model linguistic faculties in the         Kristina Toutanova. 2018. Bert: Pre-training of deep
                                                            bidirectional transformers for language understand-
mind. We also made a case for ‘small’ and ‘fair’            ing. arXiv preprint arXiv:1810.04805.
systems, in reach of all researchers and end-users.
   As illustration, we have explored what the ol-        Katrin Erk. 2012. Vector space models of word mean-
factory system of a fruit fly can do for the rep-          ing and phrase meaning: A survey. Language and
                                                           Linguistics Compass, 6(10):635–653.
resentation of word meanings. The algorithm is
certainly ‘fair’ in terms of complexity and re-          Adriano Ferraresi, Eros Zanchetta, Marco Baroni, and
quired resources. Being based on an actual cogni-          Silvia Bernardini. 2008. Introducing and evaluating
tive mechanism, it naturally encodes requirements          ukwac, a very large web-derived corpus of english.
such as (processing and storage) efficiency. Its           In Proceedings of the 4th Web as Corpus Workshop
                                                           (WAC-4) Can we beat Google, pages 47–54.
simplicity lends itself to incremental learning and
interpretability. Performance on a relatedness data      Alona Fyshe, Leila Wehbe, Partha P Talukdar, Brian
set highlights that the raw model successfully cap-        Murphy, and Tom M Mitchell. 2015. A composi-
tures latent concepts in the data but would proba-         tional and interpretable semantic space. In Proceed-
bly require an extra attention layer, as indicated by      ings of the 2015 Conference of the North American
                                                           Chapter of the Association for Computational Lin-
the stronger results obtained on additionally pre-         guistics: Human Language Technologies, pages 32–
processed data.                                            41.
   We hope to have demonstrated that such study
is accessible to all, and actually sheds insights into   Alex Gittens, Dimitris Achlioptas, and Michael W Ma-
                                                           honey. 2017. Skip-gram- zipf+ uniform= vector ad-
the minimal components of a model in a way that            ditivity. In Proceedings of the 55th Annual Meet-
more complex systems do not achieve. We par-               ing of the Association for Computational Linguistics
ticularly draw attention to the fact that the inter-       (Volume 1: Long Papers), pages 69–76.
Pentii Kanerva, Jan Kristoferson, and Anders Holst.       Behrang QasemiZadeh, Laura Kallmeyer, and Au-
  2000. Random indexing of text samples for la-             relie Herbelot. 2017. Projection aléatoire non-
  tent semantic analysis. In Proceedings of the An-         négative pour le calcul de word embedding. In
  nual Meeting of the Cognitive Science Society, vol-       24e Conférence sur le Traitement Automatique des
  ume 22.                                                   Langues Naturelles (TALN), pages 109–122.

Samuel Kaski. 1998. Dimensionality reduction by ran-      Magnus Sahlgren. 2005. An introduction to random
  dom mapping: Fast similarity computation for clus-       indexing. In Proceedings of the Methods and Appli-
  tering. In 1998 IEEE International Joint Confer-         cations of Semantic Indexing Workshop at the 7th In-
  ence on Neural Networks Proceedings. IEEE World          ternational Conference on Terminology and Knowl-
  Congress on Computational Intelligence (Cat. No.         edge Engineering (TKE).
  98CH36227), volume 1, pages 413–418. IEEE.
                                                          Jamin Shin, Andrea Madotto, and Pascale Fung.
Omer Levy and Yoav Goldberg. 2014. Neural word              2018. Interpreting word embeddings with eigenvec-
 embedding as implicit matrix factorization. In Ad-         tor analysis. 32nd Conference on Neural Informa-
 vances in neural information processing systems,           tion Processing Systems (NIPS 2018), IRASL work-
 pages 2177–2185.                                           shop.

Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im-        Malcolm Slaney and Michael Casey. 2008. Locality-
 proving distributional similarity with lessons learned    sensitive hashing for finding nearest neighbors [lec-
 from word embeddings. Transactions of the Associ-         ture notes]. IEEE Signal processing magazine,
 ation for Computational Linguistics, 3:211–225.           25(2):128–131.

Hongyin Luo, Zhiyuan Liu, Huanbo Luan, and                Charles F Stevens. 2015. What the flys nose tells the
  Maosong Sun. 2015. Online learning of inter-              flys brain. Proceedings of the National Academy of
  pretable word embeddings. In Proceedings of the           Sciences, 112(30):9460–9465.
  2015 Conference on Empirical Methods in Natural
                                                          Emma Strubell, Ananya Ganesh, and Andrew Mc-
  Language Processing, pages 1687–1692.
                                                            Callum. 2019.     Energy and policy considera-
Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-             tions for deep learning in nlp. arXiv preprint
  frey Dean. 2013. Efficient estimation of word             arXiv:1906.02243.
  representations in vector space. arXiv preprint         Peter D Turney and Patrick Pantel. 2010. From fre-
  arXiv:1301.3781.                                          quency to meaning: Vector space models of se-
                                                            mantics. Journal of artificial intelligence research,
Brian Murphy, Partha Talukdar, and Tom Mitchell.
                                                            37:141–188.
  2012. Learning effective and interpretable semantic
  models using non-negative sparse embedding. Pro-        George Kingsley Zipf. 1932. Selected studies of the
  ceedings of COLING 2012, pages 1933–1950.                 principle of relative frequency in language. Harvard
                                                            university press.
Wen-Tsao Pan. 2011. A new evolutionary computation
  approach: fruit fly optimization algorithm. In Pro-
  ceedings of the conference on digital technology and
  innovation management.

Loı̈c Paulevé, Hervé Jégou, and Laurent Amsaleg.
  2010. Locality sensitive hashing: A comparison of
  hash function types and querying mechanisms. Pat-
  tern Recognition Letters, 31(11):1348–1358.

Jeffrey Pennington, Richard Socher, and Christopher
   Manning. 2014. Glove: Global vectors for word
   representation. In Proceedings of the 2014 confer-
   ence on empirical methods in natural language pro-
   cessing (EMNLP), pages 1532–1543.

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt
 Gardner, Christopher Clark, Kenton Lee, and Luke
 Zettlemoyer. 2018. Deep contextualized word rep-
 resentations. arXiv preprint arXiv:1802.05365.

Behrang QasemiZadeh and Laura Kallmeyer. 2016.
  Random positive-only projections: Ppmi-enabled
  incremental semantic space construction. In Pro-
  ceedings of the Fifth Joint Conference on Lexical
  and Computational Semantics, pages 189–198.