=Paper= {{Paper |id=Vol-2481/paper65 |storemode=property |title=How Much Competence Is There in Performance? Assessing the Distributional Hypothesis in Word Bigrams |pdfUrl=https://ceur-ws.org/Vol-2481/paper65.pdf |volume=Vol-2481 |authors=Johann Seltmann,Luca Ducceschi,Aurélie Herbelot |dblpUrl=https://dblp.org/rec/conf/clic-it/SeltmannDH19 }} ==How Much Competence Is There in Performance? Assessing the Distributional Hypothesis in Word Bigrams== https://ceur-ws.org/Vol-2481/paper65.pdf
                      How Much Competence Is There in Performance?
                   Assessing the Distributional Hypothesis in Word Bigrams
           Johann Seltmann                            Luca Ducceschi                         Aurélie Herbelot
         University of Potsdam♦                     University of Trento♣                   University of Trento♠
    jseltmann@uni-potsdam.de                     luca.ducceschi@unitn.it                aurelie.herbelot@unitn.it


           ♦
               Department of Linguistics, ♣ Dept. of Psychology and Cognitive Science,♠ Center for Mind/Brain Sciences,
                                      Dept. of Information Engineering and Computer Science



                            Abstract                                reach some ideal knowledge of that community’s
                                                                    language, thereby reaching competence.
        The field of Distributional Semantics (DS) is                  The present paper borrows the notions of ‘per-
        built on the ‘distributional hypothesis’, which             formance’, ‘competence’ and ‘innateness’ to criti-
        states that meaning can be recovered from sta-              cally analyse the semantic ‘acquisition’ processes
        tistical information in observable language. It
                                                                    simulated by Distributional Semantics models
        is however notable that the computations nec-
        essary to obtain ‘good’ DS representations are              (DSMs). Our goal is to tease apart how much of
        often very involved, implying that if meaning               their observed competence is due to the perfor-
        is derivable from linguistic data, it is not di-            mance data they are exposed to, and how much
        rectly encoded in it. This prompts questions                is contributed by ‘innate’ properties of those sys-
        related to fundamental questions about lan-                 tems, i.e. by their specific architectures.
        guage acquisition: if we regard text data as                   DSMs come in many shapes. Traditional
        linguistic performance, what kind of ‘innate’
                                                                    unsupervised architectures rely on counting co-
        mechanisms must operate over that data to
        reach competence? In other words, how much                  occurrences of words with other words or docu-
        of semantic acquisition is truly data-driven,               ments (Turney and Pantel, 2010; Erk, 2012; Clark,
        and what must be hard-encoded in a system’s                 2012). Their neural counterparts, usually referred
        architecture? In this paper, we introduce a new             to as ‘predictive models’ (Baroni et al., 2014) learn
        methodology to pull those questions apart. We               from a language modelling task over raw linguis-
        use state-of-the-art computational models to                tic data (e.g. Word2Vec, Mikolov et al., 2013,
        investigate the amount and nature of transfor-              GloVE Pennington et al., 2014). The most re-
        mations required to perform particular seman-
                                                                    cent language embedding models (Vaswani et al.,
        tic tasks. We apply that methodology to one of
        the simplest structures in language: the word               2017; Radford et al., 2018), ELMo (Peters et al.,
        bigram, giving insights into the specific con-              2018), or BERT (Devlin et al., 2018) compute
        tribution of that linguistic component.1                    contextualised word representations and sentence
                                                                    representations, yielding state-of-the-art results on
1       Introduction                                                sentence-related tasks, including translation. In
                                                                    spite of their differences, all models claim to rely
The traditional notions of performance and com-                     on the Distributional Hypothesis (Harris, 1954;
petence come from Chomsky’s work on syntax                          Firth, 1957), that is, the idea that distributional
(Chomsky, 1965), where much emphasis is put                         patterns of occurrences in language correlate with
on the mental processes underpinning language                       specific aspects of meaning.
acquisition. Chomsky posits the existence of a                         The Distributional Hypothesis, as stated in the
Universal Grammar, innate in the human species,                     DSM literature, makes semantic acquisition sound
which gets specialised to the particular language                   like an extremely data-driven procedure. But we
of a speaker. By exposure to the imperfect utter-                   should ask to what extent meaning indeed is to be
ances of their community (referred to as perfor-                    found in statistical patterns. The question is mo-
mance data), an individual configures their UG to                   tivated by the observation that the success of the
    1
                                                                    latest DSMs relies on complex mechanisms be-
     Copyright c 2019 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-           ing applied to the underlying linguistic data or the
ternational (CC BY 4.0).                                            task at hand (e.g. attention, self-attention, negative
sampling, particular objective functions). Such           word relatedness; b) sentence relatedness; c) sen-
mechanisms have been shown to apply very sig-             tence autoencoding (Turney, 2014; Bowman et al.,
nificant transformations to the original input data:      2016). The first two tasks test to which extent the
for instance, the Word2Vec objective function in-         linguistic structure under consideration encodes
troduces parallelisms in the space that make it per-      topicality: if it does, it should prove able to clus-
form particularly well on analogy tasks (Gittens          ter together similar lexical items, both in isolation
et al., 2017). Models such as BERT apply exten-           and as the constituents of sentences. The third task
sive processing to the input through stacks of en-        evaluates the ability of a system to build a sen-
coders. So while meaning can be derived from              tence representation and from that representation
training regimes involving raw data, it is not di-        alone, recover the original utterance. That is, it
rectly encoded in it.                                     tests distinguishability of representations. Impor-
   Interestingly, Harris himself (Harris, 1954)           tantly, distinguishability is at odds with the relat-
points out that a) distributional structure is in no      edness tasks which favour clusterability. The type
simple relation to the structure of meaning; b) dif-      of space learned from the raw data will necessarily
ferent distributions in language encode different         favour one or the other. Our choice of tasks thus
phenomena with various levels of complexity. We           allows us to understand which type of space can be
take both points as highlighting the complex re-          learned from the bigram: we will expand on this in
lation between linguistic structure and the cogni-        our discussion (§6).2
tive mechanisms that are necessary to apply to the
raw input to retrieve semantic information. The           2       Related work
point of our paper is to understand better what is
                                                          The Distributional Hypothesis is naturally en-
encoded in observable linguistic structures (at the
                                                          coded in count-based models of Distributional Se-
level of raw performance data), and how much dis-
                                                          mantics (DS), which build lexical representations
tortion of the input needs to be done to acquire
                                                          by gathering statistics over word co-occurrences.
meaning (i.e. what cognitive mechanisms are in-
                                                          Over the years, however, these simple models have
volved in learning semantic competence).
                                                          been superseded by so-called predictive models
   In the spirit of Harris, we think it is worth inves-   such as Word2Vec (Mikolov et al., 2013) or Fast-
tigating the behaviour of specific components of          Text (Bojanowski et al., 2017), which operate via
language and understand which aspects of mean-            language modeling tasks. These neural models
ing they encode, and to what extent. The present          involve sets of more or less complex procedures,
work illustrates our claim by presenting an ex-           from subsampling to negative sampling and sub-
ploratory analysis of one of the simplest recover-        word chunking, which give them a clear advan-
able structure in corpora: the word bigram. Our           tage over methods that stick more closely to dis-
methodology is simple: we test the raw distribu-          tributions in corpora. At the level of higher con-
tional behaviour of the constituent over different        stituents, the assumption is that a) additional com-
tasks, comparing it to a state-of-the-art model. We       position functions must be learned over the word
posit that each task embodies a specific aspect of        representations to generate meaning ‘bottom-up’
competence. By inspecting the difference in per-          (Clark, 2012; Erk, 2012); b) the semantics of a
formance between the simplest and more complex            sentence influences the meaning of its parts ‘top-
models, we get some insight into the way a par-           down’, leading to a notion of contextualised word
ticular structure (here, the bigram) contributes to       semantics, retrievable by yet another class of dis-
the acquisition of specific linguistic faculties. The     tributional models (Erk and Padó, 2008; Erk et al.,
failures of raw linguistic data to encode a particu-      2010; Thater et al., 2011; Peters et al., 2018). By-
lar competence points at some necessary, ‘innate’         passing the word level, some research investigates
constraint of the acquisition process, which might        the meaning of sentences directly. Following from
be encoded in a model’s architecture as well as the       classic work on seq2seq architectures and atten-
specific task that it is required to solve.               tion, various models have been proposed to gen-
   In what follows, we propose to investigate the         erate sentence embeddings through highly param-
behaviour of the bigram with respect to three dif-            2
                                                             Our code for this investigation can be found under
ferent levels of semantic competence, correspond-         https://github.com/sejo95/DSGeneration.
ing to specific tasks from the DS literature: a)          git.
eterised stacks of encoders (Vaswani et al., 2017;    tor v~i which has one entry for each word wj in
Radford et al., 2018; Devlin et al., 2018).           the model. The entry v~ij then contains the bigram
   This very brief overview of work in DS shows       probability p(wj |wi ).
the variety of models that have been proposed            We talked in our introduction of ‘raw’ linguis-
to encode meaning at different levels of con-         tic structure without specifying at which level it
stituency, building on more and more complex          is to be found. Following Church and Hanks
mechanisms. Aside from those efforts, much re-        (1990), we consider the joint probability of two
search has also focused on finding ideal hyper-       events, relative to their probability of occurring in-
parameters for the developed architectures (Bulli-    dependently, to be a good correlate of the funda-
naria and Levy, 2007; Baroni et al., 2014), ranging   mental psycholinguistic notion of association. As
from the amount of context taken into account by      per previous work, we thus assume that a PMI-
the model to the type of task it should be trained    weighted DS space gives the most basic represen-
on. Overall, it is fair to say that if meaning can    tation of the information contained in the struc-
be retrieved from raw language data, the process      ture of interest. For our bigram model, the numer-
requires knowing the right transformations to ap-     ator and denominator of the PMI calculation ex-
ply to that data, and the right parametrisation for   actly correspond to elements in our bigram matrix
those transformations, including the type of lin-     B weighted by elements of our unigram vector U :
guistic structure the model should focus on. One
important question remains for the linguist to an-                                            p(wj |wi )
                                                                 pmi(wi , wj ) ≡ log                        (1)
swer: how much semantics was actually contained                                                p(wj )
in corpus statistics, and where? We attempt to set       In practice, we use PPMI weighting and map
up a methodology to answer this question, and use     every negative PMI value to 0.
two different types of tasks (relatedness and au-        Word relatedness: following standard prac-
toencoding) to support our investigation.             tice, we compute relatedness scores as the co-
   While good progress has been made in the           sine similarity of two PPMI-weighted word vec-
DS community on modelling relatedness, distin-        tors, cos(w
                                                                ~i , w~j ). For evaluation, we use the MEN
guishability has received less attention. Some ap-    test collection (Bruni et al., 2014), which con-
proaches to autoencoding suggest using syntac-        tains 3000 word pairs annotated for relatedness;
tic elements (such as syntax trees) for decom-        we compute the spearman ρ correlation between
position of an embedding vector into a sentence       system and human scores.
(Dinu and Baroni, 2014; Iyyer et al., 2014). How-        Sentence relatedness: we follow the proof
ever, some research suggests that this may not be     given by Paperno and Baroni (2016), indicating
necessary and that continuous bag-of-words rep-       that the meaning of a phrase ab in a count-based
resentations and n-gram models contain enough         model with PMI weighting is roughly equivalent
word order information to reconstruct sentences       to the addition of the PMI-weighted vectors of a
(Schmaltz et al., 2016; Adi et al., 2017). Our own    and b (shifted by some usually minor correction).
methodology is inspired by White et al. (2016b),      Thus, we can compute the similarity of two sen-
who decode a sentence vector into a bag of words      tences S1 and S2 as:
using a greedy search over the vocabulary. In or-
der to also recover word order, those authors ex-                   cos(
                                                                            X
                                                                                    w
                                                                                    ~i ,
                                                                                            X
                                                                                                    w~j )   (2)
pand their original system in White et al. (2016a)                         wi ∈S1          wj ∈S2
by combining it with a traditional trigram model,
which they use to reconstruct the original sentence      We report sentence relatedness scores on the
from the bag of words.                                SICK dataset (Marelli et al., 2014), which con-
                                                      tains 10,000 utterance pairs annotated for related-
3     Methodology                                     ness. We calculate the relatedness for each pair
                                                      in the dataset and order the pairs according to the
3.1    A bigram model of Distributional               results. We then report the spearman correlation
       Semantics                                      between the results of the model and the ordering
We construct a count-based DS model by taking         of the dataset.
bigrams as our context windows. Specifically,            Autoencoding of sentences: White et al.
for a word wi , we construct an embedding vec-        (2016b) encode a sentence as the sum of the word
embedding vectors of the words of that sentence.               build B and U from 90% of the BNC (≈ 5.4 mil-
They decode that vector (the target) back into a               lion sentences), retaining 10% for development
bag of words in two steps. The first step, greedy              purposes. We limit our vocabulary to the 50000
addition begins with an empty bag of words. In                 most common words in the corpus, therefore the
each step a word is selected, such that the sum                matrix is of the size 50002 × 50002, including to-
of the word vectors in the bag and the vector of               kens for sentence beginning and end.
the candidate item is closest to the target (using
Euclidian distance as similarity measure). This is             3.2    Comparison
repeated until no new word could bring the sum                 In what follows, we compare our model to two
closer to the target than it already is. The second            Word2Vec models, which provide an upper bound
step, n-Substitution begins with the bag of n words            for what a DS model may be to achieve. One
found in the greedy addition. For each subbag of               model, W2V-BNC, is trained from scratch on our
size m ≤ n it considers replacing it with another              BNC background corpus, using gensim (Řehůřek
possible subbag of size ≤ m. The replacement                   and Sojka, 2010) with 300 dimensions, window
that brings the sum closest to the target vector is            size ±5, and ignoring words that occur less than
chosen. We follow the same procedure, except that              five times in the corpus. The other model, W2V-
we only consider subbags of size 1, i.e. substitu-             LARGE, is given by out-of-the-box vectors re-
tion of single words, for computational efficiency.            leased by Baroni et al. (2014): that model is
In addition, the bigram component of our model B               trained on 2.5B words, giving an idea of the sys-
lets us turn the bags of words back into an ordered            tem’s performance on larger data. In all cases, we
sequence.3 We use a beam search to perform this                limit the vocabulary to the same 50,000 words in-
step, following Schmaltz et al. (2016).                        cluded in the bigram model.
   We evaluate sentence autoencoding in two                       Note that given space restrictions, we do not
ways. First, we test the bag-of-words reconstruc-              disentangle the contribution of the models them-
tion on its own, by feeding the system the encoded             selves and the particular type of linguistic struc-
sentence embedding and evaluating whether it can               ture they are trained on. Our results should thus be
retrieve all single words contained in the origi-              taken as indication of the amount of information
nal utterance. We report the proportion of per-                encoded in a raw bigram model compared to what
fectly reconstructed bags-of-words across all test             can be obtained by a state-of-the-art model using
instances. Second, we test the entire autoencoding             the best linguistic structure at its disposal (here, a
process, including word re-ordering. We use two                window of ±5 words around the target).
different metrics: a) the BLEU score: (Papineni
et al., 2002), which computes how many n-grams                 4     Results
of a decoded sentence are shared with several ref-
erence sentences, giving a precision score; b) the             Word relatedness: the bigram model obtains an
CIDEr-D score: (Vedantam et al., 2015) which                   acceptable ρ = 0.48 on the MEN dataset. W2V-
accounts for both precision and recall and is com-             BNC and W2V-LARGE perform very well, reach-
puted using the average cosine similarity between              ing ρ = 0.72 and ρ = 0.80. Note that whilst the
the vector of a candidate sentence and a set of ref-           bigram model lags well behind W2V, it achieves
erence vectors. For this evaluation, we use the                its score with what is in essence a unidirectional
PASCAL-50S dataset (included in CIDEr-D), a                    model with window of size 1 – that is, with as
caption generation dataset, that contains 1000 im-             minimal input as it can get, seeing 10 times less
ages with 50 reference captions each. We encode                co-occurrences than W2V-BNC.
and decode the first reference caption for each im-               Sentence relatedness: the bigram model ob-
age and use the remaining 49 as reference for the              tains ρ = 0.40 on the sentence relatedness task.
CIDEr and BLEU calculations.                                   Interestingly, that score increases by 10 points, to
                                                               ρ = 0.50, when filtering away frequent words
  For the actual implementation of the model, we               with probability over 0.005. W2V-BNC and W2V-
                                                               LARGE give respectively ρ = 0.59 and ρ = 0.61.
   3
     Note that although a bigram language model would nor-        Sentence autoencoding: we evaluate sentence
mally perform rather poorly on sentence generation, having a
constrained bag-of-words to reorder makes the task consider-   autoencoding on sentences from the Brown cor-
ably simpler.                                                  pus (Kučera and Francis, 1967), using seven bins
                   original sents.     in matrix                                    all    2-10     11-23
    sent. length   W2V       CB      W2V      CB         CIDEr-D bigram           1.940    1.875    2.047
        3-5        0.556 0.792       0.686 0.988          BLEU bigram             0.193    0.209    0.176
        6-8        0.380 0.62        0.646 0.988         CIDEr-D random           1.113     1.1     1.134
       9-11        0.279 0.586       0.548    1.0         BLEU random             0.053    0.059    0.045
       12-14       0.210 0.578       0.402    1.0
                                                        Table 2: CIDEr-D and BLEU scores on reordering of
       15-17       0.178 0.338       0.366 0.978
                                                        bags-of-words using our bigram matrix and random re-
       18-20       0.366 0.404       0.984 0.974        ordering. Results are given for all sentences as well as
       21-23       0.306 0.392       0.982 0.968        sentences of lengths 2-10 and 11-23.
Table 1: Fraction of exact matches in bag-of-word re-
                                                         Original sentence             Reconstruction
construction (W2V refers to W2V-LARGE)
                                                         They have to be.              they have to be .
                                                         Six of these were             by these were six of
for different sentence lengths (from 3-5 words to        proposed by religious         religious groups pro-
21-23 words). Each bin contains 500 sentences.           groups.                       posed .
In some cases, the sentences contained words that        His reply, he said, was       the need for the coun-
aren’t present in the matrix and which are there-        that he agreed to the         try , in his reply , he
fore skipped for encoding. We thus look at two           need for unity in the         said that he was now
different values: a) in how many cases the recon-        country now.                  agreed to unity .
struction returns exactly the words in the sentence;
b) in how many cases the reconstruction returns         Table 3: Examples of decoded and reordered sentences.
the words in the sentence which are contained in        All words in the original sentences were retrieved by
                                                        the model, but the ordering is only perfectly recovered
the matrix (results in Table 1).
                                                        in the first case.
   The bigram model shines in this task: ignor-
ing words not contained in the matrix leads to al-
most perfect reconstruction. In comparison, the         edness: in spite of encoding extremely minimal
W2V model has extremely erratic performance             co-occurrence information, they manage to make
(Table 1), with scores decreasing as a function         for two thirds of W2V’s performance, trained on
of sentence length (from 0.686 for length 3-5 to        the same data with a much larger window and a
0.366 for length 15-17), but increasing again for       complex algorithm (see ρ = 0.48 for the bigram
lengths over 18.                                        model vs ρ = 0.72 for W2V-BNC). So related-
   One interesting aspect of the bigram model is        ness, the flagship task of DS, seems to be present
that it also affords a semantic competence that         in the most basic structures of language use, al-
W2V does not naturally have: encoding a se-             though in moderate amount.
quence and decoding it back into an ordered se-             The result of the bigram model on sentence
quence. We inspect how well the model does at           relatedness is consistent with its performance at
that task, compared to a random reordering base-        the word level. The improved result obtained by
line. Results are listed in Table 2. The bigram         filtering out frequent words, though, reminds us
model clearly beats the baseline for all sentence       that logical terms are perhaps not so amenable to
lengths. But it is expectedly limited by the small      the distributional hypothesis, despite indications
n-gram size provided by the model. Table 3 con-         to the contrary (Abrusán et al., 2018).
tains examples of sentences from the brown corpus           As for sentence autoencoding, the excellent re-
and their reconstructions. We see that local order-     sults of the bigram model might at first be con-
ing is reasonably modeled, but the entire sentence      sidered trivial and due to the dimensionality of
structure fails to be captured.                         the space, much larger for the bigram model than
                                                        for W2V. Indeed, at the bag-of-words level, sen-
5     Discussion                                        tence reconstruction can in principle be perfectly
                                                        achieved by having a space of the dimensional-
On the back of our results, we can start comment-
                                                        ity of the vocabulary, with each word symboli-
ing on the particular contribution of bigrams to
                                                        cally expressed as a one-hot vector.4 However,
the semantic competences tested here. First, bi-
                                                           4
grams are moderately efficient at capturing relat-             To make this clear, if we have a vocabulary V   =
as noted in §2, the ability to encode relatedness            we will also investigate which aspects of state-of-
is at odds with the ability to distinguish between           the-art models such as W2V contribute to score
meanings. There is a trade-off between having a              improvement on lexical aspects of semantics. We
high-dimensionality space (which allows for more             hope to thus gain insights into the specific cogni-
discrimination between vectors and thus easier re-           tive processes required to bridge the gap between
construction – see White et al., 2016b) and captur-          raw distributional structure as it is found in cor-
ing latent features between concepts (which is typ-          pora, and actual speaker competence.
ically better achieved with lower dimensionality).
Interestingly, bigrams seem to be biased towards
more symbolic representations, generating repre-             References
sentations that distinguish very well between word           Márta Abrusán, Nicholas Asher, and Tim Van de
meanings, but they do also encapsulate a reason-               Cruys. 2018. Content vs. function words: The view
able amount of lexical information. This makes                 from distributional semantics. In Proceedings of
                                                               Sinn und Bedeutung 22.
them somewhat of a hybrid constituent, between
proper symbols and continuous vectors.                       Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer
                                                               Lavi, and Yoav Goldberg. 2017. Fine-grained anal-
                                                               ysis of sentence embeddings using auxiliary predic-
6   Conclusion                                                 tion tasks. International Conference on Learning
                                                               Representations (ICLR), Toulon, France.
So what can be said about bigrams as distribu-
tional structure? They encode a very high level of           Marco Baroni, Georgiana Dinu, and Germán
lexical discrimination while accounting for some              Kruszewski. 2014.     Don’t count, predict!   a
                                                              systematic comparison of context-counting vs.
basic semantic similarity. They of course also en-
                                                              context-predicting semantic vectors. In ACL (1),
code minimal sequential information which can be              pages 238–247.
used to retrieve local sentence ordering. Essen-
tially, they result in representations that are per-         Piotr Bojanowski, Edouard Grave, Armand Joulin, and
                                                                Tomas Mikolov. 2017. Enriching word vectors with
haps more ‘symbolic’ than continuous. It is im-                 subword information. Transactions of the Associa-
portant to note that the reasonable correlations ob-            tion for Computational Linguistics, 5:135–146.
tained on relatedness tasks were achieved after ap-
plication of PMI weighting, implying that the raw            Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An-
                                                               drew Dai, Rafal Jozefowicz, and Samy Bengio.
structure requires some minimal preprocessing to               2016. Generating sentences from a continuous
generate lexical information.                                  space. In Proceedings of The 20th SIGNLL Confer-
   On the back of our results, we can draw a                   ence on Computational Natural Language Learning,
few conclusions with respect to the relation of                pages 10–21. Association for Computational Lin-
                                                               guistics.
performance and competence at the level of bi-
grams. Performance data alone produces very                  E. Bruni, N. K. Tran, and M. Baroni. 2014. Multi-
distinct word representations without any further               modal distributional semantics. Journal of Artificial
                                                                Intelligence Research, 49:1–47.
processing. Some traces of lexical semantics are
present, but require some hard-encoded prepro-               John A Bullinaria and Joseph P Levy. 2007. Extracting
cessing step in the shape of the PMI function. We              semantic representations from word co-occurrence
conclude from this that as a constituent involved              statistics: A computational study: A computational
                                                               study. Behavior Research Methods, 39(3):510–526.
in acquisition, the bigram is mostly a marker of
the uniqueness of word meaning. Interestingly, we            Noam Chomsky. 1965. Aspects of the theory of syntax.
note that the notion of contrast (words that differ            MIT Press.
in form differ in meaning) is an early feature of            Kenneth Ward Church and Patrick Hanks. 1990. Word
children’s language acquisition (Clark, 1988). The             association norms, mutual information, and lexicog-
fact that it is encoded in one of the most simple              raphy. Computational linguistics, 16(1):22–29.
structures in language is perhaps no coincidence.            Eve V Clark. 1988. On the logic of contrast. Journal
   In future work, we plan a more encompassing                 of Child Language, 15(2):317–335.
study of other linguistic components. Crucially,
                                                             Stephen Clark. 2012. Vector space models of lexical
{cat, dog, run} and we define cat = [100], dog = [010]          meaning. In Shalom Lappin and Chris Fox, editors,
and run = [001], then, trivially, [011] corresponds to the      Handbook of Contemporary Semantics – second edi-
bag-of-word {dog, run}.                                         tion. Wiley-Blackwell.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and             Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
   Kristina Toutanova. 2018. Bert: Pre-training of deep     Jing Zhu. 2002. Bleu: A method for automatic eval-
   bidirectional transformers for language understand-      uation of machine translation. In Proceedings of
   ing. arXiv preprint arXiv:1810.04805.                    the 40th Annual Meeting on Association for Com-
                                                            putational Linguistics, ACL ’02, pages 311–318,
Georgiana Dinu and Marco Baroni. 2014. How to               Stroudsburg, PA, USA. Association for Computa-
  make words with vectors: Phrase generation in dis-        tional Linguistics.
  tributional semantics. In Proceedings of the 52nd
  Annual Meeting of the Association for Computa-          Jeffrey Pennington, Richard Socher, and Christopher
  tional Linguistics (Volume 1: Long Papers), vol-           Manning. 2014. Glove: Global vectors for word
  ume 1, pages 624–633.                                      representation. In Proceedings of the 2014 confer-
                                                             ence on empirical methods in natural language pro-
Katrin Erk. 2012. Vector space models of word mean-          cessing (EMNLP), pages 1532–1543.
  ing and phrase meaning: a survey. Language and          Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
  Linguistics Compass, 6:635–653.                          Gardner, Christopher Clark, Kenton Lee, and Luke
                                                           Zettlemoyer. 2018. Deep contextualized word rep-
Katrin Erk and Sebastian Padó. 2008. A structured         resentations. In Proceedings of the 2018 Confer-
  vector space model for word meaning in context.          ence of the North American Chapter of the Associ-
  In Proceedings of the 2008 Conference on Em-             ation for Computational Linguistics: Human Lan-
  pirical Methods in Natural Language Processing           guage Technologies, Volume 1 (Long Papers), pages
  (EMNLP2008), pages 897–906, Honolulu, HI.                2227–2237.
Katrin Erk, Sebastian Padó, and Ulrike Padó. 2010. A    Alec Radford, Karthik Narasimhan, Tim Salimans, and
  flexible, corpus-driven model of regular and inverse      Ilya Sutskever. 2018. Improving language under-
  selectional preferences. Computational Linguistics,       standing by generative pre-training.
  36(4):723–763.
                                                          Radim Řehůřek and Petr Sojka. 2010. Software Frame-
John Rupert Firth. 1957. A synopsis of linguistic the-      work for Topic Modelling with Large Corpora. In
  ory, 1930–1955. Philological Society, Oxford.             Proceedings of the LREC 2010 Workshop on New
                                                            Challenges for NLP Frameworks, pages 45–50, Val-
Alex Gittens, Dimitris Achlioptas, and Michael W Ma-        letta, Malta. ELRA. http://is.muni.cz/
  honey. 2017. Skip-gram- zipf+ uniform= vector ad-         publication/884893/en.
  ditivity. In Proceedings of the 55th Annual Meet-
  ing of the Association for Computational Linguistics    Allen Schmaltz, Alexander M. Rush, and Stuart
  (Volume 1: Long Papers), pages 69–76.                     Shieber. 2016. Word ordering without syntax. In
                                                            Proceedings of the 2016 Conference on Empirical
                                                            Methods in Natural Language Processing, pages
Zelig Harris. 1954. Distributional structure. Word,
                                                            2319–2324, Austin, Texas. Association for Compu-
  10(2-3):146–162.
                                                            tational Linguistics.
Mohit Iyyer, Jordan Boyd-Graber, and Hal Daumé III.      S. Thater, H. Fürstenau, and M. Pinkal. 2011. Word
 2014. Generating sentences from semantic vector             meaning in context: A simple and effective vector
 space representations. In NIPS Workshop on Learn-           model. In Proceedings of the 5th International Joint
 ing Semantics.                                              Conference on Natural Language Processing, Chi-
                                                             ang Mai, Thailand.
Henry Kučera and Winthrop Nelson Francis. 1967.
  Computational analysis of present-day American          Peter D. Turney. 2014. Semantic composition and
  English. Dartmouth Publishing Group.                      decomposition: From recognition to generation.
                                                            CoRR, abs/1405.7908.
Marco Marelli, Stefano Menini, Marco Baroni, Luisa
 Bentivogli, Raffaella Bernardi, and Roberto Zam-         Peter D. Turney and Patrick Pantel. 2010. From fre-
 parelli. 2014. A sick cure for the evaluation of com-      quency to meaning: Vector space models of se-
 positional distributional semantic models. In LREC,        mantics. Journal of Artificial Intelligence Research,
 pages 216–223.                                             37:141–188.
                                                          Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.             Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  2013. Linguistic regularities in continuous space         Kaiser, and Illia Polosukhin. 2017. Attention is all
  word representations. In HLT-NAACL, pages 746–            you need. In Advances in neural information pro-
  751.                                                      cessing systems, pages 5998–6008.
Denis Paperno and Marco Baroni. 2016. When the            R. Vedantam, C. L. Zitnick, and D. Parikh. 2015.
  whole is less than the sum of its parts: How compo-       Cider: Consensus-based image description evalua-
  sition affects pmi values in distributional semantic      tion. In 2015 IEEE Conference on Computer Vision
  vectors. Computational Linguistics, 42(2):345–350.        and Pattern Recognition (CVPR), pages 4566–4575.
L. White, R. Togneri, W. Liu, and M. Bennamoun.
   2016a. Modelling sentence generation from sum
   of word embedding vectors as a mixed integer pro-
   gramming problem. In 2016 IEEE 16th Inter-
   national Conference on Data Mining Workshops
   (ICDMW), pages 770–777.
Lyndon White, Roberto Togneri, Wei Liu, and Mo-
  hammed Bennamoun. 2016b. Generating bags
  of words from the sums of their word embed-
  dings. In 17th International Conference on Intelli-
  gent Text Processing and Computational Linguistics
  (CICLing).