=Paper= {{Paper |id=Vol-2473/paper24 |storemode=property |title=In Search for Linear Relations in Sentence Embedding Spaces |pdfUrl=https://ceur-ws.org/Vol-2473/paper24.pdf |volume=Vol-2473 |authors=Petra Barančíková,Ondřej Bojar |dblpUrl=https://dblp.org/rec/conf/itat/BarancikovaB19 }} ==In Search for Linear Relations in Sentence Embedding Spaces== https://ceur-ws.org/Vol-2473/paper24.pdf
               In Search for Linear Relations in Sentence Embedding Spaces

                                                      Petra Barančíková, Ondřej Bojar

                                                               Charles University
                                                     Faculty of Mathematics and Physics
                                                 Institute of Formal and Applied Linguistics
                                                {barancikova,bojar}@ufal.mff.cuni.cz

Abstract: We present an introductory investigation into
                                                                             Figure 1: An illustration of a continuous multi-
continuous-space vector representations of sentences. We
                                                                             dimensional vector space representing individual sen-
acquire pairs of very similar sentences differing only by a
                                                                             tences, a ‘space of sentences’ (upper plot) where each sen-
small alterations (such as change of a noun, adding an ad-
                                                                             tence is represented as a dot. Pairs of related sentences are
jective, noun or punctuation) from datasets for natural lan-
                                                                             connected with arrows; dashing indicates various relation
guage inference using a simple pattern method. We look
                                                                             types. The lower plot illustrates a possible ‘space of op-
into how such a small change within the sentence text af-
                                                                             erations’ (here vector difference, so all arrows are simply
fects its representation in the continuous space and how
                                                                             moved to start at a common origin). The hope is that sim-
such alterations are reflected by some of the popular sen-
                                                                             ilar operations (e.g. all vector transformations extracted
tence embedding models. We found that vector differences
                                                                             from sentence pairs differing in the speed of travel “run-
of some embeddings actually reflect small changes within
                                                                             ning instead of walking”) would be represented close to
a sentence.
                                                                             each other in the space of operations, i.e. form a more or
                                                                             less compact cluster.
1    Introduction                                                                                                     A sad boy is walking.       Look how sad my cat is.
                                                                                                             A little boy is walking.
                                                                           Space of Sentences




Continuous-space representations of sentences, so-called                                                                                  Look at my little cat!

sentence embeddings, are becoming an interesting object                                                       A little boy is running.
of study, consider e.g. the BlackBox workshop.1 Repre-                                                                            A dog is walking past a field.
                                                                                                   A man is walking
senting sentences in a continuous space, i.e. commonly                                               in the field.
                                                                                                                                  There is a dog running past the field.
with a long vector of real numbers, can be useful in multi-
ple ways, analogous to continuous word representations
                                                                           Space of Operations




                                                                                                                                   ...being sad, not little...
(word embeddings). Word embeddings have provably                                                 ...a man instead
                                                                                                 of a dog...
made downstream processing robust to unimportant input
variations or minor errors (sometimes incl. typos), they
                                                                                                                          ...running instead of walking...
have greatly boosted the performance of many tasks in
                                                                                                                        ...a grown-up instead of a child...
low data conditions and can form the basis of empirically-
driven lexicographic explanations of word meanings.
   One notable observation was made in [15], showing that
several interesting relations between words have their im-
mediate geometric counterpart in the continuous vector                       This approach has the potential of explaining the good or
space.                                                                       bad performance of the examined types of representations
   Our aim is to examine existing continuous representa-                     in various tasks.
tions of whole sentences, looking for an analogous be-                          The paper is structured as follows: Section 2 reviews
haviour. The idea of what we are hoping for is illustrated                   the closest related work. Sections 3 and 4, respectively,
in Figure 1. As with words, we would like to learn if and to                 describe the dataset of sentences and the sentence embed-
what extent some simple geometric operations in the con-                     dings methods we use. Section 5 presents the selection of
tinuous space correspond to simple semantic operations                       operations on the sentence vectors. Section 6 provides the
on the sentence strings. Similarly to [15], we are delib-                    main experimental results of our work. We conclude in
erately not including this aspect in the training objective                  Section 7.
of the sentence presentations but instead search for prop-
erties that are learned in an unsupervised way, as a side-                   2                   Related Work
effect of the original training objective, data and setup.
                                                                             Series of tests to measure how well their word embed-
       Copyright c 2019 for this paper by its authors. Use permitted un-
der Creative Commons License Attribution 4.0 International (CC BY
                                                                             dings capture semantic and syntactic information is de-
4.0).                                                                        fined in [15]. These tests include for example declina-
     1 https://blackboxnlp.github.io/                                        tion of adjectives (“easy"→“easier"→“easiest"), chang-
Figure 2: Example of our pattern extraction method. In the first step, the longest common subsequence of tokens (ear is
playing a guitar .) is found and replaced with the variable X. In the second step, with a tattoo behind is substituted with
the variable Y. As the variables are not listed alphabetically in the premise, they are switched in the last step.
step                           premise                                                        hypothesis
 1.     a man with a tattoo behind his ear is playing a guitar .       a woman with a tattoo behind her ear is playing a guitar .
 2.     a man with a tattoo behind his X                               a woman with a tattoo behind her X
 3.     a man Y his X                                                  a woman Y her X
 4.     a man X his Y                                                  a woman X her Y


Figure 3: Top 10 patterns extracted from sentence pairs labelled as entailmens, contradictions and neutrals, respectively.
Note the “X → X" pattern indicating no change in the sentence string at all.

                       entailments                        contradictions                               neutrals
                premise hypothesis                  premise hypothesis                       premise     hypothesis
       1.              X X                693      X man Y X woman Y               413           XY      X sad Y      701
       2.      X man Y X person Y         224    X woman Y X man Y                 196           XY      X big Y      119
       3.            X. X                 207         X men X women Y              111           XY      X fat Y      69
       4.    X woman Y X person Y         118       X boy Y X girl Y               109    X young Y      X sad Y      68
       5.       X boy Y X person Y        65        X dog Y X cat Y                98     X people Y     X men Y      60
       6.           XY Y,X.               61        X girl Y X boy Y               97              X     sad X        51
       7.      X men Y X people Y         56     X women Y X men Y                 64              X     X            41
       8.         two X X                 56            X Y, X not Y               56     X person Y     X man Y      34
       9.       X girl Y X person Y       55          two X, three X               46            XY      X red Y      30
       10.         X,Y YX.                53       X child Y X man Y               44            XY      X busy Y     28



ing the tense of a verb (“walking"→“walk") or getting              original embedding. The results suggest that pretrained
the capital (“Athens"→“Greece") or currency of a state             neural sentence encoders are much more robust to intro-
(“Angola"→“kwanza"). References [2; 13] have further               duced errors contrary to bag-of-words embeddings.
refined the support of sub-word units, leading to con-
siderable improvements in representing morpho-syntactic
properties of words. Vylomova, Rimmel, Cohn and Bald-              3     Examined Sentences
win [26] largely extended the set of considered semantic
relations of words.                                                Because manual creation of sentence variations is costly,
                                                                   we reuse existing data from SNLI [3] and MultiNLI [27].
   Sentence embeddings are most commonly evaluated ex-             Both these collections consist of pairs of sentences—a
trinsically in so called ‘transfer tasks’, i.e. comparing          premise and a hypothesis—and their relationship (entail-
the evaluated representations based on their performance           ment/contradiction/neutral). The two datasets together
in sentence sentiment analysis, question type prediction,          contain 982k unique sentence pairs. All sentences were
natural language inference and other assignments. Refer-           lowercased and tokenized using NLTK [14].
ence [8] introduce ‘probing tasks’ for intrinsic evaluation           From all the available sentence pairs, we select only a
of sentence embeddings. They measure to what extent lin-           subset where the difference between the sentences in the
guistic features like sentence length, word order, or the          pair can be described with a simple pattern. Our method
depth of the syntactic tree are available in a sentence em-        goes as follows: given two sentences, a premise p and the
bedding. This work was extended to SentEval [6], a toolkit         corresponding hypothesis h, we find the longest common
for evaluating the quality of sentence embedding both in-          substring consisting of whole words and replace it with
trinsically and extrinsically. It contains 17 transfer tasks       a variable. This is repeated once more, so our sentence
and 10 probing tasks. SentEval is applied to many recent           patterns can have up to two variables. In the last step, we
sentence embedding techniques showing that no method               make sure the pattern is in a canonical form by switching
had a consistently good performance across all tasks [18].         the variables to ensure they are alphabetically sorted in p.
   Voleti, Liss and Berisha [25] examine how errors (such          The process is illustrated in Figure 2.
as incorrect word substitution caused by automatic speech             Ten most common patterns for each NLI relation are
recognition) in a sentence affect its embedding. The em-           shown in Figure 3. Many of the obtained patterns clearly
beddings of corrupted sentences are then used in textual           match the sentence pair label. For instance the pattern no.
similarity tasks and the performance is compared with              2 (“X man Y → X person Y”) can be expected to lead to
a sentence pair illustrating entailment. If a man appears in    Whole Word Masking. BERT gives embeddings for every
a story, we can infer that a person appeared in the story.      (sub)word unit, we take as a sentence embedding a [CLS]
The contradictions illustrate typical oppositions like man–     token, which is inserted at the beginning of every sentence.
woman, dog–cat. Neutrals are various refinements of the         BERT embeddings have 1,024-dimensions.
content described by the sentences, probably in part due to        ELMo6 (Embedding from Language Models) [5] uses
the original instruction in SNLI that hypothesis “might be      representations from a biLSTM that is trained with the
a true” given the premise in neutral relation.                  language model objective on a large text dataset. Its em-
   We kept only patterns appearing with at least 20 differ-     beddings are a function of the internal layers of the bi-
ent sentence pairs in order to have large and variable sets     directional Language Model (biLM), which should cap-
of sentence pairs in subsequent experiments. We also ig-        ture not only semantics and syntax, but also different
nored the overall most common pattern, namely the iden-         meanings a word can represent in different contexts (pol-
tity, because it actually does not alter the sentence at all.   ysemy). Similarly to BERT, each token representation of
Strangely enough, identity was observed not just among          ELMo is a function of the entire input sentence - one word
entailment pairs (693 cases), but also in neutral (41 cases)    gets different embeddings in different contexts. ELMo
and contradiction (22) pairs.                                   computes an embedding for every token and we compute
   Altogether, we collected 4,2k unique sentence pairs in       the final sentence embedding as the average over all to-
60 patterns. Only 10% of this data comes from MultiNLI,         kens. It has dimensionality 1024.
the majority is from SNLI.                                         LASER7 (Language-Agnostic SEntence Representa-
                                                                tions) [1] is a five-layer bi-directional LSTM (BiLSTM)
                                                                network. The 1,024-dimension vectors are obtained by
4   Sentence Embeddings                                         max-pooling over its last states. It was trained to trans-
                                                                late from more than 90 languages to English or Spanish at
We experiment with several popular pretrained sentence
                                                                the same time, the source language was selected randomly
embeddings.
                                                                in each batch.
   InferSent2 [7] is the first embedding model that used a
supervised learning to compute sentence representations.
It was trained to predict inference labels on the SNLI          5   Choosing Vector Operations
dataset. The authors tested 7 different architectures and
BiLSTM encoder with max pooling achieved the best               Mikolov, Chen, Corrado and Dean [15] used a simple vec-
results. InferSent comes in two versions: InferSent_1           tor difference as the operation that relates two word em-
is trained with Glove embeddings [17] and InferSent_2           beddings. For sentences embeddings, we experiment a lit-
with fastText [2]. InferSent representations are by far the     tle and consider four simple operations: addition, subtrac-
largest, with the dimensionality of 4096 in both versions.      tion, multiplication and division, all applied elementwise.
   Similarly to InferSent, Universal Sentence Encoder [4]       More operations could be also considered as long as they
uses unsupervised learning augmented with training on su-       are reversible, so that we can isolate the vector change for
pervised data from SNLI. There are two models available.        a particular sentence alternation and apply it to the embed-
USE_T3 is a transformer-network [23] designed for higher        ding of any other sentence. Hopefully, we would then land
accuracy at the cost of larger memory use and computa-          in the area where the correspondingly altered sentence is
tional time. USE_D4 is a deep averaging network [12],           embedded.
where words and bi-grams embeddings are averaged and               The underlying idea of our analysis was already
used as input to a deep neural network that computes the        sketched in Figure 1. From every sentence pair in our
final sentence embeddings. This second model is faster          dataset, we extract the pattern, i.e. the string edit of the
and more efficient but its accuracy is lower. Both models       sentences. The arithmetic operation needed to move from
output representation with 512 dimensions.                      the embedding of the first sentence to the embedding of
   Unlike the previous models, BERT5 (Bidirectional En-         the second sentence (in the continuous space of sentences)
coder Representations from Transformers) [10] is a deep         can be represented as a point in what we call the space of
unsupervised language representation, pre-trained using         operations. Considering all sentence pairs that share the
only unlabeled text. It has two self-supervised training        same edit pattern, we obtain many points in the space of
objectives - masked language modelling and next sen-            operations. If the space of sentences reflects the particu-
tence classification. It is considered bidirectional as the     lar edit pattern in an accessible way, all the corresponding
Transformer encoder reads the entire sequence of words          points in the space of operations will be close together,
at once. We use a pre-trained BERT-Large model with             forming a cluster.
    2 https://github.com/facebookresearch/InferSent                To select which of the arithmetic operations best suits
    3 https://tfhub.dev/google/universal-sentence-              the data, we test pattern clustering with three common
encoder-large/3                                                 clustering performance evaluation methods:
   4 https://tfhub.dev/google/universal-sentence-

encoder/2                                                           6 https://github.com/HIT-SCIR/ELMoForManyLangs
   5 https://github.com/google-research/bert                        7 https://github.com/facebookresearch/LASER
Table 1: This table presents the quality of pattern clustering in terms of the three cluster evaluation measures in the
space of operations. For all the scores, the value of 1 represents a perfect assignment and 0 corresponds to random label
assignment. All the numbers were computed using the Scikit-learn library [16]. Best operation according to each cluster
score across the various embeddings in bold.
                             Adjusted Rank Index                V-measure               Adjusted Mutual Information
 embedding        dim.    -      +      *      /         -      +    *         /        -      +     *       /
 InferSent_1      4096    0.58 0.03 0.03 0.00            0.91   0.28 0.24      0.03     0.87 0.18 0.14 0.00
 ELMo             1024    0.55 0.03 0.02 0.00            0.85   0.28 0.23      0.03     0.82 0.18 0.13 0.00
 LASER            1024    0.48 0.02 0.01 0.00            0.79   0.19 0.15      0.03     0.76 0.09 0.04 0.00
 USE_T            512     0.25 0.04 0.08 0.00            0.73   0.25 0.30      0.03     0.69 0.14 0.20 0.00
 InferSent_2      4096    0.31 0.04 0.04 0.01            0.69   0.28 0.28      0.10     0.65 0.19 0.19 0.03
 BERT             1024    0.33 0.02 0.01 0.00            0.66   0.22 0.16      0.03     0.62 0.12 0.06 0.00
 USE_D            512     0.21 0.05 0.08 0.00            0.65   0.27 0.33      0.03     0.58 0.17 0.23 0.00
 average          1775    0.39 0.03 0.04 0.00            0.75   0.25 0.24      0.04     0.71 0.15 0.14 0.00


  • Adjusted Rand index [11] is measure of the simi-            6     Experiments
    larity between two cluster assignments adjusted with
    chance normalization. The score ranges from −1 to           For the following exploration of the continuous space of
    +1 with 1 being the perfect match score and values          operations, we focus only on the ELMo embeddings. They
    around 0 meaning random label assignment. Nega-             scored second best in all scores but unlike the best scoring
    tive numbers show worse agreement than what is ex-          Infersent_1, ELMo was not trained on SNLI, which is the
    pected from a random result.                                major source of our sentence pairs.
                                                                   The t-SNE [22] visualisation of subtractions of ELMo
  • V-measure [19] is harmonic mean of homogeneity              vectors is presented in Figure 4. The visualisation is con-
    (each cluster should contain only members of one            structed automatically and, of course, without the knowl-
    class) and completeness (all members of one class           edge of the pattern label. It shows that the patterns are
    should be assigned to the same cluster). The score          generally grouped together into compact clusters with the
    ranges from 0 (the worst situation) to 1 (perfect           exception of a ‘chaos cloud’ in the middle and several out-
    score).                                                     liers. Also there are several patterns that seem inseparable,
                                                                e.g. “two X → X" and “three X → X", or “X white Y ->
                                                                X Y" and “X black Y -> X Y".
  • Adjusted Mutual Information [21] measures the                  We identified the patterns responsible for the noisy cen-
    agreement of the two clusterings with the correction        ter and outliers by computing weighted inertia for each
    of agreement by chance. The random label assign-            pattern (the sum of squared distances of samples to their
    ment gets a score close to 0, while two identical clus-     cluster center divided by the size of sample). The clus-
    terings get the score of 1.                                 ters with highest inertia consists of patterns representing a
                                                                change of word order and/or adding or removing punctua-
   As the detailed description of these measures is out of      tion. These patterns are:
scope of this article, we refer readers to related literature
(e.g. [24]). We use these scores to compare patterns with           X is Y . → Y is X    XY.→YX.             X→X.
labels predicted by k-Means (best result of 100 random              X,Y.→YX.             X,Y.→Y,X.
initialisations). The results are presented in Table 1. It is       XY.→Y,X.             X.→X
apparent that the best distribution by far is achieved using
the most intuitive operation, vector subtraction.                  To see if the space of operations can be interpreted also
   There seems to be a weak correlation between the size        automatically, i.e. if the sentence relations are general-
of embeddings and the scores. The smallest embeddings           izable, we remove the noisy patterns as above and apply
USE_D and USE_T are getting the worst scores, while the         fully unsupervised clustering: we do not even disclose the
largest embeddings InferSent_1 are the best scoring em-         expected number of patterns, i.e. clusters. We try two
beddings. However, InferSent_2 with dimensionality 4096         metrics for finding the optimal number of clusters: Davies-
is performing poorly. The fact that several of the embed-       Bouldin’s index [9] and Silhouette Coefficient [20]. They
dings were trained on SNLI does not to seem benefit those       are both designed to measure compactness and separation
embeddings. Between the three top scored embeddings,            of the clusters, i.e. they award dense clusters that are far
only InferSent_1 was trained on the data that we use for        from each other. Both Davies-Bouldin index and Silhou-
evaluation of embeddings.                                       ette Coefficient agree that the best separation is achieved
Figure 4: t-SNE representation of patterns. The points in the operation space are obtained by subtracting the ELMo
embedding of the hypothesis from the ELMo embedding of the premise. Best viewed in color. Colors correspond to the
sentence patterns.




                                   X woman Y -> X person Y                     X young Y -> X sad Y

                             X girl Y -> X person Y
                                                                                       X little Y -> X sad Y
                    X boy Y -> X person Y
              X child Y -> X person Y
           X child Y -> X man Y
                                            X people Y -> X dogs Y                                    X Y -> X sad Y
             X children Y -> X men Y
                                                              X Y -> X fat Y
                             X girl Y -> X boy Y
       X lady Y -> X man Y
                                                                    X Y -> X busy Y

                                                                                        X -> there is X
X woman Y -> X man Y                X Y -> X is Y
                                                                                      a group of X -> X
            X women Y -> X men Y
                                                                               two X -> X

      X person Y -> X man                                                        three X -> X                  X man Y -> X person Y

          X people Y -> X men Y                    X . -> X                                        X men Y -> X people Y
                                    X -> X .
                                        X red Y -> X Y                             X men Y -> X women Y
                                 X blue Y -> X Y
                              X white Y -> X Y
                                                                                                 man X -> woman X
                             X black Y -> X Y
                                               X boys Y -> X girls Y
                                                                                             X man Y -> X woman Y

                                         X boy Y -> X girl Y
Figure 5: t-SNE representation of patterns as in Figure 4 with colors coding now fully automatic clusters. Each cluster is
labelled with the set of patterns extracted from sentence pairs assigned to the cluster. The numbers in parentheses indicate
how many sentence pairs belong to the given pattern within this cluster and overall, resp. For instance the line “two X →
X (52/56)” says that of the 56 sentence pairs differing in the prefix “two”, 52 were automatically clustered together based
on the subtraction of their ELMo embeddings.




                       1: [X woman Y -> X person Y (115/119),
                           X boy Y -> X person Y (64/65),                              3:[X Y -> X sad Y (680/703),
                           X girl Y -> X person Y (54/55),                                X young Y -> X sad Y (68/68),
                           X child Y -> X person Y (30/30),                               X -> sad X (50/51),
                           X women Y -> X people Y (1/23)]                                X little Y -> X sad Y (19/21),
                                                                                          X -> there is X (9/25),
                                                                                          X Y -> X big Y (1/122)]
9: [X woman Y -> X man Y (196/196),
    X girl Y -> X boy Y (96/97),
    X women Y -> X men Y (64/64),
    X child Y -> X man Y (45/45),               2: [ X people Y -> X dogs Y (36/36),
    X person Y -> X man Y (35/37),                   X person Y -> X dog Y (20/20)]
    X girls Y -> X boys Y (29/29),
    X lady Y -> X man Y (27/27),
    X women Y -> X people Y (17/23),
    X children Y -> X men Y (14/23)]




                                                                                                  4: [two X -> X (52/56),
                                                                                                      a group of X -> X (36/38),
                                                                                                      three X -> X (24/24)]




                                                                                                  5: [X man Y -> X person Y (218/227),
8: [X Y -> X big Y (121/122),                                                                         X Y -> X not Y (1/56)]
    X dog Y -> X cat Y (98/98),
    X Y -> X fat Y (69/69),
    X people Y -> X men Y (59/60),
    X Y -> X not Y (55/56),
    two X -> three X (45/46),
    but X -> X (32/32),
    X Y -> X busy Y (30/30),
    X Y -> X red Y (30/30),            7: [X men Y -> X people Y (57/57),                      6: [X man Y -> X woman Y (414/414),
    X Y -> X n't Y (28/28),                X young Y -> X Y (55/55),                               X men Y -> X women Y (109/111),
    X blue Y -> X red Y (27/27),           X black Y -> X Y (41/41),                               X boy Y -> X girl Y (107/109),
    X red Y -> X blue Y (27/27),           X red Y -> X Y (36/36),                                 man X -> woman X (31/31),
    X dogs Y -> X cats Y (26/26),          X white Y -> X Y (34/34),                               X boys Y -> X girls Y (21/27),
    X Y -> X sad Y (23/703),               X little Y -> X Y (31/31),                              X boy Y -> X person Y (1/65),
    X . -> X outside . (20/21),            X not Y -> X Y (28/29),                                 X man Y -> X person Y (1/227)]
    X -> there is X (13/25),               X blue Y -> X Y (27/27),
    X children Y -> X men Y (9/23),        X Y -> X is Y (22/24), ...]
    X boys Y -> X girls Y (6/27),
    two X -> X (4/56), ...]
at 9 clusters. Running k-Means with 9 clusters, we get the      ine the possibilities of generating sentence strings back
result as plotted in Figure 5.                                  from the sentence embedding space. If successful, our
   Manually inspecting the contents of the automatically        method could lead to controlled paraphrasing via the con-
identified clusters, we see that many clusters are meaning-     tinuous space: take an input sentence, embed it, modify
ful in some way. For instance, Cluster 1 captures 90% (al-      the embedding using a vector operation and generate the
together 264 out of 292) sentence pairs exerting the pattern    target sentence in the standard textual from.
of generalizing women, boys or girls to people. The coun-
terpart for men belonging to people is spread into Cluster 5
(218 out of 227 pairs) for the singular case and not so clean   Acknowledgment
Cluster 7 containing 57/57 of the plural pairs “X men Y →
X people Y” together with various oppositions. Cluster 2        This work has been supported by the grant No. 18-
covers all sentence pairs where a person is replaced with a     24210S of the Czech Science Foundation. It has been
dog. Cluster 3 is primarily connected with sentence pairs       using language resources and tools stored and distributed
introducing bad mood. Cluster 4 unites patterns that rep-       by the LINDAT/CLARIN project of the Ministry of Edu-
resent omitting a numeral/group. Cluster 6 covers gender        cation, Youth and Sports of the Czech Republic (project
oppositions in one direction and Cluster 9 adds the other       LM2015071).
direction (with some noise for child/man and person/man
and similar), etc.
                                                                References
7   Conclusion and Future Work                                   [1] M. Artetxe and H. Schwenk. Massively multilin-
                                                                     gual sentence embeddings for zero-shot cross-lingual
We examined vector spaces of sentence representations                transfer and beyond. CoRR, abs/1812.10464, 2018.
as inferred automatically by sentence embedding meth-
ods such as InferSent or ELMo. Our goal was to find out          [2] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov.
if some simple arithmetic operations in the vector space             Enriching word vectors with subword information.
correspond to meaningful edit operations on the sentence             CoRR, abs/1607.04606, 2016.
strings.
   Our first explorations of 60 sentence edit patterns docu-     [3] S. R. Bowman, G. Angeli, C. Potts, and C. D. Man-
ment that this is indeed the case. Automatically identified          ning. A large annotated corpus for learning natu-
frequent patterns with 20 or more occurrences in the SNLI            ral language inference. In Proceedings of the 2015
and MultiNLI datasets correspond to simple vector differ-            Conference on Empirical Methods in Natural Lan-
ences. The ELMo space (and others such as Infersent_1,               guage Processing (EMNLP). Association for Com-
LASER and USE-T, which are omitted due to paper length               putational Linguistics, 2015.
requirements) exerts this property very well.
   Unfortunately, choosing ELMo as example might not             [4] D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S.
have been the best option – we compute ELMo embed-                   John, N. Constant, M. Guajardo-Cespedes, S. Yuan,
dings by averaging contextualized word embeddings and                C. Tar, Y. Sung, B. Strope, and R. Kurzweil. Univer-
majority of the patterns are just removing/adding/chang-             sal sentence encoder. CoRR, abs/1803.11175, 2018.
ing a single word. Difference between two such sentence
                                                                 [5] W. Che, Y. Liu, Y. Wang, B. Zheng, and T. Liu. To-
embeddings may be a simple difference between the em-
                                                                     wards better UD parsing: Deep contextualized word
beddings of the words substituted, depending on the effect
                                                                     embeddings, ensemble, and treebank concatenation.
of the contextualization. Thus, the differences in vector
                                                                     CoRR, abs/1807.03121, 2018.
space would show rather the word embeddings than the
sentence embeddings.                                             [6] A. Conneau and D. Kiela. Senteval: An evaluation
   It should be noted that our search made use of only               toolkit for universal sentence representations. arXiv
about 0.5% of the sentence pairs available in SNLI and               preprint arXiv:1803.05449, 2018.
MultiNLI. The remaining sentence pairs differ beyond
what was extractable automatically using our simple pat-         [7] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and
tern method. A different approach for a fine-grained de-             A. Bordes. Supervised learning of universal sentence
scription of the semantic relation between two sentences             representations from natural language inference data.
would have to be taken for a better exploitation of the              CoRR, abs/1705.02364, 2017.
available data.
   Our plans for the long term are to further verify these       [8] A. Conneau, G. Kruszewski, G. Lample, L. Barrault,
observations using a more diverse set of vector operations           and M. Baroni. What you can cram into a single
and a larger set of sentence alternations, primarily by ex-          vector: Probing sentence embeddings for linguistic
tending the set of alternation types. We also plan to exam-          properties. CoRR, abs/1805.01070, 2018.
 [9] D. L. Davies and D. W. Bouldin. A cluster separation          Processing and Computational Natural Language
     measure. IEEE Trans. Pattern Anal. Mach. Intell.,             Learning (EMNLP-CoNLL), pages 410–420, Prague,
     1(2):224–227, Feb. 1979.                                      Czech Republic, June 2007. Association for Compu-
                                                                   tational Linguistics.
[10] J. Devlin, M. Chang, K. Lee, and K. Toutanova.
     BERT: pre-training of deep bidirectional trans-          [20] P. Rousseeuw. Silhouettes: A graphical aid to the
     formers for language understanding.     CoRR,                 interpretation and validation of cluster analysis. J.
     abs/1810.04805, 2018.                                         Comput. Appl. Math., 20(1):53–65, Nov. 1987.

[11] L. Hubert and P. Arabie. Comparing partitions. Jour-     [21] A. Strehl and J. Ghosh. Cluster ensembles: A
     nal of Classification, 2(1):193–218, Dec 1985.                knowledge reuse framework for combining partition-
                                                                   ings. In Eighteenth National Conference on Arti-
[12] M. Iyyer, V. Manjunatha, J. Boyd-Graber, and                  ficial Intelligence, pages 93–98, Menlo Park, CA,
     H. Daumé III. Deep unordered composition rivals               USA, 2002. American Association for Artificial In-
     syntactic methods for text classification. In Proceed-        telligence.
     ings of the 53rd Annual Meeting of the Association
     for Computational Linguistics and the 7th Interna-       [22] L. van der Maaten and G. Hinton. Visualizing data
     tional Joint Conference on Natural Language Pro-              using t-SNE. Journal of Machine Learning Re-
     cessing (Volume 1: Long Papers), pages 1681–1691,             search, 9:2579–2605, 2008.
     Beijing, China, July 2015. Association for Computa-
                                                              [23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
     tional Linguistics.
                                                                   L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.
[13] T. Kocmi and O. Bojar. Subgram: Extending skip-               Attention is all you need. In NIPS, 2017.
     gram word representation with substrings. CoRR,
                                                              [24] N. X. Vinh and J. Epps. A novel approach for auto-
     abs/1806.06571, 2018.
                                                                   matic number of clusters detection in microarray data
[14] E. Loper and S. Bird. Nltk: The natural language              based on consensus clustering. In 2009 Ninth IEEE
     toolkit. In In Proceedings of the ACL Workshop on             International Conference on Bioinformatics and Bio-
     Effective Tools and Methodologies for Teaching Nat-           Engineering, pages 84–91, June 2009.
     ural Language Processing and Computational Lin-
                                                              [25] R. Voleti, J. M. Liss, and V. Berisha. Investigating
     guistics. Philadelphia: Association for Computa-
                                                                   the effects of word substitution errors on sentence
     tional Linguistics, 2002.
                                                                   embeddings. CoRR, abs/1811.07021, 2018.
[15] T. Mikolov, K. Chen, G. S. Corrado, and J. Dean.
                                                              [26] E. Vylomova, L. Rimell, T. Cohn, and T. Baldwin.
     Efficient estimation of word representations in vector
                                                                   Take and took, gaggle and goose, book and read:
     space, 2013.
                                                                   Evaluating the utility of vector differences for lexi-
[16] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,           cal relation learning. CoRR, abs/1509.01692, 2015.
     B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,      [27] A. Williams, N. Nangia, and S. R. Bowman. A
     R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,               broad-coverage challenge corpus for sentence under-
     D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-            standing through inference. In NAACL-HLT, 2018.
     esnay. Scikit-learn: Machine learning in Python.
     Journal of Machine Learning Research, 12:2825–
     2830, 2011.

[17] J. Pennington, R. Socher, and C. Manning. Glove:
     Global vectors for word representation. In Proceed-
     ings of the 2014 Conference on Empirical Methods
     in Natural Language Processing (EMNLP), pages
     1532–1543, Doha, Qatar, Oct. 2014. Association for
     Computational Linguistics.

[18] C. S. Perone, R. Silveira, and T. S. Paula. Evaluation
     of sentence embeddings in downstream and linguis-
     tic probing tasks. CoRR, abs/1806.06259, 2018.

[19] A. Rosenberg and J. Hirschberg. V-measure: A
     conditional entropy-based external cluster evaluation
     measure. In Proceedings of the 2007 Joint Con-
     ference on Empirical Methods in Natural Language