=Paper=
{{Paper
|id=Vol-2473/paper24
|storemode=property
|title=In Search for Linear Relations in Sentence Embedding Spaces
|pdfUrl=https://ceur-ws.org/Vol-2473/paper24.pdf
|volume=Vol-2473
|authors=Petra Barančíková,Ondřej Bojar
|dblpUrl=https://dblp.org/rec/conf/itat/BarancikovaB19
}}
==In Search for Linear Relations in Sentence Embedding Spaces==
In Search for Linear Relations in Sentence Embedding Spaces
Petra Barančíková, Ondřej Bojar
Charles University
Faculty of Mathematics and Physics
Institute of Formal and Applied Linguistics
{barancikova,bojar}@ufal.mff.cuni.cz
Abstract: We present an introductory investigation into
Figure 1: An illustration of a continuous multi-
continuous-space vector representations of sentences. We
dimensional vector space representing individual sen-
acquire pairs of very similar sentences differing only by a
tences, a ‘space of sentences’ (upper plot) where each sen-
small alterations (such as change of a noun, adding an ad-
tence is represented as a dot. Pairs of related sentences are
jective, noun or punctuation) from datasets for natural lan-
connected with arrows; dashing indicates various relation
guage inference using a simple pattern method. We look
types. The lower plot illustrates a possible ‘space of op-
into how such a small change within the sentence text af-
erations’ (here vector difference, so all arrows are simply
fects its representation in the continuous space and how
moved to start at a common origin). The hope is that sim-
such alterations are reflected by some of the popular sen-
ilar operations (e.g. all vector transformations extracted
tence embedding models. We found that vector differences
from sentence pairs differing in the speed of travel “run-
of some embeddings actually reflect small changes within
ning instead of walking”) would be represented close to
a sentence.
each other in the space of operations, i.e. form a more or
less compact cluster.
1 Introduction A sad boy is walking. Look how sad my cat is.
A little boy is walking.
Space of Sentences
Continuous-space representations of sentences, so-called Look at my little cat!
sentence embeddings, are becoming an interesting object A little boy is running.
of study, consider e.g. the BlackBox workshop.1 Repre- A dog is walking past a field.
A man is walking
senting sentences in a continuous space, i.e. commonly in the field.
There is a dog running past the field.
with a long vector of real numbers, can be useful in multi-
ple ways, analogous to continuous word representations
Space of Operations
...being sad, not little...
(word embeddings). Word embeddings have provably ...a man instead
of a dog...
made downstream processing robust to unimportant input
variations or minor errors (sometimes incl. typos), they
...running instead of walking...
have greatly boosted the performance of many tasks in
...a grown-up instead of a child...
low data conditions and can form the basis of empirically-
driven lexicographic explanations of word meanings.
One notable observation was made in [15], showing that
several interesting relations between words have their im-
mediate geometric counterpart in the continuous vector This approach has the potential of explaining the good or
space. bad performance of the examined types of representations
Our aim is to examine existing continuous representa- in various tasks.
tions of whole sentences, looking for an analogous be- The paper is structured as follows: Section 2 reviews
haviour. The idea of what we are hoping for is illustrated the closest related work. Sections 3 and 4, respectively,
in Figure 1. As with words, we would like to learn if and to describe the dataset of sentences and the sentence embed-
what extent some simple geometric operations in the con- dings methods we use. Section 5 presents the selection of
tinuous space correspond to simple semantic operations operations on the sentence vectors. Section 6 provides the
on the sentence strings. Similarly to [15], we are delib- main experimental results of our work. We conclude in
erately not including this aspect in the training objective Section 7.
of the sentence presentations but instead search for prop-
erties that are learned in an unsupervised way, as a side- 2 Related Work
effect of the original training objective, data and setup.
Series of tests to measure how well their word embed-
Copyright c 2019 for this paper by its authors. Use permitted un-
der Creative Commons License Attribution 4.0 International (CC BY
dings capture semantic and syntactic information is de-
4.0). fined in [15]. These tests include for example declina-
1 https://blackboxnlp.github.io/ tion of adjectives (“easy"→“easier"→“easiest"), chang-
Figure 2: Example of our pattern extraction method. In the first step, the longest common subsequence of tokens (ear is
playing a guitar .) is found and replaced with the variable X. In the second step, with a tattoo behind is substituted with
the variable Y. As the variables are not listed alphabetically in the premise, they are switched in the last step.
step premise hypothesis
1. a man with a tattoo behind his ear is playing a guitar . a woman with a tattoo behind her ear is playing a guitar .
2. a man with a tattoo behind his X a woman with a tattoo behind her X
3. a man Y his X a woman Y her X
4. a man X his Y a woman X her Y
Figure 3: Top 10 patterns extracted from sentence pairs labelled as entailmens, contradictions and neutrals, respectively.
Note the “X → X" pattern indicating no change in the sentence string at all.
entailments contradictions neutrals
premise hypothesis premise hypothesis premise hypothesis
1. X X 693 X man Y X woman Y 413 XY X sad Y 701
2. X man Y X person Y 224 X woman Y X man Y 196 XY X big Y 119
3. X. X 207 X men X women Y 111 XY X fat Y 69
4. X woman Y X person Y 118 X boy Y X girl Y 109 X young Y X sad Y 68
5. X boy Y X person Y 65 X dog Y X cat Y 98 X people Y X men Y 60
6. XY Y,X. 61 X girl Y X boy Y 97 X sad X 51
7. X men Y X people Y 56 X women Y X men Y 64 X X 41
8. two X X 56 X Y, X not Y 56 X person Y X man Y 34
9. X girl Y X person Y 55 two X, three X 46 XY X red Y 30
10. X,Y YX. 53 X child Y X man Y 44 XY X busy Y 28
ing the tense of a verb (“walking"→“walk") or getting original embedding. The results suggest that pretrained
the capital (“Athens"→“Greece") or currency of a state neural sentence encoders are much more robust to intro-
(“Angola"→“kwanza"). References [2; 13] have further duced errors contrary to bag-of-words embeddings.
refined the support of sub-word units, leading to con-
siderable improvements in representing morpho-syntactic
properties of words. Vylomova, Rimmel, Cohn and Bald- 3 Examined Sentences
win [26] largely extended the set of considered semantic
relations of words. Because manual creation of sentence variations is costly,
we reuse existing data from SNLI [3] and MultiNLI [27].
Sentence embeddings are most commonly evaluated ex- Both these collections consist of pairs of sentences—a
trinsically in so called ‘transfer tasks’, i.e. comparing premise and a hypothesis—and their relationship (entail-
the evaluated representations based on their performance ment/contradiction/neutral). The two datasets together
in sentence sentiment analysis, question type prediction, contain 982k unique sentence pairs. All sentences were
natural language inference and other assignments. Refer- lowercased and tokenized using NLTK [14].
ence [8] introduce ‘probing tasks’ for intrinsic evaluation From all the available sentence pairs, we select only a
of sentence embeddings. They measure to what extent lin- subset where the difference between the sentences in the
guistic features like sentence length, word order, or the pair can be described with a simple pattern. Our method
depth of the syntactic tree are available in a sentence em- goes as follows: given two sentences, a premise p and the
bedding. This work was extended to SentEval [6], a toolkit corresponding hypothesis h, we find the longest common
for evaluating the quality of sentence embedding both in- substring consisting of whole words and replace it with
trinsically and extrinsically. It contains 17 transfer tasks a variable. This is repeated once more, so our sentence
and 10 probing tasks. SentEval is applied to many recent patterns can have up to two variables. In the last step, we
sentence embedding techniques showing that no method make sure the pattern is in a canonical form by switching
had a consistently good performance across all tasks [18]. the variables to ensure they are alphabetically sorted in p.
Voleti, Liss and Berisha [25] examine how errors (such The process is illustrated in Figure 2.
as incorrect word substitution caused by automatic speech Ten most common patterns for each NLI relation are
recognition) in a sentence affect its embedding. The em- shown in Figure 3. Many of the obtained patterns clearly
beddings of corrupted sentences are then used in textual match the sentence pair label. For instance the pattern no.
similarity tasks and the performance is compared with 2 (“X man Y → X person Y”) can be expected to lead to
a sentence pair illustrating entailment. If a man appears in Whole Word Masking. BERT gives embeddings for every
a story, we can infer that a person appeared in the story. (sub)word unit, we take as a sentence embedding a [CLS]
The contradictions illustrate typical oppositions like man– token, which is inserted at the beginning of every sentence.
woman, dog–cat. Neutrals are various refinements of the BERT embeddings have 1,024-dimensions.
content described by the sentences, probably in part due to ELMo6 (Embedding from Language Models) [5] uses
the original instruction in SNLI that hypothesis “might be representations from a biLSTM that is trained with the
a true” given the premise in neutral relation. language model objective on a large text dataset. Its em-
We kept only patterns appearing with at least 20 differ- beddings are a function of the internal layers of the bi-
ent sentence pairs in order to have large and variable sets directional Language Model (biLM), which should cap-
of sentence pairs in subsequent experiments. We also ig- ture not only semantics and syntax, but also different
nored the overall most common pattern, namely the iden- meanings a word can represent in different contexts (pol-
tity, because it actually does not alter the sentence at all. ysemy). Similarly to BERT, each token representation of
Strangely enough, identity was observed not just among ELMo is a function of the entire input sentence - one word
entailment pairs (693 cases), but also in neutral (41 cases) gets different embeddings in different contexts. ELMo
and contradiction (22) pairs. computes an embedding for every token and we compute
Altogether, we collected 4,2k unique sentence pairs in the final sentence embedding as the average over all to-
60 patterns. Only 10% of this data comes from MultiNLI, kens. It has dimensionality 1024.
the majority is from SNLI. LASER7 (Language-Agnostic SEntence Representa-
tions) [1] is a five-layer bi-directional LSTM (BiLSTM)
network. The 1,024-dimension vectors are obtained by
4 Sentence Embeddings max-pooling over its last states. It was trained to trans-
late from more than 90 languages to English or Spanish at
We experiment with several popular pretrained sentence
the same time, the source language was selected randomly
embeddings.
in each batch.
InferSent2 [7] is the first embedding model that used a
supervised learning to compute sentence representations.
It was trained to predict inference labels on the SNLI 5 Choosing Vector Operations
dataset. The authors tested 7 different architectures and
BiLSTM encoder with max pooling achieved the best Mikolov, Chen, Corrado and Dean [15] used a simple vec-
results. InferSent comes in two versions: InferSent_1 tor difference as the operation that relates two word em-
is trained with Glove embeddings [17] and InferSent_2 beddings. For sentences embeddings, we experiment a lit-
with fastText [2]. InferSent representations are by far the tle and consider four simple operations: addition, subtrac-
largest, with the dimensionality of 4096 in both versions. tion, multiplication and division, all applied elementwise.
Similarly to InferSent, Universal Sentence Encoder [4] More operations could be also considered as long as they
uses unsupervised learning augmented with training on su- are reversible, so that we can isolate the vector change for
pervised data from SNLI. There are two models available. a particular sentence alternation and apply it to the embed-
USE_T3 is a transformer-network [23] designed for higher ding of any other sentence. Hopefully, we would then land
accuracy at the cost of larger memory use and computa- in the area where the correspondingly altered sentence is
tional time. USE_D4 is a deep averaging network [12], embedded.
where words and bi-grams embeddings are averaged and The underlying idea of our analysis was already
used as input to a deep neural network that computes the sketched in Figure 1. From every sentence pair in our
final sentence embeddings. This second model is faster dataset, we extract the pattern, i.e. the string edit of the
and more efficient but its accuracy is lower. Both models sentences. The arithmetic operation needed to move from
output representation with 512 dimensions. the embedding of the first sentence to the embedding of
Unlike the previous models, BERT5 (Bidirectional En- the second sentence (in the continuous space of sentences)
coder Representations from Transformers) [10] is a deep can be represented as a point in what we call the space of
unsupervised language representation, pre-trained using operations. Considering all sentence pairs that share the
only unlabeled text. It has two self-supervised training same edit pattern, we obtain many points in the space of
objectives - masked language modelling and next sen- operations. If the space of sentences reflects the particu-
tence classification. It is considered bidirectional as the lar edit pattern in an accessible way, all the corresponding
Transformer encoder reads the entire sequence of words points in the space of operations will be close together,
at once. We use a pre-trained BERT-Large model with forming a cluster.
2 https://github.com/facebookresearch/InferSent To select which of the arithmetic operations best suits
3 https://tfhub.dev/google/universal-sentence- the data, we test pattern clustering with three common
encoder-large/3 clustering performance evaluation methods:
4 https://tfhub.dev/google/universal-sentence-
encoder/2 6 https://github.com/HIT-SCIR/ELMoForManyLangs
5 https://github.com/google-research/bert 7 https://github.com/facebookresearch/LASER
Table 1: This table presents the quality of pattern clustering in terms of the three cluster evaluation measures in the
space of operations. For all the scores, the value of 1 represents a perfect assignment and 0 corresponds to random label
assignment. All the numbers were computed using the Scikit-learn library [16]. Best operation according to each cluster
score across the various embeddings in bold.
Adjusted Rank Index V-measure Adjusted Mutual Information
embedding dim. - + * / - + * / - + * /
InferSent_1 4096 0.58 0.03 0.03 0.00 0.91 0.28 0.24 0.03 0.87 0.18 0.14 0.00
ELMo 1024 0.55 0.03 0.02 0.00 0.85 0.28 0.23 0.03 0.82 0.18 0.13 0.00
LASER 1024 0.48 0.02 0.01 0.00 0.79 0.19 0.15 0.03 0.76 0.09 0.04 0.00
USE_T 512 0.25 0.04 0.08 0.00 0.73 0.25 0.30 0.03 0.69 0.14 0.20 0.00
InferSent_2 4096 0.31 0.04 0.04 0.01 0.69 0.28 0.28 0.10 0.65 0.19 0.19 0.03
BERT 1024 0.33 0.02 0.01 0.00 0.66 0.22 0.16 0.03 0.62 0.12 0.06 0.00
USE_D 512 0.21 0.05 0.08 0.00 0.65 0.27 0.33 0.03 0.58 0.17 0.23 0.00
average 1775 0.39 0.03 0.04 0.00 0.75 0.25 0.24 0.04 0.71 0.15 0.14 0.00
• Adjusted Rand index [11] is measure of the simi- 6 Experiments
larity between two cluster assignments adjusted with
chance normalization. The score ranges from −1 to For the following exploration of the continuous space of
+1 with 1 being the perfect match score and values operations, we focus only on the ELMo embeddings. They
around 0 meaning random label assignment. Nega- scored second best in all scores but unlike the best scoring
tive numbers show worse agreement than what is ex- Infersent_1, ELMo was not trained on SNLI, which is the
pected from a random result. major source of our sentence pairs.
The t-SNE [22] visualisation of subtractions of ELMo
• V-measure [19] is harmonic mean of homogeneity vectors is presented in Figure 4. The visualisation is con-
(each cluster should contain only members of one structed automatically and, of course, without the knowl-
class) and completeness (all members of one class edge of the pattern label. It shows that the patterns are
should be assigned to the same cluster). The score generally grouped together into compact clusters with the
ranges from 0 (the worst situation) to 1 (perfect exception of a ‘chaos cloud’ in the middle and several out-
score). liers. Also there are several patterns that seem inseparable,
e.g. “two X → X" and “three X → X", or “X white Y ->
X Y" and “X black Y -> X Y".
• Adjusted Mutual Information [21] measures the We identified the patterns responsible for the noisy cen-
agreement of the two clusterings with the correction ter and outliers by computing weighted inertia for each
of agreement by chance. The random label assign- pattern (the sum of squared distances of samples to their
ment gets a score close to 0, while two identical clus- cluster center divided by the size of sample). The clus-
terings get the score of 1. ters with highest inertia consists of patterns representing a
change of word order and/or adding or removing punctua-
As the detailed description of these measures is out of tion. These patterns are:
scope of this article, we refer readers to related literature
(e.g. [24]). We use these scores to compare patterns with X is Y . → Y is X XY.→YX. X→X.
labels predicted by k-Means (best result of 100 random X,Y.→YX. X,Y.→Y,X.
initialisations). The results are presented in Table 1. It is XY.→Y,X. X.→X
apparent that the best distribution by far is achieved using
the most intuitive operation, vector subtraction. To see if the space of operations can be interpreted also
There seems to be a weak correlation between the size automatically, i.e. if the sentence relations are general-
of embeddings and the scores. The smallest embeddings izable, we remove the noisy patterns as above and apply
USE_D and USE_T are getting the worst scores, while the fully unsupervised clustering: we do not even disclose the
largest embeddings InferSent_1 are the best scoring em- expected number of patterns, i.e. clusters. We try two
beddings. However, InferSent_2 with dimensionality 4096 metrics for finding the optimal number of clusters: Davies-
is performing poorly. The fact that several of the embed- Bouldin’s index [9] and Silhouette Coefficient [20]. They
dings were trained on SNLI does not to seem benefit those are both designed to measure compactness and separation
embeddings. Between the three top scored embeddings, of the clusters, i.e. they award dense clusters that are far
only InferSent_1 was trained on the data that we use for from each other. Both Davies-Bouldin index and Silhou-
evaluation of embeddings. ette Coefficient agree that the best separation is achieved
Figure 4: t-SNE representation of patterns. The points in the operation space are obtained by subtracting the ELMo
embedding of the hypothesis from the ELMo embedding of the premise. Best viewed in color. Colors correspond to the
sentence patterns.
X woman Y -> X person Y X young Y -> X sad Y
X girl Y -> X person Y
X little Y -> X sad Y
X boy Y -> X person Y
X child Y -> X person Y
X child Y -> X man Y
X people Y -> X dogs Y X Y -> X sad Y
X children Y -> X men Y
X Y -> X fat Y
X girl Y -> X boy Y
X lady Y -> X man Y
X Y -> X busy Y
X -> there is X
X woman Y -> X man Y X Y -> X is Y
a group of X -> X
X women Y -> X men Y
two X -> X
X person Y -> X man three X -> X X man Y -> X person Y
X people Y -> X men Y X . -> X X men Y -> X people Y
X -> X .
X red Y -> X Y X men Y -> X women Y
X blue Y -> X Y
X white Y -> X Y
man X -> woman X
X black Y -> X Y
X boys Y -> X girls Y
X man Y -> X woman Y
X boy Y -> X girl Y
Figure 5: t-SNE representation of patterns as in Figure 4 with colors coding now fully automatic clusters. Each cluster is
labelled with the set of patterns extracted from sentence pairs assigned to the cluster. The numbers in parentheses indicate
how many sentence pairs belong to the given pattern within this cluster and overall, resp. For instance the line “two X →
X (52/56)” says that of the 56 sentence pairs differing in the prefix “two”, 52 were automatically clustered together based
on the subtraction of their ELMo embeddings.
1: [X woman Y -> X person Y (115/119),
X boy Y -> X person Y (64/65), 3:[X Y -> X sad Y (680/703),
X girl Y -> X person Y (54/55), X young Y -> X sad Y (68/68),
X child Y -> X person Y (30/30), X -> sad X (50/51),
X women Y -> X people Y (1/23)] X little Y -> X sad Y (19/21),
X -> there is X (9/25),
X Y -> X big Y (1/122)]
9: [X woman Y -> X man Y (196/196),
X girl Y -> X boy Y (96/97),
X women Y -> X men Y (64/64),
X child Y -> X man Y (45/45), 2: [ X people Y -> X dogs Y (36/36),
X person Y -> X man Y (35/37), X person Y -> X dog Y (20/20)]
X girls Y -> X boys Y (29/29),
X lady Y -> X man Y (27/27),
X women Y -> X people Y (17/23),
X children Y -> X men Y (14/23)]
4: [two X -> X (52/56),
a group of X -> X (36/38),
three X -> X (24/24)]
5: [X man Y -> X person Y (218/227),
8: [X Y -> X big Y (121/122), X Y -> X not Y (1/56)]
X dog Y -> X cat Y (98/98),
X Y -> X fat Y (69/69),
X people Y -> X men Y (59/60),
X Y -> X not Y (55/56),
two X -> three X (45/46),
but X -> X (32/32),
X Y -> X busy Y (30/30),
X Y -> X red Y (30/30), 7: [X men Y -> X people Y (57/57), 6: [X man Y -> X woman Y (414/414),
X Y -> X n't Y (28/28), X young Y -> X Y (55/55), X men Y -> X women Y (109/111),
X blue Y -> X red Y (27/27), X black Y -> X Y (41/41), X boy Y -> X girl Y (107/109),
X red Y -> X blue Y (27/27), X red Y -> X Y (36/36), man X -> woman X (31/31),
X dogs Y -> X cats Y (26/26), X white Y -> X Y (34/34), X boys Y -> X girls Y (21/27),
X Y -> X sad Y (23/703), X little Y -> X Y (31/31), X boy Y -> X person Y (1/65),
X . -> X outside . (20/21), X not Y -> X Y (28/29), X man Y -> X person Y (1/227)]
X -> there is X (13/25), X blue Y -> X Y (27/27),
X children Y -> X men Y (9/23), X Y -> X is Y (22/24), ...]
X boys Y -> X girls Y (6/27),
two X -> X (4/56), ...]
at 9 clusters. Running k-Means with 9 clusters, we get the ine the possibilities of generating sentence strings back
result as plotted in Figure 5. from the sentence embedding space. If successful, our
Manually inspecting the contents of the automatically method could lead to controlled paraphrasing via the con-
identified clusters, we see that many clusters are meaning- tinuous space: take an input sentence, embed it, modify
ful in some way. For instance, Cluster 1 captures 90% (al- the embedding using a vector operation and generate the
together 264 out of 292) sentence pairs exerting the pattern target sentence in the standard textual from.
of generalizing women, boys or girls to people. The coun-
terpart for men belonging to people is spread into Cluster 5
(218 out of 227 pairs) for the singular case and not so clean Acknowledgment
Cluster 7 containing 57/57 of the plural pairs “X men Y →
X people Y” together with various oppositions. Cluster 2 This work has been supported by the grant No. 18-
covers all sentence pairs where a person is replaced with a 24210S of the Czech Science Foundation. It has been
dog. Cluster 3 is primarily connected with sentence pairs using language resources and tools stored and distributed
introducing bad mood. Cluster 4 unites patterns that rep- by the LINDAT/CLARIN project of the Ministry of Edu-
resent omitting a numeral/group. Cluster 6 covers gender cation, Youth and Sports of the Czech Republic (project
oppositions in one direction and Cluster 9 adds the other LM2015071).
direction (with some noise for child/man and person/man
and similar), etc.
References
7 Conclusion and Future Work [1] M. Artetxe and H. Schwenk. Massively multilin-
gual sentence embeddings for zero-shot cross-lingual
We examined vector spaces of sentence representations transfer and beyond. CoRR, abs/1812.10464, 2018.
as inferred automatically by sentence embedding meth-
ods such as InferSent or ELMo. Our goal was to find out [2] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov.
if some simple arithmetic operations in the vector space Enriching word vectors with subword information.
correspond to meaningful edit operations on the sentence CoRR, abs/1607.04606, 2016.
strings.
Our first explorations of 60 sentence edit patterns docu- [3] S. R. Bowman, G. Angeli, C. Potts, and C. D. Man-
ment that this is indeed the case. Automatically identified ning. A large annotated corpus for learning natu-
frequent patterns with 20 or more occurrences in the SNLI ral language inference. In Proceedings of the 2015
and MultiNLI datasets correspond to simple vector differ- Conference on Empirical Methods in Natural Lan-
ences. The ELMo space (and others such as Infersent_1, guage Processing (EMNLP). Association for Com-
LASER and USE-T, which are omitted due to paper length putational Linguistics, 2015.
requirements) exerts this property very well.
Unfortunately, choosing ELMo as example might not [4] D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S.
have been the best option – we compute ELMo embed- John, N. Constant, M. Guajardo-Cespedes, S. Yuan,
dings by averaging contextualized word embeddings and C. Tar, Y. Sung, B. Strope, and R. Kurzweil. Univer-
majority of the patterns are just removing/adding/chang- sal sentence encoder. CoRR, abs/1803.11175, 2018.
ing a single word. Difference between two such sentence
[5] W. Che, Y. Liu, Y. Wang, B. Zheng, and T. Liu. To-
embeddings may be a simple difference between the em-
wards better UD parsing: Deep contextualized word
beddings of the words substituted, depending on the effect
embeddings, ensemble, and treebank concatenation.
of the contextualization. Thus, the differences in vector
CoRR, abs/1807.03121, 2018.
space would show rather the word embeddings than the
sentence embeddings. [6] A. Conneau and D. Kiela. Senteval: An evaluation
It should be noted that our search made use of only toolkit for universal sentence representations. arXiv
about 0.5% of the sentence pairs available in SNLI and preprint arXiv:1803.05449, 2018.
MultiNLI. The remaining sentence pairs differ beyond
what was extractable automatically using our simple pat- [7] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and
tern method. A different approach for a fine-grained de- A. Bordes. Supervised learning of universal sentence
scription of the semantic relation between two sentences representations from natural language inference data.
would have to be taken for a better exploitation of the CoRR, abs/1705.02364, 2017.
available data.
Our plans for the long term are to further verify these [8] A. Conneau, G. Kruszewski, G. Lample, L. Barrault,
observations using a more diverse set of vector operations and M. Baroni. What you can cram into a single
and a larger set of sentence alternations, primarily by ex- vector: Probing sentence embeddings for linguistic
tending the set of alternation types. We also plan to exam- properties. CoRR, abs/1805.01070, 2018.
[9] D. L. Davies and D. W. Bouldin. A cluster separation Processing and Computational Natural Language
measure. IEEE Trans. Pattern Anal. Mach. Intell., Learning (EMNLP-CoNLL), pages 410–420, Prague,
1(2):224–227, Feb. 1979. Czech Republic, June 2007. Association for Compu-
tational Linguistics.
[10] J. Devlin, M. Chang, K. Lee, and K. Toutanova.
BERT: pre-training of deep bidirectional trans- [20] P. Rousseeuw. Silhouettes: A graphical aid to the
formers for language understanding. CoRR, interpretation and validation of cluster analysis. J.
abs/1810.04805, 2018. Comput. Appl. Math., 20(1):53–65, Nov. 1987.
[11] L. Hubert and P. Arabie. Comparing partitions. Jour- [21] A. Strehl and J. Ghosh. Cluster ensembles: A
nal of Classification, 2(1):193–218, Dec 1985. knowledge reuse framework for combining partition-
ings. In Eighteenth National Conference on Arti-
[12] M. Iyyer, V. Manjunatha, J. Boyd-Graber, and ficial Intelligence, pages 93–98, Menlo Park, CA,
H. Daumé III. Deep unordered composition rivals USA, 2002. American Association for Artificial In-
syntactic methods for text classification. In Proceed- telligence.
ings of the 53rd Annual Meeting of the Association
for Computational Linguistics and the 7th Interna- [22] L. van der Maaten and G. Hinton. Visualizing data
tional Joint Conference on Natural Language Pro- using t-SNE. Journal of Machine Learning Re-
cessing (Volume 1: Long Papers), pages 1681–1691, search, 9:2579–2605, 2008.
Beijing, China, July 2015. Association for Computa-
[23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
tional Linguistics.
L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.
[13] T. Kocmi and O. Bojar. Subgram: Extending skip- Attention is all you need. In NIPS, 2017.
gram word representation with substrings. CoRR,
[24] N. X. Vinh and J. Epps. A novel approach for auto-
abs/1806.06571, 2018.
matic number of clusters detection in microarray data
[14] E. Loper and S. Bird. Nltk: The natural language based on consensus clustering. In 2009 Ninth IEEE
toolkit. In In Proceedings of the ACL Workshop on International Conference on Bioinformatics and Bio-
Effective Tools and Methodologies for Teaching Nat- Engineering, pages 84–91, June 2009.
ural Language Processing and Computational Lin-
[25] R. Voleti, J. M. Liss, and V. Berisha. Investigating
guistics. Philadelphia: Association for Computa-
the effects of word substitution errors on sentence
tional Linguistics, 2002.
embeddings. CoRR, abs/1811.07021, 2018.
[15] T. Mikolov, K. Chen, G. S. Corrado, and J. Dean.
[26] E. Vylomova, L. Rimell, T. Cohn, and T. Baldwin.
Efficient estimation of word representations in vector
Take and took, gaggle and goose, book and read:
space, 2013.
Evaluating the utility of vector differences for lexi-
[16] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, cal relation learning. CoRR, abs/1509.01692, 2015.
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, [27] A. Williams, N. Nangia, and S. R. Bowman. A
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, broad-coverage challenge corpus for sentence under-
D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- standing through inference. In NAACL-HLT, 2018.
esnay. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research, 12:2825–
2830, 2011.
[17] J. Pennington, R. Socher, and C. Manning. Glove:
Global vectors for word representation. In Proceed-
ings of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP), pages
1532–1543, Doha, Qatar, Oct. 2014. Association for
Computational Linguistics.
[18] C. S. Perone, R. Silveira, and T. S. Paula. Evaluation
of sentence embeddings in downstream and linguis-
tic probing tasks. CoRR, abs/1806.06259, 2018.
[19] A. Rosenberg and J. Hirschberg. V-measure: A
conditional entropy-based external cluster evaluation
measure. In Proceedings of the 2007 Joint Con-
ference on Empirical Methods in Natural Language