=Paper=
{{Paper
|id=Vol-3174/paper2
|storemode=property
|title=Theoretical Study and Empirical Investigation of Sentence Analogies
|pdfUrl=https://ceur-ws.org/Vol-3174/paper2.pdf
|volume=Vol-3174
|authors=Stergos Afantenos,Suryani Lim,Henri Prade,Gilles Richard
|dblpUrl=https://dblp.org/rec/conf/ijcai/AfantenosLPR22
}}
==Theoretical Study and Empirical Investigation of Sentence Analogies==
Theoretical study and empirical investigation of
sentence analogies
Stergos Afantenos1 , Suryani Lim2 , Henri Prade1 and Gilles Richard1
1
IRIT, University of Toulouse, France
2
Federation University - Churchill -Australia
Abstract
Analogies between 4 sentences, βπ is to π as π is to πβ, are usually defined between two pairs of sentences
(π, π) and (π, π) by constraining a relation π
holding between the sentences of the first pair, to hold
for the second pair. From a theoretical perspective, three postulates define an analogy - one of which
is the βcentral permutationβ postulate which allows the permutation of central elements π and π. This
postulate is no longer appropriate in sentence analogies since the existence of π
offers no guarantee
in general for the existence of some relation π such that π also holds for the pairs (π, π) and (π, π). In
this paper, the βcentral permutationβ postulate is replaced by a weaker βinternal reversalβ postulate to
provide an appropriate definition of sentence analogies. To empirically validate the aforementioned
postulate, we build a LSTM as well as baseline Random Forest models capable of learning analogies
based on quadruplets. We use the Penn Discourse Treebank (PDTB), the Stanford Natural Language
Inference (SNLI) and the Microsoft Research Paraphrase (MSRP) corpora. Our experiments show that
our models trained on samples of analogies between (π, π) and (π, π), recognize analogies between (π, π)
and (π, π) when the underlying relation is symmetrical, validating thus the formal model of sentence
analogies using βinternal reversalβ postulate.
1. Introduction
Analogy plays a crucial role in human cognition and intelligence. It has been characterized
as βthe core of cognitionβ [1] and has recently gained some interest from the computational
linguistics and machine learning communities (see [2, 3]). Word analogies1 such as βParis is to
France as Berlin is to Germanyβ are now well captured via word embeddings [4, 5]. If βπ, βπ, β, π βπ
2
are the embeddings of words π, π, π, π, then π : π :: π : π holds iff (π β β
π π) is a parallelogram
β , π, β,
in the underlying vector space [6].
Although analogies between words have been extensively studied, analogies between sen-
tences have received very scant attention by the community, to the best of our knowledge.
Instead of dealing with words, dealing with sentences leads to 2 challenges:
β’ How to embed sentences in a vector space?
IARML@IJCAI-ECAIβ2022: Workshop on the Interactions between Analogical Reasoning and Machine Learning, at
IJCAI-ECAIβ2022, July, 2022, Vienna, Austria
*
Corresponding author.
Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
CEUR Workshop Proceedings (CEUR-WS.org)
Proceedings
http://ceur-ws.org
ISSN 1613-0073
1
In the following, βanalogyβ refers to a quaternary relation linking 4 items of the form βπ is to π as π is to πβ, called
analogical proportion.
2
π : π :: π : π is a standard notation for analogical proportion.
15
Stergos Afantenos et al. IARML@IJCAI-ECAIβ22 Workshop Proceedings
β’ How do we define a sentence analogy?
We expect sentence embeddings to be dense vectors supposed to reflect semantic properties
of a sentence. Various approaches are available: embed each word and get the average of the
vectors. In that case, the order of the words is lost. Another option, described in [7], allowing to
recover the sentence from its embedding, makes use of Discrete Cosine Transform.
The question of defining sentence analogy is especially delicate. Indeed the aforementioned
parallelogram model used for words reflects the usual postulates of analogies, namely if π :
π :: π : π holds, π : π :: π : π (symmetry) and π : π :: π : π (central permutation) should hold
as well. This latter postulate (already questionable between words [8]) is still more debatable
with sentences. In the NLP community, analogies between sentences are usually induced from
predefined relationships between sentences. A quadruplet of sentences π, π, π, π defines an
analogy π : π :: π : π if the (implicit or explicit) relation that holds between the sentences of the
first pair (π, π), also holds for the second pair (π, π). Let us consider the following example:
John sneezed loudly (a). Mary was startled (b).
Bob took an analgesic (c). His headache stopped (d).
In that case, the implicit relation π
between sentences in a pair is a kind of causal relation.
This example indicates that central permutation makes no sense here and raises the question of
defining a weaker notion of analogy obeying another system of postulates. By which postulate
to replace the central permutation? In this paper, we propose to introduce a postulate we call
βinternal reversalβ that expresses that if π : π :: π : π holds then π : π :: π : π holds as well, and
we study its consequences. So our main goal is to:
β’ theoretically investigate the formal consequences of this new model,
β’ empirically validate the model by implementing various classifiers of sentence analogies.
After presenting in Section 3 the standard formal definitions of analogies, including the
βcentral permutationβ postulate, and their immediate consequences, we focus on the replacement
of the βcentral permutationβ postulate by the internal reversal postulate. Having a better fit with
what is accepted as sentence analogies in the NLP community, this postulate also impacts the
machine learning perspective that we implement.
For natural language sentences, βinternal reversalβ, as a formal postulate, may have some
limitations. For instance, if π
= π
β1 , where π
is the common relation that holds between two
pairs of sentences (π, π) and (π, π) (e.g. , π is a paraphrase of π), one would expect that internal
reversal holds straightforwardly. In that case, a machine learning model trained to recognize
π : π :: π : π, should also recognize π : π :: π : π.
We investigate the conditions under which a machine learning model containing quadruplets
of sentences (π, π, π, π) representing positive and negatives instances of analogies, is capable
of identifying analogies for which the operation of internal reversal has been performed. We
have then devised several series of experiments using various underlying models and datasets.
The paper is structured as follows. After reviewing the related work (Section 2), in Section
3 we recall the formal definitions of analogical proportions and investigate the new case of
sentence analogies, suggesting βinternal reversalβ postulate as a better fit and examining its
consequences. In Section 4, we consider the consequences of the formal definition in a machine
learning perspective, by suggesting a rigorous extension of an initial training set. Sections 5 and
16
Stergos Afantenos et al. IARML@IJCAI-ECAIβ22 Workshop Proceedings
6 are dedicated to the description to the context, protocol and results of our experiments. This
work is an extension of [9], replacing artificially created datasets with human annotated ones.
2. Related work
Due to the advent of neural models and distributed representations of words, lexical analogies
have been the focus of various works in computational linguistics. [10, 11, 12, 13, for example].
In terms of analogies on the sentential level few works exist. [14] investigate how existing
embedding approaches can capture sentential analogies. They create two different kinds of
datasets one consisting of replacing words with word analogies from the Google word analogy
dataset [15] while the other is based on analogies between sentences that share common
relations (entailment, negation, passivization, for example) or syntactic patterns (comparisons,
opposites, plurals among others). The goal is to optimize arg πππ₯πβπ (π£βπ , π£βπ β π£βπ + π£βπ ) with
the additional constraint that π β / {π, π, π}. Using these datasets, analogies are evaluated using
various embeddings, such as GloVe [5], word2vec [15], fastText [16, 17], etc. showing that
capturing syntactic analogies based on lexical analogies from the Google word analogies dataset
is more effective than recognising analogies based on more semantic information. [18] use a
similar approach to identify the most plausible answer ππ to a given question π from a pool π΄
of answers to a question by leveraging analogies between (π, ππ ) and various pairs of what they
call βprototypicalβ question/answer pairs, assuming that there is an analogy between (π, ππ ) and
the prototypical pair (ππ , ππ ). The goal is to select the candidate answer π*π β π΄ such that:
π*π = arg ππππ (||(ππ β ππ ) β (π β ππ )||)
. The authors limit the question/answer pairs to π€ββ questions from WikiQA and TrecQA.
They use a Siamese bi-GRUs as their architecture to represent the four sentences. In this manner,
the authors learn embedding representations for the sentences which they compare against
various baselines including random vectors, word2vec, InferSent and Sent2Vec obtaining better
results with the WikiQA corpus. Most of the tested sentence embedding models succeed in
recognizing syntactic analogies based on lexical ones but had a harder time capturing analogies
between pairs of sentences based on semantics.
Instead of training a model to select the best candidate amongst a given set of candidates
([18, 19] train an encoder-decoder model based on LSTMs to generate the π given a pair (π, π)
and a candidate π. Authors obtain vector encodings of βπ, βπ, βπ using an LSTM guided by two
loss functions. The authors then experiment with concatenation, summation and arithmetic
analogy on these vectors to obtain a new vector which is then used as input for the decoding
mechanism, showing that arithmetic analogy outperforms the other methods.
In this paper, the aim is to empirically validate the βinternal reversalβ postulate (without
focusing on accuracy). To our knowledge, such a study has not been conducted before.
3. Theoretical Foundations of Analogies
We briefly recall the formal definition of analogy such as found in [20, 21, 22]. We focus on a
widely accepted definition for sentence analogies and we investigate to what extent sentence
17
Stergos Afantenos et al. IARML@IJCAI-ECAIβ22 Workshop Proceedings
analogies obey the formal postulates and what has to be modified in the formal setting to fit
with this particular definition.
3.1. Formal definitions
Given a set of items π, a (proportional) analogy is a quaternary relation supposed to obey the 3
following postulates (e.g.,[21]):
βπ, π, π, π β π :
1. π : π :: π : π (reflexivity);
2. π : π :: π : π β π : π :: π : π (symmetry);
3. π : π :: π : π β π : π :: π : π (central permutation).
These postulates have straightforward consequences like:
β’ π : π :: π : π (identity);
β’ π : π :: π : π β π : π :: π : π (internal reversal);
β’ π : π :: π : π β π : π :: π : π (extreme permutation);
β’ π : π :: π : π β π : π :: π : π (complete reversal).
Among the 24 permutations of π, π, π, π, the previous postulates induce 3 distinct classes each
containing 8 distinct proportions regarded as equivalent due to postulates: π : π :: π : π has in
its class π : π :: π : π, π : π :: π : π, π : π :: π : π, π : π :: π : π, π : π :: π : π, π : π :: π : π, and
π : π :: π : π. But π : π :: π : π and π : π :: π : π do not belong to the class of π : π :: π : π and are
elements of the two other classes.
3.2. Sentence analogies
In the NLP community, the 4 items π, π, π, π are sentences in natural language, not necessarily
the same. It is widely admitted that the sentences are in analogy (i.e., π : π :: π : π) as soon
as there is a relation π
, the relation between sentences, such that π
(π, π) and π
(π, π). The
example from the introduction is a perfect illustration of this definition where the relation π
is
just causality:
John sneezed loudly (a). Mary was startled (b).
Bob took an analgesic (c). His headache stopped (d).
But:
Il fait beau aujourdβhui (a). Today we have nice weather (b).
Il vaut mieux eviter la guerre (c). It is better to avoid war (d).
is another example of analogies between sentences where the implicit relation π
is βπ is the
English translation of the French sentence πβ. From a logical viewpoint, this can be expressed
as:
π : π :: π : π iff βπ
s.t. π
(π, π) β§ π
(π, π) (1)
where β§ is just the formal notation for the πππ connector. This definition can be considered
as quite vague because, as advocated in [23, 24], there is always a way to find such a relation
18
Stergos Afantenos et al. IARML@IJCAI-ECAIβ22 Workshop Proceedings
π
between 2 sentences. A more effective option used in the NLP community is to consider
that the underlying relation π
belongs to a finite set π of relations. Such relations can be, for
example, discourse relations (Elaboration, Continuation, Contrast, Concession, etc) or a Causality
relation as is the case in the above example. Then, the formal definition has to be refined into:
π : π :: π : π iff βπ
β π s.t. π
(π, π) β§ π
(π, π) (2)
where π = {π
1 , . . . , π
π } is just a finite non-empty set of relations belonging to a list of target
relations. With this definition, we constraint the relation π
to belong to a predefined set.
Obviously, in the case of French-English translation, the list π is reduced to only one relation.
It is quite clear that reflexivity and symmetry are still valid postulates for sentence analogies
i.e., they are satisfied with both above definitions. Back to our initial example:
John sneezed loudly (a). Mary was startled (b).
Bob took an analgesic (c). His headache stopped (d).
Definition 1 or 2 still applies to π : π :: π : π:
Bob took an analgesic (c). His headache stopped (d).
John sneezed loudly (a). Mary was startled (b).
which is then a valid analogy. Nevertheless, central permutation is not satisfied with the above
definitions 1 or 2.
3.3. Internal reversal for sentence analogies
Let us now focus on the βinternal reversalβ postulate as a alternative to βcentral permutationβ:
π : π :: π : π β π : π :: π : π (internal reversal)
By definition, if π
(π, π) holds then π
β1 (π, π) holds. Definition 1 supports βinternal reversalβ:
for instance, if relation π
(π, π) is interpreted as βπ is a cause of π", π
β1 (π, π) can be the passive
form βπ is a consequence of aβ. But Definition 2 does not support βinternal reversalβ except if,
for each relation π
in the set π of built-in relations, we also have its counterpart π
β1 . A simple
way to ensure this property is to consider relations π
such that π
= π
β1 . For instance, π
(π, π)
is defined as βπ is a paraphrase of πβ.
In the general case, a proper definition of a sentence analogy supporting the 3 postulates
(reflexivity, symmetry, internal reversal) would be:
: π :: π : π iff βπ
β π s.t.
π {οΈ
(π
(π, π) β§ π
(π, π)) (3)
β¨(π
β1 (π, π) β§ π
β1 (π, π))
This leads to a formal definition of sentence analogies with:
1. π : π :: π : π (reflexivity);
2. π : π :: π : π β π : π :: π : π (symmetry);
3. π : π :: π : π β π : π :: π : π (internal reversal).
As immediate consequences, we get that :
β’ there are only 4 equivalent forms (instead of 8 with the central permutation postulate)
for an analogy:
19
Stergos Afantenos et al. IARML@IJCAI-ECAIβ22 Workshop Proceedings
β’ π : π :: π : π, π : π :: π : π, π : π :: π : π, and π : π :: π : π.
β’ π : π :: π : π β π : π :: π : π (complete reversal).
β’ π : π :: π : π (full identity) is still satisfied.
β’ π : π :: π : π (identity) is no longer a consequence of the new postulates.
4. Implications for Machine Learning
Let us assume that we have at our disposal a repository of pairs of sentences (π, π) with their
associated relation π
. From this repository, we need a training set of examples for the classifier.
Given the previous section, several steps can be implemented.
1) Building an initial training set of analogies π : π :: π : π can be done by joining 2 pairs (π, π)
and (π, π) belonging to the same relation π
. This constitutes a set of positive examples π³ + such
that for every quadruplet (π, π, π, π) = x β π³ + the training instances are {x, π¦} with π¦ = 1. In
terms of negative examples, joining 2 pairs (π, π) and (π, π) belonging to different relations leads
to build a set of negative examples π³ β such that for every quadruplet (π, π, π, π) = x β π³ β
the training instances are {x, π¦} with π¦ = 0. The training set π³ = π³ + βͺ π³ β is then a set of
quadruplets of sentences π, π, π, π such that:
β’ if the implicit/explicit relation π
between the pair (π, π) also holds for the pair (π, π),
then (π, π, π, π) β π³ +
β’ if the implicit/explicit relation π
between the pair (π, π) does not hold for the pair (π, π),
then (π, π, π, π) β π³ β
Applying symmetry postulate allows to double the size of π³ + , just by adding (π, π, π, π) β π³ +
as soon as (π, π, π, π) β π³ + . We then improve the theoretical unbalance between π³ + and π³ β .
2) The same method applies with internal reversal postulate, by adding (π, π, π, π) β π³ + as
soon as (π, π, π, π) β π³ + . This again doubles the size of π³ + .
At this stage, we have multiplied by 4 the initial size of our positive training set π³ + by
introducing common sense analogies deducible from the initial ones, but not necessarily related
to the initial list of relations π. Can we do more?
The Identity Relation For completeness sake, one could argue that it is still possible to
extend the set of positive examples since it seems acceptable to consider π : π :: π : π as a valid
sentence analogy even though identity πΌπ relation likely does not belong to π. But identity
relation πΌπ holds between the pairs (π, π) and (π, π). Although recognition of analogies based
on the identity relation might seem trivial from an NLP perspective it could still be a useful
task in case that we want to evaluate the quality of our classifier. In other words, if a potential
classifier is not able to identify analogies based on the identity relation, one should probably
reconsider the underlying approach.
The Inverse Relation A scenario that appears quite often in Natural Language Processing,
although far from being a generalized phenomenon, is that a relation π
between sentences (or
larger proportions of text for that matter) is its own inverse π
β1 .
20
Stergos Afantenos et al. IARML@IJCAI-ECAIβ22 Workshop Proceedings
Instances of such a relation can be, for example, that of the paraphrase. If π is a paraphrase of
π, obviously π is a paraphrase of π. The same hold for the operation of translation. If sentence π
is a translation of π then again π is a translation of π. Following our initial definition of analogy,
we will have to accept:
π : π :: π : π when R is its own inverse
Before moving to the details of the empirical validation, we describe the datasets we use in
the following section.
5. Experiments
As explained earlier in this paper, our main goal is the empirical validation of internal reversal
for sentential analogies, using various corpora. To investigate this postulate we devise the
following sets of experiments.
5.1. Experimental settings
Base setting Given a training set
(π³π‘ππππ , π΄π‘ππππ ) = ({xπ }ππ=1 , {π¦π }ππ=1 })
and a test set (π³π‘ππ π‘ , π΄π‘ππ π‘ ) = ({xπ }π
π=1 , {π¦π }π=1 ) with π typically being a tenth of π and xπ
π
representing a quadruplet of sentences π : π :: π : π3 and π¦π β {0, 1} we learn a model βπ capable
of identifying analogies with a certain accuracy. Crucially, |{π¦π : π¦π = 1}|= |{π¦π : π¦π = 0}|
both for training and testing sets. Due to the huge size of instances at our disposal, there is
no need at this stage to implement any further data augmentation process, as explained in the
previous Section. In other words, we have an equal number of positive and negative instances
in training and testing sets, for a total of 4M instances.
Internal reversal on the test set (Experimental setting 1) In this series of experiments, we
used the same training set (π³π‘ππππ , π΄π‘ππππ ) = ({xπ }ππ=1 , {π¦π }ππ=1 }) as the base setting βπ . To
construct the test set, we perform internal reversal on all the instances of the train set that we
have used in base setting. Our goal is to see whether we get similar results on analogies for the
internal reversal.
Test set from train distribution with internal reversal (Experimental setting 2) For this
series of experiments we use the same training set (π³π‘ππππ , π΄π‘ππππ ) = ({xπ }ππ=1 , {π¦π }ππ=1 }) for
the base setting βπ . The test set though is constructed in the following way: for every positive
instance (xπ:π::π:π , 1) in (π³π‘ππππ , π΄π‘ππππ ) we add the internal reversal pair (xπ:π::π:π , 1) to the new
testing set (π³π‘ππ π‘ , π΄π‘ππ π‘ ) whose size thus is π/2. In contrast to experimental setting 1 where
the underlying sentences between train and test distributions are different, in this series of
experiments we want to see how well a trained model can detect analogies after performing
internal reversal on the same set of pairs of sentences.
3
Henceforth, we will denote a representation for a quadruplet of sentences π : π :: π : π by the vector xπ:π::π:π .
21
Stergos Afantenos et al. IARML@IJCAI-ECAIβ22 Workshop Proceedings
Augmenting training and test sets (Experimental setting 3) In the series of experiments
we learn a model βπ using
π π π+π/2 π+π/2
(π³π‘ππππ , π΄π‘ππππ ) = ({xπ }π=1 , {π¦π }π=1 })
and a test set
π π π+π/2 π+π/2
(π³π‘ππ π‘ , π΄π‘ππ π‘ ) = ({xπ }π=1 , {π¦π }π=1 )
where both train and test sets have been augmented using the following rule: for each instance
(xπ:π::π:π , 1) in train or test set we add the following instance (xπ:π::π:π , 1). In other words, we
double only the positive instances by adding the internal reversal of a quadruplet as a positive
instance.
Augmenting test set (Experimental setting 4) In this series of experiments the train set and
thus the model learnt is the same as the base setting. In other words, we have a training set
(π³π‘ππππ , π΄π‘ππππ ) = ({xπ }ππ=1 , {π¦π }ππ=1 }) from which we learn a model βπ . For testing though we
ππ‘ ππ‘
have a new test set (π³π‘ππ π‘ , π΄π‘ππ π‘ ) = ({xπ }π π=1 , {π¦π }π=1 ) which results from (π³π‘ππ π‘ , π΄π‘ππ π‘ ) of the
π
base setting by keeping only positive instances. This subset is then augmented with instances
(xπ:π::π:π , 1) for every instance (xπ:π::π:π , 1) we have in that subset, resulting thus in π total
positive instances.
5.2. Datasets
To perform our experiments, we used three corpora: Penn Discourse TreeBank (PDTB), Stanford
Natural Language Inference Corpus (SNLI) and the paraphrase dataset MPRC.
PDTB dataset The first dataset that we use is PDTB version 2.1[25]. (36,000 pairs of sentences
annotated with discourse relations). Relations can be explicitly expressed via a discourse marker,
or implicitly expressed in which case no such discourse marker exists and the annotators provide
one that more closely describes the implicit discourse relation. Relations are organized in a
taxonomy of depth 3. Level 1 (L1) (top level) has four types of relations (Temporal, Contingency,
Expansion and Comparison), level 2 (L2) has 16 relation types and level 3 (L3) has 23 relation
types. For this series of experiments, we used the L1 relation.
SNLI dataset SNLI is a corpus of pairs of sentences from [26]. SNLI was created and annotated
manually. It contains 570K human-written sentence pairs considered as a sufficient number of
pairs for machine learning. The sentence pairs are annotated with entailment, contradiction
and semantic independence. More precisely, a pair of sentences π and π can be annotated either
with Entailment, Contradiction or Neutral relation. Construction of the corpus was done using
Mechanical Turk who was presented with a premise in the form of a sentence and was asked to
provide three hypotheses, in a sentential form, for each of the aforementioned labels. 10% of
the corpus was validated by trusted Mechanical Turks. Overall a Fleiss π
of 0.70 was achieved.
For our experiments we considered the Neutral relation as symmetric.
22
Stergos Afantenos et al. IARML@IJCAI-ECAIβ22 Workshop Proceedings
MRPC dataset The third corpus is Microsoft Research Paraphrase Corpus (MRPC [27]). It
contains about 5800 pairs of sentences which can either be a paraphrase of each other or not.
Each pair of sentences was annotated by two annotators. In case of disagreements, a third
annotator resolved the conflict. After this, about two-thirds of the pairs were annotated as
paraphrases and one third as not.
5.3. Embedding techniques
There are well-known word embeddings such as word2vec [15], Glove [5], BERT [28], fastText
[17], etc. It is standard to start from a word embedding to build a sentence embedding. Sentence
embedding techniques represent entire sentences and their semantic information as vectors. In
this paper, we focus on 2 techniques relying on initial word embedding.
- The simplest method is to average the word embeddings of all words in a sentence. Although
this method ignores both the order of the words and the structure of the sentence, it performs
well in many tasks. So the final vector has the dimension of the initial word embedding.
- The other approach, suggested in [7], makes use of the Discrete Cosine Transform (DCT)
as a simple and efficient way to model both word order and structure in sentences while
maintaining practical efficiency. Using the inverse transformation, the original word sequence
can be reconstructed. A parameter π is a small constant that needs to be set. One can choose
how many features are being embedded per sentence by adjusting the value of π, but undeniably
it increases the final size of the sentence vector by a factor π. If the initial embedding of words is
of dimension π, the final sentence dimension will be = π * π (see [7] for complete description).
In our experiments, we use the average method to embed sentences as it is at least as effective
as DCT [9].
5.4. Models
Random Forest (RF) We have tested our hypothesis on a classical method successfully used
for word analogy classification [29]: Random Forests (RF). The parameters for RF are 100 trees,
no maximum depth, and a minimum split of 2. We also use LSTMs, but any other model (SVM,
etc.), could have been used.
Bi-LSTM architecture Given a quadruplet of sentences π : π :: π : π which can be an
analogy or not, we represent each sentence by its input tokens π = {π€1π , . . . , π€ππ }, π =
{π€1π , . . . , π€ππ }, π = {π€1π , . . . , π€ππ } and π = {π€1π , . . . , π€ππ }. Although sentences can have dif-
ferent lengths we have empirically fixed π = 35; if a sentence has less than 35 word tokens we
use padding. Each word token π€ππ (with π β {π, π, π, π} and π β [1 . . . π]) is represented by a
Glove vector of 300 dimensions. In this series of experiments, LSTM did not use averaging or
DCT since the recurrent nature of LSTMs themselves accounts for the structure of a sentence.
Our architecture is composed by four Bi-LSTMs whose output is passed over to a feed-forward
network. More precisely, for each sentence we recursively calculate βπ‘ = ππ‘ β π‘ππβ(πΆπ‘ ) with
β representing the Hadamard operation and
ππ‘ = π(Wπ Β· [hπ‘β1 , xπ‘ ] + bπ )
23
Stergos Afantenos et al. IARML@IJCAI-ECAIβ22 Workshop Proceedings
where
πΆπ‘ = ππ‘ β πΆπ‘β1 + ππ‘ β CΜπ‘
and
ππ‘ = π(Wπ Β· [hπ‘β1 , xπ‘ ] + bπ )
CΜπ‘ = π‘ππβ(WπΆ Β· [hπ‘β1 , xπ‘ ] + bπΆ )
ππ‘ = π(Wπ Β· [hπ‘β1 , xπ‘ ] + bπ )
In the above, xπ‘ represents the vector for token π€π‘ in a given sentence. These representations
are obtained on both directions. Thus for each sentence the following representations are
obtained:
β
β β β β
β β β β
β β β β
β β β
π = {π€ππ } = βππ‘ ++ βππ‘ ; π = {π€ππ } = βππ‘ +
+ βππ‘ ; π = {π€ππ } = βππ‘ +
+ βππ‘ ; π = {π€ππ } = βππ‘ +
+ βππ‘
with +
+ representing the concatenation operation.
The above representations are given as input to a single layer feed forward network:
hπ = π (Wπ hπΏππ π + b)
with
β
β β β β β β β β β β β β β β β
hπΏππ π = βππ‘ +
+ βππ‘ +
+ βππ‘ +
+ βππ‘ +
+ βππ‘ +
+ βππ‘ +
+ βππ‘ +
+ βππ‘
using Rectified Linear Unit (ReLU) as activation function. Finally, the prediction is performed
using a sigmoΓ―d function:
1
π¦Λ = π(Wπ hπΏππ π + b) =
1 + πβWπ hπΏππ π +b
The architecture is guided by a standard binary cross entropy loss function.
6. Results and discussion
Results of our experiments for LSTMs and RFs are shown in Tables 1 and 2 respectively. In all
cases, we randomly generated quadruplets (π, π, π, π) which we annotated as analogies (class 1)
if pairs (π, π) and (π, π) shared the same relation, or with class 0 if they did not. For PDTB and
SNLI we randomly generated 2 million instances for training; testing and development corpus
contained 200.000 instances each. For the paraphrase corpus, we generated 4 million instances
for training and testing and development corpus contained 200.000 instances each. Each dataset
contains an equal number of positive and negative instances. As we can see, base settings for
all datasets perform quite moderately, which is to be expected since our aim was not to create
a general model for sentential analogies, which would require much more data and powerful
models with billions of parameters. Instead, our goal was to examine under which conditions
internal reversal holds. As we can see, in experimental setting 1, for which the test set is the
same as the train but with internal reversal, results on PDTB and SNLI, which contain relations
that are not symmetric, are worse than the base setting. This is not the case though for the
paraphrases corpus for which results are better to the base setting.
24
Stergos Afantenos et al. IARML@IJCAI-ECAIβ22 Workshop Proceedings
In the second set of experiments, we decided to focus solely on the positive instances and
examine if learning analogy π : π :: π : π also implicitly learnt internal reversal, that is
π : π :: π : π. We used the same base setting that we have learnt, but for testing, we created a
new dataset. Starting from an empty set, we took every positive instance of the training set and
performed an internal reversal; we then add it to the new test dataset. The resulting dataset has
no common instances with the training dataset, but every instance of it is an internal reversal
of the positive instances of the training set. As we can see there is almost no difference in
scores for PDTB and SNLI, but the results for the paraphrases corpus (93.412% πΉ1 for LSTMs
and 87.544% for RFs) clearly show that when a relation is symmetrical the model almost makes
no difference between an analogy and its internal reversal. It is interesting to observe that the
trend for LSTM is similar to RF. However, the results from LSTM appear to be more stable.
In the third series of experiments, we augmented both the training and testing datasets with
the internal reversal. All three datasets showed a significant increaseβof almost 20 percentile
points for some casesβfor the detection of analogies. In the fourth and final set of experiments,
we used the same base setting that we had used initially. The test set was constructed based on
the same test set as the base setting but we removed all negative instances and focused solely
on the positive ones augmented with the internal reversal. Again here we can see a significant
increase in the results for the detection of analogies, further showing that the model learns
internal reversal as well. On Table 1 , Experiment setting 2 for MRPC has the highest F1: this
corpus has more symmetrical relationships when compared to PDTB and SNLI. We observed
with Table 2 the trend already observed with LSTM: Experiment setting 2 has the highest F1.
Precision Recall F1 Accuracy Precision Recall F1 Accuracy
PDTB PDTB
class 1 54.274 47.476 50.648 class 1 54.604 33.778 41.737
base setting 53.739 base setting 53.314
class 0 53.322 60.001 56.465 class 0 52.744 72.468 61.053
class 1 48.91 39.76 43.863 class 1 51.254 31.096 38.708
Exp. Set. 1 49.114 Exp. Set. 1 50.826
class 0 49.254 58.468 53.467 class 0 50.639 70.504 58.943
Exp. Set. 2 class 1 100.0 39.76 56.898 39.76 Exp. Set. 2 class 1 100.00 31.096 47.440 31.096
class 1 70.16 79.346 74.471 class 1 66.263 99.953 79.694
Exp. Set. 3 63.733 Exp. Set. 3 66.267
class 0 44.038 32.507 37.404 class 0 69.847 0.213 0.424
Exp. Set. 4 class 1 100.0 46.585 63.56 46.585 Exp. Set. 4 class 1 100.00 32.117 48.619 32.117
SNLI SNLI
class 1 67.862 67.811 67.837 class 1 50.725 47.006 48.794
base setting 67.859 base setting 50.729
class 0 67.856 67.907 67.882 class 0 50.732 54.443 52.522
class 1 50.111 49.57 49.839 class 1 50.302 46.189 48.158
Exp. Set. 1 50.11 Exp. Set. 1 50.285
class 0 50.11 50.651 50.379 class 0 50.270 54.379 52.244
Exp. Set 2 class 1 100.0 49.57 66.283 49.57 Exp. Set 2 class 1 100.00 46.189 63.191 46.189
class 1 84.489 83.982 84.235 class 1 70.368 86.903 77.766
Exp. Set 3 79.047 Exp. Set 3 66.898
class 0 68.365 69.185 68.772 class 0 50.797 26.979 35.241
Exp. Set. 4 class 1 100.0 59.086 74.282 59.086 Exp. Set. 4 class 1 100.00 46.313 63.307 46.313
MRPC MRPC
class 1 53.45 61.487 57.188 class 1 54.327 69.353 60.927
base setting 53.969 base setting 54.739
class 0 54.671 46.45 50.227 class 0 55.502 39.599 46.221
class 1 80.454 87.638 83.892 class 1 58.916 77.847 67.071
Exp. Set. 1 83.173 Exp. Set. 1 61.374
class 0 86.426 78.708 82.387 class 0 66.313 44.547 53.293
Exp. Set. 2 class 1 100.0 87.638 93.412 87.638 Exp. Set. 2 class 1 100.00 77.847 87.544 77.847
class 1 69.033 72.395 70.674 class 1 67.523 99.649 80.499
Exp. Set. 3 59.946 Exp. Set. 3 67.437
class 0 38.832 35.05 36.844 class 0 48.952 0.698 1.377
Exp. Set. 4 class 1 100.0 62.752 77.114 62.75 Exp. Set. 4 class 1 100.0 69.139 81.754 69.139
Table 1 Table 2
Results for LSTM Results for Random Forest
25
Stergos Afantenos et al. IARML@IJCAI-ECAIβ22 Workshop Proceedings
7. Conclusion and future work
In this paper, we have suggested a new formal model dedicated to sentence analogies, replacing
the standard model for word analogies. A weaker βinternal reversalβ postulate takes the place
of the well-known βcentral permutationβ postulate. From a purely formal viewpoint, we have
investigated the consequences of this new model and to what extent it fits with sentence
analogies. To validate this approach in practice, we have implemented sentence analogies
classifiers, using well-known machine learning algorithms. We have also designed two machine
learning protocols involving different ways to build a training set, all derived from the formal
expected properties. Our results show that an βinternal reversalβ sentence analogy is recognized
by our algorithms as a valid analogy as soon as the underlying relation between sentences is
symmetric (e.g. βto be a paraphrase ofβ). When this relation is not symmetric (e.g., βto be a
consequence ofβ), βinternal reversalβ sentence analogies are not always recognized. Maybe,
in the general case, learning π
is not the same as learning π
β1 . Alternatively finding a more
accurate postulate might be a valid track of research for the future. Analogy postulates could
also be used for further constraining the classifier.
Acknowledgments
The authors would like to express their gratitude to the anonymous reviewers for their valuable
comments. They would also like to thank the organizers of this workshop.
References
[1] D. R. Hofstadter, Analogy as the Core of Cognition, MIT Press, 2001, pp. 499β538.
[2] C. Allen, T. Hospedales, Analogies explained: Towards understanding word embeddings,
in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference
on Machine Learning, volume 97 of Proceedings of Machine Learning Research, PMLR, 2019,
pp. 223β231. URL: http://proceedings.mlr.press/v97/allen19a.html.
[3] F. Chollet, On the measure of intelligence, 2019. arXiv:1911.01547.
[4] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of
words and phrases and their compositionality, in: C. J. C. B. et al. (Ed.), Advances in Neural
Information Processing Systems 26, Curran Associates Inc., 2013, pp. 3111β3119.
[5] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation,
in: EMNLP, 2014, pp. 1532β1543.
[6] D. E. Rumelhart, A. A. Abrahamson, A model for analogical reasoning, Cognitive Psychol.
5 (1973) 1β28.
[7] N. Almarwani, H. Aldarmaki, M. Diab, Efficient sentence embedding using discrete cosine
transform, in: EMNLP, 2019, pp. 3663β3669.
[8] S. Lim, H. Prade, G. Richard, Classifying and completing word analogies by machine
learning, Int. J. Approx. Reason. 132 (2021) 1β25.
[9] S. Afantenos, T. Kunza, S. Lim, H. Prade, G. Richard, Analogies between sen-
26
Stergos Afantenos et al. IARML@IJCAI-ECAIβ22 Workshop Proceedings
tences:theoretical aspects - preliminary experiments, in: Proc.16th Europ. Conf. Symb. &
Quantit. Appr. to Reas. with Uncert. (ECSQARU), 2021.
[10] Z. Bouraoui, S. Jameel, S. Schockaert, Relation induction in word embeddings revisited, in:
COLING, 1627-1637, Assoc. Computat. Ling., 2018.
[11] A. Drozd, A. Gladkova, S. Matsuoka, Word embeddings, analogies, and machine learning:
Beyond king - man + woman = queen, in: COLING, 2016, pp. 3519β3530.
[12] P. D. Turney, A uniform approach to analogies, synonyms, antonyms, and associations, in:
COLING, 2008, pp. 905β912.
[13] P. D. Turney, Distributional semantics beyond words: Supervised learning of analogy and
paraphrase, TACL 1 (2013) 353β366.
[14] X. Zhu, G. de Melo, Sentence analogies: Linguistic regularities in sentence embeddings,
in: COLING, 2020.
[15] T. Mikolov, K. Chen, G. S. Corrado, J. Dean, Efficient estimation of word representations
in vector space, CoRR abs/1301.3781 (2013).
[16] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword
information, in: Transactions of the Association for Computational Linguistics, 2017, p.
135β146.
[17] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin, Advances in pre-training
distributed word representations, in: Proc. of LREC, 2018.
[18] A. Diallo, M. Zopf, J. FΓΌrnkranz, Learning analogy-preserving sentence embeddings for
answer selection, in: Proc. 23rd Conf. Computational Natural Language Learning, 910 -
919, Assoc. Computat. Ling., 2019.
[19] L. Wang, Y. Lepage, Vector-to-sequence models for sentence analogies, in: 2020 Interna-
tional Conference on Advanced Computer Science and Information Systems (ICACSIS),
2020, pp. 441β446. doi:10.1109/ICACSIS51025.2020.9263191.
[20] Y. Lepage, De lβanalogie rendant compte de la commutation en linguistique, Habilit. Γ
Diriger des Recher., Univ. J. Fourier, Grenoble (2003). URL: https://tel.archives-ouvertes.fr/
tel-00004372/en.
[21] Y. Lepage, Analogy and formal languages, Electr. Notes Theor. Comput. Sci. 53 (2001).
[22] H. Prade, G. Richard, From analogical proportion to logical proportions, Logica Univers. 7
(2013) 441β505.
[23] M. Hesse, On defining analogy, Proceedings of the Aristotelian Society 60 (1959) 79β100.
[24] M. Hesse, Analogy and confirmation theory, Philosophy of Science xxxi (1964) 319β327.
[25] R. Prasad, N. Dinesh, A. Lee, E. Miltsakaki, L. Robaldo, A. Joshi, B. Webber, The Penn
Discourse TreeBank 2.0., in: LREC 08, 2008. URL: http://www.lrec-conf.org/proceedings/
lrec2008/pdf/754_paper.pdf.
[26] S. R. Bowman, G. Angeli, C. Potts, C. D. Manning, A large annotated corpus for learning
natural language inference, in: Proceedings of the 2015 Conference on Empirical Methods
in Natural Language Processing (EMNLP), Association for Computational Linguistics,
2015.
[27] W. B. Dolan, C. Brockett, Automatically constructing a corpus of sentential paraphrases,
in: Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005.
URL: https://www.aclweb.org/anthology/I05-5002.
[28] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
27
Stergos Afantenos et al. IARML@IJCAI-ECAIβ22 Workshop Proceedings
transformers for language understanding, CoRR abs/1810.04805 (2018).
[29] S. Lim, H. Prade, G. Richard, Solving word analogies: A machine learning perspective, in:
Proc.15th Europ. Conf. Symb. & Quantit. Appr. to Reas. with Uncert. (ECSQARU), LNCS
11726, 238-250, Springer, 2019.
28