=Paper=
{{Paper
|id=Vol-3174/paper2
|storemode=property
|title=Theoretical Study and Empirical Investigation of Sentence Analogies
|pdfUrl=https://ceur-ws.org/Vol-3174/paper2.pdf
|volume=Vol-3174
|authors=Stergos Afantenos,Suryani Lim,Henri Prade,Gilles Richard
|dblpUrl=https://dblp.org/rec/conf/ijcai/AfantenosLPR22
}}
==Theoretical Study and Empirical Investigation of Sentence Analogies==
<pdf width="1500px">https://ceur-ws.org/Vol-3174/paper2.pdf</pdf>
<pre>
Theoretical study and empirical investigation of
sentence analogies
Stergos Afantenos1 , Suryani Lim2 , Henri Prade1 and Gilles Richard1
1
    IRIT, University of Toulouse, France
2
    Federation University - Churchill -Australia


                                         Abstract
                                        Analogies between 4 sentences, “𝑎 is to 𝑏 as 𝑐 is to 𝑑”, are usually defined between two pairs of sentences
                                        (𝑎, 𝑏) and (𝑐, 𝑑) by constraining a relation 𝑅 holding between the sentences of the first pair, to hold
                                        for the second pair. From a theoretical perspective, three postulates define an analogy - one of which
                                        is the “central permutation” postulate which allows the permutation of central elements 𝑏 and 𝑐. This
                                        postulate is no longer appropriate in sentence analogies since the existence of 𝑅 offers no guarantee
                                        in general for the existence of some relation 𝑆 such that 𝑆 also holds for the pairs (𝑎, 𝑐) and (𝑏, 𝑑). In
                                        this paper, the “central permutation” postulate is replaced by a weaker “internal reversal” postulate to
                                        provide an appropriate definition of sentence analogies. To empirically validate the aforementioned
                                        postulate, we build a LSTM as well as baseline Random Forest models capable of learning analogies
                                        based on quadruplets. We use the Penn Discourse Treebank (PDTB), the Stanford Natural Language
                                        Inference (SNLI) and the Microsoft Research Paraphrase (MSRP) corpora. Our experiments show that
                                        our models trained on samples of analogies between (𝑎, 𝑏) and (𝑐, 𝑑), recognize analogies between (𝑏, 𝑎)
                                        and (𝑑, 𝑐) when the underlying relation is symmetrical, validating thus the formal model of sentence
                                        analogies using “internal reversal” postulate.


1. Introduction
Analogy plays a crucial role in human cognition and intelligence. It has been characterized
as “the core of cognition” [1] and has recently gained some interest from the computational
linguistics and machine learning communities (see [2, 3]). Word analogies1 such as “Paris is to
France as Berlin is to Germany” are now well captured via word embeddings [4, 5]. If ⃗𝑎, ⃗𝑏, ⃗,    𝑐 ⃗𝑑
                                                           2
are the embeddings of words 𝑎, 𝑏, 𝑐, 𝑑, then 𝑎 : 𝑏 :: 𝑐 : 𝑑 holds iff (𝑎  ⃗      ⃗
                                                                              𝑐 𝑑) is a parallelogram
                                                                       ⃗ , 𝑏, ⃗,
in the underlying vector space [6].
   Although analogies between words have been extensively studied, analogies between sen-
tences have received very scant attention by the community, to the best of our knowledge.
Instead of dealing with words, dealing with sentences leads to 2 challenges:

                  • How to embed sentences in a vector space?

IARML@IJCAI-ECAI’2022: Workshop on the Interactions between Analogical Reasoning and Machine Learning, at
IJCAI-ECAI’2022, July, 2022, Vienna, Austria
*
  Corresponding author.
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
             CEUR Workshop Proceedings (CEUR-WS.org)
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073


1
  In the following, ‘analogy’ refers to a quaternary relation linking 4 items of the form “𝑎 is to 𝑏 as 𝑐 is to 𝑑”, called
  analogical proportion.
2
  𝑎 : 𝑏 :: 𝑐 : 𝑑 is a standard notation for analogical proportion.


                                                                                                     15
Stergos Afantenos et al.                                IARML@IJCAI-ECAI’22 Workshop Proceedings


    • How do we define a sentence analogy?

   We expect sentence embeddings to be dense vectors supposed to reflect semantic properties
of a sentence. Various approaches are available: embed each word and get the average of the
vectors. In that case, the order of the words is lost. Another option, described in [7], allowing to
recover the sentence from its embedding, makes use of Discrete Cosine Transform.
   The question of defining sentence analogy is especially delicate. Indeed the aforementioned
parallelogram model used for words reflects the usual postulates of analogies, namely if 𝑎 :
𝑏 :: 𝑐 : 𝑑 holds, 𝑐 : 𝑑 :: 𝑎 : 𝑏 (symmetry) and 𝑎 : 𝑐 :: 𝑏 : 𝑑 (central permutation) should hold
as well. This latter postulate (already questionable between words [8]) is still more debatable
with sentences. In the NLP community, analogies between sentences are usually induced from
predefined relationships between sentences. A quadruplet of sentences 𝑎, 𝑏, 𝑐, 𝑑 defines an
analogy 𝑎 : 𝑏 :: 𝑐 : 𝑑 if the (implicit or explicit) relation that holds between the sentences of the
first pair (𝑎, 𝑏), also holds for the second pair (𝑐, 𝑑). Let us consider the following example:
   John sneezed loudly (a). Mary was startled (b).
Bob took an analgesic (c). His headache stopped (d).
   In that case, the implicit relation 𝑅 between sentences in a pair is a kind of causal relation.
This example indicates that central permutation makes no sense here and raises the question of
defining a weaker notion of analogy obeying another system of postulates. By which postulate
to replace the central permutation? In this paper, we propose to introduce a postulate we call
“internal reversal” that expresses that if 𝑎 : 𝑏 :: 𝑐 : 𝑑 holds then 𝑏 : 𝑎 :: 𝑑 : 𝑐 holds as well, and
we study its consequences. So our main goal is to:

    • theoretically investigate the formal consequences of this new model,
    • empirically validate the model by implementing various classifiers of sentence analogies.

   After presenting in Section 3 the standard formal definitions of analogies, including the
“central permutation” postulate, and their immediate consequences, we focus on the replacement
of the “central permutation” postulate by the internal reversal postulate. Having a better fit with
what is accepted as sentence analogies in the NLP community, this postulate also impacts the
machine learning perspective that we implement.
   For natural language sentences, “internal reversal”, as a formal postulate, may have some
limitations. For instance, if 𝑅 = 𝑅−1 , where 𝑅 is the common relation that holds between two
pairs of sentences (𝑎, 𝑏) and (𝑐, 𝑑) (e.g. , 𝑎 is a paraphrase of 𝑏), one would expect that internal
reversal holds straightforwardly. In that case, a machine learning model trained to recognize
𝑎 : 𝑏 :: 𝑐 : 𝑑, should also recognize 𝑏 : 𝑎 :: 𝑑 : 𝑐.
   We investigate the conditions under which a machine learning model containing quadruplets
of sentences (𝑎, 𝑏, 𝑐, 𝑑) representing positive and negatives instances of analogies, is capable
of identifying analogies for which the operation of internal reversal has been performed. We
have then devised several series of experiments using various underlying models and datasets.
The paper is structured as follows. After reviewing the related work (Section 2), in Section
3 we recall the formal definitions of analogical proportions and investigate the new case of
sentence analogies, suggesting “internal reversal” postulate as a better fit and examining its
consequences. In Section 4, we consider the consequences of the formal definition in a machine
learning perspective, by suggesting a rigorous extension of an initial training set. Sections 5 and


                                              16
Stergos Afantenos et al.                              IARML@IJCAI-ECAI’22 Workshop Proceedings


6 are dedicated to the description to the context, protocol and results of our experiments. This
work is an extension of [9], replacing artificially created datasets with human annotated ones.


2. Related work
Due to the advent of neural models and distributed representations of words, lexical analogies
have been the focus of various works in computational linguistics. [10, 11, 12, 13, for example].
In terms of analogies on the sentential level few works exist. [14] investigate how existing
embedding approaches can capture sentential analogies. They create two different kinds of
datasets one consisting of replacing words with word analogies from the Google word analogy
dataset [15] while the other is based on analogies between sentences that share common
relations (entailment, negation, passivization, for example) or syntactic patterns (comparisons,
opposites, plurals among others). The goal is to optimize arg 𝑚𝑎𝑥𝑑∈𝑉 (𝑣⃗𝑑 , 𝑣⃗𝑏 − 𝑣⃗𝑎 + 𝑣⃗𝑐 ) with
the additional constraint that 𝑑 ∈ / {𝑎, 𝑏, 𝑐}. Using these datasets, analogies are evaluated using
various embeddings, such as GloVe [5], word2vec [15], fastText [16, 17], etc. showing that
capturing syntactic analogies based on lexical analogies from the Google word analogies dataset
is more effective than recognising analogies based on more semantic information. [18] use a
similar approach to identify the most plausible answer 𝑎𝑖 to a given question 𝑞 from a pool 𝐴
of answers to a question by leveraging analogies between (𝑞, 𝑎𝑖 ) and various pairs of what they
call “prototypical” question/answer pairs, assuming that there is an analogy between (𝑞, 𝑎𝑖 ) and
the prototypical pair (𝑞𝑝 , 𝑎𝑝 ). The goal is to select the candidate answer 𝑎*𝑖 ∈ 𝐴 such that:

                            𝑎*𝑖 = arg 𝑚𝑖𝑛𝑖 (||(𝑞𝑝 − 𝑎𝑝 ) − (𝑞 − 𝑎𝑖 )||)

. The authors limit the question/answer pairs to 𝑤ℎ− questions from WikiQA and TrecQA.
They use a Siamese bi-GRUs as their architecture to represent the four sentences. In this manner,
the authors learn embedding representations for the sentences which they compare against
various baselines including random vectors, word2vec, InferSent and Sent2Vec obtaining better
results with the WikiQA corpus. Most of the tested sentence embedding models succeed in
recognizing syntactic analogies based on lexical ones but had a harder time capturing analogies
between pairs of sentences based on semantics.
   Instead of training a model to select the best candidate amongst a given set of candidates
([18, 19] train an encoder-decoder model based on LSTMs to generate the 𝑑 given a pair (𝑎, 𝑏)
and a candidate 𝑐. Authors obtain vector encodings of ⃗𝑎, ⃗𝑏, ⃗𝑐 using an LSTM guided by two
loss functions. The authors then experiment with concatenation, summation and arithmetic
analogy on these vectors to obtain a new vector which is then used as input for the decoding
mechanism, showing that arithmetic analogy outperforms the other methods.
   In this paper, the aim is to empirically validate the “internal reversal” postulate (without
focusing on accuracy). To our knowledge, such a study has not been conducted before.


3. Theoretical Foundations of Analogies
We briefly recall the formal definition of analogy such as found in [20, 21, 22]. We focus on a
widely accepted definition for sentence analogies and we investigate to what extent sentence


                                             17
Stergos Afantenos et al.                                    IARML@IJCAI-ECAI’22 Workshop Proceedings


analogies obey the formal postulates and what has to be modified in the formal setting to fit
with this particular definition.

3.1. Formal definitions
Given a set of items 𝑋, a (proportional) analogy is a quaternary relation supposed to obey the 3
following postulates (e.g.,[21]):

∀𝑎, 𝑏, 𝑐, 𝑑 ∈ 𝑋 :
   1. 𝑎 : 𝑏 :: 𝑎 : 𝑏 (reflexivity);
   2. 𝑎 : 𝑏 :: 𝑐 : 𝑑 → 𝑐 : 𝑑 :: 𝑎 : 𝑏 (symmetry);
   3. 𝑎 : 𝑏 :: 𝑐 : 𝑑 → 𝑎 : 𝑐 :: 𝑏 : 𝑑 (central permutation).
  These postulates have straightforward consequences like:

    • 𝑎 : 𝑎 :: 𝑏 : 𝑏 (identity);
    • 𝑎 : 𝑏 :: 𝑐 : 𝑑 → 𝑏 : 𝑎 :: 𝑑 : 𝑐 (internal reversal);
    • 𝑎 : 𝑏 :: 𝑐 : 𝑑 → 𝑑 : 𝑏 :: 𝑐 : 𝑎 (extreme permutation);
    • 𝑎 : 𝑏 :: 𝑐 : 𝑑 → 𝑑 : 𝑐 :: 𝑏 : 𝑎 (complete reversal).

   Among the 24 permutations of 𝑎, 𝑏, 𝑐, 𝑑, the previous postulates induce 3 distinct classes each
containing 8 distinct proportions regarded as equivalent due to postulates: 𝑎 : 𝑏 :: 𝑐 : 𝑑 has in
its class 𝑐 : 𝑑 :: 𝑎 : 𝑏, 𝑐 : 𝑎 :: 𝑑 : 𝑏, 𝑑 : 𝑏 :: 𝑐 : 𝑎, 𝑑 : 𝑐 :: 𝑏 : 𝑎, 𝑏 : 𝑎 :: 𝑑 : 𝑐, 𝑏 : 𝑑 :: 𝑎 : 𝑐, and
𝑎 : 𝑐 :: 𝑏 : 𝑑. But 𝑏 : 𝑎 :: 𝑐 : 𝑑 and 𝑎 : 𝑑 :: 𝑐 : 𝑏 do not belong to the class of 𝑎 : 𝑏 :: 𝑐 : 𝑑 and are
elements of the two other classes.

3.2. Sentence analogies
In the NLP community, the 4 items 𝑎, 𝑏, 𝑐, 𝑑 are sentences in natural language, not necessarily
the same. It is widely admitted that the sentences are in analogy (i.e., 𝑎 : 𝑏 :: 𝑐 : 𝑑) as soon
as there is a relation 𝑅, the relation between sentences, such that 𝑅(𝑎, 𝑏) and 𝑅(𝑐, 𝑑). The
example from the introduction is a perfect illustration of this definition where the relation 𝑅 is
just causality:
   John sneezed loudly (a). Mary was startled (b).
   Bob took an analgesic (c). His headache stopped (d).
But:
   Il fait beau aujourd’hui (a). Today we have nice weather (b).
   Il vaut mieux eviter la guerre (c). It is better to avoid war (d).
is another example of analogies between sentences where the implicit relation 𝑅 is “𝑏 is the
English translation of the French sentence 𝑎”. From a logical viewpoint, this can be expressed
as:
                             𝑎 : 𝑏 :: 𝑐 : 𝑑 iff ∃𝑅 s.t. 𝑅(𝑎, 𝑏) ∧ 𝑅(𝑐, 𝑑)                      (1)
where ∧ is just the formal notation for the 𝑎𝑛𝑑 connector. This definition can be considered
as quite vague because, as advocated in [23, 24], there is always a way to find such a relation


                                                  18
Stergos Afantenos et al.                                     IARML@IJCAI-ECAI’22 Workshop Proceedings


𝑅 between 2 sentences. A more effective option used in the NLP community is to consider
that the underlying relation 𝑅 belongs to a finite set 𝑆 of relations. Such relations can be, for
example, discourse relations (Elaboration, Continuation, Contrast, Concession, etc) or a Causality
relation as is the case in the above example. Then, the formal definition has to be refined into:

                           𝑎 : 𝑏 :: 𝑐 : 𝑑 iff ∃𝑅 ∈ 𝑆 s.t. 𝑅(𝑎, 𝑏) ∧ 𝑅(𝑐, 𝑑)                       (2)

where 𝑆 = {𝑅1 , . . . , 𝑅𝑛 } is just a finite non-empty set of relations belonging to a list of target
relations. With this definition, we constraint the relation 𝑅 to belong to a predefined set.
Obviously, in the case of French-English translation, the list 𝑆 is reduced to only one relation.
It is quite clear that reflexivity and symmetry are still valid postulates for sentence analogies
i.e., they are satisfied with both above definitions. Back to our initial example:
   John sneezed loudly (a). Mary was startled (b).
   Bob took an analgesic (c). His headache stopped (d).
Definition 1 or 2 still applies to 𝑐 : 𝑑 :: 𝑎 : 𝑏:
   Bob took an analgesic (c). His headache stopped (d).
   John sneezed loudly (a). Mary was startled (b).
which is then a valid analogy. Nevertheless, central permutation is not satisfied with the above
definitions 1 or 2.

3.3. Internal reversal for sentence analogies
Let us now focus on the “internal reversal” postulate as a alternative to “central permutation”:
   𝑎 : 𝑏 :: 𝑐 : 𝑑 → 𝑏 : 𝑎 :: 𝑑 : 𝑐 (internal reversal)
By definition, if 𝑅(𝑎, 𝑏) holds then 𝑅−1 (𝑏, 𝑎) holds. Definition 1 supports “internal reversal”:
for instance, if relation 𝑅(𝑎, 𝑏) is interpreted as “𝑎 is a cause of 𝑏", 𝑅−1 (𝑏, 𝑎) can be the passive
form “𝑏 is a consequence of a”. But Definition 2 does not support “internal reversal” except if,
for each relation 𝑅 in the set 𝑆 of built-in relations, we also have its counterpart 𝑅−1 . A simple
way to ensure this property is to consider relations 𝑅 such that 𝑅 = 𝑅−1 . For instance, 𝑅(𝑎, 𝑏)
is defined as “𝑎 is a paraphrase of 𝑏”.
   In the general case, a proper definition of a sentence analogy supporting the 3 postulates
(reflexivity, symmetry, internal reversal) would be:

                                    : 𝑏 :: 𝑐 : 𝑑 iff ∃𝑅 ∈ 𝑆 s.t.
                                  𝑎 {︃
                                      (𝑅(𝑎, 𝑏) ∧ 𝑅(𝑐, 𝑑))                                         (3)
                                      ∨(𝑅−1 (𝑎, 𝑏) ∧ 𝑅−1 (𝑐, 𝑑))

  This leads to a formal definition of sentence analogies with:
   1. 𝑎 : 𝑏 :: 𝑎 : 𝑏 (reflexivity);
   2. 𝑎 : 𝑏 :: 𝑐 : 𝑑 → 𝑐 : 𝑑 :: 𝑎 : 𝑏 (symmetry);
   3. 𝑎 : 𝑏 :: 𝑐 : 𝑑 → 𝑏 : 𝑎 :: 𝑑 : 𝑐 (internal reversal).
As immediate consequences, we get that :
    • there are only 4 equivalent forms (instead of 8 with the central permutation postulate)
      for an analogy:


                                                19
Stergos Afantenos et al.                                 IARML@IJCAI-ECAI’22 Workshop Proceedings


    • 𝑎 : 𝑏 :: 𝑐 : 𝑑, 𝑐 : 𝑑 :: 𝑎 : 𝑏, 𝑑 : 𝑐 :: 𝑏 : 𝑎, and 𝑏 : 𝑎 :: 𝑑 : 𝑐.
    • 𝑎 : 𝑏 :: 𝑐 : 𝑑 → 𝑑 : 𝑐 :: 𝑏 : 𝑎 (complete reversal).
    • 𝑎 : 𝑎 :: 𝑎 : 𝑎 (full identity) is still satisfied.
    • 𝑎 : 𝑎 :: 𝑏 : 𝑏 (identity) is no longer a consequence of the new postulates.


4. Implications for Machine Learning
Let us assume that we have at our disposal a repository of pairs of sentences (𝑎, 𝑏) with their
associated relation 𝑅. From this repository, we need a training set of examples for the classifier.
Given the previous section, several steps can be implemented.
   1) Building an initial training set of analogies 𝑎 : 𝑏 :: 𝑐 : 𝑑 can be done by joining 2 pairs (𝑎, 𝑏)
and (𝑐, 𝑑) belonging to the same relation 𝑅. This constitutes a set of positive examples 𝒳 + such
that for every quadruplet (𝑎, 𝑏, 𝑐, 𝑑) = x ∈ 𝒳 + the training instances are {x, 𝑦} with 𝑦 = 1. In
terms of negative examples, joining 2 pairs (𝑎, 𝑏) and (𝑐, 𝑑) belonging to different relations leads
to build a set of negative examples 𝒳 − such that for every quadruplet (𝑎, 𝑏, 𝑐, 𝑑) = x ∈ 𝒳 −
the training instances are {x, 𝑦} with 𝑦 = 0. The training set 𝒳 = 𝒳 + ∪ 𝒳 − is then a set of
quadruplets of sentences 𝑎, 𝑏, 𝑐, 𝑑 such that:

    • if the implicit/explicit relation 𝑅 between the pair (𝑎, 𝑏) also holds for the pair (𝑐, 𝑑),
      then (𝑎, 𝑏, 𝑐, 𝑑) ∈ 𝒳 +
    • if the implicit/explicit relation 𝑅 between the pair (𝑎, 𝑏) does not hold for the pair (𝑐, 𝑑),
      then (𝑎, 𝑏, 𝑐, 𝑑) ∈ 𝒳 −

   Applying symmetry postulate allows to double the size of 𝒳 + , just by adding (𝑐, 𝑑, 𝑎, 𝑏) ∈ 𝒳 +
as soon as (𝑎, 𝑏, 𝑐, 𝑑) ∈ 𝒳 + . We then improve the theoretical unbalance between 𝒳 + and 𝒳 − .
   2) The same method applies with internal reversal postulate, by adding (𝑏, 𝑎, 𝑑, 𝑐) ∈ 𝒳 + as
soon as (𝑎, 𝑏, 𝑐, 𝑑) ∈ 𝒳 + . This again doubles the size of 𝒳 + .
   At this stage, we have multiplied by 4 the initial size of our positive training set 𝒳 + by
introducing common sense analogies deducible from the initial ones, but not necessarily related
to the initial list of relations 𝑆. Can we do more?

The Identity Relation For completeness sake, one could argue that it is still possible to
extend the set of positive examples since it seems acceptable to consider 𝑎 : 𝑎 :: 𝑏 : 𝑏 as a valid
sentence analogy even though identity 𝐼𝑑 relation likely does not belong to 𝑆. But identity
relation 𝐼𝑑 holds between the pairs (𝑎, 𝑎) and (𝑏, 𝑏). Although recognition of analogies based
on the identity relation might seem trivial from an NLP perspective it could still be a useful
task in case that we want to evaluate the quality of our classifier. In other words, if a potential
classifier is not able to identify analogies based on the identity relation, one should probably
reconsider the underlying approach.

The Inverse Relation A scenario that appears quite often in Natural Language Processing,
although far from being a generalized phenomenon, is that a relation 𝑅 between sentences (or
larger proportions of text for that matter) is its own inverse 𝑅−1 .


                                               20
Stergos Afantenos et al.                                            IARML@IJCAI-ECAI’22 Workshop Proceedings


   Instances of such a relation can be, for example, that of the paraphrase. If 𝑎 is a paraphrase of
𝑏, obviously 𝑏 is a paraphrase of 𝑎. The same hold for the operation of translation. If sentence 𝑎
is a translation of 𝑏 then again 𝑏 is a translation of 𝑎. Following our initial definition of analogy,
we will have to accept:
   𝑎 : 𝑏 :: 𝑏 : 𝑎 when R is its own inverse
   Before moving to the details of the empirical validation, we describe the datasets we use in
the following section.


5. Experiments
As explained earlier in this paper, our main goal is the empirical validation of internal reversal
for sentential analogies, using various corpora. To investigate this postulate we devise the
following sets of experiments.

5.1. Experimental settings
Base setting           Given a training set

                                      (𝒳𝑡𝑟𝑎𝑖𝑛 , 𝒴𝑡𝑟𝑎𝑖𝑛 ) = ({x𝑖 }𝑛𝑖=1 , {𝑦𝑖 }𝑛𝑖=1 })

and a test set (𝒳𝑡𝑒𝑠𝑡 , 𝒴𝑡𝑒𝑠𝑡 ) = ({x𝑖 }𝑚
                                        𝑖=1 , {𝑦𝑖 }𝑖=1 ) with 𝑚 typically being a tenth of 𝑛 and x𝑗
                                                   𝑚

representing a quadruplet of sentences 𝑎 : 𝑏 :: 𝑐 : 𝑑3 and 𝑦𝑖 ∈ {0, 1} we learn a model ℋ𝑏 capable
of identifying analogies with a certain accuracy. Crucially, |{𝑦𝑘 : 𝑦𝑘 = 1}|= |{𝑦𝑘 : 𝑦𝑘 = 0}|
both for training and testing sets. Due to the huge size of instances at our disposal, there is
no need at this stage to implement any further data augmentation process, as explained in the
previous Section. In other words, we have an equal number of positive and negative instances
in training and testing sets, for a total of 4M instances.

Internal reversal on the test set (Experimental setting 1) In this series of experiments, we
used the same training set (𝒳𝑡𝑟𝑎𝑖𝑛 , 𝒴𝑡𝑟𝑎𝑖𝑛 ) = ({x𝑖 }𝑛𝑖=1 , {𝑦𝑖 }𝑛𝑖=1 }) as the base setting ℋ𝑏 . To
construct the test set, we perform internal reversal on all the instances of the train set that we
have used in base setting. Our goal is to see whether we get similar results on analogies for the
internal reversal.

Test set from train distribution with internal reversal (Experimental setting 2) For this
series of experiments we use the same training set (𝒳𝑡𝑟𝑎𝑖𝑛 , 𝒴𝑡𝑟𝑎𝑖𝑛 ) = ({x𝑖 }𝑛𝑖=1 , {𝑦𝑖 }𝑛𝑖=1 }) for
the base setting ℋ𝑏 . The test set though is constructed in the following way: for every positive
instance (x𝑎:𝑏::𝑐:𝑑 , 1) in (𝒳𝑡𝑟𝑎𝑖𝑛 , 𝒴𝑡𝑟𝑎𝑖𝑛 ) we add the internal reversal pair (x𝑏:𝑎::𝑑:𝑐 , 1) to the new
testing set (𝒳𝑡𝑒𝑠𝑡 , 𝒴𝑡𝑒𝑠𝑡 ) whose size thus is 𝑛/2. In contrast to experimental setting 1 where
the underlying sentences between train and test distributions are different, in this series of
experiments we want to see how well a trained model can detect analogies after performing
internal reversal on the same set of pairs of sentences.

3
    Henceforth, we will denote a representation for a quadruplet of sentences 𝑎 : 𝑏 :: 𝑐 : 𝑑 by the vector x𝑎:𝑏::𝑐:𝑑 .


                                                         21
Stergos Afantenos et al.                                     IARML@IJCAI-ECAI’22 Workshop Proceedings


Augmenting training and test sets              (Experimental setting 3) In the series of experiments
we learn a model ℋ𝑎 using
                                𝑎       𝑎                𝑛+𝑛/2         𝑛+𝑛/2
                             (𝒳𝑡𝑟𝑎𝑖𝑛 , 𝒴𝑡𝑟𝑎𝑖𝑛 ) = ({x𝑖 }𝑖=1      , {𝑦𝑖 }𝑖=1    })

and a test set
                                 𝑎      𝑎              𝑚+𝑚/2          𝑚+𝑚/2
                              (𝒳𝑡𝑒𝑠𝑡 , 𝒴𝑡𝑒𝑠𝑡 ) = ({x𝑖 }𝑖=1     , {𝑦𝑖 }𝑖=1      )
where both train and test sets have been augmented using the following rule: for each instance
(x𝑎:𝑏::𝑐:𝑑 , 1) in train or test set we add the following instance (x𝑏:𝑎::𝑑:𝑐 , 1). In other words, we
double only the positive instances by adding the internal reversal of a quadruplet as a positive
instance.

Augmenting test set (Experimental setting 4) In this series of experiments the train set and
thus the model learnt is the same as the base setting. In other words, we have a training set
(𝒳𝑡𝑟𝑎𝑖𝑛 , 𝒴𝑡𝑟𝑎𝑖𝑛 ) = ({x𝑖 }𝑛𝑖=1 , {𝑦𝑖 }𝑛𝑖=1 }) from which we learn a model ℋ𝑏 . For testing though we
                           𝑎𝑡     𝑎𝑡
have a new test set (𝒳𝑡𝑒𝑠𝑡    , 𝒴𝑡𝑒𝑠𝑡 ) = ({x𝑖 }𝑚 𝑖=1 , {𝑦𝑖 }𝑖=1 ) which results from (𝒳𝑡𝑒𝑠𝑡 , 𝒴𝑡𝑒𝑠𝑡 ) of the
                                                             𝑚

base setting by keeping only positive instances. This subset is then augmented with instances
(x𝑏:𝑎::𝑑:𝑐 , 1) for every instance (x𝑎:𝑏::𝑐:𝑑 , 1) we have in that subset, resulting thus in 𝑚 total
positive instances.

5.2. Datasets
To perform our experiments, we used three corpora: Penn Discourse TreeBank (PDTB), Stanford
Natural Language Inference Corpus (SNLI) and the paraphrase dataset MPRC.

PDTB dataset The first dataset that we use is PDTB version 2.1[25]. (36,000 pairs of sentences
annotated with discourse relations). Relations can be explicitly expressed via a discourse marker,
or implicitly expressed in which case no such discourse marker exists and the annotators provide
one that more closely describes the implicit discourse relation. Relations are organized in a
taxonomy of depth 3. Level 1 (L1) (top level) has four types of relations (Temporal, Contingency,
Expansion and Comparison), level 2 (L2) has 16 relation types and level 3 (L3) has 23 relation
types. For this series of experiments, we used the L1 relation.

SNLI dataset SNLI is a corpus of pairs of sentences from [26]. SNLI was created and annotated
manually. It contains 570K human-written sentence pairs considered as a sufficient number of
pairs for machine learning. The sentence pairs are annotated with entailment, contradiction
and semantic independence. More precisely, a pair of sentences 𝑎 and 𝑏 can be annotated either
with Entailment, Contradiction or Neutral relation. Construction of the corpus was done using
Mechanical Turk who was presented with a premise in the form of a sentence and was asked to
provide three hypotheses, in a sentential form, for each of the aforementioned labels. 10% of
the corpus was validated by trusted Mechanical Turks. Overall a Fleiss 𝜅 of 0.70 was achieved.
For our experiments we considered the Neutral relation as symmetric.


                                                  22
Stergos Afantenos et al.                                   IARML@IJCAI-ECAI’22 Workshop Proceedings


MRPC dataset The third corpus is Microsoft Research Paraphrase Corpus (MRPC [27]). It
contains about 5800 pairs of sentences which can either be a paraphrase of each other or not.
Each pair of sentences was annotated by two annotators. In case of disagreements, a third
annotator resolved the conflict. After this, about two-thirds of the pairs were annotated as
paraphrases and one third as not.

5.3. Embedding techniques
There are well-known word embeddings such as word2vec [15], Glove [5], BERT [28], fastText
[17], etc. It is standard to start from a word embedding to build a sentence embedding. Sentence
embedding techniques represent entire sentences and their semantic information as vectors. In
this paper, we focus on 2 techniques relying on initial word embedding.
    - The simplest method is to average the word embeddings of all words in a sentence. Although
this method ignores both the order of the words and the structure of the sentence, it performs
well in many tasks. So the final vector has the dimension of the initial word embedding.
    - The other approach, suggested in [7], makes use of the Discrete Cosine Transform (DCT)
as a simple and efficient way to model both word order and structure in sentences while
maintaining practical efficiency. Using the inverse transformation, the original word sequence
can be reconstructed. A parameter 𝑙 is a small constant that needs to be set. One can choose
how many features are being embedded per sentence by adjusting the value of 𝑙, but undeniably
it increases the final size of the sentence vector by a factor 𝑙. If the initial embedding of words is
of dimension 𝑛, the final sentence dimension will be = 𝑛 * 𝑙 (see [7] for complete description).
In our experiments, we use the average method to embed sentences as it is at least as effective
as DCT [9].

5.4. Models
Random Forest (RF) We have tested our hypothesis on a classical method successfully used
for word analogy classification [29]: Random Forests (RF). The parameters for RF are 100 trees,
no maximum depth, and a minimum split of 2. We also use LSTMs, but any other model (SVM,
etc.), could have been used.

Bi-LSTM architecture Given a quadruplet of sentences 𝑎 : 𝑏 :: 𝑐 : 𝑑 which can be an
analogy or not, we represent each sentence by its input tokens 𝑎 = {𝑤1𝑎 , . . . , 𝑤𝑘𝑎 }, 𝑏 =
{𝑤1𝑏 , . . . , 𝑤𝑘𝑏 }, 𝑐 = {𝑤1𝑐 , . . . , 𝑤𝑘𝑐 } and 𝑑 = {𝑤1𝑑 , . . . , 𝑤𝑘𝑑 }. Although sentences can have dif-
ferent lengths we have empirically fixed 𝑘 = 35; if a sentence has less than 35 word tokens we
use padding. Each word token 𝑤𝑖𝑠 (with 𝑠 ∈ {𝑎, 𝑏, 𝑐, 𝑑} and 𝑖 ∈ [1 . . . 𝑘]) is represented by a
Glove vector of 300 dimensions. In this series of experiments, LSTM did not use averaging or
DCT since the recurrent nature of LSTMs themselves accounts for the structure of a sentence.
Our architecture is composed by four Bi-LSTMs whose output is passed over to a feed-forward
network. More precisely, for each sentence we recursively calculate ℎ𝑡 = 𝑜𝑡 ⊗ 𝑡𝑎𝑛ℎ(𝐶𝑡 ) with
⊗ representing the Hadamard operation and

                                     𝑜𝑡 = 𝜎(W𝑜 · [h𝑡−1 , x𝑡 ] + b𝑜 )


                                                 23
Stergos Afantenos et al.                               IARML@IJCAI-ECAI’22 Workshop Proceedings


where
                                    𝐶𝑡 = 𝑓𝑡 ⊗ 𝐶𝑡−1 + 𝑖𝑡 ⊗ C̃𝑡
and
                                   𝑖𝑡 = 𝜎(W𝑖 · [h𝑡−1 , x𝑡 ] + b𝑖 )
                                C̃𝑡 = 𝑡𝑎𝑛ℎ(W𝐶 · [h𝑡−1 , x𝑡 ] + b𝐶 )
                                  𝑓𝑡 = 𝜎(W𝑓 · [h𝑡−1 , x𝑡 ] + b𝑓 )
  In the above, x𝑡 represents the vector for token 𝑤𝑡 in a given sentence. These representations
are obtained on both directions. Thus for each sentence the following representations are
obtained:
                −
                → ←    −                 −
                                         → ←   −                  −
                                                                  → ←   −                  −
                                                                                           → ←   −
  𝑎 = {𝑤𝑖𝑎 } = ℎ𝑎𝑡 ++ ℎ𝑎𝑡 ; 𝑏 = {𝑤𝑖𝑏 } = ℎ𝑏𝑡 +
                                             + ℎ𝑏𝑡 ; 𝑐 = {𝑤𝑖𝑐 } = ℎ𝑐𝑡 +
                                                                      + ℎ𝑐𝑡 ; 𝑑 = {𝑤𝑖𝑑 } = ℎ𝑑𝑡 +
                                                                                               + ℎ𝑑𝑡

with +
     + representing the concatenation operation.
  The above representations are given as input to a single layer feed forward network:

                                    h𝑓 = 𝑓 (W𝑇 h𝐿𝑆𝑇 𝑀 + b)

with
                              −
                              → ←   − −   → ←   − −   → ←   − −   → ←   −
                     h𝐿𝑆𝑇 𝑀 = ℎ𝑎𝑡 +
                                  + ℎ𝑎𝑡 +
                                        + ℎ𝑏𝑡 +
                                              + ℎ𝑏𝑡 +
                                                    + ℎ𝑐𝑡 +
                                                          + ℎ𝑐𝑡 +
                                                                + ℎ𝑑𝑡 +
                                                                      + ℎ𝑑𝑡
using Rectified Linear Unit (ReLU) as activation function. Finally, the prediction is performed
using a sigmoïd function:
                                                                1
                           𝑦ˆ = 𝜎(W𝑇 h𝐿𝑆𝑇 𝑀 + b) =
                                                      1 + 𝑒−W𝑇 h𝐿𝑆𝑇 𝑀 +b
The architecture is guided by a standard binary cross entropy loss function.


6. Results and discussion
Results of our experiments for LSTMs and RFs are shown in Tables 1 and 2 respectively. In all
cases, we randomly generated quadruplets (𝑎, 𝑏, 𝑐, 𝑑) which we annotated as analogies (class 1)
if pairs (𝑎, 𝑏) and (𝑐, 𝑑) shared the same relation, or with class 0 if they did not. For PDTB and
SNLI we randomly generated 2 million instances for training; testing and development corpus
contained 200.000 instances each. For the paraphrase corpus, we generated 4 million instances
for training and testing and development corpus contained 200.000 instances each. Each dataset
contains an equal number of positive and negative instances. As we can see, base settings for
all datasets perform quite moderately, which is to be expected since our aim was not to create
a general model for sentential analogies, which would require much more data and powerful
models with billions of parameters. Instead, our goal was to examine under which conditions
internal reversal holds. As we can see, in experimental setting 1, for which the test set is the
same as the train but with internal reversal, results on PDTB and SNLI, which contain relations
that are not symmetric, are worse than the base setting. This is not the case though for the
paraphrases corpus for which results are better to the base setting.


                                              24
Stergos Afantenos et al.                                                     IARML@IJCAI-ECAI’22 Workshop Proceedings


   In the second set of experiments, we decided to focus solely on the positive instances and
examine if learning analogy 𝑎 : 𝑏 :: 𝑐 : 𝑑 also implicitly learnt internal reversal, that is
𝑏 : 𝑎 :: 𝑑 : 𝑐. We used the same base setting that we have learnt, but for testing, we created a
new dataset. Starting from an empty set, we took every positive instance of the training set and
performed an internal reversal; we then add it to the new test dataset. The resulting dataset has
no common instances with the training dataset, but every instance of it is an internal reversal
of the positive instances of the training set. As we can see there is almost no difference in
scores for PDTB and SNLI, but the results for the paraphrases corpus (93.412% 𝐹1 for LSTMs
and 87.544% for RFs) clearly show that when a relation is symmetrical the model almost makes
no difference between an analogy and its internal reversal. It is interesting to observe that the
trend for LSTM is similar to RF. However, the results from LSTM appear to be more stable.
In the third series of experiments, we augmented both the training and testing datasets with
the internal reversal. All three datasets showed a significant increase—of almost 20 percentile
points for some cases—for the detection of analogies. In the fourth and final set of experiments,
we used the same base setting that we had used initially. The test set was constructed based on
the same test set as the base setting but we removed all negative instances and focused solely
on the positive ones augmented with the internal reversal. Again here we can see a significant
increase in the results for the detection of analogies, further showing that the model learns
internal reversal as well. On Table 1 , Experiment setting 2 for MRPC has the highest F1: this
corpus has more symmetrical relationships when compared to PDTB and SNLI. We observed
with Table 2 the trend already observed with LSTM: Experiment setting 2 has the highest F1.

                           Precision   Recall   F1       Accuracy                                 Precision   Recall   F1       Accuracy
                                       PDTB                                                                   PDTB
                 class 1   54.274      47.476   50.648                                  class 1   54.604      33.778   41.737
  base setting                                           53.739          base setting                                           53.314
                 class 0   53.322      60.001   56.465                                  class 0   52.744      72.468   61.053
                 class 1   48.91       39.76    43.863                                  class 1   51.254      31.096   38.708
  Exp. Set. 1                                            49.114          Exp. Set. 1                                            50.826
                 class 0   49.254      58.468   53.467                                  class 0   50.639      70.504   58.943
  Exp. Set. 2    class 1   100.0       39.76    56.898   39.76           Exp. Set. 2    class 1   100.00      31.096   47.440   31.096
                 class 1   70.16       79.346   74.471                                  class 1   66.263      99.953   79.694
  Exp. Set. 3                                            63.733          Exp. Set. 3                                            66.267
                 class 0   44.038      32.507   37.404                                  class 0   69.847      0.213    0.424
  Exp. Set. 4    class 1   100.0       46.585   63.56    46.585          Exp. Set. 4    class 1   100.00      32.117   48.619   32.117
                                       SNLI                                                                   SNLI
                 class 1   67.862      67.811   67.837                                  class 1   50.725      47.006   48.794
  base setting                                           67.859          base setting                                           50.729
                 class 0   67.856      67.907   67.882                                  class 0   50.732      54.443   52.522
                 class 1   50.111      49.57    49.839                                  class 1   50.302      46.189   48.158
  Exp. Set. 1                                            50.11           Exp. Set. 1                                            50.285
                 class 0   50.11       50.651   50.379                                  class 0   50.270      54.379   52.244
  Exp. Set 2     class 1   100.0       49.57    66.283   49.57           Exp. Set 2     class 1   100.00      46.189   63.191   46.189
                 class 1   84.489      83.982   84.235                                  class 1   70.368      86.903   77.766
  Exp. Set 3                                             79.047          Exp. Set 3                                             66.898
                 class 0   68.365      69.185   68.772                                  class 0   50.797      26.979   35.241
  Exp. Set. 4    class 1   100.0       59.086   74.282   59.086          Exp. Set. 4    class 1   100.00      46.313   63.307   46.313
                                       MRPC                                                                   MRPC
                 class 1   53.45       61.487   57.188                                  class 1   54.327      69.353   60.927
  base setting                                           53.969          base setting                                           54.739
                 class 0   54.671      46.45    50.227                                  class 0   55.502      39.599   46.221
                 class 1   80.454      87.638   83.892                                  class 1   58.916      77.847   67.071
  Exp. Set. 1                                            83.173          Exp. Set. 1                                            61.374
                 class 0   86.426      78.708   82.387                                  class 0   66.313      44.547   53.293
  Exp. Set. 2    class 1   100.0       87.638   93.412   87.638          Exp. Set. 2    class 1   100.00      77.847   87.544   77.847
                 class 1   69.033      72.395   70.674                                  class 1   67.523      99.649   80.499
  Exp. Set. 3                                            59.946          Exp. Set. 3                                            67.437
                 class 0   38.832      35.05    36.844                                  class 0   48.952      0.698    1.377
  Exp. Set. 4    class 1   100.0       62.752   77.114   62.75           Exp. Set. 4    class 1   100.0       69.139   81.754   69.139

Table 1                                                                Table 2
Results for LSTM                                                       Results for Random Forest


                                                                  25
Stergos Afantenos et al.                              IARML@IJCAI-ECAI’22 Workshop Proceedings


7. Conclusion and future work
In this paper, we have suggested a new formal model dedicated to sentence analogies, replacing
the standard model for word analogies. A weaker “internal reversal” postulate takes the place
of the well-known “central permutation” postulate. From a purely formal viewpoint, we have
investigated the consequences of this new model and to what extent it fits with sentence
analogies. To validate this approach in practice, we have implemented sentence analogies
classifiers, using well-known machine learning algorithms. We have also designed two machine
learning protocols involving different ways to build a training set, all derived from the formal
expected properties. Our results show that an “internal reversal” sentence analogy is recognized
by our algorithms as a valid analogy as soon as the underlying relation between sentences is
symmetric (e.g. “to be a paraphrase of”). When this relation is not symmetric (e.g., “to be a
consequence of”), “internal reversal” sentence analogies are not always recognized. Maybe,
in the general case, learning 𝑅 is not the same as learning 𝑅−1 . Alternatively finding a more
accurate postulate might be a valid track of research for the future. Analogy postulates could
also be used for further constraining the classifier.


Acknowledgments
The authors would like to express their gratitude to the anonymous reviewers for their valuable
comments. They would also like to thank the organizers of this workshop.


References
 [1] D. R. Hofstadter, Analogy as the Core of Cognition, MIT Press, 2001, pp. 499–538.
 [2] C. Allen, T. Hospedales, Analogies explained: Towards understanding word embeddings,
     in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference
     on Machine Learning, volume 97 of Proceedings of Machine Learning Research, PMLR, 2019,
     pp. 223–231. URL: http://proceedings.mlr.press/v97/allen19a.html.
 [3] F. Chollet, On the measure of intelligence, 2019. arXiv:1911.01547.
 [4] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of
     words and phrases and their compositionality, in: C. J. C. B. et al. (Ed.), Advances in Neural
     Information Processing Systems 26, Curran Associates Inc., 2013, pp. 3111–3119.
 [5] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation,
     in: EMNLP, 2014, pp. 1532–1543.
 [6] D. E. Rumelhart, A. A. Abrahamson, A model for analogical reasoning, Cognitive Psychol.
     5 (1973) 1–28.
 [7] N. Almarwani, H. Aldarmaki, M. Diab, Efficient sentence embedding using discrete cosine
     transform, in: EMNLP, 2019, pp. 3663–3669.
 [8] S. Lim, H. Prade, G. Richard, Classifying and completing word analogies by machine
     learning, Int. J. Approx. Reason. 132 (2021) 1–25.
 [9] S. Afantenos, T. Kunza, S. Lim, H. Prade, G. Richard, Analogies between sen-


                                             26
Stergos Afantenos et al.                              IARML@IJCAI-ECAI’22 Workshop Proceedings


     tences:theoretical aspects - preliminary experiments, in: Proc.16th Europ. Conf. Symb. &
     Quantit. Appr. to Reas. with Uncert. (ECSQARU), 2021.
[10] Z. Bouraoui, S. Jameel, S. Schockaert, Relation induction in word embeddings revisited, in:
     COLING, 1627-1637, Assoc. Computat. Ling., 2018.
[11] A. Drozd, A. Gladkova, S. Matsuoka, Word embeddings, analogies, and machine learning:
     Beyond king - man + woman = queen, in: COLING, 2016, pp. 3519–3530.
[12] P. D. Turney, A uniform approach to analogies, synonyms, antonyms, and associations, in:
     COLING, 2008, pp. 905–912.
[13] P. D. Turney, Distributional semantics beyond words: Supervised learning of analogy and
     paraphrase, TACL 1 (2013) 353–366.
[14] X. Zhu, G. de Melo, Sentence analogies: Linguistic regularities in sentence embeddings,
     in: COLING, 2020.
[15] T. Mikolov, K. Chen, G. S. Corrado, J. Dean, Efficient estimation of word representations
     in vector space, CoRR abs/1301.3781 (2013).
[16] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword
     information, in: Transactions of the Association for Computational Linguistics, 2017, p.
     135–146.
[17] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin, Advances in pre-training
     distributed word representations, in: Proc. of LREC, 2018.
[18] A. Diallo, M. Zopf, J. Fürnkranz, Learning analogy-preserving sentence embeddings for
     answer selection, in: Proc. 23rd Conf. Computational Natural Language Learning, 910 -
     919, Assoc. Computat. Ling., 2019.
[19] L. Wang, Y. Lepage, Vector-to-sequence models for sentence analogies, in: 2020 Interna-
     tional Conference on Advanced Computer Science and Information Systems (ICACSIS),
     2020, pp. 441–446. doi:10.1109/ICACSIS51025.2020.9263191.
[20] Y. Lepage, De l’analogie rendant compte de la commutation en linguistique, Habilit. à
     Diriger des Recher., Univ. J. Fourier, Grenoble (2003). URL: https://tel.archives-ouvertes.fr/
     tel-00004372/en.
[21] Y. Lepage, Analogy and formal languages, Electr. Notes Theor. Comput. Sci. 53 (2001).
[22] H. Prade, G. Richard, From analogical proportion to logical proportions, Logica Univers. 7
     (2013) 441–505.
[23] M. Hesse, On defining analogy, Proceedings of the Aristotelian Society 60 (1959) 79–100.
[24] M. Hesse, Analogy and confirmation theory, Philosophy of Science xxxi (1964) 319–327.
[25] R. Prasad, N. Dinesh, A. Lee, E. Miltsakaki, L. Robaldo, A. Joshi, B. Webber, The Penn
     Discourse TreeBank 2.0., in: LREC 08, 2008. URL: http://www.lrec-conf.org/proceedings/
     lrec2008/pdf/754_paper.pdf.
[26] S. R. Bowman, G. Angeli, C. Potts, C. D. Manning, A large annotated corpus for learning
     natural language inference, in: Proceedings of the 2015 Conference on Empirical Methods
     in Natural Language Processing (EMNLP), Association for Computational Linguistics,
     2015.
[27] W. B. Dolan, C. Brockett, Automatically constructing a corpus of sentential paraphrases,
     in: Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005.
     URL: https://www.aclweb.org/anthology/I05-5002.
[28] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional


                                             27
Stergos Afantenos et al.                           IARML@IJCAI-ECAI’22 Workshop Proceedings


     transformers for language understanding, CoRR abs/1810.04805 (2018).
[29] S. Lim, H. Prade, G. Richard, Solving word analogies: A machine learning perspective, in:
     Proc.15th Europ. Conf. Symb. & Quantit. Appr. to Reas. with Uncert. (ECSQARU), LNCS
     11726, 238-250, Springer, 2019.


                                           28

</pre>