Theoretical study and empirical investigation of sentence analogies Stergos Afantenos1 , Suryani Lim2 , Henri Prade1 and Gilles Richard1 1 IRIT, University of Toulouse, France 2 Federation University - Churchill -Australia Abstract Analogies between 4 sentences, β€œπ‘Ž is to 𝑏 as 𝑐 is to 𝑑”, are usually defined between two pairs of sentences (π‘Ž, 𝑏) and (𝑐, 𝑑) by constraining a relation 𝑅 holding between the sentences of the first pair, to hold for the second pair. From a theoretical perspective, three postulates define an analogy - one of which is the β€œcentral permutation” postulate which allows the permutation of central elements 𝑏 and 𝑐. This postulate is no longer appropriate in sentence analogies since the existence of 𝑅 offers no guarantee in general for the existence of some relation 𝑆 such that 𝑆 also holds for the pairs (π‘Ž, 𝑐) and (𝑏, 𝑑). In this paper, the β€œcentral permutation” postulate is replaced by a weaker β€œinternal reversal” postulate to provide an appropriate definition of sentence analogies. To empirically validate the aforementioned postulate, we build a LSTM as well as baseline Random Forest models capable of learning analogies based on quadruplets. We use the Penn Discourse Treebank (PDTB), the Stanford Natural Language Inference (SNLI) and the Microsoft Research Paraphrase (MSRP) corpora. Our experiments show that our models trained on samples of analogies between (π‘Ž, 𝑏) and (𝑐, 𝑑), recognize analogies between (𝑏, π‘Ž) and (𝑑, 𝑐) when the underlying relation is symmetrical, validating thus the formal model of sentence analogies using β€œinternal reversal” postulate. 1. Introduction Analogy plays a crucial role in human cognition and intelligence. It has been characterized as β€œthe core of cognition” [1] and has recently gained some interest from the computational linguistics and machine learning communities (see [2, 3]). Word analogies1 such as β€œParis is to France as Berlin is to Germany” are now well captured via word embeddings [4, 5]. If βƒ—π‘Ž, ⃗𝑏, βƒ—, 𝑐 ⃗𝑑 2 are the embeddings of words π‘Ž, 𝑏, 𝑐, 𝑑, then π‘Ž : 𝑏 :: 𝑐 : 𝑑 holds iff (π‘Ž βƒ— βƒ— 𝑐 𝑑) is a parallelogram βƒ— , 𝑏, βƒ—, in the underlying vector space [6]. Although analogies between words have been extensively studied, analogies between sen- tences have received very scant attention by the community, to the best of our knowledge. Instead of dealing with words, dealing with sentences leads to 2 challenges: β€’ How to embed sentences in a vector space? IARML@IJCAI-ECAI’2022: Workshop on the Interactions between Analogical Reasoning and Machine Learning, at IJCAI-ECAI’2022, July, 2022, Vienna, Austria * Corresponding author. Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 1 In the following, β€˜analogy’ refers to a quaternary relation linking 4 items of the form β€œπ‘Ž is to 𝑏 as 𝑐 is to 𝑑”, called analogical proportion. 2 π‘Ž : 𝑏 :: 𝑐 : 𝑑 is a standard notation for analogical proportion. 15 Stergos Afantenos et al. IARML@IJCAI-ECAI’22 Workshop Proceedings β€’ How do we define a sentence analogy? We expect sentence embeddings to be dense vectors supposed to reflect semantic properties of a sentence. Various approaches are available: embed each word and get the average of the vectors. In that case, the order of the words is lost. Another option, described in [7], allowing to recover the sentence from its embedding, makes use of Discrete Cosine Transform. The question of defining sentence analogy is especially delicate. Indeed the aforementioned parallelogram model used for words reflects the usual postulates of analogies, namely if π‘Ž : 𝑏 :: 𝑐 : 𝑑 holds, 𝑐 : 𝑑 :: π‘Ž : 𝑏 (symmetry) and π‘Ž : 𝑐 :: 𝑏 : 𝑑 (central permutation) should hold as well. This latter postulate (already questionable between words [8]) is still more debatable with sentences. In the NLP community, analogies between sentences are usually induced from predefined relationships between sentences. A quadruplet of sentences π‘Ž, 𝑏, 𝑐, 𝑑 defines an analogy π‘Ž : 𝑏 :: 𝑐 : 𝑑 if the (implicit or explicit) relation that holds between the sentences of the first pair (π‘Ž, 𝑏), also holds for the second pair (𝑐, 𝑑). Let us consider the following example: John sneezed loudly (a). Mary was startled (b). Bob took an analgesic (c). His headache stopped (d). In that case, the implicit relation 𝑅 between sentences in a pair is a kind of causal relation. This example indicates that central permutation makes no sense here and raises the question of defining a weaker notion of analogy obeying another system of postulates. By which postulate to replace the central permutation? In this paper, we propose to introduce a postulate we call β€œinternal reversal” that expresses that if π‘Ž : 𝑏 :: 𝑐 : 𝑑 holds then 𝑏 : π‘Ž :: 𝑑 : 𝑐 holds as well, and we study its consequences. So our main goal is to: β€’ theoretically investigate the formal consequences of this new model, β€’ empirically validate the model by implementing various classifiers of sentence analogies. After presenting in Section 3 the standard formal definitions of analogies, including the β€œcentral permutation” postulate, and their immediate consequences, we focus on the replacement of the β€œcentral permutation” postulate by the internal reversal postulate. Having a better fit with what is accepted as sentence analogies in the NLP community, this postulate also impacts the machine learning perspective that we implement. For natural language sentences, β€œinternal reversal”, as a formal postulate, may have some limitations. For instance, if 𝑅 = π‘…βˆ’1 , where 𝑅 is the common relation that holds between two pairs of sentences (π‘Ž, 𝑏) and (𝑐, 𝑑) (e.g. , π‘Ž is a paraphrase of 𝑏), one would expect that internal reversal holds straightforwardly. In that case, a machine learning model trained to recognize π‘Ž : 𝑏 :: 𝑐 : 𝑑, should also recognize 𝑏 : π‘Ž :: 𝑑 : 𝑐. We investigate the conditions under which a machine learning model containing quadruplets of sentences (π‘Ž, 𝑏, 𝑐, 𝑑) representing positive and negatives instances of analogies, is capable of identifying analogies for which the operation of internal reversal has been performed. We have then devised several series of experiments using various underlying models and datasets. The paper is structured as follows. After reviewing the related work (Section 2), in Section 3 we recall the formal definitions of analogical proportions and investigate the new case of sentence analogies, suggesting β€œinternal reversal” postulate as a better fit and examining its consequences. In Section 4, we consider the consequences of the formal definition in a machine learning perspective, by suggesting a rigorous extension of an initial training set. Sections 5 and 16 Stergos Afantenos et al. IARML@IJCAI-ECAI’22 Workshop Proceedings 6 are dedicated to the description to the context, protocol and results of our experiments. This work is an extension of [9], replacing artificially created datasets with human annotated ones. 2. Related work Due to the advent of neural models and distributed representations of words, lexical analogies have been the focus of various works in computational linguistics. [10, 11, 12, 13, for example]. In terms of analogies on the sentential level few works exist. [14] investigate how existing embedding approaches can capture sentential analogies. They create two different kinds of datasets one consisting of replacing words with word analogies from the Google word analogy dataset [15] while the other is based on analogies between sentences that share common relations (entailment, negation, passivization, for example) or syntactic patterns (comparisons, opposites, plurals among others). The goal is to optimize arg π‘šπ‘Žπ‘₯π‘‘βˆˆπ‘‰ (𝑣⃗𝑑 , 𝑣⃗𝑏 βˆ’ π‘£βƒ—π‘Ž + 𝑣⃗𝑐 ) with the additional constraint that 𝑑 ∈ / {π‘Ž, 𝑏, 𝑐}. Using these datasets, analogies are evaluated using various embeddings, such as GloVe [5], word2vec [15], fastText [16, 17], etc. showing that capturing syntactic analogies based on lexical analogies from the Google word analogies dataset is more effective than recognising analogies based on more semantic information. [18] use a similar approach to identify the most plausible answer π‘Žπ‘– to a given question π‘ž from a pool 𝐴 of answers to a question by leveraging analogies between (π‘ž, π‘Žπ‘– ) and various pairs of what they call β€œprototypical” question/answer pairs, assuming that there is an analogy between (π‘ž, π‘Žπ‘– ) and the prototypical pair (π‘žπ‘ , π‘Žπ‘ ). The goal is to select the candidate answer π‘Ž*𝑖 ∈ 𝐴 such that: π‘Ž*𝑖 = arg π‘šπ‘–π‘›π‘– (||(π‘žπ‘ βˆ’ π‘Žπ‘ ) βˆ’ (π‘ž βˆ’ π‘Žπ‘– )||) . The authors limit the question/answer pairs to π‘€β„Žβˆ’ questions from WikiQA and TrecQA. They use a Siamese bi-GRUs as their architecture to represent the four sentences. In this manner, the authors learn embedding representations for the sentences which they compare against various baselines including random vectors, word2vec, InferSent and Sent2Vec obtaining better results with the WikiQA corpus. Most of the tested sentence embedding models succeed in recognizing syntactic analogies based on lexical ones but had a harder time capturing analogies between pairs of sentences based on semantics. Instead of training a model to select the best candidate amongst a given set of candidates ([18, 19] train an encoder-decoder model based on LSTMs to generate the 𝑑 given a pair (π‘Ž, 𝑏) and a candidate 𝑐. Authors obtain vector encodings of βƒ—π‘Ž, ⃗𝑏, ⃗𝑐 using an LSTM guided by two loss functions. The authors then experiment with concatenation, summation and arithmetic analogy on these vectors to obtain a new vector which is then used as input for the decoding mechanism, showing that arithmetic analogy outperforms the other methods. In this paper, the aim is to empirically validate the β€œinternal reversal” postulate (without focusing on accuracy). To our knowledge, such a study has not been conducted before. 3. Theoretical Foundations of Analogies We briefly recall the formal definition of analogy such as found in [20, 21, 22]. We focus on a widely accepted definition for sentence analogies and we investigate to what extent sentence 17 Stergos Afantenos et al. IARML@IJCAI-ECAI’22 Workshop Proceedings analogies obey the formal postulates and what has to be modified in the formal setting to fit with this particular definition. 3.1. Formal definitions Given a set of items 𝑋, a (proportional) analogy is a quaternary relation supposed to obey the 3 following postulates (e.g.,[21]): βˆ€π‘Ž, 𝑏, 𝑐, 𝑑 ∈ 𝑋 : 1. π‘Ž : 𝑏 :: π‘Ž : 𝑏 (reflexivity); 2. π‘Ž : 𝑏 :: 𝑐 : 𝑑 β†’ 𝑐 : 𝑑 :: π‘Ž : 𝑏 (symmetry); 3. π‘Ž : 𝑏 :: 𝑐 : 𝑑 β†’ π‘Ž : 𝑐 :: 𝑏 : 𝑑 (central permutation). These postulates have straightforward consequences like: β€’ π‘Ž : π‘Ž :: 𝑏 : 𝑏 (identity); β€’ π‘Ž : 𝑏 :: 𝑐 : 𝑑 β†’ 𝑏 : π‘Ž :: 𝑑 : 𝑐 (internal reversal); β€’ π‘Ž : 𝑏 :: 𝑐 : 𝑑 β†’ 𝑑 : 𝑏 :: 𝑐 : π‘Ž (extreme permutation); β€’ π‘Ž : 𝑏 :: 𝑐 : 𝑑 β†’ 𝑑 : 𝑐 :: 𝑏 : π‘Ž (complete reversal). Among the 24 permutations of π‘Ž, 𝑏, 𝑐, 𝑑, the previous postulates induce 3 distinct classes each containing 8 distinct proportions regarded as equivalent due to postulates: π‘Ž : 𝑏 :: 𝑐 : 𝑑 has in its class 𝑐 : 𝑑 :: π‘Ž : 𝑏, 𝑐 : π‘Ž :: 𝑑 : 𝑏, 𝑑 : 𝑏 :: 𝑐 : π‘Ž, 𝑑 : 𝑐 :: 𝑏 : π‘Ž, 𝑏 : π‘Ž :: 𝑑 : 𝑐, 𝑏 : 𝑑 :: π‘Ž : 𝑐, and π‘Ž : 𝑐 :: 𝑏 : 𝑑. But 𝑏 : π‘Ž :: 𝑐 : 𝑑 and π‘Ž : 𝑑 :: 𝑐 : 𝑏 do not belong to the class of π‘Ž : 𝑏 :: 𝑐 : 𝑑 and are elements of the two other classes. 3.2. Sentence analogies In the NLP community, the 4 items π‘Ž, 𝑏, 𝑐, 𝑑 are sentences in natural language, not necessarily the same. It is widely admitted that the sentences are in analogy (i.e., π‘Ž : 𝑏 :: 𝑐 : 𝑑) as soon as there is a relation 𝑅, the relation between sentences, such that 𝑅(π‘Ž, 𝑏) and 𝑅(𝑐, 𝑑). The example from the introduction is a perfect illustration of this definition where the relation 𝑅 is just causality: John sneezed loudly (a). Mary was startled (b). Bob took an analgesic (c). His headache stopped (d). But: Il fait beau aujourd’hui (a). Today we have nice weather (b). Il vaut mieux eviter la guerre (c). It is better to avoid war (d). is another example of analogies between sentences where the implicit relation 𝑅 is β€œπ‘ is the English translation of the French sentence π‘Žβ€. From a logical viewpoint, this can be expressed as: π‘Ž : 𝑏 :: 𝑐 : 𝑑 iff βˆƒπ‘… s.t. 𝑅(π‘Ž, 𝑏) ∧ 𝑅(𝑐, 𝑑) (1) where ∧ is just the formal notation for the π‘Žπ‘›π‘‘ connector. This definition can be considered as quite vague because, as advocated in [23, 24], there is always a way to find such a relation 18 Stergos Afantenos et al. IARML@IJCAI-ECAI’22 Workshop Proceedings 𝑅 between 2 sentences. A more effective option used in the NLP community is to consider that the underlying relation 𝑅 belongs to a finite set 𝑆 of relations. Such relations can be, for example, discourse relations (Elaboration, Continuation, Contrast, Concession, etc) or a Causality relation as is the case in the above example. Then, the formal definition has to be refined into: π‘Ž : 𝑏 :: 𝑐 : 𝑑 iff βˆƒπ‘… ∈ 𝑆 s.t. 𝑅(π‘Ž, 𝑏) ∧ 𝑅(𝑐, 𝑑) (2) where 𝑆 = {𝑅1 , . . . , 𝑅𝑛 } is just a finite non-empty set of relations belonging to a list of target relations. With this definition, we constraint the relation 𝑅 to belong to a predefined set. Obviously, in the case of French-English translation, the list 𝑆 is reduced to only one relation. It is quite clear that reflexivity and symmetry are still valid postulates for sentence analogies i.e., they are satisfied with both above definitions. Back to our initial example: John sneezed loudly (a). Mary was startled (b). Bob took an analgesic (c). His headache stopped (d). Definition 1 or 2 still applies to 𝑐 : 𝑑 :: π‘Ž : 𝑏: Bob took an analgesic (c). His headache stopped (d). John sneezed loudly (a). Mary was startled (b). which is then a valid analogy. Nevertheless, central permutation is not satisfied with the above definitions 1 or 2. 3.3. Internal reversal for sentence analogies Let us now focus on the β€œinternal reversal” postulate as a alternative to β€œcentral permutation”: π‘Ž : 𝑏 :: 𝑐 : 𝑑 β†’ 𝑏 : π‘Ž :: 𝑑 : 𝑐 (internal reversal) By definition, if 𝑅(π‘Ž, 𝑏) holds then π‘…βˆ’1 (𝑏, π‘Ž) holds. Definition 1 supports β€œinternal reversal”: for instance, if relation 𝑅(π‘Ž, 𝑏) is interpreted as β€œπ‘Ž is a cause of 𝑏", π‘…βˆ’1 (𝑏, π‘Ž) can be the passive form β€œπ‘ is a consequence of a”. But Definition 2 does not support β€œinternal reversal” except if, for each relation 𝑅 in the set 𝑆 of built-in relations, we also have its counterpart π‘…βˆ’1 . A simple way to ensure this property is to consider relations 𝑅 such that 𝑅 = π‘…βˆ’1 . For instance, 𝑅(π‘Ž, 𝑏) is defined as β€œπ‘Ž is a paraphrase of 𝑏”. In the general case, a proper definition of a sentence analogy supporting the 3 postulates (reflexivity, symmetry, internal reversal) would be: : 𝑏 :: 𝑐 : 𝑑 iff βˆƒπ‘… ∈ 𝑆 s.t. π‘Ž {οΈƒ (𝑅(π‘Ž, 𝑏) ∧ 𝑅(𝑐, 𝑑)) (3) ∨(π‘…βˆ’1 (π‘Ž, 𝑏) ∧ π‘…βˆ’1 (𝑐, 𝑑)) This leads to a formal definition of sentence analogies with: 1. π‘Ž : 𝑏 :: π‘Ž : 𝑏 (reflexivity); 2. π‘Ž : 𝑏 :: 𝑐 : 𝑑 β†’ 𝑐 : 𝑑 :: π‘Ž : 𝑏 (symmetry); 3. π‘Ž : 𝑏 :: 𝑐 : 𝑑 β†’ 𝑏 : π‘Ž :: 𝑑 : 𝑐 (internal reversal). As immediate consequences, we get that : β€’ there are only 4 equivalent forms (instead of 8 with the central permutation postulate) for an analogy: 19 Stergos Afantenos et al. IARML@IJCAI-ECAI’22 Workshop Proceedings β€’ π‘Ž : 𝑏 :: 𝑐 : 𝑑, 𝑐 : 𝑑 :: π‘Ž : 𝑏, 𝑑 : 𝑐 :: 𝑏 : π‘Ž, and 𝑏 : π‘Ž :: 𝑑 : 𝑐. β€’ π‘Ž : 𝑏 :: 𝑐 : 𝑑 β†’ 𝑑 : 𝑐 :: 𝑏 : π‘Ž (complete reversal). β€’ π‘Ž : π‘Ž :: π‘Ž : π‘Ž (full identity) is still satisfied. β€’ π‘Ž : π‘Ž :: 𝑏 : 𝑏 (identity) is no longer a consequence of the new postulates. 4. Implications for Machine Learning Let us assume that we have at our disposal a repository of pairs of sentences (π‘Ž, 𝑏) with their associated relation 𝑅. From this repository, we need a training set of examples for the classifier. Given the previous section, several steps can be implemented. 1) Building an initial training set of analogies π‘Ž : 𝑏 :: 𝑐 : 𝑑 can be done by joining 2 pairs (π‘Ž, 𝑏) and (𝑐, 𝑑) belonging to the same relation 𝑅. This constitutes a set of positive examples 𝒳 + such that for every quadruplet (π‘Ž, 𝑏, 𝑐, 𝑑) = x ∈ 𝒳 + the training instances are {x, 𝑦} with 𝑦 = 1. In terms of negative examples, joining 2 pairs (π‘Ž, 𝑏) and (𝑐, 𝑑) belonging to different relations leads to build a set of negative examples 𝒳 βˆ’ such that for every quadruplet (π‘Ž, 𝑏, 𝑐, 𝑑) = x ∈ 𝒳 βˆ’ the training instances are {x, 𝑦} with 𝑦 = 0. The training set 𝒳 = 𝒳 + βˆͺ 𝒳 βˆ’ is then a set of quadruplets of sentences π‘Ž, 𝑏, 𝑐, 𝑑 such that: β€’ if the implicit/explicit relation 𝑅 between the pair (π‘Ž, 𝑏) also holds for the pair (𝑐, 𝑑), then (π‘Ž, 𝑏, 𝑐, 𝑑) ∈ 𝒳 + β€’ if the implicit/explicit relation 𝑅 between the pair (π‘Ž, 𝑏) does not hold for the pair (𝑐, 𝑑), then (π‘Ž, 𝑏, 𝑐, 𝑑) ∈ 𝒳 βˆ’ Applying symmetry postulate allows to double the size of 𝒳 + , just by adding (𝑐, 𝑑, π‘Ž, 𝑏) ∈ 𝒳 + as soon as (π‘Ž, 𝑏, 𝑐, 𝑑) ∈ 𝒳 + . We then improve the theoretical unbalance between 𝒳 + and 𝒳 βˆ’ . 2) The same method applies with internal reversal postulate, by adding (𝑏, π‘Ž, 𝑑, 𝑐) ∈ 𝒳 + as soon as (π‘Ž, 𝑏, 𝑐, 𝑑) ∈ 𝒳 + . This again doubles the size of 𝒳 + . At this stage, we have multiplied by 4 the initial size of our positive training set 𝒳 + by introducing common sense analogies deducible from the initial ones, but not necessarily related to the initial list of relations 𝑆. Can we do more? The Identity Relation For completeness sake, one could argue that it is still possible to extend the set of positive examples since it seems acceptable to consider π‘Ž : π‘Ž :: 𝑏 : 𝑏 as a valid sentence analogy even though identity 𝐼𝑑 relation likely does not belong to 𝑆. But identity relation 𝐼𝑑 holds between the pairs (π‘Ž, π‘Ž) and (𝑏, 𝑏). Although recognition of analogies based on the identity relation might seem trivial from an NLP perspective it could still be a useful task in case that we want to evaluate the quality of our classifier. In other words, if a potential classifier is not able to identify analogies based on the identity relation, one should probably reconsider the underlying approach. The Inverse Relation A scenario that appears quite often in Natural Language Processing, although far from being a generalized phenomenon, is that a relation 𝑅 between sentences (or larger proportions of text for that matter) is its own inverse π‘…βˆ’1 . 20 Stergos Afantenos et al. IARML@IJCAI-ECAI’22 Workshop Proceedings Instances of such a relation can be, for example, that of the paraphrase. If π‘Ž is a paraphrase of 𝑏, obviously 𝑏 is a paraphrase of π‘Ž. The same hold for the operation of translation. If sentence π‘Ž is a translation of 𝑏 then again 𝑏 is a translation of π‘Ž. Following our initial definition of analogy, we will have to accept: π‘Ž : 𝑏 :: 𝑏 : π‘Ž when R is its own inverse Before moving to the details of the empirical validation, we describe the datasets we use in the following section. 5. Experiments As explained earlier in this paper, our main goal is the empirical validation of internal reversal for sentential analogies, using various corpora. To investigate this postulate we devise the following sets of experiments. 5.1. Experimental settings Base setting Given a training set (π’³π‘‘π‘Ÿπ‘Žπ‘–π‘› , π’΄π‘‘π‘Ÿπ‘Žπ‘–π‘› ) = ({x𝑖 }𝑛𝑖=1 , {𝑦𝑖 }𝑛𝑖=1 }) and a test set (𝒳𝑑𝑒𝑠𝑑 , 𝒴𝑑𝑒𝑠𝑑 ) = ({x𝑖 }π‘š 𝑖=1 , {𝑦𝑖 }𝑖=1 ) with π‘š typically being a tenth of 𝑛 and x𝑗 π‘š representing a quadruplet of sentences π‘Ž : 𝑏 :: 𝑐 : 𝑑3 and 𝑦𝑖 ∈ {0, 1} we learn a model ℋ𝑏 capable of identifying analogies with a certain accuracy. Crucially, |{π‘¦π‘˜ : π‘¦π‘˜ = 1}|= |{π‘¦π‘˜ : π‘¦π‘˜ = 0}| both for training and testing sets. Due to the huge size of instances at our disposal, there is no need at this stage to implement any further data augmentation process, as explained in the previous Section. In other words, we have an equal number of positive and negative instances in training and testing sets, for a total of 4M instances. Internal reversal on the test set (Experimental setting 1) In this series of experiments, we used the same training set (π’³π‘‘π‘Ÿπ‘Žπ‘–π‘› , π’΄π‘‘π‘Ÿπ‘Žπ‘–π‘› ) = ({x𝑖 }𝑛𝑖=1 , {𝑦𝑖 }𝑛𝑖=1 }) as the base setting ℋ𝑏 . To construct the test set, we perform internal reversal on all the instances of the train set that we have used in base setting. Our goal is to see whether we get similar results on analogies for the internal reversal. Test set from train distribution with internal reversal (Experimental setting 2) For this series of experiments we use the same training set (π’³π‘‘π‘Ÿπ‘Žπ‘–π‘› , π’΄π‘‘π‘Ÿπ‘Žπ‘–π‘› ) = ({x𝑖 }𝑛𝑖=1 , {𝑦𝑖 }𝑛𝑖=1 }) for the base setting ℋ𝑏 . The test set though is constructed in the following way: for every positive instance (xπ‘Ž:𝑏::𝑐:𝑑 , 1) in (π’³π‘‘π‘Ÿπ‘Žπ‘–π‘› , π’΄π‘‘π‘Ÿπ‘Žπ‘–π‘› ) we add the internal reversal pair (x𝑏:π‘Ž::𝑑:𝑐 , 1) to the new testing set (𝒳𝑑𝑒𝑠𝑑 , 𝒴𝑑𝑒𝑠𝑑 ) whose size thus is 𝑛/2. In contrast to experimental setting 1 where the underlying sentences between train and test distributions are different, in this series of experiments we want to see how well a trained model can detect analogies after performing internal reversal on the same set of pairs of sentences. 3 Henceforth, we will denote a representation for a quadruplet of sentences π‘Ž : 𝑏 :: 𝑐 : 𝑑 by the vector xπ‘Ž:𝑏::𝑐:𝑑 . 21 Stergos Afantenos et al. IARML@IJCAI-ECAI’22 Workshop Proceedings Augmenting training and test sets (Experimental setting 3) In the series of experiments we learn a model β„‹π‘Ž using π‘Ž π‘Ž 𝑛+𝑛/2 𝑛+𝑛/2 (π’³π‘‘π‘Ÿπ‘Žπ‘–π‘› , π’΄π‘‘π‘Ÿπ‘Žπ‘–π‘› ) = ({x𝑖 }𝑖=1 , {𝑦𝑖 }𝑖=1 }) and a test set π‘Ž π‘Ž π‘š+π‘š/2 π‘š+π‘š/2 (𝒳𝑑𝑒𝑠𝑑 , 𝒴𝑑𝑒𝑠𝑑 ) = ({x𝑖 }𝑖=1 , {𝑦𝑖 }𝑖=1 ) where both train and test sets have been augmented using the following rule: for each instance (xπ‘Ž:𝑏::𝑐:𝑑 , 1) in train or test set we add the following instance (x𝑏:π‘Ž::𝑑:𝑐 , 1). In other words, we double only the positive instances by adding the internal reversal of a quadruplet as a positive instance. Augmenting test set (Experimental setting 4) In this series of experiments the train set and thus the model learnt is the same as the base setting. In other words, we have a training set (π’³π‘‘π‘Ÿπ‘Žπ‘–π‘› , π’΄π‘‘π‘Ÿπ‘Žπ‘–π‘› ) = ({x𝑖 }𝑛𝑖=1 , {𝑦𝑖 }𝑛𝑖=1 }) from which we learn a model ℋ𝑏 . For testing though we π‘Žπ‘‘ π‘Žπ‘‘ have a new test set (𝒳𝑑𝑒𝑠𝑑 , 𝒴𝑑𝑒𝑠𝑑 ) = ({x𝑖 }π‘š 𝑖=1 , {𝑦𝑖 }𝑖=1 ) which results from (𝒳𝑑𝑒𝑠𝑑 , 𝒴𝑑𝑒𝑠𝑑 ) of the π‘š base setting by keeping only positive instances. This subset is then augmented with instances (x𝑏:π‘Ž::𝑑:𝑐 , 1) for every instance (xπ‘Ž:𝑏::𝑐:𝑑 , 1) we have in that subset, resulting thus in π‘š total positive instances. 5.2. Datasets To perform our experiments, we used three corpora: Penn Discourse TreeBank (PDTB), Stanford Natural Language Inference Corpus (SNLI) and the paraphrase dataset MPRC. PDTB dataset The first dataset that we use is PDTB version 2.1[25]. (36,000 pairs of sentences annotated with discourse relations). Relations can be explicitly expressed via a discourse marker, or implicitly expressed in which case no such discourse marker exists and the annotators provide one that more closely describes the implicit discourse relation. Relations are organized in a taxonomy of depth 3. Level 1 (L1) (top level) has four types of relations (Temporal, Contingency, Expansion and Comparison), level 2 (L2) has 16 relation types and level 3 (L3) has 23 relation types. For this series of experiments, we used the L1 relation. SNLI dataset SNLI is a corpus of pairs of sentences from [26]. SNLI was created and annotated manually. It contains 570K human-written sentence pairs considered as a sufficient number of pairs for machine learning. The sentence pairs are annotated with entailment, contradiction and semantic independence. More precisely, a pair of sentences π‘Ž and 𝑏 can be annotated either with Entailment, Contradiction or Neutral relation. Construction of the corpus was done using Mechanical Turk who was presented with a premise in the form of a sentence and was asked to provide three hypotheses, in a sentential form, for each of the aforementioned labels. 10% of the corpus was validated by trusted Mechanical Turks. Overall a Fleiss πœ… of 0.70 was achieved. For our experiments we considered the Neutral relation as symmetric. 22 Stergos Afantenos et al. IARML@IJCAI-ECAI’22 Workshop Proceedings MRPC dataset The third corpus is Microsoft Research Paraphrase Corpus (MRPC [27]). It contains about 5800 pairs of sentences which can either be a paraphrase of each other or not. Each pair of sentences was annotated by two annotators. In case of disagreements, a third annotator resolved the conflict. After this, about two-thirds of the pairs were annotated as paraphrases and one third as not. 5.3. Embedding techniques There are well-known word embeddings such as word2vec [15], Glove [5], BERT [28], fastText [17], etc. It is standard to start from a word embedding to build a sentence embedding. Sentence embedding techniques represent entire sentences and their semantic information as vectors. In this paper, we focus on 2 techniques relying on initial word embedding. - The simplest method is to average the word embeddings of all words in a sentence. Although this method ignores both the order of the words and the structure of the sentence, it performs well in many tasks. So the final vector has the dimension of the initial word embedding. - The other approach, suggested in [7], makes use of the Discrete Cosine Transform (DCT) as a simple and efficient way to model both word order and structure in sentences while maintaining practical efficiency. Using the inverse transformation, the original word sequence can be reconstructed. A parameter 𝑙 is a small constant that needs to be set. One can choose how many features are being embedded per sentence by adjusting the value of 𝑙, but undeniably it increases the final size of the sentence vector by a factor 𝑙. If the initial embedding of words is of dimension 𝑛, the final sentence dimension will be = 𝑛 * 𝑙 (see [7] for complete description). In our experiments, we use the average method to embed sentences as it is at least as effective as DCT [9]. 5.4. Models Random Forest (RF) We have tested our hypothesis on a classical method successfully used for word analogy classification [29]: Random Forests (RF). The parameters for RF are 100 trees, no maximum depth, and a minimum split of 2. We also use LSTMs, but any other model (SVM, etc.), could have been used. Bi-LSTM architecture Given a quadruplet of sentences π‘Ž : 𝑏 :: 𝑐 : 𝑑 which can be an analogy or not, we represent each sentence by its input tokens π‘Ž = {𝑀1π‘Ž , . . . , π‘€π‘˜π‘Ž }, 𝑏 = {𝑀1𝑏 , . . . , π‘€π‘˜π‘ }, 𝑐 = {𝑀1𝑐 , . . . , π‘€π‘˜π‘ } and 𝑑 = {𝑀1𝑑 , . . . , π‘€π‘˜π‘‘ }. Although sentences can have dif- ferent lengths we have empirically fixed π‘˜ = 35; if a sentence has less than 35 word tokens we use padding. Each word token 𝑀𝑖𝑠 (with 𝑠 ∈ {π‘Ž, 𝑏, 𝑐, 𝑑} and 𝑖 ∈ [1 . . . π‘˜]) is represented by a Glove vector of 300 dimensions. In this series of experiments, LSTM did not use averaging or DCT since the recurrent nature of LSTMs themselves accounts for the structure of a sentence. Our architecture is composed by four Bi-LSTMs whose output is passed over to a feed-forward network. More precisely, for each sentence we recursively calculate β„Žπ‘‘ = π‘œπ‘‘ βŠ— π‘‘π‘Žπ‘›β„Ž(𝐢𝑑 ) with βŠ— representing the Hadamard operation and π‘œπ‘‘ = 𝜎(Wπ‘œ Β· [hπ‘‘βˆ’1 , x𝑑 ] + bπ‘œ ) 23 Stergos Afantenos et al. IARML@IJCAI-ECAI’22 Workshop Proceedings where 𝐢𝑑 = 𝑓𝑑 βŠ— πΆπ‘‘βˆ’1 + 𝑖𝑑 βŠ— C̃𝑑 and 𝑖𝑑 = 𝜎(W𝑖 Β· [hπ‘‘βˆ’1 , x𝑑 ] + b𝑖 ) C̃𝑑 = π‘‘π‘Žπ‘›β„Ž(W𝐢 Β· [hπ‘‘βˆ’1 , x𝑑 ] + b𝐢 ) 𝑓𝑑 = 𝜎(W𝑓 Β· [hπ‘‘βˆ’1 , x𝑑 ] + b𝑓 ) In the above, x𝑑 represents the vector for token 𝑀𝑑 in a given sentence. These representations are obtained on both directions. Thus for each sentence the following representations are obtained: βˆ’ β†’ ← βˆ’ βˆ’ β†’ ← βˆ’ βˆ’ β†’ ← βˆ’ βˆ’ β†’ ← βˆ’ π‘Ž = {π‘€π‘–π‘Ž } = β„Žπ‘Žπ‘‘ ++ β„Žπ‘Žπ‘‘ ; 𝑏 = {𝑀𝑖𝑏 } = β„Žπ‘π‘‘ + + β„Žπ‘π‘‘ ; 𝑐 = {𝑀𝑖𝑐 } = β„Žπ‘π‘‘ + + β„Žπ‘π‘‘ ; 𝑑 = {𝑀𝑖𝑑 } = β„Žπ‘‘π‘‘ + + β„Žπ‘‘π‘‘ with + + representing the concatenation operation. The above representations are given as input to a single layer feed forward network: h𝑓 = 𝑓 (W𝑇 h𝐿𝑆𝑇 𝑀 + b) with βˆ’ β†’ ← βˆ’ βˆ’ β†’ ← βˆ’ βˆ’ β†’ ← βˆ’ βˆ’ β†’ ← βˆ’ h𝐿𝑆𝑇 𝑀 = β„Žπ‘Žπ‘‘ + + β„Žπ‘Žπ‘‘ + + β„Žπ‘π‘‘ + + β„Žπ‘π‘‘ + + β„Žπ‘π‘‘ + + β„Žπ‘π‘‘ + + β„Žπ‘‘π‘‘ + + β„Žπ‘‘π‘‘ using Rectified Linear Unit (ReLU) as activation function. Finally, the prediction is performed using a sigmoΓ―d function: 1 𝑦ˆ = 𝜎(W𝑇 h𝐿𝑆𝑇 𝑀 + b) = 1 + π‘’βˆ’W𝑇 h𝐿𝑆𝑇 𝑀 +b The architecture is guided by a standard binary cross entropy loss function. 6. Results and discussion Results of our experiments for LSTMs and RFs are shown in Tables 1 and 2 respectively. In all cases, we randomly generated quadruplets (π‘Ž, 𝑏, 𝑐, 𝑑) which we annotated as analogies (class 1) if pairs (π‘Ž, 𝑏) and (𝑐, 𝑑) shared the same relation, or with class 0 if they did not. For PDTB and SNLI we randomly generated 2 million instances for training; testing and development corpus contained 200.000 instances each. For the paraphrase corpus, we generated 4 million instances for training and testing and development corpus contained 200.000 instances each. Each dataset contains an equal number of positive and negative instances. As we can see, base settings for all datasets perform quite moderately, which is to be expected since our aim was not to create a general model for sentential analogies, which would require much more data and powerful models with billions of parameters. Instead, our goal was to examine under which conditions internal reversal holds. As we can see, in experimental setting 1, for which the test set is the same as the train but with internal reversal, results on PDTB and SNLI, which contain relations that are not symmetric, are worse than the base setting. This is not the case though for the paraphrases corpus for which results are better to the base setting. 24 Stergos Afantenos et al. IARML@IJCAI-ECAI’22 Workshop Proceedings In the second set of experiments, we decided to focus solely on the positive instances and examine if learning analogy π‘Ž : 𝑏 :: 𝑐 : 𝑑 also implicitly learnt internal reversal, that is 𝑏 : π‘Ž :: 𝑑 : 𝑐. We used the same base setting that we have learnt, but for testing, we created a new dataset. Starting from an empty set, we took every positive instance of the training set and performed an internal reversal; we then add it to the new test dataset. The resulting dataset has no common instances with the training dataset, but every instance of it is an internal reversal of the positive instances of the training set. As we can see there is almost no difference in scores for PDTB and SNLI, but the results for the paraphrases corpus (93.412% 𝐹1 for LSTMs and 87.544% for RFs) clearly show that when a relation is symmetrical the model almost makes no difference between an analogy and its internal reversal. It is interesting to observe that the trend for LSTM is similar to RF. However, the results from LSTM appear to be more stable. In the third series of experiments, we augmented both the training and testing datasets with the internal reversal. All three datasets showed a significant increaseβ€”of almost 20 percentile points for some casesβ€”for the detection of analogies. In the fourth and final set of experiments, we used the same base setting that we had used initially. The test set was constructed based on the same test set as the base setting but we removed all negative instances and focused solely on the positive ones augmented with the internal reversal. Again here we can see a significant increase in the results for the detection of analogies, further showing that the model learns internal reversal as well. On Table 1 , Experiment setting 2 for MRPC has the highest F1: this corpus has more symmetrical relationships when compared to PDTB and SNLI. We observed with Table 2 the trend already observed with LSTM: Experiment setting 2 has the highest F1. Precision Recall F1 Accuracy Precision Recall F1 Accuracy PDTB PDTB class 1 54.274 47.476 50.648 class 1 54.604 33.778 41.737 base setting 53.739 base setting 53.314 class 0 53.322 60.001 56.465 class 0 52.744 72.468 61.053 class 1 48.91 39.76 43.863 class 1 51.254 31.096 38.708 Exp. Set. 1 49.114 Exp. Set. 1 50.826 class 0 49.254 58.468 53.467 class 0 50.639 70.504 58.943 Exp. Set. 2 class 1 100.0 39.76 56.898 39.76 Exp. Set. 2 class 1 100.00 31.096 47.440 31.096 class 1 70.16 79.346 74.471 class 1 66.263 99.953 79.694 Exp. Set. 3 63.733 Exp. Set. 3 66.267 class 0 44.038 32.507 37.404 class 0 69.847 0.213 0.424 Exp. Set. 4 class 1 100.0 46.585 63.56 46.585 Exp. Set. 4 class 1 100.00 32.117 48.619 32.117 SNLI SNLI class 1 67.862 67.811 67.837 class 1 50.725 47.006 48.794 base setting 67.859 base setting 50.729 class 0 67.856 67.907 67.882 class 0 50.732 54.443 52.522 class 1 50.111 49.57 49.839 class 1 50.302 46.189 48.158 Exp. Set. 1 50.11 Exp. Set. 1 50.285 class 0 50.11 50.651 50.379 class 0 50.270 54.379 52.244 Exp. Set 2 class 1 100.0 49.57 66.283 49.57 Exp. Set 2 class 1 100.00 46.189 63.191 46.189 class 1 84.489 83.982 84.235 class 1 70.368 86.903 77.766 Exp. Set 3 79.047 Exp. Set 3 66.898 class 0 68.365 69.185 68.772 class 0 50.797 26.979 35.241 Exp. Set. 4 class 1 100.0 59.086 74.282 59.086 Exp. Set. 4 class 1 100.00 46.313 63.307 46.313 MRPC MRPC class 1 53.45 61.487 57.188 class 1 54.327 69.353 60.927 base setting 53.969 base setting 54.739 class 0 54.671 46.45 50.227 class 0 55.502 39.599 46.221 class 1 80.454 87.638 83.892 class 1 58.916 77.847 67.071 Exp. Set. 1 83.173 Exp. Set. 1 61.374 class 0 86.426 78.708 82.387 class 0 66.313 44.547 53.293 Exp. Set. 2 class 1 100.0 87.638 93.412 87.638 Exp. Set. 2 class 1 100.00 77.847 87.544 77.847 class 1 69.033 72.395 70.674 class 1 67.523 99.649 80.499 Exp. Set. 3 59.946 Exp. Set. 3 67.437 class 0 38.832 35.05 36.844 class 0 48.952 0.698 1.377 Exp. Set. 4 class 1 100.0 62.752 77.114 62.75 Exp. Set. 4 class 1 100.0 69.139 81.754 69.139 Table 1 Table 2 Results for LSTM Results for Random Forest 25 Stergos Afantenos et al. IARML@IJCAI-ECAI’22 Workshop Proceedings 7. Conclusion and future work In this paper, we have suggested a new formal model dedicated to sentence analogies, replacing the standard model for word analogies. A weaker β€œinternal reversal” postulate takes the place of the well-known β€œcentral permutation” postulate. From a purely formal viewpoint, we have investigated the consequences of this new model and to what extent it fits with sentence analogies. To validate this approach in practice, we have implemented sentence analogies classifiers, using well-known machine learning algorithms. We have also designed two machine learning protocols involving different ways to build a training set, all derived from the formal expected properties. Our results show that an β€œinternal reversal” sentence analogy is recognized by our algorithms as a valid analogy as soon as the underlying relation between sentences is symmetric (e.g. β€œto be a paraphrase of”). When this relation is not symmetric (e.g., β€œto be a consequence of”), β€œinternal reversal” sentence analogies are not always recognized. Maybe, in the general case, learning 𝑅 is not the same as learning π‘…βˆ’1 . Alternatively finding a more accurate postulate might be a valid track of research for the future. Analogy postulates could also be used for further constraining the classifier. Acknowledgments The authors would like to express their gratitude to the anonymous reviewers for their valuable comments. They would also like to thank the organizers of this workshop. References [1] D. R. Hofstadter, Analogy as the Core of Cognition, MIT Press, 2001, pp. 499–538. [2] C. Allen, T. Hospedales, Analogies explained: Towards understanding word embeddings, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, PMLR, 2019, pp. 223–231. URL: http://proceedings.mlr.press/v97/allen19a.html. [3] F. Chollet, On the measure of intelligence, 2019. arXiv:1911.01547. [4] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: C. J. C. B. et al. (Ed.), Advances in Neural Information Processing Systems 26, Curran Associates Inc., 2013, pp. 3111–3119. [5] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in: EMNLP, 2014, pp. 1532–1543. [6] D. E. Rumelhart, A. A. Abrahamson, A model for analogical reasoning, Cognitive Psychol. 5 (1973) 1–28. [7] N. Almarwani, H. Aldarmaki, M. Diab, Efficient sentence embedding using discrete cosine transform, in: EMNLP, 2019, pp. 3663–3669. [8] S. Lim, H. Prade, G. Richard, Classifying and completing word analogies by machine learning, Int. J. Approx. Reason. 132 (2021) 1–25. [9] S. Afantenos, T. Kunza, S. Lim, H. Prade, G. Richard, Analogies between sen- 26 Stergos Afantenos et al. IARML@IJCAI-ECAI’22 Workshop Proceedings tences:theoretical aspects - preliminary experiments, in: Proc.16th Europ. Conf. Symb. & Quantit. Appr. to Reas. with Uncert. (ECSQARU), 2021. [10] Z. Bouraoui, S. Jameel, S. Schockaert, Relation induction in word embeddings revisited, in: COLING, 1627-1637, Assoc. Computat. Ling., 2018. [11] A. Drozd, A. Gladkova, S. Matsuoka, Word embeddings, analogies, and machine learning: Beyond king - man + woman = queen, in: COLING, 2016, pp. 3519–3530. [12] P. D. Turney, A uniform approach to analogies, synonyms, antonyms, and associations, in: COLING, 2008, pp. 905–912. [13] P. D. Turney, Distributional semantics beyond words: Supervised learning of analogy and paraphrase, TACL 1 (2013) 353–366. [14] X. Zhu, G. de Melo, Sentence analogies: Linguistic regularities in sentence embeddings, in: COLING, 2020. [15] T. Mikolov, K. Chen, G. S. Corrado, J. Dean, Efficient estimation of word representations in vector space, CoRR abs/1301.3781 (2013). [16] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, in: Transactions of the Association for Computational Linguistics, 2017, p. 135–146. [17] T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin, Advances in pre-training distributed word representations, in: Proc. of LREC, 2018. [18] A. Diallo, M. Zopf, J. FΓΌrnkranz, Learning analogy-preserving sentence embeddings for answer selection, in: Proc. 23rd Conf. Computational Natural Language Learning, 910 - 919, Assoc. Computat. Ling., 2019. [19] L. Wang, Y. Lepage, Vector-to-sequence models for sentence analogies, in: 2020 Interna- tional Conference on Advanced Computer Science and Information Systems (ICACSIS), 2020, pp. 441–446. doi:10.1109/ICACSIS51025.2020.9263191. [20] Y. Lepage, De l’analogie rendant compte de la commutation en linguistique, Habilit. Γ  Diriger des Recher., Univ. J. Fourier, Grenoble (2003). URL: https://tel.archives-ouvertes.fr/ tel-00004372/en. [21] Y. Lepage, Analogy and formal languages, Electr. Notes Theor. Comput. Sci. 53 (2001). [22] H. Prade, G. Richard, From analogical proportion to logical proportions, Logica Univers. 7 (2013) 441–505. [23] M. Hesse, On defining analogy, Proceedings of the Aristotelian Society 60 (1959) 79–100. [24] M. Hesse, Analogy and confirmation theory, Philosophy of Science xxxi (1964) 319–327. [25] R. Prasad, N. Dinesh, A. Lee, E. Miltsakaki, L. Robaldo, A. Joshi, B. Webber, The Penn Discourse TreeBank 2.0., in: LREC 08, 2008. URL: http://www.lrec-conf.org/proceedings/ lrec2008/pdf/754_paper.pdf. [26] S. R. Bowman, G. Angeli, C. Potts, C. D. Manning, A large annotated corpus for learning natural language inference, in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, 2015. [27] W. B. Dolan, C. Brockett, Automatically constructing a corpus of sentential paraphrases, in: Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005. URL: https://www.aclweb.org/anthology/I05-5002. [28] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional 27 Stergos Afantenos et al. IARML@IJCAI-ECAI’22 Workshop Proceedings transformers for language understanding, CoRR abs/1810.04805 (2018). [29] S. Lim, H. Prade, G. Richard, Solving word analogies: A machine learning perspective, in: Proc.15th Europ. Conf. Symb. & Quantit. Appr. to Reas. with Uncert. (ECSQARU), LNCS 11726, 238-250, Springer, 2019. 28