=Paper=
{{Paper
|id=Vol-3212/paper9
|storemode=property
|title=Neural Analogical Reasoning
|pdfUrl=https://ceur-ws.org/Vol-3212/paper9.pdf
|volume=Vol-3212
|authors=Atharv Sonwane,Abhinav Lalwani,Sweta Mahajan,Gautam Shroff,Lovekesh Vig
|dblpUrl=https://dblp.org/rec/conf/nesy/SonwaneLMSV22
}}
==Neural Analogical Reasoning==
<pdf width="1500px">https://ceur-ws.org/Vol-3212/paper9.pdf</pdf>
<pre>
Neural Analogical Reasoning
Atharv Sonwane1,∗,† , Abhinav Lalwani1,∗ , Sweta Mahajan2 , Gautam Shroff3 and
Lovekesh Vig3
1
  APPCAIR, BITS Pilani, K K Birla Goa Campus
2
  Indian Institute of Technology, Delhi
3
  TCS Research, New Delhi


                                         Abstract
                                         Symbolic systems excel at reusing and composing modular functional units when solving problems
                                         such as simple analogical reasoning. However, they are less amenable to processing real-world data (e.g.
                                         images), and rely on additional (often hard-coded) mechanisms to convert such high-dimensional data
                                         to symbolic descriptions. In this work, we describe a modular approach ‘Neural Analogical Reasoning’
                                         wherein elementary neural transformations operate and compose on distributed representations of
                                         high-dimensional inputs. We apply this approach on a class of visual analogical reasoning problems
                                         that involve discovering the sequence of transformations by which pairs of input-output images are
                                         related, so as to analogously transform future inputs. This can be viewed as a program synthesis task
                                         and solved via symbolic search if represented in symbolic form. Instead, we search for a sequence of
                                         elementary neural network transformations that manipulate distributed representations of the inputs. We
                                         present two variations of learning useful representations for this task and compare both with end-to-end
                                         meta-learning based approaches to demonstrate the importance of performing an explicit search.

                                         Keywords
                                         neural reasoning, visual analogy making


1. Introduction
Consider the class of visual reasoning problems as demonstrated in Figure 1 in which each
task involves a functional analogy (i.e., 𝑥 ∶ 𝑓 (𝑥) ∶∶ 𝑦 ∶ 𝑓 (𝑦)) [1]. Analogical reasoning in this
context can be thought of as the reuse of some transformation 𝑓 and analogy making as the
discovery of these transformations. By modelling these transformations as compositions of
elementary functional units, we can cast analogy making into a program synthesis problem.
Given a set of input-output pairs 𝑆 and a set of functional units 𝑇, we must discover a composition
𝑓 = 𝑇1 .𝑇2 . … .𝑇𝑛 where 𝑇𝑖 ∈ 𝑇 such that 𝑓 (𝑥𝑖 ) = 𝑥𝑜 for all (𝑥𝑖 , 𝑥𝑜 ) ∈ 𝑆. This program 𝑓 can then
applied to a query image to generate an analogous output. The key questions to be answered in
such a frameworks are how these elementary functional units 𝑇𝑖 are to be represented and how
they can be ensured to be composable with each other and usable across a wide enough range
of inputs.


NeSy 2022, 16th International Workshop on Neural-Symbolic Learning and Reasoning, Cumberland Lodge, Windsor, UK
∗
    These authors contributed equally.
†
     Corresponding author.
Envelope-Open f20181021@goa.bits-pilani.ac.in (A. Sonwane)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
   Classical reasoning systems can solve such problems
                                                                    Examples    Query
via discrete search over compositions of elementary
primitives [2, 3]. There have also been advances in            (     ,     )      𝜋         𝜋𝜋
                                                                                                𝜋
However, they require both the inputs and the elemen-
                                                               (     ,     )
                                                                                           Solution
tary primitives to be represented with predefined sym-                          Task

bols. For the problem in Figure 1, the images would
need to be explicitly annotated with discrete symbolic
                                                             Figure 1: Given example input-output
features such as the type of shape and its location in the
                                                             image pairs, generate the analogous
image. Such use of predefined symbols allows classi-
                                                             output for the given query image.
cal reasoning systems to leverage compositionality and
reuse computational units to perform robust reasoning.
However, it restricts their applicability in more complex domains where feature extraction may
be non-trivial while also making it difficult to discover new knowledge from data.
   Deep Neural Networks (DNNs), on the other hand, have shown an impressive ability to
learn compressed distributed representations of high-dimensional data in complex domains
such as images [4], text [5] and graphs [6]. Despite these successes, it remains unclear how
to reason over such learned representations to solve analogy problems. In this paper, we take
the approach of combining a discrete search procedure with neural primitives that manipulate
distributed representations in a latent space. This allows us to leverage both the flexibility of
DNNs and the ability of discrete search to construct arbitrarily complex compositions.
   Our proposed system consists of (1) a distributed representation space 𝑍 into and from
which images can be converted with an encoder 𝐸 ∶ 𝐼 → 𝑍 and decoder 𝐷 ∶ 𝑍 → 𝐼 and
(2) neural primitives 𝑇𝑖 ∶ 𝑍 → 𝑍 that approximate the action of symbolic primitives within
this representation space and that are composable with each other. Given a task as described
in Figure 1, our system searches over sequences of such neural primitives to find a program
that transforms the given input to the given output. By using a search procedure in place
of a completely end-to-end learned system, we retain the flexibility of having independent
functional units that can be reused, debugged and composed arbitrarily to build transformations
in an interpret able manner. And using learned primitives means tedious hand-crafting of new
symbols and procedures can be replaced with providing relevant examples of the transformation
to be learned. Thus, our system aims to retain the power of compositional reasoning that
symbolic primitives provide while being able to generalize beyond the practical capacity of a
purely symbolic approach to richer inputs.
   In such a framework, the ability of primitives to compose with each other while generalising
to similar but unseen domains is essential for successful analogical reasoning. We obtain
almost arbitrarily composable primitives by using neural networks that manipulate distributed
representations within the same latent space (𝑇𝑖 ∶ 𝑍 → 𝑍). To learn these neural primitives
as well as the latent space they operate on, we take inspiration from the Neural Algorithmic
Reasoning framework [7], which describes how to neuralize an erstwhile symbolic algorithm
(in our case, transformation) so that it can deal with rich inputs directly without requiring them
to be explicitly transformed to a symbolic form. We also describe and evaluate two variations
of learning distributed representations of the input - a guided approach in which representation
space is derived from symbolic descriptions of the inputs and an unguided approach) in which
the representation is learned directly from the high-dimensional inputs (in this case images).
   Contributions Through this paper we (1) introduce a challenging few-shot visual analogical
reasoning task; (2) propose a neural reasoning framework that makes use of learned primitives
along with an explicit search procedure to solve this task; (3) evaluate two different methods
for representation learning within this framework; (4) demonstrate that elementary primitives
learned through this approach are able to both compose and generalise - satisfying essential
requirements of analogical reasoning; (5) show the importance of a search procedure when
solving the task by comparing with neural networks based end-to-end meta-learning approaches.


2. Related Works
Analogical reasoning has been considered a fundamental trait of human intelligence for the
last half-century of AI research [8]. More recently, there has been a focus on characterising
similar types of abstract reasoning as the fundamental building blocks of intelligence [9, 10, 11].
We refer to [1] for a review of various broad approaches to analogical making in AI systems,
with our particular method building functional analogies. However, the ability to represent
concepts in a modular way and to transfer them across domains still eludes many modern
machine learning approaches [12]. Recent attempts at utilising neural networks for reasoning
tasks focus on creating modular architectures of neural networks that can be learned end to end
[13, 14, 10]. However, the reliance on fixed architectures limits the ability to learn and reuse
functional units. Further, these approaches limit inference to applying the trained network once
or at most a few times, rather than searching for the solution as is done in symbolic program
synthesis [2, 3]. Our system is able to reuse neural transformation units in an unconstrained
manner by exploiting the the ability of search to create arbitrarily complex compositions. This
idea of inferring programs made up of neural modules has also been used for visual question
and answering [15]. However this utilises a learned sequence model to generate programs and
solves a classification task as opposed to a generative one.
   The Neural Algorithmic Reasoning [7] framework takes a different approach to reasoning
with neural networks. Instead of learning end-to-end on a particular task, it proposes to learn a
symbolic algorithm as a neural network together with a high dimensional representation space
on which it can act. Such a neural algorithm can then be utilised in various rich domains by
learning an encoder and decoder for it. This approach has proven successful for learning to
execute graph algorithms [16] and is even able to transfer knowledge to learn related graph
algorithms [17]. We use a variant of this framework to construct neural networks that learn
from input and output data to execute symbolic primitives over a learned representation space.
We also investigate the properties of this representation space since learning disentangled
representations has been shown to be an important for visual abstract reasoning [18, 19, 20].
   Other symbolic approaches to analogical reasoning include learning to rank possible relations
based on features from a relational database [21] as well as Structure Mapping [22] and more
recent approaches [23] that use of graph representation learning to emulate Structure Mapping.
Our method differs from such approaches by operating directly on images without requiring
structured symbolic descriptions during test time. Our work also builds on a previous short
abstract [24], specifically by exploring an unguided way of representation learning which enables
the overall system to work without any symbolic descriptions at all.
3. Methodology
Problem Setting We introduce a class of visual analogical reasoning problems as shown
in Figure 1 to motivate our ‘neural reasoning’ system. Each problem consists of input-output
image pairs which are related by an unknown transformation. Given these pairs, the task is
to discover the transformation and apply it to a query image to generate the solution. The
images considered consist of items from a set of 20 simple shapes and handwritten characters
(from the Omniglot dataset [11]) placed on a 3 × 3 grid. The transformations between the input
and output images are programs composed of either control flow operations (r e s e t and o u t )
or spatial transformations of shapes in the image. The latter consist of positional shifts in the
ordinal directions (e.g. s h i f t - r i g h t moves each shape in the image one grid-position to the
right), shape conversions (e.g. t o - s q u a r e converts each shape in the image to a square), and
affine transformations (e.g a f f i n e - r i g h t skews the the shape to the right).
   This problem setting can be interpreted as solving
weakly supervised evaluation tasks which consists of a
input-output image pairs and a query image for which
we have to generate the output given some strongly
supervised training data consisting of similar input-                   (a)             (b)        (c) (d)
output pairs and the associated transformation proce-
dure. We utilise the strongly supervised data to learn: Figure 2: Example of applying various
(1) a representation space for the input data and (2) transformations to an image from the
neural transforms that manipulate distributed represen- problem setting. (a) original image 𝑥,
tations in this space. These can then be used to solve (b) s h i f t - r i g h t (x), (c) t o - s q u a r e (x),
the weakly supervised testing tasks.                               (d) a f f i n e - r i g h t (x)


Representation Learning Reasoning requires con-
structing compositions of elementary functions. To use neural networks as functional units, we
first need to obtain a latent space of real vectors into which we can embed the domains and
co-domains of our original elementary functions. In particular, we are interested in represen-
tation spaces that are rich enough to represent a wide variety of shapes and sparse enough
to represent concepts distinctly enough to allow for robust manipulation. We approach this
in two related ways: (1) a guided representation derived from symbolic descriptions of the
input and (2) an unguided representation learned directly from images (unguided). The guided
approach generates more composable transforms at the cost of requiring annotated data and
augmentations to enable generalisation The unguided approach generalises more naturally and
can be trained without annotations but suffers from worse transform composability.
   In the guided approach, we aim to shape the latent space to include information about the
input images that is relevant to the transformations, namely shape and position. We first train
an autoencoder (Encoder: 𝐸𝑠 , Decoder: 𝐷𝑠 ) on symbolic descriptions of the input images present
in the training data by minimising the negative log-likelihood loss between inputs and the
reconstructed vectors. The symbolic descriptions used are concatenations of a one-hot vector
denoting the shape present in the image and another denoting the position of this shape in
the grid. We then train a CNN based encoder 𝐸𝑥 that takes images (denoted by 𝑥s) as inputs to
fit to the outputs of 𝐸𝑠 (with frozen weights) by minimising the mean-squared-error between
the embeddings of images 𝐸𝑥 (𝑥 (𝑖) ) and embeddings of the corresponding symbolic vectors
𝐸𝑠 (𝑠 (𝑖) ): MSE (𝐸𝑥 (𝑥 (𝑖) ), 𝐸𝑠 (𝑠 (𝑖) )). To handle shapes at test time that were not seen during training,
we also include an additional shape-label (u n s e e n ) and train 𝐸𝑥 to map shapes (whose labels
weren’t included in the symbolic description) to the latent vectors generated by 𝐸𝑠 corresponding
to the u n s e e n labels. While this approach ensures that the latent space encapsulates relevant
information, it is limited by the reliance on symbolic descriptions. Addition of any new attributes
of the input over which we would like to include transformations (e.g. a new shape, or continuous
- such as affine - transformations of existing shapes) would require modifying the input space
(size of one-hot input vector), requiring both latent space and transforms to be re-trained.
   Learning the latent representation in an unguided way alleviates these limitations at the
cost of not always capturing relevant information. In the unguided approach, we train
a CNN based autoencoder (𝐸𝑥 , 𝐷𝑥 ) by minimising mean-squared-error for reconstructions:
MSE (𝐷𝑥 (𝐸𝑥 (𝑥 (𝑖) )), 𝑥 (𝑖) )). We expect the autoencoder to generalize to unseen shapes and posi-
tions given enough data. One of the main issues encountered here is the accumulation of noise
after applying each transform, which degrades compositional performance. Using a sigmoid
activation after the final layer of the decoder drastically reduces noise as the image pixels are
binary. Further details can be found in the appendix attached.

Transform Training Our system makes use of neural net-
works to approximate the actions of the elementary spatial trans-                     Pre-processing
formations that make up the mapping between input and output
images. These neural primitives 𝑇𝑖 ∈ 𝑇 need to (1) be universally
composable with each other by operating on the same distributed                  Ex
                                                                                          Connected Component
representation space 𝑍 (i.e. they must be of the form 𝑇𝑖 ∶ 𝑍 → 𝑍)                                 Extraction

and (2) generalise to latent representations of similar images                 Latent Transforms
enabling the transfer of analogical concepts (here compositions                vectors

𝑓 = 𝑇1 .𝑇2 . … .𝑇𝑛 ) across inputs. For each elementary transforma-
tion t f we construct a neural network 𝑇t f and train it to satisfy
the constraint that t f (𝑎) = 𝑏 ⟹ 𝑇t f (𝐸(𝑎)) = 𝐸(𝑏) for some                         reset             out

encoder 𝐸, using the procedure from [24]. Further details can be                Input         Memory          Ouptut
found in the attached appendix.                                                 Buffer         Buffer         Buffer

                                                                              Post-procesing                     Dx
Program Execution We model mappings between input and
output images in the visual analogy tasks as programs composed
of primitives which are either control-flow operations (r e s e t           combine
and o u t ) or spatial transformations. Figure 3 describes the ex-
ecution of such programs. Since spatial transformations act
                                                                   Figure 3: Program Execution
independently on each shape, we first extract individual shapes
by identifying connected components within the image. Each
extracted image is converted into a latent vector using encoder 𝐸𝑥
and stored in the input and memory buffers. The program is executed sequentially with neural
primitives being applied to vectors in the memory buffer and control flow operations indicating
transfer of vectors between buffers. After execution is complete, the vectors present in the
output buffer are converted to images using 𝐷𝑥 in the unguided case or by manual construction
according to symbolic descriptions obtained using 𝐷𝑠 in the guided case. These are combined
together to form the result. Further details can be found in [24] and the appendix.

Searching for Solutions Given a representation space and a set of transforms we can solve
the analogy tasks being considered by searching over possible programs. A naive breadth
first search has a branching factor 12 (4 shift operations, 4 change shape operations, 2 affine
operations, r e s e t and o u t ), making it intractable for longer programs. Instead, we utilise the
fact that both the input and output images are available, to extensively prune the search tree,
details of which can be found in [24] and the attached appendix.


4. Results & Discussion
Compositional Abilities of Transformations A symbolic analogue to our approach would
simply be a search over sequences of predefined hard-coded primitives. Our system uses learned
primitives acting on a learned latent representation. Since these are an approximation to the
underlying symbols, there may be errors in the system which lead to incorrect solutions or an
inability to find a solution. To evaluate how well neural primitives in our system compose, we
use of a dataset of input-output pairs, generated from 20000 random programs upto length 10
(details present in appendix), and compute the ratio of pairs for which a valid program is found.
   The results are summarised in Figure 4. The guided
approach is able to find the correct solution programs       100
for 100% of the examples in the dataset. The unguided
approach shows similar performance with an overall            95
                                                                Success rate


success rate of 94.7 %. The minor drop in performance         90

of unguided can be attributed to (1) the absence of the       85
symbolic annotations during training, (2) verification                                Unguided   Guided


of the generated output is performed using MSE with           80
                                                                 1 2   3  4     5      6     7 8   9    10
the target, which is more susceptible to noise than com-                    Length of program

paring multi-hot vectors (used in unguided) and (3) the
                                                          Figure 4: Performance at discovering
inclusion of the affine transforms resulting in noisy
                                                          programs of varying length.
reconstructions for certain shapes since these more
complex transforms are not learned as well as the oth-
ers. Despite this, the high success rate of both approaches demonstrates that independently
trained neural transforms can compose robustly enough to be used in our approach.

Comparison with End-to-end neural approaches The analogical reasoning task pre-
sented in the paper can be thought of as a few-shot learning problem, where an output has to
be found for a query given input-output examples - a setting well suited for end-to-end meta
learning algorithms. In order to demonstrate the significance of discrete search in performing
compositional reasoning, we compare our method against three end-to-end NN based meta-
learning algorithms - Conditional Neural Processes (CNP) [25], Matching Networks (MAN) [26],
                     Approach      Guided      Unguided      CNP     MAN     MAML
                  Performance       100%         94.7%       64%     74%      73%
Table 1
Performance of our approach (in bold) and comparison with fully NN based meta-learning algorithms
(CNP, MAN, MAML) on a simplified few-shot classification version of the task.


                                                                    (c) Unguided representation with
     (a) Guided representation       (b) Unguided representation    rotations and affine transforma-
                                                                    tions included
Figure 5: Visualisation of latent representations of input images using t-SNE. Each point represents an
embedding for a specific image consisting of a single shape in a grid position on the board. The plots
on the left colour the points according to what shape is contained in the image. The plots on the right
colour the points according the grid position the shape in the image.


Model-Agnostic Meta-Learning (MAML) [27] - on a few-shot classification version of the ana-
logical reasoning task (details present in attached appendix). From the results in Table 1, we
can see that despite being presented with a more difficult task (generation vs classification), our
approach is able to significantly outperform end-to-end meta learning techniques.

Representation Learning The learned representation space upon which compositional
reasoning is being performed is a critical part of our system. The experiments discussed above
demonstrate that this space can be both disentangled enough to support robust transformations,
as well as diverse enough to support interpolation and generalisation. To explore this further,
we visualise the latent embeddings of images with a single shape in Figure 5. We can see
that for both the guided and unguided cases, the representation space is strongly structured
around the grid positions of the shapes in the images. For unguided, the clusters are more
sparsely structured, indicating the presence of another component of variation that captures
more shape information. Finally, Figure 5c shows that affine transformations of shapes are
closer to each other than to the original shapes, indicating that a f f i n e transforms causes a shift
in the distribution of latent embeddings. This explains why excluding affine transformations
of seen shapes when training transforms causes reduced performance. Such transforms being
applied to affine shapes would introduce significant noise since they are not aware of this
distribution shift. This highlights the importance of having a well structured latent space.

Generalisation To be of practical use for computation and reasoning, our system needs to
be able to generalise to visual analogy tasks with similar images but with characteristics not
                                     Regular    Transforms    Representation
                         Guided       100%          95%           100%
                        Unguided      94.7%        94.8%          92.4%
Table 2
End-to-end performance on generalisation experiments. The Regular case is where all 20 shapes are
used for training and evaluation of both representations and transforms.


present in the training data. To evaluate this, we exclude images containing 5 (of the 20 total)
shapes from the training data, but include these images during testing (performed similar to
section 4). We evaluate generalisation for the transforms and representation space (encoder
and decoder) separately. The results in Table 2 indicate that both the guided and unguided
approaches are able to generalise well to shapes not present during training. This suggests that
with a diverse and large enough training set, our representation space could be rich enough to
support spaces with a large variety of inputs.
   In the unguided approach, s h i f t transforms work robustly for images containing shapes
not seen during training, indicating that the representation space captures information about
position in a disentangled manner. The a f f i n e transformations, which manipulate the shape
itself, also work well for shapes unseen during transform training, but only if the affine versions of
seen shapes are seen when learning the representation space. This is explained by the distribution
shift in latent embeddings caused by a f f i n e transformations as seen in Figure 5c and discussed
in section 4. Incorporating explicit measures to disentangle the distributed representations with
respect to continuous shape transformations, in addition to shape vs position, may alleviate
this need, enabling robust system-level generalization. More generally, these results suggest
that representations learned directly and without supervision can capture relevant information
in a distributed enough manner to support reasoning within our framework; and that providing
explicit symbolic guidance aids in compositionality with some loss in generalizability.

Conclusion Our neural analogical reasoning approach applies compositional reasoning via
search on transformations of distributed representations, both learned via neural networks, thus
substituting handcrafting of symbols and procedures with provision of the relevant training
data. Experiments show that our approach generalises to data drawn from related but different
distributions (i.e., unseen shapes). Future work can tackle more complete representation learning
methods to avoid extraction of connected components, enabling the system to tackle more
complex problems with overlapping and arbitrarily sized shapes. We envisage a more general
system for analogical reasoning (in the limited sense used here) based on our NAR approach:
one that can operate on rich input distributions while continuously learning to detect when
new shapes and/or transformations are encountered and retrain its elements (representation
space and/or new transformation network(s)) appropriately.


Acknowledgments
This work is supported by “The DataLab” agreement between BITS Pilani, K.K. Birla Goa
Campus and TCS Research. We thank Tirtharaj Dash and Ashwin Srinivasan for their input.
References
 [1] H. Prade, G. Richard, Analogical proportions: why they are useful in ai, in: Proceedings
     of the 30th International Joint Conference on Artificial Intelligence (IJCAI 2021), Montreal,
     2021, pp. 21–26.
 [2] K. Ellis, C. Wong, M. Nye, M. Sablé-Meyer, L. Morales, L. B. Hewitt, L. Cary, A. Solar-Lezama,
     J. B. Tenenbaum, Dreamcoder: bootstrapping inductive program synthesis with wake-sleep
     library learning, Proceedings of the 42nd ACM SIGPLAN International Conference on
     Programming Language Design and Implementation (2021).
 [3] S. Gulwani, Automating string processing in spreadsheets using input-output examples,
     in: POPL ’11, 2011.
 [4] L. Jiao, J. Gao, X. Liu, F. Liu, S. Yang, B. Hou, Multi-scale representation learning for
     image classification: A survey, IEEE Transactions on Artificial Intelligence (2021) 1–1.
     doi:1 0 . 1 1 0 9 / T A I . 2 0 2 1 . 3 1 3 5 2 4 8 .
 [5] Y. Yu, X. Si, C. Hu, J. Zhang, A Review of Recurrent Neural Networks: LSTM Cells and Net-
     work Architectures, Neural Computation 31 (2019) 1235–1270. URL: https://doi.org/10.1162/
     neco_a_01199. doi:1 0 . 1 1 6 2 / n e c o _ a _ 0 1 1 9 9 . a r X i v : h t t p s : / / d i r e c t . m i t . e d u / n e c o / a r t i c l e -
     pdf/31/7/1235/1053200/neco_a_01199.pdf.
 [6] F. Chen, Y.-C. Wang, B. Wang, C.-C. J. Kuo, Graph representation learning: A survey,
     APSIPA Transactions on Signal and Information Processing 9 (2020).
 [7] P. Veličković, C. Blundell, Neural algorithmic reasoning, Patterns 2 (2021) 100273. doi:h t t p s :
     //doi.org/10.1016/j.patter.2021.100273.
 [8] D. R. Hofstadter, Fluid concepts and creative analogies: Computer models of the funda-
     mental mechanisms of thought., Basic books, 1995.
 [9] F. Chollet, On the measure of intelligence, ArXiv abs/1911.01547 (2019).
[10] D. Barrett, F. Hill, A. Santoro, A. Morcos, T. Lillicrap, Measuring abstract reasoning
     in neural networks, in: J. Dy, A. Krause (Eds.), Proceedings of the 35th International
     Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research,
     PMLR, 2018, pp. 511–520. URL: https://proceedings.mlr.press/v80/barrett18a.html.
[11] B. M. Lake, R. Salakhutdinov, J. B. Tenenbaum, Human-level concept learning through
     probabilistic program induction, Science 350 (2015) 1332 – 1338.
[12] B. M. Lake, M. Baroni, Generalization without systematicity: On the compositional skills
     of sequence-to-sequence recurrent networks, in: ICML, 2018.
[13] N. Rahaman, M. W. Gondal, S. Joshi, P. V. Gehler, Y. Bengio, F. Locatello, B. Scholkopf,
     Dynamic inference with neural interpreters, ArXiv abs/2110.06399 (2021).
[14] V. Kolev, B. Georgiev, S. Penkov, Neural abstract reasoner, ArXiv abs/2011.09860 (2020).
[15] J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. L. Zitnick, R. B.
     Girshick, Inferring and executing programs for visual reasoning, 2017 IEEE International
     Conference on Computer Vision (ICCV) (2017) 3008–3017.
[16] P. Velickovic, R. Ying, M. Padovano, R. Hadsell, C. Blundell, Neural execution of graph
     algorithms, ArXiv abs/1910.10593 (2020).
[17] L.-P. Xhonneux, A. Deac, P. Velickovic, J. Tang, How to transfer algorithmic reasoning
     knowledge to learn new algorithms?, ArXiv abs/2110.14056 (2021).
[18] S. van Steenkiste, F. Locatello, J. Schmidhuber, O. Bachem, Are disentangled representations
     helpful for abstract visual reasoning?, in: NeurIPS, 2019.
[19] X. Steenbrugge, S. Leroux, T. Verbelen, B. Dhoedt, Improving generalization for abstract
     reasoning tasks using disentangled feature representations, ArXiv abs/1811.04784 (2018).
[20] F. Hill, A. Santoro, D. Barrett, A. Morcos, T. Lillicrap, Learning to make analogies by
     contrasting abstract relational structure, in: International Conference on Learning Repre-
     sentations, 2019. URL: https://openreview.net/forum?id=SylLYsCcFm.
[21] R. Silva, K. A. Heller, Z. Ghahramani, Analogical reasoning with relational bayesian sets,
     in: AISTATS, 2007.
[22] D. Gentner, Structure‐mapping: A theoretical framework for analogy*, Cognitive Science
     7 (1983) 155–170.
[23] M. Crouse, C. Nakos, I. Abdelaziz, K. D. Forbus, Neural analogical matching, ArXiv
     abs/2004.03573 (2021).
[24] A. Sonwane, G. Shroff, L. Vig, A. Srinivasan, T. Dash, Solving visual analogies using neural
     algorithmic reasoning, CoRR abs/2111.10361 (2021). URL: https://arxiv.org/abs/2111.10361.
     arXiv:2111.10361.
[25] M. Garnelo, D. Rosenbaum, C. Maddison, T. Ramalho, D. Saxton, M. Shanahan, Y. W.
     Teh, D. Rezende, S. M. A. Eslami, Conditional neural processes, in: J. Dy, A. Krause
     (Eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80
     of Proceedings of Machine Learning Research, PMLR, 2018, pp. 1704–1713. URL: https:
     //proceedings.mlr.press/v80/garnelo18a.html.
[26] O. Vinyals, C. Blundell, T. Lillicrap, k. kavukcuoglu, D. Wierstra, Matching net-
     works for one shot learning, in: D. Lee, M. Sugiyama, U. Luxburg, I. Guyon,
     R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 29,
     Curran Associates, Inc., 2016. URL: https://proceedings.neurips.cc/paper/2016/file/
     90e1357833654983612fb05e3ec9148c-Paper.pdf.
[27] C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-learning for fast adaptation of deep
     networks, in: Proceedings of the 34th International Conference on Machine Learning -
     Volume 70, ICML’17, JMLR.org, 2017, p. 1126–1135.
A. Details of Methodology
A.1. Problem Setting


Figure 6: The set of base shapes and characters (modulo rotations and skews) used to construct visual
analogy tasks.


      Types            Transformations                    Meaning
                       shift-right, shift-left, shift-    Move all of the shapes in the image by
      Shifts
                       up, shift-down                     one grid positions
      Shape conver-    to-square, to-circle, to-delta,    Convert all of the shapes in the image
      sions            to-triangle                        to a particular shape
                                                          Skew all the shapes in the image either
      Affines          affine-right, affine-left
                                                          to the left or right
Table 3
All spatial transformation used in visual analogy tasks


Details of Images used in Visual Analogy Tasks Each image used in the visual analogy
task consists of shapes or characters from the set described in Figure 6 placed on a 3 × 3 grid.
The size of each such image is 64 × 64 pixels with the the size of shapes and characters placed
inside being 20 × 20 pixels. Note that the grid isn’t explicitly drawn and all images are black
and white (the background being white and the shapes being solid black). The handwritten
characters used have been taken from the Omniglot dataset.
A.2. Guided Representation Space


                            P           Multi-Hot with
                              = Bheight + Bwidth Position Slots,
                            S            N Shape Slots

Figure 7: Symbolic descriptions of input images are in the form of multi-hot vectors constructed by
concatenating a one-hot vector denoting which shape is present in the image and another one-hot
vector denoting the position of the shape in the image. With 20 possible shapes and a 3 × 3 grid, this
multi-hot vector will be of length 26.


                                P                              P
                                                                        Negative Log
                                     ES             DS
                                                                       Likelohood Loss
                        1       S                              S


                        2                                          P
                                                                          Negative Log
                                       EX                DS
                                                                         Likelohood Loss
                                                                   S
                                                    (frozen)

Figure 8: Two step procedure for obtaining image encoder using latent representation space based on
symbolic descriptions


Representation Learning For the guided representation space we first train an autoencoder
directly on the symbolic descriptions (as described in Figure 7) of images containing a single
shape followed by training an encoder from said images to the latent embeddings of their
symbolic descriptions. This procedure is described in Figure 8 The size of the latent embeddings
here is 32. The encoder and decoder used for the autoencoder over symbolic descriptions are
both MLPs with a single hidden layer of size 16. They are trained using batch size of 36 for
10,000 weight updates with a learning rate of 0.0003. For the encoder from images to the latent
embeddings of their symbolic descriptions we use a CNN with 6 convolutional layers each using
a stride of 2, padding of 1 and 3 × 3 kernels. Each layer doubles the number of channels and
halves the dimensions of its input and applies a rectified linear unit activation. Since the input
image has dimensions 1 × 64 × 64, the output of these six layers is 128 × 1 × 1. This is flattened
and passed through a final linear transformation to output the latent embedding of size 32. This
encoder is trained with a batch size of 180 (since there are 20 × 9 = 180 possible combinations
of shape and grid positions) for 50,000 weight updates with a learning rate of 0.0003.

Neural Transforms For the guided representation space, we use MLPs as neural transforms
with a single hidden layer of size 32. These are trained with a batch size of 180 for 5000 weight
updates with a learning rate of 0.0003. A overview of the training procedure is given in Figure 9.
                                     shift-
                                     shift-                         P'
                                      shift-
                                       shift-                                  Negative Log
                                       shift-
                                        shift-
                      EX              right                 DS
                                      right
                                       right                                  Likelohood Loss
                                        right
                                        right
                                         right                      S'


Figure 9: Procedure for training transforms for guided representation space


A.3. Unguided Representation Space
Representation Learning For the unguided representation space we train an autoencoder
directly on the images containing a single shape with latent embedding size of 16 × 8 × 8. The
encoder used is a CNN with 3 convolutional layers each followed by a batch normalisation layer.
Each layer uses a stride of 2 and padding of 1. The first two layers use 4 × 4 kernels and the last
one uses 3 × 3 kernel. The first layer outputs 8 channels while the second and third both output
16 channels. The decoder is made up of 3 transpose convolution layers each followed by batch
normalisation. All three layers use a stride of 2, padding of 1 and 4 × 4 kernels. The first two
layers output 8 channels. The autoencoder is trained using batch size of 32 for 20000 weight
updates with a learning rate of 0.001.

Neural Transforms For the unguided representation space we use MLPs as neural transforms
with a single hidden layer of size 256 for shift and shape conversion transforms and of 2048 for
affine transforms (although similar performance is observed for size 256). These are trained
with a batch size of 18 for 4000 weight updates with a learning rate of 0.001 on dataset of images
with a single shape present. Note that the latent embeddings are flattened to a 1024 long vector
before being passed to the transforms.

A.4. Program Execution Algorithm

Algorithm 1 Program Execution
Input: Input latent vectors 𝑧𝑘 s, program 𝑃
Output: Output latent vectors

 1: input = {𝑧𝑘 }, memory = {𝑧𝑘 }, output = ∅
 2: for 𝑖 = 1 ∶ length(𝑃) do
 3:   if 𝑃[𝑖] = o u t then
 4:      output ← output ∪ memory
 5:   else if 𝑃[𝑖] = r e s e t then
 6:      memory ← input
 7:   else
 8:      memory ← {𝑃[𝑖](𝑧𝑘 ), ∀𝑧𝑘 ∈ memory}
 9:   end if
10: end for
11: return output
A.5. Search Algorithm

Algorithm 2 Limited Depth BFS
Input Encoder: 𝐸, Decoder: 𝐷 Transforms: 𝑇 = {𝑇𝑖 }, Example: 𝜆 = (𝑥𝑖 , 𝑥𝑜 )
Output Program satisfying all given examples
  1: targets ← connected_components(𝑥𝑜 )
  2: q ← LIFO(); q.push([ ])
  3: max_count ← 0
  4: while True do
  5:    program ← q.pop();
  6:    if length(program) = 𝑑 then
  7:       return None;
  8:    end if
  9:    for 𝑡𝑖 ∈ 𝑇 𝑍 do
 10:       new_program ← program + 𝑡𝑖 ;
 11:       if s a t i s f i e s (𝜆, new_program, 𝐸, 𝐷) then
 12:          return new_program;
 13:       end if
 14:    end for
 15: end while


Caching program state to speed up search Alongside pruning we also reduce the constant
factor involved at each step of the search by keeping track of the program state after partial
execution. This creates a significant reduction in time taken since the majority of it is the neural
network forward passes. This means that transforms are only applied when a new transform
is added to the program. We do the same for output decoding - only applying output decoder
whenever the o u t primitive is called and keeping track of output buffer in subsequent nodes.

A.6. Example Run-through
Consider the example task in the shown in Figure 10. An input image 𝑥𝑖 and target image 𝑥𝑟
makes up the example for an unknown transformation sequence which needs to be applied to
the query image 𝑥𝑞 . We have an encoder 𝐸𝑥 and decoder 𝐷𝑥 trained as described in section 3
and a set of transforms 𝑇𝑖 trained according to section 3.
   Solving this task using the guided approach, the first step is isolating connected components
                                                                                                         (1) (2)
in the input image. This results in two images for the input in our example 𝑥𝑖 , 𝑥𝑖 , each of
which is converted into its latent representation using the encoder and stored in the input and
                                       (1) (2)
memory buffers: 𝐼 = 𝑀 = {𝑧𝑖 , 𝑧𝑖 }. Having obtained the initial program state, the search
procedure enumerates sequences of transforms which it tests against the example.
   Suppose that the search arrives at s h i f t - l e f t r e s e t s h i f t - d o w n o u t . We first apply the neural
                                                                                                 ′ (1)                (1)
network 𝑇s h i f t −l e f t to the latent vectors present in the memory buffer 𝑧𝑖 = 𝑇s h i f t −l e f t (𝑧𝑖 )
      ′ (2)                     (2)         ′ (1)     ′ (2)
and 𝑧𝑖 = 𝑇s h i f t −l e f t (𝑧𝑖 ). Here 𝑧𝑖 and 𝑧𝑖 are now latent embeddings of the characters in
                                                           Program
                                                                                Query
                             Examples
                                                         shiftdown


                     (                 ,           )
                                                         shiftdown
                                                        affine-right
                                                            out

                                                                                      Result
                  Primitives               reset
                                                       search     execute
                                            +
                  shiftright
                   shiftright
                    shiftright
                     shiftright
                      shiftright   +       out


Figure 10: Demonstration on an example task.


the original image but shifted to the left by one grid position. r e s e t is a special primitive that
                                                                                                             (1) (2)
results in the memory buffer being reset with the contents of the input buffer: 𝑀 = 𝐼 = {𝑧𝑖 , 𝑧𝑖 }.
The application of s h i f t - d o w n proceeds similarly to that of s h i f t - l e f t using the appropriate
neural network transform. Finally o u t is another special primitive which adds all the latent
                                                                                         ″
                                                                                           (1) ″ (2)
vectors present in the memory buffer to the the output buffer: 𝑂 = {𝑧𝑖 , 𝑧𝑖 }. After executing
the program, we use decoder 𝐷𝑥 to produce the images corresponding to the latent vectors
present in the output buffer which are then added together to form the output image. Since
this resulting image would contain the same characters as in the input image but shifted down
by one grid position, comparison with the target using a distance function as described in
section 3 would lead us to reject the current transform sequence and keep searching. Eventually,
the search arrives at a sequence that satisfies the example, which is our case is s h i f t - d o w n
s h i f t - d o w n a f f i n e - r i g h t o u t . This is then applied to the query image to get our final result.


A.7. Data Generation
Random example (input, output) pairs can be generated by applying a randomly generated
program to a randomly selected image. During generation of the program, constraints are
applied to minimise the generation of any redundant programs. These are (1) no consecutive
shape conversion transforms, (2) no consecutive o u t or r e s e t operations, (3) all programs
must end with an o u t operation and (4) r e s e t operations must follow an o u t operation. When
generating a set of random examples, it is also ensured that all examples generated are unique.

A.8. Details of Evaluation Dataset
To evaluate the ability of transforms in our system to compose without error, we use of a dataset
of (input, output) pairs (generated from a dataset of 20000 programs, 200 programs each for
lengths of 1 till 10) and compute the ratio of pairs for which a valid program is found. These
pairs are generated from a dataset of 20000 programs, 200 programs each for lengths of 1 till 10.
Each task is one-shot (contains a single input-output pair), and the images can contain multiple
shapes. The programs are made up of 4 types of s h i f t transforms and 2 a f f i n e transforms. We
exclude shape conversions from our experiments since their application to an unseen shape
turns it into a seen shape, thus not allowing us to properly test for generalization performance.
Also, programs containing a f f i n e transforms were not used in evaluation for the guided case
since, as discussed in Section 3, continuous shape transformations apply only in the unguided
case.


B. Details of Meta Learning Implementations
We cast our analogical reasoning task into a few shot classification problem - each task consists
of 6 examples representing a single transformation, where each example consists of an input
image, 4 options for the possible outputs and an index indicating the correct output as shown
in Figure 11.
   The task also contains a query image and 4 option images for that query. This is further
deconstructed into a binary classification problem - for each example, we compute the pixel
difference between the input image and each of the 4 option images. Each such pixel difference
is assigned a binary label resulting in a 6 × 4 = 24 meta-train examples and 4 meta-test queries
for each task.
   We used custom made Convolutional Neural Networks to get the features of the input images.
CNP and MAN were run for 60 epochs whereas MAML is run for 30 epochs. MAML uses a
dropout of 0.5 and CNP uses a dropout 0.8. We used ReduceLROnPlateau learning rate scheduler
with a threshold of 10−3 modifying the other parameters as required. Both Dropout and Learning
rate scheduler were essential to reduce the fluctuations in the training loss.
Figure 11: Visualization of each task of the few-shot classification dataset use to evaluate end-to-end
meta learners. Each row is an example, containing an input image, 4 possible output images and index
(starting from 0) of the correct choice. The transformation here is s h i f t - d o w n , s h i f t - d o w n , t o - c i r c l e
which corresponds to shifting down the objects present in the input image by two grid positions and
replacing them with circles.


C. Further Results
C.1. Evaluations of the representation space
Figure 12: t-SNE Plots for latent embedding of autoencoder in the guided case trained with mean-
squared distance loss as opposed to the log-likelihood loss which is used in the final system. Each point
represents an embedding for a specific shape in a specific grid position. The left hand plots have points
coloured according to shape and right hand plots according to grid position. We can see that the clusters
for each position are much more tightly grouped than when using log-likelihood loss making it much
harder to distinguish between shapes in the same position.


                                                   Vanilla Autoencoder         AE with sigmoid
                                                  AE with sigmoid with affine shapes included
                                     0.0020
              Reconstruction error


                                     0.0015

                                     0.0010

                                     0.0005

                                     0.0000
                                               Seen shape     Unseen shape       Seen shape     Unseen shape
                                              seen position   seen position    unseen position unseen position


Figure 13: Reconstruction error for different types of autoencoders evaluated on different data regimes
in the unguided case. The prefix seen indicates that the error being reported is over variants of the
particular attribute (shape or position) that were present during training and unseen indicates that
the error is over those variants that were not present during training. For example the unseen shape
seen position case indicates the reconstruction error for shapes that were not present during training of
the autoencoder but in positions that were included during training. Here with sigmoid denotes the
autoencoder uses the sigmoid activation on the output layer and with affine denotes that images with
affine transformations of the shapes and characters were included during training of the autoencoder.
These results indicate that reconstruction is improved drastically by the use of sigmoid and also somewhat
by more diverse training data.
C.2. Ability of Neural Transforms to Compose


                               100


                Success rate    95


                                90


                                85
                                                               With affine       Without affine

                                80
                                     1   2   3   4       5      6        7   8        9      10
                                                     Length of program

Figure 14: Average success ratio of system using unguided representation space on varying program
lengths. The with and without affine indicates whether the dataset of programs being evaluated on
contained affine transforms.


C.3. Evaluation of Guided Representation Space Generalisation
To evaluate how well the guided representation space (encoder and decoder) generalises to
unseen situations, we utilise the following scheme. Let us a tuple (𝑀, 𝑁 , 𝑅, 𝑇 ) ∈ 𝒫 (𝑄)4 where
𝑄 is the set of all shapes as described in Figure 6, 𝑀 denotes the shapes present in the training
data for the latent space (here |𝑀| + 1 + 6 denotes the size of the multi-hot vector, where 6 is
for the positional encoding and 1 is for a slot of unseen shape), 𝑁 denotes the shapes present
in the training data for transforms, 𝑅 denotes the shapes present in the training data for the
encoder and 𝑇 denotes the shapes present in the examples on which the latent space is being
evaluated. We perform the following 4 experiments based on this evaluation scheme to test the
generalisation capabilities of the guided representation space. The results are given in Figure 15.
Note that the total number of shapes possible |𝑄| = 20.

   1. For 𝑀 = 𝑁 = ∅ (all shapes considered unseen in the multi-hot) and 𝑇 = 𝑄 (all shapes
      included in test examples), we vary the shapes shown to the CNN encoder during training
      from |𝑅| = 1 (just one shape being included in training data) 𝑅 = 𝑄 (all shapes included)
      and record the evaluation performance.

   2. For 𝑀 = 𝑁 = ∅ (all shapes considered unseen in the multi-hot), we vary the shapes
      shown to the CNN encoder during training from |𝑅| = 1 (just one shape being included
      in training data) 𝑅 = 𝑄 (all shapes included) and record the evaluation performance of
      with this encoder on 𝑇 = 𝑅′ (test only including shapes not present in train for encoder).

   3. For 𝑀 = 𝑁 = {square, triangle, circle, delta} (four shapes + unseen in the multi-hot) and
      𝑇 = 𝑄 (all shapes included in test examples), we vary the shapes shown to the CNN
Figure 15: Generalisation of guided representation space to unseen shapes


      encoder during training from |𝑅| = 1 (just one shape being included in training data)
      𝑅 = 𝑄 (all shapes included) and record the evaluation performance.

   4. For 𝑀 = 𝑁 = {square, triangle, circle, delta} (four shapes + unseen in the multi-hot), we
      vary the shapes shown to the CNN encoder during training from |𝑅| = 1 (just one shape
      being included in training data) 𝑅 = 𝑄 (all shapes included) and record the evaluation
      performance of with this encoder on 𝑇 = 𝑅′ (test only including shapes not present in
      train for encoder).

C.4. Evaluation of Transform Generalisation in Guided Representation Space
To evaluate how well neural transforms learned in the guided representation space generalise to
unseen situations, we first learn the representation space with data containing all of the shapes
or positions. We then train the neural transforms on (input, output) image pairs containing only
a subset of shapes (or input positions). During testing, we include all shapes (or positions). This
can be thought of as keeping 𝑀 = 𝑅 = 𝑇 = 𝑄 while 𝑁 is varied from 𝑄 to ∅. If the transforms
give similar performance to those trained with all the data, we can say that they, together with
the latent space, are generalising well. The results are given in Figure 16.
Figure 16: Generalisation of transforms to unseen attributes for guided representation space


C.5. Evaluation of Unguided Representation Space Generalisation


                                100


                                 95
                 Success rate


                                 90


                                 85
                                                               With affine    Without affine
                                 80
                                      1   2   3   4       5      6        7   8     9     10
                                                      Length of program

Figure 17: Average success ratio of system using unguided representation space on varying program
lengths on dataset of (input, output) examples containing shapes that were not seen during training.
Out of the 20 total shapes, 15 were present during training and all 20 were present in the evaluation
dataset. The with and without affine indicates whether the dataset of programs being evaluated on
contained affine transforms.


C.6. Evaluation of Search Performance
Figure 18: Comparison between the running time of naive and pruned BFS for program search. We
can see that pruning leads to significant performance gains enabling the system to search for longer
programs within shorter time budgets. While this plot is for the guided representation case, similar
behaviour is observed for the unguided case since both use the same search procedure and pruning
measures.

</pre>