<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Knowledge-based Analogical Reasoning in Neuro-symbolic Latent Spaces</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vishwa Shah</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aditya Sharma</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gautam Shrof</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lovekesh Vig</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tirtharaj Dash</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ashwin Srinivasan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>APPCAIR</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>BITS Pilani</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>K.K. Birla Goa Campus</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop Proceedings</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TCS Research</institution>
          ,
          <addr-line>New Delhi</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Analogical Reasoning problems pose unique challenges for both connectionist and symbolic AI systems as these entail a carefully crafted solution combining background knowledge, deductive reasoning and visual pattern recognition. While symbolic systems are designed to ingest explicit domain knowledge and perform deductive reasoning, they are sensitive to noise and require inputs be mapped to a predetermined set of symbolic features. Connectionist systems on the other hand are able to directly ingest rich input spaces such as images, text or speech and can perform robust pattern recognition even with noisy inputs. However connectionist models struggle to incorporate explicit domain knowledge and perform deductive reasoning. In this paper, we propose a framework that combines the pattern recognition capabilities of neural networks with symbolic reasoning and background knowledge for solving a class of Analogical Reasoning problems where the set of example attributes and possible relations across them are known apriori. We take inspiration from the 'neural algorithmic reasoning' approach [DeepMind 2020] and exploit problem-specific background knowledge by (i) learning a distributed representation based on a symbolic model of the current problem (ii) training neural-network transformations reflective of the relations involved in the problem and finally (iii) training a neural network encoder from images to the distributed representation in (i). These three elements enable us to perform search-based reasoning using neural networks as elementary functions manipulating distributed representations. We test our approach on visual analogy problems in RAVENs Progressive Matrices, and achieve accuracy competitive with human performance and, in certain cases, superior to initial end-to-end neural-network based approaches. While recent neural models trained at scale currently yield the overall SOTA, we submit that our novel neuro-symbolic reasoning approach is a promising direction for this problem, and is arguably more general, especially for problems where suficient domain knowledge is available.</p>
      </abstract>
      <kwd-group>
        <kwd>neural reasoning</kwd>
        <kwd>visual analogy</kwd>
        <kwd>neuro-symbolic learning</kwd>
        <kwd>RPMs</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Many symbolic reasoning algorithms can be viewed as searching for a solution in a space defined
by prior domain knowledge. Given suficient domain knowledge represented in symbolic form,
‘dificult’ reasoning problems, such as analogical reasoning, can be ‘solved’ via exhaustive search,
even though they are challenging for the average human. However, such algorithms operate on
a symbolic space, whereas humans are easily able to consume rich data such as images or speech.
Neural networks on the other end are proficient in encoding high-dimensional continuous data
and are robust to noisy inputs, but struggle with deductive reasoning and absorbing explicit
domain knowledge. ‘Neural Algorithmic Reasoning’[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], presents an approach to jointly model
neural and symbolic learning, wherein rich inputs are encoded to a latent representation that
has been learned using from symbolic inputs. This design allows neural learners and algorithms
to complement each other’s weaknesses. Through this work, we aim to investigate a variation
of the neural algorithmic reasoning approach applied to analogical reasoning, using RAVENs
Progressive Matrices [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] problems as a test case.
      </p>
      <p>
        Our approach essentially exploits the domain
knowledge to train a suite of neural networks, one
for each known domain predicate. For RAVENs
problems, these predicates represent rules that
might apply to ordered sets (rows) of images in a
particular problem. Further, these neural
predicates are trained to operate on a special
highdimensional representation space (‘symbolic latent
space’) that is derived, via self-supervised learning,
from the symbolic input space. Note that a purely
symbolic algorithm can consume symbolic inputs
to solve the problem exactly, however a distributed
representation can allow for real world analogical Figure 1: RPM : Problem Matrix (Top),
Anreasoning for rich input spaces such as images or swer Options (Bottom)
speech (see Fig 1 for a RAVENs problem; one can
also imagine tasks with speech inputs where the analogous example signals are high pitch
transformed versions of the original). Our approach difers from [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] where the symbolic latent
space is derived via a supervised approach; by using a self-supervised learning approach we are
able to use the same representation space to train multiple neural predicates, unlike in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] where
only a single function is learned. Next, we train a neural network encoder to map real-world
images (here sub-images of the RAVENs matrices) to the ‘symbolic latent space’. Finally, using
the above elements together we are able to perform symbolic search-based reasoning, albeit
using neural-networks as primitive predicates, to solve analogical reasoning problems.
Contributions (1) We adapt and extend Neural Algorithmic Reasoning to propose a
neurosymbolic approach for a class of visual analogical reasoning problems (2) We present
experimental results on the RAVENs Progressive Matrices dataset and compare our neuro-symbolic
approach to purely connectionist approaches, and analyse the results. In certain cases, our
approach is superior to initial neural approaches, as well as to human performance (though more
recent neural approaches trained at scale remain SOTA) (3) We submit that our approach can
be viewed as a novel and more general neuro-symbolic procedure that uses domain knowledge
to train neural network predicates operating on a special, ‘symbolically-derived latent space’,
which are then used as elementary predicates in a symbolic search process.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Problem Definition</title>
      <p>
        In general, ‘RAVEN-like’ analogical reasoning tasks can be viewed as comprising of  ordered
sets  1,  2, ...,   containing  input samples each, an additional test set containing  − 1 samples
and a target  ℎ sample. Each sample   where  ∈ [1...],  ∈ [1...] is comprised of a set
of entities   and each entity  ∈   has attributes from a set  of  predefined attributes
 1,  2, ...,   ∈  . Assume a predefined set of all possible rules  =  1,  2...,   that can hold over
sample attributes in an example set(s). For a given task the objective is to infer which rules hold
across the  samples in each of the  example sets in order to subsequently predict the analogous
missing sample for the test set, either by generating the target sample as in the ARC challenge
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], or by classifying from a set of possible choices as in RPMs. Note that the problem definition
assumes prior domain knowledge about possible sample entities, their possible attribute values,
and possible rules over sample attributes within the example sets. It is worth mentioning that
while here we investigate visual analogies, the problem definition can accommodate input
samples of any datatype such as audio or text as long as the problem structure is unchanged.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Approach</title>
      <p>We adapt a variation of the neural algorithmic reasoning approach to the problem defined
in Section 2, where we (i) learn a latent representation based on the symbolic representation
of the tasks, via self-supervised learning; (ii) train neural networks to infer the rules involved
in the problem; (iii) train a neural encoder from images to align with the symbolic latent
representations in (i), and (iv) use the above elements to solve a given task via a neuro-symbolic
search procedure, i.e., where the elementary predicates are neural networks. We assume the
presence of a training dataset   with correct answer labels for the analogical reasoning
tasks. The components (i), (ii) and (iii) are trained independently as for each of them we know
or can determine the inputs and the targets depending on their function. We also evaluate an
alternative of (v) encoding an image to the symbolic latent space via the neural encoder in (iii)
above, and decode it to symbolic form using the decoder trained in (i) on the symbolic space, on
which purely symbolic search is used to solve a problem instance.</p>
      <sec id="sec-3-1">
        <title>3.1. Learning a Distributed Representation from the Symbolic Space</title>
        <p>We begin with a symbolic multihot task representation  , which is a series of concatenated
one-hots, one for each image entity attribute. Each attribute can take a value from a finite set
and hence is represented using a one-hot vector. To obtain the latent representations of the
tasks, we train an auto-encoder on the symbolic task definitions S as ( S) = ( S,  S) where
the encoder  S maps from the symbolic space to the latent space and the decoder  S maps the
representation from the latent space back to the symbolic space as shown in component (i) of
Fig. 2. As we want to reconstruct the multihot representation, a sum of negative-log likelihood
is computed for each one-hot encoding present in the multihot representation. We provide an
example in C.1 where the parameters  and  are obtained via gradient descent on a combination
of negative log-likelihood loss functions as shown in equation 1 and 2 in the appendix.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Training Rule Identification Neural Networks</title>
        <p>Next for every attribute, and for each applicable rule for that attribute, we train a Rule
Identification neural network classifier to predict if the rule (pattern) holds for the example set. As
mentioned in 2, we know the rules that can hold over attributes, giving us a definite set of
networks to be trained. We refer to any rule identification network  using the ( ,   − )
pair for which it is trained. The latent representations obtained after encoding the symbolic
representations of each of the samples in the example set are concatenated and sent as input to the
rule identification networks. While training a neural net for a ( ,   −  ) pair, we
categorize each example set with the specific label for that particular rule and attribute, labeling it
with 0 if the rule-type is not followed, 1 if the rule-type is followed or a rule-value indicating the
level of the rule-type when being followed. As each of these rule-types is deterministic, we can
obtain the rule-type and value for each row using their symbolic representations. The overview
can be seen in component (ii) of Fig. 2 where we see the input for these elementary neural
networks and the expected output determined for the ( ,   −  ) pair. We see in Fig.
2 that type (shape) changes in row 1 , hence the expected target for  ( , ) should
be 1 as type stays constant and in case of 2 as the type changes, we expect  ( , )
to predict 0, indicating the rule is not obeyed. With these labels, each network is optimized
using cross-entropy loss. For parameterized rules we train additional networks to predict the
parameters.1
1Examples provided in appendix section C.2</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Sample representation</title>
        <p>As our objective to apply our approach on samples in an unstructured (image, text, speech)
format, we want to develop a representation for the samples that resembles the latent
representation of symbolic inputs. For this we train any neural network encoder  X over the rich input
space X which encodes each sample to a latent space. We want to minimize the disparity in the
latent representations from the sample  and its corresponding symbolic representation  . For
this we use  S trained in 3.1. We minimize the mean squared error over all pairs of symbolic and
input representations (equation shown in C.1). This enables us to use the previously learned
neural networks for rule inference on our image data.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Combining the Elements</title>
        <p>As shown in component (iv) of Fig. 2, we first use  X() as inputs to find the underlying rules
using the neural networks trained in section 3.2. Once we obtain the set of (rule-type, value)
pairs for each attribute, we apply these neural networks for each of the answer choices by
adding them to the test set. For each attribute, we obtain the output probability score for that
rule-type and value. The final score is the sum of these probability scores for all the inferred
rules 2. The choice with the highest score is returned as the answer.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Empirical Evaluation</title>
      <p>
        Raven’s Progressive Matrices (RPM): is a widely accepted visual reasoning puzzle used to
test human intelligence [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The idea is to infer multiple patterns to be able to find the answer
from the options, an example of the same from the RAVEN[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] dataset is seen in Fig 1. In this
dataset, for every problem, each independent attribute follows an algorithmic rule. The task
here is to deduce the underlying rules applied over each row for the first two rows; followed by
selecting the option that validates all the inferred rules when aligned with the images in the last
row. As seen in the first two rows in the Fig 1 we observe the attributes: Number, Position and
Type stay constant across the rows. We observe an increasing progression in Size and a fixed set
of 3 Colors are permuted within the row indicating distribute three. Hence option 3 is the only
solution that satisfies all the rules 3. For our experiments, We use the “Relational and Analogical
Visual rEasoNing” dataset (RAVEN)[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which was introduced to drive reasoning ability in
current models. RAVEN consists of 10,000 problems for each of the 7 diferent configurations
of the RPM problem shown in Fig 3. Each problem has 16 images (8 : problem matrix and 8 :
target options).
2We explain the complete pseudo-algorithm along with the scoring function in the appendix.
3We provide the dataset overview, set of rule and attribute definitions for the RAVENs problems used in the appendix.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Experimental Details</title>
        <p>As each image in a RAVENs problem can be represented symbolically in terms of the entities
(shapes) involved and their attributes: Type, Size, and Color; and multiple entities in the same
image have Number and Position attributes. Such attributes are also rule-governing in that
rules based on these can be applied to each row and the combination of rules from all rows is
used to solve a given problem. Example: for each entity, the multihot representation  is of size
| | + || + || where  ,  and  are the set of shapes, possible sizes and possible colors respectively.
The multi-hot vector is made up of 3 concatenated one-hot vectors, one each for type, size, and
color. In case of multiple components, e.g: Left-Right, we concatenate the multihots of both the
entities. For our auto-encoder architecture, we train simple MLPs with a single hidden layer for
both   and   (using loss function 1,2) in appendix. The dimensions of the layers and latent
representations are chosen based on the RPM configuration.</p>
        <p>
          Following [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]’s description of RPM, there are four types of rules: Constant, Distribute Three,
Arithmetic, and Progression. In a given problem, there is a one rule applied to each
rulegoverning attribute across all rows, and the answer image is chosen so that this holds. We aim
to find a set of rules being obeyed by both the rows.
        </p>
        <p>So for every attribute and its rule-type, we train an elementary neural network classifier
 ( ,−) , to verify if the rule is satisfied in a given row, or pair of rows. The rules of
Progression and Arithmetic are further associated with a value (e.g., Progression could have
increments or decrements of 1 or 2). For rule-type Constant and ‘Distribute-three’ we train a
binary classifier, and for rule-type Arithmetic and Progression, we train a multi-class classifier
to also predict the value associated with the rule. An example is described in the appendix
along with further details on the neural networks used.</p>
        <p>We train a CNN-based image encoder  X over the image space X which encodes each image
of the problem to a latent space and minimize the disparity with the corresponding symbolic
latent space as described in Section 3.3. Finally, as shown in component (iv) of Fig. 2, we find
the underlying rules using the neural networks trained in section 3.2. Once we obtain the set
of (rule-type, value) pairs for each attribute, we apply these neural networks for each of the 8
options by placing them in the last row. We obtain the output probability score for that attribute,
rule-type and value and sum the probability scores for all the inferred rules 4 and the image
with the highest score is returned as the answer.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results</title>
        <p>We use the test set provided by RAVEN to evaluate rule classification networks and the
final accuracy. Table 2 lists the F1 of each  ( ,−) classification network across all
configurations. We observe that 85% of the neural networks have an F1-score ≥ 0.90. This is
corroborated by the idea that these networks are trained on latent representations of symbolic
data to perform elementary functions and do well on specialized reasoning components.</p>
        <p>Table 1 shows end-to-end accuracy for diferent RAVENs problem configurations. Our
proposed neural reasoning approach (A) is where we have image input encoded by   () and
neural reasoning in the latent space, i.e. steps (i)-(iv) in Section 3. We also show results for
4We explain the complete pseudo-algorithm along with the scoring function in the appendix.
an alternative (B), (v) mentioned in Section 3, i.e., image inputs decoded to symbolic space via
  (  ()) followed by purely symbolic reasoning (algorithmic solving). Results using neural
reasoning in the latent space but using the correct symbolic inputs mapped via   () are also
shown as (c) to highlight the loss in accuracy incurred while encoding images using   () .</p>
        <p>
          We use ResNet+DRT from RAVEN as our baseline, human performance (provided in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] )
as a reference and other SOTA methods for comparison. We note that the RAVEN baseline
is bested by A: neural reasoning on image inputs for 4 out of the 7 configurations, and by B:
symbolic reasoning on image inputs for one of the more dificult cases (2x2). At the same time
we observe that approach B is better than A except for the dificult case of 3x3 grid, where the
encoder-decoder combination   (  ()) produces too many errors, adversely afecting purely
symbolic reasoning.
        </p>
        <p>Neural reasoning from symbolic inputs, i.e. (c), accuracy consistently exceeds approach A,
which can be attributed to the closer relation of the latent space to the algorithmic symbolic
space. We also observe lower performance for the configurations 2x2 Grid, 3x3 Grid, and Out-In
grid. Upon analysis, we find that the performance of   for these configurations is relatively
lower as each of these components have multiple entities and the task to transform the image
into the latent space and identifying rules becomes dificult.</p>
        <p>While more recent purely neural-network based approaches remain SOTA, we note that for
the simpler configurations our neuro-symbolic approaches are competitive. We speculate that
because of the complex nature and dificulty of these configurations, using more powerful neural
architectures (such as transformers) for self-supervised learning of the symbolic latent space
as well as for learning predicates can be useful. More generally our results provide evidence
that a neuro-symbolic search using neural-network based elementary predicates, trained on
a symbolic latent space, may be a promising approach for learning complex reasoning tasks,
especially where domain knowledge is available.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>
        While the results presented in this paper pertain to visual analogical reasoning problems, it
should be noted that the procedure presented in Section 3 is agnostic to the input modality.
Figure 4 illustrates analogical reasoning problems in speech and text respectively; the first task
involves analogical reasoning in speech, where the input corresponds to a speech sample in a
male voice and the output samples correspond to the same utterance in a female voice: The task
is to infer that this is the transformation involved and analogously generate the output speech
signal for the target query. Possible attributes for rules on a speech signal can include discrete
values of pitch, tone, volume or others. In the second, text-based example, inputs correspond to
positive reflections of an input passage, and the outputs correspond to negative reflections of
the same passage. Attributes for text rule identification can similarly include textual aspects
like language, sentiment and style. Note that both these examples require generation of the
missing target output which is a harder task than classification from a set of possible choices.
However, given the recent progress in conditional generation for images [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and text[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], this
seems entirely feasible.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Related Work</title>
      <p>
        The ‘neural algorithmic reasoning’[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] approach presents a procedure for building neural
networks that can mimic algorithms. It includes training processor networks that can operate over
high-dimensional latent spaces to align with fundamental computations. This improves
generalization and reasoning in neural networks. RAVEN[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] combines both visual understanding
and structural reasoning using a Dynamic Residual tree (DRT) graph developed from structural
representation and aggregates latent features in a bottom-up manner. This provides a direction
suggesting that augmenting networks with domain knowledge performs better than black-box
neural networks. Scattering Compositional learner(SCL)[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] presents an approach where the
model learns a compositional representation by learning independent networks for
encoding object, attribute representations and relationship networks for inferring rules, and using
their composition to make a prediction. Our work bears similarity with this approach as both
utilize background knowledge in composing a larger mechanism from elementary networks.
CoPINet[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] presents the Contrastive Perceptual Inference network which is built on the idea of
contrastive learning, i.e. to teach concepts by comparing cases. The Dual-Contrast Network
(DCNet)[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] works on similar lines as it uses 2 contrasting modules: rule contrast and choice
contrast for its training. We draw inspiration from [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] which also presents a variation of the
Neural Algorithmic Reasoning approach applied to visual reasoning.
      </p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In this work, we have proposed a novel neuro-symbolic reasoning approach where we learn
neural-network based predicates operating on a ‘symbolically-derived latent space’ and use
these in a symbolic search procedure to solve complex visual reasoning tasks, such as RAVENs
Progressive Matrices. Our experimental results (though preliminary, in that our predicates are
composition of simple MLPs) indicate that our the approach points to a promising direction for
neuro-symbolic reasoning research.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work is supported by “The DataLab” agreement between BITS Pilani, K.K. Birla Goa
Campus and TCS Research, India.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Overview of RAVEN dataset generation</title>
      <p>To give an overview of how the RAVEN dataset was generated, the authors used an A-SIG
(Attributed Stochastic Grammar) to generate the structural representation of RPM. Each RPM is
a parse tree that instantiates from this A-SIG. After this, rules and the initial attributes for that
structure are sampled. The rules are applied to produce a valid row. This process is repeated 3
times to generate a valid problem matrix. The answer options are generated by breaking some
set of rules. This structured representation is then used to generate images.</p>
      <p>The RAVEN dataset provides a structural representation that is semantically linked with the
image representation. The structural representation of the image space available in RAVEN
makes it generalizable. As each image in a configuration follows a fixed structure, we use
this knowledge to generate the corresponding symbolic representations. RAVEN has 10,000
problems for each configuration split into 6000: Train, 2000:Val and 2000:Test. We use the same
split for training and validation and provide the results on the test set.</p>
    </sec>
    <sec id="sec-10">
      <title>B. Rule and Attribute definitions</title>
      <sec id="sec-10-1">
        <title>B.1. Attributes</title>
        <p>Number: The number of entities in a given layout. It could take integer values from [1; 9].
Position: Possible slots for each object in the layout. Each Entity could occupy one slot.
Type: Entity types could be triangle, square, pentagon, hexagon, and circle.
Size: 6 scaling factors uniformly distributed in [0:4; 0:9].</p>
        <p>Color: 10 grey-scale colors</p>
      </sec>
      <sec id="sec-10-2">
        <title>B.2. Rules</title>
        <p>4 diferent rules can be applied over rule-governing attributes.</p>
        <p>Constant: Attributes governed by this rule would not change in the row. If it is applied on
Number or Position, attribute values would not change across the three panels. If it is applied
on Entity level attributes, then we leave “as is” the attribute in each object across the three
panels.</p>
        <p>Progression: Attribute values monotonically increase or decrease in a row. The increment or
decrement could be either 1 or 2, resulting in 4 instances in this rule.</p>
        <p>Arithmetic: There are 2 instantiations in this rule, resulting in either a rule of summation or
one of subtraction. Arithmetic derives the value of the attribute in the third panel from the first
2 panels. For Position, this rule is implemented as set arithmetics.</p>
        <p>Distribute Three: This rule first samples 3 values of an attribute in a problem instance and
permutes the values in diferent rows.</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>C. Autoencoder, Neural Predicates and Image Encoder</title>
      <sec id="sec-11-1">
        <title>C.1. Autoencoder</title>
        <p>The symbolic encoder  S() is trained using the following losses as described in Section 3. As we
want to reconstruct the multihot representation, a sum of negative-log likelihood is computed
for each one-hot encoding present in the multihot representation. Here   denotes the output
nodes from the decoder corresponding to the  ℎ attribute and   denotes the one-hot input for
the same attribute. In equations 1 and 2 we use the example from RAVEN where the attributes
are type, size, color, etc.</p>
        <p>LS(, ) =
∑</p>
        <p>−(
∈{,,,… }


 (</p>
        <p>Σ   

 )</p>
        <p>)
,  =  
, Σ∈ LS(, ̂), where  =̂  S(
S())
(1)
(2)</p>
      </sec>
      <sec id="sec-11-2">
        <title>C.2. Neural Predicates for Rule Classification</title>
        <p>For every attribute, for each of its rule-type, we train an elementary neural network classifier
to verify if the rule is satisfied in the row- this acts as our Rule Identification network. In this
work, we refer to any rule identification network  using a ( ,   −  )
it is trained. For rule-type Constant and Distribute Three we train a binary classifier. The rules
of Progression and Arithmetic are also associated with a value (e.g., Progression could have
increments or decrements of 1 or 2), hence for rule-type Arithmetic and Progression, we train
a multi-class classifier to also predict the value associated with the rule. Example: A neural
pair for which
is a binary classifier trained to identify if the row from the
network for Center:  ( ,)
for Left:  (,)
configuration Center has constant type (shape) across the 3 panels. Similarly a neural network
is a five-class classifier trained to classify if there is a progression in size
in the Left component. This predicts 0 if there is no progression and predicts the progression
value: increment or decrement (-2, -1, 1, 2) otherwise.</p>
        <p>The latent representations  S() obtained after encoding the symbolic representations of each
of the three panels in the row are concatenated and sent as input to the neural networks. While
training a neural net for a ( ,   −  )
pair, we categorize each row with the specific
label for that particular rule and attribute, labeling it with 0 if the rule-type is not followed
and with 1 or rule-value indicating the level of the rule-type when being followed. As each of
these rule-types is deterministic, we can obtain the rule-type and value for each row using their
symbolic representations. The overview can be seen in component (ii) of Fig. 2 where we see
the input for these elementary neural networks and the expected output determined for the
( ,   −  )</p>
        <p>pair. With these labels, each network is optimized using cross-entropy
loss. Each network is a shallow MLP classifier with 1 or 2 hidden layers whose dimensions are
chosen depending on the configuration and validation set. These classifiers are trained using
symbolic representations for each component across the various configurations and we provide
the results in Table. 2.</p>
      </sec>
      <sec id="sec-11-3">
        <title>C.3. Image Encoder</title>
        <p>To learn the latent representation of the unstructured data such that it mimics the symbolic latent
space, we minimize the mean squared error over all pairs of symbolic and input representations
obtained from  S()) and  X( ) respectively. This enables us to use the previously learned
neural networks for rule inference on our image data.</p>
        <p>=  
 Σ( ,) ( X( ) − 
S()) 2
(3)
f
f

n
i
e
s
i
i
n
i
b
e
r
a
n
i
e
d
o
,
…
, 
←</p>
        <p>0
i
n
{
1
,
2
,
…
,
8
}
d
3
←

3
←
=


←
(
3
1
, 
3
2
,  )
(   ,    )
(
+

)
3
3,    
e
n
d
f
o
, 
o
f
e
l
o
i
h</p>
        <p>S
a
c</p>
        <p>O
e
v
o
o
i
, 
1
3
, 
2
2
, 
2
3
)
i
n
r
u
l
e
t
y
p
e
s
d
←
2

(
|
&gt;

2
)
t
h
e
i</p>
        <p>C
n
t
n</p>
        <p>D
s
r
b
t</p>
        <p>T
r
e
h</p>
        <p>U
e
h
t
a
n
d
e
r
l
e
ific
(
r,
ili
i</p>
        <p>P
o
r
s
i
n
|
A
i
h
e
i
t
e
1
)
!
=
0
∧
i
d
,
…
, 
n
n
ili
y
f</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Veličković</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Blundell</surname>
          </string-name>
          ,
          <source>Neural algorithmic reasoning, Patterns</source>
          <volume>2</volume>
          (
          <year>2021</year>
          )
          <fpage>100273</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Carpenter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Just</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shell</surname>
          </string-name>
          ,
          <article-title>What one intelligence test measures: A theoretical account of the processing in the raven progressive matrices test</article-title>
          ,
          <source>Psychological review 97</source>
          (
          <year>1990</year>
          )
          <fpage>404</fpage>
          -
          <lpage>31</lpage>
          .
          <source>doi:1 0 . 1 0 3 7 / 0 0 3 3 - 2 9 5 X . 9 7 . 3 . 4 0 4 .</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Chollet</surname>
          </string-name>
          ,
          <article-title>On the measure of intelligence</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>01547</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , S.-C. Zhu,
          <article-title>Raven: A dataset for relational and analogical visual reasoning</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-C.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Learning perceptual inference by contrasting</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems (NeurIPS)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Grosse</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Ba,</surname>
          </string-name>
          <article-title>The scattering compositional learner: Discovering objects, attributes, relationships in analogical reasoning</article-title>
          ,
          <year>2020</year>
          .
          <article-title>a r X i v : 2 0 0 7 . 0 4 2 1 2</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kankanhalli</surname>
          </string-name>
          ,
          <article-title>Efective abstract reasoning with dual-contrast network</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2021</year>
          . URL: https://openreview.net/ forum?id=ldxlzGYWDmW.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pavlov</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Voss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Zero-shot text-to-image generation</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2102.12092.
          <source>doi:1 0 . 4 8</source>
          <volume>5 5</volume>
          <fpage>0</fpage>
          <string-name>
            <surname>/ A R X I</surname>
          </string-name>
          <article-title>V . 2 1 0 2 . 1 2 0 9 2</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dathathri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hung</surname>
          </string-name>
          , E. Frank,
          <string-name>
            <given-names>P.</given-names>
            <surname>Molino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yosinski</surname>
          </string-name>
          , R. Liu,
          <article-title>Plug and play language models: A simple approach to controlled text generation</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1912</year>
          .02164.
          <source>doi:1 0 . 4 8</source>
          <volume>5 5</volume>
          <fpage>0</fpage>
          <string-name>
            <surname>/ A R X I</surname>
          </string-name>
          <article-title>V . 1 9 1 2 . 0 2 1 6 4</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sonwane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Shrof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Srinivasan</surname>
          </string-name>
          , T. Dash,
          <article-title>Solving visual analogies using neural algorithmic reasoning</article-title>
          ,
          <source>CoRR abs/2111</source>
          .10361 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2111.10361.
          <article-title>a r X i v : 2 1 1 1 . 1 0 3 6 1</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>