<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Understanding on Conceptual Abstraction Benchmarks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Victor Vikram Odouard</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Melanie Mitchell</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Santa Fe Institute</institution>
          ,
          <addr-line>1399 Hyde Park Road, Santa Fe, NM 87501</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A long-held objective in AI is to build systems that understand concepts in a humanlike way. Setting aside the dificulty of building such a system, even trying to evaluate one is a challenge, due to present-day AI's relative opacity and its proclivity for finding shortcut solutions. This is exacerbated by humans' tendency to anthropomorphize, assuming that a system that can recognize one instance of a concept must also understand other instances, as a human would. In this paper, we argue that understanding a concept requires the ability to use it in varied contexts. Accordingly, we propose systematic evaluations centered around concepts, by probing a system's ability to use a given concept in many diferent instantiations. We present case studies of such an evaluations on two domains-RAVEN (inspired by Raven's Progressive Matrices) and the Abstraction and Reasoning Corpus (ARC)-that have been used to develop and assess abstraction abilities in AI systems. Our concept-based approach to evaluation reveals information about AI systems that conventional test sets would have left hidden.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;abstraction</kwd>
        <kwd>analogy</kwd>
        <kwd>concepts</kwd>
        <kwd>machine learning</kwd>
        <kwd>evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        pictures of animals by looking for blurry backgrounds [
        <xref ref-type="bibr" rid="ref14">4</xref>
        ]
or pictures of cows by looking at surrounding landscapes
What unites chain-link fences, high prices, entrance ex- [
        <xref ref-type="bibr" rid="ref1">5</xref>
        ]. More insidiously, certain image classifiers can be
ams, and import tarifs? They are all diferent kinds of fooled into classifying, say, school buses as ostriches by
barriers. Your understanding of physical barriers may changing the picture in ways indiscernible to human
have helped you quickly intuit how chess pieces move viewers [
        <xref ref-type="bibr" rid="ref2">6</xref>
        ].
(and the fundamental diference between the knight and In this paper, we propose systematic assessments
centhe other pieces) from very few examples. It may have tered around concepts—a concept-based approach—to
helped you relate to a friend struggling with credit card evaluate understanding in AI systems. This approach
debt, even when your obstacles are very diferent. It may involves (1) identifying a set of concepts a system should
have helped you describe how being jet-lagged some- know and (2) designing sets of questions probing for the
times feels like “hitting a wall.” These examples illustrate grasp of these concepts using a variety of instantiations
the importance of abstract concepts in few-shot learn- of each concept.
ing, generalization, emotional intelligence, and commu- One of the important pillars of the traditional train/test
nication. Such examples display the intuition behind paradigm in machine learning—that the training and test
Barsalou’s definition of a concept: “a competence or dis- sets be independent and identically distributed (IID)—is
position for generating infinite conceptualizations of a violated with our concept-based evaluation method. In
category” [
        <xref ref-type="bibr" rid="ref11">1</xref>
        ]. In short, understanding the world entails order to probe understanding by creating varied concept
being able to recognize and generate concepts in both instantiations, the examples used for evaluation may not
concrete and abstract forms. be drawn from the same “distribution” as the training set.
      </p>
      <p>
        Early pioneers suggested that their AI summer project Furthermore, the examples in evaluation set will likely
might lead to blueprints for machines that could “form not be independent in any sense, since they are created
abstractions and concepts” [
        <xref ref-type="bibr" rid="ref12">2</xref>
        ]. More than six decades by varying specific concepts. In two case studies, we find
later, AI systems are still extremely limited in this regard: that our evaluation method reveals important
informathey have yet to surmount the “barrier” of understanding tion about a system’s ability to understand concepts that
[
        <xref ref-type="bibr" rid="ref13">3</xref>
        ]. might be hidden using a conventional IID test set.
      </p>
      <p>
        Evaluating a system’s understanding of concepts and We created concept-based evaluations for two domains
abstractions is challenging. AI systems are known to that have been used to develop and assess conceptual
abbe susceptible to shortcut learning, such as recognizing straction abilities in AI systems: RAVEN [
        <xref ref-type="bibr" rid="ref3">7</xref>
        ] (inspired
by Raven’s Progressive Matrices (RPMs) [
        <xref ref-type="bibr" rid="ref4">8</xref>
        ]) and the
Abstraction and Reasoning Corpus (ARC) [
        <xref ref-type="bibr" rid="ref5">9</xref>
        ]. Figure 1
shows a sample problem in the RAVEN domain. Each
EBeM’22: AI Evaluation Beyond Metrics, July 24, 2022, Vienna, Austria
$ vo47@cornell.edu (V. V. Odouard); mm@santafe.edu
(M. Mitchell)
      </p>
      <p>
        © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License such problem consists of a three-by-three matrix
(FigCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) ure 1 left) in which each of 8 matrix components is a
ifgure involving geometric shapes, with some relation- (progression, sameness, part-whole). Notably, ARC tasks
ship between the figures in the rows and columns. The require the solver to generate an answer, rather than
ninth component is missing, and the task is to fill in the choose among given candidate answers, as in RAVEN,
missing component with one of a set of eight candidate providing the potential for more insight into the
underanswers (Figure 1 right). standing of the solver [
        <xref ref-type="bibr" rid="ref5">9</xref>
        ].
      </p>
      <p>
        ARC problems (termed “tasks” in [
        <xref ref-type="bibr" rid="ref5">9</xref>
        ]) present a
number of “demonstration” pairs of grids which are related
via a transformation rule, asking the solver to “do the 2. Prior Results on RAVEN
same thing” (i.e., apply the same transformation) to a
new “test” input grid. Figure 2 shows a sample task in The RAVEN domain was inspired by Raven’s Progressive
the ARC domain. The solver’s challenge is to gener- Matrices (RPMs), a kind of IQ test that has been used to
ate a new grid that transforms the test input grid analo- measure “fluid intelligence” in humans for many decades
gously to the transformations in the demonstration grids. [
        <xref ref-type="bibr" rid="ref4">8</xref>
        ]. There have been numerous eforts to apply AI and
The concepts used in the ARC domain were inspired machine learning methods to RPM-like problems (e.g.,
by Spelke’s proposals for core knowledge systems [
        <xref ref-type="bibr" rid="ref6">10</xref>
        ] [
        <xref ref-type="bibr" rid="ref10 ref3 ref7 ref8 ref9">7, 11, 12, 13, 14, 15, 16, 17</xref>
        ], among many others).
Resuch as spatio-temporal relations (inside, above, next-to), cently many groups have applied deep neural networks
object attributes (shape, size, color, boundary), transfor- (DNNs) to such problems, but given that DNNs need large
mations (rotate, shift, extend), and more general relations numbers of training examples, these eforts require
methods for procedural generation of these examples. The score high on the RAVEN test set.
creators of the RAVEN dataset [
        <xref ref-type="bibr" rid="ref3">7</xref>
        ] developed one such We first selected two high-performing models from
method (another method was used to generate the PGM the RAVEN literature: the Multi-scale Relation Network
dataset [
        <xref ref-type="bibr" rid="ref7">11</xref>
        ]). To generate a RAVEN problem, the system (MRNet, [
        <xref ref-type="bibr" rid="ref8">12</xref>
        ]) and the Scattering Compositional Learner
sampled from a hierarchical stochastic image grammar (SCL, [17]). For both these systems, the authors made the
[
        <xref ref-type="bibr" rid="ref3">7</xref>
        ], which ofered diferent possible layouts for the ma- code publicly available. We then trained both systems on
trix components (e.g., center, inside/outside, grid), and 30,000 RAVEN training examples—ones that used five of
within each layout it ofered a choice of shapes (e.g., cir- the seven layouts available (Center, 2× 2Grid, 3× 3Grid,
cle, square, triangle, pentagon) with diferent attributes Out-InCenter, and Out-InGrid)1 We then evaluated the
to be chosen (e.g., color, size, angle), where each attribute trained system on 10,000 RAVEN test examples that used
is constrained to be one of a small number of values. The these layouts.2 The resulting accuracies on these test
grammar also enforced one of a choice of relationships examples were 73% for MRNet and 89% for SCL.
between matrix elements in a row (e.g., constant, pro- We then chose two concepts that are present in RAVEN
gression, arithmetic); see [
        <xref ref-type="bibr" rid="ref3">7</xref>
        ] for details. The authors problems: Sameness and Progression. Both MRNet and
generated 70,000 problems total, splitting RAVEN into SCL were trained on problems involving some version of
42,000 training, 14,000 validation, and 14,000 test exam- these concepts, and both were correct on some instances
ples. of these concepts in the RAVEN test set. In order to
      </p>
      <p>
        In the paper detailing the RAVEN dataset, Zhang et probe the degree to which these systems grasp these
al. [
        <xref ref-type="bibr" rid="ref3">7</xref>
        ] reported human performance on RAVEN’s test set two concepts, we manually created new problems that
at 84% accuracy on average. Several subsequent papers systematically vary these concepts, by instantiating these
reported deep learning methods which surpassed human concepts using diferent attributes.
performance on this dataset (e.g., [
        <xref ref-type="bibr" rid="ref15">18, 19</xref>
        ]). In all Sameness problems, the relevant relationship in
      </p>
      <p>
        The original RAVEN dataset, however, had a bias in each row is that one or more attributes remain constant.
its answer-generation method: answer choices were gen- In the RAVEN domain, the possible attributes include
erated by taking the correct answer and modifying an shape, size, color (i.e., gray scale), position, row, column,
attribute, allowing solvers to take the majority vote for number, angle, and whether one object is inside or outside
each attribute to get the correct answer. In fact, net- another object. Figure 3 shows four sample Sameness
works trained solely on the answer choices could attain problems from our evaluation set.
over 90% accuracy [
        <xref ref-type="bibr" rid="ref9">13</xref>
        ]. To remedy this shortcoming, In all Progression problems, the relevant relationship in
other groups generated modified versions of the answer each row is an increase (or decrease) in the value of one or
choices in RAVEN using methods that that seem to be more attributes. Figure 4 shows four sample Progression
less exploitable. The new versions of RAVEN included problems from our evaluation set.
RAVEN-FAIR [
        <xref ref-type="bibr" rid="ref8">12</xref>
        ] and I-RAVEN [
        <xref ref-type="bibr" rid="ref9">13</xref>
        ]. Several groups These samples give a flavor of the problem variations
have since reported test-set accuracies on these new ver- we created around each concept. Our evaluation
consions that significantly surpass the human performance sisted of 210 Sameness and 80 Progression problems,
debenchmark of 84% (e.g., [
        <xref ref-type="bibr" rid="ref16 ref8">12, 17, 20</xref>
        ]). signed to instantiate the concepts in ways that we believe
would be relatively easy for humans to understand.3 The
evaluation results are given in Table 1. For both
MR3. Concept-Based Evaluations for Net and SCL, the accuracy on our concept variations are
RAVEN substantially lower than the programs’ RAVEN test set
accuracy would predict, indicating that their grasp of
these general abstract concepts is lacking.
      </p>
      <p>When a program (e.g., a DNN) exhibits high accuracy
on the RAVEN dataset, does the program understand the
concepts expressed in the problems it solved, as a human
would? And when a program for solving ARC problems
correctly solves a task, to what extent is the program
capturing the abstract reasoning abilities the dataset’s
name implies?</p>
      <p>As we have argued above, the way to answer these
questions is to evaluate these programs on systematic
variations of the concepts that they purport to understand.</p>
      <p>Neither the RAVEN nor ARC datasets (nor any other
abstraction datasets that we are aware of) provides this
kind of evaluation. In this section we demonstrate how
such an evaluation can be carried out on programs that</p>
      <sec id="sec-1-1">
        <title>1Because these two models scored each answer individually without</title>
        <p>any comparison between answers, the models were not afected
by the answer-generation bias of the original RAVEN dataset we
described above. Thus we used the original version to train and
evaluate them.
2For the sake of time and simplicity, we omitted the Left-Right and
Up-Down layouts, which split each matrix component into two.
3Our Sameness and Progression problems can be downloaded from
https://melaniemitchell.me/EBeM2022/RavenVariations.zip.
We have argued for assessing AI abstraction programs
using systematic concept-based evaluations rather than
random training/test splits or IID test sets. We
demonstrated our proposed concept-based evaluation method
on existing programs designed to solve problems in the</p>
      </sec>
      <sec id="sec-1-2">
        <title>4Our ARC task variations can be downloaded from https://</title>
        <p>melaniemitchell.me/EBeM2022/ARCVariations.zip.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Prior Results on ARC</title>
      <sec id="sec-2-1">
        <title>Kaggle competition’s second-place winner [22] (whose</title>
        <p>
          code was made publicly available). Here we will call this
Deep learning systems such as MRNet and SCL typically program ARC-Kaggle2. To probe this program’s
underlack transparency. Given their large numbers of parame- standing of concepts in the ARC domain, we selected a
ters and training on large IID datasets, they are suscepti- number of ARC training tasks that it answered correctly,
ble to shortcut learning—that is, learning subtle statistical and identified the concepts a human might have used to
correlations between their input and the correct answers solve them.
that don’t require actual concept understanding [
          <xref ref-type="bibr" rid="ref1">5</xref>
          ]. Such Here we focus on two concepts that appear in the
origishortcuts are more likely when a system solving prob- nal ARC evaluation set. The first concept involves spatial
lems is allowed to choose from a set of candidate answers, notions of “top” and “bottom” (or “above” and “below”).
rather than having to generate its own answer. Moreover, The second concept involves the notion of “boundary.”
the procedural generation of examples—essential for cre- Figure 5(a) shows a task from the original ARC
evaluaating suficiently large training sets—can be susceptible tion set that focuses on the “top/bottom” concept: The
to overt and subtle biases. transformation rule is something like “Select the color
        </p>
        <p>
          Chollet’s ARC dataset [
          <xref ref-type="bibr" rid="ref5">9</xref>
          ] was created to avoid these of the topmost stripe.” ARC-Kaggle2 answered this task
pitfalls of deep learning approaches and to be a better correctly. Figure 6(a) shows a task from the original ARC
method of assessing true abstraction abilities. Unlike evaluation set that focuses on the “boundary” concept:
RAVEN and related abstraction datasets, ARC focuses The transformation rule is something like “Move all
obon few-shot learning. As shown in Figure 2, each ARC jects to the red boundary.” ARC-Kaggle2 also answered
task can be considered a few-shot-learning task: given this task correctly.
a small number of demonstrations, the solver needs to To probe ARC-Kaggle2’s grasp of these two concepts,
ifgure out the relevant concept and apply it to the test we created variations on “top/bottom” and 12 variations
input grid. In particular, the solver must generate the on “boundary.” To give a flavor of these variations,
Figanswer rather than choose from given candidate answers. ures 5(b) and (c) show two of our variants on the
“top/botMoreover, rather than relying on procedurally generated tom” concept, and Figures 6(b) and (c) show two of our
problems, Chollet hand-designed 1,000 tasks, which were variations on the “boundary” concept. 4 Table 2 gives the
used for a competition on the Kaggle website [
          <xref ref-type="bibr" rid="ref17">21</xref>
          ]. Four accuracy (given three guesses per task) of ARC-Kaggle2
hundred of the tasks were assigned to a “training set,” on our concept variations. It can be seen that while the
whose purpose is to give the solver a general idea of what program’s accuracy on the original ARC test set was 19%,
kinds of concepts can be used. Four hundred additional it appears somewhat better on the “top/bottom” concept
tasks were assigned to an evaluation set for solvers to at 29% correct, and significantly worse on the “boundary”
assess their abilities, and the 200 hundred remaining tasks concept at 8% correct. Given the small number of
variamake up a unreleased (hidden) test set. The tasks were tions we evaluated the system on, we give these results
carefully designed to capture “core knowledge” [
          <xref ref-type="bibr" rid="ref6">10</xref>
          ] and only as an illustration of our concept-evaluation method;
to assess it in a few-shot, generative framework. a more thorough evaluation would require many more
        </p>
        <p>The Kaggle ARC competition allowed each competing variations.
program to generate three answers for each task. If one
of the answers is correct, the program gets credit for
solving that task. Using this metric, the top scorer in the 6. Conclusions and Future Work
competition was correct on about 21% of the hidden test
cases; the second-place scorer was correct on about 19%.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Concept-Based Evaluations For ARC</title>
      <p>As a second illustration of our concept-based evaluation
approach, we created new ARC tasks to evaluate the</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>This material is based upon work supported by the
National Science Foundation under Grant No. 2139983. Any
opinions, findings, and conclusions or recommendations
expressed in this material are those of the author and do
not necessarily reflect the views of the National Science
Foundation. This work was also supported by the Santa
Fe Institute.</p>
      <p>RAVEN and ARC datasets. Our results indicate that
evaluation based on accuracy IID tests set can be uninformative
in predicting more generalized performance for a given
concept. In particular, even for concepts present in
problems on which the system did well, its performance on
concept variations—meant to probe the system’s degree
of conceptual understanding—can be poor.</p>
      <p>
        The results in this paper are meant as an illustration
of the method rather than a thorough evaluation; a more
complete evaluation would require assessing the systems
on many additional concepts, each explored via
numerous problem variations. In the future we plan to develop
more thorough concept-based evaluation problem suites
in not only the RAVEN and ARC domains but in other
idealized abstraction and analogy domains for AI systems
(e.g., Bongard problems [
        <xref ref-type="bibr" rid="ref19">23</xref>
        ] and letter-string analogies
[
        <xref ref-type="bibr" rid="ref20">24</xref>
        ]). We also plan to perform human benchmarking
studies on these evaluation suites so we can compare
human performance with that of machines.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Geirhos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-H.</given-names>
            <surname>Jacobsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Michaelis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zemel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Brendel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bethge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. A.</given-names>
            <surname>Wichmann</surname>
          </string-name>
          ,
          <article-title>Shortcut learning in deep neural networks</article-title>
          ,
          <source>Nature Machine Intelligence</source>
          <volume>2</volume>
          (
          <year>2020</year>
          )
          <fpage>665</fpage>
          -
          <lpage>673</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zaremba</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bruna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <article-title>Intriguing properties of neural networks</article-title>
          ,
          <source>arXiv</source>
          (
          <year>2013</year>
          )
          <article-title>6199</article-title>
          . arXiv:
          <volume>1312</volume>
          .
          <fpage>6199</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , S.-C.
          <article-title>Zhu, RAVEN: A dataset for relational and analogical visual reasoning</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          <year>2019</year>
          ,
          <year>2019</year>
          , pp.
          <fpage>5317</fpage>
          -
          <lpage>5327</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Raven</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Court</surname>
          </string-name>
          ,
          <article-title>Raven's progressive matrices</article-title>
          ,
          <source>Western Psychological Services</source>
          ,
          <year>1938</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Chollet</surname>
          </string-name>
          ,
          <article-title>On the measure of intelligence</article-title>
          , arXiv (
          <year>2019</year>
          )
          <article-title>01547</article-title>
          . arXiv:
          <year>1911</year>
          .01547.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Spelke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Kinzler</surname>
          </string-name>
          ,
          <article-title>Core knowledge</article-title>
          ,
          <source>Developmental Science</source>
          <volume>10</volume>
          (
          <year>2007</year>
          )
          <fpage>89</fpage>
          -
          <lpage>96</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D. G. T.</given-names>
            <surname>Barrett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Santoro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Morcos</surname>
          </string-name>
          , T. Lillicrap,
          <article-title>Measuring abstract reasoning in neural networks</article-title>
          ,
          <source>in: Proceedings of the International Conference on Machine Learning</source>
          , ICML,
          <year>2018</year>
          , pp.
          <fpage>4477</fpage>
          -
          <lpage>4486</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Benny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pekar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <article-title>Scale-localized abstract reasoning</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>12557</fpage>
          -
          <lpage>12565</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ma</surname>
          </string-name>
          , X. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <article-title>Stratified ruleaware network for abstract visual reasoning</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1567</fpage>
          -
          <lpage>1574</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lovett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Forbus</surname>
          </string-name>
          , Modeling visual problem
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L. W.</given-names>
            <surname>Barsalou</surname>
          </string-name>
          ,
          <article-title>Challenges and opportunities for solving as analogical reasoning, Psychological Regrounding cognition</article-title>
          ,
          <source>Journal of Cognition</source>
          <volume>3</volume>
          (
          <year>2020</year>
          ).
          <source>view 124</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>McCarthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Minsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rochester</surname>
          </string-name>
          , C. E. Shan- [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Spratley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ehinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>A closer look at non, A proposal for the Dartmouth summer re- generalisation in RAVEN</article-title>
          ,
          <source>in: European Conference search project on artificial intelligence (First</source>
          pub- on
          <source>Computer Vision</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>601</fpage>
          -
          <lpage>616</lpage>
          .
          <source>lished August 31</source>
          ,
          <year>1955</year>
          ),
          <source>AI</source>
          Magazine
          <volume>27</volume>
          (
          <year>2006</year>
          ) [16]
          <string-name>
            <given-names>K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <article-title>Automatic generation of Raven's 12-12</article-title>
          . progressive matrices,
          <source>in: Proceedings of the Inter-</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          ,
          <source>Artificial intelligence hits the barrier national Joint Conference on Artificial Intelligence, of meaning, Information</source>
          <volume>10</volume>
          (
          <year>2019</year>
          )
          <article-title>51</article-title>
          .
          <string-name>
            <surname>IJCAI</surname>
          </string-name>
          ,
          <year>2015</year>
          , pp.
          <fpage>903</fpage>
          -
          <lpage>909</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Landecker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Thomure</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. M. A.</given-names>
            <surname>Bettencourt</surname>
          </string-name>
          , [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Grosse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          , The scatterM. Mitchell, G. T. Kenyon,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Brumby</surname>
          </string-name>
          ,
          <article-title>Inter- ing compositional learner: Discovering objects, preting individual classifications of hierarchical net- attributes, relationships in analogical reasoning, works</article-title>
          , in: 2013 IEEE Symposium on Computational arXiv:
          <year>2007</year>
          .
          <volume>04212</volume>
          (
          <year>2020</year>
          ).
          <article-title>Intelligence and Data Mining (CIDM)</article-title>
          , IEEE,
          <year>2013</year>
          , [18]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lu</surname>
          </string-name>
          , S.-C. Zhu, pp.
          <fpage>32</fpage>
          -
          <lpage>38</lpage>
          .
          <article-title>Learning perceptual inference by contrasting</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhuo</surname>
          </string-name>
          , . Kankanhalli,
          <article-title>Solving Raven's progressive matrices with neural networks</article-title>
          , arXiv:
          <year>2002</year>
          .
          <volume>01646</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Małkiński</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>Mańdziuk, Multi-label contrastive learning for abstract visual reasoning</article-title>
          , arXiv preprint arXiv:
          <year>2012</year>
          .
          <year>01944</year>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>F.</given-names>
            <surname>Chollet</surname>
          </string-name>
          ,
          <source>Abstraction and reasoning challenge</source>
          ,
          <year>2020</year>
          . URL: https://www.kaggle.com/c/ abstraction-and
          <article-title>-reasoning-challenge.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [22]
          <string-name>
            <surname>A. de Miquel Bleier</surname>
          </string-name>
          ,
          <article-title>Finishing 2nd in Kaggle's abstraction and reasoning</article-title>
          challenge,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [23]
          <string-name>
            <surname>M. M. Bongard</surname>
          </string-name>
          , Pattern Recognition, Spartan Books,
          <year>1970</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Hofstadter</surname>
          </string-name>
          , M. Mitchell,
          <article-title>The Copycat project: A model of mental fluidity and analogy-making, in: K. J</article-title>
          .
          <string-name>
            <surname>Holyoak</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          <string-name>
            <surname>Barnden</surname>
          </string-name>
          (Eds.),
          <source>Advances in Connectionist and Neural Computation Theory</source>
          , volume
          <volume>2</volume>
          ,
          <string-name>
            <given-names>Ablex</given-names>
            <surname>Publishing</surname>
          </string-name>
          <string-name>
            <surname>Corporation</surname>
          </string-name>
          ,
          <year>1994</year>
          , pp.
          <fpage>31</fpage>
          -
          <lpage>112</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>