1. Introduction

Evaluating Understanding on Conceptual Abstraction Benchmarks

Victor Vikram Odouard

Melanie Mitchell

0 0 Santa Fe Institute , 1399 Hyde Park Road, Santa Fe, NM 87501 USA

A long-held objective in AI is to build systems that understand concepts in a humanlike way. Setting aside the dificulty of building such a system, even trying to evaluate one is a challenge, due to present-day AI's relative opacity and its proclivity for finding shortcut solutions. This is exacerbated by humans' tendency to anthropomorphize, assuming that a system that can recognize one instance of a concept must also understand other instances, as a human would. In this paper, we argue that understanding a concept requires the ability to use it in varied contexts. Accordingly, we propose systematic evaluations centered around concepts, by probing a system's ability to use a given concept in many diferent instantiations. We present case studies of such an evaluations on two domains-RAVEN (inspired by Raven's Progressive Matrices) and the Abstraction and Reasoning Corpus (ARC)-that have been used to develop and assess abstraction abilities in AI systems. Our concept-based approach to evaluation reveals information about AI systems that conventional test sets would have left hidden.

eol>abstraction analogy concepts machine learning evaluation

1. Introduction

pictures of animals by looking for blurry backgrounds [ 4 ] or pictures of cows by looking at surrounding landscapes What unites chain-link fences, high prices, entrance ex- [ 5 ]. More insidiously, certain image classifiers can be ams, and import tarifs? They are all diferent kinds of fooled into classifying, say, school buses as ostriches by barriers. Your understanding of physical barriers may changing the picture in ways indiscernible to human have helped you quickly intuit how chess pieces move viewers [ 6 ]. (and the fundamental diference between the knight and In this paper, we propose systematic assessments centhe other pieces) from very few examples. It may have tered around concepts—a concept-based approach—to helped you relate to a friend struggling with credit card evaluate understanding in AI systems. This approach debt, even when your obstacles are very diferent. It may involves (1) identifying a set of concepts a system should have helped you describe how being jet-lagged some- know and (2) designing sets of questions probing for the times feels like “hitting a wall.” These examples illustrate grasp of these concepts using a variety of instantiations the importance of abstract concepts in few-shot learn- of each concept. ing, generalization, emotional intelligence, and commu- One of the important pillars of the traditional train/test nication. Such examples display the intuition behind paradigm in machine learning—that the training and test Barsalou’s definition of a concept: “a competence or dis- sets be independent and identically distributed (IID)—is position for generating infinite conceptualizations of a violated with our concept-based evaluation method. In category” [ 1 ]. In short, understanding the world entails order to probe understanding by creating varied concept being able to recognize and generate concepts in both instantiations, the examples used for evaluation may not concrete and abstract forms. be drawn from the same “distribution” as the training set.

Early pioneers suggested that their AI summer project Furthermore, the examples in evaluation set will likely might lead to blueprints for machines that could “form not be independent in any sense, since they are created abstractions and concepts” [ 2 ]. More than six decades by varying specific concepts. In two case studies, we find later, AI systems are still extremely limited in this regard: that our evaluation method reveals important informathey have yet to surmount the “barrier” of understanding tion about a system’s ability to understand concepts that [ 3 ]. might be hidden using a conventional IID test set.

Evaluating a system’s understanding of concepts and We created concept-based evaluations for two domains abstractions is challenging. AI systems are known to that have been used to develop and assess conceptual abbe susceptible to shortcut learning, such as recognizing straction abilities in AI systems: RAVEN [ 7 ] (inspired by Raven’s Progressive Matrices (RPMs) [ 8 ]) and the Abstraction and Reasoning Corpus (ARC) [ 9 ]. Figure 1 shows a sample problem in the RAVEN domain. Each EBeM’22: AI Evaluation Beyond Metrics, July 24, 2022, Vienna, Austria $ vo47@cornell.edu (V. V. Odouard); mm@santafe.edu (M. Mitchell)

© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License such problem consists of a three-by-three matrix (FigCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) ure 1 left) in which each of 8 matrix components is a ifgure involving geometric shapes, with some relation- (progression, sameness, part-whole). Notably, ARC tasks ship between the figures in the rows and columns. The require the solver to generate an answer, rather than ninth component is missing, and the task is to fill in the choose among given candidate answers, as in RAVEN, missing component with one of a set of eight candidate providing the potential for more insight into the underanswers (Figure 1 right). standing of the solver [ 9 ].

ARC problems (termed “tasks” in [ 9 ]) present a number of “demonstration” pairs of grids which are related via a transformation rule, asking the solver to “do the 2. Prior Results on RAVEN same thing” (i.e., apply the same transformation) to a new “test” input grid. Figure 2 shows a sample task in The RAVEN domain was inspired by Raven’s Progressive the ARC domain. The solver’s challenge is to gener- Matrices (RPMs), a kind of IQ test that has been used to ate a new grid that transforms the test input grid analo- measure “fluid intelligence” in humans for many decades gously to the transformations in the demonstration grids. [ 8 ]. There have been numerous eforts to apply AI and The concepts used in the ARC domain were inspired machine learning methods to RPM-like problems (e.g., by Spelke’s proposals for core knowledge systems [ 10 ] [ 7, 11, 12, 13, 14, 15, 16, 17 ], among many others). Resuch as spatio-temporal relations (inside, above, next-to), cently many groups have applied deep neural networks object attributes (shape, size, color, boundary), transfor- (DNNs) to such problems, but given that DNNs need large mations (rotate, shift, extend), and more general relations numbers of training examples, these eforts require methods for procedural generation of these examples. The score high on the RAVEN test set. creators of the RAVEN dataset [ 7 ] developed one such We first selected two high-performing models from method (another method was used to generate the PGM the RAVEN literature: the Multi-scale Relation Network dataset [ 11 ]). To generate a RAVEN problem, the system (MRNet, [ 12 ]) and the Scattering Compositional Learner sampled from a hierarchical stochastic image grammar (SCL, [17]). For both these systems, the authors made the [ 7 ], which ofered diferent possible layouts for the ma- code publicly available. We then trained both systems on trix components (e.g., center, inside/outside, grid), and 30,000 RAVEN training examples—ones that used five of within each layout it ofered a choice of shapes (e.g., cir- the seven layouts available (Center, 2× 2Grid, 3× 3Grid, cle, square, triangle, pentagon) with diferent attributes Out-InCenter, and Out-InGrid)1 We then evaluated the to be chosen (e.g., color, size, angle), where each attribute trained system on 10,000 RAVEN test examples that used is constrained to be one of a small number of values. The these layouts.2 The resulting accuracies on these test grammar also enforced one of a choice of relationships examples were 73% for MRNet and 89% for SCL. between matrix elements in a row (e.g., constant, pro- We then chose two concepts that are present in RAVEN gression, arithmetic); see [ 7 ] for details. The authors problems: Sameness and Progression. Both MRNet and generated 70,000 problems total, splitting RAVEN into SCL were trained on problems involving some version of 42,000 training, 14,000 validation, and 14,000 test exam- these concepts, and both were correct on some instances ples. of these concepts in the RAVEN test set. In order to

In the paper detailing the RAVEN dataset, Zhang et probe the degree to which these systems grasp these al. [ 7 ] reported human performance on RAVEN’s test set two concepts, we manually created new problems that at 84% accuracy on average. Several subsequent papers systematically vary these concepts, by instantiating these reported deep learning methods which surpassed human concepts using diferent attributes. performance on this dataset (e.g., [ 18, 19 ]). In all Sameness problems, the relevant relationship in

The original RAVEN dataset, however, had a bias in each row is that one or more attributes remain constant. its answer-generation method: answer choices were gen- In the RAVEN domain, the possible attributes include erated by taking the correct answer and modifying an shape, size, color (i.e., gray scale), position, row, column, attribute, allowing solvers to take the majority vote for number, angle, and whether one object is inside or outside each attribute to get the correct answer. In fact, net- another object. Figure 3 shows four sample Sameness works trained solely on the answer choices could attain problems from our evaluation set. over 90% accuracy [ 13 ]. To remedy this shortcoming, In all Progression problems, the relevant relationship in other groups generated modified versions of the answer each row is an increase (or decrease) in the value of one or choices in RAVEN using methods that that seem to be more attributes. Figure 4 shows four sample Progression less exploitable. The new versions of RAVEN included problems from our evaluation set. RAVEN-FAIR [ 12 ] and I-RAVEN [ 13 ]. Several groups These samples give a flavor of the problem variations have since reported test-set accuracies on these new ver- we created around each concept. Our evaluation consions that significantly surpass the human performance sisted of 210 Sameness and 80 Progression problems, debenchmark of 84% (e.g., [ 12, 17, 20 ]). signed to instantiate the concepts in ways that we believe would be relatively easy for humans to understand.3 The evaluation results are given in Table 1. For both MR3. Concept-Based Evaluations for Net and SCL, the accuracy on our concept variations are RAVEN substantially lower than the programs’ RAVEN test set accuracy would predict, indicating that their grasp of these general abstract concepts is lacking.

When a program (e.g., a DNN) exhibits high accuracy on the RAVEN dataset, does the program understand the concepts expressed in the problems it solved, as a human would? And when a program for solving ARC problems correctly solves a task, to what extent is the program capturing the abstract reasoning abilities the dataset’s name implies?

As we have argued above, the way to answer these questions is to evaluate these programs on systematic variations of the concepts that they purport to understand.

Neither the RAVEN nor ARC datasets (nor any other abstraction datasets that we are aware of) provides this kind of evaluation. In this section we demonstrate how such an evaluation can be carried out on programs that

1Because these two models scored each answer individually without

any comparison between answers, the models were not afected by the answer-generation bias of the original RAVEN dataset we described above. Thus we used the original version to train and evaluate them. 2For the sake of time and simplicity, we omitted the Left-Right and Up-Down layouts, which split each matrix component into two. 3Our Sameness and Progression problems can be downloaded from https://melaniemitchell.me/EBeM2022/RavenVariations.zip. We have argued for assessing AI abstraction programs using systematic concept-based evaluations rather than random training/test splits or IID test sets. We demonstrated our proposed concept-based evaluation method on existing programs designed to solve problems in the

4Our ARC task variations can be downloaded from https://

melaniemitchell.me/EBeM2022/ARCVariations.zip.

4. Prior Results on ARC Kaggle competition’s second-place winner [22] (whose

code was made publicly available). Here we will call this Deep learning systems such as MRNet and SCL typically program ARC-Kaggle2. To probe this program’s underlack transparency. Given their large numbers of parame- standing of concepts in the ARC domain, we selected a ters and training on large IID datasets, they are suscepti- number of ARC training tasks that it answered correctly, ble to shortcut learning—that is, learning subtle statistical and identified the concepts a human might have used to correlations between their input and the correct answers solve them. that don’t require actual concept understanding [ 5 ]. Such Here we focus on two concepts that appear in the origishortcuts are more likely when a system solving prob- nal ARC evaluation set. The first concept involves spatial lems is allowed to choose from a set of candidate answers, notions of “top” and “bottom” (or “above” and “below”). rather than having to generate its own answer. Moreover, The second concept involves the notion of “boundary.” the procedural generation of examples—essential for cre- Figure 5(a) shows a task from the original ARC evaluaating suficiently large training sets—can be susceptible tion set that focuses on the “top/bottom” concept: The to overt and subtle biases. transformation rule is something like “Select the color

Chollet’s ARC dataset [ 9 ] was created to avoid these of the topmost stripe.” ARC-Kaggle2 answered this task pitfalls of deep learning approaches and to be a better correctly. Figure 6(a) shows a task from the original ARC method of assessing true abstraction abilities. Unlike evaluation set that focuses on the “boundary” concept: RAVEN and related abstraction datasets, ARC focuses The transformation rule is something like “Move all obon few-shot learning. As shown in Figure 2, each ARC jects to the red boundary.” ARC-Kaggle2 also answered task can be considered a few-shot-learning task: given this task correctly. a small number of demonstrations, the solver needs to To probe ARC-Kaggle2’s grasp of these two concepts, ifgure out the relevant concept and apply it to the test we created variations on “top/bottom” and 12 variations input grid. In particular, the solver must generate the on “boundary.” To give a flavor of these variations, Figanswer rather than choose from given candidate answers. ures 5(b) and (c) show two of our variants on the “top/botMoreover, rather than relying on procedurally generated tom” concept, and Figures 6(b) and (c) show two of our problems, Chollet hand-designed 1,000 tasks, which were variations on the “boundary” concept. 4 Table 2 gives the used for a competition on the Kaggle website [ 21 ]. Four accuracy (given three guesses per task) of ARC-Kaggle2 hundred of the tasks were assigned to a “training set,” on our concept variations. It can be seen that while the whose purpose is to give the solver a general idea of what program’s accuracy on the original ARC test set was 19%, kinds of concepts can be used. Four hundred additional it appears somewhat better on the “top/bottom” concept tasks were assigned to an evaluation set for solvers to at 29% correct, and significantly worse on the “boundary” assess their abilities, and the 200 hundred remaining tasks concept at 8% correct. Given the small number of variamake up a unreleased (hidden) test set. The tasks were tions we evaluated the system on, we give these results carefully designed to capture “core knowledge” [ 10 ] and only as an illustration of our concept-evaluation method; to assess it in a few-shot, generative framework. a more thorough evaluation would require many more

The Kaggle ARC competition allowed each competing variations. program to generate three answers for each task. If one of the answers is correct, the program gets credit for solving that task. Using this metric, the top scorer in the 6. Conclusions and Future Work competition was correct on about 21% of the hidden test cases; the second-place scorer was correct on about 19%.

5. Concept-Based Evaluations For ARC

As a second illustration of our concept-based evaluation approach, we created new ARC tasks to evaluate the

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. 2139983. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation. This work was also supported by the Santa Fe Institute.

RAVEN and ARC datasets. Our results indicate that evaluation based on accuracy IID tests set can be uninformative in predicting more generalized performance for a given concept. In particular, even for concepts present in problems on which the system did well, its performance on concept variations—meant to probe the system’s degree of conceptual understanding—can be poor.

The results in this paper are meant as an illustration of the method rather than a thorough evaluation; a more complete evaluation would require assessing the systems on many additional concepts, each explored via numerous problem variations. In the future we plan to develop more thorough concept-based evaluation problem suites in not only the RAVEN and ARC domains but in other idealized abstraction and analogy domains for AI systems (e.g., Bongard problems [ 23 ] and letter-string analogies [ 24 ]). We also plan to perform human benchmarking studies on these evaluation suites so we can compare human performance with that of machines.

[5]

Geirhos ,

J.-H.

Jacobsen ,

Michaelis ,

Zemel ,

Brendel ,

Bethge ,

F. A.

Wichmann , Shortcut learning in deep neural networks , Nature Machine Intelligence 2 ( 2020 ) 665 - 673 .

[6]

Szegedy ,

Zaremba , I. Sutskever ,

Bruna ,

Erhan , I. Goodfellow ,

Fergus , Intriguing properties of neural networks , arXiv ( 2013 ) 6199 . arXiv: 1312 . 6199 .

[7]

Zhang ,

Gao ,

Jia ,

Zhu , S.-C. Zhu, RAVEN: A dataset for relational and analogical visual reasoning , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019 , 2019 , pp. 5317 - 5327 .

[8]

J. C.

Raven ,

J. H.

Court , Raven's progressive matrices , Western Psychological Services , 1938 .

[9]

Chollet , On the measure of intelligence , arXiv ( 2019 ) 01547 . arXiv: 1911 .01547.

[10]

E. S.

Spelke ,

K. D.

Kinzler , Core knowledge , Developmental Science 10 ( 2007 ) 89 - 96 .

[11]

D. G. T.

Barrett ,

Hill ,

Santoro ,

A. S.

Morcos , T. Lillicrap, Measuring abstract reasoning in neural networks , in: Proceedings of the International Conference on Machine Learning , ICML, 2018 , pp. 4477 - 4486 .

[12]

Benny ,

Pekar ,

Wolf , Scale-localized abstract reasoning , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2021 , pp. 12557 - 12565 .

[13]

Hu ,

Ma , X. Liu,

Wei ,

Bai , Stratified ruleaware network for abstract visual reasoning , in: Proceedings of the AAAI Conference on Artificial Intelligence , 2021 , pp. 1567 - 1574 .

[14]

Lovett ,

K. D.

Forbus , Modeling visual problem

[1]

L. W.

Barsalou , Challenges and opportunities for solving as analogical reasoning, Psychological Regrounding cognition , Journal of Cognition 3 ( 2020 ). view 124 ( 2017 ).

[2]

McCarthy ,

M. L.

Minsky ,

Rochester , C. E. Shan- [15]

Spratley ,

Ehinger ,

Miller , A closer look at non, A proposal for the Dartmouth summer re- generalisation in RAVEN , in: European Conference search project on artificial intelligence (First pub- on Computer Vision , Springer, 2020 , pp. 601 - 616 . lished August 31 , 1955 ), AI Magazine 27 ( 2006 ) [16]

Wang ,

Su , Automatic generation of Raven's 12-12 . progressive matrices, in: Proceedings of the Inter-

[3]

Mitchell , Artificial intelligence hits the barrier national Joint Conference on Artificial Intelligence, of meaning, Information 10 ( 2019 ) 51 . IJCAI , 2015 , pp. 903 - 909 .

[4]

Landecker ,

M. D.

Thomure ,

L. M. A.

Bettencourt , [17]

Wu ,

Dong ,

Grosse ,

Ba , The scatterM. Mitchell, G. T. Kenyon,

S. P.

Brumby , Inter- ing compositional learner: Discovering objects, preting individual classifications of hierarchical net- attributes, relationships in analogical reasoning, works , in: 2013 IEEE Symposium on Computational arXiv: 2007 . 04212 ( 2020 ). Intelligence and Data Mining (CIDM) , IEEE, 2013 , [18]

Zhang ,

Jia ,

Gao ,

Zhu ,

Lu , S.-C. Zhu, pp. 32 - 38 . Learning perceptual inference by contrasting , Advances in neural information processing systems 32 ( 2019 ).

[19]

Zhuo , . Kankanhalli, Solving Raven's progressive matrices with neural networks , arXiv: 2002 . 01646 ( 2020 ).

[20]

Małkiński , J. Mańdziuk, Multi-label contrastive learning for abstract visual reasoning , arXiv preprint arXiv: 2012 . 01944 ( 2020 ).

[21]

Chollet , Abstraction and reasoning challenge , 2020 . URL: https://www.kaggle.com/c/ abstraction-and -reasoning-challenge.

[22] A. de Miquel Bleier , Finishing 2nd in Kaggle's abstraction and reasoning challenge, 2020 .

[23] M. M. Bongard , Pattern Recognition, Spartan Books, 1970 .

[24]

D. R.

Hofstadter , M. Mitchell, The Copycat project: A model of mental fluidity and analogy-making, in: K. J . Holyoak , J. A. Barnden (Eds.), Advances in Connectionist and Neural Computation Theory , volume 2 ,

Ablex

Publishing Corporation , 1994 , pp. 31 - 112 .