Introduction

Solving Abstract Reasoning Tasks with Grammatical Evolution

Raphael Fischer

Matthias Jakobs

Sascha Mücke

Katharina Morik

0 0 TU Dortmund, AI Group , Dortmund , Germany

The Abstraction and Reasoning Corpus (ARC) comprising image-based logical reasoning tasks is intended to serve as a benchmark for measuring intelligence. Solving these tasks is very dificult for ofthe-shelf ML methods due to their diversity and low amount of training data. We here present our approach, which solves tasks via grammatical evolution on a domain-specific language for image transformations. With this approach, we successfully participated in an online challenge, scoring among the top 4% out of 900 participants.

Machine Learning Reasoning Grammatical Evolution

Introduction

The ARC challenge requires participants to develop a model that is able to solve tasks Di from the set D = fD1; : : : ; DM g. Each Di comprises an image-based Copyright © 2020 by the paper’s authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 https://www.kaggle.com/c/abstraction-and-reasoning-challenge/ 2 ls8-arc team on the oficial ARC challenge leaderboard.

(a) Training task 48 (b) Training task 100 (c) Training task 397 reasoning task with mi training pairs f(xk; yk)g and ni test pairs f(x^`; y^`)g, where mi > 0 is typically around 3 and ni > 0 is mostly 1. The images feature up to 10 discrete colors C = fc1; : : : ; c10g, and image sizes range between 1 and 30 pixels per dimension. The implied function f that maps inputs to the expected output images is vastly diferent for every task, often based on abstract features (e.g. connected shapes) or pattern continuation (see Figure 1). For each task Di, the method should derive f~i from f(xk; yk)g 2 Di which correctly maps all input images to the expected output: 8(x^`; y^`) 2 Di : f~i(x^`) = y^`.

The challenge participants have access to M = 400 training tasks, which show some logic concepts and come with labels y^`. The final scoring however is based on an undisclosed test set, whose task are only seen by the model. 3

Reasoning Approach

For our approach, we assume that f can be broken down into a sequence of basic image transformations. We developed a custom domain-specific language (DSL) specifying such sequences, whose space of expressions is explored using an evolutionary algorithm (EA). We first explain our language in more detail and then show how we use an EA to create ARC task solvers from our DSL. Domain-Specific Language In a first step, we manually implemented solvers for approximately 20 random training tasks and identified reoccurring image operations, which became the basis of our DSL.

Let X = Su;v 1 Cu v denote the set of rectangular images with colors in C, and X ordered lists of images, which we call layers. Our DSL is a context-free grammar in Backus Naur Form (BNF) [ 6 ]; the non-terminal symbols represent function sets whose members (i) modify images (T = X X ), (ii) decompose an image into layers (S = (X )X ), (iii) combine layers into a single image (J = X X ) or (iv) modify a layer object (L = (X )X ). The terminal productions of these symbols are either concrete functions of the corresponding type or more complex function compositions. Figure 2 depicts a visualization of our DSL function types.

Our atomic functions comprise basic operations such as translation, rotation and cropping, as well as layer-specific operations like extracting a layer, sorting crop, rotate, …

sort, filter, … split_colors, duplicate, …

J union, top, … by diferent criteria, splitting images into layers, and merging them together. For additional flexibility, we also allow higher-order functions like map, which lifts the T -type to an L-type function by applying it to each layer. The grammar structure ensures that all possible expressions have an overall T -type logic, i.e. they transform a single input image into an output image. This allows to find solvers for some tasks, an example is given in Figure 3.

Grammatical Evolution We use Grammatical Evolution (GE) [ 7 ] to generate expressions in our DSL, and ultimately find solvers for a given task. We use the standard modulo-based mapping from codons to syntax trees of our DSL [ 6 ] to obtain image functions f~. Uniform mutation and 1-point crossover are used to produce ofspring [ 4 ]. To prevent running out of codons, we limit the maximum tree depth by preemptively excluding rules at every tree node.We choose the next generation’s parents via tournament selection on the combined parent and ofspring population. For this we assess the loss value of each function, given by the distance of its outputs f~(x) to the ground-truth output images y, averaged over all mi training pairs,

L(f jDi) = 1 mi

X dimg(f (xk); yk): mi k=1 Here we define dimg as (i) the proportion of correctly colored pixels (if images have equal size) or (ii) the Euclidean distance between the color histograms: dimg(x; y) =

(y)k2 1 + k (x) (Pi 1fxi6=yig=(uxvx) if ux = uy and vx = vy

else stripc1 split_colors sortArea;Desc top crop (1) (2) where u; v is the image width and height, 1fP g = 1 if P > else 0, and (x) = Pxi2x(1fxi=cg)c>2C as the (unnormalized) color histogram of x. f~ is an optimal solution if (and only if) all predictions are equal to the expected output, i.e. L(f jDi) = 0, thus we can also use Equation 1 as an early-stopping criterion. Following the evaluation procedure of the challenge, our accuracy is the proportion of correctly solved tasks: ACC = M 1 PiM=1 Q`n=i1 1ffi(x^`)=y^`g. Here, fi is the solution produced by our method after training on task Di. A grid search was performed to obtain good hyperparameters for population size, mutation rate and mutation strength. As GE is randomized, we ran the experiments with 40 diferent seeds and discuss the averaged results with standard deviation.

We first evaluate our method on the 400 training tasks of ARC, from which we solved 7:68( 0:61)%. The challenge leaderboard evaluation however is based on a secret data set of 100 tasks with slightly higher logic complexity. Here, we were able to correctly solve ACC = 3% of the tasks. Despite this seemingly low value, we scored among the top 30 of over 900 participants, which illustrates the challenge’s non-triviality. The small number of tasks makes it hard to assess the usefulness of atomic functions in the DSL. Moreover, there is no information about the overlap of required logic operations between training and test tasks.

Our results also made us question the efectiveness of GE considering the tremendous search space complexity. Small mutations to the current best individual, such as swapping out a single atomic function, can significantly change its behavior. As a result, the algorithm operates in a highly non-convex search space. We therefore compared our EA approach to simply generating random individuals from our DSL. This random search baseline solves an average of 6:17( 0:13)% of the training tasks, indicating that the EA is able to traverse the search space at least somewhat more eficiently. This is even more significant on the test tasks, where random-search is not able to solve even a single task. 5

Conclusion

We showed that our DSL+GE approach is viable for solving reasoning tasks. By designing a language for image-based logic and learning corresponding expressions from the training instances, we were able to score well in the ARC challenge. Ensuring a high expressiveness of the DSL, while at the same time limiting the complexity to allow tractable function evolution, appears to be the key problem. Finding the optimal set of functional atoms for the given domain may thus be subject to further research.

The ARC challenge’s dificulty illustrates the long way still to go until ML methods reach abstract reasoning capabilities comparable to the human mind’s. Acknowledgment This research has been funded by the Federal Ministry of Education and Research of Germany as part of the competence center for machine learning ML2R (01|18038A)

1. Chollet , F. : The measure of intelligence . arXiv preprint arXiv: 1911 . 01547 ( 2019 )

2. Gulwani , S. : Automating string processing in spreadsheets using input-output examples . ACM Sigplan Notices 46 ( 1 ), 317 - 330 ( 2011 )

3. Hernández-Orallo , J. , Martínez-Plumed , F. , Schmid , U. , Siebers , M. , Dowe , D.L. : Computer models solving intelligence test problems: Progress and implications . Artificial Intelligence 230 , 74 - 107 ( 2016 )

4. Koza , J.R. : Genetic programming as a means for programming computers by natural selection . Statistics and Computing 4 , 87 -- 112 ( 1994 )

5. Legg , S. , Hutter , M.: A collection of definitions of intelligence . Advances in Artificial General Intelligence: Concepts , Architectures and Algorithms 157 (07 2007 )

'Neill , M. , Ryan , C. : Grammatical evolution . IEEE Transactions on Evolutionary Computation 5 ( 4 ), 349 - 358 ( 2001 )

7. Ryan , C. , Collins , J.J. , Neill , M.O. : Grammatical evolution: Evolving programs for an arbitrary language . In: European Conference on Genetic Programming . pp. 83 - 96 . Springer ( 1998 )