Solving Abstract Reasoning Tasks with
                Grammatical Evolution

    Raphael Fischer, Matthias Jakobs, Sascha Mücke, and Katharina Morik

                  TU Dortmund, AI Group, Dortmund, Germany
                      http://www-ai.cs.tu-dortmund.de


      Abstract. The Abstraction and Reasoning Corpus (ARC) comprising
      image-based logical reasoning tasks is intended to serve as a benchmark
      for measuring intelligence. Solving these tasks is very difficult for off-
      the-shelf ML methods due to their diversity and low amount of training
      data. We here present our approach, which solves tasks via grammatical
      evolution on a domain-specific language for image transformations. With
      this approach, we successfully participated in an online challenge, scoring
      among the top 4% out of 900 participants.

      Keywords: Machine Learning · Reasoning · Grammatical Evolution


1   Introduction
Despite AI research’s fast advancements, the question of how to rigorously de-
fine and measure intelligence is still open [5,3]. It is taken up by the recently
published Abstraction and Reasoning Corpus (ARC) [1] and its corresponding
Kaggle challenge1 . It features hundreds of image-based logic tasks (some exam-
ples given in Figure 1), which are expected to be solved by reasoning AIs without
any human aid. Finding solutions requires learning the inherent logic of a task
from very few examples, which is easy for humans but proves to be very hard
for machines. Learning logic from few examples has already been explored (e.g.
for text data [2]), however ARC’s image logic space is much larger.
    We here present our approach to solve abstract reasoning tasks based on a
domain-specific language (DSL), whose expressions are generated via grammat-
ical evolution (GE). With our method, we were able to reach the 28th place out
of over 900 participating teams in the ARC challenge2 ).


2   Problem Statement
The ARC challenge requires participants to develop a model that is able to solve
tasks Di from the set D = {D1 , . . . , DM }. Each Di comprises an image-based
  Copyright © 2020 by the paper’s authors. Use permitted under Creative Commons
  License Attribution 4.0 International (CC BY 4.0).
1
  https://www.kaggle.com/c/abstraction-and-reasoning-challenge/
2
  ls8-arc team on the official ARC challenge leaderboard.
    (a) Training task 48       (b) Training task 100         (c) Training task 397

Fig. 1: Exemplary ARC tasks: (a) Crop to the smallest unicolor rectangle; (b)
Draw lines over the image in directions indicated by pixels colored in red; (c)
Match pattern, upscale and/or recolor if necessary.


reasoning task with mi training pairs {(xk , y k )} and ni test pairs {(x̂` , ŷ ` )},
where mi > 0 is typically around 3 and ni > 0 is mostly 1. The images feature
up to 10 discrete colors C = {c1 , . . . , c10 }, and image sizes range between 1
and 30 pixels per dimension. The implied function f that maps inputs to the
expected output images is vastly different for every task, often based on abstract
features (e.g. connected shapes) or pattern continuation (see Figure 1). For each
task Di , the method should derive f˜i from {(xk , y k )} ∈ Di which correctly maps
all input images to the expected output: ∀(x̂` , ŷ ` ) ∈ Di : f˜i (x̂` ) = ŷ ` .
    The challenge participants have access to M = 400 training tasks, which
show some logic concepts and come with labels ŷ ` . The final scoring however is
based on an undisclosed test set, whose task are only seen by the model.


3    Reasoning Approach

For our approach, we assume that f can be broken down into a sequence of
basic image transformations. We developed a custom domain-specific language
(DSL) specifying such sequences, whose space of expressions is explored using
an evolutionary algorithm (EA). We first explain our language in more detail
and then show how we use an EA to create ARC task solvers from our DSL.


Domain-Specific Language In a first step, we manually implemented solvers
for approximately 20 random training tasks and identified reoccurring image
operations, which  became the basis of our DSL.
    Let X = u,v≥1 C u×v denote the set of rectangular images with colors in C,
             S
and X ∗ ordered lists of images, which we call layers. Our DSL is a context-free
grammar in Backus Naur Form (BNF) [6]; the non-terminal symbols represent
function sets whose members (i) modify images (T = X X ), (ii) decompose an
image into layers (S = (X ∗ )X ), (iii) combine layers into a single image (J = X X )
                                                                                   ∗


or (iv) modify a layer object (L = (X ) ). The terminal productions of these
                                          ∗ X∗

symbols are either concrete functions of the corresponding type or more complex
function compositions. Figure 2 depicts a visualization of our DSL function types.
    Our atomic functions comprise basic operations such as translation, rotation
and cropping, as well as layer-specific operations like extracting a layer, sorting
                                      split_colors, duplicate, …
                                                  S
   crop, rotate, …      T        X                            X∗     L      sort, filter, …

                                                  J
                                            union, top, …


Fig. 2: Function types of custom reasoning DSL, displayed as state automaton,
with some example functions for each type.


by different criteria, splitting images into layers, and merging them together. For
additional flexibility, we also allow higher-order functions like map, which lifts
the T -type to an L-type function by applying it to each layer. The grammar
structure ensures that all possible expressions have an overall T -type logic, i.e.
they transform a single input image into an output image. This allows to find
solvers for some tasks, an example is given in Figure 3.

Grammatical Evolution We use Grammatical Evolution (GE) [7] to generate
expressions in our DSL, and ultimately find solvers for a given task. We use the
standard modulo-based mapping from codons to syntax trees of our DSL [6] to
obtain image functions f˜. Uniform mutation and 1-point crossover are used to
produce offspring [4]. To prevent running out of codons, we limit the maximum
tree depth by preemptively excluding rules at every tree node.We choose the
next generation’s parents via tournament selection on the combined parent and
offspring population. For this we assess the loss value of each function, given by
the distance of its outputs f˜(x) to the ground-truth output images y, averaged
over all mi training pairs,
                                           i  m
                                        1 X
                            L(f |Di ) =      dimg (f (xk ), y k ).                            (1)
                                        mi
                                             k=1

Here we define dimg as (i) the proportion of correctly colored pixels (if images
have equal size) or (ii) the Euclidean distance between the color histograms:

                             i 1{xi 6=yi } /(ux vx ) if ux = uy and vx = vy
                         (P
         dimg (x, y) =                                                        (2)
                          1 + kφ(x) − φ(y)k2 else


               stripc       split_colors sortArea,Desc       top     crop
                        1


Fig. 3: Solver for task displayed in Figure 1a. It first strips away black (c1 ) pixels,
then splits the image into layers according to the remaining colors, and finally
crops to the layer with smallest surface area.
where u, v is the image width and height, 1{P } = 1 if P ≡ > else 0, and
                          c∈C as the (unnormalized) color histogram of x. f is an
φ(x) = xi ∈x (1{xi =c} )>
         P                                                                   ˜
optimal solution if (and only if) all predictions are equal to the expected output,
i.e. L(f |Di ) = 0, thus we can also use Equation 1 as an early-stopping criterion.


4   Experimental Results

Following the evaluation procedure of the challenge,
                                              PM Qour     accuracy is the propor-
tion of correctly solved tasks: ACC = M −1 i=1 `=1        1{fi (x̂` )=ŷ` } . Here, fi is
                                                      ni

the solution produced by our method after training on task Di . A grid search
was performed to obtain good hyperparameters for population size, mutation
rate and mutation strength. As GE is randomized, we ran the experiments with
40 different seeds and discuss the averaged results with standard deviation.
    We first evaluate our method on the 400 training tasks of ARC, from which
we solved 7.68(±0.61)%. The challenge leaderboard evaluation however is based
on a secret data set of 100 tasks with slightly higher logic complexity. Here, we
were able to correctly solve ACC = 3% of the tasks. Despite this seemingly low
value, we scored among the top 30 of over 900 participants, which illustrates
the challenge’s non-triviality. The small number of tasks makes it hard to assess
the usefulness of atomic functions in the DSL. Moreover, there is no information
about the overlap of required logic operations between training and test tasks.
    Our results also made us question the effectiveness of GE considering the
tremendous search space complexity. Small mutations to the current best indi-
vidual, such as swapping out a single atomic function, can significantly change
its behavior. As a result, the algorithm operates in a highly non-convex search
space. We therefore compared our EA approach to simply generating random
individuals from our DSL. This random search baseline solves an average of
6.17(±0.13)% of the training tasks, indicating that the EA is able to traverse
the search space at least somewhat more efficiently. This is even more significant
on the test tasks, where random-search is not able to solve even a single task.


5   Conclusion

We showed that our DSL+GE approach is viable for solving reasoning tasks.
By designing a language for image-based logic and learning corresponding ex-
pressions from the training instances, we were able to score well in the ARC
challenge. Ensuring a high expressiveness of the DSL, while at the same time
limiting the complexity to allow tractable function evolution, appears to be the
key problem. Finding the optimal set of functional atoms for the given domain
may thus be subject to further research.
   The ARC challenge’s difficulty illustrates the long way still to go until ML
methods reach abstract reasoning capabilities comparable to the human mind’s.
Acknowledgment This research has been funded by the Federal Ministry of
Education and Research of Germany as part of the competence center for ma-
chine learning ML2R (01|18038A)


References
1. Chollet, F.: The measure of intelligence. arXiv preprint arXiv:1911.01547 (2019)
2. Gulwani, S.: Automating string processing in spreadsheets using input-output ex-
   amples. ACM Sigplan Notices 46(1), 317–330 (2011)
3. Hernández-Orallo, J., Martínez-Plumed, F., Schmid, U., Siebers, M., Dowe, D.L.:
   Computer models solving intelligence test problems: Progress and implications. Ar-
   tificial Intelligence 230, 74–107 (2016)
4. Koza, J.R.: Genetic programming as a means for programming computers by natural
   selection. Statistics and Computing 4, 87––112 (1994)
5. Legg, S., Hutter, M.: A collection of definitions of intelligence. Advances in Artificial
   General Intelligence: Concepts, Architectures and Algorithms 157 (07 2007)
6. O’Neill, M., Ryan, C.: Grammatical evolution. IEEE Transactions on Evolutionary
   Computation 5(4), 349–358 (2001)
7. Ryan, C., Collins, J.J., Neill, M.O.: Grammatical evolution: Evolving programs
   for an arbitrary language. In: European Conference on Genetic Programming. pp.
   83–96. Springer (1998)