=Paper= {{Paper |id=Vol-2071/paper5 |storemode=property |title=Towards an Algebraic Approach to Theory and Concept Evaluation |pdfUrl=https://ceur-ws.org/Vol-2071/CExAIIA_2017_paper_5.pdf |volume=Vol-2071 |authors=Pietro Galliani |dblpUrl=https://dblp.org/rec/conf/aiia/Galliani17 }} ==Towards an Algebraic Approach to Theory and Concept Evaluation== https://ceur-ws.org/Vol-2071/CExAIIA_2017_paper_5.pdf
        Towards an Algebraic Approach to Theory and
                    Concept Evaluation

                                          Pietro Galliani

                                 Free University of Bozen-Bolzano
                                       pgallian@gmail.com



              Abstract. I introduce a theoretical framework for reasoning about the
              values of theories (understood as sets of elements of random conceptual
              algebras) and about the values of concepts with respect to theories. Then
              I define theory formation games and argue that they provide an adequate
              theoretical framework for the evaluation of strategies and algorithms
              for computational creativity; and, finally, I briefly discuss two possible
              practical applications of this framework, namely the automatic search
              for Natural Deduction rule systems for logics and the construction of
              falling-rule-list-plus-definitions classification models.


       1    Introduction
       One problem common to computational models of creativity, in mathematics
       or in other areas, is that it is often less than clear how such models, or their
       products, are to be evaluated [17, 19, 14, 18, 10]. There exist models of concept
       formation that have been shown to be, in principle, capable of reproducing non-
       trivial conceptual leaps, of the kind that – if produced spontaneously by a human
       being – would surely be regarded as creative: for instance, in [3] it is shown how
       the notion of prime ideal can arise from the blending of two conceptual spaces
       representing commutative rings with unity and integer numbers respectively.
       However, the same systems are equally able to generate, by the same techniques,
       concepts that a human would instantly recognize as spurious and of little value.
       It is not currently clear which criteria humans use for such an evaluation: while
       some conditions that, let us say, a conceptual blending operation [9] should satisfy
       in order for its result to be meaningful have been proposed (e.g. [7, 6]), it is fair
       to say that, at the moment, models of creativity and concept formation are
       relatively well suited for the formal explanation of creative concept formation
       but not as useable for tools for the assisted (let alone autonomous) creation of
       novel and useful concepts in mathematics or in other disciplines.
           In this work, I present a simple, abstract, algebraic formalization of the no-
       tion of the value of a theory T (or of a concept c given the theory T). This
       formalization does not involve the modelling of researchers as individual agents
       with utilities and possible actions, as it is often done in the area of social sim-
       ulation (see e.g. [1, 11, 21]): while such approaches can often be the source of
       very valuable insights, the high number of possible choices involved in their spec-
       ification poses serious difficulties to the analysis of their implications. Finally,




Copyright © 2018 for this paper by its authors. Copying permitted for private and academic purposes.
I briefly discuss two potential applications of this framework to the problems
of building Natural Deduction systems and of improving the interpretability of
falling rule list classification models [2].


2   Basic definitions

Concepts, mathematical or non-mathematical, do not live in isolation; and, to
a large degree, creative activity can be understood as the generation of novel
concepts through the combination of known ones. However, as remarked in the
introduction, not all such combinations are feasible, let alone fruitful. This jus-
tifies the following definition:

Definition 1 (Conceptual Algebra). A conceptual algebra A is a pair (A, ·),
where

 – A is a set of concepts;
 – · is a partial operation over A.1

    In the above definition, the algebra A is not required to be commutative. In
this way, we can distinguish between two possibly distinct roles (active/passive)
of concepts in a concept combination: a concept can either be applied to another
concept to modify it, or be modified by the application of another concept. To
make an example, if we take the concept of “prime ideal” as a combination of
the concepts “prime number” and “ideal”, it is clear that it is the notion of
primality that is being applied to the notion of ideal and not vice versa; and
indeed, the resulting concept belongs to the conceptual domain of ideals rather
than to the one of numbers. Quite arbitrarily, in a concept combination a · b the
left operand a is assumed to take the active role and the right one b is assumed
to take the passive one.
    A theory T is then defined as a set of concepts in a conceptual algebra.
It is important to emphasize here, for the following analysis to be meaningful,
that a theory is not understood as a set of axioms: instead, it is taken to be
the set of all concepts - be them statements, rules, heuristics, or proofs - that
are part of the “received knowledge” about a subject. For instance, the Riesz
Representation Theorem is certainly part of the theory of modern functional
analysis, even though it cannot meaningfully be considered one of its axioms.2

Definition 2 (Theory). A theory over a conceptual algebra A = (A, ·) is a
finite subset T ⊆ A.
1
  That is, it is a subset ·A of A × A × A such that (a, b, c), (a, b, c0 ) ∈ ·A implies
  c = c0 . As usual, we write a · b = c for (a, b, c) ∈ ·A .
2
  In fact, it is questionable whether a notion such as “the axioms of functional analysis”
  has a meaningful and unique denotation to begin with: axiomatic systems can be
  very useful tools to systematize mathematical knowledge, but they are not necessarily
  epistemically prior to it.
   How valuable would such a theory be? In our framework, we do not have any
access whatsoever to properties of individual concepts such as their “simplicity”
or “elegance”: “internal” properties affecting the values of individual concepts
could be added to our framework easily enough, but for now it is preferable to
avoid such complications and treat concepts as individually indistinguishable.
The only difference between concepts in this framework, thus, lies in which fur-
ther concepts can be obtained (or “derived”) from them through applications of
the concept combination operator ·.

     The following analysis of theory value is based on three guiding principles:

Guiding Principle 1: Storing and using theories involves costs (e.g. paper,
  disk space, effort required for researchers to achieve proficiency with them...).
  These costs grow approximately linearly with the number of concepts con-
  tained in the theory.
Guiding Principle 2: The more concepts can be derived from a theory, the
  more valuable it is.
Guiding Principle 3: Concepts that can be derived from a theory in few steps
  contribute more to its value than concepts that can be derived from it only
  in many steps.

   In accordance with the above principles, I now propose a definition for the
value v(T) of a theory T in a conceptual algebra A. First, however, I need a
couple of auxiliary concepts:
Definition 3 (n-th level span and edge of a theory). Let A = (A, ·) be a
conceptual space and let T ⊆ A be a theory in it. Then for all n ∈ N, its n-th
level span hTin and its n-th level edge [T]n are defined recursively as follows:
 – hTi0 = T;
 – For all n ∈ N, [T]n = {c ∈ A\hTin : ∃a ∈ hTii , b ∈ hTij s.t. a · b =
   c and i + j = n};
 – For all n ∈ N, hTin+1 = hTin ∪ [T]n .
In other words, hTin is the set of all concepts that can be derived from those in
T within n concept combination steps,3 , while [T]n is the set of all those that
cannot, but could be derived in one extra step.
                                                             S∞
    As per Guiding Principle 2, all concepts in hTi = T∪ k=0 [T]k contribute
to the value v(T) of T. Furthermore, as per Guiding Principle 3, the concepts
in T —which are already in the theory as given— contribute individually more to
v(T) than those in [T]0 , which require one concept formation step to be obtained;
and, in general, the individual contributions of concepts in [T]i are greater than
those of concepts in [T]j whenever i < j. As no further information regarding
3
    It may be worth pointing out, once more, that these are not meant to be deduction
    steps in a specific formal system. The concepts contained in T may correspond to
    statements; but they may also correspond to definitions, heuristics or so on.
the contributions of concepts to the value of theory is available, a conservative
choice is to let —by choice of unit— the concepts in T individually provide a
unit contribution to the value of the theory,4 and to let the concepts in [T]i
contribute for vi ∈ [0, 1] each, where 1 > v0 > v1 . . . > vn . . . Additionally, it is
reasonable to require vn to tend to zero for n → ∞: indeed, the contribution of a
concept to the value of theory becomes negligible as the length of the derivation
necessary to obtain it from concepts in the theory grows extremely large. Thus,
and by analogy with the notion of discounted reward in reinforcement learning
[22], a simple (and by no means unique, but reasonable enough until and unless
evidence to the contrary is found) choice would be to let vk = λk+1 for some
fixed discount rate λ ∈ [0, 1].
    Finally, as per Guiding Principle 1, working with a theory T carries costs
which grow linearly with its size |T|: thus, v(T) also contains a term −C|T|,
where C ∈ R≥0 .
    We thus obtained the following definition, which is the central definition of
this work:

Definition 4 (Value of a theory in a conceptual algebra). Let C ∈ R≥0
be the cost per concept and let λ ∈ [0, 1] be the concept derivation discount rate.
Then for every conceptual algebra A = (A, ·) and every theory T ⊆ A, the value
vC,λ (T) is given by
                                                 ∞
                                                 X
                       vC,λ (T) = (1 − C)|T| +         λk+1 |[T]n |.
                                                 k=0

   The value of a concept given a theory can then be defined simply as the effect
that adding (or removing) it would have on the value of a theory. More precisely:


Definition 5 (Value of a concept given a theory). Let T be a theory in a
conceptual algebra A = (A, ·), let c ∈ A be a concept of A, let C ∈ R≥0 and let
λ ∈ [0, 1]. Then

                     vC,λ (c|T) = vC,λ (T ∪ {c}) − vC,λ (T\{c}).

    Note that this definition makes sense regardless of whether the concept c is
or is not part of T already; andP that, moreover, it follows easily from it that if
                                    n
T = {c1 . . . cn } then vC,λ (T) = t=1 vC,λ (ct |{ci : i < t}).


3     Towards a Formal Model of Creativity
The simple model presented above can be used as a component of a formal,
abstract framework through which to evaluate the creativity of an algorithm.
4
    In other words, the unit of value corresponds to the amount of value that a single
    concept would contribute to the theory if added directly to it, before considering
    costs and possible derivations.
The idea, in brief, is that a procedure is creative insomuch as it can generate
valuable theories in an unknown (but – at a cost – explorable) conceptual algebra.
In what follows, I present a decision-theoretical formalization of this intuition.

Definition 6 (Theory Formation Games). Let A = (A, ·) be a conceptual
algebra. Furthermore, let β0 ∈ R≥0 be a budget, let ρ, η ∈ R≥0 be the costs
associated to combination and random exploration respectively, and – as usual
– let C and λ be the memorization cost and the concept derivation discount rate
respectively. Then a theory formation game position is a tuple p = (K, F, β),
where

1. K ⊆ A is a set of known concepts;
2. F is a set of known facts, of the form a · b = c or of the form a · b =↓ for
   a, b and (if applicable) c in K;
3. β ∈ R≥0 is the remaining budget.

The initial position of the game is (∅, ∅, β0 ),5 and the possible moves given a
given position (K, F, β) are defined as follows:

 – If β ≥ ρ then, for all a, b ∈ K, a · b = ? is a possible move. If a · b is defined
   in A and is some c ∈ A, then the successor position is (K ∪ {c}, F ∪ {a · b =
   c}, β − ρ); otherwise, it is (K, F ∪ {a · b =↓}, β − ρ).
 – If β ≥ η then ?? is a possible move. The successor positions are then of the
   form (K ∪ {c}, F, β − η), where the concept c is an arbitrary concept in A.
 – For all T ⊆ K, Choose(T) is always a possible move which ends the game.
   T is then the output theory.

A strategy for such a game is a function σ from positions (K, F, β) to legal
moves. A complete play of such a strategy is then a sequence p0 p1 . . . pn , where
p0 is the starting position (∅, ∅, β), for each i = 1 . . . n it is the case that pi+1
is a possible successor of position pi under the move σ(pi ), and σ(pn ) is of the
form Choose(T). The value of such a play is then simply the value vC,λ (T) of
the output theory T.6
    The value of a strategy σ is then defined simply as the expectation of the
value of its complete plays, when the elements c added to K during ?? moves
are chosen randomly from the uniform distribution7 over the set A of all possible
concepts.
5
  Of course, this can be changed if one wishes to consider games starting from
  nonempty sets of known concepts and facts.
6
  It is easy to see that no infinite-length plays are possible in theory formation games,
  as a · b = ? and ?? moves decrease the available budget β until only play-ending
  moves Choose(T) remain available.
7
  Or from another suitable distribution. Note that the selected c may be in K already:
  this reflects the fact that, if most of the concepts relevant to a certain subject are
  known already, it is rarely productive to go fishing for the others in a random,
  undirected way.
    In the above definition, theory formation games are specified with respect
to a fixed conceptual algebra A. However, it is not usually the case that the
structure of a conceptual algebra is known in advance; and indeed, there would
not be any reason for the positions of concept formation games to contain a set
F of known facts if only one conceptual algebra was considered possible. But as
we will now see, it is easy to extend theory formation games to distributions of
algebras:
Definition 7 (Theory Formation Games: Values of Strategies with Dis-
tributions). Let P(A) be a probability distribution over conceptual algebras
(A, ·) all having the same domain A,8 . Moreover, let β, ρ, η, C and λ be as be-
fore and let σ be a strategy. Then the value of σ with respect to the distribution
P(A) is the expectation of the value of σ with respect to a random algebra A
chosen according to the distribution P(A).
    In brief, the value of a strategy represents its ability to find valuable theories
in an unknown conceptual algebra, when exploring said algebras – either by
computing the combination of two concepts through a · b = ? moves or by
attempting to find possibly novel concepts through ?? moves – carries a cost.
This framework is rather reminiscent of bandit algorithms [23]: but whereas the
main problem that bandit algorithms need to solve is to find the optimal balance
between the exploitation of known options and the exploration of novel ones, in
the case of theory formation games the chief difficulties that a strategy must
address lie first in balancing the search for novel concepts with the investigation
of the properties of the known ones, and then in selecting – on the basis of the
information gathered – a set of concepts which is expected to be of optimal value.
    I leave to future work the discussion and comparison of various possible –
and computationally feasible – strategies in concept formation games; here, I
merely observe that the problems that such strategies must overcome to be
successful appear to mirror very closely the ones that humans face when involved
in activities such as mathematical research, and that therefore this framework is
a promising abstract testbed for the study of creatively advantageous strategies.


4     Two Examples
Until now, the framework introduced in this work has been discussed from a very
abstract perspective. In this section, I will instead show two practical examples of
it, related to the problem of axiomatizing a logic and to the problem of learning
falling rule list and definitions from a given dataset.
     Let L be a logic with Modus Ponens and Contraction – for instance, proposi-
tional logic – and let the (infinite) algebra A be the set of all Natural Deduction
rules
                                       P1 . . . Pn
                                           Q
which are valid in L, plus the following other types of objects:
8
    For simplicity, we can assume that A = {1 . . . N } for some integer N .
 – For every i ∈ N, a premise selection operator Seli ;
 – For every i ∈ N and any rule r ∈ A with at least i premises, an object

                                                  P1 . . . Pbi . . . Pn
                                    Seli (r) =
                                                            Q
   which differs from r only in that its i-th premise is highlighted;
 – For every i, j ∈ N with i < j, an object Eqi,j ;
 – For every atomic expression A and (possibly complex) expression S, a sub-
   stitution operator Sub(A, S);
The concept combination rule is defined as follows:
 – For all i ∈ N, if r ∈ A has at least i premises then Seli ·r = Seli (r);
 – For all i ∈ N, if r, Seli (s) ∈ A and the conclusion Q of r is the same as
   the highlighted premise Pbi of Seli (s) then r · Seli (s) is the rule obtained by
   replacing its marked premise with r. In other words,

            S1 . . . Sk P1 . . . Pbi . . . Pn   P1 . . . Pn−1 S1 . . . Sk Pi+1 . . . Pn
                       ·                      =                                         ;
               Pi                 Q                               Q
 – For all i, j ∈ N with i < j and rules r ∈ A with at least j premises and such
   that its i-th and j-th premises are equal, then Eqi,j ·r is the rule obtained
   from r by removing its j-th premise. In other words,
                       P1 . . . Pi . . . Pj . . . Pn   P1 . . . Pi . . . Pj−1 Pj+1 . . . Pn
             Eqi,j ·                                 =
                                    Q                                     Q
   whenever Pi = Pj ;
 – For all substitution operators Sub(A, S) and all rules r, Sub(A, S) · r is the
   result of replacing all occurrences of A in the premises and conclusion of r
   with the expression S;
 – For all a, b ∈ A, if none of the above cases are applicable then a·b is undefined.
    Now let T be a theory in our algebra – that is to say, a finite set of rules
r, premise highlighting operators Seli , rules with highlighted premises Seli (r),
premise merging operators Eqi,j and substitution operators Sub(A, S). Then, for
a fixed choice of cost C and discount rate λ, the value
                                                         ∞
                                                         X
                         vC,λ (T) = (1 − C)|T| +               λk+1 |[T]n |
                                                         k=0

describes the tradeoff between the cost of memorizing such a set of rules and
operators and their usefulness as tools to generate further rules; and the corre-
sponding theory formation game describes the process of seeking such a set of
rules when performing deductions or sampling random rules carries a cost.
    This framework can be easily modified in various ways: for instance, one
could easily modify the definition of [T]n by making the operators Seli , Eqi,j
and Sub(A, S) always available as left operands, regardless if they are in any
[T]i , and furthermore require T to consist only of rules r. Then, it would also
be advisable to replace the expressions |[T]n | in the definition of v(T) with a
finer-grained notion of the contribution of [T]n to the value of a theory, one
which would likewise count only rules and – for instance – count them only up
to substitution.

    A more complex variation on this theme could be to adapt the framework
for theory evaluation described in this work to falling rule lists ([24]).
    Falling rule lists are classification models which can be represented as cas-
cading sequences of if ...then ...else ... rules, to be applied in order to
a given input until a match is found. Despite their apparent simplicity, in many
cases their performances are comparable to these of less easily interpretable
methods, and it was recently shown in [2] that certifiably optimal falling rule
lists for categorical data9 can be derived efficiently.


    if (age = 23–25) ∧ (priors = 2–3) then YES
    else if (age = 18–20) then YES
    else if (sex = male) ∧ (age = 21–22) then YES
    else if (priors > 3) then YES
    else NO
    end if

Fig. 1. A simple falling rule list for predicting recidivism, found by the CORELS
algorithm [2]


    One of the main advantages of falling rule lists lies in their ease of inter-
pretability: given a falling rule list and an input, it is possible not only to extract
a prediction but also to straightforwardly present a justification for it, that is,
to produce a list of reasons why a given rule is applicable and no previous one
is.
    However, the interpretability of a falling rule list degrades rapidly with the
increase of the number of features: while the falling rule list of Figure 1 is per-
fectly readable, an analogous falling rule list involving even just fifty different
input features – a comparatively modest input dimensionality, in many modern
applications – would be markedly less so.
    A natural way to surpass this difficulty could consist in introducing complex
concepts, defined in terms of combinations of input concepts which occur more
than once in the left hand side of rules, and rewriting the falling rule list in
order to make use of them, as in Figure 2: in many cases, this would be expected
not only to improve the readability and interpretability of the model, but also
to provide us with definitions for complex concepts which are relevant to the
classification problem in exam.
9
    That is, data in which input features and outputs take values in finite, fixed domains.
                                                  T2 ← A ∧ G
                                                  T1 ← T2 ∧ C
     if (A ∧ B ∧ C ∧ E ∧ F ∧ G) then YES          if (T1 ∧ B ∧ E ∧ F ) then YES
     else if (A ∧ C ∧ D ∧ G) then YES             else if (T1 ∧ D) then YES
     else if (A ∧ C ∧ G ∧ H) then YES             else if (T1 ∧ H) then YES
     else if (A ∧ F ∧ H) then YES                 else if (A ∧ F ∧ H) then YES
     else if (A ∧ G) then YES                     else if (T2 ) then YES
     else NO                                      else NO
     end if                                       end if
Fig. 2. Rewriting a falling rule list by introducing complex concepts: first we define
T1 as A ∧ C ∧ G and rewrite rules 1–3 accordingly, then we define T2 as A ∧ G and
use it in the definition of T1 and rule 5. Does the rewrite improve readability? Are T1
and T2 valuable concepts? As in the framework discussed in this work, it depends on
size–derivation length tradeoffs!



    When generating definitions and falling rule lists, one must consider tradeoffs
between the total size of the model, its predictive power, and the number of
steps required to classify instances.10 This is reminiscent of the tradeoff between
representation size and derivation length which has been previously discussed in
this work; and, as it will now be sketched, it can be reduced to it.
    Very briefly, the elements of the algebra A in this case are definitions X ←
Condition, dataset instances d with their attributes, and falling rule lists in-
dexed by dataset instances d : R. The combination operator · can be used to
combine two definitions, applying the former inside of the latter; or to combine
a concept definition with a dataset instance which satisfies it, adding the defined
concept to the instance; or to combine a dataset instance d with a rule list d : R,
causing the rule list to yield a success d> (if its first rule is applicable to d and
its output is correct), a failure d⊥ (if its first rule is applicable to d but its out-
put is incorrect), or returning the rule obtained by removing its first premise (if
its first rule is not applicable to d). Figure 3 contains examples illustrating the
behaviour of these rules.
    It is useful to remark that, according to the above mentioned rules, the
generation of a success (or a failure) for a dataset instance corresponds in a
natural way to the explanation of its classification in terms of the given rule list
and definitions. For instance, consider the rule set and definitions of Figure 2
right, and the dataset instance d : A, C, D, G, H, YES. Then a possible way to
generate d> could be given by first applying d to the rule list, thus removing the
first rule as inapplicable; then applying the two definitions to d, thus obtaining
d : A, C, D, G, H, T2 , T1 , YES; and finally applying the first rule of the derived
rule list (that is, the second rule of the original rule list) if (T1 ∧ D) then YES.

10
     The motivation for attempting to minimize this number does not rise from computa-
     tional costs as much as from explainability issues: the more steps – that is, definition
     applications or discardings of inapplicable premises – are necessary to classify an in-
     stance, the less readable the explanation of its classification will be.
  T2 ← A ∧ G      ·   T1 ← T2 ∧ C              =    T1 ← A ∧ G ∧ C




  T2 ← A ∧ G      ·   d : A, G, H              =    d : A, G, H, T2




                          if (T2 ∧ C ∧ H) then
                              YES                            if (A ∧ F ∧ H) then
                          else if (A∧F ∧H) then                  YES
d : A, F, H, YES · d :        YES               = d:         else if (T2 ) then
                          else if (T2 ) then                     YES
                              YES                            else NO
                          else NO                            end if
                          end if



                          if (A ∧ F ∧ H) then
                              YES
d : A, F, H, YES · d :    else if (T2 ) then       = d>
                              YES
                          else NO
                          end if



                          if (A ∧ F ∧ H) then
                              YES
d : A, F, H, NO   · d:    else if (T2 ) then       = d⊥
                              YES
                          else NO
                          end if
Fig. 3. Concept combination in the algebra of falling rule lists, definitions and dataset
instances. From top to bottom: combining two definitions, combining a definition and
a dataset instance, combining a dataset instance with a rule list whose first rule is not
applicable, combining a dataset instance with a rule list whose first rule is applica-
ble (correct output), combining a dataset instance with a rule list whose first rule is
applicable (incorrect output).
     The length of such a derivation, therefore, can be used as a measure for the
complexity of the explanation.
     In general, a theory T can be defined as a rule list, indexed by all available
dataset instances, and a set of definitions.11 In order to measure its cost more
accurately, we may weigh every rule list or definition in proportion to its size;
and then, we may modify the computation of the [T]n by making all dataset
instances available as left operands when computing it (even if they are not in T).
Finally, we change the contribution |[T]n | of [T]n to the value of T by counting
the number of successes and failures in [T]n (and weighing them appropriately),
instead of counting the number of elements in [T]n (in other words, we assign
positive value to the successes, negative to the failures, and zero to the definitions
and rule lists in [T]n ). Multiplying the contribution of [T]n to the value of
T by the factor λn+1 then corresponds to valuing more highly successes (and
penalizing more highly failures) if they can be obtained from the initial rule
list and definitions in a small number of steps; thus, maximizing the value of a
falling rule list plus definitions does not simply correspond to searching for a rule
list which classifies accurately as many dataset instances as possible, but rather
to seeking one such that, additionally, the instances which can be successfully
classified admit as short an explanation (which is to say, in the above framework,
a derivation) as possible. The choice of the cost and discount rate parameters, as
well as the choice of the value of successes d> and failures d⊥ , defines the tradeoff
between the objective of classifying correctly as many instances as possible in as
few steps as possible with the objective of having an acceptably compact set of
initial rule lists and definitions.
     This is only a brief sketch, and more detailed presentation and analysis of this
adaptation of the framework to the case of falling rule lists is left for further work.
Here, it serves to illustrate a possible practical application of the framework
presented here, one which in my opinion might be worth exploring further.

References
 [1] Ahrweiler, P.: Modelling theory communities in science. Journal of Artificial So-
     cieties and Social Simulation 14(4), 8 (2011)
 [2] Angelino, E., Larus-Stone, N., Alabi, D., Seltzer, M., Rudin, C.: Learning certifi-
     ably optimal rule lists for categorical data. stat 1050, 6 (2017)
 [3] Bou, F., Schorlemmer, M., Corneli, J., Gómez-Ramı́rez, D., Maclean, E., Smaill,
     A., Pease, A.: The role of blending in mathematical invention. In: Proceedings
     of the Sixth International Conference on Computational Creativity June. vol. 55
     (2015)
 [4] Colton, S., Bundy, A., Walsh, T.: Agent based cooperative theory formation in
     pure mathematics. In: Proceedings of AISB 2000 symposium on creative and
     cultural aspects and applications of AI and cognitive science. pp. 11–18 (2000)
 [5] Colton, S., Bundy, A., Walsh, T.: On the notion of interestingness in automated
     mathematical discovery. International Journal of Human-Computer Studies 53(3),
     351–375 (2000)
11
     Nothing prevents us, in principle, from allowing multiple rule lists inside T; but for
     simplicity, that possibility is not considered here.
 [6] Confalonieri, R., Kutz, O., Troquard, N., Galliani, P., Porello, D., Peñaloza, R.,
     Schorlemmer, M.: Coherence, similarity, and concept generalisation (2017), under
     review.
 [7] Confalonieri, R., Plaza, E., Schorlemmer, M.: A process model for concept inven-
     tion. In: Proc. of the 7th International Conference on Computational Creativity,
     ICCC16 (2016)
 [8] Dyson, F.: A meeting with Enrico Fermi. Nature 427(6972), 297–297 (2004)
 [9] Fauconnier, G., Turner, M.: The way we think: Conceptual blending and the
     mind’s hidden complexities. Basic Books (2008)
[10] Galanter, P.: Computational aesthetic evaluation: Past and future. In: Computers
     and Creativity, pp. 255–293. Springer (2012)
[11] Gilbert, N.: A simulation of the structure of academic science. Sociological Re-
     search Online 2(2) (1997)
[12] Inglis, M., Aberdein, A.: Beauty is not simplicity: an analysis of mathematicians’
     proof appraisals. Philosophia Mathematica 23(1), 87–109 (2015)
[13] Inglis, M., Aberdein, A.: Diversity in proof appraisal. In: Mathematical Cultures,
     pp. 163–179. Springer (2016)
[14] Jordanous, A.: Evaluating evaluation: Assessing progress in computational cre-
     ativity research. In: Proceedings of the second international conference on com-
     putational creativity (ICCC-11). Mexico City, Mexico. pp. 102–107 (2011)
[15] Martin, U., Pease, A.: Mathematical practice, crowdsourcing, and social machines.
     In: International Conference on Intelligent Computer Mathematics. pp. 98–119.
     Springer (2013)
[16] Mayer, J., Khairy, K., Howard, J.: Drawing an elephant with four complex pa-
     rameters. American Journal of Physics 78(6), 648–649 (2010)
[17] McCormack, J.: Open problems in evolutionary music and art. In: Workshops on
     Applications of Evolutionary Computation. pp. 428–436. Springer (2005)
[18] McCormack, J., d’Inverno, M.: Computers and creativity: The road ahead. In:
     Computers and creativity, pp. 421–424. Springer (2012)
[19] Pease, A., Winterstein, D., Colton, S.: Evaluating machine creativity. In: Work-
     shop on creative systems, 4th international conference on case based reasoning.
     pp. 129–137 (2001)
[20] Raman-Sundström, M.: The notion of fit as a mathematical value. In: Mathemat-
     ical Cultures, pp. 271–285. Springer (2016)
[21] Righi, S., Takács, K.: The miracle of peer review and development in science: an
     agent-based model. Scientometrics pp. 1–21 (2017)
[22] Sutton, R.S., Barto, A.G.: Introduction to reinforcement learning, vol. 135. MIT
     Press Cambridge (1998)
[23] Vermorel, J., Mohri, M.: Multi-armed bandit algorithms and empirical evaluation.
     In: European conference on machine learning. pp. 437–448. Springer (2005)
[24] Wang, F., Rudin, C.: Falling rule lists. In: Artificial Intelligence and Statistics.
     pp. 1013–1022 (2015)