Introduction

Towards an Algebraic Approach to Theory and Concept Evaluation

Pietro Galliani

0 0 Free University of Bozen-Bolzano

I introduce a theoretical framework for reasoning about the values of theories (understood as sets of elements of random conceptual algebras) and about the values of concepts with respect to theories. Then I de ne theory formation games and argue that they provide an adequate theoretical framework for the evaluation of strategies and algorithms for computational creativity; and, nally, I brie y discuss two possible practical applications of this framework, namely the automatic search for Natural Deduction rule systems for logics and the construction of falling-rule-list-plus-de nitions classi cation models.

Introduction

One problem common to computational models of creativity, in mathematics or in other areas, is that it is often less than clear how such models, or their products, are to be evaluated [ 17, 19, 14, 18, 10 ]. There exist models of concept formation that have been shown to be, in principle, capable of reproducing nontrivial conceptual leaps, of the kind that { if produced spontaneously by a human being { would surely be regarded as creative: for instance, in [ 3 ] it is shown how the notion of prime ideal can arise from the blending of two conceptual spaces representing commutative rings with unity and integer numbers respectively. However, the same systems are equally able to generate, by the same techniques, concepts that a human would instantly recognize as spurious and of little value. It is not currently clear which criteria humans use for such an evaluation: while some conditions that, let us say, a conceptual blending operation [ 9 ] should satisfy in order for its result to be meaningful have been proposed (e.g. [ 7, 6 ]), it is fair to say that, at the moment, models of creativity and concept formation are relatively well suited for the formal explanation of creative concept formation but not as useable for tools for the assisted (let alone autonomous) creation of novel and useful concepts in mathematics or in other disciplines.

In this work, I present a simple, abstract, algebraic formalization of the notion of the value of a theory T (or of a concept c given the theory T). This formalization does not involve the modelling of researchers as individual agents with utilities and possible actions, as it is often done in the area of social simulation (see e.g. [ 1, 11, 21 ]): while such approaches can often be the source of very valuable insights, the high number of possible choices involved in their speci cation poses serious di culties to the analysis of their implications. Finally, Copyright © 2018 for this paper by its authors. Copying permitted for private and academic purposes.

I brie y discuss two potential applications of this framework to the problems of building Natural Deduction systems and of improving the interpretability of falling rule list classi cation models [ 2 ]. 2

Basic de nitions

Concepts, mathematical or non-mathematical, do not live in isolation; and, to a large degree, creative activity can be understood as the generation of novel concepts through the combination of known ones. However, as remarked in the introduction, not all such combinations are feasible, let alone fruitful. This justi es the following de nition: De nition 1 (Conceptual Algebra). A conceptual algebra A is a pair (A; ), where { A is a set of concepts; { is a partial operation over A.1

In the above de nition, the algebra A is not required to be commutative. In this way, we can distinguish between two possibly distinct roles (active/passive) of concepts in a concept combination: a concept can either be applied to another concept to modify it, or be modi ed by the application of another concept. To make an example, if we take the concept of \prime ideal" as a combination of the concepts \prime number" and \ideal", it is clear that it is the notion of primality that is being applied to the notion of ideal and not vice versa; and indeed, the resulting concept belongs to the conceptual domain of ideals rather than to the one of numbers. Quite arbitrarily, in a concept combination a b the left operand a is assumed to take the active role and the right one b is assumed to take the passive one.

A theory T is then de ned as a set of concepts in a conceptual algebra. It is important to emphasize here, for the following analysis to be meaningful, that a theory is not understood as a set of axioms : instead, it is taken to be the set of all concepts - be them statements, rules, heuristics, or proofs - that are part of the \received knowledge" about a subject. For instance, the Riesz Representation Theorem is certainly part of the theory of modern functional analysis, even though it cannot meaningfully be considered one of its axioms.2 De nition 2 (Theory). A theory over a conceptual algebra A = (A; ) is a nite subset T A. 1 That is, it is a subset A of A A A such that (a; b; c); (a; b; c0) 2 A implies c = c0. As usual, we write a b = c for (a; b; c) 2 A. 2 In fact, it is questionable whether a notion such as \the axioms of functional analysis" has a meaningful and unique denotation to begin with: axiomatic systems can be very useful tools to systematize mathematical knowledge, but they are not necessarily epistemically prior to it.

How valuable would such a theory be? In our framework, we do not have any access whatsoever to properties of individual concepts such as their \simplicity" or \elegance": \internal" properties a ecting the values of individual concepts could be added to our framework easily enough, but for now it is preferable to avoid such complications and treat concepts as individually indistinguishable. The only di erence between concepts in this framework, thus, lies in which further concepts can be obtained (or \derived") from them through applications of the concept combination operator .

The following analysis of theory value is based on three guiding principles: Guiding Principle 1: Storing and using theories involves costs (e.g. paper, disk space, e ort required for researchers to achieve pro ciency with them...). These costs grow approximately linearly with the number of concepts contained in the theory.

Guiding Principle 2: The more concepts can be derived from a theory, the more valuable it is.

Guiding Principle 3: Concepts that can be derived from a theory in few steps contribute more to its value than concepts that can be derived from it only in many steps.

In accordance with the above principles, I now propose a de nition for the value v(T) of a theory T in a conceptual algebra A. First, however, I need a couple of auxiliary concepts: De nition 3 (n-th level span and edge of a theory). Let A = (A; ) be a conceptual space and let T A be a theory in it. Then for all n 2 N, its n-th level span hTin and its n-th level edge [T]n are de ned recursively as follows: { hTi0 = T; { For all n 2 N, [T]n = fc 2 AnhTin : 9a 2 hTii; b 2 hTij s.t. a b = c and i + j = ng; { For all n 2 N, hTin+1 = hTin [ [T]n.

In other words, hTin is the set of all concepts that can be derived from those in T within n concept combination steps,3, while [T]n is the set of all those that cannot, but could be derived in one extra step.

As per Guiding Principle 2, all concepts in hTi = T[Sk1=0[T]k contribute to the value v(T) of T. Furthermore, as per Guiding Principle 3, the concepts in T |which are already in the theory as given| contribute individually more to v(T) than those in [T]0, which require one concept formation step to be obtained; and, in general, the individual contributions of concepts in [T]i are greater than those of concepts in [T]j whenever i < j. As no further information regarding 3 It may be worth pointing out, once more, that these are not meant to be deduction steps in a speci c formal system. The concepts contained in T may correspond to statements; but they may also correspond to de nitions, heuristics or so on. the contributions of concepts to the value of theory is available, a conservative choice is to let |by choice of unit| the concepts in T individually provide a unit contribution to the value of the theory,4 and to let the concepts in [T]i contribute for vi 2 [0; 1] each, where 1 > v0 > v1 : : : > vn : : : Additionally, it is reasonable to require vn to tend to zero for n ! 1: indeed, the contribution of a concept to the value of theory becomes negligible as the length of the derivation necessary to obtain it from concepts in the theory grows extremely large. Thus, and by analogy with the notion of discounted reward in reinforcement learning [ 22 ], a simple (and by no means unique, but reasonable enough until and unless evidence to the contrary is found) choice would be to let vk = k+1 for some xed discount rate 2 [0; 1].

Finally, as per Guiding Principle 1, working with a theory T carries costs which grow linearly with its size jTj: thus, v(T) also contains a term CjTj, where C 2 R 0.

We thus obtained the following de nition, which is the central de nition of this work: De nition 4 (Value of a theory in a conceptual algebra). Let C 2 R 0 be the cost per concept and let 2 [0; 1] be the concept derivation discount rate. Then for every conceptual algebra A = (A; ) and every theory T A, the value vC; (T) is given by vC; (cjT) = vC; (T [ fcg) vC; (Tnfcg):

Note that this de nition makes sense regardless of whether the concept c is or is not part of T already; and that, moreover, it follows easily from it that if T = fc1 : : : cng then vC; (T) = Pn

t=1 vC; (ctjfci : i < tg). 3

Towards a Formal Model of Creativity

The simple model presented above can be used as a component of a formal, abstract framework through which to evaluate the creativity of an algorithm. 4 In other words, the unit of value corresponds to the amount of value that a single concept would contribute to the theory if added directly to it, before considering costs and possible derivations.

The idea, in brief, is that a procedure is creative insomuch as it can generate valuable theories in an unknown (but { at a cost { explorable) conceptual algebra. In what follows, I present a decision-theoretical formalization of this intuition. De nition 6 (Theory Formation Games). Let A = (A; ) be a conceptual algebra. Furthermore, let 0 2 R 0 be a budget, let ; 2 R 0 be the costs associated to combination and random exploration respectively, and { as usual { let C and be the memorization cost and the concept derivation discount rate respectively. Then a theory formation game position is a tuple p = (K; F ; ), where

1. K A is a set of known concepts;

2. F is a set of known facts, of the form a b = c or of the form a b =# for a, b and (if applicable) c in K; 3. 2 R 0 is the remaining budget.

The initial position of the game is (;; ;; 0),5 and the possible moves given a given position (K; F ; ) are de ned as follows: { If then, for all a; b 2 K, a b = ? is a possible move. If a b is de ned in A and is some c 2 A, then the successor position is (K [ fcg; F [ fa b = cg; ); otherwise, it is (K; F [ fa b =#g; ). { If then ?? is a possible move. The successor positions are then of the form (K [ fcg; F ; ), where the concept c is an arbitrary concept in A. { For all T K, Choose(T) is always a possible move which ends the game.

T is then the output theory.

A strategy for such a game is a function from positions (K; F ; ) to legal moves. A complete play of such a strategy is then a sequence p0p1 : : : pn, where p0 is the starting position (;; ;; ), for each i = 1 : : : n it is the case that pi+1 is a possible successor of position pi under the move (pi), and (pn) is of the form Choose(T). The value of such a play is then simply the value vC; (T) of the output theory T.6

The value of a strategy is then de ned simply as the expectation of the value of its complete plays, when the elements c added to K during ?? moves are chosen randomly from the uniform distribution7 over the set A of all possible concepts. 5 Of course, this can be changed if one wishes to consider games starting from nonempty sets of known concepts and facts. 6 It is easy to see that no in nite-length plays are possible in theory formation games, as a b = ? and ?? moves decrease the available budget until only play-ending moves Choose(T) remain available. 7 Or from another suitable distribution. Note that the selected c may be in K already: this re ects the fact that, if most of the concepts relevant to a certain subject are known already, it is rarely productive to go shing for the others in a random, undirected way.

In the above de nition, theory formation games are speci ed with respect to a xed conceptual algebra A. However, it is not usually the case that the structure of a conceptual algebra is known in advance; and indeed, there would not be any reason for the positions of concept formation games to contain a set F of known facts if only one conceptual algebra was considered possible. But as we will now see, it is easy to extend theory formation games to distributions of algebras: De nition 7 (Theory Formation Games: Values of Strategies with Distributions). Let P(A) be a probability distribution over conceptual algebras (A; ) all having the same domain A,8. Moreover, let , , , C and be as before and let be a strategy. Then the value of with respect to the distribution P(A) is the expectation of the value of with respect to a random algebra A chosen according to the distribution P(A).

In brief, the value of a strategy represents its ability to nd valuable theories in an unknown conceptual algebra, when exploring said algebras { either by computing the combination of two concepts through a b = ? moves or by attempting to nd possibly novel concepts through ?? moves { carries a cost. This framework is rather reminiscent of bandit algorithms [ 23 ]: but whereas the main problem that bandit algorithms need to solve is to nd the optimal balance between the exploitation of known options and the exploration of novel ones, in the case of theory formation games the chief di culties that a strategy must address lie rst in balancing the search for novel concepts with the investigation of the properties of the known ones, and then in selecting { on the basis of the information gathered { a set of concepts which is expected to be of optimal value.

I leave to future work the discussion and comparison of various possible { and computationally feasible { strategies in concept formation games; here, I merely observe that the problems that such strategies must overcome to be successful appear to mirror very closely the ones that humans face when involved in activities such as mathematical research, and that therefore this framework is a promising abstract testbed for the study of creatively advantageous strategies. 4

Two Examples

Until now, the framework introduced in this work has been discussed from a very abstract perspective. In this section, I will instead show two practical examples of it, related to the problem of axiomatizing a logic and to the problem of learning falling rule list and de nitions from a given dataset.

Let L be a logic with Modus Ponens and Contraction { for instance, propositional logic { and let the (in nite) algebra A be the set of all Natural Deduction rules

P1 : : : Pn

Q which are valid in L, plus the following other types of objects: 8 For simplicity, we can assume that A = f1 : : : N g for some integer N . { For every i 2 N, a premise selection operator Seli; { For every i 2 N and any rule r 2 A with at least i premises, an object Seli(r) =

P1 : : : Pbi : : : Pn

Q which di ers from r only in that its i-th premise is highlighted; { For every i; j 2 N with i < j, an object Eqi;j ; { For every atomic expression A and (possibly complex) expression S, a substitution operator Sub(A; S);

The concept combination rule is de ned as follows:

{ For all i 2 N, if r 2 A has at least i premises then Seli r = Seli(r); { For all i 2 N, if r; Seli(s) 2 A and the conclusion Q of r is the same as the highlighted premise Pbi of Seli(s) then r Seli(s) is the rule obtained by replacing its marked premise with r. In other words,

S1 : : : Sk

P1 : : : Pbi : : : Pn = P1 : : : Pn 1S1 : : : SkPi+1 : : : Pn

Q Q ; { For all i; j 2 N with i < j and rules r 2 A with at least j premises and such that its i-th and j-th premises are equal, then Eqi;j r is the rule obtained from r by removing its j-th premise. In other words,

Eqi;j

P1 : : : Pi : : : Pj : : : Pn = P1 : : : Pi : : : Pj 1Pj+1 : : : Pn

Q Q whenever Pi = Pj ; { For all substitution operators Sub(A; S) and all rules r, Sub(A; S) r is the result of replacing all occurrences of A in the premises and conclusion of r with the expression S; { For all a; b 2 A, if none of the above cases are applicable then a b is unde ned.

Now let T be a theory in our algebra { that is to say, a nite set of rules r, premise highlighting operators Seli, rules with highlighted premises Seli(r), premise merging operators Eqi;j and substitution operators Sub(A; S). Then, for a xed choice of cost C and discount rate , the value describes the tradeo between the cost of memorizing such a set of rules and operators and their usefulness as tools to generate further rules; and the corresponding theory formation game describes the process of seeking such a set of rules when performing deductions or sampling random rules carries a cost.

This framework can be easily modi ed in various ways: for instance, one could easily modify the de nition of [T]n by making the operators Seli, Eqi;j and Sub(A; S) always available as left operands, regardless if they are in any [T]i, and furthermore require T to consist only of rules r. Then, it would also be advisable to replace the expressions j[T]nj in the de nition of v(T) with a ner-grained notion of the contribution of [T]n to the value of a theory, one which would likewise count only rules and { for instance { count them only up to substitution.

A more complex variation on this theme could be to adapt the framework for theory evaluation described in this work to falling rule lists ([ 24 ]).

Falling rule lists are classi cation models which can be represented as cascading sequences of if ...then ...else ... rules, to be applied in order to a given input until a match is found. Despite their apparent simplicity, in many cases their performances are comparable to these of less easily interpretable methods, and it was recently shown in [ 2 ] that certi ably optimal falling rule lists for categorical data9 can be derived e ciently.

if (age = 23{25) ^ (priors = 2{3) then YES else if (age = 18{20) then YES else if (sex = male) ^ (age = 21{22) then YES else if (priors > 3) then YES else NO end if

One of the main advantages of falling rule lists lies in their ease of interpretability: given a falling rule list and an input, it is possible not only to extract a prediction but also to straightforwardly present a justi cation for it, that is, to produce a list of reasons why a given rule is applicable and no previous one is.

However, the interpretability of a falling rule list degrades rapidly with the increase of the number of features: while the falling rule list of Figure 1 is perfectly readable, an analogous falling rule list involving even just fty di erent input features { a comparatively modest input dimensionality, in many modern applications { would be markedly less so.

A natural way to surpass this di culty could consist in introducing complex concepts, de ned in terms of combinations of input concepts which occur more than once in the left hand side of rules, and rewriting the falling rule list in order to make use of them, as in Figure 2: in many cases, this would be expected not only to improve the readability and interpretability of the model, but also to provide us with de nitions for complex concepts which are relevant to the classi cation problem in exam. 9 That is, data in which input features and outputs take values in nite, xed domains. if (A ^ B ^ C ^ E ^ F ^ G) then YES else if (A ^ C ^ D ^ G) then YES else if (A ^ C ^ G ^ H) then YES else if (A ^ F ^ H) then YES else if (A ^ G) then YES else NO end if if (T1 ^ B ^ E ^ F ) then YES else if (T1 ^ D) then YES else if (T1 ^ H) then YES else if (A ^ F ^ H) then YES else if (T2) then YES else NO end if

When generating de nitions and falling rule lists, one must consider tradeo s between the total size of the model, its predictive power, and the number of steps required to classify instances.10 This is reminiscent of the tradeo between representation size and derivation length which has been previously discussed in this work; and, as it will now be sketched, it can be reduced to it.

Very brie y, the elements of the algebra A in this case are de nitions X Condition, dataset instances d with their attributes, and falling rule lists indexed by dataset instances d : R. The combination operator can be used to combine two de nitions, applying the former inside of the latter; or to combine a concept de nition with a dataset instance which satis es it, adding the de ned concept to the instance; or to combine a dataset instance d with a rule list d : R, causing the rule list to yield a success d> (if its rst rule is applicable to d and its output is correct), a failure d? (if its rst rule is applicable to d but its output is incorrect), or returning the rule obtained by removing its rst premise (if its rst rule is not applicable to d). Figure 3 contains examples illustrating the behaviour of these rules.

It is useful to remark that, according to the above mentioned rules, the generation of a success (or a failure) for a dataset instance corresponds in a natural way to the explanation of its classi cation in terms of the given rule list and de nitions. For instance, consider the rule set and de nitions of Figure 2 right, and the dataset instance d : A; C; D; G; H; YES. Then a possible way to generate d> could be given by rst applying d to the rule list, thus removing the rst rule as inapplicable; then applying the two de nitions to d, thus obtaining d : A; C; D; G; H; T2; T1; YES; and nally applying the rst rule of the derived rule list (that is, the second rule of the original rule list) if (T1 ^ D) then YES. 10 The motivation for attempting to minimize this number does not rise from computational costs as much as from explainability issues: the more steps { that is, de nition applications or discardings of inapplicable premises { are necessary to classify an instance, the less readable the explanation of its classi cation will be. T2

A ^ G A ^ G d : A; G; H = = d : A; F; H; YES d : = d : if (A ^ F ^ H) then

YES else if (T2) then

YES else NO end if if (T2 ^ C ^ H) then

YES else if (A^F ^H) then

YES else if (T2) then

YES else NO end if if (A ^ F ^ H) then

YES else if (T2) then

YES else NO end if if (A ^ F ^ H) then

YES else if (T2) then

YES else NO end if = d> = d? d : A; F; H; YES

d : d : A; F; H; NO

d : Fig. 3. Concept combination in the algebra of falling rule lists, de nitions and dataset instances. From top to bottom: combining two de nitions, combining a de nition and a dataset instance, combining a dataset instance with a rule list whose rst rule is not applicable, combining a dataset instance with a rule list whose rst rule is applicable (correct output), combining a dataset instance with a rule list whose rst rule is applicable (incorrect output).

The length of such a derivation, therefore, can be used as a measure for the complexity of the explanation.

In general, a theory T can be de ned as a rule list, indexed by all available dataset instances, and a set of de nitions.11 In order to measure its cost more accurately, we may weigh every rule list or de nition in proportion to its size; and then, we may modify the computation of the [T]n by making all dataset instances available as left operands when computing it (even if they are not in T). Finally, we change the contribution j[T]nj of [T]n to the value of T by counting the number of successes and failures in [T]n (and weighing them appropriately), instead of counting the number of elements in [T]n (in other words, we assign positive value to the successes, negative to the failures, and zero to the de nitions and rule lists in [T]n). Multiplying the contribution of [T]n to the value of T by the factor n+1 then corresponds to valuing more highly successes (and penalizing more highly failures) if they can be obtained from the initial rule list and de nitions in a small number of steps; thus, maximizing the value of a falling rule list plus de nitions does not simply correspond to searching for a rule list which classi es accurately as many dataset instances as possible, but rather to seeking one such that, additionally, the instances which can be successfully classi ed admit as short an explanation (which is to say, in the above framework, a derivation) as possible. The choice of the cost and discount rate parameters, as well as the choice of the value of successes d> and failures d?, de nes the tradeo between the objective of classifying correctly as many instances as possible in as few steps as possible with the objective of having an acceptably compact set of initial rule lists and de nitions.

This is only a brief sketch, and more detailed presentation and analysis of this adaptation of the framework to the case of falling rule lists is left for further work. Here, it serves to illustrate a possible practical application of the framework presented here, one which in my opinion might be worth exploring further.

[1] Ahrweiler , P. : Modelling theory communities in science . Journal of Arti cial Societies and Social Simulation 14 ( 4 ), 8 ( 2011 )

[2] Angelino , E. , Larus-Stone , N. , Alabi , D. , Seltzer , M. , Rudin , C. : Learning certi - ably optimal rule lists for categorical data . stat 1050 , 6 ( 2017 )

[3] Bou , F. , Schorlemmer , M. , Corneli , J. , Gomez-Ram rez , D., Maclean , E. , Smaill , A. , Pease , A. : The role of blending in mathematical invention . In: Proceedings of the Sixth International Conference on Computational Creativity June . vol. 55 ( 2015 )

[4] Colton , S. , Bundy , A. , Walsh , T. : Agent based cooperative theory formation in pure mathematics . In: Proceedings of AISB 2000 symposium on creative and cultural aspects and applications of AI and cognitive science . pp. 11 { 18 ( 2000 )

[5] Colton , S. , Bundy , A. , Walsh , T. : On the notion of interestingness in automated mathematical discovery . International Journal of Human-Computer Studies 53 ( 3 ), 351 { 375 ( 2000 )

11 Nothing prevents us, in principle, from allowing multiple rule lists inside T; but for simplicity, that possibility is not considered here .

[6] Confalonieri , R. , Kutz , O. , Troquard , N. , Galliani , P. , Porello , D. , Pen~aloza, R., Schorlemmer , M. : Coherence, similarity, and concept generalisation (2017), under review .

[7] Confalonieri , R. , Plaza , E. , Schorlemmer , M.: A process model for concept invention . In: Proc. of the 7th International Conference on Computational Creativity , ICCC16 ( 2016 )

[8] Dyson , F. : A meeting with Enrico Fermi . Nature 427 ( 6972 ), 297 { 297 ( 2004 )

[9] Fauconnier , G. , Turner , M.: The way we think: Conceptual blending and the mind's hidden complexities . Basic Books ( 2008 )

[10] Galanter , P. : Computational aesthetic evaluation: Past and future . In: Computers and Creativity , pp. 255 { 293 . Springer ( 2012 )

[11] Gilbert , N.: A simulation of the structure of academic science . Sociological Research Online 2 ( 2 ) ( 1997 )

[12] Inglis , M. , Aberdein , A. : Beauty is not simplicity: an analysis of mathematicians' proof appraisals . Philosophia Mathematica 23 ( 1 ), 87 { 109 ( 2015 )

[13] Inglis , M. , Aberdein , A. : Diversity in proof appraisal . In: Mathematical Cultures , pp. 163 { 179 . Springer ( 2016 )

[14] Jordanous , A. : Evaluating evaluation: Assessing progress in computational creativity research . In: Proceedings of the second international conference on computational creativity (ICCC-11) . Mexico City, Mexico. pp. 102 { 107 ( 2011 )

[15] Martin , U. , Pease , A. : Mathematical practice, crowdsourcing, and social machines . In: International Conference on Intelligent Computer Mathematics . pp. 98 { 119 . Springer ( 2013 )

[16] Mayer , J. , Khairy , K. , Howard , J.: Drawing an elephant with four complex parameters . American Journal of Physics 78 ( 6 ), 648 { 649 ( 2010 )

[17] McCormack , J. : Open problems in evolutionary music and art . In: Workshops on Applications of Evolutionary Computation . pp. 428 { 436 . Springer ( 2005 )

[18] McCormack , J. , d'Inverno , M. : Computers and creativity: The road ahead . In: Computers and creativity , pp. 421 { 424 . Springer ( 2012 )

[19] Pease , A. , Winterstein , D. , Colton , S.: Evaluating machine creativity . In: Workshop on creative systems, 4th international conference on case based reasoning . pp. 129 { 137 ( 2001 )

[20] Raman-Sundstro m, M.: The notion of t as a mathematical value . In: Mathematical Cultures , pp. 271 { 285 . Springer ( 2016 )

[21] Righi , S. , Takacs , K. : The miracle of peer review and development in science: an agent-based model . Scientometrics pp. 1 { 21 ( 2017 )

[22] Sutton , R.S. , Barto , A.G. : Introduction to reinforcement learning , vol. 135 . MIT Press Cambridge ( 1998 )

[23] Vermorel , J. , Mohri , M.: Multi-armed bandit algorithms and empirical evaluation . In: European conference on machine learning . pp. 437 { 448 . Springer ( 2005 )

[24] Wang , F. , Rudin , C. : Falling rule lists . In: Arti cial Intelligence and Statistics . pp. 1013 { 1022 ( 2015 )