A Human Like Incremental Decision Tree
                                     Algorithm
                 Combining Rule Learning, Pattern Induction, and Storing
                                       Examples

                                             Christina Zeller and Ute Schmid

                                     Cognitive Systems Group, University of Bamberg
                                       An der Weberei 5, 96045 Bamberg, Germany
                                    {Christina.Zeller,Ute.Schmid}@uni-bamberg.de


                        Abstract. Early machine learning research was strongly interrelated
                        with research on human category learning while later on the focus shifted
                        to the development of algorithms with high performance. Only recently,
                        there is a renewed interest in cognitive aspects of learning. Machine
                        learning approaches might be able to model and explain human cate-
                        gory learning while cognitive models might inspire new, more human
                        like, approaches to machine learning. In cognitive science research there
                        exist different theories of category learning, especially, rule-based ap-
                        proaches, prototypes, and exemplar-based theories. To take account of
                        the flexibility of human learning and categorization we propose a hu-
                        man like learning algorithm. In the algorithm we combine incremental
                        decision tree learning, least general generalization, and storing examples
                        for similarity-based categorization. In this paper we present first ideas of
                        this algorithm.

                        Keywords: (human) supervised category learning, cognitive modeling,
                        incremental decision trees, least general generalizations


                 1    Introduction

                 Early machine learning research strongly related to human category learning (cf.
                 Michalski, Carbonell, & Mitchell, 1983). For example, decision tree algorithms
                 were inspired by research on human category learning by Bruner, Goodnow, and
                 Austin (1956). They investigated how humans learned conjunctive or disjunc-
                 tive categorization rules for a fixed set of artificial stimuli. It could be shown
                 that most humans learned such rules in an incremental way using for example
                 the wholist strategy, that is, they generate an initial rule which is sequentially
                 modified for new examples which do not confirm to the current rule. Based on
                 these findings Hunt, Marin, and Stone (1966) developed the first decision tree
                 algorithms. Subsequently, Unger and Wysotzki (1981) introduced the incremen-
                 tal decision tree algorithm Cal 2 as an ideal model for human category learning
                 and Quinlan (1986) developed the well-known ID3.


Copyright © 2017 by the paper’s authors. Copying permitted only for private and academic purposes.
In: M. Leyer (Ed.): Proceedings of the LWDA 2017 Workshops: KDML, FGWM, IR, and FGDB.
Rostock, Germany, 11.-13. September 2017, published at http://ceur-ws.org
    Starting in the 1990s machine learning research was focused on the develop-
ment of new algorithms with high performance and not on cognitive plausibility.
Therefore, machine learning and cognitive modeling research separated (Langley,
2016). For example, a main characteristic of human learning is life-long learn-
ing, that is, incremental and cumulative learning, while in the field of machine
learning many batch learning algorithms were developed (e.g., ID3, neuronal
network approaches, Bayes classifier; Mitchell, 1997). The lack of research in the
field of incremental and cumulative and therefore life-long machine learning was
addressed and motivated by human learning recently but it did not focus on
cognitive plausibility of the algorithms (Thrun, 1998).
    Another main characteristic of human learning is the flexible use of different
strategies while learning and while using the learned knowledge in categoriza-
tion tasks. That is, in human category learning there is evidence for rule-based
learning, prototype learning, and exemplar-based learning which are combined in
hybrid theories of human category learning (Kruschke, 2008). In machine learn-
ing multistrategy learning has been focused in the 1980s and 1990s (Michalski,
1993; Langley, 2016) and Langley (2016) has suggested that the research should
continue in this line. Multistrategy learning shares characteristics with ensem-
ble learning. However while in ensemble learning focus is on combining different
models to improve classification performance, in multistrategy learning, focus is
on achieving human-level flexibility in application of different knowledge struc-
tures in different contexts (Dietterich, 2002).
    Besides, Langley (2016) has recommended that researcher should work inter-
disciplinary because cognitive science and machine learning can mutually profit
from each other. Machine learning approaches might be able to model and ex-
plain human category learning while cognitive models might inspire new, more
human like, approaches to machine learning. Following these thoughts, we cur-
rently are developing a human-inspired learning approach based on Cal 2. Con-
sequently, we focus on learning in supervised classification settings, or in other
words in categorization learning from labeled examples.
    Therefore, in the following section we describe Cal 2 as the incremental rule-
based basis of our algorithm and ID3 which offers an idea for exemplar-based hu-
man category learning. Then we introduce least general generalizations (Mitchell,
1977) which inspire a prototypical view of learning. The algorithm combining
these ideas is presented and applied to a small example in Section 3. In the last
section we discuss further steps like possible extensions of our learning algorithm
and the evaluation of the proposed approach.


2   Basic Symbolic Approaches to Category Learning

In this section we first describe relevant decision tree learning algorithms and how
they connect to human category learning theories. However, not all aspects of
human category learning can be explained with rules as generated with decision
trees. Therefore, in the second part of this section we take a closer look on least
general generalizations which could relate to the prototype theories of human
category learning.

2.1   Decision Tree Learning, Rules and Exemplars
Algorithm 1 shows the decision tree learner Cal 2 (Unger & Wysotzki, 1981).
The algorithm handles examples in an incremental way where the information
of previous examples is stored implicitly in the tree structure, but there is no
knowledge of previous or later examples given explicitly in each step. The algo-
rithm terminates successfully if all examples are classified correctly. Successful
termination is guaranteed for linear separable examples, that is, disjunctive cat-
egories. A typical reason for non-disjunctivness is that the given set of features is
not sufficient to discriminate the training examples. If examples are not linearly
separable, the algorithm fails in Line 9.
    The algorithm does not provide a strategy for feature selection. However,
the next feature (cf. Line 9) can make use of a predefined selection strategy.
In the simplest case the algorithm could choose a feature randomly or use a
predefined feature order. The well-known information gain as proposed in ID3
(Quinlan, 1986) is no feasible feature selection criterion for Cal 2 because to
calculate the information gain of a specific feature all examples have to be known
in advance and therefore the learning is not incremental anymore. However, we
currently investigate incremental variants of information gain, for example, the
igain measure (Zeller & Schmid, 2016). Calculating an incremental variant of
the information gain implies that the already used examples are stored explicitly
and not only implicitly in the tree structure. Storing examples explicitly refers to
the exemplar-based theories in cognitive psychology (Nosofksy & Palmeri, 1997;
Jäkel, Schölkopf, & Wichmann, 2008).


Algorithm 1 Incremental Decision Tree Algorithm Cal 2 (Unger & Wysotzki,
1981)
 1: procedure Cal 2(examples)
 2:    Start with a tree containing only * as node
 3:    while at least one of the examples changes the tree structure do
 4:       if current example is classified correct then
 5:           do nothing
 6:       if current example is classified as * then
 7:           replace * with the class of current example
 8:       if current example is classified wrong then
 9:           add the next feature in the tree
10:           set for the branch with the feature value of current example
              the class of current example and set * for all other branches


   The combination of Cal 2 and igain was used to model human categorization
behavior obtained in an experiment with stimuli in form of lamps (Lafond, La-
couture, & Cohen, 2009). The lamps were described by four features with two
discrete feature values each. Features are defined for the base, upright, shade,
and top of each lamp and are given as F 1, . . . , F 4. The categorization task in
the experiment is to decide whether a specific feature combination belongs to
Category A or B. The knowledge structure for categorization has to be induced
by the humans from nine training examples. Category distribution follows the
in psychology often used 5-4 structure (see Table 1; Medin & Schaffer, 1978).
The igain of the features helped to explain differences in decision trees when the
presentation order of input examples where different for the learner (Zeller &
Schmid, 2016). While ordering effects in incremental learning were addressed in
early machine learning (cf. Fisher, 1993; Langley, 1995) it only recently comes
into focus in the field of cognitive modeling of human category learning (cf.
Carvalho & Goldstone, 2015; Mathy & Feldman, 2016).


Table 1. The 5-4 category structure (cf. Medin & Schaffer, 1978) and the input example
order for the example in Figure 1.

                      Example F 1 F 2 F 3 F 4         class
                         e1      1   1   1   0         A
                         e2      1   0   1   0         A
                         e3      1   0   1   1         A
                         e4      1   1   0   1         A
                         e5      0   1   1   1         A
                         e6      1   1   0   0         B
                         e7      0   1   1   0         B
                         e8      0   0   0   1         B
                         e9      0   0   0   0         B
                       Order:   e8, e2, e5, e6, e7, e1, e4, e9, e3


2.2   Least General Generalizations and Prototypes
In human category learning there is strong evidence that, especially for natural
categories, humans do not learn feature combinations (rules) but prototypes
(Rosch & Mervis, 1975). Categories over entities with discrete features are formed
by family resemblance, that is, a prototype is given by the most frequent feature
value for each feature. It could be shown that humans often search for similarity
by (implicit) counting of the occurrence of feature values within a category and
compare it with the occurrence of feature values in the other categories.
    A similar idea is described as least general generalizations in machine learn-
ing where a pattern for a set of examples of a class is generated (cf. Mitchell,
1977). This pattern includes for each feature either a concrete feature value that
matches with all examples or a wild card if there exists more than one feature
value for this feature in the set of examples. Least general generalizations can be
formed for symbolic features, but also for structural features, terms and graphs
(cf. Schmid, Hofmann, Bader, Häberle, & Schneider, 2010; Siebers, Schmid, Seuß,
Kunz, & Lautenbacher, 2016).


3   An Incremental Decision Tree Algorithm Including
    Most Specific Patterns and Examples

Algorithm 2 shows our human-inspired decision tree learning approach which
realizes incremental learning of rules. Furthermore, to take account of the hu-
man flexibility of the categorization process, it integrates storage of generalized
patterns and of examples. That is, dependent on the context of a categoriza-
tion task the learned knowledge structure allows to categorize a new object by
applying a rule, by pattern matching, or by similarity to known examples.
    Core of our algorithm is Cal 2 which can be seen in the procedure INCLUDE
(cf. Line 5) where a new example is included to a given tree. To take account
of the flexibility of human categorization, the decision tree is enriched by least
general generalizations at each decision node and the leafs (cf. most specific
pattern, for example, in Line 23). Besides, the examples are stored at the leafs
(cf., for example, Line 24). In difference to Cal 2 our algorithm considers every
input example only once and stores it in the matching current leaf. For this first
version of the algorithm which is restricted to disjunctive categories, termination
conditions are inherited from Cal 2. We have realized this algorithm in Prolog
and are currently investigating its ability to model reported findings of human
category learning.
    The result of the algorithm depends on the category structure, the order
of the input examples, and the feature selection criterion. Figure 1 shows the
generated tree, with the 5-4 category structure mentioned above and the order
of the input examples given in Table 1 when using the igain for feature selection.
Left branches show the branches with feature value 1, right branches show the
branches with feature value 0. The least general patterns contain the feature
values with the following structure: < F 1, F 2, F 3, F 4 >. The ? stands for the
wildcard.


4   Future Work

The proposed algorithm is work in early progress and further aspects need to be
considered: extension of the learning algorithm and evaluation of the proposed
approach.
    The current algorithm fails if a new example is categorized erroneously and
there is no feature available to extend the tree such that correct categorization
becomes possible. This category structure can, for example, be generated by non-
disjunctive categories. Learning scenarios involving overlapping categories are
often used in psychological studies to demonstrate that humans do not learn rules
(Kruschke, 2008). We propose to keep the rule-based structure but to sample
examples at the leaf nodes and postpone branching decisions until a certain
Algorithm 2 Incremental Decision Tree Algorithm Including Most Specific Pat-
terns and Examples
1: procedure incrDTmsp(examples)
2:    tree ← empty tree
3:    for each example e ∈ examples do
4:        tree ← include(e, tree)
      return: tree

 5: procedure include(e, tree)
 6:    new tree ← empty tree
 7:    if tree is empty then
 8:        new tree ← (most specific pattern of e, label of e, e)
 9:    else if tree is leaf with the same class as e then
10:        all examples ← all examples of the leaf and e
11:        new tree ← same(all examples)
12:    else if tree is leaf with a different class as e then
13:        all examples ← all examples of the leaf and e
14:        new tree ← different(all examples)
15:    else
16:        branch ← the branch of tree matching with e
17:        node msp ← update msp of root node of branch with e
18:        new branch ← (node msp, include(e, branch))
19:        new tree ← substitute branch with new branch in tree
20:    return: new tree

21: procedure same(all examples)
22:    branch ← empty tree
23:    msp ← most specific pattern of all examples
24:    branch ← (msp, label of all examples, all examples)
25:    return: branch

26: procedure different(all examples)
27:    branch ← branch all examples by a feature not yet used
28:    for each attribute in branch do
29:       if attribute has no examples then
30:           node ← empty leaf
31:       else if all examples in attribute have the same class then
32:           branched examples ← examples of attribute
33:           node ← same(branched examples)
34:       else
35:           branched examples ← examples of attribute
36:           node msp ← most specific pattern of branched examples
37:           node ← (node msp, different(branched examples))
38:       branch ← add node for the attribute of the branch
39:    return: branch
                                           F4
                                       <?, ?, ?, ? >
                               1                          0
                  F3                                               F3
              <?, ?, ?, 1 >                                    <?, ?, ?, 0 >


           A               F1                              F1                B
      <?, ?, 1, 1 >    <?, ?, 0, 1 >                   <?, ?, 1, 0 >    <?, ?, 0, 0 >
        e3, e5                                                            e6, e9

                     A              B                A              B
               < 1, 1, 0, 1 > < 0, 0, 0, 1 >   < 1, ?, 1, 0 > < 0, 1, 1, 0 >
                    e4             e8             e1, e2           e7


Fig. 1. Generated tree for the 5-4 category structure and the the input example order
stated in Table 1. Left branches show the branches with feature value 1, right branches
show the branches with feature value 0. The least general generalizations contain the
feature values with the following structure: < F 1, F 2, F 3, F 4 >. The ? stands for the
wildcard.


amount of evidence is reached. That is, we introduce “delayed branching” that
can reduce but not solve the problem of failing.
    Delayed branching typically leads to decisions based on stronger evidence,
because a larger sample of examples offers a better basis for pattern general-
ization as well as for similarity-based categorization. Besides, delayed branching
might help to reduce overfitting. While in the tree given in Figure 1 most leafs
are associated with patterns representing feature vectors and single examples,
trees constructed with delayed branching will less often specialize in such a way.
    A simple approach to sampling is just to define a threshold for the number
of examples needed before a new branch is introduced. This idea is realized in
Cal 3 (Unger & Wysotzki, 1981), using a parameter for sample size. We plan to
investigate more sophisticated criteria such as prior probabilities (Frermann &
Lapata, 2016), or similarity-based coherence of examples (Michalski & Stepp,
1983).
    Besides, the least general generalizations does not reflect the prototype ap-
proach in detail. That is, prototypes are build by the most frequent feature
values, while the least general generalizations represents the feature value that
are common among all examples. We plan a variant of the algorithm where
at each node the feature values with the highest frequency instead of the least
general generalizations is annotated.
    Additionally, it might be interesting to make restructuring of the tree and
forgetting of examples possible. In human category learning there is evidence that
humans completely reject partially learned rules and start with a new rule (Unger
& Wysotzki, 1981). If we assume that wrong information of categories, that is
noise, is seldom we could handle noise by forgetting seldom seen examples and
restructure the tree. This procedure reflects representational shifts (Johansen &
Palmeri, 2002).
    Currently, categorization of a new example—while learning and while using
the tree—is strictly guided by the branching in the decision tree. That is, patterns
and exemplars have no influence on the learning and categorization. Hybrid
theories in human category learning take into account that humans are flexible
in using different strategies in different situations (Kruschke, 2008; Nosofsky,
Palmeri, & McKinley, 1994; Rosch, 1983). The factors for using one or another
information in the tree need to be investigated, and incorporated in the current
algorithm.
    Since our goal is to develop a human like machine learning algorithm as
well as a machine learning inspired cognitive model of categorization learning,
evaluation of the algorithm has to address performance characteristics as well
as validity as a cognitive model. We plan to take a closer look on efficiency
(time, number of training examples) and accuracy in comparison with other ma-
chine learning approaches, such as (other) decision tree learners, random forests,
and inductive logic programming (Schmid, Zeller, Besold, Tamaddoni-Nezhad,
& Muggleton, 2017).
    Furthermore, we want to compare the learning steps as well as the resulting
knowledge structure of our algorithm with human behavior and we plan to pre-
dict human behavior with our algorithm. For this aim we want to select several
learning domains which have been introduced in cognitive science literature. We
are especially interested in domains where it was shown that humans make use
of different strategies for learning and categorization. Among these domains is
the lamp domain introduced above but also artificial domains of letter strings,
or natural categories such as fruit (Rosch & Mervis, 1975).


References

Bruner, J. S., Goodnow, J. J., & Austin, G. A. (1956). A study of thinking: With
      an appendix on language by Roger W. Brown. New York, NY: Wiley.
Carvalho, P. F., & Goldstone, R. L. (2015). What you learn is more than
      what you see: What can sequencing effects tell us about inductive category
      learning? Frontiers in Psychology, 6 .
Dietterich, T. G. (2002). Ensemble learning. The handbook of brain theory and
      neural networks, 2 , 110–125.
Fisher, D. (1993). Ordering effects in incremental learning. In Training Issues
      in Incremental Learning: AAAI Spring Symposium at Stanford University
      (pp. 35–42). Menlo Park, CA: AAAI Press.
Frermann, L., & Lapata, M. (2016). Incremental Bayesian category learning
      from natural language. Cognitive Science, 40 , 1333–1381.
Hunt, E. B., Marin, J., & Stone, P. J. (1966). Experiments in induction. New
      York, NY: Academic Press.
Jäkel, F., Schölkopf, B., & Wichmann, F. A. (2008). Generalization and similar-
       ity in exemplar models of categorization: Insights from machine learning.
       Psychonomic Bulletin & Review , 15 (2), 256–271.
Johansen, M. K., & Palmeri, T. J. (2002). Are there representational shifts
       during category learning? Cognitive Psychology, 45 , 482–553.
Kruschke, J. K. (2008). Models of categorization. In R. Sun (Ed.), Computational
       psychology (pp. 267–301). Cambridge University Press.
Lafond, D., Lacouture, Y., & Cohen, A. L. (2009). Decision-tree models of cat-
       egorization response times, choice proportions, and typicality judgments.
       Psychological Review , 116 , 833–855.
Langley, P. W. (1995). Order effects in incremental learning. In P. Reimann &
       H. Spada (Eds.), Learning in humans and machines: Towards and inter-
       disciplinary learning science (pp. 154–167). Oxford: Elsevier.
Langley, P. W. (2016). The central role of cognition in learning. Advances in
       Cognitive Systems, 4 , 3–12.
Mathy, F., & Feldman, J. (2016). The influence of presentation order on category
       transfer. Experimental Psychology, 63 (1), 59–69.
Medin, D. L., & Schaffer, M. M. (1978). Context theory of classification learning.
       Psychological Review , 85 , 207–238.
Michalski, R. S. (1993). Multistrategy learning. Boston: Kluwer Academic
       Publishers.
Michalski, R. S., Carbonell, J. G., & Mitchell, T. M. (1983). Machine learning:
       An artificial intelligence approach. Springer.
Michalski, R. S., & Stepp, R. (1983). Automated construction of classifications:
       Conceptual clustering versus numerical taxonomy. IEEE Transactions on
       Pattern Analysis and Machine Intelligence, 5 , 219–243.
Mitchell, T. M. (1977). Version spaces: A candidate elimination approach to
       rule learning. In Fifth International Joint Conference on AI (pp. 305–310).
       Cambridge, MA: MIT Press.
Mitchell, T. M. (1997). Machine learning. McGraw-Hill.
Nosofksy, R. M., & Palmeri, T. J. (1997). An exemplar-based random walk
       model of speeded classification. Psychological Review , 104 (2), 266–300.
Nosofsky, R. M., Palmeri, T. J., & McKinley, S. C. (1994). Rule-plus-exception
       model of classification learning. Psychological Review , 101 (1), 53–79.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1 (1),
       81–106.
Rosch, E. (1983). Prototype classification and logical classification: The two
       systems. In E. K. Scholnick (Ed.), New trends in conceptual representation:
       Challenges to Piaget’s theory? (pp. 73–85). Hillsdale, NJ: Erlbaum.
Rosch, E., & Mervis, C. B. (1975). Family resemblances: Studies in the internal
       structure of categories. Cognitive Psychology, 7 , 573–605.
Schmid, U., Hofmann, M., Bader, F., Häberle, T., & Schneider, T. (2010). In-
       cident mining using structural prototypes. In N. Garcı́a-Pedrajas, F. Her-
       rera, C. Fyfe, J. M. Benı́tez, & M. Ali (Eds.), Trends in Applied Intelli-
       gent Systems: The Twenty Third International Conference on Industrial,
      Engineering & Other Applications of Applied Intelligent Systems, IEA-
      AIE 2010, Cordoba, Spain, 01.–04. June 2010 (Vol. 6097, pp. 327–336).
      Springer.
Schmid, U., Zeller, C., Besold, T., Tamaddoni-Nezhad, A., & Muggleton, S.
      (2017). How does predicate invention affect human comprehensibility?
      In J. Cussens & A. Russo (Eds.), Inductive Logic Programming – 26th
      International Conference, ILP 2016, London, UK, 04.-06. September 2016,
      Revised Selected Papers (Vol. 10326, pp. 52–67). Springer.
Siebers, M., Schmid, U., Seuß, D., Kunz, M., & Lautenbacher, S. (2016). Char-
      acterizing facial expression by grammars of action unit sequences: A first
      investigation using ABL. Information Sciences, 329 , 866–875.
Thrun, S. (1998). Lifelong learning algorithms. In S. Thrun & L. Pratt (Eds.),
      Learning to learn (pp. 181–209). Boston, MA: Springer US.
Unger, S., & Wysotzki, F. (1981). Lernfähige Klassifizierungssyssteme (Classifi-
      cation Systems Being Able to Learn). Berlin, Germany: Akademie-Verlag.
Zeller, C., & Schmid, U. (2016). Rule learning from incremental presentation
      of training examples: Reanalysis of a categorization experiment. In 13th
      Biannual Conference of the German Cognitive Science Society; Bremen,
      Germany, 26.–30. September 2016 (pp. 39–42).