The 18th European Conference on Artificial Intelligence


                      Proceedings


    1st International Workshop on
Combinations of Intelligent Methods and
       Applications (CIMA 2008)
                 Tuesday July 22, 2008
                    Patras, Greece


  Ioannis Hatzilygeroudis, Constantinos Koutsojannis
                   and Vasile Palade
Copyright © 2008 for the individual papers by the papers’ authors. Copying is
permitted for private and academic purposes. Re-publication of material from this
volume requires permission by the copyright owners.
                                       Table of Contents
Workshop Organization ……………………………………………………………………….. ii
Preface …………………………………………………………………………………………… iii
                                                    Papers
ANN for prognosis of abdominal pain in childhood: use of fuzzy modelling for
convergence estimation
  George C. Anastassopoulos and Lazaros S. Iliadis ……………………………………… 1

Using Genetic Programming to Learn Models Containing Temporal Relations from
Spatio-Temporal Data
  Andrew Bennett and Derek Magee ………………………………………………………… 7

Combining Intelligent Methods for Learner Modelling in Exploratory Learning
Environments
  Mihaela Cocea and George D. Magoulas ……………………………………………….. 13

Belief Propagation in Fuzzy Bayesian Networks
  Christopher Fogelberg, Vasile Palade and Phil Assheton ……………………………... 19

Combining Goal Inference and Natural-Language Dialogue for Human-Robot Joint
Action
  Mary Ellen Foster, Manuel Giuliani, Thomas Muller, Markus Rickert, Alois Knoll,
  Wolfram Erlhagen, Estela Bicho, Nzoji Hipolito and Luis Louro ………………………. 25

A Tool for Evolving Artificial Neural Networks
  Efstratios F. Georgopoulos, Adam V. Adamopoulos and Spiridon D. Likothanassis .. 31

Intelligently Raising Academic Performance Alerts
   Dimitris Kalles, Christos Pierrakeas and Michalis Xenos ………………………………. 37

Recognizing predictive patterns in chaotic maps
  Nicos G. Pavlidis, Adam Adamopoulos and Michael N. Vrahatis ……………………... 43

Improving the Accuracy of Neuro-Symbolic Rules with Case-Based Reasoning
  Jim Prentzas, Ioannis Hatzilygeroudis and Othon Michail …………………….……….. 49

Combinations of Case-Based Reasoning with Other Intelligent Methods (short paper)
  Jim Prentzas and Ioannis Hatzilygeroudis ……………………………………………..... 55

Combining Argumentation and Hybrid Evolutionary Systems in a Portfolio
Construction Application
  Nikolaos Spanoudakis and Konstantina Pendaraki and Grigorios Beligiannis ………. 59

An Architecture for Multiple Heterogeneous Case-Based Reasoning Employing
Agent Technologies (short paper)
  Elena I. Teodorescu and Miltos Petridis ...................................................................... 65
       Workshop Organization
             Chairs-Organizers
            Ioannis Hatzilygeroudis
            University of Patras, Greece

          Constantinos Koutsojannis
               TEI of Patras, Greece

                  Vasile Palade
               Oxford University, UK


            Program Committee
Ajaith Abraham, IITA, South Korea
Ao Sio Iong, Oxford University, UK
Plamen Agelov, Lancaster University, UK
Emilio Corchado, University of Burgos, Spain
George Dounias, University of the Aegean, Greece
Artur S. d’Avila Garcez, City University, UK
Melanie Hilario, CUI - University of Geneva, Switzerland
Elpida Keravnou-Papailiou, University of Cyprus, Cyprus
Rudolf Kruse, University of Magdeburg, Germany
George Magoulas, Birkbeck College, Univ. of London, UK
Vasilis Megalooikonomou, University of Patras, Greece
Toni Moreno, University Rovira i Virgili, Spain
Amedeo Napoli, CNRS-INRIA-University of Nancy, France
Ciprian-Daniel Neagu, University of Bradford, UK
Jim Prentzas, TEI of Lamia, Greece
Han Reichgelt, Southern Polytechnic State Univ., GA, USA
David Sanchez, University Rovira i Virgili, Spain
Douglas Vieira, University of Minas Gerais, Brazil


                Contact Chair
            Ioannis Hatzilygeroudis
   Dept. of Computer Engineering & Informatics
            University of Patras, Greece
           Email: ihatz@ceid.upatras.gr


                         ii
                                          Preface

The combination of different intelligent methods is a very active research area in Artificial
Intelligence (AI). The aim is to create integrated or hybrid methods that benefit from each
of their components. It is generally believed that complex problems can be easier solved
with such integrated or hybrid methods.
   Some of the existing efforts combine what are called soft computing methods (fuzzy
logic, neural networks and genetic algorithms) either among themselves or with more
traditional AI methods such as logic and rules. Another stream of efforts integrates case-
based reasoning or machine learning with soft-computing or traditional AI methods. Some
of the combinations have been quite important and more extensively used, like neuro-
symbolic methods, neuro-fuzzy methods and methods combining rule-based and case-
based reasoning. However, there are other combinations that are still under investigation.
In some cases, combinations are based on first principles, whereas in other cases they
are created in the context of specific applications.
   The Workshop is intended to become a forum for exchanging experience and ideas
among researchers and practitioners who are dealing with combining intelligent methods
either based on first principles or in the context of specific applications.
   There were totally 20 papers submitted to the Workshop. Each paper was reviewed by
at least two members of the PC. We finally accepted 12 papers (10 full and 2 short).
Revised versions of the accepted papers (based on the comments of the reviewers) are
included in these proceedings in alphabetic order (based on first author).
   Five of the accepted papers deal with combinations of Genetic Programming or
Genetic Algorithms with either non-symbolic methods, like Neural Networks (NNs) and/or
Kalman Filters (Georgopoulos etal, Spanoudakis etal), or symbolic ones, like Decision
Trees (Kalles etal) and Temporal Logic (Bennett and Magee). Another four papers deal
with combinations of Case-Based Reasoning (CBR). One of them presents a short survey
of CBR combinations (Prentzas and Hatzilygeroudis) and another one a combination with
Agents (Teodorescu and Petridis). The rest two of them present CBR combinations with a
Neuro-Fuzzy (Cocea and Magoulas) and a Neuro-Symbolic (Prentzas etal) approach
respectively, leading to multi-combinations. Also, another two papers concern
combinations of Fuzzy Logic with either NNs (Anastassopoulos and Iliadis) or Bayesian


                                               iii
Nets (Fogelberg etal). Finally, one of the papers combines a NN-based approach with a
Natural Language Processing one (Foster etal).
   Four of the above papers present combinations developed in the context of an
application. Applications involve Medicine (Anastassopoulos and Iliadis), Education
(Cocea and Magoulas, Kalles etal) and Economy (Spanoudakis etal).
   We hope that this collection of papers will be useful to both researchers and
developers.
   Given the success of this first Workshop on combinations of intelligent methods, we
intend to continue our effort in the coming years.


Ioannis Hatzilygeroudis
Constantinos Koutsojannis
Vasile Palade


                                             iv
                         ANN for prognosis of abdominal pain
                         in childhood: use of fuzzy modelling
                              for convergence estimation
                                         George C. Anastassopoulos, Lazaros S. Iliadis

Abstract. This paper focuses in two parallel objectives. First it            decision rules can predict which children are at risk for
aims in presenting a series of Artificial Neural Network models              appendicitis (appendicitis is the most common surgical condition
that are capable of performing prognosis of abdominal pain in                of the abdomen). One such numerically based system is based on
childhood. Clinical medical data records have been gathered and              a 6-part scoring system: nausea (6 point), history of local RLQ
used towards this direction. Its second target is the presentation and       pain (2 point), migration of pain (1 point), difficulty walking (1
application of an innovative fuzzy algebraic model capable of                point), rebound tenderness / pain with percussion (2 point), and
evaluating Artificial Neural Networks’ performance [1]. This                 absolute neutrophil count of >6.75 x 10`3/μL (6 point). A score <5
model offers a flexible approach that uses fuzzy numbers, fuzzy              had a sensitivity of 96.3% with a negative predictive value of
sets and various fuzzy intensification and dilution techniques to            95.6% for AA.
perform assessment of neural models under different perspectives.               To date, all efforts to find clinical features or laboratory tests,
It also produces partial and overall evaluation indices. The                 either alone or in combination, that are able to diagnose
produced ANN models have proven to perform the classification                appendicitis with 100% sensitivity or specificity have proven
with significant success in the testing phase with first time seen           futile. Also, there is only one research work [4] in bibliography
data.                                                                        based on ANN that deals with the abdominal pain prognosis in
                                                                             childhood.
                                                                                The incidence of Acute Appendicitis (AA) is 4 cases per 1000
1 INTRODUCTION                                                               children. However appendicitis despite pediatric surgeons’ best
                                                                             efforts remains the most commonly misdiagnosed surgical
    The wide range of problems in which Artificial Neural
                                                                             condition. Although diagnosis and treatment have improved,
Networks can be used with promising results, is the reason of their
                                                                             appendicitis continues to cause significant morbidity and still
growth [2, 3]. Some of the fields that ANNs are used are: medical
                                                                             remains, although rarely, a cause of death. Appendicitis has a
systems [4-6], robotics [7], industry [8 – 11], image processing
                                                                             male-to-female ratio of 3:2 with a peak incidence between ages 12
[12], applied mathematics [13], financial analysis [14],
                                                                             and 18 years. The mean age in the pediatric population is 6-10
environmental risk modelling [15] and others.
                                                                             years. The lifetime risk is 8.6% for boys and 6.7% for girls.
    Prognosis is a medical term denoting an attempt of physician to
                                                                                The 15 factors that are used in the routine clinical practice for
accurately estimate how a patient's disease will progress, and
                                                                             the assessment of AA in childhood are: Sex, Age, Religion,
whether there is chance of recovery, based on an objective set of
                                                                             Demographic data, Duration of Pain, Vomitus, Diarrhea, Anorexia,
factors that represent that situation. The inference about prognosis
                                                                             Tenderness, Rebound, Leucocytosis, Neutrophilia, Urinalysis,
of a patient when presented with complex clinical and prognostic
                                                                             Temperature, Constipation. The sex (males), the age (peak of
information is a common problem, in clinical medicine. The
                                                                             appearance of A.A in children aged 9 to 13 years), and the religion
diagnosis of a disease is the outcome of combination of clinical
                                                                             (hygiene condition, feeding attitudes, genetic predisposition) were
and laboratorial examinations through medical techniques.
                                                                             in relation with a higher frequency for AA. Anorexia, vomitus,
    In this paper various ANN architectures using different learning
                                                                             diarrhea or constipation and a slight elevation of the temperature
rules, transfer functions and optimization algorithms have been
                                                                             (370 C - 380 C) were common manifestation of AA. Additionally,
tried. This research effort was motivated form the fact that reliable
                                                                             abdominal tenderness principally in the RLQ of the abdomen and
and seasonable detection of abdomen pain constitute attainments in
                                                                             the existence of the rebound sign, are strongly related with AA.
effective treatment of disease and avoidance of relapses. That is
                                                                             Leucocytosis (>10.800 K/μl) with neutrophilia (neutrophil count >
why the development of such an intelligent model that can
                                                                             75%) is considered to be a significant clue for AA. Urinalysis is
collaborate with the doctors will be very useful towards successful
                                                                             useful for detecting urinary tract disease, normal findings on
treatment of potential patients.
                                                                             urinalysis are of limited diagnostic value for appendicitis.
                                                                                The role of race, ethnicity, health insurance, education, access to
                                                                             healthcare, and economic status on the development and treatment
2 DIAGNOSTIC FACTORS OF ABDOMINAL                                            of appendicitis are widely debated. Cogent arguments have been
PAIN                                                                         made on both sides for and against the significance of each
Several reports have described clinical scoring systems                      socioeconomic or racial condition. A genetic predisposition
incorporating specific elements of the history, physical                     appears operative in some cases, particularly in children in whom
examination, and laboratory studies designed to improve diagnostic           appendicitis develops before age 6 years. Although the disorder is
accuracy of abdominal pain [16]. Nothing is guaranteed, but                  uncommon in infants and elderly, these groups have a
Democritus University of Thrace, Hellenic Open University                    disproportionate number of compilations because of delays in
anasta@med.duth.gr, liliadis@fmenr.duth.gr                                   diagnosis and the presence of comorbid conditions.


                                                                         1
   As diagnosis, there are four stages of appendicitis, including           (TanH) transfer function the input data were normalized (divided
acute focal appendicitis, acute supurative appendicitis, gangrenous         properly) in order to be included in the acceptable range of [-3, 3]
appendicitis and perforated appendicitis. These distinctions are            to avoid problems such as saturation, where an element’s
vague, and only the clinically relevant distinction of perforated           summation value (the sum of the inputs times the weights) exceeds
(gangrenous appendicitis includes into this entity as dead intestine        the acceptable network range [17]. Standard back-propagation
functionally acts as a perforation) versus non-perforated                   optimization algorithms using TanH, or Sigmoid or Digital Neural
appendicitis (acute focal and supurative appendicitis) should be            Network Architecture (DNNA) transfer functions, combined with
made.                                                                       the Extended Delta Bar Delta (ExtDBD) or with the Quick Prop
   The present study is based on data set that is obtained from the         learning rules [18, 19] were employed. The ExtDBD is a heuristic
Pediatric Surgery Clinical Information System of the University             technique reinforcing good general trends and damping oscillations
Hospital of Alexandroupolis, Greece. It consisted of 516 children’s         [20].
medical records. Some of these children had different stages of                Modular and radial basis function (RBF) ANN applying the
appendicitis and, therefore, underwent operative treatment. This            ExtDBD learning rule and the TanH transfer function were also
data set was divided into a set of 422 records and another set of 94        used in an effort to determine the optimal networks. RBFs have an
records. The former was used for training of the ANN, while the             internal representation of hidden neurons which are radially
latter for testing. A small number of data records were used as a           symmetric, and the hidden layer consists of pattern units fully
validation set during training to avoid overfitting. Table 1                connected to a linear output layer [21, 22].
represents the stages of appendicitis as well as the corresponding
cases for each one. The 3rd column of Table 1 depicts the coding
of possible diagnosis, as they used for ANN training and testing            3.2 ANN evaluation metrics applied
stages.
                                                                            Traditional ANN evaluation measures like the Root Mean Square
           Table 1. Possible diagnosis and corresponding cases.             Error (RMS error), R2 and the confusion matrix were used to
                    Diagnosis                    Coding      Cases          validate the ensuing neural network models. It is well known that
                  Discharge                         -2         236
                                                                            the RMS error adds up the squares of the errors for each neuron in
       Normal                                                               the output layer, divides by the number of neurons in the output
                  Observation                       -1         186
                  No findings                       0           15
                                                                            layer to obtain an average, and then takes the square root of that
                                                                            average. The confusion matrix is a graphical way of measuring the
         Operative


                  Focal appendicitis                 1          34
         treatment


                  Phlegmonous              or                               network’s performance during the “training” and “testing” phases.
                                                    2           29          It also facilitates the correlation of the network output to the actual
                 Supurative appendicitis
                  Gangrenous appendicitis            3           8          observed values that belong to the testing set in a visual display
                  Peritonitis                        4           8          [17], and therefore provides a visual indication of the network’s
                                                                            performance. A network with the optimal configuration should
                                                                            have the “bins” (the cells in each matrix) on the diagonal from the
3 NEURAL NETWORK DESIGN                                                     lower left to the upper right of the output. An important aspect of
                                                                            the matrix is that the value of the vertical axis in the generated
Data were divided into two groups, the training cases (TRAC) and            histogram is the Common Mean Correlation (CMC) coefficient of
the testing cases (TESC). The TRAC consisted of 417 concrete                the desired (d), and the actual (predicted) output (y) across the
medical data records and the TESC consisted of 101. Each input              Epoch.
record was organised in a format of fifteen fields, namely sex, age,           Finally, the FUSETRESYS (Fuzzy Set Transformer Evaluation
religion, area of residence, pain time period, vomit symptoms,              System) that constitutes an innovative ANN evaluation system has
diarrhoea, anorexia, located sensitivity, rebound, wbc, poly,               been applied offering a more flexible approach [1].
general analysis of urine, body temperature, constipation. The
output record contained a single field which corresponded to the
potential outcome of each case.                                             3.3 Technical description of the FUSETRESYS
    The determination if the TRAC and TESC data sets was
performed in a rather random manner. The training and testing
                                                                            ANN evaluation model
sample size which would be sufficient for a good generalization             Fuzzy logic enables the performance of calculations with
was determined by using the Widrow’s rule of thumb for the LMS              mathematically defined words called “Linguistics” [1, 23-25].
algorithm which is a distribution free, worst case formula [2] and it       FUSETRESYS faces each training/testing example as a Fuzzy Set.
is shown in the following equation 1. W is the total number of free         It applies triangular or trapezoidal membership functions in order
parameters in the network (synaptic weights and biases) and ε               to determine the partial degree of convergence (PADECOV) of the
denotes the fraction of the classification errors permitted during          ANN for each training/testing example separately. The following
testing. The O notation shows the order of quantity enclosed within         equations 2 and 3 represent a triangular and a trapezoidal
[2].                        ⎛W ⎞                       (1)                  membership functions respectively [1].
                       N = O⎜ ⎟
                            ⎝ε ⎠                                                  μs(x;a,b,c)=max{min{
                                                                                                          x − a c − x },0} a<b<c         (2)
                                                                                                               ,
   In the case examined here with 417 training examples used, the                                         b−a c−b
classification error that could be tolerated would be about 4%.
                                                                              μs(x;a,b,c,d)= max{min{ x − a ,1, d − x },0}a<b<c<d              (3)
                                                                                                         b−a    d −c
3.1 Description of the experiments performed                                The model can produce various overall degrees of convergence
   During experimentations, numerous ANN architectures,                     (OVDECOV) for all of the training examples by applying either
learning algorithms and transfer functions were combined in an              fuzzy T-Norm or fuzzy S-Norm conjunction operations, depending
effort to obtain the optimal network. For the Tangent Hyperbolic            on the optimistic or pessimistic point of view of the developer. T-

                                                                        2
Norms tend to produce lower aggregation indices so in the case of                                          performing weights are kept: one copy for training and another one
ANN evaluation they can be considered as a pessimistic approach,                                           of the best performing weights thus far.
whereas the opposite happens with S-Norms [26]. In fact, each
distinct Norm evaluates the performance of an ANN under a                                                           Table 2. Structure of the four most successful ML ANN
different perspective. For example the drastic product assigns the                                                                       Hidden         Second
ANN a high OVDECOV only if it does not have extreme                                                          Optimization      Input       sub-         Hidden
                                                                                                                                                                       Learning
deviations between the desired and the produced classifications                                                                                                      Rule/Transfer
                                                                                                              algorithm        Layer      layer           sub-
                                                                                                                                                                       Function
during the training/testing process [1] whereas the Einstein T-                                                                          neurons         layer
Norm acts in a more average mode. The following equations 4 and
5 present the drastic product and the Einstein product T-Norms.                                              ANN ML#1.
                                                                                                                                                                        Genetic
More details on fuzzy conjunction operators can be found in [26-                                             Reinforcement
                                                                                                                                   15        7             7           Algorithm
                                                                                                              ANN using
28].                                                                                                                                                                    /TanH
                                                                                                            BackPropagation
μ ⎛~ ~⎞
         = Min {μ ~ (Χ), μ ~ ( Χ)} if Max {μ ~ (Χ), μ ~ ( Χ)} = 1 else
  ⎜⎜ A ∩ B ⎟⎟           A           B                             A         B                                ANN ML#2.                                                  Norm-
   ⎝        ⎠
                                                                                                              Multilayer           15        9             0          Cum_Delta/
                                                                                                            Backpropagation                                             TanH
μ⎛              = 0 (4) ⎛⎜      ~
                                        ⎞                  μ {X )μ ( X )                         (5)
                        μ ⎜ Α ∩ Β ⎟⎟ =
     ~ ~⎞                           ~                         ~        ~
                                                              A        B
  ⎜⎜ A∩B ⎟⎟                                                                                                  ANN ML#3.                                                  Norm-
   ⎝      ⎠                 ⎜           ⎟   2 − [ μ ~ { X ) + μ ~ ( X ) − μ ~ { X ) μ ~ ( X )]
                            ⎝           ⎠           A             B         A         B                       Multilayer           15        9             9          Cum_Delta/
                                                                                                            Backpropagation                                             TanH
   The fact that the FUSETRESYS evaluates each training/testing
example separately, offers a more clear view of the ANN’s                                                    ANN ML#4.
                                                                                                                                                                       ExtDBD/
performance. In this way the developers know if the network                                                   Multilayer           15        7             0
                                                                                                                                                                        TanH
operates extremely bad or well in specific cases.                                                           Backpropagation
   Also when there are several neurons in the output layer, the
traditional approaches produce separate evaluation results for each                                        Table 3 shows the architecture and structure of the four most
one whereas the FUSETRESYS can produce an additive                                                         successful radial basis function (RBF) ANN.
performance index (ADPERI) of the ANN. This could be done
under different perspectives and under different degrees of
                                                                                                                   Table 3. Structure of the two most successful RBF ANN
optimism [1].
                                                                                                                                               Number
   Finally the application of fuzzy set hedges offers the “dilution”                                                                               of                  Learning
and the “intensification” options. In this way by using the dilution                                        Optimization Input                             Output
                                                                                                                                      Proto neurons                  Rule/Transfer
approach the developer softens the membership function over the                                              algorithm      Layer                           Layer
                                                                                                                                                Hidden                 function
fuzzy set and weakens the membership constraints so that a point                                                                                 layer
of the Universe of discourse is “truer” than it would be before [1,
27]. On the contrary the intensification hardens the MF over the FS                                           ANN 5.                                                        Norm
                                                                                                            ANN6 Radial                                                   Cum_Delta /
and strengthens the membership constraints so that a point on the                                                             15        30          2            1
domain is “less true” than it used to be [1, 27]. The following                                                Basis
                                                                                                             Function                                                     Sigmoid or
equations 6 and 7 correspond to the intensification and dilution                                                                                                             TanH
functions respectively.
μint ensify ( A) ( X i ) = μ An ( X i ) (6) μ dilute ( A ) ( X i ) = μ An ( X i )
                                                                                1
                                                                                           (7)                The following Table 4 presents the training and testing results
                                                                                                           for the most successful ML and RBF ANN using R2. Also for the
  In this way the ANN can be evaluated strictly by using a “very                                           three most successful networks, namely 2,3,4 the FUSETRESYS
well fit” evaluation option, or in a more relaxed way by using the                                         model was applied to determine the average degree of
“somewhat fit” option. Of course it is in the developer’s hand to                                          convergence. According to the results, the most suitable network
decide the potential type of the ANN’s evaluation and the degree                                           was ML#3. The structure and the architecture of this successful
of dilution or intensification. For a more detailed description of                                         network have been described in the above Table 2.
FUSETRESYS please see [1].                                                                                      Table 4. Evaluation of the most successful ML and RBF ANN
                                                                                                                                                                 Average Degree
                                                                                                                                                                  of success in
4 RESULTS AND DISCUSSION                                                                                     ANN                                  R2 in           Testing using
                                                                                                                       R2 in Training
4.1 ANN analysis                                                                                             code                                Testing         FUSETRESYS
                                                                                                                                                                  for the three
Several experiments were performed. The following table 2                                                                                                           best ANN
presents the structure of the four most effective Back Propagation                                              1           0.8258               0.8247
(BP) multilayer (ML) neural networks. In all cases of ANN                                                       2           0.9615               0.9471                0.9699
models, the classical approach for overcoming the overfitting                                                   3           0.9721               0.9489                0.9716
problem has been followed. More specifically, a set of validation                                               4           0.9352               0.9588                0.9799
data have been provided to the algorithm in addition to the training                                            5           0.9114               0.9000
data. The algorithm has monitored the error with respect to this                                                6           0.9346               0.9400
validation set, while using the training set to drive the gradient
descent search. The number of weight tuning iterations performed                                             The following figure 1 is a graphical representation of the all the
by the system, were determined in each case based on the criterion                                         PADECOV of the ML#2, for each testing example. The absolute
of lowest error over the validation set. Two copies of the best                                            degree of convergence has the value of one. A serious effort was

                                                                                                       3
made towards the development of modular ANN (MODANN) for
the classification problem solution. The term MODANN refers to
the “adaptive” mixtures of local experts (LOCEXP) as proposed
by [29].
  They consist of a group of BP ANN referred to as local experts
competing to learn different aspects of a problem. A “gating ANN”                                                                       Figure 2. Gating Probabilities of the MODANN with code #1Ref.
controls the competition and learns to assign different parts of the
data space to different networks.                                                                                                               Table 6. Small sample of the PADECOV indices
                                                                                                                                                        PADECOV by FUSETRESYS
    Fuzzy set "ANN output equal


                                  1.2
                                                                                                                                                        ML#2            ML#3        #1REF
     Membership values to the


                                   1
         to the right class"


                                  0.8
                                                                                                                                                       0.83333         0.83333         1
                                  0.6
                                  0.4                                                                                                                  0.83333         0.83333       0.833
                                  0.2
                                   0
                                                                                                                                                          1               1            1
                                        1   7   13   19   25   31   37   43   49    55   61   67       73   79   85   91   97
                                                                                                                                                       0.83333         0.83333         1
                                                               Code number for each evaluated record
                                                                                                                                                          1               1            1
                                                                                                                                                       0.83333         0.83333         1
  Figure 1. Representation of all the PADECOV of the network ML# 2
                                                                                                                                                      0.833333        0.833333         1
The LOCEXP have the same architecture but they can apply
distinct learning rules or transfer functions. Also the number of the                                                                                         OVDECOV Einstein
output processing elements of the gating network is equal to the
number of LOCEXP used. The number of the neurons in the                                                                                                0.98299         0.97784       0.971
hidden layer of the gating network should be larger than the
number of the output processing elements [17].                                                                                      The above Table 6 presents a small sample of the 101 distinct
                                                                                                                                    PADECOV values produced by the FUSTRESYS.
  Table 5. Refereed Back Propagation using Gating Networks with two                                                                   Also the Einstein T-Norm was applied for the determination of
                       Competing local experts                                                                                      the overall degree of convergence of the ANN. The ML#2 ANN
                                                                                                                                    had a very high OVEDECOV index with a value of 0.98299
 Refereed #1 REF ANN using Gating ANN with 2 Local Experts
                                                                                                                                    whereas the other ML#3 ANN and the MODANN #REF1 had
 Learning                                                                                                                           OVEDECOV indices as high as 0.97. The Drastic Product T-Norm
   rule                                 Transfer                 Error                   Output                   Noise             was not applied in this research effort because it was proven
                                                Local Expert's #1 functions                                                         unnecessary from the data in table 5 where there were no serious
                                                                                                                                    indications of extreme bad ANN performance in any of the testing
  ExtDBD                                    TanH               standard                  Direct                  Uniform            examples.
                                                Local Expert's #2 functions
   Norm-                                                                                                                            Table 7. OVDECOV values when intensification and dilution of first order
   Cum                                      TanH               standard                  Direct                  Uniform             was applied using Einstein product and Triangular membership function
   Delta                                                                                                                                              OVDECOV of          OVDECOV of       OVDECOV of
                                                                                                                                                           ML #2             #1REF             ML # 3
                                                Local Expert's Architecture                                                             Dilution
                                                                                                                                                          0.99972            0.99934           0.99957
   Input                                                       Hidden                                            Output               “Partly fit”
  neurons                                                      neurons                                           neurons            Intensification
                                                                                                                                                          0.75932            0.64887           0.71033
     15                                                           6                                                1                “Very well fit”

                                                     Gating ANN functions
 Learning                                                                                                                           5 CONCLUSIONS
   rule                                 Transfer                 Error                   Output                   Noise
                                                                                                                                    The above research has obtained six ANNs with good level of
  ExtDBD                                    Linear             Standard                  SoftMax                 Uniform            convergence and it has proven that there exist at least four ANNs
                                                                                                                                    that have high performance indices, in the case of abdominal pain
The above table 5 presents the structure and the architecture of the                                                                classification. Namely the best ANNs are two ML BP ANN, a RBF
optimal MODANN that was developed for the medical                                                                                   ANN and a MODANN using a referee gating network and two
classification problem examined here. The performance of the                                                                        local experts. All of them have been described in the previous
developed modular network is very satisfying, having an R2 value                                                                    sections.
of 0.9434 and a FUSETRESYS produced average PADECOV                                                                                    A very interesting part of the whole research effort is the
equal to 0.9733 (using the Triangular membership function) in the                                                                   application of an innovative ANN evaluation model called
testing process using the first time seen testing data set.                                                                         FUSETRESYS that uses fuzzy logic and fuzzy algebra proposed in
   The following figure 2 depicts the gating probabilities for the                                                                  [11].
optimal MODANN..                                                                                                                       The new evaluation scheme has performed individual
                                                                                                                                    convergence indices namely PADECOV, for the output of each
                                                                                                                                    single data record used in the testing phase. The worst PADECOV
                                                                                                                                    value equals to 0.6666 which actually is the degree of membership
                                                                                                                                    of each data record to the FS “Actual output value equal to the
                                                                                                                                    desired value”. This worst case appears three times exactly in the

                                                                                                                                4
same cases of data records, for the ML#2, ML#3, #1REF ANN and                        [7]  V. Rankovic and I. Nikolic, ‘Control of industrial Robot using neural
it shows that the classification capacity of the developed networks                       network compensator’, Theoretical Applications of Mech., 32, No. 2,
is not bad even in the worst cases. This conclusion becomes                               147-163, (2005).
                                                                                     [8] S. Avramidis and L. Iliadis, ‘Wood-water sorption isotherm
stronger by considering the fact that the second worst PADECOV
                                                                                          prediction with artificial neural networks: a preliminary study’,
index has a value of 0.833.                                                               Holzforschung, 59, 336 – 341, (2005).
   If an overall ANN validation is performed the traditional                         [9] S. Mansfield, L. Iliadis, S. Avramidis, ‘Neural Network Prediction of
evaluation instruments agree with the FUSETRESYS that the most                            Bending Strength and Stiffness in Western Hemlock’,
suitable ANN is the ML BP with code# 4 whereas all of the other                           HOLZFORSCHUNG, 61, Issue 6, 707-716 Walter De Gruyter & Co
developed ANN have almost an equally good performance. The                                Berlin, New York, 2007.
Einstein T-Norm produces a higher “good performance index” for                       [10] B. Cannas, A. Fanni, A. Montisci, G. Murgia and P. Sonato,
the MODANN than the traditional methods.                                                  ‘Dynamic Neural Networks for Prediction of Disruptions in
                                                                                          Tokamaks’, Proceedings of the 10th International Conference on
   As it can be seen in table 7, the OVDECOV indices have very
                                                                                          Engineering Applications of Neural Networks (EANN), Thessaloniki,
high values for ML#2 and for REF#1 and ML#3 networks when a                               Greece. 2007.
“Partly fit” validation is performed. There is significant                           [11] L. Iliadis, S. Spartalis and S. Tachos, ‘A Fuzzy intelligent Artificial
differentiation when a very strict evaluation is done under the                           Neural Network evaluation System: Application in Industry’,
linguistic “Very well fit”. The OVDECOV indices fall from 0.99                            Proceedings of the 10th International Conference Engineering
to 0.75 for ML#2, from 0.99 to 0.65 for #REF and from 0.99 to                             Applications of Neural Networks, 320-326, Thessaloniki, Greece,
0.71 for ML#3 respectively. This is a very useful approach and it                         2007.
shows the actual power of FUSETRESYS due to the fact that it                         [12] R. Kuhn, R. Bordas, B. Wunderlich, B. Michaelis and D. Thevenin,
                                                                                          ‘Colour class identification of tracers using artificial neural networks’,
shows the differentiation of the average convergence degree of the
                                                                                          Proceedings of the 10th International Conference on Engineering
three ANN when more strict validation methods are applied. So                             Applications of Neural Networks (EANN), Thessaloniki, Greece,
ANN fed with the same data records in testing and appearing to                            2007.
have more or less the same performance, they are very seriously                      [13] L. Iliadis and S. Spartalis, ‘Artificial Neural Networks equivalent to
differentiated when more strict convergence validation methods are                        Fuzzy Algebra T_Norm conjunction operators’, Proceedings (Book of
performed.                                                                                extended abstracts) of the ICCMSE 2007 Published by the AIP
   The proposed ANN architecture faces the appendicitis prediction                        (American Institute of Physics), USA, 2007.
quite satisfactory, based on both the above presented results, and                   [14] P. Hajek and V. Olej, ‘Municipal creditworthiness Modelling by
                                                                                          clustering methods’, Proceedings of the 10th International
the pediatric surgeon’s opinion that used these ANNs in their
                                                                                          Conference on Engineering Applications of Neural Networks (EANN),
everyday routine clinical practice.                                                       Thessaloniki, Greece, 2007.
   The innovative ANN evaluation model that was applied                              [15] L. Iliadis, ‘A decision support system applying an integrated Fuzzy
successfully in this research effort will be used extensively in the                      model for long - term forest fire risk estimation’, Environmental
future, in an integrated effort to check its validity under various                       Modelling and Software, 20, Issue 5, 613-621, (2005).
perspectives.                                                                        [16] M. Blazadonakis, V. Moustakis, and G. Charissis, ‘Deep Assessment
                                                                                          of Machine Learning Techniques Using Patient Treatment in Acute
                                                                                          Abdominal Pain in Children’, Artificial Intelligence in Medicine, 8,
ACKNOWLEDGEMENTS                                                                          527- 542, (1996).
                                                                                     [17] Neuralware, Getting started. A tutorial for Neuralworks Professional
We would like to thank the pediatric surgeons of the Pediatric                            II/PLUS, Carnegie, PA,USA, 2001.
Surgeon Department of Medical School of Democritus University                        [18] R.A. Jacobs, ‘Increased rates of convergence through learning rate
                                                                                          adaption’, Neural Networks 1, 295-307, (1988).
of Thrace, for their contribution in the concession of the medical
                                                                                     [19] A.A. Minai, R.D. Wiliams, ‘Acceleration of back-propagation
records.                                                                                  through learning rate and momentum adaption’, International Joint
                                                                                          Conference on Neural Networks, I, 676-679, (1990).
                                                                                     [20] D.E. Rummelhart, G.E. Hinton and R.J. Wiliams, Learning internal
REFERENCES                                                                                representations by error propagation. Institute for Cognitive Science
                                                                                          Report 8506. San Diego, University of California (1985).
[1]   L. Iliadis, ‘An intelligent Artificial Neural Network evaluation system        [21] J. Platt, “A resource allocating network for function interpolation”,
      using Fuzzy Set Hedges: Application in wood industry’, Proceedings                  Neural Computation, 3, 213-225, (1991).
      of the 19th IEEE ICTAI The Annual IEEE International Conference                [22] J. Moody, C.J. Darken, ‘Fast learning in networks of locally tuned
      on Tools with Artificial Intelligence. IEEE Volume II, 366-370, Los                 processing units’, Neural Computation, 1, 281-294, (1989).
      Alamitos California, 2007.                                                     [23] R. Callan, The Essence of Neural Networks. Prentice Hall, UK, 1999.
[2]   S. Haykin, Neural Networks: A comprehensive foundation,. McMillan              [24] W. Pedrycz, ‘Structural interpolation and approximation with fuzzy
      College Publishing Company, New York, 1999.                                         relations: A study in knowledge reuse’, Journal Studies in Fuzziness
[3]   P. Picton, Neural Networks, (2nd edition) Palgrave, New York, USA,                  and Soft Computing, 215, 65-77, (2007).
      2000.                                                                          [25] B. J. Park, W. Pedrycz, S.K. Oh, ‘Fuzzy polynomial neural network:
[4]   D. Mantzaris, G. Anastassopoulos, A. Adamopoulos, I. Stephanakis,                   hybrid architectures of fuzzy modelling’, IEEE Trans. Fuzzy Systems.
      K. Kambouri and S. Gardikis, ‘Abdominal Pain Estimation in                          10 (5), 607-621, (2002).
      Childhood based on Artificial Neural Network Classification’, Proc.            [26] V. Kecman, Learning and Soft Computing, MIT Press. London
      of the 10th International Conference on Engineering Applications of                 England, 2001.
      Neural Networks (EANN 2007), 129-134, August, 2007.                            [27] E. Cox, Fuzzy Modeling and Genetic Algorithms for Data Mining and
[5]   G.P.K. Economou, D. Lymberopoulos, E. Karavatselou and C.                           Exploration, Elsevier Science, USA, 2005.
      Chassomeris, ‘A new concept toward computer-aided medical                      [28] T. Calvo, G. Mayor and R. Mesira, Aggregation Operators: New
      diagnosis - A prototype implementation addressing pulmonary                         Trends and Applications, (Studies in Fuzziness and Soft Computing).
      diseases’, IEEE Transactions on Information Technology in                           Physica-Verlag, Heildeberg, 2002.
      Biomedicine, 5 (1): 55-66, (2001).                                             [29] R.A. Jacobs, M.I. Jordan, S.J. Nowlan and G.E. Hinton, ‘Adaptive
[6]   J. Shieh, S. Fan and W. Shi, ‘The intelligent model of a patient using              mixtures of local experts’, Neural computation, 3, 79-87, 1991.
      artificial neural networks for inhalational anaesthesia’, J. Chin. Inst.
      Chem. Engrs., 33, No. 6, 609-620, (2002).


                                                                                 5
6
Using Genetic Programming to Learn Models Containing
    Temporal Relations from Spatio-Temporal Data
                                                  Andrew Bennett and Derek Magee 1


Abstract. In this paper we describe a novel technique for learning           and uno datasets, along with learning from video the structure of a
predictive models from non-deterministic spatio-temporal data. Our           set of mock CCTV cameras.
technique learns a set of sub-models that model different, typically            There has been much previous work on learning from spatio-
independent, aspects of the data. By using temporal relations, and           temporal domains. Traditional methods usually require a fixed di-
implicit feature selection, based on the use of 1st order logic expres-      mensionality vector, existing with canonical ordering / constant
sions, we make the sub-models general, and robust to irrelevant vari-        meaning, to represent the world. To construct this vector often re-
ations in the data. We use Allen’s intervals [1], plus a set of four novel   quires knowledge of the domain, making these methods hard to use
temporal state relations, which relate temporal intervals to the current     in a problem domain where the structure of the domain is variable,
time. These are added to the system as background knowledge in the           and not known a priori. One approach to modelling data of variable
form of functions. To combine the sub-models into a single model a           dimensionality is to take statistics of a variable size set [8]. This pro-
context chooser is used. This probabilistically picks the most appro-        duces a fixed set description, however spatial relationship informa-
priate set of sub-models to predict in a certain context, and allows         tion is lost in this process. If this information is important within a
the system to predict in non-deterministic situations. The models are        domain this leads to a poor model. Feature selection can be used to
learnt using an evolutionary technique called Genetic Programming.           find the most relevant subset of the data, which then allows for a more
The method has been applied to learning the rules of snap, and uno           general model to be built. However, the relevant subset may change
by observation; and predicting a person’s course through a network           from one context to another.
of CCTV cameras.                                                                Temporal modelling approaches such as Markov chains, Hid-
                                                                             den Markov Models (HMMs) and Variable Length Markov Mod-
                                                                             els (VLMMs) [7] use a description based on graphs to model state
1 Introduction
                                                                             transitions. These methods also usually need a fixed dimensional-
Learning predictive models from spatial-temporal data is, in general,        ity vector with canonical ordering for each observation. There does
a hard problem. Events and activities can have variations in their spa-      not have to be a fixed dimensionality for every observation vector,
tial, and temporal scope; include multiple (variable numbers of) ob-         as theoretically each observation vector can have a different number
jects; can overlap temporally with other events, and activities; and         of dimensions. It is possible to optimise their structure by using local
happen in a non-deterministic manner. A model for predicting spatio-         optimisation approaches based on information theory [3]. In VLMMs
temporal events must support this complexity. Our novel technique            this optimisation acts as kind of temporal feature selection, but as the
learns a set of sub-models that model different, typically indepen-          input variables stay in the same fixed order spatial feature selection
dent, aspects of data. The sub-models can, in addition to object prop-       is not performed.
erties, use temporal relations to describe the scene, and implicit fea-         Bayesian networks are a generalisation of probabilistic graph
ture selection, based on the use of 1st order logic expressions, to          based reasoning methods like HMMs and VLMMs. Again these net-
make them robust to irrelevant variations in the data. To combine the        works require a fixed input vector, but again their relational struc-
sub-models into a single model a context chooser is used. This picks         ture can be optimised by local search [12], genetic algorithms [5], or
the most appropriate set of sub-models to predict in a certain con-          MCMC [6] usually based on information theoretic criteria.
text, and allows the system to predict in non-deterministic situations.         An alternative to using graph based methods is to use (1st order)
Using the combination of sub-models and the context chooser also             logical expressions. Feature selection is implicit in the formalism
reduces the complexity of the model search space, and allows the             of these expressions. Logical expressions also make no assumptions
system to learn a global sub-model that matches most of the dataset,         about the ordering of variables, so there is no need to have a have
and then learn simple sub-models to cover the cases where the global         them in a fixed ordering. Progol [14] and HR [4] are Inductive Logic
sub-model does not work.                                                     Programming (ILP) methods. In general ILP takes data and generates
   This approach extends our previous work [2], by allowing a quali-         a set of logical expressions describing the structure of the data. Pro-
tative, as well as a markovian representation of time. This is done by       gol does this by iterative subsumption using a deterministic search
replacing the step-wise markovian view with temporal relations like          with the goal of data compression. HR does this by using a stochastic
Allen’s intervals [1], and a set of four additional relations to relate      search using a number of specialist operators. This is similar to Ge-
the temporal state of objects to the current time. We use Genetic Pro-       netic Programming which is described below. These approaches suf-
gramming to learn the models, and present an improved fitness func-          fer from a number of disadvantages. Firstly, logical expressions are
tion. The system has been successfully tested on handcrafted snap,           deterministic, so it is hard for then to model non-deterministic situa-
1 University of Leeds, UK, email: {andrewb,drm}@comp.leeds.ac.uk
                                                                             tions. However, there has been much work on combining (1st order)


                                                                             7
logic and probability to solve this problem [16] and [9]. Secondly                    An architecture to represent a model of spatio-temporal data, along
Progol’s search is depth bounded, which limits the size of problems                with associated learning methods is described in our previous work
it can work on, as explained in [15]. Thirdly Progol’s fitness function            [2]. We use this architecture as shown in Figure 1. It is broken down
is only based on how well the model compresses the data, and not                   into two parts: the sub-models, and the context chooser. The sub-
how well the model predicts the data. This can cause incorrect, or                 models each model a separate part of the underlying process gener-
invalid models to be produced.                                                     ating the data. Each sub-model contains two sections: a search sec-
    Genetic Programming (GP) [10] is a evolutionary method, similar                tion, and an output section. The search section looks for a particular
to genetic algorithms, for creating a program that model a dataset. In             pattern in the dataset. A query language, created by ourselves, hav-
a similar way to HR, it takes a dataset data, a set of terminals, and a            ing some similarity to SQL and Prolog, is used to describe the actual
set of functions; and using a set of operators generates a binary tree             search, and a binary tree is used to represent it. The output section
that models the data.                                                              describes what is implied if the search returns true. This will be a set
    Qualitative representations can be used to describe spatio-                    of entities and relations, and their properties the sub-model predicts.
temporal data in an abstract manner. [1] describes a set of seven tem-             Figure 2 shows an example of a sub-model.
poral relations to represent temporal interactions between objects.
    There has been previous work in learning of spatio-temporal mod-                                           Search                              Output
els from video by [15] who produced a system that could learn basic                                              &
card games. It had three parts: an attention mechanism, unsupervised
low-level learning, and high-level protocol learning. The attention                                  =                     =                               =
mechanism uses a generic blob tracker, that locates the position of
the moving objects. From this a set of features including: colour, po-                 Light1.colour(t−1) C1    Light2.colour(t−1) C0   Light3.colour(t)       C2
sition and texture are extracted. The data is clustered into groups. Us-
ing these clusters new input data is assigned its closest cluster proto-
type. A symbolic data stream is then created by combining together                 Figure 2. This shows a sub-model matching a traffic light with colour c1,
the clustered data, with time information. The symbolic stream is                     and a different light having a colour c0 both at current time - 1. If the
                                                                                   expression evaluates true it will output a new light which has a colour c2, at
passed to Progol, which builds a model of the data. Once the model                                               the current time.
has been learnt it can be applied to new data. This allows the system
to interact in the world.
    [17] looked at learning event definitions from video. A raw video                 The context chooser is used to decide how to combine the sub-
of a scene is converted into a polygon representation. This is then                models in different situations. It takes as its input a boolean vector de-
transformed into a force-dynamic model which shows how the ob-                     scribing which sub-models have evaluated true, and returned outputs,
jects in the scene are in contact with one another. Using this data and-           and using a probability distribution decides which ones will form the
meets-and (AMA) logic formulae describing the events are learnt us-                overall output. A context Sn is defined as a set of sub-models M
ing a specific-to-general ILP approach. Work in the area of learning               producing an output in a given context, for example Sn = M1 , M2
from spatial-temporal data, such as the previous two approaches have               represents that M1 , and M2 have search sections that have evalu-
inspired our work.                                                                 ated true at the same time. For each context a probability distribu-
    The reminder of this paper will take the following form. The sec-              tion over the possible combinations of model outputs for that con-
ond section looks at previous work about the architecture for the
                                                                                   P is defined, for example Pn (M1 ), Pn (M2 ), Pn (M1 , M2 ), where
                                                                                   text
models. The subsequent section looks at an extension to this work to                    P (j) = 1. This distribution is formed from the frequency of
                                                                                      j n
incorporate temporal relations into the sub-models. The subsequent                 occurrence of each situation in the training data in the given context.
section describes how these models are learnt by Genetic Program-                  This can be implemented as a sparse hash table.
ming. The subsequent section presents an evaluation of our system,
and the final section shows the conclusions of the work and the fur-
ther work.                                                                         3 Incorporating Temporal Relations into
                                                                                     Sub-models
                                                                                   To evaluate the sub-models history data from the world is required.
2 Architecture for Models of Spatio-Temporal                                       The search section of the sub-model uses data pointers to reference
  Data                                                                             particular data items in the history. The search section of the sub-
                                                                                   model is then evaluated with respect to this data. If the search sec-
                                                                                   tion evaluates true, then the output section is implied. In our previous
                                                                                   work [2] each data pointer could only reference fixed quantified time
                                                                                   points in the history, as shown in Figure 2. The use of this qualitative
                                                   Context chooser
                                                                                   markovian representation of time implies an exact ordering of the
                      ?
                                                                                   events. When multiple independent events are happening simultane-
                                                                                   ously this representation will fail, and an alternative method of repre-
                                                                                   senting temporal ordering is necessary. In order to quantify temporal
             Data                     Sub−models Output     Overall output         ordering in the data we use a combination of Allen’s intervals [1], and
                                                                                   four novel temporal state relations. Allen’s intervals describe tempo-
                                                                                   ral relations between objects. There are seven relations which are:
Figure 1. This figure shows the architecture of our model. It has two parts:
    a set of sub-models, and a context chooser to decide how to use the
                                                                                   meets, starts, finishes, during, before, overlaps, and equal to. Along
                    sub-models in different situations.                            with describing temporal relations between objects in the history, we
                                                                                   need to describe how the objects relate to the current time. An object

                                                                               2


                                                                                   8
goes through a series of temporal states, based on how its start and                 it, and replaces it with a randomly generated sub-tree. Reproduction
end time relates to current time, these are described Figure 3. Firstly              copies a program exactly as it is into the new generation. The pro-
the object is entering the world, its end time is unknown, but its start             grams in the new generation are then scored based on how well data
time is the same as the current time. Secondly the object exists in the              is predicted, and the process is repeated. The GP system will stop
world, again the end time is unknown, but its start time is less than                when a certain fitness score is reached, or a certain number of gener-
the current time. Thirdly the object is leaving the world and its end                ations has passed.
time is equal to the current time. Finally the object has left the world,                In our implementation of GP we assume that a program is a
where both its start, and end times are less than the current time.                  model containing a context chooser, and a set of sub-models. To ini-
                                                                                     tialise the population we generate a set of models just containing
                                                            Current Time             one randomly generated sub-model. The sub-model is produced us-
                                                                                     ing Koza’s ramped half and half method [10]. We apply a hierarchical
Entering                                                                             structure to our sub-models in a similar manner to [13], to try and cut
Current_time = start
                                                                                     down the search space, and to make finding a solution more efficient.
Existing                                                                                 A set of operators is then used to evolve the population. There are
Current_time > start                                                                 two kinds of operators. Firstly there are operators that try to optimise
Leaving                                                                              sub-models which are used in the model, and secondly there are op-
Current_time = end AND Current_time > start                                          erators that optimise the sub-models themselves. A technique called
Left                                                                                 tournament selection [10] is used to pick a model from the popula-
Current_time > end AND Current_time > start                                          tion. Tournament selection picks n models at random from the popu-
                                                                                     lation, and returns the one with the lowest score, for our experiments
                                                                                     we set n to be 5. The operators used to optimise sub-models which
                                                                                     are used in the model are shown below:
Figure 3. This shows the four temporal states, with respect to current time,
  an object can be in: entering, existing, leaving, and left. The dotted lines       Reproduction A set number of models are picked via tournament
     represent that we don’t know when the object will leave the world.                selection and copied directly into the new population.
                                                                                     Adding in a sub-model from another model Two models are
                                                                                       picked by tournament selection. A sub-model from the first
   Both the Allen’s intervals, and our additional temporal state re-                   picked model is randomly selected, and added to the second
lations, are represented in the system as functions of the data, that                  chosen model.
appear in the search section of the sub-models. These relations do                   Replacing a sub-model Again two models are picked by tourna-
not appear in the data; only the temporal range of individual objects                  ment selection, and a sub-model from the first chosen model is
occurs in the data. As the data pointers can be used over the entire                   then replaced by a sub-model randomly selected from the second
history, it is quite likely that a sub-model will evaluate on many dif-                chosen model.
ferent parts of the history. To resolve this issue we just use the result            Removing a sub-model A sub-model is picked by tournament se-
which includes the most recent data. The justification for this is the                 lection, and a randomly selected sub-model is removed.
sub-model will have already output this information at a previous
time in other situations.                                                               The only operator used to optimise the sub-models themselves is
                                                                                     crossover. In crossover two models are picked using tournament se-
                                                                                     lection. A sub-model from each model is then randomly selected, and
4 Learning the Models from Data                                                      standard crossover [10] is performed on these sub-models.
Previously in our previous work [2] it has been shown that it was in-                   To score the models a fixed length window is randomly moved
tractable to find the set of optimal sub-models by exhaustive search,                over the dataset. At each generation two random locations are picked:
for all but the simplest problems. The search space is complex, so a                 one for training, and one for testing. In the training phase the prob-
stochastic search method was chosen as an alternative. We use Ge-                    ability distribution used in the context chooser is calculated. In the
netic Programming [10], which has already been successfully used                     testing phase the fitness of a model (m) is evaluated over a win-
for pattern recognition tasks [11].                                                  dowed section of the dataset (w). For each position in the window the
   Genetic Programming (GP) [10] evolves a population of programs                    model is given a set of history data (h), calculated from the window,
until a program with the desired behaviour is found. It is a type of                 and is queried to produce a prediction. This produces a set of pos-
genetic algorithm, but the programs are stored as binary trees, and                  sible corresponding outputs (o), and a set of possible corresponding
not as fixed length strings. Functions are used for the nodes, and ter-              output likelihoods (ol). The similarity (C) of each output with the ac-
minals (for example constants, and variables) are used for the leaf                  tual output (r), is computed using the F indBestM atch function, as
nodes. In order for the population to evolve a fitness function (in our              shown in Equation 1. This function takes the set of actual output, and
case a predictive accuracy score) must be defined. This score will                   the set of model output, and firstly pads out them out with blank data
be used by the GP system to decide which programs in the current                     so that they are the same size. Then for each item in the actual output
generation to use to produce the next generation, and which ones                     set, a unique match in the model output set is found. For each of the
to throw away. To initialise the system, a set of randomly generated                 matches a comparison is done between the two objects. The compari-
programs must be created. Each then receive a score using the fitness                son looks at how similar each of the properties in the two objects are.
function. Algorithms including crossover, mutation and reproduction                  Each of the comparisons are summed together to produce a score that
use the programs from the current generation to create a new gener-                  shows how good that set of matches is. An exhaustive search is then
ation. Crossover takes two programs and randomly picks a sub-tree                    performed over all the possible combination of matches to find the
on each program, these two trees are swapped over, creating two new                  best (maximal) matching score. The result is then multiplied by its
programs. Mutation takes one program, randomly picks a sub-tree on                   output likelihood. From this the best (maximal) output is found. This

                                                                                 3


                                                                                     9
is then repeated over the rest of window, and the results summed and          5.3 Uno
then normalised to produce (S), as shown in Equation 2. This fitness
function is an improved version to the one described in our previous          The handcrafted uno dataset has a similar sequence to the snap
work [2], as it can be applied to non-deterministic datasets.                 dataset. Again the computer will initially see a blank scene. Then
                                                                              play will be heard. Next two cards, each one having one of three
               C(o, r) = F indBestM atch(o, r)                     (1)        possible coloured shapes on them, will be placed down either at the
                      1    X                                                  same time, or one by one. If the two card have the same coloured
          S(m, w) =      ∗    M axn (oln ∗ C(on , ri ))            (2)        shape on them the “same” is heard; or if they have shapes of the
                     |w|
                               i                                              same colour then “colour” is heard; or if they have the same shapes
   The system runs in two stages, and will stop running once it ex-           on then “shape” is heard; or if the cards are different then “nothing”
ceeds a maximum number of generations. Firstly the system is ini-             is heard. The cards are then removed either together, or one by one.
tialised in the manner described above, and then for five generations         Three datasets were created: a non-noisy training set, a noisy training
it works out the best set of sub-models to use in the models. To do           set, and a non-noisy test set. Each one contained around 50 rounds
this the system uses reproduction (10%), removing (10%), adding               of uno. Again noisy data was prepared by adding 10% of noisy data
(40%), and replacement (40%). Next the system will optimise these             to the non-noisy training data. The noise took the same form as the
models to find the best solution. It uses crossover (60%), reproduc-          noisy snap data.
tion (10%), removing (10%), adding (10%), and replacement (10%).

5 Evaluation                                                                                                                            2
Our method was evaluated on three different datasets, which were:                                                            3
handcrafted uno data, handcrafted snap data, and data from people
walking through a network of mock CCTV cameras. More detail                                                                                   0
about these datasets is presented in the following section.

5.1 CCTV Data of a Path
                                                                                                                                                      1
A 10 minute video of people walking along a path containing a junc-
tion was filmed. This was then used to mock up a network of CCTV                   Figure 4. This figure firstly shows a frame of the video with a person
cameras. Figure 4 shows a frame from the video. Virtual motion                     taking a decision at the junction point, and secondly it shows where the
                                                                                                      virtual detectors are on the video.
detectors, representing CCTV cameras, were hand placed over the
video has shown in Figure 4. Using frame differencing, and morpho-
logical operations, the video was processed to determine the location
of the motion. If the number of moved pixels in a region exceeded
a fixed threshold then the virtual detector outputted that motion had                                AND

occurred at that location. Hysteresis on the motion detection is im-
                                                                                         AND                   EQUAL
plemented as a 2 state, state machine (where the states are motion/no
motion). The state machine requires a numbers of frames (normally                                                                                      EQUAL
                                                                                                             GET    GET
10) of stability to change state. The data produced is then placed in               BEFORE     ENTERING    COLOUR COLOUR
a datafile with a motion event recorded per state change going from
no motion to motion. This was used to create a training datafile con-
taining 84 state changes and a test file containing 46 state changes.          CARD3 CARD2      CARD1      CARD2    CARD1


5.2 Snap
                                                                               Figure 5. This shows one of the sub-model results for the snap dataset. It
The snap dataset was handcrafted, but the format of it was similar             is predicting the equal state, by using the properties of three cards. If card3
to the snap dataset used in the work of [15]. The snap sequence is             occurs before card2; and card1 has just entered the world; and the colour of
the following: initially the computer will see a blank scene, then it            both card1, and card2 is the same then the sub-model evaluates true, and
                                                                                                           Equal will be returned.
will hear the word play, next two coloured cards will be seen. Either
they will be both put down at the same time, or put down one by
one. If they are the same then the word “equals” will be heard, oth-
erwise “different” will be heard. Then the cards are removed, again
either one by one, or at the same time. We ask the computer to only           6 Results
learn the sections where a human is speaking, as it would be im-
possible to accurately predict the next two cards because they are            To test the system five runs were allocated to each possible combi-
essentially random. Again three datasets were prepared: a non-noisy,          nation of dataset. For each run a different random number seed was
and noisy training set, and a non-noisy test set. All the datasets con-       used to initialise the system. The tests were run on a 2GHz machine
tained around 50 rounds of snap. The noisy data was generated by              having 8GB memory.
adding 10% noise to the non-noisy training set. The noise took the              To evaluate how well the models have been learnt they were tested
form of removing cards, removing the play state, and changing the             on a separate test set. Two metrics were used to evaluate the results:
output state, for example making the output not equal when it should          coverage, and prediction accuracy. Coverage (C) scores if the sys-
be equal.                                                                     tem can correctly predict the dataset (ie. the probability of correct

                                                                          4


                                                                              10
prediction is greater than 0%) and is the number of correct predic-                           Training Dataset    Number of runs      C(%)     A(%)
tions (pc) divided by the dataset size (d) as shown in Equation 3.                             Snap No Noise           1              99.9     99.9
                                                                                                                       4              99.8     99.8
Prediction accuracy (A) scores with what probability the correct pre-
                                                                                                Snap Noise             2              99.8     99.8
diction is made, and is the sum of the likelihoods of each correct                                                     1              99.8     96.6
prediction (pl) divided by the dataset size, as shown in Equation 4.                                                   1              96.0     94.8
In non-deterministic scenarios this will not be 100%.                                                                  1              96.0     93.3
                                                                                               Uno No Noise            1              99.7     99.7
                                    pc                                                                                 1              97.2     97.2
                              C=                                  (3)
                                     d                                                                                 1              94.0     92.0
                                                                                                                       1              91.5     89.7
                                                                                                                       1              88.8     88.8
                                         pl                                                      Uno Noise             2              96.8     88.4
                                 A=                                      (4)
                                          d                                                                            1              95.1     95.1
    Both the snap datasets were tested on a population size of 4000,                                                   1              94.3     90.4
                                                                                                                       1              88.5     87.1
and the system was run for 65 generations, taking around 5 hours                               Path No Noise           1              97.9     92.1
to do each run. All the runs using the non-noisy datasets were suc-                                                    1              97.9     89.3
cessful. However the models did not get 100% coverage because they                                                     1              96.8     88.8
failed to produce any output at the start of the test dataset as there was                                             1              95.8     90.0
                                                                                                                       1              94.7     88.5
insufficient items in the history. Figure 5 shows an example of this,                            Path Noise            1              93.6     85.8
as it will only evaluate once there are three cards in the history. Four                                               1              92.6     85.2
of the results did not predict the first two items in the test dataset,                                                1              91.5     82.4
and one of the results only failed to predict the first item. Two out of                                               1              90.8     83.3
                                                                                                                       1              90.1     80.7
the five runs using the noisy snap dataset got an exact solution. The
noise effected the models causing the sub-models to model incorrect
parts of the dataset. This was because some of the noise added to the               Figure 6. This figure shows the results for the snap, uno and path datasets.
                                                                                    The number of runs column shows for each training set how many runs got
noisy training set changed the outcomes for some rounds of snap,                                    the same coverage, and accuracy scores.
this then causes the system to model this noise, and to incorrectly
predict the outcomes in the test set. Again, like in the non-noisy snap
models there was problems predicting the start of the test dataset.                as a markovian representation of time. This technique is important
The models themselves made use of both the Allen’s intervals, and                  for a number of reasons. Firstly it produces models that are robust to
the temporal state relations. Figure 5 shows one of the sub-models                 irrelevant variations in data. Secondly, it allows the system to learn
produced from the non-noisy snap training set. It shows the use of                 from a dataset containing single actions, and then be able to predict
Allens intervals (the before relation), and the temporal state relations           from a dataset containing multiple overlapping actions.
(the enter relation). Most of the models contained four sub-models in                 In future work will be looking into using spatial, as well as tempo-
them.                                                                              ral relations in the system. We are also looking into trying out quanti-
    The uno datasets were run on a population of size 6000, and for                tative relations, so that a relation will not work on objects that are ei-
65 generations, taking around 7 hours to do each run. One out of five              ther too close, or too far away. We will also be looking into changing
runs on the non-noisy dataset managed to get the correct solution, but             the output from a sub-model based on what data the search section
it did not get 100% coverage because it did not have enough history at             has evaluated on. Finally we will be looking at speed improvements
the start of the test set to predict the initial items. The rest of the non-       to the system so that the run time can be reduced.
noisy results were very close to the solution, and probably needed
more generations to find the exact solution. The models themselves
were very similar to the models produced for the snap datasets. Both               REFERENCES
Allen’s intervals, and the temporal state relations were used. None
of the runs for the noisy dataset managed to produce an exact result,               [1] James Allen, ‘Maintaining knowledge about temporal intervals’, Com-
with the noise causing the sub-models to model incorrect parts of the                   munications of the ACM, 26, 832–843, (1983).
                                                                                    [2] A. Bennett and D. Magee, ‘Learning sets of sub-models for spatio-
dataset.
                                                                                        temporal prediction’, in Proceedings of AI-2007, the Twenty-seventh
    The runs using the path dataset used a population size of 2000, and                 SGAI International Conference on Innovate Techniques and Applica-
the system was run for 65 generations, taking around 3 hours to do                      tions of Artificial Intelligence, pp. 123–136, (2007).
each run. All the runs using the non-noisy dataset predicted well in                [3] Matthew Brand, ‘Pattern discovery via entropy minimization’, in Arti-
the main section of the test dataset, but failed to predict well at the                 ficial Intelligence and Statistics, (1998).
                                                                                    [4] Simon Colton, Alan Bundy, and Toby Walsh, ‘Automatic identification
start of the test dataset, due to lack of history. Some of the runs also                of mathematical concepts’, in International Conference on Machine
failed to predict infrequently occurring actions in the test set. In the                Learning, (2000).
runs using the noisy training set all the models learnt the frequently              [5] R. Etxeberria, P. Larranaga, and J.M. Picaza, ‘Analysis of the behaviour
occurring actions, but they all started to learn some of the noise in                   of genetic algorithms when learning bayesian network structure from
                                                                                        data’, Pattern Recognition Letters, 13, 1269–1273, (1997).
the dataset, and this effected their scores on the test dataset. Both the
                                                                                    [6] Nir Friedman and Daphne Koller, ‘Being bayesian about network struc-
non-noisy and noisy models used Allen’s intervals, and the temporal                     ture’, Machine Learning, 50, 95–126, (2003).
state relations.                                                                    [7] Aphrodite Galata, Neil Johnson, and David Hogg, ‘Learning behaviour
                                                                                        models of human activities’, in British Machine Vision Conference
                                                                                        (BMVC), (1999).
7 Conclusions                                                                       [8] Kristen Grauman and Trevor Darrell, ‘The pyramid match ker-
                                                                                        nel:discriminative classication with sets of image features’, in Inter-
We have extended the previous work of [2] and shown that that it is                     national Conference on Computer Vision, (2005).
possible, by the use of temporal relations, to use a qualitative, as well           [9] R Haenni, ‘Towards a unifying theory of logical and probabilistic rea-


                                                                               5


                                                                                   11
       soning’, in International Symposium on Imprecise Probabilities and
       Their Applications, pp. 193–202, (2005).
[10]   John Koza, Genetic Programming, MIT Press, 1992.
[11]   John Koza, Genetic Programming II, MIT Press, 1994.
[12]   Philippe Leray and Olivier Francios, ‘Bayesian network structural
       learning and incomplete data’, in Adaptive Knowledge Representation
       and Reasoning, (2005).
[13]   David Montana, ‘Strongly typed genetic programming’, in Evolution-
       ary Computation, (1995).
[14]   S.H. Muggleton and J. Firth, ‘CProgol4.4: a tutorial introduction’, in
       Relational Data Mining, 160–188, Springer-Verlag, (2001).
[15]   Chris Needham, Paulo Santos, Derek Magee, Vincent Devin, David
       Hogg, and Anthony Cohn, ‘Protocols from perceptual observations’,
       Artificial Intelligence, 167, 103–136, (2005).
[16]   N. J. Nilsson, ‘Probabilistic logic’, Artificial Intelligence, 28, 71–87,
       (1986).
[17]   Jeffrey Mark Siskind, ‘Grounding the lexical semantics of verbs in vi-
       sual perception using force dynamics and event logic’, Articial Intelli-
       gence Research, 15, 31–90, (2000).


                                                                                   6


                                                                                       12
      Combining Intelligent Methods for Learner Modelling
            in Exploratory Learning Environments
                                              Mihaela Cocea and George D. Magoulas 1


Abstract. Most of the existing learning environments work in well-             subsequent section briefly introduces the application domain, namely
structured domains by making use of or combining AI techniques in              mathematical generalisation, and the ELE used, called ShapeBuilder,
order to create and update a learner model, provide individual and/or          and discusses the challenges involved in performing learner mod-
collaboration support and perform learner diagnosis. In this paper we          elling. Section 3 presents a conceptual framework for the learner
present an approach that exploits the synergy of case-base reasoning           modelling process and describes the case-based formulation. Section
and soft-computing for learner modelling in an ill-structured domain           4 illustrates the process with an example, while Section 5 concludes
for exploratory learning. We present the architecture of the learner           the paper and outlines future work.
model, the knowledge formulation in terms of cases and illustrate its
application in an exploratory learning environment for mathematical
generalisation.
                                                                               2     EXPLORATORY LEARNING FOR
                                                                                     MATHEMATICAL GENERALISATION
                                                                               Mathematical generalisation (MG) is associated with algebra, as “al-
1     INTRODUCTION
                                                                               gebra is, in one sense, the language of generalisation of quantity. It
Several AI techniques have been proposed in intelligent learning en-           provides experience of, and a language for, expressing generality,
vironments, such as case-based reasoning [27], [10], bayesian net-             manipulating generality, and reasoning about generality” [20].
works [4], [6], neural networks [2], genetic and evolutionary algo-                However, students do not associate algebra with generalisation as
rithms [24], neuro–fuzzy systems [26], as well as synergistic ap-              the algebraic language is perceived as been separate from what it rep-
proaches, such as genetic algorithms and case-based reasoning [13],            resents [15]. To address this problem the ShapeBuilder [8] system,
hybrid rules integrating symbolic rules with neurocomputing [11],              which is an ELE under development in the context of the MiGen
and expert systems with genetic algorithms [18].                               project 2 , aims to facilitate the correspondence between the mod-
   Exploratory Learning Environments (ELEs) belong to a particular             els, patterns and structures (visual representations) that the learners
class of learning environments built on the principles of construc-            build, on one hand, and their numeric, iconic and symbolic repre-
tivism paradigm for teaching and learning. ELEs place the emphasis             sentations, on the other hand. ShapeBuilder allows the construction
on the opportunity to learn through free exploration and discovery             of different shapes [9], e.g. rectangles, L-shapes, T-shapes and sup-
rather than guided tutoring. This approach has proved to be bene-              ports the three types of representations aforementioned: (a) numeric
ficial for learners in terms of acquiring deep conceptual and struc-           representations that include numbers (constants or variables) and ex-
tural knowledge. However, discovery learning without guidance and              pressions with numbers; (b) iconic representations which correspond
support appears to be less effective than step-by-step guiding learn-          to icon variables; (c) symbolic representations that are names or sym-
ing environments [16]. To this end, an understanding of learner’s be-          bols given by users to variables or expressions. An icon variable has
haviour and knowledge construction is needed [22].                             the value of a dimension of a shape (e.g. width, height) and can be
   Most existing ELEs use simulations as a way of actively involving           obtained by double-clicking on the corresponding edge of the shape.
learners in the learning process (e.g. [28], [14]) and exploit cog-            It is represented as an icon of the shape with the corresponding edge
nitive tools [29] to support their learning. Few such systems model            highlighted (see Figure 1a).
learner’s knowledge/skills; for example [4] and [6] use bayesian net-              Constants, variables and numeric expressions lead to specific con-
works and [26] combines neural networks with fuzzy representation              structions/models, while icon variables and expressions using them
of knowledge. Another category of ELEs is closer to the construc-              lead to general ones. Through the use of icon variables, ShapeBuilder
tivist approach by allowing the learner to construct their own models          encourages structured algebra thinking, connecting the visual with
rather than explore a “predefined” one. Compared to conventional               the abstract (algebraic) representation, as “each expression of gen-
learning environments (even environments that use simulations), this           erality expresses a way of seeing” [20] (see Figure 1b). It also uses
type of ELE requires approaches to learner modelling that would be             the “messing up” metaphor [12] that consists of asking the learner to
able to capture and model the useful interactions that take place as           resize a construction and observe the consequences; the model will
learners construct their models.                                               “mess up” only if it is not general (see Figures 1c and d), indicating
   In this paper, we present an approach to learner modelling in ELEs          learner’s lack of generalisation ability.
(suitable for both exploring simulations and constructing models)                  When attempting to model the learner in an ELE for such a wide
that combines case-based reasoning with other AI techniques. The               domain as MG, several challenges arise. The main and widely ac-
1 The authors are with the London Knowledge Lab, Birkbeck College, Uni-        2 Funded by ESRC, UK, under TLRP e-Learning Phase-II (RES-139-25-
    versity of London, UK; email: {mihaela;gmagoulas}@dcs.bbk.ac.uk                0381); http://www.tlrp.org/proj/tel/tel_noss.html.


                                                                          13
                                                                                     learning process. As case-based reasoning offers flexibility of infor-
                                                                                     mation representation and soft computing techniques handle uncer-
                                                                                     tainty, a combination of the two is used. Moreover, previous research
                                                                                     has proved the benefits of combining case-based reasoning with neu-
                                                                                     ral networks [23] and fuzzy quantifiers [30]. In the following sub-
                                                                                     sections, the architecture of the system, the AI components and their
                                                                                     role are described.

                                                                                     3.1    The Architecture
                                                                                     The architecture of the “Intelligent” ShapeBuilder is represented in
                                                                                     Figure 2. As the learner interacts with the system through the inter-
                                                                                     face, the actions of the learner are stored in the Learner Model (LM)
                                                                                     and they are passed to the Interactive Behaviour Analysis Module
                                                                                     (IBAM) where they are processed in cooperation with the Knowl-
                                                                                     edge Base (KB); the results are fed into the LM. The Feedback Mod-
                                                                                     ule (FM) is informed by the LM and the KB and feeds back to the
 Figure 1. (a) A rectangular shape and its icon variable; (b) an expression          learner through the interface.
  using icon variables; (c) “messing up”; (d) general solution that does not
                                 “mess up”.


knowledged challenge is to balance freedom with control: learners
should be given enough freedom so that they can actively engage in
activities but they should be offered enough guidance in order to as-
sure that the whole process reflects constructivist learning and leads
to useful knowledge [21]. This and some other challenges are illus-
trated in Table 1 with examples from the domain of MG.

     Table 1.  Applying learner modelling in ELEs for mathematical
                             generalisation.
 Challenge        Example                                                               Figure 2. Schematic of an intelligent architecture for ShapeBuilder.
 Balance     be- When a learner is trying to produce a general represen-
 tween freedom tation, for how long should he be left alone to explore
 and control      and when does guidance become necessary?
 What should be Besides learner’s knowledge of MG concepts (e.g.                        The KB includes two components (see Figure 2): a domain and
 modelled?        use of variables, consistency between representations,             a task model. The domain model includes high level learning out-
                  etc.), other aspects need to be modelled in order to
                  support the learner during exploration: shapes con-                comes related to the domain (e.g. using variables, structural reason-
                  structed, relations between shapes, etc.                           ing, consistency, etc.) and considers that each learning outcome can
 Do both correct In exploratory learning it is difficult to categorise ac-           be achieved by exploring several tasks. The task model includes dif-
 and incorrect tions or learner’s explorations into “correct” and “in-               ferent types of information: (a) strategies of approaching the task
 actions or be- correct”. Moreover, actions that might lead to incor-                which could be correct, incorrect or partially correct; (b) outcomes
 haviours have rect outcomes such as resizing can be more valuable
 value?           for constructivist learning than “correct” actions.                of the exploratory process and solutions to specific questions associ-
 Reasoning        Can consistency be inferred from the fact that a learner           ated with each (sub)task; (c) landmarks, i.e. relevant aspects or criti-
 about abstract is checking the correspondence between various forms                 cal events occurring during the exploratory process; (d) contexts, i.e.
 knowledge        of representations? If so, is that always true? Are there          reference to particular (sub)tasks.
                  any exceptions to this rule?
                                                                                        The IBAM component combines case-based reasoning with soft
 Underlying       As it is neither realistic nor feasible to include all pos-
 strategies       sible outcomes (correct or incorrect) to model the do-             computing in order to identify what learners are doing and be able
                  main of MG, only key information with educational                  to provide feedback as they explore a (sub)task. More specifically, as
                  value could be stored, such as strategies in solving               they are working in a specific subtask, which specifies a certain con-
                  a task. The challenge is how to represent and detect               text, their actions are preprocessed, current cases are identified and
                  them.
                                                                                     matched to the cases from the Task Model (the case base). Prior to
                                                                                     matching, local feature weighting [23] is applied in order to reflect
                                                                                     the importance of the attributes in the current context.
                                                                                        In the FM component, multicriteria decision making [7] will be
3    A CONCEPTUAL FRAMEWORK FOR                                                      used to obtain priorities between several aspects that require feed-
     LEARNER MODELLING                                                               back depending on the context.

Given the challenges mentioned in Table 1 a conventional learner
                                                                                     3.2    Case-based Knowledge Representation
modelling approach does not fit the purposes of ELEs. Due to the ex-
ploratory nature of the activities and the diversity of possible trajec-             In case-based reasoning (CBR) [17] the knowledge is stored as cases,
tories, flexibility in the representation of information and handling of             typically including the description of a problem and the correspond-
uncertainty are two important aspects for effectively supporting the                 ing solution. When a new problem is encountered, similar cases are


                                                                                14
searched and the solution is adapted from one or more of the most                   attributes of a generic case for ShapeBuilder is presented in Table 2.
similar cases.                                                                      The first v attributes (αij , j = 1, v) are variables, the ones from
   Although CBR has been used successfully in applications for do-                  v + 1 to w are numeric (αij , j = v + 1, w) and the rest are binary
mains like legal reasoning [1], stock market prediction [5], recom-                 (αij , j = w + 1, N ).
mender systems [19], and other areas, there is little research on using                The set of relations between attributes of the current case and at-
CBR for e-Learning environments. For example, [10] use CBR in                       tributes of other cases (as well as attributes of the same case) is
the learner modelling process and call this approach case-based stu-                represented as RAi = {RAi1 , RAi2 , . . . , RAiM }, where at least
dent modelling; while [13] use CBR and genetic algorithms to con-                   one of the attributes in each relation RAim , ∀m = 1, M , is from
struct an optimal learning path for each learner. CBR is used also in               the set of attributes of the current case Fi . Two types of binary
[27] within a case-based instruction scenario rather than a method                  relations are used: (a) a dependency relation (Dis ) is defined as
for learner modelling. We have not found any references in the liter-               (αik , αjl ) ∈ Dis ⇔ αik = DEP (αjl ), where DEP : αik → αjl
ature to ELEs that use CBR or CBR combined with other intelligent                   for attributes αik and αjl that are variables of cases i and j (where
methods.                                                                            i = j or i 6= j), and means that αik depends on (is built upon)
   The advantage of CBR for learning environments and especially                    αjl (if i = j, k 6= l is a condition as to avoid circular dependen-
for ELEs is that the system does not rely only on explicit representa-              cies) (e.g. the width type of a case is built upon the height type of
tion of general knowledge about a domain, but it can also use specific              the same case; the width type of a case is built upon the width type
knowledge previously experienced [10]. It also seems promising for                  of another case, an so on); (b) a value relation (Vis ) is defined as
improving the effectiveness of complex and unstructured decision                    (αik , αjl ) ∈ Vis ⇔ αik = f (αjl ), where αik and αjl are numeric
making [13] in combination with other computing methods.                            attributes and f is a function and could have different forms depend-
   In our research, CBR is used in the learner modelling process.                   ing on context (e.g. the height of a shape is two times its width; the
The cases contain information describing models that learners should                width of a shape is three times the height of another shape, etc.). The
construct using ShapeBuilder. Different strategies in approaching a                 set of relations between attributes is presented in Table 3.
problem (i.e. constructing a model to meet a particular learning ob-
jective) are represented as a series of cases that reflect possible ex-
                                                                                          Table 3.   The set of relations between attributes (RAi ) of cases.
ploratory trajectories of learners as they construct models during the
various (sub-)tasks.                                                                Relation                     Label                       Example
                                                                                                                                              
                                                                                    Dependency relation       Di1 (RAi1 )           αik , αjl ; k, l = 2, v; ∀j
               Table 2.   The set of attributes (Fi ) of a case.                                                   ..                             ..
                                                                                                                    .                          .
  Category        Name             Label            Possible Values                                           Dit (RAit )           αik , αjl ; k, l = 2, v; ∀j
  Shape         Shape type          αi 1      Rectangle(/L-Shape/T-Shape)           Value relation           Vi1 (RAit+1 )        αik , αjl ; k, l = v + 1, w; ∀j
  Dimensions    Width type          αi 2        constant (c)/variable (v)/                                          ..                             ..
  of shape                                         icon variable (iv)/                                               .                             .
                                               numeric expression (n exp)/                                   Viz (RAiM )          αik , αjl ; k, l = v + 1, w; ∀j
                                              expression with iv(s) (iv exp)
               Height type         αi 3          c /v /iv /n exp /iv exp
                     ..              ..                       ..
                      .               .                        .                       The set of relations between cases is represented as RCi =
             Thickness type        αi v          c /v /iv /n exp /iv exp            {RCi1 , RCi2 , . . . , RCiP }, where one of the cases in each relation
              Width value         αiv+1              numeric value                  RCij , ∀j = 1, P is the current case (Ci ). Two relations about or-
              Height value        αiv+2              numeric value                  der in time are defined: (a) P rev relation indicates the previous case
                      ..              ..                       ..
                       .               .                        .                   with respect to the current case: (Ci , Cj ) ∈ P rev if t (Cj ) < t (Ci )
             Thickness value       αi w          c /v /iv /n exp /iv exp            and (b) N ext relation indicates the next case with respect to the cur-
  Part of     P artOf S1          αiw+1                      1                      rent case: (Ci , Ck ) ∈ N ext if t (Ci ) < t (Ck ). Each case includes
  Strategy    P artOf S2          αiw+2                      0                      at most one of each of these two relations (p ≤ 2).
                    ..              ..                       ..                        A strategy is defined as Su = {Nu (C), Nu (RA), Nu (RC)},
                     .               .                        .
              P artOf Sr           αiN                       0                      u = 1, r , where Nu (Ci ) is a set of cases, Nu (RAi ) is a set of re-
                                                                                    lation between attributes of cases and Nu (RCi ) is a set of relations
                                                                                    between cases.
    A case is defined as Ci = {Fi , RAi , RCi }, where Ci represents
the case and Fi is a set of attributes. RAi is a set of relations between
attributes and RCi is a set of relations between Ci and other cases
                                                                                    3.3     Comparing Cases, Exploiting Context and
respectively.
                                                                                            Modelling Learning Trajectories
    The set of attributes is represented as Fi = {αi1 , αi2 , . . . , αiN }.        In this section we present three distinctive features of the proposed
It includes three types of attributes: (a) numeric, (b) variables and (c)           framework: comparing cases, exploiting context and modelling of
binary. Variables refer to different string values that an attribute can            learning trajectories.
take, and binary attributes indicate whether a case can be considered
in formulating a particular strategy or not. This could be represented
as a “part of strategy” function: P artOf Su : Ci → {0, 1},                         Comparing cases. The most common definition of similarity is a
                                                                                   weighted sum of similarities of attributes of cases [17]:
                                      1 if Ci ∈ Su
                  P artOf Su =
                                      0 if Ci ∈  / Su ,                                                          PN
                                                                                                                          × sim(fiI , fiR )
                                                                                                                    i=1 oiP
                                                                                                       SIR =                n               ,
where Su represents a strategy and is defined further on. The set of                                                        i=1 oi


                                                                               15
where oi represents the weight of each attribute, sim is a similarity            Learning trajectories. A string of cases connected with relations
function, and I and R stand for input and retrieved cases, respec-              in time yields a knowledge structure that represents learner’s explo-
tively. In our case, four similarity measures are defined for compar-           rations/learning trajectory in the ELE during a task or sub-task. Such
ing cases:                                                                      a learning trajectory is constructed by successively applying P rev
                                                                                and N ext relations to Ci in order to get cases previous in time to
1. Euclidean
           qdistance is used for comparing numeric attributes:                  Ci and cases following Ci , respectively. Comparing trajectories in
             Pw
                                                                                the KB to the current trajectory (this is useful to provide support and
               j=v+1 oj × (αIj − αRj )
   DIR =                                2

2. The following metric is used for attributes that are variables:              decide on scaffolding techniques) is done in two stages: comparing
          P  v
             j=1 g(αIj ,αRj )
                                                                                the past and evaluating the future.
   VIR =           v
                                , where g is defined as:                           Comparison of the past with respect to a reference point (e.g. a
                                                                               selected case) depends on the depth of the evaluation in terms of
                                        1   if αIj = αRj                        samples taken into account and rules than concern comparisons of
                g(αIj , αRj ) =
                                        0   if αIj 6= αRj ,                     the past, e.g. IF the actual trajectory is similar to a trajectory in the
                                                                                KB, indicated by a reference case representing a starting point in the
3. In a similar way to [25], we define the following metric for com-            past, THEN this trajectory is a past-matching trajectory.
                                                  |RAI ∩RAR |
   paring relations between attributes: PIR = |RA     I ∪RAR |
                                                               . PIR is            When it comes to evaluating the future of a trajectory, comparison
   the number of relations between attributes that the input and re-            is based on the similarity between the future of a trajectory in the KB
   trieved case have in common divided by the total number of rela-             with a desired future for the current trajectory. This is expressed by
   tions between attributes of the two cases.                                   rules of the general form: IF a piece of the future trajectory of a past-
4. Similarity in terms of relations between cases is defined by TIR =           matching trajectory resembles the reference starting from a selected
   |RCI ∩RCR |
   |RCI ∪RCR |
                , where TIR is the number of relations between cases            case, THEN the reference can be met by applying certain strategies.
   that the input and retrieved case have in common divided by the                 As it is not possible to represent all learning trajectories in the
   the total number of relations between cases of I and R.                      KB of an ELE, similarity is measured in terms of convex fuzzy sets,
                                                                                whose width might change depending on the context and the amount
   In order to identify the closest strategy to the one employed by a           of information available, i.e. the current trajectory can be interpreted
learner, cumulative similarity measures are used for each of the four           in more vague way by increasing the width of the fuzzy set. Also if
types of similarity:                                                            the distance between past and future is large for certain tasks, it does
                                                                                not make sense to evaluate the future carefully. Nevertheless, if the
                       ( zi=1 DIi Ri )/z.
                        P
1. Numeric attributes:
               Pz                                                               distance to a reference (desired outcome) is small, the future needs to
2. Variables: ( i=1 VIi Ri )/z. P                                               be evaluated accurately. So the depth of the evaluation is measured
                                   z
                             Pz( i=1 PIi Ri )/z.
3. Relations between attributes:                                                by a fuzzy time distance set to evaluate both short and long time
4. Relations between cases. ( i=1 TIi Ri )/z.                                   distances.

where z represents the minimum number of cases among the two
compared strategies. The strength of similarity between the current             4   AN ILLUSTRATIVE EXAMPLE
strategy and the various stored strategies is defined as the maximum
combined similarity of these four measures among the various strate-            To illustrate the combination of intelligent methods for learner mod-
gies compared.                                                                  elling we use an example from the mathematical generalisation do-
                                                                                main, and a task called “pond tiling”, which is common in the En-
                                                                                glish school curriculum and expects learners to produce a general
 Exploiting context. Attributes and relations stored in cases have
                                                                                expression for finding out how many tiles are required for surround-
different relevance depending on the context, which in ShapeBuilder
                                                                                ing any rectangular pond [8]. The high level learning objective in the
corresponds to different stages of the constructivist learning process
                                                                                Domain Model is to acquire the ability to perform structural reason-
that learners go through as they explore the various sub-tasks within a
                                                                                ing [9]. In order to achieve this, sub-tasks can be explored in Shape-
learning activity. Typically, a task includes several sub-tasks, and the
                                                                                Builder, e.g. construct a pond of fixed dimensions, surround the pond
activity is sequenced within the system so as to know at any time the
                                                                                with tiles and determine how many are required; generalise the struc-
current context. As the environment allows the learners to explore,
                                                                                ture using icon variables.
they may “jump” to different stages in the activity sequence.
    Context dependence can be taken into account by having differ-
ent weights for attributes and relations depending on the stage of the           Knowledge representation. The Task Model for pond tiling in-
learning process within a task or activity. The weights could be ob-            cludes: (a) strategies identified in pilot studies [9], e.g. thinking in
tained through an approach called local feature weighting [23] that             terms of areas (see Figure 3a) or in terms of width and height (see
uses Neural Networks (NNs). The principle of the training algorithm             Figures 3b, c, d, e and f); (b) outcomes, e.g. model built, number
is to reduce the distance between cases of the same class and increase          of tiles for surrounding a particular pond, and solution, i.e. the gen-
the distance between cases of different classes [23], where the var-            eral expression (see Figure 3 for the solutions corresponding to each
ious classes in ShapeBuilder correspond to types of context (stages             strategy; for the “area strategy” the solution with icon variables is
of the learning process) of the various (sub-)tasks. Thus, a neural             displayed in Figure 1b); (c) landmarks, e.g. for the area strategy: cre-
network is trained in order to identify the context and several net-            ating a rectangle with height and width greater by 2 than the pond;
works (one for each context) are used to provide the context-specific           for the width and height strategies: using rows/column of tiles; slips:
weights. This approach appears to be more robust than other weight-             several correct actions followed by an incorrect one (e.g. correct sur-
ing schemes due to the generalisation capacities of the NNs that can            rounding of the pond, partially correct expression, but missing a 2 in
produce weights even in imprecise situations [23].                              the formula); (d) the context of each (sub-)task.


                                                                           16
                                                                                         Table 5.   The set of attributes (Fi ) for the cases in the “I strategy”.

                                                                                       Label      C1              C2             C3             C4             C5
                                                                                       αi 1    Rectangle      Rectangle      Rectangle      Rectangle      Rectangle
                                                                                       αi 2    c/v/n exp      iv /iv exp     iv /iv exp     c/v/n exp      c/v/n exp
                                                                                       αi 3    c/v/n exp      c/v/n exp      c/v/n exp      iv /iv exp     iv /iv exp
                                                                                       αi 4        5               7              7              1              1
                                                                                       αi 5        3               1              1              3              3
                                                                                       αi 6        1               0              0              0              0
                                                                                       αi 7        1               1              1              1              1
                                                                                       ..           ..              ..             ..             ..             ..
                                                                                        .            .               .              .              .              .
                                                                                        αi 8       1               0              0              1              1


                                                                                      use this type of approach [8, 9]. Some pupils surround the pond in a
                                                                                      non–systematic manner and with variable degrees of symmetry. Such
                                                                                      examples are illustrated in Figure 4.

                                                                                       Comparing cases. To illustrate the operation of similarity mea-
Figure 3. (a) “Area strategy”; (b) “H strategy”; (c) “I strategy”; (d) “Spiral        sures we use two non–symmetrical examples of surrounding the
 strategy”; (e) “+4 strategy”; (f) “−4 strategy”; (g) Steps and relations of          pond, displayed in Figure 4. The similarity measures are the ones
           “area strategy”; (h) Steps and relations of “I strategy”.                  presented in Section 3.3.
                                                                                         The first example (Figure 4a), has 4 cases in common with two
   The six strategies and their associated solutions (the general ex-                 strategies: the “I strategy” (C1 , C3 , C4 , C5 ) and the “+4 strategy”
pressions for surrounding any rectangular pond) are displayed in Fig-                 (C1 , C4 , C5 , C6 ). When comparing it with the “I strategy” z √= 5
ures 3(a–f). Two strategies are presented in detail: the “area strategy”              (minimum between 6 and 5) and the combined similarity is: 51 +
(S1 ) and the “I strategy” (S3 ). The attributes of cases that are part of            5
                                                                                      5
                                                                                         + 7/4
                                                                                             5
                                                                                                 + 10/4
                                                                                                      5
                                                                                                           = 2.05. When comparing with the “+4” strategy,
                                                                                                                                                        √
these two strategies are presented in Table 4 and Table 5, respectively.              z = 6 (minimum between 6 and 9) the combined similarity is: 65 +
The steps and the sets of relations between attributes and between                    5+2/3
                                                                                         6
                                                                                              + 6/4
                                                                                                  6
                                                                                                      + 10/4+1/3
                                                                                                             6
                                                                                                                  = 2.04. Thus, in this case the learner will
cases are displayed in Figure 3g and Figure 3h, respectively.
                                                                                      be guided towards the “I strategy”.
   A particular order between cases is presented for the “I strategy” in
                                                                                         The second example (Figure 4b), has 3 cases in common with
Figure 3h. For the same strategy, the surrounding of the pond could
                                                                                      two strategies: the “spiral strategy” (C1 , C3 , C4 ) and the “H strat-
be done in several other different orders; there are 4! = 24 such
                                                                                      egy” (C1 , C2 , C5 ). When comparing it with the “spiral strategy” as
possibilities (the pond is always first).
                                                                                      well as the “H strategy”, z √ = 5 (minimum between 5 and 5), and
                                                                                      the combined similarity is: 52 + 4+2/3  5
                                                                                                                                   + 8/45
                                                                                                                                          + 10/4
                                                                                                                                               5
                                                                                                                                                  = 2.12. In
  Table 4.   The set of attributes (Fi ) for the cases in the “area strategy”.
                                                                                      this situation, when the learner’s construction is equally similar to
             Name              Label        C1             C2                         two strategies, the following options could be offered: (a) present
             Shape type         αi1      Rectangle      Rectangle                     the learner with the two options and let him/her choose one of the
             Width type         αi2      c/v/n exp      iv/iv exp                     two (an approach that appears more suitable for advanced learners
             Height type        αi3      c/v/n exp      iv/iv exp                     than novices); (b) automatically suggest one of the two in a system-
             Width value        αi4          5               7                        atic way, e.g. present the one that occurs more/less often with other
             Height value       αi5          3               5
                                                                                      learners; (c) inform the teacher about the learner’s trajectory and the
             P artOf S1         αi6          1               1
             ..                  ..           ..             ..                       frequency of strategies and let him/her decide between the two.
              .                   .            .              .
              P artOf S2        αi7          1               0
              P artOf S6        αi8          1               0


   There are two types of strategies depending on the degree of gen-
erality: specific and general. Specific cases refer to surroundings that
cannot be generalised and include value relations, but no dependency
relations; the general cases refer to surroundings that can be gener-
alised and are distinguished by the presence of the dependency re-                      Figure 4. Non-symmetrical strategies: (a) combination of ‘I’ and ‘+4’
lations and by the fact that the dimension type of at least one of the                         strategies; (b) combination of ‘spiral’ and ‘H’ strategies.
dimensions of the case is an icon variable or an expression using icon
variable(s). The presence or absence of the abovementioned aspects
apply to all cases that form the composite case with the exception of
the first case representing the pond. The “area” and the “I strategy”                  Exploiting context. In the pond tiling task, when the learner is
presented previously fall into the category of general strategies.                    constructing a specific (as opposed to general) tiling of the pond, the
   The strategies displayed in Figure 3 are correct symmetrical “ele-                 value relation attribute is more relevant, while when dealing with a
gant” solutions, but trials with pupils have shown that not all of them               general tiling, the dependency relation attribute is more important.


                                                                                 17
Local feature weighting in this case involves two trained neural net-               [8] S. Gutiérrez, M. Mavrikis, and D. Pierce, ‘A Learning Environment for
works for each context and applying the weights delivered by the                        Promoting Structured Algebraic Thinking in Children’, in Proceedings
                                                                                        of the 8th IEEE International Conference on Advanced Learning Tech-
NNs before the matching process.                                                        nologies, (2008). forthcoming.
                                                                                    [9] S. Gutiérrez, D. Pierce, E. Geraniou, and M. Mavrikis, ‘Supporting
 Learning trajectories. Lets consider the example in Figure 4b                          Reasoning and Problem-Solving in Mathematical Generalisation with
and a comparison after C3 . The current trajectory includes the cre-                    Dependency Graphs’, in Proceedings of the 5th International Confer-
                                                                                        ence on the Theory and Application of Diagrams, (2008). forthcoming.
ation of 3 rectangles corresponding to C1 , C2 and C3 ; this trajectory            [10] S-G. Han, S-G. Lee, and S. Jo, ‘Case-based tutoring systems for proce-
is considered to be far from the desired outcome (surrounding the                       dural problem solving on the www’, Expert Systems with Applications,
pond), and thus, the future does not need to be evaluated accurately.                   29(3), 573–582, (2005).
At this point with respect to the past, two strategies partially match             [11] I. Hatzilygeroudis and J. Prentzas, ‘Using a hybrid rule-based approach
                                                                                        in developing an intelligent tutoring system with knowledge acquisition
the learner’s current trajectory: “I” (C1 , C2 ) and “spiral” (C1 , C3 )
                                                                                        and update capabilities’, Expert Systems with Applications, 26(4), 477–
strategy; the learner could be left to continue with his/her model con-                 492, (2004).
struction without intervention. With respect to the future, the desired            [12] L. Healy, Hoelzl R., Hoyles C., and R. Noss, ‘Messing Up’, Micromath,
outcome can be obtained by following one of the two strategies pre-                     10(1), 14–16, (1994).
viously identified.                                                                [13] M-J. Huang, H-S. Huang, and M-Y. Chen, ‘Constructing a personalized
                                                                                        e-learning system based on genetic algorithm and case-based reasoning
   If the comparison takes place after C5 , the trajectory would in-                    approach’, Expert Systems with Applications, 33(3), 551–564, (2007).
clude the creation of 5 rectangles (C1 to C5 ) and thus it can be con-             [14] C.D. Hulshof, T.H.S. Eysink, S. Loyens, and T. de Jong, ‘Zaps: Us-
cluded that the learner has reached the desired outcome of surround-                    ing interactive programs for learning psychology’, Interactive Learning
ing the pond. However, in this process the learner did not use any                      Environments, 13, 39–53, (2005).
                                                                                   [15] J. Kaput, Technology and Mathematics education, 515–556, D. Grouws
of the desirable strategies, i.e. any of the six strategies presented in
                                                                                        (ed.) Handbook of Research on Mathematics Teaching and Learning,
Figure 3. At this point in time two trajectories match the past and in-                 New York: Macmillan, 1992.
dicate the future, as before, but now it might be considered pedagogi-             [16] P. Kirschner, J. Sweller, and R.E. Clark, ‘Why minimal guidance dur-
cally important to intervene and guide the learner towards a trajectory                 ing instruction does not work: An analysis of the failure of construc-
that corresponds to one of the two identified desirable strategies.                     tivist, discovery, problem-based, experiential and inquiry-based teach-
                                                                                        ing’, Educational Psychologist, 41(2), 75–86, (2006).
                                                                                   [17] J.L. Kolodner, Case-Based Reasoning, Morgan Kaufmann Publishers,
5   CONCLUSIONS                                                                         Inc., 2nd edn., 1993.
                                                                                   [18] C. Koutsojannis, G. Beligiannis, I. Hatzilygeroudis, C. Papavlasopou-
In this paper a learner modelling process involving a combination                       los, and J. Prentzas, ‘Using a hybrid ai approach for exercise difficulty
of intelligent methods was presented for the domain of mathemati-                       level adaptation’, International Journal of Continuing Engineering Ed-
cal generalisation. Case-based reasoning is used in combination with                    ucation and Life-Long Learning, 17(4-5), 256–272, (2007).
                                                                                   [19] P. Kumar, S. Gopalan, and V. Sridhar, ‘Context enabled multi-CBR
soft computing (fuzzy sets and neural networks) in order to process                     based recommendation engine for e-commerce’, in Proceedings of the
the models that the learners construct and thus be able to provide                      IEEE International Conference on e-Business Engineering (ICEBE),
feedback while learners work on the task.                                               pp. 237 – 244. IEEE Press, (2005).
   Further work includes expanding the conceptual framework by                     [20] J. Mason, Generalisation and algebra: Exploiting children’s powers,
                                                                                        105–120, L. Haggarty, Aspects of Teaching Secondary Mathematics:
defining a strength as the maximum combined similarity measure                          Perspectives on Practice, Routledge Falmer and the Open University,
(similarity of the past and similarity of the future at a particular dis-               2002.
tance) for various evaluated trajectories and a reliability index that             [21] R. Mayer, ‘Should There Be a Three-Strikes Rule Against Pure Discov-
will reflect the extent to which the similarities can be relied upon to                 ery Learning? The Case for Guided Methods of Instruction’, American
provide the right support.                                                              Psychologist, 59(1), 14–19, (2004).
                                                                                   [22] R. Morales Gamboa, Exploring Participative Learner Modelling and Its
                                                                                        Effects on Learner Behaviour, PhD thesis, Univ. of Edinburgh, 2000.
REFERENCES                                                                         [23] J.H. Park, K.H. Im, C-K. Shin, and S-C. Park, ‘MBNR: Case-Based
                                                                                        Reasoning with Local Feature Weighting by Neural Network’, Applied
[1] V. Aleven, ‘Using background knowledge in case-based legal reason-                  Intelligence, 21(3), 265–276, (2004).
    ing: A computational model and an intelligent learning environment’,           [24] C. Romero, S. Ventura, P. de Bra, and C. de Castro, ‘Discovering Pre-
    Artificial Intelligence, 150(1-2), 183–237, (2003).                                 diction Rules in AHA! Courses’, In Brusilovsky et al. [3], pp. 25–34.
[2] J.E. Beck and B.P. Woolf, ‘Using a Learning Agent with a Student               [25] Y. Seo, D. Sheen, and T. Kim, ‘Block assembly planning in shipbuilding
    Model’, in Intelligent Tutoring Systems, eds., B.P. Goettl, H.M. Halff,             using case-based reasoning’, Expert Systems with Applications, 32(1),
    C.L. Redfield, and V.J. Shute, volume 1452 of Lecture Notes in Com-                 245–253, (2007).
    puter Science, pp. 6–15. Springer, (1998).                                     [26] R. Stathacopoulou, M. Grigoriadou, G.D. Magoulas, and D. Mitropou-
[3] P. Brusilovsky, A.T. Corbett, and F. de Rosis, eds. User Modeling 2003,             los, ‘A Neuro-fuzzy Approach in Student Modeling’, In Brusilovsky
    9th International Conference, UM 2003, Johnstown, PA, USA, June 22-                 et al. [3], pp. 337–341.
    26, 2003, Proceedings, volume 2702 of Lecture Notes in Computer Sci-           [27] R.H. Stottler and S. Ramachandran, ‘A Case-Based Reasoning Ap-
    ence. Springer, 2003.                                                               proach to Internet Intelligent Tutoring Systems (ITS) and ITS Au-
[4] A. Bunt and C. Conati, ‘Probabilistic Student Modelling to Improve Ex-              thoring’, in Proceedings of the Twelfth International Florida Artificial
    ploratory Behaviour’, User Modelling and User-Adaptive Interaction,                 Intelligence Research Society Conference, pp. 181–186. AAAI Press,
    13(3), 269–309, (2003).                                                             (1999).
[5] S-H. Chun and Y-J. Park, ‘Dynamic adaptive ensemble case-based rea-            [28] J. Swaak, W.R. van Joolingen, and T. de Jong, ‘Supporting simulation-
    soning: application to stock market prediction’, Expert Systems with                based learning; the effects of model progression and assignments on
    Application, 28(3), 435–443, (2005).                                                definitional and intuitive knowledge’, Learning and Instruction, 8, 235–
[6] C. Conati, A.S. Gertner, and K. VanLehn, ‘Using Bayesian Networks to                253, (1998).
    Manage Uncertainty in Student Modeling’, User Modelling and User-              [29] W.R. van Joolingen, ‘Cognitive tools for discovery learning’, Interna-
    Adaptive Interaction, 12(4), 371–417, (2002).                                       tional Journal of Artificial Intelligence and Education, 10, 385–397,
[7] G. Ghinea, G.D. Magoulas, and C. Siamitros, ‘Multicriteria decision                 (1999).
    making for enhanced perception-based multimedia communication’,                [30] R.R. Yager, ‘Soft Aggregation Methods in Case Based Reasoning’, Ap-
    IEEE Transactions on Systems, Man, and Cybernetics, Part A, 35(6),                  plied Intelligence, 21(3), 277–288, (2004).
    855–866, (2005).


                                                                              18
     Belief Propagation in Fuzzy Bayesian Networks
                     Christopher Fogelberg1 and Vasile Palade and Phil Assheton2


Abstract. Fuzzy Bayesian networks are a generalisation             presents belief propagation for variables with one parent (sub-
of classic Bayesian networks to networks with fuzzy variable       section 3.3) and for variables with multiple parents (subsec-
state. This paper describes our formalisation and outlines how     tion 3.4). Section 4 analyses the algorithmic efficiency of FBNs
belief propagation can be conducted. Fuzzy techniques can          and how it can be controlled. Section 5 outlines an important
lead to more robust inference. A key advantage of our formal-      bioinformatic domain where FBNs may be especially useful,
isation is that it can take advantage of all existing network      and section 6 concludes the paper.
inference and Bayesian network algorithms. Another key ad-
vantage is that we have developed several techniques to con-       2     A Fuzzy Bayesian Network
trol the algorithmic complexity. When these techniques can
be applied it means that fuzzy Bayesian networks are only          The structure of the FBN that is used as an example in this
a small linear factor less efficient than classic Bayesian net-    paper is shown in figure 1. Call this FBN G = hη, θi, where η
works. With appropriate pre-processing they may be substan-        denotes the structure of G and θ its parameters.
tially more efficient.                                                For clarity of presentation, G is a multinomial (discrete)
                                                                   BN. However, the formalisation generalises easily and trans-
                                                                   parently to continuously-valued FBNs and hybrid FBNs.
1     Introduction                                                    The relevant conditional distributions of G are shown in
Modern machine learning research frequently uses Bayesian          figure 2. D’s distribution is not shown and we will later assume
networks (BNs)[6; 7; 15; 16]. However, BN inference is NP-         a state for D with no loss of generality.
complete due to cycles in the undirected graph[4], and belief         Because we have restricted the differences between BNs and
propagation is exponential in the tree-width of the network.       FBNs to belief propagation, the specification of a FBN and a
This makes them difficult to use for large problems.               BN are identical.
   Fuzzy[3] and hybrid fuzzy systems[11; 13] are also fre-
quently used. In a fuzzy system, a variable’s state is repre-      3     Belief Propagation
sented by a set of fuzzy values (FVs). Because fuzzy systems       Belief propagation in a Bayesian network involves calculat-
do not force a model to artificially discretise a continuous un-   ing the updated probability distributions of variables in the
derlying state they are often more robust in the face of noise.    network, given θ and the observed states of other variables.
   To date there has been very little research into BNs with
fuzzy variable states. What there is has centred around the        3.1    Some Notation
use of fuzzy approximations to perform inference and belief
propagation in a hybrid BN [1; 12]. A hybrid BN is one where       The terminology and notation is as follows. A variable has a
the parameters are a mix of continuous and multinomial dis-        state, either a fuzzy state (FS) or a discrete state (DS).
tributions.                                                           A fuzzy state is made up of one or more components, and
   This paper’s key contribution is a formal generalisation of     each component is annotated with the variable’s degree of
classic BNs to fuzzy Bayesian networks (FBNs). In a FBN            membership (µ) in that component. For example, equation 1
the variables can have fuzzy states. The paper also describes      is an example of a variable (S) with two components. It has
tractable belief propagation over FBNs. An important ad-           membership 0.7 in the component hi and membership 0.3
vantage of the presented formalisation is that existing infer-     in the component mid. hi and mid are examples of values
ence algorithms (e.g. MCMC, simulated annealing) can be            that the variable can take. When annotated with µ they are
used without modification. Furthermore, FBNs may be only           referred to as fuzzy values (FV). The set of all possible values
a small linear constant less efficient than classic BNs of the     (fuzzy values) that a variable can take is the range of that
same size, and appropriate pre-processing may make some            variable, e.g. hi, mid, lo.
problems which were intractable for classic BNs tractable for
FBNs.                                                                                    S = [hi0.7 , mid0.3 ]                 (1)
   The paper is structured as follows. Section 2 presents a
                                                                      In general, the components of a variable’s state are enclosed
fuzzy Bayesian network which will be used as an example.
                                                                   in square brackets. A discrete state is just a special case of
Section 3 introduces some notation (subsection 3.1), and
                                                                   a fuzzy state. Discrete states have just one component with
1 Supported by the ACU and CSCUK                                   µ = 1, and the square brackets andP µ subscript can be omitted
2   Computing Laboratory, University of Oxford; Contact email:     in this situation. We assume that c∈C µc = 1 for a FS with
    christopher.fogelberg@comlab.ox.ac.uk                          C components and have not considered other situations.


                                                                   19
                                     A                       B                   C

                                                                                                        E


                                                                                 D


                               Figure 1.     The fuzzy Bayesian network G, used as an example in this paper.


             →A       A = lo      A = mid       A = hi                   fuzzy probability distribution (FPD). Samples are drawn from
                       0.7          0.1          0.2                     a FPD in the same way that they are drawn from a PD.
                                                                         However, a variable with membership µ in a FPD can only
                 (a) θA , A’s prior distribution                         have µ proportion of its state determined by that FPD; a
                                                                         sample from a FPD will have the same µ as the FPD does.
            A→B          B = lo     B = mid        B = hi                For example, a sample from {0.2, 0.1, 0.7}0.2 will be one of
                                                                         lo0.2 , mid0.2 or hi0.2 , and each of these components will be
             A = lo       0.6         0.2           0.2
                                                                         drawn with probability 0.2, 0.1 and 0.7 respectively.
            A = mid       0.1         0.1           0.8
                                                                            This means that the state of a sample from the uncertain
             A = hi       0.1         0.2           0.7
                                                                         variable T (equation 2 will be some member from the set
              (b) θB , B’s conditional distribution                      [lo[0..0.8] , mid[0.2..1] , hi[0..0.8] ] and the distribution over mem-
                                                                         bers in this set is determined by the two FPD and one FV
                                                                         which make up the fuzzy state of T .
            B→C          C = lo      C = mid       C = hi
             B = lo       0.1          0.1          0.8                  3.2     Assumptions
            B = mid       0.1          0.8          0.1
                                                                         The full and general analysis of FBNs would also consider un-
             B = hi       0.7          0.2          0.1
                                                                         restrictedPinteractions amongst components in a fuzzy state,
              (c) θC , C’s conditional distribution                      allowing     µ 6= 1 and so forth. In this article we make a num-
                                                                         ber of linearising assumptions which make FBN belief propa-
       C, D → E                 E = lo      E = mid       E = hi         gation cheap, relative to the cost of full general propagation.
                                                                         They also greatly aid the clarity of the presentation in the
    C = lo   D = lo              0.6          0.2          0.2           space available. Furthermore, these assumptions are reason-
    C = lo  D = mid              0.1          0.1          0.8           able and do not restrict the general utility of FBNs. However
    C = lo   D = hi              0.1          0.1          0.8           it is important to make them explicit. The assumptions (and
   C = mid   D = lo              0.6          0.2          0.2           consequently the nature of full general propagation) will be
   C = mid D = mid               0.1          0.6          0.3           briefly summarised in this section; a more general discussion
   C = mid   D = hi              0.1          0.6          0.3           is forthcoming.
    C = hi   D = lo              0.1          0.2          0.7
    C = hi  D = mid              0.1          0.2          0.7
                                                                         3.2.1    Assumption: Total Membership
    C = hi   D = hi              0.8          0.1          0.1
                                                                                                                                  P
                                                                         As noted above and in subsection 3.1, we assume that       µ=
              (d) θE , E’s conditional distribution                      1. This is our first linearising assumption, and it can be con-
                                                                         ceptualised as follows. A variable’s degree of membership in
Figure 2.    θ for G. The conditional distributions of A, B, C and       each of its |C| components forms a |C|-dimensional fuzzy state
                                E.                                       space. If the variable has no uncertainty (no component is an
                                                                         FPD) then |C| will be the same as the variable’s range. Even
  Just as a component can be a value from the range of                   if a FS has 0 membership in some of its range those
                                                                                                                           P values are
a variable, e.g. hi, a component can also be a probability               still part of the state’s FSS. By assuming that c∈C µc = 1,
distribution (PD). PD are denoted with curly brackets, e.g.              we restrict our attention to a smaller |C| − 1 dimensional sub-
{hi0.3 , mid0.2 , lo0.5 }. Because the value associated with each        space. This subspace constrains the degrees of membership
subscripted probability is implicit in the tuple order the value         in each component in a cyclically conditional way on the de-
names can be omitted: {0.3, 0.2, 0.5}.                                   grees of membership in the other components. This assump-
  An example of a fuzzy state which mixes values and prob-               tion simplifies the combination of components of fuzzy states
ability distributions is shown in equation 2.                            in subsection 3.4.
                                                                            For example, if a FS has membership 0.5 in the value hi
        T = [{0.2, 0.1, 0.7}0.2 , {0.1, 0.8, 0.1}0.6 , mid0.2 ]    (2)   its membership in the values lo and mid in the FSS are con-
                                                                         strained to be in the range [0, 0.5]. Furthermore, its member-
  A PD which is annotated (subscripted) with µ is called a               ship in lo and mid mutually constrain (in this case define, as


                                                                         20
there are no other values in the range) each other. Figure 3           This definition of a FPD could be generalised so that µ
illustrates the impact of the first assumption on the FSS for       could vary independently but was fixed for each of the r pos-
a fuzzy state with two components.                                  sible samples that could be drawn from the FPD. Call this
                                                                    a slightly general FPD (SGFPD) and call the FPD defined
                    1                                               in subsection 3.1 a standard FPD. An SGFPD would specify
                                                                    a single point in an r + r dimensional space, where r of the
                                                                    dimensions are the probability of each value and the other r
                Membership in A
                                                                    dimensions specify the µ of a sample of each value. Just as the
                                                                    first r dimensions specify a p-space, the second r dimensions
                                                                    specify a µ-space. An example of such a space is given dia-
                                                                    grammatically in figure 4, and it is used to contrast a SGFPD
                                                                    with a standard FPD.
                                                                       An example of an SGFPD might be “there is a 0.2 prob-
                                                                    ability of drawing a sample of hi, and any sample of hi will
                           0                        1               have µ = 0.3, and there is a 0.3 probability of drawing a sam-
                                  Membership in B                   ple of mid, and any sample of mid will  P have µ = 0.5, and. . . ”
                                                                    and so forth. The assumption that c∈C µc = 1 for a state
 Figure 3. Imagine a fuzzy state with two components, A and         could be relaxed if SGFPD were used.
   B. Such a state would have a two-dimensional FSS, as in this
                                                                       After considering figure 4 it will be clear that SGFPD could
 figure. Without restriction, the state’s degree of membership in
   each component could P  be specified by any point in the FSS.    be further generalised so that the µ of any sample also varied
However, we assume that c∈C µc = 1. Therefore its membership        probabilistically, conditional on the value (hi, mid, etc.) of
 in components A and B must be specified by some point on the       the sample. Such a general FPD (GFPD) would be an r +
                            dashed line.                            r dimensional probability distribution over the joint µ and
                                                                    range of the variable. We believe that this represents the most
                                                                    general kind of inference and belief propagation in a FBN.
3.2.2   Assumption: Component Independence                          Such inference is intractable and we do not consider it in this
                                                                    paper.
We also make two further assumptions about FBN during
                                                                       In summary, the assumptions which we have made substan-
belief propagation. The first is that components are indepen-
                                                                    tially reduce the dimensionality of belief propagation and are
dent. This means that when a variable has only one parent
                                                                    necessary for it to be tractable. However, more general FBNs
then its state will have one component for each component
                                                                    with GFPD do not have these restrictions; their utility will
its parent has, and these components will have the same µ
                                                                    be considered in a forthcoming publication.
as the corresponding parent’s component. For example, the
children which have S (equation 1) as their only parent will        3.3       Single-Parent Belief Propagation
have two components in their FS, one with µ = 0.7 and one
with µ = 0.3.                                                       Assume that observations indicate A = [mid0.2 , hi0.8 ] in G.
   When a variable has more than one parent then the com-           With this information we can calculate the updated distribu-
ponents of the parents will be mixed and combined before            tions on B and C.
propagation. This is described in subsection 3.4. Because we           Because A has an observed (certain) FS and is B’s only
assume independence, the child’s fuzzy state will have one          parent the components of B’s updated FS can be read from
component for each component in the mixed and combined              θB . This shows that:
parent set, and each of the child’s component will have the
same µ as the corresponding component in the parent set.                           B = [{0.1, 0.1, 0.8}0.2 , {0.1, 0.2, 0.7}0.8 ]      (3)
   It will also be clear that we assume independence when we           The FS over C is calculated similarly. Just as each of the
describe how the parents’ components are mixed and com-             fuzzy values in A lead to a weighted FPD in the FS of B
bined in subsection 3.4.                                            the same occurs for C, and C = [α0.2 , β0.8 ]. The weighted
                                                                    distributions α and β are calculated using standard BN belief
3.2.3   Assumption: FPD Samples                                     propagation, based on the conditional distribution of B. This
                                                                    is shown in equations 4, 5 and 6.
The third assumption implicit in this model is the assumption
that a sample from an FPD with µ = x will be a single fuzzy
component with µ = x. Although natural and intuitive we                                 p(C|B = lo)       =   {0.1, 0.1, 0.8}
do not believe that this is automatically entailed, thus we                          p(C|B = mid)         =   {0.1, 0.8, 0.1}
explicitly assume it.                                                                  p(C|B = hi)        =   {0.7, 0.2, 0.1}          (4)
   Consider an uncertain variable with a range of r in a stan-
dard (discrete) Bayesian network. Its state will be a probabil-
ity distribution which specified a single point in the r dimen-           α    =     {0.1, 0.1, 0.8} × 0.1 + {0.1, 0.8, 0.1} × 0.1 +
sional probability space (p-space).                                                          {0.7, 0.2, 0.1} × 0.8
   Now consider this variable in a FBN. Any uncertainty in
                                                                               =     {0.01, 0.01, 0.08} + {0.01, 0.08, 0.01} +
its state will be represented by an FPD component with some
µ. As described in subsection 3.1, a FPD is just PD with a                                   {0.56, 0.16, 0.08}
fixed (0 dimensional) µ associated with it.                                    =     {0.58, 0.25, 0.17}                                (5)


                                                                    21
                                                                                           Membership in B
                                         1


                                    p(B)
                                                                                                             Membership in A


                                         0                p(A)                1

    Figure 4. Assume a variable with a range (r) of 2 (A and B). Each point in the 2 dimensional p-space (left hand side)P could be
considered an index into a 2 dimensional µ-space (right hand side), as diagrammed. All proper probability distributions ( p = 1) fall on
   the dotted dashed line in the p-space. The µ-space that is indexed by some point in the p-space could be unique to that point. In a
standard FPD, the µ-space is reduced to a single point on the dashed line and that point on the line in the µ-space is specified precisely
  by the µ of the FPD. Because any sample from a standard FPD, regardless of its value, will have the same µ, r of the dimensions are
  eliminated. In addition, we assume that an FPD has a proper probability distribution. In total, these reduces the dimensionality from
r + r to r − 1. In a SGFPD the µ-space would also be reduced to a single point, but any point in the µ-space would be a valid reduction,
                          so an SGFPD with a proper distribution over the values still has r + r − 1 dimensions.

                                                                                     In the naive approach to belief propagation the Cartesian
      β       =     {0.1, 0.1, 0.8} × 0.1 + {0.1, 0.8, 0.1} × 0.2 +               product of the parents’ FS is used to find all possible combi-
                                                                                  nations of components. µ for each one of these combinations
                           {0.7, 0.2, 0.1} × 0.7
                                                                                  is calculated using the product t-norm[2]. Any other fuzzy
              =     {0.01, 0.01, 0.08} + {0.02, 0.16, 0.02} +                     conjunction (normalising µ where    Pnecessary) could also be
              =            {0.49, 0.14, 0.07}                                     used. Because we assumed that          µ = 1 holds for each of
              =     {0.52, 0.31, 0.17}                                  (6)       the parents
                                                                                        P      though,  using  this fuzzy conjunction guarantees
                                                                                  that    µ over the child’s components will also equal 1 and no
  The calculated FS for C is shown in equation 7.                                 normalisation is necessary.
                                                                                     For example, if we use the first components of C and
              C = [{0.58, 0.25, 0.17}0.2 , {0.52, 0.31, 0.17}0.8 ]      (7)       D ({0.58, 0.25, 0.17}0.2 and {0.45, 0.30, 0.25}0.3 , respectively,
                                                                                  equation 8) then standard Bayesian propagation and using
3.4       Multi-Parent Belief Propagation                                         the product t-norm to calculate µ shows that one member of
                                                                                  E’s updated FS is:
Subsection 3.3 illustrated belief propagation in a FBN when
a variable has only one parent. This subsection shows naive
                                                                                                              α = {0.3165, 0.2189, 0.4647}0.06        (9)
FBN belief propagation in the case of a variable with multiple
parents. Section 4 outlines several more nuanced approaches                         The full FPD for E will have four members, one for each
which address the problems with naive propagation.                                member of C × D (equation 10, below). For clarity, the cal-
   Take the calculated value of C, and assume a fuzzy state                       culated α from equation 9 has not been substituted into this
for D (equation 8). What is the updated fuzzy state of E?                         equation.

                                                                                                              E = [α0.06 , β0.14 , γ0.24 , δ0.56 ]   (10)
          C     =    [{0.58, 0.25, 0.17}0.2 , {0.52, 0.31, 0.17}0.8 ]
      D         =    [{0.45, 0.30, 0.25}0.3 , {0.1, 0.8, 0.1}0.7 ]      (8)          In general a variable with k parents that each have an FS
                                                                                  with m components will have an updated FS of size mk . As-
   Any combination of components, one from each parent, can                       suming all variables have k parents, the grand-children will
be used to calculate an updated probability distribution for a                                                 k
                                                                                  have updated FS of size mk , and so forth. This is the fuzzy
variable. However, this raises the question of how to combine                     state size explosion (FSSE), and it makes naive belief propa-
and weight each combination of component distributions in                         gation in a FBN intractable.
the parent FSs to calculate an updated FS for the child.
   Because the parents are conditionally independent given
the variable being updated3 , any particular combination of                       4    Dealing with Complexity
PD and observations can be summed over, as one was in each                        There are several ways that the explosion in the complex-
of equations 5 and 6. The summed over combinations became                         ity can be controlled by approximating the FS. This sec-
components of C’s updated distribution.                                           tion discusses four kinds of control. The bimodal fuzzy state
3 And also given their updated state and the acyclic nature of the                X = [{0.9α , 0.1β }0.5 , {0.1α , 0.9β }0.5 ] is used as an example in
 graph                                                                            several places in this section. Such a variable could represent


                                                                                  22
a committee of two in which the committee members (compo-                 an accurate reflection of the full FS, thus φ-dynamic selection
nents) hold diametrically opposite beliefs about the outcome              may be more appropriate in some cases.
of some future event.
                                                                          4.3      Clustering the Fuzziness
4.1    Linear Collapse
                                                                          Another way of controlling the FSSE is to calculate the full
A first approximation that addresses the FSSE is to linearly              FS of each variable during belief propagation. However, be-
collapse a FS that is made up only of FPDs, immediately                   fore using the full FS to update the state of its children, its
after they are calculated. Each component can be weighted by              components could be clustered so that FPD which specified
its fuzzy membership and they can be summed to calculate                  similar distributions were combined together.
a single, discrete, PD. For example, B (equation 3) can be                   For example, the FS [. . . , {0.7, 0.2, 0.1}0.3 , {0.6, 0.3, 0.1}0.2 , . . .]
collapsed as shown in equation 11. Collapsed FS are denoted               might cluster to [. . . , {0.66, 0.24, 0.1}0.5 , . . .].
with a prime.                                                                Because the clustering problem would only have as many
                                                                          dimensions as the range of each FPD, we speculate that a
                                                                          simple fixed-k clustering algorithm like k-means would work
           B     =    [{0.1, 0.1, 0.8}0.2 , {0.1, 0.2, 0.7}0.8 ]
             ′
                                                                          very well.
        ∴B       =    {0.1, 0.18, 0.72}                            (11)      Although this approach is more complex than selection or
                                                                          linear collapse, the total increase in complexity in belief prop-
   However, this approximation is unsatisfactory: it conflates
                                                                          agation would be related to and bound by the maximum in-
probability with fuzziness and may change the expected value
                                                                          degree and range of a variable.
of the variable. Although it may be approximately correct in
some circumstances, a simple thought experiment will show
why it is insufficient.
                                                                          4.4      Expected Values
   Consider the bimodal FS X. The expected sample from X                  A fourth kind of control is inspired by particle filtering and
is X ′ = [α0.5 , β0.5 ]. Although this sample does not reflect any        the Condensation algorithm[9]. The general sequential Monte
of the uncertainty in X it does reflect the bi-modality (inde-            Carlo (SMC) method will be outlined first. Although this
cision) of the variable (committee) as a whole. Subsection 4.4            approach is not as efficient as others it is applicable in all
returns to this approach.                                                 cases and is strictly correct.
   If X is linearly collapsed though then X ′ = {0.5, 0.5}. No               Consider X. An infinite sequence of independent samples
sample drawn from this PD can be half α and half β. Im-                   drawn from this uncertain fuzzy state will take something like
portant information in X has been lost. Although further be-              the form [α0.5 , β0.5 ], [α0.5 , β0.5 ], . . . , [α1 ], [α0.5 , β0.5 ]. . . and so
lief propagation will not be biased if this variable is summed            forth.
over4 , there is no way to compare the linearly collapsed value              The properties of this sequence are identical to those of
X ′ with any observed value for X when trying to evaluate the             the fuzzy state, and a long-enough finite sequence will be a
quality of an inferred network.                                           good approximation to it. For example, 100 samples could
   Other approximations to the naive approach have been de-               be drawn from X. Each of these samples could then be used
veloped. They are discussed in the next three subsections.                to propagate the uncertain state of X to X’s children. The
                                                                          relative efficiency of this technique compared to clustering
4.2    Strict and Dynamic Top Fuzzy                                       depends on the range and kmax of the variables, but in certain
       Combinations                                                       situations it may also be better.
Consider again the full (naive) FS of E, reproduced in equa-                 As noted in subsection 4.1, the expected value of a variable
tion 12.                                                                  is easily calculated analytically. For example, the expected
                                                                          value of X is [α0.5×0.9+0.5×0.1 , β0.5×0.1+0.5×0.9 ] = [α0.5 , β0.5 ].
                 E = [α0.06 , β0.14 , γ0.24 , δ0.56 ]              (12)      Doing this expectation calculation is analogous to summing
                                                                          over or numerically integrating a probability distribution, and
   Some of the components barely contribute to the overall                we call it fuzzy integration. Like clustering the impact on ef-
state and will not have a substantial influence on any children           ficiency of fuzzy integration depends on the range and kmax .
either. Such components could be ignored, and the remaining                  This approach is very similar to linear collapse but it has
components could have their µ normalised. For example, if                 a number of key advantages. Firstly, like linear collapse, it
just the top three components of E were used then the up-                 does not bias any further belief propagation. This is untrue
dated FS would take the form:                                             of selection and clustering. Secondly, the expected value X ′
                                                                          which is the result of this fuzzy integration can be meaning-
                     E = [β0.149 , γ0.255 , δ0.596 ]               (13)   fully compared with observed values of X when performing
   The number of components retained could be either k-                   network inference. In many cases, users are only interested in
component strict selected or φ-dynamically selected. In the               the expected (integrated) value. In these cases the expected
former case, the top k components would be selected. In the               value of a FS is ideal.
latter, the
         P |C| components with greatest µ would be selected
so that c∈C µc > φ. Strict selection would mean that FBNs                 5     A Bioinformatic Domain
were only a small linear factor less efficient than classic BNs
of the same size. However, the top k components may not be                Inference of large genetic regulatory networks (GRN) is a cen-
                                                                          tral problem in modern bioinformatics. However, algorith-
4 Due to the use of the product t-norm.
                                                                          mic complexity has limited detailed inference using BNs[6;


                                                                          23
15; 16] to N / 100 genes[5]. Approaches which can be ap-             [2] F. Bobillo and U. Straccia, ‘A fuzzy description logic with
plied to larger numbers of genes include modern clustering               product t-norm’, in Proceedings of the 16th IEEE In-
methods[10] and the inference of graphical Gaussian models               ternational Conference on Fuzzy Systems (FUZZ-IEEE
over clustered gene expression data[8; 14].                              2007), pp. 652–657, London (United Kingtom), (July
   FBNs suggest a novel approach to detailed exploration of              2007).
large GRN. Such a methodology generalises to the inference           [3] Y. Cao, P. Wang, and A. Tokuta, ‘Reverse engineering
of other large causal networks as well.                                  of NK boolean network and its extensions — fuzzy logic
   If the data is pre-processed by using a fuzzy cover algo-             network (FLN)’, New Mathematics and Natural Compu-
rithm the dimensionality of the problem may be reduced by                tation, 3(1), 68–87, (2007).
an order of magnitude or more. This could lead to an expo-           [4] David M. Chickering, ‘Learning Bayesian networks is
nential reduction in the algorithmic complexity which would              NP-Complete’, in Learning from Data: Artificial Intel-
more than offset any increase caused by the fuzzy state size             ligence and Statistics V, eds., D. Fisher and H. J. Lenz,
explosion and its collapse.                                              121–130, Springer-Verlag, (1996).
   A fuzzy cover is a clustering algorithm which covers the          [5] Christopher Fogelberg and Vasile Palade, ‘Machine
data, rather                                                             learning and genetic regulatory networks: A review and
          P than clusters it. In a fuzzy cover, a variable (gene)
can have c∈C µc > 1, where C is the set of covers that the               a roadmap’, Technical Report CS-RR-08-04, Computing
algorithm finds.                                                         Laboratory, Oxford University, Wolfson Building, Parks
   Inference over the covers is performed using a standard al-           Road, Oxford, OX1-3QD, (April 2008).
gorithm to find a virtual GRN. Using the retained µc for each        [6] A. J. Hartemink, D. K. Gifford, T. S. Jaakkola, and R. A.
n ∈ N and c ∈ C, most of the original fidelity can be recovered          Young, ‘Combining location and expression data for prin-
after the inference has been performed by linearly devolving             cipled discovery of genetic regulatory network models.’,
and normalising the network of covers back down to a net-                Pacific Symposium on Biocomputing, 437–449, (2002).
work of genes. The synergistic use of dimension-reduction and        [7] David Heckerman, ‘A tutorial on learning with Bayesian
FBNs are what we believe will be most useful.                            networks’, Technical report, Microsoft Research, Red-
   The authors are using this approach (fuzzy covering, FBN              mond, Washington, (1995).
inference, FBN devolution) to infer and explore large genetic        [8] Katsuhisa Horimoto and Hiroyuki Toh, ‘Statistical esti-
regulatory networks. With fuzzy clustering and FBNs we ex-               mation of cluster boundaries in gene expression profile
pect to be able to perform more detailed exploratory inference           data.’, Bioinformatics, 17(12), 1143–1151, (2001).
for N ≈ 1000.                                                        [9] Michael Isard and Andrew Blake. Condensation – con-
                                                                         ditional density propagation for visual tracking, 1998.
6   Contributions and Future Work                                   [10] Sara C. Madeira and Arlindo L. Oliveira, ‘Bicluster-
                                                                         ing algorithms for biological data analysis: a survey’,
This paper has presented a new formalisation which combines              IEEE/ACM Transactions on Computational Biology and
fuzzy theory and Bayesian networks. Because of the way that              Bioinformatics, 1(1), 24–45, (2004).
it extends classic BNs, all existing algorithms, tools and ma-      [11] Daniel Neagu and Vasile Palade, ‘A neuro-fuzzy ap-
chine learning techniques for classic BNs can be used imme-              proach for functional genomics data interpretation and
diately with FBNs.                                                       analysis’, Neural Computing and Applications, 12(3-4),
   Several techniques for tractably propagating fuzzy beliefs            153–159, (2003).
across a FBN are also described. Using these techniques, pre-       [12] Heping Pan and Lin Liu, ‘Fuzzy Bayesian networks - a
viously used BNs can be assigned fuzzy variable states and               general formalism for representation, inference and learn-
updated accordingly. This means that existing networks, of-              ing with hybrid Bayesian networks’, IJPRAI, 14(7), 941–
ten learnt only after substantial effort, can be easily reused.          962, (2000).
   Furthermore, the difference in BN and FBN efficiency with        [13] Romesh Ranawana and Vasile Palade, ‘Multi-classifier
sensible fuzziness collapse may be as little as a small linear           systems: Review and a roadmap for developers’, Interna-
constant in some circumstances. This means that there are                tional Journal of Hybrid Intelligent Systems, 3(1), 35–61,
few disadvantages to using FBNs instead of BNs.                          (2006).
   The possibility of integrating FBNs into a machine learn-        [14] Hiroyuki Toh and Katsuhisa Horimoto, ‘Inference of a
ing pipeline which involves dimension-reduction and network              genetic network by a combined approach of cluster anal-
devolution also suggests that the inference of larger causal             ysis and graphical Gaussian modeling.’, Bioinformatics,
networks will be possible using FBNs.                                    18(2), 287–297, (2002).
   Future research may uncover more efficient methods for in-       [15] Jing Yu, V. Anne Smith, Paul P. Wang, Alexander J.
tegrating, clustering or otherwise collapsing a FS. In addition,         Hartemink, and Erich D. Jarvis, ‘Advances to Bayesian
the authors plan to present an even more generalised formali-            network inference for generating causal networks from
sation which relaxes the assumptions made in subsection 3.2.             observational biological data.’, Bioinformatics, 20(18),
                                                                         3594–3603, (2004).
                                                                    [16] Yu Zhang, Zhingdong Deng, Hongshan Jiang, and Peifa
REFERENCES                                                               Jia, ‘Dynamic Bayesian network (DBN) with structure
 [1] Jim F. Baldwin and Enza Di Tomaso, ‘Inference and                   expectation maximization (SEM) for modeling of gene
     learning in fuzzy Bayesian networks’, in FUZZ’03: The               network from time series gene expression data.’, in BIO-
     12th IEEE International Conference on Fuzzy Systems,                COMP, eds., Hamid R. Arabnia and Homayoun Valafar,
     volume 1, pp. 630–635, (May 2003).                                  pp. 41–47. CSREA Press, (2006).


                                                                    24
          Combining Goal Inference and Natural-Language
             Dialogue for Human-Robot Joint Action
    Mary Ellen Foster1 and Manuel Giuliani1 and Thomas Müller1 and Markus Rickert1 and Alois Knoll1
                 Wolfram Erlhagen2 and Estela Bicho2 and Nzoji Hipólito2 and Luis Louro2


Abstract. We demonstrate how combining the reasoning compo-                     planner, using connectionist kernel perceptron learning to learn the
nents from two existing systems designed for human-robot joint ac-              effects of different domain actions. Integration among the different
tion produces an integrated system with greater capabilities than ei-           components of this system is achieved through a common represen-
ther of the individual systems. One of the systems supports primarily           tation of actions and their effects. Such hybrid architectures are also
non-verbal interaction and uses dynamic neural fields to infer the              particularly common when the robot is designed to cooperate with a
user’s goals and to suggest appropriate system responses; the other             human partner; recent examples include [13, 17, 32].
emphasises natural-language interaction and uses a dialogue man-                   In this paper, we present two robot systems designed to cooperate
ager to process user input and select appropriate system responses.             with humans on mutual tasks and then show how combining rea-
Combining these two methods of reasoning results in a robot that is             soning components from these systems results in a more powerful
able to coordinate its actions with those of the user while employing           integrated system. Both of the robot systems have been developed in
a wide range of verbal and non-verbal communicative actions.                    the context of the JAST3 project (“Joint Action Science and Technol-
                                                                                ogy”). The two main goals of this project are to investigate the cogni-
                                                                                tive, neural, and communicative aspects of jointly-acting agents, both
1    INTRODUCTION AND MOTIVATION
                                                                                human and artificial, and to build jointly-acting autonomous systems
As robot systems become increasingly sophisticated, their role is               that communicate and work intelligently on mutual tasks. The com-
moving from one where the robot is essentially an intelligent tool              mon task across the project is joint construction—that is, multiple
to one where the robot is able to participate as a full team member             agents working together to assemble objects from their components.
in collaborative tasks. Supporting this type of human-robot coopera-               The two JAST human-robot systems support intelligent coopera-
tion requires that the robot system be able to produce and understand           tion with humans on this joint construction task. Although both sys-
a wide range of natural communicative cues in order to allow hu-                tems address the same basic task and incorporate similar input- and
mans to cooperate with it easily. For example, [15] experimentally              output-processing components, the reasoning components are imple-
demonstrated the contribution of anticipatory action to the fluency             mented using very different techniques and they support very differ-
of human-robot interaction; similarly, natural-language dialogue has            ent styles of interaction. The goal inference system is implemented
been shown to be an effective means of coordinating actions between             using dynamic neural fields and concentrates on inferring the user’s
a human and a robot [7].                                                        intended domain actions based on their non-verbal behaviours and
   A number of previous systems have also addressed the task of                 on selecting appropriate domain actions for the system to perform in
human-robot cooperation, using a variety of communicative styles.               response. For example, if the user picks up a bolt in a way that in-
The Leonardo robot [2], for example, is able to learn simple action             dicates that they intend to use it themselves, the system might pick
sequences and to execute them jointly with the user. The Ripley sys-            up the corresponding wheel and hold it out to the user. The dialogue
tem [24] is able to manipulate objects in response to spoken requests           system, on the other hand, concentrates on understanding and gener-
from a human partner; a more recent robot from the same group [16]              ating multimodal natural-language utterances to support cooperation
increases the responsiveness of the system and allows the action plan-          between the human and the robot, using a dialogue manager. Sec-
ner to adapt flexibly to a rapidly-changing world. The BARTHOC                  tions 2–3 present the details of these two systems and show a typical
[26] and ARMAR [27] humanoid robots both support multimodal                     interaction with each.
dialogue to interact with a human user in a variety of settings and                Since the two JAST human-robot systems address the same task,
domains. The experiments described in [15] demonstrated that un-                using complementary forms of reasoning, it is possible to combine
derstanding and anticipating the user’s actions produces a robot that           the two forms of reasoning into a single system. This integrated sys-
can cooperate more smoothly with a human user.                                  tem is able both to intelligently infer the user’s actions and suggest
   Since an intelligent robot system must both process continuous               appropriate responses, and also to engage in dialogue with the user
sensor data and reason about discrete concepts such as plans, actions,          to support coordination and to discuss situations when the system is
and dialogue moves, this type of system is often made up of compo-              unable to infer the user’s goal. In Section 4, we present this integrated
nents drawing from an assortment of representation and reasoning                system and show a sample of the interactions that it can support that
paradigms. The robot system described in [22], for example, com-                are not possible with either of the individual systems; this section
bines low-level robot control and vision systems with a high-level              also gives some technical details of how the components of the two
1 Technische Universität München, Germany, contact: foster@in.tum.de
2 University of Minho, Portugal, contact: wolfram.erlhagen@mct.uminho.pt        3 http://www.euprojects-jast.net/


                                                                           25
systems are combined in practice. Finally, in Section 5, we compare
the integrated system with other similar systems and summarise the
contributions of the system and the areas for future work.


2   GOAL INFERENCE BASED ON DYNAMIC
    NEURAL FIELDS
The first of the JAST human-robot systems concentrates on giving
the robot the ability to predict the consequences of observed actions,
using an implementation inspired by neurocognitive mechanisms un-
derlying this capacity in humans and other social species. Many con-
temporary theories of intention understanding in familiar tasks rely             Figure 1. The JAST goal-inference robot together with the toy vehicle that
on the notion that an observer uses their own motor repertoire to sim-           the human and the robot jointly construct.
ulate an observed action and its effect ([4], for a review see [25]). The
selection of an appropriate complementary behaviour in a joint action            based pattern recognizers were used [30]. As a software development
task depends not only on the inferred goal of the partner, but also on           platform we haven chosen YARP [19]. This open-source project sup-
the integration of additional information sources such as shared task            ports inter-process communication, image processing and a class hi-
knowledge (e.g., a construction plan) and contextual cues.                       erarchy to ease code reuse across different hardware platforms.
   The cognitive control architecture for action coordination in the                Figure 2 illustrates a typical example of the goal inference and ac-
joint construction scenario is formalized by a coupled system of                 tion selection capacities in this domain. In the top image, the human
dynamic fields representing a distributed network of local but con-              reaches his open hand towards the robot teammate. By activating the
nected neural populations [3]. Different pools of neurons encode                 respective action chain in its repertoire, the robot interprets this ges-
task-relevant information about action means, action goals, and con-             ture as a request for a bolt to fix the wheel. Since the human has al-
text in the form of activation patterns that are self-sustained through          ready mounted the wheel on his side of the construction, this inferred
recurrent interactions.                                                          goal describes a currently active subtask. A logical complementary
   The motor simulation idea is implemented by the propagation of                action sequence would be that the robot grasps a bolt to place it in
activity through interconnected neural populations that constitute a             the teammate’s hand. However, the human has seemingly overlooked
learned chain of motor primitives directed towards a specific goal               a bolt in his own working area. In this situation, the representation of
[6]. Typical examples in the context of the construction scenario are            the inferred goal together with the representation of the bolt in the
reaching-grasping-placing/plugging sequences. The chains are auto-               work space of the human trigger the decision to make a pointing ges-
matically triggered by an observed motor act (e.g., reaching or grasp-           ture directed towards the object. In addition, the robot uses speech to
ing) whenever additional input from connected dynamic fields (e.g.,              explain the type of error the human is making.
representing the currently available subgoals) pre-activates the neu-
ral populations. As a consequence of the motor simulation, the robot
                                                                                 3   DIALOGUE-BASED HUMAN-ROBOT
is able to react to the partner’s action sequences well ahead of their
                                                                                     INTERACTION
completion. This anticipation capacity has been shown to be crucial
for a fluent team performance [1, 15].                                           Like the system described in the preceding section, the JAST human-
   In the layer of the control architecture linked to motor execution,           robot dialogue system [23] also supports multimodal human-robot
neural populations represent the decision about the most appropri-               collaboration on a joint construction task. In this case, the user and
ate complementary behaviour. The behaviour is selected as a con-                 the robot work together to assemble wooden construction toys on a
sequence of a competition process between all response alternatives              common workspace, coordinating their actions through speech, ges-
getting input from connected layers (for details see [1]).                       tures, and facial displays. The robot (Figure 3) consists of a pair
   A system based on this dynamic field architecture was imple-                  of Mitsubishi manipulator arms with grippers, mounted in a po-
mented to support human-robot cooperation on the JAST joint con-                 sition to resemble human arms, and an animatronic talking head
struction task. This system constructs a toy vehicle (Figure 1) with             [29] capable of producing facial expressions, rigid head motion, and
the user. The vehicle is composed of several components which are                lip-synchronised synthesised speech. The input channels consist of
initially distributed in the separated working areas of the two team-            speech recognition, object recognition, robot sensors, and face track-
mates; this ensures that neither of the agents is able to reach all of           ing; the outputs include synthesised speech, head motions, and robot
the required components on its own and must rely on the partner to               actions. The components of the system communicate with each other
retrieve them, making joint action essential to a successful interac-            using the Ice distributed object middleware system [14].
tion. The robotics platform we are currently using consists of a torus              The robot is able to manipulate objects in the workspace and to
on which are mounted a 7 DOFs AMTEC arm (Schunk GmbH &                           perform simple assembly tasks. The primary form of interaction with
Co.KG) with a 3-fingered BARRET hand (Barrett Technology Inc.)                   the current version of the system is one in which the robot instructs
and a stereo vision system. The system uses synthesised speech to                the user on building a particular compound object, explaining the
communicate its reasoning process to the human partner.                          necessary assembly steps and retrieving pieces as required, with the
   To control the arm-hand system, we applied a global planning                  user performing the actual assembly actions. As with the dynamic-
method in posture space that facilitates the integration of optimiza-            field system, the workspace is divided into two areas—one belonging
tion principles derived from experiments with humans [5]. For the                to the robot and one to the human—in order to make joint action
object recognition as well as for the classification of object-directed          necessary for success in the overall task.
hand postures and communicative gestures such as pointing or de-                    Input on each of the channels is processed using a dedicated mod-
manding an object, a combination of feature- and correspondence-                 ule for that channel. To process the speech, we use a Java Speech


                                                                            26
                                                                                    API interface to the commercial Dragon NaturallySpeaking speech
                                                                                    recogniser [21]. A camera mounted above the common work area
                                                                                    provides two-dimensional images of the contents of the workspace.
                                                                                    The information from this camera is pre-processed to extract regions
                                                                                    of interest (ROIs). The extracted ROIs are then processed in parallel
                                                                                    by a template-based object-recognition module [20] and a module
                                                                                    that performs static hand-gesture recognition [33].
                                                                                       Input received on all of the input sensors is continuously processed
                                                                                    by the corresponding modules and broadcast through Ice, using the
                                                                                    built-in IceStorm publish-subscribe mechanism. All of the input mes-
                                                                                    sages are received by a multimodal fusion component [11, 12], which
                                                                                    parses the recognized speech into logical forms using the OpenCCG
                                                                                    grammar formalism [31] and combines it with the recognised non-
                                                                                    verbal behaviour to produce multimodal hypotheses representing
                                                                                    user requests. The fusion hypotheses are then sent to the dialogue
                                                                                    manager, which selects an appropriate response.
                                                                                       The dialogue manager is implemented using the TrindiKit dia-
                                                                                    logue management toolkit [18]. This toolkit uses the well-known
                                                                                    information-state update approach to dialogue management [28],
                                                                                    which allows a dialogue to be modelled declaratively. When the dia-
                                                                                    logue manager receives a new set of fusion hypotheses, it selects the
                                                                                    appropriate system response using information from three sources:
                                                                                    the inventory of objects in the world, a representation of the cur-
                                                                                    rent assembly state, and the history of the dialogue. When the sys-
                                                                                    tem is jointly following an assembly plan with the user, the dialogue
                                                                                    manager is able to select from different strategies for traversing the
Figure 2. Example of the goal inference (top) and action selection (bot-            plan: it may use a postorder strategy, in which it proceeds directly
tom) capacities which are implemented by the dynamic field architecture. The        to describing the concrete assembly actions, or it may use a preorder
robot uses speech to communicate the results of its reasoning about the be-         strategy, in which the structure of the plan is described before giving
haviour of the teammate.                                                            specific assembly actions. More details on the dialogue manager and
                                                                                    on the description strategies are given in [10].
                                                                                       Once the dialogue manager has selected a response to the user’s
                                                                                    multimodal utterance, it sends the specification of the response to
                                                                                    the output planner. This module in turn sends commands to select
                                                                                    appropriate output on each of the individual channels to meet the
                                                                                    specification: linguistic content including appropriate multimodal re-
                                                                                    ferring expressions [9], facial expressions and gaze behaviours of the
                                                                                    talking head [8], and actions of the robot manipulators. The user then
                                                                                    responds to the system utterance by speaking or performing actions
                                                                                    in the world, and the interaction continues until the target object has
                                                                                    been assembled.
                                                                                       An excerpt from a typical interaction between a user and the JAST
                                                                                    dialogue system is shown in Figure 4. In this excerpt, the robot knows
                                                                                    the full plan for building the target object: a “railway signal”, which
                                                                                    has sub-components called a “snowman” and a “flower”. The assem-
                                                                                    bled object is shown in Figure 4(a). In the excerpt, the robot instructs
                                                                                    the user on how to build the target object, using a preorder strat-
                                                                                    egy, and the user learns to make particular sub-components along the
                                                                                    way. We are currently carrying out a system evaluation based on this
                                                                                    robot-as-instructor scenario. The evaluation is designed to compare
                                                                                    the two description strategies in terms both of user satisfaction and
                                                                                    in success in the overall joint-construction task.
                                                                                       We will shortly extend the dialogue system to handle scenarios
                                                                                    where the user also knows the assembly plan. In such situations, the
                                                                                    main goal of the interaction is no longer instruction, but rather—as
                                                                                    with the goal-inference system described previously—coordination
                                                                                    between the partners, and the user will be able to take much more
                                                                                    initiative in the dialogue than is currently possible. We will return to
Figure 3. The JAST dialogue robot with a selection of wooden                        the details of this extended scenario in the following section.
construction-toy components.


                                                                               27
                                                                                 formation along with the processed messages from the fusion sys-
                                                                                 tem into the (extended) information state of the integrated system.
                           • SYSTEM: To build a railway signal, we               The dialogue manager then uses enhanced update rules to select an
                             need to make a snowman and a flower.                appropriate system response to the input. Finally, just as in the indi-
                           • USER: How do you make a snowman?                    vidual systems, the selected response is sent to the output system for
                           • SYSTEM: [Picking up a red cube and                  realisation on the output channels.
                             holding it out] To make a snowman, in-
                             sert the green bolt into your blue cube
                             and fasten it with this cube.
                           • USER: [picks up bolt and cube; takes                                        Object                Goal
                             other cube; assembles “snowman”]                                          Inventory               State
                             Okay.
                           • SYSTEM: Well done, you have made                     Multimodal                                                         Multimodal
                             a snowman. Now we need to make a                       Input                                                             Output
                             flower. To make a flower, [. . . ]
                                                                                                          Goal               Dialogue
                                                                                                       Inference             Manager
    (a) “Railway signal”               (b) Dialogue excerpt

Figure 4. A sample object and an excerpt from an interaction where the                     Figure 5.    The architecture of the integrated system.
robot instructs the user on how to construct this object.
                                                                                    This integrated system supports interaction patterns that would not
4     INTEGRATING GOAL INFERENCE AND                                             be possible with either of the individual systems. Most importantly,
      NATURAL-LANGUAGE DIALOGUE                                                  it is able both to detect unexpected actions from the user (i.e., ac-
                                                                                 tions that do not meet what it believes to be the current subgoals)
There are a number of similarities between the two human-robot                   and to engage the user in dialogue to discuss how to deal with the
systems described above. Both support the same basic task—joint                  unexpected action. When both forms of reasoning work together, the
construction—and view the goals and subgoals of this task in a simi-             system is able to detect such user actions and to produce a variety of
lar way. Also, the input and output channels used by the two systems             responses, including correcting the user, asking for clarification as to
are very similar: both include object and gesture recognition in the             the user’s intentions, or attempting to silently revise its representa-
input and produce speech and robot-manipulator actions as part of                tion of the goal state. Varying the system’s response to this situation
their output. On the other hand, the reasoning processes used by the             is able to produce systems with different interactive “personalities”,
two systems are very different: the former uses dynamic neural fields            ranging from one that always makes the user follow the plan selected
to perform goal inference and action selection based entirely on non-            by the system to one where the user has full control over the course
verbal input, while the latter uses techniques from issue-based dia-             of the interaction.
logue management to engage in natural-language conversation with                    Figure 6 shows a sample interaction between a user and the inte-
some multimodal components. The strengths of the two systems are                 grated system, where the role of each of the reasoning components
also complementary: the dynamic-field system is good at detecting                is shown throughout. In this interaction, the user and the robot are
and reasoning about the user’s non-verbal actions, but uses language             jointly building the “railway signal” object (Figure 4(a)). At the start,
only for a limited form of canned output; the dialogue system sup-               the robot system has assumed that the user is building the “snow-
ports advanced linguistic interaction, but has no mechanism to infer             man” sub-component. When the user grasps a medium slat, which
the user’s intention from their actions in the world.                            is not needed for that subgoal, the goal inference system detects this
   Motivated by the above similarities and complementary features,               (just as in the sample interaction described at the end of Section 2)
we have combined components from the two individual human-robot                  and sends a message to the dialogue manager that the user’s action
systems into a single, integrated architecture. The hardware platform            cannot be integrated into the current plan.
for the integrated system is the robot from the dialogue system (Fig-               At this point, the system has several options to deal with the mis-
ure 3), while the scenario is an extended version of the scenarios               match between its beliefs about the current subgoals and the recent
used by each of the individual systems. As in the dynamic-field sce-             action of the user. It might silently revise its view of the current sub-
nario, the user and the robot are both assumed to know the assembly              goals, for example, or it might—as in Figure 2—correct the user’s
plan for the target object and are able to infer the partner’s intentions        apparent “error”. In the example, the system uses a third strategy,
based on their behaviour, and the main goal of the interaction is for            and one that is only available because of the integration of the di-
the two participants to coordinate their actions. As in the dialogue             alogue components: it asks the user in words to clarify their inten-
system, this coordination is accomplished through natural-language               tions. After the user provides the needed clarification, also verbally,
dialogue incorporating both verbal and non-verbal communication.                 the dialogue manager updates the system’s subgoals and informs the
   Figure 5 shows the high-level architecture of the integrated sys-             goal-inference system of the change. The goal-inference system then
tem. Messages on all of the multimodal input channels (speech, ges-              anticipates that, to meet this new subgoal, the user will need the nut
tures, and recognised objects) are sent to both of the input-processing          that is lying on the robot’s side of the table. The system therefore
components, each of which—just as in the individual systems—                     picks up the nut and offers it to the user without being asked.
reasons about the meaning of the user’s actions in the current con-                 As can be seen by the right-hand columns in Figure 6, this type
text, each drawing information from the same set of state modules                of interaction would not be possible with either of the individual sys-
(plan state, object inventory, interaction history). The inferred goals          tems. The dialogue system does not have the necessary mechanism to
and suggested system responses from the goal-inference system are                infer the user’s goals from their actions, while the goal-inference sys-
then passed to the the dialogue manager, which incorporates this in-             tem would only have been able to respond to the user’s unexpected


                                                                            28
          Actions                                                                 Dialogue Manager                   Goal Inference
          User grasps a medium slat
                                                                                                                     Notices that action does not
                                                                                                                     meet current subgoal
                                                                                  Tells output planner to ask
                                                                                  for clarification
          SYSTEM: “We don’t need a medium slat for the snowman”
          USER: “Yes, but I want to build the flower now”
                                                                                  Interprets response and up-
                                                                                  dates subgoals
                                                                                                                     Suggests system response
                                                                                  Sends message to output
                                                                                  planner
          Robot picks up a nut and holds it out
          SYSTEM: “Then you’ll need this nut”
    Figure 6.   A sample interaction with the integrated system, showing the role of each individual reasoning component in the decision-making process.

action by treating it as an error rather than discussing the user’s goals         a system integrating the reasoning components of the two individual
as in the example. Only when these two components are combined is                 systems is able to take advantage of the complementary strengths of
this rich interaction made possible.                                              each to support interactions that neither system is able to support on
                                                                                  its own. In particular, this integrated system is able both to detect
                                                                                  the user’s intentions and anticipate their needs, and to use natural-
4.1    Technical Details                                                          language dialogue to manage the joint activity. The integration of
The two individual systems use the same basic information in their                these two systems is made possible through well-defined interfaces
reasoning (task goals and subgoals, object inventory, input events);              that allow the two sets of reasoning components to share information
however, due to the different implementations, they represent this                about world state, task goals, and input events.
information quite differently. Also, at the implementation level, the                In contrast to the other systems mentioned in the introduction,
components of the dynamic-field system use YARP to communicate                    the integrated JAST system is unique in that it combines methods
with one another, while the dialogue system uses Ice as an integration            and techniques taken from two separate, fully-implemented, existing
platform. A specific goal of the integration has been to make as few              systems—a neuro-inspired perception-action system and a symbolic,
changes as possible to the individual systems. An important aspect                multimodal-dialogue system—to produce an integrated robot system
of creating the integrated system has therefore been coming up with               that is able both to communicate with its human partner using lan-
a common representation for all of the relevant information, where                guage and to intelligently understand and anticipate the partner’s in-
the representation is compatible with both of the systems and both of             tentions. As demonstrated by the example in the preceding section,
the integration platforms.                                                        the integrated system is able to go beyond the capabilities of either
   To support the integration, we have defined generic interfaces to              of the individual systems to support intelligent human-robot cooper-
represent recognised gestures and objects, as well as inferred and                ation on the joint construction task.
proposed domain actions. These representations include the follow-                   The integrated system is currently under development: the neces-
ing information:                                                                  sary interfaces have been specified as described in Section 4.1, and
                                                                                  the reasoning modules from the two systems are being adapted to
• The Gestures representation includes the type of gesture recog-                 use the common interfaces. When the is completed, we will run a
  nised (pointing, grasping, holding-out, unknown) and if necessary,              user evaluation of the full system similar to that currently under way
  the object indicated.                                                           for the dialogue system to demonstrate the contribution of both forms
• The Objects representation includes the classification of the ob-               of reasoning to natural human-robot joint action.
  ject, a 3D position and a flag indicating whether the object can be
  reached by the robot.
                                                                                  ACKNOWLEDGEMENTS
• The Action representation consists of the type of action (grasp-
  and-give, demand-and-receive, speak, undefined) and a string con-               This work was supported by the EU FP6 IST Cognitive Sys-
  taining further specifications (e.g. the object-id for grasp-and-give           tems Integrated Project “JAST” (FP6-003747-IP), http://www.
  or the sentence to speak out loud).                                             euprojects-jast.net/. Thanks to the CIMA workshop re-
                                                                                  viewers for their useful comments and suggestions.
   Internal communication between YARP and Ice is implemented
via a connector module that translates Ice messages to YARP mes-
sages and vice versa.
                                                                                  REFERENCES
                                                                                   [1] E. Bicho, L. Louro, N. Hipólito, and W. Erlhagen, ‘A dynamic neu-
                                                                                       ral field architecture for flexible and fluent human-robot interaction.’,
5     DISCUSSION                                                                       in Proceedings of the 2008 International Conference on Cognitive Sys-
                                                                                       tems, pp. 179–185, (2008).
We have presented two human-robot systems, each of which is de-                    [2] C. Breazeal, A. Brooks, J. Gray, G. Hoffman, C. Kidd, H. Lee, J. Lieber-
signed to support the same joint construction task. One system uses                    man, A. Lockerd, and D. Chilongo, ‘Tutelage and collaboration for
                                                                                       humanoid robots’, International Journal of Humanoid Robotics, 1(2),
dynamic neural fields to perform non-verbal goal inference and ac-                     315–348, (2004).
tion selection, while the other uses a dialogue manager to support                 [3] W. Erlhagen and E. Bicho, ‘The dynamic neural field approach to cog-
multimodal natural-language interaction. We have then shown how                        nitive robotics’, Journal of Neural Enginering, 3, R36–R54, (2006).


                                                                             29
 [4] W. Erlhagen, A. Mukovskiy, and E. Bicho, ‘A dynamic model for ac-                     form’, International Journal of Advanced Robotics Systems, 3(1), 43–
     tion understanding and goal-directed imitation.’, Brain Research, 1083,               48, (2006).
     174–188, (2006).                                                               [20]   T. Müller, P. Ziaie, and A. Knoll, ‘A wait-free realtime system for op-
 [5] W. Erlhagen, A. Mukovskiy, E. Bicho, G. Panin, C. Kiss, A. Knoll,                     timal distribution of vision tasks on multicore architectures’, in Pro-
     H. van Schie, and H. Bekkering, ‘Goal-directed imitation for robots:                  ceedings of the 5th International Conference on Informatics in Control,
     a bio-inspired approach to action understanding and skill learning’,                  Automation and Robotics, (May 2008).
     Robotics and Autonomous Systems, 54, 353–360, (2006).                          [21]   Nuance Communications. Dragon NaturallySpeaking 9. http://
 [6] W. Erlhagen, A. Mukovskiy, F. Chersi, and E. Bicho, ‘On the develop-                  www.nuance.com/naturallyspeaking/.
     ment of intention understanding for joint action tasks’, in Proceedings        [22]   R. Petrick, D. Kraft, K. Mourão, C. Geib, N. Pugeault, N. Krüger, and
     of the 6th IEEE International Conference on Development and Learn-                    M. Steedman, ‘Representation and integration: Combining robot con-
     ing, (July 2007).                                                                     trol, high-level planning, and action learning’, in Proceedings of the
 [7] T. Fong, C. Thorpe, and C. Baur, ‘Collaboration, dialogue, and human-                 International Cognitive Robotics Workshop (CogRob 2008) at ECAI
     robot interaction’, in Robotics Research, volume 6 of Springer Tracts                 2008, (July 2008).
     in Advanced Robotics, 255–266, Springer, (2003).                               [23]   M. Rickert, M. E. Foster, M. Giuliani, T. By, G. Panin, and A. Knoll,
 [8] M. E. Foster, ‘Roles of a talking head in a cooperative human-robot                   ‘Integrating language, vision and action for human robot dialog sys-
     dialogue system’, in Proceedings of the 7th International Conference                  tems’, in Proceedings of HCI International 2007, (July 2007).
     on Intelligent Virtual Agents (IVA 2007), (September 2007).                    [24]   D. Roy, K.-Y. Hsiao, and N. Mavridis, ‘Mental imagery for a conver-
 [9] M. E. Foster, E. G. Bard, R. L. Hill, M. Guhe, J. Oberlander, and                     sational robot’, IEEE Transactions on Systems, Man, and Cybernetics,
     A. Knoll, ‘The roles of haptic-ostensive referring expressions in co-                 Part B, 34(3), 1374–1383, (June 2004).
     operative, task-based human-robot dialogue’, in Proceedings of the             [25]   N. Sebanz, H. Bekkering, and G. Knoblich, ‘Joint action: bodies and
     3rd ACM/IEEE International Conference on Human Robot Interaction                      minds moving together’, Trends in Cognitive Sciences, 10, 70–76,
     (HRI 2008), (March 2008).                                                             (2006).
[10] M. E. Foster and C. Matheson, ‘Following assembly plans in coop-               [26]   T. Spexard, M. Hanheide, and G. Sagerer, ‘Human-oriented interaction
     erative, task-based human-robot dialogue’, in Proceedings of Londial                  with an anthropomorphic robot’, IEEE Transactions on Robotics, 23(5),
     2008, (June 2008).                                                                    852–862, (October 2007).
[11] M. Giuliani and A. Knoll, ‘Integrating multimodal cues using grammar           [27]   R. Stiefelhagen, H. Ekenel, C. Fugen, P. Gieselmann, H. Holzapfel,
     based models’, in Proceedings of HCI International 2007, (July 2007).                 F. Kraft, K. Nickel, M. Voit, and A. Waibel, ‘Enabling multimodal
[12] M. Giuliani and A. Knoll. MultiML – A general purpose representation                  human-robot interaction for the Karlsruhe humanoid robot’, IEEE
     language for multimodal human utterances, 2008. Submitted.                            Transactions on Robotics, 23(5), 840–851, (October 2007).
[13] N. Hawes, A. Sloman, J. Wyatt, M. Zillich, H. Jacobsson, G.-J. Kruijff,        [28]   D. Traum and S. Larsson, ‘The information state approach to dialogue
     M. Brenner, G. Berginc, and D. Skočaj, ‘Towards an integrated robot                  management’, in Current and New Directions in Discourse and Dia-
     with multiple cognitive functions’, in Proceedings of the 22nd Confer-                logue, eds., J. C. J. Van Kuppevelt and R. W. Smith, 325–353, Kluwer
     ence on Artificial Intelligence (AAAI 2007), (2007).                                  Academic Publishers, (2003).
[14] M. Henning, ‘A new approach to object-oriented middleware’, IEEE               [29]   A. J. N. van Breemen, ‘iCat: Experimenting with animabotics’, in Pro-
     Internet Computing, 8(1), 66–75, (Jan–Feb 2004).                                      ceedings of the AISB 2005 Creative Robotics Symposium, (2005).
[15] G. Hoffman and C. Breazeal, ‘Effects of anticipatory action on human-          [30]   G. Westphal, C. von der Malsburg, and R. Würtz, ‘Feature-driven emer-
     robot teamwork efficiency, fluency, and perception of team’, in Pro-                  gence of model graphs for object recognition and categorization’, in
     ceedings of the 2nd ACM/IEEE International Conference on Human                        Applied Pattern Recognition, eds., A. Kandel, H. Bunke, and M. Last,
     Robot Interaction (HRI 2007), (2007).                                                 155–199, Springer Verlag, (2008).
[16] K.-y. Hsiao, S. Vosoughi, S. Tellex, R. Kubat, and D. Roy, ‘Object             [31]   M. White, ‘Efficient realization of coordinate structures in Combinatory
     schemas for responsive robotic language use’, in Proceedings of the                   Categorial Grammar’, Research on Language and Computation, 4(1),
     3rd ACM/IEEE International Conference on Human Robot Interaction                      39–75, (2006).
     (HRI 2008), (2008).                                                            [32]   H. Zender, P. Jensfelt, Ó. Martinez Mozos, G.-J. M. Kruijff, and W. Bur-
[17] W. G. Kennedy, M. D. Bugajska, M. Marge, W. Adams, B. R. Fransen,                     gard, ‘An integrated robotic system for spatial understanding and situ-
     D. Perzanowski, A. C. Schultz, and J. G. Trafton, ‘Spatial representa-                ated interaction in indoor environments’, in Proceedings of the 22nd
     tion and reasoning for human-robot collaboration’, in Proceedings of                  Conference on Artificial Intelligence (AAAI 2007), (July 2007).
     the 22nd Conference on Artificial Intelligence (AAAI 2007), (2007).            [33]   P. Ziaie, T. Müller, M. E. Foster, and A. Knoll, ‘Using a naı̈ve Bayes
[18] S. Larsson and D. R. Traum, ‘Information state and dialogue manage-                   classifier based on k-nearest neighbors with distance weighting for
     ment in the TRINDI dialogue move engine toolkit’, Natural Language                    static hand-gesture recognition in a human-robot dialog system’, in
     Engineering, 6, 323–340, (2000).                                                      Proceedings of CSICC 2008, (March 2008).
[19] G. Metta, P. Fitzpatrick, and L. Natale, ‘YARP: Yet another robot plat-


                                                                               30
                 A Tool for Evolving Artificial Neural Networks
              Efstratios F. Georgopoulos 1, 3, Adam V. Adamopoulos 2, 3 and Spiridon D. Likothanassis 3

Abstract. A hybrid evolutionary algorithm that combines genetic                  architectures forms a surface in the space. The optimal architecture
programming philosophy, with localized Extended Kalman Filter                    design equals to finding the optimum point on this surface.
(EKF) training method is presented here. This algorithm is used for                  The first attempts, described in [10], [23], [25] and [27], focused
the topological evolution and training of Multi-Layered Neural                   mainly on the problem of training the networks and not in the
Networks. It is implemented as a visual software tool in C++                     topology design. They used neural networks with fixed architecture
programming language. The proposed hybrid evolutionary                           and genetic algorithms in order to search the weight space for some
                                                                                 near optimum weight vector that solves the problem of network
algorithm is applied on two bio-signal modeling tasks: the Magneto
                                                                                 training. That is, they used genetic algorithms instead of some
Encephalogram (MEG) of epileptic patients and the Magneto
                                                                                 classical training algorithm. Soon the main research interest moved
Cardiogram (MCG) of normal subjects, exhibiting very satisfactory                from the training, to the search for the optimal architectural (or
results.                                                                         topological) design of a neural network. Some first works used
                                                                                 genetic algorithms in order to imitate the pruning algorithms. They
                                                                                 start with a network larger than necessary for the task and then use
1 INTRODUCTION                                                                   a specially designed genetic algorithm to define which combination
                                                                                 of connections is sufficient to, quickly and accurately, learn to
One of the main problems that are faced when Artificial Neural                   perform the target task, using back propagation. Miller et al. [21]
Networks (ANN) and especially Multilayer Perceptrons, are applied                did that for some small nets. The same problem, but for larger
on some tasks, is finding the network architecture or topology that              networks, was faced by Whitley and Bogard in [26]. Bornholdt and
is best suited for the task at hand. A small network for the problem             Graudenz in [9], used a modified GA in order to evolve a simplified
might causes poor learning ability, while a large one might cause                model of a biological neural network and then applied the algorithm
poor generalization. Until now the common method to determine                    to some toy Boolean functions. A different approach to the design
the architecture of a neural network is by trial and error. However,             and evolution of modular neural network architectures is presented
in the last years there have been many attempts, in the direction of             in [13]. Billings and Zheng in [8] used a GA for the architectural
designing the architecture of a neural network automatically, that               evolution of radial basis function (RBF) networks. The most recent
have led to a variety of methods.                                                approach and maybe the most successful one, to the problem of
   Two subcategories of such methods are a) the constructive and                 finding the near optimum architecture is presented in [28]. There,
b) pruning (destructive) algorithms, [28], [29]. Roughly speaking, a             Yao and Liu propose a new evolutionary system, the EPNet, for
constructive algorithm starts with a minimal network, that is an                 evolving artificial neural networks’ behavior.
ANN with a minimal number of hidden layers, hidden neurons and                       The last couple of years, there is an increasing interest in the use
connections, and adds new layers, neurons or connections, if it is               of multi-objective optimization methods and especially
necessary, during the training phase. On the opposite, a pruning                 evolutionary multi-objective techniques for neural network training
(destructive) algorithm does the opposite, starts with a maximal                 and structure optimization. Two very interesting approaches are
network and deletes the unnecessary layers, nodes and connections                presented in [31] and [32].
during training.                                                                     The present work is the sequence of a series of efforts
   Another approach to this problem is by using Genetic                          concerning the application of evolutionary algorithms for the
Algorithms. Genetic Algorithms are a class of optimization                       optimization of neural networks. In [17] a neural network model
algorithms, which are good in exploring a large and complex space                with binary neurons was evolved by a modified genetic algorithm in
in an intelligent way in order to find values close to the global                order to learn some Boolean functions. In [1], [2], [3], [4], [5], [6],
optimum (see [12], [15], [20] and [22] for details). The design of a             [7], [11], [18] and [19] genetically evolved artificial neural
near optimal topology can be formulated as a search problem in the               networks were successfully used for a variety of problems.
architecture space, where each point in the space represents network                 In this paper we present a hybrid evolutionary method that looks
architecture. The training can be formulated as a search problem in              like more to a genetic programming technique for the evolution of a
the weight space. Since the end of the last decade, there have been              population of feed-forward Multi Layered Perceptrons [14]. This
several attempts to combine the technology of neural networks with               hybrid algorithm combines a genetic programming technique (for
that of genetic algorithms. Given some performance criteria, for                 details see [16]) for the evolution of the architecture of a neural
example error, generalization ability, learning time, architectural              network, with a training method based on the localized Extended
complexity etc, for the architecture, the performance level of all

1
  Technological Educational Institute of Kalamata, Greece, e-mail: sfg@teikal.gr
2
  Dept. of Medicine, Democritus University of Thrace, Greece, e-mail: adam@med.duth.gr
3
  Dept. of Computer Engineering & Informatics, University of Patras, Greece, e-mail: likothan@cti.gr


                                                                            31
Kalman Filter (EKF), known as Multiple Extended Kalman                                                                1
                                                                                                                        ∑
                                                                                                                                                     2
Algorithm (MEKA). The MEKA is described in detail in [24]. The                                            E ( n) =          ⎡⎣ d j ( n) − y j (n) ⎤⎦                                   (6)
                                                                                                                      2 j∈C
novelty of this effort depends on, apart from the combination of
evolution techniques with MEKA, the capability of the proposed
method to search, not only for the optimal number of hidden units,                The differentiation in equation (5) corresponds to the back-
but also, for the number of inputs needed for the problem at hand;             propagation of the global error to the output of neuron i. The
of course this stands only for time series prediction problems where           activation function ϕ(•) is responsible for the non-linearity in the
the number of needed past values, which represent the network’s                neuron. The weight vector wi of the optimum model for neuron i is
inputs, is unknown. This hybrid algorithm is an evolved and heavily            to be “estimated” through training with examples. The activation
enriched version of an older algorithm that was developed by the               function is assumed to be differentiable. Accordingly, we can use
authors and presented in [4], [7] and [19]. Furthermore this                   Taylor series to expand equation (3) about the current estimate of
evolutionary neural network system has been implemented as a                   the weight vector and thereby linearize the equation as follows [14]:
visual tool in C++ with a graphical user interface. In order to test
the ability of this algorithm to produce networks that perform well,
                                                                                                                               ⎡                                               ⎤
                                                                               φ ( xiT (n)wi (n) ) ≅ qiT (n)wi (n) + ⎢φ ⎛⎜ xiT (n) wi (n) ⎞⎟ − qiT (n) wi (n)⎥
                                                                                                                                               ∧                       ∧
we apply the system on two biosignals, namely the Magneto                                                                                                                              (7)
Encephalogram (MEG) recordings of epileptic patients and                                                                       ⎣ ⎝                      ⎠                      ⎦
Magneto Cardiogram (MCG) of normal subjects. The algorithm
produces networks with small sizes that perform well.                             where
   The rest of the paper is organized as follows. Section 2 describes
the hybrid evolutionary algorithm, while the numerical experiments                         ⎡ ∂φ ( xiT ( n ) wi ( n ) ) ⎤                        ∧
                                                                                                                                                          ⎡ ∧            ⎤
are presented in section 3. Finally, section 4 discusses the                   qi ( n ) = ⎢                             ⎥                     = y i ( n ) ⎢1 − y i ( n ) ⎥ xi ( n )    (8)
concluding remarks.                                                                       ⎢⎣       ∂  w  i(n)          ⎥⎦               ∧                 ⎣              ⎦
                                                                                                                          wi ( n ) = wi ( n )


                                                                                    ∧
2 THE HYBRID EVOLUTIONARY                                                          y i (n) is the output of neuron i that results from the use of the
ALGORITHM                                                                      weight estimate. In equation (8) we have assumed the use of the
                                                                               logistic function; other sigmoid functions, like the hyperbolic
2.1     THE MULTIPLE EXTENDED KALMAN                                           tangent, can be used as well. The first term of the right hand side of
        ALGORITHM - MEKA                                                       equation (7) is the desired linear term while the remaining term
                                                                               represents a modeling error. Thus substituting equation (7) and (4)
Consider a network characterized by a weight vector w. The                     in (3) and ignoring the modeling error we obtain:
average cost function that should be minimized during the training
phase is defined in terms of N input-output patterns as follows:                                           d i ( n ) = qiT ( n ) wi ( n ) + ei ( n )                                   (9)

                         1 N
                            ∑∑                                                    Where ei ( n ) and qi ( n ) are defined in equations (5) and (8)
                                                               2
          Eav ( w) =                 ⎡⎣ d j ( n) − y j ( n) ⎤⎦     (1)
                        2 N n =1 j∈C                                           respectively.
                                                                                  Equations (2) and (9) describe the linearized behavior of neuron
   Where d j ( n ) is the desired response and y j ( n ) the actual            i. Given the pair of equations (2) and (9), we can make use of the
response of output neuron j when input pattern n is presented, while           standard Recursive Least Squares (RLS) algorithm equations [14],
the set C includes all the output neurons of the network. The cost             which is a special case of the Kalman filter, to make an estimate of
function Eav ( w) depends on the weight vector w due to the fact that          the weight vector wi ( n ) of neuron i. The resulting solution is
                                                                               defined by the following system of recursive equations [14] that
y j ( n ) itself depends on w.                                                 describe the Multiple Extended Kalman Algorithm (MEKA) [24]:
   Concentrating on an arbitrary neuron i, which might be located
anywhere in the network, its behavior during the training phase may                                    ri ( n ) = ( λ − 1) ⋅ Pi ( n − 1) ⋅ qi ( n )                                   (10)
be viewed as a non-linear dynamic system, which in the context of
Kalman filter theory may be described by the following state-                                      ki ( n ) = ri ( n ) ⋅ (1 + r ( n ) ⋅ qi ( n ) ) − 1
                                                                                                                                   i
                                                                                                                                    T
                                                                                                                                                                                      (11)
measurement equations [14], [24]:
                                                                                                      wi ( n + 1) = wi ( n ) + ei ( n ) ⋅ ki ( n )                                    (12)
                         wi ( n + 1) = wi ( n )                    (2)                          Pi ( n + 1) = ( λ − 1) ⋅ Pi ( n ) − ki ( n ) ⋅ r ( n )  i
                                                                                                                                                         T
                                                                                                                                                                                      (13)
                       d i ( n ) = yi ( n ) + ei ( n )             (3)
                                                                                 Where, n=1,…,N is the iteration number and N is the total
                    yi ( n ) = ϕ ( x ( n ) , wi ( n ) )
                                      T
                                      i                            (4)         number of examples.
                                                                                 The vector qi ( n ) represents the linearized neuron activation
   Where the iteration n corresponds to the presentation of the nth            function given in equation (6), Pi ( n ) is the current estimate of the
input pattern, xi ( n ) and yi ( n ) are the input and output vector of
                                                                               inverse of the covariance matrix of qi ( n ) and ki ( n ) is the Kalman
neuron i respectively and ei ( n ) is the measurement error at the
                                                                               gain. The parameter λ is a forgetting factor which takes values in
output of neuron i, the instantaneous estimate of which is given by:           the range (0,1], and ei ( n ) is the localized measure of the global

                                          ∂ E (n )                             error. Equation (13) is called the Riccatti difference equation.
                    ei (n ) = −                                    (5)            Each neuron in the network perceives its own effective input
                                          ∂ yi(n )
                                                                               qi ( n ) , hence it has to maintain its own copy of Pi ( n ) even in the


                                                                                                                                                                                        2
                                                                          32
case in which it may share some of its inputs with other neurons in                      needed to predict future values is not usually known a
the network.                                                                             priory.
                                                                                        Hidden mutation: it selects randomly a neural network from
2.2      THE EVOLUTIONARY ALGORITHM                                                      the population and changes the structure of its hidden
                                                                                         region by adding or deleting a random number (selected
The proposed evolutionary algorithm is an improved version of a
                                                                                         uniformly from a given interval) of hidden neurons.
modified genetic algorithm that was used aforetime by the authors.
                                                                                        Non Uniform Weight mutation: it is responsible for the fine
It maintains the basic working philosophy of evolutionary
                                                                                         tuning capabilities of the system. It selects randomly a
algorithms and resembles genetic programming (see [16] for
                                                                                         number of connection weights and changes their values to
details) since it evolves complicated structures like linked lists and
                                                                                         new ones as follows: Let suppose that w is the old weight
not simple bit strings as genetic algorithms do.
                                                                                         value then the new one is given by the formula:
   The algorithm evolves, using a number of genetic operators, a
population of artificial neural networks (multilayered perceptrons)
that are represented as linked lists of network layers and neurons;                          w ( n + 1) = w ( n ) ± Δw ( t , ub − w ( n ) )        (15)
thus it is used the direct encoding scheme. The basic steps of the
algorithm are as follow:                                                                 Where lb and ub are the lower and upper bounds of the
1. Initialization: An initial population of neural networks (called                      weight values, t is the generation number, and Δ(t,y) is a
     individuals) is created. Every individual has a random number                       function that returns a value in the range [0,y], such that the
     of neurons (or nodes) and connections (synapses). The
     connection weights are initialized to some random values                            probability of Δ(t,y) being close to 0 increases as t
     within a specific range.                                                            increases. This property causes this operator to search the
                                                                                         solution space initially uniformly (while t is small) and very
2.    Training: Every individual (neural network) in the population                      locally at the later stages. In our experiments the following
      is trained using MEKA for a small number of training epochs.                       function, [20] was used:
      For populations other than initial, training occurs only for
                                                                                                                       ⎛     (1− t ) ⎞
                                                                                                                                   b
      those networks that have been changed by the application of                                    Δ ( t , y ) = y ⋅ ⎜1 − r T ⎟                  (16)
      genetic operators.                                                                                               ⎝             ⎠

3.    Fitness Evaluation: As fitness function it is used a function that                 Where r is a random number on [0,1], T is the maximal
      combines the performance of the network in the training                            generation number (a parameter of the algorithm), and b is a
      and/or validation set with the size of the network. The                            system parameter determining the degree of non-uniformity.
      performance is evaluated using the Mean Squared Error (MSE)
      or the Mean Relative Error (MRE). While, the size is the                          Gaussian weight mutation: it works like the Non Uniform
      number of neurons and/or the number of active synapses in the                      Weight mutation operator with the difference that the new
      network. So the fitness function for the case of MRE has a                         weight value is calculated by the formula:
      formula of the type:
                                                                                                         w ( n + 1) = w ( n ) + Δw ( n )           (17)
                                           1
       Fitness ( i ) =                                               (14)
                       1 + MRE ( i ) + sp ⋅ MRE ( i ) ⋅ SIZE ( i )
                                                                                         Where, Δw is a small random number following Gaussian
                                                                                         distribution.
      Where sp is a parameter that controls the weight of the                           Uniform weight mutation: it works like the Gaussian
      network size in the evaluation of fitness, MRE(i) is the value                     mutation operator with the difference that, Δw is a small
      of MRE of individual i, SIZE(i) is the size of individual i                        random number following Uniform distribution.
      which can be calculated as the number of active connections
      or the number of neurons and i is an index taking values in the            6.   Crossover: It selects two parents (neural networks) and
      range 1 to population size.                                                     generates one or two offspring by recombining parts of them.
                                                                                      The offspring take the place of their parents in the new
4.    Selection: Selection operator is been used in order to create a                 population. In the presented algorithm crossover recombines
      new, intermediate, population from the old one, by selecting                    whole neurons with their incoming connections. But since we
      individuals based on their fitness. This can be done using any                  have to deal with networks with different structures, the new
      of the following three different selection schemes that have                    connections that might have to be produced are initialized with
      been implemented, namely:                                                       random weight values as in the initialization phase. Herein
        The Elitism Roulette Wheel Selection Operator, with                          crossover works more like a mutation operator, like in most
         variable elitist pressure (for more details see [12], [16], [20]             genetic programming systems, than as the recombination
         and [22]).                                                                   operator of genetic algorithms
        The Rank Based Selection (for more details see [12], [16],
         [20] and [22]).                                                            Therefore the presented hybrid evolutionary algorithm works in
        The Tournament Selection with variable tournament size                  brief as follows: it starts with a population of randomly constructed
         (for more details see [12], [16], [20] and [22]).                       Neural Networks (step 1). Networks undergo some training for a
                                                                                 couple of epochs with MEKA, using the training set (step 2).
5.    Mutation: It works on the members of Three different                       Performance is measured with the fitness function (step 3) using the
      mutation operators are implemented:                                        validation set, in order to improve generalization. Then a new,
       Input Mutation: it selects randomly a neural network from                intermediate, population is created, by selecting the more fit
        the population and changes its number of inputs. This                    individuals according to their fitness (step 4) using any of the three
        operator works only on time series modeling and prediction               selection schemes. Some members of this intermediate population
        problems, where the number of past values (network inputs)               undergo transformations by means of genetic operators to form the
                                                                                 members (individuals) of the new population: mutation (step 5) and


                                                                                                                                                      3
                                                                            33
crossover (step 6) operators. The new population that is created is
trained again (step 2); new members are trained for a couple of
epochs, while the members that have survived and passed from the
old population may be trained with MEKA for some more epochs,
or may not be trained at all. This is the new generation. This whole
process continues until a predefined termination condition is
fulfilled; the termination condition might be a maximum number of
generation or a minimum error (MSE or MRE) value. Once
terminated the algorithm is expected to have reached a near-
optimum solution, i.e. a trained network with near optimum
architecture.

2.3     THE TOOL
   This hybrid evolutionary algorithm has been implemented as a
visual tool in C++ programming language, having a graphical user
interface (GUI). Specifically, it was used the Borland C++ version
6.0 IDE for Windows. Figure 1 and 2 depict two of the basic forms
of the program, for the two main categories of problems that it can
be used for, classification and time series prediction. Figure 3 is the                                Figure 3.   The “statistics” form
“statistics” form that illustrates the evolutionary process and prints
useful information about it.
                                                                                   3 NUMERICAL EXPERIMENTS
                                                                                   In order to examine the ability of the algorithm to produce networks
                                                                                   that learn and generalize well we have tested it on two real world
                                                                                   problems: the modeling of the MEG recordings of epileptic patients
                                                                                   and the modeling MCG recordings of normal subjects.
                                                                                      Brain dynamics can be evaluated by recording the changes of the
                                                                                   neuronal electric voltage, either by the electroencephalogram
                                                                                   (EEG), or by the MEG. The EEG recordings represent the time
                                                                                   series that match up to neurological activity as a function of time.
                                                                                   On the other hand the MEG is generated due to the time varying
                                                                                   nature of the neuronal electric activity, since time-varying electric
                                                                                   currents generate magnetic fields. EEG and MEG are considered to
                                                                                   be complementary, each one carrying a part but not the whole of the
                                                                                   information related to the underlying neural activity. Thus, it has
                                                                                   been suggested that the EEG is mostly related to the inter-neural
                                                                                   electric activity, whereas the MEG is mostly related to the intra-
                                                                                   neural activity. The MEG recordings of epileptic patients were
                                                                                   obtained using a Super-conductive QUantum Interference Device
 Figure 1.    The main form for classification problems of the evolutionary
                                                                                   (SQUID) and were digitized with a sampling frequency of 256Hz
                          neural network system
                                                                                   using a 12-bit A/D Converter. SQUID is a very sensitive
                                                                                   magnetometer, capable to detect and record the bio-magnetic fields
                                                                                   produced in the human brain due to the generation of electrical
                                                                                   micro-currents at neural cellular level [30].
                                                                                      The same stands for the MCG recordings which are magnetic
                                                                                   recordings of the heart operation of normal subjects. MEG and
                                                                                   MCG data were provided by the Laboratory of Medical Physics of
                                                                                   the Democritus University of Thrace, Greece, where a one-channel
                                                                                   DC SQUID is operable. Both biosignal data were normalized in the
                                                                                   interval [0,1] in order to be processed by the neural networks.
                                                                                      In all the experiments we used, for comparison reasons, the same
                                                                                   parameter values, which are depicted in table 1. For the case of the
                                                                                   MEG modeling, as training set where used 1024 data samples
                                                                                   (corresponding to a four seconds epoch of the MEG) while for the
                                                                                   testing was used 512 data samples (corresponding to a two seconds
                                                                                   epoch of the MEG). For the case of the MCG modeling, as training
                                                                                   set where used 1024 data samples and for the test set was used 1024
  Figure 2.    The main form for prediction problems of the evolutionary           data samples The algorithm was left to run over 1000 generations.
                          neural network system                                       In order to evaluate the forecasting capability of the produced
                                                                                   networks we used three well-known error measures, the Normalized
   The user can select between this two problem categories. Then                   Root Mean Squared Error (NRMSE), the Correlation Coefficient
he/she can insert the values of the various genetic parameters, the                (CC) and the Mean Relative Error (MRE). The performance of the
training, validation and test files, as well as the output log files. The          hybrid algorithm for the case of MEG modeling is depicted in
user can observe the evolutionary process using some real time                     tables 2 and 3 and in figure 4, while for the case of MCG in tables 4
graphical display of the error, the performance of the best ever
                                                                                   and 5 and in figure 5.
network and other parameters.


                                                                                                                                                      4
                                                                              34
Table 1.   Parameters used for the cases of MEG and MCG modeling                      Table 5.     MCG forecasting - Errors on the Test Set
                    Parameter                           Value                 Architecture         NRMSE               C.C.               MRE
 Population                                        50                           12-10-1             0.2419            0.9704              1.4222
 Max number of Generations                         1000                         9-20-1              0.2081            0.9781              1.1874
 Crossover Probability                             0,1                          10-11-1             0.2214            0.9752              1.3642
 Input Mutation Probability                        0,1                          13-10-1             0.1858            0.9827              0.9617
 Weight Mutation Probability                       0,1                          13-12-1             0.2163            0.9774              0.8677
 Uniform Mutation Probability                      0,1                          10-9-1              0.2193            0.9760              0.9882
 nonUniform Mutation Probability                   0,1
                                                                                 4-4-1              0.3586            0.9445              0.9817
 Gaussian Mutation Probability                     0,1
 MeanGaussian                                      0,1                           3-1-1              0.6497            0.8241              0.9915
 StDev Gauss                                       0,25
 Predicting Horizon                                1                                                     Actual(-) vs Predicted(:)
                                                                                  0.8


     Table 2.     MEG forecasting - Errors on the Training Set                    0.6

Architecture        NRMSE               C.C.               MRE
   4-9-1             0.2540            0.9674             0.0350                  0.4

   3-3-1             0.2913            0.9581             0.0386
                                                                                  0.2
   4-5-1             0.2564            0.9672             0.0357
   3-2-1             0.2967            0.9557             0.0411
                                                                                   0
   3-3-1             0.2705            0.9656             0.0369
   3-4-1             0.2655            0.9645             0.0361
                                                                                 -0.2

       Table 3.        MEG forecasting - Errors on the Test Set                  -0.4
Architecture           NRMSE               C.C.              MRE
   4-9-1                0.1971            0.9805             0.0403              -0.6
                                                                                        0        200       400             600       800         1000
   3-3-1                0.2189            0.9757             0.0438
   4-5-1                0.2063            0.9786             0.0434
                                                                                    Figure 5.    MCG forecasting, performance on the test set.
   3-2-1                0.2309            0.9733             0.0466
   3-3-1                0.2177            0.9765             0.0446
   3-4-1                0.2111            0.9775             0.0429         4 CONCLUSIONS
                                                                            In the current paper it was presented a hybrid biological inspired
                              Actual(-) vs Predicted(:)
                                                                            evolutionary algorithm that combines a genetic programming
      0.8
                                                                            technique with a training method based on the Multiple Extended
                                                                            Kalman Algorithm. This hybrid algorithm is implemented in C++
      0.7                                                                   as a software system with a graphical user interface.
                                                                               The main novelties of the proposed hybrid algorithm are the
                                                                            combination of genetic programming technique with MEKA, the
      0.6
                                                                            use of a fitness function that combines performance with network
                                                                            size, the ability to evolve not only the structure of the hidden layers
      0.5                                                                   but the number of inputs as well, and the large number of different
                                                                            genetic operators and especially mutation operators that have been
                                                                            implemented. Another novelty is the representation used for neural
      0.4
                                                                            networks. As said before, every network in the population is
                                                                            represented as a link list of layers and neurons, using the direct
      0.3                                                                   encoding scheme. The use of link lists has some certain advantages
                                                                            that have to do mainly with the memory management; you use only
                                                                            the memory that is needed every time and you don’t have to
      0.2                                                                   allocate a maximum memory size, for maximum network size like
                                                                            other representation schemes. Moreover link lists are dynamic data
      0.1
                                                                            structures, which it means that the neural network architecture can
         0       100        200         300          400   500   600        change dynamically during run time in contrast with other data
                                                                            structures like matrices that in C++ can not change during run time.
     Figure 4.    MEG forecasting, performance on the test set.                This hybrid algorithm was used for the modeling of two
                                                                            biological time series, namely the Magneto Encephalogram (MEG)
      Table 4.    MCG forecasting - Errors on the Training Set              recordings of epileptic patients and Magneto Cardiogram (MCG) of
Architecture       NRMSE                C.C.               MRE              normal subjects. All the reported cases refer to predictions on
  12-10-1           0.1918             0.9815              0.8421           recordings of the dynamics of nonlinear systems. In all the
  9-20-1            0.1840             0.9829              0.8073           performed experiments the algorithm was able to find a near
  10-11-1           0.1790             0.9839              0.6790           optimum network architecture that gave small prediction errors.
  13-10-1           0.1869             0.9824              0.7875           Therefore we can conclude that the algorithm is able to produce
  13-12-1           0.2213             0.9761              0.8983           small and compact networks that learn and generalize well.
  10-9-1            0.2274             0.9740              0.9702              The algorithm has only tested on time series prediction problems
   4-4-1            0.3593             0.9441              0.9822           and it is in our intention to test it on some difficult classification
   3-1-1            0.6481             0.8262              0.9061           problems as well.


                                                                                                                                                        5
                                                                       35
   One of the main drawbacks of this kind of algorithms, namely                  [12] Goldberg, D. Genetic Algorithms in Search Optimization & Machine
the evolutionary algorithms, hybrid or not, is that they are                           Learning, Addison-Wesley 1989.
computational expensive in terms of computer memory and CPU                      [13] Happel, B., et al. Design and evolution of modular neural network
                                                                                       architectures. Neural Networks, Vol. 7, pp. 985 – 1004, 1994.
time. Even though the proposed algorithm belongs to this category,
                                                                                 [14] Haykin, S., “Neural Networks - A Comprehensive Foundation”,
the use of MEKA for just a couple of epochs for the training phase                     McMillan College Publishing Company, ch. 6, p.213, New York,
of the neural networks and the representation where each member                        1994.
of the population is a network represented as a link list so that there          [15] Holland, J. Adaptation in Natural and Artificial Systems, MIT press
is no need to use encoding and decoding functions for the                              1992.
calculation of network’s performance, makes the algorithm less                   [16] Koza J.R., Genetic programming: on the programming of computers
computational expensive than other approaches to the same                              by means of natural selection, MIT Press, Cambridge, MA, 1992.
problem of neural networks evolution.                                            [17] Likothanassis S. Georgopoulos E. and Fotakis D. (1997). Optimizing
                                                                                      the Structure of Neural Networks Using Evolution Techniques. 5th
   The algorithm could be further improved by adding some more
                                                                                      International Conference on Applications of High - Performance
genetic operators for better and faster local search both to the                      Computers in Engineering, pp. 157-168, Santiago de Compostela,
architecture and weight space and this is going to be one of our                      Spain, July.
future research targets. Furthermore, in the integrated software                 [18] Likothanassis, S. D., Georgopoulos, E. F., Manioudakis, G. D. and
system there are already implemented a large number of genetic                         Adamopoulos, A.V., “Currency Forecasting Using Genetically
operators whose influence to the performance of the hybrid                             Optimized Neural Networks”, HERCMA Athens September 1998.
algorithm needs to be appraised; we need to see which of the three               [19] Likothanassis, S. D., Georgopoulos, E. F. “A Novel Method for the
selection schemes, or the many mutation operators give better                          Evolution of Neural Networks”, 3rd IMACS/IEEE International
                                                                                       Conference on Circuits Systems and Computers (CSC’99), July 1999.
results. Another future research direction will be the combination
                                                                                 [20] Michalewicz, Z., “Genetic Algorithms + Data Structures = Evolution
of MEKA with other evolutionary techniques like Particle Swarm                         Programs”, Springer-Verlag, 1996.
Optimization and Differential Evolution for neural network                       [21] Miller, G., et al. Designing neural networks using genetic algorithms.
evolution.                                                                             Proceedings of the 3rd International Conference on Genetic
                                                                                       Algorithms, Morgan Kaufmann 1989.
                                                                                 [22] Mitchell M. An Introduction to Genetic Algorithms, MIT Press 1996.
REFERENCES                                                                       [23] Montana, D. and Davis, L. Training feedforward neural networks
                                                                                       using genetic algorithms. BBN Systems and Technologies, Cambridge,
[1]  Adamopoulos, A., Andreou, A., Georgopoulos, E., Ioannou, N. and                   MA 1989.
     Likothanassis, S., “Currency Forecasting Using Recurrently RBF              [24] Shah, S., Palmieri, F. and Datum, M., “Optimal Filtering Algorithms
     Networks Optimized by Genetic Algorithms”, Computational Finance                  for Fast Learning in Feed-Forward Neural Networks”, Neural
     1997 (CF’97), London Business School, 1997.                                       Networks, Vol. 5, pp. 779-787, 1992.
[2] Adamopoulos, A., Anninos, P., Likothanassis, S., and Georgopoulos,           [25] Whitley, D. Applying genetic algorithms to neural network problems,
     E. “On the Predictability of MEG of Epileptic Patients Using RBF                  International Neural Networks Society, p.230 1988.
     Networks Evolved with Genetic Algorithms”, BIOSIGNAL’98, Brno,              [26] Whitley, D., and Bogart, C. The evolution of connectivity: Pruning
     Czech Republic, June 23-25, 1998a.                                                neural networks using genetic algorithms. International Joint
[3] Adamopoulos, A., Georgopoulos E., Manioudakis, G. and                              Conference on Neural Networks, Washington D.C., 1. Hillsdale, NJ:
     Likothanassis, S. “An Evolutionary Method for System Structure                    Lawpence Erlbaum, pp. 134-137, 1990.
     Identification Using Neural Networks” Neural Computation ’98,               [27] Whitley, D., and Hanson, T. Optimizing neural networks using faster,
     1998b.                                                                            more accurate genetic search. 3rd Intern. Conference on Genetic
[4] Adamopoulos, A., G. Georgopoulos, S. Likothanassis and P. Anninos,                 Algorithms, Washington D.C., Morgan Kaufmann, pp. 391-396, 1989.
     “Forecasting the MagnetoEngephaloGram (MEG) of Epileptic Patient            [28] Yao, X. & Liu, Y. A New Evolutionary System for Evolving Artificial
     Using Genetically Optimized Neural Networks”, Genetic and                         Neural Networks, IEEE Transactions on Neural Networks, vol. 8, no.
     Evolutionary Computation Conference (GECCO’99), Orlando, Florida                 3, 1997.
     USA, July 14-17, 1999                                                       [29] Yao, X. “Evolving Artificial Neural Networks”, Proceedings of the
[5] Andreou, A., Georgopoulos, E., and Likothanassis, S., and                          IEEE, 87(9):1423:1447, September 1999.
     Polidoropoulos, P., “Is the Greek foreign exchange-rate market              [30] Anninos, P. Jacobson, J. Tsagas, N. Adamopoulos, A. “Spatiotemporal
     predictable? A comparative study using chaotic dynamics and neural                Stationarity of Epileptic Focal Activity Evaluated by Analyzing
     networks”, Proceedings of the Fourth International Conference on                  Magneto Encephalographic (MEG) data and the Theoretical
     Forecasting Financial Markets, Banque Nationale de Paris and                      Implications”. Panminerva Med. 39, 189-201, 1997.
     Imperial College, London, 1997.                                             [31] Graning, L.; Yaochu Jin; Sendhoff, B.: Generalization Improvement in
[6] Andreou, A., Georgopoulos, E., Zombanakis, G. and Likothanassis, S.,               Multi-Objective Learning, Neural Networks, IJCNN apos;06.
     “Testing Currency Predictability Using An Evolutionary Neural                     International Joint Conference on Volume , Issue , 0-0 0 Page(s):4839
     Network Model”, Proceedings of the fifth International Conference on              – 4846, 2006.
     Forecasting Financial Markets, Banque Nationale de Paris and                [32] Vieira, D. A. G. and J. A. Vasconcelos and W. M. Caminhas:
     Imperial College, London, 1998.                                                   Controlling the parallel layer perceptron complexity using a
[7] Andreou A., Georgopoulos E. and Likothanassis, S. “Exchange Rates                  multiobjective learning algorithm, Neural Computing and
     Forecasting: A Hybrid Algorithm Based On Genetically Optimized                    Applications, vol. 16, n. 4, p.p. 317—325, 2007.
     Adaptive Neural Networks”, Computational Economics Journal,
     Kluwer Academic Publishers, vol. 20, issue 3, pp. 191 – 210,
     December 2002
[8] Billings, S. A., and Zheng, G. L. Radial basis function network
     configuration using genetic algorithms. Neural Networks, Vol. 8, pp.
     877-890, 1995.
[9] Bornholdt S. and Graudenz, D. General asymmetric neural networks
     and structure design by genetic algorithms. Neural Networks, Vol. 5,
     pp327 – 334, 1992.
[10] Davis, L. Mapping classifier systems into neural networks.
     Proceedings of the 1988 Conference on Neural Information
     Processing Systems, Morgan Kaufmann, 1988.
[11] Georgopoulos E., Likothanassis S. and Adamopoulos A., “Evolving
     Artificial Neural Networks Using Genetic Algorithms”, Neural
     Network World, 4/00, pp. 565 – 574, 2000.


                                                                                                                                                          6
                                                                            36
         Intelligently Raising Academic Performance Alerts

                                  Dimitris Kalles1, Christos Pierrakeas and Michalis Xenos
Abstract. We use decision trees and genetic algorithms to
analyze the academic performance of students and the
homogeneity of tutoring teams in the undergraduate program on                    2     THE EDUCATIONAL BACKGROUND
Informatics at the Hellenic Open University (HOU). Based on
                                                                                 A module is the basic educational unit at HOU. It runs for
the accuracy of the generated rules, we examine the applicability
                                                                                 about ten months and is the equivalent of about 3-4
of the techniques at large and reflect on how one can deploy
                                                                                 conventional university semester courses. A student may
such techniques in academic performance alert systems.
                                                                                 register with up to three modules per year. For each module,
                                                                                 a student is expected to attend five plenary class meetings
                                                                                 throughout the academic year. A typical class contains about
1         INTRODUCTION
                                                                                 thirty students and is assigned to a tutor (tutors of classes of
Student success is a natural performance indicator in                            the same module collaborate on various course aspects).
universities. However, if that success is used as a criterion for                Class face-to-face meetings are about four hours long and are
tutor assessment (and subsequent possible contract renewal), and                 structured along tutor presentations, group-work and review
if students must evaluate their own teachers, then tutors may tend               of homework. Furthermore, each student must turn in some
to lax their standards. This paper is about dealing with this issue              written assignments (typically four or six), which contribute
in the context of the Hellenic Open University (HOU); we focus                   towards the final grade, before sitting a written exam. That
on the undergraduate Informatics program (about 2,500                            exam is delivered in two stages: you only need sit the second
students). We ask whether we can detect regularities in distance                 if you fail or miss the first.
tutoring, then, we try to associate them with measures of                            Students fail a module and may not sit the written exam if
students’ success in an objective way and, subsequently, reflect                 they do not achieve a pass grade in the assignments they turn
on how to effectively disseminate this information to all                        in; these students must repeat that module afresh. A student
interested parties.                                                              who only fails the written exam may sit it on the following
    The measurement strategy we have developed to-date in HOU                    academic year (without having to turn in assignments); such
has been progressively refined to deal with two closely linked                   “virtual” students are also assigned to student groups but the
problems: that of predicting student success in the final exams                  tutor is only responsible for marking their exam papers.
and that of analyzing whether some specific tutoring practices
have any effect on the performance of students. Each problem
gives rise to the emergence of a different type of user model. A                 3     GENETIC    ALGORITHMS      AND
student model allows us, in principle, to explain and maybe                            DECISION TREES FOR PREDICTION
predict why some students fail in the exams while others
succeed. A tutor model allows us to infer the extent to which a                  In our work we have relied on decision trees to produce
group of tutors diffuses its collective capacity effectively into the            performance models. Decision trees can be considered as rule
student population they advise. However, both types of models                    representations that, besides being accurate, can produce
can be subsequently interpreted in terms of the effectiveness of                 comprehensible output, which can be also evaluated from a
the educational system that the university implements.                           qualitative point of view [1, 2]. In a decision tree nodes
    The rest of this paper is organised in five sections. The next               contain test attributes and leaves contain class descriptors.
section presents the educational background. Section 3 then                         A decision tree for the (student) exam success analysis
reviews the fundamental features of the AI techniques that we                    problem could look like the one in Figure 1 and tells us that a
have used. Following that we report the experimental results for                 mediocre grade at the second assignment (root) is an
the undergraduate programme that we have analysed, as well as a                  indicator of possible failure (left branch) at the exams,
short evaluation of the individual module results that seem to                   whereas a non-mediocre grade refers the alert to the fourth
signify an interesting deviation. Section 5 presents a discussion                (last) assignment.
from the point of view of how one can generalise our approach                       Decision trees are usually produced by analyzing the
as well as how one can substitute other intelligent techniques for               structure of examples (training instances), which are given in
data analysis; finally we conclude and describe directions for                   a tabular form. An excerpt of a training set that could have
future development.                                                              produced such a tree is shown in Table 1. Note that the three
                                                                                 examples shown are consistent with the decision tree. As this
                                                                                 may not always be the case, there rises the need to measure
                                                                                 accuracy, even on the training set, in order to compare the
                                                                                 quality of two decision trees which offer competing
                                                                                 explanations for the same data set.

    1
        All authors are with Hellenic Open University, www.eap.gr. Contact address is dkalles@acm.org.


                                                                            37
                                                                                             Of course, GATREE was first used [3] to confirm the
                                                                                             qualitative validity of the original findings experiments [4],
                                                                                             also serving as result replication, before advancing to more
                           Assgn2 in [3..6]
                                                                                             elaborate experiments [7, 8, 9].
                                                                                                 GATREE [6] evolves populations of trees according to a
                                                                                             fitness function that allows for fine-tuning decision tree size
                                                                                             vs. accuracy on the training set. At each generation, a certain
              FAIL                                     Assgn4 < 3
                                                                                             population of decision trees is generated and sorted
                                                                                             according to fitness. Based on that ordering, certain genetic
                                                                                             operators are performed on some members of the population
                                                                                             to produce a new population. For example, a mutation may
                                          FAIL                            PASS
                                                                                             modify the test attribute at a node or the class label at a leaf,
                                                                                             while a cross-over may exchange parts between decision
                                                                                             trees.
Figure 1. A sample decision tree [3].                                                            The fitness function is fitnessi=Correcti2*x/(sizei2+x), for
                                                                                             tree i. The first part of the product is the actual number of
   Note that the sample decision tree does not utilize data neither                          training instances that i classifies correctly. The second part
on the first nor the third assignments, but such data is shown in                            of the product (the size factor) includes a factor x which
the associated table. Such dimensionality reduction information                              regulates the relative contribution of the tree size into the
is typical of why decision trees are useful; if we consistently                              overall fitness; thus, the payoff is greater for smaller trees
derive trees on some problem that seem to not use some data                                      When using GATREE, we used the default settings for the
column, we feel quite safe to not collect measurements for that                              genetic algorithm operations and set cross-over probability at
data column. Of course, simple correlation could also deliver                                0.99 and mutation probability at 0.01. Moreover, all but the
such information, however it is the visual representation                                    simplest experiments (explicitly so identified in the following
advantages of decision trees that have rendered them as very                                 sections) were carried out using 10-fold cross-validation, on
popular data analysis tools.                                                                 which all averages are based (i.e. one-tenth of the training set
                                                                                             was reserved for testing purposes and the model was built by
    Table 1. A sample decision tree training set (adapted from [3]).
                                                                                             training on the remaining nine-tenths; furthermore, ten such
                                                                                             stages were carried out by rotating the testing one-tenth.
  Assgn1         Assgn2                       Assgn3         Assgn4              Exam
    ...              ...                        ...                 ...           ...
    4.6              7.1                       3.8              9.1              PASS        4     DATA ANALYSIS AT A PROGRAMME
    9.1              5.1                       4.6              3.8              FAIL              LEVEL
    7.6              7.1                       5.8              6.1              PASS        Before advancing, we first review some aggregate statistics of
                                                                                             the undergraduate informatics programme at HOU.
   Analyzing the performance of high-risk students is a goal                                    First, Table 2 presents the success rates for the modules
towards achieving tutoring excellence. It is, thus, reasonable to                            that we have analysed.
assert that predicting a student’s performance can enable a tutor
to take early remedial measures by providing more focused                                              Table 2. Success (percentage) rates of modules.
coaching, especially in issues such as priority setting and time                                                   2004-5         2005-6         2006-7
management.
                                                                                                    INF10               35%            38%               33%
   Initial experimentation at HOU [4] consisted of using several
machine learning techniques to predict student performance with                                     INF11               55%            52%               55%
reference to the final examination. The scope of the                                                INF12               39%            34%               35%
experimentation was to investigate the effectiveness and                                            INF20               56%            44%               44%
efficiency of machine learning techniques in such a context. The
                                                                                                    INF21               37%            44%               37%
WEKA toolkit [5] was used because it supports a diverse
collection of techniques. The key result was that learning                                          INF22               71%            61%               55%
algorithms could enable tutors to predict student performance                                       INF23           N/A                83%               97%
with satisfying accuracy long before final examination. The key                                     INF24               70%            64%               58%
finding that lead to that result was that success in the initial
                                                                                                    INF30               81%            85%               84%
written assignments is a strong indicator of success in the
examination. Furthermore, our tutoring experience corroborates                                      INF31               93%            92%               85%
that finding.                                                                                       INF35           N/A                98%               93%
   We then employed the GATREE system [6] as the tool of                                            INF37           N/A                98%               98%
choice for our experiments, to progressively set and test
                                                                                                    INF42           N/A            N/A               100%
hypotheses of increasing complexity based on the data sets that
were available from the university registry. The formation and
development of these tests is the core content of this chapter and                              Next, Table 3 presents the enrolment numbers for these
is presented and discussed in detail in the following sections.                              modules. Note that, as we advance from junior to senior
GATREE is a decision tree builder that employs genetic                                       years, the overall enrolment is dramatically reduced and the
algorithms to evolve populations of decision trees; it was                                   success rates increase.
eventually used because it produces short comprehensible trees.


                                                                                        38
              Table 3. Enrollment numbers at modules.                                Table 4. Model accuracies omitting tutor data.
                       2004-5        2005-6        2006-7                                      2004-5         2005-6          2006-7
         INF10              987         1.247           1.353                                  E         F    E         F     E        F
         INF11              492             517          642                      INF10        83        84   84         82   83       82
         INF12              717             818          925                      INF11        75        76   76         78   75       80
         INF20              362             389          420                      INF12        74        76   86         74   78       74
         INF21              322             363          383                      INF20        76        70   76         59   87       60
         INF22              321             291          321                      INF21        83        78   76         72   77       73
         INF23          N/A                  52           73                      INF22        68        80   68         76   63       70
         INF24              157             167          221                      INF23            N/A        46         78   89       99
         INF30              156             198          199                      INF24        67        67   68         66   69       70
         INF31              149             200          144                      INF30        77        82   64         85   71       94
         INF35          N/A                 101           58                      INF31        65        95   86         93   68       91
         INF37          N/A                 106          132                      INF35            N/A        72         97   80       92
         INF42          N/A           N/A                109                      INF37            N/A        95        100   95       98
                                                                                  INF42            N/A            N/A         96      100
   The above statistics are all drawn from the university registry
and none is subject to any further processing. However, all                   It is straightforward to attribute the increase in senior year
results presented from now on, refer to experiments carried out            modules to the fact that, eventually, students have to focus on
totally using the GATREE system, with the occasional help of               their exam and pass the test, regardless of how well they did
some post-processing automation scripts.                                   along the year. The large discrepancy, however, suggests that
                                                                           the exercises do not serve well their goal, which is to keep the
                                                                           students engaged in the learning process. One could say that
4.1     Detecting a shift in exam grades                                   exercises are less of learning opportunities and more of
There is a straightforward way to attempt to answer this                   necessary evils.
question. One can build a model that attempts to answer the                   The dramatic decrease in the 2006-7 year results of the
success question for the first stage of the final exam. Then, one          INF20 module are quite interesting. They reflect, basically, a
can build a model that attempts to answer the success question             huge fail rate in the first stage of the exam, which is well
for the overall student grade. A gross disparity in these numbers          served by a small model that predicts failure all around.
should be indicative of an issue that merits investigation.                   When seen from that viewpoint, however, the relatively
   The simplest data to consider as input for this problem                 narrow margins of the junior year modules seem quite
consists of exercise and exam grades, as in Table 1, omitting any          impressive, since they are also associated with low overall
other information (for example, which tutor was responsible for            pass rates. The difference, however, is that the junior modules
a student). The results reported are based on re-classification (we        also report significant dropout rates which skews
reserve a cross-validation like mechanism for the more detailed            pessimistically the rates reported in Table 2.
experiments later on) and are shown in Table 4.
   What does a difference signify? To answer that, one can take
a step backwards and try to answer a simpler question: what does           4.2     Detecting tutor influence
a large difference signify? We have elected to brand a difference          If we take the data sets that were used in section 4.1 and put
as large when the re-classification accuracy of the same module            back in the information on which tutor was responsible for
for the same year differs by at least 20 percentage points when            each student group, we can run the same experiments and try
we compare the model predicting the pass/fail result of the first          to see whether the tutor attribute will surface in some models
stage of the final exam and the corresponding model after a                (sample data are shown in Table 5).
possible second stage (which is the actual pass/fail grade for the            In principle, observing models where the tutor attribute
module). In Table 4 such differences are shown in bold.                    appears near the decision tree root would not be a good thing,
   There are two issues that become apparent when one views                suggesting that a crucial factor in student success is not the
Table 4. The first is that whenever we observe an increase in the          educational system itself but the tutor. As a matter of fact we
model accuracy when switching from the first exam (E) to the               can opt to not look for this information at all in the resulting
final grade (F), this is associated with senior modules where              trees; comparing the accuracies to the ones reported in Table
eventual success rates (see Table 2) are substantial. The only             4 should suffice. These results are now shown in Table 6.
decrease is observed in a junior year module where success rates
are considerably reduced compared to senior year modules.


                                                                      39
            Table 5. An expanded sample training set (see Group).                            Table 7. Lesion study model accuracies including tutor data.
    Assgn1       Assgn2        Assgn3      Assgn4          Group          Exam                                2004-5         2005-6          2006-7
      ...           ...          ...         ...              ...             ...                            E         F    E          F    E       F
     4.6            7.1         3.8         9.1           Athens-1        PASS                  INF10       78         78   75         75   77     77
     9.1            5.1         4.6         3.8            Patras-1       FAIL                  INF11       70         74   72         75   71     74
     7.6            7.1         5.8         6.1           Athens-2        PASS                  INF12       71         68   77         69   75     71
                                                                                                INF20       70         65   72         60   82     61
               Table 6. Model accuracies including tutor data.                                  INF21       79         68   69         65   69     64
                           2004-5           2005-6                   2006-7                     INF22       57         74   61         68   60     65
                          E            F    E         F             E         F                 INF23            N/A        65         74   83     98
            INF10         82         83    80         79            82    81                    INF24       62         70   66         69   63     66
            INF11         75         77    76         78            75    80                    INF30       70         82   64         84   65     89
            INF12         75         77    81         72            80    72                    INF31       63         91   79         91   59     83
            INF20         76         72    76         62            87    61                    INF35            N/A        66         97   72     91
            INF21         84         77    74         74            75    72                    INF37            N/A        95         98   91     95
            INF22         66         80    68         74            62    75                    INF42            N/A             N/A        93     100
            INF23              N/A         52         82            90    99
            INF24         63         69    69         69            66    74                Furthermore, we tried to summarise the results from a
            INF30         75         82    60         88            75    94             further point of view: that of consistency between the results
                                                                                         reported for the E and F columns of both tables. Essentially
            INF31         67         94    85         93            89    91
                                                                                         we computed the quantity (F5-E5)-(F6-E6) for each module
            INF35              N/A         72         98            76    90             for each year, where the subscript indicates which table that
            INF37              N/A         96       100             94    98             particular number was drawn from. Not surprisingly, the two
            INF42              N/A              N/A                 96    100            singularities observed were module INF23 for year 2005-6
                                                                                         (with a value of about 20%) and module INF31 for year
                                                                                         2006-7 (with a value of about -22%).
    This time we observe that the relative difference between the
models which utilise the tutor attribute and the ones that do not
are quite small. There are some very interesting cases, however.
                                                                                         4.3     Observing the accuracy-size trade-off
    For example, the INF11 module demonstrates near zero
differences throughout. It is interesting to note that this module                       It is interesting to investigate whether the conventional
utilizes a plenary exam marking session, which means that tutors                         wisdom on model characteristics is valid. In particular, we
get to mark exam papers drawn from all groups at random. This                            analysed the results in Table 6 and in Table 7 with respect to
places only marginal administrative overhead and, when viewed                            whether an increase (or decrease, accordingly) in model
from the point of model consistency, seems to be well worth it.                          accuracy for a particular module for a year was associated
    Another example is the INF31 module (shown in bold),                                 with a reduction in model size. We say that the model
which demonstrated a year where the tutor attribute seemed to be                         accuracy increases if the accuracy for the E column of that
of paramount importance. In that year, the gap between the first                         year is less that the corresponding number in the F column.
exam stage and the final grade seems to be influenced by the                             For the 68 pairs of numbers reported in Table 6 and in Table
tutors. It is now very narrow (89 to 91) while it was quite wide                         7 we observed that only in 4 of them did we see the same
(68 to 91). This could suggest a relative gap in tutor                                   direction in model accuracy and model size. So, conventional
homogeneity.                                                                             wisdom was confirmed in nearly 95% of the cases.
    There is one other way to view the importance of the tutor
attribute. One can derive a model for one module group and then
attempt to use that model as a predictor of performance for the                          5     DISCUSSION
other module groups (within the same module). This approach,                             HOU has been the first university in Greece to operate, from
while suppressing the tutor attribute, essentially tests its                             its very first year, a comprehensive assessment scheme (on
importance by specifically segmenting the module data set along                          tutoring and administrative services). Despite a rather hostile
groups. The overall accuracy is then averaged over all individual                        political environment (at least in Greece), quite a few
tests. This is the lesion comparison; its results are shown in                           academic departments have lately been moving along the
Table 7.                                                                                 direction of introducing such schemes, though the practice
    We highlight (in bold) the main difference from the results in                       has yet to be adapted at a university level. Still, however,
Table 4, where it now seems that the gap has been shortened a                            there is quite a mentality shift required when considering the
while. Surprisingly, it suggests an erratic intra-group                                  subtle differences between “measuring” and “assessing”.
consistency. Note also, that this particular result in Table 4 was                           The act of measuring introduces some error in what is
the only one not to pass the binary choice (50%) level, which it                         being measured. If indices are interpreted as assessment
only just did in Table 6.                                                                indices, then people (actually, any “assessed” subject where
                                                                                         people are involved – groups of people, for example) will
                                                                                         gradually skew their behaviour towards achieving good
                                                                                         measurements. Such behaviour is quite predictably human, of


                                                                                    40
course; the problem is that it simply educates people in the ropes          6     CONCLUSION
of the measurement system while sidelining the real issue of
                                                                            We have shown how we have used a combination of genetic
improving the educational service.
                                                                            algorithms and decision trees in the context of experimenting
    By shifting measurement to quantities that are difficult to
                                                                            with how one might setup a quality control system in an
“tweak”, one hopes that people whose performance is assessed
                                                                            educational context.
will gradually shift form fine-tuning their short-term behaviour
                                                                                Quality control should be a core aspect of any educational
toward achieving longer-term goals. Indeed, if people find out
                                                                            system but setting up a system for quality control entails
that the marginal gains from fine-tuning their behaviour are too
                                                                            managerial and administrative decisions that may also have to
small for the effort expended to achieve them, it will be easier to
                                                                            deal with political side-effects. Deciding how to best and as
convince them to improve more fundamental attitudes towards
                                                                            early as possible defuse the potential stand-offs that a quality
tutoring (as far as tutors are concerned) or studying (as far as
                                                                            measurement message might trigger calls for the employment
students are concerned).
                                                                            of techniques that not only ensure a basic technical soundness
    In our application, this is demonstrated two-fold.
                                                                            in the actual measurement but also cater to the way the results
    First, by disseminating tutor group homogeneity indices, one
                                                                            are conveyed and subsequently exploited. This is particularly
hopes that, regardless how we call these indices, these tutor
                                                                            so when the application context for the large scale suggests
groups will be motivated by peer pressure to consider their
                                                                            that data and models will freely flow amongst thousands of
performance vis-a-vis other tutor groups. Even if that may not be
                                                                            tutors and tens of thousands of students.
really required, that introspection itself will quite likely improve
                                                                                We have earlier [9] expressed the view that our approach
how hat particular tutor group co-operates; at least it will focus
                                                                            is applicable to any educational setting where performance
their decisions with respect to why such decisions might
                                                                            measurement can be cast in terms of test (and exam)
influence their overall ranking.
                                                                            performance. In the proposed paper we have scaled up our
    For students, a similar argument applies. Realising that one
                                                                            analysis to cover several modules and years and still believe
fits a model which predicts likely failure, even if one knows that
                                                                            that taking the sting out of individual performance evaluation
the particular model is known to err quite some times, is
                                                                            but still being able to convey the full message is a key
something that will most likely motivate that person to take a
                                                                            component of tutoring self-improvement. Scaling our
more decisive approach to studying. For adult students, such a
                                                                            approach to other programmes, other institutions and, even,
decisive approach might even mean to drop a course of studying
                                                                            obtaining the approval of our own university for official and
or defer studying. This is not necessarily negative, however;
                                                                            consistent reporting of such indices is, however, less of a
knowing how to better utilise one’s resources is a key skill in life
long learning.                                                              technical nature and more of a political exercise. After all we
                                                                            need to persuade people that some innovations are less of a
    We have selected decision trees because we want to generate
                                                                            threat and more of an opportunity.
models that can be effectively communicated to tutors and
students alike. We have also selected genetic algorithms to
induce the decision trees because we have shown [7] that, for the
                                                                            ACKNOWLEDGEMENTS
particular application domain, we can derive small and easy to
communicate yet accurate trees. We thus need a hybrid                       Thanassis Hadzilacos (now at the Open University of Cyprus)
approach: rule-based output to be comprehensible and grounded               has contributed to this line of research while at Hellenic Open
and evolutionary computing to derive this output.                           University.
    Which other techniques should one utilise to develop the                   Anonymized data can be available on request for research
models? We cannot fail to note that conventional statistics can             purposes only, on a case-by-case basis.
be cumbersome to disseminate to people with a background on                    We acknowledge the advice, from an anonymous reviewer
humanities or arts, and this could have an adverse impact on the            of the CIMA-ECAI08 workshop, on how to improve the
user acceptance of such systems. In that sense, the decision of             presentation of this work to reflect the combination of AI
whether the models are computed centrally or in a decentralized             techniques used.
fashion (by devolving responsibility to the tutors, for example) is
a key factor. In any case, deploying our measurement scheme in
an organization-wide context would also lend support to our                 REFERENCES
initial preference for short models. At the same time, the
                                                                            [1] Mitchell, T. (1997). Machine Learning. McGraw Hill.
possibility of a decentralized scheme also suggests that we                 [2] Quinlan, J.R (1993). C4.5: Programs for machine learning. San
should strive to use tools that do not demand a steep learning                  Mateo, CA: Morgan Kaufmann.
curve on the part of the tutors.                                            [3] Kalles, D., & Ch. Pierrakeas (2006). Analyzing student
    Of course, one can take an alternative course and drop the                  performance in distance learning with genetic algorithms and
requirement that a model has to be communicated. If we only                     decision trees. Applied Artificial Intelligence, 20(8), pp. 655-
focus on the indices then any technique can be used, from neural                674.
networks to more conventional ones, such as naive Bayes or                  [4] Kotsiantis, S., Pierrakeas, C., & Pintelas, P. (2004). Predicting
logistic regression [4]. As in all data mining application                      students’ performance in distance learning using Machine
                                                                                Learning techniques. Applied Artificial Intelligence, 18:5, 411-
contexts, it is the application that must drive the techniques to
                                                                                426.
use; for our problem, suffice to note that the comparisons                  [5] Witten, I., & Frank, E. (2000). Data mining: practical machine
reported (Table 4, Table 6 and Table 7) are essentially technique               learning tools and techniques with Java implementations. San
independent, yet the GATREE approach has proven to-date to be                   Mateo, CA: Morgan Kaufmann.
the best method for prototyping the measurement exercise that               [6] Papagelis, A., & D. Kalles (2001). Breeding decision trees using
we are developing.                                                              evolutionary techniques. Proceedings of the International
                                                                                Conference      on     Machine      Learning,     Williamstown,
                                                                                Massachusetts, pp. 393-400, Morgan Kaufmann.


                                                                       41
[7] Kalles, D., & Ch. Pierrakeas. (2006). Using Genetic Algorithms and
     Decision Trees for a posteriori Analysis and Evaluation of Tutoring
     Practices based on Student Failure Models. Proceedings of the 3rd
     IFIP conference on Artificial Intelligence Applications and
     Innovations, Athens, Greece, pp. 9-18, Springer.
[8] Hadzilacos, Th., Kalles, D. Pierrakeas, Ch., & M. Xenos (2006). On
     Small Data Sets Revealing Big Differences. Proceedings of the 4th
     Panhellenic conference on Artificial Intelligence, Heraklion, Greece,
     Springer LNCS 3955, pp. 512-515.
[9] Hadzilacos, Th., & D. Kalles (2006). On the Software Engineering
     Aspects of Educational Intelligence. Proceedings of the 10th
     International Conference on Knowledge-Based Intelligent
     Information & Engineering Systems, Bournemouth, UK, Springer
     LNCS 4252, pp. 1136-1143.
[10] Xenos, M., Pierrakeas, C., & Pintelas, P. (2002). A survey on
     student dropout rates and dropout causes concerning the students in
     the Course of Informatics of the Hellenic Open University.
     Computers & Education, 39, 361-377.


                                                                             42
    Recognizing predictive patterns in chaotic maps
                   Nicos G. Pavlidis1, Adam Adamopoulos2 and Michael N. Vrahatis3


Abstract. We investigate the existence of rules (in the form             where r is a control parameter that assumes values in the
of binary patterns) that allow the short-term prediction of              interval [0, 2]. We consider a discrete process generated by:
highly complex binary sequences. We also study the extent to
which these rules retain their predictive power when the se-              xn+1 = fr (xn ) = fr (fr (. . .)) = fr(n+1) (x0 ),      n = 0, 1, . . . ,
quence is contaminated with noise. Complex binary sequences
                                                                                            |    {z       }
                                                                                                 (n+1) times
are derived by applying two binary transformations on real-                                                                    (2)
valued sequences generated by the well known tent map. To                        (n)
                                                                         where fr denotes the nth iterate of fr . The Lyapunov ex-
identify short-term predictors we employ Genetic Algorithms.
                                                                         ponent is given by:
The dynamics of the tent map depend strongly on the value
of the control parameter, r. The experimental results suggest                                         ˛
                                                                                                  1 ˛˛ d (n) ˛˛
                                                                                                                 ˛
that the same is true for the number of predictors. Despite                          λr (x) = lim  ln ˛ fr (x)˛ = ln r,
                                                                                             n→∞ n      dx
the chaotic nature of the tent map and the complexity of the
derived binary sequences, the results reported suggest that                                                                           (n)
                                                                         everywhere in [0, 1]. For r ∈ (0, 1), the orbit, fr (x0 ), for
there exist settings in which an unexpectedly large number
                                                                         any x0 ∈ [0, 1] converges to the unique fixed point 0, as n
of predictive rules exists. Furthermore, rules that permit the
                                                                         increases. For r = 1, every point x ∈ [0, 1/2] is a fixed point.
risk free prediction of the value of the next bit are detected in
                                                                         The chaotic region is 1 < r 6 2, in which λr > 0 [5]. For
a wide range of parameter settings. By incorporating noise in
                                                                         r > 1 the map has two unstable fixed points, one at 0 and the
the data generating process, the rules that allow the risk free
                                                                         other at x∗ (r) = r/(r + 1). Using the notation in [5], we write,
prediction of the next bit are eliminated. However, for small                         (n)
                                                                         xn (r) ≡ fr (1/2). Then x1 (r) = r/2 and x2 (r) = r(1−r/2).
values of the variance of the Gaussian noise term there exist
                                                                         The intervals, (0, x2 (r)) and (x1 (r), 1) are transient  `√for f˜r , and
rules that retain much of their predictive power.
                                                                         we have fr A = A for A = [x2 (r), x1 (r)]. If r ∈            2, 2 , then
                                                                                                               √
                                                                         A is an attractor. At r = 2, the attractor A splits into         ` √  two˜
1   Introduction                                                         bands, A0 and A1 , at the position x = x∗ (r). For r ∈ 1, 2
                                                                                                                                               √
                                                                         we have fr (A0 ) = A1 and fr (A1 ) = A0 . Similarly, at r2 = 2,
In this paper we consider the problem of identifying rules, in           each of the two bands splits into two bands, Aij (i, j = 0, 1). In
the form of binary patterns, that are perfect, or in the worst           this manner, as r decreases, band splitting occurs successively
                                                                                                                             m
case good, short-term predictors of complex binary sequences.            at r = r1 , r2 , . . . , rm , . . ., where rm = 21/2 , and m = 1, 2, . . ..
A binary pattern of length L is defined as perfect short-term            By setting r0 = 2, then, for rm+1 < r < rm , there exist 2m
predictor if its presence in any place of the binary sequence is         disjoint intervals Ai1 ,i2 ,...,im , ik = (0, 1) in which the invariant
declarative of the value of the next bit. By definition, perfect         density is positive (the 2m -band regime). Defining, l = 1+i1 +
predictors, enable the risk-free prediction of the next bit. Sim-        2i2 +· · ·+2m−1 im , and Jl ≡ Ai1 ,i2 ,...,im , it is shown in [5] that
ilarly, good short-term predictors, are binary patterns whose            fr (Jl ) = Jl+1 for 1 6 l 6 2m − 1, and fr (J            ) = ˜J1 , where
                                                                                                                               `M √
appearance in any position of the binary sequence renders the            M = 2m . Therefore, if r lies in the interval 1, 2 , fr maps
value of the next bit highly predictable.                                a set of intervals between √           r − r2 /2 and r/2 to themselves.
   Complex binary sequences are derived through the applica-             If, on the other hand, r > 2 these intervals merge. This is
tion of binary transformations on real–valued data sequences             illustrated in the bifurcation diagram of Fig. 1.
obtained from the tent map. The tent map is a piecewise-                    Real-world time series are frequently contaminated by
linear, continuous map on the unit interval [0, 1] into itself:          noise. To this end, we investigate the resilience of the predic-
                                                                        tors to the presence of noise in the data generating process.
                            rx,         x ∈ [0, 1/2]                     We include an additive Gaussian noise term with zero mean,
             fr (x) =                                ,       (1)
                            r(1 − x),   x ∈ (1/2, 1]                     to Eq. (1), and study the extent to which the predictors de-
                                                                         tected in the original sequences retain their predictive power
1 Institute for Mathematical Sciences, Imperial College London,          for different values of the variance of the distribution.
  South Kensington Campus, London SW7 2AZ, United Kingdom,
  email: n.pavlidis@imperial.ac.uk
2 Medical Physics Laboratory, Department of Medicine, Democri-
  tus University of Thrace, Alexandroupolis GR-68100, Greece,
                                                                         2    Methods
  email: adam@med.duth.gr
3 Department of Mathematics, University of Patras, Patras GR-            The tent map, described in Eq. (1), was employed to gener-
  26110, Greece, email:vrahatis@math.upatras.gr                          ate raw data sequences xn (x0 , r). To generate the raw data


                                                                    43
                                                                         3     Presentation of Results
                                                                         3.1     Fixed threshold
                                                                         In the following, we present indicative results for binary se-
                                                                         quences of length 106 , obtained by applying the transforma-
                                                                         tion of Eq. (3). In Fig. 2 the distribution of bits with value ‘1’
                                                                         and ‘0’ for different values of r is plotted. Evidently, an equal
                                                                         distribution of the two occurs only as r tends to 2.


Figure 1.    Bifurcation diagram of the steady states of the tent
                     map with respect to r.


from the tent map the GNU Multiple Precision Arithmetic Li-
brary (GMP) [1] was utilized to generate floating point num-
bers with precision of at least 5000 bits. Subsequently, binary
data sequences bn (x0 , r) were produced by applying the sim-
ple, threshold, binary transformation originally proposed for
the logistic equation in [4]:                                                Figure 2. Distribution of ones (dashed) and zeros (solid)
                                                                             according to the transformation of Eq. (3) for bn (0.1, r) and
                                                                                            r ∈ [1, 2] with stepsize 10−3 .
                              
                                   0,    if xn 6 0.5,
              bn (x0 , r) =                                   (3)
                                   1,    if xn > 0.5.

   Eq. (3) produces a bit with value ‘1’ when the value of                  The number of distinct patterns of length L that appear for
the tent map is greater that 0.5 and a bit with value ‘0’                different values of r, is reported in Table 1. In detail, the first
otherwise. To avoid transient phenomena, the first 104 iter-             column of Table 1 reports the value of r; the second column
ations of the map were discarded. A number of real-valued                indicates the length of the binary patterns L; the third column
sequences xn (x0 , r) were generated through Eq. (1) for differ-         corresponds to the number of different patterns of length L
ent values of the control parameter, r, and starting points, x0 .        identified in each binary sequence (#f); and finally the fourth
Binary sequences, bn (x0 , r), of 106 bits were produced by ap-          column reports the ratio of the number of patterns of length L
plying Eq. (3) on the raw data, xn (x0 , r).                             found (#f) to the number of possible binary patterns of this
   A second binary transformation, also proposed in [4] for the          length (2L ). The lower the ratio shown in the last column of
logistic equation, was applied on the raw data. This transfor-           the table the fewer the patterns that appear in the binary
mation is also a simple, linear, threshold binary transforma-            sequence and hence the higher the predictability.
tion, but with a variable threshold. The threshold value is the             An inspection of Table 1 suggests that increasing the value
previous value of the raw data of the tent map. Hence, the               of r, gradually increases the number of patterns that are en-
second transformation is formulated as:                                  countered for each value of L and hence degrades predictabil-
                                                                         ity. This effect becomes clear by comparing the results for
                                                                         r = 1.44 and r = 1.999. For r = 1.44 and L = 2, already the
                             
                                  0,    if xn 6 xn−1 ,
             bn (x0 , r) =                                    (4)        ratio of appearing to all possible patterns is 0.75 suggesting
                                  1,    if xn > xn−1 .
                                                                         that one out of the four possible patterns is absent. This ratio
  The number of all possible patterns of length L, 2L , in-              decreases as L increases to reach 0.091 for L = 9 indicating
creases exponentially with respect to L. For large values of L,          that less than 10% of all possible patterns of this length are
therefore, it is infeasible to perform exhaustive search, and            present in the sequence b106 (0.1, 1.44). On the contrary, for
more efficient search methods, such as Genetic Algorithms                r = 1.999 all possible patterns appear for all the different val-
(GAs), are required [2, 3]. To this end, a simple GA with bi-            ues of L up to and including L = 9. It should be noted that
nary representation was implemented and utilized. The GA                 for r = 1.999 and L = 10 the ratio of column four becomes
population consisted of L–bit patterns. The fitness of a pat-            less than unity, but still its value is very close to that, 0.999,
tern p, was the number of times p was encountered in the                 suggesting that even in this case increasing L reduces the ratio
binary sequence bn (x0 , r). The selection process used was              but this effect takes place very slowly.
roulette wheel selection. As crossover operator the well–known              Next, the impact of introducing noise to the data generating
one–point crossover operator was employed. Finally, the mu-              process is investigated. A normally distributed, ε ∼ N (0, σ 2 ),
tation operator utilized was the flip bit mutation operator .            additive noise term was included in the tent map equation,
GAs were applied for several values of L and a number of                 yielding xn+1 = fr (xn ) + ε, where fr (xn ) is given by Eq. (1).
binary sequences, bn (x0 , r). Consequently, patterns that can           It should be noted that we enforced the resulting raw data se-
account as perfect, or good, predictors can be identified by             ries to lie in the interval [0, 1] by rejecting realizations of the
comparing the obtained results for L–bit and (L + 1)–bit pat-            noise term that would result in xn+1 ∈      / [0, 1]. The obtained
terns.                                                                   experimental results for the most predictable binary sequence


                                                                    44
 Table 1. Number of patterns in b106 (0.1, r) obtained through              Table 2. Patterns in b106 (0.1, 1.44) obtained through the
     the transformation of Eq. (3) for different values of r.                  transformation of Eq. (3) and different values of σ 2 .
                                                                           L    patterns   σ 2 = 0.0   σ 2 = 0.01   σ 2 = 0.1   σ 2 = 0.5
    r     L    #f   #f/2L             r     L     #f    #f/2L              1       0         230307        246757     434808      552264
          2     3    0.750                  6     36     0.562                     1         769693        753243     565192      447736
          3     5    0.625           1.7    7     61     0.476                     00              0           17     190252      308874
          4     7    0.437                  8    105     0.410             2       01        231742        246799     243209      244133
          5    11    0.343                  9    179     0.349                     10        231742        246799     243209      244133
   1.44   6    15    0.234                  2      4     1.000                     11        536514        506383     323328      202858
          7    23    0.179                  3      7     0.875                    000              0            0       93237     173043
          8    31    0.121                  4     13     0.812                    001              0           17       97015     135831
          9    47    0.091                  5     24     0.750                    010        112119        119901     108060      133112
          2     3    0.750           1.8    6     43     0.671                    011        119623        126897     135149      111021
          3     5    0.625                  7     78     0.609             3      100              0           17       97015     135831
          4     7    0.437                  8    141     0.550                    101        231742        246782     146194      108301
          5    11    0.343                  9    253     0.494                    110        119623        126898     135148      111021
   1.5    6    16    0.250                  2      4     1.000                    111        416890        379485     188179        91837
          7    25    0.195                  3      8     1.000                    0000             0            0       50612       96905
          8    37    0.144                  4     15     0.937                    0001             0            0       42625       76138
          9    57    0.111                  5     29     0.906                    0010             0           17       43068       74254
          2     3    0.750           1.9    6     55     0.859                    0011             0            0       53947       61577
          3     5    0.625                  7    105     0.820                    0100             0            9       44101       74116
          4     8    0.500                  8    199     0.777                    0101       112119        119892       63959       58995
          5    13    0.406                  9    379     0.740                    0110             0         2915       55812       60753
   1.6    6    21    0.328                  2      4     1.000             4      0111       119622        123982       79336       50268
          7    34    0.265                  3      8     1.000                    1000             0            0       42625       76138
          8    55    0.214                  4     16     1.000                    1001             0           17       54390       59693
          9    88    0.171                  5     32     1.000                    1010       112119        119884       64992       58858
          2     4    1.000          1.999   6     64     1.000                    1011       119623        126897       81202       49443
          3     7    0.875                  7    128     1.000                    1100             0            8       52914       61715
   1.7    4    12    0.750                  8    256     1.000                    1101       119623        126890       82234       49306
          5    21    0.656                  9    512     1.000                    1110       119623        123983       79336       50268
                                                                                  1111       297267        255502     108843        41569
when no noise is included, b106 (0.1, 1.44), are summarised in
Table 2. The first column of the table corresponds to the pat-
tern length L; the second lists all the possible binary patterns
of length L (due to space limitations, only patterns of length
up to four are included); while columns three to six report the
number of occurrences of each pattern for different values of
the variance, σ 2 , starting with the case of no noise (σ 2 = 0).
   Starting from the case of no noise, we observe that more
than three quarters of the binary sequence consists of bits
with value ‘1’. Furthermore, from the patterns with length
two, the pattern ‘00’ is missing, indicating that a ‘0’ is always
followed by a ‘1’. This fact renders the unit length pattern
‘0’ (and consequently all patterns of any length ending with a
‘0’) a perfect predictor, and hence approximately 23% of the
sequence is perfectly predictable. The inclusion of the additive
noise term distorts these findings gradually as the variance in-
creases. For σ 2 = 0.01 findings are marginally altered as the
length two pattern ‘00’ appears only 17 times in the length              Figure 3. Predictive power of the unit length binary pattern ‘0’
                                                                          in the sequences obtained through the transformation of Eq (3)
106 binary sequence. Thus, the probability of a ‘1’ following                    with respect to the variance σ 2 of the noise term.
a bit with value ‘0’ is 0.99993. For σ 2 = 0.1 and σ 2 = 0.5
this probability becomes 0.56109 and 0.44146 respectively. In
the case of σ 2 = 0.5, therefore, the impact of noise is so large        3.2    Variable threshold
that the original finding is reversed and a ‘0’ is more likely to
be followed by a ‘0’. The fact that increasing the variance of           In this subsection we present results from the analysis of
the noise term deteriorates the predictability of the binary se-         binary sequences derived by applying the transformation of
quence is also evident from the fact that patterns that did not          Eq. (4), according to which the threshold is equal to the previ-
appear in the not contaminated with noise sequence, appear               ous value of the tent map. The distribution of bits with value
frequently in the contaminated series. The predictive power              ‘1’ and ‘0’ for different values of the control parameter r is
of the binary pattern ‘0’ (perfect predictors in the noise-free          illustrated in Fig. 4. Comparing Figs. 2 and 4 it is evident
binary sequence) with respect to the value of the variance of            that the two transformations yield substantially different dis-
the additive noise term, σ 2 is illustrated in Fig. 3. To gen-           tributions of ones and zeros. For the second transformation,
erate Fig. 3, σ 2 assumed values in the interval [0, 0.5] with           the proportion of ones exceeds that of zeros for r marginally
stepsize 10−3 .                                                          larger than 1. As shown in Fig. 4 the two proportions are


                                                                    45
                                                                             Table 3. Number of patterns in b106 (0.1, r) obtained through
                                                                                 the transformation of Eq. (4) for different values of r.
                                                                                r     L    #f   #f/2L             r     L    #f    #f/2L
                                                                                      2     3    0.750                  6     9     0.140
                                                                                      3     4    0.500           1.7    7    12     0.093
                                                                                      4     5    0.312                  8    16     0.062
                                                                                      5     6    0.187                  9    21     0.041
                                                                               1.44   6     7    0.109                  2     3     0.750
                                                                                      7     8    0.062                  3     5     0.625
                                                                                      8     9    0.035                  4     7     0.437
                                                                                      9    10    0.019                  5    10     0.312
                                                                                      2     3    0.750           1.8    6    14     0.218
                                                                                      3     4    0.500                  7    19     0.148
                                                                                      4     5    0.312                  8    27     0.105
                                                                                      5     6    0.187                  9    38     0.074
                                                                               1.5    6     7    0.109                  2     3     0.750
                                                                                      7     8    0.062                  3     5     0.625
   Figure 4. Distribution of ones (dashed) and zeros (solid)                          8     9    0.035                  4     8     0.500
    according to the transformation of Eq. (4) for bn (0.1, r)                        9    11    0.021                  5    12     0.375
               and r ∈ [1, 2] with stepsize 10−3 .                                    2     3    0.750           1.9    6    18     0.281
                                                                                      3     4    0.500                  7    27     0.210
                                                                                      4     5    0.312                  8    40     0.156
                                                                                      5     7    0.218                  9    59     0.115
                                  √                                            1.6    6     9    0.140                  2     3     0.750
equal until r becomes equal to 2. This finding is attributed                          7    12    0.093                  3     5     0.625
to the band splitting phenomenon, √      briefly described in Sec-                    8    15    0.058                  4     8     0.500
tion 1, that occurs for r ∈ (1, 2] [5]. From that point and                           9    19    0.037                  5    13     0.406
onward their difference increases.                                                    2     3    0.750          1.999   6    21     0.328
                                                                               1.7    3     4    0.500                  7    34     0.265
   The number of patterns of different length L that appear                           4     5    0.312                  8    55     0.214
in the binary sequences of length 106 , are reported in Table 3                       5     7    0.218                  9    89     0.173
for different values of the control parameter r. More specifi-
cally, the first column of Table 3 reports the value of r; the
second column corresponds to the length L of the binary pat-
terns; the third column reports the number of different pat-                of the variance of the additive noise term ε ∼ N (0, σ 2 ). Note
terns of length L that were identified in the sequence (#f);                that as in the previous case, the resulting raw data sequence
                                                                                   6
and lastly, column four depicts the proportion of the patterns              {xn }10
                                                                                 n=0 was restrained in the interval [0, 1] by rejecting re-
encountered (#f) to the number of possible binary patterns                  alizations of the noise term that would result in xn ∈  / [0, 1].
of length L (2L ).
   As in the case of the fixed threshold binary transforma-                    Starting from the case of no noise, we observe that zeros
tion, increasing the value of the control parameter r increases             and ones are approximately equally distributed in the binary
the number of patterns that appear in the derived binary                    sequence. As in the case of the first transformation, pattern
sequences. However, this effect is more pronounced for the                  ‘00’ is missing from the patterns of length two, a finding which
fixed threshold transformation of Eq. (3) than for the vari-                implies that a ‘0’ is always followed by a ‘1’, and hence all the
able threshold transformation of Eq. (4). Even for r = 1.999,               binary patterns of any length that end with ‘0’ are perfect pre-
Table 3 reports that the number of binary patterns of length                dictors. Furthermore, the pattern ‘111’ was not encountered,
two is three, suggesting that one pattern of length two does                implying that ‘11’ is always followed by a ‘0’. From the inspec-
not appear, and hence a unit length perfect binary predictor                tion of the findings for patterns of length three we also obtain
exists. In contrast, for the fixed threshold binary transforma-             a good predictor of length two, namely the pattern ‘01’, for
tion, Table 1, all four length two binary patterns are present              which the probability of appearance of ‘0’ immediately after
in the sequences that are generated with r > 1.7. Moreover,                 this pattern is 0.96919. Comparing the aforementioned find-
the ratio of the patterns of length L found to the number of                ings for the case of no noise, with the corresponding ones
possible patterns of this length decreases more rapidly in the              obtained by the first, fixed threshold, transformation we con-
sequences generated by the variable threshold transformation.               clude that the binary sequence obtained through the variable
For instance, for r = 1.44, the number of patterns of length                threshold transformation is more predictable.
nine is 47 for the fixed threshold transformation, while for the               As expected, the introduction of noise eliminates all the
variable threshold transformation this number is 10.                        perfect predictors identified in the original binary sequence.
   The impact of introducing noise on the short-term predic-                For a low value of the variance, σ 2 = 0.01, the findings are
tors is studied next. Table 4 reports the patterns of length                marginally distorted. The previously not encountered pattern
two to four that were encountered in the binary sequence                    ‘00’ appears 5979 times yielding a probability of encounter-
b106 (0.1, 1.44) that was obtained through the second transfor-             ing a bit with the value ‘1’ following a bit with a value of
mation, for different values of σ 2 . In detail, the first column of        ‘0’ equal to 0.987688. This probability is marginally lower
Table 4 corresponds to the length L of the patterns; the second             than the corresponding probability for the first transforma-
column lists all possible binary patterns of this length; while             tion. On the other hand, when the variance of the noise term
columns three to six report the number of occurrences of each               increases to σ 2 = 0.1 and σ 2 = 0.5 this probability becomes
pattern in the binary sequences obtained for different values               0.867348 and 0.698495, respectively. Both these probabilities


                                                                       46
        Table 4. Patterns in bn (0.1, 1.44) obtained through the
          transformation of Eq. (4) and different values of σ 2 .

 L        patterns   σ 2 = 0.0   σ 2 = 0.01   σ 2 = 0.1   σ 2 = 0.5
 1           0         492207        485787     399771      471981
             1         507793        514213     600229      528019
             00              0         5979       52973     142274
    2        01        492415        479647     346368      329606
             10        492414        479647     346368      329606
             11          15169        34725     254289      198512
            000              0          628        7257       32975
            001              0         5351       45716     109299
            010        477246        447216     186610      187262
            011          15169        32430     159758      142343
    3       100              0         5351       45716     109299
            101        492414        474296     300652      220307
            110          15168        32430     159758      142343
            111              0         2295       94530       56169        Figure 5. Predictive power of binary patterns identified in the
            0000             0           71         979        6175         sequences obtained through the transformation of Eq (4) with
            0001             0          557        6278       26800                 respect to the variance σ 2 of the noise term.
            0010             0         4840       24665       57459
            0011             0          511       21051       51840
            0100             0         3964       24580       59905
            0101       477246        443252     162030      127357
                                                                           suggesting that there is no perfect predictor. On the contrary,
            0110         15168        30378       98071       98715        for the sequences generated through the second transforma-
    4       0111             0         2052       61686       43628        tion with the same value of r, only three out of the four pos-
            1000             0          557        6278       26800        sible patterns of length two are encountered, suggesting that
            1001             0         4794       39438       82499        there is a perfect short-term predictor of length one. The in-
            1010       477245        442376     161945      129803
            1011         15169        31919     138707        90503        clusion of an additive Gaussian noise term with zero mean in
            1100             0         1387       21136       49394        the tent map equation eliminated all perfect predictors. How-
            1101         15168        31043     138622        92949        ever, for small values of the variance of the Gaussian noise
            1110             0         2052       61686       43628        binary patterns with high predictive power were identified.
            1111             0          243       32844       12541
                                                                              Future work on the subject will include the investigation of
                                                                           multiplicative noise, as well as, the application of this method-
are higher than the corresponding ones for the case of the
                                                                           ology to real–world time series and in particular financial time
transformation of Eq. (3). Moreover, note that for the case of
                                                                           series. It is worth noting that the second binary transforma-
the first transformation and σ 2 = 0.5 a bit with value ‘0’ is
                                                                           tion is particularly meaningful in the study of financial time
more likely to be followed by a bit with the same value (prob-
                                                                           series as it corresponds to the direction of change of the next
ability equal to 0.55854); a phenomenon that does not occur
                                                                           value relative to the present one.
at present. For the pattern ‘11’ the probability of encounter-
ing a zero immediately after it becomes 0.933909, 0.628256,
and 0.717049, for σ 2 equal to 0.01, 0.1, and 0.5, respectively.           Acknowledgments
Finally, for the pattern ‘01’ the probability of zero after its
                                                                           This work was partially supported by the Hellenic Ministry of
appearance is 0.932387, 0.538762, and 0.568140 for σ 2 equal
                                                                           Education and the European Union under Research Program
to 0.01, 0.1, and 0.5, respectively. The predictive power of the
                                                                           PYTHAGORAS-89203.
binary patterns, ‘0’, ‘11’, (perfect predictors in the noise-free
binary sequence) and ‘01’ (good predictor in the noise-free
binary sequence), with respect to the value of the variance of             REFERENCES
the additive noise term, σ 2 is illustrated in Fig. 5. To gener-           [1] Free Software Foundation Inc., GNU Multiple Precision Arith-
ate Fig. 5, σ 2 assumed values in the interval [0, 0.5] with a                 metic Library ver. 4.1.4.
stepsize of 10−3 .                                                         [2] J. H. Holland, Adaptation in Natural and Artificial Systems,
                                                                               University of Michigan Press, 1975.
                                                                           [3] M. Mitchell, Introduction to Genetic Algorithms, MIT Press,
4       Conclusions                                                            1996.
                                                                           [4] N. G. Packard, ‘A genetic learning algorithm for the analysis
                                                                               of complex data’, Complex Systems, 4(5), 543–572, (1990).
Despite the chaotic nature of the tent map and the resulting               [5] T. Yoshida, H. Mori, and H. Shigematsu, ‘Analytic study of
complexity of the binary sequences that were derived after the                 chaos of the tent map: Band structures, power spectra, and
application of two threshold, binary, transformations a large                  critical behaviors’, Journal of Statistical Physics, 31(2), 279–
number of short-term predictors was detected. The reported                     308, (1983).
experimental results indicate that the binary sequences gen-
erated through the variable threshold binary transformation
are more predictable than those obtained through the fixed
threshold transformation. This finding is clearer for values of
the control parameter, r, close to its upper bound, 2. Indeed
for r = 1.999 all the patterns of length up to nine appear in the
binary sequences obtained through the first transformation,


                                                                      47
48
     Improving the Accuracy of Neuro-Symbolic Rules with
                    Case-Based Reasoning
                                      Jim Prentzas1, Ioannis Hatzilygeroudis2 and Othon Michail2

Abstract. In this paper, we present an improved approach                        combination. The bulk of the approaches combining rule-based
integrating rules, neural networks and cases, compared to a                     and case-based reasoning follow the coupling models [17]. In
previous one. The main approach integrates neurules and cases.                  these models, the problem-solving (or reasoning) process is
Neurules are a kind of integrated rules that combine a symbolic                 decomposed into tasks (or stages) for which different
(production rules) and a connectionist (adaline unit)                           representation formalisms (i.e., rules or cases) are applied.
representation. Each neurule is represented as an adaline unit.                    However, a more interesting approach is one integrating
The main characteristics of neurules are that they improve the                  more than two reasoning methods towards the same objective.
performance of symbolic rules and, in contrast to other hybrid                  In [16] and [10], such an approach integrating three reasoning
neuro-symbolic approaches, retain the modularity of production                  schemes, namely rules, neurocomputing and case-based
rules and their naturalness in a large degree. In the improved                  reasoning in an effective way is introduced. To this end,
approach, various types of indices are assigned to cases                        neurules and cases are combined. Neurules are a type of hybrid
according to different roles they play in neurule-based                         rules integrating symbolic rules with neurocomputing in a
reasoning, instead of one. Thus, an enhanced knowledge                          seamless way. Their main characteristic is that they retain the
representation scheme is derived resulting in accuracy                          modularity of production rules and also their naturalness in a
improvement.      Experimental     results     demonstrate  its                 large degree. In that approach, on the one hand, cases are used
effectiveness.                                                                  as exceptions to neurules, filling their gaps in representing
                                                                                domain knowledge and, on the other hand, neurules perform
                                                                                indexing of the cases facilitating their retrieval. Finally, it
1 INTRODUCTION                                                                  results in accuracy improvement.
In contrast to rule-based systems that solve problems from                         In this paper, we enhance the above approach by employing
scratch, case-based systems use pre-stored situations (i.e.,                    different types of indices for the cases according to different
cases) to deal with similar new situations. Case-based reasoning                roles they play in neurule-based reasoning. In this way, an
offers some advantages compared to symbolic rules and other                     improved knowledge representation scheme is derived as
knowledge representation formalisms. Cases represent specific                   various types of neurules’ gaps in representing domain
knowledge of the domain, are natural and usually easy to obtain                 knowledge are filled in by indexed cases. Experimental results
[11], [12]. Incremental learning comes natural to case-based                    demonstrate the effectiveness of the presented approach
reasoning. New cases can be inserted into a knowledge base                      compared to our previous one.
without making changes to the preexisting knowledge. The                           The rest of the paper is organized as follows. Section 2
more cases are available, the better the domain knowledge is                    presents neurules, whereas Section 3 presents methods for
represented. Therefore, the accuracy of a case-based system can                 constructing the indexing scheme of the case library. Section 4
be enhanced throughout its operation, as new cases become                       describes the hybrid inference mechanism. Section 5 presents
available. A negative aspect of cases compared to symbolic                      experimental results regarding accuracy of the inference
rules is that they do not provide concise representations of the                process. Section 6 discusses related work. Finally, Section 7
incorporated knowledge. Also it is not possible to represent                    concludes.
heuristic knowledge. Furthermore, the time-performance of the
retrieval operations is not always the desirable.
                                                                                2 NEURULES
   Approaches integrating rule-based and case-based reasoning
have given interesting and effective knowledge representation                   Neurules are a type of hybrid rules integrating symbolic rules
schemes and are becoming more and more popular in various                       with neurocomputing giving pre-eminence to the symbolic
fields [3], [13], [14], [15], [17], [18], [19]. The objective of                component. Neurocomputing is used within the symbolic
these efforts is to derive hybrid representations that augment the              framework to improve the performance of symbolic rules [7],
positive aspects of the integrated formalisms and                               [10]. In contrast to other hybrid approaches (e.g. [4], [5]), the
simultaneously minimize their negative aspects. The                             constructed knowledge base retains the modularity of
complementary advantages and disadvantages of rule-based and                    production rules, since it consists of autonomous units
case-based reasoning are a good justification for their possible                (neurules), and also retains their naturalness in a large degree,


1 Technological Educational Institute of Lamia, Department of Informatics and Computer Technology, 35100 Lamia, Greece, email: dprentzas@teilam.gr.
2 University of Patras, Dept of Computer Engineering & Informatics, 26500 Patras, Greece, email: {ihatz, michailo}@ceid.upatras.gr.


                                                                           49
since neurules look much like symbolic rules [7], [8]. Also, the             conclusion(s). Table 1 (Section 3) presents two example
inference mechanism is a tightly integrated process, which                   neurules, from a medical diagnosis domain.
results in more efficient inferences than those of symbolic rules               Neurules can be constructed either from symbolic rules, thus
[7], [10]. Explanations in the form of if-then rules can be                  exploiting existing symbolic rule bases, or from empirical data
produced [9], [10].                                                          (i.e., training examples) (see [7] and [8] respectively). An
                                                                             adaline unit is initially assigned to each possible conclusion.
                                                                             Each unit is individually trained via the Least Mean Square
2.1 Syntax and Semantics                                                     (LMS) algorithm. When the training set is inseparable, special
The form of a neurule is depicted in Fig.1a. Each condition Ci is            techniques are used. In that case, more than one neurule having
assigned a number sfi, called its significance factor. Moreover,             the same conclusion are produced.
each rule itself is assigned a number sf0, called its bias factor.
Internally, each neurule is considered as an adaline unit                                              Table 1. Example neurules
(Fig.1b). The inputs Ci (i=1,...,n) of the unit are the conditions           NR1: (-23.9)                             NR2: (-13.4)
of the rule. The weights of the unit are the significance factors            if patient-class is human0-20 (10.6),   if patient-class is human21-35 (6.9),
of the neurule and its bias is the bias factor of the neurule. Each             pain is continuous (10.5),              pain is continuous (3.2),
input takes a value from the following set of discrete values: [1                fever is high (8.8),                   joints-pain is yes (3.1),
(true), 0 (false), 0.5 (unknown)]. This gives the opportunity to                 fever is medium (8.4),                 fever is low (1.5),
easily distinguish between the falsity and the absence of a                      patient-class is human21-35 (6.2),     fever is no-fever (1.5)
condition in contrast to symbolic rules. The output D, which                     fever is no-fever (2.7),            then disease-type is chronic-
                                                                                 ant-reaction is medium (1.1)       inflammation
represents the conclusion (decision) of the rule, is calculated via
                                                                             then disease-type is inflammation
the standard formulas:
                                              n
               D = f(a) ,      a = sf 0 + ∑ sf i C i                         2.2 The Neurule-Based Inference Engine
                                             i=1
                                                                             The neurule-based inference engine performs a task of
                               ⎧1        if a ≥ 0                            classification: based on the values of the condition variables
                      f (a ) = ⎨                                             and the weighted sums of the conditions, conclusions are
                               ⎩ -1     otherwise
                                                                             reached. It gives pre-eminence to symbolic reasoning, based on
                                                                             a backward chaining strategy [7], [10]. As soon as the initial
where a is the activation value and f(x) the activation function,            input data is given and put in the working memory, the output
a threshold function. Hence, the output can take one of two                  neurules are considered for evaluation. One of them is selected
values (‘-1’, ‘1’) representing failure and success of the rule              for evaluation. Selection is based on textual order. A neurule
respectively.                                                                fires if the output of the corresponding adaline unit is computed
                                                                             to be ‘1’ after evaluation of its conditions. A neurule is said to
                                                                             be ‘blocked’ if the output of the corresponding adaline unit is
                                                                             computed to be ‘-1’ after evaluation of its conditions.
                                                                                A condition evaluates to ‘true’ (‘1’), if it matches a fact in
                                                                             the working memory, that is there is a fact with the same
                                                                             variable, predicate and value. A condition evaluates to
                                                                             ‘unknown’, if there is a fact with the same variable, predicate
                                                                             and ‘unknown’ as its value. A condition cannot be evaluated if
                                                                             there is no fact in the working memory with the same variable.
                                                                             In this case, either a question is made to the user to provide data
                                                                             for the variable, in case of an input variable, or an intermediate
       Fig. 1. (a) Form of a neurule (b) a neurule as an adaline unit
                                                                             neurule with a conclusion containing the variable is examined,
                                                                             in case of an intermediate variable. A condition with an input
   The general syntax of a condition Ci and the conclusion D is:             variable evaluates to ‘false’ (‘0’), if there is a fact in the
<condition>::= <variable> <l-predicate> <value>                              working memory with the same variable, predicate and
<conclusion>::= <variable> <r-predicate> <value>                             different value. A condition with an intermediate variable
where <variable> denotes a variable, that is a symbol                        evaluates to ‘false’ if additionally to the latter there is no
representing a concept in the domain, e.g. ‘sex’, ‘pain’ etc, in a           unevaluated intermediate neurule that has a conclusion with the
medical domain. <l-predicate> denotes a symbolic or a numeric                same variable. Inference stops either when one or more output
predicate. The symbolic predicates are {is, isnot} whereas the               neurules are fired (success) or there is no further action
numeric predicates are {<, >, =}. <r-predicate> can only be a                (failure).
symbolic predicate. <value> denotes a value. It can be a symbol                 During inference, a conclusion is rejected (or not drawn)
or a number. The significance factor of a condition represents               when none of the neurules containing it fires. This happens
the significance (weight) of the condition in drawing the                    when: (i) all neurules containing the conclusion have been
                                                                             examined and are blocked or/and (ii) a neurule containing an


                                                                        50
alternative conclusion for the specific variable fires instead. For                   conclusions and not with specific neurules because there
instance, if all neurules containing the conclusion ‘disease-type                     may be more than one neurule with the same conclusion
is inflammation’ have been examined and are blocked, then this                        in the neurule base.
conclusion is rejected (or not drawn). If a neurule containing                  The indexing process may take as input the following types
e.g. the alternative conclusion ‘disease-type is primary-                    of knowledge:
malignant’ fires, then conclusion ‘disease-type is inflammation’                 (a) Available neurules and non-indexed cases.
is rejected (or not drawn), no matter whether all neurules                       (b) Available symbolic rules and indexed cases. This type of
containing as conclusion ‘disease-type is inflammation’ have                          knowledge concerns an available formalism of symbolic
been examined (and are blocked) or not.                                               rules and indexed exception cases as the one presented in
                                                                                      [6].
                                                                                The availability of data determines which type of knowledge
3 INDEXING                                                                   is provided as input to the indexing module. If an available
Indexing concerns the organization of the available cases so                 formalism of symbolic rules and indexed cases is presented as
that combined neurule-based and case-based reasoning can be                  input, the symbolic rules are converted to neurules using the
performed. Indexed cases fill in gaps in the domain knowledge                ‘rules to neurules’ module. The produced neurules are
representation by neurules and during inference may assist in                associated with the exception cases of the corresponding
reaching the right conclusion. To be more specific, cases may                symbolic rules [10]. Exception cases are indexed as ‘false
enhance neurule-based reasoning to avoid reasoning errors by                 positives’ by neurules. Furthermore, for each case ‘true
handling the following situations:                                           positive’ and ‘false negative’ indices may be acquired using the
  (a) Examining whether a neurule misfires. If sufficient                    same process as in type (a).
      conditions of the neurule are satisfied so that it can fire, it           When available neurules and non-indexed cases are given as
      should be examined whether the neurule misfires for the                input to the indexing process, cases must be associated with
      specific facts, thus producing an incorrect conclusion.                neurules and neurule base conclusions. For each case, this
  (b) Examining whether a specific conclusion was erroneously                information can be easily acquired as following:
      rejected (or not drawn).                                                  Until all intermediate and output attribute values of the case
   In the approach in [10], the neurules contained in the neurule            have been considered:
base were used to index cases representing their exceptions. A               1. Perform neurule-based reasoning for the neurules based on
case constitutes an exception to a neurule if its attribute values               the attribute values of the case.
satisfy sufficient conditions of the neurule (so that it can fire)           2. If a neurule fires, check whether the value of its conclusion
but the neurule's conclusion contradicts the corresponding                       variable matches the corresponding attribute value of the
attribute value of the case. In this approach, various types of                  case. If it does (doesn't), associate the case as a ‘true
indices are assigned to cases. More specifically, indices are                    positive’ (‘false positive’) with this neurule.
assigned to cases according to different roles they play in                  3. Check all intermediate and final conclusions. Associate the
neurule-based reasoning and assist in filling in different types                 case as a ‘false negative’ with each rejected (or not drawn)
of gaps in the knowledge representation by neurules. Assigning                   conclusion that ought to have been drawn based on the
different types of indices to cases can produce an effective                     attribute values of the case.
approach combining symbolic rule-based with case-based                          To illustrate how the indexing process works, we present the
reasoning [1].                                                               following example. Suppose that we have a neurule base
   In this new approach, a case may be indexed by neurules and               containing the two neurules in Table 1 and the example cases
by neurule base conclusions as well. In particular, a case may               shown in Table 2 (only the most important attributes of the
be indexed as:                                                               cases are shown). The cases however, also possess other
    (a) False positive (FP), by a neurule whose conclusion is                attributes (not shown in Table 2).
        contradicting. Such cases, as in our previous approach,                 ‘disease-type’ is the output attribute that corresponds to the
        represent exceptions to neurules and may assist in                   neurules’ conclusion variable. Table 3 shows the types of
        avoiding neurule misfirings.                                         indices associated with each case in Table 2 at the end of the
    (b) True positive (TP), by a neurule whose conclusion is                 indexing process.
        endorsing. The attribute values of such a case satisfy                  To acquire indexing information, the input values
        sufficient conditions of the neurule (so that it can fire)           corresponding to the attribute values of the cases are presented
        and the neurule's conclusion agrees with the                         to the example neurules. Recall that when a neurule condition
        corresponding attribute value of the case. Such cases                evaluates to ‘true’ it gets the value ‘1’, whereas when it is false
        may assist in endorsing correct neurule firings.                     gets ‘0’.
    (c) False negative (FN), by a conclusion erroneously                        For example, given the input case C2, the final weighted sum
        rejected (or not drawn) by neurules. Such cases may                  of neurule NR1 is: -23.9 + 10.6 + 10.5 + 8.8 = 6>0. Note that
        assist in reaching conclusions that ought to have been               the first three conditions of NR1 evaluate to ‘true’ whereas the
        drawn by neurules (and were not drawn). If neurules                  remaining four (i.e., ‘fever is medium’, ‘fever is no-fever’,
        with alternative conclusions containing this variable                ‘patient-class is human21-35’ and ‘ant-reaction is medium’) to
        were fired instead, it may also assist in avoiding neurule           ‘false’ (not contributing to the weighted sum).
        misfirings. ‘False negative’ indices are associated with


                                                                        51
                                                                    Table 2. Example cases
                         Case                                                          ant-      joints-
                                  patient-class         pain            fever                                  disease-type
                          ID                                                         reaction     pain
                                                                                                                  chronic-
                          C1      human21-35        continuous           low           none        yes
                                                                                                               inflammation
                          C2       human0-20        continuous         high            none        no          inflammation
                          C3       human0-20          night            high            none        no          inflammation
                          C4       human0-20        continuous        medium           none        no          inflammation
                                                                                                                  chronic-
                          C5      human21-35        continuous        no-fever       medium        yes
                                                                                                               inflammation
                                                                                                                  chronic-
                          C6       human0-20        continuous           low           none        no
                                                                                                               inflammation

The fact that the final weighted sum is positive means that                             In case (a), firing of the neurule is suspended and case-based
sufficient conditions of NR1 are satisfied so that it can fire.                     reasoning is performed for cases indexed as ‘false positives’
Furthermore, the corresponding output attribute value of the                        and ‘true positives’ by the neurule and cases indexed as ‘false
case matches the conclusion of NR1 and therefore C2 is                              negatives’ by alternative conclusions containing the neurule’s
associated as ‘true positive’ with NR1.                                             conclusion variable. Cases indexed as ‘true positives’ by the
                                                                                    neurule endorse its firing whereas the other two sets of cases
        Table 3. Indices assigned to the example cases in Table 2                   considered (i.e., ‘false positives’ and ‘false negatives’) prevent
 Case       Type of index                      Indexed by                           its firing. The results produced by case-based reasoning are
  ID                                                                                evaluated in order to assess whether the neurule will fire or
  C1        ‘True positive’                 Neurule NR2                             whether an alternative conclusion proposed by the retrieved
  C2        ‘True positive’                 Neurule NR1                             case will be considered valid instead.
  C3       ‘False negative’         Conclusion ‘disease-type is                         In case (b), the case-based module will focus on cases
                                           inflammation’                            indexed as ‘false negatives’ by conclusions containing the
  C4        ‘True positive’                 Neurule NR1                             specific (intermediate or output) variable.
  C5       ‘False positive’                 Neurule NR1                                 The basic steps of the inference process are the following:
  C5        ‘True positive’                 Neurule NR2                             1. Perform neurule-based reasoning for the neurules.
  C6       ‘False negative’      Conclusion ‘disease-type is chronic-               2. If sufficient conditions of a neurule are fulfilled so that it can
                                           inflammation’                            fire, then
                                                                                      2.1. Perform case-based reasoning for the ‘false positive’
    Similarly, when the input values corresponding to the                                    and ‘true positive’ cases indexed by the neurule and the
attribute values of cases C1 and C4 are given as input to the                                ‘false negative’ cases associated with alternative
neurule base, sufficient conditions of neurules NR2 and NR1                                  conclusions containing the neurule’s conclusion
respectively are satisfied so that they can fire and the                                     variable.
corresponding output attribute case values match their                                2.2. If none case is retrieved or the best matching case is
conclusions. Furthermore, when the input values corresponding                                indexed as ‘true positive’, the neurule fires and its
to the attribute values of case C5 are given as input to the                                 conclusion is inserted into the working memory.
neurule base, sufficient conditions of both neurules NR1 and                          2.3. If the best matching case is indexed as ‘false positive’ or
NR2 are satisfied so that they can fire. However, the                                        ‘false negative’, insert the conclusion supported by the
corresponding output attribute case values match the conclusion                              case into the working memory and mark the neurule as
of NR2 and contradict the conclusion of NR1. In addition,                                    'blocked'.
conclusion ‘disease-type is inflammation’ cannot be drawn                           3. If all intermediate neurules with a specific conclusion
when the input values corresponding to the attribute values of                      variable are blocked, then
case C3 are given as input because the only neurule with the                          3.1. Examine all cases indexed as ‘false negatives’ by the
corresponding conclusion (i.e., NR1) is blocked. A similar                                   corresponding intermediate conclusions, retrieve the
situation happens for case C6.                                                               best matching one and insert the conclusion supported
                                                                                             by the retrieved case into the working memory.
                                                                                    4. If all output neurules with a specific conclusion variable are
4 THE HYBRID INFERENCE MECHANISM                                                    blocked, then
The inference mechanism combines neurule-based with case-                             4.1. Examine all cases indexed as ‘false negatives’ by the
based reasoning. The combined inference process mainly                                       corresponding final conclusions, retrieve the best
focuses on the neurules. The indexed cases are considered                                    matching one and insert the conclusion supported by the
when: (a) sufficient conditions of a neurule are fulfilled so that                           retrieved case into the working memory.
it can fire, (b) all output or intermediate neurules with a specific                    The similarity measure between two cases ck and cl is
conclusion variable are blocked and thus no final or                                calculated via a distance metric [1]. The best-matching case to
intermediate conclusion containing this variable is drawn.                          the problem at hand is the one having the maximum similarity


                                                                               52
with (minimum distance from) the input case. If multiple stored            according to [10] which will be referred to as
cases have a similarity equal to the maximum one, a simple                 NBRCBR_PREV.
heuristic is used.                                                            Inferences were run for both NBRCBR and
   Let present now two simple inference examples concerning                NBRCBR_PREV using the testing sets. Inferences from
the combined neurule base (Table 1) and the indexed example                NBRCBR_Prev were performed using the inference mechanism
cases (Tables 2 and 3). Suppose that during inference sufficient           combining neurule-based and CBR as described in [10].
conditions of neurule NR1 are satisfied so that it can fire. Firing        Inferences from NBRCBR were performed according to the
of NR1 is suspended and the case-based reasoning process                   inference mechanism described in this paper. No test case was
focuses on the cases contained in the union of the following sets          stored in the case libraries.
of indexed cases:                                                             Table 4 presents such experimental results regarding
        • the set of cases indexed as ‘true positives’ by NR1:             inferences from NBRCBR and NBRCBR_PREV. It presents
              {C2, C4},                                                    results regarding classification accuracy of the integrated
        • the set of cases indexed as ‘false positives’ by                 approaches and the percentage of test cases resulting in neurule-
              NR1: {C5} and                                                based reasoning errors that were successfully handled by case-
        • the set of cases indexed as ‘false negatives’ by                 based reasoning. Column ‘% FPs handled’ refers to the
              alternative conclusions containing variable                  percentage of test cases resulting in neurule misfirings (i.e.,
              ‘disease-type’ (i.e., ‘disease-type is chronic               ‘false positives’) that were successfully handled by case-based
              inflammation’): {C6}.                                        reasoning. Column ‘% FNs handled’ refers to the percentage of
                                                                           test cases resulting in having all output neurules blocked (i.e.,
So, in this example the case-based reasoning process focuses on            ‘false negatives’) that were successfully handled by case-based
the following set of indexed cases: {C2, C4} ∪ {C5} ∪ {C6} =               reasoning. ‘False negative’ test cases are handled in
{C2, C4, C5, C6}.                                                          NBRCBR_PREV by retrieving the best-matching case from the
   Suppose now that during inference both output neurules in               whole library of indexed cases.
the example neurule base are blocked. The case-based
reasoning process will focus on the cases contained in the union                                         Table 4. Experimental results
set of the following sets of indexed cases:                                                              NBRCBR                      NBRCBR_PREV
         • the set of cases indexed as ‘false negatives’ by
                                                                                        Classification


                                                                                                                               Classification
               conclusion ‘disease-type is inflammation’: {C3}.
                                                                                          Accuracy


                                                                                                                                 Accuracy
                                                                                                           Handled


                                                                                                                     Handled


                                                                                                                                                 Handled


                                                                                                                                                           Handled
                                                                                                                     % FNs


                                                                                                                                                           % FNs
                                                                                                            % FPs


                                                                                                                                                  % FPs
         • the set of cases indexed as ‘false negatives’ by                Dataset
               conclusion ‘disease-type is chronic-inflammation’:
               {C6}.
Therefore, in this example the case-based reasoning process                   Car      96.04%             52.81%     64.07%    92.49%           15.51%     20.36%
focuses on the following set of indexed cases: {C3} ∪ {C6} =                 (1728
{C3, C6}.                                                                  patterns)
                                                                           Nursery     98.92%             58.68%     52.94%    97.68%           6.60%      18.82%
                                                                            (12960
5 EXPERIMENTAL RESULTS                                                     patterns)

In this section, we present experimental results using datasets
                                                                              As can be seen from the table, the presented approach results
acquired from [2]. Note that there are no intermediate
                                                                           in improved classification accuracy. Furthermore, in inferences
conclusions in these datasets. The experimental results involve
                                                                           from NBRCBR the percentages of both ‘false positive’ and
evaluation of the presented approach combining neurule-based
                                                                           ‘false negative’ test cases successfully handled are greater than
and case-based reasoning and comparison with our previous
                                                                           the corresponding percentages in inferences from
approach [10]. 75% and 25% of each dataset were used as
                                                                           NBRCBR_PREV. Results also show that there is still room for
training and testing sets respectively. Each initial training set
                                                                           improvement.
was used to create a combined neurule base and indexed case
                                                                              We also tested a nearest neighbor approach working alone in
library. For this purpose, each initial training set was randomly
                                                                           these two datasets (75% of the dataset used as case library and
split into two disjoint subsets, one used to create neurules and
                                                                           25% of the dataset used as testing set). We used the similarity
one used to create an indexed case library. More specifically,
                                                                           measure presented in Section 5. The approach classified the
2/3 of each initial training set was used to create neurules by
                                                                           input case to the conclusion supported by the best-matching
employing the ‘patterns to neurules’ module [8] whereas the
                                                                           case retrieved from the case library. Classification accuracy for
remaining 1/3 of each initial training set constituted non-
                                                                           car and nursery dataset is 90.45% and 96.67% respectively. So,
indexed cases. Both types of knowledge (i.e., neurules and non-
                                                                           both integrated approaches perform better. This is due to the
indexed cases) were given as input to the indexing construction
                                                                           fact that the indexing schemes assist in focusing on specific
module presented in this paper producing a combined neurule
                                                                           parts of the case library.
base and an indexed case library which will be referred to as
NBRCBR. Neurules and non-indexed cases were also used to
produce a combined neurule base and an indexed case library


                                                                      53
7 CONCLUSIONS                                                                      [12] D.B. Leake (ed.), Case-Based Reasoning: Experiences, Lessons &
                                                                                        Future Directions, AAAI Press/MIT Press, 1996.
In this paper, we present an approach integrating neurule-based                    [13] M.R. Lee, ‘An Exception Handling of Rule-Based Reasoning Using
and case-based reasoning that improves a previous hybrid                               Case-Based Reasoning’, Journal of Intelligent and Robotic Systems,
approach [10]. Neurules are a type of hybrid rules integrating                         35, 327-338, (2002).
symbolic rules with neurocomputing. In contrast to other neuro-                    [14] C.R. Marling, M. Sqalli, E. Rissland, H. Munoz-Avila, D. Aha,
symbolic approaches, neurules retain the naturalness and                               ‘Case-Based Reasoning Integrations’, AI Magazine, 23, 69-86,
modularity of symbolic rules. Integration of neurules and cases                        (2002).
                                                                                   [15] S. Montani, R. Bellazzi, ‘Supporting Decisions in Medical
is done in order to improve the accuracy of the inference
                                                                                       Applications: the Knowledge Management Perspective’, International
mechanism. Cases are indexed according to the roles they can                           Journal of Medical Informatics, 68, 79-90, (2002).
play during neurule-based inference. More specifically, they are                   [16] J. Prentzas, I. Hatzilygeroudis, ‘Integrating Hybrid Rule-Based with
associated as ‘true positives’ and ‘false positives’ with neurules                     Case-Based Reasoning’, In S. Craw and A. Preece (Eds), Advances in
and as ‘false negatives’ with neurule base conclusions.                                Case-Based Reasoning, Proceedings of the European Conference on
   The presented approach integrates three types of knowledge                          Case-Based Reasoning, ECCBR-2002, Lecture Notes in Artificial
representation schemes: symbolic rules, neural networks and                            Intelligence, Vol. 2416, Springer-Verlag, 336-349, 2002.
case-based reasoning. Most hybrid intelligent systems                              [17] J. Prentzas, I. Hatzilygeroudis, ‘Categorizing Approaches Combining
implemented in the past usually integrate two intelligent                              Rule-Based and Case-Based Reasoning’, Expert Systems, 24, 97-122,
                                                                                       (2007).
technologies e.g. neural networks and expert systems, neural
                                                                                   [18] E.L. Rissland, D.B. Skalak, ‘CABARET: Rule Interpretation in a
and fuzzy logic, genetic algorithms and neural networks, etc. A
                                                                                       Hybrid Architecture’, International Journal of Man-Machine Studies,
new development that should receive interest in the future is the                      34, 839-887, (1991).
integration of more than two intelligent technologies,                             [19] D. Rossille, J.-F. Laurent, A. Burgun, ‘Modeling a Decision Support
facilitating the solution of complex problems and exploiting                           System for Oncology using Rule-Based and Case-Based Reasoning
multiple types of data sources.                                                        Methodologies’, International Journal of Medical Informatics, 74,
                                                                                       299-306, (2005).
                                                                                   [20] H. Vafaie, C. Cecere, ‘CORMS AI: Decision Support System for
References                                                                             Monitoring US Maritime Environment’, Proceedings of the 17th
                                                                                       Innovative Applications of Artificial Intelligence Conference (IAAI),
[1] G. Agre, ‘KBS Maintenance as Learning Two-Tiered Domain
                                                                                       AAAI Press, 1499-1507, (2005).
    Representation’, In M.M. Veloso, A. Aamodt, (Eds.): Case-Based
    Reasoning Research and Development, First International Conference,
    ICCBR-95, Proceedings, Lecture Notes in Computer Science, Vol.
    1010, Springer-Verlag, 108-120, 1995.
[2] A. Asuncion, D.J. Newman, ‘UCI Repository of Machine Learning
    Databases’      [http://www.ics.uci.edu/~mlearn/MLRepository.html].
    Irvine, CA, University of California, School of Information and
    Computer Science (2007).
[3] N. Cercone, A. An, C. Chan, ‘Rule-Induction and Case-Based
    Reasoning: Hybrid Architectures Appear Advantageous’, IEEE
    Transactions on Knowledge and Data Engineering, 11, 164-174,
    (1999).
[4] S. I. Gallant, Neural Network Learning and Expert Systems, MIT
    Press, 1993.
[5] A.Z. Ghalwash, ‘A Recency Inference Engine for Connectionist
    Knowledge Bases’, Applied Intelligence, 9, 201-215, (1998).
[6] A.R. Golding, P.S. Rosenbloom, ‘Improving accuracy by combining
    rule-based and case-based reasoning’, Artificial Intelligence, 87, 215-
    254, (1996).
[7] I. Hatzilygeroudis, J. Prentzas, ‘Neurules: Improving the Performance
    of Symbolic Rules’, International Journal on AI Tools, 9, 113-130,
    (2000).
[8] I. Hatzilygeroudis, J. Prentzas, ‘Constructing Modular Hybrid Rule
    Bases for Expert Systems’, International Journal on AI Tools, 10, 87-
    105, (2001).
[9] I. Hatzilygeroudis, J. Prentzas, ‘An Efficient Hybrid Rule-Based
    Inference Engine with Explanation Capability’, Proceedings of the
    14th International FLAIRS Conference, AAAI Press, 227-231,
    (2001).
[10] I. Hatzilygeroudis, J. Prentzas, ‘Integrating (Rules, Neural Networks)
    and Cases for Knowledge Representation and Reasoning in Expert
    Systems’, Expert Systems with Applications, 27, 63-75, (2004).
[11] J. Kolodner, Case-Based Reasoning, Morgan Kaufmann Publishers,
    San Mateo, CA, 1993.


                                                                              54
          Combinations of Case-Based Reasoning with Other
                        Intelligent Methods
                                                  Jim Prentzas1 and Ioannis Hatzilygeroudis2


Abstract. Case-based reasoning is a popular approach used in                    other intelligent systems to create an improved overall
intelligent systems. Whenever a new case has to be dealt with,                  system.
the most similar cases are retrieved from the case base and their                  Popular CBR combinations involve combinations with rule-
encompassed knowledge is exploited in the current situation.                    based reasoning (RBR), model-based reasoning (MBR) and soft
Combinations of case-based reasoning with other intelligent                     computing methods. CBR has also been combined with other
methods have been explored deriving effective knowledge                         intelligent methods (e.g. ontologies). In certain CBR
representation schemes. Although some types of combinations                     combinations both combination trends have been followed. In
have been mostly explored, other types have not been                            other combinations one of the two trends is mostly explored.
thoroughly investigated. In this paper, we briefly outline                         In this paper, we briefly discuss aspects involving CBR
popular case-based reasoning combinations. More specifically,                   combinations. We focus on intelligent methods with which
we focus on combinations of case-based reasoning with rule-                     CBR is usually combined. Our purpose is not to present an
based reasoning, soft computing and ontologies. We illustrate                   extensive survey of developed CBR combinations but to
basic types of such combinations and discuss future directions.                 present their key aspects.


1 INTRODUCTION                                                                  3 COMBINATIONS OF CBR
Case-based representations store a large set of previous cases                  Combinations of CBR with other intelligent methods have been
with their solutions in the case base using them whenever a                     explored for more effective knowledge representation and
similar new case has to be dealt with [19], [22]. Whenever, a                   problem solving. CBR can be combined with various intelligent
new input case comes in, a case-based system performs                           methods. However, CBR is usually combined with RBR, MBR
inference in four phases known as the case-based reasoning                      and soft computing methods.
(CBR) cycle [1]: (i) retrieve, (ii) reuse, (iii) revise and (iv)                   To categorize CBR combinations one could use Medsker’s
retain. The retrieval phase retrieves from the case base the                    general categorization scheme for integrated intelligent systems
most relevant stored case(s) to the new case. Indexing                          [26]. Medsker distinguishes five main combination models:
schemes and similarity metrics are used for this purpose. In                    standalone, transformational, loose coupling, tight coupling and
the reuse phase, a solution for the new case is created based                   fully integrated models. Distinction between those models is
on the retrieved most relevant case(s). The revise phase                        based on the degree of coupling between the integrated
validates the correctness of the proposed solution, perhaps                     components. Underlying categories for some of these models
with the intervention of the user. Finally, the retain phase                    are also defined. Main types of underlying categories for loose
decides whether the knowledge learned from the solution of                      and tight coupling models involve pre-processing, post-
the new case is important enough to be incorporated into the                    processing and co-processing models as well as embedded
system.                                                                         processing (for tight coupling models only). Not all of these
   CBR can be effectively combined with other intelligent                       combination models and/or their underlying categories have
methods [25], [31]. Two main trends for CBR combinations                        been thoroughly explored in the case of CBR combinations.
can be discerned. The first trend involves embedded                             The types of combination models that have been applied to
approaches in which the primary intelligent method (usually                     CBR combinations depend on the nature of the other intelligent
CBR) embeds one or more other intelligent methods to                            methods combined with CBR. Some combination models are
assist its internal online and offline tasks. The second                        difficult to apply in certain CBR combinations. For instance, it
combination trend involves approaches in which the                              is difficult to apply the fully integrated model in combinations
problem solving process can be decomposed into tasks for                        of RBR with CBR. Obviously, the standalone model can be
which different representation formalisms are required or                       applied to combinations of CBR with any other method.
available. In such situations, a CBR system as a whole (with                       Generally speaking, coupling models are the most usual
its possible internal modules) is integrated ‘externally’ with                  CBR combination models. More specifically, embedded
                                                                                coupling approaches constitute perhaps the most popular trend.

1 Technological Educational Institute of Lamia, Department of Informatics and Computer Technology, 35100 Lamia, Greece, email: dprentzas@teilam.gr.
2 University of Patras, Dept of Computer Engineering & Informatics, 26500 Patras, Greece, email: ihatz@ceid.upatras.gr.


                                                                           55
Most of the combinations following this trend use other                    applied in case retrieval. In addition, fuzzy adaptation rules can
intelligent methods to assist various CBR tasks. CBR is a                  be employed in case adaptation.
generic methodology for building knowledge-based systems                     The works concerning combination of RBR with CBR [31]
and its internal reasoning tasks can be implemented using a                could potentially be improved with use of fuzzy rules.
number of techniques as long as the guiding CBR principles are             Investigation of coupling approaches in combinations of CBR
followed [36]. The reverse approach that is, embedding case-               with fuzzy systems besides embedded ones could be fruitful.
based modules into intelligent systems employing other
representations to assist in their internal tasks does not seem to
be popular with the exception of combinations with genetic                 3.3 Combinations of CBR with Neural Networks
algorithms. In combinations of CBR with RBR and MBR,                       Neural networks are usually employed by CBR to perform
various coupling approaches have also been investigated                    tasks such as indexing, retrieval and adaptation. In this way,
besides embedded approaches [31]. In coupling combinations                 appealing characteristics of neural networks such as
of CBR with soft computing methods, embedded approaches                    parallelism, robustness, adaptability, generalization and ability
seem to be the most thoroughly investigated.                               to cope with incomplete input data are exploited [10], [35]. Due
   In the following, we discuss main issues involving                      to the fact that different types of neural networks have been
combinations of CBR with RBR, fuzzy logic, neural networks,                developed (e.g. back propagation neural networks, radial basis
genetic algorithms and ontologies.                                         function networks, Self-Organizing Map networks, ART
                                                                           network), different types of neural capabilities for classification
                                                                           and clustering can be exploited. Certain CBR approaches have
3.1 Combinations of CBR with RBR                                           employed different types of neural networks for the various
                                                                           internal CBR tasks (e.g. [12], [34]). Knowledge extracted from
Various types of coupling models involving combinations of
                                                                           neural networks could also be exploited by CBR [10], [35]. An
CBR and RBR have been investigated i.e., sequential
                                                                           interesting direction could involve non-embedded coupling
processing, co-processing and embedded processing [31].
                                                                           approaches combining CBR with neural networks.
   In sequential processing, information (produced by
reasoning) necessarily passes sequentially through some or all
of the combined modules to produce the final result [33], [11].            3.4 Combinations of CBR with Genetic Algorithms
   In co-processing approaches, the combined modules closely
interact in producing the final result. Such systems can be                Usual combinations of CBR with genetic algorithms (GAs)
discerned into two types: cooperation-oriented, which give                 involve use of GAs to optimize (one or more) aspects of a CBR
emphasis on cooperation, and reconciliation-oriented, which                system. On the other hand, CBR can be exploited to enhance
give emphasis on reconciliation. In the former type, the                   GAs. Other types of combinations of CBR with GAs can be
combined components cooperate with each other (usually by                  also implemented.
interleaving their reasoning steps) [27], [32]. In the latter, each           GAs can be used within CBR to enhance indexing and
component produces its own conclusion, possibly differing                  retrieval. GAs have been used to assign case feature weights
from the conclusion of the other component, and thus a                     enhancing similarity assessment [39], [8], to perform feature
reconciliation process is necessary [14].                                  selection [18] and generally to select relevant indices for
   In embedded processing, CBR systems employ one or more                  evolving environments. GAs have also been used to retrieve
RBR modules to perform tasks of their CBR cycle (e.g.                      multiple similar cases [38]. If k nearest neighbor retrieval is
retrieval and adaptation). Such approaches are quite common in             applied, genetic algorithms can be used to find the optimal k
CBR especially for adaptation. RBR systems embedding CBR                   parameter in order to improve the retrieval accuracy [2].
modules do not seem to exist.                                              Furthermore, GAs can be used to perform instance selection
                                                                           i.e., finding the representative cases in a case base and
                                                                           determining a reduced subset of a case base. In this way, time
                                                                           performance is improved by reducing search space and
3.2 Combinations of CBR with Fuzzy Logic                                   accuracy can be improved through elimination of noisy and
CBR can be combined with fuzzy logic in fruitful ways in order             useless cases [2].
to handle imprecision. A usual approach is the incorporation of               Additionally, GAs have been used to enhance case
fuzzy logic into a CBR system in order to improve CBR aspects              adaptation [16], [17]. Genetic algorithms can also optimize case
[4], [29], [35], [9]. Such combinations have been vastly                   representation, e.g. by performing case feature discretization
explored as imprecision and uncertainty are inherent in various            [18] and removing irrelevant features. Such optimizations
CBR tasks. Fuzzy terms may be used in case representation                  improve accuracy, search time and storage requirements. It is
enabling a flexible encoding of case features that encompasses             also quite usual to simultaneously optimize more than one CBR
imprecise and uncertain information. Fuzzy logic may be also               aspect with GAs (e.g. [2], [18]).
proved very useful in indexing and retrieval. Fuzzy indexing                  On the other hand CBR can be employed to enhance GAs.
enables multiple indexing of a case on a single feature with               CBR can be applied to GAs by creating cases to track the
different degrees of membership [35]. Fuzzy similarity                     history of a search. This case base can contribute in the
assessment and matching methods can produce more accurate                  understanding of how a solution was reached, why a solution
results. Fuzzy clustering and classification methods can also be           works, and what the search space looks like. It could thus be


                                                                      56
used to design highly tailored search strategies for future use          of earlier systems combining CBR with only one of the other
[23]. Such an approach could therefore be used to explain the            two intelligent methods (e.g. RBR or MBR alone). Multi-
results of the genetic algorithm and for knowledge extraction.           integrated CBR approaches, besides those involving
Moreover, similar stored cases can be also incorporated into a           RBR/MBR, could be developed. For instance, ontologies could
genetic algorithm to reduce convergence time and improve                 constitute an interesting candidate method that could be
solution accuracy. GAs randomly initialize their starting                combined with CBR and another intelligent method in order to
population. Instead, relevant stored cases can be used as part of        facilitate knowledge sharing and reuse among the integrated
the initial population (solution) of GAs. Additionally, relevant         system components themselves [5] and among integrated
stored cases can be periodically injected into the pool of               systems. Such a combination could be useful in Web-based
chromosomes while the genetic algorithm runs [24], [7]. In               systems that need to share knowledge. Fruitful such approaches
certain approaches, CBR is exploited by GAs for both                     could involve combinations of CBR, ontologies and
knowledge extraction and case injection [30].                            RBR/MBR. For instance in [6] an approach combining CBR,
                                                                         RBR and an ontology is presented.
                                                                            Multi-integrated paradigms could also be considered systems
3.5 Combinations of CBR with Ontologies                                  combining CBR with certain types of neuro-symbolic or neuro-
Ontologies facilitate knowledge sharing and reuse. They can              fuzzy approaches in which the neuro-symbolic (neuro-fuzzy)
provide an explicit conceptualization describing data semantics          module fully integrates the neural and symbolic (fuzzy)
and a shared and common understanding of the domain                      approach. Such modules could be used within CBR instead of
knowledge that can be communicated among agents and                      plain neural or fuzzy components. Non-embedded coupling
application systems [6]. Ontologies play a crucial role in               approaches can be applied as well. For instance, in [15] a
enabling the processing and sharing of knowledge between                 neuro-symbolic method is combined with CBR according to the
programs on the Web [21]. Intelligent Decision Support                   reconciliation coupling approach.
Systems in the semantic Web framework should be able to
handle, integrate with and reason from distributed data and
                                                                         4 CONCLUSIONS
information on the Web [3].
   Therefore ontologies can be combined with CBR in various              In this paper, we discuss key aspects involving combinations of
ways. Ontologies can be used by a CBR system to represent the            CBR with other intelligent methods. Such combinations are
input problem [20], to enhance similarity assessment [13], case          becoming increasingly popular due to the fact that in many
representation, case abstraction and case adaptation [3].                application domains a vast amount of case data is available.
Ontologies may perform all such CBR tasks [37].                          Such combined approaches have managed to solve problems in
                                                                         application domains where a case-based module needs the
                                                                         assistance and/or completion of other intelligent modules in
3.6 Combinations of CBR with Multiple Intelligent                        order to produce effective results. This trend is very likely to
Methods                                                                  carry on in the following years.
The previous sections focused on combinations of CBR with                   Future directions in combinations of CBR with other
one other individual intelligent method. However, intelligent            intelligent methods could involve a number of aspects. Main
systems have been developed that combine CBR with multiple               such aspects involve: (a) combinations of CBR with soft
other intelligent methods. Such multi-integrated paradigms               computing methods, (b) combinations of CBR with fuzzy rules,
usually follow a coupling model.                                         (c) combinations of CBR with ontologies and (d) combinations
   Obviously, a CBR system may employ multiple intelligent               of CBR with neuro-symbolic and neuro-fuzzy approaches.
methods (e.g. rules and various soft computing methods) to                  Combinations of CBR with soft computing methods not
perform its internal tasks [36]. Typical examples of approaches          following an embedded coupling approach could be an
employing multiple soft computing methods within the CBR                 interesting future research direction. At present there seems to
cycle are presented in [12] and [34]. In [12] all of the four            be a lack of great interest in pursuing this direction since the
phases of the CBR cycle employ soft computing methods.                   main interest has been focused on employing soft computing
Employed soft computing methods are a self-organizing neural             methods within CBR. A non-embedded direction in the
network for retrieval, a radial basis neural network for reuse,          combinations of CBR with soft computing could be pursued as
fuzzy systems for revise and all soft computing methods for              thoroughly as in the case of combinations of CBR with
retain. In [34] fuzzy logic, supervised and unsupervised neural          RBR/MBR. A further step towards this direction could involve
networks and a genetic algorithm are employed for case                   non-embedded approaches combining CBR with multiple soft
representation, indexing, retrieval and adaptation.                      computing methods or combinations of CBR, soft computing
   More interesting approaches concern multi-integrated                  and other intelligent methods (e.g. RBR, MBR or ontologies).
systems not following the embedded approach. Typical such                   Combinations of CBR with fuzzy rule-based systems could
multi-integrated approaches involve combinations of CBR,                 be based on work combining CBR with RBR that is,
RBR and MBR (e.g. [28]). Such approaches seem to be quite                investigation of various coupling approaches.
effective, because combinations of CBR with RBR and MBR                     The increasing interest in Web-based intelligent systems and
individually have been thoroughly investigated. Quite often              future advances in the Semantic Web is likely to provide an
such systems have been implemented to deal with deficiencies             impetus to approaches combining CBR with ontologies. This


                                                                    57
trend is likely to involve multi-integrated approaches                                 genetic algorithm’, Expert Systems with Applications, 31, 83-93,
combining CBR, ontologies and other intelligent methods.                               (2006).
   Finally, a direction that may be useful to pursue involves                      [18] K.-J. Kim, ‘Toward Global Optimization of Case-Based Reasoning
non-embedded coupling approaches combining CBR with                                    Systems for Financial Forecasting’, Applied Intelligence, 21, 239-249,
                                                                                       (2004).
neuro-symbolic and neuro-fuzzy modules. Few such
                                                                                   [19] J.L. Kolodner, Case-Based Reasoning, Morgan Kaufmann, 1993.
approaches have been developed.                                                    [20] O. Byung Kwon, ‘Meta Web Service: Building Web-based Open
                                                                                       Decision Support System based on Web Services’, Expert Systems
                                                                                       with Applications, 24, 375-389, (2003).
REFERENCES                                                                         [21] O. Kwon, M. Kim, ‘MyMessage: Case-Based Reasoning and
[1] A. Aamodt, E. Plaza, ‘Case-Based Reasoning: Foundational Issues,                   Multicriteria Decision Making Techniques for Intelligent Context-
    Methodological Variations, and System Approaches’, AI                              Aware Message Filtering’, Expert Systems with Applications, 27, 467-
    Communications, 7, 39-59, (2004).                                                  480, (2004).
[2] H. Ahn, K. Kim, ‘Global optimization of case-based reasoning for               [22] D.B. Leake (Ed.), Case-Based Reasoning: Experiences, Lessons, and
    breast cytology diagnosis’, Expert Systems with Applications (to                   Future Directions, AAAI Press, 1996.
    appear).                                                                       [23] S.J. Louis, G. McGraw, R.O. Wyckoff, ‘Case-based reasoning
[3] I. Bischindaritz, ‘Memoire: A Framework for Semantic                               assisted explanation of genetic algorithm results’, Journal of
    Interoperability of Case-Based Reasoning Systems in Biology and                    Experimental & Theoretical Artificial Intelligence, 5, 21-37, (1993).
    Medicine’, Artificial Intelligence in Medicine, 36, 177-192, (2006).           [24] S. J. Louis, C. Miles, ‘Playing to Learn: Case-Injected Genetic
[4] P.P. Bonissone, R. Lopez de Mantaras, ‘Fuzzy Case-Based Reasoning                  Algorithms for Learning to Play Computer Games’, IEEE
    Systems’, Handbook on Fuzzy Computing, vol. F 4.3, Oxford                          Transactions on Evolutionary Computation, 9, 669-681, (2005).
    University Press, 1998.                                                        [25] C. Marling, M. Sqalli, E. Rissland, H. Munoz-Avila, D. Aha, ‘Case-
[5] L. Castillo, E. Armengol, E. Onaindia, L. Sebastia, J. Gonzalez-                   Based Reasoning Integrations’, AI Magazine, 23, 69-86, (2002).
    Boticario, A. Rodriguez, S. Fernandez, J.D. Arias, D. Borrajo,                 [26] L.R. Medsker, Hybrid Intelligent Systems, Kluwer Academic
    ‘SAMAP: An User Oriented Adaptive System for Planning Tourist                      Publishers, Second Printing, 1998.
    Visits’, Expert Systems with Applications, 34, 1318-1332, (2008).              [27] S. Montani, R. Bellazzi, ‘Supporting Decisions in Medical
[6] L. Ceccaroni, U. Cortes, M. Sanchez-Marre, ‘OntoWEDSS :                            Applications: the Knowledge Management Perspective’, International
    Augmenting Environmental Decision-Support Systems with                             Journal of Medical Informatics, 68, 79-90, (2002).
    Ontologies’, Environmental Modelling & Software, 19, 785-797,                  [28] S. Montani, P. Magni, R. Bellazzi, C. Larizza, A.V. Roudsari, E.R.
    (2004).                                                                            Carson, ‘Integrating model-based decision support in a multi-modal
[7] P.-C. Chang, J.-C. Hsieh, C.-H. Liu, ‘A Case-Injected Genetic                      reasoning system for managing type 1 diabetic patients’, Artificial
    Algorithm for Single Machine Scheduling Problems with Release                      Intelligence in Medicine, 29, 131-151, (2003).
    Time’, International Journal on Production Economics, 103, 551-                [29] S. K. Pal, S.C.K. Shiu, Foundations of Soft Case-Based Reasoning,
    564, (2006).                                                                       John Wiley, 2004.
[8] P.-C. Chang, C.-Y. Lai, K. R. Lai, ‘A Hybrid System by Evolving                [30] E.I. Perez, C.A. Coello Coello, A. Hernandez-Aguirre, ‘Extraction
    Case-Based Reasoning with Genetic Algorithm in Wholesaler’s                        and Reuse of Design Patterns from Genetic Algorithms using Case-
    Returning Book Forecasting’, Decision Support Systems, 42, 1715-                   Based Reasoning’, Soft Computing, 9, 44-53, (2005).
    1729, (2006).                                                                  [31] J. Prentzas, I. Hatzilygeroudis, ‘Categorizing Approaches Combining
[9] W. Cheetam, S.C.K. Shiu, R.O. Weber, ‘Soft Case-Based Reasoning’,                  Rule-Based and Case-Based Reasoning’, Expert Systems, 24, 97-122,
    Knowledge Engineering Review, 20, 267-269, (2006).                                 (2007).
[10] D. Chen, P. Burrell, ‘Case-Based Reasoning System and Artificial              [32] E.L. Rissland, D.B. Skalak, ‘CABARET: Rule Interpretation in a
    Neural Networks: A Review’, Neural Computing & Applications, 10,                   Hybrid Architecture’, International Journal of Man-Machine Studies,
    264-276, (2001).                                                                   34, 839-887, (1991).
[11] R.-J. Dzeng, H.-Y. Lee, ‘Critiquing Contractors’ Scheduling by                [33] D. Rossille, J.-F. Laurent, A. Burgun, ‘Modeling a Decision Support
    Integrating Rule-Based and Case-Based Reasoning’, Automation in                    System for Oncology using Rule-Based and Case-Based Reasoning
    Construction, 13, 665-678, (2004).                                                 Methodologies’, International Journal of Medical Informatics, 74,
[12] F. Fdez-Riverola, J.M. Corchado, ‘FSfRT: Forecasting System for                   299-306, (2005).
    Red Tides’, Applied Intelligence, 21, 251-264, (2004).                         [34] K.M. Saridakis, A.J. Dentsoras, ‘Case-DeSC: A System for Case-
[13] P. Gervas, B. Diaz-Agudo, F. Peinado, R. Hervas, ‘Story Plot                      Based Design with Soft Computing Techniques’, Expert Systems with
    Generation Based on CBR’, Knowledge-Based Systems, 18, 235-242,                    Applications, 32, 641-657, (2007).
    (2005).                                                                        [35] S.C.K. Shiu, S.K. Pal, ‘Case-Based Reasoning: Concepts, Features
[14] A.R. Golding, P.S. Rosenbloom, ‘Improving Accuracy by                             and Soft Computing’, Applied Intelligence, 21, 233-238, (2004).
    Combining Rule-Based and Case-Based Reasoning’, Artificial                     [36] I. Watson, ‘Case-Based Reasoning is a Methodology not a
    Intelligence, 87, 215-254, (1996).                                                 Technology’, Knowledge-Based Systems, 12, 303-308, (1997)
[15] I. Hatzilygeroudis, J. Prentzas, ‘Integrating (Rules, Neural Networks)        [37] P. Wriggers, M. Siplivaya, I. Joukova, R. Slivin, ‘Intelligent Support
    and Cases for Knowledge Representation and Reasoning in Expert                     of Engineering Analysis using Ontology and Case-Based Reasoning’,
    Systems’, Expert Systems with Applications, 27, 63-75, (2004).                     Engineering Applications of Artificial Intelligence, 20, 709-720,
[16] B. W. Huang, M. L. Shih, N.-H. Chiu, W. Y. Hu, C. Chiu, ‘Price                    (2007).
    information evaluation and prediction for broiler using adapted case-          [38] H.-L. Yang, C.-S. Wang, ‘Two stages of case-based reasoning –
    based reasoning approach’, Expert Systems with Applications (to                    Integrating genetic algorithm with data mining mechanism’, Expert
    appear).                                                                           Systems with Applications (to appear).
[17] Y.-K. Juan, S.-G. Shih, Y.-H. Perng, ‘Decision Support for Housing            [39] F.-C. Yuan, C. Chiu, ‘A Hierarchical Design of Case-Based
    Customization: A hybrid approach using case-based reasoning and                    Reasoning in the Balanced Scorecard Application’, Expert Systems
                                                                                       with Applications (to appear).


                                                                              58
        Combining Argumentation and Hybrid Evolutionary
          Systems in a Portfolio Construction Application
                     Nikolaos Spanoudakis1 and Konstantina Pendaraki2 and Grigorios Beligiannis2

Abstract. In this paper we present an application for the                        accurately predicting the evolution of stock values in the Greek
construction of mutual fund portfolios. It is based on a                         market (its application on economic data is presented in [2]).
combination of Intelligent Methods, namely an argumentation                      Moreover, there is a lot of work on hybrid evolutionary algorithms
based decision making framework and a forecasting algorithm                      and their application on many difficult problems has shown very
combining Genetic Algorithms (GA), MultiModel Partitioning                       promising results [4]. The problem of predicting the behavior of
(MMP) theory and Extended Kalman Filters (EKF). The                              the financial market is an open problem and many solutions have
argumentation framework is employed in order to develop mutual                   been proposed. However, there isn't any known algorithm able to
funds performance models and to select a small set of mutual                     identify effectively all kinds of behaviors. Also, many traditional
funds, which will compose the final portfolio. The forecasting                   methods have been applied to the same problem and the results
algorithm is employed in order to forecast the market status                     obtained were not very satisfactory. There are two main
(inflating or deflating) for the next investment period. The                     difficulties in this problem, firstly the search space is huge and,
knowledge engineering approach and application development                       secondly, it comprises of many local optima.
steps are also discussed.12                                                         In this contribution, we present the whole application resulting
                                                                                 from the combination of argumentation with hybrid evolutionary
                                                                                 systems along with the respective results.
1 INTRODUCTION                                                                      The rest of the paper is organized as follows: Section two
                                                                                 presents an overview of the concepts and application domain
Portfolio management [8] is concerned with constructing a
                                                                                 knowledge. Section three outlines the main features of the
portfolio of securities (e.g., stock, bonds, mutual funds [13], etc.)
                                                                                 proposed argumentation based decision-making framework and
that maximizes the investor’s utility. In a previous study [14], we
                                                                                 the developed argumentation theory. The forecasting hybrid
constructed mutual fund (MF) portfolios using an argumentation
                                                                                 evolutionary system is presented in section four, followed by
based decision making framework. We developed rules that
                                                                                 section five, which presents the developed application and
characterize the market and different investor types policies using
                                                                                 discusses the obtained empirical results. Finally, section six
evaluation criteria of fund performance and risk. We also defined
                                                                                 summarizes the main findings of this research.
strategies for resolving conflicts over these rules. Furthermore, the
developed application can be used for a set of different investment
policy scenarios and supports the investor/portfolio manager in                  2 DOMAIN KNOWLEDGE
composing efficient MF portfolios that meet his investment
preferences. The traditional portfolio theories ([8], [11], [12])                    This section describes the criteria (or variables) used for
were based on unidimensional approaches that did not fit to the                  creating portfolios and the knowledge on how to use these criteria
multidimensional nature of risk ([3]), and they did not capture the              in order to construct a portfolio.
complexity presented in the data set. In [14], this troublesome                      The data used in this study is provided from the Association of
situation was resolved by the high level of adaptability in the                  Greek Institutional Investors and consists of daily data of domestic
decisions of the portfolio manager or investor when his                          equity mutual funds (MFs) over the period January 2000 to
environment is changing and the characteristics of the funds are                 December 2005.
multidimensional that was demonstrated by the use of                                 The proposed framework is based on five fundamental
argumentation.                                                                   variables. The return of the funds is the actual value of return of
    Our study showed that when taking into account the market                    an investment defined by the difference between the nominal
context, the results were better if we could forecast the status of              return and the rate of inflation. This variable is based on the net
the market of the following investment period. In order to achieve               price of a fund. At this point, it is very important to mention that
this goal we employed a hybrid system that combines Genetic                      transaction costs such as management commission are included in
Algorithms (GA), MultiModel Partitioning (MMP) theory and the                    the net price. Frond-end commission and redemption commission
Extended Kalman Filter (EKF). A general description of this                      fluctuate depending on the MF class and in most cases are very
algorithm and its application in linear and non-linear data is                   low. The standard deviation is used to measure the variability of
discussed in [2], while the specific version used in this                        the fund’s daily returns, thus representing the total risk of the
contribution is presented in [1], where its successful application to            fund. The beta coefficient (β) is a measure of fund’s risk in
non-linear data is also presented. This algorithm captured our                   relation to the capital risk. The Sharpe index [13] is a useful
attention because it had been successfully used in the past for                  measure of performance and is used to measure the expected
                                                                                 return of a fund per unit of risk, defined by the standard deviation.
1
                                                                                 The Treynor index [15] is similar to the Sharpe index except that
    Technical University of Crete, Greece, email: nikos@science.tuc.gr
2
    University of Ioannina, Greece, email: {dpendara, gbeligia}@cc.uoi.gr


                                                                            59
performance is measured as the risk premium per unit of                       communication. A single agent may use argumentation techniques
systematic (beta coefficient) and not of total risk.                          to perform its individual reasoning because it needs to make
    On the basis of the argumentation framework for the selection             decisions under complex preferences policies, in a highly dynamic
of a small set of MF, which will compose the final multi-                     environment (see e.g. [6]). This is the case used in this research.
portfolios, the examined funds are clustered in three groups for              In the following paragraphs we describe the theoretical framework
each criterion for each year. For example, we have funds with                 that we adopted:
high, medium and low performance (return), the same for the                       Definition 1. A theory is a pair (T, P) whose sentences are
other criteria.                                                               formulae in the background monotonic logic (L, ⊢ ) of the form
    The aforementioned performance and risk variables visualize               L←L1,…,Ln, where L, L1, …, Ln are positive or negative ground
the characteristics of the capital market (bull or bear) and the type         literals. For rules in P the head L refers to an (irreflexive) higher
of the investor according to his investment policy (aggressive or             priority relation, i.e. L has the general form L = h_p(rule1, rule2).
moderate). Further information is represented through variables               The derivability relation, ⊢ , of the background logic is given by
that describe the general conditions of the market and the investor           the simple inference rule of modus ponens.
policy (selection of portfolios with high performance per unit of                 An argument for a literal L in a theory (T, P) is any subset, T,
risk).                                                                        of this theory that derives L, T ⊢ L, under the background logic. A
    The general conditions of the market are characterized through
                                                                              part of the theory T0 ⊂ T, is the background theory that is
the development of funds which have high performance levels
                                                                              considered as a non defeasible part (the indisputable facts).
(high return). Regarding the market context, in a bull market,
                                                                                  An argument attacks (or is a counter argument to) another
funds are selected if they have high systematic or total risk. On the
                                                                              when they derive a contrary conclusion. These are conflicting
other hand, in a bear market, we select funds with low systematic
                                                                              arguments. A conflicting argument (from T) is admissible if it
and total risk. An aggressive investor is placing his capital upon
                                                                              counter-attacks all the arguments that attack it. It counter-attacks
funds with high performance and high systematic risk.
                                                                              an argument if it takes along priority arguments (from P) and
Accordingly, a moderate investor selects funds with high
performance and low or medium systematic risk. Some types of                  makes itself at least as strong as the counter-argument (we omit
investors select portfolios with high performance per unit of risk.           the relevant definitions from [6] due to limited space).
Such portfolios are characterized by high Sharpe ratio and high                   Definition 2. An agent’s argumentative policy theory is a
Treynor ratio.                                                                theory T = ((T, T0), PR, PC) where T contains the argument rules in
                                                                              the form of definite Horn logic rules, PR contains priority rules
                                                                              which are also definite Horn rules with head h_p(r1, r2) s.t. r1, r2
3 ARGUMENTATION-BASED DECISION                                                ∈ T and all rules in PC are also priority rules with head h_p(R1,
MAKING                                                                        R2) s.t. R1, R2 ∈ PR ∪ PC. T0 contains auxiliary rules of the
                                                                              agent’s background knowledge.
In this section we firstly present the argumentation framework that               Thus, in defining the decision maker’s theory we specify three
we used and then we describe the domain knowledge modeling                    levels. The first level (T) defines the (background theory) rules
based on the argumentation framework.                                         that refer directly to the subject domain, called the Object-level
                                                                              Decision Rules. In the second level we have the rules that define
                                                                              priorities over the first level rules for each role that the agent can
3.1        The Argumentation Framework                                        assume or context that he can be in (including a default context).
Autonomous agents, be they artificial or human, need to make                  Finally, the third level rules define priorities over the rules of the
decisions under complex preference policies that take into account            previous level (which context is more important) but also over the
different factors. In general, these policies have a dynamic nature           rules of this level in order to define specific contexts, where
and are influenced by the particular state of the environment in              priorities change again.
which the agent finds himself. The agent's decision process needs
to be able to synthesize together different aspects of his preference
policy and to adapt to new input from the current environment.                3.2    The Decision Maker’s Argumentation
Such agents are the mutual fund managers.                                     Theory
   In order to address requirements like the above, Kakas and
                                                                              Using the presented argumentation framework, we transformed
Moraitis ([6]) proposed an argumentation based framework to
                                                                              the criteria for all MFs and experts knowledge (§2) to background
support an agent's self deliberation process for drawing
                                                                              theory (facts) and rules of the first and second level. Then, we
conclusions under a given policy.
                                                                              defined the strategies (or specific contexts) in the third level rules.
   Argumentation can be abstractly defined as the principled
                                                                                 The goal of the knowledge base is to select some MFs in order
interaction of different, potentially conflicting arguments, for the
                                                                              to construct our portfolio. Therefore our rules have as their head
sake of arriving at a consistent conclusion (see e.g. [10]). The
                                                                              the predicate selectFund/1 and its negation. We write rules
nature of the “conclusion” can be anything, ranging from a
                                                                              supporting it or its negation and use argumentation for resolving
proposition to believe, to a goal to try to achieve, to a value to try
                                                                              conflicts. We introduce the hasInvestPolicy/2, preference/1 and
to promote. Perhaps the most crucial aspect of argumentation is
                                                                              market/1 predicates for defining the different contexts and roles.
the interaction between arguments. This means that argumentation
                                                                              For example, John, an aggressive investor is expressed with the
can give us means for allowing an agent to reconcile conflicting
                                                                              predicate hasInvestPolicy(john, aggressive).
information within itself, for reconciling its informational state
                                                                                 The knowledge base facts are the performance and risk
with new perceptions from the environment, and for reconciling
                                                                              variables values for each MF, the thresholds for each group of
conflicting information between multiple agents through


                                                                         60
values for each year and the above mentioned predicates                           The problem with the above rules is that the facts market(bear)
characterizing the investor and the market. The following rules are            or (exclusive) market(bull) could not be safely determined for the
an example of the object-level rules (level 1 rules of the                     next investment period. In the application version presented in
framework - T):                                                                [14] it was just assumed to remain the same as at the time of the
                                                                               investment. This strategy, however produced quite poor results for
r1(Fund): selectFund(Fund) ← highR(Fund)                                       this context if it should change in the next period.

r2(Fund): ¬selectFund(Fund) ← highB(Fund)
                                                                               4 FORECASTING THE STATUS OF THE
   The highR predicate denotes the classification of the MF as a               FINANCIAL MARKET
high return fund and the highB predicate denotes the classification
of the MF as a high risk fund. Thus, the r1 rule states that a high            One of the most prominent issues in the field of signal processing
performance fund should be selected, while the r2 rule states that a           is the adaptive filtering problem, with unknown time-invariant or
high risk fund should not be selected. Such rules are created for              time-varying parameters. Selecting the correct order and
the three groups of our performance and risk criteria.                         estimating the parameters of a system model is a fundamental
   Then, in the second level we assign priorities over the object              issue in linear and nonlinear prediction and system identification.
level rules. The PR are the default context rules or level 2 rules.            The problem of fitting an AutoRegressive Moving Aaverage
These rules are added by experts and express their preferences in              model with eXogenous input (ARMAX) or a Nonlinear
the form of priorities between the object level rules that should              AutoRegressive Moving Aaverage model with eXogenous input
take place within defined contexts and roles. For example, the                 (NARMAX) to a given time series has attracted much attention
level 1 rules with signatures r1 and r2 are conflicting. In the                because it arises in a large variety of applications, such as time
default context the first one has priority, while the bear market              series prediction in economic and biomedical data, adaptive
context reverses this priority:                                                control, speech analysis and synthesis, neural networks, radar and
                                                                               sonar, fuzzy systems, and wavelets [5].
R1: h_p(r1(Fund),r2(Fund)) ← true                                                  The forecasting algorithm used in this contribution is a generic
                                                                               applied evolutionary hybrid technique, which combines the
R2: h_p(r2(Fund),r1(Fund)) ← market(bear)                                      effectiveness of adaptive multimodel partitioning filters and GAs’
                                                                               robustness [1]. This method has been first presented in [7].
   Rule R1 defines the priorities set for the default context, i.e. an         Specifically, the a posteriori probability that a specific model, of a
investor selects a fund that has high return on investment (RoI)               bank of the conditional models, is the true model, can be used as
even if it has high risk. Rule R2 defines the default context for the          fitness function for the GA. In this way, the algorithm identifies
bear market context (within which, the fund selection process is               the true model even in the case where it is not included in the
cautious and does not select a high RoI fund if it has high risk).             filters’ bank. It is clear that the filter’s performance is
   Finally, in PC (level 3 rules) the decision maker defines his               considerably improved through the evolution of the population of
strategy and policy for integrating the different roles and contexts           the filters’ bank, since the algorithm can search the whole
rules. When combining the Aggressive investor role and bear                    parameter space. The proposed hybrid evolutionary algorithm can
market context, for example, the final portfolio is their union                be applied to linear and nonlinear data; is not restricted to the
except that the aggressive investor now would accept to select                 Gaussian case; does not require any knowledge of the model
high and medium risk MFs (instead of only high). The decision                  switching law; is practically implementable, computationally
maker’s strategy sets preference rules between the rules of the                efficient and applicable to online/adaptive operation; and exhibits
previous level but also between rules at this level. Relating to the           very satisfactory performance as indicated by simulation
level 2 priorities, the bear market context’s priority of not buying           experiments [2]. The structure of the hybrid evolutionary system
a high risk MF, even if it has a high return, is set at higher priority        used is depicted in Figure 1.
than that of the general context. Then, the specific context of an                 The representation used for the genomes of the population of
aggressive investor in a bear market defines that the bear market              the GA is the following. We use a mapping that transforms a fixed
context preference is inverted. See the relevant priority rules:               dimensional internal representation to variable dimensional
                                                                               problem instances. Each genome consists of a vector x of real
C1: h_p(R2, R1) ← true                                                         values xi∈ ℜ , i = 1, ..., k, and a bit string b of binary digits
                                                                               bi∈{0,1}, i = 1, ..., k. Real values are summed up as long as the
C2: h_p(R1, R2) ← hasInvestPolicy(Investor, aggressive).                       corresponding bits are equal. Obviously, k is an upper bound for
                                                                               the dimension of the resulting parameter vector. We use the first
C3: h_p(C2, C1) ← true                                                         k/3 real values for the autoreggressive part, the second k/3 real
                                                                               values for the moving average part, and the last k/3 real values for
   Thus, an aggressive investor in a bear market context would                 the exogenous input part. An example of this mapping is
continue selecting high risk funds. In the latter case, the argument           presented in Figure 2. For a more detailed description of this
r1 takes along the priority arguments R1, C2 and C3 and becomes                mapping refer to [2].
stronger (is the only admissible one) than the conflicting r2                      At first, an initial population of m genomes is created at
argument that can only take along the R2 and C1 priority                       random (each genome consists of a vector of real values and a bit
arguments. Thus, the selectFund(Fund) predicate is true and the                string). As stated before, each vector of real values represents a
fund is inserted in the portfolio.                                             possible value of the NARMAX model order and its parameters.
                                                                               For each such population we apply an MMAF with EKFs and


                                                                          61
have as result the model-conditional probability density function          (one) which is the maximum value it is able to have as a
(pdf) of each candidate model. This pdf is the fitness of each             probability For a more detailed description of this hybrid
candidate model, namely the fitness of each genome of the                  evolutionary system refer to [2].
population (Figure 3).


                                                                             Figure 3: The fitness of each candidate model is the model
                                                                             conditional pdf (m is the number of the extended Kalman
                                                                                      filters in the multimodel adaptive filter)
                                                                               In this contribution we apply a slightly different approach
                                                                           compared to the one presented in [2]. In [2], at the algorithm’s
                                                                           step where the value of the estimation (output) x of each filter is
                                                                           calculated, the past values of x that are used in order to estimate
                                                                           the next value of x are always taken from the estimation file (the
                                                                           file of all past values of x that have been estimated by the
                                                                           algorithm till this point). All these values are used in each
                                                                           generation in order to estimate the next value of the estimation
                                                                           (output) vector x. The method presented in this contribution uses a
                                                                           different approach in order to estimate x. At the algorithm’s step
                                                                           where the value of x for each filter is calculated, the past values of
                                                                           x that are used in order to estimate the next value of x are smaller
                                                                           than the total length of the time series that has been estimated till
                                                                           this point. The length of past values used in each generation in
                                                                           order to estimate the next value of x equals to n/2, where n is the
                                                                           total length of the time series to be estimated. Every new value of
Figure 1: The structure of the hybrid evolutionary system used             x, estimated by the algorithm, is added to this time series of length
                        for forecasting                                    n/2 and the oldest one is removed in order this time series to
                                                                           sustain a length of n/2. The value of n/2 was not selected
                                                                           arbitrarily. We have conducted exhaustive experiments using
                                                                           many different values. The value of n/2, that has been finally
                                                                           selected, was the most effective one, that is, the one that resulted
                                                                           in the best prediction results.
                                                                               Thus, the hybrid evolutionary system presented in Figure 1 is
                                                                           used in order to forecast the behavior of the financial market in
                                                                           relation to its current status. The market is characterized as bull
                                                                           market if it is forecasted to rise in the next semester, or as bear
                                                                           market if it is forecasted to fall. We used the return values of the
                                                                           Greek market index for each semester starting from year 1985 to
     Figure 2: Mapping from a fixed dimensional internal
                                                                           the years of our sample data (2000 to 2005). The algorithm
   representation to a variable length NARMAX parameter
                                                                           performed very well considering that it could forecast the next
      vector. The resulting order is n(p, q, r) = (4, 3, 2).
                                                                           semester market behavior with a success rate of 85.17% (12 out of
    The reproduction operator we decided to use is the classic             14 right predictions).
biased roulette wheel selection according to the fitness function
value of each possible model order [9]. As far as crossover is
concerned, we use the one-point crossover operator for the binary          5 THE PORTFOLIO CONSTRUCTION
strings and the uniform crossover operator for the real values [9].        APPLICATION
Finally, we use the flip mutation operator for the binary strings
and the Gaussian mutation operator for the real values [9]. Every          In this section we firstly present the system architecture, i.e. the
new generation of possible solutions iterates the same process as          combination method for the argumentation decision making sub-
the old ones and all this process may be repeated as many                  system and the hybrid forecasting sub-system that resulted in a
generations as we desire or till the fitness function has value 1


                                                                      62
coherent application. Then we present the results of this                     to different investment choices and leads to the selection of
combination.                                                                  different number and combinations of MFs.


5.1        System Architecture
The portfolio generation application is a Java program creating a
human-machine interface and managing its modules, namely the
decision making module, which is a prolog rule base (executed in
SWI-prolog1) using the Gorgias2 framework, and the forecasting
module, which is a Matlab3 implementation of the forecasting
hybrid system (see Figure 4).
   The application connects to the SWI-Prolog module using the
provided Java interface (JPL) that allows for inserting facts to an
existing rule-base and running it for reaching goals. The goals can
be captured and returned to the Java program. The application
connects to Matlab by executing it in a system shell. The matlab
program writes the results of the algorithm to a MySQL4 database               Figure 5: A screenshot for portfolio generation for a scenario
using SQL (Structured Query Language). The application first                          of a moderate investor in a bull market context
executes the forecasting module, then updates the database, using
JDBC (Java DataBase Connectivity interface) technology, with                      In Table 1 the reader can inspect the average return on
the investor profile (selected roles) and, finally, queries the               investment (RoI) for the six years for all different contexts. The
decision making module setting as goal the funds to select for                reader should notice that the table contains two RoI columns, the
participation in the final portfolio. Thus, after the execution of the        first (“Previous RoI”) depicts the results before changing the
forecasting module the predicate market/1 is determined as bull or            system as they appeared in [14]. The second presents the results of
bear and inserted as a fact in the rule base before the decision              upgrading the application by combining it with the hybrid
making process is launched. The reader can see in Figure 5 a                  evolutionary forecasting sub-system and by fixing the selected
screenshot of the integrated system.                                          funds participation to the final portfolio. The latter modification is
                                                                              out of the scope of this paper but the reader can clearly see that it
                                                                              has greatly influenced the performance of all scenarios.
                                                                                  Table 1, however, shows the added value of this contribution as
                                                                              the market context has become the most profitable in the “New
                                                                              RoI” column (8.17% RoI), while in the “Previous RoI” column it
                                                                              was one of the worst cases (3.72% RoI). Consequently the specific
                                                                              contexts containing the market context have better results.
                                                                                 Table 1: Average RoI for six years. The New RoI column
                                                                                 shows the gains after the evolutionary hybrid forecasting
                                                                                                   system’s integration
                                                                                                                            Previous
                  Figure 4: System Architecture                                   Context type       Context                             New RoI (%)
                                                                                                                             RoI (%)
                                                                                      simple         general                  3.53            6.86
                                                                                        role         aggressive               2.65            7.38
5.2        System Evaluation                                                            role         moderate                 4.02            6.09
                                                                                     context         market                   3.72            8.17
For evaluating our results we defined scenarios for all years for                       role         high performance         4.98            7.16
which we had available data (2000-2005) and for all combinations                  specific context   aggressive – market      3.56            7.92
of contexts. That resulted to the two investor types combined with
                                                                                  specific context   moderate – market        4.72            6.08
the market status, plus the two investor types combined with the
                                                                                  specific context   aggressive - high p.     4.88            7.46
high performance option, plus the market status combined with
                                                                                  specific context   moderate - high p.       4.98            7.16
the high performance option, all together five different scenarios
                                                                                  specific context   Market - high perf.      4.59            7.23
run for six years each. Each one of the examined scenarios refers
                                                                                      ASE-GI         RASE                              6.75


1
  SWI-Prolog offers a comprehensive Free Software Prolog environment,            Moreover, Table 1 also shows the added value of our approach
   http://www.swi-prolog.org                                                  as the reader can compare our results with the return on
2
  Gorgias is an open source general argumentation framework that              investment (RASE) of the General Index of the Athens Stock
   combines the ideas of preference reasoning and abduction,                  Exchange (ASE-GI). According to the results of this table, the
   http://www.cs.ucy.ac.cy/~nkd/gorgias/                                      average return of the constructed portfolios for all contexts, except
3
  MATLAB® is a high-level language and interactive environment for            two, achieves higher return than the market index. The two cases
   performing computationally intensive tasks, http://www.mathworks.          where the constructed portfolios did not beat the market index are
   com/products/matlab
4                                                                             the moderate simple context and moderate-market specific
   MySQL is an open source database, http://www.mysql.com


                                                                         63
context. This is, maybe, due to the fact that in these two contexts             The developed application allows a decision maker (fund
we have an investor who wishes to earn more without taking into              manager) to construct multi-portfolios of MFs under different,
account any amount of risk in relation to the variability which              possibly conflicting contexts. Moreover, for medium to long term
characterizes the conditions of the market during the examined               investments, the returns on investment of the constructed
period. This fact makes it very difficult to implement investment            portfolios are better than those of the General Index of the Athens
strategies that can help a fund manager outperform a passive                 Stock Exchange, while the best results are those that involve the
investment policy.                                                           forecasting of the financial market.
   Furthermore, we notice that in some specific contexts the                    Our future work will be to develop a new rule base for the
results are more satisfying than the results obtained by simple              problem of determining when to construct a new portfolio for a
contexts, while in others there is little or no difference. This             specific investor. We will also make the application web-based so
means that by using effective strategies in the third preference             that it can get on-line financial data available from the internet for
rules layer the decision maker can optimize the combined                     computing the decision variables and for allowing the investors to
contexts. Specifically, the aggressive-high performance specific             insert their profiles by filling on-line forms. Finally, we will
context provides better results than both the simple contexts                continue evaluating our application as new data become available
aggressive and high performance (the ones that it combines) and              for years after 2005. Our aim is to be able to guarantee a better
the general context. The moderate-high performance specific                  RoI than that of the ASE.
context’s returns on investment are equal to the higher simple
context’s returns (high performance) while the aggressive-market
specific context returns are closer to the higher simple context’s           7 REFERENCES
returns (market).                                                            [1]  A. V. Adamopoulos, Anninos, P. A., Likothanassis, S .D.,
   Finally, in Figure 6, we present the RoI of all contexts                       Beligiannis, G. N., Skarlas, L. V., Demiris E. N. and Papadopoulos,
separately for each year. This view is also useful, as it shows that              P., 2002. Evolutionary Self-adaptive Multimodel Prediction
for two years, 2003 and 2004, RASE was greater than all our                       Algorithms of the Fetal Magnetocardiogram, 14th Int. Conf. on
contexts RoI performance. This shows that our application, for the                Digital Signal Processing (DSP 2002), Vol. II, 1-3 July, Santorini,
time being, performs better for medium term to long term                          Greece, pp. 1149-1152.
investments, i.e. those that range over five years.                          [2] G. N. Beligiannis, L. V. Skarlas and S. D. Likothanassis. “A Generic
                                                                                  Applied Evolutionary Hybrid Technique for Adaptive System
                                                                                  Modeling and Information Mining”, IEEE Signal Processing
                                                                                  Magazine, 21(3), pp. 28-38, 2004
                                                                             [3] G. Colson, and M. Zeleny, “Uncertain prospects ranking and
                                                                                  portfolio analysis under the condition of partial information” in
                                                                                  Mathematical Systems in Economics 44, 1979.
                                                                             [4] G. Crina, A. Ajith, I. Hisao (Eds.), “Hybrid Evolutionary
                                                                                  Algorithms”, Studies in Computational Intelligence, Vol. 75, 2007
                                                                             [5] S. Haykin, Adaptive Filter Theory. Englewood Cliffs, NJ: Prentice-
                                                                                  Hall Int., 1991
                                                                             [6] A. Kakas, and P. Moraitis, “Argumentation based decision making
                                                                                  for autonomous agents”, Proc. of the second Int. Conf. on
                                                                                  Autonomous Agents and Multi-Agent Systems (AAMAS03), July
                                                                                  14-18, Australia, 2003.
                                                                             [7] S.K. Katsikas, S.D. Likothanassis, G.N. Beligiannis, K.G. Berketis
                                                                                  and D.A. Fotakis, “Evolutionary multimodel partitioning filters: A
                                                                                  unified framework,” IEEE Trans. Signal Processing, vol. 49, no. 10,
                                                                                  pp. 2253–2261, 2001
                                                                             [8] H. Markowitz, Portfolio Selection: Efficient Diversification of
                                                                                  Investments, Wiley, New York, 1959.
                                                                             [9] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution
   Figure 6: Comparative RoIs of all contexts for each year.                      Programs, 3rd ed. New York: Springer-Verlag, 1996.
                                                                             [10] I. Rahwan, P. Moraitis and C. Reed (Eds.) “Argumentation in Multi-
                                                                                  Agent Systems”, Lecture Notes in Artificial Intelligence, 3366,
                                                                                  Springer-Verlag, Berlin, Germany, 2005
6 CONCLUSION                                                                 [11] S. Ross, “The Arbitrage Theory of Capital Asset Pricing”, Journal of
                                                                                  Economic Theory, 6, 1976, pp. 341-360.
The objective of this paper was to present an artificial intelligence
                                                                             [12] W.F. Sharpe, “Capital asset prices: A theory of market equilibrium
based application for the MF portfolio generation problem that                    under conditions of risk”, Journal of Finance, 19, 1964, pp. 425-442.
combines two different intelligent methods, argumentation based              [13] W.F., Sharpe, “Mutual fund performance”, Journal of Business, 39,
decision making and a hybrid system that combines Genetic                         1966, pp. 119-138.
Algorithms (GA), MultiModel Partitioning (MMP) theory and the                [14] N. Spanoudakis and K. Pendaraki, “A Tool for Portfolio Generation
Extended Kalman Filter (EKF).                                                     Using an Argumentation Based Decision Making Framework” , in:
   We described in detail how we developed our argumentation                      Proceedings of the annual IEEE International Conference on Tools
theory and how we combined it with the hybrid system to                           with Artificial Intelligence (ICTAI 2007), Patras, Greece, October
determine an important fact for the decision making process, i.e.                 29-31, 2007
                                                                             [15] J.L. Treynor, “How to rate management of investment funds”,
the status of the financial market in the next investment period.
                                                                                  Harvard Business Review, 43, 1965, pp. 63-75.


                                                                        64
    An Architecture for Multiple Heterogeneous Case-Based
         Reasoning Employing Agent Technologies.
                                              Elena I Teodorescu and Miltos Petridis1

Abstract. This paper presents an investigation into applying                 petence and the contextualisation of the cases. Past research at
Case-Based Reasoning to Multiple Heterogeneous Case                          Greenwich [2][3] has shown the need to combine knowledge en-
Bases using agents. The adaptive CBR process and the ar-                     coded in cases from various heterogeneous sources to achieve a
chitecture of the system are presented. A case study is pre-                 competent, seamless CBR system.
                                                                                 Ontanon and Plaza [7] looked at a way to “improve the overall
sented to illustrate and evaluate the approach. The process of               performance of the multiple case systems and of the individual
creating and maintaining the dynamic data structures is                      CBR agents without compromising the agent’s autonomy”. They
discussed. The similarity metrics employed by the system                     present [8] a framework for collaboration among agents that use
are used to support the process of optimisation of the col-                  CBR and strategies for case bartering (case trading by CBR
laboration between the agents which is based on the use of a                 agents). Nevertheless, they do not focus at the possibility of cases
blackboard architecture. The blackboard architecture is                      having different structures and what impact this will have on ap-
shown to support the efficient collaboration between the                     plying CBR to heterogeneous case bases. Leake [5] states that “An
agents to achieve an efficient overall CBR solution, while                   important issue beyond the scope of this paper is how to establish
using case-based reasoning methods to allow the overall                      correspondences between case representations, if the representa-
                                                                             tions used by different case-bases differ.”
system to adapt and “learn” new collaborative strategies for                     Given several case bases as the search domain, it is very likely
achieving the aims of the overall CBR problem solving                        that they have different structures. Ideally, accessing Multiple Case
process.                                                                     Bases should not require a change to their data structures. In order
                                                                             for an MCBR system to effectively use case-bases that may have
                                                                             been developed in different ways, for different tasks or task envi-
                                                                             ronments, methods are needed to adjust retrieved cases for local
1 Introduction1                                                              needs.
                                                                                 Leake and Sooriamurthi [4] proposed a theoretical “cross-case-
Case-based reasoning (CBR) is now an established artificial intel-           base adaptation” which would adapt suggested solutions from one
ligence paradigm. Given a case-base of prior experiences, a CBR              case base to apply to the needs of another. They are currently
system solves new problems by retrieving cases from the case-                exploring sampling methods for comparing case-base characteris-
base, and adapting their solutions to comply the new require-                tics in order to select appropriate cross-case-base adaptation strate-
ments[1].                                                                    gies.
    Multiple Case Based Reasoning (MCBR) is used to retrieve so-
lutions for a new problem from more than one case-base. Methods
for managing sharing of standardized case bases have been studied
in research on distributed CBR (e.g. [13]), as have methods for              2 Adaptive CBR
facilitating large-scale case distribution [10]. Leake and Sooria-
muthhi propose a new strategy for MCBR - an agent selectively                   In order to enable effective solution retrieval across autonomous
supplements its own case-base as needed, by dispatching problems             case bases with differing structures, it is essential to have access
to external case-bases and using cross-case-base adaptation to               and a good understanding of each of the different case base struc-
adjust their solutions for inter-case-base differences [4, 5, 6,13 ].        tures involved. This would make it possible to identify the com-
    In many problems in modern organisations, the knowledge en-              monalities, equivalences and specific characteristics of every case
capsulated by cases is contained in multiple case bases reflecting           base associated with the system.
the fragmented way with which organisations capture and organise
knowledge. The traditional approach is to merge all case bases into
a central case base that can be used for the CBR process. However,
this approach brings with it three challenges:
                                                                             2.1 The process of adaptive CBR
     •     Moving cases into a central case base potentially sepa-
           rates from its context and makes maintenance more diffi-          Instead of trying to adapt the suggested solutions from one case
           cult.                                                             base to the needs of another, the approach investigated in this study
                                                                             will be to create a “dynamic structure” of a general case. This
     •     Various case bases can use different semantics. There is
                                                                             dynamic structure would be modified every time a new case base
           therefore a need to maintain various ontologies and map-
                                                                             with a new structure is added.
           pings across the case bases.
                                                                                The process of adaptive CBR, within the architecture of the
     •     The knowledge content “value” of individual cases can
                                                                             HMCBR System (Figure 1), will incorporate a number of steps.
           be related to its origination. This can be lost when merg-
                                                                                 Firstly, in order for the system to work with a particular case
           ing into a central case base.
                                                                             base, it will need to know the structure of that case base. Every
    Keeping the cases distributed in the form of a Heterogeneous
                                                                             newly added case base will therefore have to publish its structure to
Multiple Case Based Reasoning system (HMCBR) may have a
                                                                             a Registry System. The published structures are required to have
number of advantages such as increased maintainability and com-
                                                                             their own data dictionaries attached to enable the creation of a
                                                                             dynamic Data Dictionary.
  1 Department of Computing Science, University of Greenwich, Park

Row, London SE10 9LS email:{ E.I.Teodorescu , M.Petridis}@gre.ac.uk


                                                                        65
                                                                              3 Creating the Dynamic Structure

                                                                              Creating and maintaining a dynamic structure makes the self-
                                                                              adaptive multi case base reasoning system possible. By adding a
                                                                              new case base to the existing ones, new attributes are added to a
                                                                              global dynamic structure and new relations linked to these attrib-
                                                                              utes are established.
                                                                                     CBS1.        Apartment        Studio       Detached
                                                                                     type                                       house
                                                                              DCBS
                                                                              name
                                                                                 House            0                0            1
                                                                                 Flat             1                0.8          0
                                                                              Fig. 3. Data Dictionary includes relations between some of the
                                                                              attributes.
                                                                                  A data dictionary is required to keep all the metadata for the dy-
                                                                              namic structure. This data dictionary would have multiple func-
                                                                              tions: It records the location and the name of every attribute from
                                                                              the Case Base Structures (CBS) and how these are translated into
                                                                              the Dynamic Case Base Structure (DCBS). It also stores the type
Fig. 1. The Architecture of the HMCBR System
                                                                              and any default value for every single attribute.
                                                                                  The Data Dictionary will reflect any relationships between the
    The published structure will be retrieved by the Dynamic CB               Dynamic Case Base Structure attributes. These relationships can be
System and used to adapt the local dynamic structure to accommo-              mathematical relationships or look-up tables (figure 3).
date any new elements and map existing ones.                                      We will use the presented case study to show how a dynamic
    When the dynamic structure reflects all participating case bases,         structure is created and how it is continuously changed by adding
a case query can be submitted. The system would then reformulate              new case bases to the search domain.
the target case structure into each provider’s case base structure.               Let us suppose that our general structure (the initial state of the
The target case structure will be a subset of the dynamic structure.          Dynamic Structure containing few main attributes of a property) is
    The reformulated cases are submitted to each provider and solu-           already built (see figure 4). The structure has attached a basic Data
tion cases are retrieved using KNN techniques [1]. The structures             Dictionary mainly containing the data types of the existing attrib-
of these solutions will be translated into the dynamic structure, thus        utes.
creating a dynamic case base. Finally, the system will apply the                  We will show how this initial structure will be dynamically
classical CBR process to the dynamic case base.                               changed by consecutively adding the three agents to the search
    The whole process is intended to provide a transparent view of            domain.
the CBR process across the heterogeneous system.                                  Adding the Case Base Structure 1 to the system implies map-
                                                                              ping of the attributes ParkingSpace, Area and Type into the Dy-
                                                                              namic Structure (these attributes are already existing in the initial
2.2 Case Study                                                                structure) and also adding more attributes to it (i.e. NoOfRooms,
                                                                              NoOfBathrooms, GardenLength, GardenWidth)
This case study requires searching for a property from three estate
agencies without amalgamating their case bases structures.
    Let us suppose that the estate agencies have different case base
structures (figure 2).
    A possible buyer should be able to search for a property and get
all the suitable solutions from all three agencies. A search should
retrieve the best matches from all case bases as if it was dealing
with a single case base in a way transparent to the buyer.


                                                                                     Data Dictionary
                                                                                     Size: Double
                                                                                     NoOfBedrooms: Integer
                                                                                     Location: String
                                                                                     ParkingSpace: double
                                                                                         Name:        house         flat
                                                                                         house          1            0
                                                                                          flat          0            1
            Case Bases Structures 1(CBS1)
                                                                              Fig. 4. Initial state of the Dynamic Structure and Data Dictionary

                                                                                  The Data Dictionary will reflect the mapping of attributes:
                                                                                        CBS1.ParkingSpace = DCBS.ParkingSpace;
            Case Bases Structures 2(CBS2)                                               CBS1.Area = DCBS.Location
                                                                                        CBS1.type= DCBS.name
                                                                                  The following attributes will be added to the dynamic data dic-
                                                                              tionary:
                                                                                        NoOfRooms: integer;
                                                                                        GardenLength: double; GardenWidth: double
            Case Bases Structures 3(CBS3)
                                                                                  Any other relevant relationships such as look-up tables for de-
                                                                              fining mappings between the values of attribute Type of CBS1 and
   Fig. 2. Three different Case Base Structures


                                                                         66
the values of the attribute Name of the dynamic structure will be           4 Optimising the agent collaboration process
captured.
    Case Base Structure 2 will add another attribute, GardenSize, to
the Dynamic Structure and the data dictionary will record mapping           In order to optimise the process of collaboration between the
of attributes:                                                              agents to achieve an efficient solution from the overall CBR proc-
           CBS2.Name = DCBS.Name,                                           ess when applied across the heterogeneous case bases, an overall
           CBS2.Location = DCBS.Location ,                                  similarity metric is required. Additionally, an overall process to
           CBS2.NoOfBedrooms = DCBS. NoOfBedrooms;                          enable collaboration between the agents is necessary based on a
    The mathematical relationships are recorded:                            flexible architecture to enable this collaboration.
DCBS.GardenSize = DCBS.GardenLength * DCBS.GardenWidth,
     Functions can be applied, for example to keep the same metric
system:
                                                                            4.1 Defining an overall similarity metric
    DCBS.GardenSize= CBS2.GardenSizeInFeet/(3.281)2
    The Data Dictionary would also include a look-up table show-            The overall similarity metric between a target and a source Case
ing the conversion of values of CBS2.Name to values of                      can be defined as:
DCBS.Name.
    Attention has to be paid to the meanings of the names of the at-                      ,                ,                           1
tributes. For example, if the attribute “Type” in CBS1 and the                    where:
attribute “Name” in CBS2 have the same meaning (they would be                     σ: overall similarity
translated as “Name” in DCBS, with values found in a look-up                      σCBy: similarity from case base provider CBy
table), the attribute “Name” from CBS3 has not the same meaning                   CT: target case
as the one from CBS2. It is actually translated into DCBS.Location                CS: source case
(similar to CBS2.Location)                                                                   : weighting for a case base provider y for case CT
                                                                                  To allow for defining locally optimised similarity metrics for
                                                                            different providers, the following metric can be defined:
                                                                                            ,                               ,   ,           2

                                                                                  where:
                                                                                            : the weighting from case base provider CBy for at-
                                                                                tribute x
                                                                                             , , : the local similarity metric for provider CBy
                                                                                   for attribute x.
                                                                                This extended similarity metric takes into account the level of
                                                                            trust that the HMCBR system attributes to the competence of each
                                                                            case base provider. The level of trust is determined by applying
                                                                            CBR to the case-base of the history of queries. Additionally it
                                                                            allows to adjust the trust to particular providers to different “re-
                                                                            gions” in the case base allowing for case base providers to be
   Fig. 5. Adapted Dynamic Structure after CBS3 was added                   “specialised” on particular types of domain knowledge. Finally, the
                                                                            extended metric allows for different ways of defining similarity
   By adding the third estate agent case base to the search domain,
                                                                            based on possible particularities pertaining to individual case base
the dynamic structure will grow even more (see figure 5) and the
                                                                            providers.
Data dictionary will reflect it by adding the attributes
                                                                                Let us assume that in our case study the third estate agent is
DSBS.Garage and DSBS.View.
                                                                            specialised in city apartments. After a few searches for country side
   The following attributes are mapped:
                                                                            houses with gardens, reasoning can be applied to the History case-
          CBS3.Name = DCBS.Location
                                                                            base. Results will show that, for this particular query, the estate
          CBS3.Description = DCBS.Name
                                                                            agent’s level of trust is not high, i.e. there will be less solutions for
          CBS3.GardenSizeInMeters = DCBS.GardenSize
                                                                            this particular case base added to the Dynamic case-base.
   Another look-up table can be created and added to the Data Dic-
                                                                                A global level of trust of a provider’s case-base can be calculat-
tionary to record the relationship between the Garage and Parking-
                                                                            ing taking in consideration the results of all the previous enquiries
Space. Figure 6 shows the state of the Dynamic data Dictionary
                                                                            for that provider.
after CBS1, CBS2 and CBS3 are added.
   Dynamic Data Dictionary
   CBS1.Area = DCBS.Location                                                4.2 An architecture and process to support effective
   NoOfRooms: integer                                                       collaboration between case base agents
   CBS1.type= DCBS.name
   DCBS.GardenSize: double                                                  The architecture of the HMCBR system shown in figure 1 contains
   DCBS.GardenSize = CBS2.GardenSizeInFeet                                  the dynamic CB system, which incorporates a blackboard architec-
   DCBS.GardenSize = DCBS.GardenLenght *                                    ture. Blackboards have been used very effectively in the past for
      DCBS.GardenWidth ...                                                  the construction of hybrid and agent based AI systems [11], [12].
   CBS3.Name = DCBS.Location                                                    The dynamic CB system is where the process for agent collabo-
   CBS3.GardenSizeInMetres = DCBS.GardenSize                                ration is controlled. It is based on a blackboard architecture incor-
                   Garage     ParkingSpace                                  porating the blackboard containing the target and retrieved cases
   Garage              1             0.7                                    from various providers together with similarity calculations and
   ParkingSpace       0.7             1                                     rankings. The blackboard also contains a log of the solution proc-
                                                                            ess and the reconciliation strategy followed, thus representing the
   Fig. 6. Adapted Dynamic Data Dictionary after CBS1, CBS2                 state of the overall CBR solution process at any point in this proc-
and CBS3 are added                                                          ess. Figure 7 shows the structure of the dynamic CB module incor-
                                                                            porating the blackboard architecture.


                                                                       67
                                                                             use the translated query to match it to its local cases and retrieve
       AgentCB1        AgentCB2           AgentCB3                           the best matches.
                                                                                 A Data Dictionary is created in order to manage the Dynamic
                                                                             Structure. This contains the metadata for the Dynamic Structure,
                                                                             such as mapping details of the case base provider’s structures to the
                                                                             Dynamic Structure, type information and relationships between
                                                          Query              attributes of the dynamic structure.
         BB                    Blackboard
                                                          History                The dynamic case base system manages the overall process, in-
         Manager                                                             cluding controlling the agents, reconciling and optimising the
                                                                             retrieved cases and feeding back into its strategy by continuously
                                                                             adjusting weights representing confidence levels on individual case
                                                                             base providers. A prototype system to evaluate the efficiency of
                                                                             using a heterogeneous Multiple Case Based Reasoning system is
                                                                             currently being evaluated. Preliminary findings are encouraging.
        Dynamic              Dynamic Data                                        Further work will concentrate into optimising the process of
        Structure            Dictionary                                      collaboration between the agents and methods and strategies for the
                                                                             reconciliation of retrieved cases.
   Fig. 7. The Dynamic CB system incorporating the blackboard
architecture                                                                 References
    The blackboard manager manages the overall solution process,
communicates with and keeps track of the CB agents, selects and              [1] Kolodner, J.: Case-based Reasoning, Morgan Kaufmann, 1993
implements a solution strategy and monitors and evaluates the                [2] Knight B, Petridis M, Mileman T: Maintenance of a Case-Base
solutions achieved. Given a new target case, the blackboard man-                for the Retrieval of Rotationally Symmetric Shapes for the De-
ager decides on strategy for finding similar cases from the CB                  sign of Metal Castings LNAI 1998: Springer Verlag. Advances
providers. The blackboard system decides which CB providers to                  in Case-Based Reasoning, 5th European Workshop, Trento, It-
use and the number of cases to retrieve from each one and other                 aly, Sept. 6-9.2000 pp 418-430
requirements, such as the requirement for diversity, similarity              [3] Knight B, Petridis M, Mejasson P, Norman P: A Intelligent
thresholds etc. The system then initialises the agents and assigns to           design assistant (IDA): a CBR System for Materials Design
them a mission. On return, the results (cases) are mapped using the             Journal of Materials and Design 22 pp 163-170, 2001
dynamic data dictionary and written to the blackboard. A “global”            [4] David Leake & Raja Sooriamurthi: “Automatically Selecting
CBR process is used to decide on the retrieved cases. The system                Strategies for Multi-Case-Base Reasoning” - Proceedings of the
then selects and presents the shortlisted cases after the reconcilia-           Sixth European Conference on Case-Based (ECCBR) , Aber-
tion process and provides these to the user, together with links to             deen, Scotland, September, 2002
their original forms for the user to explore. Finally, the system            [5] David Leake & Raja Sooriamurthi : “Managing Multiple Case
“reflects” on the process by updating the query history and confi-              Bases: Dimensions and Issues” (Proceedings of the Fifteenth
dence weights for each provider.                                                Florida Artificial Intelligence Research Society (FLAIRS) Con-
    The system described here has been implemented and tested on                ference , Pensacola, Florida, May 2002),
a set of case bases from three different estate agent case bases, all        [6] David Leake & Raja Sooriamurthi: “When Two Case Bases
using different structures. Experiments with the system have shown              are Better Than One: Exploiting Multiple Case Bases” (Pro-
that the system can retrieve useful cases combining cases from all              ceedings of the Fourth International Conference on Case-Based
case bases to provide a more efficient overall solution when com-               Reasoning (ICCBR) , Vancouver, Canada, July, 2001).
pared to using the case bases separately or mapping them to one              [7] Enric Plaza, Santiago Ontañón: Cooperative Multiagent Learn-
central case base. Additionally, the system has shown that it can               ing. Adaptive Agents and Multi-Agents Systems 2002
provide a more diverse retrieved case population in both cases. A            [8] Santiago Ontañón, Enric Plaza: A bartering approach to im-
full scale evaluation of the system, including using a different                prove multiagent learning. AAMAS 2002
application domain is under way.                                             [9] Francisco J. Martín, Enric Plaza, Josep Lluís Arcos: Knowledge
                                                                                and Experience Reuse Through Communication Among Com-
                                                                                petent (Peer) Agents. International Journal of Software Engi-
5     Conclusion                                                                neering and Knowledge Engineering 1999
                                                                             [10] Hayes, C., Doyle, M., Cunningham, P.: Distributed CBR
                                                                                Using XML, Proceedings of the Fourth European Conference
At a time of increasing web-based communication and sharing of                  on Case-Based (ECCBR) , Dublin, Ireland, June 1998
knowledge between organisations and organisational units within              [11]R Engelmore, T Morgan : Blackboard Systems, Addison-
enterprises, heterogeneous CBR applied to Multiple Case Bases                   Wesley, 1988
seems to be the natural progression in this area of research.                [12] Petridis M, Knight B, (2001): “A blackboard architecture for a
   The paper investigates an approach based on agents operating                 hybrid CBR system for scientific software” in Proceedings of
on different structures/views of the problem domain in a transpar-              the Workshop Program at the 4th International Conference in
ent and autonomous way. In this approach all data is kept locally               Case-based Reasoning ICCBR01, Vancouver 2001, Eds R. We-
by each case base provider in its native form. Agents can be dy-                ber, G. von Wagenheim, pp 190-195, NCARAI, Naval Research
namically added to the system, thus increasing the search domain                Laboratory, Code 5510 Washington DC.
and potentially the competence and vocabulary of the system.                 [13] Case Dispatching versus Case-Base Merging: when MCBR
   This research proposes a new architecture for a self-adaptive                matters:     David       Leake     and     Raja    Sooriamurthi.
MCBR system which involves the use of a dynamic structure based                 International Journal on Artificial Intelligence Tools: Architec-
on the blackboard architecture. The Dynamic Structure reflects all              tures, Languages and Algorithms (IJAIT). Special issue on re-
participating case base provider structures. As new agents are                  cent advances in techniques for intelligent systems. Vol 13, No
added to the system, their case base structure is published and is              1, 2004
used to adapt the Dynamic Structure accordingly.
   The Dynamic Structure is used at runtime to translate search
queries into the local structures of each agent. Each agent can then


                                                                        68