=Paper=
{{Paper
|id=Vol-375/paper-1
|storemode=property
|title=Here you can download the complete proceedings as a single PDF-file
|pdfUrl=https://ceur-ws.org/Vol-375/cima08-proceedings.pdf
|volume=Vol-375
}}
==Here you can download the complete proceedings as a single PDF-file==
The 18th European Conference on Artificial Intelligence
Proceedings
1st International Workshop on
Combinations of Intelligent Methods and
Applications (CIMA 2008)
Tuesday July 22, 2008
Patras, Greece
Ioannis Hatzilygeroudis, Constantinos Koutsojannis
and Vasile Palade
Copyright © 2008 for the individual papers by the papers’ authors. Copying is
permitted for private and academic purposes. Re-publication of material from this
volume requires permission by the copyright owners.
Table of Contents
Workshop Organization ……………………………………………………………………….. ii
Preface …………………………………………………………………………………………… iii
Papers
ANN for prognosis of abdominal pain in childhood: use of fuzzy modelling for
convergence estimation
George C. Anastassopoulos and Lazaros S. Iliadis ……………………………………… 1
Using Genetic Programming to Learn Models Containing Temporal Relations from
Spatio-Temporal Data
Andrew Bennett and Derek Magee ………………………………………………………… 7
Combining Intelligent Methods for Learner Modelling in Exploratory Learning
Environments
Mihaela Cocea and George D. Magoulas ……………………………………………….. 13
Belief Propagation in Fuzzy Bayesian Networks
Christopher Fogelberg, Vasile Palade and Phil Assheton ……………………………... 19
Combining Goal Inference and Natural-Language Dialogue for Human-Robot Joint
Action
Mary Ellen Foster, Manuel Giuliani, Thomas Muller, Markus Rickert, Alois Knoll,
Wolfram Erlhagen, Estela Bicho, Nzoji Hipolito and Luis Louro ………………………. 25
A Tool for Evolving Artificial Neural Networks
Efstratios F. Georgopoulos, Adam V. Adamopoulos and Spiridon D. Likothanassis .. 31
Intelligently Raising Academic Performance Alerts
Dimitris Kalles, Christos Pierrakeas and Michalis Xenos ………………………………. 37
Recognizing predictive patterns in chaotic maps
Nicos G. Pavlidis, Adam Adamopoulos and Michael N. Vrahatis ……………………... 43
Improving the Accuracy of Neuro-Symbolic Rules with Case-Based Reasoning
Jim Prentzas, Ioannis Hatzilygeroudis and Othon Michail …………………….……….. 49
Combinations of Case-Based Reasoning with Other Intelligent Methods (short paper)
Jim Prentzas and Ioannis Hatzilygeroudis ……………………………………………..... 55
Combining Argumentation and Hybrid Evolutionary Systems in a Portfolio
Construction Application
Nikolaos Spanoudakis and Konstantina Pendaraki and Grigorios Beligiannis ………. 59
An Architecture for Multiple Heterogeneous Case-Based Reasoning Employing
Agent Technologies (short paper)
Elena I. Teodorescu and Miltos Petridis ...................................................................... 65
Workshop Organization
Chairs-Organizers
Ioannis Hatzilygeroudis
University of Patras, Greece
Constantinos Koutsojannis
TEI of Patras, Greece
Vasile Palade
Oxford University, UK
Program Committee
Ajaith Abraham, IITA, South Korea
Ao Sio Iong, Oxford University, UK
Plamen Agelov, Lancaster University, UK
Emilio Corchado, University of Burgos, Spain
George Dounias, University of the Aegean, Greece
Artur S. d’Avila Garcez, City University, UK
Melanie Hilario, CUI - University of Geneva, Switzerland
Elpida Keravnou-Papailiou, University of Cyprus, Cyprus
Rudolf Kruse, University of Magdeburg, Germany
George Magoulas, Birkbeck College, Univ. of London, UK
Vasilis Megalooikonomou, University of Patras, Greece
Toni Moreno, University Rovira i Virgili, Spain
Amedeo Napoli, CNRS-INRIA-University of Nancy, France
Ciprian-Daniel Neagu, University of Bradford, UK
Jim Prentzas, TEI of Lamia, Greece
Han Reichgelt, Southern Polytechnic State Univ., GA, USA
David Sanchez, University Rovira i Virgili, Spain
Douglas Vieira, University of Minas Gerais, Brazil
Contact Chair
Ioannis Hatzilygeroudis
Dept. of Computer Engineering & Informatics
University of Patras, Greece
Email: ihatz@ceid.upatras.gr
ii
Preface
The combination of different intelligent methods is a very active research area in Artificial
Intelligence (AI). The aim is to create integrated or hybrid methods that benefit from each
of their components. It is generally believed that complex problems can be easier solved
with such integrated or hybrid methods.
Some of the existing efforts combine what are called soft computing methods (fuzzy
logic, neural networks and genetic algorithms) either among themselves or with more
traditional AI methods such as logic and rules. Another stream of efforts integrates case-
based reasoning or machine learning with soft-computing or traditional AI methods. Some
of the combinations have been quite important and more extensively used, like neuro-
symbolic methods, neuro-fuzzy methods and methods combining rule-based and case-
based reasoning. However, there are other combinations that are still under investigation.
In some cases, combinations are based on first principles, whereas in other cases they
are created in the context of specific applications.
The Workshop is intended to become a forum for exchanging experience and ideas
among researchers and practitioners who are dealing with combining intelligent methods
either based on first principles or in the context of specific applications.
There were totally 20 papers submitted to the Workshop. Each paper was reviewed by
at least two members of the PC. We finally accepted 12 papers (10 full and 2 short).
Revised versions of the accepted papers (based on the comments of the reviewers) are
included in these proceedings in alphabetic order (based on first author).
Five of the accepted papers deal with combinations of Genetic Programming or
Genetic Algorithms with either non-symbolic methods, like Neural Networks (NNs) and/or
Kalman Filters (Georgopoulos etal, Spanoudakis etal), or symbolic ones, like Decision
Trees (Kalles etal) and Temporal Logic (Bennett and Magee). Another four papers deal
with combinations of Case-Based Reasoning (CBR). One of them presents a short survey
of CBR combinations (Prentzas and Hatzilygeroudis) and another one a combination with
Agents (Teodorescu and Petridis). The rest two of them present CBR combinations with a
Neuro-Fuzzy (Cocea and Magoulas) and a Neuro-Symbolic (Prentzas etal) approach
respectively, leading to multi-combinations. Also, another two papers concern
combinations of Fuzzy Logic with either NNs (Anastassopoulos and Iliadis) or Bayesian
iii
Nets (Fogelberg etal). Finally, one of the papers combines a NN-based approach with a
Natural Language Processing one (Foster etal).
Four of the above papers present combinations developed in the context of an
application. Applications involve Medicine (Anastassopoulos and Iliadis), Education
(Cocea and Magoulas, Kalles etal) and Economy (Spanoudakis etal).
We hope that this collection of papers will be useful to both researchers and
developers.
Given the success of this first Workshop on combinations of intelligent methods, we
intend to continue our effort in the coming years.
Ioannis Hatzilygeroudis
Constantinos Koutsojannis
Vasile Palade
iv
ANN for prognosis of abdominal pain
in childhood: use of fuzzy modelling
for convergence estimation
George C. Anastassopoulos, Lazaros S. Iliadis
Abstract. This paper focuses in two parallel objectives. First it decision rules can predict which children are at risk for
aims in presenting a series of Artificial Neural Network models appendicitis (appendicitis is the most common surgical condition
that are capable of performing prognosis of abdominal pain in of the abdomen). One such numerically based system is based on
childhood. Clinical medical data records have been gathered and a 6-part scoring system: nausea (6 point), history of local RLQ
used towards this direction. Its second target is the presentation and pain (2 point), migration of pain (1 point), difficulty walking (1
application of an innovative fuzzy algebraic model capable of point), rebound tenderness / pain with percussion (2 point), and
evaluating Artificial Neural Networks’ performance [1]. This absolute neutrophil count of >6.75 x 10`3/μL (6 point). A score <5
model offers a flexible approach that uses fuzzy numbers, fuzzy had a sensitivity of 96.3% with a negative predictive value of
sets and various fuzzy intensification and dilution techniques to 95.6% for AA.
perform assessment of neural models under different perspectives. To date, all efforts to find clinical features or laboratory tests,
It also produces partial and overall evaluation indices. The either alone or in combination, that are able to diagnose
produced ANN models have proven to perform the classification appendicitis with 100% sensitivity or specificity have proven
with significant success in the testing phase with first time seen futile. Also, there is only one research work [4] in bibliography
data. based on ANN that deals with the abdominal pain prognosis in
childhood.
The incidence of Acute Appendicitis (AA) is 4 cases per 1000
1 INTRODUCTION children. However appendicitis despite pediatric surgeons’ best
efforts remains the most commonly misdiagnosed surgical
The wide range of problems in which Artificial Neural
condition. Although diagnosis and treatment have improved,
Networks can be used with promising results, is the reason of their
appendicitis continues to cause significant morbidity and still
growth [2, 3]. Some of the fields that ANNs are used are: medical
remains, although rarely, a cause of death. Appendicitis has a
systems [4-6], robotics [7], industry [8 – 11], image processing
male-to-female ratio of 3:2 with a peak incidence between ages 12
[12], applied mathematics [13], financial analysis [14],
and 18 years. The mean age in the pediatric population is 6-10
environmental risk modelling [15] and others.
years. The lifetime risk is 8.6% for boys and 6.7% for girls.
Prognosis is a medical term denoting an attempt of physician to
The 15 factors that are used in the routine clinical practice for
accurately estimate how a patient's disease will progress, and
the assessment of AA in childhood are: Sex, Age, Religion,
whether there is chance of recovery, based on an objective set of
Demographic data, Duration of Pain, Vomitus, Diarrhea, Anorexia,
factors that represent that situation. The inference about prognosis
Tenderness, Rebound, Leucocytosis, Neutrophilia, Urinalysis,
of a patient when presented with complex clinical and prognostic
Temperature, Constipation. The sex (males), the age (peak of
information is a common problem, in clinical medicine. The
appearance of A.A in children aged 9 to 13 years), and the religion
diagnosis of a disease is the outcome of combination of clinical
(hygiene condition, feeding attitudes, genetic predisposition) were
and laboratorial examinations through medical techniques.
in relation with a higher frequency for AA. Anorexia, vomitus,
In this paper various ANN architectures using different learning
diarrhea or constipation and a slight elevation of the temperature
rules, transfer functions and optimization algorithms have been
(370 C - 380 C) were common manifestation of AA. Additionally,
tried. This research effort was motivated form the fact that reliable
abdominal tenderness principally in the RLQ of the abdomen and
and seasonable detection of abdomen pain constitute attainments in
the existence of the rebound sign, are strongly related with AA.
effective treatment of disease and avoidance of relapses. That is
Leucocytosis (>10.800 K/μl) with neutrophilia (neutrophil count >
why the development of such an intelligent model that can
75%) is considered to be a significant clue for AA. Urinalysis is
collaborate with the doctors will be very useful towards successful
useful for detecting urinary tract disease, normal findings on
treatment of potential patients.
urinalysis are of limited diagnostic value for appendicitis.
The role of race, ethnicity, health insurance, education, access to
healthcare, and economic status on the development and treatment
2 DIAGNOSTIC FACTORS OF ABDOMINAL of appendicitis are widely debated. Cogent arguments have been
PAIN made on both sides for and against the significance of each
Several reports have described clinical scoring systems socioeconomic or racial condition. A genetic predisposition
incorporating specific elements of the history, physical appears operative in some cases, particularly in children in whom
examination, and laboratory studies designed to improve diagnostic appendicitis develops before age 6 years. Although the disorder is
accuracy of abdominal pain [16]. Nothing is guaranteed, but uncommon in infants and elderly, these groups have a
Democritus University of Thrace, Hellenic Open University disproportionate number of compilations because of delays in
anasta@med.duth.gr, liliadis@fmenr.duth.gr diagnosis and the presence of comorbid conditions.
1
As diagnosis, there are four stages of appendicitis, including (TanH) transfer function the input data were normalized (divided
acute focal appendicitis, acute supurative appendicitis, gangrenous properly) in order to be included in the acceptable range of [-3, 3]
appendicitis and perforated appendicitis. These distinctions are to avoid problems such as saturation, where an element’s
vague, and only the clinically relevant distinction of perforated summation value (the sum of the inputs times the weights) exceeds
(gangrenous appendicitis includes into this entity as dead intestine the acceptable network range [17]. Standard back-propagation
functionally acts as a perforation) versus non-perforated optimization algorithms using TanH, or Sigmoid or Digital Neural
appendicitis (acute focal and supurative appendicitis) should be Network Architecture (DNNA) transfer functions, combined with
made. the Extended Delta Bar Delta (ExtDBD) or with the Quick Prop
The present study is based on data set that is obtained from the learning rules [18, 19] were employed. The ExtDBD is a heuristic
Pediatric Surgery Clinical Information System of the University technique reinforcing good general trends and damping oscillations
Hospital of Alexandroupolis, Greece. It consisted of 516 children’s [20].
medical records. Some of these children had different stages of Modular and radial basis function (RBF) ANN applying the
appendicitis and, therefore, underwent operative treatment. This ExtDBD learning rule and the TanH transfer function were also
data set was divided into a set of 422 records and another set of 94 used in an effort to determine the optimal networks. RBFs have an
records. The former was used for training of the ANN, while the internal representation of hidden neurons which are radially
latter for testing. A small number of data records were used as a symmetric, and the hidden layer consists of pattern units fully
validation set during training to avoid overfitting. Table 1 connected to a linear output layer [21, 22].
represents the stages of appendicitis as well as the corresponding
cases for each one. The 3rd column of Table 1 depicts the coding
of possible diagnosis, as they used for ANN training and testing 3.2 ANN evaluation metrics applied
stages.
Traditional ANN evaluation measures like the Root Mean Square
Table 1. Possible diagnosis and corresponding cases. Error (RMS error), R2 and the confusion matrix were used to
Diagnosis Coding Cases validate the ensuing neural network models. It is well known that
Discharge -2 236
the RMS error adds up the squares of the errors for each neuron in
Normal the output layer, divides by the number of neurons in the output
Observation -1 186
No findings 0 15
layer to obtain an average, and then takes the square root of that
average. The confusion matrix is a graphical way of measuring the
Operative
Focal appendicitis 1 34
treatment
Phlegmonous or network’s performance during the “training” and “testing” phases.
2 29 It also facilitates the correlation of the network output to the actual
Supurative appendicitis
Gangrenous appendicitis 3 8 observed values that belong to the testing set in a visual display
Peritonitis 4 8 [17], and therefore provides a visual indication of the network’s
performance. A network with the optimal configuration should
have the “bins” (the cells in each matrix) on the diagonal from the
3 NEURAL NETWORK DESIGN lower left to the upper right of the output. An important aspect of
the matrix is that the value of the vertical axis in the generated
Data were divided into two groups, the training cases (TRAC) and histogram is the Common Mean Correlation (CMC) coefficient of
the testing cases (TESC). The TRAC consisted of 417 concrete the desired (d), and the actual (predicted) output (y) across the
medical data records and the TESC consisted of 101. Each input Epoch.
record was organised in a format of fifteen fields, namely sex, age, Finally, the FUSETRESYS (Fuzzy Set Transformer Evaluation
religion, area of residence, pain time period, vomit symptoms, System) that constitutes an innovative ANN evaluation system has
diarrhoea, anorexia, located sensitivity, rebound, wbc, poly, been applied offering a more flexible approach [1].
general analysis of urine, body temperature, constipation. The
output record contained a single field which corresponded to the
potential outcome of each case. 3.3 Technical description of the FUSETRESYS
The determination if the TRAC and TESC data sets was
performed in a rather random manner. The training and testing
ANN evaluation model
sample size which would be sufficient for a good generalization Fuzzy logic enables the performance of calculations with
was determined by using the Widrow’s rule of thumb for the LMS mathematically defined words called “Linguistics” [1, 23-25].
algorithm which is a distribution free, worst case formula [2] and it FUSETRESYS faces each training/testing example as a Fuzzy Set.
is shown in the following equation 1. W is the total number of free It applies triangular or trapezoidal membership functions in order
parameters in the network (synaptic weights and biases) and ε to determine the partial degree of convergence (PADECOV) of the
denotes the fraction of the classification errors permitted during ANN for each training/testing example separately. The following
testing. The O notation shows the order of quantity enclosed within equations 2 and 3 represent a triangular and a trapezoidal
[2]. ⎛W ⎞ (1) membership functions respectively [1].
N = O⎜ ⎟
⎝ε ⎠ μs(x;a,b,c)=max{min{
x − a c − x },0} a start two kinds of operators. Firstly there are operators that try to optimise
Leaving sub-models which are used in the model, and secondly there are op-
Current_time = end AND Current_time > start erators that optimise the sub-models themselves. A technique called
Left tournament selection [10] is used to pick a model from the popula-
Current_time > end AND Current_time > start tion. Tournament selection picks n models at random from the popu-
lation, and returns the one with the lowest score, for our experiments
we set n to be 5. The operators used to optimise sub-models which
are used in the model are shown below:
Figure 3. This shows the four temporal states, with respect to current time,
an object can be in: entering, existing, leaving, and left. The dotted lines Reproduction A set number of models are picked via tournament
represent that we don’t know when the object will leave the world. selection and copied directly into the new population.
Adding in a sub-model from another model Two models are
picked by tournament selection. A sub-model from the first
Both the Allen’s intervals, and our additional temporal state re- picked model is randomly selected, and added to the second
lations, are represented in the system as functions of the data, that chosen model.
appear in the search section of the sub-models. These relations do Replacing a sub-model Again two models are picked by tourna-
not appear in the data; only the temporal range of individual objects ment selection, and a sub-model from the first chosen model is
occurs in the data. As the data pointers can be used over the entire then replaced by a sub-model randomly selected from the second
history, it is quite likely that a sub-model will evaluate on many dif- chosen model.
ferent parts of the history. To resolve this issue we just use the result Removing a sub-model A sub-model is picked by tournament se-
which includes the most recent data. The justification for this is the lection, and a randomly selected sub-model is removed.
sub-model will have already output this information at a previous
time in other situations. The only operator used to optimise the sub-models themselves is
crossover. In crossover two models are picked using tournament se-
lection. A sub-model from each model is then randomly selected, and
4 Learning the Models from Data standard crossover [10] is performed on these sub-models.
Previously in our previous work [2] it has been shown that it was in- To score the models a fixed length window is randomly moved
tractable to find the set of optimal sub-models by exhaustive search, over the dataset. At each generation two random locations are picked:
for all but the simplest problems. The search space is complex, so a one for training, and one for testing. In the training phase the prob-
stochastic search method was chosen as an alternative. We use Ge- ability distribution used in the context chooser is calculated. In the
netic Programming [10], which has already been successfully used testing phase the fitness of a model (m) is evaluated over a win-
for pattern recognition tasks [11]. dowed section of the dataset (w). For each position in the window the
Genetic Programming (GP) [10] evolves a population of programs model is given a set of history data (h), calculated from the window,
until a program with the desired behaviour is found. It is a type of and is queried to produce a prediction. This produces a set of pos-
genetic algorithm, but the programs are stored as binary trees, and sible corresponding outputs (o), and a set of possible corresponding
not as fixed length strings. Functions are used for the nodes, and ter- output likelihoods (ol). The similarity (C) of each output with the ac-
minals (for example constants, and variables) are used for the leaf tual output (r), is computed using the F indBestM atch function, as
nodes. In order for the population to evolve a fitness function (in our shown in Equation 1. This function takes the set of actual output, and
case a predictive accuracy score) must be defined. This score will the set of model output, and firstly pads out them out with blank data
be used by the GP system to decide which programs in the current so that they are the same size. Then for each item in the actual output
generation to use to produce the next generation, and which ones set, a unique match in the model output set is found. For each of the
to throw away. To initialise the system, a set of randomly generated matches a comparison is done between the two objects. The compari-
programs must be created. Each then receive a score using the fitness son looks at how similar each of the properties in the two objects are.
function. Algorithms including crossover, mutation and reproduction Each of the comparisons are summed together to produce a score that
use the programs from the current generation to create a new gener- shows how good that set of matches is. An exhaustive search is then
ation. Crossover takes two programs and randomly picks a sub-tree performed over all the possible combination of matches to find the
on each program, these two trees are swapped over, creating two new best (maximal) matching score. The result is then multiplied by its
programs. Mutation takes one program, randomly picks a sub-tree on output likelihood. From this the best (maximal) output is found. This
3
9
is then repeated over the rest of window, and the results summed and 5.3 Uno
then normalised to produce (S), as shown in Equation 2. This fitness
function is an improved version to the one described in our previous The handcrafted uno dataset has a similar sequence to the snap
work [2], as it can be applied to non-deterministic datasets. dataset. Again the computer will initially see a blank scene. Then
play will be heard. Next two cards, each one having one of three
C(o, r) = F indBestM atch(o, r) (1) possible coloured shapes on them, will be placed down either at the
1 X same time, or one by one. If the two card have the same coloured
S(m, w) = ∗ M axn (oln ∗ C(on , ri )) (2) shape on them the “same” is heard; or if they have shapes of the
|w|
i same colour then “colour” is heard; or if they have the same shapes
The system runs in two stages, and will stop running once it ex- on then “shape” is heard; or if the cards are different then “nothing”
ceeds a maximum number of generations. Firstly the system is ini- is heard. The cards are then removed either together, or one by one.
tialised in the manner described above, and then for five generations Three datasets were created: a non-noisy training set, a noisy training
it works out the best set of sub-models to use in the models. To do set, and a non-noisy test set. Each one contained around 50 rounds
this the system uses reproduction (10%), removing (10%), adding of uno. Again noisy data was prepared by adding 10% of noisy data
(40%), and replacement (40%). Next the system will optimise these to the non-noisy training data. The noise took the same form as the
models to find the best solution. It uses crossover (60%), reproduc- noisy snap data.
tion (10%), removing (10%), adding (10%), and replacement (10%).
5 Evaluation 2
Our method was evaluated on three different datasets, which were: 3
handcrafted uno data, handcrafted snap data, and data from people
walking through a network of mock CCTV cameras. More detail 0
about these datasets is presented in the following section.
5.1 CCTV Data of a Path
1
A 10 minute video of people walking along a path containing a junc-
tion was filmed. This was then used to mock up a network of CCTV Figure 4. This figure firstly shows a frame of the video with a person
cameras. Figure 4 shows a frame from the video. Virtual motion taking a decision at the junction point, and secondly it shows where the
virtual detectors are on the video.
detectors, representing CCTV cameras, were hand placed over the
video has shown in Figure 4. Using frame differencing, and morpho-
logical operations, the video was processed to determine the location
of the motion. If the number of moved pixels in a region exceeded
a fixed threshold then the virtual detector outputted that motion had AND
occurred at that location. Hysteresis on the motion detection is im-
AND EQUAL
plemented as a 2 state, state machine (where the states are motion/no
motion). The state machine requires a numbers of frames (normally EQUAL
GET GET
10) of stability to change state. The data produced is then placed in BEFORE ENTERING COLOUR COLOUR
a datafile with a motion event recorded per state change going from
no motion to motion. This was used to create a training datafile con-
taining 84 state changes and a test file containing 46 state changes. CARD3 CARD2 CARD1 CARD2 CARD1
5.2 Snap
Figure 5. This shows one of the sub-model results for the snap dataset. It
The snap dataset was handcrafted, but the format of it was similar is predicting the equal state, by using the properties of three cards. If card3
to the snap dataset used in the work of [15]. The snap sequence is occurs before card2; and card1 has just entered the world; and the colour of
the following: initially the computer will see a blank scene, then it both card1, and card2 is the same then the sub-model evaluates true, and
Equal will be returned.
will hear the word play, next two coloured cards will be seen. Either
they will be both put down at the same time, or put down one by
one. If they are the same then the word “equals” will be heard, oth-
erwise “different” will be heard. Then the cards are removed, again
either one by one, or at the same time. We ask the computer to only 6 Results
learn the sections where a human is speaking, as it would be im-
possible to accurately predict the next two cards because they are To test the system five runs were allocated to each possible combi-
essentially random. Again three datasets were prepared: a non-noisy, nation of dataset. For each run a different random number seed was
and noisy training set, and a non-noisy test set. All the datasets con- used to initialise the system. The tests were run on a 2GHz machine
tained around 50 rounds of snap. The noisy data was generated by having 8GB memory.
adding 10% noise to the non-noisy training set. The noise took the To evaluate how well the models have been learnt they were tested
form of removing cards, removing the play state, and changing the on a separate test set. Two metrics were used to evaluate the results:
output state, for example making the output not equal when it should coverage, and prediction accuracy. Coverage (C) scores if the sys-
be equal. tem can correctly predict the dataset (ie. the probability of correct
4
10
prediction is greater than 0%) and is the number of correct predic- Training Dataset Number of runs C(%) A(%)
tions (pc) divided by the dataset size (d) as shown in Equation 3. Snap No Noise 1 99.9 99.9
4 99.8 99.8
Prediction accuracy (A) scores with what probability the correct pre-
Snap Noise 2 99.8 99.8
diction is made, and is the sum of the likelihoods of each correct 1 99.8 96.6
prediction (pl) divided by the dataset size, as shown in Equation 4. 1 96.0 94.8
In non-deterministic scenarios this will not be 100%. 1 96.0 93.3
Uno No Noise 1 99.7 99.7
pc 1 97.2 97.2
C= (3)
d 1 94.0 92.0
1 91.5 89.7
1 88.8 88.8
pl Uno Noise 2 96.8 88.4
A= (4)
d 1 95.1 95.1
Both the snap datasets were tested on a population size of 4000, 1 94.3 90.4
1 88.5 87.1
and the system was run for 65 generations, taking around 5 hours Path No Noise 1 97.9 92.1
to do each run. All the runs using the non-noisy datasets were suc- 1 97.9 89.3
cessful. However the models did not get 100% coverage because they 1 96.8 88.8
failed to produce any output at the start of the test dataset as there was 1 95.8 90.0
1 94.7 88.5
insufficient items in the history. Figure 5 shows an example of this, Path Noise 1 93.6 85.8
as it will only evaluate once there are three cards in the history. Four 1 92.6 85.2
of the results did not predict the first two items in the test dataset, 1 91.5 82.4
and one of the results only failed to predict the first item. Two out of 1 90.8 83.3
1 90.1 80.7
the five runs using the noisy snap dataset got an exact solution. The
noise effected the models causing the sub-models to model incorrect
parts of the dataset. This was because some of the noise added to the Figure 6. This figure shows the results for the snap, uno and path datasets.
The number of runs column shows for each training set how many runs got
noisy training set changed the outcomes for some rounds of snap, the same coverage, and accuracy scores.
this then causes the system to model this noise, and to incorrectly
predict the outcomes in the test set. Again, like in the non-noisy snap
models there was problems predicting the start of the test dataset. as a markovian representation of time. This technique is important
The models themselves made use of both the Allen’s intervals, and for a number of reasons. Firstly it produces models that are robust to
the temporal state relations. Figure 5 shows one of the sub-models irrelevant variations in data. Secondly, it allows the system to learn
produced from the non-noisy snap training set. It shows the use of from a dataset containing single actions, and then be able to predict
Allens intervals (the before relation), and the temporal state relations from a dataset containing multiple overlapping actions.
(the enter relation). Most of the models contained four sub-models in In future work will be looking into using spatial, as well as tempo-
them. ral relations in the system. We are also looking into trying out quanti-
The uno datasets were run on a population of size 6000, and for tative relations, so that a relation will not work on objects that are ei-
65 generations, taking around 7 hours to do each run. One out of five ther too close, or too far away. We will also be looking into changing
runs on the non-noisy dataset managed to get the correct solution, but the output from a sub-model based on what data the search section
it did not get 100% coverage because it did not have enough history at has evaluated on. Finally we will be looking at speed improvements
the start of the test set to predict the initial items. The rest of the non- to the system so that the run time can be reduced.
noisy results were very close to the solution, and probably needed
more generations to find the exact solution. The models themselves
were very similar to the models produced for the snap datasets. Both REFERENCES
Allen’s intervals, and the temporal state relations were used. None
of the runs for the noisy dataset managed to produce an exact result, [1] James Allen, ‘Maintaining knowledge about temporal intervals’, Com-
with the noise causing the sub-models to model incorrect parts of the munications of the ACM, 26, 832–843, (1983).
[2] A. Bennett and D. Magee, ‘Learning sets of sub-models for spatio-
dataset.
temporal prediction’, in Proceedings of AI-2007, the Twenty-seventh
The runs using the path dataset used a population size of 2000, and SGAI International Conference on Innovate Techniques and Applica-
the system was run for 65 generations, taking around 3 hours to do tions of Artificial Intelligence, pp. 123–136, (2007).
each run. All the runs using the non-noisy dataset predicted well in [3] Matthew Brand, ‘Pattern discovery via entropy minimization’, in Arti-
the main section of the test dataset, but failed to predict well at the ficial Intelligence and Statistics, (1998).
[4] Simon Colton, Alan Bundy, and Toby Walsh, ‘Automatic identification
start of the test dataset, due to lack of history. Some of the runs also of mathematical concepts’, in International Conference on Machine
failed to predict infrequently occurring actions in the test set. In the Learning, (2000).
runs using the noisy training set all the models learnt the frequently [5] R. Etxeberria, P. Larranaga, and J.M. Picaza, ‘Analysis of the behaviour
occurring actions, but they all started to learn some of the noise in of genetic algorithms when learning bayesian network structure from
data’, Pattern Recognition Letters, 13, 1269–1273, (1997).
the dataset, and this effected their scores on the test dataset. Both the
[6] Nir Friedman and Daphne Koller, ‘Being bayesian about network struc-
non-noisy and noisy models used Allen’s intervals, and the temporal ture’, Machine Learning, 50, 95–126, (2003).
state relations. [7] Aphrodite Galata, Neil Johnson, and David Hogg, ‘Learning behaviour
models of human activities’, in British Machine Vision Conference
(BMVC), (1999).
7 Conclusions [8] Kristen Grauman and Trevor Darrell, ‘The pyramid match ker-
nel:discriminative classication with sets of image features’, in Inter-
We have extended the previous work of [2] and shown that that it is national Conference on Computer Vision, (2005).
possible, by the use of temporal relations, to use a qualitative, as well [9] R Haenni, ‘Towards a unifying theory of logical and probabilistic rea-
5
11
soning’, in International Symposium on Imprecise Probabilities and
Their Applications, pp. 193–202, (2005).
[10] John Koza, Genetic Programming, MIT Press, 1992.
[11] John Koza, Genetic Programming II, MIT Press, 1994.
[12] Philippe Leray and Olivier Francios, ‘Bayesian network structural
learning and incomplete data’, in Adaptive Knowledge Representation
and Reasoning, (2005).
[13] David Montana, ‘Strongly typed genetic programming’, in Evolution-
ary Computation, (1995).
[14] S.H. Muggleton and J. Firth, ‘CProgol4.4: a tutorial introduction’, in
Relational Data Mining, 160–188, Springer-Verlag, (2001).
[15] Chris Needham, Paulo Santos, Derek Magee, Vincent Devin, David
Hogg, and Anthony Cohn, ‘Protocols from perceptual observations’,
Artificial Intelligence, 167, 103–136, (2005).
[16] N. J. Nilsson, ‘Probabilistic logic’, Artificial Intelligence, 28, 71–87,
(1986).
[17] Jeffrey Mark Siskind, ‘Grounding the lexical semantics of verbs in vi-
sual perception using force dynamics and event logic’, Articial Intelli-
gence Research, 15, 31–90, (2000).
6
12
Combining Intelligent Methods for Learner Modelling
in Exploratory Learning Environments
Mihaela Cocea and George D. Magoulas 1
Abstract. Most of the existing learning environments work in well- subsequent section briefly introduces the application domain, namely
structured domains by making use of or combining AI techniques in mathematical generalisation, and the ELE used, called ShapeBuilder,
order to create and update a learner model, provide individual and/or and discusses the challenges involved in performing learner mod-
collaboration support and perform learner diagnosis. In this paper we elling. Section 3 presents a conceptual framework for the learner
present an approach that exploits the synergy of case-base reasoning modelling process and describes the case-based formulation. Section
and soft-computing for learner modelling in an ill-structured domain 4 illustrates the process with an example, while Section 5 concludes
for exploratory learning. We present the architecture of the learner the paper and outlines future work.
model, the knowledge formulation in terms of cases and illustrate its
application in an exploratory learning environment for mathematical
generalisation.
2 EXPLORATORY LEARNING FOR
MATHEMATICAL GENERALISATION
Mathematical generalisation (MG) is associated with algebra, as “al-
1 INTRODUCTION
gebra is, in one sense, the language of generalisation of quantity. It
Several AI techniques have been proposed in intelligent learning en- provides experience of, and a language for, expressing generality,
vironments, such as case-based reasoning [27], [10], bayesian net- manipulating generality, and reasoning about generality” [20].
works [4], [6], neural networks [2], genetic and evolutionary algo- However, students do not associate algebra with generalisation as
rithms [24], neuro–fuzzy systems [26], as well as synergistic ap- the algebraic language is perceived as been separate from what it rep-
proaches, such as genetic algorithms and case-based reasoning [13], resents [15]. To address this problem the ShapeBuilder [8] system,
hybrid rules integrating symbolic rules with neurocomputing [11], which is an ELE under development in the context of the MiGen
and expert systems with genetic algorithms [18]. project 2 , aims to facilitate the correspondence between the mod-
Exploratory Learning Environments (ELEs) belong to a particular els, patterns and structures (visual representations) that the learners
class of learning environments built on the principles of construc- build, on one hand, and their numeric, iconic and symbolic repre-
tivism paradigm for teaching and learning. ELEs place the emphasis sentations, on the other hand. ShapeBuilder allows the construction
on the opportunity to learn through free exploration and discovery of different shapes [9], e.g. rectangles, L-shapes, T-shapes and sup-
rather than guided tutoring. This approach has proved to be bene- ports the three types of representations aforementioned: (a) numeric
ficial for learners in terms of acquiring deep conceptual and struc- representations that include numbers (constants or variables) and ex-
tural knowledge. However, discovery learning without guidance and pressions with numbers; (b) iconic representations which correspond
support appears to be less effective than step-by-step guiding learn- to icon variables; (c) symbolic representations that are names or sym-
ing environments [16]. To this end, an understanding of learner’s be- bols given by users to variables or expressions. An icon variable has
haviour and knowledge construction is needed [22]. the value of a dimension of a shape (e.g. width, height) and can be
Most existing ELEs use simulations as a way of actively involving obtained by double-clicking on the corresponding edge of the shape.
learners in the learning process (e.g. [28], [14]) and exploit cog- It is represented as an icon of the shape with the corresponding edge
nitive tools [29] to support their learning. Few such systems model highlighted (see Figure 1a).
learner’s knowledge/skills; for example [4] and [6] use bayesian net- Constants, variables and numeric expressions lead to specific con-
works and [26] combines neural networks with fuzzy representation structions/models, while icon variables and expressions using them
of knowledge. Another category of ELEs is closer to the construc- lead to general ones. Through the use of icon variables, ShapeBuilder
tivist approach by allowing the learner to construct their own models encourages structured algebra thinking, connecting the visual with
rather than explore a “predefined” one. Compared to conventional the abstract (algebraic) representation, as “each expression of gen-
learning environments (even environments that use simulations), this erality expresses a way of seeing” [20] (see Figure 1b). It also uses
type of ELE requires approaches to learner modelling that would be the “messing up” metaphor [12] that consists of asking the learner to
able to capture and model the useful interactions that take place as resize a construction and observe the consequences; the model will
learners construct their models. “mess up” only if it is not general (see Figures 1c and d), indicating
In this paper, we present an approach to learner modelling in ELEs learner’s lack of generalisation ability.
(suitable for both exploring simulations and constructing models) When attempting to model the learner in an ELE for such a wide
that combines case-based reasoning with other AI techniques. The domain as MG, several challenges arise. The main and widely ac-
1 The authors are with the London Knowledge Lab, Birkbeck College, Uni- 2 Funded by ESRC, UK, under TLRP e-Learning Phase-II (RES-139-25-
versity of London, UK; email: {mihaela;gmagoulas}@dcs.bbk.ac.uk 0381); http://www.tlrp.org/proj/tel/tel_noss.html.
13
learning process. As case-based reasoning offers flexibility of infor-
mation representation and soft computing techniques handle uncer-
tainty, a combination of the two is used. Moreover, previous research
has proved the benefits of combining case-based reasoning with neu-
ral networks [23] and fuzzy quantifiers [30]. In the following sub-
sections, the architecture of the system, the AI components and their
role are described.
3.1 The Architecture
The architecture of the “Intelligent” ShapeBuilder is represented in
Figure 2. As the learner interacts with the system through the inter-
face, the actions of the learner are stored in the Learner Model (LM)
and they are passed to the Interactive Behaviour Analysis Module
(IBAM) where they are processed in cooperation with the Knowl-
edge Base (KB); the results are fed into the LM. The Feedback Mod-
ule (FM) is informed by the LM and the KB and feeds back to the
Figure 1. (a) A rectangular shape and its icon variable; (b) an expression learner through the interface.
using icon variables; (c) “messing up”; (d) general solution that does not
“mess up”.
knowledged challenge is to balance freedom with control: learners
should be given enough freedom so that they can actively engage in
activities but they should be offered enough guidance in order to as-
sure that the whole process reflects constructivist learning and leads
to useful knowledge [21]. This and some other challenges are illus-
trated in Table 1 with examples from the domain of MG.
Table 1. Applying learner modelling in ELEs for mathematical
generalisation.
Challenge Example Figure 2. Schematic of an intelligent architecture for ShapeBuilder.
Balance be- When a learner is trying to produce a general represen-
tween freedom tation, for how long should he be left alone to explore
and control and when does guidance become necessary?
What should be Besides learner’s knowledge of MG concepts (e.g. The KB includes two components (see Figure 2): a domain and
modelled? use of variables, consistency between representations, a task model. The domain model includes high level learning out-
etc.), other aspects need to be modelled in order to
support the learner during exploration: shapes con- comes related to the domain (e.g. using variables, structural reason-
structed, relations between shapes, etc. ing, consistency, etc.) and considers that each learning outcome can
Do both correct In exploratory learning it is difficult to categorise ac- be achieved by exploring several tasks. The task model includes dif-
and incorrect tions or learner’s explorations into “correct” and “in- ferent types of information: (a) strategies of approaching the task
actions or be- correct”. Moreover, actions that might lead to incor- which could be correct, incorrect or partially correct; (b) outcomes
haviours have rect outcomes such as resizing can be more valuable
value? for constructivist learning than “correct” actions. of the exploratory process and solutions to specific questions associ-
Reasoning Can consistency be inferred from the fact that a learner ated with each (sub)task; (c) landmarks, i.e. relevant aspects or criti-
about abstract is checking the correspondence between various forms cal events occurring during the exploratory process; (d) contexts, i.e.
knowledge of representations? If so, is that always true? Are there reference to particular (sub)tasks.
any exceptions to this rule?
The IBAM component combines case-based reasoning with soft
Underlying As it is neither realistic nor feasible to include all pos-
strategies sible outcomes (correct or incorrect) to model the do- computing in order to identify what learners are doing and be able
main of MG, only key information with educational to provide feedback as they explore a (sub)task. More specifically, as
value could be stored, such as strategies in solving they are working in a specific subtask, which specifies a certain con-
a task. The challenge is how to represent and detect text, their actions are preprocessed, current cases are identified and
them.
matched to the cases from the Task Model (the case base). Prior to
matching, local feature weighting [23] is applied in order to reflect
the importance of the attributes in the current context.
In the FM component, multicriteria decision making [7] will be
3 A CONCEPTUAL FRAMEWORK FOR used to obtain priorities between several aspects that require feed-
LEARNER MODELLING back depending on the context.
Given the challenges mentioned in Table 1 a conventional learner
3.2 Case-based Knowledge Representation
modelling approach does not fit the purposes of ELEs. Due to the ex-
ploratory nature of the activities and the diversity of possible trajec- In case-based reasoning (CBR) [17] the knowledge is stored as cases,
tories, flexibility in the representation of information and handling of typically including the description of a problem and the correspond-
uncertainty are two important aspects for effectively supporting the ing solution. When a new problem is encountered, similar cases are
14
searched and the solution is adapted from one or more of the most attributes of a generic case for ShapeBuilder is presented in Table 2.
similar cases. The first v attributes (αij , j = 1, v) are variables, the ones from
Although CBR has been used successfully in applications for do- v + 1 to w are numeric (αij , j = v + 1, w) and the rest are binary
mains like legal reasoning [1], stock market prediction [5], recom- (αij , j = w + 1, N ).
mender systems [19], and other areas, there is little research on using The set of relations between attributes of the current case and at-
CBR for e-Learning environments. For example, [10] use CBR in tributes of other cases (as well as attributes of the same case) is
the learner modelling process and call this approach case-based stu- represented as RAi = {RAi1 , RAi2 , . . . , RAiM }, where at least
dent modelling; while [13] use CBR and genetic algorithms to con- one of the attributes in each relation RAim , ∀m = 1, M , is from
struct an optimal learning path for each learner. CBR is used also in the set of attributes of the current case Fi . Two types of binary
[27] within a case-based instruction scenario rather than a method relations are used: (a) a dependency relation (Dis ) is defined as
for learner modelling. We have not found any references in the liter- (αik , αjl ) ∈ Dis ⇔ αik = DEP (αjl ), where DEP : αik → αjl
ature to ELEs that use CBR or CBR combined with other intelligent for attributes αik and αjl that are variables of cases i and j (where
methods. i = j or i 6= j), and means that αik depends on (is built upon)
The advantage of CBR for learning environments and especially αjl (if i = j, k 6= l is a condition as to avoid circular dependen-
for ELEs is that the system does not rely only on explicit representa- cies) (e.g. the width type of a case is built upon the height type of
tion of general knowledge about a domain, but it can also use specific the same case; the width type of a case is built upon the width type
knowledge previously experienced [10]. It also seems promising for of another case, an so on); (b) a value relation (Vis ) is defined as
improving the effectiveness of complex and unstructured decision (αik , αjl ) ∈ Vis ⇔ αik = f (αjl ), where αik and αjl are numeric
making [13] in combination with other computing methods. attributes and f is a function and could have different forms depend-
In our research, CBR is used in the learner modelling process. ing on context (e.g. the height of a shape is two times its width; the
The cases contain information describing models that learners should width of a shape is three times the height of another shape, etc.). The
construct using ShapeBuilder. Different strategies in approaching a set of relations between attributes is presented in Table 3.
problem (i.e. constructing a model to meet a particular learning ob-
jective) are represented as a series of cases that reflect possible ex-
Table 3. The set of relations between attributes (RAi ) of cases.
ploratory trajectories of learners as they construct models during the
various (sub-)tasks. Relation Label Example
Dependency relation Di1 (RAi1 ) αik , αjl ; k, l = 2, v; ∀j
Table 2. The set of attributes (Fi ) of a case. .. ..
. .
Category Name Label Possible Values Dit (RAit ) αik , αjl ; k, l = 2, v; ∀j
Shape Shape type αi 1 Rectangle(/L-Shape/T-Shape) Value relation Vi1 (RAit+1 ) αik , αjl ; k, l = v + 1, w; ∀j
Dimensions Width type αi 2 constant (c)/variable (v)/ .. ..
of shape icon variable (iv)/ . .
numeric expression (n exp)/ Viz (RAiM ) αik , αjl ; k, l = v + 1, w; ∀j
expression with iv(s) (iv exp)
Height type αi 3 c /v /iv /n exp /iv exp
.. .. ..
. . . The set of relations between cases is represented as RCi =
Thickness type αi v c /v /iv /n exp /iv exp {RCi1 , RCi2 , . . . , RCiP }, where one of the cases in each relation
Width value αiv+1 numeric value RCij , ∀j = 1, P is the current case (Ci ). Two relations about or-
Height value αiv+2 numeric value der in time are defined: (a) P rev relation indicates the previous case
.. .. ..
. . . with respect to the current case: (Ci , Cj ) ∈ P rev if t (Cj ) < t (Ci )
Thickness value αi w c /v /iv /n exp /iv exp and (b) N ext relation indicates the next case with respect to the cur-
Part of P artOf S1 αiw+1 1 rent case: (Ci , Ck ) ∈ N ext if t (Ci ) < t (Ck ). Each case includes
Strategy P artOf S2 αiw+2 0 at most one of each of these two relations (p ≤ 2).
.. .. .. A strategy is defined as Su = {Nu (C), Nu (RA), Nu (RC)},
. . .
P artOf Sr αiN 0 u = 1, r , where Nu (Ci ) is a set of cases, Nu (RAi ) is a set of re-
lation between attributes of cases and Nu (RCi ) is a set of relations
between cases.
A case is defined as Ci = {Fi , RAi , RCi }, where Ci represents
the case and Fi is a set of attributes. RAi is a set of relations between
attributes and RCi is a set of relations between Ci and other cases
3.3 Comparing Cases, Exploiting Context and
respectively.
Modelling Learning Trajectories
The set of attributes is represented as Fi = {αi1 , αi2 , . . . , αiN }. In this section we present three distinctive features of the proposed
It includes three types of attributes: (a) numeric, (b) variables and (c) framework: comparing cases, exploiting context and modelling of
binary. Variables refer to different string values that an attribute can learning trajectories.
take, and binary attributes indicate whether a case can be considered
in formulating a particular strategy or not. This could be represented
as a “part of strategy” function: P artOf Su : Ci → {0, 1}, Comparing cases. The most common definition of similarity is a
weighted sum of similarities of attributes of cases [17]:
1 if Ci ∈ Su
P artOf Su =
0 if Ci ∈ / Su , PN
× sim(fiI , fiR )
i=1 oiP
SIR = n ,
where Su represents a strategy and is defined further on. The set of i=1 oi
15
where oi represents the weight of each attribute, sim is a similarity Learning trajectories. A string of cases connected with relations
function, and I and R stand for input and retrieved cases, respec- in time yields a knowledge structure that represents learner’s explo-
tively. In our case, four similarity measures are defined for compar- rations/learning trajectory in the ELE during a task or sub-task. Such
ing cases: a learning trajectory is constructed by successively applying P rev
and N ext relations to Ci in order to get cases previous in time to
1. Euclidean
qdistance is used for comparing numeric attributes: Ci and cases following Ci , respectively. Comparing trajectories in
Pw
the KB to the current trajectory (this is useful to provide support and
j=v+1 oj × (αIj − αRj )
DIR = 2
2. The following metric is used for attributes that are variables: decide on scaffolding techniques) is done in two stages: comparing
P v
j=1 g(αIj ,αRj )
the past and evaluating the future.
VIR = v
, where g is defined as: Comparison of the past with respect to a reference point (e.g. a
selected case) depends on the depth of the evaluation in terms of
1 if αIj = αRj samples taken into account and rules than concern comparisons of
g(αIj , αRj ) =
0 if αIj 6= αRj , the past, e.g. IF the actual trajectory is similar to a trajectory in the
KB, indicated by a reference case representing a starting point in the
3. In a similar way to [25], we define the following metric for com- past, THEN this trajectory is a past-matching trajectory.
|RAI ∩RAR |
paring relations between attributes: PIR = |RA I ∪RAR |
. PIR is When it comes to evaluating the future of a trajectory, comparison
the number of relations between attributes that the input and re- is based on the similarity between the future of a trajectory in the KB
trieved case have in common divided by the total number of rela- with a desired future for the current trajectory. This is expressed by
tions between attributes of the two cases. rules of the general form: IF a piece of the future trajectory of a past-
4. Similarity in terms of relations between cases is defined by TIR = matching trajectory resembles the reference starting from a selected
|RCI ∩RCR |
|RCI ∪RCR |
, where TIR is the number of relations between cases case, THEN the reference can be met by applying certain strategies.
that the input and retrieved case have in common divided by the As it is not possible to represent all learning trajectories in the
the total number of relations between cases of I and R. KB of an ELE, similarity is measured in terms of convex fuzzy sets,
whose width might change depending on the context and the amount
In order to identify the closest strategy to the one employed by a of information available, i.e. the current trajectory can be interpreted
learner, cumulative similarity measures are used for each of the four in more vague way by increasing the width of the fuzzy set. Also if
types of similarity: the distance between past and future is large for certain tasks, it does
not make sense to evaluate the future carefully. Nevertheless, if the
( zi=1 DIi Ri )/z.
P
1. Numeric attributes:
Pz distance to a reference (desired outcome) is small, the future needs to
2. Variables: ( i=1 VIi Ri )/z. P be evaluated accurately. So the depth of the evaluation is measured
z
Pz( i=1 PIi Ri )/z.
3. Relations between attributes: by a fuzzy time distance set to evaluate both short and long time
4. Relations between cases. ( i=1 TIi Ri )/z. distances.
where z represents the minimum number of cases among the two
compared strategies. The strength of similarity between the current 4 AN ILLUSTRATIVE EXAMPLE
strategy and the various stored strategies is defined as the maximum
combined similarity of these four measures among the various strate- To illustrate the combination of intelligent methods for learner mod-
gies compared. elling we use an example from the mathematical generalisation do-
main, and a task called “pond tiling”, which is common in the En-
glish school curriculum and expects learners to produce a general
Exploiting context. Attributes and relations stored in cases have
expression for finding out how many tiles are required for surround-
different relevance depending on the context, which in ShapeBuilder
ing any rectangular pond [8]. The high level learning objective in the
corresponds to different stages of the constructivist learning process
Domain Model is to acquire the ability to perform structural reason-
that learners go through as they explore the various sub-tasks within a
ing [9]. In order to achieve this, sub-tasks can be explored in Shape-
learning activity. Typically, a task includes several sub-tasks, and the
Builder, e.g. construct a pond of fixed dimensions, surround the pond
activity is sequenced within the system so as to know at any time the
with tiles and determine how many are required; generalise the struc-
current context. As the environment allows the learners to explore,
ture using icon variables.
they may “jump” to different stages in the activity sequence.
Context dependence can be taken into account by having differ-
ent weights for attributes and relations depending on the stage of the Knowledge representation. The Task Model for pond tiling in-
learning process within a task or activity. The weights could be ob- cludes: (a) strategies identified in pilot studies [9], e.g. thinking in
tained through an approach called local feature weighting [23] that terms of areas (see Figure 3a) or in terms of width and height (see
uses Neural Networks (NNs). The principle of the training algorithm Figures 3b, c, d, e and f); (b) outcomes, e.g. model built, number
is to reduce the distance between cases of the same class and increase of tiles for surrounding a particular pond, and solution, i.e. the gen-
the distance between cases of different classes [23], where the var- eral expression (see Figure 3 for the solutions corresponding to each
ious classes in ShapeBuilder correspond to types of context (stages strategy; for the “area strategy” the solution with icon variables is
of the learning process) of the various (sub-)tasks. Thus, a neural displayed in Figure 1b); (c) landmarks, e.g. for the area strategy: cre-
network is trained in order to identify the context and several net- ating a rectangle with height and width greater by 2 than the pond;
works (one for each context) are used to provide the context-specific for the width and height strategies: using rows/column of tiles; slips:
weights. This approach appears to be more robust than other weight- several correct actions followed by an incorrect one (e.g. correct sur-
ing schemes due to the generalisation capacities of the NNs that can rounding of the pond, partially correct expression, but missing a 2 in
produce weights even in imprecise situations [23]. the formula); (d) the context of each (sub-)task.
16
Table 5. The set of attributes (Fi ) for the cases in the “I strategy”.
Label C1 C2 C3 C4 C5
αi 1 Rectangle Rectangle Rectangle Rectangle Rectangle
αi 2 c/v/n exp iv /iv exp iv /iv exp c/v/n exp c/v/n exp
αi 3 c/v/n exp c/v/n exp c/v/n exp iv /iv exp iv /iv exp
αi 4 5 7 7 1 1
αi 5 3 1 1 3 3
αi 6 1 0 0 0 0
αi 7 1 1 1 1 1
.. .. .. .. .. ..
. . . . . .
αi 8 1 0 0 1 1
use this type of approach [8, 9]. Some pupils surround the pond in a
non–systematic manner and with variable degrees of symmetry. Such
examples are illustrated in Figure 4.
Comparing cases. To illustrate the operation of similarity mea-
Figure 3. (a) “Area strategy”; (b) “H strategy”; (c) “I strategy”; (d) “Spiral sures we use two non–symmetrical examples of surrounding the
strategy”; (e) “+4 strategy”; (f) “−4 strategy”; (g) Steps and relations of pond, displayed in Figure 4. The similarity measures are the ones
“area strategy”; (h) Steps and relations of “I strategy”. presented in Section 3.3.
The first example (Figure 4a), has 4 cases in common with two
The six strategies and their associated solutions (the general ex- strategies: the “I strategy” (C1 , C3 , C4 , C5 ) and the “+4 strategy”
pressions for surrounding any rectangular pond) are displayed in Fig- (C1 , C4 , C5 , C6 ). When comparing it with the “I strategy” z √= 5
ures 3(a–f). Two strategies are presented in detail: the “area strategy” (minimum between 6 and 5) and the combined similarity is: 51 +
(S1 ) and the “I strategy” (S3 ). The attributes of cases that are part of 5
5
+ 7/4
5
+ 10/4
5
= 2.05. When comparing with the “+4” strategy,
√
these two strategies are presented in Table 4 and Table 5, respectively. z = 6 (minimum between 6 and 9) the combined similarity is: 65 +
The steps and the sets of relations between attributes and between 5+2/3
6
+ 6/4
6
+ 10/4+1/3
6
= 2.04. Thus, in this case the learner will
cases are displayed in Figure 3g and Figure 3h, respectively.
be guided towards the “I strategy”.
A particular order between cases is presented for the “I strategy” in
The second example (Figure 4b), has 3 cases in common with
Figure 3h. For the same strategy, the surrounding of the pond could
two strategies: the “spiral strategy” (C1 , C3 , C4 ) and the “H strat-
be done in several other different orders; there are 4! = 24 such
egy” (C1 , C2 , C5 ). When comparing it with the “spiral strategy” as
possibilities (the pond is always first).
well as the “H strategy”, z √ = 5 (minimum between 5 and 5), and
the combined similarity is: 52 + 4+2/3 5
+ 8/45
+ 10/4
5
= 2.12. In
Table 4. The set of attributes (Fi ) for the cases in the “area strategy”.
this situation, when the learner’s construction is equally similar to
Name Label C1 C2 two strategies, the following options could be offered: (a) present
Shape type αi1 Rectangle Rectangle the learner with the two options and let him/her choose one of the
Width type αi2 c/v/n exp iv/iv exp two (an approach that appears more suitable for advanced learners
Height type αi3 c/v/n exp iv/iv exp than novices); (b) automatically suggest one of the two in a system-
Width value αi4 5 7 atic way, e.g. present the one that occurs more/less often with other
Height value αi5 3 5
learners; (c) inform the teacher about the learner’s trajectory and the
P artOf S1 αi6 1 1
.. .. .. .. frequency of strategies and let him/her decide between the two.
. . . .
P artOf S2 αi7 1 0
P artOf S6 αi8 1 0
There are two types of strategies depending on the degree of gen-
erality: specific and general. Specific cases refer to surroundings that
cannot be generalised and include value relations, but no dependency
relations; the general cases refer to surroundings that can be gener-
alised and are distinguished by the presence of the dependency re- Figure 4. Non-symmetrical strategies: (a) combination of ‘I’ and ‘+4’
lations and by the fact that the dimension type of at least one of the strategies; (b) combination of ‘spiral’ and ‘H’ strategies.
dimensions of the case is an icon variable or an expression using icon
variable(s). The presence or absence of the abovementioned aspects
apply to all cases that form the composite case with the exception of
the first case representing the pond. The “area” and the “I strategy” Exploiting context. In the pond tiling task, when the learner is
presented previously fall into the category of general strategies. constructing a specific (as opposed to general) tiling of the pond, the
The strategies displayed in Figure 3 are correct symmetrical “ele- value relation attribute is more relevant, while when dealing with a
gant” solutions, but trials with pupils have shown that not all of them general tiling, the dependency relation attribute is more important.
17
Local feature weighting in this case involves two trained neural net- [8] S. Gutiérrez, M. Mavrikis, and D. Pierce, ‘A Learning Environment for
works for each context and applying the weights delivered by the Promoting Structured Algebraic Thinking in Children’, in Proceedings
of the 8th IEEE International Conference on Advanced Learning Tech-
NNs before the matching process. nologies, (2008). forthcoming.
[9] S. Gutiérrez, D. Pierce, E. Geraniou, and M. Mavrikis, ‘Supporting
Learning trajectories. Lets consider the example in Figure 4b Reasoning and Problem-Solving in Mathematical Generalisation with
and a comparison after C3 . The current trajectory includes the cre- Dependency Graphs’, in Proceedings of the 5th International Confer-
ence on the Theory and Application of Diagrams, (2008). forthcoming.
ation of 3 rectangles corresponding to C1 , C2 and C3 ; this trajectory [10] S-G. Han, S-G. Lee, and S. Jo, ‘Case-based tutoring systems for proce-
is considered to be far from the desired outcome (surrounding the dural problem solving on the www’, Expert Systems with Applications,
pond), and thus, the future does not need to be evaluated accurately. 29(3), 573–582, (2005).
At this point with respect to the past, two strategies partially match [11] I. Hatzilygeroudis and J. Prentzas, ‘Using a hybrid rule-based approach
in developing an intelligent tutoring system with knowledge acquisition
the learner’s current trajectory: “I” (C1 , C2 ) and “spiral” (C1 , C3 )
and update capabilities’, Expert Systems with Applications, 26(4), 477–
strategy; the learner could be left to continue with his/her model con- 492, (2004).
struction without intervention. With respect to the future, the desired [12] L. Healy, Hoelzl R., Hoyles C., and R. Noss, ‘Messing Up’, Micromath,
outcome can be obtained by following one of the two strategies pre- 10(1), 14–16, (1994).
viously identified. [13] M-J. Huang, H-S. Huang, and M-Y. Chen, ‘Constructing a personalized
e-learning system based on genetic algorithm and case-based reasoning
If the comparison takes place after C5 , the trajectory would in- approach’, Expert Systems with Applications, 33(3), 551–564, (2007).
clude the creation of 5 rectangles (C1 to C5 ) and thus it can be con- [14] C.D. Hulshof, T.H.S. Eysink, S. Loyens, and T. de Jong, ‘Zaps: Us-
cluded that the learner has reached the desired outcome of surround- ing interactive programs for learning psychology’, Interactive Learning
ing the pond. However, in this process the learner did not use any Environments, 13, 39–53, (2005).
[15] J. Kaput, Technology and Mathematics education, 515–556, D. Grouws
of the desirable strategies, i.e. any of the six strategies presented in
(ed.) Handbook of Research on Mathematics Teaching and Learning,
Figure 3. At this point in time two trajectories match the past and in- New York: Macmillan, 1992.
dicate the future, as before, but now it might be considered pedagogi- [16] P. Kirschner, J. Sweller, and R.E. Clark, ‘Why minimal guidance dur-
cally important to intervene and guide the learner towards a trajectory ing instruction does not work: An analysis of the failure of construc-
that corresponds to one of the two identified desirable strategies. tivist, discovery, problem-based, experiential and inquiry-based teach-
ing’, Educational Psychologist, 41(2), 75–86, (2006).
[17] J.L. Kolodner, Case-Based Reasoning, Morgan Kaufmann Publishers,
5 CONCLUSIONS Inc., 2nd edn., 1993.
[18] C. Koutsojannis, G. Beligiannis, I. Hatzilygeroudis, C. Papavlasopou-
In this paper a learner modelling process involving a combination los, and J. Prentzas, ‘Using a hybrid ai approach for exercise difficulty
of intelligent methods was presented for the domain of mathemati- level adaptation’, International Journal of Continuing Engineering Ed-
cal generalisation. Case-based reasoning is used in combination with ucation and Life-Long Learning, 17(4-5), 256–272, (2007).
[19] P. Kumar, S. Gopalan, and V. Sridhar, ‘Context enabled multi-CBR
soft computing (fuzzy sets and neural networks) in order to process based recommendation engine for e-commerce’, in Proceedings of the
the models that the learners construct and thus be able to provide IEEE International Conference on e-Business Engineering (ICEBE),
feedback while learners work on the task. pp. 237 – 244. IEEE Press, (2005).
Further work includes expanding the conceptual framework by [20] J. Mason, Generalisation and algebra: Exploiting children’s powers,
105–120, L. Haggarty, Aspects of Teaching Secondary Mathematics:
defining a strength as the maximum combined similarity measure Perspectives on Practice, Routledge Falmer and the Open University,
(similarity of the past and similarity of the future at a particular dis- 2002.
tance) for various evaluated trajectories and a reliability index that [21] R. Mayer, ‘Should There Be a Three-Strikes Rule Against Pure Discov-
will reflect the extent to which the similarities can be relied upon to ery Learning? The Case for Guided Methods of Instruction’, American
provide the right support. Psychologist, 59(1), 14–19, (2004).
[22] R. Morales Gamboa, Exploring Participative Learner Modelling and Its
Effects on Learner Behaviour, PhD thesis, Univ. of Edinburgh, 2000.
REFERENCES [23] J.H. Park, K.H. Im, C-K. Shin, and S-C. Park, ‘MBNR: Case-Based
Reasoning with Local Feature Weighting by Neural Network’, Applied
[1] V. Aleven, ‘Using background knowledge in case-based legal reason- Intelligence, 21(3), 265–276, (2004).
ing: A computational model and an intelligent learning environment’, [24] C. Romero, S. Ventura, P. de Bra, and C. de Castro, ‘Discovering Pre-
Artificial Intelligence, 150(1-2), 183–237, (2003). diction Rules in AHA! Courses’, In Brusilovsky et al. [3], pp. 25–34.
[2] J.E. Beck and B.P. Woolf, ‘Using a Learning Agent with a Student [25] Y. Seo, D. Sheen, and T. Kim, ‘Block assembly planning in shipbuilding
Model’, in Intelligent Tutoring Systems, eds., B.P. Goettl, H.M. Halff, using case-based reasoning’, Expert Systems with Applications, 32(1),
C.L. Redfield, and V.J. Shute, volume 1452 of Lecture Notes in Com- 245–253, (2007).
puter Science, pp. 6–15. Springer, (1998). [26] R. Stathacopoulou, M. Grigoriadou, G.D. Magoulas, and D. Mitropou-
[3] P. Brusilovsky, A.T. Corbett, and F. de Rosis, eds. User Modeling 2003, los, ‘A Neuro-fuzzy Approach in Student Modeling’, In Brusilovsky
9th International Conference, UM 2003, Johnstown, PA, USA, June 22- et al. [3], pp. 337–341.
26, 2003, Proceedings, volume 2702 of Lecture Notes in Computer Sci- [27] R.H. Stottler and S. Ramachandran, ‘A Case-Based Reasoning Ap-
ence. Springer, 2003. proach to Internet Intelligent Tutoring Systems (ITS) and ITS Au-
[4] A. Bunt and C. Conati, ‘Probabilistic Student Modelling to Improve Ex- thoring’, in Proceedings of the Twelfth International Florida Artificial
ploratory Behaviour’, User Modelling and User-Adaptive Interaction, Intelligence Research Society Conference, pp. 181–186. AAAI Press,
13(3), 269–309, (2003). (1999).
[5] S-H. Chun and Y-J. Park, ‘Dynamic adaptive ensemble case-based rea- [28] J. Swaak, W.R. van Joolingen, and T. de Jong, ‘Supporting simulation-
soning: application to stock market prediction’, Expert Systems with based learning; the effects of model progression and assignments on
Application, 28(3), 435–443, (2005). definitional and intuitive knowledge’, Learning and Instruction, 8, 235–
[6] C. Conati, A.S. Gertner, and K. VanLehn, ‘Using Bayesian Networks to 253, (1998).
Manage Uncertainty in Student Modeling’, User Modelling and User- [29] W.R. van Joolingen, ‘Cognitive tools for discovery learning’, Interna-
Adaptive Interaction, 12(4), 371–417, (2002). tional Journal of Artificial Intelligence and Education, 10, 385–397,
[7] G. Ghinea, G.D. Magoulas, and C. Siamitros, ‘Multicriteria decision (1999).
making for enhanced perception-based multimedia communication’, [30] R.R. Yager, ‘Soft Aggregation Methods in Case Based Reasoning’, Ap-
IEEE Transactions on Systems, Man, and Cybernetics, Part A, 35(6), plied Intelligence, 21(3), 277–288, (2004).
855–866, (2005).
18
Belief Propagation in Fuzzy Bayesian Networks
Christopher Fogelberg1 and Vasile Palade and Phil Assheton2
Abstract. Fuzzy Bayesian networks are a generalisation presents belief propagation for variables with one parent (sub-
of classic Bayesian networks to networks with fuzzy variable section 3.3) and for variables with multiple parents (subsec-
state. This paper describes our formalisation and outlines how tion 3.4). Section 4 analyses the algorithmic efficiency of FBNs
belief propagation can be conducted. Fuzzy techniques can and how it can be controlled. Section 5 outlines an important
lead to more robust inference. A key advantage of our formal- bioinformatic domain where FBNs may be especially useful,
isation is that it can take advantage of all existing network and section 6 concludes the paper.
inference and Bayesian network algorithms. Another key ad-
vantage is that we have developed several techniques to con- 2 A Fuzzy Bayesian Network
trol the algorithmic complexity. When these techniques can
be applied it means that fuzzy Bayesian networks are only The structure of the FBN that is used as an example in this
a small linear factor less efficient than classic Bayesian net- paper is shown in figure 1. Call this FBN G = hη, θi, where η
works. With appropriate pre-processing they may be substan- denotes the structure of G and θ its parameters.
tially more efficient. For clarity of presentation, G is a multinomial (discrete)
BN. However, the formalisation generalises easily and trans-
parently to continuously-valued FBNs and hybrid FBNs.
1 Introduction The relevant conditional distributions of G are shown in
Modern machine learning research frequently uses Bayesian figure 2. D’s distribution is not shown and we will later assume
networks (BNs)[6; 7; 15; 16]. However, BN inference is NP- a state for D with no loss of generality.
complete due to cycles in the undirected graph[4], and belief Because we have restricted the differences between BNs and
propagation is exponential in the tree-width of the network. FBNs to belief propagation, the specification of a FBN and a
This makes them difficult to use for large problems. BN are identical.
Fuzzy[3] and hybrid fuzzy systems[11; 13] are also fre-
quently used. In a fuzzy system, a variable’s state is repre- 3 Belief Propagation
sented by a set of fuzzy values (FVs). Because fuzzy systems Belief propagation in a Bayesian network involves calculat-
do not force a model to artificially discretise a continuous un- ing the updated probability distributions of variables in the
derlying state they are often more robust in the face of noise. network, given θ and the observed states of other variables.
To date there has been very little research into BNs with
fuzzy variable states. What there is has centred around the 3.1 Some Notation
use of fuzzy approximations to perform inference and belief
propagation in a hybrid BN [1; 12]. A hybrid BN is one where The terminology and notation is as follows. A variable has a
the parameters are a mix of continuous and multinomial dis- state, either a fuzzy state (FS) or a discrete state (DS).
tributions. A fuzzy state is made up of one or more components, and
This paper’s key contribution is a formal generalisation of each component is annotated with the variable’s degree of
classic BNs to fuzzy Bayesian networks (FBNs). In a FBN membership (µ) in that component. For example, equation 1
the variables can have fuzzy states. The paper also describes is an example of a variable (S) with two components. It has
tractable belief propagation over FBNs. An important ad- membership 0.7 in the component hi and membership 0.3
vantage of the presented formalisation is that existing infer- in the component mid. hi and mid are examples of values
ence algorithms (e.g. MCMC, simulated annealing) can be that the variable can take. When annotated with µ they are
used without modification. Furthermore, FBNs may be only referred to as fuzzy values (FV). The set of all possible values
a small linear constant less efficient than classic BNs of the (fuzzy values) that a variable can take is the range of that
same size, and appropriate pre-processing may make some variable, e.g. hi, mid, lo.
problems which were intractable for classic BNs tractable for
FBNs. S = [hi0.7 , mid0.3 ] (1)
The paper is structured as follows. Section 2 presents a
In general, the components of a variable’s state are enclosed
fuzzy Bayesian network which will be used as an example.
in square brackets. A discrete state is just a special case of
Section 3 introduces some notation (subsection 3.1), and
a fuzzy state. Discrete states have just one component with
1 Supported by the ACU and CSCUK µ = 1, and the square brackets andP µ subscript can be omitted
2 Computing Laboratory, University of Oxford; Contact email: in this situation. We assume that c∈C µc = 1 for a FS with
christopher.fogelberg@comlab.ox.ac.uk C components and have not considered other situations.
19
A B C
E
D
Figure 1. The fuzzy Bayesian network G, used as an example in this paper.
→A A = lo A = mid A = hi fuzzy probability distribution (FPD). Samples are drawn from
0.7 0.1 0.2 a FPD in the same way that they are drawn from a PD.
However, a variable with membership µ in a FPD can only
(a) θA , A’s prior distribution have µ proportion of its state determined by that FPD; a
sample from a FPD will have the same µ as the FPD does.
A→B B = lo B = mid B = hi For example, a sample from {0.2, 0.1, 0.7}0.2 will be one of
lo0.2 , mid0.2 or hi0.2 , and each of these components will be
A = lo 0.6 0.2 0.2
drawn with probability 0.2, 0.1 and 0.7 respectively.
A = mid 0.1 0.1 0.8
This means that the state of a sample from the uncertain
A = hi 0.1 0.2 0.7
variable T (equation 2 will be some member from the set
(b) θB , B’s conditional distribution [lo[0..0.8] , mid[0.2..1] , hi[0..0.8] ] and the distribution over mem-
bers in this set is determined by the two FPD and one FV
which make up the fuzzy state of T .
B→C C = lo C = mid C = hi
B = lo 0.1 0.1 0.8 3.2 Assumptions
B = mid 0.1 0.8 0.1
The full and general analysis of FBNs would also consider un-
B = hi 0.7 0.2 0.1
restrictedPinteractions amongst components in a fuzzy state,
(c) θC , C’s conditional distribution allowing µ 6= 1 and so forth. In this article we make a num-
ber of linearising assumptions which make FBN belief propa-
C, D → E E = lo E = mid E = hi gation cheap, relative to the cost of full general propagation.
They also greatly aid the clarity of the presentation in the
C = lo D = lo 0.6 0.2 0.2 space available. Furthermore, these assumptions are reason-
C = lo D = mid 0.1 0.1 0.8 able and do not restrict the general utility of FBNs. However
C = lo D = hi 0.1 0.1 0.8 it is important to make them explicit. The assumptions (and
C = mid D = lo 0.6 0.2 0.2 consequently the nature of full general propagation) will be
C = mid D = mid 0.1 0.6 0.3 briefly summarised in this section; a more general discussion
C = mid D = hi 0.1 0.6 0.3 is forthcoming.
C = hi D = lo 0.1 0.2 0.7
C = hi D = mid 0.1 0.2 0.7
3.2.1 Assumption: Total Membership
C = hi D = hi 0.8 0.1 0.1
P
As noted above and in subsection 3.1, we assume that µ=
(d) θE , E’s conditional distribution 1. This is our first linearising assumption, and it can be con-
ceptualised as follows. A variable’s degree of membership in
Figure 2. θ for G. The conditional distributions of A, B, C and each of its |C| components forms a |C|-dimensional fuzzy state
E. space. If the variable has no uncertainty (no component is an
FPD) then |C| will be the same as the variable’s range. Even
Just as a component can be a value from the range of if a FS has 0 membership in some of its range those
P values are
a variable, e.g. hi, a component can also be a probability still part of the state’s FSS. By assuming that c∈C µc = 1,
distribution (PD). PD are denoted with curly brackets, e.g. we restrict our attention to a smaller |C| − 1 dimensional sub-
{hi0.3 , mid0.2 , lo0.5 }. Because the value associated with each space. This subspace constrains the degrees of membership
subscripted probability is implicit in the tuple order the value in each component in a cyclically conditional way on the de-
names can be omitted: {0.3, 0.2, 0.5}. grees of membership in the other components. This assump-
An example of a fuzzy state which mixes values and prob- tion simplifies the combination of components of fuzzy states
ability distributions is shown in equation 2. in subsection 3.4.
For example, if a FS has membership 0.5 in the value hi
T = [{0.2, 0.1, 0.7}0.2 , {0.1, 0.8, 0.1}0.6 , mid0.2 ] (2) its membership in the values lo and mid in the FSS are con-
strained to be in the range [0, 0.5]. Furthermore, its member-
A PD which is annotated (subscripted) with µ is called a ship in lo and mid mutually constrain (in this case define, as
20
there are no other values in the range) each other. Figure 3 This definition of a FPD could be generalised so that µ
illustrates the impact of the first assumption on the FSS for could vary independently but was fixed for each of the r pos-
a fuzzy state with two components. sible samples that could be drawn from the FPD. Call this
a slightly general FPD (SGFPD) and call the FPD defined
1 in subsection 3.1 a standard FPD. An SGFPD would specify
a single point in an r + r dimensional space, where r of the
dimensions are the probability of each value and the other r
Membership in A
dimensions specify the µ of a sample of each value. Just as the
first r dimensions specify a p-space, the second r dimensions
specify a µ-space. An example of such a space is given dia-
grammatically in figure 4, and it is used to contrast a SGFPD
with a standard FPD.
An example of an SGFPD might be “there is a 0.2 prob-
ability of drawing a sample of hi, and any sample of hi will
0 1 have µ = 0.3, and there is a 0.3 probability of drawing a sam-
Membership in B ple of mid, and any sample of mid will P have µ = 0.5, and. . . ”
and so forth. The assumption that c∈C µc = 1 for a state
Figure 3. Imagine a fuzzy state with two components, A and could be relaxed if SGFPD were used.
B. Such a state would have a two-dimensional FSS, as in this
After considering figure 4 it will be clear that SGFPD could
figure. Without restriction, the state’s degree of membership in
each component could P be specified by any point in the FSS. be further generalised so that the µ of any sample also varied
However, we assume that c∈C µc = 1. Therefore its membership probabilistically, conditional on the value (hi, mid, etc.) of
in components A and B must be specified by some point on the the sample. Such a general FPD (GFPD) would be an r +
dashed line. r dimensional probability distribution over the joint µ and
range of the variable. We believe that this represents the most
general kind of inference and belief propagation in a FBN.
3.2.2 Assumption: Component Independence Such inference is intractable and we do not consider it in this
paper.
We also make two further assumptions about FBN during
In summary, the assumptions which we have made substan-
belief propagation. The first is that components are indepen-
tially reduce the dimensionality of belief propagation and are
dent. This means that when a variable has only one parent
necessary for it to be tractable. However, more general FBNs
then its state will have one component for each component
with GFPD do not have these restrictions; their utility will
its parent has, and these components will have the same µ
be considered in a forthcoming publication.
as the corresponding parent’s component. For example, the
children which have S (equation 1) as their only parent will 3.3 Single-Parent Belief Propagation
have two components in their FS, one with µ = 0.7 and one
with µ = 0.3. Assume that observations indicate A = [mid0.2 , hi0.8 ] in G.
When a variable has more than one parent then the com- With this information we can calculate the updated distribu-
ponents of the parents will be mixed and combined before tions on B and C.
propagation. This is described in subsection 3.4. Because we Because A has an observed (certain) FS and is B’s only
assume independence, the child’s fuzzy state will have one parent the components of B’s updated FS can be read from
component for each component in the mixed and combined θB . This shows that:
parent set, and each of the child’s component will have the
same µ as the corresponding component in the parent set. B = [{0.1, 0.1, 0.8}0.2 , {0.1, 0.2, 0.7}0.8 ] (3)
It will also be clear that we assume independence when we The FS over C is calculated similarly. Just as each of the
describe how the parents’ components are mixed and com- fuzzy values in A lead to a weighted FPD in the FS of B
bined in subsection 3.4. the same occurs for C, and C = [α0.2 , β0.8 ]. The weighted
distributions α and β are calculated using standard BN belief
3.2.3 Assumption: FPD Samples propagation, based on the conditional distribution of B. This
is shown in equations 4, 5 and 6.
The third assumption implicit in this model is the assumption
that a sample from an FPD with µ = x will be a single fuzzy
component with µ = x. Although natural and intuitive we p(C|B = lo) = {0.1, 0.1, 0.8}
do not believe that this is automatically entailed, thus we p(C|B = mid) = {0.1, 0.8, 0.1}
explicitly assume it. p(C|B = hi) = {0.7, 0.2, 0.1} (4)
Consider an uncertain variable with a range of r in a stan-
dard (discrete) Bayesian network. Its state will be a probabil-
ity distribution which specified a single point in the r dimen- α = {0.1, 0.1, 0.8} × 0.1 + {0.1, 0.8, 0.1} × 0.1 +
sional probability space (p-space). {0.7, 0.2, 0.1} × 0.8
Now consider this variable in a FBN. Any uncertainty in
= {0.01, 0.01, 0.08} + {0.01, 0.08, 0.01} +
its state will be represented by an FPD component with some
µ. As described in subsection 3.1, a FPD is just PD with a {0.56, 0.16, 0.08}
fixed (0 dimensional) µ associated with it. = {0.58, 0.25, 0.17} (5)
21
Membership in B
1
p(B)
Membership in A
0 p(A) 1
Figure 4. Assume a variable with a range (r) of 2 (A and B). Each point in the 2 dimensional p-space (left hand side)P could be
considered an index into a 2 dimensional µ-space (right hand side), as diagrammed. All proper probability distributions ( p = 1) fall on
the dotted dashed line in the p-space. The µ-space that is indexed by some point in the p-space could be unique to that point. In a
standard FPD, the µ-space is reduced to a single point on the dashed line and that point on the line in the µ-space is specified precisely
by the µ of the FPD. Because any sample from a standard FPD, regardless of its value, will have the same µ, r of the dimensions are
eliminated. In addition, we assume that an FPD has a proper probability distribution. In total, these reduces the dimensionality from
r + r to r − 1. In a SGFPD the µ-space would also be reduced to a single point, but any point in the µ-space would be a valid reduction,
so an SGFPD with a proper distribution over the values still has r + r − 1 dimensions.
In the naive approach to belief propagation the Cartesian
β = {0.1, 0.1, 0.8} × 0.1 + {0.1, 0.8, 0.1} × 0.2 + product of the parents’ FS is used to find all possible combi-
nations of components. µ for each one of these combinations
{0.7, 0.2, 0.1} × 0.7
is calculated using the product t-norm[2]. Any other fuzzy
= {0.01, 0.01, 0.08} + {0.02, 0.16, 0.02} + conjunction (normalising µ where Pnecessary) could also be
= {0.49, 0.14, 0.07} used. Because we assumed that µ = 1 holds for each of
= {0.52, 0.31, 0.17} (6) the parents
P though, using this fuzzy conjunction guarantees
that µ over the child’s components will also equal 1 and no
The calculated FS for C is shown in equation 7. normalisation is necessary.
For example, if we use the first components of C and
C = [{0.58, 0.25, 0.17}0.2 , {0.52, 0.31, 0.17}0.8 ] (7) D ({0.58, 0.25, 0.17}0.2 and {0.45, 0.30, 0.25}0.3 , respectively,
equation 8) then standard Bayesian propagation and using
3.4 Multi-Parent Belief Propagation the product t-norm to calculate µ shows that one member of
E’s updated FS is:
Subsection 3.3 illustrated belief propagation in a FBN when
a variable has only one parent. This subsection shows naive
α = {0.3165, 0.2189, 0.4647}0.06 (9)
FBN belief propagation in the case of a variable with multiple
parents. Section 4 outlines several more nuanced approaches The full FPD for E will have four members, one for each
which address the problems with naive propagation. member of C × D (equation 10, below). For clarity, the cal-
Take the calculated value of C, and assume a fuzzy state culated α from equation 9 has not been substituted into this
for D (equation 8). What is the updated fuzzy state of E? equation.
E = [α0.06 , β0.14 , γ0.24 , δ0.56 ] (10)
C = [{0.58, 0.25, 0.17}0.2 , {0.52, 0.31, 0.17}0.8 ]
D = [{0.45, 0.30, 0.25}0.3 , {0.1, 0.8, 0.1}0.7 ] (8) In general a variable with k parents that each have an FS
with m components will have an updated FS of size mk . As-
Any combination of components, one from each parent, can suming all variables have k parents, the grand-children will
be used to calculate an updated probability distribution for a k
have updated FS of size mk , and so forth. This is the fuzzy
variable. However, this raises the question of how to combine state size explosion (FSSE), and it makes naive belief propa-
and weight each combination of component distributions in gation in a FBN intractable.
the parent FSs to calculate an updated FS for the child.
Because the parents are conditionally independent given
the variable being updated3 , any particular combination of 4 Dealing with Complexity
PD and observations can be summed over, as one was in each There are several ways that the explosion in the complex-
of equations 5 and 6. The summed over combinations became ity can be controlled by approximating the FS. This sec-
components of C’s updated distribution. tion discusses four kinds of control. The bimodal fuzzy state
3 And also given their updated state and the acyclic nature of the X = [{0.9α , 0.1β }0.5 , {0.1α , 0.9β }0.5 ] is used as an example in
graph several places in this section. Such a variable could represent
22
a committee of two in which the committee members (compo- an accurate reflection of the full FS, thus φ-dynamic selection
nents) hold diametrically opposite beliefs about the outcome may be more appropriate in some cases.
of some future event.
4.3 Clustering the Fuzziness
4.1 Linear Collapse
Another way of controlling the FSSE is to calculate the full
A first approximation that addresses the FSSE is to linearly FS of each variable during belief propagation. However, be-
collapse a FS that is made up only of FPDs, immediately fore using the full FS to update the state of its children, its
after they are calculated. Each component can be weighted by components could be clustered so that FPD which specified
its fuzzy membership and they can be summed to calculate similar distributions were combined together.
a single, discrete, PD. For example, B (equation 3) can be For example, the FS [. . . , {0.7, 0.2, 0.1}0.3 , {0.6, 0.3, 0.1}0.2 , . . .]
collapsed as shown in equation 11. Collapsed FS are denoted might cluster to [. . . , {0.66, 0.24, 0.1}0.5 , . . .].
with a prime. Because the clustering problem would only have as many
dimensions as the range of each FPD, we speculate that a
simple fixed-k clustering algorithm like k-means would work
B = [{0.1, 0.1, 0.8}0.2 , {0.1, 0.2, 0.7}0.8 ]
′
very well.
∴B = {0.1, 0.18, 0.72} (11) Although this approach is more complex than selection or
linear collapse, the total increase in complexity in belief prop-
However, this approximation is unsatisfactory: it conflates
agation would be related to and bound by the maximum in-
probability with fuzziness and may change the expected value
degree and range of a variable.
of the variable. Although it may be approximately correct in
some circumstances, a simple thought experiment will show
why it is insufficient.
4.4 Expected Values
Consider the bimodal FS X. The expected sample from X A fourth kind of control is inspired by particle filtering and
is X ′ = [α0.5 , β0.5 ]. Although this sample does not reflect any the Condensation algorithm[9]. The general sequential Monte
of the uncertainty in X it does reflect the bi-modality (inde- Carlo (SMC) method will be outlined first. Although this
cision) of the variable (committee) as a whole. Subsection 4.4 approach is not as efficient as others it is applicable in all
returns to this approach. cases and is strictly correct.
If X is linearly collapsed though then X ′ = {0.5, 0.5}. No Consider X. An infinite sequence of independent samples
sample drawn from this PD can be half α and half β. Im- drawn from this uncertain fuzzy state will take something like
portant information in X has been lost. Although further be- the form [α0.5 , β0.5 ], [α0.5 , β0.5 ], . . . , [α1 ], [α0.5 , β0.5 ]. . . and so
lief propagation will not be biased if this variable is summed forth.
over4 , there is no way to compare the linearly collapsed value The properties of this sequence are identical to those of
X ′ with any observed value for X when trying to evaluate the the fuzzy state, and a long-enough finite sequence will be a
quality of an inferred network. good approximation to it. For example, 100 samples could
Other approximations to the naive approach have been de- be drawn from X. Each of these samples could then be used
veloped. They are discussed in the next three subsections. to propagate the uncertain state of X to X’s children. The
relative efficiency of this technique compared to clustering
4.2 Strict and Dynamic Top Fuzzy depends on the range and kmax of the variables, but in certain
Combinations situations it may also be better.
Consider again the full (naive) FS of E, reproduced in equa- As noted in subsection 4.1, the expected value of a variable
tion 12. is easily calculated analytically. For example, the expected
value of X is [α0.5×0.9+0.5×0.1 , β0.5×0.1+0.5×0.9 ] = [α0.5 , β0.5 ].
E = [α0.06 , β0.14 , γ0.24 , δ0.56 ] (12) Doing this expectation calculation is analogous to summing
over or numerically integrating a probability distribution, and
Some of the components barely contribute to the overall we call it fuzzy integration. Like clustering the impact on ef-
state and will not have a substantial influence on any children ficiency of fuzzy integration depends on the range and kmax .
either. Such components could be ignored, and the remaining This approach is very similar to linear collapse but it has
components could have their µ normalised. For example, if a number of key advantages. Firstly, like linear collapse, it
just the top three components of E were used then the up- does not bias any further belief propagation. This is untrue
dated FS would take the form: of selection and clustering. Secondly, the expected value X ′
which is the result of this fuzzy integration can be meaning-
E = [β0.149 , γ0.255 , δ0.596 ] (13) fully compared with observed values of X when performing
The number of components retained could be either k- network inference. In many cases, users are only interested in
component strict selected or φ-dynamically selected. In the the expected (integrated) value. In these cases the expected
former case, the top k components would be selected. In the value of a FS is ideal.
latter, the
P |C| components with greatest µ would be selected
so that c∈C µc > φ. Strict selection would mean that FBNs 5 A Bioinformatic Domain
were only a small linear factor less efficient than classic BNs
of the same size. However, the top k components may not be Inference of large genetic regulatory networks (GRN) is a cen-
tral problem in modern bioinformatics. However, algorith-
4 Due to the use of the product t-norm.
mic complexity has limited detailed inference using BNs[6;
23
15; 16] to N / 100 genes[5]. Approaches which can be ap- [2] F. Bobillo and U. Straccia, ‘A fuzzy description logic with
plied to larger numbers of genes include modern clustering product t-norm’, in Proceedings of the 16th IEEE In-
methods[10] and the inference of graphical Gaussian models ternational Conference on Fuzzy Systems (FUZZ-IEEE
over clustered gene expression data[8; 14]. 2007), pp. 652–657, London (United Kingtom), (July
FBNs suggest a novel approach to detailed exploration of 2007).
large GRN. Such a methodology generalises to the inference [3] Y. Cao, P. Wang, and A. Tokuta, ‘Reverse engineering
of other large causal networks as well. of NK boolean network and its extensions — fuzzy logic
If the data is pre-processed by using a fuzzy cover algo- network (FLN)’, New Mathematics and Natural Compu-
rithm the dimensionality of the problem may be reduced by tation, 3(1), 68–87, (2007).
an order of magnitude or more. This could lead to an expo- [4] David M. Chickering, ‘Learning Bayesian networks is
nential reduction in the algorithmic complexity which would NP-Complete’, in Learning from Data: Artificial Intel-
more than offset any increase caused by the fuzzy state size ligence and Statistics V, eds., D. Fisher and H. J. Lenz,
explosion and its collapse. 121–130, Springer-Verlag, (1996).
A fuzzy cover is a clustering algorithm which covers the [5] Christopher Fogelberg and Vasile Palade, ‘Machine
data, rather learning and genetic regulatory networks: A review and
P than clusters it. In a fuzzy cover, a variable (gene)
can have c∈C µc > 1, where C is the set of covers that the a roadmap’, Technical Report CS-RR-08-04, Computing
algorithm finds. Laboratory, Oxford University, Wolfson Building, Parks
Inference over the covers is performed using a standard al- Road, Oxford, OX1-3QD, (April 2008).
gorithm to find a virtual GRN. Using the retained µc for each [6] A. J. Hartemink, D. K. Gifford, T. S. Jaakkola, and R. A.
n ∈ N and c ∈ C, most of the original fidelity can be recovered Young, ‘Combining location and expression data for prin-
after the inference has been performed by linearly devolving cipled discovery of genetic regulatory network models.’,
and normalising the network of covers back down to a net- Pacific Symposium on Biocomputing, 437–449, (2002).
work of genes. The synergistic use of dimension-reduction and [7] David Heckerman, ‘A tutorial on learning with Bayesian
FBNs are what we believe will be most useful. networks’, Technical report, Microsoft Research, Red-
The authors are using this approach (fuzzy covering, FBN mond, Washington, (1995).
inference, FBN devolution) to infer and explore large genetic [8] Katsuhisa Horimoto and Hiroyuki Toh, ‘Statistical esti-
regulatory networks. With fuzzy clustering and FBNs we ex- mation of cluster boundaries in gene expression profile
pect to be able to perform more detailed exploratory inference data.’, Bioinformatics, 17(12), 1143–1151, (2001).
for N ≈ 1000. [9] Michael Isard and Andrew Blake. Condensation – con-
ditional density propagation for visual tracking, 1998.
6 Contributions and Future Work [10] Sara C. Madeira and Arlindo L. Oliveira, ‘Bicluster-
ing algorithms for biological data analysis: a survey’,
This paper has presented a new formalisation which combines IEEE/ACM Transactions on Computational Biology and
fuzzy theory and Bayesian networks. Because of the way that Bioinformatics, 1(1), 24–45, (2004).
it extends classic BNs, all existing algorithms, tools and ma- [11] Daniel Neagu and Vasile Palade, ‘A neuro-fuzzy ap-
chine learning techniques for classic BNs can be used imme- proach for functional genomics data interpretation and
diately with FBNs. analysis’, Neural Computing and Applications, 12(3-4),
Several techniques for tractably propagating fuzzy beliefs 153–159, (2003).
across a FBN are also described. Using these techniques, pre- [12] Heping Pan and Lin Liu, ‘Fuzzy Bayesian networks - a
viously used BNs can be assigned fuzzy variable states and general formalism for representation, inference and learn-
updated accordingly. This means that existing networks, of- ing with hybrid Bayesian networks’, IJPRAI, 14(7), 941–
ten learnt only after substantial effort, can be easily reused. 962, (2000).
Furthermore, the difference in BN and FBN efficiency with [13] Romesh Ranawana and Vasile Palade, ‘Multi-classifier
sensible fuzziness collapse may be as little as a small linear systems: Review and a roadmap for developers’, Interna-
constant in some circumstances. This means that there are tional Journal of Hybrid Intelligent Systems, 3(1), 35–61,
few disadvantages to using FBNs instead of BNs. (2006).
The possibility of integrating FBNs into a machine learn- [14] Hiroyuki Toh and Katsuhisa Horimoto, ‘Inference of a
ing pipeline which involves dimension-reduction and network genetic network by a combined approach of cluster anal-
devolution also suggests that the inference of larger causal ysis and graphical Gaussian modeling.’, Bioinformatics,
networks will be possible using FBNs. 18(2), 287–297, (2002).
Future research may uncover more efficient methods for in- [15] Jing Yu, V. Anne Smith, Paul P. Wang, Alexander J.
tegrating, clustering or otherwise collapsing a FS. In addition, Hartemink, and Erich D. Jarvis, ‘Advances to Bayesian
the authors plan to present an even more generalised formali- network inference for generating causal networks from
sation which relaxes the assumptions made in subsection 3.2. observational biological data.’, Bioinformatics, 20(18),
3594–3603, (2004).
[16] Yu Zhang, Zhingdong Deng, Hongshan Jiang, and Peifa
REFERENCES Jia, ‘Dynamic Bayesian network (DBN) with structure
[1] Jim F. Baldwin and Enza Di Tomaso, ‘Inference and expectation maximization (SEM) for modeling of gene
learning in fuzzy Bayesian networks’, in FUZZ’03: The network from time series gene expression data.’, in BIO-
12th IEEE International Conference on Fuzzy Systems, COMP, eds., Hamid R. Arabnia and Homayoun Valafar,
volume 1, pp. 630–635, (May 2003). pp. 41–47. CSREA Press, (2006).
24
Combining Goal Inference and Natural-Language
Dialogue for Human-Robot Joint Action
Mary Ellen Foster1 and Manuel Giuliani1 and Thomas Müller1 and Markus Rickert1 and Alois Knoll1
Wolfram Erlhagen2 and Estela Bicho2 and Nzoji Hipólito2 and Luis Louro2
Abstract. We demonstrate how combining the reasoning compo- planner, using connectionist kernel perceptron learning to learn the
nents from two existing systems designed for human-robot joint ac- effects of different domain actions. Integration among the different
tion produces an integrated system with greater capabilities than ei- components of this system is achieved through a common represen-
ther of the individual systems. One of the systems supports primarily tation of actions and their effects. Such hybrid architectures are also
non-verbal interaction and uses dynamic neural fields to infer the particularly common when the robot is designed to cooperate with a
user’s goals and to suggest appropriate system responses; the other human partner; recent examples include [13, 17, 32].
emphasises natural-language interaction and uses a dialogue man- In this paper, we present two robot systems designed to cooperate
ager to process user input and select appropriate system responses. with humans on mutual tasks and then show how combining rea-
Combining these two methods of reasoning results in a robot that is soning components from these systems results in a more powerful
able to coordinate its actions with those of the user while employing integrated system. Both of the robot systems have been developed in
a wide range of verbal and non-verbal communicative actions. the context of the JAST3 project (“Joint Action Science and Technol-
ogy”). The two main goals of this project are to investigate the cogni-
tive, neural, and communicative aspects of jointly-acting agents, both
1 INTRODUCTION AND MOTIVATION
human and artificial, and to build jointly-acting autonomous systems
As robot systems become increasingly sophisticated, their role is that communicate and work intelligently on mutual tasks. The com-
moving from one where the robot is essentially an intelligent tool mon task across the project is joint construction—that is, multiple
to one where the robot is able to participate as a full team member agents working together to assemble objects from their components.
in collaborative tasks. Supporting this type of human-robot coopera- The two JAST human-robot systems support intelligent coopera-
tion requires that the robot system be able to produce and understand tion with humans on this joint construction task. Although both sys-
a wide range of natural communicative cues in order to allow hu- tems address the same basic task and incorporate similar input- and
mans to cooperate with it easily. For example, [15] experimentally output-processing components, the reasoning components are imple-
demonstrated the contribution of anticipatory action to the fluency mented using very different techniques and they support very differ-
of human-robot interaction; similarly, natural-language dialogue has ent styles of interaction. The goal inference system is implemented
been shown to be an effective means of coordinating actions between using dynamic neural fields and concentrates on inferring the user’s
a human and a robot [7]. intended domain actions based on their non-verbal behaviours and
A number of previous systems have also addressed the task of on selecting appropriate domain actions for the system to perform in
human-robot cooperation, using a variety of communicative styles. response. For example, if the user picks up a bolt in a way that in-
The Leonardo robot [2], for example, is able to learn simple action dicates that they intend to use it themselves, the system might pick
sequences and to execute them jointly with the user. The Ripley sys- up the corresponding wheel and hold it out to the user. The dialogue
tem [24] is able to manipulate objects in response to spoken requests system, on the other hand, concentrates on understanding and gener-
from a human partner; a more recent robot from the same group [16] ating multimodal natural-language utterances to support cooperation
increases the responsiveness of the system and allows the action plan- between the human and the robot, using a dialogue manager. Sec-
ner to adapt flexibly to a rapidly-changing world. The BARTHOC tions 2–3 present the details of these two systems and show a typical
[26] and ARMAR [27] humanoid robots both support multimodal interaction with each.
dialogue to interact with a human user in a variety of settings and Since the two JAST human-robot systems address the same task,
domains. The experiments described in [15] demonstrated that un- using complementary forms of reasoning, it is possible to combine
derstanding and anticipating the user’s actions produces a robot that the two forms of reasoning into a single system. This integrated sys-
can cooperate more smoothly with a human user. tem is able both to intelligently infer the user’s actions and suggest
Since an intelligent robot system must both process continuous appropriate responses, and also to engage in dialogue with the user
sensor data and reason about discrete concepts such as plans, actions, to support coordination and to discuss situations when the system is
and dialogue moves, this type of system is often made up of compo- unable to infer the user’s goal. In Section 4, we present this integrated
nents drawing from an assortment of representation and reasoning system and show a sample of the interactions that it can support that
paradigms. The robot system described in [22], for example, com- are not possible with either of the individual systems; this section
bines low-level robot control and vision systems with a high-level also gives some technical details of how the components of the two
1 Technische Universität München, Germany, contact: foster@in.tum.de
2 University of Minho, Portugal, contact: wolfram.erlhagen@mct.uminho.pt 3 http://www.euprojects-jast.net/
25
systems are combined in practice. Finally, in Section 5, we compare
the integrated system with other similar systems and summarise the
contributions of the system and the areas for future work.
2 GOAL INFERENCE BASED ON DYNAMIC
NEURAL FIELDS
The first of the JAST human-robot systems concentrates on giving
the robot the ability to predict the consequences of observed actions,
using an implementation inspired by neurocognitive mechanisms un-
derlying this capacity in humans and other social species. Many con-
temporary theories of intention understanding in familiar tasks rely Figure 1. The JAST goal-inference robot together with the toy vehicle that
on the notion that an observer uses their own motor repertoire to sim- the human and the robot jointly construct.
ulate an observed action and its effect ([4], for a review see [25]). The
selection of an appropriate complementary behaviour in a joint action based pattern recognizers were used [30]. As a software development
task depends not only on the inferred goal of the partner, but also on platform we haven chosen YARP [19]. This open-source project sup-
the integration of additional information sources such as shared task ports inter-process communication, image processing and a class hi-
knowledge (e.g., a construction plan) and contextual cues. erarchy to ease code reuse across different hardware platforms.
The cognitive control architecture for action coordination in the Figure 2 illustrates a typical example of the goal inference and ac-
joint construction scenario is formalized by a coupled system of tion selection capacities in this domain. In the top image, the human
dynamic fields representing a distributed network of local but con- reaches his open hand towards the robot teammate. By activating the
nected neural populations [3]. Different pools of neurons encode respective action chain in its repertoire, the robot interprets this ges-
task-relevant information about action means, action goals, and con- ture as a request for a bolt to fix the wheel. Since the human has al-
text in the form of activation patterns that are self-sustained through ready mounted the wheel on his side of the construction, this inferred
recurrent interactions. goal describes a currently active subtask. A logical complementary
The motor simulation idea is implemented by the propagation of action sequence would be that the robot grasps a bolt to place it in
activity through interconnected neural populations that constitute a the teammate’s hand. However, the human has seemingly overlooked
learned chain of motor primitives directed towards a specific goal a bolt in his own working area. In this situation, the representation of
[6]. Typical examples in the context of the construction scenario are the inferred goal together with the representation of the bolt in the
reaching-grasping-placing/plugging sequences. The chains are auto- work space of the human trigger the decision to make a pointing ges-
matically triggered by an observed motor act (e.g., reaching or grasp- ture directed towards the object. In addition, the robot uses speech to
ing) whenever additional input from connected dynamic fields (e.g., explain the type of error the human is making.
representing the currently available subgoals) pre-activates the neu-
ral populations. As a consequence of the motor simulation, the robot
3 DIALOGUE-BASED HUMAN-ROBOT
is able to react to the partner’s action sequences well ahead of their
INTERACTION
completion. This anticipation capacity has been shown to be crucial
for a fluent team performance [1, 15]. Like the system described in the preceding section, the JAST human-
In the layer of the control architecture linked to motor execution, robot dialogue system [23] also supports multimodal human-robot
neural populations represent the decision about the most appropri- collaboration on a joint construction task. In this case, the user and
ate complementary behaviour. The behaviour is selected as a con- the robot work together to assemble wooden construction toys on a
sequence of a competition process between all response alternatives common workspace, coordinating their actions through speech, ges-
getting input from connected layers (for details see [1]). tures, and facial displays. The robot (Figure 3) consists of a pair
A system based on this dynamic field architecture was imple- of Mitsubishi manipulator arms with grippers, mounted in a po-
mented to support human-robot cooperation on the JAST joint con- sition to resemble human arms, and an animatronic talking head
struction task. This system constructs a toy vehicle (Figure 1) with [29] capable of producing facial expressions, rigid head motion, and
the user. The vehicle is composed of several components which are lip-synchronised synthesised speech. The input channels consist of
initially distributed in the separated working areas of the two team- speech recognition, object recognition, robot sensors, and face track-
mates; this ensures that neither of the agents is able to reach all of ing; the outputs include synthesised speech, head motions, and robot
the required components on its own and must rely on the partner to actions. The components of the system communicate with each other
retrieve them, making joint action essential to a successful interac- using the Ice distributed object middleware system [14].
tion. The robotics platform we are currently using consists of a torus The robot is able to manipulate objects in the workspace and to
on which are mounted a 7 DOFs AMTEC arm (Schunk GmbH & perform simple assembly tasks. The primary form of interaction with
Co.KG) with a 3-fingered BARRET hand (Barrett Technology Inc.) the current version of the system is one in which the robot instructs
and a stereo vision system. The system uses synthesised speech to the user on building a particular compound object, explaining the
communicate its reasoning process to the human partner. necessary assembly steps and retrieving pieces as required, with the
To control the arm-hand system, we applied a global planning user performing the actual assembly actions. As with the dynamic-
method in posture space that facilitates the integration of optimiza- field system, the workspace is divided into two areas—one belonging
tion principles derived from experiments with humans [5]. For the to the robot and one to the human—in order to make joint action
object recognition as well as for the classification of object-directed necessary for success in the overall task.
hand postures and communicative gestures such as pointing or de- Input on each of the channels is processed using a dedicated mod-
manding an object, a combination of feature- and correspondence- ule for that channel. To process the speech, we use a Java Speech
26
API interface to the commercial Dragon NaturallySpeaking speech
recogniser [21]. A camera mounted above the common work area
provides two-dimensional images of the contents of the workspace.
The information from this camera is pre-processed to extract regions
of interest (ROIs). The extracted ROIs are then processed in parallel
by a template-based object-recognition module [20] and a module
that performs static hand-gesture recognition [33].
Input received on all of the input sensors is continuously processed
by the corresponding modules and broadcast through Ice, using the
built-in IceStorm publish-subscribe mechanism. All of the input mes-
sages are received by a multimodal fusion component [11, 12], which
parses the recognized speech into logical forms using the OpenCCG
grammar formalism [31] and combines it with the recognised non-
verbal behaviour to produce multimodal hypotheses representing
user requests. The fusion hypotheses are then sent to the dialogue
manager, which selects an appropriate response.
The dialogue manager is implemented using the TrindiKit dia-
logue management toolkit [18]. This toolkit uses the well-known
information-state update approach to dialogue management [28],
which allows a dialogue to be modelled declaratively. When the dia-
logue manager receives a new set of fusion hypotheses, it selects the
appropriate system response using information from three sources:
the inventory of objects in the world, a representation of the cur-
rent assembly state, and the history of the dialogue. When the sys-
tem is jointly following an assembly plan with the user, the dialogue
manager is able to select from different strategies for traversing the
Figure 2. Example of the goal inference (top) and action selection (bot- plan: it may use a postorder strategy, in which it proceeds directly
tom) capacities which are implemented by the dynamic field architecture. The to describing the concrete assembly actions, or it may use a preorder
robot uses speech to communicate the results of its reasoning about the be- strategy, in which the structure of the plan is described before giving
haviour of the teammate. specific assembly actions. More details on the dialogue manager and
on the description strategies are given in [10].
Once the dialogue manager has selected a response to the user’s
multimodal utterance, it sends the specification of the response to
the output planner. This module in turn sends commands to select
appropriate output on each of the individual channels to meet the
specification: linguistic content including appropriate multimodal re-
ferring expressions [9], facial expressions and gaze behaviours of the
talking head [8], and actions of the robot manipulators. The user then
responds to the system utterance by speaking or performing actions
in the world, and the interaction continues until the target object has
been assembled.
An excerpt from a typical interaction between a user and the JAST
dialogue system is shown in Figure 4. In this excerpt, the robot knows
the full plan for building the target object: a “railway signal”, which
has sub-components called a “snowman” and a “flower”. The assem-
bled object is shown in Figure 4(a). In the excerpt, the robot instructs
the user on how to build the target object, using a preorder strat-
egy, and the user learns to make particular sub-components along the
way. We are currently carrying out a system evaluation based on this
robot-as-instructor scenario. The evaluation is designed to compare
the two description strategies in terms both of user satisfaction and
in success in the overall joint-construction task.
We will shortly extend the dialogue system to handle scenarios
where the user also knows the assembly plan. In such situations, the
main goal of the interaction is no longer instruction, but rather—as
with the goal-inference system described previously—coordination
between the partners, and the user will be able to take much more
initiative in the dialogue than is currently possible. We will return to
Figure 3. The JAST dialogue robot with a selection of wooden the details of this extended scenario in the following section.
construction-toy components.
27
formation along with the processed messages from the fusion sys-
tem into the (extended) information state of the integrated system.
• SYSTEM: To build a railway signal, we The dialogue manager then uses enhanced update rules to select an
need to make a snowman and a flower. appropriate system response to the input. Finally, just as in the indi-
• USER: How do you make a snowman? vidual systems, the selected response is sent to the output system for
• SYSTEM: [Picking up a red cube and realisation on the output channels.
holding it out] To make a snowman, in-
sert the green bolt into your blue cube
and fasten it with this cube.
• USER: [picks up bolt and cube; takes Object Goal
other cube; assembles “snowman”] Inventory State
Okay.
• SYSTEM: Well done, you have made Multimodal Multimodal
a snowman. Now we need to make a Input Output
flower. To make a flower, [. . . ]
Goal Dialogue
Inference Manager
(a) “Railway signal” (b) Dialogue excerpt
Figure 4. A sample object and an excerpt from an interaction where the Figure 5. The architecture of the integrated system.
robot instructs the user on how to construct this object.
This integrated system supports interaction patterns that would not
4 INTEGRATING GOAL INFERENCE AND be possible with either of the individual systems. Most importantly,
NATURAL-LANGUAGE DIALOGUE it is able both to detect unexpected actions from the user (i.e., ac-
tions that do not meet what it believes to be the current subgoals)
There are a number of similarities between the two human-robot and to engage the user in dialogue to discuss how to deal with the
systems described above. Both support the same basic task—joint unexpected action. When both forms of reasoning work together, the
construction—and view the goals and subgoals of this task in a simi- system is able to detect such user actions and to produce a variety of
lar way. Also, the input and output channels used by the two systems responses, including correcting the user, asking for clarification as to
are very similar: both include object and gesture recognition in the the user’s intentions, or attempting to silently revise its representa-
input and produce speech and robot-manipulator actions as part of tion of the goal state. Varying the system’s response to this situation
their output. On the other hand, the reasoning processes used by the is able to produce systems with different interactive “personalities”,
two systems are very different: the former uses dynamic neural fields ranging from one that always makes the user follow the plan selected
to perform goal inference and action selection based entirely on non- by the system to one where the user has full control over the course
verbal input, while the latter uses techniques from issue-based dia- of the interaction.
logue management to engage in natural-language conversation with Figure 6 shows a sample interaction between a user and the inte-
some multimodal components. The strengths of the two systems are grated system, where the role of each of the reasoning components
also complementary: the dynamic-field system is good at detecting is shown throughout. In this interaction, the user and the robot are
and reasoning about the user’s non-verbal actions, but uses language jointly building the “railway signal” object (Figure 4(a)). At the start,
only for a limited form of canned output; the dialogue system sup- the robot system has assumed that the user is building the “snow-
ports advanced linguistic interaction, but has no mechanism to infer man” sub-component. When the user grasps a medium slat, which
the user’s intention from their actions in the world. is not needed for that subgoal, the goal inference system detects this
Motivated by the above similarities and complementary features, (just as in the sample interaction described at the end of Section 2)
we have combined components from the two individual human-robot and sends a message to the dialogue manager that the user’s action
systems into a single, integrated architecture. The hardware platform cannot be integrated into the current plan.
for the integrated system is the robot from the dialogue system (Fig- At this point, the system has several options to deal with the mis-
ure 3), while the scenario is an extended version of the scenarios match between its beliefs about the current subgoals and the recent
used by each of the individual systems. As in the dynamic-field sce- action of the user. It might silently revise its view of the current sub-
nario, the user and the robot are both assumed to know the assembly goals, for example, or it might—as in Figure 2—correct the user’s
plan for the target object and are able to infer the partner’s intentions apparent “error”. In the example, the system uses a third strategy,
based on their behaviour, and the main goal of the interaction is for and one that is only available because of the integration of the di-
the two participants to coordinate their actions. As in the dialogue alogue components: it asks the user in words to clarify their inten-
system, this coordination is accomplished through natural-language tions. After the user provides the needed clarification, also verbally,
dialogue incorporating both verbal and non-verbal communication. the dialogue manager updates the system’s subgoals and informs the
Figure 5 shows the high-level architecture of the integrated sys- goal-inference system of the change. The goal-inference system then
tem. Messages on all of the multimodal input channels (speech, ges- anticipates that, to meet this new subgoal, the user will need the nut
tures, and recognised objects) are sent to both of the input-processing that is lying on the robot’s side of the table. The system therefore
components, each of which—just as in the individual systems— picks up the nut and offers it to the user without being asked.
reasons about the meaning of the user’s actions in the current con- As can be seen by the right-hand columns in Figure 6, this type
text, each drawing information from the same set of state modules of interaction would not be possible with either of the individual sys-
(plan state, object inventory, interaction history). The inferred goals tems. The dialogue system does not have the necessary mechanism to
and suggested system responses from the goal-inference system are infer the user’s goals from their actions, while the goal-inference sys-
then passed to the the dialogue manager, which incorporates this in- tem would only have been able to respond to the user’s unexpected
28
Actions Dialogue Manager Goal Inference
User grasps a medium slat
Notices that action does not
meet current subgoal
Tells output planner to ask
for clarification
SYSTEM: “We don’t need a medium slat for the snowman”
USER: “Yes, but I want to build the flower now”
Interprets response and up-
dates subgoals
Suggests system response
Sends message to output
planner
Robot picks up a nut and holds it out
SYSTEM: “Then you’ll need this nut”
Figure 6. A sample interaction with the integrated system, showing the role of each individual reasoning component in the decision-making process.
action by treating it as an error rather than discussing the user’s goals a system integrating the reasoning components of the two individual
as in the example. Only when these two components are combined is systems is able to take advantage of the complementary strengths of
this rich interaction made possible. each to support interactions that neither system is able to support on
its own. In particular, this integrated system is able both to detect
the user’s intentions and anticipate their needs, and to use natural-
4.1 Technical Details language dialogue to manage the joint activity. The integration of
The two individual systems use the same basic information in their these two systems is made possible through well-defined interfaces
reasoning (task goals and subgoals, object inventory, input events); that allow the two sets of reasoning components to share information
however, due to the different implementations, they represent this about world state, task goals, and input events.
information quite differently. Also, at the implementation level, the In contrast to the other systems mentioned in the introduction,
components of the dynamic-field system use YARP to communicate the integrated JAST system is unique in that it combines methods
with one another, while the dialogue system uses Ice as an integration and techniques taken from two separate, fully-implemented, existing
platform. A specific goal of the integration has been to make as few systems—a neuro-inspired perception-action system and a symbolic,
changes as possible to the individual systems. An important aspect multimodal-dialogue system—to produce an integrated robot system
of creating the integrated system has therefore been coming up with that is able both to communicate with its human partner using lan-
a common representation for all of the relevant information, where guage and to intelligently understand and anticipate the partner’s in-
the representation is compatible with both of the systems and both of tentions. As demonstrated by the example in the preceding section,
the integration platforms. the integrated system is able to go beyond the capabilities of either
To support the integration, we have defined generic interfaces to of the individual systems to support intelligent human-robot cooper-
represent recognised gestures and objects, as well as inferred and ation on the joint construction task.
proposed domain actions. These representations include the follow- The integrated system is currently under development: the neces-
ing information: sary interfaces have been specified as described in Section 4.1, and
the reasoning modules from the two systems are being adapted to
• The Gestures representation includes the type of gesture recog- use the common interfaces. When the is completed, we will run a
nised (pointing, grasping, holding-out, unknown) and if necessary, user evaluation of the full system similar to that currently under way
the object indicated. for the dialogue system to demonstrate the contribution of both forms
• The Objects representation includes the classification of the ob- of reasoning to natural human-robot joint action.
ject, a 3D position and a flag indicating whether the object can be
reached by the robot.
ACKNOWLEDGEMENTS
• The Action representation consists of the type of action (grasp-
and-give, demand-and-receive, speak, undefined) and a string con- This work was supported by the EU FP6 IST Cognitive Sys-
taining further specifications (e.g. the object-id for grasp-and-give tems Integrated Project “JAST” (FP6-003747-IP), http://www.
or the sentence to speak out loud). euprojects-jast.net/. Thanks to the CIMA workshop re-
viewers for their useful comments and suggestions.
Internal communication between YARP and Ice is implemented
via a connector module that translates Ice messages to YARP mes-
sages and vice versa.
REFERENCES
[1] E. Bicho, L. Louro, N. Hipólito, and W. Erlhagen, ‘A dynamic neu-
ral field architecture for flexible and fluent human-robot interaction.’,
5 DISCUSSION in Proceedings of the 2008 International Conference on Cognitive Sys-
tems, pp. 179–185, (2008).
We have presented two human-robot systems, each of which is de- [2] C. Breazeal, A. Brooks, J. Gray, G. Hoffman, C. Kidd, H. Lee, J. Lieber-
signed to support the same joint construction task. One system uses man, A. Lockerd, and D. Chilongo, ‘Tutelage and collaboration for
humanoid robots’, International Journal of Humanoid Robotics, 1(2),
dynamic neural fields to perform non-verbal goal inference and ac- 315–348, (2004).
tion selection, while the other uses a dialogue manager to support [3] W. Erlhagen and E. Bicho, ‘The dynamic neural field approach to cog-
multimodal natural-language interaction. We have then shown how nitive robotics’, Journal of Neural Enginering, 3, R36–R54, (2006).
29
[4] W. Erlhagen, A. Mukovskiy, and E. Bicho, ‘A dynamic model for ac- form’, International Journal of Advanced Robotics Systems, 3(1), 43–
tion understanding and goal-directed imitation.’, Brain Research, 1083, 48, (2006).
174–188, (2006). [20] T. Müller, P. Ziaie, and A. Knoll, ‘A wait-free realtime system for op-
[5] W. Erlhagen, A. Mukovskiy, E. Bicho, G. Panin, C. Kiss, A. Knoll, timal distribution of vision tasks on multicore architectures’, in Pro-
H. van Schie, and H. Bekkering, ‘Goal-directed imitation for robots: ceedings of the 5th International Conference on Informatics in Control,
a bio-inspired approach to action understanding and skill learning’, Automation and Robotics, (May 2008).
Robotics and Autonomous Systems, 54, 353–360, (2006). [21] Nuance Communications. Dragon NaturallySpeaking 9. http://
[6] W. Erlhagen, A. Mukovskiy, F. Chersi, and E. Bicho, ‘On the develop- www.nuance.com/naturallyspeaking/.
ment of intention understanding for joint action tasks’, in Proceedings [22] R. Petrick, D. Kraft, K. Mourão, C. Geib, N. Pugeault, N. Krüger, and
of the 6th IEEE International Conference on Development and Learn- M. Steedman, ‘Representation and integration: Combining robot con-
ing, (July 2007). trol, high-level planning, and action learning’, in Proceedings of the
[7] T. Fong, C. Thorpe, and C. Baur, ‘Collaboration, dialogue, and human- International Cognitive Robotics Workshop (CogRob 2008) at ECAI
robot interaction’, in Robotics Research, volume 6 of Springer Tracts 2008, (July 2008).
in Advanced Robotics, 255–266, Springer, (2003). [23] M. Rickert, M. E. Foster, M. Giuliani, T. By, G. Panin, and A. Knoll,
[8] M. E. Foster, ‘Roles of a talking head in a cooperative human-robot ‘Integrating language, vision and action for human robot dialog sys-
dialogue system’, in Proceedings of the 7th International Conference tems’, in Proceedings of HCI International 2007, (July 2007).
on Intelligent Virtual Agents (IVA 2007), (September 2007). [24] D. Roy, K.-Y. Hsiao, and N. Mavridis, ‘Mental imagery for a conver-
[9] M. E. Foster, E. G. Bard, R. L. Hill, M. Guhe, J. Oberlander, and sational robot’, IEEE Transactions on Systems, Man, and Cybernetics,
A. Knoll, ‘The roles of haptic-ostensive referring expressions in co- Part B, 34(3), 1374–1383, (June 2004).
operative, task-based human-robot dialogue’, in Proceedings of the [25] N. Sebanz, H. Bekkering, and G. Knoblich, ‘Joint action: bodies and
3rd ACM/IEEE International Conference on Human Robot Interaction minds moving together’, Trends in Cognitive Sciences, 10, 70–76,
(HRI 2008), (March 2008). (2006).
[10] M. E. Foster and C. Matheson, ‘Following assembly plans in coop- [26] T. Spexard, M. Hanheide, and G. Sagerer, ‘Human-oriented interaction
erative, task-based human-robot dialogue’, in Proceedings of Londial with an anthropomorphic robot’, IEEE Transactions on Robotics, 23(5),
2008, (June 2008). 852–862, (October 2007).
[11] M. Giuliani and A. Knoll, ‘Integrating multimodal cues using grammar [27] R. Stiefelhagen, H. Ekenel, C. Fugen, P. Gieselmann, H. Holzapfel,
based models’, in Proceedings of HCI International 2007, (July 2007). F. Kraft, K. Nickel, M. Voit, and A. Waibel, ‘Enabling multimodal
[12] M. Giuliani and A. Knoll. MultiML – A general purpose representation human-robot interaction for the Karlsruhe humanoid robot’, IEEE
language for multimodal human utterances, 2008. Submitted. Transactions on Robotics, 23(5), 840–851, (October 2007).
[13] N. Hawes, A. Sloman, J. Wyatt, M. Zillich, H. Jacobsson, G.-J. Kruijff, [28] D. Traum and S. Larsson, ‘The information state approach to dialogue
M. Brenner, G. Berginc, and D. Skočaj, ‘Towards an integrated robot management’, in Current and New Directions in Discourse and Dia-
with multiple cognitive functions’, in Proceedings of the 22nd Confer- logue, eds., J. C. J. Van Kuppevelt and R. W. Smith, 325–353, Kluwer
ence on Artificial Intelligence (AAAI 2007), (2007). Academic Publishers, (2003).
[14] M. Henning, ‘A new approach to object-oriented middleware’, IEEE [29] A. J. N. van Breemen, ‘iCat: Experimenting with animabotics’, in Pro-
Internet Computing, 8(1), 66–75, (Jan–Feb 2004). ceedings of the AISB 2005 Creative Robotics Symposium, (2005).
[15] G. Hoffman and C. Breazeal, ‘Effects of anticipatory action on human- [30] G. Westphal, C. von der Malsburg, and R. Würtz, ‘Feature-driven emer-
robot teamwork efficiency, fluency, and perception of team’, in Pro- gence of model graphs for object recognition and categorization’, in
ceedings of the 2nd ACM/IEEE International Conference on Human Applied Pattern Recognition, eds., A. Kandel, H. Bunke, and M. Last,
Robot Interaction (HRI 2007), (2007). 155–199, Springer Verlag, (2008).
[16] K.-y. Hsiao, S. Vosoughi, S. Tellex, R. Kubat, and D. Roy, ‘Object [31] M. White, ‘Efficient realization of coordinate structures in Combinatory
schemas for responsive robotic language use’, in Proceedings of the Categorial Grammar’, Research on Language and Computation, 4(1),
3rd ACM/IEEE International Conference on Human Robot Interaction 39–75, (2006).
(HRI 2008), (2008). [32] H. Zender, P. Jensfelt, Ó. Martinez Mozos, G.-J. M. Kruijff, and W. Bur-
[17] W. G. Kennedy, M. D. Bugajska, M. Marge, W. Adams, B. R. Fransen, gard, ‘An integrated robotic system for spatial understanding and situ-
D. Perzanowski, A. C. Schultz, and J. G. Trafton, ‘Spatial representa- ated interaction in indoor environments’, in Proceedings of the 22nd
tion and reasoning for human-robot collaboration’, in Proceedings of Conference on Artificial Intelligence (AAAI 2007), (July 2007).
the 22nd Conference on Artificial Intelligence (AAAI 2007), (2007). [33] P. Ziaie, T. Müller, M. E. Foster, and A. Knoll, ‘Using a naı̈ve Bayes
[18] S. Larsson and D. R. Traum, ‘Information state and dialogue manage- classifier based on k-nearest neighbors with distance weighting for
ment in the TRINDI dialogue move engine toolkit’, Natural Language static hand-gesture recognition in a human-robot dialog system’, in
Engineering, 6, 323–340, (2000). Proceedings of CSICC 2008, (March 2008).
[19] G. Metta, P. Fitzpatrick, and L. Natale, ‘YARP: Yet another robot plat-
30
A Tool for Evolving Artificial Neural Networks
Efstratios F. Georgopoulos 1, 3, Adam V. Adamopoulos 2, 3 and Spiridon D. Likothanassis 3
Abstract. A hybrid evolutionary algorithm that combines genetic architectures forms a surface in the space. The optimal architecture
programming philosophy, with localized Extended Kalman Filter design equals to finding the optimum point on this surface.
(EKF) training method is presented here. This algorithm is used for The first attempts, described in [10], [23], [25] and [27], focused
the topological evolution and training of Multi-Layered Neural mainly on the problem of training the networks and not in the
Networks. It is implemented as a visual software tool in C++ topology design. They used neural networks with fixed architecture
programming language. The proposed hybrid evolutionary and genetic algorithms in order to search the weight space for some
near optimum weight vector that solves the problem of network
algorithm is applied on two bio-signal modeling tasks: the Magneto
training. That is, they used genetic algorithms instead of some
Encephalogram (MEG) of epileptic patients and the Magneto
classical training algorithm. Soon the main research interest moved
Cardiogram (MCG) of normal subjects, exhibiting very satisfactory from the training, to the search for the optimal architectural (or
results. topological) design of a neural network. Some first works used
genetic algorithms in order to imitate the pruning algorithms. They
start with a network larger than necessary for the task and then use
1 INTRODUCTION a specially designed genetic algorithm to define which combination
of connections is sufficient to, quickly and accurately, learn to
One of the main problems that are faced when Artificial Neural perform the target task, using back propagation. Miller et al. [21]
Networks (ANN) and especially Multilayer Perceptrons, are applied did that for some small nets. The same problem, but for larger
on some tasks, is finding the network architecture or topology that networks, was faced by Whitley and Bogard in [26]. Bornholdt and
is best suited for the task at hand. A small network for the problem Graudenz in [9], used a modified GA in order to evolve a simplified
might causes poor learning ability, while a large one might cause model of a biological neural network and then applied the algorithm
poor generalization. Until now the common method to determine to some toy Boolean functions. A different approach to the design
the architecture of a neural network is by trial and error. However, and evolution of modular neural network architectures is presented
in the last years there have been many attempts, in the direction of in [13]. Billings and Zheng in [8] used a GA for the architectural
designing the architecture of a neural network automatically, that evolution of radial basis function (RBF) networks. The most recent
have led to a variety of methods. approach and maybe the most successful one, to the problem of
Two subcategories of such methods are a) the constructive and finding the near optimum architecture is presented in [28]. There,
b) pruning (destructive) algorithms, [28], [29]. Roughly speaking, a Yao and Liu propose a new evolutionary system, the EPNet, for
constructive algorithm starts with a minimal network, that is an evolving artificial neural networks’ behavior.
ANN with a minimal number of hidden layers, hidden neurons and The last couple of years, there is an increasing interest in the use
connections, and adds new layers, neurons or connections, if it is of multi-objective optimization methods and especially
necessary, during the training phase. On the opposite, a pruning evolutionary multi-objective techniques for neural network training
(destructive) algorithm does the opposite, starts with a maximal and structure optimization. Two very interesting approaches are
network and deletes the unnecessary layers, nodes and connections presented in [31] and [32].
during training. The present work is the sequence of a series of efforts
Another approach to this problem is by using Genetic concerning the application of evolutionary algorithms for the
Algorithms. Genetic Algorithms are a class of optimization optimization of neural networks. In [17] a neural network model
algorithms, which are good in exploring a large and complex space with binary neurons was evolved by a modified genetic algorithm in
in an intelligent way in order to find values close to the global order to learn some Boolean functions. In [1], [2], [3], [4], [5], [6],
optimum (see [12], [15], [20] and [22] for details). The design of a [7], [11], [18] and [19] genetically evolved artificial neural
near optimal topology can be formulated as a search problem in the networks were successfully used for a variety of problems.
architecture space, where each point in the space represents network In this paper we present a hybrid evolutionary method that looks
architecture. The training can be formulated as a search problem in like more to a genetic programming technique for the evolution of a
the weight space. Since the end of the last decade, there have been population of feed-forward Multi Layered Perceptrons [14]. This
several attempts to combine the technology of neural networks with hybrid algorithm combines a genetic programming technique (for
that of genetic algorithms. Given some performance criteria, for details see [16]) for the evolution of the architecture of a neural
example error, generalization ability, learning time, architectural network, with a training method based on the localized Extended
complexity etc, for the architecture, the performance level of all
1
Technological Educational Institute of Kalamata, Greece, e-mail: sfg@teikal.gr
2
Dept. of Medicine, Democritus University of Thrace, Greece, e-mail: adam@med.duth.gr
3
Dept. of Computer Engineering & Informatics, University of Patras, Greece, e-mail: likothan@cti.gr
31
Kalman Filter (EKF), known as Multiple Extended Kalman 1
∑
2
Algorithm (MEKA). The MEKA is described in detail in [24]. The E ( n) = ⎡⎣ d j ( n) − y j (n) ⎤⎦ (6)
2 j∈C
novelty of this effort depends on, apart from the combination of
evolution techniques with MEKA, the capability of the proposed
method to search, not only for the optimal number of hidden units, The differentiation in equation (5) corresponds to the back-
but also, for the number of inputs needed for the problem at hand; propagation of the global error to the output of neuron i. The
of course this stands only for time series prediction problems where activation function ϕ(•) is responsible for the non-linearity in the
the number of needed past values, which represent the network’s neuron. The weight vector wi of the optimum model for neuron i is
inputs, is unknown. This hybrid algorithm is an evolved and heavily to be “estimated” through training with examples. The activation
enriched version of an older algorithm that was developed by the function is assumed to be differentiable. Accordingly, we can use
authors and presented in [4], [7] and [19]. Furthermore this Taylor series to expand equation (3) about the current estimate of
evolutionary neural network system has been implemented as a the weight vector and thereby linearize the equation as follows [14]:
visual tool in C++ with a graphical user interface. In order to test
the ability of this algorithm to produce networks that perform well,
⎡ ⎤
φ ( xiT (n)wi (n) ) ≅ qiT (n)wi (n) + ⎢φ ⎛⎜ xiT (n) wi (n) ⎞⎟ − qiT (n) wi (n)⎥
∧ ∧
we apply the system on two biosignals, namely the Magneto (7)
Encephalogram (MEG) recordings of epileptic patients and ⎣ ⎝ ⎠ ⎦
Magneto Cardiogram (MCG) of normal subjects. The algorithm
produces networks with small sizes that perform well. where
The rest of the paper is organized as follows. Section 2 describes
the hybrid evolutionary algorithm, while the numerical experiments ⎡ ∂φ ( xiT ( n ) wi ( n ) ) ⎤ ∧
⎡ ∧ ⎤
are presented in section 3. Finally, section 4 discusses the qi ( n ) = ⎢ ⎥ = y i ( n ) ⎢1 − y i ( n ) ⎥ xi ( n ) (8)
concluding remarks. ⎢⎣ ∂ w i(n) ⎥⎦ ∧ ⎣ ⎦
wi ( n ) = wi ( n )
∧
2 THE HYBRID EVOLUTIONARY y i (n) is the output of neuron i that results from the use of the
ALGORITHM weight estimate. In equation (8) we have assumed the use of the
logistic function; other sigmoid functions, like the hyperbolic
2.1 THE MULTIPLE EXTENDED KALMAN tangent, can be used as well. The first term of the right hand side of
ALGORITHM - MEKA equation (7) is the desired linear term while the remaining term
represents a modeling error. Thus substituting equation (7) and (4)
Consider a network characterized by a weight vector w. The in (3) and ignoring the modeling error we obtain:
average cost function that should be minimized during the training
phase is defined in terms of N input-output patterns as follows: d i ( n ) = qiT ( n ) wi ( n ) + ei ( n ) (9)
1 N
∑∑ Where ei ( n ) and qi ( n ) are defined in equations (5) and (8)
2
Eav ( w) = ⎡⎣ d j ( n) − y j ( n) ⎤⎦ (1)
2 N n =1 j∈C respectively.
Equations (2) and (9) describe the linearized behavior of neuron
Where d j ( n ) is the desired response and y j ( n ) the actual i. Given the pair of equations (2) and (9), we can make use of the
response of output neuron j when input pattern n is presented, while standard Recursive Least Squares (RLS) algorithm equations [14],
the set C includes all the output neurons of the network. The cost which is a special case of the Kalman filter, to make an estimate of
function Eav ( w) depends on the weight vector w due to the fact that the weight vector wi ( n ) of neuron i. The resulting solution is
defined by the following system of recursive equations [14] that
y j ( n ) itself depends on w. describe the Multiple Extended Kalman Algorithm (MEKA) [24]:
Concentrating on an arbitrary neuron i, which might be located
anywhere in the network, its behavior during the training phase may ri ( n ) = ( λ − 1) ⋅ Pi ( n − 1) ⋅ qi ( n ) (10)
be viewed as a non-linear dynamic system, which in the context of
Kalman filter theory may be described by the following state- ki ( n ) = ri ( n ) ⋅ (1 + r ( n ) ⋅ qi ( n ) ) − 1
i
T
(11)
measurement equations [14], [24]:
wi ( n + 1) = wi ( n ) + ei ( n ) ⋅ ki ( n ) (12)
wi ( n + 1) = wi ( n ) (2) Pi ( n + 1) = ( λ − 1) ⋅ Pi ( n ) − ki ( n ) ⋅ r ( n ) i
T
(13)
d i ( n ) = yi ( n ) + ei ( n ) (3)
Where, n=1,…,N is the iteration number and N is the total
yi ( n ) = ϕ ( x ( n ) , wi ( n ) )
T
i (4) number of examples.
The vector qi ( n ) represents the linearized neuron activation
Where the iteration n corresponds to the presentation of the nth function given in equation (6), Pi ( n ) is the current estimate of the
input pattern, xi ( n ) and yi ( n ) are the input and output vector of
inverse of the covariance matrix of qi ( n ) and ki ( n ) is the Kalman
neuron i respectively and ei ( n ) is the measurement error at the
gain. The parameter λ is a forgetting factor which takes values in
output of neuron i, the instantaneous estimate of which is given by: the range (0,1], and ei ( n ) is the localized measure of the global
∂ E (n ) error. Equation (13) is called the Riccatti difference equation.
ei (n ) = − (5) Each neuron in the network perceives its own effective input
∂ yi(n )
qi ( n ) , hence it has to maintain its own copy of Pi ( n ) even in the
2
32
case in which it may share some of its inputs with other neurons in needed to predict future values is not usually known a
the network. priory.
Hidden mutation: it selects randomly a neural network from
2.2 THE EVOLUTIONARY ALGORITHM the population and changes the structure of its hidden
region by adding or deleting a random number (selected
The proposed evolutionary algorithm is an improved version of a
uniformly from a given interval) of hidden neurons.
modified genetic algorithm that was used aforetime by the authors.
Non Uniform Weight mutation: it is responsible for the fine
It maintains the basic working philosophy of evolutionary
tuning capabilities of the system. It selects randomly a
algorithms and resembles genetic programming (see [16] for
number of connection weights and changes their values to
details) since it evolves complicated structures like linked lists and
new ones as follows: Let suppose that w is the old weight
not simple bit strings as genetic algorithms do.
value then the new one is given by the formula:
The algorithm evolves, using a number of genetic operators, a
population of artificial neural networks (multilayered perceptrons)
that are represented as linked lists of network layers and neurons; w ( n + 1) = w ( n ) ± Δw ( t , ub − w ( n ) ) (15)
thus it is used the direct encoding scheme. The basic steps of the
algorithm are as follow: Where lb and ub are the lower and upper bounds of the
1. Initialization: An initial population of neural networks (called weight values, t is the generation number, and Δ(t,y) is a
individuals) is created. Every individual has a random number function that returns a value in the range [0,y], such that the
of neurons (or nodes) and connections (synapses). The
connection weights are initialized to some random values probability of Δ(t,y) being close to 0 increases as t
within a specific range. increases. This property causes this operator to search the
solution space initially uniformly (while t is small) and very
2. Training: Every individual (neural network) in the population locally at the later stages. In our experiments the following
is trained using MEKA for a small number of training epochs. function, [20] was used:
For populations other than initial, training occurs only for
⎛ (1− t ) ⎞
b
those networks that have been changed by the application of Δ ( t , y ) = y ⋅ ⎜1 − r T ⎟ (16)
genetic operators. ⎝ ⎠
3. Fitness Evaluation: As fitness function it is used a function that Where r is a random number on [0,1], T is the maximal
combines the performance of the network in the training generation number (a parameter of the algorithm), and b is a
and/or validation set with the size of the network. The system parameter determining the degree of non-uniformity.
performance is evaluated using the Mean Squared Error (MSE)
or the Mean Relative Error (MRE). While, the size is the Gaussian weight mutation: it works like the Non Uniform
number of neurons and/or the number of active synapses in the Weight mutation operator with the difference that the new
network. So the fitness function for the case of MRE has a weight value is calculated by the formula:
formula of the type:
w ( n + 1) = w ( n ) + Δw ( n ) (17)
1
Fitness ( i ) = (14)
1 + MRE ( i ) + sp ⋅ MRE ( i ) ⋅ SIZE ( i )
Where, Δw is a small random number following Gaussian
distribution.
Where sp is a parameter that controls the weight of the Uniform weight mutation: it works like the Gaussian
network size in the evaluation of fitness, MRE(i) is the value mutation operator with the difference that, Δw is a small
of MRE of individual i, SIZE(i) is the size of individual i random number following Uniform distribution.
which can be calculated as the number of active connections
or the number of neurons and i is an index taking values in the 6. Crossover: It selects two parents (neural networks) and
range 1 to population size. generates one or two offspring by recombining parts of them.
The offspring take the place of their parents in the new
4. Selection: Selection operator is been used in order to create a population. In the presented algorithm crossover recombines
new, intermediate, population from the old one, by selecting whole neurons with their incoming connections. But since we
individuals based on their fitness. This can be done using any have to deal with networks with different structures, the new
of the following three different selection schemes that have connections that might have to be produced are initialized with
been implemented, namely: random weight values as in the initialization phase. Herein
The Elitism Roulette Wheel Selection Operator, with crossover works more like a mutation operator, like in most
variable elitist pressure (for more details see [12], [16], [20] genetic programming systems, than as the recombination
and [22]). operator of genetic algorithms
The Rank Based Selection (for more details see [12], [16],
[20] and [22]). Therefore the presented hybrid evolutionary algorithm works in
The Tournament Selection with variable tournament size brief as follows: it starts with a population of randomly constructed
(for more details see [12], [16], [20] and [22]). Neural Networks (step 1). Networks undergo some training for a
couple of epochs with MEKA, using the training set (step 2).
5. Mutation: It works on the members of Three different Performance is measured with the fitness function (step 3) using the
mutation operators are implemented: validation set, in order to improve generalization. Then a new,
Input Mutation: it selects randomly a neural network from intermediate, population is created, by selecting the more fit
the population and changes its number of inputs. This individuals according to their fitness (step 4) using any of the three
operator works only on time series modeling and prediction selection schemes. Some members of this intermediate population
problems, where the number of past values (network inputs) undergo transformations by means of genetic operators to form the
members (individuals) of the new population: mutation (step 5) and
3
33
crossover (step 6) operators. The new population that is created is
trained again (step 2); new members are trained for a couple of
epochs, while the members that have survived and passed from the
old population may be trained with MEKA for some more epochs,
or may not be trained at all. This is the new generation. This whole
process continues until a predefined termination condition is
fulfilled; the termination condition might be a maximum number of
generation or a minimum error (MSE or MRE) value. Once
terminated the algorithm is expected to have reached a near-
optimum solution, i.e. a trained network with near optimum
architecture.
2.3 THE TOOL
This hybrid evolutionary algorithm has been implemented as a
visual tool in C++ programming language, having a graphical user
interface (GUI). Specifically, it was used the Borland C++ version
6.0 IDE for Windows. Figure 1 and 2 depict two of the basic forms
of the program, for the two main categories of problems that it can
be used for, classification and time series prediction. Figure 3 is the Figure 3. The “statistics” form
“statistics” form that illustrates the evolutionary process and prints
useful information about it.
3 NUMERICAL EXPERIMENTS
In order to examine the ability of the algorithm to produce networks
that learn and generalize well we have tested it on two real world
problems: the modeling of the MEG recordings of epileptic patients
and the modeling MCG recordings of normal subjects.
Brain dynamics can be evaluated by recording the changes of the
neuronal electric voltage, either by the electroencephalogram
(EEG), or by the MEG. The EEG recordings represent the time
series that match up to neurological activity as a function of time.
On the other hand the MEG is generated due to the time varying
nature of the neuronal electric activity, since time-varying electric
currents generate magnetic fields. EEG and MEG are considered to
be complementary, each one carrying a part but not the whole of the
information related to the underlying neural activity. Thus, it has
been suggested that the EEG is mostly related to the inter-neural
electric activity, whereas the MEG is mostly related to the intra-
neural activity. The MEG recordings of epileptic patients were
obtained using a Super-conductive QUantum Interference Device
Figure 1. The main form for classification problems of the evolutionary
(SQUID) and were digitized with a sampling frequency of 256Hz
neural network system
using a 12-bit A/D Converter. SQUID is a very sensitive
magnetometer, capable to detect and record the bio-magnetic fields
produced in the human brain due to the generation of electrical
micro-currents at neural cellular level [30].
The same stands for the MCG recordings which are magnetic
recordings of the heart operation of normal subjects. MEG and
MCG data were provided by the Laboratory of Medical Physics of
the Democritus University of Thrace, Greece, where a one-channel
DC SQUID is operable. Both biosignal data were normalized in the
interval [0,1] in order to be processed by the neural networks.
In all the experiments we used, for comparison reasons, the same
parameter values, which are depicted in table 1. For the case of the
MEG modeling, as training set where used 1024 data samples
(corresponding to a four seconds epoch of the MEG) while for the
testing was used 512 data samples (corresponding to a two seconds
epoch of the MEG). For the case of the MCG modeling, as training
set where used 1024 data samples and for the test set was used 1024
Figure 2. The main form for prediction problems of the evolutionary data samples The algorithm was left to run over 1000 generations.
neural network system In order to evaluate the forecasting capability of the produced
networks we used three well-known error measures, the Normalized
The user can select between this two problem categories. Then Root Mean Squared Error (NRMSE), the Correlation Coefficient
he/she can insert the values of the various genetic parameters, the (CC) and the Mean Relative Error (MRE). The performance of the
training, validation and test files, as well as the output log files. The hybrid algorithm for the case of MEG modeling is depicted in
user can observe the evolutionary process using some real time tables 2 and 3 and in figure 4, while for the case of MCG in tables 4
graphical display of the error, the performance of the best ever
and 5 and in figure 5.
network and other parameters.
4
34
Table 1. Parameters used for the cases of MEG and MCG modeling Table 5. MCG forecasting - Errors on the Test Set
Parameter Value Architecture NRMSE C.C. MRE
Population 50 12-10-1 0.2419 0.9704 1.4222
Max number of Generations 1000 9-20-1 0.2081 0.9781 1.1874
Crossover Probability 0,1 10-11-1 0.2214 0.9752 1.3642
Input Mutation Probability 0,1 13-10-1 0.1858 0.9827 0.9617
Weight Mutation Probability 0,1 13-12-1 0.2163 0.9774 0.8677
Uniform Mutation Probability 0,1 10-9-1 0.2193 0.9760 0.9882
nonUniform Mutation Probability 0,1
4-4-1 0.3586 0.9445 0.9817
Gaussian Mutation Probability 0,1
MeanGaussian 0,1 3-1-1 0.6497 0.8241 0.9915
StDev Gauss 0,25
Predicting Horizon 1 Actual(-) vs Predicted(:)
0.8
Table 2. MEG forecasting - Errors on the Training Set 0.6
Architecture NRMSE C.C. MRE
4-9-1 0.2540 0.9674 0.0350 0.4
3-3-1 0.2913 0.9581 0.0386
0.2
4-5-1 0.2564 0.9672 0.0357
3-2-1 0.2967 0.9557 0.0411
0
3-3-1 0.2705 0.9656 0.0369
3-4-1 0.2655 0.9645 0.0361
-0.2
Table 3. MEG forecasting - Errors on the Test Set -0.4
Architecture NRMSE C.C. MRE
4-9-1 0.1971 0.9805 0.0403 -0.6
0 200 400 600 800 1000
3-3-1 0.2189 0.9757 0.0438
4-5-1 0.2063 0.9786 0.0434
Figure 5. MCG forecasting, performance on the test set.
3-2-1 0.2309 0.9733 0.0466
3-3-1 0.2177 0.9765 0.0446
3-4-1 0.2111 0.9775 0.0429 4 CONCLUSIONS
In the current paper it was presented a hybrid biological inspired
Actual(-) vs Predicted(:)
evolutionary algorithm that combines a genetic programming
0.8
technique with a training method based on the Multiple Extended
Kalman Algorithm. This hybrid algorithm is implemented in C++
0.7 as a software system with a graphical user interface.
The main novelties of the proposed hybrid algorithm are the
combination of genetic programming technique with MEKA, the
0.6
use of a fitness function that combines performance with network
size, the ability to evolve not only the structure of the hidden layers
0.5 but the number of inputs as well, and the large number of different
genetic operators and especially mutation operators that have been
implemented. Another novelty is the representation used for neural
0.4
networks. As said before, every network in the population is
represented as a link list of layers and neurons, using the direct
0.3 encoding scheme. The use of link lists has some certain advantages
that have to do mainly with the memory management; you use only
the memory that is needed every time and you don’t have to
0.2 allocate a maximum memory size, for maximum network size like
other representation schemes. Moreover link lists are dynamic data
0.1
structures, which it means that the neural network architecture can
0 100 200 300 400 500 600 change dynamically during run time in contrast with other data
structures like matrices that in C++ can not change during run time.
Figure 4. MEG forecasting, performance on the test set. This hybrid algorithm was used for the modeling of two
biological time series, namely the Magneto Encephalogram (MEG)
Table 4. MCG forecasting - Errors on the Training Set recordings of epileptic patients and Magneto Cardiogram (MCG) of
Architecture NRMSE C.C. MRE normal subjects. All the reported cases refer to predictions on
12-10-1 0.1918 0.9815 0.8421 recordings of the dynamics of nonlinear systems. In all the
9-20-1 0.1840 0.9829 0.8073 performed experiments the algorithm was able to find a near
10-11-1 0.1790 0.9839 0.6790 optimum network architecture that gave small prediction errors.
13-10-1 0.1869 0.9824 0.7875 Therefore we can conclude that the algorithm is able to produce
13-12-1 0.2213 0.9761 0.8983 small and compact networks that learn and generalize well.
10-9-1 0.2274 0.9740 0.9702 The algorithm has only tested on time series prediction problems
4-4-1 0.3593 0.9441 0.9822 and it is in our intention to test it on some difficult classification
3-1-1 0.6481 0.8262 0.9061 problems as well.
5
35
One of the main drawbacks of this kind of algorithms, namely [12] Goldberg, D. Genetic Algorithms in Search Optimization & Machine
the evolutionary algorithms, hybrid or not, is that they are Learning, Addison-Wesley 1989.
computational expensive in terms of computer memory and CPU [13] Happel, B., et al. Design and evolution of modular neural network
architectures. Neural Networks, Vol. 7, pp. 985 – 1004, 1994.
time. Even though the proposed algorithm belongs to this category,
[14] Haykin, S., “Neural Networks - A Comprehensive Foundation”,
the use of MEKA for just a couple of epochs for the training phase McMillan College Publishing Company, ch. 6, p.213, New York,
of the neural networks and the representation where each member 1994.
of the population is a network represented as a link list so that there [15] Holland, J. Adaptation in Natural and Artificial Systems, MIT press
is no need to use encoding and decoding functions for the 1992.
calculation of network’s performance, makes the algorithm less [16] Koza J.R., Genetic programming: on the programming of computers
computational expensive than other approaches to the same by means of natural selection, MIT Press, Cambridge, MA, 1992.
problem of neural networks evolution. [17] Likothanassis S. Georgopoulos E. and Fotakis D. (1997). Optimizing
the Structure of Neural Networks Using Evolution Techniques. 5th
The algorithm could be further improved by adding some more
International Conference on Applications of High - Performance
genetic operators for better and faster local search both to the Computers in Engineering, pp. 157-168, Santiago de Compostela,
architecture and weight space and this is going to be one of our Spain, July.
future research targets. Furthermore, in the integrated software [18] Likothanassis, S. D., Georgopoulos, E. F., Manioudakis, G. D. and
system there are already implemented a large number of genetic Adamopoulos, A.V., “Currency Forecasting Using Genetically
operators whose influence to the performance of the hybrid Optimized Neural Networks”, HERCMA Athens September 1998.
algorithm needs to be appraised; we need to see which of the three [19] Likothanassis, S. D., Georgopoulos, E. F. “A Novel Method for the
selection schemes, or the many mutation operators give better Evolution of Neural Networks”, 3rd IMACS/IEEE International
Conference on Circuits Systems and Computers (CSC’99), July 1999.
results. Another future research direction will be the combination
[20] Michalewicz, Z., “Genetic Algorithms + Data Structures = Evolution
of MEKA with other evolutionary techniques like Particle Swarm Programs”, Springer-Verlag, 1996.
Optimization and Differential Evolution for neural network [21] Miller, G., et al. Designing neural networks using genetic algorithms.
evolution. Proceedings of the 3rd International Conference on Genetic
Algorithms, Morgan Kaufmann 1989.
[22] Mitchell M. An Introduction to Genetic Algorithms, MIT Press 1996.
REFERENCES [23] Montana, D. and Davis, L. Training feedforward neural networks
using genetic algorithms. BBN Systems and Technologies, Cambridge,
[1] Adamopoulos, A., Andreou, A., Georgopoulos, E., Ioannou, N. and MA 1989.
Likothanassis, S., “Currency Forecasting Using Recurrently RBF [24] Shah, S., Palmieri, F. and Datum, M., “Optimal Filtering Algorithms
Networks Optimized by Genetic Algorithms”, Computational Finance for Fast Learning in Feed-Forward Neural Networks”, Neural
1997 (CF’97), London Business School, 1997. Networks, Vol. 5, pp. 779-787, 1992.
[2] Adamopoulos, A., Anninos, P., Likothanassis, S., and Georgopoulos, [25] Whitley, D. Applying genetic algorithms to neural network problems,
E. “On the Predictability of MEG of Epileptic Patients Using RBF International Neural Networks Society, p.230 1988.
Networks Evolved with Genetic Algorithms”, BIOSIGNAL’98, Brno, [26] Whitley, D., and Bogart, C. The evolution of connectivity: Pruning
Czech Republic, June 23-25, 1998a. neural networks using genetic algorithms. International Joint
[3] Adamopoulos, A., Georgopoulos E., Manioudakis, G. and Conference on Neural Networks, Washington D.C., 1. Hillsdale, NJ:
Likothanassis, S. “An Evolutionary Method for System Structure Lawpence Erlbaum, pp. 134-137, 1990.
Identification Using Neural Networks” Neural Computation ’98, [27] Whitley, D., and Hanson, T. Optimizing neural networks using faster,
1998b. more accurate genetic search. 3rd Intern. Conference on Genetic
[4] Adamopoulos, A., G. Georgopoulos, S. Likothanassis and P. Anninos, Algorithms, Washington D.C., Morgan Kaufmann, pp. 391-396, 1989.
“Forecasting the MagnetoEngephaloGram (MEG) of Epileptic Patient [28] Yao, X. & Liu, Y. A New Evolutionary System for Evolving Artificial
Using Genetically Optimized Neural Networks”, Genetic and Neural Networks, IEEE Transactions on Neural Networks, vol. 8, no.
Evolutionary Computation Conference (GECCO’99), Orlando, Florida 3, 1997.
USA, July 14-17, 1999 [29] Yao, X. “Evolving Artificial Neural Networks”, Proceedings of the
[5] Andreou, A., Georgopoulos, E., and Likothanassis, S., and IEEE, 87(9):1423:1447, September 1999.
Polidoropoulos, P., “Is the Greek foreign exchange-rate market [30] Anninos, P. Jacobson, J. Tsagas, N. Adamopoulos, A. “Spatiotemporal
predictable? A comparative study using chaotic dynamics and neural Stationarity of Epileptic Focal Activity Evaluated by Analyzing
networks”, Proceedings of the Fourth International Conference on Magneto Encephalographic (MEG) data and the Theoretical
Forecasting Financial Markets, Banque Nationale de Paris and Implications”. Panminerva Med. 39, 189-201, 1997.
Imperial College, London, 1997. [31] Graning, L.; Yaochu Jin; Sendhoff, B.: Generalization Improvement in
[6] Andreou, A., Georgopoulos, E., Zombanakis, G. and Likothanassis, S., Multi-Objective Learning, Neural Networks, IJCNN apos;06.
“Testing Currency Predictability Using An Evolutionary Neural International Joint Conference on Volume , Issue , 0-0 0 Page(s):4839
Network Model”, Proceedings of the fifth International Conference on – 4846, 2006.
Forecasting Financial Markets, Banque Nationale de Paris and [32] Vieira, D. A. G. and J. A. Vasconcelos and W. M. Caminhas:
Imperial College, London, 1998. Controlling the parallel layer perceptron complexity using a
[7] Andreou A., Georgopoulos E. and Likothanassis, S. “Exchange Rates multiobjective learning algorithm, Neural Computing and
Forecasting: A Hybrid Algorithm Based On Genetically Optimized Applications, vol. 16, n. 4, p.p. 317—325, 2007.
Adaptive Neural Networks”, Computational Economics Journal,
Kluwer Academic Publishers, vol. 20, issue 3, pp. 191 – 210,
December 2002
[8] Billings, S. A., and Zheng, G. L. Radial basis function network
configuration using genetic algorithms. Neural Networks, Vol. 8, pp.
877-890, 1995.
[9] Bornholdt S. and Graudenz, D. General asymmetric neural networks
and structure design by genetic algorithms. Neural Networks, Vol. 5,
pp327 – 334, 1992.
[10] Davis, L. Mapping classifier systems into neural networks.
Proceedings of the 1988 Conference on Neural Information
Processing Systems, Morgan Kaufmann, 1988.
[11] Georgopoulos E., Likothanassis S. and Adamopoulos A., “Evolving
Artificial Neural Networks Using Genetic Algorithms”, Neural
Network World, 4/00, pp. 565 – 574, 2000.
6
36
Intelligently Raising Academic Performance Alerts
Dimitris Kalles1, Christos Pierrakeas and Michalis Xenos
Abstract. We use decision trees and genetic algorithms to
analyze the academic performance of students and the
homogeneity of tutoring teams in the undergraduate program on 2 THE EDUCATIONAL BACKGROUND
Informatics at the Hellenic Open University (HOU). Based on
A module is the basic educational unit at HOU. It runs for
the accuracy of the generated rules, we examine the applicability
about ten months and is the equivalent of about 3-4
of the techniques at large and reflect on how one can deploy
conventional university semester courses. A student may
such techniques in academic performance alert systems.
register with up to three modules per year. For each module,
a student is expected to attend five plenary class meetings
throughout the academic year. A typical class contains about
1 INTRODUCTION
thirty students and is assigned to a tutor (tutors of classes of
Student success is a natural performance indicator in the same module collaborate on various course aspects).
universities. However, if that success is used as a criterion for Class face-to-face meetings are about four hours long and are
tutor assessment (and subsequent possible contract renewal), and structured along tutor presentations, group-work and review
if students must evaluate their own teachers, then tutors may tend of homework. Furthermore, each student must turn in some
to lax their standards. This paper is about dealing with this issue written assignments (typically four or six), which contribute
in the context of the Hellenic Open University (HOU); we focus towards the final grade, before sitting a written exam. That
on the undergraduate Informatics program (about 2,500 exam is delivered in two stages: you only need sit the second
students). We ask whether we can detect regularities in distance if you fail or miss the first.
tutoring, then, we try to associate them with measures of Students fail a module and may not sit the written exam if
students’ success in an objective way and, subsequently, reflect they do not achieve a pass grade in the assignments they turn
on how to effectively disseminate this information to all in; these students must repeat that module afresh. A student
interested parties. who only fails the written exam may sit it on the following
The measurement strategy we have developed to-date in HOU academic year (without having to turn in assignments); such
has been progressively refined to deal with two closely linked “virtual” students are also assigned to student groups but the
problems: that of predicting student success in the final exams tutor is only responsible for marking their exam papers.
and that of analyzing whether some specific tutoring practices
have any effect on the performance of students. Each problem
gives rise to the emergence of a different type of user model. A 3 GENETIC ALGORITHMS AND
student model allows us, in principle, to explain and maybe DECISION TREES FOR PREDICTION
predict why some students fail in the exams while others
succeed. A tutor model allows us to infer the extent to which a In our work we have relied on decision trees to produce
group of tutors diffuses its collective capacity effectively into the performance models. Decision trees can be considered as rule
student population they advise. However, both types of models representations that, besides being accurate, can produce
can be subsequently interpreted in terms of the effectiveness of comprehensible output, which can be also evaluated from a
the educational system that the university implements. qualitative point of view [1, 2]. In a decision tree nodes
The rest of this paper is organised in five sections. The next contain test attributes and leaves contain class descriptors.
section presents the educational background. Section 3 then A decision tree for the (student) exam success analysis
reviews the fundamental features of the AI techniques that we problem could look like the one in Figure 1 and tells us that a
have used. Following that we report the experimental results for mediocre grade at the second assignment (root) is an
the undergraduate programme that we have analysed, as well as a indicator of possible failure (left branch) at the exams,
short evaluation of the individual module results that seem to whereas a non-mediocre grade refers the alert to the fourth
signify an interesting deviation. Section 5 presents a discussion (last) assignment.
from the point of view of how one can generalise our approach Decision trees are usually produced by analyzing the
as well as how one can substitute other intelligent techniques for structure of examples (training instances), which are given in
data analysis; finally we conclude and describe directions for a tabular form. An excerpt of a training set that could have
future development. produced such a tree is shown in Table 1. Note that the three
examples shown are consistent with the decision tree. As this
may not always be the case, there rises the need to measure
accuracy, even on the training set, in order to compare the
quality of two decision trees which offer competing
explanations for the same data set.
1
All authors are with Hellenic Open University, www.eap.gr. Contact address is dkalles@acm.org.
37
Of course, GATREE was first used [3] to confirm the
qualitative validity of the original findings experiments [4],
also serving as result replication, before advancing to more
Assgn2 in [3..6]
elaborate experiments [7, 8, 9].
GATREE [6] evolves populations of trees according to a
fitness function that allows for fine-tuning decision tree size
vs. accuracy on the training set. At each generation, a certain
FAIL Assgn4 < 3
population of decision trees is generated and sorted
according to fitness. Based on that ordering, certain genetic
operators are performed on some members of the population
to produce a new population. For example, a mutation may
FAIL PASS
modify the test attribute at a node or the class label at a leaf,
while a cross-over may exchange parts between decision
trees.
Figure 1. A sample decision tree [3]. The fitness function is fitnessi=Correcti2*x/(sizei2+x), for
tree i. The first part of the product is the actual number of
Note that the sample decision tree does not utilize data neither training instances that i classifies correctly. The second part
on the first nor the third assignments, but such data is shown in of the product (the size factor) includes a factor x which
the associated table. Such dimensionality reduction information regulates the relative contribution of the tree size into the
is typical of why decision trees are useful; if we consistently overall fitness; thus, the payoff is greater for smaller trees
derive trees on some problem that seem to not use some data When using GATREE, we used the default settings for the
column, we feel quite safe to not collect measurements for that genetic algorithm operations and set cross-over probability at
data column. Of course, simple correlation could also deliver 0.99 and mutation probability at 0.01. Moreover, all but the
such information, however it is the visual representation simplest experiments (explicitly so identified in the following
advantages of decision trees that have rendered them as very sections) were carried out using 10-fold cross-validation, on
popular data analysis tools. which all averages are based (i.e. one-tenth of the training set
was reserved for testing purposes and the model was built by
Table 1. A sample decision tree training set (adapted from [3]).
training on the remaining nine-tenths; furthermore, ten such
stages were carried out by rotating the testing one-tenth.
Assgn1 Assgn2 Assgn3 Assgn4 Exam
... ... ... ... ...
4.6 7.1 3.8 9.1 PASS 4 DATA ANALYSIS AT A PROGRAMME
9.1 5.1 4.6 3.8 FAIL LEVEL
7.6 7.1 5.8 6.1 PASS Before advancing, we first review some aggregate statistics of
the undergraduate informatics programme at HOU.
Analyzing the performance of high-risk students is a goal First, Table 2 presents the success rates for the modules
towards achieving tutoring excellence. It is, thus, reasonable to that we have analysed.
assert that predicting a student’s performance can enable a tutor
to take early remedial measures by providing more focused Table 2. Success (percentage) rates of modules.
coaching, especially in issues such as priority setting and time 2004-5 2005-6 2006-7
management.
INF10 35% 38% 33%
Initial experimentation at HOU [4] consisted of using several
machine learning techniques to predict student performance with INF11 55% 52% 55%
reference to the final examination. The scope of the INF12 39% 34% 35%
experimentation was to investigate the effectiveness and INF20 56% 44% 44%
efficiency of machine learning techniques in such a context. The
INF21 37% 44% 37%
WEKA toolkit [5] was used because it supports a diverse
collection of techniques. The key result was that learning INF22 71% 61% 55%
algorithms could enable tutors to predict student performance INF23 N/A 83% 97%
with satisfying accuracy long before final examination. The key INF24 70% 64% 58%
finding that lead to that result was that success in the initial
INF30 81% 85% 84%
written assignments is a strong indicator of success in the
examination. Furthermore, our tutoring experience corroborates INF31 93% 92% 85%
that finding. INF35 N/A 98% 93%
We then employed the GATREE system [6] as the tool of INF37 N/A 98% 98%
choice for our experiments, to progressively set and test
INF42 N/A N/A 100%
hypotheses of increasing complexity based on the data sets that
were available from the university registry. The formation and
development of these tests is the core content of this chapter and Next, Table 3 presents the enrolment numbers for these
is presented and discussed in detail in the following sections. modules. Note that, as we advance from junior to senior
GATREE is a decision tree builder that employs genetic years, the overall enrolment is dramatically reduced and the
algorithms to evolve populations of decision trees; it was success rates increase.
eventually used because it produces short comprehensible trees.
38
Table 3. Enrollment numbers at modules. Table 4. Model accuracies omitting tutor data.
2004-5 2005-6 2006-7 2004-5 2005-6 2006-7
INF10 987 1.247 1.353 E F E F E F
INF11 492 517 642 INF10 83 84 84 82 83 82
INF12 717 818 925 INF11 75 76 76 78 75 80
INF20 362 389 420 INF12 74 76 86 74 78 74
INF21 322 363 383 INF20 76 70 76 59 87 60
INF22 321 291 321 INF21 83 78 76 72 77 73
INF23 N/A 52 73 INF22 68 80 68 76 63 70
INF24 157 167 221 INF23 N/A 46 78 89 99
INF30 156 198 199 INF24 67 67 68 66 69 70
INF31 149 200 144 INF30 77 82 64 85 71 94
INF35 N/A 101 58 INF31 65 95 86 93 68 91
INF37 N/A 106 132 INF35 N/A 72 97 80 92
INF42 N/A N/A 109 INF37 N/A 95 100 95 98
INF42 N/A N/A 96 100
The above statistics are all drawn from the university registry
and none is subject to any further processing. However, all It is straightforward to attribute the increase in senior year
results presented from now on, refer to experiments carried out modules to the fact that, eventually, students have to focus on
totally using the GATREE system, with the occasional help of their exam and pass the test, regardless of how well they did
some post-processing automation scripts. along the year. The large discrepancy, however, suggests that
the exercises do not serve well their goal, which is to keep the
students engaged in the learning process. One could say that
4.1 Detecting a shift in exam grades exercises are less of learning opportunities and more of
There is a straightforward way to attempt to answer this necessary evils.
question. One can build a model that attempts to answer the The dramatic decrease in the 2006-7 year results of the
success question for the first stage of the final exam. Then, one INF20 module are quite interesting. They reflect, basically, a
can build a model that attempts to answer the success question huge fail rate in the first stage of the exam, which is well
for the overall student grade. A gross disparity in these numbers served by a small model that predicts failure all around.
should be indicative of an issue that merits investigation. When seen from that viewpoint, however, the relatively
The simplest data to consider as input for this problem narrow margins of the junior year modules seem quite
consists of exercise and exam grades, as in Table 1, omitting any impressive, since they are also associated with low overall
other information (for example, which tutor was responsible for pass rates. The difference, however, is that the junior modules
a student). The results reported are based on re-classification (we also report significant dropout rates which skews
reserve a cross-validation like mechanism for the more detailed pessimistically the rates reported in Table 2.
experiments later on) and are shown in Table 4.
What does a difference signify? To answer that, one can take
a step backwards and try to answer a simpler question: what does 4.2 Detecting tutor influence
a large difference signify? We have elected to brand a difference If we take the data sets that were used in section 4.1 and put
as large when the re-classification accuracy of the same module back in the information on which tutor was responsible for
for the same year differs by at least 20 percentage points when each student group, we can run the same experiments and try
we compare the model predicting the pass/fail result of the first to see whether the tutor attribute will surface in some models
stage of the final exam and the corresponding model after a (sample data are shown in Table 5).
possible second stage (which is the actual pass/fail grade for the In principle, observing models where the tutor attribute
module). In Table 4 such differences are shown in bold. appears near the decision tree root would not be a good thing,
There are two issues that become apparent when one views suggesting that a crucial factor in student success is not the
Table 4. The first is that whenever we observe an increase in the educational system itself but the tutor. As a matter of fact we
model accuracy when switching from the first exam (E) to the can opt to not look for this information at all in the resulting
final grade (F), this is associated with senior modules where trees; comparing the accuracies to the ones reported in Table
eventual success rates (see Table 2) are substantial. The only 4 should suffice. These results are now shown in Table 6.
decrease is observed in a junior year module where success rates
are considerably reduced compared to senior year modules.
39
Table 5. An expanded sample training set (see Group). Table 7. Lesion study model accuracies including tutor data.
Assgn1 Assgn2 Assgn3 Assgn4 Group Exam 2004-5 2005-6 2006-7
... ... ... ... ... ... E F E F E F
4.6 7.1 3.8 9.1 Athens-1 PASS INF10 78 78 75 75 77 77
9.1 5.1 4.6 3.8 Patras-1 FAIL INF11 70 74 72 75 71 74
7.6 7.1 5.8 6.1 Athens-2 PASS INF12 71 68 77 69 75 71
INF20 70 65 72 60 82 61
Table 6. Model accuracies including tutor data. INF21 79 68 69 65 69 64
2004-5 2005-6 2006-7 INF22 57 74 61 68 60 65
E F E F E F INF23 N/A 65 74 83 98
INF10 82 83 80 79 82 81 INF24 62 70 66 69 63 66
INF11 75 77 76 78 75 80 INF30 70 82 64 84 65 89
INF12 75 77 81 72 80 72 INF31 63 91 79 91 59 83
INF20 76 72 76 62 87 61 INF35 N/A 66 97 72 91
INF21 84 77 74 74 75 72 INF37 N/A 95 98 91 95
INF22 66 80 68 74 62 75 INF42 N/A N/A 93 100
INF23 N/A 52 82 90 99
INF24 63 69 69 69 66 74 Furthermore, we tried to summarise the results from a
INF30 75 82 60 88 75 94 further point of view: that of consistency between the results
reported for the E and F columns of both tables. Essentially
INF31 67 94 85 93 89 91
we computed the quantity (F5-E5)-(F6-E6) for each module
INF35 N/A 72 98 76 90 for each year, where the subscript indicates which table that
INF37 N/A 96 100 94 98 particular number was drawn from. Not surprisingly, the two
INF42 N/A N/A 96 100 singularities observed were module INF23 for year 2005-6
(with a value of about 20%) and module INF31 for year
2006-7 (with a value of about -22%).
This time we observe that the relative difference between the
models which utilise the tutor attribute and the ones that do not
are quite small. There are some very interesting cases, however.
4.3 Observing the accuracy-size trade-off
For example, the INF11 module demonstrates near zero
differences throughout. It is interesting to note that this module It is interesting to investigate whether the conventional
utilizes a plenary exam marking session, which means that tutors wisdom on model characteristics is valid. In particular, we
get to mark exam papers drawn from all groups at random. This analysed the results in Table 6 and in Table 7 with respect to
places only marginal administrative overhead and, when viewed whether an increase (or decrease, accordingly) in model
from the point of model consistency, seems to be well worth it. accuracy for a particular module for a year was associated
Another example is the INF31 module (shown in bold), with a reduction in model size. We say that the model
which demonstrated a year where the tutor attribute seemed to be accuracy increases if the accuracy for the E column of that
of paramount importance. In that year, the gap between the first year is less that the corresponding number in the F column.
exam stage and the final grade seems to be influenced by the For the 68 pairs of numbers reported in Table 6 and in Table
tutors. It is now very narrow (89 to 91) while it was quite wide 7 we observed that only in 4 of them did we see the same
(68 to 91). This could suggest a relative gap in tutor direction in model accuracy and model size. So, conventional
homogeneity. wisdom was confirmed in nearly 95% of the cases.
There is one other way to view the importance of the tutor
attribute. One can derive a model for one module group and then
attempt to use that model as a predictor of performance for the 5 DISCUSSION
other module groups (within the same module). This approach, HOU has been the first university in Greece to operate, from
while suppressing the tutor attribute, essentially tests its its very first year, a comprehensive assessment scheme (on
importance by specifically segmenting the module data set along tutoring and administrative services). Despite a rather hostile
groups. The overall accuracy is then averaged over all individual political environment (at least in Greece), quite a few
tests. This is the lesion comparison; its results are shown in academic departments have lately been moving along the
Table 7. direction of introducing such schemes, though the practice
We highlight (in bold) the main difference from the results in has yet to be adapted at a university level. Still, however,
Table 4, where it now seems that the gap has been shortened a there is quite a mentality shift required when considering the
while. Surprisingly, it suggests an erratic intra-group subtle differences between “measuring” and “assessing”.
consistency. Note also, that this particular result in Table 4 was The act of measuring introduces some error in what is
the only one not to pass the binary choice (50%) level, which it being measured. If indices are interpreted as assessment
only just did in Table 6. indices, then people (actually, any “assessed” subject where
people are involved – groups of people, for example) will
gradually skew their behaviour towards achieving good
measurements. Such behaviour is quite predictably human, of
40
course; the problem is that it simply educates people in the ropes 6 CONCLUSION
of the measurement system while sidelining the real issue of
We have shown how we have used a combination of genetic
improving the educational service.
algorithms and decision trees in the context of experimenting
By shifting measurement to quantities that are difficult to
with how one might setup a quality control system in an
“tweak”, one hopes that people whose performance is assessed
educational context.
will gradually shift form fine-tuning their short-term behaviour
Quality control should be a core aspect of any educational
toward achieving longer-term goals. Indeed, if people find out
system but setting up a system for quality control entails
that the marginal gains from fine-tuning their behaviour are too
managerial and administrative decisions that may also have to
small for the effort expended to achieve them, it will be easier to
deal with political side-effects. Deciding how to best and as
convince them to improve more fundamental attitudes towards
early as possible defuse the potential stand-offs that a quality
tutoring (as far as tutors are concerned) or studying (as far as
measurement message might trigger calls for the employment
students are concerned).
of techniques that not only ensure a basic technical soundness
In our application, this is demonstrated two-fold.
in the actual measurement but also cater to the way the results
First, by disseminating tutor group homogeneity indices, one
are conveyed and subsequently exploited. This is particularly
hopes that, regardless how we call these indices, these tutor
so when the application context for the large scale suggests
groups will be motivated by peer pressure to consider their
that data and models will freely flow amongst thousands of
performance vis-a-vis other tutor groups. Even if that may not be
tutors and tens of thousands of students.
really required, that introspection itself will quite likely improve
We have earlier [9] expressed the view that our approach
how hat particular tutor group co-operates; at least it will focus
is applicable to any educational setting where performance
their decisions with respect to why such decisions might
measurement can be cast in terms of test (and exam)
influence their overall ranking.
performance. In the proposed paper we have scaled up our
For students, a similar argument applies. Realising that one
analysis to cover several modules and years and still believe
fits a model which predicts likely failure, even if one knows that
that taking the sting out of individual performance evaluation
the particular model is known to err quite some times, is
but still being able to convey the full message is a key
something that will most likely motivate that person to take a
component of tutoring self-improvement. Scaling our
more decisive approach to studying. For adult students, such a
approach to other programmes, other institutions and, even,
decisive approach might even mean to drop a course of studying
obtaining the approval of our own university for official and
or defer studying. This is not necessarily negative, however;
consistent reporting of such indices is, however, less of a
knowing how to better utilise one’s resources is a key skill in life
long learning. technical nature and more of a political exercise. After all we
need to persuade people that some innovations are less of a
We have selected decision trees because we want to generate
threat and more of an opportunity.
models that can be effectively communicated to tutors and
students alike. We have also selected genetic algorithms to
induce the decision trees because we have shown [7] that, for the
ACKNOWLEDGEMENTS
particular application domain, we can derive small and easy to
communicate yet accurate trees. We thus need a hybrid Thanassis Hadzilacos (now at the Open University of Cyprus)
approach: rule-based output to be comprehensible and grounded has contributed to this line of research while at Hellenic Open
and evolutionary computing to derive this output. University.
Which other techniques should one utilise to develop the Anonymized data can be available on request for research
models? We cannot fail to note that conventional statistics can purposes only, on a case-by-case basis.
be cumbersome to disseminate to people with a background on We acknowledge the advice, from an anonymous reviewer
humanities or arts, and this could have an adverse impact on the of the CIMA-ECAI08 workshop, on how to improve the
user acceptance of such systems. In that sense, the decision of presentation of this work to reflect the combination of AI
whether the models are computed centrally or in a decentralized techniques used.
fashion (by devolving responsibility to the tutors, for example) is
a key factor. In any case, deploying our measurement scheme in
an organization-wide context would also lend support to our REFERENCES
initial preference for short models. At the same time, the
[1] Mitchell, T. (1997). Machine Learning. McGraw Hill.
possibility of a decentralized scheme also suggests that we [2] Quinlan, J.R (1993). C4.5: Programs for machine learning. San
should strive to use tools that do not demand a steep learning Mateo, CA: Morgan Kaufmann.
curve on the part of the tutors. [3] Kalles, D., & Ch. Pierrakeas (2006). Analyzing student
Of course, one can take an alternative course and drop the performance in distance learning with genetic algorithms and
requirement that a model has to be communicated. If we only decision trees. Applied Artificial Intelligence, 20(8), pp. 655-
focus on the indices then any technique can be used, from neural 674.
networks to more conventional ones, such as naive Bayes or [4] Kotsiantis, S., Pierrakeas, C., & Pintelas, P. (2004). Predicting
logistic regression [4]. As in all data mining application students’ performance in distance learning using Machine
Learning techniques. Applied Artificial Intelligence, 18:5, 411-
contexts, it is the application that must drive the techniques to
426.
use; for our problem, suffice to note that the comparisons [5] Witten, I., & Frank, E. (2000). Data mining: practical machine
reported (Table 4, Table 6 and Table 7) are essentially technique learning tools and techniques with Java implementations. San
independent, yet the GATREE approach has proven to-date to be Mateo, CA: Morgan Kaufmann.
the best method for prototyping the measurement exercise that [6] Papagelis, A., & D. Kalles (2001). Breeding decision trees using
we are developing. evolutionary techniques. Proceedings of the International
Conference on Machine Learning, Williamstown,
Massachusetts, pp. 393-400, Morgan Kaufmann.
41
[7] Kalles, D., & Ch. Pierrakeas. (2006). Using Genetic Algorithms and
Decision Trees for a posteriori Analysis and Evaluation of Tutoring
Practices based on Student Failure Models. Proceedings of the 3rd
IFIP conference on Artificial Intelligence Applications and
Innovations, Athens, Greece, pp. 9-18, Springer.
[8] Hadzilacos, Th., Kalles, D. Pierrakeas, Ch., & M. Xenos (2006). On
Small Data Sets Revealing Big Differences. Proceedings of the 4th
Panhellenic conference on Artificial Intelligence, Heraklion, Greece,
Springer LNCS 3955, pp. 512-515.
[9] Hadzilacos, Th., & D. Kalles (2006). On the Software Engineering
Aspects of Educational Intelligence. Proceedings of the 10th
International Conference on Knowledge-Based Intelligent
Information & Engineering Systems, Bournemouth, UK, Springer
LNCS 4252, pp. 1136-1143.
[10] Xenos, M., Pierrakeas, C., & Pintelas, P. (2002). A survey on
student dropout rates and dropout causes concerning the students in
the Course of Informatics of the Hellenic Open University.
Computers & Education, 39, 361-377.
42
Recognizing predictive patterns in chaotic maps
Nicos G. Pavlidis1, Adam Adamopoulos2 and Michael N. Vrahatis3
Abstract. We investigate the existence of rules (in the form where r is a control parameter that assumes values in the
of binary patterns) that allow the short-term prediction of interval [0, 2]. We consider a discrete process generated by:
highly complex binary sequences. We also study the extent to
which these rules retain their predictive power when the se- xn+1 = fr (xn ) = fr (fr (. . .)) = fr(n+1) (x0 ), n = 0, 1, . . . ,
quence is contaminated with noise. Complex binary sequences
| {z }
(n+1) times
are derived by applying two binary transformations on real- (2)
valued sequences generated by the well known tent map. To (n)
where fr denotes the nth iterate of fr . The Lyapunov ex-
identify short-term predictors we employ Genetic Algorithms.
ponent is given by:
The dynamics of the tent map depend strongly on the value
of the control parameter, r. The experimental results suggest ˛
1 ˛˛ d (n) ˛˛
˛
that the same is true for the number of predictors. Despite λr (x) = lim ln ˛ fr (x)˛ = ln r,
n→∞ n dx
the chaotic nature of the tent map and the complexity of the
derived binary sequences, the results reported suggest that (n)
everywhere in [0, 1]. For r ∈ (0, 1), the orbit, fr (x0 ), for
there exist settings in which an unexpectedly large number
any x0 ∈ [0, 1] converges to the unique fixed point 0, as n
of predictive rules exists. Furthermore, rules that permit the
increases. For r = 1, every point x ∈ [0, 1/2] is a fixed point.
risk free prediction of the value of the next bit are detected in
The chaotic region is 1 < r 6 2, in which λr > 0 [5]. For
a wide range of parameter settings. By incorporating noise in
r > 1 the map has two unstable fixed points, one at 0 and the
the data generating process, the rules that allow the risk free
other at x∗ (r) = r/(r + 1). Using the notation in [5], we write,
prediction of the next bit are eliminated. However, for small (n)
xn (r) ≡ fr (1/2). Then x1 (r) = r/2 and x2 (r) = r(1−r/2).
values of the variance of the Gaussian noise term there exist
The intervals, (0, x2 (r)) and (x1 (r), 1) are transient `√for f˜r , and
rules that retain much of their predictive power.
we have fr A = A for A = [x2 (r), x1 (r)]. If r ∈ 2, 2 , then
√
A is an attractor. At r = 2, the attractor A splits into ` √ two˜
1 Introduction bands, A0 and A1 , at the position x = x∗ (r). For r ∈ 1, 2
√
we have fr (A0 ) = A1 and fr (A1 ) = A0 . Similarly, at r2 = 2,
In this paper we consider the problem of identifying rules, in each of the two bands splits into two bands, Aij (i, j = 0, 1). In
the form of binary patterns, that are perfect, or in the worst this manner, as r decreases, band splitting occurs successively
m
case good, short-term predictors of complex binary sequences. at r = r1 , r2 , . . . , rm , . . ., where rm = 21/2 , and m = 1, 2, . . ..
A binary pattern of length L is defined as perfect short-term By setting r0 = 2, then, for rm+1 < r < rm , there exist 2m
predictor if its presence in any place of the binary sequence is disjoint intervals Ai1 ,i2 ,...,im , ik = (0, 1) in which the invariant
declarative of the value of the next bit. By definition, perfect density is positive (the 2m -band regime). Defining, l = 1+i1 +
predictors, enable the risk-free prediction of the next bit. Sim- 2i2 +· · ·+2m−1 im , and Jl ≡ Ai1 ,i2 ,...,im , it is shown in [5] that
ilarly, good short-term predictors, are binary patterns whose fr (Jl ) = Jl+1 for 1 6 l 6 2m − 1, and fr (J ) = ˜J1 , where
`M √
appearance in any position of the binary sequence renders the M = 2m . Therefore, if r lies in the interval 1, 2 , fr maps
value of the next bit highly predictable. a set of intervals between √ r − r2 /2 and r/2 to themselves.
Complex binary sequences are derived through the applica- If, on the other hand, r > 2 these intervals merge. This is
tion of binary transformations on real–valued data sequences illustrated in the bifurcation diagram of Fig. 1.
obtained from the tent map. The tent map is a piecewise- Real-world time series are frequently contaminated by
linear, continuous map on the unit interval [0, 1] into itself: noise. To this end, we investigate the resilience of the predic-
tors to the presence of noise in the data generating process.
rx, x ∈ [0, 1/2] We include an additive Gaussian noise term with zero mean,
fr (x) = , (1)
r(1 − x), x ∈ (1/2, 1] to Eq. (1), and study the extent to which the predictors de-
tected in the original sequences retain their predictive power
1 Institute for Mathematical Sciences, Imperial College London, for different values of the variance of the distribution.
South Kensington Campus, London SW7 2AZ, United Kingdom,
email: n.pavlidis@imperial.ac.uk
2 Medical Physics Laboratory, Department of Medicine, Democri-
tus University of Thrace, Alexandroupolis GR-68100, Greece,
2 Methods
email: adam@med.duth.gr
3 Department of Mathematics, University of Patras, Patras GR- The tent map, described in Eq. (1), was employed to gener-
26110, Greece, email:vrahatis@math.upatras.gr ate raw data sequences xn (x0 , r). To generate the raw data
43
3 Presentation of Results
3.1 Fixed threshold
In the following, we present indicative results for binary se-
quences of length 106 , obtained by applying the transforma-
tion of Eq. (3). In Fig. 2 the distribution of bits with value ‘1’
and ‘0’ for different values of r is plotted. Evidently, an equal
distribution of the two occurs only as r tends to 2.
Figure 1. Bifurcation diagram of the steady states of the tent
map with respect to r.
from the tent map the GNU Multiple Precision Arithmetic Li-
brary (GMP) [1] was utilized to generate floating point num-
bers with precision of at least 5000 bits. Subsequently, binary
data sequences bn (x0 , r) were produced by applying the sim-
ple, threshold, binary transformation originally proposed for
the logistic equation in [4]: Figure 2. Distribution of ones (dashed) and zeros (solid)
according to the transformation of Eq. (3) for bn (0.1, r) and
r ∈ [1, 2] with stepsize 10−3 .
0, if xn 6 0.5,
bn (x0 , r) = (3)
1, if xn > 0.5.
Eq. (3) produces a bit with value ‘1’ when the value of The number of distinct patterns of length L that appear for
the tent map is greater that 0.5 and a bit with value ‘0’ different values of r, is reported in Table 1. In detail, the first
otherwise. To avoid transient phenomena, the first 104 iter- column of Table 1 reports the value of r; the second column
ations of the map were discarded. A number of real-valued indicates the length of the binary patterns L; the third column
sequences xn (x0 , r) were generated through Eq. (1) for differ- corresponds to the number of different patterns of length L
ent values of the control parameter, r, and starting points, x0 . identified in each binary sequence (#f); and finally the fourth
Binary sequences, bn (x0 , r), of 106 bits were produced by ap- column reports the ratio of the number of patterns of length L
plying Eq. (3) on the raw data, xn (x0 , r). found (#f) to the number of possible binary patterns of this
A second binary transformation, also proposed in [4] for the length (2L ). The lower the ratio shown in the last column of
logistic equation, was applied on the raw data. This transfor- the table the fewer the patterns that appear in the binary
mation is also a simple, linear, threshold binary transforma- sequence and hence the higher the predictability.
tion, but with a variable threshold. The threshold value is the An inspection of Table 1 suggests that increasing the value
previous value of the raw data of the tent map. Hence, the of r, gradually increases the number of patterns that are en-
second transformation is formulated as: countered for each value of L and hence degrades predictabil-
ity. This effect becomes clear by comparing the results for
r = 1.44 and r = 1.999. For r = 1.44 and L = 2, already the
0, if xn 6 xn−1 ,
bn (x0 , r) = (4) ratio of appearing to all possible patterns is 0.75 suggesting
1, if xn > xn−1 .
that one out of the four possible patterns is absent. This ratio
The number of all possible patterns of length L, 2L , in- decreases as L increases to reach 0.091 for L = 9 indicating
creases exponentially with respect to L. For large values of L, that less than 10% of all possible patterns of this length are
therefore, it is infeasible to perform exhaustive search, and present in the sequence b106 (0.1, 1.44). On the contrary, for
more efficient search methods, such as Genetic Algorithms r = 1.999 all possible patterns appear for all the different val-
(GAs), are required [2, 3]. To this end, a simple GA with bi- ues of L up to and including L = 9. It should be noted that
nary representation was implemented and utilized. The GA for r = 1.999 and L = 10 the ratio of column four becomes
population consisted of L–bit patterns. The fitness of a pat- less than unity, but still its value is very close to that, 0.999,
tern p, was the number of times p was encountered in the suggesting that even in this case increasing L reduces the ratio
binary sequence bn (x0 , r). The selection process used was but this effect takes place very slowly.
roulette wheel selection. As crossover operator the well–known Next, the impact of introducing noise to the data generating
one–point crossover operator was employed. Finally, the mu- process is investigated. A normally distributed, ε ∼ N (0, σ 2 ),
tation operator utilized was the flip bit mutation operator . additive noise term was included in the tent map equation,
GAs were applied for several values of L and a number of yielding xn+1 = fr (xn ) + ε, where fr (xn ) is given by Eq. (1).
binary sequences, bn (x0 , r). Consequently, patterns that can It should be noted that we enforced the resulting raw data se-
account as perfect, or good, predictors can be identified by ries to lie in the interval [0, 1] by rejecting realizations of the
comparing the obtained results for L–bit and (L + 1)–bit pat- noise term that would result in xn+1 ∈ / [0, 1]. The obtained
terns. experimental results for the most predictable binary sequence
44
Table 1. Number of patterns in b106 (0.1, r) obtained through Table 2. Patterns in b106 (0.1, 1.44) obtained through the
the transformation of Eq. (3) for different values of r. transformation of Eq. (3) and different values of σ 2 .
L patterns σ 2 = 0.0 σ 2 = 0.01 σ 2 = 0.1 σ 2 = 0.5
r L #f #f/2L r L #f #f/2L 1 0 230307 246757 434808 552264
2 3 0.750 6 36 0.562 1 769693 753243 565192 447736
3 5 0.625 1.7 7 61 0.476 00 0 17 190252 308874
4 7 0.437 8 105 0.410 2 01 231742 246799 243209 244133
5 11 0.343 9 179 0.349 10 231742 246799 243209 244133
1.44 6 15 0.234 2 4 1.000 11 536514 506383 323328 202858
7 23 0.179 3 7 0.875 000 0 0 93237 173043
8 31 0.121 4 13 0.812 001 0 17 97015 135831
9 47 0.091 5 24 0.750 010 112119 119901 108060 133112
2 3 0.750 1.8 6 43 0.671 011 119623 126897 135149 111021
3 5 0.625 7 78 0.609 3 100 0 17 97015 135831
4 7 0.437 8 141 0.550 101 231742 246782 146194 108301
5 11 0.343 9 253 0.494 110 119623 126898 135148 111021
1.5 6 16 0.250 2 4 1.000 111 416890 379485 188179 91837
7 25 0.195 3 8 1.000 0000 0 0 50612 96905
8 37 0.144 4 15 0.937 0001 0 0 42625 76138
9 57 0.111 5 29 0.906 0010 0 17 43068 74254
2 3 0.750 1.9 6 55 0.859 0011 0 0 53947 61577
3 5 0.625 7 105 0.820 0100 0 9 44101 74116
4 8 0.500 8 199 0.777 0101 112119 119892 63959 58995
5 13 0.406 9 379 0.740 0110 0 2915 55812 60753
1.6 6 21 0.328 2 4 1.000 4 0111 119622 123982 79336 50268
7 34 0.265 3 8 1.000 1000 0 0 42625 76138
8 55 0.214 4 16 1.000 1001 0 17 54390 59693
9 88 0.171 5 32 1.000 1010 112119 119884 64992 58858
2 4 1.000 1.999 6 64 1.000 1011 119623 126897 81202 49443
3 7 0.875 7 128 1.000 1100 0 8 52914 61715
1.7 4 12 0.750 8 256 1.000 1101 119623 126890 82234 49306
5 21 0.656 9 512 1.000 1110 119623 123983 79336 50268
1111 297267 255502 108843 41569
when no noise is included, b106 (0.1, 1.44), are summarised in
Table 2. The first column of the table corresponds to the pat-
tern length L; the second lists all the possible binary patterns
of length L (due to space limitations, only patterns of length
up to four are included); while columns three to six report the
number of occurrences of each pattern for different values of
the variance, σ 2 , starting with the case of no noise (σ 2 = 0).
Starting from the case of no noise, we observe that more
than three quarters of the binary sequence consists of bits
with value ‘1’. Furthermore, from the patterns with length
two, the pattern ‘00’ is missing, indicating that a ‘0’ is always
followed by a ‘1’. This fact renders the unit length pattern
‘0’ (and consequently all patterns of any length ending with a
‘0’) a perfect predictor, and hence approximately 23% of the
sequence is perfectly predictable. The inclusion of the additive
noise term distorts these findings gradually as the variance in-
creases. For σ 2 = 0.01 findings are marginally altered as the
length two pattern ‘00’ appears only 17 times in the length Figure 3. Predictive power of the unit length binary pattern ‘0’
in the sequences obtained through the transformation of Eq (3)
106 binary sequence. Thus, the probability of a ‘1’ following with respect to the variance σ 2 of the noise term.
a bit with value ‘0’ is 0.99993. For σ 2 = 0.1 and σ 2 = 0.5
this probability becomes 0.56109 and 0.44146 respectively. In
the case of σ 2 = 0.5, therefore, the impact of noise is so large 3.2 Variable threshold
that the original finding is reversed and a ‘0’ is more likely to
be followed by a ‘0’. The fact that increasing the variance of In this subsection we present results from the analysis of
the noise term deteriorates the predictability of the binary se- binary sequences derived by applying the transformation of
quence is also evident from the fact that patterns that did not Eq. (4), according to which the threshold is equal to the previ-
appear in the not contaminated with noise sequence, appear ous value of the tent map. The distribution of bits with value
frequently in the contaminated series. The predictive power ‘1’ and ‘0’ for different values of the control parameter r is
of the binary pattern ‘0’ (perfect predictors in the noise-free illustrated in Fig. 4. Comparing Figs. 2 and 4 it is evident
binary sequence) with respect to the value of the variance of that the two transformations yield substantially different dis-
the additive noise term, σ 2 is illustrated in Fig. 3. To gen- tributions of ones and zeros. For the second transformation,
erate Fig. 3, σ 2 assumed values in the interval [0, 0.5] with the proportion of ones exceeds that of zeros for r marginally
stepsize 10−3 . larger than 1. As shown in Fig. 4 the two proportions are
45
Table 3. Number of patterns in b106 (0.1, r) obtained through
the transformation of Eq. (4) for different values of r.
r L #f #f/2L r L #f #f/2L
2 3 0.750 6 9 0.140
3 4 0.500 1.7 7 12 0.093
4 5 0.312 8 16 0.062
5 6 0.187 9 21 0.041
1.44 6 7 0.109 2 3 0.750
7 8 0.062 3 5 0.625
8 9 0.035 4 7 0.437
9 10 0.019 5 10 0.312
2 3 0.750 1.8 6 14 0.218
3 4 0.500 7 19 0.148
4 5 0.312 8 27 0.105
5 6 0.187 9 38 0.074
1.5 6 7 0.109 2 3 0.750
7 8 0.062 3 5 0.625
Figure 4. Distribution of ones (dashed) and zeros (solid) 8 9 0.035 4 8 0.500
according to the transformation of Eq. (4) for bn (0.1, r) 9 11 0.021 5 12 0.375
and r ∈ [1, 2] with stepsize 10−3 . 2 3 0.750 1.9 6 18 0.281
3 4 0.500 7 27 0.210
4 5 0.312 8 40 0.156
5 7 0.218 9 59 0.115
√ 1.6 6 9 0.140 2 3 0.750
equal until r becomes equal to 2. This finding is attributed 7 12 0.093 3 5 0.625
to the band splitting phenomenon, √ briefly described in Sec- 8 15 0.058 4 8 0.500
tion 1, that occurs for r ∈ (1, 2] [5]. From that point and 9 19 0.037 5 13 0.406
onward their difference increases. 2 3 0.750 1.999 6 21 0.328
1.7 3 4 0.500 7 34 0.265
The number of patterns of different length L that appear 4 5 0.312 8 55 0.214
in the binary sequences of length 106 , are reported in Table 3 5 7 0.218 9 89 0.173
for different values of the control parameter r. More specifi-
cally, the first column of Table 3 reports the value of r; the
second column corresponds to the length L of the binary pat-
terns; the third column reports the number of different pat- of the variance of the additive noise term ε ∼ N (0, σ 2 ). Note
terns of length L that were identified in the sequence (#f); that as in the previous case, the resulting raw data sequence
6
and lastly, column four depicts the proportion of the patterns {xn }10
n=0 was restrained in the interval [0, 1] by rejecting re-
encountered (#f) to the number of possible binary patterns alizations of the noise term that would result in xn ∈ / [0, 1].
of length L (2L ).
As in the case of the fixed threshold binary transforma- Starting from the case of no noise, we observe that zeros
tion, increasing the value of the control parameter r increases and ones are approximately equally distributed in the binary
the number of patterns that appear in the derived binary sequence. As in the case of the first transformation, pattern
sequences. However, this effect is more pronounced for the ‘00’ is missing from the patterns of length two, a finding which
fixed threshold transformation of Eq. (3) than for the vari- implies that a ‘0’ is always followed by a ‘1’, and hence all the
able threshold transformation of Eq. (4). Even for r = 1.999, binary patterns of any length that end with ‘0’ are perfect pre-
Table 3 reports that the number of binary patterns of length dictors. Furthermore, the pattern ‘111’ was not encountered,
two is three, suggesting that one pattern of length two does implying that ‘11’ is always followed by a ‘0’. From the inspec-
not appear, and hence a unit length perfect binary predictor tion of the findings for patterns of length three we also obtain
exists. In contrast, for the fixed threshold binary transforma- a good predictor of length two, namely the pattern ‘01’, for
tion, Table 1, all four length two binary patterns are present which the probability of appearance of ‘0’ immediately after
in the sequences that are generated with r > 1.7. Moreover, this pattern is 0.96919. Comparing the aforementioned find-
the ratio of the patterns of length L found to the number of ings for the case of no noise, with the corresponding ones
possible patterns of this length decreases more rapidly in the obtained by the first, fixed threshold, transformation we con-
sequences generated by the variable threshold transformation. clude that the binary sequence obtained through the variable
For instance, for r = 1.44, the number of patterns of length threshold transformation is more predictable.
nine is 47 for the fixed threshold transformation, while for the As expected, the introduction of noise eliminates all the
variable threshold transformation this number is 10. perfect predictors identified in the original binary sequence.
The impact of introducing noise on the short-term predic- For a low value of the variance, σ 2 = 0.01, the findings are
tors is studied next. Table 4 reports the patterns of length marginally distorted. The previously not encountered pattern
two to four that were encountered in the binary sequence ‘00’ appears 5979 times yielding a probability of encounter-
b106 (0.1, 1.44) that was obtained through the second transfor- ing a bit with the value ‘1’ following a bit with a value of
mation, for different values of σ 2 . In detail, the first column of ‘0’ equal to 0.987688. This probability is marginally lower
Table 4 corresponds to the length L of the patterns; the second than the corresponding probability for the first transforma-
column lists all possible binary patterns of this length; while tion. On the other hand, when the variance of the noise term
columns three to six report the number of occurrences of each increases to σ 2 = 0.1 and σ 2 = 0.5 this probability becomes
pattern in the binary sequences obtained for different values 0.867348 and 0.698495, respectively. Both these probabilities
46
Table 4. Patterns in bn (0.1, 1.44) obtained through the
transformation of Eq. (4) and different values of σ 2 .
L patterns σ 2 = 0.0 σ 2 = 0.01 σ 2 = 0.1 σ 2 = 0.5
1 0 492207 485787 399771 471981
1 507793 514213 600229 528019
00 0 5979 52973 142274
2 01 492415 479647 346368 329606
10 492414 479647 346368 329606
11 15169 34725 254289 198512
000 0 628 7257 32975
001 0 5351 45716 109299
010 477246 447216 186610 187262
011 15169 32430 159758 142343
3 100 0 5351 45716 109299
101 492414 474296 300652 220307
110 15168 32430 159758 142343
111 0 2295 94530 56169 Figure 5. Predictive power of binary patterns identified in the
0000 0 71 979 6175 sequences obtained through the transformation of Eq (4) with
0001 0 557 6278 26800 respect to the variance σ 2 of the noise term.
0010 0 4840 24665 57459
0011 0 511 21051 51840
0100 0 3964 24580 59905
0101 477246 443252 162030 127357
suggesting that there is no perfect predictor. On the contrary,
0110 15168 30378 98071 98715 for the sequences generated through the second transforma-
4 0111 0 2052 61686 43628 tion with the same value of r, only three out of the four pos-
1000 0 557 6278 26800 sible patterns of length two are encountered, suggesting that
1001 0 4794 39438 82499 there is a perfect short-term predictor of length one. The in-
1010 477245 442376 161945 129803
1011 15169 31919 138707 90503 clusion of an additive Gaussian noise term with zero mean in
1100 0 1387 21136 49394 the tent map equation eliminated all perfect predictors. How-
1101 15168 31043 138622 92949 ever, for small values of the variance of the Gaussian noise
1110 0 2052 61686 43628 binary patterns with high predictive power were identified.
1111 0 243 32844 12541
Future work on the subject will include the investigation of
multiplicative noise, as well as, the application of this method-
are higher than the corresponding ones for the case of the
ology to real–world time series and in particular financial time
transformation of Eq. (3). Moreover, note that for the case of
series. It is worth noting that the second binary transforma-
the first transformation and σ 2 = 0.5 a bit with value ‘0’ is
tion is particularly meaningful in the study of financial time
more likely to be followed by a bit with the same value (prob-
series as it corresponds to the direction of change of the next
ability equal to 0.55854); a phenomenon that does not occur
value relative to the present one.
at present. For the pattern ‘11’ the probability of encounter-
ing a zero immediately after it becomes 0.933909, 0.628256,
and 0.717049, for σ 2 equal to 0.01, 0.1, and 0.5, respectively. Acknowledgments
Finally, for the pattern ‘01’ the probability of zero after its
This work was partially supported by the Hellenic Ministry of
appearance is 0.932387, 0.538762, and 0.568140 for σ 2 equal
Education and the European Union under Research Program
to 0.01, 0.1, and 0.5, respectively. The predictive power of the
PYTHAGORAS-89203.
binary patterns, ‘0’, ‘11’, (perfect predictors in the noise-free
binary sequence) and ‘01’ (good predictor in the noise-free
binary sequence), with respect to the value of the variance of REFERENCES
the additive noise term, σ 2 is illustrated in Fig. 5. To gener- [1] Free Software Foundation Inc., GNU Multiple Precision Arith-
ate Fig. 5, σ 2 assumed values in the interval [0, 0.5] with a metic Library ver. 4.1.4.
stepsize of 10−3 . [2] J. H. Holland, Adaptation in Natural and Artificial Systems,
University of Michigan Press, 1975.
[3] M. Mitchell, Introduction to Genetic Algorithms, MIT Press,
4 Conclusions 1996.
[4] N. G. Packard, ‘A genetic learning algorithm for the analysis
of complex data’, Complex Systems, 4(5), 543–572, (1990).
Despite the chaotic nature of the tent map and the resulting [5] T. Yoshida, H. Mori, and H. Shigematsu, ‘Analytic study of
complexity of the binary sequences that were derived after the chaos of the tent map: Band structures, power spectra, and
application of two threshold, binary, transformations a large critical behaviors’, Journal of Statistical Physics, 31(2), 279–
number of short-term predictors was detected. The reported 308, (1983).
experimental results indicate that the binary sequences gen-
erated through the variable threshold binary transformation
are more predictable than those obtained through the fixed
threshold transformation. This finding is clearer for values of
the control parameter, r, close to its upper bound, 2. Indeed
for r = 1.999 all the patterns of length up to nine appear in the
binary sequences obtained through the first transformation,
47
48
Improving the Accuracy of Neuro-Symbolic Rules with
Case-Based Reasoning
Jim Prentzas1, Ioannis Hatzilygeroudis2 and Othon Michail2
Abstract. In this paper, we present an improved approach combination. The bulk of the approaches combining rule-based
integrating rules, neural networks and cases, compared to a and case-based reasoning follow the coupling models [17]. In
previous one. The main approach integrates neurules and cases. these models, the problem-solving (or reasoning) process is
Neurules are a kind of integrated rules that combine a symbolic decomposed into tasks (or stages) for which different
(production rules) and a connectionist (adaline unit) representation formalisms (i.e., rules or cases) are applied.
representation. Each neurule is represented as an adaline unit. However, a more interesting approach is one integrating
The main characteristics of neurules are that they improve the more than two reasoning methods towards the same objective.
performance of symbolic rules and, in contrast to other hybrid In [16] and [10], such an approach integrating three reasoning
neuro-symbolic approaches, retain the modularity of production schemes, namely rules, neurocomputing and case-based
rules and their naturalness in a large degree. In the improved reasoning in an effective way is introduced. To this end,
approach, various types of indices are assigned to cases neurules and cases are combined. Neurules are a type of hybrid
according to different roles they play in neurule-based rules integrating symbolic rules with neurocomputing in a
reasoning, instead of one. Thus, an enhanced knowledge seamless way. Their main characteristic is that they retain the
representation scheme is derived resulting in accuracy modularity of production rules and also their naturalness in a
improvement. Experimental results demonstrate its large degree. In that approach, on the one hand, cases are used
effectiveness. as exceptions to neurules, filling their gaps in representing
domain knowledge and, on the other hand, neurules perform
indexing of the cases facilitating their retrieval. Finally, it
1 INTRODUCTION results in accuracy improvement.
In contrast to rule-based systems that solve problems from In this paper, we enhance the above approach by employing
scratch, case-based systems use pre-stored situations (i.e., different types of indices for the cases according to different
cases) to deal with similar new situations. Case-based reasoning roles they play in neurule-based reasoning. In this way, an
offers some advantages compared to symbolic rules and other improved knowledge representation scheme is derived as
knowledge representation formalisms. Cases represent specific various types of neurules’ gaps in representing domain
knowledge of the domain, are natural and usually easy to obtain knowledge are filled in by indexed cases. Experimental results
[11], [12]. Incremental learning comes natural to case-based demonstrate the effectiveness of the presented approach
reasoning. New cases can be inserted into a knowledge base compared to our previous one.
without making changes to the preexisting knowledge. The The rest of the paper is organized as follows. Section 2
more cases are available, the better the domain knowledge is presents neurules, whereas Section 3 presents methods for
represented. Therefore, the accuracy of a case-based system can constructing the indexing scheme of the case library. Section 4
be enhanced throughout its operation, as new cases become describes the hybrid inference mechanism. Section 5 presents
available. A negative aspect of cases compared to symbolic experimental results regarding accuracy of the inference
rules is that they do not provide concise representations of the process. Section 6 discusses related work. Finally, Section 7
incorporated knowledge. Also it is not possible to represent concludes.
heuristic knowledge. Furthermore, the time-performance of the
retrieval operations is not always the desirable.
2 NEURULES
Approaches integrating rule-based and case-based reasoning
have given interesting and effective knowledge representation Neurules are a type of hybrid rules integrating symbolic rules
schemes and are becoming more and more popular in various with neurocomputing giving pre-eminence to the symbolic
fields [3], [13], [14], [15], [17], [18], [19]. The objective of component. Neurocomputing is used within the symbolic
these efforts is to derive hybrid representations that augment the framework to improve the performance of symbolic rules [7],
positive aspects of the integrated formalisms and [10]. In contrast to other hybrid approaches (e.g. [4], [5]), the
simultaneously minimize their negative aspects. The constructed knowledge base retains the modularity of
complementary advantages and disadvantages of rule-based and production rules, since it consists of autonomous units
case-based reasoning are a good justification for their possible (neurules), and also retains their naturalness in a large degree,
1 Technological Educational Institute of Lamia, Department of Informatics and Computer Technology, 35100 Lamia, Greece, email: dprentzas@teilam.gr.
2 University of Patras, Dept of Computer Engineering & Informatics, 26500 Patras, Greece, email: {ihatz, michailo}@ceid.upatras.gr.
49
since neurules look much like symbolic rules [7], [8]. Also, the conclusion(s). Table 1 (Section 3) presents two example
inference mechanism is a tightly integrated process, which neurules, from a medical diagnosis domain.
results in more efficient inferences than those of symbolic rules Neurules can be constructed either from symbolic rules, thus
[7], [10]. Explanations in the form of if-then rules can be exploiting existing symbolic rule bases, or from empirical data
produced [9], [10]. (i.e., training examples) (see [7] and [8] respectively). An
adaline unit is initially assigned to each possible conclusion.
Each unit is individually trained via the Least Mean Square
2.1 Syntax and Semantics (LMS) algorithm. When the training set is inseparable, special
The form of a neurule is depicted in Fig.1a. Each condition Ci is techniques are used. In that case, more than one neurule having
assigned a number sfi, called its significance factor. Moreover, the same conclusion are produced.
each rule itself is assigned a number sf0, called its bias factor.
Internally, each neurule is considered as an adaline unit Table 1. Example neurules
(Fig.1b). The inputs Ci (i=1,...,n) of the unit are the conditions NR1: (-23.9) NR2: (-13.4)
of the rule. The weights of the unit are the significance factors if patient-class is human0-20 (10.6), if patient-class is human21-35 (6.9),
of the neurule and its bias is the bias factor of the neurule. Each pain is continuous (10.5), pain is continuous (3.2),
input takes a value from the following set of discrete values: [1 fever is high (8.8), joints-pain is yes (3.1),
(true), 0 (false), 0.5 (unknown)]. This gives the opportunity to fever is medium (8.4), fever is low (1.5),
easily distinguish between the falsity and the absence of a patient-class is human21-35 (6.2), fever is no-fever (1.5)
condition in contrast to symbolic rules. The output D, which fever is no-fever (2.7), then disease-type is chronic-
ant-reaction is medium (1.1) inflammation
represents the conclusion (decision) of the rule, is calculated via
then disease-type is inflammation
the standard formulas:
n
D = f(a) , a = sf 0 + ∑ sf i C i 2.2 The Neurule-Based Inference Engine
i=1
The neurule-based inference engine performs a task of
⎧1 if a ≥ 0 classification: based on the values of the condition variables
f (a ) = ⎨ and the weighted sums of the conditions, conclusions are
⎩ -1 otherwise
reached. It gives pre-eminence to symbolic reasoning, based on
a backward chaining strategy [7], [10]. As soon as the initial
where a is the activation value and f(x) the activation function, input data is given and put in the working memory, the output
a threshold function. Hence, the output can take one of two neurules are considered for evaluation. One of them is selected
values (‘-1’, ‘1’) representing failure and success of the rule for evaluation. Selection is based on textual order. A neurule
respectively. fires if the output of the corresponding adaline unit is computed
to be ‘1’ after evaluation of its conditions. A neurule is said to
be ‘blocked’ if the output of the corresponding adaline unit is
computed to be ‘-1’ after evaluation of its conditions.
A condition evaluates to ‘true’ (‘1’), if it matches a fact in
the working memory, that is there is a fact with the same
variable, predicate and value. A condition evaluates to
‘unknown’, if there is a fact with the same variable, predicate
and ‘unknown’ as its value. A condition cannot be evaluated if
there is no fact in the working memory with the same variable.
In this case, either a question is made to the user to provide data
for the variable, in case of an input variable, or an intermediate
Fig. 1. (a) Form of a neurule (b) a neurule as an adaline unit
neurule with a conclusion containing the variable is examined,
in case of an intermediate variable. A condition with an input
The general syntax of a condition Ci and the conclusion D is: variable evaluates to ‘false’ (‘0’), if there is a fact in the
::= working memory with the same variable, predicate and
::= different value. A condition with an intermediate variable
where denotes a variable, that is a symbol evaluates to ‘false’ if additionally to the latter there is no
representing a concept in the domain, e.g. ‘sex’, ‘pain’ etc, in a unevaluated intermediate neurule that has a conclusion with the
medical domain. denotes a symbolic or a numeric same variable. Inference stops either when one or more output
predicate. The symbolic predicates are {is, isnot} whereas the neurules are fired (success) or there is no further action
numeric predicates are {<, >, =}. can only be a (failure).
symbolic predicate. denotes a value. It can be a symbol During inference, a conclusion is rejected (or not drawn)
or a number. The significance factor of a condition represents when none of the neurules containing it fires. This happens
the significance (weight) of the condition in drawing the when: (i) all neurules containing the conclusion have been
examined and are blocked or/and (ii) a neurule containing an
50
alternative conclusion for the specific variable fires instead. For conclusions and not with specific neurules because there
instance, if all neurules containing the conclusion ‘disease-type may be more than one neurule with the same conclusion
is inflammation’ have been examined and are blocked, then this in the neurule base.
conclusion is rejected (or not drawn). If a neurule containing The indexing process may take as input the following types
e.g. the alternative conclusion ‘disease-type is primary- of knowledge:
malignant’ fires, then conclusion ‘disease-type is inflammation’ (a) Available neurules and non-indexed cases.
is rejected (or not drawn), no matter whether all neurules (b) Available symbolic rules and indexed cases. This type of
containing as conclusion ‘disease-type is inflammation’ have knowledge concerns an available formalism of symbolic
been examined (and are blocked) or not. rules and indexed exception cases as the one presented in
[6].
The availability of data determines which type of knowledge
3 INDEXING is provided as input to the indexing module. If an available
Indexing concerns the organization of the available cases so formalism of symbolic rules and indexed cases is presented as
that combined neurule-based and case-based reasoning can be input, the symbolic rules are converted to neurules using the
performed. Indexed cases fill in gaps in the domain knowledge ‘rules to neurules’ module. The produced neurules are
representation by neurules and during inference may assist in associated with the exception cases of the corresponding
reaching the right conclusion. To be more specific, cases may symbolic rules [10]. Exception cases are indexed as ‘false
enhance neurule-based reasoning to avoid reasoning errors by positives’ by neurules. Furthermore, for each case ‘true
handling the following situations: positive’ and ‘false negative’ indices may be acquired using the
(a) Examining whether a neurule misfires. If sufficient same process as in type (a).
conditions of the neurule are satisfied so that it can fire, it When available neurules and non-indexed cases are given as
should be examined whether the neurule misfires for the input to the indexing process, cases must be associated with
specific facts, thus producing an incorrect conclusion. neurules and neurule base conclusions. For each case, this
(b) Examining whether a specific conclusion was erroneously information can be easily acquired as following:
rejected (or not drawn). Until all intermediate and output attribute values of the case
In the approach in [10], the neurules contained in the neurule have been considered:
base were used to index cases representing their exceptions. A 1. Perform neurule-based reasoning for the neurules based on
case constitutes an exception to a neurule if its attribute values the attribute values of the case.
satisfy sufficient conditions of the neurule (so that it can fire) 2. If a neurule fires, check whether the value of its conclusion
but the neurule's conclusion contradicts the corresponding variable matches the corresponding attribute value of the
attribute value of the case. In this approach, various types of case. If it does (doesn't), associate the case as a ‘true
indices are assigned to cases. More specifically, indices are positive’ (‘false positive’) with this neurule.
assigned to cases according to different roles they play in 3. Check all intermediate and final conclusions. Associate the
neurule-based reasoning and assist in filling in different types case as a ‘false negative’ with each rejected (or not drawn)
of gaps in the knowledge representation by neurules. Assigning conclusion that ought to have been drawn based on the
different types of indices to cases can produce an effective attribute values of the case.
approach combining symbolic rule-based with case-based To illustrate how the indexing process works, we present the
reasoning [1]. following example. Suppose that we have a neurule base
In this new approach, a case may be indexed by neurules and containing the two neurules in Table 1 and the example cases
by neurule base conclusions as well. In particular, a case may shown in Table 2 (only the most important attributes of the
be indexed as: cases are shown). The cases however, also possess other
(a) False positive (FP), by a neurule whose conclusion is attributes (not shown in Table 2).
contradicting. Such cases, as in our previous approach, ‘disease-type’ is the output attribute that corresponds to the
represent exceptions to neurules and may assist in neurules’ conclusion variable. Table 3 shows the types of
avoiding neurule misfirings. indices associated with each case in Table 2 at the end of the
(b) True positive (TP), by a neurule whose conclusion is indexing process.
endorsing. The attribute values of such a case satisfy To acquire indexing information, the input values
sufficient conditions of the neurule (so that it can fire) corresponding to the attribute values of the cases are presented
and the neurule's conclusion agrees with the to the example neurules. Recall that when a neurule condition
corresponding attribute value of the case. Such cases evaluates to ‘true’ it gets the value ‘1’, whereas when it is false
may assist in endorsing correct neurule firings. gets ‘0’.
(c) False negative (FN), by a conclusion erroneously For example, given the input case C2, the final weighted sum
rejected (or not drawn) by neurules. Such cases may of neurule NR1 is: -23.9 + 10.6 + 10.5 + 8.8 = 6>0. Note that
assist in reaching conclusions that ought to have been the first three conditions of NR1 evaluate to ‘true’ whereas the
drawn by neurules (and were not drawn). If neurules remaining four (i.e., ‘fever is medium’, ‘fever is no-fever’,
with alternative conclusions containing this variable ‘patient-class is human21-35’ and ‘ant-reaction is medium’) to
were fired instead, it may also assist in avoiding neurule ‘false’ (not contributing to the weighted sum).
misfirings. ‘False negative’ indices are associated with
51
Table 2. Example cases
Case ant- joints-
patient-class pain fever disease-type
ID reaction pain
chronic-
C1 human21-35 continuous low none yes
inflammation
C2 human0-20 continuous high none no inflammation
C3 human0-20 night high none no inflammation
C4 human0-20 continuous medium none no inflammation
chronic-
C5 human21-35 continuous no-fever medium yes
inflammation
chronic-
C6 human0-20 continuous low none no
inflammation
The fact that the final weighted sum is positive means that In case (a), firing of the neurule is suspended and case-based
sufficient conditions of NR1 are satisfied so that it can fire. reasoning is performed for cases indexed as ‘false positives’
Furthermore, the corresponding output attribute value of the and ‘true positives’ by the neurule and cases indexed as ‘false
case matches the conclusion of NR1 and therefore C2 is negatives’ by alternative conclusions containing the neurule’s
associated as ‘true positive’ with NR1. conclusion variable. Cases indexed as ‘true positives’ by the
neurule endorse its firing whereas the other two sets of cases
Table 3. Indices assigned to the example cases in Table 2 considered (i.e., ‘false positives’ and ‘false negatives’) prevent
Case Type of index Indexed by its firing. The results produced by case-based reasoning are
ID evaluated in order to assess whether the neurule will fire or
C1 ‘True positive’ Neurule NR2 whether an alternative conclusion proposed by the retrieved
C2 ‘True positive’ Neurule NR1 case will be considered valid instead.
C3 ‘False negative’ Conclusion ‘disease-type is In case (b), the case-based module will focus on cases
inflammation’ indexed as ‘false negatives’ by conclusions containing the
C4 ‘True positive’ Neurule NR1 specific (intermediate or output) variable.
C5 ‘False positive’ Neurule NR1 The basic steps of the inference process are the following:
C5 ‘True positive’ Neurule NR2 1. Perform neurule-based reasoning for the neurules.
C6 ‘False negative’ Conclusion ‘disease-type is chronic- 2. If sufficient conditions of a neurule are fulfilled so that it can
inflammation’ fire, then
2.1. Perform case-based reasoning for the ‘false positive’
Similarly, when the input values corresponding to the and ‘true positive’ cases indexed by the neurule and the
attribute values of cases C1 and C4 are given as input to the ‘false negative’ cases associated with alternative
neurule base, sufficient conditions of neurules NR2 and NR1 conclusions containing the neurule’s conclusion
respectively are satisfied so that they can fire and the variable.
corresponding output attribute case values match their 2.2. If none case is retrieved or the best matching case is
conclusions. Furthermore, when the input values corresponding indexed as ‘true positive’, the neurule fires and its
to the attribute values of case C5 are given as input to the conclusion is inserted into the working memory.
neurule base, sufficient conditions of both neurules NR1 and 2.3. If the best matching case is indexed as ‘false positive’ or
NR2 are satisfied so that they can fire. However, the ‘false negative’, insert the conclusion supported by the
corresponding output attribute case values match the conclusion case into the working memory and mark the neurule as
of NR2 and contradict the conclusion of NR1. In addition, 'blocked'.
conclusion ‘disease-type is inflammation’ cannot be drawn 3. If all intermediate neurules with a specific conclusion
when the input values corresponding to the attribute values of variable are blocked, then
case C3 are given as input because the only neurule with the 3.1. Examine all cases indexed as ‘false negatives’ by the
corresponding conclusion (i.e., NR1) is blocked. A similar corresponding intermediate conclusions, retrieve the
situation happens for case C6. best matching one and insert the conclusion supported
by the retrieved case into the working memory.
4. If all output neurules with a specific conclusion variable are
4 THE HYBRID INFERENCE MECHANISM blocked, then
The inference mechanism combines neurule-based with case- 4.1. Examine all cases indexed as ‘false negatives’ by the
based reasoning. The combined inference process mainly corresponding final conclusions, retrieve the best
focuses on the neurules. The indexed cases are considered matching one and insert the conclusion supported by the
when: (a) sufficient conditions of a neurule are fulfilled so that retrieved case into the working memory.
it can fire, (b) all output or intermediate neurules with a specific The similarity measure between two cases ck and cl is
conclusion variable are blocked and thus no final or calculated via a distance metric [1]. The best-matching case to
intermediate conclusion containing this variable is drawn. the problem at hand is the one having the maximum similarity
52
with (minimum distance from) the input case. If multiple stored according to [10] which will be referred to as
cases have a similarity equal to the maximum one, a simple NBRCBR_PREV.
heuristic is used. Inferences were run for both NBRCBR and
Let present now two simple inference examples concerning NBRCBR_PREV using the testing sets. Inferences from
the combined neurule base (Table 1) and the indexed example NBRCBR_Prev were performed using the inference mechanism
cases (Tables 2 and 3). Suppose that during inference sufficient combining neurule-based and CBR as described in [10].
conditions of neurule NR1 are satisfied so that it can fire. Firing Inferences from NBRCBR were performed according to the
of NR1 is suspended and the case-based reasoning process inference mechanism described in this paper. No test case was
focuses on the cases contained in the union of the following sets stored in the case libraries.
of indexed cases: Table 4 presents such experimental results regarding
• the set of cases indexed as ‘true positives’ by NR1: inferences from NBRCBR and NBRCBR_PREV. It presents
{C2, C4}, results regarding classification accuracy of the integrated
• the set of cases indexed as ‘false positives’ by approaches and the percentage of test cases resulting in neurule-
NR1: {C5} and based reasoning errors that were successfully handled by case-
• the set of cases indexed as ‘false negatives’ by based reasoning. Column ‘% FPs handled’ refers to the
alternative conclusions containing variable percentage of test cases resulting in neurule misfirings (i.e.,
‘disease-type’ (i.e., ‘disease-type is chronic ‘false positives’) that were successfully handled by case-based
inflammation’): {C6}. reasoning. Column ‘% FNs handled’ refers to the percentage of
test cases resulting in having all output neurules blocked (i.e.,
So, in this example the case-based reasoning process focuses on ‘false negatives’) that were successfully handled by case-based
the following set of indexed cases: {C2, C4} ∪ {C5} ∪ {C6} = reasoning. ‘False negative’ test cases are handled in
{C2, C4, C5, C6}. NBRCBR_PREV by retrieving the best-matching case from the
Suppose now that during inference both output neurules in whole library of indexed cases.
the example neurule base are blocked. The case-based
reasoning process will focus on the cases contained in the union Table 4. Experimental results
set of the following sets of indexed cases: NBRCBR NBRCBR_PREV
• the set of cases indexed as ‘false negatives’ by
Classification
Classification
conclusion ‘disease-type is inflammation’: {C3}.
Accuracy
Accuracy
Handled
Handled
Handled
Handled
% FNs
% FNs
% FPs
% FPs
• the set of cases indexed as ‘false negatives’ by Dataset
conclusion ‘disease-type is chronic-inflammation’:
{C6}.
Therefore, in this example the case-based reasoning process Car 96.04% 52.81% 64.07% 92.49% 15.51% 20.36%
focuses on the following set of indexed cases: {C3} ∪ {C6} = (1728
{C3, C6}. patterns)
Nursery 98.92% 58.68% 52.94% 97.68% 6.60% 18.82%
(12960
5 EXPERIMENTAL RESULTS patterns)
In this section, we present experimental results using datasets
As can be seen from the table, the presented approach results
acquired from [2]. Note that there are no intermediate
in improved classification accuracy. Furthermore, in inferences
conclusions in these datasets. The experimental results involve
from NBRCBR the percentages of both ‘false positive’ and
evaluation of the presented approach combining neurule-based
‘false negative’ test cases successfully handled are greater than
and case-based reasoning and comparison with our previous
the corresponding percentages in inferences from
approach [10]. 75% and 25% of each dataset were used as
NBRCBR_PREV. Results also show that there is still room for
training and testing sets respectively. Each initial training set
improvement.
was used to create a combined neurule base and indexed case
We also tested a nearest neighbor approach working alone in
library. For this purpose, each initial training set was randomly
these two datasets (75% of the dataset used as case library and
split into two disjoint subsets, one used to create neurules and
25% of the dataset used as testing set). We used the similarity
one used to create an indexed case library. More specifically,
measure presented in Section 5. The approach classified the
2/3 of each initial training set was used to create neurules by
input case to the conclusion supported by the best-matching
employing the ‘patterns to neurules’ module [8] whereas the
case retrieved from the case library. Classification accuracy for
remaining 1/3 of each initial training set constituted non-
car and nursery dataset is 90.45% and 96.67% respectively. So,
indexed cases. Both types of knowledge (i.e., neurules and non-
both integrated approaches perform better. This is due to the
indexed cases) were given as input to the indexing construction
fact that the indexing schemes assist in focusing on specific
module presented in this paper producing a combined neurule
parts of the case library.
base and an indexed case library which will be referred to as
NBRCBR. Neurules and non-indexed cases were also used to
produce a combined neurule base and an indexed case library
53
7 CONCLUSIONS [12] D.B. Leake (ed.), Case-Based Reasoning: Experiences, Lessons &
Future Directions, AAAI Press/MIT Press, 1996.
In this paper, we present an approach integrating neurule-based [13] M.R. Lee, ‘An Exception Handling of Rule-Based Reasoning Using
and case-based reasoning that improves a previous hybrid Case-Based Reasoning’, Journal of Intelligent and Robotic Systems,
approach [10]. Neurules are a type of hybrid rules integrating 35, 327-338, (2002).
symbolic rules with neurocomputing. In contrast to other neuro- [14] C.R. Marling, M. Sqalli, E. Rissland, H. Munoz-Avila, D. Aha,
symbolic approaches, neurules retain the naturalness and ‘Case-Based Reasoning Integrations’, AI Magazine, 23, 69-86,
modularity of symbolic rules. Integration of neurules and cases (2002).
[15] S. Montani, R. Bellazzi, ‘Supporting Decisions in Medical
is done in order to improve the accuracy of the inference
Applications: the Knowledge Management Perspective’, International
mechanism. Cases are indexed according to the roles they can Journal of Medical Informatics, 68, 79-90, (2002).
play during neurule-based inference. More specifically, they are [16] J. Prentzas, I. Hatzilygeroudis, ‘Integrating Hybrid Rule-Based with
associated as ‘true positives’ and ‘false positives’ with neurules Case-Based Reasoning’, In S. Craw and A. Preece (Eds), Advances in
and as ‘false negatives’ with neurule base conclusions. Case-Based Reasoning, Proceedings of the European Conference on
The presented approach integrates three types of knowledge Case-Based Reasoning, ECCBR-2002, Lecture Notes in Artificial
representation schemes: symbolic rules, neural networks and Intelligence, Vol. 2416, Springer-Verlag, 336-349, 2002.
case-based reasoning. Most hybrid intelligent systems [17] J. Prentzas, I. Hatzilygeroudis, ‘Categorizing Approaches Combining
implemented in the past usually integrate two intelligent Rule-Based and Case-Based Reasoning’, Expert Systems, 24, 97-122,
(2007).
technologies e.g. neural networks and expert systems, neural
[18] E.L. Rissland, D.B. Skalak, ‘CABARET: Rule Interpretation in a
and fuzzy logic, genetic algorithms and neural networks, etc. A
Hybrid Architecture’, International Journal of Man-Machine Studies,
new development that should receive interest in the future is the 34, 839-887, (1991).
integration of more than two intelligent technologies, [19] D. Rossille, J.-F. Laurent, A. Burgun, ‘Modeling a Decision Support
facilitating the solution of complex problems and exploiting System for Oncology using Rule-Based and Case-Based Reasoning
multiple types of data sources. Methodologies’, International Journal of Medical Informatics, 74,
299-306, (2005).
[20] H. Vafaie, C. Cecere, ‘CORMS AI: Decision Support System for
References Monitoring US Maritime Environment’, Proceedings of the 17th
Innovative Applications of Artificial Intelligence Conference (IAAI),
[1] G. Agre, ‘KBS Maintenance as Learning Two-Tiered Domain
AAAI Press, 1499-1507, (2005).
Representation’, In M.M. Veloso, A. Aamodt, (Eds.): Case-Based
Reasoning Research and Development, First International Conference,
ICCBR-95, Proceedings, Lecture Notes in Computer Science, Vol.
1010, Springer-Verlag, 108-120, 1995.
[2] A. Asuncion, D.J. Newman, ‘UCI Repository of Machine Learning
Databases’ [http://www.ics.uci.edu/~mlearn/MLRepository.html].
Irvine, CA, University of California, School of Information and
Computer Science (2007).
[3] N. Cercone, A. An, C. Chan, ‘Rule-Induction and Case-Based
Reasoning: Hybrid Architectures Appear Advantageous’, IEEE
Transactions on Knowledge and Data Engineering, 11, 164-174,
(1999).
[4] S. I. Gallant, Neural Network Learning and Expert Systems, MIT
Press, 1993.
[5] A.Z. Ghalwash, ‘A Recency Inference Engine for Connectionist
Knowledge Bases’, Applied Intelligence, 9, 201-215, (1998).
[6] A.R. Golding, P.S. Rosenbloom, ‘Improving accuracy by combining
rule-based and case-based reasoning’, Artificial Intelligence, 87, 215-
254, (1996).
[7] I. Hatzilygeroudis, J. Prentzas, ‘Neurules: Improving the Performance
of Symbolic Rules’, International Journal on AI Tools, 9, 113-130,
(2000).
[8] I. Hatzilygeroudis, J. Prentzas, ‘Constructing Modular Hybrid Rule
Bases for Expert Systems’, International Journal on AI Tools, 10, 87-
105, (2001).
[9] I. Hatzilygeroudis, J. Prentzas, ‘An Efficient Hybrid Rule-Based
Inference Engine with Explanation Capability’, Proceedings of the
14th International FLAIRS Conference, AAAI Press, 227-231,
(2001).
[10] I. Hatzilygeroudis, J. Prentzas, ‘Integrating (Rules, Neural Networks)
and Cases for Knowledge Representation and Reasoning in Expert
Systems’, Expert Systems with Applications, 27, 63-75, (2004).
[11] J. Kolodner, Case-Based Reasoning, Morgan Kaufmann Publishers,
San Mateo, CA, 1993.
54
Combinations of Case-Based Reasoning with Other
Intelligent Methods
Jim Prentzas1 and Ioannis Hatzilygeroudis2
Abstract. Case-based reasoning is a popular approach used in other intelligent systems to create an improved overall
intelligent systems. Whenever a new case has to be dealt with, system.
the most similar cases are retrieved from the case base and their Popular CBR combinations involve combinations with rule-
encompassed knowledge is exploited in the current situation. based reasoning (RBR), model-based reasoning (MBR) and soft
Combinations of case-based reasoning with other intelligent computing methods. CBR has also been combined with other
methods have been explored deriving effective knowledge intelligent methods (e.g. ontologies). In certain CBR
representation schemes. Although some types of combinations combinations both combination trends have been followed. In
have been mostly explored, other types have not been other combinations one of the two trends is mostly explored.
thoroughly investigated. In this paper, we briefly outline In this paper, we briefly discuss aspects involving CBR
popular case-based reasoning combinations. More specifically, combinations. We focus on intelligent methods with which
we focus on combinations of case-based reasoning with rule- CBR is usually combined. Our purpose is not to present an
based reasoning, soft computing and ontologies. We illustrate extensive survey of developed CBR combinations but to
basic types of such combinations and discuss future directions. present their key aspects.
1 INTRODUCTION 3 COMBINATIONS OF CBR
Case-based representations store a large set of previous cases Combinations of CBR with other intelligent methods have been
with their solutions in the case base using them whenever a explored for more effective knowledge representation and
similar new case has to be dealt with [19], [22]. Whenever, a problem solving. CBR can be combined with various intelligent
new input case comes in, a case-based system performs methods. However, CBR is usually combined with RBR, MBR
inference in four phases known as the case-based reasoning and soft computing methods.
(CBR) cycle [1]: (i) retrieve, (ii) reuse, (iii) revise and (iv) To categorize CBR combinations one could use Medsker’s
retain. The retrieval phase retrieves from the case base the general categorization scheme for integrated intelligent systems
most relevant stored case(s) to the new case. Indexing [26]. Medsker distinguishes five main combination models:
schemes and similarity metrics are used for this purpose. In standalone, transformational, loose coupling, tight coupling and
the reuse phase, a solution for the new case is created based fully integrated models. Distinction between those models is
on the retrieved most relevant case(s). The revise phase based on the degree of coupling between the integrated
validates the correctness of the proposed solution, perhaps components. Underlying categories for some of these models
with the intervention of the user. Finally, the retain phase are also defined. Main types of underlying categories for loose
decides whether the knowledge learned from the solution of and tight coupling models involve pre-processing, post-
the new case is important enough to be incorporated into the processing and co-processing models as well as embedded
system. processing (for tight coupling models only). Not all of these
CBR can be effectively combined with other intelligent combination models and/or their underlying categories have
methods [25], [31]. Two main trends for CBR combinations been thoroughly explored in the case of CBR combinations.
can be discerned. The first trend involves embedded The types of combination models that have been applied to
approaches in which the primary intelligent method (usually CBR combinations depend on the nature of the other intelligent
CBR) embeds one or more other intelligent methods to methods combined with CBR. Some combination models are
assist its internal online and offline tasks. The second difficult to apply in certain CBR combinations. For instance, it
combination trend involves approaches in which the is difficult to apply the fully integrated model in combinations
problem solving process can be decomposed into tasks for of RBR with CBR. Obviously, the standalone model can be
which different representation formalisms are required or applied to combinations of CBR with any other method.
available. In such situations, a CBR system as a whole (with Generally speaking, coupling models are the most usual
its possible internal modules) is integrated ‘externally’ with CBR combination models. More specifically, embedded
coupling approaches constitute perhaps the most popular trend.
1 Technological Educational Institute of Lamia, Department of Informatics and Computer Technology, 35100 Lamia, Greece, email: dprentzas@teilam.gr.
2 University of Patras, Dept of Computer Engineering & Informatics, 26500 Patras, Greece, email: ihatz@ceid.upatras.gr.
55
Most of the combinations following this trend use other applied in case retrieval. In addition, fuzzy adaptation rules can
intelligent methods to assist various CBR tasks. CBR is a be employed in case adaptation.
generic methodology for building knowledge-based systems The works concerning combination of RBR with CBR [31]
and its internal reasoning tasks can be implemented using a could potentially be improved with use of fuzzy rules.
number of techniques as long as the guiding CBR principles are Investigation of coupling approaches in combinations of CBR
followed [36]. The reverse approach that is, embedding case- with fuzzy systems besides embedded ones could be fruitful.
based modules into intelligent systems employing other
representations to assist in their internal tasks does not seem to
be popular with the exception of combinations with genetic 3.3 Combinations of CBR with Neural Networks
algorithms. In combinations of CBR with RBR and MBR, Neural networks are usually employed by CBR to perform
various coupling approaches have also been investigated tasks such as indexing, retrieval and adaptation. In this way,
besides embedded approaches [31]. In coupling combinations appealing characteristics of neural networks such as
of CBR with soft computing methods, embedded approaches parallelism, robustness, adaptability, generalization and ability
seem to be the most thoroughly investigated. to cope with incomplete input data are exploited [10], [35]. Due
In the following, we discuss main issues involving to the fact that different types of neural networks have been
combinations of CBR with RBR, fuzzy logic, neural networks, developed (e.g. back propagation neural networks, radial basis
genetic algorithms and ontologies. function networks, Self-Organizing Map networks, ART
network), different types of neural capabilities for classification
and clustering can be exploited. Certain CBR approaches have
3.1 Combinations of CBR with RBR employed different types of neural networks for the various
internal CBR tasks (e.g. [12], [34]). Knowledge extracted from
Various types of coupling models involving combinations of
neural networks could also be exploited by CBR [10], [35]. An
CBR and RBR have been investigated i.e., sequential
interesting direction could involve non-embedded coupling
processing, co-processing and embedded processing [31].
approaches combining CBR with neural networks.
In sequential processing, information (produced by
reasoning) necessarily passes sequentially through some or all
of the combined modules to produce the final result [33], [11]. 3.4 Combinations of CBR with Genetic Algorithms
In co-processing approaches, the combined modules closely
interact in producing the final result. Such systems can be Usual combinations of CBR with genetic algorithms (GAs)
discerned into two types: cooperation-oriented, which give involve use of GAs to optimize (one or more) aspects of a CBR
emphasis on cooperation, and reconciliation-oriented, which system. On the other hand, CBR can be exploited to enhance
give emphasis on reconciliation. In the former type, the GAs. Other types of combinations of CBR with GAs can be
combined components cooperate with each other (usually by also implemented.
interleaving their reasoning steps) [27], [32]. In the latter, each GAs can be used within CBR to enhance indexing and
component produces its own conclusion, possibly differing retrieval. GAs have been used to assign case feature weights
from the conclusion of the other component, and thus a enhancing similarity assessment [39], [8], to perform feature
reconciliation process is necessary [14]. selection [18] and generally to select relevant indices for
In embedded processing, CBR systems employ one or more evolving environments. GAs have also been used to retrieve
RBR modules to perform tasks of their CBR cycle (e.g. multiple similar cases [38]. If k nearest neighbor retrieval is
retrieval and adaptation). Such approaches are quite common in applied, genetic algorithms can be used to find the optimal k
CBR especially for adaptation. RBR systems embedding CBR parameter in order to improve the retrieval accuracy [2].
modules do not seem to exist. Furthermore, GAs can be used to perform instance selection
i.e., finding the representative cases in a case base and
determining a reduced subset of a case base. In this way, time
performance is improved by reducing search space and
3.2 Combinations of CBR with Fuzzy Logic accuracy can be improved through elimination of noisy and
CBR can be combined with fuzzy logic in fruitful ways in order useless cases [2].
to handle imprecision. A usual approach is the incorporation of Additionally, GAs have been used to enhance case
fuzzy logic into a CBR system in order to improve CBR aspects adaptation [16], [17]. Genetic algorithms can also optimize case
[4], [29], [35], [9]. Such combinations have been vastly representation, e.g. by performing case feature discretization
explored as imprecision and uncertainty are inherent in various [18] and removing irrelevant features. Such optimizations
CBR tasks. Fuzzy terms may be used in case representation improve accuracy, search time and storage requirements. It is
enabling a flexible encoding of case features that encompasses also quite usual to simultaneously optimize more than one CBR
imprecise and uncertain information. Fuzzy logic may be also aspect with GAs (e.g. [2], [18]).
proved very useful in indexing and retrieval. Fuzzy indexing On the other hand CBR can be employed to enhance GAs.
enables multiple indexing of a case on a single feature with CBR can be applied to GAs by creating cases to track the
different degrees of membership [35]. Fuzzy similarity history of a search. This case base can contribute in the
assessment and matching methods can produce more accurate understanding of how a solution was reached, why a solution
results. Fuzzy clustering and classification methods can also be works, and what the search space looks like. It could thus be
56
used to design highly tailored search strategies for future use of earlier systems combining CBR with only one of the other
[23]. Such an approach could therefore be used to explain the two intelligent methods (e.g. RBR or MBR alone). Multi-
results of the genetic algorithm and for knowledge extraction. integrated CBR approaches, besides those involving
Moreover, similar stored cases can be also incorporated into a RBR/MBR, could be developed. For instance, ontologies could
genetic algorithm to reduce convergence time and improve constitute an interesting candidate method that could be
solution accuracy. GAs randomly initialize their starting combined with CBR and another intelligent method in order to
population. Instead, relevant stored cases can be used as part of facilitate knowledge sharing and reuse among the integrated
the initial population (solution) of GAs. Additionally, relevant system components themselves [5] and among integrated
stored cases can be periodically injected into the pool of systems. Such a combination could be useful in Web-based
chromosomes while the genetic algorithm runs [24], [7]. In systems that need to share knowledge. Fruitful such approaches
certain approaches, CBR is exploited by GAs for both could involve combinations of CBR, ontologies and
knowledge extraction and case injection [30]. RBR/MBR. For instance in [6] an approach combining CBR,
RBR and an ontology is presented.
Multi-integrated paradigms could also be considered systems
3.5 Combinations of CBR with Ontologies combining CBR with certain types of neuro-symbolic or neuro-
Ontologies facilitate knowledge sharing and reuse. They can fuzzy approaches in which the neuro-symbolic (neuro-fuzzy)
provide an explicit conceptualization describing data semantics module fully integrates the neural and symbolic (fuzzy)
and a shared and common understanding of the domain approach. Such modules could be used within CBR instead of
knowledge that can be communicated among agents and plain neural or fuzzy components. Non-embedded coupling
application systems [6]. Ontologies play a crucial role in approaches can be applied as well. For instance, in [15] a
enabling the processing and sharing of knowledge between neuro-symbolic method is combined with CBR according to the
programs on the Web [21]. Intelligent Decision Support reconciliation coupling approach.
Systems in the semantic Web framework should be able to
handle, integrate with and reason from distributed data and
4 CONCLUSIONS
information on the Web [3].
Therefore ontologies can be combined with CBR in various In this paper, we discuss key aspects involving combinations of
ways. Ontologies can be used by a CBR system to represent the CBR with other intelligent methods. Such combinations are
input problem [20], to enhance similarity assessment [13], case becoming increasingly popular due to the fact that in many
representation, case abstraction and case adaptation [3]. application domains a vast amount of case data is available.
Ontologies may perform all such CBR tasks [37]. Such combined approaches have managed to solve problems in
application domains where a case-based module needs the
assistance and/or completion of other intelligent modules in
3.6 Combinations of CBR with Multiple Intelligent order to produce effective results. This trend is very likely to
Methods carry on in the following years.
The previous sections focused on combinations of CBR with Future directions in combinations of CBR with other
one other individual intelligent method. However, intelligent intelligent methods could involve a number of aspects. Main
systems have been developed that combine CBR with multiple such aspects involve: (a) combinations of CBR with soft
other intelligent methods. Such multi-integrated paradigms computing methods, (b) combinations of CBR with fuzzy rules,
usually follow a coupling model. (c) combinations of CBR with ontologies and (d) combinations
Obviously, a CBR system may employ multiple intelligent of CBR with neuro-symbolic and neuro-fuzzy approaches.
methods (e.g. rules and various soft computing methods) to Combinations of CBR with soft computing methods not
perform its internal tasks [36]. Typical examples of approaches following an embedded coupling approach could be an
employing multiple soft computing methods within the CBR interesting future research direction. At present there seems to
cycle are presented in [12] and [34]. In [12] all of the four be a lack of great interest in pursuing this direction since the
phases of the CBR cycle employ soft computing methods. main interest has been focused on employing soft computing
Employed soft computing methods are a self-organizing neural methods within CBR. A non-embedded direction in the
network for retrieval, a radial basis neural network for reuse, combinations of CBR with soft computing could be pursued as
fuzzy systems for revise and all soft computing methods for thoroughly as in the case of combinations of CBR with
retain. In [34] fuzzy logic, supervised and unsupervised neural RBR/MBR. A further step towards this direction could involve
networks and a genetic algorithm are employed for case non-embedded approaches combining CBR with multiple soft
representation, indexing, retrieval and adaptation. computing methods or combinations of CBR, soft computing
More interesting approaches concern multi-integrated and other intelligent methods (e.g. RBR, MBR or ontologies).
systems not following the embedded approach. Typical such Combinations of CBR with fuzzy rule-based systems could
multi-integrated approaches involve combinations of CBR, be based on work combining CBR with RBR that is,
RBR and MBR (e.g. [28]). Such approaches seem to be quite investigation of various coupling approaches.
effective, because combinations of CBR with RBR and MBR The increasing interest in Web-based intelligent systems and
individually have been thoroughly investigated. Quite often future advances in the Semantic Web is likely to provide an
such systems have been implemented to deal with deficiencies impetus to approaches combining CBR with ontologies. This
57
trend is likely to involve multi-integrated approaches genetic algorithm’, Expert Systems with Applications, 31, 83-93,
combining CBR, ontologies and other intelligent methods. (2006).
Finally, a direction that may be useful to pursue involves [18] K.-J. Kim, ‘Toward Global Optimization of Case-Based Reasoning
non-embedded coupling approaches combining CBR with Systems for Financial Forecasting’, Applied Intelligence, 21, 239-249,
(2004).
neuro-symbolic and neuro-fuzzy modules. Few such
[19] J.L. Kolodner, Case-Based Reasoning, Morgan Kaufmann, 1993.
approaches have been developed. [20] O. Byung Kwon, ‘Meta Web Service: Building Web-based Open
Decision Support System based on Web Services’, Expert Systems
with Applications, 24, 375-389, (2003).
REFERENCES [21] O. Kwon, M. Kim, ‘MyMessage: Case-Based Reasoning and
[1] A. Aamodt, E. Plaza, ‘Case-Based Reasoning: Foundational Issues, Multicriteria Decision Making Techniques for Intelligent Context-
Methodological Variations, and System Approaches’, AI Aware Message Filtering’, Expert Systems with Applications, 27, 467-
Communications, 7, 39-59, (2004). 480, (2004).
[2] H. Ahn, K. Kim, ‘Global optimization of case-based reasoning for [22] D.B. Leake (Ed.), Case-Based Reasoning: Experiences, Lessons, and
breast cytology diagnosis’, Expert Systems with Applications (to Future Directions, AAAI Press, 1996.
appear). [23] S.J. Louis, G. McGraw, R.O. Wyckoff, ‘Case-based reasoning
[3] I. Bischindaritz, ‘Memoire: A Framework for Semantic assisted explanation of genetic algorithm results’, Journal of
Interoperability of Case-Based Reasoning Systems in Biology and Experimental & Theoretical Artificial Intelligence, 5, 21-37, (1993).
Medicine’, Artificial Intelligence in Medicine, 36, 177-192, (2006). [24] S. J. Louis, C. Miles, ‘Playing to Learn: Case-Injected Genetic
[4] P.P. Bonissone, R. Lopez de Mantaras, ‘Fuzzy Case-Based Reasoning Algorithms for Learning to Play Computer Games’, IEEE
Systems’, Handbook on Fuzzy Computing, vol. F 4.3, Oxford Transactions on Evolutionary Computation, 9, 669-681, (2005).
University Press, 1998. [25] C. Marling, M. Sqalli, E. Rissland, H. Munoz-Avila, D. Aha, ‘Case-
[5] L. Castillo, E. Armengol, E. Onaindia, L. Sebastia, J. Gonzalez- Based Reasoning Integrations’, AI Magazine, 23, 69-86, (2002).
Boticario, A. Rodriguez, S. Fernandez, J.D. Arias, D. Borrajo, [26] L.R. Medsker, Hybrid Intelligent Systems, Kluwer Academic
‘SAMAP: An User Oriented Adaptive System for Planning Tourist Publishers, Second Printing, 1998.
Visits’, Expert Systems with Applications, 34, 1318-1332, (2008). [27] S. Montani, R. Bellazzi, ‘Supporting Decisions in Medical
[6] L. Ceccaroni, U. Cortes, M. Sanchez-Marre, ‘OntoWEDSS : Applications: the Knowledge Management Perspective’, International
Augmenting Environmental Decision-Support Systems with Journal of Medical Informatics, 68, 79-90, (2002).
Ontologies’, Environmental Modelling & Software, 19, 785-797, [28] S. Montani, P. Magni, R. Bellazzi, C. Larizza, A.V. Roudsari, E.R.
(2004). Carson, ‘Integrating model-based decision support in a multi-modal
[7] P.-C. Chang, J.-C. Hsieh, C.-H. Liu, ‘A Case-Injected Genetic reasoning system for managing type 1 diabetic patients’, Artificial
Algorithm for Single Machine Scheduling Problems with Release Intelligence in Medicine, 29, 131-151, (2003).
Time’, International Journal on Production Economics, 103, 551- [29] S. K. Pal, S.C.K. Shiu, Foundations of Soft Case-Based Reasoning,
564, (2006). John Wiley, 2004.
[8] P.-C. Chang, C.-Y. Lai, K. R. Lai, ‘A Hybrid System by Evolving [30] E.I. Perez, C.A. Coello Coello, A. Hernandez-Aguirre, ‘Extraction
Case-Based Reasoning with Genetic Algorithm in Wholesaler’s and Reuse of Design Patterns from Genetic Algorithms using Case-
Returning Book Forecasting’, Decision Support Systems, 42, 1715- Based Reasoning’, Soft Computing, 9, 44-53, (2005).
1729, (2006). [31] J. Prentzas, I. Hatzilygeroudis, ‘Categorizing Approaches Combining
[9] W. Cheetam, S.C.K. Shiu, R.O. Weber, ‘Soft Case-Based Reasoning’, Rule-Based and Case-Based Reasoning’, Expert Systems, 24, 97-122,
Knowledge Engineering Review, 20, 267-269, (2006). (2007).
[10] D. Chen, P. Burrell, ‘Case-Based Reasoning System and Artificial [32] E.L. Rissland, D.B. Skalak, ‘CABARET: Rule Interpretation in a
Neural Networks: A Review’, Neural Computing & Applications, 10, Hybrid Architecture’, International Journal of Man-Machine Studies,
264-276, (2001). 34, 839-887, (1991).
[11] R.-J. Dzeng, H.-Y. Lee, ‘Critiquing Contractors’ Scheduling by [33] D. Rossille, J.-F. Laurent, A. Burgun, ‘Modeling a Decision Support
Integrating Rule-Based and Case-Based Reasoning’, Automation in System for Oncology using Rule-Based and Case-Based Reasoning
Construction, 13, 665-678, (2004). Methodologies’, International Journal of Medical Informatics, 74,
[12] F. Fdez-Riverola, J.M. Corchado, ‘FSfRT: Forecasting System for 299-306, (2005).
Red Tides’, Applied Intelligence, 21, 251-264, (2004). [34] K.M. Saridakis, A.J. Dentsoras, ‘Case-DeSC: A System for Case-
[13] P. Gervas, B. Diaz-Agudo, F. Peinado, R. Hervas, ‘Story Plot Based Design with Soft Computing Techniques’, Expert Systems with
Generation Based on CBR’, Knowledge-Based Systems, 18, 235-242, Applications, 32, 641-657, (2007).
(2005). [35] S.C.K. Shiu, S.K. Pal, ‘Case-Based Reasoning: Concepts, Features
[14] A.R. Golding, P.S. Rosenbloom, ‘Improving Accuracy by and Soft Computing’, Applied Intelligence, 21, 233-238, (2004).
Combining Rule-Based and Case-Based Reasoning’, Artificial [36] I. Watson, ‘Case-Based Reasoning is a Methodology not a
Intelligence, 87, 215-254, (1996). Technology’, Knowledge-Based Systems, 12, 303-308, (1997)
[15] I. Hatzilygeroudis, J. Prentzas, ‘Integrating (Rules, Neural Networks) [37] P. Wriggers, M. Siplivaya, I. Joukova, R. Slivin, ‘Intelligent Support
and Cases for Knowledge Representation and Reasoning in Expert of Engineering Analysis using Ontology and Case-Based Reasoning’,
Systems’, Expert Systems with Applications, 27, 63-75, (2004). Engineering Applications of Artificial Intelligence, 20, 709-720,
[16] B. W. Huang, M. L. Shih, N.-H. Chiu, W. Y. Hu, C. Chiu, ‘Price (2007).
information evaluation and prediction for broiler using adapted case- [38] H.-L. Yang, C.-S. Wang, ‘Two stages of case-based reasoning –
based reasoning approach’, Expert Systems with Applications (to Integrating genetic algorithm with data mining mechanism’, Expert
appear). Systems with Applications (to appear).
[17] Y.-K. Juan, S.-G. Shih, Y.-H. Perng, ‘Decision Support for Housing [39] F.-C. Yuan, C. Chiu, ‘A Hierarchical Design of Case-Based
Customization: A hybrid approach using case-based reasoning and Reasoning in the Balanced Scorecard Application’, Expert Systems
with Applications (to appear).
58
Combining Argumentation and Hybrid Evolutionary
Systems in a Portfolio Construction Application
Nikolaos Spanoudakis1 and Konstantina Pendaraki2 and Grigorios Beligiannis2
Abstract. In this paper we present an application for the accurately predicting the evolution of stock values in the Greek
construction of mutual fund portfolios. It is based on a market (its application on economic data is presented in [2]).
combination of Intelligent Methods, namely an argumentation Moreover, there is a lot of work on hybrid evolutionary algorithms
based decision making framework and a forecasting algorithm and their application on many difficult problems has shown very
combining Genetic Algorithms (GA), MultiModel Partitioning promising results [4]. The problem of predicting the behavior of
(MMP) theory and Extended Kalman Filters (EKF). The the financial market is an open problem and many solutions have
argumentation framework is employed in order to develop mutual been proposed. However, there isn't any known algorithm able to
funds performance models and to select a small set of mutual identify effectively all kinds of behaviors. Also, many traditional
funds, which will compose the final portfolio. The forecasting methods have been applied to the same problem and the results
algorithm is employed in order to forecast the market status obtained were not very satisfactory. There are two main
(inflating or deflating) for the next investment period. The difficulties in this problem, firstly the search space is huge and,
knowledge engineering approach and application development secondly, it comprises of many local optima.
steps are also discussed.12 In this contribution, we present the whole application resulting
from the combination of argumentation with hybrid evolutionary
systems along with the respective results.
1 INTRODUCTION The rest of the paper is organized as follows: Section two
presents an overview of the concepts and application domain
Portfolio management [8] is concerned with constructing a
knowledge. Section three outlines the main features of the
portfolio of securities (e.g., stock, bonds, mutual funds [13], etc.)
proposed argumentation based decision-making framework and
that maximizes the investor’s utility. In a previous study [14], we
the developed argumentation theory. The forecasting hybrid
constructed mutual fund (MF) portfolios using an argumentation
evolutionary system is presented in section four, followed by
based decision making framework. We developed rules that
section five, which presents the developed application and
characterize the market and different investor types policies using
discusses the obtained empirical results. Finally, section six
evaluation criteria of fund performance and risk. We also defined
summarizes the main findings of this research.
strategies for resolving conflicts over these rules. Furthermore, the
developed application can be used for a set of different investment
policy scenarios and supports the investor/portfolio manager in 2 DOMAIN KNOWLEDGE
composing efficient MF portfolios that meet his investment
preferences. The traditional portfolio theories ([8], [11], [12]) This section describes the criteria (or variables) used for
were based on unidimensional approaches that did not fit to the creating portfolios and the knowledge on how to use these criteria
multidimensional nature of risk ([3]), and they did not capture the in order to construct a portfolio.
complexity presented in the data set. In [14], this troublesome The data used in this study is provided from the Association of
situation was resolved by the high level of adaptability in the Greek Institutional Investors and consists of daily data of domestic
decisions of the portfolio manager or investor when his equity mutual funds (MFs) over the period January 2000 to
environment is changing and the characteristics of the funds are December 2005.
multidimensional that was demonstrated by the use of The proposed framework is based on five fundamental
argumentation. variables. The return of the funds is the actual value of return of
Our study showed that when taking into account the market an investment defined by the difference between the nominal
context, the results were better if we could forecast the status of return and the rate of inflation. This variable is based on the net
the market of the following investment period. In order to achieve price of a fund. At this point, it is very important to mention that
this goal we employed a hybrid system that combines Genetic transaction costs such as management commission are included in
Algorithms (GA), MultiModel Partitioning (MMP) theory and the the net price. Frond-end commission and redemption commission
Extended Kalman Filter (EKF). A general description of this fluctuate depending on the MF class and in most cases are very
algorithm and its application in linear and non-linear data is low. The standard deviation is used to measure the variability of
discussed in [2], while the specific version used in this the fund’s daily returns, thus representing the total risk of the
contribution is presented in [1], where its successful application to fund. The beta coefficient (β) is a measure of fund’s risk in
non-linear data is also presented. This algorithm captured our relation to the capital risk. The Sharpe index [13] is a useful
attention because it had been successfully used in the past for measure of performance and is used to measure the expected
return of a fund per unit of risk, defined by the standard deviation.
1
The Treynor index [15] is similar to the Sharpe index except that
Technical University of Crete, Greece, email: nikos@science.tuc.gr
2
University of Ioannina, Greece, email: {dpendara, gbeligia}@cc.uoi.gr
59
performance is measured as the risk premium per unit of communication. A single agent may use argumentation techniques
systematic (beta coefficient) and not of total risk. to perform its individual reasoning because it needs to make
On the basis of the argumentation framework for the selection decisions under complex preferences policies, in a highly dynamic
of a small set of MF, which will compose the final multi- environment (see e.g. [6]). This is the case used in this research.
portfolios, the examined funds are clustered in three groups for In the following paragraphs we describe the theoretical framework
each criterion for each year. For example, we have funds with that we adopted:
high, medium and low performance (return), the same for the Definition 1. A theory is a pair (T, P) whose sentences are
other criteria. formulae in the background monotonic logic (L, ⊢ ) of the form
The aforementioned performance and risk variables visualize L←L1,…,Ln, where L, L1, …, Ln are positive or negative ground
the characteristics of the capital market (bull or bear) and the type literals. For rules in P the head L refers to an (irreflexive) higher
of the investor according to his investment policy (aggressive or priority relation, i.e. L has the general form L = h_p(rule1, rule2).
moderate). Further information is represented through variables The derivability relation, ⊢ , of the background logic is given by
that describe the general conditions of the market and the investor the simple inference rule of modus ponens.
policy (selection of portfolios with high performance per unit of An argument for a literal L in a theory (T, P) is any subset, T,
risk). of this theory that derives L, T ⊢ L, under the background logic. A
The general conditions of the market are characterized through
part of the theory T0 ⊂ T, is the background theory that is
the development of funds which have high performance levels
considered as a non defeasible part (the indisputable facts).
(high return). Regarding the market context, in a bull market,
An argument attacks (or is a counter argument to) another
funds are selected if they have high systematic or total risk. On the
when they derive a contrary conclusion. These are conflicting
other hand, in a bear market, we select funds with low systematic
arguments. A conflicting argument (from T) is admissible if it
and total risk. An aggressive investor is placing his capital upon
counter-attacks all the arguments that attack it. It counter-attacks
funds with high performance and high systematic risk.
an argument if it takes along priority arguments (from P) and
Accordingly, a moderate investor selects funds with high
performance and low or medium systematic risk. Some types of makes itself at least as strong as the counter-argument (we omit
investors select portfolios with high performance per unit of risk. the relevant definitions from [6] due to limited space).
Such portfolios are characterized by high Sharpe ratio and high Definition 2. An agent’s argumentative policy theory is a
Treynor ratio. theory T = ((T, T0), PR, PC) where T contains the argument rules in
the form of definite Horn logic rules, PR contains priority rules
which are also definite Horn rules with head h_p(r1, r2) s.t. r1, r2
3 ARGUMENTATION-BASED DECISION ∈ T and all rules in PC are also priority rules with head h_p(R1,
MAKING R2) s.t. R1, R2 ∈ PR ∪ PC. T0 contains auxiliary rules of the
agent’s background knowledge.
In this section we firstly present the argumentation framework that Thus, in defining the decision maker’s theory we specify three
we used and then we describe the domain knowledge modeling levels. The first level (T) defines the (background theory) rules
based on the argumentation framework. that refer directly to the subject domain, called the Object-level
Decision Rules. In the second level we have the rules that define
priorities over the first level rules for each role that the agent can
3.1 The Argumentation Framework assume or context that he can be in (including a default context).
Autonomous agents, be they artificial or human, need to make Finally, the third level rules define priorities over the rules of the
decisions under complex preference policies that take into account previous level (which context is more important) but also over the
different factors. In general, these policies have a dynamic nature rules of this level in order to define specific contexts, where
and are influenced by the particular state of the environment in priorities change again.
which the agent finds himself. The agent's decision process needs
to be able to synthesize together different aspects of his preference
policy and to adapt to new input from the current environment. 3.2 The Decision Maker’s Argumentation
Such agents are the mutual fund managers. Theory
In order to address requirements like the above, Kakas and
Using the presented argumentation framework, we transformed
Moraitis ([6]) proposed an argumentation based framework to
the criteria for all MFs and experts knowledge (§2) to background
support an agent's self deliberation process for drawing
theory (facts) and rules of the first and second level. Then, we
conclusions under a given policy.
defined the strategies (or specific contexts) in the third level rules.
Argumentation can be abstractly defined as the principled
The goal of the knowledge base is to select some MFs in order
interaction of different, potentially conflicting arguments, for the
to construct our portfolio. Therefore our rules have as their head
sake of arriving at a consistent conclusion (see e.g. [10]). The
the predicate selectFund/1 and its negation. We write rules
nature of the “conclusion” can be anything, ranging from a
supporting it or its negation and use argumentation for resolving
proposition to believe, to a goal to try to achieve, to a value to try
conflicts. We introduce the hasInvestPolicy/2, preference/1 and
to promote. Perhaps the most crucial aspect of argumentation is
market/1 predicates for defining the different contexts and roles.
the interaction between arguments. This means that argumentation
For example, John, an aggressive investor is expressed with the
can give us means for allowing an agent to reconcile conflicting
predicate hasInvestPolicy(john, aggressive).
information within itself, for reconciling its informational state
The knowledge base facts are the performance and risk
with new perceptions from the environment, and for reconciling
variables values for each MF, the thresholds for each group of
conflicting information between multiple agents through
60
values for each year and the above mentioned predicates The problem with the above rules is that the facts market(bear)
characterizing the investor and the market. The following rules are or (exclusive) market(bull) could not be safely determined for the
an example of the object-level rules (level 1 rules of the next investment period. In the application version presented in
framework - T): [14] it was just assumed to remain the same as at the time of the
investment. This strategy, however produced quite poor results for
r1(Fund): selectFund(Fund) ← highR(Fund) this context if it should change in the next period.
r2(Fund): ¬selectFund(Fund) ← highB(Fund)
4 FORECASTING THE STATUS OF THE
The highR predicate denotes the classification of the MF as a FINANCIAL MARKET
high return fund and the highB predicate denotes the classification
of the MF as a high risk fund. Thus, the r1 rule states that a high One of the most prominent issues in the field of signal processing
performance fund should be selected, while the r2 rule states that a is the adaptive filtering problem, with unknown time-invariant or
high risk fund should not be selected. Such rules are created for time-varying parameters. Selecting the correct order and
the three groups of our performance and risk criteria. estimating the parameters of a system model is a fundamental
Then, in the second level we assign priorities over the object issue in linear and nonlinear prediction and system identification.
level rules. The PR are the default context rules or level 2 rules. The problem of fitting an AutoRegressive Moving Aaverage
These rules are added by experts and express their preferences in model with eXogenous input (ARMAX) or a Nonlinear
the form of priorities between the object level rules that should AutoRegressive Moving Aaverage model with eXogenous input
take place within defined contexts and roles. For example, the (NARMAX) to a given time series has attracted much attention
level 1 rules with signatures r1 and r2 are conflicting. In the because it arises in a large variety of applications, such as time
default context the first one has priority, while the bear market series prediction in economic and biomedical data, adaptive
context reverses this priority: control, speech analysis and synthesis, neural networks, radar and
sonar, fuzzy systems, and wavelets [5].
R1: h_p(r1(Fund),r2(Fund)) ← true The forecasting algorithm used in this contribution is a generic
applied evolutionary hybrid technique, which combines the
R2: h_p(r2(Fund),r1(Fund)) ← market(bear) effectiveness of adaptive multimodel partitioning filters and GAs’
robustness [1]. This method has been first presented in [7].
Rule R1 defines the priorities set for the default context, i.e. an Specifically, the a posteriori probability that a specific model, of a
investor selects a fund that has high return on investment (RoI) bank of the conditional models, is the true model, can be used as
even if it has high risk. Rule R2 defines the default context for the fitness function for the GA. In this way, the algorithm identifies
bear market context (within which, the fund selection process is the true model even in the case where it is not included in the
cautious and does not select a high RoI fund if it has high risk). filters’ bank. It is clear that the filter’s performance is
Finally, in PC (level 3 rules) the decision maker defines his considerably improved through the evolution of the population of
strategy and policy for integrating the different roles and contexts the filters’ bank, since the algorithm can search the whole
rules. When combining the Aggressive investor role and bear parameter space. The proposed hybrid evolutionary algorithm can
market context, for example, the final portfolio is their union be applied to linear and nonlinear data; is not restricted to the
except that the aggressive investor now would accept to select Gaussian case; does not require any knowledge of the model
high and medium risk MFs (instead of only high). The decision switching law; is practically implementable, computationally
maker’s strategy sets preference rules between the rules of the efficient and applicable to online/adaptive operation; and exhibits
previous level but also between rules at this level. Relating to the very satisfactory performance as indicated by simulation
level 2 priorities, the bear market context’s priority of not buying experiments [2]. The structure of the hybrid evolutionary system
a high risk MF, even if it has a high return, is set at higher priority used is depicted in Figure 1.
than that of the general context. Then, the specific context of an The representation used for the genomes of the population of
aggressive investor in a bear market defines that the bear market the GA is the following. We use a mapping that transforms a fixed
context preference is inverted. See the relevant priority rules: dimensional internal representation to variable dimensional
problem instances. Each genome consists of a vector x of real
C1: h_p(R2, R1) ← true values xi∈ ℜ , i = 1, ..., k, and a bit string b of binary digits
bi∈{0,1}, i = 1, ..., k. Real values are summed up as long as the
C2: h_p(R1, R2) ← hasInvestPolicy(Investor, aggressive). corresponding bits are equal. Obviously, k is an upper bound for
the dimension of the resulting parameter vector. We use the first
C3: h_p(C2, C1) ← true k/3 real values for the autoreggressive part, the second k/3 real
values for the moving average part, and the last k/3 real values for
Thus, an aggressive investor in a bear market context would the exogenous input part. An example of this mapping is
continue selecting high risk funds. In the latter case, the argument presented in Figure 2. For a more detailed description of this
r1 takes along the priority arguments R1, C2 and C3 and becomes mapping refer to [2].
stronger (is the only admissible one) than the conflicting r2 At first, an initial population of m genomes is created at
argument that can only take along the R2 and C1 priority random (each genome consists of a vector of real values and a bit
arguments. Thus, the selectFund(Fund) predicate is true and the string). As stated before, each vector of real values represents a
fund is inserted in the portfolio. possible value of the NARMAX model order and its parameters.
For each such population we apply an MMAF with EKFs and
61
have as result the model-conditional probability density function (one) which is the maximum value it is able to have as a
(pdf) of each candidate model. This pdf is the fitness of each probability For a more detailed description of this hybrid
candidate model, namely the fitness of each genome of the evolutionary system refer to [2].
population (Figure 3).
Figure 3: The fitness of each candidate model is the model
conditional pdf (m is the number of the extended Kalman
filters in the multimodel adaptive filter)
In this contribution we apply a slightly different approach
compared to the one presented in [2]. In [2], at the algorithm’s
step where the value of the estimation (output) x of each filter is
calculated, the past values of x that are used in order to estimate
the next value of x are always taken from the estimation file (the
file of all past values of x that have been estimated by the
algorithm till this point). All these values are used in each
generation in order to estimate the next value of the estimation
(output) vector x. The method presented in this contribution uses a
different approach in order to estimate x. At the algorithm’s step
where the value of x for each filter is calculated, the past values of
x that are used in order to estimate the next value of x are smaller
than the total length of the time series that has been estimated till
this point. The length of past values used in each generation in
order to estimate the next value of x equals to n/2, where n is the
total length of the time series to be estimated. Every new value of
Figure 1: The structure of the hybrid evolutionary system used x, estimated by the algorithm, is added to this time series of length
for forecasting n/2 and the oldest one is removed in order this time series to
sustain a length of n/2. The value of n/2 was not selected
arbitrarily. We have conducted exhaustive experiments using
many different values. The value of n/2, that has been finally
selected, was the most effective one, that is, the one that resulted
in the best prediction results.
Thus, the hybrid evolutionary system presented in Figure 1 is
used in order to forecast the behavior of the financial market in
relation to its current status. The market is characterized as bull
market if it is forecasted to rise in the next semester, or as bear
market if it is forecasted to fall. We used the return values of the
Greek market index for each semester starting from year 1985 to
Figure 2: Mapping from a fixed dimensional internal
the years of our sample data (2000 to 2005). The algorithm
representation to a variable length NARMAX parameter
performed very well considering that it could forecast the next
vector. The resulting order is n(p, q, r) = (4, 3, 2).
semester market behavior with a success rate of 85.17% (12 out of
The reproduction operator we decided to use is the classic 14 right predictions).
biased roulette wheel selection according to the fitness function
value of each possible model order [9]. As far as crossover is
concerned, we use the one-point crossover operator for the binary 5 THE PORTFOLIO CONSTRUCTION
strings and the uniform crossover operator for the real values [9]. APPLICATION
Finally, we use the flip mutation operator for the binary strings
and the Gaussian mutation operator for the real values [9]. Every In this section we firstly present the system architecture, i.e. the
new generation of possible solutions iterates the same process as combination method for the argumentation decision making sub-
the old ones and all this process may be repeated as many system and the hybrid forecasting sub-system that resulted in a
generations as we desire or till the fitness function has value 1
62
coherent application. Then we present the results of this to different investment choices and leads to the selection of
combination. different number and combinations of MFs.
5.1 System Architecture
The portfolio generation application is a Java program creating a
human-machine interface and managing its modules, namely the
decision making module, which is a prolog rule base (executed in
SWI-prolog1) using the Gorgias2 framework, and the forecasting
module, which is a Matlab3 implementation of the forecasting
hybrid system (see Figure 4).
The application connects to the SWI-Prolog module using the
provided Java interface (JPL) that allows for inserting facts to an
existing rule-base and running it for reaching goals. The goals can
be captured and returned to the Java program. The application
connects to Matlab by executing it in a system shell. The matlab
program writes the results of the algorithm to a MySQL4 database Figure 5: A screenshot for portfolio generation for a scenario
using SQL (Structured Query Language). The application first of a moderate investor in a bull market context
executes the forecasting module, then updates the database, using
JDBC (Java DataBase Connectivity interface) technology, with In Table 1 the reader can inspect the average return on
the investor profile (selected roles) and, finally, queries the investment (RoI) for the six years for all different contexts. The
decision making module setting as goal the funds to select for reader should notice that the table contains two RoI columns, the
participation in the final portfolio. Thus, after the execution of the first (“Previous RoI”) depicts the results before changing the
forecasting module the predicate market/1 is determined as bull or system as they appeared in [14]. The second presents the results of
bear and inserted as a fact in the rule base before the decision upgrading the application by combining it with the hybrid
making process is launched. The reader can see in Figure 5 a evolutionary forecasting sub-system and by fixing the selected
screenshot of the integrated system. funds participation to the final portfolio. The latter modification is
out of the scope of this paper but the reader can clearly see that it
has greatly influenced the performance of all scenarios.
Table 1, however, shows the added value of this contribution as
the market context has become the most profitable in the “New
RoI” column (8.17% RoI), while in the “Previous RoI” column it
was one of the worst cases (3.72% RoI). Consequently the specific
contexts containing the market context have better results.
Table 1: Average RoI for six years. The New RoI column
shows the gains after the evolutionary hybrid forecasting
system’s integration
Previous
Figure 4: System Architecture Context type Context New RoI (%)
RoI (%)
simple general 3.53 6.86
role aggressive 2.65 7.38
5.2 System Evaluation role moderate 4.02 6.09
context market 3.72 8.17
For evaluating our results we defined scenarios for all years for role high performance 4.98 7.16
which we had available data (2000-2005) and for all combinations specific context aggressive – market 3.56 7.92
of contexts. That resulted to the two investor types combined with
specific context moderate – market 4.72 6.08
the market status, plus the two investor types combined with the
specific context aggressive - high p. 4.88 7.46
high performance option, plus the market status combined with
specific context moderate - high p. 4.98 7.16
the high performance option, all together five different scenarios
specific context Market - high perf. 4.59 7.23
run for six years each. Each one of the examined scenarios refers
ASE-GI RASE 6.75
1
SWI-Prolog offers a comprehensive Free Software Prolog environment, Moreover, Table 1 also shows the added value of our approach
http://www.swi-prolog.org as the reader can compare our results with the return on
2
Gorgias is an open source general argumentation framework that investment (RASE) of the General Index of the Athens Stock
combines the ideas of preference reasoning and abduction, Exchange (ASE-GI). According to the results of this table, the
http://www.cs.ucy.ac.cy/~nkd/gorgias/ average return of the constructed portfolios for all contexts, except
3
MATLAB® is a high-level language and interactive environment for two, achieves higher return than the market index. The two cases
performing computationally intensive tasks, http://www.mathworks. where the constructed portfolios did not beat the market index are
com/products/matlab
4 the moderate simple context and moderate-market specific
MySQL is an open source database, http://www.mysql.com
63
context. This is, maybe, due to the fact that in these two contexts The developed application allows a decision maker (fund
we have an investor who wishes to earn more without taking into manager) to construct multi-portfolios of MFs under different,
account any amount of risk in relation to the variability which possibly conflicting contexts. Moreover, for medium to long term
characterizes the conditions of the market during the examined investments, the returns on investment of the constructed
period. This fact makes it very difficult to implement investment portfolios are better than those of the General Index of the Athens
strategies that can help a fund manager outperform a passive Stock Exchange, while the best results are those that involve the
investment policy. forecasting of the financial market.
Furthermore, we notice that in some specific contexts the Our future work will be to develop a new rule base for the
results are more satisfying than the results obtained by simple problem of determining when to construct a new portfolio for a
contexts, while in others there is little or no difference. This specific investor. We will also make the application web-based so
means that by using effective strategies in the third preference that it can get on-line financial data available from the internet for
rules layer the decision maker can optimize the combined computing the decision variables and for allowing the investors to
contexts. Specifically, the aggressive-high performance specific insert their profiles by filling on-line forms. Finally, we will
context provides better results than both the simple contexts continue evaluating our application as new data become available
aggressive and high performance (the ones that it combines) and for years after 2005. Our aim is to be able to guarantee a better
the general context. The moderate-high performance specific RoI than that of the ASE.
context’s returns on investment are equal to the higher simple
context’s returns (high performance) while the aggressive-market
specific context returns are closer to the higher simple context’s 7 REFERENCES
returns (market). [1] A. V. Adamopoulos, Anninos, P. A., Likothanassis, S .D.,
Finally, in Figure 6, we present the RoI of all contexts Beligiannis, G. N., Skarlas, L. V., Demiris E. N. and Papadopoulos,
separately for each year. This view is also useful, as it shows that P., 2002. Evolutionary Self-adaptive Multimodel Prediction
for two years, 2003 and 2004, RASE was greater than all our Algorithms of the Fetal Magnetocardiogram, 14th Int. Conf. on
contexts RoI performance. This shows that our application, for the Digital Signal Processing (DSP 2002), Vol. II, 1-3 July, Santorini,
time being, performs better for medium term to long term Greece, pp. 1149-1152.
investments, i.e. those that range over five years. [2] G. N. Beligiannis, L. V. Skarlas and S. D. Likothanassis. “A Generic
Applied Evolutionary Hybrid Technique for Adaptive System
Modeling and Information Mining”, IEEE Signal Processing
Magazine, 21(3), pp. 28-38, 2004
[3] G. Colson, and M. Zeleny, “Uncertain prospects ranking and
portfolio analysis under the condition of partial information” in
Mathematical Systems in Economics 44, 1979.
[4] G. Crina, A. Ajith, I. Hisao (Eds.), “Hybrid Evolutionary
Algorithms”, Studies in Computational Intelligence, Vol. 75, 2007
[5] S. Haykin, Adaptive Filter Theory. Englewood Cliffs, NJ: Prentice-
Hall Int., 1991
[6] A. Kakas, and P. Moraitis, “Argumentation based decision making
for autonomous agents”, Proc. of the second Int. Conf. on
Autonomous Agents and Multi-Agent Systems (AAMAS03), July
14-18, Australia, 2003.
[7] S.K. Katsikas, S.D. Likothanassis, G.N. Beligiannis, K.G. Berketis
and D.A. Fotakis, “Evolutionary multimodel partitioning filters: A
unified framework,” IEEE Trans. Signal Processing, vol. 49, no. 10,
pp. 2253–2261, 2001
[8] H. Markowitz, Portfolio Selection: Efficient Diversification of
Investments, Wiley, New York, 1959.
[9] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution
Figure 6: Comparative RoIs of all contexts for each year. Programs, 3rd ed. New York: Springer-Verlag, 1996.
[10] I. Rahwan, P. Moraitis and C. Reed (Eds.) “Argumentation in Multi-
Agent Systems”, Lecture Notes in Artificial Intelligence, 3366,
Springer-Verlag, Berlin, Germany, 2005
6 CONCLUSION [11] S. Ross, “The Arbitrage Theory of Capital Asset Pricing”, Journal of
Economic Theory, 6, 1976, pp. 341-360.
The objective of this paper was to present an artificial intelligence
[12] W.F. Sharpe, “Capital asset prices: A theory of market equilibrium
based application for the MF portfolio generation problem that under conditions of risk”, Journal of Finance, 19, 1964, pp. 425-442.
combines two different intelligent methods, argumentation based [13] W.F., Sharpe, “Mutual fund performance”, Journal of Business, 39,
decision making and a hybrid system that combines Genetic 1966, pp. 119-138.
Algorithms (GA), MultiModel Partitioning (MMP) theory and the [14] N. Spanoudakis and K. Pendaraki, “A Tool for Portfolio Generation
Extended Kalman Filter (EKF). Using an Argumentation Based Decision Making Framework” , in:
We described in detail how we developed our argumentation Proceedings of the annual IEEE International Conference on Tools
theory and how we combined it with the hybrid system to with Artificial Intelligence (ICTAI 2007), Patras, Greece, October
determine an important fact for the decision making process, i.e. 29-31, 2007
[15] J.L. Treynor, “How to rate management of investment funds”,
the status of the financial market in the next investment period.
Harvard Business Review, 43, 1965, pp. 63-75.
64
An Architecture for Multiple Heterogeneous Case-Based
Reasoning Employing Agent Technologies.
Elena I Teodorescu and Miltos Petridis1
Abstract. This paper presents an investigation into applying petence and the contextualisation of the cases. Past research at
Case-Based Reasoning to Multiple Heterogeneous Case Greenwich [2][3] has shown the need to combine knowledge en-
Bases using agents. The adaptive CBR process and the ar- coded in cases from various heterogeneous sources to achieve a
chitecture of the system are presented. A case study is pre- competent, seamless CBR system.
Ontanon and Plaza [7] looked at a way to “improve the overall
sented to illustrate and evaluate the approach. The process of performance of the multiple case systems and of the individual
creating and maintaining the dynamic data structures is CBR agents without compromising the agent’s autonomy”. They
discussed. The similarity metrics employed by the system present [8] a framework for collaboration among agents that use
are used to support the process of optimisation of the col- CBR and strategies for case bartering (case trading by CBR
laboration between the agents which is based on the use of a agents). Nevertheless, they do not focus at the possibility of cases
blackboard architecture. The blackboard architecture is having different structures and what impact this will have on ap-
shown to support the efficient collaboration between the plying CBR to heterogeneous case bases. Leake [5] states that “An
agents to achieve an efficient overall CBR solution, while important issue beyond the scope of this paper is how to establish
using case-based reasoning methods to allow the overall correspondences between case representations, if the representa-
tions used by different case-bases differ.”
system to adapt and “learn” new collaborative strategies for Given several case bases as the search domain, it is very likely
achieving the aims of the overall CBR problem solving that they have different structures. Ideally, accessing Multiple Case
process. Bases should not require a change to their data structures. In order
for an MCBR system to effectively use case-bases that may have
been developed in different ways, for different tasks or task envi-
ronments, methods are needed to adjust retrieved cases for local
1 Introduction1 needs.
Leake and Sooriamurthi [4] proposed a theoretical “cross-case-
Case-based reasoning (CBR) is now an established artificial intel- base adaptation” which would adapt suggested solutions from one
ligence paradigm. Given a case-base of prior experiences, a CBR case base to apply to the needs of another. They are currently
system solves new problems by retrieving cases from the case- exploring sampling methods for comparing case-base characteris-
base, and adapting their solutions to comply the new require- tics in order to select appropriate cross-case-base adaptation strate-
ments[1]. gies.
Multiple Case Based Reasoning (MCBR) is used to retrieve so-
lutions for a new problem from more than one case-base. Methods
for managing sharing of standardized case bases have been studied
in research on distributed CBR (e.g. [13]), as have methods for 2 Adaptive CBR
facilitating large-scale case distribution [10]. Leake and Sooria-
muthhi propose a new strategy for MCBR - an agent selectively In order to enable effective solution retrieval across autonomous
supplements its own case-base as needed, by dispatching problems case bases with differing structures, it is essential to have access
to external case-bases and using cross-case-base adaptation to and a good understanding of each of the different case base struc-
adjust their solutions for inter-case-base differences [4, 5, 6,13 ]. tures involved. This would make it possible to identify the com-
In many problems in modern organisations, the knowledge en- monalities, equivalences and specific characteristics of every case
capsulated by cases is contained in multiple case bases reflecting base associated with the system.
the fragmented way with which organisations capture and organise
knowledge. The traditional approach is to merge all case bases into
a central case base that can be used for the CBR process. However,
this approach brings with it three challenges:
2.1 The process of adaptive CBR
• Moving cases into a central case base potentially sepa-
rates from its context and makes maintenance more diffi- Instead of trying to adapt the suggested solutions from one case
cult. base to the needs of another, the approach investigated in this study
will be to create a “dynamic structure” of a general case. This
• Various case bases can use different semantics. There is
dynamic structure would be modified every time a new case base
therefore a need to maintain various ontologies and map-
with a new structure is added.
pings across the case bases.
The process of adaptive CBR, within the architecture of the
• The knowledge content “value” of individual cases can
HMCBR System (Figure 1), will incorporate a number of steps.
be related to its origination. This can be lost when merg-
Firstly, in order for the system to work with a particular case
ing into a central case base.
base, it will need to know the structure of that case base. Every
Keeping the cases distributed in the form of a Heterogeneous
newly added case base will therefore have to publish its structure to
Multiple Case Based Reasoning system (HMCBR) may have a
a Registry System. The published structures are required to have
number of advantages such as increased maintainability and com-
their own data dictionaries attached to enable the creation of a
dynamic Data Dictionary.
1 Department of Computing Science, University of Greenwich, Park
Row, London SE10 9LS email:{ E.I.Teodorescu , M.Petridis}@gre.ac.uk
65
3 Creating the Dynamic Structure
Creating and maintaining a dynamic structure makes the self-
adaptive multi case base reasoning system possible. By adding a
new case base to the existing ones, new attributes are added to a
global dynamic structure and new relations linked to these attrib-
utes are established.
CBS1. Apartment Studio Detached
type house
DCBS
name
House 0 0 1
Flat 1 0.8 0
Fig. 3. Data Dictionary includes relations between some of the
attributes.
A data dictionary is required to keep all the metadata for the dy-
namic structure. This data dictionary would have multiple func-
tions: It records the location and the name of every attribute from
the Case Base Structures (CBS) and how these are translated into
the Dynamic Case Base Structure (DCBS). It also stores the type
Fig. 1. The Architecture of the HMCBR System
and any default value for every single attribute.
The Data Dictionary will reflect any relationships between the
The published structure will be retrieved by the Dynamic CB Dynamic Case Base Structure attributes. These relationships can be
System and used to adapt the local dynamic structure to accommo- mathematical relationships or look-up tables (figure 3).
date any new elements and map existing ones. We will use the presented case study to show how a dynamic
When the dynamic structure reflects all participating case bases, structure is created and how it is continuously changed by adding
a case query can be submitted. The system would then reformulate new case bases to the search domain.
the target case structure into each provider’s case base structure. Let us suppose that our general structure (the initial state of the
The target case structure will be a subset of the dynamic structure. Dynamic Structure containing few main attributes of a property) is
The reformulated cases are submitted to each provider and solu- already built (see figure 4). The structure has attached a basic Data
tion cases are retrieved using KNN techniques [1]. The structures Dictionary mainly containing the data types of the existing attrib-
of these solutions will be translated into the dynamic structure, thus utes.
creating a dynamic case base. Finally, the system will apply the We will show how this initial structure will be dynamically
classical CBR process to the dynamic case base. changed by consecutively adding the three agents to the search
The whole process is intended to provide a transparent view of domain.
the CBR process across the heterogeneous system. Adding the Case Base Structure 1 to the system implies map-
ping of the attributes ParkingSpace, Area and Type into the Dy-
namic Structure (these attributes are already existing in the initial
2.2 Case Study structure) and also adding more attributes to it (i.e. NoOfRooms,
NoOfBathrooms, GardenLength, GardenWidth)
This case study requires searching for a property from three estate
agencies without amalgamating their case bases structures.
Let us suppose that the estate agencies have different case base
structures (figure 2).
A possible buyer should be able to search for a property and get
all the suitable solutions from all three agencies. A search should
retrieve the best matches from all case bases as if it was dealing
with a single case base in a way transparent to the buyer.
Data Dictionary
Size: Double
NoOfBedrooms: Integer
Location: String
ParkingSpace: double
Name: house flat
house 1 0
flat 0 1
Case Bases Structures 1(CBS1)
Fig. 4. Initial state of the Dynamic Structure and Data Dictionary
The Data Dictionary will reflect the mapping of attributes:
CBS1.ParkingSpace = DCBS.ParkingSpace;
Case Bases Structures 2(CBS2) CBS1.Area = DCBS.Location
CBS1.type= DCBS.name
The following attributes will be added to the dynamic data dic-
tionary:
NoOfRooms: integer;
GardenLength: double; GardenWidth: double
Case Bases Structures 3(CBS3)
Any other relevant relationships such as look-up tables for de-
fining mappings between the values of attribute Type of CBS1 and
Fig. 2. Three different Case Base Structures
66
the values of the attribute Name of the dynamic structure will be 4 Optimising the agent collaboration process
captured.
Case Base Structure 2 will add another attribute, GardenSize, to
the Dynamic Structure and the data dictionary will record mapping In order to optimise the process of collaboration between the
of attributes: agents to achieve an efficient solution from the overall CBR proc-
CBS2.Name = DCBS.Name, ess when applied across the heterogeneous case bases, an overall
CBS2.Location = DCBS.Location , similarity metric is required. Additionally, an overall process to
CBS2.NoOfBedrooms = DCBS. NoOfBedrooms; enable collaboration between the agents is necessary based on a
The mathematical relationships are recorded: flexible architecture to enable this collaboration.
DCBS.GardenSize = DCBS.GardenLength * DCBS.GardenWidth,
Functions can be applied, for example to keep the same metric
system:
4.1 Defining an overall similarity metric
DCBS.GardenSize= CBS2.GardenSizeInFeet/(3.281)2
The Data Dictionary would also include a look-up table show- The overall similarity metric between a target and a source Case
ing the conversion of values of CBS2.Name to values of can be defined as:
DCBS.Name.
Attention has to be paid to the meanings of the names of the at- , , 1
tributes. For example, if the attribute “Type” in CBS1 and the where:
attribute “Name” in CBS2 have the same meaning (they would be σ: overall similarity
translated as “Name” in DCBS, with values found in a look-up σCBy: similarity from case base provider CBy
table), the attribute “Name” from CBS3 has not the same meaning CT: target case
as the one from CBS2. It is actually translated into DCBS.Location CS: source case
(similar to CBS2.Location) : weighting for a case base provider y for case CT
To allow for defining locally optimised similarity metrics for
different providers, the following metric can be defined:
, , , 2
where:
: the weighting from case base provider CBy for at-
tribute x
, , : the local similarity metric for provider CBy
for attribute x.
This extended similarity metric takes into account the level of
trust that the HMCBR system attributes to the competence of each
case base provider. The level of trust is determined by applying
CBR to the case-base of the history of queries. Additionally it
allows to adjust the trust to particular providers to different “re-
gions” in the case base allowing for case base providers to be
Fig. 5. Adapted Dynamic Structure after CBS3 was added “specialised” on particular types of domain knowledge. Finally, the
extended metric allows for different ways of defining similarity
By adding the third estate agent case base to the search domain,
based on possible particularities pertaining to individual case base
the dynamic structure will grow even more (see figure 5) and the
providers.
Data dictionary will reflect it by adding the attributes
Let us assume that in our case study the third estate agent is
DSBS.Garage and DSBS.View.
specialised in city apartments. After a few searches for country side
The following attributes are mapped:
houses with gardens, reasoning can be applied to the History case-
CBS3.Name = DCBS.Location
base. Results will show that, for this particular query, the estate
CBS3.Description = DCBS.Name
agent’s level of trust is not high, i.e. there will be less solutions for
CBS3.GardenSizeInMeters = DCBS.GardenSize
this particular case base added to the Dynamic case-base.
Another look-up table can be created and added to the Data Dic-
A global level of trust of a provider’s case-base can be calculat-
tionary to record the relationship between the Garage and Parking-
ing taking in consideration the results of all the previous enquiries
Space. Figure 6 shows the state of the Dynamic data Dictionary
for that provider.
after CBS1, CBS2 and CBS3 are added.
Dynamic Data Dictionary
CBS1.Area = DCBS.Location 4.2 An architecture and process to support effective
NoOfRooms: integer collaboration between case base agents
CBS1.type= DCBS.name
DCBS.GardenSize: double The architecture of the HMCBR system shown in figure 1 contains
DCBS.GardenSize = CBS2.GardenSizeInFeet the dynamic CB system, which incorporates a blackboard architec-
DCBS.GardenSize = DCBS.GardenLenght * ture. Blackboards have been used very effectively in the past for
DCBS.GardenWidth ... the construction of hybrid and agent based AI systems [11], [12].
CBS3.Name = DCBS.Location The dynamic CB system is where the process for agent collabo-
CBS3.GardenSizeInMetres = DCBS.GardenSize ration is controlled. It is based on a blackboard architecture incor-
Garage ParkingSpace porating the blackboard containing the target and retrieved cases
Garage 1 0.7 from various providers together with similarity calculations and
ParkingSpace 0.7 1 rankings. The blackboard also contains a log of the solution proc-
ess and the reconciliation strategy followed, thus representing the
Fig. 6. Adapted Dynamic Data Dictionary after CBS1, CBS2 state of the overall CBR solution process at any point in this proc-
and CBS3 are added ess. Figure 7 shows the structure of the dynamic CB module incor-
porating the blackboard architecture.
67
use the translated query to match it to its local cases and retrieve
AgentCB1 AgentCB2 AgentCB3 the best matches.
A Data Dictionary is created in order to manage the Dynamic
Structure. This contains the metadata for the Dynamic Structure,
such as mapping details of the case base provider’s structures to the
Dynamic Structure, type information and relationships between
Query attributes of the dynamic structure.
BB Blackboard
History The dynamic case base system manages the overall process, in-
Manager cluding controlling the agents, reconciling and optimising the
retrieved cases and feeding back into its strategy by continuously
adjusting weights representing confidence levels on individual case
base providers. A prototype system to evaluate the efficiency of
using a heterogeneous Multiple Case Based Reasoning system is
currently being evaluated. Preliminary findings are encouraging.
Dynamic Dynamic Data Further work will concentrate into optimising the process of
Structure Dictionary collaboration between the agents and methods and strategies for the
reconciliation of retrieved cases.
Fig. 7. The Dynamic CB system incorporating the blackboard
architecture References
The blackboard manager manages the overall solution process,
communicates with and keeps track of the CB agents, selects and [1] Kolodner, J.: Case-based Reasoning, Morgan Kaufmann, 1993
implements a solution strategy and monitors and evaluates the [2] Knight B, Petridis M, Mileman T: Maintenance of a Case-Base
solutions achieved. Given a new target case, the blackboard man- for the Retrieval of Rotationally Symmetric Shapes for the De-
ager decides on strategy for finding similar cases from the CB sign of Metal Castings LNAI 1998: Springer Verlag. Advances
providers. The blackboard system decides which CB providers to in Case-Based Reasoning, 5th European Workshop, Trento, It-
use and the number of cases to retrieve from each one and other aly, Sept. 6-9.2000 pp 418-430
requirements, such as the requirement for diversity, similarity [3] Knight B, Petridis M, Mejasson P, Norman P: A Intelligent
thresholds etc. The system then initialises the agents and assigns to design assistant (IDA): a CBR System for Materials Design
them a mission. On return, the results (cases) are mapped using the Journal of Materials and Design 22 pp 163-170, 2001
dynamic data dictionary and written to the blackboard. A “global” [4] David Leake & Raja Sooriamurthi: “Automatically Selecting
CBR process is used to decide on the retrieved cases. The system Strategies for Multi-Case-Base Reasoning” - Proceedings of the
then selects and presents the shortlisted cases after the reconcilia- Sixth European Conference on Case-Based (ECCBR) , Aber-
tion process and provides these to the user, together with links to deen, Scotland, September, 2002
their original forms for the user to explore. Finally, the system [5] David Leake & Raja Sooriamurthi : “Managing Multiple Case
“reflects” on the process by updating the query history and confi- Bases: Dimensions and Issues” (Proceedings of the Fifteenth
dence weights for each provider. Florida Artificial Intelligence Research Society (FLAIRS) Con-
The system described here has been implemented and tested on ference , Pensacola, Florida, May 2002),
a set of case bases from three different estate agent case bases, all [6] David Leake & Raja Sooriamurthi: “When Two Case Bases
using different structures. Experiments with the system have shown are Better Than One: Exploiting Multiple Case Bases” (Pro-
that the system can retrieve useful cases combining cases from all ceedings of the Fourth International Conference on Case-Based
case bases to provide a more efficient overall solution when com- Reasoning (ICCBR) , Vancouver, Canada, July, 2001).
pared to using the case bases separately or mapping them to one [7] Enric Plaza, Santiago Ontañón: Cooperative Multiagent Learn-
central case base. Additionally, the system has shown that it can ing. Adaptive Agents and Multi-Agents Systems 2002
provide a more diverse retrieved case population in both cases. A [8] Santiago Ontañón, Enric Plaza: A bartering approach to im-
full scale evaluation of the system, including using a different prove multiagent learning. AAMAS 2002
application domain is under way. [9] Francisco J. Martín, Enric Plaza, Josep Lluís Arcos: Knowledge
and Experience Reuse Through Communication Among Com-
petent (Peer) Agents. International Journal of Software Engi-
5 Conclusion neering and Knowledge Engineering 1999
[10] Hayes, C., Doyle, M., Cunningham, P.: Distributed CBR
Using XML, Proceedings of the Fourth European Conference
At a time of increasing web-based communication and sharing of on Case-Based (ECCBR) , Dublin, Ireland, June 1998
knowledge between organisations and organisational units within [11]R Engelmore, T Morgan : Blackboard Systems, Addison-
enterprises, heterogeneous CBR applied to Multiple Case Bases Wesley, 1988
seems to be the natural progression in this area of research. [12] Petridis M, Knight B, (2001): “A blackboard architecture for a
The paper investigates an approach based on agents operating hybrid CBR system for scientific software” in Proceedings of
on different structures/views of the problem domain in a transpar- the Workshop Program at the 4th International Conference in
ent and autonomous way. In this approach all data is kept locally Case-based Reasoning ICCBR01, Vancouver 2001, Eds R. We-
by each case base provider in its native form. Agents can be dy- ber, G. von Wagenheim, pp 190-195, NCARAI, Naval Research
namically added to the system, thus increasing the search domain Laboratory, Code 5510 Washington DC.
and potentially the competence and vocabulary of the system. [13] Case Dispatching versus Case-Base Merging: when MCBR
This research proposes a new architecture for a self-adaptive matters: David Leake and Raja Sooriamurthi.
MCBR system which involves the use of a dynamic structure based International Journal on Artificial Intelligence Tools: Architec-
on the blackboard architecture. The Dynamic Structure reflects all tures, Languages and Algorithms (IJAIT). Special issue on re-
participating case base provider structures. As new agents are cent advances in techniques for intelligent systems. Vol 13, No
added to the system, their case base structure is published and is 1, 2004
used to adapt the Dynamic Structure accordingly.
The Dynamic Structure is used at runtime to translate search
queries into the local structures of each agent. Each agent can then
68