=Paper= {{Paper |id=None |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-941/aih2012_Complete.pdf |volume=Vol-941 }} ==None== https://ceur-ws.org/Vol-941/aih2012_Complete.pdf

The Second Australian Workshop
on
Artificial Intelligence in Health
AIH 2012

held in conjunction with the

25th Australasian Joint Conference on Artificial Intelligence (AI 2012)

Tuesday, 4th December 2012
Sydney Harbour Marriott Hotel, Sydney, Australia

WORKSHOP
PROCEEDINGS
Editors : Sankalp Khanna, Abdul Sattar, David Hansen

Program Chairs

 Abdul Sattar (Griffith University, Australia)
 David Hansen (CSIRO Australian e-Health Research Centre, Australia)

Workshop Chair

 Sankalp Khanna (CSIRO Australian e-Health Research Centre, Australia)

Senior Program Committee

 Aditya Ghose (University of Newcastle, Australia)
 Anthony Maeder (University of Western Sydney, Australia)
 Wayne Wobcke (University of New South Wales, Australia)
 Mehmet Orgun (Macquarie University, Australia)
 Yogesan (Yogi) Kanagasingam (CSIRO Australian e-Health Research Centre, Australia)

Program Committee

 Simon McBride (CSIRO Australian e-Health Research Centre)
 Adam Dunn (University of New South Wales)
 Stephen Anthony (University of New South Wales)
 Lawrence Cavedon (Royal Melbourne Institute of Technology / NICTA)
 Diego Mollá Aliod (Macquarie University)
 Michael Lawley (CSIRO Australian e-Health Research Centre)
 Anthony Nguyen (CSIRO Australian e-Health Research Centre)
 Amol Wagholikar (CSIRO Australian e-Health Research Centre)
 Bevan Koopman (CSIRO Australian e-Health Research Centre)
 Kewen Wang (Griffith University)
 Vladimir Estivill-Castro (Griffith University)
 John Thornton (Griffith University)
 Bela Stantic (Griffith University)
 Byeong-Ho Kang (University of Tasmania)
 Justin Boyle (CSIRO Australian e-Health Research Centre)
 Guido Zuccon (CSIRO Australian e-Health Research Centre)
 Hugo Leroux(CSIRO Australian e-Health Research Centre)
 Alejandro Metke (CSIRO Australian e-Health Research Centre)

Key Sponsors

 CSIRO Australian e-Health Research Centre
 Institute for Integrated and Intelligent Systems, Griffith University

Supporting Organisations

 The Australasian College of Health Informatics
 The Australasian Medical Journal
 The Australasian Telehealth Society
AIH 2012

PROGRAM

8:30 am – 9:00 am Registration and Welcome

Session 1 Chair : Abdul Sattar
Keynote Address
Technology in Healthcare: Myths and Realities
Dr. Jia-Yee Lee
9:00 am – 10:30 am National ICT Australia (NICTA)
Keynote Address
Driving Digital Productivity in Australian Health Services
Dr. Sankalp Khanna
CSIRO Australian e-Health Research Centre

10:30 am – 11:00 am Morning Tea

Session 2 Chair : Sadananda Ramakoti
An investigation into the types of drug related problems that can and cannot be
identified by commercial medication review software
Colin Curtain, Ivan Bindoff, Juanita Westbury and Gregory Peterson
FS-XCS vs. GRD-XCS: An analysis using high-dimensional DNA microarray gene
expression data sets
Mani Abedini, Michael Kirley and Raymond Chiong
11:00 am – 12:30 pm Reliable Epileptic Seizure Detection Using an Improved Wavelet Neural
Network
Zarita Zainuddin, Pauline Ong and Kee Huong Lai
Clinician-Driven Automated Classification of Limb Fractures from Free-Text
Radiology Reports
Amol Wagholikar, Guido Zuccon, Anthony Nguyen, Kevin Chu, Shane Martin, Kim Lai and
Jaimi Greenslade
Using Prediction to Improve Elective Surgery Scheduling
Zahra Shahabi Kargar, Sankalp Khanna and Abdul Sattar

12:30 pm – 2:00 pm LUNCH (and Poster Session)

Session 3 Chair : Wayne Wobcke
Acute Ischemic Stroke Prediction from Physiological Time Series Patterns
Qing Zhang, Yang Xie, Pengjie Ye and Chaoyi Pang
Comparing Data Mining with Ensemble Classification of Breast Cancer Masses
2:00 pm – 3:30 pm in Digital Mammograms
Shima Ghassem Pour, Peter Mc Leod, Brijesh Verma and Anthony Maeder
Automatic Classification of Cancer Notifiable Death Certificates
Luke Butt, Guido Zuccon, Anthony Nguyen, Anton Bergheim and Narelle Grayson
If you fire together, you wire together; Hebb's Law revisited
Prajni Sadananda and Sadananda Ramakoti

3:30 pm – 4:00 pm Afternoon Tea

Session 4 Chair : Sankalp Khanna
Keynote Address
Smart Analytics in Health
Dr. Christian Guttman
4:00 pm – 5:30 pm IBM Research Australia
Panel Discussion
AI in Health : the 3 Big Challenges
Panel Chair : Professor Abdul Sattar . Panelists : Dr. Jia-Yee Lee, Dr. Christian Guttman,
Prof. Wayne Wobcke, Prof. Sadananda Ramakoti

Announcement of Best Paper Award
5:30 pm
Workshop Close

i
AIH 2012

ii
AIH 2012

TABLE OF CONTENTS

PREFACE 1

KEYNOTE ADDRESSES

Technology in Healthcare: Myths and Realities
5
Jia-Yee Lee
Driving Digital Productivity in Australian Health Services
7
Sankalp Khanna
Smart Analytics in Health
9
Christian Guttman

FULL PAPERS

An investigation into the types of drug related problems that can and cannot be
identified by commercial medication review software 11
Colin Curtain, Ivan Bindoff, Juanita Westbury and Gregory Peterson
FS-XCS vs. GRD-XCS: An analysis using high-dimensional DNA microarray gene
expression data sets 21
Mani Abedini, Michael Kirley and Raymond Chiong
Reliable Epileptic Seizure Detection Using an Improved Wavelet Neural Network
33
Zarita Zainuddin, Pauline Ong and Kee Huong Lai
Acute Ischemic Stroke Prediction from Physiological Time Series Patterns 45
Qing Zhang, Yang Xie, Pengjie Ye and Chaoyi Pang
Comparing Data Mining with Ensemble Classification of Breast Cancer Masses in
Digital Mammograms 55
Shima Ghassem Pour, Peter Mc Leod, Brijesh Verma and Anthony Maeder
Automatic Classification of Cancer Notifiable Death Certificates
65
Luke Butt, Guido Zuccon, Anthony Nguyen, Anton Bergheim and Narelle Grayson

SHORT PAPERS

Clinician-Driven Automated Classification of Limb Fractures from Free-Text
Radiology Reports 77
Amol Wagholikar, Guido Zuccon, Anthony Nguyen, Kevin Chu, Shane Martin, Kim Lai and
Jaimi Greenslade
Using Prediction to Improve Elective Surgery Scheduling
83
Zahra Shahabi Kargar, Sankalp Khanna and Abdul Sattar
If you fire together, you wire together; Hebb's Law revisited
89
Prajni Sadananda and Sadananda Ramakoti

iii
AIH 2012

iv
AIH 2012

Second Australian Workshop on Artificial Intelligence in
Health (AIH 2012)
PREFACE
Sankalp Khanna1, 2, Abdul Sattar2, David Hansen1
1
The Australian e-Health Research Centre, RBWH, Herston, Australia
{Sankalp.Khanna, David.Hansen}@csiro.au
2
Institute for Integrated and Intelligent Systems, Griffith University, Australia
A.Sattar@griffith.edu.au

1 Motivation behind the workshop series

The business of health service delivery is a complex one. Employing over
850,000 people, and delivering services to 21.3 million residents, the Austra-
lian healthcare system is currently struggling to deal with increasing demand
for services, and an acute shortage of skilled professionals. The National e-
Health Strategy drives a nationwide agenda to provide the infrastructure and
tools required to support the planning, management and delivery of health
care services. National initiatives such as the National Health Reform Pro-
gram, the National Broadband Network, and the Personally Controlled Elec-
tronic Health Record are accelerating the use of information and communi-
cation technologies in delivering healthcare services. The Australasian Joint
Conferences on Artificial Intelligence (AI) provide an excellent opportunity to
bring together artificial intelligence researchers who are working in health
research.
Driven by a senior program committee comprising distinguished faculty
from several Australian universities including Griffith University, the Univer-
sity of New South Wales, University of Newcastle, University of Western
Sydney, and Macquarie University, and specialist health research organisa-
tions including the CSIRO Australian e-Health Research Centre, the Artificial
Intelligence in Health workshop series was created in 2011 to bring these
researchers together as part of Australia’s premier Artificial Intelligence con-
ference.

1
AIH 2012

2 AIH 2011 – the First Australian Workshop on Artificial
Intelligence

Held for the first time in December 2011, the workshop was the first of its
kind to bring together scholars and practitioners nationally in the field of
Artificial Intelligence driven Health Informatics to present and discuss their
research, share their knowledge and experiences, define key research chal-
lenges and explore possible collaborations to advance e-Health development
nationally and internationally. The workshop was co-located with the 24th
Australasian Joint Conference on Artificial Intelligence and was attended by
25 delegates.
Of the 16 submissions received, 6 were accepted as Full Papers and 5 as
Short Papers accompanied with posters. All papers presented at the AIH
2011 workshop were also invited to revise and submit their manuscripts for
inclusion in a special issue of the Australasian Medical Journal. Of these,
seven papers and a letter to the editor were published in the special issue in
September 2012.

3 AIH2012 - The Second Australian Workshop on Artificial
Intelligence

The Second Australian Workshop on Artificial Intelligence (AIH 2012) is being
held in conjunction with the 25th Australasian Joint Conference on Artificial
Intelligence (AI 2012) in Sydney, Australia, on the 4th of December, 2012.
The Call for Papers received an excellent response this year. All submitted
papers went through a rigorous review process. Of these, 6 full papers and 3
short papers have been accepted for presentation in the workshop and for
publication in these CEUR proceedings. The workshop will also feature three
keynote addresses and a panel discussion on the topic “AI in Health: the 3
Big Challenges”.
This year again, the workshop is offering 4 travel scholarships of $250
each to students who were first authors of accepted papers. A best paper
prize of $250 will also be awarded on the workshop day. Both prizes have
been sponsored by the CSIRO Australian e-Health Research Centre.
All accepted full papers and short papers will be also invited to extend and
reformat their papers for publication in a special issue of the Australasian
Medical Journal (www.amj.net.au). The journal is indexed on the following
databases: DOAJ, EBSCO, Genamics journalseek, ProQuest, Index Copernicus,

2
AIH 2012

Open J-Gate, Intute, Global health and CAB Abstracts databases, MedWorm,
Scopus, Socolar, PMC, PubMed.

4 Workshop Organisation

4.1 Program Chairs

Abdul Sattar (Griffith University, Australia)
David Hansen (CSIRO Australian e-Health Research Centre, Australia)

4.2 Workshop Chair

Sankalp Khanna (CSIRO Australian e-Health Research Centre, Australia)

4.3 Senior Program Committee

Aditya Ghose (University of Newcastle, Australia)
Anthony Maeder (University of Western Sydney, Australia)
Wayne Wobcke (University of New South Wales, Australia)
Mehmet Orgun (Macquarie University, Australia)
Yogesan (Yogi) Kanagasingam (CSIRO Australian e-Health Research Centre,
Australia)

4.4 Program Committee

Simon McBride (CSIRO Australian e-Health Research Centre)
Adam Dunn (University of New South Wales)
Stephen Anthony (University of New South Wales)
Lawrence Cavedon (Royal Melbourne Institute of Technology / NICTA)
Diego Mollá Aliod (Macquarie University)
Michael Lawley (CSIRO Australian e-Health Research Centre)
Anthony Nguyen (CSIRO Australian e-Health Research Centre)
Amol Wagholikar (CSIRO Australian e-Health Research Centre)
Bevan Koopman (CSIRO Australian e-Health Research Centre)
Kewen Wang (Griffith University)
Vladimir Estivill-Castro (Griffith University)
John Thornton (Griffith University)
Bela Stantic (Griffith University)

3
AIH 2012

Byeong-Ho Kang (University of Tasmania)
Justin Boyle (CSIRO Australian e-Health Research Centre)
Guido Zuccon (CSIRO Australian e-Health Research Centre)
Hugo Leroux(CSIRO Australian e-Health Research Centre)
Alejandro Metke (CSIRO Australian e-Health Research Centre)

4.5 Key Sponsors

CSIRO Australian e-Health Research Centre
Institute for Integrated and Intelligent Systems, Griffith University

4.6 Supporting Organisations

The Australasian College of Health Informatics
The Australasian Medical Journal
The Australasian Telehealth Society

5 Acknowledgements

We are especially thankful to the organising committee of the 25 th Austral-
asian Joint Conference on Artificial Intelligence (AI 2012). This workshop se-
ries would not have possible without their support. We would also like to
thank the Workshop Chair of AI 2012, Hans Guesgen, for organising the
workshops and championing these CEUR workshop proceedings.

4
AIH 2012

Technology in Healthcare : Myths and Realities
Keynote Address
Dr. Jia-Yee Lee
National Information and Communications Technology Australia Ltd (NICTA), Australia
jia-yee.lee@nicta.com.au

Speaker Profile

Dr Jia-Yee Lee is the Director of the Health
and Life Science Business Team at National
ICT Australia Ltd. She manages the business,
commercial and research activities of the
NICTA groups in Diagnostic and Computa-
tional Genomics, Biomedical Informatics,
Portable Motion Analytics and Bio-Imaging
Analytics. Prior to joining NICTA, Jia-Yee
spent 10 years in the management consulting
sector providing leadership in developing and
implementing strategies and operational
plans that improved business outcomes for
clients in government, ICT, and healthcare
sectors. Her business plans have led to international and national invest-
ments into Australian-based start-ups. Jia-Yee has extensive experience as a
project manager working on complex multi-disciplinary and multi-million
dollar programs funded by State and Commonwealth governments. Her e-
health experience includes stakeholder engagement with clinicians and lead-
ing technical teams to implement a range of commercial web-based systems
for the healthcare and medical research sectors. With more than 20 years in
medical research, Jia-Yee has led programs at MacFarlane Burnet Centre
(now "Burnet Institute") and the Victorian Infectious Diseases Reference
Laboratory, Melbourne Health. Her research into hepatitis B virus and rubella
virus was funded by the National Health and Medical Research Council of
Australia. Jia-Yee’s research skills include molecular and diagnostic virology,
and electron and confocal microscopy. Jia-Yee has a PhD from the University
of Melbourne and a MBA from Melbourne Business School.

5
AIH 2012

6
AIH 2012

Driving Digital Productivity in Australian Health Services
Keynote Address
Sankalp Khanna
The Australian e-Health Research Centre, RBWH, Herston, Australia
Sankalp.Khanna@csiro.au

Speaker Profile

Sankalp is a Postdoctoral Fellow at the Aus-
tralian e-Health Research Centre, the leading
national research facility applying information
and communication technology to improve
health services and clinical treatment for Aus-
tralians. As a member of the Forecasting and
Scheduling team, he is actively engaged in pro-
jects in the areas of planning and optimization,
patient flow analytics, prediction and forecast-
ing, and predictive scheduling, all aimed at em-
ploying artificial intelligence to improve the
efficiency of the health system.

His research interests include Applied Artificial Intelligence, Prediction
and Forecasting, Planning and Scheduling, Multi Agent Systems, Distributed
Constraint Reasoning, and Decision Making and Learning under Uncertainty.

Sankalp completed a PhD in 2010 looking at intelligent techniques to
model and optimise the complex, dynamic and distributed processes of Elec-
tive Surgery Scheduling. He was the recipient of a state award for out-
standing student achievement in 2006. He has co-authored several journal
and conference papers and editorials, and served on the program and organ-
ising committees of numerous national and international conferences and
workshops. He is a member of the ACS, HISA, IEEE and AAAI societies.

Sankalp was also the founding workshop chair of this AI in Health work-
shop series.

7
AIH 2012

8
AIH 2012

Smart Analytics in Health
Keynote Address
Christian Guttman
IBM Research, Australia
Christian.guttmann@au1.ibm.com

Speaker Profile

Dr. Guttmann leads and defines projects
around health care at the newly established IBM
Research labs in Melbourne – the 11th lab of IBM
Research worldwide. One focus of Guttmann’s
work is to build smarter analytics that enables
health care entities (doctors, nurses, hospitals,
pharmacies, etc) to collaborate more efficiently
in complex environments. His work addresses
the information and communication challenges
faced by tomorrow’s world of health care: How can we create and apply
smarter collaborative health care technologies that cope with the tsunami of
chronic diseases.

Prior to IBM, Dr. Guttmann led the research theme on health care and
disaster at the Etisalat British Telecom Innovation Centre (EBTIC). The theme
partnered with major stakeholders, including governmental health
authorities and ministries. He has been a research fellow at the Faculty of
Medicine, Nursing and Health Sciences at Monash University, where he
researched how intelligent systems can improve collaborative care (done in
together with primary health care providers). He worked also in industrial
projects with HP and Ericsson.

Dr. Guttmann holds a PhD degree from Monash University, two Master
degrees from Paderborn University (Germany) and the Royal Institute of
Technology (Sweden), and a psychology degree from Stockholm University
(Sweden). He organised major conferences and workshops, edited two books
on intelligent agent technologies, and co-authored over 30 articles in leading
conferences and journals.

9
AIH 2012

10
AIH 2012

An investigation into the types of drug related problems
that can and cannot be identified by commercial
medication review software

Colin Curtain, Ivan Bindoff, Juanita Westbury and Gregory Peterson

Unit for Medication Outcomes Research and Education
School of Pharmacy
University of Tasmania
{Colin.Curtain, Ivan.Bindoff, Juanita.Westbury, G.Peterson}@utas.edu.au

Abstract.
A commercially used expert system using multiple-classification ripple-
down rules applied to the domain of pharmacist-conducted home medicines re-
view was examined. The system was capable of detecting a wide range of po-
tential drug-related problems. The system identified the same problems as
pharmacists in many of the cases. Problems identified by pharmacists but not by
the system may be related to missing information or information outside the
domain model. Problems identified by the system but not by pharmacists may
be associated with system consistency and perhaps human oversight or human
selective prioritization. Problems identified by the system were considered rele-
vant even though the system identified a larger number of problems than human
counterparts.

Keywords: Clinical decision support system, multiple-classification ripple-
down rules, expert system, pharmacy practice

1 Introduction

A drug-related problem (DRP) can be broadly defined as “…an event or circum-
stance involving drug therapy that actually or potentially interferes with desired health
outcomes”[1] DRPs comprise a spectrum of problems including over- or under-
dosage, drug-drug or drug-disease interactions, untreated disease and drug toxicity.
Patient health education and compliance with therapy may be sub-standard and sub-
sequently also be considered as drug-related problems. DRPs can be dangerous; For
instance, a marginally high daily dose of warfarin has the potential to cause fatal
bleeding.
Home medicines review (HMR) is a Commonwealth Government funded service
conducted by accredited pharmacists to identify and address DRPs among eligible
patients [2]. The main aims of the service are to enhance patient knowledge, quality
use of medicines, reconcile health professional awareness of actual medication use
and, ultimately, improve patient quality of life. The HMR service is a collaborative

11
AIH 2012

activity between health professionals, typically accredited pharmacists, general practi-
tioners (GPs), and patients. Since its inception in 2001 the service has steadily grown
with nearly 80,000 HMRs funded in the 2011/2012 period [3].
An HMR is initiated for eligible consenting patients by a GP. Eligible patients are
identified if they regularly take 5 or more medications among other criteria [2]. An
HMR accredited pharmacist then obtains medical information from the GP, covering
medical history, current medications and pathology.
A core component of an HMR is an interview between the pharmacist and the pa-
tient, with interview typically conducted in the patient’s home. The interview, elicits
additional information such as: actual medication use, additional non-prescribed med-
ications, an understanding of the patient’s motivation behind actual rather than di-
rected medication use, and the patient’s health and medication knowledge [4]. This
process allows for a deeper understanding of the patient’s situation and gives the
pharmacist insight into cultural or language barriers, physical and economic limita-
tions and family support.
The amassed information is reviewed by the pharmacist to identify actual and po-
tential DRPs. The pharmacist writes a report of findings for the patient’s GP, which
includes recommendations to resolve any actual or potential problems. Consultation
between the GP and the patient culminates in an actionable medication management
plan designed to trial changes to existing therapy, and ideally, lead to improved medi-
cation use and improved patient health outcomes [4].
An important component is the professional skill of the pharmacist to be able to
identify clinically relevant DRPs from the available information. This requires a wide
scope of knowledge, not only of medications, but of evidence-based guidelines and
contemporary management of a variety of medical conditions.
Evidence-based guidelines can be difficult to implement due to their apparent
complexity. An example is provided from Basger et al.’s Prescribing Indicators in
Elderly Australians: “Patient at high risk of a cardiovascular event (b) is taking an
HMG-CoA reductase inhibitor (statin)”[5] If a patient did not meet this criterion this
would be considered a DRP. It can be reasonably expected that pharmacists would be
aware of statin medications currently available in Australia, in October 2012 these
were: atorvastatin, fluvastatin, pravastatin, rosuvastatin, and simvastatin. Note (b)
specifies those patients at high risk of cardiovascular event: “age>75 years, sympto-
matic cardiovascular disease (angina, MI[myocardial infarction], previous coronary
revascularization procedure, heart failure, stroke, TIA[transient ischemic attack],
PVD[peripheral vascular disease], genetic lipid disorder, diabetes and evidence of
renal disease (microalbuminuria and/or proteinuria and/or GFR[glomerular filtration
rate]<60ml/min”. Determining patients at high risk of cardiovascular events is more
problematic and requires sufficient additional information to make such a determina-
tion. One obvious problem is the amount of information that needs to be screened,
both within the guideline text and the patient data, to identify appropriate patients.
A commercial product developed by Medscope, Medication Review Mentor
(MRM)[6], incorporates a clinical decision support (CDSS) tool to assist with the
detection of DRPs. MRM utilizes a knowledge-based system to detect DRPs and pro-
vide recommendations for their resolution. This knowledge-based system uses the

12
AIH 2012

multiple classification ripple-down rules (MCRDR) method and was based on the
work of Bindoff et al. who applied this approach to the knowledge domain of medica-
tion reviews [7, 8]. The ripple-down rules method was considered appropriate as
knowledge could be gradually added to the knowledge base, broadening the scope and
refining existing knowledge as the system was being used [7, 9]. Bindoff et al. sug-
gested intelligent decision support software developed for this knowledge domain
may improve the quality and consistency of medication reviews.
No prior research had been undertaken to determine the clinical decision support
capacity of this commercial software, apart from contemporary research by the au-
thors. This contemporary research by the authors assessed opinions from pharmacolo-
gy experts and had determined that MRM is capable of identifying clinically relevant
DRPs [10-12].
This evaluation attempts to provide light on the scope of DRPs that can be identi-
fied by this software by presenting summary counts and examples of the types of
problems that were identified by MRM and by pharmacists. This paper evaluates the
similarities and differences between pharmacist findings and MRM findings more in
terms of a qualitative comparison by highlighting common findings, extremes of dif-
ference and discussing the possible advantages and limitations of the software, as well
as discussing areas for potential improvements.

2 How MRM works

The decision support component of MRM is a knowledge-based system which uses
MCRDR as its inference engine. MCRDR provides the knowledge engineer a way to
incrementally improve the quality of the knowledge base through the addition of ei-
ther new rules – which are added when the system fails to identify a DRP, or refine-
ments to existing rules – which are added when the system incorrectly identifies an
inappropriate DRP. The system’s knowledge base is managed by medication review
experts, who regularly review cases, examining the findings of the system for that
case, and then adding/refining rules until the system produces a wholly correct set of
findings for that case [8]. The validity of new rules is always being ensured, as the
system identifies any conflicts which may arise from the addition of the new rule, and
prompts the pharmacist to refine their rule until no further conflicts arise.

3 Methods

Australia-wide data collected during 2008 for a previous project, examining the
economic value of HMRs, was used for this study [13]. The data contained patient
demographics, medications, diagnoses and pathology results for 570 community-
dwelling patients aged 65 years old and older. The 570 HMRs were obtained from
148 different pharmacists. Supplementing this data were the original reviewing phar-
macists’ findings, detailing pharmacist-identified DRPs and recommendations.
The HMR data were entered into MRM and DRPs identified by MRM were rec-
orded. MRM utilized a wide range of information including basic patient de-

13
AIH 2012

mographics such as age and gender, medication type including strength, directions
and daily dose. MRM could calculate daily dose from strength and directions in many
cases. Duration of use of medication could be entered, which included options of less
than 3 months and more than 12 months. Medications were assigned Anatomic Ther-
apeutic Chemical classifications (ATC) [14]. ATC is a five-tier hierarchical classifica-
tion system allowing medications with similar properties to be grouped together in
chemical classes which are then grouped into therapeutic categories.
Diagnoses could be entered and were based on the ICPC2 classifications [15]. The
ICPC2 classification system was also hierarchical, grouping diagnoses under similar
categories. Diagnoses could be assigned temporal context as recent, ongoing or past
history. Medication allergies and general observations including height, weight and
blood pressure could be entered. A wide range of pathology readings could be en-
tered, including biochemical and hematological data.
At the time of the data entry and collections of results, August 2011, MRM con-
tained approximately 1800 rules [16]. Rule development was undertaken by a phar-
macist with expertise in both clinical pharmacology and HMRs [6].
Direct comparison of the DRPs identified by MRM and those identified by the
original pharmacists was not possible due to the individual textual nature of each
DRP. Each DRP identified by either the pharmacist or MRM was mapped to a con-
cept (defined here as a theme) that described the DRP in sufficient detail to allow
comparisons of similarity and difference between pharmacists and MRM. The themes
often described the type of drug or disease and other relevant factors involved. The
development of a list of themes and the mapping of DRPs to themes was performed
manually by the author, a qualified pharmacist.
Examples of the text of two DRPs identified by a pharmacist and by MRM in the
same patient are shown in Table 1. These DRPs were assigned the theme Hyper-
lipidemia under/untreated, which captured the basic problem identified within the text
of each DRP.

Table 1. Example DRP text

MRM Pharmacist
Patient has elevated triglycerides and is Patient’s cholesterol and triglycerides
only taking a statin. Additional treatment, remain elevated despite Lipitor [statin].
such as a fibrate, may be worth consider- This may be due to poor compliance or
ing an inadequate dose

These themes provided a common language for comparison of the DRPs found by
the original pharmacist reviewer and MRM. The initial themes were created where at
least two of three published prescribing guidelines for the elderly [5, 17, 18] were in
agreement concerning the same types of DRPs. DRPs from MRM and pharmacists
were mapped to this table of themes. Further themes were added if both pharmacist
and MRM DRPs could be mapped to any remaining ‘non-agreement’ prescribing
guideline DRPs. New themes were developed for remaining pharmacist and MRM
DRPs where concepts were clearly similar but were not contained within prescribing
guidelines. These new themes were very broad such as Vitamin, no indication, and

14
AIH 2012

may have included the DOCUMENT DRP classification text such as, Therapeutic
dose too high [19]. The remaining DRPs were unique to either pharmacists or MRM
and themes were provided where possible, such as, Skin disease (un)dertreated –
pharmacist only DRP. Lastly miscellaneous otherwise unclassifiable DRPs were as-
signed Other DRP pharmacist and Other DRP MRM.
A list of 129 themes was developed. Many themes described disease states and/or
drug classes describing identified DRPs in general terms. A descriptive analysis of the
themes was performed.
The number of unique themes found in each patient was considered more im-
portant than the raw number of themes found in each patient. That is where two DRPs
matched the same theme in the same patient, that theme was counted once. The rea-
son behind this decision was to compare the number of different types of conceptual
problems that could be identified across patients rather than raw numbers across pa-
tients.
Each theme identified in each patient was allocated into one of three categories: 1.
Identified by pharmacists only, 2. Identified by MRM only or 3. Identified by both.

4 Results

The patient cohort was predominantly female, with an average age of 80 and an aver-
age of 12 medications and 9 diagnoses, as described in Table 2.

Table 2. Patient Demographics

Patient (N = 570) Demographics
Age (years) 79.6 ± 6.7
Gender Male 234 : Female 336
Number of medications 12.0 ± 4.4
Number of diagnoses 9.1 ± 5.2

Pharmacists identified a total of 2020 DRPs, an average of 3.5±1.8 per patient, with a
range of 0 to 13 DRPs. MRM identified 3209 DRPs, of which 256 were excluded due
to duplicated findings, leaving 2953 MRM DRPs, and an average of 5.2±2.8 per pa-
tient, ranging from 0 to 16 DRPs.
The 2953 MRM DRPs were able to be assigned to 100 different themes that de-
scribed in general terms the central issue of each of the DRPs. Similarly, the 2020
pharmacist DRPs were able to be assigned to 119 different themes. Ninety of these
themes which were identified by pharmacists were also able to be identified by MRM.
Within these 90 themes, the software was able to identify the same issues as the
pharmacists in one or more of the same patients for 68 particular themes.
The number of different themes identified by MRM or by pharmacists per patient
was considered more important than the raw totals. The 2953 MRM DRPs were ag-
gregated into 2854 themes. Pharmacist DRPs which were clearly identifiable as com-
pliance or non-classifiable cost-related problems and outside the scope of MRM’s

15
AIH 2012

ability to identify were excluded, leaving 1726 pharmacist DRPs which were aggre-
gated into 1680 themes.
MRM was able to identify the same themes as identified by pharmacists in the
same patients 389 times, a 23% (389/1680) overlap of pharmacist findings by theme
and patient. This then left 1291 themes identified by pharmacists only and 2465
themes identified by MRM only. For each patient a Jaccard coefficient was calculated
as the number of themes in common divided by the number of different themes found
by either MRM or pharmacists. For the 570 patients Jaccard coefficients ranged from
a minimum of 0 to a maximum of 1, with a mean of 0.092 ± 0.117.
The top five themes by number of patients in common are shown in Table 3. Not
surprisingly several of the most common themes found align with common health
conditions in this cohort, namely hyperlipidemia and osteoporosis.
Some of the problems that can be identified by the software are shown in Tables 3
and 4. Table 3 shows there is some overlap of the ability of MRM to find the same
kind of problems as pharmacists in the same patients. However, both pharmacists and
MRM find many instances of the same problem in different patients. Table 4 shows
examples of some of the themes at the extremes of overlap. The two example themes
calcium-channel blocker and reflux and anti-lipidemic drug, no indication were iden-
tified in many patients by MRM but only once each by pharmacists. Similarly, the
two example themes vitamin, no indication and combine medications into combina-
tion product illustrate that pharmacists identified many patients with particular prob-
lems that MRM could not identify.

Table 3. Top five themes by patients in common

Top five themes by cases in Pa- Patients Patients Total
common tients pharma- in com- Patients:
MRM cist found mon pharmacists
found + MRM
Osteoporosis (or risk) may 137 117 49 205
require calcium and or vita-
min D
Renal impairment and using 122 48 24 146
(or check dose for) renally
excreted drugs
Hyperlipidemia un- 83 31 20 94
der/untreated
Sedatives long-acting or seda- 55 31 18 68
tive long term
NSAID not recommended 59 28 17 70
(heart disease/risk of
bleed/other)

16
AIH 2012

Table 4. Themes skewed in favour of MRM or pharmacists

Skewed themes with cases Patients Patients Patients Total
in common MRM pharma- in Patients:
found cist found common pharmacists
+ MRM
Calcium channel blocker and 120 1 1 120
reflux
Anti-lipidemic drug, no indi- 56 1 1 56
cation
Vitamin, no indication 1 6 1 6
Combine medications into 3 10 1 12
combination product

5 Discussion

The majority of the unique pharmacist themes involved non-classifiable, mostly
drug cost and compliance, problems. These pharmacist-only themes were not cap-
tured in the knowledge domain model. Although the majority of unique MRM themes
could have been identified by pharmacists they were not. This was not due to lack of
information on the part of pharmacists but more likely to be due to pharmacists hav-
ing additional knowledge that rendered these issues moot. It is also possible that
pharmacists were not aware of or simply missed these particular issues. Alternatively,
the software may have produced erroneous findings.
The wide variety of variables including temporal context encapsulated in the model
were manifested in the broad scope of problems that could be identified by the soft-
ware. For 68 themes (out of 100 themes identified by MRM) the software showed the
ability to identify the same issues that pharmacists could find in the same patients. In
some circumstances half to all instances of a theme identified by pharmacists was also
identified by MRM; most of the themes shown in Table 3 are examples of this.
The broad scope of themes and similarity of identification of themes in the same
patients as pharmacists is encouraging; however, there were many patients who had
particular problems identified by either MRM or pharmacists but not by both. Further,
twenty-two themes were identified by MRM and by pharmacists without any patients
in common. Several explanations are posited to account for these differences.
The first and main point is knowledge not captured and subsequently not able to be
utilized by the software. Extending this point, knowledge may have been available but
not entered into the software because it was not recorded anywhere by either the pa-
tient’s GP or the reviewing pharmacist. Several themes stated some drugs had no
indication for use because no suitable diagnosis was assigned to those patients. An
example in Table 4, anti-lipidemic drug, no indication, shows MRM found many
instances of this potential problem but pharmacists did not identify this as an issue.
Does this mean pharmacists were aware of the indication for the drug? Or does it
suggest pharmacists missed the opportunity to identify unnecessary medication?

17
AIH 2012

Overall MRM found more problems than pharmacists. It is not unreasonable to
suggest pharmacists may lack consistency in identifying DRPs. Correspondingly, it is
not unreasonable to suggest MRM exemplifies consistency, as it is after all computer
software. Several studies examining clinical decision support, including two proto-
types on which MRM was based, have identified that humans lack consistency or lack
the capacity to identify all relevant problems in contrast with the software [7, 8, 20].
Additionally, pharmacists may have focused on more important DRPs through priori-
tizing more pertinent DRP findings and ignoring lesser issues.
MRM did find substantially more problems than pharmacists, which raises some
concerns about potential alert fatigue, a known limitation of many clinical decision
support systems, wherein the system identifies so many irrelevant problems that the
user simply ignores it entirely. It should be noted a portion of MRMs findings were
duplications, 256 of 3209 DRPs. The central requirement and unfortunately concomi-
tant problem of clinical decision support is the need to have sufficient information to
present findings in context of the patient’s current clinical situation. The application
of MCRDR attempts to address the problem of context through incorporation of an
extensive array of variables integrated with a knowledge base of many patient cases
and inference rules.
However, it appears that MRM may not suffer from alert fatigue, as separate re-
search that we have conducted, concerning the clinical relevance of the DRP findings
of MRM and of pharmacists, was recently completed [11]. In that study experts in the
field were of the opinion that both MRM and pharmacists identified clinically rele-
vant DRPs [11]. That study supports the position that MRM may be more consistent
than pharmacists by identifying a greater number of issues that pharmacists did not
identify. Secondly, and importantly, despite the larger number of issues identified by
MRM, lack of clinical relevance did not appear to be a factor.
A specific advantage of this implementation of MCRDR was the use of case-based
reasoning, allowing the knowledge domain expert to readily add new rules and refine
existing rules. This method incrementally increases the precision of rules in context of
the uniquely varied situations encountered through amassing knowledge of individual
patients. This is an important point, as the development of new medications, or new
applications of existing medications, and ever expanding medical knowledge needs to
be to be incorporated into such software on an ongoing basis to maintain the rele-
vance of the knowledge base.
Due to the ability to easily add and refine the rules and knowledge-base a follow-
up study may produce different, likely improved results. A subsequent investigation
applying the same patient cases to the software and comparing the differences may be
performed to determine whether DRP identification can be further enhanced over
time.
MRM appears to work well in the HMR domain, but improvements may include a
greater extent of variables such as compliance or cost-related concepts to widen prob-
lem detection scope as well as increasing accuracy of problem identification. Rule
refinement to reduce the occurrence of duplicated DRPs is warranted. Another poten-
tial issue involves medication classification which was based on the ATC classifica-
tion system. The ATC classification system included codes for combination products.

18
AIH 2012

There may be limitations when attempting to create rules based on individual ingredi-
ents within combination products as each individual ingredient is not uniquely identi-
fied. Additionally, with the impending implementation of national electronic health
record standards, data entry limitations such as transcription errors or missed data
entry may be minimized by implementing these standards.

6 Conclusion

The use of ripple-down rules in this software did perform well in the complex and
detailed HMR knowledge domain. It showed a reasonable degree of similarity with
the human experts in the both the range of problem types that could be identified
within its scope of knowledge, and in the frequency of problems found. MRM cannot
find some of the problems that pharmacists could find, some things will always be
missed because of incomplete data.
The truly interesting aspect is the software’s capacity to identify more problems
than pharmacists. This capacity to identify more problems did not appear to involve
lack of relevance, but it is likely to be a strong indication of the consistent methodical
ability of the machine to identify problems. This finding alone justifies the use of such
a tool. MRM cannot replace pharmacists but may help pharmacists make good deci-
sions and avoid missing important problems.

7 Competing interests

The author Gregory Peterson is an investor in Medscope Pty Ltd which developed
MRM. The MRM software was based on the work of author Ivan Bindoff. Gregory
Peterson was involved with the work of Ivan Bindoff as researcher and supervisor.
Peter Tenni, a researcher previously involved with Ivan Bindoff’s work, is currently
the manager of the clinical division of Medscope Pty Ltd.

8 References
1. Pharmaceutical Care Network Europe, www.pcne.org/sig/drp/drug-related-problems.php
2. Home Medicines Review (HMR), www.medicareaustralia.gov.au/provider/pbs/fifth-
agreement/home-medicines-review.jsp
3. Medicare Australia – Statistics – Item Reports,
www.medicareaustralia.gov.au/statistics/mbs_item.shtml
4. Pharmaceutical Society of Australia, Guidelines for pharmacists providing home medi-
cines review (HMR) services. Pharmaceutical Society of Australia (2011)
5. Basger, B.J., T.F. Chen, and R.J. Moles, Inappropriate medication use and prescribing in-
dicators in elderly Australians: Development of a prescribing indicators tool. Drugs Aging.
25(9), 777-793 (2008)
6. Medscope Medication Review Mentor (MRM), www.medscope.com.au

19
AIH 2012

7. Bindoff, I., Stafford, A., Peterson, G., Kang, B.H., Tenni, P.: The potential for intelligent
decision support systems to improve the quality and consistency of medication reviews. J
Clin Pharm Ther. 37(4), 452-458 (2011)
8. Bindoff, I.K., Tenni, P.C., Peterson, G.M., Kang, B.H., Jackson, S.L.: Development of an
intelligent decision support system for medication review. J Clin Pharm Ther. 32(1), 81-88
(2007)
9. Compton, P., Peters, L., Edwards, G., Lavers, T.G.: Experience with Ripple-Down Rules.
Knowledge-Based Systems. 19(5), 356-362 (2006)
10. Curtain, C., Westbury, J., Bindoff, I., Peterson, G.: Validation of home medicines review
decision support software. In Graduate research - Sharing excellence in research confer-
ence proceedings, p. 23 Hobart (2012)
11. Curtain, C., Bindoff, I., Westbury, J., Peterson, G.: Validation of decision support software
for identification of drug-related problems. In 11th National conference of Emerging Re-
searchers in Ageing, In Press, Brisbane (2012)
12. Curtain, C., Bindoff, I., Westbury, J., Peterson, G.: Can software assist the home medi-
cines review process by identifying clinically relevant drug-related problems? In ASCEPT-
APSA 2012 conference. In Press. Sydney (2012)
13. Stafford, A., Tenni, P., Peterson, G., Doran, C., Kelly, W.: IIG-021 - VALMER (the Eco-
nomic Value of Home Medicines Reviews), Pharmacy Guild of Australia
14. WHO Collaborating Centre for Drug Statistics Methodology Norwegian Institute of Public
Health. International language for drug utilization research ATC / DDD, www.whocc.no
15. Jamoulle, M. ICPC2, the international classification of primary care,
www.ulb.ac.be/esp/wicc/icpc2.html#C2
16. Tenni, P.: Manager, Clinical Division, Medscope, Hobart (2012)
17. Fick, D.M., Cooper, J.W., Wade, W.E., Waller, J.L., Maclean, J.R., Beers, M.H.: Updating
the Beers criteria for potentially inappropriate medication use in older adults: results of a
US consensus panel of experts. Arch Intern Med. 163, 2716-2724 (2003)
18. Gallagher, P., Ryan, C., Byrne, S., Kennedy, J., O’Mahony, D.: STOPP (Screening Tool of
Older Person's Prescriptions) and START (Screening Tool to Alert doctors to Right
Treatment). Consensus validation. Int J Clin Pharmacol Ther. 46(2), 72-83 (2008)
19. Williams, M., Peterson, G.M., Tenni, P.C., Bindoff, I.K., Stafford, A.C.: DOCUMENT: a
system for classifying drug-related problems in community pharmacy. Int J Clin Pharm.
34(1), 43-52 (2011)
20. Martins, S.B., Lai, S., Tu, S., Shankar, R., Hastings, S.N., Hoffman, B.B., Dipilla, N.,
Goldstein, M.K.: Offline testing of the ATHENA Hypertension decision support system
knowledge base to improve the accuracy of recommendations. AMIA Annu Symp Proc,
539-43 (2006)

20
AIH 2012

FS-XCS vs. GRD-XCS:
An analysis using high-dimensional DNA
microarray gene expression data sets

Mani Abedini1 , Michael Kirley1 , and Raymond Chiong1,2
1
Department of Computing and Information Systems,
The University of Melbourne, Victoria 3010, Australia
{mabedini,mkirley,rchiong}@csse.unimelb.edu.au
2
Faculty of Higher Education Lilydale,
Swinburne University of Technology, Victoria 3140, Australia
rchiong@swin.edu.au

Abstract. XCS, a Genetic Based Machine Learning model that com-
bines reinforcement learning with evolutionary algorithms to evolve a
population of classiﬁers in the form of condition-action rules, has been
used successfully for many classiﬁcation tasks. However, like many other
machine learning algorithms, XCS becomes less eﬀective when it is ap-
plied to high-dimensional data sets. In this paper, we present an anal-
ysis of two XCS extensions – FS-XCS and GRD-XCS – in an attempt
to overcome the dimensionality issue. FS-XCS is a standard combina-
tion of a feature selection method and XCS. As for GRD-XCS, we use
feature quality information to bias the evolutionary operators without
removing any features from the data sets. Comprehensive numerical sim-
ulation experiments show that both approaches can eﬀectively enhance
the learning performance of XCS. While GRD-XCS has obtained signif-
icantly more accurate classiﬁcation results than FS-XCS, the latter has
produced much quicker execution time than the former.

1 Introduction
Classiﬁcation tasks arise in many areas of science and engineering. One such ex-
ample is disease classiﬁcation based on gene expression proﬁles in bioinformatics.
Gene expression proﬁles provide important insights into, and further our under-
standing of, biological processes. They are key tools used in medical diagnosis,
treatment, and drug design [21]. From a clinical perspective, the classiﬁcation
of gene expression data is an important problem and a very active research area
(see [3] for a review). DNA microarray technology has advanced a great deal
in recent years. It is possible to simultaneously measure the expression levels
of thousands of genes under particular experimental environments and condi-
tions [22]. However, the number of samples tends to be much smaller than the
number of genes (features)1 . Consequently, the high dimensionality of a given
1
Generally speaking, the number of samples must be larger than the number of fea-
tures for good classiﬁcation performance.

21
AIH 2012
2 Mani Abedini, Michael Kirley, and Raymond Chiong

data set poses many statistical and analytical challenges, which often degrade
the performance of classiﬁcation methods used.
XCS – the eXtended Classiﬁer System – is a Genetic Based Machine Learning
(GBML) method that has been successfully used for a wide variety of classiﬁ-
cation applications, including medical data mining. XCS can learn from sample
data in multiple iterative cycles. This is a great characteristic, but it also ex-
hibits two common pitfalls that most classiﬁcation methods have: sensitivity to
data noise and “the curse of dimensionality” [22]. Both issues can easily jeopar-
dise the learning process. A well-known solution is to use a cleansing stage. For
example, feature selection/ranking techniques can remove unnecessary features
from the data set. Reducing the dimensionality and removing noisy features can
improve learning performance. Nevertheless, there exist data sets with highly
co-expressed features, such as those studying Epistasis phenomena, that do not
allow eﬀective feature reduction. Examples of this include protein structure pre-
diction and protein-protein interaction.
In this paper, we study two extensions of XCS inspired by feature selection
techniques commonly used in machine learning: FS-XCS with eﬀective feature
reduction in place and GRD-XCS [1] that does not remove any features. The pro-
posed model uses some prior knowledge, provided by a feature ranking method,
to bias the discovery operators of XCS. A series of comprehensive numerical
experiments on high-dimensional medical data sets has been conducted. The re-
sults of these simulation experiments suggest that both extensions can eﬀectively
enhance the XCS’s learning performance. While GRD-XCS has performed sig-
niﬁcantly more accurate than FS-XCS, the latter is shown to have much quicker
execution time compared to the former.
The remainder of this paper is organised as follows: Section 2 brieﬂy describes
some related work on XCS. In Section 3, we present the details of our proposed
model. Section 4 discusses the experimental settings and results. Finally, we draw
conclusion and highlight future possibilities in Section 5.

2 Related Work

GBML concerns applying evolutionary algorithms (EAs) to machine learning.
EAs belong to the family of nature-inspired optimisation algorithms [9, 10]. As
a manifestation of population-based, stochastic search algorithms that mimic
natural evolution, EAs use genetic operators such as crossover and mutation for
the search process to generate new solutions through a repeated application of
variation and selection [11].
It is well documented in the evolutionary computation literature that the im-
plementation of EA’s genetic operators can inﬂuence the trajectory of the evolv-
ing population. However, there has been a paucity of studies focused speciﬁcally
on the impact of selected evolutionary operator implementations in Learning
Classiﬁer Systems (LCSs), a type of GBML algorithm for rule induction. Here,
we brieﬂy describe some of the key studies related to LCSs in general and XCS
– a Michigan-style LCS – in particular.

22
AIH 2012
FS-XCS vs. GRD-XCS – A comparative study 3

In one of the ﬁrst studies focused on the rule discovery component speciﬁcally
for XCS, Butz et al. [7] have shown that uniform crossover can ensure success-
ful learning in many tasks. In subsequent work, Butz et al. [6] introduced an
informed crossover operator, which extended the usual uniform operator such
that exchanges of eﬀective building blocks occurred. This approach helped to
avoid the over-generalisation phenomena inherent in XCS [14]. In other work,
Bacardit et al. [4] customised the GAssist crossover operator to switch between
the standard crossover or a new simple crossover, SX. The SX operator uses a
heuristic selection approach to take a minimum number of rules from the par-
ents (more than two), which can obtain maximum accuracy. Morales-Ortigosa et
al. [16] have also proposed a new XCS crossover operator, BLX, which allowed
for the creation of multiple oﬀspring with a diversity parameter to control diﬀer-
ences between oﬀspring and parents. In a more comprehensive overview paper,
Morales-Ortigosa et al. [17] presented a systematic experimental analysis of the
rule discovery component in LCSs. Subsequently, they developed crossover op-
erators to enhance the discovery component based on evolution strategies with
signiﬁcant performance improvements.
Other work focused on biased evolutionary operators in LCSs include the
work of Jos-Revuelta [18], who introduced a hybridised Genetic Algorithm-Tabu
Search (GA-TS) method that employed modiﬁed mutation and crossover oper-
ators. Here, the operator probabilities were tuned by analysing all the ﬁtness
values of individuals during the evolution process. Wang et al. [20] used Infor-
mation Gain as part of the ﬁtness function in an EA. They reported improved
results when comparing their model to other machine learning algorithms. Re-
cently, Huerta et al. [5] combined linear discriminant analysis with a GA to
evaluate the ﬁtness of individuals and associated discriminate coeﬃcients for
crossover and mutation operators. Moore et al. [15] argued that biasing the
initial population, based on expert knowledge preprocessing, would lead to im-
proved performance of the evolutionary based model. In their approach, a statis-
tical method, Tuned ReliefF, was used to determine the dependencies between
features to seed the initial population. A modiﬁed ﬁtness function and a new
guided mutation operator based on features dependency was also introduced,
leading to signiﬁcantly improved performance.

3 The Model

We have designed and developed two extensions of XCS, both inspired by fea-
ture selection techniques commonly used in machine learning. The ﬁrst exten-
sion, which we call FS-XCS, is a combination of a Feature Selection method and
the original XCS. The second extension, which we call GRD-XCS, incorporates
a probabilistically Guided Rule Discovery mechanism for FS-XCS. The moti-
vation behind both extensions was to improve classiﬁer performance (in terms
of accuracy and execution time), especially for high-dimensional classiﬁcation
problems.

23
AIH 2012
4 Mani Abedini, Michael Kirley, and Raymond Chiong

Fig. 1. Here, Information Gain has been used to rank the features. The top Ω features
(in this example Ω = 5) are selected and allocated relatively large probability values
∈ [γ, 1]. The RDP vector maintains these values. The probability value of the highest
ranked feature is set to 1.0. Other features receive smaller probability values relative to
their rank (in this example γ =0.5). Features that are not selected based on Information
Gain are assigned a very small probability value (in this example ξ = 0.1).

FS-XCS uses feature ranking methods to reduce the dimension of a given
data set before XCS starts to process the data set. It is a fairly straightfor-
ward hybrid approach. However, in GRD-XCS information gathered from feature
ranking methods is used to build a probability model that biases the evolution-
ary operators of XCS. The feature ranking probability distribution values are
recorded in a Rule Discovery Probability (RDP ) vector. Each value of the RDP
vector (∈ [0, 1.0]) is associated with a corresponding feature. The RDP vector
is then used to bias the feature-wise uniform crossover, mutation, and don’t care
operators, which are part of the XCS rule discovery component.
The actual values in the RDP vector are calculated based on the rank of the
corresponding feature as described below:
⎧ 1−γ
⎨ Ω × (Ω − i) + γ if i ≤ Ω
RDPi = (1)
⎩
ξ otherwise
where i represents the rank index in ascending order for the selected top ranked
features Ω. The probability values associated with the top ranked features would
be some relatively large values (∈ [γ, 1]) depending on the feature rank; for the
others a very low probability value ξ is given. Thus, all features have a chance
to participate in the rule discovery process. However, the Ω-top ranked features
have a greater chance of being selected (see Figure 1 for an example).
GRD-XCS uses the probability values recorded in the RDP vector in the pre-
processing phase to bias the evolutionary operators used in the rule discovery
phase of XCS. The modiﬁed algorithms describing the crossover, mutation and
don’t care operators in GRD-XCS are very similar to standard XCS operators:

– GRD-XCS crossover operator: This is a hybrid uniform/n-point function. An
additional check of each feature is carried out before the exchange of genetic
material. If Random[0, 1) < RDP [i] then feature i is swapped between the
selected parents (Algorithm 1).

24
AIH 2012
FS-XCS vs. GRD-XCS – A comparative study 5

Algorithm 1 Guided Uniform Crossover algorithm
Require: Individuals: Cl1 ,Cl2 ∈[A], Probability Vector: RDP , Crossover Probability:
χ
if Random[0,1) < χ then
for i = 1 To SizeOf(Features) do
if Random[0,1) < RDP [i] then
SWAP(Cl1 [i],Cl2 [i])
end if
end for
end if

Algorithm 2 Guided Mutation algorithm
Require: Individual: Cl ∈[A], Probability Vector: RDP , Mutation Probability: μ
for i = 1 To SizeOf(Features) do
if Random[0,1) < RDP [i] ×μ then
Mutate(Cl[i])
end if
end for

Algorithm 3 Guided Don’t Care algorithm
Require: Individuals: Cl ∈[A], Probability Vector: RDP , Don’t Care Probability: P#
for i = 1 To SizeOf(Features) do
if Random[0,1) < (1 − RDP [i]) ×P# then
P1 [i] ← #
end if
end for

– GRD-XCS mutation operator: It uses the RDP vector to determine if feature
i is to undergo mutation; the base-line mutation probability is multiplied by
RDP for each feature. Therefore, the mutation probability is not a uniform
distribution anymore. The more informative features have better chance to
be selected for mutation (Algorithm 2).
– GRD-XCS don’t care operator: In this special mutation operator, the values
in the RDP vector are used in the reverse order. That is, if feature i has
been selected to be mutated and Random[0, 1) < (1 − RDP [i]), then feature
i is changed to # (“don’t care”) (see Algorithm 3).
The application of the RDP vector reduces the crossover and mutation prob-
abilities for “uninformative” features. However, it increases the “don’t care” op-
erator probability for the same feature. Therefore, the more informative features
should appear in rules more often than the “uninformative” ones.

4 Experiments and Results
We have conducted a series of independent experiments to compare the perfor-
mance of FS-XCS and GRD-XCS. A suite of feature selection techniques have

25
AIH 2012
6 Mani Abedini, Michael Kirley, and Raymond Chiong

Table 1. Data set details

Data Set #Instances #Features Cross Validation Reference

High-dimensional data sets (Microarray DNA gene expression)
Breast cancer 22 3226 3 [13]
Colon cancer 62 2000 10 [2]
Leukemia cancer 72 7129 10 [12]
Prostate cancer 136 12600 10 [19]

been tested: Correlation based Feature Selection (CFS), Gain Ratio, Informa-
tion Gain, One Rule, ReliefF and Support Vector Machine (SVM). Four DNA
microarray gene expression data sets have been used in the experiments. The
details of these data sets are reported in Table 1.
Our algorithms were implemented in C++, based on the Butz’s XCS code2 .
The WEKA package (version 3.6.1)3 was used for feature ranking. All exper-
iments were performed on the VPAC 4 Tango Cluster server. Tango has 111
computing nodes. Each node is equipped with two 2.3 GHz AMD based quad
core Opteron processors, 32GB of RAM and four 320GB hard drives. Tango’s
operating system is the Linux distribution CentOS (version 5).

4.1 Parameter settings
Default parameter values as recommended in [8] have largely been used to conﬁg-
ure the underlying XCS model. For parameters speciﬁc to our proposed model,
we have carried out a detailed analysis to determine the optimal settings. In
particular, we have tested a range of Ω values Ω = 10, 20, 32, 64, 128, 256 and
population sizes pop size = 500, 1000, 2000, 5000. The analysis suggested that
Ω = 20 with a population size of 2000 can provide an acceptable accuracy level
within reasonable execution time for FS-XCS. As for GRD-XCS, the setting of
Ω = 128 and pop size = 500 was found to have produced the best results. As
such, these parameter values were used for the results presented in Section 4.3.
The limits used in probability value calculations in Equation 1 were set to
γ = 0.5 and ξ = 0.1. In all experiments, the number of iterations was capped at
5000.

4.2 Evaluation
For each scenario (parameter value–data set combination), we performed N -fold
cross validation experiments over 100 trials (see Table 1). The average accuracy
2
The source code is available at the Illinois Genetic Algorithms Laboratory (IlliGAL)
site http://www.illigal.org/
3
Weka 3 is an open source data mining tool (in Java), with a collection of ma-
chine learning algorithms developed by the Machine Learning Group at University
of Waikato – http://www.cs.waikato.ac.nz/ml/weka/
4
Victorian Partnership for Advanced Computing: www.vpac.org

26
AIH 2012
FS-XCS vs. GRD-XCS – A comparative study 7

Table 2. Average accuracy (measured by AUC values) of the base-line XCS, FS-XCS
and GRD-XCS on all selected microarray gene expression data sets.

base-line XCS FS-XCS GRD-XCS
0.77 0.88 0.98

values for speciﬁc parameter combinations have been reported using the Area
Under the ROC Curve – the AUC value. The ROC curve is a graphical way to
depict the tradeoﬀ between the True Positive rate (TPR) on the Y axis and the
False Positive rate (FPR) on the X axis. The AUC values obtained from the
ROC graphs allow for easy comparison between two or more plots. Larger AUC
values represent higher overall accuracy.
Appropriate statistical analyses using paired t-tests were conducted to deter-
mine whether there were statistically signiﬁcant diﬀerences between particular
scenarios in terms of both accuracy and execution time. Scatter plots of the
observed and ﬁtted values and Q-Q plots were used to verify normality assump-
tions. These statistical analyses were performed using the IBM SPSS Statistics
(version 19) software.

4.3 FS-XCS vs. GRD-XCS

To begin with, we have compared the average accuracy of FS-XCS and GRD-
XCS with the base-line XCS (without feature selection) using all the aforemen-
tioned feature ranking methods on the microarray gene expression data sets
listed in Table 1. The results, as shown in Table 2, indicate that GRD-XCS has
an overall better accuracy than FS-XCS: the average FS-XCS accuracy using
various feature selection techniques is 0.88 while the average accuracy of GRD-
XCS using the same feature ranking methods is 0.98. Meanwhile, both FS-XCS
and GRD-XCS are better than the base-line XCS – the latter has managed only
an average accuracy of 0.77. For the rest of this section, we will focus on a
detailed comparison between FS-XCS and GRD-XCS.
Figure 2 shows the AUC values of FS-XCS and GRD-XCS when diﬀerent
feature ranking methods are used. From the ﬁgure, it is clear that GRD-XCS is
signiﬁcantly more accurate than FS-XCS. The accuracy result of both FS-XCS
and GRD-XCS for every feature ranking method, except Information Gain over
the Breast Cancer data set, is signiﬁcantly diﬀerent (p < 0.001).
In Figure 3, FS-XCS is shown to be signiﬁcantly faster than GRD-XCS (p <
0.001) in terms of execution time. This is much expected since FS-XCS works
with only a fraction of the original data set size (i.e., 20 features) while GRD-XCS
still accepts the entire data set with thousands of features. The only exception
is when Gain Ratio has been applied over the Breast Cancer data set – in this
case there is strong evidence that both FS-XCS and GRD-XCS have signiﬁcantly
equal average execution time (p = 0.94).
Figures 4 and 5 depict some general insight into the population diversity. In
the majority of cases, GRD-XCS has less diversity.

27
AIH 2012
8 Mani Abedini, Michael Kirley, and Raymond Chiong

GRD-XCS GRD-XCS

FS-XCS FS-XCS

(a) Breast Cancer (b) Prostate Cancer

GRD-XCS GRD-XCS

FS-XCS FS-XCS

Fig. 2. The accuracy (AUC) of FS-XCS vs. GRD-XCS when various feature ranking
methods are applied.

The average length of each classiﬁer in GRD-XCS is almost always signiﬁ-
cantly smaller than FS-XCS (p < 0.05). The signiﬁcant similar cases are Gain
Ratio (p = 0.80) and ReliefF (p = 0.26) on the Prostate Cancer data set.
The average number of macro classiﬁers in GRD-XCS is signiﬁcantly smaller
than the average number of macro classiﬁers in FS-XCS. As can be seen in Fig-
ures 5(b) and (d), the diﬀerence is getting more obvious when the dimensionality
increases (for Prostate Cancer and Colon Cancer). However, there is a diﬀerent
story for the Breast Cancer data set where the average number of macro clas-
siﬁers in the GRD-XCS population is larger than FS-XCS. It would be a fair
conclusion to say that GRD-XCS is exploring the solution space in a more fo-
cused manner than FS-XCS. In other words, the guided rule discovery approach
forces the learning process to generate less diverse testing hypothesis; however
this behaviour can evolve more accurate classiﬁers.

5 Conclusion and Future Work
In this paper, we have analysed the performance of FS-XCS and GRD-XCS
based on some high-dimensional classiﬁcation problems. Comprehensive numer-

28
AIH 2012
FS-XCS vs. GRD-XCS – A comparative study 9

GRD-XCS
GRD-XCS

FS-XCS
FS-XCS

(a) Breast Cancer (b) Prostate Cancer

GRD-XCS
GRD-XCS
FS-XCS
FS-XCS

Fig. 3. The execution time (in seconds) of FS-XCS vs. GRD-XCS when various feature
ranking methods are applied.

ical simulations have established that GRD-XCS is signiﬁcantly more accurate
than FS-XCS in terms of classiﬁcation results. On the other hand, FS-XCS is
signiﬁcantly faster than GRD-XCS in terms of execution time. The results of
FS-XCS suggest that normally 20 top-ranked features would be enough to build
a good classiﬁer, although this classiﬁer is signiﬁcantly less accurate than the
equivalent GRD-XCS model. Nevertheless, both models have performed better
than the base-line XCS.
To sum up, using feature selection to highlight the more informative features
and using them to guide the XCS rule discovery process is better than applying
feature reduction approaches. This is mainly due to the fact that GRD-XCS can
transform poor classiﬁers (created from the uninformative features) into highly
accurate classiﬁers. From the empirical analysis presented it is clear that the
performance of diﬀerent feature selection techniques varies inevitably depending
on the data set characteristic. Future work will therefore attempt to rectify
this through the idea of ensemble learning. That is, we can build an ensemble
classiﬁer from multiple XCS based models (may it be FS-XCS or GRD-XCS).
Each of these XCS cores can use a distinctive feature selection method. The

29
AIH 2012
10 Mani Abedini, Michael Kirley, and Raymond Chiong

GRD-XCS GRD-XCS

FS-XCS FS-XCS

(a) Breast Cancer (b) Prostate Cancer

GRD-XCS GRD-XCS

FS-XCS FS-XCS

Fig. 4. The proportion of macro classiﬁers to the population size of FS-XCS vs. GRD-
XCS when various feature ranking methods are applied.

results of all XCS cores are then combined to form the ensemble result – for
instance by using a majority voting technique.

References
1. M. Abedini and M. Kirley. An enhanced XCS rule discovery module using
feature ranking. International Journal of Machine Learning and Cybernetics,
10.1007/s13042-012-0085-9, 2012.
2. U. Alon, N. Barkai, D. A. Notterman, K. Gishdagger, S. Ybarradagger,
D. Mackdagger, and A. J. Levine. Broad patterns of gene expression revealed
by clustering analysis of tumor and normal colon tissues probed by oligonucleotide
arrays. Proc. of the National Academy of Sciences of the USA, 96:6745–6750, 1999.
3. M. H. Asyali, D. Colak, O. Demirkaya, and M. S. Inan. Gene expression proﬁle
classiﬁcation: A review. Current Bioinformatics, 1(1):55–73, 2006.
4. J. Bacardit and N. Krasnogor. Smart crossover operator with multiple parents
for a Pittsburgh learning classiﬁer system. In Proceedings of the Genetic and
Evolutionary Computation Conference (GECCO), pages 1441–1448. ACM Press,
2006.

30
AIH 2012
FS-XCS vs. GRD-XCS – A comparative study 11

GRD-XCS GRD-XCS

FS-XCS FS-XCS

(a) Breast Cancer (b) Prostate Cancer

GRD-XCS
GRD-XCS

FS-XCS
FS-XCS

Fig. 5. The average length of macro classiﬁers (rules) of FS-XCS vs. GRD-XCS when
various feature ranking methods are applied.

5. E. Bonilla Huerta, J. C. Hernandez Hernandez, and L. A. Hernandez Montiel. A
new combined ﬁlter-wrapper framework for gene subset selection with specialized
genetic operators. In Advances in Pattern Recognition, volume 6256 of Lecture
Notes in Computer Science, pages 250–259. Springer, 2010.
6. M. Butz, M. Pelikan, X. Lloral, and David E. Goldberg. Automated global struc-
ture extraction for eﬀective local building block processing in XCS. Evolutionary
Computation, 14(3):345–380, 2006.
7. M. V. Butz, D. E. Goldberg, and K. Tharakunnel. Analysis and improvement of
ﬁtness exploitation in XCS: Bounding models, tournament selection, and bilateral
accuracy. Evolutionary Computation, 11(3):239–277, 2003.
8. M. V. Butz and S. W. Wilson. An algorithmic description of XCS. Soft Computing,
6(3–4):144–153, 2002.
9. R. Chiong, editor. Nature-Inspired Algorithms for Optimisation. Springer, 2009.
10. R. Chiong, F. Neri, and R. I. McKay. Nature that breeds solutions. In R. Chiong,
editor, Nature-Inspired Informatics for Intelligent Applications and Knowledge Dis-
covery: Implications in Business, Science and Engineering, chapter 1, pages 1–24.
Information Science Reference, Hershey, PA, 2009.
11. R. Chiong, T. Weise, and Z. Michalewicz, editors. Variants of Evolutionary Algo-
rithms for Real-World Applications. Springer, 2012.

31
AIH 2012
12 Mani Abedini, Michael Kirley, and Raymond Chiong

12. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov,
H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, and C. D. Bloomﬁeld.
Molecular classiﬁcation of cancer: Class discovery and class prediction by gene
expression monitoring. Science, 286:531–537, 1999.
13. I. Hedenfalk, D. Duggan, Y. Chen, M. Radmacher, M. Bittner, R. Simon,
P. Meltzer, B. Gusterson, M. Esteller, O. P. Kallioniemi, B. Wilfond, A. Borg,
and J. Trent. Gene-expression proﬁles in hereditary breast cancer. The New Eng-
land Journal of Medicine, 344(8):539–548, 2001.
14. P. L. Lanzi. A study of the generalization capabilities of XCS. In Thomas Bäck,
editor, Proceedings of the 7th International Conference on Genetic Algorithms,
pages 418–425. Morgan Kaufmann, 1997.
15. J. H. Moore and B. C. White. Exploiting expert knowledge in genetic programming
for genome-wide genetic analysis. In PPSN, volume 4193 of Lecture Notes in
Computer Science, pages 969–977. Springer, 2006.
16. S. Morales-Ortigosa, A. Orriols-Puig, and E. Bernadó-Mansilla. New crossover
operator for evolutionary rule discovery in XCS. In Proceedings of the 8th Interna-
tional Conference on Hybrid Intelligent Systems, pages 867–872. IEEE Computer
Society, 2008.
17. S. Morales-Ortigosa, A. Orriols-Puig, and E. Bernadó-Mansilla. Analysis and im-
provement of the genetic discovery component of XCS. International Journal of
Hybrid Intelligent Systems, 6(2):81–95, 2009.
18. L. M. San Jose-Revuelta. A Hybrid GA-TS Technique with Dynamic Operators
and its Application to Channel Equalization and Fiber Tracking. I-Tech Education
and Publishing, 2008.
19. D. Singh, P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo,
and A. A. Renshaw. Gene expression correlates of clinical prostate cancer behavior.
Cancer Cell, 1:203–209, 2002.
20. P. Wang, T. Weise, and R. Chiong. Novel evolutionary algorithms for supervised
classiﬁcation problems: An experimental study. Evolutionary Intelligence, 4(1):3–
16, 2011.
21. F.-X. Wu, W. J. Zhang, and A. J. Kusalik. On determination of minimum sample
size for discovery of temporal gene expression patterns. In Proceedings of the First
International Multi-Symposiums on Computer and Computational Sciences, pages
96–103, 2006.
22. Y. Zhang and J. C. Rajapakse. Machine Learning in Bioinformatics. Wiley Series
in Bioinformatics. 1’st edition, 2008.

32
AIH 2012

Reliable Epileptic Seizure Detection Using an Improved
Wavelet Neural Network

Zarita Zainuddin1,*, Lai Kee Huong1, and Ong Pauline1
1
School of Mathematical Sciences, Universiti Sains Malaysia, 11800 USM, Penang, Malaysia.
zarita@cs.usm.my, laikeehuong1986@yahoo.com,
ong.pauline@hotmail.com

Abstract. Electroencephalogram (EEG) signals analysis is indispensable in epi-
lepsy diagnosis as it offers valuable insights for locating the abnormal distor-
tions in the brain wave. However, visual interpretation of the massive amount
of EEG signals is time-consuming, and there is often inconsistent judgment be-
tween the experts. Thus, a reliable seizure detection system is highly sought af-
ter. A novel approach for epileptic seizure detection is proposed in this paper,
where the statistical features extracted from the discrete wavelet transform are
used in conjunction with an improved wavelet neural network in order to identi-
fy the occurrence of seizures. Experimental simulations were carried out on a
well-known publicly available dataset, which was kindly provided by Ralph
Andrzejak from the Epilepsy center in Bonn, Germany. The obtained high pre-
diction accuracy, sensitivity and specificity demonstrated the feasibility of the
proposed seizure detection scheme.

Keywords: Epileptic seizure detection, fuzzy C-means clustering, K-means
clustering, type-2 fuzzy C-means clustering, wavelet neural networks.

1 Introduction

Since its first inception reported by German neuropsychiatrist Hans Berger in the year
1924, the electroencephalogram (EEG) signals, which record the electrical activity in
the brain, have emerged as an essential alternative in diagnosing neurological disord-
ers. By analyzing the EEG recordings, inherent information from different physiolog-
ical states of the brain can be extracted, which are extremely crucial for the epileptic
seizure detection since the occurrence of seizure exhibits clear transient abnormalities
in the EEG signals. Thus, a warning signal can be initiated in time to avoid any un-
wanted seizure related accidents and injuries, upon detecting an impending seizure
attack.
While vital as a ubiquitous tool which supports general diagnostic of epilepsy, the
clinical implementation of EEG is constrained due to the challenges of: (i) Available
therapies require long term continuous monitoring of EEG signals. The generated
massive amounts of EEG recordings have to be painstakingly scanned and analyzed
visually by neurophysiologists, which is a tedious and time-consuming task. (ii) There

33
AIH 2012

often is disagreement among different physicians during the analysis of ictal signals
[19]. Undoubtedly, an automated diagnostic system that is capable of distinguishing
the transient patterns of epileptiform activity from the EEG signals with reliable pre-
cision is of great significance.
Various efforts have been devoted in the literature in this regard. Generally speak-
ing, a typical epileptic seizure detection process consists of two stages wherein, the
inherent information that characterizes the different states of brain electrical activity
are first derived from the EEG recordings using some feature extraction techniques,
and subsequently, a chosen expert system is trained based on the obtained features.
The discrete wavelet transform (DWT) has gained practical interest in extracting the
valuable information embedded on the EEG signals due to its ability in capturing
precise frequency information at low frequency bands and time information at high
frequency bands [4], [9], [22], [25]. EEG signals are non-stationary in nature, and
they contain high frequency information with short time period and low frequency
information with long time period [18]. Therefore, by analyzing the biomedical sig-
nals at different time and frequency resolutions, DWT is able to preprocess the bio-
medical signals efficiently in the feature extraction stage.
In the second stage of the seizure detection scheme, a great deal of different artifi-
cial neural networks (ANNs) based expert systems have been utilized extensively in
the emerging field of epilepsy diagnosis. For instance, the multilayer perceptrons,
radial basis function neural networks, support vector machines, probabilistic neural
networks, and recurrent neural networks are some of the models that have been pre-
viously reported in literature [5], [12], [14], [19], [22]. ANNs are powerful mathemat-
ical models that are inspired from their biological counterparts - the biological neural
networks, which concern on how the interconnecting neurons process a massive
amount of information at any given time. The utilization of ANNs in the seizure de-
tection study is appropriate in nature, due to their capability of finding the underlying
relationship between rapid variations in the EEG recordings, in addition to having the
characteristics of fault tolerance, massive parallel processing ability, and adaptive
learning capability.
The objective of this paper is to present a novel scheme based on an improved
WNNs for the optimal classification of epileptic seizures in EEG recordings. The
normal as well as the epileptic EEG signals were first pre-processed using the DWT
wherein, the signals were decomposed into several frequency subbands. Subsequent-
ly, a set of statistical features were extracted from each frequency subband, and was
used as a feature set to train a wavelet neural networks (WNNs) based classifier. It is
worth mentioning that the feature selection of EEG signals using DWT and epileptic
seizure detection with ANNs are well-accepted methodologies by medical experts [6-
7].
The paper is organized as follows. In Section 2, the clinical data used in this study
is first presented, followed by the feature extraction method based on the DWT. The
implementation of the improved WNNs is next described in Section 3. In Section 4,
the effectiveness of the proposed WNNs in epileptic seizure detection is presented
and finally, conclusions are drawn in Section 5.

34
AIH 2012

2 Materials and Methods

The flow of the methodology used in this study is depicted in the block diagram in
Fig. 1, which will be discussed in detail in the following sections.

2.1 Clinical data selection
The EEG signals used in this study were acquired from a publicly available bench-
mark dataset [2]. The dataset is divided into five sets, labeled set A until E. Each set
of the data consists of 100 segments, with each segment being a time series with 4097
data points. Each segment was recorded for 23.6 s at a sampling rate of 173.61 Hz.
Each of the five sets was recorded under different circumstances. Both sets A and B
were recorded from healthy subjects, with set A recorded with their eyes open whe-
reas set B with their eyes closed. On the other hand, sets C until E were obtained from
epileptic patients. Set C and D were recorded during seizure free period, where set C
was recorded from the hippocampal formation of the opposite hemisphere of the
brain, whereas set D was obtained from within the epileptogenic zone. The last data
set, set E, contains ictal data that were recorded when the patients were experiencing
seizure. In other words, the first four sets of data, sets A until D, are normal EEG
signals, while set E represents epileptic EEG signals.

2.2 Discrete wavelet transform for feature extraction
DWT offers a more flexible time-frequency window function, which narrows when
observing high frequency information and widens when analyzing low frequency
resolution. It is implemented by decomposing the signal into coarse approximation
and detail information by using successive low-pass and high-pass filtering, which is
illustrated in Fig. 2.
As shown in this figure, a sample signal x(n), is passed through the low-pass filter
G0 and high-pass filter H0 simultaneously until the desired level of decomposition is
reached. The low-pass filter produces coarse approximation coefficients a(n), whereas
the high-pass filter outputs the detail coefficients d(n). The size of the approximation
coefficients and detail coefficients decreases by a factor of 2 at each successive de-
composition.

Fig. 1. Block diagram for the proposed seizure detection scheme.

35
AIH 2012

Fig. 2. A three-level wavelet decomposition tree.

Selecting the appropriate number of decomposition level is important for DWT.
For the EEG signal analysis, the number of decomposition levels can be determined
directly, based on their dominant frequency components. The number of levels is
chosen in such a way that those parts of the signals which correlate well with the fre-
quencies required for the classification of EEG signals are retained in the wavelet
coefficients [17]. Since the clinical data used were sampled at 173.61Hz, the DWT
using Daubechies wavelet of order 4 (db4), with four decomposition levels is chosen,
as suggested in [21]. The db4 is suitable to be used as wavelets of lower order are too
coarse to represent the EEG signals, while wavelets of higher order oscillate too wild-
ly [1]. The four-level wavelet decomposition process will yield a total of five groups
of wavelet coefficients, each corresponds to their respective frequency. They are
d1(43.4-86.8Hz), d2(21.7-43.4Hz), d3(10.8-21.7Hz), d4(5.4-10.8Hz), and a4(0-5.4Hz),
which correlate with the EEG spectrum that fall within four frequency bands of: delta
(1-4Hz), theta (4-8Hz), alpha (8-13Hz) and beta (13-22Hz).
Subsequently, the statistical features of these decomposition coefficients are ex-
tracted, which are:
1. The 90th percentile of the absolute values of the wavelet coefficients
2. The 10th percentile of the absolute values of the wavelet coefficients
3. The mean of the absolute values of the wavelet coefficients
4. The standard deviation of the wavelet coefficients.
It is worth mentioning that instead of the usual extrema (maximum and minimum
of the wavelet coefficient), the percentiles are selected in this case in order to elimi-
nate the possible outliers [11]. At the end of the feature extraction stage, a feature
vector of length 20 is formed for each EEG signal.

3 Classification using an improved wavelet neural networks

WNNs are feedforward neural networks with three layers – the input layer, the hidden
layer, and the output layer [26] . As the name suggests, the input layer receives input
values and transmits them to the single hidden layer. The hidden nodes consist of

36
AIH 2012

continuous wavelet functions, such as Gaussian wavelet, Mexican Hat wavelet, or
Morlet wavelet, which perform the nonlinear mapping. The product from this hidden
layer will then be sent to the final output layer.
Mathematically, a typical WNN is modeled by the following equation:
p
 x  ti 
y (x)   wij    b, (1)
i 1  d 

where y is the desired output, x  m is the input vector, p is the number of hidden
neurons, wij is the weight matrix whose values will be adjusted iteratively during the
training phase in order to minimize the error goal,  is the wavelet activation func-
tion, t is the translation vector, d is the dilation parameter, and b is the column matrix
that contains the bias terms. The network structure is illustrated in Fig. 3.
The WNNs are distinct from those of other ANNs in the sense that [26]:

 WNNs show relatively faster learning speed owing to the constitution of the fast-
decaying localized wavelet activation functions in the hidden layer.
 WNNs preserve the universal approximation property, and they are guaranteed to
converge with sufficient training.
 WNNs establish an explicit link between the neural network coefficients and the
wavelet transform.
 WNNs achieve the same quality of approximation with a network of reduced size.
Designing a WNN requires the researchers to focus particular attention on several
areas. First, a suitable learning algorithm is vital in adjusting the weights between the
hidden and output layers so that the network does not converge to the undesirable
local minima. Second, a proper choice of activation functions in the hidden nodes is
crucial as it has been shown that some functions yield significant better result for
certain problems [23]. Third, an appropriate initialization of the translation and dila-
tion parameters is essential because this will lead to simpler network architecture and
higher accuracy [24].

Fig. 3. WNNs with d input nodes, m hidden nodes, and L output nodes.

37
AIH 2012

The selection of the translation vectors for WNNs is of paramount importance. An
appropriate initialization of the translation vectors will do a good job of reflecting the
essential attributes of the input space, in such a way that the WNNs begin its learning
from good starting points and could lead to the optimal solution. Among the notable
proposed approaches are the ones given by the pioneers of WNNs themselves, where
the translation vectors are chosen from the points located on the interval of the do-
main of the function [26]. In [10], a dyadic selection scheme realized using the K-
means clustering algorithm was employed. In [13], the translation vectors were ob-
tained from the new input data. An explicit formula was derived to compute the trans-
lation vectors to be used for the proposed composite function WNNs [3]. In [24], an
enhanced fuzzy C-means clustering algorithm, termed modified point symmetry-
distance fuzzy C-means (MPSDFCM) algorithm, was proposed to initialize the trans-
lation vectors. By incorporating the idea of symmetry similarity measure into the
computation, the MPSFCM algorithm was able to find a set of fewer yet effective
translation vectors for the WNNs, which eventually led to superb generalization abili-
ty in microarray study. In short, the utilization of different novel clustering algorithms
in WNNs aim at simpler algorithm complexity and higher classification accuracy
from the WNNs.
In this study, the type-2 fuzzy C-means (T2FCM) clustering algorithm [16] was
proposed to initialize the translation vectors of WNNs. Its clustering effectiveness as
well as its robustness to noise has motivated the investigation on the feasibility of
T2FCM in selecting the translation vectors of the WNNs. For comparison purposes,
the use of K-means (KM) and the conventional type-1 fuzzy C-means (FCM-1) algo-
rithms in initializing the WNNs translation vectors were also considered.

3.1 Type-2 Fuzzy C-Means Clustering Algorithm
Rhee and Hwang [16] proposed an extension to the conventional FCM-1 clustering
algorithm by assigning membership grades to type-1 membership values. They
pointed out that the conventional FCM-1 clustering may result in undesirable cluster-
ing when noise exists in the input data. This is because all the data, including the
noise, will be assigned to all the available clusters with a membership value. As such,
a triangular membership function is proposed, as shown in the following equation:

 1  uij 
aij  u ij   , (2)
 2 

where uij and aij represent the type-1 and type-2 membership values for input j and
cluster center i, respectively. The proposed membership function aims to handle the
possible noise that might present in the input data. From Eq. 2, the new membership
value, aij, is defined as the difference between the old membership value, uij and the
area of the membership function, where the length of the base of each of the triangu-
lar function is taken as 1 minus the corresponding membership value obtained from
FCM-1.

38
AIH 2012

By introducing a second layer of fuzziness, the T2FCM algorithm’s concept still
conforms to the conventional FCM-1 method in representing the membership values.
To illustrate, it can noted from Eq. 2 that a larger value of FCM-1 value (closer to 1)
will yield a larger value of T2FCM value as well.
Since the proposed T2FCM algorithm is built upon the conventional FCM-1 algo-
rithm, the formula used to find the cluster centers, cij , can now be obtained from the
following equation that has been modified accordingly, as shown below:
N

a x
j 1
m
ij j

ci  N
, (3)
a
j 1
m
ij

where m is the fuzzifier, which is commonly set to a value of 2.
The algorithm for T2FCM is similar to the conventional FCM-1, which aims to
minimize the following objective function:
C N
J m (U ,V )   uijm || x j  ci ||2 , (4)
i 1 j 1

but it differs in the extra introduced membership function and also the equation that
has been modified to update the cluster centers. In general, the algorithm proceeds as
follows:
1. Fix the number of cluster centers, C.
2. Initialize the location of the centers, ci, i = 1, 2, … , C, randomly.
3. Compute the membership values using the following equation:

 2
 
1

C  x c 
 m 1
 
U  [uij ]     (5)
 k 1  x  c    .
j i

  j k   
 

4. Calculate the new membership value, aij from the values of uij using Eq. 2.
5. Update the cluster centers using Eq. 3.
6. Repeat steps 3-5 until the locations of the centers stabilize.
The algorithm for T2FCM is summarized in the flowchart shown in Fig. 4.

3.2 K-fold Cross Validation
In statistical analysis, k-fold cross validation is used to estimate the generalization
performance of classifiers. Excessive training will force the classifiers to memorize
the input vectors, while insufficient training will result in poor generalization when a

39
AIH 2012

new input is presented to it. In order to avoid these problems, k-fold cross validation
is performed.
To implement the k-fold cross validation, the samples are first randomly parti-
tioned into k>1 distinct groups of equal (or approximately equal) size. The first group
of samples is selected as the testing data initially, while the remaining groups serve as
training data. A performance metric, for instance, the classification accuracy, is then
measured. The process is repeated for k times, and thus, the k-fold cross validation has
the advantage of having each of the sample being used for both training and testing.
The average of the performance metric from the k iterations is then reported. In this
study, k is chosen as 10.

START

Select C

Initialize randomly
ci , i  1, 2,..., C

Compute membership, uij

Compute new membership, aij

Find new centers, ci

Centers
stabilize?

Yes No
END

Fig. 4. Algorithm for T2FCM.

40
AIH 2012

4 Results and Discussion

The binary classification task between normal subjects and epileptic patients was
realized using the WNNs models. The activation function used in the hidden nodes is
the Morlet wavelet function. During the training process, a normal EEG signal was
indicated by a single value of 0, while an epileptic EEG signal was labeled with a
value of 1. During the testing stage, a threshold value of 0.5 was used, that is, any
output from WNNs which is equals to or greater than 0.5 will be reassigned a value of
1; otherwise, it will be reassigned a value of 0. The simulation was carried out using
the mathematical software MATLAB® version 7.10 (R2010a). The performance of
the proposed WNNs was evaluated using the statistical measures of classification
accuracy, sensitivity and specificity. The corresponding classification results between
the normal and epileptic EEG signals by using the WNNs-based classifier with differ-
ent initialization approaches are listed in Table 1.
In terms of the classification accuracy, the translation vectors generated by the
conventional KM clustering algorithm gave the poorest result, where an overall accu-
racy of 94.8% was obtained. The WNNs that used the conventional FCM-1 clustering
algorithm reported an overall accuracy of 97.15%. The best performance was ob-
tained by the classifier that employed the T2FCM algorithm, which yielded an overall
classification accuracy of 98.87%.
As shown in Table 1, a steady increase in the classification accuracy was noticed
when the KM clustering algorithm was substituted with FCM algorithm, and subse-
quently T2FCM algorithm. FCM outperformed the primitive KM algorithm because
the soft clustering employed can assign one particular datum to more than one cluster.
On the contrary, KM algorithm, which used hard or crisp clustering, assigns one da-
tum to one center only, and this degrades greatly the classification accuracy. While
FCM relies on one fuzzifier, T2FCM adds a second layer of fuzziness by assigning a
membership function to the membership value obtained from the type-1 FCM mem-
bership values.
In the field of medical diagnosis, the unwanted noise and outliers produced from
the signals or images need to be handled carefully, as they will affect and skew the
results and analysis obtained afterwards. In this regard, the concept of fuzziness can
be incorporated to deal with these uncertainties. Outliers or noise can be handled
more efficiently and higher classification accuracy can be obtained via the introduc-
tion of the membership function. The noise in the biomedical signals used in this
work has thus been handled via two different approaches. The first treatment is in the

Table 1. The performance metrics for the binary classification problem.

Performance metric
Initialization methods
Sensitivity Specificity Accuracy
KM 85.00 97.30 94.80
FCM 93.82 97.92 97.15
T2FCM 94.96 99.43 98.87

41
AIH 2012

Table 2. Performance comparison of classification accuracy obtained by the proposed WNNs
and other approaches reported in the literature

Feature Selection Method Classifier Accuracy References
Time Frequency Analysis ANNs 97.73 [20]
DWT with KM MLPs 99.60 [15]
DWT MLPs 97.77 [8]
Approximate Entropy ANNs 98.27 [8]
This Work 98.87

feature selection stage, where the 10th and 90th percentiles of the absolute values of the
wavelet coefficients were used instead of the minima and maxima values. The second
way is via the T2FCM clustering algorithm used when initializing the translation
parameters for the hidden nodes of WNNs. The clustering achieved by T2FCM
proves to result in more desirable locations compared to the conventional KM and
FCM-1 methods, as reflected in the higher overall classification accuracy.
Numerous epileptic detection approaches have been implemented in the literature
using the same benchmark dataset as in this study. For the sake of performance as-
sessment, comparison of the results with other state-of-the-art methods reported in the
literature was included, as presented in Table 2. As depicted in this table, the pro-
posed WNNs with T2FCM initialization approach outperformed the others generally.
However, the achieved classification accuracy of 98.87% by the proposed model was
inferior to the multilayer perceptrons (MLPs)-based classifier as described in [15],
which might be attributed to their feature extraction method. Instead of using basic
statistical features, the authors used the KM clustering algorithm to find the similari-
ties among the wavelet coefficient, where the obtained probability distribution from
the KM was used as the input of the MLPs-based classifier. A better set of determinis-
tic features might be obtained from this approach, which will be an interesting topic to
pursue in future. However, it is pertinent to note that the MLPs-based classifiers are
subject to slow learning deficiency and getting trapped in local minima easily.
In order to evaluate the statistical significance of the obtained results, statistical test
on the difference of the population mean of the overall classification accuracy was
performed using the t distribution. The experiment was run 10 times to obtain the
values of the summary statistics, namely, the mean and the standard deviation of the
samples. The 1% significance level, or α=0.01 was utilized to check whether there is
significant difference between the two population means. Two comparisons were
done, namely, between KM and T2FCM, and between FCM and T2FCM. The formu-
la for the test statistics is given by:

 x1  x2    1  2 
t , (6)
sx1  x2

where x1 and x2 are the sample means; 1 and  2 are the population means; and
sx1  x2 is the estimate of the two standard deviations.

42
AIH 2012

For both cases, the values of the test statistic obtained fall in the rejection region.
So the null hypothesis is rejected and it is concluded that there is significant differ-
ence between the classification accuracy obtained using the different initialization
methods, that is, the performance of T2FCM is superior to those of KM and FCM.

5 Conclusions

In this paper, a novel seizure detection scheme using the improved WNNs with
T2FCM initialization approach was proposed. Based on the overall classification
accuracy obtained from the real world problem of epileptic seizure detection, it was
found that the proposed model outperformed the other conventional clustering algo-
rithms, where an overall accuracy of 98.87%, sensitivity of 94.96% and specificity of
99.43% were achieved. The initialization accomplished via T2FCM has proven that
the algorithm can handle the uncertainty and noise in the EEG signals better than the
conventional KM and FCM-1 algorithms. This again suggested the prospective im-
plementation of the proposed method in developing a real time automated epileptic
diagnostic system with fast and accurate response that could assist the neurologists in
their decision making process.

Acknowledgements. The authors gratefully acknowledge the generous financial sup-
port provided by Universiti Sains Malaysia under the USM Fellowship Scheme.

References
1. Adeli, H., Zhou, Z., Dadmehr, N.: Analysis of EEG records in an epileptic patient using
wavelet transform. J Neurosci Meth 123, 69-87 (2003)
2. Andrzejak, R. G., Lehnertz, K., Mormann, F., Rieke, C., David, P., Elger, C. E.: Indications
of nonlinear deterministic and finite-dimensional structures in time series of brain electrical
activity: Dependence on recording region and brain state. Phys Rev E 64, (2001)
3. Cao, J. W., Lin, Z. P., Huang, G. B.: Composite function wavelet neural networks with
extreme learning machine. Neurocomputing 73, 1405-1416 (2010)
4. Ebrahimpour, R., Babakhani, K., Arani, S. A. A. A., Masoudnia, S.: Epileptic Seizure
Detection Using a Neural Network Ensemble Method and Wavelet Transform. Neural Netw
World 22, 291-310 (2012)
5. Gandhi, T. K., Chakraborty, P., Roy, G. G., Panigrahi, B. K.: Discrete harmony search based
expert model for epileptic seizure detection in electroencephalography. Expert Syst Appl 39,
4055-4062 (2012)
6. Ghosh-Dastidar, S., Adeli, H., Dadmehr, N.: Mixed-band wavelet-chaos-neural network
methodology for epilepsy and epileptic seizure detection. IEEE transactions on bio-medical
engineering 54, 1545-1551 (2007)
7. Ghosh-Dastidar, S., Adeli, H., Dadmehr, N.: Principal component analysis-enhanced cosine
radial basis function neural network for robust epilepsy and seizure detection. IEEE
transactions on bio-medical engineering 55, 512-518 (2008)

43
AIH 2012

8. Guo, L., Rivero, D., Dorado, J., Rabunal, J. R., Pazos, A.: Automatic epileptic seizure
detection in EEGs based on line length feature and artificial neural networks. J Neurosci Meth
191, 101-109 (2010)
9. Guo, L., Rivero, D., Pazos, A.: Epileptic seizure detection using multiwavelet transform based
approximate entropy and artificial neural networks. J Neurosci Meth 193, 156-163 (2010)
10. Hwang, K., Mandayam, S., Udpa, S. S., Udpa, L., Lord, W., Atzal, M.: Characterization of
gas pipeline inspection signals using wavelet basis function neural networks. NDT and E Int
33, 531-545 (2000)
11. Kandaswamy, A., Kumar, C. S., Ramanathan, R. P., Jayaraman, S., Malmurugan, N.: Neural
classification of lung sounds using wavelet coefficients. Comput Biol Med 34, 523-537
(2004)
12. Kumar, S. P., Sriraam, N., Benakop, P. G., Jinaga, B. C.: Entropies based detection of
epileptic seizures with artificial neural network classifiers. Expert Syst Appl 37, 3284-3291
(2010)
13. Lin, C.-J.: Nonlinear systems control using self-constructing wavelet networks. Appl Soft
Comput 9, 71-79 (2009)
14. Naghsh-Nilchi, A. R., Aghashahi, M.: Epilepsy seizure detection using eigen-system spectral
estimation and Multiple Layer Perceptron neural network. Biomed Signal Proces 5, 147-157
(2010)
15. Orhan, U., Hekim, M., Ozer, M.: EEG signals classification using the K-means clustering and
a multilayer perceptron neural network model. Expert Syst Appl 38, 13475-13481 (2011)
16. Rhee, F. C. H., Hwang, C. A type-2 fuzzy C-means clustering algorithm. In: Proceedings of
the 20th IEEE FUZZ Conference, pp 1926-1929. IEEE Press, New York (2001)
17. Subasi, A.: EEG signal classification using wavelet feature extraction and a mixture of expert
model. Expert Syst Appl 32, 1084-1093 (2007)
18. Subasi, A., Gursoy, M. I.: EEG signal classification using PCA, ICA, LDA and support vector
machines. Expert Syst Appl 37, 8659-8666 (2010)
19. Tang, Y., Durand, D. M.: A tunable support vector machine assembly classifier for epileptic
seizure detection. Expert Syst Appl 39, 3925-3938 (2012)
20. Tzallas, A. T., Tsipouras, M. G., Fotiadis, D. I.: Automatic Seizure Detection Based on Time-
Frequency Analysis and Artificial Neural Networks. 2007, (2007)
21. Ubeyli, E. D.: Wavelet/mixture of experts network structure for EEG signals classification.
Expert Syst Appl 34, 1954-1962 (2008)
22. Ubeyli, E. D.: Combined neural network model employing wavelet coefficients for EEG
signals classification. Digit Signal Process 19, 297-308 (2009)
23. Zainuddin, Z., Ong, P.: Modified wavelet neural network in function approximation and its
application in prediction of time-series pollution data. Appl Soft Comput 11, 4866-4874
(2011)
24. Zainuddin, Z., Ong, P.: Reliable multiclass cancer classification of microarray gene
expression profiles using an improved wavelet neural network. Expert Syst Appl 38, 13711-
13722 (2011)
25. Zandi, A. S., Javidan, M., Dumont, G. A., Tafreshi, R.: Automated Real-Time Epileptic
Seizure Detection in Scalp EEG Recordings Using an Algorithm Based on Wavelet Packet
Transform. Ieee T Bio-Med Eng 57, 1639-1651 (2010)
26. Zhang, Q. G., Benveniste, A.: Wavelet Networks. Ieee T Neural Networ 3, 889-898 (1992)

44
AIH 2012

Acute Ischemic Stroke Prediction from
Physiological Time Series Patterns

Qing Zhang1,2 , Yang Xie2 , Pengjie Ye1,2 , and Chaoyi Pang2
1
Australian e-Health Research Centre/CSIRO ICT Centre
2
THe University of New South Wales
{qing.zhang,pengjie.ye,chaoyi.pang}@csiro.au
yang.xie@unsw.edu.au

Abstract. Stroke is one of the major diseases that can cause human
deaths. However, despite the frequency and importance of stroke, there
are only a limited number of evidence-based acute treatment options cur-
rently available. Recent clinical research has indicated that early changes
in common physiological variables represent a potential therapeutic tar-
get, thus the manipulation of these variables may eventually yield an
effective way to optimise stroke recovery. Nevertheless the accuracy of
prediction methods based on statistical characteristics of certain physi-
ological variables, such as blood pressure, glucose, is still far from sat-
isfactory due to vague understandings of effects and function domain
of those physiological determinants. Therefore, developing a relatively
accurate prediction method of stroke outcome based on justifiable de-
terminants becomes more and more important to the decision of the
medical treatment at the very beginning of the stroke. In this work, we
utilize machine learning techniques to find correlations between physi-
ological parameters of stroke patient during 48 hours after stroke, and
their stroke outcomes after three months. Our prediction method not
only incorporates statistical characteristics of physiological parameters,
but also considers physiological time series patterns as key features. Ex-
periment results on real stroke patients’ data indicate that our method
can greatly improve prediction accuracy to a high precision rate of 94%,
as well as a high recall rate of 90%.

Keywords: Stroke, Outcome Prediction, Time Series Data, Machine
Learning

1 Introduction
Stroke is a common cause of human death and is a major cause of death after
ischemic heart disease [1]. The World Health Organisation (WHO) defines it as
”rapidly developing clinical signs of local (or global) disturbance of cerebral func-
tion, with symptoms lasting more than 24 hours or leading to death, and with no
apparent cause other than of vascular origin” [2]. Recent years research reveals a
strong association between physiological homeostasis and outcomes of Acute Is-
chemic Stroke. Thus understanding determinants of physiological variables, such

45
AIH 2012

2 Acute Ischemic Stroke Prediction

as blood pressure, temperature and blood glucose levels, may eventually yield
an effective and potentially widely applicable range of therapies for optimis-
ing stroke recovery, such as abbreviating the duration of ischaemia, preventing
further stroke, or preventing deterioration due to post-stroke complications.
The correlations between blood pressure and stroke outcomes have been
widely studied in the literature. It is stated in current guidelines that a sig-
nificant decrease of BP during the first hours after admission should be avoided,
as it correlates with poor outcomes, measured by Canadian Stroke Scale or
modified Rankin Score (mRS), at 3 months [10]. Extreme hypertension and
hypotension on admission have also been associated with adverse outcome in
acute stroke patients [11]. BP values, periodically monitored within the first
72 hours after admission, demonstrate that extreme values still correlate with
unfavored outcomes [9]. For example, high baseline of systolic BP is inversely
associated with favourable outcome assessed on mRS at 90 days with OR=1.220
and (95% CI: 1.01 to 1.49). Other periodically retrieved statistical properties of
BP within 24 hours of ictus, such as maximum, mean, variability etc., have also
been investigated. Yong et al. [12] report strong independent association between
those properties and the outcome at 30 days after ischemic stroke. For example,
variability of systolic BP is inversely associated with favourable outcome with
OR=0.57, (95% CI: 0.35 to 0.92).
Research also shows associations between other physiological variables and
stroke outcomes. Abnormalities of blood glucose, heart rate variability, ECG and
temperature may be predictors of 3-month stroke outcome.
Most of the above analyses are based on periodically recorded physiological
parameters, hourly or daily, up to 3 months. Whether continuous data patterns,
such as data trends, have a similar predictive role is still uncertain. Although it is
clear that the after stroke elevated 24-hours blood pressure levels predict a poor
outcome, few studies have investigated the predictive ability of more sophisti-
cate trends, e.g. combined trends of several physiological parameters. Yet this
could be an effective way to readily obtain important prognostic information for
acute ischemic stroke patients. Dawson et al did pioneering works on associating
shorter length (around 10 minutes) beat-to-beat BP with acute ischemic stroke
outcomes [8]. They conclude that a poor outcome, assessed by mRS, at 30 days
after ischemic stroke is dependent on stroke subtype, beat-to-beat diastolic BP
and Mean Arterial Pressure and variability. However in their study, they still
use the average values of continuous recordings, instead of time series patterns
as predictors. This motivates our research on mining physiological data patterns
as effective predictors of acute ischemic stroke outcome.
Obviously mining physiological data patterns can be easily aligned with time
series data classification, which is a traditional topic and has attracted inten-
sive studies. Although there exist many sophisticate time series data mining
techniques, we find that most of them, if not all, are not applicable to our
application scenario, due to the always incomplete, non-isometric physiological
data collected from patients. Therefore, in this paper, we incorporate a simple
yet powerful time series data pattern analysing method, trend analyses, into

46
AIH 2012

Acute Ischemic Stroke Prediction 3

our prediction method. By utilising those trend features, together with values
of traditional physiological variables, we design an efficient algorithm that can
predict 3-month stroke outcome with high accuracy.
In summary, we list our contributions in this paper:
– We propose using trend patterns of physiological time series data as a new
set of stroke outcome prediction features,
– We design a novel prediction algorithm which can accurately predict 3-
months stroke outcomes with high precision and recall rate, when tested
against a real data set.
The rest of this paper is organised as follows. Section 2 introduces works
related to stroke outcome predictions. Section 3 presents our prediction methods.
Section 4 reports empirical study results. And section 5 concludes this paper with
possible future studies.

2 Related Work
The relationship between beat-to-beat blood pressure (BP) and the early out-
come after acute ischemic stroke was firstly described in [8].
A further investigation on BP was done in [6], which investigated detrimental
effects of blood pressure reduction in the first 24 hours of acute stroke onset. BP
reduction is regarded to have the possibility to worsen an already compromised
perfusion in the brain tissue and thus not lowering BP in the early stage after
the stroke onset is suggested. However, it lacks further discussion on the relation
of higher BP and outcome. Ritter et al. formulated the blood pressure variation
by counting threshold violations. Significant difference in the frequency of upper
threshold violation occurrences was observed between different time points after
stroke [9] . Wong observed some temporal patterns from the changing process
of some physiological variables and also attempted to employ such temporal
patterns to explain and predict the early outcomes [5]. However, due to the
limit of candidate feature set considered in those studies, achieving an accurate
prediction is fairly unlikely in those scenarios.
Relationships between other physiological variables and stroke outcome have
also been studied in literature. Abnormalities of serum osmolarity, temperature,
blood glucose, SPO2 may be predictors of stroke outcomes. More specifically,
heart rate and ECG, can be correlated to stroke outcomes at 3-months:
– Heart Rate Variability: Gujjar et al. reported that heart rate variability is
efficient in predicting stroke outcome. Specifically they studied continuous
echocardiogram of 25 patients with acute stroke and concluded that the eye-
opening score of Glasgow Coma Scale and low-frequency spectral power were
factors that were independently predictive of mortality [16].
– ECG: The relationship between ECG abnormalities and stroke outcomes
were reported by Christensen et al. They analysed a large cohort of 692
patients and predict that ECG abnormalities are frequent in acute stroke
and may conclude 3-month mortality [17].

47
AIH 2012

4 Acute Ischemic Stroke Prediction

3 Stroke outcomes prediction

Our prediction method adopts statistical values of physiological parameters and
also incorporates the descriptive ability of the physiological patterns as features
to predict 3-months stroke outcomes. Particularly, we use the trend pattern of
time series data as new add-on features to form an initial feature set. Then we
apply the logistic regression method to classify stroke patient outcomes into two
groups: good vs. bad. Note that there exist different clinical criteria in defining
good/bad outcomes. We will report empirical study results on all criteria in the
next section. Cross validation is also adopted to obtain an unbiased assessment of
classifier performance, by which the physiological determinants can be accurately
identified in the last stage. Finally, we select a subset of features that can most
accurately predict 3-months stroke outcomes. Figure 1 presents logic flows of our
method. We use Rankin Scale to represent various outcomes at 3 months after
stroke (RS3) [18].

Fig. 1. Stroke outcomes prediction method

3.1 Construct initial feature set

Five physiological parameters are usually considered as influential factors on
stroke patient outcomes, namely Blood Sugar Level, Diastolic Blood Pressure,
Systolic Blood Pressure, Heart Rate and Body Temperature [6, 16, 17]. Exist-
ing stroke outcome predictions always assume a certain parameter as the main

48
AIH 2012

Acute Ischemic Stroke Prediction 5

feature in their approaches. However in our approach, we will assume all five
parameters in the initial feature set.
Moreover, for each physiological parameter, we compute trends through par-
titioning the time series data into non-overlapping, continuous blocks. Although
there exists many trend and shape detection methods in the literature, such as
[3], in our application, we simply consider a bi-partition on the first 48-hours
time series data records after stroke. The reasons are:
1. most available physiological data records are only within 48-hours after
stroke.
2. clinical observation and our initial experiments both suggest that setting the
granularity level at having only two partitions in the 48-hours, well represents
the physiological time series pattern changes.
In each partition, accordingly we generate 6 new features, as shown below,
to represent the trend pattern:
1. yChange: the difference between the value at the end of a trend and the
value at the start of a trend

yChange = y(end of trend) − y(start of trend)

2. absYChange: the absolute value of the yChange
3. slope: the slope of the trend
4. sign: the direction of the trend
5. NumofMeasure: the number of values in a partition
6. FreqofMeasure: the average time interval between measurements, i.e.
T rend Length
F reqof M easure =
N umof M easure
The initial feature set comprised physiological values and their trend pat-
terns. We apply the logistical regression method to classify the good/bad stroke
outcomes based on this initial feature set.

3.2 Logistic Regression Classifier
In statistics, logistic regression is a type of regression analysis used for predicting
the outcome of a binary dependent variable (a variable which can take only two
possible outcomes, e.g. “yes” vs. “no” or “success” vs. “failure”) based on one or
more predictor variables. Like other forms of regression analysis, logistic regres-
sion makes use of one or more predictor variables that may be either continuous
or categorical. Unlike ordinary linear regression, however, logistic regression is
used for predicting binary outcomes rather than continuous outcomes. Logistic
regression adopted here is a type of regression analysis used for predicting the
outcome of stroke (“good” vs. “bad”) based on features in our initial feature set.
To obtain an unbiased assessment of classifier performance, the Leave-One-
Out Cross validation technique is adopted. Suppose N folds are employed, this

49
AIH 2012

6 Acute Ischemic Stroke Prediction

technique withholds a subject from the training set for each run to later test
with. Once a record has been withheld for testing, the classifier is trained us-
ing the remaining N-1 subjects. The withheld subject is then reintroduced for
classification.

3.3 Final feature set selection

We use two greedy search strategies to find the best feature subset that can
achieve highest prediction accuracy. Specifically, we use backward search and
forward search:

backward search : A greedy backward search is performed to identify a near
optimum subset of features. Starting with all features, in sequence, the feature
which improves prediction accuracy the most (or decreases it the least) is re-
moved from the current set of features and retained as an intermediate feature
subset. This is repeated until all features have been removed. The intermediate
feature subset which provides the maximum performance, compared to all other
subset evaluated, is selected as the final feature set.

forward Search A sequential forward floating search algorithm is used for feature
selection, in an attempt to discover the optimal subset of features from the pool
of available candidate features. This strategy begins with a forward-selection
process, selecting a single feature from the pool of available features, which im-
proves the prediction accuracy most. After this selection, removal of a feature
from the set of selected features is considered. The process of possible feature ad-
dition, followed by possible feature removal, is iterated until the selected feature
set converges.

4 Empirical Study

In this section, we report experiment results through testing our prediction
method on a real data set of stroke patients. Firstly, we introduce the physi-
ological data sets of stroke patients and the good/bad criteria used in our study.
Then we report prediction accuracy based on various combination of feature
sets. Our study was approved by a ethics committee of the related institution.

4.1 Experimental data sets

A cohort of 157 patients with acute ischaemic stroke were recruited. Patients
presenting to the Emergency Department of the Royal Brisbane and Women’s
Hospital, an Australian tertiary referral teaching hospital, within 48 hours of
stroke or existing inpatients with an intercurrent stroke were enrolled prospec-
tively. Important physiological parameters, such as blood pressure, were recorded
at least every 4 hours from the time of admission until 48 hours after the stroke.

50
AIH 2012

Acute Ischemic Stroke Prediction 7

These values were used as the outcome variable in the analyses. The measure-
ments from patients who died during these first 48 hours were also included in
the analyses. Furthermore, some demographic and other stroke-related data were
also collected such as the age and gender. The age range of these 157 patients
was 16 to 92 years with median age 75 years. The patient distribution based on
different values of RS3 is showed in Figure 2.

Fig. 2. Patient distributions on values of RS3

4.2 Classification criteria

As shown in Figure 2, RS3 score varies between 0 and 6. Patients with RS3 =
6 means the subject is dead after three months and RS3 = 0 means the subject
recovers quite well after three months. Based on RS3 values, patient outcomes
can be divided into good/bad groups basing on different grouping criteria. Figure
3 illustrates patient distributions under three type grouping criteria.

4.3 Prediction accuracy comparisons

Applying techniques described in Section 3, we run experiments on various
grouping criteria to test our stroke outcome prediction algorithm. We always
notice that ‘backward search’ generates more accurate prediction results, which
will thus be used as our default feature set search strategy. Figure 4 shows
prediction accuracy comparisons under all three types of grouping criteria. In
Figure 5, we also evaluate the efficiency of including trend pattern as prediction

51
AIH 2012

8 Acute Ischemic Stroke Prediction

Fig. 3. Good vs Bad outcomes under various criteria

Fig. 4. Prediction Accuracy on different grouping criteria

features. Experiment shows that by adding those simple trend features, the pre-
diction accuracy on all three grouping types is unanimously boosted from 71%
to 89∼91%.

5 Conclusion

In this paper, we describe novel algorithms to predict three months stroke out-
comes. We have quantified the great improvements brought by including phys-
iological data trend patterns as features of a classifier. We believe that these
trends play important roles on three months outcomes of stroke patients. The
efficiency and accuracy of our algorithm have also been demonstrated through
our experiments.

52
AIH 2012

Acute Ischemic Stroke Prediction 9

features(values,trends)
features(values)

100%
90%
80%
70%
60%
Type
1
Type
2
Type
3

Fig. 5. Prediction accuracy improved by adding trend features

In our future work, we will first try to locate the most important trend pattens
for stroke outcome predictions. Then we will work with healthcare professionals
to find clinical ground truth beneath those physiological trend patterns of stroke
patients. This will greatly benefit clinical treatments of acute ischemic stroke.
We also plan to run clinical trials to validate our prediction methods on other
real data sets of stoke patients.

References
[1] Australian Institute of Health and Welfare.: Australias health 2006, the tenth bien-
nial health report of the Australian Institute of Health and Welfare. ISBN 1 74024
565 2. 2006
[2] The World Health Organization MONICA Project (monitoring trends and de-
terminants in cardiovascular disease): a major international collaboration. WHO
MONICA Project Principal Investigators. Journal of Clinical Epidemiology.
1988;41(2):105-14.
[3] Ye, L., Keogh, E.: Time series shapelets: a new primitive for data mining. Proceed-
ings of the 15th ACM SIGKDD international conference on Knowledge discovery
and data mining, Vol. 22, ACM, Paris, France, pp. 947–956.
[4] Mueen, A., Keogh, E.,Young, N.: Logical-shapelets: an expressive primitive for
time series classification. Proceedings of the 17th ACM SIGKDD international
conference on Knowledge discovery and data mining, Vol. 22, ACM, San Diego,
California, USA, pp. 1154–1162.
[5] Wong, A.: The Natural History and Determinants of Changes in Physiological Vari-
ables after Ischaemic Stroke. Ph.D. Thesis, The University of Queensland, St.Lucia.
[6] Oliveira-Filho, J., Silva, S.C.S., Trabuco, C.C., Pedreira, B.B., Sousa, E.U., Bacel-
lar, A.: Detrimental effect of blood pressure reduction in the first 24 hours of acute
stroke onset. Neurology. 61(8), 1047–1051.
[7] Marti-Fabregas, J., Belvis, R., Guardia, E., Cocho, D., Munoz, J., Marruecos, L.,
Marti-Vilalta, J.-L.: Prognostic value of Pulsatility Index in Acute Intracerebral
Hemorrhage. Neurology. 61(8), 1051–1056.

53
AIH 2012

10 Acute Ischemic Stroke Prediction

[8] Dawson, S.L., Manktelow, B.N., Robinson, T.G., Panerai, R.B., Potter, J.F.: Which
Parameters of Beat-to-Beat Blood Pressure and Variability Best Predict Early
Outcome After Acute Ischemic Stroke. Stroke. 2000(31), 463–468.
[9] Ritter, M.A., Kimmeyer, P., Heuschmann, P.U., Dziewas, R. Dittrich, R., Nabavi,
D.G., Ringelstein, E.B.: Blood Pressure Threshold Violations in the First 24 Hours
After Admission for Acute Stroke: Frequency, Timing, Predictors, and Impact on
Clinical Outcome. Stroke. 2009(40), 462–468.
[10] Castillo, J., et al., Blood pressure decrease during the acute phase of ischemic
stroke is associated with brain injury and poor stroke outcome. Stroke, 2004. 35(2):
p.520-6
[11] Ahmed, N., P. Nasman, and N.G. Wahlgren, Effect of intravenous nimodipine on
blood pressure and outcome after acute stroke. Stroke, 2000. 31(6): p. 12505.
[12] Yong, M. and M. Kaste, Association of characteristics of blood pressure profiles
and stroke outcomes in the ECASSII trial. Stroke, 2008. 39(2): p. 36672
[13] Wong AA, Schluter PJ, Henderson RD, O’Sullivan JD, Read SJ. The natural
history of blood glucose within the first 48 hours after ischemic stroke. Neurology
2008;70:103641.
[14] Christensen, H., A. Fogh Christensen, and G. Boysen, Abnormalities on ECG
and telemetry predict stroke outcome at 3 months. J Neurol Sci, 2005. 234(12): p.
99103.
[15] Boysen, G. and H. Christensen, Stroke severity determines body temperature in
acute stroke. Stroke, 2001. 32 (2): p. 4137.
[16] Gujjar AR, Sathyaprabha TN, Nagaraja D, Thennarasu K and Pradhan N, Heart
rate variability and outcome in acute severe stroke: role of power spectral analysis.
Neurocrit Care, 2004. 1(3): p. 347-53.
[17] Christensen, H., A. Fogh Christensen, and G. Boysen, Abnormalities on ECG and
telemetry predict stroke outcome at 3 months. J Neurol Sci, 2005. 234(1-2): p.
99-103.
[18] Rankin J (May 1957). Cerebral vascular accidents in patients over the age of 60.
II. Prognosis. Scott Med J 2 (5): 200-15.

54
AIH 2012

Comparing Data Mining with Ensemble
Classification of Breast Cancer Masses in Digital
Mammograms

Shima Ghassem Pour1 , Peter Mc Leod2 , Brijesh Verma2 , and Anthony Maeder1
1
School of Computing, Engineering and Mathematics,
University of Western Sydney
Campbelltown, New South Wales, Australia
2
School of Information and Communication Technology,
Central Queensland University
Rockhampton, Queensland, Australia
{shima.ghassempour,mcleod.ptr}@gamil.com,
b.verma@cqu.edu.au,a.maeder@uws.edu.au

Abstract. Medical diagnosis sometimes involves detecting subtle indi-
cations of a disease or condition amongst a background of diverse healthy
individuals. The amount of information that is available for discover-
ing such indications for mammography is large and has been growing
at an exponential rate, due to population wide screening programmes.
In order to analyse this information data mining techniques have been
utilised by various researchers. A question that arises is: do flexible data
mining techniques have comparable accuracy to dedicated classification
techniques for medical diagnostic processes? This research compares a
model-based data mining technique with a neural network classification
technique and the improvements possible using an ensemble approach. A
publicly available breast cancer benchmark database is used to determine
the utility of the techniques and compare the accuracies obtained.

Keywords: latent class analysis, digital mammography, breast cancer,
clustering, classification, neural network.

1 Introduction
Medical diagnosis is an active area of pattern recognition with different tech-
niques being employed [17, 19, 12]. The expansion of digital information for dif-
ferent cohorts [15] has allowed researchers to examine relationships that were
previously not uncovered due to the limited nature of information as well as a
lack of techniques being available for the analysis of large data sets. Flexible
data mining techniques have the capacity to predict disease and reveal previous
unknown trends.
The question that arises is whether the relationships that are revealed by
those techniques are as accurate or as comparable as techniques that are specif-
ically developed for other purposes, such as a diagnostic system for a particular

55
AIH 2012

2 Comparing Data Mining with Ensemble Classification

disease or condition. This research aims at contrasting the cluster analysis tech-
nique (Latent Class Analysis) of Ghassem Pour, Maeder and Jorm [4] against a
baseline neural network classifier, and then considers the effects of applying an
ensemble technique to improve the accuracies obtained.
The organisation of this paper is that section two provides a background on
the approaches that have been utilised for breast cancer diagnosis, sections three
and four detail the proposed techniques for comparison, section five outlines the
experimental results obtained and conclusions are presented in section six.

2 Background

Medical diagnosis is a problematic paradigm in that complex relationships can
exist in the diagnostic features that are utilised to map to a resultant diagnosis
about the disease state. In different cases the state of the disease condition itself
can be marked by stages where the diagnostic symptoms or signs can be subtle
or different to other stages of the disease. This means that there is often not a
clean mapping between the diagnostic features and the diagnosis [13, 14].
Breast cancer screening using mammography provides an exemplar of this
situation. Early detection and treatment have been the most effective way of
reducing mortality [2] however Christoyianni et al. [1] noted that 10-30% of
breast cancers remain undetected while 15-30% of biopsies are cancerous. Tay-
lor and Potts [22] made similar observations in their research. There are many
reasons why various cancers can remain undetected. These include the obfus-
cation of anomalies by surrounding breast tissue, the asymmetry of the breast,
prior surgery, natural differences in breast appearance on mammograms, the low
contrast nature of the mammogram itself, distortion from the mammographic
process and even talc or powder on the outside of the breast making it hard to
identify and discriminate anomalies. Even if an anomaly is detected, a high rate
of false positives exist [17, 18].
Clustering has provided a widely used mechanism for organising data into
similar groupings. The usage of clustering has also been extended to classifiers
and detection systems in order to improve detection and provide greater classi-
fication accuracy. Kim et al. [9] developed a classifier based on Adaptive Res-
onance Theory (ART2) where micro-calcifications were grouped into different
classes with a three-layered back propagation network performing the classifica-
tion. The system achieved 90% sensitivity (Az of 0.997) with a low false positive
rate of 0.67 per cropped image.
Other researchers such as Mohanty, Senapati and Lenka [16] explored the
application of data mining techniques to breast cancer diagnosis. They indi-
cated that data mining medical images would allow for the collection of effective
models, rules as well as patterns and reveal abnormalities from large datasets.
Their approach was to use a hybrid feature selection technique with a decision
tree classifier to classify breast cancer. They utilised 300 images from the MIAS
database. They achieved a classification accuracy of 97.7% however their dataset
images contained microcalcifications as well as mass anomalies.

56
AIH 2012

Comparing Data Mining with Ensemble Classification 3

3 Latent Class Analysis and Data Mining

Latent Class Analysis (LCA) has been proposed as a mechanism for improved
clustering of data over traditional clustering algorithms like k-means [11]. LCA
classifies subjects into one of K unobserved classes based on the observed data,
where K is a constant and known parameter. These latent or potential classes
are then refined based upon their statistical relationships with the observed vari-
ables.
LCA is a probabilistic clustering approach: although each object is assumed
to belong to one cluster, there is uncertainty about an object’s membership of
a cluster [11, 10]. This type of approach offers some advantages in dealing with
noisy data or data with complex relationships between variables, although as an
iterative method there is always some chance that it will be susceptible to noise
and in some cases fail to converge.
An advantage of using a statistical model is that the choice of the clus-
ter criterion is less arbitrary. Nevertheless, the log-likelihood functions corre-
sponding to LC cluster models may be similar to the criteria used by certain
non-hierarchical cluster techniques [18]. Another advantage of the model-based
clustering approach is that no decisions have to be made about the scaling of the
observed variables: for instance, when working with normal distributions with
unknown variances, the results will be the same irrespective of whether the vari-
ables are normalized or not.
Other advantages are that it is relatively easy to deal with variables of mixed
measurement levels (different scale types) and that there are more formal cri-
teria to make decisions about the number of clusters and other model features
[3]. We have successfully applied LCA for cases in health data mining when the
anomalous range of variables results in more clusters than have been expected
from a causal or hypothesis based approach [5]. This implies that in some cases
LCA may be used to reveal associations between variables that are more subtle
and complex.
Unsupervised clustering requires prior specification of the number of clusters
K to be constructed, implying that a model for the data is necessary which pro-
vides K. The binary nature of the diagnosis problem implies that K=2 should
be used in ideal circumstances, but the possibility exists that allowing more
clusters would give a better solution (e.g. by allowing several different classes
within the positive or negative groups). Consequently a figure of merit is needed
to establish that the chosen K value is optimal. In this research the Bayesian
Information Criteria (BIC) is determined for the mass dataset in order to gauge
the best number of clusters.
Repeated application of the clustering approach can also lead to different so-
lutions due to randomness in starting conditions. In this work we used multiple
applications of the clustering calculations to allow improvement in the results,
in an ensemble-like approach. Our improvement strategy was based on selection
of the most frequent membership of classes per element, over different numbers
of clustering repetitions.

57
AIH 2012

4 Comparing Data Mining with Ensemble Classification

4 Neural Network and Ensemble Methods

Neural networks have been advocated for breast cancer detection by many re-
searchers. Various efforts to refine classification performance have been made,
using a number of strategies involving some means of choice between alternatives.
Ensembles have been proposed as a mechanism for improving the classification
accuracy of existing classifiers [6] providing that constituents are diverse.
Zhang et al. [23] partitioned their mass dataset obtained from the DDSM
into several subsets based on mass shape and age. Several classifiers were then
tested and the best performing classifier on each subset was chosen. They used
SVM, k-nearest neighbour and Decision Tree (DT) classifiers in their ensemble
and achieved a combined classification accuracy of 72% that was better than
any individual classifier.
Surrendiran and Vadivel [21] proposed a technique that could determine what
features had the most appropriate correlation on classification accuracy and
achieved 87.3% classification accuracy. They achieved this by using ANOVA
DA, Principal Component Analysis and Stepwise ANOVA analysis to determine
the relationship between input feature and classification accuracy.
Mc Leod and Verma [14] utilised a clustered ensemble technique that relied
on the notion that some patterns could be readily identified through cluster-
ing (atomic). Other patterns that were not so easily separable (non-atomic)
were classified by a neural network. The classification process involved an initial
lookup to determine if a pattern was associated with an atomic class however
for non-atomic classes a neural network ensemble that had been created through
an iterative clustering mechanism (to introduce diversity into the ensemble) was
employed. The advantage of this technique is that the ensemble was not ad-
versely affected by outliers (atomic clusters). This technique was applied to the
same mass dataset as utilised in this research and achieved a classification accu-
racy of 91%.
The ensemble utilised in this research was created by fusing together (using
the majority vote algorithm) constituent neural networks that were created by
varying the number of neurons in the hidden layer to create diverse networks for
incorporation into an ensemble classifier.

5 Experimental Results

The experiments were conducted for LCA and neural network techniques and
the related ensemble approaches using mass type anomalies from the Digital
Database of Screening Mammography (DDSM) [7]. The features used for classi-
fication purposes coincided with the Breast Imaging Reporting and Data System
(BI-RADS) as this is how radiologists classify breast cancer. The BI-RADS fea-
tures of density, mass shape, mass margin and abnormality assessment rank are
used as they have been proven to provide good classification accuracy [20]. These
features are then combined with patient age and a subtlety value [7].
Experiments were performed utilising the clustering technique of Ghassem

58
AIH 2012

Comparing Data Mining with Ensemble Classification 5

Pour, Maeder and Jorm [4] on this dataset. This was achieved using the La-
tent Gold R software package. The first step was to utilise the analysis feature of
LatentGold R to calculate the BIC value and the classification error rate. This
information appears in Table 1 below, with Npar designating the resulting pa-
rameter value associated with the LCA.

Table 1. LCA Cluster optimisation based on Classification Error.

Clusters BIC Npar Classification Error
2 1238.8 30 0.0303
3 1240.6 38 0.0403
4 1241.8 46 0.0446
5 1254.1 54 0.0470

Minimisation of BIC and the Classification Error determines the best number
of clusters for the LCA analysis in terms of classification accuracy and this was
found to be 2 clusters. Nevertheless, it might be expected that some further
complexity could be identified in higher numbers of clusters, where multiple
clusters may exist for either positive or negative classes. The results obtained
when cases of more than 2 clusters were merged to form the dominant positive
and negative classes, are detailed in Table 2. These results show the instability

Table 2. LCA Classification Technique Accuracy.

Clusters Accuracy %
2 87.2
3 56.7
4 43.2
5 32.8

of LCA classification for this dataset at higher numbers of clusters, for example
the 2-cluster solution gives better accuracy than the 3-cluster solution (merging
into 2 clusters) and so forth. From this we conclude that the natural 2-cluster
solution is indeed optimal.
In order to provide a comparison, further experiments were performed using
a neural network and then applying an ensemble classifier. The neural network
and ensemble techniques were implemented in MATLAB R utilising the neural
network toolbox. The parameters utilised are detailed in the Table 3 below.
Experiments were first performed with a neural network classifier alone, in order
to provide a baseline for measuring the classification accuracy on the selected
dataset. The results obtained are detailed in Table 4 below. Further experiments
were then performed utilising an ensemble technique with a summary of the
neural network test results using ten-fold cross validation, as detailed in Table
5 below.

59
AIH 2012

6 Comparing Data Mining with Ensemble Classification

Table 3. Neural network configuration parameters.

Parameter Value
Hidden Layers 1
Transfer Function Tansig
Learning Rate 0.05
Momentum 0.7
Maximum Epochs 3000
Root Mean Square Goal 0.001

Table 4. Neural network classification technique accuracy.

Hidden Neurons Accuracy (%)
13 80
25 80
52 90
111 79

Table 5. NN-ensemble classification technique accuracy.

Networks Hidden Neurons in Ensemble Accuracy (%)
6 24,5,15,32,31,43 94
10 24,5,15,32,31,43,50,75,38,59 96.5
13 24,5,15,32,31,43,50,75,38,59,68,79,116 98
15 24,5,15,32,31,43,50,75,38,59,68,79,116,146,14 96

Experiments were also performed for the ensemble-like optimising of results
from the LCA technique. It is difficult to match this process directly with the
complexity used for the NN-ensemble experiments, so the number of repetitions
has been modelled on plausible choice based on dataset size of 100 cases. The
results for these experiments are shown in Table 6 below.

Table 6. LCA-ensemble classification technique accuracy.

Repetitions Accuracy (%)
10 87
20 89
40 91
70 94

6 Discussion and Conclusions

Examination of the results from Tables 1 to 6 demonstrates that the accuracy
obtained with the LCA technique is below that of the baseline classification

60
AIH 2012

Comparing Data Mining with Ensemble Classification 7

performed with the neural network. However an ensemble oriented approach en-
abled improvement of the results from both techniques.
In order to examine the results more closely the sensitivity, specificity and
positive predictive value have been calculated for the best performing results for
each of the trialled techniques, shown below in Table 7.
Sensitivity is the True Positive diagnosis divided by the True Positive and
False Negative components. Sensitivity can be thought of as the probability of
detecting cancer when it exists.
Specificity is the True Negative component divided by the True Negative
component plus the False Positive component. Specificity can be thought of as
the probability of being correctly diagnosed as not having cancer.
Positive Predictive Value (PPV) is the True Positive component divided by
the True Positive component plus the False Positive component. PPV is the accu-
racy of being able to identify malignant abnormalities. The latent class analysis

Table 7. Performance results for the proposed techniques.

Technique Performance(%)
Sensitivity Specificity PPV
Latent Class Analysis 80.5 93.9 95.0
LCA-ensemble 82.7 95.2 96.0
Neural Network 91.6 88.4 90.0
NN-ensemble 97.0 97.9 99.0

technique was not as sensitive as the neural network but had better specificity
and a higher positive predictive value than the neural network. Both ensemble
approaches resulted in substantially better performance, which of course must
be traded off against the increased computational cost. The NN-ensemble tech-
nique performed the best with good sensitivity, specificity and a high positive
predictive value.
The flexibility of clustering techniques such as LCA provides a mechanism for
gaining insight from large data repositories. However once patterns in the data
become evident it would appear that other less flexible but more specialised
techniques could be utilised to obtain analysis at a higher degree of granularity
of the data in question.
A summary of the overall performance of the techniques employed in this
paper are presented in Figure 1. The optimal LCA-ensemble result, while less
than the optimal NN-ensemble result, is obtained with somewhat less processing
effort and complexity, and further improvement may be possible.
Future work could look at extending the comparison of LCA with other
data mining algorithms to determine their applicability. Breast cancer represents
only one problem domain and applying these methods to other datasets would
be a logical extension. Our future research will include more experiments with
LatentGold R on other breast cancer datasets to determine how different numbers
of clusters produce different classification results for a more detailed analysis.

61
AIH 2012

8 Comparing Data Mining with Ensemble Classification

Fig. 1. Comparative Classification Accuracies.

References

1. Christoyianni, I., Koutras, A., Dermatas, E., Kokkinakis, G.: Computer Aided Diag-
nosis of Breast Cancer in Digitized Mammograms. Computerized Medical Imaging
and Graphics 26(5), 309-319 (2002)
2. DeSantis, C., Siegel, R., Bandi, P., Jemal, A.:Breast Cancer Statistics, 2011. CA: A
Cancer Journal for Clinicians 61(6), 408-418 (2011)
3. Fraley, C., Raftery, A.: Model-based Clustering, Discriminant Analysis, and Density
Estimation. Journal of the American Statistical Association 97(458), 611-631(2002)
4. Ghassem Pour, S., Maeder, A., Jorm, L.: Constructing a Synthetic Longitudinal
Health Dataset for Data Mining. DBKDA 2012, The Fourth International Confer-
ence on Advances in Databases, Knowledge, and Data Applications.86-90 (2012)
5. Ghassem Pour, S., Maeder, A., Jorm, L.: Validating Synthetic Health Datasets for
Longitudinal Clustering. The Australasian Workshop on Health Informatics and
Knowledge Management (HIKM 2013) 142, to appear (2013)
6. Gou, S., Yang, H., Jiao, L., Zhuang, X.: Algorithm of Partition Based Network
Boosting for Imbalanced Data Classification. The International Joint Conference
on Neural Networks (IJCNN).1-6. IEEE (2010)
7. Heath, M., Bowyer, K., Kopans, D., Moore, R., Kegelmeyer, P.: The Digital
Database for Screening Mammography. Proceedings of the 5th International Work-
shop on Digital Mammography.212-218 (2000)
8. Hofvind, S., Ponti, A., Patnick, J., Ascunce, N., Njor, S., Broeders, M., Giordano,
L., Frigerio, A., Tornberg, S.: False-positive Results in Mammographic Screening
for Breast Cancer in Europe: a literature review and survey of service screening
programmes. Journal of Medical Screening 19(1), 57-66 (2012)
9. Kim, J., Park, J., Song, K., Park, H.: Detection of Clustered Microcalssifications on
Mammograms Using Surrounding Region Dependence Method and Artificial Neural
Network. The Journal of VLSI Signal Processing 18(3),251-262 (1998)

62
AIH 2012

Comparing Data Mining with Ensemble Classification 9

10. Lanza, S., Flaherty, B., Collins, L.: Latent Class and Latent Transition Analysis.
Handbook of Psychology. 663-685 (2003)
11. Magidson, J., Vermunt, J.: Latent Class Models for Clustering: A Comparison with
k-means. Canadian Journal of Marketing Research 20(1), 36-43 (2002)
12. Malich, A., Schmidt, S., Fischer, D., Facius, M., Kaiser, W.: The Performance of
Computer-aided Detection when Analyzing Prior Mammograms of Newly Detected
Breast Cancers with Special Focus on the Time Interval from Initial Imaging to
Detection. European Journal of Radiology 69(3),574-578 (2009)
13. Mannila, H.: Data mining: Machine learning, Statistics, and Databases. Pro-
ceedings of Eighth International Conference on Scientific and Statistical Database
Systems.2-9 IEEE (1996)
14. McLeod, P., Verma, B.: Clustered Ensemble Neural Network for Breast Mass Clas-
siffcation in Digital Mammography. In: The International Joint Conference on Neu-
ral Networks (IJCNN). 1266-1271 (2012)
15. Mealing, N., Banks, E., Jorm, L., Steel, D., Clements, M., Rogers, K.: Investiga-
tion of Relative Risk Estimates from Studies of the Same Population with Contrast-
ing Response rates and Designs. BMC Medical Research Methodology 10(1), 10-26
(2010)
16. Mohanty, A., Senapati, M., Lenka, S.: A Novel Image Mining Technique for Clas-
sification of Mammograms Using Hybrid Feature Selection. Neural Computing &
Applications. 1-11 (2012)
17. Nishikawa, R., Kallergi, M., Orton, C., et al.: Computer-aided Detection, in its
present form, is not an Effective aid for Screening Mammography. Medical Physics
33(4), 811-814 (2006)
18. Nylund, K., Asparouhov, T., Muthen, B.: Deciding on the Number of Classes in
Latent Class Analysis and Growth Mixture Modeling: A Monte Carlo Simulation
Study. Structural Equation Modeling 14(4), 535-569 (2007)
19. Oh, S., Lee, M., Zhang, B.: Ensemble Learning with Active Example Selection for
Imbalanced Biomedical Data Classification. IEEE/ACM Transactions on Compu-
tational Biology and Bioinformatics 8(2), 316-325 (2011)
20. Sampat, M., Bovik, A., Markey, M.: Classification of Mammographic lesions into
BIRADS Shape Categories Using the Beamlet Transform. In: Proceedings of SPIE,
Medical Imaging: Image Processing. 16-25. SPIE(2005)
21. Surrendiran, B., Vadivel, A.: Feature Selection Using Stepwise ANOVA, Discrimi-
nant Analysis for Mammogram Mass Classification. International Journal of Recent
Trends in Engineering and Technology 3, 55-57 (2010)
22. Taylor, P., Potts, H.: Computer Aids and Human Second Reading as Interventions
in Screening Mammography: two systematic reviews to compare effects on cancer
detection and recall rate. European Journal of Cancer 44(6), 798-807 (2008)
23. Zhang, Y., Tomuro, N., Furst, J., Raicu, D.: Building an Ensemble System for
Diagnosing Masses in Mammograms. International Journal of Computer Assisted
Radiology and Surgery 7(2), 323-329 (2012)

63
AIH 2012

64
AIH 2012

Automatic Classification of Cancer Notifiable
Death Certificates

Luke Butt1 , Guido Zuccon1 , Anthony Nguyen1 ,
Anton Bergheim2 , Narelle Grayson2
1
The Australian e-Health Research Centre, Brisbane, Queensland, Australia;
2
Cancer Institute NSW, Alexandria, New South Wales, Australia.
{luke.butt, guido.zuccon, anthony.nguyen}@csiro.au
{anton.bergheim, narelle.grayson}@cancerinstitute.org.au

Abstract. The timely notification of cancer cases is crucial for can-
cer monitoring and prevention. However, the abstraction and classifica-
tion of cancer from the free-text of pathology reports and other relevant
documents, such as death certificates, are complex and time-consuming
activities. In this paper we investigate approaches for the automatic de-
tection of cases where the cause of death is a notifiable cancer from
free-text death certificates supplied to Cancer Registries. A number of
machine learning classifiers were investigated. A large set of features
were also extracted using natural language techniques and the Medtex
toolkit; features include stemmed words, bi-grams, and concepts from
the SNOMED CT medical terminology. The investigated approaches
were found to be very effective in identifying death certificates where the
cause of death was a notifiable cancer. Best performance was achieved by
a Support Vector Machine (SVM) classifier with an overall F-measure of
0.9647 when evaluated on a set of 5,000 free-text death certificates. This
classifier considers as features stemmed token bigrams and information
from SNOMED CT concepts filtered by morphological abnormalities and
disorders. However, our analysis shows that it is the selection of features
that most influences the performance of the classifiers rather than the
type of classifier or the feature weighting schema. Specifically, we found
that stemmed token bigrams with or without SNOMED CT concepts
are the most effective feature. In addition, the combination of token bi-
grams and SNOMED CT information was found to yield the best overall
performance.

Keywords: death certificates, Cancer Registry, cancer monitoring and
reporting, machine learning, natural language processing, SNOMED CT

1 Introduction

Cancer notification and reporting is an important and fundamental process for
providing an accurate picture of the impact of cancer, the nature and extent of
cancer, and to direct research efforts for the cure of cancer. Cancer Registries col-
lect and interpret data from a large number of sources, helping to improve cancer

65
AIH 2012

prevention and control, as well as treatments and survival rates for patients with
cancer.
The manual coding of documents, such as pathology reports and death cer-
tificates, with respect to notifiable cancers and corresponding synoptic factors
(such as primary site, morphology, etc.) is a laborious and time consuming pro-
cess. Cancer Registries strive to provide timely and accurate information on
cancer incidence and mortality in the community. They receive large quantities
of data from a range of sources, including hospitals, pathology laboratories and
Registries of Births, Deaths and Marriages (which issues releases of death cer-
tificates). It is estimated that incident cases within Cancer Registries that have
death certificate only notifications amount to about 1-5% of the total cases; de-
lays in the processing of this data may cause underestimation of the incidence of
cancer. Computational methods for the automatic abstraction of relevant infor-
mation have the possibility to enhance a Cancer Registry’s workflow, providing
time and costs savings as well as timely cancer incidence information and mor-
tality information. This automatic process is however challenging, both for the
complex nature of the language used in the reports, and for the high level of
recall and accuracy required.
Previous works have attempted to provide automatic cancer coding from
free-text pathology reports collected by Cancer Registries. For example, Nguyen
et al. [1] used natural language processing techniques and a rule-based system
to automatically extract relevant synoptic factors from electronic pathology re-
ports. Similarly, Zuccon et al. [2] showed how these techniques could cope with
character recognition errors generated by scanning free-text pathology reports
stored in paper form. Machine learning approaches have also been considered; for
instance, D’Avolio et al. [3] have tested approaches based on supervised machine
learning (Conditional Random fields and Maximum Entropy) and have shown
its effectiveness for the classification of pathology reports that were consistent
with cancer in the domains of colorectal, prostate, and lung cancer.
Cancer Registries have access to a number of data sources beyond pathology
reports. One such data source is death certificates. Death certificates are a rich
source of data that can support cancer surveillance, monitoring and reporting.
These certificates contain free-text sections that report the cause of the death of
an individual. An example of the free-text content of a death certificate where
the cause of death is a notifiable cancer is given in Figure 1, while Figure 2 is
an example of a non-notifiable death certificate.
Limited works have focused on computational methods for automatically
classifing death certificates with respect to the cause of death. The Super-
MICAR system and its related tools1 provide a semi-automatic coding of the
cause of death in death certificates. The system identifies keywords and expres-
sions from the free-text documents that indicate possible causes of death; this
is done through the use of a standard set of expressions encoded in a predefined
vocabulary. Extracted free-text expressions are then converted to one or more
1
Consult http://www.cdc.gov/nchs/nvss/mmds/super_micar.htm (last visited 19th
November 2012) for further details.

66
AIH 2012

(I)A) MAXILLARY TUMOR, 2 YEARS B) PULMONARY OEDEMA, 1 WEEK
(II) CEREBROVASCULAR ACCIDENT/DYSPLASIA, 20 YEARS ASTHMA

Fig. 1. A de-identified death certificate where the cause of death is a notifiable cancer.

I(A) CEREBROVASCULAR ACCIDENT 48 HOURS (B) CEREBRAL ARTERIOSCLEROSIS YEARS
(C) HYPERTENSION YEARS II CHRONIC ALCOHOLISM YEARS

Fig. 2. A de-identified death certificate where the cause of death is not a notifiable
cancer.

ICD-10 codes which are then aggregated into a single ICD-10 underlying cause
of death through the use of a rule-base. While doctor reported death certificates
can be fed directly into the system, Coroner reported ones require additional
pre-processing. A consistent number (between 15 and 20 percent according to a
US study [4]) of death certificates cannot be coded through SuperMICAR and
related tools, and thus require manual coding. A recent work has successfully
classified death certificates related to pneumonia and influenza using a natural
language processing pipeline and rule-based system [5]. However, to the best
of our knowledge, no previous research has been conducted to investigate fully
automatic methods that go beyond keyword spotting of standard cause of death
expressions to classifying death certificates, in particular focusing on certificates
where the main cause of death is cancer. Furthermore, while Australian Can-
cer Registries can acquire free-text death certificates on a fortnightly basis from
the Registry of Births Deaths and Marriages, coded causes of death produced
by SuperMICAR (and related products) are released by the Australian Bureau
of Statistics on a yearly basis. Computational methods able to tackle the fast
identification of death certificates where the cause of death is a notifiable can-
cer would enhance the cancer reporting and monitoring capabilities of Cancer
Registries.
In this paper, we focus on the problem of automatically identifying death
certificates where the main cause of death is cancer. This problem is cast into
a binary classification problem, i.e. death certificates are classified as containing
a death cause related to cancer or vice versa as not containing a death cause
related to cancer. Several machine learning classifiers were investigated for this
task. These include support vector machine, Naive Bayes, decision trees, and
boosting algorithms. A state-of-the-art information extraction tool (Medtex [6])
is used to create different set of features that are used to train the classifiers; dif-
ferent feature weighting schemas were also considered. Features include stemmed
tokens, n-grams, as well as SNOMED CT concept ids and tokens from fully spec-
ified names of SNOMED CT concepts, among others. SNOMED CT is a medical
terminology which formally describes in detail the coverage and knowledge of
topics and terminology used in the medical domain.
Our approaches are tested on 5,000 de-identified death certificates acquired
from an Australian Cancer Registry, using 10-fold cross validation for allow-
ing robust training and testing. Our experimental results demonstrate that the

67
AIH 2012

choice of classifier and weighting schema, although being important, is not crit-
ical for achieving high classification effectiveness. Instead, the choice of features
used to represent content of death certificates is the determining factor for high
classification effectiveness. Specifically, stemmed token bigrams are found to be
the single most important features among those extracted. Furthermore, we
found that SNOMED CT features provide consistent increments in classification
effectiveness if used along with stemmed token bigrams; although not providing a
large increment, the combined use of stemmed token bigrams and SNOMED CT
morphology provide the best classification effectiveness in our experiments.
Next, we detail the approaches adopted in this paper. Then, in Section 3 we
outline our empirical evaluation methodology; classification results obtained by
the investigated approaches are reported in Section 4. An analysis of the results
is developed in Section 4.1. The paper concludes in Section 5 summarising our
main contribution and directions for future work.

2 Approaches for Automatic Classification of Death
Certificates
In this paper we investigate supervised machine learning approaches for the
detection of death certificates where the cause of death is a notifiable cancer.
These approaches are characterised by three main variables: (1) the features
extracted from the documents (Section 2.1), (2) the weighting schemas applied
to the features to represent documents (Section 2.2), and (3) the specific binary
classifier used to individuate certificates where the cause of death is a notifiable
cancer (Section 2.3).

2.1 Automatic Feature Extraction
Machine learning algorithms require data to be represented by features, such as
the words that occur in a text document. We used the information extraction
capabilities of the Medtex system2 for obtaining a set of meaningful features
from the free-text of the death certificates.
The feature sets investigated in this paper are:
stem: a token stem, i.e. the stemmed version of a word contained in a certificates
stemBigram: the bi-gram formed by two token stems, i.e. a pair of adjacent
stemmed words as found in a certificates
concept: SNOMED CT concepts as found in the free-text of the certificates
using the Medtex system
conceptFull: the tokens of the fully specified name of the extracted SNOMED CT
concepts
2
Medtex comprises both information extraction capabilities (extracting both low level
information such as word tokens and stems, punctuation, etc., and higher level se-
mantic information such as UMLS and SNOMED CT concepts [1]) and classification
capabilities integrated via its rule-based engine.

68
AIH 2012

concFullMorph: the tokens of the fully specified name of extracted SNOMED CT
concepts that are morphologic abnormalities or disorders
concBigram: the bigram formed by two adjacent SNOMED CT concept ids
concFullBigram: the bigram formed by two adjacent tokens in the fully specified
name of concepts extracted from SNOMED CT

While features like stem and stemBigram are commonly used for classifying
free-text documents, features based on SNOMED CT concepts and its properties
such as tokens from the fully specified name have not been exploited by previ-
ous works that attempted to classify free-text death certificates. SNOMED CT
provides a standard clinical terminology used to map various descriptions of a
clinical concept to a single standard clinical concept. In this work, the SNOMED
CT ontology was used as an underlying mechanism to classify free-text using se-
mantically matching SNOMED CT concepts.
In addition, we also considered pair-wise combinations of features that pro-
vided promising results on preliminary experiments. In this paper we shall re-
port the results obtained by all features used singularly, and of the combinations
concept + stem, concept + stemBigram, concFullMorph + stemBigram, and con-
cBigram + stemBigram, which has shown promise in preliminary investigations.
Next, we consider the example death certificates given in Figure 1 and Fig-
ure 2 to describe how a feature set is constructed. To build the feature represen-
tations, we examine each death certificate and for each occurring instance of a
feature in the certificate we assign a value of 1, while the absence of a feature is
marked by a zero entry value. Note that these values are subsequently modified
according to the feature weighting functions, as we shall describe in Section 2.2.
After all certificates have been processed in this manner, we add a final feature
cancerNotifiable, whose value is obtained from ground truth judgements supplied
with the data. Table 1 shows an extract of the feature data constructed for the
two example death certificates. The task of the machine learning classifiers is to
predict the value of the cancerNotifiable feature, given the learning data supplied.

Features
stem stemBigram concept conceptFull ...
Cerebrovascular accident
Cerebral arteriosclerosis
ACCID DYSPLASIA

Neoplasm of maxilla
YEAR ASTHMA

cancerNotifiable
ALCOHOL

ACCID 48

126550004

230690007
20 YEAR
TUMOR
ACCID

WEEK
YEAR
...

...

Document ...
Figure 1 1 0 0 1 1 1 1 0 0 1 1 1 0 1 1 1 1 0 ... 1
Figure 2 1 1 1 0 0 1 0 1 1 0 0 0 1 1 0 1 1 1 ... 0

Table 1. Feature data built from two example death certificates.

69
AIH 2012

Note that no further processing is applied to the text, for example, for remov-
ing punctuation, identifying section or list labels, or for removing or correcting
typographical errors present in the free-text. While adequate text pre-processing
may enhance the quality of the text itself and thus of the extracted features,
we left this for future work and instead we focused on investigating weighting
schemas for the selected features and binary classifiers.

2.2 Feature Weighting

A number of weighting schemes for capturing the local importance of a feature
in a report were tested.
Binary coefficients were used to encode the presence or absence of a feature.
We refer to this schema as binary.
The weighting schema composed by the feature frequency f (F) of feature F
was used to capture the number of times a specific feature appeared within a
document. We shall refer to this weighting schema as frequency.
Variations of the frequency weighting schema were also experimented with. In
this weighting schema, features frequencies were directly translated into weights,
i.e. weights are linearly derived from frequencies. Variations consider non-linear
functions of the frequency of a feature.
A first variation was to scale the appearance of feature F in a free-text
death certificate by the function 1 + log(f (F)) if f (F) ≥ 1, and 0 if the feature
was absent. This function would capture the fact that little importance is given
to subsequent appearances of a feature F in a document: the logarithm of a
number greater than one plateaus rapidly. In the following, we shall refer to this
weighting schema as LogF, i.e. logarithm of the frequency.
A second variation was to assign increasing weights to features that appear
with high frequencies within the death certificate. To this aim, the appearance of
feature F was weighted according to the function ef (F ) , while a zero value was
assigned to absent features. It is suggested that, given the short length of the
considered death certificates, the unexpected multiple occurrence of a feature
would provide strong evidence that that feature is important for the document.
Using the exponential function to weight occurrences of a feature would assign
dominating scores to features that occur frequently in a document. We shall refer
to this weighting function as expF.
Note that only local weighting functions were used to assign scores to fea-
tures,that is, weights were computed only by taking into account the frequencies
of appearance of a feature in a text, thus ignoring the distribution of that feature
on a global level, i.e. across the dataset. The incorporation of global occurrence
statistics within the weighting schemas is left to future work.

2.3 Automatic Classification Methodology

A number of common classifiers were evaluated. These comprised statistical mod-
els (Naive Bayes), support vector machines (SPegasos), decision trees (C4.5), and

70
AIH 2012

boosting algorithms (AdaBoost). We considered the implementations of these
algorithms provided in the Weka toolkit [7].
The multinomial Naive Bayes classifier determines the class of a death cer-
tificate according to the features that occur in the text and their weights. The
SPegasos classifier uses a stochastic gradient descent algorithm and a hinge loss
function to produce the separation hyperplane used by the linear support vector
machine. In the C4.5 classifier, information gain is used for choosing at each
level of the decision tree the most effective feature able to split the data into
the two binary classes considered here (i.e. death certificates related to cancers
and those not related to cancer). Adaboost minimises of a convex loss function
built from the prediction of a base weak classifier. A simple binary decision tree
classifier that constructs one-level trees was used as base classifier for Adaboost.
Parameters of all classifiers were set to the default values described in Witten
et al. [7].

3 Experimental Methodology
3.1 Data
A set of 5,000 free-text death certificates was acquired from Cancer Institute
NSW, the institutional entity responsible for maintaining the Central Cancer
Registry in New South Wales. Ethics approval was granted by the NSW Popu-
lation & Health Services Research Ethics Committee for this study including to
use the de-identified data. The free-text documents were short in length, con-
taining on average 13.08 words; the (unstemmed) vocabulary contained 3,751
unique words (including section headings and labels).
Cause of death classifications based on ICD-10 codes accompanied the re-
ports. This coding set was acquired from the Australian Bureau of Statistics,
who releases coded data yearly. ICD-10 codings were used to determine the
class each death certificates belonged to. A list of ICD-10 codes that are cancer
notifiable was provided by Cancer Institute NSW.
The 5,000 death certificates were extracted from Cancer Institute NSW
archives so that documents were uniformly split across the two classes, i.e. 2,500
certificates were coded with ICD-10 codes that are for notifiable cancers accord-
ing to the business rules of Cancer Institute NSW, while the remaining 2,500
were not cancer notifiable. The causes of death of the 2,500 death certificates
for notifiable cancers span a total of 367 unique ICD-10 codes.

3.2 Evaluation
A 10-fold cross validation methodology was used to train and test the classifica-
tion algorithms. In this methodology, the dataset was randomly divided into 10
stratified3 folds of equal dimensions. A model for each classifier was then learnt
3
Folds were automatically stratified with respect to the two target classes, not the
ICD-10 codes.

71
AIH 2012

on nine of these folds, leaving one fold out for testing. The process was repeated
by selecting a new fold for testing, while a new model was learnt from the re-
maining folds. Classification effectiveness was then averaged across the folds left
out for testing in each iteration.
F-Measure (F-m) was used as primary metric to evaluate the efficacy of the
implemented classifiers; accuracy, recall (sensitivity, Rec) and precision (posi-
tive predictive value, Prec) were also recorded, along with the number of true
positive (TP), false positve (FP), true negative (TN), and false negative (FN)
classifications.

4 Results and Discussion

The combination of 10 features, 4 weighting schemas, and 4 classifiers requires
the evaluation of a total of 160 classifier settings (referred to as runs in the
following) on the dataset consisting of 5,000 death certificates. While we eval-
uated all combinations of features, weighting schema and classifiers, given the
large number of combinations, it is not feasible to report the individual results
for each of the runs. Thus, we report only the settings of the 40 most effective
runs in terms on F-measure, our primary evaluation metric (Table 2), with the
F-measure of each classifier over all experimented settings graphically shown in
Figure 3. Later in the paper we shall consider a summary evaluation of the vari-
ability of results provided by features, weighting schemas, and classifiers. This
analysis will comprise of the results from all runs.
0.95
0.90
F-measure

0.85
0.80
0.75
0.70

Naive Bayes Supp. Vec. Mach. C4.5 Adaboost

Classifier

Fig. 3. Boxplot summarising the F-measure performance of the investigated classifiers
over all considered settings.

The results reported in Table 2 suggest that the tested approaches are highly
effective in discriminating between those death certificates that contain a cancer
notifiable cause of death and those that do not.

72
AIH 2012

Classifier Feature Weight Prec Rec F-m TP FN FP TN
SPegasos concFullMorph + stemBigram frequency .9794 .9504 .9647 2376 124 50 2450
SPegasos concFullMorph + stemBigram logF .9786 .9500 .9641 2375 125 52 2448
SPegasos concept + stemBigram logF .9770 .9508 .9637 2377 123 56 2444
SPegasos concFullMorph + stemBigram binary .9770 .9504 .9635 2376 124 56 2444
SPegasos concept + stemBigram binary .9766 .9504 .9633 2376 124 57 2443
SPegasos concept + stemBigram frequency .9766 .9504 .9633 2376 124 57 2443
SPegasos stemBigram binary .9761 .9488 .9623 2372 128 58 2442
SPegasos concFullMorph + stemBigram expF .9773 .9476 .9622 2369 131 55 2445
SPegasos stemBigram logF .9753 .9476 .9612 2369 131 60 2440
SPegasos stemBigram expF .9785 .9444 .9611 2361 139 52 2448
SPegasos stemBigram frequency .9764 .9452 .9606 2363 137 57 2443
SPegasos concept + stemBigram expF .9741 .9460 .9598 2365 135 63 2437
C4.5 concept + stemBigram logF .9800 .9392 .9592 2348 152 48 2452
C4.5 concept + stemBigram expF .9800 .9392 .9592 2348 152 48 2452
C4.5 concept + stemBigram frequency .9800 .9392 .9592 2348 152 48 2452
C4.5 concept + stemBigram binary .9799 .9384 .9587 2346 154 48 2452
C4.5 concFullMorph + stemBigram logF .9856 .9324 .9583 2331 169 34 2466
C4.5 concFullMorph + stemBigram expF .9856 .9324 .9583 2331 169 34 2466
C4.5 concFullMorph + stemBigram frequency .9856 .9324 .9583 2331 169 34 2466
C4.5 stemBigram logF .9848 .9320 .9577 2330 170 36 2464
C4.5 stemBigram expF .9848 .9320 .9577 2330 170 36 2464
C4.5 stemBigram frequency .9848 .9320 .9577 2330 170 36 2464
C4.5 concFullMorph + stemBigram binary .9848 .9320 .9577 2330 170 36 2464
C4.5 stemBigram binary .9848 .9308 .9570 2327 173 36 2464
AdaBoost concept + stemBigram binary 1 .8816 .9371 2204 296 0 2500
AdaBoost concept + stemBigram logF 1 .8816 .9371 2204 296 0 2500
AdaBoost concept + stemBigram expF 1 .8816 .9371 2204 296 0 2500
AdaBoost concept + stemBigram frequency 1 .8816 .9371 2204 296 0 2500
AdaBoost concFullMorph + stemBigram binary 1 .8816 .9371 2204 296 0 2500
AdaBoost concFullMorph + stemBigram logF 1 .8816 .9371 2204 296 0 2500
AdaBoost concFullMorph + stemBigram expF 1 .8816 .9371 2204 296 0 2500
AdaBoost concFullMorph + stemBigram frequency 1 .8816 .9371 2204 296 0 2500
AdaBoost stemBigram binary 1 .8784 .9353 2196 304 0 2500
AdaBoost stemBigram logF 1 .8784 .9353 2196 304 0 2500
AdaBoost stemBigram expF 1 .8784 .9353 2196 304 0 2500
AdaBoost stemBigram frequency 1 .8784 .9353 2196 304 0 2500
SPegasos stem logF .9588 .9120 .9348 2280 220 98 2402
SPegasos stem frequency .9611 .9096 .9346 2274 226 92 2408
Naive Bayes stemBigram binary .9658 .9036 .9337 2259 241 80 2420
Naive Bayes concept + stemBigram binary .9606 .9076 .9334 2269 231 93 2407

Table 2. Top 40 results with respect to decrease F-measure (F-m).

Overall, the best classifier is the support vector machine implementation pro-
vided by SPegasos when used on concFullMorph + stemBigram features, i.e. the
fully specified names of concepts associated to morphological abnormalities and
disorders as encoded in SNOMED CT, weighted using raw frequencies. SPegasos
is found to be very effective also when other combinations of weighting schemas

73
AIH 2012

and features are considered. In addition, this support vector machine classifier
shows the smallest variance across all considered settings (Figure 3).
Among the best performing classifiers, AdaBoost used in conjunction with
stemmed bigrams features achieved perfect precision (Prec= 1), at the expense
of recall. Although these results are remarkable, high precision may be considered
less important than high recall in such task. In fact, in a Cancer Registry setting,
it is preferable to have high recall and be considering death certificate that are
incorrectly reported as containing cancer notifiable cause of death, than to have
missed cancer cases. This becomes particularly important if the missed cancer
cases refer to rare cancers. AdaBoost also exhibits the highest variance across
experiment settings among the considered classifiers (see Figure 3).

4.1 The Impact of Classifiers, Weighting Schemas, and Features
To better understand the role of specific features, weighting schema, and classi-
fiers on the effectiveness of the tested approaches, an analysis of the empirical
results where each of the three key characteristics were treated as the controlled
variable is performed.
We start by examining the impact of each classification model on the overall
effectiveness of the approaches. Table 3 reports maximum (Max(F-m)), mini-
mum (Min(F-m)), difference (∆), and variance of F-measure over all runs of
each classifier model. SPegasos is found to be the classifier achieving the high-
est maximum and minimum F-measure values, thus extending the observations
made on this classifier when examining the results of Table 2. Instead, while the
Naive Bayes classifier was not found to be amongst the most effective classifica-
tion models in our experiments, its robustness is second only to that of SPegasos,
with performance ranges between 0.9337 and 0.7428 in F-Measure. While models
such as C4.5 and Adaboost achieve higher values of F-measure than Naive Bayes,
their minimum performances are lower than that recorded for Naive Bayes.

Classifier Max(F-m) Min(F-m) ∆ Variance
SPegasos 0.9647 0.7767 0.1880 5.10 · 10−3
Naive Bayes 0.9337 0.7428 0.1909 5.10 · 10−3
C4.5 0.9592 0.7355 0.2237 7.35 · 10−3
AdaBoostM1 0.9371 0.6954 0.2417 7.88 · 10−3

Table 3. Classification effectiveness across the four classifiers ordered by increasing
max-min F-measure range (∆).

We continue by analysing the influence of weighting schemas on the classifi-
cation results of the approaches investigated in this work. Simple raw frequency
weighting, i.e. frequency, is found to be the most effective weighting schema. How-
ever, no weighting schema appears to be significantly better than another: while

74
AIH 2012

Weight Max(F-m) Min(F-m) ∆ Variance
binary 0.9635 0.6954 0.2681 6.81 · 10−3
frequency 0.9647 0.6954 0.2693 6.74 · 10−3
logF 0.9641 0.6954 0.2687 6.80 · 10−3
expF 0.9622 0.6954 0.2668 6.53 · 10−3

Table 4. Classification effectiveness across the four weighting schema ordered by in-
creasing max-min F-measure range (∆).

frequency achieves the best performance with a F-measure of 0.9647, the highest
F-measure of the worst performing schema is 0.9622 (expF), just 0.003% lower
than frequency. Furthermore, all weighting schema exhibit the same effectiveness
when considering the worst performing settings. Thus the range of performance
differences and their variance do not significantly differ across weighting schema.
This may be due to the fact that death certificates are in general short docu-
ments, where features occur uniformly.

Feature Max(F-m) Min(F-m) ∆ Variance
stemBigram 0.9623 0.9275 0.0348 2.02 · 10−4
concept + bigramStem 0.9637 0.9267 0.0370 2.16 · 10−4
concFullMorph + stemBigram 0.9647 0.9255 0.0392 2.33 · 10−4
concBigram + stemBigram 0.8443 0.7677 0.0766 8.01 · 10−4
concBigram 0.8443 0.7677 0.0766 8.01 · 10−4
concFullBigram 0.7768 0.6954 0.0814 8.93 · 10−4
conceptFull 0.809 0.7177 0.0913 1.17 · 10−3
concept + stemBigram 0.9302 0.838 0.0922 8.39 · 10−4
concept 0.8743 0.7792 0.0951 1.13 · 10−3
stem 0.9348 0.8131 0.1217 1.36 · 10−3

Table 5. Classification effectiveness across the ten features ordered by increasing max-
min F-measure range (∆).

Feature is the final variable of our analysis, and the one with the greatest im-
pact on classification results. The use of the concFullMorph + stemBigram feature
provide the highest F-measure (0.9647), while concFullBigram yields the lowest
maximal F-measure (0.7768): a significant difference of 19.48%. The smallest
variance was demonstrated by stemBigram (2.02 · 10−4 ), making it the most
robust feature in our experiment; in addition this feature yielded a maximal F-
measure of only 0.003% lower than the best value recorded in our experiments.
The minimal F-measure yield by the stemBigram feature was also greater than
the greatest F-measure values obtained when using half of the features investi-

75
AIH 2012

gated in our study. These results provide strong indication that, of the variables
analysed, the choice of feature provides the greatest contribution to the classifi-
cation effectiveness.
5 Conclusions
Timely processing of cancer notifications is critical for timely reporting of cancer
incidence and mortality. Death certificates are a rich source of data on cancer
mortality. Cancer registries acquire free-text death certificates on a regular (e.g.
fortnightly) basis. However, the cause of death information needs to be classified
to facilitate reporting of cancer mortality. Cause of death information classified
using ICD-10 codes is only available on an annual basis. In this paper we inves-
tigated the automatic classification of death certificates to individuate cancer
notifiable cause of deaths. The investigated approaches achieved overall strong
classification effectiveness, with a support vector machine classifier trained with
token bigram features and information from the SNOMED CT medical ontol-
ogy, and weighted by their frequency in the documents yielding an F-measure
of 0.9647. The choice of features, rather than that of classifiers or weighting
schema, was found to be the determining factor for high effectiveness.
Future efforts will be directed towards an in depth error analysis, in particular
examining the distance between the prediction produced by a classifier and the
decision threshold. We also plan to extend the investigation to predict the actual
ICD-10 codes associated to cause of death related to cancer, so as to further assist
clinical coders in processing cancer notifications.

References
1. Nguyen, A., Moore, J., Lawley, M., Hansen, D., Colquist, S.: Automatic extraction
of cancer characteristics from free-text pathology reports for cancer notifications.
In: Health Informatics Conference. (2011) 117–124
2. Zuccon, G., Nguyen, A., Bergheim, A., Wickman, S., Grayson, N.: The impact of
OCR accuracy on automated cancer classification of pathology reports. Studies in
health technology and informatics 178 (2012) 250
3. D’Avolio, L., Nguyen, T., Farwell, W., Chen, Y., Fitzmeyer, F., Harris, O., Fiore,
L.: Evaluation of a generalizable approach to clinical information retrieval using the
automated retrieval console (ARC). Journal of the American Medical Informatics
Association 17(4) (2010) 375–382
4. Harris, K.: Selected data editing procedures in an automated multiple cause of death
coding system. In: Proceedings of the Conference of European Statistics. (1999)
5. Davis, K., Staes, C., Duncan, J., Igo, S., Facelli, J.: Identification of pneumonia and
influenza deaths using the death certificate pipeline. BMC Medical Informatics and
Decision Making 12(1) (2012) 37
6. Nguyen, A.N., Lawley, M.J., Hansen, D.P., Bowman, R.V., Clarke, B.E., Duhig,
E.E., Colquist, S.: Symbolic rule-based classification of lung cancer stages from free-
text pathology reports. Journal of the American Medical Informatics Association
17(4) (2010) 440–445
7. Witten, I., Frank, E., Hall, M.: Data Mining: Practical Machine Learning Tools and
Techniques: Practical Machine Learning Tools and Techniques. Morgan Kaufmann
(2011)

76
AIH 2012

Clinician-Driven Automated Classification of Limb
Fractures from Free-Text Radiology Reports

Amol Wagholikar1, Guido Zuccon1, Anthony Nguyen1,
Kevin Chu2, Shane Martin2, Kim Lai2, Jaimi Greenslade2
1
The Australian e-Health Research Centre, Brisbane, CSIRO
{amol.wagholikar,guido.zuccon,anthony.nguyen}@csiro.au
2
Department of Emergency Medicine, RBWH, Brisbane, Queensland Health
{kevin_chu,shane_martin}@health.qld.gov.au
{kim_lai,jaimi_greenslade}@health.qld.gov.au

Abstract. The aim of this research is to report initial experimental results and
evaluation of a clinician-driven automated method that can address the issue of
misdiagnosis from unstructured radiology reports. Timely diagnosis and report-
ing of patient symptoms in hospital emergency departments (ED) is a critical
component of health services delivery. However, due to disperse information
resources and vast amounts of manual processing of unstructured information, a
point-of-care accurate diagnosis is often difficult. A rule-based method that
considers the occurrence of clinician specified keywords related to radiological
findings was developed to identify limb abnormalities, such as fractures. A da-
taset containing 99 narrative reports of radiological findings was sourced from a
tertiary hospital. The rule-based method achieved an F-measure of 0.80 and an
accuracy of 0.80. While our method achieves promising performance, a number
of avenues for improvement were identified using advanced natural language
processing (NLP) techniques.

Keywords: limb fractures, emergency department, radiology reports, classifica-
tion, rule-based method, machine learning.

1 Introduction

The analysis of x-rays is an essential step in the diagnostic work-up of many condi-
tions including fractures in injured Emergency Department (ED) patients. X-rays are
initially interpreted by the treating ED doctor, and if necessary patients are appropri-
ately treated. X-rays are eventually reported on by the specialist in radiology and
these findings are relayed to the treating doctor in a formal written report. The ED,
however, may not receive the report until after the patient was discharged home. This
is not an uncommon event because the reporting did not occur in real-time. As a re-
sult, there are potential delays in the diagnosis of subtle fractures missed by the treat-
ing doctor until the receipt of the radiologist’s report. The review of x-ray reports is a
necessary practice to ensure fractures and other conditions identified by the radiolo-

77
AIH 2012

gist were not missed by the treating doctor. The review requires the reading of the
free-text report. Large “batches” of x-rays are reviewed often days after the patient’s
ED presentation. This is a labour intensive process which adds to the diagnostic delay.
The process may be streamlined if it can be automated with clinical text processing
solutions. These solutions will minimise delays in diagnosis and prevent complica-
tions arising from diagnostic errors [1-2]. This research aims to address these issues
through the application of a gazetteer rule-based approach where keywords that may
suggest the presence or absence of an abnormality were provided by expert ED clini-
cians. Rule-based methods are commonly used in Artificial Intelligence [3-5]. Studies
have shown that rule-based methods can be applied for identifying clinical conditions
from radiology reports such as acute cholecystitis, acute pulmonary embolism and
other conditions [6]. The purpose of these methods is to simulate human reasoning for
any given information processing task to achieve full or partial automation.

2 Related Work

Previous studies that focused on the problem of identification of subtle limb frac-
tures during the diagnosis of ED patients showed that about 2.1% of all fractures were
not identified during initial presentation to the Emergency Department [7]. A similar
study about radiological evidence for fracture reports that 1.5% of all x-rays had ab-
normalities that were not identified in the Emergency Department records [8]. Further
research also reported that 5% and 2% of the x-rays of the hand/fingers and ankle/foot
from a pediatric Emergency Department had fractures missed by the treating ED doc-
tor [9]. These small percentages of incidences may have significant impact on the
overall patient healthcare as these missed fractures may develop into more complex
conditions. Timely recognition of fractures is therefore important. There have been
efforts to automatically detect fractures and other abnormalities from free-text radiol-
ogy reports using support vector machine (SVM) and machine learning tech-
niques[10-11]. Even though the results of machine learning based classifiers show
high effectiveness, their applicability in clinical settings may be limited. Machine
learning methods are data–driven, and as a result, if the training sample is not a repre-
sentative selection of the problem domain, then the resulting model will not general-
ise. In addition, machine learning approaches are required to be retrained on new
corpora and tasks and collating training data to build new classifier models can be a
timely and labour intensive process. These issues provide the motivation for the in-
vestigation of rule-based methods which have the ability to model expert knowledge
as easily implementable rules.

3 Methods
A set of 99 de-identified free-text descriptions of patient’s limb x-rays reported by
radiologists were extracted from a tertiary hospital’s picture archiving and
communication system (PACS). An ethics approval was granted by the Human

78
AIH 2012

Research Ethics Committee at Queensland Health to use this data. The average length
of free-text reports is about 52 words with total 930 unique words in the vocabulary.
Some reports are semi-structured, with section headings such as “History”, “Clinical
Details”, “Findings”, appearing in the text.

3.1 Ground Truth Development
One ED visiting medical officer and one ED Registrar were engaged as assessors to
manually classify the patient findings. Findings were assigned to either one of the
following two classes: (1) “Normal”, means identifying no fractures or dislocations
and (2) “Abnormal”, identifying the presence of a reportable abnormality such as
fracture, dislocation, displacement etc., which requires further follow-up. To gather
ground truth labels about the data, an in-house annotation tool was developed. This
tool allowed the assessors to manually annotate and classify the free-text reports into
one of the two target categories. The two assessors initially agreed on the annotations
of 77 of the 99 reports and disagreed on the remaining 22 reports. The disagreed
reports were resolved and validated by a senior Staff Specialist in Emergency
Medicine, who acted as a third assessor.

3.2 Rule-based classifier

A rule-based classifier was developed and implemented with rules as a set of key-
words extracted from the x-ray reports assessment criteria as documented by the cli-
nicians prior to the ground truth annotation task. The classifier was implemented to
classify the text into “Normal” and “Abnormal” categories as shown in Table 1.
Table 1. Keywords used for building the rule-base.

Keywords Suggested Classification
no + fracture Normal
old + fracture Abnormal
Fracture Abnormal
x ray + follow up Abnormal
Dislocation Abnormal
FB Abnormal
Osteomyelitis Abnormal
Osteoly Abnormal
Displacement Abnormal
intraarticular extension Abnormal
foreign body Abnormal
articular effusion Abnormal
Avulsion Abnormal
septic arthritis Abnormal
Subluxation Abnormal
Osteotomy Abnormal
Callus Abnormal

79
AIH 2012

4 Results and Discussion

Results obtained by our gazetteer rule-based approach on the dataset containing 99
radiology reports are reported in Table 2, along with the performance of a Naïve
Bayes classifier that was used to classify on the same dataset [12]. The Naïve Bayes
classifier was trained and evaluated using a 10-fold cross validation approach. This
approach used 90% of reports for training and subsequently evaluated on the remain-
ing 10% within each cross validation fold. The average of the evaluation results
across the 10 folds was reported as the classifier’s performance. A set of stemmed
tokens in combination with high order semantic features such as SNOMED CT con-
cepts related to morphological abnormalities and disorders generated by the Medtex
system [13] were used to represent the reports. Classification results were evaluated in
terms of F-measure and accuracy (see Table 2). The number of true positive (TP), true
negative (TN), false positive (FP), and false negative (FN) instances were also report-
ed.

Table 2. Classification results obtained by rule-based and NB classification

Method F-measure Accuracy TP TN FP FN
Rule-based 0.80 0.80 39 40 11 9
Naive Bayes 0.92 0.92 44 47 4 4

The rule-based system classified 49 reports as “Normal”. Thirty-three of these were
classified as “normal” due to the “no + fracture” rule. The remaining 16 reports did
not match any rule, and thus were classified as “normal” (i.e. “no rule fired”). The
high false negative count from the rule-based system suggests that the keywords that
were used to characterise “Abnormal” cases by the clinician were not complete or
adequate to capture all possible cases of abnormalities. Although the proposed key-
word rule-based approach is simplistic but shows promise, advanced Natural Lan-
guage Processing techniques such as those adopted in Medtex [14] can be used to
improve classification performances. More keywords can also be learnt using compu-
tational linguistic methods, such as the Basilisk bootstrapping algorithm [15].

5 Conclusion and Future Research

This work has described an initial investigation of a clinician-driven rule-based meth-
od for automatic classification of free-text limb fracture x-ray findings. We described
a simple keyword spotting approach where keywords were derived from classification
criteria provided by clinicians. The rule-based classification method achieved promis-
ing results with F-measure performances of 0.80 and an accuracy of 0.80. As future
work, the research will aim to improve the simple keyword approach with more ad-
vanced clinical text processing techniques to complement the proposed rule-based
classification method. The possible integration of our method in real-life workflow of
hospital emergency departments will also be considered.

80
AIH 2012

Acknowledgements. The authors are thankful to Bevan Koopman for feedbacks on
earlier draft of this paper. This research was supported by the Queensland Emergency
Medicine Research Foundation Grant, EMPJ-11-158-Chu-Radiology.

References
1. James M. R., Bracegirdle A. and Yates D. W. X-ray reporting in accident and emergency
departments – an area for improvements in efficiency. Arch
Emerg
Med, 8:266–270,
1991.
2. Siegel E., Groleau G., Reiner B. and Stair T. Computerized follow-up of discrepancies in
image interpretation between emergency and radiology departments. J Digit Imaging,
11:18–20, 1998.
3. Long W.J, et al. Reasoning requirements for diagnosis of heart disease. Artificial Intelli-
gence in Medicine, 10(1), pp. 5–24, 1997.
4. Harleen K., Siri Krishan W.Empirical Study on Applications of Data Mining Techniques
in Healthcare, Journal of Computer Science 2 (2): 194-200, pp.1549-3636, 2006.
5. Subhash Chandra, N., Uppalaiah, B., Charles Babu, G., Naresh Kumar, K., Raja Shekar P.
General Approach to Classification: Various Methods can be used to classify X-ray imag-
es, IJCSET, Vol 2, Issue 3,933-937, March 2012.
6. Lakhani P, Kim W, Langlotz CP. Automated detection of critical results in radiology re-
ports. J Digit Imaging 25(1):30–36, 2012.
7. Cameron MG. Missed fractures in the emergency department. Emerg Med (Fremantle),
6:3, 1994.
8. Sprivulis P. and Frazer A. Same-day x-ray reporting is not needed in well supervised
emergency departments. Emerg Med (Fremantle), 13:194–197, 2001.
9. Mounts J., Clingenpeel J., and E. Byers E. McGuire and Kireeva Y. Most frequently
missed fractures in the emergency department. Clin
Pediatr
(Phila), 50:183–186, 2011.
10. De Bruijn B., Cranney A., O’Donnell S., Martin J.D. and Forster A.J. Identifying wrist
fracture patients with high accuracy by automatic categorization of x-ray reports. Journal
of the American Medical Informatics Association (JAMIA), 13(6):696–698, 2006.
11. Thomas B.J., Ouellette H., Halpern E.F. and Rosenthal D.I. Automated computer-assisted
categorization of radiology reports. American
Journal
of
Roentgenology, 184(2):687–
690, 2005.
12. Zuccon G, Wagholikar A, Nguyen A, Chu, K, Martin S, Greenslade J., Identifying Limb
Fractures from Free-Text Radiology Reports using Machine Learning, Technical Report,
CSIRO, 2012.
13. Nguyen AN, Lawley MJ, Hansen DP, et al. A simple pipeline application for identifying
and negating SNOMED clinical terminology in free text. Proceedings of the Health Infor-
matics Conference; August 2009, Canberra, Australia; 188–93, 2009.
14. Nguyen A, Lawley, M., Hansen, D., Bowman, R., Clarke, B., Duhig, E., Colquist, S. Sym-
bolic Rule-based Classification of Lung Cancer Stages from Free-Text Pathology Reports,
Journal of the American Medical Informatics Association(JAMIA), vol. 17, no. 4, pp. 440-
445, July/August 2010.
15. Thelen, M., Riloff, E. A bootstrapping method for learning semantic lexicons using extrac-
tion pattern contexts, Proceedings of the ACL-02 conference on Empirical methods in nat-
ural language processing, p.214-221, July 06, 2002

81
AIH 2012

82
AIH 2012

Using Prediction to Improve Elective Surgery Scheduling

Zahra Shahabi Kargar1, 2, Sankalp Khanna1, 2, Abdul Sattar1
1
Institute for Integrated and Intelligent Systems, Griffith University, Australia
{Zahra.Shahabikargar, A.Sattar@griffith.edu.au
2
The Australian e-Health Research Centre, RBWH, Herston, Australia
{Sankalp.Khanna}@csiro.au

Abstract. Stochastic activity durations, uncertainty in the arrival process of pa-
tients, and coordination of multiple activities are some key features of surgery
planning and scheduling. In this paper we provide an overview of challenges
around elective surgery scheduling and propose a predictive model for elective
surgery scheduling to be evaluated in a major tertiary hospital in Queensland.
The proposed model employs waiting lists, peri-operative information, work-
load predictions, and improved procedure time estimation models, to optimise
surgery scheduling. It is expected that the resulting improvement in scheduling
processes will lead to more efficient use of surgical suites, higher productivity,
and lower labour costs, and ultimately improve patient outcomes.

Keywords: Surgery scheduling, Predictive optimisation, Waiting list

1 Introduction

Ageing population and higher rates of chronic disease increase the demand
on health services. The Australian Institute of Health and Welfare reports a
3.6% per year increase in total elective surgery admissions over the past four
years [1]. These factors stress the need for efficiency and necessitate the
development of adequate planning and scheduling systems in hospitals.
Since operating rooms (ORs) are the hospital’s largest cost and revenue cen-
tre that has a major impact on the performance of the hospital, OR schedul-
ing has been studied by many researchers.
The surgery scheduling problem deals with the allocation of ORs under un-
certain demand in a complex and dynamic hospital environment to optimise
use of resources. Different techniques such as Mathematical programming[2-
4], simulation [5, 6], Meta-heuristics [5, 7] and Distributed Constraint Opti-
mization [8] have been proposed to address this problem. However most
current efforts to solve this problem either make simplifying assumptions
(e.g. considering only one department or type of surgery [4]), or employ
theoretic data [3, 5] which make them difficult to use in hospitals.

83
AIH 2012

In this paper, we propose a prediction based methodology for surgery
scheduling to address the above limitations. By using predicted workload
information and retrospective analysis of waiting lists and theatre utilization,
we predict a theatre template representing optimal case mix. The proposed
model also employs accurate estimation of procedure time and predicted
workload information to drive optimal elective surgery scheduling, and help
hospitals fulfil National Elective Surgery Targets (NEST) [1].

2 Elective Surgery Scheduling at the Evaluation Hospital

Long waiting lists for elective surgery in Australian hospitals during recent
years has driven a nationwide research agenda to improve the planning,
management and delivery of health care services. This work is to be evalu-
ated at a major tertiary hospital which has a total of 15 operating theatres
performing 124 elective operating sessions and 23 emergency sessions per
week. Currently allocation of available elective operating sessions at the
hospital have been broken down to different specialties and teams of sur-
geons based on a static case mix planning. This static allocation of available
sessions between emergency and elective patients and among different de-
partments results in underutilization or cancellation due to demand fluctua-
tions. Also, the allocation of patients to theatres is carried out without con-
sidering the uncertainty and possible changes that might happen. Procedure
times are estimated by using generic data or recommended by relevant sur-
geons not based on individual patient and surgery characteristics. Patients
are booked into schedules in a joint process between surgeons and the book-
ing department. Due to the dynamic environment and rapid changes, these
schedules need to be updated quickly. Usually department managers have
regular meetings to make any changes needed. Department managers try to
locally optimise their department goals, but since there is no global objective
usually these solutions are not the optimal global solutions.

3 An Optimal Surgery Scheduling Model

Although the surgery scheduling problem has been well addressed in litera-
ture, it still remains an open problem in Operations Research and Artificial
Intelligence. Despite the dynamic nature of the hospital environment, the
majority of previous studies ignore the underlying uncertainty. This leads to
simplistic models that are not applicable in real world situations.

84
AIH 2012

3.1 Current State of the Art
Cardoen et al. present a comprehensive literature review on operating room
scheduling including different features such as performance measures, pa-
tient classes, solution technique and uncertainty [9]. One of the major issues
associated with the development of accurate operating room schedules or
capacity planning strategies is the uncertainty inherent to surgical services.
Uncertainty and variability of frequency and distribution of patient arrivals,
patient conditions, and procedure durations, as well as ‘‘add-on’’ cases are
some instances of uncertainty in surgery scheduling [10]. Among them sto-
chastic arrival and procedure duration are two type of uncertainty studied by
many researchers. Procedure duration depends on several factors such as
experience of the surgeon, supporting staff, type of anaesthesia, and pre-
condition of the patient. Devi et al. estimate surgery times by using Adaptive
Nero Fuzzy Inference Systems, Artificial Neural Networks and Multiple Linear
Regression Analysis [2] but they just focus on one department and use a very
limited sample to build and validate their model. Lamiri et al. developed a
stochastic model for planning elective surgeries under uncertain demand for
emergency surgery [3]. Lamiri et al. also address the elective surgery plan-
ning under uncertainties related to surgery times and emergency surgery
demands by combining Monte Carlo simulation and a column generation
approach[5]. Although their method addresses uncertainties, it is based on
theoretic data and it has not been tested on real data. What is needed is a
whole of theatre approach to provide better prediction of surgery time, in-
corporation of predicted workload in planning the weekly surgery template,
and target guided optimization to ensure optimal allocation of resources.

3.2 Proposed Method
To improve the planning and optimization tasks underlying the process, we
propose a two stage methodology for elective surgery scheduling. As a first
stage, predicted workload information (drawn from Patient Admission Pre-
diction Tool [11] currently used at the evaluation hospital), current Waiting
List information and Historic utilization information is used to manage thea-
tre allocation and case mix distribution for each week (see Figure 1). This
allows the prediction based sharing of theatres between elective and emer-
gency surgery, and allocation of theatre time to surgery teams/departments
and results in a theatre schedule template that works better than a static
allocation model (as demonstrated by Khanna et al. [8]).

85
AIH 2012

Figure 1. Proposed Methodology for Improving Surgery Scheduling

In the second stage of the process, the allocation of patients to the weekly
theatre schedule is guided by an improved prediction algorithm to estimate
the surgery duration. The algorithm takes into account current patient, sur-
gery, and surgeon information and related historic peri-operative informa-
tion to forecast the planned procedure time. Incorporating NEST compliance
in the optimization function and improved resource estimation deliver fur-
ther improvements to the scheduling process and help deliver a more robust
and optimal schedule (Figure 1). We are currently working towards collecting
over 5 years of surgery scheduling, waiting list and peri-operative informa-
tion for the evaluation hospital from the corporate information systems. This
data will be used for modelling and independently validating the prediction
algorithms and building historic resource utilization knowledge banks to
guide other stage of the scheduling process.

4 Conclusion

The proposed model has the potential to improve elective surgery scheduling
by providing more accurate procedure time estimation and predicting arrival
demand of elective and emergency patients.

86
AIH 2012

References

1. Health, D.o., Expert Panel Review of Elective Surgery and Emergency
Access Targets Under the National Partnership Agreement on
Improving Public Hospital Services. 2011.
2. Devi, S.P., K.S. Rao, and S.S. Sangeetha, Prediction of surgery times
and scheduling of operation theaters in ophthalmology department. J
Med Syst, 2012. 36(2): p. 415-30.
3. Lamiri, M., Xiaolan Xie, and Shuguang Zhang, Column Generation
Approach to Operating Theater Planning with Elective and
Emergency Patients. IIE Transactions, 2008. 40(9): p. 838–852.
4. Pérez Gladish, B., et al., Management of surgical waiting lists
through a Possibilistic Linear Multiobjective Programming problem.
Applied Mathematics and Computation, 2005. 167(1): p. 477-495.
5. Lamiri, M., J. Dreo, and Xiaolan Xie. Operating Room Planning with
Random Surgery Times. in IEEE International Conference On
Automation Science and Engineering. 2007. Scottsdale, AZ, USA.
6. S.M. Ballard, M.E.K. The use of simulation to determine maximum
capacity in the surgical suite operating room. in Proceedings of the
2006 Winter Simulation Conference. 2006.
7. Fei, H., Nadine Meskens, and Chengbin Chu. An Operating Theatre
Planning and Scheduling Problem in the Case of a ‘Block Scheduling’
Strategy. in International Conference on Service Systems and Service
Management. 2006.
8. Khanna, S., Abdul Sattar, Justin Boyle, David Hansen, and Bela
Stantic. An Intelligent Approach to Surgery Scheduling. in
Proceedings of the 13th International Conference on Principles and
Practice of Multi-Agent Systems. 2012. Berlin.
9. Cardoen, B., Erik Demeulemeester, and Jeroen Beliën, Operating
Room Planning and Scheduling: A Literature Review. European
Journal of Operational Research, 2010. 201(3): p. 921–932.
10. May, J.H., William E. Spangler, David P. Strum, and Luis G. Vargas.,
The Surgical Scheduling Problem: Current Research and Future
Opportunities. Production and Operations Management, 2011. 20(3):
p. 392–405.
11. Boyle, J., M. Jessup, J. Crilly, D. Green, J. Lind, M. Wallis, P. Miller,
and G. Fitzgerald, Predicting Emergency Department Admissions.
Emergency Medicine Journal 2011. 29(5): p. 358–365.

87
AIH 2012

88
AIH 2012

If
you
fire
together,
you
wire
together
Prajni
Sadananda1,
Ramakoti
Sadananda2,
3
1

Department
of
Anatomy
and
Neuroscience,
University
of
Melbourne
Australian,
Australia
prajni.sadananda@unimelb.edu.au
2
Institute
for
Integrated
and
Intelligent
Systems,
Griffith
University,
Australia
3
NICTA,
Sydney
Australia
rsadananda@griffith.edu.au

The
intention
of
this
paper
is
to
stimulate
discussion
on
Hebb’s
Law
and
its
pedagogic
implications.

At
a
basic
cellular
level,
Hebb’s
Law
states
that
is
Cell
A
and
Cell
B
persis-‐
tently
fire,
the
connection
between
them
strengthens.
Figure
1
illustrates
the
interactions.
This
is
a
cellular
levels
process,
suggesting
that
brain
pro-‐
cesses
that
occur
repeartedly
tend
to
become
grafted
together
[1].

Fig
1.
Hebb’s
Law.
Repeated
stimulation
results
in
a
stronger
signal

89
AIH 2012

This
scientific
theory
explains
the
adaptation
of
neurons
during
the
learn-‐
ing
process.
Importantly,
this
type
of
plasticity
does
not
involve
increasing
the
number
of
cells,
but
rather
strengthening
the
existing
cells’
connectivity.
Understanding
such
biological
phenomena
opens
up
new
paradigms
and
laws
that
AI
can
utilise.
The
question
is
whether
Hebb’s
law
would
stand
at
a
higher
level
of
abstraction?
There
are
suggestive,
but
not
conclusive
indica-‐
tions.
For
example,
a
friendship
is
considered
stronger
with
time,
indicating
a
strengthening
of
wiring
between
the
friends.

Models
based
on
“firing
together
to
wire
together”
have
been
suggested
in
health
and
therapy
[2].
For
example,
if
a
patient
presents
with
a
mental
trauma
that
causes
extreme
anger,
the
therapist
introduces
a
counter
and
positive
stimulus
that
occurs
whenever
the
anger
occurs.
Both
(anger
and
the
positive
stimulus)
are
repeated
over
and
over
again,
thus
following
Hebb’s
Law
and
adding
strength
to
this
connection
between
the
two
stimuli,
resulting
in
relief
to
the
patient.

This
also
implies
causal
and
temporal
conjectures
based
on
causality.
The
causality
is
in
the
firing
sequence;
that
if
A
fires
first
and
then
B
fires,
A
is
the
cause.
If
B
fires
before
A,
a
reverse
interpretation
is
possible
that
may
de-‐
crease
the
strength
between
them.
There
seems
evidence
to
suggest
that
the
“firing”
and
“wiring”
may
be
a
sequential
process.

Causality
is
a
subject
of
intense
philosophical
interest
from
ancient
times.
Most
causal
models
are
rule-‐based
systems.
They
demand
descriptions
of
the
world
at
two
points
in
time
–
a
before
and
an
after.
Two
problems
arise
here:
the
practical
computational
compulsions
make
these
rules
crudely
simplistic.
In
addition,
it
is
challenging
to
incorporate
temporal
effects
within
the
framework
of
rule
based
systems.
Hebb’s
law,
while
suggesting
causali-‐
ty,
does
not
provide
any
quantification.
Thus,
it
is
unlikely
that
an
alternative
formulation
of
causation
would
emerge
from
Hebb’s
law
alone.
We
may
look
for
another,
additional
neural
network
perspective
of
causation
here.

Nevertheless,
causality
as
implied
with
Hebb’s
law
has
been
used
in
sci-‐
entific
research
and
therapeutics
to
a
large’
extent.
For
example,
oftentimes
doctors
complain
of
their
patients
being
unable
to
add
minor
and
incremen-‐
tal
changes
in
their
daily
routines
(such
as
exercise).
Understanding
Hebb’s
law
will
open
new
insights
into
why
this
might
be
so.
It
is
possible
that
the
patient
is
not
yet
“wired”
in
this
activity
and
requires
more
“firing”
before

90
AIH 2012

these
changes
can
be
established.
An
avenue
for
AI
research
is
to
aide
in
the
development
of
tools
to
help
such
people
to
“re-‐wire”.

Indeed,
such
tools
exist
to
some
extent
to
treat
spinal
cord
injured
pa-‐
tients
who
have
lost
motor
control
of
their
limbs.
In
a
non-‐injured
situation,
the
brain
delivers
pulses
to
the
lower
limbs
in
a
rhythmic/patterned
fashion
to
allow
walking
action.
Once
a
spinal
injury
occurs,
the
connectivity
from
the
brain
to
the
limbs
is
lost,
thereby
leaving
the
patient
immobile.
Stimula-‐
tors
are
often
placed
below
the
level
of
the
injury,
which
deliver
patterned
pulses
in
a
similar
manner
to
what
the
brain
was
previously
doing.
Over
a
period
of
time,
a
spinal
pattern
generator
emerges,
which
thus
allows
some
motion
of
the
lower
limbs
[3].
This
area
of
research
is
as
yet
in
its
infancy
and
calls
for
a
better,
more
intelligent
systems
to
aide
these
patients.

Conclusions:
Artificial
intelligence
in
health
opens
up
chapters
of
great
opportunities
and
exciting
challenges.
The
logical
calculus
articulated
by
McCulloch
and
Pitts
[4]
forms
the
initial
basis
for
both
Symbolic
and
Connectionist
AI.
Since
then
a
number
of
paradigms
have
emerged
on
all
aspects
of
AI
and
relating
to
health
and
health
care.
The
emergence
of
the
convergence
of
computing
and
communication
provides
us
boundless
opportunities
to
exploit
these
paradigms
and
discover
the
new
ones.

ReferenceƐ:

1. Hebb,
D.
O.:
Organization
of
Behavior:
a
Neuropsychological
Theory.
John
Wiley,
New
York
(1949).
2. Atkinson,
B.,
Atkinson,
L.,
Kutz,
P.,
Lata,
L.,
Lata,
K.W.,
Szekely,
J.,
Weiss,
P.:
Rewiring
Neural
States
in
Couples
Therapy:
Advances
from
Affective
Neuroscience.
In:
Journal
of
Systemic
Therapies.
24,
3-‐13
(2005)
3. Edgerton,
V.R.,
Roy,
R.R.:
A
new
age
for
rehabilitation.
Eur
J
Phys
Re-‐
habil
Med.
48,
99-‐109
(2012)
4. McCulloch,
W.S.,
Pitts,
W.:
A
logical
Calculus
of
the
ideas
immanent
in
nervous
activity,
Bulletin
of
mathematical
Biophysics.
5,
115-‐137
(1943).

91
AIH 2012

92
AIH 2012

NOTES

93
AIH 2012

NOTES