<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Journal of Advanced ceedings</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.14569/IJACSA.2022.0130516</article-id>
      <title-group>
        <article-title>Leveraging Bio-Inspired Optimization Algorithms for Advanced Feature Selection in Chronic Disease Datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Abeer Dyoub</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Letteri</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, University of Bari</institution>
          ,
          <addr-line>Bari -</addr-line>
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Life, Health and Environmental Sciences,University of L'Aquila</institution>
          ,
          <addr-line>L'Aquila -</addr-line>
          <country country="IT">ITALY</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>2597</volume>
      <fpage>29</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>In this study, we investigated the application of bio-inspired optimization algorithms for feature selection in chronic disease prediction. The primary goal was to enhance models' predictive accuracy, streamline data dimensionality, and make predictions more interpretable and actionable. The research encompassed a comparative analysis of the three bio-inspired categories: evolutionary-based, swarm-intelligence, and ecology-based. For the feature selection method, we selected one algorithm for each category: Genetic Algorithms, Flower Pollination Optimization, and Particle Swarm Optimization, applying them across diverse chronic diseases including cancer, kidney, and cardiovascular diseases. The results demonstrate in some cases, that the bio-inspired optimization algorithms efectively reduce the number of features required for accurate classification and consequently the convergence time. The findings underscore this work's potential impact on early intervention, precision medicine, and improved patient outcomes, providing new avenues for delivering healthcare services tailored to individual needs.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Chronic Diseases Prediction</kwd>
        <kwd>Bio-Inspired Feature Selection</kwd>
        <kwd>Genetic Algorithms</kwd>
        <kwd>Flower Pollination Optimization</kwd>
        <kwd>Particle Swarm Optimization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>we selected Genetic Algorithms (GA), Particle Swarm
Optimisation (PSO), and Flower Pollination Optimisation
Chronic diseases pose a significant global health chal- (FPO), one from each category.
lenge, impacting morbidity and mortality rates. Early de- We refine feature subsets from medical datasets
encomtection is crucial for prevention and personalised health- passing cancer, kidney, and cardiovascular diseases to
care. Advanced analytics and AI ofer the potential for enhance model accuracy and simplify data
dimensionalrevolutionising prediction in many field like finance [ 1] ity. The aim is to improve interpretability and practicality
[2], cybersecurity [3] and in particular disease. in chronic disease prediction.</p>
      <p>Supervised learning in various fields relies heavily Investigating chronic diseases presents significant
chalon feature selection (FS) to reduce input dimensional- lenges in the healthcare domain. This study aims to
imity. Maintaining target class integrity amidst irrelevant prove the predictive accuracy of chronic diseases by
emcharacteristics is essential for accurate classification in ploying machine learning (ML) and feature selection (FS)
the medical domain. techniques, which involve data collection, preprocessing,</p>
      <p>Bio-inspired optimisation emulates behaviours found and performance assessment.
in various natural creatures such as fish, insects, bird The paper proceeds with an outline of the methodology
swarms, terrestrial animals, reptiles, humans, and other in Section 2. Section 3 presents experimental findings,
phenomena. These methods have been used for super- followed by a discussion in Section 4. Finally, Section 5
vised feature selection (see [4]). The same source cate- summarises key findings, limitations, and future
direcgorises bio-inspired optimisation algorithms into three tions.
groups based on their source of inspiration: swarm
intelligence algorithms, evolutionary-based algorithms, and
ecology-based algorithms. For robustness and diversity, 2. Methodology</p>
      <sec id="sec-1-1">
        <title>2.1. The Datasets</title>
      </sec>
      <sec id="sec-1-2">
        <title>2.4. Performance Evaluation Method</title>
      </sec>
      <sec id="sec-1-3">
        <title>2.2. Datsets Pre-processing</title>
        <p>Breast Cancer dataset: From the University of Wis- For each dataset detailed in section 2.1, every machine
consin, this dataset involves cytological examinations to learning model is trained using 70% of the data and tested
distinguish between benign and malignant tumours. It using the remaining 30%, employing all features, and
contains 569 samples and 31 features. ifltered features by PSO, FPO, and GA algorithms. This</p>
        <p>Kidney Disease: Medical information on chronic kid- process is iterated 100 times with each iteration involving
ney disease, collected over two months in India, is in- shufling the dataset. Moreover, for each iteration, the
cluded in this dataset, available on Kaggle or UCI. It con- dataset is split into training and testing sets to evaluate
sists of 400 samples and 25 features. measures such as Accuracy, Recall, Precision, and F1-score.</p>
        <p>Heart failure dataset: Comprising medical records
of heart failure patients during follow-up, this dataset
contains 299 samples and 13 features. 3. Experiments and Results</p>
        <p>Each dataset has the “diagnosis” column with binary
values used as targets for supervised learning of
classiifers, where 0 denotes a negative and 1 indicates a
positive outcome, respectively.</p>
        <p>Missing Values Imputation. Addressing missing
data poses risks of performance degradation and biased
results. We used the K-Nearest Neighbors (KNN)
algorithm, known for its adaptability to diverse data types,
to fill the lack in the datasets.</p>
        <p>Data Balancing. To balance the datasets is a critical
concern due to the struggle of the classifiers when faced
with disparate class distributions, leading to biased
models. To mitigate this issue, we used the Synthetic Minority
Over-sampling Technique for Nominal and Continuous
features (SMOTEEN) [6] which addresses imbalanced
datasets by oversampling the minority class and cleaning
the majority class by combining the SMOTE and Edited
Nearest Neighbors (ENN) methods.</p>
        <p>Min-max Normalization. We applied this scaling
method to normalize the datasets to a predefined range,
as follows:  = −− , where 
represents the normalized value of the feature,  is the
original value of the feature.  and  denote
the minimum and maximum values respectively.</p>
      </sec>
      <sec id="sec-1-4">
        <title>2.3. Bio-inspired Feature Selection</title>
        <sec id="sec-1-4-1">
          <title>Following the data preparation stage, we applied the</title>
          <p>three aforementioned bio-inspired feature selection
algorithms to each of the three datasets (see section 2.1). All
algorithms employ the same fitness function, with the 
value set to 0.99 to prioritize classification accuracy.</p>
          <p>For assessing fitness, we utilized the K-Nearest
Neighbors (KNN) classifier, known for its eficiency and
adopted by [7], as it does not necessitate a lengthy
training phase. A neighbour count of  = 10 was used.
The feature selection algorithms were configured with
20 agents (individuals) and 100 generations.</p>
        </sec>
      </sec>
      <sec id="sec-1-5">
        <title>3.1. Breast Cancer Dataset</title>
        <sec id="sec-1-5-1">
          <title>The final features selected by the various algorithms are: metrics can be seen in figures 4, 5, 6.</title>
          <p>we note that GA has acheived a better fitness with
respect to FPO, even both have achieved the same
reduction percentage in dimentionality with breast cancer
dataset. The two algorithms have selected diferent sets
of features. Regarding the training time, we can observe
that the training times have globally decreased up to a
maximum of 54.5% for GA with LR on this dataset. In
KNN, the time has increased in all the cases. further</p>
        </sec>
      </sec>
      <sec id="sec-1-6">
        <title>3.2. Heart Failure Dataset</title>
        <p>The final features selected by the three algorithms are:
• FPO: [’anaemia’, ’diabetes’, ’smoking’]
• GA: [’platelets’, ’serum sodium’, ’time’]
• PSO: [’platelets’, ’serum creatinine’]
we note that GA has acheived a better fitness with respect
to FPO, even both have achieved the same reduction
percentage in dimentionality with breast cancer dataset. The
two algorithms have selected diferent sets of features.
The genetic algorithm significantly higher fitness with
respect to FPO and PSO even though the the
dimentionality reduction is almost the same. RF, DT, SVM and KNN
have achieved a better performance on this dataset when
combined with GA algorithm. In general, however, the
training times have all decreased, with the maximum
decrease 57% by LR model with both GA and PSO. See
further metrics in figures 7, 8 , 9.</p>
        <p>In this dataset, high performance was achieved with most
models combined with PSO and GA, while with FPO
there was a significant decrease in performance. There
has been a 55% decrease in processing time without loss
in the performance with LR model combined with PSO
and a reduction in processing time up to 57% with LR
combined with GA with a very slight reduction in the
performance. The highest fitness was achieved by GA
with 7 features (70% reduction in dimentionality), while
the lowest fitness was achieved by FPO with the highest
features reduction. See further metrics in figures 10, 11,
12.</p>
      </sec>
      <sec id="sec-1-7">
        <title>3.3. Kidney Disease Dataset</title>
        <sec id="sec-1-7-1">
          <title>The final features selected by the various algorithms are:</title>
          <p>• FPO: [’su’, ’rbc’, ’pcc’, ’pe’, ’ane’]
• GA: [’rbc’, ’bgr’, ’sod’, ’hemo’, ’pcv’, ’dm’, ’cad’]
• PSO: [’age’, ’su’, ’rbc’, ’pc’, ’bgr’, ’sod’, ’pot’,
’hemo’, ’pcv’, ’rc’, ’cad’]</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Discussion</title>
      <p>From Table 2, it is evident that GA emerged as the most
efective FS technique in terms of performance. It
consistently improved accuracy across various ML models
and datasets, or maintained accuracy levels compared
to pre-FS values with other techniques. The accuracy
enhancement with GA ranged from 0.1% to 7%.
Following GA, PSO ranked second in terms of accuracy
performance among the three bio-inspired algorithms. While
PSO did not notably enhance accuracy, it also did not lead
to significant decreases. FPO exhibited diverse outcomes
across diferent ML models and datasets. While accuracy
decreases were marginal (less than 2.5%) for most ML
models on the breast cancer dataset, there were more
pronounced decreases on the heart failure and kidney
disease datasets.</p>
      <p>In terms of training times, the impact was particularly
notable for DT and LR, as evidenced in Table 3.
Generally, training times decreased across all models when
employing feature selection (FS), except for K-Nearest
Neighbours (KNN) with breast cancer and kidney
disease datasets, where a significant increase of up to 21%
was observed. Minor fluctuations within ±2% in training
times were considered insignificant, likely due to
variable hardware conditions and software factors. Overall,
machine learning (ML) models exhibited reduced
training times with FS, especially DT and LR models with
GA and PSO. The most substantial reduction in
training time, up to 67%, was achieved by the LR model with
FPO on the Kidney disease dataset. Although FS did not
significantly improve ML model performance in most
cases, and even led to a decrease in performance in some
instances, the noteworthy decrease in processing times
without significant loss in accuracy represents a
significant achievement.</p>
      <p>The experimental findings indicate that the GA
outperformed other FS algorithms in terms of precision, recall,
and F1-measure. GA demonstrated superior performance
when paired with nearly all ML models compared to FPO
and PSO across all datasets. However, the PSO algorithm,
when combined with the LR model, exhibited slightly
higher recall and F1 scores for breast cancer and kidney
disease datasets, as well as marginally improved recall for
heart failure dataset. Conversely, FPO generally
exhibited the poorest performance when paired with various
ML models. Although FPO achieved the highest recall
when combined with the LR model on the heart failure
dataset, its overall performance was inferior. In terms of
iftness trends, GA displayed the most favourable results,
with PSO closely trailing behind, while FPO yielded
significantly lower fitness levels compared to GA and PSO.
Further experiments are planned to investigate the
behaviour of these FS algorithms with varying parameters,
datasets, and ML models.</p>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusion</title>
      <p>Our experiments have highlighted the importance of
feature selection (FS) in improving the performance of
machine learning (ML) models. The impact of FS varies
depending on factors such as the chosen FS algorithm and
dataset characteristics [8]. FS holds the potential to
significantly enhance ML outcomes, especially for datasets
with a large number of features. For example, in the
breast cancer dataset, reducing features from 30 to 12
or 8 resulted in up to a 50% reduction in training time,
while maintaining the same performance across various
ML models. However, the efect of FS on training time
may vary. While FS could improve training eficiency for
some datasets, it may require more training cycles for
others. Additionally, we have highlighted the limitations of
the Flower Pollination Optimization (FPO) algorithm and
emphasised the importance of considering multiple
evaluation metrics beyond accuracy alone. Finally, we note
that this work forms part of our broader research project
on healthcare assistant agents, encompassing various
aspects, including ethical considerations [9], [10, 11].</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>