=Paper=
{{Paper
|id=Vol-3762/515
|storemode=property
|title=Leveraging Bio-Inspired Optimization Algorithms for Advanced Feature Selection in Chronic Disease Datasets
|pdfUrl=https://ceur-ws.org/Vol-3762/515.pdf
|volume=Vol-3762
|authors=Ivan Letteri,Abeer Dyoub
|dblpUrl=https://dblp.org/rec/conf/ital-ia/LetteriD24
}}
==Leveraging Bio-Inspired Optimization Algorithms for Advanced Feature Selection in Chronic Disease Datasets==
<pdf width="1500px">https://ceur-ws.org/Vol-3762/515.pdf</pdf>
<pre>
                                Leveraging Bio-Inspired Optimization Algorithms for
                                Advanced Feature Selection in Chronic Disease Datasets
                                Abeer Dyoub1,*,† , Ivan Letteri2,†
                                1
                                    Computer Science Department, University of Bari, Bari - Italy
                                2
                                    Department of Life, Health and Environmental Sciences,University of L’Aquila, L’Aquila - ITALY


                                                                           Abstract
                                                                           In this study, we investigated the application of bio-inspired optimization algorithms for feature selection in chronic disease
                                                                           prediction. The primary goal was to enhance models’ predictive accuracy, streamline data dimensionality, and make predictions
                                                                           more interpretable and actionable. The research encompassed a comparative analysis of the three bio-inspired categories:
                                                                           evolutionary-based, swarm-intelligence, and ecology-based. For the feature selection method, we selected one algorithm for
                                                                           each category: Genetic Algorithms, Flower Pollination Optimization, and Particle Swarm Optimization, applying them across
                                                                           diverse chronic diseases including cancer, kidney, and cardiovascular diseases. The results demonstrate in some cases, that
                                                                           the bio-inspired optimization algorithms effectively reduce the number of features required for accurate classification and
                                                                           consequently the convergence time. The findings underscore this work’s potential impact on early intervention, precision
                                                                           medicine, and improved patient outcomes, providing new avenues for delivering healthcare services tailored to individual
                                                                           needs.

                                                                           Keywords
                                                                           Chronic Diseases Prediction, Bio-Inspired Feature Selection, Genetic Algorithms, Flower Pollination Optimization, Particle
                                                                           Swarm Optimization


                                1. Introduction                                                                                              we selected Genetic Algorithms (GA), Particle Swarm
                                                                                                                                             Optimisation (PSO), and Flower Pollination Optimisation
                                Chronic diseases pose a significant global health chal-                                                      (FPO), one from each category.
                                lenge, impacting morbidity and mortality rates. Early de-                                                       We refine feature subsets from medical datasets encom-
                                tection is crucial for prevention and personalised health-                                                   passing cancer, kidney, and cardiovascular diseases to
                                care. Advanced analytics and AI offer the potential for                                                      enhance model accuracy and simplify data dimensional-
                                revolutionising prediction in many field like finance [1]                                                    ity. The aim is to improve interpretability and practicality
                                [2], cybersecurity [3] and in particular disease.                                                            in chronic disease prediction.
                                   Supervised learning in various fields relies heavily                                                         Investigating chronic diseases presents significant chal-
                                on feature selection (FS) to reduce input dimensional-                                                       lenges in the healthcare domain. This study aims to im-
                                ity. Maintaining target class integrity amidst irrelevant                                                    prove the predictive accuracy of chronic diseases by em-
                                characteristics is essential for accurate classification in                                                  ploying machine learning (ML) and feature selection (FS)
                                the medical domain.                                                                                          techniques, which involve data collection, preprocessing,
                                   Bio-inspired optimisation emulates behaviours found                                                       and performance assessment.
                                in various natural creatures such as fish, insects, bird                                                        The paper proceeds with an outline of the methodology
                                swarms, terrestrial animals, reptiles, humans, and other                                                     in Section 2. Section 3 presents experimental findings,
                                phenomena. These methods have been used for super-                                                           followed by a discussion in Section 4. Finally, Section 5
                                vised feature selection (see [4]). The same source cate-                                                     summarises key findings, limitations, and future direc-
                                gorises bio-inspired optimisation algorithms into three                                                      tions.
                                groups based on their source of inspiration: swarm intel-
                                ligence algorithms, evolutionary-based algorithms, and
                                ecology-based algorithms. For robustness and diversity,                                                      2. Methodology
                                Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga-
                                                                                                                                                                    Preprocessing techniques, including transformation,
                                nized by CINI, May 29-30, 2024, Naples, Italy                                                                                       cleaning, imputation, balancing, and normalization, were
                                *
                                  Abeer Dyoub.                                                                                                                      applied to ensure data quality [5]. Subsequently, feature
                                †
                                  These authors contributed equally.                                                                                                selection was performed by GA, PSO, and FPO algorithms.
                                $ abeer.dyoub@uniba.it (A. Dyoub); ivan.letteri@univaq.it                                                                           The selected features were then used for classification
                                (I. Letteri)                                                                                                                        using Decision Trees (DT), Random Forest (RF), Logistic
                                 0000-0003-0329-2419 (A. Dyoub); 0000-0002-3843-386X
                                                                                                                                                                    Regression (LR), Support Vector Machines (SVM), and
                                (I. Letteri)
                                                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License K-Nearest Neighbour (KNN). Finally, we evaluated the
                                                                       Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                      performance of these models using various metrics.


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2.1. The Datasets                                             2.4. Performance Evaluation Method
   Breast Cancer dataset: From the University of Wis-         For each dataset detailed in section 2.1, every machine
consin, this dataset involves cytological examinations to     learning model is trained using 70% of the data and tested
distinguish between benign and malignant tumours. It          using the remaining 30%, employing all features, and
contains 569 samples and 31 features.                         filtered features by PSO, FPO, and GA algorithms. This
   Kidney Disease: Medical information on chronic kid-        process is iterated 100 times with each iteration involving
ney disease, collected over two months in India, is in-       shuffling the dataset. Moreover, for each iteration, the
cluded in this dataset, available on Kaggle or UCI. It con-   dataset is split into training and testing sets to evaluate
sists of 400 samples and 25 features.                         measures such as Accuracy, Recall, Precision, and F1-score.
   Heart failure dataset: Comprising medical records
of heart failure patients during follow-up, this dataset
contains 299 samples and 13 features.                         3. Experiments and Results
   Each dataset has the “diagnosis” column with binary
                                                              Figures 1, 2, 3 show the fitness trends of the FS algo-
values used as targets for supervised learning of classi-
                                                              rithms, and table 1 summarizes the performance of these
fiers, where 0 denotes a negative and 1 indicates a posi-
                                                              FS algorithms in terms of feature reduction.
tive outcome, respectively.
                                                                 In table 2, we report the accuracies of the classifiers
                                                              with the features. Whereas, in table 3, we report the
2.2. Datsets Pre-processing                                   percentage variations of the training time before and
   Missing Values Imputation. Addressing missing              after the FS.
data poses risks of performance degradation and biased
results. We used the K-Nearest Neighbors (KNN) algo-
rithm, known for its adaptability to diverse data types,
to fill the lack in the datasets.
   Data Balancing. To balance the datasets is a critical
concern due to the struggle of the classifiers when faced
with disparate class distributions, leading to biased mod-
els. To mitigate this issue, we used the Synthetic Minority
Over-sampling Technique for Nominal and Continuous
features (SMOTEEN) [6] which addresses imbalanced
datasets by oversampling the minority class and cleaning
the majority class by combining the SMOTE and Edited
Nearest Neighbors (ENN) methods.
   Min-max Normalization. We applied this scaling
method to normalize the datasets to a predefined range,
as follows: 𝑋𝑛𝑜𝑟𝑚 = 𝑋𝑚𝑎𝑥    𝑋−𝑋𝑚𝑖𝑛
                                −𝑋𝑚𝑖𝑛
                                       , where 𝑋𝑛𝑜𝑟𝑚 rep- Figure 1: Fitness trends on breast cancer dataset
resents the normalized value of the feature, 𝑋 is the
original value of the feature. 𝑋𝑚𝑖𝑛 and 𝑋𝑚𝑎𝑥 denote
the minimum and maximum values respectively.
                                                              Table 1
                                                              Performance of Dimensional Reduction in the different Feature Selection Algorithms.
2.3. Bio-inspired Feature Selection
                                                                    Dataset         Algorithm         Fitness      #Features       Reduction
Following the data preparation stage, we applied the          Breast Cancer
                                                                               GA
                                                                               PSO
                                                                                        ≈ 0.992
                                                                                        ≈ 0.985
                                                                                                      8
                                                                                                     12
                                                                                                               73.3%
                                                                                                                60%
three aforementioned bio-inspired feature selection algo-                      FPO      ≈ 0.9092      8        73.3%
rithms to each of the three datasets (see section 2.1). All   Heart Failure
                                                                               GA
                                                                               PSO
                                                                                         ≈ 0.91
                                                                                         ≈ 0.79
                                                                                                      3
                                                                                                      2
                                                                                                                75%
                                                                                                               83.3%
algorithms employ the same fitness function, with the 𝛼                        FPO      ≈ 0.581       3         75%
value set to 0.99 to prioritize classification accuracy.                       GA       ≈ 0.998       7         70%
                                                             Kidney Disease    PSO      ≈ 0.995      11         54%
   For assessing fitness, we utilized the K-Nearest Neigh-                     FPO      ≈ 0.8454      5         80%
bors (KNN) classifier, known for its efficiency and
adopted by [7], as it does not necessitate a lengthy train-
ing phase. A neighbour count of 𝐾 = 10 was used.
The feature selection algorithms were configured with 3.1. Breast Cancer Dataset
20 agents (individuals) and 100 generations.                The final features selected by the various algorithms are:
                                                                  metrics can be seen in figures 4, 5, 6.


Figure 2: Fitness trends on heart failure dataset                 Figure 4: Precision on dataset Breast Cancer.


                                                                  Figure 5: Recall on dataset Breast Cancer.
Figure 3: Fitness trends on kidney disease dataset


     • FPO: [’sy11etry 1ean’, ’fractal di1ension 1ean’,
       ’radius se’, ’area se’, ’co1pactness se’, ’sy11etry se’,
       ’fractal di1ension se’, ’concavity worst’]
     • GA: [’texture mean’, ’concavity mean’, ’area se’,
       ’compactness se’, ’concave points se’, ’fractal di-
       mension se’, ’radius worst’, ’compactness worst’]
     • PSO: [’radius mean’, ’area mean’, ’smoothness
       mean’, ’compactness mean’, ’fractal dimension
       mean’, ’radius se’, ’texture se’, ’area se’, ’smooth-
       ness se’, ’compactness se’, ’texture worst’, ’con-
       cavity worst’]
   we note that GA has acheived a better fitness with Figure 6: F1-score on dataset Breast Cancer.
respect to FPO, even both have achieved the same re-
duction percentage in dimentionality with breast cancer
dataset. The two algorithms have selected different sets
of features. Regarding the training time, we can observe 3.2. Heart Failure Dataset
that the training times have globally decreased up to a
                                                         The final features selected by the three algorithms are:
maximum of 54.5% for GA with LR on this dataset. In
KNN, the time has increased in all the cases. further         • FPO: [’anaemia’, ’diabetes’, ’smoking’]
     • GA: [’platelets’, ’serum sodium’, ’time’]
     • PSO: [’platelets’, ’serum creatinine’]
we note that GA has acheived a better fitness with respect
to FPO, even both have achieved the same reduction per-
centage in dimentionality with breast cancer dataset. The
two algorithms have selected different sets of features.
The genetic algorithm significantly higher fitness with
respect to FPO and PSO even though the the dimentional-
ity reduction is almost the same. RF, DT, SVM and KNN
have achieved a better performance on this dataset when
combined with GA algorithm. In general, however, the
training times have all decreased, with the maximum
decrease 57% by LR model with both GA and PSO. See Figure 9: F1-score on dataset Hearth Failure.
further metrics in figures 7, 8 , 9.

                                                           In this dataset, high performance was achieved with most
                                                           models combined with PSO and GA, while with FPO
                                                           there was a significant decrease in performance. There
                                                           has been a 55% decrease in processing time without loss
                                                           in the performance with LR model combined with PSO
                                                           and a reduction in processing time up to 57% with LR
                                                           combined with GA with a very slight reduction in the
                                                           performance. The highest fitness was achieved by GA
                                                           with 7 features (70% reduction in dimentionality), while
                                                           the lowest fitness was achieved by FPO with the highest
                                                           features reduction. See further metrics in figures 10, 11,
                                                           12.
Figure 7: Precision on dataset Hearth Failure.


                                                           Figure 10: Precision on dataset Kidney Disease.


Figure 8: Recall on dataset Hearth Failure.

                                                           4. Discussion
3.3. Kidney Disease Dataset                                  From Table 2, it is evident that GA emerged as the most
                                                             effective FS technique in terms of performance. It con-
The final features selected by the various algorithms are: sistently improved accuracy across various ML models
     • FPO: [’su’, ’rbc’, ’pcc’, ’pe’, ’ane’]                and datasets, or maintained accuracy levels compared
     • GA: [’rbc’, ’bgr’, ’sod’, ’hemo’, ’pcv’, ’dm’, ’cad’] to pre-FS values with other techniques. The accuracy
     • PSO: [’age’, ’su’, ’rbc’, ’pc’, ’bgr’, ’sod’, ’pot’, enhancement with GA ranged from 0.1% to 7%. Follow-
       ’hemo’, ’pcv’, ’rc’, ’cad’]                           ing GA, PSO ranked second in terms of accuracy perfor-
                                                             instances, the noteworthy decrease in processing times
                                                             without significant loss in accuracy represents a signifi-
                                                             cant achievement.
                                                                The experimental findings indicate that the GA outper-
                                                             formed other FS algorithms in terms of precision, recall,
                                                             and F1-measure. GA demonstrated superior performance
                                                             when paired with nearly all ML models compared to FPO
                                                             and PSO across all datasets. However, the PSO algorithm,
                                                             when combined with the LR model, exhibited slightly
                                                             higher recall and F1 scores for breast cancer and kidney
                                                             disease datasets, as well as marginally improved recall for
                                                             heart failure dataset. Conversely, FPO generally exhib-
Figure 11: Recall on dataset Kidney Disease.
                                                             ited the poorest performance when paired with various
                                                             ML models. Although FPO achieved the highest recall
                                                             when combined with the LR model on the heart failure
                                                             dataset, its overall performance was inferior. In terms of
                                                             fitness trends, GA displayed the most favourable results,
                                                             with PSO closely trailing behind, while FPO yielded sig-
                                                             nificantly lower fitness levels compared to GA and PSO.
                                                             Further experiments are planned to investigate the be-
                                                             haviour of these FS algorithms with varying parameters,
                                                             datasets, and ML models.

                                                             Table 2
                                                             Models Accuracy.
                                                                                Breast Cancer   Heart Failure   Kidney Disease
                                                                        No FS       98.4%          85.5%            98.7%
                                                                        FPO         96.2%          61.2%             57%
                                                                RF
                                                                        PSO          98%           83.6%            97.5%
Figure 12: F1-score on dataset Kidney Disease.                           GA         98.5%          89.2%            96.8%
                                                                        No FS       97.3%          79.2%            96.9%
                                                                        FPO         94.9%          61.1%             57%
                                                                DT
                                                                        PSO         96.7%           83%             96.8%
                                                                         GA         97.3%          83.6%            96.8%
mance among the three bio-inspired algorithms. While                    No FS       99.5%          77.7%            99.4%
                                                                        FPO         96.9%          56.8%             57%
PSO did not notably enhance accuracy, it also did not lead     SVM
                                                                        PSO         99.5%          68.2%            98.7%
to significant decreases. FPO exhibited diverse outcomes                 GA         99.6%          78.3%            97.5%
across different ML models and datasets. While accuracy                 No FS
                                                                        FPO
                                                                                    99.5%
                                                                                    96.1%
                                                                                                   79.4%
                                                                                                   59.1%
                                                                                                                    97.7%
                                                                                                                     57%
decreases were marginal (less than 2.5%) for most ML            LR
                                                                        PSO         98.3%          59.9%            97.6%
models on the breast cancer dataset, there were more                     GA
                                                                        No FS
                                                                                    98.3%
                                                                                    98.7%
                                                                                                   77.8%
                                                                                                   69.6%
                                                                                                                    94.7%
                                                                                                                    96.1%
pronounced decreases on the heart failure and kidney           KNN
                                                                        FPO         96.1%          52.9%            52.5%
disease datasets.                                                       PSO
                                                                         GA
                                                                                    98.5%
                                                                                    98.4%
                                                                                                   64.6%
                                                                                                   77.3%
                                                                                                                    98.4%
                                                                                                                    97.1%
   In terms of training times, the impact was particularly
notable for DT and LR, as evidenced in Table 3. Gen-
erally, training times decreased across all models when
employing feature selection (FS), except for K-Nearest       5. Conclusion
Neighbours (KNN) with breast cancer and kidney dis-
ease datasets, where a significant increase of up to 21%     Our experiments have highlighted the importance of
was observed. Minor fluctuations within ±2% in training      feature selection (FS) in improving the performance of
times were considered insignificant, likely due to vari-     machine learning (ML) models. The impact of FS varies
able hardware conditions and software factors. Overall,      depending on factors such as the chosen FS algorithm and
machine learning (ML) models exhibited reduced train-        dataset characteristics [8]. FS holds the potential to sig-
ing times with FS, especially DT and LR models with          nificantly enhance ML outcomes, especially for datasets
GA and PSO. The most substantial reduction in train-         with a large number of features. For example, in the
ing time, up to 67%, was achieved by the LR model with       breast cancer dataset, reducing features from 30 to 12
FPO on the Kidney disease dataset. Although FS did not       or 8 resulted in up to a 50% reduction in training time,
significantly improve ML model performance in most           while maintaining the same performance across various
cases, and even led to a decrease in performance in some     ML models. However, the effect of FS on training time
Table 3                                                                http://dx.doi.org/10.14569/IJACSA.2022.0130516.
Models Processing Time                                                 doi:10.14569/IJACSA.2022.0130516.
                 Breast Cancer   Heart Failure   Kidney Disease    [5] I. Letteri, G. D. Penna, L. D. Vita, M. T. Grifa, Mta-
          FPO    -7%             +2%             -2%                   kdd’19: A dataset for malware traffic detection, in:
   RF     PSO    -3.5%           +1%             -2%
          GA     -5%             +1%             -4%                   Proceedings of the Fourth Italian Conference on
          FPO    -40%            -8%             -25%                  Cyber Security, Ancona, Italy, February 4th to 7th,
   DT     PSO    -40%            -6%             -11%
          GA     -50%            -8%             -18%                  2020, volume 2597 of CEUR Workshop Proceedings,
          FPO    -4%             -6%             -10%                  CEUR-WS.org, 2020, pp. 153–165. URL: https://ceur
   SVM    PSO    -10%            +1%             -0.6%
          GA     -16%            -8.5%           -4%                   -ws.org/Vol-2597/paper-14.pdf.
          FPO    -20.5%          -17%            -67%              [6] M. Khushi, K. Shaukat, T. M. Alam, I. A. Hameed,
   LR     PSO
          GA
                 -54%
                 -54.5%
                                 -57%
                                 -57%
                                                 -55%
                                                 -57%                  S. Uddin, S. Luo, X. Yang, M. C. Reyes, A com-
          FPO    +11%            -2%             +3.6%                 parative performance analysis of data resampling
   KNN    PSO
          GA
                 +21%
                 +10%
                                 -4%
                                 -7%
                                                 +4%
                                                 -0.009%
                                                                       methods on imbalance medical data, IEEE Access 9
                                                                       (2021) 109960–109975. doi:10.1109/ACCESS.202
                                                                       1.3102399.
may vary. While FS could improve training efficiency for           [7] M. Sharawi, H. M. Zawbaa, E. Emary, Feature se-
some datasets, it may require more training cycles for oth-            lection approach based on whale optimization algo-
ers. Additionally, we have highlighted the limitations of              rithm, in: 2017 Ninth international conference on
the Flower Pollination Optimization (FPO) algorithm and                advanced computational intelligence (ICACI), IEEE,
emphasised the importance of considering multiple eval-                2017, pp. 163–168.
uation metrics beyond accuracy alone. Finally, we note             [8] I. Letteri, G. D. Penna, P. Caianiello, Feature selec-
that this work forms part of our broader research project              tion strategies for HTTP botnet traffic detection,
on healthcare assistant agents, encompassing various                   in: 2019 IEEE European Symposium on Security
aspects, including ethical considerations [9], [10, 11].               and Privacy Workshops, EuroS&P Workshops 2019,
                                                                       Stockholm, Sweden, June 17-19, 2019, IEEE, 2019,
                                                                       pp. 202–210. doi:10.1109/EUROSPW.2019.000
References                                                             29.
                                                                   [9] A. Dyoub, S. Costantini, F. A. Lisi, Learning do-
 [1] I. Letteri, AITA: A new framework for trading for-                main ethical principles from interactions with users,
     ward testing with an artificial intelligence engine,              Digit. Soc. 1 (2022). doi:10.1007/S44206-022-0
     in: F. Falchi, F. Giannotti, A. Monreale, C. Boldrini,            0026-Y.
     S. Rinzivillo, S. Colantonio (Eds.), Proceedings of          [10] A. Dyoub, S. Costantini, I. Letteri, Care robots learn-
     the Italia Intelligenza Artificiale - Thematic Work-              ing rules of ethical behavior under the supervision
     shops co-located with the 3rd CINI National Lab                   of an ethical teacher (short paper), in: P. Bruno,
     AIIS Conference on Artificial Intelligence (Ital IA               F. Calimeri, F. Cauteruccio, M. Maratea, G. Ter-
     2023), Pisa, Italy, May 29-30, 2023, volume 3486 of               racina, M. Vallati (Eds.), Joint Proceedings of the
     CEUR Workshop Proceedings, 2023, pp. 506–511.                     1st International Workshop on HYbrid Models for
 [2] I. Letteri, G. D. Penna, G. D. Gasperis, A. Dy-                   Coupling Deductive and Inductive ReAsoning (HY-
     oub, Trading strategy validation using forwardtest-               DRA 2022) and the 29th RCRA Workshop on Exper-
     ing with deep neural networks, in: Proceedings                    imental Evaluation of Algorithms for Solving Prob-
     of the 5th International Conference on Finance,                   lems with Combinatorial Explosion (RCRA 2022)
     Economics, Management and IT Business, FEMIB                      co-located with the 16th International Conference
     2023, Prague, Czech Republic, April 23-24, 2023,                  on Logic Programming and Non-monotonic Reason-
     SCITEPRESS, 2023, pp. 15–25. doi:10.5220/0011                     ing (LPNMR 2022), Genova Nervi, Italy, September
     715300003494.                                                     5, 2022, volume 3281 of CEUR Workshop Proceedings,
 [3] I. Letteri, G. D. Penna, G. D. Gasperis, Security in              2022, pp. 1–8.
     the internet of things: botnet detection in software-        [11] A. Dyoub, S. Costantini, F. A. Lisi, I. Letteri, Logic-
     defined networks by deep learning techniques, Int.                based machine learning for transparent ethical
     J. High Perform. Comput. Netw. 15 (2019) 170–182.                 agents, in: F. Calimeri, S. Perri, E. Zumpano (Eds.),
     doi:10.1504/IJHPCN.2019.106095.                                   Proceedings of the 35th Italian Conference on Com-
 [4] M. Petwan, K. R. Ku-Mahamud, A review on bio-                     putational Logic - CILC 2020, Rende, Italy, October
     inspired optimization method for supervised fea-                  13-15, 2020, volume 2710 of CEUR Workshop Pro-
     ture selection, International Journal of Advanced                 ceedings, CEUR-WS.org, 2020, pp. 169–183. URL:
     Computer Science and Applications 13 (2022). URL:                 https://ceur-ws.org/Vol-2710/paper11.pdf.

</pre>