Leveraging Bio-Inspired Optimization Algorithms for Advanced Feature Selection in Chronic Disease Datasets Abeer Dyoub1,*,† , Ivan Letteri2,† 1 Computer Science Department, University of Bari, Bari - Italy 2 Department of Life, Health and Environmental Sciences,University of L’Aquila, L’Aquila - ITALY Abstract In this study, we investigated the application of bio-inspired optimization algorithms for feature selection in chronic disease prediction. The primary goal was to enhance models’ predictive accuracy, streamline data dimensionality, and make predictions more interpretable and actionable. The research encompassed a comparative analysis of the three bio-inspired categories: evolutionary-based, swarm-intelligence, and ecology-based. For the feature selection method, we selected one algorithm for each category: Genetic Algorithms, Flower Pollination Optimization, and Particle Swarm Optimization, applying them across diverse chronic diseases including cancer, kidney, and cardiovascular diseases. The results demonstrate in some cases, that the bio-inspired optimization algorithms effectively reduce the number of features required for accurate classification and consequently the convergence time. The findings underscore this work’s potential impact on early intervention, precision medicine, and improved patient outcomes, providing new avenues for delivering healthcare services tailored to individual needs. Keywords Chronic Diseases Prediction, Bio-Inspired Feature Selection, Genetic Algorithms, Flower Pollination Optimization, Particle Swarm Optimization 1. Introduction we selected Genetic Algorithms (GA), Particle Swarm Optimisation (PSO), and Flower Pollination Optimisation Chronic diseases pose a significant global health chal- (FPO), one from each category. lenge, impacting morbidity and mortality rates. Early de- We refine feature subsets from medical datasets encom- tection is crucial for prevention and personalised health- passing cancer, kidney, and cardiovascular diseases to care. Advanced analytics and AI offer the potential for enhance model accuracy and simplify data dimensional- revolutionising prediction in many field like finance [1] ity. The aim is to improve interpretability and practicality [2], cybersecurity [3] and in particular disease. in chronic disease prediction. Supervised learning in various fields relies heavily Investigating chronic diseases presents significant chal- on feature selection (FS) to reduce input dimensional- lenges in the healthcare domain. This study aims to im- ity. Maintaining target class integrity amidst irrelevant prove the predictive accuracy of chronic diseases by em- characteristics is essential for accurate classification in ploying machine learning (ML) and feature selection (FS) the medical domain. techniques, which involve data collection, preprocessing, Bio-inspired optimisation emulates behaviours found and performance assessment. in various natural creatures such as fish, insects, bird The paper proceeds with an outline of the methodology swarms, terrestrial animals, reptiles, humans, and other in Section 2. Section 3 presents experimental findings, phenomena. These methods have been used for super- followed by a discussion in Section 4. Finally, Section 5 vised feature selection (see [4]). The same source cate- summarises key findings, limitations, and future direc- gorises bio-inspired optimisation algorithms into three tions. groups based on their source of inspiration: swarm intel- ligence algorithms, evolutionary-based algorithms, and ecology-based algorithms. For robustness and diversity, 2. Methodology Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- Preprocessing techniques, including transformation, nized by CINI, May 29-30, 2024, Naples, Italy cleaning, imputation, balancing, and normalization, were * Abeer Dyoub. applied to ensure data quality [5]. Subsequently, feature † These authors contributed equally. selection was performed by GA, PSO, and FPO algorithms. $ abeer.dyoub@uniba.it (A. Dyoub); ivan.letteri@univaq.it The selected features were then used for classification (I. Letteri) using Decision Trees (DT), Random Forest (RF), Logistic  0000-0003-0329-2419 (A. Dyoub); 0000-0002-3843-386X Regression (LR), Support Vector Machines (SVM), and (I. Letteri) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License K-Nearest Neighbour (KNN). Finally, we evaluated the Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) performance of these models using various metrics. CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2.1. The Datasets 2.4. Performance Evaluation Method Breast Cancer dataset: From the University of Wis- For each dataset detailed in section 2.1, every machine consin, this dataset involves cytological examinations to learning model is trained using 70% of the data and tested distinguish between benign and malignant tumours. It using the remaining 30%, employing all features, and contains 569 samples and 31 features. filtered features by PSO, FPO, and GA algorithms. This Kidney Disease: Medical information on chronic kid- process is iterated 100 times with each iteration involving ney disease, collected over two months in India, is in- shuffling the dataset. Moreover, for each iteration, the cluded in this dataset, available on Kaggle or UCI. It con- dataset is split into training and testing sets to evaluate sists of 400 samples and 25 features. measures such as Accuracy, Recall, Precision, and F1-score. Heart failure dataset: Comprising medical records of heart failure patients during follow-up, this dataset contains 299 samples and 13 features. 3. Experiments and Results Each dataset has the “diagnosis” column with binary Figures 1, 2, 3 show the fitness trends of the FS algo- values used as targets for supervised learning of classi- rithms, and table 1 summarizes the performance of these fiers, where 0 denotes a negative and 1 indicates a posi- FS algorithms in terms of feature reduction. tive outcome, respectively. In table 2, we report the accuracies of the classifiers with the features. Whereas, in table 3, we report the 2.2. Datsets Pre-processing percentage variations of the training time before and Missing Values Imputation. Addressing missing after the FS. data poses risks of performance degradation and biased results. We used the K-Nearest Neighbors (KNN) algo- rithm, known for its adaptability to diverse data types, to fill the lack in the datasets. Data Balancing. To balance the datasets is a critical concern due to the struggle of the classifiers when faced with disparate class distributions, leading to biased mod- els. To mitigate this issue, we used the Synthetic Minority Over-sampling Technique for Nominal and Continuous features (SMOTEEN) [6] which addresses imbalanced datasets by oversampling the minority class and cleaning the majority class by combining the SMOTE and Edited Nearest Neighbors (ENN) methods. Min-max Normalization. We applied this scaling method to normalize the datasets to a predefined range, as follows: 𝑋𝑛𝑜𝑟𝑚 = 𝑋𝑚𝑎𝑥 𝑋−𝑋𝑚𝑖𝑛 −𝑋𝑚𝑖𝑛 , where 𝑋𝑛𝑜𝑟𝑚 rep- Figure 1: Fitness trends on breast cancer dataset resents the normalized value of the feature, 𝑋 is the original value of the feature. 𝑋𝑚𝑖𝑛 and 𝑋𝑚𝑎𝑥 denote the minimum and maximum values respectively. Table 1 Performance of Dimensional Reduction in the different Feature Selection Algorithms. 2.3. Bio-inspired Feature Selection Dataset Algorithm Fitness #Features Reduction Following the data preparation stage, we applied the Breast Cancer GA PSO ≈ 0.992 ≈ 0.985 8 12 73.3% 60% three aforementioned bio-inspired feature selection algo- FPO ≈ 0.9092 8 73.3% rithms to each of the three datasets (see section 2.1). All Heart Failure GA PSO ≈ 0.91 ≈ 0.79 3 2 75% 83.3% algorithms employ the same fitness function, with the 𝛼 FPO ≈ 0.581 3 75% value set to 0.99 to prioritize classification accuracy. GA ≈ 0.998 7 70% Kidney Disease PSO ≈ 0.995 11 54% For assessing fitness, we utilized the K-Nearest Neigh- FPO ≈ 0.8454 5 80% bors (KNN) classifier, known for its efficiency and adopted by [7], as it does not necessitate a lengthy train- ing phase. A neighbour count of 𝐾 = 10 was used. The feature selection algorithms were configured with 3.1. Breast Cancer Dataset 20 agents (individuals) and 100 generations. The final features selected by the various algorithms are: metrics can be seen in figures 4, 5, 6. Figure 2: Fitness trends on heart failure dataset Figure 4: Precision on dataset Breast Cancer. Figure 5: Recall on dataset Breast Cancer. Figure 3: Fitness trends on kidney disease dataset • FPO: [’sy11etry 1ean’, ’fractal di1ension 1ean’, ’radius se’, ’area se’, ’co1pactness se’, ’sy11etry se’, ’fractal di1ension se’, ’concavity worst’] • GA: [’texture mean’, ’concavity mean’, ’area se’, ’compactness se’, ’concave points se’, ’fractal di- mension se’, ’radius worst’, ’compactness worst’] • PSO: [’radius mean’, ’area mean’, ’smoothness mean’, ’compactness mean’, ’fractal dimension mean’, ’radius se’, ’texture se’, ’area se’, ’smooth- ness se’, ’compactness se’, ’texture worst’, ’con- cavity worst’] we note that GA has acheived a better fitness with Figure 6: F1-score on dataset Breast Cancer. respect to FPO, even both have achieved the same re- duction percentage in dimentionality with breast cancer dataset. The two algorithms have selected different sets of features. Regarding the training time, we can observe 3.2. Heart Failure Dataset that the training times have globally decreased up to a The final features selected by the three algorithms are: maximum of 54.5% for GA with LR on this dataset. In KNN, the time has increased in all the cases. further • FPO: [’anaemia’, ’diabetes’, ’smoking’] • GA: [’platelets’, ’serum sodium’, ’time’] • PSO: [’platelets’, ’serum creatinine’] we note that GA has acheived a better fitness with respect to FPO, even both have achieved the same reduction per- centage in dimentionality with breast cancer dataset. The two algorithms have selected different sets of features. The genetic algorithm significantly higher fitness with respect to FPO and PSO even though the the dimentional- ity reduction is almost the same. RF, DT, SVM and KNN have achieved a better performance on this dataset when combined with GA algorithm. In general, however, the training times have all decreased, with the maximum decrease 57% by LR model with both GA and PSO. See Figure 9: F1-score on dataset Hearth Failure. further metrics in figures 7, 8 , 9. In this dataset, high performance was achieved with most models combined with PSO and GA, while with FPO there was a significant decrease in performance. There has been a 55% decrease in processing time without loss in the performance with LR model combined with PSO and a reduction in processing time up to 57% with LR combined with GA with a very slight reduction in the performance. The highest fitness was achieved by GA with 7 features (70% reduction in dimentionality), while the lowest fitness was achieved by FPO with the highest features reduction. See further metrics in figures 10, 11, 12. Figure 7: Precision on dataset Hearth Failure. Figure 10: Precision on dataset Kidney Disease. Figure 8: Recall on dataset Hearth Failure. 4. Discussion 3.3. Kidney Disease Dataset From Table 2, it is evident that GA emerged as the most effective FS technique in terms of performance. It con- The final features selected by the various algorithms are: sistently improved accuracy across various ML models • FPO: [’su’, ’rbc’, ’pcc’, ’pe’, ’ane’] and datasets, or maintained accuracy levels compared • GA: [’rbc’, ’bgr’, ’sod’, ’hemo’, ’pcv’, ’dm’, ’cad’] to pre-FS values with other techniques. The accuracy • PSO: [’age’, ’su’, ’rbc’, ’pc’, ’bgr’, ’sod’, ’pot’, enhancement with GA ranged from 0.1% to 7%. Follow- ’hemo’, ’pcv’, ’rc’, ’cad’] ing GA, PSO ranked second in terms of accuracy perfor- instances, the noteworthy decrease in processing times without significant loss in accuracy represents a signifi- cant achievement. The experimental findings indicate that the GA outper- formed other FS algorithms in terms of precision, recall, and F1-measure. GA demonstrated superior performance when paired with nearly all ML models compared to FPO and PSO across all datasets. However, the PSO algorithm, when combined with the LR model, exhibited slightly higher recall and F1 scores for breast cancer and kidney disease datasets, as well as marginally improved recall for heart failure dataset. Conversely, FPO generally exhib- Figure 11: Recall on dataset Kidney Disease. ited the poorest performance when paired with various ML models. Although FPO achieved the highest recall when combined with the LR model on the heart failure dataset, its overall performance was inferior. In terms of fitness trends, GA displayed the most favourable results, with PSO closely trailing behind, while FPO yielded sig- nificantly lower fitness levels compared to GA and PSO. Further experiments are planned to investigate the be- haviour of these FS algorithms with varying parameters, datasets, and ML models. Table 2 Models Accuracy. Breast Cancer Heart Failure Kidney Disease No FS 98.4% 85.5% 98.7% FPO 96.2% 61.2% 57% RF PSO 98% 83.6% 97.5% Figure 12: F1-score on dataset Kidney Disease. GA 98.5% 89.2% 96.8% No FS 97.3% 79.2% 96.9% FPO 94.9% 61.1% 57% DT PSO 96.7% 83% 96.8% GA 97.3% 83.6% 96.8% mance among the three bio-inspired algorithms. While No FS 99.5% 77.7% 99.4% FPO 96.9% 56.8% 57% PSO did not notably enhance accuracy, it also did not lead SVM PSO 99.5% 68.2% 98.7% to significant decreases. FPO exhibited diverse outcomes GA 99.6% 78.3% 97.5% across different ML models and datasets. While accuracy No FS FPO 99.5% 96.1% 79.4% 59.1% 97.7% 57% decreases were marginal (less than 2.5%) for most ML LR PSO 98.3% 59.9% 97.6% models on the breast cancer dataset, there were more GA No FS 98.3% 98.7% 77.8% 69.6% 94.7% 96.1% pronounced decreases on the heart failure and kidney KNN FPO 96.1% 52.9% 52.5% disease datasets. PSO GA 98.5% 98.4% 64.6% 77.3% 98.4% 97.1% In terms of training times, the impact was particularly notable for DT and LR, as evidenced in Table 3. Gen- erally, training times decreased across all models when employing feature selection (FS), except for K-Nearest 5. Conclusion Neighbours (KNN) with breast cancer and kidney dis- ease datasets, where a significant increase of up to 21% Our experiments have highlighted the importance of was observed. Minor fluctuations within ±2% in training feature selection (FS) in improving the performance of times were considered insignificant, likely due to vari- machine learning (ML) models. The impact of FS varies able hardware conditions and software factors. Overall, depending on factors such as the chosen FS algorithm and machine learning (ML) models exhibited reduced train- dataset characteristics [8]. FS holds the potential to sig- ing times with FS, especially DT and LR models with nificantly enhance ML outcomes, especially for datasets GA and PSO. The most substantial reduction in train- with a large number of features. For example, in the ing time, up to 67%, was achieved by the LR model with breast cancer dataset, reducing features from 30 to 12 FPO on the Kidney disease dataset. Although FS did not or 8 resulted in up to a 50% reduction in training time, significantly improve ML model performance in most while maintaining the same performance across various cases, and even led to a decrease in performance in some ML models. However, the effect of FS on training time Table 3 http://dx.doi.org/10.14569/IJACSA.2022.0130516. Models Processing Time doi:10.14569/IJACSA.2022.0130516. Breast Cancer Heart Failure Kidney Disease [5] I. Letteri, G. D. Penna, L. D. Vita, M. T. Grifa, Mta- FPO -7% +2% -2% kdd’19: A dataset for malware traffic detection, in: RF PSO -3.5% +1% -2% GA -5% +1% -4% Proceedings of the Fourth Italian Conference on FPO -40% -8% -25% Cyber Security, Ancona, Italy, February 4th to 7th, DT PSO -40% -6% -11% GA -50% -8% -18% 2020, volume 2597 of CEUR Workshop Proceedings, FPO -4% -6% -10% CEUR-WS.org, 2020, pp. 153–165. URL: https://ceur SVM PSO -10% +1% -0.6% GA -16% -8.5% -4% -ws.org/Vol-2597/paper-14.pdf. FPO -20.5% -17% -67% [6] M. Khushi, K. Shaukat, T. M. Alam, I. A. Hameed, LR PSO GA -54% -54.5% -57% -57% -55% -57% S. Uddin, S. Luo, X. Yang, M. C. Reyes, A com- FPO +11% -2% +3.6% parative performance analysis of data resampling KNN PSO GA +21% +10% -4% -7% +4% -0.009% methods on imbalance medical data, IEEE Access 9 (2021) 109960–109975. doi:10.1109/ACCESS.202 1.3102399. may vary. While FS could improve training efficiency for [7] M. Sharawi, H. M. Zawbaa, E. Emary, Feature se- some datasets, it may require more training cycles for oth- lection approach based on whale optimization algo- ers. Additionally, we have highlighted the limitations of rithm, in: 2017 Ninth international conference on the Flower Pollination Optimization (FPO) algorithm and advanced computational intelligence (ICACI), IEEE, emphasised the importance of considering multiple eval- 2017, pp. 163–168. uation metrics beyond accuracy alone. Finally, we note [8] I. Letteri, G. D. Penna, P. Caianiello, Feature selec- that this work forms part of our broader research project tion strategies for HTTP botnet traffic detection, on healthcare assistant agents, encompassing various in: 2019 IEEE European Symposium on Security aspects, including ethical considerations [9], [10, 11]. and Privacy Workshops, EuroS&P Workshops 2019, Stockholm, Sweden, June 17-19, 2019, IEEE, 2019, pp. 202–210. doi:10.1109/EUROSPW.2019.000 References 29. [9] A. Dyoub, S. Costantini, F. A. Lisi, Learning do- [1] I. Letteri, AITA: A new framework for trading for- main ethical principles from interactions with users, ward testing with an artificial intelligence engine, Digit. Soc. 1 (2022). doi:10.1007/S44206-022-0 in: F. Falchi, F. Giannotti, A. Monreale, C. Boldrini, 0026-Y. S. Rinzivillo, S. Colantonio (Eds.), Proceedings of [10] A. Dyoub, S. Costantini, I. Letteri, Care robots learn- the Italia Intelligenza Artificiale - Thematic Work- ing rules of ethical behavior under the supervision shops co-located with the 3rd CINI National Lab of an ethical teacher (short paper), in: P. Bruno, AIIS Conference on Artificial Intelligence (Ital IA F. Calimeri, F. Cauteruccio, M. Maratea, G. Ter- 2023), Pisa, Italy, May 29-30, 2023, volume 3486 of racina, M. Vallati (Eds.), Joint Proceedings of the CEUR Workshop Proceedings, 2023, pp. 506–511. 1st International Workshop on HYbrid Models for [2] I. Letteri, G. D. Penna, G. D. Gasperis, A. Dy- Coupling Deductive and Inductive ReAsoning (HY- oub, Trading strategy validation using forwardtest- DRA 2022) and the 29th RCRA Workshop on Exper- ing with deep neural networks, in: Proceedings imental Evaluation of Algorithms for Solving Prob- of the 5th International Conference on Finance, lems with Combinatorial Explosion (RCRA 2022) Economics, Management and IT Business, FEMIB co-located with the 16th International Conference 2023, Prague, Czech Republic, April 23-24, 2023, on Logic Programming and Non-monotonic Reason- SCITEPRESS, 2023, pp. 15–25. doi:10.5220/0011 ing (LPNMR 2022), Genova Nervi, Italy, September 715300003494. 5, 2022, volume 3281 of CEUR Workshop Proceedings, [3] I. Letteri, G. D. Penna, G. D. Gasperis, Security in 2022, pp. 1–8. the internet of things: botnet detection in software- [11] A. Dyoub, S. Costantini, F. A. Lisi, I. Letteri, Logic- defined networks by deep learning techniques, Int. based machine learning for transparent ethical J. High Perform. Comput. Netw. 15 (2019) 170–182. agents, in: F. Calimeri, S. Perri, E. Zumpano (Eds.), doi:10.1504/IJHPCN.2019.106095. Proceedings of the 35th Italian Conference on Com- [4] M. Petwan, K. R. Ku-Mahamud, A review on bio- putational Logic - CILC 2020, Rende, Italy, October inspired optimization method for supervised fea- 13-15, 2020, volume 2710 of CEUR Workshop Pro- ture selection, International Journal of Advanced ceedings, CEUR-WS.org, 2020, pp. 169–183. URL: Computer Science and Applications 13 (2022). URL: https://ceur-ws.org/Vol-2710/paper11.pdf.