1. Introduction

International Journal of Advanced ceedings

10.14569/IJACSA.2022.0130516

Leveraging Bio-Inspired Optimization Algorithms for Advanced Feature Selection in Chronic Disease Datasets

Abeer Dyoub

Ivan Letteri

1 0 Computer Science Department, University of Bari , Bari - Italy 1 Department of Life, Health and Environmental Sciences,University of L'Aquila , L'Aquila - ITALY

2024

2597 29 30

In this study, we investigated the application of bio-inspired optimization algorithms for feature selection in chronic disease prediction. The primary goal was to enhance models' predictive accuracy, streamline data dimensionality, and make predictions more interpretable and actionable. The research encompassed a comparative analysis of the three bio-inspired categories: evolutionary-based, swarm-intelligence, and ecology-based. For the feature selection method, we selected one algorithm for each category: Genetic Algorithms, Flower Pollination Optimization, and Particle Swarm Optimization, applying them across diverse chronic diseases including cancer, kidney, and cardiovascular diseases. The results demonstrate in some cases, that the bio-inspired optimization algorithms efectively reduce the number of features required for accurate classification and consequently the convergence time. The findings underscore this work's potential impact on early intervention, precision medicine, and improved patient outcomes, providing new avenues for delivering healthcare services tailored to individual needs.

eol>Chronic Diseases Prediction Bio-Inspired Feature Selection Genetic Algorithms Flower Pollination Optimization Particle Swarm Optimization

1. Introduction

we selected Genetic Algorithms (GA), Particle Swarm Optimisation (PSO), and Flower Pollination Optimisation Chronic diseases pose a significant global health chal- (FPO), one from each category. lenge, impacting morbidity and mortality rates. Early de- We refine feature subsets from medical datasets encomtection is crucial for prevention and personalised health- passing cancer, kidney, and cardiovascular diseases to care. Advanced analytics and AI ofer the potential for enhance model accuracy and simplify data dimensionalrevolutionising prediction in many field like finance [ 1] ity. The aim is to improve interpretability and practicality [2], cybersecurity [3] and in particular disease. in chronic disease prediction.

Supervised learning in various fields relies heavily Investigating chronic diseases presents significant chalon feature selection (FS) to reduce input dimensional- lenges in the healthcare domain. This study aims to imity. Maintaining target class integrity amidst irrelevant prove the predictive accuracy of chronic diseases by emcharacteristics is essential for accurate classification in ploying machine learning (ML) and feature selection (FS) the medical domain. techniques, which involve data collection, preprocessing,

Bio-inspired optimisation emulates behaviours found and performance assessment. in various natural creatures such as fish, insects, bird The paper proceeds with an outline of the methodology swarms, terrestrial animals, reptiles, humans, and other in Section 2. Section 3 presents experimental findings, phenomena. These methods have been used for super- followed by a discussion in Section 4. Finally, Section 5 vised feature selection (see [4]). The same source cate- summarises key findings, limitations, and future direcgorises bio-inspired optimisation algorithms into three tions. groups based on their source of inspiration: swarm intelligence algorithms, evolutionary-based algorithms, and ecology-based algorithms. For robustness and diversity, 2. Methodology

2.1. The Datasets 2.4. Performance Evaluation Method 2.2. Datsets Pre-processing

Breast Cancer dataset: From the University of Wis- For each dataset detailed in section 2.1, every machine consin, this dataset involves cytological examinations to learning model is trained using 70% of the data and tested distinguish between benign and malignant tumours. It using the remaining 30%, employing all features, and contains 569 samples and 31 features. ifltered features by PSO, FPO, and GA algorithms. This

Kidney Disease: Medical information on chronic kid- process is iterated 100 times with each iteration involving ney disease, collected over two months in India, is in- shufling the dataset. Moreover, for each iteration, the cluded in this dataset, available on Kaggle or UCI. It con- dataset is split into training and testing sets to evaluate sists of 400 samples and 25 features. measures such as Accuracy, Recall, Precision, and F1-score.

Heart failure dataset: Comprising medical records of heart failure patients during follow-up, this dataset contains 299 samples and 13 features. 3. Experiments and Results

Each dataset has the “diagnosis” column with binary values used as targets for supervised learning of classiifers, where 0 denotes a negative and 1 indicates a positive outcome, respectively.

Missing Values Imputation. Addressing missing data poses risks of performance degradation and biased results. We used the K-Nearest Neighbors (KNN) algorithm, known for its adaptability to diverse data types, to fill the lack in the datasets.

Data Balancing. To balance the datasets is a critical concern due to the struggle of the classifiers when faced with disparate class distributions, leading to biased models. To mitigate this issue, we used the Synthetic Minority Over-sampling Technique for Nominal and Continuous features (SMOTEEN) [6] which addresses imbalanced datasets by oversampling the minority class and cleaning the majority class by combining the SMOTE and Edited Nearest Neighbors (ENN) methods.

Min-max Normalization. We applied this scaling method to normalize the datasets to a predefined range, as follows: = −− , where represents the normalized value of the feature, is the original value of the feature. and denote the minimum and maximum values respectively.

2.3. Bio-inspired Feature Selection Following the data preparation stage, we applied the

three aforementioned bio-inspired feature selection algorithms to each of the three datasets (see section 2.1). All algorithms employ the same fitness function, with the value set to 0.99 to prioritize classification accuracy.

For assessing fitness, we utilized the K-Nearest Neighbors (KNN) classifier, known for its eficiency and adopted by [7], as it does not necessitate a lengthy training phase. A neighbour count of = 10 was used. The feature selection algorithms were configured with 20 agents (individuals) and 100 generations.

3.1. Breast Cancer Dataset The final features selected by the various algorithms are: metrics can be seen in figures 4, 5, 6.

we note that GA has acheived a better fitness with respect to FPO, even both have achieved the same reduction percentage in dimentionality with breast cancer dataset. The two algorithms have selected diferent sets of features. Regarding the training time, we can observe that the training times have globally decreased up to a maximum of 54.5% for GA with LR on this dataset. In KNN, the time has increased in all the cases. further

3.2. Heart Failure Dataset

The final features selected by the three algorithms are: • FPO: [’anaemia’, ’diabetes’, ’smoking’] • GA: [’platelets’, ’serum sodium’, ’time’] • PSO: [’platelets’, ’serum creatinine’] we note that GA has acheived a better fitness with respect to FPO, even both have achieved the same reduction percentage in dimentionality with breast cancer dataset. The two algorithms have selected diferent sets of features. The genetic algorithm significantly higher fitness with respect to FPO and PSO even though the the dimentionality reduction is almost the same. RF, DT, SVM and KNN have achieved a better performance on this dataset when combined with GA algorithm. In general, however, the training times have all decreased, with the maximum decrease 57% by LR model with both GA and PSO. See further metrics in figures 7, 8 , 9.

In this dataset, high performance was achieved with most models combined with PSO and GA, while with FPO there was a significant decrease in performance. There has been a 55% decrease in processing time without loss in the performance with LR model combined with PSO and a reduction in processing time up to 57% with LR combined with GA with a very slight reduction in the performance. The highest fitness was achieved by GA with 7 features (70% reduction in dimentionality), while the lowest fitness was achieved by FPO with the highest features reduction. See further metrics in figures 10, 11, 12.

3.3. Kidney Disease Dataset The final features selected by the various algorithms are:

• FPO: [’su’, ’rbc’, ’pcc’, ’pe’, ’ane’] • GA: [’rbc’, ’bgr’, ’sod’, ’hemo’, ’pcv’, ’dm’, ’cad’] • PSO: [’age’, ’su’, ’rbc’, ’pc’, ’bgr’, ’sod’, ’pot’, ’hemo’, ’pcv’, ’rc’, ’cad’]

4. Discussion

From Table 2, it is evident that GA emerged as the most efective FS technique in terms of performance. It consistently improved accuracy across various ML models and datasets, or maintained accuracy levels compared to pre-FS values with other techniques. The accuracy enhancement with GA ranged from 0.1% to 7%. Following GA, PSO ranked second in terms of accuracy performance among the three bio-inspired algorithms. While PSO did not notably enhance accuracy, it also did not lead to significant decreases. FPO exhibited diverse outcomes across diferent ML models and datasets. While accuracy decreases were marginal (less than 2.5%) for most ML models on the breast cancer dataset, there were more pronounced decreases on the heart failure and kidney disease datasets.

In terms of training times, the impact was particularly notable for DT and LR, as evidenced in Table 3. Generally, training times decreased across all models when employing feature selection (FS), except for K-Nearest Neighbours (KNN) with breast cancer and kidney disease datasets, where a significant increase of up to 21% was observed. Minor fluctuations within ±2% in training times were considered insignificant, likely due to variable hardware conditions and software factors. Overall, machine learning (ML) models exhibited reduced training times with FS, especially DT and LR models with GA and PSO. The most substantial reduction in training time, up to 67%, was achieved by the LR model with FPO on the Kidney disease dataset. Although FS did not significantly improve ML model performance in most cases, and even led to a decrease in performance in some instances, the noteworthy decrease in processing times without significant loss in accuracy represents a significant achievement.

The experimental findings indicate that the GA outperformed other FS algorithms in terms of precision, recall, and F1-measure. GA demonstrated superior performance when paired with nearly all ML models compared to FPO and PSO across all datasets. However, the PSO algorithm, when combined with the LR model, exhibited slightly higher recall and F1 scores for breast cancer and kidney disease datasets, as well as marginally improved recall for heart failure dataset. Conversely, FPO generally exhibited the poorest performance when paired with various ML models. Although FPO achieved the highest recall when combined with the LR model on the heart failure dataset, its overall performance was inferior. In terms of iftness trends, GA displayed the most favourable results, with PSO closely trailing behind, while FPO yielded significantly lower fitness levels compared to GA and PSO. Further experiments are planned to investigate the behaviour of these FS algorithms with varying parameters, datasets, and ML models.

5. Conclusion

Our experiments have highlighted the importance of feature selection (FS) in improving the performance of machine learning (ML) models. The impact of FS varies depending on factors such as the chosen FS algorithm and dataset characteristics [8]. FS holds the potential to significantly enhance ML outcomes, especially for datasets with a large number of features. For example, in the breast cancer dataset, reducing features from 30 to 12 or 8 resulted in up to a 50% reduction in training time, while maintaining the same performance across various ML models. However, the efect of FS on training time may vary. While FS could improve training eficiency for some datasets, it may require more training cycles for others. Additionally, we have highlighted the limitations of the Flower Pollination Optimization (FPO) algorithm and emphasised the importance of considering multiple evaluation metrics beyond accuracy alone. Finally, we note that this work forms part of our broader research project on healthcare assistant agents, encompassing various aspects, including ethical considerations [9], [10, 11].