CatBoost model with self-explanatory capabilities for predicting SLE in OMAN population Hamza Zidoum1*, Ali AlShareedah1, Aliya Al-Ansari2, Batool Al Lawati3, S. Al-Sawafi1 1 Department of Computer, Science Sultan Qaboos, University, Muscat (Oman) 2 Department of Biology, Sultan Qaboos, University, Muscat (Oman) 3 Department of Medicine Sultan Qaboos University, Muscat (Oman) Abstract Systemic lupus erythematosus (SLE) presents as an autoimmune condition influenced by both genetic and environmental factors, showcasing a diverse range of clinical symptoms and often, unpredictable disease flares. Despite advancements in classification methods, the timely diagnosis of SLE remains a challenge for many patients. This research introduces an interpretable disease classification model that combines the robust predictive capabilities of CatBoost with the transparent interpretation tools offered by SHapley Additive exPlanations (SHAP). Trained on a local cohort comprising 219 Omani patients diagnosed with SLE and individuals with other control diseases, the CatBoost model demonstrates high performance. Moreover, utilizing the SHAP library enables the generation of individualized explanations for the model's decisions, highlighting key clinical features such as alopecia, renal disorders, cutaneous lupus, and hemolytic anemia, alongside patient age, which significantly contribute to the prediction process. The model achieved notable metrics, including an AUC score of 0.945 and an F1-score of 0.92, underscoring its efficacy in SLE prediction Keywords Systemic lupus erythematosus; CatBoost; Feature selection; Model interpretation; SHAP1 1. Introduction Systemic lupus erythematosus (SLE) stands as a chronic autoimmune disease affecting multiple systems of the body. Its clinical presentation varies across different races, genders, and age groups, rendering diagnosis challenging [1]. Despite significant strides in SLE treatment strategies, diagnostic and therapeutic hurdles persist [2]. Early diagnosis remains particularly problematic, given the gradual onset of SLE symptoms over years, alongside the potential for other conditions to mimic its manifestations, including infectious and hematologic diseases [3]. Data analysis underscores the importance of diagnosing SLE within a narrow window, as delayed diagnosis correlates with increased flare rates, hospitalizations, and the risk of progressive organ damage, ultimately elevating mortality rates [4]. In Oman, where the mortality rate stands at 5% and the mean prevalence at 38 per 100,000 individuals, limited research exists on the specific clinical and serologic characteristics of the Omani population. [5], [6]. This study contributes significantly on three fronts: firstly, by identifying unique clinical manifestation patterns specific to the Omani population, filling a notable gap in the literature. Secondly, by introducing the CatBoost algorithm, renowned for its rapid computation, strong generalization, and high predictive accuracy, alongside leveraging advanced machine learning techniques such as the SHAP algorithm, RFECV-based feature selection, and GridSearchCV-based hyperparameter optimization. Thirdly, by integrating the model's predictions with interpretability Late-breaking work, Demos and Doctoral Consortium, colocated with The 2nd World Conference on eXplainable Artificial Intelligence: July 17–19, 2024, Valletta, Malta *Correspondint author zidoum@squ.edu.com (H. Zidoum), alansari@squ.edu.om (A. Al-Ansari), sksawafi@squ.edu.om (S. Al-Sawafi) https://orcid.org/0000-0003-0365-650X (H. Zidoum) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings algorithms, this research promotes self-explanatory models that empower physicians to cross-reference model outputs with their expertise, enhancing diagnostic accuracy and fostering the adoption of machine learning in healthcare. 2. Method DATASET The dataset utilized in this study originates from the Rheumatology clinic at Sultan Qaboos University Hospital. Approval for the study was granted by the Ethics Committee of the College of Medicine and Health Science at Sultan Qaboos University (MERC # 1418 and 1650). Data extraction involved both structured and unstructured sources, including the hospital's Electronic Medical Record (EMR) system named TrakCare, which stores patients' demographic information, medical states, and histories. While demographic data were directly obtained from TrakCare, clinical data were unstructured and retrieved from patients' medical histories in the form of clinical notes from each hospital visit. The dataset comprised records of Omani patients from 2006 to 2019 who met the entry criteria outlined by EULAR/ACR, which necessitate a positive Antinuclear Antibodies test (ANA test) followed by the application of additional classification criteria. Non-Omani patients and those with insufficient data were excluded. The dataset encompassed 214 Omani patient records, with 138 diagnosed with SLE, confirmed by rheumatologist assessment on a case-by-case basis, while the remaining 81 patients had other control diseases. Analysis revealed a female predominance of 92% and a mean age of 38. The Al Batinah Governorate accounted for the highest proportion of patients (37.9%), followed by Muscat (23.7%). FEATURE SELECTION Initial data comprised 20 clinical and demographic variables (referred to as "features" in machine learning), with no missing values detected. Categorical features were encoded using Ordinal encoding, and Min-Max normalization was applied due to variations in feature ranges. To enhance signal-to-noise ratio and select the most informative features, recursive feature elimination (RFE) based on Random Forest (RF) with ten-fold cross-validation (CV) was employed. RFE iteratively builds models, identifies the best feature, selects it, and repeats the process until all features are traversed. Table 1 Characteristics of patients. Feature Category Occurrence (No. %, N=219) Fever Yes/No 48 (21.9%) Acute cutaneous lupus (ACL) Yes/No 70 (31.9%) Chronic cutaneous lupus Yes/No 5 (2.28%) Oral ulcers Yes/No 29 (13.2%) Alopecia Yes/No 61 (27.8%) Joint Involvement Yes/No 202 (92.2%) Serositis Yes/No 9 (4%) Renal Manifestation Yes/No 62 (28.3%) Lupus Nephritis class None 112 (51 %) Class II 1 (0.4%) Class III 4 (1.8%) Class IV 16 (7.3%) Class V 5 (2 %) Proteinuria Yes/ No 51 (23%) Vasculitis Yes/ No 12 (5.4%) Neurologic Disorder None 121 (55.2 Psychosis %) Seizure 5 (2.2 %) Hemolytic Anemia Yes/ No 12 (5.4%) Leukopenia Yes/ No 53 (24%) Thrombocytopenia Yes/ No 19 (8.6%) 11 (5%) We have developed four ML models to predict the presence or absence of SLE (Figure 1). In addition to three common ML models that are multi-layer perceptron (MLP), support vector machine (SVM), and Random Forest, we introduced CatBoost [11] Figure 1: Flowchart of the development and evaluation process CatBoost is an ensemble learning algorithm, similar to gradient boosting, but with some unique features. Its implementation of ordered boosting helps it handle categorical variables more effectively, which is particularly useful in real-world datasets where categorical features are common. The Oblivious Tree structure and random permutations are additional techniques that contribute to its robustness and efficiency especially when dealing with categorical data. Ordered boosting calculates leaf values during the selection of the tree structure to reduce overfitting. Oblivious Tree structure is used to construct CatBoosts’ model ensembles which means that all the leaves are in the same level and the same splitting criterion is applied to all intermediate nodes within the same level of tree. The use of Oblivious Tree structure greatly improves the performance speed and efficiency. Random permutations of the training examples are also applied to fight the prediction shift caused by a special kind of target leakage present in all existing implementations of gradient boosting algorithms [12]. To train and validate the performance of our model, the dataset was divided into two parts in a 70:30 ratio (i.e. 70% of the dataset is used for training and 30% for testing). Additionally, a subset of the training data set was used for cross-validation to protect the models from overfitting and optimize the model's parameters. Due to the imbalanced nature of the data set, several parameters are used to evaluate the classification performance such as Recall, Specificity, F1-score, and AUC (Area Under ROC Curve). The problem with using imbalanced data set for classification is that the user is biased to the performance on cases that are poorly represented in the data samples [13]. Standard evaluation criteria tend to focus the evaluation of the models on the most frequent cases, thus if applied, could lead to sub-optimal classification models. Each of the models undergoes a hyper- parameter optimization through grid search with a five-fold cross-validation. Finally, to avoid reporting biased results and limit overfitting, we calculated the average of 10 repetitions for each model. MODEL INTERPRETATION In clinical applications, the ability to justify the prediction is equally as important as the prediction score itself. This is because of the high sensitivity of the medical environment where misclassification could lead to devastating consequences. It is therefore challenging to trust complex ML models for a number of reasons. First, the models are often designed and rigorously trained on specific diseases in a narrow environment. Second, it depends on the user’s technical knowledge of statistics and ML. Third, how the data is labeled affects the results produced by the model [14]. For these reasons and more, Interpretable ML has thus emerged as an area of research that aims to design transparent and explainable models through developing means to transform a black-box ML model into a white-box ML model. By providing transparent prediction, domain experts can accurately interpret the results meaningfully. In 2017, Lundberg and Lee proposed a unified framework to interpret ML predictions. SHapley Additive exPlanations (SHAP) is derived from ‘Shapley values’, a concept that is commonly used within the field of cooperative game theory to determine the payout for each player within a cooperative coalition [14]. Casting this concept onto prediction models, the payout is mapped as the final prediction while players are mapped as the model’s features. The contribution of a feature to the final prediction can be determined by looking at the magnitude and sign of the Shapley value. Specifically, the importance of a feature relative to the payout (prediction) is represented by the magnitude of the related Shapley value. More importantly, this framework provides local and global interpretability simultaneously 3. Results Applying the RFE feature selection algorithm resulted in 13 optimal features (Figure 2). In Table 3, the features that were selected are indicated with a True value. Overall, three demographic features, as well as 10 clinical features, were selected Figure 2: Visualizing RFE’s optimal number of features with 10-fold CV Table 2 Comparing feature sets obtained from different feature selection algorithms. Feature Name Selected by RFE AGE True Disease Duration True PROV True Fever True ACL True Oral_Ul True Alopecia True Serositis True Renal True Proteinuria True Neurologic True Hemolytic_Anemia True Leukopenia True Age onset False Thrombocytopenia False Comparing between the performance of the different classifiers (Table 4), CatBoost had the highest AUC score of 0.956 providing a slight edge in performance (Figure 3). This superiority in performance was also indicated in benchmarks against other recent classifiers (e.g. XGBoost and LightGBM) on a set of popular publicly available datasets [12]. In training phase, each set of decision trees is built consecutively with successive trees focusing on minimizing the loss compared to previous trees. Figure 3: ROC plots for the 4 classification models considered in this work using the features produced by the RFE algorithm Table 3 Comparing between the different classifiers Model Precision Recall F1-score AUC SVM 0.85 0.83 0.85 0.91 Random Forest 0.85 0.96 0.85 0.93 CatBoost 0.90 0.95 0.90 0.956 MLP 0.86 0.86 0.86 0.89 Compared with the help of SHAP algorithm, we can break down each prediction individually. As a demonstration, we took two individuals from the testing set: a 40-year-old patient that was predicted to have the disease and a 56-year-old patient that was not. In table 5, the non-normalized features of the two patients are shown. Table 4 Non-normalized values for test patients 1 and 2 Patient 2 Patient 1 Feature Name (Negative for (positive for SLE) SLE) AGE 40 56 Disease_Duration 21 13 PROV Dakhiliyah Muscat Fever N N ACL N N Oral_Ul N N Alopecia N N Serositis N N Renal Nephritis N Proteinuria Y N Neurologic N N Hemolytic_Anemia N N Leukopenia N N The force plot attributes the positive prediction of patient 1 to renal disorders, and the patient’s age (Figure 4.a). Since the values in Figure 4 are normalized we cross-reference them with table 4, we find that the patient is 40 which falls within the age group SLE is most active in. Additionally, the patient has been diagnosed with Lupus Nephritis a disease that is commonly caused by an auto-immune disorder. In contrast, patient 2 (Figure 4.b) displays a lack of any autoimmune manifestation, long disease duration, and the age of 56 makes him outside the age group that SLE is most active in. Figure 4: Force plot of CatBoost model prediction for patient 1 (values are normalized). Looking at the waterfall plot in Figure (5.a), we find the feature with highest SHAP value for patient 1 is renal disorders by a large margin. Due to its high SHAP value, the presence of renal disorder in patient 1 had the greatest contribution to the positive prediction of SLE. This was followed by the age and province features. Overall, there were four blue features pushing the prediction probability lower toward class 0. The non-existence of alopecia, hemolytic anemia, and ACL in patient 1 profile in table 4 resulted in negative SHAP values. The remaining features had minimal impact on the prediction probability evident by their low SHAP values. The waterfall plot for patient 2 (Figure 5.b) indicates that age is the largest contribution toward class 0, followed by the absence of any renal disorders. Figure 5: Waterfall plot of CatBoost model for patient 1. The waterfall plot displays SHAP values representing feature contribution toward a positive prediction. Ranking Features. In Figure 6, the older the patient is the less likely it is to have SLE, which is evident by the red dots on the negative scale of SHAP values. The same can be said for disease duration, we find that long disease durations without autoimmune manifestation correlated with the absence of SLE. Experts point out, however, that SLE intensity increases and decreases at intervals differently from patient to patient, thus in rare occasions clinical symptoms might not manifest until late phases of the disease [15]. Our result indicates that the higher the patient’s age and disease duration the less likely that SLE is the cause. Renal disorders are ranked the highest in contribution followed by alopecia, Acute Cutaneous Lupus (ACL), and hemolytic anemia. The lowest contributing features are serositis, proteinuria, and leukopenia. 1. Figure 6: Summery plot of CatBoost model. The summary plot combines feature importance with feature effects. 4. Conclusions In this study, the first SLE prediction model has been developed with our proposed self- explainable framework that aims at establishing trust in ML prediction. SHAP interpretation tool was implemented to explain and justify individual predictions and thereby eliminate any risk of misclassification. Additionally, a minimum set of 13 early predictors achieved the highest scores of 0.95 AUC and 0.92 F1-score metrics. The dataset features comprise demographic and clinical symptoms available to physicians at early stages. By interpreting Catboost predictions, we found that four clinical features had the highest influence on the prediction in addition to the patient’s age. The features were alopecia, renal disorders, cutaneous lupus, and hemolytic anemia. All are considered indicators of lupus activity at varying rates, combined with the patient’s age and age-onset the model was able to establish a profile of the disease relative to the Omani population. With such scores, our model can predict with reasonable certainty the presence or absence of SLE. This can alert physicians to investigate further with the help of immunological tests such as antinuclear antibodies test and Anti-dsDNA test. Overall, our framework and its application can aid in providing a more practical introduction of machine learning and interpretation tools to medical diagnosis, thereby increasing the efficiency of medical testing and subsequently maximizing chances of disease mitigation and management. This is expected to reduce the cost, of medical care as well as decrease the cases of unmitigated severe cases of SLE. References [1] Nisengard R. Diagnosis of systemic lupus erythematosus. Importance of antinuclear antibody titers and peripheral staining patterns. Archives of Dermatology. 1975;111(10):1298-1300. [2] Felten R, Lipsker D, Sibilia J, Chasset F, Arnaud L. The history of lupus throughout the ages. Journal of the American Academy of Dermatology. 2020; [3] Piga M, Arnaud L. The Main Challenges in Systemic Lupus Erythematosus: Where Do We Stand?. Journal of Clinical Medicine. 2021;10(2):243. [4] Murimi-Worstell I, Lin D, Nab H, Kan H, Onasanya O, Tierce J et al. Association between organ damage and mortality in systemic lupus erythematosus: a systematic review and meta-analysis. BMJ Open. 2020;10(5):e031850. [5] Al‐Adhoubi N, Al‐Balushi F, Al Salmi I, Ali M, Al Lawati T, Al Lawati B et al. A multicenter longitudinal study of the prevalence and mortality rate of systemic lupus erythematosus patients in Oman: Oman Lupus Study. International Journal of Rheumatic Diseases. 2021;24(6):847-854. [6] Al Rasbi A, Abdalla E, Sultan R, Abdullah N, Al Kaabi J, Al-Zakwani I et al. Spectrum of systemic lupus erythematosus in Oman: from childhood to adulthood. Rheumatology International. 2018;38(9):1691-1698. [7] Hancock J, Khoshgoftaar T. CatBoost for big data: an interdisciplinary review. Journal of Big Data. 2020;7(1). [8] Anghel A, Papandreou N, Parnell T, et al. Benchmarking and Optimization of Gradient Boosting Decision Tree Algorithms. arXiv:180904559 [9] A. M, Brinks R, Dörner T, Daikh D, Mosca M, et al. European League Against Rheumatism (EULAR)/American College of Rheumatology (ACR) SLE classification criteria item performance. Annals of the Rheumatic Diseases. 2021;80(6):775-781. [10] Eye A, Clogg C. Categorical variables in developmental research. San Diego: Academic Press; 1996. [11] Prokhorenkova L, Gusev G, Vorobev A, et al. CatBoost: unbiased boosting with categorical features. arXiv:170609516 [12] Dorogush A, Ershov V, Gulin A. CatBoost: gradient boosting with categorical features support. arXiv:1810.11363 [13] Branco, P., Torgo, L., & Ribeiro, R. (2015). A survey of predictive modelling under imbalanced distributions. ArXiv:1505.01658 [14] Lundberg S, Lee S-I. A Unified Approach to Interpreting Model Predictions. arXiv:170507874 [15] LALANI S, POPE J, de LEON F, PESCHKEN C. Clinical Features and Prognosis of Late-onset Systemic Lupus Erythematosus: Results from the 1000 Faces of Lupus Study. The Journal of Rheumatology. 2009;37(1):38-44. [16] Binder A, Ellis S. When to order an antinuclear antibody test. BMJ. 2013;347(aug21 2):f5060-f5060. [17] Wichainun R. Sensitivity and specificity of ANA and anti-dsDNA in the diagnosis of systemic lupus erythematosus: A comparison using control sera obtained from healthy individuals and patients with multiple medical problems. Asian Pacific Journal of Allergy and Immunology. 2013;31(4). [18] Arnaud L, Mathian A, Boddaert J, Amoura Z. Late-Onset Systemic Lupus Erythematosus. Drugs & Aging. 2012;29(3):181-189. [19] Beckwith H, Lightstone L. Rituximab in Systemic Lupus Erythematosus and Lupus Nephritis. Nephron Clinical Practice. 2014;128(3-4):250-254. [20] Mahajan A, Amelio J, Gairy K, Kaur G, Levy R, Roth D et al. Systemic lupus erythematosus, lupus nephritis and end-stage renal disease: a pragmatic review mapping disease severity and progression. Lupus. 2020;29(9):1011-1020. [21] Yu H, Nagafuchi Y, Fujio K. Clinical and Immunological Biomarkers for Systemic Lupus Erythematosus. Biomolecules. 2021;11(7):928. [22] Giannouli S. Anaemia in systemic lupus erythematosus: from pathophysiology to clinical assessment. Annals of the Rheumatic Diseases. 2006;65(2):144-148. [23] Nisengard R. Diagnosis of systemic lupus erythematosus. Importance of antinuclear antibody titers and peripheral staining patterns. Archives of Dermatology. 1975;111(10):1298-1300.