Invariant Feature Selection for Battery State of Health Estimation in Heterogeneous Hybrid Electric Bus Fleets Yuantao Fan1,* , Mohammed Ghaith Altarabichi1 , Sepideh Pashami1,2 , Peyman Sheikholharam Mashhadi1 and Sławomir Nowaczyk1 1 Kristian IV:s väg 3, 301 18 Halmstad, Center for Applied Intelligent Systems Research (CAISR), Halmstad University, Halmstad, Sweden 2 Isafjordsgatan 28 A, 164 40 Kista, Research Institutes of Sweden (RISE), Sweden Abstract Batteries are a safety-critical and the most expensive component for electric buses (EBs). Monitoring their condition, or the state of health (SoH), is crucial for ensuring the reliability of EB operation. However, EBs come in many models and variants, including different mechanical configurations, and deploy to operate under various conditions. Developing new degradation models for each combination of settings and faults quickly becomes challenging due to the unavailability of data for novel conditions and the low evidence for less popular vehicle populations. Therefore, building machine learning models that can generalize to new and unseen settings becomes a vital challenge for practical deployment. This study aims to develop and evaluate feature selection methods for robust machine learning models that allow estimating the SoH of batteries across various settings of EB configuration and usage. Building on our previous work, we propose two approaches, a genetic algorithm for domain invariant features (GADIF) and causal discovery for selecting invariant features (CDIF). Both aim to select features that are invariant across multiple domains. While GADIF utilizes a specific fitness function encompassing both task performance and domain shift, the CDIF identifies pairwise causal relations between features and selects the common causes of the target variable across domains. Experimental results confirm that selecting only invariant features leads to a better generalization of machine learning models to unseen domains. The contribution of this work comprises the two novel invariant feature selection methods, their evaluation on real-world EBs data, and a comparison against state-of-the-art invariant feature selection methods. Moreover, we analyze how the selected features vary under different settings. Keywords State of Health Estimation, Invariant Feature Selection, Transfer Learning, Casual Discovery, Genetic Algorithm 1. Introduction The transition from fossil fuel to electromobility in the transportation ecosystem is well under- way and accelerating as an increasing number of hybrid and full-electric vehicles are manufac- HAII5.0: Embracing Human-Aware AI in Industry 5.0, at ECAI 2024, 19 October 2024, Santiago de Compostela, Spain. * Corresponding author. $ yuantao.fan@hh.se (Y. Fan); mohammed_ghaith.altarabichi@hh.se (M. G. Altarabichi); sepideh.pashami@hh.se (S. Pashami); peyman.mashhadi@hh.se (P. S. Mashhadi); slawomir.nowaczyk@hh.se (S. Nowaczyk)  0000-0002-3034-6630 (Y. Fan); 0000-0002-6040-2269 (M. G. Altarabichi); 0000-0003-3272-4145 (S. Pashami); 0000-0002-0051-0954 (P. S. Mashhadi); 0000-0002-7796-5201 (S. Nowaczyk) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings tured and commissioned. As the market size of electric buses is growing worldwide, they have become a crucial part of a sustainable and environmentally friendly transportation solution [1]. The level of electrification in the buses depends mainly on the powertrain configuration, which includes Battery Electric, Fuel Cell Electric, Hybrid Electric, etc. Hybrid electrification is a stepping stone towards full electrification; so far, both are of similar market share in EU. The most common battery type that powers the electric drive-line is the lithium-ion battery [2] since the technology was matured for broad mass more than a decade ago [3]. As a result of major manufacturers, like Volvo, gradually electrifying their products, new hybrid electric buses are regularly launched for public transportation [4, 5]. These hybrid electric buses (HEBs) are mounted with prismatic or cylindrical lithium-ion cells using liquid-cooled packs. However, the introduction of new electrical components means entering new, less well-known territories. For example, operating conditions, e.g., types of operation or temperature, can have a considerable impact on accelerated component wear in electric drive-line. This raises a question about the robustness of the system. Most conventional on-board diagnostic and monitoring methods [6, 7, 8, 9, 10, 11, 12] for batteries mainly utilize the internal parameters of the battery unit based on experiments in a laboratory environment. Notably, the degradation process of the battery heavily depends on the usage pattern of the buses it powers. Often, the batteries and their management unit are deployed to various bus fleets with different mechanical configurations (e.g., single vs. double deck) and operations (e.g., urban vs. inter-city transportation). Therefore, developing robust battery deterioration models that could generalize well from established to novel settings provides great value to the industry. Drive batteries of HEBs are a good example of this challenge. Optimizing the deployment, usage, and maintenance of these components requires a better understanding of the component’s health under diverse operational settings. Such batteries are expensive, and prolonging their effective life is crucial to achieving a sustainable transportation system, low-cost operation, and market satisfaction. This requires insights into what factors are relevant to battery deterioration; analyzing the importance of available onboard signals for SoH prediction is of great help. The reliability of this expensive component becomes more important as the electromobility is being implemented at scale. Continuous changes and improvements in the batteries’ technologies challenge the experts and statistical models currently in use. For example, different generations of batteries have different definitions for health indicators, and modeling one generation might not transfer to the next one. Besides, the lack of historical data due to the relatively recent deployment of the first electric dive lines in the real world leads to high model uncertainty. The State of Health (SoH) is defined as the ratio between the nominal capacity at time 𝑡 and the initial capacity [8]. It is a commonly used health indicator for evaluating the aging and deterioration level of the battery. Accurate estimation of SoH can be considered as a proxy to estimate the remaining useful life (RUL) and predict the end-of-life (EOL) of batteries for maintenance planning. Conventionally, the SoH is computed by a model developed by the manufacturer of the battery, and they are usually based on the internal parameters of the battery measured online [6, 13], e.g., current, voltage, and temperature of the single cells [14]. Although the SoH estimated via battery management unit (BMU) units on-board was developed and specialized for the target energy system; however, in reality, it may not always function as intended, e.g., due to faults or mismatch of newly replaced sub-component versions. An alternative is to employ a data-driven approach, using machine learning models, e.g., [15, 16], based on sensor data collected onboard. In dynamic environments, the behavior of the systems is expected to change over time. In such cases, the conditions under which a machine learning model is trained are likely to differ from the conditions on which they are to be tested. Thus, the performance of machine learning models deteriorates unless domain transfer is explicitly addressed. Moreover, the change in conditions between domains is often associated with a change in the joint distribution of input and output between source and target domains. These problems are commonly known as forms of dataset shift [17] and were addressed by transfer learning or domain adaptation methods [18]. An example approach is to select only a subset of features that generalize well to a new domain [19]. More concretely, in our experiments with modeling Li-Ion drive batteries, the degradation behavior of hybrid buses varies considerably across different bus configurations and operating conditions. This means that the features that are important for model training might not be relevant for the different configurations on which it will be tested. Even worse, they could have detrimental effects, leading to negative transfer. Our objective is to select features useful not only for the task in the source domain(s) but also those that transfer well across different domains, including the unseen ones. Modern electric buses are complex machines equipped with many sub-systems controlled by electronic computing units and monitored by hundreds of sensors. However, not all sensor data collected are helpful in estimating the SoH of the batteries. Selecting a relevant subset of features that generalize to various domains, e.g., different EB models and operating conditions, is crucial for building an effective SoH prediction model. In this paper, we propose two invariant feature selection methods, i.e., GADIF [20] and CDIF, and evaluate their performance for selecting a set of invariant features across different settings to estimate the SoH of batteries. We compare the two proposed approaches with a causal learning-based domain adaptation method that produces invariant models, presented by Rojas et al. in [21], and several classical feature selection methods under transfer learning settings, where the goal is to migrate to unseen domains. The experimental results show that using invariant features selected by the proposed approaches leads to better generality of resulting models. 2. Related Work The vast majority of studies on data-driven methods for estimating the SoH of batteries (review available in [6, 7, 8, 9]) assumes the target population is the same as the population available at the training time. However, the deterioration characteristics of the battery might vary depending on the usage pattern of the equipment it powers. Vehicles may be deployed to a new area, i.e., an unseen domain, where the terrain and traffic situation are very different from its peers in the previous deployment. The same type of battery and the management unit may be installed in a vehicle with different physical configurations. Under such circumstances, differences between vehicle populations, i.e., domains, need to be considered for SoH prediction. Recently, more works on SoH prediction were conducted from the perspective of transfer learning, where the source domain is different from the target domain. The most popular approach is to utilize temporal information and sequence modeling, e.g., the long short-term memory (LSTM) and the gated recurrent unit (GRU). For the SoH prediction task, this is done under an inductive transfer learning setting, i.e., a small amount of labeled data from the target domain is available. In their work [16], Tan et al. presented an approach using model-based transfer using an LSTM network for SoH Prediction. The base model was trained for the generalization performance, and the last two fully connected layers were fine-tuned with a subset. Che et al., in their work [22], have adopted gate recurrent neural networks for SoH prediction under a transfer learning setting. The model was first trained using a relevant battery and fine-tuned with an early degradation cycle from the testing battery. To address the variability in the length of the cycle sequences, Zhou et al. presented an interpretable transfer learning method [23] that uses cycle synchronization for data alignment and explored a few variations of GRUs to estimate the SoH of batteries in the target domain. Genetic algorithms (GA) are widely used for feature selection [24, 25, 26] and shown to be very promising in many applications [27, 28, 29, 30]. However, few works have considered the scenarios in which heterogeneity exists between the training and testing samples. One needs to account for useful features for SoH regression in the source domain being different from the ones in the target domain. Causal feature learning [31, 32] is another promising approach for selecting informative features. It exploits the causal relationships learned from the observations since these rela- tionships imply the underlying mechanism of the variables. In particular, causal relationships that do not vary over different source domains are of great interest and are likely to be robust and able to generalize to future domains. Rojas et al. have proposed, in their work [21], an approach based on causal transfer learning to produce invariant models (IMCTL) that deal with covariate shift in a subset of features. Similarly, Persello et al. presented a kernel-based approach [33] for selecting features that minimize the dataset shift between the training and testing data. Magliacane et al. proposed an approach [34] selecting a subset of features that lead to the best predictions in the source domains while satisfying the invariance condition, determined by a joint causal inference framework [35] by Mooij et al. Another perspective is to learn a common set of features for multiple tasks [36, 37], presented by Argyriou et al. Work [38] presented by Taghiyarrenani et al. focused on constructing a new feature space that aligns multiple domains with conditional shifts in the joint distribution, using a novel loss function considering pairwise similarities between samples from different domains. With their dedicated effort to studying capacity degradation in batteries [39], Sahoo et al. have proposed a two-layer artificial neural networks model, with the second layer being fine-tuned using a small portion of the target data and a novel feature that yields a low generalization error on a battery of different types and working conditions. Few works on causality-based feature selection have considered the settings where the source domains are different from target domains [32]. Neither works proposed by Magliacane et al. [34], and Rojas et al. [21] have considered domains that consist of time series data collected from multiple systems. In this study, we emphasize the specific challenges of EBs setting, which motivate the need for proposed approaches, beyond state-of-the-art. The first one, inspiring GADIF, is employing a fitness function for GA that takes into account heterogeneity in the available sample population and extracting features that are invariant w.r.t. samples from different domains is of interest. The second one, inspiring CDIF, is that causal graphs need to be learned from each system individually, using data with time dependencies, and invariant features shall be selected based on the presence of causal relationships across multiple different domains. Figure 1: An illustration of the proposed CDIF method. 3. Method This study targets an unsupervised domain adaptation problem, to be specific, a multi-source domain generalization problem, i.e., deploying machine learning models trained using labeled pairs of sample observations from one or more source domains 𝐷𝑆 = {𝐷𝑆1 , 𝐷𝑆2 , ..., 𝐷𝑆𝑀 } to an unseen target domain 𝐷𝑇 , without any labels. Given a source domain 𝐷𝑆𝑖 , and its entire vehicle population 𝑉𝑆𝑖 = {𝑣1 , 𝑣2 , ..., 𝑣𝑀 }, the observations (labeled pairs) collected from vehicle 𝑣𝑖 are denoted using 𝑋𝑣𝑖 = {(𝑥 ⃗ (𝑖,𝑛) , 𝑦(𝑖,𝑛) )}. In this case, 𝑀 is the number of ⃗ (𝑖,1) , 𝑦(𝑖,1) ), ..., (𝑥 vehicles, and 𝑋𝑣𝑖 has dimensions of (𝑛 × 𝑘), where 𝑛 is the number of samples and 𝑘 is the number of features. A sample instance 𝑥⃗(·) is a 𝑘-dimensional real-valued vector ⃗𝑥(·) ∈ IR𝑘 . Data set of a source⋃︀ domain 𝐷𝑆𝑖 is merely the union of samples from all vehicles 𝑉𝑆𝑖 in that domain, i.e., 𝐷𝑆𝑖 = 𝑀 ⋃︀𝑣𝑖 . On the contrary, sample observations in the target domain 𝐷𝑇 𝑖=1 𝑋 are not labeled, i.e., 𝐷𝑇 = 𝑣𝑖 ∈𝑉𝑇 {𝑥 ⃗ (𝑖,1) , ..., ⃗𝑥(𝑖,𝑛) }. Nevertheless, the feature space 𝜒 of sample instances in the source and the target domains are the same. In this study, we propose two approaches, i.e., using a genetic algorithm for domain invariant features (GADIF) and causal discovery for selecting invariant features (CDIF), to select invariant features that are expected to generalize based on several source domains {𝐷𝑆1 , 𝐷𝑆2 , ..., 𝐷𝑆𝑚 }, where 𝑚 is the number of source domains of which labeled data is available. Similar to [21], our work relies on the assumption that if the conditional distribution of the target label given a feature subset is invariant across several source domains, then the same is likely to hold in the target domain. In other words, there exists a subset of features that is invariant in predicting the SoH values across all domains. One way to verify how well this assumption holds is to evaluate whether the presence of causal relation between any features to the target variable is consistent across various domains (an illustration is available in Figure 2). 3.1. Causal discovery for invariant features selection We propose to use a causal discovery algorithm to learn causal relations in each vehicle and select a set of features that were the common causes for the target variable 𝑦, among vehicles across all source domains. More specifically, we consider the probabilistic causation between the two variables and compute the probability of the causal relations, e.g., that the feature 𝑢𝑖 causes 𝑦, presented in each source population as a scoring criterion for selecting invariant features. An illustration of the CDIF method is presented in Figure 1; invariant features were extracted from two sets of causal graphs learned from individual vehicles in every source domain. For each vehicle 𝑣, a causal graph 𝐺𝑣 = (𝑈𝑣 , 𝐸𝑣 ) is generated using a causal discovery method with additive noise models, proposed by Hoyer et al. in their work [40], where 𝑈𝑣 denotes a set of nodes, 𝑈𝑣 = {𝑢1 , 𝑢2 , ..., 𝑢𝑘 }, corresponding to the 𝑘 number of features in the training instances of vehicle 𝑣, and 𝐸𝑣 denotes a set of edges 𝑒𝑣𝑖𝑗 , indicating causal relations direct from feature 𝑖 to feature 𝑗 if 𝑒𝑖𝑗 = 1, for vehicle 𝑣. Correspondingly, 𝑒𝑣𝑖𝑗 = 0 indicates no causal relation directly from feature 𝑢𝑖 to 𝑢𝑗 . Given a source domain 𝐷𝑠𝑚 and its corresponding vehicle population 𝑉𝑠𝑚 , a card(𝑉𝑠𝑚 ), i.e. cardinally of set 𝑉𝑠𝑚 , number of causal graphs were learned, and the number of occurrences of ∑︀card(𝑉 ) any causal relation presented in this vehicle population was computed as 𝑣=1 𝑠𝑚 𝑒𝑣𝑖𝑗 . The probability of any causal relation given a specific source domain 𝐷𝑠𝑚 , with a vehicle population 𝑉𝑠𝑚 , is then: 1 ∑︁ 𝑃 ( 𝑒𝑣𝑖𝑗 = 1 | 𝑣 ∈ 𝑉𝑠𝑚 ) = 𝑒𝑣𝑖𝑗 , (1) card(𝑉𝑠𝑚 ) 𝑣∈𝑉𝑠𝑚 where 𝑒𝑣𝑖𝑗 indicates the causal relation of feature 𝑢𝑖 to 𝑢𝑗 in vehicle 𝑣. In this work, we only consider the target variable 𝑦 as 𝑢𝑗 . Given a number of source domains {𝐷𝑆1 , 𝐷𝑆2 , ..., 𝐷𝑆𝑚 } and its corresponding vehicle population {𝑉𝑆1 , 𝑉𝑆2 , ..., 𝑉𝑆𝑚 }, we propose two criteria for selecting invariant features: ⋂︁ 1 ∑︁ 𝑈𝜃 = {𝑢𝑖 | 𝑒𝑣𝑖𝑗 ≥ 𝜃, 𝑖 ∈ {1, 2, ..., 𝑘}} (2) card(𝑆) 𝑆∈{𝑉𝑆1 ,𝑉𝑆2 ,...,𝑉𝑆𝑚 } 𝑣∈𝑆 ⋃︁ 1 ∑︁ 1 ∑︁ 𝑈𝜖 = {𝑢𝑖 | | 𝑒𝑣𝑖𝑗 − 𝑒𝑣𝑖𝑗 | ≤ 𝜖, 𝑖 ∈ {1, 2, ..., 𝑘}} card(𝑆1 ) card(𝑆2 ) 𝑆1 ,𝑆2 ∈{𝑉𝑆1 ,𝑉𝑆2 ,...,𝑉𝑆𝑚 }, 𝑣∈𝑆1 𝑣∈𝑆2 𝑆1 ̸=𝑆2 (3) where 𝜃 and 𝜖 are the thresholds for selecting features 𝑢𝑖 in 𝑈𝜃 and 𝑈𝜖 . Invariant features included in 𝑈𝜃 are the ones of high presence, determined by 𝜃, in the causal relation across all source domains; features included in 𝑈𝜖 are the ones of similar presence, determined by 𝜖, in at least a pair of the source domains. 3.2. Genetic algorithm for domain invariant features The proposed Genetic Algorithm (GA) [20] identifies good predictors of the target label in 𝐷𝑇 , based on their generalization performance across a number of source domains {𝐷𝑆1 , 𝐷𝑆2 , ..., 𝐷𝑆𝑚 } in a leave-one-domain-out cross-validation setting. The fitness func- tion of GADIF was designed to evaluate a feature subset 𝑈 by testing its performance across a number of source domains {𝐷𝑆1 , 𝐷𝑆2 , ..., 𝐷𝑆𝑚 }. Our algorithm GADIF is initiated with a population of individuals encoding feature subsets as chromosomes of binary strings, with 1 indicating the inclusion of the feature of the corre- sponding index. The algorithm proceeds from one generation to the next by applying crossover and mutation operators to the selected individuals in order to produce offspring. The process reassembles natural selection by choosing individuals with the best fitness as the most likely to propagate to the following generation. GADIF uses a variant of GA called CHC [41] to carry out the evolutionary search, which is essentially a wrapper method for feature selection, and can be computationally expensive. Our feature selection method evaluates feature subsets according to their performance across all available labeled source domains. The optimization search of GADIF maximizes a fitness function that extends the equation provided by [42]: 𝑁𝑓 𝑓 (𝑈 ) = 𝛼𝑃 + (1 − 𝛼)(1 − ) (4) 𝑁𝑡 where 𝑃 is the leave-one-domain-out cross-validation performance (e.g. mean absolute error) of a machine learning model 𝑀 trained in a wrapper setting using 𝑈 , and 𝛼 is a hyperparameter for the tradeoff between the model performance across several source domains P, and the reduction in a number of selected features 𝑁𝑓 with respect to the total number of features 𝑁𝑡 . 4. Experimental Results The application scenario we focus on in this study is to build robust machine learning models using domain invariant features. That means features that generalize well from multiple source domains to an unseen target domain. For example, it can be a new population with novel operating conditions that were not available at training time. Therefore, we have chosen a dataset from heterogeneous fleets of commercial buses consisting of approximately 1500 hybrid energy vehicles. They have different physical configurations (e.g., Double-Decker vs. Single-Decker) and were deployed for different usage (e.g., long-haul vs. city operations) in various locations. All of them are equipped with lithium-ion batteries. Although both numerical and categorical features were present in the dataset, we only focused on a set of numerical features (10 in total) to model SoH degradation, as these features are of the best data quality and availability. The SoH values are collected from BMU, which is a component that does not necessarily last longer than batteries. The age of the energy storage systems (ESS) was post-processed (using repair records of the battery and BMU) and added to the feature set. 4.1. Experiment settings Two sets of three experiments (six scenarios in total) were listed in Table 1 and were proposed to address the need for domain adaption in the presence of a discrepancy between the source and the target domain. The first set of experiments (i.e., 1 to 3) focuses on finding domain invariant features that could generalize over different operating conditions. The second set of experiments (i.e., 4 to 6) focuses on generalizing over different physical configurations. They correspond to real-world scenarios in the applications with heterogeneous groups of vehicles, where a freshly deployed fleet is of varying configuration, or operates under different conditions; thus, it requires transfer learning. The first three scenarios focus on the transfer learning setting of fleets with different transport cycles (operating conditions), with respect to the average speed. There are three different vehicle populations/domains, i.e., the “slow”, the “moderate”, and the “fast”. The first domain of vehicles belongs to the transport cycle “slow”, which corresponds to buses operating in city centers. Due to heavier traffic, frequent stops, and speed limits, this group of vehicles obtains an average speed of under 12.5 kph. The second transport cycle domain “moderate” includes buses traveling at a higher speed range (12.5 to 17.5 kph) operating within the outer circle of cities. The group of “fast” vehicles travels at an average speed larger than 17.5 kph, which corresponds to buses operating between cities traveling long-hauls and highways. The second set of scenarios (i.e., 4 to Table 1 Experiment scenarios Scenario # Source Domain 𝐷𝑆 Target Domain 𝐷𝑇 1 Moderate, Fast Slow 2 Slow, Fast Moderate 3 Slow, Moderate Fast 4 Double-decker, Articulated Single-decker 5 Single-decker, Articulated Double-decker 6 Single-decker, Double-decker Articulated Table 2 Performance (MAE) of SoH predictions for different feature selection methods across all six experiment scenarios (note that GA* uses both the source and target data for feature selection). Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5 Scenario 6 GA* 1.50±0.01 1.59±0.01 1.88±0.01 2.05±0.00 1.53±0.01 1.48±0.01 Pearson 1.54±0.06 1.74±0.05 2.02±0.01 2.16±0.12 1.54±0.03 2.08±0.16 RF 1.62±0.07 1.72±0.07 1.99±0.04 2.19±0.14 1.54±0.03 1.78±0.10 LR 1.75±0.06 1.81±0.07 2.12±0.02 2.29±0.09 1.67±0.01 2.07±0.41 SFS 1.54±0.07 1.73±0.07 2.03±0.09 2.19±0.12 1.60±0.03 2.10±0.15 XGB 1.54±0.07 1.74±0.05 1.99±0.11 2.16±0.12 1.57±0.02 2.08±0.16 GADIF 1.54±0.02 1.65±0.01 1.96±0.02 2.27±0.03 1.53±0.01 1.67±0.03 IMCTL 2.08±0.68 1.97±0.43 5.55±4.98 2.70±0.71 3.47±0.99 4.33±3.56 IMCTL-G 10.75±5.11 4.01±4.02 6.80±4.85 11.99±0.66 10.70±1.01 10.25±1.75 CDIF(𝑈𝜃 ) 1.53±0.07 3.93±1.93 4.85±0.93 3.17±2.19 1.68±0.06 3.13±2.34 CDIF(𝑈𝜖 ) 1.53±0.08 1.68±0.13 2.10±0.14 2.22±0.23 1.65±0.03 2.96±1.86 All Features 1.53±0.07 1.72±0.09 2.06±0.14 2.19±0.21 1.68±0.05 2.07±0.18 0.8 Slow 0.8 Single-decker Moderate Double-decker 0.7 Fast 0.7 Articulated P( eijv = 1 | v Vsm) P( eijv = 1 | v Vsm) 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 e e e ele el_dis ce e die _time ce l_di ance tota tance ess me die _time ele el_dis ce tota tance e hyb ge_time e hyb ge_time ce l_di ance ess me _ag _ag ele el_tim cha al_tim ele el_tim cha al_tim n tan n tan l_ti l_ti die _dista die _dista total_dist total_dist rid rid s s c c s s r r hyb hyb ctri ctri rid rid ca ca s s ctri ctri (a) experiment scenario 1-3 (b) experiment scenario 4-6 Figure 2: Probability of causation of each feature to the target variable 𝑦 in trained causal graphs 6) focuses on the transfer learning setting of vehicles with different physical configurations. The first domain in the second set was the single-decker buses; the second domain was the double- decker buses; the third domain was articulated buses, which are longer in length compared to vehicles in the other two domains. Figure 2 provided an overview of the presence of the causal relations on each feature 𝑢𝑖 to the target variable 𝑦 in different populations. It is shown in the figure that the presence of features in the “moderate” and the “fast” population are quite similar, as are the “double-decker” and the “articulated” vehicle population. For each experiment scenario, several well-known filter- and wrapper-based feature selection methods were tested to compare with the proposed approaches, including Pearson index; feature importance measured by random forest (RF), extreme gradient boost (XGB), and coefficient from linear regression (LR) model; and sequential feature selection (SFS) [43] with linear regression. In addition, a recently proposed approach based on invariant models for causal transfer learning (IMCTL) [21] by Rojas et al., along with its variation utilizing greedy search (IMCTL-G) is tested and compared to the proposed approach based causal discovery, i.e., casual discovery for invariant features (CDIF 𝑈𝜃 and 𝑈𝜖 ). Moreover, we compared the Genetic Algorithm (GA) using labeled data from the entire dataset (including all domains) with the proposed GA variation GAIF using a special fitness function. GA trained with data from all domains served as a reference and is expected to achieve the best performance. To evaluate the goodness of the features selected by different methods under a fair setting, a linear regression model was chosen for the SoH regression task, using Python implementation in scikit-learn [44]. Mean absolute error (MAE) was adopted as the evaluation metric for the SoH prediction task, and 4-fold cross-validation was performed on the vehicle level to measure the uncertainty. The result section is organized as follows. We first present a comparison (in Table 2) of the Pearson RF LR SFS XGB GA GADIF IMCTL IMCTL(G) CDIF U CDIF U electrical_distance charge_time hybrid_time diesel_distance Scenario 1 diesel_time ess_age total_distance electrical_time hybrid_distance total_time electrical_distance charge_time hybrid_time diesel_distance Scenario 2 diesel_time ess_age total_distance electrical_time hybrid_distance total_time electrical_distance charge_time hybrid_time diesel_distance Scenario 3 diesel_time ess_age total_distance electrical_time hybrid_distance total_time 1234 1234 1234 1234 1234 1234 1234 1234 24 1234 1234 Fold Figure 3: Features selected as invariant across different folds under experimental scenarios 1-3. proposed approach against well-known methods across all six scenarios. Next, the uncertainty in the feature selection over different folds of the experiments is analyzed (Figures 3 and 4). We also study how the performance of different conventional feature selection methods depends on design parameter 𝑘, the number of features to be selected. We show the performance of CDIF across different thresholds and numbers of features selected in Figure 7. 4.2. Methods performance comparison Table 2 presented the MAE of modeling the SoH using features set suggested by different methods under six scenarios. For methods based on the Pearson index, RF, LR, SFS, and XGB, the best 5 features were selected. GA* corresponds to the approach using the genetic algorithm with the entire population from both the source and the target domain. GA* served as a reference and is expected to achieve the best performance, compared to other methods that only use the source population for feature selection. The two variations of CDIF, i.e., CDIF 𝑈𝜃 employed a threshold 𝜃 of 0.45, and CDIF 𝑈𝜖 employed a threshold of 0.18. The selection of the thresholds, i.e., 𝜃 and 𝑈𝜖 , is a trade-off between regression performance and number of selected features. The proposed approach (GADIF), with an 𝛼 of 0.5, outperforms other well-known methods under scenarios 2, 3, 5, and 6 (except for GA*, which is expected since it utilizes data from the target domain while GADIF does not). CDIF 𝑈𝜃 outperformed other methods under scenario 1. For the rest of the scenarios, CDIF outperformed only IMCTL and IMCTL-G, which were also based on causal learning; however, it was not competitive against classical methods. GADIF performs better under scenarios 5 and 6 compared to scenario 4. The source population under scenarios 5 and 6 contains vehicles of both simple and a more complex/heavier structure, i.e., single-decker and with one of the other two types. Intuitively, with the designed fitness function, GADIF can extract domain invariant features from a population with larger variation compared to scenario 4, in which only a vehicle with a complex/heavy structure is included while in the target only resides vehicle with a simple/lighter structure. In addition, the standard deviation in MAE values over 4-fold cross-validation of GA-based methods is rather small (around 0.01), which indicates that the difference in the prediction performance of GA-based methods between different scenarios is statistically significant. 4.3. Uncertainties in SoH modeling performance The uncertainty (e.g., standard deviation computed over the 4 folds) in the MAE performance measure, presented in Table 2, was from the variations in the sample population. Therefore, features selected by different methods might vary as well. Figures 3 and 4 illustrated features selected by different methods: feature sets (top 5 features) suggested by Pearson index under scenarios 1-6 remain the same over all four folds; features sets suggested by CDIF 𝑈𝜖 remain the same across all four folds under most scenarios (except for the 5-th run); feature sets suggested by RF, LR, XGB, and CDIF 𝑈𝜃 are quite stable, with less than seven mismatches. The deeper and larger the filled squares are, the more important the corresponding feature is; unfilled squares correspond to features selected by GA, GADIF, IMCTL, IMCTL-G, and CDIF; these methods are binary and do not entail a degree of importance. Neither GA nor GADIF requires one to specify the number of features for the selection, and features selected from both methods vary similarly over the four folds; on the other hand, features selected by IMCTL, and IMCTL-G were quite different over the four folds for all scenarios. The stability of feature selection methods was estimated by computing the distance between the selected features from experiments of different cross-validation folds over the six scenarios. A stable and consistent feature selection method would select identical feature sets from different scenarios, yielding a zero-sum of all pairwise distances, computed from, e.g., feature sets selected from different cross-validation folds. We computed the sum of all pairwise distances between feature sets selected from the 4-folds using the Jaccard index and compared different methods using a boxplot of summed distances from all six scenarios. Results are presented in Figure 5. The result shows the stability of the proposed approaches GADIF and CDIF 𝑈𝜃 were not Pearson RF LR SFS XGB GA GADIF IMCTL IMCTL(G) CDIF U CDIF U electrical_distance charge_time hybrid_time diesel_distance Scenario 4 diesel_time ess_age total_distance electrical_time hybrid_distance total_time electrical_distance charge_time hybrid_time diesel_distance Scenario 5 diesel_time ess_age total_distance electrical_time hybrid_distance total_time electrical_distance charge_time hybrid_time diesel_distance Scenario 6 diesel_time ess_age total_distance electrical_time hybrid_distance total_time 1234 1234 1234 1234 1234 1234 1234 1234 1234 1234 1234 Fold Figure 4: Features selected as invariant across different folds under experimental scenarios 4-6. significantly different from the majority of the methods (i.e., RF, LR, SFS, and GA); feature sets selected by Pearson index, and the proposed approach CDIF 𝑈𝜖 were the most stable across the six scenarios, while XGB appears to slightly outperform most of the methods; feature sets selected using IMCTL and IMCTL-G were highly unstable. 4.4. Performance based on design parameter 𝑘 Figure 6 illustrates the performance of using different sets of features selected by different methods over the feature set size 𝑘, under experiment scenario 2. The gray horizontal line corresponds to the mean MAE of GADIF, and the margin corresponds to the standard deviation 10 8 Jaccard Index 6 4 2 0 on RF LR SFS XGB GA GADIF IMCTL CTL-GDIF U DIF U pears IM C C Figure 5: Stability measured with Jaccard Index of selected feature sets by different methods (a lower value indicates greater stability) 2.00 Methods 1.95 Pearson RF 1.90 LR SFS 1.85 XGB 1.80 MAE 1.75 1.70 1.65 1.60 1 2 3 4 5 6 7 8 9 10 k Figure 6: Performance (MAE) comparison of features selected by different methods for varying values of 𝑘, the feature set size, under experiment scenario 2. over 4-fold cross-validation. For most methods, it can be clearly seen that the performance starts to converge from a 𝑘 of 5. GADIF does not need to specify the number of features to select and outperform other methods, regardless of the chosen 𝑘. Subfigures in Figure 7(a), 7(c), 7(e), and 7(g) illustrate the performance of CDIF 𝑈𝜃 and CDIF 𝑈𝜖 for different values of the thresholds 𝜃 and 𝜖. Correspondingly, Figure 7(b), 7(d), 7(f), and 7(h) showcases the number of features selected versus the threshold. For experiment scenario 1 with the feature set 𝑈𝜃 , the MAE for the “slow” target population was significantly lower compared to scenarios 2 and 3. This is due to the presence of many features that cause the target variable 𝑦. Therefore, more features were selected, which yielded better overall prediction power. In contrast, for scenarios 2 and 3, the presence of most features that cause the target variable 𝑦 is quite low in the “slow” population compared to the other two populations. Thus, only a few features were selected for the regression task. For all six scenarios with the feature set 𝑈𝜃 and 𝑈𝜖 , MAE changes negatively correspond to the number of features selected, and hence, in general, the more features selected, the better the performance, until a saturation point that there is no significant improvement, as is shown in Figure 6. 10 Slow 12 Moderate 8 Fast 10 8 6 MAE k 6 4 4 Slow Moderate 2 2 Fast 0.45 0.475 0.5 0.525 0.55 0.575 0.6 0.625 0.65 0.675 0.7 0.725 0.45 0.475 0.5 0.525 0.55 0.575 0.6 0.625 0.65 0.675 0.7 0.725 threshold threshold (a) MAE vs. Threshold of 𝑈𝜃 , Scenario 1-3 (b) k vs. Threshold of 𝑈𝜃 , Scenario 1-3 10 Single-decker 12 Double-decker 8 Articulated 10 8 6 MAE k 6 4 4 Single-decker Double-decker 2 2 Articulated 0.45 0.475 0.5 0.525 0.55 0.575 0.6 0.625 0.65 0.675 0.7 0.725 0.45 0.475 0.5 0.525 0.55 0.575 0.6 0.625 0.65 0.675 0.7 0.725 threshold threshold (c) MAE vs. Threshold of 𝑈𝜃 , Scenario 4-6 (d) k vs. Threshold of 𝑈𝜃 , Scenario 4-6 Slow 10 Slow 12 Moderate Moderate Fast 8 Fast 10 8 6 MAE k 6 4 4 2 2 16 65 39 27 23 65 24 28 62 69 11 16 65 39 27 23 65 24 28 62 69 11 0.01 0.01 0.02 0.03 0.04 0.0 0.09 0.1 0.13 0.15 0.18 0.01 0.01 0.02 0.03 0.04 0.0 0.09 0.1 0.13 0.15 0.18 threshold threshold (e) MAE vs. Threshold of 𝑈𝜖 , Scenario 1-3 (f) k vs. Threshold of 𝑈𝜖 , Scenario 1-3 Single-decker 10 Single-decker 12 Double-decker Double-decker Articulated 8 Articulated 10 8 6 MAE k 6 4 4 2 2 16 65 39 27 23 65 24 28 62 69 11 16 65 39 27 23 65 24 28 62 69 11 0.01 0.01 0.02 0.03 0.04 0.0 0.09 0.1 0.13 0.15 0.18 0.01 0.01 0.02 0.03 0.04 0.0 0.09 0.1 0.13 0.15 0.18 threshold threshold (g) MAE vs. Threshold of 𝑈𝜖 , Scenario 4-6 (h) k vs. Threshold of 𝑈𝜖 , Scenario 4-6 Figure 7: Performance (MAE) of CDIF, over the threshold and number of features 𝑘 selected 5. Conclusions This paper aims to develop and evaluate methods for selecting domain-invariant features. Such features are expected to lead to robust machine learning models and lead to a better generaliza- tion performance to unseen target populations. Two approaches, one based on causal discovery (CDIF) and the other based on an evolutionary algorithm (GADIF), were proposed to identify invariant features across several source domains in a multi-source domain generalization. The performance of the proposed approaches was compared with well-known feature selection methods in predicting the SoH of the batteries in real-world data from heterogeneous fleets of HEBs, running under different operating conditions and physical configurations. The main contribution of the paper is the comparison of well-known methods with the proposed ap- proaches on their ability to choose invariant features for training SoH regression models that could generalize well to the unseen target domains. The experiment results show that GADIF outperforms other methods by selecting features that generalize better to unseen domains in 4 out of 6 testing scenarios, while CDIF outperforms other methods in one scenario. Future efforts for further development of CDIF include exploring the alternatives in generating the causal graph for each vehicle; criteria for suggesting threshold over different settings; extending the experiments with settings involving larger numbers of source domains; a more holistic evaluation metric with a balanced weighting on both performance accuracy and stability in the selected domain invariant features. Acknowledgments The work was carried out with support from the Knowledge Foundation and Vinnova (Sweden’s innovation agency) through the Vehicle Strategic Research and Innovation Programme FFI. References [1] J.-Q. Li, Battery-electric transit bus developments and operations: A review, International Journal of Sustainable Transportation 10 (2016) 157–169. [2] M. Mahmoud, R. Garnett, M. Ferguson, P. Kanaroglou, Electric buses: A review of alterna- tive powertrains, Renewable and Sustainable Energy Reviews 62 (2016) 673–684. [3] P. Pichler, M. Kapaun, Lithium ion batteries for hybrid busses and hybrid commercial vehicles-being ready for the broad mass.?; li-ion batterien fuer hybridbusse und hybrid- nutzfahrzeuge. startklar fuer die breite masse.? (2009). [4] S. Borén, Electric buses’ sustainability effects, noise, energy use, and costs, International Journal of Sustainable Transportation 14 (2020) 956–971. [5] Volvo group invests in optibus to speed up on bus electrification and digitalization, https: //www.sustainable-bus.com/its/optibus-volvo-investment/, ???? Accessed: 2021-11-12. [6] M. H. Lipu, M. Hannan, A. Hussain, M. Hoque, P. J. Ker, M. H. M. Saad, A. Ayob, A review of state of health and remaining useful life estimation methods for lithium-ion battery in electric vehicles: Challenges and recommendations, Journal of Cleaner Production 205 (2018) 115–133. [7] A. Fotouhi, D. J. Auger, K. Propp, S. Longo, M. Wild, A review on electric vehicle battery modelling: From lithium-ion toward lithium–sulphur, Renewable and Sustainable Energy Reviews 56 (2016) 1008–1021. [8] A. Barré, B. Deguilhem, S. Grolleau, M. Gérard, F. Suard, D. Riu, A review on lithium-ion battery ageing mechanisms and estimations for automotive applications, Journal of Power Sources 241 (2013) 680–689. [9] M. Berecibar, I. Gandiaga, I. Villarreal, N. Omar, J. V. Mierlo, P. V. den Bossche, Critical review of state of health estimation methods of li-ion batteries for real applications, Renewable and Sustainable Energy Reviews 56 (2016) 572–587. [10] J. Du, X. Zhang, T. Wang, Z. Song, X. Yang, H. Wang, M. Ouyang, X. Wu, Battery degradation minimization oriented energy management strategy for plug-in hybrid electric bus with multi-energy storage system, Energy 165 (2018) 153–163. [11] J.-H. Lee, I.-S. Lee, Estimation of online state of charge and state of health based on neural network model banks using lithium batteries, Sensors 22 (2022) 5536. [12] N. Xu, Y. Xie, Q. Liu, F. Yue, D. Zhao, A data-driven approach to state of health estimation and prediction for a lithium-ion battery pack of electric buses based on real-world data, Sensors 22 (2022) 5762. [13] B. Pattipati, K. Pattipati, J. P. Christopherson, S. M. Namburu, D. V. Prokhorov, L. Qiao, Automotive battery management systems, IEEE, 2008. [14] S. Rothgang, M. Rogge, J. Becker, D. U. Sauer, Battery design for successful electrification in public transport, Energies 8 (2015) 6715–6737. [15] D. Roman, S. Saxena, V. Robu, M. Pecht, D. Flynn, Machine learning pipeline for battery state-of-health estimation, Nature Machine Intelligence 3 (2021) 447–456. [16] Y. Tan, G. Zhao, Transfer learning with long short-term memory network for state- of-health prediction of lithium-ion batteries, IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS 67 (2020) 8723–8731. [17] J. Quiñonero-Candela, M. Sugiyama, N. D. Lawrence, A. Schwaighofer, Dataset shift in machine learning, Mit Press, 2009. [18] S. J. Pan, Q. Yang, A survey on transfer learning, IEEE Transactions on knowledge and data engineering 22 (2009) 1345–1359. [19] S. Uguroglu, J. Carbonell, Feature selection for transfer learning, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2011, pp. 430–442. [20] M. G. Altarabichi, Y. Fan, S. Pashami, P. S. Mashhadi, S. Nowaczyk, Extracting invariant features for predicting state of health of batteries in hybrid energy buses, in: 2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA), IEEE, 2021, pp. 1–6. [21] M. Rojas-Carulla, B. Schölkopf, R. Turner, J. Peters, Invariant models for causal transfer learning, The Journal of Machine Learning Research 19 (2018) 1309–1342. [22] Y. Che, Z. Deng, X. Lin, L. Hu, X. Hu, Predictive battery health management with transfer learning and online model correction, IEEE Transactions on Vehicular Technology 70 (2021) 1269–1277. [23] K. Q. Zhou, Y. Qin, C. Yuen, Transfer-learning-based state-of-health estimation for lithium- ion battery with cycle synchronization, IEEE/ASME Transactions on Mechatronics (2022). [24] S. Sindhiya, S. Gunasundari, A survey on genetic algorithm based feature selection for disease diagnosis system, in: Proceedings of IEEE international conference on computer communication and systems ICCCS14, IEEE, 2014, pp. 164–169. [25] M. Rostami, K. Berahmand, S. Forouzandeh, A novel community detection based genetic algorithm for feature selection, Journal of Big Data 8 (2021) 1–27. [26] P. Dhal, C. Azad, A comprehensive survey on feature selection in the various fields of machine learning, Applied Intelligence (2021) 1–39. [27] Z. Ullah, S. R. Naqvi, W. Farooq, H. Yang, S. Wang, D.-V. N. Vo, et al., A comparative study of machine learning methods for bio-oil yield prediction–a genetic algorithm-based features selection, Bioresource Technology 335 (2021) 125292. [28] A. A. Ewees, M. A. Al-qaness, L. Abualigah, D. Oliva, Z. Y. Algamal, A. M. Anter, R. Ali Ibrahim, R. M. Ghoniem, M. Abd Elaziz, Boosting arithmetic optimization algorithm with genetic algorithm operators for feature selection: case study on cox proportional hazards model, Mathematics 9 (2021) 2321. [29] N. Maleki, Y. Zeinali, S. T. A. Niaki, A k-nn method for lung cancer prognosis with the use of a genetic algorithm for feature selection, Expert Systems with Applications 164 (2021) 113981. [30] P. Z. Lappas, A. N. Yannacopoulos, A machine learning approach combining expert knowledge with genetic algorithms in feature selection for credit risk assessment, Applied Soft Computing 107 (2021) 107391. [31] K. Chalupka, F. Eberhardt, P. Perona, Causal feature learning: an overview, Behav- iormetrika 44 (2017) 137–164. [32] K. Yu, X. Guo, L. Liu, J. Li, H. Wang, Z. Ling, X. Wu, Causality-based feature selection: Methods and evaluations, ACM Computing Surveys (CSUR) 53 (2020) 1–36. [33] C. Persello, L. Bruzzone, Kernel-based domain-invariant feature selection in hyperspectral images for transfer learning, IEEE transactions on geoscience and remote sensing 54 (2015) 2615–2626. [34] S. Magliacane, T. Van Ommen, T. Claassen, S. Bongers, P. Versteeg, J. M. Mooij, Do- main adaptation by using causal inference to predict invariant conditional distributions, Advances in neural information processing systems 31 (2018). [35] J. M. Mooij, S. Magliacane, T. Claassen, Joint causal inference from multiple contexts (2020). [36] A. Argyriou, T. Evgeniou, M. Pontil, Multi-task feature learning, in ‘advances in neural information processing systems 19’, 2007. [37] A. Argyriou, M. Pontil, Y. Ying, C. Micchelli, A spectral regularization framework for multi-task structure learning, Advances in neural information processing systems 20 (2007). [38] Z. Taghiyarrenani, S. Nowaczyk, S. Pashami, M.-R. Bouguelia, Multi-domain adaptation for regression under conditional distribution shift, Available at SSRN 4197949 (2022). [39] S. Sahoo, K. S. Hariharan, S. Agarwal, S. B. Swernath, R. Bharti, S. Han, S. Lee, Transfer learning based generalized framework for state of health estimation of li-ion cells, Scientific Reports 12 (2022) 1–12. [40] P. Hoyer, D. Janzing, J. M. Mooij, J. Peters, B. Schölkopf, Nonlinear causal discovery with additive noise models, Advances in neural information processing systems 21 (2008). [41] L. J. Eshelman, The chc adaptive search algorithm: How to have safe search when engaging in nontraditional genetic recombination, in: Foundations of genetic algorithms, volume 1, Elsevier, 1991, pp. 265–283. [42] S. M. Vieira, L. F. Mendonça, G. J. Farinha, J. M. Sousa, Modified binary pso for feature selec- tion using svm applied to mortality prediction of septic patients, Applied Soft Computing 13 (2013) 3494–3504. [43] S. Raschka, Mlxtend: Providing machine learning and data science utilities and extensions to python’s scientific computing stack, The Journal of Open Source Software 3 (2018). URL: http://joss.theoj.org/papers/10.21105/joss.00638. doi:10.21105/joss.00638. [44] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830.