A Few Models to Rule Them All: Aggregating Machine Learning Models Florian Siepe1,2 , Phillip Wenig3 and Thorsten Papenbrock1 1 Philipps University of Marburg, Marburg, Germany 2 Viessmann IT Service GmbH, Allendorf (Eder), Germany 3 Hasso Plattner Institute, University of Potsdam, Potsdam, Germany Abstract Many manufacturers of electrical installations in smart home environments have developed and now offer AI solutions that record and analyze the sensor data from their products. Their goal is to monitor and forecast runtime parameters, such as the energy consumption of heat generators or the cooling performance of air conditioning systems, for predictive maintenance and to optimize the carbon footprint. The training and deployment of such AI models can, though, be costly, necessitating intelligent techniques to consolidate, i.e., aggregate models of individual installations into fewer, but larger models. The aggregation of AI models, however, poses a challenging task due to the complexity of the systems and the variability of (hidden) factors that influence the forecasts. To solve the aggregation challenge, improve the forecasting accuracies and ultimately also reduce the AI deployment costs, this paper explores the concept of consolidating similar machine learning models with a novel clustering approach. We introduce CAML, a novel technique for (C)lustering and (A)ggregating (M)achine (L)earning models with shared characteristics. The clusters effectively capture the unique features of the contained models and can be combined into fewer AI models. Our evaluation shows that the hidden parameters learned by the baseline models are key factors in achieving accurate performance, underlining the importance of these models in the clustering process. Moreover, we demonstrate that by choosing the right model architecture, cluster models offer a higher prediction certainty while exhibiting an only slightly higher average error compared to baseline models. Our experimental results show that CAML outperforms alternative clustering techniques in terms of prediction error and variance across multiple cluster configurations. Keywords Machine Learning, Clustering, Model Aggregation, Energy Consumption, Heat Generator 1. Consolidation of AI Model Deployments In today’s digital age, service providers frequently utilize AI and machine learning technologies to analyze the vast swaths of collectable sensor data, offering valuable insights to their customers. A common application is to train individual prediction models for each customer on their specific data. However, this results in a multitude of models that need to be trained, deployed, and maintained, especially in cloud environments, ultimately costing the company a lot of resources and money. In this paper, we investigate this challenge in the context of heat generator LWDA’23: Lernen, Wissen, Daten, Analysen. October 09–11, 2023, Marburg, Germany Envelope-Open sifr@viessmann.com (F. Siepe); phillip.wenig@hpi.de (P. Wenig); papenbrock@informatik.uni-marburg.de (T. Papenbrock) Orcid 0009-0008-5911-5327 (F. Siepe); 0000-0002-8942-4322 (P. Wenig); 0000-0002-4019-8221 (T. Papenbrock) © 2023 Copyright © 2023 by the paper’s authors. Copying permitted only for private and academic purposes. In: M. Leyer, Wichmann, J. (Eds.): Proceedings of the LWDA 2023 Workshops: BIA, DB, IR, KDML and WM. Marburg, Germany, 09.-11. October 2023, published at http://ceur‐ws.org CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings installations and their energy forecasting models. Accurate energy consumption forecasts are critical for various stakeholders, including utility companies, building managers, and residents, because they enable optimizations in energy usage, predictive maintenance and ultimately cost reductions. The models employed for this purpose use historical sensor data to predict future energy consumption curves. Through careful feature engineering, they are also able to reflect parameter changes (e.g. adjusting the indoor temperature) in their predictions. However, accurate forecasts depend not only on features that are captured by sensors, such as temperatures and power consumptions, but also on various hidden parameters specific to the installation environment, such as the size, age, and insulation type of the building that houses the heating system. Heating systems usually supply temperature at a constant level, heating water to a specific temperature and circulating it to provide warmth. This approach may, however, not be the most energy-efficient method, particularly during milder weather. Modern heating systems, therefore, use a dynamic approach, adjusting the supply temperature based on the outside temperature measured by an outdoor temperature sensor. This relationship is graphically represented by a heating curve, a crucial part of a heating system’s control strategy. These curves can be tailored to the residents’ preferences, which is an important variable for the accuracy of energy consumption forecasts. Additional factors include the desired indoor tem- perature, which is set via a thermostat, and the application of a night setback concept. The latter lowers the temperature during unoccupied periods like nighttime to save energy. These strategies make modern heating systems usually more efficient, but the interplay of these novel features combined with the hidden housing parameters of the installations make it difficult for manufacturers to develop generalizable AI models, such as the global forecasting model shown in Figure 1a. Instead, the current practice is to train, deploy and maintain a forecasting model (with potentially different hyper-parameters and model architectures) for every installation, as show in Figure 1b. Although the predictions of these models are decently accurate, their deployment and maintenance is expensive. Despite the complex feature interactions and the numerous hidden parameters, many instal- lations and, hence, their AI models are sufficiently similar, such that, as depicted in Figure 1c, a smaller set of AI models could manage their forecasts. To find these partially generalizable models, we propose to cluster similar heat generator models and consolidate every cluster into only one cluster model. These cluster models are expected to encapsulates the unique features and encoded knowledge of each cluster. In our setting, this encoded knowledge refers to constant parameters that are relevant in the learning process but not directly visible within the data set, such as the size or insulation of buildings. By consolidating 𝑛 individual models into 𝑚 cluster models with an effective clustering, the number of models necessary for deployment and serving can be reduced significantly. To find these clusterings, we evaluate three different approaches: a) clustering the models by their underlying time series training data; b) clustering the models by their output data, which is their predictions; and c) clustering the models by their cross performance, which is the performance on each others training data. We demonstrate that approach c) delivers the most promising results, making it our suggested solution. More specifically, we introduce CAML, a novel technique for (C)lustering and (A)ggregating (M)achine (L)earning models of any type. CAML uses a custom cross performance similarity function and hierarchical clustering to find clusters of AI models with similar hidden features; Ideal Current 1 2 3 4 𝑛 Cloud Cloud ... Buildings ... Buildings ... 1 2 3 4 𝑛 1 2 3 4 𝑛 (a) Ideal. One global forecasting model for all heat (b) Current. Each heat generator correspond to one generators. individual model. Target 1 2 𝑚 Cloud ... Buildings ... 1 2 3 4 𝑛 (c) Target. Each of the 𝑛 heat generators is assigned to one of 𝑚 < 𝑛 different cluster models. Figure 1: Three types of deployment architectures: The theoretically ideal global model approach (1a), the current individual models approach (1b), and the target cluster model approach (1c). Heat generator installations send their sensor recordings into the cloud and receive temperature and energy forecasts. it then trains new cluster models on the clusters’ training data to consolidate the member models into one representative model. Our evaluation demonstrates that cross performance is more effective than input- and output-based model similarities; it also shows that the cluster models are significantly more accurate than a global model and almost as accurate as the many individual models. Hence, our key contributions are as follows: 1. Cross performance: We propose an effective similarity measure for model clustering that compares two models by their pair-wise test accuracies. 2. Hierarchical clustering: We introduce an unsupervised, hierarchical clustering ap- proach with Ward-linkage based on the cross performance similarity measure that grands the ability to configure the number of target clusters, i.e., the trade-off between deploy- ment costs and forecasting accuracy. 3. CAML: We present an algorithm that automates the entire model consolidation pipeline, including preprocessing, matrix computation, clustering, and subsequent retraining. 2. Related Work Clustering is an unsupervised machine learning technique that partitions objects into (mutually exclusive) partitions based on their similarity w.r.t. a specific similarity measure. Clustering algorithms can be categorized into several types. These include partitioning methods that divide the data set into a pre-set number of clusters (e.g. K-Means [1]), hierarchical methods that aim to create a cluster linkage tree that models the relationships in the data best (e.g. Bottom- Up [2]); and density-based methods that recognize clusters as high-density regions of objects separated by regions of low density (e.g. Meanshift [3] or DBSCAN [4]). Most of the mentioned standard clustering algorithms either rely on euclidean distance as a fixed distance metric (e.g., K-Means), which makes them rather inflexible, or the number of desired clusters cannot be directly specified (e.g., Meanshift, DBSCAN). In our work, we therefore focus on hierarchical clusters, which can handle any given distance metric and desired number of clusters. Time Series Clustering algorithms can be used to cluster AI models by their input and/or output data. Bandara et al., for example, use selected features extracted from time series data, such as trend or seasonality, to apply standard clustering algorithms [5]. A similar approach is employed by Räsänen and Kolehmainen, who focus specifically on energy consumption data [6]. Algorithms for time series clustering in general rely on time series distance measures, such as Dynamic Time Warping (DTW) [7], Move-Split-Merge (MSM) [8], or Time-Warp-Edit (TWE) [9]. Paparrizos and Gravano presented k-Shape, an algorithm for clustering univariate time series, which has proven to be effective in several works [10, 11, 12]. We therefore chose shape- based distance (SBD) in combination with hierarchical clustering for our reference clustering approaches that are based on the input time series similarities (see approach a)). Model Consolidation describes the process of combining several individual models into a single model. Bakker and Heskes demonstrated that in many cases a smaller set of representative models can adequately summarize an ensemble of neural network models [13]. In their approach, they leveraged the outputs of these models on a static data set for cluster assignment followed by an optimization step for finding the model as a cluster representative that minimizes the average cluster distance between the models of the cluster and its center. This approach is similar to our baseline approach b), which also relies on the models’ outputs but instead uses the outputs as direct embedding within the clustering procedure. A common technique for aggregating models is ensemble learning. Sarkar et al. discuss the use of different ensemble learning models for short-term electric load forecasting. The authors used real-time load data and meteorological parameters for data analysis [14]. Khan et al. presents a spatial and temporal forecasting model ensemble of LSTM and GRU deep neural networks for short-term electric consumption forecasting [15]. Another common technique is response-based model distillation as formulated by Hinton et al., which summarizes a large but precise teacher model within a smaller student model [16]. The proposed single-teacher distillation can be extended to a multi-teacher distillation for ensemble learning by either aggregating the teachers’ responses or by weighting the responses [17]. In summary, most of the mentioned works imply training large, global models. Our experiments will show that in our specific domain, training an accurate, global teacher models is not possible, which eliminates model distillation as an option for our model consolidation use case. Using model distillation on multiple global models is possible but typically impractical and expensive due to high computational and time demands. This paper mainly discusses clustering and aggregating pre-trained models based to form new consolidated models, a strategy distinct from federated learning which centers on training a single model across many decentralized nodes. 3. CAML - Clustering and Aggregating Machine Learning Models Distance Hierarchical Model Preprocessing Matrix Clustering Aggregation Computation Figure 2: Architecture of CAML with its four mayor steps: preprocessing, distance matrix computation, hierarchical clustering, and model aggregation. This section introduces CAML, our proposal for a (C)lustering and (A)ggregation technique for (M)achine (L)earning models. As visualized in Figure 2, CAML comprises four steps, which are data preprocessing, distance matrix computation, hierarchical clustering, and model aggregation. To execute these steps, CAML requires only one user-defined parameter: the number of clusters 𝑘 that controls the granularity of the clustering. The selection of 𝑘 strongly depends on the desired use case. Most commonly, 𝑘 represents the number of cluster models the user wishes to deploy. However, an exhaustive search of 𝑘 can be performed to meet certain prediction accuracy goals of the resulting cluster models compared to the individual models. Because CAML embodies hierarchical clustering, which produces a linkage matrix, multiple different clusterings can be extracted without recomputing the actual clustering procedure. We now introduce the four steps of CAML in more detail. Preprocessing: Time series that contain anomalous data points (e.g., due to faulty sensor readings) are harmful to any clustering attempt. The removal of outliers is an essential pre- processing step as it helps to reduce the influence of anomalous data points on subsequent steps of clustering and model aggregation. These outliers, if not addressed, can distort overall patterns and relationships within the data, leading to suboptimal clustering results. Our CAML algorithm, therefore, initially preprocesses the data on which the individual models have been trained and tested, replacing the outlier data points in the time series with the value of the next, non-anomalous data point in the series. In this context, we define an outlier as a data point where at least one of its feature values is greater or less than 𝑣 standard deviations from the mean of that feature. By default, we choose 𝑣 to be 3, because we expect that each feature 𝑥 follows approximately a gaussian distribution, i.e., 𝑃(𝜇 − 3𝜎 ≤ 𝑥 ≤ 𝜇 + 3𝜎 ) ≈ 0.9973. Thus, only the extreme outliers are removed, and most of the original data remains. Next, the time series of each model is split into training and test sets. We denote the training sets of model 𝑚𝑖 as 𝑇𝑖 = (𝑋𝑖 , 𝑌𝑖 ) and the test sets as 𝑡𝑖 = (𝑥𝑖 , 𝑦𝑖 ), with 𝑋𝑖 and 𝑥𝑖 as features and 𝑌𝑖 and 𝑦𝑛 as labels of the model, respectively. It is important to note that these splits need to be the same as used in the training procedure of the baseline model. The test sets are used in our proposed distance function for computing a complete distance matrix of the models. Distance Matrix Computation: The hierarchical clustering step of CAML is based on a distance matrix that stores all pair-wise distances between baseline models. These pair-wise distances should measure the dissimilarity between two machine learning models 𝑚𝑖 and 𝑚𝑗 and are computed with a custom distance function, which is based on the pair-wise cross-validation of the models on each other’s test sets. In other words, the distance between model 𝑚𝑖 and 𝑚𝑗 is the mean of 𝑚𝑖 ’s loss on 𝑚𝑗 ’s test data and 𝑚𝑗 ’s loss on 𝑚𝑖 ’s test data. We calculate this loss as the Mean Absolute Error (MAE) on the respective test sets. The intuition behind this distance function is to measure how well 𝑚𝑖 performs in the specific environment of 𝑚𝑗 and vice versa. If the loss of 𝑚𝑖 in the setting of 𝑚𝑗 is close to 𝑚𝑗 ’s loss in its environment, 𝑚𝑗 can be replaced with 𝑚𝑖 . By performing this calculation bidirectionally, we measure the extent to which the two models can replace each other. Formally, let 𝑚 ∶ 𝑅𝑢×𝑣 → 𝑅𝑣 be a prediction model, which maps 𝑣 data points with 𝑢 different features to 𝑣 predictions, 𝑙 ∶ 𝑅𝑣 × 𝑅𝑣 → ℝ+ 0 be a non-negative loss function (e.g., MAE) and 𝑡𝑖 = (𝑥𝑖 , 𝑦𝑖 ) be the test set of the features 𝑥𝑖 and the label 𝑦𝑖 of a model 𝑚𝑖 . The distance between any model 𝑚𝑖 and 𝑚𝑗 can be computed as: 1 𝑑(𝑚𝑖 , 𝑚𝑗 ) = (𝑙(𝑚𝑖 (𝑥𝑗 ), 𝑦𝑗 ) + 𝑙(𝑚𝑗 (𝑥𝑖 ), 𝑦𝑖 )) (1) 2 Here, 𝑚𝑖 (𝑥𝑗 ) denotes the predicted output of model 𝑚𝑖 on the test set of model 𝑚𝑗 , and 𝑙(𝑚𝑖 (𝑥𝑗 ), 𝑦𝑗 ) is the loss between the predicted output 𝑚𝑖 (𝑥𝑗 ) and the true label 𝑦𝑗 . The distance function is symmetric, i.e., 𝑑(𝑚𝑖 , 𝑚𝑗 ) = 𝑑(𝑚𝑗 , 𝑚𝑖 ), and it is non-negative, i.e., 𝑑(𝑚𝑖 , 𝑚𝑗 ) ≥ 0. Hierarchical clustering: For the actual clustering of the AI models, CAML uses a hierarchical clustering approach [2] that can create any number of 𝑘 pre-specified clusters with sophisticated linkage methods. Hierarchical clustering successively merges clusters of objects into ever larger clusters with increasing merge distance, which effectively creates a cluster tree. Given the distance matrix from the previous step, the hierarchical clustering algorithm proceeds as follows: 1. Initialize: Each object, i.e., AI model is considered as one initial, separate cluster. 2. Merge: The two clusters with the smallest distance to each other with respect to the linkage method and distance function are merged into a new cluster. 3. Iterate: Unless all the clusters have been merged into one cluster, re-iterate with step 2. The linkage method defines the distance of any pair of clusters based on the pair-wise object distances. The distances then determine the most similar pair of clusters in each iteration. Be- cause the choice of the linkage method can significantly influence the results of the hierarchical clustering, we considered various popular methods including single linkage, complete linkage, and average linkage, which correspond to the minimum, maximum and average distances of objects in the two compared clusters, and found Ward linkage to be the most effective method in this step. The Ward linkage method [18] uses the within-cluster variance, which is the variance of of all pair-wise object distances within some cluster, for the distance calculation: The distance of two clusters is the increase in total within-cluster variance of the merged cluster w.r.t. its two base clusters. We choose Ward, because the pair-wise distances between models within a cluster should be possibly low, such that the models can replace each other. In mathematical terms, if we denote 𝑐𝑖 and 𝑐𝑗 as two clusters and 𝑐𝑖𝑗 as the resulting cluster after merging 𝑐𝑖 and 𝑐𝑗 , then the total increase in squared distance due to merging, known as Ward’s criterion, is: Δ𝜎 2 = ∑ 𝑑(𝑥, 𝜇(𝑐𝑖𝑗 ))2 − ∑ 𝑑(𝑥, 𝜇(𝑐𝑖 ))2 − ∑ 𝑑(𝑥, 𝜇(𝑐𝑗 ))2 and 𝜇(𝑐) = arg min ∑ 𝑑(𝑖, 𝑗) (2) 𝑥∈𝑐𝑖𝑗 𝑥∈𝑐𝑖 𝑥∈𝑐𝑗 𝑖∈𝑐 𝑗∈𝑐 where 𝜇(𝑐) is the centroid of the observations in cluster 𝑐, 𝑑 is a distance function based on the distance matrix, and 𝑥 are the data points. Model aggregation: The model aggregation takes as input the clustering hierarchy and produces as output a set of 𝑘 cluster models. The cluster hierarchy is stored as a linkage matrix and can be visualized as a dendrogram. At first, the aggregation process simply cuts the dendrogram at the specific depth that creates 𝑘 disjoint clusters. Then, it consolidates every cluster into a single cluster model by training a new model on the combined training data of all individual models of that cluster. More specifically, let 𝐶 = {𝐶1 , ... , 𝐶𝑘 } be any clustering of baseline models. The cluster model 𝑀𝑖 for a given cluster 𝐶𝑖 with size 𝑛 = |𝐶𝑖 |, the models 𝑚𝑖,1 , ... , 𝑚𝑖,𝑛 and their respective test sets 𝑡𝑖,1 , ... , 𝑡𝑖,𝑛 , where 𝑡𝑖,𝑗 = (𝑥𝑖,𝑗 , 𝑦𝑖,𝑗 ), and training sets 𝑇𝑖,1 , ... , 𝑇𝑖,𝑛 , where 𝑇𝑖,𝑗 = (𝑋𝑖,𝑗 , 𝑌𝑖,𝑗 ), along with a fitting function 𝑓, is computed as in Equation 3 by merging all data of the cluster. 𝑛 𝑛 𝑀𝑖 = 𝑓 (⋃ 𝑋𝑖,𝑗 , ⋃ 𝑌𝑖,𝑗 ) (3) 𝑗=1 𝑗=1 The AI models within a cluster are similar in their behavior, but they might use very different types of models (e.g. linear regression, random forest regression, LSTMs etc.) and hyperparamter settings. Our clustering results in fact show that model architectures have little impact on the clustering results. The consolidation, therefore, needs to find a model architecture with similar or higher capacity than the baseline models to capture all individual properties. Because the best model for consolidation depends on the concrete baseline models, the model aggregation needs to test a set of diverse model types to find the most effective model architecture for the cluster models. Within our implementation, we test five models, which are ExtraTreesRegressor [19], LightGBM [20], XGBoost [21], N-Beats [22], and randomly selecting one baseline model to serve as a cluster model. Comparing the model performances can easily been done on the validation data sets of the respective clusters. We present more in-depth results in Section 5. 4. Metrics for Evaluating the Model Aggregation Effectiveness In this section, we describe metrics for measuring the effectiveness of our clustering-based model aggregation. With CAML, we trained new cluster models that should replace their baseline models. To measure how well a cluster model generalizes its baseline models, we need a metric, with which we can compare the effectiveness of different cluster model sets. To evaluate the effectiveness, we measure the accuracy of each cluster model and compare it to the unconsolidated accuracies of the respective baseline models. The performance of a cluster model is then the average decrease in loss of the trained cluster model compared to all of its baseline models. Both the cluster model and its baseline models are tested with the respective test sets of the corresponding baseline models. First of all, we measure the accuracy of a set of predictions 𝑦̂ w.r.t. the correct test labels 𝑦 as the mean average error (MAE) loss: |𝑦| 1 MAE(𝑦,̂ 𝑦) = ∑ |𝑦 − 𝑦𝑖̂ | (4) |𝑦| 𝑖=1 𝑖 Given a cluster model 𝑀𝑖 for cluster 𝐶𝑖 with 1 ≤ 𝑖 ≤ 𝑘 and cluster size |𝐶𝑖 |, we calculate the effectiveness score 𝜇𝑐 (𝑖) of 𝑀𝑖 on cluster 𝑖 as the mean MAE loss over all test sets 𝑡𝑖,𝑗 = (𝑥𝑖,𝑗 , 𝑦𝑖,𝑗 ) whose model 𝑚𝑖,𝑗 belongs to cluster 𝑖 (Equation 5). Additionally, we compute an effectiveness score 𝜇𝑏 (𝑖) of all baseline models 𝑚𝑖,𝑗 of cluster 𝑖 for comparison. Here, we compute the mean MAE loss over all baseline models 𝑚𝑖,𝑗 in 𝐶𝑖 on their own test sets (Equation 6). |𝐶𝑖 | |𝐶𝑖 | 1 1 𝜇𝑐 (𝑖) = ∑ MAE(𝑀𝑖 (𝑥𝑖,𝑗 ), 𝑦𝑖,𝑗 ) (5) 𝜇𝑏 (𝑖) = ∑ MAE(𝑚𝑗 (𝑥𝑖,𝑗 ), 𝑦𝑖,𝑗 ) (6) |𝐶𝑖 | 𝑗=1 |𝐶𝑖 | 𝑗=1 𝑘 Let 𝑁 = ∑𝑖=1 |𝐶𝑖 | be the number of all baseline models, we can now calculate the overall accuracy of all cluster models and all baseline models over all clusters and by aggregating the individual scores 𝜇𝑐 (𝑖) and 𝜇𝑏 (𝑖) to total scores 𝜇𝐶 and 𝜇𝐵 , respectively. For this, we apply the weighted mean with the cluster size |𝐶𝑖 | as weight (Equations 7 and 8). 𝑘 𝑘 1 1 𝜇𝐶 = ∑ |𝐶 | ⋅ 𝜇 (𝑖) (7) 𝜇𝐵 = ∑ |𝐶 | ⋅ 𝜇 (𝑖) (8) 𝑁 𝑖=1 𝑖 𝑐 𝑁 𝑖=1 𝑖 𝑏 To judge the overall effectiveness of the clustering-based model aggregation we can simply compare 𝜇𝐶 to 𝜇𝐵 or some 𝜇𝐶 ′ . If 𝜇𝐶 < 𝜇𝐵 for some clustering 𝐶, the cluster models performed better than the baseline models. The value of 𝜇𝐶 − 𝜇𝐵 can be interpreted as the additional loss one encounters when aggregating the baseline models. We also consider the spread of the clustering scores as the variance 𝜎𝑐2 of 𝜇𝑐 (𝑖) and 𝜎𝑏2 of 𝜇𝑏 (𝑖) (Equations 9 and 10). For 𝜎𝐵2 we consider variance of the baseline model 𝑚𝑗 s’ prediction error on its test set 𝑡𝑗 = (𝑥𝑗 , 𝑦𝑗 ). 𝑘 𝑁 1 1 𝜎𝐶2 = ∑ |𝐶 | ⋅ (𝜇𝑐 (𝑖) − 𝜇𝐶 )2 (9) 𝜎𝐵2 = ∑(MAE(𝑚𝑗 (𝑥𝑗 ), 𝑦𝑗 ) − 𝜇𝐵 )2 (10) 𝑁 𝑖=1 𝑖 𝑁 𝑗=1 Because the model consolidation effectiveness depends on the clustering effectiveness and the aggregation effectiveness, we evaluate both aspects in Section 5. 5. Experiments In this section, we first outline our experimental setup (Section 5.1) and, then, discuss our experimental results with CAML (Section 5.2) to demonstrate the algorithms effectiveness. 5.1. Experimental Setup Measuring Quality: To assess the quality of our clustering-based aggregation technique, we leverage 𝜇𝐶 of the cluster models and 𝜇𝐵 of the baseline models. The baseline models act as a lower bound for the MAE, because they fit the individual setups and their (hidden) hyperparameters best. As an upper bound for the MAE, we train a global model (with different architectures) using the entire training data set without any clustering. To evaluate the proposed aggregation technique, we also test a clustering-based approach that randomly selects one of the baseline models in every cluster as a reprepsentative. Benchmark Approaches: To evaluate the proposed cross performance-based clustering, we benchmark two additional clustering approaches: training data-based and output data-based. The training data-based approach clusters the data that the models have been trained on. Here, we use hierarchical clustering with Ward linkage with the Shape-Based Distance (SBD) derived from the k-shape algorithm [10, 11]. Because SBD is designed for univariate time series, we apply it on each feature of our multivariate time series individually and sum the distances. The output data-based approach employs the models and their outputs. We consider all training data as one big set and randomly select 𝑝 individual measurements from this set and add gaussian noise to them to prevent overfitting and increase robustness. These data points then serve as input for the models to compute a signature for each model, represented as 𝑝-dimensional vectors of the models’ output. Then, hierarchical clustering - this time using euclidean distance - is performed on these signatures. Technology: CAML is written in Python 3.10 and can handle any model that implements the Scikit-learn [23] regressor interface. For hierarchical clustering with Ward [18], we use the implementation from Scipy [24]. The cluster models are built using the time series forecasting library Darts [25]. In particular, we evaluate the ExtraTreesRegressor [19], LightGBM [20], XGBoost [21], and N-Beats [22]. The hyper-parameters of both, cluster and global models, have been tuned using Optuna [26]. The source code for CAML can be found on GitHub1 . Data set: To assess CAML’s performance, we ran the algorithm on a real-world data set from our industry partner. The data set comprises the models and time series of 370 heat pumps. The multivariate time series vary in length, ranging from 200 to 607 time steps with daily recordings of the energy consumption of these heat pumps, along with their associated measurements (e.g., outdoor temperature, supply temperature, date, etc.), which serve as input features for the models. The baseline models are built using Scikit-learn [23] and its implementations of the ExtraTreesRegressor [19] and GradientBoostingRegressor [27]. 1 https://github.com/floriansiepe/CAML 14 3.5 µC 12 2 3.0 σC 10 2.5 8 µC 2.0 σC2 6 1.5 k = 55 1.0 4 0.5 2 0.0 0 0 20 40 60 80 100 Retained models (%) Figure 3: Evaluation of the number of target clusters 𝑘. The clustering scores 𝜇𝐶 and 𝜎𝐶2 over the share of retained models in percent with a selected clustering of 𝑘 = 55. In total, the data set consists of 370 models. 5.2. Experimental Results Because CAML creates a hierarchical clustering, the algorithm needs to choose a number of clusters 𝑘, which is the number of to-be-created cluster. Our first experiment, therefore, evaluates the influence of 𝑘 on the clustering scores. The results are shown in Figure 3. The 𝑥-axis shows the share of retained models in percent, while the 𝑦-axis shows the scores 𝜇𝐶 and 𝜎𝐶2 ; the more clusters we use, the better each aggregated model can specialize on the specific installations. The goal is to create as few clusters as possible, while keeping the clusters’ mean MAE 𝜇𝐶 acceptably small. The depicted curve is an effective tool to tune 𝑘 for a specific application. For our application with 370 models, we chose 𝑘 = 55 (see vertical line) with the elbow method [28] in combination with 𝜇𝐶 as supervised metric of the cluster models prediction error instead of the usually employed sum of squared errors. Also practical constraints on the number of deployable models from our industry partner influence the choice of 𝑘. In a second experiment, we evaluate the three clustering approaches based on (a) training data, (b) output data, and (c) cross performance (CAML) by their MAE scores. For this exper- iment, the cluster models are built using the original model architectures, which are either ExtraTreesRegressor or GradientBoostingRegressor, depending on which yields better accuracy; we evaluate the model selection for the cluster models in the next experiment. Figure 4 plots the measured MAE scores for the three clustering approaches. Each subfigure plots the average MAE 𝜇𝑏 (𝑖) of the baseline models on their respective test set in each cluster (𝑥-axis) against the average MAE 𝜇𝑐 (𝑖) of the cluster models on all test sets in their cluster (𝑦-axis). If a measurement point is above the diagonal line, the baseline models performed better in this cluster; if a point is below the diagonal line, the cluster model performed better. The size of the marker indicates the cluster size. For every subfigure, we also provide the scores of the entire clustering (see 4) for the baseline models (𝜇𝐵 and 𝜎𝐵2 ) and the cluster models (𝜇𝐶 and 𝜎𝐶2 ). The measurements in Figure 4 show that cross performance-based clustering (with CAML) results in a lower prediction error and variance than training data or output data-based clustering. The training data-based clustering performs worst, because the hidden features are not represented in the input data and, therefore, do not influence this clustering – if they were represented, a globally aggregated model could learn from them as well and the clustering would not have been necessary to begin with. The output data and cross performance-based clustering approaches, however, both work well for the model clustering task, because they both capture the hidden features. The cross performance-based approach, however, has a slightly lower error and more dense clustering with less variance than the output data-based approach. For this reason, we propose the former with CAML. In a third experiment, we evaluate CAML’s model aggregation. More specifically, given a specific clustering, we evaluate different model architectures and how models within a cluster can be aggregated. For this, we choose the clustering of CAML with 𝑘 = 55 (Figure 4) and implement multiple model architectures, both as cluster models and as global models. Figure 5 shows the distribution of the MAEs of these model architectures in comparison to the baseline models. Considering only the mean performances, we see that the individual models, i.e., the baseline models perform best, the cluster models sacrificed a little bit of precision for having a much smaller overall number of models, and training only one global model provides the worst performance. The fact that the cluster models’ mean MAE is very close to the baseline’s mean MAE demonstrates the effectiveness of CAML’s clustering. The ExtraTrees and LightGBM cluster models even outperform the baseline models in terms of MAE variance, i.e., prediction certainty. Another observation that highlights the effectiveness of the proposed clustering is that the performance of a randomly selected baseline model from each cluster as cluster model offers a similar – although slightly worse – performance than all other cluster models, which retrain on the clusters’ data set. Considering the different model architectures, every global model performs worse than its respective cluster model counterpart. This also shows that the clustering effectively captures some hidden features. The model architectures, however, profit differently from the clustering. Most notably, the neuronal network-based N-Beats model hardly improves with the clustering, because it tends to overfit the training data and, hence, has the same problems on clustered and non-clustered inputs; both the cluster and global N-Beats model perform worse than all cluster models. In summary, ExtraTrees, LightGBM, and XGBoost all (a) training data (SBD) (b) output data (Euclidean) (c) cross performance (MAE; CAML) 10 10 10 8 8 8 6 6 6 µc 4 4 4 µC = 7.01 µC = 2.33 µC = 2.15 2 2 2 σC = 9.40 σC = 3.10 σC = 2.18 2 2 2 µB = 0.80 µB = 0.80 µB = 0.80 2 2 2 σB = 0.97 σB = 0.97 σB = 0.97 0 0 0 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 µb µb µb Figure 4: Evaluation of clustering approaches. The MAEs of the clustering approaches based on (a) training data, (b) output data, and (c) cross performance (CAML) measured as 𝜇𝐶 and compared to the MAE of the baseline models 𝜇𝐵 . The marker size corresponds to the size of each cluster. 60 MAE 40 20 0 e es t ts BM m es t ts BM m os os lin ea ea do do re re Bo Bo se aT aT tG tG -B -B an an Ba XG XG N N gh gh tr tr R R Ex Ex Li Li Cluster Models Global Models Figure 5: Evaluation of cluster models and overall success. The MAE distribution of the baseline models and different model architectures for cluster and global models. performed very well as cluster models for the aggregation step. 6. Conclusion In this paper, we introduced CAML, a novel clustering-based aggregation technique for machine learning models that have been trained in specific environments with certain hidden but constant features. Due to these hidden features, no global model can replace all individual models. The clusters of models that CAML creates effectively capture these hidden features and serve to consolidate their models into much fewer models with still very good precision. With the proposed hierarchical clustering approach, data scientists can tune the trade-off between the number of to-be-deployed models and the models’ precision. Our experimental results demonstrate that CAML outperforms all globally aggregated models as well as our benchmark approaches, which cluster the models using their training data and output data, in terms of prediction error and variance. The consolidated cluster models tend to have a slightly higher average error than the baseline models, but the proposed model architectures have an overall lower variance, i.e., prediction certainty than the baseline models. In the context of the investigated application, CAML offers a solution for the aggregation of energy consumption models, enhancing the practical value of these models for utility companies, building managers, and consumers, while simultaneously reducing operational costs. Future Work: We evaluated CAML on a data set with only 370 models. Due to the positive results obtained with the proposed consolidation, the next step is to deploy CAML on a data set with, at the time of writing, several hundred-thousand models and their respective data. To make this possible, further research on enhancing CAML’s performance and/or scalability is needed, given that CAML requires the computation of a distance matrix with an expensive distance function and a runtime complexity of 𝒪(𝑛2 ). Acknowledgments: This work was primarily undertaken for Viessmann IT Service GmbH. We gratefully acknowledge the management of Viessmann IT Service GmbH for their support and for allowing the presentation of the results. References [1] J. MacQueen, Classification and analysis of multivariate observations, in: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281–297. [2] B. Everitt, S. Landau, M. Leese, D. Stahl, Cluster Analysis, 5 ed., John Wiley & Sons, 2011. [3] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature space analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002) 603–619. doi:10. 1109/34.1000236 . [4] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A densitybased algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of the International Conference on Knowledge Discovery and Data Mining, Kdd’96, AAAI Press, Portland, Oregon, 1996, pp. 226–231. [5] K. Bandara, C. Bergmeir, S. Smyl, Forecasting across time series databases using long short-term memory networks on groups of similar series, CoRR abs/1710.03222 (2017). URL: http://arxiv.org/abs/1710.03222. arXiv:1710.03222 . [6] T. Räsänen, M. Kolehmainen, Feature-based clustering for electricity use time series data, in: M. Kolehmainen, P. J. Toivanen, B. Beliczynski (Eds.), Adaptive and Natural Computing Algorithms, 9th International Conference, ICANNGA 2009, Kuopio, Finland, April 23-25, 2009, Revised Selected Papers, volume 5495 of Lecture Notes in Computer Science, Springer, 2009, pp. 401–412. URL: https://doi.org/10.1007/978-3-642-04921-7%5F41. doi:10.1007/978- 3- 642- 04921- 7\_{4}{1} . [7] H. Sakoe, S. Chiba, Dynamic programming algorithm optimization for spoken word recognition, IEEE Transactions on Acoustics, Speech, and Signal Processing (1978). doi:10. 1016/b978- 0- 08- 051584- 7.50016- 4 . [8] A. Stefan, V. Athitsos, G. Das, The move-split-merge metric for time series, IEEE Trans- actions on Knowledge and Data Engineering (TKDE) 25 (2013) 1425–1438. doi:10.1109/ tkde.2012.88 . [9] P.-F. Marteau, Time warp edit distance with stiffness adjustment for time series matching, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2009) 306–318. doi:10. 1109/tpami.2008.76 . arXiv:cs/0703033 . [10] J. Paparrizos, L. Gravano, k-shape: Efficient and accurate clustering of time series, in: Proceedings of the International Conference on Management of Data (SIGMOD), 2015. doi:10.1145/2949741.2949758 . [11] J. Paparrizos, L. Gravano, Fast and accurate time-series clustering, ACM Transactions on Database Systems (TODS) 42 (2017). doi:10.1145/3044711 . [12] J. Yang, C. Ning, C. Deb, F. Zhang, D. Cheong, S. E. Lee, C. Sekhar, K. W. Tham, k- shape clustering algorithm for building energy usage patterns analysis and forecasting model accuracy improvement, Energy and Buildings 146 (2017) 27–37. URL: https://www. sciencedirect.com/science/article/pii/S0378778817305352. doi:10.1016/j.enbuild.2017. 03.071 . [13] B. Bakker, T. Heskes, Clustering ensembles of neural network models, Neural Networks 16 (2003) 261–269. URL: https://www.sciencedirect.com/science/article/pii/ S0893608002001879. doi:10.1016/s0893- 6080(02)00187- 9 . [14] D. Sarkar, T. Ao, S. K. Gunturi, Bootstrap aggregating approach to short-term load forecasting using meteorological parameters for demand side management in the north- eastern region of india, Theoretical and Applied Climatology 148 (2022) 1111–1125. doi:10.1007/s00704- 022- 03933- 9 . [15] A.-N. Khan, N. Iqbal, A. Rizwan, R. Ahmad, D.-H. Kim, An ensemble energy consumption forecasting model based on spatial-temporal clustering analysis in residential buildings, Energies 14 (2021) 3020. doi:10.3390/en14113020 . [16] G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, ArXiv.org (2015). doi:10.48550/arxiv.1503.02531 . arXiv:1503.02531 . [17] H. Zhang, D. Chen, C. Wang, Confidence-aware multi-teacher knowledge distillation, in: Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 4498–4502. doi:10.1109/icassp43922.2022.9747534 . [18] J. H. Ward, Hierarchical grouping to optimize an objective function, Journal of the American Statistical Association 58 (1963) 236–244. URL: http://www.jstor.org/stable/ 2282967. [19] P. Geurts, D. Ernst, L. Wehenkel, Extremely randomized trees, Machine Learning 63 (2006) 3–42. [20] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T.-Y. Liu, Lightgbm: A highly efficient gradient boosting decision tree, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 30, Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/ paper%5Ffiles/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf. [21] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Kdd ’16, Association for Computing Machinery, New York, NY, USA, 2016, pp. 785–794. doi:10.1145/2939672.2939785 . arXiv:1603.02754 . [22] B. N. Oreshkin, D. Carpov, N. Chapados, Y. Bengio, N-beats: Neural basis expansion analysis for interpretable time series forecasting, in: International Conference on Learning Representations, 2020. URL: https://openreview.net/forum?id=r1ecqn4YwB. [23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, J. Mach. Learn. Res. 12 (2011) 2825–2830. [24] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, İ. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, SciPy 1.0 Contributors, SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python, Nature Methods 17 (2020) 261–272. doi:10.1038/s41592- 019- 0686- 2 . [25] J. Herzen, F. L. A¤ssig, S. G. Piazzetta, T. Neuer, L. A©o Tafti, G. Raille, T. V. Pottelbergh, M. Pasieka, A. Skrodzki, N. Huguenin, M. Dumonal, J. KoÅ›cisz, D. Bader, F. A©dérick Gusset, M. Benheddi, C. Williamson, M. Kosinski, M. Petrik, G. A«l Grosch, Darts: User- friendly modern machine learning for time series, Journal of Machine Learning Research 23 (2022) 1–6. URL: http://jmlr.org/papers/v23/21-1177.html. [26] T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyperpa- rameter optimization framework, in: Proceedings of the International Conference on Knowledge discovery and data mining (SIGKDD), ???? [27] J. H. Friedman, Greedy function approximation: A gradient boosting machine., The Annals of Statistics 29 (2001) 1189–1232. doi:10.1214/aos/1013203451 . [28] R. L. Thorndike, Who belongs in the family? 18 (1953) 267–276. doi:10.1007/bf02289263 .