Hoeffding Regression Trees for Forecasting Quality of Experience in B5G/6G Networks José Luis Corcuera Bárcena, Pietro Ducange, Francesco Marcelloni, Alessandro Renda and Fabrizio Ruffini 1 Department of Information Engineering, University of Pisa, Largo Lucio Lazzarino 1, 56122 Pisa, Italy Abstract Online data stream analysis is becoming more and more relevant, as the focus of daily life analyses shifts from offline processing to real-time acquisition and modeling of massive data from remote devices. In this paper, we focus our attention on the domain of telecommunications, in particular the video streaming services for moving devices (e.g., a passenger enjoying a movie during a car trip). Since the streaming service must provide a satisfactory level of quality of experience to the user, it is important to predict incoming problems on video quality. We used the well-known Hoeffding Decision Tree (HDT) for streaming data, tailored to regression problems, and we compared its performance with standard Regression Trees (RTs) to evaluate the potentiality of HDTs to forecast the quality of experience in terms of accuracy, time for learning, and memory used. Results show that, during the online learning process, the standard RT outperforms HDT in terms of accuracy, but is prone to under-performance in terms of timings and memory when applied to potentially massive data streaming scenarios. Keywords Data Stream Mining, Regression Tree, QoE forecasting, Explainable AI, Hoeffding Decision Tree 1. Introduction Quality of Experience (QoE) is a measure of end-user satisfaction in enjoying a service and is typically used in the context of telecommunications [1]. The fulfillment of QoE metrics is a primary goal in current, i.e., fourth and fifth generations, and future mobile networks. Beyond 5G (B5G) and 6G networks are indeed currently under development as pointed out, for instance, by the commitment of institutions, industry and academia in the framework of international projects such as Hexa-X1 . Such next generation wireless networks are expected to be much more complex than current ones and will support innovative functionalities such as holographic communication, high precision manufacturing, and smart automotive applications [2]. Notably, the capability to play high-definition videos in real-time may represent a key enabler toward such new functionalities. Thus, being able to forecast the perceived quality of video experience OLUD 2022: First Workshop on Online Learning from Uncertain Data Streams, July 18, 2022, Padua, Italy $ joseluis.corcuera@phd.unipi.it (J. Corcuera Bárcena); pietro.ducange@unipi.it (P. Ducange); francesco.marcelloni@unipi.it (F. Marcelloni); alessandro.renda@unipi.it (A. Renda); fabrizio.ruffini@ing.unipi.it (F. Ruffini)  0000-0002-9984-1904 (J. Corcuera Bárcena); 0000-0003-4510-1350 (P. Ducange); 0000-0002-5895-876X (F. Marcelloni); 0000-0002-0482-5048 (A. Renda); 0000-0001-6328-4360 (F. Ruffini) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 https://hexa-x.eu/, accessed June 2022 may be fundamental to avoid the degradation of end-users’ satisfaction or to determine whether a specific functionality should be provided or not. In the context of video streaming services, QoE metrics include startup delay, rebuffering events and video quality [1] which clearly depend on contextual factors and typically vary over time; several works [1, 3, 4] have recently addressed the QoE prediction task by exploit- ing Machine Learning (ML) techniques and leveraging Quality of Service (QoS) metrics, i.e., quantitative measures that characterize the service offered by the network, such as packet loss and channel quality. Interestingly, only one of these works [4] has framed the issue of QoE prediction as a timeseries forecasting problem, yet disregarding important challenges of data stream mining: the whole dataset is typically not available for offline processing and the distribution of data may change over time due to a phenomenon known as concept drift [5], making it essential to adapt the model to avoid performance degradation. In the last decades, various approaches for incremental learning of ML models have been proposed; here, we focus on the field of eXplainable Artificial Intelligence (XAI) and specifically on a class of inherently interpretable models, capable of explaining, by design, how decisions have been taken. Indeed, transparency (i.e., the capability of understanding the structure of the model itself) represents a key requirement towards trustworthy AI (AI) [6] which in turn is deemed as a major pillar in the design of next generation wireless networks. In this framework, the Hoeffding Decision Tree (HDT) [7] represents a reference approach: it has been widely exploited for both classification and regression tasks. In the context of classification tasks, HDT has also recently been extended with fuzziness to handle vague and noisy data and enhance interpretability [8]. In this paper, we present a preliminary experimental evaluation of the Hoeffding Regression Tree (HRT) for a QoE forecasting task in the frame of next generation wireless networks: specifically, we resort on a recently published QoS-QoE forecasting dataset and compare the performance of HRT and classical Regression Tree (RT) from different perspectives: modelling capability, training time and memory required. The rest of the paper is organized as follows: in Section 2 we summarize the key aspects of HRT model; in Section 3 we describe the experimental setup, highlighting the different learning schemes being compared and the evaluation strategies. Section 4 reports the experimental results, whereas in Section 5 we analyze the robustness of the HRT model to hyperparameter configuration. Finally, Section 6 draws some conclusions. 2. Hoeffding Regression Tree: background HDT, also known as “Very Fast Decision Tree” [7], is a reference model to solve classification problems over an input data stream. In a nutshell, it allows growing a binary decision tree incrementally: a leaf is considered for a split only if it contains a minimum number of samples and a condition based on the Hoeffding’s theorem is met. The theorem guarantees, within a certain level of confidence, that the selected attribute would have been the same in the case of an infinite number of available samples. In the case of classification, the condition is met when the difference between the two highest values of the information gains computed for the attributes available at the leaf node is higher than a bound, dubbed the Hoeffding’s bound. Although the adoption of the Hoeffding’s theorem in relation to the splitting criterion has received some criticism [9], HDT generally provides satisfactory results and can be regarded as a valid heuristic method. HRT represents an adaptation of HDT to solve regression problems given an input data stream. Unlike its classification counterpart, HRT relies on calculating the reduction of variance of the target variable to decide among the splitting candidates. Let ΔVar(𝑎) and ΔVar(𝑏) be the reduction of variance associated to the best and the second best splitting attribute, respectively. The Hoeffding condition, for a leaf node L, is defined as follows: ΔVar(𝑏) < 1 − 𝜀𝐿 (1) ΔVar(𝑎) and the term 𝜀𝐿 , i.e. the Hoeffding bound for the leaf node L, is evaluated according to the following equation: √︃ ln(1/𝛿) 𝜀𝐿 = (2) 2𝑁𝐿 where 𝛿 (split confidence) is equal to 1 minus the desired probability of choosing the correct attribute, and 𝑁𝐿 is the number of samples in node L. The value assigned to a leaf node is the average of the target values of the training samples contained in the leaf node, and, given an incoming input sample, is used to predict the output at inference time. As any tree-based model, HRT features a high level of interpretability, which is a crucial requirement in many applications, including those within next generation wireless networks. Thus, we adopt HRT for tackling our QoE forecasting problem, leveraging an implementation available in the scikit-multiflow library [10]. 3. Experimental analysis In this section, we first introduce the problem and the dataset; then, we describe the models and learning schemes involved in the experimental comparison. Finally, we provide details about the experimental setup. 3.1. Problem description: the QoE forecasting dataset As the scenario of our investigation, we consider the publicly available QoS-QoE forecasting dataset2 , introduced in one of our previous works [4]. A client-server video-streaming appli- cation is simulated within Simu5G [11], a dedicated open-source model library for realistic 5G network simulations: while experiencing the video, each of the 15 simulated clients, also referred to as user equipment (UE), measures or collects a set of time-tagged QoS and QoE metrics. We formulate the QoE prediction task as in [4], replicating the preprocessing and features extraction steps. Specifically, a simulation lasts approximately 120 seconds: for each user, during such time frame, we collect the timeseries related to 12 metrics (QoS, QoE and 2 http://www.iet.unipi.it/g.nardini/ai6g_qoe_dataset.html, accessed June 2022 contextual). Then, we obtain any tuple of the preprocessed dataset as follows: for a timestamp 𝑡, the input variables consist in 11 statistics (i.e., mean, median, max, min, variance, standard deviation, kurtosis, skewness, Q1 and Q3, number of samples) measured for each metric in the time window [𝑡 − 𝑊, 𝑡] (with 𝑊 = 10𝑠), whereas the output variable consists in the mean of the target QoE metric over the time horizon of one second (i.e., in [𝑡, 𝑡 + 𝐻], with 𝐻 = 1𝑠). As the target QoE metric, we consider the average percentage of arrived frames at the time of its display. The subsequent tuple is obtained by sliding the two windows 𝑊 and 𝐻 with a step of 1 second. To summarize, each instance in the dataset is represented in R132 , resulting from 11 statistics evaluated over window of size W on 12 timeseries. The 120-seconds video-streaming simulation is repeated 24 times. We consider the following setting: we aim to learn the mapping between QoS and QoE in order to tackle the QoE forecasting problem. We assume that the data generated by different UEs within a simulation can be gathered for training the model; however, the data from the various simulations are not immediately available but arrive in chunks, each corresponding to one of the 24 simulations. Basically, one can think of the 24 simulations as representing temporally consecutive scenarios in which each of the various UEs experiences, from time to time, different situations. The overall dataset consists of 28758 samples, with a chunk size ranging from 972 to 1466 samples (the variability is induced by the removal of missing values). Such a setting demands for ad-hoc strategies for incremental model training: in the following, we describe two learning schemes based on classical RT and HRT models, respectively, along with the evaluation strategies adopted for assessing the performance of the models. 3.2. Learning schemes and evaluation strategies Let chunk 𝑖 indicate the chunk of data of the 𝑖-th simulation, with 𝑖 = 1, 2, . . . , 24. Each chunk 𝑖 contains the samples of all the UEs from the 𝑖-th simulation. We compare two learning schemes using two evaluation strategies. Learning schemes. HRT supports an incremental learning scheme: it consists in updating the model at each incoming chunk. In other words, at each step 𝑖 the model is updated consid- ering only the current chunk 𝑖 . Conversely, the classical RT does not support an incremental learning scheme: the model is retrained from scratch at each newly collected chunk of data. At each step 𝑖 the previous model is replaced with a new one trained on 𝑖𝑗=1 chunk 𝑗 , i.e., the ⋃︀ union of the chunks collected so far. Evaluation strategies. Both learning schemes are evaluated using two approaches, widely adopted in data stream applications. Prequential evaluation, or interleaved-test-then-train, can be formalized as follows: once a new chunk 𝑖 is collected (with 𝑖 = 2, . . . , 24) we first assess the performance of the current model on chunk 𝑖 and then exploit it to train/update the model. For example, the first evaluation step consists in using the first chunk (chunk 1 ) for training and the chunk 2 for testing. Hold-out evaluation consists in assessing the performance of a model after updating it using each chunk 𝑖 on a fixed test set. To carry out this experiment, we assume that 4 chunks are immediately available as test set (specifically: chunk 21 , chunk 22 , chunk 23 , and chunk 24 ). At each step of the analysis the updated model will always be tested on the same data. To summarize, in our experimental campaign, we refer to the various approaches using the following notation: • HRT-preq indicates the HRT model, i.e., incremental learning scheme, evaluated using the prequential strategy. • HRT-hold-out indicates the HRT model, i.e., incremental learning scheme, evaluated using the hold-out strategy. • RT-preq indicates the RT model, i.e., retraining learning scheme, evaluated using the prequential strategy. • RT-hold-out indicates the RT model, i.e., retraining learning scheme, evaluated using the hold-out strategy. 3.3. Experimental setup Both HRT and classical RT have publicly available Python implementations: HRT is available in scikit-multiflow3 , whereas the classical RT is implemented in scikit-learn4 . Tables 1 and 2 report the values of the main configuration parameters for HRT and RT models. As per the former, we adopt the default parameter configuration, whereas for the latter we set the parameters coherently with our previous study [4], pursuing a fair comparison of the results. We executed our experiments on a computer featuring an x86_64 architecture, 16 cores, Intel Xeon Processor (Cascadelake) - 2.194GHz and 32GB RAM. Table 1 Table 2 HRT configuration parameters. RT configuration parameters. Parameter Value Parameter Value −7 Split confidence 10 Max. depth 10 Tie threshold 0.05 Split criterion MSE Grace period 200 Min_samples_split 0.01 Min_samples_leaf 0.001 4. Experimental Results In this section, we report the results of our experimental analysis from a threefold perspective: regression metrics, memory used, and time for learning/updating the model. 3 https://scikit-multiflow.readthedocs.io/en/latest/api/generated/skmultiflow.trees.HoeffdingTreeRegressor. html, accessed June 2022 4 https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html, accessed June 2022 4.1. Regression metrics and model complexity In the following, we compare the HRT models with their RT counterparts, considering the prequential and the hold-out evaluation strategies independently. Figure 1 shows the trends of the Mean Absolute Error (MAE) along the sequence of processed chunks considering the two evaluation strategies, namely hold-out (Fig. 1a) and prequential (Fig. 1b). In the hold-out setting, two “offline” versions of the decision tree described in [4] are considered as reference baselines. These models are generated considering a global training dataset composed by all training chunks (ie, chunk 1 to chunk 20 ). The results of these “offline” decision trees (RT-offline-5 and RT-offline-10, induced by setting the maximum depth at 5 and 10, respectively) are obtained evaluating the models on the same hold-out test set made up of the last 4 chunks of the dataset. In general, we can observe how the RT-models outperform their HRT counterparts along the whole model updating process in streaming. As regards HRT-hold-out model (Fig. 1a), after an initial phase of “start-up” corresponding more or less to the first five chunks, it reaches a plateau in performance on the hold-out test set, approaching, but unfortunately not reaching, the performance of the baseline models. As for the prequential evaluation strategy (Fig. 1b), HRT-prequential closely trails the performance of RT-prequential: again, however, re-training the traditional model leads to consistently superior performance compared to incremental training of HRT. (a) Hold-out evaluation strategy (b) Prequential evaluation strategy Figure 1: MAE metrics measured on the test set. In the following, we discuss in details the trend of the complexity for both the HRT and RT models. As the training stage is analogous among hold-out and prequential setting (at least for the first 20 chunks), we just consider the former, but the same considerations apply for the latter. Figure 2 shows the trends of the complexity of the HRT-hold-out and RT-hold-out along the sequence of processed chunks. As regards the number of nodes (Fig. 2a) and the number of leaves (Fig. 2b), it is worth noticing that RT-hold-out is always more complex than the HRT-hold-out, up to one order of magnitude. We can observe that the RT models entail a large number of nodes even at the first chunks, while the number of nodes in the HRT models keep steadily increasing almost linearly with the number of chunks. The depth of the tree (Fig. 2c) is relevant as well, since it is associated with maximum number of conditions in the antecedent of the rules that can be extracted from the trees: we can observe that the HRT-hold-out reaches the same depth of RT-hold-out just at the end of the stream of chunks. We recall that, to obtain an easier comparison, we constrained the RT models to have the maximum depth equals to 10 as per the best model we found in the previous study reported in [4]. (a) Number of nodes (b) Number of leaves (c) Tree depth Figure 2: Hold-out models complexity. Tables 3 and 4 reports the performance of the models for the hold-out and prequential evaluation strategy, respectively, after training the models up to the final available chunk (from chunk 1 to chunk 20 , in the case of hold-out, and from chunk 1 to chunk 23 in the case of prequential). The performance of the models are measured in terms of Mean Squared Error (MSE), MAE, and coefficient of determination (𝑅2 ). Furthermore, we report the complexity of the model measured in terms of number of nodes, leaves, maximum depth, and number of features selected by the induced tree. Table 3 Global results and model complexity for HRT-hold-out and RT-hold-out approaches. Regression is evaluated on test sets. The different baselines refer to different models where we fixed the maximum depth to 5 (i.e., RT-offline-5) and 10 (i.e., RT-offline-10, the best result from [4]). model Regression metrics Model Complexity MSE MAE 𝑅2 Nodes Leaves Features Max selected depth HRT-hold-out 0.120 0.285 0.300 67 34 22 10 RT-hold-out 0.102 0.242 0.407 303 152 65 10 RT-offline-10 0.102 0.242 0.407 303 152 65 10 RT-offline-5 0.111 0.266 0.357 57 29 18 5 Obviously, the results obtained with the RT-hold-out learning scheme after processing the final chunk (i.e., chunk 20 ) are equivalent to those obtained with the more complex among the two baselines, namely RT-offline-10: in fact, the last step of the RT-hold-out strategy is essentially the same scenario as the baseline strategy where the whole dataset (from chunk 1 to chunk 20 ) is used for training. In general, results confirm that the HRT-hold-out and HRT-preq strategies are characterized by a worse performance in terms of MAE, MSE, and 𝑅2 than their RT-counterparts. However, HRT models are characterized by the lowest levels of complexity, in terms of number of nodes, number of leaves, number of selected features and maximum depth of the trees, thus ensuring a Table 4 Global results and model complexity for HRT and RT prequential (preq) approaches. Regression metrics are evaluated on the last testing chunk (i.e., chunk 24 ). model Regression metrics Model Complexity MSE MAE 𝑅2 Nodes Leaves Features max selected depth HRT-preq 0.122 0.283 0.264 77 39 22 9 RT-preq 0.111 0.256 0.332 305 153 56 10 higher level of interpretability than RT models. Figures 3 and 4 report examples of QoE test timeseries for different UEs, overlapping the ground-truth with the predicted values obtained by the different models in the test datasets, after processing the last available chunk of training data. The visual analysis suggests that the different models provide reasonable predictions in different conditions; in particular, the HRT-based models show a worse predictive performance than their RT counterparts, possibly due to their lower complexity. (a) HRT-hold-out (b) RT-hold-out Figure 3: Real and predicted values of QoE for an example UE of the test set: hold-out evaluation strategy. To summarize, the accuracies obtained by the HRT models are smaller by a 10-26% (de- pending on the MAE, MSE or 𝑅2 metric considered) than the corresponding RT counterpart models. However, this decrease in performance is counter-balanced by the time-for-learning and memory-used values, that are aspects of utmost importance in a streaming scenario; for this reason, they are detailed in the following. 4.2. Memory occupancy Figure 5 shows the training set sizes (i.e., the number of samples) used for updating both the RT and HRT models when processing a new chunk of data. We just discuss the prequential (a) HRT-prequential (b) RT-prequential Figure 4: Real and predicted values of QoE for one example UE of the test set: prequential evaluation strategy. setting: in the hold-out strategy, the learning phase is analogous and the same considerations apply. As expected, the RT model memory occupancy rapidly exceeds the HRT model one: in a real-case where we have massive input data stream, this would lead to large training set sizes, thus making the retraining learning scheme an impractical and very computationally intensive approach. Figure 5: Training set sizes during the model updating process in streaming. 4.3. Time for model updating Figure 6 reports the trends of the updating times for the RT and HRT models. Also in this case, we just discuss the prequential setting. The plot shows how, after about 22 chunks, the time for the RT learning exceeds the time for HRT learning. This is important, because for the HRT models we need to reduce as much as possible the dependency of the training time on the number of chunks, with the aim of ensuring minimum latency in the operative real-time application. In addition, for the RT-model we can observe an almost linear relationship between the number of chunks and the learning time: in fact, a simple linear fitting on the trend related to RT-preq model yields 𝑅2 =0.99 and p-value=3.88𝑒−23 . On the other hand, the HRT-models do not show a strong increasing behaviour in function of the chunk number, but it can depend on the dimension, namely the number of samples, of each single chunk. Figure 6: Training times of RT-prequential and HRT-prequential. 5. HRT sensitivity analysis In this section we analyze the sensitivity of HRT models with respect to two aspects: parameter setting and order of chunks in the streaming process. 5.1. Sensitivity with respect to parameter configuration We analyzed the suitability of the default configuration of the HRT training in terms of model parameters, reported in Table 1. In particular, we compared the MAE values for different values of the grace period and of the tie-threshold, after the whole dataset has been incrementally processed. We aim to analyse if the default values of the model parameters are a “robust” choice, and ensure good performances. We recall that grace period defines a threshold on the number of instances contained in a leaf node before considering it for a split, whereas tie threshold consists in a threshold below which a split will be forced to break ties. Intuitively, lower values of grace period and higher values of tie threshold will foster easier node splitting, thus leading to more complex trees. Notably, hyperparameter tuning for finding optimal parameters configuration is not viable in an operative scenario, where the model cannot rely on a static dataset but rather learns from an incoming data stream. Figures 7 and 8 show the MAE values on the test set for the HRT-models. It is worth highlighting that the value of the metric measured at the end of the training conveys only a partial insight into the behaviour of the model, but can still be considered a proxy for the quality of the parameter configuration. From the heatmaps, we can observe a slight indication of the presence of a better-performing area, in the bottom right of the plot, where the grace period and the tie-threshold have values greater than 300 and 0.08, respectively. The boxplots show how, for the default configuration (grace period=200 and tie-threshold=0.05), the value of the MAE score, even if not optimal, lies below the median values for both the evaluation strategies. (a) HRT-hold-out (b) HRT-prequential Figure 7: MAE results on the test set for different choices of model parameters for HRT models. (a) HRT-hold-out (b) HRT-prequential Figure 8: Boxplot of MAE values on the test set for different choices of model parameters for HRT models. The value obtained with the default configuration is marked with a blue dot. 5.2. Sensitivity with respect to chunk order In HRT the initial structure of the model (e.g., the root) is determined based on the initial chunks and cannot be reassessed subsequently. As a consequence, the order of the chunks may impact on the performance of the model throughout the whole data stream. To quantify the performance variation, we performed ten tests where we randomly shuffled the input chunks order. Figure 9 shows the MAE values for the last test dataset (i.e., after processing chunk 23 for the prequential strategy and chunk 20 for the hold out strategy), suggesting that the order of input chunks does not significatively affect the resulting performance: the maximum (minimum) MAE values are different of about 12% (5%) with respect to the median value of the distribution. Such variability is not negligible in absolute terms, but it is still comparable to the variations of MAE values we observe, for instance, during model training in the prequential case (see Fig. 1b), after the first “start-up” 5 chunks. (a) HRT-prequential (b) HRT-hold-out Figure 9: Boxplot of MAE values for ten repetitions of the experiment with different random shuffling of chunks. Values obtained with the default order (analyzed in the rest of the paper) are marked with a blue dot. 6. Conclusion In this paper, we have discussed an application of streaming methods to a realistic 5G network simulation for QoE forecasting. We applied a Hoeffding Decision Tree for data stream regression to predict incoming QoE, and we compared the results with standard regression trees. From the results, we observed that HRT models have proven to be better strategies regarding memory usage and learning time aspects, at the cost of having worse accuracies than RT models. However, we experimentally highlighted how HRT models, after an initial start-up phase where the models complexity increase, approach the performance of the standard RT models with comparable complexity. This can be explained by the kind of strategy used by the Hoeffding Decision Tree: by construction, the structure of the tree is strongly affected by the initial data input, and the resulting tree is typically more shallow with respect to “traditional” decision trees. These considerations represent the initial steps for future works, where ad-hoc methods could be designed to take the discussed shortcomings into account. In particular, a further study will aim to shed some light on the relationship between complexity and performance in the streaming approaches, by refining the tree updating strategy and investigating techniques to select appropriate parameter configurations. Furthermore, we plan to assess if concepts from fuzzy set theory can help improve the performance of HRT models in this kind of applications. Acknowledgments This work has been partly funded by the Italian Ministry of University and Research (MIUR), in the framework of the Cross-Lab project (Departments of Excellence) and PON 2014-2021 “Research and Innovation”, DM MUR 1062/2021, Project title: “Progettazione e sperimentazione di algoritmi di federated learning per data stream mining” and by the EU Commission through the H2020 projects Hexa-X (Grant no. 101015956). References [1] V. Vasilev, J. Leguay, S. Paris, L. Maggi, M. Debbah, Predicting QoE Factors with Machine Learning, in: 2018 IEEE Int’l Conf. on Communications (ICC), 2018, pp. 1–6. doi:10.1109/ ICC.2018.8422609. [2] K. Sheth, K. Patel, H. Shah, S. Tanwar, R. Gupta, N. Kumar, A taxonomy of AI techniques for 6G communication networks, COMPUT COMMUN 161 (2020) 279–303. doi:10.1016/ j.comcom.2020.07.035. [3] A. Renda, P. Ducange, G. Gallo, F. Marcelloni, XAI Models for Quality of Experience Prediction in Wireless Networks, in: 2021 IEEE Int’l Conf. on Fuzzy Systems (FUZZ-IEEE), 2021, pp. 1–6. doi:10.1109/FUZZ45933.2021.9494509. [4] J. L. Corcuera Bárcena, P. Ducange, F. Marcelloni, G. Nardini, A. Noferi, A. Renda, G. Stea, A. Virdis, Towards Trustworthy AI for QoE prediction in B5G/6G Networks, in: First Int’l Workshop on Artificial Intelligence in Beyond 5G and 6G Wireless Networks (AI6G 2022), (accepted). [5] J. a. Gama, I. Žliobaitè, A. Bifet, M. Pechenizkiy, A. Bouchachia, A Survey on Concept Drift Adaptation, ACM Comput. Surv. 46 (2014). doi:10.1145/2523813. [6] Ethics Guidelines for Trustworthy AI, Technical Report, 2019. European Commission. High Level Expert Group on AI. https://ec.europa.eu/digital-single-market/en/news/ ethics-guidelines-trustworthy-ai. [7] P. Domingos, G. Hulten, Mining high-speed data streams, in: Proc. of the sixth ACM SIGKDD Int’l Conf. on Knowledge discovery and data mining, 2000, pp. 71–80. [8] P. Ducange, F. Marcelloni, R. Pecori, Fuzzy Hoeffding Decision Tree for Data Stream Classifi- cation, INT J COMPUT INT SYS 14 (2021) 946–964. doi:10.2991/ijcis.d.210212.001. [9] L. Rutkowski, L. Pietruczuk, P. Duda, M. Jaworski, Decision Trees for Mining Data Streams Based on the McDiarmid’s Bound, IEEE T KNOWL DATA EN 25 (2013) 1272–1279. doi:10.1109/TKDE.2012.66. [10] J. Montiel, J. Read, A. Bifet, T. Abdessalem, Scikit-multiflow: A multi-output streaming framework, J MACH LEARN RES 19 (2018) 2915–2914. [11] G. Nardini, D. Sabella, G. Stea, P. Thakkar, A. Virdis, Simu5G–An OMNeT++ Library for End-to-End Performance Evaluation of 5G Networks, IEEE Access 8 (2020) 181176–181191. doi:10.1109/ACCESS.2020.3028550.