1. Introduction

First Workshop on Online Learning from Uncertain Data Streams, July

Hoefding Regression Trees for Forecasting Quality of Experience in B5G/6G Networks

José Luis Corcuera Bárcena

joseluis.corcuera@phd.unipi.it

Pietro Ducange

pietro.ducange@unipi.it

Francesco Marcelloni

francesco.marcelloni@unipi.it

Alessandro Renda

alessandro.renda@unipi.it

Fabrizio Rufini

fabrizio.ruffini@ing.unipi.it 0 Department of Information Engineering, University of Pisa , Largo Lucio Lazzarino 1, 56122 Pisa , Italy

2022

18 2022 0000 0002

Online data stream analysis is becoming more and more relevant, as the focus of daily life analyses shifts from ofline processing to real-time acquisition and modeling of massive data from remote devices. In this paper, we focus our attention on the domain of telecommunications, in particular the video streaming services for moving devices (e.g., a passenger enjoying a movie during a car trip). Since the streaming service must provide a satisfactory level of quality of experience to the user, it is important to predict incoming problems on video quality. We used the well-known Hoefding Decision Tree (HDT) for streaming data, tailored to regression problems, and we compared its performance with standard Regression Trees (RTs) to evaluate the potentiality of HDTs to forecast the quality of experience in terms of accuracy, time for learning, and memory used. Results show that, during the online learning process, the standard RT outperforms HDT in terms of accuracy, but is prone to under-performance in terms of timings and memory when applied to potentially massive data streaming scenarios.

eol>Data Stream Mining Regression Tree QoE forecasting Explainable AI Hoefding Decision Tree

1. Introduction

Quality of Experience (QoE) is a measure of end-user satisfaction in enjoying a service and is typically used in the context of telecommunications [ 1 ]. The fulfillment of QoE metrics is a primary goal in current, i.e., fourth and fifth generations, and future mobile networks. Beyond 5G (B5G) and 6G networks are indeed currently under development as pointed out, for instance, by the commitment of institutions, industry and academia in the framework of international projects such as Hexa-X1. Such next generation wireless networks are expected to be much more complex than current ones and will support innovative functionalities such as holographic communication, high precision manufacturing, and smart automotive applications [ 2 ]. Notably, the capability to play high-definition videos in real-time may represent a key enabler toward such new functionalities. Thus, being able to forecast the perceived quality of video experience may be fundamental to avoid the degradation of end-users’ satisfaction or to determine whether a specific functionality should be provided or not.

In the context of video streaming services, QoE metrics include startup delay, rebufering events and video quality [ 1 ] which clearly depend on contextual factors and typically vary over time; several works [ 1, 3, 4 ] have recently addressed the QoE prediction task by exploiting Machine Learning (ML) techniques and leveraging Quality of Service (QoS) metrics, i.e., quantitative measures that characterize the service ofered by the network, such as packet loss and channel quality. Interestingly, only one of these works [ 4 ] has framed the issue of QoE prediction as a timeseries forecasting problem, yet disregarding important challenges of data stream mining: the whole dataset is typically not available for ofline processing and the distribution of data may change over time due to a phenomenon known as concept drift [ 5 ], making it essential to adapt the model to avoid performance degradation.

In the last decades, various approaches for incremental learning of ML models have been proposed; here, we focus on the field of eXplainable Artificial Intelligence (XAI) and specifically on a class of inherently interpretable models, capable of explaining, by design, how decisions have been taken. Indeed, transparency (i.e., the capability of understanding the structure of the model itself) represents a key requirement towards trustworthy AI (AI) [ 6 ] which in turn is deemed as a major pillar in the design of next generation wireless networks. In this framework, the Hoefding Decision Tree (HDT) [ 7 ] represents a reference approach: it has been widely exploited for both classification and regression tasks. In the context of classification tasks, HDT has also recently been extended with fuzziness to handle vague and noisy data and enhance interpretability [ 8 ].

In this paper, we present a preliminary experimental evaluation of the Hoefding Regression Tree (HRT) for a QoE forecasting task in the frame of next generation wireless networks: specifically, we resort on a recently published QoS-QoE forecasting dataset and compare the performance of HRT and classical Regression Tree (RT) from diferent perspectives: modelling capability, training time and memory required.

The rest of the paper is organized as follows: in Section 2 we summarize the key aspects of HRT model; in Section 3 we describe the experimental setup, highlighting the diferent learning schemes being compared and the evaluation strategies. Section 4 reports the experimental results, whereas in Section 5 we analyze the robustness of the HRT model to hyperparameter configuration. Finally, Section 6 draws some conclusions.

2. Hoefding Regression Tree: background

HDT, also known as “Very Fast Decision Tree” [ 7 ], is a reference model to solve classification problems over an input data stream. In a nutshell, it allows growing a binary decision tree incrementally: a leaf is considered for a split only if it contains a minimum number of samples and a condition based on the Hoefding’s theorem is met. The theorem guarantees, within a certain level of confidence, that the selected attribute would have been the same in the case of an infinite number of available samples. In the case of classification, the condition is met when the diference between the two highest values of the information gains computed for the attributes available at the leaf node is higher than a bound, dubbed the Hoefding’s bound. Although the adoption of the Hoefding’s theorem in relation to the splitting criterion has received some criticism [ 9 ], HDT generally provides satisfactory results and can be regarded as a valid heuristic method.

HRT represents an adaptation of HDT to solve regression problems given an input data stream. Unlike its classification counterpart, HRT relies on calculating the reduction of variance of the target variable to decide among the splitting candidates. Let ΔVar() and ΔVar() be the reduction of variance associated to the best and the second best splitting attribute, respectively. The Hoefding condition, for a leaf node L, is defined as follows: and the term , i.e. the Hoefding bound for the leaf node L, is evaluated according to the following equation: ΔVar() ΔVar()

< 1 − = √︃ ln(1/) 2 where (split confidence) is equal to 1 minus the desired probability of choosing the correct attribute, and is the number of samples in node L.

The value assigned to a leaf node is the average of the target values of the training samples contained in the leaf node, and, given an incoming input sample, is used to predict the output at inference time. As any tree-based model, HRT features a high level of interpretability, which is a crucial requirement in many applications, including those within next generation wireless networks. Thus, we adopt HRT for tackling our QoE forecasting problem, leveraging an implementation available in the scikit-multiflow library [ 10 ].

3. Experimental analysis

In this section, we first introduce the problem and the dataset; then, we describe the models and learning schemes involved in the experimental comparison. Finally, we provide details about the experimental setup.

3.1. Problem description: the QoE forecasting dataset

As the scenario of our investigation, we consider the publicly available QoS-QoE forecasting dataset2, introduced in one of our previous works [ 4 ]. A client-server video-streaming application is simulated within Simu5G [ 11 ], a dedicated open-source model library for realistic 5G network simulations: while experiencing the video, each of the 15 simulated clients, also referred to as user equipment (UE), measures or collects a set of time-tagged QoS and QoE metrics. We formulate the QoE prediction task as in [ 4 ], replicating the preprocessing and features extraction steps. Specifically, a simulation lasts approximately 120 seconds: for each user, during such time frame, we collect the timeseries related to 12 metrics (QoS, QoE and 2http://www.iet.unipi.it/g.nardini/ai6g_qoe_dataset.html, accessed June 2022 (1) (2) contextual). Then, we obtain any tuple of the preprocessed dataset as follows: for a timestamp , the input variables consist in 11 statistics (i.e., mean, median, max, min, variance, standard deviation, kurtosis, skewness, Q1 and Q3, number of samples) measured for each metric in the time window [ − , ] (with = 10), whereas the output variable consists in the mean of the target QoE metric over the time horizon of one second (i.e., in [, + ], with = 1). As the target QoE metric, we consider the average percentage of arrived frames at the time of its display. The subsequent tuple is obtained by sliding the two windows and with a step of 1 second. To summarize, each instance in the dataset is represented in R132, resulting from 11 statistics evaluated over window of size W on 12 timeseries. The 120-seconds video-streaming simulation is repeated 24 times.

We consider the following setting: we aim to learn the mapping between QoS and QoE in order to tackle the QoE forecasting problem. We assume that the data generated by diferent UEs within a simulation can be gathered for training the model; however, the data from the various simulations are not immediately available but arrive in chunks, each corresponding to one of the 24 simulations. Basically, one can think of the 24 simulations as representing temporally consecutive scenarios in which each of the various UEs experiences, from time to time, diferent situations. The overall dataset consists of 28758 samples, with a chunk size ranging from 972 to 1466 samples (the variability is induced by the removal of missing values). Such a setting demands for ad-hoc strategies for incremental model training: in the following, we describe two learning schemes based on classical RT and HRT models, respectively, along with the evaluation strategies adopted for assessing the performance of the models.

3.2. Learning schemes and evaluation strategies

Let chunk indicate the chunk of data of the -th simulation, with = 1, 2, . . . , 24. Each chunk contains the samples of all the UEs from the -th simulation. We compare two learning schemes using two evaluation strategies.

Learning schemes. HRT supports an incremental learning scheme: it consists in updating the model at each incoming chunk. In other words, at each step the model is updated considering only the current chunk . Conversely, the classical RT does not support an incremental learning scheme: the model is retrained from scratch at each newly collected chunk of data. At each step the previous model is replaced with a new one trained on ⋃︀ =1 chunk , i.e., the union of the chunks collected so far.

Evaluation strategies. Both learning schemes are evaluated using two approaches, widely adopted in data stream applications. Prequential evaluation, or interleaved-test-then-train, can be formalized as follows: once a new chunk is collected (with = 2, . . . , 24) we first assess the performance of the current model on chunk and then exploit it to train/update the model. For example, the first evaluation step consists in using the first chunk ( chunk 1) for training and the chunk 2 for testing. Hold-out evaluation consists in assessing the performance of a model after updating it using each chunk on a fixed test set. To carry out this experiment, we assume that 4 chunks are immediately available as test set (specifically: chunk 21, chunk 22, chunk 23, and chunk 24). At each step of the analysis the updated model will always be tested on the same data.

To summarize, in our experimental campaign, we refer to the various approaches using the following notation: • HRT-preq indicates the HRT model, i.e., incremental learning scheme, evaluated using the prequential strategy. • HRT-hold-out indicates the HRT model, i.e., incremental learning scheme, evaluated using the hold-out strategy. • RT-preq indicates the RT model, i.e., retraining learning scheme, evaluated using the prequential strategy. • RT-hold-out indicates the RT model, i.e., retraining learning scheme, evaluated using the hold-out strategy.

3.3. Experimental setup

Both HRT and classical RT have publicly available Python implementations: HRT is available in scikit-multiflow3, whereas the classical RT is implemented in scikit-learn4. Tables 1 and 2 report the values of the main configuration parameters for HRT and RT models. As per the former, we adopt the default parameter configuration, whereas for the latter we set the parameters coherently with our previous study [ 4 ], pursuing a fair comparison of the results.

We executed our experiments on a computer featuring an x86_64 architecture, 16 cores, Intel Xeon Processor (Cascadelake) - 2.194GHz and 32GB RAM.

4. Experimental Results

In this section, we report the results of our experimental analysis from a threefold perspective: regression metrics, memory used, and time for learning/updating the model.

3https://scikit-multiflow.readthedocs.io/en/latest/api/generated/skmultiflow.trees.HoefdingTreeRegressor. html, accessed June 2022

4https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html, accessed June 2022

4.1. Regression metrics and model complexity

In the following, we compare the HRT models with their RT counterparts, considering the prequential and the hold-out evaluation strategies independently.

Figure 1 shows the trends of the Mean Absolute Error (MAE) along the sequence of processed chunks considering the two evaluation strategies, namely hold-out (Fig. 1a) and prequential (Fig. 1b). In the hold-out setting, two “ofline” versions of the decision tree described in [ 4 ] are considered as reference baselines. These models are generated considering a global training dataset composed by all training chunks (ie, chunk 1 to chunk 20). The results of these “ofline” decision trees (RT-ofline-5 and RT-ofline-10, induced by setting the maximum depth at 5 and 10, respectively) are obtained evaluating the models on the same hold-out test set made up of the last 4 chunks of the dataset.

In general, we can observe how the RT-models outperform their HRT counterparts along the whole model updating process in streaming. As regards HRT-hold-out model (Fig. 1a), after an initial phase of “start-up” corresponding more or less to the first five chunks, it reaches a plateau in performance on the hold-out test set, approaching, but unfortunately not reaching, the performance of the baseline models.

As for the prequential evaluation strategy (Fig. 1b), HRT-prequential closely trails the performance of RT-prequential: again, however, re-training the traditional model leads to consistently superior performance compared to incremental training of HRT. (a) Hold-out evaluation strategy (b) Prequential evaluation strategy

In the following, we discuss in details the trend of the complexity for both the HRT and RT models. As the training stage is analogous among hold-out and prequential setting (at least for the first 20 chunks), we just consider the former, but the same considerations apply for the latter. Figure 2 shows the trends of the complexity of the HRT-hold-out and RT-hold-out along the sequence of processed chunks. As regards the number of nodes (Fig. 2a) and the number of leaves (Fig. 2b), it is worth noticing that RT-hold-out is always more complex than the HRT-hold-out, up to one order of magnitude. We can observe that the RT models entail a large number of nodes even at the first chunks, while the number of nodes in the HRT models keep steadily increasing almost linearly with the number of chunks. The depth of the tree (Fig. 2c) is relevant as well, since it is associated with maximum number of conditions in the antecedent of the rules that can be extracted from the trees: we can observe that the HRT-hold-out reaches the same depth of RT-hold-out just at the end of the stream of chunks. We recall that, to obtain an easier comparison, we constrained the RT models to have the maximum depth equals to 10 as per the best model we found in the previous study reported in [ 4 ].

(a) Number of nodes (b) Number of leaves (c) Tree depth

Tables 3 and 4 reports the performance of the models for the hold-out and prequential evaluation strategy, respectively, after training the models up to the final available chunk (from chunk 1 to chunk 20, in the case of hold-out, and from chunk 1 to chunk 23 in the case of prequential). The performance of the models are measured in terms of Mean Squared Error (MSE), MAE, and coeficient of determination ( 2). Furthermore, we report the complexity of the model measured in terms of number of nodes, leaves, maximum depth, and number of features selected by the induced tree.

model

Regression metrics

MSE MAE 2 HRT-hold-out

RT-hold-out RT-ofline-10 RT-ofline-5

Obviously, the results obtained with the RT-hold-out learning scheme after processing the ifnal chunk (i.e., chunk 20) are equivalent to those obtained with the more complex among the two baselines, namely RT-ofline-10: in fact, the last step of the RT-hold-out strategy is essentially the same scenario as the baseline strategy where the whole dataset (from chunk 1 to chunk 20) is used for training.

In general, results confirm that the HRT-hold-out and HRT-preq strategies are characterized by a worse performance in terms of MAE, MSE, and 2 than their RT-counterparts. However, HRT models are characterized by the lowest levels of complexity, in terms of number of nodes, number of leaves, number of selected features and maximum depth of the trees, thus ensuring a

model higher level of interpretability than RT models.

Figures 3 and 4 report examples of QoE test timeseries for diferent UEs, overlapping the ground-truth with the predicted values obtained by the diferent models in the test datasets, after processing the last available chunk of training data. The visual analysis suggests that the diferent models provide reasonable predictions in diferent conditions; in particular, the HRT-based models show a worse predictive performance than their RT counterparts, possibly due to their lower complexity.

(a) HRT-hold-out (b) RT-hold-out

To summarize, the accuracies obtained by the HRT models are smaller by a 10-26% (depending on the MAE, MSE or 2 metric considered) than the corresponding RT counterpart models. However, this decrease in performance is counter-balanced by the time-for-learning and memory-used values, that are aspects of utmost importance in a streaming scenario; for this reason, they are detailed in the following.

4.2. Memory occupancy

Figure 5 shows the training set sizes (i.e., the number of samples) used for updating both the RT and HRT models when processing a new chunk of data. We just discuss the prequential (b) RT-prequential setting: in the hold-out strategy, the learning phase is analogous and the same considerations apply. As expected, the RT model memory occupancy rapidly exceeds the HRT model one: in a real-case where we have massive input data stream, this would lead to large training set sizes, thus making the retraining learning scheme an impractical and very computationally intensive approach.

4.3. Time for model updating

Figure 6 reports the trends of the updating times for the RT and HRT models. Also in this case, we just discuss the prequential setting. The plot shows how, after about 22 chunks, the time for the RT learning exceeds the time for HRT learning. This is important, because for the HRT models we need to reduce as much as possible the dependency of the training time on the number of chunks, with the aim of ensuring minimum latency in the operative real-time application. In addition, for the RT-model we can observe an almost linear relationship between the number of chunks and the learning time: in fact, a simple linear fitting on the trend related to RT-preq model yields 2=0.99 and p-value=3.88−23. On the other hand, the HRT-models do not show a strong increasing behaviour in function of the chunk number, but it can depend on the dimension, namely the number of samples, of each single chunk.

5. HRT sensitivity analysis

In this section we analyze the sensitivity of HRT models with respect to two aspects: parameter setting and order of chunks in the streaming process.

5.1. Sensitivity with respect to parameter configuration

We analyzed the suitability of the default configuration of the HRT training in terms of model parameters, reported in Table 1. In particular, we compared the MAE values for diferent values of the grace period and of the tie-threshold, after the whole dataset has been incrementally processed. We aim to analyse if the default values of the model parameters are a “robust” choice, and ensure good performances. We recall that grace period defines a threshold on the number of instances contained in a leaf node before considering it for a split, whereas tie threshold consists in a threshold below which a split will be forced to break ties. Intuitively, lower values of grace period and higher values of tie threshold will foster easier node splitting, thus leading to more complex trees. Notably, hyperparameter tuning for finding optimal parameters configuration is not viable in an operative scenario, where the model cannot rely on a static dataset but rather learns from an incoming data stream.

Figures 7 and 8 show the MAE values on the test set for the HRT-models. It is worth highlighting that the value of the metric measured at the end of the training conveys only a partial insight into the behaviour of the model, but can still be considered a proxy for the quality of the parameter configuration. From the heatmaps, we can observe a slight indication of the presence of a better-performing area, in the bottom right of the plot, where the grace period and the tie-threshold have values greater than 300 and 0.08, respectively. The boxplots show how, for the default configuration (grace period=200 and tie-threshold=0.05), the value of the MAE score, even if not optimal, lies below the median values for both the evaluation strategies. (b) HRT-prequential

5.2. Sensitivity with respect to chunk order

In HRT the initial structure of the model (e.g., the root) is determined based on the initial chunks and cannot be reassessed subsequently. As a consequence, the order of the chunks may impact on the performance of the model throughout the whole data stream. To quantify the performance variation, we performed ten tests where we randomly shufled the input chunks order. Figure 9 shows the MAE values for the last test dataset (i.e., after processing chunk 23 for the prequential strategy and chunk 20 for the hold out strategy), suggesting that the order of input chunks does not significatively afect the resulting performance: the maximum (minimum) MAE values are diferent of about 12% (5%) with respect to the median value of the distribution. Such variability is not negligible in absolute terms, but it is still comparable to the variations of MAE values we observe, for instance, during model training in the prequential case (see Fig. 1b), after the first “start-up” 5 chunks.

(a) HRT-prequential (b) HRT-hold-out

6. Conclusion

In this paper, we have discussed an application of streaming methods to a realistic 5G network simulation for QoE forecasting. We applied a Hoefding Decision Tree for data stream regression to predict incoming QoE, and we compared the results with standard regression trees. From the results, we observed that HRT models have proven to be better strategies regarding memory usage and learning time aspects, at the cost of having worse accuracies than RT models. However, we experimentally highlighted how HRT models, after an initial start-up phase where the models complexity increase, approach the performance of the standard RT models with comparable complexity. This can be explained by the kind of strategy used by the Hoefding Decision Tree: by construction, the structure of the tree is strongly afected by the initial data input, and the resulting tree is typically more shallow with respect to “traditional” decision trees. These considerations represent the initial steps for future works, where ad-hoc methods could be designed to take the discussed shortcomings into account. In particular, a further study will aim to shed some light on the relationship between complexity and performance in the streaming approaches, by refining the tree updating strategy and investigating techniques to select appropriate parameter configurations. Furthermore, we plan to assess if concepts from fuzzy set theory can help improve the performance of HRT models in this kind of applications.

Acknowledgments

This work has been partly funded by the Italian Ministry of University and Research (MIUR), in the framework of the Cross-Lab project (Departments of Excellence) and PON 2014-2021 “Research and Innovation”, DM MUR 1062/2021, Project title: “Progettazione e sperimentazione di algoritmi di federated learning per data stream mining” and by the EU Commission through the H2020 projects Hexa-X (Grant no. 101015956).

[1]

Vasilev ,

Leguay , S. Paris, L. Maggi,

Debbah , Predicting QoE Factors with Machine Learning , in: 2018 IEEE Int'l Conf. on Communications (ICC) , 2018 , pp. 1 - 6 . doi: 10 .1109/ ICC. 2018 . 8422609 .

[2]

Sheth ,

Patel ,

Shah ,

Tanwar ,

Gupta ,

Kumar , A taxonomy of AI techniques for 6G communication networks , COMPUT COMMUN 161 ( 2020 ) 279 - 303 . doi: 10 .1016/ j.comcom. 2020 . 07 .035.

[3]

Renda ,

Ducange ,

Gallo ,

Marcelloni , XAI Models for Quality of Experience Prediction in Wireless Networks , in: 2021 IEEE Int'l Conf. on Fuzzy Systems (FUZZ-IEEE) , 2021 , pp. 1 - 6 . doi: 10 .1109/FUZZ45933. 2021 . 9494509 .

[4]

J. L. Corcuera

Bárcena ,

Ducange ,

Marcelloni ,

Nardini ,

Noferi ,

Renda ,

Stea ,

Virdis , Towards Trustworthy AI for QoE prediction in B5G/6G Networks , in: First Int'l Workshop on Artificial Intelligence in Beyond 5G and 6G Wireless Networks (AI6G 2022 ), (accepted).

[5]

J. a.

Gama , I. Žliobaitè ,

Bifet ,

Pechenizkiy ,

Bouchachia , A Survey on Concept Drift Adaptation , ACM Comput. Surv . 46 ( 2014 ). doi: 10 .1145/2523813.

[6]

Ethics

Guidelines for Trustworthy AI , Technical Report , 2019 .

European

Commission . High Level Expert Group on AI. https://ec.europa. eu/digital-single-market/en/news/ ethics-guidelines-trustworthy-ai.

[7]

Domingos , G. Hulten, Mining high-speed data streams , in: Proc. of the sixth ACM SIGKDD Int'l Conf. on Knowledge discovery and data mining , 2000 , pp. 71 - 80 .

[8]

Ducange ,

Marcelloni ,

Pecori , Fuzzy Hoefding Decision Tree for Data Stream Classification , INT J COMPUT INT SYS 14 ( 2021 ) 946 - 964 . doi: 10 .2991/ijcis.d. 210212 .001.

[9]

Rutkowski ,

Pietruczuk ,

Duda , M.

Jaworski, Decision Trees for Mining Data Streams Based on the McDiarmid's Bound, IEEE T KNOWL DATA EN 25 (

2013 ) 1272 - 1279 . doi: 10 .1109/TKDE. 2012 . 66 .

[10]

Montiel ,

Read ,

Bifet , T. Abdessalem, Scikit-multiflow: A multi-output streaming framework , J MACH LEARN RES 19 ( 2018 ) 2915 - 2914 .

[11]

Nardini ,

Sabella , G. Stea,

Thakkar ,

Virdis , Simu5G-An

OMNeT

++ Library for End-to-End Performance Evaluation of 5G Networks, IEEE Access 8 ( 2020 ) 181176 - 181191 . doi: 10 .1109/ACCESS. 2020 . 3028550 .