1. Introduction

Data Technol. Appl. 55 (2021) 558-585. URL: https://doi.org/10.1371/journal.pone.0237724. URL: https://doi.org/10.1108/DTA-12-2020-0298. doi:10.1371/journal.pone.0237724. doi:10.1108/DTA

10.24963/ijcai.2022/760

About the Efects of Data Imputation Techniques on ML Uncertainty

Cinzia Cappiello

cinzia.cappiello@polimi.it 0 1 2 4

Federico Cerutti

federico.cerutti@unibs.it 0 2 3 4

Camilla Sancricca

camilla.sancricca@polimi.it 0 1 2 4

Riccardo Zanelli

riccardo.zanelli@mail.polimi.it 0 1 2 4

Data Quality, Uncertainty, Data Imputation

0 It might be afected by several aspects , such as syntactic 1 Politecnico di Milano , Milan , Italy 2 To validate such a statement, we started considering the 3 University of Brescia , Brescia , Italy 4 We must also consider that an ML model's perfor-

2021

97 5418 5425

The data-driven culture is based on the importance of data analysis in supporting decision-making. In particular, machine learning technologies and tools are evolving quickly and becoming increasingly popular as an efective means to gain insights from raw data. However, it should be considered that Machine Learning (ML) models often generate uncertain results due mainly to their imperfect and statistical nature. In this paper, we focus on the fact that data preparation techniques can introduce additional uncertainty. Errors, missing values, and inconsistencies are frequently addressed using techniques that correct data using estimates and thus add further uncertainty. Focusing on the specific problem of incomplete data, this paper (i) investigates the efect of imputation techniques on the results' uncertainty, and (ii) identifies the techniques that minimize companies' decisions. In particular, Machine Learning (ML) models help users efectively gain insights from raw data. However, dealing with ML requires managing uncertainty.

1. Introduction In the modern era of the data-driven culture, data analy

sis is critical in providing useful information to support variability. Data Quality plays a crucial role in managing this uncertainty: reliable and consistent data helps identify and quantify the inherent variability and randomness

Joint Workshops at 49th International Conference on Very Large Data Bases (VLDBW’23) — the 12th International Workshop on Quality in Databases (QDB’23), August 28 - September 1, 2023, Vancouver,

nEvelop-O to the model. The available imputation methods are vari© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License in some form, and once the data are complete, feed them TEST

SET DATA

SOURCE ous: they go from traditional techniques in which null accurate single imputations, being competitive with other values are substituted with statistical information (e.g., state-of-the-art methods. Finally, the method shown in mean, median, mode) to more complex processes based [11] adapts the Generative Adversarial Networks (GAN) on ML (e.g., clustering and distance-based algorithms [7]). framework to impute the missing data. This method has All these methods are imputing estimates, and therefore been tested on various datasets and outperforms some they add epistemic uncertainty. state-of-the-art methods.

This work investigates (i) the efect of this portion Some of the data imputation methods described above of uncertainty introduced by the data preparation pro- were also considered in the experiments of this paper. cess on the data analysis results and (ii) if the goal of Moreover, some studies have tried to put together sevmitigating uncertainty can be exploited to find the best eral imputation methods. For example, the work in [12] preparation action within a specific context ( i.e., data and proposes an adaptive iterative imputation framework ML model characteristics). that automatically finds, for each dataset column, the

The paper is organized as follows: Section 2 explores best data imputation model and configures it with the similar literature contributions and highlights the novel appropriate hyperparameters. The best single-column aspects of the presented paper. Section 3 describes the imputation method is computed by trying several methmethod that we used to investigate the impact of data ods until an imputation-stopping criterion, based on the preparation on the uncertainty of ML results; Section incremental change in imputation quality, is met. 4 presents the conducted experiments and discuss the Within the same domain, several contributions have obtained results, while Section 5 concludes the paper and conducted comparisons between the diferent imputation presents future work. methods present nowadays in the literature. For example, [13] depicts a comprehensive benchmark on six diferent methods involving standard, classical ML, and novel 2. Related Work deep learning approaches to perform data imputation. The experiments were done on a huge set of real-world The problem of missing data has been increasingly spread datasets, including three missingness patterns, i.e., missin a variety of domains. For this reason, a lot of research ing completely at random (MCAR), missing at random contributions aim to define methods for eficiently per- (MAR), and missing not at random (MNAR). In [14], the forming data imputation and replacing the missing data authors show a comparison between multiple existing with values that are as accurate as possible [7, 8]. data imputation techniques that are based on deep learn

Several papers propose implementing accurate and ing; moreover, they propose a set of improvements for eficient data imputation methods by exploiting ML tech- each analyzed method. niques. For example, the method presented in [9] pro- In these cases, the data imputation methods are evaluposes a novel k-Nearest-Neighbors (kNN) imputation ated on their imputation quality, without considering the method that iteratively imputes missing data selecting uncertainty that they can introduce in the performance the kNN via calculating the Gray distance, i.e., a tech- of a ML model that will be executed on them. nique used in the Gray system theory, rather than tradi- A recent approach [15] focuses on studying the imtional distance metrics. Such a distance metric can deal pact of data preparation on the ML model performance. with both numerical and categorical attributes. This study investigates the impact of data cleaning ac

Other methods [10, 11] make use of neural networks. tions on ML classification models. The authors consider The work presented in [10] builds a deep latent variable diferent data cleaning methods for correcting outliers, model to impute missing-at-random data. This model is duplicates, inconsistencies, mislabels, and missing valbased on autoencoders and has been proven to provide ues. The goal was to assign, for a specific setting (error splitting is done before injecting the DQ errors: in this type, data cleaning action, and ML application), a P (pos- way, dirty instances of the Training Set are created, an itive), N (negative), or S (insignificant) flag indicating the ML model is trained on them, and it is finally evaluated impact of the data cleaning on the ML performance. on the same original instance of the Test Set.

Also in this case, the impact of data cleaning methods is The Missing Values Injection phase generates five inevaluated on the basis of the final ML model performance, stances of the Training Set at diferent levels of quality by without considering the ML uncertainty. injecting a diferent percentage of missing values (from

Some contributions focused on creating their data im- 50% to 10%, with a decreasing step of 10%) uniformly. putation methods for particular contexts and then tried The targeted class is excluded from the injection and is to validate them from the point of view of the introduced not corrupted. uncertainty [16, 17]. In particular, the work proposed Following this procedure, the injected missing valin [16] aims to provide a tool to predict hospital read- ues are Missing Completely At Random (MCAR), i.e., mission among Heart Failure patients and develops a the probability of a data point being missed is indepennew methodological framework to address the missing dent of the observed and unobserved data. An injection data using a Gaussian process latent variable model. In above 50% of DQ errors has not been performed in our contrast, the method shown in [17] focuses on well logs, experiments since the variance of the model performance, commonly used in geoscience, and proposes an approach trained with so many mistakes, was too high and was no to customize the hyper-parameters of a random forest longer considered reliable. model to predict the missing values. The obtained five dirty datasets are the input of the

However, none of the cited works considered using Data Imputation phase, in which a data imputation techuncertainty to select the best data imputation method nique is applied to fill the missing values. In this phase, to apply in a given analysis context. Our work aims to several imputation methods have been compared. explore this open issue. The five cleaned datasets obtained as the output of

Finally, a paper that implements a similar approach the Data Imputation are fed to the Data Analysis phase, w.r.t. our method is [18]. However, the authors focus where an ML model is trained on them. The resulting on a totally diferent purpose: they systematically inject ifve ML models are finally evaluated on the same Test errors, e.g., missing values and encoding errors, into the Set, computing their prediction performance and related input data to estimate the prediction quality of a ML epistemic uncertainty. Two sets of scores are the output model. Their goal was to estimate the output quality of of this phase: five scores (each one related to the ML ML models on unseen, unlabeled serving data, in order model executed on one of the five cleaned datasets) reto automate the validation of black boxes. lated to the model performance and another set of scores for the uncertainty. The method is repeated for all the selected data source/ML algorithm/data imputation method combinations.

3. Measuring the Impact of Data Preparation on the Decision Uncertainty

The Pipeline with Feature Selection The same pipeline is also performed with an additional step of This section presents the pipeline — illustrated in Fig- feature selection. In this case, the input dataset is first ure 1 — implemented to investigate the impact of data analyzed through a feature selection method. The output preparation, whose application introduces approximate is a subset of the original dataset that keeps only the four data, on the uncertainty of ML outcomes. most relevant features. The resulting dataset is the input

In this work, we focus on the Completeness DQ di- Data Source of this set of experiments. mension i.e., the degree to which a given data collection includes the data describing the corresponding set of real- 4. Experiments & Results world objects [6]. It is afected by missing values and can be improved by applying data imputation techniques. This section describes the setup used to run the experNote that the considered input dataset is free of DQ prob- iments and the results obtained following the method lems. For this reason, we have to inject missing values proposed in Section 3. to perform the data imputation techniques.

The Experiment Pipeline As Figure 1 depicts, the input of the pipeline is a Data Source, which is split into two datasets: the Training Set and the Test Set. Each dataset is the input of the Data Analysis phase. This 4.1. Experimental Setup

Diferent data sources have been selected to run the exper

iments: Boston,1 Wine,2 California,3 House,4 Concrete.5 Table 1 lists their main characteristics. All these datasets have a numeric target label, and regression ML models were adopted to perform the Data Analysis phase (see Section 3). For this reason, the CatBoost algorithm from Multiple imputation creates copies of the origithe catboost Python library [19] and the Gaussian Process nal data and estimates the missing values through regressor from the scikit-learn Python library [20] have an iterative process. We consider the (5) Multiple been selected as ML analysis algorithms. In addition, the Imputation by Chained Equations (MICE) [22] Boruta [21] method for feature selection has been adopted. technique: (i) random imputation is applied to each It is an ML-based method that evaluates each feature’s missing column; (ii) the missing values are set back one importance in a dataset and returns the most relevant feature at a time; (iii) an ML model is fitted to impute ones. the values using the rest of data as training set; (iv) the

In order to include a diversified set of data imputation training set is updated with the predicted column. For techniques, we consider seven types of them, divided the experiments, the selected ML model is KNNRegressor into four macro-categories. For each category, we select from scikit-learn Python library [20]. one or more representative methods, even though it is known that some are less efective than others. The considered methods are the following: data and the ones that were just generated; (4) MIWAE [10] uses an autoencoder, a neural network trained to encode the observed data into a lower-dimensional space. This allows the autoencoder to learn a compact representation of the data, which can be used to predict the missing values. ( 1 ) Single-column imputation with aggregated values computes an aggregated value like the mean, the median, or the most frequent to substitute the missing ones.

ML-based imputation exploits ML algorithms,

such as: ( 2 ) k-Nearest Neighbours (KNN) [9] estimates each sample’s missing value with the mean value of its nearest neighbours; ( 3 ) Generative Adversarial Imputation Nets (GAIN) [11] uses generative adversarial networks (GANs) for estimating missing values by training a GAN, which consists of two neural networks: a generator network, which generates the missing data, and a discriminator one, to distinguish between the real 1https://www.kaggle.com/datasets/avish5787/boston-data-set (on 29th May 2023). 2https://www.kaggle.com/datasets/shelvigarg/wine-quality-dataset (on 29th May 2023). 3https://www.kaggle.com/datasets/dhirajnirne/ california-housing-data (on 29th May 2023). 4https://www.kaggle.com/datasets/lespin/house-prices-dataset (on 29th May 2023). 5https://www.kaggle.com/datasets/rithikkotha/concrete-dataset (on 29th May 2023).

boston california house str wine

Tuples 506 1,000 1,460 1,030 1599 Statistics-based imputation considers Matrix Factorization (MF) techniques. We select two of them: (6) basic MF and (7) Singular Value Decomposition (SVD) [8]. These processes assume that input data are noisy observations produced by a linear combination of a small set of principal components. They estimate the missing data by splitting them into two or more low-dimensional matrices and reconstructing the original one based on a linear combination.

To evaluate the accuracy and the uncertainty of the results, we used the following evaluation metrics.

The Root Mean Squared Error (RMSE), i.e., a measure of the average diference between the predicted and actual values, has been used to evaluate the prediction accuracy.

Moreover, one common approach for estimating the (epistemic) uncertainty of ML models is to use the standard deviation of the algorithm prediction, i.e., a measure of the variation of the predicted values with respect to their average. A high standard deviation indicates that the predicted values are more variable and, therefore, less reliable.

The standard deviation of the results was estimated directly by the CatBoost and Gaussian Process algorithms.

For example, for CatBoost, the value of the uncertainty was extracted from the model evaluation function, which, in this case, was set to RMSEWithUncertainty — an evaluation metric provided by the catboost Python library [19].

The method presented in Section 3 has been executed 16 times with diferent random seeds for each combination of data source/Ml algorithm/data imputation method. 0.6 0.5 0.5 0.4 0.3 0.3 1E-03 8E-04 6E-04 4E-04 2E-04 KNN MICE MIWAE MF SVD 50 40 20

10 30 (a) (b)

30 50 40 20 10 4.2. Results Evaluation plying k-Nearest Neighbours (KNN), Multiple imputation (MICE), Matrix Factorization (MF), and Singular Value This section shows the preliminary results we obtained Decomposition (SVD) yields ML model performance (RMapplying the method described in Section 3. Experiments SEs) that are very similar to each other, and it becomes have been conducted for the data sources, ML algorithms, dificult to determine which one is better. However, by and data imputation techniques listed in Section 4.1. analyzing the uncertainty, one can argue that MF outper

From the experiments’ results, the role of uncertainty forms the others since — on average — it leads to lower introduced by data preparation arises: it can be used as a values. support in identifying the best data preparation method Moreover, we rank the data imputation methods based to apply in a specific analysis context, i.e., a combination on the analysis performance and the uncertainty they inof the data source and the ML algorithm selected for its troduced. For each analysis context, we compute the ML analysis. When applying two data imputation methods model performance and related uncertainty for the seleads to equivalent analysis results (in terms of perfor- lected ML algorithms using the original (cleaned) dataset, mance), the best one can be identified by evaluating their and we use this value as a baseline. Then, for each comuncertainty. bination of dataset/ML algorithm/data imputation tech

Figure 2 depicts an example of the aggregated results nique, we (i) run our method (see Section 3) several times, obtained for the combination CatBoost/house dataset. In (ii) aggregate the results by the median, and (iii) compute particular, Figure 2a plots the model performance (RMSE) the median distance between the five extracted scores and Figure 2b the uncertainty distribution for the five (both for RMSE and uncertainty) and the baseline. imputation methods that give the best analysis results, Data imputation methods were sorted in ascending orvarying the completeness. The y-axes represent values as- der of their median distance from the baseline to extract sumed by the RMSE and uncertainty, respectively, while the rankings. The closer the score is to the original values, the x-axis pictures the Completeness level. the more reliable the data imputation method is. Table From visual inspection of Figure 2, it emerges that ap2 lists the extracted rankings and their related distances. ples are statistically significant ( < 0.01 ) according to We performed a Kruskal-Wallis [23] nonparametric test a pairwise analysis performed using the Mann-Whitney to determine if there are statistically significant difer- test [24]: (*) MICE ≠ SVD; (†) MICE ≠ {MIWAE, KNN}; ences between the methods in each ranking. White cells (‡) KNN ≠ {SVD, MEDIAN}; (#) KNN ≠ {MEAN, SVD}. in Table 2 are statistically significant ( < 0.01 ) results From Table 2, we can appreciate, again, that unceraccording to the Kruskal-Wallis test. Among the non- tainty can be used to discriminate between diferent statistically significant ones — in grey — the following cou- imputation methods with absolute values of distance from the baseline very close considering the ML model the imputation methods that outperform the others are performance achieved. From these tables, it is evident KNN, MICE, and MF. that whenever two imputation methods have median From a more general perspective, it is also possible to distances from the baseline that are very close, the un- state that neural network-based imputation techniques, certainty they introduce is always diferent and can be in some cases, are the best ones. However, they have very sorted accordingly. high uncertainty and are less reliable. This is especially

For example, for the combination of CatBoost algo- the case of data sources with low dimensionality, i.e., the rithm/concrete dataset, the first two imputation methods number of tuples and features, as neural networks need are very close to each other, i.e., MF and MICE; however, much more data to build a reliable ML model. the uncertainty introduced by MF is much smaller than As regards the single-column imputation with aggrethe other one. We can conclude that the first method is gated values techniques, it is possible to highlight that better than the second one. The above statement applies the uncertainty introduced by these methods is higher to all the tested analysis contexts. for lower completeness values. This happens since sub

We also aggregate the rankings results in the following stituting an aggregated value introduces a higher approxmanner: (i) aggregating all results together; (ii) aggregat- imation concerning the other methods. ing, for each dataset, results obtained applying the two algorithms with and without feature selection; (iii) aggregating, for the 4 combinations of CatBoost-Gaussian 5. Conclusions and Future Work Process/with-without feature selection all dataset-related results. For each aggregation, we sum the median distances reported in Table 2 and sort the imputation methods in ascending order of that sum, creating aggregated rankings.

From the aggregated results, we can state that: (i) Considering all the results together, the best-4 methods turned out to be MICE, MF, SVD, and KNN. Moreover, their aggregated distance values are very close to each other both for RMSE and uncertainty. SVD imputation has slightly higher uncertainty than the others. (ii) The best-4 methods found in (i), in general, appear in the first four positions of the rankings obtained for each dataset aggregation. There may be variations in the third and fourth positions of the aggregated rankings, where other imputation methods can appear. However, the uncertainty of the latter methods is always higher than the best-4 methods. (iii) The best-4 methods are coherent for all algorithms/with-without feature selection aggregations. However, the position of these methods changes based on the considered combination. We can notice that CatBoost and Gaussian Process algorithms have very similar RMSE-related rankings ( 1 ) with feature selection, in which the first 3 positions are the same, and ( 2 ) without feature selection.

It is possible to draw some conclusions from the con

ducted experiments. First of all, there is no absolute “best” imputation method that fits all situations: identifying the imputation method to prefer depends on the analysis context. However, it is possible to observe that The paper presents a set of experiments to evaluate the effects of data imputation techniques on ML-based analysis uncertainty. The obtained results highlight that besides performance, uncertainty can be an additional metric to consider for defining the data preparation method to prefer. Future work will focus on extending the experiments considering the other DQ dimensions. Our vision is to exploit these experimental results and experience already gained in similar contexts to design a self-service environment that supports data scientists in finding and recommending data preparation techniques to maximize the results’ accuracy while minimizing uncertainty.

Acknowledgments This research was supported by EU Horizon Framework grant agreement 101069543 (CS-AWARE-NEXT).

[1]

S. C.

Hora , Aleatory and epistemic uncertainty in probability elicitation with an example from hazardous waste management , Reliability Engineering & System Safety 54 ( 1996 ) 217 - 223 . Citation Key: HORA1996217 .

[2]

Hüllermeier , W. Waegeman, Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction to Concepts and Methods , Machine Learning 110 ( 2021 ) 457 - 506 . doi: 10 .1007/ s10994- 021- 05946- 3. arXiv: 1910 .09457.

[3]

Cerutti ,

L. M.

Kaplan ,

Sensoy , Evidential reasoning and learning: a survey , in: L. D. Raedt (Ed.), Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022 , Vienna, Austria, 23 - 29 July 2022 , ijcai .org, 2022 , pp.