=Paper=
{{Paper
|id=Vol-3318/short13
|storemode=property
|title=Developing a noise-aware AI system for change risk assessment with minimal human intervention
|pdfUrl=https://ceur-ws.org/Vol-3318/short13.pdf
|volume=Vol-3318
|authors=Subhadip Paul,Anirban Chatterjee,Binay Gupta,Kunal Banerjee
|dblpUrl=https://dblp.org/rec/conf/cikm/PaulCG022
}}
==Developing a noise-aware AI system for change risk assessment with minimal human intervention==
Developing a noise-aware AI system for change risk assessment with minimal human intervention Subhadip Paul, Anirban Chatterjee, Binay Gupta and Kunal Banerjee∗ Walmart Global Tech. Bengaluru, Karnataka, India Abstract Introducing changes to a system in production may sometimes result in failures, and eventual revenue loss, for any industry. Therefore, it is important to monitor the “risk” that each such change request may present. Change risk assessment is a sub-field in operations management that deals with this problem in a systematic manner. However, a manual or even a human-centered AI system may find it challenging to meet the scaling demands for a big industry. Accordingly, an automated system for change risk assessment is highly desired. There are a few commercial solutions available to address this problem but those solutions lack the ability to deal with highly noisy data, which is quite a possibility for such systems. There are literature which proposed methods to integrate the feedback of domain experts into the training process of a machine learning model to deal with noisy data. Even though some of these methods produced decent risk prediction accuracy of the model but such an arrangement to collect feedback from the domain experts continuously has practical challenges due to the limitation in bandwidth and availability of the domain experts at times. Therefore, as part of this work, we explore a way to take the transition from a human-centered AI system to a near-autonomous AI system, which minimizes the need of intervention of domain experts without compromising with the prediction accuracy of the model. Initial experiments with the proposed AI system exhibit 10% improvement in risk prediction accuracy in comparison with the baseline which was trained by integrating the feedback of domain experts in the training process. Keywords change management, risk assessment, human-centered decision making, 1. Introduction in a production system is one of the key challenges for an industry to provide seamless service to its customers. Launching a new business or expanding the repertoire There are a few commercial solutions, such as the one of features for an existing business is a common phe- provided by [1], which address the problem of automated nomenon in the modern technology-driven industries. change risk assessment. In [2], the authors addressed few All such upgrades require a series of software changes to of the limitations of the currently available commercial a base system that is already in production. However, one solutions such as concept drift in data and seeking feed- needs to be cautious prior to pushing in these changes be- back from domain experts depending on the estimated cause each one of these potentially can cause a failure in uncertainty of the model and few others. However, in the system. In the current era of agile development, often practice, the problem of predicting risk associated with a a large volume of requests come right before the sprint change request can be further exacerbated by the pres- deadlines. At times, a tight delivery schedule severely ence of label noise in the data. Such label noise can be restricts the scope for thorough inspection and review be- primarily attributed to inaccuracies crept in during impu- fore the deployment. Moreover, from our experience, in tation of missing values and some remedial intervention case of manual change risk assessment, when the risk as- by the change management team which prevents some sociated with a change is marked as “low” by the change of the change requests from failing in production. We requester (which, in reality, need not be so – this may need experts’ frequent and elaborate feedback on several happen if the developer is new or less skilled, and hence data samples to ensure high reliability and generaliza- may have applied poor judgement), that request is of- tion accuracy of the model which is trained with change ten completely disregarded by the domain experts while data with high degree of label noise. However, frequent reviewing, which eventually may manifest as a critical is- and elaborate feedback from the domain experts may not sue later in the pipeline. Reducing the number of failures be always practically possible due to the limitation of bandwidth and availability of the domain experts. That Proceedings of the CIKM 2022 Workshops ∗ Corresponding author. motivates us into a transition from human-centered AI Envelope-Open subhadip.paul0@walmart.com (S. Paul); system to a near-autonomous AI system to predict risk anirban.chatterjee@walmart.com (A. Chatterjee); of change requests in order to minimize the requirement binay.gupta@walmart.com (B. Gupta); of intervention by the domain experts. kunal.banerjee1@walmart.com (K. Banerjee) In this paper, we present our experience of exploring Orcid 0000-0002-0605-630X (K. Banerjee) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License the following questions while building an automated Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 change risk assessment system: (a) Data without missing values (b) Data with imputed missing values Figure 1: Data with and without imputed missing values. • How can the label noise in the data affect the 2.1.1. Feature Sparsity in Data. generalization accuracy of risk prediction model Some of the features of our data exhibit high degree of ? sparsity. We impute the missing values but some error • Can we have an automated process to remove the always gets introduced by the process of imputation. Let label noise in the data and train a model simulta- us try to understand why the error originating from the neously? process of missing value imputation leads to label noise. The remainder of the paper is organized as follows. Section 2 covers the background and motivation of our Consider a toy example where a data instance has two work. Section 3 briefly explains our methodology. Sec- features (refer to Figure ??) and originally it belongs to tion 4 provides the dataset description and the experi- ‘class 1’. Now consider a situation where the same data mental results. Lastly, Section 5 describes some future point as depicted in Figure ?? has the value of feature 𝑓1 work along with the concluding remarks. missing and it is eventually imputed (refer to Figure ??). Notice that, after imputation the data instance has moved leftward and got located in the region of ‘class 2’. How- 2. Background & Motivation ever, in spite of the new location of the data instance in the region of ‘class 2’ after imputation, it is still labelled In course of explaining our motivation into the transition as ‘class 1’ as that was the original label of the data in- from human-centered AI system to an autonomous AI stance. It eventually introduces label noise in the dataset. system, we revolve our discussion around the following Notice also that the data instances located close to the question one by one: class boundary are more prone to produce label noise • Question 1. How can label noise get introduced in case the missing values of some of their features are into the change data? imputed. • Question 2. How can label noise impact the gen- eralization error of the risk prediction model? 2.1.2. Change management process of the organization. 2.1. Analysis of Question 1 Another major source of label noise lies in the change management process itself of the organization. Consider There are multiple ways in which label noise may get a situation when a change request (CRQ) is raised which introduced into the data. In the context of our change has high likelihood of causing failure in production and data, let us introduce two primary reasons for label noise: change manager along with the change requesting team took some mitigatory action against this CRQ to prevent it from causing failure in production. Due to such manual intervention by the change management team, this CRQ 𝑤𝑒𝑖𝑔ℎ𝑡 (𝜋, [𝛽1 , 𝛽2 ]) = 𝔼 [∑𝑥∈𝑋 𝐷(𝑥) ⋅ 𝟙(𝐷(𝑥) ∈ [𝛽1 , 𝛽2 ])] may not end up causing failure in production. When and this CRQ will be part of historical dataset for training 𝑛 is the total number data instances in the training risk prediction model, it will lead to the illusion that this dataset. CRQ belongs to ‘normal’ or ‘non-risky’ class as it didn’t Note that higher is the value of 𝑙, higher is the lead to any failure in production but ideally it should lower bound of the generalization error of the model. have been considered otherwise as this CRQ had high When it comes to dealing with tabular dataset with potential to cause failure in production. Therefore, such moderate to high dimension such as the dataset of manual intervention in the change management process, ours, repetition of data instances in the dataset may which does not reflect in the change data, causes label apparently seem unlikely but still it is approximately noise in change data. possible. An intuitive explanation could be that in the context of supervised learning, a data instance is 2.2. Analysis of Question 2 approximately represented by the set of its significant features with respect to the classification model even In our work, we model the change risk predictor as a though there can be high number of insignificant or binary classifier. Therefore, to understand the impact redundant features of that data instance. In that way, of label noise on the accuracy of change risk prediction two data instances are seen by the model as repeti- model, we need to understand how label noise impacts tion if the values of their significant features are the same. a classifier in a generalized setting of supervised learn- ing. Consider the following notations for the generalized Therefore, our primary motivation to take up this prob- setting of supervised classification: A training dataset lem is to do away with the label noise due to some inher- 𝑆 = {(𝑥1 , 𝑦1 ), ..., (𝑥𝑛 , 𝑦𝑛 )} is available. In each pair (𝑥𝑖 , 𝑦𝑖 ), ent noise generation process and its adverse impact on 𝑥𝑖 represents the feature vector and 𝑦𝑖 represents the the model’s accuracy. associated label. 𝑋 and 𝑌 denote the space of 𝑥 and 𝑦 respectively. Jointly (𝑥, 𝑦) are drawn from an unknown distribution 𝒫 over 𝑋 × 𝑌 . In other words, 𝑥 is drawn 3. Methodology from a distribution 𝒟, and the true label 𝑦 for 𝑥 is given by a function 𝑓 ∶ 𝑋 → − 𝑌 drawn from a distribution ℱ. As proposed by [4], we employ Progressive Label Correc- The learner’s algorithm 𝒜 represents a function which tion (PLC) method to iteratively correct the labels and takes in the training data 𝑆 as input parameters and re- train the binary classification model. We first train the turns a distribution of classifiers ℎ ∶ 𝑋 → − 𝑌. We define XGBoost model with the original noisy data for first few 𝑒𝑟𝑟𝒫 (𝒜 ,𝑆)∶=𝐸ℎ∼𝒜 (𝑆) [𝑒𝑟𝑟𝒫 (ℎ)] to represent the generalization iterations and we call it warm-up period. We start cor- recting the labels once the warm-up period is over. We error function, where 𝑒𝑟𝑟𝒫 (ℎ)∶=𝔼𝒫 [𝟙(ℎ(𝑥)≠𝑦)] and 𝟙(⋅) is only correct those labels on which the classifier 𝑓 exhibits the indicator function. We also assume |𝑋 | = 𝑛 and high confidence. The idea is based on the intuition that |𝑌 | = 𝑚. We follow the notation below to characterize there exists a region in the data in which noisy classi- the training dataset 𝑆: fier 𝑓 produces highly confident prediction and exhibit • Consider 𝜋 = 𝜋1 , ⋅ ⋅ ⋅, 𝜋𝑛 to represent the priors for consistency with the clean Bayes optimal classifier. Thus each 𝑥 ∈ 𝑋. within the specified data region as mentioned above, the • For each 𝑥 ∈ 𝑋, sample a quantity 𝑝𝑥 indepen- algorithm produces clean labels. More formally, within dently and uniformly from the set 𝜋. the specified data region, if 𝑓 predicts a different label • The probability mass function of 𝑥 is given by than the observed label, 𝑦,̃ with confidence above the 𝐷(𝑥) = ∑ 𝑥 𝑝 . 𝑝 threshold, 𝜃, i.e. |𝑓 (𝑥) − 1/2| > 𝜃, we flip the label 𝑦̃ to the 𝑥∈𝑋 𝑥 prediction of 𝑓. We continue this process until we reach When a model becomes sufficiently complex, many a stage where no label can be corrected. We choose the of the times it ends up memorizing the labels of some value of 𝜃 empirically. of the instances in the training dataset. Theorem 6 of [3] shows how memorizing noisy labels for data 4. Experimental Setup & Results instances of frequency 𝑙 leads to a sharper decline in the generalization power of a supervised classifier. 4.1. Dataset Description Theorem 1. (Theorem 6 of [3]) For 𝑥 ∈ 𝑋𝑆=𝑙 with true We have collected change data for 3 months which com- label 𝑦, ℎ memorizing its 𝑙 noisy labels leads to the follow- prises of ∼27K data samples that are labelled as “risky” ing order of individual excessive generalization error: (class 1), i.e., potentially may lead to a failure in the pro- 2 Ω ( 𝑛𝑙 2 ⋅ 𝑤𝑒𝑖𝑔ℎ𝑡 (𝜋, [ 23 𝑛−1 𝑙−1 3 𝑛 , 4 𝑙 ])) ⋅ ∑𝑘≠𝑦 ℙ[𝑦̃ = 𝑘|𝑥], where duction system, or “normal” (class 0). Out of the ∼27K (a) Experiment 1. (b) Experiment 2. Figure 2: Change in F1 score with iterations while training the model with PLC method (with a warm-up period of 30 iterations). data samples there are only 65 instances which belong We create 3 separate datasets for each month from the to the class “risky” or the positive class. overall data and perform two experiments: • Feature Description: Each instance in the data • Experiment 1: The model is trained with the consists of 20 features; out of these, 2 features are change data of Month 1 and the change data of continuous while the rest are categorical. Month 2 is used for validation. • Sparsity: There are many features that have • Experiment 2: The model is trained with the missing values; some of the features even have change data of Month 2 and the change data of almost 30% values missing. Month 3 is used for validation. We use a gradient-boosted decision tree (XGBoost) [5] 4.2. Experimental Results to generate the probability with which a new change request may cause failure in production. We consider this probability as the estimation of the risk for a change. This Table 1 is our baseline model. Note that we had explored other Comparison of results of two models for Experiment 1 models as well; however, the XGBoost model produced Experiment 1 Baseline After PLC the best results as recorded in our prior work [2]. True Positive Rate 0.62 0.89 Next we use the PLC algorithm [4] to remove the label False Positive Rate 0.19 0.05 noise in the dataset. Then we re-train the XGBoost model True Negative Rate 0.81 0.95 with the label corrected dataset iteratively as described False Negative Rate 0.38 0.11 in the previous section. Detailed comparisons between Precision 0.06 0.26 the baseline model (trained on original data) and the Positive Likelihood Ratio 3.35 17.57 F1 Score 0.11 0.40 model trained following the PLC method for Experiment Balanced Accuracy 0.72 0.92 1 and Experiment 2 are shown in Table 1 and Table 2, respectively. Note that the metric balanced accuracy is useful when the classes are imbalanced and is defined as (sensitivity + specificity)/2; we believe that the rest of Table 2 the metrics used in these tables are standard and need Comparison of results of two models for Experiment 2 no definition. As can be seen from Table 1 and Table 2, Experiment 2 Baseline After PLC the model trained with PLC method outperformed the True Positive Rate 0.51 0.74 baseline across all the metrics. Figure 2 shows the plot of False Positive Rate 0.06 0.02 how F1 score varies with iterations during training the True Negative Rate 0.94 0.98 model with PLC method. Note that we had used a warm- False Negative Rate 0.49 0.26 up period of 30 iterations, which is why a sharp jump Precision 0.05 0.20 is noticed upon applying label correction 31st iteration Positive Likelihood Ratio 8.19 42.44 onward. F1 Score 0.09 0.32 Balanced Accuracy 0.72 0.86 5. Conclusion & Future Work However, latest research shows that a higher confidence score from the model does not necessarily assure higher In this paper we have shown how we made transition probability of correctness of the classifier. Therefore, the from a human-centred AI system to a near-autonomous fact that, a classifier’s own confidence score may not AI system by employing progressive label correction be the best judge of its own trustworthiness, makes our method in order to get rid of inherent label noise in the on-going work all the more challenging but interesting. data. We now seek labels for a handful of samples from the domain experts only when the model is extremely uncertain about their class. Experimental results exhibit significant improvement in the model’s performance with References respect to various metrics. As part of the future work, we aim to build not just a [1] Digital.ai, Change risk prediction (2019). URL: https: more accurate model but a more accurate and trustworthy //digital.ai/change-risk-prediction. model as earning the trust of the end users for the ML [2] B. Gupta, A. Chatterjee, S. Paul, H. Matha, L. Parsai, model is the key to success in driving business values K. Banerjee, V. Agneeswaran, Look before you leap! by ML especially in ‘operations’ in a large-scale orga- designing a human-centered AI system for change nization. Therefore, we are in the process to build an risk assessment, in: ICAART, 2022, pp. 655–662. enhanced label-noise removal method which is based on [3] Y. Liu, Understanding instance-level label noise: Dis- the intuition that in noisy data, there exists a ‘data region’ parate impacts and treatments, in: ICML, volume 139 in which the noisy classifier 𝑓 produces highly confident of Proceedings of Machine Learning Research, PMLR, and trustworthy prediction which is consistent with the 2021, pp. 6725–6735. clean ‘Bayes optimal classifier’. A standard approach to [4] Y. Zhang, S. Zheng, P. Wu, M. Goswami, C. Chen, quantify a classifier’s trustworthiness is to use its own Learning with feature-dependent label noise: A pro- estimated confidence or score such as probabilities from gressive approach, in: ICLR, 2021. URL: https: the softmax layer of a neural network, distance to the //openreview.net/forum?id=ZPa2SyGcbwh. separating hyper-plane in support vector classification or [5] T. Chen, C. Guestrin, XGBoost: A scalable tree boost- mean class probabilities for the trees in a random forest. ing system, in: KDD, 2016, pp. 785–794.