-

Developing a noise-aware AI system for change risk assessment with minimal human intervention

Subhadip Paul

Anirban Chatterjee

Binay Gupta

Kunal Banerjee

Walmart Global Tech. Bengaluru

Karnataka

India

2022

Introducing changes to a system in production may sometimes result in failures, and eventual revenue loss, for any industry. Therefore, it is important to monitor the “risk” that each such change request may present. Change risk assessment is a sub-field in operations management that deals with this problem in a systematic manner. However, a manual or even a human-centered AI system may find it challenging to meet the scaling demands for a big industry. Accordingly, an automated system for change risk assessment is highly desired. There are a few commercial solutions available to address this problem but those solutions lack the ability to deal with highly noisy data, which is quite a possibility for such systems. There are literature which proposed methods to integrate the feedback of domain experts into the training process of a machine learning model to deal with noisy data. Even though some of these methods produced decent risk prediction accuracy of the model but such an arrangement to collect feedback from the domain experts continuously has practical challenges due to the limitation in bandwidth and availability of the domain experts at times. Therefore, as part of this work, we explore a way to take the transition from a human-centered AI system to a near-autonomous AI system, which minimizes the need of intervention of domain experts without compromising with the prediction accuracy of the model. Initial experiments with the proposed AI system exhibit 10% improvement in risk prediction accuracy in comparison with the baseline which was trained by integrating the feedback of domain experts in the training process.

change management, risk assessment, human-centered decision making,

1. Introduction Launching a new business or expanding the repertoire

of features for an existing business is a common phenomenon in the modern technology-driven industries. All such upgrades require a series of software changes to a base system that is already in production. However, one needs to be cautious prior to pushing in these changes because each one of these potentially can cause a failure in the system. In the current era of agile development, often a large volume of requests come right before the sprint deadlines. At times, a tight delivery schedule severely restricts the scope for thorough inspection and review before the deployment. Moreover, from our experience, in sociated with a change is marked as “low” by the change requester (which, in reality, need not be so – this may happen if the developer is new or less skilled, and hence ten completely disregarded by the domain experts while reviewing, which eventually may manifest as a critical issue later in the pipeline. Reducing the number of failures nEvelop-O ∗Corresponding author. motivates us into a transition from human-centered AI system to a near-autonomous AI system to predict risk of change requests in order to minimize the requirement of intervention by the domain experts.

In this paper, we present our experience of exploring

case of manual change risk assessment, when the risk as- by the change management team which prevents some may have applied poor judgement), that request is of- tion accuracy of the model which is trained with change (a) Data without missing values (b) Data with imputed missing values

• How can the label noise in the data afect the generalization accuracy of risk prediction model ? • Can we have an automated process to remove the label noise in the data and train a model simultaneously?

2.1.1. Feature Sparsity in Data. Some of the features of our data exhibit high degree of

sparsity. We impute the missing values but some error always gets introduced by the process of imputation. Let us try to understand why the error originating from the process of missing value imputation leads to label noise.

The remainder of the paper is organized as follows.

Section 2 covers the background and motivation of our Consider a toy example where a data instance has two work. Section 3 briefly explains our methodology. Sec- features (refer to Figure ??) and originally it belongs to tion 4 provides the dataset description and the experi- ‘class 1’. Now consider a situation where the same data mental results. Lastly, Section 5 describes some future point as depicted in Figure ?? has the value of feature 1 work along with the concluding remarks. missing and it is eventually imputed (refer to Figure ??). Notice that, after imputation the data instance has moved leftward and got located in the region of ‘class 2’. How2. Background & Motivation ever, in spite of the new location of the data instance in the region of ‘class 2’ after imputation, it is still labelled In course of explaining our motivation into the transition as ‘class 1’ as that was the original label of the data infrom human-centered AI system to an autonomous AI stance. It eventually introduces label noise in the dataset. system, we revolve our discussion around the following Notice also that the data instances located close to the question one by one: class boundary are more prone to produce label noise in case the missing values of some of their features are imputed. • Question 1. How can label noise get introduced

into the change data? • Question 2. How can label noise impact the generalization error of the risk prediction model?

2.1.2. Change management process of the organization.

2.1. Analysis of Question 1 Another major source of label noise lies in the change management process itself of the organization. Consider There are multiple ways in which label noise may get a situation when a change request (CRQ) is raised which introduced into the data. In the context of our change has high likelihood of causing failure in production and data, let us introduce two primary reasons for label noise: change manager along with the change requesting team took some mitigatory action against this CRQ to prevent it from causing failure in production. Due to such manual intervention by the change management team, this CRQ ℎ ( , [ 1, 2]) = [∑∈ () ⋅ (() ∈ [ 1, 2])] may not end up causing failure in production. When and this CRQ will be part of historical dataset for training is the total number data instances in the training risk prediction model, it will lead to the illusion that this dataset.

CRQ belongs to ‘normal’ or ‘non-risky’ class as it didn’t Note that higher is the value of , higher is the lead to any failure in production but ideally it should lower bound of the generalization error of the model. have been considered otherwise as this CRQ had high When it comes to dealing with tabular dataset with potential to cause failure in production. Therefore, such moderate to high dimension such as the dataset of manual intervention in the change management process, ours, repetition of data instances in the dataset may which does not reflect in the change data, causes label apparently seem unlikely but still it is approximately noise in change data. possible. An intuitive explanation could be that in the context of supervised learning, a data instance is 2.2. Analysis of Question 2 approximately represented by the set of its significant features with respect to the classification model even though there can be high number of insignificant or redundant features of that data instance. In that way, two data instances are seen by the model as repetition if the values of their significant features are the same.

Theorem 1. (Theorem 6 of [ 3 ]) For ∈ = with true We have collected change data for 3 months which comlabel , ℎ memorizing its noisy labels leads to the follow- prises of ∼27K data samples that are labelled as “risky” ing order of individual excessive generalization error : (class 1), i.e., potentially may lead to a failure in the proΩ ( 22 ⋅ ℎ ( , [ 23 −−11 , 34 ])) ⋅ ∑≠ ℙ[ =̃ |] , where duction system, or “normal” (class 0). Out of the ∼27K In our work, we model the change risk predictor as a binary classifier. Therefore, to understand the impact of label noise on the accuracy of change risk prediction model, we need to understand how label noise impacts a classifier in a generalized setting of supervised learning. Consider the following notations for the generalized setting of supervised classification: A training dataset = {( 1, 1), ..., ( , )} is available. In each pair ( , ), represents the feature vector and represents the associated label. and denote the space of and respectively. Jointly (, ) are drawn from an unknown distribution over × . In other words, is drawn from a distribution , and the true label for is given by a function ∶ →− drawn from a distribution ℱ.

The learner’s algorithm represents a function which takes in the training data as input parameters and returns a distribution of classifiers ℎ ∶ →− . We define ( ,)∶= ℎ∼() [ (ℎ)] to represent the generalization error function, where (ℎ)∶= [(ℎ()≠)] and (⋅) is the indicator function. We also assume | | = and | | = . We follow the notation below to characterize the training dataset : • Consider = 1, ⋅ ⋅ ⋅, to represent the priors for

each ∈ . • For each ∈ , sample a quantity indepen

dently and uniformly from the set . • The probability mass function of is given by () = .

∑∈

When a model becomes suficiently complex, many

of the times it ends up memorizing the labels of some of the instances in the training dataset. Theorem 6 of [ 3 ] shows how memorizing noisy labels for data instances of frequency leads to a sharper decline in the generalization power of a supervised classifier.

Therefore, our primary motivation to take up this problem is to do away with the label noise due to some inherent noise generation process and its adverse impact on the model’s accuracy. 3. Methodology

As proposed by [ 4 ], we employ Progressive Label Correction (PLC) method to iteratively correct the labels and train the binary classification model. We first train the XGBoost model with the original noisy data for first few iterations and we call it warm-up period. We start correcting the labels once the warm-up period is over. We only correct those labels on which the classifier exhibits high confidence. The idea is based on the intuition that there exists a region in the data in which noisy classiifer produces highly confident prediction and exhibit consistency with the clean Bayes optimal classifier. Thus within the specified data region as mentioned above, the algorithm produces clean labels. More formally, within the specified data region, if predicts a diferent label than the observed label, ,̃ with confidence above the threshold, , i.e. | () − 1/2| > , we flip the label ̃ to the prediction of . We continue this process until we reach a stage where no label can be corrected. We choose the value of empirically.

4. Experimental Setup & Results

4.1. Dataset Description (a) Experiment 1. (b) Experiment 2. data samples there are only 65 instances which belong to the class “risky” or the positive class.

We create 3 separate datasets for each month from the overall data and perform two experiments: • Feature Description: Each instance in the data consists of 20 features; out of these, 2 features are continuous while the rest are categorical. • Sparsity: There are many features that have missing values; some of the features even have almost 30% values missing. 4.2. Experimental Results

We use a gradient-boosted decision tree (XGBoost) [ 5 ] to generate the probability with which a new change request may cause failure in production. We consider this probability as the estimation of the risk for a change. This is our baseline model. Note that we had explored other models as well; however, the XGBoost model produced the best results as recorded in our prior work [ 2 ].

Next we use the PLC algorithm [ 4 ] to remove the label noise in the dataset. Then we re-train the XGBoost model with the label corrected dataset iteratively as described in the previous section. Detailed comparisons between the baseline model (trained on original data) and the model trained following the PLC method for Experiment 1 and Experiment 2 are shown in Table 1 and Table 2, respectively. Note that the metric balanced accuracy is useful when the classes are imbalanced and is defined as (sensitivity + specificity)/2 ; we believe that the rest of the metrics used in these tables are standard and need no definition. As can be seen from Table 1 and Table 2, the model trained with PLC method outperformed the baseline across all the metrics. Figure 2 shows the plot of how F1 score varies with iterations during training the model with PLC method. Note that we had used a warmup period of 30 iterations, which is why a sharp jump is noticed upon applying label correction 31st iteration onward.

In this paper we have shown how we made transition

from a human-centred AI system to a near-autonomous AI system by employing progressive label correction method in order to get rid of inherent label noise in the data. We now seek labels for a handful of samples from the domain experts only when the model is extremely uncertain about their class. Experimental results exhibit significant improvement in the model’s performance with respect to various metrics.

As part of the future work, we aim to build not just a more accurate model but a more accurate and trustworthy model as earning the trust of the end users for the ML model is the key to success in driving business values by ML especially in ‘operations’ in a large-scale organization. Therefore, we are in the process to build an enhanced label-noise removal method which is based on the intuition that in noisy data, there exists a ‘data region’ in which the noisy classifier produces highly confident and trustworthy prediction which is consistent with the clean ‘Bayes optimal classifier’ . A standard approach to quantify a classifier’s trustworthiness is to use its own estimated confidence or score such as probabilities from the softmax layer of a neural network, distance to the separating hyper-plane in support vector classification or mean class probabilities for the trees in a random forest.

However, latest research shows that a higher confidence

score from the model does not necessarily assure higher probability of correctness of the classifier. Therefore, the fact that, a classifier’s own confidence score may not be the best judge of its own trustworthiness, makes our on-going work all the more challenging but interesting.

[1] Digital.ai, Change risk prediction (

2019 ). URL: https: //digital.ai /change-risk-prediction.

[2]

Gupta ,

Chatterjee , S. Paul,

Matha ,

Parsai ,

Banerjee ,

Agneeswaran , Look before you leap! designing a human-centered AI system for change risk assessment , in: ICAART , 2022 , pp. 655 - 662 .

[3]

Liu , Understanding instance-level label noise: Disparate impacts and treatments , in: ICML, volume 139 of Proceedings of Machine Learning Research, PMLR , 2021 , pp. 6725 - 6735 .

[4]

Zhang , S. Zheng,

Wu ,

Goswami ,

Chen , Learning with feature-dependent label noise: A progressive approach , in: ICLR, 2021 . URL: https: //openreview.net/forum?id= ZPa2SyGcbwh .

[5]

Chen , C. Guestrin, XGBoost: A scalable tree boosting system , in: KDD , 2016 , pp. 785 - 794 .