=Paper=
{{Paper
|id=Vol-3318/short13
|storemode=property
|title=Developing a noise-aware AI system for change risk assessment with minimal human intervention
|pdfUrl=https://ceur-ws.org/Vol-3318/short13.pdf
|volume=Vol-3318
|authors=Subhadip Paul,Anirban Chatterjee,Binay Gupta,Kunal Banerjee
|dblpUrl=https://dblp.org/rec/conf/cikm/PaulCG022
}}
==Developing a noise-aware AI system for change risk assessment with minimal human intervention==
<pdf width="1500px">https://ceur-ws.org/Vol-3318/short13.pdf</pdf>
<pre>
Developing a noise-aware AI system for change risk
assessment with minimal human intervention
Subhadip Paul, Anirban Chatterjee, Binay Gupta and Kunal Banerjee∗
Walmart Global Tech. Bengaluru, Karnataka, India


                                    Abstract
                                    Introducing changes to a system in production may sometimes result in failures, and eventual revenue loss, for any industry.
                                    Therefore, it is important to monitor the “risk” that each such change request may present. Change risk assessment is a
                                    sub-field in operations management that deals with this problem in a systematic manner. However, a manual or even a
                                    human-centered AI system may find it challenging to meet the scaling demands for a big industry. Accordingly, an automated
                                    system for change risk assessment is highly desired. There are a few commercial solutions available to address this problem
                                    but those solutions lack the ability to deal with highly noisy data, which is quite a possibility for such systems. There are
                                    literature which proposed methods to integrate the feedback of domain experts into the training process of a machine learning
                                    model to deal with noisy data. Even though some of these methods produced decent risk prediction accuracy of the model but
                                    such an arrangement to collect feedback from the domain experts continuously has practical challenges due to the limitation
                                    in bandwidth and availability of the domain experts at times. Therefore, as part of this work, we explore a way to take the
                                    transition from a human-centered AI system to a near-autonomous AI system, which minimizes the need of intervention of
                                    domain experts without compromising with the prediction accuracy of the model. Initial experiments with the proposed AI
                                    system exhibit 10% improvement in risk prediction accuracy in comparison with the baseline which was trained by integrating
                                    the feedback of domain experts in the training process.

                                    Keywords
                                    change management, risk assessment, human-centered decision making,


1. Introduction                                                                                                 in a production system is one of the key challenges for
                                                                                                                an industry to provide seamless service to its customers.
Launching a new business or expanding the repertoire                                                               There are a few commercial solutions, such as the one
of features for an existing business is a common phe- provided by [1], which address the problem of automated
nomenon in the modern technology-driven industries. change risk assessment. In [2], the authors addressed few
All such upgrades require a series of software changes to of the limitations of the currently available commercial
a base system that is already in production. However, one solutions such as concept drift in data and seeking feed-
needs to be cautious prior to pushing in these changes be- back from domain experts depending on the estimated
cause each one of these potentially can cause a failure in uncertainty of the model and few others. However, in
the system. In the current era of agile development, often practice, the problem of predicting risk associated with a
a large volume of requests come right before the sprint change request can be further exacerbated by the pres-
deadlines. At times, a tight delivery schedule severely ence of label noise in the data. Such label noise can be
restricts the scope for thorough inspection and review be- primarily attributed to inaccuracies crept in during impu-
fore the deployment. Moreover, from our experience, in tation of missing values and some remedial intervention
case of manual change risk assessment, when the risk as- by the change management team which prevents some
sociated with a change is marked as “low” by the change of the change requests from failing in production. We
requester (which, in reality, need not be so – this may need experts’ frequent and elaborate feedback on several
happen if the developer is new or less skilled, and hence data samples to ensure high reliability and generaliza-
may have applied poor judgement), that request is of- tion accuracy of the model which is trained with change
ten completely disregarded by the domain experts while data with high degree of label noise. However, frequent
reviewing, which eventually may manifest as a critical is- and elaborate feedback from the domain experts may not
sue later in the pipeline. Reducing the number of failures be always practically possible due to the limitation of
                                                                                                                bandwidth and availability of the domain experts. That
Proceedings of the CIKM 2022 Workshops
∗
     Corresponding author.                                                                                      motivates us into a transition from human-centered AI
Envelope-Open subhadip.paul0@walmart.com (S. Paul);                                                             system to a near-autonomous AI system to predict risk
anirban.chatterjee@walmart.com (A. Chatterjee);                                                                 of change requests in order to minimize the requirement
binay.gupta@walmart.com (B. Gupta);                                                                             of intervention by the domain experts.
kunal.banerjee1@walmart.com (K. Banerjee)
                                                                                                                   In this paper, we present our experience of exploring
Orcid 0000-0002-0605-630X (K. Banerjee)
                   © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License the following questions while building an automated
                   Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings       CEUR Workshop Proceedings (CEUR-WS.org)
               http://ceur-ws.org
               ISSN 1613-0073
                                                                                                                change risk assessment system:
           (a) Data without missing values                                 (b) Data with imputed missing values
Figure 1: Data with and without imputed missing values.


     • How can the label noise in the data affect the        2.1.1. Feature Sparsity in Data.
       generalization accuracy of risk prediction model
                                                             Some of the features of our data exhibit high degree of
       ?
                                                             sparsity. We impute the missing values but some error
     • Can we have an automated process to remove the
                                                             always gets introduced by the process of imputation. Let
       label noise in the data and train a model simulta-
                                                             us try to understand why the error originating from the
       neously?
                                                             process of missing value imputation leads to label noise.
   The remainder of the paper is organized as follows.
Section 2 covers the background and motivation of our           Consider a toy example where a data instance has two
work. Section 3 briefly explains our methodology. Sec-       features (refer to Figure ??) and originally it belongs to
tion 4 provides the dataset description and the experi-      ‘class 1’. Now consider a situation where the same data
mental results. Lastly, Section 5 describes some future      point as depicted in Figure ?? has the value of feature 𝑓1
work along with the concluding remarks.                      missing and it is eventually imputed (refer to Figure ??).
                                                             Notice that, after imputation the data instance has moved
                                                             leftward and got located in the region of ‘class 2’. How-
2. Background & Motivation                                   ever, in spite of the new location of the data instance in
                                                             the region of ‘class 2’ after imputation, it is still labelled
In course of explaining our motivation into the transition   as ‘class 1’ as that was the original label of the data in-
from human-centered AI system to an autonomous AI            stance. It eventually introduces label noise in the dataset.
system, we revolve our discussion around the following       Notice also that the data instances located close to the
question one by one:                                         class boundary are more prone to produce label noise
     • Question 1. How can label noise get introduced        in case the missing values of some of their features are
       into the change data?                                 imputed.
     • Question 2. How can label noise impact the gen-
       eralization error of the risk prediction model? 2.1.2. Change management process of the
                                                              organization.

2.1. Analysis of Question 1                                 Another major source of label noise lies in the change
                                                            management process itself of the organization. Consider
There are multiple ways in which label noise may get a situation when a change request (CRQ) is raised which
introduced into the data. In the context of our change has high likelihood of causing failure in production and
data, let us introduce two primary reasons for label noise: change manager along with the change requesting team
                                                            took some mitigatory action against this CRQ to prevent
                                                            it from causing failure in production. Due to such manual
intervention by the change management team, this CRQ 𝑤𝑒𝑖𝑔ℎ𝑡 (𝜋, [𝛽1 , 𝛽2 ]) = 𝔼 [∑𝑥∈𝑋 𝐷(𝑥) ⋅ 𝟙(𝐷(𝑥) ∈ [𝛽1 , 𝛽2 ])]
may not end up causing failure in production. When and
this CRQ will be part of historical dataset for training 𝑛 is the total number data instances in the training
risk prediction model, it will lead to the illusion that this dataset.
CRQ belongs to ‘normal’ or ‘non-risky’ class as it didn’t                   Note that higher is the value of 𝑙, higher is the
lead to any failure in production but ideally it should lower bound of the generalization error of the model.
have been considered otherwise as this CRQ had high When it comes to dealing with tabular dataset with
potential to cause failure in production. Therefore, such moderate to high dimension such as the dataset of
manual intervention in the change management process, ours, repetition of data instances in the dataset may
which does not reflect in the change data, causes label apparently seem unlikely but still it is approximately
noise in change data.                                                    possible. An intuitive explanation could be that in
                                                                         the context of supervised learning, a data instance is
2.2. Analysis of Question 2                                              approximately represented by the set of its significant
                                                                         features with respect to the classification model even
In our work, we model the change risk predictor as a though there can be high number of insignificant or
binary classifier. Therefore, to understand the impact redundant features of that data instance. In that way,
of label noise on the accuracy of change risk prediction two data instances are seen by the model as repeti-
model, we need to understand how label noise impacts tion if the values of their significant features are the same.
a classifier in a generalized setting of supervised learn-
ing. Consider the following notations for the generalized                   Therefore, our primary motivation to take up this prob-
setting of supervised classification: A training dataset lem is to do away with the label noise due to some inher-
𝑆 = {(𝑥1 , 𝑦1 ), ..., (𝑥𝑛 , 𝑦𝑛 )} is available. In each pair (𝑥𝑖 , 𝑦𝑖 ), ent noise generation process and its adverse impact on
𝑥𝑖 represents the feature vector and 𝑦𝑖 represents the the model’s accuracy.
associated label. 𝑋 and 𝑌 denote the space of 𝑥 and 𝑦
respectively. Jointly (𝑥, 𝑦) are drawn from an unknown
distribution 𝒫 over 𝑋 × 𝑌 . In other words, 𝑥 is drawn 3. Methodology
from a distribution 𝒟, and the true label 𝑦 for 𝑥 is given
by a function 𝑓 ∶ 𝑋 →         − 𝑌 drawn from a distribution ℱ. As proposed by [4], we employ Progressive Label Correc-
The learner’s algorithm 𝒜 represents a function which tion (PLC) method to iteratively correct the labels and
takes in the training data 𝑆 as input parameters and re- train the binary classification model. We first train the
turns a distribution of classifiers ℎ ∶ 𝑋 →         − 𝑌. We define XGBoost model with the original noisy data for first few
𝑒𝑟𝑟𝒫 (𝒜 ,𝑆)∶=𝐸ℎ∼𝒜 (𝑆) [𝑒𝑟𝑟𝒫 (ℎ)] to represent the generalization iterations and we call it warm-up period. We start cor-
                                                                         recting the labels once the warm-up period is over. We
error function, where 𝑒𝑟𝑟𝒫 (ℎ)∶=𝔼𝒫 [𝟙(ℎ(𝑥)≠𝑦)] and 𝟙(⋅) is
                                                                         only correct those labels on which the classifier 𝑓 exhibits
the indicator function. We also assume |𝑋 | = 𝑛 and
                                                                         high confidence. The idea is based on the intuition that
|𝑌 | = 𝑚. We follow the notation below to characterize
                                                                         there exists a region in the data in which noisy classi-
the training dataset 𝑆:
                                                                         fier 𝑓 produces highly confident prediction and exhibit
      • Consider 𝜋 = 𝜋1 , ⋅ ⋅ ⋅, 𝜋𝑛 to represent the priors for consistency with the clean Bayes optimal classifier. Thus
         each 𝑥 ∈ 𝑋.                                                     within the specified data region as mentioned above, the
      • For each 𝑥 ∈ 𝑋, sample a quantity 𝑝𝑥 indepen- algorithm produces clean labels. More formally, within
         dently and uniformly from the set 𝜋.                            the specified data region, if 𝑓 predicts a different label
      • The probability mass function of 𝑥 is given by than the observed label, 𝑦,̃ with confidence above the
         𝐷(𝑥) = ∑ 𝑥 𝑝 .
                         𝑝                                               threshold, 𝜃, i.e. |𝑓 (𝑥) − 1/2| > 𝜃, we flip the label 𝑦̃ to the
                        𝑥∈𝑋 𝑥                                            prediction of 𝑓. We continue this process until we reach
    When a model becomes sufficiently complex, many a stage where no label can be corrected. We choose the
of the times it ends up memorizing the labels of some value of 𝜃 empirically.
of the instances in the training dataset. Theorem 6
of [3] shows how memorizing noisy labels for data                     4. Experimental Setup & Results
instances of frequency 𝑙 leads to a sharper decline in the
generalization power of a supervised classifier.                      4.1. Dataset Description
   Theorem 1. (Theorem 6 of [3]) For 𝑥 ∈ 𝑋𝑆=𝑙 with true               We have collected change data for 3 months which com-
label 𝑦, ℎ memorizing its 𝑙 noisy labels leads to the follow-         prises of ∼27K data samples that are labelled as “risky”
ing order of individual excessive generalization error:               (class 1), i.e., potentially may lead to a failure in the pro-
      2
Ω ( 𝑛𝑙 2 ⋅ 𝑤𝑒𝑖𝑔ℎ𝑡 (𝜋, [ 23 𝑛−1
                           𝑙−1 3 𝑛
                               , 4 𝑙 ])) ⋅ ∑𝑘≠𝑦 ℙ[𝑦̃ = 𝑘|𝑥], where    duction system, or “normal” (class 0). Out of the ∼27K
                   (a) Experiment 1.                                                   (b) Experiment 2.
Figure 2: Change in F1 score with iterations while training the model with PLC method (with a warm-up period of 30
iterations).


data samples there are only 65 instances which belong            We create 3 separate datasets for each month from the
to the class “risky” or the positive class.                    overall data and perform two experiments:

     • Feature Description: Each instance in the data               • Experiment 1: The model is trained with the
       consists of 20 features; out of these, 2 features are          change data of Month 1 and the change data of
       continuous while the rest are categorical.                     Month 2 is used for validation.
     • Sparsity: There are many features that have                  • Experiment 2: The model is trained with the
       missing values; some of the features even have                 change data of Month 2 and the change data of
       almost 30% values missing.                                     Month 3 is used for validation.

                                                                  We use a gradient-boosted decision tree (XGBoost) [5]
4.2. Experimental Results                                      to generate the probability with which a new change
                                                               request may cause failure in production. We consider this
                                                               probability as the estimation of the risk for a change. This
Table 1
                                                               is our baseline model. Note that we had explored other
Comparison of results of two models for Experiment 1
                                                               models as well; however, the XGBoost model produced
        Experiment 1            Baseline     After PLC         the best results as recorded in our prior work [2].
      True Positive Rate          0.62          0.89              Next we use the PLC algorithm [4] to remove the label
      False Positive Rate         0.19          0.05           noise in the dataset. Then we re-train the XGBoost model
     True Negative Rate           0.81          0.95
                                                               with the label corrected dataset iteratively as described
     False Negative Rate          0.38          0.11
                                                               in the previous section. Detailed comparisons between
           Precision              0.06          0.26
                                                               the baseline model (trained on original data) and the
   Positive Likelihood Ratio      3.35         17.57
            F1 Score              0.11          0.40
                                                               model trained following the PLC method for Experiment
     Balanced Accuracy            0.72          0.92           1 and Experiment 2 are shown in Table 1 and Table 2,
                                                               respectively. Note that the metric balanced accuracy is
                                                               useful when the classes are imbalanced and is defined
                                                               as (sensitivity + specificity)/2; we believe that the rest of
Table 2                                                        the metrics used in these tables are standard and need
Comparison of results of two models for Experiment 2           no definition. As can be seen from Table 1 and Table 2,
        Experiment 2            Baseline     After PLC         the model trained with PLC method outperformed the
      True Positive Rate          0.51          0.74           baseline across all the metrics. Figure 2 shows the plot of
      False Positive Rate         0.06          0.02           how F1 score varies with iterations during training the
     True Negative Rate           0.94          0.98           model with PLC method. Note that we had used a warm-
     False Negative Rate          0.49          0.26           up period of 30 iterations, which is why a sharp jump
           Precision              0.05          0.20           is noticed upon applying label correction 31st iteration
   Positive Likelihood Ratio      8.19         42.44           onward.
            F1 Score              0.09          0.32
     Balanced Accuracy            0.72          0.86
5. Conclusion & Future Work                                      However, latest research shows that a higher confidence
                                                                 score from the model does not necessarily assure higher
In this paper we have shown how we made transition               probability of correctness of the classifier. Therefore, the
from a human-centred AI system to a near-autonomous              fact that, a classifier’s own confidence score may not
AI system by employing progressive label correction              be the best judge of its own trustworthiness, makes our
method in order to get rid of inherent label noise in the        on-going work all the more challenging but interesting.
data. We now seek labels for a handful of samples from
the domain experts only when the model is extremely
uncertain about their class. Experimental results exhibit
significant improvement in the model’s performance with          References
respect to various metrics.
   As part of the future work, we aim to build not just a        [1] Digital.ai, Change risk prediction (2019). URL: https:
more accurate model but a more accurate and trustworthy              //digital.ai/change-risk-prediction.
model as earning the trust of the end users for the ML           [2] B. Gupta, A. Chatterjee, S. Paul, H. Matha, L. Parsai,
model is the key to success in driving business values               K. Banerjee, V. Agneeswaran, Look before you leap!
by ML especially in ‘operations’ in a large-scale orga-              designing a human-centered AI system for change
nization. Therefore, we are in the process to build an               risk assessment, in: ICAART, 2022, pp. 655–662.
enhanced label-noise removal method which is based on            [3] Y. Liu, Understanding instance-level label noise: Dis-
the intuition that in noisy data, there exists a ‘data region’       parate impacts and treatments, in: ICML, volume 139
in which the noisy classifier 𝑓 produces highly confident            of Proceedings of Machine Learning Research, PMLR,
and trustworthy prediction which is consistent with the              2021, pp. 6725–6735.
clean ‘Bayes optimal classifier’. A standard approach to         [4] Y. Zhang, S. Zheng, P. Wu, M. Goswami, C. Chen,
quantify a classifier’s trustworthiness is to use its own            Learning with feature-dependent label noise: A pro-
estimated confidence or score such as probabilities from             gressive approach, in: ICLR, 2021. URL: https:
the softmax layer of a neural network, distance to the               //openreview.net/forum?id=ZPa2SyGcbwh.
separating hyper-plane in support vector classification or       [5] T. Chen, C. Guestrin, XGBoost: A scalable tree boost-
mean class probabilities for the trees in a random forest.           ing system, in: KDD, 2016, pp. 785–794.

</pre>