<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Developing a noise-aware AI system for change risk assessment with minimal human intervention</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Subhadip Paul</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anirban Chatterjee</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Binay Gupta</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kunal Banerjee</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Walmart Global Tech. Bengaluru</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karnataka</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <abstract>
        <p>Introducing changes to a system in production may sometimes result in failures, and eventual revenue loss, for any industry. Therefore, it is important to monitor the “risk” that each such change request may present. Change risk assessment is a sub-field in operations management that deals with this problem in a systematic manner. However, a manual or even a human-centered AI system may find it challenging to meet the scaling demands for a big industry. Accordingly, an automated system for change risk assessment is highly desired. There are a few commercial solutions available to address this problem but those solutions lack the ability to deal with highly noisy data, which is quite a possibility for such systems. There are literature which proposed methods to integrate the feedback of domain experts into the training process of a machine learning model to deal with noisy data. Even though some of these methods produced decent risk prediction accuracy of the model but such an arrangement to collect feedback from the domain experts continuously has practical challenges due to the limitation in bandwidth and availability of the domain experts at times. Therefore, as part of this work, we explore a way to take the transition from a human-centered AI system to a near-autonomous AI system, which minimizes the need of intervention of domain experts without compromising with the prediction accuracy of the model. Initial experiments with the proposed AI system exhibit 10% improvement in risk prediction accuracy in comparison with the baseline which was trained by integrating the feedback of domain experts in the training process.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>change management, risk assessment, human-centered decision making,</p>
      <p>change risk assessment system:
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License the following questions while building an automated</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <sec id="sec-2-1">
        <title>Launching a new business or expanding the repertoire</title>
        <p>of features for an existing business is a common
phenomenon in the modern technology-driven industries.
All such upgrades require a series of software changes to
a base system that is already in production. However, one
needs to be cautious prior to pushing in these changes
because each one of these potentially can cause a failure in
the system. In the current era of agile development, often
a large volume of requests come right before the sprint
deadlines. At times, a tight delivery schedule severely
restricts the scope for thorough inspection and review
before the deployment. Moreover, from our experience, in
sociated with a change is marked as “low” by the change
requester (which, in reality, need not be so – this may
happen if the developer is new or less skilled, and hence
ten completely disregarded by the domain experts while
reviewing, which eventually may manifest as a critical
issue later in the pipeline. Reducing the number of failures
nEvelop-O
∗Corresponding author.
motivates us into a transition from human-centered AI
system to a near-autonomous AI system to predict risk
of change requests in order to minimize the requirement
of intervention by the domain experts.</p>
      </sec>
      <sec id="sec-2-2">
        <title>In this paper, we present our experience of exploring</title>
        <p>case of manual change risk assessment, when the risk as- by the change management team which prevents some
may have applied poor judgement), that request is of- tion accuracy of the model which is trained with change
(a) Data without missing values
(b) Data with imputed missing values</p>
        <p>• How can the label noise in the data afect the
generalization accuracy of risk prediction model
?
• Can we have an automated process to remove the
label noise in the data and train a model
simultaneously?</p>
        <sec id="sec-2-2-1">
          <title>2.1.1. Feature Sparsity in Data.</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Some of the features of our data exhibit high degree of</title>
        <p>sparsity. We impute the missing values but some error
always gets introduced by the process of imputation. Let
us try to understand why the error originating from the
process of missing value imputation leads to label noise.</p>
      </sec>
      <sec id="sec-2-4">
        <title>The remainder of the paper is organized as follows.</title>
        <p>Section 2 covers the background and motivation of our Consider a toy example where a data instance has two
work. Section 3 briefly explains our methodology. Sec- features (refer to Figure ??) and originally it belongs to
tion 4 provides the dataset description and the experi- ‘class 1’. Now consider a situation where the same data
mental results. Lastly, Section 5 describes some future point as depicted in Figure ?? has the value of feature  1
work along with the concluding remarks. missing and it is eventually imputed (refer to Figure ??).
Notice that, after imputation the data instance has moved
leftward and got located in the region of ‘class 2’.
How2. Background &amp; Motivation ever, in spite of the new location of the data instance in
the region of ‘class 2’ after imputation, it is still labelled
In course of explaining our motivation into the transition as ‘class 1’ as that was the original label of the data
infrom human-centered AI system to an autonomous AI stance. It eventually introduces label noise in the dataset.
system, we revolve our discussion around the following Notice also that the data instances located close to the
question one by one: class boundary are more prone to produce label noise
in case the missing values of some of their features are
imputed.
• Question 1. How can label noise get introduced</p>
        <p>into the change data?
• Question 2. How can label noise impact the
generalization error of the risk prediction model?</p>
        <sec id="sec-2-4-1">
          <title>2.1.2. Change management process of the organization.</title>
          <p>2.1. Analysis of Question 1 Another major source of label noise lies in the change
management process itself of the organization. Consider
There are multiple ways in which label noise may get a situation when a change request (CRQ) is raised which
introduced into the data. In the context of our change has high likelihood of causing failure in production and
data, let us introduce two primary reasons for label noise: change manager along with the change requesting team
took some mitigatory action against this CRQ to prevent
it from causing failure in production. Due to such manual
intervention by the change management team, this CRQ  ℎ ( , [ 1,  2]) =  [∑∈ () ⋅ (() ∈ [ 1,  2])]
may not end up causing failure in production. When and
this CRQ will be part of historical dataset for training  is the total number data instances in the training
risk prediction model, it will lead to the illusion that this dataset.</p>
          <p>CRQ belongs to ‘normal’ or ‘non-risky’ class as it didn’t Note that higher is the value of  , higher is the
lead to any failure in production but ideally it should lower bound of the generalization error of the model.
have been considered otherwise as this CRQ had high When it comes to dealing with tabular dataset with
potential to cause failure in production. Therefore, such moderate to high dimension such as the dataset of
manual intervention in the change management process, ours, repetition of data instances in the dataset may
which does not reflect in the change data, causes label apparently seem unlikely but still it is approximately
noise in change data. possible. An intuitive explanation could be that in
the context of supervised learning, a data instance is
2.2. Analysis of Question 2 approximately represented by the set of its significant
features with respect to the classification model even
though there can be high number of insignificant or
redundant features of that data instance. In that way,
two data instances are seen by the model as
repetition if the values of their significant features are the same.</p>
          <p>
            Theorem 1. (Theorem 6 of [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]) For  ∈  = with true We have collected change data for 3 months which
comlabel  , ℎ memorizing its  noisy labels leads to the follow- prises of ∼27K data samples that are labelled as “risky”
ing order of individual excessive generalization error : (class 1), i.e., potentially may lead to a failure in the
proΩ (   22 ⋅  ℎ ( , [ 23  −−11 , 34   ])) ⋅ ∑≠ ℙ[ =̃ |] , where duction system, or “normal” (class 0). Out of the ∼27K
In our work, we model the change risk predictor as a
binary classifier. Therefore, to understand the impact
of label noise on the accuracy of change risk prediction
model, we need to understand how label noise impacts
a classifier in a generalized setting of supervised
learning. Consider the following notations for the generalized
setting of supervised classification: A training dataset
 = {( 1,  1), ..., (  ,   )} is available. In each pair (  ,   ),
  represents the feature vector and   represents the
associated label.  and  denote the space of  and 
respectively. Jointly (,  ) are drawn from an unknown
distribution  over  ×  . In other words,  is drawn
from a distribution  , and the true label  for  is given
by a function  ∶  →−  drawn from a distribution ℱ.
          </p>
          <p>The learner’s algorithm  represents a function which
takes in the training data  as input parameters and
returns a distribution of classifiers ℎ ∶  →−  . We define
   ( ,)∶= ℎ∼() [  (ℎ)] to represent the generalization
error function, where    (ℎ)∶=  [(ℎ()≠)] and (⋅) is
the indicator function. We also assume | | =  and
| | =  . We follow the notation below to characterize
the training dataset  :
• Consider  =  1, ⋅ ⋅ ⋅,   to represent the priors for</p>
          <p>each  ∈  .
• For each  ∈  , sample a quantity  
indepen</p>
          <p>dently and uniformly from the set  .
• The probability mass function of  is given by
() =   .</p>
          <p>∑∈</p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>When a model becomes suficiently complex, many</title>
        <p>
          of the times it ends up memorizing the labels of some
of the instances in the training dataset. Theorem 6
of [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] shows how memorizing noisy labels for data
instances of frequency  leads to a sharper decline in the
generalization power of a supervised classifier.
        </p>
      </sec>
      <sec id="sec-2-6">
        <title>Therefore, our primary motivation to take up this problem is to do away with the label noise due to some inherent noise generation process and its adverse impact on the model’s accuracy.</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>
        As proposed by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we employ Progressive Label
Correction (PLC) method to iteratively correct the labels and
train the binary classification model. We first train the
XGBoost model with the original noisy data for first few
iterations and we call it warm-up period. We start
correcting the labels once the warm-up period is over. We
only correct those labels on which the classifier  exhibits
high confidence. The idea is based on the intuition that
there exists a region in the data in which noisy
classiifer  produces highly confident prediction and exhibit
consistency with the clean Bayes optimal classifier. Thus
within the specified data region as mentioned above, the
algorithm produces clean labels. More formally, within
the specified data region, if  predicts a diferent label
than the observed label,  ,̃ with confidence above the
threshold,  , i.e. | () − 1/2| &gt;  , we flip the label  ̃ to the
prediction of  . We continue this process until we reach
a stage where no label can be corrected. We choose the
value of  empirically.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup &amp; Results</title>
      <p>4.1. Dataset Description
(a) Experiment 1.
(b) Experiment 2.
data samples there are only 65 instances which belong
to the class “risky” or the positive class.</p>
      <p>We create 3 separate datasets for each month from the
overall data and perform two experiments:
• Feature Description: Each instance in the data
consists of 20 features; out of these, 2 features are
continuous while the rest are categorical.
• Sparsity: There are many features that have
missing values; some of the features even have
almost 30% values missing.
4.2. Experimental Results</p>
      <p>
        We use a gradient-boosted decision tree (XGBoost) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
to generate the probability with which a new change
request may cause failure in production. We consider this
probability as the estimation of the risk for a change. This
is our baseline model. Note that we had explored other
models as well; however, the XGBoost model produced
the best results as recorded in our prior work [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Next we use the PLC algorithm [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to remove the label
noise in the dataset. Then we re-train the XGBoost model
with the label corrected dataset iteratively as described
in the previous section. Detailed comparisons between
the baseline model (trained on original data) and the
model trained following the PLC method for Experiment
1 and Experiment 2 are shown in Table 1 and Table 2,
respectively. Note that the metric balanced accuracy is
useful when the classes are imbalanced and is defined
as (sensitivity + specificity)/2 ; we believe that the rest of
the metrics used in these tables are standard and need
no definition. As can be seen from Table 1 and Table 2,
the model trained with PLC method outperformed the
baseline across all the metrics. Figure 2 shows the plot of
how F1 score varies with iterations during training the
model with PLC method. Note that we had used a
warmup period of 30 iterations, which is why a sharp jump
is noticed upon applying label correction 31st iteration
onward.
      </p>
      <sec id="sec-4-1">
        <title>In this paper we have shown how we made transition</title>
        <p>from a human-centred AI system to a near-autonomous
AI system by employing progressive label correction
method in order to get rid of inherent label noise in the
data. We now seek labels for a handful of samples from
the domain experts only when the model is extremely
uncertain about their class. Experimental results exhibit
significant improvement in the model’s performance with
respect to various metrics.</p>
        <p>As part of the future work, we aim to build not just a
more accurate model but a more accurate and trustworthy
model as earning the trust of the end users for the ML
model is the key to success in driving business values
by ML especially in ‘operations’ in a large-scale
organization. Therefore, we are in the process to build an
enhanced label-noise removal method which is based on
the intuition that in noisy data, there exists a ‘data region’
in which the noisy classifier  produces highly confident
and trustworthy prediction which is consistent with the
clean ‘Bayes optimal classifier’ . A standard approach to
quantify a classifier’s trustworthiness is to use its own
estimated confidence or score such as probabilities from
the softmax layer of a neural network, distance to the
separating hyper-plane in support vector classification or
mean class probabilities for the trees in a random forest.</p>
      </sec>
      <sec id="sec-4-2">
        <title>However, latest research shows that a higher confidence</title>
        <p>score from the model does not necessarily assure higher
probability of correctness of the classifier. Therefore, the
fact that, a classifier’s own confidence score may not
be the best judge of its own trustworthiness, makes our
on-going work all the more challenging but interesting.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] Digital.ai, Change risk prediction (</article-title>
          <year>2019</year>
          ). URL: https: //digital.ai
          <article-title>/change-risk-prediction.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chatterjee</surname>
          </string-name>
          , S. Paul,
          <string-name>
            <given-names>H.</given-names>
            <surname>Matha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Parsai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Agneeswaran</surname>
          </string-name>
          ,
          <article-title>Look before you leap! designing a human-centered AI system for change risk assessment</article-title>
          ,
          <source>in: ICAART</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>655</fpage>
          -
          <lpage>662</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Understanding instance-level label noise: Disparate impacts and treatments</article-title>
          , in: ICML, volume
          <volume>139</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>6725</fpage>
          -
          <lpage>6735</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Zheng,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Goswami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Learning with feature-dependent label noise: A progressive approach</article-title>
          , in: ICLR,
          <year>2021</year>
          . URL: https: //openreview.net/forum?id=
          <fpage>ZPa2SyGcbwh</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          , C. Guestrin,
          <article-title>XGBoost: A scalable tree boosting system</article-title>
          ,
          <source>in: KDD</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>785</fpage>
          -
          <lpage>794</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>