<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Reliability in Recom mendation Systems: Beyond point estimations to monitor population stability</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yingshi Chen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohit Jain</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vaibhav Sawhney</string-name>
          <email>vsawhney@indeed.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Liyasi Wu</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Indeed Inc.</string-name>
          <email>lwu@indeed.com</email>
          <email>yolandac@indeed.com</email>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Recommender Systems, Model Stability, Production Monitoring</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Ensuring reliable recommendations is essential for a company's success and user trust. Indeed has traditionally used point estimation to maintain consistent model predictions during model refinement and retraining. However, despite extensive research on robustness, there has been less focus on reliability and population monitoring. This study introduces the Cumulative Probability Stability Index (CPSI), which is derived from the Probability Stability Index (PSI), to monitor distribution stability. CPSI assesses the stability of a model's population and allows for targeted adjustments. Our implementation of CPSI proved efective in identifying significant instabilities during model transitions, demonstrating its versatility across various model types and calibration methods.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The #1 job site in the world, Indeed is committed to
ofering job seekers with high-quality opportunities through
advanced recommendation systems. To maintain the accuracy
of recommendations, we regularly retrain and enhance our
models to efectively accommodate shifts in the job market
and job seeker behavior. However, the process of retraining
might produce diverse outcomes, occasionally resulting in
unforeseen variations in scores, thereby compromising user
confidence and product excellence [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Most research on recommender systems has generally
concentrated on accuracy, business metrics, and diversity,
often overlooking the crucial aspect of stability. Stability
measures how recommendations change with updates and their
consistency over time [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Traditional definitions
emphasize strict alignment with prior predictions [
        <xref ref-type="bibr" rid="ref1 ref3">1, 3</xref>
        ], whereas
more recent studies acknowledge the possibility of some
deviations to accommodate new information [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We
deifne stability as the ability to provide reliable and
consistent recommendations while efectively adapting to changes
without significant interruptions.
      </p>
      <sec id="sec-1-1">
        <title>Point estimation methods, such as the mean or median</title>
        <p>
          score of prediction scores, are insuficient to assess the
stability of recommender systems [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We utilize the Population
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>Stability Index (PSI), a risk modeling metric that measures</title>
        <p>
          consistency between two probability distributions based on
the Kullback-Leibler divergence [
          <xref ref-type="bibr" rid="ref21 ref6 ref7 ref8 ref9">6, 7, 8, 9</xref>
          ].
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>Limited research was conducted to understand the proper</title>
        <p>ties of PSI. There is a general rule of thumb for interpreting</p>
      </sec>
      <sec id="sec-1-4">
        <title>PSI values[8]: if PSI is less than 10%, there is no change</title>
        <p>
          in the population; if PSI is between 10% and 25%, the
population has changed slightly, and investigation is needed;
and if PSI exceeds 25%, there are significant changes in the
population, and the models should be retrained[
          <xref ref-type="bibr" rid="ref10 ref6 ref7">6, 7, 10</xref>
          ].
        </p>
      </sec>
      <sec id="sec-1-5">
        <title>Later research has discussed the arbitrary nature of the general ‘rule of thumb’ and explored the statistical properties of PSI[10].</title>
      </sec>
      <sec id="sec-1-6">
        <title>This paper presents the Cumulative Population Stability</title>
      </sec>
      <sec id="sec-1-7">
        <title>Index (CPSI), an improved version of PSI. CPSI eficiently identifies alterations in distribution patterns and maintains</title>
        <p>RecSys in HR’24: The 4th Workshop on Recommender Systems for Human
Resources, in conjunction with the 18th ACM Conference on Recommender</p>
        <p>CEUR
Workshop
ISSN1613-0073</p>
      </sec>
      <sec id="sec-1-8">
        <title>CPSI through simulations and real-life examples.</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related</title>
    </sec>
    <sec id="sec-3">
      <title>Work</title>
      <sec id="sec-3-1">
        <title>2.1. Measurement of Stability: PSI</title>
        <sec id="sec-3-1-1">
          <title>PSI is a metric for assessing population stability between two samples. It classifies scores into predefined bins or categories, evaluating the diference between a given probability distribution and a reference distribution.</title>
          <p>Let  be the sample size for the reference population and
 be the sample size for the target population, each being
divided into  bins. Then PSI can be defined as:
  =
∑ ( ̂ −  ̂ )× (ln  ̂ − ln  ̂ )
(1)
logarithm.</p>
          <p>where   and   are counts in the  -th bin, ∑  =  ,
∑  =  ,  ̂ =   , and  ̂ =   . ln denotes the natural</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Limitation of PSI</title>
        <sec id="sec-3-2-1">
          <title>PSI focuses on local bin proportions, ignoring cumulative distribution patterns, which can result in false positives during cumulative score shifts.</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Additionally, fixed bin boundaries based on percentiles may not accurately capture skewed distributions.</title>
          <p>2.2.1. Local Comparisons</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>PSI focuses on local comparisons of bin proportions. PSI does not account for cumulative or global distribution patterns.</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>According to Figure 1, the PSI value is 28.4%. Based on the</title>
          <p>general rule of thumb, if the PSI exceeds 25%, it strongly
suggests that the model needs to be recalibrated. However, the</p>
        </sec>
        <sec id="sec-3-2-5">
          <title>Kolmogorov-Smirnov (KS) goodness-of-fit test [ 11] yielded</title>
          <p>a p-value of 0.173, indicating that there is no significant drift
in the predictions. Additionally, the Global Comparisons
(CDF) plots demonstrate that the cumulative distribution
functions (CDFs) of the two distributions remain closely
aligned.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.1. Definition</title>
        <sec id="sec-3-3-1">
          <title>The CPSI is defined as:</title>
          <p>of the anticipated distribution. Although this method can
efectively partition the data into equal-sized groups, it may
not accurately capture the actual distribution’s structure,
especially for distributions that are skewed or have several
modes.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>The fixed bin boundaries intersect many modes of the</title>
          <p>distribution, which may result in an inaccurate
representation of the discrepancies. This can result in bins containing
many peaks or valleys, leading to less accurate stability
assessments.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Proposed Method</title>
      <p>In this section, we propose the Cumulative Population
Stability Index (CPSI) to provide a comprehensive view of
distribution changes. CPSI provides a detailed assessment of
distributional changes by computing localized cumulative
sums, allowing tailored analysis and evaluation within
sliding windows of bins.
+
 −</p>
      <p>∑
2
 
•   and   represent the proportions in the reference
and current (prediction) distributions, respectively,
for bin  .
•  is the total number of bins.
•  is the number of bins included in the cumulative
sum on either side of bin  .
•  and</p>
      <p>are the sample sizes of the reference and
current (prediction) distributions, respectively.</p>
      <p>• ln denotes the natural logarithm.</p>
      <p>CPSI can be viewed as a variation of PSI. Recall the
definition of PSI:

=1</p>
      <sec id="sec-4-1">
        <title>3.2. Statistical Properties of CPSI</title>
        <p>3.2.1. Expectation of CPSI</p>
        <sec id="sec-4-1-1">
          <title>As proved by Yurdakul and Naranjo [10], the expectation of PSI is:</title>
          <p>=1
( PSI) = ∑(  −   )(ln   − ln   ) +  − 1

=1</p>
          <p>Since  ̂ in PSI corresponds to  −̃,+
in CPSI, the
expected value ( CPSI) can be expressed analogously to
and  ̃−,+</p>
          <p>substituting for  ̂ and  ̂ ,
( CPSI) = ∑( −̃,+</p>
          <p>−  ̃−,+ )(ln  −̃,+ − ln  ̃−,+ )
Under the null hypothesis  0 ∶   =   ,  = 1, … ,  , we have:
+</p>
          <p>−
+

∑
2
 − 1



=1   +
 −</p>
          <p>∑
2
=1 
 

(4)
+  − 1</p>
          <p>−
+
+

∑
2
 − 1

3.2.2. Theorem and Variance Calculation</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Yurdakul and Naranjo[10] use the following theorem to</title>
          <p>
            identify the variance of PSI. Theorem 1 Let ( ) = 
and
Cov( ) = Σ . The proof of this theorem can be found in
Searle’s [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ].
          </p>
        </sec>
        <sec id="sec-4-1-3">
          <title>Then:</title>
          <p>Var( ′ ) = 2</p>
          <p>Tr(ΣΣ) + 4
′Σ.</p>
          <p>Given that  = ( ) = 0 , Theorem 1 implies:</p>
          <p>Var( ′ ) = 2</p>
          <p>Tr(ΣΣ).
  ,  = 1, … ,  is true, they prove that µ = 0 and</p>
          <p>For PSI, assuming that the null hypothesis  0 ∶   =
  (ΣΣ) =
ifnally leading to
(

1
+

1, … ,  , for CPSI we can say that µ = 0 and
=  ̃−,+ ,  =
Then:
  ( Σ̃ Σ)̃ = (</p>
          <p>) × ∑(1 − −̃,+ )


1
1
+
+


Where:
min(,+)</p>
          <p>3.2.3. Robustness and Invariance of CPSI
The distribution of CPSI, controlled by  ,  , and 
(total
number of bins, and sample sizes of the reference and target
populations, respectively), is unafected by the underlying
variable distributions, ensuring it remains a reliable and
robust measure of divergence between model predictions.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>3.3. Parameter Selection</title>
        <p>3.3.1. Determining Sample Sizes  and</p>
        <sec id="sec-4-2-1">
          <title>The robustness and dependability of the Cumulative Popu</title>
          <p>lation Stability Index (CPSI) depend critically on the sample
sizes of the target population 
and the reference
population  . The sensitivity and stability of the index can be
greatly afected by the choice of  and  .</p>
          <p>To make sure that  and 
are big enough to find
significant diferences, we have performed a power analysis.
Furthermore, the selection of these parameters can be guided
by preliminary exploratory data analysis, and model
performance can be optimized through iterative refinement. A
ifnal decision between  and 
stability, and sensitivity into account.
should take practical limits,
3.3.2. Determining the Optimal Number of Bins</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Determine the number of bins is crucial to accurately cap</title>
          <p>ture the distributional characteristics of the data. An optimal
value for the number of bins must balance both bias and
variance. There are several studies that have attempted to
determine an optimal number of bins, each ofering diferent
advantages based on the size and distribution of the data:
Square-Root Choice:
 =
√</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>The square-root choice method recommends using the square root of the data points to determine the number of bins, providing a balanced approach for moderate-sized datasets[13].</title>
          <p>Sturges’ Formula:
 = ⌈ log2  + 1⌉</p>
        </sec>
        <sec id="sec-4-2-4">
          <title>Sturges’ formula, commonly used for smaller datasets, as</title>
          <p>
            sumes that the data follow an approximate normal
distribution. The objective is to ascertain an appropriate number of
bins that properly reflect the distribution of the data points,
while avoiding unnecessary complexity in the model.[
            <xref ref-type="bibr" rid="ref14">14</xref>
            ].
          </p>
          <p>Rice Rule:</p>
          <p>= ⌈2 ×  1/3⌉</p>
        </sec>
        <sec id="sec-4-2-5">
          <title>The Rice Rule proposes determining the appropriate number of bins by achieving a balance between granularity and simplicity. This method is particularly efective for larger datasets.[15] [16].</title>
          <p>Each of these methods provides a pragmatic way to select
bins, depending on the specific attributes of the data set and
the objectives of the research. We looked at a few diferent
approaches to figure out how many bins there should be,
and we used those approaches to figure out the values to
work from. We then analyzed the trade-of between bias
and variance as a function of the parameter  , with  and
 kept constant. The parameter  was selected to minimize
the overall error.</p>
          <p>shift
3.3.3. Determining the number of</p>
        </sec>
        <sec id="sec-4-2-6">
          <title>A key factor in balancing the sensitivity and robustness</title>
          <p>of the Cumulative Population Stability Index (CPSI) is the
selection of  , which regulates the number of bins included
in the cumulative sum on either side of a bin  .</p>
        </sec>
        <sec id="sec-4-2-7">
          <title>We determined the optimal value of  , using a combina</title>
          <p>tion of domain-specific knowledge and exploratory data
analysis. We have deliberately chosen to maximize the
noticeable variability in CPSI values during calibration while
minimizing penalty for small shifts between adjoining bins.</p>
        </sec>
        <sec id="sec-4-2-8">
          <title>This approach aligns with our objective of reducing penalization for modest distribution shifts caused by calibration, so that predictions remain reasonably close to the true likelihood.</title>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>3.4. Rule of Thumb</title>
        <p>By comparing subsequent model version distributions
using the Cumulative Population Stability Index (CPSI), we
can quantify changes and establish a benchmark for
stability between iterations. To efectively apply CPSI, we
require two population samples: the base sample, which
represents the score distribution from a previous model
version, and the test sample, representing the predicted score
distribution from the current model version. We propose
the following ’rule of thumb’ values derived from
empirical data, specifically using the 90th and 99th percentiles.
(details in Appendix 8.1) to monitor system stability across
model retraining versions by assessing the CPSI measure
over diferent historical time frames.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Results: evaluating the</title>
    </sec>
    <sec id="sec-6">
      <title>Efectiveness of CPSI</title>
      <sec id="sec-6-1">
        <title>We conducted a simulation study to assess CPSI perfor</title>
        <p>mance using the normal approximation for critical values.</p>
      </sec>
      <sec id="sec-6-2">
        <title>Based on the statistical properties of CPSI, we can construct the following test:</title>
        <p>CPSI &gt; (</p>
        <p>+

1

1
) ( − 1) +  0.95 (
1
+

1
) × √2( − 1)
where the right-hand side (RHS) is the critical value,
deifned as the 95th percentile of the CPSI normal
approximation.</p>
        <p>We created a right-skewed baseline using a Beta
distribution and introduced small shifts and noise to simulate
real-world conditions. PSI and CPSI values were calculated
for the expected and new distributions. We sampled 10,000
values from the baseline and challenger distributions,
conducting 30 simulations. The CPSI results were computed
with the number of bins (B) set to 1,000 and K set to 1. We
compared these results with the critical value of 0.0215, as
suggested by normal approximation.</p>
        <p>The results table (Table 1) along with the plot (Fig.3),
shows that PSI identified small shifts as unstable,
indicating a high sensitivity to local changes. In contrast, CPSI
smooths out local variations and focuses on cumulative
proportions, proving robust against noise while efectively
detecting distribution shifts.
1–7
in the test model were markedly smoother. Although point
estimates alone are insuficient to confirm stability, they do
ofer some directional confidence.</p>
      </sec>
      <sec id="sec-6-3">
        <title>This experiment provides a valuable case study for eval</title>
        <p>
          uating the efectiveness of the Cumulative Population
Stability Index (CPSI) and comparing it with other population
stability metrics. These metrics include the Population
Stability Index (PSI), the recently introduced Population
Accuracy Index (PAI) [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], and the Kolmogorov-Smirnov (KS)
goodness-of-fit test [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
      </sec>
      <sec id="sec-6-4">
        <title>The Population Accuracy Index (PAI) ofers an alterna</title>
        <p>tive approach to scorecard stability testing by measuring
the change in the variance of the estimated mean response
since development. PAI Interpretation: 0 ≤ PAI &lt; 1.1
indicates no substantial change, 1.1 ≤ PAI &lt; 1.5 suggests a
small change, and PAI ≥ 1.5 indicates a substantial change.</p>
      </sec>
      <sec id="sec-6-5">
        <title>The Kolmogorov-Smirnov (KS) test statistic quantifies the maximum diference between two empirical distribution functions, providing a measure of the discrepancy between observed and expected distributions.</title>
      </sec>
      <sec id="sec-6-6">
        <title>KS test is overly sensitive to small changes when the</title>
        <p>
          sample size is large, often labeling any model changes as
instability [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. Even minor distributional changes can lead
to the rejection of population stability at nominal
significance levels, potentially misrepresenting true instability.
        </p>
      </sec>
      <sec id="sec-6-7">
        <title>Similarly, PSI is prone to detecting small local changes, fre</title>
        <p>quently marking all movements as unstable.</p>
      </sec>
      <sec id="sec-6-8">
        <title>However, the experimental observations are at odds with</title>
        <p>the PAI results, which suggest little change between model
transitions in both the test and control groups.</p>
      </sec>
      <sec id="sec-6-9">
        <title>This study demonstrates how CPSI outperforms other well-established techniques in determining population stability due to its resilience and comprehensiveness.</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Monitoring System</title>
    </sec>
    <sec id="sec-8">
      <title>Implementation with CPSI</title>
      <sec id="sec-8-1">
        <title>In this section, we present the implementation of an online</title>
        <p>recommender monitoring system that incorporates CPSI</p>
      </sec>
      <sec id="sec-8-2">
        <title>Metrics.</title>
        <p>We designed a testing infrastructure to leverage the
requests coming to the incumbent model in production to
test against the challenger model which was trained
using a diferent set of data. The ‘incumbent’ model refers
to the machine learning model that is currently deployed
in production and actively handling real-world requests or
tasks. It is the established model that new or ‘challenger’
models are compared against to determine if an upgrade
or replacement is warranted. The architecture of the said
infrastructure contains two modules: one to poll for a newly
trained challenger model called Model Score Verification</p>
      </sec>
      <sec id="sec-8-3">
        <title>Initiator and another to test it against the incumbent model called Model Score Verifier. Step by step depiction of how the infrastructure is laid out to perform testing.</title>
        <p>• Gathering data: Collect a set of sampled requests
from the past 14 days. These requests should include
either a list of multiple jobs being matched to one job
seeker or a list of multiple job seekers being matched
to one job. They were originally inferred using the
model being tested in the past.
• Preparing testing infrastructure: Set up the
necessary testing environment, including Model Score
Verification Initiator, databases, and any required
software or tools.
• Triggering test: Initiate the test by triggering an
instance of the Model Score Verifier for the models
being tested.
• Loading the right models: Model Score Verifier
ensures that the correct model are loaded and
active in the application responsible for inferring the
requests.
• Sending and inferring requests: Forward the
gathered requests to the loaded models for
inference.
• Logging responses: Record the responses
generated by the models for later analysis, each containing
one score attached to multiple unique job-job seeker
pairs, respectively, that were part of the request.
• Deciding to promote or drop: On gathering 100k
unique pairs with their relevance score we calculate</p>
      </sec>
      <sec id="sec-8-4">
        <title>CPSI. The process of gathering 100k unique pairs</title>
        <p>with their scores is repeated 30 times, and a mean</p>
      </sec>
      <sec id="sec-8-5">
        <title>CPSI score is calculated. Evaluate the results to de</title>
        <p>cide whether the tested model should be promoted to
production or discarded using the mean CPSI score.
By algorihtms such as Jackknife resampling, we can
ifnd the standard error of CPSI through the 30-time
calculation. This can be used to calculate the
conifdence interval of CPSI. Here the number 30 is to
insure we can have a statistically sound conclusion.
• Triggering alerts: If the test identifies any critical
issues or anomalies, automatically trigger alerts to
notify the relevant stakeholders.</p>
        <p>Figure 5 provides a simplified overview of the monitoring
system. This system enables proactive monitoring,
investigation, and improvement of recommendation system
performance in production. By integrating this system into our
workflow, we can detect major issues early without
depending on manual monitoring, preventing negative impacts on
customer trust and reducing churn.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>7. Conclusion and Future Work</title>
      <p>In this paper, we introduced the Cumulative Probability
Stability Index (CPSI) as a tool for monitoring large-scale
recommender systems. We demonstrated CPSI’s efectiveness in
detecting significant instabilities during model transitions
and its robustness against prediction variations through
simulations, real-world implementations, and monitoring
systems. CPSI has proven to be a reliable metric for
evaluating the stability of recommender systems, both ofline and
online, especially for DNN-based recommendation systems.</p>
      <sec id="sec-9-1">
        <title>We believe CPSI has potential applications in other domains</title>
        <p>as well.</p>
        <p>In the future, we aim to extend the application of the
proposed stability monitoring methods to a broader range
of scenarios, including various model types such as
reinforcement learning models. In addition, we plan to evaluate
the efectiveness of these methods in diferent domains,
explore their scalability in large-scale systems, and assess their
adaptability to real-time monitoring environments.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>8. Appendix</title>
      <p>
        There are two approaches for determining the critical
values for CPSI. The first and most straightforward approach
involves utilizing the normal approximation. Instead of
relying on predetermined critical values, it is more
advantageous to utilize the theoretical percentiles of the normal
approximation distribution [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. As seen in the preceding
section, the distribution of CPSI is afected by the
parameters B, N, and M. Using the normal approximation, we can
determine the desired percentiles to establish the critical
values.
      </p>
      <p>The second method involves using the empirical
distribution of CPSI values collected during production. This
method does not depend on hypothetical estimations, but
rather utilizes actual distribution data. Through the
implementation of ofline simulations using historical prediction
data, we can collect results to determine the critical values
of CPSI. We found that using the empirical distribution of
CPSI values to set critical values is more promising. The
critical value should reflect the system’s tolerance for
instability. We have identified inherent instability resulting
from variability in training neural network models with
medium-sized datasets. Sorely depending on the normal
approximation may lead to false alerts, as it might misinterpret
natural score fluctuations as significant deviations. Hence,
using empirical critical values from actual system
performance data provides a more accurate and reliable stability
assessment.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Adomavicius</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and Zhang,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>2010</year>
          .
          <article-title>On the stability of recommendation algorithms</article-title>
          .
          <source>In Proceedings of the Fourth ACM Conference on Recommender Systems</source>
          , RecSys '
          <volume>10</volume>
          ,
          <fpage>47</fpage>
          -
          <lpage>54</lpage>
          . New York, NY, USA: ACM.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O</given-names>
            <surname>'Mahony</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hurley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Kushmerick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            , and
            <surname>Silvestre</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2004</year>
          .
          <article-title>Collaborative recommendation: A robustness analysis</article-title>
          .
          <source>ACM Trans. Internet Technol</source>
          .
          <volume>4</volume>
          (
          <issue>4</issue>
          ):
          <fpage>344</fpage>
          -
          <lpage>377</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Adomavicius</surname>
          </string-name>
          , Gediminas and Zhang, Jingjing.
          <year>2012</year>
          .
          <article-title>Stability of recommendation algorithms</article-title>
          .
          <source>ACM Transactions on Information Systems 30</source>
          ,
          <issue>4</issue>
          (Nov.
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Shriver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elbaum</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dwyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Rosenblum</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Evaluating Recommender System Stability with Influence-Guided Fuzzing</article-title>
          .
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          <volume>33</volume>
          :
          <fpage>4934</fpage>
          -
          <lpage>4942</lpage>
          . https: //doi.org/10.1609/aaai.v33i01.
          <fpage>33014934</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Ekstrand</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carterette</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Diaz</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2023</year>
          .
          <article-title>Distributionally-Informed Recommender System Evaluation</article-title>
          .
          <source>ACM Transactions on Recommender Systems</source>
          <volume>2</volume>
          ,
          <issue>1</issue>
          :
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          . Online publication date:
          <fpage>31</fpage>
          -Mar-
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>L. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Edelman</surname>
            ,
            <given-names>D. B.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Crook</surname>
            ,
            <given-names>J. N.</given-names>
          </string-name>
          <year>2002</year>
          .
          <article-title>Credit Scoring and its Applications. SIAM monographs on mathematical modeling and computation</article-title>
          . Philadelphia: SIAM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Siddiqi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Intelligent Credit Scoring: Building and Implementing Better Credit Risk Scorecards</article-title>
          . John Wiley &amp; Sons.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          <year>1994</year>
          .
          <article-title>Introduction to Credit Scoring</article-title>
          . The Athena Press.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Kullback</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Leibler</surname>
            ,
            <given-names>R. A.</given-names>
          </string-name>
          <year>1951</year>
          .
          <article-title>On Information and Suficiency</article-title>
          .
          <source>The Annals of Mathematical Statistics</source>
          <volume>22</volume>
          ,
          <issue>1</issue>
          :
          <fpage>79</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Yurdakul</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Naranjo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Statistical Properties of the Population Stability Index</article-title>
          .
          <source>Journal of Risk Model Validation</source>
          <volume>14</volume>
          ,
          <issue>3</issue>
          :
          <fpage>89</fpage>
          -
          <lpage>100</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Kolmogorof</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>1933</year>
          .
          <article-title>Sulla determinazione empirica di una legge di distribuzione</article-title>
          .
          <source>G. Ist. Ital. Attuari</source>
          ,
          <volume>4</volume>
          :
          <fpage>83</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Searle</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          <year>1971</year>
          . Linear Models. Wiley, New York. 560 pages.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Lohaka</surname>
            ,
            <given-names>H.O.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Making a grouped-data frequency table: development and examination of the iteration algorithm</article-title>
          .
          <source>Doctoral dissertation</source>
          , Ohio University. p.
          <fpage>87</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Sturges</surname>
            ,
            <given-names>H.A.</given-names>
          </string-name>
          <year>1926</year>
          .
          <article-title>The choice of a class interval</article-title>
          .
          <source>Journal of the American Statistical Association</source>
          ,
          <volume>21</volume>
          (
          <issue>153</issue>
          ):
          <fpage>65</fpage>
          -
          <lpage>66</lpage>
          . doi:
          <volume>10</volume>
          .1080/01621459.
          <year>1926</year>
          .
          <volume>10502161</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Lane</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Guidelines for Making Graphs Easy to Perceive, Easy to Understand, and Information Rich</article-title>
          . In: McCrudden,
          <string-name>
            <given-names>M.T.</given-names>
            ,
            <surname>Schraw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            , and
            <surname>Buckendahl</surname>
          </string-name>
          , C. (Eds.), Use of Visual Displays in Research and Testing: Coding, Interpreting, and
          <string-name>
            <given-names>Reporting</given-names>
            <surname>Data</surname>
          </string-name>
          . Information Age Publishing, Charlotte, pp.
          <fpage>47</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Lane</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          <year>2015</year>
          . Histograms. Rice University, Houston.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shivanna</surname>
          </string-name>
          , R., Cheng, D.,
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hong</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Chi</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>2021</year>
          .
          <article-title>DCN V2: Improved Deep &amp; Cross Network and Practical Lessons for Web-Scale Learning to Rank Systems</article-title>
          .
          <source>In Proceedings of the Web Conference</source>
          <year>2021</year>
          ,
          <fpage>1785</fpage>
          -
          <lpage>1797</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
          </string-name>
          , D., Cheng, D. Z.,
          <string-name>
            <surname>Hong</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chi</surname>
            ,
            <given-names>E. H.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Cui</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2021</year>
          .
          <article-title>Beyond Point Estimate: Inferring Ensemble Prediction Variation from Neuron Activation Strength in Recommender Systems</article-title>
          .
          <source>In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM '21)</source>
          ,
          <fpage>76</fpage>
          -
          <lpage>84</lpage>
          . Association for Computing Machinery, New York, NY, USA. https://doi. org/10.1145/3437963.3441770.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Taplin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Hunt</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>The Population Accuracy Index: A New Measure of Population Stability for Model Monitoring</article-title>
          . Risks,
          <volume>7</volume>
          (
          <issue>2</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>du</surname>
            <given-names>Pisanie</given-names>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>2023</year>
          .
          <article-title>A Critical Review of Existing and New Population Stability Testing Procedures in Credit Risk Scoring</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>8.1. Determining Critical Values for CPSI</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>