<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Optimal bin number for histogram binning method to calibrate binary probabilities.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tetyana Honcharenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olga Solovei</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kyiv National University of Construction and Architecture</institution>
          ,
          <addr-line>Povitroflots'kyi Ave, 31, Kyiv, 03037</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A receiver operating characteristic (ROC) curve analysis is an important instrument while selecting the best for the given dataset a classification algorithm by comparing the area under ROC curve. It is applied in areas: medicine, finance and e-commerce, information retrieval, quality control. However, the accuracy of area under ROC curve imposing on predicted probabilities a threshold &gt;0.5, therefore when predicted probability is calculated with different logic then corresponding area under ROC curve is affected and ROC curve analysis's results are misleading. To guarantee the accuracy of area under ROC curve, predicted probabilities must be calibrated. The subject matter of the article is “fixed-width binning” method which is used to calibrate binary predicted probabilities of machine learning algorithms Naive Bayes Classifier, Random Forest Classifier. In this paper the focus is put on “fixed-width binning” method which algorithm is based on the constant number of bins. The goal of work is to increase the calibration scores by proposing a method to select bin number depending on simple statistics of binary predicted uncalibrated probabilities. To meet the goal in the research were evaluated the feasibility to use two different approaches for the identification of the optimal bins' number: “rule-based” approach, “estimators-based” approach. The results of conducted experiments identified that often used 10 bins with “fixed-width binning” method is not optimal. Our proposal is to identify bin number dynamically according to “estimatorsbased” approach which algorithm is described in the paper. Histogram binning, bin number, brier score, expected calibration error, calibration curve.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>ROC graphs in machine learning are used to select the best for the given dataset a classification
algorithm by comparing the area under ROC curve and to model the classifier predictions depending
on the chosen value of false positive rate [1]. ROC curve analysis is applied in a medical field to evaluate
a performance of diagnostic tests as it helps in decision-making regarding test accuracy; in finance and
e-commerce - to assess the effectiveness of fraud detection systems; in information retrieval
systemsto optimize the trade-off between relevant and non-relevant results. ROC graphs measure the ability of
a classifier to produce relative instance scores – the numeric values which represents the degree to
which an instance is a member of a class. In work [2] is defined that “a classifier need not produce
accurate, calibrated probability estimates; it needs only produce relative accurate scores that serve to
discriminate positive and negative instances.” However, in the same work is underlined that area under
ROC’ accuracy is imposing a threshold &gt;0.5, so this metric is not appropriate when classifier doesn’t
produce calibrated scores. The threshold &gt;0.5 is applied only on probabilities which are predicted by
Logistic Regression Classifier, others commonly used binary classification algorithms computes
probabilities as: Random Forest Classifier (RFC) – predicts probabilities (further “scores”) as fractions</p>
      <p>2020 Copyright for this paper by its authors.
CEUR</p>
      <p>ceur-ws.org
of samples in a given class within the set of decision trees in the forest; Naive Bayes Classifier
(GaussianNB) - computes the probability that a data point belongs to a particular class based on
Gaussian distribution of the features; Support Vector Machine Classifier (SVC) – doesn’t provide
probabilities, instead for each data point in dataset it produces the distance from it to hyperplane.
Therefore, it is recommended to calibrate classifier’s scores predicted by the mentioned learners in
order to execute ROC curve analysis with a goal to select the best binary classifier for the given dataset.</p>
      <p>As the benchmarking method to calibrate binary scores is used “fixed-width binning” method
proposed for algorithms RFC and GaussianNB in research [3]. The method described as a “fixed-width
binning” where interval [0,1] is partitioned into bins and a number of bins is recommended to be 10.
The computational simplicity and ability to measure a calibration error with “fixed-width binning”
made it often used [4-6].</p>
      <p>However, in study [7] is concluded that a binning method is effective with properly selected bin’s
width, as depending on dataset characteristics predicted scores are differently distributed through bins
and too small or big number of bins could result calibrated scores are either “over-detailed” or
“oversmoothed”.</p>
      <p>The current research objectives are to empirically study whether the simple statistics of uncalibrated
predicted binary scores can be used to choose optimal number of bins for the “fixed-width binning”
method and to propose the solution approach. To achieve the research’s objectives will be considered
the simple statistics of the predicted scores: a range, a standard deviation and an interquartile range and
bin size estimators: David W. Scott rule, Freedman-Diaconis rule as their results are directly
proportional to a standard deviation and an interquartile range correspondingly.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Study Research</title>
    </sec>
    <sec id="sec-3">
      <title>2.1. Related literature review</title>
      <p>To improve “fixed-width binning” method results in work [8] was proposed a scaling binning
method – the algorithm divides data into two subsets - the 1st subset is calibrated by the other continuous
calibration method such as “Platt calibration”; the 2nd subset is used to choose the bins so that an equal
number of points are landed in each bin. The scaling binning method addresses two issues:
 Reduces a calibration error.
 Calculates bin width which is adopted to already calibrated scores.</p>
      <p>However, “Platt calibration” method has a few problems [9]:
 It is the most efficient when distortion of predicted scores is sigmoid-shaped.
 It is computationally intensive as is solving a convex optimization problem to find sigmoid
function parameters.</p>
      <p>In research [10] to identify bin size to construct a histogram for dataset’s distribution is proposed to
use Freedman-Diaconis rule. However, in statistic theory depending on actual data distribution it is
recommended for the identification of bin size and bin numbers the estimators: David W. Scott,
Freedman-Diaconis, Sturges rule, Doane formula, Rice Rule and others. For this reason, applying only
Freedman-Diaconis rule may not result that the optimal bin size is found.</p>
      <p>The related works [11-12] do not recommend a logic to define the number of bins for “fixed-width
binning” method to calibrate predicted probabilities so the problem is actual to study.
2.2.</p>
    </sec>
    <sec id="sec-4">
      <title>Methodology</title>
      <p>To achieve the study’s goals we will evaluate the feasibility to use two different approaches for the
identification of the optimal bins’ number: 1) “rule-based”; 2) “estimators-based”.</p>
      <p>The “rule-based” approach suggests having a set of rules which depending on simple statistics of
predicted uncalibrated scores to propose an optimal bins’ number to be used with “fixed-width binning”
method.</p>
      <p>The “estimators-based” approach suggests identifying bins’ number by using different estimators
and selection the best bins’ number to be used with “fixed-width binning” method as a result of
evaluation of calibration error.</p>
      <p>Feasibility study for “rule-based” approach includes steps: 1) set rules which input parameters are
simple statistics of predicted uncalibrated scores; 2) specify the expected results; 3) execute the rules
and record the actual results; 3) compare actual and expected results: if actual and expected results are
the same then recommend a “rule-based” approach.</p>
      <p>Feasibility study for “estimators-based” approach include steps: 1) calculate bins’ number using
selected estimators; 2) calibrate predicted scores with “fixed-width binning” method and bins’ number
received from step 1; 3) compare actual and expected results: if the minimum calibration error does not
correspond to bin’s number equal to 10 then propose a “estimators-based” approach.
2.3.</p>
    </sec>
    <sec id="sec-5">
      <title>Materials</title>
      <p>The following rules to be included in feasibility study for “rule-based” approach:
1. The 1st rule’s clause: given predicted scores have a low standard deviation (less than 0.1) and
scores’ variability is small (&lt;0.5) and the 2nd rule’s clause: given predicted scores have a low standard
deviation (less than 0.1) and scores’ variability shows a degree of dispersion (from 0.5 to 0.7). The
expected results: when David W. Scott rule calculates the required bins number for the 1st rule’s clause
and the 2nd rule’s clause then the results are closed as David W. Scott rule is not expected to care about
scores’ range.</p>
      <p>2. Given predicted scores have a low standard deviation and a low interquartile range (less than 0.1)
and scores’ variability shows a degree of dispersion (from 0.5 to 0.7). The expected results: when David
W. Scott rule calculates the required bin number and Freedman-Diaconis rule calculates the required
bin number then the results are closed as both estimators are not expected to care about scores’ range.</p>
      <p>In case, the actual results from execution of the rules 1-2 look the same as the expected then our
recommendation will be to develop “rule-base” approach as stable rules can be specified based on
simple statistic of the predicted binary scores.</p>
      <p>The following formulas and algorithms to be included in feasibility study for “estimators-based”
approach.</p>
      <p>
        The predicted binary scores are considered as well calibrated when its values are closed or equal to
actual value of target class y {0,1} . Brier score estimates the calibration’s error as the mean squared
error of the actual target class yi and predicted binary score si (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) [13]. The lower value of Brier score
the better calibration results.
      </p>
      <p>
        2
1 n1
BS  ( yi  si ) (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
      </p>
      <p>n i0
where n – is the number of observations in dataset.</p>
      <p>
        The lower Brier score does not always mean a better calibration, the reason for it is the bias-variance
decomposition of the mean squared error [14]. Other approach to measure calibrations is to calculate
expected calibration error (ECE) (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) [15].
where yi 
1 B y j - the mean of the actual target class values which are belonging to a single
y jhi
bin with a width is hi; si 
1 B s j - the mean of calibrated scores which are belonging to a
s jhi
single bin with a width is hi; B – number of bins.
      </p>
      <p>A Calibration curve plots the calibration results as the relationship between the mean predicted
binary scores   in each bin, placed on x-axis and fraction of actual values of target class in each bin –
placed on y-axis. The closer a calibration curve to diagonal line the better calibration [16].</p>
      <p>
        When bin’s width (h) is identify using David W. Scott (further Scott) estimator (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) then it makes
bin’s width to be proportional to dataset standard deviation ( ) and inversely proportional to the
number of observations in dataset (n) [17].
      </p>
      <p>ECE 
1 B</p>
      <p>
         hi  yi  si
n i1
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
h 
      </p>
      <p>
        When bin’s width (h) is identify using Freedman-Diaconis estimator (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) then it makes bin’s width
to be proportional to interquartile rate (IQR) and inversely proportional to the number of observations
in dataset (n) so the calculated width is optimal when dataset is normally distributed but contains outliers
[18].
      </p>
      <p>
        The required bin’s number (b) for the calculated bin’s width to be derived as a fraction between
scores’ range and bin’s width (h) according to equation (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ).
      </p>
      <p>
        b  max( scores )  min( scores ) (
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
h
      </p>
      <p>
        Algorithm 1 specifies a “fixed-width binning” method to calibrate predicted binary scores for a given
constant number of bins, noted as “n_bins”. In lines 6-8 it calculates bins’ edges. In lines 9-14 for each
uncalibrated score received as input parameter is calculated its index of bin, denoted as “score_inx”. In
lines 16-18: the algorithm iterates through bins indices, finds scores’ which indices coincide with current bin’s
index, those score’s indices are denoted as “mask” and calculates calibrated score as an average of scores inside
the bin.
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
      </p>
      <p>Algorithm 2 specifies logic to identify an optimal bin’s number to be used as input parameter for
“fixed-width binning” method to calibrate predicted binary scores. In line 6 bin’s width is calculated by
estimator and in line 8 predicted scores are calibrated by algorithm 1 which is called with bin’s number
calculated in line 7. In lines 9-10 the calibration results are evaluated by Brier score and expected
calibration error and marks are saved in arrays. The lines 6-10 are repeated for each estimator. In lines
12-16 the optimal bin number is selected where the lower values of metrics’ marks is identifier. If
metrics’ marks disagree, then the default bins’ number equal to 10 is returned.</p>
      <p>In case, the minimum calibrations error, which is received from the execution of algorithm 2 is for
calibration with bins’ numbers different from 10 bins, then our recommendation will be to develop
“estimators-base” approach.</p>
      <p>
        The execution of feasibility study for “rule-based” approach consists of:
Step1. Calculate bin’s width using estimator (
        <xref ref-type="bibr" rid="ref3 ref4">3-4</xref>
        ) and bins’ number according to (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) with input
parameters:
Rule 1:  ≤0.1 and n = 250 and scores range ≤0.1.
      </p>
      <p>Rule 2:  ≤0.1 and n = 250 and 0.5&lt; scores range ≤0.7.</p>
      <p>Rule 3: IQR ≤0.1 and n = 250 and 0.5&lt; scores range ≤0.7.</p>
      <p>Step2. Compare actual results and expected results (specified in sec. “Materials”), make the
recommendations.</p>
      <p>The execution of feasibility study for “estimators-based” approach consists of three steps:
Step1. Generate two synthetic datasets for classification problem: the 1st dataset is from skewed
Gaussian distribution; the 2nd – from normal distribution. The size of datasets is: two features and 1000
observations; a target class values are 1 and 0 for positive and negative class correspondingly. The
machining learning algorithms which are included in the experiments are RandomForestClassifier and
GaussianNB to be used with default values for hyperparameters.</p>
      <p>Step2. Split dataset on a train and test subset in proportion 80/20; train a learner and receive predicted
binary scores for test subset.</p>
      <p>Step3. Execute algorithm 2, compare actual results and expected results (specified in sec. “Materials”),
make the recommendations.
2.5.</p>
    </sec>
    <sec id="sec-6">
      <title>Results and discussions</title>
      <p>
        The actual results from the execution of the rules 1-3 to study the feasibility to use “rule-based”
approach are:
1. when   1 and n = 250 and scores range  0.5 then estimator (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) will make bins’ width less
than 0.0349 and number of bins is 29;
2. when   1 and n = 250 and 0.5&lt;scores range  0.7 then estimator (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) will make bins’ width
less than 0.08 and number of bins is 5;
3. when IQR  1 and n = 250 and 0.5&lt; scores range  0.7 then estimator (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) will make bins’ width
less than 0.01 and number of bins is 60.
      </p>
      <p>
        The actual results which are obtained from the execution of the rules 1-2 show that small changes
in scores’ variability may impact David W. Scott’s rule’s result so that calculated numbers of bin differ
appx. by a factor of 6 which is not expected as standard deviation and number of observations are
similar low in rules 1-2. The actual results which are obtained from the execution of the rules 2-3 show
that both estimators (
        <xref ref-type="bibr" rid="ref3 ref4">3-4</xref>
        ) calculate numbers of bin which differ by a factor of 12 which is not expected
as we kept low standard deviation and IQR, so expecting the similar results from both estimators.
      </p>
      <p>To summarize the results, we won’t recommend “rule-based” approach as stable rules cannot be
defined based on the selected simple statistic of the predicted binary scores.</p>
      <p>The results of the execution of feasibility study for “estimators-based” approach is recorded in tables
1-6 and illustrated on figures 1-2.</p>
      <p>Table 1 records simple statistics for predicted by GausianNB and RandomForestClassifier
uncalibrated binary scores in lines 1 and 2 correspondingly. The learning algorithm had been trained on
the skewed dataset from Gaussian distribution. Line 1 recodes standard deviation of uncalibrated
predicted scores and IQR are less than 0.1, as specified in scenario 1, in Table 2 is visible that the
number of bins tends to be bigger than 10 – it is 60 and 14 bins. Line 2 records increased spreads and
in Table 3 is visible that the number of bins tends to be smaller than 10 – it is 9 and 8.</p>
      <p>Calibration results for scores which statistics are described in Table1 is presented in Table 2-3 in
columns: “Brier score”, “ECE” and visualized with calibration curves on Figure1. For calibrated
GausianNB scores the lower values of metrics are captured for 60 bins and calibration curves on picture
(b) from Figure 1 and line 1, shows that half from total curve’s points are very closed to diagonal. On
picture (c) from Figure 1 and line 1 can also be seen that half of the point are closed to diagonal and
Brier score is almost the same, however according to ECE metric – the better calibration is achieved
with 60 bins. For calibrated RandomForestClassifier’s scores the lower values of Brier score is captured
for binning with 8 bins. The difference in ECE metric between 8 bins and 9 bins binning is less than
10-4, however, calibration curve on picture (c) from Figure 1, line 2 shows more curve’s points are
closed to diagonal line compared to curve on picture (b) which indicate the better calibration with 8
bins.</p>
      <sec id="sec-6-1">
        <title>Estimator</title>
      </sec>
      <sec id="sec-6-2">
        <title>Predicted scores counts per bin</title>
      </sec>
      <sec id="sec-6-3">
        <title>GausianNB</title>
      </sec>
      <sec id="sec-6-4">
        <title>RandomForestClassifier</title>
        <p>2, 2, 4, 7, 18, 53, 132, 19, 5, 8
8, 12, 32, 28, 43, 40, 30, 26, 22, 9
Scores
range
0.66
0.89</p>
        <p>
0.086
0.2</p>
        <p>IQR
0.035
0.3</p>
      </sec>
      <sec id="sec-6-5">
        <title>Bin width 0.1 0.011 0.047</title>
        <p>Table 4 records simple statistics for predicted by GausianNB and RandomForestClassifier
uncalibrated binary scores in lines 1 and 2 correspondingly. The learning algorithms had been trained
on the dataset from normal distribution. Calibration results for scores from Table 4 are presented in
Table 5-6.</p>
        <p>As in line 1 and line 2 from Table 1 is seen increased compared to Table 1 in the spread of
uncalibrated predicted scores from mean and median, so the result in Table 5 record the number of bins
is closed to 10 – it is 12 and 8. In Table 6 due to more increased spread we see 6 bins are needed.</p>
        <p>For calibrated GausianNB scores the lower values of Brier score is captured for default 10 bins,
however the difference in ECE metric is less than 10-4, and calibration curves on the picture (c) from
Figure 2, line 1 shows 5 from 8 curve points are on diagonal line so 8 bins could be considered as
optimal as well. For calibrated RandomForestClassifier scores the lower values of metrics are captured
for 6 bins and calibration curves on pictures (b)-(c) from Figure 2, line 2 indicate better calibration
compared to curve with 10 bins on picture (a) from Figure 2, line 2.</p>
      </sec>
      <sec id="sec-6-6">
        <title>Estimator</title>
      </sec>
      <sec id="sec-6-7">
        <title>Predicted scores counts per bin</title>
      </sec>
      <sec id="sec-6-8">
        <title>GausianNB</title>
      </sec>
      <sec id="sec-6-9">
        <title>RandomForestClassifier 115, 49, 20, 18, 13, 9, 10, 3, 8, 5. 25, 30, 20, 32, 26, 18, 19, 19, 20, 41.</title>
      </sec>
      <sec id="sec-6-10">
        <title>Estimator</title>
      </sec>
      <sec id="sec-6-11">
        <title>Not applied</title>
      </sec>
      <sec id="sec-6-12">
        <title>Freedman</title>
      </sec>
      <sec id="sec-6-13">
        <title>Diaconis</title>
      </sec>
      <sec id="sec-6-14">
        <title>Scott</title>
      </sec>
      <sec id="sec-6-15">
        <title>Bin width 0.1 0.05 0.081</title>
      </sec>
      <sec id="sec-6-16">
        <title>Bin width 0.1 0.165 0.165</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>3. Conclusions</title>
      <p>In current study had been considered two different approaches for identification of the optimal bin
number to be used with “fixed-width binning” method. The “rule-based” approach – according to which
a set of rules depending on the values of uncalibrated scores’ range, standard deviation and an
interquartile range will propose an optimal bin number had not been recommended to further
development as the actual results from the rules execution compared to expected results showed that
the small changes in scores’ variability may impact David W. Scott’s rule results and David W. Scott’s
rule and Freedman-Diaconis rule calculated the numbers of bin which differ by a factor of 12 which
was not expected as we fixed to be low standard deviation and an interquartile range. Our conclusion
for “rule-based” approach is that stable rules can’t be defined based on the selected simple statistic of
the predicted binary scores.</p>
      <p>The effectiveness of “estimators-based” approach - according to which bins’ number is calculated
by different estimators and the optimal bin number is selected as a result of the evaluation of a
calibration error revealed the following: 10 bins for “fixed-width binning” method to calibrated
predicted probabilities is not optional for all datasets. When uncalibrated score’s range, standard
deviation and interquartile range are low then 60 bins can be optimal according to Freedman-Diaconis
rule, at the same when scores’ spread is low, however a standard deviation and an interquartile range
are increasing then 8 bins can be optimal according to David W. Scott’s rule. Further increase of spread
will cause optimal bin number is decreased to 6 bins according to both estimators.</p>
      <p>Our proposal is to identify bin number dynamically according to “estimators-based” approach which
is described per algorithm 2. The proposed approach will improve calibrations of binary predicted
probabilities based on ECE and Brier score metrics as visible from calibration curves on Figure 1 and
Figure 2 so that the accuracy of area under ROC curve is good to conduct ROC curve analysis.</p>
      <p>Further work will be to extend the proposed “estimators-based” approach to calculate optimal bin
number with estimators: Sturgers’ formula, Rice rule, Doane’s formula as those estimators considers
bin numbers based on the range of the data.
4. References
[17] DW. Scott, Sturges' rule. Wiley Interdisciplinary Reviews: Computational Statistics. 2009</p>
      <p>
        Nov;1(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ):303-6.
[18] D. Freedman D, P. Diaconis, On the histogram as a density estimator: L 2 theory. Zeitschrift für
Wahrscheinlichkeitstheorie und verwandte Gebiete. 1981 Dec;57(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ):453-76.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Fawcett</surname>
          </string-name>
          ,
          <article-title>An introduction to ROC analysis</article-title>
          .
          <source>Pattern recognition letters</source>
          .
          <source>2006 Jun</source>
          <volume>1</volume>
          ;
          <issue>27</issue>
          (
          <issue>8</issue>
          ):
          <fpage>861</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Fawcett</surname>
          </string-name>
          ,
          <article-title>ROC graphs: Notes and practical considerations for researchers</article-title>
          .
          <source>Machine learning</source>
          .
          <source>2004 Mar</source>
          <volume>16</volume>
          ;
          <issue>31</issue>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zadrozny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Elkan</surname>
          </string-name>
          ,
          <article-title>Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers</article-title>
          .
          <source>InIcml 2001 Jun</source>
          <volume>28</volume>
          (Vol.
          <volume>1</volume>
          , pp.
          <fpage>609</fpage>
          -
          <lpage>616</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Naeini</surname>
          </string-name>
          , G. Cooper,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hauskrecht</surname>
          </string-name>
          .
          <article-title>Obtaining well calibrated probabilities using Bayesian binning</article-title>
          .
          <source>In AAAI Conference on Artificial Intelligence</source>
          ,
          <fpage>20</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Guo</surname>
          </string-name>
          , G. Pleiss,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          , KQ. Weinberger.
          <article-title>On calibration of modern neural networks</article-title>
          .
          <source>In International conference on machine learning 2017 Jul</source>
          <volume>17</volume>
          (pp.
          <fpage>1321</fpage>
          -
          <lpage>1330</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Roelofs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shlens</surname>
          </string-name>
          , M. Mozer.
          <article-title>Mitigating bias in calibration error estimation</article-title>
          .
          <source>arXiv preprint arXiv:2012.08668</source>
          ,
          <year>2020</year>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramdas</surname>
          </string-name>
          ,
          <article-title>Distribution-free calibration guarantees for histogram binning without sample splitting</article-title>
          .
          <source>In International Conference on Machine Learning 2021 Jul</source>
          <volume>1</volume>
          (pp.
          <fpage>3942</fpage>
          -
          <lpage>3952</lpage>
          ). PMLR.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tengyu</surname>
          </string-name>
          ,
          <article-title>Verified uncertainty calibration</article-title>
          .
          <source>Advances in Neural Information Processing Systems</source>
          .
          <year>2019</year>
          ;
          <volume>32</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Platt</surname>
          </string-name>
          ,
          <article-title>Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods</article-title>
          .
          <source>Advances in large margin classifiers</source>
          .
          <source>1999 Mar</source>
          <volume>26</volume>
          ;
          <issue>10</issue>
          (
          <issue>3</issue>
          ):
          <fpage>61</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , G.,
          <year>2023</year>
          .
          <article-title>A new bin size index method for statistical analysis of multimodal datasets from materials characterization</article-title>
          .
          <source>Scientific Reports</source>
          ,
          <volume>13</volume>
          (
          <issue>1</issue>
          ), p.
          <fpage>10915</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>KH. Knuth</surname>
          </string-name>
          ,
          <article-title>Optimal data-based binning for histograms and histogram-based probability density models</article-title>
          .
          <source>Digital Signal Processing. 2019 Dec</source>
          <volume>1</volume>
          ;
          <fpage>95</fpage>
          :
          <fpage>102581</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>WK. Leow</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>The analysis and applications of adaptive-binning color histograms</article-title>
          .
          <source>Computer Vision</source>
          and Image Understanding.
          <source>2004 Apr</source>
          <volume>1</volume>
          ;
          <issue>94</issue>
          (
          <issue>1-3</issue>
          ):
          <fpage>67</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Brier</surname>
          </string-name>
          ,
          <article-title>Verification of forecasts expressed in terms of probability</article-title>
          .
          <source>Monthly weather review 78</source>
          (
          <year>1950</year>
          )
          <fpage>1</fpage>
          -
          <lpage>3</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ferri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hernández-Orallo</surname>
          </string-name>
          , MJ. Ramírez-Quintana,
          <article-title>Calibration of machine learning models</article-title>
          .
          <source>InHandbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques</source>
          <year>2010</year>
          (pp.
          <fpage>128</fpage>
          -
          <lpage>146</lpage>
          ).
          <source>IGI Global.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.P.</given-names>
            <surname>Naeini</surname>
          </string-name>
          , G. Cooper, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Hauskrecht</surname>
          </string-name>
          ,
          <article-title>Obtaining well calibrated probabilities using bayesian binning</article-title>
          .
          <source>In: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>DS</surname>
          </string-name>
          ,
          <article-title>Wilks, On the combination of forecast probabilities for consecutive precipitation periods</article-title>
          .
          <source>Weather and forecasting</source>
          .
          <source>1990 Dec;5</source>
          (
          <issue>4</issue>
          ):
          <fpage>640</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>