<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On the Labeling Correctness in Computer Vision Datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohammed Al-Rawi</string-name>
          <email>al-rawi@cvc.uab.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimosthenis Karatzas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Vision Center, Universidad Autonoma de Barcelona</institution>
          ,
          <addr-line>Bellaterra</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Image datasets have heavily been used to build computer vision systems. These datasets are either manually or automatically labeled, which is a problem as both labeling methods are prone to errors. To investigate this problem, we use a majority voting ensemble that combines the results from several Convolutional Neural Networks (CNNs). Majority voting ensembles not only enhance the overall performance, but can also be used to estimate the confidence level of each sample. We also examined Softmax as another form to estimate posterior probability. We have designed various experiments with a range of different ensembles built from one or different, or temporal/snapshot CNNs, which have been trained multiple times stochastically. We analyzed CIFAR10, CIFAR100, EMNIST, and SVHN datasets and we found quite a few incorrect labels, both in the training and testing sets. We also present detailed confidence analysis on these datasets and we found that the ensemble is better than the Softmax when used estimate the per-sample confidence. This work thus proposes an approach that can be used to scrutinize and verify the labeling of computer vision datasets, which can later be applied to weakly/semi-supervised learning. We propose a measure, based on the Odds-Ratio, to quantify how many of these incorrectly classified labels are actually incorrectly labeled and how many of these are confusing. The proposed methods are easily scalable to larger datasets, like ImageNet, LSUN and SUN, as each CNN instance is trained for 60 epochs; or even faster, by implementing a temporal (snapshot) ensemble.</p>
      </abstract>
      <kwd-group>
        <kwd>Data annotation and labeling</kwd>
        <kwd>ensembles</kwd>
        <kwd>convolutional neural networks</kwd>
        <kwd>semi-supervised learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Recent developments in deep neural network approaches have greatly advanced the
performance of visual recognition systems. Most research and development are based
on standard computer vision datasets that have been annotated manually1 or
automatically. Moreover, the computer vision community is devoted to building larger datasets
containing tens, or even hundreds, of millions of samples, for example the JFT-300M
data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Dataset annotation and/or labeling is a difficult, confusing and time consuming
task; and even after labeling, it is difficult to assess a dataset for label correctness,
whether manually or automatically. One way, however, to verify the labeling is by
having a system that returns a confidence-level for each sample in the dataset, and not an
overall system/classifier confidence, we illustrate the implementation of our ideas in
Fig. 1.
      </p>
      <p>
        Although state-of-the-art deep learning architectures can produce posterior
probabilities, these probabilities may not be adequate to estimate the per-sample
confidencelevel value [
        <xref ref-type="bibr" rid="ref2 ref29">2</xref>
        ]. However, one promising approach that can be used to measure the
persample confidence-level is by using ensemble classification methods. In ensemble
learning, multiple classifiers can be combined to solve a specific classification task and
they can be used to enhance the classification performance by compensating for the low
performance of a poor classifier. Other important outcomes of ensemble learning
include assigning a confidence-level, and/or posterior probability, to each sample in the
testing set. Neural networks ensembles, nonetheless, have been investigated long before
deep learning [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. After the deep learning boom in 2012, there has been quite a few
works on ensembles built with deep nets deploying Convolutional Neural Networks
(CNNs) [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5-7</xref>
        ]. Ensembles, in fact, can well be connected with deep learning
frameworks and they are currently being used in many research and development aspects,
including challenges and competitions [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>Data Collection</title>
    </sec>
    <sec id="sec-3">
      <title>Data Collection</title>
    </sec>
    <sec id="sec-4">
      <title>Annotation/Labeling</title>
    </sec>
    <sec id="sec-5">
      <title>Annotation/Labeling</title>
    </sec>
    <sec id="sec-6">
      <title>Data ready for usage</title>
    </sec>
    <sec id="sec-7">
      <title>Annotation/Labeling</title>
    </sec>
    <sec id="sec-8">
      <title>Probabilistic Analysis</title>
    </sec>
    <sec id="sec-9">
      <title>Corrected labels</title>
    </sec>
    <sec id="sec-10">
      <title>Incorrect labels</title>
    </sec>
    <sec id="sec-11">
      <title>Data ready for usage</title>
    </sec>
    <sec id="sec-12">
      <title>Correct labels Fig. 1. Data annotation that is normally used (left) and the proposed probabilistic analysis (right).</title>
      <p>
        Ensembles’ research work, however, have focused on improving the classification
performance, on different applications and not only image understanding, compared to
using a single learning model [
        <xref ref-type="bibr" rid="ref10 ref11 ref5 ref6 ref9">5, 6, 9-11</xref>
        ], and quite a few of them won computer vision
challenges, see for example [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. While the ensemble performance-improvement
hypothesis is effective and even supported by theoretical material, confidence analysis
has not taken its expected share in the literature. Apart from this, and whenever
compared to works tackling the overall confidence-level of the classifier, the confidence
should go down to a low-level similar to humans’ decision ability to be confident in
their classification/decision for each sample/image. Such confidence analysis would
highly be useful in weekly supervised learning, which was the goal of the work in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
where the authors successfully implemented temporal ensembling. However, the
authors of [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] have not examined the per-sample confidence levels to perceive how this
could be useful in data cleaning, i.e. in a semi-supervised fashion. Other works that
differentiate between expert and novice annotators, and between strongly and weakly
annotated, in the so called active learning [
        <xref ref-type="bibr" rid="ref13 ref14 ref15">13-15</xref>
        ]. These works usually focus on
uncertainty-based methods that usually ignore incorrectly labeled samples, and thus are
sensitive to outliers [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ]. Furthermore, these works have not incorporated deep
CNNs into active learning. The major aim therefore of this work is using deep CNNs
to investigate the per-sample confidence level and compare it to the Softmax posterior
probability, and to examine the possibility of using it to verify the labeling in computer
vision datasets. The proposed approach can also find important application in
weeklysupervised / active learning scenarios.
      </p>
    </sec>
    <sec id="sec-13">
      <title>We also aim to scrutinize the possibility of building ensemble classifier from one type of CNNs and compare the result to using different types of CNNs, including temporal ensembles, which will allow us to study the independence between the same kind of randomly trained CNN structures.</title>
      <p>2</p>
      <sec id="sec-13-1">
        <title>Methods</title>
        <p>
          We tried different types of CNNs that have been trained with ImageNet (aka
“PreTrained Models”). Generally speaking, a pre-trained model can learn the features from
images faster than a model that starts from scratch (i.e. by randomly initializing its
weights) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. In fact, some pre-trained models can reach an accuracy of 80% in three
epochs on CIFAR10. For the ensemble classifier, we implemented voting schemes
based on the predicted labels of the used classifiers. It has been proven that majority
voting combination will always lead to a performance improvement for sufficiently
large number of classifiers provided that the classifier outputs are independent [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. To
illustrate this further, consider a binary classifier and assuming that each classifier has
a probability  of making a correct decision, the ensemble’s probability (
) of
making a correct decision has a binomial distribution [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]:

=
( /
        </p>
        <p>) 
 (1 −  )
where  is the number of classifiers used to build the ensemble. From the above, if
 &gt; 0.5, 
→ 1 when</p>
        <p>
          → ∞. Note that  &gt; 0.5 (above chance-level) is almost
present in most successfully trained binary classifiers. A similar argument can simply
be conjectured for multiclass ensembles as combining binary classifiers for multi-class
classification is a very familiar approach [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. The vital issue that can be of concern
here is the independence of the output of different classifiers.
2.1
        </p>
        <sec id="sec-13-1-1">
          <title>A measure to quantify the classified labels</title>
          <p>
            In this work, we used majority voting ensemble based on the classifiers’ output
labels, and the ensemble chooses the category/class that receives the largest total vote.
The higher the votes each sample gets, the higher the confidence and the lower the votes
the lower the confidence. We then used the highest confidence as a key indicator to find
any incorrectly labeled samples; which is the per-sample confidence level when the
ensemble votes are equal to the number of classifiers used to build it. To make some
inferences from the high confidence of the ensemble we make use of 1)  , which is
the number of incorrect samples that have been classified with high confidence (these
are the false positives) with probability  = ( / ), and 2)  , which is the
number of correct samples that have been classified with high confidence with
probability  = ( / ), where  is the number of testing samples. The value of 
is of most interest as it indicates that all classifiers of the ensemble agreed (with high
confidence) to incorrectly classify a sample. The high-confidence incorrectly classified
samples will further be investigated to verify their labels. It is also possible that these
incorrectly classified samples contain some of the difficult / confusing details that
deceived the ensemble, or the classifiers that were used to build the ensemble were not
independent. To compare the performance of different ensembles, we will calculate the
Odds Ratio (OR) using the formula [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ]:

= 
(1 − 
)/(
(1 − 
)).
          </p>
          <p>(2)</p>
          <p>The value of OR will be used to estimate the likeliness that the ensemble may
produce false positives but with high confidence (on the assumption that all samples are
correctly labeled/annotated), i.e. how likely the incorrect samples will be classified as
correct ones with high confidence; hence, the lower the OR the better. An OR equals
to one indicates that the classification of correct and incorrect samples with high
confidence is equally likely to occur.</p>
          <p>
            We used CIFAR10 and CIFAR100 [
            <xref ref-type="bibr" rid="ref22">22</xref>
            ] in all confidence analyses’ experiments.
CIFAR10 is a well-known dataset that has heavily been investigated in the computer
vision literature. It essentially has 50K 32×32 RGB image samples for training and 10K
32×32 RGB image samples for testing, where each image belongs to one of ten classes;
airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. Each class thus
has 5000 images in the training set and 1000 images in the testing set. CIFAR100, on
the other hand, has a similar image structure but it has 100 classes distributed on 600
samples and 500 samples in the training set and 100 samples in the testing set. To
implement our algorithms, we used PyTorch [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ] as our main deep learning framework.
Further details on the used methods and experimental setting can be found in the
supplemental material.
          </p>
          <p>
            We chose the VGG CNN family [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ] (we will refer to VGG Ensemble; ‘VGG-E’)
as they require less training time than other CNNs, and they can reach higher accuracy
than other CNNs, when trained up to 60 epochs. We chose 60 epochs for the following
reasons: 1) to see how fast and how well the ensemble classifier can learn with
confidence 2) to reduce the execution time of ensembles, 3) following the Schapire’s idea
on the strength of weak learnability [
            <xref ref-type="bibr" rid="ref25">25</xref>
            ], and 4) for the proposed methodology to be
efficient and scalable when used on larger data sets. Some VGGs have Batch
Normalization (BN) others do not, thus they have been postfixed with ‘BN’, VGG11BN thus
denotes VGG11 with batch normalization. In most analysis, we used eight VGG CNNs,
these are: VGG11, VGG11BN, VGG13, VGG13BN, VGG16, VGG16BN, VGG19,
VGG19BN.
3
3.1
          </p>
        </sec>
      </sec>
      <sec id="sec-13-2">
        <title>Results</title>
        <sec id="sec-13-2-1">
          <title>The Softmax Posterior Probability</title>
          <p>
            It is widely known that the CNNs, and neural networks in general, yield Posterior
Probabilities (PPs) as their outputs, when Softmax is used. It is not known, however, if these
posterior probabilities can be used to estimate the per-sample confidence to an accurate
degree. To investigate the confidence distribution of the correctly and incorrectly
classified labels via Softmax outputs, we used CIFAR100 to train VGG19BN with the
previously mentioned settings except that we increased the number of training epochs to
600. We will refer to the condition where Softmax posterior probability equals one as
high confidence. The typical situation of the PP of the incorrectly classified samples is
to have an exponential distribution or, in the worst case, a normal distribution. The
results of the training and testing are demonstrated in Table 1 and show that Softmax
posterior probabilities of a single VGG have high OR values, and thus, may not be used
as good estimates of the per-sample confidence level. This deduction is clearly depicted
in Fig. 2 that shows the posterior probability distributions, where the incorrectly
classified samples have a peek at PP=1 (PP=1 denotes high confidence as the probability is
100%). We also perceive from Fig. 2 that the distribution of the PP values is
rightskewed for the incorrect labels, and this means using these PP values for the per-sample
confidence level is not reliable. The presented Softmax results, in fact, copes with the
neural networks posterior probabilities as being over-confidence estimates, as has been
detailed in [
            <xref ref-type="bibr" rid="ref2 ref29">2</xref>
            ]. Our probability analysis provides further evidence of why adversarial
attacks [
            <xref ref-type="bibr" rid="ref26">26</xref>
            ] are possible when using Softmax to state the confidence of the classified
object/image.
3.2
          </p>
        </sec>
        <sec id="sec-13-2-2">
          <title>Ensembles Built with Different VGG Types</title>
          <p>Using CIFAR10, each VGG type was trained for up to 16 times, VGG-E thus has a
total of 118 VGGs (Skipping chance-level local minima resulted sometimes in less than
the planned 16×8 = 128 VGGs). The results of the discovered images with incorrect
labels in the testing set are presented in Table 2. Our tests showed that there are a 9
incorrect samples with high-confidence (voting is equal to the number of VGGs used
to build the ensemble). Investigating the 9 false positives, we found that most of them
have incorrectly been labeled in the testing set of CIFAR10. We also present in the
50K
10K
g
in e
ts iz
e s
T tse
10K
50K
10K
50K
i-f -cu
i
supplementary material a few samples that have high per-sample confidence level
values but were incorrectly classified. After examination, however, these samples appear
to be confusing.
is more challenging than the previous experiment, and could be useful in weakly
supervised learning. The results of the discovered incorrect labels are presented in Table 2.
By investigating the 81 false positives, we found that some have incorrectly been
labeled in the training set of CIFAR10, but some images have confusing content. A few
samples discovered by the VGG-E are not only be confusing CNNs, but also to the
human observer, see the supplementary material. We repeated the same above
experiments on Cifar100, which resulted in a VGG-E with 128 VGGs. The ensemble
enhanced the accuracy by ~9% compared to the average of VGGs. A few incorrect labels
in CIFAR100’s testing, as well in the training set, are demonstrated in supplemental
material. The analysis of these ensembles are summarized in Table 3, and the
per-sample confidence distributions are shown in Fig. 3.</p>
          <p>
            We can see from Table 2 that the frog (which has an index 2405 in the data) is
labeled as a cat in CIFAR10 testing set, but the VGG-E managed to predict the correct
label (more results are shown in the supplemental material). The amount of incorrect
labels in CIFAR100 is higher, for example, a bottle (which has an index of 7762) is
labeled as a cup, other images with incorrect labeling also exist. Nonetheless, by
inspecting these images, one can admire the work that CNNs can achieve in classifying
these CIFAR images, as most of the times the details are not clear even for the human
observer, due to using the so called tiny images (as each image has a size of 32×32).
Thus, using CNNs ensembles would assist inspecting and verifying the labeling, as
proposed in this work.
The same experimental strategy used for CIFAR10&amp;100 has been implemented on the
EMNIST dataset [
            <xref ref-type="bibr" rid="ref27">27</xref>
            ]. The EMNIST dataset, which is derived from the NIST Special
Database, has been compiled from a set of handwritten English characters and Arabic
digits and has been suggested as a more challenging replacement to the MNIST dataset.
The EMNIST ‘By Class’ split has 814,255 images distributed over 62 unbalanced
classes. Pixel image format and dataset structure that directly matches the MNIST dataset,
each image is 28x28 gray-level. EMNIST is extremely challenging, on the labeling and
testing levels, as it has upper and lower case confusion, in addition to numeral value
one (1) versus letter (lower case L; l), O versus 0/zero, 9 versus q, etc. In fact, our
analysis shows that this confusion has been present at the labeling/annotation stage.
          </p>
          <p>As for the results, the ensemble gave a classification accuracy 0.87 when trained
using the training set. Furthermore, in the testing set, the incorrectly labeled samples
that got recognized by the ensemble, with high confidence, is 2,837, while the correct
samples that got recognized by the ensemble, with high confidence, is 83,467, and the
quantitative measure OR is 0.0098. To inspect the labeling of the training set, we trained
the ensemble with the testing, which gave classification accuracy of 0.85. The number
of incorrect samples that got recognized by ensemble with confidence is 9,154, the
correct samples that got recognized by the ensemble with high confidence is 408,588, and
the quantitative measure OR is 0.0094. The confidence distributions are illustrated in
Fig. 4. Due to space limitations, incorrect/confusing EMNIST images are demonstrated
in the supplemental material.</p>
          <p>Confidence Confidence</p>
          <p>Low → → → → High Low → → → → High
Fig. 4. Per-sample confidence distribution training set (top row) and testing set
(bottom row); correctly classified labels (left) and incorrectly classified labels (right) in
EMNIST dataset.
3.4</p>
        </sec>
        <sec id="sec-13-2-3">
          <title>Experiments with the SVHN dataset</title>
          <p>
            SVHN [
            <xref ref-type="bibr" rid="ref28">28</xref>
            ] is a real-world image digit dataset that has been inspired by MNIST
structure (e.g., the images are of small cropped digits) but comes from a significantly harder,
unsolved, real world problem (recognizing digits and numbers in natural scene images).
SVHN, which contains 73257 digits for training and 26032 digits for testing, has been
obtained from house numbers in Google Street View images. Using the training set for
training, the classification accuracy of the ensemble is 95.14. The number of incorrect
samples that got recognized by all the CNNs is 129 and the number of correctly-labeled
samples that got recognized by the ensemble is 20887, yielding OR= 0.0012. Inspecting
the label correctness in the training set showed that the number of incorrect labels that
got recognized by the ensemble with high confidence is 267 with OR= 0.0018
(classification accuracy of the ensemble is 93.69). The confidence distributions are illustrated
in Fig. 5. Due to space limitations, selected incorrect/confusing SVHN images that have
been detected with our approach are demonstrated in the supplemental material.
4
          </p>
        </sec>
      </sec>
      <sec id="sec-13-3">
        <title>Conclusion</title>
        <p>It is of high interest in computer vision to have a system that can conjecture with
confidence what is wrong and what is right, i.e., to confidently guess which labels are
correctly and/or incorrectly classified. This work is a step in that direction. This paper
presents the use of CNN ensembles to detect incorrect labels in image classification
datasets. Essentially, if the ensemble is confident on a result which is incorrect, either
the sample is indeed visually confusing or it was incorrectly labelled. Probabilistic
cons
e
l
p
m
a
s
f
o
r
e
b
m
u
N
fidence analyses showed that some images with incorrect labeling and confusing
content exist. Fig. 6 summarizes the results of CIFAR10 &amp; 100 and illustrates that the OR
values of Softmax posterior probabilities are higher than the OR values of the ensemble
posterior probabilities; the lower the OR values the better. Hence, the posterior
probability of a CNN, measured with Softmax, cannot be used to accurately estimate the
persample confidence level. Furthermore, the proposed OR analysis provided a novel
evidence that batch normalization increases the ensemble confidence, thus, could be
related to improving generalization.</p>
        <p>0.1
0.08
0.06
R0.04
O
0.02</p>
        <p>0
Low</p>
        <p>Our analyses also agreed with previous ensemble works as the overall accuracy has
been increased by around 5%, 9%, 2%, 5% for CIFAR10, CIFAR100, SVHN, and
EMNIST respectively. Based on the proposed probabilistic methods and by making use
of the snapshot ensemble (supplemental material), we are currently building a labeling
verification tool to be implemented in PyTorch framework. This tool will be useful not
only in labeling verification, but can also be used in semi-supervised and active learning
applications. Evaluations on other datasets are left for future work.</p>
        <p>Supplemental Material</p>
      </sec>
      <sec id="sec-13-4">
        <title>Experimental Setting</title>
        <p>The following parameters have been used in all experiments: dropout probability is 0.2;
maximum number of epochs is 60; learning rate is 0.01 (the learning rate is set to
decrease by half according to the following milestones = {8, 20, 48); unless mentioned
otherwise); momentum=0.95; the seed was randomly pulled from the time function;
weight-decay=0.0005; Nestrove momentum was used; SGD optimizer; 100 mini
batches, random shuffling enabled, and Cross Entropy Loss. The run/training, however,
was skipped if the CNN is stuck at a chance-level local minima (~10% for CIFAR10
and ~1% for CIFAR100), and a new training instance is launched with a new random
seed. To demonstrate the possible variations in training each CNN of the ensemble, we
present the training progress of various VGGs in Fig-Sup. 1.</p>
        <p>Epochs</p>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>Epochs</title>
      <sec id="sec-14-1">
        <title>Fig-Sup. 1. Variations of the training progress of the different ensembles built from</title>
      </sec>
      <sec id="sec-14-2">
        <title>CIFAR10 and CIFAR100 data. Training with 50K and testing with 10K (left column) and training with 10K and testing with 50K (right column).</title>
        <p>In our preliminary analysis, we built ensembles using different CNN architectures;
including, different types of ResNets*, VGGs* DualPathNets* (DPNs), DenseNets*,
NasNetLarge, etc. However, we chose to build the ensembles via the VGG net family
as they require less training time than the other CNN types, they give similar
performance to the ensemble built from different CNN architectures, and they result much
higher accuracy than other CNNs, when trained up to 60 epochs. To give an example,
NasNetLarge requires 9X times the training time of VGG11 and 5X times of
VGG19BN. The classification accuracy of NasNetLarge gets to 75% compared to
above 85% for all VGG types, when trained up to 60 epochs. To clarify further, for a
maximum of 60 epochs, VGG11 reaches 85% accuracy in less than 6 minutes, while
ResNet18 gets to 78% accuracy in 7 minutes, but NasNetLarge gets to 75% accuracy
in 71 minutes. In general, the DPN, SqueezNet, and ResNet (including Resnext*)
families are slower than VGGs and/or can get less than 80% accuracy in 60 epochs.</p>
        <sec id="sec-14-2-1">
          <title>Ensembles Built with a Single VGG Type</title>
          <p>In this experiment, we used 16 classifier instances to see how do they perform
compared to using different VGG classifiers. The training has been performed with the 10K
testing set, and the testing has been performed using the 50 training set, as it is more
challenging than using the sets in training the other way around. Table 4 summarizes
the results. From Table 4 we notice that Batch-Normalization always leads to better
confidence, when the same CNN is used, as the OR is less when using BN, that is it is
less likely to have false positives with high confidence when BN is used.
424
446
466
556
607
651
706
834
33073
33892
33559
31506
33657
34120
34156
32379</p>
          <p>From Table 4 we see that VGG11 has the weakest performance compared to other
VGGs, thus, we took this experiment further to build one VGG-E using 128 VGG11s
and another using 128 VGG13BN using CIFAR10 testing set for training. The VGG-E
increased the performance by ~5%., but the confidence levels, as shown in the OR
values, are better when using different VGG models than using only one VGG model, as
shown in Table 5. Thus, a VGG-E built using 128 VGGs results, given by the OR
values, are not as good as a VGG-E built using different types of VGGs.</p>
        </sec>
        <sec id="sec-14-2-2">
          <title>Temporal (snapshot) Ensemble (VGG-ET)</title>
          <p>We used VGG19BN to build a temporal ensemble for CIFAR100; training with 50K
and testing with 10K. In this case, each epoch resulted a classifier. We used 150 epochs
and neglected the results of the first ten epochs, as we opted for the training to reach a
state of stability. Similar to VGG-E, VGG-ET was able to determine quite a few
incorrect labels, and to produce descent per-sample confidence values. The VGG-ET
reached an accuracy of 76.8% (slightly lower than VGG-E), and an OR (at high
confidence) of 0.004. Thus, this snapshot/temporal ensemble could be used instead of an
ensemble built from the different CNN architectures, which can be used to build a fast
and efficient labeling verification tool, which is a future work we are trying. The
confidence distributions are demonstrated in Fig-Sup. 2.</p>
        </sec>
      </sec>
      <sec id="sec-14-3">
        <title>Fig-Sup. 2. Per-sample confidence distribution, incorrectly classified labels (left) and correctly classified labels (right), using temporal (snapshot) ensemble VGG-ET (VGG19BN).</title>
        <sec id="sec-14-3-1">
          <title>Extended Results</title>
          <p>In the tables below, we demonstrate using a few samples that we have selected from
the incorrect labeled ones detected by the probability analysis. The labels of the
samples and the corresponding predicted labels, along with the corresponding image, are
shown. To double check the incorrectness, by third parties, the readers of this article
may use the index of the sample to examine it in the dataset, i.e. by loading the image
and the corresponding label. The tables contain data from CIFAR10, CIFAR100,
SVHN, and EMNIST datasets.
* We found the same image in the training test with index 24083 but has the label 55 (otter). So not only the same image
was included in both training and testing sets, but with an incorrect/opposite label.</p>
          <p>t
e
s
g
n
i
t
s
e
T
7762
28 (cup)
9 (bottle)</p>
          <p>1557
10 (bowl)
28 (cup)</p>
        </sec>
      </sec>
      <sec id="sec-14-4">
        <title>Index</title>
      </sec>
      <sec id="sec-14-5">
        <title>Original label (class)</title>
      </sec>
      <sec id="sec-14-6">
        <title>Predicted label (class)</title>
      </sec>
      <sec id="sec-14-7">
        <title>Image</title>
      </sec>
      <sec id="sec-14-8">
        <title>Confusing Images Detected in CIFAR10</title>
      </sec>
      <sec id="sec-14-9">
        <title>Image</title>
      </sec>
      <sec id="sec-14-10">
        <title>Remarks</title>
      </sec>
    </sec>
    <sec id="sec-15">
      <title>Hard to tell what this is!</title>
    </sec>
    <sec id="sec-16">
      <title>This is a minivan, probably looks more like a car than a truck</title>
    </sec>
    <sec id="sec-17">
      <title>Probably the label is correct, but the tail is dominating the photo</title>
    </sec>
    <sec id="sec-18">
      <title>Difficult to infer which one is deer and which one is bird, even for the human observer</title>
    </sec>
    <sec id="sec-19">
      <title>Difficult to infer which one is deer and which one is bird, even for the human observer</title>
    </sec>
    <sec id="sec-20">
      <title>Difficult to infer which one is deer and which one is bird, even for the human observer</title>
      <p>2 (bird)
36788
6 (frog)</p>
    </sec>
    <sec id="sec-21">
      <title>Bird and cat in one picture! Classified as dog</title>
    </sec>
    <sec id="sec-22">
      <title>Boat on top of a pull cart with wheels identified as car</title>
    </sec>
    <sec id="sec-23">
      <title>The image is not clear, probably of cat category, but classified as frog</title>
    </sec>
    <sec id="sec-24">
      <title>The image should be of category plane, yet it is not clear; classified as deer</title>
    </sec>
    <sec id="sec-25">
      <title>A truck with a ladder is hard to identify as a truck</title>
    </sec>
    <sec id="sec-26">
      <title>A frog that is hard to tell for the human observer</title>
    </sec>
    <sec id="sec-27">
      <title>Something that does not look very much like a bird has been identified as deer</title>
      <sec id="sec-27-1">
        <title>Experiments with SVHN</title>
      </sec>
      <sec id="sec-27-2">
        <title>Original la</title>
        <p>bel (class)</p>
      </sec>
    </sec>
    <sec id="sec-28">
      <title>This image has two digits, although</title>
      <p>there should only be one. It was
labeled with zero, but the ensemble got
the other digit correctly with high
confidence (7)</p>
    </sec>
    <sec id="sec-29">
      <title>As each image should only have</title>
      <p>one digit, this image has been
incorrectly segmented and labeled with 3,
the ensemble labeled it as 1 with high
confidence.</p>
    </sec>
    <sec id="sec-30">
      <title>The image is incorrectly seg</title>
      <p>mented, as it should have only one
digit.</p>
      <p>Interestingly, the attention of the
CNN is brought to the center</p>
    </sec>
    <sec id="sec-31">
      <title>Partially occluded with 5, but the ensemble got it correctly as 6. The original labels was incorrect with a value of 1.</title>
      <p>OTnabtlhee13L.aSbeelelicntegd Cinocrorrercetcntleyslsa bineleCdoimmapguetsedreVteicstieodninDEaMtaNseItSsT testing set; predicted23
Or(ilgoiwnearl claabseelL(;cll)ass) PrbeLedli(ccwtleaidtshsl)ah-igh confiIdmenagcee Remarks
b B
n N
g 9
B D
b h
6 b
r P</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , et al.
          <article-title>Revisiting Unreasonable Effectiveness of Data in Deep Learning Era</article-title>
          .
          <source>in 16th IEEE International Conference on Computer Vision</source>
          (ICCV).
          <year>2017</year>
          . Venice, ITALY.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ju</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bibaut</surname>
          </string-name>
          , and
          <string-name>
            <surname>M. J. van der Laan</surname>
          </string-name>
          ,
          <article-title>The Relative Performance of Ensemble Methods with Deep Convolutional Neural Networks for Image Classification</article-title>
          .
          <year>2017</year>
          , http://arxiv.org/abs/1704.01664.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hansen</surname>
            ,
            <given-names>L.K.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Salamon</surname>
          </string-name>
          ,
          <source>Neural Network Ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <year>1990</year>
          .
          <volume>12</volume>
          (
          <issue>10</issue>
          ): p.
          <fpage>993</fpage>
          -
          <lpage>1001</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>LeCun</surname>
            ,
            <given-names>Y.</given-names>
            , Y.
          </string-name>
          <string-name>
            <surname>Bengio</surname>
            , and
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>Deep learning</article-title>
          .
          <source>Nature</source>
          ,
          <year>2015</year>
          .
          <volume>521</volume>
          (
          <issue>7553</issue>
          ): p.
          <fpage>436</fpage>
          -
          <lpage>444</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          , et al.
          <article-title>An Ensemble of Convolutional Neural Networks for Image Classification Based on LSTM</article-title>
          .
          <source>in 2017 International Conference on Green Informatics (ICGI)</source>
          .
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>C.X.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>D.C.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <article-title>Trunk-Branch Ensemble Convolutional Neural Networks for Video-Based Face Recognition</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <year>2018</year>
          .
          <volume>40</volume>
          (
          <issue>4</issue>
          ): p.
          <fpage>1002</fpage>
          -
          <lpage>1014</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <article-title>Deep learning in neural networks: An overview</article-title>
          .
          <source>Neural Networks</source>
          ,
          <year>2015</year>
          .
          <volume>61</volume>
          : p.
          <fpage>85</fpage>
          -
          <lpage>117</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Minetto</surname>
            , R.,
            <given-names>M. Pamplona</given-names>
          </string-name>
          <string-name>
            <surname>Segundo</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sarkar</surname>
          </string-name>
          .
          <article-title>Hydra: an Ensemble of Convolutional Neural Networks for Geospatial Land Classification</article-title>
          . in https://arxiv.org/abs/
          <year>1802</year>
          .03518.
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>G.B.</given-names>
          </string-name>
          , et al.
          <article-title>Ensemble Application of Convolutional and Recurrent Neural Networks for Multi-label Text Categorization</article-title>
          . in
          <source>International Joint Conference on Neural Networks (IJCNN)</source>
          .
          <year>2017</year>
          . Anchorage, AK.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Duan</surname>
            ,
            <given-names>M.X.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>K.L.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>An Ensemble CNN2ELM for Age Estimation</article-title>
          .
          <source>IEEE Transactions on Information Forensics and Security</source>
          ,
          <year>2018</year>
          .
          <volume>13</volume>
          (
          <issue>3</issue>
          ): p.
          <fpage>758</fpage>
          -
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <year>7D7ia2z</year>
          .
          <article-title>-</article-title>
          <string-name>
            <surname>Vico</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , et al.,
          <source>Deep Neural Networks for Wind and Solar Energy Prediction. Neural Processing Letters</source>
          ,
          <year>2017</year>
          .
          <volume>46</volume>
          (
          <issue>3</issue>
          ): p.
          <fpage>829</fpage>
          -
          <lpage>844</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Laine</surname>
            , S. and
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Aila</surname>
          </string-name>
          ,
          <article-title>Temporal Ensembling for Semi-Supervised Learning</article-title>
          ,
          <source>in International Conference on Learning Representations (ICLR)</source>
          .
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.Z.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Loog</surname>
          </string-name>
          ,
          <article-title>A variance maximization criterion for active learning</article-title>
          .
          <source>Pattern Recognition</source>
          ,
          <year>2018</year>
          .
          <volume>78</volume>
          : p.
          <fpage>358</fpage>
          -
          <lpage>370</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Reyes</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.H.</given-names>
            <surname>Altalhi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ventura</surname>
          </string-name>
          ,
          <article-title>Statistical comparisons of active learning strategies over multiple datasets</article-title>
          .
          <source>Knowledge-Based Systems</source>
          ,
          <year>2018</year>
          .
          <volume>145</volume>
          : p.
          <fpage>274</fpage>
          -
          <lpage>288</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>K.Z.</given-names>
          </string-name>
          , et al.,
          <article-title>Cost-Effective Active Learning for Deep Image Classification</article-title>
          .
          <source>IEEE Transactions on Circuits and Systems for Video Technology</source>
          ,
          <year>2017</year>
          .
          <volume>27</volume>
          (
          <issue>12</issue>
          ): p.
          <fpage>2591</fpage>
          -
          <lpage>2600</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Freund</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , et al.,
          <article-title>Selective sampling using the query by committee algorithm</article-title>
          .
          <source>Machine Learning</source>
          ,
          <year>1997</year>
          .
          <volume>28</volume>
          (
          <issue>2-3</issue>
          ): p.
          <fpage>133</fpage>
          -
          <lpage>168</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Bujrbidge</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.J.</given-names>
            <surname>Rowland</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.D.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <article-title>Active learning for regression based on query by committee</article-title>
          .
          <source>Intelligent Data Engineering and Automated Learning - Ideal</source>
          <year>2007</year>
          ,
          <year>2007</year>
          .
          <volume>4881</volume>
          : p.
          <fpage>209</fpage>
          -
          <lpage>218</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Bishop</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <source>Pattern Recognition and Machine Learning (Information Science and Statistics)</source>
          .
          <year>2006</year>
          : Springer-Verlag New York, Inc. .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Monteith</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , et al.
          <article-title>Turning Bayesian Model Averaging Into Bayesian Model Combination</article-title>
          . in
          <source>International Joint Conference on Neural Networks (IJCNN)</source>
          .
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Shiraishi</surname>
            , Y. and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Fukumizu</surname>
          </string-name>
          ,
          <article-title>Statistical approaches to combining binary classifiers for multi-class classification</article-title>
          .
          <source>Neurocomputing</source>
          ,
          <year>2011</year>
          .
          <volume>74</volume>
          (
          <issue>5</issue>
          ): p.
          <fpage>680</fpage>
          -
          <lpage>688</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Rao</surname>
            ,
            <given-names>P.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Proportions</surname>
          </string-name>
          , Odds Ratios and Relative Risks,
          <source>in Statistical Methodologies with Medical Applications</source>
          .
          <year>2017</year>
          , Wiley.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <source>Learning Multiple Layers of Features from Tiny Images</source>
          ,
          <source>Technical Report</source>
          .
          <year>2009</year>
          , http://www.cs.toronto.edu/~kriz/cifar.html: Canadian Institute for Advanced Research.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Paszke</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.
          <article-title>Automatic differentiation in PyTorch</article-title>
          .
          <source>in NIPS 2017 Autodiff Workshop</source>
          .
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Very Deep Convolutional Networks for LargeScale Image Recognition</article-title>
          .
          <source>in International Conference on Learning Representations</source>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Schapire</surname>
            ,
            <given-names>R.E.</given-names>
          </string-name>
          ,
          <source>The Strength of Weak Learnability. Machine Learning</source>
          ,
          <year>1990</year>
          .
          <volume>5</volume>
          (
          <issue>2</issue>
          ): p.
          <fpage>197</fpage>
          -
          <lpage>227</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>F.X.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Adversarial Examples Detection in Deep Networks with Convolutional Filter Statistics</article-title>
          .
          <source>in 16th IEEE International Conference on Computer Vision</source>
          (ICCV).
          <year>2017</year>
          . Venice, ITALY.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , et al.,
          <article-title>EMNIST: an extension of MNIST to handwritten letters</article-title>
          . (
          <year>2017</year>
          ): https://www.nist.gov/itl/iad/image-group/emnistdataset http://arxiv.org/abs/1702.05373.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Netzer</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , et al.,
          <article-title>Reading Digits in Natural Images with Unsupervised Feature Learning</article-title>
          ,
          <source>in NIPS Workshop on Deep Learning and Unsupervised Feature Learning</source>
          .
          <year>2011</year>
          : http://ufldl.stanford.edu/housenumbers/.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <article-title>2 (bird) 4 (deer) 9503 2 (bird) 4 (deer) 9 (truck) 2 (bird) 4 (deer)</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>