=Paper=
{{Paper
|id=Vol-2787/paper6
|storemode=property
|title=Curriculum Learning with Diversity for Supervised Computer Vision Tasks
|pdfUrl=https://ceur-ws.org/Vol-2787/paper6.pdf
|volume=Vol-2787
|authors=Petru Soviany
|dblpUrl=https://dblp.org/rec/conf/ecai/Soviany20
}}
==Curriculum Learning with Diversity for Supervised Computer Vision Tasks==
<pdf width="1500px">https://ceur-ws.org/Vol-2787/paper6.pdf</pdf>
<pre>
Eleventh International Workshop Modelling and Reasoning in Context (MRC) @ECAI 2020                                                               37


                          Curriculum Learning with Diversity
                         for Supervised Computer Vision Tasks
                                                                 Petru Soviany1


Abstract. Curriculum learning techniques are a viable solution             accurately, and that a difficulty measure trained on this information
for improving the accuracy of automatic models, by replacing the           can be useful in our setting.
traditional random training with an easy-to-hard strategy. However,           The next challenge is building the curriculum schedule, or the rate
the standard curriculum methodology does not automatically provide         at which we can augment the training set with more complex infor-
improved results, but it is constrained by multiple elements like the      mation. To address this problem, we follow a sampling strategy sim-
data distribution or the proposed model. In this paper, we introduce       ilar to the one introduced in [28]. Based on the difficulty score, we
a novel curriculum sampling strategy which takes into consideration        sample according to a probability function, which favors easier sam-
the diversity of the training data together with the difficulty of the     ples in the first iterations, but converges to give the same weight to
inputs. We determine the difficulty using a state-of-the-art estimator     all the examples in the later phases of the training. Still, the probabil-
based on the human time required for solving a visual search task.         ity of sampling a harder example in the first iterations is not null, and
We consider this kind of difficulty metric to be better suited for solv-   the more difficult samples which are occasionally picked increase the
ing general problems, as it is not based on certain task-dependent         diversity of the data and help training.
elements, but more on the context of each image. We ensure the di-            The above-mentioned methodology should work well for balanced
versity during training, giving higher priority to elements from less      data sets, as various curriculum sampling strategies have been suc-
visited classes. We conduct object detection and instance segmen-          cessfully employed in literature [19, 28, 34, 37], but it can fail when
tation experiments on Pascal VOC 2007 and Cityscapes data sets,            the data is unbalanced. Ionescu et al. [12] show that some classes may
surpassing both the randomly-trained baseline and the standard cur-        be more difficult than others. A simple motivation for this may be the
riculum approach. We prove that our strategy is very efficient for un-     context in which each class appears. For example, a potted plant or
balanced data sets, leading to faster convergence and more accurate        a bottle are rarely the focus of attention, usually being placed some-
results, when other curriculum-based strategies fail.                      where in the background. Other classes of objects, such as tables,
                                                                           are usually occluded, with the pictures focusing on the objects on
                                                                           the table rather than on the piece of furniture itself. This can make a
1 Introduction                                                             standard curriculum sampling strategy neglect examples from certain
                                                                           classes and slow down training. The problem becomes even more se-
Although the accuracy of automatic models highly increased with            rious in a context where the data is biased towards the easier classes.
the development of deep and very deep neural networks, an impor-           To solve these issues, we add a new term to our sampling function
tant and less studied key element for the overall performance is the       which takes into consideration the classes of the elements already
training strategy. In this regard, Bengio et al. [2] introduced curricu-   sampled, in order to emphasize on images from less-visited classes
lum learning (CL), a set of learning strategies inspired by the way in     and ensure the diversity of the selected examples.
which humans teach and learn. People learn the easiest concepts at            The importance of diversity can be easily explained when compar-
first, followed by more and more complex elements. Similarly, CL           ing our machine learning approach to actual real-life examples. For
uses the difficulty context, feeding the automatic model with easier       instance, when creating a new vaccine, researchers need to experi-
samples at the beginning of the training, and gradually adding more        ment on multiple variants of the virus, then test it on a diverse group
difficult data as the training proceeds.                                   of people. As a rule, in all sciences, before making any assumptions,
   The idea is straightforward, but an important question is how to        researchers have to examine a diverse set of examples which are rel-
determine whether a sample is easy or hard. CL requires the exis-          evant to the actual data distribution. Similar to the vaccines, which
tence of a predefined metric which can compute the difficulty of the       must be efficient for as many people as possible, we want our cur-
input examples. Still, the difficulty of an image is strongly related      riculum model to work well on all object classes. We argue that this
to the context: a big car in the middle of an empty street should be       is not possible in unbalanced curriculum scenarios, and it is slower
easier to detect than a small car, parked in the corner of an alley full   in the traditional random training setup.
of pedestrians. Instead of building hand-crafted models for retriev-          Since it is a sampling procedure, our CL approach can be applied
ing contextual information, in this paper, we use the image difficulty     to any supervised task in machine learning. In this paper, we focus
estimator from [12] which is based on the amount of time required          on object detection and instance segmentation, two of the main tasks
by human annotators to assess if a class is present or not in a certain    in computer vision, which require the model to identify the class
image. We consider that people can understand the full context very        and the location of objects in images. To test the validity of our ap-
                                                                           proach, we experiment on two data sets: Pascal VOC 2007 [4] and
1   University of Bucharest, Department of Computer Science, Romania,      Cityscapes [3], and compare our curriculum with diversity strategy
    email: petru.soviany@yahoo.com


Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Eleventh International Workshop Modelling and Reasoning in Context (MRC) @ECAI 2020                                                                   38


against the standard random training method, a curriculum sampling           sampling probability function, we impose the selection of easy-to-
(without diversity) procedure and an inverse-curriculum approach,            hard diverse examples, without massively altering the actual class
which selects images from hard to easy. We employ a state-of-the-art         distribution of the data set.
Faster R-CNN [24] detector with a Resnet-101 [11] backbone for the              The easy-to-hard idea behind CL can be implemented in multi-
object detection experiments, and a Mask R-CNN [10] model based              ple ways. One option is to start training on the easiest set of images,
on Resnet-50 for instance segmentation.                                      while gradually adding more difficult batches [2, 7, 16, 27, 30, 37].
   Our main contributions can be summarized as follows:                      Although most of the models keep the visited examples in the train-
1. We illustrate the necessity of adding diversity when using CL in          ing set, Kocmi et al. [16] suggest reducing the size of each bin until
   unbalanced data sets;                                                     combining it with the following one, in order to use each example
2. We introduce a novel curriculum sampling function, which takes            only once during an epoch. In [19, 28] the authors propose a sam-
   into consideration the class-diversity of the training samples and        pling strategy according to some probability function, which favors
   improves results when traditional curriculum approaches fail;             easier examples in the first iterations. As the authors show, the eas-
3. We prove our strategy by experimenting on two computer vision             iness score from [28] could also be added as a new term to the loss
   tasks: object detection and instance segmentation, using two data         function to emphasize the easier examples in the beginning of the
   sets of high interest.                                                    training. In this paper, we enhance their sampling strategy by adding
   We organize the rest of this paper as follows: in Section 2, we           a new diversity term to the probability function used to select training
present the most relevant related works and compare them with our            examples.
approach. In Section 3, we explain in detail the methodology we fol-
low. We present our results in Section 4, and draw our conclusion
and discuss possible future work in the last section.


2    Related Work
Curriculum learning. Bengio et al. [2] introduced the idea of cur-
riculum learning (CL) to train artificial intelligence, proving that the
standard learning paradigm used in human educational systems could
also be applied to automatic models. CL represents a class of easy-to-
hard approaches, which have successfully been employed in a wide
range of machine learning applications, from natural language pro-
cessing [8, 16, 19, 21, 31], to computer vision [6, 7, 9, 15, 18, 27, 35],
or audio processing [1, 22].
   One of the main limitations of CL is that it assumes the existence
of a predefined metric which can rank the samples from easy to hard.
These metrics are usually task-dependent with various solutions be-
ing proposed for each. For example, in text processing, the length of
the sentence can be used to estimate the difficulty of the input (shorter     Figure 1. Number of instances from each class in the trainval split of the
sentences are easier) [21, 30], while the number and the size of ob-                               Pascal VOC 2007 data set.
jects in a certain sample can provide enough insights about difficulty
in image processing tasks (images with few large objects are eas-
ier) [27, 29]. In our paper, we employ the image difficulty estimator           Despite leading to good results in many related papers, the stan-
of Ionescu et al. [12] which was trained considering the time required       dard CL procedure is highly influenced by the task and the data dis-
by human annotators to identify the presence of certain classes in im-       tribution. Simple tasks may not gain much from using curriculum
ages.                                                                        approaches, while employing CL in unbalanced data sets can lead to
   To alleviate the challenge of finding a predefined difficulty met-        slower convergence. To address the second problem, Wang et al. [34]
ric, Kumar et al. [17] introduce self-paced learning (SPL), a set of         introduce a CL framework which adaptively adjusts the sampling
approaches in which the model ranks the samples from easy to hard            strategy and loss weight in each batch, while other papers [13, 25]
during training, based on its current progress. For example, the in-         argue that a key element is diversity. Jiang et al. [13] introduce a SPL
puts with the smaller loss at a certain time during training are easier      with diversity technique in which they regularize the model using
than the samples with higher loss. Many papers apply SPL success-            both difficulty information and the variety of the samples. They sug-
fully [26, 32, 33], and some methods combine prior knowledge with            gest using clustering algorithms to split the data into diverse groups.
live training information, creating self-paced with curriculum tech-         Sachan et al. [25] measure diversity using the angle between the
niques [14, 36]. Even so, SPL still has some limitations, requiring a        hyperplanes the samples induce in the feature space. They choose
methodology on how to select the samples and how much to empha-              the examples that optimize a convex combination of the curriculum
size easier examples. Our approach is on the borderline between CL           learning objective and the sum of angles between the candidate sam-
and SPL, but we consider it to be pure curriculum, although we use           ples and the examples selected in previous steps. In our model, we
training information to advantage less visited classes. During train-        define diversity based on the classes of our data. We combine our
ing, we only count the labels of the training samples, which is a priori     predefined difficulty metric with a score which favors images from
information, and not the learning progress. A similar system could it-       less visited classes, in order to sample easy and diverse examples
eratively select examples from every class, but this would force our         at the beginning of the training, then gradually add more complex
model to process the same number of examples from each class. In-            elements. Our idea works well for supervised tasks, but it can be ex-
stead, by using the class-diversity as a term in our difficulty-based        tended to unsupervised learning by replacing the ground-truth labels


Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Eleventh International Workshop Modelling and Reasoning in Context (MRC) @ECAI 2020                                                                    39


with a clustering model, as suggested in [13]. Figure 1 presents the
                                                                               Table 1. Average Precision scores for object detection using the baseline
class distribution on Pascal VOC 2007 data set [4] which is heavily           Faster R-CNN, on easy, medium and hard splits of Pascal VOC 2007 test set,
biased towards class person.                                                                      as estimated using our approach.
   Object detection is the task of predicting the location and the
class of objects in certain images. As noted in [29], the state-of-the-                        DIfficulty                    mAP (in %)
art object detectors can be split into two main categories: two-stage                            Easy                          72.93
                                                                                               Medium                          72.16
and single stage models. The two-stage object detectors [10, 24] use                             Hard                          67.03
a Region Proposal Network to generate regions of interest which are
then fed to another network for object localization and classification.
The single stage approaches [20, 23] take the whole image as input
and solve the problem like a regular regression task. These meth-             3.2    Curriculum sampling
ods are usually faster, but less accurate than the two-stage designs.
Instance segmentation is similar to object detection, but more com-           Soviany et al. [28] introduce a curriculum sampling strategy, which
plex, requiring the generation of a mask instead of a bounding box            favors easier examples in the first iterations and converges as the
for the objects in the test image. Our strategy can be implemented us-        training progresses. It has the advantage of being a continuous
ing any detection and segmentation models, but, in order to increase          method, removing the necessity of a curriculum schedule for en-
the relevance of our results, we experiment with high quality Faster          hancing the difficulty-based batches. Furthermore, the fact that it is a
R-CNN [24] and Mask R-CNN [10] baselines.                                     probabilistic sampling method does not constrain the model to only
                                                                              select easy examples in the first iterations, as batching does, but adds
                                                                              more diversity in data selection. We follow their approach in build-
3     Methodology
                                                                              ing our curriculum sampling strategy with only a small change in the
Training artificial intelligence using curriculum approaches, from            position of parameter k in order to better emphasize the difficulty of
easy to hard, can lead to improved results in a wide range of                 the examples. We use the following function to assign weights to the
tasks [1, 6, 7, 8, 9, 15, 16, 18, 19, 21, 22, 27, 31, 35]. Still, it is not   input images during training:
simple to determine which samples are easy or hard, and the avail-
                                                                                                                                k
able metrics are usually task-dependent. Another challenge of CL is                      w(xi , t) = (1 − dif f (xi ) · e−γ·t ) , ∀xi ∈ X,            (2)
finding the right curriculum schedule, i.e. how fast to add more dif-
ficult examples to training, and how to introduce the right amount of         where xi is the training example from the data set X, t is the cur-
harder samples at the right time to positively influence convergence.         rent iteration, and dif f (xi ) is the difficulty score associated with
In this section, we present our approach for estimating difficulty and        the selected sample. γ is a parameter which sets how fast the function
our curriculum sampling strategies.                                           converges to 1, while k sets how much to emphasize the easier exam-
                                                                              ples. Our function varies from the one proposed in [28] by changing
3.1    Difficulty estimation                                                  the position of the k parameter. We consider that we can take advan-
                                                                              tage of the properties of the power function which increases faster
To estimate the difficulty of our training examples, we employ the            for numbers greater than the unit. Since 1 − si · e−γ·t ∈ [0, 2], and
method of Ionescu et al. [12] who defined image difficulty as the hu-         the result is > 1 for easier examples, our function will focus more
man time required for solving a visual search task. They collected an-        on the easier samples in the first iterations. As the training advances,
notations for the Pascal VOC 2012 [5] data set, by asking annotators          the function converges to 1, so all examples will have the same prob-
whether a class was present or not in a certain image. They collected         ability to be selected in the later phases of the training. We transform
the time people required for answering these questions, which they            the weights into probabilities and we sample accordingly.
normalized and fed as training data for a regression model. Their
results correlate fine with other difficulty metrics which take into
consideration the number of objects, the size of the objects, or the          3.3    Curriculum with diversity sampling
occlusions. Because it is based on human annotations, this method
                                                                              As [13, 25] note, applying a CL strategy does not guarantee improved
takes into account the whole image context, not only certain features
                                                                              quality, the diversity of the selected samples having a great impact on
relevant for one problem (the number of objects, for example). This
                                                                              the final results. A simple example is the case in which the data set is
makes the model task independent, and, as a result, it was success-
                                                                              biased, having fewer samples of certain classes. Since some classes
fully employed in multiple vision problems [12, 29, 28]. To further
                                                                              are more difficult than others [12], if the data set is not well-balanced,
prove the efficiency of the estimator for our task, we show that auto-
                                                                              the model will not visit the harder classes until the later stages of the
matic models have a lower accuracy in difficult examples. We split
                                                                              training. Thus, the model will not perform well on classes it did not
the Pascal VOC 2007 [4] test set in three equal batches: easy, medium
                                                                              visit. This fact is generally valid in all kind of applications, even in
and hard, and run the baseline model on each of them. The results in
                                                                              real life reasoning: without seeing examples which match the whole
Table 1 confirm that the AP lowers as the difficulty increases.
                                                                              data distribution, it is impossible to find the solution suited for all sce-
   We follow the strategy of Ionescu et al. as described in the origi-
                                                                              narios. Because of this, we enhance our sampling method, by adding
nal paper [12] to determine the difficulty scores of the images in our
                                                                              a new term, which is based on the diversity of the examples.
data sets. These scores have values ≈ 3, with a larger score defining
                                                                                 Our diversity scoring algorithm is simple, taking into considera-
a more difficult sample. We translate the values between [−1, 1] us-
                                                                              tion the classes of the selected samples. During training, we count
ing Equation 1 to simplify the usage of the score in the next steps.
                                                                              the number of visited objects from each class (numobjects (c)). We
Figure 2 shows some examples of easy and difficult images.
                                                                              subtract the mean of the values to determine how often each class
                                    2 · (x − min(x))                          was visited. This is formally presented in Equation 3. We scale and
           Scalemin−max (x) =                        −1                (1)
                                    max(x) − min(x)                           translate the results between [−1, 1] using Equation 1 to get the score


Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Eleventh International Workshop Modelling and Reasoning in Context (MRC) @ECAI 2020                                                                   40


                        Figure 2. Easy and difficult images from Pascal VOC 2007 and Cityscapes according to our estimation.


of each class, then, for every image, we compute the image-level di-
versity by averaging the class score for each object in its ground-truth
labels (Equation 4).


                                        P
                                            cj ∈C numobjects (cj )
  visited(ci ) = numobjects (ci ) −
                                                   |C|
                                                         ∀ci ∈ C.    (3)


                         P
                           obj∈objects(xi ) visited(class(obj))
  imgV isited(xi ) =
                                      |objects(xi )|
                                                         ∀xi ∈ X.    (4)


   In our diversity algorithm we want to emphasize the images                   Figure 3. Difficulty of classes in Pascal VOC 2007 according to our
containing objects from less visited classes, i.e. with a small                                  estimation. Best viewed in color.
imgV isited value, closer to −1. We compute a scoring function
similar to Equation 2, which also takes into consideration how often
a class was visited, in order to add diversity:                             4     Experiments
                                                                            4.1     Data sets
  w(xi , t) = [1 − α · (dif f (xi ) · e−γ·t )
                                                                            In order to test the validity of our method, we experiment on two data
                     − (1 − α) · (imgV isited(xi ) · e−γ·t )]k ,     (5)    sets: Pascal VOC 2007 [4] and Cityscapes [3]. We conduct detection
                                                                            experiments on 20 classes, training on the 5011 images from the Pas-
where α controls the impact of each component, the difficulty and           cal VOC 2007 trainval split. We perform evaluation on the test split
the diversity, while the rest of the notation follows Equation 2. We        which contains 4952 images. For our instance segmentation experi-
transform the weights into probabilities by dividing them by their          ments, we use the Cityscapes data set which contains eight labeled
sum, and we sample accordingly.                                             object classes: person, rider, car, truck, bus, train, motorcycle, bicy-


Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Eleventh International Workshop Modelling and Reasoning in Context (MRC) @ECAI 2020                                                                            41


 Figure 4. Number of objects from each class sampled during our training on Pascal VOC 2007. On the first row it is the curriculum sampling method and on
the second row it is the curriculum with diversity approach. We present the first 30000 iterations for each case, with histograms generated from 10k to 10k steps.


cle. We train on the training set of 2975 images and we evaluate on
the validation split of 500 images.


4.2 Baselines and configuration
We build our method on top of the Faster R-CNN [24]
and Mask R-CNN [10] implementations available at:
https://github.com/facebookresearch/maskrcnn-benchmark. For our
detection experiments, we use Faster R-CNN with Resnet-101 [11]
backbone, while for segmentation we employ the Resnet-50
backbone on the Mask R-CNN model. We use the configurations
available on the web site, with the learning rate adjusted for a train-
ing with a batch size of 4. In our sampling procedure (Equation 5)
we set α = 0.5, γ = 6 · 10−5 , and k = 5. We do not compare with
other models, because the goal of our paper is not surpassing the
state of the art, but improving the quality of our baseline model. We
also present the results of a hard-to-easy sampling, in order to prove
the efficiency of the easy-to-hard curriculum approaches inspired by
human learning.
                                                                                      Figure 5. Evolution of mAP during training on Pascal VOC 2007 for
                                                                                                    object detection. Best viewed in color.
4.3 Evaluation metrics
We evaluate our results using the mean Average Precision (AP). The                 and AP75%, which correspond to overlap values of 50% and 75%,
AP score is given by the area under the precision-recall curve for                 respectively. Since the exact evaluation protocol has some differ-
the detected objects. The Pascal VOC 2007 [4] metric is the mean                   ences for each data set, we use the Pascal VOC 2007 [4] metric
of precision values at a set of 11 equally spaced recall levels, from              for the detection experiments and the Cityscapes [3] metric for the
0 to 1, at a step size of 0.1. The Cityscapes [3] metric computes                  instance segmentation results. We use the evaluation code available
the average precision on the region level for each class and av-                   at https://github.com/facebookresearch/maskrcnn-benchmark. More
erages it across 10 different overlaps ranging from 0.5 to 0.95 in                 details about the evaluation metrics can be found in the original pa-
steps of 0.05. We also report results on Cityscapes using AP50%                    pers [3, 4].


Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Eleventh International Workshop Modelling and Reasoning in Context (MRC) @ECAI 2020                                                                          42


  Figure 6. Difficulty of the images samples during our training on Pascal VOC 2007. On the left it is presented the curriculum sampling method and on the
right the curriculum with diversity approach. We present the first 40000 iterations for each case, with histograms generated from 10k to 10k steps. Best viewed
                                                                             in color.


4.4    Results and discussion                                                        Moreover, Figure 5 illustrates the evolution of the AP during train-
                                                                                  ing. The curriculum with diversity approach has superior results over
The class distribution of the objects in Pascal VOC 2007 clearly fa-              the baseline from the beginning to the end of the training. As the fig-
vors class person, with 4690 instances, while classes dinningtable                ure shows, the difference between the two methods increases in the
and bus only contain 215 and 229 instances, respectively. This would              later stages of the training. A simple reason for this behaviour is the
not be a problem if the difficulty of the classes was similar, because            fact that the curriculum strategy is fed with new, more difficult, ex-
we can assume the test data set has a matching distribution, but this             amples as the training progresses, continuously improving the accu-
is not the case, as it is shown in Figure 3.                                      racy of the model. On the other hand, the standard random procedure
   Figure 4 presents how the two sampling methods behave during                   receives all information from the beginning, reaching a plateau early
training on the Pascal VOC 2007 data set. In the first 10k itera-                 during training. The standard CL method starts from lower scores,
tions, curriculum sampling selects images with almost 20k objects                 exactly because it does not visit enough samples from more difficult
from class person and only 283 instances from class diningtable. By               classes in the early stages of the training. For instance, after 5000
adding diversity, we lower the gap between classes, reaching 10k ob-              iterations, the AP of the standard CL approach on class dinningtable
jects of persons and 1000 instances of tables. This behaviour contin-             was 0. Thus, by adding diversity, our model converges faster than the
ues as the training progresses, with the differences between classes              traditional methods.
being smaller when adding diversity. It is important to note that we
do not want to sample the exact number of objects from each class,                  Table 2.     Average Precision scores for object detection on Pascal VOC
but to keep the class distribution of the actual data set, while feed-                                            2007 data set.
ing the model with enough details about every class. Figure 6 shows
the difficulty of the examples sampled according to our strategies.                                        Model                               mAP (in %)
                                                                                                  Faster R-CNN (Baseline)                     72.28 ± 0.34
We observe that by adding diversity we do not break our curricu-                          Faster R-CNN with curriculum sampling               72.38 ± 0.32
lum learning schedule, the examples still being selected from easy to                 Faster R-CNN with inverse curriculum sampling           70.89 ± 0.53
hard.                                                                                Faster R-CNN with diverse curriculum sampling            73.07 ± 0.28
   To further prove the efficiency of our method, we compute the AP
on both object detection and instance segmentation tasks. The results
are presented in Tables 2 and 3.                                                  Table 3.     Average Precision scores for instance segmentation on Cityscapes
   We repeat our object detection experiments five times and aver-                                                   data set.
age the results, in order to ensure their relevance. The sampling with
                                                                                                    Model                   AP       AP50%      AP75%
diversity approach provides an improvement of 0.69% over the stan-                         Faster R-CNN (baseline)         38.72     69.15      34.95
dard curriculum method, and of 0.79% over the randomly-trained                              Curriculum sampling            38.47      69.88     35.01
baseline. Although the improvement is not large, we can observe                          Inverse curriculum sampling       37.40     68.17      34.22
that by adding diversity we boost the accuracy where the standard                        Diverse curriculum sampling       39.12     69.86       35.4
method would fail, without much effort. Our experiments, with an
inverse curriculum approach, from hard to easy, lead to the worst                    The instance segmentation results on the Cityscapes data set con-
results, showing the utility of presenting the training samples in a              firm the conclusion from our previous experiments. As Table 3
meaningful order, similar to the way people learn.                                shows, the curriculum with diversity is again the optimal method,


Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Eleventh International Workshop Modelling and Reasoning in Context (MRC) @ECAI 2020                                                                       43


surpassing the baseline with 0.4% using AP, 0.71% using AP50%,                   [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, ‘Deep resid-
and 0.45% using AP75%. It is interesting to point out that, although                  ual learning for image recognition’, in Proceedings of CVPR, pp. 770–
                                                                                      778, (2016).
the diverse curriculum approach has a better AP and AP75% than                   [12] Radu Tudor Ionescu, Bogdan Alexe, Marius Leordeanu, Marius
the standard CL method, the former technique surpasses our method                     Popescu, Dim P. Papadopoulos, and Vittorio Ferrari, ‘How hard can
with 0.02% when evaluated using AP50%. The inverse curriculum                         it be? estimating the difficulty of visual search in an image’, in Pro-
approach has the worst scores again, strengthening our statements                     ceedings of CVPR, pp. 2157–2166, (2016).
                                                                                 [13] Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Shiguang Shan,
on the utility of curriculum learning and the importance of providing
                                                                                      and Alexander Hauptmann, ‘Self-paced learning with diversity’, in Pro-
training examples in a meaningful order.                                              ceedings of NIPS, pp. 2078–2086, (2014).
                                                                                 [14] Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander G
                                                                                      Hauptmann, ‘Self-paced curriculum learning’, in Proceedings of AAAI,
5    Conclusion and future work                                                       (2015).
                                                                                 [15] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei,
                                                                                      ‘Mentornet: Learning data-driven curriculum for very deep neural net-
In this paper, we presented a simple method of optimizing the cur-                    works on corrupted labels’, in Proceedings of ICML, pp. 2304–2313,
riculum learning approaches on unbalanced data sets. We consider                      (2018).
that the diversity of the selected examples is just as important as their        [16] Tom Kocmi and Ondřej Bojar, ‘Curriculum learning and minibatch
difficulty, and neglecting this fact may slow down training for more                  bucketing in neural machine translation’, in Proceedings of RANLP,
difficult classes. We introduced a novel sampling function, which                     pp. 379–386, (2017).
                                                                                 [17] M Pawan Kumar, Benjamin Packer, and Daphne Koller, ‘Self-paced
uses the classes of the visited examples together with a difficulty                   learning for latent variable models’, in Proceedings of NIPS, pp. 1189–
score to ensure the curriculum schedule and the diversity of the se-                  1197, (2010).
lection. Our object detection and instance segmentation experiments              [18] Siyang Li, Xiangxin Zhu, Qin Huang, Hao Xu, and C.-C. Jay Kuo,
conducted on two data sets of high interest prove the superiority of                  ‘Multiple instance curriculum learning for weakly supervised object
                                                                                      detection’, in Proceedings of BMVC. BMVA Press, (2017).
our method over the randomly-trained baseline and over the standard              [19] Cao Liu, Shizhu He, Kang Liu, and Jun Zhao, ‘Curriculum learning for
CL approach. A benefit of our methodology is that it can be used on                   natural answer generation.’, in Proceedings of IJCAI, pp. 4223–4229,
top of any deep learning model, for any supervised task. Diversity                    (2018).
can be a key element for overcoming one of the shortcomings of CL                [20] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott
which can lead to the replacement of the traditional random training                  Reed, Cheng-Yang Fu, and Alexander C Berg, ‘Ssd: Single shot multi-
                                                                                      box detector’, in Proceedings of ECCV, pp. 21–37. Springer, (2016).
and a larger adoption of meaningful sample selection. For the future             [21] Emmanouil Antonios Platanios, Otilia Stretcu, Graham Neubig, Barn-
work, we plan on studying more difficulty measures to build an ex-                    abas Poczos, and Tom Mitchell, ‘Competence-based curriculum learn-
tensive view on how the chosen metric affects the performance of                      ing for neural machine translation’, in Proceedings of NAACL, pp.
our system. Furthermore, we aim to create an ablation study on the                    1162–1172, (2019).
                                                                                 [22] Shivesh Ranjan and John HL Hansen, ‘Curriculum learning based ap-
parameter choice and find better ways to detect the right parameter                   proaches for noise robust speaker recognition’, IEEE/ACM Transac-
values. Another important aspect we are considering is extending the                  tions on Audio, Speech, and Language Processing, 26(1), 197–210,
framework to unsupervised tasks, by introducing a novel method of                     (2017).
computing the diversity of the examples.                                         [23] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘You only look
                                                                                      once: Unified, real-time object detection’, in Proceedings of CVPR, pp.
                                                                                      779–788, (2016).
                                                                                 [24] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, ‘Faster r-cnn:
REFERENCES                                                                            Towards real-time object detection with region proposal networks’, in
                                                                                      Proceedings of NIPS, pp. 91–99, (2015).
 [1] Dario et al. Amodei, ‘Deep speech 2: End-to-end speech recognition in       [25] Mrinmaya Sachan and Eric Xing, ‘Easy questions first? a case study on
     english and mandarin’, in Proceedings of ICML, p. 173–182, (2016).               curriculum learning for question answering’, in Proceedings of ACL,
 [2] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason We-                pp. 453–463, (2016).
     ston, ‘Curriculum learning’, in Proceedings of ICML, pp. 41–48,             [26] Enver Sangineto, Moin Nabi, Dubravko Culibrk, and Nicu Sebe, ‘Self
     (2009).                                                                          paced deep learning for weakly supervised object detection’, IEEE
 [3] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld,                     Transactions on Pattern Analysis and Machine Intelligence, 41(3),
     Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and                 712–725, (2018).
     Bernt Schiele, ‘The cityscapes dataset for semantic urban scene under-      [27] Miaojing Shi and Vittorio Ferrari, ‘Weakly supervised object local-
     standing’, in Proceedings of CVPR, (2016).                                       ization using size estimates’, in Proceedings of ECCV, pp. 105–121.
 [4] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,                          Springer, (2016).
     and A. Zisserman.              The PASCAL Visual Object Classes             [28] Petru Soviany, Claudiu Ardei, Radu Tudor Ionescu, and Marius
     Challenge 2007 (VOC2007) Results.                    http://www.pascal-          Leordeanu, ‘Image difficulty curriculum for generative adversarial net-
     network.org/challenges/VOC/voc2007/workshop/index.html.                          works (cugan)’, in Proceedings of WACV, (2020).
 [5] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,                     [29] Petru Soviany and Radu Tudor Ionescu, ‘Frustratingly Easy Trade-off
     and A. Zisserman.              The PASCAL Visual Object Classes                  Optimization between Single-Stage and Two-Stage Deep Object De-
     Challenge 2012 (VOC2012) Results.                    http://www.pascal-          tectors’, in Proceedings of CEFRL Workshop of ECCV, pp. 366–378,
     network.org/challenges/VOC/voc2012/workshop/index.html.                          (2018).
 [6] C. Gong, D. Tao, S. J. Maybank, W. Liu, G. Kang, and J. Yang, ‘Multi-       [30] Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky, ‘Baby
     modal curriculum learning for semi-supervised image classification’,             Steps: How “Less is More” in unsupervised dependency parsing’, in
     IEEE Transactions on Image Processing, 25(7), 3249–3260, (2016).                 Proceedings of NIPS Workshop on Grammar Induction, Representation
 [7] L. Gui, T. Baltrušaitis, and L. Morency, ‘Curriculum learning for facial        of Language and Language Learning, (2009).
     expression recognition’, in Proceedings of FG, pp. 505–511, (2017).         [31] Sandeep Subramanian, Sai Rajeswar, Francis Dutil, Christopher Pal,
 [8] Junliang Guo, Xu Tan, Linli Xu, Tao Qin, Enhong Chen, and Tie-Yan                and Aaron Courville, ‘Adversarial generation of natural language’, in
     Liu, ‘Fine-tuning by curriculum learning for non-autoregressive neural           Proceedings of the 2nd Workshop on Representation Learning for NLP,
     machine translation’, arXiv preprint arXiv:1911.08717, (2019).                   pp. 241–251, (2017).
 [9] Guy Hacohen and Daphna Weinshall, ‘On the power of curriculum               [32] James S Supancic and Deva Ramanan, ‘Self-paced learning for long-
     learning in training deep networks’, in Proceedings of ICML, (2019).             term tracking’, in Proceedings of CVPR, pp. 2379–2386, (2013).
[10] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, ‘Mask       [33] Kevin Tang, Vignesh Ramanathan, Li Fei-Fei, and Daphne Koller,
     r-cnn’, in Proceedings of ICCV, pp. 2961–2969, (2017).


Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Eleventh International Workshop Modelling and Reasoning in Context (MRC) @ECAI 2020                                                       44


       ‘Shifting weights: Adapting object detectors from image to video’, in
       Proceedings of NIPS, pp. 638–646, (2012).
[34]   Yiru Wang, Weihao Gan, Jie Yang, Wei Wu, and Junjie Yan, ‘Dynamic
       curriculum learning for imbalanced data classification’, in Proceed-
       ings of the IEEE/CVF International Conference on Computer Vision
       (ICCV), (October 2019).
[35]   Daphna Weinshall and Gad Cohen, ‘Curriculum learning by transfer
       learning: Theory and experiments with deep networks’, in Proceedings
       of ICML, (2018).
[36]   Dingwen Zhang, Junwei Han, Long Zhao, and Deyu Meng, ‘Lever-
       aging prior-knowledge for weakly supervised object detection under a
       collaborative self-paced curriculum learning framework’, International
       Journal of Computer Vision, 127(4), 363–380, (2019).
[37]   Xuan Zhang, Gaurav Kumar, Huda Khayrallah, Kenton Murray, Jeremy
       Gwinnup, Marianna J Martindale, Paul McNamee, Kevin Duh, and Ma-
       rine Carpuat, ‘An empirical exploration of curriculum learning for neu-
       ral machine translation’, arXiv preprint arXiv:1811.00739, (2018).


Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

</pre>