<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Curriculum Learning with Diversity for Supervised Computer Vision Tasks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Petru Soviany</string-name>
          <email>petru.soviany@yahoo.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bucharest, Department of Computer Science</institution>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>37</fpage>
      <lpage>44</lpage>
      <abstract>
        <p>Curriculum learning techniques are a viable solution for improving the accuracy of automatic models, by replacing the traditional random training with an easy-to-hard strategy. However, the standard curriculum methodology does not automatically provide improved results, but it is constrained by multiple elements like the data distribution or the proposed model. In this paper, we introduce a novel curriculum sampling strategy which takes into consideration the diversity of the training data together with the difficulty of the inputs. We determine the difficulty using a state-of-the-art estimator based on the human time required for solving a visual search task. We consider this kind of difficulty metric to be better suited for solving general problems, as it is not based on certain task-dependent elements, but more on the context of each image. We ensure the diversity during training, giving higher priority to elements from less visited classes. We conduct object detection and instance segmentation experiments on Pascal VOC 2007 and Cityscapes data sets, surpassing both the randomly-trained baseline and the standard curriculum approach. We prove that our strategy is very efficient for unbalanced data sets, leading to faster convergence and more accurate results, when other curriculum-based strategies fail.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Although the accuracy of automatic models highly increased with
the development of deep and very deep neural networks, an
important and less studied key element for the overall performance is the
training strategy. In this regard, Bengio et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] introduced
curriculum learning (CL), a set of learning strategies inspired by the way in
which humans teach and learn. People learn the easiest concepts at
first, followed by more and more complex elements. Similarly, CL
uses the difficulty context, feeding the automatic model with easier
samples at the beginning of the training, and gradually adding more
difficult data as the training proceeds.
      </p>
      <p>
        The idea is straightforward, but an important question is how to
determine whether a sample is easy or hard. CL requires the
existence of a predefined metric which can compute the difficulty of the
input examples. Still, the difficulty of an image is strongly related
to the context: a big car in the middle of an empty street should be
easier to detect than a small car, parked in the corner of an alley full
of pedestrians. Instead of building hand-crafted models for
retrieving contextual information, in this paper, we use the image difficulty
estimator from [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] which is based on the amount of time required
by human annotators to assess if a class is present or not in a certain
image. We consider that people can understand the full context very
accurately, and that a difficulty measure trained on this information
can be useful in our setting.
      </p>
      <p>
        The next challenge is building the curriculum schedule, or the rate
at which we can augment the training set with more complex
information. To address this problem, we follow a sampling strategy
similar to the one introduced in [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. Based on the difficulty score, we
sample according to a probability function, which favors easier
samples in the first iterations, but converges to give the same weight to
all the examples in the later phases of the training. Still, the
probability of sampling a harder example in the first iterations is not null, and
the more difficult samples which are occasionally picked increase the
diversity of the data and help training.
      </p>
      <p>
        The above-mentioned methodology should work well for balanced
data sets, as various curriculum sampling strategies have been
successfully employed in literature [
        <xref ref-type="bibr" rid="ref19 ref28 ref34 ref37">19, 28, 34, 37</xref>
        ], but it can fail when
the data is unbalanced. Ionescu et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] show that some classes may
be more difficult than others. A simple motivation for this may be the
context in which each class appears. For example, a potted plant or
a bottle are rarely the focus of attention, usually being placed
somewhere in the background. Other classes of objects, such as tables,
are usually occluded, with the pictures focusing on the objects on
the table rather than on the piece of furniture itself. This can make a
standard curriculum sampling strategy neglect examples from certain
classes and slow down training. The problem becomes even more
serious in a context where the data is biased towards the easier classes.
To solve these issues, we add a new term to our sampling function
which takes into consideration the classes of the elements already
sampled, in order to emphasize on images from less-visited classes
and ensure the diversity of the selected examples.
      </p>
      <p>The importance of diversity can be easily explained when
comparing our machine learning approach to actual real-life examples. For
instance, when creating a new vaccine, researchers need to
experiment on multiple variants of the virus, then test it on a diverse group
of people. As a rule, in all sciences, before making any assumptions,
researchers have to examine a diverse set of examples which are
relevant to the actual data distribution. Similar to the vaccines, which
must be efficient for as many people as possible, we want our
curriculum model to work well on all object classes. We argue that this
is not possible in unbalanced curriculum scenarios, and it is slower
in the traditional random training setup.</p>
      <p>
        Since it is a sampling procedure, our CL approach can be applied
to any supervised task in machine learning. In this paper, we focus
on object detection and instance segmentation, two of the main tasks
in computer vision, which require the model to identify the class
and the location of objects in images. To test the validity of our
approach, we experiment on two data sets: Pascal VOC 2007 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and
Cityscapes [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and compare our curriculum with diversity strategy
against the standard random training method, a curriculum sampling
(without diversity) procedure and an inverse-curriculum approach,
which selects images from hard to easy. We employ a state-of-the-art
Faster R-CNN [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] detector with a Resnet-101 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] backbone for the
object detection experiments, and a Mask R-CNN [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] model based
on Resnet-50 for instance segmentation.
      </p>
      <p>Our main contributions can be summarized as follows:
1. We illustrate the necessity of adding diversity when using CL in
unbalanced data sets;
2. We introduce a novel curriculum sampling function, which takes
into consideration the class-diversity of the training samples and
improves results when traditional curriculum approaches fail;
3. We prove our strategy by experimenting on two computer vision
tasks: object detection and instance segmentation, using two data
sets of high interest.</p>
      <p>We organize the rest of this paper as follows: in Section 2, we
present the most relevant related works and compare them with our
approach. In Section 3, we explain in detail the methodology we
follow. We present our results in Section 4, and draw our conclusion
and discuss possible future work in the last section.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Curriculum learning. Bengio et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] introduced the idea of
curriculum learning (CL) to train artificial intelligence, proving that the
standard learning paradigm used in human educational systems could
also be applied to automatic models. CL represents a class of
easy-tohard approaches, which have successfully been employed in a wide
range of machine learning applications, from natural language
processing [
        <xref ref-type="bibr" rid="ref16 ref19 ref21 ref31 ref8">8, 16, 19, 21, 31</xref>
        ], to computer vision [
        <xref ref-type="bibr" rid="ref15 ref18 ref27 ref35 ref6 ref7 ref9">6, 7, 9, 15, 18, 27, 35</xref>
        ],
or audio processing [
        <xref ref-type="bibr" rid="ref1 ref22">1, 22</xref>
        ].
      </p>
      <p>
        One of the main limitations of CL is that it assumes the existence
of a predefined metric which can rank the samples from easy to hard.
These metrics are usually task-dependent with various solutions
being proposed for each. For example, in text processing, the length of
the sentence can be used to estimate the difficulty of the input (shorter
sentences are easier) [
        <xref ref-type="bibr" rid="ref21 ref30">21, 30</xref>
        ], while the number and the size of
objects in a certain sample can provide enough insights about difficulty
in image processing tasks (images with few large objects are
easier) [
        <xref ref-type="bibr" rid="ref27 ref29">27, 29</xref>
        ]. In our paper, we employ the image difficulty estimator
of Ionescu et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] which was trained considering the time required
by human annotators to identify the presence of certain classes in
images.
      </p>
      <p>
        To alleviate the challenge of finding a predefined difficulty
metric, Kumar et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] introduce self-paced learning (SPL), a set of
approaches in which the model ranks the samples from easy to hard
during training, based on its current progress. For example, the
inputs with the smaller loss at a certain time during training are easier
than the samples with higher loss. Many papers apply SPL
successfully [
        <xref ref-type="bibr" rid="ref26 ref32 ref33">26, 32, 33</xref>
        ], and some methods combine prior knowledge with
live training information, creating self-paced with curriculum
techniques [
        <xref ref-type="bibr" rid="ref14 ref36">14, 36</xref>
        ]. Even so, SPL still has some limitations, requiring a
methodology on how to select the samples and how much to
emphasize easier examples. Our approach is on the borderline between CL
and SPL, but we consider it to be pure curriculum, although we use
training information to advantage less visited classes. During
training, we only count the labels of the training samples, which is a priori
information, and not the learning progress. A similar system could
iteratively select examples from every class, but this would force our
model to process the same number of examples from each class.
Instead, by using the class-diversity as a term in our difficulty-based
sampling probability function, we impose the selection of
easy-tohard diverse examples, without massively altering the actual class
distribution of the data set.
      </p>
      <p>
        The easy-to-hard idea behind CL can be implemented in
multiple ways. One option is to start training on the easiest set of images,
while gradually adding more difficult batches [
        <xref ref-type="bibr" rid="ref16 ref2 ref27 ref30 ref37 ref7">2, 7, 16, 27, 30, 37</xref>
        ].
Although most of the models keep the visited examples in the
training set, Kocmi et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] suggest reducing the size of each bin until
combining it with the following one, in order to use each example
only once during an epoch. In [
        <xref ref-type="bibr" rid="ref19 ref28">19, 28</xref>
        ] the authors propose a
sampling strategy according to some probability function, which favors
easier examples in the first iterations. As the authors show, the
easiness score from [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] could also be added as a new term to the loss
function to emphasize the easier examples in the beginning of the
training. In this paper, we enhance their sampling strategy by adding
a new diversity term to the probability function used to select training
examples.
      </p>
      <p>
        Despite leading to good results in many related papers, the
standard CL procedure is highly influenced by the task and the data
distribution. Simple tasks may not gain much from using curriculum
approaches, while employing CL in unbalanced data sets can lead to
slower convergence. To address the second problem, Wang et al. [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ]
introduce a CL framework which adaptively adjusts the sampling
strategy and loss weight in each batch, while other papers [
        <xref ref-type="bibr" rid="ref13 ref25">13, 25</xref>
        ]
argue that a key element is diversity. Jiang et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] introduce a SPL
with diversity technique in which they regularize the model using
both difficulty information and the variety of the samples. They
suggest using clustering algorithms to split the data into diverse groups.
Sachan et al. [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] measure diversity using the angle between the
hyperplanes the samples induce in the feature space. They choose
the examples that optimize a convex combination of the curriculum
learning objective and the sum of angles between the candidate
samples and the examples selected in previous steps. In our model, we
define diversity based on the classes of our data. We combine our
predefined difficulty metric with a score which favors images from
less visited classes, in order to sample easy and diverse examples
at the beginning of the training, then gradually add more complex
elements. Our idea works well for supervised tasks, but it can be
extended to unsupervised learning by replacing the ground-truth labels
with a clustering model, as suggested in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Figure 1 presents the
class distribution on Pascal VOC 2007 data set [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] which is heavily
biased towards class person.
      </p>
      <p>
        Object detection is the task of predicting the location and the
class of objects in certain images. As noted in [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], the
state-of-theart object detectors can be split into two main categories: two-stage
and single stage models. The two-stage object detectors [
        <xref ref-type="bibr" rid="ref10 ref24">10, 24</xref>
        ] use
a Region Proposal Network to generate regions of interest which are
then fed to another network for object localization and classification.
The single stage approaches [
        <xref ref-type="bibr" rid="ref20 ref23">20, 23</xref>
        ] take the whole image as input
and solve the problem like a regular regression task. These
methods are usually faster, but less accurate than the two-stage designs.
Instance segmentation is similar to object detection, but more
complex, requiring the generation of a mask instead of a bounding box
for the objects in the test image. Our strategy can be implemented
using any detection and segmentation models, but, in order to increase
the relevance of our results, we experiment with high quality Faster
R-CNN [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] and Mask R-CNN [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] baselines.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>
        Training artificial intelligence using curriculum approaches, from
easy to hard, can lead to improved results in a wide range of
tasks [
        <xref ref-type="bibr" rid="ref1 ref15 ref16 ref18 ref19 ref21 ref22 ref27 ref31 ref35 ref6 ref7 ref8 ref9">1, 6, 7, 8, 9, 15, 16, 18, 19, 21, 22, 27, 31, 35</xref>
        ]. Still, it is not
simple to determine which samples are easy or hard, and the
available metrics are usually task-dependent. Another challenge of CL is
finding the right curriculum schedule, i.e. how fast to add more
difficult examples to training, and how to introduce the right amount of
harder samples at the right time to positively influence convergence.
In this section, we present our approach for estimating difficulty and
our curriculum sampling strategies.
3.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>Difficulty estimation</title>
      <p>
        To estimate the difficulty of our training examples, we employ the
method of Ionescu et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] who defined image difficulty as the
human time required for solving a visual search task. They collected
annotations for the Pascal VOC 2012 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] data set, by asking annotators
whether a class was present or not in a certain image. They collected
the time people required for answering these questions, which they
normalized and fed as training data for a regression model. Their
results correlate fine with other difficulty metrics which take into
consideration the number of objects, the size of the objects, or the
occlusions. Because it is based on human annotations, this method
takes into account the whole image context, not only certain features
relevant for one problem (the number of objects, for example). This
makes the model task independent, and, as a result, it was
successfully employed in multiple vision problems [
        <xref ref-type="bibr" rid="ref12 ref28 ref29">12, 29, 28</xref>
        ]. To further
prove the efficiency of the estimator for our task, we show that
automatic models have a lower accuracy in difficult examples. We split
the Pascal VOC 2007 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] test set in three equal batches: easy, medium
and hard, and run the baseline model on each of them. The results in
Table 1 confirm that the AP lowers as the difficulty increases.
      </p>
      <p>
        We follow the strategy of Ionescu et al. as described in the
original paper [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] to determine the difficulty scores of the images in our
data sets. These scores have values ≈ 3, with a larger score defining
a more difficult sample. We translate the values between[
        <xref ref-type="bibr" rid="ref1">−1, 1</xref>
        ]
using Equation 1 to simplify the usage of the score in the next steps.
Figure 2 shows some examples of easy and difficult images.
      </p>
      <p>
        Scalemin−max(x) =
2 · (x − min(x))
max(x) − min(x) − 1
(1)
Soviany et al. [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] introduce a curriculum sampling strategy, which
favors easier examples in the first iterations and converges as the
training progresses. It has the advantage of being a continuous
method, removing the necessity of a curriculum schedule for
enhancing the difficulty-based batches. Furthermore, the fact that it is a
probabilistic sampling method does not constrain the model to only
select easy examples in the first iterations, as batching does, but adds
more diversity in data selection. We follow their approach in
building our curriculum sampling strategy with only a small change in the
position of parameter k in order to better emphasize the difficulty of
the examples. We use the following function to assign weights to the
input images during training:
w(xi, t) = (1 − dif f (xi) · e−γ·t)k, ∀xi ∈ X,
(2)
where xi is the training example from the data set X, t is the
current iteration, and dif f (xi) is the difficulty score associated with
the selected sample. γ is a parameter which sets how fast the function
converges to 1, while k sets how much to emphasize the easier
examples. Our function varies from the one proposed in [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] by changing
the position of the k parameter. We consider that we can take
advantage of the properties of the power function which increases faster
for numbers greater than the unit. Since 1 − si · e−γ·t ∈ [
        <xref ref-type="bibr" rid="ref2">0, 2</xref>
        ], and
the result is &gt; 1 for easier examples, our function will focus more
on the easier samples in the first iterations. As the training advances,
the function converges to 1, so all examples will have the same
probability to be selected in the later phases of the training. We transform
the weights into probabilities and we sample accordingly.
3.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Curriculum with diversity sampling</title>
      <p>
        As [
        <xref ref-type="bibr" rid="ref13 ref25">13, 25</xref>
        ] note, applying a CL strategy does not guarantee improved
quality, the diversity of the selected samples having a great impact on
the final results. A simple example is the case in which the data set is
biased, having fewer samples of certain classes. Since some classes
are more difficult than others [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], if the data set is not well-balanced,
the model will not visit the harder classes until the later stages of the
training. Thus, the model will not perform well on classes it did not
visit. This fact is generally valid in all kind of applications, even in
real life reasoning: without seeing examples which match the whole
data distribution, it is impossible to find the solution suited for all
scenarios. Because of this, we enhance our sampling method, by adding
a new term, which is based on the diversity of the examples.
      </p>
      <p>
        Our diversity scoring algorithm is simple, taking into
consideration the classes of the selected samples. During training, we count
the number of visited objects from each class (numobjects(c)). We
subtract the mean of the values to determine how often each class
was visited. This is formally presented in Equation 3. We scale and
translate the results between [
        <xref ref-type="bibr" rid="ref1">−1, 1</xref>
        ] using Equation 1 to get the score
of each class, then, for every image, we compute the image-level
diversity by averaging the class score for each object in its ground-truth
labels (Equation 4).
      </p>
      <p>visited(ci) = numobjects(ci) −
Pcj∈C numobjects(cj )
|C|
∀ci ∈ C. (3)
imgV isited(xi) =</p>
      <p>Pobj∈objects(xi) visited(class(obj))
|objects(xi)|
∀xi ∈ X. (4)</p>
      <p>In our diversity algorithm we want to emphasize the images
containing objects from less visited classes, i.e. with a small
imgV isited value, closer to −1. We compute a scoring function
similar to Equation 2, which also takes into consideration how often
a class was visited, in order to add diversity:
w(xi, t) = [1 − α · (dif f (xi) · e−γ·t)</p>
      <p>
        − (1 − α) · (imgV isited(xi) · e−γ·t)]k, (5)
where α controls the impact of each component, the difficulty and
the diversity, while the rest of the notation follows Equation 2. We
transform the weights into probabilities by dividing them by their
sum, and we sample accordingly.
In order to test the validity of our method, we experiment on two data
sets: Pascal VOC 2007 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Cityscapes [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We conduct detection
experiments on 20 classes, training on the 5011 images from the
Pascal VOC 2007 trainval split. We perform evaluation on the test split
which contains 4952 images. For our instance segmentation
experiments, we use the Cityscapes data set which contains eight labeled
object classes: person, rider, car, truck, bus, train, motorcycle,
bicycle. We train on the training set of 2975 images and we evaluate on
the validation split of 500 images.
4.2
      </p>
    </sec>
    <sec id="sec-6">
      <title>Baselines and configuration</title>
      <p>
        We build our method on top of the Faster R-CNN [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]
and Mask R-CNN [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] implementations available at:
https://github.com/facebookresearch/maskrcnn-benchmark. For our
detection experiments, we use Faster R-CNN with Resnet-101 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
backbone, while for segmentation we employ the Resnet-50
backbone on the Mask R-CNN model. We use the configurations
available on the web site, with the learning rate adjusted for a
training with a batch size of 4. In our sampling procedure (Equation 5)
we set α = 0.5, γ = 6 · 10−5, and k = 5. We do not compare with
other models, because the goal of our paper is not surpassing the
state of the art, but improving the quality of our baseline model. We
also present the results of a hard-to-easy sampling, in order to prove
the efficiency of the easy-to-hard curriculum approaches inspired by
human learning.
4.3
      </p>
    </sec>
    <sec id="sec-7">
      <title>Evaluation metrics</title>
      <p>
        We evaluate our results using the mean Average Precision (AP). The
AP score is given by the area under the precision-recall curve for
the detected objects. The Pascal VOC 2007 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] metric is the mean
of precision values at a set of 11 equally spaced recall levels, from
0 to 1, at a step size of 0.1. The Cityscapes [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] metric computes
the average precision on the region level for each class and
averages it across 10 different overlaps ranging from 0.5 to 0.95 in
steps of 0.05. We also report results on Cityscapes using AP50%
and AP75%, which correspond to overlap values of 50% and 75%,
respectively. Since the exact evaluation protocol has some
differences for each data set, we use the Pascal VOC 2007 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] metric
for the detection experiments and the Cityscapes [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] metric for the
instance segmentation results. We use the evaluation code available
at https://github.com/facebookresearch/maskrcnn-benchmark. More
details about the evaluation metrics can be found in the original
papers [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ].
The class distribution of the objects in Pascal VOC 2007 clearly
favors class person, with 4690 instances, while classes dinningtable
and bus only contain 215 and 229 instances, respectively. This would
not be a problem if the difficulty of the classes was similar, because
we can assume the test data set has a matching distribution, but this
is not the case, as it is shown in Figure 3.
      </p>
      <p>Figure 4 presents how the two sampling methods behave during
training on the Pascal VOC 2007 data set. In the first 10k
iterations, curriculum sampling selects images with almost 20k objects
from class person and only 283 instances from class diningtable. By
adding diversity, we lower the gap between classes, reaching 10k
objects of persons and 1000 instances of tables. This behaviour
continues as the training progresses, with the differences between classes
being smaller when adding diversity. It is important to note that we
do not want to sample the exact number of objects from each class,
but to keep the class distribution of the actual data set, while
feeding the model with enough details about every class. Figure 6 shows
the difficulty of the examples sampled according to our strategies.
We observe that by adding diversity we do not break our
curriculum learning schedule, the examples still being selected from easy to
hard.</p>
      <p>To further prove the efficiency of our method, we compute the AP
on both object detection and instance segmentation tasks. The results
are presented in Tables 2 and 3.</p>
      <p>We repeat our object detection experiments five times and
average the results, in order to ensure their relevance. The sampling with
diversity approach provides an improvement of 0.69% over the
standard curriculum method, and of 0.79% over the randomly-trained
baseline. Although the improvement is not large, we can observe
that by adding diversity we boost the accuracy where the standard
method would fail, without much effort. Our experiments, with an
inverse curriculum approach, from hard to easy, lead to the worst
results, showing the utility of presenting the training samples in a
meaningful order, similar to the way people learn.</p>
      <p>Moreover, Figure 5 illustrates the evolution of the AP during
training. The curriculum with diversity approach has superior results over
the baseline from the beginning to the end of the training. As the
figure shows, the difference between the two methods increases in the
later stages of the training. A simple reason for this behaviour is the
fact that the curriculum strategy is fed with new, more difficult,
examples as the training progresses, continuously improving the
accuracy of the model. On the other hand, the standard random procedure
receives all information from the beginning, reaching a plateau early
during training. The standard CL method starts from lower scores,
exactly because it does not visit enough samples from more difficult
classes in the early stages of the training. For instance, after 5000
iterations, the AP of the standard CL approach on class dinningtable
was 0. Thus, by adding diversity, our model converges faster than the
traditional methods.
surpassing the baseline with 0.4% using AP, 0.71% using AP50%,
and 0.45% using AP75%. It is interesting to point out that, although
the diverse curriculum approach has a better AP and AP75% than
the standard CL method, the former technique surpasses our method
with 0.02% when evaluated using AP50%. The inverse curriculum
approach has the worst scores again, strengthening our statements
on the utility of curriculum learning and the importance of providing
training examples in a meaningful order.
5</p>
    </sec>
    <sec id="sec-8">
      <title>Conclusion and future work</title>
      <p>In this paper, we presented a simple method of optimizing the
curriculum learning approaches on unbalanced data sets. We consider
that the diversity of the selected examples is just as important as their
difficulty, and neglecting this fact may slow down training for more
difficult classes. We introduced a novel sampling function, which
uses the classes of the visited examples together with a difficulty
score to ensure the curriculum schedule and the diversity of the
selection. Our object detection and instance segmentation experiments
conducted on two data sets of high interest prove the superiority of
our method over the randomly-trained baseline and over the standard
CL approach. A benefit of our methodology is that it can be used on
top of any deep learning model, for any supervised task. Diversity
can be a key element for overcoming one of the shortcomings of CL
which can lead to the replacement of the traditional random training
and a larger adoption of meaningful sample selection. For the future
work, we plan on studying more difficulty measures to build an
extensive view on how the chosen metric affects the performance of
our system. Furthermore, we aim to create an ablation study on the
parameter choice and find better ways to detect the right parameter
values. Another important aspect we are considering is extending the
framework to unsupervised tasks, by introducing a novel method of
computing the diversity of the examples.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Dario</surname>
          </string-name>
          et al. Amodei, '
          <article-title>Deep speech 2: End-to-end speech recognition in english and mandarin'</article-title>
          ,
          <source>in Proceedings of ICML</source>
          , p.
          <fpage>173</fpage>
          -
          <lpage>182</lpage>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          , Je´roˆme Louradour, Ronan Collobert, and Jason Weston, '
          <article-title>Curriculum learning'</article-title>
          ,
          <source>in Proceedings of ICML</source>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          , (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Marius</given-names>
            <surname>Cordts</surname>
          </string-name>
          , Mohamed Omran,
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Ramos</surname>
          </string-name>
          , Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele, '
          <article-title>The cityscapes dataset for semantic urban scene understanding'</article-title>
          ,
          <source>in Proceedings of CVPR</source>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Everingham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Van</given-names>
            <surname>Gool</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. K. I.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Winn</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <source>The PASCAL Visual Object Classes Challenge</source>
          <year>2007</year>
          (
          <article-title>VOC2007) Results</article-title>
          . http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Everingham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Van</given-names>
            <surname>Gool</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. K. I.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Winn</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <source>The PASCAL Visual Object Classes Challenge</source>
          <year>2012</year>
          (
          <article-title>VOC2012) Results</article-title>
          . http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Maybank</surname>
          </string-name>
          , W. Liu, G. Kang, and
          <string-name>
            <given-names>J</given-names>
            .
            <surname>Yang</surname>
          </string-name>
          , '
          <article-title>Multimodal curriculum learning for semi-supervised image classification'</article-title>
          ,
          <source>IEEE Transactions on Image Processing</source>
          ,
          <volume>25</volume>
          (
          <issue>7</issue>
          ),
          <fpage>3249</fpage>
          -
          <lpage>3260</lpage>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gui</surname>
          </string-name>
          , T. Baltrusˇaitis, and L. Morency, '
          <article-title>Curriculum learning for facial expression recognition'</article-title>
          ,
          <source>in Proceedings of FG</source>
          , pp.
          <fpage>505</fpage>
          -
          <lpage>511</lpage>
          , (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Junliang</given-names>
            <surname>Guo</surname>
          </string-name>
          , Xu Tan, Linli Xu, Tao Qin, Enhong Chen, and
          <string-name>
            <surname>Tie-Yan</surname>
            <given-names>Liu</given-names>
          </string-name>
          , '
          <article-title>Fine-tuning by curriculum learning for non-autoregressive neural machine translation'</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>08717</volume>
          , (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Guy</given-names>
            <surname>Hacohen</surname>
          </string-name>
          and Daphna Weinshall, '
          <article-title>On the power of curriculum learning in training deep networks'</article-title>
          ,
          <source>in Proceedings of ICML</source>
          , (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Kaiming</surname>
            <given-names>He</given-names>
          </string-name>
          , Georgia Gkioxari, Piotr Dolla´r, and Ross Girshick, '
          <string-name>
            <surname>Mask</surname>
          </string-name>
          r-cnn',
          <source>in Proceedings of ICCV</source>
          , pp.
          <fpage>2961</fpage>
          -
          <lpage>2969</lpage>
          , (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Kaiming</surname>
            <given-names>He</given-names>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and Jian Sun, '
          <article-title>Deep residual learning for image recognition'</article-title>
          ,
          <source>in Proceedings of CVPR</source>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Radu</given-names>
            <surname>Tudor</surname>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
          </string-name>
          , Bogdan Alexe, Marius Leordeanu, Marius Popescu,
          <string-name>
            <given-names>Dim P.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          , and Vittorio Ferrari, '
          <article-title>How hard can it be? estimating the difficulty of visual search in an image'</article-title>
          ,
          <source>in Proceedings of CVPR</source>
          , pp.
          <fpage>2157</fpage>
          -
          <lpage>2166</lpage>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Lu</surname>
            <given-names>Jiang</given-names>
          </string-name>
          , Deyu Meng,
          <string-name>
            <surname>Shoou-I Yu</surname>
          </string-name>
          , Zhenzhong Lan, Shiguang Shan, and Alexander Hauptmann, '
          <article-title>Self-paced learning with diversity'</article-title>
          ,
          <source>in Proceedings of NIPS</source>
          , pp.
          <fpage>2078</fpage>
          -
          <lpage>2086</lpage>
          , (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Lu</surname>
            <given-names>Jiang</given-names>
          </string-name>
          , Deyu Meng,
          <string-name>
            <given-names>Qian</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Shiguang</given-names>
            <surname>Shan</surname>
          </string-name>
          , and Alexander G Hauptmann,
          <article-title>'Self-paced curriculum learning'</article-title>
          ,
          <source>in Proceedings of AAAI</source>
          , (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Lu</surname>
            <given-names>Jiang</given-names>
          </string-name>
          , Zhengyuan Zhou, Thomas Leung,
          <string-name>
            <surname>Li-Jia Li</surname>
          </string-name>
          , and
          <string-name>
            <surname>Li</surname>
          </string-name>
          Fei-Fei, 'Mentornet:
          <article-title>Learning data-driven curriculum for very deep neural networks on corrupted labels'</article-title>
          ,
          <source>in Proceedings of ICML</source>
          , pp.
          <fpage>2304</fpage>
          -
          <lpage>2313</lpage>
          , (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Tom</given-names>
            <surname>Kocmi and Ondˇrej Bojar</surname>
          </string-name>
          , '
          <article-title>Curriculum learning and minibatch bucketing in neural machine translation'</article-title>
          ,
          <source>in Proceedings of RANLP</source>
          , pp.
          <fpage>379</fpage>
          -
          <lpage>386</lpage>
          , (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M</given-names>
            <surname>Pawan Kumar</surname>
          </string-name>
          , Benjamin Packer, and Daphne Koller, '
          <article-title>Self-paced learning for latent variable models'</article-title>
          ,
          <source>in Proceedings of NIPS</source>
          , pp.
          <fpage>1189</fpage>
          -
          <lpage>1197</lpage>
          , (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Siyang</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xiangxin</given-names>
            <surname>Zhu</surname>
          </string-name>
          , Qin Huang, Hao Xu, and
          <string-name>
            <surname>C.-C. Jay Kuo</surname>
          </string-name>
          , '
          <article-title>Multiple instance curriculum learning for weakly supervised object detection'</article-title>
          ,
          <source>in Proceedings of BMVC</source>
          . BMVA Press, (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Cao</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Shizhu He, Kang Liu, and
          <string-name>
            <surname>Jun</surname>
            <given-names>Zhao,</given-names>
          </string-name>
          '
          <article-title>Curriculum learning for natural answer generation</article-title>
          .',
          <source>in Proceedings of IJCAI</source>
          , pp.
          <fpage>4223</fpage>
          -
          <lpage>4229</lpage>
          , (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Wei</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Dragomir Anguelov, Dumitru Erhan,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , Scott Reed, Cheng-Yang
          <string-name>
            <surname>Fu</surname>
          </string-name>
          , and Alexander C Berg, 'Ssd:
          <article-title>Single shot multibox detector'</article-title>
          ,
          <source>in Proceedings of ECCV</source>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>37</lpage>
          . Springer, (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Emmanouil</given-names>
            <surname>Antonios</surname>
          </string-name>
          <string-name>
            <surname>Platanios</surname>
          </string-name>
          , Otilia Stretcu, Graham Neubig, Barnabas Poczos, and Tom Mitchell, '
          <article-title>Competence-based curriculum learning for neural machine translation'</article-title>
          ,
          <source>in Proceedings of NAACL</source>
          , pp.
          <fpage>1162</fpage>
          -
          <lpage>1172</lpage>
          , (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Shivesh</given-names>
            <surname>Ranjan</surname>
          </string-name>
          and John HL Hansen, '
          <article-title>Curriculum learning based approaches for noise robust speaker recognition'</article-title>
          ,
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          ,
          <volume>26</volume>
          (
          <issue>1</issue>
          ),
          <fpage>197</fpage>
          -
          <lpage>210</lpage>
          , (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Divvala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          , '
          <article-title>You only look once: Unified, real-time object detection'</article-title>
          ,
          <source>inProceedings of CVPR</source>
          , pp.
          <fpage>779</fpage>
          -
          <lpage>788</lpage>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Shaoqing</surname>
            <given-names>Ren</given-names>
          </string-name>
          , Kaiming He,
          <string-name>
            <surname>Ross Girshick</surname>
          </string-name>
          , and Jian Sun, '
          <article-title>Faster r-cnn: Towards real-time object detection with region proposal networks'</article-title>
          ,
          <source>in Proceedings of NIPS</source>
          , pp.
          <fpage>91</fpage>
          -
          <lpage>99</lpage>
          , (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Mrinmaya</given-names>
            <surname>Sachan</surname>
          </string-name>
          and Eric Xing, '
          <article-title>Easy questions first? a case study on curriculum learning for question answering'</article-title>
          ,
          <source>in Proceedings of ACL</source>
          , pp.
          <fpage>453</fpage>
          -
          <lpage>463</lpage>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Enver</surname>
            <given-names>Sangineto</given-names>
          </string-name>
          , Moin Nabi, Dubravko Culibrk, and Nicu Sebe, '
          <article-title>Self paced deep learning for weakly supervised object detection'</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>41</volume>
          (
          <issue>3</issue>
          ),
          <fpage>712</fpage>
          -
          <lpage>725</lpage>
          , (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Miaojing</given-names>
            <surname>Shi</surname>
          </string-name>
          and Vittorio Ferrari, '
          <article-title>Weakly supervised object localization using size estimates'</article-title>
          ,
          <source>in Proceedings of ECCV</source>
          , pp.
          <fpage>105</fpage>
          -
          <lpage>121</lpage>
          . Springer, (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Petru</surname>
            <given-names>Soviany</given-names>
          </string-name>
          , Claudiu Ardei, Radu Tudor Ionescu, and Marius Leordeanu, '
          <article-title>Image difficulty curriculum for generative adversarial networks (cugan)'</article-title>
          ,
          <source>in Proceedings of WACV</source>
          , (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Petru</given-names>
            <surname>Soviany</surname>
          </string-name>
          and Radu Tudor Ionescu, '
          <article-title>Frustratingly Easy Trade-off Optimization between Single-Stage and Two-Stage Deep Object Detectors'</article-title>
          ,
          <source>in Proceedings of CEFRL Workshop of ECCV</source>
          , pp.
          <fpage>366</fpage>
          -
          <lpage>378</lpage>
          , (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Valentin</surname>
            <given-names>I. Spitkovsky</given-names>
          </string-name>
          , Hiyan Alshawi, and Daniel Jurafsky, 'Baby Steps:
          <article-title>How “Less is More” in unsupervised dependency parsing'</article-title>
          ,
          <source>in Proceedings of NIPS Workshop on Grammar Induction, Representation of Language and Language Learning</source>
          , (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Sandeep</surname>
            <given-names>Subramanian</given-names>
          </string-name>
          , Sai Rajeswar, Francis Dutil, Christopher Pal, and Aaron Courville, '
          <article-title>Adversarial generation of natural language'</article-title>
          ,
          <source>in Proceedings of the 2nd Workshop on Representation Learning for NLP</source>
          , pp.
          <fpage>241</fpage>
          -
          <lpage>251</lpage>
          , (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>James</surname>
            <given-names>S</given-names>
          </string-name>
          <string-name>
            <surname>Supancic and Deva Ramanan</surname>
          </string-name>
          , '
          <article-title>Self-paced learning for longterm tracking'</article-title>
          ,
          <source>in Proceedings of CVPR</source>
          , pp.
          <fpage>2379</fpage>
          -
          <lpage>2386</lpage>
          , (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <surname>Kevin</surname>
            <given-names>Tang</given-names>
          </string-name>
          , Vignesh Ramanathan,
          <string-name>
            <surname>Li</surname>
          </string-name>
          Fei-Fei, and Daphne Koller, '
          <article-title>Shifting weights: Adapting object detectors from image to video'</article-title>
          ,
          <source>in Proceedings of NIPS</source>
          , pp.
          <fpage>638</fpage>
          -
          <lpage>646</lpage>
          , (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <surname>Yiru</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Weihao Gan, Jie Yang,
          <string-name>
            <surname>Wei Wu</surname>
          </string-name>
          , and Junjie Yan, '
          <article-title>Dynamic curriculum learning for imbalanced data classification'</article-title>
          ,
          <source>in Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          (ICCV),
          <source>(October</source>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>Daphna</given-names>
            <surname>Weinshall</surname>
          </string-name>
          and Gad Cohen, '
          <article-title>Curriculum learning by transfer learning: Theory and experiments with deep networks'</article-title>
          ,
          <source>in Proceedings of ICML</source>
          , (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <surname>Dingwen</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Junwei Han,
          <string-name>
            <given-names>Long</given-names>
            <surname>Zhao</surname>
          </string-name>
          , and Deyu Meng, '
          <article-title>Leveraging prior-knowledge for weakly supervised object detection under a collaborative self-paced curriculum learning framework'</article-title>
          ,
          <source>International Journal of Computer Vision</source>
          ,
          <volume>127</volume>
          (
          <issue>4</issue>
          ),
          <fpage>363</fpage>
          -
          <lpage>380</lpage>
          , (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <surname>Xuan</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Gaurav Kumar, Huda Khayrallah, Kenton Murray, Jeremy Gwinnup,
          <string-name>
            <surname>Marianna J Martindale</surname>
            ,
            <given-names>Paul McNamee</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kevin</given-names>
            <surname>Duh</surname>
          </string-name>
          , and Marine Carpuat, '
          <article-title>An empirical exploration of curriculum learning for neural machine translation'</article-title>
          , arXiv preprint arXiv:
          <year>1811</year>
          .
          <volume>00739</volume>
          , (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>