<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Empirical Comparison of Syllabuses for Curriculum Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mark Collier</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joeran Beel</string-name>
          <email>joeran.beelg@tcd.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Trinity College Dublin</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Syllabuses for curriculum learning have been developed on an ad-hoc, per task basis and little is known about the relative performance of di erent syllabuses. We identify a number of syllabuses used in the literature. We compare the identi ed syllabuses based on their e ect on the speed of learning and generalization ability of a LSTM network on three sequential learning tasks. We nd that the choice of syllabus has limited e ect on the generalization ability of a trained network. In terms of speed of learning our results demonstrate that the best syllabus is task dependent but that a recently proposed automated curriculum learning approach - Prediction Gain, performs very competitively against all identi ed hand-crafted syllabuses. The best performing hand-crafted syllabus which we term Look Back and Forward combines a syllabus which steps through tasks in the order of their di culty with a uniform distribution over all tasks. Our experimental results provide an empirical basis for the choice of syllabus on a new problem that could bene t from curriculum learning. Additionally, insights derived from our results shed light on how to successfully design new syllabuses.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Curriculum learning is an approach to training neural networks inspired by
human learning. Humans learn skills such as maths and languages by tackling
increasingly di cult tasks. It has long been proposed [
        <xref ref-type="bibr" rid="ref17 ref5">5, 17</xref>
        ] that neural networks
could bene t from a similar approach to learning. Curriculum learning involves
presenting examples in some order during training such as to aid learning. On
some tasks, curriculum learning has been shown to be necessary for learning
[
        <xref ref-type="bibr" rid="ref10 ref18 ref25">10, 25, 18</xref>
        ] and to improve learning speed [
        <xref ref-type="bibr" rid="ref24 ref6">6, 24</xref>
        ] and generalization ability [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] on
other tasks.
      </p>
      <p>
        For any particular application of curriculum learning the two key components
of this approach to training neural networks are the division of the training data
into tasks of varying di culty and the de nition of a syllabus. We follow [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in
de ning a syllabus to be a \time-varying sequence of distributions over tasks".
Where a task is de ned to be a subset of all training examples. At any instance
during training a batch called an example is drawn from some task and presented
to the network. For example in machine translation we could split the training
data of source and target language pairs into tasks de ned by the length of the
source sentence. We could then de ne a syllabus in which the network was rst
trained on short sentences and gradually moving onto longer, potentially more
di cult sentences.
      </p>
      <p>
        In this work we focus on the e ect of syllabus design on curriculum
learning. Syllabuses fall in two main categories. Hand-crafted syllabuses [
        <xref ref-type="bibr" rid="ref10 ref24 ref25 ref3">24, 25, 10, 3</xref>
        ]
de ne an ordering of tasks and a level of success on each task to be attained
before moving onto the next point in the syllabus. The level of success on a task
is typically evaluated periodically on a validation set.
      </p>
      <p>
        Automated syllabuses [
        <xref ref-type="bibr" rid="ref19 ref6">6, 19</xref>
        ] attempt to solve the problem of having to
handdesign a syllabus and choosing the level success that determines progression from
task to task required when using a hand-crafted syllabus. Automated syllabuses
only require the engineer to de ne a set of tasks (potentially ordered) and
examples are chosen by some automatic mechanism e.g. in proportion with the error
rate on that task.
      </p>
      <p>
        The choice of syllabus has a signi cant e ect on the e cacy of curriculum
learning for a particular problem [
        <xref ref-type="bibr" rid="ref24 ref6">6, 24</xref>
        ]. Syllabuses have been primarily
developed on an ad-hoc basis for individual problems and only limited empirical
comparison of di erent syllabuses has been conducted [
        <xref ref-type="bibr" rid="ref24 ref6">6, 24</xref>
        ]. Additionally
explanations for why curriculum learning works have mostly been grounded in
learning theory [
        <xref ref-type="bibr" rid="ref14 ref3 ref5">5, 14, 3</xref>
        ] rather than empirical results.
      </p>
      <p>
        We consider curriculum learning in two settings [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In the multi-task setting,
after training we are concerned with the performance of the network on all tasks.
Whereas in the target-task setting, after training we are concerned only with the
performance of the network on a single \target" task.
      </p>
      <p>
        In this work we evaluate how the choice of syllabus e ects the learning speed
and generalization ability of neural networks in the multi-task and target-task
settings. We train LSTM [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] networks on three synthetic sequence learning tasks
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Sequence learning tasks are naturally suited to curriculum learning as it is
often assumed that the di culty of the task increases with the length of the input
sequence [
        <xref ref-type="bibr" rid="ref10 ref18 ref21 ref24 ref4">24, 18, 4, 21, 10</xref>
        ]. LSTMs have achieved state of the art performance in
many sequence learning tasks [
        <xref ref-type="bibr" rid="ref23 ref7 ref8">7, 23, 8</xref>
        ]. It is common to use LSTMs trained on
synthetic tasks for experimental results in curriculum learning [
        <xref ref-type="bibr" rid="ref14 ref21 ref24 ref4 ref6">24, 6, 4, 21, 14</xref>
        ].
      </p>
      <p>Our results provide an extensive empirical comparison of the e ect of
curriculum learning syllabuses on speed of learning and generalization ability. Our
results reveal insights on why curriculum learning outperforms standard
training approaches and the key components of successful syllabuses. Thus our work
provides evidence for the choice of syllabus in new applications of curriculum
learning and provides the basis for future work in syllabus design.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Bengio, et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] examined curriculum learning under hand-crafted syllabuses.
They found that a Perceptron trained on a toy task with a syllabus which steps
through tasks in their order of di culty learns faster than a Perceptron trained
on randomly sampled examples. Their work demonstrated the potential utility of
curriculum learning, but the division between curriculum learning and transfer
learning was unclear in their experiments and only a basic hand-crafted syllabus
was considered.
      </p>
      <p>
        Zaremba and Sutskever [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] showed that curriculum learning was necessary
for a LSTM network to learn to \execute" simple Python programs. The authors
provide an empirical comparison of two hand-crafted syllabuses. Their problem
had two dimensions of di culty the length of the program and the level of
nesting. They de ned a syllabus which they call the Combined Strategy that
combines the basic approach of stepping through the tasks one at a time with
a uniform distribution over all tasks. The Combined Strategy syllabus lead to
faster learning and better nal performance on the target task than baseline
strategies.
      </p>
      <p>
        The authors hypothesize that by including examples from tasks more di cult
than the current task, the LSTM network has an incentive to use only a portion
of its memory to solve the current task in the syllabus. The authors argue that
syllabuses which do not provide tasks from future points on the syllabus incur
a signi cant time penalty at each point of progression as the network must
undergo substantial retraining of the network weights to adapt to the new task.
In a separate work, the same authors proposed a slightly more complex syllabus
with the same goal of sampling tasks based on a syllabus with \non-negligible
mass over the hardest di culty levels" which enabled reinforcement learning of
a discrete Neural Turing Machine [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], where a network trained directly on the
target task failed to learn.
      </p>
      <p>
        Zaremba and Sutskever only considered syllabuses with a linear progression
through the tasks in order of di culty. Yet it has been argued that linear
progression through the tasks in a syllabus can result in quadratic training time in
the length of the sequences for a given problem [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Following a simple syllabus
of drawing tasks from a uniform distribution over all tasks up to the current
task and exponentially increasing the level of di culty of the problem at each
progression point it is possible to scale a variant of the Neural Turing Machine
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to sequences of length 4,000 [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        Graves, et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] propose an automated syllabus by treating the selection of
the task from which to draw the next example from as a stochastic adversarial
multi-armed bandit problem [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Their paper examines six de nitions of a reward
function to promote learning. The reward functions split into two classes, those
driven by the improvement in the loss of the network as a result of being trained
on an example from a particular task and those driven by an increase in network
complexity motivated by the Minimum Description Length principle [
        <xref ref-type="bibr" rid="ref11 ref20">20, 11</xref>
        ]. The
authors show that the best reward function from each class results in non-uniform
syllabuses that on some tasks converge to an optimal solution twice as fast as
a baseline uniform sampling syllabus. Across all tasks however the authors nd
that the uniform sampling syllabus provides a strong baseline. Other automated
syllabuses have been used successfully [
        <xref ref-type="bibr" rid="ref15 ref19">19, 15</xref>
        ].
      </p>
      <p>
        Curriculum learning is not widely applied in practical machine learning and
has not always been successfully applied in research settings either [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
\Catastrophic forgetting" of previously learned tasks may be responsible for poor
generalization performance after training [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Developing syllabuses robust to
catastrophic forgetting may be critical to achieving wider application of curriculum
learning.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <sec id="sec-3-1">
        <title>Syllabuses</title>
        <p>
          We considered six syllabuses that captured the main features of those identi ed
in the literature. The six syllabuses include three hand-crafted, one automated
and two benchmark syllabuses. Below, we follow the notation that there are
T tasks on the syllabus that are ordered in di culty from 1 to T and that
the learner's current task is denoted C. A full distribution over tasks for each
syllabus is given in table 1.
Under the Naive Strategy proposed by by Zaremba and Sutskever [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]
examples at each timestep are sampled solely from the current point C on the
syllabus. Once the learner reaches the de ned level of success for progression
on the current task as measured on a validation set, the learner moves onto the
next task on the syllabus and all examples are now drawn from that task. We
call this syllabus Naive.
        </p>
        <p>
          It was observed that while the Naive syllabus increased the rate of learning
on certain tasks, the learner rapidly forgot previously learned tasks [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. This is
undesirable in the multi-task setting and it was proposed [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] that drawing some
examples from previous tasks on the syllabus may prevent this catastrophic
forgetting. We de ne a syllabus which we call Look Back in which a xed percentage
of examples are drawn from a Uniform distribution over all previous tasks on
the syllabus and the remainder are drawn from the current task on the syllabus
as per the Naive syllabus. In practice for our experiments we chose to draw 10%
of tasks from previous points on the syllabus, with the remaining 90% coming
from the current task.
        </p>
        <p>
          While the Look Back syllabus addresses the issue of catastrophic forgetting,
it was further hypothesized that by only drawing examples from the current
and past tasks in the syllabus considerable retraining of the network weights
would be required as the learner moved forward through the syllabus [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. A
syllabus which we call Look Back and Forward addresses this issue. Look Back
and Forward corresponds to the syllabus which Zaremba and Sutskever call the
Combined Strategy [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. With the Look Back and Forward syllabus a xed
percentage of examples are drawn from a Uniform distribution over all tasks on
the syllabus with the remaining examples drawn from the current task. Thus,
when on the early tasks in the syllabus almost all examples drawn from the
uniform distribution will be drawn from future tasks on the syllabus. Once the
learner approaches the target task almost all such examples will be drawn from
previously learned tasks. In this way the Look Back and Forward syllabus seeks
to address both the issue of catastrophic forgetting and the potential retraining
required when the learner is not given examples from upcoming tasks on the
syllabus. In our experiments, we drew 20% of examples from the Uniform
distribution over all tasks, with the remaining 80% coming from the current task
on the syllabus. We note that for the Look Back and Look Back and Forward
syllabuses, we chose the percentage splits through experimentation.
        </p>
        <p>
          We adopt the best performing automated syllabus consistent with maximum
likelihood training from recent work on automated curriculum learning [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], which
the authors call Prediction Gain. The authors follow their general approach of
selecting the next task based on the Exp3.S algorithm [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] for stochastic
adversarial multi-armed bandits. The reward function for Prediction Gain is de ned
to be the scaled reduction in loss L on the same example x after training on that
example, i.e. the reward to the bandit is de ned to be L(x; ) L(x; 0) rescaled
to the [ 1; 1] range by a simple procedure [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], where and 0 are the weights
parameterizing the network before and after training on x respectively.
        </p>
        <p>None is our benchmark syllabus in the target-task setting, for which we draw
all examples from the target task in the syllabus.</p>
        <p>
          As our benchmark syllabus in the multi-task setting we draw examples from a
Uniform distribution over all tasks. Unsurprisingly, we call this syllabus Uniform.
This syllabus can also be seen as a simple syllabus in the target-task setting and
as mentioned above has been found to be a strong benchmark in this setting [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          In practice for the above hand-crafted syllabuses we, as other authors have
found [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], that learning was slow if the learner progressed through the syllabus
one task at a time. To alleviate this, for each of these syllabuses after meeting
the de ned success metric on a task instead of moving onto the next task in the
syllabus we followed an exponential strategy of doubling the current point on the
syllabus along one dimension of di culty on the problem. For the Repeat Copy
problem (described below) which has two dimensions of di culty, we alternated
which dimension of di culty to double.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Benchmark Problems</title>
        <p>
          As noted above curriculum learning has primarily been applied to sequence
learning problems where the sequence length is typically used to divide the data into
tasks ordered by di culty. We follow this approach by adopting three synthetic
sequence learning problems that have been shown to be di cult for LSTMs to
solve [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>Copy - for the Copy problem, the network is fed a sequence of random bit
vectors followed by an end of sequence marker. The network must then output
the input sequence. This requires the network to store the input sequence and
then read it back from memory. In our experiments we trained on input sequences
up to 32 in length with 8-dimensional random bit vectors.</p>
        <p>Repeat Copy - similarly to the Copy problem, with Repeat Copy the
network is fed an input sequence of random bit vectors. Unlike the Copy problem,
this is followed by a scalar that indicates how many times the network should
repeat the input sequence in its output sequence. In our experiments we trained
on input sequences up to 13 in length with maximum number of repeats set to
13, again with 8-dimensional random bit vectors. This means the target task for
Repeat Copy required an output sequence of length 169.</p>
        <p>Associative Recall - Associative Recall is also a sequence learning problem,
with sequences consisting of random bit vectors. In this case the inputs are
divided into items, with each item consisting of 3 x 6-dimensional vectors. After
being fed a sequence of items and an end of sequence marker, the network is
then fed a query item which is an item from the input sequence. The correct
output is the next item in the input sequence after the query item. We trained
on sequences of up to 12 items, thus our target task contained input sequences
of 36 6-dimensional random bit vectors followed by the query item.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Experiments</title>
        <p>For all of the above problems we ran experiments1 to measure the e ect of
syllabus choice on speed of learning in both the target-task and multi-task setting
and the generalization ability of the trained network. For each problem, syllabus
pair we ran training from 10 di erent random initializations.</p>
        <p>In order to measure the speed of learning we measured the performance
during training every 200 steps on two held-out validation sets, one for the
target-task and one for the multi-task setting. The validation set for the
targettask setting consisted solely of examples from the target task on that problem.
Whereas, the validation set for the multi-task setting consisted of examples
uniformly distributed over all tasks in the syllabus. The number of examples in the
target-task and multi-task validation sets were 512 and 1024 respectively for all
experiments.</p>
        <p>To test the e ect of syllabus choice on the generalization ability of our
networks, for each problem we created test sets of 384 examples which gradually
increased the di culty of the problem. For the Copy problem which was trained
with a target task of sequences of length 32, the test set comprised on sequences
of length 40.
1 Source code to replicate our experiments can be found here: https://github.com/
MarkPKCollier/CurriculumLearningFYP</p>
        <p>For Repeat Copy we wished to test the generalization ability of the trained
networks along both dimensions of di culty of the problem - sequence length
and number of repeats. We created two test sets, one comprised of sequences of
length 16 with the number of repeats xed to 13. The other test set comprised
of sequences of length 13 with the number of repeats set to 16. Our test set for
Associative Recall consisted of sequences of 16 items, having been trained with
a target task of 12 items. We measured the performance of each network on the
test set when the network's error on the validation set was lowest.</p>
        <p>
          In all our experiments we used a stacked 3 x 256 units LSTM network, with
a cross-entropy loss function. For all networks and syllabuses we used the Adam
optimizer [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] with an initial learning rate of 0.01. We used a batch size of 32 for
all experiments, with each batch containing only examples from a single task.
We de ned the maximum error achieved on a task before progression to the next
point on the syllabus to be 2 bit errors per sequence for the Copy problem and
1 bit error per sequence on the Repeat Copy and Associative Recall problems.
4
4.1
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <sec id="sec-4-1">
        <title>Copy</title>
        <p>As expected, in the multi-task setting training directly on the target task
performs very poorly when evaluated on all tasks. Despite making some progress
towards the target task, the Naive syllabus shows almost no improvement on
random performance in the multi-task setting. This demonstrates that the
network rapidly forgets previously learned tasks if not shown further examples of
them. Prediction Gain, Uniform and Look Back and Forward all converge to
near zero error in the multi-task setting at similar rates, g. 1.</p>
        <p>Figure 2 shows the range of error achieved by the 10 trained networks for each
syllabus when the networks are asked to generalize to sequences of length 40 on
the Copy problem. Despite the networks of three syllabuses converging to near
zero error on the target task with sequences of length 32 none of the networks
succeed in generalizing to sequences of length 40. On sequences of length 40, the
Prediction Gain and the Uniform syllabus demonstrate similar performance and
have approximately 1.35-1.59 times lower median error than the Look Back and
Forward, None and Look Back Syllabuses. There is substantial overlap in the
range of generalization error for all syllabuses, so we cannot say that any one
syllabus clearly outperforms the others in terms of improving generalization on
the Copy problem.
Figure 3 shows the median learning curves for each syllabus on the Repeat Copy
problem for the target-task and multi-task setting. Unlike for the Copy problem
In the target-task setting for the Repeat Copy problem, the network fails to
learn a solution in the time provided by training directly on the target task, g.
3. Interestingly the three hand-crafted syllabuses converge to near zero error at
approximately the same time and 3.5 times faster than the Uniform syllabus
which is the next fastest. This is a clear win for the hand-crafted syllabuses over
the benchmark and automated syllabuses.</p>
        <p>
          The Uniform syllabus converges to near zero error twice as fast as Prediction
Gain. It is unclear why Prediction Gain has slower convergence although we
posit a potential explanation - that scaling the rewards by the length of the input
sequence as per the speci cation [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] may bias the bandit towards tasks with high
repeats as such tasks incur no penalty for their added di culty. This highlights
that despite the automated nature of Prediction Gain's syllabus generation it
still relies on several tunable parameters.
        </p>
        <p>Despite the Uniform syllabus' slow convergence on the target task, in the
multi-task setting training on the same distribution as the test distribution is
bene cial, as would be expected in non curriculum learning settings, g. 3. The
Uniform syllabus reaches near zero error 1.35 times faster than Look Back and
Forward, the next fastest syllabus.</p>
        <p>
          All the syllabuses but Prediction Gain and None consistently generalize with
near zero error to sequences of length 16 on Repeat Copy, g. 4. Prediction
Gain's wide generalization error range is explained by the fact that training
with Prediction Gain on the Repeat Copy problem is unstable and 4 out of the
10 training runs failed to converge to near zero error. Whereas when we attempt
to increase the number of repeats, all the syllabuses fail to generalize successfully,
g. 4, this is consistent with previous results [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
(a) Generalization to input sequences of
length 16, with the number of repeats
xed to 13
(b) Generalization to 16 repeats, with
the input sequence xed to length 13
a near zero error solution to the Associative Recall problem in either setting.
The Uniform and Look Back and Forward syllabuses perform similarly in both
settings and make some progress towards a low error solution. The Naive syllabus
makes some progress towards learning the target-task and perhaps given enough
time would learn a solution to the task. The Look Back and None syllabuses
make very limited progress from random initialization in either setting.
(a) target-task setting
        </p>
        <p>
          Despite converging to a near zero error solution in both the target-task and
multi-task setting, gure 6 shows that Prediction Gain fails to generalize with
similar error to sequences of 16 items. As expected, the other syllabuses which
do not reach near zero error on the target task also exhibit similarly high
generalization error.
The above experiments demonstrate that no single syllabuses in our comparison
has an advantage in all problems (Copy, Repeat Copy, Associative Recall) or
settings (target-task, multi-task). Over the three problems, the Look Back and
Forward syllabus learned consistently faster than the other two hand-crafted
syllabuses; Look Back and Naive. Prediction Gain was fastest of the non-benchmark
syllabuses to converge to near zero error solutions on the Copy and Associative
Recall problems but the opposite was true on Repeat Copy (when training with
Prediction Gain was unstable). We found that the Uniform syllabus provided
a strong baseline in the target-task setting, which agrees with previous results
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In the multi-task setting uniformly sampling from all tasks leads to the
fastest learning in two of the three problems. No syllabus consistently improved
generalization on any of the above problems.
        </p>
        <p>Including examples from previously learned tasks is vital to prevent
catastrophic forgetting of these tasks. For example, the Naive syllabus which makes
progress towards the target-task on each benchmark problem, is among the two
slowest learners in the multi-task setting on all problems. This demonstrates
that the network rapidly forgets previously learned tasks when it is no longer
presented with them.</p>
        <p>Including examples from future tasks on the syllabus provided substantial
increases in speed of learning. In particular, the Look Back and Forward syllabus
converged to a lower error solution in both settings on all problems than the Look
Back syllabus which does not include examples from future tasks on the syllabus
but is otherwise identical.</p>
        <p>Despite being compared to stronger syllabuses than in the original paper,
Prediction Gain performed competitively. We conclude that automated approaches
to syllabus design in curriculum learning may be a fruitful future area of
development but that further work is required on how to set the hyperparameters
governing the existing automated approaches.</p>
        <p>Our empirical results provide the basis for the choice of syllabus in new
applications of curriculum learning. In the multi-task setting we recommend
using the non curriculum learning syllabus of uniformly sampling from all tasks.
Other syllabuses may provide marginal gains on some problems in the
multitask setting, but this is not reliable across all problems and requires additional
hyperparameter tuning. We recommend that when applying curriculum learning
to problems in the target-task setting, practitioners use either Prediction Gain
or the Look Back and Forward syllabus.</p>
        <p>Acknowledgements This publication emanated from research conducted with
the nancial support of Science Foundation Ireland (SFI) under Grant Number
13/RC/2106.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cesa-Bianchi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freund</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schapire</surname>
            ,
            <given-names>R.E.</given-names>
          </string-name>
          :
          <article-title>The nonstochastic multiarmed bandit problem</article-title>
          .
          <source>SIAM journal on computing 32(1)</source>
          ,
          <volume>48</volume>
          {
          <fpage>77</fpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Avramova</surname>
          </string-name>
          , V.:
          <article-title>Curriculum learning with deep convolutional neural networks (</article-title>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Louradour</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Collobert</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weston</surname>
          </string-name>
          , J.:
          <article-title>Curriculum learning</article-title>
          .
          <source>In: Proceedings of the 26th annual international conference on machine learning</source>
          . pp.
          <volume>41</volume>
          {
          <fpage>48</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cirik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hovy</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morency</surname>
            ,
            <given-names>L.P.</given-names>
          </string-name>
          :
          <article-title>Visualizing and understanding curriculum learning for long short-term memory networks</article-title>
          .
          <source>arXiv preprint arXiv:1611.06204</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Elman</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          :
          <article-title>Learning and development in neural networks: The importance of starting small</article-title>
          .
          <source>Cognition</source>
          <volume>48</volume>
          (
          <issue>1</issue>
          ),
          <volume>71</volume>
          {
          <fpage>99</fpage>
          (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bellemare</surname>
            ,
            <given-names>M.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Menick</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Munos</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Automated curriculum learning for neural networks</article-title>
          .
          <source>arXiv preprint arXiv:1704.03003</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liwicki</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernndez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bertolami</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bunke</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A novel connectionist system for unconstrained handwriting recognition</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>31</volume>
          (
          <issue>5</issue>
          ),
          <volume>855</volume>
          {
          <fpage>868</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohamed</surname>
            ,
            <given-names>A.r.</given-names>
          </string-name>
          , Hinton, G.:
          <article-title>Speech recognition with deep recurrent neural networks</article-title>
          .
          <source>In: Acoustics, speech and signal processing (icassp)</source>
          ,
          <year>2013</year>
          ieee international conference on. pp.
          <volume>6645</volume>
          {
          <fpage>6649</fpage>
          . IEEE
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wayne</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Danihelka</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Neural turing machines</article-title>
          .
          <source>arXiv preprint arXiv:1410.5401</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wayne</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reynolds</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harley</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Danihelka</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grabska-Barwiska</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colmenarejo</surname>
            ,
            <given-names>S.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grefenstette</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramalho</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agapiou</surname>
          </string-name>
          , J.:
          <article-title>Hybrid computing using a neural network with dynamic external memory</article-title>
          .
          <source>Nature</source>
          <volume>538</volume>
          (
          <issue>7626</issue>
          ),
          <volume>471</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. Grunwald, P.D.:
          <article-title>The minimum description length principle</article-title>
          . MIT press (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation 9(8)</source>
          ,
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
          </string-name>
          , J.:
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Krueger</surname>
            ,
            <given-names>K.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dayan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Flexible shaping: How learning in small steps helps</article-title>
          .
          <source>Cognition</source>
          <volume>110</volume>
          (
          <issue>3</issue>
          ),
          <volume>380</volume>
          {
          <fpage>394</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Matiisen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliver</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schulman</surname>
          </string-name>
          , J.:
          <article-title>Teacher-student curriculum learning</article-title>
          .
          <source>arXiv preprint arXiv:1707.00183</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>McCloskey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>N.J.:</given-names>
          </string-name>
          <article-title>Catastrophic interference in connectionist networks: The sequential learning problem</article-title>
          . In:
          <article-title>Psychology of learning and motivation</article-title>
          , vol.
          <volume>24</volume>
          , pp.
          <volume>109</volume>
          {
          <fpage>165</fpage>
          .
          <string-name>
            <surname>Elsevier</surname>
          </string-name>
          (
          <year>1989</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. Mitchell,
          <string-name>
            <surname>T.M.:</surname>
          </string-name>
          <article-title>The need for biases in learning generalizations</article-title>
          . Department of Computer Science, Laboratory for Computer Science Research,
          <source>Rutgers Univ. New Jersey</source>
          (
          <year>1980</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Rae</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hunt</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Danihelka</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harley</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senior</surname>
            ,
            <given-names>A.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wayne</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lillicrap</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Scaling memory-augmented neural networks with sparse reads and writes</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          . pp.
          <volume>3621</volume>
          {
          <fpage>3629</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Reed</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Freitas</surname>
          </string-name>
          , N.:
          <string-name>
            <surname>Neural</surname>
          </string-name>
          programmer-interpreters.
          <source>arXiv preprint arXiv:1511.06279</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Rissanen</surname>
          </string-name>
          , J.:
          <article-title>Stochastic complexity and modeling</article-title>
          .
          <source>The annals of statistics</source>
          pp.
          <volume>1080</volume>
          {
          <issue>1100</issue>
          (
          <year>1986</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Spitkovsky</surname>
            ,
            <given-names>V.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alshawi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>From baby steps to leapfrog: How less is more in unsupervised dependency parsing</article-title>
          . In: Human Language Technologies:
          <article-title>The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics</article-title>
          . pp.
          <volume>751</volume>
          {
          <fpage>759</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>R.J.:</given-names>
          </string-name>
          <article-title>Simple statistical gradient-following algorithms for connectionist reinforcement learning</article-title>
          , pp.
          <volume>5</volume>
          {
          <fpage>32</fpage>
          . Springer (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuster</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Norouzi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macherey</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krikun</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macherey</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Google's neural machine translation system: Bridging the gap between human and machine translation</article-title>
          .
          <source>arXiv preprint arXiv:1609.08144</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Zaremba</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Learning to execute</article-title>
          .
          <source>arXiv preprint arXiv:1410.4615</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Zaremba</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Reinforcement learning neural turing machines-revised</article-title>
          .
          <source>arXiv preprint arXiv:1505.00521</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>