<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>RAL - Improving Stream-Based Active Learning by Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sarah Wassermann</string-name>
          <email>sarah@wassermann.lu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thibaut Cuvelier</string-name>
          <email>thibaut.cuvelier@supelec.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pedro Casas</string-name>
          <email>pedro.casas@ait.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AIT Austrian Institute of Technology - Vienna</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CentraleSup ́elec - Gif-sur-Yvette</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <fpage>32</fpage>
      <lpage>47</lpage>
      <abstract>
        <p>One of the main challenges associated with supervised learning under dynamic scenarios is that of periodically getting access to labels of fresh, previously unseen samples. Labeling new data is usually an expensive and cumbersome process, while not all data points are equally valuable. Active learning aims at labeling only the most informative samples to reduce cost. In this paradigm, a learner can choose from which new samples it wants to learn, and can obtain the ground truth by asking an oracle for the corresponding labels. We introduce RAL - Reinforced stream-based Active Learning -, a new active-learning approach, coupling stream-based active learning with reinforcement-learning concepts. In particular, we model active learning as a contextual-bandit problem, in which rewards are based on the usefulness of the system's querying behavior. Empirical evaluations on multiple datasets confirm that RAL outperforms the state of the art, both by improving learning accuracy and by reducing the number of requested labels. As an additional contribution, we release RAL as an open-source Python package to the machine-learning community.</p>
      </abstract>
      <kwd-group>
        <kwd>Stream-based active learning</kwd>
        <kwd>Reinforcement learning</kwd>
        <kwd>Bandits</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>A wide range of popular end-user applications such as Skype or Facebook
Messenger request users’ feedback about their experience at the end of a session.
This user feedback is paramount for such services, as they may use the massively
collected information to build prediction models shedding light on service
performance, user engagement, preferences, etc. A major challenge when querying
users is the fact that asking too often is annoying and might have negative
consequences on engagement. In a general supervised-learning scenario, labeled
data is extremely important, but labeling data is an expensive and tedious task,
especially if based on human effort.</p>
      <p>
        Active learning [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] offers a solution to the data exploration and exploitation
trade-off, allowing a learner to interactively query the label of only the most
informative samples. Active learning is generally used in an offline setup, to limit
the number of labels required from a predefined set of unlabeled samples.
c 2019 for this paper by its authors. Use permitted under CC BY 4.0.
      </p>
    </sec>
    <sec id="sec-2">
      <title>R2AL - RSe.inWfoarscseerdmsatnrneaemt-abl.ased Active Learning</title>
      <p>The learning tasks we tackle in this paper have two specific characteristics
which are not covered by offline active learning: first, the learning data is
time-andsize unbounded; second, it comes in the form of an online, non-periodic, sequential
stream of samples. An appropriate learning approach should be able to track the
evolution of the oracle feedback and the underlying concept drifts. In addition, it
should do so by smartly and dynamically deciding at which specific times it is
better to query the oracle. This would constantly improve the model-prediction
capabilities and avoid unnecessarily querying the oracle.</p>
      <p>
        Stream-based active-learning schemes have been proposed in the past, notably
in [
        <xref ref-type="bibr" rid="ref16 ref17">16,17</xref>
        ]. However, as in traditional active learning, the decision on whether to
request for a label or not is solely based on the the model’s prediction uncertainty,
and not on the informativeness of such a request. The question that is raised is:
can we improve the performance of stream-based active learning by considering
how informative the requested label was? We propose RAL – a novel Reinforced
stream-based Active-Learning approach – combining both traditional
streambased active-learning techniques with reinforcement learning. While reinforcement
learning has already been used to solve pool-based active-learning problems, the
combination with stream-based active-learning algorithms is a mostly unexplored
topic. RAL is specifically designed for stream-based, time-unbounded data, and,
as we show next, it manages to improve the accuracy of the underlying
supervisedlearning model, while also significantly reducing the number of requested labels,
compared to the state of the art. RAL is available as an open-source Python
package3.
2
      </p>
      <sec id="sec-2-1">
        <title>Related Work</title>
        <p>
          Many research efforts have already been undertaken in the field of active learning.
For example, [
          <xref ref-type="bibr" rid="ref16 ref17">16,17</xref>
          ] present three simple approaches for this learning paradigm.
Their proposed Randomized Variable Uncertainty approach tackles the problem
of stream-based active learning using the model’s prediction uncertainty to
decide whether to query and trying to detect concept drifts by randomizing the
certainty threshold used for querying decisions. [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] develops an active-learning
algorithm with two different classifiers: one “reactive” and one “stable”. The
stable classifier is trained on all available labeled instances, while the reactive
one learns on a window of recent instances. In [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], the authors present an
activelearning technique based on clustering and prediction uncertainty. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] conceives
an approach relying on a modification of the Na¨ıve Bayes classifier to update
the different learners through the queried samples. In particular, the author uses
one-versus-one classifiers to tackle multi-class problems and update the weights
of the different classifiers by comparing their predictions to the ground truth.
Krawczyk’s technique behaves similarly to ours. However, the major difference is
that he is using information about the classifiers’ prediction certainty (without
considering the corresponding weights) in order to adapt the minimum threshold
3 https://github.com/SAWassermann/RAL
to query the oracle, while we rely on the usefulness of the decisions taken by
RAL to tune the system according to the data stream.
        </p>
        <p>
          Extending active learning through reinforcement learning is currently a very
active research area. Active learning alone can easily converge on a policy of
actions that have worked well in the past, but are sub-optimal. Reinforcement
learning helps to improve the exploration-exploitation trade-off by letting the
learner take risks with an uncertain outcome. However, most proposals do not
consider the stream scenario, and operate on the basis of pool-based approaches.
In [
          <xref ref-type="bibr" rid="ref2 ref5">2,5</xref>
          ], authors rely on the multi-armed bandit paradigm. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] develops ALBL,
which uses a modified version of EXP4 [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], a weight-updating rule, to attribute
adaptive weights to different learners based on rewards; the learner to use is
then determined through these weights; the chosen learner selects the samples in
the pool to hand to the oracle based on its uncertainty measure. The approach
described in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] is similar to the one in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], except for the reward-computation
scheme. Some other papers in the field of pool-based active learning are [
          <xref ref-type="bibr" rid="ref20 ref22">20,22</xref>
          ].
The algorithm presented in [
          <xref ref-type="bibr" rid="ref10 ref11">10,11</xref>
          ] relies on the same principles as the system
we are proposing, but tackles a different problem: Song’s goal is to introduce an
active-learning component into a contextual-bandit problem, while we are aiming
at solving an active-learning problem by using contextual bandits.
        </p>
        <p>
          Other recent papers dealing with active learning and reinforcement learning
include [
          <xref ref-type="bibr" rid="ref12 ref18 ref21 ref23">12,18,21,23</xref>
          ]. However, most of them consider only one of the perspectives
addressed by RAL, namely the enhancement of pool-based active learning through
reinforcement learning [
          <xref ref-type="bibr" rid="ref18 ref21 ref23">18,21,23</xref>
          ], or the application of active learning to the
streaming setup [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Combining active learning with reinforcement learning in a
streaming, adaptive learning context is the most important contribution of RAL,
a very timely yet vaguely addressed problem in the literature. [
          <xref ref-type="bibr" rid="ref19 ref24">19,24</xref>
          ] use the
idea of learning to active learn, i.e. data-driven active learning. [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] proposes this
view on pool-based active learning: the querying decision for a sample is based
on an estimation of the accuracy improvement. [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] uses reinforcement learning
in stream-based active one-shot learning, but this work is different from RAL on
multiple aspects: (i) it tackles a different learning task, as it aims at detecting
new classes instead of improving overall classification accuracy, (ii) their scheme
relies on reinforcement learning only during the training phase and not once
deployed, while RAL continuously adapts its querying policy during the whole
incoming stream, and (iii) the system heavily relies on deep recurrent networks,
too cumbersome to use in real-time resource-constrained scenarios, unlike RAL.
3
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Introducing RAL</title>
        <p>RAL relies on prediction uncertainty and reinforcement-learning principles, using
rewards and bandit algorithms. The overall idea is summarized in Figure 1.</p>
        <p>The intuition behind the different reward values is that we attribute a positive
reward in case our system behaves as expected, and a negative one otherwise,
to penalize it. RAL obtains rewards/penalties only when it is asking for ground
truth. In a nutshell, it earns a positive reward ρ+ in case it queries the oracle</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>R4AL - RSe.inWfoarscseerdmsatnrneaemt-abl.ased Active Learning</title>
      <p>x</p>
      <p>Committee
learners</p>
      <p>Single
learners ŷ</p>
      <p>y
ε-greedy? no</p>
      <p>query?
ρ±
reward / penalty</p>
      <p>Oracle
yes
and would have predicted the wrong label otherwise (the system made the right
decision to ask for the ground truth) and a penalty ρ− (i.e. a negative reward)
when it asks the oracle even though the underlying classification model would
have predicted the correct label (querying was unnecessary). The rationale for
using reinforcement learning is that RAL learns not only based on the queried
samples themselves, but also from the usefulness of its decisions. The objective
function to maximize is the total reward Pn
i=1 rn, where rn is the nth reward
(either ρ+ or ρ−) obtained by RAL.</p>
      <p>The conceived system additionally makes use of the prediction certainty
of the classification models. It is defined as the highest posterior classification
probability among all possible labels for sample x. More formally, the certainty of
a model is equal to maxyˆ P (yˆ|x) with yˆ being one of all the possible labels for x.</p>
      <p>The underlying assumption for designing RAL is based on the rationale that
the model’s uncertainty is an appropriate proxy for assessing the usefulness of a
data point. Indeed, if the learner is uncertain about its prediction, this sample
likely represents a region which has not been explored enough. Adding that
sample with its true label to the training set would improve the overall prediction
accuracy; the alternative is that the uncertainty is due to noise or to concept drift,
these two points being especially challenging in a streaming setting. In RAL, we
combine the aforementioned reward mechanism with the model’s uncertainty to
tune the sample-informativeness heuristic to better guide the query decisions.</p>
      <p>
        Additionally, we implement an ε-greedy policy (also inspired by the bandit
literature [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]) for the sake of data-space exploration: with probability ε, the
system queries the oracle, even if the committee agreed not to query; we call this
the ε-scenario. This ensures that we have a good chance of detecting potential
concept drifts: without this policy, the system could end up being too confident
about its predictions (and thus never ask the oracle again) even though its
estimations are erroneous.
      </p>
      <p>
        In the next two sections, we provide details about the single-classifier and the
committee versions of RAL. We first devise the committee version, of which the
single-classifier alternative is a simple adaptation.
Algorithm 1 RAL algorithm.
1: procedure RAL(x, E, α, θ, ε, η)
2: x: sample to consider
3: E: set of learners, members of the committee
4: α: vector of decision powers of learners in E
5: θ: certainty/querying threshold
6: ε: threshold for ε-greedy
7: η: learning rate for updating decision powers and the querying threshold
8: decisions ← {} . will contain decisions of learners
9: for e ∈ E do
10: decisions[e] ← e.askCertainty(x) &lt; θ . decisions[e] ∈ {0, 1}
11: committeeDecision ← round Pe∈E α[e] · decisions[e]
12: p ← U[
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ] . random number drawn from a uniform distribution
13: if p &lt; ε or committeeDecision = 1 then . ε-scenario or not?
14: y ← acquireLabel(x)
15: if committeeDecision = 1 then
16: r ← getReward(x, y)
17: α ← updateDecisionPowers(r, E, decisions, committeeDecision, α, η)
18: θ ← min nθ 1 + η × 1 − 2 ρr− , 1o
19:
20: function updateDecisionPowers(r, E, decisions, committeeDecision, α, η)
21: for e ∈ E do
22: if decisions[e] = committeeDecision then
23: α[e] ← α[e] × exp(η × r)
24: α ← α/ Pe∈E α[e]
25: return α
26:
27: function getReward(x, y)
28: return (ρ− if yˆ(x) = y else ρ+)
. EXP4
. normalize each value of α
machine-learning models), referred to as a committee. Each expert gives its
advice for the sample to consider: should the system ask the oracle for feedback
or is the expert confident enough about its prediction? To assess a model’s
prediction certainty, we rely on a certainty threshold θ: if the model’s certainty
is below θ, the model is too uncertain about the prediction to make and thus it
advises that RAL asks the ground truth. The query decision of the committee
takes into account the opinions of the experts, but also their decision power: if
the weighted majority of the experts votes for not querying, RAL will rely on
the label prediction provided by the committee, used in the form of a voting
classifier. The decision power of each expert gets updated such that the experts
which agree with the entire committee are obtaining more power in case that
particular decision is rewarding, i.e. informative (otherwise, these experts get
penalized). These weights are updated through the EXP4 rule [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], with a learning
rate η. RAL does not update the decision powers of the different learners in
the ε-scenario: the committee did not take the querying decision and thus the
weights of the models should not be impacted by this querying action.
      </p>
    </sec>
    <sec id="sec-4">
      <title>R6AL - RSe.inWfoarscseerdmsatnrneaemt-abl.ased Active Learning</title>
      <p>The computation of the reward is carried out every time the committee
decided to query (i.e. not in the ε-scenario). RAL therefore gets rewarded with
ρ+ when it queried the oracle and asking was rewarding (i.e. the voting classifier
would have predicted the wrong label). Conversely, RAL is penalized with ρ− if
the system used the oracle because the committee decided to do so, even though
the underlying classifier would have predicted the correct class.</p>
      <p>As an additional step, to ensure that RAL adapts as best as possible to the
data stream, we do not only tune the weights of the committee members based
on rewards, but also the uncertainty threshold θ, denoted in the remainder of
this section as θn to stress that it is influenced by the n − 1 samples observed so
far. Again, as for the decision powers, θn is not updated in the ε-scenario.</p>
      <p>The update rule of θn we implemented for our tool is written as follows:
n
θn ← min θn−1 ×
1 + η ×
(1)
(2)
(3)
Design of update rule. In this subsection, we detail the reasoning behind our
choice of the update policy. We are looking for a rule of this form:
θn ← min {θn−1 (1 + f (rn)) ,
1} ,
where f (rn) = 1 − exp (a × rn). The design goals of this update policy are that
the threshold increases slightly when the reward is positive, conversely when the
reward is negative. Our update policy should satisfy the following properties:
1. θn should decrease rapidly in case rn is negative, as this indicates that
the system queries too often and thus is performing poorly. Therefore, θn
should be adapted fast to improve its performance.
2. θn should slightly increase when rn is positive, so that the system does
not keep decreasing the threshold. The model was right to ask for more
samples, and thus the threshold should be increased. Nevertheless, the system
is doing well: the threshold should not be too reactive to the queries.
3. f must have two extrema: a minimum at ρ− &lt; 0 and a maximum at
ρ+ &gt; 0.
4. θn represents a probability. θn = 0 is not acceptable due to the product
form of the update policy, thus the values of θn must be in the interval (0, 1].
5. f (rn) must be in the interval (−1, 1] to ensure that θn takes values
corresponding to a probability. We exclude −1 from the allowed range of
values to avoid that θn drops to 0.</p>
      <p>The Properties 1 and 2 lead us to choose the family of functions fa : r 7→
1 − exp (a × r) parameterized by a. Property 5 can be translated into an equation
to determine this parameter:
lim f (r) = 1 − exp a × ρ− = −1
r→ρ−
After solving Equation 3, we get a = lρn−2 . As f is strictly increasing, and because
a is nonpositive, f will have a maximum when rn = ρ+. To satisfy Property 5,
ρ+ must be chosen such that f (ρ+) ≤ 1.</p>
      <p>As a final step, we introduce an additional hyperparameter to the update
rule, namely the learning rate η. This rate aims at smoothing the evolution of
the threshold θn, i.e. avoiding that θn changes too dramatically with a single
query. We thus have the following update rule:</p>
      <p>n
θn ← min θn−1 1 + η ×
(4)
The values of η are restricted to the range (0, 1). Indeed, we still must satisfy
Property 5 (a value of 1 would violate this one) and η = 0 would lead to an
nonreactive system, as the threshold would never adapt. Note that this version of
RAL uses the same value of η for updating both θn and the decision powers in α.
Choice of hyperparameters. We acknowledge that RAL includes a
nonnegligible number of hyperparameters which should be well chosen in order to
obtain the best results. While we do not have any rule of thumb on how to define
exact values, the following guidelines help RAL learn from the streaming data:
– the initial value of θ should be high when the number of possible labels is
low, to avoid that the model is always too certain about its prediction for
the encountered samples
– ε should be higher when dealing with more dynamic datasets, to increase the
probability to accurately grasp concept drifts; in general, we would advise
using values in the range of 1 to 5%
– η should be small to avoid changing the decision powers of the different
learners (α) and θn too abruptly; we would advise values below 0.1
– there is no specific range of values for ρ± which works better than others
and these values should be picked considering the situation in which RAL is
used: if unnecessary queries are a major issue, one should set ρ− such that
its absolute value is much higher than the one of ρ+
3.2</p>
      <sec id="sec-4-1">
        <title>Learning with a Single Classifier</title>
        <p>RAL can also be used with a single classifier instead of a committee of learners. In
that case, RAL becomes very lightweight and the only element of the system that
allows it to efficiently adapt to and learn from the data stream is the variation of
the uncertainty threshold θ by relying on the rewards rn.
4</p>
        <sec id="sec-4-1-1">
          <title>Expected Reinforcement Reward Analysis</title>
          <p>As the main novelty of RAL lies in the introduction of a reinforcement learning
loop to improve querying effectiveness and the data exploration-exploitation
trade-off, we devote this section to the study of the reward properties in RAL.
We rely on concepts from the bandit theory to understand its expected behavior.
In the general case of a multi-class classification problem, under the assumptions</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>R8AL - RSe.inWfoarscseerdmsatnrneaemt-abl.ased Active Learning</title>
      <p>that ρ+ ± ρ− ≥ 0, we prove the following bounds for RAL, fn (α, γ) being a
nonnegative function described next:</p>
      <p>T ρ+ − ρ− [α fn (α, γ) − 1] ≤ E
Let us analyze the expected total reward obtained by using RAL, i.e. E nPTn=1 rno,
where T denotes the number of samples in the considered data stream and rn
indicates the reward obtained for the nth sample.</p>
      <p>
        In the following developments, we use these notations:
– yˆn – nth predicted value
– pˆn – certainty of the model for the nth prediction
– ρ± – reward and penalty obtained by RAL respectively; the reward must be
nonnegative and the penalty nonpositive
– VC – VC dimension [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] of the learner
– θn – uncertainty threshold before having observed the nth sample
– errn – error rate of our classifier before having observed the nth sample
– errn – training error of model before having observed the nth sample
The expected total reward writes E nPTn=1 rno = PTn=1 E {rn}.
      </p>
      <p>
        Based on the classical result of [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], we have the following bound:
fn(α) = errn +
s 1
      </p>
      <p>Nn</p>
      <p>VC log 2VNCn + 1 − log α4</p>
      <p>
        P (errn ≤ fn(α)) = 1 − α
where Nn denotes the training-set size for the underlying classifier at the nth
round and α is a confidence level whose value lies in the interval [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]. We
therefore can write this probabilistic bound as:
      </p>
      <p>PhP(yˆn 6= yn) ≤ fn (α) i = 1 − α
This means that the probability of making a mistake can be written as:
P [yˆn 6= yn] = P [yˆn 6= yn|P(yˆn 6= yn) ≤ fn (α)] × P [P(yˆn 6= yn) ≤ fn (α)]
(6)
(7)
(8)
(9)
+ P [yˆn 6= yn|P(yˆn 6= yn) &gt; fn (α)] × P [P(yˆn 6= yn) &gt; fn (α)]
= P [yˆn 6= yn|P(yˆn 6= yn) ≤ fn (α)] (1 − α) + P [yˆn 6= yn|P(yˆn 6= yn) &gt; fn (α)] α
Its upper and lower bounds are thus:</p>
      <p>0 + α fn (α) ≤ P [yˆn 6= yn] ≤ (1 − α) fn (α) + α × 1</p>
      <p>For the next proofs, we will require bounds on the probability of the certainty
of the model being less than a threshold. Unfortunately, to the best of our
knowledge, no generic result exists for the probability distribution of these
certainties, which leads to very lose bounds:</p>
      <p>0 ≤ P [pˆn ≤ θn] ≤ 1</p>
      <p>For the following steps, we rely on classical results in probability theory,
namely the union bound and Fr´echet’s inequality. For two probabilistic events A
and B, be they independent or not, the following bounds hold:</p>
      <p>P(A ∧ B) ≤ P(A) + P(B)</p>
      <p>P(A ∧ B) ≥ max {0, P(A) + P(B) − 1}</p>
      <p>We have that E {rn} = Pr∈R r × P(rn = r) with R = {ρ+, ρ−} being the
set of all possible reward values. As RAL does not obtain any reward in the
ε-scenario, it can be ignored. Therefore, we have the following decomposition of
the expectation and a generic upper bound:</p>
      <p>E {rn} = ρ+ P [pˆn ≤ θn ∧ yˆn 6= yn] + ρ− P [pˆn ≤ θn ∧ yˆn = yn]
|≥{z0} |≤(P[pˆn≤θn{]z+P[yˆn6=yn])} |≤{z0} ≥|(P[pˆn≤θn]+{zP[yˆn=yn]−1})
≤ P [pˆn ≤ θn] ρ+ + ρ− + P [yˆn 6= yn] ρ+ + [1 − P [yˆn 6= yn]] ρ− − ρ−
by factoring out the probabilities and using the opposite events
≤ P [pˆn ≤ θn] ρ+ + ρ− + P [yˆn 6= yn] ρ+ − ρ−
(14)
Finally, the upper bound on the expected total reward, under the assumption
that both (ρ+ + ρ−) and (ρ+ − ρ−) are nonnegative, is:
(10)
(11)
(12)
(13)
E
≤ T ρ+ + ρ− + T ρ+ − ρ− [(1 − α) fn (α) + α]
(15)</p>
      <p>If these two assumptions do not hold, a similar bound can still be achieved:
– First, suppose that ρ+ + ρ− ≥ 0 and ρ+ − ρ− ≤ 0. In this case, the only
solution is to have ρ+ = ρ− = 0, thus trivially E{PTn=1 rn} = 0
– Second, suppose that, conversely, ρ+ + ρ− ≤ 0 and ρ+ − ρ− ≥ 0. These
assumptions lead to:</p>
      <p>E
≤ T ρ+ − ρ− [(1 − α) fn (α) + α]
(16)
– Third, the case where both ρ+ + ρ− ≤ 0 and ρ+ − ρ− ≤ 0 should not be
studied further, because that would imply that ρ+ ≤ 0, which violates the
defined range of allowed values for ρ+ (in case ρ+ = 0, we must have ρ− = 0)</p>
    </sec>
    <sec id="sec-6">
      <title>R1A0L - RSe.inWfoarscseerdmsatnrneaemt-abl.ased Active Learning</title>
      <p>As a next step, we derive a lower bound of the expected total reward, with a
very similar reasoning. First, the expected reward can be decomposed as:
E {rn} = ρ+ P [pˆn ≤ θn ∧ yˆn 6= yn] + ρ− P [pˆn ≤ θn ∧ yˆn = yn]
|≥{z0} ≥|(P[pˆn≤θn]+{zP[yˆn6=yn]−1}) |≤{z0} |≤(P[pˆn≤θn{]z+P[yˆn=yn])}
≥ P [pˆn ≤ θn] ρ+ + ρ− + (P [yˆn 6= yn] − 1) ρ+ − ρ−
by factoring out the probabilities and using the opposite events
(17)
Eventually, if ρ+ + ρ− ≥ 0 and ρ+ − ρ− ≥ 0, the expected total reward is at
least T (ρ+ − ρ−) [α fn (α) − 1]. Conversely, if ρ+ + ρ− ≤ 0 and ρ+ − ρ− ≥ 0, the
lower bound is T (ρ+ − ρ−) [α fn (α) − 2].
4.2</p>
      <sec id="sec-6-1">
        <title>Generalization to the multi-class case</title>
        <p>
          The VC dimension makes no more sense when the classification problem includes
multiple classes. There have been several generalizations thereof, for instance the
covering number N (p)(γ/4, Δγ G, 2 Nn) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], where Δγ G is the set of classification
margins obtained by any classifier of the family G in the known Nn data points
(if a margin is larger than γ, it is clipped to γ). errγ,n is the number of
misclassifications, where an element is misclassified if its margin is less than γ. With
a margin γ ∈ R0 , a real number Γ ∈ R0+ (γ ≤ Γ ), and the previously defined
+
notations, the following bound on the generalization error holds:
fn (α, γ) = errγ,n +
1
Nn
+
s 2
        </p>
        <p>Nn
log 2 N (p) γ4 , Δγ G, 2 Nn</p>
        <p>2 Γ
− log α γ</p>
        <p>
          P (errn ≤ fn (α, γ)) = 1 − α
Notation is taken directly as defined in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>The upper bound of the expected total reward can be computed as in the
binary-classification problem:
(18)
(19)
(20)
(21)
E
The mathematical developments for the committee version are very similar to
the single classifier ones. First of all, the committee is still a classifier, and thus
the same kind of bound applies on the probability of misclassifying. The only
difference is that we have to take the VC dimension of the stacked classifiers
instead of the one of the single classifier.</p>
        <p>RAL asks the oracle for a label (and obtains the corresponding reward) if the
weighted average of the decisions encourages it to query. We denote by di,n the
random variable indicating whether the ith classifier decides to query the oracle
or not, i.e. whether its certainty pˆi,n for the nth prediction is below the threshold
θn (in case of querying, di,n = 1; otherwise, di,n = 0). αi,n is the weight of the
ith classifier for the nth sample; we have previously imposed that the sum of the
weights must be one (PC
i=1 αi,n = 1 for each sample n). Thus, RAL asks when:</p>
        <p>C</p>
        <p>1
X αi,n di,n ≥ 2
i=1
(22)
(23)
(24)
(25)
(26)
For the upper bound, the previous developments still hold:
E {rn} ≤ P
" C</p>
        <p>1 #
X αi,n di,n ≥ 2
i=1</p>
        <p>ρ+ + ρ− + P [yˆn 6= yn] ρ+ − ρ−
Again, to the best of our knowledge, no generic result exists for a probability
distribution of the querying decisions; we therefore have to resort to a very broad
bound:
0 ≤ P
" C</p>
        <p>1 #
X αi,n di,n ≥ 2
i=1
≤ 1.</p>
        <p>Finally, the expected total reward is, if ρ+ + ρ− ≥ 0 and ρ+ − ρ− ≥ 0, at most:
E
This theoretical analysis of RAL yields bounds on the expected total reward
which could be tightened by stronger results on the probability distribution of
pˆn. Nevertheless, our results show that the expected total reward is significantly
higher than T ρ−, whatever the values of ρ±: RAL usually takes the appropriate
decision, and thus mostly queries when necessary. Conversely, the upper bound
is always nonnegative.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>R1A2L - RSe.inWfoarscseerdmsatnrneaemt-abl.ased Active Learning</title>
      <p>MAWI Woodcover
ε
η
ρ+
ρ−
initial θ</p>
      <p>Fig. 3. Concept-drift detection in MAWI.</p>
      <p>Changes are marked with dashed lines.</p>
      <p>It is more beneficial to choose the rewards such that ρ+ + ρ− ≥ 0. Indeed, in
this case, the upper bound is higher (we add a term T (ρ+ + ρ−) to the initial
bound) as well as the lower bound (this time, we add a term T (ρ+ − ρ−)). This
means that, for a promising behavior of RAL, good decisions should be more
rewarded than bad ones are penalized.</p>
      <p>By studying special cases of these formulae, we can obtain significant insight
into their properties. For instance, if the prediction algorithm is very inaccurate
(a situation that is expected for the first samples of the stream) and almost
constantly infers the wrong class, i.e. if αfn (α, γ) is close to 1 (equal to 1 − ζ),
the lower bound becomes T (ρ+ − ρ−) ζ. This means that the expected total
reward, in this case, is at least zero. This result is significantly stronger than the
trivial bound T ρ−: RAL’s decisions generate a positive total reward, on average.
5</p>
      <sec id="sec-7-1">
        <title>Evaluation</title>
        <p>
          To showcase the performance of RAL, we evaluate our tool and compare it to
a state-of-the-art algorithm for stream-based active learning, and to random
sampling (RS). In particular, we compare RAL to the Randomized Variable
Uncertainty (RVU) technique proposed in [
          <xref ref-type="bibr" rid="ref16 ref17">16,17</xref>
          ] and mentioned in Section 2, as
this approach also heavily relies on the uncertainty of the underlying
machinelearning model to take the querying decisions. We use three datasets for the
sake of generalization, namely two datasets extracted from MAWILab [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and a
subset of the widely used Forest Covertype data4. The covertype dataset contains
samples labeled with forest cover type represented by cartographic variables. The
dataset provided by MAWILab is publicly available and consists of
15-minutenetwork-traffic traces captured every day on a backbone link between Japan and
the US since 2001 in a stream-based fashion. Traces are annotated with the attack
types observed during their corresponding measurement period. In this work, we
focus on two attack detections, namely the flood and the UDP-netscan attacks.
The MAWI data is subject to concept drifts. We relied on a commonly used
statistical test, namely, the Page-Hinkley test [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], for the detection of changes.
Figure 3 depicts the cumulative number of drifts observed in the dataset. The
test detects 14 abrupt changes during the total measurement time span.
4 https://archive.ics.uci.edu/ml/datasets/Covertype, accessed in April 2019
(a) Flood attack.
        </p>
        <p>(b) Netscan.</p>
        <p>(c) Wood covertype.</p>
        <p>Fig. 4. Prediction-accuracy evaluation for RAL, RVU, and RS. For each of the tested
datasets, RAL outperforms both techniques.</p>
        <p>(a) Flood attack (RAL).</p>
        <p>(b) Netscan (RAL).</p>
        <p>(c) Wood covertype (RAL).
(d) Flood attack (RVU).</p>
        <p>(e) Netscan (RVU).</p>
        <p>
          (f) Wood covertype (RVU).
For each benchmarked algorithm, we proceed as follows: first, we subdivide the
considered datasets into three consecutive, disjoint parts, i.e. the initial training
set, the streaming data, and the validation set. The validation set consists of the
last 30% of the dataset, the initial training set is a variable fraction of the first
samples (varying between the first 0.5%, 1%, 2%, 5%, 10%, and 15%), and the
streaming part includes all the remaining samples not belonging to the other two
subsets. We then train a model on the initial training set and check its accuracy
(referred to as the initial accuracy ) on the validation part. Next, we run the
benchmarked algorithm on the streaming part and let it pick the samples it
wants to learn from. We retrain the models after each queried label. Finally, we
R1A4L - RSe.inWfoarscseerdmsatnrneaemt-abl.ased Active Learning
evaluate the final model (i.e. trained on the initial training set and the chosen
samples) again on the validation set and report this model’s prediction accuracy
(referred to as the final accuracy ). In the context of this evaluation, we implement
for both RAL and RVU the budget mechanism presented in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], based on the
ratio between the number of queries and the total number of samples observed
so far; the system is allowed to issue queries to the oracle as long as this ratio is
below a certain threshold, i.e. the budget. For random sampling, we use a budget
indicating the exact number of samples to ask feedback for. For each dataset,
we set it to the highest average number of queried samples by either RAL or
RVU among all the tests with all the considered initial-training-set sizes. All
tests are repeated 10 times, and we report both the average accuracy and its
standard error. For RAL, we indicate the average number of queries performed
due to the uncertainty of the underlying model and the ones issued through the
ε-greedy mechanism. For RVU, we also report the average number of queries.
The hyperparameter values of RAL are chosen by grid search for our datasets on
the training set within the ranges prescribed in Section 3. The used values are
indicated in Figure 2. For RAL and RVU, the budget is set to 0.05 for all the
experiments and we test RVU with the hyperparameters recommended in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
5.2
        </p>
        <sec id="sec-7-1-1">
          <title>Results</title>
          <p>The results are shown in Figures 4 and 5. The reported all-streaming accuracy
refers to the accuracy obtained by the model in case it queries all the samples
seen in the stream.</p>
          <p>For the evaluations based on the two MAWI datasets, we use the committee
version of RAL. More precisely, the committee is a voting classifier composed
of a k -NN model with k equal to 5, a decision tree, and a random forest with
10 trees. We also use the same model for RVU and RS. On Figures 4(a) and
(b), we can clearly note that RAL outperforms both RVU and RS on average. A
striking example are the results for the netscan detection, where RAL obtains
final accuracies which are 5 percent points higher than the ones of RVU and RS
for the two smallest initial-training-set sizes. It is also worth highlighting that
RAL is the only algorithm yielding higher final accuracies than the all-streaming
one as long as the initial training set contains less than 5% of the data. To our
surprise, RVU is often outperformed by RS. Finally, the flood-attack-detection
analysis shows that the three approaches often yield a final accuracy higher than
the all-streaming one, underlining that learning from the entire data stream does
not necessarily output the best possible accuracy. When it comes to the number
of queried samples, we see that RAL queries on average significantly less often
than RVU, and a non-negligible part of these queries are due to the model’s
uncertainty, suggesting that the samples picked by RAL for its learning purposes
are wisely chosen.</p>
          <p>For the evaluation on the Forest Covertype dataset, we rely on the
singleclassifier version of RAL, using a 10-tree random forest. As for MAWI, RVU
and RS use the same model as RAL. Again, RAL yields better final accuracies
than both RVU and RS, even though this prediction task is very challenging.
0.5 1 2 5 10 15</p>
          <p>Initial-training-set size (%)</p>
          <p>Fig. 7. Evolution of RAL’s uncertainty
threshold w.r.t. rewards – wood covertype
dataset. Red lines indicate penalties.</p>
          <p>Moreover, in this study, RVU performs at most as well as RS, but generally worse,
showing that the decision-making algorithm of RVU is not always appropriate for
complex tasks. Next, we analyze the ratio between the rewards obtained by RAL
versus its penalties, i.e. we check whether querying the oracle when uncertain is
necessary. Figure 6 shows the outcome of this analysis and it is very encouraging.
Indeed, for each test case, the fraction of useful queries is at least 60%. Lastly, we
analyze the reactivity of RAL’s certainty threshold θ with respect to the obtained
rewards and penalties. Figure 7 depicts the evolution of θ with respect to the
rewards while RAL is processing the data stream. θ reacts swiftly to penalties,
i.e. its value decreases rapidly once RAL gets penalized and, very often, the
next query is useful (reward equal to ρ+), underlining the fact that updating the
threshold based on rewards is very relevant.</p>
          <p>Note that the initial accuracy is constant for the two different MAWILAB
datasets. This is due to the fact that the first 15% of these datasets consist of
points with the same label (more precisely, they represent an attack). The first
parts of the woodcover dataset are much more dynamic.
6</p>
        </sec>
      </sec>
      <sec id="sec-7-2">
        <title>Conclusions</title>
        <p>We have introduced RAL, a novel Reinforced stream-based Active-Learning
approach, to tackle challenges of stream-based active learning, i.e. selecting the most
valuable sequentially incoming samples, using reinforcement-learning principles.
It does not only learn from the data stream, but also from the relevance of its
querying decisions. RAL provides a completely different exploration-exploitation
trade-off than existing algorithms, as it queries fewer samples of higher relevance.
The theoretical analysis underlines the encouraging behavior of RAL. We showed
on several datasets that RAL provides promising results, outperforming the state
of the art. We make RAL publicly available as an open-source Python package.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>P.</given-names>
            <surname>Auer</surname>
          </string-name>
          et al.:
          <article-title>The Nonstochastic Multiarmed Bandit Problem</article-title>
          .
          <source>SIAM Journal on Computing</source>
          , vol.
          <volume>32</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>48</fpage>
          -
          <lpage>77</lpage>
          ,
          <year>2002</year>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Baram</surname>
          </string-name>
          et al.:
          <article-title>Online Choice of Active Learning Algorithms</article-title>
          .
          <source>Journal of Machine Learning Research (JMLR)</source>
          ,
          <source>vol. 5</source>
          , pp.
          <fpage>255</fpage>
          -
          <lpage>291</lpage>
          ,
          <year>2004</year>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>R.</given-names>
            <surname>Fontugne</surname>
          </string-name>
          et al.:
          <source>MAWILab: Combining Diverse Anomaly Detectors for Automated Anomaly Labeling and Performance Benchmarking. ACM CoNEXT</source>
          ,
          <year>2010</year>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <article-title>Guermeur: VC Theory of Large Margin Multi-Category Classifiers</article-title>
          .
          <source>Journal of Machine Learning Research (JMLR)</source>
          ,
          <source>vol. 8</source>
          , pp.
          <fpage>2551</fpage>
          -
          <lpage>2594</lpage>
          ,
          <year>2007</year>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>W.N.</given-names>
            <surname>Hsu</surname>
          </string-name>
          et al.:
          <article-title>Active Learning by Learning</article-title>
          .
          <source>AAAI Conference on Artificial Intelligence (AAAI)</source>
          ,
          <year>2015</year>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>D.</given-names>
            <surname>Ienco</surname>
          </string-name>
          et al.:
          <article-title>Clustering Based Active Learning for Evolving Data Streams</article-title>
          .
          <source>Discovery Science</source>
          . pp.
          <fpage>79</fpage>
          -
          <lpage>93</lpage>
          ,
          <year>2013</year>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>B.</given-names>
            <surname>Krawczyk</surname>
          </string-name>
          :
          <article-title>Active and adaptive ensemble learning for online activity recognition from data streams</article-title>
          .
          <source>Knowledge-Based Systems</source>
          , vol.
          <volume>138</volume>
          , pp.
          <fpage>69</fpage>
          -
          <lpage>78</lpage>
          ,
          <year>2017</year>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>E.S.</surname>
          </string-name>
          <article-title>Page: Continuous Inspection Schemes</article-title>
          .
          <source>Biometrika</source>
          , vol.
          <volume>41</volume>
          , pp.
          <fpage>100</fpage>
          -
          <lpage>115</lpage>
          ,
          <year>1954</year>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>B.</given-names>
            <surname>Settles</surname>
          </string-name>
          :
          <article-title>Active Learning Literature Survey</article-title>
          .
          <source>Tech. rep.</source>
          , University of WisconsinMadison,
          <year>2010</year>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. L. Song:
          <article-title>Stream-based Online Active Learning in a Contextual Multi-Armed Bandit Framework</article-title>
          .
          <source>arXiv:1607.03182</source>
          ,
          <year>2016</year>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. L.
          <string-name>
            <surname>Song</surname>
          </string-name>
          et al.:
          <article-title>A Contextual Bandit Approach for Stream-Based Active Learning</article-title>
          .
          <source>arXiv:1701.06725</source>
          ,
          <year>2017</year>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. A.
          <string-name>
            <surname>Santoro</surname>
          </string-name>
          et al.:
          <article-title>One-shot Learning with Memory-Augmented Neural Networks</article-title>
          .
          <source>arXiv:1605.06065</source>
          ,
          <year>2016</year>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. V.
          <article-title>Vapnik: The Nature of Statistical Learning Theory, 2000</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Watkins: Learning from Delayed Rewards</article-title>
          .
          <source>Ph.D. thesis, King's College</source>
          , Cambridge,
          <year>1989</year>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. W. Xu et al.:
          <article-title>Active Learning Over Evolving Data Streams Using Paired Ensemble Framework</article-title>
          .
          <source>International Conference on Advanced Computational Intelligence (ICACI)</source>
          ,
          <year>2016</year>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. I. Zˇliobaite˙ et al.:
          <article-title>Active Learning with Evolving Streaming Data</article-title>
          .
          <source>Machine Learning and Knowledge Discovery in Databases</source>
          . pp.
          <fpage>597</fpage>
          -
          <lpage>612</lpage>
          ,
          <year>2011</year>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. I. Zˇliobaite˙ et al.:
          <article-title>Active Learning With Drifting Streaming Data</article-title>
          .
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          , vol.
          <volume>25</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>27</fpage>
          -
          <lpage>39</lpage>
          ,
          <year>2014</year>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. P. Bachman et al.:
          <article-title>Learning Algorithms for Active Learning</article-title>
          .
          <source>International Conference on Machine Learning (ICML)</source>
          ,
          <year>2017</year>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>K. Konyushkova</surname>
          </string-name>
          et al.:
          <article-title>Learning Active Learning from Real and Synthetic Data</article-title>
          .
          <source>Conference on Neural Information Processing Systems (NIPS)</source>
          ,
          <year>2017</year>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20. G. Contardo et al.:
          <article-title>A Meta-Learning Approach to One-Step Active Learning</article-title>
          .
          <source>International Workshop on Automatic Selection, Configuration and Composition of Machine Learning Algorithms</source>
          ,
          <year>2017</year>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>M. Fang</surname>
          </string-name>
          et al.:
          <article-title>Learning how to Active Learn: A Deep Reinforcement Learning Approach</article-title>
          .
          <source>Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2017</year>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22. S. Ravi et al.:
          <article-title>Meta-Learning for Batch Mode Active Learning</article-title>
          .
          <source>International Conference on Learning Representations (ICLR)</source>
          ,
          <source>Workshop Track</source>
          , 2018
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>K. Pang</surname>
          </string-name>
          et al.:
          <article-title>Meta-Learning Transferable Active Learning Policies by Deep Reinforcement Learning</article-title>
          .
          <source>International Conference on Machine Learning (ICML)</source>
          ,
          <source>AutoML Workshop</source>
          ,
          <year>2018</year>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>M. Woodward</surname>
          </string-name>
          et al.:
          <article-title>Active One-shot Learning</article-title>
          .
          <source>Conference on Neural Information Processing Systems (NIPS)</source>
          ,
          <source>Deep Reinforcement Learning Workshop</source>
          ,
          <year>2016</year>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>