<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Meta-Active Learning approach exploiting Instance Importance based on Learning Gradient Variation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>(Discussion Paper)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergio Flesca</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domenico Mandaglio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Scala</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Tagarelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DIMES Dept., University of Calabria</institution>
          ,
          <addr-line>87036 Rende (CS)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A major challenge in active learning is to select the most informative instances to be labeled by an annotation oracle at each step. In this respect, one efective paradigm is to learn the active learning strategy that best suits the performance of a meta-learning model. This strategy first measures the quality of the instances selected in the previous steps and then trains a machine learning model that is used to predict the quality of instances to be labeled in the current step. In this paper, we discuss a new approach of learning-to-active-learn that selects the instances to be labeled as the ones producing the maximum change to the current classifier. The key idea is to select such instances according to their importance reflecting variations in the learning gradient of the classification model. Our approach can be instantiated with any classifier trainable via gradient descent optimization, and here we provide a formulation based on a deep neural network model, which has not deeply been investigated in existing learning-to-active-learn approaches. The experimental validation of our approach has shown promising results in scenarios characterized by relatively few initially labeled instances.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;meta-learning models</kwd>
        <kwd>model-change framework</kwd>
        <kwd>learning to active learn</kwd>
        <kwd>active learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Supervised machine learning methods typically require a large number of training data instances.
However, manually labeling training instances is a costly and time consuming process, especially
for specialized domains, where a deep expertise is required for correctly labeling data instances.
Active Learning aims at selecting the data instances to be labeled by an expert, or annotation
oracle, in order to train a machine learning model as quickly and efectively as possible. Several
strategies have been proposed in the literature [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which select the instances to be provided
to the oracle for annotation using diferent heuristics; however, none of such heuristics has
shown to outperform the others in every scenario of interest. To overcome major limitations,
meta-active learning approaches have been proposed to automatically detect the best strategy
of selection of the instances to be annotated [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ].
      </p>
      <p>
        In this paper, we discuss the main contributions from our earlier study [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], where we
introduced a new meta-active learning method whose instance selection step, modeled as a regression
problem, exploits the training gradient of a deep neural network model, and in general of any
machine learning model whose training is based on a gradient descent method. Experiments
conducted on CIFAR-10 image data, and including a comparison with some baselines, have
shown promising results by the proposed approach in terms of percentage increase in accuracy.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Active learning methods typically fall into one of the following categories: Uncertainty Sampling,
Query-By-Committee, Expected Model Change, Expected Error Reduction, Variance Reduction,
and Density-Weighted [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Uncertainty sampling aims to improve the quality of the labeled dataset by selecting as
instances to be labeled those such that the trained classifier is most uncertain in assigning a
class label [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. Among this class of methods, the most popular one is probably least confidence
sampling (LCS) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] which uses as uncertainty measure for an instance the diference between
100% confidence and the most confidently predicted label for the instance. Other approaches use
diferent multi-class uncertainty sampling variants, such as margin sampling [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] or entropy [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        The query-by-committee approach [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] maintains a set of prediction models, or committee,
that are used to predict the label of an instance. The instance over which there is the maximum
disagreement on the labels predicted by the models in the committee is regarded as the most
informative and hence selected for labeling. Several specializations of the approach have been
proposed using diferent models for the committee members [
        <xref ref-type="bibr" rid="ref11 ref12 ref13 ref14">11, 12, 13, 14</xref>
        ].
      </p>
      <p>
        The expected model change framework [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] aims to define a strategy for selecting the instance
that would yield the greatest change to the current model if we knew its label. The strategy
computes the expected gradient length and uses it as a measure of the expected change to
the model that is associated to the labeling of an instance. The key idea is to prefer instances
that are likely to have the greatest influence in changing the model. Theoretical aspects of
this framework have been well studied for support vector machines and linear regression [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ],
although it can be computationally expensive for large feature space and set of labelings.
      </p>
      <p>
        Expected error reduction aims to select the instance  that yields the maximum reduction of
the model generalization error once it is trained using the label of  too. However, since the
labels of some instances are not known, the model is usually approximated using the expectation
over all possible labels under the current model. This framework has been successfully used
with a variety of models such as Naıve Bayes [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], logistic regression [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], and SVM [19].
      </p>
      <p>Variance reduction methods reduce the generalization error indirectly by minimizing output
variance. The early method in [20] was proposed for active learning based on the reduction
of the estimated distribution of the model’s output for regression. Applications of variance
reduction include multi-class image classification [21].</p>
      <p>
        The key idea of density-weighted methods is that informative instances should not only be
the uncertain ones, but also those representative of the underlying distribution [
        <xref ref-type="bibr" rid="ref11 ref8">11, 22, 23, 8, 24</xref>
        ].
Hence, the instances are selected according to both a base selection measure (e.g., LCS) and a
density based measure (e.g., the average similarity of an instance w.r.t. the other instances).
      </p>
      <p>
        Meta-learning algorithms have recently been proposed for the active learning tasks. In [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ],
several active learning heuristics are combined using a bandit algorithm exploiting a maximum
entropy criterion that estimates classification performance without knowing the actual labels.
Rather than combining existing heuristics, the meta-learning approach to active learning in [25]
models the active learning task as a regression problem: given a trained classifier and its output
for a specific unlabeled instance, it predicts the reduction in generalization error that can be
expected by providing the actual label of the instance. Note that the regressor in [25] is required
to be trained on a specific set of instance-driven features, such as the variance of the classifier
output for the instance or the predicted probability distribution over possible labels for the
instance. Our approach does not have the same constraint, since we utilize the raw features
of the instances, yet we can in principle exploit instance-driven features. More importantly,
for each active learning epoch, [25] requires to perform several training steps of the classifier
while we perform just a single training step.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Approach</title>
      <p>A classification problem consists in associating every instance taken from a predefined domain
 with a label selected from a fixed domain of labels ℒ. We assume the presence of a set of
instance-label pairs  ⊆  × ℒ and a set of unlabeled instances   ⊆  , where for each pair
⟨, ⟩ ∈ ,  is an instance in  and  is the label associated with .</p>
      <p>Algorithm 1 shows the general schema of the proposed approach, named Learning to Active
Learn by Instance Importance based Gradient Variation (LAL-IGradV). LAL-IGradV receives
in input a (small) set of labeled instances , a set of unlabeled instances  , a deep neural
network model DNN, a regressor model , the number ℎ of active learning epochs, and the
number  of unlabeled instances to select for oracle labeling at each active learning epoch.</p>
      <p>Our proposed approach is comprised of two phases: initialization and an iterative phase. In
the initialization phase, the algorithm first trains DNN using  (line 1), randomly selects 
unlabeled instances from   and asks the oracle to label them, thus obtaining the initial set
  of oracle-labeled instances (lines 2- 3). In each step of the iterative phase (lines 4-11), the
set   of newly labeled instances is used to train the classifier together with the set  (line 5).
When retraining the classifier, every instance  ∈   is associated with its importance score .
The computation of the importance scores of the instance in   is performed using one of the
techniques described in Section 3.1. Next, a regressor  is trained on the set {(, )| ∈ }
and NLI instances are added to LS (lines 6-7). The regressor  is then applied to the instances
in   so that, given an instance , it predicts its importance score  (line 8). Finally, the top-
̂︀
instances having the greatest importance score are selected for oracle labeling and, once labeled,
they replace the set   so to start the next active learning step (lines 9-11).</p>
      <p>
        Following the model change framework [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], the importance score of an instance  measures
the impact of having  in the training set for the obtained classifier. That is, the importance
score of a (labeled) instance  w.r.t. a set of labeled instances is a measure of the diference
between the parameters of the classifier  trained over  and the parameters of the classifier ^
trained over  ∪{⟨, ⟩}, where  is the label of . Unfortunately, in the case of neural network
classifiers, for the most commonly used training methods, such as the stochastic gradient, such
diference between the parameters of the model (almost) does not exist. To overcome this issue,
we define diferent notions of importance score, as discussed next.
      </p>
      <p>Algorithm 1: LAL-IGradV</p>
      <p>Data: : set of labeled instances,  : set of unlabeled instances, DNN: deep neural network
model, : importance score regressor, ℎ: maximum number of epochs, : number of
relevant instances to select
1 Train DNN on 
2   ←
3 The oracle annotates the instances in  
4 for  = 1 . . . ℎ do</p>
      <p>Select  instances from   uniformly at random
5
6
7
8
9
10
11</p>
      <p>Train  on the set of pairs {⟨, ⟩ |  ∈  }
Train DNN on  ∪   and compute importance score , for each  ∈</p>
      <p>∪  
 ←
 ←
  ← 
Apply  to   instances to predict importance scores ()</p>
      <p>Select top- instances from   by importance score 
The oracle annotates the instances in 
̂︀
̂︀
3.1. Importance scoring strategies
 ∈  is associated to a label .</p>
      <p>The training of the DNN  over  requires solving
Let  (,  ) be the output of a DNN model  characterized by a vector of parameters  for an
input  and let  = {1, . . . , } be a set of instances used for training  , where each sample
⎛
arg min ⎝</p>
      <p>⎞
∑︁ ((,  (,  )) + ( )⎠ ,
∈
where (,  (,  )) is the loss of the model for instance  and ( ) is the regularization of
the parameters. The training of  is done by iteratively updating the parameters  , through two
steps: (i) computing the change in  w.r.t. all parameters, i.e., the gradient, defined as follows
 ∈
 () =</p>
      <p>∑︁ ((,  (,  )) + ( )),
and (ii) updating  using  (), i.e.,  +1 =   −  ×  (), where  is the update step size.</p>
      <p>We define four strategies to associate each instance in
  with its importance score during
the training of the   classifier. The goal shared by the various techniques is to modify
the training of the neural network model by accounting for the importance of the instances in
  involved in each training step. Each of the proposed techniques makes use of the gradient
corresponding to the instances currently in  and  , i.e.,  ( ∪  ), hereinafter simply
denoted as  . The four proposed techniques difer in the way the importance of an instance
 in   is calculated with respect to the single epoch. We will use symbol   to denote the
value of the gradient  ({}), and  ¬ to denote the value of the gradient  ( ∪   ∖ {}).
In the following, we describe our proposed techniques for computing the importance scores.</p>
      <p>Direct similarity (DS) – given an instance  in  , this strategy compares the learning
gradient of the neural network at the current epoch,  , with the gradient calculated with respect
to  only, i.e.,  . The importance score of  at the current epoch is defined as the cosine
similarity between  and  , i.e.,  = (,  ). The rationale of this strategy is that an
instance  ∈   is likely to be more important for the training of   at the current epoch
if there is a small diference between the directions of the gradients  and  , as reflected by a
high value of the cosine similarity between the two gradients. That is, the more the learning
behavior of the neural network considering the whole training set is similar to the one of the
same neural network trained on  only, the higher the importance of  is.</p>
      <p>Ranked direct similarity (RDS) – this strategy first applies the DS technique, then the
importance scores of the instances in   computed by DS are ordered and divided into three
bins, which correspond to the top quartile of the importance scores, the bottom quartile, and
the union of the second and third quartiles. The instances falling into the top quartile will be
associated with score 1, the ones falling into the bottom quartile with score 0, and the other
instances with score 0.5.</p>
      <p>Leave-one-out distance (LD) – given an instance  in  , this strategy compares  with
the gradient calculated when leaving out , i.e.,  ¬. The importance score of  at the current
epoch is defined as the complement of the cosine similarity (i.e., cosine distance) between 
and  ¬, i.e.,  = 1 − (,  ¬). The rationale of this strategy is that an instance  ∈   is
likely to be more important for the training of   at the current epoch if leaving it out will
lead to large diferences between the learning behavior of the neural network considering the
whole training set and the learning behavior of the same neural network trained without , i.e.,
a large change in the direction of the gradient  ¬ w.r.t. the gradient  , as reflected by a high
value of the cosine distance between the two gradients.</p>
      <p>Ranked leave-one-out distance (RLD) – analogously to RDS w.r.t. DS, the RLD strategy
adds the same discretization step over the importance scores computed by LD.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Evaluation</title>
      <p>Data. We used the well-known CIFAR-10 dataset [26], which consists of 60000 instances
representing 32x32 colour images, labeled using 10 mutually exclusive classes, with 6000 images
per class. The dataset is organized into 50000 instances as the training set and 10000 instances
as the test set. The latter contains exactly 1000 randomly-selected images from each class, while
the training set is comprised of five training batches, which contain 5000 images from each
class. We divided the training set into two parts, the one corresponding to the set of labeled
instances (), and the other corresponding to the set of unlabeled instances ( ).</p>
      <p>
        Baseline methods. We compare the performance of our methods with a Random baseline
and the LCS method [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The Random baseline, hereinafter denoted as Rnd, simply selects 
instances to be annotated at each epoch uniformly at random from the set of unlabeled instances.
The LCS method follows an uncertainty sampling approach, therefore the unlabeled instance
selection is driven by the uncertainty of the instances. More precisely, given an instance 
and a classification model  , the LCS method measures the uncertainty of  w.r.t.  (()) as
() = (1 −  (* |)) × −  1 , where  (* |) denotes the probability that the model  assigns
to the label * for the instance , * is the label for which  yields the maximum probability
on  (i.e., * = arg max  (|)), and  is the cardinality of the set of labels. Note that the
uncertainty function ranges between [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ], where 1 is the most uncertain score.
      </p>
      <p>Settings and assessment criteria. In our experimental evaluation, we used 6 Convolutional
Neural Network (CNN) 2D layers, with 3 input channels, kernel size 3, stride size 3, padding
size 1, ReLU activation function. The CNN module has on top a fully-connected network with
an input layer of size 4096, one hidden layer with input size 4096 and output size 1024, another
hidden layer with input size 1024 and output size 512, an output layer of size 10 (i.e., number of
classes), and a dropout layers with probability 0.1.</p>
      <p>In our LAL-IGradV algorithm, the DNN model was trained using cross entropy as loss function
and Adam optimizer (with learning rate 1e-4 and weight decay 5e-4), a number of epochs equal
to 10 for both the initialization step of training (Line 1) and the training steps in the main loop
(Line 5). Also, the maximum number of iterations of the algorithm, i.e., number of epochs in
the active learning process (ℎ) was set to 10. Unless otherwise specified, the number  of
instances to select from   was set to 500; the size of , resp.  , was experimentally varied.
As the regressor (), we used two models: the Gradient Boosting Regressor, with least absolute
deviations (LAD) loss function and 200 estimators, for the DS and LD strategies, and the Random
Forest Classifier, with maximum depth 5, for the RDS and RLD strategies.</p>
      <p>To simulate the oracle for annotating the instances, we resorted to the availability of class
label information for the CIFAR-10 data: whenever an instance was used in the   set, we
masked its actual label during the learning process, and we unveiled the label only if the instance
was selected within the  set of instances to annotate.</p>
      <p>To assess the performance of the methods, we considered the accuracy of the classifier
during the various training batches, in absolute terms as well as in terms of percentage increase
w.r.t. the early accuracy of the classifier itself or the accuracy of a reference method. More
precisely, we computed: the accuracy at the initial step of training of LAL-IGradV (line 1),
denoted as (0), and the accuracy at the end of the active learning process, denoted as ; the
percentage increase in the accuracy of LAL-IGradV, which is defined as 100( − (0))/(0);
the percentage increase in the accuracy of LAL-IGradV w.r.t. Rnd, resp. LCS, which is defined
as %Rnd = 100( − Rnd)/Rnd, resp. %LCS = 100( − LCS)/LCS, where Rnd and LCS
denote the accuracy at the end of the active learning process for Rnd and LCS.</p>
      <p>Results. Table 1 reports on the performance of our LAL-IGradV variants corresponding
to the four importance scoring techniques, for varying percentages of the set of unlabeled
instances ( ). As expected, the accuracy values (i.e., columns corresponding to  and (0))
tend to decrease as the percentage of unlabeled instances gets higher, since the LAL-IGradV
method is forced to handle progressively reduced sets of labeled instances on its initial training.
More interestingly, the percentage increase of each of the LAL-IGradV variants w.r.t. both Rnd
and LCS is always positive — up to 6.5% against Rnd and up to 3.2% against LCS — and it tends
to improve with higher percentages of unlabeled instances, with peaks around 70% against Rnd
and around 50-60% against LCS. As concerns the impact of the importance scoring technique,
we observe that all the LAL-IGradV variants are able to improve upon the accuracy at the initial
training step. Moreover, the direct similarity based techniques, i.e., DS and RDS, reveal to be
more eficient 1 as well as more accurate than the leave-one-out distance based techniques, for
each percentage of unlabeled set. We tend to ascribe this fact to a higher sensitivity of the
approach in capturing the gradient direction change due to the individual contribution of an
instance rather than to the masking of a single instance in the training gradient, which would
result in a more diluted signal of variation of the training gradient.</p>
      <p>We analyzed the percentage increase in accuracy that each active learning method achieves
by varying the fraction of unlabeled instances. As expected due to the advantage of performing
an active learning task, the percentage increase values (results not shown) tend to improve for
higher fractions of unlabeled instances. The trends are steeper for our LAL-IGradV methods
(around 10% increase), particularly for DS and RDS, followed by LCS. Indeed, it is worth
emphasizing that our LAL-IGradV methods achieve the best performance gain against the two
baselines as the fraction of labeled instances becomes smaller.</p>
      <p>In Fig. 1, we delve into the trends of accuracy percentage-increase obtained by a particular
active learning method, for varying , i.e., number of unlabeled instances to be selected at each
epoch of the active learning process. At a first glance, in each of the plots, we notice that the
curve of the percentage increase values over  is more likely to change for larger fractions of
the set of unlabeled instances, with the most evident changes corresponding to 90%.</p>
      <p>A few interesting remarks can be drawn from Figs. 1(a)-(d). When portions of   below 90%
are selected, we observe a relatively small range of variation of the percentage increase values
(approximately from 5% to 10%), with peaks around  = 500 for the DS and LD variants, and
around  = 900 for the RDS and RLD variants. This would hint at higher requirements (i.e.,
higher ) needed for the importance scoring strategies that compute discretized importance
scores. Another remark is on the curves corresponding to the use of 90% of the set of unlabeled
instances: compared to the cases with lower fractions of  , the percentage increase values
are higher on average, and the trends are quite diferent, especially for the DS variant where
we observe a minimum (rather than a maximum) for  = 500. Apart from this exception, it
is worth noticing that better percentage increase of accuracy do not necessarily correspond
to a higher number  of selected instances. This might be explained since the more unlabeled
instances are selected for labeling, the more the method is less likely to make a correct choice
for changing the most the current model, as the latter is being trained only on few instances,
thus lacking full knowledge on the class distribution of all the instances for available training.</p>
      <p>Concerning the baseline methods, two diferent situations occur between the Rnd plot
1Experiments were carried out on an Intel Core i7 CPU @2.90GHz, 32GB RAM, with NVIDIA GeForce RTX 2070
Super GPU
60% unlabeled
70% unlabeled
80% unlabeled
90% unlabeled
60% unlabeled
70% unlabeled
80% unlabeled
90% unlabeled
60% unlabeled
70% unlabeled
80% unlabeled
90% unlabeled
60% unlabeled
70% unlabeled
80% unlabeled
90% unlabeled
60% unlabeled
70% unlabeled
80% unlabeled
90% unlabeled
60% unlabeled
70% unlabeled
80% unlabeled
90% unlabeled
(Fig. 1(e)) and the LCS plot (Fig. 1(f)). The former shows a decreasing trend until mid values of
 (i.e., around 500) followed by a rising trend, which sheds light on the divergent behavior of
a random selection of the unlabeled instances w.r.t. all the other instance selection methods.
Also, the LCS plot shows curves that tend to monotonically decrease, resp. remain substantially
unchanged, for larger, resp. smaller, fractions of  , which again puts in evidence how our
LAL-IGradV variants behave diferently from an uncertainty sampling approach like LCS.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>We proposed a learning-to-active-learn approach whose key novelty is twofold: the integration
of a regression-based meta-learning approach within a maximum model-change framework,
and the definition of policies for scoring the instance importance based on the amount of change
in the learning gradient of a deep neural network model. Our experimental evaluation has
shown that our proposed LAL-IGradV outperforms both a random baseline and the LCS method,
especially when the number of initially available labeled instances gets smaller. As a future
work, we plan to evaluate the impact of measuring the importance of an instance not only in
terms of its own contribution to the model change but also w.r.t. other instances according to
some instance locality principle.
20th International Joint Conference on Artificial Intelligence, 2007, pp. 823–829.
[19] R. Moskovitch, N. Nissim, D. Stopel, C. Feher, R. Englert, Y. Elovici, Improving the detection
of unknown computer worms activity using active learning, in: Proc. of the 30th Annual
German Conference on Artificial Intelligence, volume 4667 of Lecture Notes in Computer
Science, Springer, 2007, pp. 489–493.
[20] D. A. Cohn, Neural network exploration using optimal experiment design, Neural Networks
9 (1996) 1071–1083.
[21] A. J. Joshi, F. Porikli, N. P. Papanikolopoulos, Scalable active learning for multiclass image
classification, IEEE Trans. Pattern Anal. Mach. Intell. 34 (2012) 2259–2273.
[22] H. T. Nguyen, A. W. M. Smeulders, Active learning using pre-clustering, in: Proc. of the</p>
      <p>Twenty-first International Conference on Machine Learning, 2004.
[23] Z. Xu, R. Akella, Y. Zhang, Incorporating diversity and density in active learning for
relevance feedback, in: Proc. of the 29th European Conference on Information Retrieval,
volume 4425 of Lecture Notes in Computer Science, Springer, 2007, pp. 246–257.
[24] S. Huang, R. Jin, Z. Zhou, Active learning by querying informative and representative
examples, IEEE Trans. Pattern Anal. Mach. Intell. 36 (2014) 1936–1949.
[25] K. Konyushkova, R. Sznitman, P. Fua, Learning active learning from data, in: Proc. of the</p>
      <p>Annual Conference on Neural Information Processing Systems, 2017, pp. 4225–4235.
[26] K. Alex, H. Geofrey, Learning multiple layers of features from tiny images (2009).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Settles</surname>
          </string-name>
          ,
          <article-title>Active Learning Literature Survey</article-title>
          ,
          <source>Technical Report</source>
          , University of WisconsinMadison Department of Computer Sciences,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Baram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>El-Yaniv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Luz</surname>
          </string-name>
          ,
          <article-title>Online choice of active learning algorithms</article-title>
          ,
          <source>J. Mach. Learn. Res</source>
          .
          <volume>5</volume>
          (
          <year>2004</year>
          )
          <fpage>255</fpage>
          -
          <lpage>291</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Active learning by learning</article-title>
          ,
          <source>in: Proc. of the Twenty-Ninth AAAI Conference on Artificial Intelligence</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>2659</fpage>
          -
          <lpage>2665</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ebert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fritz</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Schiele,</surname>
          </string-name>
          <article-title>RALF: A reinforced active learning formulation for object class recognition</article-title>
          ,
          <source>in: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>3626</fpage>
          -
          <lpage>3633</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Flesca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mandaglio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Scala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tagarelli</surname>
          </string-name>
          ,
          <article-title>Learning to active learn by gradient variation based on instance importance</article-title>
          ,
          <source>in: 2022 26th International Conference on Pattern Recognition (ICPR)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>2224</fpage>
          -
          <lpage>2230</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICPR56361.
          <year>2022</year>
          .
          <volume>9956039</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. A.</given-names>
            <surname>Gale</surname>
          </string-name>
          ,
          <article-title>A sequential algorithm for training text classifiers</article-title>
          ,
          <source>in: Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>1994</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Catlett</surname>
          </string-name>
          ,
          <article-title>Heterogeneous uncertainty sampling for supervised learning</article-title>
          ,
          <source>in: Proc. of the Eleventh International Conference on Machine Learning</source>
          ,
          <year>1994</year>
          , pp.
          <fpage>148</fpage>
          -
          <lpage>156</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Settles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Craven</surname>
          </string-name>
          ,
          <article-title>An analysis of active learning strategies for sequence labeling tasks</article-title>
          ,
          <source>in: Proc. of the 2008 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>1070</fpage>
          -
          <lpage>1079</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Schefer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Decomain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wrobel</surname>
          </string-name>
          ,
          <article-title>Active hidden markov models for information extraction</article-title>
          ,
          <source>in: Proc. of the 4th International Conference on Advances in Intelligent Data Analysis</source>
          , volume
          <volume>2189</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2001</year>
          , pp.
          <fpage>309</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Seung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Opper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sompolinsky</surname>
          </string-name>
          ,
          <article-title>Query by committee</article-title>
          ,
          <source>in: Proc. of the Fifth Annual ACM Conference on Computational Learning Theory</source>
          ,
          <year>1992</year>
          , pp.
          <fpage>287</fpage>
          -
          <lpage>294</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nigam</surname>
          </string-name>
          ,
          <string-name>
            <surname>Employing</surname>
            <given-names>EM</given-names>
          </string-name>
          <article-title>and pool-based active learning for text classification</article-title>
          ,
          <source>in: Proc. of the Fifteenth International Conference on Machine Learning</source>
          ,
          <year>1998</year>
          , pp.
          <fpage>350</fpage>
          -
          <lpage>358</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>I.</given-names>
            <surname>Dagan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Engelson</surname>
          </string-name>
          ,
          <article-title>Committee-based sampling for training probabilistic classifiers</article-title>
          ,
          <source>in: Proc. of the Twelfth International Conference on Machine Learning</source>
          ,
          <year>1995</year>
          , pp.
          <fpage>150</fpage>
          -
          <lpage>157</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Melville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Mooney</surname>
          </string-name>
          ,
          <article-title>Diverse ensembles for active learning</article-title>
          ,
          <source>in: Proc. of the Twenty-first International Conference on Machine Learning</source>
          ,
          <year>2004</year>
          . doi:
          <volume>10</volume>
          .1145/1015330.1015385.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Gilad-Bachrach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Navot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tishby</surname>
          </string-name>
          ,
          <article-title>Query by committee made real</article-title>
          ,
          <source>in: Proc. of the Neural Information Processing Systems</source>
          ,
          <year>2005</year>
          , pp.
          <fpage>443</fpage>
          -
          <lpage>450</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Settles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Craven</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ray</surname>
          </string-name>
          ,
          <article-title>Multiple-instance active learning</article-title>
          ,
          <source>in: Proc. of the Twenty-First Annual Conference on Neural Information Processing Systems</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>1289</fpage>
          -
          <lpage>1296</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>W.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <article-title>Active learning for classification with maximum model change</article-title>
          ,
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>36</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>N.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          ,
          <article-title>Toward optimal active learning through sampling estimation of error reduction</article-title>
          ,
          <source>in: Proc. of the Eighteenth International Conference on Machine Learning</source>
          ,
          <year>2001</year>
          , pp.
          <fpage>441</fpage>
          -
          <lpage>448</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Greiner</surname>
          </string-name>
          ,
          <article-title>Optimistic active-learning using mutual information</article-title>
          ,
          <source>in: Proc. of the</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>