Challenges of Reliable, Realistic and
       Comparable Active Learning Evaluation

               Daniel Kottke1 , Adrian Calma1 , Denis Huseljic1 ,
                     Georg Krempl2 , and Bernhard Sick1
                               1)
                                University of Kassel
                 Wilhelmshöher Allee 73, 34112 Kassel, Germany
              {daniel.kottke,adrian.calma,bsick}@uni-kassel.de
                    2)
                      Otto-von-Guericke University Magdeburg
                  Universitätsplatz 2, 39106 Magdeburg, Germany
                               georg.krempl@ovgu.de


      Abstract. Active learning has the potential to save costs by intelligent
      use of resources in form of some expert’s knowledge. Nevertheless, these
      methods are still not established in real-world applications as they can
      not be evaluated properly in the specific scenario because evaluation
      data is missing. In this article, we provide a summary of different evalu-
      ation methodologies by discussing them in terms of being reproducible,
      comparable, and realistic. A pilot study which compares the results of
      different exhaustive evaluations suggests a lack in repetitions in many
      articles. Furthermore, we aim to start a discussion on a gold standard
      evaluation setup for active learning that ensures comparability without
      reimplementing algorithms.

      Keywords: Evaluation, Active Learning, Classification, Semi-supervised
      Learning, Data Mining


1   Introduction
The field of machine active learning (AL) investigates how a learning algorithm
can learn to solve problems (e.g., classification or regression problems) more
effectively by exploiting interactions with humans (e.g., experts in a specific
application field) or simulation systems which are abstractly modeled as an or-
acle [1] (Fig. 1). In many application domains, it is unproblematic to collect
unlabeled data, but gathering labels may be complicated, time-consuming, or
costly [18]. Furthermore, AL is based on the assumption that by allowing the
learner to be curios (i.e., it is allowed to choose the data from which it learns),
it may learn faster [39].
    Pool-based AL [29] usually starts with an initially empty or very sparsely
labeled set of samples, a large pool of unlabeled samples (candidates), and iter-
atively queries for new labels from instances of the candidate pool by “asking
the right questions”. For example, in every learning cycle the oracle is asked to
provide labels for the most “informative” samples based on a selection strategy.


                                          2
Challenges
2          of Reliable,
       Daniel           Realistic
              Kottke, Adrian      andDenis
                             Calma,   Comparable
                                           Huseljic,Active LearningBernhard
                                                    Georg Krempl,   Evaluation
                                                                            Sick

Thereby, it aims to improve the performance of the learning model as fast as
possible. After the labels are added, the knowledge model is updated.
   In this article, we focus on three critical aspects of AL evaluation which are
underrepresented in current AL research:
 – Reliable Evaluation: Reliable evaluation results require a robust and re-
   producible evaluation methodology. Hence, the methodology should be de-
   scribed in detail and should be robust to varying seeds or shuffled data.
 – Realistic Evaluation: Evaluating an AL algorithm in a lab setting (the
   lack of labels is just simulated) is not realistic. Often, implications for the
   real world do not hold. Hence, AL methods are not very common in industrial
   applications. We will discuss the challenges of a real-world application.
 – Comparable Evaluation: Current evaluation methodologies vary a lot re-
   garding its evaluation type, performance measure, number of repetitions,
   etc. Ideally, presented results are directly comparable with others. Hence,
   this article aims to initiate a discussion for a standardized AL evaluation
   gold standard.
    The article starts with a general overview of components taking part in an
AL cycle (Sec. 2). Next, we discuss aspects of reliable evaluation (Sec. 3) and
compare two methodologies in a pilot study (Sec. 4). In Sec. 5, we present un-
realistic assumptions for real-world applications. Finally, we conclude the work
and propose an outlook on how comparable evaluation could be made possible.


2   Active Learning in Classification Tasks
The learning cycle of AL (see Fig. 1) consists of three main components: In
pool-based AL for classification tasks, we have a selection strategy, an oracle,
and a classifier. The selection strategy selects the instances from the candidate
pool to be labeled by the oracle such that the classifier can learn a well-suited
model. This procedure repeats until a stopping criterion is reached. In AL eval-
uation, we normally investigate the performance of the selection strategy. Using
an omniscient oracle and a pre-trained classifier, we can assure that performance


                                       machine learning model
                                            (classifier)
                        labeled                                 candidate
                        training set                                 pool

                                           oracle (expert)
                                                                selection strategy


                   Fig. 1. Pool-based active learning cycle [39]


                                                3
Challenges of Reliable, Realistic and Comparable
                                  Challenges        Active
                                             of Active     Learning
                                                       Learning      Evaluation 3
                                                                Evaluation

differences are solely induced by the selection of training instances from the can-
didate pool. Changing the classifier (or the parameters of the classifier) within
different AL systems might lead to falsified results because of the high interde-
pendence between the three components.
    Comparing multiple classifiers in combination with AL, the selection strategy
should be fixed. Comparing both, classifiers and selection strategies, one should
run every combination. Unfortunately, some selection strategies solely work with
specific classifiers or classifier types. Hence, it is not possible to compare these
selection strategies with their individual classifiers as performance differences
could be explained by the qualities of the classifiers and not the selection strat-
egy. To face this problem, we could learn multiple classifiers on the selected
samples. According to [42], this is subsumed under the term label reusability.
The authors propose to use the specific classifier for the active selection (se-
lector) and train additional classifiers for prediction (consumer). Although the
authors of [42] show that the suitability of selector-consumer pairings cannot
be estimated independently of the AL problem, we propose to run each selector
also as a consumer for evaluation.


3       Aspects of Reliable Evaluation
Reliable evaluation is robust and reproducible. Robustness in evaluation means
that changing seeds or the order of data points does not effect the results. In this
section, we will point out different aspects and discuss what is done in literature.

3.1        Repetitions and Hold-Out Evaluation
In AL, we are facing classification tasks with very few training instances. When
classifiers try to generalize from only a few training samples, their performance
might be very sensitive to small changes. Also, the performance probably varies
a lot depending on the concrete choice of instances to be labeled. Hence, lots of
repetitions are needed to get a reliable trend of the performance. In Fig. 2, we
clarify the nomenclature of different sets that might take part in AL.
    In recent active learning articles, the number of repetitions varies between one
single training-evaluation set [49] to 100 different partitionings [26]. Therefore,
some authors use a k-fold cross validation [2, 5, 31] with solely one execution [31,
38] or multiple ones [2, 5]. Executing a k-fold cross validation multiple times


                    initiali-                                                labeled
      tuning set   zation set             candidate pool                   training set                    evaluation set


                                                     oracle (expert)                      machine learning model
                                selection strategy


                      Fig. 2. Different sets used in literature for active learning.


                                                                       4
Challenges
4          of Reliable,
       Daniel           Realistic
              Kottke, Adrian      andDenis
                             Calma,   Comparable
                                           Huseljic,Active LearningBernhard
                                                    Georg Krempl,   Evaluation
                                                                            Sick

requires different seeds among the repetitions. Others [8, 21, 30, 46] use a simple
split with a fixed percentage (varying between 50% and 67%) for the candidate
pool and the rest, respectively, for the evaluation set. To get rid of random
effects, this is repeated multiple times.
    In Sec. 4, we present a pre-study that shows the drawbacks of a single k-fold
cross validation and shows the importance of multiple repetitions.

3.2   Performance Measures
Active Learning is a dynamic process which improves its model by successively
adding labels to instances from the candidate pool. The aim of AL algorithms
is to achieve a high performance which improves as fast as possible. Hence, we
have two objectives [27, 39]:
 1. achieve a high performance level (learn a good classifier) and
 2. learn as fast as possible (save cost induced by annotations).

Applying Common Performance Measures to AL:
Depending on the learning problem, several performance measures [36] have been
used. Usually, accuracy or error [2, 6] are used for problems with balanced mis-
classification cost and class priors. For unbalanced data, measures like cost, F1-
Score, G-mean, Area under the Receiver Operating Characteristic-Curve (AU-
ROC) [17, 20] (see [21, 22, 30, 48]) or H-measure [19] are more sophisticated. Usu-
ally, these performance measures are then plotted over time (resp. the number
of acquired labels), which is then called learning curve (e.g., see Fig. 3).
    As mentioned in the previous subsection, the results from multiple executions
should be included in the evaluation by plotting standard deviations or ideally
quartiles. An evaluation of means could also include the mean standard error
or mean quartiles which can be determined using bootstrapping [15]. Note that
quartiles are more exact as the distribution of performances given the number of
acquired labels is unlikely normally distributed because these random variables
are bounded (most of the time between 0 and 1).
    The comparison of learning curves remains difficult as it is unclear how to
combine the two objectives from above. The easiest option is to present the result
for different points in time (e.g., early stage, mid stage, saturated stage) [26, 37].
Having fixed these time points, one can use comparison methods like in usual
classification tasks. Note that most often, these time points and the total number
of label acquisitions (when to stop learning) are chosen by the authors which
could bias the results. We recommend not to stop learning before most of the
AL algorithms have converged, and if possible, to also include the performance
of a classifier learned on all instances as a baseline.
    In reliable evaluation, statistical testing plays a essential role. Nevertheless,
one should be reminded that statistical test only show if the results may also
be explained by random artifacts [33], and do not show the real superiority of
one’s method. Nuzzo [33] claims that results should not only be reported by
their statistical significance but also their effect size. Typically, statistical tests


                                          5
Challenges of Reliable, Realistic and Comparable
                                  Challenges        Active
                                             of Active     Learning
                                                       Learning      Evaluation 5
                                                                Evaluation

(like the t-test or the Wilcoxon signed rank test [47]) assume to have i.i.d. ran-
dom variables. Hence, the compared performance values should be drawn from
the different training-evaluation combinations and not from different time points
because these performance values are highly correlated and therefore not inde-
pendent. One also could argue that even the performances across the repetitions
are not independent because training and/or evaluation sets might overlap. Many
use a t-test for comparing the tendencies of the mean between two algorithms [8,
21]. Due to the assumption of the mean being normally distributed, it might be
better to use a parameter-free test like the Wilcoxon signed rank test [8, 22, 26,
41]. To test if an algorithm is significantly better across datasets, the Wilcoxon
signed rank test might also be a good choice. An alternative to statistical testing
is to present the number of won/lost trials using a simple pairwise comparison
between the performances of two algorithms [26].

Active Learning Specific Performance Measures:
There also exist approaches to summarize the shape of the performance curve:
The easiest approach sums up all the performance values for each time point.
Often, this is called area under the learning curve [38] (also denoted as AUC1 ).
This measure is proportional to the mean and hence dependent on the length of
the AL process (i.e., the number of acquisitions which is often chosen manually).
   More convenient is the deficiency score proposed by Yanik et al. [50]. This is
determined by calculating the area between the maximal performance line and
the actual learning curve which they call α for algorithm A and β for algorithm
B. The deficiency of A with respect to B is then calculated using the following
equation:
                                                  α
                            deficiency(A, B) =                                (1)
                                                α+β
    Another measure to calculate how fast the AL algorithm learns (2nd objec-
tive) is the Data Utilization Rate (DUR) by Reitmaier et al. [38]. They first
compute the target accuracy defined as the mean (considering the performances
between 80% and 100% of the total number of acquired labels) from the random
strategy. The DUR is then the minimum number of samples needed by each
strategy to reach this target accuracy divided by the number of samples needed
by random.


3.3    Initialization of Active Learning

Some papers propose to initialize their AL cycle with some labels to be compati-
ble to state-of-the-art implementations or as an essential part of their algorithm.
The number of initialization labels varies between no label at all and 10% [30].
This choice is highly dependent on the dataset and the proposed algorithm.
Unfortunately, it is often not described, how the specific values have been de-
termined (or tuned), although this is essential for the method to succeed or
fail.
1
    We do not recommend the abbrev. AUC because it can be mixed up with AUROC


                                        6
Challenges
6          of Reliable,
       Daniel           Realistic
              Kottke, Adrian      andDenis
                             Calma,   Comparable
                                           Huseljic,Active LearningBernhard
                                                    Georg Krempl,   Evaluation
                                                                            Sick

    The number of initial labels is relatively small when initialization is done
due to compatibility issues [7, 13, 25, 37]. In some SVM implementations, the
classifiers need one instance per class to predict labels. Hence, some authors
added a fixed number of instances per class [43, 49, 50, 37] although this is not
possible in real applications as the class labels are unknown in advance. This is
even more relevant in datasets with unequal class priors as finding an instance
of the minority class is especially difficult [16].
    In [30, 48], the initialization step is used to have a representative sample
for the dataset to find a broad decision boundary. Later, an uncertainty based
method is used to refine the boundary and improve the performance. In this
case, the number of samples used for initialization is critical for the active learn-
ing process. Especially, when the number of initial samples is varied across the
datasets [30], one should mention how this number has been tuned.
    For transparent evaluation of the selection strategy, we propose that algo-
rithms with an initialization phase should be seen as a two step selection strat-
egy. In the first step, labeling candidates are chosen according to an initialization
strategy (e.g., random) which is stopped by a comprehensible stopping criterion.
Then, the real active learning method can proceed. As this initialization phase is
now part of the active learning algorithm it should be somehow evaluated (e.g.,
regarding robustness) and included in the learning curves [30, 37].


3.4   Parameter Tuning
Tuning parameters for classifiers is very difficult with only a few labels available.
Unfortunately, these tuning procedures are often not described in great detail.
Yanik et al. [50] used a grid search approach in an 5 fold cross validation after
each label acquisition to tune the parameters of the SVM. Similarly, Tuia et
al. [43] tune their parameters for their SVM. Both do not describe, on which data
this is executed. Using a hold out tuning set [13, 27] is not valid in AL unless these
additional labels are comprehensibly selected and included in the evaluation (i.e.,
considering them in the number of acquired labels in the learning curve). As in
passive classification tasks, it is strictly forbidden to tune the parameters using
the evaluation instances.
     One could also argue that parameters should be adapted during learning as
the number of training instances is increased by AL which affects the capability
of generalization. This means, we either use a pre-trained mediocre classifier
because parameters are tuned for a specific labeling situation, or we re-calibrate
the parameters during learning which means that classifiers become different
across selection methods which also biases the results.
     Another way is to use standard parameter with normalized features (e.g.
z-normalized) [25].


                                          7
Challenges of Reliable, Realistic and Comparable
                                  Challenges        Active
                                             of Active     Learning
                                                       Learning      Evaluation 7
                                                                Evaluation

3.5   Proposing an AL Evaluation Methodology
In order to achieve reliable results across selection strategies, we propose the
following methodology for AL evaluation:
  – Use exactly the same robust classifier for every AL method when comparing
     and try to sync the parameters of these classifiers.
  – Capture the effect of different AL methods on multiple datasets using at
     least 50 repetitions.
  – Start with an initially unlabeled set. If you need initial training instances,
     sample randomly and explain how to determine the number of samples.
  – Use either a clear defined stopping criterion or enough label acquisitions
     (sample until convergence).
  – Show learning curves (incl. quartiles) with reasonable performance measures.
  – Present pairwise differences in terms of significance and effect size (Wilcoxon
     signed rank test).


4     Pilot Study: Influence of the Number of Repetitions
The major challenge of AL evaluation is to measure the effect of improvement
although the variance of results might be high: Especially in the early learning
stages (1% − 10% of the data are labeled), the classification performance varies a
lot. This is where the differences across AL methods are highest. Hence, experi-
ments have to be repeated multiple times to yield reliable results as mentioned
before. In this section, we provide an exemplary evaluation methodology using
a 5-fold cross validation.
    For these experiments, we solely used one dataset from the UCI machine
learning repository, named Mammographic Mass [3]. We chose this dataset as it
is a typical representative for an AL dataset regarding the number of instances
and features. For classification, we decided to use a robust classifier based on
Gaussian kernel density estimation, namely a Parzen Window Classifier (PWC).
Here, we only have one parameter: the bandwidth. In a pre-processing step, all
categorical data has been dichotomized and all features are linearly transformed
into [0, 1] space. Hence, we use a standard bandwidth for the Gaussian kernel of
the PWC of 0.2 as this seems to be reasonable. The AL algorithms are: Optimized
Probabilistic AL [26], uncertainty sampling (Uncer) [29], an optimized version
of expected error reduction from Chapelle (EER) [11], and random (Rand).
    In 5-fold cross validation, we split the dataset D into 5 separate subsets
(D = D1 ∪. . .∪D5 , Di ∩Dj = ∅, i 6= j) to build disjoint candidates and evaluation
sets (Ti , Ei ). In this subsection, we applied AL 5 times on four of the subsets and
evaluated the trained classifier on the left out subset.
    Performing solely one complete 5-fold cross validation, as shown in Fig. 3, the
performances might vary a lot. Furthermore, the ranking of the final performance
(after 60 labels have been acquired) changes completely. The left evaluation
shows OPAL being the best, followed by Expected Error Reduction, Random,
and Uncertainty Sampling. Using another seed (right plot), the ranking is dif-
ferent: First OPAL, then Random, Uncertainty Sampling, and Expected Error


                                         8
Challenges
8          of Reliable,
       Daniel           Realistic
              Kottke, Adrian      andDenis
                             Calma,   Comparable
                                           Huseljic,Active LearningBernhard
                                                    Georg Krempl,   Evaluation
                                                                            Sick


              0.8                                                                                 0.8


             0.75                                                                                0.75


              0.7                                                                                 0.7
  accuracy


                                                                                      accuracy
             0.65                                                                                0.65


              0.6                                                                                 0.6


             0.55                                                                                0.55


              0.5                                                     Opal                        0.5                                             Opal
                                                                      Unc                                                                         Unc
                                                                      EER                                                                         EER
                                                                      Rand                                                                        Rand
             0.45                                                                                0.45
                    0    10           20        30        40     50          60                         0   10    20        30        40     50          60
                                     number of acquired labels                                                   number of acquired labels


Fig. 3. Results of a 5-fold cross validation: two executions with different seeds of a
complete 5-fold cross validation.


                                          0.8


                                         0.75


                                          0.7
                              accuracy


                                         0.65


                                          0.6


                                         0.55


                                          0.5                                                                          Opal
                                                                                                                       Unc
                                                                                                                       EER
                                                                                                                       Rand
                                         0.45
                                                0         10           20        30        40                     50            60
                                                                      number of acquired labels


                        Fig. 4. Mean results of 10 times repeated 5-fold cross validations


                                                                                  9
Challenges of Reliable, Realistic and Comparable
                                  Challenges        Active
                                             of Active     Learning
                                                       Learning      Evaluation 9
                                                                Evaluation

Reduction. This clearly shows that a 5-fold cross validation evaluation for these
AL methods on this dataset using a PWC is not sufficient. Similar experiments
(not shown due to space restrictions) show that it is also true for other datasets
and other classifiers. Repeating this 5-fold cross validation 10 times as shown in
Fig. 4, provides much more stable results that are also comparable to the ones
from the following experiment.


5   Challenges of realistic evaluation

Publications from companies such as Microsoft [24, 35], IBM [32], or Mitsubishi [23]
show the growing interest in AL and its practical usefulness. AL has been suc-
cessfully applied to solve problems such as on-road vehicle detection [40] or in
recommender systems [28]. Unfortunately, these systems are highly specialized
and often cannot be easily used for related problems.
    In contrast to lab experiments, real active learning approaches only have one
shot to learn. Hence, not the mean performance of multiple repetitions is of
interest but the pairwise comparisons of the different methods. Because of high
variances, it is still difficult to ensure a certain improvement of performance of
one selection algorithm against others. This is the reason for many researchers
arguing that random sampling is still a powerful baseline [10].
    One of the main challenges to apply active learning in practice is to know
when to stop querying for new label information. By now, in real-world applica-
tions, the AL process stops when a given “labeling budget” has been consumed.
For example, in [40] the performance of the investigated AL approaches is done
after a fixed number of queried samples. But, this may be a waste of resources,
both in terms of time and money. Thus, the active learner should be able to
asses its own performance. Here, different problems occur: a) collecting a sep-
arate evaluation dataset by randomly sampling instances is expensive, b) the
collected data can not be used for performance estimation due to the sampling
bias [12]. Some research work has been done to analyze when to stop the AL pro-
cess besides estimating the performance directly [14, 34, 45]. It has been shown
that it is possible to identify when a learning process might be saturated, but
none provides information about the real classification performance.
    In dedicated collaborative interactive learning (D-CIL) [9], different realistic
applications for AL have been outlined. It addresses AL processes that are in-
teractive – the information flows from humans to the active learner and vice
versa, collaborative – multiple domain experts collaborate, and dedicated – a
small number of benevolent domain experts interact with the active learner in
order to support the selection process. Even though the oracles are imperson-
ated by benevolent domain experts, they are still prone to error. Their labeling
performance may depend on the labeler’s experience, form of the day, or the
complexity degree of the learning problem. In case of an opportunistic active
learner [4], the oracles are not necessarily embodied by benevolent domain ex-
perts. Similar smart systems, simulation systems, or own sensors of the learning


                                       10
Challenges
10         of Reliable,
       Daniel           Realistic
              Kottke, Adrian      andDenis
                             Calma,   Comparable
                                           Huseljic,Active LearningBernhard
                                                    Georg Krempl,   Evaluation
                                                                            Sick

system may assemble together or separately the oracle. Furthermore, there is
high heterogeneity between these oracles, and their number is not fixed.
    To summarize, AL research is mostly based on the following (limiting) as-
sumptions [9]: a) the classification problem is well-defined (i.e., the number of
classes and features are known in advance), b) labeled samples are available
at the beginning of the learning process, c) uniform labeling cost (i.e., identi-
cal labeling costs for all samples), d) the oracle is omnipresent and omniscient,
e) there exists a ground truth, based on which the performance of the active
learner is evaluated. However, these assumptions often do not hold in real-world
applications. Although, a large variety of specialized solutions is given which
solve single problems, there is further work necessary to apply methods in a
real-world setting. Here, a central aspect is the lack of comparability across dif-
ferent approaches which is a critical point for practitioners to apply AL in their
specific domain.


6    Conclusion and Outlook

In this article, we summarized various challenges of AL evaluation with regard
to being reliable, realistic, and comparable. Some of these appear naturally by
the problem’s definition, others are defined through the demands of real-world
applications. We proposed an evaluation methodology to initialize a discussion
on a gold standard for AL evaluation and provided preliminary results in a pilot
study which shows the importance of many repetitions in AL which hopefully
leads to comparable results without repeating whole experiments. Nevertheless,
it is essential to report all details of evaluation to be able to reproduce the results
of a paper. Those details have been discussed in this paper.
     As future work, we plan to extend this literature overview and refine our
proposed methodology. Additionally, we aim at providing a large comparison
of different methodologies showing the effect of each component for different
selection strategies. In this paper, we excluded the whole discussion of online
algorithms and methods for evolving datastreams. Providing a valid evaluation
framework for one-shot AL, is one of the goals of future research.
     Our vision is to develop an evaluation system, enabling researchers and prac-
titioners to collaborate. This system will provide a web-based user interface like
OpenML [44] showing detailed information about different AL methods and their
specific characteristics in relation to different tasks. In that way, we aim to stan-
dardize AL evaluation in order to simplify the steps towards practical solutions
and fair comparison.


                                          11
Challenges of Reliable, Realistic and Comparable
                                  Challenges        Active
                                             of Active     Learning
                                                       Learning      Evaluation11
                                                                Evaluation

References
 1. Aggarwal, C.C., Kong, X., Gu, Q., Han, J., Yu, P.S.: Active learning: A survey.
    In: Aggarwal, C.C. (ed.) Data Classification: Algorithms and Applications, pp.
    571–606. CRC Press (2014)
 2. Aldogan, D., Yaslan, Y.: A comparison study on ensemble strategies and feature
    sets for sentiment analysis. Lecture Notes in Electrical Engineering 363, 359–370
    (2016)
 3. Asuncion, A., Newman, D.J.: UCI machine learning repository (2015), http://
    archive.ics.uci.edu/ml/
 4. Bahle, G., Calma, A., Leimeister, J.M., Lukowicz, P., Oeste-Reiß, S., Reitmaier, T.,
    Schmidt, A., Sick, B., Stumme, G., Zweig, K.: Lifelong learning and collaboration
    of smart technical systems in open-ended environments – Opportunistic Collabora-
    tive Interactive Learning. In: International Conference on Autonomic Computing.
    IEEE, Würzburg, Germany (2017)
 5. Bilgic, M., Getoor, L.: Active learning for networked data. Computer 411(29-30),
    2712–2728 (2010)
 6. Bouguelia, M.R., Belaı̈d, Y., Belaı̈d, A.: An adaptive streaming active learning
    strategy based on instance weighting. Pattern Recognition Letters 70, 38–44 (2016)
 7. Brinker, K.: Incorporating diversity in active learning with support vector ma-
    chines. In: Proceedings of the 20th International Conference on Machine Learning
    (ICML). pp. 59–66 (2003)
 8. Cai, W., Zhang, Y., Zhou, S., Wang, W., Ding, C., Gu, X.: Active learning for
    support vector machines with maximum model change. In: Proceedings of the Eu-
    ropean Conference on Machine Learning and Knowledge Discovery in Databases.
    vol. 8724 (2014)
 9. Calma, A., Leimeister, J.M., Lukowicz, P., Oeste-Rei, S., Reitmaier, T., Schmidt,
    A., Sick, B., Stumme, G., Zweig, A.K.: From active learning to dedicated collabora-
    tive interactive learning. In: Varbanescu, A.L. (ed.) 29th International Conference
    on Architecture of Computing Systems, Workshop Proceedings. pp. 1–8. VDI Ver-
    lag, Nuremberg, Germany (2016)
10. Cawley, G.C.: Baseline methods for active learning. In: Active Learning and Exper-
    imental Design Workshop in Conjunction with AISTATS 2010. pp. 47–57 (2011)
11. Chapelle, O.: Active learning for parzen window classifier. In: Proceedings of the
    Tenth International Workshop on Artificial Intelligence and Statistics. pp. 49–56
    (2005)
12. Dasgupta, S., Hsu, D.: Hierarchical sampling for active learning. In: Proceedings of
    the 25th International Conference on Machine learning. pp. 208–215. ACM (2008)
13. Demir, B., Persello, C., Bruzzone, L.: Batch-mode active-learning methods for the
    interactive classification of remote sensing images. IEEE Transactions on Geo-
    science and Remote Sensing 49(3), 1014–1031 (2011)
14. Dimitrakakis, C., Savu-Krohn, C.: Cost-Minimising Strategies for Data Labelling:
    Optimal Stopping and Active Learning, pp. 96–111. Springer Berlin Heidelberg,
    Berlin, Heidelberg (2008)
15. Efron, B.: Bootstrap methods: another look at the jackknife. The annals of Statis-
    tics pp. 1–26 (1979)
16. Ertekin, S., Huang, J., Bottou, L., Giles, L.: Learning on the border: Active learning
    in imbalanced data classification. In: Proceedings of the 16th ACM Conference on
    Conference on Information and Knowledge Management. pp. 127–136. CIKM ’07,
    ACM, New York, NY, USA (2007)


                                           12
Challenges
12         of Reliable,
       Daniel           Realistic
              Kottke, Adrian      andDenis
                             Calma,   Comparable
                                           Huseljic,Active LearningBernhard
                                                    Georg Krempl,   Evaluation
                                                                            Sick

17. Flach, P., Hernandez-Orallo, J., Ferri, C.: A coherent interpretation of AUC as a
    measure of aggregated classification performance. In: Getoor, L., Scheffer, T. (eds.)
    Proceedings of the 28th International Conference on Machine Learning, ICML
    2011, Bellevue, Washington, USA. pp. 657–664. ACM, New York, NY, USA (2011)
18. Fu, Y., Zhu, X., Li, B.: A survey on instance selection for active learning. Knowl-
    edge and Information Systems 35(2), 249–283 (2013)
19. Hand, D.J.: Measuring classifier performance: a coherent alternative to the area
    under the roc curve. Machine Learning 77(1), 103–123 (2009)
20. Hu, B.G., Dong, W.M.: A study on cost behaviors of binary classification measures
    in class-imbalanced problems. arXiv preprint arXiv:1403.7100 (2014)
21. Huang, K.h., Lin, H.t.: A novel uncertainty sampling algorithm for cost-sensitive
    multiclass active learning. In: 2016 IEEE 16th International Conference on Data
    Mining (ICDM) (2016)
22. Huang, S.j., Jin, R., Zhou, Z.H.: Active learning by querying informative and repre-
    sentative examples. In: NIPS’10 Proceedings of the 23rd International Conference
    on Neural Information Processing Systems. pp. 892–900 (2010)
23. Joshi, A.J., Porikli, F., Papanikolopoulos, N.P.: Scalable active learning for multi-
    class image classification. IEEE Transactions on Pattern Analysis and Machine
    Intelligence 34(11), 2259–2273 (2012)
24. Kapoor, A., Horvitz, E., Basu, S.: Selective supervision: Guiding supervised learn-
    ing with decision-theoretic active learning. In: Veloso, M.M. (ed.) Proceedings of
    the 20th International Joint Conference on Artifical Intelligence. pp. 877–882. Mor-
    gan Kaufmann Publishers Inc. (2007)
25. Kottke, D., Krempl, G., Lang, D., Teschner, J., Spiliopoulou, M.: Multi-class prob-
    abilistic active learning. In: ECAI. Frontiers in Artificial Intelligence and Applica-
    tions, vol. 285, pp. 586–594. IOS Press (2016)
26. Krempl, G., Kottke, D., Lemaire, V.: Optimised probabilistic active learning
    (OPAL) for fast, non-myopic, cost-sensitive active classification. Machine Learning
    pp. 1–28 (2015)
27. Krempl, G., Kottke, D., Spiliopoulou, M.: Probabilistic active learning: Towards
    combining versatility, optimality and efficiency. In: Proceedings of the 17th Inter-
    national Conference on Discovery Science (DS), Bled. Lecture Notes in Computer
    Science, Springer (2014)
28. Lamche, B., Trottmann, U., Wörndl, W.: Active Learning Strategies for Ex-
    ploratory Mobile Recommender Systems. In: Proceedings of the Fourth Workshop
    on Context-Awareness in Retrieval and Recommendation. pp. 10–17. Amsterdam,
    Niederlande (2014)
29. Lewis, D., Gale, W.: A sequential algorithm for training text classifiers. In:
    Conference on Research and Development in Information Retrieval. pp. 3–12.
    ACM/Springer, New York, NY (1994)
30. Li, X., Guo, Y.: Active learning with multi-label svm classification. In: Proceedings
    of the 23rd International Joint Conference on Artificial Intelligence (2013)
31. Longstaff, B., Reddy, S., Estrin, D.: Improving activity classification for health ap-
    plications on mobile devices using active and semi-supervised learning. Proceedings
    of the 4th International ICST Conference on Pervasive Computing Technologies
    for Healthcare (2010)
32. Melville, P., Sindhwani, V.: Active dual supervision: Reducing the cost of annotat-
    ing examples and features. In: Workshop on Active Learning for Natural Language
    Processing. pp. 49–57. Boulder, CO (2009)
33. Nuzzo, R.: Statistical errors. Nature 506(7487), 150 (2014)


                                           13
Challenges of Reliable, Realistic and Comparable
                                  Challenges        Active
                                             of Active     Learning
                                                       Learning      Evaluation13
                                                                Evaluation

34. Olsson, F., Tomanek, K.: An intrinsic stopping criterion for committee-based active
    learning. In: Conference on Computational Natural Language Learning. pp. 138–
    146. Boulder, CO (2009)
35. Paquet, U., Gael, J.V., Stern, D., Kasneci, G., Herbrich, R., Graepel, T.: Vuvuzelas
    & active learning for online classification. In: Workshop on Computational Social
    Science and the Wisdom of Crowds. pp. 1–5. Whistler, BC (2010)
36. Parker, C.: An analysis of performance measures for binary classifiers. In: Pro-
    ceedings of the 11th IEEE International Conference on Data Mining (ICDM). pp.
    517–526. IEEE (2011)
37. Pasolli, E., Melgani, F.: Active learning methods for electrocardiographic signal
    classification. IEEE Transactions on Information Technology in Biomedicine 14(6),
    1405–16 (2010)
38. Reitmaier, T., Sick, B.: Let us know your decision: Pool-based active training of
    a generative classifier with the selection strategy 4DS. In: Information Sciences -
    Informatics and Computer Science Intelligent Systems Applications. vol. 230, pp.
    106–131 (2013)
39. Settles, B.: Active learning literature survey. Computer Sciences Technical Report
    1648, University of Wisconsin, Department of Computer Science (2009)
40. Sivaraman, S., Trivedi, M.M.: Active learning for on-road vehicle detection: a com-
    parative study. Machine Vision and Applications pp. 1–13 (2011)
41. Son, Y., Lee, J.: Active learning using transductive sparse bayesian regression.
    Information Sciences 374, 240–254 (2016)
42. Tomanek, K., Morik, K.: Inspecting sample reusability for active learning. In:
    Guyon, I., Cawley, G.C., Dror, G., Lemaire, V., Statnikov, A.R. (eds.) Work-
    shop on Active Learning and Experimental Design. JMLR Proceedings, vol. 16,
    pp. 169–181 (2011)
43. Tuia, D., Volpi, M., Copa, L., Kanevski, M., Munoz-Mari, J.: A survey of active
    learning algorithms for supervised remote sensing image classification. IEEE Jour-
    nal of Selected Topics in Signal Processing 5(3), 606–617 (2011)
44. Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.: Openml: Networked science
    in machine learning. SIGKDD Explorations 15(2), 49–60 (2013)
45. Vlachos, A.: A stopping criterion for active learning. Computer Speech & Language
    22(3), 295–312 (2008)
46. Wang, J., Park, E.: Active learning for penalized logistic regression via sequential
    experimental design. Neurocomputing 222, 183–190 (2017)
47. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics bulletin
    1(6), 80–83 (1945)
48. Yan, Y., Rosales, R., Fung, G., Dy, J.G.: Active learning from crowds. Proceedings
    of the 28th International Conference on Machine Learning pp. 1161–1168 (2011)
49. Yang, Y., Ma, Z., Nie, F., Chang, X., Hauptmann, A.G.: Multi-class active learn-
    ing by uncertainty sampling with diversity maximization. International Journal of
    Computer Vision 113(2), 113–127 (2014)
50. Yanik, E., Sezgin, T.M.: Active learning for sketch recognition. Computers and
    Graphics (Pergamon) 52, 93–105 (2015)


                                          14