Simulation of Annotators for Active Learning:
                  Uncertain Oracles

                        Adrian Calma and Bernhard Sick

                          Intelligent Embedded Systems
                          University of Kassel, Germany
                      {adrian.calma,bsick}@uni-kassel.de


      Abstract. In real-world applications the information for previously
      unknown categories (labels) may come from various sources, often but not
      always humans. Therefore, a new problem arises: The labels are subject to
      uncertainty. For example, the performance of human annotators depends
      on many factors: e.g., expertise/experience, concentration/distraction,
      fatigue level, etc. Furthermore, some samples are difficult for both experts
      and machines to label (e.g., samples near the decision boundary). Thus,
      one question arises: How can one make use of annotators that can be
      erroneous (uncertain oracles)? A first step towards answering this ques-
      tion is to create experiments with humans, which involves a high time
      and money effort. This article addresses the following challenge: How
      can the expertise of erroneous human annotators be simulated? First, we
      discuss situations in which humans are prone to error. Second, we present
      methods for conducting active learning experiments with simulated un-
      certain oracles that possess various degrees of expertise (e.g., local/global
      or class/region dependent).

      Keywords: Active Learning, Uncertain Oracles


1   Introduction

Consider the following problem: we have access to a large set of unlabeled images
and we have the possibility to buy labels for any data point, Our first goal is to
train a classifier with the highest possible accuracy. A possible approach is to
label all the images and then train the classifier on the labeled data set. Now,
suppose we have a limited budget, which doesn’t allow us to label the all images.
Our second goal is to keep the costs to a minimum. Thus, we need a strategy
to determine which images should be labeled. A naive strategy would be to
select the images at random. But, we can do better than that, if we make use
of a selection strategy that selects the most informative images. Precisely at
this point, active learning (AL) comes in, more specifically pool-based active
learning (PAL). The learning cycle is presented in Figure 1: there is a large set
of unlabeled data and our goal is to train a model (e.g., a classifier). Thus,
we need to select the most informative data points based on a selection
strategy and present them to an annotator (e.g., a domain expert), generally


                                          49
Simulation
2          of Annotators
        Adrian Calma and for ActiveSick
                         Bernhard   Learning: Uncertain Oracles

called oracle, for labeling. The labeled samples are added to the training set
(the set with labeled data), the classifier is updated and, depending on the
chosen stopping criteria (e.g., is there still money in our budget?), we continue
to ask for more labels or not.


                      Fig. 1. Pool-based active learning cycle.


   At this point we can ask ourselves: Are the labels provided by the human
annotators correct? Probably not, as we can assume that humans are prone
to error (Section 2). Thus, a new question arise: How can we deal with
uncertainty regarding the labels? A first step towards answering the previous
question is to develop techniques for simulating human experts prone to error.
As we assume that they are unsure regarding the classification decision, we call
these annotators uncertain oracles. Thus, this article focuses on presenting:

 – cases in which the uncertain oracle misclassifies data (see Section 3) and
 – techniques for simulating uncertain oracles in AL (see Section 4).

    In the remainder of this article, we first present the possible causes for
erroneous labels and explain what we mean by the term “uncertainty” (Section 2).
Then, we present and categorize various types of expertise (Section 3). In Section 4,
we introduce possible approaches for simulating error prone oracles. Then, related
work will be summarized in Section 5. Finally, Section 6 concludes the article.


2    Motivation – The Problem

By now, we assumed that the answers provided by the oracles are always right. But,
it is obvious that they are not always right. On the one hand, the performance
of human annotators (human oracles) depends on multiple factors, such as:
expertise, experience, level of concentration, level of interest, or level of
fatigue [1]. On the other hand, the labels may come from simulations or test
stands. Once again, it is justifiable to assume that due to imperfect simulations,
sensor noise, or transmission errors, the labels are subject to uncertainty.
     Depending on the difficulty of the labeling task, the oracles might be right
in case of “easy” classification problems. The more difficult a classification task


                                         50
Simulation of Annotators
           Simulation     for Activefor
                      of Annotators  Learning:  Uncertain
                                        Active Learning:   OraclesOracles
                                                         Uncertain                  3

is, the likelier it is that the oracle has a higher degree of doubt (i.e., is more
uncertain) about its answer. Thus, the label uncertainty depends on the difficulty
of the classification task. That is, the number of steps an annotator has to perform
for determining the right class, the designated time, and the risk involved by
misclassification. This factors come in addition to the previously presented sources
of uncertainty, such as required knowledge for problem understanding, experience
regarding similar classification problems or labeling tasks, concentration, or
tiredness.
    What do we mean by “uncertainty”? When humans are asked to provide
information about an actual situation, the confidence regarding the given answer
depends on diverse factors, such as the difficulty/complexity to assess that
information, previous experience, or knowledge. Certainly, there are circumstances
when we cannot state our answer with absolute confidence. Thus, we tend to
add additional information about the quality of our answer, i.e., to quantify and
qualify our confidence [2].
    On these grounds, we cannot assume that the oracles are omniscient, but we
have to soften the assumption of omniscience: An oracle may be wrong. In this
context, the “uncertainty” is the degree of confidence for given label. Consequently
we ask ourself, how can we make use of the uncertain oracles, especially, how can
we exploit the oracle’s firm knowledge?

3     Human Expertise
When an expert has worked for a long period of time on a classification task,
he posses more “experience”. That is, he has seen and labeled more data than
an oracle that just started to work on the labeling task. Therefore, such an
oracle possesses global expertise about the classification problem. On the other
hand, depending on how difficult the classification problem is or on the degree of
expertise and experience, the oracle may bear only limited knowledge about the
learning task, i.e., local expertise.
    At this point, we assume that the expertise of an oracle (its degree of uncer-
tainty) is time invariant.

3.1   Global Expertise
The annotators have a global expertise in the sense that their knowledge is not
limited to a certain region of the input space or to a specific class. They “know”
the problem in all its aspects. Still, they may possess different levels of expertise.
Moreover, samples exist that are hard to label for both the learning system as
well as for the oracles. For example, samples that lie near the decision boundary
of a classifier are good examples for data points that might be difficult to label
by the oracle and the active learner.
    From a practical point of view, we may ask the oracles to provide additional
information when they provide labels for samples. This is required for assessing
their certainty, or rather their uncertainty regarding the provided answer. Such
additional information may include asking for [1]:


                                         51
Simulation
4          of Annotators
        Adrian Calma and for ActiveSick
                         Bernhard   Learning: Uncertain Oracles

 1. a degree of confidence for one class,
 2. membership probabilities for each class,
 3. a difficulty estimate, or
 4. a relative difficulty estimate for two data points.
    In the first case, a sample is presented to the oracle, for example an image.
The oracle is asked to provide a class label for the sample and to estimate his
degree of confidence. Further help regarding the degree of confidence may be
provided: e.g. a a graphical control element with which the oracle sets a certainty
value by moving an indicator on a predefined scale (i.e., a slider). Thus, a possible
answer may look like “I select class «cat» and I rate my certainty 3 on a scale
from 1 to 4, where 4 is the highest score”.
    Another possibility is to ask the oracle to provide class estimates for each of
the possible categories. Given an 3-class classification problem, an answer may
be “The self estimated probability for the first class is 0%, for the second class
30%, and for the third class 70%”.
    The last two cases address cases where the oracle has to estimate how difficult
it was for him to label a specific data point. Possible answer may look like “I
choose class «cat» and it was hard for me to determine it”, if it was asked to
label only one sample, or “It was easier for me to label the image depicting a
«cat» than the one showing a «liger»”, if asked to label to images simultaneously.

3.2   Local Expertise
The oracles possess a local expertise in the sense that they do not have enough
“experience”, they can only recognize specific classes, they are more reliable for
specific regions of the input space, or for certain features. That is, the human
annotators are experts for:
 1. different classes,
 2. different regions of the input space, or
 3. different dimensions of the input space (i.e., features, attributes).
    We assume that, in some applications, the oracles have not only diverse
degrees of experience and expertise, but they have various levels of proficiency
for different parts of the classification problem. For example, the oracle may be
more confident and adept in detecting some certain classes. The quality of the
given answers and his confidence may vary over the regions of the input space or
it may depend on the considered features (dimension of the input space).
    It is not required to change the way the active learner queries new labels. The
query approaches described in Section 3.1 can be adopted for this case too.

3.3   Disparate Features
Up to this point we assumed that the oracle and the active learner are considering
the same features for solving the classification problem. But, this is not always
the case. For example, complex processes happen in our brains when we examine


                                         52
Simulation of Annotators
           Simulation     for Activefor
                      of Annotators  Learning:  Uncertain
                                        Active Learning:   OraclesOracles
                                                         Uncertain                    5

an image. It is hard to say which “features” we consider when trying to recognize
or evaluate the content of that specific image. Still the active learner “views” the
same image, but it may consider additional features such as histograms or apply
filters (e.g. anisotropic diffusion [3] or median [4] filters) or transformations (e.g.
Fourier [5] or Hough [6] transform) on the image. Obviously, we can provide
these additional information to the oracles, but the active learner might not have
access to all features that were “extracted” by the oracles.
     Once again, the answers expected from oracles can be implied from Section 3.1.
But, you may ask yourself why we do not ask the oracle for additional information
regarding the features that it considers for its decision. As we focus on classification
tasks, we do not consider it in this work, but it is definitely an interesting research
topic, commonly referred to as active feature selection [7].

4     Simulate Error Prone Annotators
A first step towards exploiting the knowledge of an uncertain oracle would be to
analyze how the current AL paradigms perform in combination with multiple
oracles. But, such experiments are costly both in terms of money and time.
If we are able to successfully simulate uncertain oracles, then we can better
investigate the performance of the selection strategies and of the classifiers
without generating additional costs in this research phase. Moreover, based
on the gathered knowledge from the investigation of current active learning
techniques in a dedicated collaborative interactive learning (D-CIL) context,
we can develop new ones, that take the uncertainty into consideration. That
brings us to the following questions: how can we simulate error prone annotators
(uncertain oracles)?
    In the following, we will describe different approaches for simulating the
uncertain oracles.

4.1   Omniscient Oracle
For the sake of completeness, we shortly describe how an omniscient oracle can be
simulated and what we understand under experience in this context. Simulating
this type of oracle is straight forward: It returns the true labels of the samples.
That is, the labels are not manipulated in any way.
    How can we simulate the experience? We define the experience as the
number of samples the uncertain oracle has already seen and labeled. Thus,
when we consider the complete data set for training a classifier (i.e., supervised
learning) we can simulate an uncertain oracle with maximal global experience.
Global, in the sense that the expertise is not limited to a region of the input
space or to a specific class.

4.2   Uncertain Oracle with Global Expertise
At first, we concentrate on how to simulate uncertain oracles with global expertise
and the same degree of experience. We assume that the labels near the decision


                                          53
Simulation
6          of Annotators
        Adrian Calma and for ActiveSick
                         Bernhard   Learning: Uncertain Oracles

boundary of the classifier are hard to classify for both the human expert (human
oracle) as well as for the classifier. Thus, we can simulate an uncertain oracle by
randomly altering (changing) the classes of the samples lying near the decision
boundary. A legit question may arise: What is the “right” decision boundary? We
do not know, but we can estimate it. As one of the goals of active learningis to
be as good as a learner trained in a supervised way, we can train a classifier in a
supervised way (i.e., overall data set). The decision boundary resulted from this
classifier trained can be used to determine the samples for which the labels are
altered.
    The next challenge is to simulate oracles that have different levels of experience.
For example, the oracle may have just started labeling samples for this type
of problem. Thus, they have only a labeled few samples and, of course, their
experience is based on a small number of data. One possible way to simulate its
“experience” is to reduce the number of samples on which the classifier is trained.
As the classifier is used as a model of the experience, by reducing the number of
samples we increase the level of uncertainty. By doing so, we simulate an oracle
that has little experience. Depending on the reduction factor, uncertain oracles
with different levels of experience can be simulated. Moreover, if we can split
the data in such a way, that the training set of the classifier used to simulate
the uncertain oracle is larger than the pool of unlabeled data. Thus, the data
from which the uncertain oracles gathered their experience is larger than the
data from which the active learner can select samples for labeling, resulting in a
simulated oracle with a higher degree of expertise.
    Another possibility to simulate uncertain oracles with different levels of
experience is to alternate the parameter values of the classifiers. For example, we
can simulate the expertise of an uncertain oracle with a classifier trained with
default parameters. For a better expertise, we can imply heuristics (e.g., grid
search) to find suitable parameters for the classifiers.
    Furthermore, the expertise can be simulated by different types of classifiers.
We can use generative or discriminative classifier for simulating the expertise of
an expert.
    Last but not least, we can add noise to the feature values. Of course, this is
not always possible, as it depends on the type of feature (i.e., nominal, continuous,
ordinal, etc.) and on the values range. By doing so, we can simulate uncertain
oracles that have an experience built on similar samples.
    In a nutshell, we can simulate oracles with global and various degrees of
expertise by

 – modifying (altering) the classes of the samples lying near the decision boun-
   dary,
 – training different classifier types for various uncertain oracles, and
 – training a classifier
     • on training sets of different size (more or less samples than in the pool of
       unlabeled data),
     • using different parametrization strategies and parameter sets, or
     • adding noise to the feature values (if possible and if it makes sense).


                                          54
Simulation of Annotators
           Simulation     for Activefor
                      of Annotators  Learning:  Uncertain
                                        Active Learning:   OraclesOracles
                                                         Uncertain                   7

Additionally, any combination of the previous simulation can be implied. For
example, if we want to simulate an oracle with little global expertise based on
similar samples, we can reduce the training set of the classifier and add noise to
the feature values.

4.3   Uncertain Oracle with Local Expertise
The expertise of an oracle can be restricted to a certain class or to a specific
region of the input space. Thus, to simulate a better expertise with respect to one
or more classes of our choice, we can change the labels of the samples belonging
to the classes for which we would like to simulate a little (or no) expertise. It is
also possible to exclude the samples belonging to one class, which translates to
“the uncertain oracle has no expertise regarding this specific class”. One possible
approach is to train a generative classifier on these data. The resulting classifier
estimates the processes that are supposed to generate the data, i. e. one process
generates samples belonging to one class. That is, a process generates samples
belonging to only one class. Therefore, we can artificially change the labels of the
estimated processes, which results in an erroneousness classification of samples
that were assumed to be generated by that process.
     The expertise of the uncertain oracles may be restricted to a specific region
of the input space. Depending on the feature values, the labeling quality can
suffer. For example, an uncertain oracle is more accurate regarding samples that
lie in regions of the input space, which have been previously seen or learned by
the oracle. We propose two ways to simulate the local expertise: (1) by using
various classifier types and (2) by deliberately altering the class affiliations of the
samples lying in those regions.
     By using different classifier types, the regions of the input space are modeled
in different ways and, thus, the result of the classification may vary.
     By modifying the classes of the samples lying in specific regions of the input
space, the result of the classifier is modified. That is, for samples lying in these
regions, the expertise of the uncertain oracle is diminished.
     The difference between class based experience and region based experience is
showed in Figure 2. Here, we have a region of the input space where two classes
strongly overlap, green ◦’s and blue +. If we assume that a human expert has
firm knowledge about class green ◦, then he will probably label the samples
that belong to the green class correctly and the others not (higher error rate
for blue + and red 4). On the other had, assuming that the oracle correctly
labels samples in a given region of the input space leads us to the conclusion
that it labels correctly all the samples in the specified region. For example, the
uncertain oracle has a region based expertise for samples having feature values
∈ [−1.5, 1.5], will lead to correct class affiliation for samples lying in this region.
In this concrete case, samples lying in the square defined by (−1.5, −1.5) and
(1.5, 1.5) and belonging to either class are labeled correctly.
     An overview of the introduced simulation methods is presented in Figure 3.
The core of the simulation techniques is the assumption regarding which features
are considered. The described simulation methods can be applied for both cases:


                                          55
Simulation
8          of Annotators
        Adrian Calma and for ActiveSick
                         Bernhard   Learning: Uncertain Oracles

                     5
                     4.5
                     4
                     3.5
                     3
                     2.5
                     2
                     1.5
                      1
                     0.5
                     0
                     -0.5
                     -1
                     -1.5
                     -2
                     -2.5
                     -3
                     -3.5
                     -4
                     -4.5
                     -5
                            -3 -2.5 -2 -1.5 -1 -0.5 0   0.5   1   1.5 2   2.5   3   3.5   4   4.5   5


Fig. 2. Samples belonging to three classes (green ◦’s, blue +’s, and red 4)’s depicted
in the input space, whereby the processes generating samples belonging to green ◦’s
and blue +’s strongly overlap.


when the uncertain oracle considers the same features as the active learner and
when not.


4.4   Motivating Example: Generative Classifier based Simulation

One possible way to simulate the expertise of an uncertain oracle is by means of
a generative classifier, e.g. a classifier based on mixture models, which is based
on a probabilistic mixture modeling approach. That is, for a given D-dimensional
input sample x0 we can compute the posterior distribution p(c|x0 ), i.e., the
probabilities for class c membership given the input x0 . To minimize the risk of
classification errors we then select the class with the highest posterior probability
(cf. the principle of winner-takes-all ). Thus, the “uncertainty” can be computed
as 1 − p(c0 |j), where c0 = argmaxc p(c|j). In case of other classifier types (e.g.,
Support Vector Machines), Platt scaling [8] can be used to transform the outputs
into probability distributions.


5     Related Work

In [9], the authors simulate oracles with different types of accuracies: 10% of
samples are incorrect, 20% unknown, and 70% uncertain knowledge. k-means
clustering is implied in [10] to generate the concepts and to assign the oracles
to different clusters, in order to simulate the experience (in this article called
“knowledge sets”). Clustering is also used in [11], where some clusters represent
regions for which the oracles give unsure as feedback. Virtual oracles for binary
classification, with different labeling qualities, controlled by two parameters that


                                                         56
Simulation of Annotators
           Simulation     for Activefor
                      of Annotators  Learning:  Uncertain
                                        Active Learning:   OraclesOracles
                                                         Uncertain                                                                                                                              9

                                                                                                 True labels


                                                                                    Omniscient


                                                                                                               Vary the number of
                                                                                                               feature.


                                                                                           Expertise based
                                                                       Expertise based
                                                                                             on different
                      Alter the classes of the                         on same features
                      samples lying near the
                                                                                               features                                               Alter the classes of the
                      decision boundary.                                                                                                              generating processes.
                                                   Decision
                                                                                                                                            Classes
                                                   Boundary


      Alternate the number of
      training samples.
                                                                                                                                                                     Alternate the classifier
                                                                                                                                                                     types.
                                     Different
    Alternate the parameter
    values.
                                     Level of                 Global                                                                Local             Regions
                                    Experience                                                                                                                       Alter the classes of the
                                                                                                                                                                     samples for various
      Alternate the classifier                                                                                                                                       generating processes.
      types.


                          Noisy feature valuess.


                                   Fig. 3. Types of expertise and possible simulation practices.


represent the label accuracy regarding the two classes are presented in [12]. In [13],
a uniform distribution is implied to simulate various behavior of the oracles.
Randomly flipping labels with a specific probability [14] and ranges for the noise
rate [15] are also applied to simulate uncertain oracles. A Gauss distribution [16]
has also been use to simulate the expertise of oracles. But also multiple oracles
have been simulated, where their label quality does not vary [17].

6       Conclusion
In this article, we addressed a challenge in the field of AL and, especially, in
the field of D-CIL [1], where oracles might be wrong for various reasons. Thus,
the queried labels are subject to uncertainty. The research regarding uncertain
oracles is still in its infancy, so we proposed simulation methods for uncertain
oracles in order to help the research go further. The simulation methods will help
investigate the performance of the current AL techniques and understand their
advantages and disadvantages. Moreover, new questions for future research arise:
How can we exploit the uncertain oracles? Is it necessary to re-query labels for
already labeled samples? How can we learn (model) the expertise of an uncertain
oracle? How do we decide whether the uncertain oracle is erroneous or the process
to be learned are nondeterministic? How do we decide whom to ask next?

References
 1. Calma, A., Leimeister, J.M., Lukowicz, P., Oeste-Reiß, S., Reitmaier, T., Schmidt,
    A., Sick, B., Stumme, G., Zweig, K.A.: From active learning to dedicated collabora-
    tive interactive learning. In: International Conference on Architecture of Computing
    Systems, Nuremberg, Germany (2016) 1–8


                                                                                      57
Simulation
10         of Annotators
        Adrian Calma and for ActiveSick
                         Bernhard   Learning: Uncertain Oracles

 2. Motro, A., Smets, P., eds.: Uncertainty Management in Information Systems –
    From Needs to Solutions. Springer US (1997)
 3. Weickert, J.: Anisotropic Diffusion in Image Processing. B.G. Teubner Stuttgart
    (1998)
 4. Zhu, Y., Huang, C.: An improved median filtering algorithm for image noise
    reduction. Physics Procedia 25 (2012) 609–616
 5. Cochran, W., Cooley, J., Favin, D., Helms, H., Kaenel, R., Lang, W., Maling, G.,
    Nelson, D., Rader, C., Welch, P.: What is the fast Fourier transform? Proceedings
    of the IEEE 55 (1967) 1664 – 1674
 6. Nixon, M.S., Aguado, A.S.: Feature Extraction and Image Processing. Academic
    Press (2008)
 7. Liua, H., Motoda, H., Yua, L.: A selective sampling approach to active feature
    selection. Artificial Intelligence 159 (2004) 49–74
 8. Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to
    regularized likelihood methods. Advances in Large Margin Classifiers 10 (1999)
    61–74
 9. Fang, M., Zhu, X.: Active learning with uncertain labeling knowledge. Pattern
    Recognition Letters 43 (2013) 98–108
10. Fang, M., Zhu, X., Li, B., Ding, W., Wu, X.: Self-Taught Active Learning from
    Crowds. In: 2012 IEEE 12th International Conference on Data Mining (ICDM),
    Brussels, Belgium (2012) 1–6
11. Zhong, J., Tang, K., Zhou, Z.H.: Active Learning from Crowds with Unsure Option.
    In: 24th International Conference on Artificial Intelligence, AAAI Press (2015)
    1061–1067
12. Jing, Z., Xindong, W., S., S.V.: Active Learning With Imbalanced Multiple Noisy
    Labeling. IEEE Transactions on Cybernetics 45 (2015) 1081–1093
13. Kumar, A., Lease, M.: Modeling Annotator Accuracies for Supervised Learning.
    In: WSDM 2011 Workshop on Crowdsourcing for Search and Data Mining (CSDM
    11), Hong Kong, China (2011) 19–22
14. Yan, Y., Rosales, R.: Active learning from multiple knowledge sources. In: 15th
    International Conference on Artificial Intelligence and Statistics (AISTATS). Vo-
    lume XX., La Palma, Canary Islands (2012)
15. Du, J., Ling, C.X.: Active learning with human-like noisy oracle. In: IEEE 10th
    International Conference on Data Mining, Sydney, Australia (2010) 797–802
16. Zhao, L.: An Active Learning Approach for Jointly Estimating Worker Performance
    and Annotation Reliability with Crowdsourced Data. ArXiv (2014) 1–18
17. Shu, Z., Sheng, V.S., Li, J.: Learning from crowds with active learning and self-
    healing. Neural Computing and Applications (2017) 1–12


                                         58