=Paper=
{{Paper
|id=Vol-3259/ialatecml_paper3
|storemode=property
|title=Enhancing Active Learning with Weak Supervision and Transfer Learning by Leveraging Information and Knowledge Sources
|pdfUrl=https://ceur-ws.org/Vol-3259/ialatecml_paper3.pdf
|volume=Vol-3259
|authors=Lukas Rauch,Denis Huseljic,Bernhard Sick
|dblpUrl=https://dblp.org/rec/conf/pkdd/RauchHS22
}}
==Enhancing Active Learning with Weak Supervision and Transfer Learning by Leveraging Information and Knowledge Sources==
<pdf width="1500px">https://ceur-ws.org/Vol-3259/ialatecml_paper3.pdf</pdf>
<pre>
    Enhancing Active Learning with Weak
    Supervision and Transfer Learning by
Leveraging Information and Knowledge Sources

                  Lukas Rauch, Denis Huseljic, and Bernhard Sick

         University of Kassel, Wilhelmshöher Allee 73, 34121 Kassel, Germany
                  {lukas.rauch, dhuseljic, bsick}@uni-kassel.de


        Abstract. One of the major limitations of deploying a machine learning
        model is the availability of labeled training data and the resulting expen-
        sive annotation process. Although active learning (AL) methods may re-
        duce the annotation cost by actively selecting the most-useful instances,
        a costly human annotator usually provides the labels. Therefore, even
        with AL, we still consider the annotation process to be time-consuming
        and expensive. Besides human annotators, though, companies often have
        a vast amount of information and knowledge sources available that can
        generate low-cost labels (e.g., a black-box model) or improve the learn-
        ing process (e.g., a pre-trained model). We present a novel approach that
        enhances AL with weak supervision (WS) and transfer learning (TL) to
        reduce the annotation cost by leveraging these sources. Specifically, we
        consider a black-box model like a rule-based system as an error-prone
        and weakly-supervised annotator that inexpensively provides labels. We
        estimate its performance with an annotator model to decide whether a
        human annotation is required. Additionally, we utilize unlabeled internal
        and external data by transferring knowledge from a pre-trained model
        to the AL cycle. We sequentially investigate the impact of WS and TL
        on annotation cost and model performance in an AL cycle through a use
        case. Our evaluation shows that our approach can reduce annotation cost
        by 51% while achieving nearly identical model performance compared to
        a traditional AL approach.

        Keywords: Active Learning · Weak Supervision · Transfer Learning ·
        Information and Knowledge Sources.


1     Introduction
In recent years, there has been an increasing interest in machine learning ap-
plications across all industries [25]. In particular, (deep) neural networks (NNs)
have proven beneficial for unstructured data types such as image or text data.
However, one of the major real-world bottlenecks in deploying a NN is the need
for large labeled training data sets to reach peak performance [25,30]. To reduce
annotation cost for the training process, active learning (AL) [4,31] is a part
of human-in-the-loop learning [13] where we actively select the most-useful in-
stances. The goal is to reduce annotation cost while maximizing the performance

    © 2022 for this paper by its authors. Use permitted under CC BY 4.0.
2
28     L.Rauch
       L. Rauch,etD.al.Huseljic, B. Sick

of a model trained on an actively selected subset from an unlabeled data pool
[32,12]. However, since a human annotator (HA) usually provides the labels,
the annotation process may still be time-consuming and expensive [11]. Besides
HAs, companies usually have a wide range of information and knowledge sources
[10] available such as an established black-box model (BBM) like a rule-based
system [23] or external data and a pre-trained model from the Internet. These
sources can provide labels (information source) or contain beneficial knowledge
for training NNs (knowledge source). Nevertheless, they are often ignored or not
fully utilized in practice. This raises the question of how to efficiently leverage
and extract information and knowledge from available sources to further reduce
the annotation cost in AL.
    To address this question, research fields such as weak supervision (WS) [1,5]
and transfer learning (TL)[26] provide suitable methods. Specifically, WS meth-
ods generate noisy labels at low cost, e.g., with expert-defined rules or labeling
heuristics [2,30] and are typically applied after obtaining a high-quality labeled
data set. In TL [26], acquired knowledge of a pre-trained model is transferred
to a different but related downstream task. Combining AL-WS and AL-TL has
already shown promising results to further reduce the annotation cost in AL
[7,33]. However, to the best of our knowledge, there has not yet been a combi-
nation of all three fields in which multiple available information and knowledge
sources are exploited. Therefore, we investigate the following research questions
in this work:

Question 1. How can we enhance AL with WS so that we can leverage an avail-
able BBM as an information source to reduce the annotation cost with a com-
petitive model performance compared to a traditional AL approach?

Question 2. How far can the inclusion of TL to leverage unlabeled internal and
external data as knowledge sources empower the combination of AL-WS and,
thus, further reduce the annotation cost and improve the model performance?

To answer those research questions, we conduct experiments in a real-world use
case where we thematically classify banking transactions based on text data.
We extend an AL cycle with WS, training a classification and annotator model
simultaneously. Specifically, we consider an available BBM (a rule-based system
in our use case) as an error-prone and weakly-supervised annotator (WSA). The
annotator model allows us to decide whether annotations can be performed at
low cost by the WSA without a costly HA (a domain expert in our use case).
In addition, we further enhance the AL-WS cycle with TL. We fine-tune a pre-
trained model (a language model in our use case) from an external source on
unlabeled internal data for the downstream task with unsupervised learning.
This allows us to use labeled and unlabeled data to train our models in the AL
cycle. By doing so, we are the first to provide an approach to combine AL with
WS and TL by leveraging multiple available information and knowledge sources.
Based on the evaluation of our experiments, we summarize our contributions as
follows:
    Enhancing Active Learning with Weak Supervision and Transfer Learning        3
                                                                                29

 1. Enhancing AL with WS by leveraging a rule-based system as an information
    source through an annotator model leads to a reduction of the annotation
    cost by 43% with a nearly identical model performance compared to a tra-
    ditional AL approach. Our approach applies without any adjustments to a
    rule-based system and any BBM that provides class labels (e.g., a classifica-
    tion model).
 2. With the addition of TL, we leverage unlabeled internal data for the down-
    stream task and unlabeled external data through a pre-trained model as
    knowledge sources for the learning process. This enables us to reduce the
    annotation cost by 51% compared to a traditional AL approach and im-
    prove the model performance compared to the combination of AL-WS.
The remainder of this article is structured as follows. Section 2 presents related
approaches and illustrates the difference in our work. Subsequently, we propose
our approach in Section 3 and evaluate it in Section 4 within a use case. Finally,
we conclude our work and present future challenges in Section 5.

2    Related Work
Since AL is the backbone of our approach, we focus on related work regarding
combinations of AL-WS and AL-TL. To the best of our knowledge, there has
been no attempt yet to enhance AL with WS and TL.

Active Learning and Weak Supervision. Similar to our approach, [24] and [2]
combine AL and WS. However, in their approaches, human experts actively
select and annotate instances to improve a generative model that converts one-
hot-encoded into probabilistic labels. Moreover, the authors of [3] use this combi-
nation to improve the expert rules of a WS model with interactive user feedback.
In contrast to our approach, these methods primarily focus on WS and try to
improve it with AL techniques. Instead, we focus on an AL cycle and enhance it
with WS to reduce the annotation cost. Additionally, these works require labeling
functions that are created from scratch. We, on the contrary, can automatically
leverage information from any existing BBM that generates class labels without
necessarily designing labeling functions. This simplification saves the effort to
decompose an existing BBM for a generative model and enables us to treat it as
a WSA in an AL cycle.
    In comparison, [7] and [28] follow a similar objective as we do since they also
aim to enhance a traditional AL cycle with WS techniques to reduce human
interaction. The authors of [7] assign a pseudo label for a given instance in
a self-training setting if the classifier’s predicted probability exceeds a certain
threshold. Additionally, they automatically assign the majority class label of
similar instances to all unlabeled instances in a cluster. Moreover, instead of
annotating single instances, [28] use human labels to annotate a cluster of similar
instances to reduce human effort. However, these works do not consider a BBM
that generates class labels in a real-world setting. We automatically leverage this
existing knowledge source through an annotator model, reducing the annotation
cost in an AL cycle.
4
30      L.Rauch
        L. Rauch,etD.al.Huseljic, B. Sick

Active Learning and Transfer Learning. The authors of [27] combine AL and
TL but from a different perspective. While we aim to improve AL with TL, they
enhance TL by actively selecting the most-suitable instances for the source do-
main from the target domain. Furthermore, [14] actively fine-tune a pre-trained
model based on the contribution of an instance for the feature representation
and performance of a classification model on a target task to reduce the anno-
tation cost. In contrast, we do not actively select instances in the TL process
but enhance a classification and annotator model within an AL cycle with trans-
ferred knowledge. Additionally, [17] investigate how TL mitigates the random
initialization cold start and reduces label queries. The authors of [33] also lever-
age available unlabeled data but through unsupervised feature learning at the
beginning of an AL cycle and semi-supervised learning during the cycle. They
employ unsupervised pre-training by clustering the features and train a semi-
supervised model by generating pseudo-labels for unlabeled instances. This way,
they improve the model’s performance while requiring less labeled data [33].
In our approach, however, we apply unsupervised learning not only on existing
internal data but also propose to utilize external knowledge sources with TL.


3     Proposed Approach

In Section 3.1, we first give a formal definition of our problem setting. Conse-
quently, we describe our proposed approach in Section 3.2 as shown in Figure 1.
We design a modular approach so that we can selectively combine AL with WS
and TL. This enables us to compare the influence of the individual components
on the model performance and annotation cost.


3.1   Problem Setting

Problem. We consider a classification problem where we have a D-dimensional
instance that is described by a feature vector x ∈ X where X = RD describes the
feature space. An instance x is drawn independently from the same distribution
and belongs to a ground truth class label y ∈ Y where the set Y = {1, ..., C} de-
fines the space of all class labels and C is the number of classes. In a pool-based
AL scenario, we are given an unlabeled pool data set U(t) ⊆ X without class
labels. At each cycle iteration t ∈ N, we aggregate the most-useful instances
x∗ in a batch B(t) ⊂ U(t) with the size b ∈ N. These instances require labels
for the next cycle t+1 that annotators provide. Therefore, we define a set of
annotators A = {HA, WSA}, where we treat the HA as omniscient, providing
a costly ground truth class label and an available BBM as a WSA, providing
an error-prone class label at a low cost. Besides the class labels y to train the
classification model, we also add a binary agreement label z ∈ Z with the set
Z = {0, 1} to every instance in a batch to train the annotator model. We deter-
mine z based on the agreement between the labels provided by the HA and the
WSA. It represents which instances were correctly classified (1) or misclassified
(0) by the WSA. This means that we have to retrieve the WSA label at every
   Enhancing Active Learning with Weak Supervision and Transfer Learning         5
                                                                                31

selected instance. Thus, we denote the annotated batch as B ∗ (t) ∈ X × Y × Z
and the labeled data set as L(t) ⊆ X × Y × Z.

Model training. We express the classification model (e.g., a NN) through its
parameters at cycle iteration t as θ t . This model is trained on the labeled data
set L(t) where either the HA or the WSA provide the class label y. It maps an
instance to a vector of class probabilities with f θt : X → ∆C−1 , where ∆C−1 is
the C − 1 probability simplex spanned by C classes. Given an instance x ∈ X ,
the classification model predicts the probability vector p̂ = f θt (x). This vector
corresponds to an estimate of the categorical distribution of the classes made
by the model f θt . Additionally, we describe the annotator model through its
parameters ω t which result from training on the binary agreement label z of the
labeled data set L(t). With the function g ωt : X → [0, 1] the annotator model
maps an instance x ∈ X to a probability q̂ = g ωt (x). Its task is to estimate
the probability that the WSA can provide a true class label. Thus, both models
receive the same input instances from L(t) but are trained either on class or
binary agreement labels. Moreover, we denote the parameters extracted from a
pre-trained model as ϕ. Since the pre-trained model is only trained once, these
parameters are independent of the cycle iterations.

3.2   Proposed Cycle
Our proposed AL cycle is illustrated in Figure 1. In the following paragraphs,
we will give a detailed explanation of the steps in our approach.

Step 1 - Initialize Cycle. Before the cycle starts, we fine-tune a pre-trained
model on the unlabeled data U and all additional data that we do not consider
for AL with unsupervised learning. This model supplies initial parameters ϕ for
the classification and annotator model and provides feature representations that
are helpful for AL [33]. Thus, we do not randomly initialize the parameters of
a model at each cycle iteration. In our case, we utilize a pre-trained language
model to extract word embeddings for the downstream task. In the first step,
 1 at iteration t, the classification and annotator model are initially trained on
a small labeled data set L(t) where the instances x are drawn randomly from
the unlabeled pool data set U(t). Here, the HA provides the ground truth class
labels, and the WSA the error-prone class labels allowing us to compute the
binary agreement label, which is utilized for training the annotator model. After
the initialization step, we assume to have a trained classification model with the
parameters θ t and a trained annotator model with the parameters ω t .

Step 2 - Select Batch. The cycle continues in step 2 with the selection algorithm
of the AL module. We approximate the utility of all instances from the unlabeled
pool U(t) based on the entropy of the predicted probability of the classification
model f θt . Given a probability vector p̂, the entropy is defined as
                                          C
                                          X
                              H(p̂) = −         p̂c ln p̂c .                   (1)
                                          c=1
6
32        L.Rauch
          L. Rauch,etD.al.Huseljic, B. Sick

                                              Classiﬁcation Model
                               1
                       Initialize Cycle                                         Stop       6
            Pre-Trained Model                                                            Continue


                                                Annotator Model                 WS
          TL
                                                                                         Selection Algorithm
Unlabeled Pool
                         5
                  Retrain Models                                       Annotator
                                                                      Performance
                                                                                       AL

                                                     WSA
            Labeled                                                 Accept                  Select
            Data Set                                                                        Batch    2
                          Update
                         Data Sets                                              3
                                4                     HA                      Select
                                                                             Annotator
                                                                    Reject


       Fig. 1. A schematic illustration of the proposed AL cycle with WS and TL.

At cycle iteration t, we select the instance with maximum entropy according to
                                                    
                             x∗ = arg max H f θt (x) .                     (2)
                                          x∈U (t)

To aggregate a batch B(t) ⊂ U(t), we greedily select the most-useful instances x∗
until we reach the desired acquisition batch size b ∈ N. We refer to this sampling
strategy as max-entropy sampling.

Step 3 - Select Annotator. In step 3 with the WS module, we estimate the
annotator performance of the WSA to decide whether it should provide the
class labels for a specific instance. Therefore, we give each instance x∗ of the
selected batch B(t) to the annotator model g ωt which estimates the probabil-
ity q̂. Intuitively, we interpret q̂ as the probability that the WSA is capable of
providing the ground truth class label. This way, the annotator model assesses
the performance of the WSA. With the annotator performance estimation we
decide whether to reject an error-prone class label of the WSA. In our approach,
we investigate a simple reject function1 that is based on threshold α and the
estimated probability q̂ as given by
                                         (
                              ωt  ∗        1, if g ωt (x∗ ) ≥ α
                         rα (g (x )) =                                         (3)
                                           0, otherwise.
1
     It should be noted that more complex reject functions are available that could be
     the focus of future research.
    Enhancing Active Learning with Weak Supervision and Transfer Learning        7
                                                                                33

If a class label of the WSA is rejected, the HA has to provide the true class
label, enabling us to determine the binary agreement label z. However, suppose
we decide that the WSA can provide a ground truth class label. In that case, the
binary agreement label is set to 1 as a pseudo-label in the labeled pool. We refer
to this as a pseudo-label because no ground truth is available. This technique
can be considered semi-supervised learning [33].

Step 4 - Update Data Sets. In 4 , we update the unlabeled pool data set
U(t+1) = U(t)\B(t) with the instances from the aggregated batch. Additionally,
we update the labeled training set L(t+1) = L(t) ∪ B∗ (t) with the annotated
batch including the class and the binary agreement labels.

Step 5 - Retrain Models. In 5 , the classification and annotator model are re-
trained from scratch simultaneously. Before training, we initialize the models’
parameters with the parameters ϕ we obtain from the unsupervised pre-trained
model. This leads to an update of the model parameters θ t+1 and ω t+1 .

Step 6 - Continue/Stop Cycle. At the end of an iteration, we decide in 6
whether to continue or stop the AL cycle with a stopping criterion. AL strategies
in literature often use a simple pre-defined stopping criterion such as the desired
size of the labeled pool or the maximum number of cycle iterations [20,31]. As
this is not in the scope of this work, we choose the maximum number of instances
as our stopping criterion.


4     Experimental Evaluation

In Section 4.1, we summarize the experimental setup for our use case. We design
our experiments to enhance the AL cycle sequentially with the WS and TL
modules to investigate their impact on model performance and annotation cost.
The first experiments in Section 4.2 detail our findings where we enhance AL
with WS to leverage an available BBM as an information source to reduce the
annotation cost. Subsequently, Section 4.3 gives insights on how the addition of
TL further improves our approach by utilizing internal and external unlabeled
data with a pre-trained model as a knowledge source.


4.1   Experimental Setup

Use Case and Data. The data set in our use case consists of banking transactions.
The goal is to predict an appropriate thematic class (e.g., household or insurance)
based on short text descriptions of transactions with a NN. We do not have a
labeled data set available, but the following information and knowledge sources
are at our disposal:

 1. External Data: Besides internal in-domain data for the downstream task,
    a vast amount of general-domain text data is available on the Internet [29].
8
34        L.Rauch
          L. Rauch,etD.al.Huseljic, B. Sick

    As a pre-trained language model, we employ a fastText model [16] as a
    knowledge source. This model was trained on a general-domain corpus [8]
    and is available open-source2 . We do not employ a deep transformer model
    in this preliminary investigation to avoid the issues of deep AL.
 2. Internal Data: We leverage an extensive unlabeled data set with 7.7 million
    transactions to fine-tune the fastText model in an unsupervised manner with
    in-domain knowledge. To conduct the experiments efficiently, we randomly
    sample 9000 instances as the pool data set U and reserve 2000 instances with
    ground truth class labels for testing.
 3. Black-Box Model: A rule-based system that classifies transactions with
    hand-crafted labeling rules is available. It was developed iteratively over
    several years by domain experts, and we consider it a BBM since the labeling
    rules are unavailable. We treat the BBM as the WSA that generates error-
    prone class labels at low cost. Preliminary studies show that it achieves an
    accuracy of approximately 86% on the test set.
 4. Human Annotator: We assume a domain expert as an omniscient anno-
    tator that delivers ground truth class labels at a high cost. Specifically, the
    HA provides the class labels for the actively selected training instances when
    the label of the WSA is rejected and for the initialization step.

Models. The results are obtained by a classification model in our proposed AL
cycle. The classification model is a multi-layer perceptron with an embedding
layer to represent the text input with D = 300, a hidden layer with a ReLU
activation function and an output layer with C = 36 neurons for each class. The
annotator model is comprised of a similar structure, differing only in the output
layer with C = 1 neuron as the annotator model solves a binary classification
task. In each cycle iteration, we create a new vocabulary from the labeled pool
and adapt the input layer of both models. We employ the Adam optimizer [18] to
optimize the parameters, and the focal loss [22] as a loss criterion to address class
imbalance. Additionally, we add dropout with 20% probability to the hidden
neurons. We extract the static word embeddings from the pre-trained fastText
model as initial weights of the embedding layers. This process can be considered
as sequential TL [29].

Overall Experimental Design. To ensure comparability between our experiments,
we define basic AL parameter configurations for all experiments. The configu-
rations are generally based on results from preliminary studies in this use case.
Specific settings for the experiments are highlighted in the corresponding sec-
tions. The initial labeled data set consists of 250 randomly sampled instances
with ground truth labels provided by the HA. In preliminary work, this has
proven to be a sufficient initial quantity of instances to enable the models to
provide information to select the most-useful instances and suitable annotators.
We set the desired size of the labeled data pool to 5370 as a pre-defined stopping
criterion and the acquisition batch size b to 32 with 161 cycle iterations t. Our
2
     https://fasttext.cc/docs/en/crawl-vectors.html, accessed 2022-04-20
   Enhancing Active Learning with Weak Supervision and Transfer Learning        9
                                                                               35

previous studies have shown that this relatively small number of instances leads
to key results while enabling us to conduct experiments efficiently. We employ
random sampling as a baseline sampling strategy and compare it to max-entropy
sampling (Equation 2) for each experiment. Additionally, we decide between a
costly (HA) or low-cost class label (WSA) based on our proposed reject option
(Equation 3). Therefore, we define three different annotation scenarios to assess
the influence of the WSA and the resulting annotation costs:
1. full-human: The HA provides the class labels for all of the selected instances,
   and we reject the class labels of the WSA. We consider this scenario a conven-
   tional AL approach without WS that should achieve the highest performance
   but generate the greatest baseline annotation cost.
2. hybrid : We select the WSA and the HA to provide the class labels based on
   the assessment of the annotator performance. In preliminary studies, 0.85
   has proven to be a simple and promising reject threshold α, ensuring that we
   only accept labels of the WSA at high annotator performance estimations. At
   the same time, we ask the HA only for very uncertain instances to minimize
   annotation cost. Note that we must retrieve the class label of the WSA for
   every instance to determine the binary agreement label. This scenario reflects
   our approach combining AL with WS.
3. full-WSA: The WSA provides the class labels for all selected instances. This
   approach is the most inexpensive regarding the annotation cost, but we
   expect a deterioration of model performance. To ensure comparability, the
   HA still determines the ground truth class labels for the random initialization
   step.
As an exemplary cost scheme, we assign a cost of 1 to each annotation by the HA.
Since the maintenance of the rule-based system as the BBM and automatically
retrieving a class label also generates low cost, we assign 0.1 to an annotation
of the WSA. Additionally, each experiment is repeated five times with different
random seeds.

4.2   Experiments on AL with WS
This section shows the experimental results to answer research question 1. In
these experiments, we utilize the HA and the available rule-based system as
information sources with AL and WS.

Question 1. How can we enhance AL with WS so that we can leverage an avail-
able BBM as an information source to reduce the annotation cost with a com-
petitive model performance compared to a traditional AL approach?

Findings. In Figure 2, we show the test accuracy and annotation cost for the
aforementioned annotation scenarios and sampling strategies for each cycle it-
eration. Additionally, we report the final results in Table 1 after the AL cycle
reaches the stopping criterion. The savings metric represents the cost saved rel-
ative to the highest baseline cost with conventional AL. As Figure 2 shows on
10
36                L.Rauch
                  L. Rauch,etD.al.Huseljic, B. Sick

Table 1. Mean results (± standard error) of accuracy, annotation cost and savings of
the AL cycle with different sampling strategies and annotation scenarios.

                   Sampling    Scenario                  Accuracy(↑)             Cost(↓)       Savings(↑)
                               full-human                0.849±0.001               5370             0
                   random      hybrid                    0.842±0.001            1842±221          0.66
                               full-wsa                  0.823±0.004                762           0.86
                               full-human                0.873±0.001               5370             0
                   max-entropy hybrid                    0.872±0.002             3045±46          0.43
                               full-wsa                  0.842±0.002                762           0.86

           0.9
                                                                              full-human                          5000
                                                                              entropy hybrid
           0.8                                                                random hybrid
                                                            Annotation Cost
                                                                              full-WSA
                                                                                                                  4000
                                                                                                    full-human
Accuracy


           0.7
                                                                                                                  3000
                                    entropy full-human
           0.6                      entropy hybrid                                               entropy hybrid
                                                                                                                  2000
                                    entropy full-WSA
                                    random full-human
           0.5                                                                                   random hybrid    1000
                                    random hybrid
                                    random full-WSA                                                               250
           0.4                                                                                       full-WSA
                                                                                                                  0
            250          2000         4000                                       2000         4000
                       Size Labeled Pool                                       Size Labeled Pool

Fig. 2. Test accuracy and annotation cost with increasing size of the labeled data pool
in the AL cycle with different sampling strategies and annotation scenarios.


the right, the annotation costs for the annotation scenarios full-human (highest
baseline annotation cost and traditional AL) and full-wsa (lowest annotation
cost without HA) are constant and independent of the sampling strategy. The
former cost is identical to the size of the labeled pool since the HA provides
labels for each instance. For the latter cost, only the initial labels are provided
by the costly HA while the WSA generates the remaining labels at a low cost.
With our approach in the hybrid scenario, the annotation cost depends on a
mix of HA and WSA annotations. More WSA labels are generally rejected when
using max-entropy sampling compared to random sampling in our hybrid sce-
nario. The savings in Table 1 demonstrate that we can save annotation costs of
43 % with max-entropy sampling and 66 % with random sampling compared to
the baseline cost of 5370 in the full-human scenario. However, we can see that
random sampling degrades test accuracy. We attribute this to the fact that we
actively select instances where the classification model is most uncertain in each
batch. These also seem to be instances where the annotator model is uncertain
and, thus, we more frequently reject the error-prone WSA. Additionally, we can
observe a decreasing slope of the green cost curve with max-entropy sampling in
our hybrid scenario on the left side of Figure 2. This seems intuitive since the
   Enhancing Active Learning with Weak Supervision and Transfer Learning         11
                                                                                 37

high-entropy instances from the unlabeled pool also diminish with cycle itera-
tions. Therefore, a bigger labeled pool as the pre-defined stopping criterion could
lead to only a slight increase in annotation cost and more strongly emphasize
the benefits of our approach. The slope of the purple cost curve further high-
lights this assumption as it is monotonously increasing with random sampling,
where we draw instances without considering the uncertainty of the classification
model.
    When looking at the accuracy in Figure 2, we observe that the model perfor-
mance with max-entropy sampling is consistently superior to random sampling in
each annotation scenario. Table 1 supports this observation and shows a perfor-
mance increase of up to 3 % in accuracy with AL. Accordingly, the classification
model’s accuracy grows more rapidly in each cycle iteration, and it reaches the
highest test accuracy with max-entropy sampling in the hybrid and full-human
annotation scenarios. This demonstrates how AL techniques enable us to ob-
tain a better classification accuracy with the same number of labeled instances
compared to random sampling. The worst classification accuracy is obtained by
random sampling in the full-WSA scenario. Accordingly, the results deteriorate
for both selection strategies when only the error-prone WSA provides the class
labels. Even though we can obtain savings of 86 % in the full-WSA scenario, the
accuracy of the BBM (rule-based system) limits the achievable test accuracy of
the classification model. This emphasizes the importance of ground-truth class
labels from HAs and, thus, strengthens our combined approach in the hybrid sce-
nario. As we expect, the classification model provides the best accuracy in the
full-human scenario with max-entropy sampling as the traditional AL approach.
However, our approach in the hybrid scenario with max-entropy sampling deliv-
ers nearly identical test accuracy while reducing the annotation cost by 43%, as
seen by savings in Table 1. Our results show that while costly HAs are important,
we can also leverage a BBM as an additional information source. These obser-
vations let us conclude that our combination of AL and WS greatly reduces the
annotation cost with only a marginal performance loss compared to traditional
AL.

4.3   Experiments on AL with WS and TL
In this section, we conduct experiments with our complete proposed approach
to tackle the second research question. In addition to WS and AL, we leverage
all of the available unlabeled data to train a language model, which serves as a
sequential TL approach. We focus on the hybrid annotation scenario with and
without pre-training. So, we assess the influence of using all available information
and knowledge sources on the model performance and annotation cost.
Question 2. How far can the inclusion of TL to leverage unlabeled internal and
external data as knowledge sources empower the combination of AL-WS and,
thus, further reduce the annotation cost and improve the model performance?

Findings. Figure 3 shows the test accuracy and annotation cost for the aforemen-
tioned sampling strategies in the hybrid scenario with and without pre-training.
12
38                L.Rauch
                  L. Rauch,etD.al.Huseljic, B. Sick

Table 2. Mean results (± standard error) of accuracy, annotation cost, and savings of
the AL cycle in different annotation scenarios with and without TL

                   Sampling    Scenario               Accuracy(↑)             Cost(↓)      Savings(↑)
                               full-human             0.879±0.002               5370            0
                   random      hybrid                 0.876±0.002             1819±51         0.66
                               full-wsa               0.840±0.003                762          0.86
                               full-human             0.894±0.003               5370            0
                   max-entropy hybrid                 0.893±0.001             2652±40         0.51
                               full-wsa               0.847±0.002                762          0.86


           0.9                                                                                                    4000
                                                                            entropy hybrid pre-trained
                                                                            entropy hybrid
           0.8                                                              random hybrid pre-trained
                                                          Annotation Cost   random hybrid                         3000
                                                                                                 entropy hybrid
Accuracy


           0.7
                                                                                                 entropy hybrid
                                                                                                   pre-trained 2000
           0.6                                                                                   random hybrid
                          entropy hybrid pre-trained
                          entropy hybrid                                                                          1000
           0.5                                                                                   random hybrid
                          random hybrid pre-trained                                                pre-trained
                          random hybrid                                                                           250
           0.4                                                                                                    0
            250          2000         4000                                      2000         4000
                       Size Labeled Pool                                       Size Labeled Pool

Fig. 3. Test accuracy and annotation cost with increasing size of the labeled data pool
in the AL-WS cycle with and without TL.


In Table 2, we summarize the final results in all annotation scenarios with pre-
training. We can see in Figure 3 that utilizing pre-trained weights gives the
classification model a clear head start in performance. After initial training in
the hybrid scenario, the model already reaches an accuracy of 60% for random
(orange curve) and max-entropy (blue curve) sampling. This increase represents
a 20% improvement over the green and red curves without pre-training. The ad-
vantage generally decreases with more training data but remains fundamentally
intact and demonstrates the benefits of adding TL to our WS-AL approach, as
also demonstrated in Table 2. We obtain the best results with the fastest accu-
racy increase in each iteration with pre-training and maximum-entropy sampling
(blue curve). However, with the increasing size of the labeled data set, the ac-
curacy of max-entropy sampling without pre-training adjusts to the same level
of random sampling with pre-training. This means that max-entropy sampling
has the same effect on the final model accuracy as leveraging the knowledge
extractable from 7.7 million transactions and shows the general advantage of
AL as the backbone of our approach. Table 2 further highlights the increase in
accuracy in all annotation scenarios with TL compared to Table 1. Additionally,
    Enhancing Active Learning with Weak Supervision and Transfer Learning       13
                                                                                39

the curves’ trajectories with pre-training are more consistent with much less
performance variance across the experiments’ seeds.
    On the right sight in Figure 3, we can see that the annotation cost with pre-
training for max-entropy sampling is lower than without pre-training. Again,
random sampling leads to lower annotation costs and poorer accuracy and con-
firms the benefits of using AL from the results above. Table 2 also highlights
the improved savings of 51 % with the addition of TL compared to the baseline
cost of 5370 with an 8 % increase relative to AL-WS. Moreover, we assume that
the transferred knowledge improves the classification model’s and the annota-
tor model’s certainty estimations. This means that pre-trained weights enable
us to more efficiently select the most-useful instances and the low-cost annota-
tions of the WSA. The results demonstrate the benefits of enhancing our AL-WS
approach with TL by also leveraging available unlabeled data as a knowledge
source with a pre-trained model. With our AL-WS-TL approach, we can improve
the overall test accuracy of the classification model while further reducing the
annotation cost.


5    Conclusion and Future Work

This work presented a novel approach to extending AL with WS and TL to
reduce the annotation cost by leveraging multiple information and knowledge
sources. We treated an established BBM (e.g., a rule-based system) as a weakly-
supervised annotator that provides error-prone class labels inexpensively. This
assumption made it possible to estimate the performance of this information
source with an annotator model to decide whether a costly human annotation
in an AL cycle is required. In a use case, we have successfully shown that en-
hancing AL with WS reduces annotation cost by 43% and leads to an almost
identical model performance compared to traditional AL. Moreover, we lever-
aged unlabeled internal and external data as knowledge sources by fine-tuning a
pre-trained language model on all available unlabeled data in an unsupervised
manner. We then transferred this knowledge to expand our AL-WS cycle with
TL. This enabled us to reduce the annotation cost by 51 % and improve the
overall model performance compared to the AL-WS approach.
    Since we applied our proposed approach for a shallow NN, we plan to move
towards deep AL and the related problems in an application-oriented setting. To
provide an accurate probabilistic estimation for the selection of instances, we aim
to investigate the uncertainty estimates [15] of our classification and annotator
models and calibrate them with methods such as temperature scaling [9] or
scaling-binning [21]. Since we greedily acquired a batch of instances without
batch-awareness, we intend to use a more complex selection strategy, such as
BALD [6,19]. Moreover, we aim to enhance and further investigate the annotator
model to measure the label quality of other information sources in the annotation
process, such as the HA. Accordingly, we can move towards modern AL settings,
where we also consider the HA as error-prone and can determine a more complex
14
40      L.Rauch
        L. Rauch,etD.al.Huseljic, B. Sick

cost scheme [11]. This could also be done in a multi-task learning setting by
embedding the annotator model directly into the classification model.

Acknowledgments. This work results from the project INFINA, funded by
Wirtschafts- und Infrastrukturbank Hessen under the Operational Program for
the Promotion of Investments in Growth and Employment in Hessen which is
financed by the European Regional Development Fund (ERDF).

References
 1. Bach, S.H., Rodriguez, D., Liu, Y., Luo, C., Shao, H., Xia, C., Sen, S., Ratner,
    A., Hancock, B., Alborzi, H., Kuchhal, R., Ré, C., Malkin, R.: Snorkel DryBell: A
    Case Study in Deploying Weak Supervision at Industrial Scale. In: Proceedings of
    the 2019 International Conference on Management of Data. pp. 362–375 (2019).
    https://doi.org/10.1145/3299869.3314036
 2. Biegel, S., El-Khatib, R., Oliveira, L.O.V.B., Baak, M., Aben, N.: Ac-
    tive weasul: Improving weak supervision with active learning. CoRR (2021).
    https://doi.org/10.48550/arXiv.2104.14847
 3. Boecking, B., Neiswanger, W., Xing, E., Dubrawski, A.: Interactive weak supervi-
    sion: Learning useful heuristics for data labeling. ICLR (2021)
 4. Budd, S., Robinson, E.C., Kainz, B.: A Survey on Active Learning and Human-in-
    the-Loop Deep Learning for Medical Image Analysis. Medical Image Analysis 71,
    102062 (2021). https://doi.org/10.1016/j.media.2021.102062
 5. Dunnmon, J.A., Ratner, A.J., Saab, K., Khandwala, N., Markert, M., Sagreiya,
    H., Goldman, R., Lee-Messer, C., Lungren, M.P., Rubin, D.L., Ré, C.: Cross-
    Modal Data Programming Enables Rapid Medical Machine Learning. Patterns
    1(2), 100019 (2020). https://doi.org/10.1016/j.patter.2020.100019
 6. Gal, Y., Islam, R., Ghahramani, Z.: Deep bayesian active learning with image
    data. In: Proceedings of the 34th International Conference on Machine Learning -
    Volume 70. pp. 1183––1192. ICML (2017)
 7. Gonsior, J., Thiele, M., Lehner, W.: WeakAL: Combining Active Learning and
    Weak Supervision. In: Appice, A., Tsoumakas, G., Manolopoulos, Y., Matwin, S.
    (eds.) Discovery Science. pp. 34–49. Lecture Notes in Computer Science, Springer
    International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-61527-
    73
 8. Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vec-
    tors for 157 languages. In: Proceedings of the International Conference on Language
    Resources and Evaluation (LREC 2018) (2018)
 9. Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neu-
    ral networks. In: Proceedings of the 34th International Conference on Machine
    Learning. pp. 1321–1330. ICML (2017)
10. Hanika, T., Herde, M., Kuhn, J., Leimeister, J.M., Lukowicz, P., Oeste-Reiß, S.,
    Schmidt, A., Sick, B., Stumme, G., Tomforde, S., Zweig, K.A.: Collaborative Inter-
    active Learning – A clarification of terms and a differentiation from other research
    fields. CoRR (2019). https://doi.org/10.48550/arXiv.1905.07264
11. Herde, M., Huseljic, D., Sick, B., Calma, A.: A survey on cost
    types, interaction schemes, and annotator performance models in se-
    lection algorithms for active learning in classification. CoRR (2021).
    https://doi.org/10.48550/arXiv.2109.11301
   Enhancing Active Learning with Weak Supervision and Transfer Learning             15
                                                                                     41

12. Hino, H.: Active learning: Problem settings and recent developments. CoRR (2020).
    https://doi.org/10.48550/arXiv.2012.04225
13. Holzinger, A., Plass, M., Kickmeier-Rust, M., Holzinger, K., Crişan, G.C., Pintea,
    C.M., Palade, V.: Interactive machine learning: Experimental evidence for the hu-
    man in the algorithmic loop: A case study on Ant Colony Optimization. Applied
    Intelligence 49(7), 2401–2414 (2019). https://doi.org/10.1007/s10489-018-1361-5
14. Huang, S.J., Zhao, J.W., Liu, Z.Y.: Cost-Effective Training of Deep CNNs with
    Active Model Adaptation. In: Proceedings of the 24th ACM SIGKDD International
    Conference on Knowledge Discovery & Data Mining. pp. 1580–1588. ACM (2018).
    https://doi.org/10.1145/3219819.3220026
15. Huseljic, D., Sick, B., Herde, M., Kottke, D.: Separation of aleatoric and epis-
    temic uncertainty in deterministic deep neural networks. In: 2020 25th In-
    ternational Conference on Pattern Recognition (ICPR). pp. 9172–9179 (2021).
    https://doi.org/10.1109/ICPR48806.2021.9412616
16. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of Tricks for Efficient Text
    Classification. CoRR (2016). https://doi.org/10.48550/arXiv.1607.01759
17. Kale, D., Liu, Y.: Accelerating Active Learning with Transfer Learning. In: 2013
    IEEE 13th International Conference on Data Mining. pp. 1085–1090 (2013).
    https://doi.org/10.1109/ICDM.2013.160
18. Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. CoRR (2017).
    https://doi.org/10.48550/arXiv.1412.6980
19. Kirsch, A., van Amersfoort, J., Gal, Y.: BatchBALD: Efficient and diverse batch
    acquisition for deep bayesian active learning. In: Advances in Neural Information
    Processing Systems (2019)
20. Kottke, D., Schellinger, J., Huseljic, D., Sick, B.: Limitations of As-
    sessing Active Learning Performance at Runtime. CoRR (2019).
    https://doi.org/10.48550/arXiv.1901.10338
21. Kumar, A., Liang, P., Ma, T.: Verified uncertainty calibration. In: Advances in
    Neural Information Processing Systems (NeurIPS) (2019)
22. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal Loss for Dense Object
    Detection. CoRR (2018). https://doi.org/10.48550/arXiv.1708.02002
23. Liu, H., Gegov, A., Cocea, M.: Rule-based systems: A granular computing perspec-
    tive. Granular Computing 1(4), 259–274 (2016). https://doi.org/10.1007/s41066-
    016-0021-6
24. Nashaat, M., Ghosh, A., Miller, J., Quader, S., Marston, C., Puget, J.F.: Hy-
    bridization of Active Learning and Data Programming for Labeling Large Indus-
    trial Datasets. In: 2018 IEEE International Conference on Big Data (Big Data).
    pp. 46–55 (2018). https://doi.org/10.1109/BigData.2018.8622459
25. Paleyes, A., Urma, R.G., Lawrence, N.D.: Challenges in Deploying Machine Learn-
    ing: A Survey of Case Studies (2021)
26. Pan, S.J., Yang, Q.: A Survey on Transfer Learning. IEEE Trans-
    actions on Knowledge and Data Engineering 22(10), 1345–1359 (2010).
    https://doi.org/10.1109/TKDE.2009.191
27. Peng, Z., Zhang, W., Han, N., Fang, X., Kang, P., Teng, L.: Active Transfer Learn-
    ing. IEEE Transactions on Circuits and Systems for Video Technology 30(4), 1022–
    1036 (2020). https://doi.org/10.1109/TCSVT.2019.2900467
28. Perez, F., Lebret, R., Aberer, K.: Weakly Supervised Active Learning with Cluster
    Annotation. CoRR (2019). https://doi.org/10.48550/arXiv.1812.11780
29. Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., Huang, X.: Pre-trained Models for Nat-
    ural Language Processing: A Survey. Science China Technological Sciences 63(10),
    1872–1897 (2020). https://doi.org/10.1007/s11431-020-1647-3
16
42      L.Rauch
        L. Rauch,etD.al.Huseljic, B. Sick

30. Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Wu, S., Ré, C.: Snorkel: Rapid
    training data creation with weak supervision. The VLDB Journal 29(2), 709–730
    (2020). https://doi.org/10.1007/s00778-019-00552-1
31. Ren, P., Xiao, Y., Chang, X., Huang, P.Y., Li, Z., Gupta, B.B., Chen, X.,
    Wang, X.: A survey of deep active learning. ACM Comput. Surv. 54(9) (2021).
    https://doi.org/10.1145/3472291
32. Settles, B.: Active learning literature survey. Computer Sciences Technical Re-
    port 1648, University of Wisconsin–Madison (2010)
33. Siméoni, O., Budnik, M., Avrithis, Y., Gravier, G.: Rethinking deep
    active learning: Using unlabeled data at model training. In: Interna-
    tional Conference on Pattern Recognition (ICPR). pp. 1220–1227 (2021).
    https://doi.org/10.1109/ICPR48806.2021.9412716

</pre>