=Paper= {{Paper |id=Vol-3079/ial2021_paper2 |storemode=property |title=Uncertainty and Utility Sampling with Pre-Clustering |pdfUrl=https://ceur-ws.org/Vol-3079/ial2021_paper2.pdf |volume=Vol-3079 |authors=Zhixin Huang,Yujiang He,Stephan Vogt,Bernhard Sick |dblpUrl=https://dblp.org/rec/conf/pkdd/HuangHVS21 }} ==Uncertainty and Utility Sampling with Pre-Clustering== https://ceur-ws.org/Vol-3079/ial2021_paper2.pdf
          Uncertainty and Utility Sampling with
                     Pre-Clustering

          Zhixin Huang, Yujiang He, Stephan Vogt, and Bernhard Sick

         Intelligent Embedded Systems, University of Kassel, Kassel, Germany
          {zhixin.huang,yujiang.he,stephan.vogt,bsick}@uni-kassel.de


        Abstract. Uncertainty sampling is one of the main approaches in deep
        active learning. In the early phase of uncertainty sampling, uninforma-
        tive instances are usually selected due to missing exploration of the data
        space. This can result in a poor quality model leading to poorer ac-
        quisitions and further leading to a poorer model. Clustering algorithms
        can analyze large amounts of unlabeled data in an unsupervised way.
        A cluster center can be seen as the representative of its cluster and is
        often highly useful for querying the label from the oracle. Therefore, we
        propose an algorithm that enables the model to explore the data space
        at the initial stage using pre-clustering, and enhances the exploration
        of uncertainty sampling continually based on a combination of uncer-
        tainty and utility metrics. The preliminary experimental results show
        that the proposed algorithm supports balance and imbalanced data sce-
        narios. Besides, our algorithm can achieve a higher classification accuracy
        compared to baselines methods, even under fewer annotations.

        Keywords: Active Learning · Deep Active Learning · Bayesian Neural
        Network · Uncertainty Sampling · Clustering.


1     Introduction
Deep learning (DL) has a strong learning ability to process high-dimensional
data and extract features automatically [24], while DL is often very greedy for
data [11]. Active learning is concerned with reducing annotation costs effectively
and ensuring a predetermined level of accuracy. However, a major challenge in
AL is its lack of scalability to high-dimensional data [29]. Therefore, an approach
that combines DL and AL will significantly expand their application potential.
This combined approach, referred to as deep active learning (DAL), mainly con-
tains two parts: the AL query strategy on the unlabeled data set and the DL
model training [24]. In the pool-based AL scenario, the selection strategy chooses
the best sample based on the evaluation and ranking of the entire large data set.
The annotated samples are used to train the model and improve the data ac-
quisition for the next AL iteration. The uncertainty-based approach is one of
the most common pool-based methods in the application, because it is simple in
form and has low computational complexity [24]. Many DAL [1, 10, 22, 23] meth-
ods use the uncertainty sampling (US) strategy directly. However, there are still
two challenges that have to be overcome:

    © 2021 for this paper by its authors. Use permitted under CC BY 4.0.
2
22        Huang et
       Z. Huang, Y.al.
                    He, S. Vogt, B. Sick

 – Unreliable uncertainty at the initial AL phase Uninformative in-
   stances are usually selected based on unreliable uncertainty due to an unclear
   sense of the data space at the early stage. This can result in a poor quality
   model leading to poorer acquisition and further leading to a poorer model [3].
 – Uncertainty sampling lacks exploration For uncertainty sampling in
   DAL context, [6, 14, 12] utilize batch acquisition and query the top n in-
   stances with the highest scores. However, it is likely to select a set of information-
   rich but similar samples [33]. It leads to insufficient exploration, i.e., the
   knowledge regarding the data distribution is not fully utilized [24], which
   makes low DL model training efficiency and high annotation cost.

    To address the first challenge, it is crucial to find the most representative
instances from the large unlabeled data set at the initial AL phase. The general
method [7] is to use random selection (RS) at the beginning of the training
process for exploration. However, this method could fail for imbalanced data set
because the selected instances are less representative, and most of them locate
dense areas [30]. The model can deeply learn the true data space only when
sufficient labels of data are available. However, it will increase annotation cost.
Unsupervised learning algorithms can analyze large amounts of unlabeled data.
For example, the K-Means [26] algorithm is one of the most common clustering
algorithms for knowledge discovery in data mining. The cluster information is
helpful for AL in two aspects: (1) The instances located in the center of clusters
are more representative than the others and should be labeled firstly; (2) Samples
in the same cluster are likely to have the same label [21].
    For the second challenge, a feasible solution is to use a hybrid query strat-
egy to enhance the exploration of US. The similarity between samples is a
method [21, 15] to measure the similarity amongst instances by calculating the
feature vectors’ distance between each other. Similar to US, these algorithms
are often only good at exploitation, i.e., the learners tend to only focus on in-
stances near the current decision boundary [24]. But in the opposite direction, we
can also utilize the similarity to exclude similar samples. After sorting a batch
of instances based on the uncertainty through US, we could filter out similar
instances to improve the exploration of selection strategy.
    To overcome the challenges mentioned above, the two core ideas of our pro-
posed algorithm are: (1) At the initial phase, we label the instances closest to
cluster centers to train the model for estimating reliable uncertainty. (2) The se-
lection of the most informative instance depends on two strategies, uncertainty
and utility. The uncertainty evaluates the epistemic uncertainty of Bayesian
Neural Network (BNN) [6, 7] to an instance. The utility filters out the instances
which are similar to the already labeled instances. Since US lacks exploration
in the data space, the utility metric helps the model discover some valuable
instances far away from the current decision boundary. Therefore, we propose
our algorithm Uncertainty and Utility sampling with Pre-Clustering (UUPC).
Compared to the baselines, our algorithm can achieve a higher classification
accuracy under fewer annotations.
                       Uncertainty and Utility Sampling with Pre-Clustering        3
                                                                                  23

   The remainder of this article starts with a summary of the related work. The
details of the algorithm and the experiments are introduced in Section 3 and 4
respectively. This article is closed with a conclusion and an outlook on our future
work in this field.


2   Related Work

The uncertainty-based query strategies (e.g., Margin Sampling and Entropy) in
the DAL scenario are widely used [6, 14, 12] because it is convenient to combine
with the output of the DL model. Traditional DL requires a large amount of
labeled data to obtain reliable uncertainty estimation. In the DAL scenario with
large unlabeled data, epistemic uncertainty is particularly valuable because it
allows the model to assess its lack of knowledge. For this reason, a method that
combines deep Bayesian neural network with US has been proposed [7, 12, 22].
However, as analyzed in Section 1, US could select uninformative instances at
the initial AL phase and lack exploration. Therefore, some hybrid query strate-
gies are developed [32, 34], taking into account the uncertainty and diversity of
samples. Exploration-P [32] utilizes a deep neural network to obtain the uncer-
tainty and the similarity between the samples. Besides, this method uses RS
strategy for exploration purposes in the early AL phase. The combination of AL
and K-means clustering has been researched in previous works [13, 21] to find the
most representative instances. DBAL [34] presents a hybrid query approach that
utilizes the K-means clustering algorithm to explore the diversity of instances in
each mini-batch. Contrary to [34], which performs clustering in each AL itera-
tion, our approach annotates labels based on cluster centers only at the initial
AL phase to pretrain the BNN model. Thus, it can avoid labeling samples re-
peatedly in the same cluster. Similar to select the most representative instances
by clustering, the core set approach is also a representative query strategy. The
basic idea is constructing a core set to represent the distribution of the feature
space of the entire original data set, thereby reducing the labeling cost of AL [27,
31]. However, the core-set approach requires building a large distance matrix on
the unlabeled data set, the search process is computationally expensive especial
on the large data set [2].


3   Problem Formulation and Algorithms

In the general classification, one sample is described by x ∈ X and its label from
C classes with a corresponding label y ∈ Y = {1, · · · , C}. The clustering informa-
tion can be described explicitly by introducing the cluster label k ∈ {1, · · · , K},
where K is the number of clusters in the data. In the pool-based AL, we define
U = {x1 , · · · , xN } as an unlabeled set with N samples. Labels are not available
at the beginning but can be annotated by the oracle. The query strategy selects
an instance x ∈ U and asks the oracle for the corresponding label y ∈ Y. The
newly labeled instance is removed from the unlabeled set U ← U\x. We add the
4
24         Huang et
        Z. Huang, Y.al.
                     He, S. Vogt, B. Sick

instance with its label to the labeled set L ← L ∪ (x, y), and train supervised
learning models such as SVM and DNN on L.

3.1   Pre-Clustering at initial AL Phase
Selecting the most representative instances from the unlabeled data by labeling
cluster centers is heavily dependent on the quality of clustering results. In K-
Means, the crucial parameter that affects the goodness of clustering results is the
number of clusters, which should be optimized. The evaluation without any labels
must be performed using the model itself. The elbow method [16] is the most
popular heuristic approach, which calculates the sum of squared distances (SD)
from each point to its assigned center. The unsupervised evaluation scores such as
Silhouette Coefficient (SC) [25], Calinski-Harabasz Index (CHI) [4] and Davies-
Bouldin Index (DBI) [9] could also be applied to the elbow method. We will
calculate multiple cluster scores to determine the optimal number of clusters Ko .
To optimize SC and CHI, we have to maximize the scores, while lower SD and
DBI indicate a model with better defined clusters so they must be minimized.
We take the reciprocal of SC and CHI to unify the optimization direction. The
weighted score of pre-clustering (PC) is calculated by following:

            Score (K, U) = α1 SD (K, U) + α2 DBI (K, U)
              PC
                                         1              1                       (1)
                              + α3             + α4            + λK.
                                     SC (K, U)      CHI (K, U)
The weights of each score are α1,··· ,4 , and the sum is 1. The α1,··· ,4 could be
selected by expert knowledge, or in the absence of detailed expert knowledge,
like in the experiment in Section 4, all weights are selected to be the same value.
In our definition, K must be equal or greater than C. For example, MNIST [19]
requires at least 10 clusters, one per class. Kmax indicates the maximum budget
of annotations at the initial AL phase, and we expect that C < Kmax  N .
Since the above four cluster evaluation scores have different scales, in practice,
we calculate a set for each type of score (SD, DBI and reciprocal of SC and CHI)
from C to Kmax and normalize each set to 0-1 range. Then we add the four scores
to obtain a set of Score (K, U), where K ∈ {C, . . . , Kmax }. The larger the K, the
                   PC
smaller the Score, which means that the more refined clustering. However, the la-
             PC
beling cost must be considered because the oracle has to annotate every instance
closest to the center in each cluster. Therefore we append λK into Score (K, U)
                                                                        PC
as the regularization, where λ is the weight of regularization and proportional
to the cost of an annotation. Setting a proper value of λ is dependent on the
application scenario and requires expert experience. The Bayesian information
criterion (BIC) and the Akaike information criterion (AIC) could determine the
appropriate number of clusters without tuning regularization [28, 8]. But they
can be applied only if we extend the clustering algorithm beyond K-Means to
Gaussian Mixture Model (GMM). Since this paper utilizes pre-clustering by K-
Means, BIC and AIC will be researched in future work. The optimal number of
                      Uncertainty and Utility Sampling with Pre-Clustering        5
                                                                                 25

clusters Ko can be described as follows:

                        Ko =       argmin         Score (K, U)                  (2)
                               K∈{C,··· ,Kmax }    PC


    Assume that the information about the class label y is encoded in the cluster
k. The set of elements in cluster k is ck . Once the data probability distribution
of clusters p(x ∈ ck ) and the class yk of each cluster center xk are known, we can
infer the probability distribution of class p(y|x ∈ ck ) with respect to all samples
in ck [21]. However, using the cluster center to annotate all instances’ labels is
not reliable because the samples located at the intersection of clusters are easily
misclassified. In contrast with refining smaller clusters [21], our method only uses
the pre-clustering to pretrain the model. In detail, we only label the instances
closest to each cluster center by oracle p(yk |xk , k) and put them into the labeled
data set L = {(xj , yj ) | j ∈ {1, · · · , Ko }}, where Ko is optimal number of
clusters. At the initial phase of our approach, the BNN will learn the initial
labeled data set to get optimal posterior parameters for reliable uncertainty
estimation. Then the oracle will label the most informative instances based on
the combination of the following two selection functions: uncertainty and utility.


3.2   Uncertainty-Utility Selection Strategy at AL Phase

The BNN can be defined as f (x, θ). p (θ) where θ ∈ Θ is a prior on the parameter
space Θ. The likelihood p (y|x, θ) is determined by softmax (f (x, θ)). The goal
is to obtain the posterior distribution over θ from labeled training set L:

                                            p (L|θ) p (θ)
                              p (θ|L) =                                         (3)
                                                p (L)

The θ1 , . . . , θT are sampled T times to get an monte carlo estimate of the pre-
dictive probabilitiy distribution on the label y as the average regarding a new
unlabeled instance x∗ ∈ U :
                                             T
                                          1X
                         p̂(y|x∗ , L) =         p(y|x∗ , L, θt )                (4)
                                          T t=1

Equation 4 describes the general uncertainty estimation of BNN, and it includes
both the epistemic and aleatoric uncertainty of the prediction y. In our case, we
calculate the entropy over the predicted class probabilities of a new instance to
estimate the uncertainty score as given in the numerator of Eq. 5. In each AL
iteration, the scores of instances in U are normalized into a 0–1 range, where 1
is the most uncertain score, indicating that being annotated is often very useful.
The function Uncr (x∗ ) can evaluate the uncertainty score for each instance in
U:
                              P
                              C
                            − p̂ (y = c|x∗ , L) log2 (p̂ (y = c|x∗ , L))
                              c
              Uncr (x∗ ) =                                               .     (5)
                                           log2 (C)
6
26           Huang et
          Z. Huang, Y.al.
                       He, S. Vogt, B. Sick

    As mentioned in Section 2, the uncertainty metric requires to be enhanced
with exploration of the data space. Although in the initial stage, we use pre-
clustering to help BNN to obtain reliable uncertainty estimations quickly, some
valuable instances are far from the current existing decision boundary. Therefore,
we define a utility metric to enhance the exploration of US continually. We define
the Euclidean distances between two instances x1 and x2 as Dis (x1 , x2 ). The
similarity between the instance x∗ to class c is defined as the median distances
of x∗ to all instances of c in the L. The formulation can be written as:

         Sim (x∗ , c) = median({Dis (x∗ , x) , where (x, y) ∈ L and y = c}).        (6)

The standard deviation of the similarities between the instance and each class
represents the trend of which class it belongs to. The higher standard deviation
indicates the instance is likely to be classified to one single class. When the
standard deviation is lower, the instance is located in the intersection of multiple
classes, and annotation by the oracle could be more beneficial. For a paired
comparison with uncertainty, we transfer the optimization task of this score into
a maximization problem. The scale of uncertainty score is 0-1. Hence in practice,
we calculate the utility score of each instance in a batch and normalize the entire
batch of utility scores to the same scale. Eq. 7 shows the method of calculating
the utility of a single instance x∗ .
                                                       1
                 Utility(x∗ ) =                                                     (7)
                                  std ({Sim (x∗ , c1 ) , · · · Sim (x∗ , cC )})

      Uncertainty-utility (UU) score is defined as follows:

                     Score (x∗) = γ1 Uncr (x∗ ) + γ2 Utility (x∗ )                  (8)
                       UU

where γ1 and γ2 are in 0-1 range and control the weights of two selection metrics
separately. The weights could be selected by expert knowledge, or in the absence
of detailed expert knowledge, γ1 and γ2 are each selected equal to 1. The higher
score, indicating the more worthy of being annotated.


3.3     Batch-based UUPC Algorithm

With batch training, our method could have more efficient training on large data
sets: (1) Clustering, such as K-Means, passes through the entire data set to ob-
tain the centers. The training process is time-consuming, which is proportional
to the amount of data. The mini-batch-based K-Means [26] uses a batch-based
method to cluster large data sets to reduce computation costs. (2) For traditional
uncertainty sampling, each iteration requires uncertainty estimation for all in-
stances in U. In DAL scenario, we use batch-based sample querying to improve
training efficiency [24].
    At each acquisition step, we score a batch of candidate unlabeled samples B ⊆
U, where B = {x1 , x2 , · · · , xb } and b refers to the batch size. Based on the Score,
                                                                                  UU
                              Uncertainty and Utility Sampling with Pre-Clustering    7
                                                                                     27

Algorithm 1 UUPC Algorithm for Batch Training
Input: Unlabeled data set U ← X , initial labeled set L ← Ø, one batch data B ⊆ U
with b samples is selected randomly, the process of batch sampling is described as
BatchSampling (U, b), the maximum number of AL iterations for pre-clustering phase
Bpc and for UU sampling phase Buu , Nuu instances are annotated per batch.
Output: Optimized number of cluster Ko , labeled data set L, BNN model f (x, θ)
 1: Ko ←     argmin     Score (K, U)
             K∈{C,···Kmax }      PC
 2: iter ← 0
 3: while iter < Bpc do
 4:     Biter ← BatchSampling (U, b)                              
 5:     {xiter
            1     , · · · , xiter
                               k   } ← K-Means Biter , Ko
 6:     if {xiter1     , · · · , xiter
                                  k    } == {xiter−1
                                              1      , · · · , xkiter−1 } then
 7:           Break
 8:     iter ← iter + 1
 9: L ← {(x1 , y1 ) , · · · , (xk , yk )} ← Labeling({x1 , · · · , xk })
10: {θ1 , · · · , θT } ← Training (f (x, θ) , y), where (x, y) ∈ L
11: iter ← 0
12: while iter < Buu do
13:     Biter ← BatchSampling (U, b)
14:     S←Ø
15:     while i < Nuu do
16:            x∗i ← argmax Score (x) , where x ∈ Biter \S
                                  UU
17:          S ← S ∪ x∗i
18:      L ← L ∪ Labeling (S)
19:      U ← U\S
20:      {θ1 , · · · , θT } ← Training (f (x, θ)), where x ∈ L
21:      if U == Ø then
22:          Break
23:      iter ← iter + 1



we select the top n candidate instances with the highest scores S = {x∗1 , · · · , x∗n }
where n ≤ b. This problem can be formulated as follows:

                                      x∗i =     argmax        Score (x)              (9)
                                              x∈B\{x∗           UU
                                                    j |j