J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 144–152
CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 T. Šabata, T. Borovička, M. Holeňa


             K-best Viterbi Semi-supervized Active Learning in Sequence Labelling

                                        Tomáš Šabata1 , Tomáš Borovička1 , and Martin Holeňa2
                                                    1 Faculty of Information Technology,
                                                    Czech Technical University in Prague,
                                                        Prague, The Czech Repubic
                                                      2 Institute of Computer Science,

                                                        Czech Academy of Sciences,
                                                        Prague, The Czech Republic

      Abstract: In application domains where there exists a              are easy to predict and let the annotator to focus only on
      large amount of unlabelled data but obtaining labels is ex-        parts of sequences that are the most uncertain.
      pensive, active learning is a useful way to select which data         In this paper, we propose a semi-supervised active
      should be labelled. In addition to its traditional successful      learning approach that uses the k-best Viterbi algorithm
      use in classification and regression tasks, active learning        to detect candidates for manual labelling. The proposed
      has been also applied to sequence labelling. According to          approach was experimentally evaluated on an NLP task,
      the standard active learning approach, sequences for which         part-of-speech tagging.
      the labelling would be the most informative should be la-             In the second section, we provide an overview of related
      belled. However, labelling the entire sequence may be in-          work in active and semi-supervised learning. The third
      efficient as for some its parts, the labels can be predicted       section recalls some basics of hidden Markov models that
      using a model. Labelling such parts brings only a little           are necessary for understanding of the proposed approach
      new information. Therefore in this paper, we investigate a         which is introduced in the fourth section. An experiment
      sequence labelling approach in which in the sequence se-           description, its result and analysis are given in the fifth
      lected for labelling, the labels of most tokens are predicted      section. The paper is concluded by a discussion of the
      by a model and only tokens that the model can not predict          results and possible future work.
      with sufficient confidence are labelled. Those tokens are
      identified using the k-best Viterbi algorithm.
                                                                         2   Related work
      1    Introduction                                                  While active learning has been studied for classification
                                                                         and regression tasks [1], less attention has been given to
      Hidden Markov models (HMMs) and conditional ran-                   the task of sequence labelling. Despite this, the most of
      dom fields (CRFs) are very popular models in sequence              the algorithms developed for the task of classification can
      labelling tasks such as handwriting recognition, speech            also be adapted for the task of sequence labelling [2].
      recognition, DNA analysis, video analysis, information                Active learning can be applied in three different
      extraction or natural language processing (NLP). They              scenarios: pool-based sampling, stream-based selective
      achieve good results if a high quality and fully annotated         sampling and membership query synthesis. The most com-
      dataset is available. Unfortunately, in these tasks, obtain-       monly used scenario is pool-based sampling originaly pro-
      ing labels for data may be expensive. The annotation cost          posed in [3]. It has been studied for many real-world
      is a motivation for using active learning. Active learning         problem domains with sequence labelling included. For
      usually begins with a small labelled set L and in each it-         example, speech recognition [4], information retrieval [5]
      eration, the most informative instance of an unlabeled set         or named entitiy recognition [6]. The main idea of pool-
      U is chosen, annotated by an oracle and added to the set           based active learning is using a query strategy framework
      L. The model is retrained using the extended set L and the         to find the most informative sample (sequence) from the
      whole process repeats till a stopping criterion is met. This       unlabeled set (pool) of samples. This selected sample is
      approach is valuable in tasks where unlabeled data are eas-        annotated and added to the labelled set. The model is re-
      ily available but obtaining their labels is expensive. In this     trained, and the whole process repeats. The second scen-
      case, it aims at achieving higher accuracy with minimal            ario, stream-based selective sampling, is also possible to
      cost.                                                              use in sequence labeling [7] but it is used less commonly.
         Nevertheless, labelling long sequences can be trouble-          The difference against pool-based sampling is that samples
      some, in particular for a human annotator who is prone to          are coming in a stream and the framework decides to an-
      create labels of lower quality. To address the problem, we         notate the sample or to discard it. The discarded samples
      can combine active learning with semi-supervised learn-            are never later used in training. The main idea of the third
      ing. Semi-supervised active learning in sequence labelling         scenario, membership query synthesis, is that a learner can
      means that a model labels those parts of a sequence that           query any unlabeled instance, usually generated de novo.
K-best Viterbi Semi-supervized Active Learning in Sequence Labelling                                                                        145

         Active learning can use one of six different query            3.1 Hidden Markov Models
      strategy frameworks [1]. The most commonly used
      frameworks are Uncertainty Sampling [8] and Querry-by-           With each HMM, a random process indexed by time is
      Committee [9]. Uncertainty Sampling selects sample in            connected, which is assumed to be in exactly one of a set of
      which the model is least confident. Query-by-Committee           N distinct states at any time. At regularly spaced discrete
      maintains a committee of predictors, and the sample on           times, the system changes its state according to probabilit-
      which the predictors disagree most regarding their predic-       ies of transitions between states. The time steps associated
      tions is considered to be the most informative. Other query      with time changes are denoted t = 1, 2, 3, ... . The actual
      strategies applicable to sequences are Expected Gradient         state at a time step t is denoted qt .
      Length, Information Density, Fisher Information and Fu-              The process itself is assumed to be a first-order Markov
      ture Error Reduction [2]. The Future Error Reduction             chain which is described as a matrix of transition probab-
      framework is not commonly used due to its high computa-          ilities A = {ai j }, defined
      tional complexity.
         Semi-supervised learning methods were developed with                   ai j = P(qt = y j |qt−1 = yi ),       1 ≤ i, j ≤ N.   (1)
      the same motivation of a partly unlabeled dataset. Self-            A simple observable Markov chain is too restrictive to
      Training is a commonly used technique where the pre-             describe the reality. However, it can be extended. De-
      dictor is firstly trained on a small labelled dataset and        noting Y the variable recording the states of the Markov
      then used to annotate data. The most confident labels are        chain, an HMM is obtained through completing Y with
      added to the training set, and the predictor is retrained.       a random variable X. In the context of that HMM, X is
      Self-training has found application in several tasks of nat-     called ’observable variable’ or ’output variable’, whereas
      ural language processing [10, 11, 12]. Another technique,        Y is called ’hidden variable’. The hidden variable Y takes
      Co-training, is a multi-learner algorithm where learners         values in the set {y1 , y2 , ..., yN } and the observable variable
      have independent, complementary features of the data-            X takes values in a set {x1 , x2 , ..., xM }.
      set and produce labelled examples separately [13]. Semi-            We assume to have an observation sequence O =
      supervized learning was also applied to sequence model-          o1 o2 ...oT and a state sequence Q = q1 q2 ...qT which cor-
      ling tasks [14, 15].                                             responds to the observation sequence. HMM can be char-
         In tasks where a large amount of labelled data is re-         acterised using three probability distributions:
      quired (for example, NLP tasks), the semi-supervised
      learning does not perform well due to the propagation of           1. A state transition probability distribution A = {ai, j }.
      many tagging errors through the learning dataset. The
                                                                         2. A probability distribution of observable variables,
      problem of the data pollution was partially solved in [16],
                                                                            B = {bi (xk )}, where bi (xk ) is the probability of ot as-
      where a human was put into training loop to correct la-
                                                                            suming the value xk if qt is in the state yi and it is
      belled examples. However, correction of labelled data can
                                                                            defined
      be time-consuming and is similar to labelling the data from
                                                                                          bi (k) = P(ot = xk |qt = yi ).            (2)
      scratch. To address the problem, a semi-supervised act-
      ive learning which does not need any human inspection              3. An initial state distribution π = {πi } is defined by
      was proposed in [17]. The approach uses active learn-
      ing to find the most informative sequences. The model                                        πi = P(q1 = yi ).
      labels the most informative sequences and uses a marginal
      probability of each sequence token to decide if the predic-
                                                                       With these three elements, HMM is fully defined and de-
      tion is confident. The method contains two parameters, a
                                                                       noted θ = (A, B, π).
      delay of running semi-supervised approach and a confid-
                                                                          The parameters of an HMM can be learned either in a
      ence threshold. A proper setting of parameters is neces-
                                                                       semi-supervised way with the Baum-Welch algorithm [18]
      sary to achieve the desired results.
                                                                       or in a fully-supervised way with the maximum-likelihood
         Inspired by the semi-supervised method in [17], we            estimation (MLE). In the fully-supervised way, values of
      proposed a method that does not need the confidence              both observable and hidden variables are known.
      threshold parameter due to using the k-best Viterbi paths.          In the MLE, we assume a training set D =
                                                                       {(o1 , q1 ), ..., (on , qn )} of a size n whose elements are in-
                                                                       dependent. The MLE consists in taking those parameters
      3   Preliminaries                                                θ ∗ that maximize the probability of the training set:

                                                                                           θ ∗ = argmaxθ P(D|θ ).                     (3)
      In the paper, we focus on a task of part of speech tag-
                                                                       Due to (1) and (2), the probability in (3) turns to:
      ging. For the simplicity, our approach is shown by means
      of HMM but can be extended to CRF as well. In this sec-          P(D|θ ) = ∏ aTi, i,j j ∏[bi (k)]Ei (k) , ∑ ai, j = 1, ∑ bi (k) = 1
      tion, the principles of an HMM will be recalled.                              i, j     i,k                  j           k
146                                                                                                                        T. Šabata, T. Borovička, M. Holeňa

      where Ti, j stands for number of transitions from state yi to               a variable δt (i) = maxq1 ,...,qt−1 P(q1 , ..., qt = yi , o1 , ..., ot ).
      state y j in the training set and Ei (k) stands for number of               The algorithm is initialized as follows:
      emissions of value x j in state yi . Then, parameters A and
      B can be obtained by following formulas:                                                             δ1 (i) = πi bi (o1 ),                       (8)

                   Ti, j + ri, j                         Ei (k) + ri (k)          and for each 2 ≤ t ≤ T and each yi from Y , the algorithm
        ai, j =                          and bi (k) =                        ,    calculates the variable δt (i):
               ∑ i, j
                   (T
                    j0   0 + r  i, j 0 )              ∑k (Ei (k) + ri (k0 ))
                                                        0
                                                                            (4)
         where ri , j and ri (k) are our prior beliefs. The prior be-                         δt (i) = (max1≤ j≤N δt−1 (y j )ai, j )bi (ot ).          (9)
      liefs are used in the case of an insufficiently large data-
                                                                                    In each time t and for each node i, the algorithm stores
      set, where the estimate would lead to zero probabilities of
                                                                                  the link to one of all predecessor nodes with which it forms
      events which never occurred in D.
                                                                                  the best path. These links are stored in the additional two-
         To simplify the notation, we define variables α and β as
                                                                                  dimensional array ψt (i), where:
      follows:
                                                                                                     ψ1 (i) =0,
                  αt (i) =p(o1 , ..., ot , qt = yi |θ ),                   (5)
                                                                                                     ψt (i) = arg max δt−1 ( j)a ji .
                  βt (i) =p(ot + 1, ..., oT , qt = yi |θ ).                (6)                                  1≤ j≤N

      These variables are computed using the following                            The probability of the most probable sequence can be
      forward-backward algorithm [18]:                                            found by max1 leqi≤N δT (i) and the most probable state
                                                                                  path Q∗ = (q∗1 , q∗2 , ..., q∗T ) can be found by backtracking:
                           α1 (i) =πi bi (o1 ),
                                    N                                                                    q∗T = arg max δT (i),
                         αt+1 (i) = ∑ αt ( j)a j,i bi (ot+1 ),                                                 1≤i≤N
                                        j=1                                                                ∗         ∗
                                                                                                          qt =ψt+1 (qt+1 ).
      respectively,
                                                                                     The Viterbi algorithm has a similar structure as the
                         βT (i) =1,                                               forward-backward algorithm, and both have complexity
                                                                                  O(N 2 T ).
                                    N
                          βt (i) = ∑ ai, j b j (ot+1 )βt+1 ( j).
                                   j=1
                                                                                  4 Proposed approach
      3.2 Marginal probability                                                    Our proposed approach is an adaptation of the semi-
                                                                                  supervised active learning method (SeSAL), originally pro-
      Once, the model is learned, it can be used for the predic-
                                                                                  posed in [17]. Both SeSAL and our adaptation are based
      tion of a sequence of hiddden states given an observable
                                                                                  on a standard fully-supervized active learning algorithm
      sequence. In the task of finding the most likely states se-
                                                                                  (FuSAL). The concept of FuSAL algorithm is decribed by
      quence, it is possible to find the sequence that maximises
                                                                                  pseudocode in Algorithm 1.
      the expected number of correctly assigned states. From (5)
                                                                                     An utility function φM (x) represents an informativness
      follows that the marginal probability of being in a specific
                                                                                  of the sample x given the model M. In the algorithm, any
      state i at a particular time t is:
                                                                                  utility function can be used to find the most informative
                                            αt (i)βt (i)                          sequence [2].
                              γt (i) =    n                                (7)       In the SeSAL, the most informative instance is annot-
                                         ∑ j=1 αt ( j)βt ( j)
                                                                                  ated by a model M and only the tokens whose predicted
      Then, maximising the expected number of correctly as-                       labels have a confidence smaller than a given threshold
      signed states can be achieved through applying qt =                         are given to a human annotator (oracle). Finding the op-
      arg maxyi ∈Y γt (yi ) to the whole sequence. However, the                   timal threshold value is an optimisation task minimising
      approach can find a sequence with very low or even zero                     the dataset pollution and the number of queried labels.
      probability in case the sequence is not feasible.                           If the threshold is too high, a human annotates labels in
                                                                                  which the model is well confident. On the other hand, if
      3.3    Viterbi algrithm                                                     the threshold is too low, the algorithm accepts incorrectly
                                                                                  labelled tokens which may result in a polluted training set.
      Viterbi algorithm is a dynamic programming algorithm                           In the SeSAL, they use a parameter called delay that
      that finds the most likely state sequence as a whole by                     represents a number of iterations of the FuSAL before the
      maximising of P(Q, O|θ ). It gradually counts the max-                      algorithm is switched to SeSAL. This helps to avoid pro-
      imal probability of the state chain from its beginning till                 ducing errors coming from incorrect labels comming from
      the state in time t with the state qt being yi represented by               an insufficiently converged model.
K-best Viterbi Semi-supervized Active Learning in Sequence Labelling                                                                   147

      Algorithm 1 FuSAL algorithm                                      simplest way how to modify Viterbi algorithm is to store
      Given:                                                           up to k best predecessors that can form k best sequences.
          L: set of labeled examples                                   Unfortunately, with this modification, the algorithm has
          U: set of unlabeled examples                                 the computational complexity of O(kT N 2 ). This compu-
          φM : utility function                                        tational overload can be lowered by the iterative Viterbi
      Algorithm:                                                       A* algorithm which has the complexity of O(T + kT ) in
       1: while stopping criterion is not met do                       the best case and O(T N 2 + kT N) in the worst case [19].
       2:     learn model M from L                                        With the k-best Viterbi paths found, the algorithm loops
       3:     for all xi ∈ U:uxi ← φM (xi )                            trough the decoding (lines 6-12). The label is accepted
       4:     select x∗ = arg maxxi uxi                                only if all sequences produced it. Otherwise, a human an-
       5:     query an oracle for labels y of x∗                       notator (oracle) is called to label the instance.
       6:     remove x∗ from U
       7:     insert < x∗ , y > into L
       8: end while
                                                                       5 Experiment and results
                                                                       In this section, we describe an experiment used for the
         In our approach, the confidence of labels is replaced by      evaluation of the proposed method. The method is evalu-
      calculating the k best Viterbi paths to find tokens where        ated on an NLP task called part-of-speech tagging (POS).
      predictions of the model differ in the k most likely se-         The input to the POS is a set of meaningful sentences. The
      quences. The number of paths affects the behaviour of            output is a set of tag sequences, one tag for each word.
      the algorithm, however, we assume this parameter to be           Word classes (noun, verb, adjective, etc.) or their deriv-
      less data dependent than confidence threshold. We call the       ates are the most often used tagsets. The number of tags is
      approach k-best Viterbi SeSAL. The pseudocode of it is de-       not limited.
      scribed in Algorithm (2).                                           POS is a difficult task for two reasons. First, the num-
                                                                       ber of possible words in the text can be very high, and it
      Algorithm 2 k-best Viterbi SeSAL algorithm                       may contain words that occur rarely. Second, some words
      Given:                                                           can have assigned several tags, and to find the correct tag,
          L: set of labeled examples                                   the context of the sentence is needed. CRFs can take a
          U: set of unlabeled examples                                 wide context into account and thus is the most commonly
          φM : utility function                                        used in the POS. However, though it is impossible to take
          k: number of paths                                           a wide context into account in HMM, it is a sufficiently
      Algorithm:                                                       good performing model for our experiment.
       1: while stopping criterion is not met do                          In our experiment, we used data from the Natural lan-
       2:     learn model M from L                                     guage toolkit [20], which provides data for many NLP
       3:     for all xi ∈ U:uxi ← φM (xi )                            tasks such as POS, chunking, entity recognition, inform-
       4:     select x∗ = arg maxxi uxi                                ation extraction, etc. A few statistics for the employed
       5:     find the k best Viterbi paths {v1 , ..., vk }            benchmark datasets are provided in Table 1. Each data-
       6:     for t in length(x∗ ) do                                  set contains its proper tagset and a simplified tagset with
       7:          if vi (t) for all i = 1, . . . , k are equal then   12 tags representing ten basic word classes, a dot and the
       8:               label x∗ (t) with y(t) = v1 (t)                rest.
       9:          else
      10:               query an oracle for a label y(t) of x∗ (t)                 Table 1: Benchmark datasets.
      11:          end if                                                      Dataset     #sentences #words           #tags
      12:     end for                                                          Brown         57340     56057            472
      13:     remove x∗ from U                                               CoNLL2000       10948     21589             44
      14:     insert < x∗ , y > into L                                        Treebank        3914     12408             46
      15: end while


         The proposed approach uses the approach from the                 In order to compare the datasets, HMMs were trained
      FuSAL active learning framework to find the most inform-         using supervised learning on the full dataset with all labels
      ative instance (lines 2-4). Then, the semi-supervised learn-     available. Accuracy and the F1 score measures were used
      ing is applied in order to label the instance. The algorithm     for the performance comparison. The performance was
      computes the k best Viterbi sequences that are used to de-       measured for both the original tagset (Acc 1 and F-score 1)
      tect not likely labels (line 5).                                 and the simplified tagset (Acc 2 and F-score 2). The data
         The Viterbi algorithm described in section 3.3 provides       was randomly split into training and testing sets in a 7:3
      only one best sequence. To produce k best sequences it           ratio. The performance of the supervised learning is shown
      is not enough to store only one best label per node. The         in Table 2. Due to the results in the table, we consider
148                                                                                                       T. Šabata, T. Borovička, M. Holeňa

      HMM to be sufficiently well performing in the experiment.        where V is set of k-best Viterbi sequences and y∗k is the k-th
      The worse F-score in the case of Brown dataset with all          most probable sequence of labels. The behaviour of differ-
      tags is caused an approximately ten times higher number          ent uncertainty measures is investigated in the experiment
      of possible hidden values.                                       in Section 5.2.
                                                                          After finding them most informative sequence, semi-
                                                                       supervised learning was applied. The sequence was la-
      Table 2: Prediction performance learned on the full data-        belled according to Algorithm 2. The algorithm has one
      set. Training and testing data were randomly split in a 7:3      parameter, the number of k best sequences. The effect
      ratio. Acc 1 and F-score 1 represent results based on all        of the parameter on the performance of the proposed ap-
      tags, whereas, Acc 2 and F-score represent results based         proach is described in the experiment in Section 5.3.
      on simplified tags.
           Dataset    Acc 1 Acc 2 F-score 1 F-score 2
           Brown      .9421 .9572         .4520      .9245             5.2 Uncertainty measure
         Conll2000 .9508 .9546            .9080      .9408             At first, we study effects of uncertainty measures on the
          Treebank .9189 .9307            .8205      .9291             proposed method. The measures were evaluated on the
                                                                       TreeBank dataset with 30% labeled instances. The para-
                                                                       meter k was set to 100.
                                                                          The experiment has shown that the computational com-
      5.1   Experimental setup                                         plexity of the k-best sequence entropy measure and the
                                                                       margin measure is too high for practical usage due to the
      For most of the experiments we used the following set-           calculation of the k best Viterbi paths (two best Viterbi
      tings. In the base model, HMM, tags were considered to           paths respectively) for each unlabeled instance. Moreover,
      be hidden state values and words were considered to be           active learning that uses as a measure the k-best sequences
      observable variable values. The parameters of the model          entropy had a tendency to choose short sentences. In that
      were estimated using MLE. To handle words that have not          case, active learning had a lower accuracy than the random
      occurred in the training set, we added uniformly distrib-        sampling method.
      uted pseudo-counts to both matrices A and B. Prior beliefs          The computational complexity of least confident and
      were set to be uniformly distributed, therefore, each word       total token entropy measures were reasonable even for
      has the probability of 1/|words|.                                datasets with a big number of unlabeled samples. The per-
         In order to simulate a standard situation in active learn-    formance comparison is shown in Figures 1 and 2. Ac-
      ing, the original dataset was randomly split into training       cording to the experiment results, FuSAL with the least
      and testing sets in a 7:3 ratio and then, the training set was   confident measure achieved higher accuracy after 50 itera-
      randomly split into labelled and unlabeled sets in a 3:7 ra-     tions. However, the total token entropy measure achieved
      tio.                                                             the certain level of accuracy in less queried tokens which
         In each iteration of the experiment, the most informat-       can be preferable for some tasks.
      ive instance was selected, annotated and put into the la-           Taking into account the computational complexity of
      belled training set. As most informative were considered         the methods, the least confidence measure is used in the
      instances maximizing the employed one of the following           rest of the experiment.
      four uncertainty measures:
                                                                       5.3 Parameter settings
        • least confidence
                                                                       In semi-supervised learning, a well performing model is
                          φLC (x) = 1 − P(y∗1 |x; θ ),                 crucial to produce good quality labels. In SeSAL al-
                                                                       gorithm, the parameter delay controls how many iterations
        • margin                                                       of FuSAL algorithm is used before semi-supervised ap-
                                                                       proach is applied. The goal of this experiment was an ana-
                    φM (x) = −(P(y∗1 |x; θ ) − P(y∗2 |x; θ )),         lysis of the relationship between the parameter delay and
                                                                       the parameter k. Since the proposed method does not use
        • total token entropy                                          the delay parameter, it has been simulated using datasets
                                                                       with a different number of labelled samples. The exper-
                          T   N                                        iment was evaluated on the biggest dataset, Brown, with
            φT E (x) = − ∑ ∑ P(yt = n|x; θ )logP(yt = n|x; θ ),        three initial settings a) 10% of labelled samples, b) 30% of
                         t=1 n=1
                                                                       labelled samples, c) 60% of labelled samples.
                                                                          It has been shown that the value of the parameter k is
        • k-best sequences entropy
                                                                       highly correlated with the number of labelled samples in
                   φSE (x) = − ∑ P(ŷ|x; θ )logP(ŷ|x; θ ),            the dataset. In the dataset with 10% of labelled samples,
                                  ŷ∈V                                 the high value of the parameter k has shown to be crucial
K-best Viterbi Semi-supervized Active Learning in Sequence Labelling                                                                              149


                       Least confident vs Total Token entropy                                                   Labeled 10%
           0.878                                                                      0.9158
                                                                                                     method
                                                                                      0.9156       SeSAL - viterbi 10
           0.876                                                                                   SeSAL - viterbi 50
                                                                                      0.9154       SeSAL - viterbi 100
                                                                                                   SeSAL - viterbi 200
           0.874                                                                      0.9152
     acc


                                                                                acc
                                                                                      0.9150
           0.872                                        method
                                                  FuSAL LC                            0.9148
                                                  SeSAL - viterbi 100 LC
           0.870                                  FuSAL TE                            0.9146
                                                  SeSAL - viterbi 100 TE
                                                                                      0.9144
                   5   10     15    20    25     30     35   40     45     50                  0     10          20       30           40
                                     sentence_queries                                                        sentence_queries

      Figure 1: Comparison of the least confident measure (LC)                   Figure 3: An accuracy regarding a number of queried sen-
      and the total token entropy measure (TE) for different                     tences where 10% of the training set is labeled
      numbers of queries in connection with FuSAL and Viterbi
      SeSAL.
                                                                                                                Labeled 30%
                                                                                      0.9403         method
                                                                                                   SeSAL - viterbi 10
                                                                                      0.9402       SeSAL - viterbi 50
                                                                                                   SeSAL - viterbi 100
                       Least confident vs Total Token entropy                                      SeSAL - viterbi 200
                                                                                      0.9401
           0.878
                                                                                acc


                                                                                      0.9400
           0.876
                                                                                      0.9399
           0.874                                                                      0.9398
     acc


           0.872                                        method                                 0     10          20       30           40
                                                  FuSAL LC                                                   sentence_queries
                                                  SeSAL - viterbi 100 LC
           0.870                                  FuSAL TE
                                                  SeSAL - viterbi 100 TE         Figure 4: An accuracy regarding a number of queried sen-
                                                                                 tences where 30% of the training set is labeled
                   0    500        1000   1500        2000   2500    3000
                                      queriedTokens

      Figure 2: Comparison of the least confident measure (LC)
      and the total token entropy measure (TE) for different                          0.9403
      numbers of queries in connection with FuSAL and Viterbi
      SeSAL.                                                                          0.9402

                                                                                      0.9401
                                                                                acc


                                                                                      0.9400
                                                                                                                                method
                                                                                      0.9399                                SeSAL - viterbi 10
      to reduce the number of errors propagated to the training                                                             SeSAL - viterbi 50
      dataset (Figure 3). With increasing number of labelled                          0.9398                                SeSAL - viterbi 100
      samples, a high value of the parameter k becomes less                                                                 SeSAL - viterbi 200
      effective. In Figure 4 the difference between parameter                                  0   100    200     300    400     500        600
      k=100 and k=200 almost vanished. Moreover, regarding                                                      queriedTokens
      the number of queried labels, the settings k=100 becomes
      more efficient (Figure (5). The same trend was also ob-                    Figure 5: An accuracy regarding a number of queried
      served in the case where 60% of instances were labelled.                   tokens where 30% of the training set is labeled
150                                                                                                                                     T. Šabata, T. Borovička, M. Holeňa


                                                     Labeled 10%                                                             Labeled 60%
                               700         method                                                    600
                                                                                                                   method
                                         SeSAL - viterbi 10                                                      SeSAL - viterbi 10
                               600       SeSAL - viterbi 50                                                      SeSAL - viterbi 50
                                         SeSAL - viterbi 100                                         500         SeSAL - viterbi 100
      errors, queried tokens


                                                                            errors, queried tokens
                               500       SeSAL - viterbi 200                                                     SeSAL - viterbi 200
                                                                                                     400
                               400
                                                                                                     300
                               300
                               200                                                                   200

                               100                                                                   100
                                0                                                                     0
                                     0     10          20      30     40                                   0       10          20       30         40
                                                   sentence_queries                                                        sentence_queries

      Figure 6: A number of queried tokens (solid line) and er-            Figure 7: A number of queried tokens (solid line) and er-
      rors (dashed line) regarding a number of sentences where             rors (dashed line) regarding a number of sentences where
      10% of the training set is labeled.                                  60% of the training set is labeled.


      5.4                       Number of queried tokens and errors
                                propagation                                               0.880

      The parameter k affects the number of queried tokens and                            0.878
      the number of errors propagated to the learning set. The
      optimal setting of the parameter minimises both. The ex-                            0.876
      periment in this section analyses the relationship between
                                                                           acc


                                                                                          0.874
      tokens and errors.
         One should consider the number od labelled samples in                            0.872                                              method
      setting of the parameter k. In the case of less labelled                                                                           random
                                                                                                                                         FuSAL
      samples, the parameter k should be set to a higher num-                             0.870                                          SeSAL 0.48
      ber to avoid production of errors (Figure 6). After several                                                                        SeSAL - viterbi 100
      iterations, when the base model is more accurate, higher                            0.868
                                                                                                           0   500 1000 1500 2000 2500 3000 3500 4000
      values of parameter k become less effective (Figure 7).                                                           Number of queried tokens
         However, even with an almost labelled dataset and the
      settings k = 200 we were not able to avoid errors in la-             Figure 8: Achieved accuracy over the number of queried
      belling. From all 2402 annotated tokens, 57 were annot-              tokens.
      ated wrongly. We consider the complicated control of an
      acceptable error rate as one of the biggest disadvantages
      of the proposed method.
                                                                              As expected, the FuSAL method achieved the highest
                                                                           accuracy because all labels were annotated manually, thus
      5.5                       Comparison with other methods              correctly. In the number of queried tokens, Viterbi SeSAL
                                                                           achieved bigger accuracy in more queried tokens (Fig-
      To evaluate the performance of the proposed method in                ure 8). The explanation can be seen in Figure 9 where
      comparison with other methods an accuracy was meas-                  the number of errors and the number of queried tokens
      ured regarding the number of queried sentences and the               was measured. In the given settings, the number of er-
      number of queried tokens. Furthermore, the number of                 rors propagated to the learning set was lower in Viterbi
      errors propagated to the learning set was measured. All              SeSAL at the expense of the number of queried tokens.
      experiments were evaluated on the Brown dataset with the             Although, after several iterations, the error rate of the pro-
      simplified tagset.                                                   posed method has been lower than in the SeSAL method.
         The SeSAL with uncertainty threshold and the proposed
      method can be compared only if the parameters are set
      such that the methods produce an approximately same                  6 Conclusion and future work
      number of errors. In the experiment, confidence threshold
      was set to 0.48 and parameter of the number of paths k was           We proposed a semi-supervised active learning method
      set to 100.                                                          that is easy to setup for the sequence labelling and is suf-
K-best Viterbi Semi-supervized Active Learning in Sequence Labelling                                                                                    151

                                                                                         of the conference on empirical methods in natural language
                                                                                         processing, pages 1070–1079. Association for Computa-
                                               method                                    tional Linguistics, 2008.
                                700       SeSAL 0.48                                 [3] David D Lewis and William A Gale. A sequential al-
                                600       SeSAL - viterbi 100
                                                                                         gorithm for training text classifiers. In Proceedings of the
       Queried tokens, Errors


                                500                                                      17th annual international ACM SIGIR conference on Re-
                                                                                         search and development in information retrieval, pages 3–
                                400                                                      12. Springer-Verlag New York, Inc., 1994.
                                300                                                  [4] Gokhan Tur, Dilek Hakkani-Tür, and Robert E Scha-
                                200                                                      pire. Combining active and semi-supervised learning for
                                                                                         spoken language understanding. Speech Communication,
                                100                                                      45(2):171–186, 2005.
                                 0                                                   [5] Cynthia A Thompson, Mary Elaine Califf, and Raymond J
                                      0   10     20      30     40   50   60   70        Mooney. Active learning for natural language parsing and
                                                      Number of queries                  information extraction. In ICML, pages 406–414, 1999.
                                                                                     [6] Lin Yao, Chengjie Sun, Shaofeng Li, Xiaolong Wang,
      Figure 9: The number of queried tokens (solid line) and                            and Xuan Wang. Crf-based active learning for chinese
      the number of errors (dashed line) over the number of                              named entity recognition. In Systems, Man and Cybernet-
      queries.                                                                           ics, 2009. SMC 2009. IEEE International Conference on,
                                                                                         pages 1557–1561. IEEE, 2009.
                                                                                     [7] Ido Dagan and Sean P Engelson.             Committee-based
                                                                                         sampling for training probabilistic classifiers. In Proceed-
      ficiently well performing in comparison with the semi-                             ings of the Twelfth International Conference on Machine
      supervised active learning method that use an uncer-                               Learning, pages 150–157. The Morgan Kaufmann series in
      tainty threshold and a marginal probability. The proposed                          machine learning,(San Francisco, CA, USA), 1995.
      method uses k best Viterbi paths to find the tokens in which                   [8] David D Lewis and Jason Catlett. Heterogeneous uncer-
      the model is not sufficiently confident.                                           tainty sampling for supervised learning. In Proceedings of
         The number of errors, the number of queried tokens and                          the eleventh international conference on machine learning,
      the computational complexity are controlled by the para-                           pages 148–156, 1994.
      meter k. In order to reduce the number of errors propag-                       [9] H Sebastian Seung, Manfred Opper, and Haim Sompolin-
      ated to the labelling set, the parameter k should be set as                        sky. Query by committee. In Proceedings of the fifth an-
      high as it is reasonable in terms of the computational time.                       nual workshop on Computational learning theory, pages
                                                                                         287–294. ACM, 1992.
      The computational complexity of k-best Viterbi path al-
      gorithm can be partially reduced using iterative Viterbi A*                   [10] David Yarowsky. Unsupervised word sense disambiguation
                                                                                         rivaling supervised methods. In Proceedings of the 33rd an-
      algorithm. In addition to a high computation complexity,
                                                                                         nual meeting on Association for Computational Linguist-
      a complicated control of the number of propagated errors                           ics, pages 189–196. Association for Computational Lin-
      is disadvantage of the proposed method.                                            guistics, 1995.
         An area for further research is the exploration of Co-                     [11] Ellen Riloff, Janyce Wiebe, and Theresa Wilson. Learning
      training in combination with the Query-by-Committee act-                           subjective nouns using extraction pattern bootstrapping. In
      ive learning framework where both approaches consider                              Proceedings of the seventh conference on Natural language
      several different views of the data. Furthermore, the semi-                        learning at HLT-NAACL 2003-Volume 4, pages 25–32. As-
      supervised active learning method that can be applied                              sociation for Computational Linguistics, 2003.
      to both probabilistic and deterministic sequential models                     [12] Chuck Rosenberg, Martial Hebert, and Henry Schneider-
      should be more studied to find a general solution for them.                        man. Semi-supervised self-training of object detection
                                                                                         models. 2005.
                                                                                    [13] Avrim Blum and Tom Mitchell. Combining labeled and
      Acknowledgements                                                                   unlabeled data with co-training. In Proceedings of the elev-
                                                                                         enth annual conference on Computational learning theory,
      The reported research was supported by the CTU grant                               pages 92–100. ACM, 1998.
      nr. SGS17/210/OHK3/3T/18 and by the Czech Science                             [14] Andrew M. Dai and Quoc V. Le. Semi-supervised sequence
      Foundation grant nr. 17-01251.                                                     learning. CoRR, abs/1511.01432, 2015.
                                                                                    [15] Shi Zhong. Semi-supervised sequence classification with
                                                                                         hmms. International Journal of Pattern Recognition and
      References                                                                         Artificial Intelligence, 19(02):165–182, 2005.
                                                                                    [16] David Pierce and Claire Cardie. Limitations of co-training
        [1] Burr Settles. Active learning literature survey. University
                                                                                         for natural language learning from large datasets. In Pro-
            of Wisconsin, Madison, 52(55-66):11, 2010.
                                                                                         ceedings of the 2001 Conference on Empirical Methods in
        [2] Burr Settles and Mark Craven. An analysis of active learn-                   Natural Language Processing, pages 1–9, 2001.
            ing strategies for sequence labeling tasks. In Proceedings
152                                                                       T. Šabata, T. Borovička, M. Holeňa

      [17] Katrin Tomanek and Udo Hahn. Semi-supervised act-
           ive learning for sequence labeling. In Proceedings of
           the Joint Conference of the 47th Annual Meeting of the
           ACL and the 4th International Joint Conference on Natural
           Language Processing of the AFNLP: Volume 2-Volume 2,
           pages 1039–1047. Association for Computational Linguist-
           ics, 2009.
      [18] Lawrence R Rabiner. A tutorial on hidden markov models
           and selected applications in speech recognition. Proceed-
           ings of the IEEE, 77(2):257–286, 1989.
      [19] Zhiheng Huang, Yi Chang, Bo Long, Jean-Francois
           Crespo, Anlei Dong, Sathiya Keerthi, and Su-Lin Wu. It-
           erative viterbi a* algorithm for k-best sequential decoding.
           In Proceedings of the 50th Annual Meeting of the Associ-
           ation for Computational Linguistics: Long Papers-Volume
           1, pages 611–619. Association for Computational Linguist-
           ics, 2012.
      [20] Steven Bird, Ewan Klein, and Edward Loper. Natural Lan-
           guage Processing with Python. O’Reilly Media, 2009.