J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 144–152 CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 T. Šabata, T. Borovička, M. Holeňa K-best Viterbi Semi-supervized Active Learning in Sequence Labelling Tomáš Šabata1 , Tomáš Borovička1 , and Martin Holeňa2 1 Faculty of Information Technology, Czech Technical University in Prague, Prague, The Czech Repubic 2 Institute of Computer Science, Czech Academy of Sciences, Prague, The Czech Republic Abstract: In application domains where there exists a are easy to predict and let the annotator to focus only on large amount of unlabelled data but obtaining labels is ex- parts of sequences that are the most uncertain. pensive, active learning is a useful way to select which data In this paper, we propose a semi-supervised active should be labelled. In addition to its traditional successful learning approach that uses the k-best Viterbi algorithm use in classification and regression tasks, active learning to detect candidates for manual labelling. The proposed has been also applied to sequence labelling. According to approach was experimentally evaluated on an NLP task, the standard active learning approach, sequences for which part-of-speech tagging. the labelling would be the most informative should be la- In the second section, we provide an overview of related belled. However, labelling the entire sequence may be in- work in active and semi-supervised learning. The third efficient as for some its parts, the labels can be predicted section recalls some basics of hidden Markov models that using a model. Labelling such parts brings only a little are necessary for understanding of the proposed approach new information. Therefore in this paper, we investigate a which is introduced in the fourth section. An experiment sequence labelling approach in which in the sequence se- description, its result and analysis are given in the fifth lected for labelling, the labels of most tokens are predicted section. The paper is concluded by a discussion of the by a model and only tokens that the model can not predict results and possible future work. with sufficient confidence are labelled. Those tokens are identified using the k-best Viterbi algorithm. 2 Related work 1 Introduction While active learning has been studied for classification and regression tasks [1], less attention has been given to Hidden Markov models (HMMs) and conditional ran- the task of sequence labelling. Despite this, the most of dom fields (CRFs) are very popular models in sequence the algorithms developed for the task of classification can labelling tasks such as handwriting recognition, speech also be adapted for the task of sequence labelling [2]. recognition, DNA analysis, video analysis, information Active learning can be applied in three different extraction or natural language processing (NLP). They scenarios: pool-based sampling, stream-based selective achieve good results if a high quality and fully annotated sampling and membership query synthesis. The most com- dataset is available. Unfortunately, in these tasks, obtain- monly used scenario is pool-based sampling originaly pro- ing labels for data may be expensive. The annotation cost posed in [3]. It has been studied for many real-world is a motivation for using active learning. Active learning problem domains with sequence labelling included. For usually begins with a small labelled set L and in each it- example, speech recognition [4], information retrieval [5] eration, the most informative instance of an unlabeled set or named entitiy recognition [6]. The main idea of pool- U is chosen, annotated by an oracle and added to the set based active learning is using a query strategy framework L. The model is retrained using the extended set L and the to find the most informative sample (sequence) from the whole process repeats till a stopping criterion is met. This unlabeled set (pool) of samples. This selected sample is approach is valuable in tasks where unlabeled data are eas- annotated and added to the labelled set. The model is re- ily available but obtaining their labels is expensive. In this trained, and the whole process repeats. The second scen- case, it aims at achieving higher accuracy with minimal ario, stream-based selective sampling, is also possible to cost. use in sequence labeling [7] but it is used less commonly. Nevertheless, labelling long sequences can be trouble- The difference against pool-based sampling is that samples some, in particular for a human annotator who is prone to are coming in a stream and the framework decides to an- create labels of lower quality. To address the problem, we notate the sample or to discard it. The discarded samples can combine active learning with semi-supervised learn- are never later used in training. The main idea of the third ing. Semi-supervised active learning in sequence labelling scenario, membership query synthesis, is that a learner can means that a model labels those parts of a sequence that query any unlabeled instance, usually generated de novo. K-best Viterbi Semi-supervized Active Learning in Sequence Labelling 145 Active learning can use one of six different query 3.1 Hidden Markov Models strategy frameworks [1]. The most commonly used frameworks are Uncertainty Sampling [8] and Querry-by- With each HMM, a random process indexed by time is Committee [9]. Uncertainty Sampling selects sample in connected, which is assumed to be in exactly one of a set of which the model is least confident. Query-by-Committee N distinct states at any time. At regularly spaced discrete maintains a committee of predictors, and the sample on times, the system changes its state according to probabilit- which the predictors disagree most regarding their predic- ies of transitions between states. The time steps associated tions is considered to be the most informative. Other query with time changes are denoted t = 1, 2, 3, ... . The actual strategies applicable to sequences are Expected Gradient state at a time step t is denoted qt . Length, Information Density, Fisher Information and Fu- The process itself is assumed to be a first-order Markov ture Error Reduction [2]. The Future Error Reduction chain which is described as a matrix of transition probab- framework is not commonly used due to its high computa- ilities A = {ai j }, defined tional complexity. Semi-supervised learning methods were developed with ai j = P(qt = y j |qt−1 = yi ), 1 ≤ i, j ≤ N. (1) the same motivation of a partly unlabeled dataset. Self- A simple observable Markov chain is too restrictive to Training is a commonly used technique where the pre- describe the reality. However, it can be extended. De- dictor is firstly trained on a small labelled dataset and noting Y the variable recording the states of the Markov then used to annotate data. The most confident labels are chain, an HMM is obtained through completing Y with added to the training set, and the predictor is retrained. a random variable X. In the context of that HMM, X is Self-training has found application in several tasks of nat- called ’observable variable’ or ’output variable’, whereas ural language processing [10, 11, 12]. Another technique, Y is called ’hidden variable’. The hidden variable Y takes Co-training, is a multi-learner algorithm where learners values in the set {y1 , y2 , ..., yN } and the observable variable have independent, complementary features of the data- X takes values in a set {x1 , x2 , ..., xM }. set and produce labelled examples separately [13]. Semi- We assume to have an observation sequence O = supervized learning was also applied to sequence model- o1 o2 ...oT and a state sequence Q = q1 q2 ...qT which cor- ling tasks [14, 15]. responds to the observation sequence. HMM can be char- In tasks where a large amount of labelled data is re- acterised using three probability distributions: quired (for example, NLP tasks), the semi-supervised learning does not perform well due to the propagation of 1. A state transition probability distribution A = {ai, j }. many tagging errors through the learning dataset. The 2. A probability distribution of observable variables, problem of the data pollution was partially solved in [16], B = {bi (xk )}, where bi (xk ) is the probability of ot as- where a human was put into training loop to correct la- suming the value xk if qt is in the state yi and it is belled examples. However, correction of labelled data can defined be time-consuming and is similar to labelling the data from bi (k) = P(ot = xk |qt = yi ). (2) scratch. To address the problem, a semi-supervised act- ive learning which does not need any human inspection 3. An initial state distribution π = {πi } is defined by was proposed in [17]. The approach uses active learn- ing to find the most informative sequences. The model πi = P(q1 = yi ). labels the most informative sequences and uses a marginal probability of each sequence token to decide if the predic- With these three elements, HMM is fully defined and de- tion is confident. The method contains two parameters, a noted θ = (A, B, π). delay of running semi-supervised approach and a confid- The parameters of an HMM can be learned either in a ence threshold. A proper setting of parameters is neces- semi-supervised way with the Baum-Welch algorithm [18] sary to achieve the desired results. or in a fully-supervised way with the maximum-likelihood Inspired by the semi-supervised method in [17], we estimation (MLE). In the fully-supervised way, values of proposed a method that does not need the confidence both observable and hidden variables are known. threshold parameter due to using the k-best Viterbi paths. In the MLE, we assume a training set D = {(o1 , q1 ), ..., (on , qn )} of a size n whose elements are in- dependent. The MLE consists in taking those parameters 3 Preliminaries θ ∗ that maximize the probability of the training set: θ ∗ = argmaxθ P(D|θ ). (3) In the paper, we focus on a task of part of speech tag- Due to (1) and (2), the probability in (3) turns to: ging. For the simplicity, our approach is shown by means of HMM but can be extended to CRF as well. In this sec- P(D|θ ) = ∏ aTi, i,j j ∏[bi (k)]Ei (k) , ∑ ai, j = 1, ∑ bi (k) = 1 tion, the principles of an HMM will be recalled. i, j i,k j k 146 T. Šabata, T. Borovička, M. Holeňa where Ti, j stands for number of transitions from state yi to a variable δt (i) = maxq1 ,...,qt−1 P(q1 , ..., qt = yi , o1 , ..., ot ). state y j in the training set and Ei (k) stands for number of The algorithm is initialized as follows: emissions of value x j in state yi . Then, parameters A and B can be obtained by following formulas: δ1 (i) = πi bi (o1 ), (8) Ti, j + ri, j Ei (k) + ri (k) and for each 2 ≤ t ≤ T and each yi from Y , the algorithm ai, j = and bi (k) = , calculates the variable δt (i): ∑ i, j (T j0 0 + r i, j 0 ) ∑k (Ei (k) + ri (k0 )) 0 (4) where ri , j and ri (k) are our prior beliefs. The prior be- δt (i) = (max1≤ j≤N δt−1 (y j )ai, j )bi (ot ). (9) liefs are used in the case of an insufficiently large data- In each time t and for each node i, the algorithm stores set, where the estimate would lead to zero probabilities of the link to one of all predecessor nodes with which it forms events which never occurred in D. the best path. These links are stored in the additional two- To simplify the notation, we define variables α and β as dimensional array ψt (i), where: follows: ψ1 (i) =0, αt (i) =p(o1 , ..., ot , qt = yi |θ ), (5) ψt (i) = arg max δt−1 ( j)a ji . βt (i) =p(ot + 1, ..., oT , qt = yi |θ ). (6) 1≤ j≤N These variables are computed using the following The probability of the most probable sequence can be forward-backward algorithm [18]: found by max1 leqi≤N δT (i) and the most probable state path Q∗ = (q∗1 , q∗2 , ..., q∗T ) can be found by backtracking: α1 (i) =πi bi (o1 ), N  q∗T = arg max δT (i), αt+1 (i) = ∑ αt ( j)a j,i bi (ot+1 ), 1≤i≤N j=1 ∗ ∗ qt =ψt+1 (qt+1 ). respectively, The Viterbi algorithm has a similar structure as the βT (i) =1, forward-backward algorithm, and both have complexity O(N 2 T ). N βt (i) = ∑ ai, j b j (ot+1 )βt+1 ( j). j=1 4 Proposed approach 3.2 Marginal probability Our proposed approach is an adaptation of the semi- supervised active learning method (SeSAL), originally pro- Once, the model is learned, it can be used for the predic- posed in [17]. Both SeSAL and our adaptation are based tion of a sequence of hiddden states given an observable on a standard fully-supervized active learning algorithm sequence. In the task of finding the most likely states se- (FuSAL). The concept of FuSAL algorithm is decribed by quence, it is possible to find the sequence that maximises pseudocode in Algorithm 1. the expected number of correctly assigned states. From (5) An utility function φM (x) represents an informativness follows that the marginal probability of being in a specific of the sample x given the model M. In the algorithm, any state i at a particular time t is: utility function can be used to find the most informative αt (i)βt (i) sequence [2]. γt (i) = n (7) In the SeSAL, the most informative instance is annot- ∑ j=1 αt ( j)βt ( j) ated by a model M and only the tokens whose predicted Then, maximising the expected number of correctly as- labels have a confidence smaller than a given threshold signed states can be achieved through applying qt = are given to a human annotator (oracle). Finding the op- arg maxyi ∈Y γt (yi ) to the whole sequence. However, the timal threshold value is an optimisation task minimising approach can find a sequence with very low or even zero the dataset pollution and the number of queried labels. probability in case the sequence is not feasible. If the threshold is too high, a human annotates labels in which the model is well confident. On the other hand, if 3.3 Viterbi algrithm the threshold is too low, the algorithm accepts incorrectly labelled tokens which may result in a polluted training set. Viterbi algorithm is a dynamic programming algorithm In the SeSAL, they use a parameter called delay that that finds the most likely state sequence as a whole by represents a number of iterations of the FuSAL before the maximising of P(Q, O|θ ). It gradually counts the max- algorithm is switched to SeSAL. This helps to avoid pro- imal probability of the state chain from its beginning till ducing errors coming from incorrect labels comming from the state in time t with the state qt being yi represented by an insufficiently converged model. K-best Viterbi Semi-supervized Active Learning in Sequence Labelling 147 Algorithm 1 FuSAL algorithm simplest way how to modify Viterbi algorithm is to store Given: up to k best predecessors that can form k best sequences. L: set of labeled examples Unfortunately, with this modification, the algorithm has U: set of unlabeled examples the computational complexity of O(kT N 2 ). This compu- φM : utility function tational overload can be lowered by the iterative Viterbi Algorithm: A* algorithm which has the complexity of O(T + kT ) in 1: while stopping criterion is not met do the best case and O(T N 2 + kT N) in the worst case [19]. 2: learn model M from L With the k-best Viterbi paths found, the algorithm loops 3: for all xi ∈ U:uxi ← φM (xi ) trough the decoding (lines 6-12). The label is accepted 4: select x∗ = arg maxxi uxi only if all sequences produced it. Otherwise, a human an- 5: query an oracle for labels y of x∗ notator (oracle) is called to label the instance. 6: remove x∗ from U 7: insert < x∗ , y > into L 8: end while 5 Experiment and results In this section, we describe an experiment used for the In our approach, the confidence of labels is replaced by evaluation of the proposed method. The method is evalu- calculating the k best Viterbi paths to find tokens where ated on an NLP task called part-of-speech tagging (POS). predictions of the model differ in the k most likely se- The input to the POS is a set of meaningful sentences. The quences. The number of paths affects the behaviour of output is a set of tag sequences, one tag for each word. the algorithm, however, we assume this parameter to be Word classes (noun, verb, adjective, etc.) or their deriv- less data dependent than confidence threshold. We call the ates are the most often used tagsets. The number of tags is approach k-best Viterbi SeSAL. The pseudocode of it is de- not limited. scribed in Algorithm (2). POS is a difficult task for two reasons. First, the num- ber of possible words in the text can be very high, and it Algorithm 2 k-best Viterbi SeSAL algorithm may contain words that occur rarely. Second, some words Given: can have assigned several tags, and to find the correct tag, L: set of labeled examples the context of the sentence is needed. CRFs can take a U: set of unlabeled examples wide context into account and thus is the most commonly φM : utility function used in the POS. However, though it is impossible to take k: number of paths a wide context into account in HMM, it is a sufficiently Algorithm: good performing model for our experiment. 1: while stopping criterion is not met do In our experiment, we used data from the Natural lan- 2: learn model M from L guage toolkit [20], which provides data for many NLP 3: for all xi ∈ U:uxi ← φM (xi ) tasks such as POS, chunking, entity recognition, inform- 4: select x∗ = arg maxxi uxi ation extraction, etc. A few statistics for the employed 5: find the k best Viterbi paths {v1 , ..., vk } benchmark datasets are provided in Table 1. Each data- 6: for t in length(x∗ ) do set contains its proper tagset and a simplified tagset with 7: if vi (t) for all i = 1, . . . , k are equal then 12 tags representing ten basic word classes, a dot and the 8: label x∗ (t) with y(t) = v1 (t) rest. 9: else 10: query an oracle for a label y(t) of x∗ (t) Table 1: Benchmark datasets. 11: end if Dataset #sentences #words #tags 12: end for Brown 57340 56057 472 13: remove x∗ from U CoNLL2000 10948 21589 44 14: insert < x∗ , y > into L Treebank 3914 12408 46 15: end while The proposed approach uses the approach from the In order to compare the datasets, HMMs were trained FuSAL active learning framework to find the most inform- using supervised learning on the full dataset with all labels ative instance (lines 2-4). Then, the semi-supervised learn- available. Accuracy and the F1 score measures were used ing is applied in order to label the instance. The algorithm for the performance comparison. The performance was computes the k best Viterbi sequences that are used to de- measured for both the original tagset (Acc 1 and F-score 1) tect not likely labels (line 5). and the simplified tagset (Acc 2 and F-score 2). The data The Viterbi algorithm described in section 3.3 provides was randomly split into training and testing sets in a 7:3 only one best sequence. To produce k best sequences it ratio. The performance of the supervised learning is shown is not enough to store only one best label per node. The in Table 2. Due to the results in the table, we consider 148 T. Šabata, T. Borovička, M. Holeňa HMM to be sufficiently well performing in the experiment. where V is set of k-best Viterbi sequences and y∗k is the k-th The worse F-score in the case of Brown dataset with all most probable sequence of labels. The behaviour of differ- tags is caused an approximately ten times higher number ent uncertainty measures is investigated in the experiment of possible hidden values. in Section 5.2. After finding them most informative sequence, semi- supervised learning was applied. The sequence was la- Table 2: Prediction performance learned on the full data- belled according to Algorithm 2. The algorithm has one set. Training and testing data were randomly split in a 7:3 parameter, the number of k best sequences. The effect ratio. Acc 1 and F-score 1 represent results based on all of the parameter on the performance of the proposed ap- tags, whereas, Acc 2 and F-score represent results based proach is described in the experiment in Section 5.3. on simplified tags. Dataset Acc 1 Acc 2 F-score 1 F-score 2 Brown .9421 .9572 .4520 .9245 5.2 Uncertainty measure Conll2000 .9508 .9546 .9080 .9408 At first, we study effects of uncertainty measures on the Treebank .9189 .9307 .8205 .9291 proposed method. The measures were evaluated on the TreeBank dataset with 30% labeled instances. The para- meter k was set to 100. The experiment has shown that the computational com- 5.1 Experimental setup plexity of the k-best sequence entropy measure and the margin measure is too high for practical usage due to the For most of the experiments we used the following set- calculation of the k best Viterbi paths (two best Viterbi tings. In the base model, HMM, tags were considered to paths respectively) for each unlabeled instance. Moreover, be hidden state values and words were considered to be active learning that uses as a measure the k-best sequences observable variable values. The parameters of the model entropy had a tendency to choose short sentences. In that were estimated using MLE. To handle words that have not case, active learning had a lower accuracy than the random occurred in the training set, we added uniformly distrib- sampling method. uted pseudo-counts to both matrices A and B. Prior beliefs The computational complexity of least confident and were set to be uniformly distributed, therefore, each word total token entropy measures were reasonable even for has the probability of 1/|words|. datasets with a big number of unlabeled samples. The per- In order to simulate a standard situation in active learn- formance comparison is shown in Figures 1 and 2. Ac- ing, the original dataset was randomly split into training cording to the experiment results, FuSAL with the least and testing sets in a 7:3 ratio and then, the training set was confident measure achieved higher accuracy after 50 itera- randomly split into labelled and unlabeled sets in a 3:7 ra- tions. However, the total token entropy measure achieved tio. the certain level of accuracy in less queried tokens which In each iteration of the experiment, the most informat- can be preferable for some tasks. ive instance was selected, annotated and put into the la- Taking into account the computational complexity of belled training set. As most informative were considered the methods, the least confidence measure is used in the instances maximizing the employed one of the following rest of the experiment. four uncertainty measures: 5.3 Parameter settings • least confidence In semi-supervised learning, a well performing model is φLC (x) = 1 − P(y∗1 |x; θ ), crucial to produce good quality labels. In SeSAL al- gorithm, the parameter delay controls how many iterations • margin of FuSAL algorithm is used before semi-supervised ap- proach is applied. The goal of this experiment was an ana- φM (x) = −(P(y∗1 |x; θ ) − P(y∗2 |x; θ )), lysis of the relationship between the parameter delay and the parameter k. Since the proposed method does not use • total token entropy the delay parameter, it has been simulated using datasets with a different number of labelled samples. The exper- T N iment was evaluated on the biggest dataset, Brown, with φT E (x) = − ∑ ∑ P(yt = n|x; θ )logP(yt = n|x; θ ), three initial settings a) 10% of labelled samples, b) 30% of t=1 n=1 labelled samples, c) 60% of labelled samples. It has been shown that the value of the parameter k is • k-best sequences entropy highly correlated with the number of labelled samples in φSE (x) = − ∑ P(ŷ|x; θ )logP(ŷ|x; θ ), the dataset. In the dataset with 10% of labelled samples, ŷ∈V the high value of the parameter k has shown to be crucial K-best Viterbi Semi-supervized Active Learning in Sequence Labelling 149 Least confident vs Total Token entropy Labeled 10% 0.878 0.9158 method 0.9156 SeSAL - viterbi 10 0.876 SeSAL - viterbi 50 0.9154 SeSAL - viterbi 100 SeSAL - viterbi 200 0.874 0.9152 acc acc 0.9150 0.872 method FuSAL LC 0.9148 SeSAL - viterbi 100 LC 0.870 FuSAL TE 0.9146 SeSAL - viterbi 100 TE 0.9144 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 sentence_queries sentence_queries Figure 1: Comparison of the least confident measure (LC) Figure 3: An accuracy regarding a number of queried sen- and the total token entropy measure (TE) for different tences where 10% of the training set is labeled numbers of queries in connection with FuSAL and Viterbi SeSAL. Labeled 30% 0.9403 method SeSAL - viterbi 10 0.9402 SeSAL - viterbi 50 SeSAL - viterbi 100 Least confident vs Total Token entropy SeSAL - viterbi 200 0.9401 0.878 acc 0.9400 0.876 0.9399 0.874 0.9398 acc 0.872 method 0 10 20 30 40 FuSAL LC sentence_queries SeSAL - viterbi 100 LC 0.870 FuSAL TE SeSAL - viterbi 100 TE Figure 4: An accuracy regarding a number of queried sen- tences where 30% of the training set is labeled 0 500 1000 1500 2000 2500 3000 queriedTokens Figure 2: Comparison of the least confident measure (LC) and the total token entropy measure (TE) for different 0.9403 numbers of queries in connection with FuSAL and Viterbi SeSAL. 0.9402 0.9401 acc 0.9400 method 0.9399 SeSAL - viterbi 10 to reduce the number of errors propagated to the training SeSAL - viterbi 50 dataset (Figure 3). With increasing number of labelled 0.9398 SeSAL - viterbi 100 samples, a high value of the parameter k becomes less SeSAL - viterbi 200 effective. In Figure 4 the difference between parameter 0 100 200 300 400 500 600 k=100 and k=200 almost vanished. Moreover, regarding queriedTokens the number of queried labels, the settings k=100 becomes more efficient (Figure (5). The same trend was also ob- Figure 5: An accuracy regarding a number of queried served in the case where 60% of instances were labelled. tokens where 30% of the training set is labeled 150 T. Šabata, T. Borovička, M. Holeňa Labeled 10% Labeled 60% 700 method 600 method SeSAL - viterbi 10 SeSAL - viterbi 10 600 SeSAL - viterbi 50 SeSAL - viterbi 50 SeSAL - viterbi 100 500 SeSAL - viterbi 100 errors, queried tokens errors, queried tokens 500 SeSAL - viterbi 200 SeSAL - viterbi 200 400 400 300 300 200 200 100 100 0 0 0 10 20 30 40 0 10 20 30 40 sentence_queries sentence_queries Figure 6: A number of queried tokens (solid line) and er- Figure 7: A number of queried tokens (solid line) and er- rors (dashed line) regarding a number of sentences where rors (dashed line) regarding a number of sentences where 10% of the training set is labeled. 60% of the training set is labeled. 5.4 Number of queried tokens and errors propagation 0.880 The parameter k affects the number of queried tokens and 0.878 the number of errors propagated to the learning set. The optimal setting of the parameter minimises both. The ex- 0.876 periment in this section analyses the relationship between acc 0.874 tokens and errors. One should consider the number od labelled samples in 0.872 method setting of the parameter k. In the case of less labelled random FuSAL samples, the parameter k should be set to a higher num- 0.870 SeSAL 0.48 ber to avoid production of errors (Figure 6). After several SeSAL - viterbi 100 iterations, when the base model is more accurate, higher 0.868 0 500 1000 1500 2000 2500 3000 3500 4000 values of parameter k become less effective (Figure 7). Number of queried tokens However, even with an almost labelled dataset and the settings k = 200 we were not able to avoid errors in la- Figure 8: Achieved accuracy over the number of queried belling. From all 2402 annotated tokens, 57 were annot- tokens. ated wrongly. We consider the complicated control of an acceptable error rate as one of the biggest disadvantages of the proposed method. As expected, the FuSAL method achieved the highest accuracy because all labels were annotated manually, thus 5.5 Comparison with other methods correctly. In the number of queried tokens, Viterbi SeSAL achieved bigger accuracy in more queried tokens (Fig- To evaluate the performance of the proposed method in ure 8). The explanation can be seen in Figure 9 where comparison with other methods an accuracy was meas- the number of errors and the number of queried tokens ured regarding the number of queried sentences and the was measured. In the given settings, the number of er- number of queried tokens. Furthermore, the number of rors propagated to the learning set was lower in Viterbi errors propagated to the learning set was measured. All SeSAL at the expense of the number of queried tokens. experiments were evaluated on the Brown dataset with the Although, after several iterations, the error rate of the pro- simplified tagset. posed method has been lower than in the SeSAL method. The SeSAL with uncertainty threshold and the proposed method can be compared only if the parameters are set such that the methods produce an approximately same 6 Conclusion and future work number of errors. In the experiment, confidence threshold was set to 0.48 and parameter of the number of paths k was We proposed a semi-supervised active learning method set to 100. that is easy to setup for the sequence labelling and is suf- K-best Viterbi Semi-supervized Active Learning in Sequence Labelling 151 of the conference on empirical methods in natural language processing, pages 1070–1079. Association for Computa- method tional Linguistics, 2008. 700 SeSAL 0.48 [3] David D Lewis and William A Gale. A sequential al- 600 SeSAL - viterbi 100 gorithm for training text classifiers. In Proceedings of the Queried tokens, Errors 500 17th annual international ACM SIGIR conference on Re- search and development in information retrieval, pages 3– 400 12. Springer-Verlag New York, Inc., 1994. 300 [4] Gokhan Tur, Dilek Hakkani-Tür, and Robert E Scha- 200 pire. Combining active and semi-supervised learning for spoken language understanding. Speech Communication, 100 45(2):171–186, 2005. 0 [5] Cynthia A Thompson, Mary Elaine Califf, and Raymond J 0 10 20 30 40 50 60 70 Mooney. Active learning for natural language parsing and Number of queries information extraction. In ICML, pages 406–414, 1999. [6] Lin Yao, Chengjie Sun, Shaofeng Li, Xiaolong Wang, Figure 9: The number of queried tokens (solid line) and and Xuan Wang. Crf-based active learning for chinese the number of errors (dashed line) over the number of named entity recognition. In Systems, Man and Cybernet- queries. ics, 2009. SMC 2009. IEEE International Conference on, pages 1557–1561. IEEE, 2009. [7] Ido Dagan and Sean P Engelson. Committee-based sampling for training probabilistic classifiers. In Proceed- ficiently well performing in comparison with the semi- ings of the Twelfth International Conference on Machine supervised active learning method that use an uncer- Learning, pages 150–157. The Morgan Kaufmann series in tainty threshold and a marginal probability. The proposed machine learning,(San Francisco, CA, USA), 1995. method uses k best Viterbi paths to find the tokens in which [8] David D Lewis and Jason Catlett. Heterogeneous uncer- the model is not sufficiently confident. tainty sampling for supervised learning. In Proceedings of The number of errors, the number of queried tokens and the eleventh international conference on machine learning, the computational complexity are controlled by the para- pages 148–156, 1994. meter k. In order to reduce the number of errors propag- [9] H Sebastian Seung, Manfred Opper, and Haim Sompolin- ated to the labelling set, the parameter k should be set as sky. Query by committee. In Proceedings of the fifth an- high as it is reasonable in terms of the computational time. nual workshop on Computational learning theory, pages 287–294. ACM, 1992. The computational complexity of k-best Viterbi path al- gorithm can be partially reduced using iterative Viterbi A* [10] David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd an- algorithm. In addition to a high computation complexity, nual meeting on Association for Computational Linguist- a complicated control of the number of propagated errors ics, pages 189–196. Association for Computational Lin- is disadvantage of the proposed method. guistics, 1995. An area for further research is the exploration of Co- [11] Ellen Riloff, Janyce Wiebe, and Theresa Wilson. Learning training in combination with the Query-by-Committee act- subjective nouns using extraction pattern bootstrapping. In ive learning framework where both approaches consider Proceedings of the seventh conference on Natural language several different views of the data. Furthermore, the semi- learning at HLT-NAACL 2003-Volume 4, pages 25–32. As- supervised active learning method that can be applied sociation for Computational Linguistics, 2003. to both probabilistic and deterministic sequential models [12] Chuck Rosenberg, Martial Hebert, and Henry Schneider- should be more studied to find a general solution for them. man. Semi-supervised self-training of object detection models. 2005. [13] Avrim Blum and Tom Mitchell. Combining labeled and Acknowledgements unlabeled data with co-training. In Proceedings of the elev- enth annual conference on Computational learning theory, The reported research was supported by the CTU grant pages 92–100. ACM, 1998. nr. SGS17/210/OHK3/3T/18 and by the Czech Science [14] Andrew M. Dai and Quoc V. Le. Semi-supervised sequence Foundation grant nr. 17-01251. learning. CoRR, abs/1511.01432, 2015. [15] Shi Zhong. Semi-supervised sequence classification with hmms. International Journal of Pattern Recognition and References Artificial Intelligence, 19(02):165–182, 2005. [16] David Pierce and Claire Cardie. Limitations of co-training [1] Burr Settles. Active learning literature survey. University for natural language learning from large datasets. In Pro- of Wisconsin, Madison, 52(55-66):11, 2010. ceedings of the 2001 Conference on Empirical Methods in [2] Burr Settles and Mark Craven. An analysis of active learn- Natural Language Processing, pages 1–9, 2001. ing strategies for sequence labeling tasks. In Proceedings 152 T. Šabata, T. Borovička, M. Holeňa [17] Katrin Tomanek and Udo Hahn. Semi-supervised act- ive learning for sequence labeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 1039–1047. Association for Computational Linguist- ics, 2009. [18] Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceed- ings of the IEEE, 77(2):257–286, 1989. [19] Zhiheng Huang, Yi Chang, Bo Long, Jean-Francois Crespo, Anlei Dong, Sathiya Keerthi, and Su-Lin Wu. It- erative viterbi a* algorithm for k-best sequential decoding. In Proceedings of the 50th Annual Meeting of the Associ- ation for Computational Linguistics: Long Papers-Volume 1, pages 611–619. Association for Computational Linguist- ics, 2012. [20] Steven Bird, Ewan Klein, and Edward Loper. Natural Lan- guage Processing with Python. O’Reilly Media, 2009.