<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>K-best Viterbi Semi-supervized Active Learning in Sequence Labelling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tomáš Šabata</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomáš Borovicˇka</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Holenˇa</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Information Technology, Czech Technical University in Prague</institution>
          ,
          <addr-line>Prague, The Czech Repubic</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Computer Science, Czech Academy of Sciences</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country>The Czech Republic</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>1885</volume>
      <fpage>144</fpage>
      <lpage>152</lpage>
      <abstract>
        <p>In application domains where there exists a large amount of unlabelled data but obtaining labels is expensive, active learning is a useful way to select which data should be labelled. In addition to its traditional successful use in classification and regression tasks, active learning has been also applied to sequence labelling. According to the standard active learning approach, sequences for which the labelling would be the most informative should be labelled. However, labelling the entire sequence may be inefficient as for some its parts, the labels can be predicted using a model. Labelling such parts brings only a little new information. Therefore in this paper, we investigate a sequence labelling approach in which in the sequence selected for labelling, the labels of most tokens are predicted by a model and only tokens that the model can not predict with sufficient confidence are labelled. Those tokens are identified using the k-best Viterbi algorithm.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Hidden Markov models (HMMs) and conditional
random fields (CRFs) are very popular models in sequence
labelling tasks such as handwriting recognition, speech
recognition, DNA analysis, video analysis, information
extraction or natural language processing (NLP). They
achieve good results if a high quality and fully annotated
dataset is available. Unfortunately, in these tasks,
obtaining labels for data may be expensive. The annotation cost
is a motivation for using active learning. Active learning
usually begins with a small labelled set L and in each
iteration, the most informative instance of an unlabeled set
U is chosen, annotated by an oracle and added to the set
L. The model is retrained using the extended set L and the
whole process repeats till a stopping criterion is met. This
approach is valuable in tasks where unlabeled data are
easily available but obtaining their labels is expensive. In this
case, it aims at achieving higher accuracy with minimal
cost.</p>
      <p>Nevertheless, labelling long sequences can be
troublesome, in particular for a human annotator who is prone to
create labels of lower quality. To address the problem, we
can combine active learning with semi-supervised
learning. Semi-supervised active learning in sequence labelling
means that a model labels those parts of a sequence that
are easy to predict and let the annotator to focus only on
parts of sequences that are the most uncertain.</p>
      <p>In this paper, we propose a semi-supervised active
learning approach that uses the k-best Viterbi algorithm
to detect candidates for manual labelling. The proposed
approach was experimentally evaluated on an NLP task,
part-of-speech tagging.</p>
      <p>In the second section, we provide an overview of related
work in active and semi-supervised learning. The third
section recalls some basics of hidden Markov models that
are necessary for understanding of the proposed approach
which is introduced in the fourth section. An experiment
description, its result and analysis are given in the fifth
section. The paper is concluded by a discussion of the
results and possible future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        While active learning has been studied for classification
and regression tasks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], less attention has been given to
the task of sequence labelling. Despite this, the most of
the algorithms developed for the task of classification can
also be adapted for the task of sequence labelling [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Active learning can be applied in three different
scenarios: pool-based sampling, stream-based selective
sampling and membership query synthesis. The most
commonly used scenario is pool-based sampling originaly
proposed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. It has been studied for many real-world
problem domains with sequence labelling included. For
example, speech recognition [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], information retrieval [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
or named entitiy recognition [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The main idea of
poolbased active learning is using a query strategy framework
to find the most informative sample (sequence) from the
unlabeled set (pool) of samples. This selected sample is
annotated and added to the labelled set. The model is
retrained, and the whole process repeats. The second
scenario, stream-based selective sampling, is also possible to
use in sequence labeling [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] but it is used less commonly.
The difference against pool-based sampling is that samples
are coming in a stream and the framework decides to
annotate the sample or to discard it. The discarded samples
are never later used in training. The main idea of the third
scenario, membership query synthesis, is that a learner can
query any unlabeled instance, usually generated de novo.
      </p>
      <p>
        Active learning can use one of six different query
strategy frameworks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The most commonly used
frameworks are Uncertainty Sampling [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and
Querry-byCommittee [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Uncertainty Sampling selects sample in
which the model is least confident. Query-by-Committee
maintains a committee of predictors, and the sample on
which the predictors disagree most regarding their
predictions is considered to be the most informative. Other query
strategies applicable to sequences are Expected Gradient
Length, Information Density, Fisher Information and
Future Error Reduction [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The Future Error Reduction
framework is not commonly used due to its high
computational complexity.
      </p>
      <p>
        Semi-supervised learning methods were developed with
the same motivation of a partly unlabeled dataset.
SelfTraining is a commonly used technique where the
predictor is firstly trained on a small labelled dataset and
then used to annotate data. The most confident labels are
added to the training set, and the predictor is retrained.
Self-training has found application in several tasks of
natural language processing [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10, 11, 12</xref>
        ]. Another technique,
Co-training, is a multi-learner algorithm where learners
have independent, complementary features of the
dataset and produce labelled examples separately [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
Semisupervized learning was also applied to sequence
modelling tasks [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ].
      </p>
      <p>
        In tasks where a large amount of labelled data is
required (for example, NLP tasks), the semi-supervised
learning does not perform well due to the propagation of
many tagging errors through the learning dataset. The
problem of the data pollution was partially solved in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ],
where a human was put into training loop to correct
labelled examples. However, correction of labelled data can
be time-consuming and is similar to labelling the data from
scratch. To address the problem, a semi-supervised
active learning which does not need any human inspection
was proposed in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The approach uses active
learning to find the most informative sequences. The model
labels the most informative sequences and uses a marginal
probability of each sequence token to decide if the
prediction is confident. The method contains two parameters, a
delay of running semi-supervised approach and a
confidence threshold. A proper setting of parameters is
necessary to achieve the desired results.
      </p>
      <p>
        Inspired by the semi-supervised method in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], we
proposed a method that does not need the confidence
threshold parameter due to using the k-best Viterbi paths.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Preliminaries</title>
      <p>In the paper, we focus on a task of part of speech
tagging. For the simplicity, our approach is shown by means
of HMM but can be extended to CRF as well. In this
section, the principles of an HMM will be recalled.
3.1</p>
      <sec id="sec-3-1">
        <title>Hidden Markov Models</title>
        <p>With each HMM, a random process indexed by time is
connected, which is assumed to be in exactly one of a set of
N distinct states at any time. At regularly spaced discrete
times, the system changes its state according to
probabilities of transitions between states. The time steps associated
with time changes are denoted t = 1, 2, 3, ... . The actual
state at a time step t is denoted qt .</p>
        <p>The process itself is assumed to be a first-order Markov
chain which is described as a matrix of transition
probabilities A = {ai j}, defined
ai j = P(qt = y j|qt−1 = yi), 1 ≤ i, j ≤ N.
(1)</p>
        <p>A simple observable Markov chain is too restrictive to
describe the reality. However, it can be extended.
Denoting Y the variable recording the states of the Markov
chain, an HMM is obtained through completing Y with
a random variable X. In the context of that HMM, X is
called ’observable variable’ or ’output variable’, whereas
Y is called ’hidden variable’. The hidden variable Y takes
values in the set {y1, y2, ..., yN } and the observable variable
X takes values in a set {x1, x2, ..., xM}.</p>
        <p>We assume to have an observation sequence O =
o1o2...oT and a state sequence Q = q1q2...qT which
corresponds to the observation sequence. HMM can be
characterised using three probability distributions:
1. A state transition probability distribution A = {ai, j}.
2. A probability distribution of observable variables,
B = {bi(xk)}, where bi(xk) is the probability of ot
assuming the value xk if qt is in the state yi and it is
defined
bi(k) = P(ot = xk|qt = yi).
(2)
3. An initial state distribution π = {πi} is defined by
πi = P(q1 = yi).</p>
        <p>With these three elements, HMM is fully defined and
denoted θ = (A, B, π).</p>
        <p>
          The parameters of an HMM can be learned either in a
semi-supervised way with the Baum-Welch algorithm [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]
or in a fully-supervised way with the maximum-likelihood
estimation (MLE). In the fully-supervised way, values of
both observable and hidden variables are known.
        </p>
        <p>In the MLE, we assume a training set D =
{(o1, q1), ..., (on, qn)} of a size n whose elements are
independent. The MLE consists in taking those parameters
θ ∗ that maximize the probability of the training set:
θ ∗ = argmaxθ P(D|θ ).
(3)
Due to (1) and (2), the probability in (3) turns to:
P(D|θ ) = ∏ aiT, ij, j ∏[bi(k)]Ei(k), ∑ ai, j = 1, ∑ bi(k) = 1
i, j i,k j k
where Ti, j stands for number of transitions from state yi to
state y j in the training set and Ei(k) stands for number of
emissions of value x j in state yi. Then, parameters A and
B can be obtained by following formulas:
ai, j =</p>
        <p>Ti, j + ri, j
∑ j0 (Ti, j0 + ri, j0 )
and bi(k) =</p>
        <p>Ei(k) + ri(k)
∑k0 (Ei(k) + ri(k0))
,
(4)
where ri, j and ri(k) are our prior beliefs. The prior
beliefs are used in the case of an insufficiently large
dataset, where the estimate would lead to zero probabilities of
events which never occurred in D.</p>
        <p>To simplify the notation, we define variablesα and β as
follows:
αt (i) =p(o1, ..., ot , qt = yi|θ ),
βt (i) =p(ot + 1, ..., oT , qt = yi|θ ).</p>
        <p>
          These variables are computed using the following
forward-backward algorithm [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]:
respectively,
        </p>
        <p>α1(i) =πibi(o1),
αt+1(i) =</p>
        <p>N
∑ αt ( j)a j,i bi(ot+1),
j=1
βT (i) =1,</p>
        <p>N
βt (i) = ∑ ai, jb j(ot+1)βt+1( j).</p>
        <p>j=1
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Marginal probability</title>
        <p>Once, the model is learned, it can be used for the
prediction of a sequence of hiddden states given an observable
sequence. In the task of finding the most likely states
sequence, it is possible to find the sequence that maximises
the expected number of correctly assigned states. From (5)
follows that the marginal probability of being in a specific
state i at a particular time t is:
γt (i) =</p>
        <p>αt (i)βt (i)
n
∑ j=1 αt ( j)βt ( j)
Then, maximising the expected number of correctly
assigned states can be achieved through applying qt =
arg maxyi∈Y γt (yi) to the whole sequence. However, the
approach can find a sequence with very low or even zero
probability in case the sequence is not feasible.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Viterbi algrithm</title>
        <p>Viterbi algorithm is a dynamic programming algorithm
that finds the most likely state sequence as a whole by
maximising of P(Q, O|θ ). It gradually counts the
maximal probability of the state chain from its beginning till
the state in time t with the state qt being yi represented by
(5)
(6)
(7)
a variable δt (i) = maxq1,...,qt−1 P(q1, ..., qt = yi, o1, ..., ot ).
The algorithm is initialized as follows:
δ1(i) = πibi(o1),
(8)
and for each 2 ≤ t ≤ T and each yi from Y , the algorithm
calculates the variable δt (i):
δt (i) = (max1≤ j≤N δt−1(y j)ai, j)bi(ot ).
(9)</p>
        <p>In each time t and for each node i, the algorithm stores
the link to one of all predecessor nodes with which it forms
the best path. These links are stored in the additional
twodimensional array ψt (i), where:
ψ1(i) =0,
ψt (i) = arg max δt−1( j)a ji.</p>
        <p>1≤ j≤N
The probability of the most probable sequence can be
found by max1 leqi≤N δT (i) and the most probable state
path Q∗ = (q∗1, q∗2, ..., q∗T ) can be found by backtracking:
q∗T = arg max δT (i),</p>
        <p>1≤i≤N
qt∗ =ψt+1(qt∗+1).</p>
        <p>The Viterbi algorithm has a similar structure as the
forward-backward algorithm, and both have complexity
O(N2T ).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Proposed approach</title>
      <p>
        Our proposed approach is an adaptation of the
semisupervised active learning method (SeSAL), originally
proposed in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Both SeSAL and our adaptation are based
on a standard fully-supervized active learning algorithm
(FuSAL). The concept of FuSAL algorithm is decribed by
pseudocode in Algorithm 1.
      </p>
      <p>
        An utility function φM(x) represents an informativness
of the sample x given the model M. In the algorithm, any
utility function can be used to find the most informative
sequence [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>In the SeSAL, the most informative instance is
annotated by a model M and only the tokens whose predicted
labels have a confidence smaller than a given threshold
are given to a human annotator (oracle). Finding the
optimal threshold value is an optimisation task minimising
the dataset pollution and the number of queried labels.
If the threshold is too high, a human annotates labels in
which the model is well confident. On the other hand, if
the threshold is too low, the algorithm accepts incorrectly
labelled tokens which may result in a polluted training set.</p>
      <p>In the SeSAL, they use a parameter called delay that
represents a number of iterations of the FuSAL before the
algorithm is switched to SeSAL. This helps to avoid
producing errors coming from incorrect labels comming from
an insufficiently converged model.
Algorithm 1 FuSAL algorithm</p>
      <sec id="sec-4-1">
        <title>Given:</title>
        <p>L: set of labeled examples
U: set of unlabeled examples
φM: utility function
Algorithm:
1: while stopping criterion is not met do
2: learn model M from L
3: for all xi ∈ U:uxi ← φM(xi)
4: select x∗ = arg maxxi uxi
5: query an oracle for labels y of x∗
6: remove x∗ from U
7: insert &lt; x∗, y &gt; into L</p>
      </sec>
      <sec id="sec-4-2">
        <title>8: end while</title>
        <p>In our approach, the confidence of labels is replaced by
calculating the k best Viterbi paths to find tokens where
predictions of the model differ in the k most likely
sequences. The number of paths affects the behaviour of
the algorithm, however, we assume this parameter to be
less data dependent than confidence threshold. We call the
approach k-best Viterbi SeSAL. The pseudocode of it is
described in Algorithm (2).</p>
        <p>Algorithm 2 k-best Viterbi SeSAL algorithm</p>
      </sec>
      <sec id="sec-4-3">
        <title>Given:</title>
        <p>
          L: set of labeled examples
U: set of unlabeled examples
φM: utility function
k: number of paths
Algorithm:
1: while stopping criterion is not met do
2: learn model M from L
3: for all xi ∈ U:uxi ← φM(xi)
4: select x∗ = arg maxxi uxi
5: find the k best Viterbi paths{v1, ..., vk}
6: for t in length(x∗) do
7: if vi(t) for all i = 1, . . . , k are equal then
8: label x∗(t) with y(t) = v1(t)
9: else
10: query an oracle for a label y(t) of x∗(t)
11: end if
12: end for
13: remove x∗ from U
14: insert &lt; x∗, y &gt; into L
15: end while
simplest way how to modify Viterbi algorithm is to store
up to k best predecessors that can form k best sequences.
Unfortunately, with this modification, the algorithm has
the computational complexity of O(kT N2). This
computational overload can be lowered by the iterative Viterbi
A* algorithm which has the complexity of O(T + kT ) in
the best case and O(T N2 + kT N) in the worst case [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
        </p>
        <p>With the k-best Viterbi paths found, the algorithm loops
trough the decoding (lines 6-12). The label is accepted
only if all sequences produced it. Otherwise, a human
annotator (oracle) is called to label the instance.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiment and results</title>
      <p>In this section, we describe an experiment used for the
evaluation of the proposed method. The method is
evaluated on an NLP task called part-of-speech tagging (POS).
The input to the POS is a set of meaningful sentences. The
output is a set of tag sequences, one tag for each word.
Word classes (noun, verb, adjective, etc.) or their
derivates are the most often used tagsets. The number of tags is
not limited.</p>
      <p>POS is a difficult task for two reasons. First, the
number of possible words in the text can be very high, and it
may contain words that occur rarely. Second, some words
can have assigned several tags, and to find the correct tag,
the context of the sentence is needed. CRFs can take a
wide context into account and thus is the most commonly
used in the POS. However, though it is impossible to take
a wide context into account in HMM, it is a sufficiently
good performing model for our experiment.</p>
      <p>
        In our experiment, we used data from the Natural
language toolkit [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], which provides data for many NLP
tasks such as POS, chunking, entity recognition,
information extraction, etc. A few statistics for the employed
benchmark datasets are provided in Table 1. Each
dataset contains its proper tagset and a simplified tagset with
12 tags representing ten basic word classes, a dot and the
rest.
      </p>
      <p>The proposed approach uses the approach from the
FuSAL active learning framework to find the most
informative instance (lines 2-4). Then, the semi-supervised
learning is applied in order to label the instance. The algorithm
computes the k best Viterbi sequences that are used to
detect not likely labels (line 5).</p>
      <p>The Viterbi algorithm described in section 3.3 provides
only one best sequence. To produce k best sequences it
is not enough to store only one best label per node. The
In order to compare the datasets, HMMs were trained
using supervised learning on the full dataset with all labels
available. Accuracy and the F1 score measures were used
for the performance comparison. The performance was
measured for both the original tagset (Acc 1 and F-score 1)
and the simplified tagset (Acc 2 and F-score 2). The data
was randomly split into training and testing sets in a 7:3
ratio. The performance of the supervised learning is shown
in Table 2. Due to the results in the table, we consider
HMM to be sufficiently well performing in the experiment.
The worse F-score in the case of Brown dataset with all
tags is caused an approximately ten times higher number
of possible hidden values.
For most of the experiments we used the following
settings. In the base model, HMM, tags were considered to
be hidden state values and words were considered to be
observable variable values. The parameters of the model
were estimated using MLE. To handle words that have not
occurred in the training set, we added uniformly
distributed pseudo-counts to both matrices A and B. Prior beliefs
were set to be uniformly distributed, therefore, each word
has the probability of 1/|words|.</p>
      <p>In order to simulate a standard situation in active
learning, the original dataset was randomly split into training
and testing sets in a 7:3 ratio and then, the training set was
randomly split into labelled and unlabeled sets in a 3:7
ratio.</p>
      <p>In each iteration of the experiment, the most
informative instance was selected, annotated and put into the
labelled training set. As most informative were considered
instances maximizing the employed one of the following
four uncertainty measures:
• least confidence
• margin
• total token entropy</p>
      <p>φLC(x) = 1 − P(y∗1|x; θ ),
φM(x) = −(P(y∗1|x; θ ) − P(y∗2|x; θ )),</p>
      <p>T N
φT E (x) = − ∑ ∑ P(yt = n|x; θ )logP(yt = n|x; θ ),
t=1 n=1
• k-best sequences entropy
φSE (x) = − ∑ P(yˆ|x; θ )logP(yˆ|x; θ ),
yˆ∈V
where V is set of k-best Viterbi sequences and yk∗ is the k-th
most probable sequence of labels. The behaviour of
different uncertainty measures is investigated in the experiment
in Section 5.2.</p>
      <p>After finding them most informative sequence,
semisupervised learning was applied. The sequence was
labelled according to Algorithm 2. The algorithm has one
parameter, the number of k best sequences. The effect
of the parameter on the performance of the proposed
approach is described in the experiment in Section 5.3.
5.2</p>
      <sec id="sec-5-1">
        <title>Uncertainty measure</title>
        <p>At first, we study effects of uncertainty measures on the
proposed method. The measures were evaluated on the
TreeBank dataset with 30% labeled instances. The
parameter k was set to 100.</p>
        <p>The experiment has shown that the computational
complexity of the k-best sequence entropy measure and the
margin measure is too high for practical usage due to the
calculation of the k best Viterbi paths (two best Viterbi
paths respectively) for each unlabeled instance. Moreover,
active learning that uses as a measure the k-best sequences
entropy had a tendency to choose short sentences. In that
case, active learning had a lower accuracy than the random
sampling method.</p>
        <p>The computational complexity of least confident and
total token entropy measures were reasonable even for
datasets with a big number of unlabeled samples. The
performance comparison is shown in Figures 1 and 2.
According to the experiment results, FuSAL with the least
confident measure achieved higher accuracy after 50
iterations. However, the total token entropy measure achieved
the certain level of accuracy in less queried tokens which
can be preferable for some tasks.</p>
        <p>Taking into account the computational complexity of
the methods, the least confidence measure is used in the
rest of the experiment.
5.3</p>
      </sec>
      <sec id="sec-5-2">
        <title>Parameter settings</title>
        <p>In semi-supervised learning, a well performing model is
crucial to produce good quality labels. In SeSAL
algorithm, the parameter delay controls how many iterations
of FuSAL algorithm is used before semi-supervised
approach is applied. The goal of this experiment was an
analysis of the relationship between the parameter delay and
the parameter k. Since the proposed method does not use
the delay parameter, it has been simulated using datasets
with a different number of labelled samples. The
experiment was evaluated on the biggest dataset, Brown, with
three initial settings a) 10% of labelled samples, b) 30% of
labelled samples, c) 60% of labelled samples.</p>
        <p>It has been shown that the value of the parameter k is
highly correlated with the number of labelled samples in
the dataset. In the dataset with 10% of labelled samples,
the high value of the parameter k has shown to be crucial
to reduce the number of errors propagated to the training
dataset (Figure 3). With increasing number of labelled
samples, a high value of the parameter k becomes less
effective. In Figure 4 the difference between parameter
k=100 and k=200 almost vanished. Moreover, regarding
the number of queried labels, the settings k=100 becomes
more efficient (Figure (5). The same trend was also
observed in the case where 60% of instances were labelled.
Least confident vs Total Token entropy</p>
        <p>20 30
sentence_queries
40</p>
        <p>20 30
sentence_queries
40</p>
        <p>20 30
sentence_queries
40
10</p>
        <p>20 30
sentence_queries
40
The parameter k affects the number of queried tokens and
the number of errors propagated to the learning set. The
optimal setting of the parameter minimises both. The
experiment in this section analyses the relationship between
tokens and errors.</p>
        <p>One should consider the number od labelled samples in
setting of the parameter k. In the case of less labelled
samples, the parameter k should be set to a higher
number to avoid production of errors (Figure 6). After several
iterations, when the base model is more accurate, higher
values of parameter k become less effective (Figure 7).</p>
        <p>However, even with an almost labelled dataset and the
settings k = 200 we were not able to avoid errors in
labelling. From all 2402 annotated tokens, 57 were
annotated wrongly. We consider the complicated control of an
acceptable error rate as one of the biggest disadvantages
of the proposed method.
5.5</p>
      </sec>
      <sec id="sec-5-3">
        <title>Comparison with other methods</title>
        <p>To evaluate the performance of the proposed method in
comparison with other methods an accuracy was
measured regarding the number of queried sentences and the
number of queried tokens. Furthermore, the number of
errors propagated to the learning set was measured. All
experiments were evaluated on the Brown dataset with the
simplified tagset.</p>
        <p>The SeSAL with uncertainty threshold and the proposed
method can be compared only if the parameters are set
such that the methods produce an approximately same
number of errors. In the experiment, confidence threshold
was set to 0.48 and parameter of the number of paths k was
set to 100.
600</p>
        <p>As expected, the FuSAL method achieved the highest
accuracy because all labels were annotated manually, thus
correctly. In the number of queried tokens, Viterbi SeSAL
achieved bigger accuracy in more queried tokens
(Figure 8). The explanation can be seen in Figure 9 where
the number of errors and the number of queried tokens
was measured. In the given settings, the number of
errors propagated to the learning set was lower in Viterbi
SeSAL at the expense of the number of queried tokens.
Although, after several iterations, the error rate of the
proposed method has been lower than in the SeSAL method.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and future work</title>
      <p>We proposed a semi-supervised active learning method
that is easy to setup for the sequence labelling and is
suf</p>
      <p>method
SeSAL 0.48</p>
      <p>SeSAL - viterbi 100
0
10
20</p>
      <p>30 40 50
Number of queries
60
70
ficiently well performing in comparison with the
semisupervised active learning method that use an
uncertainty threshold and a marginal probability. The proposed
method uses k best Viterbi paths to find the tokens in which
the model is not sufficiently confident.</p>
      <p>The number of errors, the number of queried tokens and
the computational complexity are controlled by the
parameter k. In order to reduce the number of errors
propagated to the labelling set, the parameter k should be set as
high as it is reasonable in terms of the computational time.
The computational complexity of k-best Viterbi path
algorithm can be partially reduced using iterative Viterbi A*
algorithm. In addition to a high computation complexity,
a complicated control of the number of propagated errors
is disadvantage of the proposed method.</p>
      <p>An area for further research is the exploration of
Cotraining in combination with the Query-by-Committee
active learning framework where both approaches consider
several different views of the data. Furthermore, the
semisupervised active learning method that can be applied
to both probabilistic and deterministic sequential models
should be more studied to find a general solution for them.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>The reported research was supported by the CTU grant
nr. SGS17/210/OHK3/3T/18 and by the Czech Science
Foundation grant nr. 17-01251.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Burr</given-names>
            <surname>Settles</surname>
          </string-name>
          .
          <article-title>Active learning literature survey</article-title>
          . University of Wisconsin, Madison,
          <volume>52</volume>
          (
          <fpage>55</fpage>
          -66):
          <fpage>11</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Burr</given-names>
            <surname>Settles</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Craven</surname>
          </string-name>
          .
          <article-title>An analysis of active learning strategies for sequence labeling tasks</article-title>
          .
          <source>In Proceedings of the conference on empirical methods in natural language processing</source>
          , pages
          <fpage>1070</fpage>
          -
          <lpage>1079</lpage>
          . Association for Computational Linguistics,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>David D</given-names>
            <surname>Lewis</surname>
          </string-name>
          and
          <string-name>
            <given-names>William A</given-names>
            <surname>Gale</surname>
          </string-name>
          .
          <article-title>A sequential algorithm for training text classifiers</article-title>
          .
          <source>In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <fpage>3</fpage>
          -
          <lpage>12</lpage>
          . Springer-Verlag New York, Inc.,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Gokhan</given-names>
            <surname>Tur</surname>
          </string-name>
          , Dilek Hakkani-Tür, and
          <string-name>
            <given-names>Robert E</given-names>
            <surname>Schapire.</surname>
          </string-name>
          <article-title>Combining active and semi-supervised learning for spoken language understanding</article-title>
          .
          <source>Speech Communication</source>
          ,
          <volume>45</volume>
          (
          <issue>2</issue>
          ):
          <fpage>171</fpage>
          -
          <lpage>186</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Cynthia</surname>
            <given-names>A Thompson</given-names>
          </string-name>
          , Mary Elaine Califf, and
          <string-name>
            <surname>Raymond</surname>
          </string-name>
          J Mooney.
          <article-title>Active learning for natural language parsing and information extraction</article-title>
          .
          <source>In ICML</source>
          , pages
          <fpage>406</fpage>
          -
          <lpage>414</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Lin</surname>
            <given-names>Yao</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chengjie Sun</surname>
            ,
            <given-names>Shaofeng</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Xiaolong</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            , and
            <given-names>Xuan</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>Crf-based active learning for chinese named entity recognition</article-title>
          .
          <source>In Systems, Man and Cybernetics</source>
          ,
          <year>2009</year>
          .
          <article-title>SMC 2009</article-title>
          . IEEE International Conference on, pages
          <fpage>1557</fpage>
          -
          <lpage>1561</lpage>
          . IEEE,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Ido</given-names>
            <surname>Dagan and Sean P Engelson</surname>
          </string-name>
          .
          <article-title>Committee-based sampling for training probabilistic classifiers</article-title>
          .
          <source>In Proceedings of the Twelfth International Conference on Machine Learning</source>
          , pages
          <fpage>150</fpage>
          -
          <lpage>157</lpage>
          . The Morgan Kaufmann series in machine learning,(San Francisco, CA, USA),
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>David D</given-names>
            <surname>Lewis</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jason</given-names>
            <surname>Catlett</surname>
          </string-name>
          .
          <article-title>Heterogeneous uncertainty sampling for supervised learning</article-title>
          .
          <source>In Proceedings of the eleventh international conference on machine learning</source>
          , pages
          <fpage>148</fpage>
          -
          <lpage>156</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H</given-names>
            <surname>Sebastian Seung</surname>
          </string-name>
          , Manfred Opper, and
          <string-name>
            <given-names>Haim</given-names>
            <surname>Sompolinsky</surname>
          </string-name>
          .
          <article-title>Query by committee</article-title>
          .
          <source>In Proceedings of the fifth annual workshop on Computational learning theory</source>
          , pages
          <fpage>287</fpage>
          -
          <lpage>294</lpage>
          . ACM,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>David</given-names>
            <surname>Yarowsky</surname>
          </string-name>
          .
          <article-title>Unsupervised word sense disambiguation rivaling supervised methods</article-title>
          .
          <source>In Proceedings of the 33rd annual meeting on Association for Computational Linguistics</source>
          , pages
          <fpage>189</fpage>
          -
          <lpage>196</lpage>
          . Association for Computational Linguistics,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Ellen</surname>
            <given-names>Riloff</given-names>
          </string-name>
          , Janyce Wiebe, and Theresa Wilson.
          <article-title>Learning subjective nouns using extraction pattern bootstrapping</article-title>
          .
          <source>In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume</source>
          <volume>4</volume>
          , pages
          <fpage>25</fpage>
          -
          <lpage>32</lpage>
          . Association for Computational Linguistics,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Chuck</surname>
            <given-names>Rosenberg</given-names>
          </string-name>
          , Martial Hebert, and
          <string-name>
            <given-names>Henry</given-names>
            <surname>Schneiderman</surname>
          </string-name>
          .
          <article-title>Semi-supervised self-training of object detection models</article-title>
          .
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Avrim</given-names>
            <surname>Blum</surname>
          </string-name>
          and Tom Mitchell.
          <article-title>Combining labeled and unlabeled data with co-training</article-title>
          .
          <source>In Proceedings of the eleventh annual conference on Computational learning theory</source>
          , pages
          <fpage>92</fpage>
          -
          <lpage>100</lpage>
          . ACM,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Andrew</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dai</surname>
            and
            <given-names>Quoc V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          .
          <article-title>Semi-supervised sequence learning</article-title>
          .
          <source>CoRR, abs/1511.01432</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Shi</given-names>
            <surname>Zhong</surname>
          </string-name>
          .
          <article-title>Semi-supervised sequence classification with hmms</article-title>
          .
          <source>International Journal of Pattern Recognition and Artificial Intelligence</source>
          ,
          <volume>19</volume>
          (
          <issue>02</issue>
          ):
          <fpage>165</fpage>
          -
          <lpage>182</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>David</given-names>
            <surname>Pierce</surname>
          </string-name>
          and
          <string-name>
            <given-names>Claire</given-names>
            <surname>Cardie</surname>
          </string-name>
          .
          <article-title>Limitations of co-training for natural language learning from large datasets</article-title>
          .
          <source>In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Katrin</given-names>
            <surname>Tomanek</surname>
          </string-name>
          and
          <string-name>
            <given-names>Udo</given-names>
            <surname>Hahn</surname>
          </string-name>
          .
          <article-title>Semi-supervised active learning for sequence labeling</article-title>
          .
          <source>In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-</source>
          Volume
          <volume>2</volume>
          , pages
          <fpage>1039</fpage>
          -
          <lpage>1047</lpage>
          . Association for Computational Linguistics,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Lawrence</surname>
            <given-names>R</given-names>
          </string-name>
          <string-name>
            <surname>Rabiner.</surname>
          </string-name>
          <article-title>A tutorial on hidden markov models and selected applications in speech recognition</article-title>
          .
          <source>Proceedings of the IEEE</source>
          ,
          <volume>77</volume>
          (
          <issue>2</issue>
          ):
          <fpage>257</fpage>
          -
          <lpage>286</lpage>
          ,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Zhiheng</surname>
            <given-names>Huang</given-names>
          </string-name>
          ,
          <article-title>Yi Chang, Bo Long</article-title>
          ,
          <string-name>
            <surname>Jean-Francois</surname>
            <given-names>Crespo</given-names>
          </string-name>
          , Anlei Dong, Sathiya Keerthi, and
          <string-name>
            <surname>Su-Lin Wu</surname>
          </string-name>
          .
          <article-title>Iterative viterbi a* algorithm for k-best sequential decoding</article-title>
          .
          <source>In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume</source>
          <volume>1</volume>
          , pages
          <fpage>611</fpage>
          -
          <lpage>619</lpage>
          . Association for Computational Linguistics,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Steven</surname>
            <given-names>Bird</given-names>
          </string-name>
          , Ewan Klein, and
          <string-name>
            <given-names>Edward</given-names>
            <surname>Loper. Natural Language Processing with Python. O'Reilly Media</surname>
          </string-name>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>