<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tensor Decomposition-Based Training Method for High-Order Hidden Markov Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matej Cibul'a</string-name>
          <email>matej.cibula@fel.cvut.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Radek Marˇík</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Telecommunication Engineering, Faculty of Electrical Engineering, Czech Technical University in Prague</institution>
          ,
          <addr-line>166 27 Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Hidden Markov models (HMMs) are one of the most widely used unsupervised-learning algorithms for modeling discrete sequential data. Traditionally, most of the applications of HMMs have utilized only models of order 1 because higher-order models are computationally hard to train. We reformulate HMMs using tensor decomposition to efficiently build higher-order models with the use of stochastic gradient descent. Based on this, we propose a new modified version of a training algorithm for HMMs, especially suitable for high-order HMMs. Further, we show its capabilities and convergence on synthetic data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Hidden Markov models (HMMs) are a popular member
of the group of unsupervised-learning algorithms. HMMs
have been extensively used in a variety of fields, including
bioinformatics[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], speech recognition[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], image
processing [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], etc. Nevertheless, most of the works on the topic
of HMMs considered only models of an order 1.
      </p>
      <p>
        Hidden Markov Models of orders higher than 1 have
been used very scarcely even though it has been
demonstrated that higher-order HMMs provide substantial
improvements in the areas of bioinformatics [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], speech
recognition [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) or image processing [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Higher-order
HMMs can be replaced with a model of an order 1 with
with the same amount of free parameters but with
considerably higher amount of hidden states [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. However,
increased amount of hidden states obstructs their
interpretability. The complexity of parameter estimation of
higher-order models has been the main reason for their
scarce use.
      </p>
      <p>
        Traditionally, HMMs are trained using the Baum-Welch
algorithm [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] which itself is a particular case of the
Expectation-Maximization algorithm. The algorithm
requires non-trivial modifications to be applicable to
higherorder HMMs and still be able to converge consistently to
optima [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Another approach is to reformulate
higherorder HMM into an HMM of an order 1 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
Furthermore, the number of HMM parameters that need to be
estimated grows exponentially with the model’s order.
      </p>
      <p>
        Another approach to estimate parameters of an HMM
utilizes gradient methods [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Traditionally,
deterministic gradient descent has been used for training purposes
because of the added advantage of online training [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
Contrary to other training methods, training algorithms
utilizing gradient descent can be provided data
sequentially, one-by-one, and the complete dataset does not need
to be available during the whole training process. With the
improvement of stochastic gradient descent (SGD)
techniques [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], those have been applied to HMM parameter
estimation as well [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The SGD approach provides better
convergence guarantees and is preferable to its
deterministic counterpart.
      </p>
      <p>
        The least commonly used approach involves Markov
Chain Monte Carlo (MCMC) sampling [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Its main
advantage is in providing a complex distribution of HMM
parameters instead of only computing their point estimates.
In practice, other techniques are preferred for their lower
computational time.
      </p>
      <p>
        Lastly, the spectral algorithm is the newest one for
training HMMs [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. As opposed to the previously discussed
methods, this algorithm optimizes observation-operator
representation of HMMs instead of directly modifying
HMM parameters. The most significant advantage of this
approach is its provable convergence to global optima
(under mild assumptions on the model parameters). Another
factor is the fact that this is the only non-iterative training
algorithm. However, this learning method has only been
proposed for HMMs of order 1.
      </p>
      <p>Higher-order models involve the use of tensors instead
of matrices. When it comes to use of tensors in practice
in general, we often want to find a tensor of fixed rank
that approximates the original one. Such approximations
have generally advantageous properties when it comes to
noise reduction in the original data. From a certain point of
view, low-rank tensor approximation is similar to a
dimensionality reduction using singular value decomposition for
matrices.</p>
      <p>
        The algorithm that we propose combines the modern
stochastic gradient descent optimization with tensor
decomposition methods. We utilize the Canonical Polyadic
(CP) decomposition [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], hence, we refer to the
algorithm as the CP-Decomposition-Stochastic Gradient
Descent (CPD-SGD) algorithm.
      </p>
      <p>In the next section, we introduce a notation and provide
basic definitions and background to our work. Following,
in Section 3, we propose a new method and point out its
most essential ideas. We experimentally evaluate our
algorithm and discuss the results in Section 4.
1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The idea of using the stochastic gradient descent for HMM
optimization was explored before, for example, in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], or
[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. However, it has been only applied to HMMs of an
order 1, while this work focuses on higher-order models.
Furthermore, both works used the traditional
representation of an HMM instead of the observation-operator
representation utilized in this paper.
      </p>
      <p>
        The application of tensor decomposition methods to
HMMs has been partially explored in the works [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
and [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], authors combine the Tensor-Train
decomposition [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] with the spectral algorithm for
HMMs [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], as opposed to our work where we combine
CP-decomposition and SGD. Additionally, the authors do
not consider higher-order models, leaving any possible
extension in this direction unmentioned.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], the author combine tensor decomposition with
Markov chain Monte Carlo methods. However, their
model is considerably more complicated and, again,
considers only the standard HMM representation. On the
other hand, it can straightforwardly provide confidence
intervals for estimated parameters instead of just a point
estimate.
1.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Our Contribution</title>
      <p>
        We propose a new method of training hidden Markov
models. It is based on recent advancements in the fields of
tensor decomposition and stochastic gradient methods for
optimization. The most important advantage of the approach
is its applicability to the training of HMMs of higher
orders. The applicability has been achieved by
introducing multiple tricks into the original gradient descent
algorithm [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] for the training of HMMs.
      </p>
      <p>Firstly, we introduce a permutation decomposition for
transition-probability tensor, which provides possibly even
an exponential improvement in the memory required to
store those probabilities. We achieve this by clever reuse
of vectors representing individual states.</p>
      <p>
        This work focuses solely on the CP-decomposition,
although there have been proposed multiple other
decompositions [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], such as Tucker decomposition [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], and
tensor train (TT) decomposition [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <p>
        Secondly, we modify the operator representation of the
HMM (similar to [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]) to allow for the optimization of
higher-order models. We use this representation during
the stochastic gradient descent instead of the "raw" model
that has been commonly used before.
      </p>
      <p>
        Thirdly, we modify the computation of the
observation probabilities given a fixed state by the application of
Bayes’s theorem and use of stationary distribution of the
Markov chains [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. This modification provides
performance improvements, especially in cases where the
number of possible different observations is far higher than the
number of hidden states.
2
2.1
      </p>
      <sec id="sec-3-1">
        <title>Preliminaries</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Notation</title>
      <p>
        In this work, we follow the generally used notation, based
on [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] in a case of tensors, and [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] for Hidden
Markov Models. Bold capital letters (A) denote matrices,
bold lowercase letters (a) denote vectors and normal
lowercase letters (a) denote scalars. Tensors are denoted by
uppercase calligraphy letters, such as A . In ambiguous
cases, we add an arrow over vectors, for example, !p.
2.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Tensor Decomposition</title>
      <p>Tensors are multidimensional arrays. They can be
considered to be a generalization of vectors and matrices into
higher dimensions, where vector is a 1st-order tensor and
matrix is a 2nd-order tensor, respectively.</p>
      <p>The Kronecker product of A 2 Rk l and B 2 Rm n is
denoted as A B. It applies that A B 2 Rkm ln and it is
defined as</p>
      <p>A
0a1;1B</p>
      <p>Ba2;1B
B = BB@ ...</p>
      <p>ak;1B
a1;2B
a2;2B
.
.</p>
      <p>.
ak;2B
. . .</p>
      <p>a1;l B1
a2;l BC
. C :
. C
. A
ak;l B</p>
      <p>The Khatri-Rao product is the column-wise Kronecker
product. Given A 2 Rk m and B 2 Rl m, their Khatri-Rao
product is denoted A J B, A B 2 Rkl m, and defined as
A</p>
      <p>B = [a1
b1
: : :
am
bm] :</p>
      <p>
        For decomposition algorithms, the operation of tensor
matricization is commonly used [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. Tensor
matricization, often referred to as reshaping or flattening, is a
process of reordering elements of a higher-order (&gt; 2) tensor
into a matrix. All elements of a tensor must be reused;
hence, the amount of memory required to store a tensor
is the same as to store its matricized form. The standard
forms of matricized tensors are mode-n unfoldings.
      </p>
      <p>Let X 2 RI1 Ik . If an element in the tensor has
indices (i1; : : : ; ik), then it will have indices (in; j) in its
mode-n unfolding, where j is defined as</p>
      <p>k
j = 1 + å (il
l=1
l6=n
1) Jl ;</p>
      <p>l 1
where Jl = Õ Im:
m=1
m6=n
We denote a mode-n unfolding of a tensor X as Xn.</p>
      <p>The tensor rank of X , denoted rank(X ), is defined as
Lastly, let p 2 Rn n be of order 1 less than the tensor
T . It defines an initial hidden state distribution by
(
rank(X ) = min r 2 N
9a11 : : : ark</p>
      <p>r
X = å ai1
i=1
aik
!)
;
where the symbol represents the vector outer product.</p>
      <p>The above definition of the tensor rank gives rise to the
tensor rank decomposition. Given a tensor X 2 RI1 Ik
of a rank q, it can be decomposed as</p>
      <p>q
X = å liai1
i=1
qˆ
Xˆ = å liai1</p>
      <p>i=1
jjX</p>
      <p>ˆ</p>
      <p>X jjF :</p>
      <p>Here, li 2 R are normalizing coefficients, ensuring that
all components are of the same scale while at the same
time making the convergence during computation quicker.</p>
      <p>
        Trivially, each tensor has a finite rank. However,
estimating the tensor rank is an NP-hard problem [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. In
applications, we often want to find a tensor of fixed rank
that approximates the original one.
      </p>
      <p>The rank-qˆ canonical polyadic (CP) decomposition of a
tensor X of an order k is a problem of finding such vectors
aij that
and Xˆ minimizes
aik:
aik;
2.3</p>
    </sec>
    <sec id="sec-6">
      <title>Hidden Markov Models</title>
      <p>The Hidden Markov Model (HMM) defines a
probability distribution over sequences from a finite alphabet.
Additionally, it associates each sequence of observations
o1 : : : ot with a sequence of hidden states of the same length
s1 : : : st . There is a finite amount of different states. We
denote the set of all possible hidden states as [n] = f1; : : : ; ng,
and the set of all possible observations, a.k.a. the finite
alphabet of sequences, as [m] = f1; : : : ; mg.</p>
      <p>It is always assumed that the HMM satisfies 2
conditional independence properties. Firstly, one states that
hidden state st depends only on a finite amount of previous
states, typically only 1, and nothing else. Secondly, one
requires that the observation ot at a time t is only
dependent on the hidden state st at that given moment t and no
other element of the sequence itself or other hidden states.</p>
      <p>Formally, the HMM of the order r 1 can be described
by 3 matrices and/or tensors. Let T 2 Rn n be a tensor
of state transition probabilities where</p>
      <p>[T ]i1;:::;ir = Pr [st = irjst 1 = ir 1; : : : ; st r+1 = i1] :
Similarly, let O 2 Rn m be a matrix of observation
probabilities with
[O]i; j = Pr [ot = jjst = i] :
[p]i1;:::;ir 1 = Pr [sr 1 = ir 1; : : : ; s1 = i1] :</p>
      <p>In the most common case in the literature, the hidden
state at the time t is assumed to only depend on its
immediate predecessor. This means that the transition between
states can be modeled as a Markov chain of order 1.
Consequently, the previously defined parameters are reduced
to: T 2 Rn n, O 2 Rn m and !p 2 Rn.</p>
      <p>
        Authors in [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] showed that such HMM of the
order 1 can be represented by the means of observation
operators. Those operators are a sufficient and computationally
efficient way to evaluate observation-sequence
probabilities. The probability of a particular sequence o1; : : : ; ot is
given by
      </p>
      <p>
        Pr [o1; : : : ; ot ] = !1 mAot : : : Ao1 !p;
where the observation operators Ai are defined [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] as
      </p>
      <p>Ai = Tdiag [O]1;i ; : : : ; [O]n;i :</p>
      <p>The standard strategy to estimate the HMM parameters
is to maximize the provided samples’ likelihood. As the
hidden states are not directly observable, algorithms often
resort to heuristics or other similar methods without any
convergence guarantees.
3
3.1</p>
      <sec id="sec-6-1">
        <title>CPD-SGD Algorithm</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Permutation Decomposition</title>
      <p>We assume that we have a HMM of an order r 1
described by parameters T ; O; p. The transition-probability
tensor T has a finite tensor rank, let’s denote it as q1.
Hence, it has a CP-decomposition</p>
      <p>q1
T = å liti1</p>
      <p>i=1
T = åq1 ti1
i=1
tir:
tir:
Further, as we can incorporate the normalizing coefficients
li into the individual components, we will omit them in the
following, writing only
This omission is based on two facts. Firstly, all of the tij
vectors will be inferred during the training of the model.
Secondly, we do not need to constrain decomposition
components into a certain form, as we do not work with
them directly and the tensor approximation is invariant to
the way they are incorporated into decomposition
components. Hence, omitting of li coefficients results in a lower
number of parameters to be estimated.</p>
      <p>Every hidden state is now represented by r vectors of
length q1. Each one of them is for one position relative to
the current state, a.k.a. the first one defines the influence
of the given state on the state observed r 1 steps in the
future from it. The next to last vector defines the
influence of the given state on the following one, and the last
vector defines the behavior of the current state based on
history. For a fixed hidden state i and for j 2 f1; : : : ; rg,
those vectors have a form</p>
      <p>t j i t2j i : : : htq1 i iT :
h 1
j i</p>
      <p>For the purposes of interpretability, we would like to
represent each state by only one vector. Trivial solution
would be to just concatenate them. Another option is to
replace all of those r vectors with just one. If we did this, our
model would be able to recognize previous states,
however, it would not be able to distinguish their combinations
(order in time). For example, state 1 followed by state 2
would be represented the same way as state 2 followed by
state 1.</p>
      <p>Based on the thoughts from the previous paragraph, we
propose a solution in the following form: we represent
each hidden states by a vector of a fixed length, and
depending on their relative position in time we permute those
vectors. More precisely, we shift their elements by one for
each time-step into the past, eg. a vector [1 2 3]T turns
into [2 3 1]T and then into [3 1 2]T.</p>
      <p>Such a replacement vector may need to have higher
dimension, worst-case it needs to be of length rq1, although
we will be interested only in a low-rank approximation of
the tensor T . Hence, we introduce a new parameter for
our algorithm, q2, which denotes the length of those
vectors (we will refer to them as state vectors). The q2 acts as
a hyperparameter that needs to be set at the beginning of
an experiment. Its choice is beyond this publication,
however, it should be considerably higher than q1. With those
modifications, we can rewrite the decomposition as
T = åq2 ti+r 1
i=1
ti+1 ti;
where tk = t j if j = k mod q2.</p>
      <p>We apply the same trick to the initial-state-probability
tensor p and the observation-probability matrix O,
O = åq2 ti+r ui;</p>
      <p>i=1
where ui represent observations.
3.2</p>
    </sec>
    <sec id="sec-8">
      <title>Matricized Representation</title>
      <p>The observation-operator representation of HMM is
advantageous during training, as it involves only
matrixvector products. Unfortunately, this representation only
works for HMMs of order 1, whose transition
probabilities can be represented as a matrix. To alleviate this
problem, we introduce matricized tensor using the
representation from the previous subsection.</p>
      <p>Plainly, instead of using transition probability tensor
T 2 Rn n, we use its matricized form T 2 Rnr 1 n.</p>
      <p>Using the previous notation, let a matrix H 2 Rn q2 be
defined as</p>
      <p>H = t1
: : : tq2 ;
where ti are individual columns of the matrix.</p>
      <p>This matrix represents behaviour of states at the time of
current observation. To get the matrix for a time step -1,
we multiply it by a permutation matrix P 2 Rq2 q2 ,
00 1 0 01</p>
      <p>B0 0 1 0C
P = BBB ... ... ... . . . ... CCC :</p>
      <p>B@0 0 0 1CA</p>
      <p>1 0 0 0
Furthermore, we define a matrix S 2 Rnr 1 q2 as
S = HPr 1</p>
      <p>HP2</p>
      <p>HP:
It is obvious, that we get the mode-r unfolding of the
tensor T by multiplying it with HT,</p>
      <p>T = SHT:</p>
      <p>For the direct replacement in the operator
representation, a square matrix is necessary. We do this by imputing
zeros into matrix T, creating a sparse matrix. This is a
consequence of past hidden states being already given, hence,
reducing the degree of freedom of the system. By
exploiting the structure created by Khatri-Rao product, each row
in the imputed matrix needs to be padded to the left. The
i-th row is padded by m(i%r) to the left, creating a matrix
T0.
3.3</p>
    </sec>
    <sec id="sec-9">
      <title>Unconstrained Optimization</title>
      <p>To ensure that probabilities will not be negative and that
they will add up to 1, we would need to solve a constrained
optimization problem. To avoid this complication, we
instead apply a row-wise softmax function to T0. For a
general vector x, the softmax function is defined as
exi
[softmax (x)]i =
å mj=r 0 ex j :
Thanks to this, we can optimize this problem as an
unconstrained one.</p>
      <p>The same trick can be applied to the matrix O. The
estimation of O, however, requires the application of the
softmax function to each row, where rows are typically
of considerably higher dimension than columns. In those
cases where there are tens of thousands or even more
different possible observations, this computation would
be prohibitively computationally expensive. Instead, we
would like to estimate only columns. A solution to this
problem is sketched in the following subsection.
To compute the probabilities [O]i; j = Pr [ot = jjst = i], we
would need to use our decomposed representation and
compute a whole row [O]i. However, this row contains
m elements, usually far more than the amount of hidden
states n. Instead, we use the Bayes’ theorem to compute
the following formula
Here, the Pr [ot = j] is independent of hidden states and
cannot be optimized, which means we can remove it from
the expression during training as it does not provide any
valuable gradients.</p>
      <p>The estimation of Pr [st = ijot = j] requires the
estimation of m probabilities, one for each hidden state, which
is computationally more feasible than previous approach.
We will denote this modified matrix as OT.</p>
      <p>Lastly, the expression Pr [st = i] is from a stationary
distribution of the Markov chain represented by the
hiddenstate transitions T0. It can be calculated by solving a linear
system of equations which is a completely differentiable
process. We denote its solution as z,</p>
      <p>[z]i = Pr [st = i] :
Furthermore, the estimation of z needs to be done only
once during each batch which further improves
performance.
3.5</p>
    </sec>
    <sec id="sec-10">
      <title>The CPD-SGD algorithm</title>
      <p>The pseudocode of the complete algorithm is presented in
Algorithm 1. We formulate the algorithm using the
automatic differentiation framework. Hence, the exact update
rules for individual parameters are determined based on
the chosen optimizer.</p>
      <p>
        The optimizer can be chosen based on the concrete task
or other requirements. It can be an SGD with or without
adaptive learning rate. We choose the later option in this
work. Another option is to utilize one of the
momentumbased methods, such as Adam [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. In any case, the update
of the optimizer parameters is done within the body of the
function update_params_based_on() in the Algorithm
1.
      </p>
      <p>The computation is organized into epochs that repeat
until the model has converged. We assume that the !p is a
uniform distribution, hence we replaced it in the algorithm
with the vector of ones. The algorithm evaluates the
loglikelihood of the given sequence or sequences during each
epoch and computes their respective gradients. Hence, the
algorithm can be used in both online and batch settings.
Afterward, the parameters are updated based on the
chosen optimizer. The use of log-likelihood is necessary as
otherwise, tiny probabilities may cause computational
errors/underflows.</p>
      <p>Algorithm 1: The CPD-SGD Algorithm</p>
    </sec>
    <sec id="sec-11">
      <title>Result: H; OT</title>
      <p>initialization
ln_prob 0
while not_converged() do</p>
      <p>S HPr 1 HP2 HP
T softmax(SHT)
z solve(T)
T Tdiag(z)
T0 add_zeros_and_shift(T)
OT eval_cols(xk : : : x1)
forall sequences xt : : : x1 in current batch do
ln_prob
ln_prob ln !1 T : : : T0diag([OT] ;x1 )!1
end
update_params_based_on(ln_prob)
ln_prob 0
end
4</p>
      <sec id="sec-11-1">
        <title>Experiments</title>
        <p>We experimentally evaluated the proposed CPD-SGD
training algorithm that was presented in Section 3. To
do this, we generated a synthetic dataset. The synthetic
dataset was generated to resemble natural language
properties and to emphasize the advantages of higher-order
HMMs. To evaluate and compare trained models, we
compute the likelihood of the test sequences for each model.
The focus of this work is to provide a proof of concept
of the proposed algorithm, hence, further experiments and
applications to disjoint datasets are relegated to the future
work.
4.1</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>Synthetic Dataset</title>
      <p>We generate the synthetic dataset as an output from an
HMM model of an order 5 with 20 hidden states. The
amount of possible distinct observations is equal to 10 000.
This model was created in such a way as to be close to but
not completely deterministic. The generated sequences are
all of length 50, as this is long enough to alleviate the
influence of the initial-states distribution and short enough not
to cause an arithmetic underflow. The initial state
probabilities of the model were chosen as a stationary
distribution of its underlying Markov chain.
4.2</p>
    </sec>
    <sec id="sec-13">
      <title>Convergence</title>
      <p>To evaluate the convergence of the CPD-SGD algorithm,
we divide the dataset into two disjoint sets: training data
and test data. While training data is used to estimate
parameters of the HMM, test data is used to estimate the
model’s fit.</p>
      <p>In the first experiment, we hypothesize an HMM model
of an order 3, represented by state vectors of length 50.</p>
      <p>The number of hidden states was chosen to be 20. Figure
1 shows the evolution of the model’s fit after each epoch
(an epoch consisted of 100 sequences). It can be seen that
the parameters of the model are being optimized.</p>
      <p>
        We follow by assessing the convergence of our proposed
algorithm in comparison to the Baum-Welch algorithm [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
for models of order 1. We use a set of sequences to
estimate parameters of HMMs, and afterward, we compute
each model’s fit using a different set of test sequences.
We repeat this process 500 times, each time with
differently reorganized sequences into train and test sets.
Furthermore, we always use 10 different initializations for the
Baum-Welch algorithm, train all of them, and choose the
model with the best fit.
      </p>
      <p>The distribution of differences between fits of both of
those algorithms can be seen in Figure 2. As apparent
from the figure, our algorithm attains results comparable
to traditional approaches with low differences in absolute
and relative likelihood differences.
In the final experiment, we compare the expressive power
of the higher-order models. We train models of orders
from 1 up to 5, using the same train set. After they
converge, we compute their fit on the test sequences. All
models were trained using the CPD-SGD algorithm, with the
number of hidden states equal to 20, and with state vectors
of length 50. They were randomly initialized.
In this work, we build upon one of the well-known
unsupervised learning methods - hidden Markov models. We
explored the possibilities to improve the training of HMMs
of higher orders. We followed up on previous works in the
area and proposed a new algorithm for HMM training,
denoted as the CPD-SGD algorithm.</p>
      <p>Our algorithm is based on multiple ideas that stem from
- among others - an emerging area of tensor decomposition
methods. We utilize permutation decomposition which
assigns only one vector to each state while it reuses all of its
values for each time step. Further, we used matricized
representation of the probability tensor to extend the operator
representation of HMMs to higher-order models. By
applying the softmax function, we avoided the problematic
constrained optimization. Finally, we reformulated
computation of observation probabilities, making them more
efficient to compute in case of models with observation
space far larger than state space.</p>
      <p>We experimentally showed that higher-order HMMs
could be efficiently represented in memory as decomposed
tensors while being suitable for accelerated training using
stochastic gradient descent optimization.</p>
      <p>The focus of this paper has been on constructing an
algorithm for training higher-order HMMs in an efficient
way. We have experimentally verified that our method
works and is able to train an HMM model on a synthetic
dataset accurately.</p>
      <p>In future work, we plan to apply this method to
realworld problems in natural language processing, which
requires models with large state spaces and observation
spaces.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Byung-Jun Yoon</surname>
          </string-name>
          .
          <article-title>Hidden markov models and their applications in biological sequence analysis</article-title>
          .
          <source>Current Genomics</source>
          ,
          <volume>10</volume>
          :
          <fpage>402</fpage>
          -
          <lpage>415</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gales</surname>
          </string-name>
          and
          <string-name>
            <surname>S. Young.</surname>
          </string-name>
          <article-title>The application of hidden markov models in speech recognition</article-title>
          .
          <source>Found. Trends Signal Process.</source>
          ,
          <volume>1</volume>
          :
          <fpage>195</fpage>
          -
          <lpage>304</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bobulski</surname>
          </string-name>
          and
          <string-name>
            <given-names>Lukasz</given-names>
            <surname>Adrjanowicz</surname>
          </string-name>
          .
          <article-title>Two-dimensional hidden markov models for pattern recognition</article-title>
          .
          <source>In ICAISC</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Seifert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Abou-El-Ardat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Betty</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Klink</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Deutsch</surname>
          </string-name>
          .
          <article-title>Autoregressive higher-order hidden markov models: Exploiting local chromosomal dependencies in the analysis of tumor expression profiles</article-title>
          .
          <source>PLoS ONE</source>
          ,
          <volume>9</volume>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Lee-Min Lee</surname>
          </string-name>
          and
          <string-name>
            <surname>Jia-Chien Lee</surname>
          </string-name>
          .
          <article-title>A study on high-order hidden markov models and applications to speech recognition</article-title>
          .
          <source>In IEA/AIE</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Benyoussef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Carincotte</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Derrode</surname>
          </string-name>
          .
          <article-title>Extension of higher-order hmc modeling with application to image segmentation</article-title>
          .
          <source>Digit</source>
          . Signal Process.,
          <volume>18</volume>
          :
          <fpage>849</fpage>
          -
          <lpage>860</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zucchini</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Macdonald.</surname>
          </string-name>
          <article-title>Hidden markov models for time series: An introduction using r</article-title>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Rabiner</surname>
          </string-name>
          .
          <article-title>A tutorial on hidden markov models and selected applications</article-title>
          .
          <source>Proceedings of the IEEE</source>
          ,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Abhra</given-names>
            <surname>Sarkar</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Dunson</surname>
          </string-name>
          .
          <article-title>Bayesian nonparametric higher order hidden markov models</article-title>
          .
          <source>arXiv: Methodology</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Di</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <surname>L. Zhang.</surname>
          </string-name>
          <article-title>A high-order hidden markov model and its applications for dynamic car ownership analysis</article-title>
          .
          <source>Transp. Sci.</source>
          ,
          <volume>52</volume>
          :
          <fpage>1365</fpage>
          -
          <lpage>1375</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>U.</given-names>
            <surname>Hadar</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Messer</surname>
          </string-name>
          .
          <article-title>High-order hidden markov models - estimation and implementation</article-title>
          .
          <source>2009 IEEE/SP 15th Workshop on Statistical Signal Processing</source>
          , pages
          <fpage>249</fpage>
          -
          <lpage>252</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bagos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Liakopoulos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Hamodrakas</surname>
          </string-name>
          .
          <article-title>Faster gradient descent training of hidden markov models, using individual learning rate adaptation</article-title>
          .
          <source>In ICGI</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Baldi</surname>
          </string-name>
          and
          <string-name>
            <surname>Y. Chauvin.</surname>
          </string-name>
          <article-title>Smooth on-line learning algorithms for hidden markov models</article-title>
          .
          <source>Neural Computation</source>
          ,
          <volume>6</volume>
          :
          <fpage>307</fpage>
          -
          <lpage>318</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , L. Bottou, G. Orr, and
          <string-name>
            <given-names>K.</given-names>
            <surname>Müller</surname>
          </string-name>
          .
          <article-title>Efficient backprop</article-title>
          .
          <source>In Neural Networks: Tricks of the Trade</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Ahmed</surname>
            <given-names>Saaudi</given-names>
          </string-name>
          , Yan Tong, and
          <string-name>
            <given-names>Csilla</given-names>
            <surname>Farkas</surname>
          </string-name>
          .
          <article-title>Probabilistic graphical model on detecting insiders: Modeling with sgdhmm</article-title>
          .
          <source>In ICISSP</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Md</given-names>
            <surname>Pavel</surname>
          </string-name>
          Mahmud and
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Schliep</surname>
          </string-name>
          .
          <article-title>Fast mcmc sampling for hidden markov models to determine copy number variations</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>12</volume>
          :
          <fpage>428</fpage>
          -
          <lpage>428</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Daniel</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hsu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kakade</surname>
            ,
            <given-names>and Tong</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          .
          <article-title>A spectral algorithm for learning hidden markov models</article-title>
          .
          <source>J. Comput. Syst. Sci.</source>
          ,
          <volume>78</volume>
          :
          <fpage>1460</fpage>
          -
          <lpage>1480</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R.</given-names>
            <surname>Harshman</surname>
          </string-name>
          .
          <article-title>Foundations of the parafac procedure: Models and conditions for an "explanatory" multi-model factor analysis</article-title>
          .
          <year>1970</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Yian</given-names>
            <surname>Ma</surname>
          </string-name>
          , Y. Ma,
          <string-name>
            <given-names>N.</given-names>
            <surname>Foti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Fox</surname>
          </string-name>
          .
          <article-title>Stochastic gradient mcmc methods for hidden markov models</article-title>
          .
          <source>In ICML</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Maxim</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kuznetsov</surname>
            and
            <given-names>I. Oseledets.</given-names>
          </string-name>
          <article-title>Tensor train spectral method for learning of hidden markov models (hmm)</article-title>
          .
          <source>Computational Methods in Applied Mathematics</source>
          ,
          <volume>19</volume>
          :
          <fpage>93</fpage>
          -
          <lpage>99</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>I. V.</given-names>
            <surname>Oseledets</surname>
          </string-name>
          .
          <article-title>Tensor-train decomposition</article-title>
          .
          <source>SIAM Journal on Scientific Computing</source>
          ,
          <volume>33</volume>
          (
          <issue>5</issue>
          ):
          <fpage>2295</fpage>
          -
          <lpage>2317</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Majid</surname>
            <given-names>Janzamin</given-names>
          </string-name>
          , Rong Ge, Jean Kossaifi, and
          <string-name>
            <given-names>Anima</given-names>
            <surname>Anandkumar</surname>
          </string-name>
          .
          <article-title>Spectral learning on matrices and tensors</article-title>
          .
          <source>Found. Trends Mach. Learn.</source>
          ,
          <volume>12</volume>
          :
          <fpage>393</fpage>
          -
          <lpage>536</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>LR</given-names>
            <surname>Tucker</surname>
          </string-name>
          .
          <article-title>Some mathematical notes on three-mode factor analysis</article-title>
          .
          <source>Psychometrika</source>
          ,
          <volume>31</volume>
          (
          <issue>3</issue>
          ):
          <fpage>279</fpage>
          -
          <lpage>311</lpage>
          ,
          <year>September 1966</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>P.</given-names>
            <surname>Brémaud</surname>
          </string-name>
          .
          <article-title>Non-homogeneous markov chains</article-title>
          .
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Tamara</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Kolda</surname>
            and
            <given-names>Brett W.</given-names>
          </string-name>
          <string-name>
            <surname>Bader</surname>
          </string-name>
          .
          <article-title>Tensor decompositions and applications</article-title>
          .
          <source>SIAM Review</source>
          ,
          <volume>51</volume>
          (
          <issue>3</issue>
          ):
          <fpage>455</fpage>
          -
          <lpage>500</lpage>
          ,
          <year>September 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Stephan</surname>
            <given-names>Rabanser</given-names>
          </string-name>
          , Oleksandr Shchur, and
          <string-name>
            <given-names>Stephan</given-names>
            <surname>Günnemann</surname>
          </string-name>
          .
          <article-title>Introduction to tensor decompositions and their applications in machine learning</article-title>
          .
          <source>ArXiv, abs/1711.10781</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Christopher</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hillar</surname>
          </string-name>
          and
          <string-name>
            <surname>Lek-Heng Lim</surname>
          </string-name>
          .
          <article-title>Most tensor problems are np-hard</article-title>
          .
          <source>J. ACM</source>
          ,
          <volume>60</volume>
          (
          <issue>6</issue>
          ),
          <year>November 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Eyal</given-names>
            <surname>Even-Dar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kakade</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mansour</surname>
          </string-name>
          .
          <article-title>The value of observation for monitoring dynamic systems</article-title>
          .
          <source>In IJCAI</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jaeger</surname>
          </string-name>
          .
          <article-title>Observable operator models for discrete stochastic time series</article-title>
          .
          <source>Neural Computation</source>
          ,
          <volume>12</volume>
          :
          <fpage>1371</fpage>
          -
          <lpage>1398</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kingma</surname>
            and
            <given-names>Jimmy</given-names>
          </string-name>
          <string-name>
            <surname>Ba</surname>
          </string-name>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>CoRR, abs/1412.6980</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>