=Paper= {{Paper |id=Vol-2962/paper28 |storemode=property |title=Tensor Decomposition-Based Training Method for High-Order Hidden Markov Models |pdfUrl=https://ceur-ws.org/Vol-2962/paper28.pdf |volume=Vol-2962 |authors=Matej Cibuľa,Radek Mařík |dblpUrl=https://dblp.org/rec/conf/itat/CibulaM21 }} ==Tensor Decomposition-Based Training Method for High-Order Hidden Markov Models == https://ceur-ws.org/Vol-2962/paper28.pdf
       Tensor Decomposition-Based Training Method for High-Order Hidden
                                Markov Models

                                                        Matej Cibul’a , Radek Mařík

                                               Department of Telecommunication Engineering,
                                                     Faculty of Electrical Engineering,
                                                   Czech Technical University in Prague,
                                                      166 27 Prague, Czech Republic
                                                      matej.cibula@fel.cvut.cz
                                                          marikr@fel.cvut.cz

Abstract: Hidden Markov models (HMMs) are one of the                      thermore, the number of HMM parameters that need to be
most widely used unsupervised-learning algorithms for                     estimated grows exponentially with the model’s order.
modeling discrete sequential data. Traditionally, most of                    Another approach to estimate parameters of an HMM
the applications of HMMs have utilized only models of                     utilizes gradient methods [12]. Traditionally, determinis-
order 1 because higher-order models are computationally                   tic gradient descent has been used for training purposes
hard to train. We reformulate HMMs using tensor decom-                    because of the added advantage of online training [13].
position to efficiently build higher-order models with the                Contrary to other training methods, training algorithms
use of stochastic gradient descent. Based on this, we pro-                utilizing gradient descent can be provided data sequen-
pose a new modified version of a training algorithm for                   tially, one-by-one, and the complete dataset does not need
HMMs, especially suitable for high-order HMMs. Fur-                       to be available during the whole training process. With the
ther, we show its capabilities and convergence on synthetic               improvement of stochastic gradient descent (SGD) tech-
data.                                                                     niques [14], those have been applied to HMM parameter
                                                                          estimation as well [15]. The SGD approach provides better
                                                                          convergence guarantees and is preferable to its determin-
1    Introduction
                                                                          istic counterpart.
Hidden Markov models (HMMs) are a popular member                             The least commonly used approach involves Markov
of the group of unsupervised-learning algorithms. HMMs                    Chain Monte Carlo (MCMC) sampling [16]. Its main ad-
have been extensively used in a variety of fields, including              vantage is in providing a complex distribution of HMM pa-
bioinformatics[1], speech recognition[2], image process-                  rameters instead of only computing their point estimates.
ing [3], etc. Nevertheless, most of the works on the topic                In practice, other techniques are preferred for their lower
of HMMs considered only models of an order 1.                             computational time.
   Hidden Markov Models of orders higher than 1 have                         Lastly, the spectral algorithm is the newest one for train-
been used very scarcely even though it has been demon-                    ing HMMs [17]. As opposed to the previously discussed
strated that higher-order HMMs provide substantial im-                    methods, this algorithm optimizes observation-operator
provements in the areas of bioinformatics [4], speech                     representation of HMMs instead of directly modifying
recognition [5]) or image processing [6]. Higher-order                    HMM parameters. The most significant advantage of this
HMMs can be replaced with a model of an order 1 with                      approach is its provable convergence to global optima (un-
with the same amount of free parameters but with con-                     der mild assumptions on the model parameters). Another
siderably higher amount of hidden states [7]. However,                    factor is the fact that this is the only non-iterative training
increased amount of hidden states obstructs their inter-                  algorithm. However, this learning method has only been
pretability. The complexity of parameter estimation of                    proposed for HMMs of order 1.
higher-order models has been the main reason for their                       Higher-order models involve the use of tensors instead
scarce use.                                                               of matrices. When it comes to use of tensors in practice
   Traditionally, HMMs are trained using the Baum-Welch                   in general, we often want to find a tensor of fixed rank
algorithm [8] which itself is a particular case of the                    that approximates the original one. Such approximations
Expectation-Maximization algorithm. The algorithm re-                     have generally advantageous properties when it comes to
quires non-trivial modifications to be applicable to higher-              noise reduction in the original data. From a certain point of
order HMMs and still be able to converge consistently to                  view, low-rank tensor approximation is similar to a dimen-
optima [9]. Another approach is to reformulate higher-                    sionality reduction using singular value decomposition for
order HMM into an HMM of an order 1 [10], [11]. Fur-                      matrices.
                                                                             The algorithm that we propose combines the modern
     Copyright ©2021 for this paper by its authors. Use permitted under   stochastic gradient descent optimization with tensor de-
Creative Commons License Attribution 4.0 International (CC BY 4.0).       composition methods. We utilize the Canonical Polyadic
(CP) decomposition [18], hence, we refer to the algo-           the stochastic gradient descent instead of the "raw" model
rithm as the CP-Decomposition-Stochastic Gradient De-           that has been commonly used before.
scent (CPD-SGD) algorithm.                                         Thirdly, we modify the computation of the observa-
   In the next section, we introduce a notation and provide     tion probabilities given a fixed state by the application of
basic definitions and background to our work. Following,        Bayes’s theorem and use of stationary distribution of the
in Section 3, we propose a new method and point out its         Markov chains [24]. This modification provides perfor-
most essential ideas. We experimentally evaluate our al-        mance improvements, especially in cases where the num-
gorithm and discuss the results in Section 4.                   ber of possible different observations is far higher than the
                                                                number of hidden states.

1.1   Related Work
                                                                2     Preliminaries
The idea of using the stochastic gradient descent for HMM
optimization was explored before, for example, in [15], or      2.1   Notation
[19]. However, it has been only applied to HMMs of an
order 1, while this work focuses on higher-order models.        In this work, we follow the generally used notation, based
Furthermore, both works used the traditional representa-        on [25] in a case of tensors, and [8], [17] for Hidden
tion of an HMM instead of the observation-operator repre-       Markov Models. Bold capital letters (A) denote matrices,
sentation utilized in this paper.                               bold lowercase letters (a) denote vectors and normal low-
   The application of tensor decomposition methods to           ercase letters (a) denote scalars. Tensors are denoted by
HMMs has been partially explored in the works [9]               uppercase calligraphy letters, such as A . In ambiguous
and [20]. In [20], authors combine the Tensor-Train             cases, we add an arrow over vectors, for example, →
                                                                                                                  −π.
decomposition [21] with the spectral algorithm for
HMMs [17], as opposed to our work where we combine              2.2   Tensor Decomposition
CP-decomposition and SGD. Additionally, the authors do
not consider higher-order models, leaving any possible ex-      Tensors are multidimensional arrays. They can be con-
tension in this direction unmentioned.                          sidered to be a generalization of vectors and matrices into
   In [9], the author combine tensor decomposition with         higher dimensions, where vector is a 1st-order tensor and
Markov chain Monte Carlo methods. However, their                matrix is a 2nd-order tensor, respectively.
model is considerably more complicated and, again, con-            The Kronecker product of A ∈ Rk×l and B ∈ Rm×n is
siders only the standard HMM representation. On the             denoted as A ⊗ B. It applies that A ⊗ B ∈ Rkm×ln and it is
other hand, it can straightforwardly provide confidence in-     defined as
tervals for estimated parameters instead of just a point es-
                                                                                                              
                                                                                    a1,1 B a1,2 B · · · a1,l B
timate.                                                                           a2,1 B a2,2 B · · · a2,l B
                                                                         A⊗B =  .                          ..  .
                                                                                                              
                                                                                               ..    ..
                                                                                   ..          .       .    . 
1.2   Our Contribution                                                                  ak,1 B   ak,2 B    ···   ak,l B

We propose a new method of training hidden Markov mod-            The Khatri-Rao product is the column-wise Kronecker
                                                                                    k×m and B ∈ Rl×m , their Khatri-Rao
                                                                product. Given A ∈ RJ
els. It is based on recent advancements in the fields of ten-
sor decomposition and stochastic gradient methods for op-       product is denoted A B, A B ∈ Rkl×m , and defined as
timization. The most important advantage of the approach
is its applicability to the training of HMMs of higher or-                   A    B = [a1 ⊗ b1       ...    am ⊗ bm ] .
ders. The applicability has been achieved by introduc-
                                                                   For decomposition algorithms, the operation of tensor
ing multiple tricks into the original gradient descent al-
                                                                matricization is commonly used [26]. Tensor matriciza-
gorithm [12] for the training of HMMs.
                                                                tion, often referred to as reshaping or flattening, is a pro-
   Firstly, we introduce a permutation decomposition for
                                                                cess of reordering elements of a higher-order (> 2) tensor
transition-probability tensor, which provides possibly even
                                                                into a matrix. All elements of a tensor must be reused;
an exponential improvement in the memory required to
                                                                hence, the amount of memory required to store a tensor
store those probabilities. We achieve this by clever reuse
                                                                is the same as to store its matricized form. The standard
of vectors representing individual states.
                                                                forms of matricized tensors are mode-n unfoldings.
   This work focuses solely on the CP-decomposition, al-
                                                                   Let X ∈ RI1 ×···×Ik . If an element in the tensor has
though there have been proposed multiple other decompo-
                                                                indices (i1 , . . . , ik ), then it will have indices (in , j) in its
sitions [22], such as Tucker decomposition [23], and ten-
                                                                mode-n unfolding, where j is defined as
sor train (TT) decomposition [21].
   Secondly, we modify the operator representation of the                         k                                  l−1
HMM (similar to [10]) to allow for the optimization of                  j = 1 + ∑ (il − 1) Jl ,     where Jl = ∏ Im .
                                                                                 l=1                                m=1
higher-order models. We use this representation during                           l6=n                               m6=n
We denote a mode-n unfolding of a tensor X as Xn .                Lastly, let π ∈ Rn×···×n be of order 1 less than the tensor
    The tensor rank of X , denoted rank(X ), is defined as        T . It defines an initial hidden state distribution by
                   (                                            !)
                                                  r                          [π]i1 ,...,ir−1 = Pr [sr−1 = ir−1 , . . . , s1 = i1 ] .
rank(X ) = min r ∈ N ∃a11 . . . ark X = ∑ ai1 ◦ · · · ◦ aik
                                         
                                                                    ,
                                                 i=1
                                                                      In the most common case in the literature, the hidden
where the symbol ◦ represents the vector outer product.           state  at the time t is assumed to only depend on its imme-
    The above definition of the tensor rank gives rise to the     diate  predecessor. This means that the transition between
tensor rank decomposition. Given a tensor X ∈ R      I1 ×···×Ik   states  can be modeled as a Markov chain of order 1. Con-
of a rank q, it can be decomposed as                              sequently,   the previously defined parameters are reduced
                                                                  to: T ∈ Rn×n , O ∈ Rn×m and →           −
                                                                                                          π ∈ Rn .
                           q
                                                                      Authors in [28], [29] showed that such HMM of the or-
                   X = ∑ λi ai1 ◦ · · · ◦ aik .                   der 1 can be represented by the means of observation oper-
                          i=1
                                                                  ators. Those operators are a sufficient and computationally
    Here, λi ∈ R are normalizing coefficients, ensuring that      efficient way to evaluate observation-sequence probabili-
all components are of the same scale while at the same            ties. The probability of a particular sequence o1 , . . . , ot is
time making the convergence during computation quicker.           given by
    Trivially, each tensor has a finite rank. However, es-                                              →
                                                                                                        −
timating the tensor rank is an NP-hard problem [27]. In                          Pr [o1 , . . . , ot ] = 1 m Aot . . . Ao1 →
                                                                                                                           −
                                                                                                                           π,
applications, we often want to find a tensor of fixed rank
that approximates the original one.                               where the observation operators Ai are defined [17] as
    The rank-q̂ canonical polyadic (CP) decomposition of a                                                               
tensor X of an order k is a problem of finding such vectors                         Ai = Tdiag [O]1,i , . . . , [O]n,i .
aij that
                           q̂                                         The standard strategy to estimate the HMM parameters
                   Xˆ = ∑ λi ai1 ◦ · · · ◦ aik ,                  is to maximize the provided samples’ likelihood. As the
                          i=1
                                                                  hidden states are not directly observable, algorithms often
and Xˆ minimizes                                                  resort to heuristics or other similar methods without any
                                                                  convergence guarantees.
                        ||X − Xˆ ||F .

                                                                            3     CPD-SGD Algorithm
2.3     Hidden Markov Models
The Hidden Markov Model (HMM) defines a probabil-                           3.1   Permutation Decomposition
ity distribution over sequences from a finite alphabet.
                                                                            We assume that we have a HMM of an order r − 1 de-
Additionally, it associates each sequence of observations
                                                                            scribed by parameters T , O, π. The transition-probability
o1 . . . ot with a sequence of hidden states of the same length
                                                                            tensor T has a finite tensor rank, let’s denote it as q1 .
s1 . . . st . There is a finite amount of different states. We de-
                                                                            Hence, it has a CP-decomposition
note the set of all possible hidden states as [n] = {1, . . . , n},
and the set of all possible observations, a.k.a. the finite al-                                       q1
phabet of sequences, as [m] = {1, . . . , m}.                                                 T = ∑ λi ti1 ◦ · · · ◦ tir .
   It is always assumed that the HMM satisfies 2 condi-                                              i=1
tional independence properties. Firstly, one states that hid-
                                                                            Further, as we can incorporate the normalizing coefficients
den state st depends only on a finite amount of previous
                                                                            λi into the individual components, we will omit them in the
states, typically only 1, and nothing else. Secondly, one
                                                                            following, writing only
requires that the observation ot at a time t is only depen-
dent on the hidden state st at that given moment t and no                                              q1
other element of the sequence itself or other hidden states.                                   T = ∑ ti1 ◦ · · · ◦ tir .
   Formally, the HMM of the order r − 1 can be described                                              i=1

by 3 matrices and/or tensors. Let T ∈ Rn×···×n be a tensor
                                                                            This omission is based on two facts. Firstly, all of the tij
of state transition probabilities where
                                                                            vectors will be inferred during the training of the model.
      [T ]i1 ,...,ir = Pr [st = ir |st−1 = ir−1 , . . . , st−r+1 = i1 ] .   Secondly, we do not need to constrain decomposition com-
                                                                            ponents into a certain form, as we do not work with
Similarly, let O ∈ Rn×m be a matrix of observation proba-                   them directly and the tensor approximation is invariant to
bilities with                                                               the way they are incorporated into decomposition compo-
                                                                            nents. Hence, omitting of λi coefficients results in a lower
                      [O]i, j = Pr [ot = j|st = i] .                        number of parameters to be estimated.
   Every hidden state is now represented by r vectors of          works for HMMs of order 1, whose transition probabili-
length q1 . Each one of them is for one position relative to      ties can be represented as a matrix. To alleviate this prob-
the current state, a.k.a. the first one defines the influence     lem, we introduce matricized tensor using the representa-
of the given state on the state observed r − 1 steps in the       tion from the previous subsection.
future from it. The next to last vector defines the influ-           Plainly, instead of using transition probability tensor
                                                                                                                     r−1
ence of the given state on the following one, and the last        T ∈ Rn×···×n , we use its matricized form T ∈ Rn ×n .
vector defines the behavior of the current state based on            Using the previous notation, let a matrix H ∈ Rn×q2 be
history. For a fixed hidden state i and for j ∈ {1, . . . , r},   defined as
                                                                                     H = t1 . . . tq2 ,
                                                                                                       
those vectors have a form
                   h              h i iT                      where ti are individual columns of the matrix.
                     t1j i t2j i . . . tqj 1 .                      This matrix represents behaviour of states at the time of
                                        i
                                                                  current observation. To get the matrix for a time step -1,
   For the purposes of interpretability, we would like to         we multiply it by a permutation matrix P ∈ Rq2 ×q2 ,
represent each state by only one vector. Trivial solution                                                   
                                                                                        0 1 0 ··· 0
would be to just concatenate them. Another option is to re-                           0 0 1 · · · 0
place all of those r vectors with just one. If we did this, our                                             
                                                                                 P =  ... ... ... . . . ...  .
                                                                                                            
model would be able to recognize previous states, how-                                                      
ever, it would not be able to distinguish their combinations                          0 0 0 · · · 1
(order in time). For example, state 1 followed by state 2                               1 0 0 ··· 0
would be represented the same way as state 2 followed by                                                    r−1 ×q
state 1.                                                          Furthermore, we define a matrix S ∈ Rn          2   as
   Based on the thoughts from the previous paragraph, we                                r−1             2
                                                                                S = HP         ···   HP     HP.
propose a solution in the following form: we represent
                                                                  It is obvious, that we get the mode-r unfolding of the ten-
each hidden states by a vector of a fixed length, and de-
                                                                  sor T by multiplying it with HT ,
pending on their relative position in time we permute those
vectors. More precisely, we shift their elements by one for                                T = SHT .
each time-step into the past, eg. a vector [1 2 3]T turns            For the direct replacement in the operator representa-
into [2 3 1]T and then into [3 1 2]T .                            tion, a square matrix is necessary. We do this by imputing
   Such a replacement vector may need to have higher di-          zeros into matrix T, creating a sparse matrix. This is a con-
mension, worst-case it needs to be of length rq1 , although       sequence of past hidden states being already given, hence,
we will be interested only in a low-rank approximation of         reducing the degree of freedom of the system. By exploit-
the tensor T . Hence, we introduce a new parameter for            ing the structure created by Khatri-Rao product, each row
our algorithm, q2 , which denotes the length of those vec-        in the imputed matrix needs to be padded to the left. The
tors (we will refer to them as state vectors). The q2 acts as     i-th row is padded by m(i%r) to the left, creating a matrix
a hyperparameter that needs to be set at the beginning of         T0 .
an experiment. Its choice is beyond this publication, how-
ever, it should be considerably higher than q1 . With those
                                                                  3.3   Unconstrained Optimization
modifications, we can rewrite the decomposition as
                                                                  To ensure that probabilities will not be negative and that
                      q2
                                                                  they will add up to 1, we would need to solve a constrained
               T = ∑ ti+r−1 ◦ · · · ◦ ti+1 ◦ ti ,
                                                                  optimization problem. To avoid this complication, we in-
                     i=1
                                                                  stead apply a row-wise softmax function to T0 . For a gen-
where tk = t j if j = k mod q2 .                                  eral vector x, the softmax function is defined as
   We apply the same trick to the initial-state-probability                                            exi
                                                                                   [softmax (x)]i = mr x .
tensor π and the observation-probability matrix O,                                                  ∑ j=0 e j
                            q2                                    Thanks to this, we can optimize this problem as an uncon-
                      O = ∑ ti+r ◦ ui ,                           strained one.
                            i=1                                      The same trick can be applied to the matrix O. The
                                                                  estimation of O, however, requires the application of the
where ui represent observations.
                                                                  softmax function to each row, where rows are typically
                                                                  of considerably higher dimension than columns. In those
3.2   Matricized Representation                                   cases where there are tens of thousands or even more
                                                                  different possible observations, this computation would
The observation-operator representation of HMM is ad-             be prohibitively computationally expensive. Instead, we
vantageous during training, as it involves only matrix-           would like to estimate only columns. A solution to this
vector products. Unfortunately, this representation only          problem is sketched in the following subsection.
3.4   Probability Replacement                                      Algorithm 1: The CPD-SGD Algorithm
To compute the probabilities [O]i, j = Pr [ot = j|st = i], we       Result: H, OT
would need to use our decomposed representation and                 initialization
compute a whole row [O]i . However, this row contains               ln_prob ← 0
m elements, usually far more than the amount of hidden              while not_converged() do
states n. Instead, we use the Bayes’ theorem to compute                  S ← HPr−1 · · · HP2 HP
the following formula                                                    T ← softmax(SHT )
                                                                         z ← solve(T)
                             Pr [st = i|ot = j] Pr [ot = j]              T ← Tdiag(z)
      Pr [ot = j|st = i] =                                  .
                                       Pr [st = i]                       T0 ← add_zeros_and_shift(T)
                                                                         OT ← eval_cols(xk . . . x1 )
Here, the Pr [ot = j] is independent of hidden states and                forall sequences xt . . . x1 in current batch do
cannot be optimized, which means we can remove it from                       ln_prob ← 
the expression during training as it does not provide any                                   →
                                                                                            −                           →
                                                                                                                        −
                                                                              ln_prob − ln 1 T . . . T0 diag([OT ]•,x1 ) 1
valuable gradients.
                                                                         end
   The estimation of Pr [st = i|ot = j] requires the estima-
                                                                         update_params_based_on(ln_prob)
tion of m probabilities, one for each hidden state, which
                                                                         ln_prob ← 0
is computationally more feasible than previous approach.
                                                                      end
We will denote this modified matrix as OT .
   Lastly, the expression Pr [st = i] is from a stationary dis-
tribution of the Markov chain represented by the hidden-
state transitions T0 . It can be calculated by solving a linear   4     Experiments
system of equations which is a completely differentiable
process. We denote its solution as z,                             We experimentally evaluated the proposed CPD-SGD
                                                                  training algorithm that was presented in Section 3. To
                      [z]i = Pr [st = i] .                        do this, we generated a synthetic dataset. The synthetic
                                                                  dataset was generated to resemble natural language prop-
Furthermore, the estimation of z needs to be done only            erties and to emphasize the advantages of higher-order
once during each batch which further improves perfor-             HMMs. To evaluate and compare trained models, we com-
mance.                                                            pute the likelihood of the test sequences for each model.
                                                                  The focus of this work is to provide a proof of concept
                                                                  of the proposed algorithm, hence, further experiments and
3.5   The CPD-SGD algorithm
                                                                  applications to disjoint datasets are relegated to the future
The pseudocode of the complete algorithm is presented in          work.
Algorithm 1. We formulate the algorithm using the auto-
matic differentiation framework. Hence, the exact update          4.1   Synthetic Dataset
rules for individual parameters are determined based on
the chosen optimizer.                                             We generate the synthetic dataset as an output from an
   The optimizer can be chosen based on the concrete task         HMM model of an order 5 with 20 hidden states. The
or other requirements. It can be an SGD with or without           amount of possible distinct observations is equal to 10 000.
adaptive learning rate. We choose the later option in this        This model was created in such a way as to be close to but
work. Another option is to utilize one of the momentum-           not completely deterministic. The generated sequences are
based methods, such as Adam [30]. In any case, the update         all of length 50, as this is long enough to alleviate the influ-
of the optimizer parameters is done within the body of the        ence of the initial-states distribution and short enough not
function update_params_based_on() in the Algorithm                to cause an arithmetic underflow. The initial state proba-
1.                                                                bilities of the model were chosen as a stationary distribu-
   The computation is organized into epochs that repeat           tion of its underlying Markov chain.
until the model has converged. We assume that the →  −
                                                     π is a
uniform distribution, hence we replaced it in the algorithm       4.2   Convergence
with the vector of ones. The algorithm evaluates the log-
likelihood of the given sequence or sequences during each         To evaluate the convergence of the CPD-SGD algorithm,
epoch and computes their respective gradients. Hence, the         we divide the dataset into two disjoint sets: training data
algorithm can be used in both online and batch settings.          and test data. While training data is used to estimate pa-
Afterward, the parameters are updated based on the cho-           rameters of the HMM, test data is used to estimate the
sen optimizer. The use of log-likelihood is necessary as          model’s fit.
otherwise, tiny probabilities may cause computational er-           In the first experiment, we hypothesize an HMM model
rors/underflows.                                                  of an order 3, represented by state vectors of length 50.
                                                               from the figure, our algorithm attains results comparable
                                                               to traditional approaches with low differences in absolute
                                                               and relative likelihood differences.

                                                               4.3   Model Comparisons
                                                               In the final experiment, we compare the expressive power
                                                               of the higher-order models. We train models of orders
                                                               from 1 up to 5, using the same train set. After they con-
                                                               verge, we compute their fit on the test sequences. All mod-
                                                               els were trained using the CPD-SGD algorithm, with the
Figure 1: A continuous convergence of the CPD-SGD al-          number of hidden states equal to 20, and with state vectors
gorithm.                                                       of length 50. They were randomly initialized.


The number of hidden states was chosen to be 20. Figure
1 shows the evolution of the model’s fit after each epoch
(an epoch consisted of 100 sequences). It can be seen that
the parameters of the model are being optimized.
   We follow by assessing the convergence of our proposed
algorithm in comparison to the Baum-Welch algorithm [8]
for models of order 1. We use a set of sequences to esti-
mate parameters of HMMs, and afterward, we compute
each model’s fit using a different set of test sequences.
We repeat this process 500 times, each time with differ-
ently reorganized sequences into train and test sets. Fur-     Figure 3: Maximal attained log-likelihoods of models of
thermore, we always use 10 different initializations for the   given order.
Baum-Welch algorithm, train all of them, and choose the
model with the best fit.                                         Figure 3 shows individual fits of those models. As ex-
                                                               pected, the higher-order models perform better.

                                                               5     Conclusion
                                                               In this work, we build upon one of the well-known unsu-
                                                               pervised learning methods - hidden Markov models. We
                                                               explored the possibilities to improve the training of HMMs
                                                               of higher orders. We followed up on previous works in the
                                                               area and proposed a new algorithm for HMM training, de-
                                                               noted as the CPD-SGD algorithm.
                                                                  Our algorithm is based on multiple ideas that stem from
                                                               - among others - an emerging area of tensor decomposition
                                                               methods. We utilize permutation decomposition which as-
                                                               signs only one vector to each state while it reuses all of its
                                                               values for each time step. Further, we used matricized rep-
                                                               resentation of the probability tensor to extend the operator
                                                               representation of HMMs to higher-order models. By ap-
                                                               plying the softmax function, we avoided the problematic
                                                               constrained optimization. Finally, we reformulated com-
                                                               putation of observation probabilities, making them more
                                                               efficient to compute in case of models with observation
                                                               space far larger than state space.
Figure 2: A distribution of differences between log-              We experimentally showed that higher-order HMMs
likelihoods of models trained by Baum-Welch and CPD-           could be efficiently represented in memory as decomposed
SGD algorithms.                                                tensors while being suitable for accelerated training using
                                                               stochastic gradient descent optimization.
  The distribution of differences between fits of both of         The focus of this paper has been on constructing an al-
those algorithms can be seen in Figure 2. As apparent          gorithm for training higher-order HMMs in an efficient
way. We have experimentally verified that our method                    number variations. BMC Bioinformatics, 12:428 – 428,
works and is able to train an HMM model on a synthetic                  2011.
dataset accurately.                                                [17] Daniel J. Hsu, S. Kakade, and Tong Zhang. A spectral algo-
  In future work, we plan to apply this method to real-                 rithm for learning hidden markov models. J. Comput. Syst.
world problems in natural language processing, which                    Sci., 78:1460–1480, 2009.
requires models with large state spaces and observation            [18] R. Harshman. Foundations of the parafac procedure: Mod-
spaces.                                                                 els and conditions for an "explanatory" multi-model factor
                                                                        analysis. 1970.
                                                                   [19] Yian Ma, Y. Ma, N. Foti, and E. Fox. Stochastic gradient
References                                                              mcmc methods for hidden markov models. In ICML, 2017.
                                                                   [20] Maxim A. Kuznetsov and I. Oseledets. Tensor train spec-
                                                                        tral method for learning of hidden markov models (hmm).
 [1] Byung-Jun Yoon. Hidden markov models and their appli-              Computational Methods in Applied Mathematics, 19:93 –
     cations in biological sequence analysis. Current Genomics,         99, 2019.
     10:402 – 415, 2009.
                                                                   [21] I. V. Oseledets. Tensor-train decomposition. SIAM Journal
 [2] M. Gales and S. Young. The application of hidden markov            on Scientific Computing, 33(5):2295–2317, 2011.
     models in speech recognition. Found. Trends Signal Pro-
                                                                   [22] Majid Janzamin, Rong Ge, Jean Kossaifi, and Anima
     cess., 1:195–304, 2007.
                                                                        Anandkumar. Spectral learning on matrices and tensors.
 [3] J. Bobulski and Lukasz Adrjanowicz. Two-dimensional                Found. Trends Mach. Learn., 12:393–536, 2019.
     hidden markov models for pattern recognition. In ICAISC,
                                                                   [23] LR Tucker. Some mathematical notes on three-mode fac-
     2013.
                                                                        tor analysis. Psychometrika, 31(3):279—311, September
 [4] M. Seifert, K. Abou-El-Ardat, Betty Friedrich, B. Klink,           1966.
     and A. Deutsch. Autoregressive higher-order hidden
                                                                   [24] P. Brémaud. Non-homogeneous markov chains. 2020.
     markov models: Exploiting local chromosomal dependen-
     cies in the analysis of tumor expression profiles. PLoS       [25] Tamara G. Kolda and Brett W. Bader. Tensor decompo-
     ONE, 9, 2014.                                                      sitions and applications. SIAM Review, 51(3):455–500,
                                                                        September 2009.
 [5] Lee-Min Lee and Jia-Chien Lee. A study on high-order hid-
     den markov models and applications to speech recognition.     [26] Stephan Rabanser, Oleksandr Shchur, and Stephan Gün-
     In IEA/AIE, 2006.                                                  nemann. Introduction to tensor decompositions and their
                                                                        applications in machine learning. ArXiv, abs/1711.10781,
 [6] L. Benyoussef, C. Carincotte, and S. Derrode. Extension
                                                                        2017.
     of higher-order hmc modeling with application to image
     segmentation. Digit. Signal Process., 18:849–860, 2008.       [27] Christopher J. Hillar and Lek-Heng Lim. Most tensor prob-
                                                                        lems are np-hard. J. ACM, 60(6), November 2013.
 [7] W. Zucchini and I. Macdonald. Hidden markov models for
     time series: An introduction using r. 2009.                   [28] Eyal Even-Dar, S. Kakade, and Y. Mansour. The value
                                                                        of observation for monitoring dynamic systems. In IJCAI,
 [8] L. Rabiner. A tutorial on hidden markov models and se-
                                                                        2007.
     lected applications. Proceedings of the IEEE, 1989.
                                                                   [29] H. Jaeger. Observable operator models for discrete stochas-
 [9] Abhra Sarkar and D. Dunson. Bayesian nonparametric
                                                                        tic time series. Neural Computation, 12:1371–1398, 2000.
     higher order hidden markov models. arXiv: Methodology,
     2018.                                                         [30] Diederik P. Kingma and Jimmy Ba. Adam: A method for
                                                                        stochastic optimization. CoRR, abs/1412.6980, 2015.
[10] C. Xiong, Di Yang, and L. Zhang. A high-order hidden
     markov model and its applications for dynamic car owner-
     ship analysis. Transp. Sci., 52:1365–1375, 2018.
[11] U. Hadar and H. Messer. High-order hidden markov mod-
     els - estimation and implementation. 2009 IEEE/SP 15th
     Workshop on Statistical Signal Processing, pages 249–252,
     2009.
[12] P. Bagos, T. Liakopoulos, and S. Hamodrakas. Faster gra-
     dient descent training of hidden markov models, using in-
     dividual learning rate adaptation. In ICGI, 2004.
[13] P. Baldi and Y. Chauvin. Smooth on-line learning algo-
     rithms for hidden markov models. Neural Computation,
     6:307–318, 1994.
[14] Y. LeCun, L. Bottou, G. Orr, and K. Müller. Efficient back-
     prop. In Neural Networks: Tricks of the Trade, 2012.
[15] Ahmed Saaudi, Yan Tong, and Csilla Farkas. Probabilistic
     graphical model on detecting insiders: Modeling with sgd-
     hmm. In ICISSP, 2019.
[16] Md Pavel Mahmud and Alexander Schliep. Fast mcmc
     sampling for hidden markov models to determine copy