Tensor Decomposition-Based Training Method for High-Order Hidden Markov Models Matej Cibul’a , Radek Mařík Department of Telecommunication Engineering, Faculty of Electrical Engineering, Czech Technical University in Prague, 166 27 Prague, Czech Republic matej.cibula@fel.cvut.cz marikr@fel.cvut.cz Abstract: Hidden Markov models (HMMs) are one of the thermore, the number of HMM parameters that need to be most widely used unsupervised-learning algorithms for estimated grows exponentially with the model’s order. modeling discrete sequential data. Traditionally, most of Another approach to estimate parameters of an HMM the applications of HMMs have utilized only models of utilizes gradient methods [12]. Traditionally, determinis- order 1 because higher-order models are computationally tic gradient descent has been used for training purposes hard to train. We reformulate HMMs using tensor decom- because of the added advantage of online training [13]. position to efficiently build higher-order models with the Contrary to other training methods, training algorithms use of stochastic gradient descent. Based on this, we pro- utilizing gradient descent can be provided data sequen- pose a new modified version of a training algorithm for tially, one-by-one, and the complete dataset does not need HMMs, especially suitable for high-order HMMs. Fur- to be available during the whole training process. With the ther, we show its capabilities and convergence on synthetic improvement of stochastic gradient descent (SGD) tech- data. niques [14], those have been applied to HMM parameter estimation as well [15]. The SGD approach provides better convergence guarantees and is preferable to its determin- 1 Introduction istic counterpart. Hidden Markov models (HMMs) are a popular member The least commonly used approach involves Markov of the group of unsupervised-learning algorithms. HMMs Chain Monte Carlo (MCMC) sampling [16]. Its main ad- have been extensively used in a variety of fields, including vantage is in providing a complex distribution of HMM pa- bioinformatics[1], speech recognition[2], image process- rameters instead of only computing their point estimates. ing [3], etc. Nevertheless, most of the works on the topic In practice, other techniques are preferred for their lower of HMMs considered only models of an order 1. computational time. Hidden Markov Models of orders higher than 1 have Lastly, the spectral algorithm is the newest one for train- been used very scarcely even though it has been demon- ing HMMs [17]. As opposed to the previously discussed strated that higher-order HMMs provide substantial im- methods, this algorithm optimizes observation-operator provements in the areas of bioinformatics [4], speech representation of HMMs instead of directly modifying recognition [5]) or image processing [6]. Higher-order HMM parameters. The most significant advantage of this HMMs can be replaced with a model of an order 1 with approach is its provable convergence to global optima (un- with the same amount of free parameters but with con- der mild assumptions on the model parameters). Another siderably higher amount of hidden states [7]. However, factor is the fact that this is the only non-iterative training increased amount of hidden states obstructs their inter- algorithm. However, this learning method has only been pretability. The complexity of parameter estimation of proposed for HMMs of order 1. higher-order models has been the main reason for their Higher-order models involve the use of tensors instead scarce use. of matrices. When it comes to use of tensors in practice Traditionally, HMMs are trained using the Baum-Welch in general, we often want to find a tensor of fixed rank algorithm [8] which itself is a particular case of the that approximates the original one. Such approximations Expectation-Maximization algorithm. The algorithm re- have generally advantageous properties when it comes to quires non-trivial modifications to be applicable to higher- noise reduction in the original data. From a certain point of order HMMs and still be able to converge consistently to view, low-rank tensor approximation is similar to a dimen- optima [9]. Another approach is to reformulate higher- sionality reduction using singular value decomposition for order HMM into an HMM of an order 1 [10], [11]. Fur- matrices. The algorithm that we propose combines the modern Copyright ©2021 for this paper by its authors. Use permitted under stochastic gradient descent optimization with tensor de- Creative Commons License Attribution 4.0 International (CC BY 4.0). composition methods. We utilize the Canonical Polyadic (CP) decomposition [18], hence, we refer to the algo- the stochastic gradient descent instead of the "raw" model rithm as the CP-Decomposition-Stochastic Gradient De- that has been commonly used before. scent (CPD-SGD) algorithm. Thirdly, we modify the computation of the observa- In the next section, we introduce a notation and provide tion probabilities given a fixed state by the application of basic definitions and background to our work. Following, Bayes’s theorem and use of stationary distribution of the in Section 3, we propose a new method and point out its Markov chains [24]. This modification provides perfor- most essential ideas. We experimentally evaluate our al- mance improvements, especially in cases where the num- gorithm and discuss the results in Section 4. ber of possible different observations is far higher than the number of hidden states. 1.1 Related Work 2 Preliminaries The idea of using the stochastic gradient descent for HMM optimization was explored before, for example, in [15], or 2.1 Notation [19]. However, it has been only applied to HMMs of an order 1, while this work focuses on higher-order models. In this work, we follow the generally used notation, based Furthermore, both works used the traditional representa- on [25] in a case of tensors, and [8], [17] for Hidden tion of an HMM instead of the observation-operator repre- Markov Models. Bold capital letters (A) denote matrices, sentation utilized in this paper. bold lowercase letters (a) denote vectors and normal low- The application of tensor decomposition methods to ercase letters (a) denote scalars. Tensors are denoted by HMMs has been partially explored in the works [9] uppercase calligraphy letters, such as A . In ambiguous and [20]. In [20], authors combine the Tensor-Train cases, we add an arrow over vectors, for example, → −π. decomposition [21] with the spectral algorithm for HMMs [17], as opposed to our work where we combine 2.2 Tensor Decomposition CP-decomposition and SGD. Additionally, the authors do not consider higher-order models, leaving any possible ex- Tensors are multidimensional arrays. They can be con- tension in this direction unmentioned. sidered to be a generalization of vectors and matrices into In [9], the author combine tensor decomposition with higher dimensions, where vector is a 1st-order tensor and Markov chain Monte Carlo methods. However, their matrix is a 2nd-order tensor, respectively. model is considerably more complicated and, again, con- The Kronecker product of A ∈ Rk×l and B ∈ Rm×n is siders only the standard HMM representation. On the denoted as A ⊗ B. It applies that A ⊗ B ∈ Rkm×ln and it is other hand, it can straightforwardly provide confidence in- defined as tervals for estimated parameters instead of just a point es-   a1,1 B a1,2 B · · · a1,l B timate. a2,1 B a2,2 B · · · a2,l B A⊗B =  . ..  .   .. ..  .. . . .  1.2 Our Contribution ak,1 B ak,2 B ··· ak,l B We propose a new method of training hidden Markov mod- The Khatri-Rao product is the column-wise Kronecker k×m and B ∈ Rl×m , their Khatri-Rao product. Given A ∈ RJ els. It is based on recent advancements in the fields of ten- sor decomposition and stochastic gradient methods for op- product is denoted A B, A B ∈ Rkl×m , and defined as timization. The most important advantage of the approach is its applicability to the training of HMMs of higher or- A B = [a1 ⊗ b1 ... am ⊗ bm ] . ders. The applicability has been achieved by introduc- For decomposition algorithms, the operation of tensor ing multiple tricks into the original gradient descent al- matricization is commonly used [26]. Tensor matriciza- gorithm [12] for the training of HMMs. tion, often referred to as reshaping or flattening, is a pro- Firstly, we introduce a permutation decomposition for cess of reordering elements of a higher-order (> 2) tensor transition-probability tensor, which provides possibly even into a matrix. All elements of a tensor must be reused; an exponential improvement in the memory required to hence, the amount of memory required to store a tensor store those probabilities. We achieve this by clever reuse is the same as to store its matricized form. The standard of vectors representing individual states. forms of matricized tensors are mode-n unfoldings. This work focuses solely on the CP-decomposition, al- Let X ∈ RI1 ×···×Ik . If an element in the tensor has though there have been proposed multiple other decompo- indices (i1 , . . . , ik ), then it will have indices (in , j) in its sitions [22], such as Tucker decomposition [23], and ten- mode-n unfolding, where j is defined as sor train (TT) decomposition [21]. Secondly, we modify the operator representation of the k l−1 HMM (similar to [10]) to allow for the optimization of j = 1 + ∑ (il − 1) Jl , where Jl = ∏ Im . l=1 m=1 higher-order models. We use this representation during l6=n m6=n We denote a mode-n unfolding of a tensor X as Xn . Lastly, let π ∈ Rn×···×n be of order 1 less than the tensor The tensor rank of X , denoted rank(X ), is defined as T . It defines an initial hidden state distribution by ( !) r [π]i1 ,...,ir−1 = Pr [sr−1 = ir−1 , . . . , s1 = i1 ] . rank(X ) = min r ∈ N ∃a11 . . . ark X = ∑ ai1 ◦ · · · ◦ aik  , i=1 In the most common case in the literature, the hidden where the symbol ◦ represents the vector outer product. state at the time t is assumed to only depend on its imme- The above definition of the tensor rank gives rise to the diate predecessor. This means that the transition between tensor rank decomposition. Given a tensor X ∈ R I1 ×···×Ik states can be modeled as a Markov chain of order 1. Con- of a rank q, it can be decomposed as sequently, the previously defined parameters are reduced to: T ∈ Rn×n , O ∈ Rn×m and → − π ∈ Rn . q Authors in [28], [29] showed that such HMM of the or- X = ∑ λi ai1 ◦ · · · ◦ aik . der 1 can be represented by the means of observation oper- i=1 ators. Those operators are a sufficient and computationally Here, λi ∈ R are normalizing coefficients, ensuring that efficient way to evaluate observation-sequence probabili- all components are of the same scale while at the same ties. The probability of a particular sequence o1 , . . . , ot is time making the convergence during computation quicker. given by Trivially, each tensor has a finite rank. However, es- → − timating the tensor rank is an NP-hard problem [27]. In Pr [o1 , . . . , ot ] = 1 m Aot . . . Ao1 → − π, applications, we often want to find a tensor of fixed rank that approximates the original one. where the observation operators Ai are defined [17] as The rank-q̂ canonical polyadic (CP) decomposition of a   tensor X of an order k is a problem of finding such vectors Ai = Tdiag [O]1,i , . . . , [O]n,i . aij that q̂ The standard strategy to estimate the HMM parameters Xˆ = ∑ λi ai1 ◦ · · · ◦ aik , is to maximize the provided samples’ likelihood. As the i=1 hidden states are not directly observable, algorithms often and Xˆ minimizes resort to heuristics or other similar methods without any convergence guarantees. ||X − Xˆ ||F . 3 CPD-SGD Algorithm 2.3 Hidden Markov Models The Hidden Markov Model (HMM) defines a probabil- 3.1 Permutation Decomposition ity distribution over sequences from a finite alphabet. We assume that we have a HMM of an order r − 1 de- Additionally, it associates each sequence of observations scribed by parameters T , O, π. The transition-probability o1 . . . ot with a sequence of hidden states of the same length tensor T has a finite tensor rank, let’s denote it as q1 . s1 . . . st . There is a finite amount of different states. We de- Hence, it has a CP-decomposition note the set of all possible hidden states as [n] = {1, . . . , n}, and the set of all possible observations, a.k.a. the finite al- q1 phabet of sequences, as [m] = {1, . . . , m}. T = ∑ λi ti1 ◦ · · · ◦ tir . It is always assumed that the HMM satisfies 2 condi- i=1 tional independence properties. Firstly, one states that hid- Further, as we can incorporate the normalizing coefficients den state st depends only on a finite amount of previous λi into the individual components, we will omit them in the states, typically only 1, and nothing else. Secondly, one following, writing only requires that the observation ot at a time t is only depen- dent on the hidden state st at that given moment t and no q1 other element of the sequence itself or other hidden states. T = ∑ ti1 ◦ · · · ◦ tir . Formally, the HMM of the order r − 1 can be described i=1 by 3 matrices and/or tensors. Let T ∈ Rn×···×n be a tensor This omission is based on two facts. Firstly, all of the tij of state transition probabilities where vectors will be inferred during the training of the model. [T ]i1 ,...,ir = Pr [st = ir |st−1 = ir−1 , . . . , st−r+1 = i1 ] . Secondly, we do not need to constrain decomposition com- ponents into a certain form, as we do not work with Similarly, let O ∈ Rn×m be a matrix of observation proba- them directly and the tensor approximation is invariant to bilities with the way they are incorporated into decomposition compo- nents. Hence, omitting of λi coefficients results in a lower [O]i, j = Pr [ot = j|st = i] . number of parameters to be estimated. Every hidden state is now represented by r vectors of works for HMMs of order 1, whose transition probabili- length q1 . Each one of them is for one position relative to ties can be represented as a matrix. To alleviate this prob- the current state, a.k.a. the first one defines the influence lem, we introduce matricized tensor using the representa- of the given state on the state observed r − 1 steps in the tion from the previous subsection. future from it. The next to last vector defines the influ- Plainly, instead of using transition probability tensor r−1 ence of the given state on the following one, and the last T ∈ Rn×···×n , we use its matricized form T ∈ Rn ×n . vector defines the behavior of the current state based on Using the previous notation, let a matrix H ∈ Rn×q2 be history. For a fixed hidden state i and for j ∈ {1, . . . , r}, defined as H = t1 . . . tq2 ,   those vectors have a form h    h i iT where ti are individual columns of the matrix. t1j i t2j i . . . tqj 1 . This matrix represents behaviour of states at the time of i current observation. To get the matrix for a time step -1, For the purposes of interpretability, we would like to we multiply it by a permutation matrix P ∈ Rq2 ×q2 , represent each state by only one vector. Trivial solution   0 1 0 ··· 0 would be to just concatenate them. Another option is to re- 0 0 1 · · · 0 place all of those r vectors with just one. If we did this, our   P =  ... ... ... . . . ...  .   model would be able to recognize previous states, how-   ever, it would not be able to distinguish their combinations 0 0 0 · · · 1 (order in time). For example, state 1 followed by state 2 1 0 0 ··· 0 would be represented the same way as state 2 followed by r−1 ×q state 1. Furthermore, we define a matrix S ∈ Rn 2 as Based on the thoughts from the previous paragraph, we r−1 2 S = HP ··· HP HP. propose a solution in the following form: we represent It is obvious, that we get the mode-r unfolding of the ten- each hidden states by a vector of a fixed length, and de- sor T by multiplying it with HT , pending on their relative position in time we permute those vectors. More precisely, we shift their elements by one for T = SHT . each time-step into the past, eg. a vector [1 2 3]T turns For the direct replacement in the operator representa- into [2 3 1]T and then into [3 1 2]T . tion, a square matrix is necessary. We do this by imputing Such a replacement vector may need to have higher di- zeros into matrix T, creating a sparse matrix. This is a con- mension, worst-case it needs to be of length rq1 , although sequence of past hidden states being already given, hence, we will be interested only in a low-rank approximation of reducing the degree of freedom of the system. By exploit- the tensor T . Hence, we introduce a new parameter for ing the structure created by Khatri-Rao product, each row our algorithm, q2 , which denotes the length of those vec- in the imputed matrix needs to be padded to the left. The tors (we will refer to them as state vectors). The q2 acts as i-th row is padded by m(i%r) to the left, creating a matrix a hyperparameter that needs to be set at the beginning of T0 . an experiment. Its choice is beyond this publication, how- ever, it should be considerably higher than q1 . With those 3.3 Unconstrained Optimization modifications, we can rewrite the decomposition as To ensure that probabilities will not be negative and that q2 they will add up to 1, we would need to solve a constrained T = ∑ ti+r−1 ◦ · · · ◦ ti+1 ◦ ti , optimization problem. To avoid this complication, we in- i=1 stead apply a row-wise softmax function to T0 . For a gen- where tk = t j if j = k mod q2 . eral vector x, the softmax function is defined as We apply the same trick to the initial-state-probability exi [softmax (x)]i = mr x . tensor π and the observation-probability matrix O, ∑ j=0 e j q2 Thanks to this, we can optimize this problem as an uncon- O = ∑ ti+r ◦ ui , strained one. i=1 The same trick can be applied to the matrix O. The estimation of O, however, requires the application of the where ui represent observations. softmax function to each row, where rows are typically of considerably higher dimension than columns. In those 3.2 Matricized Representation cases where there are tens of thousands or even more different possible observations, this computation would The observation-operator representation of HMM is ad- be prohibitively computationally expensive. Instead, we vantageous during training, as it involves only matrix- would like to estimate only columns. A solution to this vector products. Unfortunately, this representation only problem is sketched in the following subsection. 3.4 Probability Replacement Algorithm 1: The CPD-SGD Algorithm To compute the probabilities [O]i, j = Pr [ot = j|st = i], we Result: H, OT would need to use our decomposed representation and initialization compute a whole row [O]i . However, this row contains ln_prob ← 0 m elements, usually far more than the amount of hidden while not_converged() do states n. Instead, we use the Bayes’ theorem to compute S ← HPr−1 · · · HP2 HP the following formula T ← softmax(SHT ) z ← solve(T) Pr [st = i|ot = j] Pr [ot = j] T ← Tdiag(z) Pr [ot = j|st = i] = . Pr [st = i] T0 ← add_zeros_and_shift(T) OT ← eval_cols(xk . . . x1 ) Here, the Pr [ot = j] is independent of hidden states and forall sequences xt . . . x1 in current batch do cannot be optimized, which means we can remove it from ln_prob ←  the expression during training as it does not provide any → − → − ln_prob − ln 1 T . . . T0 diag([OT ]•,x1 ) 1 valuable gradients. end The estimation of Pr [st = i|ot = j] requires the estima- update_params_based_on(ln_prob) tion of m probabilities, one for each hidden state, which ln_prob ← 0 is computationally more feasible than previous approach. end We will denote this modified matrix as OT . Lastly, the expression Pr [st = i] is from a stationary dis- tribution of the Markov chain represented by the hidden- state transitions T0 . It can be calculated by solving a linear 4 Experiments system of equations which is a completely differentiable process. We denote its solution as z, We experimentally evaluated the proposed CPD-SGD training algorithm that was presented in Section 3. To [z]i = Pr [st = i] . do this, we generated a synthetic dataset. The synthetic dataset was generated to resemble natural language prop- Furthermore, the estimation of z needs to be done only erties and to emphasize the advantages of higher-order once during each batch which further improves perfor- HMMs. To evaluate and compare trained models, we com- mance. pute the likelihood of the test sequences for each model. The focus of this work is to provide a proof of concept of the proposed algorithm, hence, further experiments and 3.5 The CPD-SGD algorithm applications to disjoint datasets are relegated to the future The pseudocode of the complete algorithm is presented in work. Algorithm 1. We formulate the algorithm using the auto- matic differentiation framework. Hence, the exact update 4.1 Synthetic Dataset rules for individual parameters are determined based on the chosen optimizer. We generate the synthetic dataset as an output from an The optimizer can be chosen based on the concrete task HMM model of an order 5 with 20 hidden states. The or other requirements. It can be an SGD with or without amount of possible distinct observations is equal to 10 000. adaptive learning rate. We choose the later option in this This model was created in such a way as to be close to but work. Another option is to utilize one of the momentum- not completely deterministic. The generated sequences are based methods, such as Adam [30]. In any case, the update all of length 50, as this is long enough to alleviate the influ- of the optimizer parameters is done within the body of the ence of the initial-states distribution and short enough not function update_params_based_on() in the Algorithm to cause an arithmetic underflow. The initial state proba- 1. bilities of the model were chosen as a stationary distribu- The computation is organized into epochs that repeat tion of its underlying Markov chain. until the model has converged. We assume that the → − π is a uniform distribution, hence we replaced it in the algorithm 4.2 Convergence with the vector of ones. The algorithm evaluates the log- likelihood of the given sequence or sequences during each To evaluate the convergence of the CPD-SGD algorithm, epoch and computes their respective gradients. Hence, the we divide the dataset into two disjoint sets: training data algorithm can be used in both online and batch settings. and test data. While training data is used to estimate pa- Afterward, the parameters are updated based on the cho- rameters of the HMM, test data is used to estimate the sen optimizer. The use of log-likelihood is necessary as model’s fit. otherwise, tiny probabilities may cause computational er- In the first experiment, we hypothesize an HMM model rors/underflows. of an order 3, represented by state vectors of length 50. from the figure, our algorithm attains results comparable to traditional approaches with low differences in absolute and relative likelihood differences. 4.3 Model Comparisons In the final experiment, we compare the expressive power of the higher-order models. We train models of orders from 1 up to 5, using the same train set. After they con- verge, we compute their fit on the test sequences. All mod- els were trained using the CPD-SGD algorithm, with the Figure 1: A continuous convergence of the CPD-SGD al- number of hidden states equal to 20, and with state vectors gorithm. of length 50. They were randomly initialized. The number of hidden states was chosen to be 20. Figure 1 shows the evolution of the model’s fit after each epoch (an epoch consisted of 100 sequences). It can be seen that the parameters of the model are being optimized. We follow by assessing the convergence of our proposed algorithm in comparison to the Baum-Welch algorithm [8] for models of order 1. We use a set of sequences to esti- mate parameters of HMMs, and afterward, we compute each model’s fit using a different set of test sequences. We repeat this process 500 times, each time with differ- ently reorganized sequences into train and test sets. Fur- Figure 3: Maximal attained log-likelihoods of models of thermore, we always use 10 different initializations for the given order. Baum-Welch algorithm, train all of them, and choose the model with the best fit. Figure 3 shows individual fits of those models. As ex- pected, the higher-order models perform better. 5 Conclusion In this work, we build upon one of the well-known unsu- pervised learning methods - hidden Markov models. We explored the possibilities to improve the training of HMMs of higher orders. We followed up on previous works in the area and proposed a new algorithm for HMM training, de- noted as the CPD-SGD algorithm. Our algorithm is based on multiple ideas that stem from - among others - an emerging area of tensor decomposition methods. We utilize permutation decomposition which as- signs only one vector to each state while it reuses all of its values for each time step. Further, we used matricized rep- resentation of the probability tensor to extend the operator representation of HMMs to higher-order models. By ap- plying the softmax function, we avoided the problematic constrained optimization. Finally, we reformulated com- putation of observation probabilities, making them more efficient to compute in case of models with observation space far larger than state space. Figure 2: A distribution of differences between log- We experimentally showed that higher-order HMMs likelihoods of models trained by Baum-Welch and CPD- could be efficiently represented in memory as decomposed SGD algorithms. tensors while being suitable for accelerated training using stochastic gradient descent optimization. The distribution of differences between fits of both of The focus of this paper has been on constructing an al- those algorithms can be seen in Figure 2. As apparent gorithm for training higher-order HMMs in an efficient way. We have experimentally verified that our method number variations. BMC Bioinformatics, 12:428 – 428, works and is able to train an HMM model on a synthetic 2011. dataset accurately. [17] Daniel J. Hsu, S. Kakade, and Tong Zhang. A spectral algo- In future work, we plan to apply this method to real- rithm for learning hidden markov models. J. Comput. Syst. world problems in natural language processing, which Sci., 78:1460–1480, 2009. requires models with large state spaces and observation [18] R. Harshman. Foundations of the parafac procedure: Mod- spaces. els and conditions for an "explanatory" multi-model factor analysis. 1970. [19] Yian Ma, Y. Ma, N. Foti, and E. Fox. Stochastic gradient References mcmc methods for hidden markov models. In ICML, 2017. [20] Maxim A. Kuznetsov and I. Oseledets. Tensor train spec- tral method for learning of hidden markov models (hmm). [1] Byung-Jun Yoon. Hidden markov models and their appli- Computational Methods in Applied Mathematics, 19:93 – cations in biological sequence analysis. Current Genomics, 99, 2019. 10:402 – 415, 2009. [21] I. V. Oseledets. Tensor-train decomposition. SIAM Journal [2] M. Gales and S. Young. The application of hidden markov on Scientific Computing, 33(5):2295–2317, 2011. models in speech recognition. Found. Trends Signal Pro- [22] Majid Janzamin, Rong Ge, Jean Kossaifi, and Anima cess., 1:195–304, 2007. Anandkumar. Spectral learning on matrices and tensors. [3] J. Bobulski and Lukasz Adrjanowicz. Two-dimensional Found. Trends Mach. Learn., 12:393–536, 2019. hidden markov models for pattern recognition. In ICAISC, [23] LR Tucker. Some mathematical notes on three-mode fac- 2013. tor analysis. Psychometrika, 31(3):279—311, September [4] M. Seifert, K. Abou-El-Ardat, Betty Friedrich, B. Klink, 1966. and A. Deutsch. Autoregressive higher-order hidden [24] P. Brémaud. Non-homogeneous markov chains. 2020. markov models: Exploiting local chromosomal dependen- cies in the analysis of tumor expression profiles. PLoS [25] Tamara G. Kolda and Brett W. Bader. Tensor decompo- ONE, 9, 2014. sitions and applications. SIAM Review, 51(3):455–500, September 2009. [5] Lee-Min Lee and Jia-Chien Lee. A study on high-order hid- den markov models and applications to speech recognition. [26] Stephan Rabanser, Oleksandr Shchur, and Stephan Gün- In IEA/AIE, 2006. nemann. Introduction to tensor decompositions and their applications in machine learning. ArXiv, abs/1711.10781, [6] L. Benyoussef, C. Carincotte, and S. Derrode. Extension 2017. of higher-order hmc modeling with application to image segmentation. Digit. Signal Process., 18:849–860, 2008. [27] Christopher J. Hillar and Lek-Heng Lim. Most tensor prob- lems are np-hard. J. ACM, 60(6), November 2013. [7] W. Zucchini and I. Macdonald. Hidden markov models for time series: An introduction using r. 2009. [28] Eyal Even-Dar, S. Kakade, and Y. Mansour. The value of observation for monitoring dynamic systems. In IJCAI, [8] L. Rabiner. A tutorial on hidden markov models and se- 2007. lected applications. Proceedings of the IEEE, 1989. [29] H. Jaeger. Observable operator models for discrete stochas- [9] Abhra Sarkar and D. Dunson. Bayesian nonparametric tic time series. Neural Computation, 12:1371–1398, 2000. higher order hidden markov models. arXiv: Methodology, 2018. [30] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015. [10] C. Xiong, Di Yang, and L. Zhang. A high-order hidden markov model and its applications for dynamic car owner- ship analysis. Transp. Sci., 52:1365–1375, 2018. [11] U. Hadar and H. Messer. High-order hidden markov mod- els - estimation and implementation. 2009 IEEE/SP 15th Workshop on Statistical Signal Processing, pages 249–252, 2009. [12] P. Bagos, T. Liakopoulos, and S. Hamodrakas. Faster gra- dient descent training of hidden markov models, using in- dividual learning rate adaptation. In ICGI, 2004. [13] P. Baldi and Y. Chauvin. Smooth on-line learning algo- rithms for hidden markov models. Neural Computation, 6:307–318, 1994. [14] Y. LeCun, L. Bottou, G. Orr, and K. Müller. Efficient back- prop. In Neural Networks: Tricks of the Trade, 2012. [15] Ahmed Saaudi, Yan Tong, and Csilla Farkas. Probabilistic graphical model on detecting insiders: Modeling with sgd- hmm. In ICISSP, 2019. [16] Md Pavel Mahmud and Alexander Schliep. Fast mcmc sampling for hidden markov models to determine copy