=Paper=
{{Paper
|id=Vol-1649/172
|storemode=property
|title=Modeling and Clustering the Behavior of Animals Using Hidden Markov Models
|pdfUrl=https://ceur-ws.org/Vol-1649/172.pdf
|volume=Vol-1649
|authors=Tomáš Šabata, Tomáš Borovička, Martin Holeňa
|dblpUrl=https://dblp.org/rec/conf/itat/SabataBH16
}}
==Modeling and Clustering the Behavior of Animals Using Hidden Markov Models==
ITAT 2016 Proceedings, CEUR Workshop Proceedings Vol. 1649, pp. 172–178 http://ceur-ws.org/Vol-1649, Series ISSN 1613-0073, c 2016 T. Šabata, T. Borovička, M. Holeňa Modeling and Clustering the Behavior of Animals Using Hidden Markov Models Tomáš Šabata1 , Tomáš Borovička1 , and Martin Holeňa2 1 Faculty of Information Technology, Czech Technical University in Prague, Prague, The Czech Republic 2 Institute of Computer Science, Czech Academy of Sciences, Prague, The Czech Republic Abstract: The objectives of this article are to model be- To create a sufficiently accurate behavioral model, an havior of individual animals and to cluster the resulting abstraction and simplification of the behavior is used. The models in order to group animals with similar behavior most common abstraction of an animals’ behavior is the patterns. Hidden Markov models are considered suitable abstraction as a sequence of states for clustering purposes. Their clustering is well studied, The description of behavior as a sequence of states is a however, only if the observable variables can be assumed commonly used abstraction and simplification in model- to be Gaussian mixtures, which is not valid in our case. ing behavior [6, 13, 14]. This allows to create generalised Therefore, we use the Kullback-Leibler divergence to clus- and sufficiently accurate behavioral model. Each state of ter hidden Markov models with observable variables that the sequence corresponds to an action. Actions are orga- have an arbitrary distribution. Hierarchical and spectral nized in a finite sequence [6, 13, 14, 17, 20]. The model is clustering is applied. To evaluate the modeling approach, expected to be more accurate if actions are easily separa- an experiment was performed and an accuracy of 83.86% ble. Thus, these actions should be mutually disjoint and was reached in predicting behavioral sequences of indi- cover all activities that an animal can do. It means that the vidual animals. Results of clustering were evaluated by animal is doing exactly one action at a time. means of statistical descriptors of the animals and by a An abstraction of an animal living in a closed environ- domain expert, both methods confirm that the results of ment is much simpler than for example an abstraction of clustering are meaningful. the behavior of a human. Such animals can do only a lim- ited number of actions. These actions are reactions to the internal state of the animal (hunger, thirst, ...), reactions to 1 Introduction other animals’ behavior or to the environment. A mathematical model that describes behavior is called be- The most commonly used methods to model sequences havioral model. In particular, we assume that patterns of are Hidden Markov Models [14], Dynamic Bayesian net- animals’ behavior are reflected in their behavioral models works [16], Conditional Random Fields [12], Neural Net- and the differences can be identified and analyzed on the works [15], Linear regression model [4] or Structured Sup- individual level. port Vector machines [22]. All these methods belong to For each animal, a model that represents its behavior in methods of supervised learning. Even though, Hidden a certain time period is created. A comparison of mod- markov models can be also estimated in a semisupervised els of one animal from different periods of time can show way [23]. There are also approaches that utilize unsuper- changes in its behavior over time. Rapid changes in the vised learning methods to deal with modeling of behav- behavior can indicate an important event or disorder. ior. The Fuzzy Q-state Learning is an example of unsuper- Moreover, different animals can be compared based on vised method that was used to model humans’ behavior. In their behavioral models. The behavioral model as a de- this approach labels are created by the Iterative Bayesian scriptor of the behavioral patterns can be used to group and Fuzzy Clustering algorithm [13]. classify the animals. Furthermore, characteristics of each group can be extracted and changes of behavioral patterns 3 Preliminaries can be tracked. 3.1 Hidden Markov Models 2 Related work Influenced by [6, 14, 20], we have decided to use hidden Markov models for behavioral modeling. In this subsec- The field of the behavioral analysis is very wide [2]. How- tion, the principles of a hidden Markov model (HMM) will ever, the paper focuses on a specific kind of behavioral be recalled. analysis, behavioral analysis of animals as a sequence of With each HMM, a random process indexed by time is states. connected, which is assumed to be in exactly one of a set of Modeling and Clustering the Behavior of Animals Using Hidden Markov Models 173 N distinct states at any time. At regularly spaced discrete q1 q2 q3 qT times, the system changes its state according to probabil- ities of transitions between states. Time steps associated with time changes are denoted t = 1, 2, 3, ... . The actual o1 o2 o3 oT state at a time step t is denoted qt . The process itself is assumed to be a Markov chain, usu- ally a first-order Markov chain, although a Markov chain Figure 1: Trellis diagram of a HMM. of any order can be used. It is usually assumed that the Markov chain is homogeneous. This assumption allows the Markov chain to be described as a matrix of transition probabilities A = {ai j }, which is formally defined in the (5). By means of the above notation, it can be formally Equation (1). described as in the Equation (6). ai j = P(qt = y j |qt−1 = yi ), 1 ≤ i, j ≤ N (1) T P(o1 , ..., oT , q1 , ..., qT ) = P(q1 )P(o1 |q1 ) ∏ P(qk |qk−1 )P(ok |qk ) The simple observable Markov chain is too restrictive to k=2 describe the reality, however it can be extended. Denoting (5) Y the variable recording the states of the Markov chain, a HMM is obtained through completing Y with a multivari- T ate random variable X. In the context of that HMM, X is P(o1 , ..., oT , q1 , ..., qT ) = πq1 bq1 ,o1 ∏ aqk−1 ,qk bqk ,ok (6) called ’observation variable’ or ’output variable’, whereas k=2 Y is called ’hidden variable’. The the hidden variable takes values in the set {y1 , y2 , ..., yN } and observable variable X 3.2 Clustering approaches takes values in the set {x1 , x2 , ..., xM }. We assume to have an observation sequence O = Hidden Markov models are often used in cluster analy- o1 o2 ...oT and a state sequence Q = q1 q2 ...qT which cor- sis of sequences [1, 9, 18]. Clustering sequences is more responds to the observation sequence. HMM can be char- complex task than clustering feature vectors since infinite acterized using three probability distributions: sequences are not in a Euclidean space and sufficiently long finite sequences are in a high-dimensional Euclidean 1. A state transition probability distribution A = {ai j } spaces. Although there exist approaches how to measure which is formally defined by the Equation (1). distance between two sequences (Levenshtein distance, Hamming distance, longest common subsequence, etc.), 2. An probability distribution of observation variables, however, they may not be always meaningful. It may be B = {bi, j }, where bi, j is a probability of observation more reasonable to learn a HMM for each sequence and variables in state yi and it is formally defined as (2). cluster the HMMs. A HMM is the random process that bi, j = P(ot = x j |qt = yi ) (2) produces the sequence. Due to the fact that HMMs’ parameters lie on a non- The matrix B of the discrete variable can be de- linear manifold, application of the k-means algorithm scribed using N × M stochastic matrix, where M de- would not succeed [3]. In addition, parameters of a par- notes number of possible values of X. ticular generative model are not unique, therefore, a per- mutation of states may correspond to the same model [3]. 3. An initial state distribution π = {πi } is defined by It means that different sequences may be generated by the same HMM. πi = P(q1 = yi ) (3) There are already a few approaches to the cluster anal- When the initial state distribution is assumed to be ysis by means of HMMs. One of them uses the spec- stationary, then a initial state distribution is usually tral clustering with Bhattacharyya divergence [9]. Another computed as the stationary distribution of the Markov commonly used dissimilarity measure in spectral cluster- chain described with matrix A by solving the Equa- ing of HMMs is the Kullback-Leibler divergence [10, 21, tion (4). 24]. πA = π The approach called variational hierarchical expecta- (4) tion maximization (VHEM) [3] is also based on spectral πA = π ∑ i ij j clustering. It directly uses the probability distributions of i HMMs instead of constructing an initial embedding and With these three elements, the HMM is fully defined. besides clustering, it generates new HMMs. The newly The model is denoted λ = (A, B, π). A HMM can be generated HMMs are representatives of each cluster. graphically depicted by Trellis diagram which is shown in A disadvantage of all the mentioned methods is the as- Figure 1. The joint probability of O and Q, which corre- sumption that observations are normally distributed and sponds to the trellis diagram, is described by the Equation can be described using the Gaussian mixture model. A 174 T. Šabata, T. Borovička, M. Holeňa distance measure between two HMMs with normally dis- of set with a probability proportional to P(O|λ ). Using tributed observations can be computed in polynomial time. Viterbi algorithm, the most likely sequence Qopt of states However, that assumption can be too restrictive for many can be computed by P(Qopt , O|λ ) = maxQ P(Q, O|λ ). This real-life applications. leads to the KL divergence. Z 1 P(Qopt , O|λ ) Dissimilarity measures of HMMs The definition of a dVit (λ , λ ′ ) = log P(O|λ )dO (10) O G(O) P(Q′opt , O|λ ′ ) distance between two HMMs is challenging since HMM consists of a Markov chain and a joint distribution of ob- We assume that: servations. In [5], the authors considered two distance • The most probable sequences of both hidden Markov measures applicable to HMMs λ and λ ′ . The first of them models are equal for both HMMs. The assumption is is a normalized Euclidean distance of the corresponding reasonable if two HMMs are not too dissimiliar. matrices B, B’, s • The Markov chain is ergodic. A Markov chain is 1 N M called ergodic if there is a nonzero probability to get dec (λ , λ ′ ) = ∑ ∑ ||bik − b′ik ||2 , (7) N i=1 k=1 from any state q to any other state q’ (not necessarily in one move). A Markov chain with an ergodic sub- where N denotes a number of states and M denotes a num- set (in particular, an ergodic Markov chain) generates ber of observations. The number of states of both models sufficient long sequence. has to be identical N = N ′ . The second distance proposed in [5] is defined by With these assumptions, the KL divergence can be com- puted using following equation s ′ 1 N M 1 P(Ql , O|λ ) dmec (λ , λ ) = ∑ min j ∑ ||bik − b′jk ||2 . (8) dVit (λ , λ ′ ) = log P(O|λ ) + ε, (11) N i=1 k=1 G(O) P(Ql , O|λ ′ ) A disadvantage of the considered distances is that they where ε is an approximation error [5]. According to Sub- ignore the Markov chain. Therefore, there exist HMMs section 3.1, the equation can be rewritten using the nota- λ and λ ′ the distance between which is zero although the tion of HMMs as it is shown in the Equation (12). generated probability distributions Pλ and Pλ′ are different. Consequently, both distance measures are not metrics, but 1 T −1 dVit (λ , λ ′ ) − ε = ∑ log ayt ,yt+1 − log a′yt ,yt+1 + only pseudometrics [11]. Although the distances ignore G(O) t=1 the Markov chain, it is agreed that the observation proba- 1 T bility distribution matrix B is, in most cases, a more sen- ∑ log byt ,xt − log b′yt ,xt G(O) t=1 sitive set of parameters related to the closeness of HMMs then the initial distribution vector π or the state transitions (12) matrix A [10]. If a sequence O is long enough, the law of large number Another distance between HMMs λ , λ ′ is the Kullback- allows us to approximate Equation (12) as (13) [5]. Leibler divergence (KL divergence) Z 1 P(O|λ ) dVit (λ , λ ′ ) ≈ δeVit = ∑ ai j πi log ai j − log a′i j + dKL (λ , λ ′ ) = log P(O|λ )dO, (9) O G(O) P(O|λ ′) i, j (13) where G(O) is a function weighting the length of the se- ∑ bik πi log bi j − log b′i j i,k quence O. Two kinds of such weighting functions are com- monly used. The function G(O) is equal to the length A disadvantage of the KL divergence is that it is not of sequence if we measure the divergence between two symmetric. HMMs that generate equally long sequences. It is equal The Bhattacharyya divergence is also a commonly used to expected value of lenghts of sequences if we measure metric for HMMs. For two probability distributions f and the divergence between two HMMs which generate vari- g, it is defined by the Equation (14). Similarly to the KL ous long sequences. divergence, it can be easily computed if HMM has nor- An analytic solution for integral in the Equation (9) ex- mally distributed observations [7]. Differently to the KL ists for HMMs with Gaussian distributed observations [8], divergence, the Bhattacharyya divergence is symmetric. otherwise it can be calculated numerically. The time com- Z p plexity of the numerical computation grows exponentially dB ( f , g) = − ln( f (x)g(x)dx) (14) with the length of the sequence. Since we want to use it for long sequences we use an aproximation described in [5]. Using a chosen dissimilarity measure, a dissimilarity We assume to have an ordered set of output sequences F = matrix can be created and used for clustering, in partic- {O1 , ..., OL }. Any sequence O could became l-th member ular, in hierarchical and spectral clustering algorithms. Modeling and Clustering the Behavior of Animals Using Hidden Markov Models 175 Spectral clustering uses similarities between objects 1. the duration of the last occurrence of the state, represented as a weighted graph. There are three ways how to create such a graph: 2. the time elapsed since the last occurrence of the state. 1. The ε-neighborhood graph. In this way, all pairs of These eight variables are discretized using binning by points with distances smaller than ε are connected. equal frequency into four or five bins. The ninth observ- The graph is often considered as unweighted because able variable represents a daytime which is discretized its edges are based on the dichotomous property of into six blocks of four hours. The Cartesian product of distances at most ε. the value sets of observable variables contains, after dis- cretization, 1,500,000 combinations of values. 2. k-nearest neighbor graphs. In this way, the goal To avoid zero probabilities in the emission and transi- is to connect vertex vi with vertex v j if v j is among tion matrices, empirical transition and emission matrices the k nearest neighbors of vi . This way is usually averaged over all models of respective animal are used as used in image segmentation where each pixel has its initial estimates. Zero probabilities in those initial esti- neighbourhood defined by 4 or 8 pixels. mates are replaced with small probabilities estimated as one over the number of elements in the matrix. 3. The fully connected graph. This way means con- Since the assumption of normally distributed observa- necting all points the pairwise similarity of which tions is in our case not valid, we use an approximation is positive. This construction is usually chosen if of the Kullback-Leibler divergence, recalled in Subsec- the similarity function itself already encodes mainly tion 3.2, as the similarity measure for HMM clustering. local neighborhoods. An example of such a sim- Using this approximation, a distance matrix D for the set ilarity function is the Gaussian similarity function ||x −x || 2 of HMMs constructed for the individual anamals is com- − i j (s(xi , x j ) = e 2σ 2 ). The parameter σ controls the puted, i.e., Di, j represents the KL divergence of the j-th width of the neighbourhood. from the i-th most likely sequence (Di, j = dVit (λi , λ j )). The properties of the KL divergence imply that the matrix Spectral clustering can be performed by three different is real valued, non-symmetric and its diagonal has zero algorithms [19]: values. Hierarchical and spectral clustering were applied to 1. Unnormalized spectral clustering cluster the models. A parameter of hierarchical cluster- 2. Normalized spectral clustering according to Shi and ing is the kind of linkage. Since HMMs are not points in Malik an Euclidean space, the single, average or complete link- age can be used. An optimal number of clusters can be 3. Normalized spectral clustering according to Ng, Jor- determined by the domain expert from the dendogram of dan, and Weiss the hierarchical clustering. For the spectral clustering, a fully connected graph based on the Gaussian similarity function is used as the 4 Proposed approach similarity graph. Estimation of the parameter sigma of the similarity function is based on the optimization of balance This article focuses on behavioral modeling and analysis of clusters. of animals that live in a closed environment. This sim- plifies the abstraction of the behavior since the number of The results of clustering were subsequently analysed us- possible actions of animals in a closed environment is lim- ing descriptive characteristics (e.g., age and weight of the ited. The proposed approach has been illustrated on a herd animal) that were not among the observable variables. of cows containing one hundred of individuals. The transition and emission matrices of the hidden 5 Experimental results Markov model are estimated using a sequence of states and sequence of observable values, therefore a set of pos- The experiment is devided into two parts, behavioral mod- sible states and a set of possible observable variable values eling and clustering of the models. Section 5.1 describes have to be predefined. Each element of the sequence de- results of the modeling using HMMs and Section 5.2 scribes one minute of animal’s behavior. presents the results of clustering. Five possible states that represent actions which an animal can do are defined. These states are denoted S1, S2, S3, S4, S5. An animal is considered to be in the state 5.1 Hidden Markov modeling S5 if it is not in any of the states S1-S4. States S1, S2, S3 and S4 correspond to eating, drinking, resting and being The dataset that consists of ten consecutive days was split milked. into train and test sets in a ratio 9:1. Parameters of a model Two observable variables are calculated for each state were estimated with data of nine consecutive days and the except the state S5: model was evaluated with data of the tenth day. For each 176 T. Šabata, T. Borovička, M. Holeňa Figure 2: For each animal a stationary distribution of its S1 S2 S3 S4 S5 Markov chain is calculated from a sequence of 14400 mea- surements. Values of these distributions are visualized for Figure 3: Accuracy of Viterbi’s sequence. each state separately in a distribution. The total accuracy was calculated as animal, transitions and emissions matrices were estimated |{i : vi = qi }| using the Baum–Welch algorithm. Acc = (15) T Hidden Markov models were evaluated in two different ways, visually by a domain expert and using the Viterbi’s and an accuracy of a state s was calculated as sequence. The domain expert checks if the real animals’ |{i : vi = qi , qi = s}| behavior is consistent with the stationary distribution of Accs = . (16) |{ j : q j = s}| the resulting Markov chain, which describes the behavior to which the Markov chain converges. Consequently, the Resulting models are able to predict animals’ actions with stationary distribution describes how much time an animal an average accuracy of 83.86%. It can be seen that the spends doing some activity in comparison with all activi- state S5 has the worst accuracy from all states. It is caused ties. by a delay. A model can relatively precisely detect that Stationary distributions were computed for each animal an action has started but the detection of a termination of separately. A histogram of the probabilities of individual the action is delayed. It causes that the state S5 has worse states in the stationary distributions is shown in Figure 2. accuracy than other states. Figure 3 visualizes the results Probabilities of the stationary distributions averaged of evaluation of the Viterbi’s sequence. over all animals are in Table 1. 5.2 Clustering State Average probability S1 8.5% In this subsection, we discuss results and evaluation of S2 0.4% the cluster analysis. The visualisation of a distance ma- S3 44.8% trix is shown in Figure 4. A color of the point i,j repre- S4 1.8% sents distance between i-th animal’s model and j-th ani- S5 44.5% mal’s model. The lighter the color is the less similar the models are. The distance matrix is used by both clustering Table 1: Average stationary distribution. methods. Linkages of hierarchical clustering can be visualized by dendrograms. Dendrogram is used to visually estimate an The second way of evaluation is using the Viterbi’s se- optimal number of clusters. The dendrogram of a com- quence. The Viterbi’s sequence determines the most likely plete linkage is shown in Figure 5. According to it, an- sequence of states given a sequence of observations and imals can be assigned into several approximately equally is denoted v1 v2 ....vT . To get an accuracy of the predic- sized clusters. tion of a sequence of states, the real sequences of states The evaluation of clustering results is a difficult task are element-wise compared with the Viterbi’s sequences. since there are no references for validation. However, clus- The accuracy is calculated as number of elements that do ters can be validated by a domain expert based on descrip- not differ divided by length of sequence. It was calculated tive characteristics related to animals (e.g., age, weight). separately for each state as well as for whole sequence. These characteristics were not used for modeling, thus, Modeling and Clustering the Behavior of Animals Using Hidden Markov Models 177 Figure 6: Proportions of four clusters, which were created using spectral clustering according to Shi and Malik. Figure 4: Kullback-Leibler divergence based distance ma- trix between estimated models. A color of the point i,j represents distance between i-th animal’s model and j-th animal’s model. complete linkage 2.1 2 Figure 7: Boxplots of unobserved descriptive characteris- tic showing differences between four clusters. distance 1.9 1.8 1.7 cluster 1.6 1 4 30 3 14 25 8 12 22 26 2 9 13 27 10 18 1 23 15 21 29 11 16 20 5 28 6 19 24 7 17 cluster id 2 0.8 3 Relative frequency 4 Figure 5: The dendrogram for a complete linkage. 0.6 0.4 they can not influence the clustering. Figures 6, 7 and 8 0.2 show results of the spectral clustering according to Shi and Malik with a fully connected similarity graph with sigma 0.0 equal to 0.65. In the figures can be seen that there are dif- 100 200 300 400 500 Characteristic 1 ferences in descriptive statistics of such characteristic be- tween individual clusters. For example, animals assigned to cluster 1 have significantly higher values of the charac- Figure 8: Distribution of unobserved descriptive charac- teristic than animals of other clusters. It indicates that the teristic showing differences between four clusters. clustering is reasonable and meaningful. 6 Conclusion 83.86%. Furthermore, the stationary distributions of the resulting models were validated by a domain expert. Clus- Hidden Markov models are proved to be applicable to ter analysis was performed by classical clustering algo- modeling of animal behavior represented as a sequence of rithms using the approximation of KL divergence. The states. According to their evaluation, the model is able clustering produces meaningful results and clusters ani- to predict animals’ actions with an average accuracy of mals into interpretable groups 178 T. Šabata, T. Borovička, M. Holeňa Prediction can be further improved with better defini- terns based on sequential actions. IEEE Transactions on tion of states. In particular, the state S5 can be divided Knowledge and Data Engineering, vol. 22(issue 4):479– into more disjoint states. This may positively influence re- 492, 2010. sults of the clustering. Moreover, different modeling ap- [14] Maja Matetić, Slobodan Ribarić, and Ivo Ipšić. Qualita- proaches commonly used for modeling sequences, such tive modelling and analysis of animal behaviour. Applied as dynamic Bayesian networks, conditional random fields, Intelligence, vol. 21(issue 1):25–44, 2004. are intended to be researched, as well as their clustering [15] Bill O’Brien, John Dooley, and Thomas J Brazil. Rf power possibilities. amplifier behavioral modeling using a globally recurrent neural network. In Microwave Symposium Digest, 2006. IEEE MTT-S International, pages 1089–1092. IEEE, 2006. References [16] Vladimir Pavlović, James M Rehg, Tat-Jen Cham, and Kevin P Murphy. A dynamic Bayesian network approach to [1] Jonathan Alon, Stan Sclaroff, George Kollios, and Vladimir figure tracking using learned dynamic models. In Computer Pavlovic. Discovering clusters in motion time-series data. Vision, 1999. The Proceedings of the Seventh IEEE Inter- In Computer Vision and Pattern Recognition, 2003. Pro- national Conference on, volume 1, pages 94–101. IEEE, ceedings. 2003 IEEE Computer Society Conference on, 1999. volume 1, pages I–375. IEEE, 2003. [17] Amy L. Sliva. Scalable techniques for behavioral analysis [2] Longbing Cao and Philip S Yu. Behavior computing. and forecasting. 2011. Springer, London, 2012. [18] Padhraic Smyth et al. Clustering sequences with hidden [3] Emanuele Coviello, Gert R Lanckriet, and Antoni B Chan. Markov models. Advances in neural information process- The variational hierarchical EM algorithm for clustering ing systems, pages 648–654, 1997. hidden Markov models. In Advances in neural information processing systems, pages 404–412, 2012. [19] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007. [4] Qi Dai, Xiao-Qing Liu, Tian-Ming Wang, and Damir Vukicevic. Linear regression model of dna sequences [20] Nelleke d. Weerd, Frank v. Langevelde, Herman v. Oev- and its application. Journal of Computational Chemistry, eren, Bart A. Nolet, Andrea Kölzsch, Herbert H. T. Prins, 28(8):1434–1445, 2007. and W. F. Boer. Deriving animal behaviour from high- frequency GPS: Tracking cows in open and forested habi- [5] Markus Falkhausen, Herbert Reininger, and Dietrich Wolf. tat. PLoS One, 10(6), 06 2015. Calculation of distance measures between hidden Markov models. In In Proc. Eurospeech, pages 1487–1490, 1995. [21] Jie Yin and Qiang Yang. Integrating hidden Markov mod- els and spectral analysis for sensory time series clustering. [6] Y. Guo, G. Poulton, P. Corke, G. J. Bishop-Hurley, In Data Mining, Fifth IEEE International Conference on, T. Wark, and D. L. Swain. Using accelerometer, high sam- pages 8–pp. IEEE, 2005. ple rate GPS and magnetometer data to develop a cattle movement and behaviour model. Ecological Modelling, [22] S.-X. Zhang. Structured Support Vector Machines for 220(17):2068–2075, 2009. Speech Recogntion. PhD thesis, Cambridge University, March 2014. [7] John R Hershey and Peder A Olsen. Variational Bhat- tacharyya divergence for hidden Markov models. In [23] Shi Zhong. Semi-supervised sequence classification with Acoustics, Speech and Signal Processing, 2008. ICASSP HMMs. In Valerie Barr and Zdravko Markov, editors, 2008. IEEE International Conference on, pages 4557– FLAIRS Conference, pages 568–574. AAAI Press, 2004. 4560. IEEE, 2008. [24] Shi Zhong and Joydeep Ghosh. A unified framework for [8] John R Hershey, Peder A Olsen, and Steven J Rennie. Vari- model-based clustering. The Journal of Machine Learning ational Kullback-Leibler divergence for hidden Markov Research, 4:1001–1037, 2003. models. In Automatic Speech Recognition & Understand- ing, 2007. ASRU. IEEE Workshop on, pages 323–328. IEEE, 2007. [9] Tony Jebara, Yingbo Song, and Kapil Thadani. Spectral clustering and embedding with hidden Markov models. In Machine Learning: ECML 2007, pages 164–175. Springer, 2007. [10] Biing-Hwang Fred Juang and Lawrence R Rabiner. A probabilistic distance measure for hidden Markov models. AT&T technical journal, 64(2):391–408, 1985. [11] John L. Kelley. General Topology (Graduate Texts in Math- ematics). Springer, 1975. [12] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. pages 282–289, 2001. [13] Sang Wan Lee, Yong Soo Kim, and Zeungnam Bien. A nonsupervised learning framework of human behavior pat-