=Paper= {{Paper |id=Vol-1649/172 |storemode=property |title=Modeling and Clustering the Behavior of Animals Using Hidden Markov Models |pdfUrl=https://ceur-ws.org/Vol-1649/172.pdf |volume=Vol-1649 |authors=Tomáš Šabata, Tomáš Borovička, Martin Holeňa |dblpUrl=https://dblp.org/rec/conf/itat/SabataBH16 }} ==Modeling and Clustering the Behavior of Animals Using Hidden Markov Models== https://ceur-ws.org/Vol-1649/172.pdf
ITAT 2016 Proceedings, CEUR Workshop Proceedings Vol. 1649, pp. 172–178
http://ceur-ws.org/Vol-1649, Series ISSN 1613-0073, c 2016 T. Šabata, T. Borovička, M. Holeňa



                               Modeling and Clustering the Behavior of Animals
                                       Using Hidden Markov Models

                                           Tomáš Šabata1 , Tomáš Borovička1 , and Martin Holeňa2
                                                      1 Faculty of Information Technology,

                                                      Czech Technical University in Prague,
                                                          Prague, The Czech Republic
                                                        2 Institute of Computer Science,

                                                          Czech Academy of Sciences,
                                                          Prague, The Czech Republic

       Abstract: The objectives of this article are to model be-               To create a sufficiently accurate behavioral model, an
       havior of individual animals and to cluster the resulting            abstraction and simplification of the behavior is used. The
       models in order to group animals with similar behavior               most common abstraction of an animals’ behavior is the
       patterns. Hidden Markov models are considered suitable               abstraction as a sequence of states
       for clustering purposes. Their clustering is well studied,              The description of behavior as a sequence of states is a
       however, only if the observable variables can be assumed             commonly used abstraction and simplification in model-
       to be Gaussian mixtures, which is not valid in our case.             ing behavior [6, 13, 14]. This allows to create generalised
       Therefore, we use the Kullback-Leibler divergence to clus-           and sufficiently accurate behavioral model. Each state of
       ter hidden Markov models with observable variables that              the sequence corresponds to an action. Actions are orga-
       have an arbitrary distribution. Hierarchical and spectral            nized in a finite sequence [6, 13, 14, 17, 20]. The model is
       clustering is applied. To evaluate the modeling approach,            expected to be more accurate if actions are easily separa-
       an experiment was performed and an accuracy of 83.86%                ble. Thus, these actions should be mutually disjoint and
       was reached in predicting behavioral sequences of indi-              cover all activities that an animal can do. It means that the
       vidual animals. Results of clustering were evaluated by              animal is doing exactly one action at a time.
       means of statistical descriptors of the animals and by a                An abstraction of an animal living in a closed environ-
       domain expert, both methods confirm that the results of              ment is much simpler than for example an abstraction of
       clustering are meaningful.                                           the behavior of a human. Such animals can do only a lim-
                                                                            ited number of actions. These actions are reactions to the
                                                                            internal state of the animal (hunger, thirst, ...), reactions to
       1    Introduction                                                    other animals’ behavior or to the environment.
       A mathematical model that describes behavior is called be-              The most commonly used methods to model sequences
       havioral model. In particular, we assume that patterns of            are Hidden Markov Models [14], Dynamic Bayesian net-
       animals’ behavior are reflected in their behavioral models           works [16], Conditional Random Fields [12], Neural Net-
       and the differences can be identified and analyzed on the            works [15], Linear regression model [4] or Structured Sup-
       individual level.                                                    port Vector machines [22]. All these methods belong to
          For each animal, a model that represents its behavior in          methods of supervised learning. Even though, Hidden
       a certain time period is created. A comparison of mod-               markov models can be also estimated in a semisupervised
       els of one animal from different periods of time can show            way [23]. There are also approaches that utilize unsuper-
       changes in its behavior over time. Rapid changes in the              vised learning methods to deal with modeling of behav-
       behavior can indicate an important event or disorder.                ior. The Fuzzy Q-state Learning is an example of unsuper-
          Moreover, different animals can be compared based on              vised method that was used to model humans’ behavior. In
       their behavioral models. The behavioral model as a de-               this approach labels are created by the Iterative Bayesian
       scriptor of the behavioral patterns can be used to group and         Fuzzy Clustering algorithm [13].
       classify the animals. Furthermore, characteristics of each
       group can be extracted and changes of behavioral patterns            3     Preliminaries
       can be tracked.
                                                                            3.1   Hidden Markov Models
       2    Related work                                                    Influenced by [6, 14, 20], we have decided to use hidden
                                                                            Markov models for behavioral modeling. In this subsec-
       The field of the behavioral analysis is very wide [2]. How-          tion, the principles of a hidden Markov model (HMM) will
       ever, the paper focuses on a specific kind of behavioral             be recalled.
       analysis, behavioral analysis of animals as a sequence of               With each HMM, a random process indexed by time is
       states.                                                              connected, which is assumed to be in exactly one of a set of
Modeling and Clustering the Behavior of Animals Using Hidden Markov Models                                                                            173

     N distinct states at any time. At regularly spaced discrete                  q1          q2           q3                          qT
     times, the system changes its state according to probabil-
     ities of transitions between states. Time steps associated
     with time changes are denoted t = 1, 2, 3, ... . The actual
                                                                                  o1          o2           o3                          oT
     state at a time step t is denoted qt .
        The process itself is assumed to be a Markov chain, usu-
     ally a first-order Markov chain, although a Markov chain                          Figure 1: Trellis diagram of a HMM.
     of any order can be used. It is usually assumed that the
     Markov chain is homogeneous. This assumption allows
     the Markov chain to be described as a matrix of transition
     probabilities A = {ai j }, which is formally defined in the        (5). By means of the above notation, it can be formally
     Equation (1).                                                      described as in the Equation (6).

              ai j = P(qt = y j |qt−1 = yi ),   1 ≤ i, j ≤ N      (1)                                                   T
                                                                        P(o1 , ..., oT , q1 , ..., qT ) = P(q1 )P(o1 |q1 ) ∏ P(qk |qk−1 )P(ok |qk )
        The simple observable Markov chain is too restrictive to                                                       k=2
     describe the reality, however it can be extended. Denoting                                                                                (5)
     Y the variable recording the states of the Markov chain, a
     HMM is obtained through completing Y with a multivari-                                                             T
     ate random variable X. In the context of that HMM, X is                 P(o1 , ..., oT , q1 , ..., qT ) = πq1 bq1 ,o1 ∏ aqk−1 ,qk bqk ,ok (6)
     called ’observation variable’ or ’output variable’, whereas                                                       k=2

     Y is called ’hidden variable’. The the hidden variable takes
     values in the set {y1 , y2 , ..., yN } and observable variable X   3.2      Clustering approaches
     takes values in the set {x1 , x2 , ..., xM }.
        We assume to have an observation sequence O =                   Hidden Markov models are often used in cluster analy-
     o1 o2 ...oT and a state sequence Q = q1 q2 ...qT which cor-        sis of sequences [1, 9, 18]. Clustering sequences is more
     responds to the observation sequence. HMM can be char-             complex task than clustering feature vectors since infinite
     acterized using three probability distributions:                   sequences are not in a Euclidean space and sufficiently
                                                                        long finite sequences are in a high-dimensional Euclidean
       1. A state transition probability distribution A = {ai j }       spaces. Although there exist approaches how to measure
          which is formally defined by the Equation (1).                distance between two sequences (Levenshtein distance,
                                                                        Hamming distance, longest common subsequence, etc.),
       2. An probability distribution of observation variables,         however, they may not be always meaningful. It may be
          B = {bi, j }, where bi, j is a probability of observation     more reasonable to learn a HMM for each sequence and
          variables in state yi and it is formally defined as (2).      cluster the HMMs. A HMM is the random process that
                          bi, j = P(ot = x j |qt = yi )           (2)   produces the sequence.
                                                                           Due to the fact that HMMs’ parameters lie on a non-
           The matrix B of the discrete variable can be de-             linear manifold, application of the k-means algorithm
           scribed using N × M stochastic matrix, where M de-           would not succeed [3]. In addition, parameters of a par-
           notes number of possible values of X.                        ticular generative model are not unique, therefore, a per-
                                                                        mutation of states may correspond to the same model [3].
       3. An initial state distribution π = {πi } is defined by         It means that different sequences may be generated by the
                                                                        same HMM.
                               πi = P(q1 = yi )                   (3)
                                                                           There are already a few approaches to the cluster anal-
           When the initial state distribution is assumed to be         ysis by means of HMMs. One of them uses the spec-
           stationary, then a initial state distribution is usually     tral clustering with Bhattacharyya divergence [9]. Another
           computed as the stationary distribution of the Markov        commonly used dissimilarity measure in spectral cluster-
           chain described with matrix A by solving the Equa-           ing of HMMs is the Kullback-Leibler divergence [10, 21,
           tion (4).                                                    24].
                                    πA = π                                 The approach called variational hierarchical expecta-
                                                                (4)     tion maximization (VHEM) [3] is also based on spectral
                                  πA = π
                                ∑ i ij          j                       clustering. It directly uses the probability distributions of
                                 i
                                                                        HMMs instead of constructing an initial embedding and
       With these three elements, the HMM is fully defined.             besides clustering, it generates new HMMs. The newly
     The model is denoted λ = (A, B, π). A HMM can be                   generated HMMs are representatives of each cluster.
     graphically depicted by Trellis diagram which is shown in             A disadvantage of all the mentioned methods is the as-
     Figure 1. The joint probability of O and Q, which corre-           sumption that observations are normally distributed and
     sponds to the trellis diagram, is described by the Equation        can be described using the Gaussian mixture model. A
174                                                                                                                 T. Šabata, T. Borovička, M. Holeňa

      distance measure between two HMMs with normally dis-              of set with a probability proportional to P(O|λ ). Using
      tributed observations can be computed in polynomial time.         Viterbi algorithm, the most likely sequence Qopt of states
      However, that assumption can be too restrictive for many          can be computed by P(Qopt , O|λ ) = maxQ P(Q, O|λ ). This
      real-life applications.                                           leads to the KL divergence.
                                                                                             Z
                                                                                                    1          P(Qopt , O|λ )
      Dissimilarity measures of HMMs The definition of a                 dVit (λ , λ ′ ) =               log                     P(O|λ )dO (10)
                                                                                               O G(O)          P(Q′opt , O|λ ′ )
      distance between two HMMs is challenging since HMM
      consists of a Markov chain and a joint distribution of ob-        We assume that:
      servations. In [5], the authors considered two distance
                                                                          • The most probable sequences of both hidden Markov
      measures applicable to HMMs λ and λ ′ . The first of them
                                                                            models are equal for both HMMs. The assumption is
      is a normalized Euclidean distance of the corresponding
                                                                            reasonable if two HMMs are not too dissimiliar.
      matrices B, B’,
                                 s                                        • The Markov chain is ergodic. A Markov chain is
                                   1 N M                                    called ergodic if there is a nonzero probability to get
                dec (λ , λ ′ ) =     ∑ ∑ ||bik − b′ik ||2 ,  (7)
                                   N i=1 k=1                                from any state q to any other state q’ (not necessarily
                                                                            in one move). A Markov chain with an ergodic sub-
      where N denotes a number of states and M denotes a num-               set (in particular, an ergodic Markov chain) generates
      ber of observations. The number of states of both models              sufficient long sequence.
      has to be identical N = N ′ . The second distance proposed
      in [5] is defined by                                              With these assumptions, the KL divergence can be com-
                                                                        puted using following equation
                              s
                         ′      1 N          M
                                                                                                  1       P(Ql , O|λ )
              dmec (λ , λ ) =     ∑   min j ∑ ||bik − b′jk ||2 . (8)       dVit (λ , λ ′ ) =          log                P(O|λ ) + ε,       (11)
                                N i=1       k=1                                                  G(O)     P(Ql , O|λ ′ )
         A disadvantage of the considered distances is that they        where ε is an approximation error [5]. According to Sub-
      ignore the Markov chain. Therefore, there exist HMMs              section 3.1, the equation can be rewritten using the nota-
      λ and λ ′ the distance between which is zero although the         tion of HMMs as it is shown in the Equation (12).
      generated probability distributions Pλ and Pλ′ are different.
      Consequently, both distance measures are not metrics, but                                     1 T −1                                
                                                                         dVit (λ , λ ′ ) − ε =          ∑   log ayt ,yt+1 − log a′yt ,yt+1 +
      only pseudometrics [11]. Although the distances ignore                                       G(O) t=1
      the Markov chain, it is agreed that the observation proba-                                    1 T                                
      bility distribution matrix B is, in most cases, a more sen-                                       ∑   log byt ,xt − log b′yt ,xt
                                                                                                   G(O) t=1
      sitive set of parameters related to the closeness of HMMs
      then the initial distribution vector π or the state transitions                                                                       (12)
      matrix A [10].                                                       If a sequence O is long enough, the law of large number
         Another distance between HMMs λ , λ ′ is the Kullback-         allows us to approximate Equation (12) as (13) [5].
      Leibler divergence (KL divergence)
                             Z                                                                                                     
                                1       P(O|λ )                            dVit (λ , λ ′ ) ≈ δeVit = ∑ ai j πi log ai j − log a′i j +
          dKL (λ , λ ′ ) =          log          P(O|λ )dO,      (9)
                             O G(O)     P(O|λ ′)                                                        i, j
                                                                                                                                           (13)
      where G(O) is a function weighting the length of the se-                                          ∑ bik πi log bi j − log b′i j
                                                                                                        i,k
      quence O. Two kinds of such weighting functions are com-
      monly used. The function G(O) is equal to the length                 A disadvantage of the KL divergence is that it is not
      of sequence if we measure the divergence between two              symmetric.
      HMMs that generate equally long sequences. It is equal               The Bhattacharyya divergence is also a commonly used
      to expected value of lenghts of sequences if we measure           metric for HMMs. For two probability distributions f and
      the divergence between two HMMs which generate vari-              g, it is defined by the Equation (14). Similarly to the KL
      ous long sequences.                                               divergence, it can be easily computed if HMM has nor-
         An analytic solution for integral in the Equation (9) ex-      mally distributed observations [7]. Differently to the KL
      ists for HMMs with Gaussian distributed observations [8],         divergence, the Bhattacharyya divergence is symmetric.
      otherwise it can be calculated numerically. The time com-                                                Z p
      plexity of the numerical computation grows exponentially                         dB ( f , g) = − ln(           f (x)g(x)dx)           (14)
      with the length of the sequence. Since we want to use it for
      long sequences we use an aproximation described in [5].             Using a chosen dissimilarity measure, a dissimilarity
      We assume to have an ordered set of output sequences F =          matrix can be created and used for clustering, in partic-
      {O1 , ..., OL }. Any sequence O could became l-th member          ular, in hierarchical and spectral clustering algorithms.
Modeling and Clustering the Behavior of Animals Using Hidden Markov Models                                                               175

       Spectral clustering uses similarities between objects               1. the duration of the last occurrence of the state,
     represented as a weighted graph. There are three ways
     how to create such a graph:                                           2. the time elapsed since the last occurrence of the state.

         1. The ε-neighborhood graph. In this way, all pairs of        These eight variables are discretized using binning by
            points with distances smaller than ε are connected.        equal frequency into four or five bins. The ninth observ-
            The graph is often considered as unweighted because        able variable represents a daytime which is discretized
            its edges are based on the dichotomous property of         into six blocks of four hours. The Cartesian product of
            distances at most ε.                                       the value sets of observable variables contains, after dis-
                                                                       cretization, 1,500,000 combinations of values.
         2. k-nearest neighbor graphs. In this way, the goal              To avoid zero probabilities in the emission and transi-
            is to connect vertex vi with vertex v j if v j is among    tion matrices, empirical transition and emission matrices
            the k nearest neighbors of vi . This way is usually        averaged over all models of respective animal are used as
            used in image segmentation where each pixel has its        initial estimates. Zero probabilities in those initial esti-
            neighbourhood defined by 4 or 8 pixels.                    mates are replaced with small probabilities estimated as
                                                                       one over the number of elements in the matrix.
         3. The fully connected graph. This way means con-
                                                                          Since the assumption of normally distributed observa-
            necting all points the pairwise similarity of which
                                                                       tions is in our case not valid, we use an approximation
            is positive. This construction is usually chosen if
                                                                       of the Kullback-Leibler divergence, recalled in Subsec-
            the similarity function itself already encodes mainly
                                                                       tion 3.2, as the similarity measure for HMM clustering.
            local neighborhoods. An example of such a sim-
                                                                       Using this approximation, a distance matrix D for the set
            ilarity function is the Gaussian similarity function
                          ||x −x ||   2                                of HMMs constructed for the individual anamals is com-
                         − i j
            (s(xi , x j ) = e 2σ 2 ). The parameter σ controls the     puted, i.e., Di, j represents the KL divergence of the j-th
            width of the neighbourhood.                                from the i-th most likely sequence (Di, j = dVit (λi , λ j )).
                                                                       The properties of the KL divergence imply that the matrix
        Spectral clustering can be performed by three different        is real valued, non-symmetric and its diagonal has zero
     algorithms [19]:                                                  values.
                                                                          Hierarchical and spectral clustering were applied to
         1. Unnormalized spectral clustering
                                                                       cluster the models. A parameter of hierarchical cluster-
         2. Normalized spectral clustering according to Shi and        ing is the kind of linkage. Since HMMs are not points in
            Malik                                                      an Euclidean space, the single, average or complete link-
                                                                       age can be used. An optimal number of clusters can be
         3. Normalized spectral clustering according to Ng, Jor-       determined by the domain expert from the dendogram of
            dan, and Weiss                                             the hierarchical clustering.
                                                                          For the spectral clustering, a fully connected graph
                                                                       based on the Gaussian similarity function is used as the
     4     Proposed approach
                                                                       similarity graph. Estimation of the parameter sigma of the
                                                                       similarity function is based on the optimization of balance
     This article focuses on behavioral modeling and analysis
                                                                       of clusters.
     of animals that live in a closed environment. This sim-
     plifies the abstraction of the behavior since the number of          The results of clustering were subsequently analysed us-
     possible actions of animals in a closed environment is lim-       ing descriptive characteristics (e.g., age and weight of the
     ited. The proposed approach has been illustrated on a herd        animal) that were not among the observable variables.
     of cows containing one hundred of individuals.
        The transition and emission matrices of the hidden             5     Experimental results
     Markov model are estimated using a sequence of states
     and sequence of observable values, therefore a set of pos-
                                                                       The experiment is devided into two parts, behavioral mod-
     sible states and a set of possible observable variable values
                                                                       eling and clustering of the models. Section 5.1 describes
     have to be predefined. Each element of the sequence de-
                                                                       results of the modeling using HMMs and Section 5.2
     scribes one minute of animal’s behavior.
                                                                       presents the results of clustering.
        Five possible states that represent actions which an
     animal can do are defined. These states are denoted
     S1, S2, S3, S4, S5. An animal is considered to be in the state    5.1    Hidden Markov modeling
     S5 if it is not in any of the states S1-S4. States S1, S2, S3
     and S4 correspond to eating, drinking, resting and being          The dataset that consists of ten consecutive days was split
     milked.                                                           into train and test sets in a ratio 9:1. Parameters of a model
        Two observable variables are calculated for each state         were estimated with data of nine consecutive days and the
     except the state S5:                                              model was evaluated with data of the tenth day. For each
176                                                                                                       T. Šabata, T. Borovička, M. Holeňa




      Figure 2: For each animal a stationary distribution of its                        S1        S2         S3           S4     S5
      Markov chain is calculated from a sequence of 14400 mea-
      surements. Values of these distributions are visualized for            Figure 3: Accuracy of Viterbi’s sequence.
      each state separately in a distribution.


                                                                     The total accuracy was calculated as
      animal, transitions and emissions matrices were estimated                                  |{i : vi = qi }|
      using the Baum–Welch algorithm.                                                    Acc =                                   (15)
                                                                                                        T
         Hidden Markov models were evaluated in two different
      ways, visually by a domain expert and using the Viterbi’s      and an accuracy of a state s was calculated as
      sequence. The domain expert checks if the real animals’                                |{i : vi = qi , qi = s}|
      behavior is consistent with the stationary distribution of                    Accs =                            .          (16)
                                                                                                 |{ j : q j = s}|
      the resulting Markov chain, which describes the behavior
      to which the Markov chain converges. Consequently, the         Resulting models are able to predict animals’ actions with
      stationary distribution describes how much time an animal      an average accuracy of 83.86%. It can be seen that the
      spends doing some activity in comparison with all activi-      state S5 has the worst accuracy from all states. It is caused
      ties.                                                          by a delay. A model can relatively precisely detect that
         Stationary distributions were computed for each animal      an action has started but the detection of a termination of
      separately. A histogram of the probabilities of individual     the action is delayed. It causes that the state S5 has worse
      states in the stationary distributions is shown in Figure 2.   accuracy than other states. Figure 3 visualizes the results
         Probabilities of the stationary distributions averaged      of evaluation of the Viterbi’s sequence.
      over all animals are in Table 1.
                                                                     5.2   Clustering
                    State    Average probability
                     S1            8.5%                              In this subsection, we discuss results and evaluation of
                     S2            0.4%                              the cluster analysis. The visualisation of a distance ma-
                     S3            44.8%                             trix is shown in Figure 4. A color of the point i,j repre-
                     S4            1.8%                              sents distance between i-th animal’s model and j-th ani-
                     S5            44.5%                             mal’s model. The lighter the color is the less similar the
                                                                     models are. The distance matrix is used by both clustering
               Table 1: Average stationary distribution.             methods.
                                                                        Linkages of hierarchical clustering can be visualized by
                                                                     dendrograms. Dendrogram is used to visually estimate an
         The second way of evaluation is using the Viterbi’s se-     optimal number of clusters. The dendrogram of a com-
      quence. The Viterbi’s sequence determines the most likely      plete linkage is shown in Figure 5. According to it, an-
      sequence of states given a sequence of observations and        imals can be assigned into several approximately equally
      is denoted v1 v2 ....vT . To get an accuracy of the predic-    sized clusters.
      tion of a sequence of states, the real sequences of states        The evaluation of clustering results is a difficult task
      are element-wise compared with the Viterbi’s sequences.        since there are no references for validation. However, clus-
      The accuracy is calculated as number of elements that do       ters can be validated by a domain expert based on descrip-
      not differ divided by length of sequence. It was calculated    tive characteristics related to animals (e.g., age, weight).
      separately for each state as well as for whole sequence.       These characteristics were not used for modeling, thus,
Modeling and Clustering the Behavior of Animals Using Hidden Markov Models                                                                                                              177




                                                                                                             Figure 6: Proportions of four clusters, which were created
                                                                                                             using spectral clustering according to Shi and Malik.




     Figure 4: Kullback-Leibler divergence based distance ma-
     trix between estimated models. A color of the point i,j
     represents distance between i-th animal’s model and j-th
     animal’s model.


                                                       complete linkage




                    2.1




                      2

                                                                                                             Figure 7: Boxplots of unobserved descriptive characteris-
                                                                                                             tic showing differences between four clusters.
         distance




                    1.9




                    1.8




                    1.7

                                                                                                                                                                              cluster
                    1.6                                                                                                                                                             1
                          4 30 3 14 25 8 12 22 26 2 9 13 27 10 18 1 23 15 21 29 11 16 20 5 28 6 19 24 7 17
                                                             cluster id                                                                                                             2
                                                                                                                                   0.8
                                                                                                                                                                                    3
                                                                                                              Relative frequency




                                                                                                                                                                                    4
          Figure 5: The dendrogram for a complete linkage.                                                                         0.6




                                                                                                                                   0.4




     they can not influence the clustering. Figures 6, 7 and 8                                                                     0.2

     show results of the spectral clustering according to Shi and
     Malik with a fully connected similarity graph with sigma                                                                      0.0


     equal to 0.65. In the figures can be seen that there are dif-                                                                       100   200    300        400    500


                                                                                                                                                     Characteristic 1
     ferences in descriptive statistics of such characteristic be-
     tween individual clusters. For example, animals assigned
     to cluster 1 have significantly higher values of the charac-                                            Figure 8: Distribution of unobserved descriptive charac-
     teristic than animals of other clusters. It indicates that the                                          teristic showing differences between four clusters.
     clustering is reasonable and meaningful.


     6        Conclusion                                                                                     83.86%. Furthermore, the stationary distributions of the
                                                                                                             resulting models were validated by a domain expert. Clus-
     Hidden Markov models are proved to be applicable to                                                     ter analysis was performed by classical clustering algo-
     modeling of animal behavior represented as a sequence of                                                rithms using the approximation of KL divergence. The
     states. According to their evaluation, the model is able                                                clustering produces meaningful results and clusters ani-
     to predict animals’ actions with an average accuracy of                                                 mals into interpretable groups
178                                                                                                          T. Šabata, T. Borovička, M. Holeňa

         Prediction can be further improved with better defini-              terns based on sequential actions. IEEE Transactions on
      tion of states. In particular, the state S5 can be divided             Knowledge and Data Engineering, vol. 22(issue 4):479–
      into more disjoint states. This may positively influence re-           492, 2010.
      sults of the clustering. Moreover, different modeling ap-         [14] Maja Matetić, Slobodan Ribarić, and Ivo Ipšić. Qualita-
      proaches commonly used for modeling sequences, such                    tive modelling and analysis of animal behaviour. Applied
      as dynamic Bayesian networks, conditional random fields,               Intelligence, vol. 21(issue 1):25–44, 2004.
      are intended to be researched, as well as their clustering        [15] Bill O’Brien, John Dooley, and Thomas J Brazil. Rf power
      possibilities.                                                         amplifier behavioral modeling using a globally recurrent
                                                                             neural network. In Microwave Symposium Digest, 2006.
                                                                             IEEE MTT-S International, pages 1089–1092. IEEE, 2006.
      References
                                                                        [16] Vladimir Pavlović, James M Rehg, Tat-Jen Cham, and
                                                                             Kevin P Murphy. A dynamic Bayesian network approach to
       [1] Jonathan Alon, Stan Sclaroff, George Kollios, and Vladimir
                                                                             figure tracking using learned dynamic models. In Computer
           Pavlovic. Discovering clusters in motion time-series data.
                                                                             Vision, 1999. The Proceedings of the Seventh IEEE Inter-
           In Computer Vision and Pattern Recognition, 2003. Pro-
                                                                             national Conference on, volume 1, pages 94–101. IEEE,
           ceedings. 2003 IEEE Computer Society Conference on,
                                                                             1999.
           volume 1, pages I–375. IEEE, 2003.
                                                                        [17] Amy L. Sliva. Scalable techniques for behavioral analysis
       [2] Longbing Cao and Philip S Yu. Behavior computing.
                                                                             and forecasting. 2011.
           Springer, London, 2012.
                                                                        [18] Padhraic Smyth et al. Clustering sequences with hidden
       [3] Emanuele Coviello, Gert R Lanckriet, and Antoni B Chan.
                                                                             Markov models. Advances in neural information process-
           The variational hierarchical EM algorithm for clustering
                                                                             ing systems, pages 648–654, 1997.
           hidden Markov models. In Advances in neural information
           processing systems, pages 404–412, 2012.                     [19] Ulrike Von Luxburg. A tutorial on spectral clustering.
                                                                             Statistics and computing, 17(4):395–416, 2007.
       [4] Qi Dai, Xiao-Qing Liu, Tian-Ming Wang, and Damir
           Vukicevic. Linear regression model of dna sequences          [20] Nelleke d. Weerd, Frank v. Langevelde, Herman v. Oev-
           and its application. Journal of Computational Chemistry,          eren, Bart A. Nolet, Andrea Kölzsch, Herbert H. T. Prins,
           28(8):1434–1445, 2007.                                            and W. F. Boer. Deriving animal behaviour from high-
                                                                             frequency GPS: Tracking cows in open and forested habi-
       [5] Markus Falkhausen, Herbert Reininger, and Dietrich Wolf.
                                                                             tat. PLoS One, 10(6), 06 2015.
           Calculation of distance measures between hidden Markov
           models. In In Proc. Eurospeech, pages 1487–1490, 1995.       [21] Jie Yin and Qiang Yang. Integrating hidden Markov mod-
                                                                             els and spectral analysis for sensory time series clustering.
       [6] Y. Guo, G. Poulton, P. Corke, G. J. Bishop-Hurley,
                                                                             In Data Mining, Fifth IEEE International Conference on,
           T. Wark, and D. L. Swain. Using accelerometer, high sam-
                                                                             pages 8–pp. IEEE, 2005.
           ple rate GPS and magnetometer data to develop a cattle
           movement and behaviour model. Ecological Modelling,          [22] S.-X. Zhang. Structured Support Vector Machines for
           220(17):2068–2075, 2009.                                          Speech Recogntion. PhD thesis, Cambridge University,
                                                                             March 2014.
       [7] John R Hershey and Peder A Olsen. Variational Bhat-
           tacharyya divergence for hidden Markov models. In            [23] Shi Zhong. Semi-supervised sequence classification with
           Acoustics, Speech and Signal Processing, 2008. ICASSP             HMMs. In Valerie Barr and Zdravko Markov, editors,
           2008. IEEE International Conference on, pages 4557–               FLAIRS Conference, pages 568–574. AAAI Press, 2004.
           4560. IEEE, 2008.                                            [24] Shi Zhong and Joydeep Ghosh. A unified framework for
       [8] John R Hershey, Peder A Olsen, and Steven J Rennie. Vari-         model-based clustering. The Journal of Machine Learning
           ational Kullback-Leibler divergence for hidden Markov             Research, 4:1001–1037, 2003.
           models. In Automatic Speech Recognition & Understand-
           ing, 2007. ASRU. IEEE Workshop on, pages 323–328.
           IEEE, 2007.
       [9] Tony Jebara, Yingbo Song, and Kapil Thadani. Spectral
           clustering and embedding with hidden Markov models. In
           Machine Learning: ECML 2007, pages 164–175. Springer,
           2007.
      [10] Biing-Hwang Fred Juang and Lawrence R Rabiner. A
           probabilistic distance measure for hidden Markov models.
           AT&T technical journal, 64(2):391–408, 1985.
      [11] John L. Kelley. General Topology (Graduate Texts in Math-
           ematics). Springer, 1975.
      [12] John D. Lafferty, Andrew McCallum, and Fernando C. N.
           Pereira. Conditional random fields: Probabilistic models
           for segmenting and labeling sequence data. pages 282–289,
           2001.
      [13] Sang Wan Lee, Yong Soo Kim, and Zeungnam Bien. A
           nonsupervised learning framework of human behavior pat-