=Paper= {{Paper |id=Vol-2953/SEIM_2021_paper_24 |storemode=property |title=Viterbi Algorithm Specialization Using Linear Algebra |pdfUrl=https://ceur-ws.org/Vol-2953/SEIM_2021_paper_24.pdf |volume=Vol-2953 |authors=Ivan Tyulyandin,Daniil Berezun,Semyon Grigorev }} ==Viterbi Algorithm Specialization Using Linear Algebra== https://ceur-ws.org/Vol-2953/SEIM_2021_paper_24.pdf
         Viterbi Algorithm Specialization Using Linear
                           Algebra
                 Ivan Tyulyandin                           Daniil Berezun                         Semyon Grigorev
       Mathematics and Mechanics Faculty             Department of Mathematics            Mathematics and Mechanics Faculty
        Saint Petersburg State University               and Computer Science               Saint Petersburg State University
             Saint Petersburg, Russia              Saint Petersburg State University           Saint Petersburg, Russia
           i.tyulyandin@2015.spbu.ru                   Saint Petersburg, Russia                  s.v.grigoriev@spbu.ru
                                                       d.berezun@2009.spbu.ru



     Abstractโ€” Algorithms based on linear algebra is widely used in      The rest of the paper organized as follows. Section II
  various areas. Often programs of interest have some input data      describes the background. In section III the Viterbi algorithm
  that are independent of the dataset being processed, and thus       specialization is explained. Section IV reports benchmarks
  they can be optimized with respect to this input. In the paper,
  we study how partial evaluation affects the Viterbi algorithm as    results. Related work is reviewed in section V. And section
  a step to research an application of partial evaluation to linear   VII ends up the paper.
  algebra-based algorithms. We evaluate the specialized multi-
  thread Viterbi algorithm against existing GPU-based CUDAMPF.                             II. BACKGROUND
  The results show that the presented algorithm is slower but
  comparable to CUDAMPF.                                                 In this section, we review specialization, hidden Markov
                                                                      models (HMM), and the Viterbi algorithm in terms of linear
                        I. I NTRODUCTION                              algebra.

     Algorithms based on linear algebra is widely used in various A. Specialization
  areas such as machine learning [1], computer vision [2],
                                                                        Specialization [6], or partial evaluation, is a well-known
  statistics [3] analysis of logic programs [4], graph theory [5],
                                                                     program transformation technique widely used when some of
  etc. One of the typical cases is querying huge database or
                                                                     the input data is already known in compile time. A typical
  graph processing and can be executed in days or weeks making
                                                                     case is serial data processing when one of the input parameters
  crucial even a simple constant time optimization. One way
                                                                     is fixed while others vary. Fixed parameters are called static
  to optimize such data processing is to use some hardware
                                                                     while other parameters are called dynamic. The idea behind
  capabilities such as different kinds of parallelism or to invent
                                                                     specialization is that optimization of a program with respect
  a more efficient algorithm or some tricky data representation.
                                                                     to the static parameters together with executing the optimized
  We focus on an alternative way to program optimization based
                                                                     program on a set of dynamic parameters may be more efficient
  on the following observation. Often programs of interest have
                                                                     than iterative execution of the initial program on both static
  some input data that are independent of the dataset being
                                                                     and dynamic parameters.
  processed, and thus they can be optimized with respect to
                                                                        The classical specialization example is the exponentiation
  this input. Partial evaluation, a.k.a. program specialization,
                                                                     function ๐‘“ (๐‘ฅ, ๐‘›) = ๐‘ฅ ๐‘› where ๐‘› is static. A simple implemen-
  is a well-known program transformation technique aiming to
                                                                     tation is shown below.
  perform such an optimization [6].
     In the paper, we study how partial evaluation affects the 1 function f(x, n)
  Viterbi algorithm [7] as a step to research an application of 2 if n == 0 then 1
                                                                   3    elif even(x) then f (x, n/2) ^ 2
  partial evaluation to linear algebra-based algorithms. First, 4 else x f (x, (n-1)/2) ^ 2
                                                                                 *
  the Viterbi algorithm is used in bioinformatics [8], speech
  recognition [9], and financial computations [10]. Second, it All recursive calls are static, i.e. are controlled by the static
  can be expressed in terms of linear algebra [11]. Third, parameter only, and thus can be reduced. Given fixed ๐‘›, say
  the algorithm has two parameters: a hidden Markov model 5, a typical specialized version is
  (HMM) [12] and an observations sequence. Its goal is to count 1 function f_spec(x) = x (x ^ 2) ^ 2
                                                                                                  *
  a probability for the sequence to be emitted by the given
  HMM. Next, a sequential application of the algorithm with Note, sometimes specialization is useless. For example, con-
  a fixed HMM to a big bunch of observations sequences is sider the exponentiation function with fixed base ๐‘ฅ but dy-
  usual. Finally, the main part of the Viterbi algorithm heavily namic power ๐‘›. Of course, one may use arithmetic tricks for
  depends on the HMM. All the above make the algorithm a some ๐‘ฅ but in general, there is no recipe for effective special-
  good candidate to research partial evaluation application.         ization. Moreover, since optimal specialization is obviously



Copyright ยฉ 2021 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
undecidable there are some heuristics to ensure specializa-            Symbol ร— stands for matrix multiplication using Min_plus
tion termination. As a result, in some cases, specialization           semiring. Column B defines a probability distribution for states
worsen program execution. For example, it is well-known that           to be a start one.
sometimes specialization negatively affects program execution
                                                                                           Probs1 = ๐‘ƒ(Obs1 ) ร— ๐ต
caused by code expansion [6].
                                                                       The next step is to handle the rest of the sequence, where ๐‘ก
B. Hidden Markov model
                                                                       changes from 2 up to lo.
   Hidden Markov model is a deterministic probability automa-
ton [12]. It has the following parameters:                                          Probs๐‘ก = ๐‘ƒ(Obs๐‘ก ) ร— ๐‘‡ > ร— Probs๐‘กโˆ’1
   โ€ข S1..N โ€” ๐‘ states of the automaton;                                As a result, the column Probslo will contain transformed
   โ€ข O1..K โ€” ๐พ possible observations;                                  probabilities to be in a certain state of the HMM if observation
   โ€ข B1..N โ€” a probabilities describing each state from S1..N          sequence Obs is handled.
      to be a start one;
                                                                                 III. S PECIALIZED V ITERBI ALGORITHM
   โ€ข T1..N,1..N โ€” state transition matrix, Ti,j is a probability to
      go from state Si to state Sj ;                                      Here we describe specialization of the Viterbi algorithm in
   โ€ข E1..N,1..K โ€” emission matrix, where Ei,j defines proba-           terms of linear algebra. We fix HMM as a static parameter.
      bility to emit observation Oj at state Si .                         To the best of our knowledge, there is no stable partial
                                                                       evaluator maintaining parallel program transformation and
With a given observation sequence, it is possible to calculate
                                                                       providing expected results. In order to achieve specialization
a maximum likelihood to be at a concrete state of the HMM,
                                                                       effect weโ€™ve made ad-hoc generating extension, i.e. we provide
i.e. to reveal hidden states, according to the sequence. HMM
                                                                       an effective procedure to perform static data transformation
makes a transition between states for each observation from
                                                                       and propagation together with handwritten specialized version
the given sequence.
                                                                       of the Viterbi algorithm itself.
C. Viterbi algorithm
                                                                       A. Theory
   Letโ€™s fix the observation sequence as Obs, where the length            The goal is to embed data from the given HMM into the
of Obs is lo. The Viterbi algorithm [11] handles sequence Obs          program and simplify expressions. These static data are ๐‘†, ๐‘‚,
for a HMM. Its result is a maximum probability to reach each           ๐ต, ๐‘‡, and ๐ธ. Given a fixed HMM, for all possible observations
state of the HMM after handling Obs.                                   ๐‘œ โˆˆ ๐‘‚ the following matrices and matrix multiplications can be
   First of all, HMM probabilities are transformed into negative       precalculated during the specialization phase according to the
binary logarithm. We define such probabilities as transformed          given in the previous section the Viterbi algorithm definition:
probabilities t(p), where ๐‘ is a some probability from the
                                                                          โ€ข ๐‘ƒ(๐‘œ),
HMM definition.
                           (                                              โ€ข ๐‘ƒ(๐‘œ) ร— ๐ต, denoted latter as PB(๐‘œ),
                             ๐‘ > 0 : โˆ’1 โˆ— ๐‘™๐‘œ๐‘”2 ( ๐‘))                      โ€ข ๐‘ƒ(๐‘œ) ร— ๐‘‡ > , denoted latter as PT (๐‘œ).
                  ๐‘ก ( ๐‘) =                                   (1)
                             ๐‘ = 0 : +โˆž                                   We can precalculate these operations and memoize the
                                                                       results for further use by the specialized algorithm. The
e.g. probability 0.5 will be expressed as โˆ’1 โˆ— log2 (0.5) = 1. It      precalculation procedure pseudocode is shown in Listing 1,
is done to reduce the loss of precision.                               function spec_Viterbi.
   The key idea is to use a special algebraic structure, named            The specialized Viterbi algorithm is shown in Listing 1,
semiring Min_plus. Elements of this semiring are floats. We            function run_Viterbi, and works as follows. The initial step
define the additionโ€™s semantic as a minimum between two                can be expressed as
floats. The multiplication symbol means addition for floats.
Neutral elements are +โˆž and 0 accordingly. This is an example                               Probs1 = PB(Obs1 ).
of usage Min_plus semiring for matrix multiplication:                  The rest of the sequence Obs is handled with multiplication
                                             
            0 1       3       ๐‘š๐‘–๐‘›(0 + 3, 1 + 4)       3
                   ร—      =                       =                                   Probs๐‘ก = PT (Obs๐‘ก ) ร— Probs๐‘กโˆ’1 .              (2)
           +โˆž 2       4      ๐‘š๐‘–๐‘›(+โˆž + 3, 2 + 4)       6
                                                                       Comparing the specialized version with the initial one, there
   Expression ti,j denotes transformed probability to get obser-
                                                                       are fewer matrix multiplication operations in Min_plus semir-
vation ๐‘— at state ๐‘–, i.e. t(Ei,j ). For each observation ๐‘— we create
                                                                       ing. For the first step, it is one matrix assignment instead
diagonal matrix P(j) as follows with data from E:
                                                                       of multiplication. For the remaining observations, we need to
                               ๐‘ก 1, ๐‘—   ...    +โˆž                      perform only one matrix multiplication against two. Thus, the
                                                                       initial Viterbi algorithm in terms of linear algebra requires to
                             ยฉ                  .. ยชยฎ
                    ๐‘ƒ( ๐‘—) = ยญยญ ...      ..
                                           .     . ยฎ
                                                                       perform 1 + 2 โˆ— (lo โˆ’ 1) matrix multiplications, where lo is
                             ยซ+โˆž        ...
                                         ๐‘ก๐‘ , ๐‘—ยฌ
                                                                       the Obs length, while the specialized version requires only
The initial step of the Viterbi algorithm is to set up data accord-    lo โˆ’ 1 multiplications but requires additional memory to keep
ing to the first observation from the observation sequence Obs.        the precalculated matrices.
        Since matrix multiplication in semiring Min_plus is asso-       required to perform the partially evaluated Viterbi algorithm.
     ciative one can handle two observations by                         Memory consumption rapidly increases with higher N, since
                                                                        K N precalculated matrices have to be saved in memory.
            Probs๐‘ก = PT (Obs๐‘ก ) ร— Probs๐‘กโˆ’1
                   = PT (Obs๐‘ก ) ร— (PT (Obs๐‘กโˆ’1 ) ร— Probs๐‘กโˆ’2 )            B. Some implementation details
                   = (PT (Obs๐‘ก ) ร— PT (Obs๐‘กโˆ’1 )) ร— Probs๐‘กโˆ’2       (3)     To    perform      matrix operations, we used S UITE S-
                                                                        PARSE :G RAPH BLAS        [5]. It is a high-performance
     Since we know all PT (o), we can precalculate these multipli-      implementation of the G RAPH BLAS [13] standard, which is
     cations, i.e. compute ๐พ ร— ๐พ matrices, and use them as needed.      intended to handle graphs, e.g. hidden Markov model. Also,
     This method can be extended to handle more observations at         it defines various linear algebra primitives, such as Min_plus
     once, e.g. for three observations:                                 semiring. Our implementation uses custom formats to define
      Probs๐‘ก = PT (Obs๐‘ก ) ร— PT (Obs๐‘กโˆ’1 ) ร— PT (Obs๐‘กโˆ’2 ) ร— Probs๐‘กโˆ’3      HMM and observation sequences simplifying data parsing.
                                                                (4)     The full source code is available online [14].
     For three observations there are ๐พ ร— ๐พ ร— ๐พ evaluated matrices                              IV. E VALUATION
     accordingly. We name equation 2 1-level specialization since          In this section, we compare the specialized Viterbi algorithm
     only one observation handling is precalculated. By analogy,        against the initial one and CUDAMPF [15].
     we name equations 3 and 4 by the second and the third                 CUDAMPF is a GPU implementation of the Viterbi algo-
     specialization levels, and so on. ๐‘-level can be computed with     rithm. Since the Viterbi algorithm can be effectively paralleled,
     PT and N โˆ’ 1-level as follows: for all ๐‘œ multiply PT (o) for       a GPU is a suitable choice. CUDAMPF works with hidden
     all matrices at the previous level.                                Markov models from bioinformatics. A model describes pro-
 1 HMM                                                                  tein family. An observation sequence specifies a protein and
 2 PB[HMM.K]                                                            contains amino acids. If a probability to be in some special
 3 PT[HMM.K]
 4 level
                                                                        state of the HMM is higher than a threshold, than protein
 5 // obs_lvl_handlers is a mapping from lists that
                                                                        belongs to the protein family.
 6 // contain level observations to a handler matrix                       We took 24 HMMs from CUDAMPF repository1 . All
                         ๐‘™๐‘’๐‘ฃ๐‘’๐‘™ ]
 7 obs_lvl_handlers[HMM.K                                               HMMs have a different number of states but the same struc-
 8
                                                                        ture. Since these HMMs have a slightly different definition, we
 9 function spec_Viterbi()
10   for i = 1..HMM.K
                                                                        implement a converter into our custom format. We evaluate our
11     PB[i] = P(HMM.O[i]) ร— HMM.B)                                     solution on three different datasets. Two datasets are randomly
12     PT[i] = P(HMM.O[i]) ร— (HMM.T)>                                   generated, each contains three sequences consisting of 3500
13   // To handle a sequence of length level,                           and 7000 observations respectively. The third dataset is real-
14   // appropriate level matrices from PT
                                                                        world 16 proteins taken from P FAM [16] database. The length
15   // should be multiplied.
16   // Put handlers for all possible sequences of                      of the proteins varies from 38 to 7096 observations. The
17   // length level into obs_lvl_handlers.                             number of possible observations, i.e. ๐พ = 20, for all datasets.
18   get_combinations(obs_lvl_handlers, level, PT)                         We run experiments on Ubuntu 20.04, Intel Core i7-6700
19
                                                                        3.40 GHz, 64 Gb RAM, NVIDIA GeForce GTX 1070. Each
20 function Viterbi(Obs)
                                                                        implementation with concrete parameters was run 10 times,
21   // First observation handling
22   Probs = PB[Obs[1]]                                                 and a median was taken as a result. We evaluate only the
23                                                                      first and second level specialization of the presented algorithm,
24     lo = length(Obs)                                                 since memory used for memoization grows by an exponent.
25     i = 2                                                            For the third level the out-of-memory exception was thrown.
26
27     // While there is more observations than level                                         CUDAMPF       Initial   1-level   2-level
28     while (lo - i) >= level)
29       // Find matrix to handle next level observations                        3 x 3500        4854       10765      8062     215329
                                                                                 3 x 7000        9209       21062     16152     387464
30       handler = obs_lvl_handlers.find(Obs[i:i+lvl])
                                                                                Real-world       8796       15864     12036     298269
31       Probs = handler ร— Probs
32       i = i + level                                                  TABLE I: Result run time (both specialization and the Viterbi
33
34     // The rest of the sequence
                                                                        algorithm), ms
35     for (; i < lo; i = i + 1)
36       Probs = PT[Obs[i]] ร— Probs                                       The results (see Figures 1a, 1b, 1c and Table I) show that
37                                                                      the first level specialized version of the Viterbi algorithm,
38     return Probs                                                     as expected, is faster, than the initial one. Unexpectedly, the
       Listing 1: The specialized with levels Viterbi algorithm         second level implementation is significantly slower comparing
                                                                        to the initial and the first level implementations, and it is
     If    the   specialization with   N-level  is   applied,
     (lo โˆ’ 1)/N + (lo โˆ’ 1) mod N matrix multiplications are               1 https://github.com/Super-Hippo/CUDAMPF (date: 2021-12-02)
               800
                            CUDAMPF                                                        1,500          CUDAMPF
                              Initial                                                                       Initial
                              1-level                                                                       1-level
               600
                                                                                           1,000
  Time, ms




                                                                                Time, ms
               400


                                                                                             500
               200


                           500      1,000   1,500     2,000                                              500      1,000   1,500     2,000
                                 Number of HMM states                                                          Number of HMM states
             (a) 3 sequences, each 3500 observations (lower is better)                     (b) 3 sequences, each 7000 observations (lower is better)

                            CUDAMPF                                                                       1-level
             1,000            Initial                                                      4,000          2-level
                              1-level

               800                                                                         3,000
  Time, ms




                                                                                Time, ms


               600
                                                                                           2,000

               400
                                                                                           1,000
               200
                                                                                               0
                           500      1,000   1,500     2,000                                              500      1,000   1,500     2,000
                                 Number of HMM states                                                          Number of HMM states
                      (c) Real-world dataset (lower is better)                              (d) Specialization algorithm run time (lower is better)
                                                                 Fig. 1: Evaluation results


not shown in some figures. One of the reasons for such a                          In [17] it was shown that the result of specialization of
slowdown is the increased memory consumption. Neverthe-                        the naรฏve pattern match algorithm to some fixed pattern can
less, even on a workstation with 8 CPU threads, CPU-based                      be behavioural equivalent to the Knuth, Morris and Pratt
first level specialization outperforms GPU-based CUDAMPF                       algorithm. This result is often used as a strength test for
on the small HMMs used in CUDAMPF benchmarks2 . After                          partial evaluators to be โ€œgood enoughโ€. Since we consider
all, these results are comparable with ones from CUDAMPF.                      the concrete algorithm specialziation only, the test is not
It is also worth noting, that the parallel Viterbi algorithm,                  applicable in our case.
expressed in terms of linear algebra, is easier to implement
than the CUDAMPF dynamic programming version, since                               A partial evaluation was used in ray tracing [18] by P. H.
the abstraction level is higher. All the above proves the spe-                 Andersen. The author optimizes a ray tracer, gaining speedups
cialization of the Viterbi algorithm is applicable in practice.                from 1.8 to 3.0 depending on the meta-parameters and com-
                                                                               piler. The main performance improvement was reached with
                              V. R ELATED WORK                                 constant propagation and unrolling loops. It can be done by
                                                                               an optimizing compiler, but this partial evaluator is aggressive.
   There are a lot of works, where specialization was success-
                                                                               That means sometimes a specialized algorithm can have an
fully applied.
                                                                               enormous code size and lead to performance regression. The
 2 For evaluation we used the exact set of HMMs from CUDAMPF bench-            static data was directly written inside source code, while our
marks.                                                                         solution can run without any filesโ€™ modifications.
   A. Tyurin, D. Berezun and S. Grigorev have applied spe-              [4] C. Sakama, H. Nguyen, T. Sato, and K. Inoue, โ€œPartial evalu-
cialization to the naรฏve pattern match string search algorithm,             ation of logic programs in vector spaces,โ€ EasyChair Preprint
implemented as a GPU program [19]. They got performance                     no. 172, EasyChair, 2018.
                                                                        [5] T. A. Davis, โ€œAlgorithm 1000: SuiteSparse:GraphBLAS: Graph
improvement up 8 times in some cases. GPU has a lot of                      Algorithms in the Language of Sparse Linear Algebra,โ€ vol. 45.
simple algebraic logic units. All of them need to take data                 URL: https://doi.org/10.1145/3322125
to work with. It means a data cache of GPU is a bottleneck.             [6] N. Jones, C. Gomard, and P. Sestoft, Partial Evaluation and
Using specialization, static data was moved to a code cache.                Automatic Program Generation, 1993.
Such transformation makes data cache miss less possible. One            [7] A. Viterbi, โ€œError bounds for convolutional codes and an asymp-
                                                                            totically optimum decoding algorithm,โ€ IEEE Transactions on
may call it a "hardware specialization".                                    Information Theory, vol. 13, no. 2, pp. 260โ€“269, 1967.
   C. Sakama et al. used linear algebra as a logic programs             [8] S. R. Eddy, โ€œAccelerated profile hmm searches,โ€ PLoS com-
representation [4]. The authors introduce partial evaluation as             putational biology, vol. 7, no. 10, pp. e1 002 195โ€“e1 002 195,
a part of the algorithm to find a logic model of a program. If              2011.
specialization is used, run time is decreased by 10 times.              [9] L. R. Rabiner, A Tutorial on Hidden Markov Models and Se-
                                                                            lected Applications in Speech Recognition. San Francisco, CA,
                                                                            USA: Morgan Kaufmann Publishers Inc., 1990, p. 267โ€“296.
                        VI. D ISCUSSION                                [10] A. Petropoulos, S. P. Chatzis, and S. Xanthopoulos, โ€œA novel
   There are some possible research directions and future work.             corporate credit rating system based on studentโ€™s-t hidden
                                                                            markov models,โ€ Expert Systems with Applications, vol. 53, pp.
The Viterbi algorithm specialization is the first step to find out          87โ€“105, 2016.
if partial evaluation can be effectively applied to the linear         [11] E. Theodosis and P. Maragos, โ€œAnalysis of the viterbi algorithm
algebra algorithms.                                                         using tropical algebra and geometry,โ€ in 2018 IEEE 19th Inter-
   First of all, the next step is to run benchmarks at a                    national Workshop on Signal Processing Advances in Wireless
GPU. S UITE S PARSE :G RAP BLAS [5] is the reference CPU                    Communications (SPAWC), 2018, pp. 1โ€“5.
                                                                       [12] S. R. Eddy, โ€œWhat is a hidden Markov model?โ€ Nature
implementation of the G RAPH BLAS [13] standard. There are                  Biotechnology, vol. 22, pp. 1315โ€“1316. URL: https://doi.org/
some GPU implementations, such as G RAPH BLAST [20] and                     10.1038/nbt1004-1315
GBTL [21], but to our knowledge, they are unstable.                    [13] A. Buluรง, T. Mattson, S. McMillan, E. J. Moreira, and C. Yang,
   One can try to apply partial evaluation to the other algo-               โ€œDesign of the graphblas api for c,โ€ IPDPS Workshops, pp. 643โ€“
                                                                            652, 2017.
rithms in terms of linear algebra. These experiments will reveal
                                                                       [14] Implementation       repository.    URL:       https://github.com/
the limits of the specialization to such algorithms. There is a             IvanTyulyandin/Lin_alg_Viterbi (Date: 2021-13-02).
high chance that such experiments can be successfully used in          [15] H. Jiang and N. Ganesan, โ€œCudampf: a multi-tiered parallel
production.                                                                 framework for accelerating protein sequence search in hmmer
   Since partial evaluation can lead to a performance increase,             on cuda-enabled gpu,โ€ BMC bioinformatics, 2016.
                                                                       [16] J. Mistry, S. Chuguransky, L. Williams, M. Qureshi, G. A.
it can be useful to implement a linear algebra library with                 Salazar, E. L. L. Sonnhammer, S. C. E. Tosatto, L. Paladin,
specialization primitives. It will let to develop more effective            S. Raj, L. J. Richardson, R. D. Finn, and A. Bateman, โ€œPfam:
applications with linear algebra algorithms in less time.                   The protein families database in 2021,โ€ Nucleic Acids Research,
   Another approach is to do hardware partial evaluation, e.g.              vol. 49, no. D1, pp. D412โ€“D419, 10 2020.
to make FPGA, where specialization program with static data            [17] C. Consel and O. Danvy, โ€œPartial evaluation of pattern matching
                                                                            in strings,โ€ Information Processing Letters, vol. 30, no. 2, pp.
will be embedded as a scheme.                                               79โ€“86, 1989.
                                                                       [18] P. H. Andersen, โ€œPartial evaluation applied to ray tracing,โ€ 1996.
                      VII. C ONCLUSION                                 [19] A. Tyurin, D. Berezun, and S. Grigorev, โ€œOptimizing gpu
                                                                            programs by partial evaluation,โ€ pp. 431โ€“432, 02 2020.
   In the paper, we study an application of partial evaluation
                                                                       [20] C. Yang, A. Buluc, and J. D. Owens, โ€œGraphblast: A high-
to the linear algebra-based algorithms on a particular example              performance linear algebra-based graph framework on the gpu,โ€
โ€” the Viterbi algorithm with an HMM being fixed, i.e. static                2020.
parameter. The specialized version of the Viterbi algorithm            [21] Graphblas template library (gbtl), v. 3.0. URL: https:
is presented. Our experiments show that on real benchmarks                  //github.com/cmu-sei/gbtl (Date: 2021-12-02).
the presented algorithm can be comparable to the existing
GPU-based Viterbi algorithm implementation CUDAMPF.
Thus, the proposed approach is applicable in practice and
further partial evaluation application to the linear algebra-
based algorithms is a promising research direction.

                          R EFERENCES
 [1] C. Aggarwal, Linear Algebra and Optimization for Machine
     Learning: A Textbook, 01 2020.
 [2] R. Szeliski, Computer Vision: Algorithms and Applications,
     1st ed. Berlin, Heidelberg: Springer-Verlag, 2010.
 [3] J. Gentle, Numerical Linear Algebra for Applications in Statis-
     tics. Springer-Verlag, 1998.