=Paper= {{Paper |id=Vol-2953/SEIM_2021_paper_24 |storemode=property |title=Viterbi Algorithm Specialization Using Linear Algebra |pdfUrl=https://ceur-ws.org/Vol-2953/SEIM_2021_paper_24.pdf |volume=Vol-2953 |authors=Ivan Tyulyandin,Daniil Berezun,Semyon Grigorev }} ==Viterbi Algorithm Specialization Using Linear Algebra== https://ceur-ws.org/Vol-2953/SEIM_2021_paper_24.pdf

Viterbi Algorithm Specialization Using Linear
Algebra
Ivan Tyulyandin Daniil Berezun Semyon Grigorev
Mathematics and Mechanics Faculty Department of Mathematics Mathematics and Mechanics Faculty
Saint Petersburg State University and Computer Science Saint Petersburg State University
Saint Petersburg, Russia Saint Petersburg State University Saint Petersburg, Russia
i.tyulyandin@2015.spbu.ru Saint Petersburg, Russia s.v.grigoriev@spbu.ru
d.berezun@2009.spbu.ru

Abstract— Algorithms based on linear algebra is widely used in The rest of the paper organized as follows. Section II
various areas. Often programs of interest have some input data describes the background. In section III the Viterbi algorithm
that are independent of the dataset being processed, and thus specialization is explained. Section IV reports benchmarks
they can be optimized with respect to this input. In the paper,
we study how partial evaluation affects the Viterbi algorithm as results. Related work is reviewed in section V. And section
a step to research an application of partial evaluation to linear VII ends up the paper.
algebra-based algorithms. We evaluate the specialized multi-
thread Viterbi algorithm against existing GPU-based CUDAMPF. II. BACKGROUND
The results show that the presented algorithm is slower but
comparable to CUDAMPF. In this section, we review specialization, hidden Markov
models (HMM), and the Viterbi algorithm in terms of linear
I. I NTRODUCTION algebra.

Algorithms based on linear algebra is widely used in various A. Specialization
areas such as machine learning [1], computer vision [2],
Specialization [6], or partial evaluation, is a well-known
statistics [3] analysis of logic programs [4], graph theory [5],
program transformation technique widely used when some of
etc. One of the typical cases is querying huge database or
the input data is already known in compile time. A typical
graph processing and can be executed in days or weeks making
case is serial data processing when one of the input parameters
crucial even a simple constant time optimization. One way
is fixed while others vary. Fixed parameters are called static
to optimize such data processing is to use some hardware
while other parameters are called dynamic. The idea behind
capabilities such as different kinds of parallelism or to invent
specialization is that optimization of a program with respect
a more efficient algorithm or some tricky data representation.
to the static parameters together with executing the optimized
We focus on an alternative way to program optimization based
program on a set of dynamic parameters may be more efficient
on the following observation. Often programs of interest have
than iterative execution of the initial program on both static
some input data that are independent of the dataset being
and dynamic parameters.
processed, and thus they can be optimized with respect to
The classical specialization example is the exponentiation
this input. Partial evaluation, a.k.a. program specialization,
function 𝑓 (𝑥, 𝑛) = 𝑥 𝑛 where 𝑛 is static. A simple implemen-
is a well-known program transformation technique aiming to
tation is shown below.
perform such an optimization [6].
In the paper, we study how partial evaluation affects the 1 function f(x, n)
Viterbi algorithm [7] as a step to research an application of 2 if n == 0 then 1
3 elif even(x) then f (x, n/2) ^ 2
partial evaluation to linear algebra-based algorithms. First, 4 else x f (x, (n-1)/2) ^ 2
*
the Viterbi algorithm is used in bioinformatics [8], speech
recognition [9], and financial computations [10]. Second, it All recursive calls are static, i.e. are controlled by the static
can be expressed in terms of linear algebra [11]. Third, parameter only, and thus can be reduced. Given fixed 𝑛, say
the algorithm has two parameters: a hidden Markov model 5, a typical specialized version is
(HMM) [12] and an observations sequence. Its goal is to count 1 function f_spec(x) = x (x ^ 2) ^ 2
*
a probability for the sequence to be emitted by the given
HMM. Next, a sequential application of the algorithm with Note, sometimes specialization is useless. For example, con-
a fixed HMM to a big bunch of observations sequences is sider the exponentiation function with fixed base 𝑥 but dy-
usual. Finally, the main part of the Viterbi algorithm heavily namic power 𝑛. Of course, one may use arithmetic tricks for
depends on the HMM. All the above make the algorithm a some 𝑥 but in general, there is no recipe for effective special-
good candidate to research partial evaluation application. ization. Moreover, since optimal specialization is obviously

Copyright © 2021 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
undecidable there are some heuristics to ensure specializa- Symbol × stands for matrix multiplication using Min_plus
tion termination. As a result, in some cases, specialization semiring. Column B defines a probability distribution for states
worsen program execution. For example, it is well-known that to be a start one.
sometimes specialization negatively affects program execution
Probs1 = 𝑃(Obs1 ) × 𝐵
caused by code expansion [6].
The next step is to handle the rest of the sequence, where 𝑡
B. Hidden Markov model
changes from 2 up to lo.
Hidden Markov model is a deterministic probability automa-
ton [12]. It has the following parameters: Probs𝑡 = 𝑃(Obs𝑡 ) × 𝑇 > × Probs𝑡−1
• S1..N — 𝑁 states of the automaton; As a result, the column Probslo will contain transformed
• O1..K — 𝐾 possible observations; probabilities to be in a certain state of the HMM if observation
• B1..N — a probabilities describing each state from S1..N sequence Obs is handled.
to be a start one;
III. S PECIALIZED V ITERBI ALGORITHM
• T1..N,1..N — state transition matrix, Ti,j is a probability to
go from state Si to state Sj ; Here we describe specialization of the Viterbi algorithm in
• E1..N,1..K — emission matrix, where Ei,j defines proba- terms of linear algebra. We fix HMM as a static parameter.
bility to emit observation Oj at state Si . To the best of our knowledge, there is no stable partial
evaluator maintaining parallel program transformation and
With a given observation sequence, it is possible to calculate
providing expected results. In order to achieve specialization
a maximum likelihood to be at a concrete state of the HMM,
effect we’ve made ad-hoc generating extension, i.e. we provide
i.e. to reveal hidden states, according to the sequence. HMM
an effective procedure to perform static data transformation
makes a transition between states for each observation from
and propagation together with handwritten specialized version
the given sequence.
of the Viterbi algorithm itself.
C. Viterbi algorithm
A. Theory
Let’s fix the observation sequence as Obs, where the length The goal is to embed data from the given HMM into the
of Obs is lo. The Viterbi algorithm [11] handles sequence Obs program and simplify expressions. These static data are 𝑆, 𝑂,
for a HMM. Its result is a maximum probability to reach each 𝐵, 𝑇, and 𝐸. Given a fixed HMM, for all possible observations
state of the HMM after handling Obs. 𝑜 ∈ 𝑂 the following matrices and matrix multiplications can be
First of all, HMM probabilities are transformed into negative precalculated during the specialization phase according to the
binary logarithm. We define such probabilities as transformed given in the previous section the Viterbi algorithm definition:
probabilities t(p), where 𝑝 is a some probability from the
• 𝑃(𝑜),
HMM definition.
( • 𝑃(𝑜) × 𝐵, denoted latter as PB(𝑜),
𝑝 > 0 : −1 ∗ 𝑙𝑜𝑔2 ( 𝑝)) • 𝑃(𝑜) × 𝑇 > , denoted latter as PT (𝑜).
𝑡 ( 𝑝) = (1)
𝑝 = 0 : +∞ We can precalculate these operations and memoize the
results for further use by the specialized algorithm. The
e.g. probability 0.5 will be expressed as −1 ∗ log2 (0.5) = 1. It precalculation procedure pseudocode is shown in Listing 1,
is done to reduce the loss of precision. function spec_Viterbi.
The key idea is to use a special algebraic structure, named The specialized Viterbi algorithm is shown in Listing 1,
semiring Min_plus. Elements of this semiring are floats. We function run_Viterbi, and works as follows. The initial step
define the addition’s semantic as a minimum between two can be expressed as
floats. The multiplication symbol means addition for floats.
Neutral elements are +∞ and 0 accordingly. This is an example Probs1 = PB(Obs1 ).
of usage Min_plus semiring for matrix multiplication: The rest of the sequence Obs is handled with multiplication

0 1 3 𝑚𝑖𝑛(0 + 3, 1 + 4) 3
× = = Probs𝑡 = PT (Obs𝑡 ) × Probs𝑡−1 . (2)
+∞ 2 4 𝑚𝑖𝑛(+∞ + 3, 2 + 4) 6
Comparing the specialized version with the initial one, there
Expression ti,j denotes transformed probability to get obser-
are fewer matrix multiplication operations in Min_plus semir-
vation 𝑗 at state 𝑖, i.e. t(Ei,j ). For each observation 𝑗 we create
ing. For the first step, it is one matrix assignment instead
diagonal matrix P(j) as follows with data from E:
of multiplication. For the remaining observations, we need to
𝑡 1, 𝑗 ... +∞ perform only one matrix multiplication against two. Thus, the
initial Viterbi algorithm in terms of linear algebra requires to
© .. ª®
𝑃( 𝑗) = ... ..
. . ®
perform 1 + 2 ∗ (lo − 1) matrix multiplications, where lo is
«+∞ ...
𝑡𝑁 , 𝑗¬
the Obs length, while the specialized version requires only
The initial step of the Viterbi algorithm is to set up data accord- lo − 1 multiplications but requires additional memory to keep
ing to the first observation from the observation sequence Obs. the precalculated matrices.
Since matrix multiplication in semiring Min_plus is asso- required to perform the partially evaluated Viterbi algorithm.
ciative one can handle two observations by Memory consumption rapidly increases with higher N, since
K N precalculated matrices have to be saved in memory.
Probs𝑡 = PT (Obs𝑡 ) × Probs𝑡−1
= PT (Obs𝑡 ) × (PT (Obs𝑡−1 ) × Probs𝑡−2 ) B. Some implementation details
= (PT (Obs𝑡 ) × PT (Obs𝑡−1 )) × Probs𝑡−2 (3) To perform matrix operations, we used S UITE S-
PARSE :G RAPH BLAS [5]. It is a high-performance
Since we know all PT (o), we can precalculate these multipli- implementation of the G RAPH BLAS [13] standard, which is
cations, i.e. compute 𝐾 × 𝐾 matrices, and use them as needed. intended to handle graphs, e.g. hidden Markov model. Also,
This method can be extended to handle more observations at it defines various linear algebra primitives, such as Min_plus
once, e.g. for three observations: semiring. Our implementation uses custom formats to define
Probs𝑡 = PT (Obs𝑡 ) × PT (Obs𝑡−1 ) × PT (Obs𝑡−2 ) × Probs𝑡−3 HMM and observation sequences simplifying data parsing.
(4) The full source code is available online [14].
For three observations there are 𝐾 × 𝐾 × 𝐾 evaluated matrices IV. E VALUATION
accordingly. We name equation 2 1-level specialization since In this section, we compare the specialized Viterbi algorithm
only one observation handling is precalculated. By analogy, against the initial one and CUDAMPF [15].
we name equations 3 and 4 by the second and the third CUDAMPF is a GPU implementation of the Viterbi algo-
specialization levels, and so on. 𝑁-level can be computed with rithm. Since the Viterbi algorithm can be effectively paralleled,
PT and N − 1-level as follows: for all 𝑜 multiply PT (o) for a GPU is a suitable choice. CUDAMPF works with hidden
all matrices at the previous level. Markov models from bioinformatics. A model describes pro-
1 HMM tein family. An observation sequence specifies a protein and
2 PB[HMM.K] contains amino acids. If a probability to be in some special
3 PT[HMM.K]
4 level
state of the HMM is higher than a threshold, than protein
5 // obs_lvl_handlers is a mapping from lists that
belongs to the protein family.
6 // contain level observations to a handler matrix We took 24 HMMs from CUDAMPF repository1 . All
𝑙𝑒𝑣𝑒𝑙 ]
7 obs_lvl_handlers[HMM.K HMMs have a different number of states but the same struc-
8
ture. Since these HMMs have a slightly different definition, we
9 function spec_Viterbi()
10 for i = 1..HMM.K
implement a converter into our custom format. We evaluate our
11 PB[i] = P(HMM.O[i]) × HMM.B) solution on three different datasets. Two datasets are randomly
12 PT[i] = P(HMM.O[i]) × (HMM.T)> generated, each contains three sequences consisting of 3500
13 // To handle a sequence of length level, and 7000 observations respectively. The third dataset is real-
14 // appropriate level matrices from PT
world 16 proteins taken from P FAM [16] database. The length
15 // should be multiplied.
16 // Put handlers for all possible sequences of of the proteins varies from 38 to 7096 observations. The
17 // length level into obs_lvl_handlers. number of possible observations, i.e. 𝐾 = 20, for all datasets.
18 get_combinations(obs_lvl_handlers, level, PT) We run experiments on Ubuntu 20.04, Intel Core i7-6700
19
3.40 GHz, 64 Gb RAM, NVIDIA GeForce GTX 1070. Each
20 function Viterbi(Obs)
implementation with concrete parameters was run 10 times,
21 // First observation handling
22 Probs = PB[Obs[1]] and a median was taken as a result. We evaluate only the
23 first and second level specialization of the presented algorithm,
24 lo = length(Obs) since memory used for memoization grows by an exponent.
25 i = 2 For the third level the out-of-memory exception was thrown.
26
27 // While there is more observations than level CUDAMPF Initial 1-level 2-level
28 while (lo - i) >= level)
29 // Find matrix to handle next level observations 3 x 3500 4854 10765 8062 215329
3 x 7000 9209 21062 16152 387464
30 handler = obs_lvl_handlers.find(Obs[i:i+lvl])
Real-world 8796 15864 12036 298269
31 Probs = handler × Probs
32 i = i + level TABLE I: Result run time (both specialization and the Viterbi
33
34 // The rest of the sequence
algorithm), ms
35 for (; i < lo; i = i + 1)
36 Probs = PT[Obs[i]] × Probs The results (see Figures 1a, 1b, 1c and Table I) show that
37 the first level specialized version of the Viterbi algorithm,
38 return Probs as expected, is faster, than the initial one. Unexpectedly, the
Listing 1: The specialized with levels Viterbi algorithm second level implementation is significantly slower comparing
to the initial and the first level implementations, and it is
If the specialization with N-level is applied,
(lo − 1)/N + (lo − 1) mod N matrix multiplications are 1 https://github.com/Super-Hippo/CUDAMPF (date: 2021-12-02)
800
CUDAMPF 1,500 CUDAMPF
Initial Initial
1-level 1-level
600
1,000
Time, ms

Time, ms
400

500
200

500 1,000 1,500 2,000 500 1,000 1,500 2,000
Number of HMM states Number of HMM states
(a) 3 sequences, each 3500 observations (lower is better) (b) 3 sequences, each 7000 observations (lower is better)

CUDAMPF 1-level
1,000 Initial 4,000 2-level
1-level

800 3,000
Time, ms

Time, ms

600
2,000

400
1,000
200
0
500 1,000 1,500 2,000 500 1,000 1,500 2,000
Number of HMM states Number of HMM states
(c) Real-world dataset (lower is better) (d) Specialization algorithm run time (lower is better)
Fig. 1: Evaluation results

not shown in some figures. One of the reasons for such a In [17] it was shown that the result of specialization of
slowdown is the increased memory consumption. Neverthe- the naïve pattern match algorithm to some fixed pattern can
less, even on a workstation with 8 CPU threads, CPU-based be behavioural equivalent to the Knuth, Morris and Pratt
first level specialization outperforms GPU-based CUDAMPF algorithm. This result is often used as a strength test for
on the small HMMs used in CUDAMPF benchmarks2 . After partial evaluators to be “good enough”. Since we consider
all, these results are comparable with ones from CUDAMPF. the concrete algorithm specialziation only, the test is not
It is also worth noting, that the parallel Viterbi algorithm, applicable in our case.
expressed in terms of linear algebra, is easier to implement
than the CUDAMPF dynamic programming version, since A partial evaluation was used in ray tracing [18] by P. H.
the abstraction level is higher. All the above proves the spe- Andersen. The author optimizes a ray tracer, gaining speedups
cialization of the Viterbi algorithm is applicable in practice. from 1.8 to 3.0 depending on the meta-parameters and com-
piler. The main performance improvement was reached with
V. R ELATED WORK constant propagation and unrolling loops. It can be done by
an optimizing compiler, but this partial evaluator is aggressive.
There are a lot of works, where specialization was success-
That means sometimes a specialized algorithm can have an
fully applied.
enormous code size and lead to performance regression. The
2 For evaluation we used the exact set of HMMs from CUDAMPF bench- static data was directly written inside source code, while our
marks. solution can run without any files’ modifications.
A. Tyurin, D. Berezun and S. Grigorev have applied spe- [4] C. Sakama, H. Nguyen, T. Sato, and K. Inoue, “Partial evalu-
cialization to the naïve pattern match string search algorithm, ation of logic programs in vector spaces,” EasyChair Preprint
implemented as a GPU program [19]. They got performance no. 172, EasyChair, 2018.
[5] T. A. Davis, “Algorithm 1000: SuiteSparse:GraphBLAS: Graph
improvement up 8 times in some cases. GPU has a lot of Algorithms in the Language of Sparse Linear Algebra,” vol. 45.
simple algebraic logic units. All of them need to take data URL: https://doi.org/10.1145/3322125
to work with. It means a data cache of GPU is a bottleneck. [6] N. Jones, C. Gomard, and P. Sestoft, Partial Evaluation and
Using specialization, static data was moved to a code cache. Automatic Program Generation, 1993.
Such transformation makes data cache miss less possible. One [7] A. Viterbi, “Error bounds for convolutional codes and an asymp-
totically optimum decoding algorithm,” IEEE Transactions on
may call it a "hardware specialization". Information Theory, vol. 13, no. 2, pp. 260–269, 1967.
C. Sakama et al. used linear algebra as a logic programs [8] S. R. Eddy, “Accelerated profile hmm searches,” PLoS com-
representation [4]. The authors introduce partial evaluation as putational biology, vol. 7, no. 10, pp. e1 002 195–e1 002 195,
a part of the algorithm to find a logic model of a program. If 2011.
specialization is used, run time is decreased by 10 times. [9] L. R. Rabiner, A Tutorial on Hidden Markov Models and Se-
lected Applications in Speech Recognition. San Francisco, CA,
USA: Morgan Kaufmann Publishers Inc., 1990, p. 267–296.
VI. D ISCUSSION [10] A. Petropoulos, S. P. Chatzis, and S. Xanthopoulos, “A novel
There are some possible research directions and future work. corporate credit rating system based on student’s-t hidden
markov models,” Expert Systems with Applications, vol. 53, pp.
The Viterbi algorithm specialization is the first step to find out 87–105, 2016.
if partial evaluation can be effectively applied to the linear [11] E. Theodosis and P. Maragos, “Analysis of the viterbi algorithm
algebra algorithms. using tropical algebra and geometry,” in 2018 IEEE 19th Inter-
First of all, the next step is to run benchmarks at a national Workshop on Signal Processing Advances in Wireless
GPU. S UITE S PARSE :G RAP BLAS [5] is the reference CPU Communications (SPAWC), 2018, pp. 1–5.
[12] S. R. Eddy, “What is a hidden Markov model?” Nature
implementation of the G RAPH BLAS [13] standard. There are Biotechnology, vol. 22, pp. 1315–1316. URL: https://doi.org/
some GPU implementations, such as G RAPH BLAST [20] and 10.1038/nbt1004-1315
GBTL [21], but to our knowledge, they are unstable. [13] A. Buluç, T. Mattson, S. McMillan, E. J. Moreira, and C. Yang,
One can try to apply partial evaluation to the other algo- “Design of the graphblas api for c,” IPDPS Workshops, pp. 643–
652, 2017.
rithms in terms of linear algebra. These experiments will reveal
[14] Implementation repository. URL: https://github.com/
the limits of the specialization to such algorithms. There is a IvanTyulyandin/Lin_alg_Viterbi (Date: 2021-13-02).
high chance that such experiments can be successfully used in [15] H. Jiang and N. Ganesan, “Cudampf: a multi-tiered parallel
production. framework for accelerating protein sequence search in hmmer
Since partial evaluation can lead to a performance increase, on cuda-enabled gpu,” BMC bioinformatics, 2016.
[16] J. Mistry, S. Chuguransky, L. Williams, M. Qureshi, G. A.
it can be useful to implement a linear algebra library with Salazar, E. L. L. Sonnhammer, S. C. E. Tosatto, L. Paladin,
specialization primitives. It will let to develop more effective S. Raj, L. J. Richardson, R. D. Finn, and A. Bateman, “Pfam:
applications with linear algebra algorithms in less time. The protein families database in 2021,” Nucleic Acids Research,
Another approach is to do hardware partial evaluation, e.g. vol. 49, no. D1, pp. D412–D419, 10 2020.
to make FPGA, where specialization program with static data [17] C. Consel and O. Danvy, “Partial evaluation of pattern matching
in strings,” Information Processing Letters, vol. 30, no. 2, pp.
will be embedded as a scheme. 79–86, 1989.
[18] P. H. Andersen, “Partial evaluation applied to ray tracing,” 1996.
VII. C ONCLUSION [19] A. Tyurin, D. Berezun, and S. Grigorev, “Optimizing gpu
programs by partial evaluation,” pp. 431–432, 02 2020.
In the paper, we study an application of partial evaluation
[20] C. Yang, A. Buluc, and J. D. Owens, “Graphblast: A high-
to the linear algebra-based algorithms on a particular example performance linear algebra-based graph framework on the gpu,”
— the Viterbi algorithm with an HMM being fixed, i.e. static 2020.
parameter. The specialized version of the Viterbi algorithm [21] Graphblas template library (gbtl), v. 3.0. URL: https:
is presented. Our experiments show that on real benchmarks //github.com/cmu-sei/gbtl (Date: 2021-12-02).
the presented algorithm can be comparable to the existing
GPU-based Viterbi algorithm implementation CUDAMPF.
Thus, the proposed approach is applicable in practice and
further partial evaluation application to the linear algebra-
based algorithms is a promising research direction.

R EFERENCES
[1] C. Aggarwal, Linear Algebra and Optimization for Machine
Learning: A Textbook, 01 2020.
[2] R. Szeliski, Computer Vision: Algorithms and Applications,
1st ed. Berlin, Heidelberg: Springer-Verlag, 2010.
[3] J. Gentle, Numerical Linear Algebra for Applications in Statis-
tics. Springer-Verlag, 1998.