=Paper=
{{Paper
|id=Vol-2953/SEIM_2021_paper_24
|storemode=property
|title=Viterbi Algorithm Specialization Using Linear Algebra
|pdfUrl=https://ceur-ws.org/Vol-2953/SEIM_2021_paper_24.pdf
|volume=Vol-2953
|authors=Ivan Tyulyandin,Daniil Berezun,Semyon Grigorev
}}
==Viterbi Algorithm Specialization Using Linear Algebra==
Viterbi Algorithm Specialization Using Linear Algebra Ivan Tyulyandin Daniil Berezun Semyon Grigorev Mathematics and Mechanics Faculty Department of Mathematics Mathematics and Mechanics Faculty Saint Petersburg State University and Computer Science Saint Petersburg State University Saint Petersburg, Russia Saint Petersburg State University Saint Petersburg, Russia i.tyulyandin@2015.spbu.ru Saint Petersburg, Russia s.v.grigoriev@spbu.ru d.berezun@2009.spbu.ru Abstractโ Algorithms based on linear algebra is widely used in The rest of the paper organized as follows. Section II various areas. Often programs of interest have some input data describes the background. In section III the Viterbi algorithm that are independent of the dataset being processed, and thus specialization is explained. Section IV reports benchmarks they can be optimized with respect to this input. In the paper, we study how partial evaluation affects the Viterbi algorithm as results. Related work is reviewed in section V. And section a step to research an application of partial evaluation to linear VII ends up the paper. algebra-based algorithms. We evaluate the specialized multi- thread Viterbi algorithm against existing GPU-based CUDAMPF. II. BACKGROUND The results show that the presented algorithm is slower but comparable to CUDAMPF. In this section, we review specialization, hidden Markov models (HMM), and the Viterbi algorithm in terms of linear I. I NTRODUCTION algebra. Algorithms based on linear algebra is widely used in various A. Specialization areas such as machine learning [1], computer vision [2], Specialization [6], or partial evaluation, is a well-known statistics [3] analysis of logic programs [4], graph theory [5], program transformation technique widely used when some of etc. One of the typical cases is querying huge database or the input data is already known in compile time. A typical graph processing and can be executed in days or weeks making case is serial data processing when one of the input parameters crucial even a simple constant time optimization. One way is fixed while others vary. Fixed parameters are called static to optimize such data processing is to use some hardware while other parameters are called dynamic. The idea behind capabilities such as different kinds of parallelism or to invent specialization is that optimization of a program with respect a more efficient algorithm or some tricky data representation. to the static parameters together with executing the optimized We focus on an alternative way to program optimization based program on a set of dynamic parameters may be more efficient on the following observation. Often programs of interest have than iterative execution of the initial program on both static some input data that are independent of the dataset being and dynamic parameters. processed, and thus they can be optimized with respect to The classical specialization example is the exponentiation this input. Partial evaluation, a.k.a. program specialization, function ๐ (๐ฅ, ๐) = ๐ฅ ๐ where ๐ is static. A simple implemen- is a well-known program transformation technique aiming to tation is shown below. perform such an optimization [6]. In the paper, we study how partial evaluation affects the 1 function f(x, n) Viterbi algorithm [7] as a step to research an application of 2 if n == 0 then 1 3 elif even(x) then f (x, n/2) ^ 2 partial evaluation to linear algebra-based algorithms. First, 4 else x f (x, (n-1)/2) ^ 2 * the Viterbi algorithm is used in bioinformatics [8], speech recognition [9], and financial computations [10]. Second, it All recursive calls are static, i.e. are controlled by the static can be expressed in terms of linear algebra [11]. Third, parameter only, and thus can be reduced. Given fixed ๐, say the algorithm has two parameters: a hidden Markov model 5, a typical specialized version is (HMM) [12] and an observations sequence. Its goal is to count 1 function f_spec(x) = x (x ^ 2) ^ 2 * a probability for the sequence to be emitted by the given HMM. Next, a sequential application of the algorithm with Note, sometimes specialization is useless. For example, con- a fixed HMM to a big bunch of observations sequences is sider the exponentiation function with fixed base ๐ฅ but dy- usual. Finally, the main part of the Viterbi algorithm heavily namic power ๐. Of course, one may use arithmetic tricks for depends on the HMM. All the above make the algorithm a some ๐ฅ but in general, there is no recipe for effective special- good candidate to research partial evaluation application. ization. Moreover, since optimal specialization is obviously Copyright ยฉ 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). undecidable there are some heuristics to ensure specializa- Symbol ร stands for matrix multiplication using Min_plus tion termination. As a result, in some cases, specialization semiring. Column B defines a probability distribution for states worsen program execution. For example, it is well-known that to be a start one. sometimes specialization negatively affects program execution Probs1 = ๐(Obs1 ) ร ๐ต caused by code expansion [6]. The next step is to handle the rest of the sequence, where ๐ก B. Hidden Markov model changes from 2 up to lo. Hidden Markov model is a deterministic probability automa- ton [12]. It has the following parameters: Probs๐ก = ๐(Obs๐ก ) ร ๐ > ร Probs๐กโ1 โข S1..N โ ๐ states of the automaton; As a result, the column Probslo will contain transformed โข O1..K โ ๐พ possible observations; probabilities to be in a certain state of the HMM if observation โข B1..N โ a probabilities describing each state from S1..N sequence Obs is handled. to be a start one; III. S PECIALIZED V ITERBI ALGORITHM โข T1..N,1..N โ state transition matrix, Ti,j is a probability to go from state Si to state Sj ; Here we describe specialization of the Viterbi algorithm in โข E1..N,1..K โ emission matrix, where Ei,j defines proba- terms of linear algebra. We fix HMM as a static parameter. bility to emit observation Oj at state Si . To the best of our knowledge, there is no stable partial evaluator maintaining parallel program transformation and With a given observation sequence, it is possible to calculate providing expected results. In order to achieve specialization a maximum likelihood to be at a concrete state of the HMM, effect weโve made ad-hoc generating extension, i.e. we provide i.e. to reveal hidden states, according to the sequence. HMM an effective procedure to perform static data transformation makes a transition between states for each observation from and propagation together with handwritten specialized version the given sequence. of the Viterbi algorithm itself. C. Viterbi algorithm A. Theory Letโs fix the observation sequence as Obs, where the length The goal is to embed data from the given HMM into the of Obs is lo. The Viterbi algorithm [11] handles sequence Obs program and simplify expressions. These static data are ๐, ๐, for a HMM. Its result is a maximum probability to reach each ๐ต, ๐, and ๐ธ. Given a fixed HMM, for all possible observations state of the HMM after handling Obs. ๐ โ ๐ the following matrices and matrix multiplications can be First of all, HMM probabilities are transformed into negative precalculated during the specialization phase according to the binary logarithm. We define such probabilities as transformed given in the previous section the Viterbi algorithm definition: probabilities t(p), where ๐ is a some probability from the โข ๐(๐), HMM definition. ( โข ๐(๐) ร ๐ต, denoted latter as PB(๐), ๐ > 0 : โ1 โ ๐๐๐2 ( ๐)) โข ๐(๐) ร ๐ > , denoted latter as PT (๐). ๐ก ( ๐) = (1) ๐ = 0 : +โ We can precalculate these operations and memoize the results for further use by the specialized algorithm. The e.g. probability 0.5 will be expressed as โ1 โ log2 (0.5) = 1. It precalculation procedure pseudocode is shown in Listing 1, is done to reduce the loss of precision. function spec_Viterbi. The key idea is to use a special algebraic structure, named The specialized Viterbi algorithm is shown in Listing 1, semiring Min_plus. Elements of this semiring are floats. We function run_Viterbi, and works as follows. The initial step define the additionโs semantic as a minimum between two can be expressed as floats. The multiplication symbol means addition for floats. Neutral elements are +โ and 0 accordingly. This is an example Probs1 = PB(Obs1 ). of usage Min_plus semiring for matrix multiplication: The rest of the sequence Obs is handled with multiplication 0 1 3 ๐๐๐(0 + 3, 1 + 4) 3 ร = = Probs๐ก = PT (Obs๐ก ) ร Probs๐กโ1 . (2) +โ 2 4 ๐๐๐(+โ + 3, 2 + 4) 6 Comparing the specialized version with the initial one, there Expression ti,j denotes transformed probability to get obser- are fewer matrix multiplication operations in Min_plus semir- vation ๐ at state ๐, i.e. t(Ei,j ). For each observation ๐ we create ing. For the first step, it is one matrix assignment instead diagonal matrix P(j) as follows with data from E: of multiplication. For the remaining observations, we need to ๐ก 1, ๐ ... +โ perform only one matrix multiplication against two. Thus, the initial Viterbi algorithm in terms of linear algebra requires to ยฉ .. ยชยฎ ๐( ๐) = ยญยญ ... .. . . ยฎ perform 1 + 2 โ (lo โ 1) matrix multiplications, where lo is ยซ+โ ... ๐ก๐ , ๐ยฌ the Obs length, while the specialized version requires only The initial step of the Viterbi algorithm is to set up data accord- lo โ 1 multiplications but requires additional memory to keep ing to the first observation from the observation sequence Obs. the precalculated matrices. Since matrix multiplication in semiring Min_plus is asso- required to perform the partially evaluated Viterbi algorithm. ciative one can handle two observations by Memory consumption rapidly increases with higher N, since K N precalculated matrices have to be saved in memory. Probs๐ก = PT (Obs๐ก ) ร Probs๐กโ1 = PT (Obs๐ก ) ร (PT (Obs๐กโ1 ) ร Probs๐กโ2 ) B. Some implementation details = (PT (Obs๐ก ) ร PT (Obs๐กโ1 )) ร Probs๐กโ2 (3) To perform matrix operations, we used S UITE S- PARSE :G RAPH BLAS [5]. It is a high-performance Since we know all PT (o), we can precalculate these multipli- implementation of the G RAPH BLAS [13] standard, which is cations, i.e. compute ๐พ ร ๐พ matrices, and use them as needed. intended to handle graphs, e.g. hidden Markov model. Also, This method can be extended to handle more observations at it defines various linear algebra primitives, such as Min_plus once, e.g. for three observations: semiring. Our implementation uses custom formats to define Probs๐ก = PT (Obs๐ก ) ร PT (Obs๐กโ1 ) ร PT (Obs๐กโ2 ) ร Probs๐กโ3 HMM and observation sequences simplifying data parsing. (4) The full source code is available online [14]. For three observations there are ๐พ ร ๐พ ร ๐พ evaluated matrices IV. E VALUATION accordingly. We name equation 2 1-level specialization since In this section, we compare the specialized Viterbi algorithm only one observation handling is precalculated. By analogy, against the initial one and CUDAMPF [15]. we name equations 3 and 4 by the second and the third CUDAMPF is a GPU implementation of the Viterbi algo- specialization levels, and so on. ๐-level can be computed with rithm. Since the Viterbi algorithm can be effectively paralleled, PT and N โ 1-level as follows: for all ๐ multiply PT (o) for a GPU is a suitable choice. CUDAMPF works with hidden all matrices at the previous level. Markov models from bioinformatics. A model describes pro- 1 HMM tein family. An observation sequence specifies a protein and 2 PB[HMM.K] contains amino acids. If a probability to be in some special 3 PT[HMM.K] 4 level state of the HMM is higher than a threshold, than protein 5 // obs_lvl_handlers is a mapping from lists that belongs to the protein family. 6 // contain level observations to a handler matrix We took 24 HMMs from CUDAMPF repository1 . All ๐๐๐ฃ๐๐ ] 7 obs_lvl_handlers[HMM.K HMMs have a different number of states but the same struc- 8 ture. Since these HMMs have a slightly different definition, we 9 function spec_Viterbi() 10 for i = 1..HMM.K implement a converter into our custom format. We evaluate our 11 PB[i] = P(HMM.O[i]) ร HMM.B) solution on three different datasets. Two datasets are randomly 12 PT[i] = P(HMM.O[i]) ร (HMM.T)> generated, each contains three sequences consisting of 3500 13 // To handle a sequence of length level, and 7000 observations respectively. The third dataset is real- 14 // appropriate level matrices from PT world 16 proteins taken from P FAM [16] database. The length 15 // should be multiplied. 16 // Put handlers for all possible sequences of of the proteins varies from 38 to 7096 observations. The 17 // length level into obs_lvl_handlers. number of possible observations, i.e. ๐พ = 20, for all datasets. 18 get_combinations(obs_lvl_handlers, level, PT) We run experiments on Ubuntu 20.04, Intel Core i7-6700 19 3.40 GHz, 64 Gb RAM, NVIDIA GeForce GTX 1070. Each 20 function Viterbi(Obs) implementation with concrete parameters was run 10 times, 21 // First observation handling 22 Probs = PB[Obs[1]] and a median was taken as a result. We evaluate only the 23 first and second level specialization of the presented algorithm, 24 lo = length(Obs) since memory used for memoization grows by an exponent. 25 i = 2 For the third level the out-of-memory exception was thrown. 26 27 // While there is more observations than level CUDAMPF Initial 1-level 2-level 28 while (lo - i) >= level) 29 // Find matrix to handle next level observations 3 x 3500 4854 10765 8062 215329 3 x 7000 9209 21062 16152 387464 30 handler = obs_lvl_handlers.find(Obs[i:i+lvl]) Real-world 8796 15864 12036 298269 31 Probs = handler ร Probs 32 i = i + level TABLE I: Result run time (both specialization and the Viterbi 33 34 // The rest of the sequence algorithm), ms 35 for (; i < lo; i = i + 1) 36 Probs = PT[Obs[i]] ร Probs The results (see Figures 1a, 1b, 1c and Table I) show that 37 the first level specialized version of the Viterbi algorithm, 38 return Probs as expected, is faster, than the initial one. Unexpectedly, the Listing 1: The specialized with levels Viterbi algorithm second level implementation is significantly slower comparing to the initial and the first level implementations, and it is If the specialization with N-level is applied, (lo โ 1)/N + (lo โ 1) mod N matrix multiplications are 1 https://github.com/Super-Hippo/CUDAMPF (date: 2021-12-02) 800 CUDAMPF 1,500 CUDAMPF Initial Initial 1-level 1-level 600 1,000 Time, ms Time, ms 400 500 200 500 1,000 1,500 2,000 500 1,000 1,500 2,000 Number of HMM states Number of HMM states (a) 3 sequences, each 3500 observations (lower is better) (b) 3 sequences, each 7000 observations (lower is better) CUDAMPF 1-level 1,000 Initial 4,000 2-level 1-level 800 3,000 Time, ms Time, ms 600 2,000 400 1,000 200 0 500 1,000 1,500 2,000 500 1,000 1,500 2,000 Number of HMM states Number of HMM states (c) Real-world dataset (lower is better) (d) Specialization algorithm run time (lower is better) Fig. 1: Evaluation results not shown in some figures. One of the reasons for such a In [17] it was shown that the result of specialization of slowdown is the increased memory consumption. Neverthe- the naรฏve pattern match algorithm to some fixed pattern can less, even on a workstation with 8 CPU threads, CPU-based be behavioural equivalent to the Knuth, Morris and Pratt first level specialization outperforms GPU-based CUDAMPF algorithm. This result is often used as a strength test for on the small HMMs used in CUDAMPF benchmarks2 . After partial evaluators to be โgood enoughโ. Since we consider all, these results are comparable with ones from CUDAMPF. the concrete algorithm specialziation only, the test is not It is also worth noting, that the parallel Viterbi algorithm, applicable in our case. expressed in terms of linear algebra, is easier to implement than the CUDAMPF dynamic programming version, since A partial evaluation was used in ray tracing [18] by P. H. the abstraction level is higher. All the above proves the spe- Andersen. The author optimizes a ray tracer, gaining speedups cialization of the Viterbi algorithm is applicable in practice. from 1.8 to 3.0 depending on the meta-parameters and com- piler. The main performance improvement was reached with V. R ELATED WORK constant propagation and unrolling loops. It can be done by an optimizing compiler, but this partial evaluator is aggressive. There are a lot of works, where specialization was success- That means sometimes a specialized algorithm can have an fully applied. enormous code size and lead to performance regression. The 2 For evaluation we used the exact set of HMMs from CUDAMPF bench- static data was directly written inside source code, while our marks. solution can run without any filesโ modifications. A. Tyurin, D. Berezun and S. Grigorev have applied spe- [4] C. Sakama, H. Nguyen, T. Sato, and K. Inoue, โPartial evalu- cialization to the naรฏve pattern match string search algorithm, ation of logic programs in vector spaces,โ EasyChair Preprint implemented as a GPU program [19]. They got performance no. 172, EasyChair, 2018. [5] T. A. Davis, โAlgorithm 1000: SuiteSparse:GraphBLAS: Graph improvement up 8 times in some cases. GPU has a lot of Algorithms in the Language of Sparse Linear Algebra,โ vol. 45. simple algebraic logic units. All of them need to take data URL: https://doi.org/10.1145/3322125 to work with. It means a data cache of GPU is a bottleneck. [6] N. Jones, C. Gomard, and P. Sestoft, Partial Evaluation and Using specialization, static data was moved to a code cache. Automatic Program Generation, 1993. Such transformation makes data cache miss less possible. One [7] A. Viterbi, โError bounds for convolutional codes and an asymp- totically optimum decoding algorithm,โ IEEE Transactions on may call it a "hardware specialization". Information Theory, vol. 13, no. 2, pp. 260โ269, 1967. C. Sakama et al. used linear algebra as a logic programs [8] S. R. Eddy, โAccelerated profile hmm searches,โ PLoS com- representation [4]. The authors introduce partial evaluation as putational biology, vol. 7, no. 10, pp. e1 002 195โe1 002 195, a part of the algorithm to find a logic model of a program. If 2011. specialization is used, run time is decreased by 10 times. [9] L. R. Rabiner, A Tutorial on Hidden Markov Models and Se- lected Applications in Speech Recognition. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1990, p. 267โ296. VI. D ISCUSSION [10] A. Petropoulos, S. P. Chatzis, and S. Xanthopoulos, โA novel There are some possible research directions and future work. corporate credit rating system based on studentโs-t hidden markov models,โ Expert Systems with Applications, vol. 53, pp. The Viterbi algorithm specialization is the first step to find out 87โ105, 2016. if partial evaluation can be effectively applied to the linear [11] E. Theodosis and P. Maragos, โAnalysis of the viterbi algorithm algebra algorithms. using tropical algebra and geometry,โ in 2018 IEEE 19th Inter- First of all, the next step is to run benchmarks at a national Workshop on Signal Processing Advances in Wireless GPU. S UITE S PARSE :G RAP BLAS [5] is the reference CPU Communications (SPAWC), 2018, pp. 1โ5. [12] S. R. Eddy, โWhat is a hidden Markov model?โ Nature implementation of the G RAPH BLAS [13] standard. There are Biotechnology, vol. 22, pp. 1315โ1316. URL: https://doi.org/ some GPU implementations, such as G RAPH BLAST [20] and 10.1038/nbt1004-1315 GBTL [21], but to our knowledge, they are unstable. [13] A. Buluรง, T. Mattson, S. McMillan, E. J. Moreira, and C. Yang, One can try to apply partial evaluation to the other algo- โDesign of the graphblas api for c,โ IPDPS Workshops, pp. 643โ 652, 2017. rithms in terms of linear algebra. These experiments will reveal [14] Implementation repository. URL: https://github.com/ the limits of the specialization to such algorithms. There is a IvanTyulyandin/Lin_alg_Viterbi (Date: 2021-13-02). high chance that such experiments can be successfully used in [15] H. Jiang and N. Ganesan, โCudampf: a multi-tiered parallel production. framework for accelerating protein sequence search in hmmer Since partial evaluation can lead to a performance increase, on cuda-enabled gpu,โ BMC bioinformatics, 2016. [16] J. Mistry, S. Chuguransky, L. Williams, M. Qureshi, G. A. it can be useful to implement a linear algebra library with Salazar, E. L. L. Sonnhammer, S. C. E. Tosatto, L. Paladin, specialization primitives. It will let to develop more effective S. Raj, L. J. Richardson, R. D. Finn, and A. Bateman, โPfam: applications with linear algebra algorithms in less time. The protein families database in 2021,โ Nucleic Acids Research, Another approach is to do hardware partial evaluation, e.g. vol. 49, no. D1, pp. D412โD419, 10 2020. to make FPGA, where specialization program with static data [17] C. Consel and O. Danvy, โPartial evaluation of pattern matching in strings,โ Information Processing Letters, vol. 30, no. 2, pp. will be embedded as a scheme. 79โ86, 1989. [18] P. H. Andersen, โPartial evaluation applied to ray tracing,โ 1996. VII. C ONCLUSION [19] A. Tyurin, D. Berezun, and S. Grigorev, โOptimizing gpu programs by partial evaluation,โ pp. 431โ432, 02 2020. In the paper, we study an application of partial evaluation [20] C. Yang, A. Buluc, and J. D. Owens, โGraphblast: A high- to the linear algebra-based algorithms on a particular example performance linear algebra-based graph framework on the gpu,โ โ the Viterbi algorithm with an HMM being fixed, i.e. static 2020. parameter. The specialized version of the Viterbi algorithm [21] Graphblas template library (gbtl), v. 3.0. URL: https: is presented. Our experiments show that on real benchmarks //github.com/cmu-sei/gbtl (Date: 2021-12-02). the presented algorithm can be comparable to the existing GPU-based Viterbi algorithm implementation CUDAMPF. Thus, the proposed approach is applicable in practice and further partial evaluation application to the linear algebra- based algorithms is a promising research direction. R EFERENCES [1] C. Aggarwal, Linear Algebra and Optimization for Machine Learning: A Textbook, 01 2020. [2] R. Szeliski, Computer Vision: Algorithms and Applications, 1st ed. Berlin, Heidelberg: Springer-Verlag, 2010. [3] J. Gentle, Numerical Linear Algebra for Applications in Statis- tics. Springer-Verlag, 1998.