Attacking the problem of continuous speech segmentation into basic units
                        I.A. Andreev1, A.I. Armer1, N.A. Krasheninnikova2, V.S. Moshkin1
                                 1
                                  Ulyanovsk State Technical University, Severny Venetz St., 32, 432027, Ulyanovsk, Russia
                                       2
                                         Ulyanovsk State University, Lev Tolstoy St., 42, 432017, Ulyanovsk, Russia


Abstract

The paper considers the algorithm of continuous speech segmentation into basic units, namely phonemes, certain combination of phonemes
and pauses. The algorithm is based on speech signal transformation into a two-dimensional image, i.e. an autocorrelation portrait. To
determine the boundaries of speech units the portraits of the analyzed signal are aligned with the model portraits of each speech unit. The
authors apply the dynamic programming to find out the optimal distance between portraits.

Keywords: speech signal; segmentation; autocorrelation portrait; speech units; discrete dynamic programming


1.    Introduction

   At present, the algorithms for continuous speech segmentation into verbal units - phonemes, their combinations and pauses -
are quite in demand. For example, this problem arises, while creating systems for research, processing, modeling and automatic
speech recognition. To use such systems under different acoustic conditions, they should be subject to strict requirements for
acoustic noise impedance and speech signal distortion. The article presents a method for determining the boundaries of speech
pauses and speech units, which correspond to SAMPA + for the Russian language [1], [2]. The algorithms of speech signal
transformation and processing used in the suggested method correspond to the strict requirements for acoustic noise impedance
and speech signal distortion.

2.    The subject of investigation

    The problem of speech signal segmentation into its basic units is extremely complicated and challenging, and at present there
is no simple solution for the general case. It is noted [3] that there are certain cases, for which exact segmentation is problematic.
Different methods [3],[4],[5] are used for continuous speech signal segmentation. It is possible to distinguish the methods based
on spectral analysis, trajectories of signal energy, energy logarithm, number of transitions through zero, and statistical
parameters of speech units. The abovementioned methods give good results under favorable acoustic conditions, but the results
deteriorate due to the presence of noise. Moreover, the time length of a speech signal varies from one pronunciation to another,
which also makes its segmentation into basic units difficult. The authors suggest using the autocorrelation transformation [6],[7]
of the speech signal into a two-dimensional image as well as the certain ways of image alignment in order to improve noise
stability when determining the speech unit boundaries. The autocorrelation transformation has a number of characteristics, which
make it somewhat noise-resistant [8]. Thus, the proposed method of speech signal segmentation is to be assumed to be less
dependent on the current acoustic conditions, in which it was pronounced. Using discrete dynamic programming [9], when
aligning two-dimensional speech signal images makes it possible to increase the stability of the method under consideration to
the changes in the time length of speech units.

3.    Algorithm for determining speech unit boundaries

1.1. General algorithm

   The algorithm for determining speech unit boundaries is as follows: a speech signal containing a fragment of continuous
speech analyzed for speech unit boundaries is represented in the form of digital readouts. The models of each speech unit are
also represented as digital readouts. For benchmarking each example of the speech unit corresponding to SAMPA + is
pronounced by the speaker, then the boundaries are defined by ear, and the speech unit becomes a model. By means of the
autocorrelation transformation digital readouts of the analyzed continuous speech segment and the readouts of every model
speech unit are transformed into particular two-dimensional images, which are called autocorrelation portraits (ACPs). For
further alignment portraits of the analyzed speech segment and every model speech unit have the same line length.
   Next, the portrait of the analyzed speech segment is aligned with all portraits of model speech units to determine the speech
unit boundaries. For this purpose, the distance [10],[11] is calculated in the sliding window. The size of the window is equal to
the number of lines in a corresponding speech unit portrait. During the calculation, the distance between the windows is
optimized using the discrete dynamic programming. For each speech unit, a distance array along the portrait of the analyzed
speech segment is determined. The distances corresponding to the same fragments of the analyzed speech segment portrait are
compared with each other. As a result, speech unit portraits, which have the smallest distances, form the desired boundaries. If
the smallest distance is obtained from the portraits of identical speech units, which follow one another, they are combined into
the boundaries of one speech unit.

3rd International conference “Information Technology and Nanotechnology 2017”                                                          6
            Image Processing, Geoinformation Technology and Information Security / I.A. Andreev, A.I. Armer, N.A. Krasheninnikova, V.S. Moshkin
1.2. Autocorrelation portraits of speech signals

   Since autocorrelation links are rather informative, i.e. they reflect speech signal features ACPs are unique for each speech
unit. This provides good results in obtaining the speech unit boundaries for continuous speech. In [12] ACPs are modeled in the
following way. Let 𝑠(𝑖) be the i-th readout of a digital speech signal; 𝑠(𝑖 + 𝑘) is a readout spaced 𝑘 readouts apart 𝑠(𝑖).
Dependency factor of these readouts is expressed by a sample correlation coefficient:
                                                              cov[𝑠(𝑖),𝑠(𝑖+𝑘)]
𝑅𝑠 (𝑘) = 𝑅[𝑠(𝑖), 𝑠(𝑖 + 𝑘)] =                 1                           1
                                                                                                       ,
                                           √ ∑𝑁    2     2       𝑁    2       2
                                              𝑖=1 𝑠 (𝑖)−𝑚𝑠(𝑖) √ ∑𝑖=1 𝑠 (𝑖+𝑘)−𝑚𝑠(𝑖+𝑘)
                                             𝑁                           𝑁


                                 1                                   1                   1
cov[𝑠(𝑖), 𝑠(𝑖 + 𝑘)] = 𝑁 ∑𝑁                      𝑁            𝑁
                         𝑖=1 𝑠(𝑖)𝑠(𝑖 + 𝑘) − [𝑁 ∑𝑖=1 𝑠(𝑖)] 𝑁 ∑𝑖=1 𝑠(𝑖 + 𝑘),                                                                        (1)

where 𝑁 is a number of readouts in the interval, in which the dependency is sought; cov[𝑠(𝑖), 𝑠(𝑖 + 𝑘)] is the sample covariance
𝑠(𝑖) and 𝑠(𝑖 + 𝑘) when 𝑖 = 1..𝑁; 𝑚𝑠(𝑖) is a sample mean 𝑠(𝑖) when 𝑖 = 1..𝑁; 𝑚𝑠(𝑖+𝑘) is a sample mean 𝑠(𝑖 + 𝑘) when 𝑖 =
1..𝑁. Function determined by the sample correlation coefficient using (1) is an autocorrelation function (ACF) of a signal. While
calculating ACF we perform the transformation of speech signal (SS) readouts 𝑠(𝑖)𝑖 = 1..𝑀(𝑀 is the number of readouts in a
speech signal) into a two-dimensional image. For this purpose, 𝑠(𝑖) is divided into intervals including 𝑁 < 𝑀 readouts, then, in
                                                                                j
each j − th (j = 1, N, 2N, . . . , M − 2N) interval the local signal maximum im = max|s| is sought. Let us assume that M is
divisible by N evenly, otherwise the remaining final SS readouts are omitted. Then, using equation (1) we calculate the elements
                                                  𝑗
of the corresponding ACP line beginning with 𝑖𝑚 (𝑗 = 1, 𝑁, 2𝑁, . . . , 𝑀 − 2𝑁) and generate ACP lines:
        𝑗          𝑗            𝑘=1..𝑁
𝑅[𝑠(𝑖𝑚 ), 𝑠(𝑖𝑚 + 𝑘)] 𝑗=1,𝑁,2𝑁,…,𝑀−2𝑁
                                                       ,                                                                                          (2)
                  𝑋(𝑗, 𝑘) = 𝑅.

   The two-dimensional image 𝑋(𝑗, 𝑘)obtained from (2), where 𝑗 is the line number, and 𝑘 is the column number, is the ACP of
                                                   𝑀
a speech signal 𝑠(𝑖) dimensioned 𝑁 × ( − 2), generated using SS local maxima. Note, that ACPs generated using local
                                                   𝑁
maxima are unique for each speech unit, and due to their link with SS local maxima they are less subject to geometrical
distortions associated with speech variability. Figure 1 represents ACPs of speech units [“a], [o], [n`:], [f] (SAMPA+).

   1.3. Alignment of autocorrelation portraits using discrete dynamic programming

   Due to high degree of speech signal variability, autocorrelation portraits of one speech unit pronounced at different times
differ from each other. Figure 2 shows ACPs of a speech unit “unstressed [a]”, one of them (a) was obtained from the
pronunciation of the word «Вера» / “Vera”, and another (b) from the word «сопутствующие» / “soputstvujushhie”. It is
obvious, that the portraits differ in the number of lines. Nevertheless, some lines of portrait a) can correspond to one line of
portrait b).
   The distance between the corresponding ACP lines is determined for the 𝑖 − 𝑡ℎ line of portrait 𝑋 and the 𝑗 − 𝑡ℎ line of
portrait 𝑌 using the following formula:
                                                       2
       𝜌𝑖,𝑗 = ∑𝑁
               𝑘=1(𝑋(𝑖, 𝑘) − 𝑌(𝑗, 𝑘)) .                                                                                                           (3)


                                     a)                                                       b)


                                     c)
                                                                                              d)


                                                  Fig. 1. ACPs of speech units a) [“a], b) [o], c) [n`:], d) [f].


       Fig. 2. ACPs of a speech unit “unstressed [a]”: a) model, b) as a part of the word «сопутствующие» / “soputstvujushhie”.


   To determine the measure of ACP concordance the discrete dynamic programming [9] is applied. It allows to minimize the
functional 𝜌 = 𝑚𝑖𝑛 √∑ 𝜌𝑖,𝑗 , which characterizes ACPs identity. Set 𝛺 predetermines the permitted correspondences of the
                  𝛺
portrait lines, which are obtained on the basis of the following rules. 1. The number of lines in ACPs can differ. 2. Any line of
one particular ACP cannot correspond to the line of another one spaced from the previous corresponding line more than с lines
apart. 3. The order of line correspondence is preserved, i.e. if the 𝑖-th line of one ACP corresponds to the 𝑗-th line of the other
one, then the (𝑖 + 1)-th line cannot correspond to 𝑗 − 𝑙, 𝑙 = 1,2, . .. 4. The total distance between ACP pronunciations of the same

3rd International conference “Information Technology and Nanotechnology 2017”                                                                     7
          Image Processing, Geoinformation Technology and Information Security / I.A. Andreev, A.I. Armer, N.A. Krasheninnikova, V.S. Moshkin
speech units formed from the distances between the corresponding lines according to the second metrics rule should be minimal
according to rules 1)-3).
   To determine the measure of speech signal ACP correspondence (in a two-dimensional sliding window) to speech unit ACP
the following algorithm is obtained. Matrix 𝐷 containing 𝑚 × 𝑚 elements is created, where 𝑚 is the number of CP lines in a
sliding window 𝑋; the number of speech unit 𝑌 ACP lines is the same. For example, let 𝑐 = 3. At first, the distances between
𝑌(1) and 𝑋(1), 𝑋(2), 𝑋(3) are found, then these distances are stored in 𝐷
𝐷1,𝑖 = 𝜌(𝑌(1), 𝑋(𝑖)), 𝑖 = 1. .3.                                                                                  (4)

   Then, distances between 𝑌(2) and 𝑋(1), 𝑋(2), 𝑋(3), 𝑋(4), 𝑋(5) are found. The position of the line 𝑌(1) is taken into
account, i.e. if 𝑌(1) corresponds to 𝑋(2), then 𝑌(2) can be compared only with 𝑋(2), 𝑋(3), 𝑋(4). Each time it is necessary to
remember portrait 𝑋 line number, and fill in the matrix 𝐷2,𝑖 = 𝐷1,𝑖 + 𝜌(𝑌(2), 𝑋(𝑗)), 𝑗 = 𝑖. . 𝑖 + 2. Besides, each element from 𝐷
due to intersection of possible line positions can be filled in several times. In such a case, the minimum value (Figure 3) is
preserved:
  𝐷𝑘,𝑖 = 𝑚𝑖𝑛[𝐷𝑘,𝑗 , 𝐷𝑘−1,𝑗 + 𝜌(𝑌(𝑘), 𝑋(𝑗))], 𝑗 = 𝑖. . 𝑖 + 2.                                                           (5)

   During the next stages, all the remaining matrix 𝐷 elements are found using formula (5), at each stage 𝑖 changes from 1 to
𝐼 + 2, where 𝐼 is the maximum value of 𝑖 at the previous stageе. For the first stage 𝐼 = 1. The algorithm is stopped when matrix
𝐷 is completely filled. The minimal element from the 𝑚-th line and the 𝑚-th column of the matrix corresponds to the minimal
distance between 𝑋 and 𝑌.


Fig. 3. Distribution of compared ACP lines. P(i-j) is the distance between the i-th line of one ACP and the j-th line of another one. Mark min shows that from all
                     possible identical comparisons at different stages of programming the comparison with the minimal distance is chosen.

4.    Experiments

   The suggested algorithm for determining speech unit boundaries in continuous speech was tested experimentally. Figure 4
shows the speech unit boundaries in the utterance containing the pronunciation of the word «основного» / “osnovnogo”. For
example, the interval of speech unit [a] pronunciation, which starts the word «основного» / “osnovnogo”, was correctly defined
within the range from 800 to 4800 speech signal digital readouts, speech unit [s] – in the range from 2400 to 5600 readouts,
speech unit [n] – in the range from 5600 to 9200 readouts, speech unit [a] – in the range from 9600 to 11200 readouts, speech
unit [v] – in the range from 11200 to 16000 readouts, speech unit [n] – in the range from 16000 to 17200 readouts, speech unit
[“o] in the range from 17200 to 26400 readouts, speech unit [v] – in the range from 26400 to 28000 readouts and the last of the
analyzed speech signal unit [а] – in the range from 28000 and up to the end of the signal.
   Comparison with expert borders was not made. However, visual comparison of the determined boundaries with the real ones
shows their closeness. Experiments show the practical applicability of the algorithm for determining the speech unit boundaries
in continuous speech.


3rd International conference “Information Technology and Nanotechnology 2017”                                                                                 8
          Image Processing, Geoinformation Technology and Information Security / I.A. Andreev, A.I. Armer, N.A. Krasheninnikova, V.S. Moshkin


                 Fig. 4. Speech unit boundaries in continuous speech containing the pronunciation of the word «основного» / “osnovnogo”.


5.     Conclusion

   The determined speech unit boundaries are to be used for a more detailed analysis of the speech signal in order to identify the
speech units. In order to solve this problem the authors also want to transform speech signals into ACPs. However, the
parameters of transformation into ACPs and the method of portrait alignment will be different.

Acknowledgements

     The work was supported by grants 16-48-732046 and 16-48-730305 from the Russian Foundation for Basic Research.

References

[1] Galounov VI, Heuvel H, Kochanina JL, Ostroukhov AV, Tropf H, Vorontsova AV. Speech Database for the Russian Launguige. Proceedings of
     international workshop SPEECOM 1998.
[2] Michael P, Rasanen O, Thiollière R, Dupoux E. Improving Phoneme Segmentation With Recurrent Neural Networks. Computation and Language, 2016,
     preprint:1608.00508.
[3] Rabiner LR, Sсhafer RV. Digital processing of speech signals. Edited by M.V. Nazarov and Yu.N. Prokhorov. Moscow: Radio i svyaz', 1981; 496 p. (in
     Russian)
[4] Goldenthal W. Statistical Trajectory Models for Phonetic Recognition. PhD thesis. M.I.T., 1994; 170 p.
[5] Ostendorf M, Roukos SA. A stochastic segment model for phoneme-based continuous speech recognition. IEEE Transaction on Accoustics, Speech, and
     Signal Processing 1989; 37(12): 1857–1869.
[6] Therrien C, Tummala M. Probability and Random Processes for Electrical and Computer Engineers. CRC Press, 2012; 287 p.
[7] Amirgaliyev Y, Mussabayev T. The speech signal segmentation algorithm using pitch synchronous analysis. Open Comput. Sci. 2017; 7: 1–8.
[8] Krasheninnikov VR, Armer AI, Krasheninnikova NA, Kuznetsov VV, Khvostov AV. Some problems connected with speech command recognition on the
     background of intense noise. Infokommunikatsionnye tekhnologii. Samara 2008; 1: 72–75. (in Russian)
[9] Bellman R. Dynamic programming. Moscow: IL, 1960; 400 p. (in Russian)
[10] Krasheninnikov VR, Armer AI, Kuznetsov VV. Autocorrelated Images and Search for Distance between them in Speech Commands Recognition. Pattern
     Recognition and Image Analysis. 2008; 18(4): 663–666.
[11] Greibus M. Rule Based Speech Signal Segmentation. Journal of telecommunications and information technology 2010; 4: 37–43.
[12] Krasheninnikov VR, Armer AI, Krasheninnikova NA, Khvostov AV. Speech command recognition on the background of intense noise using autocorrelated
     portraits. Naukojomkie tehnologii 2007; 8(9): 65–76. (in Russian)


3rd International conference “Information Technology and Nanotechnology 2017”                                                                     9