Entropy-based detection of the words boundaries of continuous speech Andrey S. Karpov Victoria I. Drozdova Information Systems Technologies Dept. Information Systems Technologies Dept. Stavropol, NCFU Stavropol, NCFU andrey revol125@mail.ru victoria drozdova@rambler.ru Galina V. Shagrova Aleksey V. Shevchenko Information Systems Technologies Dept. Information Systems Technologies Dept. Stavropol, NCFU Stavropol, NCFU g shagrova@mail.ru luckyleo769@mail.ru Abstract An algorithm for finding the word boundaries in a merged speech is proposed on the basis of a method using the definition of the entropy of a speech signal. The difference between the proposed algorithm and the known ones is the comparison of the speech signal entropy value with the entropy threshold in two stages. The work of the known and proposed algorithm is compared. 1 Introduction Automatic speech recognition, especially in noisy environments, is a complex task. The most important step in automatic speech recognition is the correct definition of word boundaries in the speech stream. Even a slight improvement at the stage of delineating the boundaries of words significantly affects the performance of the entire speech recognition system. To recognize isolated words, this problem reduces to determining the correct word boundary. For confluent speech, this task is much more difficult, since the speech signal is a continuous stream without any speech pauses. The most promising approach involves the use of speech signal entropy to search for word boundaries [Koc15, Naz15]. The main feature of the method using the entropy value of the speech signal is low sensitivity to changes in the amplitude of the speech signal, which leads to the preservation of more detailed information contained in the speech stream. For a speech recognition system to be effective, it must work satisfactorily in an environment where the incoming voice signal is noisy. Even in the presence of small noise in the analyzed speech signal, the approach using the entropy value provides high accuracy of the definition of word boundaries. 2 Description of the object and methods of research Defining the boundaries of words in a speech signal is a key aspect of the human speech recognition task, as at this stage speech data is separated from unnecessary noise and speech artifacts (a cough, speech harmonics, Copyright c by the paper’s authors. Copying permitted for private and academic purposes. In: Marco Schaerf, Massimo Mecella, Drozdova Viktoria Igorevna, Kalmykov Igor Anatolievich (eds.): Proceedings of REMS 2018 – Russian Federation & Europe Multidisciplinary Symposium on Computer Science and ICT, Stavropol – Dombay, Russia, 15–20 October 2018, published at http://ceur-ws.org microphone echo, etc.). The use of the method based on the value of the entropy of the speech signal gives high indicators of the definition of word boundaries for the task of recognizing isolated commands [Alu14, Alu16, Boz11, Wah02]. The essence of the method is that the input voice data is preliminarily processed using a bandpass filter. This filter removes the constant and low-frequency components of the background, as well as high-frequency noise and speech harmonics, arising from the spectral properties of the voice path. The pre-processed speech is normalized so that the amplitude values of the signal lie in the range from 1 to -1. Then, the normalized signal is divided into frames of approximately 25 milliseconds of speech. To avoid loss of information, these frames have an overlap of 25 - 50%. Then the entropy value in each frame of the analyzed sound sequence is calculated: N X H=− pi ln(pi ) (1) i=1 where Hj (j = 1, 2, . . . , m) – the value of entropy of the j-th frame, m – the number of frames;; pi – the probability of i-th signal count, in j-th frame; N – the number of counts within the frame. As a result, the entropy profile ξ is determined for the incoming speech signal, which is a histogram of the entropy values of all the frames, the recognizable fragment of speech: ξ = [H1 , H2 , Hm ] (2) In the case of recognizing isolated instructions, the entropy profile of the signal is used to calculate the entropy threshold γ. max(ξ) − min(ξ) γ= + µmin(ξ); µ > 0 (3) 2 where µ – the noise ratio, which is selected experimentally [Boz11]. However, for example, in [Alu14, Alu16] the value of the entropy threshold is not calculated but is taken equal to a constant value: γ=0,1. But this approach does not give good results for cases of a noisy signal. After determining the threshold, the value of the entropy of each frame Hj is compared with the entropy threshold γ. Any value equal to or greater than the entropy threshold is considered a speech and all that is less is silence or noise. ( Hj , Hj ≥ γ ξ= , (4) 0, Hj < γ However, due to the vocal characteristics of the speech signal, the entropy index may be too small in the area of the recognizable speech signal that carries the information. Or, conversely, because of instantaneous noise, a signal segment that does not carry speech data is recognized as speech [Naz15]. In order to avoid the erroneous definition of the word boundaries in the speech signal, the concepts of the minimum word length (k) and the minimum distance between words (δ) [Alu14, Alu16, Obi12]. Both these quantities are measured in the number of frames. Figure 1: Communication between speech segments The first criterion is that each recognized speech segment (λi , λj ) must have a certain minimum length, which is indicated as a constant. That is λi < k and dij > δ, the i-th segment is discarded as a segment that does not contain voice information. Also, if λj < k and dij > δ, the j-th segment is discarded. The second criterion is based on the minimum distance between words It consists in the fact that two segments of the analyzed speech signal, defined as speech, are combined into one if the distance between them (dij ) is less than the specified number of frames. This means that if (λi or λj ) > k and dij < δ, then the two segments are combined into one. This approach gives a high result of detecting the boundaries of isolated words. In order to use this approach in the recognition of the continuous speech, an algorithm is proposed, according to which the entropy threshold was determined by the formula (5): γ = min(ξ) + (max(ξ) − min(ξ)) · k (5) where k – the coefficient that was selected experimentally, the word boundaries were determined in two stages. At each stage, the minimum distance between words (dij ) was used. The result of the proposed algorithm is given in the work by the example of separating the boundaries of the words of the merged speech, which is a speech signal containing the phrase ”Dear passengers, please keep calm, the train will soon leave” pronounced in a woman’s voice. The analyzed phrase consisting of eight words was recorded with a sampling frequency of 22kHz, the number of channels 2 (stereo) and 16 bits. The duration of the speech signal was 6,583 seconds. The boundaries of the words of this phrase were in two stages. The minimum distance between words in the first stage was 12 frames, and k = 0,9. This means that all frame groups defined as speech, but less than 12 frames in length, are discarded as non-verbal data. The results of the first stage of the algorithm are shown in Figure 2. Figure 2: Boundaries of frames with voice information, obtained at the first stage of the algorithm execution As shown in Figure 2, as a result of the first stage of the algorithm, three large groups of frames carrying the voice information were formed. At the second stage, only those frame groups that were formed after the first stage were considered. For them, the minimum distance between words was 3 frames, and k = 0,75. Let us consider the work of the second stage of the algorithm for frame groups formed after the first stage (Figure 3). As can be seen from Figure 3, in the second stage, the algorithm divided the first large group of frames into two smaller ones, which are separate words. Similarly, both the second and third large groups were divided into three smaller ones. As a result, the boundaries of all eight words were found. An example of a comparison of the work of the known and proposed algorithms is given for the speech signal, which is the phrase ”Today is good weather”. The analyzed phrase is pronounced by a man and recorded with a sampling frequency of 16 kHz, the number of channels 1 (mono) and 8 bits. The duration of this phrase was 2,535 seconds. The results are shown in Figure 4. As can be seen from Figure 4, the known algorithm defines the entire phrase as a group of frames that carry information. Whereas the proposed algorithm determines the boundaries of all three words quite accurately. 3 Summary A new algorithm for determining the boundaries of words in a merged speech is proposed, which differs from the known, based on the calculation of the entropy value of a speech signal, in that the process of separating the boundaries of words is performed in two stages. At the first stage, a rough selection of large groups of frames containing verbal information is carried out. At the second stage, there is a more detailed segmentation of the speech fragments obtained in the first stage. Due to the use of the method, based on the definition of the entropy of the speech signal in speech recognition systems, much higher recognition rates of the word boundaries can be achieved, both in isolated and in the combined speech. References [Alu14] D.Yu Alunov. On Methods for Estimation of the Signal Parameters. Current Problems of Science and Education No. 6, 2014. [Alu16] D.Yu. Alunov, E.S. Sergeev, P.V. Pigachev, A.N. Mytnikov. Implementation of the algorithm for pro- cessing and recognizing speech. Modern high technology No. 3-2, pp. 225- 230, 2016. [Boz11] A.S. Bozhdai, P.A. Gudkov , A.A. Gudkov. Embedded identification system by voice biometric indicators . Open Education No 2-2, 2011. [Koc15] A.V. Kochetkov, P.V. Fedotov. About various meanings of the concept ”entropy”. Internet-journal Naukovedenie, Vol. 6, 2015. [Naz15] A.V. Nazarov, V.L. Yakimov, V.F. Avdeev. The algorithm for maximizing the entropy of the training sample and its use in the synthesis of forecast models for discrete states of nonlinear dynamical systems. Scientific Journal ”Information Control Systems”: Issue 2, St. Petersburg, 2015. [Obi12] N. Obin, M. Liuni. On the generalization of shannon entropy for speech recognition. IEEE workshop on Spoken Language Technology, United States, 2012. [Wah02] K. Waheed, K. Weaver, F.M. Salam. A robust algorithm for detecting speech segments using an entropic contrast. Midwest Symposium on Circuits and Systems 3, 2002. Figure 3: The result of the second stage of the algorithm for the first (a), second (b) and third (c) frame groups that formed after the first stage of the algorithm Figure 4: The result of the known (a) and proposed (b) algorithm for the phrase ”Today is fine weather”