Ways to increase the probability of correct recognition of noisy speech commands by their cross-correlation portraits Ekaterina Galitskaya Viktor Krasheninnikov Applied mathematics and computer science Applied mathematics and computer science Ulyanovsk State Technical University Ulyanovsk State Technical University Ulyanovsk, Russia Ulyanovsk, Russia katrisa@yandex.ru kvrulstu@mail.ru Abstract—Currently, the field of application of voice Russian facilities, the pilots of which communicate in information-control systems is being intensively expanded, for Russian. There are no open data on the using of Russian which recognition of speech commands (SC) is necessary. This VIMS on aircraft. Thus, the problem of creating methods and recognition is very difficult in the presence of intense acoustic algorithms for recognizing SC in the presence of strong noise. We consider a method for recognizing noisy SCs by interference remains relevant. cross-correlation portraits (CCP), which is used for speaker- dependent recognition from a limited vocabulary of Various features of speech signals are used in recognition commands. In this method, SCs are converted into CCPs, algorithms: spectral analysis, wavelets, hidden Markov which are special images. The probability of correct chains, cepstral analysis, artificial neural networks, etc. recognition directly depends on the choice of command Typically, SC recognition systems in severe interference standards. The standards should accurately reflect the entire conditions work with a limited dictionary. Some standards of class of commands, for which the library of standards is the SCs from this dictionary are constructed, and the optimized. The standards are stored as CCPs. Recognized SC recognized SC refers to the closest of these standards. In this is converted into CCP and the closest portrait is found from paper, we consider the method for recognizing highly noisy the set of standard portraits. In this case, a sufficiently SCs by cross-correlation portraits (CCPs). In this method, accurate matching of the standard and the recognized SC SCs are converted into CCPs, which are special images that portraits is required. For this, two methods are proposed: reflect the acoustic features of the SCs [18-20]. This makes it phonemic alignment and variation of the boundaries of SCs, possible to apply image processing methods to recognition of given that its boundaries can be estimated ahead or delayed. The experiments showed that the proposed modernization of the SC. There is an extensive literature on image processing, the algorithm significantly increases the probability of correct for example, [21-25]. recognition. Standard SCs are stored in the computer memory in the form of CCPs (the standard CCPs). Recognized SC is also Keywords—speech command, recognition, standard, cross- converted into CCP and the nearest to standard CCPs is correlation portrait located. The probability of correct recognition essentially I. INTRODUCTION depends on the choice of standard SCs. Therefore, the library of standards is to be optimized. Sufficiently accurate At present, the management of many technical systems is matching of the portraits of the standard and the recognized impossible without the participation of a human operator, SC is required to find nearest standard CCP. For this, two despite the significant success of robotization. In this case, it methods of refining alignment are proposed: phonemic is desirable to facilitate the operator work using a voice alignment and variation of the boundaries of the SC, given information management systems (VIMS), in which it is that its boundaries can be estimated ahead or delayed. The possible to obtain information about the state of the system experiments showed that the proposed modernization of the and manage it by SC. However, SC recognition is required method significantly increases the probability of correct for this. To date, many speech recognition systems have been recognition. developed that are used to enter information into a computer, control robots, etc. [1-7]. However, most of these systems are II. RECOGNITION OF SPEECH COMMANDS BY THEIR inoperative in the presence of noise. AUTOCORRELATION AND CROSS-CORRELATION PORTRAITS At the same time, there is a need for VIMS operating in The use of autocorrelation portraits (ACPs) was proposed conditions of very strong acoustic noise, for example, in [18,19] for SCs recognizing on the background of strong aviation, noisy production, etc. Installing VIMS in the noise. Let X  { x1 , x 2 ... x N } be SC, consisting of N values. cockpit can help reduce the workload of the pilot. Honeywell The ACP of X is the two-dimensional array (image) R . We has tested the VIMS on its Embraer 170 aircraft (recognition divide X into M  1 segments of the length accuracy of this VIMS is 90%) [8]. There are examples of L  [ N /( M  1 )] , where [ u ] is the integer part of the VIMS using in military aircraft Eurofighter Typhoon. Lockheed Martin also developed the F-35 cab with speech number u . Each row of R is a sequence of sample recognition. Airbus Defense and Space considered adding a correlation coefficients r (t , k ) of the segment cockpit assistance system with voice recognition technology X t  x ( t  1 ) L  1 ,..., x tL  and segments to its recently developed Sferion helicopter [9]. The  X t , k  x ( t  1 ) L  1  k ,..., x tL  k  shifted by k samples relative to TOUCH-FLIGHT 2 project is exploring the use of voice control as an alternative mode of interaction between pilots X : and cockpit avionics [10]. These programs are developed in English [11-17], which makes them impossible to use at Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) Image Processing and Earth Remote Sensing 1 L 1 recognized SC. As a result, the distance between these r (t , k )  (  x ( t 1 ) L  j x ( t 1 ) L  j  k )   t  t  k ) , (1) portraits will be distorted. Therefore, the matching of L  t t  k j0 portraits rows should be made. The dynamic programming where t  1,..., M , k  1,..., K ,  t ,  t  k are the sample algorithm was used for this matching according to the means, and  t2 ,  t2 k are sample variances of X t and criterion of the minimum distance between the matching portraits X . Thus, the ACP is an M  K array (image) of the t,k However, there is a significant drawback: an ACP sample autocorrelation coefficients of one SC reflects the features of one SC pronunciation. This is Fig. 1 shows the ACPs of once spoken SCs "Cab" noticeable in two portraits of the SC "Air Conditioning 1" (Cabina) and "Engine" (Dvigatel) and two pronunciations of and "Air Conditioning 2" in Fig. 1, built from the pronunciations obtained at different times. During this time, "Air Conditioning" (Conditsioner) at different times. Note, the voice timbre of the speaker, his health status, etc. could that in this paper all SC are spoken in Russian. There are change significantly. The standards seemed to be “aging”, so M  100 rows and K  50 columns (i.e. shifts) in these the ACPs of the standard SCs and the portraits of the same ACPs. The range of correlation coefficient  1;1  is recognized SC could vary significantly, which reduced the converted into the range of brightness 0 ; 255  in Fig. 1. quality of recognition. Therefore, the standards need to be The image row reflects the change of the correlation updated from time to time. between the values of the speech signal at shifts by More complete properties of SCs are presented in its k  1 ,..., K samples, that is, local correlations. The sequence CCPs, which are built using two pronunciations [20]. Let X of rows reflects the process of changing correlations with and Y be two pronunciations of the same SC by one speaker the time, for example, characterizes the sequence of at different times. They are divided into the same number of phonemes. M  1 segments with lengths L X and L Y , respectively. Each CCP row is a sequence of sample correlation coefficients r ( t , k ) of the segment X t  x ( t 1 ) L X  1 ,..., x tL X  of SC X with the segments Y t , k   y ( t  1 ) L Y  1  k ,..., y tL Y  k  of SC Y : L X 1  x ( t  1 ) L X  j y ( t  1 ) LY  j  k   X , t  Y ,t  k j0 r (t , k )  (2) L X  X ,t  Y ,t  k where t  1,..., M , k  1,..., K ,  X , t ,  Y ,t  k are sample means, and  X2 ,t ,  Y2 ,t  k are the corresponding sample variances. Thus, CCP is the M  K array (image) of sample cross-correlation coefficients of two SCs X and Y . If Y  X , then CCP coincides with ACP. Fig. 2 shows the CCPs of SCs using two of their pronunciations with the number of split segments (i.e. rows) M  100 and the number of shifts (i.e. columns) K  50 . It is noticeable that the CCPs of the various SCs are individual, which makes them a good basis for recognition. At the same time, they to a greater extent reflect the variability of pronunciation, as they are built from two pronunciations, which are advisable Fig. 1. Examples of autocorrelation portraits of speech commands: (a) to take at different times. It is noticeable that the CCPs "Air “Cab”; (b) “Engine”; (c) “Air Conditioning 1”; (d) “Air Conditioning 2”. Conditioning 1 + Air Conditioning 3" and "Air Conditioning It turned out, that the ACPs are individual, resistant to 2 + Air Conditioning 3" in Fig. 2 are less different than the noise and weakly sensitive to the pronunciation volume. The portraits ACPs " Air Conditioning 1" and "Air Conditioning main advantage of ACPs for SCs recognizing is strong rows 2" in Fig. 1. correlation, which makes it possible to use image processing The standards SCs are stored in the computer's memory methods for filtering, recognition, etc. The standards SCs are as CCPs. Recognized SC is also converted into CCP in pair stored in the computer's memory as ACPs. Recognized SC is with some pre-read pronunciation, for example, from the also converted into ACP. This ACP refers to the nearest of standards. This CCP refers to the nearest of the standard the standard ACPs according to some metric. The distance between two ACPs is defined as the sum of the distances CCPs according to some metric. The distance between two between the corresponding rows. Any metric can be used, CCPs is defined as the sum of the distances between the which allows to determine the distance between two rows as corresponding rows, similar to the ACPs case. vectors: Euclidean, squared, angle between the vectors, etc. III. METHODS TO INCREASE THE PROBABILITY OF When constructing the ACP, the SC is divided into M  1 CORRECT COMMANDS RECOGNITION segments. Each segment contains some part of SC. Due to variability of the pronunciation rate, the same phonemes of The described recognition method gives an almost SC can have different row numbers in ACPs of standard and absolute correct recognition in the noise absence. The VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 67 Image Processing and Earth Remote Sensing presence of strong noise significantly reduces it for a number different phonemes, so the correlation coefficient can have a of reasons. Let us consider some of the interfering factors “false” value and the CCP will be distorted. To avoid this and methods to reduce their influence. Some of these distortion, the dynamic phonemes matching algorithm was methods were applied to improve the recognition of SCs by used. As a result, the beginning of the segment of one SC is their ACPs [18,19,26,27]. shifted so that this segment is maximally correlated with the segment of the second SC. Optimization of the standards library. The quality of recognition directly depends on how well the standard CCPs present features of pronouncing commands. In this regard, an additional problem arises of choosing the "best" standards. To do this, first, several standards of each command are built and directional was applied to achieve the best library of the standard CCPs [26]. To perform this operation, it is desirable to have a large number of recognized pronunciations SC, which requires a large time expenditure of speakers. In [25,28], the methods for obtaining realizations of quasiperiodic processes in the form of autoregressive models of cylindrical images are described. Phonemes of speech signals are also quasiperiodic processes, which made it possible to simulate many variants of pronouncing the SC even from one of its real pronunciations by a speaker. The noise adding to the standards. The standards are usually built in advance by pronunciations in the absence of noise. The recognized SC contains significant noise, therefore, its CCP inevitably differs from the standard CCPs. Therefore, the distances between the CCPs are distorted and the quality of recognition is reduced. To Fig. 2. Examples of cross-correlation portraits of speech commands: (a) “Cab 1 + Cab 2”; (b) “Engine 1 + Engine 2”; (c) “Air Conditioning 1 + Air correct the distances, the noise addition to the standard SCs Conditioning 2”; (d) “Air Conditioning 2 + Air Conditioning 3”. was applied before their conversion into CCPs. In this case, the noise for the standards came from an additional The varying of the recognized SC boundaries. To microphone far from the operator’s mouth while compare the standards with the recognized command P , pronouncing the recognized SC, which ensured the first of all, it is necessary to determine its beginning (start) similarity of the noise characteristics in the compared CCPs. and ending (end). At the same time, due to strong noise, The disadvantage of this method is the calculation of all errors are inevitable: advancing or delaying. It is especially noisy standards for each incoming recognized SC. difficult to find the ending of an SC, as it is usually pronounced quieter than the beginning. To mitigate the IV. THE RESULTS OF THE EXPERIMENTS influence of these errors, trial additions and deletions of t The following experiment was conducted to assess the samples of the signal at the estimated boundaries were significance of the considered methods of the correct applied. The value of the parameter t was chosen recognition probability increasing. There was a dictionary empirically, taking into account the fact that too large a value consisting of 41 SCs on aviation topics. The dictionary was of it can change the command itself. In the process of divided into 4 groups, containing 10, 5, 8, and 19 SCs, recognition, the command P( start , end ) is converted into 9 respectively. Each SC was pronounced 30 times (in total commands: P( start , end ) , P( start  t , end ) , P( start  t , end ) , P( start , end  t ) , 1230 SCs participated in recognition). The SCs were P( start  t , end  t ) , P( start  t , end  t ) , P( start , end  t ) , P( start  t , end  t ) , additively noisy with the noise of an aircraft engine with a signal-to-noise ratio of 4. When constructing the CCPs, the where “start” and “end” are the estimated P( start  t , end  t ) , first two pronunciations were chosen as standard ones. As a bounds of the command. For each of the 9 received variants result of recognition (without applying the methods of the command, its own CCP is built. The variant that has described above) 158 SCs were not recognized. Using the the smallest distance to the standard CCPs is taken as the true methods described above, 67 of the unrecognized SC were CCP of the recognized SC. recognized. At the same time, the SCs recognized correctly The CCPs width optimization. The width of the CCP in the first case were also recognized by the improved (the number of columns in the portrait) K is chosen method. As a result, the probability of correct recognition empirically. However, as the practice has shown, the optimal increased from 87% to 93% (significance was tested by value of the parameter K depends on the length of the SC. Student's criterion with a significance level of 0.05). Therefore, all dictionary commands were divided into groups of approximately the same length, and each group used its V. CONCLUSIONS own value of this parameter. The paper proposes the use of the conversion of the SCs The phonemes matching. When building CCPs, the SCs into CCPs for commands recognition on the background of are divided into M  1 segments. Each segment contains strong noise. The CCP of two SCs is two-dimensional some part of a phoneme. Due to the variability of the images, rows of which consist of cross-correlation pronunciation rate, the segments of CCPs can begin with coefficients between these SCs. The use of two VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 68 Image Processing and Earth Remote Sensing pronunciations in the CCP allows you to take into account [13] Speech Recognition Interfaces Improve Flight Safety, 2020 [Online]. URL: https://spinoff.nasa.gov/Spinoff2012/t_4.html. the variability of pronunciations. The standard CCP is [14] Pilot Speech Recognition, 2020 [Online]. URL: constructed for each SC. Recognition is carried out by http://www.voiceflight.com/. comparing the CCP of the recognized SC with the standard [15] N. McKeegan, “Speech recognition technology allows voice control CCPs. The performed experiments showed that the use of of aircraft systems,” 2020 [Online]. URL: several modifications of this method significantly increases https://newatlas.com/speech-recognition-technology-allows-voice- the probability of correct recognition. control-of-aircraft-systems/7484/. [16] M. Peck, “Fly by Voice,” 2020 [Online]. URL: ACKNOWLEDGMENT https://aerospaceamerica.aiaa.org/departments/fly-by-voice/. [17] Speech recognition technology for air traffic controllers, 2020 The reported study was funded by the RFBR, project [Online]. URL: https://www.internationalairportreview.com/ number 20-01-00613. news/75900/voice-recognition-air-traffic/. [18] V.R. Krasheninnikov, A.I. Armer, N.A. Krasheninnikova and A.V. REFERENCES Hvostov, “Recognition of noisy speech command by autocorrelation [1] A. Zhdanov, “Speech input as an alternative to keyboard input,” 2020 portraits,” Naukoemkie tekhnologii, vol. 9, pp. 65-74, 2007. [Online]. URL: https://compress.ru/article.aspx?id=11907. [19] V.R. Krasheninnikov, A.I. Armer, V.V. Kuznetsov and E.Yu. [2] M.V. Mikhaylyuk, “Ergonomic voice control interface for Lebedeva, “Cross-Correlation Portraits of Voice Signals in the anthropomorphic robot,” 2020 [Online]. URL: Problem of Recognizing Voice Commands According to Patterns,” https://cyberleninka.ru/article/n/ergonomichnyy-golosovoy-interfeys- Pattern Recognition and Image Analysis, vol. 21, no.2, pp. 185-187, upravleniya-antropomorfnym-robotom/viewer. 2011. [3] A. Gerasimov, “Smart home from Apple, Google and Yandex - voice [20] V.A. Soifer, S.B. Popov, V.V. Mysnikov and V.V. Sergeev, control,” 2020 [Online]. URL: https://voiceapp.ru/articles/smarthome. “Computer image processing. Part I: Basic concepts and theory,” VDM Verlag, Dr.. Muller, 2009. [4] SpeechKit - Yandex speech technology, 2020 [Online]. URL: https://yandex.ru/company/technologies/speech_technologies/. [21] R.C. Gonzalez and R.E. Woods, “Digital image processing,” Pearson, Prentice-Hall, New York, 2017. [5] D. Geer, “5 impacts of speech recognition system in various fields,” 2020 [Online]. URL: https://thenextweb.com/contributors/ [22] R.G. Magdeev and A.G. Tashlinskii, “Efficiency of object 2017/09/05/5-impacts-speech-recognition-system-various-fields/. identification for binary images,” Computer Optics, vol. 43, no. 2, pp. 277-281, 2019. DOI: 10.18287/2412-6179-2019-43-2-277-281. [6] S. Rustamov, E. Gasimov, R. Hasanov, S. Jahangirli, E. Mustafayev and D. Usikov, “Speech recognition in flight simulator,” 2020 [23] V.V. Myasnikov, “Description of images using a configuration [Online]. URL: https://www.researchgate.net/publication/329485063_ equivalence relation,” Computer Optics, vol. 42, no. 6, pp. 998-1007, Speech_recognition_in_flight_simulator. 2018. DOI: 10.18287/2412-6179-2018-42-6-998-1007. [7] 8 Innovative Ways to Use Speech Recognition for Business, 2020 [24] V.R. Krasheninnikov and K.K. Vasil’ev, “Multidimensional Image [Online]. URL: https://www.transcribeme.com/blog/8-innovative- Models and Processing,” Computer Vision in Control Systems-3. ways-to-use-speech-recognition-for-business. Intelligent Systems Reference Library 135, Springer International Publishing, pp. 11-64, 2018. [8] L. Savvides, “Hey Siri, take off! Get ready for more-advanced planes,” 2020 [Online]. URL: https://www.cnet.com/news/ [25] V.R. Krasheninnikov, N.A. Krasheninnikova, V.V. Kuznetsov and honeywell-tests-gear-for-even-more-high-tech-planes/. E.Yu. Lebedeva, “Optimization of dictionary and model library for recognition of speech commands,” Pattern Recognition and Image [9] Woodrow Bellamy III. Rockwell Collins Rapidly Advancing Cockpit Analysis, vol. 21, no. 3, pp. 505-507, 2011. Voice Recognition Technology, 2020 [Online]. URL: https://www.aviationtoday.com/2014/11/13/rockwell-collins-rapidly- [26] V.R. Krasheninnikov, A.V. Khvostov and A.I. Armer, “Preparation of advancing-cockpit-voice-recognition-technology/. Templates in Speech Command Recognition by Single- and Double- Channel Scheme in Background Noise,” Pattern Recognition and [10] J. Gauci, “Aircraft control through the use voice commands,” 2020 Image Analysis, vol. 18, no. 4, pp. 580-583, 2008. [Online]. URL: https://www.um.edu.mt/newspoint/news/features/ 2019/07/aircraftcontrolthroughtheuseofvoicecommands. [27] V.R. Krasheninnikov, A.I. Armer, N.A. Krasheninnikova, V.R. Derevyankin, V.I.Kozhevnikov and N.N. Makarov, “Autoregressive [11] R. Crist, “Talk to your house with these voice-activated smart-home Models of Speech Signal Variability in the Speech Commands systems,” 2020 [Online]. URL: https://www.cnet.com/news/talk-to- Statistical Distinction,” Internetional Conference on Computational your-house-with-these-voice-activated-smart-home-systems/. Science and it’s Applications, Springer-Verlag: Berlin Heidelberg, pp. [12] Talking to Your With Telligence Voice Control, 2020 [Online]. URL: 974- 982, 2006. https://www.transcribeme.com/blog/8-innovative-ways-to-use- speech-recognition-for-business. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 69