Method of Remote Biometric Identification of a Person by Voice based on Wavelet Packet Transform Oleksandr Lavrynenko1, Bohdan Chumachenko1, Maksym Zaliskyi1, Serhii Chumachenko1, and Denys Bakhtiiarov1 1 National Aviation University, 1 Lubomyr Huzar ave., Kyiv, 03058, Ukraine Abstract In this research, the task of extracting speech signal recognition features for voice identification of a person in a remote mode was solved, which imposes several restrictions, namely: (1) minimum processing time of the speech signal realization, since the required recognition reliability is achieved through statistical processing of the results; (2) reduction of the dimensionality of recognition features, since the process of extracting recognition features and their classification occurs on the transmitting side of the communication channel, which in turn imposes certain factors of computing power and noise in the communication channel. After analyzing the given conditions of the voice identification system, the question arose of developing a method for extracting speech signal recognition features that would provide more informative spectral characteristics of the speech signal, which would improve the efficiency of their further classification under the influence of noise. In this paper, we consider the possibility of applying the theory of time-scale analysis to solve this problem, namely, the development of a method for extracting recognition features based on the wavelet packet transform using the orthogonal basis wavelet function of Meyer and subsequent averaging of wavelet coefficients that are in the frequency band of the corresponding wavelet packet. Experimental studies have shown the ability of the developed method to generate speech signal recognition features with a close frequency-temporal structure based on wavelet packets in the Meyer basis, namely, it was found that at a signal-to-noise ratio of 10 dB, the features obtained based on the developed method have a very acceptable result, namely, 1.6–2 times more robust to noise than the features obtained based on the traditional Fourier spectrum, where the total deviation of the root mean square error of the obtained features is unacceptable at a signal-to-noise ratio of 20 dB. Keywords 1 speech signal, recognition features, wavelet transform, wavelet Meyer function, spectral analysis, voice identification, biometric authentication 1. Introduction authentication include various systems and methods of biometric identification [1]. The The development of new methods and means of development of identification systems based on ensuring information security is intended biometric measurements is associated with a primarily to prevent threats of access to whole range of advantages: such systems are information resources by unauthorized persons. more reliable because biometric indicators are To solve this problem, it is necessary to have more difficult to fake; modern microprocessor identifiers and create identification procedures technology makes biometric methods more for all users. Modern identification and convenient than conventional identification CPITS-2024: Cybersecurity Providing in Information and Telecommunication Systems, February 28, 2024, Kyiv, Ukraine EMAIL: oleksandrlavrynenko@gmail.com (O. Lavrynenko); bohdan.chumachenko@npp.nau.edu.ua (B. Chumachenko); maksym.zaliskyi@npp.nau.edu.ua (M. Zaliskyi); serhii.chumachenko@npp.nau.edu.ua (S. Chumachenko); bakhtiiaroff@tks.nau.edu.ua (D. Bakhtiiarov) ORCID: 0000-0002-7738-161X (O. Lavrynenko); 0000-0002-0354-2206 (B. Chumachenko); 0000-0002-1535-4384 (M. Zaliskyi); 0009- 0003-8755-5286 (S. Chumachenko); 0000-0003-3298-4641 (D. Bakhtiiarov) ©️ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 150 methods; and, finally, they are much easier to laboratory conditions may show much lower automate measurements [2–6]. reliability when analyzing speech information One of the most common biometric with external noise. Finally, in several tasks, characteristics of a person is his or her voice, identification has to be performed in very difficult which has a set of individual characteristics that conditions of overlapping voices of several are relatively easy to measure (for example, the speakers, in particular, with similar acoustic frequency spectrum of the speech signal). The characteristics. It should be noted that there have advantages of voice identification also include been virtually no studies of voice identification ease of application and use, and the fairly low cost capabilities for this most difficult case [10]. of devices used for identification (e. g., Voice identification involves a set of microphones) [7]. technical, algorithmic, and mathematical Voice identification capabilities cover a very methods that cover all stages, from voice wide range of tasks, which distinguishes them recording to voice data classification. The from other biometric systems. First of all, voice discussed difficulties and shortcomings lead to identification has been widely used for a long the conclusion that further development of voice time in various systems for differentiating access identification systems requires the development to physical objects and information resources. Its of new approaches aimed at processing large new application in remote voice identification arrays of experimental speech signals, their systems, where a person is identified through a effective analysis, and reliable classification. This telecommunications channel, seems promising. indicates the relevance of research on the For example, in mobile communications, voice creation of new mathematical methods for can be used to manage services, and the processing, analyzing, and classifying voice data introduction of voice identification helps protect that would ensure the reliability and accuracy of against fraud [8]. person identification [11]. Voice identification is of particular Traditionally, the methods that provide the importance in the investigation of crimes, in required level of classification reliability under particular in the field of computer information, given conditions are of practical interest for and in the formation of the evidence base for such speech signal recognition. Until recently, the an investigation. In these cases, it is often dominant approach to the construction of necessary to identify an unknown voice biometric voice identification devices was not recording. Voice identification is an important to impose restrictions on the processing time of practical task when searching for a suspect based the speech signal, since the required on a voice recording in telecommunication recognition reliability was achieved by channels. Determining such characteristics of the statistical processing of the results obtained, as speaker’s voice as gender, age, nationality, well as by increasing the dimensionality of the dialect, and emotional coloring of speech is also recognition features, and as a rule, the process important in the field of forensics and anti- of extracting recognition features and their terrorism. The identification results are classification took place on the transmitting important in conducting phonoscopic side of the communication channel. examinations, and in carrying out expert forensic However, in the case of remote voice research based on the theory of forensic identification in modern mobile radio identification [9]. communication systems, it is difficult to ensure Voice identification in real-world these conditions, since the identification of a environments faces the following serious person is carried out on the receiving side, and challenges. Firstly, such identification is subject this, in turn, imposes certain factors of to all kinds of hardware distortions and noise computing power and the influence of noise in caused by the peculiarities of equipment and the communication channel. An additional devices for recording, processing, and storing requirement is often the need to make a information. Secondly, external acoustic noise classification decision in a time-sensitive inevitably superimposes the speech signal, which environment [12]. can significantly distort individual informative In this case, it is necessary to move to other characteristics. Given this, identification systems methods that can provide the necessary that have demonstrated fairly high efficiency in contrast of the speech signal in the formed 151 recognition features by the specified between classes will be manifested only in conditions, namely, to ensure the quality of differences in the characteristics of features of speech signal recognition features extraction different objects. Then, for any set of features under the influence of noise in the 𝑋1 , 𝑋2 , . . . , 𝑋𝑝 , you can set rules according to communication channel, which in turn will which any two classes 𝑠1 and 𝑠𝑟 are assigned a allow the use of voice identification vector technologies in a remote mode based on 𝑑11𝑟 modern mobile radio communication systems, . which will significantly expand the scope of this 𝐷1𝑟 = , . type of technology. In this paper, we consider [𝑑𝑞1𝑟 ] the possibility of applying the theory of time- scale analysis to solve this problem [13]. which consists of 𝑞 parameters called interclass distances that express the degree of difference in the characteristics of recognition 2. Literature Analysis and features [16]. Problem Statement An integral part of the speech signal recognition process is the definition of a set of In general, recognition is the process of features 𝑋1 , 𝑋2 , . . . , 𝑋𝑝 , i.e., the formation of assigning the object under study, in this case, a recognition features in such a way as to ensure speech signal represented by a set of the required classification reliability with the observations, to one of the alternative classes. minimum possible dimension 𝑝. By the The process of assigning an object to a class is considered approach to solving the problem of based on the existing differences in some speech signal recognition, an important point is ordered set of recognition features [14]. the choice of a method for forming recognition Traditionally, these features are formed based features. The use of approaches based on the on such parameters of the speech signal as the traditional Fourier spectral-time analysis for duration of the modulating function elements, this purpose is associated with certain the number of signal envelope extremes, difficulties. First, there are high requirements statistical characteristics of the number of for the input speech signal stream in terms of zero-level transitions, and the moments of signal-to-noise ratio. Secondly, the lack of higher orders of the spectrum shape obtained classification reliability for multicomponent as a result of observations. Then the set of and low-stationary signals, such as speech observations is represented in the form of a signals, and thirdly, the need for a significant matrix amount of implementations. The desire to 𝑥11 𝑥12 . . . 𝑥1𝑖 . . . 𝑥1𝑛 overcome these limitations within the 𝑥21 𝑥22 . . . 𝑥2𝑖 . . . 𝑥2𝑛 framework of traditional approaches of classical 𝑋𝑝𝑛 = [ . . . . . . ... . . . ], spectral signal processing leads to difficult-to- 𝑥𝑝1 𝑥𝑝2 . . . 𝑥𝑝𝑖 . . . 𝑥𝑝𝑛 implement variants of speech signal recognition where 𝑛 is the number of observations used for devices and solutions that are unacceptable for recognition, and each column 𝑋𝑖 = the conditions under consideration [17]. 𝑇 Thus, we formulate the research objective: (𝑥1𝑖 , 𝑥2𝑖 , . . . , 𝑥𝑝𝑖 ) , 𝑖 = 1,2, . . . , 𝑛 of the matrix to develop a method that allows the formation 𝑋𝑝𝑛 is a 𝑝-dimensional vector of observed of contrasting recognition features for values 𝑝 of features 𝑋1 , 𝑋2 , . . . , 𝑋𝑝 that reflect automatic remote identification of a person by the most important properties of objects for voice under the conditions of restrictions on recognition. The set of features 𝑝, as a rule, is the duration of the processed realization at a the same for all recognition classes 𝑠1 , 𝑠2 , . . . , 𝑠𝑘 signal-to-noise ratio of less than 20 dB under [15]. conditions of partial or complete a priori Thus, we consider the task of recognizing uncertainty about their structure. the object under study belongs to one of a finite number of classes 𝑠1 , 𝑠2 , . . . , 𝑠𝑘 , which are described by a set of features 𝑋1 , 𝑋2 , . . . , 𝑋𝑝 , which is the same for all classes. Differences 152 3. Proposed Method where 𝑀 is the number of decomposition levels, 𝑉𝑚,𝑘 and 𝑊𝑚,𝑘 are the approximating Currently, methods of processing and and detailing wavelet decomposition analyzing speech signals based on their coefficients [18]. wavelet transforms are widely used. The Scaling functions and wavelet functions are essence of these transformations is to defined by the theory of multiple-scale decompose the input signal into a system of analysis: basis wavelets—functions, each of which is a 𝜑𝑚,𝑘 (𝑡) = √2𝑚 𝜑(2𝑚 𝑡 − 𝑘), (1) shifted and scaled copy of the input (generating or mother wavelet). A characteristic property 𝜓𝑚,𝑘 (𝑡) = √2𝑚 𝜓(2𝑚 𝑡 − 𝑘). (2) of wavelet functions (hereinafter referred to as Here, in (1) and (2) √2𝑚 is the normalizing wavelets) is the finite energy at their full factor, and 𝑘 = 0, ± 1, ± 2, . ..; 𝑚 ∈ 𝑍. localization in both frequency and time In practice, to quickly calculate the values of domains. wavelet coefficients 𝑉𝑚,𝑘 and 𝑊𝑚,𝑘 use a Thus, any sequence of discrete samples of sequential separation scheme called the pyramid the speech signal 𝑆(𝑡𝑖 ) can be represented as an or Mallat algorithm, which is interpreted as a ordered set of coefficients of decomposition by sequential two-band filtering of the input speech a system of scaling functions and wavelet signal using cascaded low-pass (h) and high-pass functions: (g) filter blocks (Fig. 1) [19]. 2𝑁−𝑀 𝑀 2𝑁−𝑚 𝑆(𝑡𝑖 ) = ∑ 𝑉𝑚,𝑘 𝜑𝑚,𝑘 (𝑡𝑖 ) + ∑ ∑ 𝑊𝑚,𝑘 𝜓𝑚,𝑘 (𝑡𝑖 ), 𝑘=1 𝑚=1 𝐾=1 Figure 1: Scheme of signal sequence decomposition according to the Mallat algorithm In Fig. 1, for the wavelet coefficients 𝑉𝑚,𝑘 and 1 𝑊𝑚,𝑘 , the first index 𝑚 corresponds to the 𝑊𝑚,𝑘 = ∑ 𝑉𝑚−1,𝑛 𝑔𝑛+2𝑘 , number of the decomposition level, and the √2 𝑛 second index 𝑘 = 0,1, . . . , 2𝑚 − 1 corresponds where ℎ𝑚 and 𝑔𝑚 are sequences that define the to the ordinal value of the wavelet coefficient at characteristics of filters H and G at the 𝑚 level the decomposition level 𝑚. According to the of wavelet decomposition [20]. theory of multiple-scale analysis, the values of The number of multiplication operations 𝑉𝑚,𝑘 and 𝑊𝑚,𝑘 can be obtained based on the required to calculate all the coefficients of the coefficients calculated at the previous stages of discrete wavelet transform for the data set 𝑁 speech signal decomposition: and the length of the vectors h and g equal to 𝐿 1 is 2𝐿𝑁. The same number of operations is 𝑉𝑚,𝑘 = ∑ 𝑉𝑚−1,𝑛 ℎ𝑛+2𝑘 , required to recover or calculate all the spectral √2 𝑛 components. So, to analyze a speech signal on a 153 wavelet basis, you need to perform 4𝐿𝑁 which gave the method its name. In general, operations. The number of complex each level of the hierarchy can use its specific multiplication operations for the fast Fourier basis. In contrast to the Mallat algorithm, the transform is 𝑁 𝑙𝑜𝑔2 𝑁, which is comparable to use of wavelet packets makes it possible to take or even greater than in the case of the discrete into account the subtle structure of the wavelet transform [21]. analyzed speech signal process in a more The interpretation of the coefficients of the comprehensive way. discrete wavelet transform is somewhat more Indeed, the absolute values of the complicated than the Fourier coefficients. If the coefficients in the wavelet packet analyzed speech signal is sampled at a decomposition are smaller than those of the frequency of 8 kHz and consists of 256 samples, Mallat algorithm. Therefore, it can be argued then the top frequency of the signal is 4 kHz. that the approximation with wavelet packets Then the coefficients of the first level of has a much smaller error [23]. decomposition (128) occupy the frequency Since the wavelet basis is a complete band [2.0, 4.0] kHz. The second-level wavelet decomposition basis, the wavelet coefficients coefficients (64) are responsible for the [1.0, contain individual characteristics of the input 2.0] kHz frequency band. They are displayed speech signal, determined by the properties of before the first level wavelet coefficients. The the basis functions to the same extent as the procedure is repeated until there is 1 wavelet spectral components of the Fourier series. coefficient and 1 scaling coefficient at level 9. Thus, any wavelet transform, including those The total number of coefficients is based on the use of wavelet packets, allows you (1+1+2+4+8+16+32+64+128) = 256. That is, to uniquely represent a speech signal by an the number of coefficients is equal to the ordered set of its wavelet coefficients. It is number of samples in the input speech signal. possible to assume the possibility of using them If the main energy of the signal was as recognition features and thus put the concentrated near the frequency of 1.0 kHz, calculation of coefficients based on wavelet then the second-level wavelet coefficients will packets based on the proposed method. be more informative, and the first-level wavelet The method of forming speech signal coefficients can be neglected [22]. recognition features based on wavelet packets As a continuation of the development of the is defined as follows. In the wavelet spectrum theory of multiple-scale analysis, it is proposed formed based on wavelet packets, the power of to improve the Mallat algorithm by additional the calculated wavelet coefficients within each processing of the high-frequency components subband of the decomposition is averaged. The of the pyramid of the analyzed speech signal. averaged coefficients are normalized and, Thus, in the improved algorithm, recursive according to their place in the overall pyramid filtering is applied to the coefficients 𝑊𝑚,𝑘 . This of wavelet packets from left to right and from full decomposition algorithm is called wavelet top to bottom, converted into a vector of packet decomposition. The decomposition recognition features. Thus, specific values of scheme based on wavelet packets is shown in the average power of the wavelet coefficients in Fig. 2. each subband of the decomposition will serve For the wavelet coefficients 𝜁𝑚 𝑛 (𝑖) (Fig. 2), as the primary features of speech signal the index 𝑚 corresponds to the number of the recognition. It should be noted that, in general, decomposition level, the index 𝑛 corresponds the features obtained in this way will be to the number of the subband at the level 𝑚, correlated, so it is advisable to apply an and 𝑖 = 0,1, . . . , 2𝑚 − 1 corresponds to the additional decorrelation transformation to the number of the wavelet coefficients at the level vector, which, by the way, will reduce the size 𝑚. In wavelet packages, several decomposition of the secondary recognition feature space bases are used for complete decomposition, [24]. united by the image of nesting in each other, Consider the sequence of stages of the proposed method (Fig. 3). 154 Figure 2: Signal sequence decomposition scheme based on the wavelet packet algorithm S ( ti ) H G Z1,0 ( i ) Z1,1 ( i ) H G H G Z 2,0 ( i ) Z 2,1 ( i ) Z 2,2 ( i ) Z 2,3 ( i ) H G H G H G H G Z m, n ( i ) Z m, n ( i ) Z m, n ( i ) Z m, n ( i ) Z m, n ( i ) Z m, n ( i ) Z m, n ( i ) Z m, n ( i ) Pm , n Pm , n Pm , n Pm , n Pm , n Pm , n Pm , n Pm , n Figure 3: Scheme of speech signal recognition features selection for biometric identification of a person 155 Initially, the input sequence of discrete samples reflect the wavelet content of the speech signal of the speech signal 𝑆(𝑡𝑖 ) with length 𝑁, a subbands, similar to the frequency multiple of power 2, at 𝑖 = 0,1,2, . . . , (𝑁 − 1) is representation. Moreover, the transition to the decomposed into 𝐾 ≤ 𝑙𝑜𝑔2 (𝑁) levels as a result average power will allow the use of relatively of applying the wavelet packet algorithm. At short input realizations for recognition, which the first level, the input array 𝑆(𝑡𝑖 ) is is an important point in the operation of rapid decomposed into two sets 𝑍1,0 (𝑖) and 𝑍1,1 (𝑖) by analysis systems. The bandwidth of the convolution 𝑆(𝑡𝑖 ) with sequences {ℎ} and {𝑔}, frequencies falling into each of the subbands which are determined by the characteristics of will narrow with an increase in the number of low H and high G frequency filters. At the 2nd the decomposition level, which follows from level, the considered convolution procedures the wavelet packet scheme (Fig. 2). The are repeated with each of the obtained subsets average powers of the wavelet coefficients in 𝑍1,0 (𝑖) and 𝑍1,1 (𝑖). The process of full each subband, which are used as speech decomposition, called wavelet packetization, recognition features, are calculated according involves 𝑘 steps similar to the first one [25]. to the following expression: The analytically considered procedures can be ((𝑛+1)⋅ 𝑚 𝑁 represented in general by the following ∑𝑖=𝑛⋅𝑁/22𝑚 )−1(𝑍𝑚,𝑛 (𝑖))2 (3) expressions: 𝑃̄𝑚,𝑛 = . 𝑁/2𝑚 𝑁−1 To eliminate the sensitivity of the features to 𝑍𝑚,2𝑛 (𝑖) = ∑ 𝑍𝑚−1,𝑛 (𝑖)ℎ𝑚,𝑛 (𝑖), changes in the average power of the speech 𝑡=0 signal realization, the values of 𝑃̄𝑚,𝑛 obtained 𝑁−1 by (3) are normalized relative to the average 𝑍𝑚,2𝑛+1 (𝑖) = ∑ 𝑍𝑚−1,𝑛 (𝑖)𝑔𝑚,𝑛 (𝑖), power 𝑃̄0,0 of the input speech signal 𝑡=0 realization 𝑆(𝑡𝑖 ) [27]. where is 1 ≤ 𝑚 ≤ 𝐾, and 0 ≤ 𝑛 ≤ (2𝑚−1 − 1). Finally, the feature vector 𝑌 = {𝑦𝑟 }𝑅 , At the first level of decomposition, the samples consisting of an ordered sequence of averaged of the speech signal 𝑆(𝑡𝑖 ) are used as 𝑍0,0 (𝑖) . powers of wavelet coefficients, is formed by The values of the elements of the sequences {ℎ} sequentially recording for all 𝑚 and 𝑛 the and {𝑔} depend on the choice of the type of calculated normalized values of 𝑃̄𝑚,𝑛 from left to scaling function 𝜑(𝑥) and wavelet function right and from top to bottom. The number of the 𝜓(𝑥) and, according to (1) and (2), are feature 𝑟 is determined according to the calculated as follows: expression 𝑟 = 2𝑚 − 1 + 𝑛 and corresponds to ℎ𝑚,𝑛 (𝑖) = 2−𝑚/2 𝜑(2−𝑚 𝑖 − 𝑛), the ordinal number of the component element of 𝑔𝑚,𝑛 (𝑖) = 2−𝑚/2 𝜓(2−𝑚 𝑖 − 𝑛). the vector 𝑌 = {𝑦𝑟 }𝑅 . An important point in implementing the As a result of the transformations method is the choice of the scaling function 𝜑(𝑥) performed during the decomposition, the and the wavelet 𝜓(𝑥). First, the size of the time- sequence of samples of the speech signal 𝑆(𝑡𝑖 ) frequency window should be taken into account. is decomposed into 𝑅 = 2 ⋅ 2𝐾 − 1 sequences Second, the smoothness and symmetry of the (including the input one) of length 𝑁/2𝑚 , each underlying wavelet. Third, determine (set) the of which represents one of the frequency order of approximation. Correct selection of the subbands of the input speech signal [26]. wavelet basis for the speech signal significantly Different realizations of speech signals will reduces the number of non-zero wavelet have different energy distributions over coefficients 𝑍𝑚,𝑛 (𝑖), which significantly reduces frequency subbands since their Fourier spectra the size of the recognition features and makes will also be different. If you calculate the them much more informative [28]. average power of the wavelet coefficients in each subband, the set of values obtained will 156 4. Results and Discussion Practical experiments were conducted to investigate the contrast of the speech recognition feature vectors formed based on the proposed method. In particular, Figs. 4–5 show the feature vectors of the speech signal calculated in different wavelet decomposition bases. Thus, in the first case (Fig. 4), a wavelet package based on the Haar basis was used to obtain wavelet coefficients, which provides a Figure 5: Components of the speech relatively coarse approximation of the speech recognition feature vector based on the Meyer signal, which accordingly affects the basis informativeness of the recognition features. In the second case (Fig. 5), the speech signal To confirm the hypothesis that it is expedient recognition features are calculated based on a to build speech signal recognition systems smoother Meyer function, which makes the based on wavelet packets using the values of features more informative. 𝑌 = {𝑦𝑟 }𝑅 obtained by expression (3) as A comparative analysis of the results in Figs. recognition features, we studied the developed 4–5 shows that when choosing a smoother basis method of forming recognition features in function, the number of yr values close to zero in comparison with the approach proposed in the feature vector 𝑌 = {𝑦𝑟 }𝑅 increases and the [15], which is based on the spectral informativeness of the decomposition increases, components of the classical harmonic Fourier unlike the Haar function, where we get less transform (Fig. 6). informative recognition features. Thus, the use of basic wavelet functions consistently in terms of smoothness with the studied speech signal allows us to reduce the size of recognition features and increase their informativeness. Figure 6: Components of the Fourier-based speech recognition feature vector The experiment used realizations of speech signals with a duration of 𝑁 = 512 samples, and the decomposition was performed at 𝑚 = 5 levels of decomposition. This approach Figure 4: Components of the speech recognition allowed us to obtain a feature vector 𝑌 = {𝑦𝑟 }𝑅 feature vector based on the Haar basis of length 𝑅 = 32, where 16 wavelet coefficients were averaged in each subband. As for the recognition features based on the Fourier transform, the spectrum was divided into 32 bands of 16 coefficients each [29]. 157 To illustrate more clearly the effectiveness of the proposed method (Fig. 7), an experiment was conducted using pre-recorded 30 audio recordings with the same semantic constructions by two different speakers, i.e., the words were pronounced by the speakers: “1”, “2”, “3”, “4”, “5” every 30 times. The average value of Root Mean Square Errors (RMSE) will serve as an objective indicator of the effectiveness of the developed method 2 1 𝑁 ∑𝑛𝑡=1 (𝑌(𝑡) − 𝑌̂(𝑡)) c 𝜎= ∑ √ → 𝑚𝑖𝑛, Figure 7: Thirty implementations of speech 𝑁 ∑𝑛𝑡=1 𝑌(𝑡)2 𝑖=1 signal recognition features using bases: a) for all 30 realizations for each speaker, so the Meyer, b) Haar, c) Fourier result that shows the lowest RMSE error is the The results of the pairwise comparison of the best. features of the test speech signals obtained RMSE is one of many metrics that are used using the Haar and Meyer wavelet-based to evaluate model performance. To calculate methods and the Fourier spectral coefficient- RMSE, square the number of detected errors based method are presented in Table 1. and find the average value [30]. This experimental study is needed to compute an objective measure of the inter- class distance RMSE of recognition features, i.e., the scatter of features when comparing different realizations of speech signals [31]. Table 1 Comparative analysis of the existing and proposed methods Phrases Harr, 𝝈 Meyer, 𝝈 Fourier, 𝝈 “1” 0.116 0.053 0.219 “2” 0.153 0.075 0.244 “3” 0.143 0.069 0.231 “4” 0.178 0.081 0.276 “5” 0.162 0.067 0.248 a The analysis of the obtained results shows that the contrast of the recognition features of test speech signals generated based on the developed method without the influence of noise is on average 3.8 times higher than that of the method using the Fourier spectral coefficients. To investigate the effect of noise on the robustness of feature vectors formed based on wavelet packets for the Meyer, Haar basis and based on the Fourier energy spectrum, several b experiments were conducted with the addition of white noise with a signal-to-noise ratio of 158 10 dB to the speech signal (noise power was measured in the analysis band) [32]. Fig. 8 shows all three feature vectors with the same noise power. c Figure 8: Thirty realizations of speech recognition features obtained at a signal-to- noise ratio of 10 dB based on bases: (a) Meyer, (b) Haar, and (c) Fourier a Table 2 shows the results of a comparative analysis of the stability of speech signal recognition features obtained from wavelet packets in the Meyer basis and the Fourier energy spectrum. At different signal-to-noise ratios of 10, 20, and 30 dB, the total deviation of the obtained features  from their reference values was calculated. Then the values were normalized relative to the maximum. b Table 2 Comparative analysis of the existing and proposed methods under the influence of noise of different power Meyer+ Meyer+ Meyer + Fourier + Fourier + Fourier + Phrases noise level of 10 noise level of 20 noise level of 30 noise level of 10 noise level of 20 noise level of 30 dB, 𝝈 dB, 𝝈 dB, 𝝈 dB, 𝝈 dB, 𝝈 dB, 𝝈 “1” 0.183 0.119 0.074 0.352 0.281 0.239 “2” 0.231 0.158 0.095 0.382 0.314 0.254 “3” 0.227 0.149 0.087 0.381 0.325 0.261 “4” 0.246 0.161 0.097 0.403 0.347 0.286 “5” 0.214 0.122 0.076 0.367 0.302 0.258 Thus, it was possible to establish that at a signal-to-noise ratio of 10 dB, the features 5. Conclusions and Future obtained based on the developed method have a very acceptable result, namely, a 1.6-2-fold Research increase in stability compared to the features obtained based on the traditional Fourier In this research, the task of extracting speech spectrum, where already at a signal-to-noise signal recognition features for voice ratio of 20 dB the total deviation of the identification of a person in a remote mode obtained features 𝜎 is unacceptable. was solved, which imposes several restrictions, namely: (1) minimum processing time of the speech signal realization, since the 159 required recognition reliability is achieved by developed method have a very acceptable statistical processing of the obtained results; result, namely, 1.6–2 times more robust to (2) reduction of the dimensionality of noise than the features obtained based on the recognition features, since the process of traditional Fourier spectrum, where the total extracting recognition features and their deviation of the root mean square error of the classification occurs on the transmitting side of obtained features is unacceptable at a signal- the communication channel, which in turn to-noise ratio of 20 dB. imposes certain factors of computing power Also, the analysis of the results shows that and the influence of noise in the the contrast of the recognition features of test communication channel. speech signals generated based on the The studies have shown the ability of the developed method without the influence of developed method to form recognition noise is on average 3.8 times higher than that features based on wavelet packets on the of the method using Fourier spectral Meyer basis. The most important indicator of coefficients. the effectiveness of the experiment is the The authors see the further direction of increase in the contrast of recognition features, research in identifying the potential i.e., the increase in the interclass distance in the capabilities of the developed method of speech formed feature system for speech signals with signal recognition in person identification a similar frequency-temporal structure. Even a under very difficult conditions of overlapping visual analysis of the obtained values 𝑌 = voices of several speakers, in particular with {𝑦𝑟 }𝑅 (Figs. 7–8) reveals significant differences similar acoustic characteristics, as well as in in the structure of the feature vectors formed selecting and justifying the criterion for by relatively short implementations, which implementing recognition procedures. It proves the potential use of the presented should be noted that there have been virtually method for speech signal recognition in rapid no studies of voice identification capabilities analysis systems. Since the recognition for this most difficult case. features are distributed according to normal law, the subsequent procedure for deciding References whether speech signal realizations belong to a particular class is greatly simplified. [1] J. Anand Babu et al., Secure Data After analyzing the given conditions of the Retrieval System using Biometric voice identification system, the question arose Identification, International Conference of developing a method for extracting speech on Data Science and Information System signal recognition features that would provide (ICDSIS) (2022) 1–4. doi: more informative spectral characteristics of 10.1109/ICDSIS55133.2022.9915968. the speech signal, which would improve the [2] O. Romanovskyi, et al., Prototyping efficiency of their further classification under Methodology of End-to-End Speech the influence of noise. Analytics Software, in: 4th International This paper considers the possibility of Workshop on Modern Machine Learning applying the theory of scale-time analysis to Technologies and Data Science, vol. 3312 solve this problem, namely, the development of (2022) 76–86. a method for extracting recognition features [3] I. Iosifov, et al., Transferability based on the wavelet packet transform using Evaluation of Speech Emotion Recog- the orthogonal basis wavelet Meyer function nition Between Different Languages, and subsequent averaging of wavelet Advances in Computer Science for coefficients that are in the frequency band of Engineering and Education 134 (2022) the corresponding wavelet packet. 413–426. doi: 10.1007/978-3-031- Experimental studies have shown the ability of 04812-8_35 the developed method to generate speech [4] O. Iosifova, et al., Analysis of Automatic signal recognition features with a close Speech Recognition Methods, in: frequency-temporal structure based on Workshop on Cybersecurity Providing in wavelet packets in the Meyer basis, namely, it Information and Telecommunication was found that at a signal-to-noise ratio of 10 Systems, vol. 2923 (2021) 252–257. dB, the features obtained based on the 160 [5] O. Romanovskyi, et al., Automated for Dialect Identification, IEEE Access 8 Pipeline for Training Dataset Creation (2020) 174871–174879. doi: 10.1109/ from Unlabeled Audios for Automatic ACCESS.2020.3020506. Speech Recognition, Advances in [14] Y. Dong, X. Yang, Affect-Salient Event Computer Science for Engineering and Sequences Modelling for Continuous Education IV, vol. 83 (2021) 25–36. Speech Emotion Recognition Using doi: 10.1007/978-3-030-80472-5_3 Connectionist Temporal Classification, [6] O. Iosifova, et al., Techniques 5th International Conference on Signal Comparison for Natural Language and Image Processing (ICSIP) (2020) Processing, in: 2nd International 773–778. doi: 10.1109/ICSIP49896. Workshop on Modern Machine Learning 2020.9339383. Technologies and Data Science, vol. [15] R. Hidayat, A. Winursito, Analysis of 2631, no. I (2020) 57–67. Amplitude Threshold on Speech [7] H. Monday et al., Shared Weighted Recognition System, International Continuous Wavelet Capsule Network Seminar on Application for Technology for Electrocardiogram Biometric of Information and Communication Identification, 18 th International (iSemantic) (2020) 449–453. doi: Computer Conference on Wavelet Active 10.1109/iSemantic50169.2020.9234214. Media Technology and Information [16] Z. Qing, W. Zhong, W. Peng, Research on Processing (ICCWAMTIP) (2021) 419– Speech Emotion Recognition Technology 425. doi: 10.1109/ICCWAMTIP53232. Based on Machine Learning, 7th 2021.9674078. International Conference on Information [8] L. Zhu, et al., An Efficient and Privacy- Science and Control Engineering Preserving Biometric Identification (ICISCE) (2020) 1220–1223. doi: Scheme in Cloud Computing, IEEE Access 10.1109/ICISCE50968.2020.00247. 6 (2018) 19025–19033. doi: [17] B. Kashyap, et al., Machine Learning- 10.1109/ACCESS.2018.2819166. Based Scoring System to Predict the Risk [9] J. Upadhyay et al., Biometric and Severity of Ataxic Speech Using Identification using Gait Analysis by Different Speech Tasks, IEEE Deep Learning, Pune Section Transactions on Neural Systems and International Conference (PuneCon) Rehabilitation Engineering 31 (2023) (2020) 152–156. doi: 10.1109/PuneCon 4839–4850. doi: 10.1109/TNSRE.2023. 50868.2020.9362402. 3334718. [10] C. Liu, et al., An Efficient Biometric [18] H. Park, Y. Chung, J.-H. Kim, Deep Neural Identification in Cloud Computing with Networks-based Classification Enhanced Privacy Security, IEEE Access Methodologies of Speech, Audio and 7 (2019) 105363–105375. doi: Music, and its Integration for Audio 10.1109/ACCESS.2019.2931881. Metadata Tagging, J. Web Eng. 22(1) [11] O. Attallah, Multi-tasks Biometric System (2023) 1–26. doi: 10.13052/jwe1540- for Personal Identification, International 9589.2211. Conference on Computational Science [19] O. Lavrynenko, et al., Method of Semantic and Engineering (CSE) and International Coding of Speech Signals based on Conference on Embedded and Empirical Wavelet Transform, 4th Ubiquitous Computing (EUC) (2019) International Conference on Advanced 110–114. doi: 10.1109/CSE/EUC.2019. Information and Communication 00030. Technologies (AICT) (2021) 18–22. doi: [12] M. Aliaskar et al., Human Voice 10.1109/AICT52120.2021.9628985. Identification Based on the Detection of [20] A. Dutt, P. Gader, Wavelet Fundamental Harmonics, 7th Multiresolution Analysis Based Speech International Energy Conference Emotion Recognition System Using 1D (ENERGYCON) (2022) 1–4. doi: CNN LSTM Networks, Transactions on 10.1109/energycon53164.2022.9830471. Audio, Speech, and Language Proces. 31 [13] R. Kethireddy, et al., Mel-Weighted (2023) 2043–2054. doi: Single Frequency Filtering Spectrogram 10.1109/TASLP.2023.3277291. 161 [21] C. Zhang, et al., Research on Extracting Transformation Optimized by Convo- Algorithm of Speech Eigenvalue Based lutional Autoencoders, Transact. Neural on Wavelet Packet Transform and Netw. Learn. Syst. 34(3) (2023) 1395– Gammatone Filter, 3rd Information 1405. doi: 10.1109/TNNLS.2021.31053 Technology, Networking, Electronic and 67. Automation Control Conference (ITNEC) [30] O. Lavrynenko, et al., Remote Voice User (2019) 165–169. doi: 10.1109/ITNEC. Verification System for Access to IoT 2019.8729292. Services Based on 5G Technologies, 12th [22] O. Lavrynenko, et al., A Method for International Conference on Intelligent Extracting the Semantic Features of Data Acquisition and Advanced Speech Signal Recognition Based on Computing Systems: Technology and Empirical Wavelet Transform, Applications (2023) 1042–1048. doi: Radioelectron. Comput. Syst. 107(3) 10.1109/IDAACS58523.2023.10348955. (2023) 101–124. doi: 10.32620/reks. [31] O. Veselska, et al., A Wavelet-Based 2023.3.09. Steganographic Method for Text Hiding [23] G. Frusque, O. Fink, Learnable Wavelet in an Audio Signal, Sensors 22(15) Packet Transform for Data-Adapted (2022) 1–25. doi: 10.3390/s22155832. Spectrograms, International Conference [32] V. Kuzmin, et al., Empirical Data on Acoustics, Speech and Signal Approximation Using Three-Dimen- Processing (2022) 3119–3123. doi: sional Two-Segmented Regression, 3rd 10.1109/ICASSP43922.2022.9747491. KhPI Week on Advanced Technology [24] B. Zhao, et al., A Spectrum Adaptive (2022) 1–6. doi: 10.1109/KhPIWeek Segmentation Empirical Wavelet 57572.2022.9916335. Transform for Noisy and Nonstationary Signal Processing, IEEE Access 9 (2021) 106375–106386. doi: 10.1109/ACCESS. 2021.3099500. [25] R. Odarchenko, et al., Empirical Wavelet Transform in Speech Signal Compression Problems, 8 International th Conference on Problems of Infocommunications, Science and Technology (2021) 599–602. doi: 10.1109/PICST54195.2021.9772156. [26] T. Zhang, et al., Multiple Vowels Repair Based on Pitch Extraction and Line Spectrum Pair Feature for Voice Disorder, J. Biomedical Health Inform. 24(7) (2020) 1940–1951. doi: 10.1109/JBHI.2020.2978103. [27] F. Costa, et al., Wavelet-Based Harmonic Magnitude Measurement in the Presence of Interharmonics, Transactions on Power Delivery 38(3) (2023) 2072– 2087. doi: 10.1109/TPWRD.2022. 3233583. [28] X. Zheng, Y. Tang, J. Zhou, A Framework of Adaptive Multiscale Wavelet Decomposition for Signals on Undirected Graphs, Transactions on Signal Proces. 67(7) (2019) 1696–1711. doi: 10.1109/TSP.2019.2896246. [29] B. Wang, J. Saniie, Massive Ultrasonic Data Compression Using Wavelet Packet 162