The Method of User Identification by Speech Signal Vyacheslav Nykytyuka, Vasil Dozorskyia, Oksana Dozorskaa, Andrii Karnaukhova and Liubomyr Matiichuka a Ternopil Ivan Puluj National Technical University, Ruska str., 56, Ternopil, 46001, Ukraine Abstract The paper proposes a method of four-parameter identification of users by speech signal, which is based on a mathematical model of speech signals in the form of a piecewise stationary random process. Based on the application of the proposed method, three individual parameters of the speech signal are evaluated, such as the value of the main tone frequency, the time durations of the segments of the speech signal that correspond to the vowel and consonant sounds, their alternation in the speech signal - password. The value of the threshold function is used as the fourth parameter. Such time durations are actually determined on the basis of this function, and its value can be tied by a certain dependence to the value of the main tone frequency and the duration of the recording of word - password. This will practically make false identification of a third party impossible and increase the reliability of the identification itself. Keywords 1 Identification, speech signal, processing method, sliding window, piecewise stationary random process 1. Introduction The paper considers the method of biometric identification of the user by speech signal. The task of such identification is particularly relevant today in the field of IT technologies [1], in particular for controlling the provision of access to information resources, databases or individual services only to certain users. Traditional identification methods such as username and password, knowledge-based identification, and SMS-based two-factor identification have disadvantages due to security vulnerabilities ranging from account hijacking and phishing to social engineering. Accordingly, IT departments today are researching and developing more reliable identification systems that reduce the likelihood of theft and fraud [1, 2]. Among the analyzed methods, biometric identification methods are considered the most reliable, since the identifier itself is often an individual part of the user, which practically cannot be used without his knowledge by third parties or forged. Biometric identification refers to security processes that confirm a user's identity using unique biological signs such as retina, iris, voice, facial characteristics, and fingerprints, etc. [3-7]. Biometric identification systems store this biometric data to verify a user's identity when that user accesses their account. Because these data are unique to individual users, biometric identification is generally more secure than traditional forms of multifactor identification. A comparison of common biometric identification methods was also carried out, in particular on the errors of false identification and the cost of the technical implementation of the method. Summary data are shown in Table 1. ITTAP’2022: 2nd International Workshop on Information Technologies: Theoretical and Applied Problems, November 22–24, 2022, Ternopil, Ukraine EMAIL: slavikvv89@gmail.com (A. 1); vasildozorskij1985@gmail.com (A. 2); oksana4elka@gmail.com (A. 3); angryfallenangel@gmail.com (A. 4); mlpstat@gmail.com (A. 5) ORCID: 0000-0003-1547-8042 (A. 1); 0000-0001-6744-3015 (A. 2); 0000-0001-7053-863X (A. 3); 0000-0003-3042-3066 (A. 4); 0000- 0001-6701-4683 (A. 5) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) Analyzing the data in Table 1, we can come to the conclusion that the method of identification based on the characteristics and features of speech is optimal, as it does not require expensive equipment and has a low value of false identification of a person [8, 9]. Table 1 Basic characteristics of common biometric identification methods The method of obtaining Probability of false The cost of technical biometric parameters identification, % implementation, USD Geometric structure of the hand 0,2…1 600-3000 Fingerprints 0,0001 60-600 Peculiarities of the retina pattern 6…10 4000 The iris of the eye 0,0001 500-6000 Face portrait - 55000 Handwriting 0,5…5 - Keyboard and computer 3…9 - handwriting Speech characteristics and 0,5…5 1-60 features However, with the development of computer programs of speech signals generation, common speech identification algorithms become vulnerable, which requires the development of better, more efficient and reliable speech identification algorithms based on the analysis of several different parameters of speech signals. For this, it is necessary to carry out mathematical modeling of speech signals, which consists in choosing a mathematical model of such signals that is adequate for the task of biometric identification of the user and developing methods of their processing. 2. Choice of speech signals mathematical model The mathematical model of speech signals should take into account the nature of their formation and open up new opportunities in the field of user identification by applying new methods of speech signals processing and obtaining new informative features from them. It is known [10-13] that the speech signal is a complex non-stationary process, but it can be taken as a stationary random process on short intervals equal from units to several tens of main tone periods. Thus, in the research the class of piecewise stationary random processes [14-16] was used as the speech signals mathematical model. Let the stationary random process ξ1(t) be realized on the interval [0,t1), the stationary random process ξ2(t) on the next interval [t1,t2) and so on, on the interval [tn-1,tn) – the stationary random process ξn(t). In general, such a process can be presented in the form  n t    1 (t ),  2 (t ),...,  n (t )  , where:  n t  – random process – speech signal. For such a process, the concept of disorder is introduced, which characterizes the transitions between areas where stationary random processes 1 (t ), 2 (t ),...,n (t ) are present [14-16]. At the same time, it is possible to define such transitions [16], which will make it possible to segment the password into sections corresponding to individual sounds. At the same time, taking into account the fact that speech sounds can be divided into vowels, vocalized consonants and noisy consonants, it becomes possible to identify areas in the structure of the speech signal that correspond to such sounds. The duration of these areas and their ratio in the words used for identification can be used as additional informative signs to increase the reliability and accuracy of user identification itself. On the basis of the selected mathematical model of the speech signal, it becomes possible to apply the processing method, the essence of which is as follows. Actually, the speech signal processing should be carried out on the short time intervals, which are equal to several units or tens of main tone periods - within the limits of the sliding window [16]. By shifting the window in time according to the jump-like change of the probability characteristics calculated within each window, it becomes possible to detect transitions between individual sounds and select areas corresponding to individual sounds. In this way, it is possible to identify users based on several individual parameters of the speech signal. 3. The method of speech signals processing for the task of user identification A four-parameter identification method is proposed, which consists in carrying out such identification based on four parameters of the speech signal, three of which are individual biometric characteristics and, in combination, are not suitable for forgery. The actual identification process is carried out in two stages. At the initial stage, in the process of registering a new user, his speech signal is registered, which is a password spoken by a person, that will be used to identify the person at each subsequent request. Both the word - password itself and individual biometric parameters of a person's speech will be unique for identification. In particular, the areas of the speech signal that correspond to vowels and vocalized consonant sounds and the evaluation of the value of main tone frequency are selecting. For identification at the next stages, the values of this frequency and the duration of areas of vowels and vocalized consonant sounds will be used, which are the biometric parameters of the user. When registering a new user with a certain service or database, he/she (the user) will need to pronounce a sequence of vowel sounds into the microphone. This sounds will be used by the system to determine the approximate value of the main tone frequency, which will be the first individual biometric parameter of the user. Next, the user will need to come up with and say into the microphone a certain test word - a password that will be known only to the user. Based on the pre-received value of the main tone frequency, the system will process the record of this word and select the areas in it where this frequency will be present. These areas will correspond to vowels and vocalized consonant sounds. The value of the main tone frequency as well as the duration of the calculated time intervals of such areas in the test word and their sequence for a separate word will be used in the next identification of the user. To determine the main tone frequency, the method of formant analysis has proven itself well, according to which the frequency of placement of the first maximum in the spectrum of the vowel sound will correspond to the main tone frequency [13]. At the first stage, after recording the vowel sound or sounds, the system will calculate the signal power spectral density distribution and determine the main tone frequency. But, since the speech signal itself is a random process, the value of this frequency will also change. However, as stated in [13], at short time intervals (less than 0.1 s) the speech signal can be taken as a stationary random process. Thus, it is possible to select sections of the speech signal with a duration of about 0.1 s, calculate the power spectral density distribution for each section, determine the main tone frequency for each such spectrum, and calculate the average value of this frequency. It is in this way that it is planned to determine the main tone frequency in the proven method of speech identification. At the same time, such an analysis of the speech signal corresponds to the presentation of the latter as a piecewise stationary random process, in particular, when analyzing not only individual vowel sounds, but also sequences of different sounds in the spoken password, for which individual stationary areas will correspond to individual sounds. The processing method itself is based on the application of a sliding window, which is broadcast on the recorded speech signal. Within each window, the presence of the main tone signs is assessed using formant analysis methods. In particular, within each window, estimates of the power spectral density distribution are calculated and the presence of maxima in the frequency range, which can take the value of the main tone frequency, is detected. This range can roughly be from 85 to 450 Hz [12, 13]. The frequency of placement of the first maximum will correspond to the main tone frequency. Now, knowing the approximate value of the main tone frequency, we can determine the range of its change according to the rule of three sigma [14], according to which the main tone frequency can vary from -3σ to +3σ, where σ is the root mean square deviation, which can be found from the set of calculated values of the main tone frequency for samples from the speech signal. Further, the system analyzes the record of the password. Within each sliding window, the power spectral density distribution is calculated and the presence of a maximum within the previously calculated range of main tone frequency change is determined. By plotting the values of these maxima on the time axis according to the position of each window, a curve can be obtained that will display, for a given audio signal, the areas corresponding to the vowels and vocalized consonant sounds. At the same time, it becomes possible to determine the durations of such areas. In Figure 1 shows the recording of the speech signal, which is a sequence of vowel sounds [a] (the first 7 sounds) and vocalized consonant sounds [l] (the last 3 sounds). Figure 1: Recording of the speech signal, which is a sequence of vowel sounds [a] (first 7 sounds) and vocalized consonant sounds [l] (last 3 sounds) For each individual sound, power spectral density distributions were calculated and individual values of the main tone frequency were determined. As an example, some of them are shown in Figure 2. Figure 2: Individual sounds and their power spectral density distributions Further, the main tone frequency was calculated for each sound within a sliding window with a width of 0.1 s. Based on these values, the average value of the main tone frequency and the range of change of this value were calculated. The average value of the main tone frequency was 206 Hz, and the range of change according to the three-sigma rule was from 166 to 246 Hz. At the next stage, the analysis of time durations of the areas corresponding to vowel sounds was carried out. For this, the recording of the sequence of vowel sounds [а] was used, shown in Figure 3. Figure 3: Recording of the sequence of vowel sounds [a] Knowing the value of the main tone frequency and the range of its change, this signal was processed by the proposed method in the following way. A sliding window with a width of 0.1 s was formed. It was broadcast by signal. Within each sliding window, a power spectral density distribution was calculated and the maximum in this spectrum was found in the range of changes of main tone frequency. The value of this maximum was delayed on one time axis according to the placement of the sliding window on the speech signal recording. The view of the resulting graph is shown in Figure 4. Figure 4: Graph of the presence of maxima in the power spectral density (PSD) distribution within the corresponding sliding window Comparing the graphs in Figure 3 and Figure 4 we can come to the conclusion that with the help of this processing method it is possible to determine the areas that correspond to vowel sounds by the value of the main tone frequency . However, the fronts of such areas are tilted, so it is necessary to develop a method for establishing the duration of such areas. For this, a threshold function was used, which takes two values, namely: 0 – if the value of the signal in Figure 4 is less than or equal to a certain threshold, and 1 – if the signal value is greater than the threshold. In this way, the graph in Figure 4 will turn into a sequence of rectangular pulses, the width of which will correspond to the duration of the corresponding speech sound. The threshold value can be selected from the relationship between the main tone frequency and, for example, the duration of the password. Thus, this value will also be individual for each individual user. For analysis was used the speech signal shown in Figure 1. A graph of the presence of maxima in the power spectral density distribution within the corresponding sliding window and a graph of the threshold function was constructed for it. They are shown in Figure 5. The threshold value was chosen a priori at the level of 0.1. From Figure 5, it is already possible to determine with sufficient accuracy the durations of areas that correspond to vowels and vocalized consonants, as well as the sequence of these durations in the password. a) b) Figure 5: Graph of the presence of maxima in the power spectral density distribution within the corresponding sliding window (a) and graph of the threshold function (b) However, an important question is how difficult it will be to deceive this method when, for example, the test word is spoken by another person. For this, the range of values of the main tone frequency was changed. The corresponding graphs of the presence of maxima in the power spectral density distribution within the corresponding sliding window and the threshold function in the case of reducing this range to (120-194) Hz are shown in Figure 6. a) b) Figure 6: Graphs of the presence of maxima in the power spectral density distribution within the corresponding sliding window (a) and threshold function for the main tone frequency range (120- 194) Hz (b) From Figure 6 it can be seen that when the range of the main tone frequency does not correspond to the actual range, the graphs of the presence of maxima in the power spectral density distribution within the corresponding sliding window and the threshold function are significantly distorted, there are gaps corresponding to the non-detection of sounds, a significant change in the duration of the corresponding intervals of the threshold function or their loss . In Figure 7 shows the corresponding graphs of the presence of maxima in the power spectral density distribution within the corresponding sliding window and the threshold function in the case of an increase in the main tone frequency range to (220-294) Hz. a) b) Figure 7: Graphs of the presence of maxima in the power spectral density distribution within the corresponding sliding window (a) and the threshold function for the main tone frequency range (220-294) Hz (b) Similarly, as in the previous case, the graphs are distorted, and the intervals of the threshold function are significantly shortened and bifurcated. From the analysis of the graphs in Figure 5, Figure 6 and Figure 7, it can be stated that the developed method is efficient, sensitive and makes it possible to identify a person based on such individual biometric parameters of the speech as the main tone frequency, the duration of vowels and consonants located in the password, as well as the sequence of these durations in the password. Additionally, Figure 8 shows the recording of the word "s o f a", which includes two vowel sounds [o] and [a], and two noised consonants [s] and [f]. At the top of this figure is shown the threshold function. Figure 8: View of the threshold function (top) for the speech signal - the word "s o f a" As expected, the method is sensitive and makes it possible to distinguish areas of vowels and vocalized consonant sounds and is not sensitive to noised sounds. 4. Conclusion The described method of user identification includes the evaluation of three parameters of the speech signal, which are individual informative signs of the user, namely: the value of the main tone frequency, the duration of the areas of the speech signal that correspond to vowels and vocalized consonant sounds and their ratio in the password (individual for each person). The fourth parameter is the value of the threshold function, which is set in a certain way during the registration of a new user, and the calculated values of the durations of vowels and vocalized consonant sounds will depend on its value. Algorithms for implementing the method at various stages of its implementation are simple and can be integrated into various services and implemented in various software environments with the ability to work in real time. The actual calculations used in the developed method are not complicated. However, in the future, the method can be improved by expanding the number of informative signs, which are used to identify the user, in particular by using the parameters of the speech signal in the time and spectral domain. However, this would require an additional signal centering and normalization procedure and would be extra sensitive to external noises and interferences. 5. References [1] "What is Authentication? Definition of Authentication, Authentication Meaning". The Economic Times. Retrieved, 2020-11-15. [2] Tardo J. and K. Alagappan. SPX: Global Authentication Using Public Key Certificates. М.California, 1991, pp.232-244. [3] R. Manjula Devi, P. Keerthika, P. Suresh, Partha Pratim Sarangi, M. Sangeetha, C. Sagana, K. Devendran, “Retina biometrics for personal authentication”, Machine Learning for Biometrics. Concepts, Algorithms and Applications. Cognitive Data Science in Sustainable Computing. 2022, pp. 87-104 [4] Biometrics: definition, use cases, latest news, 2022, URL: https://www.thalesgroup.com/en/markets/digital-identity-and- security/government/inspired/biometrics [5] What is Biometric Authentication? By Dean Nicolls, 2019. URL: https://www.jumio.com/what- is-biometric-authentication/ [6] 5 common biometric techniques compared, 2022. URL: https://www.recogtech.com/en/knowledge-base/5-common-biometric-techniques-compared [7] Paul Benjamin Lowry, Jackson Stephens, Aaron Moyes, Sean Wilson, and Mark Mitchell. Biometrics, a critical consideration in information security management. Margherita Pagani, ed. Encyclopedia of Multimedia Technology and Networks, Idea Group Inc., 2005 pp. 69–75. [8] Poddar, Arnab; Sahidullah, Md; Saha, Goutam. Speaker Verification with Short Utterances: A Review of Challenges, Trends and Opportunities. IET Biometrics. 7 (2), March 2018, pp. 91– 101. [9] Pollack, Pickett, Sumby. Experimental phonetics. MSS Information Corporation, 1974, pp. 251– 258. [10] Fant G. Acoustic theory of speech production. The Hague: Mouton, 1970. [11] Jafek B, Stark A. ENT secrets. Philadelphia, PA: Hanley & Belfus, 1995. [12] Sadaoki F. Digital speech. Processing, synthesis and recognition. Tokyo: Tokyo institute of technology, 2000. [13] Rabiner LR, Shafer RW. Digital processing of speech signal. New Jersey: Prentice-Hall, 1978. [14] J. Bendat, A. Pirsol. Applied analysis of random data. 1989, 540 p. [15] N.B. Marchenko, V.V. Nechiporuk, O.P. Nechiporuk, Yu.V. Pepa. Methodology of accuracy of information and information-viral systems of diagnostics. NAU, 2014, 377 p. [16] Vyacheslav Nykytyuk, Vasyl Dozorskyi, Oksana Dozorska. Detection of biomedical signals disruption using a sliding window. Scientific jornal of the Ternopil National Technical University, 2018, Vol. 91, № 3, pp. 125–133.