Method of Remote Biometric Identification of a Person
                         by Voice based on Wavelet Packet Transform
                         Oleksandr Lavrynenko1, Bohdan Chumachenko1, Maksym Zaliskyi1,
                         Serhii Chumachenko1, and Denys Bakhtiiarov1
                         1 National Aviation University, 1 Lubomyr Huzar ave., Kyiv, 03058, Ukraine


                                          Abstract
                                          In this research, the task of extracting speech signal recognition features for voice
                                          identification of a person in a remote mode was solved, which imposes several
                                          restrictions, namely: (1) minimum processing time of the speech signal realization, since
                                          the required recognition reliability is achieved through statistical processing of the
                                          results; (2) reduction of the dimensionality of recognition features, since the process of
                                          extracting recognition features and their classification occurs on the transmitting side of
                                          the communication channel, which in turn imposes certain factors of computing power
                                          and noise in the communication channel. After analyzing the given conditions of the voice
                                          identification system, the question arose of developing a method for extracting speech
                                          signal recognition features that would provide more informative spectral characteristics
                                          of the speech signal, which would improve the efficiency of their further classification
                                          under the influence of noise. In this paper, we consider the possibility of applying the
                                          theory of time-scale analysis to solve this problem, namely, the development of a method
                                          for extracting recognition features based on the wavelet packet transform using the
                                          orthogonal basis wavelet function of Meyer and subsequent averaging of wavelet
                                          coefficients that are in the frequency band of the corresponding wavelet packet.
                                          Experimental studies have shown the ability of the developed method to generate speech
                                          signal recognition features with a close frequency-temporal structure based on wavelet
                                          packets in the Meyer basis, namely, it was found that at a signal-to-noise ratio of 10 dB,
                                          the features obtained based on the developed method have a very acceptable result,
                                          namely, 1.6–2 times more robust to noise than the features obtained based on the
                                          traditional Fourier spectrum, where the total deviation of the root mean square error of
                                          the obtained features is unacceptable at a signal-to-noise ratio of 20 dB.

                                          Keywords 1
                                          speech signal, recognition features, wavelet transform, wavelet Meyer function, spectral
                                          analysis, voice identification, biometric authentication

                         1. Introduction                                                                                        authentication include various systems and
                                                                                                                                methods of biometric identification [1]. The
                         The development of new methods and means of                                                            development of identification systems based on
                         ensuring information security is intended                                                              biometric measurements is associated with a
                         primarily to prevent threats of access to                                                              whole range of advantages: such systems are
                         information resources by unauthorized persons.                                                         more reliable because biometric indicators are
                         To solve this problem, it is necessary to have                                                         more difficult to fake; modern microprocessor
                         identifiers and create identification procedures                                                       technology makes biometric methods more
                         for all users. Modern identification and                                                               convenient than conventional identification


                         CPITS-2024: Cybersecurity Providing in Information and Telecommunication Systems, February 28, 2024, Kyiv, Ukraine
                         EMAIL: oleksandrlavrynenko@gmail.com (O. Lavrynenko); bohdan.chumachenko@npp.nau.edu.ua (B. Chumachenko);
                         maksym.zaliskyi@npp.nau.edu.ua (M. Zaliskyi); serhii.chumachenko@npp.nau.edu.ua (S. Chumachenko); bakhtiiaroff@tks.nau.edu.ua
                         (D. Bakhtiiarov)
                         ORCID: 0000-0002-7738-161X (O. Lavrynenko); 0000-0002-0354-2206 (B. Chumachenko); 0000-0002-1535-4384 (M. Zaliskyi); 0009-
                         0003-8755-5286 (S. Chumachenko); 0000-0003-3298-4641 (D. Bakhtiiarov)
                                      ©️ 2024 Copyright for this paper by its authors.
                                      Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

                                      CEUR Workshop Proceedings (CEUR-WS.org)

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                                                                                      150
methods; and, finally, they are much easier to         laboratory conditions may show much lower
automate measurements [2–6].                           reliability when analyzing speech information
    One of the most common biometric                   with external noise. Finally, in several tasks,
characteristics of a person is his or her voice,       identification has to be performed in very difficult
which has a set of individual characteristics that     conditions of overlapping voices of several
are relatively easy to measure (for example, the       speakers, in particular, with similar acoustic
frequency spectrum of the speech signal). The          characteristics. It should be noted that there have
advantages of voice identification also include        been virtually no studies of voice identification
ease of application and use, and the fairly low cost   capabilities for this most difficult case [10].
of devices used for identification (e. g.,                 Voice identification involves a set of
microphones) [7].                                      technical, algorithmic, and mathematical
    Voice identification capabilities cover a very     methods that cover all stages, from voice
wide range of tasks, which distinguishes them          recording to voice data classification. The
from other biometric systems. First of all, voice      discussed difficulties and shortcomings lead to
identification has been widely used for a long         the conclusion that further development of voice
time in various systems for differentiating access     identification systems requires the development
to physical objects and information resources. Its     of new approaches aimed at processing large
new application in remote voice identification         arrays of experimental speech signals, their
systems, where a person is identified through a        effective analysis, and reliable classification. This
telecommunications channel, seems promising.           indicates the relevance of research on the
For example, in mobile communications, voice           creation of new mathematical methods for
can be used to manage services, and the                processing, analyzing, and classifying voice data
introduction of voice identification helps protect     that would ensure the reliability and accuracy of
against fraud [8].                                     person identification [11].
    Voice identification is of particular                  Traditionally, the methods that provide the
importance in the investigation of crimes, in          required level of classification reliability under
particular in the field of computer information,       given conditions are of practical interest for
and in the formation of the evidence base for such     speech signal recognition. Until recently, the
an investigation. In these cases, it is often          dominant approach to the construction of
necessary to identify an unknown voice                 biometric voice identification devices was not
recording. Voice identification is an important        to impose restrictions on the processing time of
practical task when searching for a suspect based      the speech signal, since the required
on a voice recording in telecommunication              recognition reliability was achieved by
channels. Determining such characteristics of the      statistical processing of the results obtained, as
speaker’s voice as gender, age, nationality,           well as by increasing the dimensionality of the
dialect, and emotional coloring of speech is also      recognition features, and as a rule, the process
important in the field of forensics and anti-          of extracting recognition features and their
terrorism. The identification results are              classification took place on the transmitting
important       in    conducting       phonoscopic     side of the communication channel.
examinations, and in carrying out expert forensic          However, in the case of remote voice
research based on the theory of forensic               identification in modern mobile radio
identification [9].                                    communication systems, it is difficult to ensure
    Voice     identification     in      real-world    these conditions, since the identification of a
environments faces the following serious               person is carried out on the receiving side, and
challenges. Firstly, such identification is subject    this, in turn, imposes certain factors of
to all kinds of hardware distortions and noise         computing power and the influence of noise in
caused by the peculiarities of equipment and           the communication channel. An additional
devices for recording, processing, and storing         requirement is often the need to make a
information. Secondly, external acoustic noise         classification decision in a time-sensitive
inevitably superimposes the speech signal, which       environment [12].
can significantly distort individual informative           In this case, it is necessary to move to other
characteristics. Given this, identification systems    methods that can provide the necessary
that have demonstrated fairly high efficiency in       contrast of the speech signal in the formed


                                                   151
recognition features by the specified                           between classes will be manifested only in
conditions, namely, to ensure the quality of                    differences in the characteristics of features of
speech signal recognition features extraction                   different objects. Then, for any set of features
under the influence of noise in the                             𝑋1 , 𝑋2 , . . . , 𝑋𝑝 , you can set rules according to
communication channel, which in turn will                       which any two classes 𝑠1 and 𝑠𝑟 are assigned a
allow the use of voice identification                           vector
technologies in a remote mode based on                                                           𝑑11𝑟
modern mobile radio communication systems,                                                       .
which will significantly expand the scope of this                                        𝐷1𝑟 =         ,
                                                                                                 .
type of technology. In this paper, we consider
                                                                                                [𝑑𝑞1𝑟 ]
the possibility of applying the theory of time-
scale analysis to solve this problem [13].                      which consists of 𝑞 parameters called
                                                                interclass distances that express the degree of
                                                                difference in the characteristics of recognition
2. Literature   Analysis                              and       features [16].
   Problem Statement                                                An integral part of the speech signal
                                                                recognition process is the definition of a set of
In general, recognition is the process of                       features 𝑋1 , 𝑋2 , . . . , 𝑋𝑝 , i.e., the formation of
assigning the object under study, in this case, a               recognition features in such a way as to ensure
speech signal represented by a set of                           the required classification reliability with the
observations, to one of the alternative classes.                minimum possible dimension 𝑝. By the
The process of assigning an object to a class is                considered approach to solving the problem of
based on the existing differences in some                       speech signal recognition, an important point is
ordered set of recognition features [14].                       the choice of a method for forming recognition
Traditionally, these features are formed based                  features. The use of approaches based on the
on such parameters of the speech signal as the                  traditional Fourier spectral-time analysis for
duration of the modulating function elements,                   this purpose is associated with certain
the number of signal envelope extremes,                         difficulties. First, there are high requirements
statistical characteristics of the number of                    for the input speech signal stream in terms of
zero-level transitions, and the moments of                      signal-to-noise ratio. Secondly, the lack of
higher orders of the spectrum shape obtained                    classification reliability for multicomponent
as a result of observations. Then the set of                    and low-stationary signals, such as speech
observations is represented in the form of a                    signals, and thirdly, the need for a significant
matrix                                                          amount of implementations. The desire to
              𝑥11 𝑥12 . . . 𝑥1𝑖 . . . 𝑥1𝑛                       overcome these limitations within the
              𝑥21 𝑥22 . . . 𝑥2𝑖 . . . 𝑥2𝑛                       framework of traditional approaches of classical
       𝑋𝑝𝑛 = [ . . . . . .  ...     . . . ],                    spectral signal processing leads to difficult-to-
              𝑥𝑝1 𝑥𝑝2 . . . 𝑥𝑝𝑖 . . . 𝑥𝑝𝑛                       implement variants of speech signal recognition
where 𝑛 is the number of observations used for                  devices and solutions that are unacceptable for
recognition,             and     each          column   𝑋𝑖 =    the conditions under consideration [17].
                          𝑇                                         Thus, we formulate the research objective:
(𝑥1𝑖 , 𝑥2𝑖 , . . . , 𝑥𝑝𝑖 ) , 𝑖 = 1,2, . . . , 𝑛 of the matrix
                                                                to develop a method that allows the formation
𝑋𝑝𝑛 is a 𝑝-dimensional vector of observed
                                                                of contrasting recognition features for
values 𝑝 of features 𝑋1 , 𝑋2 , . . . , 𝑋𝑝 that reflect          automatic remote identification of a person by
the most important properties of objects for                    voice under the conditions of restrictions on
recognition. The set of features 𝑝, as a rule, is               the duration of the processed realization at a
the same for all recognition classes 𝑠1 , 𝑠2 , . . . , 𝑠𝑘       signal-to-noise ratio of less than 20 dB under
[15].                                                           conditions of partial or complete a priori
   Thus, we consider the task of recognizing                    uncertainty about their structure.
the object under study belongs to one of a finite
number of classes 𝑠1 , 𝑠2 , . . . , 𝑠𝑘 , which are
described by a set of features 𝑋1 , 𝑋2 , . . . , 𝑋𝑝 ,
which is the same for all classes. Differences


                                                            152
3. Proposed Method                                   where 𝑀 is the number of decomposition
                                                     levels, 𝑉𝑚,𝑘 and 𝑊𝑚,𝑘 are the approximating
Currently, methods of processing and                 and     detailing   wavelet    decomposition
analyzing speech signals based on their              coefficients [18].
wavelet transforms are widely used. The                 Scaling functions and wavelet functions are
essence of these transformations is to               defined by the theory of multiple-scale
decompose the input signal into a system of          analysis:
basis wavelets—functions, each of which is a                𝜑𝑚,𝑘 (𝑡) = √2𝑚 𝜑(2𝑚 𝑡 − 𝑘),       (1)
shifted and scaled copy of the input (generating
or mother wavelet). A characteristic property               𝜓𝑚,𝑘 (𝑡) = √2𝑚 𝜓(2𝑚 𝑡 − 𝑘).           (2)
of wavelet functions (hereinafter referred to as        Here, in (1) and (2) √2𝑚 is the normalizing
wavelets) is the finite energy at their full         factor, and 𝑘 = 0, ± 1, ± 2, . ..; 𝑚 ∈ 𝑍.
localization in both frequency and time                 In practice, to quickly calculate the values of
domains.                                             wavelet coefficients 𝑉𝑚,𝑘 and 𝑊𝑚,𝑘 use a
   Thus, any sequence of discrete samples of         sequential separation scheme called the pyramid
the speech signal 𝑆(𝑡𝑖 ) can be represented as an    or Mallat algorithm, which is interpreted as a
ordered set of coefficients of decomposition by      sequential two-band filtering of the input speech
a system of scaling functions and wavelet            signal using cascaded low-pass (h) and high-pass
functions:                                           (g) filter blocks (Fig. 1) [19].
       2𝑁−𝑀                 𝑀 2𝑁−𝑚

𝑆(𝑡𝑖 ) = ∑ 𝑉𝑚,𝑘 𝜑𝑚,𝑘 (𝑡𝑖 ) + ∑ ∑ 𝑊𝑚,𝑘 𝜓𝑚,𝑘 (𝑡𝑖 ),
        𝑘=1                𝑚=1 𝐾=1


Figure 1: Scheme of signal sequence decomposition according to the Mallat algorithm
In Fig. 1, for the wavelet coefficients 𝑉𝑚,𝑘 and                        1
𝑊𝑚,𝑘 , the first index 𝑚 corresponds to the                    𝑊𝑚,𝑘 =     ∑ 𝑉𝑚−1,𝑛 𝑔𝑛+2𝑘 ,
number of the decomposition level, and the                              √2 𝑛
second index 𝑘 = 0,1, . . . , 2𝑚 − 1 corresponds     where ℎ𝑚 and 𝑔𝑚 are sequences that define the
to the ordinal value of the wavelet coefficient at   characteristics of filters H and G at the 𝑚 level
the decomposition level 𝑚. According to the          of wavelet decomposition [20].
theory of multiple-scale analysis, the values of        The number of multiplication operations
𝑉𝑚,𝑘 and 𝑊𝑚,𝑘 can be obtained based on the           required to calculate all the coefficients of the
coefficients calculated at the previous stages of    discrete wavelet transform for the data set 𝑁
speech signal decomposition:                         and the length of the vectors h and g equal to 𝐿
                   1                                 is 2𝐿𝑁. The same number of operations is
           𝑉𝑚,𝑘 =     ∑ 𝑉𝑚−1,𝑛 ℎ𝑛+2𝑘 ,               required to recover or calculate all the spectral
                   √2 𝑛                              components. So, to analyze a speech signal on a


                                                 153
wavelet basis, you need to perform 4𝐿𝑁                  which gave the method its name. In general,
operations. The number of complex                       each level of the hierarchy can use its specific
multiplication operations for the fast Fourier          basis. In contrast to the Mallat algorithm, the
transform is 𝑁 𝑙𝑜𝑔2 𝑁, which is comparable to           use of wavelet packets makes it possible to take
or even greater than in the case of the discrete        into account the subtle structure of the
wavelet transform [21].                                 analyzed speech signal process in a more
    The interpretation of the coefficients of the       comprehensive way.
discrete wavelet transform is somewhat more                Indeed, the absolute values of the
complicated than the Fourier coefficients. If the       coefficients     in   the     wavelet     packet
analyzed speech signal is sampled at a                  decomposition are smaller than those of the
frequency of 8 kHz and consists of 256 samples,         Mallat algorithm. Therefore, it can be argued
then the top frequency of the signal is 4 kHz.          that the approximation with wavelet packets
Then the coefficients of the first level of             has a much smaller error [23].
decomposition (128) occupy the frequency                   Since the wavelet basis is a complete
band [2.0, 4.0] kHz. The second-level wavelet           decomposition basis, the wavelet coefficients
coefficients (64) are responsible for the [1.0,         contain individual characteristics of the input
2.0] kHz frequency band. They are displayed             speech signal, determined by the properties of
before the first level wavelet coefficients. The        the basis functions to the same extent as the
procedure is repeated until there is 1 wavelet          spectral components of the Fourier series.
coefficient and 1 scaling coefficient at level 9.       Thus, any wavelet transform, including those
The total number of coefficients is                     based on the use of wavelet packets, allows you
(1+1+2+4+8+16+32+64+128) = 256. That is,                to uniquely represent a speech signal by an
the number of coefficients is equal to the              ordered set of its wavelet coefficients. It is
number of samples in the input speech signal.           possible to assume the possibility of using them
If the main energy of the signal was                    as recognition features and thus put the
concentrated near the frequency of 1.0 kHz,             calculation of coefficients based on wavelet
then the second-level wavelet coefficients will         packets based on the proposed method.
be more informative, and the first-level wavelet           The method of forming speech signal
coefficients can be neglected [22].                     recognition features based on wavelet packets
    As a continuation of the development of the         is defined as follows. In the wavelet spectrum
theory of multiple-scale analysis, it is proposed       formed based on wavelet packets, the power of
to improve the Mallat algorithm by additional           the calculated wavelet coefficients within each
processing of the high-frequency components             subband of the decomposition is averaged. The
of the pyramid of the analyzed speech signal.           averaged coefficients are normalized and,
Thus, in the improved algorithm, recursive              according to their place in the overall pyramid
filtering is applied to the coefficients 𝑊𝑚,𝑘 . This    of wavelet packets from left to right and from
full decomposition algorithm is called wavelet          top to bottom, converted into a vector of
packet decomposition. The decomposition                 recognition features. Thus, specific values of
scheme based on wavelet packets is shown in             the average power of the wavelet coefficients in
Fig. 2.                                                 each subband of the decomposition will serve
    For the wavelet coefficients 𝜁𝑚   𝑛 (𝑖)
                                            (Fig. 2),   as the primary features of speech signal
the index 𝑚 corresponds to the number of the            recognition. It should be noted that, in general,
decomposition level, the index 𝑛 corresponds            the features obtained in this way will be
to the number of the subband at the level 𝑚,            correlated, so it is advisable to apply an
and 𝑖 = 0,1, . . . , 2𝑚 − 1 corresponds to the          additional decorrelation transformation to the
number of the wavelet coefficients at the level         vector, which, by the way, will reduce the size
𝑚. In wavelet packages, several decomposition           of the secondary recognition feature space
bases are used for complete decomposition,              [24].
united by the image of nesting in each other,              Consider the sequence of stages of the
                                                        proposed method (Fig. 3).


                                                    154
Figure 2: Signal sequence decomposition scheme based on the wavelet packet algorithm


                                                                    S ( ti )
                                                 H                                      G

                           Z1,0 ( i )                                                                  Z1,1 ( i )

               H                            G                                               H                           G

        Z 2,0 ( i )                          Z 2,1 ( i )                            Z 2,2 ( i )                         Z 2,3 ( i )

  H                   G             H                        G                 H                  G                 H                   G

Z m, n ( i )    Z m, n ( i ) Z m, n ( i )             Z m, n ( i ) Z m, n ( i )             Z m, n ( i ) Z m, n ( i )            Z m, n ( i )


  Pm , n          Pm , n                Pm , n             Pm , n              Pm , n         Pm , n                Pm , n            Pm , n

Figure 3: Scheme of speech signal recognition features selection for biometric identification of a
person


                                                                                                         155
Initially, the input sequence of discrete samples       reflect the wavelet content of the speech signal
of the speech signal 𝑆(𝑡𝑖 ) with length 𝑁, a            subbands,     similar    to    the    frequency
multiple of power 2, at 𝑖 = 0,1,2, . . . , (𝑁 − 1) is   representation. Moreover, the transition to the
decomposed into 𝐾 ≤ 𝑙𝑜𝑔2 (𝑁) levels as a result         average power will allow the use of relatively
of applying the wavelet packet algorithm. At            short input realizations for recognition, which
the first level, the input array 𝑆(𝑡𝑖 ) is              is an important point in the operation of rapid
decomposed into two sets 𝑍1,0 (𝑖) and 𝑍1,1 (𝑖) by       analysis systems. The bandwidth of the
convolution 𝑆(𝑡𝑖 ) with sequences {ℎ} and {𝑔},          frequencies falling into each of the subbands
which are determined by the characteristics of          will narrow with an increase in the number of
low H and high G frequency filters. At the 2nd          the decomposition level, which follows from
level, the considered convolution procedures            the wavelet packet scheme (Fig. 2). The
are repeated with each of the obtained subsets          average powers of the wavelet coefficients in
𝑍1,0 (𝑖) and 𝑍1,1 (𝑖). The process of full              each subband, which are used as speech
decomposition, called wavelet packetization,            recognition features, are calculated according
involves 𝑘 steps similar to the first one [25].         to the following expression:
The analytically considered procedures can be                           ((𝑛+1)⋅ 𝑚
                                                                                 𝑁
represented in general by the following                                ∑𝑖=𝑛⋅𝑁/22𝑚 )−1(𝑍𝑚,𝑛 (𝑖))2       (3)
expressions:                                                 𝑃̄𝑚,𝑛 =                               .
                                                                                𝑁/2𝑚
                    𝑁−1
                                                           To eliminate the sensitivity of the features to
        𝑍𝑚,2𝑛 (𝑖) = ∑ 𝑍𝑚−1,𝑛 (𝑖)ℎ𝑚,𝑛 (𝑖),               changes in the average power of the speech
                     𝑡=0                                signal realization, the values of 𝑃̄𝑚,𝑛 obtained
                      𝑁−1
                                                        by (3) are normalized relative to the average
      𝑍𝑚,2𝑛+1 (𝑖) = ∑ 𝑍𝑚−1,𝑛 (𝑖)𝑔𝑚,𝑛 (𝑖),               power 𝑃̄0,0 of the input speech signal
                      𝑡=0                               realization 𝑆(𝑡𝑖 ) [27].
where is 1 ≤ 𝑚 ≤ 𝐾, and 0 ≤ 𝑛 ≤ (2𝑚−1 − 1).                Finally, the feature vector 𝑌 = {𝑦𝑟 }𝑅 ,
At the first level of decomposition, the samples        consisting of an ordered sequence of averaged
of the speech signal 𝑆(𝑡𝑖 ) are used as 𝑍0,0 (𝑖) .      powers of wavelet coefficients, is formed by
The values of the elements of the sequences {ℎ}         sequentially recording for all 𝑚 and 𝑛 the
and {𝑔} depend on the choice of the type of             calculated normalized values of 𝑃̄𝑚,𝑛 from left to
scaling function 𝜑(𝑥) and wavelet function              right and from top to bottom. The number of the
𝜓(𝑥) and, according to (1) and (2), are                 feature 𝑟 is determined according to the
calculated as follows:                                  expression 𝑟 = 2𝑚 − 1 + 𝑛 and corresponds to
         ℎ𝑚,𝑛 (𝑖) = 2−𝑚/2 𝜑(2−𝑚 𝑖 − 𝑛),                 the ordinal number of the component element of
         𝑔𝑚,𝑛 (𝑖) = 2−𝑚/2 𝜓(2−𝑚 𝑖 − 𝑛).                 the vector 𝑌 = {𝑦𝑟 }𝑅 .
                                                           An important point in implementing the
   As a result of the transformations                   method is the choice of the scaling function 𝜑(𝑥)
performed during the decomposition, the                 and the wavelet 𝜓(𝑥). First, the size of the time-
sequence of samples of the speech signal 𝑆(𝑡𝑖 )         frequency window should be taken into account.
is decomposed into 𝑅 = 2 ⋅ 2𝐾 − 1 sequences             Second, the smoothness and symmetry of the
(including the input one) of length 𝑁/2𝑚 , each         underlying wavelet. Third, determine (set) the
of which represents one of the frequency                order of approximation. Correct selection of the
subbands of the input speech signal [26].               wavelet basis for the speech signal significantly
   Different realizations of speech signals will        reduces the number of non-zero wavelet
have different energy distributions over                coefficients 𝑍𝑚,𝑛 (𝑖), which significantly reduces
frequency subbands since their Fourier spectra          the size of the recognition features and makes
will also be different. If you calculate the            them much more informative [28].
average power of the wavelet coefficients in
each subband, the set of values obtained will


                                                    156
4. Results and Discussion
Practical experiments were conducted to
investigate the contrast of the speech recognition
feature vectors formed based on the proposed
method. In particular, Figs. 4–5 show the feature
vectors of the speech signal calculated in different
wavelet decomposition bases.
    Thus, in the first case (Fig. 4), a wavelet
package based on the Haar basis was used to
obtain wavelet coefficients, which provides a          Figure 5: Components of the speech
relatively coarse approximation of the speech          recognition feature vector based on the Meyer
signal, which       accordingly affects          the   basis
informativeness of the recognition features. In
the second case (Fig. 5), the speech signal            To confirm the hypothesis that it is expedient
recognition features are calculated based on a         to build speech signal recognition systems
smoother Meyer function, which makes the               based on wavelet packets using the values of
features more informative.                             𝑌 = {𝑦𝑟 }𝑅 obtained by expression (3) as
    A comparative analysis of the results in Figs.     recognition features, we studied the developed
4–5 shows that when choosing a smoother basis          method of forming recognition features in
function, the number of yr values close to zero in     comparison with the approach proposed in
the feature vector 𝑌 = {𝑦𝑟 }𝑅 increases and the        [15], which is based on the spectral
informativeness of the decomposition increases,        components of the classical harmonic Fourier
unlike the Haar function, where we get less            transform (Fig. 6).
informative recognition features. Thus, the use of
basic wavelet functions consistently in terms of
smoothness with the studied speech signal
allows us to reduce the size of recognition
features and increase their informativeness.


                                                       Figure 6: Components of the Fourier-based
                                                       speech recognition feature vector
                                                       The experiment used realizations of speech
                                                       signals with a duration of 𝑁 = 512 samples,
                                                       and the decomposition was performed at 𝑚 =
                                                       5 levels of decomposition. This approach
Figure 4: Components of the speech recognition         allowed us to obtain a feature vector 𝑌 = {𝑦𝑟 }𝑅
feature vector based on the Haar basis                 of length 𝑅 = 32, where 16 wavelet coefficients
                                                       were averaged in each subband. As for the
                                                       recognition features based on the Fourier
                                                       transform, the spectrum was divided into 32
                                                       bands of 16 coefficients each [29].


                                                   157
To illustrate more clearly the effectiveness of
the proposed method (Fig. 7), an experiment
was conducted using pre-recorded 30 audio
recordings with the same semantic
constructions by two different speakers, i.e.,
the words were pronounced by the speakers:
“1”, “2”, “3”, “4”, “5” every 30 times. The average
value of Root Mean Square Errors (RMSE) will
serve as an objective indicator of the
effectiveness of the developed method
                                     2
     1
       𝑁
           ∑𝑛𝑡=1 (𝑌(𝑡) − 𝑌̂(𝑡))                                              c
   𝜎= ∑  √                      → 𝑚𝑖𝑛,                Figure 7: Thirty implementations of speech
     𝑁          ∑𝑛𝑡=1 𝑌(𝑡)2
           𝑖=1                                        signal recognition features using bases: a)
for all 30 realizations for each speaker, so the      Meyer, b) Haar, c) Fourier
result that shows the lowest RMSE error is the        The results of the pairwise comparison of the
best.                                                 features of the test speech signals obtained
   RMSE is one of many metrics that are used          using the Haar and Meyer wavelet-based
to evaluate model performance. To calculate           methods and the Fourier spectral coefficient-
RMSE, square the number of detected errors            based method are presented in Table 1.
and find the average value [30].                         This experimental study is needed to
                                                      compute an objective measure of the inter-
                                                      class distance RMSE of recognition features, i.e.,
                                                      the scatter of features when comparing
                                                      different realizations of speech signals [31].
                                                      Table 1
                                                      Comparative analysis of the existing and
                                                      proposed methods
                                                        Phrases    Harr, 𝝈      Meyer, 𝝈    Fourier, 𝝈
                                                        “1”        0.116        0.053       0.219
                                                        “2”        0.153        0.075       0.244
                                                        “3”        0.143        0.069       0.231
                                                        “4”        0.178        0.081       0.276
                                                        “5”        0.162        0.067       0.248
                        a
                                                      The analysis of the obtained results shows that
                                                      the contrast of the recognition features of test
                                                      speech signals generated based on the
                                                      developed method without the influence of
                                                      noise is on average 3.8 times higher than that
                                                      of the method using the Fourier spectral
                                                      coefficients.
                                                         To investigate the effect of noise on the
                                                      robustness of feature vectors formed based on
                                                      wavelet packets for the Meyer, Haar basis and
                                                      based on the Fourier energy spectrum, several
                        b                             experiments were conducted with the addition
                                                      of white noise with a signal-to-noise ratio of


                                                  158
10 dB to the speech signal (noise power was
measured in the analysis band) [32]. Fig. 8
shows all three feature vectors with the same
noise power.


                                                                                       c
                                                                Figure 8: Thirty realizations of speech
                                                                recognition features obtained at a signal-to-
                                                                noise ratio of 10 dB based on bases: (a) Meyer,
                                                                (b) Haar, and (c) Fourier
                            a
                                                                Table 2 shows the results of a comparative
                                                                analysis of the stability of speech signal
                                                                recognition features obtained from wavelet
                                                                packets in the Meyer basis and the Fourier
                                                                energy spectrum. At different signal-to-noise
                                                                ratios of 10, 20, and 30 dB, the total deviation
                                                                of the obtained features  from their reference
                                                                values was calculated. Then the values were
                                                                normalized relative to the maximum.

                     b
Table 2
Comparative analysis of the existing and proposed methods under the influence of noise of
different power
           Meyer+              Meyer+              Meyer +             Fourier +           Fourier +           Fourier +
Phrases noise level of 10   noise level of 20   noise level of 30   noise level of 10   noise level of 20   noise level of 30
             dB, 𝝈               dB, 𝝈               dB, 𝝈               dB, 𝝈               dB, 𝝈               dB, 𝝈
  “1”        0.183               0.119               0.074               0.352               0.281               0.239
  “2”        0.231               0.158               0.095               0.382               0.314               0.254
  “3”        0.227               0.149               0.087               0.381               0.325               0.261
  “4”        0.246               0.161               0.097               0.403               0.347               0.286
  “5”        0.214               0.122               0.076               0.367               0.302               0.258


Thus, it was possible to establish that at a
signal-to-noise ratio of 10 dB, the features                    5. Conclusions                      and            Future
obtained based on the developed method have
a very acceptable result, namely, a 1.6-2-fold                     Research
increase in stability compared to the features
obtained based on the traditional Fourier                       In this research, the task of extracting speech
spectrum, where already at a signal-to-noise                    signal recognition features for voice
ratio of 20 dB the total deviation of the                       identification of a person in a remote mode
obtained features 𝜎 is unacceptable.                            was solved, which imposes several
                                                                restrictions, namely: (1) minimum processing
                                                                time of the speech signal realization, since the


                                                           159
required recognition reliability is achieved by        developed method have a very acceptable
statistical processing of the obtained results;        result, namely, 1.6–2 times more robust to
(2) reduction of the dimensionality of                 noise than the features obtained based on the
recognition features, since the process of             traditional Fourier spectrum, where the total
extracting recognition features and their              deviation of the root mean square error of the
classification occurs on the transmitting side of      obtained features is unacceptable at a signal-
the communication channel, which in turn               to-noise ratio of 20 dB.
imposes certain factors of computing power                Also, the analysis of the results shows that
and the influence of noise in the                      the contrast of the recognition features of test
communication channel.                                 speech signals generated based on the
    The studies have shown the ability of the          developed method without the influence of
developed method to form recognition                   noise is on average 3.8 times higher than that
features based on wavelet packets on the               of the method using Fourier spectral
Meyer basis. The most important indicator of           coefficients.
the effectiveness of the experiment is the                The authors see the further direction of
increase in the contrast of recognition features,      research in identifying the potential
i.e., the increase in the interclass distance in the   capabilities of the developed method of speech
formed feature system for speech signals with          signal recognition in person identification
a similar frequency-temporal structure. Even a         under very difficult conditions of overlapping
visual analysis of the obtained values 𝑌 =             voices of several speakers, in particular with
{𝑦𝑟 }𝑅 (Figs. 7–8) reveals significant differences     similar acoustic characteristics, as well as in
in the structure of the feature vectors formed         selecting and justifying the criterion for
by relatively short implementations, which             implementing recognition procedures. It
proves the potential use of the presented              should be noted that there have been virtually
method for speech signal recognition in rapid          no studies of voice identification capabilities
analysis systems. Since the recognition                for this most difficult case.
features are distributed according to normal
law, the subsequent procedure for deciding             References
whether speech signal realizations belong to a
particular class is greatly simplified.
                                                       [1]   J. Anand Babu et al., Secure Data
    After analyzing the given conditions of the
                                                             Retrieval System using Biometric
voice identification system, the question arose
                                                             Identification, International Conference
of developing a method for extracting speech
                                                             on Data Science and Information System
signal recognition features that would provide
                                                             (ICDSIS)        (2022)       1–4.     doi:
more informative spectral characteristics of
                                                             10.1109/ICDSIS55133.2022.9915968.
the speech signal, which would improve the
                                                       [2]   O. Romanovskyi, et al., Prototyping
efficiency of their further classification under
                                                             Methodology of End-to-End Speech
the influence of noise.
                                                             Analytics Software, in: 4th International
    This paper considers the possibility of
                                                             Workshop on Modern Machine Learning
applying the theory of scale-time analysis to
                                                             Technologies and Data Science, vol. 3312
solve this problem, namely, the development of
                                                             (2022) 76–86.
a method for extracting recognition features
                                                       [3]   I. Iosifov,      et al.,    Transferability
based on the wavelet packet transform using
                                                             Evaluation of Speech Emotion Recog-
the orthogonal basis wavelet Meyer function
                                                             nition Between Different Languages,
and subsequent averaging of wavelet
                                                             Advances in Computer Science for
coefficients that are in the frequency band of
                                                             Engineering and Education 134 (2022)
the        corresponding       wavelet      packet.
                                                             413–426.          doi: 10.1007/978-3-031-
Experimental studies have shown the ability of
                                                             04812-8_35
the developed method to generate speech
                                                       [4]   O. Iosifova, et al., Analysis of Automatic
signal recognition features with a close
                                                             Speech Recognition Methods, in:
frequency-temporal structure based on
                                                             Workshop on Cybersecurity Providing in
wavelet packets in the Meyer basis, namely, it
                                                             Information and Telecommunication
was found that at a signal-to-noise ratio of 10
                                                             Systems, vol. 2923 (2021) 252–257.
dB, the features obtained based on the


                                                   160
[5]    O. Romanovskyi,        et al.,    Automated             for Dialect Identification, IEEE Access 8
       Pipeline for Training Dataset Creation                  (2020) 174871–174879. doi: 10.1109/
       from Unlabeled Audios for Automatic                     ACCESS.2020.3020506.
       Speech Recognition, Advances in                  [14]   Y. Dong, X. Yang, Affect-Salient Event
       Computer Science for Engineering and                    Sequences Modelling for Continuous
       Education IV, vol. 83 (2021) 25–36.                     Speech Emotion Recognition Using
       doi: 10.1007/978-3-030-80472-5_3                        Connectionist Temporal Classification,
[6]    O. Iosifova,       et al.,        Techniques            5th International Conference on Signal
       Comparison for Natural Language                         and Image Processing (ICSIP) (2020)
       Processing, in: 2nd International                       773–778. doi: 10.1109/ICSIP49896.
       Workshop on Modern Machine Learning                     2020.9339383.
       Technologies and Data Science, vol.              [15]   R. Hidayat, A. Winursito, Analysis of
       2631, no. I (2020) 57–67.                               Amplitude Threshold on Speech
[7]    H. Monday et al., Shared Weighted                       Recognition     System,      International
       Continuous Wavelet Capsule Network                      Seminar on Application for Technology
       for     Electrocardiogram          Biometric            of Information and Communication
       Identification,      18 th      International           (iSemantic) (2020) 449–453. doi:
       Computer Conference on Wavelet Active                   10.1109/iSemantic50169.2020.9234214.
       Media Technology and Information                 [16]   Z. Qing, W. Zhong, W. Peng, Research on
       Processing (ICCWAMTIP) (2021) 419–                      Speech Emotion Recognition Technology
       425. doi: 10.1109/ICCWAMTIP53232.                       Based on Machine Learning, 7th
       2021.9674078.                                           International Conference on Information
[8]    L. Zhu, et al., An Efficient and Privacy-               Science and Control Engineering
       Preserving Biometric Identification                     (ICISCE) (2020) 1220–1223. doi:
       Scheme in Cloud Computing, IEEE Access                  10.1109/ICISCE50968.2020.00247.
       6      (2018)      19025–19033.          doi:    [17]   B. Kashyap, et al., Machine Learning-
       10.1109/ACCESS.2018.2819166.                            Based Scoring System to Predict the Risk
[9]    J. Upadhyay       et       al.,    Biometric            and Severity of Ataxic Speech Using
       Identification using Gait Analysis by                   Different     Speech       Tasks,     IEEE
       Deep      Learning,        Pune       Section           Transactions on Neural Systems and
       International Conference (PuneCon)                      Rehabilitation Engineering 31 (2023)
       (2020) 152–156. doi: 10.1109/PuneCon                    4839–4850. doi: 10.1109/TNSRE.2023.
       50868.2020.9362402.                                     3334718.
[10]   C. Liu, et al., An Efficient Biometric           [18]   H. Park, Y. Chung, J.-H. Kim, Deep Neural
       Identification in Cloud Computing with                  Networks-based               Classification
       Enhanced Privacy Security, IEEE Access                  Methodologies of Speech, Audio and
       7     (2019)     105363–105375.          doi:           Music, and its Integration for Audio
       10.1109/ACCESS.2019.2931881.                            Metadata Tagging, J. Web Eng. 22(1)
[11]   O. Attallah, Multi-tasks Biometric System               (2023) 1–26. doi: 10.13052/jwe1540-
       for Personal Identification, International              9589.2211.
       Conference on Computational Science              [19]   O. Lavrynenko, et al., Method of Semantic
       and Engineering (CSE) and International                 Coding of Speech Signals based on
       Conference       on      Embedded        and            Empirical Wavelet Transform, 4th
       Ubiquitous Computing (EUC) (2019)                       International Conference on Advanced
       110–114. doi: 10.1109/CSE/EUC.2019.                     Information      and      Communication
       00030.                                                  Technologies (AICT) (2021) 18–22. doi:
[12]   M. Aliaskar et al., Human Voice                         10.1109/AICT52120.2021.9628985.
       Identification Based on the Detection of         [20]   A. Dutt,        P. Gader,         Wavelet
       Fundamental           Harmonics,           7th          Multiresolution Analysis Based Speech
       International      Energy         Conference            Emotion Recognition System Using 1D
       (ENERGYCON)         (2022) 1–4.          doi:           CNN LSTM Networks, Transactions on
       10.1109/energycon53164.2022.9830471.                    Audio, Speech, and Language Proces. 31
[13]   R. Kethireddy, et al., Mel-Weighted                     (2023)          2043–2054.             doi:
       Single Frequency Filtering Spectrogram                  10.1109/TASLP.2023.3277291.


                                                    161
[21] C. Zhang, et al., Research on Extracting           Transformation Optimized by Convo-
     Algorithm of Speech Eigenvalue Based               lutional Autoencoders, Transact. Neural
     on Wavelet Packet Transform and                    Netw. Learn. Syst. 34(3) (2023) 1395–
     Gammatone Filter, 3rd Information                  1405. doi: 10.1109/TNNLS.2021.31053
     Technology, Networking, Electronic and             67.
     Automation Control Conference (ITNEC)         [30] O. Lavrynenko, et al., Remote Voice User
     (2019) 165–169. doi: 10.1109/ITNEC.                Verification System for Access to IoT
     2019.8729292.                                      Services Based on 5G Technologies, 12th
[22] O. Lavrynenko, et al., A Method for                International Conference on Intelligent
     Extracting the Semantic Features of                Data     Acquisition    and    Advanced
     Speech Signal Recognition Based on                 Computing Systems: Technology and
     Empirical         Wavelet       Transform,         Applications (2023) 1042–1048. doi:
     Radioelectron. Comput. Syst. 107(3)                10.1109/IDAACS58523.2023.10348955.
     (2023) 101–124. doi: 10.32620/reks.           [31] O. Veselska, et al., A Wavelet-Based
     2023.3.09.                                         Steganographic Method for Text Hiding
[23] G. Frusque, O. Fink, Learnable Wavelet             in an Audio Signal, Sensors 22(15)
     Packet Transform for Data-Adapted                  (2022) 1–25. doi: 10.3390/s22155832.
     Spectrograms, International Conference        [32] V. Kuzmin, et al., Empirical Data
     on Acoustics, Speech and Signal                    Approximation Using Three-Dimen-
     Processing (2022) 3119–3123. doi:                  sional Two-Segmented Regression, 3rd
     10.1109/ICASSP43922.2022.9747491.                  KhPI Week on Advanced Technology
[24] B. Zhao, et al., A Spectrum Adaptive               (2022) 1–6. doi: 10.1109/KhPIWeek
     Segmentation         Empirical     Wavelet         57572.2022.9916335.
     Transform for Noisy and Nonstationary
     Signal Processing, IEEE Access 9 (2021)
     106375–106386. doi: 10.1109/ACCESS.
     2021.3099500.
[25] R. Odarchenko, et al., Empirical Wavelet
     Transform         in     Speech     Signal
     Compression Problems, 8 International
                                 th

     Conference         on     Problems       of
     Infocommunications,         Science    and
     Technology (2021) 599–602. doi:
     10.1109/PICST54195.2021.9772156.
[26] T. Zhang, et al., Multiple Vowels Repair
     Based on Pitch Extraction and Line
     Spectrum Pair Feature for Voice
     Disorder, J. Biomedical Health Inform.
     24(7)      (2020)      1940–1951.      doi:
     10.1109/JBHI.2020.2978103.
[27] F. Costa, et al., Wavelet-Based Harmonic
     Magnitude Measurement in the Presence
     of Interharmonics, Transactions on
     Power Delivery 38(3) (2023) 2072–
     2087.     doi:      10.1109/TPWRD.2022.
     3233583.
[28] X. Zheng, Y. Tang, J. Zhou, A Framework
     of    Adaptive       Multiscale    Wavelet
     Decomposition for Signals on Undirected
     Graphs, Transactions on Signal Proces.
     67(7)      (2019)      1696–1711.      doi:
     10.1109/TSP.2019.2896246.
[29] B. Wang, J. Saniie, Massive Ultrasonic
     Data Compression Using Wavelet Packet


                                               162