Precision automated phonetic analysis of speech signals for
information technology of text-dependent authentication of a
person by voice
Oleg Bisikaloa, Olesia Boivanb, Nina Khairovac, Oksana Kovtunb and Viacheslav Kovtuna
a
  Vinnytsia National Technical University, Khmelnitske Shose str., 95, Vinnytsia, 21000, Ukraine
b
  Vasyl’ Stus Donetsk National University, 600-richchya str., 21, Vinnytsia, 21000, Ukraine
c
  National Technical University “Kharkiv Polytechnic Institute”, 2, Kyrpychova str., Kharkiv, 61002, Ukraine


                 Abstract
                 A model of the process of phonetic analysis of speech signals in the frequency and temporal
                 spaces is highlighted in the article for the first time. The generalization of the spectral
                 characteristics of the studied speech signals is formalized in the represented model as an
                 optimization task of minimizing the functional of relative entropy in contrast to the existing
                 models. The obtained mathematical apparatus made it possible to formulate metric for
                 quantitative estimation of the quality of the phonetic analysis results and to propose an
                 adaptive method of automated phonetic analysis with an integrated mechanism for
                 counteracting the influence of Gaussian-type noise, found in the studied speech signal, on
                 the final result. The adequacy and functionality of the proposed model and method have been
                 proved empirically. The analysis of the experiments results also showed that it is possible to
                 assess the suitability of the studied speech materials for the task of authenticating a person by
                 voice or speech recognition, focusing on the value of the coefficient of variability, which is
                 included in the metric proposed by the authors and determined for the studied database of
                 phonograms with recordings of voiced syllables of speech. Also, the values of this coefficient
                 determined for the studied phonemes can be used to estimate the degree of their vocalization.

                 Keywords 1
                 Automated phonetic analysis, clustering of language units, computational linguistics,
                 information technology, authentication of a person by voice.

1. Introduction
   Modern methods of computational linguistics [1-5] are created with a focus on the use of
technologies that process speech material automatically without constant human control. However, it
requires upgrading of the computer speech technologies to a fundamentally new level, which can be
achieved only by complete automation of the process of phonetic analysis of speech signals.
Phonemes form the basic level of language description and determine its information and
communicative characteristics. It is confirmed, in particular, by the method of forming speech
corpora, in which phonograms of speech signals are accompanied by their transcription, which is
nothing more than a sequence of phonemes. However, like any physiological process, speech is
characterized by considerable variability, so there are a great number of options for phonemes
sounding. This circumstance explains the fact that no any theoretical and software complex has been


IntelITSIS’2021: 2nd International Workshop on Intelligent Information Technologies and Systems of Information Security, March 24–26,
2021, Khmelnytskyi, Ukraine
EMAIL: obisikalo@gmail.com (O. Bisikalo); olesiaboivan@gmail.com (O. Boivan); khairova@kpi.kharkov.ua (N. Khairova);
o.kovtun@donnu.edu.ua (O. Kovtun); kovtun_v_v@vntu.edu.ua (V. Kovtun)
ORCID: 0000-0002-7607-1943 (O. Bisikalo); 0000-0002-3512-0315 (O. Boivan); 0000-0002-9826-0286 (N. Khairova); 0000-0002-9139-
8987 (O. Kovtun); 0000-0002-7624-7072 (V. Kovtun)
            © 2021 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)
created yet for the effective automatic phonetic analysis of coherent speech, although its necessity is
extremely urgent.
    So the object of the research presented in the article is the process of clustering centers-phonemes
in the spectral representation of the speech signals which are studied.

2. State-of-the-Art
    The mechanism of automated phonetic analysis is an obligatory structural unit of information
technology focused on solving such fundamental and applied phonetic problems as, for example,
sounds recognition, the study of suprasegmental linguistic characteristics, speech synthesis, ets. As it
was noted the final solution of computer phonetic analysis has not been found yet, but its relevance
encourages research teams to creative search.
    There are a number of studies [6-9] based on the mathematical apparatus of digital signal
processing. Inside of them, speech signals are analyzed directly, without taking into account their
physiologically determined phonetic structure. The speech signal is interpreted as a non-stationary
multi-frequency signal and processed in order to determine the transmission function that generates it.
After its determination, the transition from signal analysis in temporal space to its study in frequency
space is carried out, where phonemes are determined, as a rule, on the basis of summarizing the
results of analysis of the energy of a signal. For the transition from the temporal to the frequency
space, as a rule, variations of the Fourier transformation, based on linear perceptual coefficients or the
wavelet-transformations are used. These signal processing methods are mentioned in order to increase
information content and computational complexity. The advantage of such studies is their strict
mathematical adequacy, but they completely ignore the physiological mechanism of the speech signal
generation, therefore their automatic application for phonetic analysis demonstrates the average
quality results.
    The second direction of research assumes the presence of a priori information about the
transmission function of the articulatory tract. In this direction, let us single out the methods [10, 11]
for representing a speech signal as a vector of states of an a priori given dynamic system (articulatory
tract) with the help of a recursive filter, which allows to smooth out emissions at formant frequencies
in time. In this context, an improved version of the method of phonetic-format analysis of the
structure of a speech signal using linear perceptual coefficients with additional smoothing using the
Newton-Raphson algorithm is also proposed [12]. The studies [13, 14] describe the PRAAT
algorithm, which automatically finds the smoothest formant trajectory for short segments of the
speech signal. The method is based on a variational polynomial approximation of short-time
fragments of speech signals with the subsequent selection of the smoothest of them using the
appropriate criterion. The method is fully automatic. However, the effectiveness of all these methods
is mainly determined by the reliability of the applied a priori information.
    There are many well-known methods based on the acoustic-frequency model of the vocal tract,
created as a result of studies of the acoustic-physiological direction [15-17]. However, this model was
created as a tool for synthesizing speech signals, so its application for their analysis did not show
outstanding results in terms of quality. The direction [18] of research, based on the analysis of speech
signals with the interpolation of the appearance of phonemes based on the data of the energy peaks of
the formants of the studied signal in the passband is known. To increase the effectiveness of these
methods, corpora of regional dialects of North America has been created, which contains 134000
formants identified by human experts. However, the effectiveness of these methods is generally
determined by the presence of this specialized corpora. A fundamentally new direction of research is
the study of the representation of phonemes by neurolinguistic structures of the human brain [18, 19].
In this context, phonetic analysis can also be viewed as a new tool for studying psychological and
physiological phenomena. The methods that implement this concept are based on machine learning
algorithms. Their effectiveness is completely determined by the representativeness of the training
information used, the generalization of which has just begun.
    So, let’s try to take into account the strengths and weaknesses of the above mentioned methods by
defining the subject of study as the methods of acoustic theory of speech formation and information
theory, the results of which will be generalized using the methods of probability theory and
mathematical statistics.
3. Materials and Methods

3.1 Statement of research
   According to the provisions of the complex theory of phonation, the quantum of oral speech is the
phoneme, the number of which is finite and different for all languages. It is the combination of
phonemes that forms the semantic quantum of speech – the morpheme.
   Neurolinguistic research suggests that despite the psycho-physiologically determined variability of
phoneme pronunciation, in the human mind in the process of learning the appropriate language for
                                                                                      {xrj } , j = 1, J r ,
each r -th phoneme formed a cluster of its speech pattern with center xr∗ , xr∗ ∈ X r =
r = 1, R , where xrj is the j -th allophone of the r -th phoneme; J r is the power of the studied set of
allophones of the r -th phoneme; R is the total number of phonemes in the studied language.
Accordingly, in the process of perception of a speech signal X ( t ) by a person in discrete time, the
first is represented by a set of the characteristic vectors х ( t ) extracted from the corresponding
segment of the original speech signal of duration ∆t . The value of ∆t is chosen so as to consider the
fragment of the speech signal limited in this way to be quasi-stationary with a duration of τ ∈ [10, 20]
ms greater than the average duration of the phonemic utterance in the studied language. Then the
phonetic analysis of the speech signal X ( t ) , represented by the set x = ( x1 .x2 , , xL ) of characteristic
vectors х ( t ) , t = 1, 2, , L , will be understood as the task of assigning each segment х ( t ) to one of
the classes from the set X r : x ( t ) ≡ xv∗ ∈ X v , where X v is a subset of the set X r determined as a
result of classification, v ≤ R . This classification task can be solved using machine learning methods
or methods of information theory and mathematical statistics, for example, based on the functional of
relative entropy [20].
    Let the density of probability distribution Px of the multidimensional matrix х belong to some set
of alternative multidimensional densities of distributions of probability distributions Pr defined on a
finite set { Х r } of phonemes of the studied language: Px ∈ { Pr } . Therefore, the task of phonetic
analysis of the speech signal represented by the set х can be reduced to finding such a probability
density from a certain set of alternatives { Pr } , the difference of which from Px is minimal according
to the selected metric µ . If we consider the distribution law of the appearance of each phoneme
normal with zero mathematical expectation and autocorrelation matrix K r of dimension n × n , n ≤ L :
Pr = Norm ( 0, K r ) , then a necessary condition for solving the above task is to calculate the set of
differences µ ( Px Pr ) between the empirical distribution Px and each alternative from the set { Pr } .
When studying speech signals in frequency space, a characteristic parameter for estimating the
desired differences for the set of alternatives { Pr } is the set of values of power spectral densities
{G ( f )} , ∀f ∈{1, F } G ( f ) > 0 , where F is the upper limit of the frequency range for the
   r                        r

empirical speech signal. We take this into account by defining the required estimates of differences as
a functional of relative entropy:
                                      1  Gx ( f )          G (f) 
                                          F
                         µ ( Px =
                                Pr )      ∫           − ln x      − 1 df ,                    (1)
                                     2 F − F  Gr ( f )     Gr ( f ) 
where r = 1, R , Gx ( f ) is the estimate of the power spectral density of the empirical speech signal
represented by the set х .
   It is possible to identify a stationary stochastic process Pr in metric (1) by means of spectral
                 {
analysis of data =
                 xrj , r 1,=
                           R, j 1, J :}
                                          1 J
                                            ∑ Grj ( f ) , r = 1, R .
                                    Gr ( f ) =
                                          J j =1
                                                                                                  (2)

    According to Wiener-Hitchkin theorem, the estimation of the power spectral density (2) is related
to the autocorrelation matrix K r of the empirical speech signal х by the discrete Fourier transform.
However, such an estimate is possible only under the condition n >> 1 , while the estimation of the
power spectral density Gr ( f ) for a small amount of experimental data is of practical interest:
 J << ∞ . So, the aim of the study is the analytical formalization of the process of phonetic analysis of
a speech signal in frequency and temporal spaces based on the functional of relative entropy, oriented
towards the use of text-dependent voice authentication in information technology. The objectives of
the research are: - creation of a model of the process of phonetic analysis of a speech signal in
frequency and temporal spaces; - formalization of the metric for estimation of the quality of the
results of the process of phonetic analysis based on the functional of relative entropy; - formalization
of the adaptive method of automated phonetic analysis focused on achieving optimal results, assessed
in the created metric; - empirical proof of the adequacy of the created mathematical model and
analysis of the functionality of the created method.

3.2 Model of phonetic analysis of speech signals in temporal and frequency
   spaces
   A fundamental issue for the analytical formalization of the process of phonetic analysis of speech
signals is to determine the centers of clusters x=
                                                 r  x=
                                                     rv                       { }
                                                         vr , v ∈ 1, J , for empirical realizations of

{x } , j ∈{1, J } in the metric µ ( x ) :
  rj                                     rj


                                         vr = arg min ∑ µi ( xrj )
                                                               J

                                                                                                        (3)
                                                 i∈[1,, J ]   j =1

where characteristic
                                  1  Grj ( f )           Grj ( f ) 
                                                                                 { }
                                      F
                    µi ( x=
                          rj )        ∫            − ln          − 1df , i, j ∈ 1, J ,               (4)
                                 2 F − F  Gri ( f )      Gri ( f ) 
is a functional of relative entropy between i -th and j -th parametric interpretations of allophones of
 r -th phoneme in frequency space. Semantic generalization of expressions (1) and (4) allows us to
determine the frequency range where the center of the cluster of the r -th phoneme is as
                                Gr( J ) = Grv ( f ) , v ≤ J , r = 1, R .                         (5)
   However, with the identified autocorrelation matrix K r it is possible to analytically formalize the
analysis of the studied speech signals in temporal space analog of criterion (3), the classification
decision in which is made on the basis of a set of values of statistics defined by expression
                                         1   Kˆ            Kˆ     
                              ρ r (=
                                   x)        tr   − log    − n  , r = 1, R ,                   (6)
                                         2n   K r          Kr     
                                                                     
where K̂ is a selective estimate of autocorrelation matrices of the studied empirical speech signal
x = x ( t ) , t = 1, 2, , L ; tr ( А ) is the trace of the matrix А . The sample estimate K̂ is determined
based on the following considerations. Let the speech pattern X r of the r -th phoneme be determined
on the basis of the analysis of the set of its utterances xrj , j = 1, J r : X r = { xrj } . In this case, each
utterance xrj is formed by a sequence of L samples { xrj ( t )} obtained with the periodicity

Т = ( 2 F ) . Divide this sequence into frames of duration n samples, n << L , grouping them into a
          −1


set of data vectors { xrji } of dimension n × L − n .
    Then the sample estimate of the hypothetical normal distribution is defined as the arithmetic mean
in the form
                                         1 L−n
                               Kˆ rj =       ∑ xrji xrjiT , j = 1, J r ,
                                       L − n i =1
                                                                                                  (7)
where Т symbolizes the transposition operation. Substituting the value of the sample estimate (7)
into expression (6) we obtain for the pattern X r a matrix of statistics of dimension J r × J r :
                               1   Kˆ rj          Kˆ rj     
                        ρ=         tr       − log       − n  , k , j = 1, J r .                         (8)
                               2n   K rk                  
                         rjk
                                                     K rk
                                                              
                                                                                          Jr
   We find the sum of the values of the columns of the matrix (8) : ∑ ρ rjk = ρ rk , k = 1, J r , and
                                                                                          j =l

analytically formalize oriented on the description of the studied speech signal in the temporal space
the analogous to criterion (3):
                              v=r
                                   ∗
                                  x=
                                   r  xr=
                                        θ  arg min ρ rk , r = 1, R .                           (9)
                                                     k

   Determined according to expression (7) for j = θ , the sample autocorrelation matrix Kˆ rθ for the
center of the cluster xr∗ will determine the optimal decisive statistics when substituting in (6). After
analyzing expressions (4) and (6), we can conclude that the entropy of the estimate of the center of the
cluster of phoneme will decrease with increasing value of J .
    Therefore, with the center of the cluster for the r -th phoneme determined by expression (3) or (9),
it is possible to determine the optimal estimate of the power spectral density Gr( J ) for this phoneme on
the basis of expression (5). If such actions are implemented for all R phonemes of the studied
language, then we obtain a phonetic-acoustic database, the universality of which will increase with
increasing number of phonemes pronounced during the formation of the model, i.e. the parameter J .

3.3 Formalization of metrics for qualitative evaluation of the results of
   phonetic analysis of the studied language in the paradigm of the proposed
   model
    Based on the provisions of the acoustic theory of speech formation, we present a model of the j -th
utterance of the r -th phoneme by the autoregression function of the form
                         S                                                          
         =     xrj ( l ) ∑ a rj ( s ) xrj ( l − s ) + η rj ( l ) , l = 1, 2, , j = 1, J , r = 1, R , (10)
                        s =1

which is uniquely determined by the set of coefficients                   {a ( s ) , s = 1, S} by power S ≤ n and a
                                                                            rj

variance σ   2
             rj   of the generating Gaussian process              {η ( l ) , l = 1, 2,} . A property of such a
                                                                     rj

representation that is relevant for us is that the estimation of the power spectral density of the studied
signal obtained on the basis of the autoregression model with a finite set of coefficients
{a ( s ) , s = 1, S} will always satisfy the condition of regularity: G ( f ) > 0 . However, given the
  rj                                                                                  r

functional of relative entropy used by us for phonetic analysis, the possibility of normalizing the
speech signals described by the autoregression model of the form (10) to the value of their specific
entropy h ( xrj ) = 0.5ln σ rj2 is especially relevant for us and allows to achieve the desired level by the
variance σ=
          2
          rj σ=
              2
              0 const of the generating process η rj . Accordingly, if we take into account the fact
that the variance σ 02 does not change when a person utters not just phonemes, but words or even short
phrases, then expression (4) can be simplified to the form
                                            1  Grj ( f ) 
                                                F
                             µi ( xrj )
                                           2 F −∫F  Gri ( f ) 
                             =                              = − 1df µrji , i, j = 1, J .                       (11)

         When analyzing the speech signal in the temporal space analog of expression (11) will be the
      target adaptation of expression (6), namely:
                                                      σ i2 ( xrj )
                                           ρi ( xrj )
                                           =                       −1 ,                       (12)
                                                             2
                                                                      σ0
      where the variance σ ( xrj ) is determined by expression
                              i
                               2


                                          σ i2 ( xrj ) =                   (        )
                                                                    L
                                                             1
                                                                   ∑     yrj( i ) ( l ) ,
                                                                                       2
                                                                                                                 (13)
                                                           L − S l= S +1
      in which the parameter yrj( i ) ( l ) characterizes the change of the studied speech signal xrj = { xrj ( l )} ,
      l ≤ L , after its passage і -th bleaching filter
                                                                  S
                                                     xrj ( l ) − ∑ ari ( s ) xrj ( l − s ) .
                                      yrj( i ) ( l ) =                                                           (14)
                                                                 s =1
     Expressions (11) and (12) through criteria (3) and (9), respectively, allow us to identify the optimal
 standards xr for all studied phonemes r ∈ R . An additional positive point is that when applying
 criterion (9) to calculate statistics (12) a target bleaching filter (14) is used which allows to effectively
 reduce the sensitivity of the result of phonetic analysis to the potential presence of Gaussian noise in
 the empirical speech signal.
     Modify the analytical form of criterion (3) taking into account expression (11), resulting in
                       1 J  1  Grj ( f )   1  G ( f ) 
                                     F                                     F      RE

         =      µrv       ∑         ∫  
                                                 = − 1 
                                                        df   
                       J j =1  2 F − F  Grv ( f )   2 F − F  Grv ( f )
                                                                         ∫  
                                                                                     = − 1df µrvRE .              (15)
                                                                                            
     The value of the functional of relative entropy (15) substituted in expression (5) allows us to
 identify the optimal estimate of the power spectral density of the r -th phoneme, which is potentially
 more reliable than the estimate of maximum likelihood calculated by expression (2). This thesis can
 be rationally substantiated by the fact that the reliability of the estimate (2) depends only on the
 representativeness of the empirical data, whereas when calculating the estimate (5) based on the
 functional (15), firstly, according to expression (14), the empirical data get rid of potentially present
 in the studied speech signal Gaussian noise, secondly, the empirical data are further generalized by
 autoregressive models of the form (10), the reliability of which can be increased by increasing the
 order of the models s in the range from 1 to J inclusive.
     In fact, the estimate (2) allows us to determine the circumference of a center of the cluster for the
 power spectral density of the r -th formant, while the estimate (5) taking into account expression (15)
 allows changing the value of the order of s to find a center of the cluster of the r -th phoneme as a
 result of the solution optimization task. Let’s generalize the just stated concept by modifying
 expression (15) taking into account expressions (12)-(14). We obtain
          1 J σ v ( xrj )        1 J                                               1 J  1 2
                 2

                                                                      (        )                                 
                                                                                           χ rjM (1 + µrjN ) − 1 , (16)
                                                          L
                                                1
            ∑1 σ 2                 ∑                     ∑            (   )             ∑
                                                                             2
                                                                  (v)
= µrv
    RE
                =         − 1           
                                                               y    =  l      − 1 
          J j=      0            J  j=1 0σ 2
                                              ( L − S ) l=S +1
                                                                  rj
                                                                                     J  1
                                                                                        j=   M                   
      where χ rjM
              2
                  ( l ) is the χ 2 -distribution of the stochastic quantity l with M= L − S degrees of
      freedom. The greatest influence on the value of µrvRE calculated by expression (16) is caused by the
      variability of the characteristics of allophones of the r -th phoneme, which is generalized by the
      coefficient of variability µrji = 1 + µrji . The value of this coefficient depends on the individual speech
      characteristics of the persons whose speech signals are being studied and can vary widely. To some
      extent, the asymptotic properties of the χ 2 -distribution will smooth out these fluctuations:
                 1 J  1 2                          1 J  1                                
 =µrvRE            ∑   χ rjM (1 + µrjv ) − 1      ≈ ∑  (1 + µrjv ) Norm j ( M
                                                                               = , 2 M ) − 1                    (17)
                      M
                 J j 1=                       M >>1 J j 1  M                              
      1 J                                   2     1 J                             2
                                        µrjv )  − 1
            Norm j  (1 + µrjv ) , (1 +=                                   (1 + µrjv ) 
                                   2                                      2
=       ∑                                       J j1
                                                       ∑  Norm j  µrjv , =
                                                                                       
           
      J j 1=                       M                                     M
                                    1 J                        2
                                                  2 ∑(
                                                      1 + µrjv )  .
                                                1 J
= Norm  ∑ µrjv ,
                             =       J j 1=  MJ j 1             
    Let us denote the standard deviation of the coefficient of variability µr of the characteristics of
allophones of the cluster X r of the corresponding phoneme as

                                  SD [ µr ]
                                               1 J
                                                 ∑    (1 + µrjv ) .
                                                                 2
                                  =                                                                 (18)
                                               J j =1
   Expression (18) is in fact a Gaussian model for stochastic estimation of the coefficient of variation
µrvRE . The mathematical expectation of the characteristic of variability of µr of values (11) within the
cluster of the r -th phoneme is determined by the expression
                                                1 J
                                     M [ µr ] = ∑ µrjN .                                            (19)
                                                J j =1
    The standard deviation of the characteristic of variability of values (11) within the cluster of the r -
th phoneme is determined by the expression
                                          2σ r2    2M ( µr )
                                σr
                                =        =                   .                                  (20)
                                          MJ          MJ
   The statistical meaning of the parameter σ r is as follows – the larger the value of σ r , the lower
the density of the cluster of the r -th phoneme. Using expressions (18), (19) and (20), we define the
confidence interval ∆ r of the spectral estimation of the r -th phoneme as
                                                 8
                                =      pσ r
                                ∆ r 2 z=           z p SD [ µr ] ,                             (21)
                                                MJ
where z p is the coefficient of proportionality, р is the confidence probability. For example, for the
Gaussian distribution with р = 0,95 the tabular value of the coefficient z p is equal to 1,96 .
    The confidence interval (21) determines the reliability of the estimate (5) and, accordingly,
criterion (1) for the spectral representation of the phonemes of the studied language. It is obvious that
the variability of the spectral characteristics of the phonemes of the studied language will decrease
with: - increasing the homogeneity of the studied speech material; - reducing the level of presence in
the studied speech material of non-Gaussian noise; - increasing the number of persons-donors of
speech material.
    Note that the influence of the first factor analytically takes into account expression (21), the
influence of the second factor analytically takes into account expression (14), while taking into
account the outflow of the third factor on the variability of spectral characteristics of phonemes of the
studied language is worth additional analytical explanations.
    We introduce a modifier for the confidence interval described by expression (21), which will take
into account the number of donors of speech material, which was studied by generalizing the spectral
characteristics of the corresponding phonemes: ∆ (rI ) , where І = 1, 2, is the index of donor of speech
material. Accordingly, expression (21) was obtained for ∆ (r1) . For cases when І = 2,3, , based on
expression (21) we obtain:
                                         ∆ (rI ) µr( I )
                                    (I ) =
                                 δr =                    ≥1.                                  (22)
                                         ∆ (r1) µr(1)
   Naturally, as І increases, the value of the coefficient δ r will increase nonlinearly, eventually
reaching saturation at І = І ∗ . The value of І ∗ will depend on the factors just mentioned, but the
analytical formalization of this dependence requires additional theoretical research.
   Therefore, the theoretical material presented in section 3.2 has found its application in the
analytical formalization of the metric {µr ,σ r , ∆ r , δ r } , focused on the qualitative evaluation of the
results of phonetic analysis. The use of autoregressive analysis in deriving the functional of relative
entropy µr allows to determine the spectral characteristics of the center of the cluster of the r -th
phoneme as a result of solving the optimization task by changing the order s of regression models
created to describe the studied speech signals. We can quantify the quality of the phonetic analysis
performed in this way by calculating the value of the coefficient of variability δ r .

4. Results
    A group of 20 students from the Department of the Theory and Practice of Translation, Faculty of
Foreign Languages, Vasyl’ Stus Donetsk National University (Ukraine, Vinnytsia) was formed to
conduct experiments. At the initial stage of the experiments, it was assumed that each student, using a
microphone connected to a computer, would record phonograms with sequentially, repeatedly (ten
times), at the same tempo, pronounced long English phonemes [i:], [a:], [u: ], [ͻ:], [z:] (one phoneme -
one phonogram). An AKG P420 microphone without an amplifier connected to a Creative Audigy Rx
sound card integrated into a computer was used for the experiments. Sound recording processes were
supported by Sound Forge Pro for Windows.
    Phonograms were recorded with a sampling rate of 8000 Hz ( F = 4000 Hz), quantization of 16
bits, mono and stored in .wav format. Subsequently, the phonograms were programmatically
processed in order to form clusters for the corresponding phonograms {=   X r } , r 1,=
                                                                                      2, , R 5 , and to
determine the centers of the clusters based on model (3)-(11). Preliminary phonetic analysis of
phonograms was performed using the Praat program developed by the Institute of Phonetic Sciences
from the University of Amsterdam. To use the mathematical apparatus proposed in the article,
phonograms were processed in frames with a duration of L = 160 ( ≈ 20 ms).
    For spectral analysis of phonograms with speech signals using the autoregression model, the
Berg’s algorithm was used [21], known for its high resolution in the analysis of short-term signals and
guaranteed stability of the calculated forming filter. To enable the comparison of power spectral
densities according to criterion (1), the source speech material was presented in the Mel-space of the
bank of the corresponding filters with triangular averaging functions.
    As a result, the frequency characteristic parameters of the studied speech signals were obtained in
the form of weighted sums of power spectral densities in uniform intervals lasting 55 mels (a total of
31 counts for overlapping the frequency range [ 200,3400] Hz).
    As a result, 20 personalized phonetic databases { X r } of the same volume R = 5 were formed and
also more than 1000 integrated phonetic databases { X r }І were formed as a result of joint processing
of phonograms of two, three, etc. students. For the primary set of phonetic data {{ X r } ,{ X r }І } ,
=
r 1,= R 5 , І = 2,10 , obtained as a result of the described actions, the values of the coefficient of
variability were calculated by expression (15) and by expression (22) with a change in the order of the
autoregression models used.
   Empirical dependences of the value of the coefficient of reliability of the results of phonetic
analysis δ r on the number of students І , whose speech material was used to form the corresponding
primary integral phonetic bases { X r }І , for phonemes [i:] and [a:] are presented in Figure 1.
   Potentially, the experiment was oriented to arrange the phonemes of the studied language
according to the level of their informativity for the task of authentication of the person by voice.
   The second experiment aimed to recognize the isolated syllables of English words formed by as
many studied phonemes as possible. The authors formed a working dictionary of 200 selected English
words. Each of the students read the words from the working dictionary for recording in the
phonogram. The word order for all recording procedures was the same. Each speaker-student repeated
the reading-recording procedure five times.
   Prerequisites for the reading process were: - clear diction; - stable pronunciation rate; - division of
words into syllables with a clear fixation of pauses between the latter. Subsequently, the content of
individual bases of phonograms was processed by the procedure of reducing variability (21).
Accordingly, individual databases of original and processed phonograms were formed for each
student.
   An integrated database of original phonograms was also formed by averaging the spectral
characteristics of the content of individual databases of original phonograms. Subsequently, the
content of the integrated database of original phonograms was processed by the method generalized
(22), resulting in an integrated database of processed phonograms.


                                                                                     6

        20
                                                                                     5


        15
                                                                                     4
δr(I)


                                                                             δr(I)
        10                                                                           3


                                                                                     2
        5


                                                                                     1
        0

                                                                                     0
             1   2   3   4       5       6       7   8        9   10                          1   2        3   4        5   6   7   8   9   10
                        a)                                                   b)
Figure 1: Dependence of the value of the coefficient of reliability of the results of phonetic analysis
δ r on the number of students for: a) phoneme [i:]; b) phonemes [a:]

   Next, an iterative process was performed to add to the content of all databases of phonograms the
Gaussian noise of such power as to obtain variants of all databases of phonograms with a signal-to-
noise ratio of 5,10,15, ,30 dB, respectively. As a result, seven sets of personalized and integrated
databases of original and processed phonograms were obtained. Speech recognition in the created sets
of phonogram databases was carried out using the most popular currently professional APIs: Cloud
Speech from Google and Microsoft Speech from Microsoft. The results of the experiments in the
metric ε ( SNR ) are shown in Figure 2.


                                 40                                                       Set 4; Google API
                                                                                          Set 4; Microsoft API
                                 35                                                       Set 2; Google API
                                                                                          Set 2; Microsoft API
                                 30                                                       Set 3; Google API
                                                                                          Set 3; Microsoft API
                                 25
                                                                                          Set 1; Google API
                                                                                          Set 1; Microsoft API
                             ε


                                 20

                                 15

                                 10

                                     5

                                     0
                                             5           10            15                20           25           30
                                                                            SNR
Figure 2: Dependence of the relative error of recognition of isolated syllables on the signal-to-noise
ratio in the corresponding set of databases of phonograms
   The relative error of recognition of isolated syllables for all sets of databases of phonograms (Set
1: personalized databases of processed phonograms; Set 2: personalized databases of processed
phonograms; Set 3: database of integrated original phonograms Set 4: database of integrated
processed phonograms) was calculated as the ratio of the absolute value of the difference between the
number of correctly and incorrectly recognized syllables to the total number of syllables in the
corresponding sets of databases of phonograms.

5. Discussion
    From the empirical results of estimating the dependence of the value of the reliability coefficient
of the results of phonetic analysis δ r on the number of speaking students І shown in Figure 1 it is
seen that the hypothesis formulated in the theoretical part of the article about the existence of phonetic
data saturation threshold has been empirically confirmed. It is seen that the value of the characteristic
δ r ( I ) → sup δ r =
                    δ r∗ is limited from above by the value of δ r∗ , the value of which is essentially
phonemic-dependent. Based on this fact, we can conclude that by changing the volume and method of
forming an integrated database of phonograms of phonemes can be profiled for their intended use or
in the task of authentication of the person by voice (low value of δ r ) or in the task of semantic
analysis of text (high value of δ r ).
    The received estimation will be not only qualitative, but also quantitative, which is especially
relevant for information technology of text-dependent authentication of the person by voice, focused
on the application in the structure of information system for critical use with authentication of the
person-user by voice [22-26].
    Note the values of the coefficients δ r[і:] and δ r[а:] shown in Figures 1a and 1b, respectively, for the
same values of I . It is seen that the values of δ r[і:] are many times larger than the values of δ r[а:] and
this tendency only increases with increasing I . Given that the value of the coefficient δ r for the
target phoneme characterizes the degree of density of its cluster from the volume and source of speech
material, it can be argued that phonemes with a relatively high value of the coefficient δ r (for the set
{δ [ ] ,δ [ ]} is δ [ ] ) carry more information about individual of speaker’s voice. This should be taken
  r
      і:
           r
               a:
                    r
                        і:


into account when creating a representative dictionary of passphrases for information technology of
text-dependent authentication of the person by voice.
     Finally, we pay attention to the value of the confidence intervals for the values of the coefficients
δ r and δ r[а:] shown in Figures 1a and 1b, respectively. Recall that the values of confidence intervals
  [і:]

calculated by expressions (21), (22) depend on the order of the autoregression model (10) used to
describe the studied speech signal and the degree of compensation of the influence of Gaussian noise
present in the studied speech signal (14).
     Accordingly, the low variability of the confidence interval for the phoneme [i:] indicates the high
density of its cluster despite the different origin of the studied speech material and the potentially non-
Gaussian form of the distribution function of the corresponding signal. In the context of the acoustic
theory of speech formation, this is, in fact, a quantitative estimate of the degree of vocalization of this
phoneme. At the same time, the high variability of the confidence interval for the phoneme [a:] (with
increasing І from 2 to 10, the width of the confidence interval decreased by almost 20 times) can
potentially indicate a significantly lower vocalization of this phoneme. Accordingly, the proposed
mathematical apparatus provides a potential opportunity to organize the set of phonemes of the
studied language by quantifying the degree of their vocalization.
     Let’s analyze the experimental results shown in Figure 2. It becomes obvious that the
generalization of phonetic information, whether simple averaging of spectral characteristics in certain
frequency ranges or analytically substantiated in Chapter 3 of the article generalization based on the
value of the coefficient of reliability of phonetic analysis δ r , has a positive effect on solving the task
of automated recognition of syllables by the most modern specialized information systems. However,
it is seen that the effect of noise leads to a rapid nonlinear increase in the relative error of recognition
of isolated syllables ε .
    Depending on the studied set of databases of phonogram, the value of the relative error ε at the
limit value of 5 dB of the studied range of the signal-to-noise ratio increased by 2.5-7 times. These
results allow us to recommend the application of the generalized expression (14) approach to the
filtering of Gaussian noise in the speech signal to empirical signals, the estimation of the level of the
signal/noise ratio is in the range [ 40, 20] dB. It is also obvious that it is necessary to continue the
search for more efficient methods of filtering or compensating for noise for processing empirical
speech signals with a low signal-to-noise ratio – SNR ∈ [15,5] dB.
    Finally, the lowest values of the relative error of recognition of isolated syllables ε were obtained
when working with the content of the integrated database of processed phonograms, which was
obtained using the procedure proposed by the authors, generalized by expression (22). This result is
an empirical proof of the adequacy of the mathematical apparatus presented in the article.

6. Conclusions
    Without exaggeration, phonetic analysis is a “cornerstone”, which is the basis of modern human-
machine information technologies, focused on the target interpretation of speech signals. In particular,
automated phonetic analysis is the basis of approaches to solving such tasks as authentication of the
person by voice, speech recognition, determination of the speaker’s emotional state, semantic
interpretation of the text, etc. However, the quality of the results demonstrated by modern automated
systems of phonetic analysis is inversely proportional to the amount of educational information
available to them. Thus, the task of improving the quality of phonetic analysis in conditions of limited
educational information is relevant.
    A model of the process of phonetic analysis of speech signals in the frequency and temporal spaces is
highlighted in the article for the first time. The generalization of the spectral characteristics of the studied
speech signals is formalized in the represented model as an optimization task of minimizing the functional
of relative entropy in contrast to the existing models.
    The obtained mathematical apparatus made it possible to formulate metric for quantitative estimation of
the quality of the phonetic analysis results and to propose an adaptive method of automated phonetic
analysis with an integrated mechanism for counteracting the influence of Gaussian-type noise, found in
the studied speech signal, on the final result.
    The adequacy and functionality of the proposed model and method have been proved empirically. The
analysis of the experiments results also showed that it is possible to assess the suitability of the studied
speech materials for the task of authenticating a person by voice or speech recognition, focusing on the
value of the coefficient of variability, which is included in the metric proposed by the authors and
determined for the studied database of phonograms with recordings of voiced syllables of speech. Also, the
values of this coefficient determined for the studied phonemes can be used to estimate the degree of their
vocalization.
    Further research is planned to be focused on the oriented to the task text-dependent authentication
of the person by voice and on the phonetic analysis of the most common Germanic, Indo-European,
Romance and Slavic languages with the help of the created theoretical and applied complex.

7. Acknowledgments
   The authors note that the research results presented in the article were obtained while working on
the research topic “Synchronic and Diachronic Studies of Language Units on Different Levels” of the
Department of the Theory and Practice of Translation, Vasyl’ Stus Donetsk National University
(Ukraine, Vinnytsia).
   The authors are grateful to the staff of this department for facilitating the research. The authors
also thank the staff of the Department of Automation and Intelligent Information Technologies and
the Department of Computer Control Systems of Vinnytsia National Technical University (Ukraine,
Vinnytsia) for consulting in theoretical and applied aspects of the study.
8. References
[1] S.L. Kryvyi, N.P. Darchuk, A.I. Provotar, Ontological similar systems for analysis of texts of
     natural language, in Problems in programming, Iss. 2-3, 2018, pp. 132-139, doi:
     10.15407/pp2018.02.132.
[2] O. Orobinska, J.-H. Chauchat, N. Sharonova, Methods and models of automatic ontology
     construction for specialized domains (case of the Radiation Security), in 1st International
     conference Computational linguistics andintelligent systems (COLINS), Kharkiv, Ukraine, 2017,
     pp, 95-99.
[3] Y. Burov, V. Lytvyn, V. Vysotska and I. Shakleina, The Basic Ontology Development Process
     Automation Based on Text Resources Analysis, in 15th International Conference on Computer
     Sciences and Information Technologies (CSIT), Zbarazh, Ukraine, 2020, pp. 280-284, doi:
     10.1109/CSIT49958.2020.9321910.
[4] A. Adala, N. Tabbane and S. Tabbane, A novel semantic approach for Web service discovery
     using computational linguistics techniques, in Fourth International Conference on
     Communications and Networking, ComNet-2014, Hammamet, 2014, pp. 1-6, doi:
     10.1109/ComNet.2014.6840909.
[5] Y. Wang and R. C. Berwick, On formal models for cognitive linguistics, in 11th International
     Conference on Cognitive Informatics and Cognitive Computing, Kyoto, 2012, pp. 7-17, doi:
     10.1109/ICCI-CC.2012.6311169.
[6] S. Hara and H. Nishizaki, Acoustic modeling with a shared phoneme set for multilingual speech
     recognition without code-switching, in Asia-Pacific Signal and Information Processing
     Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, 2017, pp. 1617-
     1620, doi: 10.1109/APSIPA.2017.8282284.
[7] C. Zhao, H. Wang, S. Hyon, J. Wei and J. Dang, Efficient feature extraction of speaker
     identification using phoneme mean F-ratio for Chinese, in 8th International Symposium on
     Chinese Spoken Language Processing, Kowloon, 2012, pp. 345-348, doi:
     10.1109/ISCSLP.2012.6423485.
[8] J. M. McQueen and M. A. Pitt, Transitional probability and phoneme monitoring, in Proceeding
     of Fourth International Conference on Spoken Language Processing. ICSLP ‘96, Philadelphia,
     PA, USA, 1996, pp. 2502-2505 vol.4, doi: 10.1109/ICSLP.1996.607321.
[9] S. Chen, B. Song, L. Fan, X. Du and M. Guizani, Multi-Modal Data Semantic Localization With
     Relationship Dependencies for Efficient Signal Processing in EH CRNs, in IEEE Transactions
     on Cognitive Communications and Networking, vol. 5, no. 2, 2019, pp. 347-357. doi:
     10.1109/TCCN.2019.2893360.
[10] K. S. Sai Vineeth, V. Phaneendhra and S. Prince, Identification of Vowel Phonemes for Speech
     Correction Using PRAAT Scripting and SPPAS, in International Conference on Communication
     and      Signal    Processing    (ICCSP),      Chennai,    2018,    pp.     0850-0853.    doi:
     10.1109/ICCSP.2018.8524273.
[11] B. Wang and C.-C. J. Kuo, SBERT-WK: A Sentence Embedding Method by Dissecting BERT-
     Based Word Models, in IEEE/ACM Transactions on Audio, Speech, and Language Processing,
     vol. 28, 2020, pp. 2146-2157. doi: 10.1109/TASLP.2020.3008390.
[12] S. Chakrasali, U. Bilembagi and K. Indira, Formants and LPC Analysis of Kannada Vowel
     Speech Signals, in 3rd IEEE International Conference on Recent Trends in Electronics,
     Information & Communication Technology (RTEICT), Bangalore, India, 2018, pp. 945-948, doi:
     10.1109/RTEICT42901.2018.9012641.
[13] M. Pleva, J. Juhár and A. S. Thiessen, Automatic Acoustic Speech segmentation in Praat using
     cloud based ASR, in 25th International Conference Radioelektronika (RADIOELEKTRONIKA),
     Pardubice, 2015, pp. 172-175, doi: 10.1109/RADIOELEK.2015.7129000.
[14] M. A. Kutlugün and Y. Şirin, Turkish meaningful text generation with class based n-gram model,
     in 26th Signal Processing and Communications Applications Conference (SIU), Izmir, Turkey,
     2018, pp. 1-4. doi: 10.1109/SIU.2018.8404801.
[15] B. Bharathi, S. Kavitha and S. Sugapriya, Bilingual Speech Recognition System for Isolated
     Words Using Deep Neural Network, in International Conference on Computer, Communication,
     and     Signal    Processing    (ICCCSP),       Chennai,     India,    2018,    pp.   1-4,    doi:
     10.1109/ICCCSP.2018.8452832.
[16] L. Chen, H. Yang and H. Wang, Research on Dungan speech synthesis based on Deep Neural
     Network, in 11th International Symposium on Chinese Spoken Language Processing (ISCSLP),
     Taipei City, Taiwan, 2018, pp. 46-50. doi: 10.1109/ISCSLP.2018.8706713.
[17] M. Fry, Modeling the Acquisition of Intonation: A First Step, in IEEE International Conference
     on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, 2018, pp. 5124-5128. doi:
     10.1109/ICASSP.2018.8462541.
[18] Z. Zhou, X. Song, R. Botros and L. Zhao, A Neural Network Based Ranking Framework to
     Improve ASR with NLU Related Knowledge Deployed, in IEEE International Conference on
     Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 6450-
     6454, doi: 10.1109/ICASSP.2019.8682727.
[19] Y. N. Seitkulov, S. N. Boranbayev, B. B. Yergaliyeva, S. K. Atanov, H. V. Davydau and A. V.
     Patapovich, The base of speech structural units of Kazakh language for the synthesis of speech-
     like signals, in IEEE 12th International Conference on Application of Information and
     Communication Technologies (AICT), Almaty, Kazakhstan, 2018, pp. 1-4. doi:
     10.1109/ICAICT.2018.8747120.
[20] W. Zhu, J. Dai, J. Li, J. Wang and F. Hou, Analysis of α Wave in Normal and Epileptic EEG
     Signals Based on Symbol-Relative Entropy, in 11th International Congress on Image and Signal
     Processing, BioMedical Engineering and Informatics (CISP-BMEI), Beijing, China, 2018, pp. 1-
     9, doi: 10.1109/CISP-BMEI.2018.8633179.
[21] A. P. Berg and W. B. Mikhael, An efficient structure and algorithm for the mixed transform
     representation of signals, in Conference Record of The Twenty-Ninth Asilomar Conference on
     Signals, Systems and Computers, Pacific Grove, CA, USA, 1995, pp. 1056-1060 vol. 2, doi:
     10.1109/ACSSC.1995.540861.
[22] M. M. Bykov, V. V. Kovtun, I. D. Ivasyuk, A. Kotyra and A. Mussabekova, The automated
     speaker recognition system of critical use, in International Society for Optical Engineering, Vol.
     10808, 2018, 108082V, doi: 10.1117/12.2501688.
[23] M. M. Bykov, V. V. Kovtun, A. Raimy, K. Gromaszek and S. Smailova, Neural network
     modelling by rank configurations, in The automated speaker recognition system of critical use, in
     International Society for Optical Engineering, Vol. 10808, 2018, 1080821, doi:
     10.1117/12.2501521.
[24] O. V. Bisikalo, V. V. Kovtun, M. S. Yukhimchuk and I. F. Voytyuk, Analysis of the automated
     speaker recognition system of critical use operation results in Radio Electronics, Computer
     Science, Control, Zaporizhzhia, Ukraine, No, 4, 2018, pp . 71-84, doi: 10.15588/1607-3274-
     2018-4-7.
[25] O. V. Bisikalo, V. V. Kovtun and M. S. Yukhimchuk, Modeling the security policy of the
     information system for critical use, in Radio Electronics, Computer Science, Control,
     Zaporizhzhia, Ukraine, No. 1, 2019, pp. 132-149, doi: 10.15588/1607-3274-2019-1-13.
[26] V. V. Kovtun, M. S. Yukhimchuk, P. Kisała, A. Abisheva and S. Rakhmetullina, Integration of
     hidden markov models in the automated speaker recognition system for critical use, in Przeglad
     Elektrotechniczny, Wydawnictwo SIGMA, Poland, 2019, No. 1, 2019, pp. 178-182, doi:
     10.15199/48.2019.04.32.