=Paper= {{Paper |id=Vol-2608/paper78 |storemode=property |title=Analysis of voice signal phase data informativity of authentication system user |pdfUrl=https://ceur-ws.org/Vol-2608/paper78.pdf |volume=Vol-2608 |authors=Mykola Pastushenko,Yana Krasnozheniuk,Oleksandr Lemeshko |dblpUrl=https://dblp.org/rec/conf/cmis/PastushenkoKL20 }} ==Analysis of voice signal phase data informativity of authentication system user== https://ceur-ws.org/Vol-2608/paper78.pdf
                 Analysis of Voice Signal Phase Data
             Informativity of Authentication System User

        Mykola Pastushenko1[0000-0003-2664-1167], Yana Krasnozheniuk1[0000-0001-9884-0275],
                         Oleksandr Lemeshko1[0000-0002-0609-6520]


    1
        Kharkiv National University of Radio Electronics, Kharkiv, 14 Nauky Ave., UKRAINE

                             mykola.pastushenko@nure.ua
                             yana.krasnozheniuk @nure.ua
                             oleksandr.lemeshko@nure.ua



          Abstract. Directions of improving the quality characteristics specific to voice
          authentication systems in various access systems are analyzed and explored in
          the article. One of the main directions for improving the quality characteristics
          of the user authentication systems is the use of phase information of a voice
          signal. The urgent scientific task of studying new procedures is being solved to
          refine the estimates of the pitch frequency obtained on the basis of the ampli-
          tude-frequency spectrum analysis. The estimates were refined using phase data
          of the voice signal, as well as estimates of the pitch frequency in the process of
          obtaining cepstral coefficients. The results are obtained in the course of statisti-
          cal analyzing the simulation results using experimental voice data of the authen-
          tication system user. Phase data of a voice signal allows obtaining adequate and
          reliable estimates in the process of spectral analysis. However, if there are er-
          rors associated with gross errors, for example, when taking the first or second
          formants for estimating the pitch frequency, preference should be given to the
          estimate obtained in the process of calculating cepstral coefficients. The pre-
          sented research results should be used in voice authentication systems, improv-
          ing speech recognition systems, as well as in solving speaker identification
          problems.

          Keywords: amplitude, authentication, voice signal, information, spectrum,
          phase, pitch frequency.


1         Introduction

In recent decades, the achievements of science and the latest infocommunication
technologies more than ever determine the dynamics of economic growth, the level of
population’s well-being, the competitiveness of the state in the world community, the
degree of ensuring its national security and equitable integration into the global econ-
omy. The rapid development and widespread use of modern information and tele-
  Copyright © 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
communication systems marked the transition of mankind from the industrial society
to the information society that is based on the latest communication systems, the reli-
ability of which does not always comply with increasing requirements. The quantity,
technical level and accessibility of information systems, their reliability and perform-
ance stability already determine the degree of country’s development and its status in
the world community, and in the near future they will undoubtedly become a decisive
mark of this status.
    At the same time, the process of informatization of the world community generates
a complex of negative phenomena, first of all, theft of financial, informational and
computing resources. Indeed, the high complexity and vulnerability of all the systems
on which regional, national and global information spaces are based, as well as the
fundamental dependence of state infrastructures on their stability, lead to the emer-
gence of principally new threats that in some cases can be solved by improving access
systems.
    Due to the wide spreading of distributed systems in all spheres of human activity,
the task of ensuring information security in such systems is acute. One of the main
measures to protect financial resources, information data and computing resources is
to ensure reliable user authentication.
    Currently, there are many approaches to authentication and even more implemen-
tations of these approaches. However, not all classical solutions to the authentication
problem are suitable for implementation in distributed systems. In addition, various
types of systems present their unique requirements for authentication subsystems.
Moreover, the active development of computer technology makes it easy to crack
authentication algorithms that were considered reliable 10-15 years ago. For example,
in 2019, the total estimated income of fraudsters obtained using bank cards in Ukraine
increased from UAH 245.8 million to UAH 361.99 million (an increase of 47.3%), as
reported by the Deputy Director of the Ukrainian Interbank Payment Systems Mem-
ber Association “EMA”, Olesya Dalnichenko. This is largely caused by the insecurity
of the password protection of bank cards.
    In this regard, continuous work is ongoing in the field of research and develop-
ment of authentication methods. New algorithms are constantly appearing and exist-
ing ones are being improved to ensure secure user authentication. The problem of
authentication of users with access to public and personal information resources is
becoming increasingly relevant. This problem is especially important for open, mass
telecommunication and information systems. One of the most promising areas for
protecting such systems from unauthorized influence is biometric methods for identi-
fying users. However, despite all the attractiveness, this approach is fraught with a
number of serious problems.
    Initially, the development and implementation of biometric systems was associ-
ated with static biometric attributes of the user (face image, papillary finger pattern
and iris), which have proven themselves in forensics. However, to date, these hopes
have been destroyed, primarily because of the simplicity of the fake.
    Therefore, in recent years, a lot of research has been carried out in the field of ap-
plication of dynamic (behavioral) biometric authentication systems. Among these
biometric systems, voice authentication takes a special place, which is simple and
convenient. However, like all biometric systems, voice authentication has low quality
characteristics. In this regard, intensive research is being carried out in the field of
voice authentication, as evidenced by the works [1-4].
    In modern voice authentication systems (VASs), the amplitude information of a
polyharmonic non-stationary voice signal of a user is recorded. User authentication is
carried out mainly in the process of analyzing the amplitude-frequency spectrum of
registration materials [2]. The main efforts of researchers in this case are focused on
search for new or improvement of existing procedures for the formation (estimation)
of templates (a set of attributes – pitch frequency, formant data, cepstral coefficients,
mel-frequency cepstral coefficients, linear prediction coefficients and their dynamic
characteristics, etc.) of the user, as well as the development of decision rules. The
following decision-making procedures are the most popular among the latter – the
methods of Gaussian Mixture Model (GMM) and Support Vector Machine (SVM).
For these purposes, artificial neural networks and Hidden Markov Models (HMM) are
also used.
    The aim of this work is to study the influence of modern achievements in digital in-
formation processing on the accuracy of evaluating individual characteristics of the
analyzed voice signal in the process of forming a user template. The object of study is
the process of digital processing of voice signals.


2      General problem statement

In our opinion, an increase in the quality indicators of VASs is connected, first of all,
with a change in the paradigm of digital processing of registration materials, which is
associated with the addition of the amplitude-frequency spectrum analysis with mod-
ern advances in digital information processing, including algorithms for recording
phase data of voice signals.
    Currently, there is another way to improve the quality of the VASs, which is based
primarily on the use of phase information of the user voice signal. It has long been
known [5] that the phase is a more informative parameter of the signal, however, it is
traditionally ignored in the VASs [2].
    This is caused by the fact that to obtain phase information, additional computa-
tional and algorithmic resources are needed, which are not always available in these
applications. Note that earlier in radar and radio communications to obtain phase data,
special bulky devices were used – phase shifters, which could not be used in the field
of voice signal processing. Currently, there are specialized microcircuits or digital
signal processors that are also applicable in the field of digital processing of voice
signals.
    In addition, there are some features of the estimation, pre-processing and use of
phase data. It should be noted that at present there is no experience and practice of
using the signal phase with respect to voice authentication tasks.
    This is confirmed by the fact that there are only a limited number of known works
where phase data were used in the processing of speech signals. For example, in [6]
the relevance of using phase information in the processing of speech data was pointed
out, and in [7] a phase was used to clarify the frequency characteristics of the proc-
essed voice data. In [8], a comparative analysis of the procedures for estimating the
phase relationships between the vibrations of the pitch and harmonics of speech sig-
nals was performed, which the authors propose to use for solving problems of recog-
nizing speech sounds and identifying speakers.
    The above emphasizes the relevance of studies estimating the effect of phase data
on the quality characteristics of voice authentication procedures. Phase data in voice
authentication can be used in several ways that are practically important for digital
processing of voice signals:

─ increasing the signal-to-noise ratio of the registration materials (a known direction
  of using the phase in radar and radio communications);
─ improving the quality of the formation of attributes for traditionally used templates,
  for example, the pitch frequency, formant information, etc.;
─ development of new procedures for the formation of template elements based on
  phase data [9].


3      Work-related analysis

Let us analyze the latest scientific works in the field of speech signal processing when
the issues of voice authentication in infocommunication systems have become par-
ticularly relevant. It is obvious that voice identification technologies have come to the
user authentication systems from forensics. The scientific basis for the use of voice
identification technology in forensics was investigated and discussed in detail in [10].
   The general conclusion is that the voice identification differs from identification of
fingerprints, where the variations are very small, and there is no absolutely reliable
method for determining whether speech signals belong to the same person. In foren-
sics, speaker recognition can only be probabilistic, i.e. indicating the likelihood that
two speech signals belong to the same person. Under conditions of an analog tele-
phone channel, even recognition of gender or age is sometimes complicated. Due to
the small sample of speech signals, the confidence interval for evaluating the likeli-
hood of two speech recordings belonging to the same speaker is so large that an un-
ambiguous solution is impossible.
   The task of segmentation of speakers is rather close. The segmentation of speakers
in the conversation flow of different speakers (audio-indexing, diarization) is neces-
sary when marking up sound transcripts, newsgroups, radio and television shows,
interviews, etc. However, as in forensics, the quality of speaker extraction is low and
unacceptable for solving user authentication problems [11].
   As shown in [12], the individuality of the acoustic characteristics of the voice is
determined by three factors: the mechanics of the vocal folds vibrations, the anatomy
of the speech tract and the articulation control system. Naturally, the voice signal
propagation channel can have some influence on acoustic characteristics (for exam-
ple, the influence of external noise), the effect of which in modern systems is elimi-
nated by digital processing procedures and organizational measures. Acoustically, the
style is realized in the form of a contour of the pitch frequency, the duration of words
and its segments, the rhythmics of the shock segments, the duration of pauses, and the
volume level [12].
   The attribute space, in which a decision is made on the identity of the speaker,
should be formed taking into account all factors of the speech formation mechanism:
the voice source, resonant frequencies of the speech path and their attenuation, as well
as the dynamics of articulation control. In particular, in [12, 13], the following pa-
rameters of the voice source are considered: the average pitch frequency, the pitch
frequency-period contour, fluctuations in the pitch frequency and the shape of the
excitation pulse. The spectral characteristics of the vocal tract are described by the
envelope of the spectrum and its average slope, formant frequencies and their bands,
long-term spectrum or cepstrum [13].
   It was shown in [14] that the most important factor in voice individuality is the
fundamental frequency ( F 0 ), followed by formant frequencies, the size of fluctua-
tions F 0 , and the slope of the spectrum. In [15], it was suggested that the attributes
associated with F 0 provide the best separability of voices, followed by the signal
energy and the duration of the segments.
   In some works, the formant frequencies are considered the most important factor
[16, 17]. In particular, the fourth formant is practically independent of the type of
phoneme and characterizes the tract [17].
   The speaker recognition method is dominated by the cepstral method of transform-
ing the spectrum of voice signals, which was first proposed in [18].
   Cepstrum describes the shape of the envelope of the signal spectrum, which inte-
grates the characteristics of the excitation sources (voice, turbulent and pulsed) and
the shape of the speech tract. In experiments on subjective speaker recognition, it was
found that the envelope of the spectrum strongly affects voice recognition [19].
Therefore, the use of a particular method of spectrum envelope analysis for speaker
recognition is justified.
   Instead of calculating the spectrum of the speech signal using the discrete Fourier
transform over a short time interval, the amplitude-frequency characteristic of the
signal found from the coefficients of linear speech prediction can also be used [20].
   In [21], three informative areas were found: 100-300 Hz (the influence of a voice
source), 4-5 kHz (pear-shaped cavities) and 6.5-7.8 kHz (possibly the effect of conso-
nants). A small area is in the region of 1 kHz.
   Due to the fact that the vast majority of speaker recognition systems use the same
attribute space, for example, in the form of cepstral coefficients, their first and second
differences, much attention is paid to the construction of decision rules, which were
discussed above.
   The development and application of the GMM method was considered in [22, 23].
The GMM method can be considered as an extension of the vector quantization meth-
od [23]. Vector quantization is the simplest model in speaker recognition systems,
regardless of context.
   The Support Vector Machines method (SVM) is actively used in various pattern
recognition systems after the publication of the monograph [24]. This method allows
to build a hyperplane in a multidimensional space that separates two classes, for ex-
ample, the parameters of the target speaker and the parameters of speakers from the
reference base. The hyperplane is calculated using not all parameter vectors, but only
specially selected ones. These vectors are called reference vectors. Since the dividing
surface in the initial parameter space does not necessarily correspond to a hyperplane,
a nonlinear transformation of the space of the measured parameters into a certain
attribute space a higher dimension is performed. This nonlinear transformation must
satisfy the requirement of linear separability in the new attribute space. If this condi-
tion is satisfied, then the dividing surface in the hyperplane is constructed using the
support vector method. Obviously, the success of applying the support vector method
depends on how well the non-linear transformation is selected in each specific case
when recognizing speakers.
   The Support Vector Machines method is used to verify speakers often in combina-
tion with the GMM or the HMM method.
   The method of Hidden Markov Models (HMM) is also applied to speaker recogni-
tion, which has proven itself in problems of automatic speech recognition [25, 26]. In
particular, it is assumed that for short phrases of a few seconds duration for a contest-
dependent approach, it is best to use phoneme-dependent HMMs, rather than models
based on the transition probabilities from frame to frame lasting 10 to 20 ms. The
Hidden Markov models method can be used together with the GMM method.
   The general conclusion from the analysis of well-known literature is that the tem-
plates for authentication (speaker recognition) are formed on the basis of digital proc-
essing of the amplitude-frequency spectrum of the user voice signal. At the same
time, a more informative parameter of the user voice data is ignored, namely, the
phase-frequency spectrum. This could be a promising area.


4      Methods and research results

We will analyze the experimental voice signal of the authentication system user, who
pronounced the word “one”. The sampling frequency is 64 kHz and the signal-to-
noise ratio is more than 20 dB. The analyzed voice signal is presented in Fig. 1.
    Further, as in the well-known modern VASs, we calculate the amplitude-
frequency spectrum from the experimental voice signal and perform its analysis. In
this case, as indicated above [21], we will focus on the low-frequency region where
the attributes of the authentication system user are located, focusing mainly on the
pitch frequency and the associated formant frequencies.
                          Fig. 1. Voice signal of the word “one”

    It is known that the value of the pitch frequency is an individual characteristic of
the speaker. It can vary depending on the emotional coloring of speech, but within
fairly narrow limits. With parametric coding of speech, it is assumed that the pitch
frequency of a person lies in the range of 80-400 Hz, and most formant frequencies
are F 0 -fold.
    The amplitude-frequency spectrum of the analyzed signal is presented in Fig. 2.
                        Fig. 2. A short range of voice signal “one”

    Spectral analysis of the amplitude-frequency spectrum of the user's real voice sig-
nal made it possible to obtain an estimate of the pitch frequency in the region of 243
Hz. In this case, three formant frequencies are clearly pronounced (see Table 1), and
the next ones have a low level of intensity.

                Table 1. Characteristics of the amplitude spectrum formants

      Level, dB              24.6                    18.6                     14.2
    Frequency, Hz            243                     486                      776

    Now we examine the characteristics considered with respect to the phase informa-
tion of the voice signal of the authentication system user. For this, it is necessary to
generate phase data that are not registered for a voice signal.
    Therefore, phase data, as a rule, are calculated programmatically and algorithmi-
cally. To do this, it is necessary to restore the quadrature (imaginary) component of
the voice signal from the registration materials. These procedures are associated with
the application of the Hilbert transform [5].
                                         
                                      1       x( )
                              y (t )                 d ,
                                      2    (t   )

where x (t ) is the recorded voice signal; y (t ) is the quadrature (imaginary) compo-
nent of the analytical signal; t is an independent variable that has the physical mean-
ing of a unit of time;  is an integration variable. Next, we can calculate the phase of
the voice signal using the following ratio

                                                  y (t )
                                  (t )  arctg          .
                                                  x(t )

   Unfortunately, the function arctg gives angle values ranging from   / 2 to
 / 2 . To determine the correct value of the phase angle, which for a voice signal
varies from 0 to 2   , it is necessary to adjust the angle  (t ) accordingly, taking
into account the signs of the numerator and denominator in the ratio of the function
 arctg . Otherwise, the phase spectrum will be incorrect. After correction, we obtain
a phase angle, which has the form of a sawtooth signal of unknown duration.
    As the results of previous studies [27] showed, after the formation of phase data, it
is necessary to perform the procedures of their preliminary processing. This is due to
some factors, among which we highlight the following:

─ the polyharmonic nature of the voice signal, which is processed by the Hilbert
  transform. The latter is oriented to work with harmonic stationary data;
─ incorrect data when the components y (t ) or x (t ) in the function arctg are
  equal to zero;
─ for small values of the components y (t ) or x (t ) , the latter can be lost in round-
  ing noises.

    These factors lead to the fact that both random errors and anomalous measure-
ments can occur in sawtooth phase signals. This necessitates preliminary processing
of both the voice signal and the phase data. Pre-processing can be based on a priori
data about the nature of the phase change of the voice signal and will improve the
quality of the characteristics formation for both existing and perspective components
of templates.
    Now we analyze the phase spectrum of the analyzed signal. Fig. 3 shows the phase
spectrum of the corrected phase data, which we will consider below.
                    Fig. 3. A short phase spectrum of the voice signal

    The results of processing the formant information of the phase spectrum are pre-
sented in Table. 2. In this spectrum, six formants can be distinguished, and the sev-
enth and eighth have a slight energy difference. The pitch frequency, as in the ampli-
tude spectrum, is 243 Hz.
    The level of spectral density of the selected maxima is several times higher than
the level of the maxima of the amplitude spectrum, which greatly simplifies the pro-
cedure for their selection. The number of selected formants in the phase spectrum is
one and a half times greater. The aforementioned indicates a more informative phase
spectrum of the voice signal.

                   Table 2. Characteristics of phase spectrum formants

   Level, dB       84.9        76.7           70.3           65           64     62
 Frequency, Hz     243         492            738           990          1217   1450

    Another way to obtain an approximate estimate of the pitch frequency can be asso-
ciated with the calculation of cepstral or mel-frequency cepstral coefficients (MFCC),
which, as a rule, are included in the user template as attributes.
   As it is known, cepstral coefficients are determined in accordance with the scheme
presented in Fig. 4. The following notation is used in this Figure: FFT - Fast Fourier
Transform block of a signal; LOG - block logarithmation spectrum; IFFT is the Inverse
Fast Fourier Transform block.




                     Fig. 4. General scheme of cepstral signal analysis

    Thus, the cepstral coefficients are the result of applying the inverse Fourier trans-
form to the logarithmic power spectrum. The calculation of these coefficients is car-
ried out on the samples of the signal, the duration of which are several tens of milli-
seconds. It is proposed to estimate the pitch frequency in each sample after perform-
ing the inverse Fourier transform.
    In this case, the samples are selected with some overlap. The result of each inverse
transformation allows to get an estimate of the maximum frequency in the sample. As
a rule, approximately 40 coefficients are calculated, which means that we can get as
many estimates of the pitch frequency. Averaging the results, we can form a more
accurate estimate of the maximum pitch frequency.
    The adequacy and reliability of the hypothesis put forward about a different meth-
od for estimating the pitch frequency is feasible in the process of a model experiment.
To do this, we will perform digital processing according to the scheme shown in Fig.
4.
    The amplitudes and phase data of the voice signal analyzed above were subjected
to processing. When estimating the pitch frequency, the processed samples had an
overlap coefficient of 0.75, the number of points of the discrete Fourier transform was
1024 and the Hamming smoothing width for the discrete Fourier transform.
    The specifics of digital processing was the following: a sample may include vocal-
ized or nonvocalized sounds. As we know, the pitch frequency is estimated from vo-
calized sounds that were extracted during the threshold processing of the spectral
power level of the selected maxima.
    As a result of processing the amplitude data, the following estimates were ob-
tained: mathematical expectation is 247.5 Hz; the standard deviation is 15.7 Hz, and
for phase data – 250.4 Hz and 17 Hz, respectively.
Thus, the proposed method for estimating the pitch frequency allows to get adequate
and reliable results. The indicated method for estimating the pitch frequency can be
useful in the presence of errors associated with gross errors. For example, taking the
maximum frequencies of the first or second formants as the estimate of the pitch fre-
quency. In this case, preference should be given to the estimate F0 obtained in the
process of calculating the cepstral coefficients.
5      Conclusion
The problem of improving the quality characteristics of voice authentication systems
has been discussed in the article. As the main direction of solving this problem, it is
proposed to use phase data of the analyzed voice signal in the process of digital proc-
essing. The reliability of the solution proposed for this problem and the analysis of the
information content of the voice signal phase data are studied in the process of experi-
mental evaluation of the pitch frequency and formant information, which are included
in most user templates as required parameters. The cepstral or mel-frequency cepstral
coefficients and a number of other attributes are additionally included in the template.
The pitch frequency allows to solve the following problems: emotion recognition, gen-
der determination, audio segmentation with multiple voices and speech separation into
phrases.
    In this regard, the current scientific task of studying new procedures to refine the
estimates of the pitch frequency obtained on the basis of the amplitude-frequency spec-
trum analysis was considered in the work. The estimates were refined based on the use
of phase data of the speech signal, as well as the estimates of the pitch frequency in the
process of obtaining cepstral coefficients.
    Therefore, the scientific novelty of the obtained results lies in the fact that for the
first time a technique has been developed and experimental studies have been carried
out to form an estimate of the pitch frequency (as well as formant frequencies) based
on phase data of the voice signal. In addition, a new method has been developed for
estimating the pitch frequency in the process of calculating cepstral coefficients. It can
be performed during the analysis of both amplitude and phase data of the studied voice
signal.
    The results have been obtained in the process of statistical analysis of the simula-
tion results using experimental voice data of the authentication system user.
    Phase data of a voice signal allows obtaining adequate and reliable estimates in the
process of spectral analysis. However, if there are errors associated with gross errors,
for example, taking the maxima of the frequencies of the first or second formants as an
estimate of the pitch frequency, preference should be given to the estimate obtained in
the process of calculating the cepstral coefficients.
    The practical significance of the results is as follows:
─ a technique has been developed and features of phase information forming of the
  studied voice signal have been identified;
─ an example of estimating the pitch frequency has shown higher informativity of the
  phase data, which allows selecting a larger number of formant frequencies;
─ the developed method for estimating the pitch frequency eliminates gross errors in
  the formation of a template of the authentication system user.
    It is advisable to carry out further studies in the direction of estimating the quality
of the formation of attributes for traditionally used templates (for example, cepstral
coefficients, mel-frequency cepstral coefficients, linear prediction coefficients, etc.)
taking into account the phase of the voice signal, as well as the development of new
procedures for the formation of template elements based on phase data.
References
 1. Ramishvili, G.S.: Avtomaticheskoe opoznavanie govoriashego po golosu (Automatic
    speaker recognition over voice). Radio i sviaz, Moscow (1981).
 2. Beigi, H.: Fundamentals of speaker recognition, Springer, New York (2011).
 3. ISO/IEC 2382-37:2012 Information technology – Vocabulary – Part 37: Biometrics
    (2012).
 4. Boll, R.М., Connel, J.Kh., Pankati, Sh., Ratkha, N.K., Senior E.U.: Handbook on biome-
    try. Translated from Englsih by Agapova N. Ye. Tekhnosphera, Мoscow (2007).
 5. Oppenheim, A.V., Lim, J.S.: The importance of phase in signals. In: Proceeding of the
    IEEE, Vol. 69(5), pp. 529 - 541 (1981) doi:10.1109/PROC.1981.12022
 6. Paliwal, K.: Usefulness of phase in speech processing. In: Proc. IPSJ Spoken Language
    Processing Workshop, Gifu, Japan, pp. 1-6 (2003).
 7. Paliwal, K., Atal, B.: Frequency-related representation of speech. In: Proceedings of the
    European Conference on Speech Communication and Technology (EUROSPEECH-2003),
    pp. 65–68 (2003).
 8. Borisenko, S.Yu., Vorobiev, V.I., Davidov, A.G.: Sravnenie nekotorykh sposobov analiza
    fazovikh sootnoshenii mejdu kvazigarmonicheskimi sostavliaushimi rechevykh signalov
    (Comparison of some methods for analyzing phase relationships between the quasi-
    harmonic components of speech signals). In: Proceedings of the First All-Russian Acoustic
    Conference, pp. 2-7 (2004).
 9. Wu, Z., Kinnunen, T., Chng, E., Li, H., Ambikairajah, E.: A study on spoofing attack in
    state-of-the-art speaker verification: the telephone speech case. In: Proc. Asia-Pacific Sig-
    nal Information Processing Association Annual Summit and Conference (APSIPA ASC)
    (2012).
10. Broeders, Ton: Forensic Speech and Audio Analysis Forensic Linguistics. In: Proceedings
    13th INTERPOL Forensic Science Symposium, Lyon, France, 16-19 October 2001, D2,
    pp. 54-84 (2001) https://ssrn.com/abstract=2870568
11. Fergani, B., Davy, M., Houacine, A.: Speaker diarization using one-class support vector
    machines.        Speech      Communication,        vol.50,     pp.      355–365       (2008)
    doi:10.1016/j.specom.2007.11.006
12. Kuwabara, H., Sagisaka, Y.: Acoustic characteristics of speaker individuality: Control and
    Conversion. Speech Communication, vol.16, pp. 165-173 (1995) doi:10.1016/0167-
    6393(94)00053-D
13. Sorokin,V.N., Tsyplikhin, A.I.: Speaker verification using the spectral and time parameters
    of voice signal. Journal of Communications Technology and Electronics, v.55, N12, pp.
    1561-1574 (2010) doi:10.1134/S1064226910120302
14. Matsumoto, H., Hiki, S., Sone, T., Nimura, T.: Multidimensional representation of per-
    sonal quality of vowels and its acoustical correlates. In: IEEE Trans. AU, vol. AU- 21, pp.
    428-436 (1973) doi:10.1109/TAU.1973.1162507
15. Shriberg, E., Ferrer, L., Kajarekar, S., Venkataraman, A., Stolcke, A.: Modeling prosodic
    feature sequences for speaker recognition. Speech Communication, vol.46, N3–4, pp. 455–
    472 (2005) doi:10.1016/j.specom.2005.02.018
16. Lavner, Y., Gath, I., Rosenhouse, J.: The effects of acoustic modifications on the identifi-
    cation of familiar voices speaking isolated vowels. Speech Communication vol.30, 9-26
    (2000) doi:10.1016/S0167-6393(99)00028-X
17. Takemoto, H., Adachi, S., Kitamura, T., Mokhtari, P., Honda, K.: Acoustic roles of the la-
    ryngeal cavity in vocal tract resonance. J. Acoust. Soc. Am., vol.120, pp. 2228–2239
    (2006) doi:10.1121/1.2261270
18. Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic
    word recognition in continuously spoken sentences. In: IEEE Trans. Acoustics, Speech,
    Signal Process., vol.28, N4, pp. 357–366 (1980) doi:10.1109/TASSP.1980.1163420
19. Itoh, K.: Perceptual analysis of speaker identity. In: Speech Science and Technology, Saito
    S. (Ed.), IOS Press, pp. 133-145 (1992).
20. Huang, X., Acero, A., Hon, H.-W.: Spoken Language Processing: a Guide to Theory, Al-
    gorithm, and System Development. Prentice-Hall, New Jersey (2001).
21. Lu, X., Dang, J.: An investigation of dependencies between frequency components and
    speaker characteristics for text-independent speaker identification. Speech Communica-
    tion, vol.50, N4, pp. 312–322 (2007).
22. Reynolds, D.: Speaker identification and verification using Gaussian mixture speaker
    models. Speech Communication, vol.17, pp. 91–108 (1995).
23. Reynolds, D., Quatieri, T., Dunn, R.: Speaker verification using adapted gaussian mixture
    models. Digital Signal Process., vol.10, N1, pp. 19–41 (2000) doi:10.1006/dspr.1999.0361
24. Vapnik, V.N.: Statistical Learning Theory. Wiley, New York (1998).
25. BenZeghiba, M., Bourlard, H.: On the combination of speech and speaker recognition. In:
    Proc. Eighth European Conf. on Speech Communication and Technology (Eurospeech),
    pp. 1361–1364 (2003).
26. Bimbot, F., Blomberg, M., Boves, L., Genoud, D., Hutter, H.-P., Jaboulet, C., Koolwaaij,
    J., Lindberg, J., Pierrot, J.-B.: An overview of the CAVE project research activities in
    speaker verification. Speech Communication, vol. 31, pp. 155-180 (2000).
27. Pastushenko, M., Pastushenko, V., Pastushenko, O.: Specifics of receiving and processing
    phase information in voice authentication systems. In: 2019 International Scientific-
    Practical Conference Problems of Infocommunications. Science and Technology (PIC
    S&T), Kyiv, Ukraine, pp. 621-624 (2019).