On Voice Autentication Algorithm Development

                        Patryk Bąkowski[0000−0001−8325−1873]
                     and Dmitry Muromtsev[0000−0002−0644−9242]

            ITMO University, Saint Petersburg, 197101, Russian Federation
                          {baski, mouromtsev}@ifmo.ru
                                 https://en.itmo.ru/


        Abstract. This article is devoted to the development of voice authen-
        tication algorithm for access control in automated control systems. The
        existing methods of allocation of individual voice characteristics and con-
        struction of voice models are considered. The algorithm of voice authen-
        tication for voice control of the Internet of Things system in Russian on
        the basis of a neural network is offered. The peculiarity of the algorithm is
        the use of mel-frequency cepstral coefficients and the text independence
        of the voice message. Experiments aimed at identifying the optimal set
        of analysed parameters and evaluating the efficiency of the classifier and
        the authentication system as a whole are described.


1     Problem statement
In today’s world, it is necessary to protect multiple sources of sensitive data,
both in the industrial environment and in everyday life. Among the many means
of ensuring the security of such data, biometric voice authentication systems
have a number of advantages. In many situations where it is impossible to get
a high-quality image of the user’s face or get fingerprints, voice authentication
will successfully cope with its task. As a result, such systems are implemented
and used in many areas, such as forensics, finance, telecommunications.
    The use of voice for speaker recognition tasks has a great potential, in par-
ticular due to the fact that to solve such problems there is no need to purchase
complex and expensive equipment, it is enough to have a microphone. Such
identification and authentication systems can be easily implemented and used
both in ACMS (access control and management systems) and on telephone lines
and mobile devices. Voice interfaces are the most promising means of interaction
with "smart things" due to the naturalness and intuitiveness of this approach
for a person: often voice assistants become control centers of smart homes.
    At the moment two of the most relevant problems in the field of the Internet
of Things with voice interfaces can be formulated:
1. Management of personal data of users.
2. Management of complex systems, such as smart home, smart factory, etc.
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2       P. Bąkowski et al.

In accordance with this, it is necessary to develop an access control system that
includes authentication by voice characteristics without the use of code words
or phrases.


2    Development of voice authentication algorithm
The system of automatic user authentication by voice is being developed within
the voice control system of Internet of things by PICT faculty of the ITMO
University. The purpose of this work is to create the voice control system that
can, on the basis of voice commands and data obtained from ontological descrip-
tions of devices and indications of smart sensors, logically derive and generate
scenarios for the smart (automated) system, for example, the smart home or the
smart classroom. Voice control is based on local speech recognition.
    This approach unlike cloud-based speech recognition has a number of advan-
tages:
1. No issues related to availability, bandwidth and other factors that affect the
   speed of recognition inherent in cloud solutions.
2. Unlike cloud systems, there is an opportunity to configure the speech recog-
   nition system to solve a specific task. The quality of recognition depends
   on the language model used. In different application areas, different words
   have different probabilities. Cloud solutions use standard systems that use
   an average model of the language, or a model designed to solve the problems
   posed to the creators of the platform, which do not always coincide with the
   user tasks of the system.
3. Resource-efficient implementation of voice activation. To implement the ac-
   tivation function using a cloud system it is necessary to broadcast everything
   that the microphone records to the cloud in order to detect the passphrase.
   This leads to additional loading of the transmission channel and the cost of
   Internet traffic.
4. No additional financial costs. There are many open source (free) libraries
   and tools for local speech recognition, while cloud solutions are commercial
   and provide paid access to their services.
    The following steps are implemented to solve the research task:
1. Data collection.
2. Preprocessing of the voice recordings.
3. Extraction of vectors of individual vocal signs.
4. Building the voice model based on voice characteristics.
5. Decision-making and verification.
    The data is collected by the software for working with devices that record
audio signals (microphones).
    The VAD (Voice Activity Detection) algorithm based on the energy [8] is used
to pre-process recorded audio data, namely to remove pauses and non-vocalized
fragments. This algorithm splits the speech signal into frames of 40 MS, then
                            On Voice Autentication Algorithm Development         3

removes those frames which average energy is less than the set threshold: the
average energy of the entire record, multiplied by a factor k, that is selected
empirically.


             IF Ei < k ∗ E, where k < 1,                      Silence
             ELSE                                      Voice activity

    The k coefficient in this work was 0.25. Figure 1 shows the signal before and
after removing noise and pauses.


           Fig. 1. The signal before and after removing noise and pauses


    Features are extracted from the pre-processed voice records. Feature extrac-
tion is implemented using the bob.ap library.
    As a result 60-element vectors of mel-frequency kepstral coefficients (MFCC) [7]
are formed. The process of obtaining vectors is as follows:

 1. Splitting the signal into overlapping frames of length 20 ms with an inter-
    section of 10 ms.
 2. Obtaining the signal spectrum for each frame by applying Fourier transfor-
    mation.
 3. Decomposition of the spectrum on the mel-scale using triangular filters.
 4. Squaring of the obtained values and taking the logarithm.
 5. Application of the discrete cosine transformation.
4       P. Bąkowski et al.

    In feature vector-based recognition, the Gaussian mixture model (GMM) [7]
or machine learning, such as the SVM support vector method, are most com-
monly used. In this work, the multilayer neural network [4][5][6] with two hidden
layers was used to recognize the speaker by voice. The number of neurons in the
input layer i1 , i2 , . . . , in , is defined by the dimension of the feature vectors on
which learning occurs. In this paper, vectors of dimension n=60 are used. The
number of neurons in the output layer of the network o1 , o2 , . . . , ok corresponds
to the dimension k of the set of speakers G registered in the system. The ar-
chitecture of the used neural network is shown in figure 2. Tensorflow [2] and
Keras [1] libraries were used to work with neural networks.


                     Fig. 2. Architecture of the neural network


    Voxforge dataset [3] was used to train the neural network. It contains record-
ings of various lengths from 500 speakers. During the training of the network
and the analysis of its accuracy in solving the problem of speaker classification,
the outputs of the last layer of the network hθ (x(m)) is a K -dimensional vec-
tor, where K is the number of speakers, each element of which takes values in
the range from 0 to 1. The vector shows with what probability the speaker can
be attributed to each of the K classes. The prediction of the speaker class can
be carried out using the sum of the logarithms of the probability of M frames.
In this case the ID of the predicted speaker k* is the index of the maximum
probability value:
                                        M
                                                         !
                                        X
                         ∗                           m
                       k = arg max          log(hθ (x )k)
                               k∈[1,K]    m=1
                             On Voice Autentication Algorithm Development            5

    Further, in the verification and decision-making step, the user is authen-
ticated by comparing the received probability with the threshold value. The
threshold value was determined by conducting experiments with speakers who
did not take part of the speaker set G formation, known to the classifier (neg-
ative experiments), and speaker form the set G (positive experiments). From
the obtained values of identification probability for negative and positive ex-
periments, two Gaussian distributions - correct and erroneous identification -
were constructed for each user. The intersection point of these graphs of these
distributions is the threshold value for a particular user of the system (Fig.3).


Fig. 3. Determination of the threshold value by the intersection point of Gauss distri-
butions


    During the development of the voice authentication system, a number of ex-
periments were conducted to identify the value of the VAD threshold coefficient.
The choice of threshold factor is very important when working with energy-based
VAD: too high value can cause cutting off frames that contain the speaker’s voice
and, in turn, too low value can cause many non-vocalized or noisy fragments
would not be excluded from the set of frames. Figure 4 shows the dependence
of the classifier accuracy on the VAD threshold coefficient. Using VAD with the
right threshold value can improve system performance by about 10% compared
to raw data.
    When testing the developed algorithm of voice authentication results with the
following values (shown in figure 5): equal error rate EER = 7%, the coefficient
of accurate verification 87.1%.

3    Conclusion
This paper describes the analysis of methods of biometric authentication of the
speaker by voice. The existing algorithms and methods of biometric authentica-
6      P. Bąkowski et al.


     Fig. 4. The dependence of classification accuracy on VAD threshold value


     Fig. 5. The dependence of classification accuracy on VAD threshold value


tion by voice, including text-dependent and text-independent algorithms are
investigated. The analysis of the tools used in this field is also carried out,
and the neural network with two hidden layers and the distribution of neu-
rons 60:40:40:20 is selected. The result is a voice authentication algorithm for
access control in automated systems. In addition, specialized software for voice
biometric authentication system based on neural networks has been developed.
The number of experiments on the selection of the VAD threshold coefficient
was carried out. Such metrics as accuracy (87%) and EER (7.1%) were used
to evaluate the system. Possible directions of development of this research are
formulated.
                            On Voice Autentication Algorithm Development           7

References
1. Keras. the python deep learning library, https://keras.io/
2. Tensorflow. an end-to-end open source machine learning platform, https://www.
   tensorflow.org/
3. Voxforge - free speech corpus and acoustic model repository, http://www.voxforge.
   org/ru/Downloads
4. Buchneva, T., Kudryashov, M.Y.: Neural network in the task of speaker identifi-
   cation by voice. herald of tver state university. series. Applied Mathematics (2),
   119–126 (2015)
5. Ge, Z., Iyer, A.N., Cheluvaraja, S., Sundaram, R., Ganapathiraju, A.: Neural net-
   work based speaker classification and verification systems with enhanced features.
   In: 2017 Intelligent Systems Conference (IntelliSys). pp. 1089–1094. IEEE (2017)
6. McLaren, M., Lei, Y., Ferrer, L.: Advances in deep neural network approaches to
   speaker recognition. In: 2015 IEEE international conference on acoustics, speech
   and signal processing (ICASSP). pp. 4814–4818. IEEE (2015)
7. Rakhmanenko, I.A., Mescheriakov, R.V.: Identification features analysis in speech
   data using gmm-ubm speaker verification system. Trudy SPIIRAN 52, 32–50 (2017)
8. Verteletskaya, E., Sakhnov, K.: Voice activity detection for speech enhancement
   applications. Acta Polytechnica 50(4) (2010)