Development and Research of VAD-Based Speech Signal
              Segmentation Algorithms

       Oleksandr Tymchenko1[0000-0001-6315-9375], Bohdana Havrysh2[0000-0003-3213-9747],
    Aneta Poniszewska-Maranda3[0000-0001-7596-0813] ,Bohdan Kovalskyi2[0000-0002-5519-0759]
    Oleksandr O. Tymchenko2[0000-0003-2774-2138] and Kateryna Havrysh4[0000-0003-4155-8759]
                      1 University of Warmia and Mazury Olsztyn, Poland
                        2 Ukrainian Academy of Printing, Lviv, Ukraine
                          3Technical University of Lodz, Lodz, Poland
                              4 IT Step University, Lviv, Ukraine

                              o_tymch@ukr.net
                 dana.havrysh@gmail.com, bkovalskyy@ukr.net
                          olexandr.tymch@gmail.com
                    aneta.poniszewska-maranda@p.lodz.pl
                         gavrysh.kateryna@gmail.com


         Abstract. The method of a speech signal segmentation developed during the
         work by application of the VAD detector uses a spectrum of power of frag-
         ments (packets) of a speech signal unlike the other known examples. A discrete
         Fourier transform with a small number of samples (maximum 160) is used to
         calculate the spectrum. The developed method allows not only to solve the tra-
         ditional problem of VAD – the data rate reduction, but also to perform the
         speech signals separation and segmentation into individual fragments. Exam-
         ples of such segmentation and determination of vocalized and non-vocalized ar-
         eas of speech signals boundaries in the data network are given, which can be
         used to build phonemic vocoders in automated speech processing and recogni-
         tion systems.

         Keywords: segmentation, speech signal, communication channel, speech data.


1        Introduction

Packet data networks have occupied and hold the leading position among telecommu-
nication networks, in which they are facilitated by the computer networks develop-
ment and the Internet. One of the main types of packet traffic is a multimedia traffic,
which has a significant place in the language signal. Various encoders [1] are used to
provide high-quality voice traffic [1, 3], which simultaneously compress signals to
reduce network congestion. An effective means of further compression ratio enhanc-
ing is the use of the most current language codecs of Voice Activity Detector (VAD)
technologies [2, 4]. Even greater increase in the compression degree is achieved by
the methods of speech fragments separation and segmentation, ie as a result of the
transition to phonemic and semi-phonemic vocoders.Typically, none of the low-speed
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). IntelITSIS-2020
voice encoder implementation can do without the use of VAD (Voice Activity Detec-
tor) technology. The process of identifying or absence of voice activity is not a new
task, different methods of its implementation have been and are still being used (eg
GSM encoders, different methods of speech recognition, etc.). A well-known problem
with VAD synthesis using voice signal encoders for VoIP networks is to correctly
identify language pauses against a background of intense acoustic noise (office, street,
car, etc.). However, the use of VAD can significantly save bandwidth [4] and there-
fore congestion of network channels.


2      The research of existing VAD technologies capabilities

VAD provides the ability to pre-process the speech signal before it is fed to the en-
coder. In the first approximation, the following types of linguistic fragments can be
distinguished: vocalized, unvocalized, transitional, and pauses. When a language is
processed into the digital form, ie in the form of a sequence of numbers, each signal
type having the same duration and quality requires a different number of bits for en-
coding and transmission. Therefore, the transmission rate of different fragments of
speech signals may also be different. Thus, an important conclusion can be made
here: the linguistic data transmission in each direction of the duplex channel should be
considered as the transmission of asynchronous logically independent fragments of
digital sequences. These sequences (transactions) contain batch (datagram) synchro-
nization inside a transaction filled with packets of different lengths [5].
   The VAD detector must be sensitive and responsive in order to avoid the loss of
the word beginnings when switching from the pause to the active speech fragment. At
the same time the VAD detector should not be triggered by background noise [4, 7].
   Generally, the VAD goal is to estimate the value of a particular input parameter
(eg, level, power, etc.) and, if it exceeds a certain threshold, then such a packet will be
transmitted. This slightly increases the delay in the speech signal processing in the
encoder, but it can be minimized by creating coders that work with packets (data-
grams) of the readings.
   In the encoder analysis with the fast use of the Ccode language. (Bit / s), the signal is
divided into individual fragments (usually quasi-stationary sections), duration T fragm
from 2 to 50 ms and is in the input block used with N difference, uses the usual in-
formation frame about =
                      Vm.k T fragm × Ccode (bits).
    No matter what are the details of the implementation, the main criterion for evalu-
ating the encoder is the high quality of speech reproduction at a low output speed of
digital Ccode output. Especially of an output with minimum requirements for the digi-
tal signal processor resources and minimal delay [6].
    technology can be combined with a wide variety of language encoders.
1. There is a method of detecting voice activity based on finding the formant. Alt-
   hough formants carry the basic spectral information about the speech signal, in the
   case of unvocalized areas their localization is unreliable and segmentation is am-
   biguous because it is lost in the noise [3].
2. In a number of works the spectral characteristic of noise is estimated and on the
   basis of it the speech signal from a mix of a signal and noise is separated. The
   GSM standard adopted a VAD circuit with frequency domain processing [8, 16]. A
   block diagram of such a VAD system operation is shown in Fig. 1. The essence of
   its work is based on differences of speech and noise spectral characteristics.
   Background noise is considered to be stationary over a relatively long period of
   time, and its spectrum is slow to change over time. Therefore, VAD estimates the
   spectral deviations of the input sequence from the background noise spectrum. This
   operation is performed by an inverse filter whose coefficients change according to
   the input action. In the presence of a speech signal and noise input, an inverse filter
   suppresses the noise components and, in general, reduces its intensity. The energy
   of the signal + noise sum at the output of the inverse filter is compared with the
   threshold, which is variable and is estimated during the periods of action at the
   input of the noise itself. This threshold is higher than the noise signal energy level.
   Exceeding this level is a determining criterion for the presence of voice activity
   input. Because these parameters (coefficients and thresholds) are used by the VAD
   detector to detect the language, it is not for the VAD to decide at this stage of the
   analysis, as the threshold may vary [18].
This decision is made by a secondary VAD based on a comparison of the envelope
spectra in successive periods of time. If they are similar or close for a relatively long
time, then it is assumed that noise is applied at the detector input, then the filter coef-
ficients and the noise threshold can be varied, ie adapted to the current level and spec-
tral characteristics of the input noise [9, 17].


    Signal+Naice                                                                   Solution
                          Invercive Filter             Threshold Advice


                        Adaptive Scheme                   Threshold
                       Of Filter Installation         calculation scheme


             Fig. 1. VAD structural diagram on spectral characteristics of noise

   A clear disadvantage of this scheme VAD is the "relatively long period of time" for
which voice activity is decided [10, 12]. In addition, if the noise is non-stationary it is
almost impossible to segment the speech signal with such a scheme.
3     The algorithm of speech signal segmentation is offered

   The main idea behind the proposed VAD-based speech signal segmentation algo-
rithm is the linear processing of the speech fragments and the rejection of fragments
where there is no voice activity (ie useful information).
   The input parameters for the algorithm are the minimum length of language data -
Mframe considered useful (number of packets and their duration), maximum pause
time in the composition or word Eframelength, ie "error" VAD (obviously, this error
can take zero if the VAD system responds to the lowest possible signal values).
   The program code for the language segmentation algorithm using VAD is as fol-
lows.
Mframelength = 5...X;
Eframelength = 0...Y;
L = 0; - counter
ArrayList s;
ArrayList f;
int begin = 0;
int a = 0;
while (L < Plength)
  if (p[L] = true)    – voice activity indication
begin = L;
    a++;
  else
    if (a > Mframelength )
      s.Add( begin );
      f.Add( L – 1 );
L++;
L = 0;
if (Eframelength > 0)
while ( L < s length – 1 )
  if ( s[L+1] < e[L] )
    s.RemoveAt([L+1]);
    f.RemoveAt([L]);
L++;
return s,f;

  After the selection, the language fragments are detailed and processed according to
conventional coding algorithms (according to ITU-T Recommendation).


4     Methods of experimental research

The proposed VAD is based on a discrete Fourier transform (DFT):
 2 N          2π ki      2 N          2π ki 
=
   ∑
 N i 1=
           =
      yi cos 
           ak
              N 
                     =
                      , bk   ∑
                           N i1
                                yi sin 
                                        N 
                                               =
                                                , Sk                     ak 2 + bk 2       (1)

        Depending on the selected packet length, we choose the number of spectral com-
     ponents (from 1.2 to N / 2 , where 𝑁𝑁 is the packet length).
        The band in the frequency spectrum ( ∆S that is [0-1] by default, determined rela-
     tive to the number of harmonics) is selected as the main parameter of the VAD block.
        To study the effectiveness of the proposed algorithm, a simulation of its operation
     using real language signals was conducted. The scheme of the study is shown in Fig.
     2.

      Language Data             Packeting                  VAD                  Segmentation


                    Fig. 2. Scheme of language segmentation algorithms research

        The scheme includes:

     • "Voice Data" supports the download of an arbitrary speech signal file in Pulse
       Code Modulation (PCM) format. At the output of this block we get an array of
       speech signal samples with representation in integer or floating point format (se-
       lected from the criteria of data representation accuracy), the number of which is de-
       termined by the sampling frequency of the input signal and the packet length: Sk , k
       = 0..n – output signal.
     • The Packetization block splits the language data into fragments. Usually the length
       of these fragments corresponds to the stationary period of the speech signal. Using
       the data [13, 14] we assume that this value is no more than 20 ms (this parameter 𝑃𝑃
       is specified in the counts and can take values from 1 to the total number of input
       counts 𝑛𝑛).

        The packeting algorithm is described as follows:
        Spi,j = Sk , і = 0..Р, j = 0..k/P – the counting value (i) in a given packet (j).
        After a cyclic change of i and j within the input data duration, we receive a gener-
     ated packets stream of a given duration (at a sampling rate of 8000 Hz and a packet
     duration of 20 ms we have 160 samples in the packet).

     • the Voice Activity Detection Unit may use an arbitrary algorithm. The VAD level
       and DFT based VAD were used in the simulation process [15].
        The VAD system replaces the packets with "zero" by the signal level (ie, all sam-
     ples at the block output are equal to 0), if 80% of the packet samples are less than the
     specified threshold. The selected threshold (delta) is a given value of quantization
     levels (steps). This value can be changed from 0 to 127 quantization levels with a
     maximum value of signal amplitude 255 (for eight-bit quantization)
        Signal level VAD algorithm:
L = 0 – % less than the delta threshold counter count-
down
for (i,j)
і = 0..Р, j = 0..k/P;
if(abs(SРi,j)<delta)
L++;.
if ( L > 0.8P )
Sj = 0;

That is, if the selected criterion of "informativeness" is not fulfilled, then the package
must be zeroed.
   When using a VAD system based on DFT, the DFT samples of the speech frag-
ment are calculated:
   SPj[]– spectrum amplitude package, where j=0..n;
   Sj[]– packet-matching in the language input stream.
   One step of the algorithm is as follows:
      for(j) – for all packages j=0..n:
      if (max (SPj ) ∈ ∆S)) Sj[]=0;
that is, if the selected criterion of “informativeness” is not fulfilled (the maximum of
the amplitude spectrum is in a given band), then the packet is nullified, ie it is con-
cluded that the packet does not carry any useful language load. It is advisable to
choose the band ∆S in the range from 0 to 100 Hz, since the frequency of the pitch of
the speech signal is always above 200 Hz [1, 20-21].
   The "Segmentation" block performs the operation of combining packets with pre-
screening of areas where there is no linguistic activity, according to the above algo-
rithm for segmentation of language using VAD.


5      Results of experimental studies

Two signals of up to 2 seconds in length were selected to study the segmentation
method, which correspond to the language fragments (file 1 and file 2). Segmentation
was performed using VAD by level with a threshold value of 0.0625 (relative to 1)
and 0.125 within the limits shown in Fig. 3 and 4 (file 1). The segmentation of the
speech signal using VAD based on DFT (file 1) is shown in Fig. 5. The speech signal
segmentation using VAD based on DFT (file 1) is shown in Fig. 5.
   File 1 from the 1.25-second language stream (10,000 samples) was selected for
processing. The 5760 samples fragment (from 2240th to 8000th count) was highlight-
ed as a result of the VAD level application with a threshold value of 0.0625. The by-
pass of the input language signal and highlighted fragment are represented in Fig. 3.
The research result is highlighted with the blue lines, which practically corresponds to
the relevant linguistic information.
               Fig. 3. Language data segmentation (file 1), VAD level 0,0625.

   To study the language segmentation operation algorithm with VAD by signal level,
a gradual increase in the threshold value was performed. As a result, upon reaching
the threshold value of 0.125, two segments of 1600 and 1220 counts were obtained,
respectively. The selected fragments correspond to the loud sounds "a", and at the
beginning of the second fragment there is a deaf sound "b. The results are presented in
Fig. 4, where the vertical lines indicate two sections: the first (a) is in the range from
the 4000th to the 5600th reference, the second (b) displays from the 6560th to the
7840th reference.


                Fig. 4. Language data segmentation (file 1), VAD level 0.125

   Applying DFT-based VAD segmentation method to file 1, three segments were ob-
tained (Fig. 5). For greater clarity, they are presented separately: the first fragment in
the range from 2560 to 3360 (Fig.6a), the next fragment in the range 3680 ‒ 6080
(Fig.6b), and the last frame from 6400 to 8160 reference (Fig.6c). Thus, the use of
VAD on the basis of DFT allowed to distinguish a fragment with linguistic activity,
which at VAD level was missed.


              Fig. 5. Language data segmentation - VAD (file 1), based on DFP
                                                a)


                                                b)


                                                c)
 Fig. 6. Segmented data - file 1 (the scale on the abscissa axis in the figures is different). VAD
                                          based on DFT.

   In the same way, the segmentation of the linguistic data presented in file 2 was car-
ried out. As a result of the use of VAD on the basis of DFT, 5 linguistic fragments
were isolated (Fig. 7 a-e).


                                                a)
                                                  b)


                                                  c)


                                                  d)


                                                  e)
    Fig. 7. Segmented data - file 2 (the scale on the abscissa axis in Fig.6.a-e is different). Used
                                         VAD based on DFT


6        Conclusion

The analysis of the research results showed that the developed segmentation algo-
rithm using VAD with DFT gives almost error-free division of the speech flow into
words, and even into syllables and letters depending on the speaker’s intonation.
Moreover, depending on the intonation of the speaker - even into syllables and letters.
Also, raising the VAD threshold to the level provides a virtually error-free selection
of vocalized language fragments. The main drawback of the VAD algorithm based on
DFT is the lack of sensitivity when using signals in the [300..3400] Hz range, as a
result of which segmentation into letters is rarely achieved, unlike signals in the
[0..3400] Hz range. However, the proposed VAD technique can be effectively used in
language recognition, since the first DFT harmonics provide additional information
about formats, which can be used to detail individual letters or syllables.
    Comparative analysis of test signals using objective quality assessment (PESQ)
shows that the intelligibility of the speech signal remains practically at the same level
(3.7-4.5). A score of 3.7 corresponds to the fragments of the language where the low-
power packets were zeroed.
    With respect to the gain in compression and subsequent transmission of the varia-
ble speed encoded signals using VAD, a gain of 1.5-2 times (34 / 75 frames and 73 /
150) can be obtained if the transmission of empty packets is stopped or transmitted a
special short code sequence.

  Acknowledgments. The authors are appreciative to colleagues for their support
and appropriate suggestions, which allowed to improve the materials of the article.


References
 1. J. Cai, "Noise estimation using an MVDR-like approach for acoustic signal enhancement,"
    IET International Conference on Information and Communications Technologies (IETICT
    2013), Beijing, China, 2013, pp. 192-200, doi: 10.1049/cp.2013.0053.
 2. S. Ou, W. Liu, S. Shen and Y. Gao, "Two methods for estimating noise amplitude spectral
    in non-stationary environments," 2016 9th International Congress on Image and Signal
    Processing, BioMedical Engineering and Informatics (CISP-BMEI), Datong, 2016, pp.
    969-973, doi: 10.1109/CISP-BMEI.2016.7852852.
 3. P. Ahmadi and M. Joneidi, "A new method for voice activity detection based on sparse
    representation," 2014 7th International Congress on Image and Signal Processing, Dalian,
    2014, pp. 878-882, doi: 10.1109/CISP.2014.7003901.
 4. T. Izawa, "Early days of VAD method," 2016 21st OptoElectronics and Communications
    Conference (OECC) held jointly with 2016 International Conference on Photonics in
    Switching (PS), Niigata, 2016, pp. 1-3.
 5. R. Ahmad, S. P. Raza and H. Malik, "Unsupervised multimodal VAD using sequential hi-
    erarchy," 2013 IEEE Symposium on Computational Intelligence and Data Mining
    (CIDM), Singapore, 2013, pp. 174-177, doi: 10.1109/CIDM.2013.6597233.
 6. H. Sahli, L. Tlig, A. Zaafouri and M. Sayadi, "A comparative study applied to dynamic
    textures segmentation," 2016 2nd International Conference on Advanced Technologies for
    Signal and Image Processing (ATSIP), Monastir, 2016, pp. 217-222.
 7. M. Parada and I. Sanches, "Visual Voice Activity Detection Based on Motion Vectors of
    MPEG Encoded Video," 2017 European Modelling Symposium (EMS), Manchester, 2017,
    pp. 89-94, doi: 10.1109/EMS.2017.26.
 8. J. Park, Y. G. Jin, S. Hwang and J. W. Shin, "Dual Microphone Voice Activity Detection
    Exploiting Interchannel Time and Level Differences," in IEEE Signal Processing Letters,
    vol. 23, no. 10, pp. 1335-1339, Oct. 2016, doi: 10.1109/LSP.2016.2597360.
 9. A. Touazi and M. Debyeche, "A Case Study on Back-End Voice Activity Detection for
    Distributed Specch Recognition System Using Support Vector Machines," 2014 Tenth In-
    ternational Conference on Signal-Image Technology and Internet-Based Systems, Marra-
    kech, 2014, pp. 21-26, doi: 10.1109/SITIS.2014.54.
10. B. Peng and T. Li, "A Probabilistic Measure for Quantitative Evaluation of Image Seg-
    mentation," in IEEE Signal Processing Letters, vol. 20, no. 7, pp. 689-692, July 2013.
11. O. Tymchenko, B. Havrysh, O. Khamula, B. Kovalskyi, S. Vasiuta and I. Lyakh, "Meth-
    ods of Converting Weight Sequences in Digital Subtraction Filtration," 2019 IEEE 14th In-
    ternational Conference on Computer Sciences and Information Technologies (CSIT), Lviv,
    Ukraine, 2019, pp. 32-36.
12. V. A. Volchenkov and V. V. Vityazev, "Development and testing of the voice activity de-
    tector based on use of special pilot signal," 2016 5th Mediterranean Conference on Em-
    bedded Computing (MECO), Bar, 2016, pp. 108-111.
13. A. Sehgal, F. Saki and N. Kehtarnavaz, "Real-time implementation of voice activity detec-
    tor on ARM embedded processor of smartphones," 2017 IEEE 26th International Sympo-
    sium on Industrial Electronics (ISIE), Edinburgh, 2017, pp. 1285-1290.
14. S. Jelil, R. K. Das, S. R. M. Prasanna and R. Sinha, "Role of voice activity detection meth-
    ods for the speakers in the wild challenge," 2017 Twenty-third National Conference on
    Communications (NCC), Chennai, 2017, pp. 1-6, doi: 10.1109/NCC.2017.8077146.
15. K. T. Sreekumar, K. K. George, K. Arunraj and C. S. Kumar, "Spectral matching based
    voice activity detector for improved speaker recognition," 2014 International Conference
    on Power Signals Control and Computations (EPSCICON), Thrissur, 2014, pp. 1-4.
16. M. Pandharipande, R. Chakraborty, A. Panda and S. K. Kopparapu, "An Unsupervised
    frame Selection Technique for Robust Emotion Recognition in Noisy Speech," 2018 26th
    European Signal Processing Conference (EUSIPCO), Rome, 2018, pp. 2055-2059, doi:
    10.23919/EUSIPCO.2018.8553202.
17. A. Moldovan, A. Stan and M. Giurgiu, "Improving sentence-level alignment of speech
    with imperfect transcripts using utterance concatenation and VAD," 2016 IEEE 12th Inter-
    national Conference on Intelligent Computer Communication and Processing (ICCP),
    Cluj-Napoca, 2016, pp. 171-174, doi: 10.1109/ICCP.2016.7737141.
18. H. Kanamori, "Fiber and fiber based technology after VAD development," 2016 21st Op-
    toElectronics and Communications Conference (OECC) held jointly with 2016 Interna-
    tional Conference on Photonics in Switching (PS), Niigata, 2016, pp. 1-3.
19. S. Tong, H. Gu and K. Yu, "A comparative study of robustness of deep learning approach-
    es for VAD," 2016 IEEE International Conference on Acoustics, Speech and Signal Pro-
    cessing (ICASSP), Shanghai, 2016, pp. 5695-5699, doi: 10.1109/ICASSP.2016.7472768.
20. J. Song et al., "Research on Digital Hearing Aid Speech Enhancement Algorithm," 2018
    37th Chinese Control Conference (CCC), Wuhan, 2018, pp. 4316-4320, doi:
    10.23919/ChiCC.2018.8482732.
21. D. Peleshko, M. Peleshko, N. Kustra and I. Izonin, "Analysis of invariant moments in tasks
    image processing," 2011 11th International Conference The Experience of Designing and
    Application of CAD Systems in Microelectronics (CADSM), Polyana-Svalyava, 2011, pp.
    263-264.
22. Z. Fan, Z. Bai, X. Zhang, S. Rahardja and J. Chen, "AUC Optimization for Deep Learning
    Based Voice Activity Detection," ICASSP 2019 - 2019 IEEE International Conference on
    Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp.
    6760-6764, doi: 10.1109/ICASSP.2019.8682803.