<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Development and Research of VAD-Based Speech Signal Segmentation Algorithms</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>IT Step University</institution>
          ,
          <addr-line>Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Technical University of Lodz</institution>
          ,
          <addr-line>Lodz</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Ukrainian Academy of Printing</institution>
          ,
          <addr-line>Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Warmia and Mazury Olsztyn</institution>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>The method of a speech signal segmentation developed during the work by application of the VAD detector uses a spectrum of power of fragments (packets) of a speech signal unlike the other known examples. A discrete Fourier transform with a small number of samples (maximum 160) is used to calculate the spectrum. The developed method allows not only to solve the traditional problem of VAD - the data rate reduction, but also to perform the speech signals separation and segmentation into individual fragments. Examples of such segmentation and determination of vocalized and non-vocalized areas of speech signals boundaries in the data network are given, which can be used to build phonemic vocoders in automated speech processing and recognition systems.</p>
      </abstract>
      <kwd-group>
        <kwd>segmentation</kwd>
        <kwd>speech signal</kwd>
        <kwd>communication channel</kwd>
        <kwd>speech data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Packet data networks have occupied and hold the leading position among
telecommunication networks, in which they are facilitated by the computer networks
development and the Internet. One of the main types of packet traffic is a multimedia traffic,
which has a significant place in the language signal. Various encoders [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] are used to
provide high-quality voice traffic [
        <xref ref-type="bibr" rid="ref1 ref3">1, 3</xref>
        ], which simultaneously compress signals to
reduce network congestion. An effective means of further compression ratio
enhancing is the use of the most current language codecs of Voice Activity Detector (VAD)
technologies [
        <xref ref-type="bibr" rid="ref2 ref4">2, 4</xref>
        ]. Even greater increase in the compression degree is achieved by
the methods of speech fragments separation and segmentation, ie as a result of the
transition to phonemic and semi-phonemic vocoders.Typically, none of the low-speed
voice encoder implementation can do without the use of VAD (Voice Activity
Detector) technology. The process of identifying or absence of voice activity is not a new
task, different methods of its implementation have been and are still being used (eg
GSM encoders, different methods of speech recognition, etc.). A well-known problem
with VAD synthesis using voice signal encoders for VoIP networks is to correctly
identify language pauses against a background of intense acoustic noise (office, street,
car, etc.). However, the use of VAD can significantly save bandwidth [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and
therefore congestion of network channels.
2
      </p>
      <p>
        The research of existing VAD technologies capabilities
VAD provides the ability to pre-process the speech signal before it is fed to the
encoder. In the first approximation, the following types of linguistic fragments can be
distinguished: vocalized, unvocalized, transitional, and pauses. When a language is
processed into the digital form, ie in the form of a sequence of numbers, each signal
type having the same duration and quality requires a different number of bits for
encoding and transmission. Therefore, the transmission rate of different fragments of
speech signals may also be different. Thus, an important conclusion can be made
here: the linguistic data transmission in each direction of the duplex channel should be
considered as the transmission of asynchronous logically independent fragments of
digital sequences. These sequences (transactions) contain batch (datagram)
synchronization inside a transaction filled with packets of different lengths [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        The VAD detector must be sensitive and responsive in order to avoid the loss of
the word beginnings when switching from the pause to the active speech fragment. At
the same time the VAD detector should not be triggered by background noise [
        <xref ref-type="bibr" rid="ref4 ref7">4, 7</xref>
        ].
      </p>
      <p>Generally, the VAD goal is to estimate the value of a particular input parameter
(eg, level, power, etc.) and, if it exceeds a certain threshold, then such a packet will be
transmitted. This slightly increases the delay in the speech signal processing in the
encoder, but it can be minimized by creating coders that work with packets
(datagrams) of the readings.</p>
      <p>In the encoder analysis with the fast use of the Ccode language. (Bit / s), the signal is
divided into individual fragments (usually quasi-stationary sections), duration Tfragm
from 2 to 50 ms and is in the input block used with N difference, uses the usual
information frame about Vm.k =Tfragm × Ccode (bits).</p>
      <p>
        No matter what are the details of the implementation, the main criterion for
evaluating the encoder is the high quality of speech reproduction at a low output speed of
digital Ccode output. Especially of an output with minimum requirements for the
digital signal processor resources and minimal delay [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        technology can be combined with a wide variety of language encoders.
1. There is a method of detecting voice activity based on finding the formant.
Although formants carry the basic spectral information about the speech signal, in the
case of unvocalized areas their localization is unreliable and segmentation is
ambiguous because it is lost in the noise [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
2. In a number of works the spectral characteristic of noise is estimated and on the
basis of it the speech signal from a mix of a signal and noise is separated. The
GSM standard adopted a VAD circuit with frequency domain processing [
        <xref ref-type="bibr" rid="ref16 ref8">8, 16</xref>
        ]. A
block diagram of such a VAD system operation is shown in Fig. 1. The essence of
its work is based on differences of speech and noise spectral characteristics.
Background noise is considered to be stationary over a relatively long period of
time, and its spectrum is slow to change over time. Therefore, VAD estimates the
spectral deviations of the input sequence from the background noise spectrum. This
operation is performed by an inverse filter whose coefficients change according to
the input action. In the presence of a speech signal and noise input, an inverse filter
suppresses the noise components and, in general, reduces its intensity. The energy
of the signal + noise sum at the output of the inverse filter is compared with the
threshold, which is variable and is estimated during the periods of action at the
input of the noise itself. This threshold is higher than the noise signal energy level.
Exceeding this level is a determining criterion for the presence of voice activity
input. Because these parameters (coefficients and thresholds) are used by the VAD
detector to detect the language, it is not for the VAD to decide at this stage of the
analysis, as the threshold may vary [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        This decision is made by a secondary VAD based on a comparison of the envelope
spectra in successive periods of time. If they are similar or close for a relatively long
time, then it is assumed that noise is applied at the detector input, then the filter
coefficients and the noise threshold can be varied, ie adapted to the current level and
spectral characteristics of the input noise [
        <xref ref-type="bibr" rid="ref17 ref9">9, 17</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>Signal+Naice</title>
    </sec>
    <sec id="sec-3">
      <title>Invercive Filter</title>
    </sec>
    <sec id="sec-4">
      <title>Threshold Advice</title>
    </sec>
    <sec id="sec-5">
      <title>Solution</title>
    </sec>
    <sec id="sec-6">
      <title>Adaptive Scheme</title>
    </sec>
    <sec id="sec-7">
      <title>Of Filter Installation</title>
    </sec>
    <sec id="sec-8">
      <title>Threshold calculation scheme</title>
      <p>
        A clear disadvantage of this scheme VAD is the "relatively long period of time" for
which voice activity is decided [
        <xref ref-type="bibr" rid="ref10 ref12">10, 12</xref>
        ]. In addition, if the noise is non-stationary it is
almost impossible to segment the speech signal with such a scheme.
3 The algorithm of speech signal segmentation is offered
      </p>
      <p>The main idea behind the proposed VAD-based speech signal segmentation
algorithm is the linear processing of the speech fragments and the rejection of fragments
where there is no voice activity (ie useful information).</p>
      <p>The input parameters for the algorithm are the minimum length of language data
Mframe considered useful (number of packets and their duration), maximum pause
time in the composition or word Eframelength, ie "error" VAD (obviously, this error
can take zero if the VAD system responds to the lowest possible signal values).</p>
      <p>The program code for the language segmentation algorithm using VAD is as
follows.</p>
      <p>Mframelength = 5...X;
Eframelength = 0...Y;
L = 0; - counter
ArrayList s;
ArrayList f;
int begin = 0;
int a = 0;
while (L &lt; Plength)</p>
      <p>if (p[L] = true) – voice activity indication
begin = L;</p>
      <p>a++;
else
if (a &gt; Mframelength )
s.Add( begin );
f.Add( L – 1 );
L++;
L = 0;
if (Eframelength &gt; 0)
while ( L &lt; s length – 1 )
if ( s[L+1] &lt; e[L] )
s.RemoveAt([L+1]);
f.RemoveAt([L]);
L++;
return s,f;</p>
      <p>After the selection, the language fragments are detailed and processed according to
conventional coding algorithms (according to ITU-T Recommendation).</p>
      <sec id="sec-8-1">
        <title>4 Methods of experimental research</title>
        <p>TheproposedVADisbasedonadiscreteFouriertransform(DFT):
ak =N2 ∑iN1 =2πNki yi cos  , bk =N2 ∑iN1 =2πNki yi sin  , Sk</p>
        <p>Depending on the selected packet length, we choose the number of spectral
components (from 1.2 to N / 2 , where  is the packet length).</p>
        <p>
          The band in the frequency spectrum ( ∆S that is [
          <xref ref-type="bibr" rid="ref1">0-1</xref>
          ] by default, determined
relative to the number of harmonics) is selected as the main parameter of the VAD block.
        </p>
        <p>To study the effectiveness of the proposed algorithm, a simulation of its operation
using real language signals was conducted. The scheme of the study is shown in Fig.
2.</p>
        <p>Language Data</p>
        <p>Packeting</p>
        <p>VAD</p>
        <p>Segmentation</p>
        <p>The VAD system replaces the packets with "zero" by the signal level (ie, all
samples at the block output are equal to 0), if 80% of the packet samples are less than the
specified threshold. The selected threshold (delta) is a given value of quantization
levels (steps). This value can be changed from 0 to 127 quantization levels with a
maximum value of signal amplitude 255 (for eight-bit quantization)</p>
        <p>Signal level VAD algorithm:
L = 0 – %
down
for (i,j)
і = 0..Р, j = 0..k/P;
if(abs(SРi,j)&lt;delta)
L++;.
if ( L &gt; 0.8P )
Sj = 0;</p>
        <p>less than the delta threshold counter
countThat is, if the selected criterion of "informativeness" is not fulfilled, then the package
must be zeroed.</p>
        <p>When using a VAD system based on DFT, the DFT samples of the speech
fragment are calculated:</p>
        <p>SPj[]– spectrum amplitude package, where j=0..n;
Sj[]– packet-matching in the language input stream.</p>
        <p>
          One step of the algorithm is as follows:
for(j) – for all packages j=0..n:
if (max (SPj ) ∈ ∆S)) Sj[]=0;
that is, if the selected criterion of “informativeness” is not fulfilled (the maximum of
the amplitude spectrum is in a given band), then the packet is nullified, ie it is
concluded that the packet does not carry any useful language load. It is advisable to
choose the band ∆S in the range from 0 to 100 Hz, since the frequency of the pitch of
the speech signal is always above 200 Hz [
          <xref ref-type="bibr" rid="ref1 ref20 ref21">1, 20-21</xref>
          ].
        </p>
        <p>The "Segmentation" block performs the operation of combining packets with
prescreening of areas where there is no linguistic activity, according to the above
algorithm for segmentation of language using VAD.
5</p>
      </sec>
      <sec id="sec-8-2">
        <title>Results of experimental studies</title>
        <p>Two signals of up to 2 seconds in length were selected to study the segmentation
method, which correspond to the language fragments (file 1 and file 2). Segmentation
was performed using VAD by level with a threshold value of 0.0625 (relative to 1)
and 0.125 within the limits shown in Fig. 3 and 4 (file 1). The segmentation of the
speech signal using VAD based on DFT (file 1) is shown in Fig. 5. The speech signal
segmentation using VAD based on DFT (file 1) is shown in Fig. 5.</p>
        <p>File 1 from the 1.25-second language stream (10,000 samples) was selected for
processing. The 5760 samples fragment (from 2240th to 8000th count) was
highlighted as a result of the VAD level application with a threshold value of 0.0625. The
bypass of the input language signal and highlighted fragment are represented in Fig. 3.
The research result is highlighted with the blue lines, which practically corresponds to
the relevant linguistic information.</p>
        <p>To study the language segmentation operation algorithm with VAD by signal level,
a gradual increase in the threshold value was performed. As a result, upon reaching
the threshold value of 0.125, two segments of 1600 and 1220 counts were obtained,
respectively. The selected fragments correspond to the loud sounds "a", and at the
beginning of the second fragment there is a deaf sound "b. The results are presented in
Fig. 4, where the vertical lines indicate two sections: the first (a) is in the range from
the 4000th to the 5600th reference, the second (b) displays from the 6560th to the
7840th reference.</p>
        <p>Applying DFT-based VAD segmentation method to file 1, three segments were
obtained (Fig. 5). For greater clarity, they are presented separately: the first fragment in
the range from 2560 to 3360 (Fig.6a), the next fragment in the range 3680 ‒ 6080
(Fig.6b), and the last frame from 6400 to 8160 reference (Fig.6c). Thus, the use of
VAD on the basis of DFT allowed to distinguish a fragment with linguistic activity,
which at VAD level was missed.
Fig. 6. Segmented data - file 1 (the scale on the abscissa axis in the figures is different). VAD
based on DFT.</p>
        <p>In the same way, the segmentation of the linguistic data presented in file 2 was
carried out. As a result of the use of VAD on the basis of DFT, 5 linguistic fragments
were isolated (Fig. 7 a-e).
Fig. 7. Segmented data - file 2 (the scale on the abscissa axis in Fig.6.a-e is different). Used</p>
        <p>VAD based on DFT
6</p>
      </sec>
      <sec id="sec-8-3">
        <title>Conclusion</title>
        <p>The analysis of the research results showed that the developed segmentation
algorithm using VAD with DFT gives almost error-free division of the speech flow into
words, and even into syllables and letters depending on the speaker’s intonation.
Moreover, depending on the intonation of the speaker - even into syllables and letters.
Also, raising the VAD threshold to the level provides a virtually error-free selection
of vocalized language fragments. The main drawback of the VAD algorithm based on
DFT is the lack of sensitivity when using signals in the [300..3400] Hz range, as a
result of which segmentation into letters is rarely achieved, unlike signals in the
[0..3400] Hz range. However, the proposed VAD technique can be effectively used in
language recognition, since the first DFT harmonics provide additional information
about formats, which can be used to detail individual letters or syllables.</p>
        <p>Comparative analysis of test signals using objective quality assessment (PESQ)
shows that the intelligibility of the speech signal remains practically at the same level
(3.7-4.5). A score of 3.7 corresponds to the fragments of the language where the
lowpower packets were zeroed.</p>
        <p>With respect to the gain in compression and subsequent transmission of the
variable speed encoded signals using VAD, a gain of 1.5-2 times (34 / 75 frames and 73 /
150) can be obtained if the transmission of empty packets is stopped or transmitted a
special short code sequence.</p>
        <p>Acknowledgments. The authors are appreciative to colleagues for their support
and appropriate suggestions, which allowed to improve the materials of the article.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>J.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <article-title>"Noise estimation using an MVDR-like approach for acoustic signal enhancement,"</article-title>
          <source>IET International Conference on Information and Communications Technologies (IETICT</source>
          <year>2013</year>
          ), Beijing, China,
          <year>2013</year>
          , pp.
          <fpage>192</fpage>
          -
          <lpage>200</lpage>
          , doi: 10.1049/cp.
          <year>2013</year>
          .
          <volume>0053</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>S.</given-names>
            <surname>Ou</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shen</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>"Two methods for estimating noise amplitude spectral in non-stationary environments," 2016 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)</article-title>
          ,
          <year>Datong</year>
          ,
          <year>2016</year>
          , pp.
          <fpage>969</fpage>
          -
          <lpage>973</lpage>
          , doi: 10.1109/CISP-BMEI.
          <year>2016</year>
          .
          <volume>7852852</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>P.</given-names>
            <surname>Ahmadi</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Joneidi</surname>
          </string-name>
          ,
          <article-title>"A new method for voice activity detection based on sparse representation," 2014 7th International Congress on Image and Signal Processing</article-title>
          , Dalian,
          <year>2014</year>
          , pp.
          <fpage>878</fpage>
          -
          <lpage>882</lpage>
          , doi: 10.1109/CISP.
          <year>2014</year>
          .
          <volume>7003901</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>T.</given-names>
            <surname>Izawa</surname>
          </string-name>
          ,
          <article-title>"Early days of VAD method," 2016 21st OptoElectronics and Communications Conference (OECC) held jointly with 2016 International Conference on Photonics in Switching (PS</article-title>
          ), Niigata,
          <year>2016</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>3</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>R.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Raza</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <article-title>"Unsupervised multimodal VAD using sequential hierarchy," 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)</article-title>
          , Singapore,
          <year>2013</year>
          , pp.
          <fpage>174</fpage>
          -
          <lpage>177</lpage>
          , doi: 10.1109/CIDM.
          <year>2013</year>
          .
          <volume>6597233</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>H.</given-names>
            <surname>Sahli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tlig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaafouri</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sayadi</surname>
          </string-name>
          ,
          <article-title>"A comparative study applied to dynamic textures segmentation," 2016 2nd International Conference on Advanced Technologies for Signal and Image Processing (ATSIP</article-title>
          ),
          <year>Monastir</year>
          ,
          <year>2016</year>
          , pp.
          <fpage>217</fpage>
          -
          <lpage>222</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>M.</given-names>
            <surname>Parada</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Sanches</surname>
          </string-name>
          ,
          <article-title>"</article-title>
          <source>Visual Voice Activity Detection Based on Motion Vectors of MPEG Encoded Video," 2017 European Modelling Symposium (EMS)</source>
          ,
          <year>Manchester</year>
          ,
          <year>2017</year>
          , pp.
          <fpage>89</fpage>
          -
          <lpage>94</lpage>
          , doi: 10.1109/EMS.
          <year>2017</year>
          .
          <volume>26</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>J.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. G.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hwang</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <article-title>"Dual Microphone Voice Activity Detection Exploiting Interchannel Time and Level Differences,"</article-title>
          <source>in IEEE Signal Processing Letters</source>
          , vol.
          <volume>23</volume>
          , no.
          <issue>10</issue>
          , pp.
          <fpage>1335</fpage>
          -
          <lpage>1339</lpage>
          , Oct.
          <year>2016</year>
          , doi: 10.1109/LSP.
          <year>2016</year>
          .
          <volume>2597360</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>A.</given-names>
            <surname>Touazi</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Debyeche</surname>
          </string-name>
          ,
          <article-title>"A Case Study on Back-End Voice Activity Detection for Distributed Specch Recognition System Using Support Vector Machines,"</article-title>
          <source>2014 Tenth International Conference on Signal-Image Technology and Internet-Based Systems, Marrakech</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>26</lpage>
          , doi: 10.1109/SITIS.
          <year>2014</year>
          .
          <volume>54</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>B.</given-names>
            <surname>Peng</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>"A Probabilistic Measure for Quantitative Evaluation of Image Segmentation,"</article-title>
          <source>in IEEE Signal Processing Letters</source>
          , vol.
          <volume>20</volume>
          , no.
          <issue>7</issue>
          , pp.
          <fpage>689</fpage>
          -
          <lpage>692</lpage>
          ,
          <year>July 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>O.</given-names>
            <surname>Tymchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Havrysh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Khamula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kovalskyi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vasiuta</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Lyakh</surname>
          </string-name>
          ,
          <article-title>"Methods of Converting Weight Sequences in Digital Subtraction Filtration,"</article-title>
          <source>2019 IEEE 14th International Conference on Computer Sciences and Information Technologies (CSIT)</source>
          , Lviv, Ukraine,
          <year>2019</year>
          , pp.
          <fpage>32</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>V. A.</given-names>
            <surname>Volchenkov</surname>
          </string-name>
          and
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Vityazev</surname>
          </string-name>
          ,
          <article-title>"Development and testing of the voice activity detector based on use of special pilot signal,"</article-title>
          <source>2016 5th Mediterranean Conference on Embedded Computing (MECO)</source>
          ,
          <year>Bar</year>
          ,
          <year>2016</year>
          , pp.
          <fpage>108</fpage>
          -
          <lpage>111</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>A.</given-names>
            <surname>Sehgal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Saki</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Kehtarnavaz</surname>
          </string-name>
          ,
          <article-title>"Real-time implementation of voice activity detector on ARM embedded processor of smartphones,"</article-title>
          <source>2017 IEEE 26th International Symposium on Industrial Electronics (ISIE)</source>
          ,
          <year>Edinburgh</year>
          ,
          <year>2017</year>
          , pp.
          <fpage>1285</fpage>
          -
          <lpage>1290</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>S.</given-names>
            <surname>Jelil</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. K. Das</surname>
            ,
            <given-names>S. R. M.</given-names>
          </string-name>
          <string-name>
            <surname>Prasanna</surname>
            and
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Sinha</surname>
          </string-name>
          ,
          <article-title>"Role of voice activity detection methods for the speakers in the wild challenge,"</article-title>
          <source>2017 Twenty-third National Conference on Communications (NCC)</source>
          ,
          <year>Chennai</year>
          ,
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          , doi: 10.1109/NCC.
          <year>2017</year>
          .
          <volume>8077146</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>K. T. Sreekumar</surname>
            ,
            <given-names>K. K.</given-names>
          </string-name>
          <string-name>
            <surname>George</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Arunraj</surname>
            and
            <given-names>C. S.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>"Spectral matching based voice activity detector for improved speaker recognition,"</article-title>
          <source>2014 International Conference on Power Signals Control and Computations (EPSCICON)</source>
          ,
          <year>Thrissur</year>
          ,
          <year>2014</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>M. Pandharipande</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Panda</surname>
            and
            <given-names>S. K.</given-names>
          </string-name>
          <string-name>
            <surname>Kopparapu</surname>
          </string-name>
          ,
          <article-title>"An Unsupervised frame Selection Technique for Robust Emotion Recognition in Noisy Speech,"</article-title>
          <source>2018 26th European Signal Processing Conference (EUSIPCO)</source>
          , Rome,
          <year>2018</year>
          , pp.
          <fpage>2055</fpage>
          -
          <lpage>2059</lpage>
          , doi: 10.23919/EUSIPCO.
          <year>2018</year>
          .
          <volume>8553202</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>A.</given-names>
            <surname>Moldovan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stan</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Giurgiu</surname>
          </string-name>
          ,
          <article-title>"Improving sentence-level alignment of speech with imperfect transcripts using utterance concatenation and VAD,"</article-title>
          <source>2016 IEEE 12th International Conference on Intelligent Computer Communication and Processing (ICCP)</source>
          ,
          <source>Cluj-Napoca</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>171</fpage>
          -
          <lpage>174</lpage>
          , doi: 10.1109/ICCP.
          <year>2016</year>
          .
          <volume>7737141</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. H.
          <string-name>
            <surname>Kanamori</surname>
          </string-name>
          ,
          <article-title>"Fiber and fiber based technology after VAD development," 2016 21st OptoElectronics and Communications Conference (OECC) held jointly with 2016 International Conference on Photonics in Switching (PS</article-title>
          ), Niigata,
          <year>2016</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>3</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>S.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gu</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>"A comparative study of robustness of deep learning approaches for VAD,"</article-title>
          2016 IEEE International Conference on Acoustics,
          <source>Speech and Signal Processing (ICASSP)</source>
          , Shanghai,
          <year>2016</year>
          , pp.
          <fpage>5695</fpage>
          -
          <lpage>5699</lpage>
          , doi: 10.1109/ICASSP.
          <year>2016</year>
          .
          <volume>7472768</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>J. Song</surname>
          </string-name>
          et al.,
          <source>"Research on Digital Hearing Aid Speech Enhancement Algorithm," 2018 37th Chinese Control Conference (CCC)</source>
          , Wuhan,
          <year>2018</year>
          , pp.
          <fpage>4316</fpage>
          -
          <lpage>4320</lpage>
          , doi: 10.23919/ChiCC.
          <year>2018</year>
          .
          <volume>8482732</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <given-names>D.</given-names>
            <surname>Peleshko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Peleshko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kustra</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Izonin</surname>
          </string-name>
          ,
          <article-title>"Analysis of invariant moments in tasks image processing," 2011 11th International Conference The Experience of Designing and Application of CAD Systems in Microelectronics (CADSM), Polyana-</article-title>
          <string-name>
            <surname>Svalyava</surname>
          </string-name>
          ,
          <year>2011</year>
          , pp.
          <fpage>263</fpage>
          -
          <lpage>264</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rahardja</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>"AUC Optimization for Deep Learning Based Voice Activity Detection,"</article-title>
          <source>ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , Brighton, United Kingdom,
          <year>2019</year>
          , pp.
          <fpage>6760</fpage>
          -
          <lpage>6764</lpage>
          , doi: 10.1109/ICASSP.
          <year>2019</year>
          .
          <volume>8682803</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>