<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>A. Nurlankyzy);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Comparative analysis of CNN-BiGRU and CNN-BiLSTM architectures for voice activity detection under low signal-to-noise ratio conditions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aigul Nurlankyzy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aigul Kulakayeva</string-name>
          <email>a.kulakayeva@iitu.edu.kz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>International Information Technology University</institution>
          ,
          <addr-line>Manas St. 34/1, Almaty, 050040</addr-line>
          ,
          <country country="KZ">Kazakhstan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Satbayev University</institution>
          ,
          <addr-line>Satpayev St. 22, Almaty, 050013</addr-line>
          <country country="KZ">Kazakhstan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This study addresses the problem of Voice Activity Detection (VAD) under low signal-to-noise ratio (SNR) conditions, which is critical for automatic speech recognition systems and voice-controlled interfaces. This study focuses on a comparative analysis of two hybrid deep learning architectures, CNN-BiGRU and CNN-BiLSTM, which combine convolutional layers for spectral feature extraction with recurrent blocks for modeling the temporal dynamics of speech. As input features, MFCC matrices were computed from segmented speech fragments of the KSC2 corpus and contaminated with both synthetic and real noise at various SNR levels. The experimental results demonstrate that under moderate and high SNR conditions, both architectures achieve high classification accuracy (average F1-score exceeding 99%). However, in extremely low SNR scenarios (-10 dB), CNN-BiGRU exhibits a more robust performance than CNN-BiLSTM. Additionally, a computational efficiency analysis revealed that CNN-BiGRU outperforms CNN-BiLSTM in terms of training speed and parameter count, making it more suitable for deployment in resource-constrained environments. These findings support the use of GRU-based recurrent blocks in hybrid VAD models and indicate future research directions involving noise augmentation techniques, class imbalance handling, and inference optimization.</p>
      </abstract>
      <kwd-group>
        <kwd>speech detection</kwd>
        <kwd>CNN-BiGRU</kwd>
        <kwd>CNN-BiLSTM</kwd>
        <kwd>mel-frequency cepstral coefficients (MFCC)</kwd>
        <kwd>signal to-noise ratio (SNR)</kwd>
        <kwd>deep learning</kwd>
        <kwd>neural networks</kwd>
        <kwd>VAD1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Voice Activity Detection (VAD) plays a crucial role in speech recognition systems, voice interfaces,
and telecommunications. The accuracy of this module directly affects the recognition quality,
system response speed, and robustness to noise. The most challenging scenario arises under low
signal-to-noise ratio (SNR) conditions, where background noise masks the speech. In such cases,
traditional methods based on signal energy or spectral features suffer a significant drop in
accuracy. Consequently, research aimed at developing more robust VAD models capable of
operating effectively, even at negative SNR levels, is particularly relevant today.</p>
      <p>In recent years, neural network-based</p>
      <p>methods have significantly advanced the field.</p>
      <p>Convolutional neural networks (CNNs) are highly effective at extracting the time –frequency
features of speech, whereas recurrent architectures make it possible to capture sequential
dependencies and contextual information. Among recurrent networks, Long Short-Term Memory
(LSTM) and Gated Recurrent Unit (GRU) models are the most widely used. Their bidirectional
variants, BiLSTM and BiGRU, enable the analysis of both past and future contexts within the
signal, which is particularly important for accurate speech-segment detection under noisy
conditions.</p>
      <p>In recent years, research in the field of Voice Activity Detection (VAD) has shifted from simple
threshold-based and statistical approaches to more sophisticated yet compact neural network
models. These models provide high noise robustness and low response latency, which are
particularly critical for real-time applications and devices with limited computational resources.
Current efforts focus on combining architectural compactness, trainable front ends, optimized loss
functions, and elements of personalization.</p>
      <p>
        An illustrative example is the SincQDR-VAD framework [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which employs a trainable
Sincfilter-based front end and a ranking-aware loss function designed to optimize the ordering of
speech/non-speech frame classification. This approach yields a noticeable improvement in AUROC
and F-score while significantly reducing the number of parameters compared with heavier models.
In parallel, there is a growing interest in personalized VAD (PVAD) systems. A comparative study
by [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] demonstrated that incorporating temporal models, attention mechanisms, and compact
speaker embeddings substantially reduces false positives and improves accuracy in real-world
noisy environments without causing a significant increase in computational cost. Further
improvements were presented in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], who proposed a pretraining scheme based on Discriminative
Noise-Aware Predictive Coding (DN-APC), improving TS-VAD robustness in both seen and unseen
noise scenarios by approximately 2% in terms of accuracy. Additionally, methods for conditional
speaker representation, including FiLM-based modulation, were explored, leading to improved
noise robustness.
      </p>
      <p>
        Simultaneously, lightweight and energy-efficient models have been actively developed. The
sVAD model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], based on spiking neural networks and a Sinc-based encoder, has demonstrated
strong robustness at very low power consumption, which is critical for edge devices and IoT
scenarios. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] highlighted the importance of selecting an appropriate training objective: using the
segmental Voice-to-Noise Ratio (VNR) as the target leads to more stable performance at low SNR
than binary speech labels. Complementing this finding, [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] proposed a multi-resolution MFCC
front end combined with convolutional layers and self-attention, which improved the robustness of
datasets such as NoiseX-92 and other noise corpora.
      </p>
      <p>
        Recent studies have also proposed architectural improvements at both the front-end and
personalization levels. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] introduced a Sinc-Extractor module with a speaker-conditional block
that eliminates the need for bulky speaker embeddings while maintaining accuracy and reducing
inference time. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] demonstrated that the Audio-Inspired Masking Modulation Encoder with
Attention (AMME-CANet) outperforms conventional CNN-based approaches in terms of
robustness under complex noise conditions. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposed mVAD, a lightweight algorithm that
achieves high accuracy without requiring prior knowledge of noise, while preserving
computational efficiency. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] further confirmed that PVAD remains effective even with a very
short reference (~0.3 s) when using a Dual-Path RNN architecture with real-time embedding
updates.
      </p>
      <p>
        The obtained results are consistent with the findings of [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], where CNN-BiLSTM and
CNNBiGRU architectures were compared under different SNR levels with multiple noise augmentations.
Their study demonstrated that CNN-BiGRU provides an optimal trade-off between accuracy and
computational complexity.
      </p>
      <p>
        Finally, the evolution of next-generation communication systems directly influences the
requirements of VAD. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] provide a detailed analysis of coverage recovery issues in 5G NR
RedCap networks, where reduced cell radius, increased latency, and uplink channel quality
degradation demand adaptive resource management and signal processing algorithms. The
proposed solutions include MIMO (2Rx), adaptive beamforming, carrier aggregation, and Kalman
filtering for channel parameter prediction and dynamic power control (DPC). These directions
highlight the need for VAD systems capable of maintaining a stable performance in challenging
radio environments typical of IoT and industrial applications.
      </p>
      <p>Thus, contemporary research converges on the view that the future of VAD lies in compact and
customizable neural architectures that leverage trainable spectral filters, multitask loss functions,
and adaptive speaker-conditional encoding, as well as integration with network-level mechanisms
to ensure a robust operation under varying channel conditions.</p>
      <p>Nevertheless, a direct comparison of the CNN-BiGRU and CNN-BiLSTM architectures under
extremely low SNR conditions has not been sufficiently investigated. While BiLSTM networks are
known for modeling complex temporal dependencies, they are computationally more demanding,
whereas BiGRU achieves comparable results at a lower computational cost. The question of which
architecture performs better for VAD tasks in low-resource languages remains unanswered.</p>
      <p>In this study, we focus on a comparative study of two architectures: CNN-BiGRU and
CNNBiLSTM. For the analysis, we employed the Kazakh speech corpus KSC2, augmented with noise
samples from the ESC-50 dataset, and evaluated it across a wide SNR range of –10 dB to 30 dB. The
comparison was performed using both classical classification metrics (Accuracy, Precision, Recall,
F1-score) and computational parameters, including the number of trainable parameters and
training time.</p>
      <p>Thus, the objective of this study is to identify the strengths and weaknesses of CNN-BiGRU and
CNN-BiLSTM and determine which of the two models is better suited for practical speech detection
systems that operate under severe noise conditions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Materials and methods</title>
      <p>To analyze the effectiveness of the voice activity detection, we used the KSC2 corpus, which
includes recordings from 30 speakers, each pronouncing 75 phrases. The initial dataset comprised
2,250 audio files. Each recording was segmented using. wrd transcripts, which made it possible to
isolate individual speech and pause segments. To generate training samples, a sliding window
approach was applied: from each segmented track, fragments of 24 time steps were extracted with
a hop size of 5 ms, corresponding to approximately 115 ms of the audio. Consequently, the input
data for the models were represented as 24×24 MFCC matrices.</p>
      <p>To improve the noise robustness, noise signals were superimposed on each recording. We used
white Gaussian noise and four noise classes from the ESC-50 dataset (transportation, domestic,
natural, and speech-like) as noise sources. The signal-to-noise ratio (SNR) was varied from –10 to
30 dB. For training, a combined dataset was created that included versions of each sample with
different SNR levels, thereby enhancing the generalization capability of the models used.</p>
      <p>Two hybrid architectures were implemented: CNN-BiGRU and CNN-BiLSTM. The input to both
models consisted of MFCC matrices, which were first processed by convolutional layers for spectral
feature extraction and then by recurrent blocks for modeling the temporal dynamics. The output
layers consisted of fully connected neurons with a sigmoid activation function that enabled binary
classification.</p>
      <p>Training was performed using the Binary Cross-Entropy loss function, Adam optimizer
(learning rate of 0.001, batch size of 64), and an early stopping strategy. Standard metrics were used
to evaluate the performance. Testing was conducted on a speaker-disjoint set to ensure no overlap
of speakers between the training and test subsets.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>The CNN-BiGRU and CNN-BiLSTM architectures were designed following a unified scheme in
which the convolutional layers extracted the time–frequency features, the subsampling layers
reduced the dimensionality, and the fully connected layers performed the final classification. The
key difference lies in the recurrent block. CNN-BiGRU employs bidirectional GRU layers (Figure 1),
which makes the model more compact and resource-efficient, with a total parameter count of
11,106.</p>
      <p>In contrast, CNN-BiLSTM employs bidirectional LSTM layers (Figure 2), which are better at
capturing long-term dependencies but make the model more resource intensive, increasing the
total number of parameters to 13,538.</p>
      <p>A comparison of the architectures shows that the main difference lies in the type of recurrent
block used. This difference affects the model complexity and computational cost during training,
whereas the overall network structure remains the same.</p>
      <p>The training characteristics are listed in Table 1. The same training conditions were applied to
both the models.</p>
      <p>The final accuracy values for both the training and test sets matched at 96%, indicating proper
convergence without signs of overfitting. Similarly, the loss function values were close, ranging
from 9.8% to 9.9% for both the training and test sets, confirming the stability of the training
process.</p>
      <p>Furthermore, a comparison of the results presented in Table 1 shows that both architectures
exhibit the same level of accuracy and training stability, differing only in terms of the number of
parameters. This provides equal initial conditions for the subsequent comparison of the models in
more complex scenarios, involving speech detection in noisy acoustic environments. Therefore,
further analysis focused on how CNN-BiGRU and CNN-BiLSTM behave under varying
signal-tonoise ratio conditions, as illustrated in Figures 3–6.</p>
      <p>Figure 3 presents the results of an experiment in which CNN-BiGRU and CNN-BiLSTM were
trained at an SNR of –10 dB. Both models achieved high accuracy, close to the training value,
confirming their ability to reproduce the conditions on which they were trained. However, as the
SNR level increased, a sharp drop in the classification accuracy was observed. Notably, the
CNNBiLSTM model appears to be more sensitive to changes in the acoustic environment, with its
accuracy declining more rapidly as the noise level deviates from the training conditions.
CNNBiGRU demonstrated greater robustness, although it also showed performance degradation in the
range of positive SNR values.</p>
      <p>This behavior indicates the limited generalization capability of both architectures when trained
under a fixed low-SNR condition. The models tend to "overfit" a specific noise scenario and
interpret deviations from it as anomalies. This result confirms the necessity of training across a
wide range of SNR levels, which can significantly improve the robustness of the models to
variations in acoustic conditions.</p>
      <p>Figure 4 shows the results of training the CNN-BiGRU and CNN-BiLSTM models at a fixed SNR
level of 0 dB and testing them across the full SNR range. Both architectures achieved high accuracy
at positive SNR values, exceeding 95%. In the range from –5 to 10 dB, the performance curves of the
two models nearly coincide, indicating a similar ability to detect speech in moderately noisy
conditions.</p>
      <p>However, as the SNR continues to increase, a divergence is observed: the accuracy of
CNNBiGRU remains stable, whereas CNN-BiLSTM shows a decline in performance beyond 15 dB,
indicating a higher sensitivity of this model to acoustic variations. This result confirms that BiGRU
exhibits better robustness to changing noise conditions, whereas BiLSTM requires more careful
tuning to maintain stable accuracy at higher SNR levels.</p>
      <p>Figure 5 presents the results of training the CNN-BiGRU and CNN-BiLSTM models at a fixed
SNR level of 10 dB and testing them across the full SNR range. Unlike the previous cases ( –10 dB
and 0 dB), both architectures here exhibit nearly identical results, achieving an accuracy above 99%
at positive SNR values.</p>
      <p>The differences between the models become apparent only at high SNR levels (above 20 dB),
where BiGRU maintains an accuracy of approximately 99%, whereas BiLSTM exhibits a slight drop
in performance. Simultaneously, in the range of negative SNR values (–10 dB and below), both
architectures behave similarly, showing a steady increase in accuracy as the SNR level improves.</p>
      <p>Thus, when trained at 10 dB, both models effectively generalize across a wide range of noise
conditions; however, CNN-BiGRU demonstrated higher stability at the upper end of the SNR range.</p>
      <p>Figure 6 presents the results of training the CNN-BiGRU and CNN-BiLSTM models at a fixed
SNR level of 20 dB and testing them across the entire SNR range. Under these conditions, both
models showed similar results at positive SNR levels, with an accuracy exceeding 99%.
However, the differences become more pronounced when moving into the low-SNR region. The
CNN-BiGRU model exhibited a more stable increase in accuracy at negative SNR values, reaching
approximately 33% at –10 dB, whereas CNN-BiLSTM remained at approximately 20% under the
same conditions. This indicates that BiGRU has a superior ability to adapt to extreme noise
conditions when trained at high SNR levels.</p>
      <p>Thus, training at 20 dB allows both models to achieve maximum accuracy in the high-SNR
range; however, BiGRU demonstrates a clear advantage under severe noise conditions, providing a
smoother performance degradation.</p>
      <p>The conducted experiments showed that both architectures successfully handled the task of
speech detection under positive SNR conditions, delivering comparable results in the presence of
mild noise. The main differences emerged at low and negative SNR levels. CNN-BiGRU maintained
higher robustness, whereas CNN-BiLSTM was more sensitive to acoustic distortions.</p>
      <p>For a more detailed analysis of the effectiveness of the developed architectures, the first stage
involved studying the training dynamics. Figures 7 and 8 show the loss and accuracy curves for the
training and validation sets of the CNN-BiLSTM and CNN-BiGRU models, respectively.</p>
      <p>As can be seen from the presented graphs, both architectures exhibit the behavior characteristic
of deep neural networks: a gradual decrease in the loss function values accompanied by an increase
in classification accuracy. Notably, the gap between the training and validation sets remained
minimal throughout all training epochs. This indicates the absence of overfitting and confirms the
models’ ability to generalize the features extracted from the acoustic data. Thus, both architectures
demonstrated stability during the training process and can be considered robust solutions for the
task of speech detection in noisy conditions.</p>
      <p>For clarity, Table 2 summarizes the accuracy and F1-score values obtained when training at
fixed SNR levels and using the multi-SNR strategy. These data enable the comparison of the
performance of the models under conditions of mismatch between the training and test SNR levels
and confirm the effectiveness of the multi-SNR approach.</p>
      <p>As shown in Table 2, training at a single fixed SNR level resulted in a significant drop in
performance at other SNR levels, particularly in the negative range. In contrast, multi-SNR training
ensures nearly complete retention of performance across the entire SNR spectrum, with
CNNBiGRU exhibiting slightly higher robustness than CNN-BiLSTM. These results confirm the
feasibility of using multilevel noise training and highlight the practical value of the CNN-BiGRU
architecture for VAD systems operating in noisy environments.</p>
      <p>However, prediction accuracy alone is not sufficient for an objective assessment of the
applicability of these architectures in real-world scenarios. Computational efficiency, particularly
the training time and data processing cost, plays a crucial role. Figure 9 presents a comparison of
the training durations of the CNN-BiGRU and CNN-BiLSTM models under identical experimental
conditions (10 epochs, batch size of 1024, and the same computational platform).</p>
      <p>The comparative analysis showed that the longest training time was observed for the
CNNBiLSTM architecture (558 s), whereas CNN-BiGRU completed the training faster (within 525 s).
This difference can be explained by the greater structural complexity of the LSTM blocks, which
include additional memory control elements (state cells and gates) compared to GRU blocks. Thus,
BiGRU proves to be less computationally demanding while maintaining a nearly identical
performance.</p>
      <p>Figure 10 presents the normalized confusion matrices for the CNN+BiGRU and CNN+BiLSTM
models at an SNR of –6 dB. Each matrix illustrates the ratio of correctly and incorrectly classified
segments for the "speech" and "noise" classes.</p>
      <p>(a) CNN+BiGRU
(b) CNN+BiLSTM</p>
      <p>For the CNN+BiGRU model, the classification accuracies for speech and noise segments were
92.81% and 97.79%, respectively. The proportion of false positives was 2.21%, and that of false
negatives was 7.19%.</p>
      <p>For the CNN+BiLSTM model, the recognition accuracy for speech segments was 92.62%, and for
noise segments, 97.84%. The false positive rate was 2.16%, and the false negative rate – 7.38%</p>
      <p>Taken together, the results indicate that both tested architectures demonstrate high
classification accuracy and robustness under noisy conditions. At the same time, CNN-BiGRU
shows a slight advantage in terms of the “performance-to-cost” ratio, making it a more suitable
choice in scenarios where computational resources are limited or a high training speed is required.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>It was initially hypothesized that CNN-BiGRU, owing to its compactness and smaller number of
parameters, would provide a performance comparable to or superior to that of CNN-BiLSTM while
requiring fewer computational resources. The experiments confirmed this hypothesis: BiGRU
indeed demonstrated higher efficiency in terms of F1-score and Precision, while maintaining
comparable Accuracy and Recall values, and also exhibited shorter training time.</p>
      <p>The conducted experiments demonstrated that both architectures achieved high performance in
speech detection in noisy environments. The differences in metrics were minimal; the Accuracy
and Recall values were nearly identical, whereas CNN-BiGRU showed a slight advantage in terms
of F1-score and Precision. The analysis across different SNR levels confirmed that both models
performed reliably at positive SNR values; however, CNN-BiGRU exhibited greater robustness
under extremely low SNR conditions. Additionally, BiGRU requires less training time, which is
attributed to its more compact architecture and smaller number of parameters.</p>
      <p>The results of this study have practical implications for the development of voice activation and
speech detection systems that operate under high levels of background noise (e.g., automotive
interfaces, smart home systems, and telecommunication services). The more compact CNN-BiGRU
model can be deployed in resource-constrained environments, including mobile devices and
embedded systems. From a theoretical perspective, the findings confirm the importance of selecting
the appropriate recurrent block architecture when designing hybrid models, which should be
considered in future research on signal processing and machine learning.</p>
      <p>Future research may focus on expanding the range of architectures by incorporating
transformers and performing multilingual evaluations of the models using speech corpora in other
languages. Another promising direction is the study of energy efficiency when deploying models
on mobile devices, where computational constraints are particularly critical. Furthermore, adapting
the architectures to real-world acoustic scenarios with unpredictable noise would enhance their
applicability in industrial and consumer environments.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This study presents a comparative analysis of hybrid CNN-BiGRU and CNN-BiLSTM models
applied to voice activity detection in noisy environments. The results demonstrated that both
architectures achievedhigh performance metrics at positive and moderate SNR levels; however,
CNN-BiGRU exhibited more stable behavior under extremely low noise conditions, maintaining
acceptable accuracy and class balance. An additional advantage of this model is its smaller number
of parameters and reduced training time, making it a preferred solution for practical deployment in
scenarios with limited computational resources.</p>
      <p>The scientific novelty of this work lies in the comprehensive comparison of two recurrent
architectures within hybrid models for VAD under low-SNR conditions and in identifying the
advantages of using GRU blocks, which provide an optimal balance between recognition accuracy
and computational efficiency of the model. The practical significance lies in the possibility of
applying the proposed approach in mobile and embedded systems, as well as in intelligent voice
interaction services, where the combination of noise robustness and resource efficiency is critical.</p>
      <p>At the same time, this study has several limitations, primarily related to the use of
predominantly synthetically noised data and a limited set of acoustic scenarios. Future work should
expand the scope of experiments by incorporating testing on real-world recordings and exploring
additional architectural optimization techniques aimed at reducing computational costs while
maintaining high accuracy.</p>
      <p>Thus, the conducted analysis confirmed the hypothesis regarding the advantages of the
CNNBiGRU architecture and outlined promising directions for the development of efficient and
noiserobust voice-activation systems.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Edit.Paperpal tool in order to grammar and
spelling check. After using this tool, the authors reviewed and edited the content as needed and
take full responsibility for the publication’s content.
This research was funded by the Science Committee of the Ministry of Science and Higher
Education of the Republic of Kazakhstan (Grant No. AP22684173) “Development of a highly
efficient neural network method for detecting voice activity at a low signal-to-noise ratio”.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>C.-C. Wang</surname>
            ,
            <given-names>E.-L.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>J.-W.</given-names>
          </string-name>
          <string-name>
            <surname>Hung</surname>
            ,
            <given-names>S.-C.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>SincQDR-VAD</surname>
          </string-name>
          :
          <article-title>A noise-robust voice activity detection framework leveraging learnable filters and ranking-aware optimization</article-title>
          ,
          <source>arXiv preprint arXiv:2508.20885</source>
          ,
          <year>2025</year>
          . Available: https://arxiv.org/abs/2508.20885.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Buddi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , et al.,
          <article-title>Comparative analysis of personalized voice activity detection systems: Assessing real-world effectiveness</article-title>
          ,
          <source>Proc. Interspeech</source>
          , Kos, Greece,
          <year>2024</year>
          , pp.
          <fpage>2135</fpage>
          -
          <lpage>2139</lpage>
          . Available: https://www.isca-archive.org/interspeech_2024/buddi24_interspeech.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Bovbjerg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Christensen</surname>
          </string-name>
          , et al.,
          <article-title>Noise-robust target-speaker voice activity detection through self-supervised pretraining</article-title>
          ,
          <source>arXiv preprint arXiv:2501.03184</source>
          ,
          <year>2025</year>
          . Available: https://arxiv.org/abs/2501.03184.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Li</surname>
          </string-name>
          , et al.,
          <article-title>sVAD: A robust, low-power, and light-weight voice activity detection with spiking neural networks</article-title>
          ,
          <source>arXiv preprint arXiv:2403.05772</source>
          ,
          <year>2024</year>
          . Available: https://arxiv.org/abs/2403.05772.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Braun</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Tashev</surname>
          </string-name>
          ,
          <article-title>On training targets for noise-robust voice activity detection</article-title>
          ,
          <source>Proc. IEEE ICASSP</source>
          , Toronto, Canada,
          <year>2021</year>
          , pp.
          <fpage>6803</fpage>
          -
          <lpage>6807</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICASSP39728.
          <year>2021</year>
          .
          <volume>9414915</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Aghajani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Abutalebi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghanbari</surname>
          </string-name>
          ,
          <article-title>Deep learning approach for robust voice activity detection</article-title>
          ,
          <source>Journal of Applied Digital Sciences</source>
          <volume>4</volume>
          (
          <issue>2</issue>
          ) (
          <year>2024</year>
          )
          <fpage>55</fpage>
          -
          <lpage>66</lpage>
          . Available: https://jad.shahroodut.ac.ir/article_3335_8731f540e6516b844b0e3b64b8931881.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.-L.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-H. Ho</surname>
            ,
            <given-names>J.-W.</given-names>
          </string-name>
          <string-name>
            <surname>Hung</surname>
            ,
            <given-names>S.-C.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Speaker conditional sinc-extractor for personal VAD</article-title>
          ,
          <source>Proc. Interspeech</source>
          , Kos, Greece,
          <year>2024</year>
          , pp.
          <fpage>2210</fpage>
          -
          <lpage>2214</lpage>
          . Available: https://www.isca-archive.org/interspeech_2024/yu24_interspeech.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          , et al.,
          <article-title>Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network (AMME-CANet)</article-title>
          ,
          <source>Speech Communication</source>
          <volume>158</volume>
          (
          <year>2024</year>
          )
          <article-title>103103</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.specom.
          <year>2024</year>
          .
          <volume>103103</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <article-title>A robust and lightweight voice activity detection algorithm without prior noise knowledge</article-title>
          ,
          <source>Digital Signal Processing</source>
          <volume>145</volume>
          (
          <year>2023</year>
          )
          <article-title>104151</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.dsp.
          <year>2023</year>
          .
          <volume>104151</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Zhang,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <article-title>Personal voice activity detection with ultra-short reference speech</article-title>
          ,
          <source>Proc. APSIPA ASC</source>
          , Macao, China, Dec.
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Medetov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhetpisbayeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Akhmediyarova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nurlankyzy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Namazbayev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kulakayeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Albanbay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Turdalyuly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yskak</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Uristimbek, Evaluating the effectiveness of a voice activity detector based on various neural networks</article-title>
          ,
          <source>Eastern-European Journal of Enterprise Technologies</source>
          <volume>1</volume>
          (
          <issue>5</issue>
          ) (133) (
          <year>2025</year>
          )
          <fpage>19</fpage>
          -
          <lpage>28</lpage>
          . doi:
          <volume>10</volume>
          .15587/
          <fpage>1729</fpage>
          -
          <lpage>4061</lpage>
          .
          <year>2025</year>
          .
          <volume>321659</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kulakayeva</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Mektep</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nurlankyzy</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Jakanova, Analysis and prospects for restoring coverage in 5G NR RedCap</article-title>
          ,
          <source>Proc. IEEE 5th Int. Conf. on Smart Information Systems and Technologies (SIST)</source>
          , Astana, Kazakhstan,
          <year>2025</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1109/SIST61657.
          <year>2025</year>
          .
          <volume>11139297</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>