<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Voice Activity Detection on Italian Language</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shibingfeng Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gloria Gagliardi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Tamburini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FICLIT, Alma Mater Studiorum - University of Bologna</institution>
          ,
          <addr-line>via Zamboni, 32, Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Voice Activity Detection (VAD) refers to the task of identifying human voice activity in noisy settings, playing a crucial role in fields like speech recognition and audio surveillance. However, most VAD research focuses on English, leaving other languages, such as Italian, under-explored. This study aims to evaluate and enhance VAD systems for Italian speech, with the goal of finding a solution for the speech segmentation component of the Digital Linguistic Biomarkers (DLBs) extraction pipeline for early mental disorder diagnosis. We experimented with various VAD systems and proposed an ensemble VAD system. Our ensemble system shows improvements in speech event detection. This advancement lays a robust foundation for more accurate early detection of mental health issues using DLBs in Italian.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Voice Activity Detection</kwd>
        <kwd>Digital Linguistic Biomarkers</kwd>
        <kwd>Speech Processing</kwd>
        <kwd>Speech Segmentation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>step which consists of speech segmentation via VAD. The
VAD system adopted by Gagliardi and Tamburini [3] is a
Voice Activity Detection (VAD) refers to the task of iden- statistical VAD system named “SSVAD v1.0” [4], which
tifying the presence of human voice activity in noisy will be presented and compared to other VAD systems in
speech, classifying utterance segments as “speech” or Section 2.
“non-speech”. Typically, it involves making binary deci- In this project, we focus on VAD for the Italian
lansions on each frame of a noisy signal [1]. VAD has a wide guage, an area that remains largely unexplored, aiming
range of applications, serving as a crucial component in to find a VAD system that performs better and is more
various fields such as telecommunications, speech recog- reliable than the one adopted in the original pipeline.
nition systems, and audio surveillance. Nevertheless, the The outcomes of this project will serve as a
fundamengreat majority of current works focus on the application tal component in the pipeline for extracting DLBs and
of VAD to English while there are many aspects that replacing the current VAD system. Moreover, our eforts
can afect the performance of transferring a VAD system will provide a robust foundation for future work in this
from one language to another, potentially leading to sub- domain, facilitating more accurate and early detection of
optimal results. For instance, voice onset time may vary mental health issues using linguistic biomarkers.
significantly between languages, afecting the system’s Our main contributions are as follows:
ability to detect speech activity accurately [2].
Additionally, diferences in phonetic structures can further compli- • Testing and evaluating various VAD systems on
cate the system’s efectiveness across languages. Given Italian speech.
these factors, conducting research to evaluate various • Proposing an ensemble VAD system that achieves
VAD systems on Italian speech would be highly valuable. superior results.</p>
      <p>Digital Linguistic Biomarkers (DLBs) indicate
linguistic features automatically extracted directly from pa- This paper is structured into five sections. Section 2
tients’ verbal productions that provide insights into their presents the data resources and VAD systems leveraged
medical state [3]. Gagliardi and Tamburini [3] proposed in this work. Section 3 details the experiments and
rethe first DLBs extraction pipeline for the early diagnosis sources for testing VAD systems. Section 4 presents and
of mental disorders in Italian. The extraction of acoustic discusses the experimental results. Finally, Section 5
and rhythmic features relies heavily on the preprocessing draws conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>This section outlines the background, state-of-the-art</title>
        <p>developments, and architectures of VAD systems.</p>
        <p>The majority of Voice Activity Detection (VAD)
systems approach the task as a binary classification for each
frame of a noisy audio signal, with or without overlaps
between frames. Based on their architecture, these systems
can generally be divided into two categories: statisti- VAD, where a teacher model is initially trained on a
cal VAD systems and deep neural network (DNN) VAD source dataset with weak labels to handle vast and noisy
systems. audio data. The trained teacher model then provides</p>
        <p>Statistical VAD systems rely on probabilistic models frame-level guidance to a student model trained on
and statistical signal processing techniques to distinguish various unlabeled target datasets.
between speech and non-speech segments. Common Context-aware VAD [11] is a self-attentive VAD
statistical methods include Gaussian Mixture Models system based on the Transformer architecture [12]. The
(GMM), Hidden Markov Models (HMM), and Bayesian proposed self-attentive VAD model processes acoustic
frameworks. For example, Sohn et al. [5] proposed a ro- features extracted from audio input, enhancing it with
bust statistical VAD system that models the signal using a contextual information from surrounding frames.
ifrst-order two-state HMM. In this system, the VAD score Pyannote [13] is a pre-trained open-source toolkit for
of each frame is calculated based on the likelihood ratio audio processing that involves a VAD model. Similar to
between the probability density functions conditioned on GPVAD and Silero, it is a DNN-based model with CNN
two hypotheses: speech absent and speech present. Ad- and RNN components.
ditionally, the state-transition probability is determined
using the likelihood ratio from the previous frame, which
helps in maintaining temporal coherence and improving 3. Experiments
the accuracy of voice activity detection.</p>
        <p>On the other hand, VAD systems based on DNNs lever- This section provides an overview of the experiments
age the power of deep learning. These systems use neural we conducted, the evaluation metrics applied, and the
network architectures, such as convolutional neural net- resources adopted for the experiments.
works (CNNs), recurrent neural networks (RNNs), or
more advanced structures with attention mechanism [6]. 3.1. Evaluation Dataset</p>
        <p>Below, we present the list of the VAD systems we
experimented with in this project, along with a brief
description of each system:</p>
      </sec>
      <sec id="sec-2-2">
        <title>In this work, the CLIPS dataset (Corpora e Lessici</title>
        <p>dell’Italiano Parlato e Scritto, Italian for Corpora and
Lexicons of Spoken and Written Italian)2 [14] is adopted to
evaluate diferent VAD systems.</p>
        <p>CLIPS comprises approximately 100 hours of speech
data, equally distributed between male and female voices.
It includes a diverse range of regional and situational
speech samples to ensure a comprehensive
representation of the Italian language across diferent contexts. The
CLIPS dataset is organized into five subsets, with the
“DIALOGICO” and “LETTO” subsets ofering complete
temporal alignments between audio and textual
transcription, totaling approximately 7.5 hours of test data.
The “DIALOGICO” subset includes dialogues between
two interlocutors, while the “LETTO” subset consists of
recordings where words are read aloud from lists.</p>
        <sec id="sec-2-2-1">
          <title>3.2. Experiment Settings &amp; Evaluation</title>
          <p>To thoroughly evaluate the performance of various VAD
systems, we used two sets of metrics: segment-level
metrics and event-level metrics. Segment-level metrics treat
each 10ms segment of audio (a single frame)
independently, calculating metrics such as F1 score, precision,
recall, error rate, and accuracy. Event-level metrics, on
the other hand, consider each speech segment as a unit.
A prediction is deemed correct if its overlap with the
ground truth exceeds 50%, and the same metrics are
calculated accordingly.</p>
          <p>SSVAD v1.0 (Baseline) [4] is a statistical VAD
system designed to handle low signal-to-noise-ratio
(SNR), impulsive noise, and cross talks in interview-style
speech files. The system enhances speech segments as a
pre-processing step to improve SNR, thereby facilitating
subsequent speech/non-speech decisions. SSVAD v1.0
was previously integrated into the older version of the
DLBs extraction pipeline [7] for speech segmentation
and serves as the baseline for comparison with other
systems in this study.
rVAD [8] is an unsupervised model comprising two
denoising steps followed by a final VAD stage. In the
ifrst denoising step, high-energy noise segments are
identified and nullified. The second step utilizes a speech
enhancement method to further denoise the signal.
Silero [9] is a pre-trained CNN systems with
encoderdecoder architecture. Detailed information about
this VAD system is limited, as it is closed source and
undocumented.</p>
          <p>WebRTC VAD is a system developed by Google for the
WebRTC project1. Similar to the Silero VAD system,
it is closed source and detailed information about its
architecture are not publicly available.</p>
          <p>GPVAD [10] is a 5-layer framework composed of
CNN and RNN layers. The proposed model employs
a data-driven teacher-student learning paradigm for</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>1https://webrtc.org/</title>
      </sec>
      <sec id="sec-2-4">
        <title>2http://www.clips.unina.it/it/</title>
        <p>Experiments were conducted on CLIPS dataset using
the VAD systems outlined in Section 2. To achieve
optimal results, all systems were tested on their default
frame size. Furthermore, we combined systems’
predictions through diferent ensemble methods to enhance
performance further. More details on these ensemble
methods are provided in Section 4.2.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Results</title>
      <sec id="sec-3-1">
        <title>This section presents and analyses the experimental results of diferent VAD systems.</title>
        <sec id="sec-3-1-1">
          <title>4.1. Single Systems Evaluation</title>
          <p>As can be seen, the majority of the tested systems
outperformed the baseline system SSVAD used in the
current DLB pipeline at the segment level. A notable pattern
from the experiment results is that DNN-based systems,
such as Silero, GPVAD, and Pyannote, tend to achieve
better results compared to traditional statistical systems
like rVAD and SSVAD. However, context-aware VAD is
an exception, with an F1 score of 60.4, which is lower
than the baseline SSVAD score of 62.2. As for event-level
results, similar to the segment-level results, almost all
systems outperformed the baseline. DNN-based systems
tend to perform better, with Context-aware VAD being
again an exception, as its F1 score is the lowest among all
systems. The poor performance of Context-aware VAD
could be attributed to the fact that, unlike GPVAD and
Pyannote, it is trained only on the TIMIT [15] dataset
with additional background noise. The TIMIT dataset
is a relatively small English speech dataset, containing
only 5 hours of audio, likely causing the system to overfit
on this dataset. Another possible reason for this
relatively poor performance could be that, while Pyannote
and GPVAD are trained on multilingual datasets like
DIHARD III [16] and Audioset [17], Context-aware VAD is
trained solely on English speech. When tested on Italian
speech, the system could sufer a domain shift, resulting
in diminished performance.</p>
          <p>To gain a better understanding of the diferences in
system performance, a Kruskal-Wallis test was conducted.
The results indicate that both the diferences between
segment-level results and event-level results are
significant. A Dunn’s test was then performed for post-hoc
comparisons. The statistical analysis demonstrates that
systems GPVAD, rVAD, Silero, and Pyannote exhibit
similar performance at both the segment and event levels,
while SSVAD, WebRTC, and Context-aware VAD show
significantly lower performance at both levels.</p>
          <p>After considering the performance at diferent levels,
we tested all combination of three systems to form an
ensemble prediction system to generate more accurate
VAD results. The architectures of these ensemble systems
and the corresponding experimental results are discussed
in the following section.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>4.2. Ensemble Systems Evaluation</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>This section details the ensemble methods that combine</title>
        <p>predictions of systems tested in Section 4.1. It
subsequently presents the experimental results and analysis.</p>
        <p>Of the systems presented in Section 2, Silero, Pyannote,
GPVAD, and Context-aware VAD assign a score to each
frame with a threshold used for making predictions. The
other systems do not generate such scores, either due
to diferences in their architecture or because they are
closed-source. This score can be interpreted as the
probability of the frame being speech or not. We attempted to
ensemble system’s predictions using both the probability
scores and their final predictions. The major challenge
faced by these ensemble methods is that each system
uses a diferent frame size, which complicates achieving
alignment for the ensemble system.</p>
        <p>We proposed and tested several ensemble strategies:
• Probability Voting (PV): This method involves
summing and averaging the probability scores
from diferent predictions.
• Probability Voting with Frame (PV_f): In
this approach, each audio is first segmented into
frames. For each frame, we identify all
overlapping frames from all predictions, average their
probability scores, and use this average as the
probability score for the frame. The frame size of</p>
        <p>PV_f is 200 ms.
• Simple Voting with Frame(SV_f): Similar to</p>
        <p>PV_f, this method segments audio into frames.</p>
        <p>However, instead of averaging probability scores,
it performs simple majority voting based on the
predictions of overlapping frames. The frame size by a single system. Meanwhile, all other combinations
of SV_f is 200 ms. yielded scores lower than the best performance of the
• Probability Voting with Weight (PV_w): This individual systems.</p>
        <p>method is akin to PV_f but with a twist: probabil- As shown in Table 3, the ensemble systems related to
ity scores of overlapping frames from the three probability score did not achieve results that are
promipredictions are weighted according to their over- nently better than single systems at the segment level
lap percentage. These weighted scores are then either, with PV_s and PV_b systems of the combination
summed to determine the probability score for Pyannote, GPVAD, Silero being only slightly higher by
each frame. a small margin of 0.6 compared to Silero. However, at
• Probability Voting with Sampling (PV_s): For the event level, several evident improvements can be
a given audio, this method samples timestamps. observed in the performance of the ensemble systems.
For each timestamp, it calculates the mean of the Probability-based ensemble systems combining
Pyanprobability scores from the three systems, using note, GPVAD, Silero, except for PV_b and PV,
outperthis mean as the probability score for the times- formed the simple systems at event level, with PV_f
tamp. The sampling rate of PV_s is approximately achieving an F1 score of 85.9, which is 5.6 points higher
33.33 Hz, meaning that one point is sampled every than that of Pyannote. This result demonstrates that the
0.03 seconds. ensemble approach can lead to substantial performance
• Probability Voting with Bézier curve mod- gains in detecting the temporal interval in which speech
elling (PV_b): For each prediction from each takes place. It is worth noticing that the ensemble
syssystem, a Bézier curve is generated using con- tem PV_b consistently shows great disparity between its
trol points sampled from the prediction. This performance at segment level and event level across all
approach aims to use a smooth curve to model combinations. Despite its good performance on segment
the prediction and address the alignment issues level, PV_b achieves rather F1 score on event level, far
caused by diferent frame sizes of the systems. lower than all other systems. The disparity of
perforSimilar to PV_f, each audio segment is divided mance at diferent levels is likely to be caused by the
into frames, and the probability score for each insuficient number of control points adopted for
generframe is the average of the scores estimated by ating the Bézier curve. However, increasing the number
the Bézier curves. The sampling rate of control of control points is infeasible due to the computational
points that are used to generate Bézier curve in complexity of the curve, which is (2), with  being
PV_b is 5 Hz (0.2 seconds). the number of control points.</p>
        <p>Given that the ensemble systems composed of GPVAD,</p>
        <p>We experimented with all possible system combina- Silero, and Pyannote consistently outperformed other
tions using the SV_f ensemble method, as well as all combinations across all ensemble methods, a
Kruskalpossible combinations of Silero, Pyannote, GPVAD, and Wallis test, followed by Dunn’s post-hoc test, was
conContext-aware VAD using other probability-based en- ducted to assess the diferences in performance between
semble methods, as these are the only systems that gener- the ensemble methods and the individual systems of
GPate probability scores. For all probability-based methods, VAD, Silero, Pyannote. At the segment level, the
Kruskalthe “speech/non-speech” prediction for each frame is de- Wallis test indicates that the diferences are not
signifitermined by applying a threshold of 0.5 to the probability cant. However, at the event level, the results reveal that
score. PV_b’s performance is significantly lower compared to</p>
        <p>Table 2 presents results of all possible combinations the other systems.
to compose the ensemble system using SV_f method. In summary, given the performance of the systems, we
Table 3 presents results of all possible combinations to plan to adopt PV_f as the speech segmentation
compocompose the ensemble systems using probability score nent of the DLBs extraction pipeline, leveraging the
comrelated methods. The evaluation results are derived using bined predictions of Pyannote, Silero, and GPVAD. While
the methods presented in Section 3.2. PV_f shows slightly lower segment-level performance</p>
        <p>As shown in Table 2, the ensemble created using the compared to the top-performing individual system, it
SV_f method did not yield better results than the individ- enhances the accuracy in identifying speech intervals.
ual systems at the segment level. The highest segment- This trade-of is justified by the substantial improvement
level score of 91.5 was achieved by the combination of in speech event detection performance.
GPVAD, Silero, and Pyannote, which is still 0.6 lower than
the best performance of the Silero system alone.
However, at the event level, the same combination achieved
the highest score among all ensemble systems, with an F1
score of 84.0, which is higher than the best score achieved</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusions</title>
      <sec id="sec-4-1">
        <title>In this study, we explored and enhanced Voice Activity Detection systems for the Italian language, a relatively under-explored area in speech processing. We experimented with various systems and integrated systems</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This study was funded by the European Union –
NextGenerationEU programme through the Italian National
Recovery and Resilience Plan – NRRP (Mission 4 – Educa- using teacher-student training, IEEE/ACM
Transtion and research), as a part of the project ReMind: an actions on Audio, Speech, and Language Processing
ecological, costefective AI platform for early detection 29 (2021) 1542–1555.
of prodromal stages of cognitive impairment (PRIN 2022, [11] Y. R. Jo, Y. K. Moon, W. I. Cho, G. S. Jo, Self-attentive
2022YKJ8FP – CUP J53D23008380006). vad: Context-aware detection of voice from noise,
in: ICASSP 2021-2021 IEEE International
Conference on Acoustics, Speech and Signal Processing
CRediT Author Statement (ICASSP), IEEE, 2021, pp. 6808–6812.
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
SZ: Investigation, Software, Formal analysis, Visualiza- L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,
Attion, Writing - Original Draft. GG: Writing - Review &amp; tention is all you need, Advances in neural
inforEditing, Project administration, Funding acquisition. FT: mation processing systems 30 (2017).
Conceptualization, Methodology, Supervision, Writing - [13] H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov,
Review &amp; Editing. M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, M.-P.
Gill, Pyannote. audio: neural building blocks for
References speaker diarization, in: ICASSP 2020-2020 IEEE
International Conference on Acoustics, Speech and
[1] S. Graf, T. Herbig, M. Buck, G. Schmidt, Features Signal Processing (ICASSP), IEEE, 2020, pp. 7124–
for voice activity detection: a comparative analysis, 7128.</p>
      <p>EURASIP Journal on Advances in Signal Processing [14] F. A. Leoni, F. Cutugno, R. Savy, V. Caniparoli,
2015 (2015) 1–15. L. D’Anna, E. Paone, R. Giordano, O.
Manfrel[2] T. Cho, D. H. Whalen, G. Docherty, Voice onset lotti, M. Petrillo, A. De Rosa, Corpora e lessici
time and beyond: Exploring laryngeal contrast in dell’italiano parlato e scritto, 2007.
19 languages, Journal of Phonetics 72 (2019) 52–65. [15] J. S. Garofolo, Timit acoustic phonetic continuous
[3] G. Gagliardi, F. Tamburini, The automatic extrac- speech corpus, Linguistic Data Consortium, 1993
tion of linguistic biomarkers as a viable solution for (1993).
the early diagnosis of mental disorders, in: Pro- [16] N. Ryant, P. Singh, V. Krishnamohan, R. Varma,
ceedings of the Thirteenth Language Resources K. Church, C. Cieri, J. Du, S. Ganapathy, M.
Liberand Evaluation Conference, European Language man, The third dihard diarization challenge, arXiv
Resources Association, 2022, pp. 5234–5242. preprint arXiv:2012.01477 (2020).
[4] M.-W. Mak, H.-B. Yu, A study of voice activity de- [17] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen,
tection techniques for nist speaker recognition eval- W. Lawrence, R. C. Moore, M. Plakal, M. Ritter,
Auuations, Computer Speech &amp; Language 28 (2014) dio set: An ontology and human-labeled dataset for
295–313. audio events, in: Proceedings of the 2017 IEEE
inter[5] J. Sohn, N. S. Kim, W. Sung, A statistical model- national conference on acoustics, speech and signal
based voice activity detection, IEEE signal process- processing (ICASSP), IEEE, 2017, pp. 776–780.
ing letters 6 (1999) 1–3.
[6] A. Sehgal, N. Kehtarnavaz, A convolutional neural
network smartphone app for real-time voice
activity detection, IEEE access 6 (2018) 9017–9026.
[7] L. Calzà, G. Gagliardi, R. R. Favretti, F. Tamburini,</p>
      <p>Linguistic features and automatic classifiers for
identifying mild cognitive impairment and
dementia, Computer Speech &amp; Language 65 (2021) 101113.
[8] Z.-H. Tan, N. Dehak, et al., rvad: An
unsupervised segment-based robust voice activity detection
method, Computer speech &amp; language 59 (2020)
1–21.
[9] Silero Team, Silero vad: pre-trained
enterprisegrade voice activity detector (vad), number
detector and language classifier, https://github.com/
snakers4/silero-vad, 2021.
[10] H. Dinkel, S. Wang, X. Xu, M. Wu, K. Yu, Voice
activity detection in the wild: A data-driven approach</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>