=Paper=
{{Paper
|id=Vol-3878/111_main_long
|storemode=property
|title=Voice Activity Detection on Italian Language
|pdfUrl=https://ceur-ws.org/Vol-3878/111_main_long.pdf
|volume=Vol-3878
|authors=Shibingfeng Zhang,Gloria Gagliardi,Fabio Tamburini
|dblpUrl=https://dblp.org/rec/conf/clic-it/ZhangGT24
}}
==Voice Activity Detection on Italian Language==
Voice Activity Detection on Italian Language
Shibingfeng Zhang1 , Gloria Gagliardi1 and Fabio Tamburini1
1
FICLIT, Alma Mater Studiorum - University of Bologna, via Zamboni, 32, Bologna, Italy
Abstract
Voice Activity Detection (VAD) refers to the task of identifying human voice activity in noisy settings, playing a crucial
role in fields like speech recognition and audio surveillance. However, most VAD research focuses on English, leaving other
languages, such as Italian, under-explored. This study aims to evaluate and enhance VAD systems for Italian speech, with the
goal of finding a solution for the speech segmentation component of the Digital Linguistic Biomarkers (DLBs) extraction
pipeline for early mental disorder diagnosis. We experimented with various VAD systems and proposed an ensemble VAD
system. Our ensemble system shows improvements in speech event detection. This advancement lays a robust foundation for
more accurate early detection of mental health issues using DLBs in Italian.
Keywords
Voice Activity Detection, Digital Linguistic Biomarkers, Speech Processing, Speech Segmentation
1. Introduction step which consists of speech segmentation via VAD. The
VAD system adopted by Gagliardi and Tamburini [3] is a
Voice Activity Detection (VAD) refers to the task of iden- statistical VAD system named “SSVAD v1.0” [4], which
tifying the presence of human voice activity in noisy will be presented and compared to other VAD systems in
speech, classifying utterance segments as “speech” or Section 2.
“non-speech”. Typically, it involves making binary deci- In this project, we focus on VAD for the Italian lan-
sions on each frame of a noisy signal [1]. VAD has a wide guage, an area that remains largely unexplored, aiming
range of applications, serving as a crucial component in to find a VAD system that performs better and is more
various fields such as telecommunications, speech recog- reliable than the one adopted in the original pipeline.
nition systems, and audio surveillance. Nevertheless, the The outcomes of this project will serve as a fundamen-
great majority of current works focus on the application tal component in the pipeline for extracting DLBs and
of VAD to English while there are many aspects that replacing the current VAD system. Moreover, our efforts
can affect the performance of transferring a VAD system will provide a robust foundation for future work in this
from one language to another, potentially leading to sub- domain, facilitating more accurate and early detection of
optimal results. For instance, voice onset time may vary mental health issues using linguistic biomarkers.
significantly between languages, affecting the system’s Our main contributions are as follows:
ability to detect speech activity accurately [2]. Addition-
ally, differences in phonetic structures can further compli- • Testing and evaluating various VAD systems on
cate the system’s effectiveness across languages. Given Italian speech.
these factors, conducting research to evaluate various • Proposing an ensemble VAD system that achieves
VAD systems on Italian speech would be highly valuable. superior results.
Digital Linguistic Biomarkers (DLBs) indicate linguis-
tic features automatically extracted directly from pa- This paper is structured into five sections. Section 2
tients’ verbal productions that provide insights into their presents the data resources and VAD systems leveraged
medical state [3]. Gagliardi and Tamburini [3] proposed in this work. Section 3 details the experiments and re-
the first DLBs extraction pipeline for the early diagnosis sources for testing VAD systems. Section 4 presents and
of mental disorders in Italian. The extraction of acoustic discusses the experimental results. Finally, Section 5
and rhythmic features relies heavily on the preprocessing draws conclusions.
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
Dec 04 — 06, 2024, Pisa, Italy 2. Background
$ shibingfeng.zhang@unibo.it (S. Zhang);
gloria.gagliardi@unibo.it (G. Gagliardi); fabio.tamburini@unibo.it This section outlines the background, state-of-the-art
(F. Tamburini) developments, and architectures of VAD systems.
https://www.unibo.it/sitoweb/shibingfeng.zhang (S. Zhang);
https://www.unibo.it/sitoweb/gloria.gagliardi (G. Gagliardi);
The majority of Voice Activity Detection (VAD) sys-
https://www.unibo.it/sitoweb/fabio.tamburini (F. Tamburini) tems approach the task as a binary classification for each
0009-0005-7320-9088 (S. Zhang); 0000-0001-5257-1540 frame of a noisy audio signal, with or without overlaps be-
(G. Gagliardi); 0000-0001-7950-0347 (F. Tamburini) tween frames. Based on their architecture, these systems
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
can generally be divided into two categories: statisti- VAD, where a teacher model is initially trained on a
cal VAD systems and deep neural network (DNN) VAD source dataset with weak labels to handle vast and noisy
systems. audio data. The trained teacher model then provides
Statistical VAD systems rely on probabilistic models frame-level guidance to a student model trained on
and statistical signal processing techniques to distinguish various unlabeled target datasets.
between speech and non-speech segments. Common Context-aware VAD [11] is a self-attentive VAD
statistical methods include Gaussian Mixture Models system based on the Transformer architecture [12]. The
(GMM), Hidden Markov Models (HMM), and Bayesian proposed self-attentive VAD model processes acoustic
frameworks. For example, Sohn et al. [5] proposed a ro- features extracted from audio input, enhancing it with
bust statistical VAD system that models the signal using a contextual information from surrounding frames.
first-order two-state HMM. In this system, the VAD score Pyannote [13] is a pre-trained open-source toolkit for
of each frame is calculated based on the likelihood ratio audio processing that involves a VAD model. Similar to
between the probability density functions conditioned on GPVAD and Silero, it is a DNN-based model with CNN
two hypotheses: speech absent and speech present. Ad- and RNN components.
ditionally, the state-transition probability is determined
using the likelihood ratio from the previous frame, which
helps in maintaining temporal coherence and improving 3. Experiments
the accuracy of voice activity detection.
This section provides an overview of the experiments
On the other hand, VAD systems based on DNNs lever-
we conducted, the evaluation metrics applied, and the
age the power of deep learning. These systems use neural
resources adopted for the experiments.
network architectures, such as convolutional neural net-
works (CNNs), recurrent neural networks (RNNs), or
more advanced structures with attention mechanism [6]. 3.1. Evaluation Dataset
Below, we present the list of the VAD systems we In this work, the CLIPS dataset (Corpora e Lessici
experimented with in this project, along with a brief dell’Italiano Parlato e Scritto, Italian for Corpora and Lex-
description of each system: icons of Spoken and Written Italian)2 [14] is adopted to
evaluate different VAD systems.
SSVAD v1.0 (Baseline) [4] is a statistical VAD CLIPS comprises approximately 100 hours of speech
system designed to handle low signal-to-noise-ratio data, equally distributed between male and female voices.
(SNR), impulsive noise, and cross talks in interview-style It includes a diverse range of regional and situational
speech files. The system enhances speech segments as a speech samples to ensure a comprehensive representa-
pre-processing step to improve SNR, thereby facilitating tion of the Italian language across different contexts. The
subsequent speech/non-speech decisions. SSVAD v1.0 CLIPS dataset is organized into five subsets, with the
was previously integrated into the older version of the “DIALOGICO” and “LETTO” subsets offering complete
DLBs extraction pipeline [7] for speech segmentation temporal alignments between audio and textual tran-
and serves as the baseline for comparison with other scription, totaling approximately 7.5 hours of test data.
systems in this study. The “DIALOGICO” subset includes dialogues between
rVAD [8] is an unsupervised model comprising two two interlocutors, while the “LETTO” subset consists of
denoising steps followed by a final VAD stage. In the recordings where words are read aloud from lists.
first denoising step, high-energy noise segments are
identified and nullified. The second step utilizes a speech
enhancement method to further denoise the signal. 3.2. Experiment Settings & Evaluation
Silero [9] is a pre-trained CNN systems with encoder- To thoroughly evaluate the performance of various VAD
decoder architecture. Detailed information about systems, we used two sets of metrics: segment-level met-
this VAD system is limited, as it is closed source and rics and event-level metrics. Segment-level metrics treat
undocumented. each 10ms segment of audio (a single frame) indepen-
WebRTC VAD is a system developed by Google for the dently, calculating metrics such as F1 score, precision,
WebRTC project1 . Similar to the Silero VAD system, recall, error rate, and accuracy. Event-level metrics, on
it is closed source and detailed information about its the other hand, consider each speech segment as a unit.
architecture are not publicly available. A prediction is deemed correct if its overlap with the
GPVAD [10] is a 5-layer framework composed of ground truth exceeds 50%, and the same metrics are cal-
CNN and RNN layers. The proposed model employs culated accordingly.
a data-driven teacher-student learning paradigm for
1 2
https://webrtc.org/ http://www.clips.unina.it/it/
Experiments were conducted on CLIPS dataset using and GPVAD are trained on multilingual datasets like DI-
the VAD systems outlined in Section 2. To achieve op- HARD III [16] and Audioset [17], Context-aware VAD is
timal results, all systems were tested on their default trained solely on English speech. When tested on Italian
frame size. Furthermore, we combined systems’ predic- speech, the system could suffer a domain shift, resulting
tions through different ensemble methods to enhance in diminished performance.
performance further. More details on these ensemble To gain a better understanding of the differences in sys-
methods are provided in Section 4.2. tem performance, a Kruskal-Wallis test was conducted.
The results indicate that both the differences between
segment-level results and event-level results are signif-
4. Results icant. A Dunn’s test was then performed for post-hoc
comparisons. The statistical analysis demonstrates that
This section presents and analyses the experimental re-
systems GPVAD, rVAD, Silero, and Pyannote exhibit sim-
sults of different VAD systems.
ilar performance at both the segment and event levels,
while SSVAD, WebRTC, and Context-aware VAD show
4.1. Single Systems Evaluation significantly lower performance at both levels.
Table 1 shows the experimental results obtained from the After considering the performance at different levels,
systems described in Section 2. The evaluation results we tested all combination of three systems to form an
are derived using the methods presented in Section 3.2. ensemble prediction system to generate more accurate
VAD results. The architectures of these ensemble systems
and the corresponding experimental results are discussed
Table 1
in the following section.
Results of VAD experiment on different systems. For segment-
level results, each 10ms is considered one segment. For event-
level results, a prediction is considered correct if its overlap 4.2. Ensemble Systems Evaluation
with the ground truth exceeds 50%. The evaluation metric
used is the F1 score. This section details the ensemble methods that combine
Method Segment-level Event-level
predictions of systems tested in Section 4.1. It subse-
Context-aware VAD 60.4 12.1 quently presents the experimental results and analysis.
SSVAD (Baseline) 62.2 23.1 Of the systems presented in Section 2, Silero, Pyannote,
WebRTC 64.6 27.0 GPVAD, and Context-aware VAD assign a score to each
rVAD 69.5 72.2 frame with a threshold used for making predictions. The
GPVAD 89.5 72.3 other systems do not generate such scores, either due
Pyannote 92.3 80.3 to differences in their architecture or because they are
Silero 92.5 80.1 closed-source. This score can be interpreted as the proba-
bility of the frame being speech or not. We attempted to
As can be seen, the majority of the tested systems out- ensemble system’s predictions using both the probability
performed the baseline system SSVAD used in the cur- scores and their final predictions. The major challenge
rent DLB pipeline at the segment level. A notable pattern faced by these ensemble methods is that each system
from the experiment results is that DNN-based systems, uses a different frame size, which complicates achieving
such as Silero, GPVAD, and Pyannote, tend to achieve alignment for the ensemble system.
better results compared to traditional statistical systems We proposed and tested several ensemble strategies:
like rVAD and SSVAD. However, context-aware VAD is
an exception, with an F1 score of 60.4, which is lower • Probability Voting (PV): This method involves
than the baseline SSVAD score of 62.2. As for event-level summing and averaging the probability scores
results, similar to the segment-level results, almost all from different predictions.
systems outperformed the baseline. DNN-based systems • Probability Voting with Frame (PV_f): In
tend to perform better, with Context-aware VAD being this approach, each audio is first segmented into
again an exception, as its F1 score is the lowest among all frames. For each frame, we identify all overlap-
systems. The poor performance of Context-aware VAD ping frames from all predictions, average their
could be attributed to the fact that, unlike GPVAD and probability scores, and use this average as the
Pyannote, it is trained only on the TIMIT [15] dataset probability score for the frame. The frame size of
with additional background noise. The TIMIT dataset PV_f is 200 ms.
is a relatively small English speech dataset, containing • Simple Voting with Frame(SV_f): Similar to
only 5 hours of audio, likely causing the system to overfit PV_f, this method segments audio into frames.
on this dataset. Another possible reason for this rela- However, instead of averaging probability scores,
tively poor performance could be that, while Pyannote it performs simple majority voting based on the
predictions of overlapping frames. The frame size by a single system. Meanwhile, all other combinations
of SV_f is 200 ms. yielded scores lower than the best performance of the
• Probability Voting with Weight (PV_w): This individual systems.
method is akin to PV_f but with a twist: probabil- As shown in Table 3, the ensemble systems related to
ity scores of overlapping frames from the three probability score did not achieve results that are promi-
predictions are weighted according to their over- nently better than single systems at the segment level
lap percentage. These weighted scores are then either, with PV_s and PV_b systems of the combination
summed to determine the probability score for Pyannote, GPVAD, Silero being only slightly higher by
each frame. a small margin of 0.6 compared to Silero. However, at
• Probability Voting with Sampling (PV_s): For the event level, several evident improvements can be
a given audio, this method samples timestamps. observed in the performance of the ensemble systems.
For each timestamp, it calculates the mean of the Probability-based ensemble systems combining Pyan-
probability scores from the three systems, using note, GPVAD, Silero, except for PV_b and PV, outper-
this mean as the probability score for the times- formed the simple systems at event level, with PV_f
tamp. The sampling rate of PV_s is approximately achieving an F1 score of 85.9, which is 5.6 points higher
33.33 Hz, meaning that one point is sampled every than that of Pyannote. This result demonstrates that the
0.03 seconds. ensemble approach can lead to substantial performance
• Probability Voting with Bézier curve mod- gains in detecting the temporal interval in which speech
elling (PV_b): For each prediction from each takes place. It is worth noticing that the ensemble sys-
system, a Bézier curve is generated using con- tem PV_b consistently shows great disparity between its
trol points sampled from the prediction. This performance at segment level and event level across all
approach aims to use a smooth curve to model combinations. Despite its good performance on segment
the prediction and address the alignment issues level, PV_b achieves rather F1 score on event level, far
caused by different frame sizes of the systems. lower than all other systems. The disparity of perfor-
Similar to PV_f, each audio segment is divided mance at different levels is likely to be caused by the
into frames, and the probability score for each insufficient number of control points adopted for gener-
frame is the average of the scores estimated by ating the Bézier curve. However, increasing the number
the Bézier curves. The sampling rate of control of control points is infeasible due to the computational
points that are used to generate Bézier curve in complexity of the curve, which is 𝑂(𝑛2 ), with 𝑛 being
PV_b is 5 Hz (0.2 seconds). the number of control points.
Given that the ensemble systems composed of GPVAD,
We experimented with all possible system combina- Silero, and Pyannote consistently outperformed other
tions using the SV_f ensemble method, as well as all combinations across all ensemble methods, a Kruskal-
possible combinations of Silero, Pyannote, GPVAD, and Wallis test, followed by Dunn’s post-hoc test, was con-
Context-aware VAD using other probability-based en- ducted to assess the differences in performance between
semble methods, as these are the only systems that gener- the ensemble methods and the individual systems of GP-
ate probability scores. For all probability-based methods, VAD, Silero, Pyannote. At the segment level, the Kruskal-
the “speech/non-speech” prediction for each frame is de- Wallis test indicates that the differences are not signifi-
termined by applying a threshold of 0.5 to the probability cant. However, at the event level, the results reveal that
score. PV_b’s performance is significantly lower compared to
Table 2 presents results of all possible combinations the other systems.
to compose the ensemble system using SV_f method. In summary, given the performance of the systems, we
Table 3 presents results of all possible combinations to plan to adopt PV_f as the speech segmentation compo-
compose the ensemble systems using probability score nent of the DLBs extraction pipeline, leveraging the com-
related methods. The evaluation results are derived using bined predictions of Pyannote, Silero, and GPVAD. While
the methods presented in Section 3.2. PV_f shows slightly lower segment-level performance
As shown in Table 2, the ensemble created using the compared to the top-performing individual system, it
SV_f method did not yield better results than the individ- enhances the accuracy in identifying speech intervals.
ual systems at the segment level. The highest segment- This trade-off is justified by the substantial improvement
level score of 91.5 was achieved by the combination of in speech event detection performance.
GPVAD, Silero, and Pyannote, which is still 0.6 lower than
the best performance of the Silero system alone. How-
ever, at the event level, the same combination achieved
the highest score among all ensemble systems, with an F1
score of 84.0, which is higher than the best score achieved
Table 2 Table 3
Results of VAD experiments on using SV_f ensemble method. Results of VAD experiments on using probability score related
For comparison, results from individual systems that achieved ensemble methods. For comparison, results from individual
the best performance, Silero and Pyannote, are also included. S systems that achieved the best performance, Silero and Pyan-
stands for segment-level result. E stands for event-level result. note, are also included. Method stands for ensemble method
C-a stands for Context-aware VAD system. For segment-level adopted. S stands for segment-level result. E stands for event-
results, each 10ms is considered one segment. For event-level level result. C-a stands for Context-aware VAD system. For
results, a prediction is considered correct if its overlap with segment-level results, each 10ms is considered one segment.
the ground truth exceeds 50%. The evaluation metric used is For event-level results, a prediction is considered correct if its
the F1 score. overlap with the ground truth exceeds 50%. The evaluation
Involved Systems S E metric used is the F1 score.
Silero 92.5 80.1 Involved Systems Method S E
Pyannote 92.3 80.3 Silero - 92.5 80.1
GPVAD, Silero, Pyannote 91.5 84.0 Pyannote - 92.3 80.3
GPVAD, C-a, WebRTC 58.4 62.0 Pyannote, GPVAD, Silero PV 91.5 67.9
GPVAD, SSVAD, C-a 66.0 17.6 Pyannote, GPVAD,Silero PV_f 91.9 85.9
GPVAD, SSVAD, WebRTC 58.9 76.6 Pyannote, GPVAD, Silero PV_s 93.1 81.8
Pyannote, C-a, WebRTC 60.6 70.1 Pyannote, GPVAD, Silero PV_w 91.8 85.6
Pyannote, GPVAD, C-a 81.5 42.1 Pyannote, GPVAD,Silero PV_b 93.0 9.5
Pyannote, GPVAD, SSVAD 83.3 58.1 Pyannote, GPVAD, C-a PV 87.2 60.4
Pyannote, GPVAD, WebRTC 61.3 55.3 Pyannote, GPVAD, C-a PV_f 87.6 80.0
Pyannote, SSVAD, C-a 68.6 17.7 Pyannote, GPVAD, C-a PV_s 89.3 79.4
Pyannote, SSVAD, WebRTC 60.9 72.6 Pyannote, GPVAD, C-a PV_w 87.5 79.2
SSVAD, C-a, WebRTC 47.0 29.8 Pyannote, GPVAD, C-a PV_b 89.2 10.5
Silero, C-a, WebRTC 60.7 70.0 Silero, GPVAD, C-a PV 85.4 50.6
Silero, GPVAD, C-a 81.8 43.1 Silero, GPVAD, C-a PV_f 85.7 72.7
Silero, GPVAD, SSVAD 83.6 57.7 Silero, GPVAD, C-a PV_s 84.2 67.3
Silero, GPVAD, WebRTC 61.4 59.9 Silero, GPVAD, C-a PV_w 85.6 71.6
Silero, Pyannote, C-a 84.4 52.5 Silero, GPVAD, C-a PV_b 88.8 11.0
Silero, Pyannote, SSVAD 85.9 68.7 Silero, Pyannote, C-a PV 89.4 70.4
Silero, Pyannote, WebRTC 62.0 47.9 Silero, Pyannote, C-a PV_f 89.6 81.2
Silero, SSVAD, C-a 68.8 17.5 Silero, Pyannote, C-a PV_s 89.5 77.7
Silero, SSVAD, WebRTC 60.8 73.0 Silero, Pyannote, C-a PV_w 89.6 81.5
rVAD, C-a, WebRTC 52.2 41.4 Silero, Pyannote, C-a PV_b 89.6 9.3
rVAD, C-a, WebRTC 52.2 41.4
rVAD, GPVAD, C-a 71.1 29.0
rVAD, GPVAD, SSVAD 74.3 42.5 into an ensemble to improve detection accuracy. Our
rVAD, GPVAD, WebRTC 58.4 79.3 findings indicate that combining predictions from multi-
rVAD, Pyannote, C-a 73.4 27.5 ple models can lead to better results in detecting speech
rVAD, Pyannote, GPVAD 83.5 75.1 temporal intervals. This effective ensemble method will
rVAD, Pyannote, SSVAD 76.7 43.2 be used as a component of a Digital Linguistic Biomarkers
rVAD, Pyannote, WebRTC 60.8 58.7 extraction pipeline.
rVAD, SSVAD, C-a 56.8 18.1
By enhancing the accuracy of speech segmentation,
rVAD, SSVAD, WebRTC 54.0 63.0
this method provides a more reliable foundation for ex-
rVAD, Silero, C-a 73.5 27.1
tracting meaningful linguistic features for the diagnosis
rVAD, Silero, GPVAD 83.6 73.5
rVAD, Silero, Pyannote 86.3 82.4
of cognitive impairment. Future research could focus
rVAD, Silero, SSVAD 76.8 42.2 on refining the ensemble method by incorporating addi-
rVAD, Silero, WebRTC 61.0 63.3 tional linguistic features into VAD systems and exploring
their synergistic effects. Additionally, investigating the
application of this approach to other languages and di-
5. Conclusions alects could expand its utility.
In this study, we explored and enhanced Voice Activity
Detection systems for the Italian language, a relatively Acknowledgements
under-explored area in speech processing. We exper- This study was funded by the European Union – NextGen-
imented with various systems and integrated systems erationEU programme through the Italian National Re-
covery and Resilience Plan – NRRP (Mission 4 – Educa- using teacher-student training, IEEE/ACM Trans-
tion and research), as a part of the project ReMind: an actions on Audio, Speech, and Language Processing
ecological, costeffective AI platform for early detection 29 (2021) 1542–1555.
of prodromal stages of cognitive impairment (PRIN 2022, [11] Y. R. Jo, Y. K. Moon, W. I. Cho, G. S. Jo, Self-attentive
2022YKJ8FP – CUP J53D23008380006). vad: Context-aware detection of voice from noise,
in: ICASSP 2021-2021 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
CRediT Author Statement (ICASSP), IEEE, 2021, pp. 6808–6812.
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
SZ: Investigation, Software, Formal analysis, Visualiza-
L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-
tion, Writing - Original Draft. GG: Writing - Review &
tention is all you need, Advances in neural infor-
Editing, Project administration, Funding acquisition. FT:
mation processing systems 30 (2017).
Conceptualization, Methodology, Supervision, Writing -
[13] H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov,
Review & Editing.
M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, M.-P.
Gill, Pyannote. audio: neural building blocks for
References speaker diarization, in: ICASSP 2020-2020 IEEE
International Conference on Acoustics, Speech and
[1] S. Graf, T. Herbig, M. Buck, G. Schmidt, Features Signal Processing (ICASSP), IEEE, 2020, pp. 7124–
for voice activity detection: a comparative analysis, 7128.
EURASIP Journal on Advances in Signal Processing [14] F. A. Leoni, F. Cutugno, R. Savy, V. Caniparoli,
2015 (2015) 1–15. L. D’Anna, E. Paone, R. Giordano, O. Manfrel-
[2] T. Cho, D. H. Whalen, G. Docherty, Voice onset lotti, M. Petrillo, A. De Rosa, Corpora e lessici
time and beyond: Exploring laryngeal contrast in dell’italiano parlato e scritto, 2007.
19 languages, Journal of Phonetics 72 (2019) 52–65. [15] J. S. Garofolo, Timit acoustic phonetic continuous
[3] G. Gagliardi, F. Tamburini, The automatic extrac- speech corpus, Linguistic Data Consortium, 1993
tion of linguistic biomarkers as a viable solution for (1993).
the early diagnosis of mental disorders, in: Pro- [16] N. Ryant, P. Singh, V. Krishnamohan, R. Varma,
ceedings of the Thirteenth Language Resources K. Church, C. Cieri, J. Du, S. Ganapathy, M. Liber-
and Evaluation Conference, European Language man, The third dihard diarization challenge, arXiv
Resources Association, 2022, pp. 5234–5242. preprint arXiv:2012.01477 (2020).
[4] M.-W. Mak, H.-B. Yu, A study of voice activity de- [17] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen,
tection techniques for nist speaker recognition eval- W. Lawrence, R. C. Moore, M. Plakal, M. Ritter, Au-
uations, Computer Speech & Language 28 (2014) dio set: An ontology and human-labeled dataset for
295–313. audio events, in: Proceedings of the 2017 IEEE inter-
[5] J. Sohn, N. S. Kim, W. Sung, A statistical model- national conference on acoustics, speech and signal
based voice activity detection, IEEE signal process- processing (ICASSP), IEEE, 2017, pp. 776–780.
ing letters 6 (1999) 1–3.
[6] A. Sehgal, N. Kehtarnavaz, A convolutional neural
network smartphone app for real-time voice activ-
ity detection, IEEE access 6 (2018) 9017–9026.
[7] L. Calzà, G. Gagliardi, R. R. Favretti, F. Tamburini,
Linguistic features and automatic classifiers for
identifying mild cognitive impairment and demen-
tia, Computer Speech & Language 65 (2021) 101113.
[8] Z.-H. Tan, N. Dehak, et al., rvad: An unsuper-
vised segment-based robust voice activity detection
method, Computer speech & language 59 (2020)
1–21.
[9] Silero Team, Silero vad: pre-trained enterprise-
grade voice activity detector (vad), number de-
tector and language classifier, https://github.com/
snakers4/silero-vad, 2021.
[10] H. Dinkel, S. Wang, X. Xu, M. Wu, K. Yu, Voice ac-
tivity detection in the wild: A data-driven approach