=Paper= {{Paper |id=Vol-3878/111_main_long |storemode=property |title=Voice Activity Detection on Italian Language |pdfUrl=https://ceur-ws.org/Vol-3878/111_main_long.pdf |volume=Vol-3878 |authors=Shibingfeng Zhang,Gloria Gagliardi,Fabio Tamburini |dblpUrl=https://dblp.org/rec/conf/clic-it/ZhangGT24 }} ==Voice Activity Detection on Italian Language== https://ceur-ws.org/Vol-3878/111_main_long.pdf
                                Voice Activity Detection on Italian Language
                                Shibingfeng Zhang1 , Gloria Gagliardi1 and Fabio Tamburini1
                                1
                                    FICLIT, Alma Mater Studiorum - University of Bologna, via Zamboni, 32, Bologna, Italy


                                                  Abstract
                                                  Voice Activity Detection (VAD) refers to the task of identifying human voice activity in noisy settings, playing a crucial
                                                  role in fields like speech recognition and audio surveillance. However, most VAD research focuses on English, leaving other
                                                  languages, such as Italian, under-explored. This study aims to evaluate and enhance VAD systems for Italian speech, with the
                                                  goal of finding a solution for the speech segmentation component of the Digital Linguistic Biomarkers (DLBs) extraction
                                                  pipeline for early mental disorder diagnosis. We experimented with various VAD systems and proposed an ensemble VAD
                                                  system. Our ensemble system shows improvements in speech event detection. This advancement lays a robust foundation for
                                                  more accurate early detection of mental health issues using DLBs in Italian.

                                                  Keywords
                                                  Voice Activity Detection, Digital Linguistic Biomarkers, Speech Processing, Speech Segmentation



                                1. Introduction                                                                                            step which consists of speech segmentation via VAD. The
                                                                                                                                           VAD system adopted by Gagliardi and Tamburini [3] is a
                                Voice Activity Detection (VAD) refers to the task of iden-                                                 statistical VAD system named “SSVAD v1.0” [4], which
                                tifying the presence of human voice activity in noisy                                                      will be presented and compared to other VAD systems in
                                speech, classifying utterance segments as “speech” or                                                      Section 2.
                                “non-speech”. Typically, it involves making binary deci-                                                      In this project, we focus on VAD for the Italian lan-
                                sions on each frame of a noisy signal [1]. VAD has a wide                                                  guage, an area that remains largely unexplored, aiming
                                range of applications, serving as a crucial component in                                                   to find a VAD system that performs better and is more
                                various fields such as telecommunications, speech recog-                                                   reliable than the one adopted in the original pipeline.
                                nition systems, and audio surveillance. Nevertheless, the                                                  The outcomes of this project will serve as a fundamen-
                                great majority of current works focus on the application                                                   tal component in the pipeline for extracting DLBs and
                                of VAD to English while there are many aspects that                                                        replacing the current VAD system. Moreover, our efforts
                                can affect the performance of transferring a VAD system                                                    will provide a robust foundation for future work in this
                                from one language to another, potentially leading to sub-                                                  domain, facilitating more accurate and early detection of
                                optimal results. For instance, voice onset time may vary                                                   mental health issues using linguistic biomarkers.
                                significantly between languages, affecting the system’s                                                       Our main contributions are as follows:
                                ability to detect speech activity accurately [2]. Addition-
                                ally, differences in phonetic structures can further compli-                                                    • Testing and evaluating various VAD systems on
                                cate the system’s effectiveness across languages. Given                                                           Italian speech.
                                these factors, conducting research to evaluate various                                                          • Proposing an ensemble VAD system that achieves
                                VAD systems on Italian speech would be highly valuable.                                                           superior results.
                                   Digital Linguistic Biomarkers (DLBs) indicate linguis-
                                tic features automatically extracted directly from pa-                                                        This paper is structured into five sections. Section 2
                                tients’ verbal productions that provide insights into their                                                presents the data resources and VAD systems leveraged
                                medical state [3]. Gagliardi and Tamburini [3] proposed                                                    in this work. Section 3 details the experiments and re-
                                the first DLBs extraction pipeline for the early diagnosis                                                 sources for testing VAD systems. Section 4 presents and
                                of mental disorders in Italian. The extraction of acoustic                                                 discusses the experimental results. Finally, Section 5
                                and rhythmic features relies heavily on the preprocessing                                                  draws conclusions.

                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
                                Dec 04 — 06, 2024, Pisa, Italy                                                                             2. Background
                                $ shibingfeng.zhang@unibo.it (S. Zhang);
                                gloria.gagliardi@unibo.it (G. Gagliardi); fabio.tamburini@unibo.it                                         This section outlines the background, state-of-the-art
                                (F. Tamburini)                                                                                             developments, and architectures of VAD systems.
                                € https://www.unibo.it/sitoweb/shibingfeng.zhang (S. Zhang);
                                https://www.unibo.it/sitoweb/gloria.gagliardi (G. Gagliardi);
                                                                                                                                              The majority of Voice Activity Detection (VAD) sys-
                                https://www.unibo.it/sitoweb/fabio.tamburini (F. Tamburini)                                                tems approach the task as a binary classification for each
                                 0009-0005-7320-9088 (S. Zhang); 0000-0001-5257-1540                                                      frame of a noisy audio signal, with or without overlaps be-
                                (G. Gagliardi); 0000-0001-7950-0347 (F. Tamburini)                                                         tween frames. Based on their architecture, these systems
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
can generally be divided into two categories: statisti-       VAD, where a teacher model is initially trained on a
cal VAD systems and deep neural network (DNN) VAD             source dataset with weak labels to handle vast and noisy
systems.                                                      audio data. The trained teacher model then provides
   Statistical VAD systems rely on probabilistic models       frame-level guidance to a student model trained on
and statistical signal processing techniques to distinguish   various unlabeled target datasets.
between speech and non-speech segments. Common                Context-aware VAD [11] is a self-attentive VAD
statistical methods include Gaussian Mixture Models           system based on the Transformer architecture [12]. The
(GMM), Hidden Markov Models (HMM), and Bayesian               proposed self-attentive VAD model processes acoustic
frameworks. For example, Sohn et al. [5] proposed a ro-       features extracted from audio input, enhancing it with
bust statistical VAD system that models the signal using a    contextual information from surrounding frames.
first-order two-state HMM. In this system, the VAD score      Pyannote [13] is a pre-trained open-source toolkit for
of each frame is calculated based on the likelihood ratio     audio processing that involves a VAD model. Similar to
between the probability density functions conditioned on      GPVAD and Silero, it is a DNN-based model with CNN
two hypotheses: speech absent and speech present. Ad-         and RNN components.
ditionally, the state-transition probability is determined
using the likelihood ratio from the previous frame, which
helps in maintaining temporal coherence and improving         3. Experiments
the accuracy of voice activity detection.
                                                              This section provides an overview of the experiments
   On the other hand, VAD systems based on DNNs lever-
                                                              we conducted, the evaluation metrics applied, and the
age the power of deep learning. These systems use neural
                                                              resources adopted for the experiments.
network architectures, such as convolutional neural net-
works (CNNs), recurrent neural networks (RNNs), or
more advanced structures with attention mechanism [6].        3.1. Evaluation Dataset
   Below, we present the list of the VAD systems we           In this work, the CLIPS dataset (Corpora e Lessici
experimented with in this project, along with a brief         dell’Italiano Parlato e Scritto, Italian for Corpora and Lex-
description of each system:                                   icons of Spoken and Written Italian)2 [14] is adopted to
                                                              evaluate different VAD systems.
SSVAD v1.0 (Baseline) [4] is a statistical VAD                   CLIPS comprises approximately 100 hours of speech
system designed to handle low signal-to-noise-ratio           data, equally distributed between male and female voices.
(SNR), impulsive noise, and cross talks in interview-style    It includes a diverse range of regional and situational
speech files. The system enhances speech segments as a        speech samples to ensure a comprehensive representa-
pre-processing step to improve SNR, thereby facilitating      tion of the Italian language across different contexts. The
subsequent speech/non-speech decisions. SSVAD v1.0            CLIPS dataset is organized into five subsets, with the
was previously integrated into the older version of the       “DIALOGICO” and “LETTO” subsets offering complete
DLBs extraction pipeline [7] for speech segmentation          temporal alignments between audio and textual tran-
and serves as the baseline for comparison with other          scription, totaling approximately 7.5 hours of test data.
systems in this study.                                        The “DIALOGICO” subset includes dialogues between
rVAD [8] is an unsupervised model comprising two              two interlocutors, while the “LETTO” subset consists of
denoising steps followed by a final VAD stage. In the         recordings where words are read aloud from lists.
first denoising step, high-energy noise segments are
identified and nullified. The second step utilizes a speech
enhancement method to further denoise the signal.             3.2. Experiment Settings & Evaluation
Silero [9] is a pre-trained CNN systems with encoder-         To thoroughly evaluate the performance of various VAD
decoder architecture. Detailed information about              systems, we used two sets of metrics: segment-level met-
this VAD system is limited, as it is closed source and        rics and event-level metrics. Segment-level metrics treat
undocumented.                                                 each 10ms segment of audio (a single frame) indepen-
WebRTC VAD is a system developed by Google for the            dently, calculating metrics such as F1 score, precision,
WebRTC project1 . Similar to the Silero VAD system,           recall, error rate, and accuracy. Event-level metrics, on
it is closed source and detailed information about its        the other hand, consider each speech segment as a unit.
architecture are not publicly available.                      A prediction is deemed correct if its overlap with the
GPVAD [10] is a 5-layer framework composed of                 ground truth exceeds 50%, and the same metrics are cal-
CNN and RNN layers. The proposed model employs                culated accordingly.
a data-driven teacher-student learning paradigm for

1                                                             2
    https://webrtc.org/                                           http://www.clips.unina.it/it/
   Experiments were conducted on CLIPS dataset using     and GPVAD are trained on multilingual datasets like DI-
the VAD systems outlined in Section 2. To achieve op-    HARD III [16] and Audioset [17], Context-aware VAD is
timal results, all systems were tested on their default  trained solely on English speech. When tested on Italian
frame size. Furthermore, we combined systems’ predic-    speech, the system could suffer a domain shift, resulting
tions through different ensemble methods to enhance      in diminished performance.
performance further. More details on these ensemble         To gain a better understanding of the differences in sys-
methods are provided in Section 4.2.                     tem performance, a Kruskal-Wallis test was conducted.
                                                         The results indicate that both the differences between
                                                         segment-level results and event-level results are signif-
4. Results                                               icant. A Dunn’s test was then performed for post-hoc
                                                         comparisons. The statistical analysis demonstrates that
This section presents and analyses the experimental re-
                                                         systems GPVAD, rVAD, Silero, and Pyannote exhibit sim-
sults of different VAD systems.
                                                         ilar performance at both the segment and event levels,
                                                         while SSVAD, WebRTC, and Context-aware VAD show
4.1. Single Systems Evaluation                           significantly lower performance at both levels.
Table 1 shows the experimental results obtained from the    After considering the performance at different levels,
systems described in Section 2. The evaluation results we tested all combination of three systems to form an
are derived using the methods presented in Section 3.2. ensemble prediction system to generate more accurate
                                                         VAD results. The architectures of these ensemble systems
                                                         and the corresponding experimental results are discussed
Table 1
                                                         in the following section.
Results of VAD experiment on different systems. For segment-
level results, each 10ms is considered one segment. For event-
level results, a prediction is considered correct if its overlap   4.2. Ensemble Systems Evaluation
with the ground truth exceeds 50%. The evaluation metric
used is the F1 score.                                       This section details the ensemble methods that combine
  Method                    Segment-level        Event-level
                                                            predictions of systems tested in Section 4.1. It subse-
  Context-aware VAD             60.4                12.1    quently presents the experimental results and analysis.
  SSVAD (Baseline)              62.2                23.1       Of the systems presented in Section 2, Silero, Pyannote,
  WebRTC                        64.6                27.0    GPVAD, and Context-aware VAD assign a score to each
  rVAD                          69.5                72.2    frame with a threshold used for making predictions. The
  GPVAD                         89.5                72.3    other systems do not generate such scores, either due
  Pyannote                      92.3                80.3    to differences in their architecture or because they are
  Silero                        92.5                80.1    closed-source. This score can be interpreted as the proba-
                                                            bility of the frame being speech or not. We attempted to
   As can be seen, the majority of the tested systems out- ensemble system’s predictions using both the probability
performed the baseline system SSVAD used in the cur- scores and their final predictions. The major challenge
rent DLB pipeline at the segment level. A notable pattern faced by these ensemble methods is that each system
from the experiment results is that DNN-based systems, uses a different frame size, which complicates achieving
such as Silero, GPVAD, and Pyannote, tend to achieve alignment for the ensemble system.
better results compared to traditional statistical systems     We proposed and tested several ensemble strategies:
like rVAD and SSVAD. However, context-aware VAD is
an exception, with an F1 score of 60.4, which is lower            • Probability Voting (PV): This method involves
than the baseline SSVAD score of 62.2. As for event-level            summing and averaging the probability scores
results, similar to the segment-level results, almost all            from different predictions.
systems outperformed the baseline. DNN-based systems              •  Probability Voting with Frame (PV_f): In
tend to perform better, with Context-aware VAD being                 this approach, each audio is first segmented into
again an exception, as its F1 score is the lowest among all          frames. For each frame, we identify all overlap-
systems. The poor performance of Context-aware VAD                   ping frames from all predictions, average their
could be attributed to the fact that, unlike GPVAD and               probability scores, and use this average as the
Pyannote, it is trained only on the TIMIT [15] dataset               probability score for the frame. The frame size of
with additional background noise. The TIMIT dataset                  PV_f is 200 ms.
is a relatively small English speech dataset, containing          •  Simple Voting with Frame(SV_f): Similar to
only 5 hours of audio, likely causing the system to overfit          PV_f, this method segments audio into frames.
on this dataset. Another possible reason for this rela-              However, instead of averaging probability scores,
tively poor performance could be that, while Pyannote                it performs simple majority voting based on the
       predictions of overlapping frames. The frame size      by a single system. Meanwhile, all other combinations
       of SV_f is 200 ms.                                     yielded scores lower than the best performance of the
     • Probability Voting with Weight (PV_w): This            individual systems.
       method is akin to PV_f but with a twist: probabil-        As shown in Table 3, the ensemble systems related to
       ity scores of overlapping frames from the three        probability score did not achieve results that are promi-
       predictions are weighted according to their over-      nently better than single systems at the segment level
       lap percentage. These weighted scores are then         either, with PV_s and PV_b systems of the combination
       summed to determine the probability score for          Pyannote, GPVAD, Silero being only slightly higher by
       each frame.                                            a small margin of 0.6 compared to Silero. However, at
     • Probability Voting with Sampling (PV_s): For           the event level, several evident improvements can be
       a given audio, this method samples timestamps.         observed in the performance of the ensemble systems.
       For each timestamp, it calculates the mean of the      Probability-based ensemble systems combining Pyan-
       probability scores from the three systems, using       note, GPVAD, Silero, except for PV_b and PV, outper-
       this mean as the probability score for the times-      formed the simple systems at event level, with PV_f
       tamp. The sampling rate of PV_s is approximately       achieving an F1 score of 85.9, which is 5.6 points higher
       33.33 Hz, meaning that one point is sampled every      than that of Pyannote. This result demonstrates that the
       0.03 seconds.                                          ensemble approach can lead to substantial performance
     • Probability Voting with Bézier curve mod-              gains in detecting the temporal interval in which speech
       elling (PV_b): For each prediction from each           takes place. It is worth noticing that the ensemble sys-
       system, a Bézier curve is generated using con-         tem PV_b consistently shows great disparity between its
       trol points sampled from the prediction. This          performance at segment level and event level across all
       approach aims to use a smooth curve to model           combinations. Despite its good performance on segment
       the prediction and address the alignment issues        level, PV_b achieves rather F1 score on event level, far
       caused by different frame sizes of the systems.        lower than all other systems. The disparity of perfor-
       Similar to PV_f, each audio segment is divided         mance at different levels is likely to be caused by the
       into frames, and the probability score for each        insufficient number of control points adopted for gener-
       frame is the average of the scores estimated by        ating the Bézier curve. However, increasing the number
       the Bézier curves. The sampling rate of control        of control points is infeasible due to the computational
       points that are used to generate Bézier curve in       complexity of the curve, which is 𝑂(𝑛2 ), with 𝑛 being
       PV_b is 5 Hz (0.2 seconds).                            the number of control points.
                                                                 Given that the ensemble systems composed of GPVAD,
   We experimented with all possible system combina-          Silero, and Pyannote consistently outperformed other
tions using the SV_f ensemble method, as well as all          combinations across all ensemble methods, a Kruskal-
possible combinations of Silero, Pyannote, GPVAD, and         Wallis test, followed by Dunn’s post-hoc test, was con-
Context-aware VAD using other probability-based en-           ducted to assess the differences in performance between
semble methods, as these are the only systems that gener-     the ensemble methods and the individual systems of GP-
ate probability scores. For all probability-based methods,    VAD, Silero, Pyannote. At the segment level, the Kruskal-
the “speech/non-speech” prediction for each frame is de-      Wallis test indicates that the differences are not signifi-
termined by applying a threshold of 0.5 to the probability    cant. However, at the event level, the results reveal that
score.                                                        PV_b’s performance is significantly lower compared to
   Table 2 presents results of all possible combinations      the other systems.
to compose the ensemble system using SV_f method.                In summary, given the performance of the systems, we
Table 3 presents results of all possible combinations to      plan to adopt PV_f as the speech segmentation compo-
compose the ensemble systems using probability score          nent of the DLBs extraction pipeline, leveraging the com-
related methods. The evaluation results are derived using     bined predictions of Pyannote, Silero, and GPVAD. While
the methods presented in Section 3.2.                         PV_f shows slightly lower segment-level performance
   As shown in Table 2, the ensemble created using the        compared to the top-performing individual system, it
SV_f method did not yield better results than the individ-    enhances the accuracy in identifying speech intervals.
ual systems at the segment level. The highest segment-        This trade-off is justified by the substantial improvement
level score of 91.5 was achieved by the combination of        in speech event detection performance.
GPVAD, Silero, and Pyannote, which is still 0.6 lower than
the best performance of the Silero system alone. How-
ever, at the event level, the same combination achieved
the highest score among all ensemble systems, with an F1
score of 84.0, which is higher than the best score achieved
Table 2                                                             Table 3
Results of VAD experiments on using SV_f ensemble method.           Results of VAD experiments on using probability score related
For comparison, results from individual systems that achieved       ensemble methods. For comparison, results from individual
the best performance, Silero and Pyannote, are also included. S     systems that achieved the best performance, Silero and Pyan-
stands for segment-level result. E stands for event-level result.   note, are also included. Method stands for ensemble method
C-a stands for Context-aware VAD system. For segment-level          adopted. S stands for segment-level result. E stands for event-
results, each 10ms is considered one segment. For event-level       level result. C-a stands for Context-aware VAD system. For
results, a prediction is considered correct if its overlap with     segment-level results, each 10ms is considered one segment.
the ground truth exceeds 50%. The evaluation metric used is         For event-level results, a prediction is considered correct if its
the F1 score.                                                       overlap with the ground truth exceeds 50%. The evaluation
        Involved Systems                    S       E               metric used is the F1 score.
        Silero                            92.5     80.1                 Involved Systems              Method         S       E
        Pyannote                          92.3     80.3                 Silero                            -        92.5    80.1
        GPVAD, Silero, Pyannote           91.5     84.0                 Pyannote                          -        92.3    80.3
        GPVAD, C-a, WebRTC                58.4     62.0                 Pyannote, GPVAD, Silero          PV        91.5    67.9
        GPVAD, SSVAD, C-a                 66.0     17.6                 Pyannote, GPVAD,Silero          PV_f       91.9    85.9
        GPVAD, SSVAD, WebRTC              58.9     76.6                 Pyannote, GPVAD, Silero         PV_s       93.1    81.8
        Pyannote, C-a, WebRTC             60.6     70.1                 Pyannote, GPVAD, Silero         PV_w       91.8    85.6
        Pyannote, GPVAD, C-a              81.5     42.1                 Pyannote, GPVAD,Silero          PV_b       93.0     9.5
        Pyannote, GPVAD, SSVAD            83.3     58.1                 Pyannote, GPVAD, C-a             PV        87.2    60.4
        Pyannote, GPVAD, WebRTC           61.3     55.3                 Pyannote, GPVAD, C-a            PV_f       87.6    80.0
        Pyannote, SSVAD, C-a              68.6     17.7                 Pyannote, GPVAD, C-a            PV_s       89.3    79.4
        Pyannote, SSVAD, WebRTC           60.9     72.6                 Pyannote, GPVAD, C-a            PV_w       87.5    79.2
        SSVAD, C-a, WebRTC                47.0     29.8                 Pyannote, GPVAD, C-a            PV_b       89.2    10.5
        Silero, C-a, WebRTC               60.7     70.0                 Silero, GPVAD, C-a               PV        85.4    50.6
        Silero, GPVAD, C-a                81.8     43.1                 Silero, GPVAD, C-a              PV_f       85.7    72.7
        Silero, GPVAD, SSVAD              83.6     57.7                 Silero, GPVAD, C-a              PV_s       84.2    67.3
        Silero, GPVAD, WebRTC             61.4     59.9                 Silero, GPVAD, C-a              PV_w       85.6    71.6
        Silero, Pyannote, C-a             84.4     52.5                 Silero, GPVAD, C-a              PV_b       88.8    11.0
        Silero, Pyannote, SSVAD           85.9     68.7                 Silero, Pyannote, C-a            PV        89.4    70.4
        Silero, Pyannote, WebRTC          62.0     47.9                 Silero, Pyannote, C-a           PV_f       89.6    81.2
        Silero, SSVAD, C-a                68.8     17.5                 Silero, Pyannote, C-a           PV_s       89.5    77.7
        Silero, SSVAD, WebRTC             60.8     73.0                 Silero, Pyannote, C-a           PV_w       89.6    81.5
        rVAD, C-a, WebRTC                 52.2     41.4                 Silero, Pyannote, C-a           PV_b       89.6     9.3
        rVAD, C-a, WebRTC                 52.2     41.4
        rVAD, GPVAD, C-a                  71.1     29.0
        rVAD, GPVAD, SSVAD                74.3     42.5             into an ensemble to improve detection accuracy. Our
        rVAD, GPVAD, WebRTC               58.4     79.3             findings indicate that combining predictions from multi-
        rVAD, Pyannote, C-a               73.4     27.5             ple models can lead to better results in detecting speech
        rVAD, Pyannote, GPVAD             83.5     75.1             temporal intervals. This effective ensemble method will
        rVAD, Pyannote, SSVAD             76.7     43.2             be used as a component of a Digital Linguistic Biomarkers
        rVAD, Pyannote, WebRTC            60.8     58.7             extraction pipeline.
        rVAD, SSVAD, C-a                  56.8     18.1
                                                                       By enhancing the accuracy of speech segmentation,
        rVAD, SSVAD, WebRTC               54.0     63.0
                                                                    this method provides a more reliable foundation for ex-
        rVAD, Silero, C-a                 73.5     27.1
                                                                    tracting meaningful linguistic features for the diagnosis
        rVAD, Silero, GPVAD               83.6     73.5
        rVAD, Silero, Pyannote            86.3     82.4
                                                                    of cognitive impairment. Future research could focus
        rVAD, Silero, SSVAD               76.8     42.2             on refining the ensemble method by incorporating addi-
        rVAD, Silero, WebRTC              61.0     63.3             tional linguistic features into VAD systems and exploring
                                                                    their synergistic effects. Additionally, investigating the
                                                                    application of this approach to other languages and di-
5. Conclusions                                                      alects could expand its utility.

In this study, we explored and enhanced Voice Activity
Detection systems for the Italian language, a relatively Acknowledgements
under-explored area in speech processing. We exper- This study was funded by the European Union – NextGen-
imented with various systems and integrated systems erationEU programme through the Italian National Re-
covery and Resilience Plan – NRRP (Mission 4 – Educa-             using teacher-student training, IEEE/ACM Trans-
tion and research), as a part of the project ReMind: an           actions on Audio, Speech, and Language Processing
ecological, costeffective AI platform for early detection         29 (2021) 1542–1555.
of prodromal stages of cognitive impairment (PRIN 2022, [11] Y. R. Jo, Y. K. Moon, W. I. Cho, G. S. Jo, Self-attentive
2022YKJ8FP – CUP J53D23008380006).                                vad: Context-aware detection of voice from noise,
                                                                  in: ICASSP 2021-2021 IEEE International Confer-
                                                                  ence on Acoustics, Speech and Signal Processing
CRediT Author Statement                                           (ICASSP), IEEE, 2021, pp. 6808–6812.
                                                             [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
SZ: Investigation, Software, Formal analysis, Visualiza-
                                                                  L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-
tion, Writing - Original Draft. GG: Writing - Review &
                                                                  tention is all you need, Advances in neural infor-
Editing, Project administration, Funding acquisition. FT:
                                                                  mation processing systems 30 (2017).
Conceptualization, Methodology, Supervision, Writing -
                                                             [13] H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov,
Review & Editing.
                                                                  M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, M.-P.
                                                                  Gill, Pyannote. audio: neural building blocks for
References                                                        speaker diarization, in: ICASSP 2020-2020 IEEE
                                                                  International Conference on Acoustics, Speech and
  [1] S. Graf, T. Herbig, M. Buck, G. Schmidt, Features           Signal Processing (ICASSP), IEEE, 2020, pp. 7124–
      for voice activity detection: a comparative analysis,       7128.
      EURASIP Journal on Advances in Signal Processing [14] F. A. Leoni, F. Cutugno, R. Savy, V. Caniparoli,
      2015 (2015) 1–15.                                           L. D’Anna, E. Paone, R. Giordano, O. Manfrel-
  [2] T. Cho, D. H. Whalen, G. Docherty, Voice onset              lotti, M. Petrillo, A. De Rosa, Corpora e lessici
      time and beyond: Exploring laryngeal contrast in            dell’italiano parlato e scritto, 2007.
      19 languages, Journal of Phonetics 72 (2019) 52–65. [15] J. S. Garofolo, Timit acoustic phonetic continuous
  [3] G. Gagliardi, F. Tamburini, The automatic extrac-           speech corpus, Linguistic Data Consortium, 1993
      tion of linguistic biomarkers as a viable solution for      (1993).
      the early diagnosis of mental disorders, in: Pro- [16] N. Ryant, P. Singh, V. Krishnamohan, R. Varma,
      ceedings of the Thirteenth Language Resources               K. Church, C. Cieri, J. Du, S. Ganapathy, M. Liber-
      and Evaluation Conference, European Language                man, The third dihard diarization challenge, arXiv
      Resources Association, 2022, pp. 5234–5242.                 preprint arXiv:2012.01477 (2020).
  [4] M.-W. Mak, H.-B. Yu, A study of voice activity de- [17] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen,
      tection techniques for nist speaker recognition eval-       W. Lawrence, R. C. Moore, M. Plakal, M. Ritter, Au-
      uations, Computer Speech & Language 28 (2014)               dio set: An ontology and human-labeled dataset for
      295–313.                                                    audio events, in: Proceedings of the 2017 IEEE inter-
  [5] J. Sohn, N. S. Kim, W. Sung, A statistical model-           national conference on acoustics, speech and signal
      based voice activity detection, IEEE signal process-        processing (ICASSP), IEEE, 2017, pp. 776–780.
      ing letters 6 (1999) 1–3.
  [6] A. Sehgal, N. Kehtarnavaz, A convolutional neural
      network smartphone app for real-time voice activ-
      ity detection, IEEE access 6 (2018) 9017–9026.
  [7] L. Calzà, G. Gagliardi, R. R. Favretti, F. Tamburini,
      Linguistic features and automatic classifiers for
      identifying mild cognitive impairment and demen-
      tia, Computer Speech & Language 65 (2021) 101113.
  [8] Z.-H. Tan, N. Dehak, et al., rvad: An unsuper-
      vised segment-based robust voice activity detection
      method, Computer speech & language 59 (2020)
      1–21.
  [9] Silero Team, Silero vad: pre-trained enterprise-
      grade voice activity detector (vad), number de-
      tector and language classifier, https://github.com/
      snakers4/silero-vad, 2021.
[10] H. Dinkel, S. Wang, X. Xu, M. Wu, K. Yu, Voice ac-
      tivity detection in the wild: A data-driven approach