1. Introduction

Voice Activity Detection on Italian Language

Shibingfeng Zhang

Gloria Gagliardi

Fabio Tamburini

0 0 FICLIT, Alma Mater Studiorum - University of Bologna , via Zamboni, 32, Bologna , Italy

Voice Activity Detection (VAD) refers to the task of identifying human voice activity in noisy settings, playing a crucial role in fields like speech recognition and audio surveillance. However, most VAD research focuses on English, leaving other languages, such as Italian, under-explored. This study aims to evaluate and enhance VAD systems for Italian speech, with the goal of finding a solution for the speech segmentation component of the Digital Linguistic Biomarkers (DLBs) extraction pipeline for early mental disorder diagnosis. We experimented with various VAD systems and proposed an ensemble VAD system. Our ensemble system shows improvements in speech event detection. This advancement lays a robust foundation for more accurate early detection of mental health issues using DLBs in Italian.

eol>Voice Activity Detection Digital Linguistic Biomarkers Speech Processing Speech Segmentation

1. Introduction

step which consists of speech segmentation via VAD. The VAD system adopted by Gagliardi and Tamburini [3] is a Voice Activity Detection (VAD) refers to the task of iden- statistical VAD system named “SSVAD v1.0” [4], which tifying the presence of human voice activity in noisy will be presented and compared to other VAD systems in speech, classifying utterance segments as “speech” or Section 2. “non-speech”. Typically, it involves making binary deci- In this project, we focus on VAD for the Italian lansions on each frame of a noisy signal [1]. VAD has a wide guage, an area that remains largely unexplored, aiming range of applications, serving as a crucial component in to find a VAD system that performs better and is more various fields such as telecommunications, speech recog- reliable than the one adopted in the original pipeline. nition systems, and audio surveillance. Nevertheless, the The outcomes of this project will serve as a fundamengreat majority of current works focus on the application tal component in the pipeline for extracting DLBs and of VAD to English while there are many aspects that replacing the current VAD system. Moreover, our eforts can afect the performance of transferring a VAD system will provide a robust foundation for future work in this from one language to another, potentially leading to sub- domain, facilitating more accurate and early detection of optimal results. For instance, voice onset time may vary mental health issues using linguistic biomarkers. significantly between languages, afecting the system’s Our main contributions are as follows: ability to detect speech activity accurately [2]. Additionally, diferences in phonetic structures can further compli- • Testing and evaluating various VAD systems on cate the system’s efectiveness across languages. Given Italian speech. these factors, conducting research to evaluate various • Proposing an ensemble VAD system that achieves VAD systems on Italian speech would be highly valuable. superior results.

Digital Linguistic Biomarkers (DLBs) indicate linguistic features automatically extracted directly from pa- This paper is structured into five sections. Section 2 tients’ verbal productions that provide insights into their presents the data resources and VAD systems leveraged medical state [3]. Gagliardi and Tamburini [3] proposed in this work. Section 3 details the experiments and rethe first DLBs extraction pipeline for the early diagnosis sources for testing VAD systems. Section 4 presents and of mental disorders in Italian. The extraction of acoustic discusses the experimental results. Finally, Section 5 and rhythmic features relies heavily on the preprocessing draws conclusions.

2. Background This section outlines the background, state-of-the-art

developments, and architectures of VAD systems.

The majority of Voice Activity Detection (VAD) systems approach the task as a binary classification for each frame of a noisy audio signal, with or without overlaps between frames. Based on their architecture, these systems can generally be divided into two categories: statisti- VAD, where a teacher model is initially trained on a cal VAD systems and deep neural network (DNN) VAD source dataset with weak labels to handle vast and noisy systems. audio data. The trained teacher model then provides

Statistical VAD systems rely on probabilistic models frame-level guidance to a student model trained on and statistical signal processing techniques to distinguish various unlabeled target datasets. between speech and non-speech segments. Common Context-aware VAD [11] is a self-attentive VAD statistical methods include Gaussian Mixture Models system based on the Transformer architecture [12]. The (GMM), Hidden Markov Models (HMM), and Bayesian proposed self-attentive VAD model processes acoustic frameworks. For example, Sohn et al. [5] proposed a ro- features extracted from audio input, enhancing it with bust statistical VAD system that models the signal using a contextual information from surrounding frames. ifrst-order two-state HMM. In this system, the VAD score Pyannote [13] is a pre-trained open-source toolkit for of each frame is calculated based on the likelihood ratio audio processing that involves a VAD model. Similar to between the probability density functions conditioned on GPVAD and Silero, it is a DNN-based model with CNN two hypotheses: speech absent and speech present. Ad- and RNN components. ditionally, the state-transition probability is determined using the likelihood ratio from the previous frame, which helps in maintaining temporal coherence and improving 3. Experiments the accuracy of voice activity detection.

On the other hand, VAD systems based on DNNs lever- This section provides an overview of the experiments age the power of deep learning. These systems use neural we conducted, the evaluation metrics applied, and the network architectures, such as convolutional neural net- resources adopted for the experiments. works (CNNs), recurrent neural networks (RNNs), or more advanced structures with attention mechanism [6]. 3.1. Evaluation Dataset

Below, we present the list of the VAD systems we experimented with in this project, along with a brief description of each system:

In this work, the CLIPS dataset (Corpora e Lessici

dell’Italiano Parlato e Scritto, Italian for Corpora and Lexicons of Spoken and Written Italian)2 [14] is adopted to evaluate diferent VAD systems.

CLIPS comprises approximately 100 hours of speech data, equally distributed between male and female voices. It includes a diverse range of regional and situational speech samples to ensure a comprehensive representation of the Italian language across diferent contexts. The CLIPS dataset is organized into five subsets, with the “DIALOGICO” and “LETTO” subsets ofering complete temporal alignments between audio and textual transcription, totaling approximately 7.5 hours of test data. The “DIALOGICO” subset includes dialogues between two interlocutors, while the “LETTO” subset consists of recordings where words are read aloud from lists.

3.2. Experiment Settings & Evaluation

To thoroughly evaluate the performance of various VAD systems, we used two sets of metrics: segment-level metrics and event-level metrics. Segment-level metrics treat each 10ms segment of audio (a single frame) independently, calculating metrics such as F1 score, precision, recall, error rate, and accuracy. Event-level metrics, on the other hand, consider each speech segment as a unit. A prediction is deemed correct if its overlap with the ground truth exceeds 50%, and the same metrics are calculated accordingly.

SSVAD v1.0 (Baseline) [4] is a statistical VAD system designed to handle low signal-to-noise-ratio (SNR), impulsive noise, and cross talks in interview-style speech files. The system enhances speech segments as a pre-processing step to improve SNR, thereby facilitating subsequent speech/non-speech decisions. SSVAD v1.0 was previously integrated into the older version of the DLBs extraction pipeline [7] for speech segmentation and serves as the baseline for comparison with other systems in this study. rVAD [8] is an unsupervised model comprising two denoising steps followed by a final VAD stage. In the ifrst denoising step, high-energy noise segments are identified and nullified. The second step utilizes a speech enhancement method to further denoise the signal. Silero [9] is a pre-trained CNN systems with encoderdecoder architecture. Detailed information about this VAD system is limited, as it is closed source and undocumented.

WebRTC VAD is a system developed by Google for the WebRTC project1. Similar to the Silero VAD system, it is closed source and detailed information about its architecture are not publicly available.

GPVAD [10] is a 5-layer framework composed of CNN and RNN layers. The proposed model employs a data-driven teacher-student learning paradigm for

1https://webrtc.org/ 2http://www.clips.unina.it/it/

Experiments were conducted on CLIPS dataset using the VAD systems outlined in Section 2. To achieve optimal results, all systems were tested on their default frame size. Furthermore, we combined systems’ predictions through diferent ensemble methods to enhance performance further. More details on these ensemble methods are provided in Section 4.2.

4. Results This section presents and analyses the experimental results of diferent VAD systems. 4.1. Single Systems Evaluation

As can be seen, the majority of the tested systems outperformed the baseline system SSVAD used in the current DLB pipeline at the segment level. A notable pattern from the experiment results is that DNN-based systems, such as Silero, GPVAD, and Pyannote, tend to achieve better results compared to traditional statistical systems like rVAD and SSVAD. However, context-aware VAD is an exception, with an F1 score of 60.4, which is lower than the baseline SSVAD score of 62.2. As for event-level results, similar to the segment-level results, almost all systems outperformed the baseline. DNN-based systems tend to perform better, with Context-aware VAD being again an exception, as its F1 score is the lowest among all systems. The poor performance of Context-aware VAD could be attributed to the fact that, unlike GPVAD and Pyannote, it is trained only on the TIMIT [15] dataset with additional background noise. The TIMIT dataset is a relatively small English speech dataset, containing only 5 hours of audio, likely causing the system to overfit on this dataset. Another possible reason for this relatively poor performance could be that, while Pyannote and GPVAD are trained on multilingual datasets like DIHARD III [16] and Audioset [17], Context-aware VAD is trained solely on English speech. When tested on Italian speech, the system could sufer a domain shift, resulting in diminished performance.

To gain a better understanding of the diferences in system performance, a Kruskal-Wallis test was conducted. The results indicate that both the diferences between segment-level results and event-level results are significant. A Dunn’s test was then performed for post-hoc comparisons. The statistical analysis demonstrates that systems GPVAD, rVAD, Silero, and Pyannote exhibit similar performance at both the segment and event levels, while SSVAD, WebRTC, and Context-aware VAD show significantly lower performance at both levels.

After considering the performance at diferent levels, we tested all combination of three systems to form an ensemble prediction system to generate more accurate VAD results. The architectures of these ensemble systems and the corresponding experimental results are discussed in the following section.

4.2. Ensemble Systems Evaluation This section details the ensemble methods that combine

predictions of systems tested in Section 4.1. It subsequently presents the experimental results and analysis.

Of the systems presented in Section 2, Silero, Pyannote, GPVAD, and Context-aware VAD assign a score to each frame with a threshold used for making predictions. The other systems do not generate such scores, either due to diferences in their architecture or because they are closed-source. This score can be interpreted as the probability of the frame being speech or not. We attempted to ensemble system’s predictions using both the probability scores and their final predictions. The major challenge faced by these ensemble methods is that each system uses a diferent frame size, which complicates achieving alignment for the ensemble system.

We proposed and tested several ensemble strategies: • Probability Voting (PV): This method involves summing and averaging the probability scores from diferent predictions. • Probability Voting with Frame (PV_f): In this approach, each audio is first segmented into frames. For each frame, we identify all overlapping frames from all predictions, average their probability scores, and use this average as the probability score for the frame. The frame size of

PV_f is 200 ms. • Simple Voting with Frame(SV_f): Similar to

PV_f, this method segments audio into frames.

However, instead of averaging probability scores, it performs simple majority voting based on the predictions of overlapping frames. The frame size by a single system. Meanwhile, all other combinations of SV_f is 200 ms. yielded scores lower than the best performance of the • Probability Voting with Weight (PV_w): This individual systems.

method is akin to PV_f but with a twist: probabil- As shown in Table 3, the ensemble systems related to ity scores of overlapping frames from the three probability score did not achieve results that are promipredictions are weighted according to their over- nently better than single systems at the segment level lap percentage. These weighted scores are then either, with PV_s and PV_b systems of the combination summed to determine the probability score for Pyannote, GPVAD, Silero being only slightly higher by each frame. a small margin of 0.6 compared to Silero. However, at • Probability Voting with Sampling (PV_s): For the event level, several evident improvements can be a given audio, this method samples timestamps. observed in the performance of the ensemble systems. For each timestamp, it calculates the mean of the Probability-based ensemble systems combining Pyanprobability scores from the three systems, using note, GPVAD, Silero, except for PV_b and PV, outperthis mean as the probability score for the times- formed the simple systems at event level, with PV_f tamp. The sampling rate of PV_s is approximately achieving an F1 score of 85.9, which is 5.6 points higher 33.33 Hz, meaning that one point is sampled every than that of Pyannote. This result demonstrates that the 0.03 seconds. ensemble approach can lead to substantial performance • Probability Voting with Bézier curve mod- gains in detecting the temporal interval in which speech elling (PV_b): For each prediction from each takes place. It is worth noticing that the ensemble syssystem, a Bézier curve is generated using con- tem PV_b consistently shows great disparity between its trol points sampled from the prediction. This performance at segment level and event level across all approach aims to use a smooth curve to model combinations. Despite its good performance on segment the prediction and address the alignment issues level, PV_b achieves rather F1 score on event level, far caused by diferent frame sizes of the systems. lower than all other systems. The disparity of perforSimilar to PV_f, each audio segment is divided mance at diferent levels is likely to be caused by the into frames, and the probability score for each insuficient number of control points adopted for generframe is the average of the scores estimated by ating the Bézier curve. However, increasing the number the Bézier curves. The sampling rate of control of control points is infeasible due to the computational points that are used to generate Bézier curve in complexity of the curve, which is (2), with being PV_b is 5 Hz (0.2 seconds). the number of control points.

Given that the ensemble systems composed of GPVAD,

We experimented with all possible system combina- Silero, and Pyannote consistently outperformed other tions using the SV_f ensemble method, as well as all combinations across all ensemble methods, a Kruskalpossible combinations of Silero, Pyannote, GPVAD, and Wallis test, followed by Dunn’s post-hoc test, was conContext-aware VAD using other probability-based en- ducted to assess the diferences in performance between semble methods, as these are the only systems that gener- the ensemble methods and the individual systems of GPate probability scores. For all probability-based methods, VAD, Silero, Pyannote. At the segment level, the Kruskalthe “speech/non-speech” prediction for each frame is de- Wallis test indicates that the diferences are not signifitermined by applying a threshold of 0.5 to the probability cant. However, at the event level, the results reveal that score. PV_b’s performance is significantly lower compared to

Table 2 presents results of all possible combinations the other systems. to compose the ensemble system using SV_f method. In summary, given the performance of the systems, we Table 3 presents results of all possible combinations to plan to adopt PV_f as the speech segmentation compocompose the ensemble systems using probability score nent of the DLBs extraction pipeline, leveraging the comrelated methods. The evaluation results are derived using bined predictions of Pyannote, Silero, and GPVAD. While the methods presented in Section 3.2. PV_f shows slightly lower segment-level performance

As shown in Table 2, the ensemble created using the compared to the top-performing individual system, it SV_f method did not yield better results than the individ- enhances the accuracy in identifying speech intervals. ual systems at the segment level. The highest segment- This trade-of is justified by the substantial improvement level score of 91.5 was achieved by the combination of in speech event detection performance. GPVAD, Silero, and Pyannote, which is still 0.6 lower than the best performance of the Silero system alone. However, at the event level, the same combination achieved the highest score among all ensemble systems, with an F1 score of 84.0, which is higher than the best score achieved

5. Conclusions In this study, we explored and enhanced Voice Activity Detection systems for the Italian language, a relatively under-explored area in speech processing. We experimented with various systems and integrated systems Acknowledgements

This study was funded by the European Union – NextGenerationEU programme through the Italian National Recovery and Resilience Plan – NRRP (Mission 4 – Educa- using teacher-student training, IEEE/ACM Transtion and research), as a part of the project ReMind: an actions on Audio, Speech, and Language Processing ecological, costefective AI platform for early detection 29 (2021) 1542–1555. of prodromal stages of cognitive impairment (PRIN 2022, [11] Y. R. Jo, Y. K. Moon, W. I. Cho, G. S. Jo, Self-attentive 2022YKJ8FP – CUP J53D23008380006). vad: Context-aware detection of voice from noise, in: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing CRediT Author Statement (ICASSP), IEEE, 2021, pp. 6808–6812. [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, SZ: Investigation, Software, Formal analysis, Visualiza- L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attion, Writing - Original Draft. GG: Writing - Review & tention is all you need, Advances in neural inforEditing, Project administration, Funding acquisition. FT: mation processing systems 30 (2017). Conceptualization, Methodology, Supervision, Writing - [13] H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, Review & Editing. M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, M.-P. Gill, Pyannote. audio: neural building blocks for References speaker diarization, in: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and [1] S. Graf, T. Herbig, M. Buck, G. Schmidt, Features Signal Processing (ICASSP), IEEE, 2020, pp. 7124– for voice activity detection: a comparative analysis, 7128.

EURASIP Journal on Advances in Signal Processing [14] F. A. Leoni, F. Cutugno, R. Savy, V. Caniparoli, 2015 (2015) 1–15. L. D’Anna, E. Paone, R. Giordano, O. Manfrel[2] T. Cho, D. H. Whalen, G. Docherty, Voice onset lotti, M. Petrillo, A. De Rosa, Corpora e lessici time and beyond: Exploring laryngeal contrast in dell’italiano parlato e scritto, 2007. 19 languages, Journal of Phonetics 72 (2019) 52–65. [15] J. S. Garofolo, Timit acoustic phonetic continuous [3] G. Gagliardi, F. Tamburini, The automatic extrac- speech corpus, Linguistic Data Consortium, 1993 tion of linguistic biomarkers as a viable solution for (1993). the early diagnosis of mental disorders, in: Pro- [16] N. Ryant, P. Singh, V. Krishnamohan, R. Varma, ceedings of the Thirteenth Language Resources K. Church, C. Cieri, J. Du, S. Ganapathy, M. Liberand Evaluation Conference, European Language man, The third dihard diarization challenge, arXiv Resources Association, 2022, pp. 5234–5242. preprint arXiv:2012.01477 (2020). [4] M.-W. Mak, H.-B. Yu, A study of voice activity de- [17] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, tection techniques for nist speaker recognition eval- W. Lawrence, R. C. Moore, M. Plakal, M. Ritter, Auuations, Computer Speech & Language 28 (2014) dio set: An ontology and human-labeled dataset for 295–313. audio events, in: Proceedings of the 2017 IEEE inter[5] J. Sohn, N. S. Kim, W. Sung, A statistical model- national conference on acoustics, speech and signal based voice activity detection, IEEE signal process- processing (ICASSP), IEEE, 2017, pp. 776–780. ing letters 6 (1999) 1–3. [6] A. Sehgal, N. Kehtarnavaz, A convolutional neural network smartphone app for real-time voice activity detection, IEEE access 6 (2018) 9017–9026. [7] L. Calzà, G. Gagliardi, R. R. Favretti, F. Tamburini,

Linguistic features and automatic classifiers for identifying mild cognitive impairment and dementia, Computer Speech & Language 65 (2021) 101113. [8] Z.-H. Tan, N. Dehak, et al., rvad: An unsupervised segment-based robust voice activity detection method, Computer speech & language 59 (2020) 1–21. [9] Silero Team, Silero vad: pre-trained enterprisegrade voice activity detector (vad), number detector and language classifier, https://github.com/ snakers4/silero-vad, 2021. [10] H. Dinkel, S. Wang, X. Xu, M. Wu, K. Yu, Voice activity detection in the wild: A data-driven approach