Voice Activity Detection on Italian Language

Voice Activity Detection on Italian Language ShibingfengZhang shibingfeng.zhang@unibo.it FICLIT Alma Mater Studiorum -University of Bologna

via Zamboni, 32 Bologna Italy

GloriaGagliardi gloria.gagliardi@unibo.it FICLIT Alma Mater Studiorum -University of Bologna

via Zamboni, 32 Bologna Italy

FabioTamburini fabio.tamburini@unibo.it FICLIT Alma Mater Studiorum -University of Bologna

via Zamboni, 32 Bologna Italy

Tenth Italian Conference on Computational Linguistics

Dec 04 -06 2024 Pisa Italy

Voice Activity Detection on Italian Language 1613-0073 544355F285F241E81E0C6AD4A520C72F GROBID - A machine learning software for extracting information from scholarly documents Voice Activity Detection Digital Linguistic Biomarkers Speech Processing Speech Segmentation

Voice Activity Detection (VAD) refers to the task of identifying human voice activity in noisy settings, playing a crucial role in fields like speech recognition and audio surveillance. However, most VAD research focuses on English, leaving other languages, such as Italian, under-explored. This study aims to evaluate and enhance VAD systems for Italian speech, with the goal of finding a solution for the speech segmentation component of the Digital Linguistic Biomarkers (DLBs) extraction pipeline for early mental disorder diagnosis. We experimented with various VAD systems and proposed an ensemble VAD system. Our ensemble system shows improvements in speech event detection. This advancement lays a robust foundation for more accurate early detection of mental health issues using DLBs in Italian.

Introduction

Voice Activity Detection (VAD) refers to the task of identifying the presence of human voice activity in noisy speech, classifying utterance segments as "speech" or "non-speech". Typically, it involves making binary decisions on each frame of a noisy signal [1]. VAD has a wide range of applications, serving as a crucial component in various fields such as telecommunications, speech recognition systems, and audio surveillance. Nevertheless, the great majority of current works focus on the application of VAD to English while there are many aspects that can affect the performance of transferring a VAD system from one language to another, potentially leading to suboptimal results. For instance, voice onset time may vary significantly between languages, affecting the system's ability to detect speech activity accurately [2]. Additionally, differences in phonetic structures can further complicate the system's effectiveness across languages. Given these factors, conducting research to evaluate various VAD systems on Italian speech would be highly valuable.

Digital Linguistic Biomarkers (DLBs) indicate linguistic features automatically extracted directly from patients' verbal productions that provide insights into their medical state [3]. Gagliardi and Tamburini [3] proposed the first DLBs extraction pipeline for the early diagnosis of mental disorders in Italian. The extraction of acoustic and rhythmic features relies heavily on the preprocessing step which consists of speech segmentation via VAD. The VAD system adopted by Gagliardi and Tamburini [3] is a statistical VAD system named "SSVAD v1.0" [4], which will be presented and compared to other VAD systems in Section 2.

In this project, we focus on VAD for the Italian language, an area that remains largely unexplored, aiming to find a VAD system that performs better and is more reliable than the one adopted in the original pipeline. The outcomes of this project will serve as a fundamental component in the pipeline for extracting DLBs and replacing the current VAD system. Moreover, our efforts will provide a robust foundation for future work in this domain, facilitating more accurate and early detection of mental health issues using linguistic biomarkers.

Our main contributions are as follows:

• Testing and evaluating various VAD systems on Italian speech. • Proposing an ensemble VAD system that achieves superior results.

This paper is structured into five sections. Section 2 presents the data resources and VAD systems leveraged in this work. Section 3 details the experiments and resources for testing VAD systems. Section 4 presents and discusses the experimental results. Finally, Section 5 draws conclusions.

Background

This section outlines the background, state-of-the-art developments, and architectures of VAD systems.

The majority of Voice Activity Detection (VAD) systems approach the task as a binary classification for each frame of a noisy audio signal, with or without overlaps between frames. Based on their architecture, these systems can generally be divided into two categories: statistical VAD systems and deep neural network (DNN) VAD systems.

Statistical VAD systems rely on probabilistic models and statistical signal processing techniques to distinguish between speech and non-speech segments. Common statistical methods include Gaussian Mixture Models (GMM), Hidden Markov Models (HMM), and Bayesian frameworks. For example, Sohn et al. [5] proposed a robust statistical VAD system that models the signal using a first-order two-state HMM. In this system, the VAD score of each frame is calculated based on the likelihood ratio between the probability density functions conditioned on two hypotheses: speech absent and speech present. Additionally, the state-transition probability is determined using the likelihood ratio from the previous frame, which helps in maintaining temporal coherence and improving the accuracy of voice activity detection.

On the other hand, VAD systems based on DNNs leverage the power of deep learning. These systems use neural network architectures, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), or more advanced structures with attention mechanism [6].

Below, we present the list of the VAD systems we experimented with in this project, along with a brief description of each system: SSVAD v1.0 (Baseline) [4] is a statistical VAD system designed to handle low signal-to-noise-ratio (SNR), impulsive noise, and cross talks in interview-style speech files. The system enhances speech segments as a pre-processing step to improve SNR, thereby facilitating subsequent speech/non-speech decisions. SSVAD v1.0 was previously integrated into the older version of the DLBs extraction pipeline [7] for speech segmentation and serves as the baseline for comparison with other systems in this study. rVAD [8] is an unsupervised model comprising two denoising steps followed by a final VAD stage. In the first denoising step, high-energy noise segments are identified and nullified. The second step utilizes a speech enhancement method to further denoise the signal. Silero [9] is a pre-trained CNN systems with encoderdecoder architecture. Detailed information about this VAD system is limited, as it is closed source and undocumented. WebRTC VAD is a system developed by Google for the WebRTC project 1 . Similar to the Silero VAD system, it is closed source and detailed information about its architecture are not publicly available. GPVAD [10] is a 5-layer framework composed of CNN and RNN layers. The proposed model employs a data-driven teacher-student learning paradigm for 1 https://webrtc.org/ VAD, where a teacher model is initially trained on a source dataset with weak labels to handle vast and noisy audio data. The trained teacher model then provides frame-level guidance to a student model trained on various unlabeled target datasets. Context-aware VAD [11] is a self-attentive VAD system based on the Transformer architecture [12]. The proposed self-attentive VAD model processes acoustic features extracted from audio input, enhancing it with contextual information from surrounding frames. Pyannote [13] is a pre-trained open-source toolkit for audio processing that involves a VAD model. Similar to GPVAD and Silero, it is a DNN-based model with CNN and RNN components.

Experiments

This section provides an overview of the experiments we conducted, the evaluation metrics applied, and the resources adopted for the experiments.

Evaluation Dataset

In this work, the CLIPS dataset (Corpora e Lessici dell'Italiano Parlato e Scritto, Italian for Corpora and Lexicons of Spoken and Written Italian) 2 [14] is adopted to evaluate different VAD systems.

CLIPS comprises approximately 100 hours of speech data, equally distributed between male and female voices. It includes a diverse range of regional and situational speech samples to ensure a comprehensive representation of the Italian language across different contexts. The CLIPS dataset is organized into five subsets, with the "DIALOGICO" and "LETTO" subsets offering complete temporal alignments between audio and textual transcription, totaling approximately 7.5 hours of test data. The "DIALOGICO" subset includes dialogues between two interlocutors, while the "LETTO" subset consists of recordings where words are read aloud from lists.

Experiment Settings & Evaluation

To thoroughly evaluate the performance of various VAD systems, we used two sets of metrics: segment-level metrics and event-level metrics. Segment-level metrics treat each 10ms segment of audio (a single frame) independently, calculating metrics such as F1 score, precision, recall, error rate, and accuracy. Event-level metrics, on the other hand, consider each speech segment as a unit. A prediction is deemed correct if its overlap with the ground truth exceeds 50%, and the same metrics are calculated accordingly.

Experiments were conducted on CLIPS dataset using the VAD systems outlined in Section 2. To achieve optimal results, all systems were tested on their default frame size. Furthermore, we combined systems' predictions through different ensemble methods to enhance performance further. More details on these ensemble methods are provided in Section 4.2.

Results

This section presents and analyses the experimental results of different VAD systems.

Single Systems Evaluation

Table 1 shows the experimental results obtained from the systems described in Section 2. The evaluation results are derived using the methods presented in Section 3.2.

Table 1

Results of VAD experiment on different systems. For segmentlevel results, each 10ms is considered one segment. For eventlevel results, a prediction is considered correct if its overlap with the ground truth exceeds 50%. The evaluation metric used is the F1 score. As can be seen, the majority of the tested systems outperformed the baseline system SSVAD used in the current DLB pipeline at the segment level. A notable pattern from the experiment results is that DNN-based systems, such as Silero, GPVAD, and Pyannote, tend to achieve better results compared to traditional statistical systems like rVAD and SSVAD. However, context-aware VAD is an exception, with an F1 score of 60.4, which is lower than the baseline SSVAD score of 62.2. As for event-level results, similar to the segment-level results, almost all systems outperformed the baseline. DNN-based systems tend to perform better, with Context-aware VAD being again an exception, as its F1 score is the lowest among all systems. The poor performance of Context-aware VAD could be attributed to the fact that, unlike GPVAD and Pyannote, it is trained only on the TIMIT [15] dataset with additional background noise. The TIMIT dataset is a relatively small English speech dataset, containing only 5 hours of audio, likely causing the system to overfit on this dataset. Another possible reason for this relatively poor performance could be that, while Pyannote and GPVAD are trained on multilingual datasets like DI-HARD III [16] and Audioset [17], Context-aware VAD is trained solely on English speech. When tested on Italian speech, the system could suffer a domain shift, resulting in diminished performance.

Method

To gain a better understanding of the differences in system performance, a Kruskal-Wallis test was conducted. The results indicate that both the differences between segment-level results and event-level results are significant. A Dunn's test was then performed for post-hoc comparisons. The statistical analysis demonstrates that systems GPVAD, rVAD, Silero, and Pyannote exhibit similar performance at both the segment and event levels, while SSVAD, WebRTC, and Context-aware VAD show significantly lower performance at both levels.

After considering the performance at different levels, we tested all combination of three systems to form an ensemble prediction system to generate more accurate VAD results. The architectures of these ensemble systems and the corresponding experimental results are discussed in the following section.

Ensemble Systems Evaluation

This section details the ensemble methods that combine predictions of systems tested in Section 4.1. It subsequently presents the experimental results and analysis.

Of the systems presented in Section 2, Silero, Pyannote, GPVAD, and Context-aware VAD assign a score to each frame with a threshold used for making predictions. The other systems do not generate such scores, either due to differences in their architecture or because they are closed-source. This score can be interpreted as the probability of the frame being speech or not. We attempted to ensemble system's predictions using both the probability scores and their final predictions. The major challenge faced by these ensemble methods is that each system uses a different frame size, which complicates achieving alignment for the ensemble system.

We proposed and tested several ensemble strategies:

• Probability Voting (PV): This method involves summing and averaging the probability scores from different predictions. We experimented with all possible system combinations using the SV_f ensemble method, as well as all possible combinations of Silero, Pyannote, GPVAD, and Context-aware VAD using other probability-based ensemble methods, as these are the only systems that generate probability scores. For all probability-based methods, the "speech/non-speech" prediction for each frame is determined by applying a threshold of 0.5 to the probability score.

Table 2 presents results of all possible combinations to compose the ensemble system using SV_f method. Table 3 presents results of all possible combinations to compose the ensemble systems using probability score related methods. The evaluation results are derived using the methods presented in Section 3.2.

As shown in Table 2, the ensemble created using the SV_f method did not yield better results than the individual systems at the segment level. The highest segmentlevel score of 91.5 was achieved by the combination of GPVAD, Silero, and Pyannote, which is still 0.6 lower than the best performance of the Silero system alone. However, at the event level, the same combination achieved the highest score among all ensemble systems, with an F1 score of 84.0, which is higher than the best score achieved by a single system. Meanwhile, all other combinations yielded scores lower than the best performance of the individual systems.

As shown in Table 3, the ensemble systems related to probability score did not achieve results that are prominently better than single systems at the segment level either, with PV_s and PV_b systems of the combination Pyannote, GPVAD, Silero being only slightly higher by a small margin of 0.6 compared to Silero. However, at the event level, several evident improvements can be observed in the performance of the ensemble systems. Probability-based ensemble systems combining Pyannote, GPVAD, Silero, except for PV_b and PV, outperformed the simple systems at event level, with PV_f achieving an F1 score of 85.9, which is 5.6 points higher than that of Pyannote. This result demonstrates that the ensemble approach can lead to substantial performance gains in detecting the temporal interval in which speech takes place. It is worth noticing that the ensemble system PV_b consistently shows great disparity between its performance at segment level and event level across all combinations. Despite its good performance on segment level, PV_b achieves rather F1 score on event level, far lower than all other systems. The disparity of performance at different levels is likely to be caused by the insufficient number of control points adopted for generating the Bézier curve. However, increasing the number of control points is infeasible due to the computational complexity of the curve, which is 𝑂(𝑛 2 ), with 𝑛 being the number of control points.

Given that the ensemble systems composed of GPVAD, Silero, and Pyannote consistently outperformed other combinations across all ensemble methods, a Kruskal-Wallis test, followed by Dunn's post-hoc test, was conducted to assess the differences in performance between the ensemble methods and the individual systems of GP-VAD, Silero, Pyannote. At the segment level, the Kruskal-Wallis test indicates that the differences are not significant. However, at the event level, the results reveal that PV_b's performance is significantly lower compared to the other systems.

In summary, given the performance of the systems, we plan to adopt PV_f as the speech segmentation component of the DLBs extraction pipeline, leveraging the combined predictions of Pyannote, Silero, and GPVAD. While PV_f shows slightly lower segment-level performance compared to the top-performing individual system, it enhances the accuracy in identifying speech intervals. This trade-off is justified by the substantial improvement in speech event detection performance.

Table 2

Results of VAD experiments on using SV_f ensemble method. For comparison, results from individual systems that achieved the best performance, Silero and Pyannote, are also included. S stands for segment-level result. E stands for event-level result. C-a stands for Context-aware VAD system. For segment-level results, each 10ms is considered one segment. For event-level results, a prediction is considered correct if its overlap with the ground truth exceeds 50%. The evaluation metric used is the F1 score.

Involved

Conclusions

In this study, we explored and enhanced Voice Activity Detection systems for the Italian language, a relatively under-explored area in speech processing. We experimented with various systems and integrated systems By enhancing the accuracy of speech segmentation, this method provides a more reliable foundation for extracting meaningful linguistic features for the diagnosis of cognitive impairment. Future research could focus on refining the ensemble method by incorporating additional linguistic features into VAD systems and exploring their synergistic effects. Additionally, investigating the application of this approach to other languages and dialects could expand its utility.

• Probability Voting with Frame (PV_f): InFor each prediction from each system, a Bézier curve is generated using control points sampled from the prediction. This approach aims to use a smooth curve to model the prediction and address the alignment issues caused by different frame sizes of the systems. Similar to PV_f, each audio segment is divided into frames, and the probability score for each frame is the average of the scores estimated by the Bézier curves. The sampling rate of control points that are used to generate Bézier curve in PV_b is 5 Hz (0.2 seconds).predictions of overlapping frames. The frame sizeof SV_f is 200 ms.• Probability Voting with Weight (PV_w): Thismethod is akin to PV_f but with a twist: probabil-ity scores of overlapping frames from the threepredictions are weighted according to their over-lap percentage. These weighted scores are thensummed to determine the probability score foreach frame.• Probability Voting with Sampling (PV_s): Fora given audio, this method samples timestamps.For each timestamp, it calculates the mean of theprobability scores from the three systems, usingthis mean as the probability score for the times-tamp. The sampling rate of PV_s is approximately33.33 Hz, meaning that one point is sampled every0.03 seconds.• Probability Voting with Bézier curve mod-elling (PV_b):this approach, each audio is first segmented intoframes. For each frame, we identify all overlap-ping frames from all predictions, average theirprobability scores, and use this average as theprobability score for the frame. The frame size ofPV_f is 200 ms.• Simple Voting with Frame(SV_f): Similar toPV_f, this method segments audio into frames.However, instead of averaging probability scores,it performs simple majority voting based on the

Table 33Results of VAD experiments on using probability score related ensemble methods. For comparison, results from individual systems that achieved the best performance, Silero and Pyannote, are also included. Method stands for ensemble method adopted. S stands for segment-level result. E stands for eventlevel result. C-a stands for Context-aware VAD system. For segment-level results, each 10ms is considered one segment. For event-level results, a prediction is considered correct if its overlap with the ground truth exceeds 50%. The evaluation metric used is the F1 score.into an ensemble to improve detection accuracy. Our findings indicate that combining predictions from multiple models can lead to better results in detecting speech temporal intervals. This effective ensemble method will be used as a component of a Digital Linguistic Biomarkers extraction pipeline.Involved SystemsMethodSESilero-92.5 80.1Pyannote-92.3 80.3Pyannote, GPVAD, SileroPV91.5 67.9Pyannote, GPVAD,SileroPV_f91.9 85.9Pyannote, GPVAD, SileroPV_s93.1 81.8Pyannote, GPVAD, SileroPV_w91.8 85.6Pyannote, GPVAD,SileroPV_b93.09.5Pyannote, GPVAD, C-aPV87.2 60.4Pyannote, GPVAD, C-aPV_f87.6 80.0Pyannote, GPVAD, C-aPV_s89.3 79.4Pyannote, GPVAD, C-aPV_w87.5 79.2Pyannote, GPVAD, C-aPV_b89.2 10.5Silero, GPVAD, C-aPV85.4 50.6Silero, GPVAD, C-aPV_f85.7 72.7Silero, GPVAD, C-aPV_s84.2 67.3Silero, GPVAD, C-aPV_w85.6 71.6Silero, GPVAD, C-aPV_b88.8 11.0Silero, Pyannote, C-aPV89.4 70.4Silero, Pyannote, C-aPV_f89.6 81.2Silero, Pyannote, C-aPV_s89.5 77.7Silero, Pyannote, C-aPV_w89.6 81.5Silero, Pyannote, C-aPV_b89.69.3

http://www.clips.unina.it/it/

Acknowledgements

This study was funded by the European Union -NextGen-erationEU programme through the Italian National Re-covery and Resilience Plan -NRRP (Mission 4 -Education and research), as a part of the project ReMind: an ecological, costeffective AI platform for early detection of prodromal stages of cognitive impairment (PRIN 2022, 2022YKJ8FP -CUP J53D23008380006).

CRediT Author Statement

SZ: Investigation, Software, Formal analysis, Visualization, Writing -Original Draft. GG: Writing -Review & Editing, Project administration, Funding acquisition. FT: Conceptualization, Methodology, Supervision, Writing -Review & Editing.

(F. Tamburini) https://www.unibo.it/sitoweb/shibingfeng.zhang (S. Zhang); https://www.unibo.it/sitoweb/gloria.gagliardi (G. Gagliardi); https://www.unibo.it/sitoweb/fabio.tamburini (F. Tamburini) 0009-0005-7320-9088 (S. Zhang); 0000-0001-5257-1540 (G. Gagliardi); 0000-0001-7950-0347 (F. Tamburini)

Features for voice activity detection: a comparative analysis SGraf THerbig MBuck GSchmidt EURASIP Journal on Advances in Signal Processing 2015 2015 Voice onset time and beyond: Exploring laryngeal contrast in 19 languages TCho DHWhalen GDocherty Journal of Phonetics 72 2019 The automatic extraction of linguistic biomarkers as a viable solution for the early diagnosis of mental disorders GGagliardi FTamburini Proceedings of the Thirteenth Language Resources and Evaluation Conference the Thirteenth Language Resources and Evaluation Conference 2022 European Language Resources Association A study of voice activity detection techniques for nist speaker recognition evaluations M.-WMak H.-BYu Computer Speech & Language 28 2014 A statistical modelbased voice activity detection JSohn NSKim WSung IEEE signal processing letters 6 1999 A convolutional neural network smartphone app for real-time voice activity detection ASehgal NKehtarnavaz IEEE access 6 2018 Linguistic features and automatic classifiers for identifying mild cognitive impairment and dementia LCalzà GGagliardi RRFavretti FTamburini Computer Speech & Language 65 101113 2021 rvad: An unsupervised segment-based robust voice activity detection method Z.-HTan NDehak Computer speech & language 59 2020 SileroTeam Silero vad: pre-trained enterprisegrade voice activity detector (vad), number detector and language classifier 2021 Voice activity detection in the wild: A data-driven approach using teacher-student training HDinkel SWang XXu MWu KYu IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 2021 Self-attentive vad: Context-aware detection of voice from noise YRJo YKMoon WICho GSJo ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE 2021 Attention is all you need AVaswani NShazeer NParmar JUszkoreit LJones ANGomez ŁKaiser IPolosukhin Advances in neural information processing systems 30 2017 audio: neural building blocks for speaker diarization HBredin RYin JMCoria GGelly PKorshunov MLavechin DFustes HTiteux WBouaziz M.-PGill Pyannote ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE 2020 Corpora e lessici dell'italiano parlato e scritto FALeoni FCutugno RSavy VCaniparoli LD'anna EPaone RGiordano OManfrellotti MPetrillo ADeRosa 2007 Timit acoustic phonetic continuous speech corpus JSGarofolo Linguistic Data Consortium 1993. 1993 NRyant PSingh VKrishnamohan RVarma KChurch CCieri JDu SGanapathy MLiberman arXiv:2012.01477 The third dihard diarization challenge 2020 arXiv preprint Audio set: An ontology and human-labeled dataset for audio events JFGemmeke DPEllis DFreedman AJansen WLawrence RCMoore MPlakal MRitter Proceedings of the 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) the 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) IEEE 2017