Voice Activity Detection on Italian Language Shibingfeng Zhang1 , Gloria Gagliardi1 and Fabio Tamburini1 1 FICLIT, Alma Mater Studiorum - University of Bologna, via Zamboni, 32, Bologna, Italy Abstract Voice Activity Detection (VAD) refers to the task of identifying human voice activity in noisy settings, playing a crucial role in fields like speech recognition and audio surveillance. However, most VAD research focuses on English, leaving other languages, such as Italian, under-explored. This study aims to evaluate and enhance VAD systems for Italian speech, with the goal of finding a solution for the speech segmentation component of the Digital Linguistic Biomarkers (DLBs) extraction pipeline for early mental disorder diagnosis. We experimented with various VAD systems and proposed an ensemble VAD system. Our ensemble system shows improvements in speech event detection. This advancement lays a robust foundation for more accurate early detection of mental health issues using DLBs in Italian. Keywords Voice Activity Detection, Digital Linguistic Biomarkers, Speech Processing, Speech Segmentation 1. Introduction step which consists of speech segmentation via VAD. The VAD system adopted by Gagliardi and Tamburini [3] is a Voice Activity Detection (VAD) refers to the task of iden- statistical VAD system named “SSVAD v1.0” [4], which tifying the presence of human voice activity in noisy will be presented and compared to other VAD systems in speech, classifying utterance segments as “speech” or Section 2. “non-speech”. Typically, it involves making binary deci- In this project, we focus on VAD for the Italian lan- sions on each frame of a noisy signal [1]. VAD has a wide guage, an area that remains largely unexplored, aiming range of applications, serving as a crucial component in to find a VAD system that performs better and is more various fields such as telecommunications, speech recog- reliable than the one adopted in the original pipeline. nition systems, and audio surveillance. Nevertheless, the The outcomes of this project will serve as a fundamen- great majority of current works focus on the application tal component in the pipeline for extracting DLBs and of VAD to English while there are many aspects that replacing the current VAD system. Moreover, our efforts can affect the performance of transferring a VAD system will provide a robust foundation for future work in this from one language to another, potentially leading to sub- domain, facilitating more accurate and early detection of optimal results. For instance, voice onset time may vary mental health issues using linguistic biomarkers. significantly between languages, affecting the system’s Our main contributions are as follows: ability to detect speech activity accurately [2]. Addition- ally, differences in phonetic structures can further compli- • Testing and evaluating various VAD systems on cate the system’s effectiveness across languages. Given Italian speech. these factors, conducting research to evaluate various • Proposing an ensemble VAD system that achieves VAD systems on Italian speech would be highly valuable. superior results. Digital Linguistic Biomarkers (DLBs) indicate linguis- tic features automatically extracted directly from pa- This paper is structured into five sections. Section 2 tients’ verbal productions that provide insights into their presents the data resources and VAD systems leveraged medical state [3]. Gagliardi and Tamburini [3] proposed in this work. Section 3 details the experiments and re- the first DLBs extraction pipeline for the early diagnosis sources for testing VAD systems. Section 4 presents and of mental disorders in Italian. The extraction of acoustic discusses the experimental results. Finally, Section 5 and rhythmic features relies heavily on the preprocessing draws conclusions. CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Dec 04 — 06, 2024, Pisa, Italy 2. Background $ shibingfeng.zhang@unibo.it (S. Zhang); gloria.gagliardi@unibo.it (G. Gagliardi); fabio.tamburini@unibo.it This section outlines the background, state-of-the-art (F. Tamburini) developments, and architectures of VAD systems. € https://www.unibo.it/sitoweb/shibingfeng.zhang (S. Zhang); https://www.unibo.it/sitoweb/gloria.gagliardi (G. Gagliardi); The majority of Voice Activity Detection (VAD) sys- https://www.unibo.it/sitoweb/fabio.tamburini (F. Tamburini) tems approach the task as a binary classification for each  0009-0005-7320-9088 (S. Zhang); 0000-0001-5257-1540 frame of a noisy audio signal, with or without overlaps be- (G. Gagliardi); 0000-0001-7950-0347 (F. Tamburini) tween frames. Based on their architecture, these systems © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings can generally be divided into two categories: statisti- VAD, where a teacher model is initially trained on a cal VAD systems and deep neural network (DNN) VAD source dataset with weak labels to handle vast and noisy systems. audio data. The trained teacher model then provides Statistical VAD systems rely on probabilistic models frame-level guidance to a student model trained on and statistical signal processing techniques to distinguish various unlabeled target datasets. between speech and non-speech segments. Common Context-aware VAD [11] is a self-attentive VAD statistical methods include Gaussian Mixture Models system based on the Transformer architecture [12]. The (GMM), Hidden Markov Models (HMM), and Bayesian proposed self-attentive VAD model processes acoustic frameworks. For example, Sohn et al. [5] proposed a ro- features extracted from audio input, enhancing it with bust statistical VAD system that models the signal using a contextual information from surrounding frames. first-order two-state HMM. In this system, the VAD score Pyannote [13] is a pre-trained open-source toolkit for of each frame is calculated based on the likelihood ratio audio processing that involves a VAD model. Similar to between the probability density functions conditioned on GPVAD and Silero, it is a DNN-based model with CNN two hypotheses: speech absent and speech present. Ad- and RNN components. ditionally, the state-transition probability is determined using the likelihood ratio from the previous frame, which helps in maintaining temporal coherence and improving 3. Experiments the accuracy of voice activity detection. This section provides an overview of the experiments On the other hand, VAD systems based on DNNs lever- we conducted, the evaluation metrics applied, and the age the power of deep learning. These systems use neural resources adopted for the experiments. network architectures, such as convolutional neural net- works (CNNs), recurrent neural networks (RNNs), or more advanced structures with attention mechanism [6]. 3.1. Evaluation Dataset Below, we present the list of the VAD systems we In this work, the CLIPS dataset (Corpora e Lessici experimented with in this project, along with a brief dell’Italiano Parlato e Scritto, Italian for Corpora and Lex- description of each system: icons of Spoken and Written Italian)2 [14] is adopted to evaluate different VAD systems. SSVAD v1.0 (Baseline) [4] is a statistical VAD CLIPS comprises approximately 100 hours of speech system designed to handle low signal-to-noise-ratio data, equally distributed between male and female voices. (SNR), impulsive noise, and cross talks in interview-style It includes a diverse range of regional and situational speech files. The system enhances speech segments as a speech samples to ensure a comprehensive representa- pre-processing step to improve SNR, thereby facilitating tion of the Italian language across different contexts. The subsequent speech/non-speech decisions. SSVAD v1.0 CLIPS dataset is organized into five subsets, with the was previously integrated into the older version of the “DIALOGICO” and “LETTO” subsets offering complete DLBs extraction pipeline [7] for speech segmentation temporal alignments between audio and textual tran- and serves as the baseline for comparison with other scription, totaling approximately 7.5 hours of test data. systems in this study. The “DIALOGICO” subset includes dialogues between rVAD [8] is an unsupervised model comprising two two interlocutors, while the “LETTO” subset consists of denoising steps followed by a final VAD stage. In the recordings where words are read aloud from lists. first denoising step, high-energy noise segments are identified and nullified. The second step utilizes a speech enhancement method to further denoise the signal. 3.2. Experiment Settings & Evaluation Silero [9] is a pre-trained CNN systems with encoder- To thoroughly evaluate the performance of various VAD decoder architecture. Detailed information about systems, we used two sets of metrics: segment-level met- this VAD system is limited, as it is closed source and rics and event-level metrics. Segment-level metrics treat undocumented. each 10ms segment of audio (a single frame) indepen- WebRTC VAD is a system developed by Google for the dently, calculating metrics such as F1 score, precision, WebRTC project1 . Similar to the Silero VAD system, recall, error rate, and accuracy. Event-level metrics, on it is closed source and detailed information about its the other hand, consider each speech segment as a unit. architecture are not publicly available. A prediction is deemed correct if its overlap with the GPVAD [10] is a 5-layer framework composed of ground truth exceeds 50%, and the same metrics are cal- CNN and RNN layers. The proposed model employs culated accordingly. a data-driven teacher-student learning paradigm for 1 2 https://webrtc.org/ http://www.clips.unina.it/it/ Experiments were conducted on CLIPS dataset using and GPVAD are trained on multilingual datasets like DI- the VAD systems outlined in Section 2. To achieve op- HARD III [16] and Audioset [17], Context-aware VAD is timal results, all systems were tested on their default trained solely on English speech. When tested on Italian frame size. Furthermore, we combined systems’ predic- speech, the system could suffer a domain shift, resulting tions through different ensemble methods to enhance in diminished performance. performance further. More details on these ensemble To gain a better understanding of the differences in sys- methods are provided in Section 4.2. tem performance, a Kruskal-Wallis test was conducted. The results indicate that both the differences between segment-level results and event-level results are signif- 4. Results icant. A Dunn’s test was then performed for post-hoc comparisons. The statistical analysis demonstrates that This section presents and analyses the experimental re- systems GPVAD, rVAD, Silero, and Pyannote exhibit sim- sults of different VAD systems. ilar performance at both the segment and event levels, while SSVAD, WebRTC, and Context-aware VAD show 4.1. Single Systems Evaluation significantly lower performance at both levels. Table 1 shows the experimental results obtained from the After considering the performance at different levels, systems described in Section 2. The evaluation results we tested all combination of three systems to form an are derived using the methods presented in Section 3.2. ensemble prediction system to generate more accurate VAD results. The architectures of these ensemble systems and the corresponding experimental results are discussed Table 1 in the following section. Results of VAD experiment on different systems. For segment- level results, each 10ms is considered one segment. For event- level results, a prediction is considered correct if its overlap 4.2. Ensemble Systems Evaluation with the ground truth exceeds 50%. The evaluation metric used is the F1 score. This section details the ensemble methods that combine Method Segment-level Event-level predictions of systems tested in Section 4.1. It subse- Context-aware VAD 60.4 12.1 quently presents the experimental results and analysis. SSVAD (Baseline) 62.2 23.1 Of the systems presented in Section 2, Silero, Pyannote, WebRTC 64.6 27.0 GPVAD, and Context-aware VAD assign a score to each rVAD 69.5 72.2 frame with a threshold used for making predictions. The GPVAD 89.5 72.3 other systems do not generate such scores, either due Pyannote 92.3 80.3 to differences in their architecture or because they are Silero 92.5 80.1 closed-source. This score can be interpreted as the proba- bility of the frame being speech or not. We attempted to As can be seen, the majority of the tested systems out- ensemble system’s predictions using both the probability performed the baseline system SSVAD used in the cur- scores and their final predictions. The major challenge rent DLB pipeline at the segment level. A notable pattern faced by these ensemble methods is that each system from the experiment results is that DNN-based systems, uses a different frame size, which complicates achieving such as Silero, GPVAD, and Pyannote, tend to achieve alignment for the ensemble system. better results compared to traditional statistical systems We proposed and tested several ensemble strategies: like rVAD and SSVAD. However, context-aware VAD is an exception, with an F1 score of 60.4, which is lower • Probability Voting (PV): This method involves than the baseline SSVAD score of 62.2. As for event-level summing and averaging the probability scores results, similar to the segment-level results, almost all from different predictions. systems outperformed the baseline. DNN-based systems • Probability Voting with Frame (PV_f): In tend to perform better, with Context-aware VAD being this approach, each audio is first segmented into again an exception, as its F1 score is the lowest among all frames. For each frame, we identify all overlap- systems. The poor performance of Context-aware VAD ping frames from all predictions, average their could be attributed to the fact that, unlike GPVAD and probability scores, and use this average as the Pyannote, it is trained only on the TIMIT [15] dataset probability score for the frame. The frame size of with additional background noise. The TIMIT dataset PV_f is 200 ms. is a relatively small English speech dataset, containing • Simple Voting with Frame(SV_f): Similar to only 5 hours of audio, likely causing the system to overfit PV_f, this method segments audio into frames. on this dataset. Another possible reason for this rela- However, instead of averaging probability scores, tively poor performance could be that, while Pyannote it performs simple majority voting based on the predictions of overlapping frames. The frame size by a single system. Meanwhile, all other combinations of SV_f is 200 ms. yielded scores lower than the best performance of the • Probability Voting with Weight (PV_w): This individual systems. method is akin to PV_f but with a twist: probabil- As shown in Table 3, the ensemble systems related to ity scores of overlapping frames from the three probability score did not achieve results that are promi- predictions are weighted according to their over- nently better than single systems at the segment level lap percentage. These weighted scores are then either, with PV_s and PV_b systems of the combination summed to determine the probability score for Pyannote, GPVAD, Silero being only slightly higher by each frame. a small margin of 0.6 compared to Silero. However, at • Probability Voting with Sampling (PV_s): For the event level, several evident improvements can be a given audio, this method samples timestamps. observed in the performance of the ensemble systems. For each timestamp, it calculates the mean of the Probability-based ensemble systems combining Pyan- probability scores from the three systems, using note, GPVAD, Silero, except for PV_b and PV, outper- this mean as the probability score for the times- formed the simple systems at event level, with PV_f tamp. The sampling rate of PV_s is approximately achieving an F1 score of 85.9, which is 5.6 points higher 33.33 Hz, meaning that one point is sampled every than that of Pyannote. This result demonstrates that the 0.03 seconds. ensemble approach can lead to substantial performance • Probability Voting with Bézier curve mod- gains in detecting the temporal interval in which speech elling (PV_b): For each prediction from each takes place. It is worth noticing that the ensemble sys- system, a Bézier curve is generated using con- tem PV_b consistently shows great disparity between its trol points sampled from the prediction. This performance at segment level and event level across all approach aims to use a smooth curve to model combinations. Despite its good performance on segment the prediction and address the alignment issues level, PV_b achieves rather F1 score on event level, far caused by different frame sizes of the systems. lower than all other systems. The disparity of perfor- Similar to PV_f, each audio segment is divided mance at different levels is likely to be caused by the into frames, and the probability score for each insufficient number of control points adopted for gener- frame is the average of the scores estimated by ating the Bézier curve. However, increasing the number the Bézier curves. The sampling rate of control of control points is infeasible due to the computational points that are used to generate Bézier curve in complexity of the curve, which is 𝑂(𝑛2 ), with 𝑛 being PV_b is 5 Hz (0.2 seconds). the number of control points. Given that the ensemble systems composed of GPVAD, We experimented with all possible system combina- Silero, and Pyannote consistently outperformed other tions using the SV_f ensemble method, as well as all combinations across all ensemble methods, a Kruskal- possible combinations of Silero, Pyannote, GPVAD, and Wallis test, followed by Dunn’s post-hoc test, was con- Context-aware VAD using other probability-based en- ducted to assess the differences in performance between semble methods, as these are the only systems that gener- the ensemble methods and the individual systems of GP- ate probability scores. For all probability-based methods, VAD, Silero, Pyannote. At the segment level, the Kruskal- the “speech/non-speech” prediction for each frame is de- Wallis test indicates that the differences are not signifi- termined by applying a threshold of 0.5 to the probability cant. However, at the event level, the results reveal that score. PV_b’s performance is significantly lower compared to Table 2 presents results of all possible combinations the other systems. to compose the ensemble system using SV_f method. In summary, given the performance of the systems, we Table 3 presents results of all possible combinations to plan to adopt PV_f as the speech segmentation compo- compose the ensemble systems using probability score nent of the DLBs extraction pipeline, leveraging the com- related methods. The evaluation results are derived using bined predictions of Pyannote, Silero, and GPVAD. While the methods presented in Section 3.2. PV_f shows slightly lower segment-level performance As shown in Table 2, the ensemble created using the compared to the top-performing individual system, it SV_f method did not yield better results than the individ- enhances the accuracy in identifying speech intervals. ual systems at the segment level. The highest segment- This trade-off is justified by the substantial improvement level score of 91.5 was achieved by the combination of in speech event detection performance. GPVAD, Silero, and Pyannote, which is still 0.6 lower than the best performance of the Silero system alone. How- ever, at the event level, the same combination achieved the highest score among all ensemble systems, with an F1 score of 84.0, which is higher than the best score achieved Table 2 Table 3 Results of VAD experiments on using SV_f ensemble method. Results of VAD experiments on using probability score related For comparison, results from individual systems that achieved ensemble methods. For comparison, results from individual the best performance, Silero and Pyannote, are also included. S systems that achieved the best performance, Silero and Pyan- stands for segment-level result. E stands for event-level result. note, are also included. Method stands for ensemble method C-a stands for Context-aware VAD system. For segment-level adopted. S stands for segment-level result. E stands for event- results, each 10ms is considered one segment. For event-level level result. C-a stands for Context-aware VAD system. For results, a prediction is considered correct if its overlap with segment-level results, each 10ms is considered one segment. the ground truth exceeds 50%. The evaluation metric used is For event-level results, a prediction is considered correct if its the F1 score. overlap with the ground truth exceeds 50%. The evaluation Involved Systems S E metric used is the F1 score. Silero 92.5 80.1 Involved Systems Method S E Pyannote 92.3 80.3 Silero - 92.5 80.1 GPVAD, Silero, Pyannote 91.5 84.0 Pyannote - 92.3 80.3 GPVAD, C-a, WebRTC 58.4 62.0 Pyannote, GPVAD, Silero PV 91.5 67.9 GPVAD, SSVAD, C-a 66.0 17.6 Pyannote, GPVAD,Silero PV_f 91.9 85.9 GPVAD, SSVAD, WebRTC 58.9 76.6 Pyannote, GPVAD, Silero PV_s 93.1 81.8 Pyannote, C-a, WebRTC 60.6 70.1 Pyannote, GPVAD, Silero PV_w 91.8 85.6 Pyannote, GPVAD, C-a 81.5 42.1 Pyannote, GPVAD,Silero PV_b 93.0 9.5 Pyannote, GPVAD, SSVAD 83.3 58.1 Pyannote, GPVAD, C-a PV 87.2 60.4 Pyannote, GPVAD, WebRTC 61.3 55.3 Pyannote, GPVAD, C-a PV_f 87.6 80.0 Pyannote, SSVAD, C-a 68.6 17.7 Pyannote, GPVAD, C-a PV_s 89.3 79.4 Pyannote, SSVAD, WebRTC 60.9 72.6 Pyannote, GPVAD, C-a PV_w 87.5 79.2 SSVAD, C-a, WebRTC 47.0 29.8 Pyannote, GPVAD, C-a PV_b 89.2 10.5 Silero, C-a, WebRTC 60.7 70.0 Silero, GPVAD, C-a PV 85.4 50.6 Silero, GPVAD, C-a 81.8 43.1 Silero, GPVAD, C-a PV_f 85.7 72.7 Silero, GPVAD, SSVAD 83.6 57.7 Silero, GPVAD, C-a PV_s 84.2 67.3 Silero, GPVAD, WebRTC 61.4 59.9 Silero, GPVAD, C-a PV_w 85.6 71.6 Silero, Pyannote, C-a 84.4 52.5 Silero, GPVAD, C-a PV_b 88.8 11.0 Silero, Pyannote, SSVAD 85.9 68.7 Silero, Pyannote, C-a PV 89.4 70.4 Silero, Pyannote, WebRTC 62.0 47.9 Silero, Pyannote, C-a PV_f 89.6 81.2 Silero, SSVAD, C-a 68.8 17.5 Silero, Pyannote, C-a PV_s 89.5 77.7 Silero, SSVAD, WebRTC 60.8 73.0 Silero, Pyannote, C-a PV_w 89.6 81.5 rVAD, C-a, WebRTC 52.2 41.4 Silero, Pyannote, C-a PV_b 89.6 9.3 rVAD, C-a, WebRTC 52.2 41.4 rVAD, GPVAD, C-a 71.1 29.0 rVAD, GPVAD, SSVAD 74.3 42.5 into an ensemble to improve detection accuracy. Our rVAD, GPVAD, WebRTC 58.4 79.3 findings indicate that combining predictions from multi- rVAD, Pyannote, C-a 73.4 27.5 ple models can lead to better results in detecting speech rVAD, Pyannote, GPVAD 83.5 75.1 temporal intervals. This effective ensemble method will rVAD, Pyannote, SSVAD 76.7 43.2 be used as a component of a Digital Linguistic Biomarkers rVAD, Pyannote, WebRTC 60.8 58.7 extraction pipeline. rVAD, SSVAD, C-a 56.8 18.1 By enhancing the accuracy of speech segmentation, rVAD, SSVAD, WebRTC 54.0 63.0 this method provides a more reliable foundation for ex- rVAD, Silero, C-a 73.5 27.1 tracting meaningful linguistic features for the diagnosis rVAD, Silero, GPVAD 83.6 73.5 rVAD, Silero, Pyannote 86.3 82.4 of cognitive impairment. Future research could focus rVAD, Silero, SSVAD 76.8 42.2 on refining the ensemble method by incorporating addi- rVAD, Silero, WebRTC 61.0 63.3 tional linguistic features into VAD systems and exploring their synergistic effects. Additionally, investigating the application of this approach to other languages and di- 5. Conclusions alects could expand its utility. In this study, we explored and enhanced Voice Activity Detection systems for the Italian language, a relatively Acknowledgements under-explored area in speech processing. We exper- This study was funded by the European Union – NextGen- imented with various systems and integrated systems erationEU programme through the Italian National Re- covery and Resilience Plan – NRRP (Mission 4 – Educa- using teacher-student training, IEEE/ACM Trans- tion and research), as a part of the project ReMind: an actions on Audio, Speech, and Language Processing ecological, costeffective AI platform for early detection 29 (2021) 1542–1555. of prodromal stages of cognitive impairment (PRIN 2022, [11] Y. R. Jo, Y. K. Moon, W. I. Cho, G. S. Jo, Self-attentive 2022YKJ8FP – CUP J53D23008380006). vad: Context-aware detection of voice from noise, in: ICASSP 2021-2021 IEEE International Confer- ence on Acoustics, Speech and Signal Processing CRediT Author Statement (ICASSP), IEEE, 2021, pp. 6808–6812. [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, SZ: Investigation, Software, Formal analysis, Visualiza- L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- tion, Writing - Original Draft. GG: Writing - Review & tention is all you need, Advances in neural infor- Editing, Project administration, Funding acquisition. FT: mation processing systems 30 (2017). Conceptualization, Methodology, Supervision, Writing - [13] H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, Review & Editing. M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, M.-P. Gill, Pyannote. audio: neural building blocks for References speaker diarization, in: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and [1] S. Graf, T. Herbig, M. Buck, G. Schmidt, Features Signal Processing (ICASSP), IEEE, 2020, pp. 7124– for voice activity detection: a comparative analysis, 7128. EURASIP Journal on Advances in Signal Processing [14] F. A. Leoni, F. Cutugno, R. Savy, V. Caniparoli, 2015 (2015) 1–15. L. D’Anna, E. Paone, R. Giordano, O. Manfrel- [2] T. Cho, D. H. Whalen, G. Docherty, Voice onset lotti, M. Petrillo, A. De Rosa, Corpora e lessici time and beyond: Exploring laryngeal contrast in dell’italiano parlato e scritto, 2007. 19 languages, Journal of Phonetics 72 (2019) 52–65. [15] J. S. Garofolo, Timit acoustic phonetic continuous [3] G. Gagliardi, F. Tamburini, The automatic extrac- speech corpus, Linguistic Data Consortium, 1993 tion of linguistic biomarkers as a viable solution for (1993). the early diagnosis of mental disorders, in: Pro- [16] N. Ryant, P. Singh, V. Krishnamohan, R. Varma, ceedings of the Thirteenth Language Resources K. Church, C. Cieri, J. Du, S. Ganapathy, M. Liber- and Evaluation Conference, European Language man, The third dihard diarization challenge, arXiv Resources Association, 2022, pp. 5234–5242. preprint arXiv:2012.01477 (2020). [4] M.-W. Mak, H.-B. Yu, A study of voice activity de- [17] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, tection techniques for nist speaker recognition eval- W. Lawrence, R. C. Moore, M. Plakal, M. Ritter, Au- uations, Computer Speech & Language 28 (2014) dio set: An ontology and human-labeled dataset for 295–313. audio events, in: Proceedings of the 2017 IEEE inter- [5] J. Sohn, N. S. Kim, W. Sung, A statistical model- national conference on acoustics, speech and signal based voice activity detection, IEEE signal process- processing (ICASSP), IEEE, 2017, pp. 776–780. ing letters 6 (1999) 1–3. [6] A. Sehgal, N. Kehtarnavaz, A convolutional neural network smartphone app for real-time voice activ- ity detection, IEEE access 6 (2018) 9017–9026. [7] L. Calzà, G. Gagliardi, R. R. Favretti, F. Tamburini, Linguistic features and automatic classifiers for identifying mild cognitive impairment and demen- tia, Computer Speech & Language 65 (2021) 101113. [8] Z.-H. Tan, N. Dehak, et al., rvad: An unsuper- vised segment-based robust voice activity detection method, Computer speech & language 59 (2020) 1–21. [9] Silero Team, Silero vad: pre-trained enterprise- grade voice activity detector (vad), number de- tector and language classifier, https://github.com/ snakers4/silero-vad, 2021. [10] H. Dinkel, S. Wang, X. Xu, M. Wu, K. Yu, Voice ac- tivity detection in the wild: A data-driven approach