<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Voice Activity Detection on Italian Language</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Shibingfeng</forename><surname>Zhang</surname></persName>
							<email>shibingfeng.zhang@unibo.it</email>
							<affiliation key="aff0">
								<orgName type="department">FICLIT</orgName>
								<orgName type="institution">Alma Mater Studiorum -University of Bologna</orgName>
								<address>
									<addrLine>via Zamboni, 32</addrLine>
									<settlement>Bologna</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Gloria</forename><surname>Gagliardi</surname></persName>
							<email>gloria.gagliardi@unibo.it</email>
							<affiliation key="aff0">
								<orgName type="department">FICLIT</orgName>
								<orgName type="institution">Alma Mater Studiorum -University of Bologna</orgName>
								<address>
									<addrLine>via Zamboni, 32</addrLine>
									<settlement>Bologna</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Fabio</forename><surname>Tamburini</surname></persName>
							<email>fabio.tamburini@unibo.it</email>
							<affiliation key="aff0">
								<orgName type="department">FICLIT</orgName>
								<orgName type="institution">Alma Mater Studiorum -University of Bologna</orgName>
								<address>
									<addrLine>via Zamboni, 32</addrLine>
									<settlement>Bologna</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="department">Tenth Italian Conference on Computational Linguistics</orgName>
								<address>
									<addrLine>Dec 04 -06</addrLine>
									<postCode>2024</postCode>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Voice Activity Detection on Italian Language</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">544355F285F241E81E0C6AD4A520C72F</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:33+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Voice Activity Detection</term>
					<term>Digital Linguistic Biomarkers</term>
					<term>Speech Processing</term>
					<term>Speech Segmentation</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Voice Activity Detection (VAD) refers to the task of identifying human voice activity in noisy settings, playing a crucial role in fields like speech recognition and audio surveillance. However, most VAD research focuses on English, leaving other languages, such as Italian, under-explored. This study aims to evaluate and enhance VAD systems for Italian speech, with the goal of finding a solution for the speech segmentation component of the Digital Linguistic Biomarkers (DLBs) extraction pipeline for early mental disorder diagnosis. We experimented with various VAD systems and proposed an ensemble VAD system. Our ensemble system shows improvements in speech event detection. This advancement lays a robust foundation for more accurate early detection of mental health issues using DLBs in Italian.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Voice Activity Detection (VAD) refers to the task of identifying the presence of human voice activity in noisy speech, classifying utterance segments as "speech" or "non-speech". Typically, it involves making binary decisions on each frame of a noisy signal <ref type="bibr" target="#b0">[1]</ref>. VAD has a wide range of applications, serving as a crucial component in various fields such as telecommunications, speech recognition systems, and audio surveillance. Nevertheless, the great majority of current works focus on the application of VAD to English while there are many aspects that can affect the performance of transferring a VAD system from one language to another, potentially leading to suboptimal results. For instance, voice onset time may vary significantly between languages, affecting the system's ability to detect speech activity accurately <ref type="bibr" target="#b1">[2]</ref>. Additionally, differences in phonetic structures can further complicate the system's effectiveness across languages. Given these factors, conducting research to evaluate various VAD systems on Italian speech would be highly valuable.</p><p>Digital Linguistic Biomarkers (DLBs) indicate linguistic features automatically extracted directly from patients' verbal productions that provide insights into their medical state <ref type="bibr" target="#b2">[3]</ref>. Gagliardi and Tamburini <ref type="bibr" target="#b2">[3]</ref> proposed the first DLBs extraction pipeline for the early diagnosis of mental disorders in Italian. The extraction of acoustic and rhythmic features relies heavily on the preprocessing step which consists of speech segmentation via VAD. The VAD system adopted by Gagliardi and Tamburini <ref type="bibr" target="#b2">[3]</ref> is a statistical VAD system named "SSVAD v1.0" <ref type="bibr" target="#b3">[4]</ref>, which will be presented and compared to other VAD systems in Section 2.</p><p>In this project, we focus on VAD for the Italian language, an area that remains largely unexplored, aiming to find a VAD system that performs better and is more reliable than the one adopted in the original pipeline. The outcomes of this project will serve as a fundamental component in the pipeline for extracting DLBs and replacing the current VAD system. Moreover, our efforts will provide a robust foundation for future work in this domain, facilitating more accurate and early detection of mental health issues using linguistic biomarkers.</p><p>Our main contributions are as follows:</p><p>• Testing and evaluating various VAD systems on Italian speech. • Proposing an ensemble VAD system that achieves superior results.</p><p>This paper is structured into five sections. Section 2 presents the data resources and VAD systems leveraged in this work. Section 3 details the experiments and resources for testing VAD systems. Section 4 presents and discusses the experimental results. Finally, Section 5 draws conclusions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Background</head><p>This section outlines the background, state-of-the-art developments, and architectures of VAD systems.</p><p>The majority of Voice Activity Detection (VAD) systems approach the task as a binary classification for each frame of a noisy audio signal, with or without overlaps between frames. Based on their architecture, these systems can generally be divided into two categories: statistical VAD systems and deep neural network (DNN) VAD systems.</p><p>Statistical VAD systems rely on probabilistic models and statistical signal processing techniques to distinguish between speech and non-speech segments. Common statistical methods include Gaussian Mixture Models (GMM), Hidden Markov Models (HMM), and Bayesian frameworks. For example, Sohn et al. <ref type="bibr" target="#b4">[5]</ref> proposed a robust statistical VAD system that models the signal using a first-order two-state HMM. In this system, the VAD score of each frame is calculated based on the likelihood ratio between the probability density functions conditioned on two hypotheses: speech absent and speech present. Additionally, the state-transition probability is determined using the likelihood ratio from the previous frame, which helps in maintaining temporal coherence and improving the accuracy of voice activity detection.</p><p>On the other hand, VAD systems based on DNNs leverage the power of deep learning. These systems use neural network architectures, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), or more advanced structures with attention mechanism <ref type="bibr" target="#b5">[6]</ref>.</p><p>Below, we present the list of the VAD systems we experimented with in this project, along with a brief description of each system: SSVAD v1.0 (Baseline) <ref type="bibr" target="#b3">[4]</ref> is a statistical VAD system designed to handle low signal-to-noise-ratio (SNR), impulsive noise, and cross talks in interview-style speech files. The system enhances speech segments as a pre-processing step to improve SNR, thereby facilitating subsequent speech/non-speech decisions. SSVAD v1.0 was previously integrated into the older version of the DLBs extraction pipeline <ref type="bibr" target="#b6">[7]</ref> for speech segmentation and serves as the baseline for comparison with other systems in this study. rVAD <ref type="bibr" target="#b7">[8]</ref> is an unsupervised model comprising two denoising steps followed by a final VAD stage. In the first denoising step, high-energy noise segments are identified and nullified. The second step utilizes a speech enhancement method to further denoise the signal. Silero <ref type="bibr" target="#b8">[9]</ref> is a pre-trained CNN systems with encoderdecoder architecture. Detailed information about this VAD system is limited, as it is closed source and undocumented. WebRTC VAD is a system developed by Google for the WebRTC project 1 . Similar to the Silero VAD system, it is closed source and detailed information about its architecture are not publicly available. GPVAD <ref type="bibr" target="#b9">[10]</ref> is a 5-layer framework composed of CNN and RNN layers. The proposed model employs a data-driven teacher-student learning paradigm for 1 https://webrtc.org/ VAD, where a teacher model is initially trained on a source dataset with weak labels to handle vast and noisy audio data. The trained teacher model then provides frame-level guidance to a student model trained on various unlabeled target datasets. Context-aware VAD <ref type="bibr" target="#b10">[11]</ref> is a self-attentive VAD system based on the Transformer architecture <ref type="bibr" target="#b11">[12]</ref>. The proposed self-attentive VAD model processes acoustic features extracted from audio input, enhancing it with contextual information from surrounding frames. Pyannote <ref type="bibr" target="#b12">[13]</ref> is a pre-trained open-source toolkit for audio processing that involves a VAD model. Similar to GPVAD and Silero, it is a DNN-based model with CNN and RNN components.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Experiments</head><p>This section provides an overview of the experiments we conducted, the evaluation metrics applied, and the resources adopted for the experiments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Evaluation Dataset</head><p>In this work, the CLIPS dataset (Corpora e Lessici dell'Italiano Parlato e Scritto, Italian for Corpora and Lexicons of Spoken and Written Italian) <ref type="foot" target="#foot_0">2</ref> [14] is adopted to evaluate different VAD systems.</p><p>CLIPS comprises approximately 100 hours of speech data, equally distributed between male and female voices. It includes a diverse range of regional and situational speech samples to ensure a comprehensive representation of the Italian language across different contexts. The CLIPS dataset is organized into five subsets, with the "DIALOGICO" and "LETTO" subsets offering complete temporal alignments between audio and textual transcription, totaling approximately 7.5 hours of test data. The "DIALOGICO" subset includes dialogues between two interlocutors, while the "LETTO" subset consists of recordings where words are read aloud from lists.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Experiment Settings &amp; Evaluation</head><p>To thoroughly evaluate the performance of various VAD systems, we used two sets of metrics: segment-level metrics and event-level metrics. Segment-level metrics treat each 10ms segment of audio (a single frame) independently, calculating metrics such as F1 score, precision, recall, error rate, and accuracy. Event-level metrics, on the other hand, consider each speech segment as a unit. A prediction is deemed correct if its overlap with the ground truth exceeds 50%, and the same metrics are calculated accordingly.</p><p>Experiments were conducted on CLIPS dataset using the VAD systems outlined in Section 2. To achieve optimal results, all systems were tested on their default frame size. Furthermore, we combined systems' predictions through different ensemble methods to enhance performance further. More details on these ensemble methods are provided in Section 4.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results</head><p>This section presents and analyses the experimental results of different VAD systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Single Systems Evaluation</head><p>Table <ref type="table">1</ref> shows the experimental results obtained from the systems described in Section 2. The evaluation results are derived using the methods presented in Section 3.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1</head><p>Results of VAD experiment on different systems. For segmentlevel results, each 10ms is considered one segment. For eventlevel results, a prediction is considered correct if its overlap with the ground truth exceeds 50%. The evaluation metric used is the F1 score. As can be seen, the majority of the tested systems outperformed the baseline system SSVAD used in the current DLB pipeline at the segment level. A notable pattern from the experiment results is that DNN-based systems, such as Silero, GPVAD, and Pyannote, tend to achieve better results compared to traditional statistical systems like rVAD and SSVAD. However, context-aware VAD is an exception, with an F1 score of 60.4, which is lower than the baseline SSVAD score of 62.2. As for event-level results, similar to the segment-level results, almost all systems outperformed the baseline. DNN-based systems tend to perform better, with Context-aware VAD being again an exception, as its F1 score is the lowest among all systems. The poor performance of Context-aware VAD could be attributed to the fact that, unlike GPVAD and Pyannote, it is trained only on the TIMIT <ref type="bibr" target="#b14">[15]</ref> dataset with additional background noise. The TIMIT dataset is a relatively small English speech dataset, containing only 5 hours of audio, likely causing the system to overfit on this dataset. Another possible reason for this relatively poor performance could be that, while Pyannote and GPVAD are trained on multilingual datasets like DI-HARD III <ref type="bibr" target="#b15">[16]</ref> and Audioset <ref type="bibr" target="#b16">[17]</ref>, Context-aware VAD is trained solely on English speech. When tested on Italian speech, the system could suffer a domain shift, resulting in diminished performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Method</head><p>To gain a better understanding of the differences in system performance, a Kruskal-Wallis test was conducted. The results indicate that both the differences between segment-level results and event-level results are significant. A Dunn's test was then performed for post-hoc comparisons. The statistical analysis demonstrates that systems GPVAD, rVAD, Silero, and Pyannote exhibit similar performance at both the segment and event levels, while SSVAD, WebRTC, and Context-aware VAD show significantly lower performance at both levels.</p><p>After considering the performance at different levels, we tested all combination of three systems to form an ensemble prediction system to generate more accurate VAD results. The architectures of these ensemble systems and the corresponding experimental results are discussed in the following section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Ensemble Systems Evaluation</head><p>This section details the ensemble methods that combine predictions of systems tested in Section 4.1. It subsequently presents the experimental results and analysis.</p><p>Of the systems presented in Section 2, Silero, Pyannote, GPVAD, and Context-aware VAD assign a score to each frame with a threshold used for making predictions. The other systems do not generate such scores, either due to differences in their architecture or because they are closed-source. This score can be interpreted as the probability of the frame being speech or not. We attempted to ensemble system's predictions using both the probability scores and their final predictions. The major challenge faced by these ensemble methods is that each system uses a different frame size, which complicates achieving alignment for the ensemble system.</p><p>We proposed and tested several ensemble strategies:</p><p>• Probability Voting (PV): This method involves summing and averaging the probability scores from different predictions. We experimented with all possible system combinations using the SV_f ensemble method, as well as all possible combinations of Silero, Pyannote, GPVAD, and Context-aware VAD using other probability-based ensemble methods, as these are the only systems that generate probability scores. For all probability-based methods, the "speech/non-speech" prediction for each frame is determined by applying a threshold of 0.5 to the probability score.</p><p>Table <ref type="table">2</ref> presents results of all possible combinations to compose the ensemble system using SV_f method. Table <ref type="table" target="#tab_3">3</ref> presents results of all possible combinations to compose the ensemble systems using probability score related methods. The evaluation results are derived using the methods presented in Section 3.2.</p><p>As shown in Table <ref type="table">2</ref>, the ensemble created using the SV_f method did not yield better results than the individual systems at the segment level. The highest segmentlevel score of 91.5 was achieved by the combination of GPVAD, Silero, and Pyannote, which is still 0.6 lower than the best performance of the Silero system alone. However, at the event level, the same combination achieved the highest score among all ensemble systems, with an F1 score of 84.0, which is higher than the best score achieved by a single system. Meanwhile, all other combinations yielded scores lower than the best performance of the individual systems.</p><p>As shown in Table <ref type="table" target="#tab_3">3</ref>, the ensemble systems related to probability score did not achieve results that are prominently better than single systems at the segment level either, with PV_s and PV_b systems of the combination Pyannote, GPVAD, Silero being only slightly higher by a small margin of 0.6 compared to Silero. However, at the event level, several evident improvements can be observed in the performance of the ensemble systems. Probability-based ensemble systems combining Pyannote, GPVAD, Silero, except for PV_b and PV, outperformed the simple systems at event level, with PV_f achieving an F1 score of 85.9, which is 5.6 points higher than that of Pyannote. This result demonstrates that the ensemble approach can lead to substantial performance gains in detecting the temporal interval in which speech takes place. It is worth noticing that the ensemble system PV_b consistently shows great disparity between its performance at segment level and event level across all combinations. Despite its good performance on segment level, PV_b achieves rather F1 score on event level, far lower than all other systems. The disparity of performance at different levels is likely to be caused by the insufficient number of control points adopted for generating the Bézier curve. However, increasing the number of control points is infeasible due to the computational complexity of the curve, which is 𝑂(𝑛 2 ), with 𝑛 being the number of control points.</p><p>Given that the ensemble systems composed of GPVAD, Silero, and Pyannote consistently outperformed other combinations across all ensemble methods, a Kruskal-Wallis test, followed by Dunn's post-hoc test, was conducted to assess the differences in performance between the ensemble methods and the individual systems of GP-VAD, Silero, Pyannote. At the segment level, the Kruskal-Wallis test indicates that the differences are not significant. However, at the event level, the results reveal that PV_b's performance is significantly lower compared to the other systems.</p><p>In summary, given the performance of the systems, we plan to adopt PV_f as the speech segmentation component of the DLBs extraction pipeline, leveraging the combined predictions of Pyannote, Silero, and GPVAD. While PV_f shows slightly lower segment-level performance compared to the top-performing individual system, it enhances the accuracy in identifying speech intervals. This trade-off is justified by the substantial improvement in speech event detection performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>Results of VAD experiments on using SV_f ensemble method. For comparison, results from individual systems that achieved the best performance, Silero and Pyannote, are also included. S stands for segment-level result. E stands for event-level result. C-a stands for Context-aware VAD system. For segment-level results, each 10ms is considered one segment. For event-level results, a prediction is considered correct if its overlap with the ground truth exceeds 50%. The evaluation metric used is the F1 score. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Involved</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions</head><p>In this study, we explored and enhanced Voice Activity Detection systems for the Italian language, a relatively under-explored area in speech processing. We experimented with various systems and integrated systems By enhancing the accuracy of speech segmentation, this method provides a more reliable foundation for extracting meaningful linguistic features for the diagnosis of cognitive impairment. Future research could focus on refining the ensemble method by incorporating additional linguistic features into VAD systems and exploring their synergistic effects. Additionally, investigating the application of this approach to other languages and dialects could expand its utility.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>• Probability Voting with Frame (PV_f): In</head><label></label><figDesc>For each prediction from each system, a Bézier curve is generated using control points sampled from the prediction. This approach aims to use a smooth curve to model the prediction and address the alignment issues caused by different frame sizes of the systems. Similar to PV_f, each audio segment is divided into frames, and the probability score for each frame is the average of the scores estimated by the Bézier curves. The sampling rate of control points that are used to generate Bézier curve in PV_b is 5 Hz (0.2 seconds).</figDesc><table><row><cell>predictions of overlapping frames. The frame size</cell></row><row><cell>of SV_f is 200 ms.</cell></row><row><cell>• Probability Voting with Weight (PV_w): This</cell></row><row><cell>method is akin to PV_f but with a twist: probabil-</cell></row><row><cell>ity scores of overlapping frames from the three</cell></row><row><cell>predictions are weighted according to their over-</cell></row><row><cell>lap percentage. These weighted scores are then</cell></row><row><cell>summed to determine the probability score for</cell></row><row><cell>each frame.</cell></row><row><cell>• Probability Voting with Sampling (PV_s): For</cell></row><row><cell>a given audio, this method samples timestamps.</cell></row><row><cell>For each timestamp, it calculates the mean of the</cell></row><row><cell>probability scores from the three systems, using</cell></row><row><cell>this mean as the probability score for the times-</cell></row><row><cell>tamp. The sampling rate of PV_s is approximately</cell></row><row><cell>33.33 Hz, meaning that one point is sampled every</cell></row><row><cell>0.03 seconds.</cell></row><row><cell>• Probability Voting with Bézier curve mod-</cell></row><row><cell>elling (PV_b):</cell></row><row><cell>this approach, each audio is first segmented into</cell></row><row><cell>frames. For each frame, we identify all overlap-</cell></row><row><cell>ping frames from all predictions, average their</cell></row><row><cell>probability scores, and use this average as the</cell></row><row><cell>probability score for the frame. The frame size of</cell></row><row><cell>PV_f is 200 ms.</cell></row><row><cell>• Simple Voting with Frame(SV_f): Similar to</cell></row><row><cell>PV_f, this method segments audio into frames.</cell></row><row><cell>However, instead of averaging probability scores,</cell></row><row><cell>it performs simple majority voting based on the</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3</head><label>3</label><figDesc>Results of VAD experiments on using probability score related ensemble methods. For comparison, results from individual systems that achieved the best performance, Silero and Pyannote, are also included. Method stands for ensemble method adopted. S stands for segment-level result. E stands for eventlevel result. C-a stands for Context-aware VAD system. For segment-level results, each 10ms is considered one segment. For event-level results, a prediction is considered correct if its overlap with the ground truth exceeds 50%. The evaluation metric used is the F1 score.into an ensemble to improve detection accuracy. Our findings indicate that combining predictions from multiple models can lead to better results in detecting speech temporal intervals. This effective ensemble method will be used as a component of a Digital Linguistic Biomarkers extraction pipeline.</figDesc><table><row><cell>Involved Systems</cell><cell>Method</cell><cell>S</cell><cell>E</cell></row><row><cell>Silero</cell><cell>-</cell><cell cols="2">92.5 80.1</cell></row><row><cell>Pyannote</cell><cell>-</cell><cell cols="2">92.3 80.3</cell></row><row><cell>Pyannote, GPVAD, Silero</cell><cell>PV</cell><cell cols="2">91.5 67.9</cell></row><row><cell>Pyannote, GPVAD,Silero</cell><cell>PV_f</cell><cell cols="2">91.9 85.9</cell></row><row><cell>Pyannote, GPVAD, Silero</cell><cell>PV_s</cell><cell cols="2">93.1 81.8</cell></row><row><cell>Pyannote, GPVAD, Silero</cell><cell>PV_w</cell><cell cols="2">91.8 85.6</cell></row><row><cell>Pyannote, GPVAD,Silero</cell><cell>PV_b</cell><cell>93.0</cell><cell>9.5</cell></row><row><cell>Pyannote, GPVAD, C-a</cell><cell>PV</cell><cell cols="2">87.2 60.4</cell></row><row><cell>Pyannote, GPVAD, C-a</cell><cell>PV_f</cell><cell cols="2">87.6 80.0</cell></row><row><cell>Pyannote, GPVAD, C-a</cell><cell>PV_s</cell><cell cols="2">89.3 79.4</cell></row><row><cell>Pyannote, GPVAD, C-a</cell><cell>PV_w</cell><cell cols="2">87.5 79.2</cell></row><row><cell>Pyannote, GPVAD, C-a</cell><cell>PV_b</cell><cell cols="2">89.2 10.5</cell></row><row><cell>Silero, GPVAD, C-a</cell><cell>PV</cell><cell cols="2">85.4 50.6</cell></row><row><cell>Silero, GPVAD, C-a</cell><cell>PV_f</cell><cell cols="2">85.7 72.7</cell></row><row><cell>Silero, GPVAD, C-a</cell><cell>PV_s</cell><cell cols="2">84.2 67.3</cell></row><row><cell>Silero, GPVAD, C-a</cell><cell>PV_w</cell><cell cols="2">85.6 71.6</cell></row><row><cell>Silero, GPVAD, C-a</cell><cell>PV_b</cell><cell cols="2">88.8 11.0</cell></row><row><cell>Silero, Pyannote, C-a</cell><cell>PV</cell><cell cols="2">89.4 70.4</cell></row><row><cell>Silero, Pyannote, C-a</cell><cell>PV_f</cell><cell cols="2">89.6 81.2</cell></row><row><cell>Silero, Pyannote, C-a</cell><cell>PV_s</cell><cell cols="2">89.5 77.7</cell></row><row><cell>Silero, Pyannote, C-a</cell><cell>PV_w</cell><cell cols="2">89.6 81.5</cell></row><row><cell>Silero, Pyannote, C-a</cell><cell>PV_b</cell><cell>89.6</cell><cell>9.3</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">http://www.clips.unina.it/it/</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>This study was funded by the European Union -NextGen-erationEU programme through the Italian National Re-covery and Resilience Plan -NRRP (Mission 4 -Education and research), as a part of the project ReMind: an ecological, costeffective AI platform for early detection of prodromal stages of cognitive impairment (PRIN 2022, 2022YKJ8FP -CUP J53D23008380006).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CRediT Author Statement</head><p>SZ: Investigation, Software, Formal analysis, Visualization, Writing -Original Draft. GG: Writing -Review &amp; Editing, Project administration, Funding acquisition. FT: Conceptualization, Methodology, Supervision, Writing -Review &amp; Editing.</p></div>
			</div>


			<div type="funding">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>(F. Tamburini) https://www.unibo.it/sitoweb/shibingfeng.zhang (S. Zhang); https://www.unibo.it/sitoweb/gloria.gagliardi (G. Gagliardi); https://www.unibo.it/sitoweb/fabio.tamburini (F. Tamburini) 0009-0005-7320-9088 (S. Zhang); 0000-0001-5257-1540 (G. Gagliardi); 0000-0001-7950-0347 (F. Tamburini)</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Features for voice activity detection: a comparative analysis</title>
		<author>
			<persName><forename type="first">S</forename><surname>Graf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Herbig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Buck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Schmidt</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">EURASIP Journal on Advances in Signal Processing</title>
		<imprint>
			<biblScope unit="volume">2015</biblScope>
			<biblScope unit="page" from="1" to="15" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Voice onset time and beyond: Exploring laryngeal contrast in 19 languages</title>
		<author>
			<persName><forename type="first">T</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">H</forename><surname>Whalen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Docherty</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Phonetics</title>
		<imprint>
			<biblScope unit="volume">72</biblScope>
			<biblScope unit="page" from="52" to="65" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">The automatic extraction of linguistic biomarkers as a viable solution for the early diagnosis of mental disorders</title>
		<author>
			<persName><forename type="first">G</forename><surname>Gagliardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Tamburini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Thirteenth Language Resources and Evaluation Conference</title>
				<meeting>the Thirteenth Language Resources and Evaluation Conference</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="5234" to="5242" />
		</imprint>
	</monogr>
	<note>European Language Resources Association</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">A study of voice activity detection techniques for nist speaker recognition evaluations</title>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Mak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H.-B</forename><surname>Yu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer Speech &amp; Language</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="page" from="295" to="313" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A statistical modelbased voice activity detection</title>
		<author>
			<persName><forename type="first">J</forename><surname>Sohn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">S</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Sung</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE signal processing letters</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="1" to="3" />
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">A convolutional neural network smartphone app for real-time voice activity detection</title>
		<author>
			<persName><forename type="first">A</forename><surname>Sehgal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Kehtarnavaz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE access</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="9017" to="9026" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Linguistic features and automatic classifiers for identifying mild cognitive impairment and dementia</title>
		<author>
			<persName><forename type="first">L</forename><surname>Calzà</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Gagliardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">R</forename><surname>Favretti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Tamburini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer Speech &amp; Language</title>
		<imprint>
			<biblScope unit="volume">65</biblScope>
			<biblScope unit="page">101113</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">rvad: An unsupervised segment-based robust voice activity detection method</title>
		<author>
			<persName><forename type="first">Z.-H</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Dehak</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer speech &amp; language</title>
		<imprint>
			<biblScope unit="volume">59</biblScope>
			<biblScope unit="page" from="1" to="21" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">Silero</forename><surname>Team</surname></persName>
		</author>
		<ptr target="https://github.com/snakers4/silero-vad" />
		<title level="m">Silero vad: pre-trained enterprisegrade voice activity detector (vad), number detector and language classifier</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Voice activity detection in the wild: A data-driven approach using teacher-student training</title>
		<author>
			<persName><forename type="first">H</forename><surname>Dinkel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Yu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE/ACM Transactions on Audio, Speech, and Language Processing</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="page" from="1542" to="1555" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Self-attentive vad: Context-aware detection of voice from noise</title>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">R</forename><surname>Jo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">K</forename><surname>Moon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">I</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">S</forename><surname>Jo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="6808" to="6812" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Attention is all you need</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ł</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Polosukhin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">audio: neural building blocks for speaker diarization</title>
		<author>
			<persName><forename type="first">H</forename><surname>Bredin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Yin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Coria</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Gelly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Korshunov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lavechin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Fustes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Titeux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Bouaziz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-P</forename><surname>Gill</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Pyannote</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="7124" to="7128" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Corpora e lessici dell&apos;italiano parlato e scritto</title>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">A</forename><surname>Leoni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Cutugno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Savy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Caniparoli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>D'anna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Paone</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Giordano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Manfrellotti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Petrillo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">De</forename><surname>Rosa</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Timit acoustic phonetic continuous speech corpus</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">S</forename><surname>Garofolo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Linguistic Data Consortium</title>
				<imprint>
			<date type="published" when="1993">1993. 1993</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Ryant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Krishnamohan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Varma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Church</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Cieri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Du</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ganapathy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Liberman</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2012.01477</idno>
		<title level="m">The third dihard diarization challenge</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Audio set: An ontology and human-labeled dataset for audio events</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">F</forename><surname>Gemmeke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Ellis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Freedman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jansen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Lawrence</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">C</forename><surname>Moore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Plakal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ritter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP)</title>
				<meeting>the 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP)</meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="776" to="780" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
