<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">An investigation on voice mimicry attacks to a speaker recognition system</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Donato</forename><surname>Impedovo</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Moro -Department of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo</orgName>
								<address>
									<addrLine>Via E. Orabona n.4</addrLine>
									<postCode>70125</postCode>
									<settlement>Bari</settlement>
									<country>IT</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Digital Innovation srl</orgName>
								<address>
									<addrLine>Via E. Orabona n</addrLine>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Annalisa</forename><surname>Longo</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Moro -Department of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo</orgName>
								<address>
									<addrLine>Via E. Orabona n.4</addrLine>
									<postCode>70125</postCode>
									<settlement>Bari</settlement>
									<country>IT</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Tonino</forename><surname>Palmisano</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Moro -Department of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo</orgName>
								<address>
									<addrLine>Via E. Orabona n.4</addrLine>
									<postCode>70125</postCode>
									<settlement>Bari</settlement>
									<country>IT</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Lucia</forename><surname>Sarcinella</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Moro -Department of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo</orgName>
								<address>
									<addrLine>Via E. Orabona n.4</addrLine>
									<postCode>70125</postCode>
									<settlement>Bari</settlement>
									<country>IT</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Digital Innovation srl</orgName>
								<address>
									<addrLine>Via E. Orabona n</addrLine>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Davide</forename><surname>Veneto</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Digital Innovation srl</orgName>
								<address>
									<addrLine>Via E. Orabona n</addrLine>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="department">Department of Computer Science)</orgName>
								<address>
									<postCode>70125</postCode>
									<settlement>Bari</settlement>
									<country>IT</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">An investigation on voice mimicry attacks to a speaker recognition system</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">C2A38203F94CE8D6A8FDC82B1E5E3AFF</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-19T15:50+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Mimicry Attacks, Voice Recognition, Speaker Recognition, Voice XXXX-XXXX-XXXX-XXXX (A. 1)</term>
					<term>XXXX-XXXX-XXXX-XXXX (A. 2)</term>
					<term>XXXX-XXXX-XXXX-XXXX (A. 3)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Voice mimicry is the act in which imitators reproduce the vocal characteristics of another person. It can be considered to be an attack to a speaker recognition system. This work evaluates a speaker identification system under mimicry attacks: the goal is to point out how the accuracy of the system changes depending on the various real scenarios could occur. For this purpose, a GMM-UBM model and an I-Vector have been implemented and tested over dataset of Italian language imitations. Tests have been performed different audio lengths and different use cases. Use cases also take into consideration some possible countermeasures.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Voice has always been one of the most widely used biometrics as a distinctive and measurable feature for the recognition of users in terms of biometric security <ref type="bibr" target="#b0">[1]</ref>. The study of speaker recognition focuses more on who is speaking than on what is said, and it can be categorized into two macrocategories, speaker identification and speaker verification. In the first case, the goal is to establish the identity of the speaker; in this case, the system performs a 1:N match between the sample under analysis and all the known models (i.e. users) and then determines which of these is the most similar through the issuance of a score . In the second case, the task is focused on verifying, precisely, if the speaker is whom he/she claims to be, so the system performs a 1:1 match between the sample and the declared model and, depending on whether the score exceeds a certain threshold, the system will issue a boolean value <ref type="bibr" target="#b1">[2]</ref>. A further distinction is between text-dependent and text-independent systems. The first case adopts the same text/sentence during testing and training <ref type="bibr" target="#b2">[3]</ref>, the second refers to a process in which there is no constraint on the text to be pronounced <ref type="bibr" target="#b3">[4]</ref>.</p><p>An important topic that is crucial in these years is about the security of the biometric systems, indeed these systems can be prone to various attacks <ref type="bibr" target="#b4">[5]</ref>. In the case of speaker verification/identification systems, replay attacks, speech synthesis, voice conversion and mimicry can be considered <ref type="bibr" target="#b5">[6]</ref>. Mimicry is probably the simplest and most common approaches that consists in imitating the voice of another person to attack the system. The attacker tries to imitate the timbre and prosody of the voice without the use of special technologies <ref type="bibr" target="#b6">[7]</ref>. This problem has several implications and can occur in many different situations. In fact, it is also connected to the phenomenon of scam and cyberbullying. In most cases a malicious/bully can imitate the victim's voice with the aim to obtain information from unsuspecting people or to mock the imitated person by means of vocal recordings/calls. A speaker recognition system could potentially contribute to identify these actions in multiple scenarios. This work is focused on the task of speaker identification with a text-independent approach. More specifically, this study focuses on voice mimicry attacks to analyze the vulnerability of speaker identification systems, depending on the various scenarios that may arise, whether it is under attack or not.</p><p>The main contributions of this study are:</p><p>• Identify and test a set or reals attack scenarios and possible countermeasures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>•</head><p>Compare two state-of-the-art speaker identification systems: a Gaussian Mixture Model -Universal Background Model (GMM-UBM) recognizer and an I-vector recognizer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>•</head><p>The creation of an Italian language Imitation dataset for the aim of the study.</p><p>The paper is divided into the following sections. The section 2 describes the previous state-of-theart research in the field of speaker identification and mimicry attacks. The section 3 describes the methods and the approach used in this work. The section 4 explains the dataset and the experiments that have been performed and finally in the section 5 and 6 will be discussed the result obtained during the testing phase and the conclusion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>The amount of related works is not extensive highlighting that this is a relevant emerging topic. Yee Wah Lau et al. <ref type="bibr" target="#b7">[8]</ref> have experimented a simple mimicry attack on a speaker recognition system to assess its vulnerability: they conclude that if the voice of the impostors is remarkably similar to the voice of the targets speaker the authentication fails. This research has also shown that repeated attempts by an impostor to mimic a target speaker's voice can allow him to obtain a voice much more similar to the target and contribute to the degraded performance.</p><p>More recent studies have focused on two widely used speaker recognition systems, one based on a Gaussian Mixture Model -Universal Background Model (GMM-UBM) <ref type="bibr" target="#b8">[9]</ref> and the other on an I-Vector classifier <ref type="bibr" target="#b9">[10]</ref>. Hautamäki et al. have involved professional imitators in this experiment and the most important aspect for this work is that all the study has been done on a Finnish Language Imitations Dataset created by the authors. They compare the performance of two state-of-the-art systems, a GMM-UBM recognizer and an I-Vector recognizer, testing both the systems, first on genuine voice, as baseline and then on mimicked voice. The result obtained showed that the professional impersonator didn't degrade the performance of the system tested, however there was only a slightly increase of the false acceptance rate for the I-vector system compared to the GMM-UBM <ref type="bibr" target="#b10">[11]</ref>.</p><p>Other interesting research, <ref type="bibr" target="#b11">[12]</ref> compared the performance of three different systems GMM-UBM, an I-Vector with cosine similarity and an I-Vector with a probabilistic linear discriminant analysis (PLDA), under mimicry attack. Similar to the previous work it has been used a Finnish language dataset of imitators and imitations acquired from non-expert human listener. The study has showed that the GMM-UBM slightly increased the EER under mimicry attacks, but the other two systems based on the I-vector increased for two times the EER.</p><p>Vestman et al. proposed another type of work based on impersonation slightly different from the previous. They used a two ASV system one publicly available based on I-vector and PLDA, and one closed source ASV system based on x-Vector. The aim of this research is to perform a similarity search, in a speech corpus, between recruited attackers and potential target speaker with the first ASV system, then test the impersonators and the most similar voice to target speaker, founded by the first system, on the second ASV system. The research highlights the impersonators don't affect the performance of the ASV system, but an ASV system that attacks another ASV system can be potentially dangerous <ref type="bibr" target="#b12">[13]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methods</head><p>In this work two systems have been tested. The first is based on GMM-UBM models and the second on the I-Vector model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Gaussian Mixture Model -Universal Background Model</head><p>Systems based on Gaussian Mixture Model -Universal Background Model are widely used in the field of speaker recognition due to their easily implementation, the low computational cost compared to the other technics, and the excellent result that can be achieved <ref type="bibr" target="#b8">[9]</ref>.</p><p>A Gaussian Mixture Model (GMM) is a "parametric probability density function represented as a weighted sum of Gaussian component densities" <ref type="bibr" target="#b13">[14]</ref>. At the basis of the GMM-UBM system there is the Universal Background Model (UBM), which is a GMM estimated from a large speech dataset. The purpose of the UBM is to model the general feature space distribution of speech. Then with a Maximum a Posteriori (MAP) adaptation from the UBM is possible to obtain the target speaker GMM models.</p><p>After the generation of the GMM target speaker models the system can recognize "target speaker" or "ubm" in case of "no target speaker" this because the UBM model can acts like an impostor hypothesis model.</p><p>Finally, the verification score of the system is the log-likelihood ration between the test utterance generated from the speaker and that generated by the UBM <ref type="bibr" target="#b8">[9]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I-Vector</head><p>Another widely used approach is based on the identity vector (I-Vector). At the base of this approach there is the idea that the MAP adaptation performed only on the mean vectors will result in a supervector of concatenated means.</p><p>Given m the super-vector of UBM means, T a low-rank matrix that defines the total variability space and ϕ a standard distribution vector. The super-vector M of the segment GMM can be calculated adapting the means of the UBM, so M can be written as:</p><formula xml:id="formula_0">𝑀 = 𝑚 + 𝑇𝜙 (1)</formula><p>The standard distribution ϕ is used as the extracted i-vector, while the T-matrix is calculated with an expectation-maximization (EM) algorithm from a development dataset. Usually on the i-vector can be applied a post-processing algorithm like radial Gaussianization, in this way the i-vector can better follow the Gaussian assumptions used in the UBM model. Finally for measuring the similarity between two utterances represented by their corresponding i-vectors can be used the cosine similarity <ref type="bibr" target="#b11">[12]</ref>.</p><p>These two systems have been selected because of their large use in many on-the-shelf application <ref type="bibr" target="#b1">[2]</ref>. However, in both cases, common preliminary common operations are carried out as described in the following.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Pre-Processing and Feature Extraction</head><p>The audios have been all converted to .Wav format, re-sampled to 8 kHz and switched from stereo to mono channel <ref type="bibr" target="#b10">[11]</ref>. Feature extraction have been performed in 5 different sub-steps. 1.</p><p>First, a pre-emphasis filter has been applied to the signal to enhance the high frequencies of the spectrum, reduced by the speech production process, 2.</p><p>The signal have been divided into successive 25-millisecond frames with 10-millisecond overlaps by frame blocking and hamming windowing. This process is intended to minimize frequency discontinuity. Since the speech signal varies slowly over time or is a quasi-stationary signal, speech must be examined over a sufficiently short period <ref type="bibr" target="#b14">[15]</ref>, 3.</p><p>A Voice Activity Detection (VAD) filter have been applied: this allowed the removal of all superfluous parts (in most cases silence and/or background noise) from the signal, selecting only those discriminating components and allowing the correct speaker to be identified,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>4.</head><p>A RASTA (Relative Spectral Filtering) filter have been applied to eliminate frequencies that are different from the normal change in the voice signal, such as frequencies affected by background noise recorded together with the voice signal, 5.</p><p>Finally, the MFCC features have been extracted. In detail, for each audio file, a feature vector composed of 20 MFCC coefficients have been extracted with a filter bank composed of 26 triangular filters. The choice of using 20 coefficients was made after some considerations: with more coefficients, the performance worsens because they would represent rapid changes in signal energy, not representative of the individual's vocal characteristics. On the other hand, fewer would not have enough information to represent the voice adequately. In addition, 20 MFCC delta-features were also computed from the 20 MFCC coefficients for a vector size of 40 values. During the feature extraction phase, it was also chosen to replace the first value of the vector, the cepstral coefficient at position zero, usually with a null value, with the log of the entire energy component <ref type="bibr" target="#b15">[16]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Training</head><p>Two different approaches have been used for training depending on the specific system.</p><p>Regarding the GMM-UBM, Model Adaptation was chosen adapting a previously pre-trained UBM of the Italian language to the speaker's feature vector. The adaptation has been performed using the Maximum a Posteriori (MAP) algorithm, which as the name may suggest, will maximize the a posteriori probability that, given a recording, the correct speaker is selected. It consists of a first phase in which the feature vectors are mapped with probabilistic ratios, inserting them into the UBM mixtures, in this case, 512. Then, the mixtures are adapted using the new data, referring to the "relevance factor ", which quantifies the new data to be analyzed in a single mixture to balance the contribution made by the latter to perform the adaptation. Finally, the testing phase is characterized by matching the feature vector extracted from the system and the model of a speaker, calculating the LLR (Log-Likelihood Ratio).</p><p>Concerning the i-vector system, the UBM model has been used to calculate the Total Variability Matrix or TV Matrix, which represents a matrix in which the i-vectors will be extracted, then it will use a function with rank equal to 400 and number of iterations fixed at 20 to realize the actual TV Matrix. The next step will proceed to obtain the statistics for each type of audio, specifically test or training. Once the statistics are obtained, it will proceed with the actual extraction of the i-vector for each speaker and for each test segment, thus obtaining a unique and discriminating model. The resulting supervectors with reduced dimensions are then the i-vectors, representing the result of the mapping carried out in the first phases. In the last instance, it will calculate the cosine similarity or the cosine of the angle between the adjacent vectors that are compared between the vector representing the speaker enrollment, then the model, and the vector representing the speaker test.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Data and Experiments</head><p>The speech material used in this work consists of 22 Italian celebrities from the world of politics and entertainment. The audios were extracted from interviews, shows, or performances available on online platforms. These 22 identities represent the genuine set of users which will be attacked by imitators. In addition, 17 imitators were chosen and their original voices as well as imitations speech of the 22 genuine users were collected. Finally, there are 24 impersonators representing attacks in the system. As can be guessed, the "imitation" relationship between famous individuals and imitator is n-to-n since multiple imitators can imitate each famous individual, and one imitator can imitate multiple famous individuals. For each speaker, both genuine and impostors, one audio of the duration of 5 minutes have been extracted, while for the audios of the imitations, their duration varies from 40 seconds to 5 minutes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1</head><p>Italian Imitation Dataset Genuine Impostors (with their original voice)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Impersonation attacks 22 users 17 users 24 audios</head><p>In addition to the dataset created, it has been used another dataset for the UBM model training as already described in the previous section, in detail is an extensive Italian repository, the Common Voice Corpus 6.1, consisting of 5729 voices, female, male and uncleared identity (54% male and 14% female and 23% uncleared identity) at different ages (from 18 to 79 years), 158 hours of speech and about 80.000 files <ref type="bibr" target="#b16">[17]</ref>.</p><p>The model has been trained on the 25% of the entire dataset this because the training phase of the UBM model has taken a large amount of time (more than 10 hours).</p><p>For the testing phase, audio files were chunked into 1-second and 5-seconds files to analyze if and how, the system behaves by varying the length of the audio.</p><p>Tests have been evaluated in terms of precision, recall, and F1 score. Different use cases have been considered as reported in Table <ref type="table" target="#tab_0">2</ref>. Cases 1, 3, 6, and 7 represent situations in which the system is tested only considering speakers of which it is aware. Cases 2 represent the situation in which the imitator (unknown to the system) performs a mimicry attack. Case 4 is referred to the situation in which the imitator is known to the system as a genuine user, but he/she performs a mimicry attack on another genuine user. Case 5 refers to situations in which the imitator act as a genuine user as well as an impostor.</p><p>The last two cases are referred to the possibility to add imitation models. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Pre-trained Genuins + Imitations</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Genuine</head><p>The system explicitly aware of imitations. 7</p><p>Pre-trained Genuins + Imitations</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Genuine + Imitations</head><p>The system explicitly aware of imitations under mimicry attack</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Result and Discussion</head><p>Tables 2 report results obtained for the GMM-UBM system referred, respectively, to audio duration of 1s and 5s.  </p><p>The system explicitly aware of imitations 99,53% 99,60% 99,09%</p><p>The system explicitly aware of imitations under mimicry attack 88,29% 95,65% 87,13%</p><p>The baseline system under attack shows a performance degradation of 15% in both tests (1s and 5s) if the impostor is an outsider (use case 2). A performance degradation of 8% is observed if the impostor is an insider (use case 4 vs use case 3). The situation in which the system is trained on some possible mimicry attack (use case 5) does not differ significantly in performance from the previous case.</p><p>Considering the tests performed at 1 second and at 5 seconds, it can be seen that the length of the audio segment affects the performance of the system, in general it can be stated that a wider duration strongly decreases successful attacks. However, the general performance trend along the different use cases is independent by the audio duration. The performance worsens considerably (cases 2, 5, 7) when the system is attacked and does not know speakers or knows only partially imitations (cases 6 and 7) or imitators (cases 3,4,5). Tables <ref type="table" target="#tab_4">6 and 7</ref> reports results obtained for the I-Vector model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 5</head><p>Results on I-Vector approach at 1 second.  Considering the result achieved with the i-vector system, it can be seen that the baseline system under attack shows a performance degradation of 30% in 1s test and 25% in 5s test if the impostor is an outsider (use case 2). A performance degradation of 14% in both tests (1s and 5s) if the impostor is an insider (use case 4 vs use case 3). The situation in which the system is trained on some possible mimicry attack (use case 5) does not differ significantly in performance from the previous case.</p><p>Finally, comparing the I-vector results, the same trend obtained in the GMM-UBM model in the various use cases is also reflected in this model. However, the I-vector system suffer from attacks of a slightly higher performance degradation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>In this work, a study of the vulnerability of speaker identification systems against voice mimicry attacks is presented. The study conducted presents a new Italian mimicking dataset, two very widespread models are implemented and tested on this dataset: the GMM-UBM, and I-vector.</p><p>The result obtained using the GMM-UBM system show that the baseline system under attack has a considerable degradation of the performance especially if the system doesn't know or partially know the imitators or the imitations. The same trend has be observed for the I-vector system, even if with slightly increased degradation respect to the GMM-UBM. Comparing the I-vector results, the same trend obtained in the GMM-UBM model in the various use cases is also reflected in this model. However, the I-vector system suffer from attacks of a slightly higher performance degradation.</p><p>Performance degradation also depends upon the fact if impostor is another genuine user known by the system or not.</p><p>Concerning the length of the audio, longer audio files are able to report higher performances then shorter ones, however, the general performance degradation trend along the different use cases is independent by the audio duration.</p><p>In future studies, it will be possible to extends this work with other state-of-the-art technics like artificial neural networks (ANN), extend the Italian mimicry dataset with additional users and compare the result between GMM-UBM, I-vector and ANN.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 2</head><label>2</label><figDesc></figDesc><table><row><cell>Use cases</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Use case</cell><cell>UBM</cell><cell>Models know to</cell><cell>Test</cell><cell>Meaning</cell></row><row><cell></cell><cell></cell><cell>the system</cell><cell></cell><cell></cell></row><row><cell>1</cell><cell>Pre-trained</cell><cell>Genuine</cell><cell>Genuine</cell><cell>The baseline</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>system.</cell></row><row><cell>2</cell><cell>Pre-trained</cell><cell>Genuine</cell><cell>Genuine +</cell><cell>The baseline</cell></row><row><cell></cell><cell></cell><cell></cell><cell>Imitations</cell><cell>system under</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>mimicry attack.</cell></row><row><cell>3</cell><cell>Pre-trained</cell><cell>Genuine +</cell><cell>Genuine +</cell><cell>The baseline</cell></row><row><cell></cell><cell></cell><cell>Imitators</cell><cell>Imitators</cell><cell>system</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>including</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>imitators which</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>does not act as</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>imitators at</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>testing.</cell></row><row><cell>4</cell><cell>Pre-trained</cell><cell>Genuine +</cell><cell>Genuine +</cell><cell>The baseline</cell></row><row><cell></cell><cell></cell><cell>Imitators</cell><cell>Imitations</cell><cell>system</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>including</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>imitators under</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>mimicry attack.</cell></row><row><cell>5</cell><cell>Pre-trained</cell><cell>Genuine +</cell><cell>Genuine +</cell><cell>The baseline</cell></row><row><cell></cell><cell></cell><cell>Imitators</cell><cell>Imitations +</cell><cell>system</cell></row><row><cell></cell><cell></cell><cell></cell><cell>Imitators</cell><cell>including</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>imitators which</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>act as imitators</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>(mimicry</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>attack) and</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>genuine users</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>at testing.</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 3</head><label>3</label><figDesc>Results on GMM-UBM approach at 1 second.</figDesc><table><row><cell>Test 1s</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 4</head><label>4</label><figDesc>Results on GMM-UBM approach at 5 second.</figDesc><table><row><cell>Test 5s</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 6</head><label>6</label><figDesc>Results on I-Vector approach at 5 second.</figDesc><table><row><cell>Test 5s</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Acknowledgements</head><p>This work has been supported by the Italian Ministry of Education, University and Research within the PRIN2017 -BullyBuster project -A framework for bullying and cyberbullying action detection by computer vision and artificial intelligence methods and algorithms.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A Survey of Speaker Recognition: Fundamental Theories, Recognition Methods and Opportunities</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">M</forename><surname>Kabir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">F</forename><surname>Mridha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Jahan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Q</forename><surname>Ohi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Access</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A review on speaker recognition: Technology and challenges</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">M</forename><surname>Hanifa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Isa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mohamad</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computers &amp; Electrical Engineering</title>
		<imprint>
			<biblScope unit="volume">90</biblScope>
			<biblScope unit="page">107005</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Text-dependent and text-independent speaker recognition of reverberant speech based on CNN</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>El-Moneim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sedik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Nassar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>El-Fishawy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Sharshar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">E</forename><surname>Hassan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">E</forename><surname>El-Samie</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Speech Technology</title>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Text-Independent Speaker Verification Using 3D Convolutional Neural Networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Torfi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dawson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">M</forename><surname>Nasrabadi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE International Conference on Multimedia and Expo</title>
				<imprint>
			<publisher>ICME</publisher>
			<date type="published" when="2018">2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Automatic Signature Verification in the Mobile Cloud Scenario: Survey and Way Ahead</title>
		<author>
			<persName><forename type="first">D</forename><surname>Impedovo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pirlo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Emerging Topics in Computing</title>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Spoofing and countermeasures for speaker verification: A survey</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Evans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kinnunen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yamagishi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Alegre</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Speech Communication</title>
		<imprint>
			<biblScope unit="volume">66</biblScope>
			<biblScope unit="page" from="130" to="153" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Speaker recognition from the mimicked speech: A review</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">S</forename><surname>Desai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Pujara</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET)</title>
				<imprint>
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Vulnerability of speaker verification to voice mimicking</title>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">W</forename><surname>Lau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wagner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Tran</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing</title>
				<meeting>2004 International Symposium on Intelligent Multimedia, Video and Speech Processing</meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Speaker Recognition System using Gaussian Mixture Model</title>
		<author>
			<persName><forename type="first">K</forename><surname>Saakshar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Pranathi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Gomathi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sivasangari</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Communication and Signal Processing</title>
				<imprint>
			<publisher>ICCSP</publisher>
			<date type="published" when="2020">2020. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">I-vector Extraction for Speaker Recognition Based on Dimensionality Reduction</title>
		<author>
			<persName><forename type="first">N</forename><surname>Ibrahim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ramli</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">22nd International Conference on Knowledge-Based and Intelligent Information &amp; Engineering Systems</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">I-vectors meet imitators: on vulnerability of speaker verification systems against voice mimicry</title>
		<author>
			<persName><forename type="first">R</forename><surname>Hautamäki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kinnunen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Hautamäki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A.-M</forename><surname>Laukkanen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Interspeech</title>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Automatic versus human speaker verification: The case of voice mimicry</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">G</forename><surname>Hautamaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kinnunen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Hautamaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A.-M</forename><surname>Laukkanen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Speech Communication</title>
		<imprint>
			<biblScope unit="volume">72</biblScope>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Voice mimicry attacks assisted by automatic speaker verification</title>
		<author>
			<persName><forename type="first">V</forename><surname>Vestman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kinnunen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">G</forename><surname>Hautamäki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sahidullah</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer Speech &amp; Language</title>
		<imprint>
			<biblScope unit="volume">59</biblScope>
			<biblScope unit="page" from="36" to="54" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Gaussian Mixture Models</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jain</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Encyclopedia of Biometrics</title>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Springer Handbook of Speech Processing</title>
		<author>
			<persName><forename type="first">J</forename><surname>Benesty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sondhi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Greenberg</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2008">2008</date>
			<publisher>Springer</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">An MFCC-Based Speaker Identification System</title>
		<author>
			<persName><forename type="first">F.-Y</forename><surname>Leu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G.-L</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE 31st International Conference on Advanced Information Networking and Applications</title>
				<imprint>
			<publisher>AINA</publisher>
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><surname>Mozilla</surname></persName>
		</author>
		<ptr target="https://commonvoice.mozilla.org/en/datasets" />
		<title level="m">Datasets. Retrieved from Mozilla Common Voice</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
