Deepfake Audio Detection via Feature Engineering and Machine Learning⋆ Farkhund Iqbal1 , Ahmed Abbasi2 , Abdul Rehman Javed2,* , Zunera Jalil2 and Jamal Al-Karaki1 1 College of Technological Innovation, Zayed University, UAE 2 Department of Cyber Security, Air University, Islamabad, Pakistan Abstract With the advancement of technologies in synthetic speech generation, audio deepfake is becoming the most common source of deception. As a result, distinguishing between fake and real audio is becoming increasingly difficult. Several studies were conducted based on machine learning approaches using ASVSpoof or AVSpoof to deal with these challenges. This study experiments on the latest fake or real (FoR) dataset. The audio samples of this dataset are generated using the best text-to-speech (TTS) models. The proposed approach is based on optimal feature engineering and selecting the best machine learning models to detect fake or real audio. Feature engineering employs various methods for extracting features from audio. In contrast, the feature selection method employs the best-performing minimum features, which are then fed to machine learning classifiers. The experiments use six machine learning (ML) classifiers and three subsets of the FoR dataset. The experimental results show that the proposed approach can accurately detect real or fake audio. The proposed method outperforms the baseline method by an accuracy gain of 26%. Keywords Deepfake, Audio classification, Machine learning, Feature extraction, Feature selection 1. Introduction Deep fake is fake data generated using deep learning; though they are entertaining, we have seen many examples where they could be misused and used to spread disinformation. In the past couple of years, deepfakes have been intentionally used to spread fake news. With recent breakthroughs in voice conversion algorithms and text-to-speech [1, 2], synthesizing human speech will become much easier, opening the way for a future in which audio will play an equivalent role in deepfake detection as video [3, 4]. This research investigates the interaction between these two modalities, which might be essential in identifying audio-visual deepfakes. Recent work has concentrated on recognizing visual artifacts and ’fingerprints’ from various generating frameworks and detecting local texture inconsistencies produced by face-swapping. Another area of research employs biometric signals, such as recognizing distinct facial motion patterns inherent in certain persons; nevertheless, such ID-specific approaches are restricted in Woodstock’22: Symposium on the irreproducible science, June 07–11, 2022, Woodstock, NY ⋆ You can use this document as the template for preparing your publication. We recommend using the latest version of the ceurart style. * Corresponding author. © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) their capacity to generalize to new identities[5, 6]. In the fields of voice forensics and artificial intelligence, various automatic detection algo- rithms for deepfakes have been created[7]. However, it has been observed that there is a shortage of experiments on humans compared to machines’ comprehension of modified audio, mainly controlled speech. Pictures and video data remain the center of human deepfake detection and are the most studied. Many researchers have tried to compare the detecting abilities of humans and machines [1, 8]. Their study was based on their unique deepfake video and picture database, classified into three quality levels: original, high-quality deepfake, and low-quality deepfake. They discovered that the AI detection method beats the human participants, particularly when it comes to low-quality picture deepfakes. This means that humans have trouble differentiating between actual and fraudulent pictures at lesser resolutions[9]. In [10], the most comprehensive video deepfake dataset to date was created and examined. It consisted of three distinct online research that drew in 15,016 people. Participants were either oblivious to AI detection predictions or were made aware of them. They discovered that although naive human participants and AI detection systems had equal detection accuracy, they were deceived by distinct characteristics. Humans made aware of the AI algorithm’s prediction judgments performed better on the challenge. However, human accuracy would also suffer if the AI model forecasts were incorrect [11, 12]. The capacity to recognize synthetically created sounds is known as audio deepfake detection. The authors of [13] utilized the ASVspoof dataset to determine if human-based subjective judgments of spoofed can be predicted automatically. They conducted an inter-language investigation in which 68 native English and 206 native Japanese listeners classified samples as faked or benign. They observed that the general trends in both language groups were similar, with slight deviations. The study’s authors made no distinction between people with and without IT technical experience. There is also a substantial body of similar work [14, 15] that uses machine learning to detect abnormalities in audio waveforms that may be diagnostic or distinctive of deepfakes. Artifacts such as noisy malfunctioning, unable to match a phase, reverberation, or unintelligibility are examples of such artifacts [16]. There are additional abnormalities that humans do not generally sense, such as various sorts of prolonged quiet, even though such artifacts are critical for AI identification of deepfakes[17]. In this research, we are the first to investigate whether ML can recognize audio deepfakes (i.e., fake audio data generated using deep learning). We investigated how successfully machine learning or deep learning could differentiate audio deepfakes from actual material and whether specific characteristics aided detection performance. 1.1. Motivation The main motivations for this research are presented in this section. Audio deepfake has become the most prevalent source of deceit as synthetic voice synthesis technology advances. Hence, distinguishing between fake and real audio is becoming more challenging. This is why a system that can detect authentic or fraudulent audio in a short amount of time is so important. Previous research has been done on it. However, the present methodologies are computationally intensive. Consequently, the suggested method comprises the best feature extraction methods, resulting in higher classification results in less time. 1.2. Contribution This study proposes a method for detecting fake audio and distinguishing deepfake audio from non-synthetic or actual audio. The following are the specific contributions: • Provides a unique method for distinguishing between real and counterfeit audios in FoR dataset subsets using feature engineering and machine learning classifiers. • The supreme feature engineering strategy is employed, which involves the best feature extraction and feature selection approaches to extract the most appropriate features from an audio source. • Experimental results showed that the proposed approach accurately detects real or fake audio. The proposed method outperforms the baseline method by an accuracy gain of 26%. 1.3. Organization A detailed description of previous studies on deepfake audio detection is presented in section 2. The explanation of the dataset used to detect fake or real audio is presented in section 3. The proposed approach for detecting and classifying real and fake audio is presented in section 4. The experimental results detail is provided in section 5. Finally, this work’s future and conclusion are presented in section 6. 2. Literature Review Audio deepfakes are ML/DL-generated audio that appears to be real. To identify audio deep- fakes, one must first understand the processes of creation. Detecting audio generated through deepfakes is critical since audio deepfakes have been used in several illegal actions in recent years. A replay attack is one of the three categories used before generative networks; speech synthesis and voice conversion were the three subcategories of audio deepfake approaches. The reader is given this part’s most recent and relevant research and tools for each category. 2.1. Replay Attack Replay attacks can work by simply replaying a voice of a target speaker. The main concern is detection, and there are two types of detection. Far-field and cut and paste detection assaults are the two categories. The test segment in far-field detection replay attacks is a far-field microphone recording of the target that has been repeated on a phone handset with a loudspeaker[18]. If a recording is created by cutting and pasting small recordings to simulate the sentence required by a text-based system, it is referred to as a cut-and-paste detection system[19]. Text- dependent speaker verification can be used to guard against replay assaults. Deep convolutional networks are a recent technology for detecting end-to-end replay threats [20]. Some replay attack detection techniques have been proposed by focusing on the network properties[21]. Replaying a voice recording of a target speaker. Far-field detection and cut-and-paste detection assaults are the two categories. The test segment in far-field detection replay attacks is a far-field microphone recording of the target that has been repeated on a phone handset with a loudspeaker[22]. If a recording is created by cutting and pasting small recordings to simulate the sentence required by a text-based system, it is referred to as a cut-and-paste detection system[23]. Text-dependent speaker verification can guard against replay assaults[24]. Deep convolutional networks are a recent technology for detecting end-to-end replay threats. Some replay attack detection techniques have been proposed by focusing on the network properties. 2.2. Speech Synthesis Speech Synthesis can be artificially generating human speech with software or hardware programs. Speech synthesis may be used for various reasons, including text reading and serving as a personal AI assistant. Text-To-Speech is included in SS, which analyses the text and generates speech following the text supplied using the norms of linguistic description of the text. Another advantage of speech synthesis is providing multiple accents and voices instead of pre- recorded human voices. Lyrebird, a significant voice synthesis company, employs deep learning models to synthesize 1,000 phrases in a second. Text-To-Speech primarily relies on the quality of the speech corpus to build the system; regrettably, creating speech corpora is expensive. Another drawback of SS systems is that they do not detect periods or special characters[25]. The most common are homographs, which occur when two words have distinct meanings yet are spelled in the same way[26]. Char2Wav is a framework for speech synthesis production from start to finish. PixleCNN is also the foundation of WaveNet[27], an SS framework. WaveGlow[28] focuses on the second phase of text-to-speech synthesis systems, which typically involve two phases (encoder and decoder). As a result, WaveGlow is concerned with converting time-aligned information, such as a mel-spectrogram acquired from an encoder, into audio samples. Tacotron was first proposed in 2017[1]. Tacotron features CBHG, which comprises a 1-D convolution bank, a highway network, and a bidirectional GRU[29]. Tacotron 2[30] is made up of two parts. The first component is an attention-based recurrent sequence-to-sequence feature prediction network. This component’s output is an anticipated series of mel spectrogram frames. Deep Voice3, a TTS framework, [31] includes three parts, encoder, decoder, and converter, a CNN- based Network. MelNet, a TTS framework [32], predicts a distribution element-by-element over the time and frequency dimensions of a spectrogram in an autoregressive manner. The network comprises many computational stacks that extract characteristics from various input bits. Then, these elements will be summed collectively to provide the overall context. The frequency-delayed stack’s previous-layer outputs are the time-delayed and centralized stacks’ current-layer outputs. The last layer of the frequency-delayed stack outputs is utilized to compute the audio-generating parameters. Using neural network TTS synthesis, it is possible to create speech sounds in the voices of numerous speakers, including those who have not been trained. This took only five seconds[33]. Char2Wav, end-to-end speech synthesis with a reader and a neural vocoder, was the first model to synthesize audio straight from text[34]. Deep Voice 1 was the first to use deep neural networks for text-to-speech in real-time, laying the groundwork for end-to-end neural speech synthesis[35], while Deep Voice 2[36] was able to replicate many voices using the same technology. Furthermore, most neural network-based speech synthesis models are auto-regressive, which means they condition audio samples on prior samples for long-term modeling and are straightforward to train and deploy[28]. Because the SS detection systems are also utilized for voice conversion detection, the detection methods for these two categories were examined in the VC summary section. 2.3. Voice conversion (VC) and Impersonation The last type of audio deepfake is voice conversion, which takes the speech signal from the first speaker, the source, and alters it to seem like the second speaker uttered the target speaker. Impersonation is a type of VC in which one pretends to be someone else. Using improved technology, it is now faster to imitate, and one business called Overdub1 can produce an imitation of any voice with one minute of sample audio. GANs may also be utilized for voice mimicry[37]. The joint density Gaussian mixture model with maximum likelihood parameter trajectory generation considering global variance is one of the essential foundations for VC[38]. This approach also serves as the foundation for the open-source Festvox system, which served as the primary VC toolkit in "The Voice Conversion Challenge 2016." Other voice conversion methods, such as neural networks and speaker interpolation, can also be used. However, given their versatility and high-quality outputs, GANs have recently been increasingly popular for VC. The Griffin Lim approach was utilized to rebuild time-domain data using a neural network architecture to imitate the voices of different genders. As a result, the model produced persuasive examples of impersonated speech. There are also other systems for detecting audio faking. ResNet, which was initially employed for image recognition, serves as the foundation of the audio spoofing detection system. 3. Dataset selection We use the Fake or Real (FoR) Dataset in this research investigation [39]. It comprises over 195,000 audio samples. These audio samples were created by a computer utilizing the most up-to-date speech synthesis technologies. The dataset was built by integrating many datasets from various studies. The primary purpose of developing this dataset is to train algorithms for identifying phony speech [40, 41, 42]. The FoR dataset is divided into four subsets: (i) for-orig, (ii) for-norm, (iii) for-2sec, and (iv) for-rerec. The experiments in this work are confined to three datasets:: for-norm, for-2sec, and for-rerec. 3.1. For-Norm The norm dataset is composed of 69400 audio, including duplicate audio signals. We removed the duplicate values from the original dataset and obtained 53868 unique audio samples. The dataset contains audio of different gender and target class (fake/real). The dataset can be preprocessed to remove duplicate values, set the sampling rate, and set the volume and number of channels. 3.2. For-2sec The dataset contains a training set of 17720 audios and a testing set with 3731 audio samples. The dataset contains an audio length of 2 seconds. The dataset is completely balanced in terms of target class and gender. The sampling rate of audio samples of this dataset is 41000. 1 https://www.descript.com/overdub 3.3. For-rerec This dataset also contains an audio length of 2 seconds. The dataset contains 13268 audio samples with various genders and real and fake classes. The dataset has been trimmed to focus more on the signal and extract more insights from it. The sampling rate of audio samples is similar to the For-2sec audio dataset, which is 441000. 4. Proposed Approach The proposed approach of this study is presented in Figure 1. We use Fake or Real (FoR) to perform experiments in this study. The dataset is divided into three subsets. In the first step, we analyze the dataset, perform exploratory analysis, and extract useful information for deepfake audio detection. Next, we convert the audio signal from the time domain to the frequency domain to analyze the audio graphically. Furthermore, the most important feature extraction techniques are utilized to extract the important features from an audio signal. The audio signal is normalized and reduces the dimension of the features so that only valuable features may be selected. We split the dataset into two sets; training and testing. 80% of the complete dataset is used to train the models, and the remaining 20% data is used to measure the performance of the models. We selected multiple machine learning (ML) models to detect fake or real audio from the datasets. Figure 1: Proposed methodology to detect fake audio 4.1. Data Pre-Processing Preprocessing is a necessary step in machine learning to obtain better results. The primary goal of audio data preprocessing is to transform the audio signal into a format that the machine learning model can easily interpret; that is why it is important to perform standard preprocessing steps before the model implementation. Data framing is a preprocessing of audio singles that involves converting audio signals into a format that a machine can understand. We remove the duplicate audio from the dataset and normalize the audio features in this step. We extract the value after every second from an audio signal called audio sampling and set the sampling rate of an audio signal is 44100. The sampling rate shows the frame rate value of an audio signal, and it is defined by Equation 1. 𝐹 𝑟𝑎𝑚𝑒𝑠 = 𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔_𝑟𝑎𝑡𝑒 × 𝑡𝑖𝑚𝑒 (1) In this scenario, we set the sampling rate of each signal in the dataset using data framing. The frequency-domain visualization of an audio signal is shown in the next step. As shown in Figure 2, we plot a Mel power spectrogram, which indicates the audio strength at certain time intervals and shows distinct frequency levels. It displays a Mel power spectrogram of a 2-second audio stream. The frequency is presented on the vertical axis of the spectrogram and has a range of 0-8KHz, while the time of the signal is indicated on the horizontal axis. The color combination in the spectrogram represents the signal’s amplitude, while the signal’s volume is indicated by decibels (𝑑𝐵). The standard scaler is used to normalize the characteristics. We utilized it to scale the characteristics into a particular range. The goal of Standard Scaler is to normalize characteristics such that they are fairly evenly distributed. As demonstrated in the equation 2, it reduces the variance to one and eliminates the mean of the data. Sc is our standardized version of "i". (𝑖 − 𝜇) 𝑆𝑐 = (2) 𝜎 Figure 2: Mel Power Spectrogram Audio signal 4.2. Feature Extraction Each audio in the FoR dataset contains useful characteristics we need to extract and feed to the ML model for better results. In this study, we applied various feature extraction approaches to obtain the features from the audio signal. In the first step, we converted the audio signal into machine-readable form, and then we applied MFCC, spectral_rolloff, spectral_centroid, spectral_contrast, spectral_bandwidth and zero_crossing_rate features extraction approaches to extract features from the signal. This results in an array, then the mean, median, and standard deviation is computed, yielding the final value of the feature for that specific audio signal. We discovered that all 270 characteristics taken from a single audio file are unimportant and increase the classification time of ML models. Thus we used the feature reduction strategy. The 270 features from each audio are extracted using the sliding window technique. We employed Principal Component Analysis (PCA) to identify the most significant characteristics. We experimented with various numbers of n components, but in the end, we adjusted the PCA n component to 65 and got superior results. To evaluate the usefulness of selected features, we computed the explained_variance_ratio, which is 97%. It shows that this approach selected very suitable features. .Finally, 65 unique characteristics out of 270 are passed to ML classifiers to identify actual and fake audios. 4.3. Classification Models This study performed experiments by extracting features from the audio signal Moreover, applying different ML classifiers to these audio features to detect real or fake audio. After pre-processing the audio sample, 270 features are retrieved using various feature extraction approaches. However, with just 2 seconds of audio and 270 features, the classification procedure takes a long time. We reduced the feature vector length, and 65 feasible features were fed to the classification models. For the classification, six ML classifiers, SVM [43], MLP [44], DT [45], LR [46], NB [47] and XGB [48] are implemented, and their performance was compared. We used the default parameters of these classifiers. These models’ output determines whether a particular audio signal is authentic or fake. 5. Experimental Results and Comparative Analysis This section briefly describes results obtained from various classifiers (SVM, MLP, DT, LR, XGB, and NB). We performed experiments using the Fake or Real (FoR) dataset and obtained promising results. Except for the baseline technique, this dataset was not used in any other research [49]. Existing research conducted experiments on the original dataset, but there is presently no research available that works on the multiple variants of this dataset, so we conducted experiments on the three subsets of the FoR dataset. The experiments involve six classifiers and three datasets (For-Norm, For-2sec, For-rerec). The results of this study using various datasets are presented in Table 1. We used accuracy evaluation metrics to evaluate the classification performance of six classifiers on datasets (For-Norm, For-2sec, For-rerec). The dataset was divided into two parts: training and testing. The first 80% of the whole dataset is utilized for training the models, while the remaining 20% is used to measure model performance. It is observed that all the ML models performed outstandingly on the given datasets. The SVM model outperformed other ML models with 97.57% and 98.83% accuracy using the For-2sec and For-rerec datasets. It can be observed that the XGB model obtained the highest accuracy score of 92.60% using the For-Norm dataset. It was discovered that the XGB model performed very well on average across all datasets, with an average accuracy score of 93.50 %. We also conducted a larger comparison with state-of-the-art research. The baseline approach performed its experiments on the original dataset, while this study worked on various subsets of the FoR dataset. To our best knowledge, no prior work has been done on the subsets of the FoR dataset. The goal of generating these subsets from the original dataset is to improve classification while reducing computational costs. The proposed and baseline approaches’ results are presented in table 2. The original dataset was divided into three subsets(For-Norm, For-2sec, For-rerec) for better classification. We highlighted the best average result of the proposed approach from three datasets generated using the XGB model. The accuracy score of the XGB model shows that the proposed approach outperforms the baseline approach with a 26 % accuracy gain. The highest accuracy of the baseline approach is obtained from SVM, with 67 % model accuracy. It is concluded that the suggested strategy outperforms the baseline approach in detecting real or fake audio from the FoR dataset. The best features aid the classification process in accurately identifying real or fake audio in a short period. Table 1 Machine learning models Accuracy(%) from each dataset Models for-2sec for-norm for-rerec SVM 97.57 71.54 98.83 MLP 94.69 86.82 98.79 DT 87.13 62.16 88.28 LR 89.92 82.80 88.31 XGB 94.52 92.60 93.40 NB 88.20 81.80 81.91 Table 2 Comparative Analysis of proposed approach with baseline approach results Reference Algorithm Test accuracy(%) Algorithms implemented in [49] SVM 0.67 Proposed Approach Result XGB 0.93 6. Conclusion and Future Work Deepfakes have recently become a critical topic to be discussed. In manipulated audio-visual content, this type of data spreads hate and misinformation to the entire world. Therefore, it is very important to develop a system for the early detection of deepfake to prevent the spread of misinformation. This research focuses primarily on the task of detecting deepfake audio. An optimal feature engineering technique with machine learning approaches is used to classify real and fake audio. We employed Fake or Real (FoR) datasets and their various subsets in this study. The proposed approach is implemented on the subsets of FoR dataset. This study shows that the SVM model has the highest test accuracy using For-2sec and For-rerec datasets, while the XGM model exhibits the highest accuracy using the For-Norm dataset. The results show that the proposed approach obtained better results than the baseline approach. This work can be extended in the future by exploring the latest feature-extracting methodologies that will aid in obtaining better results with both machine learning and deep learning. Furthermore, the proposed system’s architecture is simpler than the models used in paper [49] and achieves comparable accuracy. In addition, the deep learning model can exhibit better performance than machine learning because of the feedforward and feed-backward strategy. However, Deep learning models can also be used to implement amplitude-based classification. One limitation of this work is that deep learning with audio features has not been utilized. Furthermore, we can expand this work by using the original dataset and the subsets of the dataset and performing a deep learning approach with new feature extraction approaches. Acknowledgment The research is supported by Zayed University grant number R21096. References [1] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al., Tacotron: A fully end-to-end text-to-speech synthesis model, arXiv preprint arXiv:1703.10135 164 (2017). [2] S. Arik, J. Chen, K. Peng, W. Ping, Y. Zhou, Neural voice cloning with a few samples, Advances in Neural Information Processing Systems 31 (2018). [3] Y. Zhou, S.-N. Lim, Joint audio-visual deepfake detection, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14800–14809. [4] A. Qais, A. Rastogi, A. Saxena, A. Rana, D. Sinha, Deepfake audio detection with neural networks using audio features, in: 2022 International Conference on Intelligent Controller and Computing for Smart Power (ICICCSP), IEEE, 2022, pp. 1–6. [5] S. Agarwal, H. Farid, T. El-Gaaly, S.-N. Lim, Detecting deep-fake videos from appearance and behavior, in: 2020 IEEE International Workshop on Information Forensics and Security (WIFS), IEEE, 2020, pp. 1–6. [6] G. Drakopoulos, I. Giannoukou, P. Mylonas, S. Sioutas, A graph neural network for assessing the affective coherence of twitter graphs, in: 2020 IEEE International Conference on Big Data (Big Data), IEEE, 2020, pp. 3618–3627. [7] A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, M. Nießner, Faceforensics++: Learn- ing to detect manipulated facial images, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1–11. [8] N. M. Müller, K. Markert, K. Böttinger, Human perception of audio deepfakes, arXiv preprint arXiv:2107.09667 (2021). [9] P. Korshunov, S. Marcel, Deepfake detection: humans vs. machines, arXiv preprint arXiv:2009.03155 (2020). [10] M. Groh, Z. Epstein, C. Firestone, R. Picard, Deepfake detection by human crowds, machines, and machine-informed crowds, Proceedings of the National Academy of Sciences 119 (2022). [11] A. R. Javed, F. Shahzad, S. ur Rehman, Y. B. Zikria, I. Razzak, Z. Jalil, G. Xu, Future smart cities requirements, emerging technologies, applications, challenges, and future aspects, Cities 129 (2022) 103794. [12] A. Abbasi, A. R. Javed, F. Iqbal, Z. Jalil, T. R. Gadekallu, N. Kryvinska, Authorship identifi- cation using ensemble learning, Scientific Reports 12 (2022) 1–16. [13] R. K. Das, T. Kinnunen, W.-C. Huang, Z. Ling, J. Yamagishi, Y. Zhao, X. Tian, T. Toda, Predictions of subjective ratings and spoofing assessments of voice conversion challenge 2020 submissions, arXiv preprint arXiv:2009.03554 (2020). [14] Z. Wang, S. Cui, X. Kang, W. Sun, Z. Li, Densely connected convolutional network for audio spoofing detection, in: 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), IEEE, 2020, pp. 1352–1360. [15] B. Chettri, D. Stoller, V. Morfi, M. A. M. Ramírez, E. Benetos, B. L. Sturm, Ensemble models for spoofing detection in automatic speaker verification, Proc. Interspeech 2019 (2019) 1018–1022. [16] M. Sahidullah, T. Kinnunen, C. Hanilçi, A comparison of features for synthetic speech detection (2015). [17] N. M. Müller, F. Dieckmann, P. Czempin, R. Canals, K. Böttinger, J. Williams, Speech is silver, silence is golden: What do asvspoof-trained models really learn?, arXiv preprint arXiv:2106.12914 (2021). [18] S. Pradhan, W. Sun, G. Baig, L. Qiu, Combating replay attacks against voice assistants, Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3 (2019) 1–26. [19] J. Villalba, E. Lleida, Preventing replay attacks on speaker verification systems, in: 2011 Carnahan Conference on Security Technology, IEEE, 2011, pp. 1–8. [20] F. Tom, M. Jain, P. Dey, End-to-end audio replay attack detection using deep convolutional networks with attention., in: Interspeech, 2018, pp. 681–685. [21] M. Witkowski, S. Kacprzak, P. Zelasko, K. Kowalczyk, J. Galka, Audio replay attack detection using high-frequency features., in: Interspeech, 2017, pp. 27–31. [22] J. Gonzalez-Rodriguez, A. Escudero, D. de Benito-Gorrón, B. Labrador, J. Franco-Pedroso, An audio fingerprinting approach to replay attack detection on asvspoof 2017 challenge data., in: Odyssey, 2018, pp. 304–311. [23] L. Huang, C.-M. Pun, Audio replay spoof attack detection by joint segment-based linear filter bank feature extraction and attention-enhanced densenet-bilstm network, IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020) 1813–1825. [24] L. Huang, C.-M. Pun, Audio replay spoof attack detection using segment-based hybrid feature and densenet-lstm network, in: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2019, pp. 2567–2571. [25] K. Kuligowska, P. Kisielewicz, A. Włodarz, Speech synthesis systems: disadvantages and limitations, Int J Res Eng Technol (UAE) 7 (2018) 234–239. [26] G. Watson, Z. Khanjani, V. Janeja, Audio deepfake perceptions in college going populations, UMBC Faculty Collection (2021). [27] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, K. Kavukcuoglu, Wavenet: Agenerative model for raw audio, arXiv preprint arXiv:1609.03499 (2016). [28] R. Prenger, R. Valle, B. Catanzaro, Waveglow: A flow-based generative network for speech synthesis, in: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 3617–3621. [29] J. Lee, K. Cho, T. Hofmann, Fully character-level neural machine translation without explicit segmentation, Transactions of the Association for Computational Linguistics 5 (2017) 365–378. [30] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, et al., Natural tts synthesis by conditioning wavenet on mel spectrogram predictions, in: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2018, pp. 4779–4783. [31] W. Ping, K. Peng, A. Gibiansky, S. Ö. Arik, A. Kannan, S. Narang, J. Raiman, J. Miller, Deep voice 3: Scaling text-to-speech with convolutional sequence learning., in: ICLR (Poster), 2018. [32] S. Vasquez, M. Lewis, Melnet: A generative model for audio in the frequency domain (2019). [33] Y. Jia, Y. Zhang, R. J. Weiss, Q. W. J. S. F. Ren, Z. C. P. N. R. Pang, I. L. M. Y. Wu, Transfer learning from speaker verification to multispeaker text-to-speech synthesis (2018). [34] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, Y. Bengio, Char2wav: End-to-end speech synthesis (2017). [35] S. Ö. Arık, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, A. Ng, J. Raiman, et al., Deep voice: Real-time neural text-to-speech, in: International Conference on Machine Learning, PMLR, 2017, pp. 195–204. [36] A. Gibiansky, S. Arik, G. Diamos, J. Miller, K. Peng, W. Ping, J. Raiman, Y. Zhou, Deep voice 2: Multi-speaker neural text-to-speech, Advances in neural information processing systems 30 (2017). [37] Y. Gao, R. Singh, B. Raj, Voice impersonation using generative adversarial networks, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp. 2506–2510. [38] T. Toda, A. W. Black, K. Tokuda, Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory, IEEE Transactions on Audio, Speech, and Language Processing 15 (2007) 2222–2235. [39] R. Reimao, V. Tzerpos, For: A dataset for synthetic speech detection, in: 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), IEEE, 2019, pp. 1–10. [40] K. Ito, L. Johnson, The lj speech dataset, https://keithito.com/LJ-Speech-Dataset/, 2017. [41] J. Kominek, A. W. Black, The cmu arctic speech databases, in: Fifth ISCA workshop on speech synthesis, 2004. [42] K. MacLean, Voxforge, Ken MacLean.[Online]. Available: http://www.voxforge.org/home.[Acedido em 2012] (2018). [43] T. Evgeniou, M. Pontil, Support vector machines: Theory and applications, in: Advanced Course on Artificial Intelligence, Springer, 1999, pp. 249–257. [44] A. Abbasi, A. R. Javed, A. Yasin, Z. Jalil, N. Kryvinska, U. Tariq, A large-scale benchmark dataset for anomaly detection and rare event classification for audio forensics, IEEE Access (2022). [45] A. J. Myles, R. N. Feudale, Y. Liu, N. A. Woody, S. D. Brown, An introduction to decision tree modeling, Journal of Chemometrics: A Journal of the Chemometrics Society 18 (2004) 275–285. [46] A. Abbasi, A. R. Javed, C. Chakraborty, J. Nebhen, W. Zehra, Z. Jalil, Elstream: An ensemble learning approach for concept drift detection in dynamic social big data stream learning, IEEE Access 9 (2021) 66408–66419. [47] I. Rish, et al., An empirical study of the naive bayes classifier, in: IJCAI 2001 workshop on empirical methods in artificial intelligence, volume 3, 2001, pp. 41–46. [48] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794. [49] J. Khochare, C. Joshi, B. Yenarkar, S. Suratkar, F. Kazi, A deep learning framework for audio deepfake detection, Arabian Journal for Science and Engineering (2021) 1–12.