1. Introduction

An Eficient Approach for Audio Deepfake Detection

Quan Trong The

0 0 Lab Blockchain, Faculty Information Security, Posts and Telecommunications Institute of Technology (PTIT) , Hanoi , Vietnam

2026

Nowadays, with the rapid development of multimedia technology for meeting the requirements of humans, deep learning techniques are used for generating, creating synthetic media. This raises significant challenging problems in controlling audio-based content with deepfake sound. Deepfake makes it harder to identify the original sound source and authentication process in almost all speech applications, such as, voice-controlled devices, teleconference systems, biometric equipment. Much research has attempted to recognize, classify audio deepfake by utilizing the AVSpoof and numerous diferent machine learning (ML) algorithms with perspective numerical simulations. In this paper, the author proposed applying an optimized Mel-frequency cepstral coeficients (MFCCs) to improve the robustness of deepfake sound detection system, which based on ML in the terms of measured performance metrics, such as accuracy, precision, recall and F1-score. The numerical results have confirmed the efectiveness of the described method in comparison to traditional approaches. The author's suggested technique ofers a promising framework for identifying, detecting and classifying deepfake voices to prevent the spread of misinformation of digital media.

eol>Audio deepfake mel-frequency cepstral coeficients detection deep learning voice

1. Introduction 2. The optimized MFCCs

Deepfake audio signals often consist of similar features to the original human voice. However, distinguishing ability is a dificult problem to advance in deep learning approaches in detecting and classifying deepfake. Hence, distinct features can significantly afect the model’s predictive ability and efectiveness. In the frequency domain, it is observed that audio signals can provide us helpful characteristics in detection and classification of deepfake audio. For this purpose, the author uses optimized Mel-frequency Cepstral Coeficient (MFCCs), which play an important role in speech recognition, because MFCCs extraction is more computational demanding than Zero-Crossing Rate or energy calculation. MFCCs remain feasible for embedded acoustic devices with integrating various signal processing algorithms.

For audio data, for each segment of audio, the short-term energy is calculated as:

If the VAD greater than a determined threshold , the frame contains the speech component and otherwise. where [] denote the normalized audio sample and N is the length of considered frame. The computed energy is the converted in decibel as the formulation: where is small additive constant to prevent the logarithmic singularities.

The MFCC feature extraction can be computed as: 1) Applying a Hamming window and transforming it using a 1024-point FFT to achieve magnitude and phase value. The number of FFT points can be used for enhancing the accuracy of detection. In this article, the author used 1024-point to achieve a better balance between signal processing time and performance.

2) A precomputed mel filterbank for 44,1 kHz sampling and 1024-point is implemented for obtaining the approximation of human auditory perception.

3) Computation logarithm and Discrete Cosine Transform (DCT) to take into account the first 64 MFCC coeficients.

The author’s idea is using an eficient Voice Activity Detection (VAD) for determining the frame with presence/absence of speech component for calculating exactly MFCCs. On the assumption that the speech component often lies on frequency range 0 − 3400 Hz, and noise distributes on higher bandwidth frequencies. The author proposed computing the ratio spectral energy between bandwidth 0 − 3400 and 3400 − 22100 Hz.

= 1 ∑−1︁ []2 =0 = 10 log10( + ) =

0−3400() 3400−22100() {︃ > speech ≤ noise ( 1 ) ( 2 ) ( 3 ) ( 4 )

The scheme of determining speech/noise frame is illustrated in Figure 1. With the goal of estimating a reasonable threshold to avoid false VAD where there is only ambient noise. The author’s procedure can be expressed as:

1) Select the first 5-10 frames at start-up with the assumption that no speech component is available during this period.

2) Compute acoustic features, such as spectral, energy and MFCCs. 3) Calculate the properties of surrounding noise by taking the average of the above parameters. 4) Set adaptive thresholds based on the background noise. Save the results in a global variable VAD to use during the entire signal processing.

During the speech frame, the pre-emphasis of MFCC is performed according to the equation: [] = [] − [ − 1] ( 5 ) where [] is the output, [] is the input and [ − 1] is the previous value of [], and pre-emphasis parameter is = 0.9.

In this section, the MFCCs were precisely computed during the entire recording file and it can improve the accuracy of detection deepfake.

3. The author’s proposed method and other classification models

Convolutional Neural Networks (CNNs) have been commonly implemented for detection tasks due to its capability to process spatial dependencies in audio spectrograms. The scholars have designed CNN architectures for extracting the discriminative features from observed audio signals.

Long Short-Term Memory (LSTM) , which is an eficient technique for resolving temporal dependencies in sequential data, is suitable for analyzing and processing audio signals. LSTM adepts operate at learning long-range dependencies and can be applied to various tasks such as speech recognition, source separation, VAD and deepfake audio detection.

To address the audio deepfake detection, Generative Adversarial Networks (GAN)s have been used for training detection models through generating synthetic audio samples.

Multi-Layer Perceptron (MLP) is an eficient solution for classification problems. A multilayer perceptron, through layers, can efectively outperform the relevant features from data and tune the parameter of the models for optimal predictions. In the MLP model, there are at least three levels: an input layer, a hidden layer of calculation nodes and an output layer of processing nodes.

The author’s idea is using the optimized MFCCs feature to increase the efectiveness of audio deepfake detection by using the above ML algorithm. The final decision is made based on the majority of obtained results. The author uses Librosa library [9] to convert real/fake audio files and each audio recording is divided into small 1 second slices. In this stage, MFCCs coeficients are computed by applying fast Fourier transform, logarithmic and discrete cosine transform.

The proposed system (ProSys + MFCC) was presented in Figure 2.

As an individual model operates on 1 - second audio segment, the overall predicted probability of a certain audio recording segment is determined by averaging of estimated probabilities over entire audio ifle. If we denote () = [1() 2() ... ()] with means the category number of the -th out of L 1-second segments in one audio file. The computed probability of considered audio recording is calculated by averaging the classification probability which denoted as () = [(1) where: () 2 ... ()] (6) (7) (8) =1

For ensembel of results from diferent models, the author demonstrated experiment on the individual models, then achieved the predicted probability as ˆ = [1 2 ... ] of individual evaluated model. Next, the predicted probability after MEAN fusion ˆ = [ˆ1 ˆ2 ... ˆ ] is derived by: 1 ∑︁ () =

1 ≤ < ˆ = 1 ∑︁ =1 ˆ = (ˆ1, ˆ2, ..., ˆ )

After all, the predicted label ˆ can be expressed as:

4. Experiments

In this experiment, the author evaluated the suggested models on the Logic Access dataset of ASVspoofing 2019 challenge with real and fake audio samples, which were generated by an AI-based generative system. Logic Access can be categorized into three subsets ‘Train’ (22800/2580), ‘Develop’ (22296/2548) and ‘Evaluation’ (63882/7355) (fake sample/real sample). The author utilized the ‘Train’ subset for the training model, then performed the ‘Develop’ subset. Finally, the models are tested on the ‘Evaluation’ subset. The promising results were reported through Equal Error Rate (ERR), Accuracy, F1 score and AuC score. The author’s approach using the optimized MFCC has the advantage of improving the capability of audio fake detection.

The author’s method is comparing evaluation of ProSys + optMFCC with each ML algorithm with normal calculation MFCC.

The model ProSys + optMFCC with optimized MFCCs coeficients and ensembles of all ML algorithms has shown the increased performance in Accuracy, F1 score, AuC and ERR. With an appropriate VAD, the author’s method can determine whether a frame with presence/absence speech component to exactly compute the necessary MFCC coeficients. With individual ML algorithms, the input frame is not smoothed, therefore the MFCC feature does not bring all characteristics of human voice, which plays an important role in the decision of detecting audio fake samples. Consequently, ProSys + optMFCC gave us better accuracy, F1 score, AuC. Beside, the ensemble method has improved the final result of ProSys + optMFCC. The promising achieved ERR 0.08 in comparison with CNN(0.15), RNN (0.14) and GAN (0.17). These findings confirmed that the diverse features via ensemble multiple ML algorithms, which are based on spectrograms, substantially improves the overall evaluation compared to single technique.

5. Conclusion

This paper has described an eficient approach of optimum calculation of MFCCs coeficients and ensemble techniques, which is based on spectrogram features. The appealing properties of the author’s approach is utilizing the VAD to exactly choose a frame with presence of speech component, smoothed and calculated 64 first coeficients. Beside, an efective ensemble technique, which based on spectral features, outperformed better accuracy and ERR on AVSspoofing 2021 database. The numerical results have confirmed the ability of the suggested system in addressing many complex problems.

Declaration on Generative AI

The author(s) have not employed any Generative AI tools. 2011 Carnahan Conference on Security Technology, Barcelona, Spain, 2011, pp. 1-8, doi: 10.1109/CCST.2011.6095943. [6] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, and K. A. Lee, “The ASVSPOOF 2017 challenge: Assessing the limits of replay spoofing attack detection,” in Proc. 18th Annu. Conf. Int. Speech Commun. Assoc., 2017, pp. 2–6. [7] J. Frank and L. Schönherr, “WaveFake: A data set to facilitate audio deepfake detection,” 2021, arXiv:2111.02813. [8] M. Hassaballah, M. A. Hameed, and M. H. Alkinani, “Introduction to digital image steganography,” in In book: Digital Media Steganography Principles, Algorithms, and Advances (pp.1-15), DOI:10.1016/B978-0-12-819438-6.00009-8

[1]

Abbasi ,

A. R. R.

Javed ,

Yasin ,

Jalil ,

Kryvinska and

Tariq , "A Large-Scale Benchmark Dataset for Anomaly Detection and Rare Event Classification for Audio Forensics," in IEEE Access , vol. 10 , pp. 38885 - 38894 , 2022 , doi: 10.1109/ACCESS. 2022 . 3166602 .

[2]

A. R.

Javed ,

Ahmed ,

Alazab ,

Jalil ,

Kifayat and

T. R.

Gadekallu , "A Comprehensive Survey on Computer Forensics: State-of-the-

Art , Tools, Techniques, Challenges, and Future

Directions , " in IEEE Access , vol. 10 , pp. 11065 - 11089 , 2022 , doi: 10.1109/ACCESS. 2022 . 3142508 .

[3]

Tom ,

Jain , and

Dey , “ End-to-end audio replay attack detection using deep convolutional networks with attention,” in Proc. Interspeech , Hyderabad, 2018 , pp. 681 - 685 . DOI: 10 .21437/Interspeech.2018- 2279 .

[4]

Pradhan ,

Sun , G. Baig, and L. Qiu, “ Combating replay attacks against voice assistants , ” Proc. ACM Interact ., Mobile , Wearable Ubiquitous Technol., vol. 3 , no. 3 , pp. 1 - 26 , Sep. 2019 . DOI: 10 .1145/3351258.

[5]

Villalba and

Lleida , "Preventing replay attacks on speaker verification systems,"