<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Eficient Approach for Audio Deepfake Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Quan Trong The</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lab Blockchain, Faculty Information Security, Posts and Telecommunications Institute of Technology (PTIT)</institution>
          ,
          <addr-line>Hanoi</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Nowadays, with the rapid development of multimedia technology for meeting the requirements of humans, deep learning techniques are used for generating, creating synthetic media. This raises significant challenging problems in controlling audio-based content with deepfake sound. Deepfake makes it harder to identify the original sound source and authentication process in almost all speech applications, such as, voice-controlled devices, teleconference systems, biometric equipment. Much research has attempted to recognize, classify audio deepfake by utilizing the AVSpoof and numerous diferent machine learning (ML) algorithms with perspective numerical simulations. In this paper, the author proposed applying an optimized Mel-frequency cepstral coeficients (MFCCs) to improve the robustness of deepfake sound detection system, which based on ML in the terms of measured performance metrics, such as accuracy, precision, recall and F1-score. The numerical results have confirmed the efectiveness of the described method in comparison to traditional approaches. The author's suggested technique ofers a promising framework for identifying, detecting and classifying deepfake voices to prevent the spread of misinformation of digital media.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Audio deepfake</kwd>
        <kwd>mel-frequency cepstral coeficients</kwd>
        <kwd>detection</kwd>
        <kwd>deep learning</kwd>
        <kwd>voice</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. The optimized MFCCs</title>
      <p>Deepfake audio signals often consist of similar features to the original human voice. However,
distinguishing ability is a dificult problem to advance in deep learning approaches in detecting and classifying
deepfake. Hence, distinct features can significantly afect the model’s predictive ability and efectiveness.
In the frequency domain, it is observed that audio signals can provide us helpful characteristics in
detection and classification of deepfake audio. For this purpose, the author uses optimized Mel-frequency
Cepstral Coeficient (MFCCs), which play an important role in speech recognition, because MFCCs
extraction is more computational demanding than Zero-Crossing Rate or energy calculation. MFCCs
remain feasible for embedded acoustic devices with integrating various signal processing algorithms.</p>
      <p>For audio data, for each segment of audio, the short-term energy is calculated as:</p>
      <p>If the VAD greater than a determined threshold  , the frame contains the speech component and
otherwise.
where [] denote the normalized audio sample and N is the length of considered frame. The computed
energy is the converted in decibel as the formulation:
where  is small additive constant to prevent the logarithmic singularities.</p>
      <p>The MFCC feature extraction can be computed as:
1) Applying a Hamming window and transforming it using a 1024-point FFT to achieve magnitude
and phase value. The number of FFT points can be used for enhancing the accuracy of detection. In
this article, the author used 1024-point to achieve a better balance between signal processing time and
performance.</p>
      <p>2) A precomputed mel filterbank for 44,1 kHz sampling and 1024-point is implemented for obtaining
the approximation of human auditory perception.</p>
      <p>3) Computation logarithm and Discrete Cosine Transform (DCT) to take into account the first 64
MFCC coeficients.</p>
      <p>The author’s idea is using an eficient Voice Activity Detection (VAD) for determining the frame
with presence/absence of speech component for calculating exactly MFCCs. On the assumption that
the speech component often lies on frequency range 0 − 3400 Hz, and noise distributes on higher
bandwidth frequencies. The author proposed computing the ratio spectral energy between bandwidth
0 − 3400 and 3400 − 22100 Hz.</p>
      <p>=
1 ∑−1︁ []2
 =0
 = 10 log10( + )
  =</p>
      <p>
        0−3400()
3400−22100()
{︃  &gt;  speech
  ≤  noise
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
      </p>
      <p>The scheme of determining speech/noise frame is illustrated in Figure 1. With the goal of estimating
a reasonable threshold to avoid false VAD where there is only ambient noise. The author’s procedure
can be expressed as:</p>
      <p>1) Select the first 5-10 frames at start-up with the assumption that no speech component is available
during this period.</p>
      <p>2) Compute acoustic features, such as spectral, energy and MFCCs.
3) Calculate the properties of surrounding noise by taking the average of the above parameters.
4) Set adaptive thresholds based on the background noise. Save the results in a global variable VAD
to use during the entire signal processing.</p>
      <p>
        During the speech frame, the pre-emphasis of MFCC is performed according to the equation:
[] = [] − [ − 1]
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
where [] is the output, [] is the input and [ − 1] is the previous value of [], and pre-emphasis
parameter is  = 0.9.
      </p>
      <p>In this section, the MFCCs were precisely computed during the entire recording file and it can improve
the accuracy of detection deepfake.</p>
    </sec>
    <sec id="sec-3">
      <title>3. The author’s proposed method and other classification models</title>
      <p>Convolutional Neural Networks (CNNs) have been commonly implemented for detection tasks due to
its capability to process spatial dependencies in audio spectrograms. The scholars have designed CNN
architectures for extracting the discriminative features from observed audio signals.</p>
      <p>Long Short-Term Memory (LSTM) , which is an eficient technique for resolving temporal
dependencies in sequential data, is suitable for analyzing and processing audio signals. LSTM adepts operate
at learning long-range dependencies and can be applied to various tasks such as speech recognition,
source separation, VAD and deepfake audio detection.</p>
      <p>To address the audio deepfake detection, Generative Adversarial Networks (GAN)s have been used
for training detection models through generating synthetic audio samples.</p>
      <p>Multi-Layer Perceptron (MLP) is an eficient solution for classification problems. A multilayer
perceptron, through layers, can efectively outperform the relevant features from data and tune the
parameter of the models for optimal predictions. In the MLP model, there are at least three levels: an
input layer, a hidden layer of calculation nodes and an output layer of processing nodes.</p>
      <p>The author’s idea is using the optimized MFCCs feature to increase the efectiveness of audio deepfake
detection by using the above ML algorithm. The final decision is made based on the majority of obtained
results. The author uses Librosa library [9] to convert real/fake audio files and each audio recording
is divided into small 1 second slices. In this stage, MFCCs coeficients are computed by applying fast
Fourier transform, logarithmic and discrete cosine transform.</p>
      <p>The proposed system (ProSys + MFCC) was presented in Figure 2.</p>
      <p>As an individual model operates on 1 - second audio segment, the overall predicted probability of a
certain audio recording segment is determined by averaging of estimated probabilities over entire audio
ifle. If we denote () = [1() 2() ... ()] with  means the category number of the -th out
of L 1-second segments in one audio file. The computed probability of considered audio recording is
calculated by averaging the classification probability which denoted as () = [(1)
where:
()
2
... ()]
(6)
(7)
(8)
=1</p>
      <p>For ensembel of results from diferent models, the author demonstrated experiment on the individual
models, then achieved the predicted probability as ˆ = [1 2 ...  ] of  individual evaluated
model. Next, the predicted probability after MEAN fusion ˆ = [ˆ1 ˆ2 ... ˆ ] is derived by:
1 ∑︁ ()
 =</p>
      <p>1 ≤  &lt; 
ˆ =
1 ∑︁
 =1
ˆ = (ˆ1, ˆ2, ..., ˆ )</p>
      <p>After all, the predicted label ˆ can be expressed as:</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>In this experiment, the author evaluated the suggested models on the Logic Access dataset of
ASVspoofing 2019 challenge with real and fake audio samples, which were generated by an AI-based generative
system. Logic Access can be categorized into three subsets ‘Train’ (22800/2580), ‘Develop’ (22296/2548)
and ‘Evaluation’ (63882/7355) (fake sample/real sample). The author utilized the ‘Train’ subset for the
training model, then performed the ‘Develop’ subset. Finally, the models are tested on the ‘Evaluation’
subset. The promising results were reported through Equal Error Rate (ERR), Accuracy, F1 score and
AuC score. The author’s approach using the optimized MFCC has the advantage of improving the
capability of audio fake detection.</p>
      <p>The author’s method is comparing evaluation of ProSys + optMFCC with each ML algorithm with
normal calculation MFCC.</p>
      <p>The model ProSys + optMFCC with optimized MFCCs coeficients and ensembles of all ML algorithms
has shown the increased performance in Accuracy, F1 score, AuC and ERR. With an appropriate VAD,
the author’s method can determine whether a frame with presence/absence speech component to
exactly compute the necessary MFCC coeficients. With individual ML algorithms, the input frame is
not smoothed, therefore the MFCC feature does not bring all characteristics of human voice, which plays
an important role in the decision of detecting audio fake samples. Consequently, ProSys + optMFCC
gave us better accuracy, F1 score, AuC. Beside, the ensemble method has improved the final result of
ProSys + optMFCC. The promising achieved ERR 0.08 in comparison with CNN(0.15), RNN (0.14) and
GAN (0.17). These findings confirmed that the diverse features via ensemble multiple ML algorithms,
which are based on spectrograms, substantially improves the overall evaluation compared to single
technique.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This paper has described an eficient approach of optimum calculation of MFCCs coeficients and
ensemble techniques, which is based on spectrogram features. The appealing properties of the author’s
approach is utilizing the VAD to exactly choose a frame with presence of speech component, smoothed
and calculated 64 first coeficients. Beside, an efective ensemble technique, which based on spectral
features, outperformed better accuracy and ERR on AVSspoofing 2021 database. The numerical results
have confirmed the ability of the suggested system in addressing many complex problems.</p>
      <p>Declaration on Generative AI</p>
      <p>The author(s) have not employed any Generative AI tools.
2011 Carnahan Conference on Security Technology, Barcelona, Spain, 2011, pp. 1-8, doi:
10.1109/CCST.2011.6095943.
[6] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, and K. A. Lee, “The
ASVSPOOF 2017 challenge: Assessing the limits of replay spoofing attack detection,” in Proc. 18th
Annu. Conf. Int. Speech Commun. Assoc., 2017, pp. 2–6.
[7] J. Frank and L. Schönherr, “WaveFake: A data set to facilitate audio deepfake detection,” 2021,
arXiv:2111.02813.
[8] M. Hassaballah, M. A. Hameed, and M. H. Alkinani, “Introduction to digital image
steganography,” in In book: Digital Media Steganography Principles, Algorithms, and Advances (pp.1-15),
DOI:10.1016/B978-0-12-819438-6.00009-8</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abbasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R. R.</given-names>
            <surname>Javed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yasin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jalil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kryvinska</surname>
          </string-name>
          and
          <string-name>
            <given-names>U.</given-names>
            <surname>Tariq</surname>
          </string-name>
          ,
          <article-title>"A Large-Scale Benchmark Dataset for Anomaly Detection and Rare Event Classification for Audio Forensics,"</article-title>
          <source>in IEEE Access</source>
          , vol.
          <volume>10</volume>
          , pp.
          <fpage>38885</fpage>
          -
          <lpage>38894</lpage>
          ,
          <year>2022</year>
          , doi: 10.1109/ACCESS.
          <year>2022</year>
          .
          <volume>3166602</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Javed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Alazab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jalil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kifayat</surname>
          </string-name>
          and
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Gadekallu</surname>
          </string-name>
          ,
          <article-title>"A Comprehensive Survey on Computer Forensics: State-of-the-</article-title>
          <string-name>
            <surname>Art</surname>
            , Tools, Techniques, Challenges, and
            <given-names>Future</given-names>
          </string-name>
          <string-name>
            <surname>Directions</surname>
          </string-name>
          ,
          <article-title>"</article-title>
          <source>in IEEE Access</source>
          , vol.
          <volume>10</volume>
          , pp.
          <fpage>11065</fpage>
          -
          <lpage>11089</lpage>
          ,
          <year>2022</year>
          , doi: 10.1109/ACCESS.
          <year>2022</year>
          .
          <volume>3142508</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Tom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jain</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Dey</surname>
          </string-name>
          , “
          <article-title>End-to-end audio replay attack detection using deep convolutional networks with attention,”</article-title>
          <source>in Proc. Interspeech</source>
          , Hyderabad,
          <year>2018</year>
          , pp.
          <fpage>681</fpage>
          -
          <lpage>685</lpage>
          . DOI:
          <volume>10</volume>
          .21437/Interspeech.2018-
          <volume>2279</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pradhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Sun</surname>
          </string-name>
          , G. Baig, and L. Qiu, “
          <article-title>Combating replay attacks against voice assistants</article-title>
          ,
          <source>” Proc. ACM Interact</source>
          .,
          <string-name>
            <surname>Mobile</surname>
          </string-name>
          , Wearable Ubiquitous Technol., vol.
          <volume>3</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>26</lpage>
          , Sep.
          <year>2019</year>
          . DOI:
          <volume>10</volume>
          .1145/3351258.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Villalba</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Lleida</surname>
          </string-name>
          ,
          <article-title>"Preventing replay attacks on speaker verification systems,"</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>