<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Interpretable Deepfake Voice Detection: A Hybrid Deep-Learning Model and Explanation Evaluation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jacob LaRock</string-name>
          <email>jacoblarock@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Md Shajalal</string-name>
          <email>md.shajalal@uni-siegen.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gunnar Stevens</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bonn-Rhein-Sieg University of Applied Sciences</institution>
          ,
          <addr-line>Sankt Augustin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Late-breaking work</institution>
          ,
          <addr-line>Demos and Doctoral Consortium, colocated with The 3rd World Conference on eXplainable Artificial Intelligence:</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Siegen</institution>
          ,
          <addr-line>Siegen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>With the unprecedented advancement of Generative Artificial Intelligence (GenAI), the threat of voice scams using synthetic voices has become a serious concern across various sectors. Recent eforts have focused on identifying fake voices through handcrafted features, deep learning models, and hybrid approaches. However, most existing methods lack explainability, rendering their predictions non-transparent to users. This paper proposes a novel, interpretable, and transparent method for fake voice identification by introducing a hybrid deep learning model that leverages multiple extracted features. The hybrid model consists of two main components: the first component addresses heterogeneous feature spaces by employing deep convolutional sub-models tailored to individual features, while the second component, the terminus model, utilizes the concatenated representations from the ifnal layers of each sub-model as input. The terminus model follows a typical multi-layer perceptron architecture, enabling efective integration and classification of the diverse feature representations. To enhance interpretability, we decompose the model's decisions using Local Interpretable Model-agnostic Explanations (LIME), taking advantage of the identical feature representation before the concatenation layers to address challenges related to multidimensional feature representations. To evaluate the features and assess the quality of the generated explanations, we propose two metrics: importance and trust. Extensive experiments are conducted on the In-theWild dataset, which is designed to test the generalization capability of synthetic audio detection methods. The experimental results demonstrate that our approach achieves performance comparable to benchmark methods. Furthermore, the results based on our proposed metrics conclude that certain perceptible features demonstrate promise for generating explanations that are meaningful to general users. For reproducibility, the source code for these experiments is available in the following repository: https://github.com/jacoblarock/fake_voices_xai</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Explainable AI (XAI)</kwd>
        <kwd>Fake Voice Detection</kwd>
        <kwd>Hybrid Model</kwd>
        <kwd>DeepFake Detection</kwd>
        <kwd>Explanation Evaluation</kwd>
        <kwd>Metrics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Since the use of generative methods for creating synthetic voices has become more widespread, the
need for reliable and usable detection methods to protect the security of individuals and businesses has
grown. In particular, the rise of deepfake technology has raised concerns about its potential misuse in
areas such as politics, entertainment, and national security. For instance, malicious actors could exploit
this technology to create fake audio recordings that appear to be genuine statements made by public
ifgures or to fabricate recordings of events that never occurred. Such misuse could have significant
consequences, including the spread of misinformation, defamation of individuals, and the escalation of
political tensions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        There has been considerable attention on identifying audio deepfakes using classical and sophisticated
deep learning models [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4, 5</xref>
        ]. Methods for fake voice identification can be broadly categorized
into two classes: methods with handcrafted feature extraction and methods with end-to-end deepfake
detectors [6]. In the former, detection approaches first extract various features from the voices. These
high-dimensional feature values are then passed through complex deep learning models to determine
whether the voice is real or synthesized (i.e., fake). End-to-end fake voice identification methods
generally optimize the feature extraction process and classification task jointly. Both methods, however,
have shown promise in the field to produce reliable results.
      </p>
      <p>The field of Explainable AI (XAI) shows promise for producing useful and interpretable results from
such models. Explainable AI refers to the ability of artificial intelligence systems, such as machine
learning models and neural networks, to provide understandable and interpretable explanations for
their decisions or predictions. This means that XAI systems can articulate why they made a particular
decision, what factors influenced it, and how confident they are in their conclusion.</p>
      <p>In this work, we introduce a hybrid deep learning model leveraging perceptible features extracted
from voice signals in order to use those features in interpretable explanations. The dimensions of each
perceptible feature are not the same. Therefore, our hybrid model consists of a component sub-model for
each input feature to address the problem of varying dimensionality of the extracted features without the
drawbacks of costly transformations in the pre-processing phase. The concatenation of the sub-models’
outputs is followed by a terminus model. The terminus model, which uses the concatenated outputs of
the individual sub-models as inputs, distills its inputs into a singular output using a classic multi-layer
perceptron architecture, allowing for an efective combination of the features into a final classification
result.</p>
      <p>To introduce explainability into our method, we generate explanations using the output of each
submodel before the concatenation layers and then apply Local Interpretable Model-agnostic Explanations
(LIME) [7] on the terminus model. This allows us to assess the impact of each input feature on the
ifnal result through local explanations. Due to the independence of the sub-models, the importance
of their outputs directly correlates with the importance of their inputs for the end classification made
by the terminus model. We then introduce two metrics: trust and importance. These metrics allow
for an aggregate evaluation of a large number of generated LIME explanations on a per-feature basis,
enabling us to assess the usefulness of the features within a given feature set for producing useful and
understandable explanations. We conducted a wide range of experiments to evaluate the performance
of our hybrid audio deepfake detection approach on the In-The-Wild dataset. The experimental results
demonstrate that our model performs efectively compared to state-of-the-art methods. The evaluation
of the generated explanations using our proposed metrics also identifies which features have the highest
impact on the model’s overall predictions. The contributions of this paper are twofold: i) we propose an
interpretable hybrid deep learning model to identify synthetic voices using perceptible features, and
ii) we propose two metrics in order to perform aggregate evaluations of many explanations from our
LIME-based method, in order to assess how useful the average explanations are.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The most common approach in this domain involves using either learned or hand-crafted imperceptible
features. Examples include spectrographic features such as mel-spectrograms and their hand-crafted
derivatives, such as mel-frequency cepstral coeficients (MFCCs), both of which are widely used due to
their efectiveness. Anagha et al. [ 8] utilized mel-spectrograms in combination with a convolutional
neural network (CNN)-based architecture, achieving strong performance on the ASVSpoof2019 dataset [5].
Müller et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] introduced the In-the-Wild dataset to assess the generalizability of deepfake voice
detection methods. Their evaluation of various features and architectures revealed that existing methods
struggle with generalization. Yang et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] conducted a comparative analysis of multiple features
and a feature selection method aimed at improving model eficiency. Their experiments, conducted
on ASVSpoof2019[5], ASV2021 [9], and In-the-Wild [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] datasets, demonstrated the impact of diferent
features on model performance and emphasized the benefits of feature selection and classification fusion
techniques. Ranjan et al. [10] proposed a deep convolutional network designed for both spoof detection
and source identification. Their evaluation across ASVSpoof2019[ 5], the FOR-Norm dataset [11], and
the In-the-Wild dataset [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] showed high accuracy in individual dataset evaluations. However,
crossdataset evaluations revealed limitations in generalizability. Yi et al. [12] provided a comprehensive
comparison of features and model architectures for synthetic voice detection. Their study, conducted
on ASVSpoof2019 [5] and In-the-Wild [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] datasets, reinforced the observation that existing methods
struggle with generalization across datasets.
      </p>
      <p>Although less explored, several studies have investigated the use of perceptible features for detecting
synthetic voices. Some of these features were incorporated into our experiments. Barrington et
al. [13] examined the potential of perceptible features in deepfake audio classification, emphasizing
their role in improving explainability. However, while they implemented a classifier, they did not
develop an explainer. Their findings indicated a performance drop when using perceptible features
compared to imperceptible hand-crafted and deep-learning-based features. Chaiwongyen et al. [14, 15]
explored perceptible feature-based classification. Their initial perceptron model, trained and tested
on the ADD2022 Challenge dataset [16], showed limited performance in 2022. However, with an
expanded feature set in 2023, they achieved improved results. Li et al. [17] investigated a hybrid
approach combining perceptible and imperceptible (referred to as "physical" in their work) features.
Their experiments, conducted using various neural networks on the ASV2022 Challenge dataset [16],
demonstrated that integrating both feature types yielded the best performance, outperforming models
trained on only one type of feature.</p>
      <p>Eforts have also been made to integrate explainability into synthetic voice detection models. Ge
et al. [18] applied the SHAP (SHapley Additive exPlanations) method to analyze feature influence in
deepfake audio detection. Using log-scaled power spectrograms as input features, they trained and
tested their model on the ASV2019 dataset [5]. Their approach enabled graphical representation of
feature importance on spectrograms and a global summary of SHAP values. Haq et al. [19] proposed an
explainable approach by leveraging emotional state changes as input features. Their method visually
represented "unlikely" emotional shifts to enhance interpretability for end users. By combining fake
video and fake audio classifiers, they produced a final classification for video samples with audio. Their
model, tested on the Presidential Deepfake dataset, achieved superior results compared to existing
benchmarks. Also relevant are other hybrid deep-learning approaches. Concatenated sub-networks
have been explored in some works such as with concept-based [20] models and neural additive models
[21].</p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>Our proposed fake voice identification method consists of three major components including feature
extractors, hybrid detection model, and the generation of explanations by modified LIME. After
extracting features, we pass them individually through sub-models to overcome the multidimensionality
problem. Then the concatenation of the outputs from the output layers of the sub-models are then
passed through the terminus models. For generating explanations, we make use of the feature values
returned from the output layers of the sub-models, which has same representation of every feature.</p>
      <sec id="sec-3-1">
        <title>3.1. Feature Extraction</title>
        <p>The extracted features from audio samples can be categorized into two diferent classes: perceptible
and imperceptible features. The perceptible features are features that can be perceived by the human
ear, often vocal qualities such as jitter, shimmer or pitch fluctuation that have a wide range of uses even
outside of audio classification such as diagnosis of disease [ 15], while imperceptible features are typically
out of the range of human hearing, and may not directly reflect a vocal quality, otherwise referred to
as “speaker-independent” features [22]. In order to increase the likelihood that the explanations are
useful and understandable to the end user, we focused on using multiple perceptible features as input
to the classifier. However, we used two imperceptible features with the hypothesis, based on previous
research [13, 14, 15, 17], that they would positively increase model performance.</p>
        <p>The exact perceptible features we extracted are the following: Harmonic to noise ratios (HNRs),
fundamental frequency lengths (f0 lengths), onset strengths, intensity, pitch-fluctuations, jitter and
shimmer. The imperceptible features that we extracted for our analysis are the mel-spectrogram and
their derivative mel-frequency cepstral coeficients (MFCCs). The features have varying dimensionalities
from vectors to matrices of varying size.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Hybrid Fake Voice Detection Model</title>
        <p>We propose an architecture for our hybrid fake voice detection model that allows for a combined
classification based on features of diferent dimensionality, or in other words, regardless of whether
the individual features have the same shape. To achieve this, we hypothesize that we can make use
of individual separate models for each feature, which we will further refer to as sub-models, in a way
resembling concept-based models . The main objective for having the individual sub-models is to
have dimensionally similar representation of the features so that we can use them for generating
understandable explanations. The structure of sub-model can be diferent based the dimension of the
extracted features. For example, the dimension of convolution layers and max pooling layer would
be diferent based on the dimension of the features. Sub-models may also omit the pooling layers
all-together, in order to have a structure based purely on convolution. We hypothesize that the use
of convolution within these sub-models will increase the localized pattern detection capability of our
method. After the processing within the sub-models, we then needed to further process and distill
the results into a singular output. To achieve this, followed by the concatenations of the output of all
sub-models, we propose a terminus model. With the goal of maximizing performance, we tried various
structures for the terminus models including single layer perceptron, convolutional neural networks
and multi-layer perceptron.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Explaining the prediction</title>
        <sec id="sec-3-3-1">
          <title>3.3.1. Generation of the Explanations</title>
          <p>The choice to use the LIME method came from its implementability as well as its proven performance
on tabular data, which is the most relevant approach to creating explanations for our method. Being
that the sub-models are separate from one another, having no influence on each other before being
concatenated at the terminus, only the terminus part of the model is relevant for assessing the weights
of the features in a single evaluation. Therefore, we make use of the processed features’ values after the
output layers of individual sub-models. This does, however, pose two challenges: i) there are multiple
input rows per sample because a sliding-window is used across each audio sample and ii) the input
features cannot be used directly for assessment at the terminus input layer.</p>
          <p>We solved the first problem by taking the mean of the weights returned by the LIME explainer of
each feature produced by each row of the sample features. This results in an average contribution of
each feature to the classification of the model. This is due to the fact that the end-classification of the
surrogate is also a mean. Therefore, the averages of the weights from the LIME explainer represent
aggregated influences of the features. In order to address the second problem, we needed to, for each
input feature, figure out what the output of the respective sub-model would be. This would allow us to
transform the input into the vector-shaped input for the terminus model from the given sample in the
input layers in every sub-model. After the processing of the sub-model, we catch the representation
before the processing of the terminus model. This allows us to use the explainer to find a weight of each
input of the terminus model in relation to the end prediction. We generate the sub-model results for a
random sample of the training set to use as a reference for the localized model approximations from the
LIME explainer. We then summarized the results of the LIME explainer together on a per-feature basis
and normalized them to produce a decimal number between -1 and 1 for every feature. In this case, a
negative value implies that the given feature pushed the result of the model in the negative direction
(i.e. not fake), while a positive value indicates the opposite.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Evaluation of the Explanations</title>
          <p>In order to make an assessment the potential of this method of generating explanations, we needed a
way to measure the influence of each feature on the average explanation generated using this method.
For context, we make the following definitions:  as the set of samples with  ∈  as a sample;  as
the set of features with  ∈  as a feature;  is the weight of feature  in sample  as produced
by the explainer;  is the correct label for sample . The first metric we propose is the mean of the
absolute values of the weights of each feature within a set of explanations. We will further refer to this
metric as importance, and we define it mathematically as follows for every feature  ∈  extracted from
the samples in : (, ) = ∑︀∈|||| . The second metric we will define is the average aggregate
correctness on a per-feature basis, which we will further refer to as trust. The trust per feature  ∈ 
extracted from samples in  can be defined as  (, ) = ∑︀∈ ||(2− 1) .</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Results and Evaluation</title>
      <p>
        Dataset. We trained and tested our model employing the In-the-Wild dataset. This dataset focuses on
the generalization of audio deepfake detection models by collecting real-world data, in comparison
to other research that used more controlled laboratory conditions [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We have chosen it for these
experiments because of the aforementioned focus on generalization, its relative recency compared
to some others and the fact that it has also been used previously by some other experiments, which
provides a useful perspective against which we can compare our method.
      </p>
      <p>Experimental Results. We present experimental results for all variations of our proposed model
in Table 1. The table details the following information: The type of terminus model used, where
“perceptron” does not include hidden layers, MPL(3) is a multi-perceptron layer model with three hidden
layers and CNN (3) is a convolutional neural network with three hidden layers; we used diferent set of
features, where “standard” is the feature set as described above including only one pitch-fluctuation
feature. However, with expanded pitch fluctuations, we combine the same feature set with the addition
of more pitch fluctuation features with diferent comparison distances; the training batch size and the
number of epochs per batch if there were more than one; followed by several performance metrics.</p>
      <p>We can see that the hybrid model with an MLP terminus model trained with standard features set on
ifrst 100000 samples achieved higher accuracy (96.27%) and ERR (0.03214). However, the model with
perceptron as a terminus applying features with expanded pitch fluctuations-based features performed
better than other experimental settings in terms of AUC (0.92727). The other experimental settings
also performed comparatively except with terminus models based on convolutional neural networks.
Percepton-based terminus models with all variation also performed efectively with (2-3)% deviation
compared to best performing model. The performance comparison among state-of-the-art methods is
summarized in table 2.</p>
      <p>Evaluating explanations generated by LIME. In order to obtain a comprehensive understanding
of the quality of the explanations generated by LIME, we selected the three best-performing models
and generated explanations for the first 500 samples of the testing data. These three models achieved
the best performance in terms of Accuracy and Equal Error Rate (EER) or AUC. They include: a model
using the standard feature set with a terminus model containing three hidden layers, model with
expanded pitch-fluctuation features and a perceptron terminus, which achieved the best AUC result,
and model with expanded pitch-fluctuation features and a terminus containing three hidden layers. We
selected these models to evaluate the best-performing configurations and to observe how the inclusion
of expanded pitch-fluctuation features afects the evaluation of the explanations. We then aggregated
and summarized the explanations using the two previously introduced metrics: trust and importance. A
summary of the evaluation in terms of these metrics, grouped by feature, is presented in Table 3.</p>
      <p>According to both metrics, we can see that in Table 3 , the MFCCs had the most positive and accurate
influence on the classification results, with the jitter features being a distant second in terms of trust
and importance. This indicates that the MFCCs not only had the greatest influence but also the most
reliable impact on the classification outcomes. In contrast, other features had little, no, or even negative
influence on the classification, depending on the experiment. However, the jitter and shimmer features
are notable as perceptible features with a positive trust value. It is possible that the consistent correctness
of the MFCC feature led to the other features having less influence over the course of training, thus
decreasing their importance metric even when the average contribution is positive.
Viability and Potential of the Explanations. We hypothesized that explanations leveraging
perceptible features have high potential to enhance the interpretability of complex predictions. However,
as shown in the previous section in the Table 3, perceptible features contributed less to classification
performance than their imperceptible counterparts, despite their increased presence in the overall set
of input features. Two features—HNRs and mel-spectrograms—even consistently had a net-negative
impact on classification accuracy in our experiments, as measured by our trust metric. This leaves an
open question whether future experiments might perform better without these features. In the model
without expanded pitch fluctuations, four perceptible features—f0 lengths, onset strengths, intensities,
and pitch fluctuations—had an overall average influence of zero on classification across both metrics.
This suggests that the model did not learn a meaningful correlation between these features and the
authenticity of audio samples in the dataset. Although these features did have some impact on the
performance of the model with expanded pitch fluctuations, the efect remained minimal. For perceptible
features with a positive trust score, there remains potential for their use in generating understandable
explanations. Even though these features did not significantly influence the model’s predictions, they
still demonstrate a certain level of reliability. Nevertheless, we believe it is worth investigating whether
other types of perceptible features might exert a stronger influence on classification or whether
alternative imperceptible features—potentially replacing MFCCs—could achieve a better balance between
classification accuracy and proportional influence on the final result. Furthermore, we believe that this
model architecture, when paired with LIME and evaluated using the proposed metrics, has the potential
to serve as a foundation for future research.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>To conclude, we have presented a hybrid interpretable deep learning model that leverages a combination
of heterogeneous features, both perceptible and imperceptible. We hypothesized that such a model
could be used to generate explanations using LIME, which might be more useful for potential end
users. The experimental results demonstrated top-tier performance in accurately identifying fake voices
while mitigating the dimensionality problem in the input features. We observed that explanations
leveraging the representations just before the model’s final layer can generate technical insights. These
explanations might be useful for improving model performance by modifying important parameters
or as the basis for presenting an analysis of a sample to an end user, in the case where perceptible
features have an influence on the classification, so they can understand if audible flaws in the sample are
present. However, based on our proposed metrics, trust and importance, we did not observe a significant
influence or usefulness of perceptible features, although they did contribute to explanations to some
extent. A promising direction for future work would be integrating this method into diferent domains
where feature-based local explanations have the potential to be efective. In such application domains,
we could further validate both the accuracy and the usefulness of the generated explanations as well as
the selected features using our proposed metrics. Additionally, it would be valuable to compare findings
from a user study on the generated explanations with our metric-based evaluation.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research has been funded by the AntiScam Project (Defense against communication fraud), funded
by BMBF Germany, Grant reference 16KIS2214</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[5] X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V. Vestman,
T. Kinnunen, K. A. Lee, et al., Asvspoof 2019: A large-scale public database of synthesized,
converted and replayed speech, Computer Speech &amp; Language 64 (2020) 101114.
[6] A. Dixit, N. Kaur, S. Kingra, Review of audio deepfake detection techniques: Issues and prospects,</p>
      <p>Expert Systems 40 (2023) e13322.
[7] M. T. Ribeiro, S. Singh, C. Guestrin, " why should i trust you?" explaining the predictions of any
classifier, in: Proceedings of the 22nd ACM SIGKDD international conference on knowledge
discovery and data mining, 2016, pp. 1135–1144.
[8] R. Anagha, A. Arya, V. H. Narayan, S. Abhishek, T. Anjali, Audio deepfake detection using deep
learning, in: 2023 12th International Conference on System Modeling &amp; Advancement in Research
Trends (SMART), IEEE, 2023, pp. 176–181.
[9] X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, J. Yamagishi,
N. Evans, A. Nautsch, et al., Asvspoof 2021: Towards spoofed and deepfake speech detection in the
wild, IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023) 2507–2522.
[10] R. Ranjan, M. Vatsa, R. Singh, Statnet: Spectral and temporal features based multi-task network
for audio spoofing detection, in: 2022 IEEE International Joint Conference on Biometrics (IJCB),
IEEE, 2022, pp. 1–9.
[11] R. Reimao, V. Tzerpos, For: A dataset for synthetic speech detection, in: 2019 International</p>
      <p>Conference on Speech Technology and Human-Computer Dialogue (SpeD), IEEE, 2019, pp. 1–10.
[12] J. Yi, C. Wang, J. Tao, X. Zhang, C. Y. Zhang, Y. Zhao, Audio deepfake detection: A survey, arXiv
preprint arXiv:2308.14970 (2023).
[13] S. Barrington, R. Barua, G. Koorma, H. Farid, Single and multi-speaker cloned voice detection: From
perceptual to learned features, in: 2023 IEEE International Workshop on Information Forensics
and Security (WIFS), IEEE, 2023, pp. 1–6.
[14] A. Chaiwongyen, N. Songsriboonsit, S. Duangpummet, J. Karnjana, W. Kongprawechnon, M. Unoki,
Contribution of timbre and shimmer features to deepfake speech detection, in: 2022 Asia-Pacific
Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC),
IEEE, 2022, pp. 97–103.
[15] A. Chaiwongyen, S. Duangpummet, J. Karnjana, W. Kongprawechnon, M. Unoki, Deepfake-speech
detection with pathological features and multilayer perceptron neural network, in: 2023 Asia
Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA
ASC), IEEE, 2023, pp. 2182–2188.
[16] J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, Y. Bai, C. Fan, et al., Add 2022: the first
audio deep synthesis detection challenge, in: ICASSP 2022-2022 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2022, pp. 9216–9220.
[17] M. Li, Y. Ahmadiadli, X.-P. Zhang, A comparative study on physical and perceptual features
for deepfake audio detection, in: Proceedings of the 1st International Workshop on Deepfake
Detection for Audio Multimedia, 2022, pp. 35–41.
[18] W. Ge, J. Patino, M. Todisco, N. Evans, Explaining deep learning models for spoofing and deepfake
detection with shapley additive explanations, in: ICASSP 2022-2022 IEEE international conference
on acoustics, speech and signal processing (ICASSP), IEEE, 2022, pp. 6387–6391.
[19] I. U. Haq, K. M. Malik, K. Muhammad, Multimodal neurosymbolic approach for explainable
deepfake detection, ACM Transactions on Multimedia Computing, Communications and Applications
20 (2024) 1–16.
[20] P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, E. Pierson, B. Kim, P. Liang, Concept bottleneck
models, in: H. D. III, A. Singh (Eds.), Proceedings of the 37th International Conference on Machine
Learning, volume 119 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 5338–5348.
[21] R. Agarwal, N. Frosst, X. Zhang, R. Caruana, G. E. Hinton, Neural additive models: Interpretable
machine learning with neural nets, CoRR abs/2004.13912 (2020). arXiv:2004.13912.
[22] X. Liu, Y. Tan, X. Hai, Q. Yu, Q. Zhou, Hidden-in-wave: A novel idea to camouflage ai-synthesized
voices based on speaker-irrelative features, in: 2023 IEEE 34th International Symposium on
Software Reliability Engineering (ISSRE), IEEE, 2023, pp. 786–794.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Veerasamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pieterse</surname>
          </string-name>
          ,
          <article-title>Rising above misinformation and deepfakes</article-title>
          ,
          <source>in: International Conference on Cyber Warfare and Security</source>
          , volume
          <volume>17</volume>
          , Academic Conferences International Limited,
          <year>2022</year>
          , pp.
          <fpage>340</fpage>
          -
          <lpage>348</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Juefei-Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          , L. Ma, Y. Liu, Deepsonar:
          <article-title>Towards efective and robust detection of ai-synthesized fake voices</article-title>
          ,
          <source>in: Proceedings of the 28th ACM international conference on multimedia</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1207</fpage>
          -
          <lpage>1216</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Czempin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dieckmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Froghyar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Böttinger</surname>
          </string-name>
          ,
          <article-title>Does audio deepfake detection generalize?</article-title>
          ,
          <source>arXiv preprint arXiv:2203.16263</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Guo</surname>
          </string-name>
          , K. Han,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>A robust audio deepfake detection system via multi-view feature</article-title>
          ,
          <source>in: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>13131</fpage>
          -
          <lpage>13135</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>