=Paper=
{{Paper
|id=Vol-3302/paper25
|storemode=property
|title=DeepAdversaryDefense: A Deep Model to Identify and Prevent Adversarial Attacks against Medical Speech Recognition
|pdfUrl=https://ceur-ws.org/Vol-3302/short11.pdf
|volume=Vol-3302
|authors=Kirtee Panwar,Akansha Singh,Krishna Kant Singh
|dblpUrl=https://dblp.org/rec/conf/iddm/PanwarSS22
}}
==DeepAdversaryDefense: A Deep Model to Identify and Prevent Adversarial Attacks against Medical Speech Recognition==
<pdf width="1500px">https://ceur-ws.org/Vol-3302/short11.pdf</pdf>
<pre>
DeepAdversaryDefense: A Deep Model to Identify and Prevent Adversarial
Attacks against Medical Speech Recognition
Kirtee Panwara, Akansha Singha and Krishna Kant Singhb
a
    SCSET, Bennett University, Greater Noida, India
b
    ASET, Amity University, Noida, India

                 Abstract
                 Deep learning models have made significant progress in safety-critical environments
                 such as health-care systems, machine-learning based robots, Autonomous Intelligent
                 Vehicles (AIV)), aviation software, etc. Deep Learning models can learn from input data,
                 the property of learning has its own drawbacks as these models can be easily affected by
                 minor disturbances in input examples. These input examples are generally created
                 purposely by attackers and are known as adversarial examples. A small malicious change
                 in input can cause the model to generate incorrect output. Majority of works in literature
                 are towards understanding and generation of adversarial attack. Most of these attacks do
                 not effectively resist detection networks. On the other hand, adversarial example
                 detectors have inadequate evaluation. In this paper, a secure medical speech Recognition
                 (MSR) system is proposed that can prevent malicious attacks. Adversarial examples that
                 pose security concerns can be detected and filtered out. With the proposed model such
                 system-malicious inputs designed to perform an attack on safety-critical applications,
                 even if the adversary has no access to the underlying model are prevented.

                 Keywords 1
                 Adversarial samples, Automatic Speech Recognition (ASR), Medical Speech
                 Recognition (MSR), vulnerabilities, deep learning, security, Adversarial Loss.

1. Introduction

    The use of Deep learning models has become a part of our daily life: from organizing our searches
to social media feeds. Recent advances in Automatic speech recognition, machine learning and deep
learning technologies have favored the advancement of speech-based conversational interfaces. This
has further led to an increase in the interaction of such devices with various machine-critical
applications. Machine learning-based speech recognition systems allow users to carry out essential
and crucial activities for either industrial development and processes or assisted living using voice
commands [1]. With the advancement in deep learning-based speech recognition systems and
interfaces based for essential applications such as recognizing the transcription of medical speech in
healthcare, etc., new attacks are developed also known as adversarial attacks.
    Deep models can achieve acceptable accuracy levels but have been found to make mistakes more
often. In literature, it can be observed that these models are vulnerable to well-designed input attacks
known as adversarial examples. These inputs make the model generate incorrect output with high
confidence. For example, attackers generate adversarial inputs to automatic speech recognition
models through sound sensors to obtain desired target output. The model outputs un-favorable results
with such malicious inputs. There are various applications of automatic speech recognition that
demands security against such vulnerabilities such as Microsoft Cortana, Amazon Alexa, and Apple
Siri [2]. The adversarial inputs differ very slightly from the actual inputs drawn from a certain data


IDDM-2022: 5th International Conference on Informatics & Data-Driven Medicine, November 18–20, 2022, Lyon, France
EMAIL: Kirtee.panwar@bennett.edu.in (A. 1); akanshasing@gmail.com (A. 2); krishnaiitr2011@gmail.com (A. 3)
ORCID: 0000-0001-9842-1804 (A. 1); 0000-0002-5520-8066 (A. 2); 0000-0002-6510-6768 (A. 3)
            ©️ 2022 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)
distribution that has the power to make machine misclassify examples [3] irrespective of model
architecture and datasets used for training purposes thus exposing blind spots of training algorithms.
    The primary cause of vulnerability of the machine learning model to adversarial attack is their
linear behavior in high dimensional space [4]. This theory further leads to new methods of generating
adversarial inputs for adversarial training for security enhancement purposes. In domains such as
traffic control, manufacturing, advanced automotive systems, the adversarial inputs have substantial
dependencies on each other which can be represented with features for non-uniform disturbances
generated at the output of the machine-learning model during adversarial training [2].
     With adequate analysis of inputs, it is possible to classify certain input examples as adversarial
examples by identifying rules between attacker and defender based on practical scenarios [5].
    One of the possible ways to reduce the vulnerability of models is to enhance scalability of the
model against adversarial inputs [6]. In such cases, network-based detectors play a fundamental role
in validating the security of the model. Attacks designed for distribution-based-detectors for
validating the security of such detectors is critical for security-related applications [7]. Design of
adversarial samples [8] that can reduce the detection rate of distribution-based detection techniques
help in understanding the underlying problem with security against adversarial attacks.
    Majority of works in literature are towards understanding and creating an adversarial attack.
However, the attacks do not effectively resist detection networks [9]. On the other hand, adversarial
example detectors have inadequate evaluation [10]. There remains a research gap in understanding the
construction of adversarial examples which conflicts with the safety requirements of the ASR systems
required for safety-critical applications [11].
    Adversarial Attacks can be Untargeted, in which the objective of the attack is to degrade the
network's performance, or targeted [12], where the aim is to make the model predict the target
transcript. Adversarial attacks based on CTC(Connectionist Temporal Classification) loss function
[13] or task loss of the problem, e.g., Houdini Attack [14] degrades the model's performance.
Adversarial attacks include imperceptible attacks [15] in which the transcripts are hidden, Fast
Gradient Sign Method (FGSM) and Project gradient Descent (PGD) attacks [4], based on generating
adversarial examples that degrade the performance of the optimization process of loss function of the
model.
    Pre-processing defenses or adversarial training alleviates the effect of adversarial attacks. Pre-
processing defenses such as randomized smoothing, WaveGAN vocoder, variational auto-encoder
(VAE), etc., eliminate the disturbances caused by adversarial examples before it enters the ASR
system but is ineffective against adaptive attacks [16]. On the other hand, in defenses based on
adversarial training, such as FGSM adversarial training and PGD adversarial training, the system's
robustness against attack is limited and requires critical tuning of parameters. The training time
required for the model is another limiting factor [17].
    In the proposed ASR model, the loss function incorporates Frequency domain power spectrum
verification and distance between voice propagation angles as defenses to alleviate the effect of
adversarial examples.
    The remainder of this paper is organized as follows. Section 2 describes the proposed methodology
for developing a secure ASR system based on deep learning approach. Section 3 presents
experimental results and comparison with other state-of-art techniques. The paper is finally concluded
in section 5 with some directions for future research.

2. Proposed Methodology

   The ASR module is used in the voice control system to enable humans to interact with machines.
These modules are vulnerable to adversarial voice commands that cause the system to generate
undefined output. For instance, a target transcription by attacker hid within audio file below a certain
threshold and imperceptible by humans but not machines can be interpreted as commands by the
sensors. The medical ASR system accepts sounds from multiple speakers such as mobile phones,
passengers, etc. Malicious messages may come from any of these speakers, creating negligibly
perceptible changes to obtain a desired target output form ASR module. Another possible attack is the
black box attack, where attacker has limited knowledge about the ASR model parameters. Such attack
is possible using keyword recognition of the ASR system that is accessible. For instance, commercial
medical ASR systems use keywords, “I need urgent medical help”. Modification of such keywords
by attackers can lead to desired target output.


               I need
          urgent medical
              help.

                                                                                             ORIGINAL
                                                                                             I need Urgent
                                                                                            medical Help


                                                 Adversarial Noise


                                                                                             AFTER ATTACK
                                                                                            All is well no help
                                                                                                needed.

Figure 1. Adversarial attack example.

   Medical ASR systems are used for critical applications and have a high demand for security
against adversarial attacks that can come from various sources such as a noisy environment,
loudspeakers, mobile phones, etc. For ASR systems, knowledge about adversarial perturbations is
limited and it is a challenge to defend medical ASR systems against such attacks. In this paper, a deep
learning model is proposed that can resist adversarial attacks. The proposed ASR system comprises of
Deep neural network and transcript generation.

2.1.       Deep Neural Network: DeepAdversaryDefense

       With help of proposed network feature extraction and classification is performed simultaneously.
       The proposed deep neural network is designed extract features in human auditory system as well
       as to classify malicious messages and bypass original messages. Classification is performed in the
       feature transformation network by incorporating frequency verification and speaker verification in
       the loss function of the network. The loss functions incorporate:
            a) Frequency domain power spectrum verification.
            b) Distance between voice propagation angles.
   The Proposed ASR system comprises of Feature Transformation network, Feature Decoder
Network and Discriminator network. The architecture is similar to [19]. The block diagram of
proposed model is given in Fig. 2.


   Figure 2. Block Diagram of Proposed Model

    Feature transformation network Input signal is represented in the form of Mel spectrogram [20].
These features resemble a 2D image. This network consists of 1 block of convolution followed by
ReLU activation function. Here the frequency domain verification and identification of driver’s voice
is performed. The architecture of this network is given in Table 1. In the first layer, B feature blocks
of spectrogram image is obtained, these features are passed if frequency check and distance check is
passed otherwise no features are passed further. The size of 𝐵 = √𝑁, where 𝑁 × 𝑁 is the size of
spectrogram image. These features are combined with pixel shuffle layer [21] and reshaped. Here, the
pixel shuffle layer has been introduced for computational efficiency.

   Table 1. Architecture of feature transformation network

    Layer                      Kernel Size                   Normalization       Activation
                                                                               Function
    Input                      BxB                           Spectral, Batch     LReLU
    Pixel Shuffle Layer

   Feature decoder network We use 1 convolution block with LReLu activation function for down-
sampling. The decoder then converts the features to transcript. We use 1 convolution block with
LReLu activation function for down-sampling sampling of features followed by 5 blocks of Resnet
these features from resnet blocks are concatenated and passed to the convolution block with LReLu
activation followed by convolution layer with tanh activation. Each ResNet Block consists of 2 layers
of CNN Layer followed by Normalisation technique. Each of the features obtained from subsequent
ResNet blocks are concatenated. The architecture of feature decoder network is given in Table 2.

   Table 2. Architecture of Feature Decoder Network

    Layer                           Kernel Size              Normalization       Activation
                                                                               Function
    Input                           3x3                      Spectral, Batch     LReLU
    ResBlocks (5 times)             5x3                      Spectral, Batch     LReLU
    TransposeConvolution            3x3                      Spectral, Batch     LReLU
    Convolution                     3x3                      Spectral, Batch     Tanh

   Discriminator Network
   The discriminator network classifies the signal as malicious or original based on frequency domain
power spectrum and distance between voice propagation angles. The discriminator network guides the
network to create realistic transcripts for given input. The architecture of discriminator network
consists of 6 layers on convolution with spectral normalization followed by self-attention [22] then
convolution then self-attention layer. The output transcripts generated are meaningful due to the
contextual information captured with help of self-attention layer. The final layer is sigmoid activation.
The architecture of network is given in Table 3.
   Table 3. Architecture of Discriminator Network

    Layer                          Kernel Size             Normalization            Activation
                                                                                  Function
    Input                          3x3                     Spectral                 LReLU
    Convolution                    3x3                     Spectral                 LReLU
    Convolution                    5x3                     Spectral                 LReLU
    Convolution                    3x3                     Spectral                 LReLU
    Convolution                    5x3                     Spectral                 LReLU
    Convolution                    3x3                     Spectral                 LReLU
    Self Attention Layer           -                       -                        -
    Convolution                    3x3                     Spectral                 LReLU
    Self Attention Layer           -                       -                        -
    Flatten                        -                       -                        -
    Linear                         -                       -                        -

   Loss Function:
   We use Connectionist Temporal Classification (CTC) loss for feature transformation network. to
produce sequential sequences for un-aligned input data. The input to the loss function is the output
probability distribution y and the objective is to maximize the probability of outputting that correct
transcript i.e., minimising maximum likelihood training [15]. The network then converts the input into
the highest probability transcription. The loss function for spectrum detection is defined as

                  L1=OML (S,Nw )= - ∑(x,z)∈S ln(p(z/x))                                ( 1)

Where, S denotes set of input samples that belong to a fixed distribution 𝐷𝑋𝑥𝑧 , 𝑧 = (𝑧1 , 𝑧2 , … 𝑧𝑈 ) is
the target sequence whose length is smaller than or equal to the input sequence 𝑥 = (𝑥1 , 𝑥2 , … 𝑥𝑇 )
and 𝑁𝑤 denotes network outputs.
    The feature transformation network must not pass features if Frequency domain power spectrum is
not verified and distance between voice propagation angles is not within certain range. Therefore, this
constrain is added to loss of the network. If realistic output is generated for malicious input, then
network is penalized. The loss function L2 is defined to perform frequency domain verification which
is defined as the ratio of low frequency power to total power as

                                      ∑2𝑘𝐻𝑧  2
                                       𝑓=85 𝐴 (𝑓)
                               𝐿2 =    ∑𝑓 𝐴2 (𝑓)
                                                                           ( 2)

    For original message the range is pre-defined and the range for human voice is 85 Hz to 4 kHz. If
ratio 𝑅1 is within acceptable range, then realistic transcripts are generated. A Spectrogram is like a 2D
image of a signal with the time on the x-axis and frequency on the y-axis. Therefore, f represents
value on the y-axis.
    For identification of driver’s voice, the angle of propagation of sound signal is analysed, if the
angle is within a predefined range, then the features are passed on further to next layer. The loss
function for identifying driver’s voice is defined as
                                                  𝑉
                        𝐿3 = 𝛼 = arccos (∆𝑁. 𝐷 0𝑓 )                               ( 3)
                                                    0. 𝑠
Where ∆𝑁 length of segment, 𝐷0 denotes distance between speakers installed.
   Adversarial loss of network is min_max(L1). The total loss of the network is 𝑎𝑑𝑣𝑙𝑜𝑠𝑠(𝐿1) + 𝐿2 +
𝐿3
3. Experimental Results

    The experiments are performed using standard dataset TIMIT for training the proposed model.
This dataset consists of audio aligned with each character as well as expected sentence transcription.
We check the effectiveness of the proposed model to resist fast gradient-sign method (FGSM) attack
[16]. Word error rate (WER(%)) is used to measure the speech to text accuracy and is calculated using
Levenshtein distance [19] to determine the resistance of the proposed model against attacks using
equation
                                    𝐷+𝐼+𝑆
                           𝑊𝐸𝑅 = 𝑁 ,                                   ( 4)
where D denotes deletion of words, I denotes insertion of words, S denotes substitution of words by
the model and N denotes total number of words. Results are compared with other state of art methods
and displayed in Table 4.
                          Table 4. WER(%) of proposed model against FGSM attack.

                         Method               FGSM               Average
                         Proposed             33.5               33.5
                         [23]                 -                  32.6
                         [24]                 -                  40.5
                         [25]                 34.6               34.6

From Table 4. WER of the proposed model is 33.5 % for TIMIT dataset against FGSM attack. The
proposed model generates valid transcription with low WER as compared to [19]. A low value of
WER indicates good performance against attacks. Comparison is done on the bases of WER with [18]
and values are similar, results show that WER values of proposed model is justified.

4. Conclusion and Future Scope

   In this paper, we propose a secure medical speech recognition system to defeat adversarial voice
command attacks in healthcare applications. We utilize the physical attributes of voices to distinguish
the speaker’s voice from other adversarial voices in two steps. First, multi-source signals are filtered
out according to frequency domain power verification spectrum. Second, the driver’s voice is
determined from its propagation direction multiple microphones installed at different corners. The
feature decoder network and discriminator network then use CTC loss function to transform the
features and generate transcripts. Detection of adversarial examples during the ASR model's training
enhances the model's scalability. The experimental results show improved WER for proposed ASR
system as compared to other systems when introduced to FGSM attack.
   As future work, the proposed model can be tested against various other attack models and
accordingly classification network can be trained with appropriate loss functions. Some new
adversarial attacks can be defined for ASR systems for adversarial training.

5. References

[1] Huynh, N.D., Bouadjenek, M.R., Razzak, I., Lee, K., Arora, C., Hassani, A. and Zaslavsky, A.
    ``Adversarial Attacks on Speech Recognition Systems for Mission-Critical Applications: A
    Survey” (2022). arXiv preprint arXiv:2202.10594.
[2] Erdemir, E., Bickford, J., Melis, L. and Aydore, S. ``Adversarial robustness with non-uniform
    perturbations”. Advances in Neural Information Processing Systems, 34 (2021), pp.19147-19159.
[3] Szegedy, Christian, Zaremba, Wojciech, Sutskever, Ilya, Bruna, Joan, Erhan, Dumitru,
    Goodfellow, Ian J., and Fergus, Rob. ``Intriguing properties of neural networks”. ICLR,
    abs/1312.6199, 2014b, 2013. URL http: //arxiv.org/abs/1312.6199.
[4] Goodfellow, I.J., Shlens, J. and Szegedy, C. ``Explaining and harnessing adversarial
    examples”, arXiv preprint arXiv:1412.6572 (2014).
[5] Gilmer, J., Adams, R.P., Goodfellow, I., Andersen, D. and Dahl, G.E. ``Motivating the rules of
    the game for adversarial example research”, arXiv preprint arXiv:1807.06732 (2018).
[6] Zhang, J. and Li, C. ``Adversarial examples: Opportunities and challenges. IEEE transactions on
     neural networks and learning systems”, 31(7), 2019, pp.2578-2593.
[7] Yigitcan Kaya, Muhammad Bilal Zafar, Sergul Aydore, Nathalie Rauschmayr, Krishnaram
     Kenthapadi, “Generating Distributional Adversarial Examples to Evade Statistical Detectors”,
     Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland,
     USA, PMLR 162, 2022
[8] Wang, K., Yi, P., Zou, F. and Wu, Y. ``Generating Adversarial Samples With Constrained
     Wasserstein Distance”, IEEE Access, 7, 2019, pp.136812-136821.
[9] T. Tanay and L. Griffin, ‘‘A boundary tilting persepective on the phenomenon of adversarial
     examples,’’ arXiv:1608.07690, 2016, URL: https://arxiv.org/abs/1608.07690
[10] Yuan, X., He, P., Zhu, Q. and Li, X. ``Adversarial examples: Attacks and defenses for deep
     learning”, IEEE transactions on neural networks and learning systems, 30(9), 2019, pp.2805-
     2824.
[11] Serban, A., Poll, E. and Visser, J. ``Adversarial examples on object recognition: A
     comprehensive survey”. ACM Computing Surveys (CSUR), 53(3), 2020, pp.1-38.
[12]    Carlini N, Mishra P, Vaidya T, Zhang Y, Sherr M, Shields C, Wagner D, Zhou W. ``Hidden
    voice commands”. 25th USENIX security symposium (USENIX security 16) 2016 (pp. 513-530).
[13]    Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J,
    Catanzaro B, Cheng Q, Chen G, Chen J. ``Deep speech 2: End-to-end speech recognition in
    english and mandarin”. International conference on machine learning, 2016 Jun 11 (pp. 173-
    182). PMLR.
[14]    Cisse M, Adi Y, Neverova N, Keshet J. ``Houdini: Fooling deep structured prediction
    models”. arXiv preprint arXiv:1707.05373. 2017 Jul 17.
[15]    Carlini N, Wagner D. ``Audio adversarial examples: Targeted attacks on speech-to-text”. In
    2018 IEEE security and privacy workshops (SPW) 2018 May 24 (pp. 1-7). IEEE.
[16]    Schönherr L, Kohls K, Zeiler S, Holz T, Kolossa D. ``Adversarial attacks against automatic
    speech recognition systems via psychoacoustic hiding”. arXiv preprint arXiv:1808.05665. 2018
    Aug 16.
[17]    Joshi S, Villalba J, Żelasko P, Moro-Velázquez L, Dehak N. ``Study of pre-processing
    defenses against adversarial attacks on state-of-the-art speaker recognition systems”. IEEE
    Transactions on Information Forensics and Security. 2021 Sep 29;16:4811-26.
[18]    Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A. ``Towards deep learning models
    resistant to adversarial attacks”. arXiv preprint arXiv:1706.06083. 2017 Jun 19.
[19] Madono, K., Tanaka, M., Onishi, M. and Ogawa, T. ``Sia-gan: Scrambling inversion attack using
     generative adversarial network”, IEEE Access, 9, 2021, pp.129385-129393
[20] Graves, A., Fernández, S., Gomez, F. and Schmidhuber, J. ``Connectionist temporal
     classification: labelling unsegmented sequence data with recurrent neural networks”.
     In Proceedings of the 23rd international conference on Machine learning (pp. 369-376), 2006.
[21]    Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D. and Wang,
    Z. ``Real-time single image and video super-resolution using an efficient sub-pixel convolutional
    neural network”. In Proceedings of the IEEE conference on computer vision and pattern
    recognition (pp. 1874-1883), 2016.
[22]    Shaw, P., Uszkoreit, J. and Vaswani, A. ``Self-attention with relative position
    representations”. arXiv preprint arXiv:1803.02155, 2018.
[23]    Navarro, G. ``A guided tour to approximate string matching. ACM computing surveys
    (CSUR)”, 33(1), 2001, pp.31-88.
[24]    Żelasko, P., Joshi, S., Shao, Y., Villalba, J., Trmal, J., Dehak, N. and Khudanpur, S.
    ``Adversarial attacks and defenses for speech recognition systems”. arXiv preprint
    arXiv:2103.17122, 2021.
[25]    Yoshioka, T. and Gales, M.J. ``Environmentally robust ASR front-end for deep neural
    network acoustic models”, Computer Speech & Language, 31(1), 2015, pp.65-86.

</pre>