Multi-Spectrum Based Audio Adversarial Detection Yunchen Li, Jian Ma, Da Luo Dongguan University of Technology, Dongguan, China Abstract Audio adversarial examples are emerging as a threat to automatic speech recognition (ASR) systems. Existing studies on adversarial examples and defence are mostly developed for im- age classification. Unlike attack methods, adversarial detection techniques cannot be directly transferred to ASR due to sequence dependency of sound waveforms and relatively less audio adversarial examples for training an adversarial detector. In this paper, we study the spectrum characteristics of audio adversarial examples and accordingly, propose a multi-spectrum based learning scheme to address these problems for audio adversarial detection. We evaluate the ASR dataset under two white-box and one black-box attacks, respectively. Compared with existing methods, our method significantly improves detection accuracy on short audio frames, especially under keyword modification attacks. In addition, through ablation experi- ments, it is proved that our proposed multi-spectral method achieves good results in audio adversarial detection. Keywords audio adversarial detection, automatic speech recognition, multi-spectrum 1. Introduction Core to the automatic speech recognition (ASR) system is the speech-to-text procedure, it can be deeply influenced by adversarial examples [1].Many modern ASR systems like DeepSpeech [2] and Lingvo [3] use deep neural networks (DNNs), which are vulnerable to input perturbations [4]. It is possible to inject adversarial perturbations into audio segment to change recognition result [5]. There are two main cases. The first one is called the key-word modification, which would changes some key words in the adversarial transcript. The second one is called the sentence modification, in which the entire transcript could be replaced. Here comes an example. ◼ Original transcript: I have gave them to Alice. ◼ Keyword modification: I have gave them to Bob. ◼ Sentence modification: I did not give to anyone. The adversarial perturbation is almost imperceptible and is a serious threat to the growing applica- tions of ASR such as Google Home and Amazon Alexa. Therefore, it is important to address the de- tection problem of audio adversarial examples. Several ASR attacks are demonstrated in the literature [6–8]. Some of them are adapted to audio systems with key techniques such as gradient descent transferred from the domain of image classifica- tion. Unlike attack methods, audio adversarial detection poses different challenges comparing with its counterpart in the image domain. Firstly, it is much slower and more complex to generate audio ad- versarial examples based on Recurrent Neural Networks (RNN) [9] due to sequence dependency. This causes much less adversarial samples for training a binary classifier as an audio adversarial detector. Moreover, audio input transformation is not effective against adversarial attacks [10]. This is mainly because the data of speech possess temporal dependency that does not have hierarchical object associ- ations. ICCEIC2022@3rd International Conference on Computer Engineering and Intelligent Control EMAIL: *15361930042@163.com (Yunchen Li) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 155 To identify audio adversarial examples, Zeng et al. [11] compared audio transcripts by different ASR systems. Yang et al. [10] exploited the inherent temporal dependency in audio data. Jayashankar et al. [12] introduced random dropout to an ASR network at test time to reduce the attack success of adversarial perturbations but at a cost of significant system degradation on clean examples. In this paper, we propose to study artefact characteristics of audio perturbations in the Mel-scale Frequency Cepstral Coefficients (MFCC) domain and we make the following contributions: – We found artefact characteristics of adversarial perturbations is hiding in all the frames of audio. So we propose multi-spectrum framework for adversarial detection. – The proposed approach can significantly improve the accuracy of audio adversarial detection rate especially in the case of black-box attack. 2. Audio Adversarial Attacks In this paper, we consider two representative attack methods as well as newly proposed attack methods for creating audio adversarial examples: 1) C&W attack, 2) Taori attack and 3) Time limit attack. C&W attack [6] is a white-box attack proposed by Carlini and Wange to generate adversarial perturbation. This method access the parameters of the target ASR and minimize its loss function to tamper the result of ASR. Taori attack [7] is a black-box attack against ASR systems without access to the internal information of the victim ASR. It employs genetic algorithm and gradient estimation method to tamper the result of ASR. Time limit attack is an audio adversarial sample white-box attack method based on time domain limitation, which hides the noise of the audio adversarial sample in the speech information part of the audio, so as to make the speech information part of the audio, so as to make the disturbance noise imperceptible. Figure 1 The proposed MS-AlexNet framework. One of the interesting question is that where the artefact characteristics of audio adversarial pertur- bation will exist. In modern ASR such as DeepSpeech, bi-directional RNN (B-RNN) [13] is often used to obtain the semantic information of audio data by aligning speech to text information. The in- put sample x = {xt} is a sequence of utterance and the output y = {yt} is the corresponding transcript at time step t for t = 0, 1, 2, ..., T. In the B-RNN, let the hidden layer activation function be h(·) and the output layer activation function be g(·). The forward and backward hidden states at time t are denoted by At and A′t , respectively. They are updated by (1) (2) while the model output is computed by (3) where B-RNN parameters U and U′ , W and W′ , V and V′ are weights between different layers. It can see that the forward hidden state At is determined by the utterance sequence of x0, x1, ...xt, while 156 the backward hidden state A′t is determined by xt+1, xt+2, ...xT . Accordingly, the output yt depends on all utterance x0, x1, ..., xT in the time series. In the attack process, define the adversarial perturbation signal as the difference between audio samples before and after speech modification, i.e., δ = {δt} for δt = ˜xt-xt. Proposition 1. To temper yt, the entire sequence of utterance x0, x1, ..., xt, ..., xT in the time series must be altered, i.e., δt ≠ 0 for t = 0, 1, ..., T. Proof. Without loss of generality, denote the tempered output by ˜yt. In lp-normed attacks, the ob- jective function with respect to yt is therefore (4) for t = 0, 1, ..., T. It is common to optimize (4) by gradient desent using equation (3). The updating rule for the gradient-based attack procedures can be generally expressed as ,since the gradient so we have That is, the adversarial perturbation exists across the entire time series of the original audio sequence. Proposition 1 indicates that the adversarial perturbations generated on BRNN are distributed through the audio time series data. Therefore, we make all ST frames {xt} contain adversarial signals given x is an adversarial audio sample. 3. Multi-Spectrum Based Detection The adversarial perturbation analysis above suggests that we could discriminate adversarial charac- teristics based on the short time (ST) frames instead of the entire audio sequence, and for the ST frame we further utilize MFCC features. There are at least three benefits on designing adversarial detectors in the MFCC domain: 1) The ar- tefacts of adversarial perturbation are more significant in the power spectra of ST frames; 2) Spatio- temporal information is better exploited on the short-time basis to deal with the highly non-stationary perturbation signals, especially when the locations of such noise sources are varying across time as demonstrated in our case; 3) By slicing the speech segment into multiple ST frames there are more adversarial samples for training an effective detector thus relieving the few-shot problem in audio ad- versarial detection. In this paper, we proposed MS-AlexNet, which is a detection method combined of the idea of mul- ti-spectrum and AlexNet [14]. The framework is shown in Figure 1. The clean audio and the adversar- ial audio examples are two categories that we are going to classify. The audios are divided into ST frames and then apply MFCC to obtain multi-spectrum features, which are fed into AlexNet. Specifi- cally, the AlexNet architecture consists of 8 layers in turn (5 convolutional layers and 3 max pooling layers alternate). In the end two fully connected layers are applied and softmax binary classifier will determine the classification result. 157 Figure 2 Detector ROC under (a-b) white-box(C&W), (c-d) white-box(Time limit) and (e-f) black-box attacks by keyword modification and sentence modification, respectively. Table 1. Adversarial detection accuracy (TPR @5% FPR) under white-box and black-box attacks for keyword and sentence modification. The best result for each row is marked in bold. Common Voice Attack Types AlexNet VGG CNN MEH-FEST MS-AlexNet White Keyword 0.89 0.28 0.84 0.22 0.93 (C&W) Sentence 0.92 0.25 0.85 0.33 0.90 White Keyword 0.76 0.17 0.19 0.11 0.98 (Time) Sentence 0.79 0.35 0.41 0.15 0.56 Keyword 0.70 0.65 0.54 0.15 1.0 Black Sentence 0.74 0.66 0.85 0.16 0.92 4. Experiments We evaluate the adversarial detection accuracy of the proposed method on the open-source Chi- nese Mandarin speech corpus AISHELL-1 [15], which contains a 150-hour training set, a 10-hour de- velopment set and a 5-hour testset. The test set contains 7,176 utterances in total. We generate adversarial samples using the three attack methods introduced in Section 2. Note that the C&W samples and Time limit samples are white-box attacks and the Taori samples are black-box attacks. To accomplish the black-box attack in limited time, we break long audio sequence into a 10- second series and a 5-second series to generate the Taori samples for Aishell-1. We generate 630 C&W samples, 400 Taori samples and 440 Time limit samples from Aishell-1 datasets for keyword and sentence modification, respectively. DeepSpeech [2] is used as victim ASR where white attacks are deployed on v0.4.1 and black attacks are deployed on v0.1.1. All experiments are performed on a single NVIDIA v100 machine. We compare the proposed MS-AlexNet method with four different methods. The first method is AlexNet [14] that uses MFCC features directly from the entire audio as input, and AlexNet acts as a binary classifier to detect adversarial examples. The second method is CNN [16], which also uses MFCC to extract features as input. The CNN architecture consists of 5 layers in turn (alternating 3 convolutional layers and 2 max pooling layers). The third approach is to employ a pooling layer in the VGG [17] architecture to aggregate utterance statistics for decision making. The statistical pooling 158 layer aggregates the mean and bias of the output of the last convolutional layer and forwards it to the dense layer. Finally, two dense layers project the statistics into a two-dimensional output space for de- cision making. The fourth MEH-FEST [18] calculates the minimum energy in high frequencies through the short-time Fourier transform of the audio and uses it as a detection metric. Figure 2 plots the ROC curves for keyword and sentence modification of the five compared meth- ods under two white-box and one black-box attacks, respectively. In general, black-box attacks are easier to detect than white-box attacks due to the addition of larger perturbation strengths to the sig- nals, especially those used for sentence modification. The proposed MS-AlexNet, significantly out- performs adversarial detection methods of CNN, VGG, and MEH-FEST in most of the case. This demonstrates the effectiveness of audio adversarial detection in the multi-spectrum domain. In addition, we also conduct ablation experiments, using the AlexNet model for adversarial detec- tion without multi-spectral method, and the results show that our proposed multi-spectral has a good effect on audio adversarial detection. Table 1 displays the detection accuracy of TPR at 5% FPR for keyword and sentence modification. It can be seen that MS-AlexNet performs best under the other at- tacks, except for the white-box for sentence modification (C&W and Time limit), in fact under the white-box for sentence modification (C&W) AlexNet is comparable to MS-AlexNet. Experiments show that our proposed MS-AlexNet has good results under different attack methods. 5. Conclusions In this paper, we study the short-time spectrum of audio sequences in the MFCC domain, analyzes audio adversarial characteristics, and, accordingly, propose a multi-spectrogram based detection method for audio adversarial examples against ASR. The proposed method is able to achieve signifi- cant improvement of detection accuracy, especially for keyword modification attacks under both white-box and black-box scenarios. Under different attack methods, the multi-spectrum detection method also has a good detection effect. 6. Acknowledgements This work was financially supported by Guangdong Natural Science Key Field Project (2019KZDZX1008). 7. References [1] Abramowitz M.: Handbook of mathematical functions, 3rd ed., New York: Dover, 1980 [2] Julkapli N. M.: A brief analysis on the application of wind-proof and dust suppression wall in the open coal storage yard, Electronic production. Vol. 14, 2013, pp. 229 [1] Yu, D., Deng, L.: Automatic Speech Recognition. Springer (2016) [2] Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., et al.: Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014) [3] Shen, J., Nguyen, P., Wu, Y., Chen, Z., Chen, M.X., Jia, Y., Kannan, A., Sainath, T., Cao, Y., Chiu, C.C., et al.: Lingvo: a modular and scalable framework for sequence-to-sequence modeling. arXiv preprintarXiv:1902.08295 (2019) [4] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intri- guing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013) [5] Cisse, M.M., Adi, Y., Neverova, N., Keshet, J.: Houdini: Fooling deep structured visual and speech recognition models with adversarial examples. Advances in neural information processing systems 30 (2017) [6] Carlini, N., Wagner, D.: Audio adversarial examples: Targeted attacks on speech-to-text. In: 2018 IEEE Security and Privacy Workshops (SPW), IEEE (2018) 1–7 [7] Taori, R., Kamsetty, A., Chu, B., Vemuri, N.: Targeted adversarial examples for black box audio systems. In: 2019 IEEE Security and Privacy Workshops (SPW), IEEE (2019) 15–20 159 [8] Yakura, H., Sakuma, J.: Robust audio adversarial example for a physical attack. arXiv preprint arXiv:1810.11793 (2018) [9] Mikolov, T., Karafi´at, M., Burget, L., Cernock`y, J., Khudanpur, S.: Recurrent neural network based language model. In: Interspeech. Volume 2., Makuhari (2010) 1045–1048 [10] Yang, Z., Li, B., Chen, P.Y., Song, D.: Characterizing audio adversarial examples using temporal dependency. arXiv preprint arXiv:1809.10875 (2018) [11] Zeng, Q., Su, J., Fu, C., Kayas, G., Luo, L., Du, X., Tan, C.C., Wu, J.: A multiversion program- ming inspired approach to detecting audio adversarial examples. In: 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), IEEE (2019) 39–51 [12] Jayashankar, T., Roux, J.L., Moulin, P.: Detecting audio attacks on asr systems with dropout un- certainty. arXiv preprint arXiv:2006.01906 (2020) [13] Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE transactions on Signal Processing 45(11) (1997) 2673–2681 [14]Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neu- ral networks. Advances in neural information processing systems 25 (2012) [15] Bu, H., Du, J., Na, X., Wu, B., Zheng, H.: Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In: 2017 20th conference of the oriental chapter of the interna- tional coordinating committee on speech databases and speech I/O systems and assessment (O- COCOSDA), IEEE (2017) 1–5 [16] Samizade, S., Tan, Z.H., Shen, C., Guan, X.: Adversarial example detection by classification for deep speech recognition. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2020) 3102–3106 [17] Li, X., Li, N., Zhong, J., Wu, X., Liu, X., Su, D., Yu, D., Meng, H.: Investigating Robustness of Adversarial Samples Detection for Automatic Speaker Verification. In: Proc. Interspeech 2020. (2020) 1540–1544 [18] Chen, Z.: On the detection of adaptive adversarial attacks in speaker verification systems. arXiv preprint arXiv:2202.05725 (2022) 160