Mask - Based Minimum Variance Distortionless Response Beamforming Using the Power Level Difference Quan Trong The Digital Agriculture Cooperative, Cau Giay, Ha Noi, Viet Nam. Abstract Speech enhancement is one of the most vulnerable problems, which exists as a complex challenge task for scholars. Single channel - approach has the low computation and easy implementation, which almost use the spectral subtraction operation. However, this research direction leads to speech distortion in the scenarios with non - stationary environment. Consequently, microphone array technology is used for reducing speech distortion by using the priori information about spatial beampattern. Minimum Variance Distortionless Response owns high directional beampattern while suppressing all background noise, interference while preserving the certain direction target speaker. In realistic situations, the performance of MVDR beamformer is often corrupted due to many reasons, the different microphone array sensitivities, the error of the direction of arrival of interest signal or the imprecise array distribution. In this article, the author suggested using spectral mask, which uses the information of power level difference to enhance MVDR beamformer’s evaluation. The demonstrated experiment shows the improvement of speech enhancement with the signal-to- noise ratio (SNR) from 2.0 (dB) to 3.1 (dB). Keywords 1 Microphone array, beamforming, minimum variance distortionless response, speech enhancement, noise reduction, steering vector. 1. Introduction Target speech extracting digital signal processing algorithm separates the desired talker in an annoying complex environment when third - party speaker, interference, living equipment or background noise exit. These methods serve as an essential preprocessing front - end for several speech communication systems, such as speech acquisition, speech enhancement, surveillance devices, smart home, automatic speech recognition, human verification, and digital hearing - aid devices. With recent development of microphone array (MA) technique, several research about MA beamforming, which use the prior spatial information about the direction of arrival (DOA) of useful signal, the properties of surrounding recording environment. Minimum Variance Distortionless Response (MVDR) beamformer is one of the most suitable techniques, which installed in almost speech applications for suppressing background noise with speech distortion. However, the real - life performance always degraded due to the imprecise necessary parameters, undetermined estimation of DOA, that corrupt the MVDR beamformer’s evaluation. Therefore, an important problem is increasing the outperformed MVDR, which requires preserving the original speech component while alleviating the total output noise power. Much early direction research synthesizes the beamformer’s output after cooperating with the time - frequency (TF) spectral mask with the obtained microphone array signals [1 - 3]. Exploiting the noise phase plays a major role in improvement of speech quality and speech intelligibility [4 - 7]. Phase - aware T - F masks have been COLINS-2023: 7th International Conference on Computational Linguistics and Intelligent Systems, April 20–21, 2023, Kharkiv, Ukraine EMAIL: quantrongthe1984@gmail.com ORCID: 0000 - 0002 - 2456 - 9598 (Quan Trong The) ©️ 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) categorized in two branches: complex ration mask [8] and phase - sensitive mask [3], [8], which have been proposed, evaluated, and enhanced the overall MVDR beamformer’s speech separation. Figure 1: The complex surrounding environment around the target speaker Besides T-F mask, several research, which avoids estimating the magnitude and phase parts [9], [10], which only operates directly on the time - domain of noisy mixture of microphone array signals. The approach, which uses the neural network (NN) - based speech separation systems that have been demonstrated. Many approaches [11], [12] replace the conventional STFT and inverse STFT signal processing by a learnable NN based encoder and decoder configuration for enhancing performance according to many objective measurements to extract the target speech. Purely NN - based speech separation system has obtained promising resulting numerical experiments since they greatly suppress the amount of the remaining noisy components or interfering third - party speech. Figure 2: The using of microphone array for extracting the desired talker Recently, the combination with the multi-channel Wiener filter [13 - 14], the linearly constrained minimum variance (LCMV) filter [15] has been proposed. Other beamformer, such as Generalized Eigenvalue (GEV) beamformer [16 - 17] aim to improve the signal - to - noise ratio (SNR) without decreasing the speech component has achieved many successful. Additionally, multi-frame MVDR (MF - MVDR) [18 - 20] have been adopted in single - channel speech separation systems to block the noise and ensure the purpose of distortionless of the obtained result. These studies prove that when oracle information is available, the MF - MVDR filter can diminish the interference and background noise while introducing very little distortion. The combination between T-F mask and NN has been studied in [21 - 24], that leads to more precise estimation of the speech and noise components, and better speech enhancement or speech recognition caused of fewer distortion and increasing the speech quality. Figure 3: Beampattern, the essential component of microphone array beamforming Many online MA beamforming techniques for purpose of signal processing real - time or time – varying. In [25], a recursive algorithm with heuristic updating factors to calculate the time - varying covariance matrix of speech and noise components. The authors [26 - 28] also use the smoothing factors to estimate the time - varying covariance matrices and allow better outperformed evaluation in noisy conditions. [29] presented a frame - level beamforming method and obtained achieved more robustness of MA beamforming. However, in many complex undetermined acoustical environments, these above literatures own speech distortion, which cause the degradation of the speech quality or speech intelligibility. The purpose of the presented work is to resolve this problem by using an additive spectral mask, which reduces the above unresolved lack. In this contribution, the authors proposed a new enhanced MVDR beamformer, which exploits an effective spectral mask. The resulting results prove that the suggested method allows increasing the speech quality in terms of the signal-to-noise ratio (SNR) from 2.0 to 3.1 (dB), reducing the speech distortion to 8.0 (dB). The remaining section of this paper is organized as following way: Section II describes the brief of principal working of MVDR beamformer, Section III presents the proposed method. Section IV illustrates the evaluated experiments and Section V concludes. 2. The signal model MVDR beamformer is based on the constrained mathematical problem of preserving the target signal at a certain direction while removing the background noise with minimum total output noise power. The criterion of saving desired signal is the beampattern at the direction of useful signal is equal 1. The signal processing of MVDR beamformer can be expressed through the above formulation. In general speaking, we will consider the model signal with dual - microphone system (DMA2). The two captured microphone array signals can be derived as: 𝑋1 (𝑓, 𝑘) = 𝑆(𝑓, 𝑘)𝑒 𝑗𝛷𝑠 + 𝑉1 (𝑓, 𝑘) (1) 𝑋2 (𝑓, 𝑘) = 𝑆(𝑓, 𝑘)𝑒 −𝑗𝛷𝑠 + 𝑉2 (𝑓, 𝑘) (2) With the current frame 𝑘, current frequency 𝑓, the desired speech component 𝑆(𝑓, 𝑘), the additive noise 𝑉1 (𝑓, 𝑘), 𝑉2 (𝑓, 𝑘), 𝜃𝑠 direction of arrival of interest talker, the distance between two microphones 𝑑, speed propagation of sound in the fresh air is 𝑐 (343 m/s), 𝜏0 = 𝑑/𝑐 is the sound delay and 𝛷𝑠 = 𝜋𝑓𝜏0 𝑐𝑜𝑠(𝜃𝑠 ). . Figure 4: The principal working of MVDR beamformer With the predefined formulation: 𝑫(𝑓, 𝜃𝑠 ) = [𝑒 𝑗𝛷𝑠 𝑒 −𝑗𝛷𝑠 ]𝑇 is the steering vector, 𝑿(𝑓, 𝑘) = [𝑋1 (𝑓, 𝑘) 𝑋2 (𝑓, 𝑘)]𝑇 , 𝑽(𝑓, 𝑘) = [𝑉1 (𝑓, 𝑘) 𝑉2 (𝑓, 𝑘)]𝑇 with 𝑇 indicates transpose operator. The system (1-2) can be rewritten as: 𝑿(𝑓, 𝑘) = 𝑆(𝑓, 𝑘)𝑫(𝑓, 𝜃𝑠 ) + 𝑽(𝑓, 𝑘) (3) In most of digital signal processing problem, the scholars need find an appropriate coefficient 𝑾(𝑓, 𝑘), which can allow obtaining the final output signal 𝑆̂(𝑓, 𝑘) ≈ 𝑆(𝑓, 𝑘): 𝑆̂(𝒇, 𝒌) = 𝑾𝐻 (𝑓, 𝑘)𝑿(𝑓, 𝑘) (4) The constrained criteria are illustrated as the following way: 𝑚𝑖𝑛 (5) 𝑾𝐻 (𝑓, 𝑘)𝑷𝑉𝑉 (𝑓, 𝑘)𝑾(𝑓, 𝑘) 𝑠. 𝑡. 𝑾𝐻 (𝑓, 𝑘)𝑫(𝑓, 𝜃𝑠 ) = 1 𝑾(𝑓, 𝑘) where 𝑷𝑉𝑉 (𝑓, 𝑘) = 𝐸{𝑽(𝑓, 𝑘)𝑽∗ (𝑓, 𝑘)} is a covariance matrix of noise signals. The optimum coefficients of MVDR beamformer, which is derived from (5) can be expressed as: 𝑷−1𝑉𝑉 𝑫(𝑓, 𝜃𝑠 ) (6) 𝑾(𝑓, 𝑘) = 𝑫𝐻 (𝑓, 𝜃𝑠 )𝑷−1 𝑉𝑉 𝑫(𝑓, 𝜃𝑠 ) However, in realistic situations, the information about covariance matrix of noise is not easy calculated, the covariance matrix of captured microphone array signals is used instead of. The final optimum solution for MVDR beamformer is achieved that: −1 𝑷𝑋𝑋 𝑫(𝑓, 𝜃𝑠 ) (7) 𝑾(𝑓, 𝑘) = 𝐻 −1 𝑫 (𝑓, 𝜃𝑠 )𝑷𝑋𝑋 𝑫(𝑓, 𝜃𝑠 ) 𝑷𝑋𝑋 (𝑓, 𝑘) = 𝐸{𝑿(𝑓, 𝑘)𝑿∗ (𝑓, 𝑘)} of observed microphone signals are computed as: 𝑃𝑋 𝑋 (𝑓, 𝑘) ∗ 1.001 𝑃𝑋1 𝑋2 (𝑓, 𝑘) (8) 𝑷𝑋𝑋 (𝑓, 𝑘) = { 1 1 } 𝑃𝑋2 𝑋1 (𝑓, 𝑘) 𝑃𝑋2 𝑋2 (𝑓, 𝑘) ∗ 1.001 where 𝑃𝑋𝑖𝑋𝑗 (𝑓, 𝑘), 𝑃𝑋𝑖𝑋𝑖 (𝑓, 𝑘), 𝑖, 𝑗 ∈ {1,2} are determined as: 𝑃𝑋𝑖𝑋𝑗 (𝑓, 𝑘) = (1 − 𝛼)𝑃𝑋𝑖𝑋𝑗 (𝑓, 𝑘 − 1) + 𝛼𝑋𝑖∗ (𝑓, 𝑘)𝑋𝑗 (𝑓, 𝑘) (9) In almost acoustic environments, the unwanted and imprecise factors also degrade the evaluation of MVDR beamformer. Speech distortion, corrupted speech quality is the lack in microphone array beamforming. In the next section, the author suggested a spectral mask for dealing this problem. 3. The proposed spectral mask MA signal processing uses the spatio - temporal priori information, which is obtained from the configuration of MA with sound source, the coherence of background noise, or the MA signals. Spectral masks have been an attractive research direction for decades and play an essential role in the development of almost speech applications. And recently, the mask - based MA beamforming has attracted increased research due to their effectiveness of pre-processing signal, reducing the speech distortion, and improving total speech enhancement of system. In this section, the author proposed a spectral mask, 𝑚𝑠𝑝(𝑓, 𝑘), which suppresses the speech component at the recorded microphone array signals as the above approach: 𝑋̂1 (𝑓, 𝑘) = msp(𝑓, 𝑘) × 𝑋1 (𝑓, 𝑘) (10) 𝑋2 (𝑓, 𝑘) = msp(𝑓, 𝑘) × 𝑋2 (𝑓, 𝑘) (11) The ideal of suggested 𝑚𝑠𝑝(𝑓, 𝑘) based on the exponent function of the power level difference (PLD) as the following way: 𝑃𝑋1 𝑋1 (𝑓, 𝑘) − 𝑃𝑋2 𝑋2 (𝑓, 𝑘) (12) 𝑃𝐿𝐷(𝑓, 𝑘) = 𝑃𝑋1 𝑋1 (𝑓, 𝑘) + 𝑃𝑋2 𝑋2 (𝑓, 𝑘) msp(𝑓, 𝑘) = 𝑒 −𝑃𝐿𝐷(𝑓,𝑘) (13) In the frame, in which only exits noisy component, 𝑃𝐿𝐷(𝑓, 𝑘) towards “0’, consequently the 𝑚𝑠𝑝(𝑓, 𝑘) towards 1. In the presence of speech component, 𝑃𝐿𝐷(𝑓, 𝑘) often obtained value from 0.2 to 0.5; therefore, 𝑚𝑠𝑝(𝑓, 𝑘) less than 1, consequently, the operator (10 -11) ensures block the speech component at microphone array signal 𝑋1 (𝑓, 𝑘), 𝑋2 (𝑓, 𝑘). In the next section, an experiment is performed to confirm the promising advantage of 𝑚𝑠𝑝(𝑓, 𝑘). 4. Experiments In this section, the author demonstrated a promising experiment to rate the effectiveness of suggested method. DMA2 with two mounted microphones is one of the most suitable MA’s configurations in several acoustic equipment, which is common used for extracting the desired target speaker while eliminating the background noise, interferences or annoying noise. The author illustrated a talker, which stands at 2 (m) relative to the DMA2’s axis. The scheme is shown in Figure 5. The purpose of the experiment is comparison the promising performance of the proposed method (pro-sm-me) with the conventional MVDR beamformer (ctl-MVDR-beam). All microphone array signal are sampled at Fs = 16kHz. For further signal processing, the author used these necessary parameters: NFFT = 512, overlap 50%, the smoothing parameter 𝛼 = 0.5. An objective measurement [30] is used for calculating the speech quality in terms of the signal-to-noise ratio (SNR). The experiment is conducted in living room, where exits the other sound source or noise. Figure 5: The illustrated experiment with DMA2 The waveform of the original microphone array signal is presented in Figure 6. Figure 6: The waveform of the original microphone array signal The waveform of processed signal by ctl-MVDR-beam is show in Figure 7. Figure 7: The waveform of processed signal by ctl-MVDR-beam The promising signal, which was derived by pro-sm-me, is expressed in Figure 8. Figure 8: The waveform of processed signal by pro-sm-me And the comparison of energy between the original microphone array signal, the processed signal by ctl-MVDR-beam, pro-sm-me are illustrated in Figure 9. Figure 9: The energy of microphone array signal, the processed signals by ctl-MVDR-beam, pro- sm-me From these figures, we can see that the suggested technique allows better achieving result of speech enhancement. The SNR was increased from 2.0 (dB) to 3.1 (dB), and pro-sm-me reduces the speech distortion to 8.0 (dB). Table 1 The signal - to - noise ratio (SNR) Method Estimation Microphone array ctl-MVDR-beam pro-sm-me signal NIST STNR 3.0 20.2 22.2 WADA SNR 2.2 16.4 19.5 Through the numerical results, the proposed method not only improves the overall MVDR beamformer’s performance, but also reduces the unwanted speech distortion. The proposed technique can be integrated into multi - microphone system to enhance the evaluation in real - life recording scenarios. Degraded MVDR beamformer’s performance is unavoidable problem in almost existing speech applications due to microphone mismatches, the error of the direction of arrival of useful signal or the imprecise microphone array ‘s distribution or undetermined the properties of surrounding environment. These factors corrupt the signal processing and cause speech distortion or decreasing speech enhancement. Therefore, dealing the drawback of MVDR beamformer is essential problem, which was deal in this correspondence. 5. Conclusion Speech enhancement plays a major role in numerous speech applications, such as hand-free communication, mobile phones, audio processing, stereo-sound systems. Digital signal processing algorithms are chosen based on the type of properties of surrounding environment, the recording configuration, and different noisy cases. MA uses the spatial beampattern to retrieve the desired clean speech component from noisy situations and becomes increasingly important. In this correspondence, the author proposed a new spectral mask. The results of this study clearly show that the spectral mask is an appropriate approach for dealing with the speech distortion in MA beamforming and enhancing the speech quality. 6. Acknowledgements This research was supported by Digital Agriculture Cooperative. The author thanks our colleagues from Digital Agriculture Cooperative, who provided insight and expertise that greatly assisted the research. 7. References [1] Wang Y., Narayanan A., Wang D. On training targets for supervised speech separation. IEEE Trans. Audio, Speech, Lang. Process., vol. 22, no. 12, pp. 1849–1858, Dec. 2014. DOI: 10.1109/TASLP.2014.2352935. [2] Erdogan H., Hershey J.R., Watanabe S., Mandel M.I., Roux J.L. Improved MVDR beamforming using single-channel mask prediction networks // Proc. Annu. Conf. Int. Speech Commun. Assoc., 2016, pp. 1981–1985. DOI:10.21437/Interspeech.2016-552. [3] Zhang Z. On loss functions and recurrency training for GAN-based speech enhancement systems // Proc. Annu. Conf. Int. Speech Commun. Assoc., pp. 3266–3270, 2020. https://doi.org/10.48550/arXiv.2007.14974. [4] Paliwal K., Wójcicki K., Shannon B. The importance of phase in speech enhancement. Speech Commun., vol. 53, no. 4, pp. 465–494, 2011. DOI:10.1016/j.specom.2010.12.003. [5] Williamson D.S., Wang Y., Wang D. Complex ratio masking for joint enhancement of magnitude and phase // Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2016, pp. 5220–5224. DOI: 10.1109/ICASSP.2016.7472673. [6] Xu Y., Chen M., LaFaire P., Tan X., Richter C.P. Distorting temporal fine structure by phase shifting and its effects on speech intelligibility and neural phase locking. Sci. Rep., vol. 7, no. 1, pp. 1–9, 2017. DOI: 10.1038/s41598-017-12975-3. [7] Zhang Z., Williamson D. S., Shen Y. Investigation of phase distortion on perceived speech quality for hearing-impaired listeners // Proc. Annu. Conf. Int. Speech Commun. Assoc., pp. 2512–2516, 2020. https://doi.org/10.48550/arXiv.2007.14986. [8] Erdogan H., Hershey J. R., Watanabe S., Roux J. L. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks // Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2015, pp. 708–712. DOI: 10.1109/ICASSP.2015.7178061. [9] Pascual S., Bonafonte A., Serra J. Speech enhancement generative adversarial network // Proc. INTERSPEECH, 2017, arXiv:1703.09452. DOI:10.21437/Interspeech.2017-1428. [10] Luo Y., Mesgarani N. Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE Trans. Audio, Speech, Lang. Process., vol. 27, no. 8, pp. 1256–1266, Aug. 2019. https://doi.org/10.1109/TASLP.2019.2915167. [11] Stoller D., Ewert S., Dixon S. Wave-U-Net: A multi-scale neural network for end-to-end audio source separation // Proc. Int. Soc. Music Inf. Retrieval Conf., pp. 334–340, 2018. https://doi.org/10.48550/arXiv.1806.03185. [12] Luo Y., Mesgarani N. TasNet: Time-domain audio separation network for real-time, single- channel speech separation // Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2018, pp. 696–700. DOI: 10.1109/ICASSP.2018.8462116. [13] Huang Y., Benesty J., Chen J. Analysis and comparison of multi - channel noise reduction methods in a common framework. IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 5, pp. 957–968, Jul. 2008. DOI: 10.1109/TASL.2008.921754. [14] Souden M., Benesty J., Affes S. New insights into non-causal multichannel linear filtering for noise reduction // Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2009, pp. 141–144. DOI: 10.1109/ICASSP.2009.4959540. [15] Van V., Buckley K. M. Beamforming: A versatile approach to spatial filtering. IEEE ASSP Mag., vol. 5, no. 2, pp. 4–24, Apr. 1988. DOI: 10.1109/53.665. [16] Warsitz E., Haeb-Umbach R. Blind acoustic beamforming based on generalized eigenvalue decomposition // IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 5, pp. 1529–1539, Jul. 2007. DOI:10.1109/TASL.2007.898454. [17] Heymann J., Drude D., Chinaev A., Haeb-Umbach R. BLSTM supported GEV beamformer front- end for the 3rd CHiME challenge // Proc. IEEE Workshop Autom. Speech Recognit. Understanding, 2015, pp. 444–451. DOI: 10.1109/ASRU.2015.7404829. [18] Huang Y.A., Benesty J. A multi-frame approach to the frequency - domain single-channel noise reduction problem. IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 4, pp. 1256–1269, May 2012. DOI: 10.1109/TASL.2011.2174226. [19] Schasse A., Martin R. Estimation of subband speech correlations for noise reduction via MVDR processing. IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 9, pp. 1355–1365, Sep. 2014. DOI: 10.1109/TASLP.2014.2329633. [20] Fischer D., Doclo S. Robust Constrained MFMVDR Filtering for Single-Microphone Speech Enhancement // Proc. 16th Int. Workshop Acoust. Signal Enhancement, 2018, pp. 41–45. DOI:10.1109/TASLP.2020.3042013. [21] Xu Y. Neural spatio-temporal beamformer for target speech separation // Proc. Annu. Conf. Int. Speech Commun. Assoc., pp. 56–60, 2020. [22] Xiao X., Zhao S., Jones D. L., Li H. On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition // Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017, pp. 3246–3250. DOI: 10.1109/ICASSP.2017.7952756. [23] Xu Y. Joint training of complex ratio mask based beamformer and acoustic model for noise robust ASR // Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2019, pp. 6745–6749. DOI: 10.1109/ICASSP.2019.8682576. [24] Tammen M., Fischer D., Doclo S. DNN-based multi-frame MVDR filtering for single-microphone speech enhancement. 2019, arXiv:1905.08492. [25] Souden M., Chen J., Benesty J., Affes S. An integrated solution for online multichannel noise tracking and reduction. IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2159–2169, Sep. 2011. DOI: 10.1109/TASL.2011.2118205. [26] Taseska M., Habets E. A. Nonstationary noise PSD matrix estimation for multichannel blind speech extraction. IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 11, pp. 2223–2236, Nov. 2017. DOI: 10.1109/TASLP.2017.2750239. [27] Chakrabarty S., Habets E. A. Time-frequency masking based online multi-channel speech enhancement with convolutional recurrent neural networks. IEEE J. Sel. Top. Signal Process., vol. 13, no. 4, pp. 787–799, Aug. 2019. DOI:10.1109/JSTSP.2019.2911401. [28] Martín-Doñas J. M., Jensen J., Tan Z. H., Gomez A. M., Peinado A. M. Online multichannel speech enhancement based on recursive EM and DNN-based speech presence estimation. IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 3080–3094, Nov. 2020, doi: 10.1109/TASLP.2020.3036776. DOI: 10.1109/TASLP.2020.3036776. [29] Higuchi T., Kinoshita K., Ito N., Karita S., Nakatani T. Frame-by-frame closed-form update for mask-based adaptive MVDR beamforming // Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2018, pp. 531–535. DOI: 10.1109/ICASSP.2018.8461850. [30] https://labrosa.ee.columbia.edu/projects/snreval/