<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Journal of Information Technology and Computer Science</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1177/2331216518770964</article-id>
      <title-group>
        <article-title>Using Recurrent Neural Network to Noise Absorption from Audio Files</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nataliya Boyko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Аnastasiia Hrynyshyn</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>Profesorska Street 1, Lviv, 79013</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>7</volume>
      <fpage>8</fpage>
      <lpage>18</lpage>
      <abstract>
        <p>The study reveals the idea of noise absorption, which is reducing any noise from input signal with minimal distortion of speech. During the study and research of this topic, many articles and publications were analyzed, in which new approaches to solving the problem of noise absorption or modification of existing ones were presented. This paper considers noise absorption algorithms. Also, highperformance algorithms for noise and human speech separation in the audio stream are analyzed. The paper uses traditional algorithms for digital signal processing. The practical value of the results will help improve the quality of video and audio calls by eliminating background noise, as well as voice recognition. The paper uses classic solutions for filtering unwanted noise. Experiments were performed to compare three different methods of noise processing in audio files. Statistical methods are used to build a noise model, which is then used to recover the output sound of the input signal with noise. The study uses deep learning for comparison. STOI and PESQ scores are used to evaluate audio recordings obtained after noise removal.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Artificial Intelligence</kwd>
        <kwd>fourier transform</kwd>
        <kwd>fast fourier transform</kwd>
        <kwd>discrete fourier transform</kwd>
        <kwd>Convolutional Neural Network</kwd>
        <kwd>Recurrent Neural Network</kwd>
        <kwd>Short-time objective intelligibility</kwd>
        <kwd>mean square error</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Today there are many means of communication. The companion can be on the other side of the
world, yet talking to them is not a problem for us, is it? There are situations when communication
is impaired due to ambient background noise, as it is impossible to find a quiet place to talk. In
this case, noise absorption algorithms are used.</p>
      <p>There is traditional noise suppression – the introduction of two or more microphones [16]. The
first microphone is located in the lower front of the phone, closest to the user's mouth, to directly
capture their voice during a conversation. The second microphone is as far away from the first one
as possible, usually on the top back of the phone.</p>
      <p>Both microphones pick up ambient sounds. The microphone closer to the mouth captures more
of the speaker's voices,while another – less. The software effectively separates them from each
other, giving an almost clear "voice".</p>
      <p>This may sound easy, but there are many situations where such technology does not work. For
example, when a person does not speak, so the microphones receive only noise, or actively
shakes/turns the phone during a conversation, as during a run. Solving these problems is a complex
process.</p>
      <p>Traditional digital signal processing (DSP) algorithms [17] constantly try to find a noise pattern
and adapt it. These algorithms work well in some cases; however, they don't scale to the variety
and variability of noise in our everyday environment. That is why deep learning is used to solve
this problem.</p>
      <p>The relevance of the topic: There are many definitions of noise, but in general, it is background
sounds caused by people, music, car buzzing, and more. These are primarily the sounds that should
not be present in a conversation, video, or audio file. Noise distracts the audience's attention from
the core material and therefore deteriorates the perception of information. But the main risk of
noise for audio files is poor speech recognition. Many technologies work with voice commands,
but due to excessive noise, the voice may be poorly recognized, due to which the program will not
perform the correct task or could not receive the signal at all. Noise suppression is used to eliminate
this risk.</p>
      <p>The main idea of noise absorption is that the input was a signal with noise and the output
without minimal speech distortion. This topic has been considered since the 70s. One example was
the absorption of acoustic noise in a speech by spectral subtraction [18]. Although the research of
this problem began a long time ago, the topic's relevance remains to this day since there is no
perfect solution.</p>
      <p>Having received a signal with noise at the input, we strive to filter out unwanted noise without
degrading the input signal. There are classic solutions to this problem. First, they use generative
modeling, which uses statistical methods like Gaussian filters to build a noise model. Next, we can
use it to recover the output sound of the input signal with noise. But recently, developments have
shown that deep learning is superior to that decision and provided enough data.</p>
      <p>The work's goal is to increase noise absorption efficiency to reduce the risk of incorrect speech
recognition and train the recurrent network on different types of noise.</p>
      <p>The practical value of the results obtained in this work will help improve the quality of video
and audio calls, eliminating background noise. This model will also reduce the risks of incorrect
voice recognition caused by background noise.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Review of Literature Sources</title>
      <p>During the study and research of this topic, I found many different articles and publications. Each
of them represents a new approach to solving noise absorption or modification of existing ones.
These materials are presented below with a brief analysis of the use of specific techniques.</p>
      <p>First covered the idea of using deep neural networks in the article «A regression approach to
speech enhancement based on deep neural network»[19], authored by Yong Xu, Jun Doo,
LeeRong Dai, and Chin-Huel Lee. The basic idea is to use a regression method, which produces a
mask of relations for each sound frequency. The purpose of this mask is to remove extraneous
noise, leaving the human voice intact. This method was far from perfect but an excellent early
solution.</p>
      <p>After the publication of the idea using deep neural networks, various theories were proposed,
one of which is using a recurrent neural network. This method was demonstrated in the RNNoise
project. Combining classical signal processing with deep learning to create a real-time noise
absorption algorithm is the main idea. A more detailed description is given in the article «A Hybrid
DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement»[20], authored by
Jean-Marc Valin.</p>
      <p>Another exciting example of the use of neural networks for noise absorption was proposed in
«Practical Deep Learning Audio Denoising»[21], authored by Thalles Santos Silva. This article
used a convolutional neural network (CNN) to create a statistical model that can extract a pure
signal and return it to the user. But in most results, the model manages to smooth out the noise,
not get rid of it. Therefore, my choice was for recurrent neural networks.</p>
      <p>The following article, authored by Michael Michelashvili, Lior Wolf, proposed a sound
absorption method that trains on a noisy sound signal and provides a pure baseline signal [22].
However, the technique is not entirely controlled and is taught only on a specific audio file that is
denominated. Disadvantages of this implementation: if the type of noise changes, the neural
network, which was trained on other data, will not provide sound absorption.</p>
      <p>Another method of teaching recurrent neural networks was proposed in the article: «Listening
to Sounds of Silence for Speech Denoising»[23] by Ruilin Xu, Rundi Wu, Yuko Ishiwaka, Carl
Vondrick, and Changxi Zheng. The proposed approach is based on the observation of human
language, namely the pauses in speech between words and sentences. They are using these
intervals to study the model. Since this algorithm studies noise in real-time, it is possible to learn
the models of noise dynamics and absorb them. This method, in my opinion, is one of the best
because it can adapt to noise changes in contrast to the previous one.</p>
      <p>Noise absorption is not only used for audio and video calls. It is also used for hearing aids. An
article entitled "Use of a Deep Recurrent Neural Network to Reduce Wind Noise: Effects on
Judged Speech Intelligibility and Sound Quality" [24], written by Mahmoud Keshavarzi, Tobias
Goehring, Justin Zakis Richard E. Turner та Brian C. J. Moore. It demonstrated the use of RNN
to reduce wind noise, which added sound quality. Recurrent neural networks were significantly
better than high-frequency filtering. Tested these results were with the help of eighteen
participants, nine of whom had mild or moderate hearing impairments. According to them, the
sound quality and intelligibility were much better when using RNN.</p>
      <p>Analysis of the sources described above gave more information about using deep neural
networks and various practical applications. In addition, multiple methods and modifications have
also been developed, with the help of which noise absorption had a much better result.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Materials and Methods</title>
      <p>The separation of noise and human speech in the audio stream is a complex problem for which
there are no high-performance algorithms.</p>
      <p>Traditional digital signal processing (DSP) algorithms [17] try to constantly find the noise
pattern and adapt it by processing the sound frame by frame.</p>
      <p>There are two types of basic types of noise: stationary and nonstationary. An example is shown
below in Fig. 1.
Stationary means that noise statistics regarding intensity, spectrum shape, or other factors are
unchanged over time. Metaphorically speaking, stationary means that none of the statistical
parameters of the process changes its position in the parameter space. Traditional DSP algorithms
(adaptive filters) can be quite effective in filtering such noise. Let's take a closer look.</p>
      <p>Digital Signal Processing (DSP) is a field of computer technology that is dynamically evolving
and covers both hardware and software [25]. In particular, related areas for digital signal
processing are information theory, optimal signal reception theory, and pattern recognition theory.
In the first case, the main task is to select the signal against the background noise and interference
of different physical nature; in the second - automatic recognition, i.e., classification and
identification of the signal.</p>
      <p>Digital processing uses the representation of signals in the form of sequences of numbers or
symbols. The purpose of such processing may be to evaluate the signal's characteristic parameters
or convert the signal into a format that is in some sense more convenient. For classical numerical
analysis, formulas such as interpolation, integration, and differentiation are digital processing
algorithms. High-speed digital computers contribute to increasingly complex and efficient signal
processing algorithms; recent advances in integrated circuit technology promise high
costeffectiveness in building very complex digital signal processing systems.</p>
      <p>Digital signal processing is an alternative to traditional analog. Its most critical qualitative
advantages include implementing any arbitrarily complex (optimal) processing algorithms
guaranteed and independent of destabilizing factors accuracy; programmability and functional
flexibility; the possibility of adaptation to the processed signals; manufacturability.</p>
      <p>The development of a new perspective on digital signal processing was accelerated by the
discovery in 1965 of efficient algorithms for calculating Fourier transforms. This class of
algorithms became known as fast Fourier transform (FFT).</p>
      <p>3.1.</p>
    </sec>
    <sec id="sec-4">
      <title>Fast Fourier Transform</title>
      <p>Fast Fourier transform (FFT) is a mathematical algorithm that calculates the discrete Fourier
transform (DFT) of a given sequence [20]. The only difference between FT (Fourier transform)
and FFT is that FT considers a continuous signal, while FFT receives a discrete signal at the input.
DFT converts a sequence into its frequency components in the same way that FT does for a
continuous signal. FFT converts the time domain to a frequency domain.</p>
      <p>The visualization of the process is demonstrated below (Fig. 2).
FFT works as follows. In the first step, the signal portion is scanned and stored in memory for
further processing. Two parameters are appropriate:
1. Sampling frequency (fs) of the measuring system (for example, 48 kHz). This is the average
number of samples obtained per second.
2. Selected number of samples; block length (BL).</p>
      <p>From the two main parameters fs and BL you can determine further measurement parameters.
For example, bandwidth (fn) indicates the theoretical maximum frequency determined using FFT
(Formula 1).</p>
      <p>fn = fs / 2
For example, at a sampling frequency of 48 kHz, it is theoretically possible to determine frequency
components up to 24 kHz. However, in an analog system, the practically realizable value is usually
slightly lower than this, thanks to analog filters - for example, at a frequency of 20 kHz.</p>
      <p>Measurement of duration (D). The measurement duration is determined by the sampling
frequency fs and the length of the block BL (Formula 2).</p>
      <p>D = BL / fs ,
where fs = 48 kHz and BL = 1024 it gives 1024/48000 Hz = 21.33 ms.</p>
      <p>Frequency resolution (df) indicates the frequency interval between two measurement results
(Formula 3).</p>
      <p>
        df = fs / BL
In practice, the sampling rate fs is usually a variable given by the system. However, by selecting
the length of the BL block, you can determine the measurement duration and frequency resolution.
The following applies:
• The short block length results in rapid repeats of measurements with coarse frequency
resolution.
• Long block length results in slower repetitions of measurements with accurate frequency
resolution.
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
3.2.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Spectral subtraction</title>
      <sec id="sec-5-1">
        <title>The method of spectral subtraction is widespread.</title>
        <p>Additive stationary noise - generated by the environment, sound recording equipment, etc.
Stationarity means that the properties of noise (power, spectral composition) do not change over
time. Additivity implies that the noise is summed with the "pure" signal y [t] and does not depend
on it (Formula 4):</p>
        <p>x[t] = y[t] + noice[t],
where t is the time.</p>
        <p>A spectral subtraction algorithm is used to suppress additive stationary noise. It consists of the
following stages:
1. Signal decay by short-term (window) Fourier transform (STFT) compactly localizes the
signal energy.
2. Assembling the noise footprint subtractor. The noise model is obtained by averaging the
amplitudes of the spectrum taken from a pre-prepared area of noise that does not contain a
proper signal (Formula 5).</p>
        <p>footpr int[ f ] = ∑k
t = 1noice[ f , t],
where noice[ f , t] is the noise spectrum; f is the Fourier transform index corresponding to the
frequency, t is the number of the current STFT window, k is the number of windows in the area
with noise.</p>
        <p>3. "Subtraction" (in the generalized sense) of the amplitude spectrum of noise from the
amplitude spectrum of the signal.
4. Inverse conversion of STFT - synthesis of the resulting signal.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Subtraction of amplitude spectra is carried out by formula 6:</title>
        <p>Y[ f , t] = max{X [ f , t] − k * W [ f , t],0},
where X [ f , t] and W [ f , t] - amplitude spectra of signal and noise, respectively;
Y[ f , t] - the amplitude spectrum of the resulting purified signal;
k is the suppression factor.</p>
        <p>
          The phase spectrum of the cleared signal is equal to the phase spectrum of the signal
interference. The result of this method is shown in Fig. 3.
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
The problem with these methods is that FFT and spectral subtraction are not suitable for
nonstationary signal analysis because nonstationary signals consist of frequency components that
change over time. As is known, the Fourier transform is suitable for those signals that have
frequencies fixed at a specific time (e.g., sine waves, voice signals). Therefore, the Fourier
transform cannot give the proper spectrum, and we will not know which frequencies are present
at what time. In spectral subtraction, the STFT coefficients of noise signals are statistically
random, which leads to uneven noise elimination.
        </p>
        <p>Nonstationary noises have complex patterns that are difficult to distinguish from the human
voice. However, the signal can be concise and come and go very quickly (for example, keyboard
input or siren). To handle both stationary and nonstationary noise, you need to go beyond
traditional DSP.</p>
        <p>To better eliminate noise, various methods of neural networks, some of which have been
superficially discussed in the analysis of literature sources. Consider some of the methods in more
detail.</p>
        <p>3.3.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Method using convolutional neural networks</title>
      <p>This method is based on "A Fully Convolutional Neural Network for Speech Enhancement" [21].
In it, the author offers a cascade backup convolutional network encoder-decoder (CR-CED).</p>
      <p>The model is based on symmetric encoder-decoder architectures. Both components contain
repetitive convolution blocks, ReLU, and batch normalization. In total, the network includes 16
such blocks - this adds up to 33K parameters.</p>
      <p>In addition, there are connection gaps between some encoder and decoder units. Here, the
function vectors of both components are combined by addition. Like ResNets, bandwidth
accelerates convergence and reduces gradient disappearance.</p>
      <p>Another essential feature of the CR-CED network is that the convolution is performed in only
one dimension. More specifically, given the input spectrum of the form (129 x 8), the convolution
is performed only on the frequency axis (i.e., the first). This ensures that the frequency axis remains
unchanged during forwarding propagation.</p>
      <p>Combining a small number of learning parameters and model architecture makes this model
extremely easy, with fast execution, especially on mobile devices.</p>
      <p>Once the network evaluates the output, we optimize (minimize) the root mean square difference
(MSE) between the output and target (pure sound) signals (Fig. 4).</p>
      <sec id="sec-6-1">
        <title>The results of this method are presented in Fig. 5.</title>
        <p>Figure 5 shows the initial audio without noise, the audio to which the noise was added, and the
result of processing the method. As you can see, given the complexity of the task, the results are
somewhat acceptable but not perfect because the noise remained on this audio file.
3.4.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Method using a recurrent neural network (GRU)</title>
      <p>This method started with the removal of noise using artificial intelligence [19]. The method
consists not only of in-depth training; it uses a hybrid approach. The central processing cycle is
based on 20 ms windows with 50% overlap (10 ms offset). Use both analysis and synthesis of a
Worbis window that satisfies the PrincenBradley criterion. The window is defined using the
following formula 7:</p>
      <p>
        π π
w(n) = sin[ sin2 ( Г )],
2 N
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
where N is the length of the window.
      </p>
      <p>In fig. 7 shows a block diagram of this method.
To avoid a huge number of outputs and, consequently, a large number of neurons, the algorithm
does not work directly with the samples or with the spectrum. It considers the frequency bands
that correspond to the Barca scale, corresponding to how we perceive sound. A total of 22 bands
are used instead of 480 spectral values, which reduces the number of calculations. Let wb (k) be
the amplitude of the band b at the frequency k we have ∑b wb (k ) = 1 . For the converted signal X
(k), the energy value in the band is calculated by formula 8:</p>
      <sec id="sec-7-1">
        <title>The gain in the band is defined as gb.</title>
        <p>E(b) = ∑ w (k ) X (k ) 2</p>
        <p>k b
y =
b</p>
        <p>Es (b)
Ex (b)
,
where E (b) is the energy of pure speech, E (b) is the energy of the input (noisy) speech.</p>
        <p>s x
Considering the ideal gain gˆb , the following interpolated gain is applied to each basket of
frequencies k (formula 10):
r(k ) = ∑ w (k )gˆ</p>
        <p>b b b
The main drawback of the lower resolution we get from using bands is that we do not have a fine
enough resolution to suppress the noise between pitch harmonics.But this task is not essential and
can be easily implemented with a comb filter.</p>
        <p>Since the result we calculate is based on 22 bands, using a higher input resolution would make
sense, so the same 22 bands are used to supply spectral information to the neural network [22].</p>
        <p>To improve the preparation of data for training, DCT is used on the logs of the spectrum. At
the output, we obtain 22 Cestral Barca frequency coefficients (BFCC). The data obtained is a bar
current based on the Barca scale, closely related to the MFC coefficients, often used for speech
recognition.</p>
        <p>In addition to our cepstral coefficients, the following is also added:
• The first and second derivatives of the first 6 coefficients across frames
• The pitch period (1/frequency of the fundamental)
• The pitch gain (voicing strength) in 6 bands
• A special non-stationarity value that's useful for detecting speech (but beyond the scope of
this demo).</p>
        <p>This makes a total of 42 neural network input functions. The traditional approach to noise
suppression inspires the neural network architecture used in this method. Most of the work is
performed by three layers of GRU. Figure 7 shows the layers used to calculate the bands.
(8)
(9)
(10)</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Experiments</title>
      <p>We will conduct experiments comparing three different methods of noise processing in audio files.
Two of them relate to algorithms using artificial intelligence, namely CNN and RNN.</p>
      <p>The first algorithm to be used for comparison is spectral subtraction. The main steps of this
algorithm:
• Calculate the FFT using an audio clip that contains noise
• Statistically calculate FFT by noise
• Calculate the threshold based on the statistical noise
• FFT is calculated by the signal
• The mask is determined by comparing the FFT signal with the threshold value
• The mask is smoothed by the filter by frequency and time
• The mask is applied to the FFT signal and inverted</p>
      <p>To begin, download the data without noise (Fig. 8). Then, divide the data from the file by 32768
because the file we download has a wav extension. The data in it is in the range of [32768, 32767],
so dividing by 32768, we get the appropriate degree of additions in two [-1, 1].
The next step is to add noise to the audio file (Fig. 9-10). The noise file also has a wav extension.
Preparations for working with a noisy audio file are completed. Now let's start with the central
part, namely the calculation of FFT. The STFT (Short Fourier Transform) function calculates the
FFT of an audio file containing noise. Short-Term Fourier Transform (STFT) - is a Fourier-related
transform used to determine the frequency of the sinusoidal and phase contents of local signal
cross-sections as it changes over time. In practice, calculating the STFT is to divide the signal of
a long time into shorter segments of the same length and then calculate the Fourier transform
separately for each shorter part.</p>
      <p>The STFT algorithm consists of the following steps:
• Select a data segment from the overall signal
• Multiply this segment by the semi-cosine function
• Zero the end of the segment with zeros
• Normalize the Fourier transform for this segment into positive and negative frequencies
• Combine the energy of positive and negative frequencies and return the one-way spectrum
• Scale the resulting spectrum in dB for easy viewing
• Record the signal to eliminate the noise beyond the noise threshold.</p>
      <p>After performing the function, we obtain a complex-valued matrix of short-term Fourier
transform coefficients. Reduce it to type dB and proceed to the next step.</p>
      <p>We calculate noise statistics. Namely, we find the mean and standard deviation. Next, multiply
the standard deviation by the change n_std_thresh. It shows how many standard deviations the
sound must be considered a signal and not a noise. By default, the change has a value of 1,5. Add
this value to the standard deviation.</p>
      <p>Calculate the STFT for a non-noisy signal and also reduce it to the type dB. Now we create a
mask, for this, we look for the minimum value of the complex matrix obtained in the previous
step, and we create a smoothing filter for the mask by time and frequency. Calculate the threshold
for each frequency interval.</p>
      <p>Convert the mask using the smoothing filter fftconvolve. Convolution is a simple mathematical
operation that requires the multiplication of vectors, so the complexity of execution is О(n2). But
to speed up the process, the convolution is performed with a fast Fourier transform. Using FFT,
the complexity decreases from О(n2) to О(nlog(n)). The algorithm is presented in Fig. 11.
After creating the mask, we proceed to the final stage, removing noise from the audio file. To do
this, we use the inverse Fourier transform. The inverse transformation is when each subsequent
window is returned to the time domain using IFFT. Then each window is shifted by the size of the
step and added to the result of the previous shift. The following diagram represents this process.
And in the end, we return the received audio file in which noise decreased. It is presented in
Fig. 13.
The following algorithm uses a convolutional neural network (CNN) to reduce noise in an audio
file, the architecture of which consists of an encoder and a decoder with residual connections
between pairs of layers.</p>
      <p>The first step is to initialize the scales. Initializing the scales is an important step. If the scales
are too small, then the dispersion of the input signal begins to decrease as it passes through each
layer in the network. As a result, the input eventually falls to shallow values and can no longer be
valid. On the other hand, if the weights are too large, the variance of the input data tends to increase
rapidly with each subsequent layer. Thus, initializing a network with suitable scales is very
important for a neural network to work correctly. We need to make sure that the scales are within
reasonable limits before we start training the net. That's why Xavier's initialization is used.</p>
      <p>Xavier initialization is an initialization scheme for neural networks. Changes are initialized to
1
,
1
0, and the weight Wij at each level is initialized as: w ≈
U[ij
n n
distribution and n is the size of the previous layer (number of columns in W).</p>
      <p>In the second step, we initialize the vector z with random values from 0 to 1. The next step in
obtaining a mask the size of STFT signals with values in the range [0,1], as the method's input,
used the signal Y.</p>
      <p>Once we have obtained the vector z the method goes through iterations. The number of
iterations is set by changing t and the function is passed along with the audio file. Each iteration
has the following steps. In the next step, the fi-1 network learns in one iteration, obtaining fi. The
following line calculates fi (z) and its STFT for each Yi. Next, we find Hi value, the absolute
] , where U is a uniform
difference between Yi and Yi−1 , and normalize the resulting difference with Yi .</p>
      <p>The following steps check the obtained value of Hi . To get rid of extreme values, it truncates
all values below 10 and above 90. The value of C is the product of the matrices. C will have high
values in the coordinates of the frequency-time domains, in which the lowest stability of recovery
y over the network f.</p>
      <p>After completing all iterations, the value of C is normalized to be in the range from 0 to 1. High
accumulation of variability implies noise, and therefore flip the value (max(C) – C, not C – min(C))
before returning the mask M.</p>
      <p>The method using recurrent neural networks uses a recurrent network with GRUs designed to
overcome the noise in the audio recording. This neural network architecture is based on the
assumption that there are three repeating layers, each responsible for one of the main components.
It includes 215 units, 4 hidden layers, and the largest layer hides 96 units. Increasing the number
of layers does not significantly improve the quality of noise absorption. However, the loss function
and the way the training data is constructed substantially influence the final grade.</p>
      <p>One of the essential parts of learning is the dataset. To teach the network, you need to use both
noisy and pure speech to test, so the learning data is built artificially, as for previous algorithms.</p>
      <p>Noise is mixed at different levels to provide a wide range of signal-to-noise ratios, including
clear speech and noise segments only. The algorithm does not use central average normalization,
and data augmentation is used to make the network resistant to changes in frequency response.
This is achieved by filtering noise and speech signals independently for each training example
using second-order filters (formula 11).</p>
      <p>H (z) =
1 + r z− 1 + r z− 2</p>
      <p>1 2
1 + r z− 1 + r z− 2
3 4
,
(11)
where r ,..., r are random values, evenly distributed ranges from -3/8 to 3/8. Reliability to the
1 4
amplitude of the signal is achieved by varying the final level of the mixed signal.</p>
      <p>In total, there are 6 hours of speech and 4 hours of noise data, which we use to generate 140
hours of speech noise using various combinations of gains and filters and by oversampling the
data to frequencies from 40 kHz to 54 kHz.</p>
      <p>The RNNoise class consists of the following methods:
• read_wav (): Takes the name of the .wav audio recording, converts it to a supported format
(16 bits, mono), and returns the pydub.AudioSegment object with the audio recording
• write_wav (): Accepts the name of the .wav audio recording, the pydub.AudioSegment
object (or a byte string with audio data without wav headers) and saves the audio recording
under the transmitted name
• filter (): Accepts the pydub.AudioSegment object (or byte string with audio data without
wav headers) leads it to a sampling rate of 48000 Hz, splits the audio into frames (10
milliseconds long), clears them of noise, and returns the object pydub.AudioSegment (or
byte string without wav headers) while preserving the original sampling rate
• filter_frame (): Clear only one frame (10 ms, 16 bits, mono, 48000 Hz) of noise (access
directly to the binary file of the RNNoise library)</p>
      <p>The input is an audio file that has some noise (Fig. 14 (a)), and the output is an audio file with
reduced noise (Fig. 14 (b)).
(b)
Figure 14: Audio file diagram with noise (a) and after processing (b)</p>
    </sec>
    <sec id="sec-9">
      <title>Results</title>
      <p>The above algorithms were tested on 4 different audio, stationary and nonstationary, such as music
or background conversation.</p>
      <p>STOI and PESQ scores were used to evaluate the audio recordings obtained after noise
removal. STOI is a metric for predicting the intelligibility of noisy speech, not the quality of speech
(which is usually evaluated in silence). The main subjective tests of this method are tests of
intelligibility (request for recognized words /symbols, etc.) [18].</p>
      <p>PESQ is a family of standards that includes a testing methodology for automatically assessing
the quality of speech experienced by a telephone system user. It was standardized as
Recommendation ITU-T P [19].</p>
      <p>The results are presented in table 1.
When working with stationary noises, each network showed high results. However, the worst noise
elimination spectral subtraction showed on data with street noise because street noise has sudden
declines or rises. Because of this, the results of spectral subtraction for these data were the worst.
Recurrent neural network, the worst result shows the removal of music in the background and
becomes noise, which is quite challenging to deal with.</p>
      <p>We will deduce audio diagrams with noise and without for the addition of musical noise in
Fig 15.
(d)
Figure 15: (a) - sound to add music noise, (b) – sound after processing by a spectral subtraction,
(c) – sound after processing by a recurrent neural network , (d) – sound after processing by a
convolutional neural network
The above is an example of graphs (Fig. 15), which display the sound before processing and after,
using different methods. The convolutional network coped best with sound, reduced the amount
of noise. It is also worth mentioning that one of the essential tasks of noise absorption is not to
degrade the sound itself. Because of this, spectral subtraction is the worst in use, because in
addition to noise, it takes away the sound itself, which sometimes gives difficulties in recognizable
languages. The advantages of this algorithm are simplicity and no need for training.</p>
      <p>In this case, the CNN algorithm was better than RNN, but as shown in Table 1, RNN was better
for noise removal such as street sound and stationary at 0.0024297737659774
0.0147917149013511, respectively. This method also showed promising results of PESQ
evaluation. Diagrams of sound from the addition of street noise and sound charts after processing
methods are presented in Fig. 16.</p>
    </sec>
    <sec id="sec-10">
      <title>6. Discussion</title>
      <p>The experiments section demonstrates various methods for removing noise from the audio file,
such as spectral subtraction, recurrent neural networks, and convolutional neural networks. These
methods were tested on different types of sound: stationary noise, background conversation
sounds, street sounds, and background music noises. We will deduce results using the bar chart.
To begin with, let's analyze the STOI estimate (Fig. 17).</p>
      <p>й
онй воонн воонй
во с
н О сн
сО О
S T A T I O N A R Y O F F I S E N O I S E</p>
      <p>N O I S E</p>
      <p>S T R E E T
N O I S E</p>
      <p>M U S I C N O I S E
From Fig. 17 you can see that the worst result with noise removal was the method of spectral
subtraction, which, unlike the other two, does not belong to the algorithms of artificial intelligence.
The best results were demonstrated for stationary sound, as the primary purpose of this method
was to remove this noise. But the results presented were still worse than CNN at
0.0695942286068124 and from RNN (GRU) at 0.0843859435081635.</p>
      <p>So, suppose you compare simple algorithms and methods using artificial intelligence. In that
case, the latter is preferred, as they can adapt to different noises. The quality of speech in the audio
file suffers much less, which gives a higher STOI score because this assessment is based on
language intelligibility. And since spectral subtraction is more damaging to speech, which is the
leading indicator of deterioration. Because of this, for data preprocessing, for further speech
recognition, this algorithm will work worse and degrade the final data.</p>
      <p>The comparison of AI methods showed that each copes with this task, but there is no exact
winner. This probably reflects the fact that both signal processing methods reflect a trade-off
between different factors. RNN processing reduced stationary and street noise, but CNN
processing performed better in removing noise such as background conversations and music noise
according to the STOI score. But it should note that the difference between the estimates is not
significant, which can be seen in Fig.18.</p>
      <p>Let's analyze the indicators of the following assessment, which is based on the quality of
speech. To begin with, we will deduce the diagram.
Figure 18: PESQ estimates for different audio files using noise absorption using spectral
subtraction, convolutional neural networks and recurrent neural networks
From this diagram, we can see that the best, as for the preliminary assessment, showed when
removing stationary noise. Since this assessment is based on speech quality, it is not surprising
that spectral analysis showed such low results.</p>
      <p>The RNN removal method showed the best results in all cases, except for musical noise, which
indicates that this algorithm does not severely damage the audio file when removing noise. The
audio file itself remains of good quality.</p>
      <p>Therefore, algorithms using artificial intelligence are more advantageous, as they can adapt to
sound and less damage to the data itself.</p>
    </sec>
    <sec id="sec-11">
      <title>7. Conclusion</title>
      <p>Sound noise is a problem that is a classic and began a long time ago but has not yet been fully
resolved to this day. It can damage audio files, which can lead to a risk of impaired audio
recognition, making it difficult to recognize speech. Most AI technologies are now used to solve
this problem, and this paper demonstrates the results that have shown the benefits of these
technologies. This technology has outstripped spectral analysis, both in noise removal and in
preserved speech intelligibility, which are the main tasks for this topic.</p>
      <p>The studies were performed using methods such as CNN and RNN. There was no exact winner
in these studies. Although CNN is more commonly used in image processing, but also with
problems related to noise in audio files, the algorithm has proven itself on the excellent side. RNN
is not far behind in this matter. Each method performed better on different noises. RNN
outperformed CNN in removing noise such as stationary and street noise, while CNN voiced
background and music. RNN also showed high results in sound quality, in contrast to CNN, which
offers some advantages in using this algorithm.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Junfeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Masato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yoiti</surname>
          </string-name>
          ,
          <article-title>A Two-Microphone Noise Reduction Method in Highly Nonstationary Multiple-Noise-Source Environments</article-title>
          ,
          <source>IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences</source>
          ,
          <year>2010</year>
          ,
          <source>E91A. 10.1093/ietfec/e91- a.6</source>
          .1337.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.Edmonson J.</given-names>
            <surname>Tucker</surname>
          </string-name>
          ,
          <source>Digital Signal Processing System for Active Noise Reduction</source>
          , Vol.
          <volume>1</volume>
          ,
          <issue>2002</issue>
          , p.
          <fpage>49</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Boll</surname>
          </string-name>
          ,
          <article-title>Suppression of acoustic noise in speech using spectral subtraction</article-title>
          ,
          <source>in: IEEE Transactions on Acoustics, Speech, and Signal Processing</source>
          , vol.
          <volume>27</volume>
          , no.
          <issue>2</issue>
          ,
          <string-name>
            <surname>April</surname>
            <given-names>1979</given-names>
          </string-name>
          , pp.
          <fpage>113</fpage>
          -
          <lpage>120</lpage>
          . doi:
          <volume>10</volume>
          .1109/TASSP.
          <year>1979</year>
          .1163209
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>A Regression</surname>
          </string-name>
          <article-title>Approach to Speech Enhancement Based on Deep Neural Networks</article-title>
          , in: IEEE/ACM Transactions on Audio,
          <source>Speech, and Language Processing</source>
          , vol.
          <volume>23</volume>
          , no.
          <issue>1</issue>
          ,
          <string-name>
            <surname>Jan</surname>
          </string-name>
          .
          <year>2015</year>
          , pp.
          <fpage>7</fpage>
          -
          <lpage>19</lpage>
          . doi:
          <volume>10</volume>
          .1109/TASLP.
          <year>2014</year>
          .
          <volume>2364452</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Valin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Hybrid</given-names>
            <surname>DSP</surname>
          </string-name>
          <article-title>/Deep Learning Approach to Real-Time Full-Band Speech Enhancement</article-title>
          ,
          <source>in: IEEE 20th International Workshop on Multimedia Signal Processing (MMSP)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          , doi: 10.1109/MMSP.
          <year>2018</year>
          .8547084
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Santos</surname>
          </string-name>
          <string-name>
            <surname>Silva</surname>
          </string-name>
          ,
          <source>Practical Deep Learning Audio Denoising</source>
          ,
          <year>2019</year>
          . https://sthalles.github.
          <article-title>io/practical-deep-learning-audio-denoising/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Michelashvili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wolf</surname>
          </string-name>
          , Speech Denoising by Accumulating
          <string-name>
            <surname>Per-Frequency Modeling</surname>
            <given-names>Fluctuations</given-names>
          </string-name>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>