<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Real-time Sound Source Separation System</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cheng-Yuan Tsai</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ting-Yu Lai</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chao-Hsiang Hung</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hau-Shiang Jung</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Syu-Siang Wang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shih-Hau Fang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Electrical Engineering, National Taiwan Normal University</institution>
          ,
          <addr-line>No. 162, Sec. 1, Heping E. Rd., Da'an Dist., Taipei City 106, Taiwan</addr-line>
          ,
          <country country="CN">R.O.C.</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Electrical Engineering, Yuan Ze University</institution>
          ,
          <addr-line>No. 135, Yuandong Rd., Zhongli Dist., Taoyuan City 320, Taiwan</addr-line>
          ,
          <country country="CN">R.O.C.</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Forensic Science Division, Ministry of Justice Investigation Bureau</institution>
          ,
          <addr-line>New Taipei City 231, Taiwan</addr-line>
          ,
          <country country="CN">R.O.C.</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The COVID-19 pandemic has accelerated the shift of many meetings to online platforms. In crucial multi-person meetings, real-time speech separation can significantly enhance subsequent applications. Additionally, real-time speech separation in various speech-related products, such as hearing aids, can improve backend services. However, achieving real-time speech separation poses substantial challenges. It requires segmenting speech over time, and as the information within each segment decreases, the dificulty of source separation increases, along with the necessary computational time. Most previous research has focused on non-real-time speech source separation, and real-time systems often rely on additional information or resources, primarily utilizing English corpora as the dataset. This research aims to develop a real-time speech source separation system. We use the architecture proposed in [1] as our main framework and modify it to handle segmented speech processing. The system is designed to perform real-time computations using a single-core CPU. The scale-invariant signal-to-noise ratio (SI-SNR) is employed as the evaluation metric for objective assessment. We validate the performance of our method with a small dataset in a real-time system and simulate the speech segmentation of the real-time system ofline using a large dataset.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Speaker Separation</kwd>
        <kwd>Source Separation</kwd>
        <kwd>SI-SNR</kwd>
        <kwd>real-time system</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Sound is vital for human perception and communication, carrying useful information such as
scene, speaker, and language. It has many applications like hearing aids, speaker recognition,
and speech recognition. Therefore, computer processing of speech signals is a popular research
topic.</p>
      <p>
        Source separation is a field that many researchers have delved into with abundant resources,
but the signal processing methods used vary. The traditional techniques mainly include
beamforming [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] and blind source separation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], both multi-channel source separation methods.
Beamforming uses multiple microphones for omnidirectional reception, determines the direction
of the source based on the time of arrival and intensity diferences of the signal, and suppresses
all sounds except the target source direction, thereby improving the signal-to-noise ratio (SNR)
of the entire system and achieving separation. Blind source separation assumes that each
source has independent statistical characteristics and designs filters to separate an unknown
number of sources [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, multi-channel signal processing means increased cost. In
addition to traditional methods, previous studies have shown that deep learning can achieve
excellent separation results in single-channel source separation. Common source separation
systems use frequency domain features as inputs, such as in [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8">5, 6, 7, 8</xref>
        ], where speech is first
converted from time domain to frequency domain signals. However, this process has several
drawbacks. Firstly, the conversion process will add a lot of computational cost, and secondly,
there is a risk of distortion in the phase and magnitude of the signal during the conversion
process. In response to these issues, many scientists have begun to research source separation
of time-domain signals, such as in the paper [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which uses a time-domain audio separation
network to overcome the drawbacks of frequency-domain transformation. In the time-domain
computations, an encoder-decoder[
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13">10, 11, 12, 13</xref>
        ] framework is used to model the time-domain
signal directly. This approach eliminates the frequency decomposition step and simplifies
the separation problem to calculate the output of the encoder, which is then synthesized and
restored into the time-domain signal by the decoder. However, this method still results in a
large model size and is not suitable for modeling data with longer time series.
      </p>
      <p>
        Combining the research from both fields, more and more researchers hope to use
timedomain signals for source separation. The paper [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] presents a deep learning framework
for end-to-end[
        <xref ref-type="bibr" rid="ref15 ref16 ref17 ref18 ref19">15, 16, 17, 18, 19</xref>
        ] time-domain source separation. It uses a linear encoder to
generate optimized speech waveforms for separating each source. The encoder’s output is then
passed through a set of weight masks estimate[
        <xref ref-type="bibr" rid="ref20 ref21 ref22 ref23">20, 21, 22, 23</xref>
        ] to achieve source separation. The
separated signals are then reversed back to speech waveforms using a linear decoder. Although
this method solves the long delay time required for computing spectrograms and reduces the
size of some models, it is still not suitable for processing data with longer time series.
      </p>
      <p>
        The dual-path recurrent neural network (DPRNN) method in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] models longer time-series
data efectively by splitting the input sequence into smaller chunks. Using the Wall Street
Journal dataset (WSJ0-2mix), DPRNN achieved a source-to-interference ratio (SI-SNR) of 18.8
dB in clean environments and 8.4 dB in noisy environments. However, the non-real-time nature
and high computational resources required during testing are limitations.
      </p>
      <p>
        To achieve real-time source separation, methods like multi-input-multi-output (MIMO) require
spatial information, increasing costs and resources. Real-time tracking using upper-body
humanoid robots and distributed processing, as seen in [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], and fusion methods of camera and
microphone array sensors, as in [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], have been explored.
      </p>
      <p>
        Building on [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], we have enhanced robustness in noisy environments, adapted the system
for Chinese corpus, and developed a real-time separation system.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. MATERIALS AND METHODS</title>
      <sec id="sec-2-1">
        <title>2.1. Database Description</title>
        <p>
          For the evaluation task, we utilized the Taiwan Mandarin Chinese version of the Hearing in
Noise Test (TMHINT) [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] to prepare the speech dataset. Each phrase consists of ten characters,
encompassing a wide range of commonly used pronunciation types in daily life.
        </p>
        <p>A total of 19 participants, comprising 12 males and 7 females aged between 20 and 28
years, took part in the audio recording for this experiment. None of the participants exhibited
noticeable pronunciation dificulties or issues with oral expression, thus representing typical
speech conditions.</p>
        <p>During the recording process, the equipment, including the laptop array microphone and a
standard microphone, was first calibrated for frequency response and to assess the impact of
environmental noise on the recording results. The recording took place in a semi-reverberant
room measuring 2m x 2m x 2m. The participants sat facing a computer, with the standard
microphone positioned at mouth height to minimize distance variation due to diferences in
participant height. We used Audacity software to control the laptop array microphone and
ACQUA software for the standard microphone, both operating at a 48kHz sampling rate. Each
participant read the 320 phrases from the TMHINT text sequentially. Each phrase lasted three
to four seconds, with a one-second pause between phrases. After reading all 320 phrases, the
recordings were stopped in both Audacity and ACQUA, and the files were saved and checked
for errors to complete the session.</p>
        <p>In the data processing stage, we addressed issues of uneven speech volume and varying
speech duration. To ensure consistency, the entire corpus was recorded in one session to avoid
variations due to diferent recording times or changes in speaker conditions. We inserted a blank
interval of approximately 1.5 seconds between each sentence to facilitate audio segmentation.
We then extracted 320 individual audio files based on these intervals. Voice activity detection
(VAD) was performed on the data to remove silent parts, preventing excessive silence in the
clips that could negatively impact model training. We manually verified the efectiveness of the
VAD method in eliminating silent portions.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Proposed method</title>
        <p>In this paper, we use the DPRNN as our main reference. Figure 1 illustrates the DPRNN process.
Initially, an audio clip is input and converted into features that are easily learned by the model
through an encoder. The audio then enters the separation model, which is divided into three
parts: segmentation processing, DPRNN block processing, and overlap-add.</p>
        <p>In segmentation processing, the encoded audio features are broken into block-like segments
and connected to form a three-dimensional tensor. Here,  represents the time length,  is the
feature-length,  is the input signal, and  is the length of one block. A 50% overlap rate was
used in this study.  represents the number of blocks generated during this process, resulting
in a 2 *  *  three-dimensional tensor.</p>
        <p>After obtaining the three-dimensional tensor, DPRNN block processing is performed, which
is divided into Intra-chunk RNN and Inter-chunk RNN. The Intra-chunk RNN processes local
information within each block in parallel, while the Inter-chunk RNN processes global
information across all blocks. Data in the Intra-chunk RNN undergoes a Bi-LSTM, a fully connected
layer, and a normalization layer. The Inter-chunk RNN follows the same procedure. These two
components combine to form a complete DPRNN block, which can be stacked to increase model
depth. In our experiments, we used six stacking layers.</p>
        <p>The purpose of overlap-add is to convert the data processed by the DPRNN block back to
the input format. As shown in Fig 1, the three-dimensional output of the last DPRNN block is
transformed back to a sequence output through overlap-add.</p>
        <p>For generating training data, we randomly selected ten males and six females from the
previously recorded data, totaling sixteen participants. The training data had a sampling rate of 8kHz.
The data from these participants were mixed and divided into three combinations: male-male,
male-female, and female-female, with 13,500, 10,000, and 6,500 sentences, respectively. The
generation of each combination followed these steps: (1) Two diferent people were selected
from the sixteen, each with 170 sentences of speech signal. (2) One sentence was randomly
selected from each person’s 170 sentences, resulting in two speech signals. (3) A signal-to-signal
ratio between − 5 to 5 was randomly selected, and the two speech signals were mixed to
obtain a mixed speech signal. (4) Steps (1) to (3) were repeated to obtain a training corpus of
approximately 30 hours, serving as a small training database according to the quantity of each
combination.</p>
        <p>The test set included four trained speakers and four untrained speakers, each with two males
and two females. The test data generation followed these steps: (1) Three gender combinations
were used for mixing: male-male, male-female, and female-female. (2) Two speakers were
selected within each gender combination, each with 150 speech signals, diferent from the
training set. (3) One sentence was selected from each speaker’s 150 speech signals, resulting
in two speech signals for each gender combination. (4) The two selected speech signals were
mixed at a signal-to-signal energy ratio of 0.5 to obtain a mixed speech signal. (5) Steps (2) to
(4) were repeated until each gender combination contained 2000 sentences, resulting in a small
test set of 6000 sentences. The number of training and testing data is listed in Table Table 1.</p>
        <p>F-F
6500
2000</p>
        <p>Total
30000
6000</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Real-time System</title>
      <p>In this paper, we initially used a mobile phone to play the audio, which we then recorded directly
using the computer’s microphone. However, this method resulted in the system capturing audio
with doubled environmental noise and potentially uneven volume due to human factors during
the recording process. To address these issues, we decided to use the mobile phone as the input
device for the real-time system, connecting it to the computer via an audio cable. This setup
allowed us to avoid the aforementioned problems. In recording mode, the system saves the
entire separated audio file to the computer’s hard drive at the end of the execution. In playback
mode, after separating a segment of the audio, the system plays it in real time through the
speakers.</p>
      <p>As shown in Fig 2, the experimental steps are as follows: (1) Data collection and preprocessing
of audio files. (2) Simulation and computer validation of source separation algorithms. (3)
Optimization of algorithm performance. (4) Development of time-domain source separation
techniques. (5) Development of real-time system simulation and objective evaluation standards.
(6) Real-time system performance optimization.</p>
      <sec id="sec-3-1">
        <title>3.1. Recording Mode and Playing Mode</title>
        <p>To efectively evaluate the performance of the system, we further expanded it with two modes:
recording mode and playback mode. In the recording mode, the system executes the complete
process and generates audio files, which are then saved to the computer’s hard drive. In the
playback mode, the system is capable of playing the separated audio in real-time, allowing users
to assess its performance immediately. The application of these two modes enables a more
comprehensive evaluation of the system’s performance while providing users with a better
experience.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Improved DPRNN Model</title>
        <p>To meet the requirements of real-time computation, we cannot input the complete audio file into
the model for processing and output the results. Therefore, the audio files need to be segmented,
resulting in a reduction in the amount of data input to the model. This reduction in data volume
significantly afects the separation performance of the model.</p>
        <p>Therefore, we made some improvements to the DPRNN method. We studied the model
3
)
z
H
k
(
y
2
c
n
e
u
q
e
r
1
F
0
3
)
z
H
k
(
y
2
c
n
e
u
q
e
r
1
F
0
0.5 1 1.5 2 0.5 1 1.5 2</p>
        <p>Time (s) Time (s)
(a) Original model speaker1.(b) Improved
speaker1.</p>
        <p>0.5 1 1.5 2 0.5 1 1.5 2</p>
        <p>Time (s) Time (s)
model (c) Original model speaker2.(d) Improved
speaker2.</p>
        <p>model
architecture of DPRNN, where the original DPRNN model first segments an input audio file
into several chunks and then stacks them sequentially. The x-axis represents the length of
each segmented audio file, and the y-axis represents the number of segmented audio files. The
stacked segmented audio files are then fed into the RNN block of the DPRNN, which consists
of two parts: intra-chunk operations based on the temporal sequence of the inputs and
interchunk operations across the segmented audio files. In the context of real-time computation,
the audio is segmented before each model inference, resulting in short segmented audio files
being processed each time. As a result, the efectiveness of the inter-chunk RNN (Inter chunk)
that operates across time on the segmented audio files is significantly reduced. In situations
where inter-chunk RNN does not provide significant benefits, we choose to omit it. This saves
half of the computational workload and reduces the impact of audio file segmentation on the
system. Finally, considering model performance and computational time trade-ofs, we settled
on a model architecture with 4 layers of intra-chunk RNN.</p>
        <p>In addition to the 4 layers of intra-chunk RNN, we believe that even though the inter-chunk
RNN may not provide substantial help when the audio files are segmented and stacked, having a
single layer of inter-chunk RNN can still aid the model in feature learning. Therefore, based on
the architecture of 4 layers of intra-chunk RNN, while maintaining its computational workload,
we replace the last layer of intra-chunk RNN with inter-chunk RNN. The final architecture
consists of 3 layers of intra-chunk RNN plus 1 layer of inter-chunk RNN.</p>
        <p>The original full model, consisting of 6 layers, had a size of 31MB. We adjusted the model
architecture to 3 layers of intra-chunk plus 1 layer of inter-chunk, achieving real-time
computational requirements, and reducing the model size to approximately 10MB.</p>
        <p>Fig 3 shows the spectrograms of the results before and after model adjustment. It can be
observed that there is little impact on the results before and after model adjustment, and the
spectrograms of both still exhibit a high degree of similarity.
3.3. Computer hardware and software connections in the system
In the system, it is necessary to establish connections between the model and external audio
sources. To achieve this purpose, we utilize a tool called "pyaudio". Pyaudio is an open-source
Python tool developed by a third party, providing capabilities for recording, reading audio files,
and playing audio files within Python. The use of this tool enables us to easily apply audio
sources and transmit audio data to our system for further processing.</p>
        <p>One of the advantages of pyaudio is its cross-platform compatibility. It supports various
operating systems, including Linux, Microsoft Windows, and Apple macOS, making our system
theoretically capable of functioning across diferent operating system environments.</p>
        <p>With pyaudio, we can efortlessly set up recording or playback functions for audio sources
and connect audio data to our model through appropriate configurations. This facilitates an
efective linkage between audio source input and model separation, equipping the system with
essential tools and functionalities for smooth operation.</p>
        <p>Overall, the utilization of pyaudio allows us to seamlessly handle audio within the Python
environment and connect it to our model. This empowers our system with enhanced capabilities
and flexibility, enabling us to achieve the goals of eficient and accurate audio separation and
real-time playback.</p>
        <p>Furthermore, while using Pyaudio, it’s important to note that the precision of audio data input
is in 16-bit format, whereas the model’s computation requires data in Float32 format. Hence,
before feeding data into the model for computation, a conversion from 16-bit to Float32 format
is necessary to ensure proper data processing by the model. Once the model’s computation is
complete, the results must be converted back to a 16-bit format for audio data storage using
pyaudio.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>
        4.1. SI-SNR
The Scale-Invariant Signal-to-Noise Ratio (SI-SNR)[
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], is a metric used to evaluate the
performance of signal separation. SI-SNR measures the ratio between the separated signal and the
original mixed signal, and it efectively quantifies the separation quality between the signal and
the noise. SI-SNR is defined as the following equation 1:
⎪⎨⎪⎧  := ⟨‖^,‖⟩2
      </p>
      <p>:= ˆ − 
⎪⎪⎩ −   := 10 ‖‖2
‖‖2
(1)</p>
      <sec id="sec-4-1">
        <title>4.2. Dual-channel engineering method</title>
        <p>Since calculating SI-SNR requires aligning every point of the two audio signals, there may be
human errors in real-time systems, such as from the start of the system to playing the audio
ifle, or from the end of the audio file to the end of the system, which can lead to imprecise
separation of the audio files. Therefore, we propose a dual-channel engineering method to
address this issue. Fig 4 shows the core of this method. That is the dual-channel approach that
simultaneously mixes the target audio file with the mixed audio file and changes the system’s
input to the processed dual-channel audio file. This ensures that the blank spaces in the output
audio file and the target audio file are of the same length, thereby achieving alignment.</p>
        <p>The reason for doing this is to align the timing of the mixed speech and the target speech.
At the same time, we want the input and output of the system in recording mode to be a
dual-channel audio file. Therefore, we first merge the mixed speech and target speech A into a
dual-channel audio file, where the first channel is the mixed speech and the second channel is
the target speech A. This ensures that both enter the system simultaneously and are output
simultaneously, with aligned timing. However, once inside the system, only the first channel,
which is the mixed speech, enters the model for separation. After going through the model,
we obtain a separate speech A, while the target speech A does not enter the model. Before
outputting, we mix these two speech segments back into a dual-channel audio file.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.3. Result of the Models in Clean and Noisy Speech</title>
        <p>In Table 2, we conducted experiments using the audio files mentioned earlier, analyzing
malemale, male-female, and female-female combinations. The DPRNN model achieved an average
SI-SNR=16.7dB, with male-female pairs performing the best at SI-SNR=17.9 dB. We speculate
that this is due to the greater diference in frequency range between male and female voices,
making it easier for the model to separate the two.</p>
        <p>Additionally, we added two types of noise, computer fan noise, and pink noise, to the training
and testing data used in a clean background environment. The noise was mixed into the data at
SNRs of 0, 5, and 10. We randomly selected one type of noise and one SNR and mixed it into
30,000 sentences of training data and 6,000 sentences of testing data. We trained the model with
the noisy data, changing the output from two speakers to three, including two speakers and the
noise. The purpose of this was to treat the noise as a third speaker for separation.</p>
        <p>We tested the model on audio files in both noisy and clean background environments, with an
average SI-SNR of 14.9 in the noisy environment. Male-male, male-female, and female-female
pairs achieved SI-SNRs of 15.5, 15.8, and 13.2, respectively.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.4. Result of Real-time System</title>
        <p>Real-time System Performance in a Small Dataset: We selected five sentences from the clean
test data to conduct small data testing on a single-core CPU. The selection criterion was based
on an average SNR of 16.7. We chose two sentences above the average SNR, two sentences
below the average, and one sentence approximately equal to the average, resulting in a total
of five sentences. As shown in Table 3, we tested the impact of diferent input sizes on the
system’s performance. The input sizes were 800 and 2000 points per input. It can be observed
that the smaller the input size, the poorer the system’s performance. This is because our main
method is based on LSTM, which is highly influenced by the length of the input sequence. If the
information available for each model processing is reduced, its performance will be adversely
afected.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.5. Result of the Ofline Simulation in Large Dataset</title>
        <p>Due to the limitation of obtaining only one audio segment at a time in the real-time system,
evaluating the performance with a large amount of data would be time-consuming. Therefore, we
conducted ofline simulations by manually segmenting the audio files to mimic the segmentation
approach in the real-time system. These segmented audio files were then processed through
the model in separate batches, and the resulting segmented audio segments were concatenated
to form a complete sentence. We compared five diferent models: a 6-layer full model, a 4-layer
full model, a 2-layer full model, a 4-layer intra-chunk model, and a 3-layer intra-chunk with a
1-layer inter-chunk model, running on a single-core CPU. We tested the input lengths of 800
samples and 2000 samples. Table 4 shows that the adjusted models outperformed the original
DPRNN model significantly. Compared to the 6-layer full model, the 4-layer intra-chunk model
achieved a 0.74 dB improvement in SI-SNR for the input length of 2000 samples, and a decrease
of 0.15 dB in SI-SNR for the input length of 800 samples. It also showed improvements compared
to other full models. In the 3-layer intra-chunk with the 1-layer inter-chunk model, there was a
2.28 dB improvement in SI-SNR for the input length of 2000 samples compared to the 4-layer
intra-chunk model, and improvements compared to other models for the input length of 800
samples. Therefore, we concluded that when the audio is segmented in time and the input
length is shorter, adding a significant number of inter-chunk layers would lead to a decrease in
model performance. However, not adding any inter-chunk layer would also not achieve the
best performance. Thus, the 3-layer intra-chunk with 1-layer inter-chunk model exhibited the
best performance in both input length scenarios.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This study focuses on developing a real-time speech source separation system in a single-core
CPU computing environment. First, we validate the performance of the complete DPRNN
model on Chinese speech data and in the presence of noise backgrounds, demonstrating its
feasibility. Next, we modify the DPRNN model architecture to meet the requirements of a
realtime system. In a real-time system, the input audio needs to be segmented over time. Therefore,
we change the original model’s structure to a 3-layer intra-chunk plus 1-layer inter-chunk
architecture, which satisfies the computational time constraints of a real-time system using a
single-core CPU. The objective evaluation criterion used is SI-SNR. The 3-layer intra-chunk plus
1-layer inter-chunk model architecture performs best even with an input length of 2000 samples.
Additionally, we develop a dual-channel engineering approach to evaluate the performance of
the real-time system under a small dataset scenario. Again, the 3-layer intra-chunk plus 1-layer
inter-chunk model architecture achieves the best performance evaluation results with an input
length of 2000 samples. Furthermore, the 3-layer intra-chunk plus 1-layer inter-chunk model
architecture meets the real-time computational time requirements in both single-core CPU and
GPU computations.</p>
      <p>However, the current performance is maintained when the input length is 2000 samples,
which corresponds to a minimum of 250 milliseconds (ms) processing time. In applications
that require audiovisual synchronization, this delay is noticeable to users. Therefore, future
research can focus on achieving a balance between performance and reducing the input length
per model inference, aiming to minimize latency.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Acknowledgments</title>
      <p>The authors would like to thank the financial support from the Ministry of Justice
(113-1301-0504-01) and the National Science and Technology Council (NSTC112-2221-E-155-017-MY2 and
NSTC112-2222-E-155-002).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.-Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Erdogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wisdom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wilson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Raj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Watanabe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Hershey</surname>
          </string-name>
          ,
          <article-title>Sequential Multi-Frame neural beamforming for speech separation and enhancement</article-title>
          ,
          <source>in: Spoken Language Technology Workshop (SLT)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>905</fpage>
          -
          <lpage>911</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. Z.</given-names>
            <surname>Ikram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Morgan</surname>
          </string-name>
          ,
          <article-title>A beamforming approach to permutation alignment for multichannel frequency-domain blind speech separation</article-title>
          , in: International Conference on Acoustics,
          <source>Speech, and Signal Processing</source>
          , volume
          <volume>1</volume>
          , IEEE,
          <year>2002</year>
          , pp.
          <source>I-881-I-884.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Benabderrahmane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Selouani</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>O'Shaughnessy, Blind speech separation for convolutive mixtures using an oriented principal components analysis method</article-title>
          ,
          <source>in: 18th European Signal Processing Conference</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>1553</fpage>
          -
          <lpage>1557</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Liutkus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-L.</given-names>
            <surname>Durrieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Daudet</surname>
          </string-name>
          , G. Richard,
          <article-title>An overview of informed audio source separation</article-title>
          ,
          <source>in: 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS)</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Virtanen</surname>
          </string-name>
          ,
          <article-title>Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria</article-title>
          ,
          <source>IEEE Transactions on Audio, Speech, and Language Processing</source>
          <volume>15</volume>
          (
          <year>2007</year>
          )
          <fpage>1066</fpage>
          -
          <lpage>1074</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hasumi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kobayashi</surname>
          </string-name>
          , T. Ogawa,
          <article-title>Investigation of network architecture for singlechannel end-to-end denoising</article-title>
          ,
          <source>in: 28th European Signal Processing Conference (EUSIPCO)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>441</fpage>
          -
          <lpage>445</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Issa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. F.</given-names>
            <surname>Al-Irhaym</surname>
          </string-name>
          ,
          <article-title>Audio source separation using supervised deep neural network</article-title>
          ,
          <source>in: Journal of Physics: Conference Series</source>
          , volume
          <year>1879</year>
          ,
          <string-name>
            <given-names>IOP</given-names>
            <surname>Publishing</surname>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Watanabe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kitamura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Saruwatari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Takahashi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kondo</surname>
          </string-name>
          ,
          <article-title>DNN-Based frequency component prediction for frequency-domain audio source separation</article-title>
          ,
          <source>in: 28th European Signal Processing Conference (EUSIPCO)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>805</fpage>
          -
          <lpage>809</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          , N. Mesgarani,
          <article-title>TaSNet: Time-Domain audio separation network for real-time, single-channel speech separation</article-title>
          , in: International Conference on Acoustics,
          <source>Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>696</fpage>
          -
          <lpage>700</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>V.</given-names>
            <surname>Badrinarayanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kendall</surname>
          </string-name>
          , R. Cipolla,
          <article-title>SegNet: A deep convolutional encoder-decoder architecture for image segmentation</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>39</volume>
          (
          <year>2017</year>
          )
          <fpage>2481</fpage>
          -
          <lpage>2495</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Van</given-names>
            <surname>Merriënboer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gulcehre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bougares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schwenk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Learning phrase representations using RNN encoder-decoder for statistical machine translation</article-title>
          ,
          <source>arXiv preprint arXiv:1406.1078</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.-C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Papandreou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schrof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <article-title>Encoder-decoder with atrous separable convolution for semantic image segmentation</article-title>
          ,
          <source>in: Proceedings of the European Conference on Computer Vision (ECCV)</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Malhotra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Shrof, LSTM-based encoder-decoder for multi-sensor anomaly detection</article-title>
          ,
          <source>arXiv preprint arXiv:1607.00148</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mesgarani</surname>
          </string-name>
          , Conv-Tasnet:
          <article-title>Surpassing ideal time-frequency magnitude masking for speech separation</article-title>
          ,
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          <volume>27</volume>
          (
          <year>2019</year>
          )
          <fpage>1256</fpage>
          -
          <lpage>1266</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.-W.</given-names>
            <surname>Fu</surname>
          </string-name>
          , T.-W. Wang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tsao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kawai</surname>
          </string-name>
          ,
          <article-title>End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully Convolutional Neural Networks</article-title>
          ,
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          <volume>26</volume>
          (
          <year>2018</year>
          )
          <fpage>1570</fpage>
          -
          <lpage>1584</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>N.</given-names>
            <surname>Zeghidour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usunier</surname>
          </string-name>
          , G. Synnaeve,
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          , E. Dupoux,
          <article-title>End-to-end speech recognition from the raw waveform</article-title>
          , arXiv preprint arXiv:
          <year>1806</year>
          .
          <volume>07098</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>F.</given-names>
            <surname>Lluís</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pons</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Serra</surname>
          </string-name>
          ,
          <article-title>End-to-end music source separation: Is it possible in the waveform domain?</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>12187</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mesgarani</surname>
          </string-name>
          , T. Yoshioka,
          <article-title>End-to-end microphone permutation and number invariant multi-channel speech separation</article-title>
          , in: IEEE International Conference on Acoustics,
          <source>Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6394</fpage>
          -
          <lpage>6398</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Z.-Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Roux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Hershey</surname>
          </string-name>
          ,
          <article-title>End-to-end speech separation with unfolded iterative phase reconstruction</article-title>
          , arXiv preprint arXiv:
          <year>1804</year>
          .
          <volume>10204</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Isik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Roux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Watanabe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Hershey</surname>
          </string-name>
          ,
          <article-title>Single-channel multi-speaker separation using deep clustering</article-title>
          ,
          <source>arXiv preprint arXiv:1607.02173</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kolbaek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.-H.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jensen</surname>
          </string-name>
          ,
          <article-title>Permutation invariant training of deep models for speaker-independent multi-talker speech separation</article-title>
          , in: IEEE International Conference on Acoustics,
          <source>Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>241</fpage>
          -
          <lpage>245</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kolbaek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.-H.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jensen</surname>
          </string-name>
          ,
          <article-title>Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks</article-title>
          ,
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          <volume>25</volume>
          (
          <year>2017</year>
          )
          <fpage>1901</fpage>
          -
          <lpage>1913</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mesgarani</surname>
          </string-name>
          ,
          <article-title>Speaker-independent speech separation with deep attractor network</article-title>
          ,
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          <volume>26</volume>
          (
          <year>2018</year>
          )
          <fpage>787</fpage>
          -
          <lpage>796</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          , T. Yoshioka,
          <string-name>
            <surname>Dual-Path</surname>
            <given-names>RNN</given-names>
          </string-name>
          :
          <article-title>Eficient long sequence modeling for timedomain single-channel speech separation</article-title>
          , in: International Conference on Acoustics,
          <source>Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>46</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>K.</given-names>
            <surname>Nakadai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hidai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Okuno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kitano</surname>
          </string-name>
          ,
          <article-title>Real-time speaker localization and speech separation by audio-visual integration</article-title>
          ,
          <source>in: Proceedings 2002 IEEE International Conference on Robotics and Automation</source>
          , volume
          <volume>1</volume>
          ,
          <year>2002</year>
          , pp.
          <fpage>1043</fpage>
          -
          <lpage>1049</lpage>
          vol.
          <volume>1</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>C.-F. Liu</surname>
          </string-name>
          , W.-S. Ciou, P.-T. Chen, Y.-C.
          <article-title>Du, A real-time speech separation method based on camera and microphone array sensors fusion approach</article-title>
          ,
          <source>Sensors</source>
          <volume>20</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>M.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Development of taiwan mandarin hearing in noise test, Department of speech language pathology and audiology</article-title>
          , National Taipei University of Nursing and Health science (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>J. Le Roux</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Wisdom</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Erdogan</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          <string-name>
            <surname>Hershey</surname>
          </string-name>
          ,
          <article-title>SDR-half-baked or well done?</article-title>
          , in: International Conference on Acoustics,
          <source>Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>626</fpage>
          -
          <lpage>630</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>