<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detecting Utterance Scenes of a Specific Person</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kunihiko Sato</string-name>
          <email>kunihiko.k.r.r@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jun Rekimoto</string-name>
          <email>rekimoto@acm.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The University of Tokyo</institution>
          ,
          <addr-line>Sony Computer Science</addr-line>
          ,
          <institution>Laboratory</institution>
          ,
          <addr-line>Tokyo</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The University of Tokyo</institution>
          ,
          <addr-line>Tokyo</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <fpage>563</fpage>
      <lpage>572</lpage>
      <abstract>
        <p>We propose a system that detects the scene, where a specific speaker is speaking in the video, and displays the site as a heat map in the video's timeline. This system enables users to skip to the timeline they want to hear by detecting scenes in a drama, talk show, or discussion TV program, where a specific speaker is speaking. To detect a specific speaker's utterance, we develop a deep neural network (DNN) to extract only a specific speaker from the original sound source. We also implement the detection algorithm based on the output of the proposed DNN and the interface for displaying the detection result. We conduct two experiments on the proposed system. One is to confirm how much the amplitude of the other sounds can be suppressed and how much that of the specific person's utterance does not be suppressed by the proposed DNN. The second experiment is to confirm how accurately the proposed system can detect the utterance scene of a specific person.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>INTRODUCTION
The demand for video streaming services, such as YouTube,
Netflix, and Amazon Prime, is increasing as well as the
amount of video contents on the Web. In this situation, in
which too many videos have already been uploaded on the
Web, the importance of supporting users to browse videos
efficiently has also increased.</p>
      <p>
        One method for efficient video browsing is fast-forwarding.
Several researchers developed a content-aware
fastforwarding technique that dynamically changes playback
speeds depending on the importance given to each video
frame. This technique is enabled using key clips [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], a
skimming model [3], and the viewing histories of other
people [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Direct manipulation techniques enable users to
manipulate object positions in video frames to seek for
© 2018. Copyright for the individual papers remains with the authors.
Copying permitted for private and academic purposes.
WII'18, March 11, Tokyo, Japan.
specific video timelines [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7, 8</xref>
        ]. Video streaming
services, such as YouTube, Netflix, and Amazon Prime,
show a tiny picture of the video in relation to where the
playhead is at in the timeline.
      </p>
      <p>
        Several studies on video navigations have used audio
information. Conventional methods [9] using audio
information summarize and classify videos based on silence,
speech, and music. CinemaGazer [
        <xref ref-type="bibr" rid="ref9">10</xref>
        ] is an audio-based
technique, which fast-forwards scenes without speech. This
technique can only distinguish whether or not the scene
includes speech, and cannot distinguish who speaks. As
described, some studies supported video browsing using a
sound class, but fewer audio-based methods have been used
to seek specific video timelines than image or
metadatabased methods.
      </p>
      <p>We propose a system that detects the scene, where a
specific speaker is speaking in the video, and displays the
site as a heat map in the video's timeline, as shown in
Figure 1. This system enables users to skip to the timeline
they want to hear by detecting scenes in a drama, talk show,
or discussion TV program, where a specific speaker is
speaking. To detect a specific speaker's utterance, we
develop a deep neural network (DNN) to extract only a
specific speaker from the original sound source. Leveraging
this sound source separation DNN, the system operates as
follows: first, the system's DNN extracts the utterance of a
specific person from the audio file of the target video and
diminishes other sounds. As a result of DNN filtering, the
amplitude of the scene, in which the target person is
speaking, does not become very small, while that of the
other scenes becomes small. The system then calculates the
difference between the amplitude of the original sound
waveform and that of the filtered sound waveform. The
system judges that scenes with the larger difference than a
threshold are where the target person does not speak and
those with the smaller difference are where the target
person utters. The scenes, where the target person speaks,
are displayed on the video timeline as a heat map based on
the judgment result.</p>
      <p>We conduct two experiments on the proposed system. One
is to confirm how much the amplitude of the other sounds
can be suppressed and how much that of the specific
person's utterance does not be suppressed by the sound
source separation DNN extracting only the specific person's
utterance. The second experiment is to confirm how
accurately the system can detect the utterance scene of a
specific person.</p>
    </sec>
    <sec id="sec-2">
      <title>Our contributions are summarized as follows.</title>
      <p>l
l</p>
      <p>We propose a novel system that automatically detects
the utterance scene of a specific person. We also
confirm how accurately the system can detect the
utterance scene of a specific person.</p>
      <p>We develop a sound source separation DNN that can
extract only a specific person's utterance, and propose
how to create a training dataset for the DNN. Many
studies successfully tackled monaural sound source
separation. However, these prior studies only
confirmed the effects for separation between
distinguished classes such as “speech and noise”, or
between multi-speakers. These studies did not clarify
whether only a specific speaker can be separated
when both diverse and various sounds are mixed in
the sound source. We confirm how much the
amplitude of the other sounds can be suppressed and
how much that of the specific person's utterance does
not be suppressed by the proposed DNN.</p>
      <p>
        RELATED WORK
Browsing Support for Videos
Various techniques to support users in browsing videos are
well studied. Fast-forwarding techniques, such as those in
[
        <xref ref-type="bibr" rid="ref10 ref11">11, 12</xref>
        ], are useful in helping users watch videos in a
reduced time. Several researchers also developed a
contentaware fast-forwarding technique that dynamically changes
playback speeds depending on the importance given to each
video frame. Higuchi et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] proposed a video
fastforwarding interface that helps users find important events
from lengthy first-person videos continuously recorded with
wearable cameras. The proposal of Pongnumkul et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
makes it easy to find the scene change when sliding the
video seek bar. Cheng et al. [3] proposed a video system to
learn the user's favorite scene for fast-forwarding. Kim et
al.’s method [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] shows the importance scene based on the
viewing histories of other people. CinemaGazer [
        <xref ref-type="bibr" rid="ref9">10</xref>
        ] is an
audio-based technique that fast-forwards scenes without
speech.
      </p>
      <p>
        Several techniques for indicating potential information in
the video were also studied. These included spatio-temporal
volume [
        <xref ref-type="bibr" rid="ref12">13</xref>
        ], positional information [
        <xref ref-type="bibr" rid="ref13">14</xref>
        ], and video
synopsis [
        <xref ref-type="bibr" rid="ref14 ref15 ref16">15, 16, 17</xref>
        ]. Meanwhile, direct manipulation
techniques enable users to manipulate object positions in
video frames to seek for specific video timelines [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7, 8</xref>
        ].
Video lens allows users to interactively explore large
collections of baseball videos and related metadata [
        <xref ref-type="bibr" rid="ref17">18</xref>
        ].
On-demand video streaming services, such as YouTube,
Netflix, and Amazon Prime, show a tiny picture of the
video in relation to where the playhead is at in the timeline.
Unlike the previous studies, ours focuses on providing an
efficient method of allowing users to skip to the scenes,
where a specific person that the user is searching for, is
speaking.
      </p>
      <p>Monaural Source Separation
Monaural sound source separation studies are closely
related to the proposed method. We introduce these
methods here and show their difference from the proposed
method.</p>
      <p>
        Wiener filtering is a classical method used for separating a
specific sound source from a source waveform [
        <xref ref-type="bibr" rid="ref18">19</xref>
        ]. The
Wiener filtering method heuristically determines
parameters; hence, the parameters cannot be optimized for
various sound sources [
        <xref ref-type="bibr" rid="ref19">20</xref>
        ].
      </p>
      <p>
        In recent years, many studies attempted to separate
monaural sound sources using deep learning. Previous deep
network approaches [
        <xref ref-type="bibr" rid="ref20 ref21 ref22 ref23 ref24 ref25 ref26 ref27 ref28 ref29 ref30">21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
31</xref>
        ] to separation showed promising performances in
scenarios with sources belonging to a distinct signal class,
such as “speech and noise” and “vocal and accompaniment.”
in addition, many researches attempted to separate
multispeakers using DNN [
        <xref ref-type="bibr" rid="ref21 ref31 ref32 ref33 ref34 ref35 ref36">22, 32, 33, 34, 35, 36, 37</xref>
        ]. These
studies performed well in the speaker-dependent separation
of two or three speakers. Deep clustering [
        <xref ref-type="bibr" rid="ref28 ref37 ref38 ref39">29, 38, 39, 40</xref>
        ] is
a deep learning framework that can be used for a
speakerindependent separation of two or more speakers, with no
special constraint on vocabulary and grammar.
      </p>
      <p>In spite of the advantages, these prior studies confirmed
only the effects for separation between distinguished
classes or between multi-speakers. The function required in
the proposed approach is to isolate only the speech of a
specific person from the sound source, including various
noise and multiple speakers.</p>
      <p>
        Speaker Recognition &amp; Audio Event Detection
The speaker recognition technique seems effective in
detecting the utterance section of a specific speaker. These
techniques using phonemes [
        <xref ref-type="bibr" rid="ref40 ref41">41, 42</xref>
        ] perform well. However,
speaker recognition methods are weak against noise. In
addition, the shorter the input speech duration, the lesser the
speaker recognition precision. Ranjan et al. [
        <xref ref-type="bibr" rid="ref42">43</xref>
        ] reported
that the equal error rate (false negative rate equals the false
positive rate) becomes close to 40% when the input
duration is 3 s. As described, this is not suitable for
detecting the utterance scene of a specific speaker in videos
because speaker recognition is vulnerable to noise and tiny
duration input.
      </p>
      <p>
        Jansen et al. [
        <xref ref-type="bibr" rid="ref43">44</xref>
        ] proposed the method for detecting
recurring audio events in YouTube videos using a small
portion of the manually annotated audio data set [
        <xref ref-type="bibr" rid="ref44">45</xref>
        ].
However, this method cannot distinguish who speaks while
can distinguish between categories of sound, such as human
voice and whistle.
      </p>
      <p>IMPLEMENTATION
The proposed system detects the scene, where a specific
speaker is speaking in the video, and displays the site as a
heat map in the video's timeline. Figure 2 shows the
system’s process. The system first loads the sound of the
target video once. Leveraging a DNN, the system then
extracts only the specific speaker from the original sound
source and diminishes the other sounds. In the sound
waveform filtered by the DNN, the amplitude of the scene,
where the target person is speaking, does not become too
small, while that of the other scenes becomes small. The
system calculates the difference between the amplitude
value of the original sound waveform and that of the
filtered sound waveform. The system then judges that
scenes with the larger difference than a threshold are where</p>
      <sec id="sec-2-1">
        <title>Relationship between separated sound sources</title>
      </sec>
      <sec id="sec-2-2">
        <title>Class</title>
        <p>based</p>
      </sec>
      <sec id="sec-2-3">
        <title>Speaker separation</title>
        <p>the target person does not speak and those with the smaller
difference are where the target person utters. The scenes,
where the target person speaks, are displayed on the video
timeline as a heat map based on the judgment result. The
following subsections describe the implementation of the
proposed sound source separation DNN, the detection and
the interface.</p>
        <p>
          Sound Source Separation between a Specific Speaker
and Other Sounds
We propose a DNN to detect the utterance of a specific
person and separate this utterance from the other sounds.
The difference of this DNN from the previous sound source
separation methods is that the relationship between the
separated sound sources is different as shown in Table 1.
Many previous studies tackled the separation with different
classes of sound sources, such as “sound and noise” and a
fixed number of sound sources, such as “two or three
speakers.”
However, we assumed that the DNN models of the previous
studies could be applied to our task if we change the
training data. Therefore, we surveyed previous studies, and
found that Rethage's method [
          <xref ref-type="bibr" rid="ref30">31</xref>
          ] was appropriate because
it used a convolutional-based neural network, which
allowed for parallel computation. Many previous methods
[
          <xref ref-type="bibr" rid="ref21 ref22 ref23 ref24">22, 23, 24, 25</xref>
          ] employed recurrent neural networks
(RNNs), including long short-term memory (LSTM)
networks, for source separation. As shown in Figure 3, the
limitation of RNNs is that it is difficult for them to perform
parallel computations because the computations at each
timestep depend on the results from the previous timestep.
Many videos on the web are several hours long; thus, the
lack of parallel computations causes a significant problem
of the processing time being linearly proportional to the
video length. Furthermore, as the authors of deep clustering
[
          <xref ref-type="bibr" rid="ref37">38</xref>
          ] reported, the most serious problem is that the LSTM
performs poorly in the sound source separation of speakers,
who are not in the training data.
        </p>
        <p>To realize the proposed DNN, we devised a training dataset.
As input data, we created the sound mixtures by merging
the target speaker with the various environmental noises
and the other speakers. We set clean speech of the target
speaker as the ideal output value. By training the dataset,
the proposed DNN was able to extract the speech of the
target speaker and mute other sounds.</p>
        <p>
          We implemented Rethage's DNN model as written in their
article. Figure 4 shows the visualization of the
implementation. The model is trained to extract a specific
speaker by inputting and outputting the waveform data as-is.
Their approach incorporated some techniques used in
WaveNet [
          <xref ref-type="bibr" rid="ref45">46</xref>
          ], such as gated unit, skip connections, and
residual blocks. The DNN model features 30 residual
blocks. The dilation factor in each layer increases in the
range 1, 2, ..., 256, 512 by powers of 2. This pattern is
repeated thrice (three stacks). Prior to the first dilated
convolution, the one-channel input is linearly projected to
128 channels by a standard 3 × 1 convolution to comply
with the number of filters in each residual layer. The skip
connections are 1 × 1 convolutions, which also feature 128
filters. A rectified linear unit (ReLU) is applied after
summing all skip connections. The final two 3 × 1
convolutional layers are not dilated; contain 2048 and 256
filters, respectively; and are separated by a ReLU. The
output layer linearly projects the feature map into a
singlechannel temporal signal using a 1 × 1 filter.
        </p>
        <p>Detection
After the voice of a specific speaker is extracted by the
sound source separation DNN, the algorithm for detecting
the utterance scene of the speaker operates as follows: the
algorithm segments the original and the filtered sound
waveforms into certain window size, as shown in Figure 5.
Then this algorithm calculates the difference between the
amplitude value of both segments. This calculation aims to
obtain the amplitude ratio of the original and filtered
waveforms. The amplitude difference is obtained by the
following equation:
/ 01 (34565789)
  = 20,- /01 (;59&lt;=4=&gt;)
where /01 (34565789) represents root mean square of the
amplitude of the original waveform segment and
/01 (;59&lt;=4=&gt;) represents root mean square of that of the
filtered waveform segment. The difference value (dB)
indicates how much the amplitude of the original sound is
attenuated after that is filtered by the proposed DNN. A
small difference value means that the amplitude of the
original sound is not much attenuated and a large difference
value means that the amplitude is greatly attenuated.
Leveraging the proposed DNN, the amplitude in the scenes,
in which the target person is speaking, does not become
very small (the difference is small), while that in the other
scenes becomes small (the difference is large) as shown in
Figure 6. Therefore, the algorithm can judge that the scenes
with the larger difference than a threshold are where the
target person does not speak, while those with the smaller
difference are where the target person utters. After the
judgement, the window shifts to the next segments. The
abovementioned operation is repeated until the window
reaches the end of each waveform.</p>
        <p>The default value of the threshold is determined based on
the average amplitude ratio of the original and filtered
waveforms. This default value will be clarified by
Experiment 1, which is described later.</p>
        <p>Interface
After the speaking scenes of specific speakers are clarified,
these scenes are displayed on the timeline as a heat map.</p>
        <p>The red marks on the heat map represent the detected
scenes. The user can jump to the scene uttered by the
specific speaker by clicking the red mark position.
In addition, the user can change the threshold of the
detection algorithm by operating the bar on the right side of
the interface. Figure 7 shows the difference in the
appearance of the heat map by operating the bar. Figure 8
shows how the judgment for detecting the utterance scenes
of the specific speaker changes when the threshold changes.
The amount of red marks in the timeline is decreased by
lowering the bar because the threshold becomes lower.
Only the scenes with a higher probability as the utterances
of the specific speaker can be displayed. The amount of red
marks is increased by raising the bar because the threshold
becomes higher. The scenes with a low probability as a
specific speaker's utterance may be included in the heat
map, but this prevents the user from missing the scene of
the speaker's utterance.</p>
        <p>EXPERIMENT 1
This experiment is to confirm how much the amplitude of
the other sounds can be suppressed and how much that of
the specific person's utterance does not be suppressed by
the sound source separation DNN extracting only the
specific person's utterance. The ideal result is that the target
speaker’ utterance does not become very small but the other
sounds become smaller. If the result is as described above,
it can be said that the proposed DNN extracts only the
utterance of the target speaker.</p>
        <p>We let the sound source separation DNN model learn with
the following setup. Then, we calculated how much of the
decibel (dB) of the other sounds could be suppressed using
the test dataset.</p>
        <p>
          Setup
dataset
We created a training dataset of sound mixtures using
noises from the Diverse Environments Multichannel
Acoustic Noise Database (DEMAND) [
          <xref ref-type="bibr" rid="ref46">47</xref>
          ], and utterances
from TIMIT corpus [
          <xref ref-type="bibr" rid="ref47">48</xref>
          ] and CMU ARCTIC corpus [
          <xref ref-type="bibr" rid="ref48">49</xref>
          ].
Figure 9 describes the visualization of creating the training
dataset. The target speaker of the detection was supplied by
the CMU ARCTIC corpus. The subset of the CMU corpus
we used features two native English speakers, including a
man (ID: RMS) and a woman (ID: SLT). Note that it is
common in speech research such as voice conversion that
the target speakers are two. We randomly chose 593
sentences, which corresponds to 30 minutes, from each
speaker for the training samples.
        </p>
        <p>We mixed the training samples of each target speaker with
the noise sounds provided by DEMAND. The subset of
DEMAND that we used provided recordings in 17 different
environmental conditions, such as in a park, a bus, or a cafe.
Ten background noises were synthetically mixed with the
target speech for training, while seven background noises
were used for testing. All training samples of each target
speaker (593 sentences) were synthetically mixed with each
ten noises type at each of the following single-to-noise
ratios (SNRs): 0, 5, 10, and 15 dB. Note that the smaller the
dB value, the bigger the noise value relative to the speech.
We also mixed the training samples of each target speaker
with different speakers from the TIMIT corpus, which
features 24 English speakers, including the following
various dialects: New England, Northern, North Midland,
South Midland, Southern, New York City, Western, and
Army Brat. We synthetically mixed the all training samples
of each target speaker with a TIMIT speaker at each SNRs
(0, 5, 10, and 15dB). Additionally, we created new corpus
of two-speaker mixtures using utterances from the TIMIT
corpus. The mixtures were mixed with all training samples
of each target speaker at each SNRs. As a result, the
number of all training data per target speaker was 28464
sentences.</p>
        <p>
          Learning
We let the sound source separation DNN learn with the
above training dataset at 16 kHz, as shown in Figure 10.
The loss function we used was the same as Rethage's [
          <xref ref-type="bibr" rid="ref30">31</xref>
          ].
The learning condition was as follows: a learning rate was
0.001, a batch size was 60, an early stopping epoch was 4
and the GPU we used was NVIDIA TITAN X Pascal.
Test
We randomly chose 100 sentences from the target speaker,
which does not include the training dataset, for test samples.
The test samples were synthetically mixed at each of the
following SNRs: -10, 0 and 10dB, with the seven test-noise
types from the DEMAND, one speaker, and two speaker
mixtures from the TIMIT corpus. Furthermore, we used the
noise only and target speaker only source, as the test dataset.
We inputted 100 files of each source type (noise only,
sound mixtures at -10, 0, 10 dB, and target only) into each
learned DNN and calculated the average amplitude
difference between the output waveform and the input
waveform.
        </p>
        <p>Result
Table 2 shows the results. What the average difference is
larger means that the input speeches were suppressed more.
The result demonstrates the amplitude of target speech does
not become very small, while that of the other sounds
becomes small. In addition, the result suggests that since
the DNN decreases the amplitude of input waveform by
about 20 dB at the maximum and about 0 dB at the
minimum, it is appropriate to set the threshold during that
interval.</p>
        <p>EXPERIMENT 2
This experiment is to confirm how accurately the proposed
system can detect the utterance scene of a specific person.
We let the system perform the task of detecting the target
speech included in the 10 minutes’ sound.</p>
        <p>Setup
The 10 minutes’ sound was created by connecting
DEMAND and TIMIT corpus which not in the training
dataset. We chose the target speech randomly at 100
sentences and superimposed on that 10 minutes’ sound. The
SNRs of the target speech to 10 minutes’ sound was chosen
randomly from 0, 5, 10 and 15 dB. We used the sound
source separation DNN learned in Experiment 1. The</p>
      </sec>
      <sec id="sec-2-4">
        <title>System predicts “not utterance scene of a specific person”</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>True Positive</title>
    </sec>
    <sec id="sec-4">
      <title>False negative</title>
    </sec>
    <sec id="sec-5">
      <title>False Positive</title>
    </sec>
    <sec id="sec-6">
      <title>True negative</title>
      <p>window size of the detection was 0.1 s and the window’s
step length was also 0.1 s. We changed the threshold every
5 dB (-5, 0, 5, 10, 15, 20 dB) for confirming whether the
result changes.</p>
      <p>We used the following four events for test: True positive
(TP), False Positive (FP), False Negative (FN), and True
Negative (TN). Table 3 shows the definition of each event.
Based on the four events, the following ratios were
calculated: the accuracy and the precision. Accuracy and
precision are formulated as follows:
 % =  +  / ( +  +  + )
 % =  / ( + )
The system performs prediction for each segment of the
waveforms as shown in Figure 11. When the middle of the
segment is included in the actual utterance timing of a
specific person, the true condition is “Actual utterance
scene of a specific person” as shown in Figure 12.
Result
Table 4 shows the results. The result shows that the
accuracy is 83% and the precision is 92% in the best case.
The accuracy is higher when the threshold is around 10 to
15 dB and the precision is higher when the threshold is
around 0 to 5 dB for each target speaker.</p>
      <p>FUTURE WORK
User study
In this paper, we did the basic performance evaluation of
the proposed system and did not do user study. We need to
perform a user study and verify that the users can find the
scenes they want to hear accurately and quickly.
We will need to refine the interface based on the user study.
One alternative interface is to display the utterance scenes
of a specific person as a graph in a video timeline. We will
confirm how usability changes by changing the interface.
Improving accuracy
We need to explore a special DNN structure for extracting a</p>
      <sec id="sec-6-1">
        <title>Accuracy</title>
      </sec>
      <sec id="sec-6-2">
        <title>Precision</title>
        <p>ID: RMS
ID: SLT
ID: RMS
ID: SLT
-5dB
48%
58%
83%
88%
0dB
59%
67%
88%
92%
5dB
73%
79%
89%
91%</p>
      </sec>
      <sec id="sec-6-3">
        <title>Threshold 10dB</title>
        <p>79%
83%
85%
85%
specific speaker more accurately. If we find this new
structure, we could make the system improve the accuracy
of the Experiment 2 task.</p>
        <p>CONCLUSION
We propose a system that detects scenes, where a specific
person speaks in the video, and displays them in the
timeline. This system enables users to skip to the timeline
they want to hear by detecting scenes in a drama, talk show,
or discussion TV program, where a specific speaker is
speaking.</p>
        <p>We conducted two experiments on the proposed system.
One was to confirm how much the amplitude of the other
sounds can be suppressed and how much that of the specific
person's utterance does not be suppressed by the sound
source separation DNN extracting only the specific person's
utterance. The result showed that the smaller the amplitude
of the target speech included in the input source was, the
larger the average amplitude difference between the input
and output waveform became. That is, we got the result as
expected.</p>
        <p>The second experiment was to confirm how accurately the
system can detect the utterance scene of a specific person.
The result showed that the accuracy was 83% and the
precision was 92% in the best case.</p>
        <p>This system can be applied to voice services, like Podcast,
Spotify, and SoundCloud. With the advent of smart
speakers, such as Amazon Echo and Google home, audio
contents are likely to increase along with the importance of
searching timelines based on audio content.</p>
        <p>Cuong Nguyen, Yuzhen Niu, and Feng Liu. 2013.
Direct manipulation video navigation in 3D.</p>
        <p>In Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems (CHI '13). ACM, New
York, NY, USA, 1169-1172.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Keita</given-names>
            <surname>Higuchi</surname>
          </string-name>
          , Ryo Yonetani, and
          <string-name>
            <given-names>Yoichi</given-names>
            <surname>Sato</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>EgoScanning: Quickly Scanning First-Person Videos with Egocentric Elastic Timelines</article-title>
          .
          <source>In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI '17)</source>
          . ACM, New York, NY, USA,
          <fpage>6536</fpage>
          -
          <lpage>6546</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Suporn</given-names>
            <surname>Pongnumkul</surname>
          </string-name>
          , Jue Wang,
          <string-name>
            <surname>Gonzalo Ramos</surname>
            ,
            <given-names>and Michael</given-names>
          </string-name>
          <string-name>
            <surname>Cohen</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Content-aware dynamic timeline for video browsing</article-title>
          .
          <source>In Proceedings of the 23nd annual ACM symposium on User interface software and technology (UIST '10)</source>
          . ACM, New York, NY, USA,
          <fpage>139</fpage>
          -
          <lpage>142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Kai-Yin</surname>
            <given-names>Cheng</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sheng-Jie</surname>
            <given-names>Luo</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bing-Yu Chen</surname>
          </string-name>
          , and
          <string-name>
            <surname>Hao-Hua Chu</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>SmartPlayer: user-centric video fast-forwarding</article-title>
          .
          <source>In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '09)</source>
          . ACM, New York, NY, USA,
          <fpage>789</fpage>
          -
          <lpage>798</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Juho</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Philip J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Carrie J.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <surname>Shang-Wen (Daniel) Li</surname>
            ,
            <given-names>Krzysztof Z.</given-names>
          </string-name>
          <string-name>
            <surname>Gajos</surname>
          </string-name>
          , and
          <string-name>
            <surname>Robert</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Miller</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Data-driven interaction techniques for improving navigation of educational videos</article-title>
          .
          <source>In Proceedings of the 27th annual ACM symposium on User interface software and technology (UIST '14)</source>
          . ACM, New York, 6.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Dragicevic</surname>
          </string-name>
          , Gonzalo Ramos, Jacobo Bibliowitcz, Derek Nowrouzezahrai, Ravin Balakrishnan, and
          <string-name>
            <given-names>Karan</given-names>
            <surname>Singh</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Video browsing by direct manipulation</article-title>
          .
          <source>In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '08)</source>
          . ACM, New York, NY, USA,
          <fpage>237</fpage>
          -
          <lpage>246</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Thorsten</given-names>
            <surname>Karrer</surname>
          </string-name>
          , Malte Weiss,
          <string-name>
            <given-names>Eric</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Jan</given-names>
            <surname>Borchers</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>DRAGON: a direct manipulation interface for frame-accurate in-scene video navigation</article-title>
          .
          <source>In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '08)</source>
          . ACM, New York, NY, USA,
          <fpage>247</fpage>
          -
          <lpage>250</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Thorsten</given-names>
            <surname>Karrer</surname>
          </string-name>
          , Moritz Wittenhagen, and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Borchers</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>DragLocks: handling temporal ambiguities in direct manipulation video navigation</article-title>
          .
          <source>In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '12)</source>
          . ACM, New York, NY, USA,
          <fpage>623</fpage>
          -
          <lpage>626</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>C.</given-names>
            <surname>Saraceno</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Leonardi</surname>
          </string-name>
          ,
          <article-title>"Audio as a support to scene change detection and characterization of video sequences,"</article-title>
          1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Munich,
          <year>1997</year>
          , pp.
          <fpage>2597</fpage>
          -
          <lpage>2600</lpage>
          vol.
          <volume>4</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          10.
          <string-name>
            <given-names>Kazutaka</given-names>
            <surname>Kurihara</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>CinemaGazer: a system for watching videos at very high speed</article-title>
          .
          <source>In Proceedings of the International Working Conference on Advanced Visual Interfaces (AVI '12)</source>
          , Genny Tortora, Stefano Levialdi, and Maurizio Tucci (Eds.). ACM, New York, NY, USA,
          <fpage>108</fpage>
          -
          <lpage>115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          11.
          <string-name>
            <surname>Abir</surname>
            Al-Hajri, Matthew Fong,
            <given-names>Gregor</given-names>
          </string-name>
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>and Sidney</given-names>
          </string-name>
          <string-name>
            <surname>Fels</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Fast forward with your VCR: visualizing single-video viewing statistics for navigation and sharing</article-title>
          .
          <source>In Proceedings of Graphics Interface 2014 (GI '14)</source>
          .
          <source>Canadian Information Processing Society</source>
          , Toronto, Ont.,
          <string-name>
            <surname>Canada</surname>
          </string-name>
          , Canada,
          <fpage>123</fpage>
          -
          <lpage>128</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          12.
          <string-name>
            <surname>Neel</surname>
            <given-names>Joshi</given-names>
          </string-name>
          , Wolf Kienzle, Mike Toelle, Matt Uyttendaele,
          <string-name>
            <given-names>and Michael F.</given-names>
            <surname>Cohen</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Real-time hyperlapse creation via optimal frame selection</article-title>
          .
          <source>ACM Trans. Graph</source>
          .
          <volume>34</volume>
          ,
          <issue>4</issue>
          ,
          <string-name>
            <surname>Article 63</surname>
          </string-name>
          (
          <year>July 2015</year>
          ), 9 pages.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          13.
          <string-name>
            <surname>Cuong</surname>
            <given-names>Nguyen</given-names>
          </string-name>
          , Yuzhen Niu, and Feng Liu.
          <year>2012</year>
          .
          <article-title>Video summagator: an interface for video summarization and navigation</article-title>
          .
          <source>In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '12)</source>
          . ACM, New York, NY, USA,
          <fpage>647</fpage>
          -
          <lpage>650</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          14.
          <string-name>
            <surname>Suporn</surname>
            <given-names>Pongnumkul</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Jue</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Michael</given-names>
            <surname>Cohen</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Creating map-based storyboards for browsing tour videos</article-title>
          .
          <source>In Proceedings of the 21st annual ACM symposium on User interface software and technology (UIST '08)</source>
          . ACM, New York, NY, USA,
          <fpage>13</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          15.
          <string-name>
            <surname>Alex</surname>
          </string-name>
          Rav-Acha,
          <article-title>Yael Pritch, and Shmuel Peleg</article-title>
          . In
          <source>In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR' 06).</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          16.
          <string-name>
            <surname>Yael</surname>
            <given-names>Pritch</given-names>
          </string-name>
          , Alex Rav-Acha,
          <string-name>
            <given-names>Avital</given-names>
            <surname>Gutman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Shmuel</given-names>
            <surname>Peleg</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Webcam Synopsis: Peeking Around the World</article-title>
          .
          <source>In In Proc. IEEE International Conference on Computer Vision</source>
          (ICCV'07).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          17.
          <string-name>
            <surname>Yael</surname>
            <given-names>Pritch</given-names>
          </string-name>
          , Alex Rav-Acha, and
          <string-name>
            <given-names>Shmuel</given-names>
            <surname>Peleg</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Nonchronological Video Synopsis and Indexing</article-title>
          .
          <source>IEEE Trans. Pattern Anal. Mach. Intell</source>
          .
          <volume>30</volume>
          ,
          <issue>11</issue>
          (November
          <year>2008</year>
          ),
          <fpage>1971</fpage>
          -
          <lpage>1984</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          18.
          <string-name>
            <surname>Justin</surname>
            <given-names>Matejka</given-names>
          </string-name>
          , Tovi Grossman, and
          <string-name>
            <given-names>George</given-names>
            <surname>Fitzmaurice</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Video lens: rapid playback and exploration of large video collections and associated metadata</article-title>
          .
          <source>In Proceedings of the 27th annual ACM symposium on User interface software and technology (UIST '14)</source>
          . ACM, New York, NY, USA,
          <fpage>541</fpage>
          -
          <lpage>550</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          19.
          <string-name>
            <surname>Pascal</surname>
          </string-name>
          Scalart et al.
          <article-title>Speech enhancement based on a priori signal to noise estimation</article-title>
          .
          <source>In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , volume
          <volume>2</volume>
          , pp.
          <fpage>629</fpage>
          -
          <lpage>632</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          20.
          <string-name>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <surname>Monisankha</surname>
          </string-name>
          , et al.
          <source>"Robustness of Voice Conversion Techniques Under Mismatched Conditions." arXiv preprint arXiv:1612.07523</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          21.
          <string-name>
            <surname>Xugang</surname>
            <given-names>Lu</given-names>
          </string-name>
          , Yu Tsao, Shigeki Matsuda, and
          <string-name>
            <given-names>Chiori</given-names>
            <surname>Hori</surname>
          </string-name>
          .
          <article-title>Speech enhancement based on deep denoising autoencoder</article-title>
          .
          <source>In Interspeech</source>
          , pp.
          <fpage>436</fpage>
          -
          <lpage>440</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          22.
          <string-name>
            <surname>Po-Sen</surname>
            <given-names>Huang</given-names>
          </string-name>
          , Minje Kim, Mark Hasegawa-Johnson, and Paris Smaragdis.
          <article-title>Joint optimization of masks and deep recurrent neural networks for monaural source separation</article-title>
          .
          <source>IEEE/ACM Transactions on Audio, Speech and Language Processing</source>
          ,
          <volume>23</volume>
          (
          <issue>12</issue>
          ):
          <fpage>2136</fpage>
          -
          <lpage>2147</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          23.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. R.</given-names>
            <surname>Dai</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>A Regression</surname>
          </string-name>
          <article-title>Approach to Speech Enhancement Based on Deep Neural Networks</article-title>
          ,
          <source>in IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          , vol.
          <volume>23</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>7</fpage>
          -
          <lpage>19</lpage>
          , Jan.
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          24.
          <string-name>
            <given-names>Anurag</given-names>
            <surname>Kumar</surname>
          </string-name>
          and
          <string-name>
            <given-names>Dinei</given-names>
            <surname>Florencio</surname>
          </string-name>
          .
          <article-title>Speech enhancement in multiple-noise conditions using deep neural networks</article-title>
          .
          <source>arXiv preprint arXiv:1605.02427</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          25.
          <string-name>
            <surname>Jordi</surname>
            <given-names>Pons</given-names>
          </string-name>
          , Jordi Janer, Thilo Rode, and
          <string-name>
            <given-names>Waldo</given-names>
            <surname>Nogueira</surname>
          </string-name>
          .
          <article-title>Remixing music using source separation algorithms to improve the musical experience of cochlear implant users</article-title>
          .
          <source>The Journal of the Acoustical Society of America</source>
          ,
          <volume>140</volume>
          (
          <issue>6</issue>
          ):
          <fpage>4338</fpage>
          -
          <lpage>4349</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          26.
          <string-name>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <surname>Kaizhi</surname>
          </string-name>
          , et al.
          <article-title>"Speech enhancement using bayesian wavenet</article-title>
          .
          <source>" Proc. Interspeech</source>
          <year>2017</year>
          (
          <year>2017</year>
          ):
          <fpage>2013</fpage>
          -
          <lpage>2017</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          27.
          <string-name>
            <surname>Tu</surname>
            , Ming,
            <given-names>and Xianxian</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          .
          <article-title>"Speech enhancement based on Deep Neural Networks with skip connections</article-title>
          .
          <source>" Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <source>2017 IEEE International Conference on. IEEE</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          28.
          <string-name>
            <surname>Pascual</surname>
            , Santiago, Antonio Bonafonte, and
            <given-names>Joan</given-names>
          </string-name>
          <string-name>
            <surname>Serrà</surname>
          </string-name>
          .
          <source>"SEGAN: Speech Enhancement Generative Adversarial Network." arXiv preprint arXiv:1703.09452</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          29.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Hershey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Le</given-names>
            <surname>Roux</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Mesgarani</surname>
          </string-name>
          ,
          <article-title>"Deep clustering and conventional networks for music separation: Stronger together,"</article-title>
          <source>2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , New Orleans, LA,
          <year>2017</year>
          , pp.
          <fpage>61</fpage>
          -
          <lpage>65</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          30.
          <string-name>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Szu-Wei</surname>
          </string-name>
          , et al.
          <article-title>"Raw Waveform-based Speech Enhancement by Fully Convolutional Networks</article-title>
          .
          <source>" arXiv preprint arXiv:1703.02205</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          31.
          <string-name>
            <surname>Rethage</surname>
            , Dario,
            <given-names>Jordi</given-names>
          </string-name>
          <string-name>
            <surname>Pons</surname>
            , and
            <given-names>Xavier</given-names>
          </string-name>
          <string-name>
            <surname>Serra</surname>
          </string-name>
          .
          <article-title>"A Wavenet for Speech Denoising."</article-title>
          <source>arXiv preprint arXiv:1706.07162</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          32.
          <string-name>
            <surname>Pritish</surname>
            <given-names>Chandna</given-names>
          </string-name>
          , Marius Miron, Jordi Janer, and
          <string-name>
            <given-names>Emilia</given-names>
            <surname>Gómez</surname>
          </string-name>
          .
          <article-title>Monoaural audio source separation using deep convolutional neural networks</article-title>
          .
          <source>In International Conference on Latent Variable Analysis and Signal Separation</source>
          , pages
          <fpage>258</fpage>
          -
          <lpage>266</lpage>
          . Springer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          33.
          <string-name>
            <given-names>Z. Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>"Recurrent deep stacking networks for supervised speech separation,"</article-title>
          <source>2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , New Orleans, LA,
          <year>2017</year>
          , pp.
          <fpage>71</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          34.
          <string-name>
            <surname>Chien</surname>
          </string-name>
          , Jen-Tzung &amp;
          <article-title>Kuo, Kuan-Ting, “Variational Recurrent Neural Networks for Speech Separation”</article-title>
          , In Interspeech, pp.
          <fpage>1193</fpage>
          -
          <lpage>1197</lpage>
          ,
          <year>2017</year>
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          35.
          <string-name>
            <given-names>K.</given-names>
            <surname>Osako</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mitsufuji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Singh</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Raj</surname>
          </string-name>
          ,
          <article-title>"Supervised monaural source separation based on autoencoders,"</article-title>
          <source>2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , New Orleans, LA,
          <year>2017</year>
          , pp.
          <fpage>11</fpage>
          -
          <lpage>15</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICASSP.
          <year>2017</year>
          .7951788
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          36.
          <string-name>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yuan-Shan</surname>
          </string-name>
          , et al.
          <article-title>"Fully complex deep neural network for phase-incorporating monaural source separation</article-title>
          .
          <source>" Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <source>2017 IEEE International Conference on. IEEE</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          37.
          <string-name>
            <surname>Wang</surname>
          </string-name>
          , Yannan &amp; Du, Jun &amp; Dai,
          <string-name>
            <surname>Li-Rong</surname>
          </string-name>
          &amp;
          <article-title>Lee, Chin-Hui, “A Maximum Likelihood Approach to Deep Neural Network Based Nonlinear Spectral Mapping for Single-Channel Speech Separation”</article-title>
          , Interspeech, pp.
          <fpage>1178</fpage>
          -
          <lpage>1182</lpage>
          ,
          <year>2017</year>
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          38.
          <string-name>
            <surname>Hershey</surname>
          </string-name>
          , John R., et al.
          <article-title>Deep clustering: Discriminative embeddings for segmentation and separation</article-title>
          .
          <source>Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <source>2016 IEEE International Conference on. IEEE</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          39.
          <string-name>
            <surname>Isik</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yusuf</surname>
          </string-name>
          , et al.
          <article-title>"Single-channel multi-speaker separation using deep clustering</article-title>
          .
          <source>" arXiv preprint arXiv:1607.02173</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          40.
          <string-name>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Dong</surname>
          </string-name>
          , et al.
          <article-title>"Permutation invariant training of deep models for speaker-independent multi-talker speech separation</article-title>
          .
          <source>" arXiv preprint arXiv:1607.00325</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          41.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and J. Liu, “
          <article-title>Deep neural networks based speaker modeling at different levels of phonetic granularity</article-title>
          ,” in 2017 IEEE International Conference on Acoustics,
          <source>Speech and Signal Processing (ICASSP)</source>
          ,
          <year>March 2017</year>
          , pp.
          <fpage>5440</fpage>
          -
          <lpage>5444</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          42.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Scheffer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ferrer and M. McLaren</surname>
          </string-name>
          ,
          <article-title>"A novel scheme for speaker recognition using a phonetically-aware deep neural network,"</article-title>
          <source>2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <year>Florence</year>
          ,
          <year>2014</year>
          , pp.
          <fpage>1695</fpage>
          -
          <lpage>1699</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          43.
          <string-name>
            <given-names>S.</given-names>
            <surname>Ranjan</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. H. L.</given-names>
            <surname>Hansen</surname>
          </string-name>
          ,
          <article-title>"Curriculum Learning Based Approaches for Noise Robust Speaker Recognition,"</article-title>
          <source>in IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          , vol.
          <volume>26</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>197</fpage>
          -
          <lpage>210</lpage>
          , Jan.
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          44.
          <string-name>
            <surname>Jansen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Aren</surname>
          </string-name>
          , et al.
          <article-title>"Large-scale audio event discovery in one million youtube videos</article-title>
          .
          <source>" Proceedings of ICASSP</source>
          .
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          45.
          <string-name>
            <surname>Gemmeke</surname>
          </string-name>
          , Jort F., et al.
          <article-title>"Audio Set: An ontology and human-labeled dataset for audio events</article-title>
          .
          <source>" IEEE ICASSP</source>
          .
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          46.
          <string-name>
            <surname>Oord</surname>
          </string-name>
          , Aaron van den, et al.
          <article-title>"Wavenet: A generative model for raw audio</article-title>
          .
          <source>" arXiv preprint arXiv:1609.03499</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          47.
          <string-name>
            <surname>Joachim</surname>
            <given-names>Thiemann</given-names>
          </string-name>
          , Nobutaka Ito, and
          <string-name>
            <given-names>Emmanuel</given-names>
            <surname>Vincent</surname>
          </string-name>
          .
          <article-title>The diverse environments multichannel acoustic noise database: A database of multichannel environmental noise recordings</article-title>
          .
          <source>The Journal of the Acoustical Society of America</source>
          ,
          <volume>133</volume>
          (
          <issue>5</issue>
          ):
          <fpage>3591</fpage>
          -
          <lpage>3591</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          48.
          <string-name>
            <surname>J. Garofolo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Lamel</surname>
            , W. Fisher,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Fiscus</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Pallett</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Dahlgren</surname>
            , and
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Zue</surname>
          </string-name>
          , “
          <article-title>TIMIT acoustic-phonetic continuous speech corpus</article-title>
          ,”
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          49.
          <string-name>
            <given-names>J.</given-names>
            <surname>Kominek</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Black</surname>
          </string-name>
          , “The CMU Arctic speech databases,” in
          <source>Fifth ISCA Workshop on Speech Synthesis</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>