Detecting Utterance Scenes of a Specific Person
                                  Kunihiko Sato                                 Jun Rekimoto
                              The University of Tokyo                       The University of Tokyo
                                   Tokyo, Japan                             Sony Computer Science
                             kunihiko.k.r.r@gmail.com                             Laboratory
                                                                                 Tokyo, Japan
                                                                              rekimoto@acm.org
ABSTRACT
We propose a system that detects the scene, where a
specific speaker is speaking in the video, and displays the
site as a heat map in the video's timeline. This system
enables users to skip to the timeline they want to hear by
detecting scenes in a drama, talk show, or discussion TV
program, where a specific speaker is speaking. To detect a
specific speaker's utterance, we develop a deep neural
network (DNN) to extract only a specific speaker from the
original sound source. We also implement the detection
algorithm based on the output of the proposed DNN and the
interface for displaying the detection result. We conduct
two experiments on the proposed system. One is to confirm                Figure 1. Proposed interface. The red marks in the timeline
how much the amplitude of the other sounds can be                        describes the utterance scenes of a specific person. The
suppressed and how much that of the specific person's                    threshold bar can change the threshold of the scene detection
utterance does not be suppressed by the proposed DNN.                    algorithm.
The second experiment is to confirm how accurately the
proposed system can detect the utterance scene of a specific             specific video timelines [5, 6, 7, 8]. Video streaming
person.                                                                  services, such as YouTube, Netflix, and Amazon Prime,
                                                                         show a tiny picture of the video in relation to where the
Author Keywords
                                                                         playhead is at in the timeline.
Scene detection; timeline; video; sound source separation;
deep learning.                                                           Several studies on video navigations have used audio
                                                                         information. Conventional methods [9] using audio
ACM Classification Keywords
                                                                         information summarize and classify videos based on silence,
H5.1. Information interfaces and presentation (e.g., HCI):
                                                                         speech, and music. CinemaGazer [10] is an audio-based
Multimedia Information Systems; H5.2. Information
                                                                         technique, which fast-forwards scenes without speech. This
interfaces and presentation (e.g., HCI): User Interface
                                                                         technique can only distinguish whether or not the scene
INTRODUCTION                                                             includes speech, and cannot distinguish who speaks. As
The demand for video streaming services, such as YouTube,                described, some studies supported video browsing using a
Netflix, and Amazon Prime, is increasing as well as the                  sound class, but fewer audio-based methods have been used
amount of video contents on the Web. In this situation, in               to seek specific video timelines than image or metadata-
which too many videos have already been uploaded on the                  based methods.
Web, the importance of supporting users to browse videos
efficiently has also increased.                                          We propose a system that detects the scene, where a
                                                                         specific speaker is speaking in the video, and displays the
One method for efficient video browsing is fast-forwarding.              site as a heat map in the video's timeline, as shown in
Several researchers developed a content-aware fast-                      Figure 1. This system enables users to skip to the timeline
forwarding technique that dynamically changes playback                   they want to hear by detecting scenes in a drama, talk show,
speeds depending on the importance given to each video                   or discussion TV program, where a specific speaker is
frame. This technique is enabled using key clips [1, 2], a               speaking. To detect a specific speaker's utterance, we
skimming model [3], and the viewing histories of other                   develop a deep neural network (DNN) to extract only a
people [4]. Direct manipulation techniques enable users to               specific speaker from the original sound source. Leveraging
manipulate object positions in video frames to seek for                  this sound source separation DNN, the system operates as
                                                                         follows: first, the system's DNN extracts the utterance of a
 © 2018. Copyright for the individual papers remains with the authors.   specific person from the audio file of the target video and
 Copying    permitted    for     private and    academic     purposes.
 WII'18, March 11, Tokyo, Japan.                                         diminishes other sounds. As a result of DNN filtering, the
amplitude of the scene, in which the target person is          viewing histories of other people. CinemaGazer [10] is an
speaking, does not become very small, while that of the        audio-based technique that fast-forwards scenes without
other scenes becomes small. The system then calculates the     speech.
difference between the amplitude of the original sound
                                                               Several techniques for indicating potential information in
waveform and that of the filtered sound waveform. The
                                                               the video were also studied. These included spatio-temporal
system judges that scenes with the larger difference than a
                                                               volume [13], positional information [14], and video
threshold are where the target person does not speak and
                                                               synopsis [15, 16, 17]. Meanwhile, direct manipulation
those with the smaller difference are where the target
                                                               techniques enable users to manipulate object positions in
person utters. The scenes, where the target person speaks,
                                                               video frames to seek for specific video timelines [5, 6, 7, 8].
are displayed on the video timeline as a heat map based on
                                                               Video lens allows users to interactively explore large
the judgment result.
                                                               collections of baseball videos and related metadata [18].
We conduct two experiments on the proposed system. One         On-demand video streaming services, such as YouTube,
is to confirm how much the amplitude of the other sounds       Netflix, and Amazon Prime, show a tiny picture of the
can be suppressed and how much that of the specific            video in relation to where the playhead is at in the timeline.
person's utterance does not be suppressed by the sound
                                                               Unlike the previous studies, ours focuses on providing an
source separation DNN extracting only the specific person's
                                                               efficient method of allowing users to skip to the scenes,
utterance. The second experiment is to confirm how
                                                               where a specific person that the user is searching for, is
accurately the system can detect the utterance scene of a
                                                               speaking.
specific person.
                                                               Monaural Source Separation
Our contributions are summarized as follows.
                                                               Monaural sound source separation studies are closely
l    We propose a novel system that automatically detects      related to the proposed method. We introduce these
     the utterance scene of a specific person. We also         methods here and show their difference from the proposed
     confirm how accurately the system can detect the          method.
     utterance scene of a specific person.                     Wiener filtering is a classical method used for separating a
l    We develop a sound source separation DNN that can         specific sound source from a source waveform [19]. The
     extract only a specific person's utterance, and propose   Wiener filtering method heuristically determines
     how to create a training dataset for the DNN. Many        parameters; hence, the parameters cannot be optimized for
     studies successfully tackled monaural sound source        various sound sources [20].
     separation. However, these prior studies only             In recent years, many studies attempted to separate
     confirmed the effects for separation between
                                                               monaural sound sources using deep learning. Previous deep
     distinguished classes such as “speech and noise”, or
                                                               network approaches [21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
     between multi-speakers. These studies did not clarify     31] to separation showed promising performances in
     whether only a specific speaker can be separated          scenarios with sources belonging to a distinct signal class,
     when both diverse and various sounds are mixed in         such as “speech and noise” and “vocal and accompaniment.”
     the sound source. We confirm how much the                 in addition, many researches attempted to separate multi-
     amplitude of the other sounds can be suppressed and       speakers using DNN [22, 32, 33, 34, 35, 36, 37]. These
     how much that of the specific person's utterance does     studies performed well in the speaker-dependent separation
     not be suppressed by the proposed DNN.                    of two or three speakers. Deep clustering [29, 38, 39, 40] is
RELATED WORK                                                   a deep learning framework that can be used for a speaker-
                                                               independent separation of two or more speakers, with no
Browsing Support for Videos
Various techniques to support users in browsing videos are     special constraint on vocabulary and grammar.
well studied. Fast-forwarding techniques, such as those in     In spite of the advantages, these prior studies confirmed
[11, 12], are useful in helping users watch videos in a        only the effects for separation between distinguished
reduced time. Several researchers also developed a content-    classes or between multi-speakers. The function required in
aware fast-forwarding technique that dynamically changes       the proposed approach is to isolate only the speech of a
playback speeds depending on the importance given to each      specific person from the sound source, including various
video frame. Higuchi et al. [1] proposed a video fast-         noise and multiple speakers.
forwarding interface that helps users find important events
                                                               Speaker Recognition & Audio Event Detection
from lengthy first-person videos continuously recorded with
                                                               The speaker recognition technique seems effective in
wearable cameras. The proposal of Pongnumkul et al. [2]
                                                               detecting the utterance section of a specific speaker. These
makes it easy to find the scene change when sliding the
                                                               techniques using phonemes [41, 42] perform well. However,
video seek bar. Cheng et al. [3] proposed a video system to
                                                               speaker recognition methods are weak against noise. In
learn the user's favorite scene for fast-forwarding. Kim et
                                                               addition, the shorter the input speech duration, the lesser the
al.’s method [4] shows the importance scene based on the
Figure 2. Proposed system’s process. The system operates as follows: first, the system extracts the audio waveform from the target
video. The system's DNN extracts the utterance of a specific person from the audio file of the target video and diminishes other
sounds. The system calculates the difference between the amplitude of the original sound waveform and that of the filtered sound
waveform. The system judges that scenes with a difference larger than a threshold are where the target person does not speak and
those with a smaller difference are where the target person utters. The scene, where the target person speaks, is displayed on the
video timeline as a heat map based on the judgment result.

speaker recognition precision. Ranjan et al. [43] reported         the target person does not speak and those with the smaller
that the equal error rate (false negative rate equals the false    difference are where the target person utters. The scenes,
positive rate) becomes close to 40% when the input                 where the target person speaks, are displayed on the video
duration is 3 s. As described, this is not suitable for            timeline as a heat map based on the judgment result. The
detecting the utterance scene of a specific speaker in videos      following subsections describe the implementation of the
because speaker recognition is vulnerable to noise and tiny        proposed sound source separation DNN, the detection and
duration input.                                                    the interface.
Jansen et al. [44] proposed the method for detecting               Sound Source Separation between a Specific Speaker
recurring audio events in YouTube videos using a small             and Other Sounds
portion of the manually annotated audio data set [45].             We propose a DNN to detect the utterance of a specific
However, this method cannot distinguish who speaks while           person and separate this utterance from the other sounds.
can distinguish between categories of sound, such as human         The difference of this DNN from the previous sound source
voice and whistle.                                                 separation methods is that the relationship between the
                                                                   separated sound sources is different as shown in Table 1.
IMPLEMENTATION                                                     Many previous studies tackled the separation with different
The proposed system detects the scene, where a specific            classes of sound sources, such as “sound and noise” and a
speaker is speaking in the video, and displays the site as a       fixed number of sound sources, such as “two or three
heat map in the video's timeline. Figure 2 shows the               speakers.”
system’s process. The system first loads the sound of the
target video once. Leveraging a DNN, the system then               However, we assumed that the DNN models of the previous
extracts only the specific speaker from the original sound         studies could be applied to our task if we change the
source and diminishes the other sounds. In the sound               training data. Therefore, we surveyed previous studies, and
waveform filtered by the DNN, the amplitude of the scene,          found that Rethage's method [31] was appropriate because
where the target person is speaking, does not become too           it used a convolutional-based neural network, which
small, while that of the other scenes becomes small. The           allowed for parallel computation. Many previous methods
system calculates the difference between the amplitude             [22, 23, 24, 25] employed recurrent neural networks
value of the original sound waveform and that of the
filtered sound waveform. The system then judges that
scenes with the larger difference than a threshold are where

                    Class      Speaker
                                                Proposed
                    based     separation

 Relationship                                    Specific
   between                                   speaker-others,
                   Speech-     Speaker-                              Figure 3. Diagrams showing the computational structure
  separated                                  including noise
                    noise      speaker                               of typical CNN and LSTM architectures. Red signifies
sound sources                                   and other
                                                                     convolutions or matrix multiplications. The computation
                                                 speakers            of LSTMs at each timestep is dependent on the results
Table 1. Difference between the previous sound source                from the previous timestep. This why it is difficult to
separation and the proposed methods.                                 implement LSTMs using parallel processing.
Figure 4. Left: Schematic diagram of the sound source separation DNN model. The waveform data is used as-is for input and
output without using the features of the frequency domain. Right: Implementation details of the sound source separation DNN.

(RNNs), including long short-term memory (LSTM)                  summing all skip connections. The final two 3 × 1
networks, for source separation. As shown in Figure 3, the       convolutional layers are not dilated; contain 2048 and 256
limitation of RNNs is that it is difficult for them to perform   filters, respectively; and are separated by a ReLU. The
parallel computations because the computations at each           output layer linearly projects the feature map into a single-
timestep depend on the results from the previous timestep.       channel temporal signal using a 1 × 1 filter.
Many videos on the web are several hours long; thus, the
                                                                 Detection
lack of parallel computations causes a significant problem
of the processing time being linearly proportional to the        After the voice of a specific speaker is extracted by the
video length. Furthermore, as the authors of deep clustering     sound source separation DNN, the algorithm for detecting
[38] reported, the most serious problem is that the LSTM         the utterance scene of the speaker operates as follows: the
performs poorly in the sound source separation of speakers,      algorithm segments the original and the filtered sound
who are not in the training data.                                waveforms into certain window size, as shown in Figure 5.
                                                                 Then this algorithm calculates the difference between the
To realize the proposed DNN, we devised a training dataset.
                                                                 amplitude value of both segments. This calculation aims to
As input data, we created the sound mixtures by merging          obtain the amplitude ratio of the original and filtered
the target speaker with the various environmental noises         waveforms. The amplitude difference is obtained by the
and the other speakers. We set clean speech of the target        following equation:
speaker as the ideal output value. By training the dataset,                                       𝐴/01 (34565789)
the proposed DNN was able to extract the speech of the                      𝑑𝑖𝑓𝑓 𝑑𝐵 = 20𝑙𝑜𝑔,-
target speaker and mute other sounds.                                                             𝐴/01 (;59<=4=>)

We implemented Rethage's DNN model as written in their
article. Figure 4 shows the visualization of the
implementation. The model is trained to extract a specific
speaker by inputting and outputting the waveform data as-is.
Their approach incorporated some techniques used in
WaveNet [46], such as gated unit, skip connections, and
residual blocks. The DNN model features 30 residual
blocks. The dilation factor in each layer increases in the
range 1, 2, ..., 256, 512 by powers of 2. This pattern is
repeated thrice (three stacks). Prior to the first dilated
convolution, the one-channel input is linearly projected to
128 channels by a standard 3 × 1 convolution to comply
with the number of filters in each residual layer. The skip
connections are 1 × 1 convolutions, which also feature 128
filters. A rectified linear unit (ReLU) is applied after           Figure 5. Visualization of segmenting the original and the
                                                                   filtered sound waveform into certain window size.
Figure 6. The line graph is plotted with the difference between the amplitude of the original sound waveform and that of the
filtered sound waveform as the vertical axis, and time as the horizontal axis. The pale red marks represent actual utterance scenes
of a specific person. The graph suggests that the amplitude difference in the utterance scenes of a specific person is smaller than
that in the other scenes.

where 𝐴/01 (34565789) represents root mean square of the            reaches the end of each waveform.
amplitude of the original waveform segment and                      The default value of the threshold is determined based on
𝐴/01 (;59<=4=>) represents root mean square of that of the          the average amplitude ratio of the original and filtered
filtered waveform segment. The difference value (dB)                waveforms. This default value will be clarified by
indicates how much the amplitude of the original sound is           Experiment 1, which is described later.
attenuated after that is filtered by the proposed DNN. A
small difference value means that the amplitude of the              Interface
original sound is not much attenuated and a large difference        After the speaking scenes of specific speakers are clarified,
value means that the amplitude is greatly attenuated.               these scenes are displayed on the timeline as a heat map.
Leveraging the proposed DNN, the amplitude in the scenes,
in which the target person is speaking, does not become
very small (the difference is small), while that in the other
scenes becomes small (the difference is large) as shown in
Figure 6. Therefore, the algorithm can judge that the scenes
with the larger difference than a threshold are where the
target person does not speak, while those with the smaller
difference are where the target person utters. After the            Figure 7. Left: The amount of red marks is decreased by
judgement, the window shifts to the next segments. The              lowering the bar. Right: The amount of red marks is
abovementioned operation is repeated until the window               increased by raising the bar.


Figure 8. The figures visualize how the judgment for detecting the utterance scenes of the specific speaker changes when the
threshold changes. These line graphs are the same as in Figure 6. When the threshold becomes lower, the number of scenes
judged to be where the target person speaks, is decreased. When the threshold becomes higher, the number of scenes judged to be
where the target speaks, is increased.
The red marks on the heat map represent the detected
scenes. The user can jump to the scene uttered by the
specific speaker by clicking the red mark position.
In addition, the user can change the threshold of the
detection algorithm by operating the bar on the right side of
the interface. Figure 7 shows the difference in the
appearance of the heat map by operating the bar. Figure 8
shows how the judgment for detecting the utterance scenes
of the specific speaker changes when the threshold changes.
The amount of red marks in the timeline is decreased by
lowering the bar because the threshold becomes lower.
Only the scenes with a higher probability as the utterances
of the specific speaker can be displayed. The amount of red
marks is increased by raising the bar because the threshold
becomes higher. The scenes with a low probability as a
specific speaker's utterance may be included in the heat           Figure 9. Visualization of creating training dataset. 593
map, but this prevents the user from missing the scene of          sentences * 12 types of other sounds *4 type SNRs = 28464
the speaker's utterance.                                           sentences.

EXPERIMENT 1                                                       were used for testing. All training samples of each target
This experiment is to confirm how much the amplitude of            speaker (593 sentences) were synthetically mixed with each
the other sounds can be suppressed and how much that of            ten noises type at each of the following single-to-noise
the specific person's utterance does not be suppressed by          ratios (SNRs): 0, 5, 10, and 15 dB. Note that the smaller the
the sound source separation DNN extracting only the                dB value, the bigger the noise value relative to the speech.
specific person's utterance. The ideal result is that the target
                                                                   We also mixed the training samples of each target speaker
speaker’ utterance does not become very small but the other
sounds become smaller. If the result is as described above,        with different speakers from the TIMIT corpus, which
it can be said that the proposed DNN extracts only the             features 24 English speakers, including the following
utterance of the target speaker.                                   various dialects: New England, Northern, North Midland,
                                                                   South Midland, Southern, New York City, Western, and
We let the sound source separation DNN model learn with
                                                                   Army Brat. We synthetically mixed the all training samples
the following setup. Then, we calculated how much of the
                                                                   of each target speaker with a TIMIT speaker at each SNRs
decibel (dB) of the other sounds could be suppressed using
the test dataset.                                                  (0, 5, 10, and 15dB). Additionally, we created new corpus
                                                                   of two-speaker mixtures using utterances from the TIMIT
Setup
                                                                   corpus. The mixtures were mixed with all training samples
dataset                                                            of each target speaker at each SNRs. As a result, the
We created a training dataset of sound mixtures using              number of all training data per target speaker was 28464
noises from the Diverse Environments Multichannel                  sentences.
Acoustic Noise Database (DEMAND) [47], and utterances
                                                                   Learning
from TIMIT corpus [48] and CMU ARCTIC corpus [49].
                                                                   We let the sound source separation DNN learn with the
Figure 9 describes the visualization of creating the training
dataset. The target speaker of the detection was supplied by
the CMU ARCTIC corpus. The subset of the CMU corpus
we used features two native English speakers, including a
man (ID: RMS) and a woman (ID: SLT). Note that it is
common in speech research such as voice conversion that
the target speakers are two. We randomly chose 593
sentences, which corresponds to 30 minutes, from each
speaker for the training samples.
We mixed the training samples of each target speaker with
the noise sounds provided by DEMAND. The subset of
DEMAND that we used provided recordings in 17 different
environmental conditions, such as in a park, a bus, or a cafe.
Ten background noises were synthetically mixed with the             Figure 10: Let the DNN learn to output clean target speech
target speech for training, while seven background noises           from the target speech with various sound including noises
                                                                    and other persons’ voice.
                                                                              Input source type
                                          Noise only            -10 dB               0 dB               10dB          Target only

                          ID: RMS          19.77 dB            8.75 dB             3.12 dB             0.64 dB           0.25 dB
 Average amplitude
  difference (dB)
                           ID: SLT         22.99 dB            11.06 dB            3.20 dB             0.84 dB           0.45 dB

Table 2. Results of calculating the average difference between the output waveform and the input waveform. Top row represents
the input source type: noise only, mixtures at -10dB, 0dB, 10dB, and target speech only. What the average amplitude difference is
larger means that the input speeches were suppressed more. The result shows that the smaller the amplitude of the target speech
included in the input source is, the larger the average amplitude difference becomes, and demonstrates the amplitude of target
speech does not become very small while that of the other sounds becomes small.

above training dataset at 16 kHz, as shown in Figure 10.              interval.
The loss function we used was the same as Rethage's [31].
                                                                      EXPERIMENT 2
The learning condition was as follows: a learning rate was            This experiment is to confirm how accurately the proposed
0.001, a batch size was 60, an early stopping epoch was 4             system can detect the utterance scene of a specific person.
and the GPU we used was NVIDIA TITAN X Pascal.                        We let the system perform the task of detecting the target
Test                                                                  speech included in the 10 minutes’ sound.
We randomly chose 100 sentences from the target speaker,
                                                                      Setup
which does not include the training dataset, for test samples.         The 10 minutes’ sound was created by connecting
The test samples were synthetically mixed at each of the              DEMAND and TIMIT corpus which not in the training
following SNRs: -10, 0 and 10dB, with the seven test-noise            dataset. We chose the target speech randomly at 100
types from the DEMAND, one speaker, and two speaker                   sentences and superimposed on that 10 minutes’ sound. The
mixtures from the TIMIT corpus. Furthermore, we used the              SNRs of the target speech to 10 minutes’ sound was chosen
noise only and target speaker only source, as the test dataset.       randomly from 0, 5, 10 and 15 dB. We used the sound
We inputted 100 files of each source type (noise only,                source separation DNN learned in Experiment 1. The
sound mixtures at -10, 0, 10 dB, and target only) into each
learned DNN and calculated the average amplitude
difference between the output waveform and the input
waveform.
Result
Table 2 shows the results. What the average difference is
larger means that the input speeches were suppressed more.
The result demonstrates the amplitude of target speech does
not become very small, while that of the other sounds
becomes small. In addition, the result suggests that since
the DNN decreases the amplitude of input waveform by                  Figure 11. Visualization of predicting whether or not the
about 20 dB at the maximum and about 0 dB at the                      scenes include the target speaker’s utterance. The system
minimum, it is appropriate to set the threshold during that           performs prediction for each segment of the waveforms.


                                                                                         True condition
                                                               Actual utterance scene of a            Not utterance scene of a
                                                                    specific person                       specific person

                  System predicts “utterance scene of
                          a specific person”                             True Positive                      False Positive
   Predicted
   condition
                 System predicts “not utterance scene
                        of a specific person”                          False negative                       True negative

                    Table 3. Contingency table of true positive, false positive, false negative and true negative.
Figure 12. Upper: Case where the middle of the segment is included in the actual utterance timing of specific person. Lower: Case
where the middle of the segment is not included in that timing. The green line represents the middle of the segment. The pale red
marks represent actual utterance scenes of a specific person. When the middle of the segment is included in the actual utterance
timing of a specific person, the true condition is “Actual utterance scene of a specific person”.
window size of the detection was 0.1 s and the window’s            Result
step length was also 0.1 s. We changed the threshold every         Table 4 shows the results. The result shows that the
5 dB (-5, 0, 5, 10, 15, 20 dB) for confirming whether the          accuracy is 83% and the precision is 92% in the best case.
result changes.                                                    The accuracy is higher when the threshold is around 10 to
                                                                   15 dB and the precision is higher when the threshold is
We used the following four events for test: True positive          around 0 to 5 dB for each target speaker.
(TP), False Positive (FP), False Negative (FN), and True
Negative (TN). Table 3 shows the definition of each event.         FUTURE WORK
Based on the four events, the following ratios were                User study
calculated: the accuracy and the precision. Accuracy and           In this paper, we did the basic performance evaluation of
precision are formulated as follows:                               the proposed system and did not do user study. We need to
  𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 % = 𝑇𝑃 + 𝑇𝑁 / (𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁)                       perform a user study and verify that the users can find the
                                                                   scenes they want to hear accurately and quickly.
              𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 % = 𝑇𝑃 / (𝑇𝑃 + 𝐹𝑃)
                                                                   We will need to refine the interface based on the user study.
The system performs prediction for each segment of the             One alternative interface is to display the utterance scenes
waveforms as shown in Figure 11. When the middle of the            of a specific person as a graph in a video timeline. We will
segment is included in the actual utterance timing of a            confirm how usability changes by changing the interface.
specific person, the true condition is “Actual utterance
                                                                   Improving accuracy
scene of a specific person” as shown in Figure 12.
                                                                   We need to explore a special DNN structure for extracting a

                                                                              Threshold
                                            -5dB          0dB           5dB           10dB            15dB           20dB

                     ID: RMS                48%           59%           73%            79%            78%             72%
  Accuracy
                      ID: SLT               58%           67%           79%            83%            81%             74%

                     ID: RMS                83%           88%           89%            85%            78%             69%
  Precision
                      ID: SLT               88%           92%           91%            85%            81%             74%

                              Table 4. Result of the accuracy and precision for each target speaker
specific speaker more accurately. If we find this new               NY, USA, 563-572.
structure, we could make the system improve the accuracy       5.   Pierre Dragicevic, Gonzalo Ramos, Jacobo
of the Experiment 2 task.                                           Bibliowitcz, Derek Nowrouzezahrai, Ravin
CONCLUSION                                                          Balakrishnan, and Karan Singh. 2008. Video browsing
We propose a system that detects scenes, where a specific           by direct manipulation. In Proceedings of the SIGCHI
person speaks in the video, and displays them in the                Conference on Human Factors in Computing
timeline. This system enables users to skip to the timeline         Systems (CHI '08). ACM, New York, NY, USA, 237-
they want to hear by detecting scenes in a drama, talk show,        246.
or discussion TV program, where a specific speaker is          6.   Cuong Nguyen, Yuzhen Niu, and Feng Liu. 2013.
speaking.                                                           Direct manipulation video navigation in 3D.
We conducted two experiments on the proposed system.                In Proceedings of the SIGCHI Conference on Human
One was to confirm how much the amplitude of the other              Factors in Computing Systems (CHI '13). ACM, New
sounds can be suppressed and how much that of the specific          York, NY, USA, 1169-1172.
person's utterance does not be suppressed by the sound         7.   Thorsten Karrer, Malte Weiss, Eric Lee, and Jan
source separation DNN extracting only the specific person's         Borchers. 2008. DRAGON: a direct manipulation
utterance. The result showed that the smaller the amplitude         interface for frame-accurate in-scene video navigation.
of the target speech included in the input source was, the          In Proceedings of the SIGCHI Conference on Human
larger the average amplitude difference between the input           Factors in Computing Systems (CHI '08). ACM, New
and output waveform became. That is, we got the result as           York, NY, USA, 247-250.
expected.
                                                               8.   Thorsten Karrer, Moritz Wittenhagen, and Jan
The second experiment was to confirm how accurately the
                                                                    Borchers. 2012. DragLocks: handling temporal
system can detect the utterance scene of a specific person.
                                                                    ambiguities in direct manipulation video navigation.
The result showed that the accuracy was 83% and the
                                                                    In Proceedings of the SIGCHI Conference on Human
precision was 92% in the best case.
                                                                    Factors in Computing Systems (CHI '12). ACM, New
This system can be applied to voice services, like Podcast,         York, NY, USA, 623-626.
Spotify, and SoundCloud. With the advent of smart              9.   C. Saraceno and R. Leonardi, "Audio as a support to
speakers, such as Amazon Echo and Google home, audio                scene change detection and characterization of video
contents are likely to increase along with the importance of        sequences," 1997 IEEE International Conference on
searching timelines based on audio content.                         Acoustics, Speech, and Signal Processing, Munich,
REFERENCES                                                          1997, pp. 2597-2600 vol.4.
1.   Keita Higuchi, Ryo Yonetani, and Yoichi Sato. 2017.       10. Kazutaka Kurihara. 2012. CinemaGazer: a system for
     EgoScanning: Quickly Scanning First-Person Videos             watching videos at very high speed. In Proceedings of
     with Egocentric Elastic Timelines. In Proceedings of          the International Working Conference on Advanced
     the 2017 CHI Conference on Human Factors in                   Visual Interfaces (AVI '12), Genny Tortora, Stefano
     Computing Systems (CHI '17). ACM, New York, NY,               Levialdi, and Maurizio Tucci (Eds.). ACM, New York,
     USA, 6536-6546.                                               NY, USA, 108-115.
2.   Suporn Pongnumkul, Jue Wang, Gonzalo Ramos, and           11. Abir Al-Hajri, Matthew Fong, Gregor Miller, and
     Michael Cohen. 2010. Content-aware dynamic timeline           Sidney Fels. 2014. Fast forward with your VCR:
     for video browsing. In Proceedings of the 23nd annual         visualizing single-video viewing statistics for
     ACM symposium on User interface software and                  navigation and sharing. In Proceedings of Graphics
     technology (UIST '10). ACM, New York, NY, USA,                Interface 2014 (GI '14). Canadian Information
     139-142.                                                      Processing Society, Toronto, Ont., Canada, Canada,
3.   Kai-Yin Cheng, Sheng-Jie Luo, Bing-Yu Chen, and               123-128.
     Hao-Hua Chu. 2009. SmartPlayer: user-centric video        12. Neel Joshi, Wolf Kienzle, Mike Toelle, Matt
     fast-forwarding. In Proceedings of the SIGCHI                 Uyttendaele, and Michael F. Cohen. 2015. Real-time
     Conference on Human Factors in Computing                      hyperlapse creation via optimal frame selection. ACM
     Systems (CHI '09). ACM, New York, NY, USA, 789-               Trans. Graph. 34, 4, Article 63 (July 2015), 9 pages.
     798.
                                                               13. Cuong Nguyen, Yuzhen Niu, and Feng Liu. 2012.
4.   Juho Kim, Philip J. Guo, Carrie J. Cai, Shang-Wen             Video summagator: an interface for video
     (Daniel) Li, Krzysztof Z. Gajos, and Robert C. Miller.        summarization and navigation. In Proceedings of the
     2014. Data-driven interaction techniques for improving        SIGCHI Conference on Human Factors in Computing
     navigation of educational videos. In Proceedings of the       Systems (CHI '12). ACM, New York, NY, USA, 647-
     27th annual ACM symposium on User interface                   650.
     software and technology (UIST '14). ACM, New York,
14. Suporn Pongnumkul, Jue Wang, and Michael Cohen.            26. Qian, Kaizhi, et al. "Speech enhancement using
    2008. Creating map-based storyboards for browsing              bayesian wavenet." Proc. Interspeech 2017 (2017):
    tour videos. In Proceedings of the 21st annual ACM             2013-2017.
    symposium on User interface software and                   27. Tu, Ming, and Xianxian Zhang. "Speech enhancement
    technology (UIST '08). ACM, New York, NY, USA,                 based on Deep Neural Networks with skip
    13-22.                                                         connections." Acoustics, Speech and Signal Processing
15. Alex Rav-Acha, Yael Pritch, and Shmuel Peleg. In In            (ICASSP), 2017 IEEE International Conference on.
    Proc. IEEE Conference on Computer Vision and                   IEEE, 2017.
    Pattern Recognition (CVPR’ 06).                            28. Pascual, Santiago, Antonio Bonafonte, and Joan Serrà.
16. Yael Pritch, Alex Rav-Acha, Avital Gutman, and                 "SEGAN: Speech Enhancement Generative
    Shmuel Peleg. 2007. Webcam Synopsis: Peeking                   Adversarial Network." arXiv preprint
    Around the World. In In Proc. IEEE International               arXiv:1703.09452 (2017).
    Conference on Computer Vision (ICCV'07).                   29. Y. Luo, Z. Chen, J. R. Hershey, J. Le Roux and N.
17. Yael Pritch, Alex Rav-Acha, and Shmuel Peleg. 2008.            Mesgarani, "Deep clustering and conventional
    Nonchronological Video Synopsis and Indexing. IEEE             networks for music separation: Stronger
    Trans. Pattern Anal. Mach. Intell. 30, 11 (November            together," 2017 IEEE International Conference on
    2008), 1971-1984.                                              Acoustics, Speech and Signal Processing (ICASSP),
18. Justin Matejka, Tovi Grossman, and George                      New Orleans, LA, 2017, pp. 61-65.
    Fitzmaurice. 2014. Video lens: rapid playback and          30. Fu, Szu-Wei, et al. "Raw Waveform-based Speech
    exploration of large video collections and associated          Enhancement by Fully Convolutional Networks." arXiv
    metadata. In Proceedings of the 27th annual ACM                preprint arXiv:1703.02205 (2017).
    symposium on User interface software and                   31. Rethage, Dario, Jordi Pons, and Xavier Serra. "A
    technology (UIST '14). ACM, New York, NY, USA,                 Wavenet for Speech Denoising." arXiv preprint
    541-550.                                                       arXiv:1706.07162 (2017).
19. Pascal Scalart et al. Speech enhancement based on a        32. Pritish Chandna, Marius Miron, Jordi Janer, and Emilia
    priori signal to noise estimation. In IEEE International       Gómez. Monoaural audio source separation using deep
    Conference on Acoustics, Speech and Signal                     convolutional neural networks. In International
    Processing (ICASSP), volume 2, pp. 629–632, 1996.              Conference on Latent Variable Analysis and Signal
20. Pal, Monisankha, et al. "Robustness of Voice                   Separation, pages 258–266. Springer, 2017.
    Conversion Techniques Under Mismatched                     33. Z. Q. Wang and D. Wang, "Recurrent deep stacking
    Conditions." arXiv preprint arXiv:1612.07523 (2016).           networks for supervised speech separation," 2017 IEEE
21. Xugang Lu, Yu Tsao, Shigeki Matsuda, and Chiori                International Conference on Acoustics, Speech and
    Hori. Speech enhancement based on deep denoising               Signal Processing (ICASSP), New Orleans, LA, 2017,
    autoencoder. In Interspeech, pp. 436–440, 2013.                pp. 71-75.
22. Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson,            34. Chien, Jen-Tzung & Kuo, Kuan-Ting, “Variational
    and Paris Smaragdis. Joint optimization of masks and           Recurrent Neural Networks for Speech Separation”, In
    deep recurrent neural networks for monaural source             Interspeech, pp. 1193-1197, 2017
    separation. IEEE/ACM Transactions on Audio, Speech         35. K. Osako, Y. Mitsufuji, R. Singh and B. Raj,
    and Language Processing, 23(12):2136–2147, 2015.               "Supervised monaural source separation based on
23. Y. Xu, J. Du, L. R. Dai and C. H. Lee, A Regression            autoencoders," 2017 IEEE International Conference on
    Approach to Speech Enhancement Based on Deep                   Acoustics, Speech and Signal Processing (ICASSP),
    Neural Networks, in IEEE/ACM Transactions on                   New Orleans, LA, 2017, pp. 11-15. doi:
    Audio, Speech, and Language Processing, vol. 23, no.           10.1109/ICASSP.2017.7951788
    1, pp. 7-19, Jan. 2015.                                    36. Lee, Yuan-Shan, et al. "Fully complex deep neural
24. Anurag Kumar and Dinei Florencio. Speech                       network for phase-incorporating monaural source
    enhancement in multiple-noise conditions using deep            separation." Acoustics, Speech and Signal Processing
    neural networks. arXiv preprint arXiv:1605.02427,              (ICASSP), 2017 IEEE International Conference on.
    2016.                                                          IEEE, 2017.
25. Jordi Pons, Jordi Janer, Thilo Rode, and Waldo             37. Wang, Yannan & Du, Jun & Dai, Li-Rong & Lee,
    Nogueira. Remixing music using source separation               Chin-Hui, “A Maximum Likelihood Approach to Deep
    algorithms to improve the musical experience of                Neural Network Based Nonlinear Spectral Mapping for
    cochlear implant users. The Journal of the Acoustical          Single-Channel Speech Separation”, Interspeech, pp.
    Society of America, 140(6):4338–4349, 2016.                    1178-1182, 2017
38. Hershey, John R., et al. Deep clustering:
    Discriminative embeddings for segmentation and
    separation. Acoustics, Speech and Signal Processing
    (ICASSP), 2016 IEEE International Conference on.
    IEEE, 2016.
39. Isik, Yusuf, et al. "Single-channel multi-speaker
    separation using deep clustering." arXiv preprint
    arXiv:1607.02173 (2016).
40. Yu, Dong, et al. "Permutation invariant training of deep
    models for speaker-independent multi-talker speech
    separation." arXiv preprint arXiv:1607.00325 (2016).
41. Y. Tian, L. He, M. Cai, W. Q. Zhang, and J. Liu,
    “Deep neural networks based speaker modeling at
    different levels of phonetic granularity,” in 2017 IEEE
    International Conference on Acoustics, Speech and
    Signal Processing (ICASSP), March 2017, pp. 5440–
    5444.
42. Y. Lei, N. Scheffer, L. Ferrer and M. McLaren, "A
    novel scheme for speaker recognition using a
    phonetically-aware deep neural network," 2014 IEEE
    International Conference on Acoustics, Speech and
    Signal Processing (ICASSP), Florence, 2014, pp. 1695-
    1699.
43. S. Ranjan and J. H. L. Hansen, "Curriculum Learning
    Based Approaches for Noise Robust Speaker
    Recognition," in IEEE/ACM Transactions on Audio,
    Speech, and Language Processing, vol. 26, no. 1, pp.
    197-210, Jan. 2018.
44. Jansen, Aren, et al. "Large-scale audio event discovery
    in one million youtube videos." Proceedings of
    ICASSP. 2017.
45. Gemmeke, Jort F., et al. "Audio Set: An ontology and
    human-labeled dataset for audio events." IEEE
    ICASSP. 2017.
46. Oord, Aaron van den, et al. "Wavenet: A generative
    model for raw audio." arXiv preprint
    arXiv:1609.03499 (2016).
47. Joachim Thiemann, Nobutaka Ito, and Emmanuel
    Vincent. The diverse environments multichannel
    acoustic noise database: A database of multichannel
    environmental noise recordings. The Journal of the
    Acoustical Society of America, 133(5):3591–3591,
    2013.
48. J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett,
    N. Dahlgren, and V. Zue, “TIMIT acoustic-phonetic
    continuous speech corpus,” 1993.
49. J. Kominek and A. W. Black, “The CMU Arctic speech
    databases,” in Fifth ISCA Workshop on Speech
    Synthesis, 2004.