Detecting Utterance Scenes of a Specific Person Kunihiko Sato Jun Rekimoto The University of Tokyo The University of Tokyo Tokyo, Japan Sony Computer Science kunihiko.k.r.r@gmail.com Laboratory Tokyo, Japan rekimoto@acm.org ABSTRACT We propose a system that detects the scene, where a specific speaker is speaking in the video, and displays the site as a heat map in the video's timeline. This system enables users to skip to the timeline they want to hear by detecting scenes in a drama, talk show, or discussion TV program, where a specific speaker is speaking. To detect a specific speaker's utterance, we develop a deep neural network (DNN) to extract only a specific speaker from the original sound source. We also implement the detection algorithm based on the output of the proposed DNN and the interface for displaying the detection result. We conduct two experiments on the proposed system. One is to confirm Figure 1. Proposed interface. The red marks in the timeline how much the amplitude of the other sounds can be describes the utterance scenes of a specific person. The suppressed and how much that of the specific person's threshold bar can change the threshold of the scene detection utterance does not be suppressed by the proposed DNN. algorithm. The second experiment is to confirm how accurately the proposed system can detect the utterance scene of a specific specific video timelines [5, 6, 7, 8]. Video streaming person. services, such as YouTube, Netflix, and Amazon Prime, show a tiny picture of the video in relation to where the Author Keywords playhead is at in the timeline. Scene detection; timeline; video; sound source separation; deep learning. Several studies on video navigations have used audio information. Conventional methods [9] using audio ACM Classification Keywords information summarize and classify videos based on silence, H5.1. Information interfaces and presentation (e.g., HCI): speech, and music. CinemaGazer [10] is an audio-based Multimedia Information Systems; H5.2. Information technique, which fast-forwards scenes without speech. This interfaces and presentation (e.g., HCI): User Interface technique can only distinguish whether or not the scene INTRODUCTION includes speech, and cannot distinguish who speaks. As The demand for video streaming services, such as YouTube, described, some studies supported video browsing using a Netflix, and Amazon Prime, is increasing as well as the sound class, but fewer audio-based methods have been used amount of video contents on the Web. In this situation, in to seek specific video timelines than image or metadata- which too many videos have already been uploaded on the based methods. Web, the importance of supporting users to browse videos efficiently has also increased. We propose a system that detects the scene, where a specific speaker is speaking in the video, and displays the One method for efficient video browsing is fast-forwarding. site as a heat map in the video's timeline, as shown in Several researchers developed a content-aware fast- Figure 1. This system enables users to skip to the timeline forwarding technique that dynamically changes playback they want to hear by detecting scenes in a drama, talk show, speeds depending on the importance given to each video or discussion TV program, where a specific speaker is frame. This technique is enabled using key clips [1, 2], a speaking. To detect a specific speaker's utterance, we skimming model [3], and the viewing histories of other develop a deep neural network (DNN) to extract only a people [4]. Direct manipulation techniques enable users to specific speaker from the original sound source. Leveraging manipulate object positions in video frames to seek for this sound source separation DNN, the system operates as follows: first, the system's DNN extracts the utterance of a © 2018. Copyright for the individual papers remains with the authors. specific person from the audio file of the target video and Copying permitted for private and academic purposes. WII'18, March 11, Tokyo, Japan. diminishes other sounds. As a result of DNN filtering, the amplitude of the scene, in which the target person is viewing histories of other people. CinemaGazer [10] is an speaking, does not become very small, while that of the audio-based technique that fast-forwards scenes without other scenes becomes small. The system then calculates the speech. difference between the amplitude of the original sound Several techniques for indicating potential information in waveform and that of the filtered sound waveform. The the video were also studied. These included spatio-temporal system judges that scenes with the larger difference than a volume [13], positional information [14], and video threshold are where the target person does not speak and synopsis [15, 16, 17]. Meanwhile, direct manipulation those with the smaller difference are where the target techniques enable users to manipulate object positions in person utters. The scenes, where the target person speaks, video frames to seek for specific video timelines [5, 6, 7, 8]. are displayed on the video timeline as a heat map based on Video lens allows users to interactively explore large the judgment result. collections of baseball videos and related metadata [18]. We conduct two experiments on the proposed system. One On-demand video streaming services, such as YouTube, is to confirm how much the amplitude of the other sounds Netflix, and Amazon Prime, show a tiny picture of the can be suppressed and how much that of the specific video in relation to where the playhead is at in the timeline. person's utterance does not be suppressed by the sound Unlike the previous studies, ours focuses on providing an source separation DNN extracting only the specific person's efficient method of allowing users to skip to the scenes, utterance. The second experiment is to confirm how where a specific person that the user is searching for, is accurately the system can detect the utterance scene of a speaking. specific person. Monaural Source Separation Our contributions are summarized as follows. Monaural sound source separation studies are closely l We propose a novel system that automatically detects related to the proposed method. We introduce these the utterance scene of a specific person. We also methods here and show their difference from the proposed confirm how accurately the system can detect the method. utterance scene of a specific person. Wiener filtering is a classical method used for separating a l We develop a sound source separation DNN that can specific sound source from a source waveform [19]. The extract only a specific person's utterance, and propose Wiener filtering method heuristically determines how to create a training dataset for the DNN. Many parameters; hence, the parameters cannot be optimized for studies successfully tackled monaural sound source various sound sources [20]. separation. However, these prior studies only In recent years, many studies attempted to separate confirmed the effects for separation between monaural sound sources using deep learning. Previous deep distinguished classes such as “speech and noise”, or network approaches [21, 22, 23, 24, 25, 26, 27, 28, 29, 30, between multi-speakers. These studies did not clarify 31] to separation showed promising performances in whether only a specific speaker can be separated scenarios with sources belonging to a distinct signal class, when both diverse and various sounds are mixed in such as “speech and noise” and “vocal and accompaniment.” the sound source. We confirm how much the in addition, many researches attempted to separate multi- amplitude of the other sounds can be suppressed and speakers using DNN [22, 32, 33, 34, 35, 36, 37]. These how much that of the specific person's utterance does studies performed well in the speaker-dependent separation not be suppressed by the proposed DNN. of two or three speakers. Deep clustering [29, 38, 39, 40] is RELATED WORK a deep learning framework that can be used for a speaker- independent separation of two or more speakers, with no Browsing Support for Videos Various techniques to support users in browsing videos are special constraint on vocabulary and grammar. well studied. Fast-forwarding techniques, such as those in In spite of the advantages, these prior studies confirmed [11, 12], are useful in helping users watch videos in a only the effects for separation between distinguished reduced time. Several researchers also developed a content- classes or between multi-speakers. The function required in aware fast-forwarding technique that dynamically changes the proposed approach is to isolate only the speech of a playback speeds depending on the importance given to each specific person from the sound source, including various video frame. Higuchi et al. [1] proposed a video fast- noise and multiple speakers. forwarding interface that helps users find important events Speaker Recognition & Audio Event Detection from lengthy first-person videos continuously recorded with The speaker recognition technique seems effective in wearable cameras. The proposal of Pongnumkul et al. [2] detecting the utterance section of a specific speaker. These makes it easy to find the scene change when sliding the techniques using phonemes [41, 42] perform well. However, video seek bar. Cheng et al. [3] proposed a video system to speaker recognition methods are weak against noise. In learn the user's favorite scene for fast-forwarding. Kim et addition, the shorter the input speech duration, the lesser the al.’s method [4] shows the importance scene based on the Figure 2. Proposed system’s process. The system operates as follows: first, the system extracts the audio waveform from the target video. The system's DNN extracts the utterance of a specific person from the audio file of the target video and diminishes other sounds. The system calculates the difference between the amplitude of the original sound waveform and that of the filtered sound waveform. The system judges that scenes with a difference larger than a threshold are where the target person does not speak and those with a smaller difference are where the target person utters. The scene, where the target person speaks, is displayed on the video timeline as a heat map based on the judgment result. speaker recognition precision. Ranjan et al. [43] reported the target person does not speak and those with the smaller that the equal error rate (false negative rate equals the false difference are where the target person utters. The scenes, positive rate) becomes close to 40% when the input where the target person speaks, are displayed on the video duration is 3 s. As described, this is not suitable for timeline as a heat map based on the judgment result. The detecting the utterance scene of a specific speaker in videos following subsections describe the implementation of the because speaker recognition is vulnerable to noise and tiny proposed sound source separation DNN, the detection and duration input. the interface. Jansen et al. [44] proposed the method for detecting Sound Source Separation between a Specific Speaker recurring audio events in YouTube videos using a small and Other Sounds portion of the manually annotated audio data set [45]. We propose a DNN to detect the utterance of a specific However, this method cannot distinguish who speaks while person and separate this utterance from the other sounds. can distinguish between categories of sound, such as human The difference of this DNN from the previous sound source voice and whistle. separation methods is that the relationship between the separated sound sources is different as shown in Table 1. IMPLEMENTATION Many previous studies tackled the separation with different The proposed system detects the scene, where a specific classes of sound sources, such as “sound and noise” and a speaker is speaking in the video, and displays the site as a fixed number of sound sources, such as “two or three heat map in the video's timeline. Figure 2 shows the speakers.” system’s process. The system first loads the sound of the target video once. Leveraging a DNN, the system then However, we assumed that the DNN models of the previous extracts only the specific speaker from the original sound studies could be applied to our task if we change the source and diminishes the other sounds. In the sound training data. Therefore, we surveyed previous studies, and waveform filtered by the DNN, the amplitude of the scene, found that Rethage's method [31] was appropriate because where the target person is speaking, does not become too it used a convolutional-based neural network, which small, while that of the other scenes becomes small. The allowed for parallel computation. Many previous methods system calculates the difference between the amplitude [22, 23, 24, 25] employed recurrent neural networks value of the original sound waveform and that of the filtered sound waveform. The system then judges that scenes with the larger difference than a threshold are where Class Speaker Proposed based separation Relationship Specific between speaker-others, Speech- Speaker- Figure 3. Diagrams showing the computational structure separated including noise noise speaker of typical CNN and LSTM architectures. Red signifies sound sources and other convolutions or matrix multiplications. The computation speakers of LSTMs at each timestep is dependent on the results Table 1. Difference between the previous sound source from the previous timestep. This why it is difficult to separation and the proposed methods. implement LSTMs using parallel processing. Figure 4. Left: Schematic diagram of the sound source separation DNN model. The waveform data is used as-is for input and output without using the features of the frequency domain. Right: Implementation details of the sound source separation DNN. (RNNs), including long short-term memory (LSTM) summing all skip connections. The final two 3 × 1 networks, for source separation. As shown in Figure 3, the convolutional layers are not dilated; contain 2048 and 256 limitation of RNNs is that it is difficult for them to perform filters, respectively; and are separated by a ReLU. The parallel computations because the computations at each output layer linearly projects the feature map into a single- timestep depend on the results from the previous timestep. channel temporal signal using a 1 × 1 filter. Many videos on the web are several hours long; thus, the Detection lack of parallel computations causes a significant problem of the processing time being linearly proportional to the After the voice of a specific speaker is extracted by the video length. Furthermore, as the authors of deep clustering sound source separation DNN, the algorithm for detecting [38] reported, the most serious problem is that the LSTM the utterance scene of the speaker operates as follows: the performs poorly in the sound source separation of speakers, algorithm segments the original and the filtered sound who are not in the training data. waveforms into certain window size, as shown in Figure 5. Then this algorithm calculates the difference between the To realize the proposed DNN, we devised a training dataset. amplitude value of both segments. This calculation aims to As input data, we created the sound mixtures by merging obtain the amplitude ratio of the original and filtered the target speaker with the various environmental noises waveforms. The amplitude difference is obtained by the and the other speakers. We set clean speech of the target following equation: speaker as the ideal output value. By training the dataset, 𝐴/01 (34565789) the proposed DNN was able to extract the speech of the 𝑑𝑖𝑓𝑓 𝑑𝐵 = 20𝑙𝑜𝑔,- target speaker and mute other sounds. 𝐴/01 (;59<=4=>) We implemented Rethage's DNN model as written in their article. Figure 4 shows the visualization of the implementation. The model is trained to extract a specific speaker by inputting and outputting the waveform data as-is. Their approach incorporated some techniques used in WaveNet [46], such as gated unit, skip connections, and residual blocks. The DNN model features 30 residual blocks. The dilation factor in each layer increases in the range 1, 2, ..., 256, 512 by powers of 2. This pattern is repeated thrice (three stacks). Prior to the first dilated convolution, the one-channel input is linearly projected to 128 channels by a standard 3 × 1 convolution to comply with the number of filters in each residual layer. The skip connections are 1 × 1 convolutions, which also feature 128 filters. A rectified linear unit (ReLU) is applied after Figure 5. Visualization of segmenting the original and the filtered sound waveform into certain window size. Figure 6. The line graph is plotted with the difference between the amplitude of the original sound waveform and that of the filtered sound waveform as the vertical axis, and time as the horizontal axis. The pale red marks represent actual utterance scenes of a specific person. The graph suggests that the amplitude difference in the utterance scenes of a specific person is smaller than that in the other scenes. where 𝐴/01 (34565789) represents root mean square of the reaches the end of each waveform. amplitude of the original waveform segment and The default value of the threshold is determined based on 𝐴/01 (;59<=4=>) represents root mean square of that of the the average amplitude ratio of the original and filtered filtered waveform segment. The difference value (dB) waveforms. This default value will be clarified by indicates how much the amplitude of the original sound is Experiment 1, which is described later. attenuated after that is filtered by the proposed DNN. A small difference value means that the amplitude of the Interface original sound is not much attenuated and a large difference After the speaking scenes of specific speakers are clarified, value means that the amplitude is greatly attenuated. these scenes are displayed on the timeline as a heat map. Leveraging the proposed DNN, the amplitude in the scenes, in which the target person is speaking, does not become very small (the difference is small), while that in the other scenes becomes small (the difference is large) as shown in Figure 6. Therefore, the algorithm can judge that the scenes with the larger difference than a threshold are where the target person does not speak, while those with the smaller difference are where the target person utters. After the Figure 7. Left: The amount of red marks is decreased by judgement, the window shifts to the next segments. The lowering the bar. Right: The amount of red marks is abovementioned operation is repeated until the window increased by raising the bar. Figure 8. The figures visualize how the judgment for detecting the utterance scenes of the specific speaker changes when the threshold changes. These line graphs are the same as in Figure 6. When the threshold becomes lower, the number of scenes judged to be where the target person speaks, is decreased. When the threshold becomes higher, the number of scenes judged to be where the target speaks, is increased. The red marks on the heat map represent the detected scenes. The user can jump to the scene uttered by the specific speaker by clicking the red mark position. In addition, the user can change the threshold of the detection algorithm by operating the bar on the right side of the interface. Figure 7 shows the difference in the appearance of the heat map by operating the bar. Figure 8 shows how the judgment for detecting the utterance scenes of the specific speaker changes when the threshold changes. The amount of red marks in the timeline is decreased by lowering the bar because the threshold becomes lower. Only the scenes with a higher probability as the utterances of the specific speaker can be displayed. The amount of red marks is increased by raising the bar because the threshold becomes higher. The scenes with a low probability as a specific speaker's utterance may be included in the heat Figure 9. Visualization of creating training dataset. 593 map, but this prevents the user from missing the scene of sentences * 12 types of other sounds *4 type SNRs = 28464 the speaker's utterance. sentences. EXPERIMENT 1 were used for testing. All training samples of each target This experiment is to confirm how much the amplitude of speaker (593 sentences) were synthetically mixed with each the other sounds can be suppressed and how much that of ten noises type at each of the following single-to-noise the specific person's utterance does not be suppressed by ratios (SNRs): 0, 5, 10, and 15 dB. Note that the smaller the the sound source separation DNN extracting only the dB value, the bigger the noise value relative to the speech. specific person's utterance. The ideal result is that the target We also mixed the training samples of each target speaker speaker’ utterance does not become very small but the other sounds become smaller. If the result is as described above, with different speakers from the TIMIT corpus, which it can be said that the proposed DNN extracts only the features 24 English speakers, including the following utterance of the target speaker. various dialects: New England, Northern, North Midland, South Midland, Southern, New York City, Western, and We let the sound source separation DNN model learn with Army Brat. We synthetically mixed the all training samples the following setup. Then, we calculated how much of the of each target speaker with a TIMIT speaker at each SNRs decibel (dB) of the other sounds could be suppressed using the test dataset. (0, 5, 10, and 15dB). Additionally, we created new corpus of two-speaker mixtures using utterances from the TIMIT Setup corpus. The mixtures were mixed with all training samples dataset of each target speaker at each SNRs. As a result, the We created a training dataset of sound mixtures using number of all training data per target speaker was 28464 noises from the Diverse Environments Multichannel sentences. Acoustic Noise Database (DEMAND) [47], and utterances Learning from TIMIT corpus [48] and CMU ARCTIC corpus [49]. We let the sound source separation DNN learn with the Figure 9 describes the visualization of creating the training dataset. The target speaker of the detection was supplied by the CMU ARCTIC corpus. The subset of the CMU corpus we used features two native English speakers, including a man (ID: RMS) and a woman (ID: SLT). Note that it is common in speech research such as voice conversion that the target speakers are two. We randomly chose 593 sentences, which corresponds to 30 minutes, from each speaker for the training samples. We mixed the training samples of each target speaker with the noise sounds provided by DEMAND. The subset of DEMAND that we used provided recordings in 17 different environmental conditions, such as in a park, a bus, or a cafe. Ten background noises were synthetically mixed with the Figure 10: Let the DNN learn to output clean target speech target speech for training, while seven background noises from the target speech with various sound including noises and other persons’ voice. Input source type Noise only -10 dB 0 dB 10dB Target only ID: RMS 19.77 dB 8.75 dB 3.12 dB 0.64 dB 0.25 dB Average amplitude difference (dB) ID: SLT 22.99 dB 11.06 dB 3.20 dB 0.84 dB 0.45 dB Table 2. Results of calculating the average difference between the output waveform and the input waveform. Top row represents the input source type: noise only, mixtures at -10dB, 0dB, 10dB, and target speech only. What the average amplitude difference is larger means that the input speeches were suppressed more. The result shows that the smaller the amplitude of the target speech included in the input source is, the larger the average amplitude difference becomes, and demonstrates the amplitude of target speech does not become very small while that of the other sounds becomes small. above training dataset at 16 kHz, as shown in Figure 10. interval. The loss function we used was the same as Rethage's [31]. EXPERIMENT 2 The learning condition was as follows: a learning rate was This experiment is to confirm how accurately the proposed 0.001, a batch size was 60, an early stopping epoch was 4 system can detect the utterance scene of a specific person. and the GPU we used was NVIDIA TITAN X Pascal. We let the system perform the task of detecting the target Test speech included in the 10 minutes’ sound. We randomly chose 100 sentences from the target speaker, Setup which does not include the training dataset, for test samples. The 10 minutes’ sound was created by connecting The test samples were synthetically mixed at each of the DEMAND and TIMIT corpus which not in the training following SNRs: -10, 0 and 10dB, with the seven test-noise dataset. We chose the target speech randomly at 100 types from the DEMAND, one speaker, and two speaker sentences and superimposed on that 10 minutes’ sound. The mixtures from the TIMIT corpus. Furthermore, we used the SNRs of the target speech to 10 minutes’ sound was chosen noise only and target speaker only source, as the test dataset. randomly from 0, 5, 10 and 15 dB. We used the sound We inputted 100 files of each source type (noise only, source separation DNN learned in Experiment 1. The sound mixtures at -10, 0, 10 dB, and target only) into each learned DNN and calculated the average amplitude difference between the output waveform and the input waveform. Result Table 2 shows the results. What the average difference is larger means that the input speeches were suppressed more. The result demonstrates the amplitude of target speech does not become very small, while that of the other sounds becomes small. In addition, the result suggests that since the DNN decreases the amplitude of input waveform by Figure 11. Visualization of predicting whether or not the about 20 dB at the maximum and about 0 dB at the scenes include the target speaker’s utterance. The system minimum, it is appropriate to set the threshold during that performs prediction for each segment of the waveforms. True condition Actual utterance scene of a Not utterance scene of a specific person specific person System predicts “utterance scene of a specific person” True Positive False Positive Predicted condition System predicts “not utterance scene of a specific person” False negative True negative Table 3. Contingency table of true positive, false positive, false negative and true negative. Figure 12. Upper: Case where the middle of the segment is included in the actual utterance timing of specific person. Lower: Case where the middle of the segment is not included in that timing. The green line represents the middle of the segment. The pale red marks represent actual utterance scenes of a specific person. When the middle of the segment is included in the actual utterance timing of a specific person, the true condition is “Actual utterance scene of a specific person”. window size of the detection was 0.1 s and the window’s Result step length was also 0.1 s. We changed the threshold every Table 4 shows the results. The result shows that the 5 dB (-5, 0, 5, 10, 15, 20 dB) for confirming whether the accuracy is 83% and the precision is 92% in the best case. result changes. The accuracy is higher when the threshold is around 10 to 15 dB and the precision is higher when the threshold is We used the following four events for test: True positive around 0 to 5 dB for each target speaker. (TP), False Positive (FP), False Negative (FN), and True Negative (TN). Table 3 shows the definition of each event. FUTURE WORK Based on the four events, the following ratios were User study calculated: the accuracy and the precision. Accuracy and In this paper, we did the basic performance evaluation of precision are formulated as follows: the proposed system and did not do user study. We need to 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 % = 𝑇𝑃 + 𝑇𝑁 / (𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁) perform a user study and verify that the users can find the scenes they want to hear accurately and quickly. 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 % = 𝑇𝑃 / (𝑇𝑃 + 𝐹𝑃) We will need to refine the interface based on the user study. The system performs prediction for each segment of the One alternative interface is to display the utterance scenes waveforms as shown in Figure 11. When the middle of the of a specific person as a graph in a video timeline. We will segment is included in the actual utterance timing of a confirm how usability changes by changing the interface. specific person, the true condition is “Actual utterance Improving accuracy scene of a specific person” as shown in Figure 12. We need to explore a special DNN structure for extracting a Threshold -5dB 0dB 5dB 10dB 15dB 20dB ID: RMS 48% 59% 73% 79% 78% 72% Accuracy ID: SLT 58% 67% 79% 83% 81% 74% ID: RMS 83% 88% 89% 85% 78% 69% Precision ID: SLT 88% 92% 91% 85% 81% 74% Table 4. Result of the accuracy and precision for each target speaker specific speaker more accurately. If we find this new NY, USA, 563-572. structure, we could make the system improve the accuracy 5. Pierre Dragicevic, Gonzalo Ramos, Jacobo of the Experiment 2 task. Bibliowitcz, Derek Nowrouzezahrai, Ravin CONCLUSION Balakrishnan, and Karan Singh. 2008. Video browsing We propose a system that detects scenes, where a specific by direct manipulation. In Proceedings of the SIGCHI person speaks in the video, and displays them in the Conference on Human Factors in Computing timeline. This system enables users to skip to the timeline Systems (CHI '08). ACM, New York, NY, USA, 237- they want to hear by detecting scenes in a drama, talk show, 246. or discussion TV program, where a specific speaker is 6. Cuong Nguyen, Yuzhen Niu, and Feng Liu. 2013. speaking. Direct manipulation video navigation in 3D. We conducted two experiments on the proposed system. In Proceedings of the SIGCHI Conference on Human One was to confirm how much the amplitude of the other Factors in Computing Systems (CHI '13). ACM, New sounds can be suppressed and how much that of the specific York, NY, USA, 1169-1172. person's utterance does not be suppressed by the sound 7. Thorsten Karrer, Malte Weiss, Eric Lee, and Jan source separation DNN extracting only the specific person's Borchers. 2008. DRAGON: a direct manipulation utterance. The result showed that the smaller the amplitude interface for frame-accurate in-scene video navigation. of the target speech included in the input source was, the In Proceedings of the SIGCHI Conference on Human larger the average amplitude difference between the input Factors in Computing Systems (CHI '08). ACM, New and output waveform became. That is, we got the result as York, NY, USA, 247-250. expected. 8. Thorsten Karrer, Moritz Wittenhagen, and Jan The second experiment was to confirm how accurately the Borchers. 2012. DragLocks: handling temporal system can detect the utterance scene of a specific person. ambiguities in direct manipulation video navigation. The result showed that the accuracy was 83% and the In Proceedings of the SIGCHI Conference on Human precision was 92% in the best case. Factors in Computing Systems (CHI '12). ACM, New This system can be applied to voice services, like Podcast, York, NY, USA, 623-626. Spotify, and SoundCloud. With the advent of smart 9. C. Saraceno and R. Leonardi, "Audio as a support to speakers, such as Amazon Echo and Google home, audio scene change detection and characterization of video contents are likely to increase along with the importance of sequences," 1997 IEEE International Conference on searching timelines based on audio content. Acoustics, Speech, and Signal Processing, Munich, REFERENCES 1997, pp. 2597-2600 vol.4. 1. Keita Higuchi, Ryo Yonetani, and Yoichi Sato. 2017. 10. Kazutaka Kurihara. 2012. CinemaGazer: a system for EgoScanning: Quickly Scanning First-Person Videos watching videos at very high speed. In Proceedings of with Egocentric Elastic Timelines. In Proceedings of the International Working Conference on Advanced the 2017 CHI Conference on Human Factors in Visual Interfaces (AVI '12), Genny Tortora, Stefano Computing Systems (CHI '17). ACM, New York, NY, Levialdi, and Maurizio Tucci (Eds.). ACM, New York, USA, 6536-6546. NY, USA, 108-115. 2. Suporn Pongnumkul, Jue Wang, Gonzalo Ramos, and 11. Abir Al-Hajri, Matthew Fong, Gregor Miller, and Michael Cohen. 2010. Content-aware dynamic timeline Sidney Fels. 2014. Fast forward with your VCR: for video browsing. In Proceedings of the 23nd annual visualizing single-video viewing statistics for ACM symposium on User interface software and navigation and sharing. In Proceedings of Graphics technology (UIST '10). ACM, New York, NY, USA, Interface 2014 (GI '14). Canadian Information 139-142. Processing Society, Toronto, Ont., Canada, Canada, 3. Kai-Yin Cheng, Sheng-Jie Luo, Bing-Yu Chen, and 123-128. Hao-Hua Chu. 2009. SmartPlayer: user-centric video 12. Neel Joshi, Wolf Kienzle, Mike Toelle, Matt fast-forwarding. In Proceedings of the SIGCHI Uyttendaele, and Michael F. Cohen. 2015. Real-time Conference on Human Factors in Computing hyperlapse creation via optimal frame selection. ACM Systems (CHI '09). ACM, New York, NY, USA, 789- Trans. Graph. 34, 4, Article 63 (July 2015), 9 pages. 798. 13. Cuong Nguyen, Yuzhen Niu, and Feng Liu. 2012. 4. Juho Kim, Philip J. Guo, Carrie J. Cai, Shang-Wen Video summagator: an interface for video (Daniel) Li, Krzysztof Z. Gajos, and Robert C. Miller. summarization and navigation. In Proceedings of the 2014. Data-driven interaction techniques for improving SIGCHI Conference on Human Factors in Computing navigation of educational videos. In Proceedings of the Systems (CHI '12). ACM, New York, NY, USA, 647- 27th annual ACM symposium on User interface 650. software and technology (UIST '14). ACM, New York, 14. Suporn Pongnumkul, Jue Wang, and Michael Cohen. 26. Qian, Kaizhi, et al. "Speech enhancement using 2008. Creating map-based storyboards for browsing bayesian wavenet." Proc. Interspeech 2017 (2017): tour videos. In Proceedings of the 21st annual ACM 2013-2017. symposium on User interface software and 27. Tu, Ming, and Xianxian Zhang. "Speech enhancement technology (UIST '08). ACM, New York, NY, USA, based on Deep Neural Networks with skip 13-22. connections." Acoustics, Speech and Signal Processing 15. Alex Rav-Acha, Yael Pritch, and Shmuel Peleg. In In (ICASSP), 2017 IEEE International Conference on. Proc. IEEE Conference on Computer Vision and IEEE, 2017. Pattern Recognition (CVPR’ 06). 28. Pascual, Santiago, Antonio Bonafonte, and Joan Serrà. 16. Yael Pritch, Alex Rav-Acha, Avital Gutman, and "SEGAN: Speech Enhancement Generative Shmuel Peleg. 2007. Webcam Synopsis: Peeking Adversarial Network." arXiv preprint Around the World. In In Proc. IEEE International arXiv:1703.09452 (2017). Conference on Computer Vision (ICCV'07). 29. Y. Luo, Z. Chen, J. R. Hershey, J. Le Roux and N. 17. Yael Pritch, Alex Rav-Acha, and Shmuel Peleg. 2008. Mesgarani, "Deep clustering and conventional Nonchronological Video Synopsis and Indexing. IEEE networks for music separation: Stronger Trans. Pattern Anal. Mach. Intell. 30, 11 (November together," 2017 IEEE International Conference on 2008), 1971-1984. Acoustics, Speech and Signal Processing (ICASSP), 18. Justin Matejka, Tovi Grossman, and George New Orleans, LA, 2017, pp. 61-65. Fitzmaurice. 2014. Video lens: rapid playback and 30. Fu, Szu-Wei, et al. "Raw Waveform-based Speech exploration of large video collections and associated Enhancement by Fully Convolutional Networks." arXiv metadata. In Proceedings of the 27th annual ACM preprint arXiv:1703.02205 (2017). symposium on User interface software and 31. Rethage, Dario, Jordi Pons, and Xavier Serra. "A technology (UIST '14). ACM, New York, NY, USA, Wavenet for Speech Denoising." arXiv preprint 541-550. arXiv:1706.07162 (2017). 19. Pascal Scalart et al. Speech enhancement based on a 32. Pritish Chandna, Marius Miron, Jordi Janer, and Emilia priori signal to noise estimation. In IEEE International Gómez. Monoaural audio source separation using deep Conference on Acoustics, Speech and Signal convolutional neural networks. In International Processing (ICASSP), volume 2, pp. 629–632, 1996. Conference on Latent Variable Analysis and Signal 20. Pal, Monisankha, et al. "Robustness of Voice Separation, pages 258–266. Springer, 2017. Conversion Techniques Under Mismatched 33. Z. Q. Wang and D. Wang, "Recurrent deep stacking Conditions." arXiv preprint arXiv:1612.07523 (2016). networks for supervised speech separation," 2017 IEEE 21. Xugang Lu, Yu Tsao, Shigeki Matsuda, and Chiori International Conference on Acoustics, Speech and Hori. Speech enhancement based on deep denoising Signal Processing (ICASSP), New Orleans, LA, 2017, autoencoder. In Interspeech, pp. 436–440, 2013. pp. 71-75. 22. Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, 34. Chien, Jen-Tzung & Kuo, Kuan-Ting, “Variational and Paris Smaragdis. Joint optimization of masks and Recurrent Neural Networks for Speech Separation”, In deep recurrent neural networks for monaural source Interspeech, pp. 1193-1197, 2017 separation. IEEE/ACM Transactions on Audio, Speech 35. K. Osako, Y. Mitsufuji, R. Singh and B. Raj, and Language Processing, 23(12):2136–2147, 2015. "Supervised monaural source separation based on 23. Y. Xu, J. Du, L. R. Dai and C. H. Lee, A Regression autoencoders," 2017 IEEE International Conference on Approach to Speech Enhancement Based on Deep Acoustics, Speech and Signal Processing (ICASSP), Neural Networks, in IEEE/ACM Transactions on New Orleans, LA, 2017, pp. 11-15. doi: Audio, Speech, and Language Processing, vol. 23, no. 10.1109/ICASSP.2017.7951788 1, pp. 7-19, Jan. 2015. 36. Lee, Yuan-Shan, et al. "Fully complex deep neural 24. Anurag Kumar and Dinei Florencio. Speech network for phase-incorporating monaural source enhancement in multiple-noise conditions using deep separation." Acoustics, Speech and Signal Processing neural networks. arXiv preprint arXiv:1605.02427, (ICASSP), 2017 IEEE International Conference on. 2016. IEEE, 2017. 25. Jordi Pons, Jordi Janer, Thilo Rode, and Waldo 37. Wang, Yannan & Du, Jun & Dai, Li-Rong & Lee, Nogueira. Remixing music using source separation Chin-Hui, “A Maximum Likelihood Approach to Deep algorithms to improve the musical experience of Neural Network Based Nonlinear Spectral Mapping for cochlear implant users. The Journal of the Acoustical Single-Channel Speech Separation”, Interspeech, pp. Society of America, 140(6):4338–4349, 2016. 1178-1182, 2017 38. Hershey, John R., et al. Deep clustering: Discriminative embeddings for segmentation and separation. Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016. 39. Isik, Yusuf, et al. "Single-channel multi-speaker separation using deep clustering." arXiv preprint arXiv:1607.02173 (2016). 40. Yu, Dong, et al. "Permutation invariant training of deep models for speaker-independent multi-talker speech separation." arXiv preprint arXiv:1607.00325 (2016). 41. Y. Tian, L. He, M. Cai, W. Q. Zhang, and J. Liu, “Deep neural networks based speaker modeling at different levels of phonetic granularity,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2017, pp. 5440– 5444. 42. Y. Lei, N. Scheffer, L. Ferrer and M. McLaren, "A novel scheme for speaker recognition using a phonetically-aware deep neural network," 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, 2014, pp. 1695- 1699. 43. S. Ranjan and J. H. L. Hansen, "Curriculum Learning Based Approaches for Noise Robust Speaker Recognition," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 1, pp. 197-210, Jan. 2018. 44. Jansen, Aren, et al. "Large-scale audio event discovery in one million youtube videos." Proceedings of ICASSP. 2017. 45. Gemmeke, Jort F., et al. "Audio Set: An ontology and human-labeled dataset for audio events." IEEE ICASSP. 2017. 46. Oord, Aaron van den, et al. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016). 47. Joachim Thiemann, Nobutaka Ito, and Emmanuel Vincent. The diverse environments multichannel acoustic noise database: A database of multichannel environmental noise recordings. The Journal of the Acoustical Society of America, 133(5):3591–3591, 2013. 48. J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, and V. Zue, “TIMIT acoustic-phonetic continuous speech corpus,” 1993. 49. J. Kominek and A. W. Black, “The CMU Arctic speech databases,” in Fifth ISCA Workshop on Speech Synthesis, 2004.