Eyes and Ears Together: New Task for Multimodal Spoken Content Analysis Yasufumi Moriya1 , Ramon Sanabria 2 , Florian Metze 2 , Gareth J. F. Jones1 1 Dublin City University, Dublin, Ireland 2 Carnegie Mellon University, Pittsburgh, PA, USA {yasufumi.moriya,gareth.jones}@adaptcentre.ie,{ramons,fmetze}@cs.cmu.edu ABSTRACT follows: Section 2 reviews previous work on multimodal process- Human speech processing is often a multimodal process combining ing of spoken multimedia data. Section 3 presents an audio-visual audio and visual processing. Eyes and Ears Together proposes two dataset suitable for our proposed tasks. Section 4 introduces our benchmark multimodal speech processing tasks: (1) multimodal au- potential tasks for spoken multimedia understanding. Section 5 tomatic speech recognition (ASR) and (2) multimodal co-reference provides concluding remarks. resolution on the spoken multimedia. These tasks are motivated by our desire to address the difficulties of ASR for multimedia spoken 2 PRIOR WORK content. We review prior work on the integration of multimodal signals into speech processing for multimedia data, introduce a multimedia dataset for our proposed tasks, and outline these tasks. 1 INTRODUCTION Human use of natural language for communication is grounded in real world entities, concepts, and activities. The importance of the real world in language interpretation is illustrated in [18], Figure 1: Comparison of the Grid corpus for AVSR (left) to where participants were presented with a picture of an apple on the CMU “How-to” corpus for multimodal ASR (right). a towel, a towel without an apple, and a box. When they heard the sentence “put the apple on the towel”, their gaze moved to the towel without an apple, before the reader finished the complete The integration of visual information into ASR systems is a long- sentence “put the apple on the towel in the box”. Another example standing topic of investigation in the field of audio-visual speech of visual grounding in language understanding is the McGurk effect recognition (AVSR). Motivated by the McGurk effect [9], AVSR aims [9]. When participants were exposed to the voiced alveolar stop to build noise-robust ASR systems by incorporating lip movement (“da”) sound, and to a video, whose lip movement indicates the into recognition of phonemes [15]. The most recent approach to voiced bilabial stop (“ba”) sound, they perceived the bilabial sound AVSR employs a multimodal deep neural network (DNN) to fuse rather than the alveolar sound. Furthermore, it is reported that the visual lip movement with audio features [13]. presence of a speaker’s face facilitates speech comprehension [19]. Although AVSR is known to be effective on noisy audio con- These experiments all demonstrate that human language processing ditions, application of the AVSR is limited to situations, where a is affected by the context provided in visual signals. speaker frontal face is visible to enable lip movement features to be Despite those findings, research on automatic speech recognition extracted. Figure 1 shows a comparison of AVSR (Grid corpus) [2] (ASR) has generally focused only on audio signals, even if the use with multimodal ASR (CMU “How-to” corpus) [3, 16] 1 . As shown of visual and contextual information could be considered (e.g., in in the Grid corpus example, constraints on an AVSR dataset are: (1) multimedia data where the audio is accompanied by a video data the presence of a speaker mouth region, and (2) precise synchroni- stream and metadata). However, high word error rates (WERs) sation of a visual signal with speech. Multimodal ASR can exploit of 30-40% are often reported for ASR of multimedia data from any available contextual information to improve ASR accuracy. contemporary sources such as YouTube videos and TV shows [1, 8]. Recent work has begun to explore the use of more general mul- More recent work on ASR for YouTube videos illustrates that much timodal information in ASR. Figure 2 demonstrates a basic frame- lower WERs are possible [17], but the use of 100k hours of data for work for the integration of contextual information into ASR using system development is not feasible in many situations. a DNN acoustic model [4] and a recurrent neural network (RNN) This paper presents two proposed multimodal speech processing language model with long short-term memory (LSTM) [5, 10]. In benchmark tasks motivated by the multimodal nature of human this framework, a convolutional neural network (CNN) model ex- language processing and the practical difficulty of automated spo- tracts a fixed-length image feature vector from the video frame ken data processing. The reminder of the paper is organised as within the time region of each utterance. The image feature vector is concatenated with the audio feature vector for the DNN acous- tic model. Alternatively, the image feature vector can be taken as Copyright held by the owner/author(s). MediaEval’18, 29-31 October 2018, Sophia Antipolis, France 1 The corpus was also used in the 2018 Jelinek workshop: https://www.clsp.jhu.edu/workshops/18-workshop/ MediaEval’18, 29-31 October 2018, Sophia Antipolis, France Moriya et al. the input of the first token of the RNN language model, before video data, the transcriptions are simplified to imperative sentences the model reads embedded word tokens of the utterance. It should and do not represent real utterances that are actually spoken in be noted that the contextual feature vector does not need to be videos. We believe that this may form an interesting new task to extracted from a video frame, but can be taken from any feature analyse environments (video) of utterances (speech) being spoken. that represents the environment of the utterance being spoken. Typically, the DNN acoustic model is used for ASR with a weighted 3 DATASET finite-state transducer [11], and the RNN language model re-scores This section outlines the CMU “How-to” corpus [16]. The corpus n-best hypotheses generated in the ASR decoding step. contains instruction videos from YouTube, speech transcriptions and various types of meta-data (e.g., video titles, video description, the number of likes). An example image from the corpus is shown in Figure 1. Audio conditions of videos vary, e.g. some of the videos are recorded outdoors with background noise present. The corpus was used for experiments in [3] and [12]. Two different setups of the corpus are provided: 480 hours of audio and 90 hours of audio. In the both setups, development and test partition remain the same. In [12], symbols and numbers in the transcription were removed or expanded to words. In addition, regions of transcription that are likely to be a mismatch with audio were rejected. For this reason, the experimental results in [3] and [12] are not directly comparable. We propose the creation of a standardised version of the corpus for development of common multimodal ASR task. 4 TASK DESCRIPTION We propose two tasks for investigation with spoken multimedia Figure 2: Framework for integration of visual features into content: multimodal ASR and multimodal co-reference resolution. ASR. A convolutional neural network (CNN) extracts a fixed- length vector from a video frame, which is appended to ei- 4.1 Multimodal ASR ther an audio feature vector or fed to the neural language Multimodal ASR is a conventional ASR task that focuses on the use model before reading embedded word tokens. of multimodal signals in ASR with effectiveness measured using standard WER. The two main goals of this task are: (1) identifying A number of interesting findings have been reported in the exist- visual or contextual features that contribute to the improvement ing work on multimodal ASR. Gupta et al. extracted object features of ASR systems; (2) exploring suitable ASR system architectures and scene features from a video frame randomly chosen from within for better exploitation of visual or contextual features. The former the time range of each utterance [3]. They used these features to encourages participants to explore alternative features available adapt the DNN acoustic model and the RNN language model. Scene in videos and meta-data in ASR, e.g., temporal features. The latter features were particularly effective in improving recognition of aims to explore unconventional architectures for ASR systems, e.g., utterances being spoken outside. It is likely that enabling the acous- use of a end-to-end neural architecture [14]. tic model to know that the audio input may contain background noise implicitly transforms the audio features into a cleaner rep- 4.2 Multimodal Co-reference Resolution resentation. Moriya and Jones investigated whether video titles The goal of multimodal co-reference resolution aims to bridge the can provide the RNN language model with background context of gap between the speech modality and the visual modality. We plan each video [12]. They represented each video title as the average to provide participants with ASR transcription containing pronouns of embedded words in the title, and found that the adapted model and referred objects appearing in a video with the task of resolving predicted “keywords” of a video better than the non-adapted model the pronouns. Effectiveness may be measured using F1 scores, as (i.e., “fish” in a fishing video). in [6]. Such resolution may find utility in reducing WER in second Huang et al. conducted a new line of work that connects speech pass ASR decoding. transcription with vision. In [7], they propose a method to align entities in a video with actions that produce the entities. Their goal 5 CONCLUSION was to jointly resolve linguistic ambiguities (e.g,. “oil mixed with This paper presents potential tasks for multimodal spoken content salt” can be referred to as “the mixture”), and visual ambiguities (e.g., analysis. The motivation for the use of multimodal grounding in “yogurt” can look similar to “dressing”). This approach was further ASR arises from the multimodal nature of human language under- extended to a multimodal co-reference resolution system which standing and from the poor performance of ASR systems, when links entities in a video with the objects in a transcription, and even applied to multimedia data. Section 2 highlights existing work on with referring expressions (e.g., “it”) [6]. Their system was evaluated integration of multimodal signals into ASR. Section 3 introduces on the YouCook2 dataset, a collection of unstructured cooking a multimodal dataset suitable for use in the proposed tasks, and videos [20]. Although spoken transcriptions are accompanied by Section 4 outlines details of two proposed tasks. Eyes and Ears Together (EET) MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES [1] P. Bell, M. J. F. Gales, T. Hain, J. Kilgour, P. Lanchantin, X. Liu, A. McParland, S. Renals, O. Saz, M. Wester, and P. C. Woodland. 2015. The MGB challenge: Evaluating multi-genre broadcast media recognition. In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). 687–693. [2] M. Cooke, J. Barker, S. Cunningham, and X. Shao. 2006. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America 120, 5 (2006), 2421–2424. https://doi.org/10.1121/1.2229005 [3] A. Gupta, Y. Miao, L. Neves, and F. Metze. 2017. Visual features for context-aware speech recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5020–5024. [4] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. 2012. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine 29, 6 (Nov 2012), 82–97. https://doi.org/10.1109/MSP.2012.2205597 [5] S Hochreiter and J Schmidhuber. 1997. Long short-term memory. Neural compu- tation 9, 8 (1997), 1735–80. [6] D. A. Huang, S. Buch*, L. Dery, A. Garg, L. Fei-Fei, and J. C. Niebles. 2018. Finding “It”: Weakly-Supervised, Reference-Aware Visual Grounding in Instructional Videos. In International Conference on Computer Vision and Pattern Recognition (CVPR). 5948–5957. [7] D. A. Huang, J. J. Lim., L. Fei-Fei, and J. C. Niebles. 2017. Unsupervised Visual- Linguistic Reference Resolution in Instructional Videos. In International Confer- ence on Computer Vision and Pattern Recognition (CVPR). 2183–2192. [8] H. Liao, E. McDermott, and A. Senior. 2013. Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video tran- scription. In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). 368–373. [9] H. McGurk and J. MacDonald. 1976. Hearing lips and seeing voices. Nature 264, 5588 (1976), 746–748. [10] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of Interspeech. 1045–1048. [11] M. Mohri, F. Pereira, and M. Riley. 2002. Weighted finite-state transducers in speech recognition. Computer Speech & Language 16, 1 (2002), 69–88. [12] Y. Moriya and G. J. F. Jones. 2018. LSTM language model adaptation with images and titles for multimedia automatic speech recognition. In (to appear) Workshop on Spoken Language Technology (SLT). [13] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. 2011. Multimodal deep learning. In Proceedings of the International Conference on Machine Learning (ICML). 689–696. [14] S. Palaskar, R. Sanabria, and F. Metze. 2018. End-to-End Multimodal Speech Recognition. In International Conference on Acoustic, Speech and Signal Processing (ICASSP). 5774–5778. [15] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior. 2003. Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91, 9 (Sept 2003), 1306–1326. https://doi.org/10.1109/JPROC.2003.817150 [16] R. Sanabria, O. Caglayan, S. Palaskar, D. Elliott, L. Barrault, L. Specia, and F Metze. 2018. How2: A large-scale dataset for multimodal language understanding. In (to appear) Proceedings of Neural Information Processing Systems (NIPS). [17] H. Soltau, H. Liao, and H Sak. 2016. Neural speech recognizer: acoustic-to- word LSTM model for large vocabulary speech recognition. arXiv (2016). http: //arxiv.org/abs/1610.09975 [18] M. K. Tanenhaus, M. J. Spivey-Knowlton, K. M. Eberhard, and J. C Sedivy. 1995. In- tegration of visual and linguistic information in spoken language comprehension. Science 268, 5217 (1995), 1632–1634. [19] V. van Wassenhove, K. W. Grant., and D. Poeppel. 2005. Visual speech speeds up the neural processing of auditory speech. Proceedings of the National Academy of Sciences 102, 4 (2005), 1181–1186. https://doi.org/10.1073/pnas.0408949102 [20] L. Zhou, C. Xu, and J. J. Corso. 2018. Towards Automatic Learning of Procedures from Web Instructional Videos. In AAAI. 7590–7598.