Eyes and Ears Together: New Task for Multimodal Spoken
                            Content Analysis
                          Yasufumi Moriya1 , Ramon Sanabria 2 , Florian Metze 2 , Gareth J. F. Jones1
                                                             1 Dublin City University, Dublin, Ireland
                                                      2 Carnegie Mellon University, Pittsburgh, PA, USA

                                 {yasufumi.moriya,gareth.jones}@adaptcentre.ie,{ramons,fmetze}@cs.cmu.edu

ABSTRACT                                                                            follows: Section 2 reviews previous work on multimodal process-
Human speech processing is often a multimodal process combining                     ing of spoken multimedia data. Section 3 presents an audio-visual
audio and visual processing. Eyes and Ears Together proposes two                    dataset suitable for our proposed tasks. Section 4 introduces our
benchmark multimodal speech processing tasks: (1) multimodal au-                    potential tasks for spoken multimedia understanding. Section 5
tomatic speech recognition (ASR) and (2) multimodal co-reference                    provides concluding remarks.
resolution on the spoken multimedia. These tasks are motivated by
our desire to address the difficulties of ASR for multimedia spoken                 2    PRIOR WORK
content. We review prior work on the integration of multimodal
signals into speech processing for multimedia data, introduce a
multimedia dataset for our proposed tasks, and outline these tasks.


1    INTRODUCTION
Human use of natural language for communication is grounded
in real world entities, concepts, and activities. The importance
of the real world in language interpretation is illustrated in [18],                Figure 1: Comparison of the Grid corpus for AVSR (left) to
where participants were presented with a picture of an apple on                     the CMU “How-to” corpus for multimodal ASR (right).
a towel, a towel without an apple, and a box. When they heard
the sentence “put the apple on the towel”, their gaze moved to the
towel without an apple, before the reader finished the complete                        The integration of visual information into ASR systems is a long-
sentence “put the apple on the towel in the box”. Another example                   standing topic of investigation in the field of audio-visual speech
of visual grounding in language understanding is the McGurk effect                  recognition (AVSR). Motivated by the McGurk effect [9], AVSR aims
[9]. When participants were exposed to the voiced alveolar stop                     to build noise-robust ASR systems by incorporating lip movement
(“da”) sound, and to a video, whose lip movement indicates the                      into recognition of phonemes [15]. The most recent approach to
voiced bilabial stop (“ba”) sound, they perceived the bilabial sound                AVSR employs a multimodal deep neural network (DNN) to fuse
rather than the alveolar sound. Furthermore, it is reported that the                visual lip movement with audio features [13].
presence of a speaker’s face facilitates speech comprehension [19].                    Although AVSR is known to be effective on noisy audio con-
These experiments all demonstrate that human language processing                    ditions, application of the AVSR is limited to situations, where a
is affected by the context provided in visual signals.                              speaker frontal face is visible to enable lip movement features to be
   Despite those findings, research on automatic speech recognition                 extracted. Figure 1 shows a comparison of AVSR (Grid corpus) [2]
(ASR) has generally focused only on audio signals, even if the use                  with multimodal ASR (CMU “How-to” corpus) [3, 16] 1 . As shown
of visual and contextual information could be considered (e.g., in                  in the Grid corpus example, constraints on an AVSR dataset are: (1)
multimedia data where the audio is accompanied by a video data                      the presence of a speaker mouth region, and (2) precise synchroni-
stream and metadata). However, high word error rates (WERs)                         sation of a visual signal with speech. Multimodal ASR can exploit
of 30-40% are often reported for ASR of multimedia data from                        any available contextual information to improve ASR accuracy.
contemporary sources such as YouTube videos and TV shows [1, 8].                       Recent work has begun to explore the use of more general mul-
More recent work on ASR for YouTube videos illustrates that much                    timodal information in ASR. Figure 2 demonstrates a basic frame-
lower WERs are possible [17], but the use of 100k hours of data for                 work for the integration of contextual information into ASR using
system development is not feasible in many situations.                              a DNN acoustic model [4] and a recurrent neural network (RNN)
   This paper presents two proposed multimodal speech processing                    language model with long short-term memory (LSTM) [5, 10]. In
benchmark tasks motivated by the multimodal nature of human                         this framework, a convolutional neural network (CNN) model ex-
language processing and the practical difficulty of automated spo-                  tracts a fixed-length image feature vector from the video frame
ken data processing. The reminder of the paper is organised as                      within the time region of each utterance. The image feature vector
                                                                                    is concatenated with the audio feature vector for the DNN acous-
                                                                                    tic model. Alternatively, the image feature vector can be taken as
Copyright held by the owner/author(s).
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                          1 The corpus was also used in the 2018 Jelinek workshop:
                                                                                    https://www.clsp.jhu.edu/workshops/18-workshop/
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                                                          Moriya et al.


the input of the first token of the RNN language model, before              video data, the transcriptions are simplified to imperative sentences
the model reads embedded word tokens of the utterance. It should            and do not represent real utterances that are actually spoken in
be noted that the contextual feature vector does not need to be             videos. We believe that this may form an interesting new task to
extracted from a video frame, but can be taken from any feature             analyse environments (video) of utterances (speech) being spoken.
that represents the environment of the utterance being spoken.
Typically, the DNN acoustic model is used for ASR with a weighted           3     DATASET
finite-state transducer [11], and the RNN language model re-scores          This section outlines the CMU “How-to” corpus [16]. The corpus
n-best hypotheses generated in the ASR decoding step.                       contains instruction videos from YouTube, speech transcriptions
                                                                            and various types of meta-data (e.g., video titles, video description,
                                                                            the number of likes). An example image from the corpus is shown
                                                                            in Figure 1. Audio conditions of videos vary, e.g. some of the videos
                                                                            are recorded outdoors with background noise present. The corpus
                                                                            was used for experiments in [3] and [12]. Two different setups of
                                                                            the corpus are provided: 480 hours of audio and 90 hours of audio.
                                                                            In the both setups, development and test partition remain the same.
                                                                            In [12], symbols and numbers in the transcription were removed
                                                                            or expanded to words. In addition, regions of transcription that are
                                                                            likely to be a mismatch with audio were rejected. For this reason,
                                                                            the experimental results in [3] and [12] are not directly comparable.
                                                                            We propose the creation of a standardised version of the corpus for
                                                                            development of common multimodal ASR task.

                                                                            4     TASK DESCRIPTION
                                                                            We propose two tasks for investigation with spoken multimedia
Figure 2: Framework for integration of visual features into                 content: multimodal ASR and multimodal co-reference resolution.
ASR. A convolutional neural network (CNN) extracts a fixed-
length vector from a video frame, which is appended to ei-                  4.1    Multimodal ASR
ther an audio feature vector or fed to the neural language                  Multimodal ASR is a conventional ASR task that focuses on the use
model before reading embedded word tokens.                                  of multimodal signals in ASR with effectiveness measured using
                                                                            standard WER. The two main goals of this task are: (1) identifying
    A number of interesting findings have been reported in the exist-       visual or contextual features that contribute to the improvement
ing work on multimodal ASR. Gupta et al. extracted object features          of ASR systems; (2) exploring suitable ASR system architectures
and scene features from a video frame randomly chosen from within           for better exploitation of visual or contextual features. The former
the time range of each utterance [3]. They used these features to           encourages participants to explore alternative features available
adapt the DNN acoustic model and the RNN language model. Scene              in videos and meta-data in ASR, e.g., temporal features. The latter
features were particularly effective in improving recognition of            aims to explore unconventional architectures for ASR systems, e.g.,
utterances being spoken outside. It is likely that enabling the acous-      use of a end-to-end neural architecture [14].
tic model to know that the audio input may contain background
noise implicitly transforms the audio features into a cleaner rep-
                                                                            4.2    Multimodal Co-reference Resolution
resentation. Moriya and Jones investigated whether video titles             The goal of multimodal co-reference resolution aims to bridge the
can provide the RNN language model with background context of               gap between the speech modality and the visual modality. We plan
each video [12]. They represented each video title as the average           to provide participants with ASR transcription containing pronouns
of embedded words in the title, and found that the adapted model            and referred objects appearing in a video with the task of resolving
predicted “keywords” of a video better than the non-adapted model           the pronouns. Effectiveness may be measured using F1 scores, as
(i.e., “fish” in a fishing video).                                          in [6]. Such resolution may find utility in reducing WER in second
    Huang et al. conducted a new line of work that connects speech          pass ASR decoding.
transcription with vision. In [7], they propose a method to align
entities in a video with actions that produce the entities. Their goal      5     CONCLUSION
was to jointly resolve linguistic ambiguities (e.g,. “oil mixed with        This paper presents potential tasks for multimodal spoken content
salt” can be referred to as “the mixture”), and visual ambiguities (e.g.,   analysis. The motivation for the use of multimodal grounding in
“yogurt” can look similar to “dressing”). This approach was further         ASR arises from the multimodal nature of human language under-
extended to a multimodal co-reference resolution system which               standing and from the poor performance of ASR systems, when
links entities in a video with the objects in a transcription, and even     applied to multimedia data. Section 2 highlights existing work on
with referring expressions (e.g., “it”) [6]. Their system was evaluated     integration of multimodal signals into ASR. Section 3 introduces
on the YouCook2 dataset, a collection of unstructured cooking               a multimodal dataset suitable for use in the proposed tasks, and
videos [20]. Although spoken transcriptions are accompanied by              Section 4 outlines details of two proposed tasks.
Eyes and Ears Together (EET)                                                                   MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


REFERENCES
 [1] P. Bell, M. J. F. Gales, T. Hain, J. Kilgour, P. Lanchantin, X. Liu, A. McParland,
     S. Renals, O. Saz, M. Wester, and P. C. Woodland. 2015. The MGB challenge:
     Evaluating multi-genre broadcast media recognition. In Proceedings of IEEE
     Workshop on Automatic Speech Recognition and Understanding (ASRU). 687–693.
 [2] M. Cooke, J. Barker, S. Cunningham, and X. Shao. 2006. An audio-visual corpus for
     speech perception and automatic speech recognition. The Journal of the Acoustical
     Society of America 120, 5 (2006), 2421–2424. https://doi.org/10.1121/1.2229005
 [3] A. Gupta, Y. Miao, L. Neves, and F. Metze. 2017. Visual features for context-aware
     speech recognition. In Proceedings of IEEE International Conference on Acoustics,
     Speech and Signal Processing (ICASSP). 5020–5024.
 [4] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V.
     Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. 2012. Deep Neural
     Networks for Acoustic Modeling in Speech Recognition: The Shared Views of
     Four Research Groups. IEEE Signal Processing Magazine 29, 6 (Nov 2012), 82–97.
     https://doi.org/10.1109/MSP.2012.2205597
 [5] S Hochreiter and J Schmidhuber. 1997. Long short-term memory. Neural compu-
     tation 9, 8 (1997), 1735–80.
 [6] D. A. Huang, S. Buch*, L. Dery, A. Garg, L. Fei-Fei, and J. C. Niebles. 2018. Finding
     “It”: Weakly-Supervised, Reference-Aware Visual Grounding in Instructional
     Videos. In International Conference on Computer Vision and Pattern Recognition
     (CVPR). 5948–5957.
 [7] D. A. Huang, J. J. Lim., L. Fei-Fei, and J. C. Niebles. 2017. Unsupervised Visual-
     Linguistic Reference Resolution in Instructional Videos. In International Confer-
     ence on Computer Vision and Pattern Recognition (CVPR). 2183–2192.
 [8] H. Liao, E. McDermott, and A. Senior. 2013. Large scale deep neural network
     acoustic modeling with semi-supervised training data for YouTube video tran-
     scription. In Proceedings of IEEE Workshop on Automatic Speech Recognition and
     Understanding (ASRU). 368–373.
 [9] H. McGurk and J. MacDonald. 1976. Hearing lips and seeing voices. Nature 264,
     5588 (1976), 746–748.
[10] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur. 2010. Recurrent
     neural network based language model. In Proceedings of Interspeech. 1045–1048.
[11] M. Mohri, F. Pereira, and M. Riley. 2002. Weighted finite-state transducers in
     speech recognition. Computer Speech & Language 16, 1 (2002), 69–88.
[12] Y. Moriya and G. J. F. Jones. 2018. LSTM language model adaptation with images
     and titles for multimedia automatic speech recognition. In (to appear) Workshop
     on Spoken Language Technology (SLT).
[13] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. 2011. Multimodal
     deep learning. In Proceedings of the International Conference on Machine Learning
     (ICML). 689–696.
[14] S. Palaskar, R. Sanabria, and F. Metze. 2018. End-to-End Multimodal Speech
     Recognition. In International Conference on Acoustic, Speech and Signal Processing
     (ICASSP). 5774–5778.
[15] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior. 2003. Recent
     advances in the automatic recognition of audiovisual speech. Proc. IEEE 91, 9
     (Sept 2003), 1306–1326. https://doi.org/10.1109/JPROC.2003.817150
[16] R. Sanabria, O. Caglayan, S. Palaskar, D. Elliott, L. Barrault, L. Specia, and F Metze.
     2018. How2: A large-scale dataset for multimodal language understanding. In
     (to appear) Proceedings of Neural Information Processing Systems (NIPS).
[17] H. Soltau, H. Liao, and H Sak. 2016. Neural speech recognizer: acoustic-to-
     word LSTM model for large vocabulary speech recognition. arXiv (2016). http:
     //arxiv.org/abs/1610.09975
[18] M. K. Tanenhaus, M. J. Spivey-Knowlton, K. M. Eberhard, and J. C Sedivy. 1995. In-
     tegration of visual and linguistic information in spoken language comprehension.
     Science 268, 5217 (1995), 1632–1634.
[19] V. van Wassenhove, K. W. Grant., and D. Poeppel. 2005. Visual speech speeds up
     the neural processing of auditory speech. Proceedings of the National Academy
     of Sciences 102, 4 (2005), 1181–1186. https://doi.org/10.1073/pnas.0408949102
[20] L. Zhou, C. Xu, and J. J. Corso. 2018. Towards Automatic Learning of Procedures
     from Web Instructional Videos. In AAAI. 7590–7598.