<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Eyes and Ears Together: New Task for Multimodal Spoken Content Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yasufumi Moriya</string-name>
          <email>yasufumi.moriya@adaptcentre.ie</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ramon Sanabria</string-name>
          <email>ramons@cs.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Florian Metze</string-name>
          <email>fmetze@cs.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gareth J. F. Jones</string-name>
          <email>gareth.jones@adaptcentre.ie</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Carnegie Mellon University</institution>
          ,
          <addr-line>Pittsburgh, PA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dublin City University</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>Human speech processing is often a multimodal process combining audio and visual processing. Eyes and Ears Together proposes two benchmark multimodal speech processing tasks: (1) multimodal automatic speech recognition (ASR) and (2) multimodal co-reference resolution on the spoken multimedia. These tasks are motivated by our desire to address the dificulties of ASR for multimedia spoken content. We review prior work on the integration of multimodal signals into speech processing for multimedia data, introduce a multimedia dataset for our proposed tasks, and outline these tasks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Human use of natural language for communication is grounded
in real world entities, concepts, and activities. The importance
of the real world in language interpretation is illustrated in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ],
where participants were presented with a picture of an apple on
a towel, a towel without an apple, and a box. When they heard
the sentence “put the apple on the towel”, their gaze moved to the
towel without an apple, before the reader finished the complete
sentence “put the apple on the towel in the box”. Another example
of visual grounding in language understanding is the McGurk efect
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. When participants were exposed to the voiced alveolar stop
(“da”) sound, and to a video, whose lip movement indicates the
voiced bilabial stop (“ba”) sound, they perceived the bilabial sound
rather than the alveolar sound. Furthermore, it is reported that the
presence of a speaker’s face facilitates speech comprehension [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
These experiments all demonstrate that human language processing
is afected by the context provided in visual signals.
      </p>
      <p>
        Despite those findings, research on automatic speech recognition
(ASR) has generally focused only on audio signals, even if the use
of visual and contextual information could be considered (e.g., in
multimedia data where the audio is accompanied by a video data
stream and metadata). However, high word error rates (WERs)
of 30-40% are often reported for ASR of multimedia data from
contemporary sources such as YouTube videos and TV shows [
        <xref ref-type="bibr" rid="ref1 ref8">1, 8</xref>
        ].
More recent work on ASR for YouTube videos illustrates that much
lower WERs are possible [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], but the use of 100k hours of data for
system development is not feasible in many situations.
      </p>
      <p>This paper presents two proposed multimodal speech processing
benchmark tasks motivated by the multimodal nature of human
language processing and the practical dificulty of automated
spoken data processing. The reminder of the paper is organised as
follows: Section 2 reviews previous work on multimodal
processing of spoken multimedia data. Section 3 presents an audio-visual
dataset suitable for our proposed tasks. Section 4 introduces our
potential tasks for spoken multimedia understanding. Section 5
provides concluding remarks.
2</p>
      <p>
        The integration of visual information into ASR systems is a
longstanding topic of investigation in the field of audio-visual speech
recognition (AVSR). Motivated by the McGurk efect [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], AVSR aims
to build noise-robust ASR systems by incorporating lip movement
into recognition of phonemes [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The most recent approach to
AVSR employs a multimodal deep neural network (DNN) to fuse
visual lip movement with audio features [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Although AVSR is known to be efective on noisy audio
conditions, application of the AVSR is limited to situations, where a
speaker frontal face is visible to enable lip movement features to be
extracted. Figure 1 shows a comparison of AVSR (Grid corpus) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
with multimodal ASR (CMU “How-to” corpus) [
        <xref ref-type="bibr" rid="ref16 ref3">3, 16</xref>
        ] 1. As shown
in the Grid corpus example, constraints on an AVSR dataset are: (1)
the presence of a speaker mouth region, and (2) precise
synchronisation of a visual signal with speech. Multimodal ASR can exploit
any available contextual information to improve ASR accuracy.
      </p>
      <p>
        Recent work has begun to explore the use of more general
multimodal information in ASR. Figure 2 demonstrates a basic
framework for the integration of contextual information into ASR using
a DNN acoustic model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and a recurrent neural network (RNN)
language model with long short-term memory (LSTM) [
        <xref ref-type="bibr" rid="ref10 ref5">5, 10</xref>
        ]. In
this framework, a convolutional neural network (CNN) model
extracts a fixed-length image feature vector from the video frame
within the time region of each utterance. The image feature vector
is concatenated with the audio feature vector for the DNN
acoustic model. Alternatively, the image feature vector can be taken as
1The corpus was also used in the 2018 Jelinek workshop:
https://www.clsp.jhu.edu/workshops/18-workshop/
the input of the first token of the RNN language model, before
the model reads embedded word tokens of the utterance. It should
be noted that the contextual feature vector does not need to be
extracted from a video frame, but can be taken from any feature
that represents the environment of the utterance being spoken.
Typically, the DNN acoustic model is used for ASR with a weighted
ifnite-state transducer [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], and the RNN language model re-scores
n-best hypotheses generated in the ASR decoding step.
      </p>
      <p>
        A number of interesting findings have been reported in the
existing work on multimodal ASR. Gupta et al. extracted object features
and scene features from a video frame randomly chosen from within
the time range of each utterance [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. They used these features to
adapt the DNN acoustic model and the RNN language model. Scene
features were particularly efective in improving recognition of
utterances being spoken outside. It is likely that enabling the
acoustic model to know that the audio input may contain background
noise implicitly transforms the audio features into a cleaner
representation. Moriya and Jones investigated whether video titles
can provide the RNN language model with background context of
each video [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. They represented each video title as the average
of embedded words in the title, and found that the adapted model
predicted “keywords” of a video better than the non-adapted model
(i.e., “fish” in a fishing video).
      </p>
      <p>
        Huang et al. conducted a new line of work that connects speech
transcription with vision. In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], they propose a method to align
entities in a video with actions that produce the entities. Their goal
was to jointly resolve linguistic ambiguities (e.g,. “oil mixed with
salt” can be referred to as “the mixture”), and visual ambiguities (e.g.,
“yogurt” can look similar to “dressing”). This approach was further
extended to a multimodal co-reference resolution system which
links entities in a video with the objects in a transcription, and even
with referring expressions (e.g., “it”) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Their system was evaluated
on the YouCook2 dataset, a collection of unstructured cooking
videos [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Although spoken transcriptions are accompanied by
video data, the transcriptions are simplified to imperative sentences
and do not represent real utterances that are actually spoken in
videos. We believe that this may form an interesting new task to
analyse environments (video) of utterances (speech) being spoken.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>DATASET</title>
      <p>
        This section outlines the CMU “How-to” corpus [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The corpus
contains instruction videos from YouTube, speech transcriptions
and various types of meta-data (e.g., video titles, video description,
the number of likes). An example image from the corpus is shown
in Figure 1. Audio conditions of videos vary, e.g. some of the videos
are recorded outdoors with background noise present. The corpus
was used for experiments in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Two diferent setups of
the corpus are provided: 480 hours of audio and 90 hours of audio.
In the both setups, development and test partition remain the same.
In [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], symbols and numbers in the transcription were removed
or expanded to words. In addition, regions of transcription that are
likely to be a mismatch with audio were rejected. For this reason,
the experimental results in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] are not directly comparable.
We propose the creation of a standardised version of the corpus for
development of common multimodal ASR task.
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>TASK DESCRIPTION</title>
      <p>We propose two tasks for investigation with spoken multimedia
content: multimodal ASR and multimodal co-reference resolution.
4.1</p>
    </sec>
    <sec id="sec-4">
      <title>Multimodal ASR</title>
      <p>
        Multimodal ASR is a conventional ASR task that focuses on the use
of multimodal signals in ASR with efectiveness measured using
standard WER. The two main goals of this task are: (1) identifying
visual or contextual features that contribute to the improvement
of ASR systems; (2) exploring suitable ASR system architectures
for better exploitation of visual or contextual features. The former
encourages participants to explore alternative features available
in videos and meta-data in ASR, e.g., temporal features. The latter
aims to explore unconventional architectures for ASR systems, e.g.,
use of a end-to-end neural architecture [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
4.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Multimodal Co-reference Resolution</title>
      <p>
        The goal of multimodal co-reference resolution aims to bridge the
gap between the speech modality and the visual modality. We plan
to provide participants with ASR transcription containing pronouns
and referred objects appearing in a video with the task of resolving
the pronouns. Efectiveness may be measured using F1 scores, as
in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Such resolution may find utility in reducing WER in second
pass ASR decoding.
5
      </p>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSION</title>
      <p>This paper presents potential tasks for multimodal spoken content
analysis. The motivation for the use of multimodal grounding in
ASR arises from the multimodal nature of human language
understanding and from the poor performance of ASR systems, when
applied to multimedia data. Section 2 highlights existing work on
integration of multimodal signals into ASR. Section 3 introduces
a multimodal dataset suitable for use in the proposed tasks, and
Section 4 outlines details of two proposed tasks.</p>
      <p>Eyes and Ears Together (EET)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J. F.</given-names>
            <surname>Gales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kilgour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lanchantin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McParland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Renals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Saz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wester</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Woodland</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>The MGB challenge: Evaluating multi-genre broadcast media recognition</article-title>
          .
          <source>In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)</source>
          .
          <volume>687</volume>
          -
          <fpage>693</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cooke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Barker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cunningham</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Shao</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>An audio-visual corpus for speech perception and automatic speech recognition</article-title>
          .
          <source>The Journal of the Acoustical Society of America 120</source>
          ,
          <issue>5</issue>
          (
          <year>2006</year>
          ),
          <fpage>2421</fpage>
          -
          <lpage>2424</lpage>
          . https://doi.org/10.1121/1.2229005
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Miao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Neves</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Visual features for context-aware speech recognition</article-title>
          .
          <source>In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          .
          <volume>5020</volume>
          -
          <fpage>5024</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Dahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jaitly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Senior</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. N.</given-names>
            <surname>Sainath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Kingsbury</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups</article-title>
          .
          <source>IEEE Signal Processing Magazine</source>
          <volume>29</volume>
          ,
          <issue>6</issue>
          (Nov
          <year>2012</year>
          ),
          <fpage>82</fpage>
          -
          <lpage>97</lpage>
          . https://doi.org/10.1109/MSP.
          <year>2012</year>
          .2205597
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <given-names>J</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation 9</source>
          ,
          <issue>8</issue>
          (
          <year>1997</year>
          ),
          <fpage>1735</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Huang</surname>
          </string-name>
          , S. Buch*,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Niebles</surname>
          </string-name>
          .
          <year>2018</year>
          . Finding “It”:
          <string-name>
            <surname>Weakly-Supervised</surname>
          </string-name>
          ,
          <article-title>Reference-Aware Visual Grounding in Instructional Videos</article-title>
          .
          <source>In International Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          .
          <volume>5948</volume>
          -
          <fpage>5957</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Lim</surname>
          </string-name>
          .,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Niebles</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Unsupervised VisualLinguistic Reference Resolution in Instructional Videos</article-title>
          .
          <source>In International Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          .
          <volume>2183</volume>
          -
          <fpage>2192</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          <article-title>McDermott, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Senior</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription</article-title>
          .
          <source>In Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)</source>
          .
          <volume>368</volume>
          -
          <fpage>373</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>McGurk</surname>
          </string-name>
          and
          <string-name>
            <surname>J. MacDonald.</surname>
          </string-name>
          <year>1976</year>
          .
          <article-title>Hearing lips and seeing voices</article-title>
          .
          <source>Nature</source>
          <volume>264</volume>
          ,
          <issue>5588</issue>
          (
          <year>1976</year>
          ),
          <fpage>746</fpage>
          -
          <lpage>748</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Karafiat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Burget</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cernocky</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Khudanpur</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Recurrent neural network based language model</article-title>
          .
          <source>In Proceedings of Interspeech</source>
          .
          <volume>1045</volume>
          -
          <fpage>1048</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mohri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Riley</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Weighted finite-state transducers in speech recognition</article-title>
          .
          <source>Computer Speech &amp; Language</source>
          <volume>16</volume>
          ,
          <issue>1</issue>
          (
          <year>2002</year>
          ),
          <fpage>69</fpage>
          -
          <lpage>88</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Moriya</surname>
          </string-name>
          and
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>LSTM language model adaptation with images and titles for multimedia automatic speech recognition</article-title>
          . In (to appear) Workshop on Spoken Language
          <string-name>
            <surname>Technology (SLT).</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ngiam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Multimodal deep learning</article-title>
          .
          <source>In Proceedings of the International Conference on Machine Learning (ICML)</source>
          .
          <volume>689</volume>
          -
          <fpage>696</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Palaskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sanabria</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>End-to-End Multimodal Speech Recognition</article-title>
          . In International Conference on Acoustic,
          <source>Speech and Signal Processing (ICASSP)</source>
          .
          <volume>5774</volume>
          -
          <fpage>5778</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G.</given-names>
            <surname>Potamianos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Neti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gravier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garg</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Senior</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Recent advances in the automatic recognition of audiovisual speech</article-title>
          .
          <source>Proc. IEEE 91, 9 (Sept</source>
          <year>2003</year>
          ),
          <fpage>1306</fpage>
          -
          <lpage>1326</lpage>
          . https://doi.org/10.1109/JPROC.
          <year>2003</year>
          .817150
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sanabria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Caglayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Palaskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Elliott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Barrault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Specia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F</given-names>
            <surname>Metze</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>How2: A large-scale dataset for multimodal language understanding</article-title>
          .
          <source>In (to appear) Proceedings of Neural Information Processing Systems (NIPS).</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>H.</given-names>
            <surname>Soltau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H</given-names>
            <surname>Sak</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Neural speech recognizer: acoustic-toword LSTM model for large vocabulary speech recognition</article-title>
          .
          <source>arXiv</source>
          (
          <year>2016</year>
          ). http: //arxiv.org/abs/1610.09975
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>M. K. Tanenhaus</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          <string-name>
            <surname>Spivey-Knowlton</surname>
            ,
            <given-names>K. M.</given-names>
          </string-name>
          <string-name>
            <surname>Eberhard</surname>
            , and
            <given-names>J. C</given-names>
          </string-name>
          <string-name>
            <surname>Sedivy</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>Integration of visual and linguistic information in spoken language comprehension</article-title>
          .
          <source>Science</source>
          <volume>268</volume>
          ,
          <issue>5217</issue>
          (
          <year>1995</year>
          ),
          <fpage>1632</fpage>
          -
          <lpage>1634</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>V. van Wassenhove</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. W.</given-names>
            <surname>Grant</surname>
          </string-name>
          ., and
          <string-name>
            <given-names>D.</given-names>
            <surname>Poeppel</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Visual speech speeds up the neural processing of auditory speech</article-title>
          .
          <source>Proceedings of the National Academy of Sciences 102</source>
          ,
          <issue>4</issue>
          (
          <year>2005</year>
          ),
          <fpage>1181</fpage>
          -
          <lpage>1186</lpage>
          . https://doi.org/10.1073/pnas.0408949102
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and J. J.</given-names>
            <surname>Corso</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Towards Automatic Learning of Procedures from Web Instructional Videos</article-title>
          .
          <source>In AAAI</source>
          .
          <fpage>7590</fpage>
          -
          <lpage>7598</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>