<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.18653/v1/W19-5210</article-id>
      <title-group>
        <article-title>Evaluating Heuristics for Audio-Visual Translation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Timo Baumann</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ashutosh Saboo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>BITS Pilani, K.K. Birla Goa Campus</institution>
          ,
          <addr-line>Goa</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Informatics, Universität Hamburg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>1</volume>
      <issue>4</issue>
      <fpage>17</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>Dubbing, i.e., the lip-synchronous translation and revoicing of audio-visual media into a target language from a diferent source language, is essential for the full-fledged reception of foreign audio-visual media, be it movies, instructional videos or short social media clips. In this paper, we objectify inlfuences on the 'dubbability' of translations, i.e., how well a translation would be synchronously revoiceable to the lips on screen. We explore the value of traditional heuristics used in evaluating the qualitative aspects, in particular matching bilabial consonants and the jaw opening while producing vowels, and control for quantity, i.e., that translations are similar to the source in length. We perform an ablation study using an adversarial neural classifier which is trained to diferentiate “true” dubbing translations from machine translations. While we are able to confirm the value of matching lip closure in dubbing, we find that the opening angle of the jaw as determined by the realized vowel may be less relevant than frequently considered in audio-visual translation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;audiovisual translation</kwd>
        <kwd>dubbing</kwd>
        <kwd>lip synchrony</kwd>
        <kwd>machine translation</kwd>
        <kwd>ablation study</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Dubbing is studied in audio-visual translation [
        <xref ref-type="bibr" rid="ref15">16</xref>
        ], a branch of translatology, and is at present
typically performed manually (although supported by specialized software environments). A
major focus is on producing translations that can be spoken in synchrony along with the facial
movements (in particular lip and jaw movements) visible on screen. The literature [
        <xref ref-type="bibr" rid="ref11 ref3">11, 3</xref>
        ]
diferentiates between quantitative and qualitative aspects of synchrony in dubbing. Both are
accepted to be highly relevant but quantity appears as more important than quality. Quantity
is concerned with the coordination of time of speech and lip movements and is meant to avoid
visual or auditory phantom efects. Potentially, the number of syllables or the time estimated to
speak in the source and target languages (SL, TL) can be helpful indicators to find translations
that enable quantitative synchrony [19].
      </p>
      <p>Quality is important once quantity is established, and is concerned with matching visemic
characteristics (i.e., what speech sounds looks like when pronounced) of source and target
speech, such as opening angle of the jaw for vowels and lip closure for consonants (e.g., when
there is a /b/ in SL, prefer a translation that features one of /m b p/ at that time over one
that features /g/, to match lip closure). Quality is often characterized by the heuristic of
ifnding a translation that ‘best matches phonetically’ the source language as it is visible on
screen, as estimated by the human audio-visual translator. Although the idea of ‘best matching
phonetically’ is intuitively plausible, there is a research gap on objective and computational
source (en): No, no. Each individual's blood chemistry is unique, like fingerprints.</p>
      <p>dubbed (es): No, no. La sangre de cada individuo es única, como una huella.</p>
      <p>ideal MT (Google): No, no. La química de la sangre de cada individuo es única, como las huellas dactilares.
measures for the dubbing quality of a given translation, which we aim to fill with this paper.
Our long-term goal is to automatically generate a translated script which can be revoiced
easily to yield dubbed film that transparently appears as if it had been recorded in the target
language all along.</p>
      <p>There is some limited recent work [19] on establishing quantitative similarity for dubbing in
machine translation (MT). Here, we specifically explore the qualitative factors of speech sounds
that may be important beyond matching syllable counts while controlling for quantity.</p>
      <p>The need for objective measures of dubbing optimality of a given translation arises from
the fact that most MT systems are trained on textual material that does not regard dubbing
optimality, of which corpora are available that exceed the size of dubbed material by several
orders of magnitude. Even subtitles do not fully cover dubbing characteristics. As a result,
high-performance MT does not have an implicit notion of dubbing optimality and yields results
that are not directly suitable for dubbing, although optimal for textual translations. It is our
goal to estimate the importance of qualitative matching between SL and TL and to later add
these as constraints to the translation process.</p>
      <p>A way of enriching MT with external constraints is described in the following section and
builds on heuristics that can be evaluated on partial or full translations of utterances. We use
this method to balance MT for quantitative similarity as a basis for our analysis of factors that
influence qualitative similarity using an ablation study that employs an adversarial classifier.
Our empirical analysis confirms the importance of qualitative similarity and matching lip
closures in dubbing. We find that the opening angle of the jaw is comparatively less relevant
for dubbing.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dubbing and Translation</title>
      <p>Translation from one language to another aims to be a meaning-preserving conversion (typically
of text but also of speech) from a source to a target language (and, to a lesser extent, from
one socio-cultural context to another). Audio-visual translation adds the constraint that the
target language material shall closely match the visemic characteristics of the source to give
the impression that a video of the source speaker actually shows the speaker speaking TL when
machine
translation</p>
      <p>speech
synthesis
most most
faithful translation natural speech
→ not synchronous → not synchronous
video
adaptation
no adaptation
→ no artifacts</p>
      <p>human
evaluation
overall
quality
perception
revoiced by a dubbing artist.</p>
      <p>A perfect dubbing is not always possible given that the same meaning in two languages is
expressed with diferent syntactic structures, diferent words, and resulting in diferent speech
sounds (and accentuation patterns) that yield diferent articulatory characteristics (visemes
such as the opening or closing of the lips and jaw). Thus, a tradeof must be found between
meaning preservation and dubbability. Figure 1 shows, as an example, one original and dubbed
utterance in a TV show, as well as the machine translation of the source to the target language
via Google translate. We find that MT performs quite well and yields a meaning-preserving
translation, which however is substantially longer. In contrast, the dubbed version changes the
syntactic structure and uses synonymy to leave out material (ignoring the ‘chemistry’ aspect
of blood and the ‘finger’ aspect of the print), yielding a more dubbable text.</p>
      <p>The translation for dubbing is clearly geared towards more easily ‘dubbable’ text and it is
then the dubbing artist’s task to speak the material in such a way that it appears as natural
as possible given the video of the original speaker.</p>
      <p>A full dubbing system that were to cover both translation and speech synthesis as well as
potential video adaptation should yield a solution that is globally optimized towards the user
perception as sketched in Figure 2: it can be wise to choose a sub-optimal translation to yield
a better overall synchrony of the system.</p>
      <p>
        Neural machine translation (NMT) has become a popular approach for MT, originally
proposed by [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],[
        <xref ref-type="bibr" rid="ref19">20</xref>
        ],[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. NMT trains a single, end-to-end neural network over parallel corpora of
SL and TL pairs. Most NMT architectures belong to the encoder-decoder family [
        <xref ref-type="bibr" rid="ref19 ref5">20, 5</xref>
        ]: after
encoding a SL sentence, the decoder generates the corresponding TL sentence word-by-word
[
        <xref ref-type="bibr" rid="ref19">20</xref>
        ] (and possibly using attention [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] as guidance), thus in a series of locally optimal decisions.
Beam search helps to approximate global optimality [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and is a convenient lever for adding
external information into the search process to steer decoding.
      </p>
      <p>In previous work [19], we enriched a translation system with an external dubbing optimality
scorer to yield controllable and dubbing-optimal translations, however, only for the quantity
of TL material produced. We here explore the influence and relative importance of qualitative
aspects in human dubbing.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Measures of Dubbing Optimality</title>
      <p>Dubbing optimality is majorly governed by lip synchrony and opening angle of the jaw and,
of course, quantity of speech (which is often taken as granted in the research literature on
dubbing). We first describe previous results in literature on quantitative measures [ 19] and
then how we use these to establish and analyze qualitative measures based on an adversarial
approach in which we aim to train a classifier that attempts to diferentiate human
goldstandard dubbing from quantitatively re-balanced MT. If this classifier performs poorly, then
MT is more difficult to distinguish from gold-standard translation. We then validate various
qualitative factors, such as importance of the opening angle of the jaws, closure of the lips,
prosody and word boundaries by performing ablation studies with this classifier.</p>
      <sec id="sec-3-1">
        <title>3.1. Enforcing quantitative similarity of phonetic material</title>
        <p>In order to allow for an even approximate lip synchrony, the duration of the revoicing should
match that of the original speech, in order to avoid audio-visual phantom efects. These can be
seemingly ‘stray’ movements of the mouth in the dubbed version if there is too little to speak,
or audible speech while the articulators are not moving if there is too much material to speak.
As in [19], we use the number of syllables as the primary indicator of visemic similarity via the
standard hyphenation library Pyphen1 for counting the number of syllables in SL as well as
candidates in TL and take the relative diference of the two as the similarity metric. 2 We then
rescore the NMT’s output by the similarity metric using some weight α. For the experiments
below, we report results across the range for all 0 ≤ α &lt; 1; [19] found an α ∼ 0.3 to yield the
best balance between BLEU score (a measure for translation quality, [18]) and quantitative
similarity. We will therefore highlight the results in the 0.2 ≤ α ≤ 0.5-range.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Qualitative similarity of phonetic material</title>
        <p>
          Qualitative similarity, i.e., the dubbing artist’s voice closely matching the articulatory
movements visible on screen, is also highly desirable, beyond quantitative matching. Phonetic
aspects of consonants and vowels, such as lip closure and opening angle of the jaw have been
reported as being relevant for translations that can be lip-synchronously dubbed, as well as
supra-segmental aspects such as prosodic phrasing [
          <xref ref-type="bibr" rid="ref14">15</xref>
          ].
        </p>
        <p>We explore the relative importance of these aspects using an ablation experiment on a
classifier that is trained to diferentiate human dubbing translations from NMT translations
(rescored to yield quantitative similarity). For MT that is ideal for dubbing, this classifier
performs poorly, whereas it performs better, the more easily gold-standard dubbing and MT
can be diferentiated. In essence, if the features that the classifier is deprived of in an ablation
setting are not relevant, the performance of the classifier should not reduce (or even improve);
if however the classifier is deprived of relevant features, we expect a performance degradation.
1Pyphen: https://pyphen.org.</p>
        <p>2This works well for English-Spanish translation; other language pairs may require other quantitative
measures, e. g. mora-driven languages.</p>
        <p>We here explore the importance of phonetic/visemic characteristics via diferent
simplifications of the textual material that we feed to the classifier. For example, when we leave out all
whitespace and punctuation, the classifier is deprived of morphological and prosodic structure
features. If its performance drops (relative to the full input) this reflects the influence on
dubbability. Note that Spanish, TL in our experiments, has highly regular grapheme-phoneme
correspondences which allows to base our experiment directly on ablations of the graphemic
representations.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Text simplifications for ablation study</title>
        <p>We use the following simplifications in addition to passing the full text to the classifier ( full):
no punctuation tests the influence of phrasing as far as expressed by punctuation in text,
no whitespace (in addition to no punctuation): tests the importance of word boundaries; we
hypothesize that word boundaries are of little relevance when dubbing as they are not
clearly observable in continuous speech.</p>
        <p>In addition to whitepace and punctuation removal:
vowels vs. C we replace all consonants by “C ” but not the vowels, to test how opening angle
of the jaw alone (which, to a large extent depends on the vowel produced) helps the
model,
consonants vs. V we replace all vowels by “V ”; as a result, the opening angle of the jaw is
not observable to the model,
bilabials vs. C vs. V we replace all vowels (“V ”) and consonants (“C ”) except for bilabials
(“b, p, m”) which are not replaced; thus, lip closure is the only consonant characteristic
observable to the model,
C vs. V to test if syllable structure alone is valuable for dubbing optimality.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Model and Training Procedure</title>
        <p>
          For our method, we use an encoder-encoder architecture with siamese parameters for the two
TL candidates to be compared based on the SL sentence as depicted in Figure 3. We first encode
the SL sentence bidirectionally character-by-character using an RNN based on GRUs [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Each
TL sentence (gold-standard dubbing and NMT) is then also encoded via its characters and
GRU units, which take as additional input the attended-to output from the SL encoder. This
attention layer conditions on the TL recurrent state and we expect that it will be able to learn
the relation of source words to target words, or even of textually observable phonetic sub-word
features (like bilabial consonants), thereby computing the matching of TL and corresponding
SL material in one encoding. We train the TL encoders for each candidate TL sentence in a
siamese setup [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], where parameters are shared, and then subtract the resulting representations
in order to yield the diference between the two candidates. The multi-dimensional diference
is then passed to a final decision layer. We train this setup and report results for each kind of
experimental text simplification in order to find out the value of diferent kinds of information
(expressed as relative performance penalty of leaving out the corresponding feature).
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Data and Experiments</title>
      <p>
        We use the HEROes corpus [
        <xref ref-type="bibr" rid="ref13">14</xref>
        ], a corpus of the TV show with the same name, with the source
(English) and dubbing into Spanish. The corpus contains a total of 7000 utterance pairs in 9.5
classification
      </p>
      <p>softmax
decision layer</p>
      <p>subtraction
Dubbed Spanish text</p>
      <p>Machine-Translated Spanish text
RNN</p>
      <p>RNN ...</p>
      <p>RNN</p>
      <p>RNN</p>
      <p>RNN ...</p>
      <p>RNN
...
characters from text</p>
      <p>...
characters from text</p>
      <p>s
see tree
isam raam
p
Source English
text</p>
      <p>attention
RNN
RNN</p>
      <p>RNN ...</p>
      <p>RNN ...</p>
      <p>RNN</p>
      <p>RNN
...</p>
      <p>characters from text
hours of speech that are based on forced alignment of video subtitles to the audio tracks. The
results have been manually checked and re-aligned to each other.</p>
      <p>
        We trained an NMT system on the OpenSubtitles corpus [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] with fairseq [17] with settings
as described in [19]. The NMT yields a BLEU score of 26.31 on our data which degrades to
25.43 after rescoring with an α value of 0.3. We produce rescored translation results for all α
weighings.
      </p>
      <p>
        Our classifier is trained on the triples of SL, TL dubbing, and TL NMT candidate for all
text simplifications and all values of α, using 10-fold cross validation on the corpus. We report
the overall accuracy for each classification setting as well as the standard deviation across folds.
The classifier is implemented in DyNet [ 13].3 We use 20-dimensional character encodings,
20dimensional RNN states, and 20-dimensional attention. We train with the Adam method [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] for
10 iterations and using a dropout of 0.2. This is not the result of an extensive hyper-parameter
search but a mixture of best guesses and experience.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>The results of the experiments are presented in Figure 4, where the x-axis denotes the control
over quantitative similarity (via the rescoring factor α) and the y-axis denotes the classifier
3The code and full experimental data are available at https://github.com/timobaumann/duboptimal.
accuracy percentage (an accuracy around 50 % indicates that the classifier is unable to
diferentiate true dubbing). The standard deviation across folds for each of the values is in the range
of 1-4 percentage points. Thus, while we have not performed significance tests across folds, we
feel confident that diferences reported below are likely ‘real’.</p>
      <p>The figure shows that the classifier performs best with full input. Translations in the
relevant α-range that yields good quantity (but still reasonable translations) are more difficult,
indicating that the adversarial task is particularly difficult under these circumstances. Only
retaining the syllabic structure (C vs. V) yields the worst performance (only marginally above
chance) and can be considered as hardly helpful – this debunks the common misunderstanding
that all that needs to be kept in dubbing is the right number of syllables.</p>
      <p>Leaving out punctuation and whitespace has some but not radical efects (probably within
the margin of error) indicating that both prosodic phrasing and lexicomorphology do not need
to be strictly retained while translating for dubbing; instead, these allow for some degree of
freedom to better match other aspects.</p>
      <p>Regarding lip closure and jaw movements, we find that (a) removing vowel information ( all
consonants vs. V) only hurts a little, whereas retaining only vowel information (all vowels
vs. C) leads to a considerable performance drop. From this we conclude that matching the
opening angle of the jaw is at least not done through vowel choice, and may be less critical
than described in the literature.</p>
      <p>(b) In contrast, removing vowel information and even reducing the consonant information
to whether it’s bilabials or not (bilabials vs. C vs. V) yields surprisingly high performance
(even better than retaining all consonants, possibly because the model learns more easily with
fewer input symbols) which indicates that lip closure is indeed observed tightly in the dubbing
corpus.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Summary and Conclusion</title>
      <p>We have studied the importance of aspects of qualitative similarity in dubbing, in particular
when quantitative similarity is controlled for. The literature in translatology for dubbing posits
that jaw movement and lip closure are critical aspects to be observed in dubbing. However we
found no study prior to ours to investigate the relative importance of these aspects, measure the
importance in an objective way, or investigate the importance of further potential influences
such as lexicomorphology and prosodic phrasing. We have presented an ablation study to try
to find those features that are particularly relevant to discern qualitatively ignorant NMT from
true dubbing using a neural siamese classifier.</p>
      <p>We find that we can confirm the importance of matching lip closures in dubbing. We
therefore conclude that good dubbing requires a good matching of lip closures. By comparison,
the opening angle of the jaw (which intrinsically varies between diferent vowel types) appears
to be far less important. Our quantification of dubbing constraints leads the way towards a
further optimization of machine translation for dubbing as it enables the training or adaptation
procedure to take into account these constraints. Additionally, our classifier could directly be
included into NMT via an adversarial learning procedure.</p>
      <p>Our experiments yield objective evidence about the importance of qualitative aspects for
dubbing. However, we acknowledge that further research is needed. In particular, our study is
restricted to the textual form and does not include the speech signal in the corpus, which would
allow for a better temporal alignment analysis. Furthermore, our analysis uses the full corpus
rather than only those parts where the face is visible on-screen (and hence qualitative aspects
matter).4 Finally, the ultimate evaluation gold standard for dubbing would be a user study
that compares diferent dubbing alternatives. This could be used to directly optimize towards
human judgements of dubbing alternatives (which might even difer with user preferences),
or information retention for educational material to estimate distraction of less-than-ideal
dubbing.</p>
      <p>More broadly, we believe that ablation studies are a suitable tool in computational
humanities research as they can help to objectively analyse and quantify the various aspects of existing
humanistic theories for complex phenomena such as in this study.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments References</title>
      <p>
        The second author’s work was performed during an internship at Universität Hamburg which
was partially supported by Volkswagen Foundation under the funding codes 91926 and 93255.
We thank the anonymous reviewers for their insightful remarks.
4A tool for on- vs. of-screen detection has become available only very recently [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          , and
          <string-name>
            <surname>Y. Bengio. “</surname>
          </string-name>
          <article-title>Neural Machine Translation by Jointly Learning to Align and Translate”</article-title>
          .
          <source>In: Proceedings of International Conference on Learning Representations (ICLR)</source>
          .
          <source>Vol. abs/1409.0473</source>
          .
          <year>2014</year>
          . arXiv:
          <volume>1409</volume>
          .0473. url: http://arxiv.org/ abs/1409.0473.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bromley</surname>
          </string-name>
          , I. Guyon,
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          , E. Säckinger, and
          <string-name>
            <given-names>R.</given-names>
            <surname>Shah</surname>
          </string-name>
          . “
          <article-title>Signature verification using a “siamese” time delay neural network”</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          . Ed. by
          <string-name>
            <given-names>J.</given-names>
            <surname>Cowan</surname>
          </string-name>
          , G. Tesauro, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Alspector</surname>
          </string-name>
          . Vol.
          <volume>6</volume>
          . San Francisco, USA: Morgan-Kaufmann,
          <year>1994</year>
          , pp.
          <fpage>737</fpage>
          -
          <lpage>744</lpage>
          . url: https://proceedings.neurips.cc/paper/1993/ file/288cc0ff022877bd3df94bc9360b9c5d-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Chaume</surname>
          </string-name>
          .
          <source>Audiovisual translation: Dubbing. St. Jerome Publishing</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. van Merriënboer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          . “
          <article-title>On the Properties of Neural Machine Translation: Encoder-Decoder Approaches”</article-title>
          . In: (
          <year>2014</year>
          ), pp.
          <fpage>103</fpage>
          -
          <lpage>111</lpage>
          . doi:
          <volume>10</volume>
          . 3115/v1/
          <fpage>W14</fpage>
          -4012. url: https://aclanthology.org/W14-4012.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. van Merriënboer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gulcehre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bougares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schwenk</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          . “
          <article-title>Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation”</article-title>
          . In: (
          <year>2014</year>
          ), pp.
          <fpage>1724</fpage>
          -
          <lpage>1734</lpage>
          . doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>D14</fpage>
          -1179. url: https: //aclanthology.org/D14-1179.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. van Merriënboer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gulcehre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bougares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schwenk</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          . “
          <article-title>Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation”</article-title>
          .
          <source>In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          . Doha, Qatar: Association for Computational Linguistics,
          <year>2014</year>
          , pp.
          <fpage>1724</fpage>
          -
          <lpage>1734</lpage>
          . doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>D14</fpage>
          -1179. url: https://aclanthology. org/D14-1179.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          . “
          <article-title>Improved Beam Search with Constrained Softmax for NMT”</article-title>
          .
          <source>In: Proceedings of Machine Translation Summit XV: Papers. Miami, USA</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>297</fpage>
          -
          <lpage>309</lpage>
          . url: https://aclanthology.org/
          <year>2015</year>
          .mtsummit-papers.
          <volume>23</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kalchbrenner</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Blunsom</surname>
          </string-name>
          . “
          <article-title>Recurrent continuous translation models”</article-title>
          .
          <source>In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          .
          <source>Association for Computational Linguistics</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>1700</fpage>
          -
          <lpage>1709</lpage>
          . url: https: //aclanthology.org/D13-1176.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          . “
          <article-title>Adam: A Method for Stochastic Optimization”</article-title>
          .
          <source>In: 3rd International Conference on Learning Representations, (ICLR</source>
          <year>2015</year>
          ). Ed. by
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          . San Diego, USA,
          <year>2015</year>
          . url: http://arxiv.org/abs/1412.6980.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lison</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          . “OpenSubtitles2016:
          <article-title>Extracting Large Parallel Corpora from Movie and TV Subtitles”</article-title>
          .
          <source>In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)</source>
          . Ed. by
          <string-name>
            <given-names>N.</given-names>
            <surname>Calzolari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Choukri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Declerck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grobelnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Maegaard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mariani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mazo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moreno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Odijk</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Piperidis</surname>
          </string-name>
          . Portorož,
          <source>Slovenia: European Language Resources Association (ELRA)</source>
          ,
          <year>2016</year>
          . url: https://aclanthology.org/L16-1147.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>X.</given-names>
            <surname>Martı</surname>
          </string-name>
          <article-title>́nez. “Film dubbing: Its process and translation”</article-title>
          . In: Topics in Audiovisual Translation. Ed. by
          <string-name>
            <given-names>P.</given-names>
            <surname>Orero</surname>
          </string-name>
          . John Benjamins Publishing,
          <year>2004</year>
          , pp.
          <fpage>18</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nayak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Baumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karakanta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Negri</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Turchi</surname>
          </string-name>
          . “
          <article-title>See me speaking? Diferentiating on whether words are spoken on screen or of to optimize machine dubbing”</article-title>
          .
          <source>In: ICMI Companion: 1st Int. Workshop on Deep Video Understanding. Acm</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>130</fpage>
          -
          <lpage>134</lpage>
          . doi:
          <volume>10</volume>
          .1145/3395035.3425640.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Öktem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Farrús</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Bonafonte</surname>
          </string-name>
          . “
          <article-title>Bilingual Prosodic Dataset Compilation for Spoken Language Translation”</article-title>
          .
          <source>In: Proc. IberSPEECH</source>
          <year>2018</year>
          . Isca,
          <year>2018</year>
          , pp.
          <fpage>20</fpage>
          -
          <lpage>24</lpage>
          . doi:
          <volume>10</volume>
          .21437/IberSPEECH.2018-
          <volume>5</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Öktem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Farrús</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Bonafonte</surname>
          </string-name>
          . “
          <article-title>Prosodic Phrase Alignment for Machine Dubbing”</article-title>
          .
          <source>In: Proc. Interspeech</source>
          <year>2019</year>
          .
          <year>2019</year>
          , pp.
          <fpage>4215</fpage>
          -
          <lpage>4219</lpage>
          . doi:
          <volume>10</volume>
          .21437/Interspeech.2019-
          <fpage>1621</fpage>
          . url: http://dx.doi.org/10.21437/Interspeech.2019-
          <volume>1621</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Orero</surname>
          </string-name>
          .
          <article-title>Topics in audiovisual translation</article-title>
          . Vol.
          <volume>56</volume>
          . John Benjamins Publishing,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Edunov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Baevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Grangier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          .
          <article-title>“fairseq: A Fast, Extensible Toolkit for Sequence Modeling”</article-title>
          .
          <source>In: Proceedings of the</source>
          <year>2019</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)</article-title>
          . Minneapolis, USA: Association for Computational Linguistics,
          <year>2019</year>
          , pp.
          <fpage>48</fpage>
          -
          <lpage>53</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -4009. url: https://www.aclweb.org/anthology/ N19-4009.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , and W.-J. Zhu. “
          <article-title>BLEU: a Method for Automatic Evaluation of Machine Translation”</article-title>
          .
          <source>In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics</source>
          . Philadelphia, Pennsylvania, USA: Association for Computational Linguistics,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . doi:
          <volume>10</volume>
          .3115/1073083.1073135. url: https://aclanthology.org/P02-1040.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Neubig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Matthews</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ammar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anastasopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ballesteros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Clothiaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Duh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Faruqui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garrette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kuncoro</surname>
          </string-name>
          , G. Kumar,
          <string-name>
            <given-names>C.</given-names>
            <surname>Malaviya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Oda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Richardson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Saphra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Swayamdipta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Yin</surname>
          </string-name>
          . “
          <source>DyNet: The Dynamic Neural Network Toolkit.” In: CoRR abs/1701</source>
          .03980 (
          <year>2017</year>
          ). arXiv:
          <volume>1701</volume>
          .
          <article-title>03980 [stat</article-title>
          .ML]. url: http://arxiv.org/ abs/1701.03980.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          . “
          <article-title>Sequence to sequence learning with neural networks”</article-title>
          .
          <source>In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. Nips'14</source>
          . Montreal, Canada: MIT Press,
          <year>2014</year>
          , pp.
          <fpage>3104</fpage>
          -
          <lpage>3112</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>