=Paper= {{Paper |id=Vol-2989/short_paper46 |storemode=property |title=Evaluating Heuristics for Audio-Visual Translation |pdfUrl=https://ceur-ws.org/Vol-2989/short_paper46.pdf |volume=Vol-2989 |authors=Timo Baumann,Ashutosh Saboo |dblpUrl=https://dblp.org/rec/conf/chr/BaumannS21 }} ==Evaluating Heuristics for Audio-Visual Translation== https://ceur-ws.org/Vol-2989/short_paper46.pdf
Evaluating Heuristics for Audio-Visual Translation
Timo Baumann1 , Ashutosh Saboo1,2
1
    Department of Informatics, Universität Hamburg, Germany
2
    BITS Pilani, K.K. Birla Goa Campus, Goa, India


                                 Abstract
                                 Dubbing, i.e., the lip-synchronous translation and revoicing of audio-visual media into a target lan-
                                 guage from a different source language, is essential for the full-fledged reception of foreign audio-visual
                                 media, be it movies, instructional videos or short social media clips. In this paper, we objectify in-
                                 fluences on the ‘dubbability’ of translations, i.e., how well a translation would be synchronously
                                 revoiceable to the lips on screen. We explore the value of traditional heuristics used in evaluating the
                                 qualitative aspects, in particular matching bilabial consonants and the jaw opening while producing
                                 vowels, and control for quantity, i.e., that translations are similar to the source in length. We per-
                                 form an ablation study using an adversarial neural classifier which is trained to differentiate “true”
                                 dubbing translations from machine translations. While we are able to confirm the value of matching
                                 lip closure in dubbing, we find that the opening angle of the jaw as determined by the realized vowel
                                 may be less relevant than frequently considered in audio-visual translation.

                                 Keywords
                                 audiovisual translation, dubbing, lip synchrony, machine translation, ablation study




1. Introduction
Dubbing is studied in audio-visual translation [16], a branch of translatology, and is at present
typically performed manually (although supported by specialized software environments). A
major focus is on producing translations that can be spoken in synchrony along with the facial
movements (in particular lip and jaw movements) visible on screen. The literature [11, 3]
differentiates between quantitative and qualitative aspects of synchrony in dubbing. Both are
accepted to be highly relevant but quantity appears as more important than quality. Quantity
is concerned with the coordination of time of speech and lip movements and is meant to avoid
visual or auditory phantom effects. Potentially, the number of syllables or the time estimated to
speak in the source and target languages (SL, TL) can be helpful indicators to find translations
that enable quantitative synchrony [19].
   Quality is important once quantity is established, and is concerned with matching visemic
characteristics (i.e., what speech sounds looks like when pronounced) of source and target
speech, such as opening angle of the jaw for vowels and lip closure for consonants (e.g., when
there is a /b/ in SL, prefer a translation that features one of /m b p/ at that time over one
that features /g/, to match lip closure). Quality is often characterized by the heuristic of
finding a translation that ‘best matches phonetically’ the source language as it is visible on
screen, as estimated by the human audio-visual translator. Although the idea of ‘best matching
phonetically’ is intuitively plausible, there is a research gap on objective and computational
CHR 2021: Computational Humanities Research Conference, November 17–19, 2021, Amsterdam, The
Netherlands
£ mail@timobaumann.de (T. Baumann); ashutosh.saboo96@gmail.com (A. Saboo)
DZ 0000-0003-2203-1783 (T. Baumann)
                               © 2021 Copyright for this paper by its authors.
                               Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Wor
    Pr
       ks
        hop
     oceedi
          ngs
                ht
                I
                 tp:
                   //
                    ceur
                       -
                SSN1613-
                        ws
                         .or
                       0073
                           g

                               CEUR Workshop Proceedings (CEUR-WS.org)




                                                                                   171
            source (en): No, no. Each individual's blood chemistry is unique, like fingerprints.
           dubbed (es): No, no. La sangre de cada individuo es única, como una huella.
    ideal MT (Google): No, no. La química de la sangre de cada individuo es única, como las huellas dactilares.
Figure 1: Example dubbing from English to Spanish in the show “Heroes” (season 3, episode 1, starting at
29’15”, from [14]); MT via Google translate.


measures for the dubbing quality of a given translation, which we aim to fill with this paper.
Our long-term goal is to automatically generate a translated script which can be revoiced
easily to yield dubbed film that transparently appears as if it had been recorded in the target
language all along.
   There is some limited recent work [19] on establishing quantitative similarity for dubbing in
machine translation (MT). Here, we specifically explore the qualitative factors of speech sounds
that may be important beyond matching syllable counts while controlling for quantity.
   The need for objective measures of dubbing optimality of a given translation arises from
the fact that most MT systems are trained on textual material that does not regard dubbing
optimality, of which corpora are available that exceed the size of dubbed material by several
orders of magnitude. Even subtitles do not fully cover dubbing characteristics. As a result,
high-performance MT does not have an implicit notion of dubbing optimality and yields results
that are not directly suitable for dubbing, although optimal for textual translations. It is our
goal to estimate the importance of qualitative matching between SL and TL and to later add
these as constraints to the translation process.
   A way of enriching MT with external constraints is described in the following section and
builds on heuristics that can be evaluated on partial or full translations of utterances. We use
this method to balance MT for quantitative similarity as a basis for our analysis of factors that
influence qualitative similarity using an ablation study that employs an adversarial classifier.
Our empirical analysis confirms the importance of qualitative similarity and matching lip
closures in dubbing. We find that the opening angle of the jaw is comparatively less relevant
for dubbing.


2. Dubbing and Translation
Translation from one language to another aims to be a meaning-preserving conversion (typically
of text but also of speech) from a source to a target language (and, to a lesser extent, from
one socio-cultural context to another). Audio-visual translation adds the constraint that the
target language material shall closely match the visemic characteristics of the source to give
the impression that a video of the source speaker actually shows the speaker speaking TL when




                                                         172
                       machine                speech                   video               human
                      translation            synthesis               adaptation         evaluation
                                most            most                  no adaptation
                 faithful translation natural speech                  → no artifacts
                → not synchronous → not synchronous                                         overall
                                                                                            quality
                                                                                        perception


                             . . . . . .




                                                   . . . . . .




                                                                          . . . . . .
                                           globa
                                                lly opt
                                                        imal s
                                                               olutio
                                                                      n


                  most synchronous         timings forced to boundless adaptation
                         translation        source visemes     → speech/gesture/
                   → mistranslations       → unintelligibility  dialog asynchrony

Figure 2: Machine Translation as one part of a full system for dubbing that includes speech synthesis, video
adaptation, and considers the perceived quality loss of misaligned speech. Given the complexity of the task, a
modular system has clear advantages over an end-to-end monolithic system but requires a notion of dubbing
optimality.


revoiced by a dubbing artist.
   A perfect dubbing is not always possible given that the same meaning in two languages is
expressed with different syntactic structures, different words, and resulting in different speech
sounds (and accentuation patterns) that yield different articulatory characteristics (visemes
such as the opening or closing of the lips and jaw). Thus, a tradeoff must be found between
meaning preservation and dubbability. Figure 1 shows, as an example, one original and dubbed
utterance in a TV show, as well as the machine translation of the source to the target language
via Google translate. We find that MT performs quite well and yields a meaning-preserving
translation, which however is substantially longer. In contrast, the dubbed version changes the
syntactic structure and uses synonymy to leave out material (ignoring the ‘chemistry’ aspect
of blood and the ‘finger’ aspect of the print), yielding a more dubbable text.
   The translation for dubbing is clearly geared towards more easily ‘dubbable’ text and it is
then the dubbing artist’s task to speak the material in such a way that it appears as natural
as possible given the video of the original speaker.
   A full dubbing system that were to cover both translation and speech synthesis as well as
potential video adaptation should yield a solution that is globally optimized towards the user
perception as sketched in Figure 2: it can be wise to choose a sub-optimal translation to yield
a better overall synchrony of the system.
   Neural machine translation (NMT) has become a popular approach for MT, originally pro-
posed by [8],[20],[4]. NMT trains a single, end-to-end neural network over parallel corpora of
SL and TL pairs. Most NMT architectures belong to the encoder-decoder family [20, 5]: after
encoding a SL sentence, the decoder generates the corresponding TL sentence word-by-word
[20] (and possibly using attention [1] as guidance), thus in a series of locally optimal decisions.
Beam search helps to approximate global optimality [7] and is a convenient lever for adding
external information into the search process to steer decoding.
   In previous work [19], we enriched a translation system with an external dubbing optimality




                                                          173
scorer to yield controllable and dubbing-optimal translations, however, only for the quantity
of TL material produced. We here explore the influence and relative importance of qualitative
aspects in human dubbing.


3. Measures of Dubbing Optimality
Dubbing optimality is majorly governed by lip synchrony and opening angle of the jaw and,
of course, quantity of speech (which is often taken as granted in the research literature on
dubbing). We first describe previous results in literature on quantitative measures [19] and
then how we use these to establish and analyze qualitative measures based on an adversarial
approach in which we aim to train a classifier that attempts to differentiate human gold-
standard dubbing from quantitatively re-balanced MT. If this classifier performs poorly, then
MT is more difficult to distinguish from gold-standard translation. We then validate various
qualitative factors, such as importance of the opening angle of the jaws, closure of the lips,
prosody and word boundaries by performing ablation studies with this classifier.

3.1. Enforcing quantitative similarity of phonetic material
In order to allow for an even approximate lip synchrony, the duration of the revoicing should
match that of the original speech, in order to avoid audio-visual phantom effects. These can be
seemingly ‘stray’ movements of the mouth in the dubbed version if there is too little to speak,
or audible speech while the articulators are not moving if there is too much material to speak.
As in [19], we use the number of syllables as the primary indicator of visemic similarity via the
standard hyphenation library Pyphen1 for counting the number of syllables in SL as well as
candidates in TL and take the relative difference of the two as the similarity metric.2 We then
rescore the NMT’s output by the similarity metric using some weight α. For the experiments
below, we report results across the range for all 0 ≤ α < 1; [19] found an α ∼ 0.3 to yield the
best balance between BLEU score (a measure for translation quality, [18]) and quantitative
similarity. We will therefore highlight the results in the 0.2 ≤ α ≤ 0.5-range.

3.2. Qualitative similarity of phonetic material
Qualitative similarity, i.e., the dubbing artist’s voice closely matching the articulatory move-
ments visible on screen, is also highly desirable, beyond quantitative matching. Phonetic
aspects of consonants and vowels, such as lip closure and opening angle of the jaw have been
reported as being relevant for translations that can be lip-synchronously dubbed, as well as
supra-segmental aspects such as prosodic phrasing [15].
   We explore the relative importance of these aspects using an ablation experiment on a
classifier that is trained to differentiate human dubbing translations from NMT translations
(rescored to yield quantitative similarity). For MT that is ideal for dubbing, this classifier
performs poorly, whereas it performs better, the more easily gold-standard dubbing and MT
can be differentiated. In essence, if the features that the classifier is deprived of in an ablation
setting are not relevant, the performance of the classifier should not reduce (or even improve);
if however the classifier is deprived of relevant features, we expect a performance degradation.
   1
     Pyphen: https://pyphen.org.
   2
     This works well for English-Spanish translation; other language pairs may require other quantitative mea-
sures, e. g. mora-driven languages.




                                                    174
   We here explore the importance of phonetic/visemic characteristics via different simplifica-
tions of the textual material that we feed to the classifier. For example, when we leave out all
whitespace and punctuation, the classifier is deprived of morphological and prosodic structure
features. If its performance drops (relative to the full input) this reflects the influence on
dubbability. Note that Spanish, TL in our experiments, has highly regular grapheme-phoneme
correspondences which allows to base our experiment directly on ablations of the graphemic
representations.

3.3. Text simplifications for ablation study
We use the following simplifications in addition to passing the full text to the classifier (full):
no punctuation tests the influence of phrasing as far as expressed by punctuation in text,
no whitespace (in addition to no punctuation): tests the importance of word boundaries; we
      hypothesize that word boundaries are of little relevance when dubbing as they are not
      clearly observable in continuous speech.
In addition to whitepace and punctuation removal:
vowels vs. C we replace all consonants by “C” but not the vowels, to test how opening angle
      of the jaw alone (which, to a large extent depends on the vowel produced) helps the
      model,
consonants vs. V we replace all vowels by “V ”; as a result, the opening angle of the jaw is
      not observable to the model,
bilabials vs. C vs. V we replace all vowels (“V ”) and consonants (“C”) except for bilabials
      (“b, p, m”) which are not replaced; thus, lip closure is the only consonant characteristic
      observable to the model,
C vs. V to test if syllable structure alone is valuable for dubbing optimality.

3.4. Model and Training Procedure
For our method, we use an encoder-encoder architecture with siamese parameters for the two
TL candidates to be compared based on the SL sentence as depicted in Figure 3. We first encode
the SL sentence bidirectionally character-by-character using an RNN based on GRUs [6]. Each
TL sentence (gold-standard dubbing and NMT) is then also encoded via its characters and
GRU units, which take as additional input the attended-to output from the SL encoder. This
attention layer conditions on the TL recurrent state and we expect that it will be able to learn
the relation of source words to target words, or even of textually observable phonetic sub-word
features (like bilabial consonants), thereby computing the matching of TL and corresponding
SL material in one encoding. We train the TL encoders for each candidate TL sentence in a
siamese setup [2], where parameters are shared, and then subtract the resulting representations
in order to yield the difference between the two candidates. The multi-dimensional difference
is then passed to a final decision layer. We train this setup and report results for each kind of
experimental text simplification in order to find out the value of different kinds of information
(expressed as relative performance penalty of leaving out the corresponding feature).


4. Data and Experiments
We use the HEROes corpus [14], a corpus of the TV show with the same name, with the source
(English) and dubbing into Spanish. The corpus contains a total of 7000 utterance pairs in 9.5




                                               175
                         classification             softmax

                                                  decision layer

                                                   subtraction

                      Dubbed Spanish text                    Machine-Translated Spanish text




                                                                                               parameters
                                                                                                siamese
                      RNN    RNN     ...     RNN                  RNN   RNN   ...   RNN



                                   ...                                     ...
                      characters from text                    characters from text

                                           Source English
                                           text
                                                      attention


                                            RNN    RNN      ...     RNN

                                            RNN    RNN      ...     RNN


                                                        ...
                                           characters from text
Figure 3: Siamese encoder-encoder classifier for comparing the ‘dubbability’ of two TL candidates given a
SL sentence.


hours of speech that are based on forced alignment of video subtitles to the audio tracks. The
results have been manually checked and re-aligned to each other.
  We trained an NMT system on the OpenSubtitles corpus [10] with fairseq [17] with settings
as described in [19]. The NMT yields a BLEU score of 26.31 on our data which degrades to
25.43 after rescoring with an α value of 0.3. We produce rescored translation results for all α
weighings.
  Our classifier is trained on the triples of SL, TL dubbing, and TL NMT candidate for all
text simplifications and all values of α, using 10-fold cross validation on the corpus. We report
the overall accuracy for each classification setting as well as the standard deviation across folds.
The classifier is implemented in DyNet [13].3 We use 20-dimensional character encodings, 20-
dimensional RNN states, and 20-dimensional attention. We train with the Adam method [9] for
10 iterations and using a dropout of 0.2. This is not the result of an extensive hyper-parameter
search but a mixture of best guesses and experience.


5. Results and Discussion
The results of the experiments are presented in Figure 4, where the x-axis denotes the control
over quantitative similarity (via the rescoring factor α) and the y-axis denotes the classifier
   3
       The code and full experimental data are available at https://github.com/timobaumann/duboptimal.




                                                       176
accuracy percentage (an accuracy around 50 % indicates that the classifier is unable to differen-
tiate true dubbing). The standard deviation across folds for each of the values is in the range
of 1-4 percentage points. Thus, while we have not performed significance tests across folds, we
feel confident that differences reported below are likely ‘real’.
   The figure shows that the classifier performs best with full input. Translations in the
relevant α-range that yields good quantity (but still reasonable translations) are more difficult,
indicating that the adversarial task is particularly difficult under these circumstances. Only
retaining the syllabic structure (C vs. V) yields the worst performance (only marginally above
chance) and can be considered as hardly helpful – this debunks the common misunderstanding
that all that needs to be kept in dubbing is the right number of syllables.
   Leaving out punctuation and whitespace has some but not radical effects (probably within
the margin of error) indicating that both prosodic phrasing and lexicomorphology do not need
to be strictly retained while translating for dubbing; instead, these allow for some degree of
freedom to better match other aspects.
   Regarding lip closure and jaw movements, we find that (a) removing vowel information (all
consonants vs. V) only hurts a little, whereas retaining only vowel information (all vowels
vs. C) leads to a considerable performance drop. From this we conclude that matching the
opening angle of the jaw is at least not done through vowel choice, and may be less critical
than described in the literature.
   (b) In contrast, removing vowel information and even reducing the consonant information
to whether it’s bilabials or not (bilabials vs. C vs. V) yields surprisingly high performance
(even better than retaining all consonants, possibly because the model learns more easily with
fewer input symbols) which indicates that lip closure is indeed observed tightly in the dubbing
corpus.




Figure 4: Plots of classifier performance (in %) for the ablation settings across all values for the control of
quantitative similarity (α). The less relevant ranges for α are shaded in gray.




                                                     177
6. Summary and Conclusion
We have studied the importance of aspects of qualitative similarity in dubbing, in particular
when quantitative similarity is controlled for. The literature in translatology for dubbing posits
that jaw movement and lip closure are critical aspects to be observed in dubbing. However we
found no study prior to ours to investigate the relative importance of these aspects, measure the
importance in an objective way, or investigate the importance of further potential influences
such as lexicomorphology and prosodic phrasing. We have presented an ablation study to try
to find those features that are particularly relevant to discern qualitatively ignorant NMT from
true dubbing using a neural siamese classifier.
   We find that we can confirm the importance of matching lip closures in dubbing. We
therefore conclude that good dubbing requires a good matching of lip closures. By comparison,
the opening angle of the jaw (which intrinsically varies between different vowel types) appears
to be far less important. Our quantification of dubbing constraints leads the way towards a
further optimization of machine translation for dubbing as it enables the training or adaptation
procedure to take into account these constraints. Additionally, our classifier could directly be
included into NMT via an adversarial learning procedure.
   Our experiments yield objective evidence about the importance of qualitative aspects for
dubbing. However, we acknowledge that further research is needed. In particular, our study is
restricted to the textual form and does not include the speech signal in the corpus, which would
allow for a better temporal alignment analysis. Furthermore, our analysis uses the full corpus
rather than only those parts where the face is visible on-screen (and hence qualitative aspects
matter).4 Finally, the ultimate evaluation gold standard for dubbing would be a user study
that compares different dubbing alternatives. This could be used to directly optimize towards
human judgements of dubbing alternatives (which might even differ with user preferences),
or information retention for educational material to estimate distraction of less-than-ideal
dubbing.
   More broadly, we believe that ablation studies are a suitable tool in computational humani-
ties research as they can help to objectively analyse and quantify the various aspects of existing
humanistic theories for complex phenomena such as in this study.


Acknowledgments
The second author’s work was performed during an internship at Universität Hamburg which
was partially supported by Volkswagen Foundation under the funding codes 91926 and 93255.
We thank the anonymous reviewers for their insightful remarks.


References
 [1]    D. Bahdanau, K. Cho, and Y. Bengio. “Neural Machine Translation by Jointly Learning
        to Align and Translate”. In: Proceedings of International Conference on Learning Repre-
        sentations (ICLR). Vol. abs/1409.0473. 2014. arXiv: 1409.0473. url: http://arxiv.org/
        abs/1409.0473.



   4
       A tool for on- vs. off-screen detection has become available only very recently [12].




                                                        178
 [2]   J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah. “Signature verification using
       a “siamese” time delay neural network”. In: Advances in Neural Information Processing
       Systems. Ed. by J. Cowan, G. Tesauro, and J. Alspector. Vol. 6. San Francisco, USA:
       Morgan-Kaufmann, 1994, pp. 737–744. url: https://proceedings.neurips.cc/paper/1993/
       file/288cc0ff022877bd3df94bc9360b9c5d-Paper.pdf.
 [3]   F. Chaume. Audiovisual translation: Dubbing. St. Jerome Publishing, 2012.
 [4]   K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio. “On the Properties of Neural
       Machine Translation: Encoder–Decoder Approaches”. In: (2014), pp. 103–111. doi: 10.
       3115/v1/W14-4012. url: https://aclanthology.org/W14-4012.
 [5]   K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y.
       Bengio. “Learning Phrase Representations using RNN Encoder–Decoder for Statistical
       Machine Translation”. In: (2014), pp. 1724–1734. doi: 10.3115/v1/D14-1179. url: https:
       //aclanthology.org/D14-1179.
 [6]   K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y.
       Bengio. “Learning Phrase Representations using RNN Encoder–Decoder for Statistical
       Machine Translation”. In: Proceedings of the 2014 Conference on Empirical Methods in
       Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational
       Linguistics, 2014, pp. 1724–1734. doi: 10.3115/v1/D14-1179. url: https://aclanthology.
       org/D14-1179.
 [7]   X. Hu, W. Li, X. Lan, H. Wu, and H. Wang. “Improved Beam Search with Constrained
       Softmax for NMT”. In: Proceedings of Machine Translation Summit XV: Papers. Miami,
       USA, 2015, pp. 297–309. url: https://aclanthology.org/2015.mtsummit-papers.23.
 [8]   N. Kalchbrenner and P. Blunsom. “Recurrent continuous translation models”. In: Pro-
       ceedings of the 2013 Conference on Empirical Methods in Natural Language Processing
       (EMNLP). Association for Computational Linguistics, 2013, pp. 1700–1709. url: https:
       //aclanthology.org/D13-1176.
 [9]   D. P. Kingma and J. Ba. “Adam: A Method for Stochastic Optimization”. In: 3rd Inter-
       national Conference on Learning Representations, (ICLR 2015). Ed. by Y. Bengio and
       Y. LeCun. San Diego, USA, 2015. url: http://arxiv.org/abs/1412.6980.
[10]   P. Lison and J. Tiedemann. “OpenSubtitles2016: Extracting Large Parallel Corpora from
       Movie and TV Subtitles”. In: Proceedings of the Tenth International Conference on Lan-
       guage Resources and Evaluation (LREC’16). Ed. by N. Calzolari, K. Choukri, T. Declerck,
       S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S.
       Piperidis. Portorož, Slovenia: European Language Resources Association (ELRA), 2016.
       url: https://aclanthology.org/L16-1147.
[11]   X. Martı́nez. “Film dubbing: Its process and translation”. In: Topics in Audiovisual
       Translation. Ed. by P. Orero. John Benjamins Publishing, 2004, pp. 18–22.
[12]   S. Nayak, T. Baumann, S. Bhattacharya, A. Karakanta, M. Negri, and M. Turchi. “See
       me speaking? Differentiating on whether words are spoken on screen or off to optimize
       machine dubbing”. In: ICMI Companion: 1st Int. Workshop on Deep Video Understanding.
       Acm, 2020, pp. 130–134. doi: 10.1145/3395035.3425640.




                                              179
[13]   G. Neubig, C. Dyer, Y. Goldberg, A. Matthews, W. Ammar, A. Anastasopoulos, M.
       Ballesteros, D. Chiang, D. Clothiaux, T. Cohn, K. Duh, M. Faruqui, C. Gan, D. Garrette,
       Y. Ji, L. Kong, A. Kuncoro, G. Kumar, C. Malaviya, P. Michel, Y. Oda, M. Richardson,
       N. Saphra, S. Swayamdipta, and P. Yin. “DyNet: The Dynamic Neural Network Toolkit.”
       In: CoRR abs/1701.03980 (2017). arXiv: 1701.03980 [stat.ML]. url: http://arxiv.org/
       abs/1701.03980.
[14]   A. Öktem, M. Farrús, and A. Bonafonte. “Bilingual Prosodic Dataset Compilation for
       Spoken Language Translation”. In: Proc. IberSPEECH 2018. Isca, 2018, pp. 20–24. doi:
       10.21437/IberSPEECH.2018-5.
[15]   A. Öktem, M. Farrús, and A. Bonafonte. “Prosodic Phrase Alignment for Machine Dub-
       bing”. In: Proc. Interspeech 2019. 2019, pp. 4215–4219. doi: 10.21437/Interspeech.2019-
       1621. url: http://dx.doi.org/10.21437/Interspeech.2019-1621.
[16]   P. Orero. Topics in audiovisual translation. Vol. 56. John Benjamins Publishing, 2004.
[17]   M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli.
       “fairseq: A Fast, Extensible Toolkit for Sequence Modeling”. In: Proceedings of the 2019
       Conference of the North American Chapter of the Association for Computational Lin-
       guistics (Demonstrations). Minneapolis, USA: Association for Computational Linguistics,
       2019, pp. 48–53. doi: 10.18653/v1/N19-4009. url: https://www.aclweb.org/anthology/
        N19-4009.
[18]   K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. “BLEU: a Method for Automatic
       Evaluation of Machine Translation”. In: Proceedings of the 40th Annual Meeting of the
       Association for Computational Linguistics. Philadelphia, Pennsylvania, USA: Association
       for Computational Linguistics, 2002, pp. 311–318. doi: 10.3115/1073083.1073135. url:
       https://aclanthology.org/P02-1040.
[19]   A. Saboo and T. Baumann. “Integration of Dubbing Constraints into Machine Transla-
       tion”. In: Proceedings of the Fourth Conference on Machine Translation (Volume 1: Re-
       search Papers). Florence, Italy: Association for Computational Linguistics, 2019, pp. 94–
       101. doi: 10.18653/v1/W19-5210. url: https://www.aclweb.org/anthology/W19-5210.
[20]   I. Sutskever, O. Vinyals, and Q. V. Le. “Sequence to sequence learning with neural
       networks”. In: Proceedings of the 27th International Conference on Neural Information
       Processing Systems - Volume 2. Nips’14. Montreal, Canada: MIT Press, 2014, pp. 3104–
       3112.




                                              180