=Paper=
{{Paper
|id=Vol-2481/paper38
|storemode=property
|title=Are Subtitling Corpora really Subtitle-like?
|pdfUrl=https://ceur-ws.org/Vol-2481/paper38.pdf
|volume=Vol-2481
|authors=Alina Karakanta,Matteo Negri,Marco Turchi
|dblpUrl=https://dblp.org/rec/conf/clic-it/KarakantaNT19
}}
==Are Subtitling Corpora really Subtitle-like?==
<pdf width="1500px">https://ceur-ws.org/Vol-2481/paper38.pdf</pdf>
<pre>
                       Are Subtitling Corpora really Subtitle-like?

                       Alina Karakanta1,2 , Matteo Negri1 , Marco Turchi1
                1
                  Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento - Italy
                                   2
                                     University of Trento, Italy
                          {akarakanta,negri,turchi}@fbk.eu


                      Abstract                             proper segmentation by phrase or sentence signifi-
                                                           cantly reduces reading time and improves compre-
    Growing needs in translating multimedia                hension (Perego, 2008; Rajendran et al., 2013).
    content have resulted in Neural Machine                   Hence, there is ample room for developing
    Translation (NMT) gradually becoming an                fully or at least partially automated solutions for
    established practice in the field of subti-            subtitle-oriented NMT, which would contribute
    tling. Contrary to text translation, subti-            in reducing post-processing effort and speeding-
    tling is subject to spatial and temporal con-          up turn-around times. Automated approaches
    straints, which greatly increase the post-             though, especially NMT, are data-hungry. Perfor-
    processing effort required to restore the              mance greatly depends on the availability of large
    NMT output to a proper subtitle format.                amounts of high-quality data (up to tens of mil-
    In this work, we explore whether exist-                lions of parallel sentences), specifically tailored
    ing subtitling corpora conform to the con-             for the task. In the case of subtitle-oriented NMT,
    straints of: 1) length and reading speed;              this implies having access to large subtitle train-
    and 2) proper line breaks. We show that                ing corpora. This leads to the following ques-
    the process of creating parallel sentence              tion: What should data specifically tailored for
    alignments removes important time and                  subtitling-oriented NMT look like?
    line break information and propose prac-
                                                              There are large amounts of available parallel
    tices for creating resources for subtitling-
                                                           data extracted from subtitles (Lison and Tiede-
    oriented NMT faithful to the subtitle for-
                                                           mann, 2016; Pryzant et al., 2018; Di Gangi et al.,
    mat.
                                                           2019). These corpora are usually obtained by col-
                                                           lecting files in a subtitle specific format (.srt) in
1   Introduction
                                                           several languages and then parsing and aligning
Machine Translation (MT) of subtitles is a grow-           them at sentence level. MT training at sentence
ing need for various applications, given the               level generally increases performance as the sys-
amounts of online multimedia content becoming              tem receives longer context (useful, for instance,
available daily. Subtitling translation is a complex       to disambiguate words). As shown in Table 1,
process consisting of several stages (transcription,       this process compromises the subtitle format by
translation, timing), and manual approaches to the         converting the subtitle blocks into full sentences.
task are laborious and costly. Subtitling has to           With this “merging”, information about subtitle
conform to spatial constraints such as length, and         segmentation (line breaks) is often lost. Therefore,
temporal constraints such as reading speed. While          recovery of the MT output to a proper subtitle for-
length and reading speed can be modelled as a              mat has to be performed subsequently, either as a
post-processing step in an MT workflow using               post-editing process or by using hand-crafted rules
simple rules, subtitle segmentation, i.e. where and        and boundary predictions. Integrating the subti-
if to insert a line break, depends on semantic and         tle constraints in the model can help reduce the
syntactic properties. Subtitle segmentation is par-        post-processing effort, especially in cases where
ticularly important, since it has been shown that a        the input is a stream of data, such as in end-
                                                           to-end Speech Neural Machine Translation. To
     Copyright c 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0   date, there has been no study examining the conse-
International (CC BY 4.0).                                 quences of obtaining parallel sentences from sub-
    1                                                      ing heuristics based on time codes and punctua-
    00:00:14,820 −− > 00:00:18,820
    Grazie mille, Chris.                                   tion. Then, the extracted sentences are aligned to
    É un grande onore venire                              create parallel corpora with the time-overlap al-
    2                                                      gorithm (Tiedemann, 2008) and bilingual dictio-
    00:00:18,820 −− > 00:00:22,820
    su questo palco due volte.                             naries. The 2018 version of OpenSubtitles has
    Vi sono estremamente grato.                            high-quality sentence alignments, however, it does
                                                           not resemble the realistic subtitling scenario de-
    Grazie mille, Chris.
    É un grande onore venire su questo palco due volte.   scribed above, since time and line break informa-
    Vi sono estremamente grato.                            tion are lost in the merging process. The same
                                                           methodology was used for compiling Montene-
Table 1: Subtitle blocks (top, 1-2) as they appear         grinSubs (Božović et al., 2018), an English – Mon-
in an .srt file and the processed output for obtain-       tenegrin parallel corpus of subtitles, which con-
ing aligned sentences (bottom).                            tains only 68k sentences.
                                                              The     Japanese-English     Subtitle    Corpus
titles on preserving the subtitling constraints.           JESC (Pryzant et al., 2018) is a large paral-
    In this work, we explore whether the large, pub-       lel subtitling corpus consisting of 2.8 million
licly available parallel data compiled from sub-           sentences. It was created by crawling the internet
titles conform to the temporal and spatial con-            for film and TV subtitles and aligning their
straints necessary for achieving quality subtitles.        captions with improved document and caption
We compare the existing resources to an adapta-            alignment algorithms. This corpus is aligned
tion of MuST-C (Di Gangi et al., 2019), where the          at caption level, therefore its format is closer to
data is kept as subtitles. For evaluating length and       our scenario. On the other hand, non-matching
reading speed, we employ character counts, while           alignments are discarded, which might hurt the
for proper line breaks we use the Chink-Chunk al-          integrity of the subtitling documents. As we will
gorithm (Liberman and Church, 1992). Based on              show, this is particularly important for learning
the analysis, we discuss limitations of the existing       proper line breaks between subtitle blocks.
data and present a preliminary road-map towards              A corpus preserving both subtitle segmentation
creating resources for training subtitling-oriented        and order of lines is SubCo (Martı́nez and Vela,
NMT faithful to the subtitling format.                     2016), a corpus of machine and human translated
                                                           subtitles for English–German. However, it only
2        Related work                                      consists of 2 source texts (∼150 captions each)
2.1        Subtitling corpora                              with multiple student and machine translations.
                                                           Therefore, it is not sufficient for training MT sys-
Building an end-to-end subtitle-oriented transla-          tems, although it could be useful for evaluation be-
tion system poses several challenges, mainly re-           cause of the multiple reference translations.
lated to the fact that NMT training needs large
                                                              Slightly deviating from the domain of films and
amounts of high-quality data representative of the
                                                           TV series, corpora for Spoken Language Transla-
target application scenario (subtitling in our case).
                                                           tion (SLT) have been created based on TED talks.
Human subtitlers translate either directly from the
                                                           The Web Inventory of Transcribed and Translated
audio/video or they are provided with a template
                                                           Talks (Cettolo et al., 2012) is a multilingual col-
with the source text already in the format of subti-
                                                           lection of transcriptions and translations of TED
tles containing time codes and line breaks, which
                                                           talks. The talks are aligned at sentence level
they have to adhere to when translating.
                                                           without audio information. Based on WIT, the
   Several projects have attempted to collect paral-
                                                           IWSLT campaigns (Niehues et al., 2018) are an-
lel subtitling corpora. The most well-known one is
                                                           nually releasing parallel data and their correspond-
the OpenSubtitles1 corpus (Lison and Tiedemann,
                                                           ing audio for the task of SLT, which are extracted
2016), extracted from 3.7 million subtitles across
                                                           based on time codes but again with merging op-
60 languages. Since subtitle blocks do not always
                                                           erations to create segments. MuST-C (Di Gangi
correspond to sentences (see Table 1), the blocks
                                                           et al., 2019) is to date the largest multilingual
are merged and then segmented into sentences us-
                                                           corpus for end-to-end speech translation. It con-
     1
         http://www.opensubtitles.org/                     tains (audio-source language transcription-target
language translation) triplets, aligned at segment             2. Lines per subtitle. Subtitles should not take up
level. The process of creation is the opposite from            too much space on screen. The space allowed for
IWSLT; the authors first align the written parts               a subtitle is about 20% of screen space. Therefore,
and then match the audio. This is a promising                  a subtitle block should not exceed 2 lines.
corpus for an end-to-end system which translates
from audio directly into subtitles. However, the               3. Reading speed. The on-air time of a subtitle
translations are merged to create sentences, there-            should be sufficient for the audience to read and
fore they are far from the suitable subtitle format.           process its content. The subtitle should match as
Given the challenges discussed above, there exists             much as possible the start and the end of an utter-
no systematic study of the suitability of the exist-           ance. The duration of the utterance (measured ei-
ing corpora for subtitling-oriented NMT.                       ther in seconds or in feet/frames) is directly equiv-
                                                               alent to the space a subtitle should occupy. As a
2.2       Subtitle segmentation                                general rule, we consider max. 21 chars/second.
Subtitle segmentation techniques have so far fo-
cused on monolingual subtitle data. Álvarez et al.            4. Preserve ‘linguistic wholes’. This criterion is
(2014) trained Support Vector Machine and Logis-               related to subtitle segmentation. Subtitle segmen-
tic Regression classifiers on correctly/incorrectly            tation does not rely only on the allowed length,
segmented subtitles to predict line breaks. Ex-                but should respect linguistic norms. To facilitate
tending this work, Álvarez et al. (2017) used a               readability, subtitle splits should not “break” se-
Conditional Random Field (CRF) classifier for the              mantic and syntactic units. In an ideal case, every
same task, also differentiating between line breaks            subtitle line (or at least subtitle block) should rep-
(next subtitle line) and subtitle breaks (next subti-          resent a coherent linguistic chunk (i.e. a sentence
tle block). Recently, Song et al. (2019) employed              or a phrase). For example, a noun should not be
a Long-Short Term Memory Network (LSTM) to                     separated from its article. Lastly, subtitles should
predict the position of the period in order to im-             respect natural pauses.
prove the readability of automatically generated
Youtube captions. To our knowledge to date, there              5. Equal length of lines. Another criterion for
is no approach attempting to learn bilingual subti-            splitting subtitles relates to aesthetics. There is
tle segmentation or incorporating subtitle segmen-             no consensus about whether the top line should be
tation in an end-to-end NMT system.                            longer or shorter, however, it has been shown that
                                                               subtitle lines of equal length are easier to read, be-
3       Criteria for assessing subtitle quality                cause the viewer’s eyes return to the same point on
                                                               the screen when reading the second line.
3.1       Background
The quality of the translated subtitles is not eval-              While subtitle length and reading speed are fac-
uated only in terms of fluency and adequacy, but               tors that can be controlled directly by the subtitle
also based on their format. We assess whether                  software used by translators, subtitle segmentation
the available subtitle corpora conform to the con-             is left to the decision of the translator. Translators
straints of length, reading speed (for the corpora             often have to either compromise the aesthetics in
where time information is available) and proper                favour of the linguistic wholes or resort to omis-
line breaks on the basis of the criteria for subti-            sions and substitutions. Therefore, modelling the
tle segmentation mentioned in the literature of Au-            segmentation decisions based on the large avail-
diovisual Translation (AVT) (Cintas and Remael,                able corpora is of great importance for a high-
2007) and the TED talk subtitling guidelines2 :                quality subtitle-oriented NMT system.

1. Characters per line. The space available for a              3.2   Quality criteria filters
subtitle is limited. The length of a subtitle depends
on different factors, such as size of screen, font,            In order to assess the conformity of the exist-
age of the audience and country. For our analysis,             ing subtitle corpora to the constraints mentioned
we consider max. 42 chars for Latin alphabets, 14              above, we implement the following filters.
for Japanese (including spaces).
                                                               Characters per line (CPL): As mentioned
    2
        https://www.ted.com/participate/translate/guidelines   above, the information about line breaks inside
subtitle blocks is discarded in the process of cre-           LP       Total     Extracted         MuST-C
ating parallel data. Therefore, we can only as-
                                                              EN-IT   671K      452K / 3.4M      253K / 4.8M
sume that a subtitle fulfils the criteria 1 and 2
                                                              EN-DE   575K      361K / 2.7M      229K / 4.2M
above by calculating the maximum possible length
                                                              EN-JA   669K       399K / 3M            -
for a subtitle block; 2 * 42 = 84 characters for
Latin scripts and 2 * 14 = 28 for Japanese. If
                                                          Table 2: Total number of subtitles vs. number of
CP L > max length then the subtitle doesn’t
                                                          extracted subtitles (in lines) from TED talks .srt
conform to the length constraints.
                                                          files vs. the original MuST-C corpus. The first
Characters per second (CPS): This metric re-              number shows lines (or sentences respectively),
lates to reading speed. For the corpora where             while the second words on the English side.
time codes and duration are preserved, we calcu-
                             #chars
late CPS as follows: CP S = duration
                                                          the same .srt files used to create MuST-C. We ex-
Chink-Chunk: Chink-Chunk is a low-level                   tract only the subtitles with matching timestamps
parsing algorithm which can be used as a rule-            from the common talks in the language pair with-
based method to insert line breaks between sub-           out any merging operations. Table 2 shows the
titles. It is a simple but efficient way to detect syn-   statistics of the extracted corpus. We randomly
tactic boundaries. It relates to preserving linguistic    sample 1,000 sentence pairs and manually inspect
wholes, since it uses POS information to split units      their alignments. 94% were correctly aligned, 3%
only at punctuation marks (logical completion) or         partially aligned and 3% misaligned.
when an open-class or content word (chunk) is fol-           We apply each of the criteria filters in Sec-
lowed by a closed-class or function word (chink).         tion 3.2 to the corpora both on the source and the
Here, we use this algorithm to compute statis-            target side independently. Then, we take the inter-
tics about the type of subtitle block breaks in the       section of the outputs of all the filters to obtain the
data (punctuation break, content-function break or        lines/sentences which conform to all the criteria.
other). The algorithm is described in Algorithm 1.
                                                          5    Analysis
    Algorithm 1: Chink-Chunk algorithm
                                                          Table 3 shows the percentage of preserved
1  if POS last in [’PUNCT’, ’SYM’, ’X’] then
 2      punc break +=1;
                                                          lines/sentences after applying each criterion.
 3 else
 4      if POS last in content words and POS next in      Length: The analysis of Characters per line fil-
          function words then                             ter shows that both OpenSubtitles and JESC con-
 5           cf break +=1;
 6      else
                                                          form to the quality criterion of length in at least
 7           other split +=1;                             94% of the cases. Despite the merging operations
 8      end                                               to obtain sentence alignments, OpenSubtitles still
 9 end
10 return punc break, cf break, other split
                                                          preserves a short length of lines, possibly because
                                                          of the nature of the text of subtitles. A manual
4        Experiments                                      inspection shows that the text is mainly short dia-
                                                          logues and the long sentences are parts of descrip-
For our experiments we consider the corpora
                                                          tions or monologues, which are more rare. On
which are large enough to train NMT systems;
                                                          the other hand, the merging operations in MuST-
OpenSubtitles, JESC and MuST-C. We focus on
                                                          C create long sentences that do not resemble the
3 language pairs, Japanese, Italian and German,
                                                          subtitling format. This could be attributed to the
paired with English, as languages coming from
                                                          format of TED talks. TED talks mostly contain
different families and having a large portion of
                                                          text written to be spoken, prepared talks usually
sentences in all corpora. We tokenise and then tag
                                                          delivered by one speaker with few dialogue turns.
the data with Universal Dependencies3 to obtain
                                                          Among all corpora, MuST-C subs shows the high-
POS tags for the Chink-Chunk algorithm.
                                                          est conformity to the criterion of length, since in-
   To observe the effect of merging processes on
                                                          deed no merging operations were performed.
preserving the subtitling constraints, we create a
version of MuST-C at a subtitle level. We obtain          Reading speed: Conformity to the criterion of
     3
         https://universaldependencies.org/               reading speed is achieved to a lesser degree, as
      LP       Corpus          Format     Time   CPL (s/t) %   CPS (s/t) %   Chink-Chunk (s/t) %   Total%
               MuST-C          segment    X        49 / 48       78 / 72           99 / 99          45
      EN-IT    OpenSubtitles   segment    -        95 / 94       -                 99 / 99          91
               MuST-C subs     subtitle   X        99 / 98       86 / 81           87 / 83          79
               MuST-C          segment    X        51 / 47       77 / 66           99 / 99          42
      EN-DE    OpenSubtitles   segment    -        95 / 95       -                 99 / 99          92
               MuST-C subs     subtitle   X        99 / 98       84 / 75           87 /87           74
               OpenSubtitles   segment    -        96 / 93       -                 99 /98           91
      EN-JA    JESC            subtitle   -        97 / 94       -                 88 / 87          85
               MuST-C subs     subtitle   X        99 / 94       85 / 99           92 / 91          83

Table 3: Percentage of data preserved after applying each of the quality criteria filters on the subtitling
corpora independently. Percentages are given on source and target side (s/t), except for the Total where
source and target are combined.


shown by the Characters per second filter. Except        that learn to translate and segment.
for Japanese, where the allowed number of char-             The total retained material shows that Open-
acters per line is lower, all other languages range      Subtitles is the most suitable corpus for producing
between 66%-86%. In general, MuST-C subs, be-            quality subtitles in all investigated languages, as
ing in subtitling format, seems to conform better        more than 90% of the sentences passed the filters.
to reading speed. Unfortunately, time information        However, this is not a fair comparison, given that
is not present in corpora other than the two ver-        the data was filtered with only 2 out of the 3 fil-
sions of MuST-C, therefore a full comparison is          ters. One serious limitation of OpenSubtitles is the
not possible.                                            lack of time information, which does not allow for
                                                         modelling reading speed. We showed that corpora
Linguistic wholes: The Chink-Chunk algorithm             in subtitling format (JESC, MuST-C subs) contain
shows interesting properties of the subtitle breaks      useful information about line breaks not ending in
for all the corpora. MuST-C and OpenSubtitles            punctuation marks, which are mostly absent from
conform to the criterion of preserving linguistic        OpenSubtitles. Since no information about subti-
wholes in 99% of the sentences, which does not           tle line breaks (inside a subtitle block) is preserved
occur in the corpora in subtitle format; JESC and        in any of the corpora, the criterion of equal length
MuST-C subs. Since these two corpora are com-            of lines cannot be explored in this study.
piled by removing captions based on unmatched
time codes, the integrity of the documents is pos-       6     Conclusions and discussion
sibly broken. Subtitles are removed arbitrarily, so
consecutive subtitles are often not kept in order.       We explored whether the existing parallel sub-
This shows the importance of preserving the order        titling resources conform to the subtitling con-
of subtitles when creating subtitling corpora.           straints. We found that subtitling corpora gener-
   This observation might lead to the assumption         ally conform to length and proper line breaks, de-
that JESC and MuST-C subs are less subtitle-like.        spite the merging operations for aligning parallel
However, a close inspection of the breaks shows          sentences. We isolated some missing elements:
that OpenSubtitles and MuST-C end in a punctu-           the lack of time information (duration of utter-
ation mark in 99.9% of the cases. Even though            ance) and the insufficient representation of line
they preserve logical completion, these corpora do       breaks other than at punctuation marks.
not contain sufficient examples of line breaks pre-          This raises several open issues for creating cor-
serving linguistic wholes. On the other hand, the        pora for subtitling-oriented NMT; i) subtitling
subtitle-level corpora contain between 5%-11%            constraints: a subtitling corpus, in order to be
subtitle breaks in the form of content-function          representative of the task, should respect the subti-
word. In a realistic subtitling scenario, an NMT         tling constraints; ii) duration of utterance: since
system at inference time will often receive unfin-       the translation of a subtitle depends on the dura-
ished sentences, either from an audio stream or a        tion of the utterance, time information is highly
subtitling template. Therefore, line break informa-      relevant; iii) integrity of documents: a subtitle
tion might be valuable for training NMT systems          often occupies several lines, therefore the order of
subtitles should be preserved whenever possible;           Mark Liberman and Kenneth Church. 1992. Text anal-
iv) line break information: while parallel sen-             ysis and word pronunciation in text-to-speech syn-
                                                            thesis. Advances in Speech Signal Processing, pages
tence alignments are indispensable, they should
                                                            791–831.
not compromise line break and subtitle block in-
formation. Break information could be preserved            Pierre Lison and Jörg Tiedemann. 2016. Opensub-
by inserting special symbols.                                 titles2016: Extracting large parallel corpora from
                                                              Movie and TV subtitles. In Proceedings of the In-
   We intend to use these observations for an                 ternational Conference on Language Resources and
adaptation of MuST-C, containing triplets (audio,             Evaluation, LREC.
source language subtitle, target language subtitle),
                                                           José Manuel Martı́nez Martı́nez and Mihaela Vela.
preserving line break information and taking ad-
                                                              2016. SubCo: A learner translation corpus of hu-
vantage of natural pauses in the audio. In the long           man and machine subtitles. In Language Resources
run, we would like to train NMT systems which                 and Evaluation Conference (LREC).
predict line breaks while translating, possibly ex-
                                                           Jan Niehues, Roldano Cattoni, Mauro Cettool Sebas-
tending the input context using methods from doc-             tian Stuke an, Marco Turchi, and Marcello Federico.
ument level translation.                                      2018. The IWSLT 2018 evaluation campaign. In
                                                              Proceedings of IWSLT 2018.
Acknowledgements
                                                           Elisa Perego. 2008. Subtitles and line-breaks: Towards
This work is part of a project financially supported          improved readability. Between Text and Image: Up-
by an Amazon AWS ML Grant.                                    dating research in screen translation, 78(1):211–
                                                              223.
                                                           Reid Pryzant, Yongjoo Chung, Dan Jurafsky, and
References                                                   Denny Britz. 2018. JESC: Japanese-English Sub-
Aitor Álvarez, Haritz Arzelus, and Thierry Etchegoy-        title Corpus. Language Resources and Evaluation
  hen. 2014. Towards customized automatic segmen-            Conference (LREC).
  tation of subtitles. In Advances in Speech and Lan-
  guage Technologies for Iberian Languages, pages          Dhevi J. Rajendran, Andrew T. Duchowski, Pilar
  229–238, Cham. Springer International Publishing.          Orero, Juan Martnez, and Pablo Romero-Fresco.
                                                             2013. Effects of text chunking on subtitling: A
Aitor Álvarez, Carlos-D. Martı́nez-Hinarejos, Haritz        quantitative and qualitative examination. Perspec-
  Arzelus, Marina Balenciaga, and Arantza del Pozo.          tives, 21(1):5–21.
  2017. Improving the automatic segmentation of
  subtitles through conditional random field.      In      Hye-Jeong Song, Hong-Ki Kim, Jong-Dae Kim, Chan-
  Speech Communication, volume 88, pages 83–95.              Young Park, and Yu-Seop Kim. 2019. Inter-
  Elsevier BV.                                               sentence segmentation of YouTube subtitles using
                                                             long-short term memory (LSTM). 9:1504.
Petar Božović, Tomaž Erjavec, Jörg Tiedemann, Nikola
  Ljubešić, and Vojko Gorjanc.          2018.   Opus-    Jörg Tiedemann. 2008. Synchronizing translated
  Montenegrinsubs 1.0: First electronic corpus of the          movie subtitles. In Proceedings of the International
  Montenegrin language. In Conference on Language              Conference on Language Resources and Evaluation,
  Technologies & Digital Humanities.                           LREC 2008.

Mauro Cettolo, Christian Girardi, and Marcello Fed-
 erico. 2012. Wit3 : Web Inventory of Transcribed
 and Translated Talks. In Proceedings of the 16th
 Conference of the European Association for Ma-
 chine Translation (EAMT), pages 261–268, Trento,
 Italy, May.
Jorge Diaz Cintas and Aline Remael. 2007. Audiovi-
   sual Translation: Subtitling. Translation practices
   explained. Routledge.
Mattia Antonino Di Gangi, Roldano Cattoni, Luisa
 Bentivogli, Matteo Negri, and Marco Turchi. 2019.
 MuST-C: a multilingual speech translation corpus.
 In Proceedings of the 2019 Conference of the North
 American Chapter of the Association for Computa-
 tional Linguistics: Human Language Technologies,
 Volume 2 (Short Papers), Minneapolis, MN, USA,
 June.

</pre>