Point Break: Surfing Heterogeneous Data for Subtitle Segmentation Alina Karakanta1,2 , Matteo Negri1 , Marco Turchi1 1 Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento - Italy 2 University of Trento, Italy {akarakanta,negri,turchi}@fbk.eu Abstract proving the reading skills of children and immi- grants (Gottlieb, 2004). Having such a large pool Subtitles, in order to achieve their pur- of users and covering a wide variety of functions, pose of transmitting information, need to subtitling is probably the most dominant form of be easily readable. The segmentation of Audiovisual Translation. subtitles into phrases or linguistic units is Subtitles, however, in order to fulfil their pur- key to their readability and comprehen- poses as described above, need to be presented sion. However, automatically segmenting on the screen in a way that facilitates readability a sentence into subtitles is a challenging and comprehension. Bartoll and Tejerina (2010) task and data containing reliable human claim that subtitles which cannot be read or can be segmentation decisions are often scarce. read only with difficulty ‘are almost as bad as no In this paper, we leverage data with noisy subtitles at all’. Creating readable subtitles comes segmentation from large subtitle corpora with several challenges. The difficulty imposed by and combine them with smaller amounts the transition to a different semiotic means, which of high-quality data in order to train mod- takes place when transcribing or translating the els which perform automatic segmentation original audio into text, is further exacerbated by of a sentence into subtitles. We show that the limitations of the medium (time and space on even a minimum amount of reliable data screen). Subtitles should not exceed a maximum can lead to readable subtitles and that qual- length, usually ranging between 35-46 characters, ity is more important than quantity for the depending on screen size and audience age or pref- task of subtitle segmentation.1 erences. They should also be presented at a com- fortable reading speed for the viewer. Moreover, 1 Introduction chucking or segmentation, i.e. the way a subtitle is In a world dominated by screens, subtitles are a split across the screen, has a great impact on com- vital means for facilitating access to information prehension. Studies have shown that a proper seg- for diverse audiences. Subtitles are classified as mentation can balance gazing behaviour and subti- interlingual (subtitles in a different language as tle reading (Perego, 2008; Rajendran et al., 2013). the original video) and intralingual (of the same Each subtitle should – if possible – have a logical language as the original video) (Bartoll, 2004). completion. This is equivalent to a segmentation Viewers normally resort to interlingual subtitles by phrase, sentence or unit of information. Where because they do not speak the language of the and if to insert a subtitle break depends on sev- original video, while intralingual subtitles (also eral factors such as speech rhythm, pauses but also called captions) are used by people who cannot semantic and syntactic properties. This all makes rely solely on the original audio for comprehen- segmenting a full sentence into subtitles a complex sion. Such viewers are, for example, the deaf and and challenging problem. hard of hearing and language learners. Apart from Developing automatic solutions for subtitle seg- creating a bridge towards information, entertain- mentation has long been impeded by the lack of ment and education, subtitles are a means to im- representative data. Line breaks are the new lines 1 inside a subtitle block, which are used to split Copyright c 2020 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- a long subtitle into two shorter lines. This type ternational (CC BY 4.0). of breaks is not present in the subtitle files used to create large subtitling corpora such as Open- mine subtitle breaks (Álvarez et al., 2014). Ex- Subtitles (Lison and Tiedemann, 2016) and cor- tending this work, Álvarez et al. (2017) trained pora based on TED Talks (Cettolo et al., 2012; a Conditional Random Field (CRF) classifier for Di Gangi et al., 2019), possibly because of en- the same task, but in this case making a distinc- coding issues and the pre-processing of the sub- tion between line breaks (next subtitle line) and titles into parallel sentences (Karakanta et al., subtitle breaks (next subtitle block). A more re- 2019). Recently, MuST-Cinema (Karakanta et al., cent, neural-based approach (Song et al., 2019) 2020b), a corpus based on TED Talks, was re- employed a Long-Short Term Memory Network leased, which added the missing line breaks from (LSTM) to predict the position of the period in the subtitle files (.srt2 ) using an automatic annota- order to improve the readability of automatically tion procedure. This makes MuST-Cinema a high- generated Youtube captions, but without focusing quality resource for the task of subtitle segmenta- specifically on the segmentation of subtitles. Fo- tion. However, the size of MuST-Cinema (about cusing on the length constraint, Liu et al. (2020) 270k sentences) might not be sufficient for devel- proposed adapting an Automatic Speech Recogni- oping automatic solutions based on data-hungry tion (ASR) system to incorporate transcription and neural-network approaches, and its language cov- text compression, with a view to generating more erage is so far limited to 7 languages. On the readable subtitles. other hand, the OpenSubtitles corpus, despite be- A recent line of works has paved the way for ing rather noisy, constitutes a large resource of Neural Machine Translation systems which gen- subtitling data. erate translations segmented into subtitles, here In this work, we leverage available subtitling in a bilingual scenario. Matusov et al. (2019) resources in different resource conditions to train customised an NMT system to subtitles and in- models which automatically segment sentences troduced a segmentation module based on hu- into readable subtitles. The goal is to exploit the man segmentation decisions trained on OpenSub- advantages of the available resources, i.e. size titles and penalties well established in the subti- for OpenSubtitles and quality for MuST-Cinema, tling industry. Karakanta et al. (2020a) were the for maximising segmentation performance, but first to propose an end-to-end solution for Speech also taking into account training efficiency and Translation into subtitles. Their findings indicated cost. We experiment with a sequence-to-sequence the importance of prosody, and more specifically model, which we train and fine-tune on different pauses, to achieving subtitle segmentation in line amounts of data. More specifically, we hypoth- with the speech rhythm. They further confirmed esise the condition where data containing high- the different roles of line breaks (new line inside a quality segmentation decisions is scarce or non- subtitle block) and subtitle block breaks (the next existent and we resort to existing resources (Open- subtitle appears on a new screen); while block Subtitles). We show that high-quality data, repre- breaks depend on speech rhythm, line breaks fol- sentative of the task, even in small amounts, are a low syntactic patterns. All this shows that subtitle key to finding the break points for readable subti- segmentation is a complex and dynamic process tles. and depends on several and varied factors. 2 Related work 3 Methodology Automatically segmenting text into subtitles has long been addressed as a post-processing step in This section describes the data processing, model a translation/transcription pipeline. In industry, and evaluation used for the experiments. All ex- language-specific rules and simple algorithms are periments are run for English, as the language employed for this purpose. Most academic ap- with the largest amount of available resources, but proaches on subtitle segmentation make use of the approach is easily extended to all languages. a classifier which predicts subtitle breaks. One Note that here we are focusing on a monolingual of these approaches used Support Vector Ma- scenario, where subtitle segmentation is seen as chine and Logistic Regression classifiers on cor- a sequence-to-sequence task of passing from En- rectly/incorrectly segmented subtitles to deter- glish sentences without break symbols to English 2 http://zuggy.wz.cz/ sentences containing break symbols. 3.1 Data Data Sents As training data we use MuST-Cinema and Open- MuST-Cinema 275,085 Subtitles. MuST-Cinema contains special symbols OpenSubs-42 185,758 to indicate the breaks: for subtitle breaks OpenSubs-48 13,713,708 and for line breaks inside a subtitle block. We train models using all data (MC-all) and only Table 1: Dataset sizes in sentences. 100k sentences (MC-100).3 The monolingual files for OpenSubtitles come set is the English test set released with MuST- in XML format, where each subtitle block form- Cinema, containing 10 single-speaker TED Talks ing a sentence is wrapped in XML tags. We are (545 sentences). The second test set (782 sen- therefore able to insert the symbols for de- tences) is much more diverse. In order to create it, termining the end of a subtitle block. However, we have selected a mix of public and proprietary we mentioned that line breaks are not present in data, more specifically, excerpts from a TV series, OpenSubtitles. We hence proceed to creating ar- a documentary, two short interviews and one ad- tificial annotations for . We filter all sen- vertising video. The subtitling was performed by tences for which all subtitles have a maximum professional translators and the .srt files were pro- length of 42 characters (OpenSubs-42). Then, for cessed to insert the break symbols in the positions each , we substitute it with with a where subtitle and line breaks occur. probability of 0.25, making sure to avoid having two consecutive , as this would lead to a 3.2 Model subtitle of three lines, which occupies too much The model is a sequence-to-sequence model based space on the screen. Since this length constraint on the Transformer architecture (Vaswani et al., results in filtering out a lot of data, we also re- 2017), trained using fairseq (Ott et al., 2019) with lax the length constraint by allowing sentences the same settings as in Karakanta et al. (2020b). It with subtitles with up to 48 characters (OpenSubs- takes as input a full sentence and returns the same 48). The motivation for this relaxation is that, if sentence annotated with subtitle and line breaks. a sequence-to-sequence model is not able to learn We process the data into sub-word units with Sen- the constraint of length from the data but instead tencePiece (Kudo and Richardson, 2018) with 8K learns segmentation decisions based on patterns vocabulary size. The special symbols are kept as of neighbouring words, having more data will in- a single sub-word. Models were trained until con- crease the amount and variety of segmentation de- vergence, on 1 Nvidia GeForce GTX1080Ti GPU. cisions observed by the model. This may result As baseline, we use a simple segmentation ap- in more plausible segmentation, possibly though proach inserting a break symbol at the first space to the expense of length conformity. Dataset sizes before every 42 characters. From the two types of are reported in Table 1. symbols, is selected with a 0.25 probabil- We are interested in the real application sce- ity, but we avoid inserting two consecutive , nario where high-quality data containing human since this would lead to a subtitle of three lines. segmentation decisions are not available or scarce. According to our hypothesis, a relatively limited 3.3 Evaluation size of high-quality data can be compensated by Evaluating the subtitle segmentation is performed OpenSubtitles. Therefore, we fine-tune each of the with the following metrics. First, we compute the OpenSubtitle models on 10k and 100k sentences precision, recall and F1-score between the output from MuST-Cinema, which contain high-quality of the segmenter and the human generated sub- break annotations. titles in order to test the model’s performance at OpenSubtitles and TED Talks have been shown inserting a sufficient number of breaks and at the to have large differences and to constitute a sub- right positions in the sentence. Additionally, we classification of the subtitling genre (Müller and compute the BLEU score (Papineni et al., 2002) Volk, 2013). For this reason, we experiment with between the output of the segmenter and the hu- 2 test sets for cross-domain evaluation. The first man reference. Higher values for BLEU indicate 3 Training a model with 10k data did not bring good re- a high similarity between the model’s and desired sults. output. Model BLEU Prec Rec F1 CPL Time readable subtitles in terms of length in diverse test- baseline 55.30 50 47 48 100 - ing conditions even without massive amounts of MC-all 84.00 85 85 85 96 305 data. Even with 100k of training data (MC-100) MC-100 81.77 84 83 83 94 210 OpenSubs-42 72.24 86 66 73 74 270 the performance of the model, which is the fastest MC-10 77.99 83 76 79 88 +26 model to train, drops only slightly, with -2% for MC-100 80.09 87 78 81 88 +250 all metrics on the MuST-Cinema test set and -1% OpenSubs-48 76.00 77 67 68 72 6980 MC-10 82.46 86 80 82 91 +240 on the second test set. This shows that high effi- ciency can be achieved without dramatically sac- Table 2: Results for the MuST-Cinema test set. rificing quality. This is particularly important for Training time in minutes. industry applications where tens of languages are involved and training data for a domain might not Model BLEU Prec Rec F1 CPL Time be vast. baseline 51.45 46 43 44 100 - The models trained only on OpenSubtitles show MC-all 66.38 72 64 69 97 305 MC-100 65.38 76 64 68 96 210 a great drop in performance for the MuST-Cinema OpenSubs-42 61.41 84 56 65 79 270 test, which is to be expected because of the differ- MC-10 63.53 76 60 66 93 +26 MC-100 65.3 77 62 67 94 +250 ent nature of the data. However, the drop is present OpenSubs-48 63.37 63 56 59 81 6980 also for the second test set, which shows that these MC-10 65.66 78 61 67 94 +240 models are not robust to different domains. Sur- prisingly, the larger model (OpenSubs-48) does Table 3: Results for the second test set. Training not perform much better than the model with less time in minutes. data (OpenSubs-42) even though it is trained on almost 10 times as much data. This could be an Finally, we want to check the performance of indication of a trade-off between data quality and the system in generating readable subtitles, there- data size. OpenSubs-48 with more noisy data has fore, we use an intrinsic, task-specific metric. We similar recall to OpenSubs-42, but it is much less compute the number of subtitles with a length of accurate in the position of the breaks, as shown by <= 42 characters (Characters per Line - CPL), the drop in precision (86 vs. 77 and 84 vs. 63). according to the TED subtitling guidelines. This We conjecture that the procedure of artificially in- shows the ability of the system to segment the sen- serting symbols by changing the existing tences into readable subtitles, by producing subti- does not reflect the distribution of the type tles that are not too long to appear on the screen. of breaks in real data. Interestingly, the OpenSubs- We additionally report training time, as efficiency 42 model, despite containing only subtitles of a and cost are important factors for scaling such maximum length of 42, is not able to generate sub- methods to tens of languages. titles which respect the length constraint (74% and 79%). It is therefore possible that the segmenter 4 Results does not learn to take into consideration the con- straint of length, but the segmentation decisions Tables 2 and 3 show the results for the MuST- are based on lexical patterns in the data, as also Cinema and the second test set respectively. As suggested by Karakanta et al. (2020a). expected, the simple baseline achieves a 100% conformity to the length constraint, it is however Fine-tuning, even on a minimum amount of real not accurate in inserting the breaks at the right po- data, as shown when fine-tuning on 10k of MuST- sitions, as shown by the very low BLEU (55.30 Cinema, can significantly boost the performance and 51.45) and F1 scores (48 and 44). The best compared to the OpenSubtitles models and is a performance for all metrics and both test sets is viable and fast solution towards readable subti- achieved when using all available MuST-Cinema tles. This corroborates the claim in favour of data (MC-all). For the in-domain test set, BLEU creating datasets which are representative of the and F1 are higher than for the out-of-domain test task at hand. Surprisingly though, fine-tuning the set, however the number of subtitles conforming OpenSubs-42 model on MC-100 does not improve to the length constraint is consistently high (96% over training the model from scratch on MC-100 and 97%). This suggests that the systems trained for neither test set. For the case when only a small on high-quality segmentation are able to produce amount of MuST-Cinema data is available (MC- 10), having a larger base model on which to fine- tle in two, while MC-100k succeeds in segmenting tune (OpenSubs-48) is beneficial, since there is an all subtitles exceeding 42 characters, matching the improvement for all metrics and in both testing reference segmentation. conditions compared to all other models trained on OpenSubtitles or fine-tuned on them. There- Reference: fore, we conclude that, in the presence of little Meditation is a technique data containing human segmentation decisions, a of finding well-being model trained or more data, even though possibly in the present moment noisier, is a more robust base on which to fine- before anything happens. tune using the high-quality data. One consider- OpenSubs-42: able drawback is that the improvement comes at Meditation is a technique of finding well- a training time of x25 over the other base model being (OpenSubs-42), which raises significant consider- in the present moment before anything hap- ations for cost and efficiency. Such a model how- pens. ever, once trained, could be re-used for fine-tuning (47+46 characters) on several domains and for different client specifi- OpenSubs-42 + MC 10K: cations. Meditation is a technique of finding well-being 5 Analysis and Discussion in the present moment before anything hap- We further perform a manual inspection to iden- pens. tify issues related to the models. We hypothesise (25+21+46 characters) that low precision is connected to over-splitting MC-100K: Meditation is a technique or splitting in wrong positions, while low recall of finding well-being suggests under-splitting (not inserting a sufficient in the present moment number of breaks). Indeed, we observe that the before anything happens. OpenSubtitle models tend to over-segment short sentences, but under-segment longer sentences: The examples above confirm our results which showed that the models do not explicitly learn Reference: the constraint of length, but rather patterns of Let’s turn our attention to the hows. segmentation. From a syntactic point of view, (37 characters) the break symbols are inserted after a noun (e.g. OpenSubs-42: attention, expectations) and before a preposi- Let’s turn our attention tion/conjunction (to, for, in, before), regardless of to the hows. (25 + 12 characters) the model. The break symbols, even though do not overlap with the human segmentation decisions, are inserted at plausible positions. This leads in Reference: subtitles that present logical completion, i.e. each My family’s traditions subtitle is formed by a phrase or syntactic unit, and expectations for a woman even though they do not respect the constraint of wouldn’t allow me to own a mobile length. The conformity to the length constraint phone until I was married. seems to be forced only with the high-quality (22 + 28 + 39 + 20 characters) MuST-Cinema data. It is possible that the artificial OpenSubs-42: break symbols in OpenSubtitles clash with the real My family’s traditions and expectations break symbols in MuST-Cinema, which creates confusion for the model. Replacing some for a woman wouldn’t allow me to own a mo- with symbols in OpenSubtitles to simu- bile phone until I was married. late data where human-annotated line breaks exist (39+72 characters) means that the models trained on OpenSubtitles observe a line break at positions where normally a In the following example, fine-tuning on MC in- subtitle break is present. Given the different func- creases length conformity, splitting the first subti- tions of the two types of breaks, this is a possible explanation why fine-tuning OpenSubtitles-42 on Mattia Antonino Di Gangi, Roldano Cattoni, Luisa MC-100 performs worse than training on MC-100 Bentivogli, Matteo Negri, and Marco Turchi. 2019. MuST-C: a multilingual speech translation corpus. from scratch and provides us with insights on fu- In Proceedings of the 2019 Conference of the North ture design of artificial segmentation decisions to American Chapter of the Association for Computa- augment subtitling data. tional Linguistics: Human Language Technologies, Volume 2 (Short Papers), Minneapolis, MN, USA, 6 Conclusion June. We have presented methods to combine hetero- Henrik Gottlieb. 2004. Language-political implica- tions of subtitling. Topics in Audiovisual Transla- geneous subtitling data in order to improve au- tion, 9:83–100. tomatic segmentation of subtitles. We leverage large data containing noisy segmentation deci- Alina Karakanta, Matteo Negri, and Marco Turchi. sions from OpenSubtitles and combine them with 2019. Are Subtitling Corpora really Subtitle-like? In Sixth Italian Conference on Computational Lin- smaller amounts of high-quality data from MuST- guistics, CLiC-It. Cinema to generate readable subtitles from full sentences. We found that even limited data with Alina Karakanta, Matteo Negri, and Marco Turchi. 2020a. Is 42 the answer to everything in subtitling- reliable segmentation can improve performance. oriented speech translation? In Proceedings of the We conclude that quality matters more than size 17th International Conference on Spoken Language for determining the break points between subtitles. Translation, pages 209–219, Online, July. Associa- tion for Computational Linguistics. Acknowledgments Alina Karakanta, Matteo Negri, and Marco Turchi. This work is part of the “End-to-end Spoken 2020b. Must-cinema: a speech-to-subtitles corpus. In Proceedings of the 12th International Confer- Language Translation in Rich Data Conditions” ence on Language Resources and Evaluation (LREC project,4 which is financially supported by an 2020), Marseille, France, May 13-15. Amazon AWS ML Grant. Taku Kudo and John Richardson. 2018. Sentence- Piece: A simple and language independent subword tokenizer and detokenizer for neural text processing. References In Proceedings of the 2018 Conference on Empirical Aitor Álvarez, Haritz Arzelus, and Thierry Etchegoy- Methods in Natural Language Processing: System hen. 2014. Towards customized automatic segmen- Demonstrations, pages 66–71, Brussels, Belgium, tation of subtitles. In Advances in Speech and Lan- November. Association for Computational Linguis- guage Technologies for Iberian Languages, pages tics. 229–238, Cham. Springer International Publishing. Pierre Lison and Jörg Tiedemann. 2016. Opensub- Aitor Álvarez, Carlos-D. Martı́nez-Hinarejos, Haritz titles2016: Extracting large parallel corpora from Arzelus, Marina Balenciaga, and Arantza del Pozo. Movie and TV subtitles. In Proceedings of the In- 2017. Improving the automatic segmentation of ternational Conference on Language Resources and subtitles through conditional random field. In Evaluation, LREC. Speech Communication, volume 88, pages 83–95. Danni Liu, Jan Niehues, and Gerasimos Spanakis. Elsevier BV. 2020. Adapting end-to-end speech recognition for E. Bartoll and A. Martı́nez Tejerina. 2010. The po- readable subtitles. In Proceedings of the 17th Inter- sitioning of subtitles for the deaf and hard of hear- national Conference on Spoken Language Transla- ing. Listening to Subtitles. Subtitles for the Deaf and tion, pages 247–256, Online, July. Association for Hard of Hearing, pages 69–86. Computational Linguistics. Eduard Bartoll. 2004. Parameters for the classifica- Evgeny Matusov, Patrick Wilken, and Yota Geor- tion of subtitles. Topics in Audiovisual Translation, gakopoulou. 2019. Customizing neural machine 9:53–60. translation for subtitling. In Proceedings of the Fourth Conference on Machine Translation (Volume Mauro Cettolo, Christian Girardi, and Marcello Fed- 1: Research Papers), pages 82–93, Florence, Italy, erico. 2012. Wit3 : Web Inventory of Transcribed August. Association for Computational Linguistics. and Translated Talks. In Proceedings of the 16th Conference of the European Association for Ma- Mathias Müller and Martin Volk. 2013. Statistical ma- chine Translation (EAMT), pages 261–268, Trento, chine translation of subtitles: From opensubtitles to Italy, May. ted. In Iryna Gurevych, Chris Biemann, and Torsten Zesch, editors, Language Processing and Knowl- 4 edge in the Web, pages 132–138, Berlin, Heidelberg. https://ict.fbk.eu/ units-hlt-mt-e2eslt/ Springer Berlin Heidelberg. Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for compu- tational linguistics, pages 311–318. Association for Computational Linguistics. Elisa Perego. 2008. Subtitles and line-breaks: Towards improved readability. Between Text and Image: Up- dating research in screen translation, 78(1):211– 223. Dhevi J. Rajendran, Andrew T. Duchowski, Pilar Orero, Juan Martı́nez, and Pablo Romero-Fresco. 2013. Effects of text chunking on subtitling: A quantitative and qualitative examination. Perspec- tives, 21(1):5–21. Hye-Jeong Song, Hong-Ki Kim, Jong-Dae Kim, Chan- Young Park, and Yu-Seop Kim. 2019. Inter- sentence segmentation of YouTube subtitles using long-short term memory (LSTM). 9:1504. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems, pages 6000–6010.