<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Point Break: Surfing Heterogeneous Data for Subtitle Segmentation</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Alina Karakanta</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>Via Sommarive 18, Povo, Trento -</addr-line>
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Trento</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Subtitles, in order to achieve their purpose of transmitting information, need to be easily readable. The segmentation of subtitles into phrases or linguistic units is key to their readability and comprehension. However, automatically segmenting a sentence into subtitles is a challenging task and data containing reliable human segmentation decisions are often scarce. In this paper, we leverage data with noisy segmentation from large subtitle corpora and combine them with smaller amounts of high-quality data in order to train models which perform automatic segmentation of a sentence into subtitles. We show that even a minimum amount of reliable data can lead to readable subtitles and that quality is more important than quantity for the task of subtitle segmentation.1</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        In a world dominated by screens, subtitles are a
vital means for facilitating access to information
for diverse audiences. Subtitles are classified as
interlingual (subtitles in a different language as
the original video) and intralingual (of the same
language as the original video)
        <xref ref-type="bibr" rid="ref4">(Bartoll, 2004)</xref>
        .
Viewers normally resort to interlingual subtitles
because they do not speak the language of the
original video, while intralingual subtitles (also
called captions) are used by people who cannot
rely solely on the original audio for
comprehension. Such viewers are, for example, the deaf and
hard of hearing and language learners. Apart from
creating a bridge towards information,
entertainment and education, subtitles are a means to
im1Copyright c 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
proving the reading skills of children and
immigrants
        <xref ref-type="bibr" rid="ref7">(Gottlieb, 2004)</xref>
        . Having such a large pool
of users and covering a wide variety of functions,
subtitling is probably the most dominant form of
Audiovisual Translation.
      </p>
      <p>
        Subtitles, however, in order to fulfil their
purposes as described above, need to be presented
on the screen in a way that facilitates readability
and compreh
        <xref ref-type="bibr" rid="ref3">ension. Bartoll and Tejerina (2010</xref>
        )
claim that subtitles which cannot be read or can be
read only with difficulty ‘are almost as bad as no
subtitles at all’. Creating readable subtitles comes
with several challenges. The difficulty imposed by
the transition to a different semiotic means, which
takes place when transcribing or translating the
original audio into text, is further exacerbated by
the limitations of the medium (time and space on
screen). Subtitles should not exceed a maximum
length, usually ranging between 35-46 characters,
depending on screen size and audience age or
preferences. They should also be presented at a
comfortable reading speed for the viewer. Moreover,
chucking or segmentation, i.e. the way a subtitle is
split across the screen, has a great impact on
comprehension. Studies have shown that a proper
segmentation can balance gazing behaviour and
subtitle reading
        <xref ref-type="bibr" rid="ref18 ref19">(Perego, 2008; Rajendran et al., 2013)</xref>
        .
Each subtitle should – if possible – have a logical
completion. This is equivalent to a segmentation
by phrase, sentence or unit of information. Where
and if to insert a subtitle break depends on
several factors such as speech rhythm, pauses but also
semantic and syntactic properties. This all makes
segmenting a full sentence into subtitles a complex
and challenging problem.
      </p>
      <p>
        Developing automatic solutions for subtitle
segmentation has long been impeded by the lack of
representative data. Line breaks are the new lines
inside a subtitle block, which are used to split
a long subtitle into two shorter lines. This type
of breaks is not present in the subtitle files used
to create large subtitling corpora such as
OpenSubtitles
        <xref ref-type="bibr" rid="ref12">(Lison and Tiedemann, 2016)</xref>
        and
corpora based on TED Talks
        <xref ref-type="bibr" rid="ref5 ref6">(Cettolo et al., 2012;
Di Gangi et al., 2019)</xref>
        , possibly because of
encoding issues and the pre-processing of the
subtitles into parallel sentences
        <xref ref-type="bibr" rid="ref8">(Karakanta et al.,
2019)</xref>
        . Recently, MuST-Cinema
        <xref ref-type="bibr" rid="ref10 ref9">(Karakanta et al.,
2020b)</xref>
        , a corpus based on TED Talks, was
released, which added the missing line breaks from
the subtitle files (.srt2) using an automatic
annotation procedure. This makes MuST-Cinema a
highquality resource for the task of subtitle
segmentation. However, the size of MuST-Cinema (about
270k sentences) might not be sufficient for
developing automatic solutions based on data-hungry
neural-network approaches, and its language
coverage is so far limited to 7 languages. On the
other hand, the OpenSubtitles corpus, despite
being rather noisy, constitutes a large resource of
subtitling data.
      </p>
      <p>In this work, we leverage available subtitling
resources in different resource conditions to train
models which automatically segment sentences
into readable subtitles. The goal is to exploit the
advantages of the available resources, i.e. size
for OpenSubtitles and quality for MuST-Cinema,
for maximising segmentation performance, but
also taking into account training efficiency and
cost. We experiment with a sequence-to-sequence
model, which we train and fine-tune on different
amounts of data. More specifically, we
hypothesise the condition where data containing
highquality segmentation decisions is scarce or
nonexistent and we resort to existing resources
(OpenSubtitles). We show that high-quality data,
representative of the task, even in small amounts, are a
key to finding the break points for readable
subtitles.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        Automatically segmenting text into subtitles has
long been addressed as a post-processing step in
a translation/transcription pipeline. In industry,
language-specific rules and simple algorithms are
employed for this purpose. Most academic
approaches on subtitle segmentation make use of
a classifier which predicts subtitle breaks. One
of these approaches used Support Vector
Machine and Logistic Regression classifiers on
correctly/incorrectly segmented subtitles to
deter2http://zuggy.wz.cz/
mine subtitle breaks
        <xref ref-type="bibr" rid="ref1">(A´ lvarez et al., 2014)</xref>
        .
Extending this work, A´ lvarez et al. (2017) trained
a Conditional Random Field (CRF) classifier for
the same task, but in this case making a
distinction between line breaks (next subtitle line) and
subtitle breaks (next subtitle block). A more
recent, neural-based approach
        <xref ref-type="bibr" rid="ref20">(Song et al., 2019)</xref>
        employed a Long-Short Term Memory Network
(LSTM) to predict the position of the period in
order to improve the readability of automatically
generated Youtube captions, but without focusing
specifically on the segmentation of subtitles.
Focusing on the length constraint, Liu et al. (2020)
proposed adapting an Automatic Speech
Recognition (ASR) system to incorporate transcription and
text compression, with a view to generating more
readable subtitles.
      </p>
      <p>A recent line of works has paved the way for
Neural Machine Translation systems which
generate translations segmented into subtitles, here
in a bilingual scenario. Matusov et al. (2019)
customised an NMT system to subtitles and
introduced a segmentation module based on
human segmentation decisions trained on
OpenSubtitles and penalties well established in the
subtitling industry. Karakanta et al. (2020a) were the
first to propose an end-to-end solution for Speech
Translation into subtitles. Their findings indicated
the importance of prosody, and more specifically
pauses, to achieving subtitle segmentation in line
with the speech rhythm. They further confirmed
the different roles of line breaks (new line inside a
subtitle block) and subtitle block breaks (the next
subtitle appears on a new screen); while block
breaks depend on speech rhythm, line breaks
follow syntactic patterns. All this shows that subtitle
segmentation is a complex and dynamic process
and depends on several and varied factors.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>This section describes the data processing, model
and evaluation used for the experiments. All
experiments are run for English, as the language
with the largest amount of available resources, but
the approach is easily extended to all languages.
Note that here we are focusing on a monolingual
scenario, where subtitle segmentation is seen as
a sequence-to-sequence task of passing from
English sentences without break symbols to English
sentences containing break symbols.
3.1
As training data we use MuST-Cinema and
OpenSubtitles. MuST-Cinema contains special symbols
to indicate the breaks: &lt;eob&gt; for subtitle breaks
and &lt;eol&gt; for line breaks inside a subtitle block.
We train models using all data (MC-all) and only
100k sentences (MC-100).3</p>
      <p>The monolingual files for OpenSubtitles come
in XML format, where each subtitle block
forming a sentence is wrapped in XML tags. We are
therefore able to insert the &lt;eob&gt; symbols for
determining the end of a subtitle block. However,
we mentioned that line breaks are not present in
OpenSubtitles. We hence proceed to creating
artificial annotations for &lt;eol&gt;. We filter all
sentences for which all subtitles have a maximum
length of 42 characters (OpenSubs-42). Then, for
each &lt;eob&gt;, we substitute it with &lt;eol&gt; with a
probability of 0.25, making sure to avoid having
two consecutive &lt;eol&gt;, as this would lead to a
subtitle of three lines, which occupies too much
space on the screen. Since this length constraint
results in filtering out a lot of data, we also
relax the length constraint by allowing sentences
with subtitles with up to 48 characters
(OpenSubs48). The motivation for this relaxation is that, if
a sequence-to-sequence model is not able to learn
the constraint of length from the data but instead
learns segmentation decisions based on patterns
of neighbouring words, having more data will
increase the amount and variety of segmentation
decisions observed by the model. This may result
in more plausible segmentation, possibly though
to the expense of length conformity. Dataset sizes
are reported in Table 1.</p>
      <p>We are interested in the real application
scenario where high-quality data containing human
segmentation decisions are not available or scarce.
According to our hypothesis, a relatively limited
size of high-quality data can be compensated by
OpenSubtitles. Therefore, we fine-tune each of the
OpenSubtitle models on 10k and 100k sentences
from MuST-Cinema, which contain high-quality
break annotations.</p>
      <p>
        OpenSubtitles and TED Talks have been shown
to have large differences and to constitute a
subclassification of the subtitling genre
        <xref ref-type="bibr" rid="ref15">(Mu¨ller and
Volk, 2013)</xref>
        . For this reason, we experiment with
2 test sets for cross-domain evaluation. The first
3Training a model with 10k data did not bring good
results.
      </p>
      <sec id="sec-3-1">
        <title>Data</title>
      </sec>
      <sec id="sec-3-2">
        <title>MuST-Cinema OpenSubs-42 OpenSubs-48</title>
      </sec>
      <sec id="sec-3-3">
        <title>Sents</title>
        <p>set is the English test set released with
MuSTCinema, containing 10 single-speaker TED Talks
(545 sentences). The second test set (782
sentences) is much more diverse. In order to create it,
we have selected a mix of public and proprietary
data, more specifically, excerpts from a TV series,
a documentary, two short interviews and one
advertising video. The subtitling was performed by
professional translators and the .srt files were
processed to insert the break symbols in the positions
where subtitle and line breaks occur.
3.2</p>
        <sec id="sec-3-3-1">
          <title>Model</title>
          <p>
            The model is a sequence-to-sequence model based
on the Transformer architecture
            <xref ref-type="bibr" rid="ref21">(Vaswani et al.,
2017)</xref>
            , trained using fairseq
            <xref ref-type="bibr" rid="ref16">(Ott et al., 2019)</xref>
            with
the same settings as in Karakanta et al. (2020b). It
takes as input a full sentence and returns the same
sentence annotated with subtitle and line breaks.
We process the data into sub-word units with
SentencePiece
            <xref ref-type="bibr" rid="ref11">(Kudo and Richardson, 2018)</xref>
            with 8K
vocabulary size. The special symbols are kept as
a single sub-word. Models were trained until
convergence, on 1 Nvidia GeForce GTX1080Ti GPU.
          </p>
          <p>As baseline, we use a simple segmentation
approach inserting a break symbol at the first space
before every 42 characters. From the two types of
symbols, &lt;eol&gt; is selected with a 0.25
probability, but we avoid inserting two consecutive &lt;eol&gt;,
since this would lead to a subtitle of three lines.
3.3</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>Evaluation</title>
          <p>
            Evaluating the subtitle segmentation is performed
with the following metrics. First, we compute the
precision, recall and F1-score between the output
of the segmenter and the human generated
subtitles in order to test the model’s performance at
inserting a sufficient number of breaks and at the
right positions in the sentence. Additionally, we
compute the BLEU score
            <xref ref-type="bibr" rid="ref17">(Papineni et al., 2002)</xref>
            between the output of the segmenter and the
human reference. Higher values for BLEU indicate
a high similarity between the model’s and desired
output.
          </p>
          <p>Finally, we want to check the performance of
the system in generating readable subtitles,
therefore, we use an intrinsic, task-specific metric. We
compute the number of subtitles with a length of
&lt;= 42 characters (Characters per Line - CPL),
according to the TED subtitling guidelines. This
shows the ability of the system to segment the
sentences into readable subtitles, by producing
subtitles that are not too long to appear on the screen.
We additionally report training time, as efficiency
and cost are important factors for scaling such
methods to tens of languages.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>Tables 2 and 3 show the results for the
MuSTCinema and the second test set respectively. As
expected, the simple baseline achieves a 100%
conformity to the length constraint, it is however
not accurate in inserting the breaks at the right
positions, as shown by the very low BLEU (55.30
and 51.45) and F1 scores (48 and 44). The best
performance for all metrics and both test sets is
achieved when using all available MuST-Cinema
data (MC-all). For the in-domain test set, BLEU
and F1 are higher than for the out-of-domain test
set, however the number of subtitles conforming
to the length constraint is consistently high (96%
and 97%). This suggests that the systems trained
on high-quality segmentation are able to produce</p>
      <p>Time readable subtitles in terms of length in diverse
test- ing conditions even without massive amounts of
305 data. Even with 100k of training data (MC-100)
221700 the performance of the model, which is the fastest
+26 model to train, drops only slightly, with -2% for
+250 all metrics on the MuST-Cinema test set and -1%
+6928400 on the second test set. This shows that high
efficiency can be achieved without dramatically
sacrificing quality. This is particularly important for
industry applications where tens of languages are
involved and training data for a domain might not
Time be vast.</p>
      <p>The models trained only on OpenSubtitles show
a great drop in performance for the MuST-Cinema
test, which is to be expected because of the
different nature of the data. However, the drop is present
also for the second test set, which shows that these
models are not robust to different domains.
Surprisingly, the larger model (OpenSubs-48) does
not perform much better than the model with less
data (OpenSubs-42) even though it is trained on
almost 10 times as much data. This could be an
indication of a trade-off between data quality and
data size. OpenSubs-48 with more noisy data has
similar recall to OpenSubs-42, but it is much less
accurate in the position of the breaks, as shown by
the drop in precision (86 vs. 77 and 84 vs. 63).
We conjecture that the procedure of artificially
inserting &lt;eol&gt; symbols by changing the existing
&lt;eob&gt; does not reflect the distribution of the type
of breaks in real data. Interestingly, the
OpenSubs42 model, despite containing only subtitles of a
maximum length of 42, is not able to generate
subtitles which respect the length constraint (74% and
79%). It is therefore possible that the segmenter
does not learn to take into consideration the
constraint of length, but the segmentation decisions
are based on lexical patterns in the data, as also
suggested by Karakanta et al. (2020a).</p>
      <p>Fine-tuning, even on a minimum amount of real
data, as shown when fine-tuning on 10k of
MuSTCinema, can significantly boost the performance
compared to the OpenSubtitles models and is a
viable and fast solution towards readable
subtitles. This corroborates the claim in favour of
creating datasets which are representative of the
task at hand. Surprisingly though, fine-tuning the
OpenSubs-42 model on MC-100 does not improve
over training the model from scratch on MC-100
for neither test set. For the case when only a small
amount of MuST-Cinema data is available
(MC10), having a larger base model on which to
finetune (OpenSubs-48) is beneficial, since there is an
improvement for all metrics and in both testing
conditions compared to all other models trained
on OpenSubtitles or fine-tuned on them.
Therefore, we conclude that, in the presence of little
data containing human segmentation decisions, a
model trained or more data, even though possibly
noisier, is a more robust base on which to
finetune using the high-quality data. One
considerable drawback is that the improvement comes at
a training time of x25 over the other base model
(OpenSubs-42), which raises significant
considerations for cost and efficiency. Such a model
however, once trained, could be re-used for fine-tuning
on several domains and for different client
specifications.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Analysis and Discussion</title>
      <p>We further perform a manual inspection to
identify issues related to the models. We hypothesise
that low precision is connected to over-splitting
or splitting in wrong positions, while low recall
suggests under-splitting (not inserting a sufficient
number of breaks). Indeed, we observe that the
OpenSubtitle models tend to over-segment short
sentences, but under-segment longer sentences:</p>
      <sec id="sec-5-1">
        <title>Reference:</title>
        <p>My family’s traditions &lt;eol&gt;
and expectations for a woman &lt;eob&gt;
wouldn’t allow me to own a mobile &lt;eol&gt;
phone until I was married. &lt;eob&gt;
(22 + 28 + 39 + 20 characters)</p>
      </sec>
      <sec id="sec-5-2">
        <title>OpenSubs-42:</title>
        <p>My family’s traditions and expectations
&lt;eol&gt;
for a woman wouldn’t allow me to own a
mobile phone until I was married. &lt;eob&gt;
(39+72 characters)</p>
        <p>In the following example, fine-tuning on MC
increases length conformity, splitting the first
subtitle in two, while MC-100k succeeds in segmenting
all subtitles exceeding 42 characters, matching the
reference segmentation.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Reference:</title>
        <p>Meditation is a technique &lt;eol&gt;
of finding well-being &lt;eob&gt;
in the present moment &lt;eol&gt;
before anything happens. &lt;eob&gt;</p>
      </sec>
      <sec id="sec-5-4">
        <title>OpenSubs-42:</title>
        <p>Meditation is a technique of finding
wellbeing &lt;eob&gt;
in the present moment before anything
happens. &lt;eob&gt;
(47+46 characters)</p>
      </sec>
      <sec id="sec-5-5">
        <title>OpenSubs-42 + MC 10K:</title>
        <p>Meditation is a technique &lt;eol&gt;
of finding well-being &lt;eob&gt;
in the present moment before anything
happens. &lt;eob&gt;
(25+21+46 characters)
MC-100K: Meditation is a technique &lt;eol&gt;
of finding well-being &lt;eob&gt;
in the present moment &lt;eol&gt;
before anything happens. &lt;eob&gt;</p>
        <p>The examples above confirm our results which
showed that the models do not explicitly learn
the constraint of length, but rather patterns of
segmentation. From a syntactic point of view,
the break symbols are inserted after a noun (e.g.
attention, expectations) and before a
preposition/conjunction (to, for, in, before), regardless of
the model. The break symbols, even though do not
overlap with the human segmentation decisions,
are inserted at plausible positions. This leads in
subtitles that present logical completion, i.e. each
subtitle is formed by a phrase or syntactic unit,
even though they do not respect the constraint of
length. The conformity to the length constraint
seems to be forced only with the high-quality
MuST-Cinema data. It is possible that the artificial
break symbols in OpenSubtitles clash with the real
break symbols in MuST-Cinema, which creates
confusion for the model. Replacing some &lt;eob&gt;
with &lt;eol&gt; symbols in OpenSubtitles to
simulate data where human-annotated line breaks exist
means that the models trained on OpenSubtitles
observe a line break at positions where normally a
subtitle break is present. Given the different
functions of the two types of breaks, this is a possible
explanation why fine-tuning OpenSubtitles-42 on
MC-100 performs worse than training on MC-100
from scratch and provides us with insights on
future design of artificial segmentation decisions to
augment subtitling data.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>We have presented methods to combine
heterogeneous subtitling data in order to improve
automatic segmentation of subtitles. We leverage
large data containing noisy segmentation
decisions from OpenSubtitles and combine them with
smaller amounts of high-quality data from
MuSTCinema to generate readable subtitles from full
sentences. We found that even limited data with
reliable segmentation can improve performance.
We conclude that quality matters more than size
for determining the break points between subtitles.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work is part of the “End-to-end Spoken
Language Translation in Rich Data Conditions”
project,4 which is financially supported by an
Amazon AWS ML Grant.</p>
      <p>4https://ict.fbk.eu/
units-hlt-mt-e2eslt/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Aitor</surname>
            <given-names>A</given-names>
          </string-name>
          ´lvarez, Haritz Arzelus, and
          <string-name>
            <given-names>Thierry</given-names>
            <surname>Etchegoyhen</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Towards customized automatic segmentation of subtitles</article-title>
          .
          <source>In Advances in Speech and Language Technologies for Iberian Languages</source>
          , pages
          <fpage>229</fpage>
          -
          <lpage>238</lpage>
          , Cham. Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Aitor</surname>
            <given-names>A</given-names>
          </string-name>
          ´ lvarez, Carlos-D.
          <article-title>Mart´ınez-</article-title>
          <string-name>
            <surname>Hinarejos</surname>
          </string-name>
          , Haritz Arzelus,
          <source>Marina Balenciaga, and Arantza del Pozo</source>
          .
          <year>2017</year>
          .
          <article-title>Improving the automatic segmentation of subtitles through conditional random field</article-title>
          .
          <source>In Speech Communication</source>
          , volume
          <volume>88</volume>
          , pages
          <fpage>83</fpage>
          -
          <lpage>95</lpage>
          . Elsevier BV.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>E.</given-names>
            <surname>Bartoll</surname>
          </string-name>
          and
          <string-name>
            <surname>A</surname>
          </string-name>
          . Mart´ınez Tejerina.
          <year>2010</year>
          .
          <article-title>The positioning of subtitles for the deaf and hard of hearing. Listening to Subtitles. Subtitles for the Deaf</article-title>
          and Hard of Hearing, pages
          <fpage>69</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Eduard</given-names>
            <surname>Bartoll</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Parameters for the classification of subtitles</article-title>
          .
          <source>Topics in Audiovisual Translation</source>
          ,
          <volume>9</volume>
          :
          <fpage>53</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Mauro</given-names>
            <surname>Cettolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Girardi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Marcello</given-names>
            <surname>Federico</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Wit3: Web Inventory of Transcribed and Translated Talks</article-title>
          .
          <source>In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT)</source>
          , pages
          <fpage>261</fpage>
          -
          <lpage>268</lpage>
          , Trento, Italy, May.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Mattia</given-names>
            <surname>Antonino Di Gangi</surname>
          </string-name>
          , Roldano Cattoni, Luisa Bentivogli, Matteo Negri, and
          <string-name>
            <given-names>Marco</given-names>
            <surname>Turchi</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <string-name>
            <surname>MuST-C:</surname>
          </string-name>
          <article-title>a multilingual speech translation corpus</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>2</volume>
          (
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , Minneapolis, MN, USA, June.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Henrik</given-names>
            <surname>Gottlieb</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Language-political implications of subtitling</article-title>
          .
          <source>Topics in Audiovisual Translation</source>
          ,
          <volume>9</volume>
          :
          <fpage>83</fpage>
          -
          <lpage>100</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Alina</given-names>
            <surname>Karakanta</surname>
          </string-name>
          , Matteo Negri, and
          <string-name>
            <given-names>Marco</given-names>
            <surname>Turchi</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Are Subtitling Corpora really Subtitle-like?</article-title>
          <source>In Sixth Italian Conference on Computational Linguistics</source>
          , CLiC-It.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Alina</given-names>
            <surname>Karakanta</surname>
          </string-name>
          , Matteo Negri, and
          <string-name>
            <given-names>Marco</given-names>
            <surname>Turchi</surname>
          </string-name>
          . 2020a.
          <article-title>Is 42 the answer to everything in subtitlingoriented speech translation</article-title>
          ?
          <source>In Proceedings of the 17th International Conference on Spoken Language Translation</source>
          , pages
          <fpage>209</fpage>
          -
          <lpage>219</lpage>
          , Online, July. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Alina</given-names>
            <surname>Karakanta</surname>
          </string-name>
          , Matteo Negri, and
          <string-name>
            <given-names>Marco</given-names>
            <surname>Turchi</surname>
          </string-name>
          . 2020b.
          <article-title>Must-cinema: a speech-to-subtitles corpus</article-title>
          .
          <source>In Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC</source>
          <year>2020</year>
          ), Marseille, France, May
          <volume>13</volume>
          -15.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Taku</given-names>
            <surname>Kudo and John Richardson</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing</article-title>
          .
          <source>In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</source>
          , pages
          <fpage>66</fpage>
          -
          <lpage>71</lpage>
          , Brussels, Belgium, November. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Lison</surname>
          </string-name>
          and Jo¨rg Tiedemann.
          <year>2016</year>
          .
          <article-title>Opensubtitles2016: Extracting large parallel corpora from Movie and TV subtitles</article-title>
          .
          <source>In Proceedings of the International Conference on Language Resources and Evaluation</source>
          , LREC.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Danni</given-names>
            <surname>Liu</surname>
          </string-name>
          , Jan Niehues, and
          <string-name>
            <given-names>Gerasimos</given-names>
            <surname>Spanakis</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Adapting end-to-end speech recognition for readable subtitles</article-title>
          .
          <source>In Proceedings of the 17th International Conference on Spoken Language Translation</source>
          , pages
          <fpage>247</fpage>
          -
          <lpage>256</lpage>
          , Online, July. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Evgeny</given-names>
            <surname>Matusov</surname>
          </string-name>
          , Patrick Wilken, and
          <string-name>
            <given-names>Yota</given-names>
            <surname>Georgakopoulou</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Customizing neural machine translation for subtitling</article-title>
          .
          <source>In Proceedings of the Fourth Conference on Machine Translation (Volume</source>
          <volume>1</volume>
          : Research Papers), pages
          <fpage>82</fpage>
          -
          <lpage>93</lpage>
          , Florence, Italy, August. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>Mathias Mu¨ller and Martin Volk</source>
          .
          <year>2013</year>
          .
          <article-title>Statistical machine translation of subtitles: From opensubtitles to ted</article-title>
          . In Iryna Gurevych, Chris Biemann, and Torsten Zesch, editors,
          <source>Language Processing and Knowledge in the Web</source>
          , pages
          <fpage>132</fpage>
          -
          <lpage>138</lpage>
          , Berlin, Heidelberg. Springer Berlin Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Myle</given-names>
            <surname>Ott</surname>
          </string-name>
          , Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier,
          <string-name>
            <given-names>and Michael</given-names>
            <surname>Auli</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>fairseq: A fast, extensible toolkit for sequence modeling</article-title>
          .
          <source>In Proceedings of NAACL-HLT</source>
          <year>2019</year>
          : Demonstrations.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Kishore</given-names>
            <surname>Papineni</surname>
          </string-name>
          , Salim Roukos, Todd Ward, and
          <string-name>
            <given-names>WeiJing</given-names>
            <surname>Zhu</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          .
          <source>In Proceedings of the 40th annual meeting on association for computational linguistics</source>
          , pages
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Elisa</given-names>
            <surname>Perego</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Subtitles and line-breaks: Towards improved readability</article-title>
          .
          <source>Between Text and Image: Updating research in screen translation</source>
          ,
          <volume>78</volume>
          (
          <issue>1</issue>
          ):
          <fpage>211</fpage>
          -
          <lpage>223</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Dhevi J. Rajendran</surname>
          </string-name>
          , Andrew T. Duchowski, Pilar Orero, Juan Mart´ınez, and
          <string-name>
            <surname>Pablo</surname>
          </string-name>
          Romero-Fresco.
          <year>2013</year>
          .
          <article-title>Effects of text chunking on subtitling: A quantitative and qualitative examination</article-title>
          .
          <source>Perspectives</source>
          ,
          <volume>21</volume>
          (
          <issue>1</issue>
          ):
          <fpage>5</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Hye-Jeong</surname>
            <given-names>Song</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hong-Ki</surname>
            <given-names>Kim</given-names>
          </string-name>
          , Jong-Dae Kim, ChanYoung Park, and
          <string-name>
            <surname>Yu-Seop Kim</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Intersentence segmentation of YouTube subtitles using long-short term memory (LSTM)</article-title>
          .
          <volume>9</volume>
          :
          <fpage>1504</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Ashish</given-names>
            <surname>Vaswani</surname>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
          <string-name>
            <surname>Łukasz Kaiser</surname>
            , and
            <given-names>Illia</given-names>
          </string-name>
          <string-name>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>6000</fpage>
          -
          <lpage>6010</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>