<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Presenting Simultaneous Translation in Limited Space</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dominik Machácˇek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ondrˇej Bojar</string-name>
          <email>bojar@ufal.mff.cuni.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Some methods of automatic simultaneous translation of a long-form speech allow revisions of outputs, trading accuracy for low latency. Deploying these systems for users faces the problem of presenting subtitles in a limited space, such as two lines on a television screen. The subtitles must be shown promptly, incrementally, and with adequate time for reading. We provide an algorithm for subtitling. Furthermore, we propose a way how to estimate the overall usability of the combination of automatic translation and subtitling by measuring the quality, latency, and stability on a test set, and propose an improved measure for translation latency.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The quality of automatic speech recognition and machine
translation of texts is constantly increasing. It leads to
an opportunity to connect these two components and use
them for spoken language translation (SLT). The output of
the SLT system can be delivered to users either as speech
or text. In simultaneous SLT, where the output has to be
delivered during the speech with as low delay as
possible, there is a trade-off between latency and quality. With
textual output, it is possible to present users with early,
partial translation hypotheses in low latency, and correct
them later by final, more accurate updates, after the
system receives more context for disambiguation, or after a
secondary big model produces its translation. Rewriting
brings another challenge, the stability of output. If the
updates are too frequent, the user is unable to read the text.
The problem of unstable output could be solved by using
big space for showing subtitles. The unstable, flickering
output would appear only at the end, allowing the user to
easily ignore the flickering part and read only the
stabilized part of the output. However, in many situations, the
space for subtitles is restricted. For example, if the users
have to follow the speaker and slides at the same time, they
lack mental capacity for searching for the stabilized part
of translations. It is, therefore, necessary to put the
subtitles and slides on the same screen, restricting the subtitling
area to a small window.</p>
      <p>In this paper, we propose an algorithm for presenting
SLT subtitles in limited space, a way for estimating the
overall usability of simultaneous SLT subtitling in a
limited area, and an improved translation latency measure for
SLT comparison. Section 2 describes the properties of SLT
for use with our subtitler. Section 3 details the main new
component for presenting a text stream as readable
subtitles. Section 4 proposes the estimation of the usability of
the subtitling of multiple realistic SLT systems. We
conclude the paper in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Re-Translating Spoken Language</title>
    </sec>
    <sec id="sec-3">
      <title>Translation</title>
      <p>
        Our subtitler solves the problem of presentation of SLT
output with a re-translating early hypothesis, similarly to
[
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. Although it can also present the subtitles from the
automatic speech recognition (ASR) that re-estimates the
early hypothesis, or generally any audio-to-text processor,
we limit ourselves only SLT in this paper for brevity.
2.1
      </p>
      <sec id="sec-3-1">
        <title>Stable and Unstable Segments</title>
        <p>SLT systems output a potentially infinite stream of
segments containing the beginning and final timestamps of an
interval from the source audio, and the translated text in
the interval. We assume that the segments can be marked
as stable and unstable, depending on whether the system
has the possibility to change them or not. This is a
realistic assumption because the ASR and SLT systems usually
process a limited window of the source audio. Whenever a
part of source audio exceeds this window, the
corresponding output becomes stable.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Subtitler</title>
      <p>This section presents the design and algorithm of
“subtitler”.</p>
      <p>The subtitler is a cache on a stream of input messages
aiming to satisfy the following conflicting needs:</p>
      <p>The output should be presented with the lowest
possible delay to achieve the effect of simultaneous
translation as much as possible.</p>
      <p>The flickering of the partial outputs is partially
desired because it highlights the simultaneity of the
translation and comforts the user in knowing that the
system is not stuck.</p>
      <p>Input 1: 23 134 STABLE Pixelen auf Ihrem Bildschirm.</p>
      <p>134 189 UNSTABLE Zu jedem Zeitpunk. Es ist auch eine sehr flexible Architektur...</p>
      <p>Window 1.1</p>
      <p>Window 1.2
Buffer: Pixelen auf Ihrem Bildschirm. Zu jedem Zeitpunkt. Es ist auch eine sehr flexible Architektur...</p>
      <p>STABLE UNSTABLE
Input 2: 134 156 STABLE Zu jedem Zeitpunk.</p>
      <p>156 210 UNSTABLE Sie ist auch sehr flexibel. Die architektur ist ein ganzes Buch.</p>
      <p>Window 2.1</p>
      <p>Window 2.2</p>
      <p>Resetted window 2.2</p>
      <p>The flickering should be minimized. If some output
was presented at a position of the screen, it should
keep the position until it is outdated.</p>
      <p>The user must have enough time to read the message.</p>
      <p>Only a predefined space of w (width) characters and
h (height) lines are available.</p>
      <p>Given an input stream of stable and unstable segments
as described above, the subtitler emits a stream of
“subtitle windows”. On every update, the former window is
replaced by a new one.</p>
      <p>The basic operation of subtitler is depicted in Figures 1
and 2. The elements of subtitler are a buffer of input
segments, a presentation window, and two independent
processing threads.</p>
      <p>The buffer is an ordered list of segments. The
presentation window is organized as a list of text lines of the
required width and count. The count corresponds to the
height of subtitling window plus one, to allow
scrollingup the top line after displaying it for minimum reading
time. This line view is regenerated whenever needed from
the current starting position of the window in the buffer,
wrapping words into lines.</p>
      <p>The input thread receives the input stream and updates
the buffer. It replaces outdated segments with their new
versions, extends the buffer, and removes old unnecessary
segments. If an update happens within or before the
current position of the presentation window, the output thread
is notified for a forced update.</p>
      <p>Independently, the output thread updates the position of
the presentation window in the buffer, obeying the
following timeouts and triggers:</p>
      <p>On forced updates, the output thread detects if any
content changed before the beginning of the already
presented window, which would cause a reset. In that
case, the window position on the window buffer has
to be moved back, and the content for the user can no
longer be presented incrementally. Instead, the
beginning of the first line in the window shows a newer
version of an old sentence that has already scrolled
away.</p>
      <p>If the first line of the presentation window has not
been changed for a minimum reading time and if
there is any input to present in the extra line of the
window, the window is “scrolled” by one line, i.e.,
the first line is discarded, the window starting
position within the buffer is updated, and the extra line is
shown as the last line of the window.</p>
      <p>If the whole presentation window has not been
changed for a long time, e.g., 5 or 20 seconds, it is
blanked by emitting empty lines.
3.1</p>
      <sec id="sec-4-1">
        <title>Timing Parameters</title>
        <p>The subtitler requires two timing parameters. A line of
subtitles is displayed to a user for a “minimum reading
time” before it can be scrolled away. If no input arrives for
a “blank time”, the subtitling window blanks to indicate
it and to prevent the user from reading the last message
unnecessarily.</p>
        <p>
          We suggest adopting the minimum reading time
parameter from the standards for subtitling films and videos (e.g.,
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]), before standards for simultaneous SLT subtitling will
be established. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] claim that 15 characters per second is a
standard reading time in English interlingual subtitling of
films and videos for deaf and hard hearing. The standards
in other European regions are close to 15 characters per
second. We use this value for the evaluation in this work.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Estimating Usability</title>
      <p>
        The challenges in simultaneous SLT are quality, latency,
and stability [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. All of these properties are critical for
the overall usability of the SLT system. The quality of
translation is a property of the SLT system. The subtitler
has no impact on it. The minimum reading time ensures
the minimum level of stability, ensuring that every stable
content is readable, and may increase the latency if the
original speech is faster than reading time. The size of
the subtitling window and timing parameters affect overall
latency and stability. The bigger the window, the longer
updates of translations fit into it without a reset. The
timing parameters determine how long the content stays
unchanged in the window before scrolling. A small subtitling
window or a short reading or blanking time may cause a
reset. Every reset increases latency because it returns to the
already displayed content. On the other hand, the
significant latency may improve stability by skipping the early
unstable hypotheses and present only the stable ones.
      </p>
      <p>We provide three automatic measures for assessing the
practical usability of simultaneous SLT subtitling on the
test set. The automatic evaluation may serve for a rough
estimation of the usefulness, or for selection of the best
candidate setups. We do not provide a strict way to judge
which SLT system and subtitling setup are useful and
which are not. The final decision should ideally consider
the particular display conditions, expectations, and needs
of the users, and should be based on a significant human
evaluation.
4.1</p>
      <sec id="sec-5-1">
        <title>Evaluation Measures</title>
        <p>
          For quality, we report an automatic machine translation
measure BLEU computed by sacrebleu [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] after automatic
sentence alignment using mwerSegmenter [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. BLEU is
considered as correlating with human quality judgement.
The higher BLEU, the higher translation quality.
        </p>
        <p>
          To explain the measure of latency and stability, let us
use the terminology of [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The EventLog is an ordered
list of events. The ith event is a triple si; oi; ti, where si
is the source text recognized so far, oi is the current SLT
output, and ti is the time when this event was produced.
Source and output, si and oi, are sequences of tokens. Let
us denote c(oi) a transformation of a token sequence into a
sequence of characters, including spaces and punctuation.
Let I be the number of all events, with an update either
in source or output, and T the number of events with an
update in translation.
        </p>
        <p>
          Character Erasure To evaluate how many updates fit
into the subtitling window, we define character erasure
(cE). It is the number of characters that must be deleted
from the tail of the current translation hypothesis to update
it to a new one. If a new translation only appends words
to the end, the erasure is zero. The character erasure is
cE(i) = jc(oi 1)j jLCP(c(oi); c(oi 1))j, where the LCP
stands for the longest common prefix. The average
charI
acter erasure is AcE = 1=T åi=1cE(i). It is inspired by the
normalized erasure (NE) by [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], but we do not divide it by
the output length in the final event, but only by the number
of translation events.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Translation Latency with Sentence-Alignment Catch</title>
        <p>up The translation latency may be measured with the use
of a finalization event of the j-th word in output. It is
f (o; j) = mini such that oi0; j0 = oI; j0 8i0 i and 8 j0 j.
In other words, the word j is finalized in the first event i,
for which the word j and all the preceding words j0 remain
unchanged in all subsequent events i0.</p>
        <p>
          The translation latency of output word j is the time
difference of the finalization event of the word j in the
output and its corresponding word j in the source. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
estimate the source word simply as j = ( j=joI j)jsI j. This is
problematic if the output is substantially shorter than
input, because then it may incorrectly base the latency on a
word which has not been uttered yet, leading to a negative
time difference. A proper word alignment would provide
the most reliable correspondence. However, we propose a
simpler and appropriately reliable solution. The following
improved measure is our novel contribution. We use it to
compare the SLT systems.
        </p>
        <p>
          We utilize the fact that our ASR produces punctuated
text, where the sentence boundaries can be detected. The
sentences coming from SLT and ASR in their sequential
order are parallel. They can be simply aligned because our
SLT systems translate the individual sentences and keep
the sentence boundaries. If the SLT does not produce
individual sentences, then we use a rule-based sentence
segmenter, e.g. from [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], and must be aware of the potential
inaccuracy.
        </p>
        <p>
          We use the sentence alignment for a catch-up, and the
simple temporal correspondence of [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] only within the
last sentence. To express it formally, let us assume that
the EventLog has also a function S(o; j), returning the
index of the sentence containing the word j in o, and
L(o; k), the length of the sentence k in o. Let x( j) =
j åiS=(o1; j) 1 L(o; i) be the index of an output word j in its
j = åiS=(o1; j) 1 L(s; i) + x( j)j LL((os;;SS((oo;;jj)))) k
        </p>
        <p>Finally, our translation latency with sentence-alignment
catch-up is TL (o; j) = t f (o; j) t f (s; j ). This is then
averaged for all output words in the document: TL =
joIj åjjo=I]1 T L (o; j). 1
1
4.2</p>
      </sec>
      <sec id="sec-5-3">
        <title>SLT Evaluation</title>
        <p>We use one ASR system for English and nine SLT
systems from English into Czech (three different models
differing in the data and training parameters), German (2
different systems), French (2 different systems), Spanish and
Russian. All the SLT systems are cascades of an ASR,
a punctuator, which inserts punctuation and capitalization
to unsegmented ASR output, and a neural machine
translation (NMT) from the text. The systems and their quality
measures are in Table 1. DE A, ES, and FR B are NMT
adapted for spoken translation as in [9]. The others are
basic sentence-level Transformer NMT connected to ASR.
The ASR is a hybrid DNN-HMM by [10].</p>
        <p>
          We evaluate the systems on IWSLT tst2015 dataset. We
downloaded the referential translation from the TED
website as [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], and removed the single words in parentheses
because they were not verbatim translations of the speech,
but marked sounds such as applause, laughter, or music.
4.3
        </p>
      </sec>
      <sec id="sec-5-4">
        <title>Reset Rate</title>
        <p>The average character erasure does not reflect the
frequency and size of the individual erasures. Therefore, in
Figure 3, we display the cumulative density function of
character erasure in the dataset. The vertical axis is the
o ]
1For a set of documents D, the TL = åo;I2D åjj=I1 TL (o; j) .
åo;I2D joIj
percentage of all translation updates, in which the
character erasure was shorter or equal than the value on the
horizontal axis. E.g., for the subtitler window with a total size
of 140 characters, 99.03 % of SLT updates of the SLT CZ
A fit into this area. Table 2 displays the same for selected
sizes, which fit into 1, 2, and 3 lines of subtitler window
of size 70, and also the percentage of updates without any
erasure (x = 0).</p>
        <p>The values approximate the expected number of resets.
However, the resets are also affected by the blanking time,
so the real number of resets may be higher if the speech
contains long pauses. The percentage in Figure 3 serves as
a lower bound.
The subtitling latency is the difference of the finalization
time of a word in subtitler and in the SLT. We count it
similarly as the translation latency, but the word
correspondence is the identity function because the language in SLT
and subtitler is the same.
height=1
height=2
height=3
0
200</p>
        <p>600
400
time (s)</p>
        <p>We computed the latency caused by the subtitler with 1,
2, and 3 lines of width 70 for one talk and SLT systems,
see Figure 4. Generally, the bigger the translation window,
the lower latency.
4.5</p>
      </sec>
      <sec id="sec-5-5">
        <title>User Evaluation</title>
        <p>We asked one user to rate the overall fluency and
stability of subtitling for the first 7-minute part of
tst2015.en.talkid1922 translated by CZ A. We presented
the user with the subtitles three times, in a window of
width 70 and heights 1, 2 and 3. The minimum reading
time parameter was 15 characters per second. The user
was asked to express his subjective quality assessment by
pressing one of five buttons: undecided (0), horrible (1),
usable with problems (2), minor flaws, but usable (3), and
perfect (4). The user was asked to press them
simultaneously with reading subtitles, whenever the assessment
changes. The source audio or video was not presented, so
this setup is comparable to situations where the user does
not understand the source language at all. The user is a
native speaker of Czech.</p>
        <p>Table 3 summarizes the percentage of the assessed
duration and the quality levels. The user has not used the level
undecided (0). The main problem that the user reported
was limited readability due to resets and unstable
translations. The flaws in usable parts of subtitling were subtle
changes of subtitles which did not distract from reading
the new input, or disfluent formulations.</p>
        <p>In the right-most column of Table 3 we show the
percentage of erasures in the part of the evaluated document
which fit into the subtitling window. We hypothesize that
the automatic measure of character erasure may be used to
estimate the user assessment of readability.
5</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>We proposed an algorithm for presenting automatic speech
translation simultaneously in the limited space of
subtitles. The algorithm is independent of the SLT system. It
ensures the minimum level of stability and allows
simultaneity. Furthermore, we propose a way of estimating the
reader’s comfort and overall usability of the SLT with
subtitling in limited space, and observe correspondence with
user rating. Last but not least, we suggested a catch-up
based on sentence-alignment in ASR and SLT to measure
the translation latency simply and realistically.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The research was partially supported by the grant
CZ.07.1.02/0.0/0.0/16_023/0000108 (Operational
Programme – Growth Pole of the Czech Republic),
H2020ICT-2018-2-825460 (ELITR) of the EU, 398120 of the
Grant Agency of Charles University, and by SVV project
number 260 575.
[9] J. Niehues and et al., “Low-latency neural speech
translation,” Interspeech 2018, Sep 2018.
[10] T.-S. Nguyen and et al., “The 2017 KIT IWSLT
Speech-toText Systems for English and German,” December, 14-15
2017.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Niehues</surname>
          </string-name>
          and et al., “
          <article-title>Dynamic transcription for lowlatency speech translation</article-title>
          ,”
          <source>in Proceedings of Interspeech</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Arivazhagan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cherry</surname>
          </string-name>
          ,
          <string-name>
            <surname>T. I</surname>
          </string-name>
          , W. Macherey,
          <string-name>
            <given-names>P.</given-names>
            <surname>Baljekar</surname>
          </string-name>
          , and G. Foster, “
          <article-title>Re-translation strategies for long form, simultaneous, spoken language translation</article-title>
          ,”
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Arivazhagan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cherry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Macherey</surname>
          </string-name>
          , and G. Foster, “
          <article-title>Re-translation versus streaming for simultaneous translation</article-title>
          ,
          <source>” ArXiv</source>
          , vol. abs/
          <year>2004</year>
          .03643,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Karamitroglou</surname>
          </string-name>
          , “
          <article-title>A proposed set of subtitling standards in europe</article-title>
          ,”
          <source>Translation Journal</source>
          , vol.
          <volume>2</volume>
          ,
          <string-name>
            <surname>4</surname>
          </string-name>
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Szarkowska</surname>
          </string-name>
          and
          <string-name>
            <given-names>O.</given-names>
            <surname>Gerber-Morón</surname>
          </string-name>
          , “
          <article-title>Viewers can keep up with fast subtitles: Evidence from eye movements,” in PloS one</article-title>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Post</surname>
          </string-name>
          , “
          <article-title>A call for clarity in reporting BLEU scores</article-title>
          .”
          <source>Association for Computational Linguistics</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Matusov</surname>
          </string-name>
          and et al.,
          <string-name>
            <surname>“</surname>
          </string-name>
          <article-title>Evaluating machine translation output with automatic sentence segmentation</article-title>
          ,” in International Workshop on Spoken Language Translation, Oct.
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Koehn</surname>
          </string-name>
          and et al., “Moses:
          <article-title>Open source toolkit for statistical machine translation,” ser</article-title>
          .
          <source>ACL '07</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>