=Paper= {{Paper |id=Vol-2718/paper23 |storemode=property |title=Presenting Simultaneous Translation in Limited Space |pdfUrl=https://ceur-ws.org/Vol-2718/paper23.pdf |volume=Vol-2718 |authors=Dominik Macháček,Ondřej Bojar |dblpUrl=https://dblp.org/rec/conf/itat/MachacekB20 }} ==Presenting Simultaneous Translation in Limited Space== https://ceur-ws.org/Vol-2718/paper23.pdf
                      Presenting Simultaneous Translation in Limited Space

                                                     Dominik Macháček, Ondřej Bojar

                                                                Charles University
                                                      Faculty of Mathematics and Physics
                                                  Institute of Formal and Applied Linguistics
                                                  {machacek,bojar}@ufal.mff.cuni.cz

Abstract: Some methods of automatic simultaneous trans-                   overall usability of simultaneous SLT subtitling in a lim-
lation of a long-form speech allow revisions of outputs,                  ited area, and an improved translation latency measure for
trading accuracy for low latency. Deploying these systems                 SLT comparison. Section 2 describes the properties of SLT
for users faces the problem of presenting subtitles in a lim-             for use with our subtitler. Section 3 details the main new
ited space, such as two lines on a television screen. The                 component for presenting a text stream as readable subti-
subtitles must be shown promptly, incrementally, and with                 tles. Section 4 proposes the estimation of the usability of
adequate time for reading. We provide an algorithm for                    the subtitling of multiple realistic SLT systems. We con-
subtitling. Furthermore, we propose a way how to estimate                 clude the paper in Section 5.
the overall usability of the combination of automatic trans-
lation and subtitling by measuring the quality, latency, and
stability on a test set, and propose an improved measure                  2     Re-Translating Spoken Language
for translation latency.                                                        Translation

                                                                          Our subtitler solves the problem of presentation of SLT
1    Introduction                                                         output with a re-translating early hypothesis, similarly to
                                                                          [1, 2, 3]. Although it can also present the subtitles from the
The quality of automatic speech recognition and machine                   automatic speech recognition (ASR) that re-estimates the
translation of texts is constantly increasing. It leads to                early hypothesis, or generally any audio-to-text processor,
an opportunity to connect these two components and use                    we limit ourselves only SLT in this paper for brevity.
them for spoken language translation (SLT). The output of
the SLT system can be delivered to users either as speech
or text. In simultaneous SLT, where the output has to be                  2.1   Stable and Unstable Segments
delivered during the speech with as low delay as possi-
                                                                          SLT systems output a potentially infinite stream of seg-
ble, there is a trade-off between latency and quality. With
                                                                          ments containing the beginning and final timestamps of an
textual output, it is possible to present users with early,
                                                                          interval from the source audio, and the translated text in
partial translation hypotheses in low latency, and correct
                                                                          the interval. We assume that the segments can be marked
them later by final, more accurate updates, after the sys-
                                                                          as stable and unstable, depending on whether the system
tem receives more context for disambiguation, or after a
                                                                          has the possibility to change them or not. This is a realis-
secondary big model produces its translation. Rewriting
                                                                          tic assumption because the ASR and SLT systems usually
brings another challenge, the stability of output. If the up-
                                                                          process a limited window of the source audio. Whenever a
dates are too frequent, the user is unable to read the text.
                                                                          part of source audio exceeds this window, the correspond-
The problem of unstable output could be solved by using
                                                                          ing output becomes stable.
big space for showing subtitles. The unstable, flickering
output would appear only at the end, allowing the user to
easily ignore the flickering part and read only the stabi-                3     Subtitler
lized part of the output. However, in many situations, the
space for subtitles is restricted. For example, if the users              This section presents the design and algorithm of “subti-
have to follow the speaker and slides at the same time, they              tler”.
lack mental capacity for searching for the stabilized part                   The subtitler is a cache on a stream of input messages
of translations. It is, therefore, necessary to put the subti-            aiming to satisfy the following conflicting needs:
tles and slides on the same screen, restricting the subtitling               • The output should be presented with the lowest possi-
area to a small window.                                                         ble delay to achieve the effect of simultaneous trans-
   In this paper, we propose an algorithm for presenting                        lation as much as possible.
SLT subtitles in limited space, a way for estimating the                     • The flickering of the partial outputs is partially de-
      Copyright c 2020 for this paper by its authors. Use permitted un-
                                                                                sired because it highlights the simultaneity of the
der Creative Commons License Attribution 4.0 International (CC BY               translation and comforts the user in knowing that the
4.0).                                                                           system is not stuck.
Input 1:   23 134 STABLE Pixelen auf Ihrem Bildschirm.
           134 189 UNSTABLE Zu jedem Zeitpunk. Es ist auch eine sehr flexible Architektur...


                            Window 1.1                          Window 1.2

  Buffer: Pixelen auf Ihrem Bildschirm. Zu jedem Zeitpunkt. Es ist auch eine sehr flexible Architektur...
                         STABLE             UNSTABLE



Input 2:   134 156 STABLE Zu jedem Zeitpunk.
           156 210 UNSTABLE Sie ist auch sehr flexibel. Die architektur ist ein ganzes Buch.
                                                                               Resetted window 2.2
                             Window 2.1                           Window 2.2
 Buffer:    Pixelen auf Ihrem Bildschirm. Zu jedem Zeitpunkt. Sie ist auch sehr flexibel. Die architektur ist ein ganzes Buch.
                                                 STABLE            UNSTABLE


Figure 1: Illustration of speech translation subtitling in two subsequent inputs from SLT. The input arrives as a sequence
of quadruples: segment beginning time, segment end time, stable/unstable flag, text. The rectangles indicate the content
of the subtitling area of one line.


                                                                 up the top line after displaying it for minimum reading
                                                                 time. This line view is regenerated whenever needed from
                                                                 the current starting position of the window in the buffer,
                                                                 wrapping words into lines.
                                                                    The input thread receives the input stream and updates
                                                                 the buffer. It replaces outdated segments with their new
                                                                 versions, extends the buffer, and removes old unnecessary
                                                                 segments. If an update happens within or before the cur-
                                                                 rent position of the presentation window, the output thread
                                                                 is notified for a forced update.
                                                                    Independently, the output thread updates the position of
Figure 2: Subtitler processing of the inputs in Figure 1         the presentation window in the buffer, obeying the follow-
with different timings. In the left one, Input 2 changes the     ing timeouts and triggers:
word “Es”, which has been read by the user and scrolled             • On forced updates, the output thread detects if any
away and causes a reset of a window start. In the right one,           content changed before the beginning of the already
the word “Es” is changed in the window on the current                  presented window, which would cause a reset. In that
display.                                                               case, the window position on the window buffer has
                                                                       to be moved back, and the content for the user can no
                                                                       longer be presented incrementally. Instead, the be-
    • The flickering should be minimized. If some output               ginning of the first line in the window shows a newer
      was presented at a position of the screen, it should             version of an old sentence that has already scrolled
      keep the position until it is outdated.                          away.
    • The user must have enough time to read the message.           • If the first line of the presentation window has not
    • Only a predefined space of w (width) characters and              been changed for a minimum reading time and if
      h (height) lines are available.                                  there is any input to present in the extra line of the
    Given an input stream of stable and unstable segments              window, the window is “scrolled” by one line, i.e.,
as described above, the subtitler emits a stream of “sub-              the first line is discarded, the window starting posi-
title windows”. On every update, the former window is                  tion within the buffer is updated, and the extra line is
replaced by a new one.                                                 shown as the last line of the window.
    The basic operation of subtitler is depicted in Figures 1       • If the whole presentation window has not been
and 2. The elements of subtitler are a buffer of input seg-            changed for a long time, e.g., 5 or 20 seconds, it is
ments, a presentation window, and two independent pro-                 blanked by emitting empty lines.
cessing threads.
    The buffer is an ordered list of segments. The presen-
                                                                 3.1   Timing Parameters
tation window is organized as a list of text lines of the
required width and count. The count corresponds to the           The subtitler requires two timing parameters. A line of
height of subtitling window plus one, to allow scrolling-        subtitles is displayed to a user for a “minimum reading
time” before it can be scrolled away. If no input arrives for      Source and output, si and oi , are sequences of tokens. Let
a “blank time”, the subtitling window blanks to indicate           us denote c(oi ) a transformation of a token sequence into a
it and to prevent the user from reading the last message           sequence of characters, including spaces and punctuation.
unnecessarily.                                                     Let I be the number of all events, with an update either
   We suggest adopting the minimum reading time param-             in source or output, and T the number of events with an
eter from the standards for subtitling films and videos (e.g.,     update in translation.
[4]), before standards for simultaneous SLT subtitling will
be established. [5] claim that 15 characters per second is a
standard reading time in English interlingual subtitling of        Character Erasure To evaluate how many updates fit
films and videos for deaf and hard hearing. The standards          into the subtitling window, we define character erasure
in other European regions are close to 15 characters per           (cE). It is the number of characters that must be deleted
second. We use this value for the evaluation in this work.         from the tail of the current translation hypothesis to update
                                                                   it to a new one. If a new translation only appends words
                                                                   to the end, the erasure is zero. The character erasure is
4     Estimating Usability                                         cE(i) = |c(oi−1 )| − |LCP(c(oi ), c(oi−1 ))|, where the LCP
                                                                   stands for the longest common prefix. The average char-
The challenges in simultaneous SLT are quality, latency,           acter erasure is AcE = 1/T ∑Ii=1 cE(i). It is inspired by the
and stability [1, 2]. All of these properties are critical for     normalized erasure (NE) by [2], but we do not divide it by
the overall usability of the SLT system. The quality of            the output length in the final event, but only by the number
translation is a property of the SLT system. The subtitler         of translation events.
has no impact on it. The minimum reading time ensures
the minimum level of stability, ensuring that every stable
content is readable, and may increase the latency if the           Translation Latency with Sentence-Alignment Catch-
original speech is faster than reading time. The size of           up The translation latency may be measured with the use
the subtitling window and timing parameters affect overall         of a finalization event of the j-th word in output. It is
latency and stability. The bigger the window, the longer            f (o, j) = mini such that oi0 , j0 = oI, j0 ∀i0 ≥ i and ∀ j0 ≤ j.
updates of translations fit into it without a reset. The tim-      In other words, the word j is finalized in the first event i,
ing parameters determine how long the content stays un-            for which the word j and all the preceding words j0 remain
changed in the window before scrolling. A small subtitling         unchanged in all subsequent events i0 .
window or a short reading or blanking time may cause a re-             The translation latency of output word j is the time dif-
set. Every reset increases latency because it returns to the       ference of the finalization event of the word j in the out-
already displayed content. On the other hand, the signif-          put and its corresponding word j∗ in the source. [2] esti-
icant latency may improve stability by skipping the early          mate the source word simply as j∗ = ( j/|oI |)|sI |. This is
unstable hypotheses and present only the stable ones.              problematic if the output is substantially shorter than in-
   We provide three automatic measures for assessing the           put, because then it may incorrectly base the latency on a
practical usability of simultaneous SLT subtitling on the          word which has not been uttered yet, leading to a negative
test set. The automatic evaluation may serve for a rough           time difference. A proper word alignment would provide
estimation of the usefulness, or for selection of the best         the most reliable correspondence. However, we propose a
candidate setups. We do not provide a strict way to judge          simpler and appropriately reliable solution. The following
which SLT system and subtitling setup are useful and               improved measure is our novel contribution. We use it to
which are not. The final decision should ideally consider          compare the SLT systems.
the particular display conditions, expectations, and needs             We utilize the fact that our ASR produces punctuated
of the users, and should be based on a significant human           text, where the sentence boundaries can be detected. The
evaluation.                                                        sentences coming from SLT and ASR in their sequential
                                                                   order are parallel. They can be simply aligned because our
                                                                   SLT systems translate the individual sentences and keep
4.1   Evaluation Measures
                                                                   the sentence boundaries. If the SLT does not produce in-
For quality, we report an automatic machine translation            dividual sentences, then we use a rule-based sentence seg-
measure BLEU computed by sacrebleu [6] after automatic             menter, e.g. from [8], and must be aware of the potential
sentence alignment using mwerSegmenter [7]. BLEU is                inaccuracy.
considered as correlating with human quality judgement.                We use the sentence alignment for a catch-up, and the
The higher BLEU, the higher translation quality.                   simple temporal correspondence of [2] only within the
   To explain the measure of latency and stability, let us         last sentence. To express it formally, let us assume that
use the terminology of [2]. The EventLog is an ordered             the EventLog has also a function S(o, j), returning the
list of events. The ith event is a triple si , oi ,ti , where si   index of the sentence containing the word j in o, and
is the source text recognized so far, oi is the current SLT        L(o, k), the length of the sentence k in o. Let x( j) =
                                                                           S(o, j)−1
output, and ti is the time when this event was produced.            j − ∑i=1         L(o, i) be the index of an output word j in its
Table 1: Quality measure of the English ASR and SLT
                                                                                                             100
systems from English into the target language in the left-
most column, on IWSLT tst2015. The letters A, B, C de-                                                       95
note different variants of SLT systems with the same tar-




                                                                                  % of translation updates
                                                                                                             90
get. Translation lag (TL∗ ) is in seconds. AcE is average                                                                                     EN (ASR)
character erasure, NE is normalized erasure.                                                                 85                               CZ A
                                                                                                                                              CZ B
                                                                                                             80                               CZ C
           SLT               BLEU           TL∗        AcE        NE                                                                          FR A
           EN (ASR)         58.4747                   29.22      5.88                                        75                               RU
                                                                                                                                              ES
           CZ A             17.5441        2.226      24.20      7.05                                                                         DE B
           CZ B             12.2914        2.622      29.48      5.30                                        70                               DE A
           CZ C             18.1505        2.933      27.90      3.93                                                                         FR A
                                                                                                             65
           DE A             15.2678        3.506      47.32      1.39                                                80    100 120 140 160           180
           DE B             15.9672        1.845      38.12      5.46                                                      character erasure (cE)
           ES               21.8516        5.429      43.30      1.49
           FR A             25.8964        1.269      31.97      3.32   Figure 3: The percentage of translation updates in the val-
           FR B             20.5367        5.425      47.92      1.46   idation set with the character erasure less than or equal to
           RU               11.6279        3.168      31.78      4.05   the value on the x-axis, for all our ASR and SLT systems.
                                                                        The x-axis corresponds with the size of the subtitling win-
                                                                        dow.
sentence. Then we define our caught-up correspondence
as
              S(o, j)−1
                                       j
                                         L(s,S(o, j))
                                                      k                 Table 2: Percentage of character erasures in all translation
     j∗∗ = ∑i=1         L(s, i) + x( j) L(o,S(o,  j))                   updates, which are shorter or equal than x characters, for
    Finally, our translation latency with sentence-alignment            selected values of x.
catch-up is TL∗ (o, j) = t f (o, j) − t f (s, j∗∗ ) . This is then
averaged for all output words in the document: TL∗ =                          SLT                                  x=0     x = 70     x = 140       x = 210
 1      |oI ]   ∗         1                                                   EN (ASR)                             20.76   84.23       99.96        100.00
|oI | ∑ j=1 T L (o, j).
                                                                              CZ A                                 41.37   91.98       99.03         99.76
                                                                              CZ B                                 28.61   89.78       98.63         99.77
4.2     SLT Evaluation                                                        CZ C                                 30.93   88.31       98.53         99.72
                                                                              FR A                                 31.65   84.47       98.14         99.51
We use one ASR system for English and nine SLT sys-                           RU                                   35.42   85.17       97.82         99.38
tems from English into Czech (three different models dif-                     ES                                   29.01   71.71       97.08         99.43
fering in the data and training parameters), German (2 dif-                   DE B                                 27.89   80.90       97.05         99.38
ferent systems), French (2 different systems), Spanish and                    DE A                                 30.85   67.65       95.83         99.13
Russian. All the SLT systems are cascades of an ASR,                          FR A                                 30.39   66.15       95.67         99.39
a punctuator, which inserts punctuation and capitalization
to unsegmented ASR output, and a neural machine trans-
                                                                        percentage of all translation updates, in which the charac-
lation (NMT) from the text. The systems and their quality
                                                                        ter erasure was shorter or equal than the value on the hori-
measures are in Table 1. DE A, ES, and FR B are NMT
                                                                        zontal axis. E.g., for the subtitler window with a total size
adapted for spoken translation as in [9]. The others are ba-
                                                                        of 140 characters, 99.03 % of SLT updates of the SLT CZ
sic sentence-level Transformer NMT connected to ASR.
                                                                        A fit into this area. Table 2 displays the same for selected
The ASR is a hybrid DNN-HMM by [10].
                                                                        sizes, which fit into 1, 2, and 3 lines of subtitler window
   We evaluate the systems on IWSLT tst2015 dataset. We
                                                                        of size 70, and also the percentage of updates without any
downloaded the referential translation from the TED web-
                                                                        erasure (x = 0).
site as [2], and removed the single words in parentheses
                                                                           The values approximate the expected number of resets.
because they were not verbatim translations of the speech,
                                                                        However, the resets are also affected by the blanking time,
but marked sounds such as applause, laughter, or music.
                                                                        so the real number of resets may be higher if the speech
                                                                        contains long pauses. The percentage in Figure 3 serves as
4.3     Reset Rate                                                      a lower bound.
The average character erasure does not reflect the fre-
quency and size of the individual erasures. Therefore, in               4.4    Subtitling Latency
Figure 3, we display the cumulative density function of
                                                                        The subtitling latency is the difference of the finalization
character erasure in the dataset. The vertical axis is the
                                                                        time of a word in subtitler and in the SLT. We count it sim-
                                                                        ilarly as the translation latency, but the word correspon-
                                                      I |o ]∗
                                                                        dence is the identity function because the language in SLT
      1 For a set of documents D, the TL∗ = ∑o,I∈D ∑ j=1 T L (o, j) .
                                                 ∑o,I∈D |oI |
                                                                        and subtitler is the same.
                                                                                  was limited readability due to resets and unstable transla-
                               40       height=1
                                        height=2                                  tions. The flaws in usable parts of subtitling were subtle
                                        height=3                                  changes of subtitles which did not distract from reading
                               30                                                 the new input, or disfluent formulations.
           subtitler lag (s)



                                                                                     In the right-most column of Table 3 we show the per-
                               20                                                 centage of erasures in the part of the evaluated document
                                                                                  which fit into the subtitling window. We hypothesize that
                                                                                  the automatic measure of character erasure may be used to
                               10                                                 estimate the user assessment of readability.

                                0                                                 5   Conclusion
                                    0      200        400     600
                                                   time (s)
                                                                                  We proposed an algorithm for presenting automatic speech
                                                                                  translation simultaneously in the limited space of subti-
Figure 4: Subtitling latency (y-axis) over time (x-axis) for                      tles. The algorithm is independent of the SLT system. It
tst2015.en.talkid1922 translated by CZ A. The subtitling                          ensures the minimum level of stability and allows simul-
window has the width 70 and height 1, 2 and 3 lines. The                          taneity. Furthermore, we propose a way of estimating the
minimum reading time is 15 characters per second (one                             reader’s comfort and overall usability of the SLT with sub-
line per 4.7s).                                                                   titling in limited space, and observe correspondence with
                                                                                  user rating. Last but not least, we suggested a catch-up
Table 3: Results of user evaluation with three subtitling                         based on sentence-alignment in ASR and SLT to measure
windows of different heights (h). Quality level 4 is the                          the translation latency simply and realistically.
highest, 1 is the lowest. The right-most column is the per-
centage of erasures fitting into the subtitling window.
                                                                                  Acknowledgments
                                  Percentage of quality levels
  height                       level=1 level=2 level=3 level=4      cE < 70 · h   The research was partially supported by the grant
   h=1                35.27 % 28.79 % 14.95 % 20.99 %                 88.59 %     CZ.07.1.02/0.0/0.0/16_023/0000108 (Operational Pro-
   h=2                11.08 % 29.94 % 35.73 % 23.24 %                 98.73 %     gramme – Growth Pole of the Czech Republic), H2020-
   h=3                16.33 % 19.90 % 33.67 % 30.11 %                 99.64 %     ICT-2018-2-825460 (ELITR) of the EU, 398120 of the
                                                                                  Grant Agency of Charles University, and by SVV project
                                                                                  number 260 575.
   We computed the latency caused by the subtitler with 1,
2, and 3 lines of width 70 for one talk and SLT systems,
see Figure 4. Generally, the bigger the translation window,                       References
the lower latency.
                                                                                   [1] J. Niehues and et al., “Dynamic transcription for low-
                                                                                       latency speech translation,” in Proceedings of Interspeech,
4.5   User Evaluation                                                                  2016.
                                                                                   [2] N. Arivazhagan, C. Cherry, T. I, W. Macherey, P. Baljekar,
We asked one user to rate the overall fluency and
                                                                                       and G. Foster, “Re-translation strategies for long form, si-
stability of subtitling for the first 7-minute part of                                 multaneous, spoken language translation,” 2019.
tst2015.en.talkid1922 translated by CZ A. We presented
                                                                                   [3] N. Arivazhagan, C. Cherry, W. Macherey, and G. Foster,
the user with the subtitles three times, in a window of                                “Re-translation versus streaming for simultaneous transla-
width 70 and heights 1, 2 and 3. The minimum reading                                   tion,” ArXiv, vol. abs/2004.03643, 2020.
time parameter was 15 characters per second. The user                              [4] F. Karamitroglou, “A proposed set of subtitling standards
was asked to express his subjective quality assessment by                              in europe,” Translation Journal, vol. 2, 4 1998.
pressing one of five buttons: undecided (0), horrible (1),                         [5] A. Szarkowska and O. Gerber-Morón, “Viewers can keep
usable with problems (2), minor flaws, but usable (3), and                             up with fast subtitles: Evidence from eye movements,” in
perfect (4). The user was asked to press them simulta-                                 PloS one, 2018.
neously with reading subtitles, whenever the assessment                            [6] M. Post, “A call for clarity in reporting BLEU scores.” As-
changes. The source audio or video was not presented, so                               sociation for Computational Linguistics, 2018.
this setup is comparable to situations where the user does                         [7] E. Matusov and et al., “Evaluating machine translation
not understand the source language at all. The user is a                               output with automatic sentence segmentation,” in Inter-
native speaker of Czech.                                                               national Workshop on Spoken Language Translation, Oct.
   Table 3 summarizes the percentage of the assessed dura-                             2005.
tion and the quality levels. The user has not used the level                       [8] P. Koehn and et al., “Moses: Open source toolkit for statis-
undecided (0). The main problem that the user reported                                 tical machine translation,” ser. ACL ’07, 2007.
 [9] J. Niehues and et al., “Low-latency neural speech transla-
     tion,” Interspeech 2018, Sep 2018.
[10] T.-S. Nguyen and et al., “The 2017 KIT IWSLT Speech-to-
     Text Systems for English and German,” December, 14-15
     2017.