=Paper= {{Paper |id=Vol-2718/paper23 |storemode=property |title=Presenting Simultaneous Translation in Limited Space |pdfUrl=https://ceur-ws.org/Vol-2718/paper23.pdf |volume=Vol-2718 |authors=Dominik Macháček,Ondřej Bojar |dblpUrl=https://dblp.org/rec/conf/itat/MachacekB20 }} ==Presenting Simultaneous Translation in Limited Space== https://ceur-ws.org/Vol-2718/paper23.pdf

Presenting Simultaneous Translation in Limited Space

Dominik Macháček, Ondřej Bojar

Charles University
Faculty of Mathematics and Physics
Institute of Formal and Applied Linguistics
{machacek,bojar}@ufal.mff.cuni.cz

Abstract: Some methods of automatic simultaneous trans- overall usability of simultaneous SLT subtitling in a lim-
lation of a long-form speech allow revisions of outputs, ited area, and an improved translation latency measure for
trading accuracy for low latency. Deploying these systems SLT comparison. Section 2 describes the properties of SLT
for users faces the problem of presenting subtitles in a lim- for use with our subtitler. Section 3 details the main new
ited space, such as two lines on a television screen. The component for presenting a text stream as readable subti-
subtitles must be shown promptly, incrementally, and with tles. Section 4 proposes the estimation of the usability of
adequate time for reading. We provide an algorithm for the subtitling of multiple realistic SLT systems. We con-
subtitling. Furthermore, we propose a way how to estimate clude the paper in Section 5.
the overall usability of the combination of automatic trans-
lation and subtitling by measuring the quality, latency, and
stability on a test set, and propose an improved measure 2 Re-Translating Spoken Language
for translation latency. Translation

Our subtitler solves the problem of presentation of SLT
1 Introduction output with a re-translating early hypothesis, similarly to
[1, 2, 3]. Although it can also present the subtitles from the
The quality of automatic speech recognition and machine automatic speech recognition (ASR) that re-estimates the
translation of texts is constantly increasing. It leads to early hypothesis, or generally any audio-to-text processor,
an opportunity to connect these two components and use we limit ourselves only SLT in this paper for brevity.
them for spoken language translation (SLT). The output of
the SLT system can be delivered to users either as speech
or text. In simultaneous SLT, where the output has to be 2.1 Stable and Unstable Segments
delivered during the speech with as low delay as possi-
SLT systems output a potentially infinite stream of seg-
ble, there is a trade-off between latency and quality. With
ments containing the beginning and final timestamps of an
textual output, it is possible to present users with early,
interval from the source audio, and the translated text in
partial translation hypotheses in low latency, and correct
the interval. We assume that the segments can be marked
them later by final, more accurate updates, after the sys-
as stable and unstable, depending on whether the system
tem receives more context for disambiguation, or after a
has the possibility to change them or not. This is a realis-
secondary big model produces its translation. Rewriting
tic assumption because the ASR and SLT systems usually
brings another challenge, the stability of output. If the up-
process a limited window of the source audio. Whenever a
dates are too frequent, the user is unable to read the text.
part of source audio exceeds this window, the correspond-
The problem of unstable output could be solved by using
ing output becomes stable.
big space for showing subtitles. The unstable, flickering
output would appear only at the end, allowing the user to
easily ignore the flickering part and read only the stabi- 3 Subtitler
lized part of the output. However, in many situations, the
space for subtitles is restricted. For example, if the users This section presents the design and algorithm of “subti-
have to follow the speaker and slides at the same time, they tler”.
lack mental capacity for searching for the stabilized part The subtitler is a cache on a stream of input messages
of translations. It is, therefore, necessary to put the subti- aiming to satisfy the following conflicting needs:
tles and slides on the same screen, restricting the subtitling • The output should be presented with the lowest possi-
area to a small window. ble delay to achieve the effect of simultaneous trans-
In this paper, we propose an algorithm for presenting lation as much as possible.
SLT subtitles in limited space, a way for estimating the • The flickering of the partial outputs is partially de-
Copyright c 2020 for this paper by its authors. Use permitted un-
sired because it highlights the simultaneity of the
der Creative Commons License Attribution 4.0 International (CC BY translation and comforts the user in knowing that the
4.0). system is not stuck.
Input 1: 23 134 STABLE Pixelen auf Ihrem Bildschirm.
134 189 UNSTABLE Zu jedem Zeitpunk. Es ist auch eine sehr ﬂexible Architektur...

Window 1.1 Window 1.2

Buﬀer: Pixelen auf Ihrem Bildschirm. Zu jedem Zeitpunkt. Es ist auch eine sehr ﬂexible Architektur...
STABLE UNSTABLE

Input 2: 134 156 STABLE Zu jedem Zeitpunk.
156 210 UNSTABLE Sie ist auch sehr ﬂexibel. Die architektur ist ein ganzes Buch.
Resetted window 2.2
Window 2.1 Window 2.2
Buﬀer: Pixelen auf Ihrem Bildschirm. Zu jedem Zeitpunkt. Sie ist auch sehr ﬂexibel. Die architektur ist ein ganzes Buch.
STABLE UNSTABLE

Figure 1: Illustration of speech translation subtitling in two subsequent inputs from SLT. The input arrives as a sequence
of quadruples: segment beginning time, segment end time, stable/unstable flag, text. The rectangles indicate the content
of the subtitling area of one line.

up the top line after displaying it for minimum reading
time. This line view is regenerated whenever needed from
the current starting position of the window in the buffer,
wrapping words into lines.
The input thread receives the input stream and updates
the buffer. It replaces outdated segments with their new
versions, extends the buffer, and removes old unnecessary
segments. If an update happens within or before the cur-
rent position of the presentation window, the output thread
is notified for a forced update.
Independently, the output thread updates the position of
Figure 2: Subtitler processing of the inputs in Figure 1 the presentation window in the buffer, obeying the follow-
with different timings. In the left one, Input 2 changes the ing timeouts and triggers:
word “Es”, which has been read by the user and scrolled • On forced updates, the output thread detects if any
away and causes a reset of a window start. In the right one, content changed before the beginning of the already
the word “Es” is changed in the window on the current presented window, which would cause a reset. In that
display. case, the window position on the window buffer has
to be moved back, and the content for the user can no
longer be presented incrementally. Instead, the be-
• The flickering should be minimized. If some output ginning of the first line in the window shows a newer
was presented at a position of the screen, it should version of an old sentence that has already scrolled
keep the position until it is outdated. away.
• The user must have enough time to read the message. • If the first line of the presentation window has not
• Only a predefined space of w (width) characters and been changed for a minimum reading time and if
h (height) lines are available. there is any input to present in the extra line of the
Given an input stream of stable and unstable segments window, the window is “scrolled” by one line, i.e.,
as described above, the subtitler emits a stream of “sub- the first line is discarded, the window starting posi-
title windows”. On every update, the former window is tion within the buffer is updated, and the extra line is
replaced by a new one. shown as the last line of the window.
The basic operation of subtitler is depicted in Figures 1 • If the whole presentation window has not been
and 2. The elements of subtitler are a buffer of input seg- changed for a long time, e.g., 5 or 20 seconds, it is
ments, a presentation window, and two independent pro- blanked by emitting empty lines.
cessing threads.
The buffer is an ordered list of segments. The presen-
3.1 Timing Parameters
tation window is organized as a list of text lines of the
required width and count. The count corresponds to the The subtitler requires two timing parameters. A line of
height of subtitling window plus one, to allow scrolling- subtitles is displayed to a user for a “minimum reading
time” before it can be scrolled away. If no input arrives for Source and output, si and oi , are sequences of tokens. Let
a “blank time”, the subtitling window blanks to indicate us denote c(oi ) a transformation of a token sequence into a
it and to prevent the user from reading the last message sequence of characters, including spaces and punctuation.
unnecessarily. Let I be the number of all events, with an update either
We suggest adopting the minimum reading time param- in source or output, and T the number of events with an
eter from the standards for subtitling films and videos (e.g., update in translation.
[4]), before standards for simultaneous SLT subtitling will
be established. [5] claim that 15 characters per second is a
standard reading time in English interlingual subtitling of Character Erasure To evaluate how many updates fit
films and videos for deaf and hard hearing. The standards into the subtitling window, we define character erasure
in other European regions are close to 15 characters per (cE). It is the number of characters that must be deleted
second. We use this value for the evaluation in this work. from the tail of the current translation hypothesis to update
it to a new one. If a new translation only appends words
to the end, the erasure is zero. The character erasure is
4 Estimating Usability cE(i) = |c(oi−1 )| − |LCP(c(oi ), c(oi−1 ))|, where the LCP
stands for the longest common prefix. The average char-
The challenges in simultaneous SLT are quality, latency, acter erasure is AcE = 1/T ∑Ii=1 cE(i). It is inspired by the
and stability [1, 2]. All of these properties are critical for normalized erasure (NE) by [2], but we do not divide it by
the overall usability of the SLT system. The quality of the output length in the final event, but only by the number
translation is a property of the SLT system. The subtitler of translation events.
has no impact on it. The minimum reading time ensures
the minimum level of stability, ensuring that every stable
content is readable, and may increase the latency if the Translation Latency with Sentence-Alignment Catch-
original speech is faster than reading time. The size of up The translation latency may be measured with the use
the subtitling window and timing parameters affect overall of a finalization event of the j-th word in output. It is
latency and stability. The bigger the window, the longer f (o, j) = mini such that oi0 , j0 = oI, j0 ∀i0 ≥ i and ∀ j0 ≤ j.
updates of translations fit into it without a reset. The tim- In other words, the word j is finalized in the first event i,
ing parameters determine how long the content stays un- for which the word j and all the preceding words j0 remain
changed in the window before scrolling. A small subtitling unchanged in all subsequent events i0 .
window or a short reading or blanking time may cause a re- The translation latency of output word j is the time dif-
set. Every reset increases latency because it returns to the ference of the finalization event of the word j in the out-
already displayed content. On the other hand, the signif- put and its corresponding word j∗ in the source. [2] esti-
icant latency may improve stability by skipping the early mate the source word simply as j∗ = ( j/|oI |)|sI |. This is
unstable hypotheses and present only the stable ones. problematic if the output is substantially shorter than in-
We provide three automatic measures for assessing the put, because then it may incorrectly base the latency on a
practical usability of simultaneous SLT subtitling on the word which has not been uttered yet, leading to a negative
test set. The automatic evaluation may serve for a rough time difference. A proper word alignment would provide
estimation of the usefulness, or for selection of the best the most reliable correspondence. However, we propose a
candidate setups. We do not provide a strict way to judge simpler and appropriately reliable solution. The following
which SLT system and subtitling setup are useful and improved measure is our novel contribution. We use it to
which are not. The final decision should ideally consider compare the SLT systems.
the particular display conditions, expectations, and needs We utilize the fact that our ASR produces punctuated
of the users, and should be based on a significant human text, where the sentence boundaries can be detected. The
evaluation. sentences coming from SLT and ASR in their sequential
order are parallel. They can be simply aligned because our
SLT systems translate the individual sentences and keep
4.1 Evaluation Measures
the sentence boundaries. If the SLT does not produce in-
For quality, we report an automatic machine translation dividual sentences, then we use a rule-based sentence seg-
measure BLEU computed by sacrebleu [6] after automatic menter, e.g. from [8], and must be aware of the potential
sentence alignment using mwerSegmenter [7]. BLEU is inaccuracy.
considered as correlating with human quality judgement. We use the sentence alignment for a catch-up, and the
The higher BLEU, the higher translation quality. simple temporal correspondence of [2] only within the
To explain the measure of latency and stability, let us last sentence. To express it formally, let us assume that
use the terminology of [2]. The EventLog is an ordered the EventLog has also a function S(o, j), returning the
list of events. The ith event is a triple si , oi ,ti , where si index of the sentence containing the word j in o, and
is the source text recognized so far, oi is the current SLT L(o, k), the length of the sentence k in o. Let x( j) =
S(o, j)−1
output, and ti is the time when this event was produced. j − ∑i=1 L(o, i) be the index of an output word j in its
Table 1: Quality measure of the English ASR and SLT
100
systems from English into the target language in the left-
most column, on IWSLT tst2015. The letters A, B, C de- 95
note different variants of SLT systems with the same tar-

% of translation updates
90
get. Translation lag (TL∗ ) is in seconds. AcE is average EN (ASR)
character erasure, NE is normalized erasure. 85 CZ A
CZ B
80 CZ C
SLT BLEU TL∗ AcE NE FR A
EN (ASR) 58.4747 29.22 5.88 75 RU
ES
CZ A 17.5441 2.226 24.20 7.05 DE B
CZ B 12.2914 2.622 29.48 5.30 70 DE A
CZ C 18.1505 2.933 27.90 3.93 FR A
65
DE A 15.2678 3.506 47.32 1.39 80 100 120 140 160 180
DE B 15.9672 1.845 38.12 5.46 character erasure (cE)
ES 21.8516 5.429 43.30 1.49
FR A 25.8964 1.269 31.97 3.32 Figure 3: The percentage of translation updates in the val-
FR B 20.5367 5.425 47.92 1.46 idation set with the character erasure less than or equal to
RU 11.6279 3.168 31.78 4.05 the value on the x-axis, for all our ASR and SLT systems.
The x-axis corresponds with the size of the subtitling win-
dow.
sentence. Then we define our caught-up correspondence
as
S(o, j)−1
j
L(s,S(o, j))
k Table 2: Percentage of character erasures in all translation
j∗∗ = ∑i=1 L(s, i) + x( j) L(o,S(o, j)) updates, which are shorter or equal than x characters, for
Finally, our translation latency with sentence-alignment selected values of x.
catch-up is TL∗ (o, j) = t f (o, j) − t f (s, j∗∗ ) . This is then
averaged for all output words in the document: TL∗ = SLT x=0 x = 70 x = 140 x = 210
1 |oI ] ∗ 1 EN (ASR) 20.76 84.23 99.96 100.00
|oI | ∑ j=1 T L (o, j).
CZ A 41.37 91.98 99.03 99.76
CZ B 28.61 89.78 98.63 99.77
4.2 SLT Evaluation CZ C 30.93 88.31 98.53 99.72
FR A 31.65 84.47 98.14 99.51
We use one ASR system for English and nine SLT sys- RU 35.42 85.17 97.82 99.38
tems from English into Czech (three different models dif- ES 29.01 71.71 97.08 99.43
fering in the data and training parameters), German (2 dif- DE B 27.89 80.90 97.05 99.38
ferent systems), French (2 different systems), Spanish and DE A 30.85 67.65 95.83 99.13
Russian. All the SLT systems are cascades of an ASR, FR A 30.39 66.15 95.67 99.39
a punctuator, which inserts punctuation and capitalization
to unsegmented ASR output, and a neural machine trans-
percentage of all translation updates, in which the charac-
lation (NMT) from the text. The systems and their quality
ter erasure was shorter or equal than the value on the hori-
measures are in Table 1. DE A, ES, and FR B are NMT
zontal axis. E.g., for the subtitler window with a total size
adapted for spoken translation as in [9]. The others are ba-
of 140 characters, 99.03 % of SLT updates of the SLT CZ
sic sentence-level Transformer NMT connected to ASR.
A fit into this area. Table 2 displays the same for selected
The ASR is a hybrid DNN-HMM by [10].
sizes, which fit into 1, 2, and 3 lines of subtitler window
We evaluate the systems on IWSLT tst2015 dataset. We
of size 70, and also the percentage of updates without any
downloaded the referential translation from the TED web-
erasure (x = 0).
site as [2], and removed the single words in parentheses
The values approximate the expected number of resets.
because they were not verbatim translations of the speech,
However, the resets are also affected by the blanking time,
but marked sounds such as applause, laughter, or music.
so the real number of resets may be higher if the speech
contains long pauses. The percentage in Figure 3 serves as
4.3 Reset Rate a lower bound.
The average character erasure does not reflect the fre-
quency and size of the individual erasures. Therefore, in 4.4 Subtitling Latency
Figure 3, we display the cumulative density function of
The subtitling latency is the difference of the finalization
character erasure in the dataset. The vertical axis is the
time of a word in subtitler and in the SLT. We count it sim-
ilarly as the translation latency, but the word correspon-
I |o ]∗
dence is the identity function because the language in SLT
1 For a set of documents D, the TL∗ = ∑o,I∈D ∑ j=1 T L (o, j) .
∑o,I∈D |oI |
and subtitler is the same.
was limited readability due to resets and unstable transla-
40 height=1
height=2 tions. The flaws in usable parts of subtitling were subtle
height=3 changes of subtitles which did not distract from reading
30 the new input, or disfluent formulations.
subtitler lag (s)

In the right-most column of Table 3 we show the per-
20 centage of erasures in the part of the evaluated document
which fit into the subtitling window. We hypothesize that
the automatic measure of character erasure may be used to
10 estimate the user assessment of readability.

0 5 Conclusion
0 200 400 600
time (s)
We proposed an algorithm for presenting automatic speech
translation simultaneously in the limited space of subti-
Figure 4: Subtitling latency (y-axis) over time (x-axis) for tles. The algorithm is independent of the SLT system. It
tst2015.en.talkid1922 translated by CZ A. The subtitling ensures the minimum level of stability and allows simul-
window has the width 70 and height 1, 2 and 3 lines. The taneity. Furthermore, we propose a way of estimating the
minimum reading time is 15 characters per second (one reader’s comfort and overall usability of the SLT with sub-
line per 4.7s). titling in limited space, and observe correspondence with
user rating. Last but not least, we suggested a catch-up
Table 3: Results of user evaluation with three subtitling based on sentence-alignment in ASR and SLT to measure
windows of different heights (h). Quality level 4 is the the translation latency simply and realistically.
highest, 1 is the lowest. The right-most column is the per-
centage of erasures fitting into the subtitling window.
Acknowledgments
Percentage of quality levels
height level=1 level=2 level=3 level=4 cE < 70 · h The research was partially supported by the grant
h=1 35.27 % 28.79 % 14.95 % 20.99 % 88.59 % CZ.07.1.02/0.0/0.0/16_023/0000108 (Operational Pro-
h=2 11.08 % 29.94 % 35.73 % 23.24 % 98.73 % gramme – Growth Pole of the Czech Republic), H2020-
h=3 16.33 % 19.90 % 33.67 % 30.11 % 99.64 % ICT-2018-2-825460 (ELITR) of the EU, 398120 of the
Grant Agency of Charles University, and by SVV project
number 260 575.
We computed the latency caused by the subtitler with 1,
2, and 3 lines of width 70 for one talk and SLT systems,
see Figure 4. Generally, the bigger the translation window, References
the lower latency.
[1] J. Niehues and et al., “Dynamic transcription for low-
latency speech translation,” in Proceedings of Interspeech,
4.5 User Evaluation 2016.
[2] N. Arivazhagan, C. Cherry, T. I, W. Macherey, P. Baljekar,
We asked one user to rate the overall fluency and
and G. Foster, “Re-translation strategies for long form, si-
stability of subtitling for the first 7-minute part of multaneous, spoken language translation,” 2019.
tst2015.en.talkid1922 translated by CZ A. We presented
[3] N. Arivazhagan, C. Cherry, W. Macherey, and G. Foster,
the user with the subtitles three times, in a window of “Re-translation versus streaming for simultaneous transla-
width 70 and heights 1, 2 and 3. The minimum reading tion,” ArXiv, vol. abs/2004.03643, 2020.
time parameter was 15 characters per second. The user [4] F. Karamitroglou, “A proposed set of subtitling standards
was asked to express his subjective quality assessment by in europe,” Translation Journal, vol. 2, 4 1998.
pressing one of five buttons: undecided (0), horrible (1), [5] A. Szarkowska and O. Gerber-Morón, “Viewers can keep
usable with problems (2), minor flaws, but usable (3), and up with fast subtitles: Evidence from eye movements,” in
perfect (4). The user was asked to press them simulta- PloS one, 2018.
neously with reading subtitles, whenever the assessment [6] M. Post, “A call for clarity in reporting BLEU scores.” As-
changes. The source audio or video was not presented, so sociation for Computational Linguistics, 2018.
this setup is comparable to situations where the user does [7] E. Matusov and et al., “Evaluating machine translation
not understand the source language at all. The user is a output with automatic sentence segmentation,” in Inter-
native speaker of Czech. national Workshop on Spoken Language Translation, Oct.
Table 3 summarizes the percentage of the assessed dura- 2005.
tion and the quality levels. The user has not used the level [8] P. Koehn and et al., “Moses: Open source toolkit for statis-
undecided (0). The main problem that the user reported tical machine translation,” ser. ACL ’07, 2007.
[9] J. Niehues and et al., “Low-latency neural speech transla-
tion,” Interspeech 2018, Sep 2018.
[10] T.-S. Nguyen and et al., “The 2017 KIT IWSLT Speech-to-
Text Systems for English and German,” December, 14-15
2017.