=Paper=
{{Paper
|id=Vol-2718/paper23
|storemode=property
|title=Presenting Simultaneous Translation in Limited Space
|pdfUrl=https://ceur-ws.org/Vol-2718/paper23.pdf
|volume=Vol-2718
|authors=Dominik Macháček,Ondřej Bojar
|dblpUrl=https://dblp.org/rec/conf/itat/MachacekB20
}}
==Presenting Simultaneous Translation in Limited Space==
Presenting Simultaneous Translation in Limited Space Dominik Macháček, Ondřej Bojar Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics {machacek,bojar}@ufal.mff.cuni.cz Abstract: Some methods of automatic simultaneous trans- overall usability of simultaneous SLT subtitling in a lim- lation of a long-form speech allow revisions of outputs, ited area, and an improved translation latency measure for trading accuracy for low latency. Deploying these systems SLT comparison. Section 2 describes the properties of SLT for users faces the problem of presenting subtitles in a lim- for use with our subtitler. Section 3 details the main new ited space, such as two lines on a television screen. The component for presenting a text stream as readable subti- subtitles must be shown promptly, incrementally, and with tles. Section 4 proposes the estimation of the usability of adequate time for reading. We provide an algorithm for the subtitling of multiple realistic SLT systems. We con- subtitling. Furthermore, we propose a way how to estimate clude the paper in Section 5. the overall usability of the combination of automatic trans- lation and subtitling by measuring the quality, latency, and stability on a test set, and propose an improved measure 2 Re-Translating Spoken Language for translation latency. Translation Our subtitler solves the problem of presentation of SLT 1 Introduction output with a re-translating early hypothesis, similarly to [1, 2, 3]. Although it can also present the subtitles from the The quality of automatic speech recognition and machine automatic speech recognition (ASR) that re-estimates the translation of texts is constantly increasing. It leads to early hypothesis, or generally any audio-to-text processor, an opportunity to connect these two components and use we limit ourselves only SLT in this paper for brevity. them for spoken language translation (SLT). The output of the SLT system can be delivered to users either as speech or text. In simultaneous SLT, where the output has to be 2.1 Stable and Unstable Segments delivered during the speech with as low delay as possi- SLT systems output a potentially infinite stream of seg- ble, there is a trade-off between latency and quality. With ments containing the beginning and final timestamps of an textual output, it is possible to present users with early, interval from the source audio, and the translated text in partial translation hypotheses in low latency, and correct the interval. We assume that the segments can be marked them later by final, more accurate updates, after the sys- as stable and unstable, depending on whether the system tem receives more context for disambiguation, or after a has the possibility to change them or not. This is a realis- secondary big model produces its translation. Rewriting tic assumption because the ASR and SLT systems usually brings another challenge, the stability of output. If the up- process a limited window of the source audio. Whenever a dates are too frequent, the user is unable to read the text. part of source audio exceeds this window, the correspond- The problem of unstable output could be solved by using ing output becomes stable. big space for showing subtitles. The unstable, flickering output would appear only at the end, allowing the user to easily ignore the flickering part and read only the stabi- 3 Subtitler lized part of the output. However, in many situations, the space for subtitles is restricted. For example, if the users This section presents the design and algorithm of “subti- have to follow the speaker and slides at the same time, they tler”. lack mental capacity for searching for the stabilized part The subtitler is a cache on a stream of input messages of translations. It is, therefore, necessary to put the subti- aiming to satisfy the following conflicting needs: tles and slides on the same screen, restricting the subtitling • The output should be presented with the lowest possi- area to a small window. ble delay to achieve the effect of simultaneous trans- In this paper, we propose an algorithm for presenting lation as much as possible. SLT subtitles in limited space, a way for estimating the • The flickering of the partial outputs is partially de- Copyright c 2020 for this paper by its authors. Use permitted un- sired because it highlights the simultaneity of the der Creative Commons License Attribution 4.0 International (CC BY translation and comforts the user in knowing that the 4.0). system is not stuck. Input 1: 23 134 STABLE Pixelen auf Ihrem Bildschirm. 134 189 UNSTABLE Zu jedem Zeitpunk. Es ist auch eine sehr flexible Architektur... Window 1.1 Window 1.2 Buffer: Pixelen auf Ihrem Bildschirm. Zu jedem Zeitpunkt. Es ist auch eine sehr flexible Architektur... STABLE UNSTABLE Input 2: 134 156 STABLE Zu jedem Zeitpunk. 156 210 UNSTABLE Sie ist auch sehr flexibel. Die architektur ist ein ganzes Buch. Resetted window 2.2 Window 2.1 Window 2.2 Buffer: Pixelen auf Ihrem Bildschirm. Zu jedem Zeitpunkt. Sie ist auch sehr flexibel. Die architektur ist ein ganzes Buch. STABLE UNSTABLE Figure 1: Illustration of speech translation subtitling in two subsequent inputs from SLT. The input arrives as a sequence of quadruples: segment beginning time, segment end time, stable/unstable flag, text. The rectangles indicate the content of the subtitling area of one line. up the top line after displaying it for minimum reading time. This line view is regenerated whenever needed from the current starting position of the window in the buffer, wrapping words into lines. The input thread receives the input stream and updates the buffer. It replaces outdated segments with their new versions, extends the buffer, and removes old unnecessary segments. If an update happens within or before the cur- rent position of the presentation window, the output thread is notified for a forced update. Independently, the output thread updates the position of Figure 2: Subtitler processing of the inputs in Figure 1 the presentation window in the buffer, obeying the follow- with different timings. In the left one, Input 2 changes the ing timeouts and triggers: word “Es”, which has been read by the user and scrolled • On forced updates, the output thread detects if any away and causes a reset of a window start. In the right one, content changed before the beginning of the already the word “Es” is changed in the window on the current presented window, which would cause a reset. In that display. case, the window position on the window buffer has to be moved back, and the content for the user can no longer be presented incrementally. Instead, the be- • The flickering should be minimized. If some output ginning of the first line in the window shows a newer was presented at a position of the screen, it should version of an old sentence that has already scrolled keep the position until it is outdated. away. • The user must have enough time to read the message. • If the first line of the presentation window has not • Only a predefined space of w (width) characters and been changed for a minimum reading time and if h (height) lines are available. there is any input to present in the extra line of the Given an input stream of stable and unstable segments window, the window is “scrolled” by one line, i.e., as described above, the subtitler emits a stream of “sub- the first line is discarded, the window starting posi- title windows”. On every update, the former window is tion within the buffer is updated, and the extra line is replaced by a new one. shown as the last line of the window. The basic operation of subtitler is depicted in Figures 1 • If the whole presentation window has not been and 2. The elements of subtitler are a buffer of input seg- changed for a long time, e.g., 5 or 20 seconds, it is ments, a presentation window, and two independent pro- blanked by emitting empty lines. cessing threads. The buffer is an ordered list of segments. The presen- 3.1 Timing Parameters tation window is organized as a list of text lines of the required width and count. The count corresponds to the The subtitler requires two timing parameters. A line of height of subtitling window plus one, to allow scrolling- subtitles is displayed to a user for a “minimum reading time” before it can be scrolled away. If no input arrives for Source and output, si and oi , are sequences of tokens. Let a “blank time”, the subtitling window blanks to indicate us denote c(oi ) a transformation of a token sequence into a it and to prevent the user from reading the last message sequence of characters, including spaces and punctuation. unnecessarily. Let I be the number of all events, with an update either We suggest adopting the minimum reading time param- in source or output, and T the number of events with an eter from the standards for subtitling films and videos (e.g., update in translation. [4]), before standards for simultaneous SLT subtitling will be established. [5] claim that 15 characters per second is a standard reading time in English interlingual subtitling of Character Erasure To evaluate how many updates fit films and videos for deaf and hard hearing. The standards into the subtitling window, we define character erasure in other European regions are close to 15 characters per (cE). It is the number of characters that must be deleted second. We use this value for the evaluation in this work. from the tail of the current translation hypothesis to update it to a new one. If a new translation only appends words to the end, the erasure is zero. The character erasure is 4 Estimating Usability cE(i) = |c(oi−1 )| − |LCP(c(oi ), c(oi−1 ))|, where the LCP stands for the longest common prefix. The average char- The challenges in simultaneous SLT are quality, latency, acter erasure is AcE = 1/T ∑Ii=1 cE(i). It is inspired by the and stability [1, 2]. All of these properties are critical for normalized erasure (NE) by [2], but we do not divide it by the overall usability of the SLT system. The quality of the output length in the final event, but only by the number translation is a property of the SLT system. The subtitler of translation events. has no impact on it. The minimum reading time ensures the minimum level of stability, ensuring that every stable content is readable, and may increase the latency if the Translation Latency with Sentence-Alignment Catch- original speech is faster than reading time. The size of up The translation latency may be measured with the use the subtitling window and timing parameters affect overall of a finalization event of the j-th word in output. It is latency and stability. The bigger the window, the longer f (o, j) = mini such that oi0 , j0 = oI, j0 ∀i0 ≥ i and ∀ j0 ≤ j. updates of translations fit into it without a reset. The tim- In other words, the word j is finalized in the first event i, ing parameters determine how long the content stays un- for which the word j and all the preceding words j0 remain changed in the window before scrolling. A small subtitling unchanged in all subsequent events i0 . window or a short reading or blanking time may cause a re- The translation latency of output word j is the time dif- set. Every reset increases latency because it returns to the ference of the finalization event of the word j in the out- already displayed content. On the other hand, the signif- put and its corresponding word j∗ in the source. [2] esti- icant latency may improve stability by skipping the early mate the source word simply as j∗ = ( j/|oI |)|sI |. This is unstable hypotheses and present only the stable ones. problematic if the output is substantially shorter than in- We provide three automatic measures for assessing the put, because then it may incorrectly base the latency on a practical usability of simultaneous SLT subtitling on the word which has not been uttered yet, leading to a negative test set. The automatic evaluation may serve for a rough time difference. A proper word alignment would provide estimation of the usefulness, or for selection of the best the most reliable correspondence. However, we propose a candidate setups. We do not provide a strict way to judge simpler and appropriately reliable solution. The following which SLT system and subtitling setup are useful and improved measure is our novel contribution. We use it to which are not. The final decision should ideally consider compare the SLT systems. the particular display conditions, expectations, and needs We utilize the fact that our ASR produces punctuated of the users, and should be based on a significant human text, where the sentence boundaries can be detected. The evaluation. sentences coming from SLT and ASR in their sequential order are parallel. They can be simply aligned because our SLT systems translate the individual sentences and keep 4.1 Evaluation Measures the sentence boundaries. If the SLT does not produce in- For quality, we report an automatic machine translation dividual sentences, then we use a rule-based sentence seg- measure BLEU computed by sacrebleu [6] after automatic menter, e.g. from [8], and must be aware of the potential sentence alignment using mwerSegmenter [7]. BLEU is inaccuracy. considered as correlating with human quality judgement. We use the sentence alignment for a catch-up, and the The higher BLEU, the higher translation quality. simple temporal correspondence of [2] only within the To explain the measure of latency and stability, let us last sentence. To express it formally, let us assume that use the terminology of [2]. The EventLog is an ordered the EventLog has also a function S(o, j), returning the list of events. The ith event is a triple si , oi ,ti , where si index of the sentence containing the word j in o, and is the source text recognized so far, oi is the current SLT L(o, k), the length of the sentence k in o. Let x( j) = S(o, j)−1 output, and ti is the time when this event was produced. j − ∑i=1 L(o, i) be the index of an output word j in its Table 1: Quality measure of the English ASR and SLT 100 systems from English into the target language in the left- most column, on IWSLT tst2015. The letters A, B, C de- 95 note different variants of SLT systems with the same tar- % of translation updates 90 get. Translation lag (TL∗ ) is in seconds. AcE is average EN (ASR) character erasure, NE is normalized erasure. 85 CZ A CZ B 80 CZ C SLT BLEU TL∗ AcE NE FR A EN (ASR) 58.4747 29.22 5.88 75 RU ES CZ A 17.5441 2.226 24.20 7.05 DE B CZ B 12.2914 2.622 29.48 5.30 70 DE A CZ C 18.1505 2.933 27.90 3.93 FR A 65 DE A 15.2678 3.506 47.32 1.39 80 100 120 140 160 180 DE B 15.9672 1.845 38.12 5.46 character erasure (cE) ES 21.8516 5.429 43.30 1.49 FR A 25.8964 1.269 31.97 3.32 Figure 3: The percentage of translation updates in the val- FR B 20.5367 5.425 47.92 1.46 idation set with the character erasure less than or equal to RU 11.6279 3.168 31.78 4.05 the value on the x-axis, for all our ASR and SLT systems. The x-axis corresponds with the size of the subtitling win- dow. sentence. Then we define our caught-up correspondence as S(o, j)−1 j L(s,S(o, j)) k Table 2: Percentage of character erasures in all translation j∗∗ = ∑i=1 L(s, i) + x( j) L(o,S(o, j)) updates, which are shorter or equal than x characters, for Finally, our translation latency with sentence-alignment selected values of x. catch-up is TL∗ (o, j) = t f (o, j) − t f (s, j∗∗ ) . This is then averaged for all output words in the document: TL∗ = SLT x=0 x = 70 x = 140 x = 210 1 |oI ] ∗ 1 EN (ASR) 20.76 84.23 99.96 100.00 |oI | ∑ j=1 T L (o, j). CZ A 41.37 91.98 99.03 99.76 CZ B 28.61 89.78 98.63 99.77 4.2 SLT Evaluation CZ C 30.93 88.31 98.53 99.72 FR A 31.65 84.47 98.14 99.51 We use one ASR system for English and nine SLT sys- RU 35.42 85.17 97.82 99.38 tems from English into Czech (three different models dif- ES 29.01 71.71 97.08 99.43 fering in the data and training parameters), German (2 dif- DE B 27.89 80.90 97.05 99.38 ferent systems), French (2 different systems), Spanish and DE A 30.85 67.65 95.83 99.13 Russian. All the SLT systems are cascades of an ASR, FR A 30.39 66.15 95.67 99.39 a punctuator, which inserts punctuation and capitalization to unsegmented ASR output, and a neural machine trans- percentage of all translation updates, in which the charac- lation (NMT) from the text. The systems and their quality ter erasure was shorter or equal than the value on the hori- measures are in Table 1. DE A, ES, and FR B are NMT zontal axis. E.g., for the subtitler window with a total size adapted for spoken translation as in [9]. The others are ba- of 140 characters, 99.03 % of SLT updates of the SLT CZ sic sentence-level Transformer NMT connected to ASR. A fit into this area. Table 2 displays the same for selected The ASR is a hybrid DNN-HMM by [10]. sizes, which fit into 1, 2, and 3 lines of subtitler window We evaluate the systems on IWSLT tst2015 dataset. We of size 70, and also the percentage of updates without any downloaded the referential translation from the TED web- erasure (x = 0). site as [2], and removed the single words in parentheses The values approximate the expected number of resets. because they were not verbatim translations of the speech, However, the resets are also affected by the blanking time, but marked sounds such as applause, laughter, or music. so the real number of resets may be higher if the speech contains long pauses. The percentage in Figure 3 serves as 4.3 Reset Rate a lower bound. The average character erasure does not reflect the fre- quency and size of the individual erasures. Therefore, in 4.4 Subtitling Latency Figure 3, we display the cumulative density function of The subtitling latency is the difference of the finalization character erasure in the dataset. The vertical axis is the time of a word in subtitler and in the SLT. We count it sim- ilarly as the translation latency, but the word correspon- I |o ]∗ dence is the identity function because the language in SLT 1 For a set of documents D, the TL∗ = ∑o,I∈D ∑ j=1 T L (o, j) . ∑o,I∈D |oI | and subtitler is the same. was limited readability due to resets and unstable transla- 40 height=1 height=2 tions. The flaws in usable parts of subtitling were subtle height=3 changes of subtitles which did not distract from reading 30 the new input, or disfluent formulations. subtitler lag (s) In the right-most column of Table 3 we show the per- 20 centage of erasures in the part of the evaluated document which fit into the subtitling window. We hypothesize that the automatic measure of character erasure may be used to 10 estimate the user assessment of readability. 0 5 Conclusion 0 200 400 600 time (s) We proposed an algorithm for presenting automatic speech translation simultaneously in the limited space of subti- Figure 4: Subtitling latency (y-axis) over time (x-axis) for tles. The algorithm is independent of the SLT system. It tst2015.en.talkid1922 translated by CZ A. The subtitling ensures the minimum level of stability and allows simul- window has the width 70 and height 1, 2 and 3 lines. The taneity. Furthermore, we propose a way of estimating the minimum reading time is 15 characters per second (one reader’s comfort and overall usability of the SLT with sub- line per 4.7s). titling in limited space, and observe correspondence with user rating. Last but not least, we suggested a catch-up Table 3: Results of user evaluation with three subtitling based on sentence-alignment in ASR and SLT to measure windows of different heights (h). Quality level 4 is the the translation latency simply and realistically. highest, 1 is the lowest. The right-most column is the per- centage of erasures fitting into the subtitling window. Acknowledgments Percentage of quality levels height level=1 level=2 level=3 level=4 cE < 70 · h The research was partially supported by the grant h=1 35.27 % 28.79 % 14.95 % 20.99 % 88.59 % CZ.07.1.02/0.0/0.0/16_023/0000108 (Operational Pro- h=2 11.08 % 29.94 % 35.73 % 23.24 % 98.73 % gramme – Growth Pole of the Czech Republic), H2020- h=3 16.33 % 19.90 % 33.67 % 30.11 % 99.64 % ICT-2018-2-825460 (ELITR) of the EU, 398120 of the Grant Agency of Charles University, and by SVV project number 260 575. We computed the latency caused by the subtitler with 1, 2, and 3 lines of width 70 for one talk and SLT systems, see Figure 4. Generally, the bigger the translation window, References the lower latency. [1] J. Niehues and et al., “Dynamic transcription for low- latency speech translation,” in Proceedings of Interspeech, 4.5 User Evaluation 2016. [2] N. Arivazhagan, C. Cherry, T. I, W. Macherey, P. Baljekar, We asked one user to rate the overall fluency and and G. Foster, “Re-translation strategies for long form, si- stability of subtitling for the first 7-minute part of multaneous, spoken language translation,” 2019. tst2015.en.talkid1922 translated by CZ A. We presented [3] N. Arivazhagan, C. Cherry, W. Macherey, and G. Foster, the user with the subtitles three times, in a window of “Re-translation versus streaming for simultaneous transla- width 70 and heights 1, 2 and 3. The minimum reading tion,” ArXiv, vol. abs/2004.03643, 2020. time parameter was 15 characters per second. The user [4] F. Karamitroglou, “A proposed set of subtitling standards was asked to express his subjective quality assessment by in europe,” Translation Journal, vol. 2, 4 1998. pressing one of five buttons: undecided (0), horrible (1), [5] A. Szarkowska and O. Gerber-Morón, “Viewers can keep usable with problems (2), minor flaws, but usable (3), and up with fast subtitles: Evidence from eye movements,” in perfect (4). The user was asked to press them simulta- PloS one, 2018. neously with reading subtitles, whenever the assessment [6] M. Post, “A call for clarity in reporting BLEU scores.” As- changes. The source audio or video was not presented, so sociation for Computational Linguistics, 2018. this setup is comparable to situations where the user does [7] E. Matusov and et al., “Evaluating machine translation not understand the source language at all. The user is a output with automatic sentence segmentation,” in Inter- native speaker of Czech. national Workshop on Spoken Language Translation, Oct. Table 3 summarizes the percentage of the assessed dura- 2005. tion and the quality levels. The user has not used the level [8] P. Koehn and et al., “Moses: Open source toolkit for statis- undecided (0). The main problem that the user reported tical machine translation,” ser. ACL ’07, 2007. [9] J. Niehues and et al., “Low-latency neural speech transla- tion,” Interspeech 2018, Sep 2018. [10] T.-S. Nguyen and et al., “The 2017 KIT IWSLT Speech-to- Text Systems for English and German,” December, 14-15 2017.