Visualization: The Missing Factor in Simultaneous Speech Translation Sara Papi1,2 , Matteo Negri1 , Marco Turchi1 1. Fondazione Bruno Kessler, Italy 2. University of Trento, Italy {spapi,negri,turchi}@fbk.eu Abstract SimulST is indeed a complex task in which the difficulties of performing speech recognition from Simultaneous speech translation partial inputs are exacerbated by the problem to (SimulST) is the task in which out- project meaning across languages. Despite the in- put generation has to be performed on creasing demand for such a system, the problem is partial, incremental speech input. In still far from being solved. recent years, SimulST has become pop- So far, research efforts mainly focused on the ular due to the spread of multilingual quality/latency trade-off, i.e. producing high qual- application scenarios, like international ity outputs in the shortest possible time, balancing live conferences and streaming lectures, the need for a good translation with the necessity in which on-the-fly speech translation can of a rapid text generation. Previous studies, how- facilitate users’ access to audio-visual ever, disregard how the translation is displayed content. In this paper, we analyze the and, consequently, how it is actually perceived by characteristics of the SimulST systems de- the end users. After a concise survey of the state veloped so far, discussing their strengths of the art in the field, in this paper we posit that, and weaknesses. We then concentrate from the users’ experience standpoint, output visu- on the evaluation framework required to alization is at least as important as having a good properly assess systems’ effectiveness. To translation in a short time. This raises the need this end, we raise the need for a broader for a broader, task-oriented and human-centered performance analysis, also including the analysis of SimulST systems’ performance, also user experience standpoint. We argue that accounting for this third crucial factor. SimulST systems, indeed, should be eval- uated not only in terms of quality/latency 2 Background measures, but also via task-oriented As in the case of offline speech translation, the metrics accounting, for instance, for the adoption of cascade architectures (Stentiford and visualization strategy adopted. In light Steer, 1988; Waibel et al., 1991) was the first at- of this, we highlight which are the goals tempt made by the SimulST community to tackle achieved by the community and what is the problem of generating text from partial, in- still missing. cremental input. Cascade systems (Fügen, 2009; Fujita et al., 2013; Niehues et al., 2018; Xiong 1 Introduction et al., 2019; Arivazhagan et al., 2020b) involve Simultaneous speech translation (SimulST) is the a pipeline of two components. First, a stream- task in which the translation of a source language ing automatic speech recognition (ASR) module speech has to be performed on partial, incremen- transcribes the input speech into the correspond- tal input. This is a key feature to achieve low la- ing text (Wang et al., 2020; Moritz et al., 2020). tency in scenarios like streaming conferences and Then, a simultaneous text-to-text translation mod- lectures, where the text has to be displayed fol- ule translates the partial transcription into target- lowing as much as possible the pace of the speech. language text (Gu et al., 2017; Dalvi et al., 2018; Ma et al., 2019; Arivazhagan et al., 2019). This Copyright © 2021 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- approach suffers from error propagation, a well- ternational (CC BY 4.0). known problem even in the offline scenario, where the transcription errors made by the ASR module and-Compensate, where the encoder exploits extra are propagated to the MT module, which cannot frames provided from the past that were discarded recover from them as it does not have direct ac- during the previous encoding step. The segmenta- cess to the audio. Another strong limitation of tion problem is a crucial aspect in SimulST, where cascaded systems is the extra latency added by the system needs to split a long audio input into the two-step pipeline, since the MT module has to smaller chunks (speech frames) in order to process wait until the streaming ASR output is produced. them. Different segmentation techniques can be To overcome these issues, the di- adopted to extract this information, starting from rect models initially proposed in the easiest one based on fixed time windows (Ma B[Pleaseinsertintopreamble]rard et al. et al., 2020b) to the dynamic ones based on auto- (2016; Weiss et al. (2017) represent a valid matically detected word boundaries (Zeng et al., alternative that is gaining increasing traction 2021; Chen et al., 2021). Ma et al. (2020b) also (Bentivogli et al., 2021). Direct ST models are studied the dynamic segmentation based on oracle composed of an encoder, usually bidirectional, boundaries but they discovered that, in their sce- and a decoder. The encoder starts from the nario, it had worse performance compared to that audio features extracted from the input signal and of the fixed segmentation. computes a hidden representation; the decoder transforms this representation into target language Decoding Strategy. Some efforts have been text. Direct modeling becomes crucial in the made to improve the decoding strategy as it simultaneous scenario, as it reduces the overall strongly correlates to the decision policy of si- system’s latency due to the absence of inter- multaneous systems. Speculative beam search, or mediate symbolic representation steps. Despite SBS, (Zheng et al., 2019c) represents the first suc- the data scarcity issue caused by the limited cessful attempt to use beam search in SimulST. availability of speech-to-translation corpora, the This technique consists in hallucinating several adoption of direct architectures showed to be prediction steps in the future in order to make promising (Weiss et al., 2017; Ren et al., 2020; more accurate decisions based on the best “spec- Zeng et al., 2021), driving recent efforts towards ulative” prediction obtained. Also Zeng et al. the development of increasingly powerful and (2021) integrate the beam search in the decoding efficient models. strategy, developing the wait-k-stride-N strategy. In particular, the authors bypass output specula- 3 Architectural Challenges tion by directly applying beam search, after wait- This section surveys the direct SimulST models ing for k words, on a word stride of size N (i.e., on developed so far, highlighting strengths and weak- N words at a time) instead of one single word as nesses of the current architectures and decision prescribed by the standard wait-k. Nguyen et al. policies – i.e. the strategies used by the system to (2021a) analyzed several decoding strategies rely- decide whether to output a partial translation or to ing on different output token granularities, such as wait for more audio information. We discuss on- characters and Byte Pair Encoding (BPE), show- going research on architectural improvements of ing that the latter yields lower latency. encoder-decoder models, as well as popular ap- Offline or Online training? An alternative ap- proaches like offline training and re-translation. proach to simultaneous training is the offline (or All these works concentrate on reducing systems full-sentence) training of the system and its sub- latency, targeting a better quality/latency trade-off. sequent use as a simultaneous one. Nguyen et Encoding Strategy. Few studies (Elbayad et al., al. (2021a) explored this solution with an LSTM- 2020a; Nguyen et al., 2021b) tried to improve the based direct ST system, analyzing the effective- encoder part of simultaneous systems. Elbayad et ness of different decoding strategies. Interestingly, al. (2020a) and Nguyen et al. (2021b) introduced the offline approach does not only preserve overall the use of unidirectional encoders instead of stan- performance despite the switch of modality, it also dard bidirectional encoders (i.e. the encoder states improves system’s ability to generate well-formed are not updated after each read action) to speed up sentences. These results are confirmed by Chen the decoding phase. Nguyen et al. (2021b) also et al. (2021), who successfully exploit a direct ST proposed an encoding strategy called Overlap- system jointly trained in an offline fashion with an ASR one. generate the target sentence, as shown in Table 1. Another point of view: re-translation. Re- source It was a way that parents ... wait-3 - - - Es ging um eine translation (Niehues et al., 2016; Niehues et al., wait-5 - - - - - Es ging 2018; Arivazhagan et al., 2020a; Arivazhagan et al., 2020b) consists in re-generating the output Table 1: wait-k policy example with k = {3, 5} from scratch (e.g. after a fixed amount of time) for as long as new information is received. This As the original wait-k implementation is based approach ensures high quality (the final output is on textual source data, Ma et al. (2020b) adapted produced with all the available context) and low it to the audio domain by waiting for k fixed latency (partial translations can be generated with time frames (audio chunks or speech frames) fixed, controllable delay). This, however, comes at rather than k words. However, this simplistic ap- the cost of strong output instability (the so-called proach does not consider various aspects of hu- flickering, due to continuous updates of the dis- man speech, such as different speech rates, dura- played translations) which is not optimal from the tion, pauses, and silences. In (Ren et al., 2020), user experience standpoint. To this end, some met- the adaptation was done differently, by including rics have been developed to measure the instability a Connectionist Temporal Classification (CTC)- phenomenon, such as the Erasure (Arivazhagan et based (Graves et al., 2006) segmentation mod- al., 2020b), which measures the number of tokens ule that is able to determine word boundaries. In that were deleted from the emitted translation to this case, the wait-k strategy is applied by wait- produce the next translation. ing for k pauses between words that are automati- cally detected by the segmenter. Similarly, Zeng et Decision Policy. In simultaneous settings, the al. (2021) employed the CTC-based segmentation model has to decide, at each time step, if the avail- method but applying a wait-k-stride-N policy to able information is enough to produce a partial allow re-ranking during the decoding phase. The translation – i.e. to perform a write action us- wait-k-stride-N model emits more than one word ing the information received until that step (audio at a time, slightly increasing the latency, since the chunk/s in case of SimulST or token/s in case of si- output is prompted after the stride is processed. multaneous MT) – or if it has to wait and perform This small increase in latency, however, allows the a read action to receive new information from the model to perform beam search on the stride, which input. Possible decision policies result in differ- has been shown to be effective in improving trans- ent ways to balance the quality/latency trade-off. lation quality (Sutskever et al., 2014). Decoding On one side, more read actions provide the sys- more than one word at a time is the approach also tem with larger context useful to generate trans- employed by Nguyen et al. (2021a), who showed lations of higher quality. On the other side, this that emitting two words increases the quality of the counterbalances the increased, sometimes unac- translation without any relevant impact on latency. ceptable latency. To address this problem, two Another way of applying the wait-k strategy was types of policy have been proposed so far: fixed proposed by Chen et al. (2021), where a streaming and adaptive. While fixed decision policies look at ASR system is used to guide the direct ST decod- the number of ingested tokens (or speech chunks, ing. They look at the ASR beam to decide how in the speech scenario), in the adaptive ones the many tokens have been emitted within the partial decision is taken by also looking at the contextual audio segment, hence having the information to information extracted from the input. apply the original wait-k policy in a straightfor- While little research focused on adaptive poli- ward way. An interesting solution is also the one cies (Gu et al., 2017; Zheng et al., 2019a; Zheng by Elbayad et al. (2020a), who jointly train a di- et al., 2020) due to the hard and time-consuming rect model across multiple wait-k paths. Once the training (Zheng et al., 2019b; Arivazhagan et al., sentence has been encoded, they optimize the sys- 2019), the adoption of very easy-to-train fixed tem by uniformly sampling the k value for the de- policies is the typical choice. Indeed, the most coding step. Even though they reach good per- widely used policy is a fixed one, called wait-k formance by using a single-path training with k=7 (Ma et al., 2019). Simple yet effective, it is based and a different k value for testing, the multi-path on waiting for k source words before starting to approach proved to be effective. One of its advan- tages is that no k value has to be specified for the To overcome all these problems, Ma et al. training, which allows to avoid the training from (2019) introduced the Average Lagging (AL) that scratch of several models for different values of k. directly describes the lagging behind the ideal pol- icy, i.e. a policy that produces the output ex- Retrospective. All the aspects analyzed in this actly at the same time as the speech source. As a section highlight several research directions al- downside, Average Lagging is not differentiable, ready taken by the simultaneous community, which is, instead, a useful property, especially if which have to be studied more in depth. Among the metric is likely to be added in the system’s loss all, the audio or text segmentation strategy clearly computation. For this reason, Cherry and Foster emerges as a fundamental factor of simultaneous (2019) proposed the Differential Average Lagging systems, and the ambivalent results obtained in (DAL), introducing a minimum delay after each several studies point out that this aspect has to be operation. better clarified. Moreover, the presence of exten- Another way of measuring the lagging is to sive literature on the wait-k policy shows that it compute the alignment difficulty of a source-target represents one of the topics of greatest interest to pair. Hence, Elbayad et al. (2020b) proposed the community, which continues to work on it to the Lagging Difficulty (LD) metric that exploits further improve its effectiveness as it directly im- the use of the fast-align (Dyer et al., 2013) pacts on the systems’ performance, especially la- tool to estimate the source and target alignments. tency. Unfortunately, all these studies focus on the Then, they infer the reference decoding path and architecture enhancements and decision policies compute the AL metric. The authors claimed the despite the absence of a unique and clear evalua- LD to be a realistic measure of the simultaneous tion framework to perform a correct and complete translation as it also evaluates how a translation analysis of the system. is easy to align considering the context available when decoding. 4 Evaluation Challenges A good simultaneous model should produce a high Latency Metrics for SimulST. The most pop- quality translation with reasonable timing, as wait- ular AP, AL and DAL metrics were successively ing too long will negatively affect the user experi- adapted by the SimulST community to the speech ence. Offline MT and ST communities commonly scenario by converting, for instance, the number of use the well-established BLEU metric (Papineni words to the sum of the speech segment durations, et al., 2002; Post, 2018) to measure the quality of as per (Ma et al., 2020a). Later, Ma et al. (2020b) the output translation, but a simultaneous system raised the issue of using computational unaware also needs a metric that accounts for the time spent metrics and proposed computational aware met- by the system to output the partial translation. Si- rics accounting for the time spent by the model multaneous MT (SimulMT) is the task in which to generate the output. Unfortunately, comput- a real-time translation is produced having a par- ing such metrics is not easy at all in absence of tial source text at disposal. Since SimulMT was a unique and reproducible environment that can the first yet easiest simultaneous scenario studied be used to evaluate the model’s performance. To by the community, a set of metrics was previously this end, Ma et al. (2020a) proposed SimulEvala introduced for the textual input-output translation tool which computes the metrics by simulating part. a real-time scenario with a server-client scheme. This toolkit automatically evaluates simultaneous Latency Metrics for SimulMT. The first met- translations (both text and speech) given a cus- ric, the Average Proportion (AP), was proposed by tomizable agent that can be defined by the user and Cho and Esipova (2016) and measures the average that will depend on the adopted policy. Despite the proportion of source input read when generating a progress in the metrics for evaluating quality and target prediction, that is the sum of the tokens read latency, no studies have been conducted on the ef- when generating the partial target. However, AP fective correlation with user experience. This rep- is not length-invariant, i.e. the value of the metric resents a missing key point in the current evalua- depends on the input and output lengths and is not tion framework landscape, giving rise to the need evenly distributed on the [0, 1] interval (Ma et al., for a tool that combines quality and latency met- 2019), making this metric strongly unreliable. rics with application-oriented metrics (e.g., read- ing speed), which are strongly correlated to the the block visualization mode, for which an en- visualization and, as an ultimate goal, to the user tire subtitle is displayed at once (usually one or experience. two lines maximum) as soon as the system has finished generating it. This display mode is the 5 The Missing Factor: Visualization easiest to read for the user because it prevents In the previous section, we introduced the most re-reading phenomena (Rajendran et al., 2013) popular metrics used to evaluate the simultane- and unnecessary/excessive eye fixations (Romero- ous systems’ performance. These metrics account Fresco, 2010), reducing the human effort. How- for the quality and the latency of the system with- ever, we discovered that the latency introduced by out capturing the user needs. Although many re- waiting for an entire subtitle is too high to let this searchers acknowledge the importance of human visualization mode be used in many simultaneous evaluation, this current partial view can push the scenarios. As a consequence, we proposed the community in the wrong direction, in which all the scrolling lines visualization mode that displays the efforts are focused on the quality/latency factors subtitles line by line. Every time a new line be- while the problem experienced by the user is of comes available, it appears at the bottom of the another kind. Indeed, the third factor that matters screen, while the previous (older) line is scrolled and strongly influences the human understanding to the upper line. In this way, there are always of a – even very good – translation is the visual- two lines displayed on the screen. To evaluate the ization strategy adopted. The visualization prob- performance of the system in the different visual- lem and the need to present the text in a readable ization modes, we also proposed an ad-hoc calcu- fashion for the user was only faced in our previ- lation of the reading speed (characters per second ous work (Karakanta et al., 2021). In the paper, or CPS) that correlates with the human judgment we raised the need for a clearer and less distract- of the subtitles (Perego et al., 2010). The reading ing visualization of the SimulST system’s gener- speed shows how fast a user needs to read in or- ated texts by presenting them as subtitles (text seg- der not to miss any part of the subtitle. The lower mented in lines preserving coherent information). the reading speed, the better is the model’s out- We proposed different visualization strategies to put since a fast reading speed increases the cogni- better assess the online display problem, attempt- tive load and leaves less time to look at the image. ing to simulate a setting where human understand- The scrolling line method offers the best balance ing is at the core of our analysis. between latency and a comfortable reading speed resulting to be the best choice for the simultane- Visualization modalities. The standard word- ous scenario. On the other hand, this approach re- for-word visualization method (Ma et al., 2019), in quires segmented text (i.e. a text that is divided which the words appear sequentially on the screen into subtitles), thus the system needs to be able to as they are generated, could be strongly sub- simultaneously generate transcripts or translations optimal for the human understanding (Romero- together with proper subtitle delimiters. However, Fresco, 2011). Infact, the word-for-word approach building a simultaneous subtitling system com- has two main problems: i) the emission rate of bines the difficulties of the simultaneous setting words (some go too fast, some too slow) is ir- with the constraint of having a text formatted in regular and the users waste more time reading the proper subtitles. Since both these research direc- text because their eyes have to make more move- tions are still evolving, a lot of work is required to ments, and ii) emission of pieces of text that do achieve good results. not correspond to linguistic units/chunks, requir- ing more cognitive effort. Moreover, when the The lack of studies on this aspects highlights the maximum length of the subtitle (that depends on shortcomings of the actual SimulST systems, in- the dimensions of the screen) is reached, the subti- dividuating possible improvements that will allow tle disappears without giving the user enough time the systems to evolve in a more organic and com- to read the last words emitted. As this will nega- plete way according to the user needs. Moreover, tively impact the user experience, we propose in to completely assess the subtitling scenario, a sys- (Karakanta et al., 2021) to adopt different visu- tem has to be able to jointly produce timestamps alization modes that better accommodate the hu- metadata linked to the word emitted, a task that has man reading requirements. We first introduced not been addressed so far. The need for this kind of system represents an interesting direction to fol- Monotonic infinite lookback attention for simul- low for the simultaneous community. In the light taneous machine translation. In Proceedings of the 57th Annual Meeting of the Association for of this, the researcher should also take into account Computational Linguistics, pages 1313–1323, the three quality-latency-visualization factors in Florence, Italy, July. Association for Computational their analyses. We are convinced that these are Linguistics. the most promising aspects to work on to build the Naveen Arivazhagan, Colin Cherry, Wolfgang best SimulST system for the audience and that hu- Macherey, and George Foster. 2020a. Re- man evaluation has to have a crucial role in future translation versus streaming for simultaneous studies. We also believe that interdisciplinary di- translation. In Proceedings of the 17th International alogue with other fields such as cognitive studies, Conference on Spoken Language Translation, pages 220–227, Online, July. Association for media accessibility and human-computer interac- Computational Linguistics. tion would be very insightful to evaluate SimulST outputs from communicative perspectives (Fantin- Naveen Arivazhagan, Colin Cherry, Isabelle Te, Wolf- uoli and Prandi, 2021). gang Macherey, Pallavi Baljekar, and George Fos- ter. 2020b. Re-translation strategies for long form, simultaneous, spoken language translation. 6 Conclusions and Future Directions In ICASSP 2020-2020 IEEE International Confer- ence on Acoustics, Speech and Signal Processing SimulST systems have become increasingly pop- (ICASSP), pages 7919–7923. IEEE. ular in recent years and many efforts have been made to build robust and efficient models. De- Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina Karakanta, Alberto Martinelli, Matteo Negri, and spite the difficulties introduced by the online Marco Turchi. 2021. Cascade versus direct speech framework, these models have rapidly improved, translation: Do the differences still make a differ- achieving comparable results to the offline sys- ence? In Proceedings of the 59th Annual Meet- tems. However, many research directions have not ing of the Association for Computational Linguistics and the 11th International Joint Conference on Nat- been explored enough (e.g., the adoption of dy- ural Language Processing (Volume 1: Long Papers), namic or fixed segmentation, the offline or the on- pages 2873–2887, Online, August. Association for line training). First among all, the visualization Computational Linguistics. strategy that is adopted to display the output of the Alexandre Bérard, Olivier Pietquin, Christophe Servan, simultaneous systems is an important and largely and Laurent Besacier. 2016. Listen and Translate: under-analyzed aspect of the simultaneous experi- A Proof of Concept for End-to-End Speech-to-Text ence. We posit that the presence of application- Translation. In NIPS Workshop on end-to-end learn- oriented metrics (e.g., reading speed), which are ing for speech and audio processing, Barcelona, Spain, December. strongly related to the visualization and, as an ul- timate goal, to the user experience, is the factor Junkun Chen, Mingbo Ma, Renjie Zheng, and Liang that misses in the actual evaluation environment. Huang. 2021. Direct simultaneous speech-to-text Indeed, this paper points out that BLEU and Aver- translation assisted by synchronized streaming ASR. In Findings of the Association for Computational age Lagging are not the only metrics that matter to Linguistics: ACL-IJCNLP 2021, pages 4618–4624, effectively evaluate a SimulST model, even if they Online, August. Association for Computational Lin- are fundamental to judge a correct and real-timed guistics. translation. We hope that this will inspire the com- Colin Cherry and George Foster. 2019. Thinking slow munity to work on this critical aspect in the future. about latency evaluation for simultaneous machine translation. Acknowledgement Kyunghyun Cho and Masha Esipova. 2016. Can neu- This work has been carried out as part of the ral machine translation do simultaneous translation? project Smarter Interpreting (https://kunven Fahim Dalvi, Nadir Durrani, Hassan Sajjad, and o.digital/) financed by CDTI Neotec funds. Stephan Vogel. 2018. Incremental decoding and training methods for simultaneous translation in neural machine translation. In Proceedings of the References 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Naveen Arivazhagan, Colin Cherry, Wolfgang Human Language Technologies, Volume 2 (Short Macherey, Chung-Cheng Chiu, Semih Yavuz, Papers), pages 493–499, New Orleans, Louisiana, Ruoming Pang, Wei Li, and Colin Raffel. 2019. June. Association for Computational Linguistics. Chris Dyer, Victor Chahuneau, and Noah A. Smith. Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, 2013. A simple, fast, and effective reparameter- Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, ization of IBM model 2. In Proceedings of the Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and 2013 Conference of the North American Chapter of Haifeng Wang. 2019. STACL: Simultaneous trans- the Association for Computational Linguistics: Hu- lation with implicit anticipation and controllable la- man Language Technologies, pages 644–648, At- tency using prefix-to-prefix framework. In Proceed- lanta, Georgia, June. Association for Computational ings of the 57th Annual Meeting of the Association Linguistics. for Computational Linguistics, pages 3025–3036, Florence, Italy, July. Association for Computational Maha Elbayad, Laurent Besacier, and Jakob Verbeek. Linguistics. 2020a. Efficient Wait-k Models for Simultaneous Machine Translation. In Proc. Interspeech 2020, Xutai Ma, Mohammad Javad Dousti, Changhan Wang, pages 1461–1465. Jiatao Gu, and Juan Pino. 2020a. SIMULEVAL: An evaluation toolkit for simultaneous translation. Maha Elbayad, Michael Ustaszewski, Emmanuelle In Proceedings of the 2020 Conference on Empiri- Esperança-Rodier, Francis Brunet-Manquat, Jakob cal Methods in Natural Language Processing: Sys- Verbeek, and Laurent Besacier. 2020b. Online tem Demonstrations, pages 144–150, Online, Octo- versus offline NMT quality: An in-depth analy- ber. Association for Computational Linguistics. sis on English-German and German-English. In Xutai Ma, Juan Pino, and Philipp Koehn. 2020b. Proceedings of the 28th International Conference SimulMT to SimulST: Adapting simultaneous text on Computational Linguistics, pages 5047–5058, translation to end-to-end simultaneous speech trans- Barcelona, Spain (Online), December. International lation. In Proceedings of the 1st Conference of the Committee on Computational Linguistics. Asia-Pacific Chapter of the Association for Compu- tational Linguistics and the 10th International Joint Claudio Fantinuoli and Bianca Prandi. 2021. Towards Conference on Natural Language Processing, pages the evaluation of automatic simultaneous speech 582–587, Suzhou, China, December. Association translation from a communicative perspective. In for Computational Linguistics. Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021), pages Niko Moritz, Takaaki Hori, and Jonathan Le. 2020. 245–254, Bangkok, Thailand (online), August. As- Streaming automatic speech recognition with the sociation for Computational Linguistics. transformer model. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and C. Fügen. 2009. A system for simultaneous translation Signal Processing (ICASSP), pages 6074–6078. of lectures and speeches. IEEE. Tomoki Fujita, Graham Neubig, S. Sakti, T. Toda, Ha Nguyen, Yannick Estève, and Laurent Besacier. and Satoshi Nakamura. 2013. Simple, lexicalized 2021a. An empirical study of end-to-end si- choice of translation timing for simultaneous speech multaneous speech translation decoding strategies. translation. In INTERSPEECH. In ICASSP 2021-2021 IEEE International Confer- ence on Acoustics, Speech and Signal Processing Alex Graves, Santiago Fernández, Faustino Gomez, (ICASSP), pages 7528–7532. IEEE. and Jürgen Schmidhuber. 2006. Connectionist temporal classification: Labelling unsegmented se- Ha Nguyen, Yannick Estève, and Laurent Besacier. quence data with recurrent neural networks. In Pro- 2021b. Impact of Encoding and Segmentation ceedings of the 23rd International Conference on Strategies on End-to-End Simultaneous Speech Machine Learning, ICML ’06, page 369–376, New Translation. In Proc. Interspeech 2021, pages 2371– York, NY, USA. Association for Computing Ma- 2375. chinery. J. Niehues, T. Nguyen, Eunah Cho, Thanh-Le Ha, Kevin Kilgour, Markus Müller, Matthias Sperber, Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Vic- S. Stüker, and A. Waibel. 2016. Dynamic transcrip- tor O.K. Li. 2017. Learning to translate in real-time tion for low-latency speech translation. In INTER- with neural machine translation. In Proceedings of SPEECH. the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume J. Niehues, Ngoc-Quan Pham, Thanh-Le Ha, Matthias 1, Long Papers, pages 1053–1062, Valencia, Spain, Sperber, and A. Waibel. 2018. Low-latency neural April. Association for Computational Linguistics. speech translation. In INTERSPEECH. Alina Karakanta, Sara Papi, Matteo Negri, and Marco Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Turchi. 2021. Simultaneous speech translation Jing Zhu. 2002. Bleu: a method for automatic eval- for live subtitling: from delay to display. In Pro- uation of machine translation. In Proceedings of the ceedings of the 1st Workshop on Automatic Spo- 40th Annual Meeting of the Association for Com- ken Language Translation in Real-World Settings putational Linguistics, pages 311–318, Philadelphia, (ASLTRW), pages 35–48, Virtual, August. Associ- Pennsylvania, USA, July. Association for Computa- ation for Machine Translation in the Americas. tional Linguistics. Elisa Perego, F. Del Missier, M. Porta, and Hao Xiong, Ruiqing Zhang, Chuanqiang Zhang, M. Mosconi. 2010. The cognitive effectiveness Zhongjun He, Hua Wu, and Haifeng Wang. 2019. of subtitle processing. Media Psychology, 13:243– Dutongchuan: Context-aware translation model –272. for simultaneous interpreting. arXiv preprint arXiv:1907.12984. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Xingshan Zeng, Liangyou Li, and Qun Liu. 2021. Machine Translation: Research Papers, pages 186– RealTranS: End-to-end simultaneous speech trans- 191, Brussels, Belgium, October. Association for lation with convolutional weighted-shrinking trans- Computational Linguistics. former. In Findings of the Association for Com- putational Linguistics: ACL-IJCNLP 2021, pages Dhevi J. Rajendran, Andrew T. Duchowski, Pilar 2461–2474, Online, August. Association for Com- Orero, Juan Martı́nez, and Pablo Romero-Fresco. putational Linguistics. 2013. Effects of text chunking on subtitling: A quantitative and qualitative examination. Perspec- Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang tives, 21(1):5–21. Huang. 2019a. Simpler and faster learning of adaptive policies for simultaneous translation. In Yi Ren, Jinglin Liu, Xu Tan, Chen Zhang, Tao Qin, Proceedings of the 2019 Conference on Empirical Zhou Zhao, and Tie-Yan Liu. 2020. SimulSpeech: Methods in Natural Language Processing and the End-to-end simultaneous speech to text translation. 9th International Joint Conference on Natural Lan- In Proceedings of the 58th Annual Meeting of the guage Processing (EMNLP-IJCNLP), pages 1349– Association for Computational Linguistics, pages 1354, Hong Kong, China, November. Association 3787–3796, Online, July. Association for Computa- for Computational Linguistics. tional Linguistics. Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang Huang. 2019b. Simultaneous translation with flex- Pablo Romero-Fresco, 2010. Standing on quicksand: ible policy via restricted imitation learning. In Pro- hearing viewers’ comprehension and reading pat- ceedings of the 57th Annual Meeting of the Asso- terns of respoken subtitles for the news, pages 175 ciation for Computational Linguistics, pages 5816– – 194. Brill, Leiden, The Netherlands. 5822, Florence, Italy, July. Association for Compu- tational Linguistics. Pablo Romero-Fresco. 2011. Subtitling through speech recognition: Respeaking. Manchester: St. Renjie Zheng, Mingbo Ma, Baigong Zheng, and Liang Jerome. Huang. 2019c. Speculative beam search for simultaneous translation. In Proceedings of the Frederick W. M. Stentiford and Martin G. Steer. 1988. 2019 Conference on Empirical Methods in Natu- Machine Translation of Speech. British Telecom ral Language Processing and the 9th International Technology Journal, 6(2):116–122. Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1395–1402, Hong Kong, Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. China, November. Association for Computational Sequence to sequence learning with neural net- Linguistics. works. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, editors, Ad- Baigong Zheng, Kaibo Liu, Renjie Zheng, M. Ma, vances in Neural Information Processing Systems, Hairong Liu, and L. Huang. 2020. Simultaneous volume 27. Curran Associates, Inc. translation policies: From fixed to adaptive. ArXiv, abs/2004.13169. Alex Waibel, Ajay N. Jain, Arthur E. McNair, Hiroaki Saito, Alexander G. Hauptmann, and Joe Tebel- skis. 1991. JANUS: A Speech-to-Speech Transla- tion System Using Connectionist and Symbolic Pro- cessing Strategies. In Proceedings of the Interna- tional Conference on Acoustics, Speech and Signal Processing, ICASSP 1991, pages 793–796, Toronto, Canada, May 14-17. Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Liang Lu, Guoli Ye, and Ming Zhou. 2020. Low latency end- to-end streaming speech recognition with a scout network. arXiv preprint arXiv:2003.10369. Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui Wu, and Zhifeng Chen. 2017. Sequence-to- Sequence Models Can Directly Translate Foreign Speech. In Proceedings of Interspeech 2017, pages 2625–2629, Stockholm, Sweden, August.