Visualization: The Missing Factor in Simultaneous Speech Translation

                                Sara Papi1,2 , Matteo Negri1 , Marco Turchi1
                                     1. Fondazione Bruno Kessler, Italy
                                        2. University of Trento, Italy
                                  {spapi,negri,turchi}@fbk.eu


                        Abstract                                SimulST is indeed a complex task in which the
                                                                difficulties of performing speech recognition from
    Simultaneous         speech     translation                 partial inputs are exacerbated by the problem to
    (SimulST) is the task in which out-                         project meaning across languages. Despite the in-
    put generation has to be performed on                       creasing demand for such a system, the problem is
    partial, incremental speech input. In                       still far from being solved.
    recent years, SimulST has become pop-                          So far, research efforts mainly focused on the
    ular due to the spread of multilingual                      quality/latency trade-off, i.e. producing high qual-
    application scenarios, like international                   ity outputs in the shortest possible time, balancing
    live conferences and streaming lectures,                    the need for a good translation with the necessity
    in which on-the-fly speech translation can                  of a rapid text generation. Previous studies, how-
    facilitate users’ access to audio-visual                    ever, disregard how the translation is displayed
    content. In this paper, we analyze the                      and, consequently, how it is actually perceived by
    characteristics of the SimulST systems de-                  the end users. After a concise survey of the state
    veloped so far, discussing their strengths                  of the art in the field, in this paper we posit that,
    and weaknesses. We then concentrate                         from the users’ experience standpoint, output visu-
    on the evaluation framework required to                     alization is at least as important as having a good
    properly assess systems’ effectiveness. To                  translation in a short time. This raises the need
    this end, we raise the need for a broader                   for a broader, task-oriented and human-centered
    performance analysis, also including the                    analysis of SimulST systems’ performance, also
    user experience standpoint. We argue that                   accounting for this third crucial factor.
    SimulST systems, indeed, should be eval-
    uated not only in terms of quality/latency                  2   Background
    measures, but also via task-oriented
                                                                As in the case of offline speech translation, the
    metrics accounting, for instance, for the
                                                                adoption of cascade architectures (Stentiford and
    visualization strategy adopted. In light
                                                                Steer, 1988; Waibel et al., 1991) was the first at-
    of this, we highlight which are the goals
                                                                tempt made by the SimulST community to tackle
    achieved by the community and what is
                                                                the problem of generating text from partial, in-
    still missing.
                                                                cremental input. Cascade systems (Fügen, 2009;
                                                                Fujita et al., 2013; Niehues et al., 2018; Xiong
1    Introduction
                                                                et al., 2019; Arivazhagan et al., 2020b) involve
Simultaneous speech translation (SimulST) is the                a pipeline of two components. First, a stream-
task in which the translation of a source language              ing automatic speech recognition (ASR) module
speech has to be performed on partial, incremen-                transcribes the input speech into the correspond-
tal input. This is a key feature to achieve low la-             ing text (Wang et al., 2020; Moritz et al., 2020).
tency in scenarios like streaming conferences and               Then, a simultaneous text-to-text translation mod-
lectures, where the text has to be displayed fol-               ule translates the partial transcription into target-
lowing as much as possible the pace of the speech.              language text (Gu et al., 2017; Dalvi et al., 2018;
                                                                Ma et al., 2019; Arivazhagan et al., 2019). This
     Copyright © 2021 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-       approach suffers from error propagation, a well-
ternational (CC BY 4.0).                                        known problem even in the offline scenario, where
the transcription errors made by the ASR module          and-Compensate, where the encoder exploits extra
are propagated to the MT module, which cannot            frames provided from the past that were discarded
recover from them as it does not have direct ac-         during the previous encoding step. The segmenta-
cess to the audio. Another strong limitation of          tion problem is a crucial aspect in SimulST, where
cascaded systems is the extra latency added by           the system needs to split a long audio input into
the two-step pipeline, since the MT module has to        smaller chunks (speech frames) in order to process
wait until the streaming ASR output is produced.         them. Different segmentation techniques can be
   To overcome these issues,               the di-       adopted to extract this information, starting from
rect      models      initially    proposed      in      the easiest one based on fixed time windows (Ma
B[Pleaseinsertintopreamble]rard         et       al.     et al., 2020b) to the dynamic ones based on auto-
(2016; Weiss et al. (2017) represent a valid             matically detected word boundaries (Zeng et al.,
alternative that is gaining increasing traction          2021; Chen et al., 2021). Ma et al. (2020b) also
(Bentivogli et al., 2021). Direct ST models are          studied the dynamic segmentation based on oracle
composed of an encoder, usually bidirectional,           boundaries but they discovered that, in their sce-
and a decoder. The encoder starts from the               nario, it had worse performance compared to that
audio features extracted from the input signal and       of the fixed segmentation.
computes a hidden representation; the decoder
transforms this representation into target language      Decoding Strategy. Some efforts have been
text. Direct modeling becomes crucial in the             made to improve the decoding strategy as it
simultaneous scenario, as it reduces the overall         strongly correlates to the decision policy of si-
system’s latency due to the absence of inter-            multaneous systems. Speculative beam search, or
mediate symbolic representation steps. Despite           SBS, (Zheng et al., 2019c) represents the first suc-
the data scarcity issue caused by the limited            cessful attempt to use beam search in SimulST.
availability of speech-to-translation corpora, the       This technique consists in hallucinating several
adoption of direct architectures showed to be            prediction steps in the future in order to make
promising (Weiss et al., 2017; Ren et al., 2020;         more accurate decisions based on the best “spec-
Zeng et al., 2021), driving recent efforts towards       ulative” prediction obtained. Also Zeng et al.
the development of increasingly powerful and             (2021) integrate the beam search in the decoding
efficient models.                                        strategy, developing the wait-k-stride-N strategy.
                                                         In particular, the authors bypass output specula-
3   Architectural Challenges                             tion by directly applying beam search, after wait-
This section surveys the direct SimulST models           ing for k words, on a word stride of size N (i.e., on
developed so far, highlighting strengths and weak-       N words at a time) instead of one single word as
nesses of the current architectures and decision         prescribed by the standard wait-k. Nguyen et al.
policies – i.e. the strategies used by the system to     (2021a) analyzed several decoding strategies rely-
decide whether to output a partial translation or to     ing on different output token granularities, such as
wait for more audio information. We discuss on-          characters and Byte Pair Encoding (BPE), show-
going research on architectural improvements of          ing that the latter yields lower latency.
encoder-decoder models, as well as popular ap-
                                                         Offline or Online training? An alternative ap-
proaches like offline training and re-translation.
                                                         proach to simultaneous training is the offline (or
All these works concentrate on reducing systems
                                                         full-sentence) training of the system and its sub-
latency, targeting a better quality/latency trade-off.
                                                         sequent use as a simultaneous one. Nguyen et
Encoding Strategy. Few studies (Elbayad et al.,          al. (2021a) explored this solution with an LSTM-
2020a; Nguyen et al., 2021b) tried to improve the        based direct ST system, analyzing the effective-
encoder part of simultaneous systems. Elbayad et         ness of different decoding strategies. Interestingly,
al. (2020a) and Nguyen et al. (2021b) introduced         the offline approach does not only preserve overall
the use of unidirectional encoders instead of stan-      performance despite the switch of modality, it also
dard bidirectional encoders (i.e. the encoder states     improves system’s ability to generate well-formed
are not updated after each read action) to speed up      sentences. These results are confirmed by Chen
the decoding phase. Nguyen et al. (2021b) also           et al. (2021), who successfully exploit a direct ST
proposed an encoding strategy called Overlap-            system jointly trained in an offline fashion with an
ASR one.                                                generate the target sentence, as shown in Table 1.

Another point of view: re-translation. Re-               source   It   was   a   way   that   parents    ...
                                                         wait-3   -     -    -   Es    ging     um      eine
translation (Niehues et al., 2016; Niehues et al.,       wait-5   -     -    -    -      -      Es      ging
2018; Arivazhagan et al., 2020a; Arivazhagan et
al., 2020b) consists in re-generating the output         Table 1: wait-k policy example with k = {3, 5}
from scratch (e.g. after a fixed amount of time)
for as long as new information is received. This           As the original wait-k implementation is based
approach ensures high quality (the final output is      on textual source data, Ma et al. (2020b) adapted
produced with all the available context) and low        it to the audio domain by waiting for k fixed
latency (partial translations can be generated with     time frames (audio chunks or speech frames)
fixed, controllable delay). This, however, comes at     rather than k words. However, this simplistic ap-
the cost of strong output instability (the so-called    proach does not consider various aspects of hu-
flickering, due to continuous updates of the dis-       man speech, such as different speech rates, dura-
played translations) which is not optimal from the      tion, pauses, and silences. In (Ren et al., 2020),
user experience standpoint. To this end, some met-      the adaptation was done differently, by including
rics have been developed to measure the instability     a Connectionist Temporal Classification (CTC)-
phenomenon, such as the Erasure (Arivazhagan et         based (Graves et al., 2006) segmentation mod-
al., 2020b), which measures the number of tokens        ule that is able to determine word boundaries. In
that were deleted from the emitted translation to       this case, the wait-k strategy is applied by wait-
produce the next translation.                           ing for k pauses between words that are automati-
                                                        cally detected by the segmenter. Similarly, Zeng et
Decision Policy. In simultaneous settings, the          al. (2021) employed the CTC-based segmentation
model has to decide, at each time step, if the avail-   method but applying a wait-k-stride-N policy to
able information is enough to produce a partial         allow re-ranking during the decoding phase. The
translation – i.e. to perform a write action us-        wait-k-stride-N model emits more than one word
ing the information received until that step (audio     at a time, slightly increasing the latency, since the
chunk/s in case of SimulST or token/s in case of si-    output is prompted after the stride is processed.
multaneous MT) – or if it has to wait and perform       This small increase in latency, however, allows the
a read action to receive new information from the       model to perform beam search on the stride, which
input. Possible decision policies result in differ-     has been shown to be effective in improving trans-
ent ways to balance the quality/latency trade-off.      lation quality (Sutskever et al., 2014). Decoding
On one side, more read actions provide the sys-         more than one word at a time is the approach also
tem with larger context useful to generate trans-       employed by Nguyen et al. (2021a), who showed
lations of higher quality. On the other side, this      that emitting two words increases the quality of the
counterbalances the increased, sometimes unac-          translation without any relevant impact on latency.
ceptable latency. To address this problem, two          Another way of applying the wait-k strategy was
types of policy have been proposed so far: fixed        proposed by Chen et al. (2021), where a streaming
and adaptive. While fixed decision policies look at     ASR system is used to guide the direct ST decod-
the number of ingested tokens (or speech chunks,        ing. They look at the ASR beam to decide how
in the speech scenario), in the adaptive ones the       many tokens have been emitted within the partial
decision is taken by also looking at the contextual     audio segment, hence having the information to
information extracted from the input.                   apply the original wait-k policy in a straightfor-
   While little research focused on adaptive poli-      ward way. An interesting solution is also the one
cies (Gu et al., 2017; Zheng et al., 2019a; Zheng       by Elbayad et al. (2020a), who jointly train a di-
et al., 2020) due to the hard and time-consuming        rect model across multiple wait-k paths. Once the
training (Zheng et al., 2019b; Arivazhagan et al.,      sentence has been encoded, they optimize the sys-
2019), the adoption of very easy-to-train fixed         tem by uniformly sampling the k value for the de-
policies is the typical choice. Indeed, the most        coding step. Even though they reach good per-
widely used policy is a fixed one, called wait-k        formance by using a single-path training with k=7
(Ma et al., 2019). Simple yet effective, it is based    and a different k value for testing, the multi-path
on waiting for k source words before starting to        approach proved to be effective. One of its advan-
tages is that no k value has to be specified for the       To overcome all these problems, Ma et al.
training, which allows to avoid the training from       (2019) introduced the Average Lagging (AL) that
scratch of several models for different values of k.    directly describes the lagging behind the ideal pol-
                                                        icy, i.e. a policy that produces the output ex-
Retrospective. All the aspects analyzed in this         actly at the same time as the speech source. As a
section highlight several research directions al-       downside, Average Lagging is not differentiable,
ready taken by the simultaneous community,              which is, instead, a useful property, especially if
which have to be studied more in depth. Among           the metric is likely to be added in the system’s loss
all, the audio or text segmentation strategy clearly    computation. For this reason, Cherry and Foster
emerges as a fundamental factor of simultaneous         (2019) proposed the Differential Average Lagging
systems, and the ambivalent results obtained in         (DAL), introducing a minimum delay after each
several studies point out that this aspect has to be    operation.
better clarified. Moreover, the presence of exten-         Another way of measuring the lagging is to
sive literature on the wait-k policy shows that it      compute the alignment difficulty of a source-target
represents one of the topics of greatest interest to    pair. Hence, Elbayad et al. (2020b) proposed
the community, which continues to work on it to         the Lagging Difficulty (LD) metric that exploits
further improve its effectiveness as it directly im-    the use of the fast-align (Dyer et al., 2013)
pacts on the systems’ performance, especially la-       tool to estimate the source and target alignments.
tency. Unfortunately, all these studies focus on the    Then, they infer the reference decoding path and
architecture enhancements and decision policies         compute the AL metric. The authors claimed the
despite the absence of a unique and clear evalua-       LD to be a realistic measure of the simultaneous
tion framework to perform a correct and complete        translation as it also evaluates how a translation
analysis of the system.                                 is easy to align considering the context available
                                                        when decoding.
4   Evaluation Challenges
A good simultaneous model should produce a high         Latency Metrics for SimulST. The most pop-
quality translation with reasonable timing, as wait-    ular AP, AL and DAL metrics were successively
ing too long will negatively affect the user experi-    adapted by the SimulST community to the speech
ence. Offline MT and ST communities commonly            scenario by converting, for instance, the number of
use the well-established BLEU metric (Papineni          words to the sum of the speech segment durations,
et al., 2002; Post, 2018) to measure the quality of     as per (Ma et al., 2020a). Later, Ma et al. (2020b)
the output translation, but a simultaneous system       raised the issue of using computational unaware
also needs a metric that accounts for the time spent    metrics and proposed computational aware met-
by the system to output the partial translation. Si-    rics accounting for the time spent by the model
multaneous MT (SimulMT) is the task in which            to generate the output. Unfortunately, comput-
a real-time translation is produced having a par-       ing such metrics is not easy at all in absence of
tial source text at disposal. Since SimulMT was         a unique and reproducible environment that can
the first yet easiest simultaneous scenario studied     be used to evaluate the model’s performance. To
by the community, a set of metrics was previously       this end, Ma et al. (2020a) proposed SimulEvala
introduced for the textual input-output translation     tool which computes the metrics by simulating
part.                                                   a real-time scenario with a server-client scheme.
                                                        This toolkit automatically evaluates simultaneous
Latency Metrics for SimulMT. The first met-             translations (both text and speech) given a cus-
ric, the Average Proportion (AP), was proposed by       tomizable agent that can be defined by the user and
Cho and Esipova (2016) and measures the average         that will depend on the adopted policy. Despite the
proportion of source input read when generating a       progress in the metrics for evaluating quality and
target prediction, that is the sum of the tokens read   latency, no studies have been conducted on the ef-
when generating the partial target. However, AP         fective correlation with user experience. This rep-
is not length-invariant, i.e. the value of the metric   resents a missing key point in the current evalua-
depends on the input and output lengths and is not      tion framework landscape, giving rise to the need
evenly distributed on the [0, 1] interval (Ma et al.,   for a tool that combines quality and latency met-
2019), making this metric strongly unreliable.          rics with application-oriented metrics (e.g., read-
ing speed), which are strongly correlated to the        the block visualization mode, for which an en-
visualization and, as an ultimate goal, to the user     tire subtitle is displayed at once (usually one or
experience.                                             two lines maximum) as soon as the system has
                                                        finished generating it. This display mode is the
5   The Missing Factor: Visualization                   easiest to read for the user because it prevents
In the previous section, we introduced the most         re-reading phenomena (Rajendran et al., 2013)
popular metrics used to evaluate the simultane-         and unnecessary/excessive eye fixations (Romero-
ous systems’ performance. These metrics account         Fresco, 2010), reducing the human effort. How-
for the quality and the latency of the system with-     ever, we discovered that the latency introduced by
out capturing the user needs. Although many re-         waiting for an entire subtitle is too high to let this
searchers acknowledge the importance of human           visualization mode be used in many simultaneous
evaluation, this current partial view can push the      scenarios. As a consequence, we proposed the
community in the wrong direction, in which all the      scrolling lines visualization mode that displays the
efforts are focused on the quality/latency factors      subtitles line by line. Every time a new line be-
while the problem experienced by the user is of         comes available, it appears at the bottom of the
another kind. Indeed, the third factor that matters     screen, while the previous (older) line is scrolled
and strongly influences the human understanding         to the upper line. In this way, there are always
of a – even very good – translation is the visual-      two lines displayed on the screen. To evaluate the
ization strategy adopted. The visualization prob-       performance of the system in the different visual-
lem and the need to present the text in a readable      ization modes, we also proposed an ad-hoc calcu-
fashion for the user was only faced in our previ-       lation of the reading speed (characters per second
ous work (Karakanta et al., 2021). In the paper,        or CPS) that correlates with the human judgment
we raised the need for a clearer and less distract-     of the subtitles (Perego et al., 2010). The reading
ing visualization of the SimulST system’s gener-        speed shows how fast a user needs to read in or-
ated texts by presenting them as subtitles (text seg-   der not to miss any part of the subtitle. The lower
mented in lines preserving coherent information).       the reading speed, the better is the model’s out-
We proposed different visualization strategies to       put since a fast reading speed increases the cogni-
better assess the online display problem, attempt-      tive load and leaves less time to look at the image.
ing to simulate a setting where human understand-       The scrolling line method offers the best balance
ing is at the core of our analysis.                     between latency and a comfortable reading speed
                                                        resulting to be the best choice for the simultane-
Visualization modalities. The standard word-            ous scenario. On the other hand, this approach re-
for-word visualization method (Ma et al., 2019), in     quires segmented text (i.e. a text that is divided
which the words appear sequentially on the screen       into subtitles), thus the system needs to be able to
as they are generated, could be strongly sub-           simultaneously generate transcripts or translations
optimal for the human understanding (Romero-            together with proper subtitle delimiters. However,
Fresco, 2011). Infact, the word-for-word approach       building a simultaneous subtitling system com-
has two main problems: i) the emission rate of          bines the difficulties of the simultaneous setting
words (some go too fast, some too slow) is ir-          with the constraint of having a text formatted in
regular and the users waste more time reading the       proper subtitles. Since both these research direc-
text because their eyes have to make more move-         tions are still evolving, a lot of work is required to
ments, and ii) emission of pieces of text that do       achieve good results.
not correspond to linguistic units/chunks, requir-
ing more cognitive effort. Moreover, when the              The lack of studies on this aspects highlights the
maximum length of the subtitle (that depends on         shortcomings of the actual SimulST systems, in-
the dimensions of the screen) is reached, the subti-    dividuating possible improvements that will allow
tle disappears without giving the user enough time      the systems to evolve in a more organic and com-
to read the last words emitted. As this will nega-      plete way according to the user needs. Moreover,
tively impact the user experience, we propose in        to completely assess the subtitling scenario, a sys-
(Karakanta et al., 2021) to adopt different visu-       tem has to be able to jointly produce timestamps
alization modes that better accommodate the hu-         metadata linked to the word emitted, a task that has
man reading requirements. We first introduced           not been addressed so far. The need for this kind
of system represents an interesting direction to fol-     Monotonic infinite lookback attention for simul-
low for the simultaneous community. In the light          taneous machine translation. In Proceedings of
                                                          the 57th Annual Meeting of the Association for
of this, the researcher should also take into account
                                                          Computational Linguistics, pages 1313–1323,
the three quality-latency-visualization factors in        Florence, Italy, July. Association for Computational
their analyses. We are convinced that these are           Linguistics.
the most promising aspects to work on to build the
                                                        Naveen Arivazhagan, Colin Cherry, Wolfgang
best SimulST system for the audience and that hu-         Macherey, and George Foster.          2020a.      Re-
man evaluation has to have a crucial role in future       translation versus streaming for simultaneous
studies. We also believe that interdisciplinary di-       translation. In Proceedings of the 17th International
alogue with other fields such as cognitive studies,       Conference on Spoken Language Translation,
                                                          pages 220–227, Online, July. Association for
media accessibility and human-computer interac-           Computational Linguistics.
tion would be very insightful to evaluate SimulST
outputs from communicative perspectives (Fantin-        Naveen Arivazhagan, Colin Cherry, Isabelle Te, Wolf-
uoli and Prandi, 2021).                                   gang Macherey, Pallavi Baljekar, and George Fos-
                                                          ter. 2020b. Re-translation strategies for long
                                                          form, simultaneous, spoken language translation.
6   Conclusions and Future Directions                     In ICASSP 2020-2020 IEEE International Confer-
                                                          ence on Acoustics, Speech and Signal Processing
SimulST systems have become increasingly pop-             (ICASSP), pages 7919–7923. IEEE.
ular in recent years and many efforts have been
made to build robust and efficient models. De-          Luisa Bentivogli, Mauro Cettolo, Marco Gaido, Alina
                                                          Karakanta, Alberto Martinelli, Matteo Negri, and
spite the difficulties introduced by the online           Marco Turchi. 2021. Cascade versus direct speech
framework, these models have rapidly improved,            translation: Do the differences still make a differ-
achieving comparable results to the offline sys-          ence? In Proceedings of the 59th Annual Meet-
tems. However, many research directions have not          ing of the Association for Computational Linguistics
                                                          and the 11th International Joint Conference on Nat-
been explored enough (e.g., the adoption of dy-
                                                          ural Language Processing (Volume 1: Long Papers),
namic or fixed segmentation, the offline or the on-       pages 2873–2887, Online, August. Association for
line training). First among all, the visualization        Computational Linguistics.
strategy that is adopted to display the output of the
                                                        Alexandre Bérard, Olivier Pietquin, Christophe Servan,
simultaneous systems is an important and largely          and Laurent Besacier. 2016. Listen and Translate:
under-analyzed aspect of the simultaneous experi-         A Proof of Concept for End-to-End Speech-to-Text
ence. We posit that the presence of application-          Translation. In NIPS Workshop on end-to-end learn-
oriented metrics (e.g., reading speed), which are         ing for speech and audio processing, Barcelona,
                                                          Spain, December.
strongly related to the visualization and, as an ul-
timate goal, to the user experience, is the factor      Junkun Chen, Mingbo Ma, Renjie Zheng, and Liang
that misses in the actual evaluation environment.         Huang. 2021. Direct simultaneous speech-to-text
Indeed, this paper points out that BLEU and Aver-         translation assisted by synchronized streaming ASR.
                                                          In Findings of the Association for Computational
age Lagging are not the only metrics that matter to       Linguistics: ACL-IJCNLP 2021, pages 4618–4624,
effectively evaluate a SimulST model, even if they        Online, August. Association for Computational Lin-
are fundamental to judge a correct and real-timed         guistics.
translation. We hope that this will inspire the com-
                                                        Colin Cherry and George Foster. 2019. Thinking slow
munity to work on this critical aspect in the future.     about latency evaluation for simultaneous machine
                                                          translation.
Acknowledgement
                                                        Kyunghyun Cho and Masha Esipova. 2016. Can neu-
This work has been carried out as part of the             ral machine translation do simultaneous translation?
project Smarter Interpreting (https://kunven            Fahim Dalvi, Nadir Durrani, Hassan Sajjad, and
o.digital/) financed by CDTI Neotec funds.                Stephan Vogel. 2018. Incremental decoding and
                                                          training methods for simultaneous translation in
                                                          neural machine translation. In Proceedings of the
References                                                2018 Conference of the North American Chapter
                                                          of the Association for Computational Linguistics:
Naveen Arivazhagan, Colin Cherry, Wolfgang                Human Language Technologies, Volume 2 (Short
  Macherey, Chung-Cheng Chiu, Semih Yavuz,                Papers), pages 493–499, New Orleans, Louisiana,
  Ruoming Pang, Wei Li, and Colin Raffel. 2019.           June. Association for Computational Linguistics.
Chris Dyer, Victor Chahuneau, and Noah A. Smith.           Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng,
  2013. A simple, fast, and effective reparameter-           Kaibo Liu, Baigong Zheng, Chuanqiang Zhang,
  ization of IBM model 2. In Proceedings of the              Zhongjun He, Hairong Liu, Xing Li, Hua Wu, and
  2013 Conference of the North American Chapter of           Haifeng Wang. 2019. STACL: Simultaneous trans-
  the Association for Computational Linguistics: Hu-         lation with implicit anticipation and controllable la-
  man Language Technologies, pages 644–648, At-              tency using prefix-to-prefix framework. In Proceed-
  lanta, Georgia, June. Association for Computational        ings of the 57th Annual Meeting of the Association
  Linguistics.                                               for Computational Linguistics, pages 3025–3036,
                                                             Florence, Italy, July. Association for Computational
Maha Elbayad, Laurent Besacier, and Jakob Verbeek.           Linguistics.
 2020a. Efficient Wait-k Models for Simultaneous
 Machine Translation. In Proc. Interspeech 2020,           Xutai Ma, Mohammad Javad Dousti, Changhan Wang,
 pages 1461–1465.                                            Jiatao Gu, and Juan Pino. 2020a. SIMULEVAL:
                                                             An evaluation toolkit for simultaneous translation.
Maha Elbayad, Michael Ustaszewski, Emmanuelle                In Proceedings of the 2020 Conference on Empiri-
 Esperança-Rodier, Francis Brunet-Manquat, Jakob            cal Methods in Natural Language Processing: Sys-
 Verbeek, and Laurent Besacier. 2020b. Online                tem Demonstrations, pages 144–150, Online, Octo-
 versus offline NMT quality: An in-depth analy-              ber. Association for Computational Linguistics.
 sis on English-German and German-English. In
                                                           Xutai Ma, Juan Pino, and Philipp Koehn. 2020b.
 Proceedings of the 28th International Conference
                                                             SimulMT to SimulST: Adapting simultaneous text
 on Computational Linguistics, pages 5047–5058,
                                                             translation to end-to-end simultaneous speech trans-
 Barcelona, Spain (Online), December. International
                                                             lation. In Proceedings of the 1st Conference of the
 Committee on Computational Linguistics.
                                                             Asia-Pacific Chapter of the Association for Compu-
                                                             tational Linguistics and the 10th International Joint
Claudio Fantinuoli and Bianca Prandi. 2021. Towards
                                                             Conference on Natural Language Processing, pages
  the evaluation of automatic simultaneous speech
                                                             582–587, Suzhou, China, December. Association
  translation from a communicative perspective. In
                                                             for Computational Linguistics.
  Proceedings of the 18th International Conference on
  Spoken Language Translation (IWSLT 2021), pages          Niko Moritz, Takaaki Hori, and Jonathan Le. 2020.
  245–254, Bangkok, Thailand (online), August. As-           Streaming automatic speech recognition with the
  sociation for Computational Linguistics.                   transformer model. In ICASSP 2020-2020 IEEE
                                                             International Conference on Acoustics, Speech and
C. Fügen. 2009. A system for simultaneous translation       Signal Processing (ICASSP), pages 6074–6078.
   of lectures and speeches.                                 IEEE.
Tomoki Fujita, Graham Neubig, S. Sakti, T. Toda,           Ha Nguyen, Yannick Estève, and Laurent Besacier.
  and Satoshi Nakamura. 2013. Simple, lexicalized            2021a.    An empirical study of end-to-end si-
  choice of translation timing for simultaneous speech       multaneous speech translation decoding strategies.
  translation. In INTERSPEECH.                               In ICASSP 2021-2021 IEEE International Confer-
                                                             ence on Acoustics, Speech and Signal Processing
Alex Graves, Santiago Fernández, Faustino Gomez,            (ICASSP), pages 7528–7532. IEEE.
  and Jürgen Schmidhuber. 2006. Connectionist
  temporal classification: Labelling unsegmented se-       Ha Nguyen, Yannick Estève, and Laurent Besacier.
  quence data with recurrent neural networks. In Pro-        2021b. Impact of Encoding and Segmentation
  ceedings of the 23rd International Conference on           Strategies on End-to-End Simultaneous Speech
  Machine Learning, ICML ’06, page 369–376, New              Translation. In Proc. Interspeech 2021, pages 2371–
  York, NY, USA. Association for Computing Ma-               2375.
  chinery.
                                                           J. Niehues, T. Nguyen, Eunah Cho, Thanh-Le Ha,
                                                              Kevin Kilgour, Markus Müller, Matthias Sperber,
Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Vic-
                                                              S. Stüker, and A. Waibel. 2016. Dynamic transcrip-
   tor O.K. Li. 2017. Learning to translate in real-time
                                                              tion for low-latency speech translation. In INTER-
   with neural machine translation. In Proceedings of
                                                              SPEECH.
   the 15th Conference of the European Chapter of the
   Association for Computational Linguistics: Volume       J. Niehues, Ngoc-Quan Pham, Thanh-Le Ha, Matthias
   1, Long Papers, pages 1053–1062, Valencia, Spain,          Sperber, and A. Waibel. 2018. Low-latency neural
   April. Association for Computational Linguistics.          speech translation. In INTERSPEECH.
Alina Karakanta, Sara Papi, Matteo Negri, and Marco        Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
  Turchi. 2021. Simultaneous speech translation              Jing Zhu. 2002. Bleu: a method for automatic eval-
  for live subtitling: from delay to display. In Pro-        uation of machine translation. In Proceedings of the
  ceedings of the 1st Workshop on Automatic Spo-             40th Annual Meeting of the Association for Com-
  ken Language Translation in Real-World Settings            putational Linguistics, pages 311–318, Philadelphia,
  (ASLTRW), pages 35–48, Virtual, August. Associ-            Pennsylvania, USA, July. Association for Computa-
  ation for Machine Translation in the Americas.             tional Linguistics.
Elisa Perego, F. Del Missier, M. Porta, and             Hao Xiong, Ruiqing Zhang, Chuanqiang Zhang,
   M. Mosconi. 2010. The cognitive effectiveness          Zhongjun He, Hua Wu, and Haifeng Wang. 2019.
   of subtitle processing. Media Psychology, 13:243–      Dutongchuan: Context-aware translation model
   –272.                                                  for simultaneous interpreting. arXiv preprint
                                                          arXiv:1907.12984.
Matt Post. 2018. A call for clarity in reporting BLEU
 scores. In Proceedings of the Third Conference on      Xingshan Zeng, Liangyou Li, and Qun Liu. 2021.
 Machine Translation: Research Papers, pages 186–         RealTranS: End-to-end simultaneous speech trans-
 191, Brussels, Belgium, October. Association for         lation with convolutional weighted-shrinking trans-
 Computational Linguistics.                               former. In Findings of the Association for Com-
                                                          putational Linguistics: ACL-IJCNLP 2021, pages
Dhevi J. Rajendran, Andrew T. Duchowski, Pilar            2461–2474, Online, August. Association for Com-
  Orero, Juan Martı́nez, and Pablo Romero-Fresco.         putational Linguistics.
  2013. Effects of text chunking on subtitling: A
  quantitative and qualitative examination. Perspec-    Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang
  tives, 21(1):5–21.                                      Huang. 2019a. Simpler and faster learning of
                                                          adaptive policies for simultaneous translation. In
Yi Ren, Jinglin Liu, Xu Tan, Chen Zhang, Tao Qin,         Proceedings of the 2019 Conference on Empirical
  Zhou Zhao, and Tie-Yan Liu. 2020. SimulSpeech:          Methods in Natural Language Processing and the
  End-to-end simultaneous speech to text translation.     9th International Joint Conference on Natural Lan-
  In Proceedings of the 58th Annual Meeting of the        guage Processing (EMNLP-IJCNLP), pages 1349–
  Association for Computational Linguistics, pages        1354, Hong Kong, China, November. Association
  3787–3796, Online, July. Association for Computa-       for Computational Linguistics.
  tional Linguistics.
                                                        Baigong Zheng, Renjie Zheng, Mingbo Ma, and Liang
                                                          Huang. 2019b. Simultaneous translation with flex-
Pablo Romero-Fresco, 2010. Standing on quicksand:
                                                          ible policy via restricted imitation learning. In Pro-
  hearing viewers’ comprehension and reading pat-
                                                          ceedings of the 57th Annual Meeting of the Asso-
  terns of respoken subtitles for the news, pages 175
                                                          ciation for Computational Linguistics, pages 5816–
  – 194. Brill, Leiden, The Netherlands.
                                                          5822, Florence, Italy, July. Association for Compu-
                                                          tational Linguistics.
Pablo Romero-Fresco. 2011. Subtitling through
  speech recognition: Respeaking. Manchester: St.       Renjie Zheng, Mingbo Ma, Baigong Zheng, and Liang
  Jerome.                                                 Huang. 2019c. Speculative beam search for
                                                          simultaneous translation. In Proceedings of the
Frederick W. M. Stentiford and Martin G. Steer. 1988.     2019 Conference on Empirical Methods in Natu-
  Machine Translation of Speech. British Telecom          ral Language Processing and the 9th International
  Technology Journal, 6(2):116–122.                       Joint Conference on Natural Language Processing
                                                          (EMNLP-IJCNLP), pages 1395–1402, Hong Kong,
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.       China, November. Association for Computational
   Sequence to sequence learning with neural net-         Linguistics.
   works. In Z. Ghahramani, M. Welling, C. Cortes,
   N. Lawrence, and K. Q. Weinberger, editors, Ad-      Baigong Zheng, Kaibo Liu, Renjie Zheng, M. Ma,
   vances in Neural Information Processing Systems,       Hairong Liu, and L. Huang. 2020. Simultaneous
   volume 27. Curran Associates, Inc.                     translation policies: From fixed to adaptive. ArXiv,
                                                          abs/2004.13169.
Alex Waibel, Ajay N. Jain, Arthur E. McNair, Hiroaki
  Saito, Alexander G. Hauptmann, and Joe Tebel-
  skis. 1991. JANUS: A Speech-to-Speech Transla-
  tion System Using Connectionist and Symbolic Pro-
  cessing Strategies. In Proceedings of the Interna-
  tional Conference on Acoustics, Speech and Signal
  Processing, ICASSP 1991, pages 793–796, Toronto,
  Canada, May 14-17.

Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Liang Lu,
  Guoli Ye, and Ming Zhou. 2020. Low latency end-
  to-end streaming speech recognition with a scout
  network. arXiv preprint arXiv:2003.10369.

Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui
  Wu, and Zhifeng Chen. 2017. Sequence-to-
  Sequence Models Can Directly Translate Foreign
  Speech. In Proceedings of Interspeech 2017, pages
  2625–2629, Stockholm, Sweden, August.