=Paper= {{Paper |id=Vol-2957/sepp_paper3 |storemode=property |title=Multilingual Simultaneous Sentence End and Punctuation Prediction (short paper) |pdfUrl=https://ceur-ws.org/Vol-2957/sepp_paper3.pdf |volume=Vol-2957 |authors=Ricardo Rei,Fernando Batista,Nuno M. Guerreiro,Luisa Coheur |dblpUrl=https://dblp.org/rec/conf/swisstext/ReiBGC21 }} ==Multilingual Simultaneous Sentence End and Punctuation Prediction (short paper)== https://ceur-ws.org/Vol-2957/sepp_paper3.pdf
    Multilingual Simultaneous Sentence End and Punctuation Prediction
                      Ricardo Rei                                    Fernando Batista
                         Unbabel                                          INESC-ID
                        INESC-ID                           ISCTE - Instituto Universitário de Lisboa
               Instituto Superior Técnico                 fernando.batista@inesc-id.pt
            ricardo.rei@unbabel.com

                Nuno M. Guerreiro                                        Luisa Coheur
           Instituto de Telecomunicações                                 INESC-ID
             Instituto Superior Técnico                          Instituto Superior Técnico
        nuno.s.guerreiro@tecnico.pt                            luisa.coheur@inesc-id.pt

                      Abstract                                summarization (Zechner, 2002; Huang and Zweig,
                                                              2002; Kim and Woodland, 2003; Ostendorf et al.,
    This paper describes the model and its corre-
                                                              2005; Jones et al., 2005; Makhoul et al., 2005;
    sponding setup, proposed by the Unbabel &
    INESC-ID team for the 1st Shared Task on
                                                              Shriberg, 2005; Matusov et al., 2006; Peitz et al.,
    Sentence End and Punctuation Prediction in                2011; Cattoni et al., 2007; Ostendorf et al., 2008;
    NLG Text (SEPP-NLG 2021). The shared task                 Liao et al., 2020).
    covers 4 languages (English, German, French                  Most of the available studies focus on full stop
    and Italian) and includes two subtasks: sub-              and comma, which have higher corpus frequencies,
    task 1 – detecting the end of a sentence, and             and a number of more restricted studies also con-
    subtask 2 – predicting a range of punctuation
                                                              sider the question mark. However, several punctua-
    marks. Our team proposes a single multilin-
    gual and multitask model that is able to pro-
                                                              tion marks can be considered for automatically gen-
    duce suitable results for all the languages and           erated texts, including: comma; period or full stop;
    subtasks involved. The results show that it               exclamation mark; question mark; colon; semi-
    is possible to achieve state-of-the-art results           colon; and quotation marks. Nevertheless, most of
    using one single multilingual model for both              these marks rarely occur and are quite difficult to
    tasks and multiple languages. Using a sin-                insert or evaluate. Quotations and semicolons, for
    gle multilingual model to solve the task for              example, are often used inconsequently and in a
    multiple languages is of particular importance,
                                                              highly variable way.
    since training a different model for each lan-
    guage is a cumbersome and time-consuming                     This paper proposes a multilingual model that
    process. Finally, the code for the shared                 is able to detect sentence boundaries and predict
    task is publicly available for reproducible pur-          a wide range of punctuation marks, based on pre-
    poses at https://github.com/Unbabel/                      trained contextual embeddings. Our architecture
    caption/tree/shared-task.                                 is composed of three main building blocks: a pre-
                                                              trained Transformer-based encoder model, an at-
1   Introduction
                                                              tention mechanism over the encoder layers, and
The text produced by a speech recognition system              the task classification heads. The proposed model
or by an automatic machine translation system of-             derives from the multilingual model proposed by
ten includes misplaced punctuation and, in the case           (Guerreiro et al., 2021), which achieves fairly com-
of a speech recognition system, the output often              petitive results in a multi-language scenario, even
consists of raw single-case words, without punc-              surpassing the existing results for some of the lan-
tuation marks, and may not even include sentence              guages.
boundaries. Detecting the sentence boundaries and                The reminder of the paper is organized as fol-
the missing punctuation in such automatically gen-            lows: Section 2 presents an overview of the related
erated texts improves the quality of such texts, and          work. Section 3 overviews the data used for train-
is often relevant for a number of downstream tasks,           ing fine-tuning our model. Section 4 presents the
such as parsing, information extraction, dialog act           building blocks of the model architecture, and the
modeling, Named Entity Recognition (NER), and                 setup parameters. Section 5 reports the experiments



    Copyright ©2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna-
tional (CC BY 4.0).
performed and Section 6 presents the correspond-        2021) that showed that having one single multilin-
ing results. Finally, Section 7 presents the most       gual model is competitive with having one model
relevant conclusions, and mentions possible future      trained for each language.
directions.
                                                        3        Corpora
2   Related work
                                                         100000000
Proper identification of sentence boundaries and
punctuation recovery are two profoundly connected           10000000
tasks that can result in great improvements for
speech processing downstream task (Harper et al.,            1000000
2005; Mrozinsk et al., 2006; Ostendorf et al., 2008).
For that reason, recovering structural information               100000
from text produced by Automatic Speech Recog-
nition (ASR) becomes an objective of many stud-                  10000
                                                                            words comma full-stop           dash     colon        qmark
ies. Early studies used a combination of n-grams
                                                                                      EN        DE     FR    IT
with prosodic classifiers through the general Hid-
den Markov Models framework (Beeferman et al.,
                                                                 Figure 1: Frequency of each punctuation mark
1998; Christensen et al., 2001; Kim and Woodland,
2001). With the development of Conditional Ran-
dom Fields (CRF) and Maximum Entropy models,
                                                            IT                53.9%                               40.8%
researchers were able to successfully improve these
                                                         FR                   54.6%                               40.3%
task (Huang and Zweig, 2002; Liu et al., 2005,           DE                       60.1%                            35.2%
2006; Batista et al., 2007, 2008, 2009; Lu and           EN                   52.9%                            42.0%
Ng, 2010; Batista et al., 2010, 2012; Ueffing et al.,             0%         20%           40%          60%           80%           100%
2013).                                                                    comma     full-stop        dash    colon        qmark
   Regarding machine translation, it is a well-
known fact that punctuation and capitalization er-               Figure 2: Frequency of each punctuation mark
rors are a predominant problem for Statistical Ma-
chine Translation (SMT). Several studies tried to          The SEPP-NLG challenge adopted the Europarl
enrich the SMT output by inserting proper capital-      corpus, covering English, German, French, and Ital-
ization and punctuation in the returned translation     ian. The corpus was previously processed in order
(Cattoni et al., 2007; Peitz et al., 2011). Even with   to remove punctuation marks and case information,
Neural Machine Translation (NMT), the punctua-          as a way to simulate Natural Language Generated
tion errors are still the most predominant type of      text. The challenge considers 5 different punctua-
errors. Indeed, these represent around 20% of the       tion marks: comma (,), full-stop (.), dash (-), colon
errors produced by the high performing systems          (:) and question marks (?). Figures 1 and 2 show
from WMT20 News Translation shared task (Fre-           the frequency of the words and punctuation marks
itag et al., 2021).                                     for each one of the languages, considering the train-
   Most of the recent approaches for punctuation        ing and development sets. As expected, from all
restoration are based on neural networks such as        the punctuation marks being considered, comma is
Recurrent Neural Networks (RNN) and Transform-          the most frequent, occurring between 52.9% (EN)
ers. With that said, most works treat the problem       and 60.1% (DE) of the times, followed by full-stop,
either as a sequence-to-sequence or as a sequence       occurring between 42% (EN) and 35.2% (DE) of
labelling task (Tilk and Alumäe, 2015, 2016; Che       the times. All the other punctuation marks into
et al., 2016; Klejch et al., 2017; Yi and Tao, 2019;    consideration, occur less than 0.24% of the times
Kim, 2019). Following the recent trends in Natural      for all the considered languages. About 95% of
Language Processing (NLP) some of these works           the sentences contain between 3 and 50 words, but
take advantage of pre-trained models such as BERT       the maximum sentence length is 303 words for EN,
Cai and Wang (2019); Makhija et al. (2019); Guer-       450 for DE and IT, and 423 for FR. 99% of the sen-
reiro et al. (2021). Our shared task participation      tences contain 1 to 7 punctuation marks, including
is mostly based on the work by (Guerreiro et al.,       the corresponding sentence boundary. However,
Figure 3: Model architecture used to compete on the SEPP-NLG 2021 shared task. This model follows the
architecture proposed by Guerreiro et al. (2021), but with a classification head that simultaneously predicts sentence
ends (binary classification) and punctuation marks (multinomial classification).


some of the sentences, mostly consisting of lists of        a single embedding, exi , the following layer-wise
                                                                                  j
numbers, may contain up to about 200 commas.                attention mechanism is used:

4   System Description                                                          exi = γEx�i Λ                       (1)
                                                                                   j          j


As it was previously mentioned, our system archi-           where γ is a trainable scaling factor, Exi =
                                                                                                                j
                                                              (0) (1)           (24)
tecture extends the architecture proposed by (Guer-         [exi , exi , . . . exi ] corresponds to the vector of
reiro et al., 2021) which has shown promising, re-             j     j           j
                                                            layer embeddings for sub-word xij , and Λ =
sults in multilingual punctuation prediction and
                                                            softmax([λ(1) , λ(2) , . . . , λ(24) ]) is a vector consti-
capitalization (Rei et al., 2020). This architecture
                                                            tuted by the layer scalar trainable parameters which
is composed of 3 modules: an Encoder Model, a
                                                            are shared for every sub-word. Finally, we concate-
Layer-wise Attention Mechanism, and a Classi-
                                                            nate the embeddings of consecutive words1 in the
fication Head. In our experiments to the shared
                                                            input sequence xi and use those as features for our
task we replaced the XLM-R base with XLM-R
                                                            punctuation (ML – multi-label) and full-stop (B –
large (Conneau et al., 2020) and also added a new
                                                            binary) classification heads. Figure 3 illustrates the
binary classification head for subtask 1 (full-stop
                                                            described architecture.
prediction).
   With that said, when our system receives a doc-          5    Experiments
ument, that document is tokenized using XLM-
R tokenizer and� divided into several       input se-       We started our experiments with the exact same
                                      �                     hyper-parameters used by Guerreiro et al. (2021).
quences xi = xi0 , xi1 , . . . , xi511 with 512 sub-
words. Then for each input sequence, the encoder            To achieve better performance we also ran an hyper-
                                (�)
will produce an embedding exi for each sub-word             parameter search using O PTUNA (Akiba et al.,
                                 j                          1
                                                                 When a word is divided into several sub-words we use
xij and each layer � ∈ {0, 1, ..., 24}. To encapsu-              the embedding of the first sub-word to represent the
late information from all transformer layers into                entire word.
                Figure 4: Best trial hyper-parameters highlighted in the O PTUNA search space.


2019). In this section we will describe the train-       memory GPU.
ing setup and the evaluation metrics used for these        Evaluation is performed after each epoch using
experiments.                                             only 50% of the entire development data. The train-
                                                         ing is interrupted after 2 epochs without improve-
5.1   Evaluation Setup                                   ments on the punctuation task Macro-F1.
The official shared task metric for full-stop predic-
tion is the F1 score of the positive class (sentence     5.3   Hyper-parameter Search
end). For the punctuation prediction sub-task the       We used O PTUNA (Akiba et al., 2019) to search for
official metric is Macro-F1. Since our developed         the optimal hyper-parameters for our model. Our
model performs both tasks at the same time, we          search space was defined as follows:
also combine those two metrics by multiplying
them. Following Guerreiro et al. (2021), we addi-           • Accumulate gradients for 1 to 32 batches (this
tionally measure the punctuation Slot Error Rate              simulates bigger batches while avoiding mem-
(SER) (Makhoul et al., 2005), a commonly used                 ory issues);
metric for the task at hand. Also, we discard the
“O” (no punctuation) label for calculation of our           • Classification heads dropout between 0.1 and
Macro-F1 scores.                                              0.5 with sampling from a uniform distribution;

5.2   Training Setup                                        • Layer-wise learning rate decay between 0.75
                                                              and 1.0 with sampling from a uniform distri-
Our model uses a discriminative fine-tuning strat-
                                                              bution;
egy with gradual unfreezing by splitting the model
parameters into two groups: the XLM-R param-                • Encoder model learning rate between 1e-05
eters and the classification heads on top. The en-             and 1e-04 with sampling from a log-uniform
coder parameters are frozen for 0.1% steps of the             distribution;
first epoch. This allows the parameters of the classi-
fication heads to adjust to the task objective before        • Classification heads learning rate between
changing the pre-trained ones. Then, the entire               1e-05 and 3e-04 with sampling from a log-
model parameters are fine-tuned, except the em-                uniform distribution;
bedding layer that is kept frozen. Keeping the em-
bedding layer frozen allows us to save some GPU             • Full-stop prediction loss with two possible
memory and fit the entire model into a single 12GB             values: 1 and 2;
                                           Predicted Labels                       6   Results
                                                                       Question
                           Comma     Full-stop   Dash        Colon
                                                                        mark
               Comma       1443471      36286        8252       3667      1142
                                                                                  Table 2 shows that, as expected, using a larger
                                                                                  encoder improves our results. Also, by using O P -
 True Labels


               Full-stop     39929 1164106            622       4255      2442
                                                                                  TUNA , we were able to further improve our results
               Dash          32925        4453   18632          1154         99
                                                                                  which means that the models presented by Guer-
               Colon          5106      16404         280      24149        101
                                                                                  reiro et al. (2021) are under-tuned and could be
               Q. mark        1493        3234         44         50     34635
                                                                                  further improved with a better selection of hyper-
Figure 5: Confusion Matrix for punctuation prediction.                            parameters.
                                                                                     Looking into the results for individual punctua-
                                                                                  tion marks we can observe that our final submission
                                                                                  has a high F1 for commas, full stops and question
         • Punctuation prediction loss with three possi-
                                                                                  marks, 96%, 94% and 89% respectively. Yet, the
           ble values: 1, 2 and 3.
                                                                                  model seems to struggle at predicting dashes and
                                                                                  colons (63% and 39% F1 respectively). By look-
                                                                                  ing at Figure 5, we can observe that, as expected,
  To speed up the hyper-parameter search we used
                                                                                  dashes and colons are frequently confused with
only 50% of the available training data while keep-
                                                                                  commas and full stops, respectively. These marks
ing the 50% development data described above.
                                                                                  can often be interchanged without loss of meaning.
   Table 2 reports the results of our baseline against                            This is further evidence to support the rationale
the large models with default hyper-parameters and                                of some proposed approaches to solve this task
the best trial results from O PTUNA. As expected,                                 (Tilk and Alumäe, 2015; Che et al., 2016; Guer-
from our table, we can observe that the biggest                                   reiro et al., 2021), in which dashes and colons tend
improvement comes from using XLM-R large in                                       to be aggregated with the commas and full stops
replacement of the base model. We can also ob-                                    labels, respectively.
serve that further hyper-parameter tuning helps es-
pecially in terms of the SER.                                                     7   Conclusions and future work
   Figure 4 shows that the best results were                                      We have described a multilingual model that is
achieved by keeping the encoder learning rate low                                 able to simultaneously detect sentence boundaries,
with a high layerwise decay (above 0.9). The learn-                               and to predict 5 different punctuation marks over
ing rate for the classification heads is almost 10×                                4 different languages (English, German, French
higher than the encoder learning rate. Finally, the                               and Italian). The model was adapted from (Guer-
weight of the punctuation prediction loss is set to                               reiro et al., 2021), and used by the Unbabel &
2× the weight of the binary prediction loss. Table 1                              INESC-ID team for the 1st Shared Task on Sen-
describes the hyper-parameters used in our baseline                               tence End and Punctuation Prediction in NLG Text
along with our final submission.                                                   (SEPP-NLG 2021), achieving one of the top re-
                                                                                  sults. The results confirm that it is possible to
                                                                                  achieve state-of-the-art results using a single mul-
 Hyper-parameter                          Baseline          Final submission      tilingual model for both tasks and multiple lan-
 Encoder Model                         XLM-R (base)          XLM-R (large)        guages. This result supports what was already
 Optimizer                               AdamW                 AdamW              observed in the experiments performed by (Guer-
 nº frozen epochs                           0.1                   0.1
 Learning rate                             5e-05               2.37e-04           reiro et al., 2021). The code used to produce the
 Encoder Learning Rate                     3e-05               2.57e-05           results is publicly available at: https://github.
 Layerwise Decay                            1.0                  0.925            com/Unbabel/caption/tree/shared-task.
 Batch size                                 12                     8
 Loss function                         Cross-Entropy         Cross-Entropy            In the future, we plan to extend this work to
 Binary Loss Weight                          1                     1              include other language families, such as Semitic
 Punctuation Loss Weight                     1                     2              and Slavic languages. Moreover, we would like to
 Dropout                                    0.1                  0.125
 FP precision                               32                    16              extend our setup to be capable of simultaneously
                                                                                  solving the capitalization task too. Having one
Table 1: Hyper-parameters used in our final submission                             single multilingual model that is capable of identi-
compared with the baseline hyper-parameters from                                  fying sentence boundaries, punctuation marks and
Guerreiro et al. (2021).                                                          proper capitalization would constitute a major step
            Development Models              SER↓      Binary F1↑      Macro F1↑       Macro x Binary↑
      Baseline (Guerreiro et al., 2021)     0.265        0.926           0.399               0.369
          XLM-R large (default)             0.243        0.944           0.411               0.388
          XLM-R large O PTUNA               0.214        0.944           0.444               0.419

Table 2: Results of our models on the shared task development data. Our baseline model is trained with the exact
same setup as the multilingual models from Guerreiro et al. (2021). Then we decided to replace XLM-R base by
XLM-R large. Finally to further improve our results we used O PTUNA to search over the hyper-parameters space
described in Section 5.3. Note that these experiments were performed using the shared task corpus V1.


towards recovering from ASR recognition errors              Communication Association, Makuhari, Chiba,
and translation errors from MT systems.                     Japan, September 26-30, 2010, pages 1509–1512.
                                                            ISCA.
Acknowledgments                                           Fernando Batista, Isabel Trancoso, and Nuno J
                                                            Mamede. 2009. Comparing automatic rich transcrip-
This work was supported by national funds through           tion for portuguese, spanish and english broadcast
FCT, Fundação para a Ciência e a Tecnologia, un-         news. In Automatic Speech Recognition and Un-
der project UIDB/50021/2020 and by the P2020                derstanding, 2009. ASRU 2009. IEEE Workshop on,
Program through projects “Unbabel Scribe” and               pages 540–545. IEEE.
“MAIA” supervised by ANI under contract num-              Doug Beeferman, Adam Berger, and John Lafferty.
bers 038510 and 045909 respectively.                        1998. Cyberpunc: a lightweight punctuation anno-
                                                            tation system for speech. ICASSP, pages 689–692.
                                                          Y. Cai and D. Wang. 2019. Question mark prediction
References                                                   by BERT. In 2019 Asia-Pacific Signal and Infor-
Takuya Akiba, Shotaro Sano, Toshihiko Yanase,                mation Processing Association Annual Summit and
  Takeru Ohta, and Masanori Koyama. 2019. Op-                Conference (APSIPA ASC), pages 363–367.
  tuna: A next-generation hyperparameter optimiza-        Roldano Cattoni, Nicola Bertoldi, and Marcello Fed-
  tion framework.    In Proceedings of the 25th             erico. 2007. Punctuating confusion networks for
  ACM SIGKDD International Conference on Knowl-             speech translation. In INTERSPEECH 2007, 8th An-
  edge Discovery & Data Mining, KDD ’19, page               nual Conference of the International Speech Com-
  2623–2631, New York, NY, USA. Association for             munication Association, Antwerp, Belgium, August
  Computing Machinery.                                      27-31, 2007, pages 2453–2456. ISCA.
F. Batista, D. Caseiro, N. Mamede, and I. Trancoso.       Xiaoyin Che, Cheng Wang, Haojin Yang, and
   2008. Recovering capitalization and punctuation          Christoph Meinel. 2016. Punctuation prediction for
   marks for automatic speech recognition: Case study       unsegmented transcript based on word vector. In
   for portuguese broadcast news. Speech Commun.,           Proceedings of the Tenth International Conference
   50(10):847–862.                                          on Language Resources and Evaluation (LREC’16),
                                                            pages 654–658, Portorož, Slovenia. European Lan-
Fernando Batista, Diamantino Caseiro, Nuno J.
                                                            guage Resources Association (ELRA).
  Mamede, and Isabel Trancoso. 2007. Recovering
  punctuation marks for automatic speech recognition.     H. Christensen, Y. Gotoh, and S. Renals. 2001. Punctu-
  In INTERSPEECH 2007, 8th Annual Conference of              ation annotation using statistical prosody models. In
  the International Speech Communication Associa-           Proc. of the ISCA Workshop on Prosody in Speech
  tion, Antwerp, Belgium, August 27-31, 2007, pages         Recognition and Understanding, pages 35–40.
  2153–2156. ISCA.
                                                          Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
Fernando Batista, Helena Moniz, Isabel Trancoso, and        Vishrav Chaudhary, Guillaume Wenzek, Francisco
  Nuno J. Mamede. 2012. Bilingual experiments on            Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
  automatic recovery of capitalization and punctuation      moyer, and Veselin Stoyanov. 2020. Unsupervised
  of automatic speech transcripts. IEEE Transactions        cross-lingual representation learning at scale. In
  on Audio, Speech and Language Processing, Spe-            Proceedings of the 58th Annual Meeting of the Asso-
  cial Issue on New Frontiers in Rich Transcription,        ciation for Computational Linguistics, pages 8440–
  20(2):474–485.                                            8451, Online. Association for Computational Lin-
                                                            guistics.
Fernando Batista, Helena Moniz, Isabel Trancoso,
  Hugo Meinedo, Ana Isabel Mata, and Nuno J.              Markus Freitag, George Foster, David Grangier, Viresh
  Mamede. 2010. Extending the punctuation module           Ratnakar, Qijun Tan, and Wolfgang Macherey. 2021.
  for european portuguese. In INTERSPEECH 2010,            Experts, errors, and context: A large-scale study of
  11th Annual Conference of the International Speech       human evaluation for machine translation.
Nuno Miguel Guerreiro, Ricardo Rei, and Fernando         Wei Lu and Hwee Tou Ng. 2010. Better punctuation
  Batista. 2021. Towards better subtitles: A multilin-     prediction with dynamic conditional random fields.
  gual approach for punctuation restoration of speech      In Proceedings of the 2010 Conference on Empiri-
  transcripts. Expert Systems With Applications (un-       cal Methods in Natural Language Processing, pages
  der review).                                            177–186, Cambridge, MA. Association for Compu-
                                                           tational Linguistics.
Mary Harper, Bonnie Dorr, John Hale, Brian Roark,
 Ishak Shafran, Matthew Lease, Yang Liu, Matthew         K. Makhija, T. Ho, and E. Chng. 2019. Transfer learn-
 Snover, Lisa Yung, Anna Krasnyanskaya, and Robin          ing for punctuation prediction. In 2019 Asia-Pacific
 Stewart. 2005. Parsing and spoken structural event        Signal and Information Processing Association An-
 detection. In 2005 Johns Hopkins Summer Work-             nual Summit and Conference (APSIPA ASC), pages
 shop Final Report.                                        268–273.
Jing Huang and Geoffrey Zweig. 2002. Maximum en-         J. Makhoul, A. Baron, I. Bulyko, L. Nguyen,
   tropy modeling for punctuation from speech. In Pro-      L. Ramshaw, D. Stallard, R. Schwartz, and B. Xiang.
   ceedings of ICSLP.                                       2005. The effects of speech recognition and punctu-
D. Jones, E. Gibson, W. Shen, N. Granoien, M. Her-          ation on information extraction. In INTERSPEECH-
  zog, D. Reynolds, and C. Weinstein. 2005. Mea-            05, pages 57–60.
  suring human readability of machine generated text:
  three case studies in speech recognition and machine   Evgeny Matusov, Arne Mauser, and Hermann Ney.
  translation. In Proc. of the IEEE International Con-     2006. Automatic sentence segmentation and punctu-
  ference on Acoustics, Speech, and Signal Processing      ation prediction for spoken language translation. In
  (ICASSP ’05), volume 5, pages v/1009–v/1012.             International Workshop on Spoken Language Trans-
                                                           lation, pages 158–165, Kyoto, Japan.
J. Kim and P. C. Woodland. 2001. The use of prosody
   in a combined system for punctuation generation       Joanna Mrozinsk, Edward WD Whittaker, Pierre
   and speech recognition. In Proc. of Eurospeech,         Chatain, and Sadaoki Furui. 2006. Automatic sen-
   pages 2757–2760.                                        tence segmentation of speech for automatic summa-
                                                           rization. In Proc. of the IEEE International Confer-
Ji-Hwan Kim and Philip C. Woodland. 2003. A com-           ence on Acoustics, Speech, and Signal Processing
   bined punctuation generation and speech recogni-        (ICASSP ’06).
   tion system and its performance enhancement using
   prosody. Speech Communication, 41(4):563 – 577.       M. Ostendorf, E. Shriberg, and A. Stolcke. 2005. Hu-
                                                           man language technology: Opportunities and chal-
Seokhwan Kim. 2019. Deep recurrent neural networks         lenges. In Proc. of the IEEE International Confer-
  with layer-wise multi-head attentions for punctua-       ence on Acoustics, Speech, and Signal Processing
  tion restoration. ICASSP 2019 - 2019 IEEE Interna-       (ICASSP ’05), Philadelphia.
  tional Conference on Acoustics, Speech and Signal
  Processing (ICASSP), pages 7280–7284.                  Mari Ostendorf, Benoit Favre, Ralph Grishman, Dilek
                                                          Hakkani-Tür, Mary Harper, Dustin Hillard, Julia
Ondrej Klejch, Peter Bell, and Steve Renals. 2017.        Hirschberg, Heng Ji, Jeremy G. Kahn, Yang Liu,
  Sequence-to-sequence models for punctuated tran-        Sameer Maskey, Evgeny Matusov, Hermann Ney,
  scription combining lexical and acoustic features.      Andrew Rosenberg, Elizabeth Shriberg, Wen Wang,
  2017 IEEE International Conference on Acoustics,        and Chuck Wooters. 2008. Speech segmentation
  Speech and Signal Processing (ICASSP), pages            and spoken document processing. IEEE Signal Pro-
  5700–5704.                                              cessing Magazine, 25(3):59–69.
Junwei Liao, Sefik Emre Eskimez, Liyang Lu, Yu Shi,
  Ming Gong, Linjun Shou, Hong Qu, and Michael           Stephan Peitz, Markus Freitag, Arne Mauser, and Her-
  Zeng. 2020. Improving readability for automatic           mann Ney. 2011. Modeling punctuation prediction
  speech recognition transcription.                         as machine translation. In International Workshop
                                                            on Spoken Language Translation, pages 238–245,
Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Dustin       San Francisco, CA, USA.
  Hillard, Mari Ostendorf, and Mary Harper. 2006.
  Enriching speech recognition with automatic detec-     Ricardo Rei, Nuno Miguel Guerreiro, and Fernando
  tion of sentence boundaries and disfluencies. IEEE        Batista. 2020. Automatic truecasing of video sub-
  Transaction on Audio, Speech and Language Pro-           titles using bert: A multilingual adaptable approach.
  cessing, 14(5):1526–1540.                                In Information Processing and Management of Un-
                                                           certainty in Knowledge-Based Systems, pages 708–
Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Bar-        721, Cham. Springer International Publishing.
  bara Peskin, Jeremy Ang, Dustin Hillard, Mari Os-
  tendorf, Marcus Tomalin, Phil Woodland, and Mary       Elisabeth Shriberg. 2005. Spontaneous speech: How
  Harper. 2005. Structural metadata research in the         people really talk, and why engineers should care.
  EARS program. In Proc. of the IEEE International          In Proc. of Eurospeech - 9th European Conference
  Conference on Acoustics, Speech, and Signal Pro-          on Speech Communication and Technology (Inter-
  cessing (ICASSP ’05), Philadelphia, USA.                  speech 2005), pages 1781 – 1784, Lisbon, Portugal.
Ottokar Tilk and Tanel Alumäe. 2015. LSTM for punc-
  tuation restoration in speech transcripts. In INTER-
  SPEECH.
Ottokar Tilk and Tanel Alumäe. 2016. Bidirectional re-
  current neural network with attention mechanism for
  punctuation restoration. In INTERSPEECH, pages
  3047–3051.
Nicola Ueffing, Maximilian Bisani, and Paul Vozila.
  2013. Improved models for automatic punctuation
  prediction for spoken and written text. In INTER-
  SPEECH.
Jiangyan Yi and Jianhua Tao. 2019. Self-attention
   based model for punctuation prediction using word
   and speech embeddings. ICASSP 2019 - 2019 IEEE
   International Conference on Acoustics, Speech and
   Signal Processing (ICASSP), pages 7270–7274.
Klaus Zechner. 2002. Automatic summarization of
  open-domain multiparty dialogues in diverse genres.
  Computational Linguistics, 28(4):447–485.