Deep Contextual Punctuator for NLG Text


                       Vandan Mujadia Pruthwik Mishra Dipti Misra Sharma
                                  Language Technologies Research Center
                            Kohli Center On Intelligent Systems, IIIT Hyderabad
                      {vandan.mu, pruthwik.mishra}@research.iiit.ac.in, dipti@iiit.ac.in


                           Abstract                            et al., 2020; Peitz et al., 2014) or combined or hy-
                                                               brid features of the previous two features based
        This paper describes our team oneNLP’s                 methods. Recent lexical based punctuation pre-
        (LTRC, IIIT-Hyderabad) participation for the           diction methods build upon deep neural networks
        SEPP-NLG 2021 tasks1 , Sentence End and
                                                               where it is modeled as a sequence to tag prediction
        Punctuation Prediction in NLG Text-2021. We
        applied sequence to tag prediction over contex-        task (Li and Lin, 2020) or a sequence to sequence
        tual embedding as ﬁne-tuning for both of these         prediction task (Vandeghinste et al., 2018).
        tasks. We also explored the use of multilingual           The simplest and basic form of punctuation pre-
        Bert and multitask learning for these tasks on         diction is the discovery of sentence boundaries,
        English, German, French and Italian.                   here the problem is the binary classiﬁcation (where
                                                               classes are period or empty as label). An incremen-
1       Introduction                                           tal and a bit harder problem is the prediction of
Generally, the output of automatic speech recogni-             each individual punctuation, here the class labels
tion (ASR) systems ignore the prediction of punc-              for subtask2 are “: - , ? . 0” (0 indicating no punctu-
tuation marks. Similarly, output of OCR systems                ation). SEPP-NLG 2021 presents both these tasks
(Nguyen et al., 2019) need automatic validation for            as a challenge for the English, German, French and
punctuation. Apart from the omission of punctua-               Italian languages.
tion markers, some automatic tools generated texts                In a recent advance of deep learning, pre-
e.g. PDF to text extraction may erroneously dis-               trained language models such as ELMo (Peters
place sentences for several reasons. Here, detecting           et al., 2018), ULMFiT (Howard and Ruder, 2018),
the end of a sentence and placing an appropriate               OpenAI Transformer (Lee and Hsiang, 2020)
punctuation mark signiﬁcantly improves the quality             and BERT (Devlin et al., 2018) have resulted
of such outputs by preserving the original meaning.            in a massive jump in the state-of-the-art perfor-
Thus, missing punctuation or inappropriate punctu-             mance for many NLP tasks, i.e text classiﬁca-
ation degrade the readability of the presented text            tion (Büyüköz et al., 2020), natural language in-
and leads to poor user experiences in real-world               ference and question-answering, dialogue system
scenarios (Che et al., 2016; Uefﬁng et al., 2013) as           (Budzianowski and Vulić, 2019) etc. All these ap-
well as erroneous input to the subsequent automatic            proaches pre-train an unsupervised language model
systems such as Machine Translation, Summariza-                on a large corpus of data such as all wikipedia ar-
tion, Question Answering, NLU etc. Therefore it is             ticles, news articles and then ﬁne-tune these pre-
necessary to restore or correct punctuation marks              trained models on different downstream tasks.
for these automatic outputs.                                      Here, for our experiments on the two punctua-
   Traditionally, automatic punctuation marking ap-            tion prediction tasks, we try to use multi-lingual
proaches (Lu and Ng, 2010) can be divided into                 Bert and ALBert (for English) as a ﬁne-tuning task
three broad categories (Vandeghinste et al., 2018)             along with the baseline experiments with CRF.
based on the used features. They can be prosody
based features (Kim and Woodland, 2001; Chris-                 2   Dataset
tensen et al., 2001), lexical features (Augustyniak
                                                               As a part of SEPP-NLG 2021, the organizers re-
    1
        https://sites.google.com/view/sentence-segmentation/   leased an Europarl corpus of spoken texts by lower-


    Copyright ©2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna-
tional (CC BY 4.0).
        Lang       #Train Sents      #Train Toks         Avg Train Sent Len           #Dev Sents        #Dev Toks
       English       1406577          33779095                 24.015                   321333           7743489
       German        1308508          28645112                 21.891                   291443           6358683
       French        1236504          32690367                 26.438                   332330           8781593
       Italian       1132554          28167993                 24.871                   290089           7194189

                                  Table 1: Training and Development Data Detail


          Class     #Count       #Percentage                             Class      #Count         #Percentage
            :        43133          0.128                                  :         46128            0.141
            -        80916          0.240                                  -         68523            0.210
            ,      1759686          5.209                                  ,        1657880           5.071
            ?        44290          0.131                                  ?         41005            0.125
            .      1396166          4.133                                  .        1223802           3.744
            0      30454904        90.159                                  0       29653029          90.709
          Total    33779095                                              Total     32690367

    Table 2: English : Class Details for Training Data         Table 4: French : Class Details for Training Data

          Class     #Count       #Percentage                             Class      #Count         #Percentage
            :        51192          0.179                                  :         55080            0.196
            -        81710          0.285                                  -         52983            0.188
            ,      2208970          7.712                                  ,        1503502           5.338
            ?        40511          0.141                                  ?         38807            0.138
            .      1290282          4.504                                  .        1138669           4.042
            0      24972447        87.179                                  0       25378952          90.099
          Total    28645112                                              Total     28167993
    Table 3: German : Class Details for Training Data           Table 5: Italian : Class Details for Training Data


casing and removing all punctuations in the tran-
                                                             maximum sequence length to 25 was based on the
scripts available in multiple languages. Table 1
                                                             average sentence length of the training data in En-
describes the corpora details for Training and De-
                                                             glish. We only used words as features and utilized a
velopment corpus for all languages in terms of sen-
                                                             continuous window of 5 words over the full corpus
tences and Tokens. Table 2, Table 3, Table 4 and Ta-
                                                             as the required features for the CRF.
ble 5 describe the training corpora details in terms
of punctuation classes and their respective distribu-
tion. Here data numbers are given after ‘’! ;‘’ are          3.2      Fine-tuning Contextual Embedding
mapped to ‘’.‘’.                                             Multi-task learning (MTL) is a technique which
                                                             aims to improve generalization, strengthen repre-
3     Approach
                                                             sentations and enable adaptation in machine learn-
We primarily used two broad categories of ap-                ing (Worsham and Kalita, 2020) for related tasks.
proaches. We model the problem as a sequence                 For our case, we enabled multi-task learning for our
labeling task. In Machine Learning approaches, we            AlBert and mBert based contextual experiments as
trained a CRF model to identify the different kinds          presented in Figure 1 where contextual embeddings
of labels correctly. Transformer based BERT ﬁne              are shared across the sub-tasks. We applied a se-
tuning is also used as the other technique.                  quence to tag classiﬁer on the output contextualized
                                                             token embeddings of Albert/mBert for the tag pre-
3.1    CRF                                                   diction. Here, we have used Albert 2 for English
We split the training data in English into sequences
of 25 tokens each. This decision of setting the                 2
                                                                    https://tfhub.dev/google/albert base/3
                                            Figure 1: Multi Task Learning


and mBert3 for other languages.                                   Dataset       Lang     Pr     Re      F1
  The following conﬁguration is used for Bert ﬁne-                               EN     0.92   0.92    0.92
tuning.                                                                          DE     0.93   0.95    0.94
                                                                     Test        FR     0.9    0.89    0.9
    • Input : Subword tokens (same as Bert/Albert                                IT     0.88   0.89    0.89
      vocab)                                                                    AVG     0.91   0.91    0.91
    • Embedding size : 512                                                       EN     0.81   0.67    0.73
    • Transformer Conﬁg : layers (6), hidden size                                DE     0.85   0.72    0.78
      (2048) attention heads (8)                                Surprise Test    FR     0.77   0.62    0.69
    • Dropout : 0.30, Optimizer : Adam                                           IT     0.78   0.58    0.67
    • max word sequence length : 300                                            AVG     0.8    0.65    0.72
    • Fine Tuning Steps : 90K with 40 Batch size                                 EN     0.92   0.92    0.92
                                                                                 DE     0.94   0.95    0.94
   Due to the algorithmic limitations, we were not
                                                                     Dev         FR     0.9    0.89    0.9
able to apply MTL for CRF as it does not facilitate
                                                                                 IT     0.88   0.89    0.89
the joint learning of multiple classiﬁcation tasks at
                                                                                AVG     0.91   0.91    0.91
one go.
                                                                 Table 6: Subtask1 Results using BERT MTL
4    Results and Discussion
As the results of CRF with word level features for                 Subtask#      Pr     Re     F1-score
English were poor (shown in table 7), we did not                      1         0.73   0.52      0.61
conduct CRF experiments on other languages.                           2         0.71   0.32      0.35
   We could observe that the results of Bert with
Multi task learning is superior to the results of CRF.      Table 7: CRF Results of Subtask 1 and 2 for English
                                                            Dev data
This is due to the better sentence or sequence rep-
resentations learnt from the transformers. Simple
surface level word features fail to capture the end         the language speciﬁc contextual embedding would
sentence or punctuation markers in CRF.                     improve performance in other languages. We will
                                                            be incorporating both of these points in our future
5    Conclusion and Future work
                                                            work.
We have successfully applied contextual embed-
ding for the task of punctuation prediction and
achieved comparable results on both of the sub-             References
tasks. We believe that ﬁne-tuning Bert on more data
would beneﬁt the overall punctuation task. Also,            Łukasz Augustyniak, Piotr Szymanski, Mikołaj Morzy,
                                                              Piotr Zelasko, Adrian Szymczak, Jan Mizgajski,
   3
     https://tfhub.dev/tensorﬂow/bert multi cased L-12 H-     Yishay Carmiel, and Najim Dehak. 2020. Punctu-
768 A-12/4                                                    ation prediction in spontaneous conversations: Can
       Dataset        Lang      Pr       Re      F1          Jieh-Sheng Lee and Jieh Hsiang. 2020. Patent claim
                       EN      0.79     0.69    0.72            generation by ﬁne-tuning openai gpt-2. World
                       DE      0.8      0.74    0.77            Patent Information, 62:101983.
         Test          FR      0.79     0.65    0.68         Xinxing Li and Edward Lin. 2020. A 43 language
                       IT      0.78     0.62    0.66           multilingual punctuation prediction neural network
                      AVG      0.79     0.68    0.71           model. Proc. Interspeech 2020, pages 1067–1071.
                       EN      0.62     0.52    0.56         Wei Lu and Hwee Tou Ng. 2010. Better punctuation
                       DE      0.61     0.57    0.58           prediction with dynamic conditional random ﬁelds.
    Surprise Test      FR      0.61     0.48    0.51           In Proceedings of the 2010 conference on empirical
                                                               methods in natural language processing, pages 177–
                       IT      0.54     0.43    0.46
                                                              186.
                      AVG      0.8      0.65    0.72
                       EN      0.8      0.69    0.72         Thi-Tuyet-Hai Nguyen, Adam Jatowt, Mickael Cous-
                       DE      0.81     0.74    0.77           taty, Nhu-Van Nguyen, and Antoine Doucet. 2019.
                                                               Deep statistical analysis of ocr errors for effective
         Dev           FR      0.79     0.65    0.69           post-ocr processing. In 2019 ACM/IEEE Joint Con-
                       IT      0.78     0.62    0.66           ference on Digital Libraries (JCDL), pages 29–38.
                      AVG      0.8      0.68    0.71
                                                             Stephan Peitz, Markus Freitag, and Hermann Ney.
     Table 8: Subtask2 Results using BERT MTL                   2014. Better punctuation prediction with hierarchi-
                                                                cal phrase-based translation. In Proc. of the Int.
                                                               Workshop on Spoken Language Translation (IWSLT),
                                                                South Lake Tahoe, CA, USA.
  we mitigate asr errors with retroﬁtted word embed-
  dings? arXiv preprint arXiv:2004.05985.                    Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
                                                              Gardner, Christopher Clark, Kenton Lee, and Luke
Paweł Budzianowski and Ivan Vulić. 2019. Hello, it’s         Zettlemoyer. 2018. Deep contextualized word repre-
  gpt-2–how can i help you? towards the use of pre-           sentations. In Proc. of NAACL.
  trained language models for task-oriented dialogue
  systems. arXiv preprint arXiv:1907.05774.                  Nicola Uefﬁng, Maximilian Bisani, and Paul Vozila.
                                                               2013. Improved models for automatic punctuation
Berfu Büyüköz, Ali Hürriyetoğlu, and Arzucan Özgür.     prediction for spoken and written text. In Inter-
  2020. Analyzing elmo and distilbert on socio-                speech, pages 3097–3101.
  political news classiﬁcation. In Proceedings of
                                                             Vincent Vandeghinste, Lyan Verwimp, Joris Pelemans,
  the Workshop on Automated Extraction of Socio-
                                                               and Patrick Wambacq. 2018. A comparison of dif-
  political Events from News 2020, pages 9–18.
                                                               ferent punctuation prediction approaches in a transla-
                                                               tion context. In Proceedings of the 21st Annual Con-
Xiaoyin Che, Cheng Wang, Haojin Yang, and                      ference of the European Association for Machine
  Christoph Meinel. 2016. Punctuation prediction for           Translation: 28-30 May 2018, Universitat d’Alacant,
  unsegmented transcript based on word vector. In              Alacant, Spain, pages 269–278. European Associa-
  Proceedings of the Tenth International Conference            tion for Machine Translation.
  on Language Resources and Evaluation (LREC’16),
  pages 654–658.                                             Joseph Worsham and Jugal Kalita. 2020. Multi-
                                                                task learning for natural language processing in the
Heidi Christensen, Yoshihiko Gotoh, and Steve Re-               2020s: Where are we going? Pattern Recognition
  nals. 2001. Punctuation annotation using statistical         Letters, 136:120–126.
  prosody models.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2018. Bert: Pre-training of deep
   bidirectional transformers for language understand-
   ing. arXiv preprint arXiv:1810.04805.

Jeremy Howard and Sebastian Ruder. 2018. Univer-
   sal language model ﬁne-tuning for text classiﬁcation.
   arXiv preprint arXiv:1801.06146.

Ji-Hwan Kim and Philip C Woodland. 2001. The use of
   prosody in a combined system for punctuation gener-
   ation and speech recognition. In Seventh European
   conference on speech communication and technol-
   ogy.