Deep Contextual Punctuator for NLG Text Vandan Mujadia Pruthwik Mishra Dipti Misra Sharma Language Technologies Research Center Kohli Center On Intelligent Systems, IIIT Hyderabad {vandan.mu, pruthwik.mishra}@research.iiit.ac.in, dipti@iiit.ac.in Abstract et al., 2020; Peitz et al., 2014) or combined or hy- brid features of the previous two features based This paper describes our team oneNLP’s methods. Recent lexical based punctuation pre- (LTRC, IIIT-Hyderabad) participation for the diction methods build upon deep neural networks SEPP-NLG 2021 tasks1 , Sentence End and where it is modeled as a sequence to tag prediction Punctuation Prediction in NLG Text-2021. We applied sequence to tag prediction over contex- task (Li and Lin, 2020) or a sequence to sequence tual embedding as fine-tuning for both of these prediction task (Vandeghinste et al., 2018). tasks. We also explored the use of multilingual The simplest and basic form of punctuation pre- Bert and multitask learning for these tasks on diction is the discovery of sentence boundaries, English, German, French and Italian. here the problem is the binary classification (where classes are period or empty as label). An incremen- 1 Introduction tal and a bit harder problem is the prediction of Generally, the output of automatic speech recogni- each individual punctuation, here the class labels tion (ASR) systems ignore the prediction of punc- for subtask2 are “: - , ? . 0” (0 indicating no punctu- tuation marks. Similarly, output of OCR systems ation). SEPP-NLG 2021 presents both these tasks (Nguyen et al., 2019) need automatic validation for as a challenge for the English, German, French and punctuation. Apart from the omission of punctua- Italian languages. tion markers, some automatic tools generated texts In a recent advance of deep learning, pre- e.g. PDF to text extraction may erroneously dis- trained language models such as ELMo (Peters place sentences for several reasons. Here, detecting et al., 2018), ULMFiT (Howard and Ruder, 2018), the end of a sentence and placing an appropriate OpenAI Transformer (Lee and Hsiang, 2020) punctuation mark significantly improves the quality and BERT (Devlin et al., 2018) have resulted of such outputs by preserving the original meaning. in a massive jump in the state-of-the-art perfor- Thus, missing punctuation or inappropriate punctu- mance for many NLP tasks, i.e text classifica- ation degrade the readability of the presented text tion (Büyüköz et al., 2020), natural language in- and leads to poor user experiences in real-world ference and question-answering, dialogue system scenarios (Che et al., 2016; Ueffing et al., 2013) as (Budzianowski and Vulić, 2019) etc. All these ap- well as erroneous input to the subsequent automatic proaches pre-train an unsupervised language model systems such as Machine Translation, Summariza- on a large corpus of data such as all wikipedia ar- tion, Question Answering, NLU etc. Therefore it is ticles, news articles and then fine-tune these pre- necessary to restore or correct punctuation marks trained models on different downstream tasks. for these automatic outputs. Here, for our experiments on the two punctua- Traditionally, automatic punctuation marking ap- tion prediction tasks, we try to use multi-lingual proaches (Lu and Ng, 2010) can be divided into Bert and ALBert (for English) as a fine-tuning task three broad categories (Vandeghinste et al., 2018) along with the baseline experiments with CRF. based on the used features. They can be prosody based features (Kim and Woodland, 2001; Chris- 2 Dataset tensen et al., 2001), lexical features (Augustyniak As a part of SEPP-NLG 2021, the organizers re- 1 https://sites.google.com/view/sentence-segmentation/ leased an Europarl corpus of spoken texts by lower- Copyright ©2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna- tional (CC BY 4.0). Lang #Train Sents #Train Toks Avg Train Sent Len #Dev Sents #Dev Toks English 1406577 33779095 24.015 321333 7743489 German 1308508 28645112 21.891 291443 6358683 French 1236504 32690367 26.438 332330 8781593 Italian 1132554 28167993 24.871 290089 7194189 Table 1: Training and Development Data Detail Class #Count #Percentage Class #Count #Percentage : 43133 0.128 : 46128 0.141 - 80916 0.240 - 68523 0.210 , 1759686 5.209 , 1657880 5.071 ? 44290 0.131 ? 41005 0.125 . 1396166 4.133 . 1223802 3.744 0 30454904 90.159 0 29653029 90.709 Total 33779095 Total 32690367 Table 2: English : Class Details for Training Data Table 4: French : Class Details for Training Data Class #Count #Percentage Class #Count #Percentage : 51192 0.179 : 55080 0.196 - 81710 0.285 - 52983 0.188 , 2208970 7.712 , 1503502 5.338 ? 40511 0.141 ? 38807 0.138 . 1290282 4.504 . 1138669 4.042 0 24972447 87.179 0 25378952 90.099 Total 28645112 Total 28167993 Table 3: German : Class Details for Training Data Table 5: Italian : Class Details for Training Data casing and removing all punctuations in the tran- maximum sequence length to 25 was based on the scripts available in multiple languages. Table 1 average sentence length of the training data in En- describes the corpora details for Training and De- glish. We only used words as features and utilized a velopment corpus for all languages in terms of sen- continuous window of 5 words over the full corpus tences and Tokens. Table 2, Table 3, Table 4 and Ta- as the required features for the CRF. ble 5 describe the training corpora details in terms of punctuation classes and their respective distribu- tion. Here data numbers are given after ‘’! ;‘’ are 3.2 Fine-tuning Contextual Embedding mapped to ‘’.‘’. Multi-task learning (MTL) is a technique which aims to improve generalization, strengthen repre- 3 Approach sentations and enable adaptation in machine learn- We primarily used two broad categories of ap- ing (Worsham and Kalita, 2020) for related tasks. proaches. We model the problem as a sequence For our case, we enabled multi-task learning for our labeling task. In Machine Learning approaches, we AlBert and mBert based contextual experiments as trained a CRF model to identify the different kinds presented in Figure 1 where contextual embeddings of labels correctly. Transformer based BERT fine are shared across the sub-tasks. We applied a se- tuning is also used as the other technique. quence to tag classifier on the output contextualized token embeddings of Albert/mBert for the tag pre- 3.1 CRF diction. Here, we have used Albert 2 for English We split the training data in English into sequences of 25 tokens each. This decision of setting the 2 https://tfhub.dev/google/albert base/3 Figure 1: Multi Task Learning and mBert3 for other languages. Dataset Lang Pr Re F1 The following configuration is used for Bert fine- EN 0.92 0.92 0.92 tuning. DE 0.93 0.95 0.94 Test FR 0.9 0.89 0.9 • Input : Subword tokens (same as Bert/Albert IT 0.88 0.89 0.89 vocab) AVG 0.91 0.91 0.91 • Embedding size : 512 EN 0.81 0.67 0.73 • Transformer Config : layers (6), hidden size DE 0.85 0.72 0.78 (2048) attention heads (8) Surprise Test FR 0.77 0.62 0.69 • Dropout : 0.30, Optimizer : Adam IT 0.78 0.58 0.67 • max word sequence length : 300 AVG 0.8 0.65 0.72 • Fine Tuning Steps : 90K with 40 Batch size EN 0.92 0.92 0.92 DE 0.94 0.95 0.94 Due to the algorithmic limitations, we were not Dev FR 0.9 0.89 0.9 able to apply MTL for CRF as it does not facilitate IT 0.88 0.89 0.89 the joint learning of multiple classification tasks at AVG 0.91 0.91 0.91 one go. Table 6: Subtask1 Results using BERT MTL 4 Results and Discussion As the results of CRF with word level features for Subtask# Pr Re F1-score English were poor (shown in table 7), we did not 1 0.73 0.52 0.61 conduct CRF experiments on other languages. 2 0.71 0.32 0.35 We could observe that the results of Bert with Multi task learning is superior to the results of CRF. Table 7: CRF Results of Subtask 1 and 2 for English Dev data This is due to the better sentence or sequence rep- resentations learnt from the transformers. Simple surface level word features fail to capture the end the language specific contextual embedding would sentence or punctuation markers in CRF. improve performance in other languages. We will be incorporating both of these points in our future 5 Conclusion and Future work work. We have successfully applied contextual embed- ding for the task of punctuation prediction and achieved comparable results on both of the sub- References tasks. We believe that fine-tuning Bert on more data would benefit the overall punctuation task. Also, Łukasz Augustyniak, Piotr Szymanski, Mikołaj Morzy, Piotr Zelasko, Adrian Szymczak, Jan Mizgajski, 3 https://tfhub.dev/tensorflow/bert multi cased L-12 H- Yishay Carmiel, and Najim Dehak. 2020. Punctu- 768 A-12/4 ation prediction in spontaneous conversations: Can Dataset Lang Pr Re F1 Jieh-Sheng Lee and Jieh Hsiang. 2020. Patent claim EN 0.79 0.69 0.72 generation by fine-tuning openai gpt-2. World DE 0.8 0.74 0.77 Patent Information, 62:101983. Test FR 0.79 0.65 0.68 Xinxing Li and Edward Lin. 2020. A 43 language IT 0.78 0.62 0.66 multilingual punctuation prediction neural network AVG 0.79 0.68 0.71 model. Proc. Interspeech 2020, pages 1067–1071. EN 0.62 0.52 0.56 Wei Lu and Hwee Tou Ng. 2010. Better punctuation DE 0.61 0.57 0.58 prediction with dynamic conditional random fields. Surprise Test FR 0.61 0.48 0.51 In Proceedings of the 2010 conference on empirical methods in natural language processing, pages 177– IT 0.54 0.43 0.46 186. AVG 0.8 0.65 0.72 EN 0.8 0.69 0.72 Thi-Tuyet-Hai Nguyen, Adam Jatowt, Mickael Cous- DE 0.81 0.74 0.77 taty, Nhu-Van Nguyen, and Antoine Doucet. 2019. Deep statistical analysis of ocr errors for effective Dev FR 0.79 0.65 0.69 post-ocr processing. In 2019 ACM/IEEE Joint Con- IT 0.78 0.62 0.66 ference on Digital Libraries (JCDL), pages 29–38. AVG 0.8 0.68 0.71 Stephan Peitz, Markus Freitag, and Hermann Ney. Table 8: Subtask2 Results using BERT MTL 2014. Better punctuation prediction with hierarchi- cal phrase-based translation. In Proc. of the Int. Workshop on Spoken Language Translation (IWSLT), South Lake Tahoe, CA, USA. we mitigate asr errors with retrofitted word embed- dings? arXiv preprint arXiv:2004.05985. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Paweł Budzianowski and Ivan Vulić. 2019. Hello, it’s Zettlemoyer. 2018. Deep contextualized word repre- gpt-2–how can i help you? towards the use of pre- sentations. In Proc. of NAACL. trained language models for task-oriented dialogue systems. arXiv preprint arXiv:1907.05774. Nicola Ueffing, Maximilian Bisani, and Paul Vozila. 2013. Improved models for automatic punctuation Berfu Büyüköz, Ali Hürriyetoğlu, and Arzucan Özgür. prediction for spoken and written text. In Inter- 2020. Analyzing elmo and distilbert on socio- speech, pages 3097–3101. political news classification. In Proceedings of Vincent Vandeghinste, Lyan Verwimp, Joris Pelemans, the Workshop on Automated Extraction of Socio- and Patrick Wambacq. 2018. A comparison of dif- political Events from News 2020, pages 9–18. ferent punctuation prediction approaches in a transla- tion context. In Proceedings of the 21st Annual Con- Xiaoyin Che, Cheng Wang, Haojin Yang, and ference of the European Association for Machine Christoph Meinel. 2016. Punctuation prediction for Translation: 28-30 May 2018, Universitat d’Alacant, unsegmented transcript based on word vector. In Alacant, Spain, pages 269–278. European Associa- Proceedings of the Tenth International Conference tion for Machine Translation. on Language Resources and Evaluation (LREC’16), pages 654–658. Joseph Worsham and Jugal Kalita. 2020. Multi- task learning for natural language processing in the Heidi Christensen, Yoshihiko Gotoh, and Steve Re- 2020s: Where are we going? Pattern Recognition nals. 2001. Punctuation annotation using statistical Letters, 136:120–126. prosody models. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understand- ing. arXiv preprint arXiv:1810.04805. Jeremy Howard and Sebastian Ruder. 2018. Univer- sal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146. Ji-Hwan Kim and Philip C Woodland. 2001. The use of prosody in a combined system for punctuation gener- ation and speech recognition. In Seventh European conference on speech communication and technol- ogy.