Deep Contextual Punctuator for NLG Text

Deep Contextual Punctuator for NLG Text VandanMujadiaPruthwik vandan.mu@research.iiit.ac.in Language Technologies Research Center Kohli Center On Intelligent Systems IIIT Hyderabad MishraDipti dipti@iiit.ac.in Language Technologies Research Center Kohli Center On Intelligent Systems IIIT Hyderabad MisraSharma Language Technologies Research Center Kohli Center On Intelligent Systems IIIT Hyderabad Deep Contextual Punctuator for NLG Text 327884D273E113FA3733216B5258A22F GROBID - A machine learning software for extracting information from scholarly documents

This paper describes our team oneNLP's (LTRC, IIIT-Hyderabad) participation for the SEPP-NLG 2021 tasks 1 , Sentence End and Punctuation Prediction in NLG Text-2021. We applied sequence to tag prediction over contextual embedding as fine-tuning for both of these tasks. We also explored the use of multilingual Bert and multitask learning for these tasks on English, German, French and Italian.

Dataset

As a part of SEPP-NLG 2021, the organizers released an Europarl corpus of spoken texts by lower-

Introduction

Generally, the output of automatic speech recognition (ASR) systems ignore the prediction of punctuation marks. Similarly, output of OCR systems (Nguyen et al., 2019) need automatic validation for punctuation. Apart from the omission of punctuation markers, some automatic tools generated texts e.g. PDF to text extraction may erroneously displace sentences for several reasons. Here, detecting the end of a sentence and placing an appropriate punctuation mark significantly improves the quality of such outputs by preserving the original meaning. Thus, missing punctuation or inappropriate punctuation degrade the readability of the presented text and leads to poor user experiences in real-world scenarios (Che et al., 2016;Ueffing et al., 2013) as well as erroneous input to the subsequent automatic systems such as Machine Translation, Summarization, Question Answering, NLU etc. Therefore it is necessary to restore or correct punctuation marks for these automatic outputs.

Traditionally, automatic punctuation marking approaches (Lu and Ng, 2010) can be divided into three broad categories (Vandeghinste et al., 2018) based on the used features. They can be prosody based features (Kim and Woodland, 2001;Christensen et al., 2001), lexical features (Augustyniak et al., 2020;Peitz et al., 2014) or combined or hybrid features of the previous two features based methods. Recent lexical based punctuation prediction methods build upon deep neural networks where it is modeled as a sequence to tag prediction task (Li and Lin, 2020) or a sequence to sequence prediction task (Vandeghinste et al., 2018).

The simplest and basic form of punctuation prediction is the discovery of sentence boundaries, here the problem is the binary classification (where classes are period or empty as label). An incremental and a bit harder problem is the prediction of each individual punctuation, here the class labels for subtask2 are ": -, ? . 0" (0 indicating no punctuation). SEPP-NLG 2021 presents both these tasks as a challenge for the English, German, French and Italian languages.

In a recent advance of deep learning, pretrained language models such as ELMo (Peters et al., 2018), ULMFiT (Howard andRuder, 2018), OpenAI Transformer (Lee andHsiang, 2020) and BERT (Devlin et al., 2018) have resulted in a massive jump in the state-of-the-art performance for many NLP tasks, i.e text classification (Büyüköz et al., 2020), natural language inference and question-answering, dialogue system (Budzianowski and Vulić, 2019) etc. All these approaches pre-train an unsupervised language model on a large corpus of data such as all wikipedia articles, news articles and then fine-tune these pretrained models on different downstream tasks.

Here, for our experiments on the two punctuation prediction tasks, we try to use multi-lingual Bert and ALBert (for English) as a fine-tuning task along with the baseline experiments with CRF. casing and removing all punctuations in the transcripts available in multiple languages. Table 1 describes the corpora details for Training and Development corpus for all languages in terms of sentences and Tokens. Table 2, Table 3, Table 4 and Table 5 describe the training corpora details in terms of punctuation classes and their respective distribution. Here data numbers are given after ''! ;'' are mapped to ''.''.

Lang

Approach

We primarily used two broad categories of approaches. We model the problem as a sequence labeling task. In Machine Learning approaches, we trained a CRF model to identify the different kinds of labels correctly. Transformer based BERT fine tuning is also used as the other technique.

CRF

We split the training data in English into sequences of 25 tokens each. This decision of setting the maximum sequence length to 25 was based on the average sentence length of the training data in English. We only used words as features and utilized a continuous window of 5 words over the full corpus as the required features for the CRF.

Class

Fine-tuning Contextual Embedding

Multi-task learning (MTL) is a technique which aims to improve generalization, strengthen representations and enable adaptation in machine learning (Worsham and Kalita, 2020) for related tasks. For our case, we enabled multi-task learning for our AlBert and mBert based contextual experiments as presented in Figure 1 where contextual embeddings are shared across the sub-tasks. We applied a sequence to tag classifier on the output contextualized token embeddings of Albert/mBert for the tag prediction. Here, we have used Albert2 for English

Results and Discussion

As the results of CRF with word level features for English were poor (shown in table 7), we did not conduct CRF experiments on other languages.

We could observe that the results of Bert with Multi task learning is superior to the results of CRF. This is due to the better sentence or sequence representations learnt from the transformers. Simple surface level word features fail to capture the end sentence or punctuation markers in CRF.

Conclusion and Future work

We have successfully applied contextual embedding for the task of punctuation prediction and achieved comparable results on both of the subtasks. We believe that fine-tuning Bert on more data would benefit the overall punctuation task. Also,

Dataset

Lang Pr Re F1 the language specific contextual embedding would improve performance in other languages. We will be incorporating both of these points in our future work.

Figure 1 :1Figure 1: Multi Task Learning

Table 1 :1Training and Development Data Detail #Train Sents #Train Toks Avg Train Sent Len #Dev Sents #Dev ToksEnglish14065773377909524.0153213337743489German13085082864511221.8912914436358683French12365043269036726.4383323308781593Italian11325542816799324.8712900897194189Class#Count#Percentage:431330.128-809160.240,17596865.209?442900.131.13961664.13303045490490.159Total 33779095

Table 2 :2English : Class Details for Training DataClass#Count#Percentage:511920.179-817100.285,22089707.712?405110.141.12902824.50402497244787.179Total 28645112

Table 3 :3German : Class Details for Training Data

Table 4 :4French : Class Details for Training Data

#Count#Percentage:461280.141-685230.210,16578805.071?410050.125.12238023.74402965302990.709Total 32690367Class#Count#Percentage:550800.196-529830.188,15035025.338?388070.138.11386694.04202537895290.099Total 28167993

Table 5 :5Italian : Class Details for Training Data

Table 6 :6Subtask1 Results using BERT MTL

EN0.92 0.92 0.92DE0.93 0.95 0.94TestFR0.9 0.89 0.9IT0.88 0.89 0.89AVG 0.91 0.91 0.91EN0.81 0.67 0.73DE0.85 0.72 0.78Surprise TestFR0.77 0.62 0.69IT0.78 0.58 0.67AVG 0.8 0.65 0.72EN0.92 0.92 0.92DE0.94 0.95 0.94DevFR0.9 0.89 0.9IT0.88 0.89 0.89AVG 0.91 0.91 0.91Subtask# PrRe F1-score10.73 0.520.6120.71 0.320.35

Table 7 :7CRF Results of Subtask 1 and 2 for English Dev datahttps://sites.google.com/view/sentence-segmentation/https://tfhub.dev/google/albert base/3https://tfhub.dev/tensorflow/bert multi cased L-12 H-768 A-12/4

ŁukaszAugustyniak PiotrSzymanski MikołajMorzy PiotrZelasko AdrianSzymczak JanMizgajski YishayCarmiel NajimDehak arXiv:1907.05774 Hello, it's gpt-2-how can i help you? towards the use of pretrained language models for task-oriented dialogue systems 2020. 2019 arXiv preprint Punctuation prediction in spontaneous conversations: Can Paweł Budzianowski and Ivan Vulić Analyzing elmo and distilbert on sociopolitical news classification BerfuBüyüköz AliHürriyetoglu ArzucanÖzgür Proceedings of the Workshop on Automated Extraction of Sociopolitical Events from News 2020 the Workshop on Automated Extraction of Sociopolitical Events from News 2020 2020 Punctuation prediction for unsegmented transcript based on word vector XiaoyinChe ChengWang HaojinYang ChristophMeinel Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) the Tenth International Conference on Language Resources and Evaluation (LREC'16) 2016 Punctuation annotation using statistical prosody models HeidiChristensen YoshihikoGotoh SteveRenals 2001 Bert: Pre-training of deep bidirectional transformers for language understanding JacobDevlin Ming-WeiChang KentonLee KristinaToutanova arXiv:1810.04805 2018 arXiv preprint JeremyHoward SebastianRuder arXiv:1801.06146 Universal language model fine-tuning for text classification 2018 arXiv preprint The use of prosody in a combined system for punctuation generation and speech recognition Ji-HwanKim PhilipCWoodland Seventh European conference on speech communication and technology Jieh-Sheng Lee and Jieh Hsiang 2001. 2020 62 101983 Patent claim generation by fine-tuning openai gpt-2 A 43 language multilingual punctuation prediction neural network model XinxingLi EdwardLin Proc. Interspeech 2020 Interspeech 2020 2020 Better punctuation prediction with dynamic conditional random fields WeiLu Hwee TouNg Proceedings of the 2010 conference on empirical methods in natural language processing the 2010 conference on empirical methods in natural language processing 2010 Deep statistical analysis of ocr errors for effective post-ocr processing Thi-Tuyet-HaiNguyen AdamJatowt MickaelCoustaty AntoineNhu-Van Nguyen Doucet ACM/IEEE Joint Conference on Digital Libraries (JCDL) 2019. 2019 Better punctuation prediction with hierarchical phrase-based translation StephanPeitz MarkusFreitag HermannNey Proc. of the Int. Workshop on Spoken Language Translation (IWSLT) of the Int. Workshop on Spoken Language Translation (IWSLT)

South Lake Tahoe, CA, USA

2014 Deep contextualized word representations MatthewEPeters MarkNeumann MohitIyyer MattGardner ChristopherClark KentonLee LukeZettlemoyer Proc. of NAACL of NAACL 2018 Improved models for automatic punctuation prediction for spoken and written text NicolaUeffing MaximilianBisani PaulVozila Interspeech 2013 A comparison of different punctuation prediction approaches in a translation context VincentVandeghinste LyanVerwimp JorisPelemans PatrickWambacq Proceedings of the 21st Annual Conference of the European Association for Machine Translation: 28-30 the 21st Annual Conference of the European Association for Machine Translation: 28-30

Universitat d'Alacant, Alacant, Spain

European Association for Machine Translation 2018. May 2018 Multitask learning for natural language processing in the 2020s: Where are we going? JosephWorsham JugalKalita Pattern Recognition Letters 136 2020