=Paper=
{{Paper
|id=Vol-2957/sepp_paper7
|storemode=property
|title=Deep Contextual Punctuator for NLG Text (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2957/sepp_paper7.pdf
|volume=Vol-2957
|authors=Vandan Mujadia,Pruthwik Mishra,Dipti Misra Sharma
|dblpUrl=https://dblp.org/rec/conf/swisstext/MujadiaMS21
}}
==Deep Contextual Punctuator for NLG Text (short paper)==
Deep Contextual Punctuator for NLG Text
Vandan Mujadia Pruthwik Mishra Dipti Misra Sharma
Language Technologies Research Center
Kohli Center On Intelligent Systems, IIIT Hyderabad
{vandan.mu, pruthwik.mishra}@research.iiit.ac.in, dipti@iiit.ac.in
Abstract et al., 2020; Peitz et al., 2014) or combined or hy-
brid features of the previous two features based
This paper describes our team oneNLP’s methods. Recent lexical based punctuation pre-
(LTRC, IIIT-Hyderabad) participation for the diction methods build upon deep neural networks
SEPP-NLG 2021 tasks1 , Sentence End and
where it is modeled as a sequence to tag prediction
Punctuation Prediction in NLG Text-2021. We
applied sequence to tag prediction over contex- task (Li and Lin, 2020) or a sequence to sequence
tual embedding as fine-tuning for both of these prediction task (Vandeghinste et al., 2018).
tasks. We also explored the use of multilingual The simplest and basic form of punctuation pre-
Bert and multitask learning for these tasks on diction is the discovery of sentence boundaries,
English, German, French and Italian. here the problem is the binary classification (where
classes are period or empty as label). An incremen-
1 Introduction tal and a bit harder problem is the prediction of
Generally, the output of automatic speech recogni- each individual punctuation, here the class labels
tion (ASR) systems ignore the prediction of punc- for subtask2 are “: - , ? . 0” (0 indicating no punctu-
tuation marks. Similarly, output of OCR systems ation). SEPP-NLG 2021 presents both these tasks
(Nguyen et al., 2019) need automatic validation for as a challenge for the English, German, French and
punctuation. Apart from the omission of punctua- Italian languages.
tion markers, some automatic tools generated texts In a recent advance of deep learning, pre-
e.g. PDF to text extraction may erroneously dis- trained language models such as ELMo (Peters
place sentences for several reasons. Here, detecting et al., 2018), ULMFiT (Howard and Ruder, 2018),
the end of a sentence and placing an appropriate OpenAI Transformer (Lee and Hsiang, 2020)
punctuation mark significantly improves the quality and BERT (Devlin et al., 2018) have resulted
of such outputs by preserving the original meaning. in a massive jump in the state-of-the-art perfor-
Thus, missing punctuation or inappropriate punctu- mance for many NLP tasks, i.e text classifica-
ation degrade the readability of the presented text tion (Büyüköz et al., 2020), natural language in-
and leads to poor user experiences in real-world ference and question-answering, dialogue system
scenarios (Che et al., 2016; Ueffing et al., 2013) as (Budzianowski and Vulić, 2019) etc. All these ap-
well as erroneous input to the subsequent automatic proaches pre-train an unsupervised language model
systems such as Machine Translation, Summariza- on a large corpus of data such as all wikipedia ar-
tion, Question Answering, NLU etc. Therefore it is ticles, news articles and then fine-tune these pre-
necessary to restore or correct punctuation marks trained models on different downstream tasks.
for these automatic outputs. Here, for our experiments on the two punctua-
Traditionally, automatic punctuation marking ap- tion prediction tasks, we try to use multi-lingual
proaches (Lu and Ng, 2010) can be divided into Bert and ALBert (for English) as a fine-tuning task
three broad categories (Vandeghinste et al., 2018) along with the baseline experiments with CRF.
based on the used features. They can be prosody
based features (Kim and Woodland, 2001; Chris- 2 Dataset
tensen et al., 2001), lexical features (Augustyniak
As a part of SEPP-NLG 2021, the organizers re-
1
https://sites.google.com/view/sentence-segmentation/ leased an Europarl corpus of spoken texts by lower-
Copyright ©2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna-
tional (CC BY 4.0).
Lang #Train Sents #Train Toks Avg Train Sent Len #Dev Sents #Dev Toks
English 1406577 33779095 24.015 321333 7743489
German 1308508 28645112 21.891 291443 6358683
French 1236504 32690367 26.438 332330 8781593
Italian 1132554 28167993 24.871 290089 7194189
Table 1: Training and Development Data Detail
Class #Count #Percentage Class #Count #Percentage
: 43133 0.128 : 46128 0.141
- 80916 0.240 - 68523 0.210
, 1759686 5.209 , 1657880 5.071
? 44290 0.131 ? 41005 0.125
. 1396166 4.133 . 1223802 3.744
0 30454904 90.159 0 29653029 90.709
Total 33779095 Total 32690367
Table 2: English : Class Details for Training Data Table 4: French : Class Details for Training Data
Class #Count #Percentage Class #Count #Percentage
: 51192 0.179 : 55080 0.196
- 81710 0.285 - 52983 0.188
, 2208970 7.712 , 1503502 5.338
? 40511 0.141 ? 38807 0.138
. 1290282 4.504 . 1138669 4.042
0 24972447 87.179 0 25378952 90.099
Total 28645112 Total 28167993
Table 3: German : Class Details for Training Data Table 5: Italian : Class Details for Training Data
casing and removing all punctuations in the tran-
maximum sequence length to 25 was based on the
scripts available in multiple languages. Table 1
average sentence length of the training data in En-
describes the corpora details for Training and De-
glish. We only used words as features and utilized a
velopment corpus for all languages in terms of sen-
continuous window of 5 words over the full corpus
tences and Tokens. Table 2, Table 3, Table 4 and Ta-
as the required features for the CRF.
ble 5 describe the training corpora details in terms
of punctuation classes and their respective distribu-
tion. Here data numbers are given after ‘’! ;‘’ are 3.2 Fine-tuning Contextual Embedding
mapped to ‘’.‘’. Multi-task learning (MTL) is a technique which
aims to improve generalization, strengthen repre-
3 Approach
sentations and enable adaptation in machine learn-
We primarily used two broad categories of ap- ing (Worsham and Kalita, 2020) for related tasks.
proaches. We model the problem as a sequence For our case, we enabled multi-task learning for our
labeling task. In Machine Learning approaches, we AlBert and mBert based contextual experiments as
trained a CRF model to identify the different kinds presented in Figure 1 where contextual embeddings
of labels correctly. Transformer based BERT fine are shared across the sub-tasks. We applied a se-
tuning is also used as the other technique. quence to tag classifier on the output contextualized
token embeddings of Albert/mBert for the tag pre-
3.1 CRF diction. Here, we have used Albert 2 for English
We split the training data in English into sequences
of 25 tokens each. This decision of setting the 2
https://tfhub.dev/google/albert base/3
Figure 1: Multi Task Learning
and mBert3 for other languages. Dataset Lang Pr Re F1
The following configuration is used for Bert fine- EN 0.92 0.92 0.92
tuning. DE 0.93 0.95 0.94
Test FR 0.9 0.89 0.9
• Input : Subword tokens (same as Bert/Albert IT 0.88 0.89 0.89
vocab) AVG 0.91 0.91 0.91
• Embedding size : 512 EN 0.81 0.67 0.73
• Transformer Config : layers (6), hidden size DE 0.85 0.72 0.78
(2048) attention heads (8) Surprise Test FR 0.77 0.62 0.69
• Dropout : 0.30, Optimizer : Adam IT 0.78 0.58 0.67
• max word sequence length : 300 AVG 0.8 0.65 0.72
• Fine Tuning Steps : 90K with 40 Batch size EN 0.92 0.92 0.92
DE 0.94 0.95 0.94
Due to the algorithmic limitations, we were not
Dev FR 0.9 0.89 0.9
able to apply MTL for CRF as it does not facilitate
IT 0.88 0.89 0.89
the joint learning of multiple classification tasks at
AVG 0.91 0.91 0.91
one go.
Table 6: Subtask1 Results using BERT MTL
4 Results and Discussion
As the results of CRF with word level features for Subtask# Pr Re F1-score
English were poor (shown in table 7), we did not 1 0.73 0.52 0.61
conduct CRF experiments on other languages. 2 0.71 0.32 0.35
We could observe that the results of Bert with
Multi task learning is superior to the results of CRF. Table 7: CRF Results of Subtask 1 and 2 for English
Dev data
This is due to the better sentence or sequence rep-
resentations learnt from the transformers. Simple
surface level word features fail to capture the end the language specific contextual embedding would
sentence or punctuation markers in CRF. improve performance in other languages. We will
be incorporating both of these points in our future
5 Conclusion and Future work
work.
We have successfully applied contextual embed-
ding for the task of punctuation prediction and
achieved comparable results on both of the sub- References
tasks. We believe that fine-tuning Bert on more data
would benefit the overall punctuation task. Also, Łukasz Augustyniak, Piotr Szymanski, Mikołaj Morzy,
Piotr Zelasko, Adrian Szymczak, Jan Mizgajski,
3
https://tfhub.dev/tensorflow/bert multi cased L-12 H- Yishay Carmiel, and Najim Dehak. 2020. Punctu-
768 A-12/4 ation prediction in spontaneous conversations: Can
Dataset Lang Pr Re F1 Jieh-Sheng Lee and Jieh Hsiang. 2020. Patent claim
EN 0.79 0.69 0.72 generation by fine-tuning openai gpt-2. World
DE 0.8 0.74 0.77 Patent Information, 62:101983.
Test FR 0.79 0.65 0.68 Xinxing Li and Edward Lin. 2020. A 43 language
IT 0.78 0.62 0.66 multilingual punctuation prediction neural network
AVG 0.79 0.68 0.71 model. Proc. Interspeech 2020, pages 1067–1071.
EN 0.62 0.52 0.56 Wei Lu and Hwee Tou Ng. 2010. Better punctuation
DE 0.61 0.57 0.58 prediction with dynamic conditional random fields.
Surprise Test FR 0.61 0.48 0.51 In Proceedings of the 2010 conference on empirical
methods in natural language processing, pages 177–
IT 0.54 0.43 0.46
186.
AVG 0.8 0.65 0.72
EN 0.8 0.69 0.72 Thi-Tuyet-Hai Nguyen, Adam Jatowt, Mickael Cous-
DE 0.81 0.74 0.77 taty, Nhu-Van Nguyen, and Antoine Doucet. 2019.
Deep statistical analysis of ocr errors for effective
Dev FR 0.79 0.65 0.69 post-ocr processing. In 2019 ACM/IEEE Joint Con-
IT 0.78 0.62 0.66 ference on Digital Libraries (JCDL), pages 29–38.
AVG 0.8 0.68 0.71
Stephan Peitz, Markus Freitag, and Hermann Ney.
Table 8: Subtask2 Results using BERT MTL 2014. Better punctuation prediction with hierarchi-
cal phrase-based translation. In Proc. of the Int.
Workshop on Spoken Language Translation (IWSLT),
South Lake Tahoe, CA, USA.
we mitigate asr errors with retrofitted word embed-
dings? arXiv preprint arXiv:2004.05985. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Paweł Budzianowski and Ivan Vulić. 2019. Hello, it’s Zettlemoyer. 2018. Deep contextualized word repre-
gpt-2–how can i help you? towards the use of pre- sentations. In Proc. of NAACL.
trained language models for task-oriented dialogue
systems. arXiv preprint arXiv:1907.05774. Nicola Ueffing, Maximilian Bisani, and Paul Vozila.
2013. Improved models for automatic punctuation
Berfu Büyüköz, Ali Hürriyetoğlu, and Arzucan Özgür. prediction for spoken and written text. In Inter-
2020. Analyzing elmo and distilbert on socio- speech, pages 3097–3101.
political news classification. In Proceedings of
Vincent Vandeghinste, Lyan Verwimp, Joris Pelemans,
the Workshop on Automated Extraction of Socio-
and Patrick Wambacq. 2018. A comparison of dif-
political Events from News 2020, pages 9–18.
ferent punctuation prediction approaches in a transla-
tion context. In Proceedings of the 21st Annual Con-
Xiaoyin Che, Cheng Wang, Haojin Yang, and ference of the European Association for Machine
Christoph Meinel. 2016. Punctuation prediction for Translation: 28-30 May 2018, Universitat d’Alacant,
unsegmented transcript based on word vector. In Alacant, Spain, pages 269–278. European Associa-
Proceedings of the Tenth International Conference tion for Machine Translation.
on Language Resources and Evaluation (LREC’16),
pages 654–658. Joseph Worsham and Jugal Kalita. 2020. Multi-
task learning for natural language processing in the
Heidi Christensen, Yoshihiko Gotoh, and Steve Re- 2020s: Where are we going? Pattern Recognition
nals. 2001. Punctuation annotation using statistical Letters, 136:120–126.
prosody models.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand-
ing. arXiv preprint arXiv:1810.04805.
Jeremy Howard and Sebastian Ruder. 2018. Univer-
sal language model fine-tuning for text classification.
arXiv preprint arXiv:1801.06146.
Ji-Hwan Kim and Philip C Woodland. 2001. The use of
prosody in a combined system for punctuation gener-
ation and speech recognition. In Seventh European
conference on speech communication and technol-
ogy.