<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Deep Contextual Punctuator for NLG Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vandan Mujadia Pruthwik Mishra Dipti Misra Sharma</string-name>
          <email>dipti@iiit.ac.in</email>
          <email>pruthwik.mishra@research.iiit.ac.in</email>
          <email>vandan.mu@research.iiit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Language Technologies Research Center Kohli Center On Intelligent Systems</institution>
          ,
          <addr-line>IIIT Hyderabad</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes our team oneNLP's (LTRC, IIIT-Hyderabad) participation for the SEPP-NLG 2021 tasks1, Sentence End and Punctuation Prediction in NLG Text-2021. We applied sequence to tag prediction over contextual embedding as fine-tuning for both of these tasks. We also explored the use of multilingual Bert and multitask learning for these tasks on English, German, French and Italian.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Generally, the output of automatic speech
recognition (ASR) systems ignore the prediction of
punctuation marks. Similarly, output of OCR systems
        <xref ref-type="bibr" rid="ref13">(Nguyen et al., 2019)</xref>
        need automatic validation for
punctuation. Apart from the omission of
punctuation markers, some automatic tools generated texts
e.g. PDF to text extraction may erroneously
displace sentences for several reasons. Here, detecting
the end of a sentence and placing an appropriate
punctuation mark significantly improves the quality
of such outputs by preserving the original meaning.
Thus, missing punctuation or inappropriate
punctuation degrade the readability of the presented text
and leads to poor user experiences in real-world
scenarios
        <xref ref-type="bibr" rid="ref16 ref5">(Che et al., 2016; Ueffing et al., 2013)</xref>
        as
well as erroneous input to the subsequent automatic
systems such as Machine Translation,
Summarization, Question Answering, NLU etc. Therefore it is
necessary to restore or correct punctuation marks
for these automatic outputs.
      </p>
      <p>
        Traditionally, automatic punctuation marking
approaches
        <xref ref-type="bibr" rid="ref12">(Lu and Ng, 2010)</xref>
        can be divided into
three broad categories
        <xref ref-type="bibr" rid="ref17">(Vandeghinste et al., 2018)</xref>
        based on the used features. They can be prosody
based features
        <xref ref-type="bibr" rid="ref6 ref6 ref9">(Kim and Woodland, 2001;
Christensen et al., 2001)</xref>
        , lexical features
        <xref ref-type="bibr" rid="ref1 ref14">(Augustyniak
1https://sites.google.com/view/sentence-segmentation/
et al., 2020; Peitz et al., 2014)</xref>
        or combined or
hybrid features of the previous two features based
methods. Recent lexical based punctuation
prediction methods build upon deep neural networks
where it is modeled as a sequence to tag prediction
task
        <xref ref-type="bibr" rid="ref10 ref11 ref18">(Li and Lin, 2020)</xref>
        or a sequence to sequence
prediction task
        <xref ref-type="bibr" rid="ref17">(Vandeghinste et al., 2018)</xref>
        .
      </p>
      <p>The simplest and basic form of punctuation
prediction is the discovery of sentence boundaries,
here the problem is the binary classification (where
classes are period or empty as label). An
incremental and a bit harder problem is the prediction of
each individual punctuation, here the class labels
for subtask2 are “: - , ? . 0” (0 indicating no
punctuation). SEPP-NLG 2021 presents both these tasks
as a challenge for the English, German, French and
Italian languages.</p>
      <p>
        In a recent advance of deep learning,
pretrained language models such as ELMo
        <xref ref-type="bibr" rid="ref15">(Peters
et al., 2018)</xref>
        , ULMFiT
        <xref ref-type="bibr" rid="ref17 ref8">(Howard and Ruder, 2018)</xref>
        ,
OpenAI Transformer
        <xref ref-type="bibr" rid="ref10 ref11 ref18">(Lee and Hsiang, 2020)</xref>
        and BERT
        <xref ref-type="bibr" rid="ref7">(Devlin et al., 2018)</xref>
        have resulted
in a massive jump in the state-of-the-art
performance for many NLP tasks, i.e text
classification
        <xref ref-type="bibr" rid="ref4">(Bu¨yu¨ko¨z et al., 2020)</xref>
        , natural language
inference and question-answering, dialogue system
        <xref ref-type="bibr" rid="ref3">(Budzianowski and Vulic´, 2019)</xref>
        etc. All these
approaches pre-train an unsupervised language model
on a large corpus of data such as all wikipedia
articles, news articles and then fine-tune these
pretrained models on different downstream tasks.
      </p>
      <p>Here, for our experiments on the two
punctuation prediction tasks, we try to use multi-lingual
Bert and ALBert (for English) as a fine-tuning task
along with the baseline experiments with CRF.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Dataset</title>
      <p>As a part of SEPP-NLG 2021, the organizers
released an Europarl corpus of spoken texts by
lowerClass
:
,
?
.</p>
      <p>0</p>
      <sec id="sec-2-1">
        <title>Total</title>
        <p>Class
:
,
?
.</p>
        <p>0
Total
casing and removing all punctuations in the
transcripts available in multiple languages. Table 1
describes the corpora details for Training and
Development corpus for all languages in terms of
sentences and Tokens. Table 2, Table 3, Table 4 and
Table 5 describe the training corpora details in terms
of punctuation classes and their respective
distribution. Here data numbers are given after ‘’! ;‘’ are
mapped to ‘’.‘’.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Approach</title>
      <p>We primarily used two broad categories of
approaches. We model the problem as a sequence
labeling task. In Machine Learning approaches, we
trained a CRF model to identify the different kinds
of labels correctly. Transformer based BERT fine
tuning is also used as the other technique.
3.1</p>
      <p>
        CRF
We split the training data in English into sequences
of 25 tokens each. This decision of setting the
maximum sequence length to 25 was based on the
average sentence length of the training data in
English. We only used words as features and utilized a
continuous window of 5 words over the full corpus
as the required features for the CRF.
Multi-task learning (MTL) is a technique which
aims to improve generalization, strengthen
representations and enable adaptation in machine
learning
        <xref ref-type="bibr" rid="ref10 ref11 ref18">(Worsham and Kalita, 2020)</xref>
        for related tasks.
For our case, we enabled multi-task learning for our
AlBert and mBert based contextual experiments as
presented in Figure 1 where contextual embeddings
are shared across the sub-tasks. We applied a
sequence to tag classifier on the output contextualized
token embeddings of Albert/mBert for the tag
prediction. Here, we have used Albert 2 for English
2https://tfhub.dev/google/albert base/3
      </p>
      <sec id="sec-3-1">
        <title>Dataset</title>
        <sec id="sec-3-1-1">
          <title>Test</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>Surprise Test Dev</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Lang</title>
        <p>EN
DE
FR</p>
        <p>IT
AVG
EN
DE
FR</p>
        <p>IT
AVG
EN
DE
FR</p>
        <p>IT
AVG
Pr
0.73
0.71
and mBert3 for other languages.</p>
        <p>The following configuration is used for Bert
finetuning.</p>
        <p>• Input : Subword tokens (same as Bert/Albert
vocab)
• Embedding size : 512
• Transformer Config : layers (6), hidden size
(2048) attention heads (8)
• Dropout : 0.30, Optimizer : Adam
• max word sequence length : 300
• Fine Tuning Steps : 90K with 40 Batch size
Due to the algorithmic limitations, we were not
able to apply MTL for CRF as it does not facilitate
the joint learning of multiple classification tasks at
one go.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results and Discussion</title>
      <p>As the results of CRF with word level features for
English were poor (shown in table 7), we did not
conduct CRF experiments on other languages.</p>
      <p>We could observe that the results of Bert with
Multi task learning is superior to the results of CRF.
This is due to the better sentence or sequence
representations learnt from the transformers. Simple
surface level word features fail to capture the end
sentence or punctuation markers in CRF.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future work</title>
      <p>We have successfully applied contextual
embedding for the task of punctuation prediction and
achieved comparable results on both of the
subtasks. We believe that fine-tuning Bert on more data
would benefit the overall punctuation task. Also,
3https://tfhub.dev/tensorflow/bert multi cased L-12
H768 A-12/4
Pr
0.92
0.93
0.9
0.88
0.91
0.81
0.85
0.77
0.78
0.8
0.92
0.94
0.9
0.88
0.91
the language specific contextual embedding would
improve performance in other languages. We will
be incorporating both of these points in our future
work.</p>
      <sec id="sec-5-1">
        <title>Test</title>
      </sec>
      <sec id="sec-5-2">
        <title>Surprise Test Dev</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Łukasz</given-names>
            <surname>Augustyniak</surname>
          </string-name>
          , Piotr Szymanski, Mikołaj Morzy, Piotr Zelasko, Adrian Szymczak, Jan Mizgajski, Yishay Carmiel, and
          <string-name>
            <given-names>Najim</given-names>
            <surname>Dehak</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Punctuation prediction in spontaneous conversations: Can Table 8: Subtask2 Results using BERT MTL</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>we mitigate asr errors with retrofitted word embeddings? arXiv preprint</article-title>
          arXiv:
          <year>2004</year>
          .05985.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Paweł</given-names>
            <surname>Budzianowski</surname>
          </string-name>
          and Ivan Vulic´.
          <year>2019</year>
          .
          <article-title>Hello, it's gpt-2-how can i help you? towards the use of pretrained language models for task-oriented dialogue systems</article-title>
          . arXiv preprint arXiv:
          <year>1907</year>
          .05774.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Berfu</given-names>
            <surname>Bu</surname>
          </string-name>
          <article-title>¨yu¨ko¨z, Ali Hu¨rriyetog˘lu, and Arzucan O¨zgu</article-title>
          ¨r.
          <year>2020</year>
          .
          <article-title>Analyzing elmo and distilbert on sociopolitical news classification</article-title>
          .
          <source>In Proceedings of the Workshop on Automated Extraction of Sociopolitical Events from News</source>
          <year>2020</year>
          , pages
          <fpage>9</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Xiaoyin</given-names>
            <surname>Che</surname>
          </string-name>
          , Cheng Wang, Haojin
          <string-name>
            <surname>Yang</surname>
            , and
            <given-names>Christoph</given-names>
          </string-name>
          <string-name>
            <surname>Meinel</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Punctuation prediction for unsegmented transcript based on word vector</article-title>
          .
          <source>In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)</source>
          , pages
          <fpage>654</fpage>
          -
          <lpage>658</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Heidi</given-names>
            <surname>Christensen</surname>
          </string-name>
          , Yoshihiko Gotoh, and
          <string-name>
            <given-names>Steve</given-names>
            <surname>Renals</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Punctuation annotation using statistical prosody models</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Jeremy</given-names>
            <surname>Howard</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Ruder</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Universal language model fine-tuning for text classification</article-title>
          . arXiv preprint arXiv:
          <year>1801</year>
          .06146.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Ji-Hwan Kim</surname>
          </string-name>
          and
          <string-name>
            <surname>Philip C Woodland</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>The use of prosody in a combined system for punctuation generation and speech recognition</article-title>
          .
          <source>In Seventh European conference on speech communication and technology.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Jieh-Sheng Lee</surname>
            and
            <given-names>Jieh</given-names>
          </string-name>
          <string-name>
            <surname>Hsiang</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Patent claim generation by fine-tuning openai gpt-2</article-title>
          . World Patent Information,
          <volume>62</volume>
          :
          <fpage>101983</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Xinxing</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>Edward</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>A 43 language multilingual punctuation prediction neural network model</article-title>
          .
          <source>Proc. Interspeech</source>
          <year>2020</year>
          , pages
          <fpage>1067</fpage>
          -
          <lpage>1071</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Wei</given-names>
            <surname>Lu</surname>
          </string-name>
          and Hwee Tou Ng.
          <year>2010</year>
          .
          <article-title>Better punctuation prediction with dynamic conditional random fields</article-title>
          .
          <source>In Proceedings of the 2010 conference on empirical methods in natural language processing</source>
          , pages
          <fpage>177</fpage>
          -
          <lpage>186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Thi-Tuyet-Hai</surname>
            <given-names>Nguyen</given-names>
          </string-name>
          , Adam Jatowt, Mickael Coustaty,
          <string-name>
            <surname>Nhu-Van Nguyen</surname>
            ,
            <given-names>and Antoine</given-names>
          </string-name>
          <string-name>
            <surname>Doucet</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Deep statistical analysis of ocr errors for effective post-ocr processing</article-title>
          .
          <source>In 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL)</source>
          , pages
          <fpage>29</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Stephan</given-names>
            <surname>Peitz</surname>
          </string-name>
          , Markus Freitag, and Hermann Ney.
          <year>2014</year>
          .
          <article-title>Better punctuation prediction with hierarchical phrase-based translation</article-title>
          .
          <source>In Proc. of the Int. Workshop on Spoken Language Translation (IWSLT)</source>
          , South Lake Tahoe, CA, USA.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Matthew E.</given-names>
            <surname>Peters</surname>
          </string-name>
          , Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Luke</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>In Proc. of NAACL.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Nicola</given-names>
            <surname>Ueffing</surname>
          </string-name>
          , Maximilian Bisani, and
          <string-name>
            <given-names>Paul</given-names>
            <surname>Vozila</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Improved models for automatic punctuation prediction for spoken and written text</article-title>
          .
          <source>In Interspeech</source>
          , pages
          <fpage>3097</fpage>
          -
          <lpage>3101</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Vincent</given-names>
            <surname>Vandeghinste</surname>
          </string-name>
          , Lyan Verwimp, Joris Pelemans, and
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Wambacq</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>A comparison of different punctuation prediction approaches in a translation context</article-title>
          .
          <source>In Proceedings of the 21st Annual Conference of the European Association for Machine Translation: 28-30 May</source>
          <year>2018</year>
          , Universitat d'Alacant, Alacant, Spain, pages
          <fpage>269</fpage>
          -
          <lpage>278</lpage>
          . European Association for Machine Translation.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Joseph</given-names>
            <surname>Worsham</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jugal</given-names>
            <surname>Kalita</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Multitask learning for natural language processing in the 2020s: Where are we going? Pattern Recognition Letters</article-title>
          ,
          <volume>136</volume>
          :
          <fpage>120</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>