=Paper=
{{Paper
|id=Vol-2957/sepp_paper4
|storemode=property
|title=FullStop: Multilingual Deep Models for Punctuation Prediction
|pdfUrl=https://ceur-ws.org/Vol-2957/sepp_paper4.pdf
|volume=Vol-2957
|authors=Oliver Guhr,Anne-Kathrin Schumann,Frank Bahrmann,Hans-Joachim Böhme
|dblpUrl=https://dblp.org/rec/conf/swisstext/GuhrSBB21
}}
==FullStop: Multilingual Deep Models for Punctuation Prediction==
<pdf width="1500px">https://ceur-ws.org/Vol-2957/sepp_paper4.pdf</pdf>
<pre>
        FullStop: Multilingual Deep Models for Punctuation Prediction

   Oliver Guhr1 Anne-Kathrin Schumann2 Frank Bahrmann1 Hans-Joachim Böhme1
                1
                  University of Applied Science (HTW) Dresden, Germany
                              2
                                t2k GmbH, Dresden, Germany
{oliver.guhr, frank.bahrmann, hans-joachim.boehme}@htw-dresden.de
                anne-kathrin.schumann@text2knowledge.de


                      Abstract                                      possible punctuation marks are members of
                                                                    the set p = {: −, ?.0}, with 0 indicating no
     This paper describes our contribution to the                   punctuation.
     SEPP-NLG Shared Task in multilingual sen-
     tence segmentation and punctuation prediction.           The task is carried out on the German, English,
     The goal of this task consists in training NLP           French, and Italian sections of the Europarl corpus
     models that can predict the end of sentence
                                                              (Koehn, 2005), since it offers transcripts of spoken
     (EOS) and punctuation marks on automatically
     generated or transcribed texts. We show that             texts for multiple languages. We developed models
     these tasks beneﬁt from crosslingual transfer            for both tasks based on the Transformers library by
     by successfully employing multilingual deep              Wolf et al. (2020). These models and our code are
     language models. Our multilingual model                  publicly available 1
     achieves an average F1 -score of 0.94 for EOS
     prediction on English, German, French, and               2   Related Work
     Italian texts and an average F1 -score of 0.78
     for punctuation mark prediction.                         Earlier studies on EOS and punctuation prediction
                                                              reﬂect the various ﬁelds of application of this tech-
1    Introduction                                             nology. The task is mostly modeled as token-wise
                                                              prediction. Over the last few years, consistent per-
The prediction of EOS and punctuation marks in                formance improvements have – unsurprisingly –
automatically generated or transcribed texts is a rel-        been achieved with the help of neural network ap-
atively novel task. While sentence segmentation is            proaches and large-scale neural language models.
a core, and low-level, natural language processing               The work by Attia et al. (2014) constitutes a
(NLP) task, punctuation has, in the past, primar-             rather traditional approach to spelling and punc-
ily been studied in the context of error correction           tuation correction, in this case for Arabic. The
and the normalisation of automatic speech recogni-            authors report that in their data set, punctuation
tion (ASR) output. However, with the recent rise              errors constitute 40 % of all errors. The task is
of conversational agents and other NLP systems                modeled as token-wise classiﬁcation with context
that are able to generate new texts, the injection            windows varying between 4-8 words. Classiﬁca-
of punctuation and EOS marks has gained wider                 tion is carried out with Support Vector Machines
interest. This is hardly surprising because punctu-           and Conditional Random Field (CRF) classiﬁers,
ation affects the readability of the text produced            using part-of-speech (POS) and morphological in-
by the NLP system and, thus, its perceived overall            formation. The authors obtain the best result, an
performance. The SEPP-NLG Shared Task offers                  F1-score of 0.56, with the CRF classiﬁer and a
two subtasks, namely:                                         window size of ﬁve tokens.
                                                                 Che et al. (2016) experiment with three differ-
    • Subtask1 – Sentence segmentation: Full-
                                                              ent neural network architectures, using pretrained
      stop prediction on fully unpunctuated, low-
                                                              GloVe (Pennington et al., 2014) embeddings as
      ercased documents.
                                                              inputs. Since their goal is to predict punctuation
    • Subtask 2 – Punctuation prediction: Pre-                marks speciﬁcally on ASR output, they evaluate
      diction of all punctuation marks on fully un-             1
                                                                  https://github.com/oliverguhr/
      punctuated, lowercased documents, where the             fullstop-deep-punctuation-prediction.


    Copyright ©2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna-
tional (CC BY 4.0).
their models on ASR transcripts of TED talks. Pre-       text. This data set consists of a training and a de-
dicting the positions of commas, periods, and ques-      velopment set. For system ranking, a test set with
tion marks, their best result in this 4-class classiﬁ-   in-domain and a surprise set with out-of-domain
cation task is an F1 -score of 0.54.                     texts were used.
   Treviso et al. (2017) study sentence segmenta-           Figure 1 shows the distribution of the punctu-
tion – not punctuation – in narrative transcripts        ation labels for subtask 2, for all languages. As
that were generated in the context of examining          can be seen from the Figure, the distribution of the
patients for symptoms of language-impairing de-          labels is quite skewed, even if we disregard that
mentia. They work on three different Portuguese          the majority of tokens in each data set has the la-
data sets. Input data is modeled by means of POS         bel ”0” (omitted in Figure 1 for better readability).
features, word embeddings, and prosodic informa-         All languages follow the same distribution pattern,
tion. They then combine convolutional and recur-         however, they exhibit subtle differences. For in-
rent neural network layers, achieving F1 -scores         stance, the difference in frequency between com-
between 0.7 and 0.8 on two evaluation data sets.         mas and fullstops is particularly pronounced for
   Schweter and Ahmed (2019) also experiment             German and German, in general, has a higher pro-
with the Europarl corpus, however, their task is         portion of commas, indicating complex sentence
different from the task presented here, i.e. they        structures. For other language pairs, we observe
model only sentence segmentation by predicting,          slight differences in the distribution of hyphens and
at each full stop in the input text, whether it is an    colons.
EOS marker or forms a part of another linguistic            Earlier versions of subtask 2 also required pre-
unit (for instance, it could mark an abbreviation).      dictions for the punctuation marks ”!” and ”;”. Dur-
Predictions are produced by character-level models       ing the training phase, the task organizers mapped
that are fed not only the token to disambiguate, but     these symbols to the fullstop to account for strongly
also local contexts in the form of context windows.      skewed distributions and potential HTML artefacts.
Working on a wide variety of languages – includ-         Sentences containing other punctuation symbols
ing often overlooked languages such as Bosnian,          than those already mentioned – parentheses, for
Greek, or Romanian, – they achieve F1 -scores be-        instance – were removed by the task organizers
tween 0.98 and 0.99, with their BiLSTM model             because not all instances of parentheses were well-
performing best on average.                              formed (i. e. not for every opening parenthesis
   Sunkara et al. (2020) also work in the clinical       there also was a closing parenthesis). These issues
domain, more precisely, on the output of medi-           leave avenues for future research.
cal ASR systems. They jointly model punctuation
                                                         4       Models
and truecasing by ﬁrst predicting a punctuation se-
quence and then the case of each input word. The         4.1      Baselines and Model Selection
authors use a pretrained transformer model (De-          The transformer architecture (Vaswani et al., 2017)
vlin et al., 2019; Liu et al., 2019) in combination      and transfer learning with transformer-based lan-
with subword embeddings to overcome lexical spar-        guage models (Devlin et al., 2019) have led to no-
sity in the medical domain. They also carry out a        table performance gains for many NLP tasks. For
ﬁne-tuning step on medical data and a task adapta-       this reason, we have focused our research on a
tion step – randomly masking punctuation marks           transformer-based architecture, exploring a num-
in the text – before training the actual model. Pre-     ber of recent language models and multilingual
dicting fullstops and commas, the authors achieve        transfer learning. Following earlier work, we have
F1 -scores of 0.81 (for commas) and 0.92 (for full-      modelled the task as token-wise prediction.
stops) with Bio-BERT (Lee et al., 2019), which              However, to assess the performance gain enabled
was trained on biomedical corpora.                       by a transformer-based language model, we also
                                                         trained (for German sentence segmentation) a ﬁrst,
3   Task and Data                                        non-neural baseline: a CRF model on the basis of
The task consists in predicting EOS and punctua-         bag-of-words, POS and local context (+/- 2 tokens)
tion marks on unpunctuated lowercased text. The          features. This model seemed to perform much
organizers of the SeppNLG shared task provided           better than the spaCy2 baseline provided for sub-
470 MB of English, German, French, and Italian               2
                                                                 https://spacy.io/.
                    (a) Subtask 2 German                                      (b) Subtask 2 English


  ?        40,511                                           ?        44,290

   :       51,192                                            :       43,133

   -       81,710                                           -        80,916

   .                        1,290,282                        .                          1,396,166
   ,                                       2,208,970         ,                                1,759,686

       0      0.5     1     1.5      2      2.5     3            0      0.5     1      1.5      2     2.5     3
                                                  ·106                                                      ·106
                    (c) Subtask 2 French                                      (d) Subtask 2 Italian


  ?        41,005                                           ?        38,807

   :       46,128                                            :       55,080

   -       68,523                                           -        52,983

   .                       1,223,802                         .                      1,138,669
   ,                              1,657,880                  ,                            1,503,502

       0      0.5     1     1.5      2      2.5     3            0      0.5     1      1.5      2     2.5     3
                                                  ·106                                                      ·106
Figure 1: Distribution of punctuation labels for the four languages on the training sets of task 2. Mean document
length varies between 10,378 (for Italian) and 12,275 words (for French).


task 13 , however, since it was outperformed by all       tests:
transformer-based models by a large margin, we
                                                             • Bert (Devlin et al., 2019)
decided to not explore this direction any further.
   As a second baseline, we trained a vanilla multi-         • Distillbert (Sanh et al., 2019)
lingual Bert model and explored techniques to im-
prove this baseline. In particular, we focused on            • Electra (Clark et al., 2020)
three different options, namely data augmentation,           • Roberta (Liu et al., 2019)
hyperparameter optimization, and the selection of
different architectures and pre-trained models. We           • XLM-Roberta (Conneau et al., 2020)
have also tested various preprocessing steps to re-
                                                             • Camembert (Martin et al., 2020)
move special characters and HTML artefacts, but
this had no signiﬁcant effect on our results.             First experiments with data augmentation and hy-
   As a ﬁrst step towards model selection, we             perparameter optimization showed that these tech-
trained a set of mono- and multilingual models            niques had only a minor effect on the models’ per-
on 10% of the training data for each task. We then        formance. All of our 10% and full models were
selected the best models per language and the best        trained for 3 epochs using Adafactor (Shazeer and
multilingual model and trained them on the full           Stern, 2018) and a learning rate of 4e−5 and batch
training data set. This approach helped us to iterate     size of 8. Furthermore we used 16-bit-precision
quickly by avoiding long training times (up to 20         training to improve training speed. We did run
hours on a single GPU) just for model selection.          hyperparamter optimizations with limited success,
We then selected the following architectures for our      for more information please see our ablations in
  3
    https://sites.google.com/view/                        section 7. We then focused on the selection of
sentence-segmentation/.                                   architectures and pretrained models.
               Base Model                                                  Task 1 F1     Task 2 F1
                                              English
               distilbert-base-uncased                                     0.849048      0.581294
               google/electra-base-generator                               0.867502      0.426554
               google/electra-small-generator                              0.872033      0.590815
               bert-base-uncased                                           0.885560      0.647669
               google/electra-large-generator                              0.901298      0.558433
               bert-large-uncased                                          0.903943      0.699679
               roberta-base                                                0.921170      0.719705
               xlm-roberta-large                                           0.932057      0.740402
               roberta-large                                               0.935672      0.742778
                                          German
               bert-base-multilingual-uncased                              0.931668      0.708220
               dbmdz/bert-base-german-uncased                              0.943437      0.746249
               deepset/gbert-base                                          0.943571      0.753979
               german-nlp-group/electra-base-german-uncased                0.950070      0.759387
                                           French
               bert-base-multilingual-uncased                              0.881648      0.658968
               camembert-base                                              0.914799      0.702187
               camembert/camembert-large                                   0.935436      0.756594
                                           Italian
               dbmdz/electra-base-italian-xxl-cased-generator              0.866070      0.496291
               bert-base-multilingual-uncased                              0.867798      0.586234
               dbmdz/bert-base-italian-cased                               0.897765      0.658520
               dbmdz/bert-base-italian-xxl-uncased                         0.910585      0.693615
                                        multilingual
               bert-base-multilingual-uncased                              0.887909      0.683688
               xlm-roberta-base                                            0.915930      0.716822
               xlm-roberta-large                                           0.935946      0.753770

Table 1: We trained all base models in this Table on 10% of the language-speciﬁc data or on 10% of all languages
for the multilingual models. All models were trained for 3 epochs using Adafactor and a learning rate of 4e−5 . For
Task 1 we report the F1 score of the EOS class. For task 2 the macro average F1 of all classes is shown.


   We trained a 10 % and 100 % model for all               tokenized into more than one token. The disadvan-
architecture types to ensure that the architectures        tage of this approach is that it is inefﬁcient since
scale well with the increased data. Comparing              most sequences will not utilize the full 512-token
the results from Table 1 and 2, we found that the          capacity of the model.
models for task 1 gain between 0.1 % to 1 % by
scaling from 10% to 100% and the model for task                  Overlapping Tokens        F1 Score Task 1
2 gain between 3 % to 5 %.                                       0                             0.87893
                                                                 10                            0.87933
4.2   Windowing Approach
                                                                 100                           0.88556
All selected architectures are limited with respect              200                           0.88375
to the number of tokens they can process, typically
512. Since most documents are longer than this             Table 2: We found that an overlap of 100 tokens be-
limit (see Figure 1), we needed a strategy to handle       tween consecutive sequences improves the models per-
longer sequences.                                          formance.
   The simplest method to achieve that is by split-
ting the text into chunks of 200 words before pro-           We therefore chose to ﬁrst tokenize each doc-
cessing. The number of 200 words was chosen                ument and then split it into sequences of 512 to-
empirically to account for the fact that words get         kens. However, this approach, just like the ﬁrst
one, can produce sequences that start with the last    XLM-RoBERTa-based model scored notably better
word of a sentence or end with the ﬁrst word of        than the best language-speciﬁc model. However,
a sentence, giving the model no context for the        for the other languages, the performance gains are
prediction. To address this issue, we used a slid-     not that signiﬁcant. The scores of the German
ing window approach and ran experiments with           Electra-based model are comparable to those of
different step sizes similar to the stride parame-     XLM RoBERTa, despite using 110 million param-
ter in convolutional neural networks. This method      eters in contrast to the 550 million parameters of
ensures that the model has additional context for      XLM RoBERTa large. This indicates that there is
making predictions. For training, we ran a grid        room for possible performance improvements.
search to ﬁnd the optimal length of the overlapping
window, using an English Bert base model on 10%        5.1      Final Models and Evaluation
of the data. Based on the results shown in Table       Since the multilingual models outperformed
2, we choose an overlapping window size of 100         almost all monolingual models, we selected these
for training our models. The loss was calculated       for subtasks 1 and 2. Furthermore, we submitted
for the whole sequence, including the overlapping      one smaller monolingual model to evaluate its
part. Since this method also generates new training    performance on the test set and out-of-domain test
sequences, it also acts as a data-augmentation.        set (surprise test).

5   Results                                            FullStop Multilingual Task 1: This model
                                                       is based on the 550-million-parameter XLM
Table 1 shows the results of the 10% model compar-
                                                       RoBERTa large model and was trained on the
ison training. All the models that performed best on
                                                       labeled data of task 1. Across all four languages
task 1 also performed best on task 2. For English,
                                                       this model archived an average F1 score of 0.94 on
we selected two models, XLM RoBERTa Large
                                                       the test set and an average F1 score of 0.78 on the
and RoBERTa Large since their scores were about
                                                       surprise test set.
even. An Electra-based model achieved the best
results for the German language, whereas, surpris-
                                                       FullStop German Task 1: This model is
ingly, English and Italian Electra models scored
                                                       based on the 110-million-parameter German
below baseline Bert models. For French, we se-
                                                       Electra base model. It was trained on the labeled
lected Camembert large, a 335 million parameters
                                                       data for task 1 and an additional data set consisting
RoBERTa-based model which scores notably better
                                                       of data from speeches of the German parliament
than Camembert base using 110 million parame-
                                                       (Bundestag, 134 MB4 ) and a text crawl from the
ters. The digital library team at the Bavarian State
                                                       Leipzig corpora collection (245 MB5 ), containing
Library (dbmdz) published two different Italian
                                                       a mixture of news texts and Wikipedia articles. For
Bert-based models, the XXL version of the model
                                                       the German language, this model archived an F1
was trained on the larger corpus and achieved the
                                                       score of 0.95 on the test set and an F1 score of
best result. The multilingual XLM RoBERTa base
                                                       0.80 on the surprise test set.
model achieved better scores than the older multi-
lingual Bert model using the same number of pa-
rameters. The larger 335 million parameter version
                                                       FullStop Multilingual Task 2: This model is also
of this model achieved the best multilingual model
                                                       based on XLM RoBERTa large and was trained on
score, on par with the language-speciﬁc models.
                                                       the labeled data for task 2. As shown in Figure
Note that the scores of the multilingual models are
                                                       2 and Table 4, the model performs well on EOS
evaluated on a multilingual development set.
                                                       marks across all languages. In contrast, the perfor-
   We trained the selected models on the full train-
                                                       mance for colons and hyphens is lower. We suspect
ing set for each task and evaluated them on the de-
                                                       that this is due to the properties of the data set as
velopment sets. The results of this evaluation can
                                                       described in section 3. We have seen that hyphens
be found in Table 3 for both subtasks 1 and 2. For
                                                       and colons are not only infrequent in the training
both tasks, the large multilingual XLM RoBERTa
                                                       data for all languages, they also exhibit unstable
outperformed all language-speciﬁc models. There-
fore we submitted our XLM RoBERTa based mod-              4
                                                              https://github.com/Datenschule/offenesparlament-data
els for task 1 and 2. For the Italian language, the       5
                                                              https://wortschatz.uni-leipzig.de/de/download/German
    Model                                                 Test Language    F1 Score Task 1    F1 Score Task 2
    roberta-large                                               EN            0.941992           0.772326
    xlm-roberta-large                                           EN            0.938764           0.765496
    electra-base-german-uncased                                 DE            0.953894           0.795759
    electra-base-german-uncased with data augmentation          DE            0.954782               –
    camembert-large                                             FR            0.937222           0.778617
    bert-base-italian-xxl-uncased                               IT            0.919729           0.732624
                                                                EN            0.945746           0.774601
                                                                DE            0.958591           0.813861
    xlm-roberta-large
                                                                FR            0.941974           0.781834
                                                                IT            0.934144           0.761775

Table 3: All models for subtasks 1 and 2 where trained on the full data set for each languages. For tasks 1, we
report the F1 score of the sentence end class and for task 2 the macro average F1 score.


     Label         EN       DE       FR        IT         on a 13GB corpus and the ”bert-base-italian-xxl-
     ,            0.819    0.945    0.831    0.798        uncased” model was trained on a 81 GB corpus.
     -            0.425    0.435    0.431    0.421        The positive effect of larger corpus sizes on model
     .            0.948    0.961    0.945    0.942        performance has also been veriﬁed for other trans-
     0            0.991    0.997    0.992    0.989        former architectures, for instance by Conneau et al.
     :            0.575    0.652    0.620    0.588        (2020) and Clark et al. (2020).
     ?            0.890    0.893    0.871    0.832        Model architectures do not work equally well
     macro avg    0.775    0.814    0.782    0.762        for different languages. Electra is the best-
Table 4: Per class F1 scores for the FullStop Multilin-   performing monolingual German model, but for
gual Task 2 model on the dev data set.                    English and Italian, results obtained with Electra
                                                          are well behind those obtained from mono- and
                                                          multilingual Bert models. We conducted a series of
distribution patterns across languages. Intuitively,      tests with different hyperparameters for the English
this is not surprising as hyphens and colons, in          Electra models, but could not further improve the
many cases, are optional in the sense that they can       results.
be substituted by either a comma or a full stop,
                                                          Both Tasks beneﬁt from multilingual models
i. e. the rules for their usage are not only grammat-
                                                          and training data. To our surprise, the multilin-
ical and syntactic, but also stylistic. Performance
                                                          gual XLM-Roberta-based model outperformed all
increases might be achieved through targeted train-
                                                          monolingual models, even though earlier multi-
ing with adversarial examples. The model achieves
                                                          lingual Bert models were, in most cases, outper-
an average F1 of 0.78 on the test set. Similar to
                                                          formed by their language-speciﬁc counterparts. We
the other models, the performance degrades to an
                                                          suspected that this could be explained by the much
average F1 of 0.61 for the out-of-domain surprise
                                                          larger number of parameters used by XLM-Roberta
set.
                                                          large. To test this hypothesis, we trained a monolin-
    Inference on the complete test and surprise set
                                                          gual English model based on XLM-RoBERTa and
(470 MB) takes about 1 hour for each multilingual
                                                          another English model based on the monolingual
FullStop model using an Nvidia 3090 GPU.
                                                          RoBERTa. As shown in Table 3, both models are
6     Key Findings                                        outperformed by the XLM-RoBERTa model, show-
                                                          ing that the model beneﬁts from multilinguality.
The type and amount of data used for pretrain-            Although we have no direct explanation for the su-
ing has a signiﬁcant impact on the ﬁnal model’s           perior performance of the multilingual model, we
performance. Table 1 shows that, for Italian, there       would like to accentuate that it is in line with earlier
is a 5% difference for task 2 between the two mono-       work (Muller et al., 2021) conﬁrming (for mBERT)
lingual Bert-based models. Both models use the            that the lower layers of multilingual models act
same 110 million parameters of the Bert architec-         as multilingual encoders by representing linguistic
ture, but were trained on different corpus sizes.         knowledge for various languages. If this is true here
The ”bert-base-italian-uncased” model was trained         as well, the larger number of multilingual training
                                        (a) Task 2 English                                                                   (b) Task 2 German

                  �    ����    ������     �����    ����     ������    ������                           �    ����     �����     �����   �����    ������ �������


                                                                                ���                                                                                ���

                  �    ����     ����      �����    �����     �����    ������                           �    ����     ����      �����    ����     �����    ������


                                                                                ���                                                                                ���
                  �   �����    �������    ����     �����    ������    ������                           �   �����    �������    ����    ������   ������    ������
     ����������


                                                                                          ����������
                  �   ������    �����    �������   ����     ������� �������                            �   ������ ������� �������        �      ������� �������
                                                                                ���                                                                                ���


                  �    ����     �����     ����     �����      ���     ������                           �   �����    ������     ����    �����      ���     ������

                                                                                ���                                                                                ���


                  �    ����    ������     �����    �����    �������    ����                            �   �����    ������     �����    ����    �������    ����


                        �         �         �        �         �         �                                   �         �         �       �         �        �

                                          ���������������                                                                     ���������������


                                         (c) Task 2 French                                                                    (d) Task 2 Italian

                  �    ����    ������     �����    ����      �����    �������                          �    ����    ������     �����    ����    ������ �������


                                                                                ���                                                                                ���

                  �    ����     ����      �����    ����      �����    ������                           �    ���      ����      �����    ����     �����    ������


                                                                                ���                                                                                ���
                  �   �����    �������    ����     �����     �����    ������                           �   �����    �������    ����    �����    ������    ������
     ����������


                  �   ������    �����    �������   ����     ������� �������
                                                                                ���
                                                                                          ����������   �   ������   ������� �������     ����    �������   �����
                                                                                                                                                                   ���


                  �    ����    ������     ����     �����     ����      �����                           �   �����    ������     ����    �����      ���     �����

                                                                                ���                                                                                ���


                  �   �����    ������     �����    �����    �������    ����                            �   �����    ������     ����    �����    ������     ���


                        �         �         �        �         �         �                                   �         �         �       �         �        �

                                          ���������������                                                                     ���������������


Figure 2: Confusion matrices for the XLM RoBERTa-based multilingual FullStop model for task 2. Note that all
values are rounded.


examples might indeed improve performance for                                         framework with a budget of 200 trails on the Ger-
the punctuation task. Our successful pruning exper-                                   man Electra base model. For the hyperparameter-
iments also point towards this direction. However,                                    search, we conﬁgured the following search space:
these hypotheses need empirical validation.                                           learning rates between 1 · 10−2 and 1 · 10−5 , 1 to 5
Punctuation patterns are domain-speciﬁc and                                           training epochs, batch sizes from 22 to 27 , a weight
robust punctuation prediction requires training                                       decay from 1 · 10−1 to 1 · 10−12 and Adam epsilon
on diverse data sets. The data set that we trained                                    from 1 · 10−6 to 1 · 10−10 .
on (Europarl) consists of data from a single do-                                         We have compared these settings with Adafactor
main, i. e. political speeches. As our scores on                                      (Shazeer and Stern, 2018), using a learning rate
the surprise set revealed, the performance of our                                     of 4e5 . For both optimizers, we have trained mod-
models degrades on texts from other domains. The                                      els for task 1 and 2 on 10% of the training data.
performance of our task-1 model drops from 0.94                                       The results of this comparison are shown in table
(average across all languages) on the in-domain test                                  5. Adafactor matches the performance of Adam,
set to an F1 of 0.78 on the out-of-domain surprise                                    but eliminates the need for a time-consuming hy-
set. The other models participating in the shared                                     perparameter search, therefore we decided to use
task suffer from similar performance degradations.                                    Adafactor for all models.
                                                                                      Is it possible to use one model for both tasks?
7   Ablations                                                                         The labels of task 2 are a super-set of the labels
                                                                                      for task 1, therefore one can use a model trained
What are the optimal hyperparameters for                                              for task 2 on task 1. We changed the classiﬁcation
each model? We ran a hyperparameter search for                                        result of task 2 by mapping the sentence end labels
the Adam optimizer using the Akiba et al. (2019)                                      ”.” and ”?” to label 1 and all other labels to label
    Task      Adafactor   Adam        diff in p.P.       8   Conclusion
    1         0.95007     0.95087     -0.0008
                                                         In this paper, we have shown that transformer-
    2         0.75939     0.75587     +0,00352
                                                         based architectures can be successfully applied to
Table 5: In a comparison between Adam with opti-         the tasks of punctuation mark and sentence end
mized hyperparameters and Adafactor, we found only       prediction. To our surprise, monolingual models
minor differences in the resulting F1 score.             are outperformed by multilingual models, showing
                                                         that these models can transfer knowledge across
                                                         languages. For the future, we plan to improve on
0. The results in Table 6 show that this method
                                                         two main aspects. Firstly, we want to reduce the
decreases the ﬁnal scores only marginally. For
                                                         size of our models. Both ”FullStop Multilingual”
many applications, it is sufﬁcient to train one model
                                                         models use 550 million parameters which leads
that processes all four languages for both tasks.
                                                         to computationally expensive inferencing. In our
For this shared task, we trained and submitted two
                                                         ablations, we have demonstrated a ﬁrst approach
different models, since a dedicated model for task
                                                         to reducing the number of parameters. Secondly,
1 slightly improves the results.
                                                         we would like to improve the out-of-domain perfor-
                                                         mance of our models. The shared task surpriseset
    Language      Task 1 Model      Task 2 Model
                                                         showed that there is a performance degradation on
    en              0.945746          0.941686
                                                         texts from unseen domains. We will address this
    de              0.958591          0.955926
                                                         issue in future research.
    fr              0.941974          0.938254
    it              0.934144          0.930851           Acknowledgments
Table 6: We compared the scores of the ”FullStop Mul-    This research has been funded by the Euro-
tilingual Task 1” model and the remapped output of the   pean Social Fund (ESF), SAB grant number
”FullStop Multilingual Task 2” model to match the la-    100339497 and the European Re-gional Develop-
bels of task 1. This approach leads to a slightly de-    ment Funds (ERDF) (ERDF-100346119). Anne-
creased F1 score.
                                                         Kathrin Schumann has received funding through
                                                         the SAB’s technology startup scholarship (Tech-
Do we need a deep model for these tasks?                 nologiegründerstipendium).
For the purpose of the shared task, we did not aim at
optimizing inference and training efﬁciency. How-
ever, we tested if it is necessary to use all the 12     References
Bert base layers. To this end, we trained a set of       Takuya Akiba, Shotaro Sano, Toshihiko Yanase,
models on 10% of the English data using 3,6, 9 and         Takeru Ohta, and Masanori Koyama. 2019. Op-
10 layers on task 1. To keep the results compara-          tuna: A next-generation hyperparameter optimiza-
ble, we used the same hyperparameters as with all          tion framework. In Proceedings of the 25rd ACM
                                                           SIGKDD International Conference on Knowledge
other models, described in section 4. The results          Discovery and Data Mining.
in Table 7 show that with this simple layer pruning
                                                         Mohammed Attia, Mohamed Al-Badrashiny, and
approach it is possible to retain 99% of the model’s
                                                          Mona Diab. 2014. GWU-HASP: Hybrid Arabic
performance while removing 1/4 of the last layers.        Spelling and Punctuation Corrector. In Proceedings
We suggest to explor more advanced optimization           of the EMNLP 2014 Workshop on Arabic Natural
techniques in further studies.                            Language Processing (ANLP), pages 148–154. As-
                                                          sociation for Computational Linguistics.
     Layers      Parameters    F1 Score Task 1           Xiaoyin Che, Cheng Wang, Haojin Yang, and
     3           45,102,338        0.74758                 Christoph Meinel. 2016. Punctuation Prediction
     6           66,365,954        0.84408                 for Unsegmented Transcript Based on Word Vec-
                                                           tor. In Proceedings of the Tenth International
     9           87,629,570        0.87776                 Conference on Language Resources and Evaluation
     12         108,893,186        0.88556                 (LREC 2016), pages 654–658. European Language
                                                           Resources Association (ELRA).
Table 7: F1 scores resulting from a pruned Bert base
model at various levels of pruning. Scores are for an    Kevin Clark, Minh-Thang Luong, Quoc V. Le, and
                                                           Christopher D. Manning. 2020. ELECTRA: Pre-
English model trained on 10% of the data.
                                                           training Text Encoders as Discriminators Rather
                                                           Than Generators. In ICLR.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,        Stefan Schweter and Sajawel Ahmed. 2019. Deep-
  Vishrav Chaudhary, Guillaume Wenzek, Francisco            EOS: General-Purpose Neural Networks for Sen-
  Guzmán, Edouard Grave, Myle Ott, Luke Zettle-            tence Boundary Detection. In Proceedings of the
  moyer, and Veselin Stayanov. 2020. Unsupervised          15th Conference on Natural Language Processing
  Cross-lingual Representation Learning at Scale. In       (KONVENS 2019), pages 251–255.
  Proceedings of the 58th Annual Meeting of the Asso-
  ciation for Computational Linguistics, pages 8440–     Noam Shazeer and Mitchell Stern. 2018. Adafactor:
  8451. Association for Computational Linguistics.         Adaptive learning rates with sublinear memory cost.
                                                           In Proceedings of the 35th International Conference
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and              on Machine Learning, volume 80 of Proceedings
   Kristina Toutanova. 2019. BERT: Pre-training of         of Machine Learning Research, pages 4596–4604.
   Deep Bidirectional Transformers for Language Un-        PMLR.
   derstanding. In Proceedings of the 2019 Confer-
   ence of the North American Chapter of the Associ-     Monica Sunkara, Srikanth Ronanki, Kalpit Dixit, Sra-
   ation for Computational Linguistics: Human Lan-        van Bodapati, and Katrin Kirchhoff. 2020. Robust
   guage Technologies, Volume 1 (Long and Short Pa-       Prediction of Punctuation and Truecasing for Med-
   pers). Association for Computational Linguistics.      ical ASR. In Proceedings of the 1st Workshop on
                                                          NLP for Medical Conversations, pages 53–62. Asso-
Philipp Koehn. 2005. Europarl: A Parallel Corpus for      ciation for Computational Linguistics.
  Statistical Machine Translation. In Proceedings of
                                                         Marcos Vinı́cius Treviso, Christopher Shulby, and San-
  the 10th Machine Translation Summit, pages 79–86.
                                                          dra Maria Aluı́sio. 2017. Sentence Segmentation
  AAMT.
                                                          in Narrative Transcripts from Neuropsychological
                                                          Tests using Recurrent Convolutional Neural Net-
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,
                                                          works. In Proceedings of the 15th Conference of the
   Donghyeon Kim, Sunkyu Kim, Chan Ho So,
                                                          European Chapter of the Association for Computa-
   and Jaewoo Kang. 2019.         BioBERT: a pre-
                                                          tional Linguistics: Volume 1, Long Papers, pages
   trained biomedical language representation model
                                                          315–325. Association for Computational Linguis-
   for biomedical text mining.       Bioinformatics,
                                                          tics.
   36(4):1234–1240.
                                                         Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-        Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,            Kaiser, and Illia Polosukhin. 2017. Attention is All
  Luke Zettlemoyer, and Veselin Stoyanov. 2019.            you Need. In Advances in Neural Information Pro-
  Roberta: A Robustly Optimized BERT Pretraining           cessing Systems, volume 30. Curran Associates, Inc.
  Approach. http://arxiv.org/abs/1907.11692.
                                                         Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Louis Martin, Benjamin Muller, Pedro Javier Ortiz          Chaumond, Clement Delangue, Anthony Moi, Pier-
  Suárez, Yoann Dupont, Larent Romary, Éric Ville-       ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
  mont de la Clergerie, Djamé Seddah, and Benoı̂t         icz, Joe Davison, Sam Shleifer, Patrick von Platen,
  Sagot. 2020. Camembert: a Tasty French Language          Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
  Model. In Proceedings of the 58th Annual Meet-           Teven Le Scao, Sylvain Gugger, Mariama Drame,
  ing of the Association for Computational Linguis-        Quentin Lhoest, and Alexander M. Rush. 2020.
  tics, pages 7203–7219. Association for Computa-          Transformers: State-of-the-art natural language pro-
  tional Linguistics.                                      cessing. In Proceedings of the 2020 Conference on
                                                           Empirical Methods in Natural Language Processing:
Benjamin Muller, Yanai Elazar, Benoı̂t Sagot, and          System Demonstrations, pages 38–45, Online. Asso-
  Djamé Seddah. 2021. First Align, then Predict: Un-      ciation for Computational Linguistics.
  derstanding the Cross-Lingual Ability of Multilin-
  gual BERT. In Proceedings of the 16th Conference
  of the European Chapter of the Association for Com-
  putational Linguistics: Main Volume, pages 2214–
  2231. Association for Computational Linguistics.

Jeffrey Pennington, Richard Socher, and Christopher D.
   Manning. 2014. Glove: Global Vectors for Word
   Representation. In Proceedings of the 2014 Con-
   ference on Empirical Methods in Natural Language
   Processing (EMNLP 2014), pages 1532–1543. Asso-
   ciation for Computational Linguistics.

Victor Sanh, Lysandre Debut, Julien Chaumond, and
  Thomas Wolf. 2019. Distilbert, a distilled version
  of BERT: smaller, faster, cheaper and lighter. CoRR,
  abs/1910.01108.

</pre>