UZH OnPoint at Swisstext-2021: Sentence End and Punctuation
    Prediction in NLG Text Through Ensembling of Different Transformers

        Andrianos Michail∗    Silvan Wehrli∗      Terézia Bucková
                            University of Zurich
      {andrianos.michail, silvan.wehrli, terezia.buckova}@uzh.ch


                        Abstract                                 Therefore, attempts in improving the quality of
     This paper presents our solutions for the Swiss-         such texts must also focus on a more precise predic-
     Text 2021 shared task “Sentence End and                  tion of punctuation. Consequently, this is an ongo-
     Punctuation Prediction in NLG Text”. We en-              ing research effort in the NLP community. Recent
     gaged with both subtasks (i.e., sentence end             developments in NLP (such as Transformers) offer
     detection and full-punctuation prediction) and           new possibilities to tackle punctuation prediction
     built systems for English, German, French                effectively. Some of these attempts are discussed
     and Italian. To tackle the punctuation pre-
                                                              in Section 2.
     diction problem, we ensemble multiple dif-
     ferently trained Transformer models (BERT,                  Following recent attempts, we propose an ensem-
     CamemBERT, Electra, Longformer, MPNet,                   ble system based on the Transformer architecture,
     XLM-RoBERTa, XLNet) and leverage their re-               where multiple models predict the punctuation sym-
     sults using a sliding window method during in-           bols of a given text. The results are then combined
     ference time. As a result, we achieve an F1              and the ﬁnal predictions are made. Our language-
     score of the positive class of 0.94 for English,         speciﬁc systems are able to predict punctuation for
     0.96 for German, 0.93 for French, and 0.93 for           English, German, French, and Italian texts and are
     Italian for the subtask 1 “sentence end detec-
                                                              on par – if not better – with current state-of-the-art
     tion” on the respective test sets. Furthermore,
     Macro F1 results on test sets for subtask 2              models that participated in the shared task.
     “full-punctuation prediction” for English, Ger-             Our main contributions include
     man, French and Italian are 0.78, 0.81, 0.78,                1. the exploration of different Transformer-based
     0.76 respectively.
                                                                     models and identiﬁcation of the most impor-
1    Introduction                                                    tant features which affect the performance for
                                                                     this task, and
Transcribed or translated texts often contain erro-
neous punctuation. Correct punctuation, however,                  2. a showcase that the ensembling of differently
is crucial for human understanding of a text, as                     trained models enhances the performance for
shown by Tündik et al. (2018). Rightly placed                       the punctuation prediction task.
punctuation not only makes the text more read-
able and intelligible but can change the meaning of           2     Related Work
sentences, as well. Translated texts pose another             Punctuation prediction tasks pose many challenges.
challenge: Different languages expose different               One of them is the restricted input length thus re-
sentence structuring conventions and hence use                stricted context for the Transformers. To solve the
punctuation very differently.                                 above mentioned limitation, Nguyen et al. (2019)
   However, systems for automatic transcription of            used an overlapped chunk method (i.e., an over-
speech nowadays focus on minimizing the Word                  lapping sliding window) combined with a capi-
Error Rate (WER), which omits punctuation (He                 talization and a punctuation model to tackle the
et al., 2011). As a result, the state-of-the-art sys-         punctuation problem in long documents. First,
tems are focused on the correct transcription of              the text is divided into chunks with overlapping
words and not necessarily correct segmentation of             segments. Second, a punctuation model (seq2seq
text or correct punctuation (Tündik et al., 2018).           LSTM, Transformer) predicts punctuation and cap-
    * Equal contribution. Order determined by coin ﬂip.       italization for every segment. Lastly, overlapped


    Copyright ©2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna-
tional (CC BY 4.0).
chunk merging combines chunks by discarding a                  Language      Training     Evaluation
deﬁned number of tokens per overlapped chunk.                  English         11,028         10,521
   Courtland et al. (2020) changed the usual frame-            German          11,495         10,207
work of punctuation prediction to predicting punc-             French          12,276         13,366
tuation for the whole sequence rather than for sin-            Italian         10,379         10,502
gle tokens. The authors used a feedforward neural
                                                        Table 1: Mean token length per document in the train-
network. Similar to Nguyen et al. (2019), they ﬁnd      ing and evaluation dataset.
that using a sliding window approach improves
prediction performance. However, instead of pro-
ducing multiple predictions for the same token,         and experimental setup. Section 6 presents our
they sum activations before prediction and make         results and discusses the impact of used methods.
inference afterwards.                                   Finally, a conclusion is drawn in Section 7.
   Sunkara et al. (2020) used a joint learning objec-
tive for capitalization and punctuation prediction.     3     Dataset
The model input are sub-word embeddings. The            The Europarl Parallel Corpus (Koehn, 2005) serves
authors used the pre-trained BERT model (BERT           as the data source for the training, development,
base truncated to the ﬁrst six layers). They ﬁne-       and test set. The surprise test set (of an undisclosed
tuned the model on medical domain data because          domain during evaluation) is an out-of-domain
the medical domain was in the main scope of this        dataset that consists of a sample from the TED
paper. They also ﬁne-tuned the model for the punc-      2020 dataset (Reimers and Gurevych, 2020) with
tuation prediction task. The authors used masked        a low vocabulary overlap with the training data.
language learning objective while forcing half of       As provided by the organizers of the shared task,
the masked tokens to be punctuation marks.              samples in all datasets were lowercased and all
   Similarly, Nagy et al. (2021) also leveraged pre-    punctuation marks were removed.
trained BERT models (BERT base cased and un-               Subsequently, we outline challenges that we be-
cased and a smaller version for English; multilin-      lieve are especially relevant in solving this shared
gual and Hungarian-speciﬁc BERT versions for            task and thus directly inﬂuenced our proposed sys-
Hungarian). They added a two-layer multi-layer          tem architecture.
perceptron network with a soft-max output layer.
The model also used a sliding window approach to        3.1    Long Documents
enhance the results further. This model is trained      As shown in Table 1, the mean token length is
to predict four labels: empty (no punctuation),         many orders of magnitudes longer than what typi-
comma, period and question mark.                        cal Transformer architectures can process at once
   Our approach differs from the above men-             (typically up to 512 subtokens). It should be noted
tioned in using ensembling of multiple pre-trained      that some of the documents are especially long and
Transformer-based models ﬁne-tuned for the given        can contain up to 100,000 tokens. The most obvi-
task. Very importantly, our systems predict six         ous solution would be just to split documents into
different punctuation symbols for the punctuation       smaller sequences and subsequently merge predic-
prediction task.                                        tions. However, this approach lowers the context
   Additionally, a multilingual Transformer was         with which a model is confronted and might lead
used as a part of our ensemble. We hypothesize          to lower prediction quality (presumably at the be-
that it would be able to capture more accurately the    ginning and end of a sequence).
multilingual content of the EuroParl data. Further-
more, low-resource Latin languages might bene-          3.2    Multilingual Content
ﬁt from pre-training on more data, e.g., including      To some extent, documents in the EuroParl Corpus
other Latin languages.                                  contain multilingual content. As shown in the ex-
   In Section 3 we will discuss the datasets we used    amples in Table 2, many documents contain names
and the challenges they provide. The problem is         of people and areas that reﬂect the multilingualism
described in Section 4 together with detailed de-       of the participants of the European Parliament, i.e.,
scription of our approach. Section 5 contains expla-    members come from all over Europe. Therefore,
nation of used hyperparameters, technical details       using pre-trained models trained on monolingual
      Sentence Excerpt                                          Language       Transformers - Base
      i agree completely with mr pöttering                     English        Electra, Longformer,
      and with you too mr swoboda                                              MPNet, XLNet
      the president of the european commission                  German         BERT, Electra‡ ,
      josé manuel durão barroso however                                      XLM-RoBERTa
                                                                French         CamemBERT‡ , Electra,
Table 2: Excerpts from the English training data that                          XLM-RoBERTa
contain multilingual content.
                                                                Italian        BERT, Electra‡ ,
                                                                               XLM-RoBERTa
data only may result in an inaccurate representation      Table 4: Transformer models that were used for each
of this content.                                          language-speciﬁc model. Models marked with ‡ were
                                                          used twice: Once trained without weighted loss and
3.3     Imbalanced Class Distribution                     once with weighted loss.

The class distribution of the training and evaluation
set, as shown by Table 3, presents a rather typical       4.2   Transformers
situation in machine learning: Some of the classes
                                                          The corner-stones of our systems are pre-trained
have very few examples compared to the biggest
                                                          Transformer models. We trained four different ﬁne-
classes. Neglecting this circumstance will likely
                                                          tuned models for each language and combined the
lead to low performance for minority classes. Us-
                                                          predictions using majority vote ensembling (see
ing typical techniques such as class-speciﬁc loss
                                                          Section 4.5). Table 4 provides an overview.
weights or data augmentation might improve per-
                                                              Electra (Clark et al., 2020) is trained as a dis-
formance to some extent. We have tried to reduce
                                                          criminator, and the authors suggest that it is more
this problem by adding a model with altered loss
                                                          suitable for downstream sequence labelling tasks.
weights to the ensemble.
                                                          In fact, we can further support this claim because
                                                          this model architecture was the best-performing
      Punctuation     Training     Development
                                                          single model for all languages except French (see
      :                 43,133            9,490
                                                          Table 5 and 6).
      ?                 44,290            9,815
      -                 80,916           18,335               Both MPNet (Song et al., 2020) and XLNet
      .              1,396,166          319,751           (Yang et al., 2019) are trained (slightly differently)
      ,              1,759,686          401,095           through permuted language modelling, allowing
      0             30,454,904        6,985,003           a better understanding of bidirectional contexts,
                                                          which is often needed with punctuation. Both of
Table 3: The distribution of class labels for English     these single models performed exceptionally well
for the training and development set. 0 indicates the     in our experiments.
absence of a punctuation mark. The distributions for          Longformer (Beltagy et al., 2020), due to its
German, French and Italian are similar.
                                                          local windowed attention with a task motivated
                                                          global attention, can process larger sequence
                                                          lengths (up to 4096) and perform well on the longer
4     Methods                                             documents of this task.
                                                              XLM-RoBERTa (Conneau et al., 2019) is a mul-
4.1     Problem Modelling                                 tilingual transformer that is trained on over 100
We modelled this problem as a token classiﬁcation         languages. In our experiments it was demonstrated
task. More precisely, each token is assigned a label      to be the best performing multilingual model.
representing the following punctuation symbol (if             The authors of CamemBERT (Martin et al.,
any). We concentrated our main efforts and focus          2019) show that it performs exceedingly well in
on the full punctuation prediction. As such, we           NER token classiﬁcation. Moreover, the good per-
built all of the models to be able to predict all punc-   formance translated to our French full-punctuation
tuation symbols. For the end of sentence prediction       prediction experiments.
task, we mapped predictions of ‘.’ ‘?’ to 1 and the           BERT (Devlin et al., 2018) has models pre-
rest to 0.                                                trained in multiple languages. We used language-
speciﬁc BERT models as part of German, French             by sacriﬁcing overall performance, which, in re-
and Italian ensembles.                                    turn, helps an ensemble to create more accurate
                                                          predictions.
4.3   Sliding Window
                                                             Initially, we used inverted class frequencies as
As discussed earlier, documents in the corpus can         loss weights. However, this approach turned out to
be rather long, and typical Transformers cannot pro-      be too aggressive (worse minority class and over-
cess such documents at once. Therefore, instead           all performance). Further, we experimented with
of simply splitting the documents into smaller seg-       increasing minority class (‘-’, ‘:’) weights. Ini-
ments, sequences are overlapped for inference. In         tial experiments showed showed that weights set
other terms, a sliding window is applied, as sug-         to three for minority classes and one for majority
gested by Nguyen et al. (2019). Subsequently, the         classes performed best on the development set. Our
overlapped sequences are merged back together by          approach is rather heuristic, and further experimen-
discarding half of the overlapped tokens at the beg-      tation may lead to better results.
ging and end of each sequence. Our experiments
have shown that an overlap of 40 tokens performs          4.5    Majority Vote Ensembling
best. Consequently, we chose this overlap length
for the ﬁnal models.                                      We did preliminary experiments in separate stack-
                                                          ing models as mentioned in Wolpert (1992) as well
4.4   Weighted Loss                                       as ensembling using the arithmetic average of class
For the German, French and Italian ensembles, we          probabilities of single models as described in Good-
retrained the best performing model with weighted         fellow et al. (2014). However, one technique was
loss. We set the weights to three for the two             shown to be more effective: majority vote ensem-
least performing classes (‘-’, ‘:’) and left them         bling. More concretely, all the models predict (i.e.,
unchanged for the other classes (i.e., a weight of        vote) and the most voted label is then used as the
one). The idea is to increase recall for these classes    ﬁnal prediction. In case of a tie, the least common

          Language                                  Models                                  Ensemble
                        Electra   Longformer         MPNet               XLNet
          English
                        0.940     0.934              0.940               0.937              0.943
                        Electra   XLM-RoBERTa        BERT                Electra‡
          German
                        0.954     0.952              0.950               0.953              0.955
                        Electra   XLM-RoBERTa        CamemBERT           CamemBERT‡
          French
                        0.923     0.926              0.930               0.928              0.933
                        Electra   XLM-RoBERTa        BERT                Electra‡
          Italian
                        0.922     0.918              0.918               0.919              0.926

Table 5: Positive class (sentence end) F1 results on the development set for all single models and the correspond-
ing ensemble for sentence end prediction. Models marked with ‡ denote a model trained with weighted loss as
described in subsection 4.4.

          Language                                  Models                                  Ensemble
                        Electra   Longformer         MPNet               XLNet
          English
                        0.769     0.760              0.768               0.763              0.777
                        Electra   XLM-RoBERTa        BERT                Electra‡
          German
                        0.803     0.795              0.792               0.805              0.812
                        Electra   XLM-RoBERTa        CamemBERT           CamemBERT‡
          French
                        0.758     0.761              0.769               0.770              0.778
                        Electra   XLM-RoBERTa        BERT                Electra‡
          Italian
                        0.746     0.732              0.741               0.739              0.755

Table 6: Macro F1 results on the development set for all single models and the corresponding ensemble for full-
punctuation prediction. Models marked with ‡ denote a model trained with weighted loss as described in subsection
4.4.
                                  Development                      Test            Surprise Test
                 Language
                                 P     R    F1           P          R      F1     P     R     F1
                   English      0.93 0.96 0.94          0.93       0.95   0.94   0.84 0.75 0.80
                   German       0.95 0.96 0.96          0.95       0.96   0.96   0.89 0.77 0.82
                   French       0.92 0.94 0.93          0.92       0.94   0.93   0.82 0.72 0.77
                   Italian      0.91 0.95 0.93          0.90       0.95   0.93   0.83 0.71 0.77

Table 7: Ensembling positive class (sentence end) F1 results on the development, test and surprise test set for
sentence end prediction.

                                  Development                      Test            Surprise Test
                 Language
                                 P     R    F1           P          R      F1     P     R     F1
                   English      0.82 0.75 0.78          0.81       0.75   0.77   0.65 0.59 0.62
                   German       0.82 0.80 0.81          0.82       0.80   0.81   0.66 0.65 0.65
                   French       0.80 0.76 0.78          0.78       0.77   0.77   0.63 0.60 0.61
                   Italian      0.77 0.74 0.76          0.77       0.74   0.75   0.57 0.55 0.56

    Table 8: Ensembling Macro F1 results on the evaluation, test and surprise test set for punctuation prediction.


label is chosen. Additionally, predictions for a hy-           5.3    Experimental Setup
phen are counted twice – mainly to increase the
                                                               We trained all of the models on a single T4 GPU
performance for the worst-performing label (which
                                                               instance. Our ﬁnal models shared some of the hy-
was the case for all languages). Our experiments on
                                                               perparameters, namely a learning rate of 4e−5, a
the development set have shown that this leads to an
                                                               batch size of 16 (four for Longformer) and the max-
increase of 1-2% Macro F1 score for all languages
                                                               imum sequence length (512, 4096 for Longformer).
compared to the single best performing model.
                                                               We trained each model for ﬁve epochs.

5       System Architecture                                    6     Results & Discussion
5.1      Hyperparameter Setup                               Our results for sentence end prediction and full
                                                            punctuation prediction can be seen in Table 7 and
At the beginning of development, we empirically
                                                            Table 8, respectively. They demonstrate the high ca-
determined what characteristics of the model and
                                                            pability of using Transformers in predicting punctu-
ﬁne-tuning correlate with better performance. For
                                                            ation marks. Especially for sentence end prediction,
ﬁne-tuning, ﬁve epochs performed consistently
                                                            the F1 scores are well above 90% for all languages.
well for all transformer architectures. Due to the
                                                            We hypothesize that it is because usage of sentence
large document size, the larger the maximum se-
                                                            end punctuation is less ambiguous – it is consis-
quence length, the better the performance. To our
                                                            tently and grammatically correctly used in the data.
surprise, there were no signiﬁcant differences be-
                                                            For full punctuation prediction, the overall perfor-
tween the performance of cased vs. uncased Trans-
                                                            mance is signiﬁcantly lower for all languages. The
formers on our lower-cased data.
                                                            full punctuation prediction task is more difﬁcult
                                                            not only because of the existence of more labels,
5.2      Technical Implementation                           but also because some of the labels might not fol-
For the training of our models, we used the Simple          low strict grammatical rules. For example ‘-’ or a
Transformers 1 library, a wrapper for the Hugging           ‘:’ can be used differently due to different styles
Face 2 library, that allows for fast experimenting.         of linguistic expressions, while a label such as a
As the Simple Transformers library does not sup-            comma might be misplaced due to human error.
port weighted loss training, we have adapted the               With respect to our system, sliding windows are
relevant code for this purpose.                             a simple way to improve performance when an in-
                                                            put sequence is much longer than what a model
    1
        https://simpletransformers.ai                       can actually process. However, this performance
    2
        https://huggingface.co                              gain is limited, and as of now, it is not clear how
this compares to a model that can process much            Xiaodong He, Li Deng, and Alex Acero. 2011. Why
longer sequences. Observing results we have ob-             word error rate is not a good metric for speech
                                                            recognizer training for the speech translation task?
tained from single models at Table 5 and 6 for both
                                                            In 2011 IEEE International Conference on Acous-
subtasks we can see that the model architecture             tics, Speech and Signal Processing (ICASSP), pages
has an effect on performance. Within our experi-            5632–5635. IEEE.
ments, majority vote ensembling further enhances
                                                          Philipp Koehn. 2005. Europarl: A parallel corpus for
performance.                                                statistical machine translation. In MT summit, vol-
                                                            ume 5, pages 79–86. Citeseer.
7   Conclusion
                                                          Louis Martin, Benjamin Muller, Pedro Javier Ortiz
In this paper, we showed that the ensembling of             Suárez, Yoann Dupont, Laurent Romary, Éric Ville-
diversely trained Transformers can yield signiﬁcant         monte de la Clergerie, Djamé Seddah, and Benoı̂t
improvement and allows for good generalisation              Sagot. 2019. Camembert: a tasty french language
for punctuation prediction in out-of-domain exam-           model. arXiv preprint arXiv:1911.03894.
ples. From this work, it can be seen that combin-         Attila Nagy, Bence Bial, and Judit Ács. 2021. Au-
ing different Transformers can be really beneﬁcial.         tomatic punctuation restoration with bert models.
However, further work is needed to determine if             arXiv preprint arXiv:2101.07343.
more advanced ensembling techniques could fur-            Binh Nguyen, Vu Bao Hung Nguyen, Hien
ther increase the quality of the predictions.               Nguyen, Pham Ngoc Phuong, The-Loc Nguyen,
                                                            Quoc Truong Do, and Luong Chi Mai. 2019. Fast
Acknowledgments                                             and accurate capitalization and punctuation for
                                                            automatic speech recognition using transformer
We want to thank Simon Clematide and Phillip                and chunk merging. In 2019 22nd Conference of
Ströbel for their valuable inputs and the Departe-         the Oriental COCOSDA International Committee
ment of Computational Linguistics for providing             for the Co-ordination and Standardisation of
                                                            Speech Databases and Assessment Techniques
us with the necessary technical infrastructure.             (O-COCOSDA), pages 1–5. IEEE.
                                                          Nils Reimers and Iryna Gurevych. 2020.        Mak-
References                                                  ing monolingual sentence embeddings multilin-
                                                            gual using knowledge distillation. arXiv preprint
Iz Beltagy, Matthew E Peters, and Arman Cohan.              arXiv:2004.09813.
   2020. Longformer: The long-document transformer.
   arXiv preprint arXiv:2004.05150.                       Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-
                                                            Yan Liu. 2020. Mpnet: Masked and permuted pre-
Kevin Clark, Minh-Thang Luong, Quoc V Le, and               training for language understanding. arXiv preprint
  Christopher D Manning. 2020. Electra: Pre-training        arXiv:2004.09297.
  text encoders as discriminators rather than genera-
  tors. arXiv preprint arXiv:2003.10555.                  Monica Sunkara, Srikanth Ronanki, Kalpit Dixit, Sra-
                                                           van Bodapati, and Katrin Kirchhoff. 2020. Robust
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,          prediction of punctuation and truecasing for medical
  Vishrav Chaudhary, Guillaume Wenzek, Francisco           asr. In Proceedings of the First Workshop on Natu-
  Guzmán, Edouard Grave, Myle Ott, Luke Zettle-           ral Language Processing for Medical Conversations,
  moyer, and Veselin Stoyanov. 2019. Unsupervised          pages 53–62.
  cross-lingual representation learning at scale. arXiv
  preprint arXiv:1911.02116.                              Máté Akos Tündik, György Szaszák, Gábor Gosztolya,
                                                            and András Beke. 2018. User-centric evaluation of
Maury Courtland, Adam Faulkner, and Gayle McEl-             automatic punctuation in asr closed captioning.
 vain. 2020. Efﬁcient automatic punctuation restora-
 tion using bidirectional transformers with robust in-    David H Wolpert. 1992. Stacked generalization. Neu-
 ference. In Proceedings of the 17th International          ral networks, 5(2):241–259.
 Conference on Spoken Language Translation, pages
 272–279.                                                 Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
                                                            bonell, Ruslan Salakhutdinov, and Quoc V Le.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and               2019. Xlnet: Generalized autoregressive pretrain-
   Kristina Toutanova. 2018. Bert: Pre-training of deep     ing for language understanding. arXiv preprint
   bidirectional transformers for language understand-      arXiv:1906.08237.
   ing. arXiv preprint arXiv:1810.04805.
Ian J Goodfellow, Jonathon Shlens, and Christian
   Szegedy. 2014. Explaining and harnessing adversar-
   ial examples. arXiv preprint arXiv:1412.6572.