University of Regensburg @ SwissText 2021 SEPP-NLG: Adding Sentence
                    Structure to Unpunctuated Text

                  Gregor Donabauer                                      Udo Kruschwitz
               University of Regensburg                             University of Regensburg
                 Regensburg, Germany                                  Regensburg, Germany
             gregor.donabauer@stud.                                udo.kruschwitz@ur.de
                uni-regensburg.de


                      Abstract                                   We took part in Subtask 1 (fully unpunctuated
                                                              sentences – full stop detection) of this challenge
    This paper describes our approach (UR-
                                                              and did so for all featured languages. This report
    mSBD) to address the shared task on Sentence
    End and Punctuation Prediction in NLG Text                starts by contextualising the task as part of a short
    (SEPP-NLG) organised as part of SwissText                 discussion of related work. We will then introduce
    2021. We participated in Subtask 1 (fully un-             our methodology, brieﬂy describe the data and re-
    punctuated sentences – full stop detection) and           port results. Finally we present some discussion
    submitted a run for every featured language               and conclusions.
    (English, German, French, Italian). Our sub-
    missions are based on pre-trained BERT mod-
    els that have been ﬁne-tuned to the task at hand.         2   Related Work
    We had recently demonstrated, that such an ap-
    proach achieves state-of-the-art performance              Sentences are considered as a fundamental informa-
    when identifying end-of-sentence markers on               tion unit of written text (Jurafsky and Martin, 2020;
    automatically transcribed texts. The difference           Levinson, 1985). Therefore, many NLP pipelines
    to that work is that here we use language-                in practice split text into sentences. Fact checking
    speciﬁc BERT models for each featured lan-                is just one – currently very popular – challenge
    guage. By framing the problem as a binary                 where the automated detection of sentences within
    tagging task using the outlined architecture we
                                                              a stream of input data is essential. Fact check-
    are able to achieve competitive results on the
    ofﬁcial test set across all languages, with Re-           ers are increasingly turning to technology to help,
    call, Precision, F1 ranging between 0.91 and              including NLP (Arnold, 2020). These tools can
    0.96 which makes us joint winners for Recall              help identify claims worth checking, ﬁnd repeats
    in two of the languages. The ofﬁcial baselines            of claims that have already been checked or even
    are beaten by large margins.                              assist in the veriﬁcation process directly (Nakov
                                                              et al., 2021). Most such tools rely on text as input
1   Introduction                                              and require the text to be split into sentences (Don-
Text normalization has always been a core build-              abauer et al., 2021). For this and other application
ing block of natural language processing aimed at             areas sentence segmentation will remain a challeng-
converting some raw text into a more convenient,              ing task despite the fact that recent developments
standard form (Jurafsky and Martin, 2020). Be-                suggest that for some NLP tasks it is possible to
sides tokenization, stemming and lemmatization                achieve state-of-the-art performance without con-
this process includes sentence segmentation. What             ducting any pre-processing of the raw data, e.g.
is interesting though is that text pre-processing and         (Shaham and Levy, 2021).
normalization is by no means a solved challenge.                 Sentence Boundary Detection (SBD) is an im-
   The SwissText 2021 Shared Task 2: Sentence                 portant and actually well-studied text processing
End and Punctuation Prediction in NLG Text is                 step but it typically relies on the presence of punc-
concerned exactly with this problem area. The goal            tuation within the input text (Jurafsky and Martin,
is to develop approaches for sentence boundary                2020). Even with such punctuation it can be a difﬁ-
detection (SBD) in unpunctuated text. Providing               cult task, e.g. (Gillick, 2009; Sanchez, 2019), and
suitable solutions means fostering readability and            traditional approaches use a variety of architectures
restoring the text’s original meaning.                        including CRFs (Liu et al., 2005) and combinations


    Copyright ©2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna-
tional (CC BY 4.0).
of HMMs, maximum likelihood as well as maxi-              alternative approaches on different datasets. We
mum entropy approaches (Liu et al., 2004). With           use a softmax classiﬁcation head predicting the la-
unpunctuated texts (and lack of word-casing infor-        bel (EOS or O) by the highest probability at each
mation) it becomes a lot harder as even humans ﬁnd        token.
it difﬁcult to determine sentence boundaries in this
case (Stevenson and Gaizauskas, 2000). Song et al.        3.2   Adjustments for the Shared Task
(2019) simplify the problem by aiming to detect the
sentence boundary within a 5-word chunk – using           We apply two changes to the model ﬁne-tuning
YouTube subtitle data. Using LSTMs they predict           process for this shared task as follows:
the position of the sample’s sentence boundaries
but did not consider any chunks without sentence             • First of all, we are faced with four different
boundary. Le (2020) presents a hybrid model (us-               languages and not just English texts. The
ing BiLSTMs and CRFs) originally used for NER                  two obvious options would be to use a multi-
that was evaluated on SBD in the context of conver-            lingual language model or to choose a dif-
sational data by preprocessing the CornellMovie-               ferent language-speciﬁc pre-trained model
Dialogue and the DailyDialog datasets to obtain                for each of the languages, i.e. German,
samples that neither contain sentence boundary                 French, English and Italian. We decided to
punctuation nor word-casing information (they also             adopt language-speciﬁc BERT-base models
predict whether the sentence is a statement or a               as Nozza et al. (2020) reports that this yields
question). Du et al. (Du et al., 2019) present a               better results than using mBERT, pre-trained
transformer-based approach to the problem, but                 on a multilingual corpus.
they assume partially punctuated text and word-
casing information. Recently, it was shown that
                                                             • Secondly, we change the process of sample
a simple ﬁne-tuned BERT model was able to im-
                                                               construction. We handle the unpunctuated in-
prove on the state of the art on fully unpunctuated
                                                               put text as one long chain of words. We origi-
case-folded input data (Donabauer et al., 2021).
                                                               nally split this chain in samples of 64 words
3       System                                                 and ﬁne-tuned the model with a maximum se-
                                                               quence length of 128 BERT-speciﬁc tokens.
3.1       General Architecture of UR-mSBD                      Further experiments have shown that utiliz-
The system architecture we use is adopted from our             ing token sequences as long as possible (512
previous work that achieved state-of-the-art per-              BERT tokens) yields the best results. There-
formance on a very similar task (Donabauer et al.,             fore, we pre-process the raw text data by send-
2021). That architecture demonstrated the suitabil-            ing it through the model’s tokenizer ﬁrst. Each
ity of a BERT-based token classiﬁcation approach               time a batch of iterated words ﬁts 512 BERT
for sentence end prediction in the context of im-              tokens we create a sample from it. If a word
proving text processing pipelines for fact-checking.           at the transition between two samples would
The underlying idea is to treat the restoration of sen-        be ripped apart (as adding it entirely to the
tence boundary information as a problem similar                current sample would exceed 512 tokens), we
to IO-tagging in named entity recognition. For the             put it at the beginning of a new sample and
implementation we refer to our GitHub repository1 .            pad the rest of the previous one with special
The last token of every sentence, marking the oc-              PAD tokens.
currence of a sentence boundary punctuation mark
to follow up, is labeled with EOS. In our previ-          All other hyperparameters are kept in line with
ous work we predicted the beginning of a sentence         Donabauer et al. (2021), namely using an epoch
rather than its end. We therefore labeled the ﬁrst        number of 3 and a batch size of 8 per device. Since
token of every sentence with BOS. Out-of-context          we run it on 3 GPUs simultaneously the batch size
labels O are assigned to all other tokens of the text.    per iteration increases to 24. We also evaluated our
   We ﬁne-tuned a pre-trained BERT model on the           approach on the datasets with tuned hyperparam-
problem and obtained high F1 scores for the de-           eters. However, it turned out that increasing the
sired positive class (sentence end) outperforming         number of epochs to 5 leads to a deterioration of
    1
        https://github.com/doGregor/SBD-SCD-pipeline      results.
4       Data and Setup                                      • German: BERT base uncased model, trained
                                                              on 16GB monolingual German corpus by db-
We participated in Subtask 1 (fully unpunctuated
                                                              mdz (MDZ Digital Library team at the Bavar-
sentences – full stop detection) of SwissText’s
                                                              ian State Library)3 .
SEPP-NLG Shared Task 2.
   Before addressing the experimental setup we              • French: BERT base uncased model, trained
brieﬂy describe the provided data sets. The chal-             on 71GB monolingual French corpus (Le
lenge’s domain are NLG texts. Since there are no              et al., 2020).
corpora that feature such data nor manually cor-
rected versions the organizers selected Europarl2           • Italian: BERT base uncased model, trained
as source. This corpus includes transcribed text              on 81GB monolingual Italian text by dbmdz.
data originating from spoken text in many different
languages. The data come along in lowercase for-          We make use of the PyTorch4 version of the
mat and are already split up into tokens. Sentence      Python huggingface5 transformers library to access
boundary punctuation is removed. Instead labels         models and run ﬁne-tuning. We execute the scripts
are assigned that mark upcoming sentence ends.          on 3 Nvidia GeForce RTX 2080 Ti GPUs with an
The last token of each sentence is labeled with ’1’,    overall memory size of 33GB.
all remaining with ’0’.
                                                        5       Results
   The data are provided as multiple tab-separated
value ﬁles grouped by each language and set. The        5.1      Baselines
number of tokens per language and dataset is re-        The ofﬁcial baseline is produced using the spaCy
ported in Table 1. We explain our pre-processing        NLP package. The organisers report scores for
with respect to a single set for a single language,     different pipeline versions and we are describing
e.g. the English evaluation set. Firstly, we read       the best performing one for every language in Table
each tsv ﬁle one after the other and concatenate all    2. The ofﬁcial evaluation metrics are Precision,
tokens and labels as two long lists. During reading     Recall and F1-score of the positive class label (i.e.,
we save the order and length of the input ﬁles. By      sentence end).
that we are able to reconstruct the original struc-        As Table 2 illustrates, F1-scores for English, Ger-
ture of the ﬁles later on. The list of tokens is fed    man and French are ranging from 0.32 to 0.47. For
into the model-speciﬁc tokenizer. If tokens are not     Italian the F1-metric is collapsing to 0.01, caused
recognized properly we replace them with ’nan’.         by a very low Recall of 0.00.
Each time a batch of 512 BERT tokens is ﬁlled, we
create a sample from it. Data are saved in CoNLL-       5.2      UR-mSBD
2003 format (Tjong Kim Sang and De Meulder,             We summarise the results obtained when running
2003). Tokens and labels are separated horizon-         our system, UR-mSBD, on the test data. For each
tally with spaces. Samples are separated vertically     language we also include scores obtained on the
with empty lines. We use the tokenizer during pre-      dev set as well as the surprise test set that was intro-
processing only to calculate the number of BERT         duced to check the generalizability of the different
tokens at each input word. The samples themselves       approaches.
consist of plain text tokens. Thus dimension and           Table 3 presents the results for the English data,
order of predicted labels correspond to the structure   Table 4 for German, Table 5 for French, and Table
of the processed tsv ﬁles. We can then simply map       6 presents the results for the Italian test data.
our output to the words in the input data.                 We see overall consistently high scores for all
   As mentioned earlier, we make use of language        three metrics and across all languages when look-
speciﬁc models rather than mBERT. We brieﬂy             ing at the ofﬁcial test sets. An average of F1=0.93
describe the respective models and the corpora they     aggregated over all languages places us just one per-
are trained on.                                         centage point behind the top performance. Looking
                                                        at Recall, we actually end up being joint winners
    • English: Classic BERT base uncased model,
                                                        for the German and French test data.
      trained on English lowercase text (Devlin
                                                            3
      et al., 2019).                                          https://github.com/dbmdz/berts
                                                            4
                                                              https://pytorch.org/
    2                                                       5
        https://opus.nlpl.eu/Europarl.php                     https://huggingface.co/
                    Language       Train            Dev         Test          Surprise Test
                    English        33,779,095       7,743,489   10,039,222    1,081,910
                    German         28,645,112       6,358,683   9,575,861     979,982
                    French         32,690,367       8,781,593   11,297,534    1,143,911
                    Italian        28,167,993       7,194,189   10,193,542    985,448

                     Table 1: Number of tokens in the respective data sets for each language.


      Dataset       Precision     Recall     F1            sentative of the data the system was trained on.
      Dev EN        0.49          0.23       0.32            Across the board all the baselines were beaten
      Test EN       0.49          0.24       0.32          by large margins.
      Dev DE        0.51          0.44       0.47
      Test DE       0.49          0.44       0.46          6    Discussion
      Dev FR        0.71          0.24       0.36
                                                           For all featured languages our ﬁne-tuned BERT-
      Test FR       0.63          0.24       0.35
                                                           based predictions are performing very well with
      Dev IT        0.64          0.00       0.01
                                                           results for all three metrics (P/R/F1) in the 90s and
      Test IT       0.51          0.00       0.01
                                                           being very competitive when looking at the other
Table 2: Highest baseline scores for EN, DE, FR, IT.       submissions for this shared task. This ﬁrst of all
                                                           demonstrates the power of transformer-based mod-
    Dataset           Precision     Recall     F1          els and conﬁrms ﬁndings we reported previously
    Dev               0.92          0.92       0.92        (Donabauer et al., 2021).
    Test              0.91          0.92       0.92           The fact that the baselines were outperformed
    Surprise Test     0.82          0.68       0.74        by such large margins is perhaps a sign that non-
                                                           neural approaches are not competitive for the task
       Table 3: UR-mSBD scores for English.                and data at hand.
                                                              We note that our approach performed best for
    Dataset           Precision     Recall     F1
                                                           German texts which might be caused by high simi-
    Dev               0.96          0.95       0.95        larity between the data the model was pre-trained
    Test              0.94          0.96       0.95        on and the data sampled to be training, dev and
    Surprise Test     0.89          0.73       0.80        test sets for this task. It will be worth exploring
       Table 4: UR-mSBD scores for German.                 whether for different data samples we observe a
                                                           similar pattern or whether the differences are in
    Dataset           Precision     Recall     F1          fact not signiﬁcant.
    Dev               0.94          0.93       0.93           Taking a slightly broader perspective, we ob-
    Test              0.93          0.94       0.93        serve that the scores obtained here are similar
    Surprise Test     0.83          0.70       0.76        to what we obtained when running our sentence
                                                           boundary detection algorithm on a dataset com-
       Table 5: UR-mSBD scores for French.                 prising transcribed lectures given at Stanford Uni-
                                                           versity ﬁrst proposed by Song et al. (2019), and
    Dataset           Precision     Recall     F1          the DailyDialog dataset (Li et al., 2017) but that
    Dev               0.93          0.91       0.92        extending these datasets or creating a hybrid ver-
    Test              0.91          0.93       0.92        sion resulted in signiﬁcant drops in performance
    Surprise Test     0.84          0.67       0.74        (Donabauer et al., 2021). It would therefore be in-
                                                           teresting to see whether other approaches illustrate
       Table 6: UR-mSBD scores for Italian.
                                                           similar patterns.
                                                              Another general pattern we read into the results
   The highest scores are reported for German (with        is that there are only small differences when com-
Precision at 0.94, Recall at 0.96 and F1 at 0.95).         paring results on the dev sets with the results on
All the scores for the test sets are above 0.90. For       the test sets. We conclude that our approach can
the surprise test set the results drop quite a bit but     generalize to unseen data as long as the training
are still reasonably high given the data is not repre-     data is representative for the data used for testing.
The approach does however generalise less well              on Financial Technology and Natural Language Pro-
to out-of-domain (’surprise’) data with F1-scores           cessing, pages 81–87, Macao, China. Association
                                                            for Computational Linguistics.
dropping between 0.15 and 0.18, compared to the
Europarl sets. We still consider the results to be rea-   Dan Gillick. 2009. Sentence boundary detection and
sonably well though given they are on average over          the problem with the u.s. In Proceedings of Hu-
all languages only 0.03 behind the top-performing           man Language Technologies: The 2009 Annual Con-
                                                            ference of the North American Chapter of the As-
system.                                                     sociation for Computational Linguistics, Compan-
                                                            ion Volume: Short Papers, NAACL-Short ’09, page
7   Conclusions                                             241–244, USA. Association for Computational Lin-
                                                            guistics.
We framed the task of full-stop prediction (Subtask
1 of Shared Task 2 at SwissText 2021) as a binary         Daniel Jurafsky and James Martin. 2020. Speech
                                                            and Language Processing: An Introduction to Nat-
classiﬁcation task over all input tokens identifying
                                                            ural Language Processing, Computational Linguis-
whether each of these tokens should indicate the            tics, and Speech Recognition. Current draft of third
position of a full stop or not. Fine-tuning language        edition (30 Dec 2020).
speciﬁc pre-trained BERT models for each of the
                                                          Hang Le, Loı̈c Vial, Jibril Frej, Vincent Segonne, Max-
four tasks resulted in competitive results. Given           imin Coavoux, Benjamin Lecouteux, Alexandre Al-
the small difference in F1 of 0.01 compared to the          lauzen, Benoı̂t Crabbé, Laurent Besacier, and Didier
top results reported for this competition for three         Schwab. 2020. Flaubert: Unsupervised language
of the languages (as well as aggregated over all            model pre-training for french. In Proceedings of
                                                            The 12th Language Resources and Evaluation Con-
languages) we will await statistical signiﬁcance            ference, pages 2479–2490, Marseille, France. Euro-
tests as our results may well turn out to be on par         pean Language Resources Association.
with the top results in this task.
                                                          The Anh Le. 2020. Sequence labeling approach to the
Acknowledgements                                            task of sentence boundary detection. In Proceed-
                                                            ings of the 4th International Conference on Machine
                                                            Learning and Soft Computing, ICMLSC 2020, page
This work was supported by the project
                                                            144–148, New York, NY, USA. ACM.
COURAGE: A Social Media Companion
Safeguarding and Educating Students funded by             Joan Persily Levinson. 1985. Punctuation and the or-
the Volkswagen Foundation, grant number 95564.              thographic sentence: a linguistic analysis. Doctoral
                                                            dissertation, City University of New York.

                                                          Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang
References                                                  Cao, and Shuzi Niu. 2017. DailyDialog: A manu-
                                                            ally labelled multi-turn dialogue dataset. In Proceed-
Phoebe Arnold. 2020. The challenges of online fact          ings of the Eighth International Joint Conference on
  checking. Technical report, Full Fact, London, UK.        Natural Language Processing (Volume 1: Long Pa-
                                                            pers), pages 986–995, Taipei, Taiwan. Asian Federa-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and               tion of Natural Language Processing.
   Kristina Toutanova. 2019. BERT: Pre-training of
   deep bidirectional transformers for language under-    Yang Liu, Andreas Stolcke, Elizabeth Shriberg, and
   standing. In Proceedings of the 2019 Conference          Mary Harper. 2004. Comparing and combining gen-
   of the North American Chapter of the Association         erative and posterior probability models: Some ad-
   for Computational Linguistics: Human Language            vances in sentence boundary detection in speech.
  Technologies, Volume 1 (Long and Short Papers),           In Proceedings of the 2004 Conference on Empiri-
   pages 4171–4186, Minneapolis, Minnesota. Associ-         cal Methods in Natural Language Processing, pages
   ation for Computational Linguistics.                     64–71, Barcelona, Spain. Association for Computa-
                                                            tional Linguistics.
Gregor Donabauer, Udo Kruschwitz, and David Cor-
  ney. 2021. Making sense of subtitles: Sentence          Yang Liu, Andreas Stolcke, Elizabeth Shriberg, and
  boundary detection and speaker change detection in        Mary Harper. 2005. Using conditional random
  unpunctuated texts. In Companion Proceedings of           ﬁelds for sentence boundary detection in speech. In
  the Web Conference 2021 (WWW ’21 Companion),              Proceedings of the 43rd Annual Meeting on Associ-
  New York, NY. ACM.                                        ation for Computational Linguistics, ACL ’05, page
                                                            451–458, USA. Association for Computational Lin-
Jinhua Du, Yan Huang, and Karo Moilanen. 2019. AIG          guistics.
   Investments.AI at the FinSBD task: Sentence bound-
   ary detection through sequence labelling and BERT      Preslav Nakov, David P. A. Corney, Maram Hasanain,
   ﬁne-tuning. In Proceedings of the First Workshop         Firoj Alam, Tamer Elsayed, Alberto Barrón-Cedeño,
  Paolo Papotti, Shaden Shaar, and Giovanni Da San
  Martino. 2021. Automated fact-checking for assist-
  ing human fact-checkers. CoRR, abs/2103.07769.
Debora Nozza, Federico Bianchi, and Dirk Hovy. 2020.
  What the [mask]? making sense of language-speciﬁc
  BERT models. CoRR, abs/2003.02912.
George Sanchez. 2019. Sentence boundary detection
  in legal text. In Proceedings of the Natural Legal
  Language Processing Workshop 2019, pages 31–38,
  Minneapolis, Minnesota. Association for Computa-
  tional Linguistics.
Uri Shaham and Omer Levy. 2021. Neural machine
  translation without embeddings. In Proceedings of
  the 2021 Conference of the North American Chap-
  ter of the Association for Computational Linguistics:
  Human Language Technologies, pages 181–186, On-
  line. Association for Computational Linguistics.
Hye Jeong Song, Hong Ki Kim, Jong Dae Kim,
  Chan Young Park, and Yu Seop Kim. 2019. Inter-
  sentence segmentation of YouTube subtitles using
  Long-Short Term Memory (LSTM). Applied Sci-
  ences (Switzerland), 9(7).
Mark Stevenson and Robert Gaizauskas. 2000. Exper-
 iments on sentence boundary detection. In Proceed-
 ings of the sixth conference on Applied natural lan-
 guage processing -, pages 84–89, Morristown, NJ,
 USA. Association for Computational Linguistics.
Erik F. Tjong Kim Sang and Fien De Meulder. 2003.
   Introduction to the CoNLL-2003 shared task. In
  Proceedings of the seventh conference on Natural
   language learning at HLT-NAACL 2003 -, volume 4,
   pages 142–147, Morristown, NJ, USA. Association
   for Computational Linguistics.