The Sentence End and Punctuation Prediction in NLG Text (SEPP-NLG)
                         Shared Task 2021

                               Don Tuggener         Ahmad Aghaebrahimian
                               Zurich University of Applied Sciences (ZHAW)
                                          Winterthur, Switzerland
                                       {tuge, agha}@zhaw.ch


                        Abstract                                translated or transcribed texts such as the output of
                                                                Automatic Speech Recognition (ASR) or Machine
    This paper describes the first Sentence                     Translation (MT) systems. The punctuation marks
    End and Punctuation Prediction in Nat-                      in such synthetic text may be displaced for sev-
    ural Language Generation (SEPP-NLG)                         eral reasons. Detecting the end of a sentence and
    shared task1 held at the SwissText confer-                  placing an appropriate punctuation mark improves
    ence 2021. The goal of the shared task was                  the quality of such texts not only by preserving
    to develop solutions for the identification                 the original meaning but also by enhancing their
    of sentence boundaries and the insertion of                 readability.
    punctuation marks into texts produced by                       The goal of the SEPP-NLG shared task is to
    NLG systems. The data and submissions2 ,                    build models for identifying the end of a sentence
    and the codebase3 for the shared tasks are                  by detecting an appropriate position for putting an
    publicly available.                                         appropriate punctuation mark.
1   Introduction
                                                                2   Related Work
Sentence End Detection, also known as Sentence
boundary disambiguation (SBD) or boundary de-                   Similar to the system proposed by Grefenstette and
tection, is the Natural Language Processing (NLP)               Tapanainen (1997), the earliest attempts for sen-
task of recognizing where a sentence begins and                 tence boundary detection utilize a set of rules or
ends. A period is the most common end of sen-                   regular expressions. In a different direction, Rey-
tence indicator in written English as well as many              nar and Ratnaparkhi (1997), and Kiss and Strunk
other Indo-European languages. However, a period                (2006) proposed an information-centric approach
may be used in a decimal point, an abbreviation,                based on the Maximum Entropy model, and an
an email address, or other possible cases as well               unsupervised method based on collocation statis-
which makes sentence boundary detection a chal-                 tics respectively. Decision tree classifier (Riley,
lenge. Other forms of punctuation such as question              1989), Naı̈ve Bayes (López and Pardo, 2015) and
and exclamation marks, semicolons, comma, etc.                  deep learning based (Kaur and Singh, 2019) mod-
add to this challenge. Although sentence bound-                 els are the most recent advances based on machine
ary detection is considered an almost solved issue              learning that are proposed for predicting correct po-
for formal written language (Walker et al., 2001),              sitions for the period in particular and other punc-
it poses a challenge in terms of meaning distor-                tuation marks in general. Moving forward and
tion and readability in synthetic or automatically              combining the rule-based and machine learning-
                                                                based systems, Deepamala and Ramakanth (2012)
Copyright © 2021 for this paper by its authors. Use permitted   proposed a hybrid system with high performance.
under Creative Commons License Attribution 4.0 Interna-
tional (CC BY 4.0)
                                                                   Our task is closely related to Tilk and Alumäe
    1
      https://sites.google.com/view/                            (2016) and follow-up work that uses the Europarl
sentence-segmentation/                                          and TED talk corpora for punctuation prediction.
    2
      https://drive.switch.ch/index.php/s/                      Similar to our goal, Żelasko et al. (2018); Don-
g3fMhMZU2uo32mf
    3
      https://github.com/dtuggener/                             abauer et al. (2021) investigate sentence bound-
SEPP-NLG-2021                                                   ary detection in unpunctuated ASR outputs of spo-
ken dialogues based on textual features. Cho et al.     boundaries in the corpus are automatically gener-
(2017) propose a method to predict sentence bound-      ated, they are quite reliable as the data and the
aries and punctuation insertion in a real-time spo-     models trained to detect the boundaries contain all
ken language translation tool. In a similar set-        the original punctuation symbols of the transcripts.
ting, Klejch et al. (2017) include acoustic features       In the spirit of the “Swissness” of the SwissText
to improve punctuation prediction in a speech trans-    conference where SEPP-NLG 2021 is co-located,
lation system, and Yi and Tao (2019) combine lexi-      we select 3 of the 4 official languages6 of Switzer-
cal and speech features for punctuation prediction      land, i.e. German, French, and Italian and comple-
in a traditional ASR setting. Finally, Rehbein et al.   ment the selection by incorporating English.7
(2020) investigate the annotation and detection of         The Europarl corpus contains multiple punctua-
sentence like units in spoken language transcripts.     tion symbols. For subtask 2, we gauged which sub-
                                                        set of them represents a realistic and feasible goal
3       Task Overview                                   for their automatic prediction in a stream of unpunc-
Ultimately, the goal of SEPP-NLG is to predict sen-     tuated, lower-cased tokens. Also, we considered
tence ends and punctuation in NLG texts. However,       which punctuation marks improve the readability
there are no corpora that feature NLG texts and         of a text the most. Hence, we consolidated the
their manually transcribed and corrected versions.      selection of punctuation symbols for subtask 2 to
Therefore, we approximate the setting by using a)       : −, ?.0 (0 indicating no punctuation) and mapped
transcripts of spoken texts, and b) lower-casing the    the symbols !; to ., the period. We removed all
texts and removing all punctuation marks. While         sentences from the data that contain other punctu-
there are multiple corpora of transcribed spoken        ation symbols such as parentheses, as there is no
language, we choose the Europarl corpus4 (Koehn,        straightforward way to remove punctuation without
2005) as the source for our data. The Europarl          interfering with the naturalness of a sentence. This
corpus consists of transcripts of the sessions of       removal affected the data for both subtasks and re-
the European parliament and features transcripts in     sulted in removing less than 10% of the data per
multiple languages.                                     language. We also removed HTML artifacts, and
   We offer the following subtasks:                     special (non-visible) characters (zero width space,
                                                        soft hyphen) from the data. Finally, we omitted
    • Subtask 1: (fully unpunctuated sentences-         sentences with fewer than 3 tokens and documents
      full stop detection): Given the textual content   with fewer than 2 sentences.
      of an utterance where all punctuation marks          The data format is as follows: Lower-cased to-
      are removed, correctly detect the end of sen-     kens per file are listed vertically, and the labels for
      tences by placing a full stop in appropriate      subtask 1 (binary classification) and 2 (multiclass
      positions.                                        classification) are appended horizontally, separated
                                                        by tab. The labels encode whether a token emits a
    • Subtask 2: (fully unpunctuated sentences-         sentence end (subtask 1) and a punctuation symbol
      full punctuation marks): Given the textual        (subtask 2). Table 1 shows an example.
      content of an utterance where all punctuation
                                                           Per language, we randomly selected 80% of the
      marks are removed, correctly predict all punc-
                                                        documents for the training and 20% for the test
      tuation marks.
                                                        set. From the the training set, we then randomly
  Participants were free to choose for which lan-       sampled 20% of the documents as the development
guages and subtasks they contributed a submission,      set.
but were encouraged to participate in all languages.       Table 2 shows several statistics of our data. We
                                                        see similar properties for all languages: Most sen-
3.1      Data                                           tences are unique, and there are few sentences that
We leverage the open parallel corpus (OPUS) ver-        occur both in the train and test sets.8 German fea-
sion of the Europarl corpus5 (Tiedemann, 2012)             6
                                                             The forth, Romansh, is not represented in Europarl.
for extracting the task data as it provides sentence       7
                                                             Incorporating further languages from the OPUS corpus
boundaries and tokenization. Albeit the sentence        using our scripts is seamless as the data format is consistent
                                                        across languages.
    4                                                      8
        http://www.statmt.org/europarl/                      Duplicate sentences are often formulaic, administrative
    5
        https://opus.nlpl.eu/Europarl.php               ones, like ”The session is adjourned.” etc.
             Token        Label 1   Label 2             to 90, meaning there are on average 10-15% of
             the             0         0                tokens per document in the surprise test set that are
             next            0         0
             item            0         0                not in the Europarl test set.
             is              0         0                   While being one order of magnitude smaller than
             the             0         0
             commission      0         0                the Europarl test set, the surprise test set is also
             statement       0         0                highly and similarly imbalanced regarding the la-
             on              0         0                bel distribution. In the English surprise test set,
             the             0         0
             referendum      0         0                there are 67’446 tokens with label 1 and 1’014’464
             in              0         0                tokens with label 0. This yields an average sen-
             venezuela       1         .                tence length of 16 tokens, which is significantly
             member          0         0
             of              0         0                lower than the 24 tokens in the English Europarl
             the             0         0                test set. The label counts for subtask 2 follow an
             commission      0         .
             madam           0         0
                                                        almost identical distribution in both test sets.
             president       0         ,
             the             0         0                4   Submissions
          Table 1: Example of the data format.
                                                        ZHAW-mbert: We provided a baseline based on
                                                        the multilingual BERT model (Devlin et al., 2019),
tures the largest vocabulary, as is expected due        mBERT, implemented in the simpletransformers
to its morphological richness, and the vocabulary       library10 . We treat the task as a token classification
overlap between train and test sets is roughly 50%      problem and segment the documents into subse-
for all languages.                                      quent, non-overlapping chunks of length 512 to
   Concerning the labels, the data is highly skewed     adhere to the sequence length restrictions of BERT.
towards the 0 label for both tasks, as most tokens do   We fine-tuned the model on the training data of
not emit a sentence end or punctuation symbol after     all languages with a randomly shuffled file order
them. For example, there are 9’618’776 tokens           across all languages and vanilla settings for about
with the label 0 and 420’446 with label 1 subtask       one week on a single GPU.
one in the English test set, which yields an average
sentence length of almost 24 tokens. Table 3 shows      ZHAW-adapter-mbert: To              contrast      the
a breakdown of the label counts in the English          resource-intensive fine-tuning of mBERT with a
test set for subtask 2. It shows that the period        computationally cheaper approach of task adaption,
and comma symbols have similar counts and are           we apply the adapter-transformers library11
the most frequent labels among the non-0 labels.        (Pfeiffer et al., 2020). Instead of updating all the
The remaining labels occur less than an order of        weights of the base models (mBERT in our case),
magnitude less frequently. These label distribution     the adapters approach inserts a few feed-forward
properties are similar across all languages.            layer in between the transformer blocks and only
                                                        trains those for adapting a base model to a new
3.2     Surprise Test Data                              task. We again use the vanilla settings and train the
                                                        model for one day.
The Europarl corpus covers domain-specific lan-
guage, i.e. political statements in the European par-      OnPoint: In their study of sentence segmenta-
liament. To measure how well the participating          tion, Michail et al. (2021) proposed a majority-
systems trained on our data generalize to out-of-       voting ensemble model consisting of several Trans-
domain data, we incorporated a surprise test set        former models trained in different ways. The mod-
comprised of TED talk transcripts9 (Reimers and         els’ predictions are leveraged at test time using a
Gurevych, 2020).                                        sliding window to obtain the final predictions. They
   For each language, we sampled 500 TED talks,         offered their system as language-dependent models
favoring those that have the lowest vocabulary over-    for all four languages of the shared task and both
lap with our Europarl test sets to maximize the         sub-tasks.
vocabulary shift. The document-based average per-         10
                                                             https://github.com/ThilinaRajapakse/
centage of the vocabulary overlap ranges from 85        simpletransformers
                                                          11
                                                             https://github.com/Adapter-Hub/
   9
       https://opus.nlpl.eu/TED2020.php                 adapter-transformers/
            Lang     #sentences     unique        train∩test    #tokens       unique      train∩test
            EN       1’406’577      1’382’738     2’660         33’779’095     88’370      43’744
            DE       1’308’508      1’276’691     2’806         28’645’112    294’035     112’000
            FR       1’236’504      1’215’981     2’081         32’690’367    103’774      57’112
            IT       1’132’554      1’112’742     1’746         28’167’993    131’024      67’626

Table 2: Training data statistics, showing number of (unique) sentences and tokens and the number of sentences
and tokens in both training and test set (train∩test) per language.


                 Label    Count                            UC3M for all languages as well as both sub-tasks
                 0        9’050’256                        in the shared task individually.
                 ,          521’594                           HTW: Guhr et al. (2021) modeled the task as
                 .          417’560                        a token-wise prediction and examined several lan-
                 -           23’600                        guage models based on the transformer architecture.
                 :           13’146                        They trained two separate models for the two tasks
                 ?           13’066                        and submitted their results for all four languages of
                                                           the shared task. They advocated transfer learning
Table 3: Label distribution for subtask 2 in the English   for solving the task and showed that the multilin-
test set.                                                  gual transformer models yielded better results than
                                                           monolingual models. By pruning the Bert layers,
   Unbabel-INESC-ID: Rei et al. (2021) extend              they also showed that their model retains 99% of
the architecture proposed by Rei et al. (2020) to          its performance without 1/4 of the last layers.
develop a multilingual model for sentence end and
                                                           5    Results
punctuation prediction. Their system is designed
based on pre-trained contextual embeddings and             In section 3.1 we showed that our data is highly im-
built on top of a pre-trained Transformer-based            balanced regarding the label distribution. Accuracy
encoder model. They propose their method as a              or Macro F1 scores are not suitable metrics in this
single multilingual model for all languages and            setting, as majority class prediction would yield
subtasks of the shared task.                               an accuracy of 96% for subtask 1 on the English
   UR-mSBD: Donabauer and Kruschwitz (2021)                test set, e.g. Therefore, we applied the following
propose a system based on a pre-trained BERT               metrics to evaluate the participants’ submissions:
model and fine-tuned for the first sub-task. They
use language-specific models for each of the four              • Subtask 1: F1 score of the label 1 (the posi-
languages of the shard task. They consider sub-task              tive class, i.e. sentence end)
1 as a binary classification problem by identifying            • Subtask 2: Macro F1 of the selected punctu-
tokens that indicate the position of a full stop.                ation symbols
   oneNLP: Applying multi-task Albert for En-
glish and multi-lingual Bert for other lan-                   We observe that a) most systems achieve a very
guages Mujadia et al. (2021) explored the impact           high score for subtask 1 for all languages on the
of using contextual language models for sentence           Europarl data, and b) the F1 scores are almost iden-
end and punctuation prediction. They modeled the           tical (with seemingly minor differences in preci-
problem in both subtasks as a sequence labeling            sion and recall) for the top-ranking systems for
task. They presented the results of employing a            both tasks. Further, the top-ranking systems are the
baseline CRF, as well as the results of applying a         same ones for both tasks. This is to be expected
fine-tuning method over contextual embedding.              to some degree, as it can be argued that subtask 2
   HULAT UC3M: Based on the Punctuator                     subsumes subtask 1.
framework (Tilk and Alumäe, 2016) which is a bidi-           While the F1 scores for subtask 2 seem low com-
rectional recurrent neural network model equipped          pared to subtask 1, a more detailed results analysis
with an attention mechanism, Masiello-Ruiz et al.          reveals that the lower (Macro) F1 scores mainly
(2021) developed an automatic punctuation sys-             stem from the labels with the lowest counts in the
tem named HULAT-UC3M. They trained HULAT-                  data. Table 6 gives the detailed classification report
                                     EN                    DE                    FR                    IT                          AVG
                             Prec    Rec    F1      Prec   Rec    F1      Prec   Rec     F1     Prec   Rec     F1      Prec        Rec    F1
TEST SET
htw+t2k fullstop multilang   0.94    0.95   0.94    0.95   0.96   0.96    0.94   0.94    0.94   0.92   0.94    0.93    0.94        0.95   0.94
OnPoint                      0.93    0.95   0.94    0.95   0.96   0.96    0.92   0.94    0.93   0.90   0.95    0.92    0.93        0.95   0.94
Unbabel-INESC-ID             0.94    0.94   0.94    0.95   0.96   0.96    0.94   0.94    0.94   0.92   0.94    0.93    0.94        0.95   0.94
UR-mSBD                      0.91    0.92   0.92    0.94   0.96   0.95    0.93   0.94    0.93   0.91   0.93    0.92    0.92        0.94   0.93
ZHAW-mbert                   0.91    0.93   0.92    0.93   0.96   0.95    0.90   0.93    0.91   0.88   0.93    0.90    0.91        0.94   0.92
oneNLP                       0.92    0.92   0.92    0.93   0.95   0.94    0.90   0.89    0.89   0.88   0.89    0.89    0.91        0.91   0.91
ZHAW-adapter-mbert           0.88    0.90   0.89    0.79   0.85   0.82    0.81   0.84    0.83   0.77   0.78    0.77    0.81        0.84   0.83
HULAT UC3M                   0.86    0.80   0.83    0.23   0.90   0.36    0.86   0.79    0.83   0.84   0.78    0.81    0.70        0.82   0.71
htw+t2k fullstop german                             0.95   0.96   0.95
SURPRISE TEST SET
htw+t2k fullstop multilang   0.85    0.70   0.77    0.90   0.74   0.82    0.84   0.70    0.76   0.85   0.67    0.75    0.86        0.70   0.78
OnPoint                      0.84    0.75   0.80    0.89   0.77   0.82    0.82   0.72    0.77   0.83   0.71    0.77    0.85        0.74   0.79
Unbabel-INESC-ID             0.92    0.75   0.83    0.88   0.71   0.78    0.85   0.72    0.78   0.86   0.68    0.76    0.88        0.72   0.79
UR-mSBD                      0.82    0.68   0.74    0.89   0.73   0.80    0.83   0.70    0.76   0.84   0.67    0.74    0.85        0.70   0.76
ZHAW-mbert                   0.78    0.70   0.74    0.86   0.74   0.80    0.78   0.69    0.73   0.77   0.65    0.70    0.80        0.70   0.74
oneNLP                       0.81    0.67   0.73    0.85   0.72   0.78    0.77   0.62    0.69   0.78   0.58    0.67    0.80        0.65   0.72
ZHAW-adapter-mbert           0.75    0.69   0.71    0.75   0.69   0.72    0.72   0.67    0.69   0.71   0.55    0.62    0.73        0.65   0.69
HULAT UC3M                   0.68    0.41   0.51    0.41   0.61   0.49    0.74   0.41    0.53   0.73   0.30    0.43    0.64        0.43   0.49
htw+t2k fullstop german                             0.90   0.75   0.80


                                                    Table 4: Results for subtask 1

                                     EN                    DE                    FR                    IT                          AVG
                             Prec    Rec    F1      Prec   Rec    F1      Prec   Rec     F1     Prec   Rec     F1      Prec        Rec    F1
TEST SET
htw+t2k fullstop multilang   0.82    0.74   0.77    0.84   0.79   0.81    0.83   0.75    0.78   0.82   0.72    0.76    0.83        0.75   0.78
OnPoint                      0.81    0.75   0.77    0.82   0.80   0.81    0.78   0.77    0.77   0.77   0.74    0.75    0.80        0.77   0.78
Unbabel-INESC-ID             0.83    0.72   0.76    0.84   0.77   0.80    0.83   0.74    0.77   0.82   0.70    0.74    0.83        0.73   0.77
ZHAW-mbert                   0.80    0.71   0.74    0.82   0.75   0.78    0.81   0.71    0.75   0.79   0.66    0.71    0.81        0.71   0.75
oneNLP                       0.79    0.69   0.72    0.80   0.74   0.77    0.79   0.65    0.68   0.78   0.62    0.66    0.79        0.68   0.71
HULAT UC3M                   0.76    0.60   0.63    0.79   0.65   0.69    0.75   0.59    0.64   0.71   0.52    0.57    0.75        0.59   0.63
ZHAW-adapter-mbert           0.78    0.64   0.68    0.59   0.48   0.49    0.70   0.55    0.59   0.64   0.46    0.49    0.68        0.53   0.56
SURPRISE TEST SET
htw+t2k fullstop multilang   0.65    0.57   0.60    0.68   0.64   0.66    0.66   0.60    0.62   0.61   0.53    0.56    0.65        0.59   0.61
OnPoint                      0.65    0.59   0.62    0.66   0.65   0.65    0.63   0.60    0.61   0.57   0.55    0.56    0.63        0.60   0.61
Unbabel-INESC-ID             0.68    0.57   0.61    0.71   0.63   0.65    0.69   0.59    0.63   0.63   0.53    0.56    0.68        0.58   0.61
ZHAW-mbert                   0.62    0.51   0.55    0.66   0.58   0.60    0.64   0.54    0.57   0.51   0.45    0.47    0.61        0.52   0.55
oneNLP                       0.62    0.52   0.56    0.61   0.57   0.58    0.61   0.48    0.51   0.54   0.43    0.46    0.60        0.50   0.53
HULAT UC3M                   0.50    0.40   0.43    0.59   0.47   0.51    0.56   0.38    0.41   0.45   0.33    0.36    0.53        0.40   0.43
ZHAW-adapter-mbert           0.60    0.48   0.51    0.54   0.41   0.44    0.60   0.44    0.48   0.51   0.35    0.38    0.56        0.42   0.45


                                                    Table 5: Results for subtask 2


for the top three ranking system for the English test                    to the Europarl dataset, we train the ZHAW-mbert
set. It shows that the systems are able to predict                       approach on the remaining TED talks that were not
periods, commas, and question marks reliably, but                        selected for the surprise test set and then test the
that they struggle with hyphens and colons, which                        system on the surprise test set. Table 7 shows that
lowers the Macro F1 scores.                                              the average F1 score does improve by 11 percent-
                                                                         age points when training the ZHAW-mbert system
       Label       htw+t2k          OnPoint        Unbabel               on domain data. Still, the 0.66 F1 score is 9 per-
          0           0.99           0.99           0.99                 centage points behind the average F1 score on the
          ,           0.82           0.82           0.80                 Europarl data. Hence, the drop in performance of
          .           0.95           0.95           0.94                 Europarl-trained ZHAW-mbert on the surprise test
          -           0.42           0.41           0.37                 set can both be accounted for by the domain shift
          :           0.57           0.57           0.56                 and by the increased difficulty of the target domain
          ?           0.88           0.91           0.89                 (TED talks). We expect that this applies for the
                                                                         performance drop of all systems.
Table 6: F1 scores per label for the top-performing sys-
tems on the English test set for subtask 2.                                                            Prec.    Rec.          F1
                                                                                       ZHAW-mbert      0.76     0.63     0.66
   All systems perform significantly worse on the
surprise test sets for both tasks. To gauge the dif-                     Table 7: Results of training ZHAW-mbert on TED talks
ficulty of the task on the TED dataset compared                          for subtask 2 (averaged over all languages).
   We expected some submissions to use linguistic         takes a ground truth label G, the predicted label A
features such as part-of-speech tags or partial syn-      of one system, and the predicted label B of another
tax parse trees and hypothesized that such systems        system and defines three types of differences for
would fare better on out-of-domain data. However,         the cases where A 6= B:
all participating systems applied neural encodings
of the surface tokens and did not encode linguistic          • correction: G = B
features explicitly. Still, the ranking of the systems
remains intact on the surprise test sets.                    • new error: G = A
   The top three systems in both tasks all use
transformers-based approaches and tackle the tasks           • changed error: G 6= A 6= B
in a similar manner. We hypothesize that this is
the main reason for near identical performance of
                                                             Table 9 shows the results. We see that the pre-
the systems in terms of F1 scores. Based on the
                                                          dictions of commas makes up a large portion of
task results, these three systems seem to produce
                                                          the differences. When OnPoint’s prediction dif-
near-identical output. To better gauge their similar-
                                                          fers from Unbabel’s for comma, OnPoint is correct
ities and differences, we evaluate their outputs for
                                                          and Unbabel incorrect in nearly 70% of the cases,
subtask 2 in a pair-wise manner on the English test
                                                          which explains the 2 percentage point higher per-
set. We apply the evaluation metric such that one
                                                          formance of OnPoint in Table 6. Still, Unbabel is
system output takes the role of the ground truth and
                                                          correct in almost 30% of the cases where the two
the other the one of the system prediction, which
                                                          predictions differ.
yields the F1 scores per class that we leverage as
an indicator of the similarity or agreement of the
                                                                 #Diff.      corr.        new err.    changed
per-token predictions. Table 8 shows the results.                                                     err.
While the macro F1 scores and even the per-class            0    45’552      34.22%       62.59%      3.19%
F1 scores in Table 6 are highly similar, there are          ,    50’496      69.01%       28.30%      2.69%
significant differences in this analysis. For exam-         .    16’190      49.28%       44.69%      6.03%
                                                            -    4’422       51.15%       33.04%      15.81%
ple, for the hyphen class, the systems have different       :    2’014       41.46%       31.43%      27.11%
predictions in over 30% of the cases, and for colon          ?   1’158       63.90%       29.53%      6.56%
in roughly 20%. For the majority classes of the
non-0 classes, the systems disagree in about 10%          Table 9: Detailed comparison of the differences in Un-
of the cases for comma, but their predictions are         babel’s predictions versus OnPoint’s predictions for En-
                                                          glish in subtask 2. #Diff. signifies the number of tokens
highly similar for period (96% agreement).
                                                          that have the respective label as the ground truth and
                                                          for which OnPoint’s and Unbabel’s predictions differ.
   Label   htw+t2k vs    OnPoint vs     OnPoint vs
           Unbabel       Unbabel        htw+t2k           The remaining columns represent the percentage of this
                                                          number in each difference class.
     0     0.99          0.99           1.00
     ,     0.90          0.90           0.92
     .     0.96          0.96           0.96                 In conclusion, we observer that while the top
     -     0.67          0.66           0.68
     :     0.79          0.81           0.81              three systems perform similarly in terms of Macro
     ?     0.89          0.92           0.91              F1 scores for subtask 2, there are nuances to each
                                                          system that distinguishes them from the others.
Table 8: System prediction similarity between the three
top-performing systems on the English test set for sub-   5.1    Winners
task 2.
                                                          While we showed that there are differences in the
   Following Tuggener (2017), we can take the             outputs of the top three systems that are not re-
comparison a step further and analyse the type of         flected in the averaged F1 scores, the declared crite-
differences per label. For example, the OnPoint           ria for winning the task are the averaged F1 scores
submission’s F1 score for hyphen is 4 percentage          in Tables 4 and 5. Since the top three systems in
points higher than the one of Unbabel, and their pre-     these tables are practically indistinguishable based
diction agreement for hyphen is 68%. This does not        on these F+ scores, we declare OnPoint, htw+t2k,
indicate, however, whether OnPoint’s predictions          and Unbabel as the joint winners of the SEPP-NLG
are always better. The aforementioned comparison          2021 shared task. Congratulations!
6    Conclusions                                         Gregor Donabauer, Udo Kruschwitz, and David Cor-
                                                           ney. 2021. Making sense of subtitles: Sentence
We presented the setting and results of the first Sen-     boundary detection and speaker change detection in
tence End and Punctuation Prediction in NLG text           unpunctuated texts. In Companion Proceedings of
(SEPP-NLG 2021) shared task. We found that all             the Web Conference 2021, pages 357–362.
participants explored neural networks-based mod-         Gregory Grefenstette and Pasi Tapanainen. 1997. What
els (particularly transformers) to tackle the task.        is a word, what is a sentence? problems of tokeniza-
The results for the in-domain Europarl data were           tion.
high for the most common punctuation symbols,            Oliver Guhr, Anne-Kathrin Schumann, Frank
but the performance decreased significantly when           Bahrmann, and Hans-Joachim Bohme. 2021.
the models were faced with out-of-domain data.             Fullstop: Multilingual deep models for punctuation
   The discussion of the task results during the ses-      prediction. In Proceedings of the 1st Shared Task on
                                                           Sentence End and Punctuation Prediction in NLG
sion at the SwissText conference yielded the fol-
                                                           Text (SEPP-NLG 2021) at SwissText 2021.
lowing desiderata for future iterations of the shared
task:                                                    Jagroop Kaur and Jaswinder Singh. 2019. Deep neural
                                                            network based sentence boundary detection and end
    • More heterogeneous data (more domains)                marker suggestion for social media text. In 2019 In-
                                                            ternational Conference on Computing, Communica-
    • Add truecasing as an additional task                  tion, and Intelligent Systems (ICCCIS), pages 292–
                                                            295.
    • Add other language families
                                                         Tibor Kiss and Jan Strunk. 2006. Unsupervised multi-
    • Take inference time / computational costs as         lingual sentence boundary detection. Comput. Lin-
      an additional evaluation criteria, or create a       guist., 32(4):485–525.
      separate track that puts emphasis on a low-        Ondřej Klejch, Peter Bell, and Steve Renals. 2017.
      resource/low-latency setting                         Sequence-to-sequence models for punctuated tran-
                                                           scription combining lexical and acoustic features.
Acknowledgments                                            In 2017 IEEE International Conference on Acous-
                                                           tics, Speech and Signal Processing (ICASSP), pages
We thank the participants for their submissions and        5700–5704. IEEE.
their valuable feedback on early versions of the
data and task details. This work was funded by           Philipp Koehn. 2005. Europarl: A parallel corpus for
                                                           statistical machine translation. Machine Translation
Innosuisse under grant project nr. 43446.1 IP-ICT.
                                                           Summit, 2005, pages 79–86.

                                                         Roque López and Thiago A. S. Pardo. 2015. Ex-
References                                                 periments on sentence boundary detection in user-
Eunah Cho, Jan Niehues, and Alex Waibel. 2017. Nmt-        generated web content. In Computational Linguis-
  based segmentation and punctuation insertion for         tics and Intelligent Text Processing, pages 227–237,
  real-time spoken language translation. In Inter-         Cham. Springer International Publishing.
  speech, pages 2645–2649.                               Jose Manuel Masiello-Ruiz, Jose Luis Lopez
Nn. Deepamala and P. Ramakanth. 2012. Sentence              Cuadrado, and Paloma Martinez. 2021. Partic-
  boundary detection in kannada language. Interna-          ipation of hulat-uc3m in sepp-nlg 2021 shared
  tional Journal of Computer Applications, 39:38–41.        task. In Proceedings of the 1st Shared Task on
                                                           Sentence End and Punctuation Prediction in NLG
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and              Text (SEPP-NLG 2021) at SwissText 2021.
   Kristina Toutanova. 2019. Bert: Pre-training of
   deep bidirectional transformers for language under-   Andrianos Michail, Silvan Wehrli, and Terézia
   standing. In Proceedings of the 2019 Conference of      Bucková. 2021. Uzh onpoint at swisstext-2021:
   the North American Chapter of the Association for       Sentence end and punctuation prediction in nlg text
   Computational Linguistics: Human Language Tech-         through ensembling of different transformers. In
   nologies, Volume 1 (Long and Short Papers), pages       Proceedings of the 1st Shared Task on Sentence
   4171–4186.                                              End and Punctuation Prediction in NLG Text (SEPP-
                                                           NLG 2021) at SwissText 2021.
Gregor Donabauer and Udo Kruschwitz. 2021. Uni-
  versity of regensburg @ swisstext 2021 sepp-nlg:       Vandan Mujadia, Pruthwik Mishra Dipti, and Misra
  Adding sentence structure to unpunctuated text. In       Sharma. 2021. Deep contextual punctuator for nlg
  Proceedings of the 1st Shared Task on Sentence           text. In Proceedings of the 1st Shared Task on Sen-
  End and Punctuation Prediction in NLG Text (SEPP-        tence End and Punctuation Prediction in NLG Text
  NLG 2021) at SwissText 2021.                             (SEPP-NLG 2021) at SwissText 2021.
Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aish-     Daniel J. Walker, David E. Clements, Maki Darwin,
  warya Kamath, Ivan Vulić, Sebastian Ruder,               and Jan W. Amtrup. 2001. Sentence boundary de-
  Kyunghyun Cho, and Iryna Gurevych. 2020.                  tection: A comparison of paradigms for improving
  Adapterhub: A framework for adapting transform-           mt quality. In In Proceedings of MT Summit VIII:
  ers. In Proceedings of the 2020 Conference on Em-         Santiago de Compostela, pages 18–22.
  pirical Methods in Natural Language Processing:
  System Demonstrations, pages 46–54.                     Jiangyan Yi and Jianhua Tao. 2019. Self-attention
                                                             based model for punctuation prediction using
Ines Rehbein, Josef Ruppenhofer, and Thomas                  word and speech embeddings. In ICASSP 2019-
   Schmidt. 2020. Improving sentence boundary de-            2019 IEEE International Conference on Acoustics,
   tection for spoken language transcripts. In Proceed-      Speech and Signal Processing (ICASSP), pages
   ings of the 12th International Conference on Lan-         7270–7274. IEEE.
   guage Resources and Evaluation (LREC), May 11-
  16, 2020, Palais du Pharo, Marseille, France, pages     Piotr Żelasko, Piotr Szymański, Jan Mizgajski, Adrian
   7102–7111. European Language Resources Associ-            Szymczak, Yishay Carmiel, and Najim Dehak. 2018.
   ation.                                                    Punctuation prediction model for conversational
                                                             speech. Proc. Interspeech 2018, pages 2633–2637.
Ricardo Rei, Fernando Batista, , Nuno M. Guerreiro,
  and Luisa Coheur. 2021. Multilingual simultaneous
  sentence end and punctuation prediction. In Pro-
  ceedings of the 1st Shared Task on Sentence End and
  Punctuation Prediction in NLG Text (SEPP-NLG
  2021) at SwissText 2021.

Ricardo Rei, Nuno Miguel Guerreiro, and Fernando
  Batista. 2020. Automatic truecasing of video sub-
  titles using bert: A multilingual adaptable approach.
  In Information Processing and Management of Un-
  certainty in Knowledge-Based Systems, pages 708–
  721, Cham. Springer International Publishing.

Nils Reimers and Iryna Gurevych. 2020. Making
  monolingual sentence embeddings multilingual us-
  ing knowledge distillation. In Proceedings of the
  2020 Conference on Empirical Methods in Natural
  Language Processing (EMNLP), pages 4512–4525.

Jeffrey C. Reynar and Adwait Ratnaparkhi. 1997. A
   maximum entropy approach to identifying sentence
   boundaries. ANLC ’97, page 16–19, USA. Associa-
   tion for Computational Linguistics.

Michael D. Riley. 1989. Some applications of tree-
  based modelling to speech and language. HLT ’89,
  page 339–352, USA. Association for Computational
  Linguistics.

Jörg Tiedemann. 2012. Parallel data, tools and inter-
    faces in opus. In Proceedings of the Eight Interna-
    tional Conference on Language Resources and Eval-
    uation (LREC’12), Istanbul, Turkey. European Lan-
    guage Resources Association (ELRA).

Ottokar Tilk and Tanel Alumäe. 2016. Bidirectional
  recurrent neural network with attention mechanism
  for punctuation restoration. In INTERSPEECH.

Don Tuggener. 2017. A method for in-depth compar-
  ative evaluation: How (dis)similar are outputs of
  pos taggers, dependency parsers and coreference re-
  solvers really? In Proceedings of the 15th Con-
  ference of the European Chapter of the Association
  for Computational Linguistics: Volume 1, Long Pa-
  pers, pages 188–198, Valencia, Spain. Association
  for Computational Linguistics.