The Sentence End and Punctuation Prediction in NLG Text (SEPP-NLG) Shared Task 2021 Don Tuggener Ahmad Aghaebrahimian Zurich University of Applied Sciences (ZHAW) Winterthur, Switzerland {tuge, agha}@zhaw.ch Abstract translated or transcribed texts such as the output of Automatic Speech Recognition (ASR) or Machine This paper describes the first Sentence Translation (MT) systems. The punctuation marks End and Punctuation Prediction in Nat- in such synthetic text may be displaced for sev- ural Language Generation (SEPP-NLG) eral reasons. Detecting the end of a sentence and shared task1 held at the SwissText confer- placing an appropriate punctuation mark improves ence 2021. The goal of the shared task was the quality of such texts not only by preserving to develop solutions for the identification the original meaning but also by enhancing their of sentence boundaries and the insertion of readability. punctuation marks into texts produced by The goal of the SEPP-NLG shared task is to NLG systems. The data and submissions2 , build models for identifying the end of a sentence and the codebase3 for the shared tasks are by detecting an appropriate position for putting an publicly available. appropriate punctuation mark. 1 Introduction 2 Related Work Sentence End Detection, also known as Sentence boundary disambiguation (SBD) or boundary de- Similar to the system proposed by Grefenstette and tection, is the Natural Language Processing (NLP) Tapanainen (1997), the earliest attempts for sen- task of recognizing where a sentence begins and tence boundary detection utilize a set of rules or ends. A period is the most common end of sen- regular expressions. In a different direction, Rey- tence indicator in written English as well as many nar and Ratnaparkhi (1997), and Kiss and Strunk other Indo-European languages. However, a period (2006) proposed an information-centric approach may be used in a decimal point, an abbreviation, based on the Maximum Entropy model, and an an email address, or other possible cases as well unsupervised method based on collocation statis- which makes sentence boundary detection a chal- tics respectively. Decision tree classifier (Riley, lenge. Other forms of punctuation such as question 1989), Naı̈ve Bayes (López and Pardo, 2015) and and exclamation marks, semicolons, comma, etc. deep learning based (Kaur and Singh, 2019) mod- add to this challenge. Although sentence bound- els are the most recent advances based on machine ary detection is considered an almost solved issue learning that are proposed for predicting correct po- for formal written language (Walker et al., 2001), sitions for the period in particular and other punc- it poses a challenge in terms of meaning distor- tuation marks in general. Moving forward and tion and readability in synthetic or automatically combining the rule-based and machine learning- based systems, Deepamala and Ramakanth (2012) Copyright © 2021 for this paper by its authors. Use permitted proposed a hybrid system with high performance. under Creative Commons License Attribution 4.0 Interna- tional (CC BY 4.0) Our task is closely related to Tilk and Alumäe 1 https://sites.google.com/view/ (2016) and follow-up work that uses the Europarl sentence-segmentation/ and TED talk corpora for punctuation prediction. 2 https://drive.switch.ch/index.php/s/ Similar to our goal, Żelasko et al. (2018); Don- g3fMhMZU2uo32mf 3 https://github.com/dtuggener/ abauer et al. (2021) investigate sentence bound- SEPP-NLG-2021 ary detection in unpunctuated ASR outputs of spo- ken dialogues based on textual features. Cho et al. boundaries in the corpus are automatically gener- (2017) propose a method to predict sentence bound- ated, they are quite reliable as the data and the aries and punctuation insertion in a real-time spo- models trained to detect the boundaries contain all ken language translation tool. In a similar set- the original punctuation symbols of the transcripts. ting, Klejch et al. (2017) include acoustic features In the spirit of the “Swissness” of the SwissText to improve punctuation prediction in a speech trans- conference where SEPP-NLG 2021 is co-located, lation system, and Yi and Tao (2019) combine lexi- we select 3 of the 4 official languages6 of Switzer- cal and speech features for punctuation prediction land, i.e. German, French, and Italian and comple- in a traditional ASR setting. Finally, Rehbein et al. ment the selection by incorporating English.7 (2020) investigate the annotation and detection of The Europarl corpus contains multiple punctua- sentence like units in spoken language transcripts. tion symbols. For subtask 2, we gauged which sub- set of them represents a realistic and feasible goal 3 Task Overview for their automatic prediction in a stream of unpunc- Ultimately, the goal of SEPP-NLG is to predict sen- tuated, lower-cased tokens. Also, we considered tence ends and punctuation in NLG texts. However, which punctuation marks improve the readability there are no corpora that feature NLG texts and of a text the most. Hence, we consolidated the their manually transcribed and corrected versions. selection of punctuation symbols for subtask 2 to Therefore, we approximate the setting by using a) : −, ?.0 (0 indicating no punctuation) and mapped transcripts of spoken texts, and b) lower-casing the the symbols !; to ., the period. We removed all texts and removing all punctuation marks. While sentences from the data that contain other punctu- there are multiple corpora of transcribed spoken ation symbols such as parentheses, as there is no language, we choose the Europarl corpus4 (Koehn, straightforward way to remove punctuation without 2005) as the source for our data. The Europarl interfering with the naturalness of a sentence. This corpus consists of transcripts of the sessions of removal affected the data for both subtasks and re- the European parliament and features transcripts in sulted in removing less than 10% of the data per multiple languages. language. We also removed HTML artifacts, and We offer the following subtasks: special (non-visible) characters (zero width space, soft hyphen) from the data. Finally, we omitted • Subtask 1: (fully unpunctuated sentences- sentences with fewer than 3 tokens and documents full stop detection): Given the textual content with fewer than 2 sentences. of an utterance where all punctuation marks The data format is as follows: Lower-cased to- are removed, correctly detect the end of sen- kens per file are listed vertically, and the labels for tences by placing a full stop in appropriate subtask 1 (binary classification) and 2 (multiclass positions. classification) are appended horizontally, separated by tab. The labels encode whether a token emits a • Subtask 2: (fully unpunctuated sentences- sentence end (subtask 1) and a punctuation symbol full punctuation marks): Given the textual (subtask 2). Table 1 shows an example. content of an utterance where all punctuation Per language, we randomly selected 80% of the marks are removed, correctly predict all punc- documents for the training and 20% for the test tuation marks. set. From the the training set, we then randomly Participants were free to choose for which lan- sampled 20% of the documents as the development guages and subtasks they contributed a submission, set. but were encouraged to participate in all languages. Table 2 shows several statistics of our data. We see similar properties for all languages: Most sen- 3.1 Data tences are unique, and there are few sentences that We leverage the open parallel corpus (OPUS) ver- occur both in the train and test sets.8 German fea- sion of the Europarl corpus5 (Tiedemann, 2012) 6 The forth, Romansh, is not represented in Europarl. for extracting the task data as it provides sentence 7 Incorporating further languages from the OPUS corpus boundaries and tokenization. Albeit the sentence using our scripts is seamless as the data format is consistent across languages. 4 8 http://www.statmt.org/europarl/ Duplicate sentences are often formulaic, administrative 5 https://opus.nlpl.eu/Europarl.php ones, like ”The session is adjourned.” etc. Token Label 1 Label 2 to 90, meaning there are on average 10-15% of the 0 0 tokens per document in the surprise test set that are next 0 0 item 0 0 not in the Europarl test set. is 0 0 While being one order of magnitude smaller than the 0 0 commission 0 0 the Europarl test set, the surprise test set is also statement 0 0 highly and similarly imbalanced regarding the la- on 0 0 bel distribution. In the English surprise test set, the 0 0 referendum 0 0 there are 67’446 tokens with label 1 and 1’014’464 in 0 0 tokens with label 0. This yields an average sen- venezuela 1 . tence length of 16 tokens, which is significantly member 0 0 of 0 0 lower than the 24 tokens in the English Europarl the 0 0 test set. The label counts for subtask 2 follow an commission 0 . madam 0 0 almost identical distribution in both test sets. president 0 , the 0 0 4 Submissions Table 1: Example of the data format. ZHAW-mbert: We provided a baseline based on the multilingual BERT model (Devlin et al., 2019), tures the largest vocabulary, as is expected due mBERT, implemented in the simpletransformers to its morphological richness, and the vocabulary library10 . We treat the task as a token classification overlap between train and test sets is roughly 50% problem and segment the documents into subse- for all languages. quent, non-overlapping chunks of length 512 to Concerning the labels, the data is highly skewed adhere to the sequence length restrictions of BERT. towards the 0 label for both tasks, as most tokens do We fine-tuned the model on the training data of not emit a sentence end or punctuation symbol after all languages with a randomly shuffled file order them. For example, there are 9’618’776 tokens across all languages and vanilla settings for about with the label 0 and 420’446 with label 1 subtask one week on a single GPU. one in the English test set, which yields an average sentence length of almost 24 tokens. Table 3 shows ZHAW-adapter-mbert: To contrast the a breakdown of the label counts in the English resource-intensive fine-tuning of mBERT with a test set for subtask 2. It shows that the period computationally cheaper approach of task adaption, and comma symbols have similar counts and are we apply the adapter-transformers library11 the most frequent labels among the non-0 labels. (Pfeiffer et al., 2020). Instead of updating all the The remaining labels occur less than an order of weights of the base models (mBERT in our case), magnitude less frequently. These label distribution the adapters approach inserts a few feed-forward properties are similar across all languages. layer in between the transformer blocks and only trains those for adapting a base model to a new 3.2 Surprise Test Data task. We again use the vanilla settings and train the model for one day. The Europarl corpus covers domain-specific lan- guage, i.e. political statements in the European par- OnPoint: In their study of sentence segmenta- liament. To measure how well the participating tion, Michail et al. (2021) proposed a majority- systems trained on our data generalize to out-of- voting ensemble model consisting of several Trans- domain data, we incorporated a surprise test set former models trained in different ways. The mod- comprised of TED talk transcripts9 (Reimers and els’ predictions are leveraged at test time using a Gurevych, 2020). sliding window to obtain the final predictions. They For each language, we sampled 500 TED talks, offered their system as language-dependent models favoring those that have the lowest vocabulary over- for all four languages of the shared task and both lap with our Europarl test sets to maximize the sub-tasks. vocabulary shift. The document-based average per- 10 https://github.com/ThilinaRajapakse/ centage of the vocabulary overlap ranges from 85 simpletransformers 11 https://github.com/Adapter-Hub/ 9 https://opus.nlpl.eu/TED2020.php adapter-transformers/ Lang #sentences unique train∩test #tokens unique train∩test EN 1’406’577 1’382’738 2’660 33’779’095 88’370 43’744 DE 1’308’508 1’276’691 2’806 28’645’112 294’035 112’000 FR 1’236’504 1’215’981 2’081 32’690’367 103’774 57’112 IT 1’132’554 1’112’742 1’746 28’167’993 131’024 67’626 Table 2: Training data statistics, showing number of (unique) sentences and tokens and the number of sentences and tokens in both training and test set (train∩test) per language. Label Count UC3M for all languages as well as both sub-tasks 0 9’050’256 in the shared task individually. , 521’594 HTW: Guhr et al. (2021) modeled the task as . 417’560 a token-wise prediction and examined several lan- - 23’600 guage models based on the transformer architecture. : 13’146 They trained two separate models for the two tasks ? 13’066 and submitted their results for all four languages of the shared task. They advocated transfer learning Table 3: Label distribution for subtask 2 in the English for solving the task and showed that the multilin- test set. gual transformer models yielded better results than monolingual models. By pruning the Bert layers, Unbabel-INESC-ID: Rei et al. (2021) extend they also showed that their model retains 99% of the architecture proposed by Rei et al. (2020) to its performance without 1/4 of the last layers. develop a multilingual model for sentence end and 5 Results punctuation prediction. Their system is designed based on pre-trained contextual embeddings and In section 3.1 we showed that our data is highly im- built on top of a pre-trained Transformer-based balanced regarding the label distribution. Accuracy encoder model. They propose their method as a or Macro F1 scores are not suitable metrics in this single multilingual model for all languages and setting, as majority class prediction would yield subtasks of the shared task. an accuracy of 96% for subtask 1 on the English UR-mSBD: Donabauer and Kruschwitz (2021) test set, e.g. Therefore, we applied the following propose a system based on a pre-trained BERT metrics to evaluate the participants’ submissions: model and fine-tuned for the first sub-task. They use language-specific models for each of the four • Subtask 1: F1 score of the label 1 (the posi- languages of the shard task. They consider sub-task tive class, i.e. sentence end) 1 as a binary classification problem by identifying • Subtask 2: Macro F1 of the selected punctu- tokens that indicate the position of a full stop. ation symbols oneNLP: Applying multi-task Albert for En- glish and multi-lingual Bert for other lan- We observe that a) most systems achieve a very guages Mujadia et al. (2021) explored the impact high score for subtask 1 for all languages on the of using contextual language models for sentence Europarl data, and b) the F1 scores are almost iden- end and punctuation prediction. They modeled the tical (with seemingly minor differences in preci- problem in both subtasks as a sequence labeling sion and recall) for the top-ranking systems for task. They presented the results of employing a both tasks. Further, the top-ranking systems are the baseline CRF, as well as the results of applying a same ones for both tasks. This is to be expected fine-tuning method over contextual embedding. to some degree, as it can be argued that subtask 2 HULAT UC3M: Based on the Punctuator subsumes subtask 1. framework (Tilk and Alumäe, 2016) which is a bidi- While the F1 scores for subtask 2 seem low com- rectional recurrent neural network model equipped pared to subtask 1, a more detailed results analysis with an attention mechanism, Masiello-Ruiz et al. reveals that the lower (Macro) F1 scores mainly (2021) developed an automatic punctuation sys- stem from the labels with the lowest counts in the tem named HULAT-UC3M. They trained HULAT- data. Table 6 gives the detailed classification report EN DE FR IT AVG Prec Rec F1 Prec Rec F1 Prec Rec F1 Prec Rec F1 Prec Rec F1 TEST SET htw+t2k fullstop multilang 0.94 0.95 0.94 0.95 0.96 0.96 0.94 0.94 0.94 0.92 0.94 0.93 0.94 0.95 0.94 OnPoint 0.93 0.95 0.94 0.95 0.96 0.96 0.92 0.94 0.93 0.90 0.95 0.92 0.93 0.95 0.94 Unbabel-INESC-ID 0.94 0.94 0.94 0.95 0.96 0.96 0.94 0.94 0.94 0.92 0.94 0.93 0.94 0.95 0.94 UR-mSBD 0.91 0.92 0.92 0.94 0.96 0.95 0.93 0.94 0.93 0.91 0.93 0.92 0.92 0.94 0.93 ZHAW-mbert 0.91 0.93 0.92 0.93 0.96 0.95 0.90 0.93 0.91 0.88 0.93 0.90 0.91 0.94 0.92 oneNLP 0.92 0.92 0.92 0.93 0.95 0.94 0.90 0.89 0.89 0.88 0.89 0.89 0.91 0.91 0.91 ZHAW-adapter-mbert 0.88 0.90 0.89 0.79 0.85 0.82 0.81 0.84 0.83 0.77 0.78 0.77 0.81 0.84 0.83 HULAT UC3M 0.86 0.80 0.83 0.23 0.90 0.36 0.86 0.79 0.83 0.84 0.78 0.81 0.70 0.82 0.71 htw+t2k fullstop german 0.95 0.96 0.95 SURPRISE TEST SET htw+t2k fullstop multilang 0.85 0.70 0.77 0.90 0.74 0.82 0.84 0.70 0.76 0.85 0.67 0.75 0.86 0.70 0.78 OnPoint 0.84 0.75 0.80 0.89 0.77 0.82 0.82 0.72 0.77 0.83 0.71 0.77 0.85 0.74 0.79 Unbabel-INESC-ID 0.92 0.75 0.83 0.88 0.71 0.78 0.85 0.72 0.78 0.86 0.68 0.76 0.88 0.72 0.79 UR-mSBD 0.82 0.68 0.74 0.89 0.73 0.80 0.83 0.70 0.76 0.84 0.67 0.74 0.85 0.70 0.76 ZHAW-mbert 0.78 0.70 0.74 0.86 0.74 0.80 0.78 0.69 0.73 0.77 0.65 0.70 0.80 0.70 0.74 oneNLP 0.81 0.67 0.73 0.85 0.72 0.78 0.77 0.62 0.69 0.78 0.58 0.67 0.80 0.65 0.72 ZHAW-adapter-mbert 0.75 0.69 0.71 0.75 0.69 0.72 0.72 0.67 0.69 0.71 0.55 0.62 0.73 0.65 0.69 HULAT UC3M 0.68 0.41 0.51 0.41 0.61 0.49 0.74 0.41 0.53 0.73 0.30 0.43 0.64 0.43 0.49 htw+t2k fullstop german 0.90 0.75 0.80 Table 4: Results for subtask 1 EN DE FR IT AVG Prec Rec F1 Prec Rec F1 Prec Rec F1 Prec Rec F1 Prec Rec F1 TEST SET htw+t2k fullstop multilang 0.82 0.74 0.77 0.84 0.79 0.81 0.83 0.75 0.78 0.82 0.72 0.76 0.83 0.75 0.78 OnPoint 0.81 0.75 0.77 0.82 0.80 0.81 0.78 0.77 0.77 0.77 0.74 0.75 0.80 0.77 0.78 Unbabel-INESC-ID 0.83 0.72 0.76 0.84 0.77 0.80 0.83 0.74 0.77 0.82 0.70 0.74 0.83 0.73 0.77 ZHAW-mbert 0.80 0.71 0.74 0.82 0.75 0.78 0.81 0.71 0.75 0.79 0.66 0.71 0.81 0.71 0.75 oneNLP 0.79 0.69 0.72 0.80 0.74 0.77 0.79 0.65 0.68 0.78 0.62 0.66 0.79 0.68 0.71 HULAT UC3M 0.76 0.60 0.63 0.79 0.65 0.69 0.75 0.59 0.64 0.71 0.52 0.57 0.75 0.59 0.63 ZHAW-adapter-mbert 0.78 0.64 0.68 0.59 0.48 0.49 0.70 0.55 0.59 0.64 0.46 0.49 0.68 0.53 0.56 SURPRISE TEST SET htw+t2k fullstop multilang 0.65 0.57 0.60 0.68 0.64 0.66 0.66 0.60 0.62 0.61 0.53 0.56 0.65 0.59 0.61 OnPoint 0.65 0.59 0.62 0.66 0.65 0.65 0.63 0.60 0.61 0.57 0.55 0.56 0.63 0.60 0.61 Unbabel-INESC-ID 0.68 0.57 0.61 0.71 0.63 0.65 0.69 0.59 0.63 0.63 0.53 0.56 0.68 0.58 0.61 ZHAW-mbert 0.62 0.51 0.55 0.66 0.58 0.60 0.64 0.54 0.57 0.51 0.45 0.47 0.61 0.52 0.55 oneNLP 0.62 0.52 0.56 0.61 0.57 0.58 0.61 0.48 0.51 0.54 0.43 0.46 0.60 0.50 0.53 HULAT UC3M 0.50 0.40 0.43 0.59 0.47 0.51 0.56 0.38 0.41 0.45 0.33 0.36 0.53 0.40 0.43 ZHAW-adapter-mbert 0.60 0.48 0.51 0.54 0.41 0.44 0.60 0.44 0.48 0.51 0.35 0.38 0.56 0.42 0.45 Table 5: Results for subtask 2 for the top three ranking system for the English test to the Europarl dataset, we train the ZHAW-mbert set. It shows that the systems are able to predict approach on the remaining TED talks that were not periods, commas, and question marks reliably, but selected for the surprise test set and then test the that they struggle with hyphens and colons, which system on the surprise test set. Table 7 shows that lowers the Macro F1 scores. the average F1 score does improve by 11 percent- age points when training the ZHAW-mbert system Label htw+t2k OnPoint Unbabel on domain data. Still, the 0.66 F1 score is 9 per- 0 0.99 0.99 0.99 centage points behind the average F1 score on the , 0.82 0.82 0.80 Europarl data. Hence, the drop in performance of . 0.95 0.95 0.94 Europarl-trained ZHAW-mbert on the surprise test - 0.42 0.41 0.37 set can both be accounted for by the domain shift : 0.57 0.57 0.56 and by the increased difficulty of the target domain ? 0.88 0.91 0.89 (TED talks). We expect that this applies for the performance drop of all systems. Table 6: F1 scores per label for the top-performing sys- tems on the English test set for subtask 2. Prec. Rec. F1 ZHAW-mbert 0.76 0.63 0.66 All systems perform significantly worse on the surprise test sets for both tasks. To gauge the dif- Table 7: Results of training ZHAW-mbert on TED talks ficulty of the task on the TED dataset compared for subtask 2 (averaged over all languages). We expected some submissions to use linguistic takes a ground truth label G, the predicted label A features such as part-of-speech tags or partial syn- of one system, and the predicted label B of another tax parse trees and hypothesized that such systems system and defines three types of differences for would fare better on out-of-domain data. However, the cases where A 6= B: all participating systems applied neural encodings of the surface tokens and did not encode linguistic • correction: G = B features explicitly. Still, the ranking of the systems remains intact on the surprise test sets. • new error: G = A The top three systems in both tasks all use transformers-based approaches and tackle the tasks • changed error: G 6= A 6= B in a similar manner. We hypothesize that this is the main reason for near identical performance of Table 9 shows the results. We see that the pre- the systems in terms of F1 scores. Based on the dictions of commas makes up a large portion of task results, these three systems seem to produce the differences. When OnPoint’s prediction dif- near-identical output. To better gauge their similar- fers from Unbabel’s for comma, OnPoint is correct ities and differences, we evaluate their outputs for and Unbabel incorrect in nearly 70% of the cases, subtask 2 in a pair-wise manner on the English test which explains the 2 percentage point higher per- set. We apply the evaluation metric such that one formance of OnPoint in Table 6. Still, Unbabel is system output takes the role of the ground truth and correct in almost 30% of the cases where the two the other the one of the system prediction, which predictions differ. yields the F1 scores per class that we leverage as an indicator of the similarity or agreement of the #Diff. corr. new err. changed per-token predictions. Table 8 shows the results. err. While the macro F1 scores and even the per-class 0 45’552 34.22% 62.59% 3.19% F1 scores in Table 6 are highly similar, there are , 50’496 69.01% 28.30% 2.69% significant differences in this analysis. For exam- . 16’190 49.28% 44.69% 6.03% - 4’422 51.15% 33.04% 15.81% ple, for the hyphen class, the systems have different : 2’014 41.46% 31.43% 27.11% predictions in over 30% of the cases, and for colon ? 1’158 63.90% 29.53% 6.56% in roughly 20%. For the majority classes of the non-0 classes, the systems disagree in about 10% Table 9: Detailed comparison of the differences in Un- of the cases for comma, but their predictions are babel’s predictions versus OnPoint’s predictions for En- glish in subtask 2. #Diff. signifies the number of tokens highly similar for period (96% agreement). that have the respective label as the ground truth and for which OnPoint’s and Unbabel’s predictions differ. Label htw+t2k vs OnPoint vs OnPoint vs Unbabel Unbabel htw+t2k The remaining columns represent the percentage of this number in each difference class. 0 0.99 0.99 1.00 , 0.90 0.90 0.92 . 0.96 0.96 0.96 In conclusion, we observer that while the top - 0.67 0.66 0.68 : 0.79 0.81 0.81 three systems perform similarly in terms of Macro ? 0.89 0.92 0.91 F1 scores for subtask 2, there are nuances to each system that distinguishes them from the others. Table 8: System prediction similarity between the three top-performing systems on the English test set for sub- 5.1 Winners task 2. While we showed that there are differences in the Following Tuggener (2017), we can take the outputs of the top three systems that are not re- comparison a step further and analyse the type of flected in the averaged F1 scores, the declared crite- differences per label. For example, the OnPoint ria for winning the task are the averaged F1 scores submission’s F1 score for hyphen is 4 percentage in Tables 4 and 5. Since the top three systems in points higher than the one of Unbabel, and their pre- these tables are practically indistinguishable based diction agreement for hyphen is 68%. This does not on these F+ scores, we declare OnPoint, htw+t2k, indicate, however, whether OnPoint’s predictions and Unbabel as the joint winners of the SEPP-NLG are always better. The aforementioned comparison 2021 shared task. Congratulations! 6 Conclusions Gregor Donabauer, Udo Kruschwitz, and David Cor- ney. 2021. Making sense of subtitles: Sentence We presented the setting and results of the first Sen- boundary detection and speaker change detection in tence End and Punctuation Prediction in NLG text unpunctuated texts. In Companion Proceedings of (SEPP-NLG 2021) shared task. We found that all the Web Conference 2021, pages 357–362. participants explored neural networks-based mod- Gregory Grefenstette and Pasi Tapanainen. 1997. What els (particularly transformers) to tackle the task. is a word, what is a sentence? problems of tokeniza- The results for the in-domain Europarl data were tion. high for the most common punctuation symbols, Oliver Guhr, Anne-Kathrin Schumann, Frank but the performance decreased significantly when Bahrmann, and Hans-Joachim Bohme. 2021. the models were faced with out-of-domain data. Fullstop: Multilingual deep models for punctuation The discussion of the task results during the ses- prediction. In Proceedings of the 1st Shared Task on Sentence End and Punctuation Prediction in NLG sion at the SwissText conference yielded the fol- Text (SEPP-NLG 2021) at SwissText 2021. lowing desiderata for future iterations of the shared task: Jagroop Kaur and Jaswinder Singh. 2019. Deep neural network based sentence boundary detection and end • More heterogeneous data (more domains) marker suggestion for social media text. In 2019 In- ternational Conference on Computing, Communica- • Add truecasing as an additional task tion, and Intelligent Systems (ICCCIS), pages 292– 295. • Add other language families Tibor Kiss and Jan Strunk. 2006. Unsupervised multi- • Take inference time / computational costs as lingual sentence boundary detection. Comput. Lin- an additional evaluation criteria, or create a guist., 32(4):485–525. separate track that puts emphasis on a low- Ondřej Klejch, Peter Bell, and Steve Renals. 2017. resource/low-latency setting Sequence-to-sequence models for punctuated tran- scription combining lexical and acoustic features. Acknowledgments In 2017 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages We thank the participants for their submissions and 5700–5704. IEEE. their valuable feedback on early versions of the data and task details. This work was funded by Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. Machine Translation Innosuisse under grant project nr. 43446.1 IP-ICT. Summit, 2005, pages 79–86. Roque López and Thiago A. S. Pardo. 2015. Ex- References periments on sentence boundary detection in user- Eunah Cho, Jan Niehues, and Alex Waibel. 2017. Nmt- generated web content. In Computational Linguis- based segmentation and punctuation insertion for tics and Intelligent Text Processing, pages 227–237, real-time spoken language translation. In Inter- Cham. Springer International Publishing. speech, pages 2645–2649. Jose Manuel Masiello-Ruiz, Jose Luis Lopez Nn. Deepamala and P. Ramakanth. 2012. Sentence Cuadrado, and Paloma Martinez. 2021. Partic- boundary detection in kannada language. Interna- ipation of hulat-uc3m in sepp-nlg 2021 shared tional Journal of Computer Applications, 39:38–41. task. In Proceedings of the 1st Shared Task on Sentence End and Punctuation Prediction in NLG Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Text (SEPP-NLG 2021) at SwissText 2021. Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language under- Andrianos Michail, Silvan Wehrli, and Terézia standing. In Proceedings of the 2019 Conference of Bucková. 2021. Uzh onpoint at swisstext-2021: the North American Chapter of the Association for Sentence end and punctuation prediction in nlg text Computational Linguistics: Human Language Tech- through ensembling of different transformers. In nologies, Volume 1 (Long and Short Papers), pages Proceedings of the 1st Shared Task on Sentence 4171–4186. End and Punctuation Prediction in NLG Text (SEPP- NLG 2021) at SwissText 2021. Gregor Donabauer and Udo Kruschwitz. 2021. Uni- versity of regensburg @ swisstext 2021 sepp-nlg: Vandan Mujadia, Pruthwik Mishra Dipti, and Misra Adding sentence structure to unpunctuated text. In Sharma. 2021. Deep contextual punctuator for nlg Proceedings of the 1st Shared Task on Sentence text. In Proceedings of the 1st Shared Task on Sen- End and Punctuation Prediction in NLG Text (SEPP- tence End and Punctuation Prediction in NLG Text NLG 2021) at SwissText 2021. (SEPP-NLG 2021) at SwissText 2021. Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aish- Daniel J. Walker, David E. Clements, Maki Darwin, warya Kamath, Ivan Vulić, Sebastian Ruder, and Jan W. Amtrup. 2001. Sentence boundary de- Kyunghyun Cho, and Iryna Gurevych. 2020. tection: A comparison of paradigms for improving Adapterhub: A framework for adapting transform- mt quality. In In Proceedings of MT Summit VIII: ers. In Proceedings of the 2020 Conference on Em- Santiago de Compostela, pages 18–22. pirical Methods in Natural Language Processing: System Demonstrations, pages 46–54. Jiangyan Yi and Jianhua Tao. 2019. Self-attention based model for punctuation prediction using Ines Rehbein, Josef Ruppenhofer, and Thomas word and speech embeddings. In ICASSP 2019- Schmidt. 2020. Improving sentence boundary de- 2019 IEEE International Conference on Acoustics, tection for spoken language transcripts. In Proceed- Speech and Signal Processing (ICASSP), pages ings of the 12th International Conference on Lan- 7270–7274. IEEE. guage Resources and Evaluation (LREC), May 11- 16, 2020, Palais du Pharo, Marseille, France, pages Piotr Żelasko, Piotr Szymański, Jan Mizgajski, Adrian 7102–7111. European Language Resources Associ- Szymczak, Yishay Carmiel, and Najim Dehak. 2018. ation. Punctuation prediction model for conversational speech. Proc. Interspeech 2018, pages 2633–2637. Ricardo Rei, Fernando Batista, , Nuno M. Guerreiro, and Luisa Coheur. 2021. Multilingual simultaneous sentence end and punctuation prediction. In Pro- ceedings of the 1st Shared Task on Sentence End and Punctuation Prediction in NLG Text (SEPP-NLG 2021) at SwissText 2021. Ricardo Rei, Nuno Miguel Guerreiro, and Fernando Batista. 2020. Automatic truecasing of video sub- titles using bert: A multilingual adaptable approach. In Information Processing and Management of Un- certainty in Knowledge-Based Systems, pages 708– 721, Cham. Springer International Publishing. Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual us- ing knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4512–4525. Jeffrey C. Reynar and Adwait Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. ANLC ’97, page 16–19, USA. Associa- tion for Computational Linguistics. Michael D. Riley. 1989. Some applications of tree- based modelling to speech and language. HLT ’89, page 339–352, USA. Association for Computational Linguistics. Jörg Tiedemann. 2012. Parallel data, tools and inter- faces in opus. In Proceedings of the Eight Interna- tional Conference on Language Resources and Eval- uation (LREC’12), Istanbul, Turkey. European Lan- guage Resources Association (ELRA). Ottokar Tilk and Tanel Alumäe. 2016. Bidirectional recurrent neural network with attention mechanism for punctuation restoration. In INTERSPEECH. Don Tuggener. 2017. A method for in-depth compar- ative evaluation: How (dis)similar are outputs of pos taggers, dependency parsers and coreference re- solvers really? In Proceedings of the 15th Con- ference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Pa- pers, pages 188–198, Valencia, Spain. Association for Computational Linguistics.