University of Regensburg @ SwissText 2021 SEPP-NLG: Adding Sentence Structure to Unpunctuated Text Gregor Donabauer Udo Kruschwitz University of Regensburg University of Regensburg Regensburg, Germany Regensburg, Germany gregor.donabauer@stud. udo.kruschwitz@ur.de uni-regensburg.de Abstract We took part in Subtask 1 (fully unpunctuated sentences – full stop detection) of this challenge This paper describes our approach (UR- and did so for all featured languages. This report mSBD) to address the shared task on Sentence End and Punctuation Prediction in NLG Text starts by contextualising the task as part of a short (SEPP-NLG) organised as part of SwissText discussion of related work. We will then introduce 2021. We participated in Subtask 1 (fully un- our methodology, briefly describe the data and re- punctuated sentences – full stop detection) and port results. Finally we present some discussion submitted a run for every featured language and conclusions. (English, German, French, Italian). Our sub- missions are based on pre-trained BERT mod- els that have been fine-tuned to the task at hand. 2 Related Work We had recently demonstrated, that such an ap- proach achieves state-of-the-art performance Sentences are considered as a fundamental informa- when identifying end-of-sentence markers on tion unit of written text (Jurafsky and Martin, 2020; automatically transcribed texts. The difference Levinson, 1985). Therefore, many NLP pipelines to that work is that here we use language- in practice split text into sentences. Fact checking specific BERT models for each featured lan- is just one – currently very popular – challenge guage. By framing the problem as a binary where the automated detection of sentences within tagging task using the outlined architecture we a stream of input data is essential. Fact check- are able to achieve competitive results on the official test set across all languages, with Re- ers are increasingly turning to technology to help, call, Precision, F1 ranging between 0.91 and including NLP (Arnold, 2020). These tools can 0.96 which makes us joint winners for Recall help identify claims worth checking, find repeats in two of the languages. The official baselines of claims that have already been checked or even are beaten by large margins. assist in the verification process directly (Nakov et al., 2021). Most such tools rely on text as input 1 Introduction and require the text to be split into sentences (Don- Text normalization has always been a core build- abauer et al., 2021). For this and other application ing block of natural language processing aimed at areas sentence segmentation will remain a challeng- converting some raw text into a more convenient, ing task despite the fact that recent developments standard form (Jurafsky and Martin, 2020). Be- suggest that for some NLP tasks it is possible to sides tokenization, stemming and lemmatization achieve state-of-the-art performance without con- this process includes sentence segmentation. What ducting any pre-processing of the raw data, e.g. is interesting though is that text pre-processing and (Shaham and Levy, 2021). normalization is by no means a solved challenge. Sentence Boundary Detection (SBD) is an im- The SwissText 2021 Shared Task 2: Sentence portant and actually well-studied text processing End and Punctuation Prediction in NLG Text is step but it typically relies on the presence of punc- concerned exactly with this problem area. The goal tuation within the input text (Jurafsky and Martin, is to develop approaches for sentence boundary 2020). Even with such punctuation it can be a diffi- detection (SBD) in unpunctuated text. Providing cult task, e.g. (Gillick, 2009; Sanchez, 2019), and suitable solutions means fostering readability and traditional approaches use a variety of architectures restoring the text’s original meaning. including CRFs (Liu et al., 2005) and combinations Copyright ©2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna- tional (CC BY 4.0). of HMMs, maximum likelihood as well as maxi- alternative approaches on different datasets. We mum entropy approaches (Liu et al., 2004). With use a softmax classification head predicting the la- unpunctuated texts (and lack of word-casing infor- bel (EOS or O) by the highest probability at each mation) it becomes a lot harder as even humans find token. it difficult to determine sentence boundaries in this case (Stevenson and Gaizauskas, 2000). Song et al. 3.2 Adjustments for the Shared Task (2019) simplify the problem by aiming to detect the sentence boundary within a 5-word chunk – using We apply two changes to the model fine-tuning YouTube subtitle data. Using LSTMs they predict process for this shared task as follows: the position of the sample’s sentence boundaries but did not consider any chunks without sentence • First of all, we are faced with four different boundary. Le (2020) presents a hybrid model (us- languages and not just English texts. The ing BiLSTMs and CRFs) originally used for NER two obvious options would be to use a multi- that was evaluated on SBD in the context of conver- lingual language model or to choose a dif- sational data by preprocessing the CornellMovie- ferent language-specific pre-trained model Dialogue and the DailyDialog datasets to obtain for each of the languages, i.e. German, samples that neither contain sentence boundary French, English and Italian. We decided to punctuation nor word-casing information (they also adopt language-specific BERT-base models predict whether the sentence is a statement or a as Nozza et al. (2020) reports that this yields question). Du et al. (Du et al., 2019) present a better results than using mBERT, pre-trained transformer-based approach to the problem, but on a multilingual corpus. they assume partially punctuated text and word- casing information. Recently, it was shown that • Secondly, we change the process of sample a simple fine-tuned BERT model was able to im- construction. We handle the unpunctuated in- prove on the state of the art on fully unpunctuated put text as one long chain of words. We origi- case-folded input data (Donabauer et al., 2021). nally split this chain in samples of 64 words 3 System and fine-tuned the model with a maximum se- quence length of 128 BERT-specific tokens. 3.1 General Architecture of UR-mSBD Further experiments have shown that utiliz- The system architecture we use is adopted from our ing token sequences as long as possible (512 previous work that achieved state-of-the-art per- BERT tokens) yields the best results. There- formance on a very similar task (Donabauer et al., fore, we pre-process the raw text data by send- 2021). That architecture demonstrated the suitabil- ing it through the model’s tokenizer first. Each ity of a BERT-based token classification approach time a batch of iterated words fits 512 BERT for sentence end prediction in the context of im- tokens we create a sample from it. If a word proving text processing pipelines for fact-checking. at the transition between two samples would The underlying idea is to treat the restoration of sen- be ripped apart (as adding it entirely to the tence boundary information as a problem similar current sample would exceed 512 tokens), we to IO-tagging in named entity recognition. For the put it at the beginning of a new sample and implementation we refer to our GitHub repository1 . pad the rest of the previous one with special The last token of every sentence, marking the oc- PAD tokens. currence of a sentence boundary punctuation mark to follow up, is labeled with EOS. In our previ- All other hyperparameters are kept in line with ous work we predicted the beginning of a sentence Donabauer et al. (2021), namely using an epoch rather than its end. We therefore labeled the first number of 3 and a batch size of 8 per device. Since token of every sentence with BOS. Out-of-context we run it on 3 GPUs simultaneously the batch size labels O are assigned to all other tokens of the text. per iteration increases to 24. We also evaluated our We fine-tuned a pre-trained BERT model on the approach on the datasets with tuned hyperparam- problem and obtained high F1 scores for the de- eters. However, it turned out that increasing the sired positive class (sentence end) outperforming number of epochs to 5 leads to a deterioration of 1 https://github.com/doGregor/SBD-SCD-pipeline results. 4 Data and Setup • German: BERT base uncased model, trained on 16GB monolingual German corpus by db- We participated in Subtask 1 (fully unpunctuated mdz (MDZ Digital Library team at the Bavar- sentences – full stop detection) of SwissText’s ian State Library)3 . SEPP-NLG Shared Task 2. Before addressing the experimental setup we • French: BERT base uncased model, trained briefly describe the provided data sets. The chal- on 71GB monolingual French corpus (Le lenge’s domain are NLG texts. Since there are no et al., 2020). corpora that feature such data nor manually cor- rected versions the organizers selected Europarl2 • Italian: BERT base uncased model, trained as source. This corpus includes transcribed text on 81GB monolingual Italian text by dbmdz. data originating from spoken text in many different languages. The data come along in lowercase for- We make use of the PyTorch4 version of the mat and are already split up into tokens. Sentence Python huggingface5 transformers library to access boundary punctuation is removed. Instead labels models and run fine-tuning. We execute the scripts are assigned that mark upcoming sentence ends. on 3 Nvidia GeForce RTX 2080 Ti GPUs with an The last token of each sentence is labeled with ’1’, overall memory size of 33GB. all remaining with ’0’. 5 Results The data are provided as multiple tab-separated value files grouped by each language and set. The 5.1 Baselines number of tokens per language and dataset is re- The official baseline is produced using the spaCy ported in Table 1. We explain our pre-processing NLP package. The organisers report scores for with respect to a single set for a single language, different pipeline versions and we are describing e.g. the English evaluation set. Firstly, we read the best performing one for every language in Table each tsv file one after the other and concatenate all 2. The official evaluation metrics are Precision, tokens and labels as two long lists. During reading Recall and F1-score of the positive class label (i.e., we save the order and length of the input files. By sentence end). that we are able to reconstruct the original struc- As Table 2 illustrates, F1-scores for English, Ger- ture of the files later on. The list of tokens is fed man and French are ranging from 0.32 to 0.47. For into the model-specific tokenizer. If tokens are not Italian the F1-metric is collapsing to 0.01, caused recognized properly we replace them with ’nan’. by a very low Recall of 0.00. Each time a batch of 512 BERT tokens is filled, we create a sample from it. Data are saved in CoNLL- 5.2 UR-mSBD 2003 format (Tjong Kim Sang and De Meulder, We summarise the results obtained when running 2003). Tokens and labels are separated horizon- our system, UR-mSBD, on the test data. For each tally with spaces. Samples are separated vertically language we also include scores obtained on the with empty lines. We use the tokenizer during pre- dev set as well as the surprise test set that was intro- processing only to calculate the number of BERT duced to check the generalizability of the different tokens at each input word. The samples themselves approaches. consist of plain text tokens. Thus dimension and Table 3 presents the results for the English data, order of predicted labels correspond to the structure Table 4 for German, Table 5 for French, and Table of the processed tsv files. We can then simply map 6 presents the results for the Italian test data. our output to the words in the input data. We see overall consistently high scores for all As mentioned earlier, we make use of language three metrics and across all languages when look- specific models rather than mBERT. We briefly ing at the official test sets. An average of F1=0.93 describe the respective models and the corpora they aggregated over all languages places us just one per- are trained on. centage point behind the top performance. Looking at Recall, we actually end up being joint winners • English: Classic BERT base uncased model, for the German and French test data. trained on English lowercase text (Devlin 3 et al., 2019). https://github.com/dbmdz/berts 4 https://pytorch.org/ 2 5 https://opus.nlpl.eu/Europarl.php https://huggingface.co/ Language Train Dev Test Surprise Test English 33,779,095 7,743,489 10,039,222 1,081,910 German 28,645,112 6,358,683 9,575,861 979,982 French 32,690,367 8,781,593 11,297,534 1,143,911 Italian 28,167,993 7,194,189 10,193,542 985,448 Table 1: Number of tokens in the respective data sets for each language. Dataset Precision Recall F1 sentative of the data the system was trained on. Dev EN 0.49 0.23 0.32 Across the board all the baselines were beaten Test EN 0.49 0.24 0.32 by large margins. Dev DE 0.51 0.44 0.47 Test DE 0.49 0.44 0.46 6 Discussion Dev FR 0.71 0.24 0.36 For all featured languages our fine-tuned BERT- Test FR 0.63 0.24 0.35 based predictions are performing very well with Dev IT 0.64 0.00 0.01 results for all three metrics (P/R/F1) in the 90s and Test IT 0.51 0.00 0.01 being very competitive when looking at the other Table 2: Highest baseline scores for EN, DE, FR, IT. submissions for this shared task. This first of all demonstrates the power of transformer-based mod- Dataset Precision Recall F1 els and confirms findings we reported previously Dev 0.92 0.92 0.92 (Donabauer et al., 2021). Test 0.91 0.92 0.92 The fact that the baselines were outperformed Surprise Test 0.82 0.68 0.74 by such large margins is perhaps a sign that non- neural approaches are not competitive for the task Table 3: UR-mSBD scores for English. and data at hand. We note that our approach performed best for Dataset Precision Recall F1 German texts which might be caused by high simi- Dev 0.96 0.95 0.95 larity between the data the model was pre-trained Test 0.94 0.96 0.95 on and the data sampled to be training, dev and Surprise Test 0.89 0.73 0.80 test sets for this task. It will be worth exploring Table 4: UR-mSBD scores for German. whether for different data samples we observe a similar pattern or whether the differences are in Dataset Precision Recall F1 fact not significant. Dev 0.94 0.93 0.93 Taking a slightly broader perspective, we ob- Test 0.93 0.94 0.93 serve that the scores obtained here are similar Surprise Test 0.83 0.70 0.76 to what we obtained when running our sentence boundary detection algorithm on a dataset com- Table 5: UR-mSBD scores for French. prising transcribed lectures given at Stanford Uni- versity first proposed by Song et al. (2019), and Dataset Precision Recall F1 the DailyDialog dataset (Li et al., 2017) but that Dev 0.93 0.91 0.92 extending these datasets or creating a hybrid ver- Test 0.91 0.93 0.92 sion resulted in significant drops in performance Surprise Test 0.84 0.67 0.74 (Donabauer et al., 2021). It would therefore be in- teresting to see whether other approaches illustrate Table 6: UR-mSBD scores for Italian. similar patterns. Another general pattern we read into the results The highest scores are reported for German (with is that there are only small differences when com- Precision at 0.94, Recall at 0.96 and F1 at 0.95). paring results on the dev sets with the results on All the scores for the test sets are above 0.90. For the test sets. We conclude that our approach can the surprise test set the results drop quite a bit but generalize to unseen data as long as the training are still reasonably high given the data is not repre- data is representative for the data used for testing. The approach does however generalise less well on Financial Technology and Natural Language Pro- to out-of-domain (’surprise’) data with F1-scores cessing, pages 81–87, Macao, China. Association for Computational Linguistics. dropping between 0.15 and 0.18, compared to the Europarl sets. We still consider the results to be rea- Dan Gillick. 2009. Sentence boundary detection and sonably well though given they are on average over the problem with the u.s. In Proceedings of Hu- all languages only 0.03 behind the top-performing man Language Technologies: The 2009 Annual Con- ference of the North American Chapter of the As- system. sociation for Computational Linguistics, Compan- ion Volume: Short Papers, NAACL-Short ’09, page 7 Conclusions 241–244, USA. Association for Computational Lin- guistics. We framed the task of full-stop prediction (Subtask 1 of Shared Task 2 at SwissText 2021) as a binary Daniel Jurafsky and James Martin. 2020. Speech and Language Processing: An Introduction to Nat- classification task over all input tokens identifying ural Language Processing, Computational Linguis- whether each of these tokens should indicate the tics, and Speech Recognition. Current draft of third position of a full stop or not. Fine-tuning language edition (30 Dec 2020). specific pre-trained BERT models for each of the Hang Le, Loı̈c Vial, Jibril Frej, Vincent Segonne, Max- four tasks resulted in competitive results. Given imin Coavoux, Benjamin Lecouteux, Alexandre Al- the small difference in F1 of 0.01 compared to the lauzen, Benoı̂t Crabbé, Laurent Besacier, and Didier top results reported for this competition for three Schwab. 2020. Flaubert: Unsupervised language of the languages (as well as aggregated over all model pre-training for french. In Proceedings of The 12th Language Resources and Evaluation Con- languages) we will await statistical significance ference, pages 2479–2490, Marseille, France. Euro- tests as our results may well turn out to be on par pean Language Resources Association. with the top results in this task. The Anh Le. 2020. Sequence labeling approach to the Acknowledgements task of sentence boundary detection. In Proceed- ings of the 4th International Conference on Machine Learning and Soft Computing, ICMLSC 2020, page This work was supported by the project 144–148, New York, NY, USA. ACM. COURAGE: A Social Media Companion Safeguarding and Educating Students funded by Joan Persily Levinson. 1985. Punctuation and the or- the Volkswagen Foundation, grant number 95564. thographic sentence: a linguistic analysis. Doctoral dissertation, City University of New York. Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang References Cao, and Shuzi Niu. 2017. DailyDialog: A manu- ally labelled multi-turn dialogue dataset. In Proceed- Phoebe Arnold. 2020. The challenges of online fact ings of the Eighth International Joint Conference on checking. Technical report, Full Fact, London, UK. Natural Language Processing (Volume 1: Long Pa- pers), pages 986–995, Taipei, Taiwan. Asian Federa- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and tion of Natural Language Processing. Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- Yang Liu, Andreas Stolcke, Elizabeth Shriberg, and standing. In Proceedings of the 2019 Conference Mary Harper. 2004. Comparing and combining gen- of the North American Chapter of the Association erative and posterior probability models: Some ad- for Computational Linguistics: Human Language vances in sentence boundary detection in speech. Technologies, Volume 1 (Long and Short Papers), In Proceedings of the 2004 Conference on Empiri- pages 4171–4186, Minneapolis, Minnesota. Associ- cal Methods in Natural Language Processing, pages ation for Computational Linguistics. 64–71, Barcelona, Spain. Association for Computa- tional Linguistics. Gregor Donabauer, Udo Kruschwitz, and David Cor- ney. 2021. Making sense of subtitles: Sentence Yang Liu, Andreas Stolcke, Elizabeth Shriberg, and boundary detection and speaker change detection in Mary Harper. 2005. Using conditional random unpunctuated texts. In Companion Proceedings of fields for sentence boundary detection in speech. In the Web Conference 2021 (WWW ’21 Companion), Proceedings of the 43rd Annual Meeting on Associ- New York, NY. ACM. ation for Computational Linguistics, ACL ’05, page 451–458, USA. Association for Computational Lin- Jinhua Du, Yan Huang, and Karo Moilanen. 2019. AIG guistics. Investments.AI at the FinSBD task: Sentence bound- ary detection through sequence labelling and BERT Preslav Nakov, David P. A. Corney, Maram Hasanain, fine-tuning. In Proceedings of the First Workshop Firoj Alam, Tamer Elsayed, Alberto Barrón-Cedeño, Paolo Papotti, Shaden Shaar, and Giovanni Da San Martino. 2021. Automated fact-checking for assist- ing human fact-checkers. CoRR, abs/2103.07769. Debora Nozza, Federico Bianchi, and Dirk Hovy. 2020. What the [mask]? making sense of language-specific BERT models. CoRR, abs/2003.02912. George Sanchez. 2019. Sentence boundary detection in legal text. In Proceedings of the Natural Legal Language Processing Workshop 2019, pages 31–38, Minneapolis, Minnesota. Association for Computa- tional Linguistics. Uri Shaham and Omer Levy. 2021. Neural machine translation without embeddings. In Proceedings of the 2021 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, pages 181–186, On- line. Association for Computational Linguistics. Hye Jeong Song, Hong Ki Kim, Jong Dae Kim, Chan Young Park, and Yu Seop Kim. 2019. Inter- sentence segmentation of YouTube subtitles using Long-Short Term Memory (LSTM). Applied Sci- ences (Switzerland), 9(7). Mark Stevenson and Robert Gaizauskas. 2000. Exper- iments on sentence boundary detection. In Proceed- ings of the sixth conference on Applied natural lan- guage processing -, pages 84–89, Morristown, NJ, USA. Association for Computational Linguistics. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 -, volume 4, pages 142–147, Morristown, NJ, USA. Association for Computational Linguistics.