UZH OnPoint at Swisstext-2021: Sentence End and Punctuation Prediction in NLG Text Through Ensembling of Different Transformers Andrianos Michail∗ Silvan Wehrli∗ Terézia Bucková University of Zurich {andrianos.michail, silvan.wehrli, terezia.buckova}@uzh.ch Abstract Therefore, attempts in improving the quality of This paper presents our solutions for the Swiss- such texts must also focus on a more precise predic- Text 2021 shared task “Sentence End and tion of punctuation. Consequently, this is an ongo- Punctuation Prediction in NLG Text”. We en- ing research effort in the NLP community. Recent gaged with both subtasks (i.e., sentence end developments in NLP (such as Transformers) offer detection and full-punctuation prediction) and new possibilities to tackle punctuation prediction built systems for English, German, French effectively. Some of these attempts are discussed and Italian. To tackle the punctuation pre- in Section 2. diction problem, we ensemble multiple dif- ferently trained Transformer models (BERT, Following recent attempts, we propose an ensem- CamemBERT, Electra, Longformer, MPNet, ble system based on the Transformer architecture, XLM-RoBERTa, XLNet) and leverage their re- where multiple models predict the punctuation sym- sults using a sliding window method during in- bols of a given text. The results are then combined ference time. As a result, we achieve an F1 and the final predictions are made. Our language- score of the positive class of 0.94 for English, specific systems are able to predict punctuation for 0.96 for German, 0.93 for French, and 0.93 for English, German, French, and Italian texts and are Italian for the subtask 1 “sentence end detec- on par – if not better – with current state-of-the-art tion” on the respective test sets. Furthermore, Macro F1 results on test sets for subtask 2 models that participated in the shared task. “full-punctuation prediction” for English, Ger- Our main contributions include man, French and Italian are 0.78, 0.81, 0.78, 1. the exploration of different Transformer-based 0.76 respectively. models and identification of the most impor- 1 Introduction tant features which affect the performance for this task, and Transcribed or translated texts often contain erro- neous punctuation. Correct punctuation, however, 2. a showcase that the ensembling of differently is crucial for human understanding of a text, as trained models enhances the performance for shown by Tündik et al. (2018). Rightly placed the punctuation prediction task. punctuation not only makes the text more read- able and intelligible but can change the meaning of 2 Related Work sentences, as well. Translated texts pose another Punctuation prediction tasks pose many challenges. challenge: Different languages expose different One of them is the restricted input length thus re- sentence structuring conventions and hence use stricted context for the Transformers. To solve the punctuation very differently. above mentioned limitation, Nguyen et al. (2019) However, systems for automatic transcription of used an overlapped chunk method (i.e., an over- speech nowadays focus on minimizing the Word lapping sliding window) combined with a capi- Error Rate (WER), which omits punctuation (He talization and a punctuation model to tackle the et al., 2011). As a result, the state-of-the-art sys- punctuation problem in long documents. First, tems are focused on the correct transcription of the text is divided into chunks with overlapping words and not necessarily correct segmentation of segments. Second, a punctuation model (seq2seq text or correct punctuation (Tündik et al., 2018). LSTM, Transformer) predicts punctuation and cap- * Equal contribution. Order determined by coin flip. italization for every segment. Lastly, overlapped Copyright ©2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna- tional (CC BY 4.0). chunk merging combines chunks by discarding a Language Training Evaluation defined number of tokens per overlapped chunk. English 11,028 10,521 Courtland et al. (2020) changed the usual frame- German 11,495 10,207 work of punctuation prediction to predicting punc- French 12,276 13,366 tuation for the whole sequence rather than for sin- Italian 10,379 10,502 gle tokens. The authors used a feedforward neural Table 1: Mean token length per document in the train- network. Similar to Nguyen et al. (2019), they find ing and evaluation dataset. that using a sliding window approach improves prediction performance. However, instead of pro- ducing multiple predictions for the same token, and experimental setup. Section 6 presents our they sum activations before prediction and make results and discusses the impact of used methods. inference afterwards. Finally, a conclusion is drawn in Section 7. Sunkara et al. (2020) used a joint learning objec- tive for capitalization and punctuation prediction. 3 Dataset The model input are sub-word embeddings. The The Europarl Parallel Corpus (Koehn, 2005) serves authors used the pre-trained BERT model (BERT as the data source for the training, development, base truncated to the first six layers). They fine- and test set. The surprise test set (of an undisclosed tuned the model on medical domain data because domain during evaluation) is an out-of-domain the medical domain was in the main scope of this dataset that consists of a sample from the TED paper. They also fine-tuned the model for the punc- 2020 dataset (Reimers and Gurevych, 2020) with tuation prediction task. The authors used masked a low vocabulary overlap with the training data. language learning objective while forcing half of As provided by the organizers of the shared task, the masked tokens to be punctuation marks. samples in all datasets were lowercased and all Similarly, Nagy et al. (2021) also leveraged pre- punctuation marks were removed. trained BERT models (BERT base cased and un- Subsequently, we outline challenges that we be- cased and a smaller version for English; multilin- lieve are especially relevant in solving this shared gual and Hungarian-specific BERT versions for task and thus directly influenced our proposed sys- Hungarian). They added a two-layer multi-layer tem architecture. perceptron network with a soft-max output layer. The model also used a sliding window approach to 3.1 Long Documents enhance the results further. This model is trained As shown in Table 1, the mean token length is to predict four labels: empty (no punctuation), many orders of magnitudes longer than what typi- comma, period and question mark. cal Transformer architectures can process at once Our approach differs from the above men- (typically up to 512 subtokens). It should be noted tioned in using ensembling of multiple pre-trained that some of the documents are especially long and Transformer-based models fine-tuned for the given can contain up to 100,000 tokens. The most obvi- task. Very importantly, our systems predict six ous solution would be just to split documents into different punctuation symbols for the punctuation smaller sequences and subsequently merge predic- prediction task. tions. However, this approach lowers the context Additionally, a multilingual Transformer was with which a model is confronted and might lead used as a part of our ensemble. We hypothesize to lower prediction quality (presumably at the be- that it would be able to capture more accurately the ginning and end of a sequence). multilingual content of the EuroParl data. Further- more, low-resource Latin languages might bene- 3.2 Multilingual Content fit from pre-training on more data, e.g., including To some extent, documents in the EuroParl Corpus other Latin languages. contain multilingual content. As shown in the ex- In Section 3 we will discuss the datasets we used amples in Table 2, many documents contain names and the challenges they provide. The problem is of people and areas that reflect the multilingualism described in Section 4 together with detailed de- of the participants of the European Parliament, i.e., scription of our approach. Section 5 contains expla- members come from all over Europe. Therefore, nation of used hyperparameters, technical details using pre-trained models trained on monolingual Sentence Excerpt Language Transformers - Base i agree completely with mr pöttering English Electra, Longformer, and with you too mr swoboda MPNet, XLNet the president of the european commission German BERT, Electra‡ , josé manuel durão barroso however XLM-RoBERTa French CamemBERT‡ , Electra, Table 2: Excerpts from the English training data that XLM-RoBERTa contain multilingual content. Italian BERT, Electra‡ , XLM-RoBERTa data only may result in an inaccurate representation Table 4: Transformer models that were used for each of this content. language-specific model. Models marked with ‡ were used twice: Once trained without weighted loss and 3.3 Imbalanced Class Distribution once with weighted loss. The class distribution of the training and evaluation set, as shown by Table 3, presents a rather typical 4.2 Transformers situation in machine learning: Some of the classes The corner-stones of our systems are pre-trained have very few examples compared to the biggest Transformer models. We trained four different fine- classes. Neglecting this circumstance will likely tuned models for each language and combined the lead to low performance for minority classes. Us- predictions using majority vote ensembling (see ing typical techniques such as class-specific loss Section 4.5). Table 4 provides an overview. weights or data augmentation might improve per- Electra (Clark et al., 2020) is trained as a dis- formance to some extent. We have tried to reduce criminator, and the authors suggest that it is more this problem by adding a model with altered loss suitable for downstream sequence labelling tasks. weights to the ensemble. In fact, we can further support this claim because this model architecture was the best-performing Punctuation Training Development single model for all languages except French (see : 43,133 9,490 Table 5 and 6). ? 44,290 9,815 - 80,916 18,335 Both MPNet (Song et al., 2020) and XLNet . 1,396,166 319,751 (Yang et al., 2019) are trained (slightly differently) , 1,759,686 401,095 through permuted language modelling, allowing 0 30,454,904 6,985,003 a better understanding of bidirectional contexts, which is often needed with punctuation. Both of Table 3: The distribution of class labels for English these single models performed exceptionally well for the training and development set. 0 indicates the in our experiments. absence of a punctuation mark. The distributions for Longformer (Beltagy et al., 2020), due to its German, French and Italian are similar. local windowed attention with a task motivated global attention, can process larger sequence lengths (up to 4096) and perform well on the longer 4 Methods documents of this task. XLM-RoBERTa (Conneau et al., 2019) is a mul- 4.1 Problem Modelling tilingual transformer that is trained on over 100 We modelled this problem as a token classification languages. In our experiments it was demonstrated task. More precisely, each token is assigned a label to be the best performing multilingual model. representing the following punctuation symbol (if The authors of CamemBERT (Martin et al., any). We concentrated our main efforts and focus 2019) show that it performs exceedingly well in on the full punctuation prediction. As such, we NER token classification. Moreover, the good per- built all of the models to be able to predict all punc- formance translated to our French full-punctuation tuation symbols. For the end of sentence prediction prediction experiments. task, we mapped predictions of ‘.’ ‘?’ to 1 and the BERT (Devlin et al., 2018) has models pre- rest to 0. trained in multiple languages. We used language- specific BERT models as part of German, French by sacrificing overall performance, which, in re- and Italian ensembles. turn, helps an ensemble to create more accurate predictions. 4.3 Sliding Window Initially, we used inverted class frequencies as As discussed earlier, documents in the corpus can loss weights. However, this approach turned out to be rather long, and typical Transformers cannot pro- be too aggressive (worse minority class and over- cess such documents at once. Therefore, instead all performance). Further, we experimented with of simply splitting the documents into smaller seg- increasing minority class (‘-’, ‘:’) weights. Ini- ments, sequences are overlapped for inference. In tial experiments showed showed that weights set other terms, a sliding window is applied, as sug- to three for minority classes and one for majority gested by Nguyen et al. (2019). Subsequently, the classes performed best on the development set. Our overlapped sequences are merged back together by approach is rather heuristic, and further experimen- discarding half of the overlapped tokens at the beg- tation may lead to better results. ging and end of each sequence. Our experiments have shown that an overlap of 40 tokens performs 4.5 Majority Vote Ensembling best. Consequently, we chose this overlap length for the final models. We did preliminary experiments in separate stack- ing models as mentioned in Wolpert (1992) as well 4.4 Weighted Loss as ensembling using the arithmetic average of class For the German, French and Italian ensembles, we probabilities of single models as described in Good- retrained the best performing model with weighted fellow et al. (2014). However, one technique was loss. We set the weights to three for the two shown to be more effective: majority vote ensem- least performing classes (‘-’, ‘:’) and left them bling. More concretely, all the models predict (i.e., unchanged for the other classes (i.e., a weight of vote) and the most voted label is then used as the one). The idea is to increase recall for these classes final prediction. In case of a tie, the least common Language Models Ensemble Electra Longformer MPNet XLNet English 0.940 0.934 0.940 0.937 0.943 Electra XLM-RoBERTa BERT Electra‡ German 0.954 0.952 0.950 0.953 0.955 Electra XLM-RoBERTa CamemBERT CamemBERT‡ French 0.923 0.926 0.930 0.928 0.933 Electra XLM-RoBERTa BERT Electra‡ Italian 0.922 0.918 0.918 0.919 0.926 Table 5: Positive class (sentence end) F1 results on the development set for all single models and the correspond- ing ensemble for sentence end prediction. Models marked with ‡ denote a model trained with weighted loss as described in subsection 4.4. Language Models Ensemble Electra Longformer MPNet XLNet English 0.769 0.760 0.768 0.763 0.777 Electra XLM-RoBERTa BERT Electra‡ German 0.803 0.795 0.792 0.805 0.812 Electra XLM-RoBERTa CamemBERT CamemBERT‡ French 0.758 0.761 0.769 0.770 0.778 Electra XLM-RoBERTa BERT Electra‡ Italian 0.746 0.732 0.741 0.739 0.755 Table 6: Macro F1 results on the development set for all single models and the corresponding ensemble for full- punctuation prediction. Models marked with ‡ denote a model trained with weighted loss as described in subsection 4.4. Development Test Surprise Test Language P R F1 P R F1 P R F1 English 0.93 0.96 0.94 0.93 0.95 0.94 0.84 0.75 0.80 German 0.95 0.96 0.96 0.95 0.96 0.96 0.89 0.77 0.82 French 0.92 0.94 0.93 0.92 0.94 0.93 0.82 0.72 0.77 Italian 0.91 0.95 0.93 0.90 0.95 0.93 0.83 0.71 0.77 Table 7: Ensembling positive class (sentence end) F1 results on the development, test and surprise test set for sentence end prediction. Development Test Surprise Test Language P R F1 P R F1 P R F1 English 0.82 0.75 0.78 0.81 0.75 0.77 0.65 0.59 0.62 German 0.82 0.80 0.81 0.82 0.80 0.81 0.66 0.65 0.65 French 0.80 0.76 0.78 0.78 0.77 0.77 0.63 0.60 0.61 Italian 0.77 0.74 0.76 0.77 0.74 0.75 0.57 0.55 0.56 Table 8: Ensembling Macro F1 results on the evaluation, test and surprise test set for punctuation prediction. label is chosen. Additionally, predictions for a hy- 5.3 Experimental Setup phen are counted twice – mainly to increase the We trained all of the models on a single T4 GPU performance for the worst-performing label (which instance. Our final models shared some of the hy- was the case for all languages). Our experiments on perparameters, namely a learning rate of 4e−5, a the development set have shown that this leads to an batch size of 16 (four for Longformer) and the max- increase of 1-2% Macro F1 score for all languages imum sequence length (512, 4096 for Longformer). compared to the single best performing model. We trained each model for five epochs. 5 System Architecture 6 Results & Discussion 5.1 Hyperparameter Setup Our results for sentence end prediction and full punctuation prediction can be seen in Table 7 and At the beginning of development, we empirically Table 8, respectively. They demonstrate the high ca- determined what characteristics of the model and pability of using Transformers in predicting punctu- fine-tuning correlate with better performance. For ation marks. Especially for sentence end prediction, fine-tuning, five epochs performed consistently the F1 scores are well above 90% for all languages. well for all transformer architectures. Due to the We hypothesize that it is because usage of sentence large document size, the larger the maximum se- end punctuation is less ambiguous – it is consis- quence length, the better the performance. To our tently and grammatically correctly used in the data. surprise, there were no significant differences be- For full punctuation prediction, the overall perfor- tween the performance of cased vs. uncased Trans- mance is significantly lower for all languages. The formers on our lower-cased data. full punctuation prediction task is more difficult not only because of the existence of more labels, 5.2 Technical Implementation but also because some of the labels might not fol- For the training of our models, we used the Simple low strict grammatical rules. For example ‘-’ or a Transformers 1 library, a wrapper for the Hugging ‘:’ can be used differently due to different styles Face 2 library, that allows for fast experimenting. of linguistic expressions, while a label such as a As the Simple Transformers library does not sup- comma might be misplaced due to human error. port weighted loss training, we have adapted the With respect to our system, sliding windows are relevant code for this purpose. a simple way to improve performance when an in- put sequence is much longer than what a model 1 https://simpletransformers.ai can actually process. However, this performance 2 https://huggingface.co gain is limited, and as of now, it is not clear how this compares to a model that can process much Xiaodong He, Li Deng, and Alex Acero. 2011. Why longer sequences. Observing results we have ob- word error rate is not a good metric for speech recognizer training for the speech translation task? tained from single models at Table 5 and 6 for both In 2011 IEEE International Conference on Acous- subtasks we can see that the model architecture tics, Speech and Signal Processing (ICASSP), pages has an effect on performance. Within our experi- 5632–5635. IEEE. ments, majority vote ensembling further enhances Philipp Koehn. 2005. Europarl: A parallel corpus for performance. statistical machine translation. In MT summit, vol- ume 5, pages 79–86. Citeseer. 7 Conclusion Louis Martin, Benjamin Muller, Pedro Javier Ortiz In this paper, we showed that the ensembling of Suárez, Yoann Dupont, Laurent Romary, Éric Ville- diversely trained Transformers can yield significant monte de la Clergerie, Djamé Seddah, and Benoı̂t improvement and allows for good generalisation Sagot. 2019. Camembert: a tasty french language for punctuation prediction in out-of-domain exam- model. arXiv preprint arXiv:1911.03894. ples. From this work, it can be seen that combin- Attila Nagy, Bence Bial, and Judit Ács. 2021. Au- ing different Transformers can be really beneficial. tomatic punctuation restoration with bert models. However, further work is needed to determine if arXiv preprint arXiv:2101.07343. more advanced ensembling techniques could fur- Binh Nguyen, Vu Bao Hung Nguyen, Hien ther increase the quality of the predictions. Nguyen, Pham Ngoc Phuong, The-Loc Nguyen, Quoc Truong Do, and Luong Chi Mai. 2019. Fast Acknowledgments and accurate capitalization and punctuation for automatic speech recognition using transformer We want to thank Simon Clematide and Phillip and chunk merging. In 2019 22nd Conference of Ströbel for their valuable inputs and the Departe- the Oriental COCOSDA International Committee ment of Computational Linguistics for providing for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques us with the necessary technical infrastructure. (O-COCOSDA), pages 1–5. IEEE. Nils Reimers and Iryna Gurevych. 2020. Mak- References ing monolingual sentence embeddings multilin- gual using knowledge distillation. arXiv preprint Iz Beltagy, Matthew E Peters, and Arman Cohan. arXiv:2004.09813. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150. Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie- Yan Liu. 2020. Mpnet: Masked and permuted pre- Kevin Clark, Minh-Thang Luong, Quoc V Le, and training for language understanding. arXiv preprint Christopher D Manning. 2020. Electra: Pre-training arXiv:2004.09297. text encoders as discriminators rather than genera- tors. arXiv preprint arXiv:2003.10555. Monica Sunkara, Srikanth Ronanki, Kalpit Dixit, Sra- van Bodapati, and Katrin Kirchhoff. 2020. Robust Alexis Conneau, Kartikay Khandelwal, Naman Goyal, prediction of punctuation and truecasing for medical Vishrav Chaudhary, Guillaume Wenzek, Francisco asr. In Proceedings of the First Workshop on Natu- Guzmán, Edouard Grave, Myle Ott, Luke Zettle- ral Language Processing for Medical Conversations, moyer, and Veselin Stoyanov. 2019. Unsupervised pages 53–62. cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116. Máté Akos Tündik, György Szaszák, Gábor Gosztolya, and András Beke. 2018. User-centric evaluation of Maury Courtland, Adam Faulkner, and Gayle McEl- automatic punctuation in asr closed captioning. vain. 2020. Efficient automatic punctuation restora- tion using bidirectional transformers with robust in- David H Wolpert. 1992. Stacked generalization. Neu- ference. In Proceedings of the 17th International ral networks, 5(2):241–259. Conference on Spoken Language Translation, pages 272–279. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- bonell, Ruslan Salakhutdinov, and Quoc V Le. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 2019. Xlnet: Generalized autoregressive pretrain- Kristina Toutanova. 2018. Bert: Pre-training of deep ing for language understanding. arXiv preprint bidirectional transformers for language understand- arXiv:1906.08237. ing. arXiv preprint arXiv:1810.04805. Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversar- ial examples. arXiv preprint arXiv:1412.6572.