=Paper=
{{Paper
|id=Vol-2957/sepp_paper4
|storemode=property
|title=FullStop: Multilingual Deep Models for Punctuation Prediction
|pdfUrl=https://ceur-ws.org/Vol-2957/sepp_paper4.pdf
|volume=Vol-2957
|authors=Oliver Guhr,Anne-Kathrin Schumann,Frank Bahrmann,Hans-Joachim Böhme
|dblpUrl=https://dblp.org/rec/conf/swisstext/GuhrSBB21
}}
==FullStop: Multilingual Deep Models for Punctuation Prediction==
FullStop: Multilingual Deep Models for Punctuation Prediction Oliver Guhr1 Anne-Kathrin Schumann2 Frank Bahrmann1 Hans-Joachim Böhme1 1 University of Applied Science (HTW) Dresden, Germany 2 t2k GmbH, Dresden, Germany {oliver.guhr, frank.bahrmann, hans-joachim.boehme}@htw-dresden.de anne-kathrin.schumann@text2knowledge.de Abstract possible punctuation marks are members of the set p = {: −, ?.0}, with 0 indicating no This paper describes our contribution to the punctuation. SEPP-NLG Shared Task in multilingual sen- tence segmentation and punctuation prediction. The task is carried out on the German, English, The goal of this task consists in training NLP French, and Italian sections of the Europarl corpus models that can predict the end of sentence (Koehn, 2005), since it offers transcripts of spoken (EOS) and punctuation marks on automatically generated or transcribed texts. We show that texts for multiple languages. We developed models these tasks benefit from crosslingual transfer for both tasks based on the Transformers library by by successfully employing multilingual deep Wolf et al. (2020). These models and our code are language models. Our multilingual model publicly available 1 achieves an average F1 -score of 0.94 for EOS prediction on English, German, French, and 2 Related Work Italian texts and an average F1 -score of 0.78 for punctuation mark prediction. Earlier studies on EOS and punctuation prediction reflect the various fields of application of this tech- 1 Introduction nology. The task is mostly modeled as token-wise prediction. Over the last few years, consistent per- The prediction of EOS and punctuation marks in formance improvements have – unsurprisingly – automatically generated or transcribed texts is a rel- been achieved with the help of neural network ap- atively novel task. While sentence segmentation is proaches and large-scale neural language models. a core, and low-level, natural language processing The work by Attia et al. (2014) constitutes a (NLP) task, punctuation has, in the past, primar- rather traditional approach to spelling and punc- ily been studied in the context of error correction tuation correction, in this case for Arabic. The and the normalisation of automatic speech recogni- authors report that in their data set, punctuation tion (ASR) output. However, with the recent rise errors constitute 40 % of all errors. The task is of conversational agents and other NLP systems modeled as token-wise classification with context that are able to generate new texts, the injection windows varying between 4-8 words. Classifica- of punctuation and EOS marks has gained wider tion is carried out with Support Vector Machines interest. This is hardly surprising because punctu- and Conditional Random Field (CRF) classifiers, ation affects the readability of the text produced using part-of-speech (POS) and morphological in- by the NLP system and, thus, its perceived overall formation. The authors obtain the best result, an performance. The SEPP-NLG Shared Task offers F1-score of 0.56, with the CRF classifier and a two subtasks, namely: window size of five tokens. Che et al. (2016) experiment with three differ- • Subtask1 – Sentence segmentation: Full- ent neural network architectures, using pretrained stop prediction on fully unpunctuated, low- GloVe (Pennington et al., 2014) embeddings as ercased documents. inputs. Since their goal is to predict punctuation • Subtask 2 – Punctuation prediction: Pre- marks specifically on ASR output, they evaluate diction of all punctuation marks on fully un- 1 https://github.com/oliverguhr/ punctuated, lowercased documents, where the fullstop-deep-punctuation-prediction. Copyright ©2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna- tional (CC BY 4.0). their models on ASR transcripts of TED talks. Pre- text. This data set consists of a training and a de- dicting the positions of commas, periods, and ques- velopment set. For system ranking, a test set with tion marks, their best result in this 4-class classifi- in-domain and a surprise set with out-of-domain cation task is an F1 -score of 0.54. texts were used. Treviso et al. (2017) study sentence segmenta- Figure 1 shows the distribution of the punctu- tion – not punctuation – in narrative transcripts ation labels for subtask 2, for all languages. As that were generated in the context of examining can be seen from the Figure, the distribution of the patients for symptoms of language-impairing de- labels is quite skewed, even if we disregard that mentia. They work on three different Portuguese the majority of tokens in each data set has the la- data sets. Input data is modeled by means of POS bel ”0” (omitted in Figure 1 for better readability). features, word embeddings, and prosodic informa- All languages follow the same distribution pattern, tion. They then combine convolutional and recur- however, they exhibit subtle differences. For in- rent neural network layers, achieving F1 -scores stance, the difference in frequency between com- between 0.7 and 0.8 on two evaluation data sets. mas and fullstops is particularly pronounced for Schweter and Ahmed (2019) also experiment German and German, in general, has a higher pro- with the Europarl corpus, however, their task is portion of commas, indicating complex sentence different from the task presented here, i.e. they structures. For other language pairs, we observe model only sentence segmentation by predicting, slight differences in the distribution of hyphens and at each full stop in the input text, whether it is an colons. EOS marker or forms a part of another linguistic Earlier versions of subtask 2 also required pre- unit (for instance, it could mark an abbreviation). dictions for the punctuation marks ”!” and ”;”. Dur- Predictions are produced by character-level models ing the training phase, the task organizers mapped that are fed not only the token to disambiguate, but these symbols to the fullstop to account for strongly also local contexts in the form of context windows. skewed distributions and potential HTML artefacts. Working on a wide variety of languages – includ- Sentences containing other punctuation symbols ing often overlooked languages such as Bosnian, than those already mentioned – parentheses, for Greek, or Romanian, – they achieve F1 -scores be- instance – were removed by the task organizers tween 0.98 and 0.99, with their BiLSTM model because not all instances of parentheses were well- performing best on average. formed (i. e. not for every opening parenthesis Sunkara et al. (2020) also work in the clinical there also was a closing parenthesis). These issues domain, more precisely, on the output of medi- leave avenues for future research. cal ASR systems. They jointly model punctuation 4 Models and truecasing by first predicting a punctuation se- quence and then the case of each input word. The 4.1 Baselines and Model Selection authors use a pretrained transformer model (De- The transformer architecture (Vaswani et al., 2017) vlin et al., 2019; Liu et al., 2019) in combination and transfer learning with transformer-based lan- with subword embeddings to overcome lexical spar- guage models (Devlin et al., 2019) have led to no- sity in the medical domain. They also carry out a table performance gains for many NLP tasks. For fine-tuning step on medical data and a task adapta- this reason, we have focused our research on a tion step – randomly masking punctuation marks transformer-based architecture, exploring a num- in the text – before training the actual model. Pre- ber of recent language models and multilingual dicting fullstops and commas, the authors achieve transfer learning. Following earlier work, we have F1 -scores of 0.81 (for commas) and 0.92 (for full- modelled the task as token-wise prediction. stops) with Bio-BERT (Lee et al., 2019), which However, to assess the performance gain enabled was trained on biomedical corpora. by a transformer-based language model, we also trained (for German sentence segmentation) a first, 3 Task and Data non-neural baseline: a CRF model on the basis of The task consists in predicting EOS and punctua- bag-of-words, POS and local context (+/- 2 tokens) tion marks on unpunctuated lowercased text. The features. This model seemed to perform much organizers of the SeppNLG shared task provided better than the spaCy2 baseline provided for sub- 470 MB of English, German, French, and Italian 2 https://spacy.io/. (a) Subtask 2 German (b) Subtask 2 English ? 40,511 ? 44,290 : 51,192 : 43,133 - 81,710 - 80,916 . 1,290,282 . 1,396,166 , 2,208,970 , 1,759,686 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 ·106 ·106 (c) Subtask 2 French (d) Subtask 2 Italian ? 41,005 ? 38,807 : 46,128 : 55,080 - 68,523 - 52,983 . 1,223,802 . 1,138,669 , 1,657,880 , 1,503,502 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 ·106 ·106 Figure 1: Distribution of punctuation labels for the four languages on the training sets of task 2. Mean document length varies between 10,378 (for Italian) and 12,275 words (for French). task 13 , however, since it was outperformed by all tests: transformer-based models by a large margin, we • Bert (Devlin et al., 2019) decided to not explore this direction any further. As a second baseline, we trained a vanilla multi- • Distillbert (Sanh et al., 2019) lingual Bert model and explored techniques to im- prove this baseline. In particular, we focused on • Electra (Clark et al., 2020) three different options, namely data augmentation, • Roberta (Liu et al., 2019) hyperparameter optimization, and the selection of different architectures and pre-trained models. We • XLM-Roberta (Conneau et al., 2020) have also tested various preprocessing steps to re- • Camembert (Martin et al., 2020) move special characters and HTML artefacts, but this had no significant effect on our results. First experiments with data augmentation and hy- As a first step towards model selection, we perparameter optimization showed that these tech- trained a set of mono- and multilingual models niques had only a minor effect on the models’ per- on 10% of the training data for each task. We then formance. All of our 10% and full models were selected the best models per language and the best trained for 3 epochs using Adafactor (Shazeer and multilingual model and trained them on the full Stern, 2018) and a learning rate of 4e−5 and batch training data set. This approach helped us to iterate size of 8. Furthermore we used 16-bit-precision quickly by avoiding long training times (up to 20 training to improve training speed. We did run hours on a single GPU) just for model selection. hyperparamter optimizations with limited success, We then selected the following architectures for our for more information please see our ablations in 3 https://sites.google.com/view/ section 7. We then focused on the selection of sentence-segmentation/. architectures and pretrained models. Base Model Task 1 F1 Task 2 F1 English distilbert-base-uncased 0.849048 0.581294 google/electra-base-generator 0.867502 0.426554 google/electra-small-generator 0.872033 0.590815 bert-base-uncased 0.885560 0.647669 google/electra-large-generator 0.901298 0.558433 bert-large-uncased 0.903943 0.699679 roberta-base 0.921170 0.719705 xlm-roberta-large 0.932057 0.740402 roberta-large 0.935672 0.742778 German bert-base-multilingual-uncased 0.931668 0.708220 dbmdz/bert-base-german-uncased 0.943437 0.746249 deepset/gbert-base 0.943571 0.753979 german-nlp-group/electra-base-german-uncased 0.950070 0.759387 French bert-base-multilingual-uncased 0.881648 0.658968 camembert-base 0.914799 0.702187 camembert/camembert-large 0.935436 0.756594 Italian dbmdz/electra-base-italian-xxl-cased-generator 0.866070 0.496291 bert-base-multilingual-uncased 0.867798 0.586234 dbmdz/bert-base-italian-cased 0.897765 0.658520 dbmdz/bert-base-italian-xxl-uncased 0.910585 0.693615 multilingual bert-base-multilingual-uncased 0.887909 0.683688 xlm-roberta-base 0.915930 0.716822 xlm-roberta-large 0.935946 0.753770 Table 1: We trained all base models in this Table on 10% of the language-specific data or on 10% of all languages for the multilingual models. All models were trained for 3 epochs using Adafactor and a learning rate of 4e−5 . For Task 1 we report the F1 score of the EOS class. For task 2 the macro average F1 of all classes is shown. We trained a 10 % and 100 % model for all tokenized into more than one token. The disadvan- architecture types to ensure that the architectures tage of this approach is that it is inefficient since scale well with the increased data. Comparing most sequences will not utilize the full 512-token the results from Table 1 and 2, we found that the capacity of the model. models for task 1 gain between 0.1 % to 1 % by scaling from 10% to 100% and the model for task Overlapping Tokens F1 Score Task 1 2 gain between 3 % to 5 %. 0 0.87893 10 0.87933 4.2 Windowing Approach 100 0.88556 All selected architectures are limited with respect 200 0.88375 to the number of tokens they can process, typically 512. Since most documents are longer than this Table 2: We found that an overlap of 100 tokens be- limit (see Figure 1), we needed a strategy to handle tween consecutive sequences improves the models per- longer sequences. formance. The simplest method to achieve that is by split- ting the text into chunks of 200 words before pro- We therefore chose to first tokenize each doc- cessing. The number of 200 words was chosen ument and then split it into sequences of 512 to- empirically to account for the fact that words get kens. However, this approach, just like the first one, can produce sequences that start with the last XLM-RoBERTa-based model scored notably better word of a sentence or end with the first word of than the best language-specific model. However, a sentence, giving the model no context for the for the other languages, the performance gains are prediction. To address this issue, we used a slid- not that significant. The scores of the German ing window approach and ran experiments with Electra-based model are comparable to those of different step sizes similar to the stride parame- XLM RoBERTa, despite using 110 million param- ter in convolutional neural networks. This method eters in contrast to the 550 million parameters of ensures that the model has additional context for XLM RoBERTa large. This indicates that there is making predictions. For training, we ran a grid room for possible performance improvements. search to find the optimal length of the overlapping window, using an English Bert base model on 10% 5.1 Final Models and Evaluation of the data. Based on the results shown in Table Since the multilingual models outperformed 2, we choose an overlapping window size of 100 almost all monolingual models, we selected these for training our models. The loss was calculated for subtasks 1 and 2. Furthermore, we submitted for the whole sequence, including the overlapping one smaller monolingual model to evaluate its part. Since this method also generates new training performance on the test set and out-of-domain test sequences, it also acts as a data-augmentation. set (surprise test). 5 Results FullStop Multilingual Task 1: This model is based on the 550-million-parameter XLM Table 1 shows the results of the 10% model compar- RoBERTa large model and was trained on the ison training. All the models that performed best on labeled data of task 1. Across all four languages task 1 also performed best on task 2. For English, this model archived an average F1 score of 0.94 on we selected two models, XLM RoBERTa Large the test set and an average F1 score of 0.78 on the and RoBERTa Large since their scores were about surprise test set. even. An Electra-based model achieved the best results for the German language, whereas, surpris- FullStop German Task 1: This model is ingly, English and Italian Electra models scored based on the 110-million-parameter German below baseline Bert models. For French, we se- Electra base model. It was trained on the labeled lected Camembert large, a 335 million parameters data for task 1 and an additional data set consisting RoBERTa-based model which scores notably better of data from speeches of the German parliament than Camembert base using 110 million parame- (Bundestag, 134 MB4 ) and a text crawl from the ters. The digital library team at the Bavarian State Leipzig corpora collection (245 MB5 ), containing Library (dbmdz) published two different Italian a mixture of news texts and Wikipedia articles. For Bert-based models, the XXL version of the model the German language, this model archived an F1 was trained on the larger corpus and achieved the score of 0.95 on the test set and an F1 score of best result. The multilingual XLM RoBERTa base 0.80 on the surprise test set. model achieved better scores than the older multi- lingual Bert model using the same number of pa- rameters. The larger 335 million parameter version FullStop Multilingual Task 2: This model is also of this model achieved the best multilingual model based on XLM RoBERTa large and was trained on score, on par with the language-specific models. the labeled data for task 2. As shown in Figure Note that the scores of the multilingual models are 2 and Table 4, the model performs well on EOS evaluated on a multilingual development set. marks across all languages. In contrast, the perfor- We trained the selected models on the full train- mance for colons and hyphens is lower. We suspect ing set for each task and evaluated them on the de- that this is due to the properties of the data set as velopment sets. The results of this evaluation can described in section 3. We have seen that hyphens be found in Table 3 for both subtasks 1 and 2. For and colons are not only infrequent in the training both tasks, the large multilingual XLM RoBERTa data for all languages, they also exhibit unstable outperformed all language-specific models. There- fore we submitted our XLM RoBERTa based mod- 4 https://github.com/Datenschule/offenesparlament-data els for task 1 and 2. For the Italian language, the 5 https://wortschatz.uni-leipzig.de/de/download/German Model Test Language F1 Score Task 1 F1 Score Task 2 roberta-large EN 0.941992 0.772326 xlm-roberta-large EN 0.938764 0.765496 electra-base-german-uncased DE 0.953894 0.795759 electra-base-german-uncased with data augmentation DE 0.954782 – camembert-large FR 0.937222 0.778617 bert-base-italian-xxl-uncased IT 0.919729 0.732624 EN 0.945746 0.774601 DE 0.958591 0.813861 xlm-roberta-large FR 0.941974 0.781834 IT 0.934144 0.761775 Table 3: All models for subtasks 1 and 2 where trained on the full data set for each languages. For tasks 1, we report the F1 score of the sentence end class and for task 2 the macro average F1 score. Label EN DE FR IT on a 13GB corpus and the ”bert-base-italian-xxl- , 0.819 0.945 0.831 0.798 uncased” model was trained on a 81 GB corpus. - 0.425 0.435 0.431 0.421 The positive effect of larger corpus sizes on model . 0.948 0.961 0.945 0.942 performance has also been verified for other trans- 0 0.991 0.997 0.992 0.989 former architectures, for instance by Conneau et al. : 0.575 0.652 0.620 0.588 (2020) and Clark et al. (2020). ? 0.890 0.893 0.871 0.832 Model architectures do not work equally well macro avg 0.775 0.814 0.782 0.762 for different languages. Electra is the best- Table 4: Per class F1 scores for the FullStop Multilin- performing monolingual German model, but for gual Task 2 model on the dev data set. English and Italian, results obtained with Electra are well behind those obtained from mono- and multilingual Bert models. We conducted a series of distribution patterns across languages. Intuitively, tests with different hyperparameters for the English this is not surprising as hyphens and colons, in Electra models, but could not further improve the many cases, are optional in the sense that they can results. be substituted by either a comma or a full stop, Both Tasks benefit from multilingual models i. e. the rules for their usage are not only grammat- and training data. To our surprise, the multilin- ical and syntactic, but also stylistic. Performance gual XLM-Roberta-based model outperformed all increases might be achieved through targeted train- monolingual models, even though earlier multi- ing with adversarial examples. The model achieves lingual Bert models were, in most cases, outper- an average F1 of 0.78 on the test set. Similar to formed by their language-specific counterparts. We the other models, the performance degrades to an suspected that this could be explained by the much average F1 of 0.61 for the out-of-domain surprise larger number of parameters used by XLM-Roberta set. large. To test this hypothesis, we trained a monolin- Inference on the complete test and surprise set gual English model based on XLM-RoBERTa and (470 MB) takes about 1 hour for each multilingual another English model based on the monolingual FullStop model using an Nvidia 3090 GPU. RoBERTa. As shown in Table 3, both models are 6 Key Findings outperformed by the XLM-RoBERTa model, show- ing that the model benefits from multilinguality. The type and amount of data used for pretrain- Although we have no direct explanation for the su- ing has a significant impact on the final model’s perior performance of the multilingual model, we performance. Table 1 shows that, for Italian, there would like to accentuate that it is in line with earlier is a 5% difference for task 2 between the two mono- work (Muller et al., 2021) confirming (for mBERT) lingual Bert-based models. Both models use the that the lower layers of multilingual models act same 110 million parameters of the Bert architec- as multilingual encoders by representing linguistic ture, but were trained on different corpus sizes. knowledge for various languages. If this is true here The ”bert-base-italian-uncased” model was trained as well, the larger number of multilingual training (a) Task 2 English (b) Task 2 German � ���� ������ ����� ���� ������ ������ � ���� ����� ����� ����� ������ ������� ��� ��� � ���� ���� ����� ����� ����� ������ � ���� ���� ����� ���� ����� ������ ��� ��� � ����� ������� ���� ����� ������ ������ � ����� ������� ���� ������ ������ ������ ���������� ���������� � ������ ����� ������� ���� ������� ������� � ������ ������� ������� � ������� ������� ��� ��� � ���� ����� ���� ����� ��� ������ � ����� ������ ���� ����� ��� ������ ��� ��� � ���� ������ ����� ����� ������� ���� � ����� ������ ����� ���� ������� ���� � � � � � � � � � � � � ��������������� ��������������� (c) Task 2 French (d) Task 2 Italian � ���� ������ ����� ���� ����� ������� � ���� ������ ����� ���� ������ ������� ��� ��� � ���� ���� ����� ���� ����� ������ � ��� ���� ����� ���� ����� ������ ��� ��� � ����� ������� ���� ����� ����� ������ � ����� ������� ���� ����� ������ ������ ���������� � ������ ����� ������� ���� ������� ������� ��� ���������� � ������ ������� ������� ���� ������� ����� ��� � ���� ������ ���� ����� ���� ����� � ����� ������ ���� ����� ��� ����� ��� ��� � ����� ������ ����� ����� ������� ���� � ����� ������ ���� ����� ������ ��� � � � � � � � � � � � � ��������������� ��������������� Figure 2: Confusion matrices for the XLM RoBERTa-based multilingual FullStop model for task 2. Note that all values are rounded. examples might indeed improve performance for framework with a budget of 200 trails on the Ger- the punctuation task. Our successful pruning exper- man Electra base model. For the hyperparameter- iments also point towards this direction. However, search, we configured the following search space: these hypotheses need empirical validation. learning rates between 1 · 10−2 and 1 · 10−5 , 1 to 5 Punctuation patterns are domain-specific and training epochs, batch sizes from 22 to 27 , a weight robust punctuation prediction requires training decay from 1 · 10−1 to 1 · 10−12 and Adam epsilon on diverse data sets. The data set that we trained from 1 · 10−6 to 1 · 10−10 . on (Europarl) consists of data from a single do- We have compared these settings with Adafactor main, i. e. political speeches. As our scores on (Shazeer and Stern, 2018), using a learning rate the surprise set revealed, the performance of our of 4e5 . For both optimizers, we have trained mod- models degrades on texts from other domains. The els for task 1 and 2 on 10% of the training data. performance of our task-1 model drops from 0.94 The results of this comparison are shown in table (average across all languages) on the in-domain test 5. Adafactor matches the performance of Adam, set to an F1 of 0.78 on the out-of-domain surprise but eliminates the need for a time-consuming hy- set. The other models participating in the shared perparameter search, therefore we decided to use task suffer from similar performance degradations. Adafactor for all models. Is it possible to use one model for both tasks? 7 Ablations The labels of task 2 are a super-set of the labels for task 1, therefore one can use a model trained What are the optimal hyperparameters for for task 2 on task 1. We changed the classification each model? We ran a hyperparameter search for result of task 2 by mapping the sentence end labels the Adam optimizer using the Akiba et al. (2019) ”.” and ”?” to label 1 and all other labels to label Task Adafactor Adam diff in p.P. 8 Conclusion 1 0.95007 0.95087 -0.0008 In this paper, we have shown that transformer- 2 0.75939 0.75587 +0,00352 based architectures can be successfully applied to Table 5: In a comparison between Adam with opti- the tasks of punctuation mark and sentence end mized hyperparameters and Adafactor, we found only prediction. To our surprise, monolingual models minor differences in the resulting F1 score. are outperformed by multilingual models, showing that these models can transfer knowledge across languages. For the future, we plan to improve on 0. The results in Table 6 show that this method two main aspects. Firstly, we want to reduce the decreases the final scores only marginally. For size of our models. Both ”FullStop Multilingual” many applications, it is sufficient to train one model models use 550 million parameters which leads that processes all four languages for both tasks. to computationally expensive inferencing. In our For this shared task, we trained and submitted two ablations, we have demonstrated a first approach different models, since a dedicated model for task to reducing the number of parameters. Secondly, 1 slightly improves the results. we would like to improve the out-of-domain perfor- mance of our models. The shared task surpriseset Language Task 1 Model Task 2 Model showed that there is a performance degradation on en 0.945746 0.941686 texts from unseen domains. We will address this de 0.958591 0.955926 issue in future research. fr 0.941974 0.938254 it 0.934144 0.930851 Acknowledgments Table 6: We compared the scores of the ”FullStop Mul- This research has been funded by the Euro- tilingual Task 1” model and the remapped output of the pean Social Fund (ESF), SAB grant number ”FullStop Multilingual Task 2” model to match the la- 100339497 and the European Re-gional Develop- bels of task 1. This approach leads to a slightly de- ment Funds (ERDF) (ERDF-100346119). Anne- creased F1 score. Kathrin Schumann has received funding through the SAB’s technology startup scholarship (Tech- Do we need a deep model for these tasks? nologiegründerstipendium). For the purpose of the shared task, we did not aim at optimizing inference and training efficiency. How- ever, we tested if it is necessary to use all the 12 References Bert base layers. To this end, we trained a set of Takuya Akiba, Shotaro Sano, Toshihiko Yanase, models on 10% of the English data using 3,6, 9 and Takeru Ohta, and Masanori Koyama. 2019. Op- 10 layers on task 1. To keep the results compara- tuna: A next-generation hyperparameter optimiza- ble, we used the same hyperparameters as with all tion framework. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge other models, described in section 4. The results Discovery and Data Mining. in Table 7 show that with this simple layer pruning Mohammed Attia, Mohamed Al-Badrashiny, and approach it is possible to retain 99% of the model’s Mona Diab. 2014. GWU-HASP: Hybrid Arabic performance while removing 1/4 of the last layers. Spelling and Punctuation Corrector. In Proceedings We suggest to explor more advanced optimization of the EMNLP 2014 Workshop on Arabic Natural techniques in further studies. Language Processing (ANLP), pages 148–154. As- sociation for Computational Linguistics. Layers Parameters F1 Score Task 1 Xiaoyin Che, Cheng Wang, Haojin Yang, and 3 45,102,338 0.74758 Christoph Meinel. 2016. Punctuation Prediction 6 66,365,954 0.84408 for Unsegmented Transcript Based on Word Vec- tor. In Proceedings of the Tenth International 9 87,629,570 0.87776 Conference on Language Resources and Evaluation 12 108,893,186 0.88556 (LREC 2016), pages 654–658. European Language Resources Association (ELRA). Table 7: F1 scores resulting from a pruned Bert base model at various levels of pruning. Scores are for an Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pre- English model trained on 10% of the data. training Text Encoders as Discriminators Rather Than Generators. In ICLR. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Stefan Schweter and Sajawel Ahmed. 2019. Deep- Vishrav Chaudhary, Guillaume Wenzek, Francisco EOS: General-Purpose Neural Networks for Sen- Guzmán, Edouard Grave, Myle Ott, Luke Zettle- tence Boundary Detection. In Proceedings of the moyer, and Veselin Stayanov. 2020. Unsupervised 15th Conference on Natural Language Processing Cross-lingual Representation Learning at Scale. In (KONVENS 2019), pages 251–255. Proceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 8440– Noam Shazeer and Mitchell Stern. 2018. Adafactor: 8451. Association for Computational Linguistics. Adaptive learning rates with sublinear memory cost. In Proceedings of the 35th International Conference Jacob Devlin, Ming-Wei Chang, Kenton Lee, and on Machine Learning, volume 80 of Proceedings Kristina Toutanova. 2019. BERT: Pre-training of of Machine Learning Research, pages 4596–4604. Deep Bidirectional Transformers for Language Un- PMLR. derstanding. In Proceedings of the 2019 Confer- ence of the North American Chapter of the Associ- Monica Sunkara, Srikanth Ronanki, Kalpit Dixit, Sra- ation for Computational Linguistics: Human Lan- van Bodapati, and Katrin Kirchhoff. 2020. Robust guage Technologies, Volume 1 (Long and Short Pa- Prediction of Punctuation and Truecasing for Med- pers). Association for Computational Linguistics. ical ASR. In Proceedings of the 1st Workshop on NLP for Medical Conversations, pages 53–62. Asso- Philipp Koehn. 2005. Europarl: A Parallel Corpus for ciation for Computational Linguistics. Statistical Machine Translation. In Proceedings of Marcos Vinı́cius Treviso, Christopher Shulby, and San- the 10th Machine Translation Summit, pages 79–86. dra Maria Aluı́sio. 2017. Sentence Segmentation AAMT. in Narrative Transcripts from Neuropsychological Tests using Recurrent Convolutional Neural Net- Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, works. In Proceedings of the 15th Conference of the Donghyeon Kim, Sunkyu Kim, Chan Ho So, European Chapter of the Association for Computa- and Jaewoo Kang. 2019. BioBERT: a pre- tional Linguistics: Volume 1, Long Papers, pages trained biomedical language representation model 315–325. Association for Computational Linguis- for biomedical text mining. Bioinformatics, tics. 36(4):1234–1240. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Kaiser, and Illia Polosukhin. 2017. Attention is All Luke Zettlemoyer, and Veselin Stoyanov. 2019. you Need. In Advances in Neural Information Pro- Roberta: A Robustly Optimized BERT Pretraining cessing Systems, volume 30. Curran Associates, Inc. Approach. http://arxiv.org/abs/1907.11692. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Louis Martin, Benjamin Muller, Pedro Javier Ortiz Chaumond, Clement Delangue, Anthony Moi, Pier- Suárez, Yoann Dupont, Larent Romary, Éric Ville- ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- mont de la Clergerie, Djamé Seddah, and Benoı̂t icz, Joe Davison, Sam Shleifer, Patrick von Platen, Sagot. 2020. Camembert: a Tasty French Language Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Model. In Proceedings of the 58th Annual Meet- Teven Le Scao, Sylvain Gugger, Mariama Drame, ing of the Association for Computational Linguis- Quentin Lhoest, and Alexander M. Rush. 2020. tics, pages 7203–7219. Association for Computa- Transformers: State-of-the-art natural language pro- tional Linguistics. cessing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Benjamin Muller, Yanai Elazar, Benoı̂t Sagot, and System Demonstrations, pages 38–45, Online. Asso- Djamé Seddah. 2021. First Align, then Predict: Un- ciation for Computational Linguistics. derstanding the Cross-Lingual Ability of Multilin- gual BERT. In Proceedings of the 16th Conference of the European Chapter of the Association for Com- putational Linguistics: Main Volume, pages 2214– 2231. Association for Computational Linguistics. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Language Processing (EMNLP 2014), pages 1532–1543. Asso- ciation for Computational Linguistics. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108.