KLUMSy@KIPoS: Experiments on Part-of-Speech Tagging of Spoken Italian Thomas Proisl Gabriella Lapesa Computational Corpus Linguistics Group Institute for Natural Language Processing Friedrich-Alexander-Universität Erlangen-Nürnberg Universität Stuttgart Bismarckstr. 6 Pfaffenwaldring 5 b 91054 Erlangen, Germany 70569 Stuttgart, Germany thomas.proisl@fau.de gabriella.lapesa@ims.uni-stuttgart.de Abstract leverage existing written standard corpora to pre- train an out-of-domain tagger model and to then In this paper, we describe experiments on adapt that model to the target domain using a small part-of-speech tagging of spoken Italian amount of in-domain data. that we conducted in the context of the We experiment with these ideas in the con- EVALITA 2020 KIPoS shared task (Bosco text of the EVALITA 2020 shared task on part- et al., 2020). Our submission to the shared of-speech tagging of spoken Italian (Bosco et al., task is based on SoMeWeTa (Proisl, 2018), 2020; Basile et al., 2020). The data of the shared a tagger which supports domain adapta- task have been drawn from the KIParla corpus tion and is designed to flexibly incorpo- (Mauri et al., 2019) and consist of the manually rate external resources. We document our annotated training and test datasets and a silver approach and discuss our results in the dataset that has been automatically tagged by the shared task along with a statistical analysis task organizers using a UDPipe1 model trained of the factors which impact performance on all Italian treebanks in the Universal Depen- the most. Additionally, we report on a dencies (UD) project.2 While the silver dataset set of additional experiments involving the is annotated with the standard UD tagset (as are combination of neural language models the corpora on which the tagger has been trained), with unsupervised HMMs, and compare the training and test sets use an extended version its performance to that of our system. where tags can optionally be assigned one of two subcategories, .DIA for dialectal forms and .LIN 1 Introduction for foreign words. Part-of-speech taggers trained on standard news- 2 Additional resources paper texts usually perform relatively poorly on spoken language or on written communication that 2.1 Corpora is “conceptually oral”, e. g. tweets or chat mes- We use a collection of plain text corpora to com- sages. The challenges of spoken language in- pute Brown clusters (Brown et al., 1992) that the clude non-standard lexis, e. g. the use of collo- tagger can use as additional resource. quial and dialectal forms, and non-standard syn- Ideally, we would use large amounts of tran- tax, e. g. false starts, repetitions, incomplete sen- scribed speech for the present task. Since there tences and the use of fillers. To make things worse, is no such dataset, we try to use corpora that come the amount of training data available for spoken close. The closest to authentic speech is scripted language – or non-standard varieties in general – speech, therefore we use the Italian movie sub- is usually several orders of magnitude smaller than titles from the OpenSubtitles corpus (Lison and for the usual newspaper corpora. One strategy for Tiedemann, 2016).3 Computer-mediated com- coping with this is to incorporate additional re- munication, e. g. in social media, sometimes ex- sources, e. g. lexica or distributional information hibits features that are typical of spoken lan- obtained from large amounts of unannotated text. guage use. Therefore, we also use a collec- Another strategy is to do domain adaptation, i. e. to tion of roughly 11.7 million Italian tweets and 1 http://ufal.mff.cuni.cz/udpipe/1 Copyright © 2020 for this paper by its authors. Use permit- 2 https://universaldependencies.org/ ted under Creative Commons License Attribution 4.0 Inter- national (CC BY 4.0). 3 http://opus.nlpl.eu/OpenSubtitles-v2018.php ca. 2.7 million Reddit posts (submissions and forms that correspond to about 35,000 lemmata. comments) from the years 2011–2018. We ex- In its analyses, Morph-it! distinguishes between tracted the Reddit posts from Jason Baumgart- derivational features and inflectional features. In ner’s collection of Reddit submissions and com- total, there are 664 unique feature combinations. ments4 using the processing pipeline by Blom- We simplify the analyses by stripping away all bach et al. (2020). Additionally, we also include inflectional features and some of the derivational all Italian corpora from the Universal Dependen- features, i. e. gender (for articles, nouns and pro- cies project and, to further increase the amount of nouns) and person and number (for pronouns). data, a number of web corpora: The PAISÀ cor- This results in 39 coarse-grained categories that pus of Italian texts from the web (Lyding et al., correspond to major word classes, with some finer 2014),5 the text of the Italian Wikimedia dumps,6 distinctions for determiners and pronouns. i. e. Wiki(pedia|books|news|versity|voyage), as ex- tracted by Wikipedia Extractor,7 and the Italian subset of OSCAR, a huge multilingual Common 3 System description Crawl corpus (Ortiz Suárez et al., 2019).8 We tokenize and sentence split all corpora us- For our submission to the shared task we use ing UDPipe trained on the union of all Italian UD SoMeWeTa (Proisl, 2018), a tagger that is based corpora. We also remove all duplicate sentences. on the averaged structured perceptron, supports The sizes of the resulting corpora are given in Ta- domain adaptation and can incorporate external re- ble 1. As final preprocessing steps, we lowercase sources such as Brown clusters and lexica.11 Its all words and normalize numbers, user mentions, ability to make use of existing linguistic resources email addresses and URLs. Finally, we use the im- allows the tagger to achieve competitive results plementation by Liang (2005)9 to compute 1,000 even with relatively small amounts of in-domain Brown clusters with a minimum frequency 5. training data, which is particularly useful for non- standard varieties or under-resourced languages corpus complete deduplicated (Kabashi and Proisl, 2018; Proisl et al., 2019). oscar – 13,787,307,218 We participate in all three substasks: The main opensubtitles 795,250,711 378,348,061 paisa 282,631,297 258,679,965 subtask where we use all the available silver and reddit 112,735,958 105,274,620 training data, subtask A where we only use the tweets 152,496,728 148,031,020 data from the formal register, and subtask B where ud 672,929 615,057 wiki 578,425,024 560,863,691 we only use the informal data. The training wikibooks 12,106,499 11,825,870 scheme is the same for all three subtasks. First, wikinews 2,744,317 2,583,135 wikiversity 5,766,859 5,365,924 we train preliminary models on the silver data pro- wikivoyage 3,911,881 3,825,872 vided by task organizers. Keep in mind that the sil- ver dataset has been automatically tagged. There- Table 1: Sizes of the additional corpora in tokens. fore, it is annotated with the standard version of OSCAR is already deduplicated on the line level. the UD tagset and not with the extended one that is used in the shared task; in addition, there will be a certain amount of tagging errors in the data. Nev- 2.2 Morphological lexicon ertheless, the dataset provides the tagger with (im- We incorporate linguistic knowledge in the form perfect) domain-specific background knowledge. of Morph-it! (Zanchetta and Baroni, 2005),10 a In the next step, we adapt the silver models to the morphological lexicon for Italian that contains union of the Italian UD treebanks, i. e. to high- morphological analyses of roughly 505,000 word quality but out-of-domain data. In the final step, 4 https://files.pushshift.io/reddit/ we adapt the models to spoken Italian using the 5 http://www.corpusitaliano.it/ manually annotated training data. In every step we 6 https://dumps.wikimedia.org/ train for 12 iterations using a search beam size of 7 http://medialab.di.unipi.it/wiki/Wikipedia_ 10 and provide the tagger with the Brown clusters Extractor 8 https://oscar-corpus.com/ and the Morph-it!-based lexicon (Section 2). 9 https://github.com/percyliang/brown-cluster/ 10 https://docs.sslmit.unibo.it/doku.php?id= resources:morph-it 11 https://github.com/tsproisl/SoMeWeTa 4 Evaluation variable and the different experimental parame- ters as independent variables (predictors). We fol- 4.1 Data preparation and evaluation results low the methodology outlined in Lapesa and Ev- The silver data, training data and the data from the ert (2014) and quantify the impact of a specific UD treebanks follow UD tokenization guidelines, predictor (e. g. the use of Brown clusters) as the i. e. contractions such as parlarmi (parlar+mi) ‘to amount of variance in the dependent variable (tag- talk+to me’ or della (di+la) ‘of+the’ are split into ging accuracy) it accounts for. We considered the their constituents for annotation. This is not the following experimental parameters as predictors. case for the test data where contractions have to • setup: Training/test setup; this predictor en- be assigned a joint tag, e. g. VERB_PRON or codes the combination of training/test data ADP_A. Therefore, we run the test data through and has the following values: all_formal (i. e. the UDPipe tokenizer from Section 2.1, tag the re- trained on the full set, tested on formal), sulting tokens and merge the tags for all tokens all_informal, formal_formal, formal_informal, that have been split. Table 2 shows the results on informal_formal, informal_informal the two testsets.12 On the main task, SoMeWeTa • silver: Use of silver data during training (yes, performs reasonably well, only 1–1.4 points worse no) than the fine-tuned UmBERTo model by Tam- • ud: Use of UD corpora during training (yes, no) burini (2020). On subtasks A and B, it even out- • morph: Use of Morph-it! (yes, no) performs that system by a considerable margin. • brown: Use of Brown clusters (yes, no) We tested all the possible configurations, i. e. task system formal informal all the combinations of the parameters described main corrected 92.12 90.11 above, and, to account for random effects during gold tokens 92.31 90.66 Tamburini (2020) 93.49 91.13 training, ran each configuration 10 times. This re- sulted in 960 experimental runs, each correspond- subA corrected 91.92 89.45 gold tokens 92.12 89.97 ing to a single datapoint in our regression analysis. Tamburini (2020) 86.47 83.16 Given that it is reasonable to assume that specific subB corrected 92.37 89.97 parameter values will influence the performance gold tokens 92.54 90.53 of other parameters (e. g., use of Morph-it! could Tamburini (2020) 89.74 89.52 boost performance but only if larger corpora are Table 2: Accuracy scores for our submissions in employed), we also test all the 2-way interactions. two variants: (i) With ADP_DET corrected to As a sanity check, we also introduce the number ADP_A and (ii) based on the true token bound- of an experimental run as a predictor (1 to 10, as aries instead of on UDPipe tokens. a categorical variable), in the hope, obviously, of finding no effect for it. Summing up, our regres- sion equation looks as follows: 4.2 Mining tagging accuracy accuracy ∼ (setup + silver + ud + To get a better insight into the impact of the differ- morph + brown + run)∧213 ent experimental variables involved in this study, Unsurprisingly, our model achieves an excellent fit we carried out feature ablation experiments which to the data, quantified in an Adjusted R-squared targeted the different components of our system, of 95.2%. Table 3 lists all significant predictors namely the different combinations of training and and interactions, along with their explained vari- test data (formal vs. informal) and the different ance. Explained variance quantifies the portion of additional resources described in section 2 (use the total R-squared that a specific parameter (or of Brown clusters, Morph-it!, silver data, and UD interaction) is responsible for and can be straight- corpora). We then carried out a linear regression forwardly interpreted as the impact that the ma- analysis with tagging accuracy as a dependent nipulation of a specific parameter has on the accu- 12 Unfortunately, when preparing our submission, we did racy of our tagger. Reassuringly, we found no ef- not notice that contractions of prepositions (ADP) and fect of experimental run. All other predictors, and determiners (DET) have to be tagged as ADP_A. As a consequence, we mis-tagged all these contractions as 13 Given that we ran the regression analysis in R, and the ADP_DET. For reference, here are the evaluation results equation follows the R syntax in which “∧2” denotes all of our faulty submission on the formal/informal test sets: pairwise interactions of the predictors between parenthe- main 87.56/88.24, subA 87.37/87.58, subB 87.81/88.11. ses. Predictor Explained variance 0 silver ● 1 ● setup 42.06 *** ● ● 92 silver 8.62 *** ● ud 12.63 *** ● brown 8.76 *** 91 ● morph 7.17 *** setup:silver 1.21 *** setup:ud 1.08 *** 90 accuracy ● ● setup:brown 0.42 *** ● setup:morph 0.50 *** 89 ● ● silver:ud 6.00 *** ● silver:brown 0.39 *** silver:morph 1.98 *** 88 ud:brown 0.03 * ● ud:morph 2.48 *** 87 brown:morph 2.44 *** all_formal all_informal formal_formal formal_informal informal_formal informal_informal training_test Table 3: Regression on tagging a accuracy: pre- Figure 1: Interaction: setup and silver data dictors and explained variance. Adj. R-squared: morph 95.2%. Sign. thresholds: ***: 0.001; *: 0.05. 0 ● 1 ● 92 all the corresponding interactions, turned out to be ● highly significant (with one minor exception). The 91 ● biggest role is played by the setup variable, which ● 90 accuracy alone accounts for 42.06%. Using UD corpora in the training has also a strong impact, with a strong 89 interaction involving the use of silver data (6.00% ● R-squared). Further strong interactions are found 88 between brown and morph, and brown and UD – probably suggesting that introducing a 3-way in- 87 0 1 teraction would be appropriate here. Given the in- ud creased complexity, however, this extension is left Figure 2: Interaction: UD corpora and Morph-it! for future work. Now that we have established which parameters in its strongest interaction, the one with silver. or interactions have the strongest impact on model Figure 1 displays the predicted accuracies result- performance, it is time to ask which parameter ing from the different parameter combinations of values ensure the best performance. In our case, the two predictors. Note that, given the excellent given that the system can be assembled incremen- fit of the regression model, we can assume pre- tally (adding external resources and training data dicted accuracy to be a reliable estimate of actual to a basic configuration), asking what the best pa- accuracy. Also, note that while we are visualiz- rameter values are amounts to determining if, for ing the predicted accuracy of a 2-way interaction, example, the addition of Brown clusters improves we are actually displaying the effect of the indi- performance or is detrimental. Note that the sig- vidual terms (setup and silver) and of the interac- nificance of the brown predictor in the regression tion (setup:silver) jointly. We observe that, unsur- analysis already tells us that the predictor affects prisingly, independently of the use of silver data, performance, ruling out the possibility that it has training on the whole dataset ensures the best per- no impact at all. To visualize the effects in the formance on both the formal and informal test sets. linear model, we follow Lapesa and Evert (2014) The use of silver data (pink line) improves perfor- and employ effect displays which show the partial mance, but with differences in the different train- effect of one or two parameters by marginalizing ing/test setups. Interestingly, using the silver data over all other parameters. Unlike coefficient esti- makes the performance gap between the models mates, they allow an intuitive interpretation of the trained on the whole dataset and those trained on effect sizes of categorical variables irrespective of just the informal dataset negligible. Surprisingly, the dummy coding scheme used. we observe that the best performance is predicted Let us start with the strongest predictor, setup, for the formal test set when the informal set is used. Further experiments on the complementar- pervised neural hidden Markov model (HMM) for ity of the two subtasks are needed to further clarify part-of-speech induction. this contradiction. The architecture of the unsupervised HMM fol- Figure 2 displays the interaction between the lows the LSTM-based variant described by Tran use of UD corpora and the integration of Morph-it! et al. (2016). We directly use the negative loga- in SoMeWeTa. Note that the performance gaps are rithm of the observation likelihood determined by smaller here than in the previous interaction: this the backward algorithm as additional loss for the is no surprise, given the smaller explanatory power language model. The embeddings of the best tag (explained variance) of the parameters and inter- sequence (determined using the Viterbi algorithm) actions involved. Morph-it! produces substantial are added to the word embeddings before feeding improvements, but again, to a lesser extent if UD them into the language model. Due to time and re- corpora are employed: this could either be due source constraints, we opt for a small to medium- to a lower coverage of Morph-it! on the UD cor- sized model14 with a total of 45.5 million train- pora, or to the boost in model robustness produced able parameters and train it on 1.9 billion tokens by the introduction of a larger training set. The of text (the corpora described in Section 2.1 ex- steep slope of the blue line wrt. the pink one sug- cluding OSCAR). The model variant with the un- gests that the presence of a morphological lexicon supervised HMM totals 48.7 million trainable pa- like Morph-it! can compensate the lack of training rameters. We pre-train and fine-tune both models data. Let us conclude with the third strongest in- with the same set of parameters.15 teraction, the one between the use of Brown clus- The results are summarized in Table 4. Due ters and the use of Morph-it!, not shown here for to the small model size and relatively little train- space constraints. It is strikingly similar to the ing data, the performance of both models is be- one in Figure 2: Morph-it! improves performance low SoMeWeTa’s. (Keep in mind that state-of-the- overall, and the steeper improvement in absence of art language models for Italian like UmBERTo or the Brown clusters suggests that the quality of the GilBERTo16 are based on the same RoBERTa ar- information encoded in Morph-it! can compensate chitecture but feature roughly three times as many for the lack of external resources. parameters and have been trained on an order of In sum, our analysis supports the starting as- magnitude more data.) However, the experiment is sumption that in a low-resource setting like the successful insofar as explicitly modelling part-of- one of KIPoS, integrating additional, focussed re- speech information using an unsupervised HMM sources always supports performance. gives modest gains on both test sets. On the union of the two test sets, this corresponds to a statisti- 5 Additional experiments: RoBERTa cally significant improvement from 89.84 to 90.42 with unsupervised HMM (McNemar mid-p test: p = 0.0133). Fine-tuned neural language models have been ex- model formal informal tremely successful in all areas of natural language processing (NLP). Not only can language mod- RoBERTa 91.28 88.46 RoBERTa+HMM 91.84 89.05 els trained on huge amounts of plain text be fine- tuned to all NLP tasks, they have also been shown Table 4: Results for RoBERTa and for RoBERTa to learn certain linguistic abstractions (Tenney et with additional unsupervised HMM al., 2019). At least that seems to be the case for English. Languages that are typologically differ- ent from English are both more difficult to model 14 We use the RoBERTa implementation from the trans- with current architectures (Mielke et al., 2019) formers library (https://github.com/huggingface/ and seem to be more challenging when it comes transformers) with 6 hidden layers, 8 attention heads, to learning linguistic abstractions (Ravfogel et al., a hidden size of 512 and an intermediate size of 2048. 15 Pretraining for 100,000 steps with a batch size of 500, peak 2018). In the experiment described in this sec- learning rate of 5 × 10−4 , 6,000 warm-up steps and dropout tion, we extend a state-of-the-art language model set to 0.1. Fine-tuning to the KIPoS task using the entire architecture to explicitly model part-of-speech in- training data for 4 epochs with a batch size of 32 and learn- ing rate of 3 × 10−4 formation. To this end, we combine a RoBERTa 16 https://github.com/musixmatchresearch/ language model (Liu et al., 2019) with an unsu- umberto, https://github.com/idb-ita/GilBERTo 6 Conclusion Gabriella Lapesa and Stefan Evert. 2014. A large scale evaluation of distributional semantic models: This paper started out with the assumption that Parameters, interactions and model selection. TACL, in low-resource scenarios like the KIPoS shared 2:531–546. task the integration of additional resources such as Percy Liang. 2005. Semi-supervised learning for nat- lexica (in our case, Morph-it!) and distributional ural language. Master’s thesis, Massachusetts Insti- information from larger corpora (in our case, the tute of Technology, Department of Electrical Engi- Brown clusters) can compensate for the lack of neering and Computer Science. large amounts of training data. Moreover, our Pierre Lison and Jörg Tiedemann. 2016. Opensub- strategy also built on the assumption that in a low- titles2016: Extracting large parallel corpora from resource scenario domain adaptation would be a movie and TV subtitles. In Proc. of LREC, pages winning strategy, as it would enable us to exploit 923–929, Portorož. ELRA. larger training sets for written language (out of do- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- main), and then fine-tune the tagger on the spoken dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, language (in domain). The results of our exper- Luke Zettlemoyer, and Veselin Stoyanov. 2019. iments, and the insights gathered from the statis- RoBERTa: A robustly optimized BERT pretraining approach. tical analysis of our results indicate that both as- sumptions hold to be true, as far as our contri- Verena Lyding, Egon Stemle, Claudia Borghetti, Marco bution to the KIPoS shared task is concerned. In Brunello, Sara Castagnoli, Felice Dell’Orletta, Hen- rik Dittmann, Alessandro Lenci, and Vito Pirrelli. subtasks A and B, where only half the amount of 2014. The PAISÀ corpus of Italian web texts. In training data was available, this strategy even out- Proc. of WaC-9, pages 36–43, Gothenburg. ACL. performed a fine-tuned state-of-the-art neural lan- guage model. Further work is needed to assess the Caterina Mauri, Silvia Ballarè, Eugenio Goria, Mas- simo Cerruti, and Francesco Suriano. 2019. KIParla complementarity of the error profiles of different corpus: A new resource for spoken Italian. In Proc. configurations, taking into the picture also the neu- of CLiC-it, Bari. ral architectures evaluated in Section 4. Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, Brian Roark, and Jason Eisner. 2019. What kind of lan- guage is hard to language-model? In Proc. of ACL, References pages 4975–4989, Florence. ACL. Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- cia C. Passaro. 2020. Evalita 2020: Overview Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent of the 7th evaluation campaign of natural language Romary. 2019. Asynchronous pipelines for pro- processing and speech tools for italian. In Valerio cessing huge corpora on medium to low resource in- Basile, Danilo Croce, Maria Di Maro, and Lucia C. frastructures. In Proc. of CMLC-7, pages 9 – 16, Passaro, editors, Proceedings of Seventh Evalua- Mannheim. Leibniz-Institut für Deutsche Sprache. tion Campaign of Natural Language Processing and Thomas Proisl, Peter Uhrig, Philipp Heinrich, Andreas Speech Tools for Italian. Final Workshop (EVALITA Blombach, Sefora Mammerella, Natalie Dykes, and 2020), Online. CEUR.org. Besim Kabashi. 2019. The_Illiterati: Part-of- Andreas Blombach, Natalie Dykes, Philipp Heinrich, speech tagging for Magahi and Bhojpuri without Besim Kabashi, and Thomas Proisl. 2020. A corpus even knowing the alphabet. In Proc. of NSURL, of German Reddit exchanges (GeRedE). In Proc. of Trento. LREC, pages 6310–6316, Marseille. ELRA. Thomas Proisl. 2018. SoMeWeTa: A part-of-speech Cristina Bosco, Silvia Ballarè, Massimo Cerruti, tagger for German social media and web texts. In Eugenio Goria, and Caterina Mauri. 2020. Proc. of LREC, pages 665–670, Miyazaki. ELRA. KIPoS@EVALITA2020: Overview of the task on KIParla part of speech tagging. In Proc. of Shauli Ravfogel, Yoav Goldberg, and Francis Tyers. EVALITA. CEUR.org. 2018. Can LSTM learn to capture agreement? The case of Basque. In Proc. of BlackboxNLP, pages 98– Peter F. Brown, Vincent J. Della Pietra, Peter V. 107, Brussels, November. ACL. de Souza, Jennifer C. Lai, and Robert L. Mercer. 1992. Class-based n-gram models of natural lan- Fabio Tamburini. 2020. UniBO@KIPoS: Fine-tuning guage. Computational Linguistics, 18(4):467–479. the Italian “BERTology” for the EVALITA 2020 KIPOS task. In Proc. of EVALITA. CEUR.org. Besim Kabashi and Thomas Proisl. 2018. Alba- nian part-of-speech tagging: Gold standard and Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. evaluation. In Proc. of LREC, pages 2593–2599, BERT rediscovers the classical NLP pipeline. In Miyazaki. ELRA. Proc. of ACL, pages 4593–4601, Florence. ACL. Ke M. Tran, Yonatan Bisk, Ashish Vaswani, Daniel Marcu, and Kevin Knight. 2016. Unsupervised neu- ral hidden Markov models. In Proc. of the Work- shop on Structured Prediction for NLP, pages 63– 71, Austin, TX. ACL. Eros Zanchetta and Marco Baroni. 2005. Morph-it! A free corpus-based morphological resource for the Italian language. In Proc. of Corpus Linguistics, Birmingham.