UmBERTo-MTSA @ AcCompl-It: Improving Complexity and Acceptability Prediction with Multi-task Learning on Self-Supervised Annotations Gabriele Sarti Department of Mathematics and Geoscience, University of Trieste International School for Advanced Studies (SISSA), Trieste, Italy gsarti@sissa.it Abstract English. This work describes a self- 1 Introduction supervised data augmentation approach In recent times, pre-trained neural language mod- used to improve learning models’ perfor- els (NLMs) have become the preferred approach mances when only a moderate amount of for language representation learning, pushing the labeled data is available. Multiple copies state-of-the-art in multiple NLP tasks (Devlin et al. of the original model are initially trained (2019); Radford et al. (2019); Yang et al. (2019); on the downstream task. Their predic- Raffel et al. (2019) inter alia). These approaches tions are then used to annotate a large rely on a two-step training process: first, a self- set of unlabeled examples. Finally, multi- supervised pre-training is performed on large- task training is performed on the par- scale corpora; then, the model undergoes a super- allel annotations of the resulting train- vised fine-tuning on downstream task labels using ing set, and final scores are obtained by task-specific prediction heads. While this method averaging annotator-specific head predic- was found to be effective in scenarios where a rel- tions. Neural language models are fine- atively large amount of labeled data are present, tuned using this procedure in the con- researchers highlighted that this is not the case in text of the AcCompl-it shared task at low-resource settings (Yogatama et al., 2019). EVALITA 2020, obtaining considerable Recently, pattern-exploiting training (PET, improvements in prediction quality. Schick and Schutze (2020a,b) tackles the depen- dence of NLMs on labeled data by first reformu- Italiano. Questo articolo descrive un ap- lating tasks as cloze questions using task-related proccio di self-supervised data augmenta- patterns and keywords, and then using language tion utilizzabile al fine di migliorare le per- models trained on those to annotate large sets of formance di algoritmi di apprendimento unlabeled examples with soft labels. PET can be su task aventi solo una modesta quantità thought of as an offline version of knowledge dis- di dati annotati. Inizialmente, molteplici tillation (Hinton et al., 2015), which is a well- copie del modello originale vengono al- established approach to transfer the knowledge lenate sul task prescelto. Le loro pre- across models of different size, or even between visioni vengono poi utilizzate per anno- different versions of the same model as in self- tare grandi quantità di esempi non etichet- training (Scudder, 1965; Yarowsky, 1995). While tati. In conclusione, un approccio di multi- effective on classification tasks that can be easily task training viene utilizzato, con le an- reformulated as cloze questions, PET cannot be notazioni del dataset risultante in veste di easily extended to regression settings since they task indipendenti, per ottenere previsioni cannot be adequately verbalized. Contemporary finali come medie dei i punteggi dei sin- work by Du et al. (2020) showed how self-training goli annotatori. Questa procedura è stata and pre-training provide complementary informa- utilizzata per allenare modelli del linguag- tion for natural language understanding tasks. gio neurali per lo shared task AcCompl-it Copyright c 2020 for this paper by its authors. Use a EVALITA 2020, ottenendo ampi miglio- permitted under Creative Commons License Attribution 4.0 ramenti nella qualità predittiva. International (CC BY 4.0). In this paper, I propose a simple self-supervised probability distributions after the softmax, which data augmentation approach that can be used to are typically used in the knowledge distillation lit- improve the generalization capabilities of NLMs erature, to keep the approach simple while making on regression and classification tasks for modest- it viable in the context of regression tasks. sized labeled corpora. In short, an ensemble of Now that the large corpus is annotated, a multi- fine-tuned models is used to annotate a large cor- task NLM M T M : xi → ẏi1 . . . ẏik is fine-tuned on pus of unlabeled text, and new annotations are U 0 by treating each annotation in the set ŷ 01 . . . ŷ 0k leveraged in a multi-task setting to obtain final as a separate task, using 1-layer feed-forward neu- predictions over the original test set. The method ral networks as task-specific heads while perform- was tested on the AcCompl-it shared tasks of the ing hard parameter sharing (Caruana, 1997) on EVALITA 2020 campaign (Brunato et al., 2020b; underlying model parameters. Intuitively, the k Basile et al., 2020), where the objective was to models used to produce annotations were trained predict respectively complexity and acceptability on different folds of the original corpus, and as scores on a 1-7 Likert scale for each test sen- such, they provide complementary viewpoints on tence, alongside an estimation of its standard er- the modeled phenomenon when k is small. ror. Results show considerable improvements As a final step, M T M is fine-tuned on a over regular fine-tuning performances on COMPL training portion of L, using as prediction scores and ACCEPT using the UmBERTo pre-trained f (ẏi1 . . . ẏik ), where f is a task and context- model (Francia et al., 2020), suggesting the valid- dependent aggregation function. For example, in ity of this approach for complexity/acceptability the case of a classification task, one can select the prediction and possibly other language processing majority vote from the ensemble of model heads tasks. as the final prediction, while in a regression set- ting this can be done by averaging scores across 2 Description of the Approach heads. Once fine-tuned, the model can be tested Let: on the test portion of L using the same f as the aggregator. I refer to this approach as Multi-Task • L = [(x1 , y1 ), . . . (xn , yn )] be the initial la- Self-Annotation (MTSA) in the following sections. beled corpus containing sentence-annotation pairs xi ∈ X, yi ∈ Yx . 1 3 Experimental Evaluation For the experimental evaluation part: • U = [x01 , . . . x0m ] be a large unlabeled corpus such that m  n • The ACCEPT and COMPL training corpora, containing respectively 1339 and 2012 sen- • M : xi → ŷi be a pre-trained neural language tences labeled with average scores and stan- model with a single task-specific heads, tak- dard error across annotators, were used as la- ing sentence xi as input and predicting label beled datasets LA , LC . The two tasks were yi at inference time. learned separately, following the same ap- For some k ∈ N1 , we begin by splitting L proach described in the previous section. in k equal-sized segments L1 , . . . , Lk and fine- • A set of multiple Italian treebanks includ- tune k identical versions of M using k-fold ing train, dev, and test sets of the Ital- cross-validation. We call the resulting models ian Stanford Dependency Treebank (Bosco M 1 , . . . , M k “NLMs with standard fine-tuning on et al., 2013), the Turin University Paral- the y target task”, with M i being trained on the lel Treebank (Sanguinetti and Bosco, 2015), subset L − Li and evaluated on Li . Then, each PoSTWITA-UD (Sanguinetti et al., 2018) sentence of U is passed to each model, obtaining and the Venice Italian Treebank (Delmonte the corpus et al., 2007) was used as unlabeled corpus U. The final corpus contains 37,344 unlabeled U 0 = [(x01 , ŷ101 . . . ŷ10k ), . . . , (x0m , ŷm 01 0k . . . ŷm )] (1) sentences and spans multiple textual genres. labeled with expert annotations from fine-tuned • The UmBERTo model (Francia et al., 2020) models. Predicted values are taken instead of available through the HuggingFace’s Trans- 1 yi can be either discrete or continuous in this context. formers framework (Wolf et al., 2019) was Model Score (ρ) Error (ρ) to obtain a sentence-level representation instead of using the [CLS] token. During the training on the UmBERTo surprisal -0.36 0.17 whole unlabeled corpus, the evaluation steps were Length (# of tokens) -0.39 0.17 increased to 100 to balance evaluation time with Length (characters) -0.39 0.21 the corpus’s increased size. UmBERTo fine-tuned 0.90 0.50 UmBERTo-STSA 0.91 0.53 4 Results UmBERTo-MTSA 0.91 0.54 Table 1 presents methods for which the correla- UmBERTo surprisal 0.49 0.28 tion between values and complexity scores was Length (# of tokens) 0.55 0.36 tested on the training portion of the ACCEPT Length (characters) 0.60 0.39 and COMPL tasks with 5-fold cross validation, UmBERTo fine-tuned 0.84 0.54 leading to the selection of MTSA as the top- UmBERTo-STSA 0.87 0.62 performing approach: UmBERTo-MTSA 0.88 0.63 • UmBERTo surprisal: Sentence-level sur- Table 1: Spearman’s correlation scores on the AC- prisal estimates are produced using the pre- CEPT (top) and COMPL (bottom) subtasks’ train- trained model without fine-tuning as: ing portions. Models are evaluated using 5-fold m cross-validation. All scores have p < 0.001 Y P (x) = P (wi |w1:i−1 , wi+1:m ) (2) i=1 used both for fine-tuning M 1...k during the • Length (# of tokens): Length of the sentence annotation part and for fine-tuning M T M . in number of tokens The model is based on the RoBERTa archi- tecture (Liu et al., 2019) and was pre-trained • Length (characters): Length of the sentence on the Italian portion of the OSCAR Com- in number of characters (including whites- monCrawl corpus (Ortiz Suárez et al., 2020), paces) containing roughly 210M sentences and over 11B tokens. • UmBERTo fine-tuned: Predictions pro- duced by Umberto with standard fine-tuning Since both tasks involve predicting both aver- on complexity corpus annotations. aged scores and the original standard error across participants, the approach presented in the previ- • UmBERTo-STSA: A variant of the MTSA ous section was adapted to account for multi-task approach where instead of performing multi- learning of scores and errors from the beginning, task learning over model annotations on U, with each model M i producing both a predicted we average them in a single score, and the score ŷ 0i and a predicted error ˆ0i for the annota- model is trained on it with single-task fine- tion step. The k parameter was set to 5 to prevent tuning. excessive overlapping of training data across mod- • UmBERTo-MTSA: The approach presented els, with the final multi-task model M T M : xi → in this work. ẏi1 . . . yi5 , 1i . . . 5i returning prediction for scores and errors for all the five sets of fine-tuned model From Table 1, it can be observed that, although annotations. length alone is already correlated with accept- Models M 1...k were trained for a maximum of ability complexity scores, UmBERTo can lever- 15 epochs on the labeled training sets using early age additional information from its representation stopping (5 patience steps, 20 evaluation steps us- to produce much stronger predictions. Interest- ing a 10% slice as dev set), learning rate λ = 1e−5 , ingly, both the STSA and MTSA self-annotation batch size b = 32 and embedding dropout δ = 0.1. approaches consistently outperform regular fine- The model’s base variant was used, having a hid- tuning, especially for what concerns standard er- den size |h| = 768, and a maximum sequence ror scores. This fact suggests that self-annotation length of 128. Notably, the representations at the leads to better generalization capabilities in the last layer of the UmBERTo model were averaged model over downstream tasks when relatively few Model Score (ρ) Error (ρ) Acceptability Complexity ρ(y ) ρ( ) ρ(y ) ρ( ) SVM 2-gram baseline 0.30 0.35 UmBERTo-MTSA 0.88 0.52 avg. score (y) -25% 10% 41% -2% std. error () 12% 2% 23% 27% SVM length baseline 0.50 0.33 upos dist PROPN 19% -3% 4% 6% UmBERTo-MTSA 0.83 0.51 dep dist nmod 19% -8% 4% 1% avg max depth 16% -3% 7% -7% n prep chains 16% -8% 4% -2% Table 2: Correlation scores with gold labels on the prep chain len 16% -6% 9% -4% ACCEPT (top) and COMPL (bottom) subtasks’ upos dist PRON 1% 20% 8% 9% test portions. All scores have p < 0.001. dep dist root -9% 18% -4% 23% dep dist punct -9% 17% 1% -3% aux mood dist Imp 7% 6% 17% 7% n tokens 9% -13% 5% -18% annotations are available. While the contribu- avg links len -3% 1% -6% -17% max links len -1% -9% -1% -16% tion of multi-task learning is modest, the MTSA approach may prove especially beneficial when training models M 1...k on scores produced by dif- Table 3: Pearson’s correlation scores between pre- ferent annotators instead of using different folds of diction errors and various linguistic features. Or- the same corpus, as in this case. In both cases, pre- ange and cyan cells contain respectively positive dicted surprisal scores act as poor predictors for and negative scores for which p < 0.001. downstream tasks. It should also be noted that length appears to be negatively correlated to ac- standard errors, respectively. Table 3 presents the ceptability scores (i.e. longer sentences are gener- results of the error analysis. ally less acceptable), while the relation is positive in the case of complexity (i.e. longer sentences are Strongly correlated values in Table 3 corre- generally more complex). spond to features that highly influence, either Table 2 reports the scores obtained by MTSA positively or negatively, the prediction capabili- over the test sets for the ACCEPT and the COMPL ties of the MTSA model. Extreme task scores shared tasks. The organizers’ baseline scores cor- (avg. score), denoting either not very acceptable respond to the correlation among gold labels and or highly complex sentences, are less predictable acceptability and complexity predictions produced than their average counterparts by MTSA. Sen- by an SVM model trained on 1-grams and bi- tences for whose the standard deviation of scores grams of sentences and an SVM trained on sen- is high across participants appear to be less pre- tence length, respectively. The MTSA approach dictable in the context of complexity scores, while achieved the first rank in both tasks, with consid- this does not affect acceptability predictions. erable improvements over baseline scores. Concerning acceptability, I found a significant correlation between acceptability prediction er- 5 Error Analysis rors and the presence of multilevel syntactic struc- tures, (avg max depth) multiple long preposi- Finally, some error analysis is performed to gain tional chains (n prep chains, prep chain len) and additional insights on which factors influence nominal modifiers (dep dist nmod). From the the predictability of complexity and acceptabil- complexity viewpoint, instead, the presence of ity judgments. The Profiling-UD tool by Brunato inflectional morphology related to the imperfect et al. (2020a) is used to produce linguistic anno- tense in auxiliaries (aux mood dist Imp) was the tations on test sentences for both tasks. Given only property related to higher prediction errors. an input sentence, Profiling-UD produces roughly However, high token counts (n tokens) and long ∼ 100 numeric scores representing different phe- dependency links (avg links len, max links len) nomena and properties at different language lev- were shown to make the variability in complexity els.2 I then correlate the value of all features with scores more predictable. y and  , representing the mean absolute error Overall, results suggest that incorporating syn- between true and predicted values for scores and tactic information during the model’s training pro- 2 A description of produced annotations is omitted for cess may further improve complexity and accept- brevity. Refer to Brunato et al. (2020a) for additional details. ability models. 6 Discussion and Conclusion and Speech Tools for Italian. Final Workshop (EVALITA 2020), Online. CEUR.org. This work introduced a simple and effective data augmentation approach improving the fine-tuning Cristina Bosco, Simonetta Montemagni, and performances of NLMs when only a modest Maria Simi. 2013. Converting Italian treebanks: amount of labeled data is available. The approach Towards an Italian Stanford dependency tree- was first formalized and then empirically tested bank. In Proceedings of the 7th Linguistic on the ACCEPT and COMPL shared tasks of the Annotation Workshop and Interoperability with EVALITA 2020 campaign. Strong performances Discourse, pages 61–69, Sofia, Bulgaria. Asso- were reported for both acceptability and complex- ciation for Computational Linguistics. ity prediction using a multi-task self-training ap- Dominique Brunato, Andrea Cimino, Felice proach, obtaining the top position in both sub- Dell’Orletta, Giulia Venturi, and Simonetta tasks. Finally, an error analysis highlighted the Montemagni. 2020a. Profiling-UD: a tool for unpredictability of extreme scores and sentences linguistic profiling of texts. In Proceedings having complex syntactic structures. of The 12th Language Resources and Evalua- The suggested approach, although computa- tion Conference, pages 7147–7153, Marseille, tionally refined and well-performing, is lacking France. European Language Resources Associ- in terms of complexity-driven biases that may ation. prove useful in the context of complexity and ac- Dominique Brunato, Chesi Cristiano, Felice ceptability prediction. A possible extension of Dell’Orletta, Simonetta Montemagni, Giu- this work may include a complementary syntac- lia Venturi, and Roberto Zamparelli. 2020b. tic task (e.g., biaffine parsing, as in Glavas and AcCompl-it @ EVALITA2020: Overview of Vulic (2020)) during multi-task learning to see if the acceptability complexity evaluation task forcing syntactically-competent representations in for italian. In Proceedings of Seventh Evalua- the top layers may prove beneficial in the context tion Campaign of Natural Language Processing of syntax-heavy tasks like complexity and accept- and Speech Tools for Italian. Final Workshop ability prediction. Moreover, it would be interest- (EVALITA 2020), Online. CEUR.org. ing to evaluate multi-task learning performances with complexity and acceptability parallel annota- Rich Caruana. 1997. Multitask learning. Machine tions given the conceptual similarity between the Learning, 28:41–75. two tasks and estimate the effectiveness of a feed- Rodolfo Delmonte, Antonella Bristot, and Sara forward network as the final aggregator f in the Tonelli. 2007. VIT–venice italian treebank: MTSA paradigm instead of merely averaging pre- syntactic and quantitative features. dictions. Finally, Du et al. (2020) findings suggest Jacob Devlin, Ming-Wei Chang, Kenton Lee, and that using an unsupervised in-domain filtering ap- Kristina Toutanova. 2019. BERT: Pre-training proach may further improve the self-training pro- of deep bidirectional transformers for language cedure when large unlabeled corpora are available. understanding. In Proceedings of the 2019 Con- ference of the North American Chapter of the Acknowledgments Association for Computational Linguistics: Hu- The author was supported by a scholarship for man Language Technologies, Volume 1 (Long Data Science and Scientific Computing students and Short Papers), pages 4171–4186, Min- from the International School of Advanced Stud- neapolis, Minnesota. Association for Computa- ies (SISSA). tional Linguistics. Jingfei Du, E. Grave, Beliz Gunel, Vishrav Chaud- References hary, Onur Çelebi, M. Auli, Ves Stoyanov, and Valerio Basile, Danilo Croce, Maria Di Maro, Alexis Conneau. 2020. Self-training improves and Lucia C. Passaro. 2020. EVALITA 2020: pre-training for natural language understanding. Overview of the 7th evaluation campaign of ArXiv, abs/2010.02194. natural language processing and speech tools Simone Francia, Loreto Parisi, and Magnani for italian. In Proceedings of Seventh Evalua- Paolo. 2020. UmBERTo: an italian language tion Campaign of Natural Language Processing model trained with whole word maskings. Goran Glavas and Ivan Vulic. 2020. Is supervised models are also few-shot learners. ArXiv, syntactic parsing beneficial for language under- abs/2009.07118. standing? an empirical investigation. ArXiv, H Scudder. 1965. Probability of error of abs/2008.06788. some adaptive pattern-recognition machines. Geoffrey E. Hinton, Oriol Vinyals, and J. Dean. IEEE Transactions on Information Theory, 2015. Distilling the knowledge in a neural net- 11(3):363–371. work. ArXiv, abs/1503.02531. Thomas Wolf, Lysandre Debut, Victor Sanh, Y. Liu, Myle Ott, Naman Goyal, Jingfei Du, Julien Chaumond, Clement Delangue, An- Mandar Joshi, Danqi Chen, Omer Levy, thony Moi, Pierric Cistac, Tim Rault, R’emi M. Lewis, L. Zettlemoyer, and V. Stoyanov. Louf, Morgan Funtowicz, and Jamie Brew. 2019. RoBERTa: A robustly optimized bert pre- 2019. Huggingface’s transformers: State-of- training approach. ArXiv, abs/1907.11692. the-art natural language processing. ArXiv, abs/1910.03771. Pedro Javier Ortiz Suárez, Laurent Romary, and Z. Yang, Zihang Dai, Y. Yang, J. Carbonell, Benoı̂t Sagot. 2020. A monolingual approach R. Salakhutdinov, and Quoc V. Le. 2019. XL- to contextualized word embeddings for mid- Net: Generalized autoregressive pretraining for resource languages. In Proceedings of the 58th language understanding. In NeurIPS. Annual Meeting of the Association for Compu- tational Linguistics, pages 1703–1714, Online. David Yarowsky. 1995. Unsupervised word sense Association for Computational Linguistics. disambiguation rivaling supervised methods. In 33rd Annual Meeting of the Association A. Radford, Jeffrey Wu, R. Child, David Luan, for Computational Linguistics, pages 189–196, Dario Amodei, and Ilya Sutskever. 2019. Lan- Cambridge, Massachusetts, USA. Association guage models are unsupervised multitask learn- for Computational Linguistics. ers. OpenAI. Dani Yogatama, Cyprien de Masson d’Autume, Colin Raffel, Noam Shazeer, Adam Roberts, J. Connor, Tomás Kociský, M. Chrzanowski, Katherine Lee, Sharan Narang, Michael Lingpeng Kong, A. Lazaridou, W. Ling, L. Yu, Matena, Yanqi Zhou, W. Li, and P. Liu. 2019. Chris Dyer, and P. Blunsom. 2019. Learning Exploring the limits of transfer learning with and evaluating general linguistic intelligence. a unified text-to-text transformer. ArXiv, ArXiv, abs/1901.11373. abs/1910.10683. Manuela Sanguinetti and Cristina Bosco. 2015. PartTUT: The Turin University Parallel Tree- bank, pages 51–69. Springer International Pub- lishing, Cham. Manuela Sanguinetti, Cristina Bosco, Alberto Lavelli, Alessandro Mazzei, Oronzo Antonelli, and Fabio Tamburini. 2018. PoSTWITA-UD: an Italian Twitter treebank in Universal Depen- dencies. In Proceedings of the Eleventh In- ternational Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA). Timo Schick and Hinrich Schutze. 2020a. Exploit- ing cloze questions for few-shot text classifi- cation and natural language inference. ArXiv, abs/2001.07676. Timo Schick and Hinrich Schutze. 2020b. It’s not just size that matters: Small language