UmBERTo-MTSA @ AcCompl-It:
          Improving Complexity and Acceptability Prediction
        with Multi-task Learning on Self-Supervised Annotations
                                      Gabriele Sarti
             Department of Mathematics and Geoscience, University of Trieste
             International School for Advanced Studies (SISSA), Trieste, Italy
                                   gsarti@sissa.it

                 Abstract

English. This work describes a self-
                                                  1   Introduction
supervised data augmentation approach             In recent times, pre-trained neural language mod-
used to improve learning models’ perfor-          els (NLMs) have become the preferred approach
mances when only a moderate amount of             for language representation learning, pushing the
labeled data is available. Multiple copies        state-of-the-art in multiple NLP tasks (Devlin et al.
of the original model are initially trained       (2019); Radford et al. (2019); Yang et al. (2019);
on the downstream task. Their predic-             Raffel et al. (2019) inter alia). These approaches
tions are then used to annotate a large           rely on a two-step training process: first, a self-
set of unlabeled examples. Finally, multi-        supervised pre-training is performed on large-
task training is performed on the par-            scale corpora; then, the model undergoes a super-
allel annotations of the resulting train-         vised fine-tuning on downstream task labels using
ing set, and final scores are obtained by         task-specific prediction heads. While this method
averaging annotator-specific head predic-         was found to be effective in scenarios where a rel-
tions. Neural language models are fine-           atively large amount of labeled data are present,
tuned using this procedure in the con-            researchers highlighted that this is not the case in
text of the AcCompl-it shared task at             low-resource settings (Yogatama et al., 2019).
EVALITA 2020, obtaining considerable                  Recently, pattern-exploiting training (PET,
improvements in prediction quality.               Schick and Schutze (2020a,b) tackles the depen-
                                                  dence of NLMs on labeled data by first reformu-
Italiano. Questo articolo descrive un ap-         lating tasks as cloze questions using task-related
proccio di self-supervised data augmenta-         patterns and keywords, and then using language
tion utilizzabile al fine di migliorare le per-   models trained on those to annotate large sets of
formance di algoritmi di apprendimento            unlabeled examples with soft labels. PET can be
su task aventi solo una modesta quantità         thought of as an offline version of knowledge dis-
di dati annotati. Inizialmente, molteplici        tillation (Hinton et al., 2015), which is a well-
copie del modello originale vengono al-           established approach to transfer the knowledge
lenate sul task prescelto. Le loro pre-           across models of different size, or even between
visioni vengono poi utilizzate per anno-          different versions of the same model as in self-
tare grandi quantità di esempi non etichet-      training (Scudder, 1965; Yarowsky, 1995). While
tati. In conclusione, un approccio di multi-      effective on classification tasks that can be easily
task training viene utilizzato, con le an-        reformulated as cloze questions, PET cannot be
notazioni del dataset risultante in veste di      easily extended to regression settings since they
task indipendenti, per ottenere previsioni        cannot be adequately verbalized. Contemporary
finali come medie dei i punteggi dei sin-         work by Du et al. (2020) showed how self-training
goli annotatori. Questa procedura è stata        and pre-training provide complementary informa-
utilizzata per allenare modelli del linguag-      tion for natural language understanding tasks.
gio neurali per lo shared task AcCompl-it
                                                       Copyright c 2020 for this paper by its authors. Use
a EVALITA 2020, ottenendo ampi miglio-            permitted under Creative Commons License Attribution 4.0
ramenti nella qualità predittiva.                International (CC BY 4.0).
   In this paper, I propose a simple self-supervised                           probability distributions after the softmax, which
data augmentation approach that can be used to                                 are typically used in the knowledge distillation lit-
improve the generalization capabilities of NLMs                                erature, to keep the approach simple while making
on regression and classification tasks for modest-                             it viable in the context of regression tasks.
sized labeled corpora. In short, an ensemble of                                   Now that the large corpus is annotated, a multi-
fine-tuned models is used to annotate a large cor-                             task NLM M T M : xi → ẏi1 . . . ẏik is fine-tuned on
pus of unlabeled text, and new annotations are                                 U 0 by treating each annotation in the set ŷ 01 . . . ŷ 0k
leveraged in a multi-task setting to obtain final                              as a separate task, using 1-layer feed-forward neu-
predictions over the original test set. The method                             ral networks as task-specific heads while perform-
was tested on the AcCompl-it shared tasks of the                               ing hard parameter sharing (Caruana, 1997) on
EVALITA 2020 campaign (Brunato et al., 2020b;                                  underlying model parameters. Intuitively, the k
Basile et al., 2020), where the objective was to                               models used to produce annotations were trained
predict respectively complexity and acceptability                              on different folds of the original corpus, and as
scores on a 1-7 Likert scale for each test sen-                                such, they provide complementary viewpoints on
tence, alongside an estimation of its standard er-                             the modeled phenomenon when k is small.
ror. Results show considerable improvements                                       As a final step, M T M is fine-tuned on a
over regular fine-tuning performances on COMPL                                 training portion of L, using as prediction scores
and ACCEPT using the UmBERTo pre-trained                                       f (ẏi1 . . . ẏik ), where f is a task and context-
model (Francia et al., 2020), suggesting the valid-                            dependent aggregation function. For example, in
ity of this approach for complexity/acceptability                              the case of a classification task, one can select the
prediction and possibly other language processing                              majority vote from the ensemble of model heads
tasks.                                                                         as the final prediction, while in a regression set-
                                                                               ting this can be done by averaging scores across
2         Description of the Approach                                          heads. Once fine-tuned, the model can be tested
Let:                                                                           on the test portion of L using the same f as the
                                                                               aggregator. I refer to this approach as Multi-Task
     • L = [(x1 , y1 ), . . . (xn , yn )] be the initial la-                   Self-Annotation (MTSA) in the following sections.
       beled corpus containing sentence-annotation
       pairs xi ∈ X, yi ∈ Yx . 1                                               3    Experimental Evaluation
                                                                               For the experimental evaluation part:
     • U = [x01 , . . . x0m ] be a large unlabeled corpus
       such that m  n                                                             • The ACCEPT and COMPL training corpora,
                                                                                     containing respectively 1339 and 2012 sen-
     • M : xi → ŷi be a pre-trained neural language
                                                                                     tences labeled with average scores and stan-
       model with a single task-specific heads, tak-
                                                                                     dard error across annotators, were used as la-
       ing sentence xi as input and predicting label
                                                                                     beled datasets LA , LC . The two tasks were
       yi at inference time.
                                                                                     learned separately, following the same ap-
For some k ∈ N1 , we begin by splitting L                                            proach described in the previous section.
in k equal-sized segments L1 , . . . , Lk and fine-                                • A set of multiple Italian treebanks includ-
tune k identical versions of M using k-fold                                          ing train, dev, and test sets of the Ital-
cross-validation. We call the resulting models                                       ian Stanford Dependency Treebank (Bosco
M 1 , . . . , M k “NLMs with standard fine-tuning on                                 et al., 2013), the Turin University Paral-
the y target task”, with M i being trained on the                                    lel Treebank (Sanguinetti and Bosco, 2015),
subset L − Li and evaluated on Li . Then, each                                       PoSTWITA-UD (Sanguinetti et al., 2018)
sentence of U is passed to each model, obtaining                                     and the Venice Italian Treebank (Delmonte
the corpus                                                                           et al., 2007) was used as unlabeled corpus U.
                                                                                     The final corpus contains 37,344 unlabeled
    U 0 = [(x01 , ŷ101 . . . ŷ10k ), . . . , (x0m , ŷm
                                                        01         0k
                                                           . . . ŷm  )] (1)
                                                                                     sentences and spans multiple textual genres.
labeled with expert annotations from fine-tuned                                    • The UmBERTo model (Francia et al., 2020)
models. Predicted values are taken instead of                                        available through the HuggingFace’s Trans-
      1
          yi can be either discrete or continuous in this context.                   formers framework (Wolf et al., 2019) was
  Model                         Score (ρ)      Error (ρ)         to obtain a sentence-level representation instead of
                                                                 using the [CLS] token. During the training on the
  UmBERTo surprisal                -0.36           0.17
                                                                 whole unlabeled corpus, the evaluation steps were
  Length (# of tokens)             -0.39           0.17
                                                                 increased to 100 to balance evaluation time with
  Length (characters)              -0.39           0.21
                                                                 the corpus’s increased size.
  UmBERTo fine-tuned               0.90            0.50
  UmBERTo-STSA                     0.91            0.53          4    Results
  UmBERTo-MTSA                     0.91            0.54
                                                                 Table 1 presents methods for which the correla-
  UmBERTo surprisal                0.49            0.28          tion between values and complexity scores was
  Length (# of tokens)             0.55            0.36          tested on the training portion of the ACCEPT
  Length (characters)              0.60            0.39          and COMPL tasks with 5-fold cross validation,
  UmBERTo fine-tuned               0.84            0.54          leading to the selection of MTSA as the top-
  UmBERTo-STSA                     0.87            0.62          performing approach:
  UmBERTo-MTSA                     0.88            0.63
                                                                     • UmBERTo surprisal: Sentence-level sur-
Table 1: Spearman’s correlation scores on the AC-                      prisal estimates are produced using the pre-
CEPT (top) and COMPL (bottom) subtasks’ train-                         trained model without fine-tuning as:
ing portions. Models are evaluated using 5-fold
                                                                                    m
cross-validation. All scores have p < 0.001                                         Y
                                                                          P (x) =         P (wi |w1:i−1 , wi+1:m )   (2)
                                                                                    i=1

      used both for fine-tuning M 1...k during the
                                                                     • Length (# of tokens): Length of the sentence
      annotation part and for fine-tuning M T M .
                                                                       in number of tokens
      The model is based on the RoBERTa archi-
      tecture (Liu et al., 2019) and was pre-trained                 • Length (characters): Length of the sentence
      on the Italian portion of the OSCAR Com-                         in number of characters (including whites-
      monCrawl corpus (Ortiz Suárez et al., 2020),                    paces)
      containing roughly 210M sentences and over
      11B tokens.                                                    • UmBERTo fine-tuned: Predictions pro-
                                                                       duced by Umberto with standard fine-tuning
    Since both tasks involve predicting both aver-                     on complexity corpus annotations.
aged scores and the original standard error across
participants, the approach presented in the previ-                   • UmBERTo-STSA: A variant of the MTSA
ous section was adapted to account for multi-task                      approach where instead of performing multi-
learning of scores and errors from the beginning,                      task learning over model annotations on U,
with each model M i producing both a predicted                         we average them in a single score, and the
score ŷ 0i and a predicted error ˆ0i for the annota-                 model is trained on it with single-task fine-
tion step. The k parameter was set to 5 to prevent                     tuning.
excessive overlapping of training data across mod-                   • UmBERTo-MTSA: The approach presented
els, with the final multi-task model M T M : xi →                      in this work.
ẏi1 . . . yi5 , 1i . . . 5i returning prediction for scores
and errors for all the five sets of fine-tuned model                From Table 1, it can be observed that, although
annotations.                                                     length alone is already correlated with accept-
    Models M 1...k were trained for a maximum of                 ability complexity scores, UmBERTo can lever-
15 epochs on the labeled training sets using early               age additional information from its representation
stopping (5 patience steps, 20 evaluation steps us-              to produce much stronger predictions. Interest-
ing a 10% slice as dev set), learning rate λ = 1e−5 ,            ingly, both the STSA and MTSA self-annotation
batch size b = 32 and embedding dropout δ = 0.1.                 approaches consistently outperform regular fine-
The model’s base variant was used, having a hid-                 tuning, especially for what concerns standard er-
den size |h| = 768, and a maximum sequence                       ror scores. This fact suggests that self-annotation
length of 128. Notably, the representations at the               leads to better generalization capabilities in the
last layer of the UmBERTo model were averaged                    model over downstream tasks when relatively few
    Model                        Score (ρ)       Error (ρ)                                Acceptability     Complexity
                                                                                         ρ(y )   ρ( )   ρ(y )   ρ( )
    SVM 2-gram baseline              0.30           0.35
    UmBERTo-MTSA                     0.88           0.52            avg. score (y)       -25%     10%      41%      -2%
                                                                    std. error ()        12%      2%      23%      27%
    SVM length baseline              0.50           0.33            upos dist PROPN      19%       -3%      4%       6%
    UmBERTo-MTSA                     0.83           0.51            dep dist nmod        19%       -8%      4%       1%
                                                                    avg max depth        16%       -3%      7%       -7%
                                                                    n prep chains        16%       -8%      4%       -2%
Table 2: Correlation scores with gold labels on the                 prep chain len       16%       -6%      9%       -4%
ACCEPT (top) and COMPL (bottom) subtasks’                           upos dist PRON       1%        20%      8%       9%
test portions. All scores have p < 0.001.                           dep dist root        -9%       18%     -4%       23%
                                                                    dep dist punct       -9%       17%      1%       -3%
                                                                    aux mood dist Imp    7%         6%     17%        7%
                                                                    n tokens             9%       -13%      5%      -18%
annotations are available. While the contribu-                      avg links len        -3%        1%     -6%      -17%
                                                                    max links len        -1%       -9%     -1%      -16%
tion of multi-task learning is modest, the MTSA
approach may prove especially beneficial when
training models M 1...k on scores produced by dif-                 Table 3: Pearson’s correlation scores between pre-
ferent annotators instead of using different folds of              diction errors and various linguistic features. Or-
the same corpus, as in this case. In both cases, pre-              ange and cyan cells contain respectively positive
dicted surprisal scores act as poor predictors for                 and negative scores for which p < 0.001.
downstream tasks. It should also be noted that
length appears to be negatively correlated to ac-
                                                                   standard errors, respectively. Table 3 presents the
ceptability scores (i.e. longer sentences are gener-
                                                                   results of the error analysis.
ally less acceptable), while the relation is positive
in the case of complexity (i.e. longer sentences are                  Strongly correlated values in Table 3 corre-
generally more complex).                                           spond to features that highly influence, either
   Table 2 reports the scores obtained by MTSA                     positively or negatively, the prediction capabili-
over the test sets for the ACCEPT and the COMPL                    ties of the MTSA model. Extreme task scores
shared tasks. The organizers’ baseline scores cor-                 (avg. score), denoting either not very acceptable
respond to the correlation among gold labels and                   or highly complex sentences, are less predictable
acceptability and complexity predictions produced                  than their average counterparts by MTSA. Sen-
by an SVM model trained on 1-grams and bi-                         tences for whose the standard deviation of scores
grams of sentences and an SVM trained on sen-                      is high across participants appear to be less pre-
tence length, respectively. The MTSA approach                      dictable in the context of complexity scores, while
achieved the first rank in both tasks, with consid-                this does not affect acceptability predictions.
erable improvements over baseline scores.                             Concerning acceptability, I found a significant
                                                                   correlation between acceptability prediction er-
5     Error Analysis                                               rors and the presence of multilevel syntactic struc-
                                                                   tures, (avg max depth) multiple long preposi-
Finally, some error analysis is performed to gain                  tional chains (n prep chains, prep chain len) and
additional insights on which factors influence                     nominal modifiers (dep dist nmod). From the
the predictability of complexity and acceptabil-                   complexity viewpoint, instead, the presence of
ity judgments. The Profiling-UD tool by Brunato                    inflectional morphology related to the imperfect
et al. (2020a) is used to produce linguistic anno-                 tense in auxiliaries (aux mood dist Imp) was the
tations on test sentences for both tasks. Given                    only property related to higher prediction errors.
an input sentence, Profiling-UD produces roughly                   However, high token counts (n tokens) and long
∼ 100 numeric scores representing different phe-                   dependency links (avg links len, max links len)
nomena and properties at different language lev-                   were shown to make the variability in complexity
els.2 I then correlate the value of all features with              scores more predictable.
y and  , representing the mean absolute error                      Overall, results suggest that incorporating syn-
between true and predicted values for scores and                   tactic information during the model’s training pro-
   2
     A description of produced annotations is omitted for          cess may further improve complexity and accept-
brevity. Refer to Brunato et al. (2020a) for additional details.   ability models.
6   Discussion and Conclusion                            and Speech Tools for Italian. Final Workshop
                                                         (EVALITA 2020), Online. CEUR.org.
This work introduced a simple and effective data
augmentation approach improving the fine-tuning        Cristina Bosco, Simonetta Montemagni, and
performances of NLMs when only a modest                  Maria Simi. 2013. Converting Italian treebanks:
amount of labeled data is available. The approach        Towards an Italian Stanford dependency tree-
was first formalized and then empirically tested         bank. In Proceedings of the 7th Linguistic
on the ACCEPT and COMPL shared tasks of the              Annotation Workshop and Interoperability with
EVALITA 2020 campaign. Strong performances               Discourse, pages 61–69, Sofia, Bulgaria. Asso-
were reported for both acceptability and complex-        ciation for Computational Linguistics.
ity prediction using a multi-task self-training ap-    Dominique Brunato, Andrea Cimino, Felice
proach, obtaining the top position in both sub-          Dell’Orletta, Giulia Venturi, and Simonetta
tasks. Finally, an error analysis highlighted the        Montemagni. 2020a. Profiling-UD: a tool for
unpredictability of extreme scores and sentences         linguistic profiling of texts. In Proceedings
having complex syntactic structures.                     of The 12th Language Resources and Evalua-
   The suggested approach, although computa-             tion Conference, pages 7147–7153, Marseille,
tionally refined and well-performing, is lacking         France. European Language Resources Associ-
in terms of complexity-driven biases that may            ation.
prove useful in the context of complexity and ac-      Dominique Brunato, Chesi Cristiano, Felice
ceptability prediction. A possible extension of          Dell’Orletta, Simonetta Montemagni, Giu-
this work may include a complementary syntac-            lia Venturi, and Roberto Zamparelli. 2020b.
tic task (e.g., biaffine parsing, as in Glavas and       AcCompl-it @ EVALITA2020: Overview of
Vulic (2020)) during multi-task learning to see if       the acceptability complexity evaluation task
forcing syntactically-competent representations in       for italian. In Proceedings of Seventh Evalua-
the top layers may prove beneficial in the context       tion Campaign of Natural Language Processing
of syntax-heavy tasks like complexity and accept-        and Speech Tools for Italian. Final Workshop
ability prediction. Moreover, it would be interest-      (EVALITA 2020), Online. CEUR.org.
ing to evaluate multi-task learning performances
with complexity and acceptability parallel annota-     Rich Caruana. 1997. Multitask learning. Machine
tions given the conceptual similarity between the        Learning, 28:41–75.
two tasks and estimate the effectiveness of a feed-    Rodolfo Delmonte, Antonella Bristot, and Sara
forward network as the final aggregator f in the         Tonelli. 2007. VIT–venice italian treebank:
MTSA paradigm instead of merely averaging pre-           syntactic and quantitative features.
dictions. Finally, Du et al. (2020) findings suggest
                                                       Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
that using an unsupervised in-domain filtering ap-
                                                         Kristina Toutanova. 2019. BERT: Pre-training
proach may further improve the self-training pro-
                                                         of deep bidirectional transformers for language
cedure when large unlabeled corpora are available.
                                                         understanding. In Proceedings of the 2019 Con-
                                                         ference of the North American Chapter of the
Acknowledgments
                                                         Association for Computational Linguistics: Hu-
The author was supported by a scholarship for            man Language Technologies, Volume 1 (Long
Data Science and Scientific Computing students           and Short Papers), pages 4171–4186, Min-
from the International School of Advanced Stud-          neapolis, Minnesota. Association for Computa-
ies (SISSA).                                             tional Linguistics.
                                                       Jingfei Du, E. Grave, Beliz Gunel, Vishrav Chaud-
References                                                hary, Onur Çelebi, M. Auli, Ves Stoyanov, and
Valerio Basile, Danilo Croce, Maria Di Maro,              Alexis Conneau. 2020. Self-training improves
  and Lucia C. Passaro. 2020. EVALITA 2020:               pre-training for natural language understanding.
  Overview of the 7th evaluation campaign of              ArXiv, abs/2010.02194.
  natural language processing and speech tools         Simone Francia, Loreto Parisi, and Magnani
  for italian. In Proceedings of Seventh Evalua-         Paolo. 2020. UmBERTo: an italian language
  tion Campaign of Natural Language Processing           model trained with whole word maskings.
Goran Glavas and Ivan Vulic. 2020. Is supervised       models are also few-shot learners.      ArXiv,
  syntactic parsing beneficial for language under-     abs/2009.07118.
  standing? an empirical investigation. ArXiv,       H Scudder. 1965.    Probability of error of
  abs/2008.06788.                                     some adaptive pattern-recognition machines.
Geoffrey E. Hinton, Oriol Vinyals, and J. Dean.       IEEE Transactions on Information Theory,
  2015. Distilling the knowledge in a neural net-     11(3):363–371.
  work. ArXiv, abs/1503.02531.                       Thomas Wolf, Lysandre Debut, Victor Sanh,
Y. Liu, Myle Ott, Naman Goyal, Jingfei Du,             Julien Chaumond, Clement Delangue, An-
  Mandar Joshi, Danqi Chen, Omer Levy,                 thony Moi, Pierric Cistac, Tim Rault, R’emi
  M. Lewis, L. Zettlemoyer, and V. Stoyanov.           Louf, Morgan Funtowicz, and Jamie Brew.
  2019. RoBERTa: A robustly optimized bert pre-        2019. Huggingface’s transformers: State-of-
  training approach. ArXiv, abs/1907.11692.            the-art natural language processing. ArXiv,
                                                       abs/1910.03771.
Pedro Javier Ortiz Suárez, Laurent Romary, and
                                                     Z. Yang, Zihang Dai, Y. Yang, J. Carbonell,
  Benoı̂t Sagot. 2020. A monolingual approach
                                                       R. Salakhutdinov, and Quoc V. Le. 2019. XL-
  to contextualized word embeddings for mid-
                                                       Net: Generalized autoregressive pretraining for
  resource languages. In Proceedings of the 58th
                                                       language understanding. In NeurIPS.
  Annual Meeting of the Association for Compu-
  tational Linguistics, pages 1703–1714, Online.     David Yarowsky. 1995. Unsupervised word sense
  Association for Computational Linguistics.           disambiguation rivaling supervised methods.
                                                       In 33rd Annual Meeting of the Association
A. Radford, Jeffrey Wu, R. Child, David Luan,
                                                       for Computational Linguistics, pages 189–196,
  Dario Amodei, and Ilya Sutskever. 2019. Lan-
                                                       Cambridge, Massachusetts, USA. Association
  guage models are unsupervised multitask learn-
                                                       for Computational Linguistics.
  ers. OpenAI.
                                                     Dani Yogatama, Cyprien de Masson d’Autume,
Colin Raffel, Noam Shazeer, Adam Roberts,              J. Connor, Tomás Kociský, M. Chrzanowski,
  Katherine Lee, Sharan Narang, Michael                Lingpeng Kong, A. Lazaridou, W. Ling, L. Yu,
  Matena, Yanqi Zhou, W. Li, and P. Liu. 2019.         Chris Dyer, and P. Blunsom. 2019. Learning
  Exploring the limits of transfer learning with       and evaluating general linguistic intelligence.
  a unified text-to-text transformer.     ArXiv,       ArXiv, abs/1901.11373.
  abs/1910.10683.
Manuela Sanguinetti and Cristina Bosco. 2015.
 PartTUT: The Turin University Parallel Tree-
 bank, pages 51–69. Springer International Pub-
 lishing, Cham.
Manuela Sanguinetti, Cristina Bosco, Alberto
 Lavelli, Alessandro Mazzei, Oronzo Antonelli,
 and Fabio Tamburini. 2018. PoSTWITA-UD:
 an Italian Twitter treebank in Universal Depen-
 dencies. In Proceedings of the Eleventh In-
 ternational Conference on Language Resources
 and Evaluation (LREC 2018), Miyazaki, Japan.
 European Language Resources Association
 (ELRA).
Timo Schick and Hinrich Schutze. 2020a. Exploit-
  ing cloze questions for few-shot text classifi-
  cation and natural language inference. ArXiv,
  abs/2001.07676.
Timo Schick and Hinrich Schutze. 2020b. It’s
  not just size that matters: Small language