=Paper= {{Paper |id=Vol-2765/92 |storemode=property |title=UmBERTo-MTSA @ AcCompl-It: Improving Complexity and Acceptability Prediction with Multi-task Learning on Self-Supervised Annotations (short paper) |pdfUrl=https://ceur-ws.org/Vol-2765/paper92.pdf |volume=Vol-2765 |authors=Gabriele Sarti |dblpUrl=https://dblp.org/rec/conf/evalita/Sarti20 }} ==UmBERTo-MTSA @ AcCompl-It: Improving Complexity and Acceptability Prediction with Multi-task Learning on Self-Supervised Annotations (short paper)== https://ceur-ws.org/Vol-2765/paper92.pdf

UmBERTo-MTSA @ AcCompl-It:
Improving Complexity and Acceptability Prediction
with Multi-task Learning on Self-Supervised Annotations
Gabriele Sarti
Department of Mathematics and Geoscience, University of Trieste
International School for Advanced Studies (SISSA), Trieste, Italy
gsarti@sissa.it

Abstract

English. This work describes a self-
1 Introduction
supervised data augmentation approach In recent times, pre-trained neural language mod-
used to improve learning models’ perfor- els (NLMs) have become the preferred approach
mances when only a moderate amount of for language representation learning, pushing the
labeled data is available. Multiple copies state-of-the-art in multiple NLP tasks (Devlin et al.
of the original model are initially trained (2019); Radford et al. (2019); Yang et al. (2019);
on the downstream task. Their predic- Raffel et al. (2019) inter alia). These approaches
tions are then used to annotate a large rely on a two-step training process: first, a self-
set of unlabeled examples. Finally, multi- supervised pre-training is performed on large-
task training is performed on the par- scale corpora; then, the model undergoes a super-
allel annotations of the resulting train- vised fine-tuning on downstream task labels using
ing set, and final scores are obtained by task-specific prediction heads. While this method
averaging annotator-specific head predic- was found to be effective in scenarios where a rel-
tions. Neural language models are fine- atively large amount of labeled data are present,
tuned using this procedure in the con- researchers highlighted that this is not the case in
text of the AcCompl-it shared task at low-resource settings (Yogatama et al., 2019).
EVALITA 2020, obtaining considerable Recently, pattern-exploiting training (PET,
improvements in prediction quality. Schick and Schutze (2020a,b) tackles the depen-
dence of NLMs on labeled data by first reformu-
Italiano. Questo articolo descrive un ap- lating tasks as cloze questions using task-related
proccio di self-supervised data augmenta- patterns and keywords, and then using language
tion utilizzabile al fine di migliorare le per- models trained on those to annotate large sets of
formance di algoritmi di apprendimento unlabeled examples with soft labels. PET can be
su task aventi solo una modesta quantità thought of as an offline version of knowledge dis-
di dati annotati. Inizialmente, molteplici tillation (Hinton et al., 2015), which is a well-
copie del modello originale vengono al- established approach to transfer the knowledge
lenate sul task prescelto. Le loro pre- across models of different size, or even between
visioni vengono poi utilizzate per anno- different versions of the same model as in self-
tare grandi quantità di esempi non etichet- training (Scudder, 1965; Yarowsky, 1995). While
tati. In conclusione, un approccio di multi- effective on classification tasks that can be easily
task training viene utilizzato, con le an- reformulated as cloze questions, PET cannot be
notazioni del dataset risultante in veste di easily extended to regression settings since they
task indipendenti, per ottenere previsioni cannot be adequately verbalized. Contemporary
finali come medie dei i punteggi dei sin- work by Du et al. (2020) showed how self-training
goli annotatori. Questa procedura è stata and pre-training provide complementary informa-
utilizzata per allenare modelli del linguag- tion for natural language understanding tasks.
gio neurali per lo shared task AcCompl-it
Copyright c 2020 for this paper by its authors. Use
a EVALITA 2020, ottenendo ampi miglio- permitted under Creative Commons License Attribution 4.0
ramenti nella qualità predittiva. International (CC BY 4.0).
In this paper, I propose a simple self-supervised probability distributions after the softmax, which
data augmentation approach that can be used to are typically used in the knowledge distillation lit-
improve the generalization capabilities of NLMs erature, to keep the approach simple while making
on regression and classification tasks for modest- it viable in the context of regression tasks.
sized labeled corpora. In short, an ensemble of Now that the large corpus is annotated, a multi-
fine-tuned models is used to annotate a large cor- task NLM M T M : xi → ẏi1 . . . ẏik is fine-tuned on
pus of unlabeled text, and new annotations are U 0 by treating each annotation in the set ŷ 01 . . . ŷ 0k
leveraged in a multi-task setting to obtain final as a separate task, using 1-layer feed-forward neu-
predictions over the original test set. The method ral networks as task-specific heads while perform-
was tested on the AcCompl-it shared tasks of the ing hard parameter sharing (Caruana, 1997) on
EVALITA 2020 campaign (Brunato et al., 2020b; underlying model parameters. Intuitively, the k
Basile et al., 2020), where the objective was to models used to produce annotations were trained
predict respectively complexity and acceptability on different folds of the original corpus, and as
scores on a 1-7 Likert scale for each test sen- such, they provide complementary viewpoints on
tence, alongside an estimation of its standard er- the modeled phenomenon when k is small.
ror. Results show considerable improvements As a final step, M T M is fine-tuned on a
over regular fine-tuning performances on COMPL training portion of L, using as prediction scores
and ACCEPT using the UmBERTo pre-trained f (ẏi1 . . . ẏik ), where f is a task and context-
model (Francia et al., 2020), suggesting the valid- dependent aggregation function. For example, in
ity of this approach for complexity/acceptability the case of a classification task, one can select the
prediction and possibly other language processing majority vote from the ensemble of model heads
tasks. as the final prediction, while in a regression set-
ting this can be done by averaging scores across
2 Description of the Approach heads. Once fine-tuned, the model can be tested
Let: on the test portion of L using the same f as the
aggregator. I refer to this approach as Multi-Task
• L = [(x1 , y1 ), . . . (xn , yn )] be the initial la- Self-Annotation (MTSA) in the following sections.
beled corpus containing sentence-annotation
pairs xi ∈ X, yi ∈ Yx . 1 3 Experimental Evaluation
For the experimental evaluation part:
• U = [x01 , . . . x0m ] be a large unlabeled corpus
such that m n • The ACCEPT and COMPL training corpora,
containing respectively 1339 and 2012 sen-
• M : xi → ŷi be a pre-trained neural language
tences labeled with average scores and stan-
model with a single task-specific heads, tak-
dard error across annotators, were used as la-
ing sentence xi as input and predicting label
beled datasets LA , LC . The two tasks were
yi at inference time.
learned separately, following the same ap-
For some k ∈ N1 , we begin by splitting L proach described in the previous section.
in k equal-sized segments L1 , . . . , Lk and fine- • A set of multiple Italian treebanks includ-
tune k identical versions of M using k-fold ing train, dev, and test sets of the Ital-
cross-validation. We call the resulting models ian Stanford Dependency Treebank (Bosco
M 1 , . . . , M k “NLMs with standard fine-tuning on et al., 2013), the Turin University Paral-
the y target task”, with M i being trained on the lel Treebank (Sanguinetti and Bosco, 2015),
subset L − Li and evaluated on Li . Then, each PoSTWITA-UD (Sanguinetti et al., 2018)
sentence of U is passed to each model, obtaining and the Venice Italian Treebank (Delmonte
the corpus et al., 2007) was used as unlabeled corpus U.
The final corpus contains 37,344 unlabeled
U 0 = [(x01 , ŷ101 . . . ŷ10k ), . . . , (x0m , ŷm
01 0k
. . . ŷm )] (1)
sentences and spans multiple textual genres.
labeled with expert annotations from fine-tuned • The UmBERTo model (Francia et al., 2020)
models. Predicted values are taken instead of available through the HuggingFace’s Trans-
1
yi can be either discrete or continuous in this context. formers framework (Wolf et al., 2019) was
Model Score (ρ) Error (ρ) to obtain a sentence-level representation instead of
using the [CLS] token. During the training on the
UmBERTo surprisal -0.36 0.17
whole unlabeled corpus, the evaluation steps were
Length (# of tokens) -0.39 0.17
increased to 100 to balance evaluation time with
Length (characters) -0.39 0.21
the corpus’s increased size.
UmBERTo fine-tuned 0.90 0.50
UmBERTo-STSA 0.91 0.53 4 Results
UmBERTo-MTSA 0.91 0.54
Table 1 presents methods for which the correla-
UmBERTo surprisal 0.49 0.28 tion between values and complexity scores was
Length (# of tokens) 0.55 0.36 tested on the training portion of the ACCEPT
Length (characters) 0.60 0.39 and COMPL tasks with 5-fold cross validation,
UmBERTo fine-tuned 0.84 0.54 leading to the selection of MTSA as the top-
UmBERTo-STSA 0.87 0.62 performing approach:
UmBERTo-MTSA 0.88 0.63
• UmBERTo surprisal: Sentence-level sur-
Table 1: Spearman’s correlation scores on the AC- prisal estimates are produced using the pre-
CEPT (top) and COMPL (bottom) subtasks’ train- trained model without fine-tuning as:
ing portions. Models are evaluated using 5-fold
m
cross-validation. All scores have p < 0.001 Y
P (x) = P (wi |w1:i−1 , wi+1:m ) (2)
i=1

used both for fine-tuning M 1...k during the
• Length (# of tokens): Length of the sentence
annotation part and for fine-tuning M T M .
in number of tokens
The model is based on the RoBERTa archi-
tecture (Liu et al., 2019) and was pre-trained • Length (characters): Length of the sentence
on the Italian portion of the OSCAR Com- in number of characters (including whites-
monCrawl corpus (Ortiz Suárez et al., 2020), paces)
containing roughly 210M sentences and over
11B tokens. • UmBERTo fine-tuned: Predictions pro-
duced by Umberto with standard fine-tuning
Since both tasks involve predicting both aver- on complexity corpus annotations.
aged scores and the original standard error across
participants, the approach presented in the previ- • UmBERTo-STSA: A variant of the MTSA
ous section was adapted to account for multi-task approach where instead of performing multi-
learning of scores and errors from the beginning, task learning over model annotations on U,
with each model M i producing both a predicted we average them in a single score, and the
score ŷ 0i and a predicted error ˆ0i for the annota- model is trained on it with single-task fine-
tion step. The k parameter was set to 5 to prevent tuning.
excessive overlapping of training data across mod- • UmBERTo-MTSA: The approach presented
els, with the final multi-task model M T M : xi → in this work.
ẏi1 . . . yi5 , 1i . . . 5i returning prediction for scores
and errors for all the five sets of fine-tuned model From Table 1, it can be observed that, although
annotations. length alone is already correlated with accept-
Models M 1...k were trained for a maximum of ability complexity scores, UmBERTo can lever-
15 epochs on the labeled training sets using early age additional information from its representation
stopping (5 patience steps, 20 evaluation steps us- to produce much stronger predictions. Interest-
ing a 10% slice as dev set), learning rate λ = 1e−5 , ingly, both the STSA and MTSA self-annotation
batch size b = 32 and embedding dropout δ = 0.1. approaches consistently outperform regular fine-
The model’s base variant was used, having a hid- tuning, especially for what concerns standard er-
den size |h| = 768, and a maximum sequence ror scores. This fact suggests that self-annotation
length of 128. Notably, the representations at the leads to better generalization capabilities in the
last layer of the UmBERTo model were averaged model over downstream tasks when relatively few
Model Score (ρ) Error (ρ) Acceptability Complexity
ρ(y ) ρ( ) ρ(y ) ρ( )
SVM 2-gram baseline 0.30 0.35
UmBERTo-MTSA 0.88 0.52 avg. score (y) -25% 10% 41% -2%
std. error () 12% 2% 23% 27%
SVM length baseline 0.50 0.33 upos dist PROPN 19% -3% 4% 6%
UmBERTo-MTSA 0.83 0.51 dep dist nmod 19% -8% 4% 1%
avg max depth 16% -3% 7% -7%
n prep chains 16% -8% 4% -2%
Table 2: Correlation scores with gold labels on the prep chain len 16% -6% 9% -4%
ACCEPT (top) and COMPL (bottom) subtasks’ upos dist PRON 1% 20% 8% 9%
test portions. All scores have p < 0.001. dep dist root -9% 18% -4% 23%
dep dist punct -9% 17% 1% -3%
aux mood dist Imp 7% 6% 17% 7%
n tokens 9% -13% 5% -18%
annotations are available. While the contribu- avg links len -3% 1% -6% -17%
max links len -1% -9% -1% -16%
tion of multi-task learning is modest, the MTSA
approach may prove especially beneficial when
training models M 1...k on scores produced by dif- Table 3: Pearson’s correlation scores between pre-
ferent annotators instead of using different folds of diction errors and various linguistic features. Or-
the same corpus, as in this case. In both cases, pre- ange and cyan cells contain respectively positive
dicted surprisal scores act as poor predictors for and negative scores for which p < 0.001.
downstream tasks. It should also be noted that
length appears to be negatively correlated to ac-
standard errors, respectively. Table 3 presents the
ceptability scores (i.e. longer sentences are gener-
results of the error analysis.
ally less acceptable), while the relation is positive
in the case of complexity (i.e. longer sentences are Strongly correlated values in Table 3 corre-
generally more complex). spond to features that highly influence, either
Table 2 reports the scores obtained by MTSA positively or negatively, the prediction capabili-
over the test sets for the ACCEPT and the COMPL ties of the MTSA model. Extreme task scores
shared tasks. The organizers’ baseline scores cor- (avg. score), denoting either not very acceptable
respond to the correlation among gold labels and or highly complex sentences, are less predictable
acceptability and complexity predictions produced than their average counterparts by MTSA. Sen-
by an SVM model trained on 1-grams and bi- tences for whose the standard deviation of scores
grams of sentences and an SVM trained on sen- is high across participants appear to be less pre-
tence length, respectively. The MTSA approach dictable in the context of complexity scores, while
achieved the first rank in both tasks, with consid- this does not affect acceptability predictions.
erable improvements over baseline scores. Concerning acceptability, I found a significant
correlation between acceptability prediction er-
5 Error Analysis rors and the presence of multilevel syntactic struc-
tures, (avg max depth) multiple long preposi-
Finally, some error analysis is performed to gain tional chains (n prep chains, prep chain len) and
additional insights on which factors influence nominal modifiers (dep dist nmod). From the
the predictability of complexity and acceptabil- complexity viewpoint, instead, the presence of
ity judgments. The Profiling-UD tool by Brunato inflectional morphology related to the imperfect
et al. (2020a) is used to produce linguistic anno- tense in auxiliaries (aux mood dist Imp) was the
tations on test sentences for both tasks. Given only property related to higher prediction errors.
an input sentence, Profiling-UD produces roughly However, high token counts (n tokens) and long
∼ 100 numeric scores representing different phe- dependency links (avg links len, max links len)
nomena and properties at different language lev- were shown to make the variability in complexity
els.2 I then correlate the value of all features with scores more predictable.
y and , representing the mean absolute error Overall, results suggest that incorporating syn-
between true and predicted values for scores and tactic information during the model’s training pro-
2
A description of produced annotations is omitted for cess may further improve complexity and accept-
brevity. Refer to Brunato et al. (2020a) for additional details. ability models.
6 Discussion and Conclusion and Speech Tools for Italian. Final Workshop
(EVALITA 2020), Online. CEUR.org.
This work introduced a simple and effective data
augmentation approach improving the fine-tuning Cristina Bosco, Simonetta Montemagni, and
performances of NLMs when only a modest Maria Simi. 2013. Converting Italian treebanks:
amount of labeled data is available. The approach Towards an Italian Stanford dependency tree-
was first formalized and then empirically tested bank. In Proceedings of the 7th Linguistic
on the ACCEPT and COMPL shared tasks of the Annotation Workshop and Interoperability with
EVALITA 2020 campaign. Strong performances Discourse, pages 61–69, Sofia, Bulgaria. Asso-
were reported for both acceptability and complex- ciation for Computational Linguistics.
ity prediction using a multi-task self-training ap- Dominique Brunato, Andrea Cimino, Felice
proach, obtaining the top position in both sub- Dell’Orletta, Giulia Venturi, and Simonetta
tasks. Finally, an error analysis highlighted the Montemagni. 2020a. Profiling-UD: a tool for
unpredictability of extreme scores and sentences linguistic profiling of texts. In Proceedings
having complex syntactic structures. of The 12th Language Resources and Evalua-
The suggested approach, although computa- tion Conference, pages 7147–7153, Marseille,
tionally refined and well-performing, is lacking France. European Language Resources Associ-
in terms of complexity-driven biases that may ation.
prove useful in the context of complexity and ac- Dominique Brunato, Chesi Cristiano, Felice
ceptability prediction. A possible extension of Dell’Orletta, Simonetta Montemagni, Giu-
this work may include a complementary syntac- lia Venturi, and Roberto Zamparelli. 2020b.
tic task (e.g., biaffine parsing, as in Glavas and AcCompl-it @ EVALITA2020: Overview of
Vulic (2020)) during multi-task learning to see if the acceptability complexity evaluation task
forcing syntactically-competent representations in for italian. In Proceedings of Seventh Evalua-
the top layers may prove beneficial in the context tion Campaign of Natural Language Processing
of syntax-heavy tasks like complexity and accept- and Speech Tools for Italian. Final Workshop
ability prediction. Moreover, it would be interest- (EVALITA 2020), Online. CEUR.org.
ing to evaluate multi-task learning performances
with complexity and acceptability parallel annota- Rich Caruana. 1997. Multitask learning. Machine
tions given the conceptual similarity between the Learning, 28:41–75.
two tasks and estimate the effectiveness of a feed- Rodolfo Delmonte, Antonella Bristot, and Sara
forward network as the final aggregator f in the Tonelli. 2007. VIT–venice italian treebank:
MTSA paradigm instead of merely averaging pre- syntactic and quantitative features.
dictions. Finally, Du et al. (2020) findings suggest
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
that using an unsupervised in-domain filtering ap-
Kristina Toutanova. 2019. BERT: Pre-training
proach may further improve the self-training pro-
of deep bidirectional transformers for language
cedure when large unlabeled corpora are available.
understanding. In Proceedings of the 2019 Con-
ference of the North American Chapter of the
Acknowledgments
Association for Computational Linguistics: Hu-
The author was supported by a scholarship for man Language Technologies, Volume 1 (Long
Data Science and Scientific Computing students and Short Papers), pages 4171–4186, Min-
from the International School of Advanced Stud- neapolis, Minnesota. Association for Computa-
ies (SISSA). tional Linguistics.
Jingfei Du, E. Grave, Beliz Gunel, Vishrav Chaud-
References hary, Onur Çelebi, M. Auli, Ves Stoyanov, and
Valerio Basile, Danilo Croce, Maria Di Maro, Alexis Conneau. 2020. Self-training improves
and Lucia C. Passaro. 2020. EVALITA 2020: pre-training for natural language understanding.
Overview of the 7th evaluation campaign of ArXiv, abs/2010.02194.
natural language processing and speech tools Simone Francia, Loreto Parisi, and Magnani
for italian. In Proceedings of Seventh Evalua- Paolo. 2020. UmBERTo: an italian language
tion Campaign of Natural Language Processing model trained with whole word maskings.
Goran Glavas and Ivan Vulic. 2020. Is supervised models are also few-shot learners. ArXiv,
syntactic parsing beneficial for language under- abs/2009.07118.
standing? an empirical investigation. ArXiv, H Scudder. 1965. Probability of error of
abs/2008.06788. some adaptive pattern-recognition machines.
Geoffrey E. Hinton, Oriol Vinyals, and J. Dean. IEEE Transactions on Information Theory,
2015. Distilling the knowledge in a neural net- 11(3):363–371.
work. ArXiv, abs/1503.02531. Thomas Wolf, Lysandre Debut, Victor Sanh,
Y. Liu, Myle Ott, Naman Goyal, Jingfei Du, Julien Chaumond, Clement Delangue, An-
Mandar Joshi, Danqi Chen, Omer Levy, thony Moi, Pierric Cistac, Tim Rault, R’emi
M. Lewis, L. Zettlemoyer, and V. Stoyanov. Louf, Morgan Funtowicz, and Jamie Brew.
2019. RoBERTa: A robustly optimized bert pre- 2019. Huggingface’s transformers: State-of-
training approach. ArXiv, abs/1907.11692. the-art natural language processing. ArXiv,
abs/1910.03771.
Pedro Javier Ortiz Suárez, Laurent Romary, and
Z. Yang, Zihang Dai, Y. Yang, J. Carbonell,
Benoı̂t Sagot. 2020. A monolingual approach
R. Salakhutdinov, and Quoc V. Le. 2019. XL-
to contextualized word embeddings for mid-
Net: Generalized autoregressive pretraining for
resource languages. In Proceedings of the 58th
language understanding. In NeurIPS.
Annual Meeting of the Association for Compu-
tational Linguistics, pages 1703–1714, Online. David Yarowsky. 1995. Unsupervised word sense
Association for Computational Linguistics. disambiguation rivaling supervised methods.
In 33rd Annual Meeting of the Association
A. Radford, Jeffrey Wu, R. Child, David Luan,
for Computational Linguistics, pages 189–196,
Dario Amodei, and Ilya Sutskever. 2019. Lan-
Cambridge, Massachusetts, USA. Association
guage models are unsupervised multitask learn-
for Computational Linguistics.
ers. OpenAI.
Dani Yogatama, Cyprien de Masson d’Autume,
Colin Raffel, Noam Shazeer, Adam Roberts, J. Connor, Tomás Kociský, M. Chrzanowski,
Katherine Lee, Sharan Narang, Michael Lingpeng Kong, A. Lazaridou, W. Ling, L. Yu,
Matena, Yanqi Zhou, W. Li, and P. Liu. 2019. Chris Dyer, and P. Blunsom. 2019. Learning
Exploring the limits of transfer learning with and evaluating general linguistic intelligence.
a unified text-to-text transformer. ArXiv, ArXiv, abs/1901.11373.
abs/1910.10683.
Manuela Sanguinetti and Cristina Bosco. 2015.
PartTUT: The Turin University Parallel Tree-
bank, pages 51–69. Springer International Pub-
lishing, Cham.
Manuela Sanguinetti, Cristina Bosco, Alberto
Lavelli, Alessandro Mazzei, Oronzo Antonelli,
and Fabio Tamburini. 2018. PoSTWITA-UD:
an Italian Twitter treebank in Universal Depen-
dencies. In Proceedings of the Eleventh In-
ternational Conference on Language Resources
and Evaluation (LREC 2018), Miyazaki, Japan.
European Language Resources Association
(ELRA).
Timo Schick and Hinrich Schutze. 2020a. Exploit-
ing cloze questions for few-shot text classifi-
cation and natural language inference. ArXiv,
abs/2001.07676.
Timo Schick and Hinrich Schutze. 2020b. It’s
not just size that matters: Small language