<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tackling Italian University Assessment Tests with Transformer-Based Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniele Puccinelli</string-name>
          <email>daniele.puccinelli@supsi.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Silvia Demartini</string-name>
          <email>silvia.demartini@supsi.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pier Luigi Ferrari</string-name>
          <email>pierluigi.ferrari@uniupo.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>. University of Applied Sciences and Arts of Southern Switzerland</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>. University of Eastern Piedmont</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Cloze tests are a great tool to asses reading proficiency as well as analytical thinking, and are therefore employed in admission and assessment tests at various levels of the education system in multiple countries. In Italy, cloze tests are administered to incoming university students to ascertain their starting level. The goal of a cloze test is to determine several tokens that have been pre-deleted from a text; this is largely equivalent to the well-known NLP task of missing token prediction. In this paper, we show that cloze tests can be solved reasonably well with various Transformerbased pre-trained language models, whose performance often compares favorably to the one of incoming Italian university students.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>A cloze test is a reading comprehension
assessment where participants are presented with a text
in which selected tokens have been replaced with
blanks. The goal is for the participant to choose
tokens (often from a list) and use them to replace
the blanks based on the overall context. Typically,
one every 5-10 tokens is replaced with a blank.</p>
      <p>
        Cloze tests are one of the most common
linguistic tests in use for formative and summative
purposes, along with written responses,
multiplechoice tests, matching tests, ordering tests,
summarizing tests etc.
        <xref ref-type="bibr" rid="ref8">(Lugarini, 2010)</xref>
        . Cloze tests
were originally introduced in the United States in
the 1950s to measure the readability of texts
        <xref ref-type="bibr" rid="ref15">(Taylor, 1953)</xref>
        and involved the random and not
predetermined deletion of words that appeared at
pre
      </p>
      <p>
        Copyright © 2021 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
defined intervals. This method was too general
for didactic and evaluation purposes, but it was
quickly adapted and became very widespread as a
teaching and testing technique
        <xref ref-type="bibr" rid="ref12">(Radice, 1978)</xref>
        . In
education, cloze tests have become more targeted:
words are deleted according to various criteria,
depending on the specific testing goals. In general,
cloze tests are designed to evaluate one of the
following:
• field-specific knowledge acquisition, by
asking to insert appropriate words about a topic
or a discipline;
• text comprehension, by asking for
information that can be inferred from the text (with
no prior domain knowledge);
• linguistic aspects, typically with respect to
L1, L2 and FL (foreign language) acquisition
at different levels (i. e. vocabulary, specific
parts of speech etc.).
      </p>
      <p>
        If carefully designed, cloze tests can be a very
effective tool at all educational levels; on the other
hand, cloze tests may also show some limits and
issues in assessing linguistic competence
        <xref ref-type="bibr" rid="ref2">(Chiari,
2002)</xref>
        , as they necessarily offer a partial and
contextual view. However, the long tradition of study
and use in the fields of educational linguistics and
linguistic makes it very interesting to compare
human and automatic performances in dealing with
cloze tests.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>
        We tackle the cloze tests in our dataset with
pretrained language models based on the Transformer
architecture
        <xref ref-type="bibr" rid="ref16">(Vaswani et al., 2017)</xref>
        . We
employ both autoencoding and autoregressive
models. Given the very small number of datapoints
at our disposal, model fine-tuning is not a viable
option; therefore, we use pre-trained versions of
such models, all of which are publicly available
through Huggingface at the time of writing
(summer 2021).
      </p>
      <p>Dataset. Our dataset contains eleven cloze tests
focusing on general linguistic competence that
were administered to incoming first year students
at the University of Eastern Piedmont in the cities
of Alessandria and Vercelli in northwestern Italy
between 2017 and 2019. Each cloze test was taken
by a number of students in the low three digits,
ranging from 130 to 390. As these are
universitylevel tests, all students had at least a high school
diploma. Most of the students were L1. The tests
were offered on-site (in information technology
classrooms) through the Moodle Learning
Platform.</p>
      <p>
        Our dataset contains two types of cloze tests:
nine restricted tests where a list of options is
provided for each blank to be filled, and two
unrestricted tests where a global list of options is
provided for all blanks with no token subgrouping
(i.e., with no information about which tokens are
supposed to go where). In the two unrestricted
tests and three of the nine restricted ones, the
list(s) contain single token options. In the other six
restricted tests, the lists contain at least one
multiple token option (e.g., il quale or con l’utilizzo).
These cloze tests involved both function words as
well as content words with both lexical and
grammatical meanings
Autoencoding models. Our choices for
autoencoding models are BERT
        <xref ref-type="bibr" rid="ref5">(Devlin et al., 2019)</xref>
        ,
RoBERTa
        <xref ref-type="bibr" rid="ref7">(Liu et al., 2019)</xref>
        , DistilBERT
        <xref ref-type="bibr" rid="ref13">(Sanh et
al., 2019)</xref>
        , and ELECTRA
        <xref ref-type="bibr" rid="ref3">(Clark et al., 2020)</xref>
        .
      </p>
      <p>BERT is a natural choice because one of its two
pre-training tasks is masked language modeling:
a fraction of tokens in the pre-training data are
masked so that BERT can be pre-trained to
reconstruct them. Viewed as an NLP task, a cloze test is
a special case of masked language modeling task
where tokens are masked in an adversarial fashion:
instead of choosing tokens to be masked uniformly
at random, tokens are masked to challenge the test
taker to reconstruct the meaning of the original
text. Because a cloze test is functionally
equivalent to a masked language modeling task, it is
reasonable to use pre-trained BERT with no further
task-specific fine-tuning.</p>
      <p>
        RoBERTa improves on the original BERT by
focusing on the aforementioned masked language
modeling task and removing the other pre-training
task (next sentence prediction). UmBERTo1 is
a RoBERTa-based model that contains some
interesting optimization such as SentencePiece and
Whole Word Masking. UmBERTo has been
shown to perform very well compared to other
BERT-based models
        <xref ref-type="bibr" rid="ref14">(Tamburini, 2020)</xref>
        .
      </p>
      <p>
        DistilBERT
        <xref ref-type="bibr" rid="ref13">(Sanh et al., 2019)</xref>
        is a more
compact language model pre-trained with knowledge
distillation
        <xref ref-type="bibr" rid="ref6">(Hinton et al., 2015)</xref>
        , a technique that
uses the output of a larger teacher network to train
a smaller student network. BERTino
        <xref ref-type="bibr" rid="ref1 ref10 ref4">(Muffo and
Bertino, 2020)</xref>
        is an Italian DistilBERT model that
was recently proposed as a lightweight alternative
to BERT specifically for the Italian language.
      </p>
      <p>ELECTRA is pre-trained with replaced token
detection: instead of being masked, tokens are
replaced with plausible alternatives sampled from a
generator network; the model is then pre-trained
to discriminate whether each token was replaced
by a generator sample or not. At the outset of this
study, the authors posited that replaced token
detection is enough to make ELECTRA reasonably
ready to tackle cloze tests with no further
taskspecific fine-tuning; this is indeed the case, as
conifrmed by the results shown in Table 1.</p>
      <p>To summarize, we employ the following
autoencoding models (all cased, as the cloze tests in
our dataset contain case-sensitive options):
• multilingual BERT-base2 (BERT multi),
which serves as a baseline for autoencoding
models;
• the Bayerische Staatsbibliothek’s Italian</p>
      <p>
        BERT model3 (BERT it);
• a smaller version of multilingual BERT-base4
(BERT it LWYN) based on the Load What
You Need concept described in
        <xref ref-type="bibr" rid="ref1">(Abdaoui et
al., 2020)</xref>
        ;
• UmBERTo5 as the representative of the
      </p>
      <p>RoBERTa family.
• BERTino6 as the representative of the
Distil</p>
      <p>BERT family;
1https://github.com/musixmatchresearch/UmBERTo
2bert-base-multilingual-cased
3dbmdz/bert-base-italian-xxl-cased
4Geotrend/bert-base-it-cased
5Musixmatch/UmBERTo-commoncrawl-cased-v1
6indigo-ai/BERTino
• the Bayerische Staatsbibliothek’s Italian</p>
      <p>ELECTRA model1.</p>
      <p>
        Autoregressive models. The key limitation of
masked language modeling as a proxy for cloze
test is the focus on single token masking.
Therefore, autoencoding models are not applicable to
the six cloze tests in our dataset that feature at least
one multiple token option. (In some cases, the
multiple token options are consistently among the
incorrect options; using our autoenconding
models in such cases would therefore skew the results
in the models’ favor.) For these tests, we employ
a simple strategy based on autoregressive models:
we iterate over all possible substitutions given the
options offered by a test and choose the one with
the lowest perplexity as determined by each of our
autoregressive language models, all of which are
from the GPT-2
        <xref ref-type="bibr" rid="ref11">(Radford et al., 2019)</xref>
        family and
include the following:
• a standard GPT-2 model2, which serves as a
performance lower bound (Vanilla GPT-2);
• a recycled version of GPT-23 transferred to
the Italian language
        <xref ref-type="bibr" rid="ref1 ref10 ref4 ref9">(de Vries and Nissim,
2020)</xref>
        (Recycled GPT-2);
• GePpeTto4
        <xref ref-type="bibr" rid="ref9">(Mattei et al., 2020)</xref>
        , the first
generative language model for Italian, also built
using the GPT-2 architecture.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>The results of our study are summarized in Table
1. We report the results obtained by the human
test takers and the models for each of the eleven
cloze tests in our dataset as well as aggregates
(mean values) over the whole dataset. For each
cloze test, we report the number of blanks to
be filled ( Questions, which varies from 4 to
6), the number of human test takers (Human
count), as well as with the mean and the standard
deviation of the scores. Each test is identified
by the initial of its topic (S=Science, L=Legal,
G=Geometry, R=Reasoning, E=Education,
H=History, T=Technology) along with a numeral
to disambiguate multiple tests on the same topic.
As previously mentioned, two tests are
unrestricted (all the provided options can go anywhere
1dbmdz/electra-base-italian-xxl-cased-generator
2https://huggingface.co/gpt2
3GroNLP/gpt2-medium-italian-embeddings
4LorenzoDeMattei/GePpeTto
in the text) and the others are restricted (there are
specific option lists for each blank to be filled).
As previously explained, six tests (L2, G2, E, H1,
H2, T) contain at least one multi-token option and
are only tackled with autoregressive models. On
average, we observe that:
• humans do better than the best model
(Electra) by eight percentage points;
• Electra, UmBERTo, and GePpeTto are the
top three performers;
• Vanilla GPT-2 aside, BERT it LWYN comes
in last and underperforms BERT it
multilingual.</p>
      <p>Averages, however, hide the enormous gap
between restricted and unrestricted tests. We
illustrate this gap in Table 2, which compares these
two categories of tests model by model and also
shows averages across autoencoding and
autoregressive models (computed over the best
models for each category, i.e., without
BERT-baseit LWYN and BERT-base-multi for autoencoding
models and without Vanilla GPT-2 for
autoregressive models). This leads us to the following
observations:
• our best autoencoding models outperform the
human average;
• as expected, our models perform much better
in restricted tests (we see a gap of 30
percentage points for autoencoding model and
10 points for autoregressive models);
• autoregressive models outperform
autoenconding models in unrestricted tests, while
the converse holds in restricted tests;
• humans perform similarly on both our
restricted and unrestricted tests (and so does
our performance lower-bound, Vanilla
GPT2).</p>
      <p>In our restricted tests, UmBERTo and Electra
outperform the human average and emerge as the
top performers among our models. Though far
below the human average, GePpeTto and
Recycled GPT-2 are the two top performers in
unrestricted tests, where none of the autoencoding
model reach the pass threshold of 0.6. Vanilla
GPT-2 aside, BERT it LWYN comes in last and
underperforms BERT it multilingual in restricted
tests while matching its baseline performance in
unrestricted tests.
0.85</p>
      <p>Dati due punti distinti A e B esiste
una e una sola retta r tale che A e B
appartengono [1] r. Invece di ”A appartiene
a r” possiamo scrivere ”A giace [2] r”
oppure A e` un punto [3] r. Due rette
complanari hanno o un punto o nessun
punto [4] comune. [5] una retta e un punto
che non giace [6] medesima, puo` essere
fatto passare uno e un solo piano.</p>
      <p>The replacements are reported in Table 3 and
show that this specific cloze test focuses solely on
function words.</p>
      <p>UmBERTo offers the best performance.
UmBERTo’s only mistake is at blank 5, where Tra is
chosen instead of Per. We note that this is a typical
mistake made by the students who took this cloze
blank replacement
1 a, su, di, in, per
2 su, a, di, in, per
3 di, a, da, in, per
4 in, a, di, su, per
5 Per, A, Sopra, In, Tra
6 sulla, alla, della, dalla, tra
test. The correct answer, Per, ranks second among
UmBERTo’s top picks, with a probability of
approximately 2.9 10− 3 as opposed to 3.3 10− 2 for
Tra. The second best models, BERTino,
BERTbase, and ELECTRA-base, make an additional
mistake at blank 2.</p>
      <p>Let us now consider the following unrestricted
cloze test (L1 in Table 1).</p>
      <p>Ai fini della sicurezza della circolazione e
della tutela della vita umana la velocita` [1]
non puo` superare i 130 km/h per le
autostrade, i 110 km/h per le strade
extraurbane principali, i 90 km/h per le strade
extraurbane secondarie e per le strade
extraurbane locali, e i 50 km/h per le strade
nei centri abitati, con la possibilita` di [2] il
limite fino a un massimo di 70 km/h per le
strade urbane le cui caratteristiche
costruttive e funzionali lo consentano, [3]
installazione degli appositi segnali. Sulle
autostrade a tre corsie piu` corsia di
emergenza per ogni senso di marcia, dotate di
apparecchiature [4] omologate per il
calcolo della velocita` media di percorrenza
su tratti determinati, gli enti proprietari
o concessionari possono elevare il limite
massimo di veloc´ıta` fino a 150 km/h sulla
base delle caratteristiche progettuali ed
effettive del tracciato, previa installazione
degli appositi segnali, [5] lo consentano
l’intensita` del traffico, le condizioni
atmosferiche prevalenti e i dati di
incidentalita` dell’ultimo [6]. In caso di
precipitazioni atmosferiche di qualsiasi natura, la
velocita` massima non puo` superare i 110
km/h per le autostrade e i 90 km/h per le
strade extraurbane principali.</p>
      <p>The replacements are reported in Table 4 and
show that this specific cloze test focuses primarily
blank replacement
1 massima
2 elevare
3 previa
4 debitamente
5 purche´
6 quinquennio
incorrect indebitamente, ridurre, finch e´,
secolo, compresa, sebbene,
giorno, poiche´, esclusa,
velocemente, dimezzare, minima
on content words.</p>
      <p>Autoregressive models ace this test. GePpeTto
offers the best performance (no incorrect
replacements). Recycled GPT-2 is second best, with only
one incorrect replacement out of 6: giorno is
chosen instead of the correct token quinquennio. This
replacement requires a level of contextual
understanding that cannot be realistically expected from
a language model at this point in time; our
conjecture is that, in this specific instance, GePpeTto’s
correct replacement is most likely fortuitous (its
performance range across all of our tests seems to
validate our conjecture). Autoenconding models
fare substantially worse, even though ELECTRA
and BERT-base are fairly close to the average
human performance.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>While these results are based on as few as eleven
cloze tests (and only two unrestricted ones), the
key takeaway is that existing pre-trained
Italian language models with no task-specific
finetuning can successfully tackle (and pass)
relatively sophisticated tests designed for Italian
students who have successfully completed their high
school education. These results, though
preliminary in nature, suggest various research questions,
which could be answered based on a larger set of
cloze tests. Such questions include whether there
exists a pattern to the incorrect replacements made
by the models, how the models fare with
different parts of speech and with function words as
opposed to content words, and how much their
performance would improve with task-specific
finetuning.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Amine</given-names>
            <surname>Abdaoui</surname>
          </string-name>
          , Camille Pradel, and Gre´goire Sigel.
          <year>2020</year>
          .
          <article-title>Load what you need: Smaller versions of mutlilingual bert</article-title>
          . In SustaiNLP / EMNLP.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Isabella</given-names>
            <surname>Chiari</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>La procedura cloze, la ridondanza e la valutazione della competenza della lingua italiana</article-title>
          .
          <source>ITALICA</source>
          ,
          <volume>79</volume>
          :
          <fpage>466</fpage>
          -
          <lpage>481</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Kevin</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <surname>Minh-Thang</surname>
            <given-names>Luong</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quoc</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
            , and
            <given-names>Christopher D.</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>ELECTRA: Pretraining text encoders as discriminators rather than generators</article-title>
          .
          <source>In ICLR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Wietse de Vries and Malvina Nissim</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>As good as new. how to successfully recycle english gpt-2 to make models for other languages</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In NAACL-HLT.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Geoffrey</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <source>Oriol Vinyals, and Jeffrey Dean</source>
          .
          <year>2015</year>
          .
          <article-title>Distilling the knowledge in a neural network</article-title>
          .
          <source>In NIPS Deep Learning and Representation Learning Workshop.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Yinhan</given-names>
            <surname>Liu</surname>
          </string-name>
          , Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
          <string-name>
            <surname>Omer Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mike</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Luke</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Veselin</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          . arXiv preprint arXiv:
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Edoardo</given-names>
            <surname>Lugarini</surname>
          </string-name>
          .
          <year>2010</year>
          . Franco Angeli.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Lorenzo De Mattei</surname>
            ,
            <given-names>Michele</given-names>
          </string-name>
          <string-name>
            <surname>Cafagna</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Dell'Orletta</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Nissim</surname>
            , and
            <given-names>Marco</given-names>
          </string-name>
          <string-name>
            <surname>Guerini</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Geppetto carves italian into a language model</article-title>
          .
          <source>ArXiv</source>
          , abs/
          <year>2004</year>
          .14253.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Matteo</given-names>
            <surname>Muffo</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Bertino</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Bertino: An italian distilbert model</article-title>
          . In CLiC-it.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Alec</given-names>
            <surname>Radford</surname>
          </string-name>
          , Jeff Wu, Rewon Child, David Luan,
          <string-name>
            <given-names>Dario</given-names>
            <surname>Amodei</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Language models are unsupervised multitask learners</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>F. W.</given-names>
            <surname>Radice</surname>
          </string-name>
          .
          <year>1978</year>
          .
          <article-title>Using the cloze procedure as a teaching technique</article-title>
          .
          <source>ELT Journal</source>
          , XXXII.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Victor</given-names>
            <surname>Sanh</surname>
          </string-name>
          , Lysandre Debut, Julien Chaumond, and
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Wolf</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter</article-title>
          . ArXiv, abs/
          <year>1910</year>
          .01108.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>F.</given-names>
            <surname>Tamburini</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>How ”bertology” changed the state-of-the-art also for italian nlp</article-title>
          . In CLiC-it.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Wilson L.</given-names>
            <surname>Taylor</surname>
          </string-name>
          .
          <year>1953</year>
          .
          <article-title>'cloze' procedure: A new tool for measuring readability</article-title>
          .
          <source>Journalism Quarterly</source>
          ,
          <volume>30</volume>
          :
          <fpage>415</fpage>
          -
          <lpage>433</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Ashish</given-names>
            <surname>Vaswani</surname>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
          <string-name>
            <surname>Lukasz Kaiser</surname>
            , and
            <given-names>Illia</given-names>
          </string-name>
          <string-name>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , volume
          <volume>30</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>