On the comparability of pre-trained language models

                Matthias Aßenmacher             Christian Heumann
                Department of Statistics       Department of Statistics
             Ludwig-Maximilians-Universität Ludwig-Maximilians-Universität
                   Munich, Germany                Munich, Germany
                    {matthias,chris}@stat.uni-muenchen.de


                        Abstract                                        but rather see our work as an overview in
                                                                        order to identify potential starting points
    Recent developments in unsupervised rep-                            for benchmark comparisons.
    resentation learning have successfully es-
    tablished the concept of transfer learn-                    1       Introduction
    ing in NLP. Instead of simply plugging
    in static pre-trained representations, end-                 For solving NLP tasks, most researchers turn to
    to-end trainable model architectures are                    using pre-trained word embeddings (Mikolov et al.,
    making better use of contextual informa-                    2013; Pennington et al., 2014; Bojanowski et al.,
    tion through more intelligently designed                    2017) as a key component of their models. These
    language modelling objectives. Along                        representations map each word of a sequence to a
    with this, larger corpora are used for self-                real valued vector of fixed dimension. Drawbacks
    supervised pre-training of models which                     of these kinds of externally learned features are
    are afterwards fine-tuned on supervised                     that they are (i) fixed, i.e. can not be adapted to a
    tasks. Advances in parallel computing                       specific domain they are used in, and (ii) context
    made it possible to train these models with                 independent, i.e. there’s only one embedding for a
    growing capacities in the same or even in                   word by which it is represented in any context.
    shorter time than previously established                    More recently, transfer learning approaches, as for
    models. These developments agglomer-                        example convolutional neural networks (CNNs)
    ate in new state-of-the-art results being re-               pre-trained on ImageNet (Krizhevsky et al., 2012)
    vealed in an increasing frequency. Never-                   in computer vision, have entered the discussion.
    theless, we show that it is not possible to                 Transfer learning in the NLP context means pre-
    completely disentangle the contributions                    training a network with a self-supervised objective
    of the three driving forces to these improve-               on large amounts of plain text and fine-tuning its
    ments.                                                      weights afterwards on a task specific, labelled data
    We provide a concise overview on several                    set. For a comprehensive overview on the current
    large pre-trained language models, which                    state of transfer learning in NLP, we recommend
    achieved state-of-the-art results on differ-                the excellent tutorial and blog post by Ruder et al.
    ent leaderboards in the last two years, and                 (2019)1 .
    compare them with respect to their use                      With ULMFiT (Universal Language Model Fine
    of new architectures and resources. We                      Tuning), Howard and Ruder (2018) proposed a
    clarify where the differences between the                   LSTM-based (Hochreiter and Schmidhuber, 1997)
    models are and attempt to gain some in-                     approach for transfer learning in NLP using AWD-
    sight into the single contributions of lexical              LSTMs (Merity et al., 2017). This model can be
    and computational improvements as well                      characterised as unidirectional contextual, while a
    as those of architectural changes. We do                    bidirectionally contextual LSTM-based model was
    not intend to quantify these contributions,                 presented in ELMo (Embeddings from Language
                                                                Models) by Peters et al. (2018).
Copyright c 2020 for this paper by its authors. Use permitted   The bidirectionality in ELMo is achieved by using
under Creative Commons License Attribution 4.0 Interna-
                                                                    1
tional (CC BY 4.0)                                                      https://ruder.io/state-of-transfer-learning-in-nlp/
biLSTMs instead of AWD-LSTMs. On the other                ing the pre-training objective (different variants
hand, ULMFiT uses a more "pure" transfer learn-           of denoising vs. language modelling), the pre-
ing approach compared to ELMo, as the ELMo-               training resources (their newly introduced C4 cor-
embeddings are extracted from the pre-trained             pus vs. variants thereof) and the parameter size
model and are not fine-tuned in conjunction with          (from 200M up to 11B). Especially, their idea of
the weights of the task-specific architecture.            introducing a new corpus and creating subsets re-
The OpenAI GPT (Generative Pre-Training, Rad-             sembling previously used corpora like RealNews
ford et al., 2018) is a model which resembles the         (Zellers et al., 2019) or OpenWebText (Gokaslan
characteristics of ULMFiT in two crucial points.          and Cohen, 2019) is a promising approach in order
It is a unidirectional language model and it al-          to ensure comparability.
lows stacking task specific layers on top after pre-      However, their experiments do not cover an impor-
training, i.e. it is fully end-to-end trainable. The      tant point we are trying to address with our work:
major difference between them is the internal ar-         Focussing on only one specific architecture does
chitecture, where GPT uses a Transformer decoder          not yield an answer to the question which com-
architecture (Vaswani et al., 2017).                      ponents explain the performance differences be-
Instead of processing one input token at a time, like     tween models where the overall architecture differs
recurrent architectures (LSTMs, GRUs) do, Trans-          (e.g. Attention-based vs. LSTM-based). Yang et al.
formers process whole sequences all at once. This         (2019) also address comparability to some extent
is possible because they utilize a variant of the At-     by performing an ablation study to compare their
tention mechanism (Bahdanau et al., 2014), which          XLNet explicitly to BERT. They train six different
allows modelling dependencies without having to           XLNet-based models where they modify different
feed the data to the model sequentially. At the same      parts of their model in order to quantify how these
time, GPT can be characterised as unidirectional          design choices influence performance. At the same
as it just takes into account the left side of the con-   time they restrict themselves to an architecture of
text. Its successor OpenAI GPT2 (Radford et al.,          the same size as BERT-BASE and use the same
2019) possesses (despite some smaller architectural       amount of lexical resources for pre-training. Liu
changes) the same model architecture and thus can         et al. (2019) vary RoBERTa with respect to model
also be termed as unidirectional contextual.              size and amount of pre-training resources in or-
BERT (Bidirectional Encoder Representations               der to perform an ablation study also aiming at
from Transformers, Devlin et al., 2019), and con-         comparability to BERT. Lan et al. (2019) go one
sequently the other two BERT-based approaches             step further with ALBERT by also comparing their
discussed here (Liu et al., 2019; Lan et al., 2019) as    model to BERT with regard to run time as well as
well, differ from the GPT models by the fact that         width and depth of the model.
they are bidirectional Transformer encoder models.        Despite all these experiments are highly valuable
Devlin et al. (2019) proposed Masked Language             steps into the direction of better comparability,
Modelling (MLM) as a special training objective           there are still no clear guidelines on which com-
which allows the use of a bidirectional Transformer       parisons to perform in order to ensure a maximum
encoder without compromising the language mod-            degree of comparability with respect to multiple
elling objective. XLNet (Yang et al., 2019) on the        potentially influential factors at the same time.
contrary relies on an objective which the authors
call Permutation Language Modelling (PLM) and             3   Materials and Methods
is also able to model a bidirectional context despite
being an auto-regressive model.                           First, we present the different corpora which were
                                                          utilised for pre-training the models and compare
2   Related work                                          them with respect to their size and their accessi-
                                                          bility (cf. Tab. 1). Subsequently, we will briefly
In their stimulating paper, Raffel et al. (2019) take     introduce benchmark data sets which the models
several steps in a similar direction by trying to         are commonly fine-tuned and evaluated on.
ensure comparability among different Transformer-         While conceptual differences between the evalu-
based models. They perform various experiments            ated models have been addressed in the introduc-
with respect to the transfer learning ability of a        tion, the models will now be described in more
Transformer encoder-decoder architecture by vary-         detail. This is driven by the intention to emphasise
differences beyond the obvious, conceptual ones.        Wikitext-103 (Merity et al., 2016a,b) The au-
                                                        thors emphasised the necessity for a new large scale
3.1      Pre-training corpora                           language modelling data set by stressing the short-
English Wikipedia Devlin et al. (2019) state that       comings of other corpora. They highlight the occur-
they used data from the English Wikipedia and           rence of complete articles, which allows learning
provide a manual for crawling it, but no actual data    long range dependencies, as one of the main bene-
set. Their version encompassed around 2.5B words.       fits of their corpus. This property is, according to
Wikipedia data sets are available in the Tensorflow     the authors, not given in the 1B Word Benchmark
Datasets-module.                                        as the sentence ordering is randomised there. With
                                                        a count of 103.227.021 tokens and a vocabulary
CommonCrawl Among other resources, Yang                 size of 267.735, it is about one eighth of the 1B
et al. (2019) used data from CommonCrawl. Be-           Word Benchmark’s size concerning token count
sides stating that they filtered out short or low-      and about one third concerning the vocabulary size.
quality content, no further information is given.       Note, that there is also the smaller Wikitext-2 cor-
Since CommonCrawl is a dynamic database, which          pus (Merity et al., 2016c) available, which is a
is updated on a monthly base (and the extracted         subset of about 2% of the size of Wikitext-103.
amount of data always depends on the user) we can
not provide a word count for this source in Tab. 1.     CC-News (Nagel, 2016) This corpus was pre-
ClueWeb (Callan et al., 2009), Giga5 (Parker            sented and used by Liu et al. (2019). They used a
et al., 2011) The information about ClueWeb             web crawler proposed by Hamborg et al. (2017) to
and Giga5 is similarly sparse as for Common-            extract data from the CommonCrawl News data set
Crawl. ClueWeb was obtained by crawling ∼ 2.8M          (Nagel, 2016) and obtained a data set similar to the
web pages in 2012, Giga5 was crawled between            RealNews data set (Zellers et al., 2019).
01/2009 and 12/2010.
                                                        Stories4 (Trinh and Le, 2018) The authors built
1B Word         Benchmark2  (Chelba et al., 2013)       a specific subset of the CommonCrawl data based
This corpus, actually introduced as a benchmark         on questions from common sense reasoning tasks.
data set by Chelba et al. (2013), combines multi-       They extracted nearly 1M documents, most of
ple data sets from the EMNLP 2011 workshop on           which are taken from longer, coherent stories.
Statistical Machine Translation. The authors nor-
malised and tokenized the corpus and performed          WebText (Radford et al., 2019) This pre-
further pre-processing steps in dropping duplicate      training corpus, obtained by creating "a new web
sentences as well as discarding words with a count      scrape which emphasised document quality" (Rad-
below three. Additionally, they randomised the          ford et al., 2019), is not publicly available.
ordering of the sentences in the corpus. This consti-
tutes a corpus with a vocabulary of 793.471 words       OpenWebText (Gokaslan and Cohen, 2019)
and a total word count of 829.250.940 words.            As a reaction to Radford et al. (2019) not releasing
                                                        their pre-training corpus, Gokaslan and Cohen
BooksCorpus3 (Zhu et al., 2015) In 2015, Zhu
                                                        (2019) started an initiative to emulate an open-
et al. introduced the BooksCorpus, which is heavily
                                                        source version of the WebText corpus.
used for pre-training language models (cf. Tab. 1).
In their work, they used the BooksCorpus in order
to train a model for retrieving sentence similarity.       It becomes obvious that there is a lot of hetero-
Overall, the corpus comprises 984.846.357 words         geneity with respect to the observed combinations
in 74.004.228 sentences obtained from analysing         of availability, quality and corpus size. Thus, we
11.038 books. They report a vocabulary consisting       can state that there is some lack of transparency
of 1.316.420 unique words, making the corpus lex-       when it comes to the lexical resources used for
ically more diverse than the 1B Word Benchmark,         per-training. Especially, the missing standardised
as it possesses a by 66% larger vocabulary whereas      availability of the BooksCorpus is problematic as
having a word count which is only 19% higher.           this corpus is heavily used for pre-training.
   2                                                       4
       https://research.google/pubs/pub41880/                https://console.cloud.google.com/storage/browser/
   3
       https://yknzhu.wixsite.com/mbweb                 commonsense-reasoning/reproduce/stories_corpus
 Corpora                           Word-count♥    Accessibility             Used by
 English Wikipedia                 ∼ 2.500M       Fully available           BERT; XLNet; RoBERTa; ALBERT
 CommonCrawl                       Unclear        Fully available           XLNet
 ClueWeb 2012-B, Giga5             Unclear        Fully available ($$)      XLNet
 1B Word Benchmark                 ∼ 830M         Fully available           ELMo
 BooksCorpus                       ∼ 985M         Not available             GPT; BERT; XLNet; RoBERTa; ALBERT
 Wikitext-103                      ∼ 103M         Fully available           ULMFit
 CC-News                           Unclear        Crawling Manual           RoBERTa
 Stories                           ∼ 7.000M♦      Fully available           RoBERTa
 WebText                           Unclear        Not available             GPT2
 OpenWebText                       Unclear        Fully available           RoBERTa


Table 1: Pre-training resources (sorted by date). Crawling Manual means the authors did not provide data, but at
least a manual for crawling it. Dollar signs signify the necessity of a payment in order to get access. RealNews
(Zellers et al., 2019) and C4 (Raffel et al., 2019) are not included as they were not used by the evaluated models.
♥
   We report the word-count as given in the respective articles proposing the corpora. Note that the number of
tokens reported in other articles depends on the tokenization scheme used by a specific model.
♦
  Stated by one of the authors on twitter: https:/twitter.com/thtrieu_/status/1096672446864748545


3.2      Benchmark data sets for fine-tuning               discussed models were evaluated on SuperGLUE.
GLUE5 (Wang et al., 2018) The General Lan-                 SQuAD7 (Rajpurkar et al., 2016, 2018) The
guage Understanding Evaluation (GLUE) bench-               Stanford Question Answering Dataset (SQuAD)
mark is a freely available collection of nine data         1.1 consists of 100.000+ questions explicitly de-
sets on which models can be evaluated. It provides         signed to be answerable by reading segments of
a fixed train-dev-test split with held out labels for      Wikipedia articles. The task is to correctly locate
the test set, as well as a leaderboard which displays      the segment in the text which contains the answer.
the top submissions and the current state-of-the-art       A shortcoming is the omission of situations where
(SOTA). The relevant metric for the SOTA is an             the question is not answerable by reading the pro-
aggregate measure of the nine single task metrics.         vided article. Rajpurkar et al. (2018) address this
The benchmark includes two binary classification           problem in SQuAD 2.0 by adding 50.000 hand-
tasks with single-sentence inputs (CoLa [Warstadt          crafted unanswerable questions to SQuAD 1.1. The
et al., 2018] and SST-2 [Socher et al., 2013]) and         authors provide a train and development set as well
five binary classification tasks with inputs that con-     as an official leaderboard. The test set is completely
sist of sentence-pairs (MRPC [Dolan and Brockett,          held out, participants are required to upload their
2005], QQP [Shankar et al., 2017], QNLI, RTE               models to CodaLab. The SQuAD 1.1 data is, in an
and WNLI [all Wang et al., 2018]). The remain-             augmented form (QNLI), also part of GLUE.
ing two tasks also take sentence-pairs as input but
have a multi-class classification objective with ei-       RACE8 (Lai et al., 2017) The Large-scale
ther three (MNLI [Williams et al., 2017]) or five          ReAding       Comprehension      Dataset    From
classes (STS-B [Cer et al., 2017]).                        Examinations (RACE) contains English exam
                                                           questions for Chinese students (middle/high
SuperGLUE6 (Wang et al., 2019) As a reaction               school). In most of the articles using RACE
to human baselines being surpassed by the top              for evaluation, it is described to be especially
ranked models, Wang et al. (2019) proposed a set           challenging due to (i) the length of the passages,
of benchmark data sets similar to, but, according          (ii) the inclusion of reasoning questions and (iii)
to the authors, more difficult than GLUE. It did not       the intentionally tricky design of the questions
make sense to include it as a part of our model com-       in order to test a human’s ability in reading
parison, as (at the time of writing) only two of the       comprehension. The data set can be subdivided
   5                                                          7
       https://gluebenchmark.com/                                 https://rajpurkar.github.io/SQuAD-explorer/
   6                                                          8
       https://super.gluebenchmark.com/                           http://www.qizhexie.com/data/RACE_leaderboard.html
in RACE-M (middle school examination) and              Attention for the first time. Additionally, BERT
RACE-H (high school examination) and comprises         also utilizes the next-sentence prediction (NSP) ob-
a total of 97.687 questions on 27.933 passages of      jective, the usefulness of which has been debated
text.                                                  in other research papers (Liu et al., 2019). The
                                                       BERT-BASE model consists of 12 bidirectional
3.3   Evaluated Models                                 transformer-encoder blocks (24 for BERT-LARGE)
ULMFit (Howard and Ruder, 2018) The                    with 12 (16 respectively) attention heads per block
AWD-LSTMs in this architecture make use of             and an embedding size of 768 (1024 respectively).
DropConnect (Wan et al., 2013) for better regu-
larisation and apply averaged stochastic gradient      OpenAI GPT2 (Radford et al., 2019) Com-
descent (ASGD) for optimization (Polyak and Ju-        pared to its predecessor GPT, it contains some
ditsky, 1992). The model consists of an embed-         smaller changes concerning the placement of layer
ding layer followed by three LSTM layers with          normalisation and residual connections. Overall,
a softmax classifier on top for pre-training. It is    there are four different versions of GPT2 with the
complemented by a task specific final layer during     smallest one being equal to GPT, the medium one
fine-tuning. The vocabulary size is limited to 30k     being of similar size as BERT-LARGE and the
words as in Johnson and Zhang (2017).                  xlarge one being released as the actual GPT2 model
ULMFiT was not evaluated on GLUE, but on sev-          with 1.5B parameters.
eral other data sets (IMDb [Maas et al., 2011],        XLNet (Yang et al., 2019) In order to overcome
TREC-6 [Voorhees and Tice, 1999], Yelp-bi, Yelp-       (what they call) the pretraining-finetune discrep-
full, AG’s news, DBpedia [all Zhang et al., 2015]).    ancy, which is a consequence of BERT’s MLM
ELMo (Peters et al., 2018) Consisting of mul-          objective, and to simultaneously include bidirec-
tiple biLSTM layers, one can extract multiple          tional contexts, Yang et al. (2019) propose the PLM
intermediate-layer representations from ELMo.          objective . They use two-stream self-attention for
These representations are used for computing a         preserving the position information of the token to
(task-specific) weighted combination, which is         be predicted, which would otherwise be lost due
concatenated with external, static word embed-         to the permutation. While the content stream at-
dings. During the training of the downstream           tention resembles the standard Self-Attention in
model, ELMo embeddings are not updated, only           a transformer-decoder, the query stream attention
the weights for combining them are. For the GLUE       doesn’t allow the token to see itself but just the
benchmark there are multiple ELMo-based archi-         preceding tokens of the permuted sequence.
tectures available on the leaderboard. In Tab. 3, we
                                                       RoBERTa (Liu et al., 2019) With RoBERTa
report the best-performing model, an ELMo-based
                                                       (Robustly optimized BERT approach), Liu et al.
BiLSTM-model with Attention (Wang et al., 2018).
                                                       (2019) introduce a replicate of BERT with tuned
OpenAI GPT (Radford et al., 2018) The Open-            hyperparameters and a larger corpus used for pre-
AI GPT is a pure attention-based architecture that     training. The masking strategy is changed from
does not make use of any recurrent layers. Pre-        static (once during pre-processing) to dynamic (ev-
training is performed by combining Byte-Pair en-       ery sequence just before feeding it to the model),
coded (Sennrich et al., 2015) token embeddings         the additional NSP objective is removed, the BPE
with learned position embeddings, feeding them         vocabulary is increased to 50k and training is per-
into a multi-layer transformer decoder architecture    formed on larger batches than BERT. These adjust-
with a standard language modelling objective. Fine-    ments improve performance of the model and make
tuning was, amongst others, performed on the nine      it competitive to the performance of XLNet.
tasks that together form the GLUE benchmark.
                                                       ALBERT (Lan et al., 2019) By identifying that
BERT (Devlin et al., 2019) BERT can be seen            the increase of the model size is a problem, AL-
as a reference point for everything that came there-   BERT (A Lite BERT) goes into another direc-
after. Similar to GPT it uses Byte-Pair Encod-         tion compared to most of post-BERT architectures.
ing (BPE) with a vocabulary size of 30k. By in-        Parameter-reduction techniques are applied in or-
troducing the MLM objective, the authors were          der to train a faster model with lower memory de-
able to combine deep bidirectionality with Self-       mands that, at the same time, yields a comparable
                                                Compute                                   Resources
    Model             Hardware                   Training time    pfs-days ♥       #parameters     lexical
    ULMFiT            NA                         NA               NA               33M             0.18GB
    GPT               8 GPUs (P600)              ∼ 30 days        0.96             117M            < 13GB
    BERT-BASE         4 Cloud TPUs               ∼ 4 days         0.96 [2.24] ♦    110M            13GB
    BERT-LARGE        16 Cloud TPUs              ∼ 4 days         3.84 [8.96] ♦    340M            13GB
    GPT2-MEDIUM       NA                         NA               NA               345M            40GB
    GPT2-XLARGE       8 v3 Cloud TPUs            ∼ 7 days         7.84             1.500M          40GB
    XLNet-LARGE       128 v3 Cloud TPUs          ∼ 2.5 days       44.8             340M            126GB
    RoBERTa           DGX-1 GPUs (8xV100) ♣      NA ♣             NA               360M            160GB
                      1024 32GB V100 GPUs ♠      ∼ 1 day ♠        4.78             360M            16GB
    ALBERT            64 – 1024 v3 Cloud TPUs    NA               NA               233M            16GB


Table 2: Usage of compute and pre-training resources alongside with model size for the evaluated model archi-
tectures. With lexical resources we refer to the size of the pre-training corpus. ELMo not included as it is not
end-to-end trainable (Size depends on the used model after obtaining the embeddings). The size of ULMFiT is
assumed to be the larger value from Merity et al. (2017), since Howard and Ruder (2018) use AWD-LSTMs with
a vocabulary size of 30k tokens (Johnson and Zhang, 2016, 2017). Values for GPT2-XLARGE are taken from
Strubell et al. (2019).
♥
    Petaflop-days: Estimation according to the formula proposed on https://openai.com/blog/
ai-and-compute/:
pfs-days = number of units × PFLOPS/unit × days trained × utilization, with an
assumed utilization of 13 . PFLOPS/unit for TPUs from https://cloud.google.com/tpu/.
♦
  Unclear, whether v2 or v3 TPUs were used. Thus, we provide calculations for both: v2[v3]
♣                                              ♠
  Full RoBERTa model (Liu et al., 2019)          RoBERTa variant utilizing less pre-training resources


performance to SOTA models. We will always re-            et al., 2019) and also limits portability to smaller
fer to the best performing ALBERT-XXLARGE,                devices.
despite also the smaller ALBERT models yield re-
sults comparable to BERT.                                    Further, it is important to consider the differ-
                                                          ences displayed in the Tab. 2 and Tab. 3 when
4   Model comparison                                      comparing the model performances. Consider-
                                                          ing two models of approximately the same size
Tab. 2 gives an overview on the amount of com-            (BERT-BASE vs. GPT), the superior performance
putational power needed to pre-train a given archi-       of BERT-BASE seems to originate purely from its
tecture on given pre-training (lexical) resources. In     more elaborated architecture because of the similar
Tab. 3 we will directly try to relate model architec-     size. But one should also be aware of the larger
ture and size as well as usage of lexical resources       lexical resources (BERT-BASE uses at least twice
to model performance.                                     as much data for pre-training) and the unknown
One thing we can learn from Tab. 2 is the lack of de-     differences in usage of computational power. We
tails when it comes to reporting the computational        approximated the latter as the pfs-days (cf. Tab. 2),
resources used for pre-training. While Howard and         resulting in an estimation for BERT-BASE being
Ruder (2018) do not provide any information on            not less than the one for GPT.
the computational power utilised for pre-training,        Another aspect which should not be ignored when
the other articles report it to different degrees. Un-    evaluating performance is ensembling. As can be
fortunately, there are no clear guidelines on how         seen in the first column of Tab. 3, the three model
to appraise this when it comes to evaluating and          ensembles outperform both of the BERT models
comparing models. This may be attributed to the           by a large margin. Only parts of these differences
rapidly growing availability of hardware, but in          may be attributed to the model architecture or the
our opinion it should nevertheless be accounted for,      hyperparameter settings, as the ensembling as well
since it might pose environmental issues (Strubell        as the larger pre-training resources might give an
                               GLUE                      SQuAD                RACE             Resources
 Model                 leaderboard     dev♥      v1.1 (dev)    v2.0 (dev)    test       #parameters      lexical
 BERT-BASE             78.3            –         88.5          76.3 ♣        65.0 ♠     110M             13GB
 ELMo-based            - 8.3           –         - 2.9         –             –          –                –
 GPT                   - 5.5           –         –             –             - 6.0      1.1x             < 0.5x
 BERT-LARGE            + 2.2           84.05     + 2.4         + 5.6         + 7.0 ♠    3.1x             1.0x
 XLNet-BASE            –               –         –             + 5.03        + 1.05     ∼ 1.0x           1.0x
 XLNet-LARGE           + 10.1 ♦        + 3.39    + 6.0         + 12.5        + 16.75    3.1x             9.7x
 RoBERTa               + 10.2 ♦        + 5.19    + 6.1         + 13.1        + 18.2     3.3x             12.3x
 RoBERTa-BASE          –               + 2.30    –             –             –          1.0x             12.3x
 RoBERTa ‡             –               + 3.79    + 5.1         + 11.0        –          3.3x             1.2x †
 ALBERT                + 11.1 ♦        + 5.91    + 5.6         + 13.9        + 21.5     2.1x             1.2x †


Table 3: Performance values as well as model size and resource usage (Reference in italics, highest improvements
in bold). Performance differences are given in percentage points (%pts), differences in size/resources as factors.
ULMFiT and GPT2 are omitted as there are no performance values on these data sets publicly available. No model
size for ELMo provided, since the performance values are from different models (cf. Sec. 3.3).
Displayed performance measures are Matthews Correlation (GLUE), F1 score (SQuAD) and Accuracy (RACE).
♥                                                                               ♦
   Own calculations based on Lan et al. (2019), Tab. 13; WNLI is excluded          Ensemble performance
♣                                                     ♠
   Values taken from Yang et al. (2019), Tab. 6         Values taken from Zhang et al. (2019), Tab. 2
†
  Liu et al. (2019) and Lan et al. (2019) specify the BooksCorpus + English Wikipedia as 16GB
‡
  This variant of RoBERTa uses only BooksCorpus + English Wikipedia for pre-training


advantage to these models. As there are no perfor-        a single model on comparable lexical resources
mance values of single models available for XL-           (13GB for BERT vs. 16GB for RoBERTa), the
Net, RoBERTa and ALBERT on the official GLUE              RoBERTa model shows a smaller (compared to
leaderboard, we also compare the single model per-        the RoBERTa ensemble), but still remarkable, im-
formances from Lan et al. (2019) obtained on the          provement over BERT-LARGE. In another ablation
dev sets. From this comparison we get an impres-          study, Liu et al. (2019) train a RoBERTa-BASE
sion of how high the contribution of ensembling           variant on larger pre-training resources. Even
might be: The difference between BERT-LARGE               though comprising only about one third of the size
and the XLNet ensemble in the official score (7.9         of BERT-LARGE, the larger pre-training corpus in
%pts) is more than twice as high as the difference        conjunction with the optimised training leads to a
in dev score (3.4 %pts).                                  slightly better performance on the GLUE dev set.
In order to address the differences in size of the        We are not able to compare RoBERTa-BASE to
pre-training resources, Yang et al. (2019) make the       BERT-BASE, as neither the "official" leaderboard
extremely insightful effort to compare a XLNet-           score for RoBERTa-BASE nor the "inofficial" dev
BASE variant to BERT-BASE using the same pre-             set score for BERT-BASE are available.
training resources. While the F1 score on SQuAD           In order to set the results of ULMFiT into con-
v2.0 is still remarkably higher than for BERT-            text, we present the results published by Yang
BASE (comparable to BERT-LARGE) it does not               et al. (2019) alongside with information on size
show a large improvement on RACE (which might             and pre-training resources in Tab. 4. Despite
have been expected due to the large improvement           being much larger and pre-training on some or-
of XLNet-LARGE over both BERT models).                    ders of magnitude larger corpora, BERT-LARGE
The comparability of RoBERTa from the GLUE                and XLNet-LARGE do not exhibit that large im-
leaderboard (ensemble + larger pre-training re-           provements over the performance of ULMFiT. This
sources) to BERT-LARGE is limited, but the au-            might partly originate from the relative simplic-
thors perform several experiments in order to show        ity of the tasks, but partly also from the already
the usefulness of their optimisations. Pre-training       achieved high performances.
                                     Sentiment                            Topic              Resources
       Model                IMDb      Yelp-bi    Yelp-full       AG’s news     DBpedia     size    lexical
       ULMFiT               95.40     97.84      70.02           94.99         99.20       33M     0.18GB
       BERT-LARGE           + 0.09    + 0.27     + 0.66          –             + 0.16      10.3x   72.2x
       XLNet-LARGE          + 0.81    + 0.61     + 2.28          + 0.52        + 0.18      10.3x   222.2x


Table 4: Performance comparison (+ model size and resource usage) on the benchmark data sets used by Howard
and Ruder (2018). Specification of the differences and highlighting as in Tab. 3. We report accuracies, as opposed
to Howard and Ruder (2018); Yang et al. (2019), in order to facilitate a similar interpretation compared to Tab. 3.


5   Discussion                                               the amount of the computational power used for
                                                             pre-training. In our opinion, this is not a careless-
This chapter reflects the main takeaways from the            ness of the authors but rather the lack of a clear
above comparisons and raises some issues for re-             reporting standard. We found ourselves confronted
search practices. We do not claim to have a solution         with the following situations:
to these potentially problematic aspects, but rather
think that these points are highly debatable.                    a) No information at all (Radford et al., 2019)
                                                                 b) Hardware (Liu et al., 2019; Lan et al., 2019)
Why no benchmark corpus for pre-training?
It is good practice to use benchmark data sets for               c) Hardware and training time (Devlin et al.,
comparing the performance of pre-trained language                   2019; Yang et al., 2019)
models on different types of Natural language un-                d) Standardised measure (Radford, 2018)
derstanding (NLU) tasks. Many recently published
articles (Liu et al., 2019; Yang et al., 2019; Lan           While a) is clearly unsatisfactory and should be
et al., 2019) perform (partly extensive) ablation            avoided, b) and c) provide most of the necessary
studies controlling for pre-training resources in or-        information but miss out on going the last final step
der to make (versions of) their models comparable            to d), where the reporting reaches universal compa-
to BERT, which is really important as it helps to            rability across different articles. The measure we
get an intuition for the impact of pre-training re-          computed (cf. Tab. 2) is of course not as exact as
sources. Nevertheless, it is unfortunately not per-          a computation based on the counts of operations
fect due to two critical issues: (i) BERT and all of         in a network, but requires no deep insight into the
its successors make use of the BooksCorpus (Zhu              model architecture and is thus applicable to a wide
et al., 2015) which is not publicly available and            range of architectures without much effort.
(ii) this only leads to model comparisons in a low
                                                             Shouldn’t performance be evaluated in relation
pre-training resource environment (compared to
                                                             to size and resource usage? As larger models
more recent models) and yields no insight on the
                                                             have a higher capacity for learning representations
behaviour of the reference model (e.g. BERT) in
                                                             and using larger pre-training resources should im-
a medium or high resource context. So we view
                                                             prove their quality, varying these two components
statements of the type "Model architecture A is su-
                                                             simultaneously with the model architecture might
perior to model architecture B on performing task
                                                             lead to interference between the individual effects
X." somewhat critical and propose to phrase it more
                                                             on model performance. This aspect has a slight
like the following statement: "Model architecture A
                                                             overlap with the question raised above, but while
is superior to model architecture B on performing
                                                             the above is more or less about introducing some
task X, when pre-trained on a small/medium/large
                                                             reference, this is about carefully varying and evalu-
corpus of low/high quality data from domain Y for
                                                             ating the effects of different model parts.
pre-training time Z."
                                                             6     Conclusion
Why no standardised description of (computa-
tional) resources? When writing this article, it             As can be seen from the above analysis, there is a
turned out difficult to get one unified measure for          lack of a concise guideline for fair comparisons of
large pre-trained language models. It is not suffi-       task 1: Semantic textual similarity-multilingual and
cient to just rank models by their performance on         cross-lingual focused evaluation. arXiv preprint
                                                          arXiv:1708.00055.
the common benchmark data sets as this does not
take into account all the other factors mentioned       Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,
in this analysis. Further aspects worth reporting         Thorsten Brants, Phillipp Koehn, and Tony Robin-
are the use of resources (time and compute) spent         son. 2013. One billion word benchmark for measur-
                                                          ing progress in statistical language modeling. arXiv
on model development (including all experimental          preprint arXiv:1312.3005.
runs and trials) and hyperparameter tuning during
pre-training. In our opinion, this is important with    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
respect to two facets: On the one hand side it is          Kristina Toutanova. 2019. BERT: Pre-training of
                                                           deep bidirectional transformers for language under-
important to take into account environmental con-          standing. In Proceedings of the 2019 Conference
siderations when training deep learning models             of the North American Chapter of the Association
(Strubell et al., 2019), on the other hand side it is      for Computational Linguistics: Human Language
also a signal to the reader/user how difficult it is      Technologies, Volume 1 (Long and Short Papers),
                                                           pages 4171–4186, Minneapolis, Minnesota. Associ-
to train (and to fine-tune) the model. This might
                                                           ation for Computational Linguistics.
have implications for the usage of a model as trans-
fer learning model for diverse downstream tasks.        William B Dolan and Chris Brockett. 2005. Automati-
Models that have already been tuned to a high de-         cally constructing a corpus of sentential paraphrases.
                                                          In Proceedings of the Third International Workshop
gree during pre-training to reach a certain level of      on Paraphrasing (IWP2005).
performance, may have, in the long run, less poten-
tial for further improvements compared to models        Aaron Gokaslan and Vanya Cohen. 2019. Openweb-
which do so without much hyperparameter tuning.           text corpus.
To conclude, we unfortunately cannot say with de-       Felix Hamborg, Norman Meuschke, Corinna Bre-
termination which one of the influential factors          itinger, and Bela Gipp. 2017. News-please: a
(architecture or amount of pre-training resources)        generic news crawler and extractor. In 15th Interna-
is more important, but we think that a substan-           tional Symposium of Information Science (ISI 2017),
                                                          pages 218–223.
tial amount of the recent improvements can be at-
tributed to larger pre-training resources. A detailed   Sepp Hochreiter and Jürgen Schmidhuber. 1997.
disentanglement of the influence of the different         Long short-term memory. Neural computation,
components stays an open research question which          9(8):1735–1780.
might be answerable by carefully designed bench-        Jeremy Howard and Sebastian Ruder. 2018. Univer-
mark studies.                                              sal language model fine-tuning for text classification.
                                                           arXiv preprint arXiv:1801.06146.
Acknowledgments
                                                        Rie Johnson and Tong Zhang. 2016. Convolutional
We would like to thank the three anonymous re-            neural networks for text categorization: Shallow
viewers for their insightful comments and their           word-level vs. deep character-level. arXiv preprint
                                                          arXiv:1609.00718.
feedback on our work.
                                                        Rie Johnson and Tong Zhang. 2017. Deep pyramid
                                                          convolutional neural networks for text categoriza-
References                                                tion. In Proceedings of the 55th Annual Meeting of
                                                          the Association for Computational Linguistics (Vol-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
                                                          ume 1: Long Papers), pages 562–570.
  gio. 2014. Neural machine translation by jointly
  learning to align and translate. arXiv preprint       Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-
  arXiv:1409.0473.                                        ton. 2012. Imagenet classification with deep con-
Piotr Bojanowski, Edouard Grave, Armand Joulin, and       volutional neural networks. In Advances in neural
  Tomas Mikolov. 2017. Enriching word vectors with        information processing systems, pages 1097–1105.
   subword information. Transactions of the Associa-
   tion for Computational Linguistics, 5:135–146.       Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,
                                                          and Eduard Hovy. 2017. Race: Large-scale reading
Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao.        comprehension dataset from examinations. arXiv
  2009. Clueweb09 data set.                               preprint arXiv:1704.04683.

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-       Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
  Gazpio, and Lucia Specia. 2017. Semeval-2017            Kevin Gimpel, Piyush Sharma, and Radu Soricut.
  2019. Albert: A lite bert for self-supervised learn-   Alec Radford, Karthik Narasimhan, Tim Salimans,
  ing of language representations. arXiv preprint          and Ilya Sutskever. 2018. Improving language
  arXiv:1909.11942.                                        understanding by generative pre-training. URL
                                                           https://s3-us-west-2. amazonaws. com/openai-
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-        assets/researchcovers/languageunsupervised/language
  dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,            understanding paper. pdf.
  Luke Zettlemoyer, and Veselin Stoyanov. 2019.
  Roberta: A robustly optimized bert pretraining ap-     Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
  proach. arXiv preprint arXiv:1907.11692.                 Dario Amodei, and Ilya Sutskever. 2019. Language
                                                           models are unsupervised multitask learners. OpenAI
Andrew L Maas, Raymond E Daly, Peter T Pham, Dan           Blog, 1(8).
  Huang, Andrew Y Ng, and Christopher Potts. 2011.
  Learning word vectors for sentiment analysis. In       Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
  Proceedings of the 49th annual meeting of the as-        Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
  sociation for computational linguistics: Human lan-      Wei Li, and Peter J Liu. 2019. Exploring the limits
  guage technologies-volume 1, pages 142–150. Asso-        of transfer learning with a unified text-to-text trans-
  ciation for Computational Linguistics.                   former. arXiv preprint arXiv:1910.10683.

Stephen Merity, Nitish Shirish Keskar, and Richard       Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
   Socher. 2017. Regularizing and optimizing lstm lan-     Know what you don’t know: Unanswerable ques-
   guage models. arXiv preprint arXiv:1708.02182.          tions for squad. arXiv preprint arXiv:1806.03822.

Stephen Merity, Caiming Xiong, James Bradbury, and       Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
   Richard Socher. 2016a. Pointer sentinel mixture         Percy Liang. 2016. Squad: 100,000+ questions
   models. arXiv preprint arXiv:1609.07843.                for machine comprehension of text. arXiv preprint
                                                           arXiv:1606.05250.
Stephen Merity, Caiming Xiong, James Bradbury, and
                                                         Sebastian Ruder, Matthew E. Peters, Swabha
   Richard Socher. 2016b. Wikitext-103. Accessed:
                                                           Swayamdipta, and Thomas Wolf. 2019. Trans-
   2020-02-10.
                                                           fer learning in natural language processing. In
Stephen Merity, Caiming Xiong, James Bradbury, and         Proceedings of the 2019 Conference of the North
   Richard Socher. 2016c. Wikitext-2. Accessed:            American Chapter of the Association for Com-
   2020-02-10.                                             putational Linguistics: Tutorials, pages 15–18,
                                                           Minneapolis, Minnesota. Association for Computa-
Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-            tional Linguistics.
  frey Dean. 2013. Efficient estimation of word          Rico Sennrich, Barry Haddow, and Alexandra Birch.
  representations in vector space. arXiv preprint          2015. Neural machine translation of rare words with
  arXiv:1301.3781.                                         subword units. arXiv preprint arXiv:1508.07909.
Sebastian Nagel. 2016. Cc-news.                          Iyer Shankar, Dandekar Nikhil, and Csernai Kornél.
                                                            2017. First quora dataset release: Question pairs.
Robert Parker, David Graff, Junbo Kong, Ke Chen,            Accessed: 2020-02-10.
  and Kazuaki Maeda. 2011.        English gigaword
  fifth edition, june. Linguistic Data Consortium,       Richard Socher, Alex Perelygin, Jean Wu, Jason
  LDC2011T07, 12.                                          Chuang, Christopher D Manning, Andrew Ng, and
                                                           Christopher Potts. 2013. Recursive deep models
Jeffrey Pennington, Richard Socher, and Christopher        for semantic compositionality over a sentiment tree-
   Manning. 2014. Glove: Global vectors for word rep-      bank. In Proceedings of the 2013 conference on
   resentation. In Proceedings of the 2014 conference      empirical methods in natural language processing,
   on empirical methods in natural language process-       pages 1631–1642.
   ing (EMNLP), pages 1532–1543.
                                                         Emma Strubell, Ananya Ganesh, and Andrew Mc-
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt          Callum. 2019.     Energy and policy considera-
 Gardner, Christopher Clark, Kenton Lee, and Luke          tions for deep learning in nlp. arXiv preprint
 Zettlemoyer. 2018. Deep contextualized word repre-        arXiv:1906.02243.
 sentations. arXiv preprint arXiv:1802.05365.
                                                         Trieu H Trinh and Quoc V Le. 2018. A simple
Boris T Polyak and Anatoli B Juditsky. 1992. Ac-            method for commonsense reasoning. arXiv preprint
  celeration of stochastic approximation by averag-         arXiv:1806.02847.
  ing. SIAM Journal on Control and Optimization,
  30(4):838–855.                                         Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
                                                           Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Alec Radford. 2018. Improving language understand-         Kaiser, and Illia Polosukhin. 2017. Attention is all
  ing with unsupervised learning. Accessed: 2020-02-       you need. In Advances in neural information pro-
  10.                                                      cessing systems, pages 5998–6008.
Ellen M Voorhees and Dawn M Tice. 1999. The trec-8
   question answering track evaluation. In TREC, vol-
   ume 1999, page 82. Citeseer.
Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun,
   and Rob Fergus. 2013. Regularization of neural net-
  works using dropconnect. In International confer-
   ence on machine learning, pages 1058–1066.
Alex Wang, Yada Pruksachatkun, Nikita Nangia,
  Amanpreet Singh, Julian Michael, Felix Hill, Omer
  Levy, and Samuel R Bowman. 2019.            Super-
  glue: A stickier benchmark for general-purpose
  language understanding systems. arXiv preprint
  arXiv:1905.00537.
Alex Wang, Amanpreet Singh, Julian Michael, Fe-
  lix Hill, Omer Levy, and Samuel Bowman. 2018.
  GLUE: A multi-task benchmark and analysis plat-
  form for natural language understanding. In Pro-
  ceedings of the 2018 EMNLP Workshop Black-
  boxNLP: Analyzing and Interpreting Neural Net-
  works for NLP, pages 353–355, Brussels, Belgium.
  Association for Computational Linguistics.

Alex Warstadt, Amanpreet Singh, and Samuel R Bow-
  man. 2018. Neural network acceptability judgments.
  arXiv preprint arXiv:1805.12471.
Adina Williams, Nikita Nangia, and Samuel R Bow-
  man. 2017. A broad-coverage challenge corpus for
  sentence understanding through inference. arXiv
  preprint arXiv:1704.05426.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
  bonell, Ruslan Salakhutdinov, and Quoc V Le.
  2019. Xlnet: Generalized autoregressive pretrain-
  ing for language understanding. arXiv preprint
  arXiv:1906.08237.

Rowan Zellers, Ari Holtzman, Hannah Rashkin,
  Yonatan Bisk, Ali Farhadi, Franziska Roesner, and
  Yejin Choi. 2019. Defending against neural fake
  news. arXiv preprint arXiv:1905.12616.

Shuailiang Zhang, Hai Zhao, Yuwei Wu, Zhuosheng
  Zhang, Xi Zhou, and Xiang Zhou. 2019. Dual co-
  matching network for multi-choice reading compre-
  hension. arXiv preprint arXiv:1901.09381.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
  Character-level convolutional networks for text clas-
  sification. In Advances in neural information pro-
  cessing systems, pages 649–657.

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-
  dinov, Raquel Urtasun, Antonio Torralba, and Sanja
  Fidler. 2015. Aligning books and movies: Towards
  story-like visual explanations by watching movies
  and reading books. In Proceedings of the IEEE inter-
  national conference on computer vision, pages 19–
  27.