=Paper= {{Paper |id=Vol-2624/paper2 |storemode=property |title=On the Comparability of Pre-trained Language Models |pdfUrl=https://ceur-ws.org/Vol-2624/paper2.pdf |volume=Vol-2624 |authors=Matthias Aßenmacher,Christian Heumann |dblpUrl=https://dblp.org/rec/conf/swisstext/AssenmacherH20 }} ==On the Comparability of Pre-trained Language Models== https://ceur-ws.org/Vol-2624/paper2.pdf

On the comparability of pre-trained language models

Matthias Aßenmacher Christian Heumann
Department of Statistics Department of Statistics
Ludwig-Maximilians-Universität Ludwig-Maximilians-Universität
Munich, Germany Munich, Germany
{matthias,chris}@stat.uni-muenchen.de

Abstract but rather see our work as an overview in
order to identify potential starting points
Recent developments in unsupervised rep- for benchmark comparisons.
resentation learning have successfully es-
tablished the concept of transfer learn- 1 Introduction
ing in NLP. Instead of simply plugging
in static pre-trained representations, end- For solving NLP tasks, most researchers turn to
to-end trainable model architectures are using pre-trained word embeddings (Mikolov et al.,
making better use of contextual informa- 2013; Pennington et al., 2014; Bojanowski et al.,
tion through more intelligently designed 2017) as a key component of their models. These
language modelling objectives. Along representations map each word of a sequence to a
with this, larger corpora are used for self- real valued vector of fixed dimension. Drawbacks
supervised pre-training of models which of these kinds of externally learned features are
are afterwards fine-tuned on supervised that they are (i) fixed, i.e. can not be adapted to a
tasks. Advances in parallel computing specific domain they are used in, and (ii) context
made it possible to train these models with independent, i.e. there’s only one embedding for a
growing capacities in the same or even in word by which it is represented in any context.
shorter time than previously established More recently, transfer learning approaches, as for
models. These developments agglomer- example convolutional neural networks (CNNs)
ate in new state-of-the-art results being re- pre-trained on ImageNet (Krizhevsky et al., 2012)
vealed in an increasing frequency. Never- in computer vision, have entered the discussion.
theless, we show that it is not possible to Transfer learning in the NLP context means pre-
completely disentangle the contributions training a network with a self-supervised objective
of the three driving forces to these improve- on large amounts of plain text and fine-tuning its
ments. weights afterwards on a task specific, labelled data
We provide a concise overview on several set. For a comprehensive overview on the current
large pre-trained language models, which state of transfer learning in NLP, we recommend
achieved state-of-the-art results on differ- the excellent tutorial and blog post by Ruder et al.
ent leaderboards in the last two years, and (2019)1 .
compare them with respect to their use With ULMFiT (Universal Language Model Fine
of new architectures and resources. We Tuning), Howard and Ruder (2018) proposed a
clarify where the differences between the LSTM-based (Hochreiter and Schmidhuber, 1997)
models are and attempt to gain some in- approach for transfer learning in NLP using AWD-
sight into the single contributions of lexical LSTMs (Merity et al., 2017). This model can be
and computational improvements as well characterised as unidirectional contextual, while a
as those of architectural changes. We do bidirectionally contextual LSTM-based model was
not intend to quantify these contributions, presented in ELMo (Embeddings from Language
Models) by Peters et al. (2018).
Copyright c 2020 for this paper by its authors. Use permitted The bidirectionality in ELMo is achieved by using
under Creative Commons License Attribution 4.0 Interna-
1
tional (CC BY 4.0) https://ruder.io/state-of-transfer-learning-in-nlp/
biLSTMs instead of AWD-LSTMs. On the other ing the pre-training objective (different variants
hand, ULMFiT uses a more "pure" transfer learn- of denoising vs. language modelling), the pre-
ing approach compared to ELMo, as the ELMo- training resources (their newly introduced C4 cor-
embeddings are extracted from the pre-trained pus vs. variants thereof) and the parameter size
model and are not fine-tuned in conjunction with (from 200M up to 11B). Especially, their idea of
the weights of the task-specific architecture. introducing a new corpus and creating subsets re-
The OpenAI GPT (Generative Pre-Training, Rad- sembling previously used corpora like RealNews
ford et al., 2018) is a model which resembles the (Zellers et al., 2019) or OpenWebText (Gokaslan
characteristics of ULMFiT in two crucial points. and Cohen, 2019) is a promising approach in order
It is a unidirectional language model and it al- to ensure comparability.
lows stacking task specific layers on top after pre- However, their experiments do not cover an impor-
training, i.e. it is fully end-to-end trainable. The tant point we are trying to address with our work:
major difference between them is the internal ar- Focussing on only one specific architecture does
chitecture, where GPT uses a Transformer decoder not yield an answer to the question which com-
architecture (Vaswani et al., 2017). ponents explain the performance differences be-
Instead of processing one input token at a time, like tween models where the overall architecture differs
recurrent architectures (LSTMs, GRUs) do, Trans- (e.g. Attention-based vs. LSTM-based). Yang et al.
formers process whole sequences all at once. This (2019) also address comparability to some extent
is possible because they utilize a variant of the At- by performing an ablation study to compare their
tention mechanism (Bahdanau et al., 2014), which XLNet explicitly to BERT. They train six different
allows modelling dependencies without having to XLNet-based models where they modify different
feed the data to the model sequentially. At the same parts of their model in order to quantify how these
time, GPT can be characterised as unidirectional design choices influence performance. At the same
as it just takes into account the left side of the con- time they restrict themselves to an architecture of
text. Its successor OpenAI GPT2 (Radford et al., the same size as BERT-BASE and use the same
2019) possesses (despite some smaller architectural amount of lexical resources for pre-training. Liu
changes) the same model architecture and thus can et al. (2019) vary RoBERTa with respect to model
also be termed as unidirectional contextual. size and amount of pre-training resources in or-
BERT (Bidirectional Encoder Representations der to perform an ablation study also aiming at
from Transformers, Devlin et al., 2019), and con- comparability to BERT. Lan et al. (2019) go one
sequently the other two BERT-based approaches step further with ALBERT by also comparing their
discussed here (Liu et al., 2019; Lan et al., 2019) as model to BERT with regard to run time as well as
well, differ from the GPT models by the fact that width and depth of the model.
they are bidirectional Transformer encoder models. Despite all these experiments are highly valuable
Devlin et al. (2019) proposed Masked Language steps into the direction of better comparability,
Modelling (MLM) as a special training objective there are still no clear guidelines on which com-
which allows the use of a bidirectional Transformer parisons to perform in order to ensure a maximum
encoder without compromising the language mod- degree of comparability with respect to multiple
elling objective. XLNet (Yang et al., 2019) on the potentially influential factors at the same time.
contrary relies on an objective which the authors
call Permutation Language Modelling (PLM) and 3 Materials and Methods
is also able to model a bidirectional context despite
being an auto-regressive model. First, we present the different corpora which were
utilised for pre-training the models and compare
2 Related work them with respect to their size and their accessi-
bility (cf. Tab. 1). Subsequently, we will briefly
In their stimulating paper, Raffel et al. (2019) take introduce benchmark data sets which the models
several steps in a similar direction by trying to are commonly fine-tuned and evaluated on.
ensure comparability among different Transformer- While conceptual differences between the evalu-
based models. They perform various experiments ated models have been addressed in the introduc-
with respect to the transfer learning ability of a tion, the models will now be described in more
Transformer encoder-decoder architecture by vary- detail. This is driven by the intention to emphasise
differences beyond the obvious, conceptual ones. Wikitext-103 (Merity et al., 2016a,b) The au-
thors emphasised the necessity for a new large scale
3.1 Pre-training corpora language modelling data set by stressing the short-
English Wikipedia Devlin et al. (2019) state that comings of other corpora. They highlight the occur-
they used data from the English Wikipedia and rence of complete articles, which allows learning
provide a manual for crawling it, but no actual data long range dependencies, as one of the main bene-
set. Their version encompassed around 2.5B words. fits of their corpus. This property is, according to
Wikipedia data sets are available in the Tensorflow the authors, not given in the 1B Word Benchmark
Datasets-module. as the sentence ordering is randomised there. With
a count of 103.227.021 tokens and a vocabulary
CommonCrawl Among other resources, Yang size of 267.735, it is about one eighth of the 1B
et al. (2019) used data from CommonCrawl. Be- Word Benchmark’s size concerning token count
sides stating that they filtered out short or low- and about one third concerning the vocabulary size.
quality content, no further information is given. Note, that there is also the smaller Wikitext-2 cor-
Since CommonCrawl is a dynamic database, which pus (Merity et al., 2016c) available, which is a
is updated on a monthly base (and the extracted subset of about 2% of the size of Wikitext-103.
amount of data always depends on the user) we can
not provide a word count for this source in Tab. 1. CC-News (Nagel, 2016) This corpus was pre-
ClueWeb (Callan et al., 2009), Giga5 (Parker sented and used by Liu et al. (2019). They used a
et al., 2011) The information about ClueWeb web crawler proposed by Hamborg et al. (2017) to
and Giga5 is similarly sparse as for Common- extract data from the CommonCrawl News data set
Crawl. ClueWeb was obtained by crawling ∼ 2.8M (Nagel, 2016) and obtained a data set similar to the
web pages in 2012, Giga5 was crawled between RealNews data set (Zellers et al., 2019).
01/2009 and 12/2010.
Stories4 (Trinh and Le, 2018) The authors built
1B Word Benchmark2 (Chelba et al., 2013) a specific subset of the CommonCrawl data based
This corpus, actually introduced as a benchmark on questions from common sense reasoning tasks.
data set by Chelba et al. (2013), combines multi- They extracted nearly 1M documents, most of
ple data sets from the EMNLP 2011 workshop on which are taken from longer, coherent stories.
Statistical Machine Translation. The authors nor-
malised and tokenized the corpus and performed WebText (Radford et al., 2019) This pre-
further pre-processing steps in dropping duplicate training corpus, obtained by creating "a new web
sentences as well as discarding words with a count scrape which emphasised document quality" (Rad-
below three. Additionally, they randomised the ford et al., 2019), is not publicly available.
ordering of the sentences in the corpus. This consti-
tutes a corpus with a vocabulary of 793.471 words OpenWebText (Gokaslan and Cohen, 2019)
and a total word count of 829.250.940 words. As a reaction to Radford et al. (2019) not releasing
their pre-training corpus, Gokaslan and Cohen
BooksCorpus3 (Zhu et al., 2015) In 2015, Zhu
(2019) started an initiative to emulate an open-
et al. introduced the BooksCorpus, which is heavily
source version of the WebText corpus.
used for pre-training language models (cf. Tab. 1).
In their work, they used the BooksCorpus in order
to train a model for retrieving sentence similarity. It becomes obvious that there is a lot of hetero-
Overall, the corpus comprises 984.846.357 words geneity with respect to the observed combinations
in 74.004.228 sentences obtained from analysing of availability, quality and corpus size. Thus, we
11.038 books. They report a vocabulary consisting can state that there is some lack of transparency
of 1.316.420 unique words, making the corpus lex- when it comes to the lexical resources used for
ically more diverse than the 1B Word Benchmark, per-training. Especially, the missing standardised
as it possesses a by 66% larger vocabulary whereas availability of the BooksCorpus is problematic as
having a word count which is only 19% higher. this corpus is heavily used for pre-training.
2 4
https://research.google/pubs/pub41880/ https://console.cloud.google.com/storage/browser/
3
https://yknzhu.wixsite.com/mbweb commonsense-reasoning/reproduce/stories_corpus
Corpora Word-count♥ Accessibility Used by
English Wikipedia ∼ 2.500M Fully available BERT; XLNet; RoBERTa; ALBERT
CommonCrawl Unclear Fully available XLNet
ClueWeb 2012-B, Giga5 Unclear Fully available ($$) XLNet
1B Word Benchmark ∼ 830M Fully available ELMo
BooksCorpus ∼ 985M Not available GPT; BERT; XLNet; RoBERTa; ALBERT
Wikitext-103 ∼ 103M Fully available ULMFit
CC-News Unclear Crawling Manual RoBERTa
Stories ∼ 7.000M♦ Fully available RoBERTa
WebText Unclear Not available GPT2
OpenWebText Unclear Fully available RoBERTa

Table 1: Pre-training resources (sorted by date). Crawling Manual means the authors did not provide data, but at
least a manual for crawling it. Dollar signs signify the necessity of a payment in order to get access. RealNews
(Zellers et al., 2019) and C4 (Raffel et al., 2019) are not included as they were not used by the evaluated models.
♥
We report the word-count as given in the respective articles proposing the corpora. Note that the number of
tokens reported in other articles depends on the tokenization scheme used by a specific model.
♦
Stated by one of the authors on twitter: https:/twitter.com/thtrieu_/status/1096672446864748545

3.2 Benchmark data sets for fine-tuning discussed models were evaluated on SuperGLUE.
GLUE5 (Wang et al., 2018) The General Lan- SQuAD7 (Rajpurkar et al., 2016, 2018) The
guage Understanding Evaluation (GLUE) bench- Stanford Question Answering Dataset (SQuAD)
mark is a freely available collection of nine data 1.1 consists of 100.000+ questions explicitly de-
sets on which models can be evaluated. It provides signed to be answerable by reading segments of
a fixed train-dev-test split with held out labels for Wikipedia articles. The task is to correctly locate
the test set, as well as a leaderboard which displays the segment in the text which contains the answer.
the top submissions and the current state-of-the-art A shortcoming is the omission of situations where
(SOTA). The relevant metric for the SOTA is an the question is not answerable by reading the pro-
aggregate measure of the nine single task metrics. vided article. Rajpurkar et al. (2018) address this
The benchmark includes two binary classification problem in SQuAD 2.0 by adding 50.000 hand-
tasks with single-sentence inputs (CoLa [Warstadt crafted unanswerable questions to SQuAD 1.1. The
et al., 2018] and SST-2 [Socher et al., 2013]) and authors provide a train and development set as well
five binary classification tasks with inputs that con- as an official leaderboard. The test set is completely
sist of sentence-pairs (MRPC [Dolan and Brockett, held out, participants are required to upload their
2005], QQP [Shankar et al., 2017], QNLI, RTE models to CodaLab. The SQuAD 1.1 data is, in an
and WNLI [all Wang et al., 2018]). The remain- augmented form (QNLI), also part of GLUE.
ing two tasks also take sentence-pairs as input but
have a multi-class classification objective with ei- RACE8 (Lai et al., 2017) The Large-scale
ther three (MNLI [Williams et al., 2017]) or five ReAding Comprehension Dataset From
classes (STS-B [Cer et al., 2017]). Examinations (RACE) contains English exam
questions for Chinese students (middle/high
SuperGLUE6 (Wang et al., 2019) As a reaction school). In most of the articles using RACE
to human baselines being surpassed by the top for evaluation, it is described to be especially
ranked models, Wang et al. (2019) proposed a set challenging due to (i) the length of the passages,
of benchmark data sets similar to, but, according (ii) the inclusion of reasoning questions and (iii)
to the authors, more difficult than GLUE. It did not the intentionally tricky design of the questions
make sense to include it as a part of our model com- in order to test a human’s ability in reading
parison, as (at the time of writing) only two of the comprehension. The data set can be subdivided
5 7
https://gluebenchmark.com/ https://rajpurkar.github.io/SQuAD-explorer/
6 8
https://super.gluebenchmark.com/ http://www.qizhexie.com/data/RACE_leaderboard.html
in RACE-M (middle school examination) and Attention for the first time. Additionally, BERT
RACE-H (high school examination) and comprises also utilizes the next-sentence prediction (NSP) ob-
a total of 97.687 questions on 27.933 passages of jective, the usefulness of which has been debated
text. in other research papers (Liu et al., 2019). The
BERT-BASE model consists of 12 bidirectional
3.3 Evaluated Models transformer-encoder blocks (24 for BERT-LARGE)
ULMFit (Howard and Ruder, 2018) The with 12 (16 respectively) attention heads per block
AWD-LSTMs in this architecture make use of and an embedding size of 768 (1024 respectively).
DropConnect (Wan et al., 2013) for better regu-
larisation and apply averaged stochastic gradient OpenAI GPT2 (Radford et al., 2019) Com-
descent (ASGD) for optimization (Polyak and Ju- pared to its predecessor GPT, it contains some
ditsky, 1992). The model consists of an embed- smaller changes concerning the placement of layer
ding layer followed by three LSTM layers with normalisation and residual connections. Overall,
a softmax classifier on top for pre-training. It is there are four different versions of GPT2 with the
complemented by a task specific final layer during smallest one being equal to GPT, the medium one
fine-tuning. The vocabulary size is limited to 30k being of similar size as BERT-LARGE and the
words as in Johnson and Zhang (2017). xlarge one being released as the actual GPT2 model
ULMFiT was not evaluated on GLUE, but on sev- with 1.5B parameters.
eral other data sets (IMDb [Maas et al., 2011], XLNet (Yang et al., 2019) In order to overcome
TREC-6 [Voorhees and Tice, 1999], Yelp-bi, Yelp- (what they call) the pretraining-finetune discrep-
full, AG’s news, DBpedia [all Zhang et al., 2015]). ancy, which is a consequence of BERT’s MLM
ELMo (Peters et al., 2018) Consisting of mul- objective, and to simultaneously include bidirec-
tiple biLSTM layers, one can extract multiple tional contexts, Yang et al. (2019) propose the PLM
intermediate-layer representations from ELMo. objective . They use two-stream self-attention for
These representations are used for computing a preserving the position information of the token to
(task-specific) weighted combination, which is be predicted, which would otherwise be lost due
concatenated with external, static word embed- to the permutation. While the content stream at-
dings. During the training of the downstream tention resembles the standard Self-Attention in
model, ELMo embeddings are not updated, only a transformer-decoder, the query stream attention
the weights for combining them are. For the GLUE doesn’t allow the token to see itself but just the
benchmark there are multiple ELMo-based archi- preceding tokens of the permuted sequence.
tectures available on the leaderboard. In Tab. 3, we
RoBERTa (Liu et al., 2019) With RoBERTa
report the best-performing model, an ELMo-based
(Robustly optimized BERT approach), Liu et al.
BiLSTM-model with Attention (Wang et al., 2018).
(2019) introduce a replicate of BERT with tuned
OpenAI GPT (Radford et al., 2018) The Open- hyperparameters and a larger corpus used for pre-
AI GPT is a pure attention-based architecture that training. The masking strategy is changed from
does not make use of any recurrent layers. Pre- static (once during pre-processing) to dynamic (ev-
training is performed by combining Byte-Pair en- ery sequence just before feeding it to the model),
coded (Sennrich et al., 2015) token embeddings the additional NSP objective is removed, the BPE
with learned position embeddings, feeding them vocabulary is increased to 50k and training is per-
into a multi-layer transformer decoder architecture formed on larger batches than BERT. These adjust-
with a standard language modelling objective. Fine- ments improve performance of the model and make
tuning was, amongst others, performed on the nine it competitive to the performance of XLNet.
tasks that together form the GLUE benchmark.
ALBERT (Lan et al., 2019) By identifying that
BERT (Devlin et al., 2019) BERT can be seen the increase of the model size is a problem, AL-
as a reference point for everything that came there- BERT (A Lite BERT) goes into another direc-
after. Similar to GPT it uses Byte-Pair Encod- tion compared to most of post-BERT architectures.
ing (BPE) with a vocabulary size of 30k. By in- Parameter-reduction techniques are applied in or-
troducing the MLM objective, the authors were der to train a faster model with lower memory de-
able to combine deep bidirectionality with Self- mands that, at the same time, yields a comparable
Compute Resources
Model Hardware Training time pfs-days ♥ #parameters lexical
ULMFiT NA NA NA 33M 0.18GB
GPT 8 GPUs (P600) ∼ 30 days 0.96 117M < 13GB
BERT-BASE 4 Cloud TPUs ∼ 4 days 0.96 [2.24] ♦ 110M 13GB
BERT-LARGE 16 Cloud TPUs ∼ 4 days 3.84 [8.96] ♦ 340M 13GB
GPT2-MEDIUM NA NA NA 345M 40GB
GPT2-XLARGE 8 v3 Cloud TPUs ∼ 7 days 7.84 1.500M 40GB
XLNet-LARGE 128 v3 Cloud TPUs ∼ 2.5 days 44.8 340M 126GB
RoBERTa DGX-1 GPUs (8xV100) ♣ NA ♣ NA 360M 160GB
1024 32GB V100 GPUs ♠ ∼ 1 day ♠ 4.78 360M 16GB
ALBERT 64 – 1024 v3 Cloud TPUs NA NA 233M 16GB

Table 2: Usage of compute and pre-training resources alongside with model size for the evaluated model archi-
tectures. With lexical resources we refer to the size of the pre-training corpus. ELMo not included as it is not
end-to-end trainable (Size depends on the used model after obtaining the embeddings). The size of ULMFiT is
assumed to be the larger value from Merity et al. (2017), since Howard and Ruder (2018) use AWD-LSTMs with
a vocabulary size of 30k tokens (Johnson and Zhang, 2016, 2017). Values for GPT2-XLARGE are taken from
Strubell et al. (2019).
♥
Petaflop-days: Estimation according to the formula proposed on https://openai.com/blog/
ai-and-compute/:
pfs-days = number of units × PFLOPS/unit × days trained × utilization, with an
assumed utilization of 13 . PFLOPS/unit for TPUs from https://cloud.google.com/tpu/.
♦
Unclear, whether v2 or v3 TPUs were used. Thus, we provide calculations for both: v2[v3]
♣ ♠
Full RoBERTa model (Liu et al., 2019) RoBERTa variant utilizing less pre-training resources

performance to SOTA models. We will always re- et al., 2019) and also limits portability to smaller
fer to the best performing ALBERT-XXLARGE, devices.
despite also the smaller ALBERT models yield re-
sults comparable to BERT. Further, it is important to consider the differ-
ences displayed in the Tab. 2 and Tab. 3 when
4 Model comparison comparing the model performances. Consider-
ing two models of approximately the same size
Tab. 2 gives an overview on the amount of com- (BERT-BASE vs. GPT), the superior performance
putational power needed to pre-train a given archi- of BERT-BASE seems to originate purely from its
tecture on given pre-training (lexical) resources. In more elaborated architecture because of the similar
Tab. 3 we will directly try to relate model architec- size. But one should also be aware of the larger
ture and size as well as usage of lexical resources lexical resources (BERT-BASE uses at least twice
to model performance. as much data for pre-training) and the unknown
One thing we can learn from Tab. 2 is the lack of de- differences in usage of computational power. We
tails when it comes to reporting the computational approximated the latter as the pfs-days (cf. Tab. 2),
resources used for pre-training. While Howard and resulting in an estimation for BERT-BASE being
Ruder (2018) do not provide any information on not less than the one for GPT.
the computational power utilised for pre-training, Another aspect which should not be ignored when
the other articles report it to different degrees. Un- evaluating performance is ensembling. As can be
fortunately, there are no clear guidelines on how seen in the first column of Tab. 3, the three model
to appraise this when it comes to evaluating and ensembles outperform both of the BERT models
comparing models. This may be attributed to the by a large margin. Only parts of these differences
rapidly growing availability of hardware, but in may be attributed to the model architecture or the
our opinion it should nevertheless be accounted for, hyperparameter settings, as the ensembling as well
since it might pose environmental issues (Strubell as the larger pre-training resources might give an
GLUE SQuAD RACE Resources
Model leaderboard dev♥ v1.1 (dev) v2.0 (dev) test #parameters lexical
BERT-BASE 78.3 – 88.5 76.3 ♣ 65.0 ♠ 110M 13GB
ELMo-based - 8.3 – - 2.9 – – – –
GPT - 5.5 – – – - 6.0 1.1x < 0.5x
BERT-LARGE + 2.2 84.05 + 2.4 + 5.6 + 7.0 ♠ 3.1x 1.0x
XLNet-BASE – – – + 5.03 + 1.05 ∼ 1.0x 1.0x
XLNet-LARGE + 10.1 ♦ + 3.39 + 6.0 + 12.5 + 16.75 3.1x 9.7x
RoBERTa + 10.2 ♦ + 5.19 + 6.1 + 13.1 + 18.2 3.3x 12.3x
RoBERTa-BASE – + 2.30 – – – 1.0x 12.3x
RoBERTa ‡ – + 3.79 + 5.1 + 11.0 – 3.3x 1.2x †
ALBERT + 11.1 ♦ + 5.91 + 5.6 + 13.9 + 21.5 2.1x 1.2x †

Table 3: Performance values as well as model size and resource usage (Reference in italics, highest improvements
in bold). Performance differences are given in percentage points (%pts), differences in size/resources as factors.
ULMFiT and GPT2 are omitted as there are no performance values on these data sets publicly available. No model
size for ELMo provided, since the performance values are from different models (cf. Sec. 3.3).
Displayed performance measures are Matthews Correlation (GLUE), F1 score (SQuAD) and Accuracy (RACE).
♥ ♦
Own calculations based on Lan et al. (2019), Tab. 13; WNLI is excluded Ensemble performance
♣ ♠
Values taken from Yang et al. (2019), Tab. 6 Values taken from Zhang et al. (2019), Tab. 2
†
Liu et al. (2019) and Lan et al. (2019) specify the BooksCorpus + English Wikipedia as 16GB
‡
This variant of RoBERTa uses only BooksCorpus + English Wikipedia for pre-training

advantage to these models. As there are no perfor- a single model on comparable lexical resources
mance values of single models available for XL- (13GB for BERT vs. 16GB for RoBERTa), the
Net, RoBERTa and ALBERT on the official GLUE RoBERTa model shows a smaller (compared to
leaderboard, we also compare the single model per- the RoBERTa ensemble), but still remarkable, im-
formances from Lan et al. (2019) obtained on the provement over BERT-LARGE. In another ablation
dev sets. From this comparison we get an impres- study, Liu et al. (2019) train a RoBERTa-BASE
sion of how high the contribution of ensembling variant on larger pre-training resources. Even
might be: The difference between BERT-LARGE though comprising only about one third of the size
and the XLNet ensemble in the official score (7.9 of BERT-LARGE, the larger pre-training corpus in
%pts) is more than twice as high as the difference conjunction with the optimised training leads to a
in dev score (3.4 %pts). slightly better performance on the GLUE dev set.
In order to address the differences in size of the We are not able to compare RoBERTa-BASE to
pre-training resources, Yang et al. (2019) make the BERT-BASE, as neither the "official" leaderboard
extremely insightful effort to compare a XLNet- score for RoBERTa-BASE nor the "inofficial" dev
BASE variant to BERT-BASE using the same pre- set score for BERT-BASE are available.
training resources. While the F1 score on SQuAD In order to set the results of ULMFiT into con-
v2.0 is still remarkably higher than for BERT- text, we present the results published by Yang
BASE (comparable to BERT-LARGE) it does not et al. (2019) alongside with information on size
show a large improvement on RACE (which might and pre-training resources in Tab. 4. Despite
have been expected due to the large improvement being much larger and pre-training on some or-
of XLNet-LARGE over both BERT models). ders of magnitude larger corpora, BERT-LARGE
The comparability of RoBERTa from the GLUE and XLNet-LARGE do not exhibit that large im-
leaderboard (ensemble + larger pre-training re- provements over the performance of ULMFiT. This
sources) to BERT-LARGE is limited, but the au- might partly originate from the relative simplic-
thors perform several experiments in order to show ity of the tasks, but partly also from the already
the usefulness of their optimisations. Pre-training achieved high performances.
Sentiment Topic Resources
Model IMDb Yelp-bi Yelp-full AG’s news DBpedia size lexical
ULMFiT 95.40 97.84 70.02 94.99 99.20 33M 0.18GB
BERT-LARGE + 0.09 + 0.27 + 0.66 – + 0.16 10.3x 72.2x
XLNet-LARGE + 0.81 + 0.61 + 2.28 + 0.52 + 0.18 10.3x 222.2x

Table 4: Performance comparison (+ model size and resource usage) on the benchmark data sets used by Howard
and Ruder (2018). Specification of the differences and highlighting as in Tab. 3. We report accuracies, as opposed
to Howard and Ruder (2018); Yang et al. (2019), in order to facilitate a similar interpretation compared to Tab. 3.

5 Discussion the amount of the computational power used for
pre-training. In our opinion, this is not a careless-
This chapter reflects the main takeaways from the ness of the authors but rather the lack of a clear
above comparisons and raises some issues for re- reporting standard. We found ourselves confronted
search practices. We do not claim to have a solution with the following situations:
to these potentially problematic aspects, but rather
think that these points are highly debatable. a) No information at all (Radford et al., 2019)
b) Hardware (Liu et al., 2019; Lan et al., 2019)
Why no benchmark corpus for pre-training?
It is good practice to use benchmark data sets for c) Hardware and training time (Devlin et al.,
comparing the performance of pre-trained language 2019; Yang et al., 2019)
models on different types of Natural language un- d) Standardised measure (Radford, 2018)
derstanding (NLU) tasks. Many recently published
articles (Liu et al., 2019; Yang et al., 2019; Lan While a) is clearly unsatisfactory and should be
et al., 2019) perform (partly extensive) ablation avoided, b) and c) provide most of the necessary
studies controlling for pre-training resources in or- information but miss out on going the last final step
der to make (versions of) their models comparable to d), where the reporting reaches universal compa-
to BERT, which is really important as it helps to rability across different articles. The measure we
get an intuition for the impact of pre-training re- computed (cf. Tab. 2) is of course not as exact as
sources. Nevertheless, it is unfortunately not per- a computation based on the counts of operations
fect due to two critical issues: (i) BERT and all of in a network, but requires no deep insight into the
its successors make use of the BooksCorpus (Zhu model architecture and is thus applicable to a wide
et al., 2015) which is not publicly available and range of architectures without much effort.
(ii) this only leads to model comparisons in a low
Shouldn’t performance be evaluated in relation
pre-training resource environment (compared to
to size and resource usage? As larger models
more recent models) and yields no insight on the
have a higher capacity for learning representations
behaviour of the reference model (e.g. BERT) in
and using larger pre-training resources should im-
a medium or high resource context. So we view
prove their quality, varying these two components
statements of the type "Model architecture A is su-
simultaneously with the model architecture might
perior to model architecture B on performing task
lead to interference between the individual effects
X." somewhat critical and propose to phrase it more
on model performance. This aspect has a slight
like the following statement: "Model architecture A
overlap with the question raised above, but while
is superior to model architecture B on performing
the above is more or less about introducing some
task X, when pre-trained on a small/medium/large
reference, this is about carefully varying and evalu-
corpus of low/high quality data from domain Y for
ating the effects of different model parts.
pre-training time Z."
6 Conclusion
Why no standardised description of (computa-
tional) resources? When writing this article, it As can be seen from the above analysis, there is a
turned out difficult to get one unified measure for lack of a concise guideline for fair comparisons of
large pre-trained language models. It is not suffi- task 1: Semantic textual similarity-multilingual and
cient to just rank models by their performance on cross-lingual focused evaluation. arXiv preprint
arXiv:1708.00055.
the common benchmark data sets as this does not
take into account all the other factors mentioned Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,
in this analysis. Further aspects worth reporting Thorsten Brants, Phillipp Koehn, and Tony Robin-
are the use of resources (time and compute) spent son. 2013. One billion word benchmark for measur-
ing progress in statistical language modeling. arXiv
on model development (including all experimental preprint arXiv:1312.3005.
runs and trials) and hyperparameter tuning during
pre-training. In our opinion, this is important with Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
respect to two facets: On the one hand side it is Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
important to take into account environmental con- standing. In Proceedings of the 2019 Conference
siderations when training deep learning models of the North American Chapter of the Association
(Strubell et al., 2019), on the other hand side it is for Computational Linguistics: Human Language
also a signal to the reader/user how difficult it is Technologies, Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, Minnesota. Associ-
to train (and to fine-tune) the model. This might
ation for Computational Linguistics.
have implications for the usage of a model as trans-
fer learning model for diverse downstream tasks. William B Dolan and Chris Brockett. 2005. Automati-
Models that have already been tuned to a high de- cally constructing a corpus of sentential paraphrases.
In Proceedings of the Third International Workshop
gree during pre-training to reach a certain level of on Paraphrasing (IWP2005).
performance, may have, in the long run, less poten-
tial for further improvements compared to models Aaron Gokaslan and Vanya Cohen. 2019. Openweb-
which do so without much hyperparameter tuning. text corpus.
To conclude, we unfortunately cannot say with de- Felix Hamborg, Norman Meuschke, Corinna Bre-
termination which one of the influential factors itinger, and Bela Gipp. 2017. News-please: a
(architecture or amount of pre-training resources) generic news crawler and extractor. In 15th Interna-
is more important, but we think that a substan- tional Symposium of Information Science (ISI 2017),
pages 218–223.
tial amount of the recent improvements can be at-
tributed to larger pre-training resources. A detailed Sepp Hochreiter and Jürgen Schmidhuber. 1997.
disentanglement of the influence of the different Long short-term memory. Neural computation,
components stays an open research question which 9(8):1735–1780.
might be answerable by carefully designed bench- Jeremy Howard and Sebastian Ruder. 2018. Univer-
mark studies. sal language model fine-tuning for text classification.
arXiv preprint arXiv:1801.06146.
Acknowledgments
Rie Johnson and Tong Zhang. 2016. Convolutional
We would like to thank the three anonymous re- neural networks for text categorization: Shallow
viewers for their insightful comments and their word-level vs. deep character-level. arXiv preprint
arXiv:1609.00718.
feedback on our work.
Rie Johnson and Tong Zhang. 2017. Deep pyramid
convolutional neural networks for text categoriza-
References tion. In Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics (Vol-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
ume 1: Long Papers), pages 562–570.
gio. 2014. Neural machine translation by jointly
learning to align and translate. arXiv preprint Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-
arXiv:1409.0473. ton. 2012. Imagenet classification with deep con-
Piotr Bojanowski, Edouard Grave, Armand Joulin, and volutional neural networks. In Advances in neural
Tomas Mikolov. 2017. Enriching word vectors with information processing systems, pages 1097–1105.
subword information. Transactions of the Associa-
tion for Computational Linguistics, 5:135–146. Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,
and Eduard Hovy. 2017. Race: Large-scale reading
Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. comprehension dataset from examinations. arXiv
2009. Clueweb09 data set. preprint arXiv:1704.04683.

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez- Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
Gazpio, and Lucia Specia. 2017. Semeval-2017 Kevin Gimpel, Piyush Sharma, and Radu Soricut.
2019. Albert: A lite bert for self-supervised learn- Alec Radford, Karthik Narasimhan, Tim Salimans,
ing of language representations. arXiv preprint and Ilya Sutskever. 2018. Improving language
arXiv:1909.11942. understanding by generative pre-training. URL
https://s3-us-west-2. amazonaws. com/openai-
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- assets/researchcovers/languageunsupervised/language
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, understanding paper. pdf.
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap- Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
proach. arXiv preprint arXiv:1907.11692. Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners. OpenAI
Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Blog, 1(8).
Huang, Andrew Y Ng, and Christopher Potts. 2011.
Learning word vectors for sentiment analysis. In Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Proceedings of the 49th annual meeting of the as- Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
sociation for computational linguistics: Human lan- Wei Li, and Peter J Liu. 2019. Exploring the limits
guage technologies-volume 1, pages 142–150. Asso- of transfer learning with a unified text-to-text trans-
ciation for Computational Linguistics. former. arXiv preprint arXiv:1910.10683.

Stephen Merity, Nitish Shirish Keskar, and Richard Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
Socher. 2017. Regularizing and optimizing lstm lan- Know what you don’t know: Unanswerable ques-
guage models. arXiv preprint arXiv:1708.02182. tions for squad. arXiv preprint arXiv:1806.03822.

Stephen Merity, Caiming Xiong, James Bradbury, and Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Richard Socher. 2016a. Pointer sentinel mixture Percy Liang. 2016. Squad: 100,000+ questions
models. arXiv preprint arXiv:1609.07843. for machine comprehension of text. arXiv preprint
arXiv:1606.05250.
Stephen Merity, Caiming Xiong, James Bradbury, and
Sebastian Ruder, Matthew E. Peters, Swabha
Richard Socher. 2016b. Wikitext-103. Accessed:
Swayamdipta, and Thomas Wolf. 2019. Trans-
2020-02-10.
fer learning in natural language processing. In
Stephen Merity, Caiming Xiong, James Bradbury, and Proceedings of the 2019 Conference of the North
Richard Socher. 2016c. Wikitext-2. Accessed: American Chapter of the Association for Com-
2020-02-10. putational Linguistics: Tutorials, pages 15–18,
Minneapolis, Minnesota. Association for Computa-
Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- tional Linguistics.
frey Dean. 2013. Efficient estimation of word Rico Sennrich, Barry Haddow, and Alexandra Birch.
representations in vector space. arXiv preprint 2015. Neural machine translation of rare words with
arXiv:1301.3781. subword units. arXiv preprint arXiv:1508.07909.
Sebastian Nagel. 2016. Cc-news. Iyer Shankar, Dandekar Nikhil, and Csernai Kornél.
2017. First quora dataset release: Question pairs.
Robert Parker, David Graff, Junbo Kong, Ke Chen, Accessed: 2020-02-10.
and Kazuaki Maeda. 2011. English gigaword
fifth edition, june. Linguistic Data Consortium, Richard Socher, Alex Perelygin, Jean Wu, Jason
LDC2011T07, 12. Chuang, Christopher D Manning, Andrew Ng, and
Christopher Potts. 2013. Recursive deep models
Jeffrey Pennington, Richard Socher, and Christopher for semantic compositionality over a sentiment tree-
Manning. 2014. Glove: Global vectors for word rep- bank. In Proceedings of the 2013 conference on
resentation. In Proceedings of the 2014 conference empirical methods in natural language processing,
on empirical methods in natural language process- pages 1631–1642.
ing (EMNLP), pages 1532–1543.
Emma Strubell, Ananya Ganesh, and Andrew Mc-
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Callum. 2019. Energy and policy considera-
Gardner, Christopher Clark, Kenton Lee, and Luke tions for deep learning in nlp. arXiv preprint
Zettlemoyer. 2018. Deep contextualized word repre- arXiv:1906.02243.
sentations. arXiv preprint arXiv:1802.05365.
Trieu H Trinh and Quoc V Le. 2018. A simple
Boris T Polyak and Anatoli B Juditsky. 1992. Ac- method for commonsense reasoning. arXiv preprint
celeration of stochastic approximation by averag- arXiv:1806.02847.
ing. SIAM Journal on Control and Optimization,
30(4):838–855. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Alec Radford. 2018. Improving language understand- Kaiser, and Illia Polosukhin. 2017. Attention is all
ing with unsupervised learning. Accessed: 2020-02- you need. In Advances in neural information pro-
10. cessing systems, pages 5998–6008.
Ellen M Voorhees and Dawn M Tice. 1999. The trec-8
question answering track evaluation. In TREC, vol-
ume 1999, page 82. Citeseer.
Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun,
and Rob Fergus. 2013. Regularization of neural net-
works using dropconnect. In International confer-
ence on machine learning, pages 1058–1066.
Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Amanpreet Singh, Julian Michael, Felix Hill, Omer
Levy, and Samuel R Bowman. 2019. Super-
glue: A stickier benchmark for general-purpose
language understanding systems. arXiv preprint
arXiv:1905.00537.
Alex Wang, Amanpreet Singh, Julian Michael, Fe-
lix Hill, Omer Levy, and Samuel Bowman. 2018.
GLUE: A multi-task benchmark and analysis plat-
form for natural language understanding. In Pro-
ceedings of the 2018 EMNLP Workshop Black-
boxNLP: Analyzing and Interpreting Neural Net-
works for NLP, pages 353–355, Brussels, Belgium.
Association for Computational Linguistics.

Alex Warstadt, Amanpreet Singh, and Samuel R Bow-
man. 2018. Neural network acceptability judgments.
arXiv preprint arXiv:1805.12471.
Adina Williams, Nikita Nangia, and Samuel R Bow-
man. 2017. A broad-coverage challenge corpus for
sentence understanding through inference. arXiv
preprint arXiv:1704.05426.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
bonell, Ruslan Salakhutdinov, and Quoc V Le.
2019. Xlnet: Generalized autoregressive pretrain-
ing for language understanding. arXiv preprint
arXiv:1906.08237.

Rowan Zellers, Ari Holtzman, Hannah Rashkin,
Yonatan Bisk, Ali Farhadi, Franziska Roesner, and
Yejin Choi. 2019. Defending against neural fake
news. arXiv preprint arXiv:1905.12616.

Shuailiang Zhang, Hai Zhao, Yuwei Wu, Zhuosheng
Zhang, Xi Zhou, and Xiang Zhou. 2019. Dual co-
matching network for multi-choice reading compre-
hension. arXiv preprint arXiv:1901.09381.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
Character-level convolutional networks for text clas-
sification. In Advances in neural information pro-
cessing systems, pages 649–657.

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-
dinov, Raquel Urtasun, Antonio Torralba, and Sanja
Fidler. 2015. Aligning books and movies: Towards
story-like visual explanations by watching movies
and reading books. In Proceedings of the IEEE inter-
national conference on computer vision, pages 19–
27.