On the comparability of pre-trained language models Matthias Aßenmacher Christian Heumann Department of Statistics Department of Statistics Ludwig-Maximilians-Universität Ludwig-Maximilians-Universität Munich, Germany Munich, Germany {matthias,chris}@stat.uni-muenchen.de Abstract but rather see our work as an overview in order to identify potential starting points Recent developments in unsupervised rep- for benchmark comparisons. resentation learning have successfully es- tablished the concept of transfer learn- 1 Introduction ing in NLP. Instead of simply plugging in static pre-trained representations, end- For solving NLP tasks, most researchers turn to to-end trainable model architectures are using pre-trained word embeddings (Mikolov et al., making better use of contextual informa- 2013; Pennington et al., 2014; Bojanowski et al., tion through more intelligently designed 2017) as a key component of their models. These language modelling objectives. Along representations map each word of a sequence to a with this, larger corpora are used for self- real valued vector of fixed dimension. Drawbacks supervised pre-training of models which of these kinds of externally learned features are are afterwards fine-tuned on supervised that they are (i) fixed, i.e. can not be adapted to a tasks. Advances in parallel computing specific domain they are used in, and (ii) context made it possible to train these models with independent, i.e. there’s only one embedding for a growing capacities in the same or even in word by which it is represented in any context. shorter time than previously established More recently, transfer learning approaches, as for models. These developments agglomer- example convolutional neural networks (CNNs) ate in new state-of-the-art results being re- pre-trained on ImageNet (Krizhevsky et al., 2012) vealed in an increasing frequency. Never- in computer vision, have entered the discussion. theless, we show that it is not possible to Transfer learning in the NLP context means pre- completely disentangle the contributions training a network with a self-supervised objective of the three driving forces to these improve- on large amounts of plain text and fine-tuning its ments. weights afterwards on a task specific, labelled data We provide a concise overview on several set. For a comprehensive overview on the current large pre-trained language models, which state of transfer learning in NLP, we recommend achieved state-of-the-art results on differ- the excellent tutorial and blog post by Ruder et al. ent leaderboards in the last two years, and (2019)1 . compare them with respect to their use With ULMFiT (Universal Language Model Fine of new architectures and resources. We Tuning), Howard and Ruder (2018) proposed a clarify where the differences between the LSTM-based (Hochreiter and Schmidhuber, 1997) models are and attempt to gain some in- approach for transfer learning in NLP using AWD- sight into the single contributions of lexical LSTMs (Merity et al., 2017). This model can be and computational improvements as well characterised as unidirectional contextual, while a as those of architectural changes. We do bidirectionally contextual LSTM-based model was not intend to quantify these contributions, presented in ELMo (Embeddings from Language Models) by Peters et al. (2018). Copyright c 2020 for this paper by its authors. Use permitted The bidirectionality in ELMo is achieved by using under Creative Commons License Attribution 4.0 Interna- 1 tional (CC BY 4.0) https://ruder.io/state-of-transfer-learning-in-nlp/ biLSTMs instead of AWD-LSTMs. On the other ing the pre-training objective (different variants hand, ULMFiT uses a more "pure" transfer learn- of denoising vs. language modelling), the pre- ing approach compared to ELMo, as the ELMo- training resources (their newly introduced C4 cor- embeddings are extracted from the pre-trained pus vs. variants thereof) and the parameter size model and are not fine-tuned in conjunction with (from 200M up to 11B). Especially, their idea of the weights of the task-specific architecture. introducing a new corpus and creating subsets re- The OpenAI GPT (Generative Pre-Training, Rad- sembling previously used corpora like RealNews ford et al., 2018) is a model which resembles the (Zellers et al., 2019) or OpenWebText (Gokaslan characteristics of ULMFiT in two crucial points. and Cohen, 2019) is a promising approach in order It is a unidirectional language model and it al- to ensure comparability. lows stacking task specific layers on top after pre- However, their experiments do not cover an impor- training, i.e. it is fully end-to-end trainable. The tant point we are trying to address with our work: major difference between them is the internal ar- Focussing on only one specific architecture does chitecture, where GPT uses a Transformer decoder not yield an answer to the question which com- architecture (Vaswani et al., 2017). ponents explain the performance differences be- Instead of processing one input token at a time, like tween models where the overall architecture differs recurrent architectures (LSTMs, GRUs) do, Trans- (e.g. Attention-based vs. LSTM-based). Yang et al. formers process whole sequences all at once. This (2019) also address comparability to some extent is possible because they utilize a variant of the At- by performing an ablation study to compare their tention mechanism (Bahdanau et al., 2014), which XLNet explicitly to BERT. They train six different allows modelling dependencies without having to XLNet-based models where they modify different feed the data to the model sequentially. At the same parts of their model in order to quantify how these time, GPT can be characterised as unidirectional design choices influence performance. At the same as it just takes into account the left side of the con- time they restrict themselves to an architecture of text. Its successor OpenAI GPT2 (Radford et al., the same size as BERT-BASE and use the same 2019) possesses (despite some smaller architectural amount of lexical resources for pre-training. Liu changes) the same model architecture and thus can et al. (2019) vary RoBERTa with respect to model also be termed as unidirectional contextual. size and amount of pre-training resources in or- BERT (Bidirectional Encoder Representations der to perform an ablation study also aiming at from Transformers, Devlin et al., 2019), and con- comparability to BERT. Lan et al. (2019) go one sequently the other two BERT-based approaches step further with ALBERT by also comparing their discussed here (Liu et al., 2019; Lan et al., 2019) as model to BERT with regard to run time as well as well, differ from the GPT models by the fact that width and depth of the model. they are bidirectional Transformer encoder models. Despite all these experiments are highly valuable Devlin et al. (2019) proposed Masked Language steps into the direction of better comparability, Modelling (MLM) as a special training objective there are still no clear guidelines on which com- which allows the use of a bidirectional Transformer parisons to perform in order to ensure a maximum encoder without compromising the language mod- degree of comparability with respect to multiple elling objective. XLNet (Yang et al., 2019) on the potentially influential factors at the same time. contrary relies on an objective which the authors call Permutation Language Modelling (PLM) and 3 Materials and Methods is also able to model a bidirectional context despite being an auto-regressive model. First, we present the different corpora which were utilised for pre-training the models and compare 2 Related work them with respect to their size and their accessi- bility (cf. Tab. 1). Subsequently, we will briefly In their stimulating paper, Raffel et al. (2019) take introduce benchmark data sets which the models several steps in a similar direction by trying to are commonly fine-tuned and evaluated on. ensure comparability among different Transformer- While conceptual differences between the evalu- based models. They perform various experiments ated models have been addressed in the introduc- with respect to the transfer learning ability of a tion, the models will now be described in more Transformer encoder-decoder architecture by vary- detail. This is driven by the intention to emphasise differences beyond the obvious, conceptual ones. Wikitext-103 (Merity et al., 2016a,b) The au- thors emphasised the necessity for a new large scale 3.1 Pre-training corpora language modelling data set by stressing the short- English Wikipedia Devlin et al. (2019) state that comings of other corpora. They highlight the occur- they used data from the English Wikipedia and rence of complete articles, which allows learning provide a manual for crawling it, but no actual data long range dependencies, as one of the main bene- set. Their version encompassed around 2.5B words. fits of their corpus. This property is, according to Wikipedia data sets are available in the Tensorflow the authors, not given in the 1B Word Benchmark Datasets-module. as the sentence ordering is randomised there. With a count of 103.227.021 tokens and a vocabulary CommonCrawl Among other resources, Yang size of 267.735, it is about one eighth of the 1B et al. (2019) used data from CommonCrawl. Be- Word Benchmark’s size concerning token count sides stating that they filtered out short or low- and about one third concerning the vocabulary size. quality content, no further information is given. Note, that there is also the smaller Wikitext-2 cor- Since CommonCrawl is a dynamic database, which pus (Merity et al., 2016c) available, which is a is updated on a monthly base (and the extracted subset of about 2% of the size of Wikitext-103. amount of data always depends on the user) we can not provide a word count for this source in Tab. 1. CC-News (Nagel, 2016) This corpus was pre- ClueWeb (Callan et al., 2009), Giga5 (Parker sented and used by Liu et al. (2019). They used a et al., 2011) The information about ClueWeb web crawler proposed by Hamborg et al. (2017) to and Giga5 is similarly sparse as for Common- extract data from the CommonCrawl News data set Crawl. ClueWeb was obtained by crawling ∼ 2.8M (Nagel, 2016) and obtained a data set similar to the web pages in 2012, Giga5 was crawled between RealNews data set (Zellers et al., 2019). 01/2009 and 12/2010. Stories4 (Trinh and Le, 2018) The authors built 1B Word Benchmark2 (Chelba et al., 2013) a specific subset of the CommonCrawl data based This corpus, actually introduced as a benchmark on questions from common sense reasoning tasks. data set by Chelba et al. (2013), combines multi- They extracted nearly 1M documents, most of ple data sets from the EMNLP 2011 workshop on which are taken from longer, coherent stories. Statistical Machine Translation. The authors nor- malised and tokenized the corpus and performed WebText (Radford et al., 2019) This pre- further pre-processing steps in dropping duplicate training corpus, obtained by creating "a new web sentences as well as discarding words with a count scrape which emphasised document quality" (Rad- below three. Additionally, they randomised the ford et al., 2019), is not publicly available. ordering of the sentences in the corpus. This consti- tutes a corpus with a vocabulary of 793.471 words OpenWebText (Gokaslan and Cohen, 2019) and a total word count of 829.250.940 words. As a reaction to Radford et al. (2019) not releasing their pre-training corpus, Gokaslan and Cohen BooksCorpus3 (Zhu et al., 2015) In 2015, Zhu (2019) started an initiative to emulate an open- et al. introduced the BooksCorpus, which is heavily source version of the WebText corpus. used for pre-training language models (cf. Tab. 1). In their work, they used the BooksCorpus in order to train a model for retrieving sentence similarity. It becomes obvious that there is a lot of hetero- Overall, the corpus comprises 984.846.357 words geneity with respect to the observed combinations in 74.004.228 sentences obtained from analysing of availability, quality and corpus size. Thus, we 11.038 books. They report a vocabulary consisting can state that there is some lack of transparency of 1.316.420 unique words, making the corpus lex- when it comes to the lexical resources used for ically more diverse than the 1B Word Benchmark, per-training. Especially, the missing standardised as it possesses a by 66% larger vocabulary whereas availability of the BooksCorpus is problematic as having a word count which is only 19% higher. this corpus is heavily used for pre-training. 2 4 https://research.google/pubs/pub41880/ https://console.cloud.google.com/storage/browser/ 3 https://yknzhu.wixsite.com/mbweb commonsense-reasoning/reproduce/stories_corpus Corpora Word-count♥ Accessibility Used by English Wikipedia ∼ 2.500M Fully available BERT; XLNet; RoBERTa; ALBERT CommonCrawl Unclear Fully available XLNet ClueWeb 2012-B, Giga5 Unclear Fully available ($$) XLNet 1B Word Benchmark ∼ 830M Fully available ELMo BooksCorpus ∼ 985M Not available GPT; BERT; XLNet; RoBERTa; ALBERT Wikitext-103 ∼ 103M Fully available ULMFit CC-News Unclear Crawling Manual RoBERTa Stories ∼ 7.000M♦ Fully available RoBERTa WebText Unclear Not available GPT2 OpenWebText Unclear Fully available RoBERTa Table 1: Pre-training resources (sorted by date). Crawling Manual means the authors did not provide data, but at least a manual for crawling it. Dollar signs signify the necessity of a payment in order to get access. RealNews (Zellers et al., 2019) and C4 (Raffel et al., 2019) are not included as they were not used by the evaluated models. ♥ We report the word-count as given in the respective articles proposing the corpora. Note that the number of tokens reported in other articles depends on the tokenization scheme used by a specific model. ♦ Stated by one of the authors on twitter: https:/twitter.com/thtrieu_/status/1096672446864748545 3.2 Benchmark data sets for fine-tuning discussed models were evaluated on SuperGLUE. GLUE5 (Wang et al., 2018) The General Lan- SQuAD7 (Rajpurkar et al., 2016, 2018) The guage Understanding Evaluation (GLUE) bench- Stanford Question Answering Dataset (SQuAD) mark is a freely available collection of nine data 1.1 consists of 100.000+ questions explicitly de- sets on which models can be evaluated. It provides signed to be answerable by reading segments of a fixed train-dev-test split with held out labels for Wikipedia articles. The task is to correctly locate the test set, as well as a leaderboard which displays the segment in the text which contains the answer. the top submissions and the current state-of-the-art A shortcoming is the omission of situations where (SOTA). The relevant metric for the SOTA is an the question is not answerable by reading the pro- aggregate measure of the nine single task metrics. vided article. Rajpurkar et al. (2018) address this The benchmark includes two binary classification problem in SQuAD 2.0 by adding 50.000 hand- tasks with single-sentence inputs (CoLa [Warstadt crafted unanswerable questions to SQuAD 1.1. The et al., 2018] and SST-2 [Socher et al., 2013]) and authors provide a train and development set as well five binary classification tasks with inputs that con- as an official leaderboard. The test set is completely sist of sentence-pairs (MRPC [Dolan and Brockett, held out, participants are required to upload their 2005], QQP [Shankar et al., 2017], QNLI, RTE models to CodaLab. The SQuAD 1.1 data is, in an and WNLI [all Wang et al., 2018]). The remain- augmented form (QNLI), also part of GLUE. ing two tasks also take sentence-pairs as input but have a multi-class classification objective with ei- RACE8 (Lai et al., 2017) The Large-scale ther three (MNLI [Williams et al., 2017]) or five ReAding Comprehension Dataset From classes (STS-B [Cer et al., 2017]). Examinations (RACE) contains English exam questions for Chinese students (middle/high SuperGLUE6 (Wang et al., 2019) As a reaction school). In most of the articles using RACE to human baselines being surpassed by the top for evaluation, it is described to be especially ranked models, Wang et al. (2019) proposed a set challenging due to (i) the length of the passages, of benchmark data sets similar to, but, according (ii) the inclusion of reasoning questions and (iii) to the authors, more difficult than GLUE. It did not the intentionally tricky design of the questions make sense to include it as a part of our model com- in order to test a human’s ability in reading parison, as (at the time of writing) only two of the comprehension. The data set can be subdivided 5 7 https://gluebenchmark.com/ https://rajpurkar.github.io/SQuAD-explorer/ 6 8 https://super.gluebenchmark.com/ http://www.qizhexie.com/data/RACE_leaderboard.html in RACE-M (middle school examination) and Attention for the first time. Additionally, BERT RACE-H (high school examination) and comprises also utilizes the next-sentence prediction (NSP) ob- a total of 97.687 questions on 27.933 passages of jective, the usefulness of which has been debated text. in other research papers (Liu et al., 2019). The BERT-BASE model consists of 12 bidirectional 3.3 Evaluated Models transformer-encoder blocks (24 for BERT-LARGE) ULMFit (Howard and Ruder, 2018) The with 12 (16 respectively) attention heads per block AWD-LSTMs in this architecture make use of and an embedding size of 768 (1024 respectively). DropConnect (Wan et al., 2013) for better regu- larisation and apply averaged stochastic gradient OpenAI GPT2 (Radford et al., 2019) Com- descent (ASGD) for optimization (Polyak and Ju- pared to its predecessor GPT, it contains some ditsky, 1992). The model consists of an embed- smaller changes concerning the placement of layer ding layer followed by three LSTM layers with normalisation and residual connections. Overall, a softmax classifier on top for pre-training. It is there are four different versions of GPT2 with the complemented by a task specific final layer during smallest one being equal to GPT, the medium one fine-tuning. The vocabulary size is limited to 30k being of similar size as BERT-LARGE and the words as in Johnson and Zhang (2017). xlarge one being released as the actual GPT2 model ULMFiT was not evaluated on GLUE, but on sev- with 1.5B parameters. eral other data sets (IMDb [Maas et al., 2011], XLNet (Yang et al., 2019) In order to overcome TREC-6 [Voorhees and Tice, 1999], Yelp-bi, Yelp- (what they call) the pretraining-finetune discrep- full, AG’s news, DBpedia [all Zhang et al., 2015]). ancy, which is a consequence of BERT’s MLM ELMo (Peters et al., 2018) Consisting of mul- objective, and to simultaneously include bidirec- tiple biLSTM layers, one can extract multiple tional contexts, Yang et al. (2019) propose the PLM intermediate-layer representations from ELMo. objective . They use two-stream self-attention for These representations are used for computing a preserving the position information of the token to (task-specific) weighted combination, which is be predicted, which would otherwise be lost due concatenated with external, static word embed- to the permutation. While the content stream at- dings. During the training of the downstream tention resembles the standard Self-Attention in model, ELMo embeddings are not updated, only a transformer-decoder, the query stream attention the weights for combining them are. For the GLUE doesn’t allow the token to see itself but just the benchmark there are multiple ELMo-based archi- preceding tokens of the permuted sequence. tectures available on the leaderboard. In Tab. 3, we RoBERTa (Liu et al., 2019) With RoBERTa report the best-performing model, an ELMo-based (Robustly optimized BERT approach), Liu et al. BiLSTM-model with Attention (Wang et al., 2018). (2019) introduce a replicate of BERT with tuned OpenAI GPT (Radford et al., 2018) The Open- hyperparameters and a larger corpus used for pre- AI GPT is a pure attention-based architecture that training. The masking strategy is changed from does not make use of any recurrent layers. Pre- static (once during pre-processing) to dynamic (ev- training is performed by combining Byte-Pair en- ery sequence just before feeding it to the model), coded (Sennrich et al., 2015) token embeddings the additional NSP objective is removed, the BPE with learned position embeddings, feeding them vocabulary is increased to 50k and training is per- into a multi-layer transformer decoder architecture formed on larger batches than BERT. These adjust- with a standard language modelling objective. Fine- ments improve performance of the model and make tuning was, amongst others, performed on the nine it competitive to the performance of XLNet. tasks that together form the GLUE benchmark. ALBERT (Lan et al., 2019) By identifying that BERT (Devlin et al., 2019) BERT can be seen the increase of the model size is a problem, AL- as a reference point for everything that came there- BERT (A Lite BERT) goes into another direc- after. Similar to GPT it uses Byte-Pair Encod- tion compared to most of post-BERT architectures. ing (BPE) with a vocabulary size of 30k. By in- Parameter-reduction techniques are applied in or- troducing the MLM objective, the authors were der to train a faster model with lower memory de- able to combine deep bidirectionality with Self- mands that, at the same time, yields a comparable Compute Resources Model Hardware Training time pfs-days ♥ #parameters lexical ULMFiT NA NA NA 33M 0.18GB GPT 8 GPUs (P600) ∼ 30 days 0.96 117M < 13GB BERT-BASE 4 Cloud TPUs ∼ 4 days 0.96 [2.24] ♦ 110M 13GB BERT-LARGE 16 Cloud TPUs ∼ 4 days 3.84 [8.96] ♦ 340M 13GB GPT2-MEDIUM NA NA NA 345M 40GB GPT2-XLARGE 8 v3 Cloud TPUs ∼ 7 days 7.84 1.500M 40GB XLNet-LARGE 128 v3 Cloud TPUs ∼ 2.5 days 44.8 340M 126GB RoBERTa DGX-1 GPUs (8xV100) ♣ NA ♣ NA 360M 160GB 1024 32GB V100 GPUs ♠ ∼ 1 day ♠ 4.78 360M 16GB ALBERT 64 – 1024 v3 Cloud TPUs NA NA 233M 16GB Table 2: Usage of compute and pre-training resources alongside with model size for the evaluated model archi- tectures. With lexical resources we refer to the size of the pre-training corpus. ELMo not included as it is not end-to-end trainable (Size depends on the used model after obtaining the embeddings). The size of ULMFiT is assumed to be the larger value from Merity et al. (2017), since Howard and Ruder (2018) use AWD-LSTMs with a vocabulary size of 30k tokens (Johnson and Zhang, 2016, 2017). Values for GPT2-XLARGE are taken from Strubell et al. (2019). ♥ Petaflop-days: Estimation according to the formula proposed on https://openai.com/blog/ ai-and-compute/: pfs-days = number of units × PFLOPS/unit × days trained × utilization, with an assumed utilization of 13 . PFLOPS/unit for TPUs from https://cloud.google.com/tpu/. ♦ Unclear, whether v2 or v3 TPUs were used. Thus, we provide calculations for both: v2[v3] ♣ ♠ Full RoBERTa model (Liu et al., 2019) RoBERTa variant utilizing less pre-training resources performance to SOTA models. We will always re- et al., 2019) and also limits portability to smaller fer to the best performing ALBERT-XXLARGE, devices. despite also the smaller ALBERT models yield re- sults comparable to BERT. Further, it is important to consider the differ- ences displayed in the Tab. 2 and Tab. 3 when 4 Model comparison comparing the model performances. Consider- ing two models of approximately the same size Tab. 2 gives an overview on the amount of com- (BERT-BASE vs. GPT), the superior performance putational power needed to pre-train a given archi- of BERT-BASE seems to originate purely from its tecture on given pre-training (lexical) resources. In more elaborated architecture because of the similar Tab. 3 we will directly try to relate model architec- size. But one should also be aware of the larger ture and size as well as usage of lexical resources lexical resources (BERT-BASE uses at least twice to model performance. as much data for pre-training) and the unknown One thing we can learn from Tab. 2 is the lack of de- differences in usage of computational power. We tails when it comes to reporting the computational approximated the latter as the pfs-days (cf. Tab. 2), resources used for pre-training. While Howard and resulting in an estimation for BERT-BASE being Ruder (2018) do not provide any information on not less than the one for GPT. the computational power utilised for pre-training, Another aspect which should not be ignored when the other articles report it to different degrees. Un- evaluating performance is ensembling. As can be fortunately, there are no clear guidelines on how seen in the first column of Tab. 3, the three model to appraise this when it comes to evaluating and ensembles outperform both of the BERT models comparing models. This may be attributed to the by a large margin. Only parts of these differences rapidly growing availability of hardware, but in may be attributed to the model architecture or the our opinion it should nevertheless be accounted for, hyperparameter settings, as the ensembling as well since it might pose environmental issues (Strubell as the larger pre-training resources might give an GLUE SQuAD RACE Resources Model leaderboard dev♥ v1.1 (dev) v2.0 (dev) test #parameters lexical BERT-BASE 78.3 – 88.5 76.3 ♣ 65.0 ♠ 110M 13GB ELMo-based - 8.3 – - 2.9 – – – – GPT - 5.5 – – – - 6.0 1.1x < 0.5x BERT-LARGE + 2.2 84.05 + 2.4 + 5.6 + 7.0 ♠ 3.1x 1.0x XLNet-BASE – – – + 5.03 + 1.05 ∼ 1.0x 1.0x XLNet-LARGE + 10.1 ♦ + 3.39 + 6.0 + 12.5 + 16.75 3.1x 9.7x RoBERTa + 10.2 ♦ + 5.19 + 6.1 + 13.1 + 18.2 3.3x 12.3x RoBERTa-BASE – + 2.30 – – – 1.0x 12.3x RoBERTa ‡ – + 3.79 + 5.1 + 11.0 – 3.3x 1.2x † ALBERT + 11.1 ♦ + 5.91 + 5.6 + 13.9 + 21.5 2.1x 1.2x † Table 3: Performance values as well as model size and resource usage (Reference in italics, highest improvements in bold). Performance differences are given in percentage points (%pts), differences in size/resources as factors. ULMFiT and GPT2 are omitted as there are no performance values on these data sets publicly available. No model size for ELMo provided, since the performance values are from different models (cf. Sec. 3.3). Displayed performance measures are Matthews Correlation (GLUE), F1 score (SQuAD) and Accuracy (RACE). ♥ ♦ Own calculations based on Lan et al. (2019), Tab. 13; WNLI is excluded Ensemble performance ♣ ♠ Values taken from Yang et al. (2019), Tab. 6 Values taken from Zhang et al. (2019), Tab. 2 † Liu et al. (2019) and Lan et al. (2019) specify the BooksCorpus + English Wikipedia as 16GB ‡ This variant of RoBERTa uses only BooksCorpus + English Wikipedia for pre-training advantage to these models. As there are no perfor- a single model on comparable lexical resources mance values of single models available for XL- (13GB for BERT vs. 16GB for RoBERTa), the Net, RoBERTa and ALBERT on the official GLUE RoBERTa model shows a smaller (compared to leaderboard, we also compare the single model per- the RoBERTa ensemble), but still remarkable, im- formances from Lan et al. (2019) obtained on the provement over BERT-LARGE. In another ablation dev sets. From this comparison we get an impres- study, Liu et al. (2019) train a RoBERTa-BASE sion of how high the contribution of ensembling variant on larger pre-training resources. Even might be: The difference between BERT-LARGE though comprising only about one third of the size and the XLNet ensemble in the official score (7.9 of BERT-LARGE, the larger pre-training corpus in %pts) is more than twice as high as the difference conjunction with the optimised training leads to a in dev score (3.4 %pts). slightly better performance on the GLUE dev set. In order to address the differences in size of the We are not able to compare RoBERTa-BASE to pre-training resources, Yang et al. (2019) make the BERT-BASE, as neither the "official" leaderboard extremely insightful effort to compare a XLNet- score for RoBERTa-BASE nor the "inofficial" dev BASE variant to BERT-BASE using the same pre- set score for BERT-BASE are available. training resources. While the F1 score on SQuAD In order to set the results of ULMFiT into con- v2.0 is still remarkably higher than for BERT- text, we present the results published by Yang BASE (comparable to BERT-LARGE) it does not et al. (2019) alongside with information on size show a large improvement on RACE (which might and pre-training resources in Tab. 4. Despite have been expected due to the large improvement being much larger and pre-training on some or- of XLNet-LARGE over both BERT models). ders of magnitude larger corpora, BERT-LARGE The comparability of RoBERTa from the GLUE and XLNet-LARGE do not exhibit that large im- leaderboard (ensemble + larger pre-training re- provements over the performance of ULMFiT. This sources) to BERT-LARGE is limited, but the au- might partly originate from the relative simplic- thors perform several experiments in order to show ity of the tasks, but partly also from the already the usefulness of their optimisations. Pre-training achieved high performances. Sentiment Topic Resources Model IMDb Yelp-bi Yelp-full AG’s news DBpedia size lexical ULMFiT 95.40 97.84 70.02 94.99 99.20 33M 0.18GB BERT-LARGE + 0.09 + 0.27 + 0.66 – + 0.16 10.3x 72.2x XLNet-LARGE + 0.81 + 0.61 + 2.28 + 0.52 + 0.18 10.3x 222.2x Table 4: Performance comparison (+ model size and resource usage) on the benchmark data sets used by Howard and Ruder (2018). Specification of the differences and highlighting as in Tab. 3. We report accuracies, as opposed to Howard and Ruder (2018); Yang et al. (2019), in order to facilitate a similar interpretation compared to Tab. 3. 5 Discussion the amount of the computational power used for pre-training. In our opinion, this is not a careless- This chapter reflects the main takeaways from the ness of the authors but rather the lack of a clear above comparisons and raises some issues for re- reporting standard. We found ourselves confronted search practices. We do not claim to have a solution with the following situations: to these potentially problematic aspects, but rather think that these points are highly debatable. a) No information at all (Radford et al., 2019) b) Hardware (Liu et al., 2019; Lan et al., 2019) Why no benchmark corpus for pre-training? It is good practice to use benchmark data sets for c) Hardware and training time (Devlin et al., comparing the performance of pre-trained language 2019; Yang et al., 2019) models on different types of Natural language un- d) Standardised measure (Radford, 2018) derstanding (NLU) tasks. Many recently published articles (Liu et al., 2019; Yang et al., 2019; Lan While a) is clearly unsatisfactory and should be et al., 2019) perform (partly extensive) ablation avoided, b) and c) provide most of the necessary studies controlling for pre-training resources in or- information but miss out on going the last final step der to make (versions of) their models comparable to d), where the reporting reaches universal compa- to BERT, which is really important as it helps to rability across different articles. The measure we get an intuition for the impact of pre-training re- computed (cf. Tab. 2) is of course not as exact as sources. Nevertheless, it is unfortunately not per- a computation based on the counts of operations fect due to two critical issues: (i) BERT and all of in a network, but requires no deep insight into the its successors make use of the BooksCorpus (Zhu model architecture and is thus applicable to a wide et al., 2015) which is not publicly available and range of architectures without much effort. (ii) this only leads to model comparisons in a low Shouldn’t performance be evaluated in relation pre-training resource environment (compared to to size and resource usage? As larger models more recent models) and yields no insight on the have a higher capacity for learning representations behaviour of the reference model (e.g. BERT) in and using larger pre-training resources should im- a medium or high resource context. So we view prove their quality, varying these two components statements of the type "Model architecture A is su- simultaneously with the model architecture might perior to model architecture B on performing task lead to interference between the individual effects X." somewhat critical and propose to phrase it more on model performance. This aspect has a slight like the following statement: "Model architecture A overlap with the question raised above, but while is superior to model architecture B on performing the above is more or less about introducing some task X, when pre-trained on a small/medium/large reference, this is about carefully varying and evalu- corpus of low/high quality data from domain Y for ating the effects of different model parts. pre-training time Z." 6 Conclusion Why no standardised description of (computa- tional) resources? When writing this article, it As can be seen from the above analysis, there is a turned out difficult to get one unified measure for lack of a concise guideline for fair comparisons of large pre-trained language models. It is not suffi- task 1: Semantic textual similarity-multilingual and cient to just rank models by their performance on cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055. the common benchmark data sets as this does not take into account all the other factors mentioned Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, in this analysis. Further aspects worth reporting Thorsten Brants, Phillipp Koehn, and Tony Robin- are the use of resources (time and compute) spent son. 2013. One billion word benchmark for measur- ing progress in statistical language modeling. arXiv on model development (including all experimental preprint arXiv:1312.3005. runs and trials) and hyperparameter tuning during pre-training. In our opinion, this is important with Jacob Devlin, Ming-Wei Chang, Kenton Lee, and respect to two facets: On the one hand side it is Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- important to take into account environmental con- standing. In Proceedings of the 2019 Conference siderations when training deep learning models of the North American Chapter of the Association (Strubell et al., 2019), on the other hand side it is for Computational Linguistics: Human Language also a signal to the reader/user how difficult it is Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Associ- to train (and to fine-tune) the model. This might ation for Computational Linguistics. have implications for the usage of a model as trans- fer learning model for diverse downstream tasks. William B Dolan and Chris Brockett. 2005. Automati- Models that have already been tuned to a high de- cally constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop gree during pre-training to reach a certain level of on Paraphrasing (IWP2005). performance, may have, in the long run, less poten- tial for further improvements compared to models Aaron Gokaslan and Vanya Cohen. 2019. Openweb- which do so without much hyperparameter tuning. text corpus. To conclude, we unfortunately cannot say with de- Felix Hamborg, Norman Meuschke, Corinna Bre- termination which one of the influential factors itinger, and Bela Gipp. 2017. News-please: a (architecture or amount of pre-training resources) generic news crawler and extractor. In 15th Interna- is more important, but we think that a substan- tional Symposium of Information Science (ISI 2017), pages 218–223. tial amount of the recent improvements can be at- tributed to larger pre-training resources. A detailed Sepp Hochreiter and Jürgen Schmidhuber. 1997. disentanglement of the influence of the different Long short-term memory. Neural computation, components stays an open research question which 9(8):1735–1780. might be answerable by carefully designed bench- Jeremy Howard and Sebastian Ruder. 2018. Univer- mark studies. sal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146. Acknowledgments Rie Johnson and Tong Zhang. 2016. Convolutional We would like to thank the three anonymous re- neural networks for text categorization: Shallow viewers for their insightful comments and their word-level vs. deep character-level. arXiv preprint arXiv:1609.00718. feedback on our work. Rie Johnson and Tong Zhang. 2017. Deep pyramid convolutional neural networks for text categoriza- References tion. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Vol- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- ume 1: Long Papers), pages 562–570. gio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin- arXiv:1409.0473. ton. 2012. Imagenet classification with deep con- Piotr Bojanowski, Edouard Grave, Armand Joulin, and volutional neural networks. In Advances in neural Tomas Mikolov. 2017. Enriching word vectors with information processing systems, pages 1097–1105. subword information. Transactions of the Associa- tion for Computational Linguistics, 5:135–146. Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. Race: Large-scale reading Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. comprehension dataset from examinations. arXiv 2009. Clueweb09 data set. preprint arXiv:1704.04683. Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez- Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Gazpio, and Lucia Specia. 2017. Semeval-2017 Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learn- Alec Radford, Karthik Narasimhan, Tim Salimans, ing of language representations. arXiv preprint and Ilya Sutskever. 2018. Improving language arXiv:1909.11942. understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- assets/researchcovers/languageunsupervised/language dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, understanding paper. pdf. Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining ap- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, proach. arXiv preprint arXiv:1907.11692. Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Blog, 1(8). Huang, Andrew Y Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Proceedings of the 49th annual meeting of the as- Lee, Sharan Narang, Michael Matena, Yanqi Zhou, sociation for computational linguistics: Human lan- Wei Li, and Peter J Liu. 2019. Exploring the limits guage technologies-volume 1, pages 142–150. Asso- of transfer learning with a unified text-to-text trans- ciation for Computational Linguistics. former. arXiv preprint arXiv:1910.10683. Stephen Merity, Nitish Shirish Keskar, and Richard Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Socher. 2017. Regularizing and optimizing lstm lan- Know what you don’t know: Unanswerable ques- guage models. arXiv preprint arXiv:1708.02182. tions for squad. arXiv preprint arXiv:1806.03822. Stephen Merity, Caiming Xiong, James Bradbury, and Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Richard Socher. 2016a. Pointer sentinel mixture Percy Liang. 2016. Squad: 100,000+ questions models. arXiv preprint arXiv:1609.07843. for machine comprehension of text. arXiv preprint arXiv:1606.05250. Stephen Merity, Caiming Xiong, James Bradbury, and Sebastian Ruder, Matthew E. Peters, Swabha Richard Socher. 2016b. Wikitext-103. Accessed: Swayamdipta, and Thomas Wolf. 2019. Trans- 2020-02-10. fer learning in natural language processing. In Stephen Merity, Caiming Xiong, James Bradbury, and Proceedings of the 2019 Conference of the North Richard Socher. 2016c. Wikitext-2. Accessed: American Chapter of the Association for Com- 2020-02-10. putational Linguistics: Tutorials, pages 15–18, Minneapolis, Minnesota. Association for Computa- Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- tional Linguistics. frey Dean. 2013. Efficient estimation of word Rico Sennrich, Barry Haddow, and Alexandra Birch. representations in vector space. arXiv preprint 2015. Neural machine translation of rare words with arXiv:1301.3781. subword units. arXiv preprint arXiv:1508.07909. Sebastian Nagel. 2016. Cc-news. Iyer Shankar, Dandekar Nikhil, and Csernai Kornél. 2017. First quora dataset release: Question pairs. Robert Parker, David Graff, Junbo Kong, Ke Chen, Accessed: 2020-02-10. and Kazuaki Maeda. 2011. English gigaword fifth edition, june. Linguistic Data Consortium, Richard Socher, Alex Perelygin, Jean Wu, Jason LDC2011T07, 12. Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models Jeffrey Pennington, Richard Socher, and Christopher for semantic compositionality over a sentiment tree- Manning. 2014. Glove: Global vectors for word rep- bank. In Proceedings of the 2013 conference on resentation. In Proceedings of the 2014 conference empirical methods in natural language processing, on empirical methods in natural language process- pages 1631–1642. ing (EMNLP), pages 1532–1543. Emma Strubell, Ananya Ganesh, and Andrew Mc- Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Callum. 2019. Energy and policy considera- Gardner, Christopher Clark, Kenton Lee, and Luke tions for deep learning in nlp. arXiv preprint Zettlemoyer. 2018. Deep contextualized word repre- arXiv:1906.02243. sentations. arXiv preprint arXiv:1802.05365. Trieu H Trinh and Quoc V Le. 2018. A simple Boris T Polyak and Anatoli B Juditsky. 1992. Ac- method for commonsense reasoning. arXiv preprint celeration of stochastic approximation by averag- arXiv:1806.02847. ing. SIAM Journal on Control and Optimization, 30(4):838–855. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Alec Radford. 2018. Improving language understand- Kaiser, and Illia Polosukhin. 2017. Attention is all ing with unsupervised learning. Accessed: 2020-02- you need. In Advances in neural information pro- 10. cessing systems, pages 5998–6008. Ellen M Voorhees and Dawn M Tice. 1999. The trec-8 question answering track evaluation. In TREC, vol- ume 1999, page 82. Citeseer. Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. 2013. Regularization of neural net- works using dropconnect. In International confer- ence on machine learning, pages 1058–1066. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2019. Super- glue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537. Alex Wang, Amanpreet Singh, Julian Michael, Fe- lix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis plat- form for natural language understanding. In Pro- ceedings of the 2018 EMNLP Workshop Black- boxNLP: Analyzing and Interpreting Neural Net- works for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics. Alex Warstadt, Amanpreet Singh, and Samuel R Bow- man. 2018. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471. Adina Williams, Nikita Nangia, and Samuel R Bow- man. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- bonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretrain- ing for language understanding. arXiv preprint arXiv:1906.08237. Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending against neural fake news. arXiv preprint arXiv:1905.12616. Shuailiang Zhang, Hai Zhao, Yuwei Wu, Zhuosheng Zhang, Xi Zhou, and Xiang Zhou. 2019. Dual co- matching network for multi-choice reading compre- hension. arXiv preprint arXiv:1901.09381. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text clas- sification. In Advances in neural information pro- cessing systems, pages 649–657. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut- dinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE inter- national conference on computer vision, pages 19– 27.