Investigating Embeddings for Sentiment
                     Analysis in Italian?

           Giuseppe Gambino1 and Roberto Pirrone1[0000−0001−9453−510X]

    Dipartimento di Ingegneria, Università degli Studi di Palermo, Viale delle Scienze,
                            Edificio 6, 90128, Palermo, Italy
                       giuseppe.gambino09@community.unipa.it
                              roberto.pirrone@unipa.it
                      http://www.unipa.it/dipartimenti/ingegneria


         Abstract. The present paper compares the performance of both contex-
         tualized and context-free embeddings used for sentiment analysis tasks
         in Italian. The selected scenario is a pre-analysis stage when the gross
         architectural parameters of the pipeline have to be devised, while both
         small data sets can be used for training the model and experiments have
         to be performed with reduced computational power. Two pipelines have
         been set up to this aim: the first one makes use of GloVe, which has
         been suitably trained on the same domain of the task at hand, and a
         deep neural architecture is used for classification. The second model uses
         a pre-trained BERT for the Italian language to perform the whole task.
         The result of our study is that a context-free embedding trained on the
         task domain outperforms the generic contextualized one. The presented
         models are reported in detail, along with the experimentations on both
         the SENTIPOLC 2016 data set and a collection of about 100K TripAd-
         visor reviews.

         Keywords: Sentiment analysis · Text classification · Contextualized
         word embeddings · Very Deep Convolutional Networks · BERT · GloVe


1      Introduction

    The last few years have been a turning point for the development of machine
learning models, and in particular of deep learning models. The current positive
attitude to share online not only the papers but also their own implementations
has brought more and more to a sort of world competition open to all researchers,
with the common interest to obtain better results. One of the areas most involved
in this phenomenon is the field of Natural Language Processing (NLP) where
new studies flourish from day to day. This continuous increase in performances
is mainly due to the findings in the field of Distributional Semantics, and in
particular to the introduction of contextualized embeddings.
?
    Supported by PON “Ricerca e Innovazione” 2014-2020, Project ”IDHEA” - Innova-
    tion for Data Elaboration in Heritage Areas


Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
2       G. Gambino et al.

    Despite the high potential of using contextualized embeddings to represent
words, these approaches attain their maximum performance if we train them
purposely for the task at hand, and very huge data sets have to be used to
devise good representations. In this work we present a preliminary study aimed
at devising the best word embedding for a NLP task in a pre-analysis scenario
when the experiments are constrained by the reduced computational power,
while we want to devise an implementation fairly close to the final one because
the access to huge computing resources is very limited in general. Also the data
sets are reduced versions of the true data to perform a rough tuning of the
hyperparameters with the aim of reducing the dimensions of the search space.
For the same reasons mentioned above, one prefers to use transfer learning with
a pre-trained embedding in place of re-training it from scratch.
    To this aim we developed two systems for sentiment polarity classification
in Italian to compare the performance of a generic (pre-trained) contextualized
embedding against a context-free embedding trained purposely for the task at
hand. Due to the limitations posed by the availability of pre-trained embed-
dings for the Italian, we selected BERT [6], and GloVe [10] respectively for our
comparative analysis. Particularly, we used both the data set from the SEN-
TIPOLC 2016 competition [1] and a collection of about 100K Italian reviews
from TripAdvisor that we built purposely for this research. Moreover, GloVe
embeddings were classified using Very Deep Convolutional Neural Networks [13]
in the SENTIPOLC tasks, while Gated Recurrent Units [5] where used in the
case of TripAdvisor reviews. We present the implementation details of all the
used architectures, and compare the embedding performance. The result of our
study is that a suitably trained context-free embedding performs significantly
better than a pre-trained contextual one, while consuming low computational
resources.
    The rest of the paper is arranged as follows. Section 2 reports a brief overview
of some of the most recent embedding techniques. In section 3 the structure of
the different data sets is addressed along with the detail of the architectures
we used. Results are discussed in section 4, and some conclusions are drawn in
section 5.


2   Word Embeddings

Both word and character embeddings are a key component of whatever deep
learning architecture designed for a NLP task, and they had a tremendous per-
formance increase in the last couple of years since the introduction of the contex-
tualized embeddings. The field of distributional semantics starts with the work by
Elman [7], while Bengio et al. [4] formulate one of the very first neural language
models, which learns embeddings.
    Word embeddings enter the NLP arena with Word2Vec [9] where both the
Continuous-Bag-Of-Words (CBOW) and Skipgram models are introduced to ob-
tain a dual representation of the words in a text or sentence. Vectors in Word2Vec
represent both target words, given a context provided by a suitable window on
                                         Embeddings for Sentiment Analysis        3

both sides of the word itself, and context words given the target one. Although
often considered another Word2Vec version, GloVe [10] has a more rigorous
approach, which takes advantage of global statistics instead of only local infor-
mation to create word vectors that capture the meaning in vector space.
    Both Word2Vec and GloVe representations provide word representations
without considering the true context of the word, apart from the window sur-
rounding the word itself. Contextualized word embeddings fill this gap. Word
representations in ELMo [11] are functions of the entire input sentence. It uses a
bi-directional Recurrent Neural Network (RNN) using Long Short Term Memory
(LSTM) units trained on a specific task to be able to create embeddings. The
key behind ELMo is the creation of a language model to be trained for predicting
a word in a sequence of words with variable length. The use of a bi-directional
LSTM, allows learning a very good representation for each word because all the
surrounding text is considered.
    The Universal Language Model Fine-tuning for Text Classification [8] or
ULMFiT bases its implementation on the concept of transfer learning. It follows
that ULMFiT uses a lot of what the model learns during pre-training, more
than the other embeddings. This method significantly achieves excellent results
on various text classification tasks. Going into details, ULMFit undergoes three
phases in its training: a) general-domain language model pre-training where a
LSTM based architecture is used to learn the Wikitext-103 corpus, b) target task
language model fine-tuning where each layer of the model is tuned with different
learning rates, and c) target task classifier fine-tuning where gradual unfreezing
is used that is each layer is unfrozen, and fine tuning proceeds for one epoch,
while keeping the others freezed; the process is repeated until convergence.
    BERT [6] stands for Bidirectional Encoder Representations from Transform-
ers, and it is the result of combining the concept of bidirectionality introduced by
ELMo with the very good results obtained by autoencoders in machine transla-
tion thanks to the work by Ashish Vaswani et al. [14]. BERT is categorized as an
autoencoder (AE) language model. An AE language model aims to reconstruct
the original data from corrupted input. BERT works out the directionality con-
straint of the previous models using a masked language model that randomly
masks some of the tokens from the input, and subsequently predicts these to-
kens based only on its context. BERT performs well in large number of NLP
task thanks to its “next sentence prediction” which allows to obtain excellent
results in the tasks of natural language inference and paraphrasing, which are
based on the prediction of relationships between sentences. BERT offers various
pre-trained models available for different languages. We used the Italian version
for our experiments.
    XLNet [15] is a generalized autoregressive (AR) pre-trained model, which
became famous because outperforms BERT in 20 NLP tasks. The idea behind
XLNet is not so far from BERT. AR language model is a kind of model using the
context words to predict the next word. This type of model is not bidirectional
because it can not use both forward and backward context at the same time.
To solve this problem in the pre-training phase there is a permutation language
4       G. Gambino et al.

modeling objective that, using permutations on each token of the string, gathers
information from all the other ones on both sides. In this way XLNet remedies
to the problem of BERT which assumes that masked tokens are independent of
each other.
    AlBERTo [12]: Italian BERT Language Understanding Model for NLP Chal-
lenging Tasks Based on Tweets is the latest embedding model available for the
Italian language. The creators trained a BERT model for the Italian language;
in particular, AlBERTo is pre-trained only on Italian tweets to reach best per-
formances on tasks that concern the Italian language used in social networks.
This model obtains the state of the art results in the EVALITA 2016 task SEN-
TIPOLC (SENTIment POLarity Classification). Unfortunaltely, the AlBERTo
embedding was not yet available for the experiments at the time of writing the
present paper. Actually, it is a very powerful ready to use tool for the Italian
language.

3     The Proposed Architectures
As already mentioned above, we built two systems for sentiment polarity clas-
sification in both tweets from the SENTIPOLC 2016 competition, and reviews
from a collection of TripAdvisor we built purposely. Particularly, we addressed
both SENTIPOLC Tasks 1 and 2 that is subjectivity, and polarity classification
respectively. Particularly, the goal of Task 1 is identifying the subjectivity of a
tweet, with one label that contains 0 for an objective tweet, while a subjective
one is labeled with 1. The goal of Task 2 is to identify the polarity of tweet; this
is a case of multi-label classification. Indeed, there are two labels; one checks if
a tweet is positive or not, and the other one checks if is negative or not. In this
manner a tweet can be classified as positive, negative, both positive and negative
or without polarity.
     We performed just polarity classification for the TripAdvisor reviews, which
were split in two classes using the bubble number of each review. In particular
we labeles the 5 bubbles reviews as positive, while 1 bubble reviews were set to
be negative.
     Despite we addressed the same task in both data sets, the language used in
these socal media is very different. Tweets are very short, informal language is
used very often, while both emoticons and hashtags convey sentiment informa-
tion. On the other hand, TripAdvisor reviews can be also very long, and are
grammatically more correct than tweets. They are very close to the plain text
that can be retrieved in the general purpose text corpora as the Wikipedia pages.
As a consequence we trained the GloVe embedding differently for the two tasks,
and used a different deep neural architecture for classification. In what follows we
report in detail the features of each data set, and describe the two architectures.

3.1   Data Sets
SENTIPOLC 2016 Table 1 reports the main features of the SENTIPOLC
2016 competitionn data set, which was given to the participants. As already
                                            Embeddings for Sentiment Analysis          5

pointed out, the length of each tweet is on average 15 words for the training set
and 14 for the test set. The other characterizing feature is the presence of many
political tweets, that brought us to train GloVe embedding with a dense data
set of political tweets.


                         Table 1. SENTIPOLC 2016 data sets


Data set Global Tweets # Political Tweets # Unique words Av. length (words)
training       7410              4279           28949            15
test           1998              1498            9771            14


TripAdvisor Dataset The TripAdvisor Dataset1 is oriented to cultural her-
itage, and it has been built by scraping the reviews with their label. In this way
we have easily obtained a labeled Italian data set. We have only considered sites
of cultural interest, and neither restaurants nor hotels. Table 2 reports the main
features of such corpus. We scraped 100K reviews, which where used to pre-train
the GloVe model. The training/validation set for the classifier is made by 10K
reviews equally split in positive and negative ones. The average length of the
reviews is 97 words.

                          Table 2. The TripAdvisor Dataset

    Data set            Reviews # Positive Unique words Av. length (words)
    TripAdvisor Dataset   10000    5000        23131            97


3.2     Embeddings and Classification

Texts have been pre-processed in the following way. The ekphrasis Python li-
brary [3] was used to normalize emoticons, url strings, email addresses, Twitter
user names, dates and numbers. Moreover, we made all the text lowercase, and
removed all the unnecessary white-spaces and symbols. Regarding the hashtags,
we have decided not to remove the number sign when training the GloVe embed-
ding to treat each hashtag as a new word thus maintaining all the information
around the hashtag, and not around the associated word. A hashtag can convey
a sentiment information very different from the word(s) generating it. In general,
1
    TripAdvisor Dataset and the script to generate it is available at the following link:
    https://github.com/giuseppegambino/Scraping-TripAdvisor-with-Selenium-2019
6        G. Gambino et al.

a hashtag is surrounded by many other ones that could correspond to meaning-
less sentences if considered as a sequence of plain words. Moreover, a hashtag
can be made by many concatenated words (i.e. #iostoconsaviano). Again the
sentence deriving from the segmentation of composing words does not convey
the same sentiment information as the hashtag as a whole.

BERT Embedding The first architecture uses the BERT uncased multilingual
version, so it was possible to perform the task on an Italian data sets. The
network makes use of a BERT layer with three input layers composed as follows:
    – input ids which are just vector representations of words
    – segment ids which are vector representations to help BERT distinguish be-
      tween paired input sequences
    – input masks to let BERT know that the inputs it is being fed with, have a
      temporal property masking some of the tokens.
Finally, the last dense layer allows classification thanks to the sigmoid activation
function and binary cross-entropy loss function. There are no implementation
differences between SENTIPOLC and TripAdvisor data sets applications, except
in the size of the input vectors, which have 23 elements in the SENTIPOLC tasks,
300 elements in the TripAdvisor task to accomodate for the longest reviews.
Otherwise the implementation is the one recommended by the authors.

GloVe Embedding The GloVe embedding was trained with 230K Italian
tweets, whose topic was both generic and political. Since GloVe is unidirec-
tional, a data augmentation technique has been applied, while making different
trials with the SENTIPOLC data set. We added all the tweets of the data set
in reverse order to the original ones so as to simulate a sort of bidirectional-
ity. Applying this technique we observed improvements in accuracy, while the
overfitting was reduced.
    The Very Deep Convolutional Neural Network [13] is an implementation
that use only small convolutions and pooling operations that works well with a
short text like tweets for the nature of convolutional layers. It was used for the
SENTIPOLC 2016 tasks. The term “very deep” derives from the high number of
convolutional layers, and the authors prove that the performance of this network
scales with the depth. We performed fine-tuning of the hyper-parameters for
this network. Due to training time constraints, we built a 9-layer VDCNN. The
last three layers are dense, and we fixed dropout between them to 0.3. Other
parameters are: 32 samples per batch, and the number of filters (64, 128, 256,
512). The implementation for the two task is exactly the same, except for the
last classification layer, which used a sigmoidal unit for Task 1 that is a binary
classification, while softmax was used for Task 2 for multi-label classification.
    The GRU recurrent neural network [5] was used only for the TripAdvisor
Dataset. RNNs are best suited for long texts where dependencies between distant
words or even between different sentences may convey useful information for the
task. Particularly, we found that Gated Recurrent Units performed better than
                                        Embeddings for Sentiment Analysis       7

LSTM cells in our task Fine-tuning of hyper-parameters provided the following
values: 32 units, dropout and recurrent dropout fixed to 0.2, 128 samples per
batch, 30 training epochs.


4   Results and Discussion

Table 3 and 4 report the performance of our architectures together with the top
5 official results for Task 1, and Task 2 respectively. In particular, each column
in both tables reports the F1 score for each label either subjective/objective or
positive/negative, and the average F1 score as the overall performance measure
to show actually how well the models distinguish the classes. The models are
sorted in terms of their average F1 score.


Table 3. The Top 5 F1-scores for SENTIPOLC 2016 Task 1 compared with our
architectures

                    System         Obj Subj       F1
                    Unitor.1.u    0.6784 0.8105 0.7444
                    GloVe-VDCNN 0,6512 0,8248 0,7380
                    Unitor.2.u    0.6723 0.7979 0.7351
                    samskara.1.c  0.6555 0.7814 0.7184
                    BERT          0,6224 0,8100 0,7162
                    ItaliaNLP.2.c 0.6733 0.7535 0.7134
                    IRADABE.2.c   0.6671 0.7539 0.7105


Table 4. The Top 5 F1-scores for SENTIPOLC 2016 Task 2 compared with our
architectures

                    System         Pos    Neg     F1
                    UniPI.2.c     0.6850 0.6426 0.6638
                    Unitor.1.u    0.6354 0.6885 0.6620
                    GloVe-VDCNN 0,6522 0,6690 0,6606
                    Unitor.2.u    0.6312 0.6838 0.6575
                    BERT          0,6500 0,6523 0,6511
                    ItaliaNLP.1.c 0.6265 0.6743 0.6504


   As already mentioned we tested both GloVe and BERT on our TripAdvisor
Dataset. Table 5 shows the F1 score of both architectures. The results show that
GloVe performs better than BERT for both Twitter and TripAdvisor datasets.
Actually, the GloVe-DCNN architecture ranks second in Task 1, and third in
Task 2. We gained several insights from these results.
8       G. Gambino et al.

Table 5. F1-scores of our architectures for sentiment polarity classification in the
TripAdvisor Dataset

                                System      F
                                Glove-GRU 0,9434
                                BERT      0,9023


    Although BERT is newer than GloVe, and it also includes context analysis,
it achieves an almost poor result. We believe that such a behaviour is due mainly
to GloVe’s Italian-only task-specific pre-training. Even if we collected different
tweets from the ones used in the competition, their linguistic features are obvi-
ously the same so our GloVe embedding was “focused” in advance to the task.
On the other hand, BERT is a multi-lingual model trained on Wikipedia, whose
pages are not so sentiment-biased, they use formal grammatical structures, and
have much more longer sentences than tweets. The results on the TripAdvisor
Dataset support our claim also. In this case BERT does not perform so much
worse than GloVe-GRU because the language used in TripAdvisor reviews re-
sembles the one that can be found in Wikipedia.
    Going into detail of the results on SENTIPOLC dataset, we noticed that data
are biased towards subjective tweets (5000 samples) while the objective ones are
2300. This explains unbalanced results for all the models. It is worth noticing
that our architectures rank first, and third respectively on the subjective F1
score. As regards Task 2, GloVe-DCNN beats the winner system for negative
polarity tweets. This result is due to the tweets we used for pre-training the
embedding. Tweets were collected in a period of political crisis, and are closer
to a negative polarity. On the other hand, BERT obtains almost identical values
for both labels, and this is due to the use of context that allows the system to
exhibit a good discrimination capacity.
    Looking at winner systems’ implementations, we devised at least two main
features in their training that gave them success. The former is distant super-
vision to increase the size of the training set, and the latter is the use of the
TWITA data sets [2], which contain more than 100 millions both generic and
topic-specific Italian tweets, and was used to create embeddings. Choosing a not
so large data set for training was a precise experimental choice due to the will
of achieving complete training using reduced computational resources.
    Our objective was obtaining a competitive score with the restriction to have
a light and fast implementation. Training was performed on a 2014 MacBook
Pro 13” with 8GB RAM, and AVX2 FMA CPU extension. Training GloVe-
VDCNN took 4 epochs with 3 minutes per epoch, while BERT implementation
was slower than GloVe: the training phase required 3 epochs, and 15 minutes
per epoch. A similar behaviour was noted also in the polarity classification task
on TripAdvisor reviews. Training GloVe-GRU took 30 epochs, and 1 minute per
epoch, while BERT required just 2 epochs, and 50 minutes per epoch.
                                          Embeddings for Sentiment Analysis          9

5   Conclusions
Two neural architectures have been presented in this work, that were aimed
at comparing the performance of non purposely pre-trained contextual embed-
dings with respect to non contextual ones trained with domain-specific data, in
a setup with reduced computational sources. The selected task was sentiment
classification in Italian social media. The SENTIPOLC 2016 tweets data set,
and a purposely built collection of TripAdvisor reviews were used to this aim.
We compared a suitably trained GloVe embedding, which was coupled with a
deep neural classifier, against a BERT architecture which is available publicly
as a multilingual distribution. The result of our investigation is that contextual
embeddings perform better than contextual ones, while requiring less computa-
tional power to be trained. Results in the SENTIPOLC 2016 competition are
satisfactory, and encourages us deepening this issue. Future work will be devoted
to try the new AlBERTo embedding model and to use wide data sets to train our
embeddings while investigating transfer learning techniques to make our system
demanding as less computational resources as possible.


References
 1. Francesco Barbieri, Valerio Basile, Danilo Croce, Malvina Nissim, Nicole Novielli,
    and Viviana Patti. Overview of the evalita 2016 sentiment polarity classification
    task. In Pierpaolo Basile, Anna Corazza, Francesco Cutugno, Simonetta Mon-
    temagni, Malvina Nissim, Viviana Patti, Giovanni Semeraro, and Rachele Sprug-
    noli, editors, Proceedings of Third Italian Conference on Computational Linguistics
    (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and
    Speech Tools for Italian. Final Workshop (EVALITA 2016), Napoli, Italy, Decem-
    ber 5-7, 2016., volume 1749 of CEUR Workshop Proceedings. CEUR-WS.org, 2016.
 2. Valerio Basile and Malvina Nissim. Sentiment analysis on italian tweets. In Pro-
    ceedings of the 4th Workshop on Computational Approaches to Subjectivity, Senti-
    ment and Social Media Analysis, pages 100–107, Atlanta, 2013.
 3. Christos Baziotis, Nikos Pelekis, and Christos Doulkeridis. Datastories at semeval-
    2017 task 4: Deep LSTM with attention for message-level and topic-based senti-
    ment analysis. In Steven Bethard, Daniel M. Cer, Marine Carpuat, David Jurgens,
    Preslav Nakov, and Torsten Zesch, editors, Proceedings of the 10th International
    Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2016, San Diego, CA,
    USA, June 16-17, 2016, pages 747–754. The Association for Computer Linguistics,
    2017.
 4. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural
    probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, 2003.
 5. Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. Empir-
    ical evaluation of gated recurrent neural networks on sequence modeling. CoRR,
    abs/1412.3555, 2014.
 6. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT:
    pre-training of deep bidirectional transformers for language understanding. In Jill
    Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019
    Conference of the North American Chapter of the Association for Computational
    Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN,
10      G. Gambino et al.

    USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Asso-
    ciation for Computational Linguistics, 2019.
 7. Jeffrey L. Elman. Distributed representations, simple recurrent networks, and
    grammatical structure. Machine Learning, 7:195–225, 1991.
 8. Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for
    text classification. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the
    56th Annual Meeting of the Association for Computational Linguistics, ACL 2018,
    Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 328–339.
    Association for Computational Linguistics, 2018.
 9. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of
    word representations in vector space. In Yoshua Bengio and Yann LeCun, editors,
    1st International Conference on Learning Representations, ICLR 2013, Scottsdale,
    Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013.
10. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global
    vectors for word representation. In Alessandro Moschitti, Bo Pang, and Walter
    Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in
    Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A
    meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532–1543. ACL,
    2014.
11. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher
    Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word represen-
    tations. In Marilyn A. Walker, Heng Ji, and Amanda Stent, editors, Proceedings of
    the 2018 Conference of the North American Chapter of the Association for Com-
    putational Linguistics: Human Language Technologies, NAACL-HLT 2018, New
    Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 2227–
    2237. Association for Computational Linguistics, 2018.
12. Marco Polignano, Pierpaolo Basile, Marco de Gemmis, Giovanni Semeraro, and
    Valerio Basile. Alberto: Italian bert language understanding model for nlp chal-
    lenging tasks based on tweets. In Proceedings of the Sixth Italian Conference on
    Computational Linguistics (CLiC-it 2019). CEUR, 2019.
13. Holger Schwenk, Loı̈c Barrault, Alexis Conneau, and Yann LeCun. Very deep
    convolutional networks for text classification. In Mirella Lapata, Phil Blunsom,
    and Alexander Koller, editors, Proceedings of the 15th Conference of the European
    Chapter of the Association for Computational Linguistics, EACL 2017, Valencia,
    Spain, April 3-7, 2017, Volume 1: Long Papers, pages 1107–1116. Association for
    Computational Linguistics, 2017.
14. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
    Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.
    CoRR, abs/1706.03762, 2017.
15. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdi-
    nov, and Quoc V. Le. Xlnet: Generalized autoregressive pretraining for language
    understanding. CoRR, abs/1906.08237, 2019.