=Paper=
{{Paper
|id=Vol-2696/paper_100
|storemode=property
|title=NLP&IR@UNED at CheckThat! 2020: A Preliminary Approach for Check-Worthiness and Claim Retrieval Tasks using Neural Networks and Graphs
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_100.pdf
|volume=Vol-2696
|authors=Juan R. Martinez-Rico,Lourdes Araujo,Juan Martinez-Romo
|dblpUrl=https://dblp.org/rec/conf/clef/Martinez-RicoAM20
}}
==NLP&IR@UNED at CheckThat! 2020: A Preliminary Approach for Check-Worthiness and Claim Retrieval Tasks using Neural Networks and Graphs==
<pdf width="1500px">https://ceur-ws.org/Vol-2696/paper_100.pdf</pdf>
<pre>
   NLP&IR@UNED at CheckThat! 2020: A
Preliminary Approach for Check-Worthiness and
 Claim Retrieval Tasks using Neural Networks
                 and Graphs ?

    Juan R. Martinez-Rico1 , Lourdes Araujo1,2 , and Juan Martinez-Romo1,2
           1
              NLP & IR Group, Dpto. Lenguajes y Sistemas Informáticos
    Universidad Nacional de Educación a Distancia (UNED), Madrid 28040, Spain
                           jrmartinezrico@invi.uned.es
     2
       Instituto Mixto de Investigación - Escuela Nacional de Sanidad (IMIENS)
                     lurdes@lsi.uned.es, juaner@lsi.uned.es


       Abstract. Check-Worthiness and Claim Retrieval are two of the first
       tasks to be performed on the Fake News detection pipeline. In this article
       we present our approach to these tasks presented in the 2020 edition of
       the CheckThat! Lab. In the task 1, Tweet Check-Worthiness English,
       we propose a Bi-LSTM model with Glove Twitter embeddings where
       the number of inputs has been increased with a graph generated from
       the additional information provided for each tweet. In task 1 Arabic
       we have followed a similar approach but using a feed forward neural
       network model with Arabic embeddings. For the task 5, Debate Check-
       Worthiness, we propose a naive Bi-LSTM model with Glove embeddings.
       Finally, our approach to the task 2, Claim Retrieval, is based in a feed
       forward neural network model with features such as cosine similarity over
       Universal Sentence Encoder embeddings of tweets and claims, and other
       linguistic features extracted from both elements.

                                      ·                ·             ·
       Keywords: check-worthiness claim retrieval embeddings graph fea-
       tures


1    Introduction

One of the problems that current society faces is the spreading of fake news.
The influence of this type of news in the results of the United States presidential
elections a few years ago, and more recently, the disinformation that they are

  Copyright    ©2020 for this paper by its authors. Use permitted under Creative
  Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25
  September 2020, Thessaloniki, Greece.
?
  This work has been partially supported by the Spanish Ministry of Science and
  Innovation within the projects PROSA-MED (TIN2016-77820-C3-2-R), DOTT-
  HEALTH (PID2019-106942RB-C32) and EXTRAE II (IMIENS 2019).
causing in the current pandemic of COVID2019 are two examples of this phe-
nomenon. To combat this problem, sites dedicated to checking the veracity of
the news that circulate daily in traditional media and on social networks have
proliferated. These sites normally carry out their work through human experts
who review the news in circulation every day. To facilitate the work of these
experts, some systems that prioritize the claims to be verified or return a list
of claims that verify another given as input have been proposed. On the other
hand, these two tasks can also be part of a larger system of claim verification
and detection of fake news.
    This article describes the participation of the NLP&IR@UNED3 team in
tasks T1, T2 and T5 of the CheckThat! Lab at CLEF2020[1][2]. Tasks T1 and
T5 are dedicated to prioritizing claims to be verified (check-worthiness) and task
T2 is a claim retrieval task.
    The rest of the article is organized as follows: in section 2 we describe the
different approaches that we have tried to face each task, in section 3 we discuss
the results we have obtained, and section 4 is devoted to reflect our conclusions
and future work.


2     Proposed Approaches

2.1   Task 1 - Tweet Check-Worthiness in English and Arabic

The purpose of this task is, given a stream of tweets and one or more topics, sort
the tweets according to their check-worthiness for the topic to which they are
supposed to belong. Three topics have been provided for the Arabic language:
“The protests in Lebanon”, “Wassim Youssef” and “Turkey enters Syria”, and
all English language tweets belong to topic “COVID19”. The official evaluation
measure for Arabic language is P@30 and for English language is MAP.
    To address these two versions of the task 1, five different models have been
analyzed. All models use cross entropy as a loss function and accuracy as a
metric4 .


Model 1 The first of these models5 is a feed forward neural network (FFNN)
whose input is made up of word embeddings of the first n words of the tweet
text, followed by a 1D global max pooling layer that operates on the n word
vectors, a hidden layer, and a sigmoid final layer of size 1 that provides the
check-worthiness score between 0 and 1.
3
  Identified as NLPIR01 in the official results.
4
  We plan to release the source code at https://github.com/jrmtnez/NLP-IR-UNED-
  at-CheckThat-2020.
5
  All models have been implemented in Tensorflow 2.1 with Keras:
  https://www.tensorflow.org
Embedding features Word embeddings can be self-generated during training,
or they can be preloaded at startup. For this second option, English pretrained
Twitter Glove[3] embeddings of dimension 2006 and Glove Arabic embeddings
of dimension 2567 have been used respectively in each version of task 1.
    Additionally, in the Arabic version of task 1, the title and the description of
the topic to which each tweet belongs have been concatenated to the text of the
tweet to form the model input.
    To preprocess the raw Arabic text, we use as tokenizer the simple word tokenize
from Camel Tools[4]. The first 25 tokens of each tweet and the first 100 tokens
of the topic title and topic description concatenation have been selected to form
an input of size 125.
    For the English language, the NLTK8 [5] tokenizer has been used, and the
first 50 tokens have been selected as input.

Graph features Taking advantage of the fact that the organizers have provided
the complete information of each tweet in a json file, we have implemented a pro-
cess to increase the size of the input with tweets related to the current tweet. To
do this, from the information of the tweets contained in the training, dev and test
datasets, we extract triples of type tweet-hashtags-hashtag, tweet-quoted-tweet,
tweet-reply status-tweet, tweet-contain url-url and tweet-has mention-user, and
build a graph per dataset with these triples.
    After this, for each tweet present in the datasets, we search for the first three
tweets that are neighbors of the current one, for example, we would go from the
tweet node to a hashtag node and from this we would select three tweet nodes.
If there were no hashtag neighbor nodes or the hashtag neighbor nodes did not
have three or more related tweet nodes, from the initial tweet node we search for
tweet nodes behind relations quoted, reply status, contain url and has mention in
this order. The result is that the texts of up to three tweets can be concatenated
to the text of the original tweet.
    As we will see later in section 3.1, the model can be executed indistinctly
with the inputs from a single instance, or with the inputs concatenated from
several instances if we make use of the tweets graph.

Model 2 The second model is a CNN fed by the same features as Model 1.
The input and embedding layers are followed by two pairs of convolutional 1D
and max pooling 1D layers, a flatten layer that feeds a dense layer, and finally
a dense sigmoid layer of size 1 at the output. Each convolutional 1D layer has a
kernel size of 5 and the max pooling 1D layers have a pool size of 5.

Model 3 The third model is a LSTM network fed by the same features as Model
1. In this case, after the input and embedding layers there is a LSTM layer that
feds directly the dense sigmoid layer of size 1 that forms the output.
6
  https://nlp.stanford.edu/projects/glove/
7
  https://github.com/tarekeldeeb/GloVe-Arabic
8
  https://www.nltk.org/
Model 4 The fourth model is a Bi-LSTM network fed by the same features that
previous models. After the input and embedding layers there are two bidirec-
tional LSTM layers followed by a dense layer that precedes the output sigmoid
layer.


Model 5 The last model is again a FFNN but in this case the input is made
up of tf-idf vectors. To build this input we have followed the same strategy that
in the embedding and graph features of previous models: in Arabic language the
title and the description of the topic have been concatenated to the twitter text,
and up to three additional tweets have been added to the input from the graph
extracted from the tweet information.
    The network is made up of three pairs of dense and batch normalization
layers and the size of the dense layers decreases by 50% in each stage. These six
layers are followed by a batch normalization layer, a dropout layer, and finally
a sigmoid output layer.


2.2    Task 5 - Debate Check-Worthiness in English

The objective of this task is, given a transcripted political debate segmented into
sentences and with the speakers annotated, sort the sentences according to their
check-worthiness. The official evaluation measure for this task is MAP.
    In this task, the five models described in section 2.1 have been used with slight
variations. First, FFNN models have also been run without a hidden layer. On
the other hand, in addition to word embeddings and tf-idf vectors, another type
of input data has been used in this model as we explain below.
    To perform a text analysis of the sentences we have prepared a version of
the English Regressive Imagery Dictionary (RID)[6][7] with a format compatible
with that used by liwc python module9 . This dictionary contains 3150 words and
roots in 48 categories, and these in turn are grouped into three main categories:
primary, secondary and emotion.
    The input vector consists of 51 decimal numbers, one for each category. For
each instance of the training and test datasets, we calculate the percentage of
words that are in those 51 categories.


2.3    Task 2 - Claim Retrieval in English

In this task, for each check-worthy tweet provided, a ranked list of claims must be
returned, reflecting which claims best support that tweet. The official evaluation
measure for this task is MAP@5.
    In our approach we have used an FFNN similar to model 5 described above,
and for this model we build a dataset in the following way: for each claim we
calculate its sentence embedding vc with Universal Sentence Encoder[8], the
ratio of different tokens to total tokens rtc , the average number of characters
9
    https://pypi.org/project/liwc/
per word avcc , the number of verbs vnc , the number of nouns nnc , the ratio
of content words10 rcwc regarding the total number of words, and the ratio of
content tags11 rctc with respect to the total of tags.
    The same values vt , rtt , avct , vnt , nnt , rcwt , rctt are calculated for each tweet,
and for each claim title vct , rtct , avcct , vnct , nnct , rcwct , rctct .
    Combining claims and tweets and claim titles and tweets we obtain the fea-
tures for our dataset shown in table 1.


              Claim - Tweet                     Claim title - Tweet
              simct = cosine sim(vc , vt )      simctt = cosine sim(vct , vt )
              drtct = rtc -rtt                  dctctt = rtct -rtt
              davcct = avcc -avct               davcctt = avcct -avct
              dvnct = vnc -vnt                  dvnctt = vnct -vnt
              dnnct = nnc -nnt                  dnnctt = nnct -nnt
              drcwct = rcwc -rcwt               drctctt = rctct -rctt
              drctct = rctc -rctt               drcwctt = rcwct -rcwt


                                 Table 1: Task 2 features


3      Experiments and Results

3.1     Task 1 - Tweet Check-Worthiness in English and Arabic

In task 1, for both, English and Arabic languages, we have increased the size of
the input by relying on the creation of a graph with which we retrieve tweets
related to each instance of the datasets based on the hypothesis that the relation-
ships tweet-hashtags-hashtag, tweet-quoted-tweet, tweet-reply status-tweet, tweet-
contain url-url and tweet-has mention-user can enrich the information provided
to the different classifiers. To verify this, we have performed a grid search with
different parameters that were applied or not according to the model used. These
parameters have been: graph generated inputs, pretrained embeddings, number
of epochs, batch size, hidden layer size, activation type, optimizer type, dropout,
number of epochs and batch size. All models use the adam optimizer with its
default parameter values except the model 4 which uses nadam.
    In Arabic there was no dev dataset so we extracted 20% of the training
dataset instances as a dev dataset.
10
     Nouns, verbs, adjectives and adverbs.
11
     “NN”, “NNS”, “NNP”, “NNPS”, “VB”, “VBD”, “VBG”, “VBN”, “VBP”, “VBZ”,
     “JJ”, “JJR”, “JJS”, “RB”, “RBR”, “RBS” and “WRB”
    Table 2 shows the best results obtained for the different combinations of the
use of pretrained embeddings and graph features and the optimal parameters
for each model in Arabic language on the dev dataset, and table 3 shows best
results for English language.


     Model    Graph Emb. Activation H. layer Dropout Ep./Bch. MAP
    Model 1     N    N      relu     2000       -      50/8   0.1175
    (FFNN)      N    Y      relu     2000       -      50/8   0.1399
                Y    N      relu     2000       -      50/8   0.1238
                Y    Y      relu     2000       -      50/8  0.1454
   Model 2      N    N      relu      100       -      25/8   0.1201
    (CNN)       N    Y      relu      100       -      25/8   0.1436
                Y    N      relu      100       -      25/8   0.1225
                Y    Y      relu      100       -      25/8   0.1313
   Model 3      N    N      tanh        -      0.1     5/32   0.0646
   (LSTM)       N    Y      tanh        -      0.1     5/32   0.0821
                Y    N      tanh        -      0.1     5/32   0.0634
                Y    Y      tanh        -      0.1     5/32   0.0797
   Model 4      N    N      tanh       10      0.1     5/32   0.1145
  (Bi-LSTM)     N    Y      tanh       10      0.1     5/32   0.1131
                Y    N      tanh       10      0.1     5/32   0.1211
                Y    Y      tanh       10      0.1     5/32   0.1210
   Model 5      N    -      relu      500      0.4     10/8   0.0665
(FFNN TF/IDF)   Y    -      relu      500      0.4     10/8   0.0640


                  Table 2: Task 1 - Arabic parameter analysis


    After observing the results obtained for the Arabic language, we can see that
in almost all cases the use of Glove Arabic embeddings improves the result, and
that the expansion of inputs through the tweet graph improves five of the nine
possible configurations, one of them (FFNN + Graphs + Embedings) being the
one that obtains the best global result.
    For the English language again it is confirmed that the use of Glove em-
beddings improve performance in all cases. The graph features do not show a
homogeneous behavior, although the best value is obtained for the Bi-LSTM
model with embeddings and graph features.
    Regarding the official results, in Arabic language our best run was the Bi-
LSTM model with Glove Arabic embeddings and graph features, sent as con-
trastive2 which obtained a P@30 of 0.5333, ranking 13th out of 28 runs sent by
all the teams, while the FFNN (primary) and CNN (contrastive1 ) models with
the same features obtained a P@30 of 0.3917 (ranking 19th) and 0.4833 (ranking
16th) respectively. In English language our best run was the Bi-LSTM model
(primary) which obtained a MAP of 0.6069 (ranking 20th) and the FFNN (con-
       Model  Graph Emb. Activation H. layer Dropout Ep./Bch. MAP
      Model 1   N    N hard sigmoid 1000        -     100/8   0.6480
      (FFNN)    N    Y hard sigmoid 1000        -     100/8   0.7145
                Y    N hard sigmoid 1000        -     100/8   0.6097
                Y    Y hard sigmoid 1000        -     100/8   0.7214
    Model 2     N    N    sigmoid      32       -     10/16   0.5392
     (CNN)      N    Y    sigmoid      32       -     10/16   0.6671
                Y    N    sigmoid      32       -     10/16   0.5998
                Y    Y    sigmoid      32       -     10/16   0.6507
    Model 3     N    N      tanh        -      0.1      5/8   0.5578
    (LSTM)      N    Y      tanh        -      0.1      5/8   0.6067
                Y    N      tanh        -      0.1      5/8   0.4153
                Y    Y      tanh        -      0.1      5/8   0.4385
   Model 4      N    N      tanh       10      0.1    10/64   0.6188
  (Bi-LSTM)     N    Y      tanh       10      0.1    10/64   0.7410
                Y    N      tanh       10      0.1    10/64   0.6024
                Y    Y     tanh        10      0.1    10/64 0.7425
    Model 5     N    -    sigmoid    1000      0.4     50/8   0.4695
(FFNN TF/IDF)   Y    -    sigmoid    1000      0.4     50/8   0.4634


                  Table 3: Task 1 - English parameter analysis


trastive1 ) and CNN (contrastive2 ) models obtained a MAP of 0.5546 (ranking
22th) and 0.5193 (ranking 23th), respectively.


3.2    Task 5 - Debate Check-Worthiness in English

In task 5, different parameters and settings have also been analyzed. In this case,
a dev dataset was not available either, so we have partitioned the training dataset
leaving 20% of it as a dev dataset. Given the great imbalance of classes (4027
negative instances and 42 positive instances) we have opted for an oversampling
strategy to equalize the number of positive and negative instances. In this task,
an additional FFNN model 6 that makes use of features derived from a RID text
analysis has been used.
    Table 4 shows the best results obtained for the different combinations of the
use of pretrained embeddings and oversampling, and the optimal parameters
for each model. All models use the adam optimizer with its default parameter
values.
    As we can see, the use of oversampling in general does not improve the be-
havior of the different models. The best results in FFNN models are obtained
with the use of RID-based features without oversampling, outperforming FFNN
models with inputs based on embeddings and TF/IDF vectors. It is also clearly
seen, as was the case in task 1, that the use of 6B-100D Glove pretrained embed-
dings substantially improve the performance of all the models in which it can be
used.

    The model that outperforms the rest by far is the Bi-LSTM with Glove
embeddings. All of the runs that we submitted for task 5 were based on this
model. The contrastive2 run used oversampling, while the primary and con-
trastive1 runs did not use oversampling, and shared the same parameters. The
only difference between them was a different random weight initialization.


     Model   Over. Emb. Activation H. layer Dropout Ep./Bch. MAP
    Model 1   N     N    sigmoid       0       -      50/8   0.0482
    (FFNN)    N     Y    sigmoid       0       -      50/8   0.0605
              Y     N    sigmoid       0       -      50/8   0.0410
              Y     Y    sigmoid       0       -      50/8   0.0511
              N     N    sigmoid    1000       -     20/64   0.0538
              N     Y    sigmoid    1000       -     20/64   0.0547
              Y     N    sigmoid    1000       -     20/64   0.0413
              Y     Y    sigmoid    1000       -     20/64   0.0760
    Model 2   N     N    sigmoid      32       -     20/64   0.0374
     (CNN)    N     Y    sigmoid      32       -     20/64   0.0808
              Y     N    sigmoid      32       -     20/64   0.0128
              Y     Y    sigmoid      32       -     20/64   0.0282
    Model 3   N     N      tanh        -      0.2    20/64   0.0375
    (LSTM)    N     Y      tanh        -      0.2    20/64   0.0768
              Y     N      tanh        -      0.2    20/64   0.0374
              Y     Y      tanh        -      0.2    20/64   0.0487
    Model 4   N     N      tanh      256      0.2    10/16   0.0239
  (Bi-LSTM)   N     Y     tanh       256      0.2    10/16 0.1700
              Y     N      tanh      256      0.2    10/16   0.0256
              Y     Y      tanh      256      0.2    10/16   0.1050
    Model 5   N     -    sigmoid    1000      0.4    20/64   0.0719
(FFNN TF/IDF) Y     -    sigmoid    1000      0.4    20/64   0.0421
    Model 6   N     -       elu      102      0.4    20/64   0.0871
  (FFNN RID)  Y     -       elu      102      0.4    20/64   0.0268


                 Table 4: Task 5 - English parameter analysis


   Table 5 shows the official results of this task. Our primary and contrastive1
runs based on the Bi-LSTM model with Glove embeddings obtained the first
positions in the classification.
                         Team              Run              MAP
                       NLPIR01           Primary           0.0867
                       NLPIR01         Contrastive-1       0.0849
                        UAICS             Primary           0.0515
                        UAICS           Contrastive-1       0.0431
                       TobbEtuP         Contrastive-1       0.0417
                       NLPIR01         Contrastive-2       0.0408
                        UAICS           Contrastive-2       0.0328
                       TobbEtuP           Primary           0.0183


                            Table 5: Task 5 - Official results


3.3      Task 2 - Claim Retrieval in English
In this task we have used an FFNN with 1000 elu12 units in the first hidden
layer, 500 elu units in the second hidden layer and 250 elu units in the last
hidden layer, and Universal Sentence Encoder embeddings of tweets, claims and
claim titles. We have used the adam optimizer with its default values and we
have trained the model for 50 epochs with a batch size of 128.
    In our experiments on the dev dataset we were able to verify that the use of
the features derived from the claim title do not provide improvements in the per-
formance offered by the claim-tweet features. Table 6 shows the results obtained
with both configurations. Both sets of features exceed the MAP@5 of 0.609 ob-
tained by the baseline based on Elasticsearch provided by the organization.
    We have submitted three runs with two different settings. In the first one,
sent as primary, we have made use of the seven features described in section 2.3
involving claims and tweets. With this configuration we obtained a MAP@5 of
0.8560, placing our team in fourth place.
    In the contrastive1 run we used all features and the contrastive2 was iden-
tical to the primary run but with a different random initialization. With these
two configurations we obtained respectively a MAP@5 of 0.8390 and 0.8550, con-
firming that the seven Claim title - Tweet features do not improve performance
when used together with the seven Claim - Tweet features.


      Feature Set MAP 1 MAP 3 MAP 5 MAP 10 MAP 20 MAP All
      Claim-Tweet  0.739 0.793 0.798  0.802  0.804   0.805
          All      0.723 0.781 0.787  0.790  0.792   0.794
        Baseline   0.470 0.601 0.609  0.615  0.617   0.619


               Table 6: Task 2 - Feature set comparison on dev dataset


12
     Exponential linear unit.
4   Conclusions and Future Work
In this paper, we present our approximation to the tasks tweet check-worthiness,
debate check-worthiness and claim retrieval at the CLEF-2020 Check-That! Lab.
    Examining the results obtained in the two check-worthiness tasks to which
we have participated, three if we take into account that the first was defined in
two different languages, we can see that the use of pretrained embeddings of the
appropriate language, significantly improves the performance of the models with
respect to generating these embeddings during the training phase.
    In the two versions of task 1, we have used a particular method to increase
the information that reaches the input of the models, using a graph constructed
from the tweets provided in the datasets that collects the relationships tweet-
hashtags-hashtag, tweet-quoted-tweet, tweet-reply status-tweet, tweet-contain url-
url and tweet-has mention-user, between tweets. Although this mechanism has
not behaved in a homogeneous way throughout the different models, it has been
the one that has obtained the highest MAP values in the dev dataset in both
the Arabic and English versions of this task.
    In task 5 we have made use of RID-based features and, although we have not
sent any run with them, in our tests with FFNN models on the dev dataset, we
have obtained good results compared to embeddings and tf-idf based features,
so this type of text analysis-based features can be an alternative to more well-
known ones like LIWC[9]. We did not have these features prepared in time for
task 1 but we think they can be applied to tweet texts and it will be a job to be
done in the future.
    MAP values in this task are certainly low. We think that the large class
imbalance and the small number of positive instances can cause these instances
to not be correctly characterized by the models.
    In the claim retrieval task we have assumed that the similarity between claim,
claim title and tweet would allow a selection of claims appropriate to the require-
ments of this task and for this, we have implemented a mixed strategy using
sentence embeddings and stylometric features based on token counts and ratios,
far exceeding the provided baseline with these features.
    On the other hand, the balance of our participation in these tasks has been
positive, obtaining first place in task 5, fourth in task 2 and more discreet results
in task 1.
    In future work we plan to continue experimenting with graph features to
increase the size of the inputs or the number of training instances, combining this
strategy with more sophisticated language representations such as BERT[10].


References
 1. Alberto Barron-Cedeno, Tamer Elsayed, Preslav Nakov, Giovanni Da San Mar-
    tino, Maram Hasanain, Reem Suwaileh, Fatima Haouari, Nikolay Babulkov, Bayan
    Hamdan, Alex Nikolov, Shaden Shaar, and Zien Sheikh Ali. Overview of Check-
    That! 2020: Automatic Identification and Verification of Claims in Social Media.
    arXiv:2007.07997 [cs], July 2020. arXiv: 2007.07997.
 2. Alberto Barrón-Cedeño, Tamer Elsayed, Preslav Nakov, Giovanni Da San Martino,
    Maram Hasanain, Reem Suwaileh, and Fatima Haouari. CheckThat! at CLEF
    2020: Enabling the Automatic Identification and Verification of Claims in Social
    Media. In Joemon M. Jose, Emine Yilmaz, João Magalhães, Pablo Castells, Nicola
    Ferro, Mário J. Silva, and Flávio Martins, editors, Advances in Information Re-
    trieval, Lecture Notes in Computer Science, pages 499–507, Cham, 2020. Springer
    International Publishing.
 3. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global
    vectors for word representation. In Proceedings of the 2014 conference on empirical
    methods in natural language processing (EMNLP), pages 1532–1543, 2014.
 4. Ossama Obeid, Nasser Zalmout, Salam Khalifa, Dima Taji, Mai Oudah, Bashar
    Alhafni, Go Inoue, Fadhl Eryani, Alexander Erdmann, and Nizar Habash. CAMeL
    Tools: An Open Source Python Toolkit for Arabic Natural Language Processing.
    page 11, 2020.
 5. Edward Loper and Steven Bird.           NLTK: The Natural Language Toolkit.
    arXiv:cs/0205028, May 2002. arXiv: cs/0205028.
 6. Colin Martindale. Romantic Progression: The Psychology of Literary History,
    Hemisphere, Washington, DC, 1975. Google Scholar, 1975.
 7. Colin Martindale. The clockwork muse: The predictability of artistic change. The
    clockwork muse: The predictability of artistic change. Basic Books, New York, NY,
    US, 1990. Pages: xiv, 411.
 8. Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St
    John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, and Chris Tar. Uni-
    versal sentence encoder. arXiv preprint arXiv:1803.11175, 2018.
 9. James W. Pennebaker, Ryan L. Boyd, Kayla Jordan, and Kate Blackburn. The
    development and psychometric properties of LIWC2015. Technical report, 2015.
10. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
    training of deep bidirectional transformers for language understanding. arXiv
    preprint arXiv:1810.04805, 2018.

</pre>