Detection of cyberbullying in Arabic social media
using dynamic graph neural network⋆
Ahmed Bouliche1,∗,† and Abdellah Rezoug2,†

,21 Department of computer science, Faculty of sciences, University of Boumerdes, Boumerdes, Algeria


                                         Abstract
                                         Despite all the advantages social networks have brought to the world, they are also a very favourable
                                         environment for the growth of so-called electronic crimes. Textual exchanges between users may include
                                         clues to crimes committed or being prepared. Usually, methods of Natural Language Processing (NLP)
                                         and neural networks are effective ways to detect cybercrimes particularly cyberbullying. In this paper,
                                         we proposed techniques that allow to use structures of dynamic temporal graphs as direct inputs to a
                                         model without turning them into static graphs as well as a message passing algorithm that fits well with
                                         the approach. The effectiveness of these techniques was tested on a prototype model. Fortunately, the
                                         proposed techniques have been proven to work, but with poor model performance. The applicability of
                                         a crime detector can be established with a session classifier if the data is more general, i.e., represents all
                                         the language used by bullies.


                                         Keywords
                                         natural language processing, Arabic language, dynamic graph neural network, graph neural network,
                                         cybercrime delection, cyberbullying


1. Introduction

The widespread use of social media in recent years has helped to spread many types of dangerous
cybercrimes. Fraud, swearing, harassment, terrorism, bullying, and drug dealing are examples
of many cybercrimes that people have been suffering from in recent years. Cyberbullying is
common among adolescents and has negative effects on the psychological and even physical
state of the victim. Therefore, focus has recently turned to developing solutions capable of
automatically detecting and predicting cybercrimes based mainly on artificial intelligence and
natural language processing (NLP).
   It is obvious that the domain of NLP is developing very well for some languages, such as
English. Yet, for the Arabic language, there are only some experimental attempts that are issued
here and there, which remain insufficient. In addition, the Arabic language poses another kind
of difficulty related to the way it is written. Also, there is no unified language among the Arab
TACC 2022: Tunisian-Algerian Joint Conference on Applied Computing, 13 - 14 Dec Constantine, Algeria
⋆


Envelope-Open bouliche.ahmed.2@gmail.com (A. Bouliche); a.rezoug@univ-boumerdes.dz (A. Rezoug)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
countries, as each country uses its own dialect.
   In recent years, researchers have proposed very effective machine learning-based solutions
to detect cyberbullying by processing published texts, images, and videos. In this study, we
are only interested in methods based on NLP. Arabic cyberbullying detection using sentiment
analysis was proposed by Almutiry and Abdel Fattah [1]. Abusive language detection on Arabic
tweets based on matching texts with a list of obscene words was presented by Mubarak et al.
[2]. Aldjanabi et al. [3] developed a classification system for determining offensive and hate
speech using a multi-task learning (MTL) model built on top of a pre-trained Arabic language
model. Fatemah Husain [4] applied a single learner machine learning approach and an ensemble
machine learning approach for offensive language detection in the Arabic language.
   This article presents an approach to solving the problem of detecting cyberbullying through
text analysis using a dynamic graph neural network (DGNN). The proposed method is a graph-
level classification to classify an entire comment session, whether it contains cyberbullying or
not. These sessions are a set of comments where each relationship between two comments is
a response relationship between the two. Each node in the graph is a comment, and an edge
between them is a reply link. The temporal aspect of the data must be taken into consideration
because texts are time-series data. The size of comments varies and is not fixed, so the type of
time graph that will be used is dynamic time graphs. According to our modest research, the
majority of methods on time graphs assume that graphs are static and often treat them as static
[28]. We propose a method that takes into account the dynamism of time graphs with a shift
method for processing temporal and spatial aspects.
   The remainder of this article is organised as follows. Section 2 summarises the state of the art
of the proposed solutions to detect cyberbullying using artificial intelligence methods. Section
3 explains in detail the approach proposed in this article. Section 4 summarises the conducted
experiments and the obtained results. This article is finished with conclusions on the most
relevant observations and results.


2. Related works

Several researchers have tried to handle cyberbullying detection using machine learning algo-
rithms. To detect aggressive behaviours, the work by [5] employs three deep learning models:
Closed Bi-directional Recurrent Unit (BiGRU), Transformers, and Convolutional Neural Net-
works (CNN). Experimental studies were conducted to examine how effective the models are at
classifying well-known hate speech Twitter data-sets into aggressive and non-aggressive. An
accuracy of about 88% was reached.
  The authors proposed by [6] another method based on supervised machine learning ap-
proaches. Several classifiers have been used to train and detect acts of bullying. The data set
shows that neural networks perform better and achieve an accuracy of 92.8%, whereas simple
vector machines (SVM) achieve 90.3%. Reynolds et al. [7] used textual features and constructed
several metrics to assess swear words through a swear word dictionary.
   What makes cyberbullying detection particularly challenging compared to offline bullying
is that language can pose difficulties like ambiguity and sarcasm. Role prediction can also be
a difficult task if we look at a conversation thread between the individuals involved in the
conversation (victim, bully, bully type, advocate, etc.). The predictions made by the previous
automatic methods for the detection of cyberbullying do not reflect the complexity of the task
described above. Recent efforts include combined word normalisation, named entity recognition
to detect person-specific references [8]. In their work, [9] collected a large data set from
ask.fm and used BOW (Bag Of Words) features as a first test, then extended it with term lists,
subjectivity lexicons, and subject model features.
   Recently, neural word embedding and neural network techniques have been applied. Convo-
lutional neural networks (CNN) on phonetic features have been applied by [10] and [11] to study,
among others, the same architecture on textual features in combination with long-short-term
memory (LSTM) networks.
   People tend to describe their experience of bullying in messages called bullying traces. The
researchers by [12] presented a method to examine bullying traces. They identify several key
issues in using social media to study bullying: text categorisation, role labelling, and topic
modelling.
   Using graphs to visualise potential cyberbullies and their connections is a recent method. The
researchers by [13] concretely implemented techniques from graph theory: Network functionali-
ties (degree of centrality: number of incident links from a node; close centrality: average distance
the shorter from each vertex to the other vertex.), Content-based features (length, sentiment,
offensive words, pronouns).
   The focus of most existing work done on cyberbullying is the independent content analysis
of comments within social media sessions. Suyu et al. [14] cited three different key limitations
for detection of cyberbullying, 1) only consider the content within a single comment rather than
the topic coherence across comments 2) remain generic and exploit limited interactions between
social media users, 3) overlook the temporal correlations among different comments. They
showed that interaction within the same session may evolve. Modelling topic coherence between
comments and temporal interactions is critical to capturing repetitive bullying characteristics.
leading to better predictive performance. To achieve this goal they constructed a unified temporal
graph for each social media session and then proposed a principled graph-based approach for
modelling the temporal dynamics and topic coherence throughout user interactions.
  Research in psychology and social science has shown that cyberbullying is carried out
repeatedly against victims [15]. Studies of user interactions allow us to characterise their
repetition by both content and temporal analysis. However, modelling interaction in social
media sessions presents some challenges: sparsity, repetition, and user characteristics.
3. The proposed approach

Our approach is based on dynamic graph neural networks. This part explains every step of
the procedure. It is composed of: data collection, graph construction and how constraints are
handled, coordination matrix creation, tokenisation and training the model.


3.1. Graphs generation

Data generation was the hardest task and the process that took us the most time. Understanding
the data structure is also an important factor for data generation. In our case, data generation is
a process based mainly on dynamic temporal graph creation.


3.1.1. Data collect

With some research and exploitation, we were able to collect 11268 comments, of which 8417
are negative instances that do not contain cyberbullying behaviour and 2851 positive instances
that do contain cyberbullying behaviour. The links to the databases used were shared by [16].


3.1.2. Construction of graphs

The construction of temporal graphs consists of understanding our data structure as well as the
two types of temporal graphs.

   1. Structure of a session The sessions will be represented by temporal graphs where each
      node represents a comment and the edges between them a response relationship. At the
      beginning (step 𝑡 = 0), the graph has nodes representing tokens corresponding to the
      𝑖𝑡ℎ comment. Each session is a dynamic temporal graph, dynamic temporal graphs are
      records over time. The records are graphs at each time step, with the 𝑖𝑡ℎ node representing
      the token of the 𝑖𝑡ℎ comment at time step 𝑡. Except the size of comments is not fixed, so
      some nodes and edges will be removed over time.
      There are two constraints that describe the structure we want to use for the sessions we
      are going to process:
         • edge loop. Each node in our graph is a comment and not a user. If a comment 𝑗 is
           a reply to a comment 𝑖, and we add a cycle edge (i.e) comment 𝑖 is also a reply to 𝑗,
           this just doesn’t make sense.
         • session owner. The session owner or the owner of a post is often the person
           bombarded with comments and insults.
   2. Constraint of dynamic temporal graphs creation
         • Node Deletion. nodes must be deleted only if the comment that corresponds to
           the node is less than the current time step, if the comment contains 10 tokens then
           the node will be deleted from the graph in the 11𝑡ℎ time step.
     • Edge removal. Edges should be removed only if one of the nodes linked to the
       edge is removed.
       Removing and adding nodes is important for processing the temporal aspect of the
       graph. The deletion of the edges is done if a node linked to it is deleted, because no
       spatial data is propagated in the network.
     • Number of nodes equal to the number of nodes in the COO matrix. The
       COO matrix (CO-Ordination matrix) is a an optimised representation of the edges in
       the graph that consumes less space and reduces the number of units of calculation.
       The matrix contains two rows. In each column the first row contains the sender
       nodes and the second row the receiver nodes. The union of the two lines gives us
       all the nodes of the matrix.


       Figure 1: Example of coordination matrix


       If the set of nodes in the COO matrix is greater than the set of nodes in the node
       embedding matrix, it means that the embedding matrix is missing information about
       some nodes. This will make message passing impossible.
     • Isolated node. Isolated nodes cause problems because they do not belong to the
       set of nodes in the COO matrix. The absence of a node in the COO matrix puts us in
       a case where we do not know if the node has been deleted or is isolated. Deleting a
       node causes an edge to be deleted and deleting an edge causes a node to be isolated.
     • Shifted node embedding. A node is added to the graph at time 𝑡 if and only if
       the length of the comment which corresponds to the node is greater than 𝑡. If
       between [0 − 𝑡 − 1], some nodes have been deleted. Deleting these nodes will cause
       a shift in the order of the nodes in the node embedding matrix. This means that the
       embedding of node 14 can actually be the embedding of node 18.
3. Graph creation
     • tokenisation: The Transformers library [17] offers an entire hub where multiple
       developers can put their algorithm to use. For the segmentation of Arabic words,
       we use the segmenter [18] and for the tokenisation algorithm we use a WordPiece
            algorithm developed by [19] for the Arabic language. WordPiece was originally
            designed by Google when designing the famous BERT model.
            The tokenisation is performed on the entire data-set. When creating temporal
            graphs, for each node (𝑖 for example) that we add to a graph in a time 𝑡, we add
            the identifier of the 𝑤𝑖 token of the comment 𝑖 in the embedding matrix. These
            identifiers will then be passed to an embedding layer which as output gives us an
            embedding vector (the dimension of the vector is a hyper-parameter).
          • Coordination matrix (COO-matrix). We start by picking a random number to
            determine the length of the session (number of comments or node in the session).
            In the first time step 𝑡 = 0 we build the first graph where each node 𝑖 will pick a
            random number 𝑒 ∈ [1, 𝑠𝑖𝑧𝑒𝑠 𝑒𝑠𝑠𝑖𝑜𝑛], 𝑒 is the node that will be connected to node𝑖. This
            will build the COO matrix which represents the edges and structure of the graph.
            We are still in the first step in time. To solve the isolated node problem, we add a
            self loop connection to the isolated node in the CO-Ordination matrix. Both the
            sender and the receiver are the isolated node and the weight of the edge is null.
            To solve all the rest of the constraints for the creation of the dynamic temporal
            graph mentioned above, we proposed shifting the indices of nodes. For each step
            in time, we save a COO matrix with the new structure (the nodes not deleted) and
            another COO matrix where the indexes of the nodes are shifted backwards.
            In the case of spatial data processing, we can use the nodes shifted edge indices,
            because the indices will be reset and this will prevent us from falling in the case
            where a node index is greater than the number of nodes in the embedding matrix.
            In the case of time series data processing, we use the original indices to check if
            the node is not deleted and the shifted edge indices matrix to select the embedding
            vector of nodes for processing their temporal information.
            Finally, we were able to generate 753 dynamic temporal graphs from which 191
            contain cyberbullying behaviour and 562 do not.


3.2. Model training

3.2.1. Spacial data processing

Generally, message passing is done through a dot product between the adjacency matrix and
the node embedding. We have a CO-Ordination matrix which consists of two rows and in each
column the two connected nodes. How can the COO matrix be used for message passing?
   The embedding matrix is created using an embedding layer that takes the token identifiers
of the nodes at time step 𝑡 as input. The embedding dimension we chose is 𝑑𝑖𝑚 = 128. We
create a copy of the embedding matrix and we update the vectors of the nodes which receive
information from their neighbours in the embedding matrix. This can be achieved as follows:
   For each column in the shifted edge indices CO-Ordination matrix, we update the embed-
ding vector of the receiver node in the new embedding matrix. Example the first column is:
𝑠ℎ𝑖𝑓 𝑡𝑒𝑑𝑒 𝑑𝑔𝑒𝑖 𝑛𝑑𝑖𝑐𝑒𝑠..,1 = [1, 3], in the new node embedding we will use the vectors in the position
1 and 3 to update the vector in the position 3. And if the vector 3 has already been updated, we
use the vectors in position 1 and the new vector in position 3.


Figure 2: The message passing algorithm


Figure 3: Analysis of temporal aspect in graphs


  This procedure must be done for all the columns in the shifted edge indices COO matrix.
The result are weighted by parameterised readable weight matrix to learn the interactions (the
number of layers we have chosen is three.). After applying these transformations to the graph
in the first time step, we will do the same for all the graphs in the future time steps individually.
This way, all spatial information about neighbouring nodes having been propagated through the
graph in all time steps, the resulting node embedding can be used as input data for processing
the temporal aspect.


3.2.2. Handling temporal aspect of graphs

After applying message passing for each node in each graph over time, our node embedding all
contain information about their neighbouring nodes. It was previously mentioned that deletion
leads to an embedding lag which can cause problems for handling the dynamic graph temporal
aspect. If the nodes will be shifted then the time series data processing will learn corrupted
and incorrect sequences. To solve this problem, we shifted the indices of the CO-Ordination
matrix so that Deep Learning architectures used for processing time-series data learn the correct
sequences. This is achieved as follows:
  We start by concatenating all the records of a node through time. This will prepare the input
data for the chosen Deep Learning architecture and will make processing temporal data very
easy. To concatenate the vectors of the node through time, we have to loop over the time steps
and check if the node has not been deleted from the COO matrix (this will indicate that the
number of comment tokens is greater than 𝑡) and we have the shifted index of the node so we
can extract the node embedding vector from the embedding matrix at time 𝑡, concatenate it
with the node embedding vector at time 𝑡 − 1. We will have a representation of all the records
over time in a matrix that will be used as input data for an LSTM layer. This concatenation
operation will be done for all the nodes of the session.
   An LSTM architecture will be used to process the temporal aspect of the data, the result
of each node that passes through the LSTM will also be concatenated. Then we calculate the
transpose the resulting matrix and get the average on the last dimension for shape reduction. A
finale output layer is added to predict if the session contains cyberbullying or not.


4. Experimental results

In order to verify the effectiveness of the proposed approach, we conducted experimentation
using the dataset. The input data contains 11268 Arabic comments (tweets), where 8417 do not
contain cyberbullying and 2851 do contain cyberbullying. From these comments, 753 dynamic
temporal graphs were generated, of which 191 contained cyberbullying and 562 did not.
   The training results of the first epoch were 49% but we saw improvements in the second
epoch with an accuracy of 74% which means that the learning took place and the proposed
techniques applied to the CO-Ordination matrix worked. We were not able to continue the
training for more than 2 epochs due to OOM (Out Of Memory) reasons. The temporal graphs
are very memory-consuming data structures. We believed that the imbalance and the size of
the database reduced the performance. The model that we have chosen is a prototype model to
test the efficiency and profitability of the shifted edge indices COO matrix that we proposed.
We could have used regularisation technique for generalisation or scheduling the learning rate
for better optimisation, but performance of the model was not our goal.
   We tested on the same set that we used for training and we noticed that there was an extreme
bias in the model predictions. This is not only because of the imbalance in the dataset, where
almost 75% of the examples are negative examples, but also because of the small size of the
dataset.

Table 1
confusion matrix
                                                Negative   positive
                                   Negative       562        191
                                   Positive        0          0

  From this confusion matrix, we calculated the following micro metrics (positive predictive
rate, recall, specificity, negative predictive rate, precision) and the f1 score as a macro metric.
The results are summarised in Table 2.

Table 2
performance indicator metrics
                       PPR      Recall   NPR     Specificity   acc    f1 score
                        0         0      100%       74%        74%      85%

   Knowing that comments are text in nature and responses between one comment and another
can be structured with a temporal graph gives us the ability to harness the power of dynamic
temporal graphs. With different message-passing algorithms available, GNNs capture the
relationships between nodes and learn to properly represent a graph in node embedding
that represents each node and the information acquired from their neighbours. And with
different deep learning architectures that process time-series data, we can learn how these node
interactions change over time. We leverage the dynamism of the dynamic temporal graphs with
a bunch of techniques that prove the learning has taken place despite the complex structure of
the dynamic temporal graph being kept.


5. Conclusion

Cyberbullying is a dangerous cybercrime that is lightly taken by governments despite the
psychological damage it causes to its victims. It has spread very rapidly due to the spread of
social media, which has caused an increase in the number of bullies that has led to a permanent
cycle of violence.
   Session classification seems more applicable for cyberbullying detection. Graph neural
networks (GNN) make it possible to capture the relationships between entities, and by adding
an architecture that deals with the temporal aspect of graphs, we have the ability to model
relationships between entities over time. We decided to leverage dynamic temporal graphs
because there is little research on this data structure, and hence the majority of research
considers these graphs to be static temporal graphs. When turning a dynamic temporal graph
into a static temporal graph, we think that the algorithm will waste a lot of training steps
learning insignificant alternatives that we added to make the structure static. Therefore, we
have proposed a technique that uses the dynamism of this structure. The method proved that
learning took place. Unfortunately, we did not have the necessary means to properly test the
performance of the method.


References
 [1] S. Almutiry, M. Abdel Fattah, Arabic cyberbullying detection using arabic sentiment
     analysis, The Egyptian Journal of Language Engineering 8 (2021) 39–50.
 [2] H. Mubarak, K. Darwish, W. Magdy, Abusive language detection on arabic social media,
     in: Proceedings of the first workshop on abusive language online, 2017, pp. 52–56.
 [3] W. Aldjanabi, A. Dahou, M. A. Al-qaness, M. A. Elaziz, A. M. Helmi, R. Damaševičius,
     Arabic offensive and hate speech detection using a cross-corpora multi-task learning
     model, in: Informatics, volume 8, MDPI, 2021, p. 69.
 [4] F. Husain, Arabic offensive language detection using machine learning and ensemble
     machine learning approaches, arXiv preprint arXiv:2005.08946 (2020).
 [5] M. Alotaibi, B. Alotaibi, A. Razaque, A multichannel deep learning framework for cyber-
     bullying detection on social media, Electronics 10 (2021). URL: https://www.mdpi.com/
     2079-9292/10/21/2664. doi:1 0 . 3 3 9 0 / e l e c t r o n i c s 1 0 2 1 2 6 6 4 .
 [6] J. Mounir, M. Nashaat, M. Ahmed, Z. Emad, E. Amer, A. Mohammed, Social media
     cyberbullying detection using machine learning, International Journal of Advanced
     Computer Science and Applications 10 (2019) 703–707. doi:1 0 . 1 4 5 6 9 / I J A C S A . 2 0 1 9 . 0 1 0 0 5 8 7 .
 [7] K. Reynolds, A. Kontostathis, L. Edwards, Using machine learning to detect cyberbully-
     ing, in: 2011 10th International Conference on Machine Learning and Applications and
     Workshops, volume 2, 2011, pp. 241–244. doi:1 0 . 1 1 0 9 / I C M L A . 2 0 1 1 . 1 5 2 .
 [8] U. Bretschneider, T. Wöhner, R. Peters, Detecting online harassment in social networks,
     in: Thirty Fifth International Conference on Information Systems, Auckland 2014, 2014,
     pp. 1–14.
 [9] C. Emmery, B. Verhoeven, G. D. Pauw, G. Jacobs, C. V. Hee, E. Lefever, B. Desmet, V. Hoste,
     W. Daelemans, Current limitations in cyberbullying detection: on evaluation criteria,
     reproducibility, and data scarcity, CoRR abs/1910.11922 (2019). URL: http://arxiv.org/abs/
     1910.11922. a r X i v : 1 9 1 0 . 1 1 9 2 2 .
[10] X. Zhang, J. Tong, N. Vishwamitra, E. Whittaker, J. P. Mazer, R. Kowalski, H. Hu, F. Luo,
     J. Macbeth, E. Dillon, Cyberbullying detection with a pronunciation based convolutional
     neural network, in: 2016 15th IEEE International Conference on Machine Learning and
     Applications (ICMLA), 2016, pp. 740–745. doi:1 0 . 1 1 0 9 / I C M L A . 2 0 1 6 . 0 1 3 2 .
[11] H. Rosa, D. Matos, R. Ribeiro, L. Coheur, J. P. Carvalho, A “deeper” look at detecting
     cyberbullying in social networks, in: 2018 International Joint Conference on Neural
     Networks (IJCNN), 2018, pp. 1–8. doi:1 0 . 1 1 0 9 / I J C N N . 2 0 1 8 . 8 4 8 9 2 1 1 .
[12] J.-M. Xu, K.-S. Jun, X. Zhu, A. Bellmore, Learning from bullying traces in social
     media, in: Proceedings of the 2012 Conference of the North American Chapter of
     the Association for Computational Linguistics: Human Language Technologies, As-
     sociation for Computational Linguistics, Montréal, Canada, 2012, pp. 656–666. URL:
     https://aclanthology.org/N12-1084.
[13] A. Squicciarini, S. Rajtmajer, Y. Liu, C. Griffin, Identification and characterization of
     cyberbullying dynamics in an online social network, in: 2015 IEEE/ACM International
     Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2015, pp.
     280–285. doi:1 0 . 1 1 4 5 / 2 8 0 8 7 9 7 . 2 8 0 9 3 9 8 .
[14] S. Ge, L. Cheng, H. Liu, Improving cyberbully detection with user interaction, CoRR
     abs/2011.00449 (2020). URL: https://arxiv.org/abs/2011.00449. a r X i v : 2 0 1 1 . 0 0 4 4 9 .
[15] J. Dooley, J. Pyżalski, D. Cross, Cyberbullying versus face-to-face bullying a theoretical
     and conceptual review, Journal of Psychology 217 (2009) 182–188. doi:1 0 . 1 0 2 7 / 0 0 4 4 - 3 4 0 9 .
     217.4.182.
[16] H. Mubarak, K. Darwish, W. Magdy, Abusive language detection on Arabic social
     media, in: Proceedings of the First Workshop on Abusive Language Online, Asso-
     ciation for Computational Linguistics, Vancouver, BC, Canada, 2017, pp. 52–56. URL:
     https://aclanthology.org/W17-3008. doi:1 0 . 1 8 6 5 3 / v 1 / W 1 7 - 3 0 0 8 .
[17] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao,
     S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural lan-
     guage processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural
     Language Processing: System Demonstrations, Association for Computational Linguistics,
     Online, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6.
[18] A. Abdelali, K. Darwish, N. Durrani, H. Mubarak, Farasa: A fast and furious segmenter for
     Arabic, in: Proceedings of the 2016 Conference of the North American Chapter of the As-
     sociation for Computational Linguistics: Demonstrations, Association for Computational
     Linguistics, San Diego, California, 2016, pp. 11–16. URL: https://aclanthology.org/N16-3003.
     doi:1 0 . 1 8 6 5 3 / v 1 / N 1 6 - 3 0 0 3 .
[19] W. Antoun, F. Baly, H. Hajj, Arabert: Transformer-based model for arabic language
     understanding, in: LREC 2020 Workshop Language Resources and Evaluation Conference
     11–16 May 2020, 2020, p. 9.