BERT-based Models for Arabic Long Document
                                Classification⋆
                                Muhammad AL-Qurishi* , Riad Souissi
                                Elm Company, Research Department, Riyadh 12382, Saudi Arabia


                                                                       Abstract
                                                                       Given the number of Arabic speakers worldwide and the notably large amount of content in the web today in some fields such
                                                                       as law, medicine, or even news, documents of considerable length are produced regularly. Classifying those documents using
                                                                       traditional learning models is often impractical since extended length of the documents increases computational requirements
                                                                       to an unsustainable level. Thus, it is necessary to customize these models specifically for long textual documents. In this
                                                                       paper we propose two simple but effective models to classify long length Arabic documents. We also fine-tune two different
                                                                       models-namely, Longformer and RoBERT, for the same task and compare their results to our models. Both of our models
                                                                       outperform the Longformer and RoBERT in this task over two different datasets.

                                                                       Keywords
                                                                       Arabic Text Processing, Long Document Classification, BERT-based Models, Sentence Segmentation


                                1. Introduction                                                                                        the proposed solutions are based on the sliding window
                                                                                                                                       paradigm [5, 6]. The downside of this class of solutions
                                A large portion of textual content that requires automated is their inability to track long-range dependencies in
                                processing is in the form of long documents. In some the text which weakens their analytic insights. Another
                                domains such as legal or medical, long documents are the group of works aim to simplify the architecture of Trans-
                                standard. This severely restricts the possibilities for prac- formers and decrease complexity as result [7, 8, 9]. So
                                tical use of the most advanced Transformer models for far, none of these attempts could match the same level of
                                text classification and other linguistic tasks [1]. For exam- performance that BERT achieves with short text. Reusing
                                ple, models such as BERT [2] have significantly improved previously completed steps is another strategy for adapt-
                                the accuracy of automated NLP tasks, but their useful- ing Transformers for longer text [10] as a prominent
                                ness is limited to relatively short text sequences [3] due example. Longformer model proposed by [11] may be
                                to the fact that their complexity increases geometrically. the most promising solution for the problem of using
                                Modifying BERT in such a way to disassociate sequence Transformers with long text, and it combines local and
                                length from computing complexity would remove this global attention to improve efficiency. The issue remains
                                obstacle and bring immediate benefits to numerous fields open, and new suggestions for the best method of long
                                such as education, science, and business [4]. Innova- document processing are still being made on a regular
                                tive approaches that leverage the greatest advantages of basis.
                                Transformers while offsetting their major shortcomings                                                    In this paper we present two BERT-based language
                                are needed at this stage of development, as they could models and fine-tune two others for Arabic long docu-
                                lead to full maturation of a concept that has been demon- ment classification. The first language model consists of
                                strated to be impressively successful with semantic tasks. four main layers: sentence segmentation layer, BERT
                                    There have been numerous attempts to improve the layer, a linear classification layer, then the sentence
                                performance and efficiency of BERT with long docu- grouping layer with respect to each document, and fi-
                                ments, using a wide variety of approaches. Some of nally the softmax layer. In this model, we segmented
                                                                                                                                       the document into meaningful sentences and then fed
                                Woodstock’22: Symposium on the irreproducible science, June 07–11, these sentences into BERT model along with their docu-
                                2022, Woodstock, NY
                                ⋆                                                                                                      ment ID. The second model has the same idea of dividing
                                  You can use this document as the template for preparing your
                                  publication. We recommend using the latest version of the ceurart the document into sentences, but instead we hypothe-
                                  style.                                                                                               size that a majority of semantically important informa-
                                *
                                  Corresponding author.                                                                                tion is concentrated within specific sentences inside of a
                                †
                                  These authors contributed equally.                                                                   longer text, making it unnecessary to check for connec-
                                $ mualqurishi@elm.sa (M. AL-Qurishi); rsouissi@elm.sa                                                  tions between all words in a document. Instead, we used
                                (R. Souissi)
                                                                                                                                       BERT-based similarity match algorithm that can recog-
                                 0000-0002-7594-7325 (M. AL-Qurishi); 0000-0002-7594-7325
                                (R. Souissi)                                                                                           nize high-relevance sentences and pass them as input
                                          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License to the BERT-base model that can complete the desired
                                          Attribution 4.0 International (CC BY 4.0).
                                 CEUR
                                 Workshop
                                 Proceedings
                                               http://ceur-ws.org
                                               ISSN 1613-0073
                                                                    CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
classification task. Both of those models are based on            attention mechanism only in a limited role, thus avoiding
BERT architecture, and require supervised training for            the exponential growth of complexity [16, 17].
best performance. Input text is divided into sentences               The aforementioned methodological differences stem
that don’t exceed the maximum length that BERT can                largely from the expectations for each paper, which range
accurately process (512 tokens).                                  from proving a theoretical point to attempting to de-
   In addition, we have fine-tuned two well-known lan-            velop specialized model for long document classification.
guage models for long documents classification task,              Works with a narrower scope tend to stay closer to the
which are the Longformer [11] and Recurrent over                  original BERT model design [11], while more ambitious
BERT [6]1 . Before the fintuning process, these two mod-          efforts that aim to create new tools are more inclined
els have been modified to be suitable for the Arabic lan-         to experiment with previously untested combinations
guage. We compared the proposed models against the                of elements. In some papers, the scope of intended ap-
Longfroemer and RoBERT using two different Arabic                 plications is limited to long documents from a certain
datasets. The proposed language models were evaluated             domain (i.e. medical) [17], while others are approach-
against two models, Longfarmer and RoBERT, using two              ing the problem in more general terms. Finally, there
datasets. The first dataset was collected from the Maw-           is an important distinction between works that aim for
doo3 website2 and the second dataset was from previous            greater accuracy, and those that primarily attempt to im-
related work [12]. The results showed that the first lan-         prove computational efficiency and shorten the inference
guage model based on sentences aggregating after clas-            time [18].
sifying them is the best among all models on news data               It’s a fair assessment that practically all works from
with a macro F1-score equal to 98%. While this model              this group are grappling with the same problem – the ten-
achieved a comparative result with the Longformer in              dency of attention-based models to become prohibitively
the second Mawdoo3 dataset that contains 22 classes.              complex as the length of the analyzed text is increased.
                                                                  In response, the authors tried a variety of ideas that rely
                                                                  on vastly different mechanisms to decrease complexity.
2. Related Works                                                  From fine-tuning and knowledge distillation to intro-
                                                                  duction of hierarchical architectures and restrictive ele-
Most of the recent works addressing the problem of long
                                                                  ments such as fixed-length sliding window [11, 19], the
document classification start from similar principles com-
                                                                  proposed techniques are quite innovative and typically
mon to all deep learning methods. They also diverge in
                                                                  leverage some known properties of deep learning mod-
many aspects, as the authors explore different avenues
                                                                  els to affect how the attention mechanism performs in
for leveraging the power of the learning algorithms and
                                                                  a particular deployment. The diversity of ideas found
overcoming the most significant obstacles [13]. Since the
                                                                  in those papers illustrates that researchers are currently
authors are essentially attempting to solve the same prob-
                                                                  casting a wide net and searching for unconventional an-
lem, namely how to maintain high accuracy of seman-
                                                                  swers to a difficult problem, without a single dominant
tic predictions while keeping the computing demands
                                                                  strategy. On the other hand, hybrid approaches hold a
reasonable, it would be fair to describe the papers as
                                                                  lot of promise and they combine some proven elements
belonging to the same family despite the considerable
                                                                  from different methodologies into new, potentially more
differences in approach.
                                                                  optimal configurations [16, 20].
   In terms of methodological choices, practically all
                                                                     Evaluation of the proposed changes to established al-
works from this group acknowledge the unmatched
                                                                  gorithms is crucially important, and all of the reviewed
power of the attention mechanism for analyzing seman-
                                                                  works include some form of empirical confirmation of
tic relationships, and incorporate it in some way into
                                                                  their premises. While the numbers seemingly validate
the proposed architecture. There is a division between
                                                                  that the proposed solutions achieve state-of-the-art re-
works that mostly (or completely) embrace an existing
                                                                  sults under the best possible conditions, those findings
architecture and perform only minor operations such as
                                                                  are self-reported and may often be too optimistic. All
fine-tuning or knowledge transfer in order to reduce the
                                                                  of the papers are interested in document classification
computational demands [14, 15]. On a different end of the
                                                                  tasks and use it to evaluate their solutions, but datasets
spectrum, there are works that propose innovative hy-
                                                                  used for testing may not be the same in terms of size, di-
brid solutions in which the attention mechanism and/or
                                                                  versity, and content. When directly comparing different
Transformer architecture are combined with elements
                                                                  solutions, it’s extremely important to keep in mind the
of different deep learning paradigms, such as RNNs and
                                                                  particulars of the evaluation protocols. Studies aiming
CNNs. In particular, a common strategy is to adopt a
                                                                  to provide evaluations with independently administered
hierarchical structure for the overall solution and use the
                                                                  comparative testing of several different BERT-like algo-
1
    https://github.com/helmy-elrais/RoBERT_Recurrence_over_BERT   rithms for document classification are slowly emerging
2
    www.mawdoo3.com                                               and reporting some interesting findings that often di-
Figure 1: Proposed Model Architecture for Long Document Classification


verge from self-assessed results [4, 13, 18]. Still, there         3. Data
are no widely accepted evaluation standards and every
comparison suffers from ‘apples-to-oranges’ problem up             Experimental parts of our study are conducted using two
to an extent.                                                      different datasets, with the choice of the datasets based
   When it comes to practical use of the proposed so-              on the domain of research which is long length Arabic
lutions, there is a general lack of field data and even            documents. The datasets are vastly different in terms of
discussions of use cases are rare. This is understandable          size and diversity of classes.
considering the main focus is on discovering more effi-
cient methods, but without real world testing it’s difficult       3.1. Mawdoo3 Dataset
to predict whether any of the solutions can deliver simi-
lar results to their reported findings. Some works may be          The first dataset was scraped from Mawdoo3 which is the
directed as specific niches such as legal or medical, but          largest Arabic content website3 . The number of classes
even in this case little attention is paid to practicalities as-   from mawdoo3 is 22 class and each category contains
sociated with real world application. This weakness may            between 700 to 12K articles. We have selected almost one
reflect the current state of the field, which is highly exper-     thousand of long articles from each category as presented
imental and mostly built on data collected in a controlled         in Figure 2.
environment.
                                                                   3
                                                                       https://mawdoo3.com/
Figure 2: Mawdo3 Dataset that contains 22 class.We selected almost 1000 article under each class.


Figure 3: Arabic News Dataset we choose almost 4000 articles from each category.


3.2. Arabic News Dataset                                 mance. Input text is divided into sentences that don’t
                                                         exceed the maximum length that BERT can accurately
The second dataset was about news articles and we down-
                                                         process (512 tokens). We also fine-tune two others for
loaded them from different sources [21, 12, 22]. These
                                                         Arabic long document classification. the following sec-
data have almost the same 8 categories so we merged
                                                         tions explain that in details.
them together and the resulted dataset is described in
Figure 3. We have selected almost four thousands of long
articles from each class.                                4.1. BERT-based Sentence Aggregation
                                                          We propose a simple but effective model to do a long doc-
4. Models                                                 ument classification task. Our proposed model consist of
                                                          multiple layers as shown in figure 1; namely, sentence seg-
In this section we introduce two BERT-based language mentation layer, BERT layer, a linear classification layer,
models. Both of those models are based on BERT archi- then the sentence grouping layer with respect to each
tecture, and require supervised training for best perfor- document, and finally the softmax layer. The first layer
is to make a segmentation of sentences from the long           long texts, which are explained in [6, 11]. We have trained
text, taking into account the structure of the sentence        and fine-tuned them to classify Arabic long length docu-
in Arabic language. So that the sentence does not lose         ments using the datasets mentioned in Tables 2 and 3.
its meaning or break. The second layer is the BERT tok-
enizer followed by the embedding representation layer.         4.4. Longformer
Since we are using BERT base model named Arabert-
V2 [23], this layer consists of 12-layer stacked encoders      The Longformer [11] was proposed to reduce the com-
that receive the embedding inputs and process it and           plexity of the self-attention matrix. This can be done by
send to the an MLP layer.                                      making the matrix sparser through the introduction of
We train the model on all the sentences and each sen-          attention pattern with specified locations that need to
tence is considered as document. The training outputs          be prioritized. By using a sliding window with a fixed
are the classification probability for each class as well      length, the model doesn’t enter exponential progression
as the sentence ID and orginal document ID. We make a          and instead scales linearly with input sequence length.
grouping of text sentences with the probabilities of each      Additional gains can be achieved by dilating the sliding
category in each sentence, and in the end we aggregate         window, which frees up some attention heads to process
all sentence in the category with the highest probability      the overall semantic context while non-dilated heads re-
with respect to the document ID.                               main focused on local tokens.
                                                               However, the implemented restrictions interfere with the
                                                               model’s ability to be trained for specific tasks, which re-
4.2. BERT-based Key Sentences Model
                                                               quired the addition of global attention to the model. Lin-
This model has the same idea of dividing the document          ear projections are used to calculate the attention scores,
into sentences, but instead we hypothesize that a major-       and in this work an extra set of projections related to
ity of semantically important information is concentrated      global attention are used to make training more reliable.
within specific sentences inside of a longer text, making it   The resulting linguistic model has an impressive capacity
unnecessary to check for connections between all words         for contextual analysis, but expends far less computa-
in a document. Instead, we used BERT-based similarity          tional resources when used with long-form documents
match algorithm that can recognize high-relevance sen-         than traditional BERT and other Transformer architec-
tences and pass them as input to the BERT model that           tures. Nonetheless, Longformer was trained for autore-
can complete the desired classification task. The high-        gressive modeling with left-to-right word sequence and
relevance sentences were selected by applying a maximal        train it with Arabic needed some preprocessing. We con-
marginal relevance (MMR) [24] similarity algorithm as          verted the base model of Arabert-V2 into a Longformer
shown in equation 1. The length of the sentences is be-        then we fine-tuned the output model for our Arabic long
tween 30 to 150 tokens.                                        document classification task.

   𝑀 𝑀 𝑅 = 𝑎𝑟𝑔𝑚𝑎𝑥𝐷𝑖 ∈𝑋 [𝜆𝑆𝑖𝑚1 (𝐷𝑖 , 𝑆) − (1 − 𝜆)               4.5. RoBERT
                               max 𝑆𝑖𝑚2 (𝐷𝑖 , 𝐷𝑗 )]     (1)
                              𝐷𝑗∈𝐶                         In this model the authors [6] are looking into possible
                                                           ways to extend the usefulness of the BERT linguistic
Where 𝑆 is the sentence vector and 𝐷𝑖 is the document model to text samples longer than a few hundred words.
vector related to 𝑆. 𝑋 is a subset of documents in our To do this, they introduce an extension to the fine-tuning
dataset we already selected and 𝜆 is a constant in range procedure and separate the input into smaller chunks. Af-
of [0–1], for diversification of results. The 𝑆𝑖𝑚1 and ter those chunks are processed by the base BERT model,
𝑆𝑖𝑚2 are the similarity function which can be replaced they are passed through another Transformer or a single
by cosine, euclidean, Jacard and any other distance simi- recurrent layer before a classification decision is made in
larity measures. In our model we have used the proper the softmax layer. Those variations were named RoBERT
cosine similarity that explained by equation 2 4 .         (Recurrence over BERT) and ToBERT (Transformer over
                        −1
                             ∑︀𝑛
                                    𝑢  ×𝑣                  BERT), collectively described as Hierarchical Transform-
                                𝑖=1 𝑖     𝑖
                    𝑐𝑜𝑠 ( ||𝑢||   2 ×||𝑣||2
                                            )              ers because they maintain the hierarchical structure of
                                                       (2)
                                𝜋                          representations both on the level of extracted segments
                                                           and the whole document. Those models were found
4.3. Fine-tuned Models                                     to converge very quickly when trained on a narrowly
                                                           focused dataset and to perform better than the origi-
In this part, we reproduced and fine-tuned two of the
                                                           nal BERT with long text sequences. Suitability of those
important research works in the literature for processing
                                                           derivative models was examined for different tasks, in-
4
  XCS224 Mod2 lecture by Prof. Pott
                                                           cluding topic identification and satisfaction prediction
during a customer call, which are possible real world         6.1. Mawdoo3 Dataset
applications. Unlike the Longformer, with RoBERT the
                                                                 All models were empirically evaluated on long document
fine-tuning process was straightforward because it was
                                                                 classification task. We compared our proposed models
a BERT-Based model.
                                                                 with Longformer as well as with the RoBERT on Maw-
                                                                 doo3 dataset. The results were very close between the
5. Experimental approach                                         two proposed solutions and the Longformer, with a very
                                                                 slight superiority to the language model based on ex-
In our work we aim to find a balance between model tracting key sentences using MMR method with macro
accuracy on classification task performed over long text F1 score equal to 83%. While Robert performed very
sequences and computational simplicity. Therefor we poorly on Mawdoo3 dataset with macro F1 score of 21%.
tried to utilize the base version of BERT which have The overall results of all models in the long document
less memory size of 500MB and faster prediction pro- classification task are explained in Table 2. We can say
cess where the length of the embedding is 768. We used that this results support our hypothesis of identify the
Google Colab pro to train and fine-tune our models. In most relevant parts of the text. The resulting solution
terms of accuracy, we use standard metrics to track all retains the ability to capture relationships between dis-
of those qualities for the tested models. Macro F1 score tant tokens, but doesn’t have to actively back-propagate
is used as a general measure of accurate prediction on all of them and instead focuses only on key sentences.
all comparisons , as it provides a basis for comparison of Because of this, the model avoids geometric progression
results between studies.                                         of complexity and continues to be efficient with much
Several hyperparameters have been setup to fine-tune longer texts than the original BERT is able to. It is worth
the experimented models. Our proposed classification noting that we have pre-processed and removed the in-
solutions were tested using two collection of documents formation at the beginning of each article in this dataset
mentioned in Sec. 3 where 80% of the dataset was used because that the parts of the document containing easily
to train the model and 10% as a validation set and 10% identifiable indicators of the class.
utilized for conducting the tests. Table 1 shows the gen-
eral parameters used in the training and fine-tuning pro- 6.2. Arabic News Dataset
cesses.
                                                                 The results of the experiment were completely different
Table 1                                                          with the Arabic news dataset. All models performed very
Hyperparameters used in the training and fine-tuning pro- well, and in this experiment, the first model outperformed
cesses                                                           the rest with macro F1 score of 98.4% which revealed that
Parameter Name                                            Value  additional modification can have a positive impact on
number of epochs                                          5
                                                                 model performance, but it’s important which dataset is
maximum sequence length aggregation and similarity models 128    used. It was discovered that classifying each sentence is
maximum sequence length longFormer model                  1024   better than classifying the whole sequence, which could
maximum sequence length truncation , RoBERT models        1024
                                                                 even increase performance when working with short
mawdoo3 data number of training steps:                    107466
adam epsilon                                              1e-8   sentences. However, both Longformer and our second
train batch size                                          64     model with MMR are still performing very well with
valid batch size                                          128    macro F1 score of 96% and 96.2%, respectfully. Whereas
epochs                                                    20
learning rate                                             5e-5   RoBERT model has macro F1 score of 74.4%. The overall
warmup ratio                                              0.1    results of all models in the long document classification
max grad norm                                             1.0    task are described in Table 3.
accumulation steps                                   1


                                                              7. Conclusion
6. Results Discussion and Analysis                            Unmatched flexibility of BERT is one of the main rea-
                                                              sons for its rapid acceptance as state-of-the-art language
The evaluation was conducted using standardized hyper         model. With additional algorithm and some modifica-
parameters such as batch size and sequence length and         tions and fine-tuning, the model can be adjusted for cer-
others as shown in Table 1, with two different datasets       tain topics or tasks and its accuracy pushed to even higher
suitable for Arabic long document classification task as      level. This work explores this possibility in detail, taking
described in Sec 3. We will try to report the results and     long text classification as the target task and searching
analyze them according to each data set separately.           for the best parameters for this type of usage. In par-
                                                              ticular, different possibilities for supervised pre-training
Table 2
Overall results of all models in the long text classification task on Mawdoo3 Dataset

Model                                   Macro F1           Macro Precision              Macro Recall         Accuracy
Our Model Aggregating                   0.82187            0.82338                      0.83049              0.83083
Our Model-MMR                           0.82732            0.83162                      0.83555               0.83522
Longformer                              0.82347            0.82497                      0.83263              0.83291
RoBERT                                  0.21157            0.19309                      0.29507              0.36461

Table 3
Overall results of all models in the long text classification task on Arabic News Dataset

Model                                   Macro F1           Macro Precision              Macro Recall         Accuracy
Our Model Aggregating                   0.98411            0.98591                      0.98264              0.98434
Our Model-MMR                           0.96217            0.96240                      0.96206              0.96263
Longformer                              0.95908            0.95956                      0.95880              0.95961
RoBERT                                  0.73062            0.75382                      0.75124              0.75142


and fine-tuning were examined on two different datasets.             passage bert: A globally normalized bert model for
Through detailed experimentation, we were able to iden-              open-domain question answering, arXiv preprint
tify the most optimal procedures that enable BERT to                 arXiv:1908.08167 (2019).
be more accurate with our particular downstream task.            [6] R. Pappagari, P. Zelasko, J. Villalba, Y. Carmiel,
While the value of the proposed training and tuning ac-              N. Dehak, Hierarchical transformers for long docu-
tions was confirmed only for text classification, it stands          ment classification, in: 2019 IEEE Automatic Speech
to reason that analogous procedures could prove to be                Recognition and Understanding Workshop (ASRU),
useful for other linguistic tasks as well. Finally, we want          IEEE, 2019, pp. 838–844.
to denote that we did not explore all hyperparameters            [7] S. Sukhbaatar, E. Grave, P. Bojanowski, A. Joulin,
which can be a future work to have along with trying                 Adaptive attention span in transformers, arXiv
another language models such as Roberta and Electra.                 preprint arXiv:1905.07799 (2019).
                                                                 [8] Y. Tay, M. Dehghani, D. Bahri, D. Metzler, Efficient
                                                                     transformers: A survey, ACM Computing Surveys
References                                                           (CSUR) (2020).
                                                                 [9] J. W. Rae, A. Potapenko, S. M. Jayakumar, T. P. Lilli-
 [1] Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri,
                                                                     crap, Compressive transformers for long-range se-
     P. Pham, J. Rao, L. Yang, S. Ruder, D. Metzler, Long
                                                                     quence modelling, arXiv preprint arXiv:1911.05507
     range arena: A benchmark for efficient transform-
                                                                     (2019).
     ers, arXiv preprint arXiv:2011.04006 (2020).
                                                                [10] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le,
 [2] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
                                                                     R. Salakhutdinov, Transformer-xl: Attentive lan-
     Bert: Pre-training of deep bidirectional transform-
                                                                     guage models beyond a fixed-length context, arXiv
     ers for language understanding, arXiv preprint
                                                                     preprint arXiv:1901.02860 (2019).
     arXiv:1810.04805 (2018).
                                                                [11] I. Beltagy, M. E. Peters, A. Cohan, Longformer:
 [3] M. Ding, C. Zhou, H. Yang, J. Tang, Cogltx: Apply-
                                                                     The long-document transformer, arXiv preprint
     ing bert to long texts, Advances in Neural Informa-
                                                                     arXiv:2004.05150 (2020).
     tion Processing Systems 33 (2020) 12792–12804.
                                                                [12] M. Abbas, K. Smaili, Comparison of topic identifica-
 [4] V. Wagh, S. Khandve, I. Joshi, A. Wani, G. Kale,
                                                                     tion methods for arabic language, in: Proceedings
     R. Joshi, Comparative study of long document clas-
                                                                     of International Conference on Recent Advances
     sification, in: TENCON 2021-2021 IEEE Region 10
                                                                     in Natural Language Processing, RANLP, 2005, pp.
     Conference (TENCON), IEEE, 2021, pp. 732–737.
                                                                     14–17.
 [5] Z. Wang, P. Ng, X. Ma, R. Nallapati, B. Xiang, Multi-
[13] X. Dai, I. Chalkidis, S. Darkner, D. Elliott, Revisiting
     transformer-based models for long document clas-
     sification, arXiv preprint arXiv:2204.06683 (2022).
[14] A. Adhikari, A. Ram, R. Tang, J. Lin, Docbert:
     Bert for document classification, arXiv preprint
     arXiv:1904.08398 (2019).
[15] C. Sun, X. Qiu, Y. Xu, X. Huang, How to fine-
     tune bert for text classification?, in: China national
     conference on Chinese computational linguistics,
     Springer, 2019, pp. 194–206.
[16] W. Huang, Z. Tao, X. Huang, L. Xiong, J. Yu, Hier-
     archical self-attention hybrid sparse networks for
     document classification, Mathematical Problems in
     Engineering 2021 (2021).
[17] Y. Si, K. Roberts, Hierarchical transformer networks
     for longitudinal clinical document classification,
     arXiv preprint arXiv:2104.08444 (2021).
[18] H. H. Park, Y. Vyas, K. Shah, Efficient classifica-
     tion of long documents using transformers, arXiv
     preprint arXiv:2203.11258 (2022).
[19] Z. Wang, C. Wang, H. Zhang, Z. Duan, M. Zhou,
     B. Chen, Learning dynamic hierarchical topic graph
     with graph convolutional network for document
     classification, in: International Conference on Ar-
     tificial Intelligence and Statistics, PMLR, 2020, pp.
     3959–3969.
[20] J. He, L. Wang, L. Liu, J. Feng, H. Wu, Long doc-
     ument classification from local word glimpses via
     recurrent attention learning, IEEE Access 7 (2019)
     40707–40718.
[21] A. Chouigui, O. Ben Khiroun, B. Elayeb, An ara-
     bic multi-source news corpus: experimenting on
     single-document extractive summarization, Ara-
     bian Journal for Science and Engineering 46 (2021)
     3925–3938.
[22] M. Abbas, D. Berkani, Topic identification by statis-
     tical methods for arabic language., WSEAS Trans-
     actions on Computers 5 (2006) 1908–1913.
[23] W. Antoun, F. Baly, H. Hajj, Arabert: Transformer-
     based model for arabic language understanding,
     arXiv preprint arXiv:2003.00104 (2020).
[24] J. Carbonell, J. Goldstein, The use of mmr, diversity-
     based reranking for reordering documents and pro-
     ducing summaries, in: Proceedings of the 21st an-
     nual international ACM SIGIR conference on Re-
     search and development in information retrieval,
     1998, pp. 335–336.