<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BERT-based Models for Arabic Long Document Classification⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Muhammad AL-Qurishi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riad Souissi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Elm Company, Research Department</institution>
          ,
          <addr-line>Riyadh 12382</addr-line>
          ,
          <country country="SA">Saudi Arabia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Given the number of Arabic speakers worldwide and the notably large amount of content in the web today in some fields such as law, medicine, or even news, documents of considerable length are produced regularly. Classifying those documents using traditional learning models is often impractical since extended length of the documents increases computational requirements to an unsustainable level. Thus, it is necessary to customize these models specifically for long textual documents. In this paper we propose two simple but efective models to classify long length Arabic documents. We also fine-tune two diferent models-namely, Longformer and RoBERT, for the same task and compare their results to our models. Both of our models outperform the Longformer and RoBERT in this task over two diferent datasets.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Arabic Text Processing</kwd>
        <kwd>Long Document Classification</kwd>
        <kwd>BERT-based Models</kwd>
        <kwd>Sentence Segmentation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        the proposed solutions are based on the sliding window
paradigm [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. The downside of this class of solutions
A large portion of textual content that requires automated is their inability to track long-range dependencies in
processing is in the form of long documents. In some the text which weakens their analytic insights. Another
domains such as legal or medical, long documents are the group of works aim to simplify the architecture of
Transstandard. This severely restricts the possibilities for prac- formers and decrease complexity as result [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7, 8, 9</xref>
        ]. So
tical use of the most advanced Transformer models for far, none of these attempts could match the same level of
text classification and other linguistic tasks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For exam- performance that BERT achieves with short text. Reusing
ple, models such as BERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] have significantly improved previously completed steps is another strategy for
adaptthe accuracy of automated NLP tasks, but their useful- ing Transformers for longer text [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] as a prominent
ness is limited to relatively short text sequences [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] due example. Longformer model proposed by [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] may be
to the fact that their complexity increases geometrically. the most promising solution for the problem of using
Modifying BERT in such a way to disassociate sequence Transformers with long text, and it combines local and
length from computing complexity would remove this global attention to improve eficiency. The issue remains
obstacle and bring immediate benefits to numerous fields open, and new suggestions for the best method of long
such as education, science, and business [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Innova- document processing are still being made on a regular
tive approaches that leverage the greatest advantages of basis.
      </p>
      <p>Transformers while ofsetting their major shortcomings In this paper we present two BERT-based language
are needed at this stage of development, as they could models and fine-tune two others for Arabic long
doculead to full maturation of a concept that has been demon- ment classification. The first language model consists of
strated to be impressively successful with semantic tasks. four main layers: sentence segmentation layer, BERT</p>
      <p>
        There have been numerous attempts to improve the layer, a linear classification layer, then the sentence
performance and eficiency of BERT with long docu- grouping layer with respect to each document, and
fiments, using a wide variety of approaches. Some of nally the softmax layer. In this model, we segmented
the document into meaningful sentences and then fed
Woodstock’22: Symposium on the irreproducible science, June 07–11, these sentences into BERT model along with their
docu⋆20Y2o2u, Wcaonodusstoeckth,iNsYdocument as the template for preparing your ment ID. The second model has the same idea of dividing
publication. We recommend using the latest version of the ceurart the document into sentences, but instead we
hypothestyle. size that a majority of semantically important
informa* Corresponding author. tion is concentrated within specific sentences inside of a
† These authors contributed equally. longer text, making it unnecessary to check for
connec$ mualqurishi@elm.sa (M. AL-Qurishi); rsouissi@elm.sa tions between all words in a document. Instead, we used
(R. Souissi) BERT-based similarity match algorithm that can
recog(R. 0S0o0u0i-s0s0i)02-7594-7325 (M. AL-Qurishi); 0000-0002-7594-7325 nize high-relevance sentences and pass them as input
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License to the BERT-base model that can complete the desired
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
classification task. Both of those models are based on attention mechanism only in a limited role, thus avoiding
BERT architecture, and require supervised training for the exponential growth of complexity [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ].
best performance. Input text is divided into sentences The aforementioned methodological diferences stem
that don’t exceed the maximum length that BERT can largely from the expectations for each paper, which range
accurately process (512 tokens). from proving a theoretical point to attempting to
de
      </p>
      <p>
        In addition, we have fine-tuned two well-known lan- velop specialized model for long document classification.
guage models for long documents classification task, Works with a narrower scope tend to stay closer to the
which are the Longformer [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and Recurrent over original BERT model design [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], while more ambitious
BERT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]1. Before the fintuning process, these two mod- eforts that aim to create new tools are more inclined
els have been modified to be suitable for the Arabic lan- to experiment with previously untested combinations
guage. We compared the proposed models against the of elements. In some papers, the scope of intended
apLongfroemer and RoBERT using two diferent Arabic plications is limited to long documents from a certain
datasets. The proposed language models were evaluated domain (i.e. medical) [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], while others are
approachagainst two models, Longfarmer and RoBERT, using two ing the problem in more general terms. Finally, there
datasets. The first dataset was collected from the Maw- is an important distinction between works that aim for
doo3 website2 and the second dataset was from previous greater accuracy, and those that primarily attempt to
imrelated work [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The results showed that the first lan- prove computational eficiency and shorten the inference
guage model based on sentences aggregating after clas- time [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
sifying them is the best among all models on news data It’s a fair assessment that practically all works from
with a macro F1-score equal to 98%. While this model this group are grappling with the same problem – the
tenachieved a comparative result with the Longformer in dency of attention-based models to become prohibitively
the second Mawdoo3 dataset that contains 22 classes. complex as the length of the analyzed text is increased.
In response, the authors tried a variety of ideas that rely
on vastly diferent mechanisms to decrease complexity.
2. Related Works From fine-tuning and knowledge distillation to
introduction of hierarchical architectures and restrictive
eleMost of the recent works addressing the problem of long ments such as fixed-length sliding window [
        <xref ref-type="bibr" rid="ref11 ref19">11, 19</xref>
        ], the
document classification start from similar principles com- proposed techniques are quite innovative and typically
mon to all deep learning methods. They also diverge in leverage some known properties of deep learning
modmany aspects, as the authors explore diferent avenues els to afect how the attention mechanism performs in
for leveraging the power of the learning algorithms and a particular deployment. The diversity of ideas found
overcoming the most significant obstacles [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Since the in those papers illustrates that researchers are currently
authors are essentially attempting to solve the same prob- casting a wide net and searching for unconventional
anlem, namely how to maintain high accuracy of seman- swers to a dificult problem, without a single dominant
tic predictions while keeping the computing demands strategy. On the other hand, hybrid approaches hold a
reasonable, it would be fair to describe the papers as lot of promise and they combine some proven elements
belonging to the same family despite the considerable from diferent methodologies into new, potentially more
diferences in approach. optimal configurations [
        <xref ref-type="bibr" rid="ref16 ref20">16, 20</xref>
        ].
      </p>
      <p>
        In terms of methodological choices, practically all Evaluation of the proposed changes to established
alworks from this group acknowledge the unmatched gorithms is crucially important, and all of the reviewed
power of the attention mechanism for analyzing seman- works include some form of empirical confirmation of
tic relationships, and incorporate it in some way into their premises. While the numbers seemingly validate
the proposed architecture. There is a division between that the proposed solutions achieve state-of-the-art
reworks that mostly (or completely) embrace an existing sults under the best possible conditions, those findings
architecture and perform only minor operations such as are self-reported and may often be too optimistic. All
ifne-tuning or knowledge transfer in order to reduce the of the papers are interested in document classification
computational demands [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ]. On a diferent end of the tasks and use it to evaluate their solutions, but datasets
spectrum, there are works that propose innovative hy- used for testing may not be the same in terms of size,
dibrid solutions in which the attention mechanism and/or versity, and content. When directly comparing diferent
Transformer architecture are combined with elements solutions, it’s extremely important to keep in mind the
of diferent deep learning paradigms, such as RNNs and particulars of the evaluation protocols. Studies aiming
CNNs. In particular, a common strategy is to adopt a to provide evaluations with independently administered
hierarchical structure for the overall solution and use the comparative testing of several diferent BERT-like
algo1https://github.com/helmy-elrais/RoBERT_Recurrence_over_BERT rithms for document classification are slowly emerging
2www.mawdoo3.com and reporting some interesting findings that often
di
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. Data</title>
      <p>
        verge from self-assessed results [
        <xref ref-type="bibr" rid="ref13 ref18 ref4">4, 13, 18</xref>
        ]. Still, there
are no widely accepted evaluation standards and every
comparison sufers from ‘apples-to-oranges’ problem up Experimental parts of our study are conducted using two
to an extent. diferent datasets, with the choice of the datasets based
      </p>
      <p>When it comes to practical use of the proposed so- on the domain of research which is long length Arabic
lutions, there is a general lack of field data and even documents. The datasets are vastly diferent in terms of
discussions of use cases are rare. This is understandable size and diversity of classes.
considering the main focus is on discovering more
eficient methods, but without real world testing it’s dificult 3.1. Mawdoo3 Dataset
to predict whether any of the solutions can deliver
similar results to their reported findings. Some works may be The first dataset was scraped from Mawdoo3 which is the
directed as specific niches such as legal or medical, but largest Arabic content website3. The number of classes
even in this case little attention is paid to practicalities as- from mawdoo3 is 22 class and each category contains
sociated with real world application. This weakness may between 700 to 12K articles. We have selected almost one
reflect the current state of the field, which is highly exper- thousand of long articles from each category as presented
imental and mostly built on data collected in a controlled in Figure 2.
environment.</p>
      <sec id="sec-2-1">
        <title>3.2. Arabic News Dataset</title>
        <sec id="sec-2-1-1">
          <title>The second dataset was about news articles and we down</title>
          <p>
            loaded them from diferent sources [
            <xref ref-type="bibr" rid="ref12 ref21 ref22">21, 12, 22</xref>
            ]. These
data have almost the same 8 categories so we merged
them together and the resulted dataset is described in
Figure 3. We have selected almost four thousands of long
articles from each class.
mance. Input text is divided into sentences that don’t
exceed the maximum length that BERT can accurately
process (512 tokens). We also fine-tune two others for
Arabic long document classification. the following
sections explain that in details.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>4.1. BERT-based Sentence Aggregation</title>
        <sec id="sec-2-2-1">
          <title>We propose a simple but efective model to do a long doc</title>
          <p>
            4. Models ument classification task. Our proposed model consist of
multiple layers as shown in figure 1; namely, sentence
segIn this section we introduce two BERT-based language mentation layer, BERT layer, a linear classification layer,
models. Both of those models are based on BERT archi- then the sentence grouping layer with respect to each
tecture, and require supervised training for best perfor- document, and finally the softmax layer. The first layer
is to make a segmentation of sentences from the long
long texts, which are explained in [
            <xref ref-type="bibr" rid="ref11 ref6">6, 11</xref>
            ]. We have trained
text, taking into account the structure of the sentence
and fine-tuned them to classify Arabic long length
docuin Arabic language. So that the sentence does not lose
ments using the datasets mentioned in Tables 2 and 3.
          </p>
          <p>We train the model on all the sentences and each sen- attention pattern with specified locations that need to
its meaning or break. The second layer is the BERT
tokenizer followed by the embedding representation layer.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Since we are using BERT base model named ArabertV2 [23], this layer consists of 12-layer stacked encoders that receive the embedding inputs and process it and send to the an MLP layer.</title>
          <p>tence is considered as document. The training outputs
are the classification probability for each class as well
as the sentence ID and orginal document ID. We make a
grouping of text sentences with the probabilities of each
category in each sentence, and in the end we aggregate
all sentence in the category with the highest probability
with respect to the document ID.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>4.2. BERT-based Key Sentences Model</title>
      </sec>
      <sec id="sec-2-4">
        <title>4.4. Longformer</title>
        <sec id="sec-2-4-1">
          <title>The Longformer [11] was proposed to reduce the com</title>
          <p>plexity of the self-attention matrix. This can be done by
making the matrix sparser through the introduction of
be prioritized. By using a sliding window with a fixed
length, the model doesn’t enter exponential progression
and instead scales linearly with input sequence length.</p>
        </sec>
        <sec id="sec-2-4-2">
          <title>Additional gains can be achieved by dilating the sliding window, which frees up some attention heads to process the overall semantic context while non-dilated heads remain focused on local tokens.</title>
        </sec>
        <sec id="sec-2-4-3">
          <title>However, the implemented restrictions interfere with the</title>
          <p>
            model’s ability to be trained for specific tasks, which
required the addition of global attention to the model.
LinThis model has the same idea of dividing the document
ear projections are used to calculate the attention scores,
into sentences, but instead we hypothesize that a
majorand in this work an extra set of projections related to
ity of semantically important information is concentrated
global attention are used to make training more reliable.
within specific sentences inside of a longer text, making it
The resulting linguistic model has an impressive capacity
unnecessary to check for connections between all words
for contextual analysis, but expends far less
computain a document. Instead, we used BERT-based similarity
tional resources when used with long-form documents
match algorithm that can recognize high-relevance sen- than traditional BERT and other Transformer
architectences and pass them as input to the BERT model that
tures. Nonetheless, Longformer was trained for
autorecan complete the desired classification task. The
highgressive modeling with left-to-right word sequence and
relevance sentences were selected by applying a maximal
train it with Arabic needed some preprocessing. We
conmarginal relevance (MMR) [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ] similarity algorithm as
verted the base model of Arabert-V2 into a Longformer
shown in equation 1. The length of the sentences is be- then we fine-tuned the output model for our Arabic long
tween 30 to 150 tokens.
          </p>
          <p>document classification task.
   = ∈ [ 1(, ) − (1 −  )</p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>4.5. RoBERT</title>
        <p>
          ∈
max 2(,  )] (1)
Where  is the sentence vector and  is the document
vector related to .  is a subset of documents in our
dataset we already selected and  is a constant in range
of [
          <xref ref-type="bibr" rid="ref1">0–1</xref>
          ], for diversification of results. The
1 and
2 are the similarity function which can be replaced
larity measures. In our model we have used the proper
cosine similarity that explained by equation 2 4.
− 1 ∑︀=1 ×  )
( ||||2×|| ||2

(2)
        </p>
      </sec>
      <sec id="sec-2-6">
        <title>4.3. Fine-tuned Models</title>
        <sec id="sec-2-6-1">
          <title>In this part, we reproduced and fine-tuned two of the</title>
          <p>important research works in the literature for processing</p>
        </sec>
        <sec id="sec-2-6-2">
          <title>4XCS224 Mod2 lecture by Prof. Pott</title>
          <p>by cosine, euclidean, Jacard and any other distance simi- recurrent layer before a classification decision is made in</p>
        </sec>
        <sec id="sec-2-6-3">
          <title>In this model the authors [6] are looking into possible</title>
          <p>ways to extend the usefulness of the BERT linguistic
model to text samples longer than a few hundred words.</p>
        </sec>
        <sec id="sec-2-6-4">
          <title>To do this, they introduce an extension to the fine-tuning</title>
          <p>procedure and separate the input into smaller chunks.
After those chunks are processed by the base BERT model,
they are passed through another Transformer or a single
the softmax layer. Those variations were named RoBERT
(Recurrence over BERT) and ToBERT (Transformer over
BERT), collectively described as Hierarchical
Transformers because they maintain the hierarchical structure of
representations both on the level of extracted segments
and the whole document. Those models were found
to converge very quickly when trained on a narrowly
focused dataset and to perform better than the
original BERT with long text sequences. Suitability of those
derivative models was examined for diferent tasks,
including topic identification and satisfaction prediction
during a customer call, which are possible real world
applications. Unlike the Longformer, with RoBERT the
ifne-tuning process was straightforward because it was
a BERT-Based model.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Experimental approach</title>
      <sec id="sec-3-1">
        <title>In our work we aim to find a balance between model</title>
        <p>accuracy on classification task performed over long text
sequences and computational simplicity. Therefor we
tried to utilize the base version of BERT which have
less memory size of 500MB and faster prediction
process where the length of the embedding is 768. We used
Google Colab pro to train and fine-tune our models. In
terms of accuracy, we use standard metrics to track all
of those qualities for the tested models. Macro F1 score
is used as a general measure of accurate prediction on
all comparisons , as it provides a basis for comparison of
results between studies.</p>
        <p>Several hyperparameters have been setup to fine-tune
the experimented models. Our proposed classification
solutions were tested using two collection of documents
mentioned in Sec. 3 where 80% of the dataset was used
to train the model and 10% as a validation set and 10%
utilized for conducting the tests. Table 1 shows the
general parameters used in the training and fine-tuning
processes.
All models were empirically evaluated on long document
classification task. We compared our proposed models
with Longformer as well as with the RoBERT on
Mawdoo3 dataset. The results were very close between the
two proposed solutions and the Longformer, with a very
slight superiority to the language model based on
extracting key sentences using MMR method with macro
F1 score equal to 83%. While Robert performed very
poorly on Mawdoo3 dataset with macro F1 score of 21%.
The overall results of all models in the long document
classification task are explained in Table 2. We can say
that this results support our hypothesis of identify the
most relevant parts of the text. The resulting solution
retains the ability to capture relationships between
distant tokens, but doesn’t have to actively back-propagate
all of them and instead focuses only on key sentences.
Because of this, the model avoids geometric progression
of complexity and continues to be eficient with much
longer texts than the original BERT is able to. It is worth
noting that we have pre-processed and removed the
information at the beginning of each article in this dataset
because that the parts of the document containing easily
identifiable indicators of the class.
6.2. Arabic News Dataset
number of epochs 5
maximum sequence length aggregation and similarity models 128
maximum sequence length longFormer model 1024
maximum sequence length truncation , RoBERT models 1024
mawdoo3 data number of training steps: 107466
adam epsilon 1e-8
train batch size 64
valid batch size 128
epochs 20
learning rate 5e-5
warmup ratio 0.1
max grad norm 1.0
accumulation steps 1</p>
      </sec>
      <sec id="sec-3-2">
        <title>The results of the experiment were completely diferent</title>
        <p>Table 1 with the Arabic news dataset. All models performed very
Hyperparameters used in the training and fine-tuning pro- well, and in this experiment, the first model outperformed
cesses the rest with macro F1 score of 98.4% which revealed that
Parameter Name Value additional modification can have a positive impact on
model performance, but it’s important which dataset is
used. It was discovered that classifying each sentence is
better than classifying the whole sequence, which could
even increase performance when working with short
sentences. However, both Longformer and our second
model with MMR are still performing very well with
macro F1 score of 96% and 96.2%, respectfully. Whereas
RoBERT model has macro F1 score of 74.4%. The overall
results of all models in the long document classification
task are described in Table 3.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Results Discussion and Analysis</title>
      <p>The evaluation was conducted using standardized hyper
parameters such as batch size and sequence length and
others as shown in Table 1, with two diferent datasets
suitable for Arabic long document classification task as
described in Sec 3. We will try to report the results and
analyze them according to each data set separately.</p>
    </sec>
    <sec id="sec-5">
      <title>7. Conclusion</title>
      <p>Unmatched flexibility of BERT is one of the main
reasons for its rapid acceptance as state-of-the-art language
model. With additional algorithm and some
modifications and fine-tuning, the model can be adjusted for
certain topics or tasks and its accuracy pushed to even higher
level. This work explores this possibility in detail, taking
long text classification as the target task and searching
for the best parameters for this type of usage. In
particular, diferent possibilities for supervised pre-training
and fine-tuning were examined on two diferent datasets.
Through detailed experimentation, we were able to
identify the most optimal procedures that enable BERT to
be more accurate with our particular downstream task.
While the value of the proposed training and tuning
actions was confirmed only for text classification, it stands
to reason that analogous procedures could prove to be
useful for other linguistic tasks as well. Finally, we want
to denote that we did not explore all hyperparameters
which can be a future work to have along with trying
another language models such as Roberta and Electra.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Abnar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruder</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Metzler, Long range arena: A benchmark for eficient transformers</article-title>
          , arXiv preprint arXiv:
          <year>2011</year>
          .
          <volume>04006</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          , Cogltx:
          <article-title>Applying bert to long texts</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>12792</fpage>
          -
          <lpage>12804</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Wagh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khandve</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wani</surname>
          </string-name>
          , G. Kale,
          <string-name>
            <given-names>R.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <article-title>Comparative study of long document classification</article-title>
          ,
          <source>in: TENCON 2021-2021 IEEE Region 10 Conference (TENCON)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>732</fpage>
          -
          <lpage>737</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nallapati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <article-title>Multipassage bert: A globally normalized bert model for open-domain question answering</article-title>
          , arXiv preprint arXiv:
          <year>1908</year>
          .
          <volume>08167</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pappagari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zelasko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Villalba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Carmiel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dehak</surname>
          </string-name>
          ,
          <article-title>Hierarchical transformers for long document classification</article-title>
          ,
          <source>in: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>838</fpage>
          -
          <lpage>844</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sukhbaatar</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          ,
          <article-title>Adaptive attention span in transformers</article-title>
          , arXiv preprint arXiv:
          <year>1905</year>
          .
          <volume>07799</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          ,
          <article-title>Eficient transformers: A survey</article-title>
          ,
          <source>ACM Computing Surveys (CSUR)</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Rae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Potapenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Jayakumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. P.</given-names>
            <surname>Lillicrap</surname>
          </string-name>
          ,
          <article-title>Compressive transformers for long-range sequence modelling</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>05507</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , J. Carbonell,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <article-title>Transformer-xl: Attentive language models beyond a fixed-length context</article-title>
          , arXiv preprint arXiv:
          <year>1901</year>
          .
          <volume>02860</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <article-title>Longformer: The long-document transformer</article-title>
          , arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>05150</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abbas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Smaili</surname>
          </string-name>
          ,
          <article-title>Comparison of topic identification methods for arabic language</article-title>
          ,
          <source>in: Proceedings of International Conference on Recent Advances in Natural Language Processing</source>
          , RANLP,
          <year>2005</year>
          , pp.
          <fpage>14</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>X.</given-names>
            <surname>Dai</surname>
          </string-name>
          , I. Chalkidis,
          <string-name>
            <given-names>S.</given-names>
            <surname>Darkner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Elliott</surname>
          </string-name>
          ,
          <article-title>Revisiting transformer-based models for long document classification</article-title>
          ,
          <source>arXiv preprint arXiv:2204.06683</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Adhikari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Docbert: Bert for document classification</article-title>
          , arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>08398</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>How to finetune bert for text classification?</article-title>
          ,
          <source>in: China national conference on Chinese computational linguistics</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>194</fpage>
          -
          <lpage>206</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>W.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Hierarchical self-attention hybrid sparse networks for document classification</article-title>
          ,
          <source>Mathematical Problems in Engineering</source>
          <year>2021</year>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Si</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <article-title>Hierarchical transformer networks for longitudinal clinical document classification</article-title>
          ,
          <source>arXiv preprint arXiv:2104.08444</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Vyas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <article-title>Eficient classification of long documents using transformers</article-title>
          ,
          <source>arXiv preprint arXiv:2203.11258</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Learning dynamic hierarchical topic graph with graph convolutional network for document classification</article-title>
          ,
          <source>in: International Conference on Artificial Intelligence and Statistics</source>
          , PMLR,
          <year>2020</year>
          , pp.
          <fpage>3959</fpage>
          -
          <lpage>3969</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          , L. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <article-title>Wu, Long document classification from local word glimpses via recurrent attention learning</article-title>
          ,
          <source>IEEE Access 7</source>
          (
          <year>2019</year>
          )
          <fpage>40707</fpage>
          -
          <lpage>40718</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chouigui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. Ben</given-names>
            <surname>Khiroun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Elayeb</surname>
          </string-name>
          ,
          <article-title>An arabic multi-source news corpus: experimenting on single-document extractive summarization</article-title>
          ,
          <source>Arabian Journal for Science and Engineering</source>
          <volume>46</volume>
          (
          <year>2021</year>
          )
          <fpage>3925</fpage>
          -
          <lpage>3938</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abbas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Berkani</surname>
          </string-name>
          ,
          <article-title>Topic identification by statistical methods for arabic language</article-title>
          .,
          <source>WSEAS Transactions on Computers</source>
          <volume>5</volume>
          (
          <year>2006</year>
          )
          <fpage>1908</fpage>
          -
          <lpage>1913</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>W.</given-names>
            <surname>Antoun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Baly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajj</surname>
          </string-name>
          , Arabert:
          <article-title>Transformerbased model for arabic language understanding</article-title>
          , arXiv preprint arXiv:
          <year>2003</year>
          .
          <volume>00104</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Carbonell</surname>
          </string-name>
          , J. Goldstein,
          <article-title>The use of mmr, diversitybased reranking for reordering documents and producing summaries</article-title>
          ,
          <source>in: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval</source>
          ,
          <year>1998</year>
          , pp.
          <fpage>335</fpage>
          -
          <lpage>336</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>