<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">BERT-based Models for Arabic Long Document Classification ⋆</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Muhammad</forename><surname>Al-Qurishi</surname></persName>
							<email>mualqurishi@elm.sa</email>
							<affiliation key="aff0">
								<orgName type="department">Research Department</orgName>
								<orgName type="institution">Elm Company</orgName>
								<address>
									<postCode>12382</postCode>
									<settlement>Riyadh</settlement>
									<country key="SA">Saudi Arabia</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Riad</forename><surname>Souissi</surname></persName>
							<email>rsouissi@elm.sa</email>
							<affiliation key="aff0">
								<orgName type="department">Research Department</orgName>
								<orgName type="institution">Elm Company</orgName>
								<address>
									<postCode>12382</postCode>
									<settlement>Riyadh</settlement>
									<country key="SA">Saudi Arabia</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">BERT-based Models for Arabic Long Document Classification ⋆</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">402A62BB7A6F44333892B55E6D87FA33</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:26+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Arabic Text Processing</term>
					<term>Long Document Classification</term>
					<term>BERT-based Models</term>
					<term>Sentence Segmentation</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Given the number of Arabic speakers worldwide and the notably large amount of content in the web today in some fields such as law, medicine, or even news, documents of considerable length are produced regularly. Classifying those documents using traditional learning models is often impractical since extended length of the documents increases computational requirements to an unsustainable level. Thus, it is necessary to customize these models specifically for long textual documents. In this paper we propose two simple but effective models to classify long length Arabic documents. We also fine-tune two different models-namely, Longformer and RoBERT, for the same task and compare their results to our models. Both of our models outperform the Longformer and RoBERT in this task over two different datasets.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>A large portion of textual content that requires automated processing is in the form of long documents. In some domains such as legal or medical, long documents are the standard. This severely restricts the possibilities for practical use of the most advanced Transformer models for text classification and other linguistic tasks <ref type="bibr" target="#b0">[1]</ref>. For example, models such as BERT <ref type="bibr" target="#b1">[2]</ref> have significantly improved the accuracy of automated NLP tasks, but their usefulness is limited to relatively short text sequences <ref type="bibr" target="#b2">[3]</ref> due to the fact that their complexity increases geometrically.</p><p>Modifying BERT in such a way to disassociate sequence length from computing complexity would remove this obstacle and bring immediate benefits to numerous fields such as education, science, and business <ref type="bibr" target="#b3">[4]</ref>. Innovative approaches that leverage the greatest advantages of Transformers while offsetting their major shortcomings are needed at this stage of development, as they could lead to full maturation of a concept that has been demonstrated to be impressively successful with semantic tasks.</p><p>There have been numerous attempts to improve the performance and efficiency of BERT with long documents, using a wide variety of approaches. Some of the proposed solutions are based on the sliding window paradigm <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6]</ref>. The downside of this class of solutions is their inability to track long-range dependencies in the text which weakens their analytic insights. Another group of works aim to simplify the architecture of Transformers and decrease complexity as result <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b8">9]</ref>. So far, none of these attempts could match the same level of performance that BERT achieves with short text. Reusing previously completed steps is another strategy for adapting Transformers for longer text <ref type="bibr" target="#b9">[10]</ref> as a prominent example. Longformer model proposed by <ref type="bibr" target="#b10">[11]</ref> may be the most promising solution for the problem of using Transformers with long text, and it combines local and global attention to improve efficiency. The issue remains open, and new suggestions for the best method of long document processing are still being made on a regular basis.</p><p>In this paper we present two BERT-based language models and fine-tune two others for Arabic long document classification. The first language model consists of four main layers: sentence segmentation layer, BERT layer, a linear classification layer, then the sentence grouping layer with respect to each document, and finally the softmax layer. In this model, we segmented the document into meaningful sentences and then fed these sentences into BERT model along with their document ID. The second model has the same idea of dividing the document into sentences, but instead we hypothesize that a majority of semantically important information is concentrated within specific sentences inside of a longer text, making it unnecessary to check for connections between all words in a document. Instead, we used BERT-based similarity match algorithm that can recognize high-relevance sentences and pass them as input to the BERT-base model that can complete the desired classification task. Both of those models are based on BERT architecture, and require supervised training for best performance. Input text is divided into sentences that don't exceed the maximum length that BERT can accurately process (512 tokens).</p><p>In addition, we have fine-tuned two well-known language models for long documents classification task, which are the Longformer <ref type="bibr" target="#b10">[11]</ref> and Recurrent over BERT <ref type="bibr" target="#b5">[6]</ref> 1 . Before the fintuning process, these two models have been modified to be suitable for the Arabic language. We compared the proposed models against the Longfroemer and RoBERT using two different Arabic datasets. The proposed language models were evaluated against two models, Longfarmer and RoBERT, using two datasets. The first dataset was collected from the Maw-doo3 website 2 and the second dataset was from previous related work <ref type="bibr" target="#b11">[12]</ref>. The results showed that the first language model based on sentences aggregating after classifying them is the best among all models on news data with a macro F1-score equal to 98%. While this model achieved a comparative result with the Longformer in the second Mawdoo3 dataset that contains 22 classes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Works</head><p>Most of the recent works addressing the problem of long document classification start from similar principles common to all deep learning methods. They also diverge in many aspects, as the authors explore different avenues for leveraging the power of the learning algorithms and overcoming the most significant obstacles <ref type="bibr" target="#b12">[13]</ref>. Since the authors are essentially attempting to solve the same problem, namely how to maintain high accuracy of semantic predictions while keeping the computing demands reasonable, it would be fair to describe the papers as belonging to the same family despite the considerable differences in approach.</p><p>In terms of methodological choices, practically all works from this group acknowledge the unmatched power of the attention mechanism for analyzing semantic relationships, and incorporate it in some way into the proposed architecture. There is a division between works that mostly (or completely) embrace an existing architecture and perform only minor operations such as fine-tuning or knowledge transfer in order to reduce the computational demands <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15]</ref>. On a different end of the spectrum, there are works that propose innovative hybrid solutions in which the attention mechanism and/or Transformer architecture are combined with elements of different deep learning paradigms, such as RNNs and CNNs. In particular, a common strategy is to adopt a hierarchical structure for the overall solution and use the 1 https://github.com/helmy-elrais/RoBERT_Recurrence_over_BERT 2 www.mawdoo3.com attention mechanism only in a limited role, thus avoiding the exponential growth of complexity <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b16">17]</ref>.</p><p>The aforementioned methodological differences stem largely from the expectations for each paper, which range from proving a theoretical point to attempting to develop specialized model for long document classification. Works with a narrower scope tend to stay closer to the original BERT model design <ref type="bibr" target="#b10">[11]</ref>, while more ambitious efforts that aim to create new tools are more inclined to experiment with previously untested combinations of elements. In some papers, the scope of intended applications is limited to long documents from a certain domain (i.e. medical) <ref type="bibr" target="#b16">[17]</ref>, while others are approaching the problem in more general terms. Finally, there is an important distinction between works that aim for greater accuracy, and those that primarily attempt to improve computational efficiency and shorten the inference time <ref type="bibr" target="#b17">[18]</ref>.</p><p>It's a fair assessment that practically all works from this group are grappling with the same problem -the tendency of attention-based models to become prohibitively complex as the length of the analyzed text is increased. In response, the authors tried a variety of ideas that rely on vastly different mechanisms to decrease complexity. From fine-tuning and knowledge distillation to introduction of hierarchical architectures and restrictive elements such as fixed-length sliding window <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b18">19]</ref>, the proposed techniques are quite innovative and typically leverage some known properties of deep learning models to affect how the attention mechanism performs in a particular deployment. The diversity of ideas found in those papers illustrates that researchers are currently casting a wide net and searching for unconventional answers to a difficult problem, without a single dominant strategy. On the other hand, hybrid approaches hold a lot of promise and they combine some proven elements from different methodologies into new, potentially more optimal configurations <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b19">20]</ref>.</p><p>Evaluation of the proposed changes to established algorithms is crucially important, and all of the reviewed works include some form of empirical confirmation of their premises. While the numbers seemingly validate that the proposed solutions achieve state-of-the-art results under the best possible conditions, those findings are self-reported and may often be too optimistic. All of the papers are interested in document classification tasks and use it to evaluate their solutions, but datasets used for testing may not be the same in terms of size, diversity, and content. When directly comparing different solutions, it's extremely important to keep in mind the particulars of the evaluation protocols. Studies aiming to provide evaluations with independently administered comparative testing of several different BERT-like algorithms for document classification are slowly emerging and reporting some interesting findings that often di- When it comes to practical use of the proposed solutions, there is a general lack of field data and even discussions of use cases are rare. This is understandable considering the main focus is on discovering more efficient methods, but without real world testing it's difficult to predict whether any of the solutions can deliver similar results to their reported findings. Some works may be directed as specific niches such as legal or medical, but even in this case little attention is paid to practicalities associated with real world application. This weakness may reflect the current state of the field, which is highly experimental and mostly built on data collected in a controlled environment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Data</head><p>Experimental parts of our study are conducted using two different datasets, with the choice of the datasets based on the domain of research which is long length Arabic documents. The datasets are vastly different in terms of size and diversity of classes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Mawdoo3 Dataset</head><p>The first dataset was scraped from Mawdoo3 which is the largest Arabic content website <ref type="foot" target="#foot_0">3</ref> . The number of classes from mawdoo3 is 22 class and each category contains between 700 to 12K articles. We have selected almost one thousand of long articles from each category as presented in Figure <ref type="figure" target="#fig_1">2</ref>.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Arabic News Dataset</head><p>The second dataset was about news articles and we downloaded them from different sources <ref type="bibr" target="#b20">[21,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b21">22]</ref>. These data have almost the same 8 categories so we merged them together and the resulted dataset is described in Figure <ref type="figure" target="#fig_2">3</ref>. We have selected almost four thousands of long articles from each class.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Models</head><p>In this section we introduce two BERT-based language models. Both of those models are based on BERT architecture, and require supervised training for best perfor-mance. Input text is divided into sentences that don't exceed the maximum length that BERT can accurately process (512 tokens). We also fine-tune two others for Arabic long document classification. the following sections explain that in details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">BERT-based Sentence Aggregation</head><p>We propose a simple but effective model to do a long document classification task. Our proposed model consist of multiple layers as shown in figure <ref type="figure" target="#fig_0">1</ref>; namely, sentence segmentation layer, BERT layer, a linear classification layer, then the sentence grouping layer with respect to each document, and finally the softmax layer. The first layer is to make a segmentation of sentences from the long text, taking into account the structure of the sentence in Arabic language. So that the sentence does not lose its meaning or break. The second layer is the BERT tokenizer followed by the embedding representation layer. Since we are using BERT base model named Arabert-V2 <ref type="bibr" target="#b22">[23]</ref>, this layer consists of 12-layer stacked encoders that receive the embedding inputs and process it and send to the an MLP layer. We train the model on all the sentences and each sentence is considered as document. The training outputs are the classification probability for each class as well as the sentence ID and orginal document ID. We make a grouping of text sentences with the probabilities of each category in each sentence, and in the end we aggregate all sentence in the category with the highest probability with respect to the document ID.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">BERT-based Key Sentences Model</head><p>This model has the same idea of dividing the document into sentences, but instead we hypothesize that a majority of semantically important information is concentrated within specific sentences inside of a longer text, making it unnecessary to check for connections between all words in a document. Instead, we used BERT-based similarity match algorithm that can recognize high-relevance sentences and pass them as input to the BERT model that can complete the desired classification task. The highrelevance sentences were selected by applying a maximal marginal relevance (MMR) <ref type="bibr" target="#b23">[24]</ref> similarity algorithm as shown in equation 1. The length of the sentences is between 30 to 150 tokens.</p><formula xml:id="formula_0">𝑀 𝑀 𝑅 = 𝑎𝑟𝑔𝑚𝑎𝑥𝐷 𝑖 ∈𝑋 [𝜆𝑆𝑖𝑚1(𝐷𝑖, 𝑆) − (1 − 𝜆) max 𝐷𝑗∈𝐶 𝑆𝑖𝑚2(𝐷𝑖, 𝐷𝑗)] (1)</formula><p>Where 𝑆 is the sentence vector and 𝐷𝑖 is the document vector related to 𝑆. 𝑋 is a subset of documents in our dataset we already selected and 𝜆 is a constant in range of [0-1], for diversification of results. The 𝑆𝑖𝑚1 and 𝑆𝑖𝑚2 are the similarity function which can be replaced by cosine, euclidean, Jacard and any other distance similarity measures. In our model we have used the proper cosine similarity that explained by equation 2 4 .</p><formula xml:id="formula_1">𝑐𝑜𝑠 −1 ( ∑︀ 𝑛 𝑖=1 𝑢 𝑖 ×𝑣 𝑖 ||𝑢|| 2 ×||𝑣|| 2 ) 𝜋 (2)</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Fine-tuned Models</head><p>In this part, we reproduced and fine-tuned two of the important research works in the literature for processing 4 XCS224 Mod2 lecture by Prof. Pott long texts, which are explained in <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b10">11]</ref>. We have trained and fine-tuned them to classify Arabic long length documents using the datasets mentioned in Tables <ref type="table" target="#tab_2">2 and 3</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Longformer</head><p>The Longformer <ref type="bibr" target="#b10">[11]</ref> was proposed to reduce the complexity of the self-attention matrix. This can be done by making the matrix sparser through the introduction of attention pattern with specified locations that need to be prioritized. By using a sliding window with a fixed length, the model doesn't enter exponential progression and instead scales linearly with input sequence length. Additional gains can be achieved by dilating the sliding window, which frees up some attention heads to process the overall semantic context while non-dilated heads remain focused on local tokens. However, the implemented restrictions interfere with the model's ability to be trained for specific tasks, which required the addition of global attention to the model. Linear projections are used to calculate the attention scores, and in this work an extra set of projections related to global attention are used to make training more reliable. The resulting linguistic model has an impressive capacity for contextual analysis, but expends far less computational resources when used with long-form documents than traditional BERT and other Transformer architectures. Nonetheless, Longformer was trained for autoregressive modeling with left-to-right word sequence and train it with Arabic needed some preprocessing. We converted the base model of Arabert-V2 into a Longformer then we fine-tuned the output model for our Arabic long document classification task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5.">RoBERT</head><p>In this model the authors <ref type="bibr" target="#b5">[6]</ref> are looking into possible ways to extend the usefulness of the BERT linguistic model to text samples longer than a few hundred words.</p><p>To do this, they introduce an extension to the fine-tuning procedure and separate the input into smaller chunks. After those chunks are processed by the base BERT model, they are passed through another Transformer or a single recurrent layer before a classification decision is made in the softmax layer. Those variations were named RoBERT (Recurrence over BERT) and ToBERT (Transformer over BERT), collectively described as Hierarchical Transformers because they maintain the hierarchical structure of representations both on the level of extracted segments and the whole document. Those models were found to converge very quickly when trained on a narrowly focused dataset and to perform better than the original BERT with long text sequences. Suitability of those derivative models was examined for different tasks, including topic identification and satisfaction prediction during a customer call, which are possible real world applications. Unlike the Longformer, with RoBERT the fine-tuning process was straightforward because it was a BERT-Based model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experimental approach</head><p>In our work we aim to find a balance between model accuracy on classification task performed over long text sequences and computational simplicity. Therefor we tried to utilize the base version of BERT which have less memory size of 500MB and faster prediction process where the length of the embedding is 768. We used Google Colab pro to train and fine-tune our models. In terms of accuracy, we use standard metrics to track all of those qualities for the tested models. Macro F1 score is used as a general measure of accurate prediction on all comparisons , as it provides a basis for comparison of results between studies.</p><p>Several hyperparameters have been setup to fine-tune the experimented models. Our proposed classification solutions were tested using two collection of documents mentioned in Sec. 3 where 80% of the dataset was used to train the model and 10% as a validation set and 10% utilized for conducting the tests. Table <ref type="table" target="#tab_0">1</ref> shows the general parameters used in the training and fine-tuning processes. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Results Discussion and Analysis</head><p>The evaluation was conducted using standardized hyper parameters such as batch size and sequence length and others as shown in Table <ref type="table" target="#tab_0">1</ref>, with two different datasets suitable for Arabic long document classification task as described in Sec 3. We will try to report the results and analyze them according to each data set separately.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">Mawdoo3 Dataset</head><p>All models were empirically evaluated on long document classification task. We compared our proposed models with Longformer as well as with the RoBERT on Maw-doo3 dataset. The results were very close between the two proposed solutions and the Longformer, with a very slight superiority to the language model based on extracting key sentences using MMR method with macro F1 score equal to 83%. While Robert performed very poorly on Mawdoo3 dataset with macro F1 score of 21%.</p><p>The overall results of all models in the long document classification task are explained in Table <ref type="table" target="#tab_1">2</ref>. We can say that this results support our hypothesis of identify the most relevant parts of the text. The resulting solution retains the ability to capture relationships between distant tokens, but doesn't have to actively back-propagate all of them and instead focuses only on key sentences. Because of this, the model avoids geometric progression of complexity and continues to be efficient with much longer texts than the original BERT is able to. It is worth noting that we have pre-processed and removed the information at the beginning of each article in this dataset because that the parts of the document containing easily identifiable indicators of the class.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">Arabic News Dataset</head><p>The results of the experiment were completely different with the Arabic news dataset. All models performed very well, and in this experiment, the first model outperformed the rest with macro F1 score of 98.4% which revealed that additional modification can have a positive impact on model performance, but it's important which dataset is used. It was discovered that classifying each sentence is better than classifying the whole sequence, which could even increase performance when working with short sentences. However, both Longformer and our second model with MMR are still performing very well with macro F1 score of 96% and 96.2%, respectfully. Whereas RoBERT model has macro F1 score of 74.4%. The overall results of all models in the long document classification task are described in Table <ref type="table" target="#tab_2">3</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion</head><p>Unmatched flexibility of BERT is one of the main reasons for its rapid acceptance as state-of-the-art language model. With additional algorithm and some modifications and fine-tuning, the model can be adjusted for certain topics or tasks and its accuracy pushed to even higher level. This work explores this possibility in detail, taking long text classification as the target task and searching for the best parameters for this type of usage. In particular, different possibilities for supervised pre-training Through detailed experimentation, we were able to identify the most optimal procedures that enable BERT to be more accurate with our particular downstream task. While the value of the proposed training and tuning actions was confirmed only for text classification, it stands to reason that analogous procedures could prove to be useful for other linguistic tasks as well. Finally, we want to denote that we did not explore all hyperparameters which can be a future work to have along with trying another language models such as Roberta and Electra.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Proposed Model Architecture for Long Document Classification</figDesc><graphic coords="3,162.21,84.19,270.85,349.43" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Mawdo3 Dataset that contains 22 class.We selected almost 1000 article under each class.</figDesc><graphic coords="4,193.47,84.19,208.36,248.22" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Arabic News Dataset we choose almost 4000 articles from each category.</figDesc><graphic coords="4,193.47,367.49,208.36,109.20" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Hyperparameters used in the training and fine-tuning processes</figDesc><table><row><cell>Parameter Name</cell><cell>Value</cell></row><row><cell>number of epochs</cell><cell>5</cell></row><row><cell cols="2">maximum sequence length aggregation and similarity models 128</cell></row><row><cell>maximum sequence length longFormer model</cell><cell>1024</cell></row><row><cell>maximum sequence length truncation , RoBERT models</cell><cell>1024</cell></row><row><cell>mawdoo3 data number of training steps:</cell><cell>107466</cell></row><row><cell>adam epsilon</cell><cell>1e-8</cell></row><row><cell>train batch size</cell><cell>64</cell></row><row><cell>valid batch size</cell><cell>128</cell></row><row><cell>epochs</cell><cell>20</cell></row><row><cell>learning rate</cell><cell>5e-5</cell></row><row><cell>warmup ratio</cell><cell>0.1</cell></row><row><cell>max grad norm</cell><cell>1.0</cell></row><row><cell>accumulation steps</cell><cell>1</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Overall results of all models in the long text classification task on Mawdoo3 Dataset</figDesc><table><row><cell>Model</cell><cell cols="4">Macro F1 Macro Precision Macro Recall Accuracy</cell></row><row><cell cols="2">Our Model Aggregating 0.82187</cell><cell>0.82338</cell><cell>0.83049</cell><cell>0.83083</cell></row><row><cell>Our Model-MMR</cell><cell>0.82732</cell><cell>0.83162</cell><cell>0.83555</cell><cell>0.83522</cell></row><row><cell>Longformer</cell><cell>0.82347</cell><cell>0.82497</cell><cell>0.83263</cell><cell>0.83291</cell></row><row><cell>RoBERT</cell><cell>0.21157</cell><cell>0.19309</cell><cell>0.29507</cell><cell>0.36461</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Overall results of all models in the long text classification task on Arabic News Dataset</figDesc><table><row><cell>Model</cell><cell cols="4">Macro F1 Macro Precision Macro Recall Accuracy</cell></row><row><cell cols="2">Our Model Aggregating 0.98411</cell><cell>0.98591</cell><cell>0.98264</cell><cell>0.98434</cell></row><row><cell>Our Model-MMR</cell><cell>0.96217</cell><cell>0.96240</cell><cell>0.96206</cell><cell>0.96263</cell></row><row><cell>Longformer</cell><cell>0.95908</cell><cell>0.95956</cell><cell>0.95880</cell><cell>0.95961</cell></row><row><cell>RoBERT</cell><cell>0.73062</cell><cell>0.75382</cell><cell>0.75124</cell><cell>0.75142</cell></row></table><note>and fine-tuning were examined on two different datasets.</note></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">https://mawdoo3.com/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Tay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dehghani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Abnar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Bahri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Pham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Rao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ruder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Metzler</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2011.04006</idno>
		<title level="m">Long range arena: A benchmark for efficient transformers</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1810.04805</idno>
		<title level="m">Bert: Pre-training of deep bidirectional transformers for language understanding</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Cogltx: Applying bert to long texts</title>
		<author>
			<persName><forename type="first">M</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="12792" to="12804" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Comparative study of long document classification</title>
		<author>
			<persName><forename type="first">V</forename><surname>Wagh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Khandve</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Joshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Wani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Kale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Joshi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">TENCON 2021-2021 IEEE Region 10 Conference (TENCON)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="732" to="737" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Ng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Nallapati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Xiang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1908.08167</idno>
		<title level="m">Multipassage bert: A globally normalized bert model for open-domain question answering</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Hierarchical transformers for long document classification</title>
		<author>
			<persName><forename type="first">R</forename><surname>Pappagari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Zelasko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Villalba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Carmiel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Dehak</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2019">2019. 2019</date>
			<biblScope unit="page" from="838" to="844" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Sukhbaatar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Grave</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bojanowski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Joulin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1905.07799</idno>
		<title level="m">Adaptive attention span in transformers</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Efficient transformers: A survey</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Tay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dehghani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Bahri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Metzler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys (CSUR)</title>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Rae</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Potapenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Jayakumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">P</forename><surname>Lillicrap</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1911.05507</idno>
		<title level="m">Compressive transformers for long-range sequence modelling</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Carbonell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Salakhutdinov</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1901.02860</idno>
		<title level="m">Transformer-xl: Attentive language models beyond a fixed-length context</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">I</forename><surname>Beltagy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>Peters</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Cohan</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2004.05150</idno>
		<title level="m">Longformer: The long-document transformer</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Comparison of topic identification methods for arabic language</title>
		<author>
			<persName><forename type="first">M</forename><surname>Abbas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Smaili</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of International Conference on Recent Advances in Natural Language Processing</title>
				<meeting>International Conference on Recent Advances in Natural Language Processing</meeting>
		<imprint>
			<publisher>RANLP</publisher>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="14" to="17" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Chalkidis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Darkner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Elliott</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2204.06683</idno>
		<title level="m">Revisiting transformer-based models for long document classification</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Adhikari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ram</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1904.08398</idno>
		<title level="m">Docbert: Bert for document classification</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">How to finetune bert for text classification?</title>
		<author>
			<persName><forename type="first">C</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Qiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Huang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">China national conference on Chinese computational linguistics</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="194" to="206" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Hierarchical self-attention hybrid sparse networks for document classification</title>
		<author>
			<persName><forename type="first">W</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Tao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Mathematical Problems in Engineering</title>
		<imprint>
			<biblScope unit="page">2021</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Si</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Roberts</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2104.08444</idno>
		<title level="m">Hierarchical transformer networks for longitudinal clinical document classification</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">H</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Vyas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Shah</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2203.11258</idno>
		<title level="m">Efficient classification of long documents using transformers</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Learning dynamic hierarchical topic graph with graph convolutional network for document classification</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Duan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Artificial Intelligence and Statistics</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="3959" to="3969" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Long document classification from local word glimpses via recurrent attention learning</title>
		<author>
			<persName><forename type="first">J</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Access</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page" from="40707" to="40718" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">An arabic multi-source news corpus: experimenting on single-document extractive summarization</title>
		<author>
			<persName><forename type="first">A</forename><surname>Chouigui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Ben Khiroun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Elayeb</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Arabian Journal for Science and Engineering</title>
		<imprint>
			<biblScope unit="volume">46</biblScope>
			<biblScope unit="page" from="3925" to="3938" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Topic identification by statistical methods for arabic language</title>
		<author>
			<persName><forename type="first">M</forename><surname>Abbas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Berkani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">WSEAS Transactions on Computers</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page" from="1908" to="1913" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<author>
			<persName><forename type="first">W</forename><surname>Antoun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Baly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hajj</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2003.00104</idno>
		<title level="m">Arabert: Transformerbased model for arabic language understanding</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">The use of mmr, diversitybased reranking for reordering documents and producing summaries</title>
		<author>
			<persName><forename type="first">J</forename><surname>Carbonell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Goldstein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval</title>
				<meeting>the 21st annual international ACM SIGIR conference on Research and development in information retrieval</meeting>
		<imprint>
			<date type="published" when="1998">1998</date>
			<biblScope unit="page" from="335" to="336" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
