<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A NLP and Rule-Based Approach to Extract Spatial Entities and Relationships in Arabic Text</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Atmane</forename><surname>Hadji</surname></persName>
							<email>a.hadji@centre-univ-mila.dz</email>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Faculty of exact sciences</orgName>
								<orgName type="department" key="dep2">Department of Computer Science</orgName>
								<orgName type="institution">University of Bejaia</orgName>
								<address>
									<postCode>06000</postCode>
									<settlement>Bejaia</settlement>
									<country key="DZ">Algeria</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Computer Science Department</orgName>
								<orgName type="laboratory">LISI Laboratory</orgName>
								<orgName type="institution">University Center A. Boussouf Mila</orgName>
								<address>
									<postCode>43000</postCode>
									<settlement>Mila</settlement>
									<country key="DZ">Algeria</country>
								</address>
							</affiliation>
							<affiliation key="aff4">
								<address>
									<country key="DZ">Algeria</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mohamed</forename><forename type="middle">Khireddine</forename><surname>Kholladi</surname></persName>
							<email>kholladi@univ-eloued.dz</email>
							<affiliation key="aff2">
								<orgName type="department" key="dep1">Department of Mathematics and Computer Science</orgName>
								<orgName type="department" key="dep2">Faculty of Exact Sciences</orgName>
								<orgName type="institution">University of El-Oued</orgName>
								<address>
									<addrLine>HAMMA Lakhdar</addrLine>
									<settlement>El-Oued</settlement>
									<country key="DZ">Algeria</country>
								</address>
							</affiliation>
							<affiliation key="aff3">
								<orgName type="laboratory">MISC Laboratory of Abdelhamid</orgName>
								<orgName type="institution">Mehri university of Constantine</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Farid</forename><surname>Boumaza</surname></persName>
							<affiliation key="aff5">
								<orgName type="department">Computer Science Department</orgName>
								<orgName type="institution">University of Mohamed El Bachir El Ibrahimi</orgName>
								<address>
									<addrLine>Bordj Bou Arreridj</addrLine>
									<postCode>34030</postCode>
									<country key="DZ">Algeria</country>
								</address>
							</affiliation>
							<affiliation key="aff6">
								<orgName type="laboratory">LAPECI Laboratory</orgName>
								<orgName type="institution">University of Oran1</orgName>
								<address>
									<postCode>31000</postCode>
									<settlement>Oran</settlement>
									<country key="DZ">Algeria</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">A NLP and Rule-Based Approach to Extract Spatial Entities and Relationships in Arabic Text</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">D8B4670225653289C7617A69599F18BB</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:11+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Information extraction, spatial information, NLP Arabic, rules JAPE F. Boumaza) 0000-0001-6706-6360 (A. HADJI)</term>
					<term>0000-0002-3051-1317 (M. K. Kholladi)</term>
					<term>0000-0002-9785-420X (F. Boumaza)</term>
				</keywords>
			</textClass>
			<abstract/>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In recent years, the massive growth of digital information, particularly on the Internet and within Big Data, has highlighted the need to develop efficient information processing systems. This exponential data increase, especially in georeferenced information, has amplified the challenge of information overload, making quick and accurate access to relevant data increasingly crucial, especially in specialized fields like geographic information systems (GIS). In this context, spatial information extraction from raw texts has become a vital area of research, encompassing disciplines such as natural language processing (NLP), information extraction (IE), information retrieval (IR), and GIS <ref type="bibr" target="#b0">[1]</ref>.</p><p>Spatial information extraction, especially in the Arabic language, offers significant advantages across various sectors. It enriches geospatial databases, enhances the accuracy of geographic information systems <ref type="bibr" target="#b1">[2]</ref>, and optimizes location-based services (LBS) <ref type="bibr" target="#b2">[3]</ref>. It also supports decision-making in critical fields such as urban planning, natural resource management, and disaster response. The extraction process involves transforming unstructured textual data into structured information, thereby identifying geospatial entities, relationships, semantic roles, and events for deeper analysis.</p><p>However, despite these potential benefits, spatial information extraction from Arabic texts remains a major challenge. Due to the morphological richness of the language and its semantic ambiguities, traditional information extraction methods, whether based on statistical techniques or machine learning, often fall short in addressing these challenges. The Arabic language presents linguistic and grammatical complexities that complicate the identification of georeferenced information, making integration into GIS systems even more challenging. This underscores the importance of developing advanced techniques that can effectively handle these linguistic specifics and overcome the limitations of traditional approaches.</p><p>Our approach leverages the complementary strengths of NLP to handle the linguistic intricacies of Arabic while addressing the growing needs of GIS users in the Arab world.</p><p>In this work, we aimed to address the challenges posed by spatial information extraction in Arabiclanguage texts, an underexplored field in GIS contexts. To this end, we developed innovative solutions based on NLP techniques and JAPE rules. The objective is to overcome the limitations of traditional approaches to structure geospatial knowledge and facilitate the indexing and extraction of spatial entities and their relationships.</p><p>In the first section, we introduce our new approach based on JAPE rules. Section 2 provides a review of related works on information extraction systems in various domains. In Section 3, we present the proposed approach along with the system architecture, detailing the components involved. Section 4 focuses on the application and implementation of our approach. Finally, in Section 5, we discuss the results obtained with our method and perform a comparative evaluation with other approaches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related works</head><p>Rule-based methods have proven their effectiveness in various areas of information extraction, not least thanks to their ability to capture specific relationships by applying defined syntactic and semantic rules. <ref type="bibr" target="#b3">[4]</ref> reported a method for extracting and combining spatial and temporal information from Arabic texts that enhances search and exploration capabilities using the GATE (General Architecture for Text Engineering) architecture. <ref type="bibr" target="#b4">[5]</ref> Introduced "drNER", a novel rule-based Named Entity Recognition (NER) method designed to extract dietary concepts, and this approach showed significant results for the extraction of evidence-based dietary recommendations.</p><p>In the bibliographic domain, <ref type="bibr" target="#b5">[6]</ref> applied a rule-based information extraction process to bibliographic data, aiming to establish a database of relevant concepts, refine the retrieved data and automate the local retrieval process. <ref type="bibr" target="#b6">[7]</ref> developed a system combining information extraction and ontology creation to facilitate the extraction and visualization of clinical information.</p><p>Furthermore, <ref type="bibr" target="#b7">[8]</ref> addressed the challenge of automatic information structure extraction from PDF books, proposing an intelligent rule-based approach to accurately extract logical metadata from these documents widely used on the semantic web. <ref type="bibr" target="#b8">[9]</ref> presented the VALET (Very Agile Language Extraction Toolkit) framework, a rule-based information extraction system that combines lexical, orthographic, syntactic and corpus-analytic information in a flexible syntax.</p><p>[10] proposed an approach integrating automatic natural language processing (ANLP) techniques, rules and gazetteers to extract spatial entities and their relationships from texts, offering a viable solution for enriching GIS with accurate spatial information. <ref type="bibr" target="#b10">[11]</ref>, demonstrated the effectiveness of a rule-based approach for extracting spatial relationships from annotated corpora, particularly for simple directional relationships. <ref type="bibr" target="#b11">[12]</ref> also proposed a system that automatically generates extraction rules from complex Chinese literal features. <ref type="bibr" target="#b12">[13]</ref> demonstrated how cross-linguistic alignment based on specific grammatical rules can enrich Open IE datasets for under-represented languages such as Brazilian Portuguese. Finally, <ref type="bibr" target="#b13">[14]</ref> illustrated how AIS (Automatic Identification System) data from fishing vessels can be exploited to extract precise spatial information, aimed at improving marine resource management.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Proposed JAPE rule-based method</head><p>The rule-based method is a classic and widely used approach in the field of information extraction. This method relies on a set of predefined rules that are designed to identify and extract specific information from text or other types of data. These rules are usually expressed in the form of models or patterns that correspond to specific linguistic structures or patterns in the data.</p><p>The general architecture of the proposed approach Figure <ref type="figure" target="#fig_0">1</ref> consists of four distinct phases. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Creation of JAPE rules</head><p>In the first phase, concepts related to Arab entities and spatial relationships are identified and collected. These concepts are then used to formulate specific JAPE rules <ref type="bibr" target="#b14">[15]</ref>. which are used to annotate and extract relevant spatial information from Arabic texts. JAPE rules are advanced regular expressions developed in Java, enabling the detection of complex patterns in text. JAPE rules offer significant flexibility in natural language processing, particularly for extracting information from unstructured text. Their main strength lies in the ease of adding or modifying new rules. It is straightforward to integrate new words or expressions into an existing system without disrupting the functionality of previously defined rules. This ability to quickly update the rules based on domain evolution or analysis needs makes JAPE a particularly adaptable and efficient tool for tasks such as entity recognition and contextual information extraction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Text processing</head><p>The second phase consists of applying natural language processing modules to prepare the raw text. This process includes steps such as normalization, tokenization and annotation of the spatial entities present in the text. These modules are crucial to ensuring that JAPE (Java Annotation Patterns Engine) rules can be applied efficiently and accurately.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Combination and Extraction</head><p>The third phase is based on the application of the JAPE rules created in the first phase. These rules are used to associate text segments with defined classes, subclasses or instances. This phase is essential for automatically extracting structured spatial information from unstructured text, taking into account the linguistic and contextual specificities of the Arabic language.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Disambiguation and classification</head><p>The fourth phase focuses on disambiguation and classification of the extracted spatial entities. This step ensures that each entity and relationship is correctly interpreted in its specific context. JAPE rules are also used here to refine the results, applying disambiguation and classification criteria to improve the accuracy of the extracted data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Application and realization</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Implementation phase</head><p>Our JAPE rule-based system architecture consists of two main phases, each playing a crucial role in the extraction of geographic information from natural language text (Figure <ref type="figure" target="#fig_1">2</ref>). The first phase uses advanced Natural Language Processing (NLP) techniques to prepare and normalize text data. This preparation includes text cleaning, sentence segmentation and initial annotation of linguistic elements, facilitating better rule application.</p><p>The second phase focuses on matching JAPE rules to extract specific information. This phase involves the definition and creation of rules, the matching of these rules with the text, disambiguation and the extraction of relevant information. Finally, post-processing is carried out to filter and structure the extracted data, making it ready for further analysis or integration into geospatial databases. Together, these phases ensure accurate and efficient information extraction, tailored to the needs of geographic analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Application environment</head><p>We have chosen to use the GATE environment, a linguistic engineering framework developed by the University of Sheffield and widely adopted since its first release in 1996 for teaching and research. GATE offers a suite of reusable processing resources in JAVA, integrated into an information extraction system called ANNIE (aNearly-New Information Extraction System) <ref type="bibr" target="#b15">[16]</ref>.</p><p>By default, ANNIE is configured for languages other than Arabic. To adapt this tool to our target language, we will use specialized components such as the Arabic tokenizer, sentence splitter, POS tagger and Arabic morphological analyzer. To avoid interference with previous executions, we'll apply the "reset" option to remove all traces of previous processes. Annotations in GATE will be performed </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Phase One: NLP techniques</head><p>The first phase of our architecture implements NLP techniques that are essential for processing and understanding natural language text. The aim of this phase is to prepare the textual data in such a way as to facilitate the extraction of relevant information, bearing in mind that we have used the same dataset or corpus discussed in <ref type="bibr" target="#b0">[1]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.1.">Linguistic pre-processing</head><p>The cleaning of Arabic text is an essential step before applying Natural Language Processing (NLP) techniques. Here are the main steps specific to the cleaning of Arabic texts:</p><p>• Removal of diacritical characters (Tashkeel): Arabic texts may contain diacritics (harakats) such as Fatha, Damma, Kasra and so on. These diacritics can be removed, as they are often unnecessary for analysis; • Removal of special characters and punctuation: As in other languages, special characters (such as !, @, #, etc.) and punctuation can be removed to simplify the text; • Character standardization: In Arabic, some characters can be written in more than one way.</p><p>For example, , , and are often normalized to . Similarly, can be transformed into .</p><p>• Removing superfluous spaces: Arabic texts may contain multiple spaces or spaces before or after punctuation. These spaces need to be normalized to ensure correct analysis. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.2.">Application of TALN techniques</head><p>TALN techniques such as Document Reset, Arabic Tokeniser, Sentence Splitter, Post Tagging and Morphological Analyser were explained in detail in the next sections. In this study, we will focus on the practical application of these techniques using the GATE platform <ref type="bibr" target="#b16">[17]</ref>. This in-depth exploration is intended to provide a better understanding of GATE and to serve as a practical guide to its use, particularly in the context of Arabic text. Given that online documentation is relatively limited, this chapter plays a vital role in filling this gap and offering clear instructions for taking full advantage of GATE's features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.3.">Sentence Splitter</head><p>Figure <ref type="figure" target="#fig_2">3</ref> shows a visualization of GATE during the Sentence Splitter step. This step splits the text into distinct sentences, improving the accuracy of syntactic and grammatical analyses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.4.">Tokenization</head><p>Figure <ref type="figure" target="#fig_3">4</ref> shows a screenshot of the GATE platform, illustrating the tokenization process. It shows how GATE segments text into basic units (tokens) for further linguistic analysis. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.5.">Post Tagging and Morphological Analyser</head><p>Morpho-syntactic tagging (Post Tagging) and morphological analysis (Morphological Analyser) are essential processes in automatic natural language processing, particularly when integrated into advanced systems such as GATE (General Architecture for Text Engineering) <ref type="bibr" target="#b17">[18]</ref>. In this context, these steps are exploited by JAPE (Java Annotation Patterns Engine) rules, which enable sophisticated annotation patterns to be defined for detecting specific linguistic structures within a corpus. When JAPE rules are executed, the annotations generated by Post Tagging and morphological analysis enrich the corpus by adding detailed metadata on grammatical categories and word morphological structure. Although these annotations do not affect the visible display of the text, they play a crucial role in providing invisible but fundamental information for subsequent linguistic analysis and accurate information extraction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Second phase: Application of JAPE rules</head><p>The second phase focuses on the application of JAPE rules for the extraction of specific spatial information. This phase follows the well-defined steps between rules and spatial information, i.e. the classification of spatial entities and relationships according to the following table</p><p>The Table <ref type="table" target="#tab_0">1</ref> below presents a classification of spatial entities, including natural entities, non-natural entities and entities corresponding to place names or locations.</p><p>The Table <ref type="table" target="#tab_1">2</ref> presents a classification of spatial relationships, detailing the following categories: topological relationships, directional relationships, distance relationships and orientation relationships. These categories help us understand how spatial entities position, orient and relate to each other in a given space. Topological relations describe relationships of contiguity or inclusion, directional relations indicate relative orientations, distance relations measure distances between entities, and orientation </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5.">Creation of JAPE Rules</head><p>A JAPE grammar consists of a set of phases, each containing a series of pattern/action <ref type="bibr" target="#b14">[15]</ref>. The phases execute sequentially, forming a cascade of finite-state transducers on the annotations. The left-hand side (LHS) of the rules comprises an annotation pattern description, while the right-hand side (RHS) contains instructions for manipulating the annotations. Matching annotations on the LHS of a rule can be referenced on the RHS using labels attached to pattern elements. Below is an example of a JAPE rule (Figure <ref type="figure" target="#fig_4">5</ref>) for extracting named entities from the "Non-Natural Object" class containing specified </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6.">Rule Matching</head><p>The defined rules (Figure <ref type="figure" target="#fig_4">5</ref>) are applied to the text or data to identify segments that match the specified patterns. A rule could be designed to identify geographic entities by searching for phrases containing keywords such as "region," "city," or "country." In our method, the option control = appelt is used to specify that the rules should be executed sequentially. This ensures that each rule is applied in a precise order, thereby maximizing the accuracy of the extraction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.7.">Information Extraction</head><p>When matching segments are identified, relevant information is extracted (Figure <ref type="figure" target="#fig_5">6</ref>). This may include capturing specific words or phrases or identifying relationships between different entities. For example, the rule "Extract Natural Object" in our JAPE script checks for text matching one of the specified instances in a list of natural objects, such as (forest), (valley), or (sea). When a token matches a word in the text, a named entity "Natural Object" is created, and specific attributes, such as class = "Natural Object" and type (containing the character string of the identified entity), are associated with this entity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results and Evaluation</head><p>In this section, we examine the results of experiments aimed at evaluating the effectiveness of our spatial information extraction system, which uses JAPE rules. This method is based on applying these rules to automatically extract spatial entities and relationships from a corpus of Arabic texts.</p><p>To evaluate and compare the methods we studied, we will use metrics: Precision, Recall, and F-scale. Precision refers to the correctness of the retrieval, while recall refers to the completeness of the retrieval. The F-measure provides the harmonic mean between precision and recall <ref type="bibr" target="#b16">[17]</ref>.</p><p>According to <ref type="bibr" target="#b17">[18]</ref>:</p><p>• Precision is the fraction of the valid annotations over the total number of identified annotations. It is formally defined as:</p><formula xml:id="formula_0">𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (𝐶𝑜𝑟𝑟𝑒𝑐𝑡)/(𝐶𝑜𝑟𝑟𝑒𝑐𝑡 + 𝑆𝑝𝑢𝑟𝑖𝑜𝑢𝑠)<label>(1)</label></formula><p>• Recall is the fraction of the valid annotations over the total amount of annotations. It is formally defined as: • F-measure is defined as the harmonic mean of two factors, precision and recall. It is formally as:</p><formula xml:id="formula_1">𝑅𝑒𝑐𝑎𝑙𝑙 = 𝐶𝑜𝑟𝑟𝑒𝑐𝑡/(𝐶𝑜𝑟𝑟𝑒𝑐𝑡 + 𝑀 𝑖𝑠𝑠𝑖𝑛𝑔)<label>(2)</label></formula><formula xml:id="formula_2">𝐹 − 𝑚𝑒𝑠𝑢𝑟𝑒 = (2 * 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 * 𝑅𝑒𝑐𝑎𝑙𝑙)/((𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙))<label>(3)</label></formula><p>The following Tables present the evaluation results for the extraction of information related to natural disasters. The corpus used for this evaluation is the same as described in the study <ref type="bibr">(Hadji et al., 2024)</ref>, comprising a total of 9,008 words extracted from four different Algerian newspapers. This corpus was annotated to identify 908 spatial information elements, of which 611 are spatial entities and 197 are spatial relationships. The results obtained (Table <ref type="table" target="#tab_2">3</ref>) show the distribution of annotations across the different newspapers and indicate that spatial entities make up the majority of annotations, representing 68.4% of the total, while spatial relationships account for 31.6%.</p><p>The results (Table <ref type="table" target="#tab_4">4</ref>) show that the most frequently annotated spatial entities are non-natural objects (265) and places (301), while natural objects are less represented. This may reflect a focus on entities deemed more relevant in the contexts of the newspapers studied. Regarding spatial relationships, directional relationships are the most commonly annotated (88), followed by orientation relationships (58). Topological and distance relationships are much less frequent, which could indicate that they are considered less important or less complex in the analyzed corpora.</p><p>The following Table <ref type="table">5</ref> shows the number of correct, incorrect, and missing annotations for Algerian newspapers. This data provides an assessment of the accuracy of spatial annotations performed on press articles, offering an overview of the quality of the results obtained during the extraction process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Analysis and Discussion</head><p>In evaluating the effectiveness of our proposed JAPE rule-based approach for extracting spatial information from Arabic texts, we compared our results with those of various other methodologies, including rule-based and hybrid approaches. The following table summarizes the precision, recall, and F-measure of each method.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">Analysis</head><p>• Precision: Our approach demonstrates a high precision of 0.90, indicating that 90% of the spatial information extracted is relevant and correct. This is a significant advantage, especially in applications where accuracy is paramount, such as in disaster response scenarios or when processing sensitive geographic data. In comparison, the rule-based methods <ref type="bibr" target="#b3">[4]</ref> and <ref type="bibr" target="#b18">[19]</ref> show lower precision values of 0.80 and 0.85, respectively. This discrepancy suggests that while these methods can recall a broader range of information, they may also include a higher number of false positives. • Recall: In terms of recall, our approach achieves a score of 0.85, which indicates a solid ability to capture a significant proportion of the actual relevant spatial information present in the texts. Although this recall rate is lower than that of <ref type="bibr" target="#b3">[4]</ref> (0.91), it remains competitive, especially when considering that higher recall often comes at the cost of lower precision. The approaches based on rules <ref type="bibr" target="#b18">[19]</ref> and the hybrid method exhibit similar performance levels in recall, with scores of 0.88 and 0.95, respectively. This suggests that while our method may miss some relevant entities compared to the others, it does so while maintaining a high level of accuracy. • F-measure: The F-measure, which balances precision and recall, is another critical metric for assessing the overall performance of the approaches. Our method achieves an F-measure of 0.87, which reflects a strong performance overall. The hybrid approach <ref type="bibr" target="#b0">[1]</ref> leads in this area with an impressive F-measure of 0.94, underscoring the effectiveness of combining different techniques to leverage their respective strengths. While our approach does not outperform this hybrid model, it still outshines the purely rule-based approaches, <ref type="bibr" target="#b3">[4]</ref> and <ref type="bibr" target="#b18">[19]</ref>, which both yield lower F-measure scores of 0.85.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">Discussion</head><p>The analysis indicates that while our JAPE rule-based method excels in precision, making it a robust option for applications that require accuracy, it falls slightly behind in recall compared to some other methods. This presents a critical trade-off in the context of information extraction: achieving a high precision often limits the breadth of recall. The hybrid method, although potentially more complex to develop and implement, demonstrates the highest overall effectiveness, suggesting that a combined strategy could yield the best results in future applications. Moving forward, our findings advocate for a nuanced approach to spatial information extraction that considers the specific requirements of each task. For instance, in scenarios where precision is paramount, our JAPE method stands out as an ideal choice. Conversely, for applications requiring extensive coverage of information, exploring hybrid methodologies could enhance performance significantly. Further research could involve developing a hybrid model that integrates the best features of our JAPE approach with the comprehensive capabilities of hybrid and machine learning methods, aiming to improve both precision and recall without sacrificing efficiency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion</head><p>This research introduced an innovative approach for extracting spatial information from Arabic texts within Geographic Information Systems (GIS), utilizing JAPE rule-based techniques. This methodological choice proved effective for annotating and identifying spatial entities, such as natural objects, artificial objects, and locations, as well as spatial relations, including distance, topology, orientation, and directional relationships. The use of JAPE rules presents several advantages: it simplifies the creation of specific linguistic patterns, making it a swift and suitable solution for systems with focused objectives where ambiguities are minimal. Thus, for targeted applications and well-defined contexts, the JAPE approach ensures reliable and systematic extraction of spatial information.</p><p>However, our study also highlighted the limitations of this approach in addressing the linguistic nuances of the Arabic language, which often require labor-intensive manual adjustments and advanced linguistic expertise. In comparison, ontology-based and machine learning methods, though promising in terms of generalization and adaptability, demand significant resources to build comprehensive ontologies and annotated datasets, making them less accessible for applications requiring rapid deployment.</p><p>In conclusion, our work underscores the relevance of the JAPE rule-based approach for extraction systems where simplicity and quick implementation are paramount. For future applications, it would be beneficial to explore the hybridization of this method with machine learning and deep learning techniques, aiming to combine their precision with the adaptability and contextualization capacities that these approaches offer. Such a combination could lead to more robust and versatile spatial information extraction systems, tailored to the diverse challenges presented by Arabic texts and GIS contexts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Declaration on Generative AI</head><p>The author(s) have not employed any Generative AI tools.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: General architecture of proposed approach</figDesc><graphic coords="3,72.00,65.60,451.30,551.75" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Phases of our approach</figDesc><graphic coords="5,72.00,65.61,451.29,231.95" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Execution of Sentence Splitter in GATE</figDesc><graphic coords="6,72.00,65.61,451.29,298.32" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Execution of Tokenization in GATE</figDesc><graphic coords="7,72.00,65.61,451.28,331.71" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Example of JAPE rule</figDesc><graphic coords="9,72.00,65.61,451.29,421.42" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Extraction of spatial information based JAPE Rules</figDesc><graphic coords="10,72.00,65.61,451.29,333.38" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Spatial Entities classes</figDesc><table><row><cell>Table 1</cell><cell cols="2">Spatial Entities classes</cell></row><row><cell></cell><cell>Classes</cell><cell>Instances</cell></row><row><cell></cell><cell>Natural Object</cell></row><row><cell>Spatial Entities</cell><cell>Building Object</cell></row><row><cell></cell><cell>Location</cell></row><row><cell>Table 2</cell><cell></cell></row><row><cell>Spatial Relations classes</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 Spatial Relations classes Classes Instances Spatial Relations Topological Direction Distance Topological</head><label>2</label><figDesc></figDesc><table /><note>relations specify alignments or angles between them.</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Distribution of annotations spatial entities and relations</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3</head><label>3</label><figDesc>Distribution of annotations spatial entities and relations</figDesc><table><row><cell>NewsPaper</cell><cell>Spatial Entity</cell><cell>Spatial Relation</cell><cell>Total Words</cell></row><row><cell>Total</cell><cell>611</cell><cell>197</cell><cell>9008</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 4</head><label>4</label><figDesc>Results of Distribution of spatial entities and relationships</figDesc><table><row><cell></cell><cell>Spatial Entity</cell><cell></cell><cell></cell><cell cols="2">Spatial Relation</cell><cell></cell></row><row><cell>Location</cell><cell>Natural</cell><cell>Building</cell><cell>Direction</cell><cell cols="2">Orientation</cell><cell>Topological</cell><cell>Distance</cell></row><row><cell>301</cell><cell>45</cell><cell>265</cell><cell>88</cell><cell>58</cell><cell></cell><cell>39</cell><cell>12</cell></row><row><cell>Table 5</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">Evaluation Annotation Metrics</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell cols="2">Source</cell><cell>Correct</cell><cell>Incorrect</cell><cell cols="2">Missing</cell></row><row><cell></cell><cell cols="2">Newspaper</cell><cell>635</cell><cell>64</cell><cell cols="2">109</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 6</head><label>6</label><figDesc>Performance Evaluation Metrics</figDesc><table><row><cell></cell><cell>Precision</cell><cell>Recall</cell><cell>F-measure</cell></row><row><cell>Our Approach</cell><cell>0.90</cell><cell>0.85</cell><cell>0.87</cell></row><row><cell>[4] Based Rule</cell><cell>0.80</cell><cell>0.91</cell><cell>0.85</cell></row><row><cell>[19] Based Rule</cell><cell>0.85</cell><cell>0.88</cell><cell>0.85</cell></row><row><cell>[1] Hybrid</cell><cell>0.93</cell><cell>0.95</cell><cell>0.94</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Enhancing spatial information extraction from arabic text: A hybrid approach with ontology and rule-based</title>
		<author>
			<persName><forename type="first">A</forename><surname>Hadji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-K</forename><surname>Kholladi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Borisova</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Ingenierie des Systemes d&apos;Information</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="page">1261</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Geographic information systems and web gis in higher education: a collaborative tool for the analysis of accessibility in the urban and built environment</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">J</forename><surname>Aguilar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pinos-Navarrete</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">Domingo</forename><surname>Jaramillo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">L</forename><surname>De La Hoz-Torres</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Teaching Innovation in Architecture and Building Engineering: Challenges of the 21st Century</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="401" to="415" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Progressive collaborative method for protecting users privacy in location-based services</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">R</forename><surname>Reddy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Sharma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Anusha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jhade</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Dhanasekaran</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">MATEC Web of Conferences</title>
				<imprint>
			<publisher>EDP Sciences</publisher>
			<date type="published" when="2024">2024</date>
			<biblScope unit="volume">392</biblScope>
			<biblScope unit="page">1089</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Automatic extraction of spatio-temporal information from arabic text documents</title>
		<author>
			<persName><forename type="first">A</forename><surname>Feriel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kholladi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Int. J. Comput. Sci. Inf. Technol</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page" from="97" to="107" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A rule-based named-entity recognition method for knowledge extraction of evidence-based dietary recommendations</title>
		<author>
			<persName><forename type="first">T</forename><surname>Eftimov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Koroušić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Seljak</surname></persName>
		</author>
		<author>
			<persName><surname>Korošec</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PloS one</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page">e0179488</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Rule based text extraction from a bibliographic database</title>
		<author>
			<persName><forename type="first">V</forename><surname>Makhija</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ahuja</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">DESIDOC Journal of Library &amp; Information Technology</title>
		<imprint>
			<biblScope unit="volume">38</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">The use of ontology in clinical information extraction</title>
		<author>
			<persName><forename type="first">S</forename><surname>Jusoh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Awajan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Obeid</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Physics: Conference Series</title>
		<imprint>
			<biblScope unit="volume">1529</biblScope>
			<biblScope unit="page">52083</biblScope>
			<date type="published" when="2020">2020</date>
			<publisher>IOP Publishing</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">A rule-based information extraction approach for extracting metadata from pdf books</title>
		<author>
			<persName><forename type="first">A</forename><surname>Alamoudi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Alomari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Alwarthan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ICIC Express Letters, Part B: Applications</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page" from="121" to="132" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Valet: Rule-based information extraction for rapid deployment</title>
		<author>
			<persName><forename type="first">D</forename><surname>Freitag</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cadigan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Sasseen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Kalmar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Thirteenth Language Resources and Evaluation Conference</title>
				<meeting>the Thirteenth Language Resources and Evaluation Conference</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="524" to="533" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">A hybrid approach for spatial information extraction from natural language text</title>
		<author>
			<persName><forename type="first">N</forename><surname>Hassini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Mahmoudi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Faiz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">20th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2023">2023. 2023</date>
			<biblScope unit="page" from="1" to="8" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Spatially oriented convolutional neural network for spatial relation extraction from natural language texts</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Qiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Tao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Transactions in GIS</title>
		<imprint>
			<biblScope unit="volume">26</biblScope>
			<biblScope unit="page" from="839" to="866" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Aprcoie: An open information extraction system for chinese</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Liao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hua</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Ping</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhong</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SoftwareX</title>
		<imprint>
			<biblScope unit="volume">26</biblScope>
			<biblScope unit="page">101649</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Utsa-nlp at chemotimelines 2024: Evaluating instruction-tuned language models for temporal relation extraction</title>
		<author>
			<persName><forename type="first">X</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rios</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 6th Clinical Natural Language Processing Workshop</title>
				<meeting>the 6th Clinical Natural Language Processing Workshop</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="604" to="615" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">A comprehensive survey on automatic knowledge graph construction</title>
		<author>
			<persName><forename type="first">L</forename><surname>Zhong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys</title>
		<imprint>
			<biblScope unit="volume">56</biblScope>
			<biblScope unit="page" from="1" to="62" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Gate jape grammar tutorial</title>
		<author>
			<persName><forename type="first">D</forename><surname>Thakker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Osman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Lakin</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="volume">1</biblScope>
			<pubPlace>UK, Phil Lakin, UK</pubPlace>
		</imprint>
		<respStmt>
			<orgName>Nottingham Trent University</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">JAPE: Regular Expressions over Annotations</title>
		<author>
			<persName><forename type="middle">Ac</forename><surname>Gate</surname></persName>
		</author>
		<author>
			<persName><surname>Uk</surname></persName>
		</author>
		<ptr target="https://gate.ac.uk/sale/tao/splitch8.html" />
		<imprint>
			<date type="published" when="2024-07-23">2024. July 23, 2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Advanced nlp methods for disaster information extraction: Analyzing jape rules, ontologies, and machine learning approaches</title>
		<author>
			<persName><forename type="first">A</forename><surname>Hadji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">K</forename><surname>Kholladi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 3rd International Conference on Computer Science&apos;s Complex System and their Application (CCSA&apos;2024)</title>
		<title level="s">Computer Science Book Series</title>
		<meeting>the 3rd International Conference on Computer Science&apos;s Complex System and their Application (CCSA&apos;2024)</meeting>
		<imprint>
			<publisher>Springer Nature</publisher>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note>In press</note>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Automatic opinion extraction from football-related social media: A gazetteer and rule-based approach</title>
		<author>
			<persName><forename type="first">A</forename><surname>Hadji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-K</forename><surname>Kholladi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">NCAIA</title>
		<imprint>
			<biblScope unit="page">61</biblScope>
			<date type="published" when="2023">2023. 2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">A rule-based information extraction system</title>
		<author>
			<persName><forename type="first">S</forename><surname>Panda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pradhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Behera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mohanty</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Innovative Technology and Exploring Engineering</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="1613" to="1617" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
