1. Introduction

Abjad numerals recognition in medieval arabic mathematical texts

Hadj Mohammed Djamel

Nacéra Bensaou

0 0 USTHB University, Laboratory for research in artificial intelligence (LRIA) , BP32 EL ALIA, BAB EZZOUAR, ALGER, ALGERIE

Abjad numerals, also called hisa¯b al-jumal, is a numeral system based on the twenty eight letters of Arabic but not in the dictionary order. In ancient Arabic mathematics, all problems and solutions sentences were completely expressed in natural language with no mathematical symbolism. The present paper is the first attempt to automatically analyze and recognize Abjad numerals in medieval Arabic mathematical texts. Since that hisa¯b al-jumal system has no ambiguity, we also translate Abjad numeral written in natural language to modern numeral system. We construct a new dataset named Hj-Tagged corpus to facilitate our study. According to the experimental results, the proposed method is eficient for automatically analyze and recognize Abjad numerals and mathematical components (such as numerical constants, Abjad numbers, mathematical operations,.. etc). We also translate Abjad terms detected in the previous step to modern numeral system, where it achieves an F1 score of 98.1%.

1 the second letter Baã'(H) is used to represent 2 etc Then the numbers 10 20 30 90 by the next nine letters (10 = yã'(ø) 20 = kãf(¼) 30 = lãm(B) etc) then 100 200 300 1000

1. Introduction

dãl(X), the first four letters in the order [ 2 ].

In several medieval Arabic manuscripts such as mathematical, geographical, and astronomical texts [ 1 ], the numbers are written in a system of Arabic alpha-numerical notation. In this system each letter from the 28 Arabic letters has a specific numerical value known as the ’Adad of that letter, and the value of a word is the sum of values to each letter compose that word. The VIPERC2022: 1st International Virtual Conference on Visual Pattern Extraction and Recognition for Cultural Heritage Understanding, 12 September 2022 * Corresponding author. Hisab al-jumal numbers were used for all mathematical purposes also used for the creation of chronograms, which "consist of grouping into one meaningful and characteristic word or short phrase letters whose numerical values when totaled give the year of a past or future event”[ 3 ]. For example, a poet used it in talking about the rules of Tajweed. He made a poem stating the rules of the Arabic letters, and in the end of the poem he said: A îD®JK á Ò Ë øQ . Aê j KPA K [the date of this poem is a good tiding to the one who masters it] when calculated in hisa¯b al-jumal system (see Table 2 ), gave the year he authored that book, which was year 1198 (512+120+566) AH.

When a number is written in hisa¯b al-jumal notation, it becomes dificult to recognize it as number, especially if the hisa¯b al-jumal word make sense. For example, the equation: " éJKAK hA¿ ú¯ é®J¯X ñË H. Qå à@ AKXP@" [We wanted to multiply law minutes by kah seconds ] (Mifta¯h al Hissa¯b [ 4 ]), the hisa¯b al-jumal numbers has a unique conversions into ordinary decimal notation, yielding " éJKAK 29 ú¯ é®J¯X 36 H. Qå à@ AKXP@"

In this work, we explore the use of a Bi-directional Long Short Term Memory (BI-LSTM) network with a conditional random field (CRF) layer to automatically analyze and recognize Abjad numerals in mathematical expression in medieval Arabic mathematical texts. Additionally, we also translate hisa¯b al-jumal terms detected in the previous step which written in natural language to modern numeral system (such as decimal numbers).

In the past few years, recurrent neural networks (RNN) [ 5, 6 ], together with its variants (such LSTM and gated recurrent unit (GRU)) are generally becoming more widely known and one of the most common techniques of the natural language processing, such as part-of-speech (POS) tagging [ 7, 8 ] and named entity recognition (NER) [ 9, 10 ]. Recently, [ 11 ] applies RNN approach using Bi-LSTM with CRF for the automatic detection of words and character level features for the task of drug NER. Similarly, [ 12 ] have combined the output of a Bi-LSTM and a CRF as input to an Support Vector Machine (SVM) classifier for disease name recognition. For sequence tagging tasks, [ 13 ] proposed a variant of Bi-LSTM with one CRF. The paper [ 14 ] shows that a combination of Bi-LSTM with CRF and external word embeddings model achieves impressive results for Russian NER task. [ 15 ] adapted a Rule-based machine translation system using Dictionary Approach (DA) to automatically generate modern (symbolic) mathematical equations from natural language in medieval Arabic Algebra. In this paper, we propose a novel approach for automatically recognizing and translating Abjad numeral in medieval Arabic mathematical texts to modern numeral system.

Following this introduction, the remainder of this paper is organized as follows. Section 2 explains the LSTM networks, Bi-LSTM networks, and Bi-LSTM with CRF networks. Section 3 describes how to translate hisa¯b al-jumal to modern numeral system. Section 4 shows the experimental setup such as dataset construction, model architecture, and the training process. Finally section 5 summarizes our methods, results, and discusses the future work.

2. Bi-LSTM-CRF Model

Recurrent neural networks (RNNs) have proved to be eficient to learn sequential data including language model [ 17, 18 ] and natural language process [19, 20]. An RNN is a neural network that consists of an input layer x, hidden layer h and output layer y. For instance, given a sentence = (1, ..., ) , an RNN uses a hidden state representation ℎ = (ℎ1, ..., ℎ) so that it can map the input to the output sequence = (1, ..., ).

However, standard RNNs sufer from both exploding and vanishing gradients problems [ 21]. On the other hand, the RNNs with the gating units such as LSTM-RNN [22] are the most efective sequence models in practical applications by adding extra memory cell inherent in RNNs.

The LSTM cell can be described mathematically with the following six fundamental operational stages: ∙ Input Gate: = ( () + ℎ− 1) ∙ Forget Gate: = ( () + ℎ− 1) ∙ Output/Exposure Gate: = ( () + ℎ− 1) ∙ New memory cell: ˜ = tanh( () + ℎ− 1) ∙ Final memory cell: = ⊙ ˜− 1 + ⊙ ˜ ∙ Final hidden state: ℎ = ⊙ tanh() where is the input vector at time , and ℎ denote the hidden state vector storing all the useful information at (and before) time . The and terms denote weight matrices for each gate. The symbol represents the Sigmoid activation function, ⊙ is the element wise multiplication.

In this paper, we propose to apply Bi-LSTM neural network [23] instead of a single forward network. In doing so, we can eficiently make use of past features (via forward states) and future features (via backward states) for a specific time frame. Finally, we construct our neural network model by feeding the output vectors of Bi-LSTM into a Conditional Random Field (CRF) layer [24] to jointly decode the best sequence of tags. Consider an input sentence = {0, 1, .., } and = {1, 2, ..., } is the corresponding sequence of tags for sentence . We consider to be the matrix of scores output by the Bi-LSTM network. is of size × , where k is the number of distinct tags, , is transition probability which represents the score of the ℎ tag of the word ℎ, its score defined with the following form [25]:

(, ) = ∑︁ A,+1 + ∑︁ P,

=0 =1 where A is a matrix of transition scores such that A, represents the score of a transition from the tag i to tag j. We use 0 and are the start and end tags of a sentence, then we add to the set of possible tags. A is therefore a square matrix of size + 2.

() denotes the set of possible sequence of tags for . The probabilistic model for sequence CRF defines a family of conditional probability (|) with all possible sequence of tags under the given with the following form: (|) = ∑︀˜∈ () (,˜) (,) During the training, log probability of correct tag sequence log (|) is maximized.

Figure 1 illustrates the main architecture of our neural network model for medieval mathematical entity recognition system in which each word is tagged with other (O) or one of six entity types: hisa¯b al-jumal (H-jumal), root (Root), square (Square), cube (Cube), equal (Equal), and operation (Op). The sentence of "AÒëPX ÈYªK èP@Yg. @ l k. ð ÈAÓ" [A square and ℎ¯ roots are equal to thirty sïn dirhems], is tagged as {Square Op H-jumal Root Equal H-jumal O}.

3. Equations and Hisab Al-Jummal Calculation in Medieval Arabic Algebra

In the following, we translate some of the Arabic basic mathematical terms and notations used throughout medieval period into modern symbols: ∙ Shay’ ("Zúae") or jidhr ("PYg. "), refer to unknown value(). ∙ Ma¯l ("ÈAÓ") and ka’b (" I.ª»") represent respectively 2 and 3. ∙ Powers greater than or equal to four can be formed by combining the two words ma¯l and ka’b. For example, ma¯l ma¯l (4), ma¯l ka’b (5), and so on. ∙ Dirhams ("AÒëPX") or mina al’adad ("XYªË@ áÓ") represent a simple number. ∙ The verb ’a¯dala ("ÈXA«") is used to indicate equality ("=") in an equation. ∙ The one-letter word wa ("ð") take the meaning of the modern addition (“+”) depend on the context. ∙ Hisab al-jummal system is often used to describe numbers during the medieval Arabic period.

The numerological calculation of hisa¯b al-jumal terms requires a dictionary approach to relate every letter in the Arabic alphabet to its equivalence in a number format (see Table 1). A dictionary approach is necessary to recognize each letter and its numerical value as shown in the table below.

Let be the sequence of words {1, 2, .., }, to capture a correspondence between the word and its numerical value , we define an alignment to be a set of pairs (, ), where is a token in and is sum total of letters value. For example, consider the following sentence : Ma¯lan wa ka¯b‘ ‘ashya¯‘ ta’dilu nüd‘ dirhaman (“AÒëPX YK ÈYªK ZAJ@ I.» ð àBAÓ”) [Two ma¯ls and kab things equals nad dirhams], given the above definitions, and knowing that the terms kab (" I.»") and nad ("YK") are hisa¯b al-jumal, once is calculated over all hisa¯b al-jumal words {1=(YK, 50+4) ,2=( I.», 20+2)}, can be written as “AÒëPX 54 ÈYªK ZAJ@ 22 ð àBAÓ” [ Two ma¯ls and twenty-two things equals fifty-four dirhams ]. We have shown that the passage from a sequence of words of any length to its numerical value notation is quite easy. However, no ambiguity is possible because there are exactly one unique translation of to .

Consider the previous example sentence (see section 2):

"AÒëPX ÈYªK èP@Yg. @ l k. ð ÈAÓ".

Which was tagged with: {Square Op H-jumal Root Equal H-jumal O}. By applying the numerological calculation of hisa¯b al-jumal to the words tagged as H-jumal, this sentence is transformed

4. Experiments

In this section we first present our proposed architecture, shown in Figure 2, for automatically recognize and translate hisa¯b al-jumal entity in medieval algebraic equations and expressions. Next, we will discuss the construction of a new Hj-Tagged Corpus and the training detail followed by their results.

4.1. Dataset Construction and Evaluation

We evaluate our proposed system on Hj-Tagged Corpus, constructed from the AMAK Dataset [ 15 ] which consists of medieval-modern equations pairs. We implement a simple dictionarybased method to detect and replace all numbers in the collected medieval equations (the numbers are referred to as token) with a random numerical entity in the hisa¯b al-jumal system. We also added several algebraic expressions which has numbers already written in hisa¯b al-jumal system, obtained from Al-Khwa¯rizm¯ı book, 9ℎ century[26][27], Al-Ka¯sh¯ı book, 15ℎ century [ 4 ], and Al-Yazd¯ı book [28] in order to increase the diversity of our training examples. Hj-Tagged Corpus has 2,262 collected equations with 23,454 words and a vocabulary size of 5,049 words, have been manually tagged using our own tagset, the collected corpus is fully tagged with O (other) or one of six entity tags: H-jumal (hisa¯b al-jumal), Root (root), Square (square), Cube (cube), Equal (equal), and Op (operation).

Table 3 shows some examples from the Hj-Tagged Corpus which consists of sequences of words and their tags. For evaluation, we report the precision, recall, and F1 scores for all tagged entities in the test set.

To ensure that the model does not see the context from the testing set during training, we ifrst split the training, validation, and testing set on our collected dataset. The size of the split of our collected data into training, validation, and testing is 2,062, 100, 100 respectively.

4.2. Training

The Bi-LSTM-CRF model were implemented using the TensorFlow and Keras [29], a flexible neural network library written in Python. The general settings of our neural network model are listed below: ∙ Dimension of word embedding vector: 20. ∙ Dimension of hidden layer: 50 (for each LSTM: forward layer and backward layer). ∙ Learning method: SGD optimizer, learning rate: 0.01 ∙ Number of examples used in each iteration(BATCH SIZE): 30. Tgas sequences (from left to right)

O Op H-jumal O O Square Equal H-jumal Root H-jumal Square Equal H-jumal Root Square Equal H-jumal Square Op H-jumal Root Equal H-jumal O H-jumal Square Op H-jumal Root Equal H-jumal O H-jumal Cube Equal H-jumal O Cube Op H-jumal Root Equal Square Op H-jumal O O

∙ We fix dropout rate at 0.5 for all dropout layers through all the experiments. ∙ Supervised learning was applied with up to 100 epochs for training the network.

4.3. Results and Discussions

In what follows (see Table 4), we use classification metrics such as precision, recall, and F1-score to evaluate our methods. F1-score can be computed to evaluate the performance of the system based on the detection results of the machine and the results of a human evaluator. F1-score is the harmonic mean of Precision and Recall computed from the number of mispronunciations detected by both the computer and human evaluator. They are defined as

2 × × 1 = (1)

Precision = TP/(TP + FP) is the fraction of all positive predictions that are true positives, while Recall = TP/(TP + FN) is the fraction of all actual positives that are predicted positive. More precisely, the True Positive (TP), in this system, is the number of Abjad numbers the system got right, False Positive (FP) is the number of Abjad numbers wrongly selected, and False Negative (FN) are the Abjad numbers wrongly classified as no Abjad numbers.

First of all, we notice that Bi-LSTM-CRF network performs remarkably well on the Hj-Tagged Corpus with a mean F1 score of 98.1%. Additionally, using the same parameters, we compare Bi-LSTM-CRF model performance to a Bi-LSTM network. We show the precision, recall, and F1 scores of the models. One can see that adding CRF layer significantly improved prediction. Besides that, the training phase require less than 60 epochs to converge and it in general takes a few minutes. Finally, our experimental results suggests that Bi-LSTM-CRF network are less sensitive to training data size and the impact of noise from the tags.

This paper focuses on recognizing and tagging components of a mathematical expression in medieval Arabic text. First, we want to mention here that our model was trained only on the Hj-Tagged Corpus. The training set is small, this limits the amount of ensemble diversity, which may reduce the network ability to generalize on new testing examples. Second, we did not perform any dataset preprocessing, apart from replacing every decimal number in the collected equations with a random numerical entity in the hisa¯b al-jumal system.

An other important point is that manually tagging such dataset with limited-vocabulary makes the system extremely sensitive to noise.

On the other hand, our model was able to correctly predict sentences which contain ambiguities in the test phase. For example, the word wa ("ð") can mean the addition operator such as (" èP@Yg. @ ð ÈAÓ" [Square Op H-jumal Root]), or hisa¯b al-jumal entity such as (" èP@Yg. @ ð ð ÈAÓ"[Square Op H-jumal Root]).

Finally, we implement a simple tag-based method to translate all hisa¯b al-jumal terms detected in the previous step to modern numeral system using the methodology described in Section 3. For example: The sentence of " èP@Yg. @ l k. ð ÈAÓ AÒëPX ÈYªK" [Square Op H-jumal Root Equal

5. Conclusion

This paper experimented the first attempt to automatically analyze and recognize Abjad numerals in medieval Arabic mathematical texts using the Bi-LSTM CRF model. We also translate hisa¯b al-jumal terms detected in the previous step to modern numeral system. An additional key strength of this work is the time and efort spent on manually building a new dataset named Hj-Tagged Corpus, which consists of 2,262 tagged medieval mathematical sentences.

In the future, we can improve the intermediate representations learned in our network by training this model jointly with named entity recognition (NER) tags. We also plan to enrich the training examples by expanding Hj-Tagged Corpus. Another interesting direction is to apply our model to data from other Arabic sources in many diferent fields, such as geography, physics, chemistry, medicine, architecture, Astronomy, and so on.

Experimental results on the Hj-Tagged Corpus demonstrate that the proposed method ofers an important step in medieval Arabic mathematics analysis to enable scientists to understand and explore medieval mathematical texts. [18] Mikolov, Tomáš, et al. (2011) "Strategies for training large scale neural network language models." 2011 IEEE Workshop on Automatic Speech Recognition & Understanding. IEEE. [19] Jackson, Richard G., et al. (2017) "Natural language processing to extract symptoms of severe mental illness from clinical text: the Clinical Record Interactive Search Comprehensive Data Extraction (CRIS-CODE) project." BMJ open 7.1 : e012012. [20] Swartz, Jordan, et al. (2017) "Creation of a simple natural language processing tool to support an imaging utilization quality dashboard." International journal of medical informatics 101 : 93-99. [21] Bengio, Yoshua, Patrice Simard, and Paolo Frasconi. (1994) "Learning long-term dependencies with gradient descent is dificult." IEEE transactions on neural networks 5.2 : 157-166. [22] Hochreiter, Sepp, and Jürgen Schmidhuber. (1997) "Long short-term memory." Neural computation 9.8 : 1735-1780. [23] Graves, Alex, and Jürgen Schmidhuber. (2005) "Framewise phoneme classification with bidirectional LSTM networks." Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005.. Vol. 4. IEEE, . [24] John Laferty, Andrew McCallum, and Fernando CN Pereira. (2001). Conditional random ifelds: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML2001, volume 951, pages 282–289. [25] Lample, Guillaume, et al. "Neural architectures for named entity recognition." arXiv preprint arXiv:1603.01360 (2016). [26] MUŠARRAFA, Alı Mustafı et AHMAD, Muhammad Mursı. Al-Khwa¯rizm¯ı. Kitab almukhtasar fı hisa¯b al-jabr wa’l-muqabalah, 1939. [27] Roshdi Rashed, Al-Khwa¯rizm¯ı, Le commencement de l’algèbre, éd. 2009. [28] "Muhammad Ba¯qir Zayn al-’ A¯bid¯ın al-Yazd¯ı", ’Uyun al-Hissab (Les Fontaines du calcul), Manuscript undated - Harvard University, http://pds.lib.harvard.edu/pds/view/11328976? n=1&imagesize=1200&jp2Res=.25&printThumbnails=no. [29] TensorFlow’s implementation of the Keras, https://www.tensorflow.org/guide/keras,

[1] Chrisomalis , S. ( 2021 ) NUMERALS AS LETTERS: LUDIC LANGUAGE IN CHRONOGRAPHIC WRITING . 09 .

[2] Farooqi , Mehr Afshan. ( 2003 ) "The Secret of Letters: Chronograms in Urdu Literary Culture . " Edebiyat 13 .2 : 147 - 58 .

[3] Ifrah , Georges. ( 2000 ) "The universal history of numbers: From prehistory to the invention of the computer, translated by David Vellos, EF Harding, Sophie Wood and

Ian

Monk . " .

[4] Mifta¯h al Hisa¯ b Ghiya¯th al-D¯ın Jamsh¯ıd Mas'ud al-Ka¯sh¯ı, edited by: A.S al Damarda¯sh¯ı & all , Dar al Kita¯b al 'Arabi,

Caire , 1967 , 357 pages

[5] Hinton , Geofrey E., D. E. Rumelhart , and Ronald J. Williams. ( 1986 ) "Learning representations by back-propagating errors . " Nature 323 .9 : 533 - 536 .

[6] Werbos , Paul J. ( 1988 ) "Generalization of backpropagation with application to a recurrent gas market model . " Neural networks 1. 4 : 339 - 356 .

[7] AlKhwiter, Wasan, and Nora

Al-Twairesh.

( 2021 ) "Part-of-speech tagging for Arabic tweets using CRF and Bi-LSTM." Computer Speech & Language 65 : 101138 .

[8] Kamath , Shilpa, Chaitra Shivanagoudar , and K. G. Karibasappa. ( 2021 ) "Part of Speech Tagging Using Bi-LSTM-CRF and Performance Evaluation Based on Tagging Accuracy." Advances in Computing and Network Communications . Springer, Singapore. 299 - 310 .

[9] Jin , Guozhe, and Zhezhou Yu. ( 2021 ) "A Korean named entity recognition method using bi-LSTM-CRF and masked self-attention." Computer Speech & Language 65 : 101134 .

[10] Wintaka , Deni Cahya , Moch Arif Bijaksana, and Ibnu Asror. ( 2019 ) "Named-entity recognition on Indonesian tweets using bidirectional LSTM-CRF." Procedia Computer Science 157 : 221 - 228 .

[11] Zeng , Donghuo , et al. ( 2017 ) "LSTM-CRF for drug-named entity recognition . " Entropy 19 .6 : 283 .

[12] Wei , Qikang , et al. ( 2016 ) "Disease named entity recognition by combining conditional random fields and bidirectional recurrent neural networks . " Database 2016 .

[13] Huang , Zhiheng, Wei Xu, and Kai Yu. ( 2015 ) "Bidirectional LSTM-CRF models for sequence tagging . " arXiv preprint arXiv:1508 . 01991 .

[14] Anh , Le

, Mikhail

Arkhipov , and M. S. Burtsev. ( 2017 ) "Application of a Hybrid Bi-LSTM-CRF model to the task of Russian Named Entity Recognition." arXiv preprint arXiv:1709 . 09686 .

[15] Djamel , Hadj Mohammed , and Nacéra Bensaou. ( 2018 ) "Automatic Extraction of Equations in Medieval Arabic Algebra . " 2018 IEEE/ACS 15th International Conference on Computer Systems and Applications (AICCSA) . IEEE.

[16] Gacek , Adam. Arabic manuscripts: a vademecum for readers . Vol. 98 . Brill , 2009 .

[17] Mikolov , Tomáš , et al. ( 2010 ) "Recurrent neural network based language model." Eleventh annual conference of the international speech communication association . .