<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Optical Character Recognition For Arabic language using neural network</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Abdelkarim</forename><surname>Mars</surname></persName>
							<email>abdelkarim.mars@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="department">Sience&amp;Technique loboratory Time Higher School</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Optical Character Recognition For Arabic language using neural network</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">100D7A6C8EC684FCEBE8E8D5AC5E3D38</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T13:22+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>OCR</term>
					<term>Artificial neural network</term>
					<term>Arabic character</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>As part of the DocumentToText project, which is being piloted by TIME University and Horizon data society, we are working to develop an Optical Character Recognition engine to convert all scanned books that exist in Tunisian University into editable textual documents . In this article we present our approach for the development of an OCR system as well as the presentation of the utility of using the artificial neural networks for Arabic characters. We will present the realization context, our point of view on the particularity of the Arabic language with regard to the literature and finally the reasons that have governed the decisions taken in the steps of the realization.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>The purpose of the recognition of writing is to transform a written text into a machine-readable representation easily reproducible by word processor. This task is not easy because the words have an infinity of representations because each person have his own writing, and because each writing can be represented by many fonts and many styles (bold, italic, underline, shaded) and each writing have a different layouts. Depending on the type of writing that a system must recognize (manuscript, cursive or printed), the operations to be carried out and the results vary significantly.</p><p>The Optical Character Recognition technology (OCR) knows several practical applications in several fields of activity Among which we can cite:</p><p>• Banks and insurance for the authentication of bank checks (Correspondence between amounts and wording on the one hand and correspondence between the identity of the signatory and its signature, on the other), and the verification of clauses contracts for insurance. • Mail for address reading and automatic mail sorting.</p><p>• Police and security for the recognition of mineralogical numbers for the control, authentication and identification of manuscripts and identification of the writer.</p><p>The Arabic character recognition is a large problem <ref type="bibr" target="#b0">[1]</ref>. This problem is due to the characteristic of the Arabic language <ref type="bibr" target="#b1">[2]</ref>. In this project we will work with Modern Standard Arabic (MSA) wich is a standardized version used for official communication across the arab world <ref type="bibr" target="#b2">[3]</ref>.</p><p>Earlier surveys presented both printed character and handwriting, with more discussion about machine-print <ref type="bibr" target="#b3">[4]</ref> <ref type="bibr" target="#b4">[5]</ref> Our project uses a neural networks approach to recognize the Arabic characters. We will use the Multi Layer Perceptron to learn our OCR system <ref type="bibr" target="#b5">[6]</ref>. MLP use backpropagation network to minimize the errors <ref type="bibr" target="#b6">[7]</ref> in trainer model. It is simply a gradient descent method to minimize the cost of the total squared error of the result computed by the MLP network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Characteristics of the Arabic language</head><p>Arabic is the 8th most spoken language in the world, with more than 400 million speakers <ref type="bibr" target="#b7">[8]</ref>. Yet, with respect to character recognition technologies, even those of the leaders in dematerialization, performances are low of those for Latin characters.</p><p>The most important step on the OCR engines rely on graphical analysis of the image to identify shapes and characters and use their reconciliation method to reconcile characters.</p><p>Arabic writing, on the other hand, has its own characteristics which pose difficulties for the engines: <ref type="bibr" target="#b8">[9]</ref> • Arabic is a Semitic language: it uses three-letter roots where vowels are not always written. The engine has difficulties to reconstruct the words. • The diacritical signs (compulsory signs or which facilitate reading) accompany each word. Thus, during the preprocessing step, these signs can be suppressed by the automatic image enhancement functions (which is particularly preferred for old and / or damaged documents) and thus alter the expected result. • Graphically, the shape of the characters is lying on the line and not vertical like most other writings. Moreover, it is a cursive writing and the continuity of the characters weakens the segmentation necessary for the identification of characters.</p><p>• Arabic letters change their shapes depending on their position in the word; isolated, initial, middle, end (Table <ref type="table">1</ref>).</p><p>Table <ref type="table">1</ref>. Change of the shape of a letter according to its position, example of variation of the letter ‫ع‬ "Ayn".</p><p>There are, therefore, a number of obstacles inherent in the specificities of the language which may alter the recognition of Arabic characters. However, technologies have evolved and are still evolving to bring more performance and improve the quality of results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Pretreatment steps</head><p>Because of the high granularity of the sampling and the various problems lighting and seizure, the image of the character may suffer defects. These problems should be corrected, if possible, before any analysis. Moreover, it is not always useful to use all the points of the image Character to extract the characteristic properties. A reduction step eliminates redundant points. The pretreatment techniques are as follows:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Smoothing</head><p>The image of the character may be tainted by noise due to artifacts acquisition and often to the quality of the document, leading either to absences from points (holes) either to impasto or excrescences and therefore to an overload of points. Smoothing techniques solve these problems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Standardization of size</head><p>The size of a character can vary from one writing to another, which can cause an instability of the parameters. A natural pretreatment technique consists in bring the characters to the same size. The normalization algorithm we used is imported from OpenCV<ref type="foot" target="#foot_0">1</ref> python library.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Thinning</head><p>The goal of slimming a character is to simplify the image of the character into an easier image to be treated, for example by reducing it to one dimension, that is, the thickness of the character is reduced to one pixel.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Neural network approach</head><p>There are many methods to train and model an OCR system. Among the existing methods, we mention the neural networks, the neighbor k-nearest, hidden Markov model (HMM), expert systems.</p><p>On our OCR engine we used a technology based on the neural networks that have been present in the Machine Learning community for decades, winning each year in maturity and answering ever more challenges.</p><p>Learning an array of artificial neurons involves the following steps:</p><p>• Acquisition of data forming the learning base.</p><p>• Pre-processing: it consists of locating, segmenting and normalizing representations.</p><p>• Choice of attributes: after the pre-processing, we must extract attributes that define the data. These attributes serve as network entries of neurons.</p><p>Before the processing of the data, we have to make the choice of the objects, the definition of the attributes characterizing the objects and the construction of the base learning. At the end of this phase we obtain a table of numbers at two inputs: data and attributes that characterize them. Learning consists of presenting the examples sequentially and modifying the synaptic weights according to an equation called the learning equation.</p><p>Artificial Neural Network consist of simple processing elements and a very high degree of interconnection <ref type="bibr" target="#b9">[10]</ref>. The weights of the network are learned from training data. The weight are intialized into the initialized input layer, hidden layers and on the final output layer. We have user the cross entropy function to compute the error rate. The extracted information from data will be processed from input layer to output layer gives a character in this task.</p><p>We have developed this algorithm for learning our artificial neural network <ref type="bibr" target="#b10">[11]</ref>: </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Evaluation of the OCR System</head><p>In the acquisition step, a database should be acquired representing the different Arabic characters. For this reason, we have used a book already scanned from the library (Le cahier de la Tunisie) for the segmentation of words into characters. We were able to extract all the characters that compose the book using the OpenCV library. At the end of this operation, we were able to obtain 100,000 clean characters. The algorithm used in this step uses the Python language.</p><p>Once the data are ready we have applied the following pretreatments:</p><p>• Cleaning and thinning the image of each character in the database.</p><p>• Normalization of the size of a character.</p><p>• Centering the image.</p><p>• Extraction of attributes.</p><p>The classification is done by a network of multilayer perceptron neurons, using the Back propagation algorithm <ref type="bibr" target="#b10">[11]</ref>.</p><p>In order to test our system we used 80% of data for the training of our system and 20% for the test.</p><p>Once we finished learning our model, we passed the test data to our system and we got an accuracy of 92%. The results obtained are encouraging, but still require some improvements to begin the conversion of scanned books into texts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>We proposed in this article an Optical character recognition system for Arabic language based on neural networks approach. We suggest a method bases on Multi Layer Perceptron classifier, which allow an effective results and a high accuracy. In addition, a neural networks approach allows us to reduce the computational complexity by exploiting the redundancy of the scanning letter. Otherwise, our OCR system still require some improvements. We need to increase the size of our dataset and maybe using deep learning approach to train a new model.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>Definition and allocation of the ANN Initialization of all weights Construction of a standardized training base For each example e from learning base Resampling of Example e Standardization of example e Extraction and saving all the features of the example e Saving the label of example e While the stop condition is not satisfied For each example e from to the learning base Propagation of example e Calculate the local error Checking the stop condition End while // Calculate the criterion For each example e from the learning base Propagation of example e Calculate the local error Calculating the cumulative error To calculate the errors we use the following function: E CE = -Σ [d j log(y j ) + (1 -d j ) log(1 -y j )]</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://docs.opencv.org/2.4/modules/refman.html</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">An On-Line Automatic Arabic Document Reader</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">F</forename><surname>Saleh</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1998">1998</date>
		</imprint>
		<respStmt>
			<orgName>University of Basrah, Iraq</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">MSc. Thesis</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Deep Learning for Feature Extraction of Arabic Handwritten Script</title>
		<author>
			<persName><forename type="first">M</forename><surname>Elleuch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Tagougui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kherallah</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Computer Analysis of Images and Patterns: 16th International Conference, CAIP 2015</title>
				<meeting><address><addrLine>Valletta, Malta</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">September 2-4, 2015</date>
			<biblScope unit="page" from="371" to="382" />
		</imprint>
	</monogr>
	<note>Proceedings, Part II</note>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m">Ethnologue: languages of the world</title>
				<imprint>
			<publisher>SIL international</publisher>
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
	<note>14th ed</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Survey and bibliography of Arabic optical text recognition</title>
		<author>
			<persName><forename type="first">B</forename><surname>Al-Badr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Mahmoud</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Signal Processing</title>
		<imprint>
			<biblScope unit="volume">41</biblScope>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Off-line Arabic character recognition : the state of the art</title>
		<author>
			<persName><forename type="first">A</forename><surname>Amin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="page" from="517" to="530" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Handwritten Arabic Numeral Recognition using Deep Learning Neural Networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ashiquzzaman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tushar</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2017-02-15">15 February 2017</date>
			<pubPlace>Dhaka, Bangladesh</pubPlace>
		</imprint>
		<respStmt>
			<orgName>Computer Science and Engineering Department, University of Asia Pacific</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A multi-objective approach towards cost effective isolated handwritten Bangla character and digit recognition</title>
		<author>
			<persName><forename type="first">R</forename><surname>Sarkhel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">K</forename><surname>Saha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Nasipuri</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition</title>
		<imprint>
			<biblScope unit="volume">58</biblScope>
			<biblScope unit="page" from="172" to="189" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<ptr target="http://www.infoplease.com/ipa/A0855611.html" />
		<title level="m">Languages spoken in each country of the world</title>
				<imprint>
			<date type="published" when="2016-12-25">2016-12-25</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Duda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">E</forename><surname>Hart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">G</forename><surname>Stork</surname></persName>
		</author>
		<title level="m">Pattern Classification</title>
				<imprint>
			<publisher>John Wiley &amp; sons, Inc</publisher>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
	<note>Second ed</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Handwriting recognition system for Arabic language learning</title>
		<author>
			<persName><forename type="first">A</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Mars</forename></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Antoniadis</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Engineering and Advanced Technology Studies</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="issue">7</biblScope>
			<biblScope unit="page" from="55" to="63" />
			<date type="published" when="2015-09">2015. September 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Arabic on-line handwriting recognition for Arabic using neural network</title>
		<author>
			<persName><forename type="first">A</forename><surname>Mars</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Antoniadis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Artificial Intelligence and Applications (IJAIA)</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="issue">5</biblScope>
			<date type="published" when="2016-09">2016. September 2016</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
