-

COMPOUND SENTENCE SEGMENTATION AND SENTENCE BOUNDARY DETECTION IN URDU

ASAD IQBAL

ASAD HABIB

JAWAD ASHRAF

0 0 Institute of Information Technology, Kohat University of Science and Technology , Pakistan

2017

99 106

The raw Urdu corpus comprises of irregular and large sentences which need to be properly segmented in order to make them useful in Natural Language Engineering (NLE). This makes the Compound Sentences Segmentation (CSS) timely and vital research topic. The existing online text processing tools are developed mostly for computationally developed languages such as English, Japanese and Spanish etc., where sentence segmentation is mostly done on the basis of delimiters.

Urdu sentence segmentation sentence tokenization word tokenization compound sentence segmentation Urdu conjunction extraction Urdu sentence delimiter identification

INTRODUCTION: Urdu Compound Sentence Segmentation using words and conjunctions as delimiters is a complex task. Most of the available raw corpora contain large sentences which are combination of sentences with conjunctions or without conjunctions. Such sentences are called compound sentences. Such sentences make it challenging for automated and computational processes such as text summarization, parsing and named entity recognition etc. [ 7 ][ 19 ]. There are some online tools available that segment sentences on the basis of sentence termination marks. For example; 1.1. Automatic Sentence Segmentation 1.2. Morph Adorner Sentence Splitter Example Automatic sentence segmentation alters simple text into separate sentence per line format by simply adding return code after sentence termination mark but most tools cannot handle abbreviation’s like Dr., Mr., p.m., Prof., a.m. This online tool covers most of those abbreviations. It also provides editing facility after resulting text to cover up the remaining abbreviations [1]. Morph Adorner Sentence Splitter uses punctuation marks to split sentences but punctuation marks not always define sentence termination mark for example ellipses, abbreviations, acronym, or decimal system and in some of the poems not even have any termination mark in it. Morph Adorner Sentence Splitter works best for plain English text and unlike Automatic Sentence Segmentation it covers more abbreviations [2].

Unlike Urdu English is not even suffering from this problem too much as most of the sentences are already separated from each other, Urdu on the other hand has even sentences in a form of paragraph which in self contains many sentences and our proposed idea is to identify those sentences and convert them into more than one sentences on the basis of these words [3,4].

LITERATURE REVIEW: Sentence segmentation is a relatively new topic of research in computationally developing languages. We could not find any automated sentence boundary segmentation tool in Urdu language. Aroonmanakun W., analyzed sentence and word segmentation in Thai language [5]. They considered sentence discourse where combination of phrases and some clues are used for each discourse segmentation process. Baseer et al. presents a sophisticated Romanized Urdu Corpus utilizes tokens with the uppermost frequency of occurrence in the data set, which was collected from participants who uses Romanized Urdu as a mean of communication [6]. Xu et al. used IBM word alignment for sentence segmentation for translation tasks of English and Chinese in the news domain [ 7 ]. The focus of this technique is to use lexicon information to identify sentence segmentation point to split compound sentences into multiple sentences. This paper proposed a technique to split large sentences into multiple ones because the longer the sentences the more problems are faced by their proposed system, subsequently resulting into higher computation cost and compromised quality in word alignment. Habib et al. presented a novel approach that records properly segmented words in order to avoid the classical space insertion and space deletion problems in native script Urdu text [ 13 ][ 15 ]. Their proposed solutions can be of direct value in Urdu sentence segmentation.

Xue, N. and Yang, Y., focuses on how Chinese uses comma, exclamation marks and questions marks for sentence boundary indication. Proposed model is being tested and trained on data provided by Chinese tree bank and the accuracy achieved by this model is up to 90% [ 8,9 ]. Rehman, Z. and Anwar, W., uses rule based algorithm and Unigram statistical model to deal with Urdu sentence boundary disambiguate. Initial result before testing and training were 90% precision, 92.45% F1-measure and 86% recall, but after same testing data and training, results improved to 99.36% precision, 97.89% F1-measure and 96.45% recall [ 10 ]. Kiss and Strunk presented language-independent approach. In this paper assumptions are made that once abbreviations are identified most of the ambiguities will be eliminated while detecting sentence boundary. In order to detect high accuracy abbreviations, proposed system define three rules which required independence of context and type of candidate. The system was tested on different text types and eleven different languages [ 11 ]. A number of related research points out to interesting aspects of text segmentation and optimized input systems. Habib et al. proposed an optimized system and input methods with respect to various modern devices for Urdu composing [ 18 ]. Adnan et al. assessed the realization of smartphones learning objects in computing adaptive learning paths of undergraduate university students. Jurish and Würzner introduced a “WASTE” method for segmentation of text into tokens and sentences. Hidden Markov Model is used as a segment boundary detection. Model parameters were defining from pre-segmented corpora. Such corpora are available as an aligned multi-lingual corpora and treebanks [ 12 ].

Hearst, M.A., A presents technique called TextTiling in which text is segmented into multi-paragraph units that is subtopics or passages. Identification of sub-topics is done using lexicon co-occurrence and distribution patterns. This segmentation can further be used for text summarization and information retrieval [ 16 ]. In Evang, K., et al. technique the accuracy achieved by rule based model Tokenization is considered as no problem, but issue regarding rule-based is its language specific rules and maintenance. This paper used unsupervised feature learning combining it with supervised sequence labeling on character level to accomplish segmentation and high accuracy word goal. Evaluation of proposed system is done on three different languages with the error rate of 0.76% (Italian), 0.27% (English), and 0.35% (Dutch) [ 16 ]. Xu, L.F., et al. proposes an idea of segmenting long sentences into short one using conjunctions in them. Long sentences took a lot of machine translation resources for processing. Punctuations were used previously for segmentation but dealing with long sentences that was not enough. This paper presents rule-based approach on conjunctions to segment Chinese long sentences. 901 conjunctions were found in 10 patent papers during experimentation. Using rule-based approach, 89% accuracy was achieved [ 17 ]. Gibson E., et al proposes Comprehension of sentences technique. Comprehension of sentences means applying constraints using available computational resources to integrate variety of information sources. Four types of comprehension techniques are explained in this paper. (1) phrase-level contingent frequency constraints, (2) locality-based computational resource constraints, (3) contextual constraints, and (4) lexical constraints [ 19 ].

URDU SENTENCE SEGMENTATION OPERATIONS: Sentence Segmentation is a process of identifying sentence boundaries. Previous researcher’s focuses on using only punctuations as a boundary detection. Our proposed research not only uses punctuations, it also uses words as a delimiters and conjunctions as boundary markers to segment sentences appropriately.

3.1. RAW CORPUS: Raw corpus has been collected by several means which includes websites, books, Urdu magazines, newspapers etc., we categories them into more general form that is online and offline category. Online category involves websites, online books and newspapers. While offline involves Magazines, newspapers, books etc. The collected raw corpus contains large and compound sentences including combination of conjunction and non-conjunction sentences. These sentences require two different methodologies for segmentation. The former is compound sentence segmentation and the latter is sentence tokenization which are described in the following. Conjunctions and delimiter words and characters are identified using words and character segmentation techniques. Individual words and characters in text are split on the basis of complex processing including space, joiner and nonjoiner properties keeping in view the complex problems of Urdu specific space insertion and space deletion [3][ 10 ][ 19 ]. The segmented words and characters are then analyzed for conjunctions and delimiting words and characters.

Offline Source: Books,

News Papers etc.

Online Source:

Internet

Raw$

Corpus! Pre-processing

Word Segmentation Compound

Sentence Segmentation Resultant)

Sentences!

Removing) sub"ordinate) clause!

Applying( Boundary)

Markers! Conjunction

Sentences identification

NonConjunction

Sentences identification 3.3. COMPOUND SENTENCE IDENTIFICATION: The issues with longer sentences is high consumption of resources such as processing time. We look for conjunction containing sentences and we analyze that the subordinate clause of sentences are mostly the explanation of main clause, which makes sentence extra-large. The options in dealing with compound sentences are two, first is to eliminate those sentences completely but by doing so, we may be risking useful information. Second option is to trim those sentences by eliminating sub-ordinate clause i.e. the explanatory part of the sentence. We use a pattern matching approach for identification of conjunctions containing sentences. We came up with the list of conjunction words which we generated from pre tagged corpus, using this list of words conjunction containing sentences are easily detectable. List of conjunction words we are given below.

Naturally occurring raw corpus text examples are mentioned in the following.

ﺮﮕﻣ ! !ﮑﻠﺑ ! !ﭽﻧﺎﻨﭼ ! ﯽﻨﻌ$ ! ﺎ"ﻮﮔ ! ﯽﻨﻌ$ ! ﻦﮑ#ﻟ ! !ﮑﻧﻮ%ﮐ 3.4. With Conjunction part: 3.5. Without Conjunction part: ﯽ"ﺗ ﯽﺌﮔ ﭻﻨﮩﭘ ﯽ"ﺑ !ﺮﻔﻧ ﯽﮐ ﺲ"ﻟﻮﭘ ﮯﺌﻟ ﮯﮐ ﮯﻧﺮﮐ ﻮﺑﺎﻗ ﻮﮐ !ﻻﺎﺣ ﺮﭘ ﻊﻗﻮﻣ !" ﺎﻨ#ﺎﮐ ﯽﻠﺠﺑ ﯽﮐ ﮯﻗﻼﻋ !ﺎﻤﺗ !ﮐ ﺎﮩﮐ ﮯﻧ ﻢ"ﻌﻨﻟ&ﺪﺒﻋ ﺮﻠﺴﻧﻮﮐ ﻖﺑﺎﺳ ﺮﮕﻣ ۔ﮯﺌﮨﺎﭼ ﯽﻨ#ﺎﮐ ﯽﻠﺠﺑ ﯽﮑﻧ$ !ﺮﺻ ﮟ"ﮨ !ﺎﺒﺟ$% ﺮﭘ !ﻮﮔﻮﻟ ﻦﺟ ﮯﮨ ﯽﺗ#ﺎ%& ﺮﺳ#ﺮﺳ !ﺗﺎﺳ ﮯﮐ !"ﻮﻋ

۔ ﯽ"ﺗ ﯽﺌﮔ ﭻﻨﮩﭘ ﯽ"ﺑ +ﺮﻔﻧ ﯽﮐ ﺲ1ﻟﻮﭘ ﮯﺌﻟ ﮯﮐ ﮯﻧﺮﮐ ﻮﺑﺎﻗ ﻮﮐ 7ﻻﺎﺣ ﺮﭘ ﻊﻗﻮﻣ <= We exclude “!"#” from conjunction list because it is not only used as a joiner in a sentence but as a joiner between two nouns too for example; ﻦﺸ#ﮉﻧ&ﺎﻓ ﺖﻣﺪﺨﻟ. /0. ﯽﻣﻼﺳ. ﺖﻋﺎﻤﺟ The remaining sentences in corpus contain combination of multiple sentences. These sentences do not contain conjunctions but they need to be segmented. For this purpose, we pass resultant text to the next process called Sentence Tokenization.

3.6. SENTENCE TOKENIZATION: Sentence tokenization segments large sentences from single into multiple ones. We target words as delimiters to identify sentence boundaries. For example; ﯽﮐ ﺮﮑﻓ ﺐﺗﺎﮑﻣ *ﺎﻤﺗ ,-. *.ﻮﻋ ﮯﮐ ﻮﮕﻨ4 ﺮﭘ 6ﺎﻓ- ﯽﮐ ﮯﺠ8ﺘ:ﺑ <,ﺎﻤﮨ >ﮐﺎﮩﮐ ﮯﻧ AﺎﺷCﺎﺑ ﻞﻘﻋ ﺮﺴ8ﻓG ﻦﺸ8ﮐﻮﺠJ. ﭧﮐﺮ$ﺳ& ﯽ$ﭘ&()*ﻮﭘ*,*ﻮ-ﺑ)ﻮﮕﻨ1 ﺮﭘ #ﺎﮔ ﺶﺋﺎﮨ) ﯽﻨﭘ, ﮯﻧ #ﺎﺷ0ﺎﺑ ﻞﻘﻋ ﺮﺴ6ﻓ8 ﻦﺸ6ﮐﻮﺠ>, ﯽ?ﭘ@ )ﺎﮩB, ﺎﮐ Cﻻﺎ6ﺧ F, ﮟ6ﮨ)ﻮﮑﺸﻣ ﺖ>ﺎﮩﻧ ﮯﮐ Kﻗﻼﻋ ﻞﮨ, ﺮﭘ Cﺎﻣﺎﻐ6ﭘ ﯽﺘ>ﺰﻌﺗ ﮯﺳ ﺐﻧﺎﺟ ﮯﻧ #$%$ﻮﮔﻮﺳ %)$ #$ﺪﻧﺎﺧ ﮯﮐ #$ ﮯﺳ ﯽﻧﺎﺑﺮﮩﻣ ﯽﮑ 5ٰﻟﺎﻌﺗ ﷲ! ﺮﮕﻣ ﮯﮨ '!ﺮﺨﻟ* +ﻌﻗ!. +ﮐ ﺎﮩﮐ ﮯﻧ 3ﻮﮩﻧ$ ﺎ&ﮐ ﮯﺋﻮﮨ ﮯﺗﺮﮐ ﺖ&ﭼ /ﺎﺑ ﮯﺳ 2ﻮ&ﻓﺎﺤﺻ !"ﺎﻤﻧ ﮯﻧ 'ﻮﮩﻨﺟ ,-ﺮﻀﺣ 1ﺎﻤﺗ 3- 3-4ﻮﮔﻮﺳ ﮯﮐ 1ﻮﺣﺮﻣ 9ﺎ:ﻓ <ﮐﺎﮩﮐ ﯽﺋﻮﮨ ﺎﻄﻋ ﻞ:ﻤﺟ ﺮﺒﺻ ﻮﮐ 3-ﺪﻧﺎﺧ G4ﻮﭘ ﮯﺳ 'IﺎﻋJ ﮯﮐ <ﻗﻼﻋ ﻞﮨ-‘ !ﻮﮔ$ﺰﺑ ۔ﮟ#ﮨ %ﻮﮑﺸﻣ ﯽﻟ, ﮯﮐ ﺐﺳ #$ ﯽﮐ ﺖ(ﺰﻌﺗ #ﻮﻓ ﯽﻠ/0 1ﻌ(2ﺬﺑ ﺎ( ﯽﮐ ﺖﮐﺮﺷ ﮟ/ﻣ :;ﺎﻨﺟ Above example is a single large sentence which is a combination of multiple sentences with 110 words in it. In available corpus Urdu is full of these kind of sentences. These delimiting words are identified manually from single chunk of file because analyzing large file manually is time consuming and it may take weeks to generate list of delimiting words and still that list will not be enough. Delimiting boundary list is not limited to this corpus only to achieve better results in segmenting large sentences this corpus also needs constant updating with constant manual updating of delimiting words also. We generate that list using a chunk of file is to carry on our experimentation and we accomplish significant results. For example, the above single sentence is analyzed for delimiting words and on the basis of this list above sentence is segmented down into 4 sentences with average length of 27 words per each sentence. ﯽﮐ ﺮﮑﻓ ﺐﺗﺎﮑﻣ *ﺎﻤﺗ ,-. *.ﻮﻋ ﮯﮐ ﻮﮕﻨ4 ﺮﭘ 6ﺎﻓ- ﯽﮐ ﮯﺠ8ﺘ:ﺑ <,ﺎﻤﮨ "ﮐﺎﮩﮐ ﮯﻧ (ﺎﺷ*ﺎﺑ ﻞﻘﻋ ﺮﺴ1ﻓ3 ﻦﺸ1ﮐﻮﺠ89 ﭧﮐﺮ;ﺳ= ﯽ;ﭘ=(@AﻮﭘABAﻮ1ﺑ)ﻮﮕﻨE ۔ﮟ#ﮨ%ﻮﮑﺸﻣ ﺖ+ﺎﮩﻧ ﮯﮐ 1ﻗﻼﻋ ﻞﮨ6 ﺮﭘ 9ﺎﻣﺎﻐ#ﭘ ﯽﺘ+ﺰﻌﺗ ﮯﺳ ﺐﻧﺎﺟ !ﻌﻗ#$ %ﮐ ﺎﮩﮐ ﮯﻧ +ﻮﮩﻧ# ۔ﺎ.ﮐ ﮯﺋﻮﮨ ﮯﺗﺮﮐ ﺖ.ﭼ 5ﺎﺑ ﮯﺳ +ﻮ.ﻓﺎﺤﺻ ﺮﭘ <ﺎﮔ ﺶﺋﺎﮨ? ﯽﻨﭘ# ﮯﻧ <ﺎﺷCﺎﺑ ﻞﻘﻋ ﺮﺴ.ﻓH ﻦﺸ.ﮐﻮﺠL# ﯽMﭘN ?ﺎﮩO# ﺎﮐ 5ﻻﺎ.ﺧ R# ﻞ"ﻤﺟ ﺮﺒﺻ ﻮﮐ *+ﺪﻧﺎﺧ 01ﻮﭘ ﮯﺳ 56ﺎﻋ8 ﮯﮐ 9ﻗﻼﻋ ﻞﮨ+‘ !ﻮﮔ$ﺰﺑ ﮯﻧ )*$*ﻮﮔﻮﺳ $,* )*ﺪﻧﺎﺧ ﮯﮐ )* ﮯﺳ ﯽﻧﺎﺑﺮﮩﻣ ﯽﮑ 7ٰﻟﺎﻌﺗ ﷲ! ﺮﮕﻣ ﮯﮨ '!ﺮﺨﻟ* ﮯﮐ ﺐﺳ %& ﯽﮐ ﺖ)ﺰﻌﺗ %ﻮﻓ ﯽﻠ01 2ﻌ)3ﺬﺑ ﺎ) ۔ﯽﮐ ﺖﮐﺮﺷ ﮟ0ﻣ <=ﺎﻨﺟ <=ﺎﻤﻧ ﮯﻧ Bﻮﮩﻨﺟ D&ﺮﻀﺣ Gﺎﻤﺗ %& %&3ﻮﮔﻮﺳ ﮯﮐ Gﻮﺣﺮﻣ Iﺎ0ﻓ 2ﮐﺎﮩﮐ ﯽﺋﻮﮨ ﺎﻄﻋ ۔۔ﮟ#ﮨ %ﻮﮑﺸﻣ ﯽﻟ, 3.6.1.Sentence Boundary Maker: Complexity of delimiting words of sentence boundary markers rises when generated list of delimiting words turned into a part of another word and that word appears in the middle or start of sentence not at the end. To deal with these kind of sentences, we need to take our approach to another level as we cannot use unigram. For example, the following list of delimiter words are unigram. We cannot use them as boundary markers alone because these words can be a part of other words and that results in ambiguity. We require n-gram approach. For example; !" ! ﮯﺋﺎﺟ ! ﮯﺌﮐ ! ﮯﮑﺳ$ ! ﮟ"ﺮﮐ ﺮﭘ ﮯﻧﺮ$% ﯽﺟﺎﺠﺘﺣ' ﮯﻨﻣﺎﺳ ﮯﮐ !"ﺎﮨ ﺮﻧ !ﻮﮔ !"# !"ﺎﮨ ﭧﻨ#ﻤ#ﻟ&ﺎﭘ ﻢﮨ ﺮﮕ#$ !"ﻮﺼﺑ ﮟ"ﺮﮐ ﻞﺣ ﺮﭘ !"#ﺎ%ﻨﺑ ﯽﺤ#ﺟﺮﺗ ﻞﺋﺎﺴﻣ !"ﺎﻤﮨ ۔ ﮯﮕﻧﻮﮨ !ﻮﺒﺠﻣ ﮟ"ﮨ ﮯﮑﭼﻮﮨ ﯽﻏﺎﺑ ﺮﺒﻤﻣ ﮯﮐ ﯽﻠﺒﻤﺳ& ﯽﺋﺎﺑﻮﺻ ﮯﻨﭘ$ ﮯﮑﺳ$ !ﮑﻠﺑ ﮯﮑﺳﺎﺟ ﺎﮩﮐ !ﺎﮔ$ﺎ% ﻮﮐ ﺲﺟ ﺎ"ﮐ ﮟ"ﮩﻧ !ﺎﮐ ﺎﺴ#$ ﯽﺋﻮﮐ ﮯﻧ !ﻮﮩﻧ (1 (2 (3 ﮟ"ﮩﻧ ﻦﮑﻤﻣ ﯽﻗﺮﺗ ﺮ"ﻐﺑ ﮯﺌﮐ ﻢﺘﺧ !ﻮﺳُﺎﻧ ﺎﮐ ﻞﻘﻧ ﮯﺳ !"#$%$ ﯽﻤ#ﻠﻌﺗ (4 (5 ﮯﮨ ﯽﺗﺎﺟ !" ﺢ"ﺟﺮﺗ ﻮﮐ !ﻮﻣﺎﮐ ﯽﻋﺎﻤﺘﺟ' ﮯﺋﺎﺠﺑ ﯽﮐ !ﻮﻣﺎﮐ !"#ﺮﻔﻧ# ﺮﭘ !ﺎﮩ$ ﮯﮨ ﯽﺗﺎﺟ !" ﺢ"ﺟﺮﺗ ﻮﮐ !ﻮﻣﺎﮐ ﯽﻋﺎﻤﺘﺟ' ﮯﺋﺎﺠﺑ ﯽﮐ !ﻮﻣﺎﮐ !"#ﺮﻔﻧ# ﺮﭘ !ﺎﮩ$ In example 1, 2 and 3 the delimiting words appear in the middle of sentences. In example 2 we also realize that there are two delimiters used one as a separate word and other as a part of other word for example ﮯﮑﺳﺎﺟ, which is a combination of word ﮯﮑﺳ$ and !. Same apply to example 4 and 5 i.e. ﮯﺋﺎﺠﺑ and !"#ﺮﻔﻧ#. These delimiting words are also a combination of words i.e. ﮯﺋﺎﺠﺑ is a combination of ﮯﺋﺎﺟ and ! and !"#ﺮﻔﻧ# is a combination of !" and !ﺮﻔﻧ!. But that was not all there another issue Urdu delimiting words were having, some of these words were not even used in the sense of termination mark for example ﮯﮑﺳ$ in the above example 2 was not used in sense of a sentence boundary marker but it was used in a sense of “his” in that sentence, but the ratio of these words are low and they can also be handled by n-gram approach so this does not create any problem. Below Table 1 represents sentences boundary markers of unigram, bigram and trigram. We only include few of the boundary markers with their frequency from a single experiment. Some sentences contain two delimiting words and our model will add termination marks after both of these boundary markers, but that will not affect the sentence, because if sentence ends with first delimiting word it will still retain its meanings. For example; 3.6.1.1.

With two delimiting words: 3.6.1.2.

Without second delimiting word: ۔ﮯﮨ ۔ﺎ#ﮔ ﺎ"# ﺎ"#$ﺑ ﮟ"ﻣ ﻦﺸ#$ﻮﭘ' ﮉﻨ#ﺮﮔ ﻮﮐ ﺐﺳ !" !" ﮯﮨ# ﮟ"ﻣ !"ﺪﺘﻗ" !ﺎﺳ !" ﻮﺟ

۔ ﺎ"ﮔ ﺎ"# ﺎ"#$ﺑ ﮟ"ﻣ ﻦﺸ#$ﻮﭘ' ﮉﻨ#ﺮﮔ ﻮﮐ ﺐﺳ !" !" ﮯﮨ# ﮟ"ﻣ !"ﺪﺘﻗ" !ﺎﺳ !" ﻮﺟ It can be observed that from semantic point of view, the sentence preserves its meanings. The lack of appropriate hardware resources impeded delays in compiling results of our experiments. We used a personal computer running Microsoft Windows 8 for processing our general genre raw Urdu corpus containing 2616 sentences and 291503 words. Computationally exhaustive iterative processing was not possible on such a workstation. It took more than two and a half days to process our raw corpus file and still it was not completed when the system suddenly shut down due to hardware failure. The only possible solution for this problem was the customary divide and conquer approach.

We divided the experimentation file into 14 smaller chunks and experimented chunk-wise to identify compound and large sentences. Delimiting words were extracted with their respective frequencies in the text. Compound sentences are identified by the list of conjunctions present in them. All statistical results were accumulated and manually verified by Urdu language experts. Consolidated results of the respective 14 chunks are shown in the following table 2. S.No. In table 2, we also generate frequency of !"# as a conjunction and it appears 485 times in a text. This was done due to realization of the fact that !"# is not only used as a conjunction but also as a part of other words. For example; These are the words that contain conjunction word “!"#”. These words are the combination of two words. For example, in ﯽﺋﺰﮐ%&' and !"#$, the word !"#is at the beginning and it is a combination of ﯽﺋﺰﮐ + !"# and ! + !"# respectively. Similarly, in !"#ﺎﺸﻣ and !ﺎﺷ$%ﻻ', the word !"# is in the middle of two words that is between ! + ﺶﻣ and !ﺎﺷ + !" respectively. In the same manner, in words such as !"ﺎ$ﭽﻧ and !"ﺎﺸﭘ, the word !"# appears in the end of these words. Our proposed model identifies them as a conjunction which increases the computational cost and algorithmic complexity. 135 times out of 485, the word “!"#” is used as a conjunction that is 27% of its total occurrences. The remaining 73% i.e. 353 times, this word appeared as a part of other words or sometimes in a non-conjunction way. So we exclude it from list of conjunctions. Identifying them as a conjunction and as a part of other word is done manually. The word “!"#” and other words posing similar problem is an interesting discovery in this research. “!ﭽﻧﺎﻨﭼ” is also a conjunction word but in our existing corpus it does not appear even once. Our corpus is growing continuously. So therefore we hope this and other similar words will be encountered in the subsequent experiments. Thus we did not exclude it from the list of conjunction word.

Complexity of Urdu language increases when dealing with boundary markers. The Urdu boundary marking words may not always appear at the end of sentences but it can also be a part of other words which may be anywhere in sentences. Also, they may appear in the middle of sentences. Reaching a computational solution becomes more difficult when sometimes these are not used in the sense of boundary markers. For example, ﮟ"ﮨ appears about 259 times in a single chunk of file. We carried out the same experiment on all 14 chunks. However, for simplicity we consider the example ﯽﺋﺰﮐ%&' ! !"#ﺎﺸﻣ ! !"ﺎ$ﭽﻧ ! !"ﺎﺸﭘ ! !ﺎﺷ$%ﻻ' ! !"#$ of only first chunk here. It was not sure whether ﮟ"ﮨ always appears as a boundary marker or a part of other words. In our experimentation we realize, that 154 time ﮟ"ﮨ appeared as a boundary marker that is 59% and 105 times i.e. 41% it appeared as part of another word for example ﮟ"ﮩﻧ, ﮟ"ﮨ$ﻮﺨﻨﺗ, ﮟ"ﮩﻧ%, ﮟ"ﮩﻤﺗ. The experimentation continued on other boundary markers such as ﺎ"ﺗ. Frequency of ﺎ"ﺗ in a text is 67 and it appears in text about 26 times that is 38% as a boundary marker and 41 times that is 62% as part of other words.

Most of the boundary markers are part of other words and to handle this we use bi-gram approach i.e. we combine it with the second closest word but using only bi-gram approach did not solve our problem because combination of these two words may also appear in middle of sentences, to tackle this issue we combine third closest word i.e. we use trigram approach but still that did not solve our problem and the process continues. The solution to this problem was to use n-gram approach. Similarly, ﮯ"ﺗ appears in a text for 36 times and 100% it is used as a boundary marker. ﯽ"ﺗ appears 13 times in which 8 time that is 61% it appears as a boundary marker and 5 times that is 39% as part of other word or boundary marker i.e. ﮟ"#ﺗ and the same goes for other boundary markers. The above mention gazetteer list of delimiting words contains only 15 boundary markers and we have about 103 list of boundary markers created manually and the list is still in a growing process the more the list grows the more effective will be the results. Considering the corpus, we have, 103 is not a huge list of boundary markers but these are enough for our experimentation. The more the list grows the more processing is required because our workstations cannot handle that kind of processing very effectively and that makes our experimentation process slower, so we restricted list of boundary markers to 103. Initially there were 184 sentences with 20548 words in experiment 1. After compound and tokenization process we have 722 sentences. We got results for about 569 sentences that is 78% of sentences can be categorized as accurate or inaccurate sentences out of which 461 were accurately marked boundary markers that is 63% and 108 were marked inaccurate that is only 14%. This inaccuracy was due to existence of those boundary markers as a part of other words. Remaining 22% sentence were the one or two word sentences that appears because of two delimiting words in a single sentence. For example;

۔ﮯﮨ ۔ﺎ#ﮔ ﺎ"ﮐ !ﺎﻈﺘﻧ& ﯽﺿ#ﺎﻋ ﺮﭘ !ﺎﮨ$ ﯽﮨ !ﻧ !"# ﮯﮨ !ﻮﺟﻮﻣ ﮓﻨﮐ$ﺎﭘ ﯽﺋﻮﮐ ﮟ"ﻣ !ﻮﻗﻼﻋ ﯽﺒ#ﺮﻗ ﻮﺗ !ﻧ ﮯ"ﻟ ﮯﮐ !ُ# ﮯﮨ is second boundary marker and just a single word marked as separate sentence. There were 153 such kind of sentences which means that 22% sentences were useless and discarded accordingly. 5.

CONCLUSION: Our work regarding Compound Sentence Segmentation and Tokenization of large Sentences was pioneering work in Urdu. The results generated by our proposed system are promising. The results generated for a single chunk of file were generated manually. Therefore, we did not include other chunks of files. We realize that with having powerful servers and with increasing delimiting words gazetteer list, we can improve our results further. Beside generating statistical results, in future we will also analyze our model and its results by language expert, by comparing our automatically tokenized sentences with human manually tokenized sentences to analyze its coherence and readability.

REFERENCES: [1] Yasumasa, S. Kansai. 2016. University of Graduate School of Foreign Language Education and Research. Automatic

Sentence Segmentation, accessed (Feb 19). DOI: http://www.someya-net.com/00-class09/sentenceDiv.html. [2] Brian, L., Zillig, P., Ramsay, S., Mueller, M., and Smutniak, F., 2016. Academic Technologies and Research Services. Morph Adorner Sentence Splitter, accessed (Feb 19). DOI: http://morphadorner.northwestern.edu/sentencesplitter/example/. [3] Malik, A.A. and Habib, A., 2013. Urdu to English Machine Translation using

Understudy. International Journal of Computer Applications, 82(7).

Bilingual Evaluation [4] Palmer, D.D., 2000. Tokenisation and sentence segmentation. In Handbook of Natural Language Processing. CRC

Press. [5] Aroonmanakun, W., 2007, December. Thoughts on word and sentence segmentation in Thai. In Proceedings of the

Seventh Symposium on Natural language Processing, Pattaya, Thailand, December 13–15 (pp. 85-90). [6] Baseer, F., Habib, A. and Ashraf, J., 2016, August. Romanized Urdu Corpus development (RUCD) model: Editdistance based most frequent unique unigram extraction approach using real-time interactive dataset. In Innovative Computing Technology (INTECH), 2016 Sixth International Conference on (pp. 513-518). IEEE.

[7] Xu , J. , Zens , R. and Ney , H., 2005 , May. Sentence segmentation using IBM word alignment model 1 . In Proceedings of EAMT (pp. 280 - 287 ).

[8]

Gul ,

Habib ,

Ashraf , "Identification and extraction of Compose-Time Anomalies in Million Words Raw Urdu Corpus and Their Proposed Solutions" , proceedings of the 3rd International Multidisciplinary Research Conference (IMRC) , 2016 .

[9] Xue , N. and Yang , Y. , 2011 . Chinese sentence segmentation as comma classification . In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papersVolume 2 (pp. 631 - 635 ). Association for Computational Linguistics .

[10] Rehman , Z. and Anwar , W. , 2012 . A hybrid approach for urdu sentence boundary disambiguation . International Arab Journal of Information Technology , 9 ( 3 ), pp. 250 - 255 .

[11] Kiss , T. and Strunk , J. , 2006 . Unsupervised multilingual sentence boundary detection . Computational Linguistics , 32 ( 4 ), pp. 485 - 525 .

[12] Jurish , B. and Würzner , K.M. , 2013 . Word and Sentence Tokenization with Hidden Markov Models . JLCL , 28 ( 2 ), pp. 61 - 83 .

[13] Habib , A. Iwatate , M. , Asahara , M. Matsumoto , Y. , 2012 . Keypad for large letter-set languages and small touchscreen devices (case study: Urdu) . International Journal of Computer Science 9 ( 3 ), ISSN: 1694 - 0814 .

[14] Hearst , M.A. , 1997 . TextTiling: Segmenting text into multi-paragraph subtopic passages . Computational linguistics , 23 ( 1 ), pp. 33 - 64 .

[15] Habib , A. Iwatate , M. , Asahara , M. Matsumoto , Y. W.K. , 2013 . Optimized and Hygienic Touch Screen Keyboard for Large Letter Set Languages . Proceedings of 7th ACM International Conference on Ubiquitous Information Management and Communication (ICUIMC) Kota

Kinabalu

, Malaysia.

[16] Evang , K. , Basile , V. , Chrupała , G. and Bos , J. , 2013 , October. Elephant: Sequence labeling for word and sentence segmentation . In EMNLP 2013 .

[17] Xu , L.F. , Zhu , Y. , Yang , L.J. and Jin , Y.H. , 2014 . Research on Sentence Segmentation with Conjunctions in Patent Machine Translation . In Applied Mechanics and Materials (Vol. 513 , pp. 4605 - 4609 ). Trans Tech Publications.

[18] Habib , A. Iwatate , M. , Asahara , M. Matsumoto , Y. , 2011 . Different input systems for different devices: Optimized touch-screen keypad designs for Urdu scripts . Proceedings of Workshop on Text Input Methods WTIM2011 , IJCNLP, Chiang

Mai

, Thailand.

[19] Gibson , E. and Pearlmutter , N.J. , 1998 . Constraints on sentence comprehension . Trends in cognitive sciences, 2 ( 7 ), pp. 262 - 268 .