Asad Iqbal et al. MAICS 2017 pp. 99–106 COMPOUND SENTENCE SEGMENTATION AND SENTENCE BOUNDARY DETECTION IN URDU ASAD IQBAL, ASAD HABIB, JAWAD ASHRAF Institute of Information Technology, Kohat University of Science and Technology, Pakistan ABSTRACT: The raw Urdu corpus comprises of irregular and large sentences which need to be properly segmented in order to make them useful in Natural Language Engineering (NLE). This makes the Compound Sentences Segmentation (CSS) timely and vital research topic. The existing online text processing tools are developed mostly for computationally developed languages such as English, Japanese and Spanish etc., where sentence segmentation is mostly done on the basis of delimiters. Our proposed approach uses special characters as sentence delimiters and computationally extracted sentence-end- letters and sentence-end-words as identifiers for segmentation of large and compound sentences. The raw and un- annotated input text is passed through preprocessing and word segmentation. Urdu word segmentation itself is a complex task including knotty problems such as space insertion and space deletion etc. Main and subordinate clauses are identified and marked for subsequent processing. The resultant text is further processed in order to identify, extract and then segment large as well as compound sentences into regular Urdu sentences. Urdu computational research is in its infancy. Our work is pioneering in Urdu CSS and results achieved by our proposed approach are promising. For experimentation, we used a general genre raw Urdu corpus containing 2616 sentences and 291503 words. We achieved 34% improvement in reduction of average sentence length from 111 w/s to 38 w/s (words per sentence). This increased the number of sentences by almost three times to 7536 shorter and computationally easy to manage sentences. Resultant text reliability and coherence are verified by Urdu language experts. Keywords: Urdu sentence segmentation, sentence tokenization, word tokenization, compound sentence segmentation, Urdu conjunction extraction, Urdu sentence delimiter identification. 1. INTRODUCTION: Urdu Compound Sentence Segmentation using words and conjunctions as delimiters is a complex task. Most of the available raw corpora contain large sentences which are combination of sentences with conjunctions or without conjunctions. Such sentences are called compound sentences. Such sentences make it challenging for automated and computational processes such as text summarization, parsing and named entity recognition etc. [7][19]. There are some online tools available that segment sentences on the basis of sentence termination marks. For example; 1.1. Automatic Sentence Segmentation 1.2. Morph Adorner Sentence Splitter Example Automatic sentence segmentation alters simple text into separate sentence per line format by simply adding return code after sentence termination mark but most tools cannot handle abbreviation’s like Dr., Mr., p.m., Prof., a.m. This online tool covers most of those abbreviations. It also provides editing facility after resulting text to cover up the remaining abbreviations [1]. Morph Adorner Sentence Splitter uses punctuation marks to split sentences but punctuation marks not always define sentence termination mark for example ellipses, abbreviations, acronym, or decimal system and in some of the poems not even have any termination mark in it. Morph Adorner Sentence Splitter works best for plain English text and unlike Automatic Sentence Segmentation it covers more abbreviations [2]. Unlike Urdu English is not even suffering from this problem too much as most of the sentences are already separated from each other, Urdu on the other hand has even sentences in a form of paragraph which in self contains many sentences and our proposed idea is to identify those sentences and convert them into more than one sentences on the basis of these words [3,4]. 99 Compound Sentence Segmentation and Sentence Boundary Detection in Urdu pp. 99–106 2. LITERATURE REVIEW: Sentence segmentation is a relatively new topic of research in computationally developing languages. We could not find any automated sentence boundary segmentation tool in Urdu language. Aroonmanakun W., analyzed sentence and word segmentation in Thai language [5]. They considered sentence discourse where combination of phrases and some clues are used for each discourse segmentation process. Baseer et al. presents a sophisticated Romanized Urdu Corpus utilizes tokens with the uppermost frequency of occurrence in the data set, which was collected from participants who uses Romanized Urdu as a mean of communication [6]. Xu et al. used IBM word alignment for sentence segmentation for translation tasks of English and Chinese in the news domain [7]. The focus of this technique is to use lexicon information to identify sentence segmentation point to split compound sentences into multiple sentences. This paper proposed a technique to split large sentences into multiple ones because the longer the sentences the more problems are faced by their proposed system, subsequently resulting into higher computation cost and compromised quality in word alignment. Habib et al. presented a novel approach that records properly segmented words in order to avoid the classical space insertion and space deletion problems in native script Urdu text [13][15]. Their proposed solutions can be of direct value in Urdu sentence segmentation. Xue, N. and Yang, Y., focuses on how Chinese uses comma, exclamation marks and questions marks for sentence boundary indication. Proposed model is being tested and trained on data provided by Chinese tree bank and the accuracy achieved by this model is up to 90% [8,9]. Rehman, Z. and Anwar, W., uses rule based algorithm and Unigram statistical model to deal with Urdu sentence boundary disambiguate. Initial result before testing and training were 90% precision, 92.45% F1-measure and 86% recall, but after same testing data and training, results improved to 99.36% precision, 97.89% F1-measure and 96.45% recall [10]. Kiss and Strunk presented language-independent approach. In this paper assumptions are made that once abbreviations are identified most of the ambiguities will be eliminated while detecting sentence boundary. In order to detect high accuracy abbreviations, proposed system define three rules which required independence of context and type of candidate. The system was tested on different text types and eleven different languages [11]. A number of related research points out to interesting aspects of text segmentation and optimized input systems. Habib et al. proposed an optimized system and input methods with respect to various modern devices for Urdu composing [18]. Adnan et al. assessed the realization of smartphones learning objects in computing adaptive learning paths of undergraduate university students. Jurish and Würzner introduced a “WASTE” method for segmentation of text into tokens and sentences. Hidden Markov Model is used as a segment boundary detection. Model parameters were defining from pre-segmented corpora. Such corpora are available as an aligned multi-lingual corpora and treebanks [12]. Hearst, M.A., A presents technique called TextTiling in which text is segmented into multi-paragraph units that is subtopics or passages. Identification of sub-topics is done using lexicon co-occurrence and distribution patterns. This segmentation can further be used for text summarization and information retrieval [16]. In Evang, K., et al. technique the accuracy achieved by rule based model Tokenization is considered as no problem, but issue regarding rule-based is its language specific rules and maintenance. This paper used unsupervised feature learning combining it with supervised sequence labeling on character level to accomplish segmentation and high accuracy word goal. Evaluation of proposed system is done on three different languages with the error rate of 0.76% (Italian), 0.27% (English), and 0.35% (Dutch) [16]. Xu, L.F., et al. proposes an idea of segmenting long sentences into short one using conjunctions in them. Long sentences took a lot of machine translation resources for processing. Punctuations were used previously for segmentation but dealing with long sentences that was not enough. This paper presents rule-based approach on conjunctions to segment Chinese long sentences. 901 conjunctions were found in 10 patent papers during experimentation. Using rule-based approach, 89% accuracy was achieved [17]. Gibson E., et al proposes Comprehension of sentences technique. Comprehension of sentences means applying constraints using available computational resources to integrate variety of information sources. Four types of comprehension techniques are explained in this paper. (1) phrase-level contingent frequency constraints, (2) locality-based computational resource constraints, (3) contextual constraints, and (4) lexical constraints [19]. 3. URDU SENTENCE SEGMENTATION OPERATIONS: Sentence Segmentation is a process of identifying sentence boundaries. Previous researcher’s focuses on using only punctuations as a boundary detection. Our proposed research not only uses punctuations, it also uses words as a delimiters and conjunctions as boundary markers to segment sentences appropriately. 3.1. RAW CORPUS: Raw corpus has been collected by several means which includes websites, books, Urdu magazines, newspapers etc., we categories them into more general form that is online and offline category. Online category involves websites, online books and newspapers. While offline involves Magazines, newspapers, books etc. The collected raw corpus contains large and compound sentences including combination of conjunction and non-conjunction sentences. These sentences require two different methodologies for segmentation. The former is compound sentence segmentation and the latter is sentence tokenization which are described in the following. 100 Asad Iqbal et al. MAICS 2017 pp. 99–106 3.2. WORDS/CHARACTER SEGMENTATION: Conjunctions and delimiter words and characters are identified using words and character segmentation techniques. Individual words and characters in text are split on the basis of complex processing including space, joiner and non- joiner properties keeping in view the complex problems of Urdu specific space insertion and space deletion [3][10][19]. The segmented words and characters are then analyzed for conjunctions and delimiting words and characters. Offline Source: Books, Online Source: News Papers etc. Internet Raw$ Corpus! Pre-processing Word Segmentation Compound Sentence Conjunction Non- Segmentation Sentences Conjunction identification Sentences identification Removing) Applying( sub"ordinate) Boundary) clause! Markers! Resultant) Sentences! Figure.1. Compound Sentence Segmentation and Sentence Tokenization Architecture 3.3. COMPOUND SENTENCE IDENTIFICATION: The issues with longer sentences is high consumption of resources such as processing time. We look for conjunction containing sentences and we analyze that the subordinate clause of sentences are mostly the explanation of main clause, which makes sentence extra-large. The options in dealing with compound sentences are two, first is to eliminate those sentences completely but by doing so, we may be risking useful information. Second option is to trim those sentences by eliminating sub-ordinate clause i.e. the explanatory part of the sentence. We use a pattern matching approach for identification of conjunctions containing sentences. We came up with the list of conjunction words which we generated from pre tagged corpus, using this list of words conjunction containing sentences are easily detectable. List of conjunction words we are given below. ‫ﻌﻨﯽ ! ﭼﻨﺎﻧﭽ! ! ﺑﻠﮑ! ! ﻣﮕﺮ‬$ ! ‫ﻌﻨﯽ ! ﮔﻮ"ﺎ‬$ ! ‫ﮑﻦ‬#‫ﻮﻧﮑ! ! ﻟ‬%‫ﮐ‬ Naturally occurring raw corpus text examples are mentioned in the following. 101 Compound Sentence Segmentation and Sentence Boundary Detection in Urdu pp. 99–106 3.4. With Conjunction part: ‫ﻨﺎ "! ﻣﻮﻗﻊ ﭘﺮ ﺣﺎﻻ! ﮐﻮ ﻗﺎﺑﻮ ﮐﺮﻧﮯ ﮐﮯ ﻟﺌﮯ ﭘﻮﻟ"ﺲ ﮐﯽ ﻧﻔﺮ! ﺑ"ﯽ ﭘﮩﻨﭻ ﮔﺌﯽ ﺗ"ﯽ‬#‫ﻣﮕﺮ ﺳﺎﺑﻖ ﮐﻮﻧﺴﻠﺮ ﻋﺒﺪ&ﻟﻨﻌ"ﻢ ﻧﮯ ﮐﮩﺎ ﮐ! ﺗﻤﺎ! ﻋﻼﻗﮯ ﮐﯽ ﺑﺠﻠﯽ ﮐﺎ‬ ‫ﻨﯽ ﭼﺎﮨﺌﮯ۔‬#‫ﻧﮑﯽ ﺑﺠﻠﯽ ﮐﺎ‬$ !‫ﺟﺒﺎ! ﮨ"ﮟ ﺻﺮ‬$% ‫ﺗﯽ ﮨﮯ ﺟﻦ ﻟﻮﮔﻮ! ﭘﺮ‬#‫ﺎ‬%& ‫ﺳﺮ‬#‫ﻋﻮ"! ﮐﮯ ﺳﺎﺗ! ﺳﺮ‬ 3.5. Without Conjunction part: ‫ ﺑ"ﯽ ﭘﮩﻨﭻ ﮔﺌﯽ ﺗ"ﯽ ۔‬+‫ﺲ ﮐﯽ ﻧﻔﺮ‬1‫ ﮐﻮ ﻗﺎﺑﻮ ﮐﺮﻧﮯ ﮐﮯ ﻟﺌﮯ ﭘﻮﻟ‬7‫=< ﻣﻮﻗﻊ ﭘﺮ ﺣﺎﻻ‬ We exclude “!"#” from conjunction list because it is not only used as a joiner in a sentence but as a joiner between two nouns too for example; ‫ﺸﻦ‬#‫ﻟﺨﺪﻣﺖ ﻓﺎ&ﻧﮉ‬. /0. ‫ﺳﻼﻣﯽ‬. ‫ﺟﻤﺎﻋﺖ‬ The remaining sentences in corpus contain combination of multiple sentences. These sentences do not contain conjunctions but they need to be segmented. For this purpose, we pass resultant text to the next process called Sentence Tokenization. 3.6. SENTENCE TOKENIZATION: Sentence tokenization segments large sentences from single into multiple ones. We target words as delimiters to identify sentence boundaries. For example; ‫ ﺗﻤﺎ* ﻣﮑﺎﺗﺐ ﻓﮑﺮ ﮐﯽ‬,-. *.‫ﻨﮕﻮ ﮐﮯ ﻋﻮ‬4 ‫ ﭘﺮ‬6‫ﻓﺎ‬- ‫ﺠﮯ ﮐﯽ‬8‫ﺘ‬:‫< ﺑ‬,‫ ﻧﮯ ﮐﮩﺎﮐ> ﮨﻤﺎ‬A‫ﺷﺎ‬C‫ﺴﺮ ﻋﻘﻞ ﺑﺎ‬8‫ﻓ‬G ‫ﺸﻦ‬8‫ﺠﻮﮐ‬J. ‫ﺮﮐﭧ‬$‫ﯽ &ﺳ‬$‫*ﭘﻮ*)(&ﭘ‬,*‫ﻮ‬-‫ﻨﮕﻮ)ﺑ‬1 ‫ ﭘﺮ‬#‫ﭘﻨﯽ )ﮨﺎﺋﺶ ﮔﺎ‬, ‫ ﻧﮯ‬#‫ﺷﺎ‬0‫ﺴﺮ ﻋﻘﻞ ﺑﺎ‬6‫ﻓ‬8 ‫ﺸﻦ‬6‫>ﺠﻮﮐ‬, ‫ﮩﺎ) @ﭘ?ﯽ‬B, ‫ ﮐﺎ‬C‫ﺎﻻ‬6‫ ﺧ‬F, ‫ﮟ‬6‫ ﮐﮯ ﻧﮩﺎ>ﺖ ﻣﺸﮑﻮ)ﮨ‬K‫ﮨﻞ ﻋﻼﻗ‬, ‫ ﭘﺮ‬C‫ﻐﺎﻣﺎ‬6‫ﺟﺎﻧﺐ ﺳﮯ ﺗﻌﺰ>ﺘﯽ ﭘ‬ ‫ ﻧﮯ‬#$%$‫ ﺳﻮﮔﻮ‬%)$ #$‫ ﮐﮯ ﺧﺎﻧﺪ‬#$ ‫ﮑﯽ ﻣﮩﺮﺑﺎﻧﯽ ﺳﮯ‬5ٰ ‫ *ﻟﺨﺮ!' ﮨﮯ ﻣﮕﺮ !ﷲ ﺗﻌﺎﻟ‬+‫!ﻗﻌ‬. +‫ ﻧﮯ ﮐﮩﺎ ﮐ‬3‫ﻧﮩﻮ‬$ ‫ ﭼ&ﺖ ﮐﺮﺗﮯ ﮨﻮﺋﮯ ﮐ&ﺎ‬/‫ ﺳﮯ ﺑﺎ‬2‫ﺻﺤﺎﻓ&ﻮ‬ !"‫ ﺟﻨﮩﻮ' ﻧﮯ ﻧﻤﺎ‬,-‫ ﺣﻀﺮ‬1‫ ﺗﻤﺎ‬3- 3-4‫ ﮐﮯ ﺳﻮﮔﻮ‬1‫ ﻣﺮﺣﻮ‬9‫ﺎ‬:‫ﻞ ﻋﻄﺎ ﮨﻮﺋﯽ ﮐﮩﺎﮐ< ﻓ‬:‫ ﮐﻮ ﺻﺒﺮ ﺟﻤ‬3-‫ ﺧﺎﻧﺪ‬G4‫' ﺳﮯ ﭘﻮ‬I‫ﻋﺎ‬J ‫ﮨﻞ ﻋﻼﻗ< ﮐﮯ‬-‘ !‫ﮔﻮ‬$‫ﺑﺰ‬ ‫ﮟ۔‬#‫ ﮨ‬%‫ﻟﯽ ﻣﺸﮑﻮ‬, ‫ ﺳﺐ ﮐﮯ‬#$ ‫ ﺗﻌﺰ(ﺖ ﮐﯽ‬#‫ﻠﯽ ﻓﻮ‬/0 1‫(ﻌ‬2‫ﮟ ﺷﺮﮐﺖ ﮐﯽ (ﺎ ﺑﺬ‬/‫ ﻣ‬:;‫ﺟﻨﺎ‬ Above example is a single large sentence which is a combination of multiple sentences with 110 words in it. In available corpus Urdu is full of these kind of sentences. These delimiting words are identified manually from single chunk of file because analyzing large file manually is time consuming and it may take weeks to generate list of delimiting words and still that list will not be enough. Delimiting boundary list is not limited to this corpus only to achieve better results in segmenting large sentences this corpus also needs constant updating with constant manual updating of delimiting words also. We generate that list using a chunk of file is to carry on our experimentation and we accomplish significant results. For example, the above single sentence is analyzed for delimiting words and on the basis of this list above sentence is segmented down into 4 sentences with average length of 27 words per each sentence. ‫ ﺗﻤﺎ* ﻣﮑﺎﺗﺐ ﻓﮑﺮ ﮐﯽ‬,-. *.‫ﻨﮕﻮ ﮐﮯ ﻋﻮ‬4 ‫ ﭘﺮ‬6‫ﻓﺎ‬- ‫ﺠﮯ ﮐﯽ‬8‫ﺘ‬:‫< ﺑ‬,‫ﺴﺮ ﻋﻘﻞ ﺑﺎ*ﺷﺎ( ﻧﮯ ﮐﮩﺎﮐ" ﮨﻤﺎ‬1‫ﻓ‬3 ‫ﺸﻦ‬1‫ﺠﻮﮐ‬89 ‫@(=ﭘ;ﯽ =ﺳ;ﺮﮐﭧ‬A‫ﭘﻮ‬ABA‫ﻮ‬1‫ﻨﮕﻮ)ﺑ‬E ‫ﮟ۔‬#‫ﮨ‬%‫ﺖ ﻣﺸﮑﻮ‬+‫ ﮐﮯ ﻧﮩﺎ‬1‫ﮨﻞ ﻋﻼﻗ‬6 ‫ ﭘﺮ‬9‫ﻐﺎﻣﺎ‬#‫ﺘﯽ ﭘ‬+‫ﺟﺎﻧﺐ ﺳﮯ ﺗﻌﺰ‬ !‫ﻗﻌ‬#$ %‫ ﻧﮯ ﮐﮩﺎ ﮐ‬+‫ﻧﮩﻮ‬# ‫ﺎ۔‬.‫ﺖ ﮐﺮﺗﮯ ﮨﻮﺋﮯ ﮐ‬.‫ ﭼ‬5‫ ﺳﮯ ﺑﺎ‬+‫ﻮ‬.‫ﭘﻨﯽ ?ﮨﺎﺋﺶ ﮔﺎ< ﭘﺮ ﺻﺤﺎﻓ‬# ‫ﺷﺎ< ﻧﮯ‬C‫ﺴﺮ ﻋﻘﻞ ﺑﺎ‬.‫ﻓ‬H ‫ﺸﻦ‬.‫ﺠﻮﮐ‬L# ‫ﯽ‬M‫ﭘ‬N ?‫ﮩﺎ‬O# ‫ ﮐﺎ‬5‫ﺎﻻ‬.‫ ﺧ‬R# ‫* ﮐﻮ ﺻﺒﺮ ﺟﻤ"ﻞ‬+‫ ﺧﺎﻧﺪ‬01‫ ﺳﮯ ﭘﻮ‬56‫ﻋﺎ‬8 ‫ ﮐﮯ‬9‫ﮨﻞ ﻋﻼﻗ‬+‘ !‫ﮔﻮ‬$‫*) ﻧﮯ ﺑﺰ‬$*‫ ﺳﻮﮔﻮ‬$,* )*‫ﮑﯽ ﻣﮩﺮﺑﺎﻧﯽ ﺳﮯ *) ﮐﮯ ﺧﺎﻧﺪ‬7ٰ ‫*ﻟﺨﺮ!' ﮨﮯ ﻣﮕﺮ !ﷲ ﺗﻌﺎﻟ‬ ‫ ﺳﺐ ﮐﮯ‬%& ‫ ﺗﻌﺰ)ﺖ ﮐﯽ‬%‫ﻠﯽ ﻓﻮ‬01 2‫)ﻌ‬3‫ﮟ ﺷﺮﮐﺖ ﮐﯽ۔ )ﺎ ﺑﺬ‬0‫ ﻧﮯ ﻧﻤﺎ=< ﺟﻨﺎ=< ﻣ‬B‫ ﺟﻨﮩﻮ‬D&‫ ﺣﻀﺮ‬G‫ ﺗﻤﺎ‬%& %&3‫ ﮐﮯ ﺳﻮﮔﻮ‬G‫ ﻣﺮﺣﻮ‬I‫ﺎ‬0‫ ﻓ‬2‫ﻋﻄﺎ ﮨﻮﺋﯽ ﮐﮩﺎﮐ‬ ‫ﮟ۔۔‬#‫ ﮨ‬%‫ﻟﯽ ﻣﺸﮑﻮ‬, 3.6.1. Sentence Boundary Maker: Complexity of delimiting words of sentence boundary markers rises when generated list of delimiting words turned into a part of another word and that word appears in the middle or start of sentence not at the end. To deal with these kind of sentences, we need to take our approach to another level as we cannot use unigram. For example, the following list of delimiter words are unigram. We cannot use them as boundary markers alone because these words can be a part of other words and that results in ambiguity. We require n-gram approach. For example; !" ! ‫ﺳﮑﮯ ! ﮐﺌﮯ ! ﺟﺎﺋﮯ‬$ ! ‫ﮐﺮ"ﮟ‬ ‫ﺮﻧﮯ ﭘﺮ‬$% ‫"! ﮔﻮ! ﻧﺮ ﮨﺎ"! ﮐﮯ ﺳﺎﻣﻨﮯ 'ﺣﺘﺠﺎﺟﯽ‬# !"‫ﻨﭧ ﮨﺎ‬#‫ﻤ‬#‫ﮕﺮ ﮨﻢ ﭘﺎ&ﻟ‬#$ !"‫"! ﭘﺮ ﺣﻞ ﮐﺮ"ﮟ ﺑﺼﻮ‬#‫ﺎ‬%‫ﺤﯽ ﺑﻨ‬#‫ﮨﻤﺎ"! ﻣﺴﺎﺋﻞ ﺗﺮﺟ‬ (1 ‫ﻣﺠﺒﻮ! ﮨﻮﻧﮕﮯ ۔‬ ‫ﭘﻨﮯ ﺻﻮﺑﺎﺋﯽ &ﺳﻤﺒﻠﯽ ﮐﮯ ﻣﻤﺒﺮ ﺑﺎﻏﯽ ﮨﻮﭼﮑﮯ ﮨ"ﮟ‬$ ‫ﺳﮑﮯ‬$ !‫ﮔﺎ! ﮐﮩﺎ ﺟﺎﺳﮑﮯ ﺑﻠﮑ‬$‫ﺎ‬% ‫ﺴﺎ ﮐﺎ! ﻧﮩ"ﮟ ﮐ"ﺎ ﺟﺲ ﮐﻮ‬#$ ‫ﻧﮩﻮ! ﻧﮯ ﮐﻮﺋﯽ‬ (2 ‫"! ﺳﮯ ﻧﻘﻞ ﮐﺎ ﻧﺎﺳُﻮ! ﺧﺘﻢ ﮐﺌﮯ ﺑﻐ"ﺮ ﺗﺮﻗﯽ ﻣﻤﮑﻦ ﻧﮩ"ﮟ‬#$%$ ‫ﻤﯽ‬#‫ﺗﻌﻠ‬ (3 102 Asad Iqbal et al. MAICS 2017 pp. 99–106 ‫"! ﮐﺎﻣﻮ! ﮐﯽ ﺑﺠﺎﺋﮯ 'ﺟﺘﻤﺎﻋﯽ ﮐﺎﻣﻮ! ﮐﻮ ﺗﺮﺟ"ﺢ "! ﺟﺎﺗﯽ ﮨﮯ‬#‫ﻧﻔﺮ‬# ‫ﮩﺎ! ﭘﺮ‬$ (4 ‫"! ﮐﺎﻣﻮ! ﮐﯽ ﺑﺠﺎﺋﮯ 'ﺟﺘﻤﺎﻋﯽ ﮐﺎﻣﻮ! ﮐﻮ ﺗﺮﺟ"ﺢ "! ﺟﺎﺗﯽ ﮨﮯ‬#‫ﻧﻔﺮ‬# ‫ﮩﺎ! ﭘﺮ‬$ (5 In example 1, 2 and 3 the delimiting words appear in the middle of sentences. In example 2 we also realize that there are two delimiters used one as a separate word and other as a part of other word for example ‫ﺟﺎﺳﮑﮯ‬, which is a combination of word ‫ﺳﮑﮯ‬$ and !. Same apply to example 4 and 5 i.e. ‫ ﺑﺠﺎﺋﮯ‬and !"#‫ﻧﻔﺮ‬#. These delimiting words are also a combination of words i.e. ‫ ﺑﺠﺎﺋﮯ‬is a combination of ‫ ﺟﺎﺋﮯ‬and ! and !"#‫ﻧﻔﺮ‬# is a combination of !" and !‫!ﻧﻔﺮ‬. But that was not all there another issue Urdu delimiting words were having, some of these words were not even used in the sense of termination mark for example ‫ﺳﮑﮯ‬$ in the above example 2 was not used in sense of a sentence boundary marker but it was used in a sense of “his” in that sentence, but the ratio of these words are low and they can also be handled by n-gram approach so this does not create any problem. Below Table 1 represents sentences boundary markers of unigram, bigram and trigram. We only include few of the boundary markers with their frequency from a single experiment. S. No. Frequency (sorted) Sentence Boundary Marker List 1 259 ‫ﮨ"ﮟ‬ 2 67 ‫ﺗ"ﺎ‬ 3 57 ‫ﮔﮯ‬ 4 36 ‫ﺗ"ﮯ‬ 5 20 ‫"ﺎ ﮔ"ﺎ‬# 6 13 ‫ﺗ"ﯽ‬ 7 13 ‫ﮨﻮﮔﺎ‬ 8 11 ‫ﮨﻮﮔﯽ‬ 9 5 ‫ﮨﺎﮨﮯ‬$ 10 1 ‫ﮐﺮﺗﺎﮨﮯ‬ 11 1 ‫ﻟﮯ ﮐﺌﮯ‬#‫ﺣﻮ‬ 12 1 ‫ﺣﺎﺻﻞ ﮐ"ﮯ‬ 13 1 ‫ﺋﯽ ﮨﮯ‬#‫ﮐﺮ‬ 14 1 !"‫ﺗﻼ! ﺷﺮ"! ﮐﺮ‬ 15 1 ‫"! ﺟﺎﺋ"ﮟ‬ Table 1: Sentence boundary markers list with corresponding frequencies ‫"ﺎ ﺟﺎﺋﮯ‬#‫ﻟﮯ ﮐﺌﮯ ! ﮐﺮ‬#‫ﺳﮑﮯ ! ﺣﻮ‬$ ‫ﺴﺮ‬#‫ﺑﻄ! ﮐﺮ"ﮟ ! ﺗﻼ! ﺷﺮ"! ﮐﺮ"! ! ﻣ‬$% Some sentences contain two delimiting words and our model will add termination marks after both of these boundary markers, but that will not affect the sentence, because if sentence ends with first delimiting word it will still retain its meanings. For example; 3.6.1.1. With two delimiting words: ‫ﺎ۔ ﮨﮯ۔‬#‫"ﺎ ﮔ‬# ‫"ﺎ‬#$‫ﺸﻦ ﻣ"ﮟ ﺑ‬#$‫ﻨﮉ 'ﭘﻮ‬#‫ﮨﮯ "! "! ﺳﺐ ﮐﻮ ﮔﺮ‬# ‫ﺟﻮ "! ﺳﺎ! "ﻗﺘﺪ"! ﻣ"ﮟ‬ 3.6.1.2. Without second delimiting word: ‫"ﺎ ﮔ"ﺎ ۔‬# ‫"ﺎ‬#$‫ﺸﻦ ﻣ"ﮟ ﺑ‬#$‫ﻨﮉ 'ﭘﻮ‬#‫ﮨﮯ "! "! ﺳﺐ ﮐﻮ ﮔﺮ‬# ‫ﺟﻮ "! ﺳﺎ! "ﻗﺘﺪ"! ﻣ"ﮟ‬ It can be observed that from semantic point of view, the sentence preserves its meanings. 103 Compound Sentence Segmentation and Sentence Boundary Detection in Urdu pp. 99–106 4. EVALUATION: The lack of appropriate hardware resources impeded delays in compiling results of our experiments. We used a personal computer running Microsoft Windows 8 for processing our general genre raw Urdu corpus containing 2616 sentences and 291503 words. Computationally exhaustive iterative processing was not possible on such a workstation. It took more than two and a half days to process our raw corpus file and still it was not completed when the system suddenly shut down due to hardware failure. The only possible solution for this problem was the customary divide and conquer approach. We divided the experimentation file into 14 smaller chunks and experimented chunk-wise to identify compound and large sentences. Delimiting words were extracted with their respective frequencies in the text. Compound sentences are identified by the list of conjunctions present in them. All statistical results were accumulated and manually verified by Urdu language experts. Consolidated results of the respective 14 chunks are shown in the following table 2. Chunk (File) Size S.No. No. of No. of Words !‫ﺑﻠﮑ‬ !‫ﭼﻨﺎﻧﭽ‬ ‫ﻌﻨﯽ‬$ ‫ﮔﻮ"ﺎ‬ ‫ﮑﻦ‬#‫ﻟ‬ !‫ﻮﻧﮑ‬%‫ﮐ‬ ‫ﻣﮕﺮ‬ !"# Sentences 1 184 20548 16 0 0 0 8 12 22 485 2 175 21794 10 0 2 0 11 8 20 517 3 274 28034 14 0 1 0 13 10 41 697 4 285 26372 15 0 1 0 8 4 29 661 5 270 27165 13 0 1 0 11 9 41 677 6 264 22956 15 0 1 0 9 3 22 582 7 211 25773 6 0 2 0 10 7 27 629 8 163 25621 17 0 1 1 17 7 36 626 9 184 22228 7 0 1 1 6 5 45 568 10 124 15326 7 0 2 0 8 2 15 335 11 104 12364 4 0 0 0 2 2 19 284 12 99 10172 6 0 1 0 2 4 17 277 13 142 15075 4 0 1 0 14 9 18 400 14 137 18075 3 0 0 0 11 6 28 457 ∑ 2616 291503 137 0 14 2 130 88 380 7195 Table 2: Conjunctions and delimiting words extracted from the raw Urdu corpus In table 2, we also generate frequency of !"# as a conjunction and it appears 485 times in a text. This was done due to realization of the fact that !"# is not only used as a conjunction but also as a part of other words. For example; ‫ﮐﺰﺋﯽ‬%&' ! !"#‫ﺎ"! ! ﻣﺸﺎ‬$‫ﺷﺎ! ! ﭘﺸﺎ"! ! ﻧﭽ‬$%‫"! ! 'ﻻ‬#$ These are the words that contain conjunction word “!"#”. These words are the combination of two words. For example, in ‫ﮐﺰﺋﯽ‬%&' and !"#$, the word !"#is at the beginning and it is a combination of ‫ ﮐﺰﺋﯽ‬+ !"# and ! + !"# respectively. Similarly, in !"#‫ ﻣﺸﺎ‬and !‫ﺷﺎ‬$%‫'ﻻ‬, the word !"# is in the middle of two words that is between ! + ‫ ﻣﺶ‬and !‫ ﺷﺎ‬+ !" respectively. In the same manner, in words such as !"‫ﺎ‬$‫ ﻧﭽ‬and !"‫ﭘﺸﺎ‬, the word !"# appears in the end of these words. Our proposed model identifies them as a conjunction which increases the computational cost and algorithmic complexity. 135 times out of 485, the word “!"#” is used as a conjunction that is 27% of its total occurrences. The remaining 73% i.e. 353 times, this word appeared as a part of other words or sometimes in a non-conjunction way. So we exclude it from list of conjunctions. Identifying them as a conjunction and as a part of other word is done manually. The word “!"#” and other words posing similar problem is an interesting discovery in this research. “!‫ ”ﭼﻨﺎﻧﭽ‬is also a conjunction word but in our existing corpus it does not appear even once. Our corpus is growing continuously. So therefore we hope this and other similar words will be encountered in the subsequent experiments. Thus we did not exclude it from the list of conjunction word. Complexity of Urdu language increases when dealing with boundary markers. The Urdu boundary marking words may not always appear at the end of sentences but it can also be a part of other words which may be anywhere in sentences. Also, they may appear in the middle of sentences. Reaching a computational solution becomes more difficult when sometimes these are not used in the sense of boundary markers. For example, ‫ ﮨ"ﮟ‬appears about 259 times in a single chunk of file. We carried out the same experiment on all 14 chunks. However, for simplicity we consider the example 104 Asad Iqbal et al. MAICS 2017 pp. 99–106 of only first chunk here. It was not sure whether ‫ ﮨ"ﮟ‬always appears as a boundary marker or a part of other words. In our experimentation we realize, that 154 time ‫ ﮨ"ﮟ‬appeared as a boundary marker that is 59% and 105 times i.e. 41% it appeared as part of another word for example ‫ﻧﮩ"ﮟ‬, ‫ﮨ"ﮟ‬$‫ﺗﻨﺨﻮ‬, ‫ﻧﮩ"ﮟ‬%, ‫ﺗﻤﮩ"ﮟ‬. The experimentation continued on other boundary markers such as ‫ﺗ"ﺎ‬. Frequency of ‫ ﺗ"ﺎ‬in a text is 67 and it appears in text about 26 times that is 38% as a boundary marker and 41 times that is 62% as part of other words. Most of the boundary markers are part of other words and to handle this we use bi-gram approach i.e. we combine it with the second closest word but using only bi-gram approach did not solve our problem because combination of these two words may also appear in middle of sentences, to tackle this issue we combine third closest word i.e. we use tri- gram approach but still that did not solve our problem and the process continues. The solution to this problem was to use n-gram approach. Similarly, ‫ ﺗ"ﮯ‬appears in a text for 36 times and 100% it is used as a boundary marker. ‫ﺗ"ﯽ‬ appears 13 times in which 8 time that is 61% it appears as a boundary marker and 5 times that is 39% as part of other word or boundary marker i.e. ‫"ﮟ‬#‫ ﺗ‬and the same goes for other boundary markers. The above mention gazetteer list of delimiting words contains only 15 boundary markers and we have about 103 list of boundary markers created manually and the list is still in a growing process the more the list grows the more effective will be the results. Considering the corpus, we have, 103 is not a huge list of boundary markers but these are enough for our experimentation. The more the list grows the more processing is required because our workstations cannot handle that kind of processing very effectively and that makes our experimentation process slower, so we restricted list of boundary markers to 103. Initially there were 184 sentences with 20548 words in experiment 1. After compound and tokenization process we have 722 sentences. We got results for about 569 sentences that is 78% of sentences can be categorized as accurate or inaccurate sentences out of which 461 were accurately marked boundary markers that is 63% and 108 were marked inaccurate that is only 14%. This inaccuracy was due to existence of those boundary markers as a part of other words. Remaining 22% sentence were the one or two word sentences that appears because of two delimiting words in a single sentence. For example; ‫ﺎ۔ ﮨﮯ۔‬#‫ﺿﯽ &ﻧﺘﻈﺎ! ﮐ"ﺎ ﮔ‬#‫ﮨﺎ! ﭘﺮ ﻋﺎ‬$ ‫"! ﻧ! ﮨﯽ‬# ‫ﮐﻨﮓ ﻣﻮﺟﻮ! ﮨﮯ‬$‫ﺒﯽ ﻋﻼﻗﻮ! ﻣ"ﮟ ﮐﻮﺋﯽ ﭘﺎ‬#‫ُ! ﮐﮯ ﻟ"ﮯ ﻧ! ﺗﻮ ﻗﺮ‬# ‫ ﮨﮯ‬is second boundary marker and just a single word marked as separate sentence. There were 153 such kind of sentences which means that 22% sentences were useless and discarded accordingly. 5. CONCLUSION: Our work regarding Compound Sentence Segmentation and Tokenization of large Sentences was pioneering work in Urdu. The results generated by our proposed system are promising. The results generated for a single chunk of file were generated manually. Therefore, we did not include other chunks of files. We realize that with having powerful servers and with increasing delimiting words gazetteer list, we can improve our results further. Beside generating statistical results, in future we will also analyze our model and its results by language expert, by comparing our automatically tokenized sentences with human manually tokenized sentences to analyze its coherence and readability. REFERENCES: [1] Yasumasa, S. Kansai. 2016. University of Graduate School of Foreign Language Education and Research. Automatic Sentence Segmentation, accessed (Feb 19). DOI: http://www.someya-net.com/00-class09/sentenceDiv.html. [2] Brian, L., Zillig, P., Ramsay, S., Mueller, M., and Smutniak, F., 2016. Academic Technologies and Research Services. Morph Adorner Sentence Splitter, accessed (Feb 19). DOI: http://morphadorner.northwestern.edu/sentencesplitter/example/. [3] Malik, A.A. and Habib, A., 2013. Urdu to English Machine Translation using Bilingual Evaluation Understudy. International Journal of Computer Applications, 82(7). [4] Palmer, D.D., 2000. Tokenisation and sentence segmentation. In Handbook of Natural Language Processing. CRC Press. [5] Aroonmanakun, W., 2007, December. Thoughts on word and sentence segmentation in Thai. In Proceedings of the Seventh Symposium on Natural language Processing, Pattaya, Thailand, December 13–15 (pp. 85-90). [6] Baseer, F., Habib, A. and Ashraf, J., 2016, August. Romanized Urdu Corpus development (RUCD) model: Edit- distance based most frequent unique unigram extraction approach using real-time interactive dataset. In Innovative Computing Technology (INTECH), 2016 Sixth International Conference on (pp. 513-518). IEEE. 105 Compound Sentence Segmentation and Sentence Boundary Detection in Urdu pp. 99–106 [7] Xu, J., Zens, R. and Ney, H., 2005, May. Sentence segmentation using IBM word alignment model 1. In Proceedings of EAMT (pp. 280-287). [8] A. Gul, A. Habib, J. Ashraf, "Identification and extraction of Compose-Time Anomalies in Million Words Raw Urdu Corpus and Their Proposed Solutions", proceedings of the 3rd International Multidisciplinary Research Conference (IMRC), 2016. [9] Xue, N. and Yang, Y., 2011. Chinese sentence segmentation as comma classification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers- Volume 2 (pp. 631-635). Association for Computational Linguistics. [10] Rehman, Z. and Anwar, W., 2012. A hybrid approach for urdu sentence boundary disambiguation. International Arab Journal of Information Technology, 9(3), pp.250-255. [11] Kiss, T. and Strunk, J., 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics, 32(4), pp.485-525. [12] Jurish, B. and Würzner, K.M., 2013. Word and Sentence Tokenization with Hidden Markov Models. JLCL, 28(2), pp.61-83. [13] Habib, A. Iwatate, M., Asahara, M. Matsumoto, Y., 2012. Keypad for large letter-set languages and small touch- screen devices (case study: Urdu). International Journal of Computer Science 9(3), ISSN: 1694-0814. [14] Hearst, M.A., 1997. TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational linguistics, 23(1), pp.33-64. [15] Habib, A. Iwatate, M., Asahara, M. Matsumoto, Y. W.K., 2013. Optimized and Hygienic Touch Screen Keyboard for Large Letter Set Languages. Proceedings of 7th ACM International Conference on Ubiquitous Information Management and Communication (ICUIMC) Kota Kinabalu, Malaysia. [16] Evang, K., Basile, V., Chrupała, G. and Bos, J., 2013, October. Elephant: Sequence labeling for word and sentence segmentation. In EMNLP 2013. [17] Xu, L.F., Zhu, Y., Yang, L.J. and Jin, Y.H., 2014. Research on Sentence Segmentation with Conjunctions in Patent Machine Translation. In Applied Mechanics and Materials (Vol. 513, pp. 4605-4609). Trans Tech Publications. [18] Habib, A. Iwatate, M., Asahara, M. Matsumoto, Y., 2011. Different input systems for different devices: Optimized touch-screen keypad designs for Urdu scripts. Proceedings of Workshop on Text Input Methods WTIM2011, IJCNLP, Chiang Mai, Thailand. [19] Gibson, E. and Pearlmutter, N.J., 1998. Constraints on sentence comprehension. Trends in cognitive sciences, 2(7), pp.262-268. [20] Adnan, M. Habib, A. Mukhtar, H. Ali, G., 2017. Assessing the Realization of Smartphones Learning Objects in Students’ Adaptive Learning Paths. International Journal of Engineering Research, 6(4), ISBN:978-81-932091-0-3. 106