<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>COMPOUND SENTENCE SEGMENTATION AND SENTENCE BOUNDARY DETECTION IN URDU</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>ASAD IQBAL</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ASAD HABIB</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>JAWAD ASHRAF</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Information Technology, Kohat University of Science and Technology</institution>
          ,
          <country country="PK">Pakistan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>99</fpage>
      <lpage>106</lpage>
      <abstract>
        <p>The raw Urdu corpus comprises of irregular and large sentences which need to be properly segmented in order to make them useful in Natural Language Engineering (NLE). This makes the Compound Sentences Segmentation (CSS) timely and vital research topic. The existing online text processing tools are developed mostly for computationally developed languages such as English, Japanese and Spanish etc., where sentence segmentation is mostly done on the basis of delimiters.</p>
      </abstract>
      <kwd-group>
        <kwd>Urdu sentence segmentation</kwd>
        <kwd>sentence tokenization</kwd>
        <kwd>word tokenization</kwd>
        <kwd>compound sentence segmentation</kwd>
        <kwd>Urdu conjunction extraction</kwd>
        <kwd>Urdu sentence delimiter identification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        INTRODUCTION:
Urdu Compound Sentence Segmentation using words and conjunctions as delimiters is a complex task. Most of the
available raw corpora contain large sentences which are combination of sentences with conjunctions or without
conjunctions. Such sentences are called compound sentences. Such sentences make it challenging for automated and
computational processes such as text summarization, parsing and named entity recognition etc. [
        <xref ref-type="bibr" rid="ref1">7</xref>
        ][
        <xref ref-type="bibr" rid="ref13">19</xref>
        ].
There are some online tools available that segment sentences on the basis of sentence termination marks. For example;
1.1. Automatic Sentence Segmentation
1.2. Morph Adorner Sentence Splitter Example
Automatic sentence segmentation alters simple text into separate sentence per line format by simply adding return code
after sentence termination mark but most tools cannot handle abbreviation’s like Dr., Mr., p.m., Prof., a.m. This online
tool covers most of those abbreviations. It also provides editing facility after resulting text to cover up the remaining
abbreviations [1]. Morph Adorner Sentence Splitter uses punctuation marks to split sentences but punctuation marks
not always define sentence termination mark for example ellipses, abbreviations, acronym, or decimal system and in
some of the poems not even have any termination mark in it. Morph Adorner Sentence Splitter works best for plain
English text and unlike Automatic Sentence Segmentation it covers more abbreviations [2].
      </p>
      <p>Unlike Urdu English is not even suffering from this problem too much as most of the sentences are already separated
from each other, Urdu on the other hand has even sentences in a form of paragraph which in self contains many
sentences and our proposed idea is to identify those sentences and convert them into more than one sentences on the
basis of these words [3,4].</p>
      <p>
        LITERATURE REVIEW:
Sentence segmentation is a relatively new topic of research in computationally developing languages. We could not
find any automated sentence boundary segmentation tool in Urdu language. Aroonmanakun W., analyzed sentence and
word segmentation in Thai language [5]. They considered sentence discourse where combination of phrases and some
clues are used for each discourse segmentation process. Baseer et al. presents a sophisticated Romanized Urdu Corpus
utilizes tokens with the uppermost frequency of occurrence in the data set, which was collected from participants who
uses Romanized Urdu as a mean of communication [6]. Xu et al. used IBM word alignment for sentence segmentation
for translation tasks of English and Chinese in the news domain [
        <xref ref-type="bibr" rid="ref1">7</xref>
        ]. The focus of this technique is to use lexicon
information to identify sentence segmentation point to split compound sentences into multiple sentences. This paper
proposed a technique to split large sentences into multiple ones because the longer the sentences the more problems are
faced by their proposed system, subsequently resulting into higher computation cost and compromised quality in word
alignment. Habib et al. presented a novel approach that records properly segmented words in order to avoid the
classical space insertion and space deletion problems in native script Urdu text [
        <xref ref-type="bibr" rid="ref7">13</xref>
        ][
        <xref ref-type="bibr" rid="ref9">15</xref>
        ]. Their proposed solutions can
be of direct value in Urdu sentence segmentation.
      </p>
      <p>
        Xue, N. and Yang, Y., focuses on how Chinese uses comma, exclamation marks and questions marks for sentence
boundary indication. Proposed model is being tested and trained on data provided by Chinese tree bank and the
accuracy achieved by this model is up to 90% [
        <xref ref-type="bibr" rid="ref2 ref3">8,9</xref>
        ]. Rehman, Z. and Anwar, W., uses rule based algorithm and
Unigram statistical model to deal with Urdu sentence boundary disambiguate. Initial result before testing and training
were 90% precision, 92.45% F1-measure and 86% recall, but after same testing data and training, results improved to
99.36% precision, 97.89% F1-measure and 96.45% recall [
        <xref ref-type="bibr" rid="ref4">10</xref>
        ]. Kiss and Strunk presented language-independent
approach. In this paper assumptions are made that once abbreviations are identified most of the ambiguities will be
eliminated while detecting sentence boundary. In order to detect high accuracy abbreviations, proposed system define
three rules which required independence of context and type of candidate. The system was tested on different text types
and eleven different languages [
        <xref ref-type="bibr" rid="ref5">11</xref>
        ]. A number of related research points out to interesting aspects of text segmentation
and optimized input systems. Habib et al. proposed an optimized system and input methods with respect to various
modern devices for Urdu composing [
        <xref ref-type="bibr" rid="ref12">18</xref>
        ]. Adnan et al. assessed the realization of smartphones learning objects in
computing adaptive learning paths of undergraduate university students. Jurish and Würzner introduced a “WASTE”
method for segmentation of text into tokens and sentences. Hidden Markov Model is used as a segment boundary
detection. Model parameters were defining from pre-segmented corpora. Such corpora are available as an aligned
multi-lingual corpora and treebanks [
        <xref ref-type="bibr" rid="ref6">12</xref>
        ].
      </p>
      <p>
        Hearst, M.A., A presents technique called TextTiling in which text is segmented into multi-paragraph units that is
subtopics or passages. Identification of sub-topics is done using lexicon co-occurrence and distribution patterns. This
segmentation can further be used for text summarization and information retrieval [
        <xref ref-type="bibr" rid="ref10">16</xref>
        ]. In Evang, K., et al. technique
the accuracy achieved by rule based model Tokenization is considered as no problem, but issue regarding rule-based is
its language specific rules and maintenance. This paper used unsupervised feature learning combining it with
supervised sequence labeling on character level to accomplish segmentation and high accuracy word goal. Evaluation
of proposed system is done on three different languages with the error rate of 0.76% (Italian), 0.27% (English), and
0.35% (Dutch) [
        <xref ref-type="bibr" rid="ref10">16</xref>
        ]. Xu, L.F., et al. proposes an idea of segmenting long sentences into short one using conjunctions in
them. Long sentences took a lot of machine translation resources for processing. Punctuations were used previously for
segmentation but dealing with long sentences that was not enough. This paper presents rule-based approach on
conjunctions to segment Chinese long sentences. 901 conjunctions were found in 10 patent papers during
experimentation. Using rule-based approach, 89% accuracy was achieved [
        <xref ref-type="bibr" rid="ref11">17</xref>
        ]. Gibson E., et al proposes
Comprehension of sentences technique. Comprehension of sentences means applying constraints using available
computational resources to integrate variety of information sources. Four types of comprehension techniques are
explained in this paper. (1) phrase-level contingent frequency constraints, (2) locality-based computational resource
constraints, (3) contextual constraints, and (4) lexical constraints [
        <xref ref-type="bibr" rid="ref13">19</xref>
        ].
      </p>
      <p>URDU SENTENCE SEGMENTATION OPERATIONS:
Sentence Segmentation is a process of identifying sentence boundaries. Previous researcher’s focuses on using only
punctuations as a boundary detection. Our proposed research not only uses punctuations, it also uses words as a
delimiters and conjunctions as boundary markers to segment sentences appropriately.</p>
      <p>
        3.1. RAW CORPUS:
Raw corpus has been collected by several means which includes websites, books, Urdu magazines, newspapers etc., we
categories them into more general form that is online and offline category. Online category involves websites, online
books and newspapers. While offline involves Magazines, newspapers, books etc. The collected raw corpus contains
large and compound sentences including combination of conjunction and non-conjunction sentences. These sentences
require two different methodologies for segmentation. The former is compound sentence segmentation and the latter is
sentence tokenization which are described in the following.
Conjunctions and delimiter words and characters are identified using words and character segmentation techniques.
Individual words and characters in text are split on the basis of complex processing including space, joiner and
nonjoiner properties keeping in view the complex problems of Urdu specific space insertion and space deletion [3][
        <xref ref-type="bibr" rid="ref4">10</xref>
        ][
        <xref ref-type="bibr" rid="ref13">19</xref>
        ].
The segmented words and characters are then analyzed for conjunctions and delimiting words and characters.
      </p>
      <p>Offline Source: Books,</p>
      <p>News Papers etc.</p>
      <p>Online Source:</p>
      <p>Internet</p>
      <p>Raw$</p>
      <p>Corpus!
Pre-processing</p>
      <p>Word
Segmentation
Compound</p>
      <p>Sentence
Segmentation
Resultant)</p>
      <p>Sentences!</p>
      <p>Removing)
sub"ordinate)
clause!</p>
      <p>Applying(
Boundary)</p>
      <p>Markers!
Conjunction</p>
      <p>Sentences
identification</p>
      <p>NonConjunction</p>
      <p>Sentences
identification
3.3. COMPOUND SENTENCE IDENTIFICATION:
The issues with longer sentences is high consumption of resources such as processing time. We look for
conjunction containing sentences and we analyze that the subordinate clause of sentences are mostly the
explanation of main clause, which makes sentence extra-large. The options in dealing with compound sentences are
two, first is to eliminate those sentences completely but by doing so, we may be risking useful information. Second
option is to trim those sentences by eliminating sub-ordinate clause i.e. the explanatory part of the sentence.
We use a pattern matching approach for identification of conjunctions containing sentences. We came up with the
list of conjunction words which we generated from pre tagged corpus, using this list of words conjunction
containing sentences are easily detectable. List of conjunction words we are given below.</p>
      <p>Naturally occurring raw corpus text examples are mentioned in the following.</p>
      <p>ﺮﮕﻣ ! !ﮑﻠﺑ ! !ﭽﻧﺎﻨﭼ ! ﯽﻨﻌ$ ! ﺎ"ﻮﮔ ! ﯽﻨﻌ$ ! ﻦﮑ#ﻟ ! !ﮑﻧﻮ%ﮐ
3.4. With Conjunction part:
3.5. Without Conjunction part:
ﯽ"ﺗ ﯽﺌﮔ ﭻﻨﮩﭘ ﯽ"ﺑ !ﺮﻔﻧ ﯽﮐ ﺲ"ﻟﻮﭘ ﮯﺌﻟ ﮯﮐ ﮯﻧﺮﮐ ﻮﺑﺎﻗ ﻮﮐ !ﻻﺎﺣ ﺮﭘ ﻊﻗﻮﻣ !" ﺎﻨ#ﺎﮐ ﯽﻠﺠﺑ ﯽﮐ ﮯﻗﻼﻋ !ﺎﻤﺗ !ﮐ ﺎﮩﮐ ﮯﻧ ﻢ"ﻌﻨﻟ&amp;ﺪﺒﻋ ﺮﻠﺴﻧﻮﮐ ﻖﺑﺎﺳ ﺮﮕﻣ
۔ﮯﺌﮨﺎﭼ ﯽﻨ#ﺎﮐ ﯽﻠﺠﺑ ﯽﮑﻧ$ !ﺮﺻ ﮟ"ﮨ !ﺎﺒﺟ$% ﺮﭘ !ﻮﮔﻮﻟ ﻦﺟ ﮯﮨ ﯽﺗ#ﺎ%&amp; ﺮﺳ#ﺮﺳ !ﺗﺎﺳ ﮯﮐ !"ﻮﻋ</p>
      <p>۔ ﯽ"ﺗ ﯽﺌﮔ ﭻﻨﮩﭘ ﯽ"ﺑ +ﺮﻔﻧ ﯽﮐ ﺲ1ﻟﻮﭘ ﮯﺌﻟ ﮯﮐ ﮯﻧﺮﮐ ﻮﺑﺎﻗ ﻮﮐ 7ﻻﺎﺣ ﺮﭘ ﻊﻗﻮﻣ &lt;=
We exclude “!"#” from conjunction list because it is not only used as a joiner in a sentence but as a joiner between
two nouns too for example;
ﻦﺸ#ﮉﻧ&amp;ﺎﻓ ﺖﻣﺪﺨﻟ. /0. ﯽﻣﻼﺳ. ﺖﻋﺎﻤﺟ
The remaining sentences in corpus contain combination of multiple sentences. These sentences do not contain
conjunctions but they need to be segmented. For this purpose, we pass resultant text to the next process called
Sentence Tokenization.</p>
      <p>3.6. SENTENCE TOKENIZATION:
Sentence tokenization segments large sentences from single into multiple ones. We target words as delimiters to
identify sentence boundaries. For example;
ﯽﮐ ﺮﮑﻓ ﺐﺗﺎﮑﻣ *ﺎﻤﺗ ,-. *.ﻮﻋ ﮯﮐ ﻮﮕﻨ4 ﺮﭘ 6ﺎﻓ- ﯽﮐ ﮯﺠ8ﺘ:ﺑ &lt;,ﺎﻤﮨ &gt;ﮐﺎﮩﮐ ﮯﻧ AﺎﺷCﺎﺑ ﻞﻘﻋ ﺮﺴ8ﻓG ﻦﺸ8ﮐﻮﺠJ. ﭧﮐﺮ$ﺳ&amp; ﯽ$ﭘ&amp;()*ﻮﭘ*,*ﻮ-ﺑ)ﻮﮕﻨ1
ﺮﭘ #ﺎﮔ ﺶﺋﺎﮨ) ﯽﻨﭘ, ﮯﻧ #ﺎﺷ0ﺎﺑ ﻞﻘﻋ ﺮﺴ6ﻓ8 ﻦﺸ6ﮐﻮﺠ&gt;, ﯽ?ﭘ@ )ﺎﮩB, ﺎﮐ Cﻻﺎ6ﺧ F, ﮟ6ﮨ)ﻮﮑﺸﻣ ﺖ&gt;ﺎﮩﻧ ﮯﮐ Kﻗﻼﻋ ﻞﮨ, ﺮﭘ Cﺎﻣﺎﻐ6ﭘ ﯽﺘ&gt;ﺰﻌﺗ ﮯﺳ ﺐﻧﺎﺟ
ﮯﻧ #$%$ﻮﮔﻮﺳ %)$ #$ﺪﻧﺎﺧ ﮯﮐ #$ ﮯﺳ ﯽﻧﺎﺑﺮﮩﻣ ﯽﮑ 5ٰﻟﺎﻌﺗ ﷲ! ﺮﮕﻣ ﮯﮨ '!ﺮﺨﻟ* +ﻌﻗ!. +ﮐ ﺎﮩﮐ ﮯﻧ 3ﻮﮩﻧ$ ﺎ&amp;ﮐ ﮯﺋﻮﮨ ﮯﺗﺮﮐ ﺖ&amp;ﭼ /ﺎﺑ ﮯﺳ 2ﻮ&amp;ﻓﺎﺤﺻ
!"ﺎﻤﻧ ﮯﻧ 'ﻮﮩﻨﺟ ,-ﺮﻀﺣ 1ﺎﻤﺗ 3- 3-4ﻮﮔﻮﺳ ﮯﮐ 1ﻮﺣﺮﻣ 9ﺎ:ﻓ &lt;ﮐﺎﮩﮐ ﯽﺋﻮﮨ ﺎﻄﻋ ﻞ:ﻤﺟ ﺮﺒﺻ ﻮﮐ 3-ﺪﻧﺎﺧ G4ﻮﭘ ﮯﺳ 'IﺎﻋJ ﮯﮐ &lt;ﻗﻼﻋ ﻞﮨ-‘ !ﻮﮔ$ﺰﺑ
۔ﮟ#ﮨ %ﻮﮑﺸﻣ ﯽﻟ, ﮯﮐ ﺐﺳ #$ ﯽﮐ ﺖ(ﺰﻌﺗ #ﻮﻓ ﯽﻠ/0 1ﻌ(2ﺬﺑ ﺎ( ﯽﮐ ﺖﮐﺮﺷ ﮟ/ﻣ :;ﺎﻨﺟ
Above example is a single large sentence which is a combination of multiple sentences with 110 words in it. In
available corpus Urdu is full of these kind of sentences. These delimiting words are identified manually from single
chunk of file because analyzing large file manually is time consuming and it may take weeks to generate list of
delimiting words and still that list will not be enough. Delimiting boundary list is not limited to this corpus only to
achieve better results in segmenting large sentences this corpus also needs constant updating with constant manual
updating of delimiting words also. We generate that list using a chunk of file is to carry on our experimentation and
we accomplish significant results. For example, the above single sentence is analyzed for delimiting words and on
the basis of this list above sentence is segmented down into 4 sentences with average length of 27 words per each
sentence.
ﯽﮐ ﺮﮑﻓ ﺐﺗﺎﮑﻣ *ﺎﻤﺗ ,-. *.ﻮﻋ ﮯﮐ ﻮﮕﻨ4 ﺮﭘ 6ﺎﻓ- ﯽﮐ ﮯﺠ8ﺘ:ﺑ &lt;,ﺎﻤﮨ "ﮐﺎﮩﮐ ﮯﻧ (ﺎﺷ*ﺎﺑ ﻞﻘﻋ ﺮﺴ1ﻓ3 ﻦﺸ1ﮐﻮﺠ89 ﭧﮐﺮ;ﺳ= ﯽ;ﭘ=(@AﻮﭘABAﻮ1ﺑ)ﻮﮕﻨE
۔ﮟ#ﮨ%ﻮﮑﺸﻣ ﺖ+ﺎﮩﻧ ﮯﮐ 1ﻗﻼﻋ ﻞﮨ6 ﺮﭘ 9ﺎﻣﺎﻐ#ﭘ ﯽﺘ+ﺰﻌﺗ ﮯﺳ ﺐﻧﺎﺟ
!ﻌﻗ#$ %ﮐ ﺎﮩﮐ ﮯﻧ +ﻮﮩﻧ# ۔ﺎ.ﮐ ﮯﺋﻮﮨ ﮯﺗﺮﮐ ﺖ.ﭼ 5ﺎﺑ ﮯﺳ +ﻮ.ﻓﺎﺤﺻ ﺮﭘ &lt;ﺎﮔ ﺶﺋﺎﮨ? ﯽﻨﭘ# ﮯﻧ &lt;ﺎﺷCﺎﺑ ﻞﻘﻋ ﺮﺴ.ﻓH ﻦﺸ.ﮐﻮﺠL# ﯽMﭘN ?ﺎﮩO# ﺎﮐ 5ﻻﺎ.ﺧ R#
ﻞ"ﻤﺟ ﺮﺒﺻ ﻮﮐ *+ﺪﻧﺎﺧ 01ﻮﭘ ﮯﺳ 56ﺎﻋ8 ﮯﮐ 9ﻗﻼﻋ ﻞﮨ+‘ !ﻮﮔ$ﺰﺑ ﮯﻧ )*$*ﻮﮔﻮﺳ $,* )*ﺪﻧﺎﺧ ﮯﮐ )* ﮯﺳ ﯽﻧﺎﺑﺮﮩﻣ ﯽﮑ 7ٰﻟﺎﻌﺗ ﷲ! ﺮﮕﻣ ﮯﮨ '!ﺮﺨﻟ*
ﮯﮐ ﺐﺳ %&amp; ﯽﮐ ﺖ)ﺰﻌﺗ %ﻮﻓ ﯽﻠ01 2ﻌ)3ﺬﺑ ﺎ) ۔ﯽﮐ ﺖﮐﺮﺷ ﮟ0ﻣ &lt;=ﺎﻨﺟ &lt;=ﺎﻤﻧ ﮯﻧ Bﻮﮩﻨﺟ D&amp;ﺮﻀﺣ Gﺎﻤﺗ %&amp; %&amp;3ﻮﮔﻮﺳ ﮯﮐ Gﻮﺣﺮﻣ Iﺎ0ﻓ 2ﮐﺎﮩﮐ ﯽﺋﻮﮨ ﺎﻄﻋ
۔۔ﮟ#ﮨ %ﻮﮑﺸﻣ ﯽﻟ,
3.6.1.Sentence Boundary Maker:
Complexity of delimiting words of sentence boundary markers rises when generated list of delimiting words turned
into a part of another word and that word appears in the middle or start of sentence not at the end. To deal with
these kind of sentences, we need to take our approach to another level as we cannot use unigram. For example, the
following list of delimiter words are unigram. We cannot use them as boundary markers alone because these words
can be a part of other words and that results in ambiguity. We require n-gram approach. For example;
!" ! ﮯﺋﺎﺟ ! ﮯﺌﮐ ! ﮯﮑﺳ$ ! ﮟ"ﺮﮐ
ﺮﭘ ﮯﻧﺮ$% ﯽﺟﺎﺠﺘﺣ' ﮯﻨﻣﺎﺳ ﮯﮐ !"ﺎﮨ ﺮﻧ !ﻮﮔ !"# !"ﺎﮨ ﭧﻨ#ﻤ#ﻟ&amp;ﺎﭘ ﻢﮨ ﺮﮕ#$ !"ﻮﺼﺑ ﮟ"ﺮﮐ ﻞﺣ ﺮﭘ !"#ﺎ%ﻨﺑ ﯽﺤ#ﺟﺮﺗ ﻞﺋﺎﺴﻣ !"ﺎﻤﮨ
۔ ﮯﮕﻧﻮﮨ !ﻮﺒﺠﻣ
ﮟ"ﮨ ﮯﮑﭼﻮﮨ ﯽﻏﺎﺑ ﺮﺒﻤﻣ ﮯﮐ ﯽﻠﺒﻤﺳ&amp; ﯽﺋﺎﺑﻮﺻ ﮯﻨﭘ$ ﮯﮑﺳ$ !ﮑﻠﺑ ﮯﮑﺳﺎﺟ ﺎﮩﮐ !ﺎﮔ$ﺎ% ﻮﮐ ﺲﺟ ﺎ"ﮐ ﮟ"ﮩﻧ !ﺎﮐ ﺎﺴ#$ ﯽﺋﻮﮐ ﮯﻧ !ﻮﮩﻧ
(1
(2
(3
ﮟ"ﮩﻧ ﻦﮑﻤﻣ ﯽﻗﺮﺗ ﺮ"ﻐﺑ ﮯﺌﮐ ﻢﺘﺧ !ﻮﺳُﺎﻧ ﺎﮐ ﻞﻘﻧ ﮯﺳ !"#$%$ ﯽﻤ#ﻠﻌﺗ
(4
(5
ﮯﮨ ﯽﺗﺎﺟ !" ﺢ"ﺟﺮﺗ ﻮﮐ !ﻮﻣﺎﮐ ﯽﻋﺎﻤﺘﺟ' ﮯﺋﺎﺠﺑ ﯽﮐ !ﻮﻣﺎﮐ !"#ﺮﻔﻧ# ﺮﭘ !ﺎﮩ$
ﮯﮨ ﯽﺗﺎﺟ !" ﺢ"ﺟﺮﺗ ﻮﮐ !ﻮﻣﺎﮐ ﯽﻋﺎﻤﺘﺟ' ﮯﺋﺎﺠﺑ ﯽﮐ !ﻮﻣﺎﮐ !"#ﺮﻔﻧ# ﺮﭘ !ﺎﮩ$
In example 1, 2 and 3 the delimiting words appear in the middle of sentences. In example 2 we also realize that
there are two delimiters used one as a separate word and other as a part of other word for example ﮯﮑﺳﺎﺟ, which is
a combination of word ﮯﮑﺳ$ and !. Same apply to example 4 and 5 i.e. ﮯﺋﺎﺠﺑ and !"#ﺮﻔﻧ#. These delimiting words are
also a combination of words i.e. ﮯﺋﺎﺠﺑ is a combination of ﮯﺋﺎﺟ and ! and !"#ﺮﻔﻧ# is a combination of !" and !ﺮﻔﻧ!.
But that was not all there another issue Urdu delimiting words were having, some of these words were not even
used in the sense of termination mark for example ﮯﮑﺳ$ in the above example 2 was not used in sense of a sentence
boundary marker but it was used in a sense of “his” in that sentence, but the ratio of these words are low and they
can also be handled by n-gram approach so this does not create any problem. Below Table 1 represents sentences
boundary markers of unigram, bigram and trigram. We only include few of the boundary markers with their
frequency from a single experiment.
Some sentences contain two delimiting words and our model will add termination marks after both of these boundary
markers, but that will not affect the sentence, because if sentence ends with first delimiting word it will still retain its
meanings. For example;
3.6.1.1.</p>
      <p>With two delimiting words:
3.6.1.2.</p>
      <p>Without second delimiting word:
۔ﮯﮨ ۔ﺎ#ﮔ ﺎ"# ﺎ"#$ﺑ ﮟ"ﻣ ﻦﺸ#$ﻮﭘ' ﮉﻨ#ﺮﮔ ﻮﮐ ﺐﺳ !" !" ﮯﮨ# ﮟ"ﻣ !"ﺪﺘﻗ" !ﺎﺳ !" ﻮﺟ</p>
      <p>۔ ﺎ"ﮔ ﺎ"# ﺎ"#$ﺑ ﮟ"ﻣ ﻦﺸ#$ﻮﭘ' ﮉﻨ#ﺮﮔ ﻮﮐ ﺐﺳ !" !" ﮯﮨ# ﮟ"ﻣ !"ﺪﺘﻗ" !ﺎﺳ !" ﻮﺟ
It can be observed that from semantic point of view, the sentence preserves its meanings.
The lack of appropriate hardware resources impeded delays in compiling results of our experiments. We used a
personal computer running Microsoft Windows 8 for processing our general genre raw Urdu corpus containing 2616
sentences and 291503 words. Computationally exhaustive iterative processing was not possible on such a workstation.
It took more than two and a half days to process our raw corpus file and still it was not completed when the system
suddenly shut down due to hardware failure. The only possible solution for this problem was the customary divide and
conquer approach.</p>
      <p>We divided the experimentation file into 14 smaller chunks and experimented chunk-wise to identify compound and
large sentences. Delimiting words were extracted with their respective frequencies in the text. Compound sentences are
identified by the list of conjunctions present in them. All statistical results were accumulated and manually verified by
Urdu language experts. Consolidated results of the respective 14 chunks are shown in the following table 2.
S.No.
In table 2, we also generate frequency of !"# as a conjunction and it appears 485 times in a text. This was done due to
realization of the fact that !"# is not only used as a conjunction but also as a part of other words. For example;
These are the words that contain conjunction word “!"#”. These words are the combination of two words. For example,
in ﯽﺋﺰﮐ%&amp;' and !"#$, the word !"#is at the beginning and it is a combination of ﯽﺋﺰﮐ + !"# and ! + !"# respectively.
Similarly, in !"#ﺎﺸﻣ and !ﺎﺷ$%ﻻ', the word !"# is in the middle of two words that is between ! + ﺶﻣ and !ﺎﺷ + !"
respectively. In the same manner, in words such as !"ﺎ$ﭽﻧ and !"ﺎﺸﭘ, the word !"# appears in the end of these words. Our
proposed model identifies them as a conjunction which increases the computational cost and algorithmic complexity.
135 times out of 485, the word “!"#” is used as a conjunction that is 27% of its total occurrences. The remaining 73%
i.e. 353 times, this word appeared as a part of other words or sometimes in a non-conjunction way. So we exclude it
from list of conjunctions. Identifying them as a conjunction and as a part of other word is done manually.
The word “!"#” and other words posing similar problem is an interesting discovery in this research. “!ﭽﻧﺎﻨﭼ” is also a
conjunction word but in our existing corpus it does not appear even once. Our corpus is growing continuously. So
therefore we hope this and other similar words will be encountered in the subsequent experiments. Thus we did not
exclude it from the list of conjunction word.</p>
      <p>Complexity of Urdu language increases when dealing with boundary markers. The Urdu boundary marking words may
not always appear at the end of sentences but it can also be a part of other words which may be anywhere in sentences.
Also, they may appear in the middle of sentences. Reaching a computational solution becomes more difficult when
sometimes these are not used in the sense of boundary markers. For example, ﮟ"ﮨ appears about 259 times in a single
chunk of file. We carried out the same experiment on all 14 chunks. However, for simplicity we consider the example
ﯽﺋﺰﮐ%&amp;' ! !"#ﺎﺸﻣ ! !"ﺎ$ﭽﻧ ! !"ﺎﺸﭘ ! !ﺎﺷ$%ﻻ' ! !"#$
of only first chunk here. It was not sure whether ﮟ"ﮨ always appears as a boundary marker or a part of other words. In
our experimentation we realize, that 154 time ﮟ"ﮨ appeared as a boundary marker that is 59% and 105 times i.e. 41% it
appeared as part of another word for example ﮟ"ﮩﻧ, ﮟ"ﮨ$ﻮﺨﻨﺗ, ﮟ"ﮩﻧ%, ﮟ"ﮩﻤﺗ. The experimentation continued on other boundary
markers such as ﺎ"ﺗ. Frequency of ﺎ"ﺗ in a text is 67 and it appears in text about 26 times that is 38% as a boundary
marker and 41 times that is 62% as part of other words.</p>
      <p>Most of the boundary markers are part of other words and to handle this we use bi-gram approach i.e. we combine it
with the second closest word but using only bi-gram approach did not solve our problem because combination of these
two words may also appear in middle of sentences, to tackle this issue we combine third closest word i.e. we use
trigram approach but still that did not solve our problem and the process continues. The solution to this problem was to
use n-gram approach. Similarly, ﮯ"ﺗ appears in a text for 36 times and 100% it is used as a boundary marker. ﯽ"ﺗ
appears 13 times in which 8 time that is 61% it appears as a boundary marker and 5 times that is 39% as part of other
word or boundary marker i.e. ﮟ"#ﺗ and the same goes for other boundary markers. The above mention gazetteer list of
delimiting words contains only 15 boundary markers and we have about 103 list of boundary markers created manually
and the list is still in a growing process the more the list grows the more effective will be the results. Considering the
corpus, we have, 103 is not a huge list of boundary markers but these are enough for our experimentation. The more the
list grows the more processing is required because our workstations cannot handle that kind of processing very
effectively and that makes our experimentation process slower, so we restricted list of boundary markers to 103.
Initially there were 184 sentences with 20548 words in experiment 1. After compound and tokenization process we
have 722 sentences. We got results for about 569 sentences that is 78% of sentences can be categorized as accurate or
inaccurate sentences out of which 461 were accurately marked boundary markers that is 63% and 108 were marked
inaccurate that is only 14%. This inaccuracy was due to existence of those boundary markers as a part of other words.
Remaining 22% sentence were the one or two word sentences that appears because of two delimiting words in a single
sentence. For example;</p>
      <p>۔ﮯﮨ ۔ﺎ#ﮔ ﺎ"ﮐ !ﺎﻈﺘﻧ&amp; ﯽﺿ#ﺎﻋ ﺮﭘ !ﺎﮨ$ ﯽﮨ !ﻧ !"# ﮯﮨ !ﻮﺟﻮﻣ ﮓﻨﮐ$ﺎﭘ ﯽﺋﻮﮐ ﮟ"ﻣ !ﻮﻗﻼﻋ ﯽﺒ#ﺮﻗ ﻮﺗ !ﻧ ﮯ"ﻟ ﮯﮐ !ُ#
ﮯﮨ is second boundary marker and just a single word marked as separate sentence. There were 153 such kind of
sentences which means that 22% sentences were useless and discarded accordingly.
5.</p>
      <p>CONCLUSION:
Our work regarding Compound Sentence Segmentation and Tokenization of large Sentences was pioneering work in
Urdu. The results generated by our proposed system are promising. The results generated for a single chunk of file
were generated manually. Therefore, we did not include other chunks of files. We realize that with having powerful
servers and with increasing delimiting words gazetteer list, we can improve our results further. Beside generating
statistical results, in future we will also analyze our model and its results by language expert, by comparing our
automatically tokenized sentences with human manually tokenized sentences to analyze its coherence and readability.</p>
      <p>REFERENCES:
[1] Yasumasa, S. Kansai. 2016. University of Graduate School of Foreign Language Education and Research. Automatic</p>
      <p>Sentence Segmentation, accessed (Feb 19). DOI: http://www.someya-net.com/00-class09/sentenceDiv.html.
[2] Brian, L., Zillig, P., Ramsay, S., Mueller, M., and Smutniak, F., 2016. Academic Technologies and Research
Services. Morph Adorner Sentence Splitter, accessed (Feb 19). DOI:
http://morphadorner.northwestern.edu/sentencesplitter/example/.
[3] Malik, A.A. and Habib, A., 2013. Urdu to English Machine Translation using</p>
      <p>Understudy. International Journal of Computer Applications, 82(7).</p>
      <p>Bilingual Evaluation
[4] Palmer, D.D., 2000. Tokenisation and sentence segmentation. In Handbook of Natural Language Processing. CRC</p>
      <p>Press.
[5] Aroonmanakun, W., 2007, December. Thoughts on word and sentence segmentation in Thai. In Proceedings of the</p>
      <p>Seventh Symposium on Natural language Processing, Pattaya, Thailand, December 13–15 (pp. 85-90).
[6] Baseer, F., Habib, A. and Ashraf, J., 2016, August. Romanized Urdu Corpus development (RUCD) model:
Editdistance based most frequent unique unigram extraction approach using real-time interactive dataset. In Innovative
Computing Technology (INTECH), 2016 Sixth International Conference on (pp. 513-518). IEEE.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zens</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ney</surname>
          </string-name>
          , H.,
          <year>2005</year>
          , May.
          <article-title>Sentence segmentation using IBM word alignment model 1</article-title>
          .
          <source>In Proceedings of EAMT</source>
          (pp.
          <fpage>280</fpage>
          -
          <lpage>287</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Habib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ashraf</surname>
          </string-name>
          ,
          <article-title>"Identification and extraction of Compose-Time Anomalies in Million Words Raw Urdu Corpus and Their Proposed Solutions"</article-title>
          ,
          <source>proceedings of the 3rd International Multidisciplinary Research Conference (IMRC)</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Xue</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <year>2011</year>
          .
          <article-title>Chinese sentence segmentation as comma classification</article-title>
          .
          <source>In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papersVolume 2</source>
          (pp.
          <fpage>631</fpage>
          -
          <lpage>635</lpage>
          ).
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Rehman</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Anwar</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <year>2012</year>
          .
          <article-title>A hybrid approach for urdu sentence boundary disambiguation</article-title>
          .
          <source>International Arab Journal of Information Technology</source>
          ,
          <volume>9</volume>
          (
          <issue>3</issue>
          ), pp.
          <fpage>250</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Kiss</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Strunk</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <year>2006</year>
          .
          <article-title>Unsupervised multilingual sentence boundary detection</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>32</volume>
          (
          <issue>4</issue>
          ), pp.
          <fpage>485</fpage>
          -
          <lpage>525</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Jurish</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Würzner</surname>
            ,
            <given-names>K.M.</given-names>
          </string-name>
          ,
          <year>2013</year>
          .
          <article-title>Word and Sentence Tokenization with Hidden Markov Models</article-title>
          .
          <source>JLCL</source>
          ,
          <volume>28</volume>
          (
          <issue>2</issue>
          ), pp.
          <fpage>61</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Habib</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Iwatate</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Asahara</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Matsumoto</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <year>2012</year>
          .
          <article-title>Keypad for large letter-set languages and small touchscreen devices (case study: Urdu)</article-title>
          .
          <source>International Journal of Computer Science</source>
          <volume>9</volume>
          (
          <issue>3</issue>
          ), ISSN:
          <fpage>1694</fpage>
          -
          <lpage>0814</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Hearst</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <year>1997</year>
          . TextTiling:
          <article-title>Segmenting text into multi-paragraph subtopic passages</article-title>
          .
          <source>Computational linguistics</source>
          ,
          <volume>23</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>33</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Habib</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Iwatate</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Asahara</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Matsumoto</surname>
            ,
            <given-names>Y. W.K.</given-names>
          </string-name>
          ,
          <year>2013</year>
          .
          <article-title>Optimized and Hygienic Touch Screen Keyboard for Large Letter Set Languages</article-title>
          .
          <source>Proceedings of 7th ACM International Conference on Ubiquitous Information Management</source>
          and
          <string-name>
            <surname>Communication (ICUIMC) Kota</surname>
            <given-names>Kinabalu</given-names>
          </string-name>
          , Malaysia.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Evang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basile</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chrupała</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Bos</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <year>2013</year>
          , October. Elephant:
          <article-title>Sequence labeling for word and sentence segmentation</article-title>
          .
          <source>In EMNLP</source>
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>L.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>Y.H.</given-names>
          </string-name>
          ,
          <year>2014</year>
          .
          <article-title>Research on Sentence Segmentation with Conjunctions in Patent Machine Translation</article-title>
          .
          <source>In Applied Mechanics and Materials</source>
          (Vol.
          <volume>513</volume>
          , pp.
          <fpage>4605</fpage>
          -
          <lpage>4609</lpage>
          ).
          <source>Trans Tech Publications.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Habib</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Iwatate</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Asahara</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Matsumoto</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <year>2011</year>
          .
          <article-title>Different input systems for different devices: Optimized touch-screen keypad designs for Urdu scripts</article-title>
          .
          <source>Proceedings of Workshop on Text Input Methods WTIM2011</source>
          , IJCNLP,
          <string-name>
            <surname>Chiang</surname>
            <given-names>Mai</given-names>
          </string-name>
          , Thailand.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Gibson</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Pearlmutter</surname>
            ,
            <given-names>N.J.</given-names>
          </string-name>
          ,
          <year>1998</year>
          .
          <article-title>Constraints on sentence comprehension</article-title>
          .
          <source>Trends in cognitive sciences, 2</source>
          (
          <issue>7</issue>
          ), pp.
          <fpage>262</fpage>
          -
          <lpage>268</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>