=Paper=
{{Paper
|id=Vol-1964/NLP1
|storemode=property
|title=Compound Sentence Segmentation and Sentence Boundary Detection in Urdu
|pdfUrl=https://ceur-ws.org/Vol-1964/NLP1.pdf
|volume=Vol-1964
|authors=Asad Iqbal,Asad Habib,Jawad Ashraf
|dblpUrl=https://dblp.org/rec/conf/maics/IqbalHA17
}}
==Compound Sentence Segmentation and Sentence Boundary Detection in Urdu==
Asad Iqbal et al. MAICS 2017 pp. 99–106
COMPOUND SENTENCE SEGMENTATION AND
SENTENCE BOUNDARY DETECTION IN URDU
ASAD IQBAL, ASAD HABIB, JAWAD ASHRAF
Institute of Information Technology, Kohat University of Science and Technology, Pakistan
ABSTRACT:
The raw Urdu corpus comprises of irregular and large sentences which need to be properly segmented in order to make
them useful in Natural Language Engineering (NLE). This makes the Compound Sentences Segmentation (CSS) timely
and vital research topic. The existing online text processing tools are developed mostly for computationally developed
languages such as English, Japanese and Spanish etc., where sentence segmentation is mostly done on the basis of
delimiters.
Our proposed approach uses special characters as sentence delimiters and computationally extracted sentence-end-
letters and sentence-end-words as identifiers for segmentation of large and compound sentences. The raw and un-
annotated input text is passed through preprocessing and word segmentation. Urdu word segmentation itself is a
complex task including knotty problems such as space insertion and space deletion etc. Main and subordinate clauses
are identified and marked for subsequent processing. The resultant text is further processed in order to identify, extract
and then segment large as well as compound sentences into regular Urdu sentences.
Urdu computational research is in its infancy. Our work is pioneering in Urdu CSS and results achieved by our
proposed approach are promising. For experimentation, we used a general genre raw Urdu corpus containing 2616
sentences and 291503 words. We achieved 34% improvement in reduction of average sentence length from 111 w/s to
38 w/s (words per sentence). This increased the number of sentences by almost three times to 7536 shorter and
computationally easy to manage sentences. Resultant text reliability and coherence are verified by Urdu language
experts.
Keywords: Urdu sentence segmentation, sentence tokenization, word tokenization, compound sentence
segmentation, Urdu conjunction extraction, Urdu sentence delimiter identification.
1. INTRODUCTION:
Urdu Compound Sentence Segmentation using words and conjunctions as delimiters is a complex task. Most of the
available raw corpora contain large sentences which are combination of sentences with conjunctions or without
conjunctions. Such sentences are called compound sentences. Such sentences make it challenging for automated and
computational processes such as text summarization, parsing and named entity recognition etc. [7][19].
There are some online tools available that segment sentences on the basis of sentence termination marks. For example;
1.1. Automatic Sentence Segmentation
1.2. Morph Adorner Sentence Splitter Example
Automatic sentence segmentation alters simple text into separate sentence per line format by simply adding return code
after sentence termination mark but most tools cannot handle abbreviation’s like Dr., Mr., p.m., Prof., a.m. This online
tool covers most of those abbreviations. It also provides editing facility after resulting text to cover up the remaining
abbreviations [1]. Morph Adorner Sentence Splitter uses punctuation marks to split sentences but punctuation marks
not always define sentence termination mark for example ellipses, abbreviations, acronym, or decimal system and in
some of the poems not even have any termination mark in it. Morph Adorner Sentence Splitter works best for plain
English text and unlike Automatic Sentence Segmentation it covers more abbreviations [2].
Unlike Urdu English is not even suffering from this problem too much as most of the sentences are already separated
from each other, Urdu on the other hand has even sentences in a form of paragraph which in self contains many
sentences and our proposed idea is to identify those sentences and convert them into more than one sentences on the
basis of these words [3,4].
99
Compound Sentence Segmentation and Sentence Boundary Detection in Urdu pp. 99–106
2. LITERATURE REVIEW:
Sentence segmentation is a relatively new topic of research in computationally developing languages. We could not
find any automated sentence boundary segmentation tool in Urdu language. Aroonmanakun W., analyzed sentence and
word segmentation in Thai language [5]. They considered sentence discourse where combination of phrases and some
clues are used for each discourse segmentation process. Baseer et al. presents a sophisticated Romanized Urdu Corpus
utilizes tokens with the uppermost frequency of occurrence in the data set, which was collected from participants who
uses Romanized Urdu as a mean of communication [6]. Xu et al. used IBM word alignment for sentence segmentation
for translation tasks of English and Chinese in the news domain [7]. The focus of this technique is to use lexicon
information to identify sentence segmentation point to split compound sentences into multiple sentences. This paper
proposed a technique to split large sentences into multiple ones because the longer the sentences the more problems are
faced by their proposed system, subsequently resulting into higher computation cost and compromised quality in word
alignment. Habib et al. presented a novel approach that records properly segmented words in order to avoid the
classical space insertion and space deletion problems in native script Urdu text [13][15]. Their proposed solutions can
be of direct value in Urdu sentence segmentation.
Xue, N. and Yang, Y., focuses on how Chinese uses comma, exclamation marks and questions marks for sentence
boundary indication. Proposed model is being tested and trained on data provided by Chinese tree bank and the
accuracy achieved by this model is up to 90% [8,9]. Rehman, Z. and Anwar, W., uses rule based algorithm and
Unigram statistical model to deal with Urdu sentence boundary disambiguate. Initial result before testing and training
were 90% precision, 92.45% F1-measure and 86% recall, but after same testing data and training, results improved to
99.36% precision, 97.89% F1-measure and 96.45% recall [10]. Kiss and Strunk presented language-independent
approach. In this paper assumptions are made that once abbreviations are identified most of the ambiguities will be
eliminated while detecting sentence boundary. In order to detect high accuracy abbreviations, proposed system define
three rules which required independence of context and type of candidate. The system was tested on different text types
and eleven different languages [11]. A number of related research points out to interesting aspects of text segmentation
and optimized input systems. Habib et al. proposed an optimized system and input methods with respect to various
modern devices for Urdu composing [18]. Adnan et al. assessed the realization of smartphones learning objects in
computing adaptive learning paths of undergraduate university students. Jurish and Würzner introduced a “WASTE”
method for segmentation of text into tokens and sentences. Hidden Markov Model is used as a segment boundary
detection. Model parameters were defining from pre-segmented corpora. Such corpora are available as an aligned
multi-lingual corpora and treebanks [12].
Hearst, M.A., A presents technique called TextTiling in which text is segmented into multi-paragraph units that is
subtopics or passages. Identification of sub-topics is done using lexicon co-occurrence and distribution patterns. This
segmentation can further be used for text summarization and information retrieval [16]. In Evang, K., et al. technique
the accuracy achieved by rule based model Tokenization is considered as no problem, but issue regarding rule-based is
its language specific rules and maintenance. This paper used unsupervised feature learning combining it with
supervised sequence labeling on character level to accomplish segmentation and high accuracy word goal. Evaluation
of proposed system is done on three different languages with the error rate of 0.76% (Italian), 0.27% (English), and
0.35% (Dutch) [16]. Xu, L.F., et al. proposes an idea of segmenting long sentences into short one using conjunctions in
them. Long sentences took a lot of machine translation resources for processing. Punctuations were used previously for
segmentation but dealing with long sentences that was not enough. This paper presents rule-based approach on
conjunctions to segment Chinese long sentences. 901 conjunctions were found in 10 patent papers during
experimentation. Using rule-based approach, 89% accuracy was achieved [17]. Gibson E., et al proposes
Comprehension of sentences technique. Comprehension of sentences means applying constraints using available
computational resources to integrate variety of information sources. Four types of comprehension techniques are
explained in this paper. (1) phrase-level contingent frequency constraints, (2) locality-based computational resource
constraints, (3) contextual constraints, and (4) lexical constraints [19].
3. URDU SENTENCE SEGMENTATION OPERATIONS:
Sentence Segmentation is a process of identifying sentence boundaries. Previous researcher’s focuses on using only
punctuations as a boundary detection. Our proposed research not only uses punctuations, it also uses words as a
delimiters and conjunctions as boundary markers to segment sentences appropriately.
3.1. RAW CORPUS:
Raw corpus has been collected by several means which includes websites, books, Urdu magazines, newspapers etc., we
categories them into more general form that is online and offline category. Online category involves websites, online
books and newspapers. While offline involves Magazines, newspapers, books etc. The collected raw corpus contains
large and compound sentences including combination of conjunction and non-conjunction sentences. These sentences
require two different methodologies for segmentation. The former is compound sentence segmentation and the latter is
sentence tokenization which are described in the following.
100
Asad Iqbal et al. MAICS 2017 pp. 99–106
3.2. WORDS/CHARACTER SEGMENTATION:
Conjunctions and delimiter words and characters are identified using words and character segmentation techniques.
Individual words and characters in text are split on the basis of complex processing including space, joiner and non-
joiner properties keeping in view the complex problems of Urdu specific space insertion and space deletion [3][10][19].
The segmented words and characters are then analyzed for conjunctions and delimiting words and characters.
Offline Source: Books, Online Source:
News Papers etc. Internet
Raw$
Corpus!
Pre-processing
Word
Segmentation
Compound
Sentence
Conjunction Non-
Segmentation
Sentences Conjunction
identification Sentences
identification
Removing) Applying(
sub"ordinate) Boundary)
clause! Markers!
Resultant)
Sentences!
Figure.1. Compound Sentence Segmentation and Sentence Tokenization Architecture
3.3. COMPOUND SENTENCE IDENTIFICATION:
The issues with longer sentences is high consumption of resources such as processing time. We look for
conjunction containing sentences and we analyze that the subordinate clause of sentences are mostly the
explanation of main clause, which makes sentence extra-large. The options in dealing with compound sentences are
two, first is to eliminate those sentences completely but by doing so, we may be risking useful information. Second
option is to trim those sentences by eliminating sub-ordinate clause i.e. the explanatory part of the sentence.
We use a pattern matching approach for identification of conjunctions containing sentences. We came up with the
list of conjunction words which we generated from pre tagged corpus, using this list of words conjunction
containing sentences are easily detectable. List of conjunction words we are given below.
ﻌﻨﯽ ! ﭼﻨﺎﻧﭽ! ! ﺑﻠﮑ! ! ﻣﮕﺮ$ ! ﻌﻨﯽ ! ﮔﻮ"ﺎ$ ! ﮑﻦ#ﻮﻧﮑ! ! ﻟ%ﮐ
Naturally occurring raw corpus text examples are mentioned in the following.
101
Compound Sentence Segmentation and Sentence Boundary Detection in Urdu pp. 99–106
3.4. With Conjunction part:
ﻨﺎ "! ﻣﻮﻗﻊ ﭘﺮ ﺣﺎﻻ! ﮐﻮ ﻗﺎﺑﻮ ﮐﺮﻧﮯ ﮐﮯ ﻟﺌﮯ ﭘﻮﻟ"ﺲ ﮐﯽ ﻧﻔﺮ! ﺑ"ﯽ ﭘﮩﻨﭻ ﮔﺌﯽ ﺗ"ﯽ#ﻣﮕﺮ ﺳﺎﺑﻖ ﮐﻮﻧﺴﻠﺮ ﻋﺒﺪ&ﻟﻨﻌ"ﻢ ﻧﮯ ﮐﮩﺎ ﮐ! ﺗﻤﺎ! ﻋﻼﻗﮯ ﮐﯽ ﺑﺠﻠﯽ ﮐﺎ
ﻨﯽ ﭼﺎﮨﺌﮯ۔#ﻧﮑﯽ ﺑﺠﻠﯽ ﮐﺎ$ !ﺟﺒﺎ! ﮨ"ﮟ ﺻﺮ$% ﺗﯽ ﮨﮯ ﺟﻦ ﻟﻮﮔﻮ! ﭘﺮ#ﺎ%& ﺳﺮ#ﻋﻮ"! ﮐﮯ ﺳﺎﺗ! ﺳﺮ
3.5. Without Conjunction part:
ﺑ"ﯽ ﭘﮩﻨﭻ ﮔﺌﯽ ﺗ"ﯽ ۔+ﺲ ﮐﯽ ﻧﻔﺮ1 ﮐﻮ ﻗﺎﺑﻮ ﮐﺮﻧﮯ ﮐﮯ ﻟﺌﮯ ﭘﻮﻟ7=< ﻣﻮﻗﻊ ﭘﺮ ﺣﺎﻻ
We exclude “!"#” from conjunction list because it is not only used as a joiner in a sentence but as a joiner between
two nouns too for example;
ﺸﻦ#ﻟﺨﺪﻣﺖ ﻓﺎ&ﻧﮉ. /0. ﺳﻼﻣﯽ. ﺟﻤﺎﻋﺖ
The remaining sentences in corpus contain combination of multiple sentences. These sentences do not contain
conjunctions but they need to be segmented. For this purpose, we pass resultant text to the next process called
Sentence Tokenization.
3.6. SENTENCE TOKENIZATION:
Sentence tokenization segments large sentences from single into multiple ones. We target words as delimiters to
identify sentence boundaries. For example;
ﺗﻤﺎ* ﻣﮑﺎﺗﺐ ﻓﮑﺮ ﮐﯽ,-. *.ﻨﮕﻮ ﮐﮯ ﻋﻮ4 ﭘﺮ6ﻓﺎ- ﺠﮯ ﮐﯽ8ﺘ:< ﺑ, ﻧﮯ ﮐﮩﺎﮐ> ﮨﻤﺎAﺷﺎCﺴﺮ ﻋﻘﻞ ﺑﺎ8ﻓG ﺸﻦ8ﺠﻮﮐJ. ﺮﮐﭧ$ﯽ &ﺳ$*ﭘﻮ*)(&ﭘ,*ﻮ-ﻨﮕﻮ)ﺑ1
ﭘﺮ#ﭘﻨﯽ )ﮨﺎﺋﺶ ﮔﺎ, ﻧﮯ#ﺷﺎ0ﺴﺮ ﻋﻘﻞ ﺑﺎ6ﻓ8 ﺸﻦ6>ﺠﻮﮐ, ﮩﺎ) @ﭘ?ﯽB, ﮐﺎCﺎﻻ6 ﺧF, ﮟ6 ﮐﮯ ﻧﮩﺎ>ﺖ ﻣﺸﮑﻮ)ﮨKﮨﻞ ﻋﻼﻗ, ﭘﺮCﻐﺎﻣﺎ6ﺟﺎﻧﺐ ﺳﮯ ﺗﻌﺰ>ﺘﯽ ﭘ
ﻧﮯ#$%$ ﺳﻮﮔﻮ%)$ #$ ﮐﮯ ﺧﺎﻧﺪ#$ ﮑﯽ ﻣﮩﺮﺑﺎﻧﯽ ﺳﮯ5ٰ *ﻟﺨﺮ!' ﮨﮯ ﻣﮕﺮ !ﷲ ﺗﻌﺎﻟ+!ﻗﻌ. + ﻧﮯ ﮐﮩﺎ ﮐ3ﻧﮩﻮ$ ﭼ&ﺖ ﮐﺮﺗﮯ ﮨﻮﺋﮯ ﮐ&ﺎ/ ﺳﮯ ﺑﺎ2ﺻﺤﺎﻓ&ﻮ
!" ﺟﻨﮩﻮ' ﻧﮯ ﻧﻤﺎ,- ﺣﻀﺮ1 ﺗﻤﺎ3- 3-4 ﮐﮯ ﺳﻮﮔﻮ1 ﻣﺮﺣﻮ9ﺎ:ﻞ ﻋﻄﺎ ﮨﻮﺋﯽ ﮐﮩﺎﮐ< ﻓ: ﮐﻮ ﺻﺒﺮ ﺟﻤ3- ﺧﺎﻧﺪG4' ﺳﮯ ﭘﻮIﻋﺎJ ﮨﻞ ﻋﻼﻗ< ﮐﮯ-‘ !ﮔﻮ$ﺑﺰ
ﮟ۔# ﮨ%ﻟﯽ ﻣﺸﮑﻮ, ﺳﺐ ﮐﮯ#$ ﺗﻌﺰ(ﺖ ﮐﯽ#ﻠﯽ ﻓﻮ/0 1(ﻌ2ﮟ ﺷﺮﮐﺖ ﮐﯽ (ﺎ ﺑﺬ/ ﻣ:;ﺟﻨﺎ
Above example is a single large sentence which is a combination of multiple sentences with 110 words in it. In
available corpus Urdu is full of these kind of sentences. These delimiting words are identified manually from single
chunk of file because analyzing large file manually is time consuming and it may take weeks to generate list of
delimiting words and still that list will not be enough. Delimiting boundary list is not limited to this corpus only to
achieve better results in segmenting large sentences this corpus also needs constant updating with constant manual
updating of delimiting words also. We generate that list using a chunk of file is to carry on our experimentation and
we accomplish significant results. For example, the above single sentence is analyzed for delimiting words and on
the basis of this list above sentence is segmented down into 4 sentences with average length of 27 words per each
sentence.
ﺗﻤﺎ* ﻣﮑﺎﺗﺐ ﻓﮑﺮ ﮐﯽ,-. *.ﻨﮕﻮ ﮐﮯ ﻋﻮ4 ﭘﺮ6ﻓﺎ- ﺠﮯ ﮐﯽ8ﺘ:< ﺑ,ﺴﺮ ﻋﻘﻞ ﺑﺎ*ﺷﺎ( ﻧﮯ ﮐﮩﺎﮐ" ﮨﻤﺎ1ﻓ3 ﺸﻦ1ﺠﻮﮐ89 @(=ﭘ;ﯽ =ﺳ;ﺮﮐﭧAﭘﻮABAﻮ1ﻨﮕﻮ)ﺑE
ﮟ۔#ﮨ%ﺖ ﻣﺸﮑﻮ+ ﮐﮯ ﻧﮩﺎ1ﮨﻞ ﻋﻼﻗ6 ﭘﺮ9ﻐﺎﻣﺎ#ﺘﯽ ﭘ+ﺟﺎﻧﺐ ﺳﮯ ﺗﻌﺰ
!ﻗﻌ#$ % ﻧﮯ ﮐﮩﺎ ﮐ+ﻧﮩﻮ# ﺎ۔.ﺖ ﮐﺮﺗﮯ ﮨﻮﺋﮯ ﮐ. ﭼ5 ﺳﮯ ﺑﺎ+ﻮ.ﭘﻨﯽ ?ﮨﺎﺋﺶ ﮔﺎ< ﭘﺮ ﺻﺤﺎﻓ# ﺷﺎ< ﻧﮯCﺴﺮ ﻋﻘﻞ ﺑﺎ.ﻓH ﺸﻦ.ﺠﻮﮐL# ﯽMﭘN ?ﮩﺎO# ﮐﺎ5ﺎﻻ. ﺧR#
* ﮐﻮ ﺻﺒﺮ ﺟﻤ"ﻞ+ ﺧﺎﻧﺪ01 ﺳﮯ ﭘﻮ56ﻋﺎ8 ﮐﮯ9ﮨﻞ ﻋﻼﻗ+‘ !ﮔﻮ$*) ﻧﮯ ﺑﺰ$* ﺳﻮﮔﻮ$,* )*ﮑﯽ ﻣﮩﺮﺑﺎﻧﯽ ﺳﮯ *) ﮐﮯ ﺧﺎﻧﺪ7ٰ *ﻟﺨﺮ!' ﮨﮯ ﻣﮕﺮ !ﷲ ﺗﻌﺎﻟ
ﺳﺐ ﮐﮯ%& ﺗﻌﺰ)ﺖ ﮐﯽ%ﻠﯽ ﻓﻮ01 2)ﻌ3ﮟ ﺷﺮﮐﺖ ﮐﯽ۔ )ﺎ ﺑﺬ0 ﻧﮯ ﻧﻤﺎ=< ﺟﻨﺎ=< ﻣB ﺟﻨﮩﻮD& ﺣﻀﺮG ﺗﻤﺎ%& %&3 ﮐﮯ ﺳﻮﮔﻮG ﻣﺮﺣﻮIﺎ0 ﻓ2ﻋﻄﺎ ﮨﻮﺋﯽ ﮐﮩﺎﮐ
ﮟ۔۔# ﮨ%ﻟﯽ ﻣﺸﮑﻮ,
3.6.1. Sentence Boundary Maker:
Complexity of delimiting words of sentence boundary markers rises when generated list of delimiting words turned
into a part of another word and that word appears in the middle or start of sentence not at the end. To deal with
these kind of sentences, we need to take our approach to another level as we cannot use unigram. For example, the
following list of delimiter words are unigram. We cannot use them as boundary markers alone because these words
can be a part of other words and that results in ambiguity. We require n-gram approach. For example;
!" ! ﺳﮑﮯ ! ﮐﺌﮯ ! ﺟﺎﺋﮯ$ ! ﮐﺮ"ﮟ
ﺮﻧﮯ ﭘﺮ$% "! ﮔﻮ! ﻧﺮ ﮨﺎ"! ﮐﮯ ﺳﺎﻣﻨﮯ 'ﺣﺘﺠﺎﺟﯽ# !"ﻨﭧ ﮨﺎ#ﻤ#ﮕﺮ ﮨﻢ ﭘﺎ&ﻟ#$ !""! ﭘﺮ ﺣﻞ ﮐﺮ"ﮟ ﺑﺼﻮ#ﺎ%ﺤﯽ ﺑﻨ#ﮨﻤﺎ"! ﻣﺴﺎﺋﻞ ﺗﺮﺟ (1
ﻣﺠﺒﻮ! ﮨﻮﻧﮕﮯ ۔
ﭘﻨﮯ ﺻﻮﺑﺎﺋﯽ &ﺳﻤﺒﻠﯽ ﮐﮯ ﻣﻤﺒﺮ ﺑﺎﻏﯽ ﮨﻮﭼﮑﮯ ﮨ"ﮟ$ ﺳﮑﮯ$ !ﮔﺎ! ﮐﮩﺎ ﺟﺎﺳﮑﮯ ﺑﻠﮑ$ﺎ% ﺴﺎ ﮐﺎ! ﻧﮩ"ﮟ ﮐ"ﺎ ﺟﺲ ﮐﻮ#$ ﻧﮩﻮ! ﻧﮯ ﮐﻮﺋﯽ (2
"! ﺳﮯ ﻧﻘﻞ ﮐﺎ ﻧﺎﺳُﻮ! ﺧﺘﻢ ﮐﺌﮯ ﺑﻐ"ﺮ ﺗﺮﻗﯽ ﻣﻤﮑﻦ ﻧﮩ"ﮟ#$%$ ﻤﯽ#ﺗﻌﻠ (3
102
Asad Iqbal et al. MAICS 2017 pp. 99–106
"! ﮐﺎﻣﻮ! ﮐﯽ ﺑﺠﺎﺋﮯ 'ﺟﺘﻤﺎﻋﯽ ﮐﺎﻣﻮ! ﮐﻮ ﺗﺮﺟ"ﺢ "! ﺟﺎﺗﯽ ﮨﮯ#ﻧﻔﺮ# ﮩﺎ! ﭘﺮ$ (4
"! ﮐﺎﻣﻮ! ﮐﯽ ﺑﺠﺎﺋﮯ 'ﺟﺘﻤﺎﻋﯽ ﮐﺎﻣﻮ! ﮐﻮ ﺗﺮﺟ"ﺢ "! ﺟﺎﺗﯽ ﮨﮯ#ﻧﻔﺮ# ﮩﺎ! ﭘﺮ$ (5
In example 1, 2 and 3 the delimiting words appear in the middle of sentences. In example 2 we also realize that
there are two delimiters used one as a separate word and other as a part of other word for example ﺟﺎﺳﮑﮯ, which is
a combination of word ﺳﮑﮯ$ and !. Same apply to example 4 and 5 i.e. ﺑﺠﺎﺋﮯand !"#ﻧﻔﺮ#. These delimiting words are
also a combination of words i.e. ﺑﺠﺎﺋﮯis a combination of ﺟﺎﺋﮯand ! and !"#ﻧﻔﺮ# is a combination of !" and !!ﻧﻔﺮ.
But that was not all there another issue Urdu delimiting words were having, some of these words were not even
used in the sense of termination mark for example ﺳﮑﮯ$ in the above example 2 was not used in sense of a sentence
boundary marker but it was used in a sense of “his” in that sentence, but the ratio of these words are low and they
can also be handled by n-gram approach so this does not create any problem. Below Table 1 represents sentences
boundary markers of unigram, bigram and trigram. We only include few of the boundary markers with their
frequency from a single experiment.
S. No. Frequency (sorted) Sentence Boundary Marker List
1 259 ﮨ"ﮟ
2 67 ﺗ"ﺎ
3 57 ﮔﮯ
4 36 ﺗ"ﮯ
5 20 "ﺎ ﮔ"ﺎ#
6 13 ﺗ"ﯽ
7 13 ﮨﻮﮔﺎ
8 11 ﮨﻮﮔﯽ
9 5 ﮨﺎﮨﮯ$
10 1 ﮐﺮﺗﺎﮨﮯ
11 1 ﻟﮯ ﮐﺌﮯ#ﺣﻮ
12 1 ﺣﺎﺻﻞ ﮐ"ﮯ
13 1 ﺋﯽ ﮨﮯ#ﮐﺮ
14 1 !"ﺗﻼ! ﺷﺮ"! ﮐﺮ
15 1 "! ﺟﺎﺋ"ﮟ
Table 1: Sentence boundary markers list with corresponding frequencies
"ﺎ ﺟﺎﺋﮯ#ﻟﮯ ﮐﺌﮯ ! ﮐﺮ#ﺳﮑﮯ ! ﺣﻮ$ ﺴﺮ#ﺑﻄ! ﮐﺮ"ﮟ ! ﺗﻼ! ﺷﺮ"! ﮐﺮ"! ! ﻣ$%
Some sentences contain two delimiting words and our model will add termination marks after both of these boundary
markers, but that will not affect the sentence, because if sentence ends with first delimiting word it will still retain its
meanings. For example;
3.6.1.1. With two delimiting words:
ﺎ۔ ﮨﮯ۔#"ﺎ ﮔ# "ﺎ#$ﺸﻦ ﻣ"ﮟ ﺑ#$ﻨﮉ 'ﭘﻮ#ﮨﮯ "! "! ﺳﺐ ﮐﻮ ﮔﺮ# ﺟﻮ "! ﺳﺎ! "ﻗﺘﺪ"! ﻣ"ﮟ
3.6.1.2. Without second delimiting word:
"ﺎ ﮔ"ﺎ ۔# "ﺎ#$ﺸﻦ ﻣ"ﮟ ﺑ#$ﻨﮉ 'ﭘﻮ#ﮨﮯ "! "! ﺳﺐ ﮐﻮ ﮔﺮ# ﺟﻮ "! ﺳﺎ! "ﻗﺘﺪ"! ﻣ"ﮟ
It can be observed that from semantic point of view, the sentence preserves its meanings.
103
Compound Sentence Segmentation and Sentence Boundary Detection in Urdu pp. 99–106
4. EVALUATION:
The lack of appropriate hardware resources impeded delays in compiling results of our experiments. We used a
personal computer running Microsoft Windows 8 for processing our general genre raw Urdu corpus containing 2616
sentences and 291503 words. Computationally exhaustive iterative processing was not possible on such a workstation.
It took more than two and a half days to process our raw corpus file and still it was not completed when the system
suddenly shut down due to hardware failure. The only possible solution for this problem was the customary divide and
conquer approach.
We divided the experimentation file into 14 smaller chunks and experimented chunk-wise to identify compound and
large sentences. Delimiting words were extracted with their respective frequencies in the text. Compound sentences are
identified by the list of conjunctions present in them. All statistical results were accumulated and manually verified by
Urdu language experts. Consolidated results of the respective 14 chunks are shown in the following table 2.
Chunk (File) Size
S.No. No. of No. of Words !ﺑﻠﮑ !ﭼﻨﺎﻧﭽ ﻌﻨﯽ$ ﮔﻮ"ﺎ ﮑﻦ#ﻟ !ﻮﻧﮑ%ﮐ ﻣﮕﺮ !"#
Sentences
1 184 20548 16 0 0 0 8 12 22 485
2 175 21794 10 0 2 0 11 8 20 517
3 274 28034 14 0 1 0 13 10 41 697
4 285 26372 15 0 1 0 8 4 29 661
5 270 27165 13 0 1 0 11 9 41 677
6 264 22956 15 0 1 0 9 3 22 582
7 211 25773 6 0 2 0 10 7 27 629
8 163 25621 17 0 1 1 17 7 36 626
9 184 22228 7 0 1 1 6 5 45 568
10 124 15326 7 0 2 0 8 2 15 335
11 104 12364 4 0 0 0 2 2 19 284
12 99 10172 6 0 1 0 2 4 17 277
13 142 15075 4 0 1 0 14 9 18 400
14 137 18075 3 0 0 0 11 6 28 457
∑ 2616 291503 137 0 14 2 130 88 380 7195
Table 2: Conjunctions and delimiting words extracted from the raw Urdu corpus
In table 2, we also generate frequency of !"# as a conjunction and it appears 485 times in a text. This was done due to
realization of the fact that !"# is not only used as a conjunction but also as a part of other words. For example;
ﮐﺰﺋﯽ%&' ! !"#ﺎ"! ! ﻣﺸﺎ$ﺷﺎ! ! ﭘﺸﺎ"! ! ﻧﭽ$%"! ! 'ﻻ#$
These are the words that contain conjunction word “!"#”. These words are the combination of two words. For example,
in ﮐﺰﺋﯽ%&' and !"#$, the word !"#is at the beginning and it is a combination of ﮐﺰﺋﯽ+ !"# and ! + !"# respectively.
Similarly, in !"# ﻣﺸﺎand !ﺷﺎ$%'ﻻ, the word !"# is in the middle of two words that is between ! + ﻣﺶand ! ﺷﺎ+ !"
respectively. In the same manner, in words such as !"ﺎ$ ﻧﭽand !"ﭘﺸﺎ, the word !"# appears in the end of these words. Our
proposed model identifies them as a conjunction which increases the computational cost and algorithmic complexity.
135 times out of 485, the word “!"#” is used as a conjunction that is 27% of its total occurrences. The remaining 73%
i.e. 353 times, this word appeared as a part of other words or sometimes in a non-conjunction way. So we exclude it
from list of conjunctions. Identifying them as a conjunction and as a part of other word is done manually.
The word “!"#” and other words posing similar problem is an interesting discovery in this research. “! ”ﭼﻨﺎﻧﭽis also a
conjunction word but in our existing corpus it does not appear even once. Our corpus is growing continuously. So
therefore we hope this and other similar words will be encountered in the subsequent experiments. Thus we did not
exclude it from the list of conjunction word.
Complexity of Urdu language increases when dealing with boundary markers. The Urdu boundary marking words may
not always appear at the end of sentences but it can also be a part of other words which may be anywhere in sentences.
Also, they may appear in the middle of sentences. Reaching a computational solution becomes more difficult when
sometimes these are not used in the sense of boundary markers. For example, ﮨ"ﮟappears about 259 times in a single
chunk of file. We carried out the same experiment on all 14 chunks. However, for simplicity we consider the example
104
Asad Iqbal et al. MAICS 2017 pp. 99–106
of only first chunk here. It was not sure whether ﮨ"ﮟalways appears as a boundary marker or a part of other words. In
our experimentation we realize, that 154 time ﮨ"ﮟappeared as a boundary marker that is 59% and 105 times i.e. 41% it
appeared as part of another word for example ﻧﮩ"ﮟ, ﮨ"ﮟ$ﺗﻨﺨﻮ, ﻧﮩ"ﮟ%, ﺗﻤﮩ"ﮟ. The experimentation continued on other boundary
markers such as ﺗ"ﺎ. Frequency of ﺗ"ﺎin a text is 67 and it appears in text about 26 times that is 38% as a boundary
marker and 41 times that is 62% as part of other words.
Most of the boundary markers are part of other words and to handle this we use bi-gram approach i.e. we combine it
with the second closest word but using only bi-gram approach did not solve our problem because combination of these
two words may also appear in middle of sentences, to tackle this issue we combine third closest word i.e. we use tri-
gram approach but still that did not solve our problem and the process continues. The solution to this problem was to
use n-gram approach. Similarly, ﺗ"ﮯappears in a text for 36 times and 100% it is used as a boundary marker. ﺗ"ﯽ
appears 13 times in which 8 time that is 61% it appears as a boundary marker and 5 times that is 39% as part of other
word or boundary marker i.e. "ﮟ# ﺗand the same goes for other boundary markers. The above mention gazetteer list of
delimiting words contains only 15 boundary markers and we have about 103 list of boundary markers created manually
and the list is still in a growing process the more the list grows the more effective will be the results. Considering the
corpus, we have, 103 is not a huge list of boundary markers but these are enough for our experimentation. The more the
list grows the more processing is required because our workstations cannot handle that kind of processing very
effectively and that makes our experimentation process slower, so we restricted list of boundary markers to 103.
Initially there were 184 sentences with 20548 words in experiment 1. After compound and tokenization process we
have 722 sentences. We got results for about 569 sentences that is 78% of sentences can be categorized as accurate or
inaccurate sentences out of which 461 were accurately marked boundary markers that is 63% and 108 were marked
inaccurate that is only 14%. This inaccuracy was due to existence of those boundary markers as a part of other words.
Remaining 22% sentence were the one or two word sentences that appears because of two delimiting words in a single
sentence. For example;
ﺎ۔ ﮨﮯ۔#ﺿﯽ &ﻧﺘﻈﺎ! ﮐ"ﺎ ﮔ#ﮨﺎ! ﭘﺮ ﻋﺎ$ "! ﻧ! ﮨﯽ# ﮐﻨﮓ ﻣﻮﺟﻮ! ﮨﮯ$ﺒﯽ ﻋﻼﻗﻮ! ﻣ"ﮟ ﮐﻮﺋﯽ ﭘﺎ#ُ! ﮐﮯ ﻟ"ﮯ ﻧ! ﺗﻮ ﻗﺮ#
ﮨﮯis second boundary marker and just a single word marked as separate sentence. There were 153 such kind of
sentences which means that 22% sentences were useless and discarded accordingly.
5. CONCLUSION:
Our work regarding Compound Sentence Segmentation and Tokenization of large Sentences was pioneering work in
Urdu. The results generated by our proposed system are promising. The results generated for a single chunk of file
were generated manually. Therefore, we did not include other chunks of files. We realize that with having powerful
servers and with increasing delimiting words gazetteer list, we can improve our results further. Beside generating
statistical results, in future we will also analyze our model and its results by language expert, by comparing our
automatically tokenized sentences with human manually tokenized sentences to analyze its coherence and readability.
REFERENCES:
[1] Yasumasa, S. Kansai. 2016. University of Graduate School of Foreign Language Education and Research. Automatic
Sentence Segmentation, accessed (Feb 19). DOI: http://www.someya-net.com/00-class09/sentenceDiv.html.
[2] Brian, L., Zillig, P., Ramsay, S., Mueller, M., and Smutniak, F., 2016. Academic Technologies and Research
Services. Morph Adorner Sentence Splitter, accessed (Feb 19). DOI:
http://morphadorner.northwestern.edu/sentencesplitter/example/.
[3] Malik, A.A. and Habib, A., 2013. Urdu to English Machine Translation using Bilingual Evaluation
Understudy. International Journal of Computer Applications, 82(7).
[4] Palmer, D.D., 2000. Tokenisation and sentence segmentation. In Handbook of Natural Language Processing. CRC
Press.
[5] Aroonmanakun, W., 2007, December. Thoughts on word and sentence segmentation in Thai. In Proceedings of the
Seventh Symposium on Natural language Processing, Pattaya, Thailand, December 13–15 (pp. 85-90).
[6] Baseer, F., Habib, A. and Ashraf, J., 2016, August. Romanized Urdu Corpus development (RUCD) model: Edit-
distance based most frequent unique unigram extraction approach using real-time interactive dataset. In Innovative
Computing Technology (INTECH), 2016 Sixth International Conference on (pp. 513-518). IEEE.
105
Compound Sentence Segmentation and Sentence Boundary Detection in Urdu pp. 99–106
[7] Xu, J., Zens, R. and Ney, H., 2005, May. Sentence segmentation using IBM word alignment model 1. In Proceedings
of EAMT (pp. 280-287).
[8] A. Gul, A. Habib, J. Ashraf, "Identification and extraction of Compose-Time Anomalies in Million Words Raw Urdu
Corpus and Their Proposed Solutions", proceedings of the 3rd International Multidisciplinary Research Conference
(IMRC), 2016.
[9] Xue, N. and Yang, Y., 2011. Chinese sentence segmentation as comma classification. In Proceedings of the 49th
Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-
Volume 2 (pp. 631-635). Association for Computational Linguistics.
[10] Rehman, Z. and Anwar, W., 2012. A hybrid approach for urdu sentence boundary disambiguation. International
Arab Journal of Information Technology, 9(3), pp.250-255.
[11] Kiss, T. and Strunk, J., 2006. Unsupervised multilingual sentence boundary detection. Computational
Linguistics, 32(4), pp.485-525.
[12] Jurish, B. and Würzner, K.M., 2013. Word and Sentence Tokenization with Hidden Markov Models. JLCL, 28(2),
pp.61-83.
[13] Habib, A. Iwatate, M., Asahara, M. Matsumoto, Y., 2012. Keypad for large letter-set languages and small touch-
screen devices (case study: Urdu). International Journal of Computer Science 9(3), ISSN: 1694-0814.
[14] Hearst, M.A., 1997. TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational
linguistics, 23(1), pp.33-64.
[15] Habib, A. Iwatate, M., Asahara, M. Matsumoto, Y. W.K., 2013. Optimized and Hygienic Touch Screen Keyboard
for Large Letter Set Languages. Proceedings of 7th ACM International Conference on Ubiquitous Information
Management and Communication (ICUIMC) Kota Kinabalu, Malaysia.
[16] Evang, K., Basile, V., Chrupała, G. and Bos, J., 2013, October. Elephant: Sequence labeling for word and sentence
segmentation. In EMNLP 2013.
[17] Xu, L.F., Zhu, Y., Yang, L.J. and Jin, Y.H., 2014. Research on Sentence Segmentation with Conjunctions in Patent
Machine Translation. In Applied Mechanics and Materials (Vol. 513, pp. 4605-4609). Trans Tech Publications.
[18] Habib, A. Iwatate, M., Asahara, M. Matsumoto, Y., 2011. Different input systems for different devices: Optimized
touch-screen keypad designs for Urdu scripts. Proceedings of Workshop on Text Input Methods WTIM2011,
IJCNLP, Chiang Mai, Thailand.
[19] Gibson, E. and Pearlmutter, N.J., 1998. Constraints on sentence comprehension. Trends in cognitive sciences, 2(7),
pp.262-268.
[20] Adnan, M. Habib, A. Mukhtar, H. Ali, G., 2017. Assessing the Realization of Smartphones Learning Objects in
Students’ Adaptive Learning Paths. International Journal of Engineering Research, 6(4), ISBN:978-81-932091-0-3.
106