<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Qualitative Analysis of Cotemporary Urdu Machine Translation Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Asad Abdul Malik</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Asad Habib</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kohat University of Science and Technology</institution>
          ,
          <addr-line>Kohat</addr-line>
          ,
          <country country="PK">Pakistan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The diversity in source and target languages coupled with source language ambiguity makes Machine Translation (MT) an exceptionally hard problem. The highly information intensive corpus based MT leads the MT research field today, with Example Based MT and Statistical MT representing two dissimilar frameworks in the data-driven paradigm. Example Based MT is another approach that involves matching of examples from large amount of training data followed by adaptation and re-combination. Urdu MT is still in its infancy due to nominal availability of required data and computational resources. This paper provides a detailed survey of the aforementioned contemporary MT techniques and reports findings based on qualitative analysis with some quantitative BLEU metric quantitative results. Strengths and weaknesses of each technique have been brought to surface through special focus and discussion on examples from Urdu language. The paper concludes with proposal of future directions for research in Urdu machine translation.</p>
      </abstract>
      <kwd-group>
        <kwd>Urdu Machine Translation</kwd>
        <kwd>Qualitative Comparison</kwd>
        <kwd>Rule Based MT</kwd>
        <kwd>Statistical MT</kwd>
        <kwd>Example Based MT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Representing text in one natural language, the source language (SL) into another, the
target language (TL) is as old as the written literature [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. At present, the need of
translation is continuously growing in business, economy, medical and many other
fields. The growth in science and technology in general and computer based solutions
in particular have paved the way to the concept of automatic translation called the
Machine Translation (MT) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
1.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Urdu</title>
      <p>
        Urdu ranks 19th among the 7,105 languages spoken in the world1. It is one the
mostspoken languages in South Asia [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. It is also spreading in the West due to the large
Diaspora of Indo-Pak Subcontinent citizens. Urdu is the national language of Pakistan
and it is used i) as medium of teaching in most of the public schools ii) for junior to
mid level administration and iii) in the mass print and electronic media. It is not only
spoken in Pakistan but also in India, Bangladesh, Afghanistan and Nepal. Also it has
become the culture language and lingua franca of the South Asian Muslim Diaspora
outside the Indo-Pak subcontinent, mainly in the Middle East, Europe, Canada and the
United States [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
1.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Urdu Machine Translation (UMT)</title>
      <p>
        In spite of the large number of speakers around the world, there are very few
computational natural language tools available for Urdu. It is a morphologically rich
language having many other distinct linguistic characteristics. On the contrary it is still
an under-resourced language from the point of view of computational research. We
could not find any public domain machine translation tool(s) developed specifically
for Urdu. However some trace of basic MT techniques has been discovered [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8 ref9">5-9</xref>
        ]. In
the current work we presented a detailed survey on the contemporary research in
UMT. We identified the weaknesses and strengths of each technique and proposed the
guidelines for future directions in UMT research.
2
      </p>
      <sec id="sec-3-1">
        <title>Literature Survey</title>
        <p>
          Some traces of basic UMT research are presented in this section. Naila et al [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]
presented a Rule Based English to Urdu Machine Translation (RBMT) technique
primarily based on the transfer approach that tries to handle the case phrases and verb
postpositions using Paninian grammar. Statistical Machine Translation (SMT) between
languages with word order differences was discussed by Bushra et al [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Example
Based Machine Translation (EBMT) approach was introduced by Maryam and Asif
that translates text form English to Urdu that supports idioms and homographs [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
Parallel corpus for statistical machine translation for English to Urdu text was
presented by Aasim et al [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Word-Order Issues in English-to-Urdu have been
investigated by Bushra and Zeman [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. In addition, SMT systems such as Google2 and Bing3
are already available online. However these systems offer poor translation quality and
limited accuracy due to issues related to Urdu syntax and other intrinsic linguistic
features.
2 http:// translate.google.com
3 http://www.bing.com/translator
        </p>
        <p>Contemporary Machine Translation techniques can be broadly categorized into
three paradigms as shown in Figure 1.
2.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Rule Based Machine Translation (RBMT)</title>
      <p>
        To provide suitable rules for translation, the RBMT needs linguistic knowledge of
source as well as the target language. Translation depends on formalized linguistic
knowledge represented in lexicon along with grammars [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. RBMT is described by
several characteristics; it has firm set of well fashioned rules, several rules rely on
present linguistic theories and the grammatical errors are prohibited. The major
advantage of RBMT is that if the required knowledge is not found in available literature
then ad-hoc heuristic rules are applied [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This system contains input sentence
analyzer (morphological, syntactic and semantic analysis) and procedures for producing
output (structural transfers and inherent Inter-lingua structures).
2.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Statistical Machine Translation (SMT)</title>
      <p>Two models are built in SMT; i) Translation model and ii) Language model. A
translation model gives probability of a target sentence given source sentence P(T/S)
whereas the language model determines the probability P(S) of the string of target
language actually occurring in that language. By using the language model and
conditional probabilities of translation model, P(S/T) is calculated using the following
formula:



=</p>
      <p>
        Probability based analysis of MT is part of SMT. It has numerous diverse
applications such as those in word sense disambiguation or structural disambiguation etc.
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The SMT techniques do not need explicit encoding of the linguistic information.
It highly depends upon availability of fine and very large amount of bilingual data
that presently does not exist for Urdu and other languages spoken in the Indo-Pak
Subcontinent region.
2.3
      </p>
    </sec>
    <sec id="sec-6">
      <title>Example Based Machine Translation (EBMT)</title>
      <p>
        Somers referred to EBMT as a hybrid approach of RBMT and SMT [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Like SMT,
it is depended upon a corpus of available translations. That is why it is similar to
(often confused with) translator‟s aid known as Translation Memory (TM). EBMT and
TM both involve comparison of input text with the database of real examples and then
find out the nearest match. In TM, a translator selects the candidate target text
whereas EBMT makes use of automated procedures that identify the translation
fragments. Recombination of these fragments produces the target text [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Thus the process is split into three phases [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. i) “Matching” fragments against the
available database of real examples (that are common between EBMT and TM), ii)
“Alignment” identifying corresponding translation fragments and finally iii)
“Recombination” that gives the target text. EBMT needs a database of parallel translations
that are searched for source language phrases or sentences and their nearest matching
target language components are generated as output [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>EBMT saves the translation examples in different manners. In simple case,
examples are saved as pairs of strings with no extra information related to them.
3</p>
      <sec id="sec-6-1">
        <title>Methodology</title>
        <p>In this section we discuss the methodologies of three major Machine Translation
techniques. English is considered as source language and Urdu as the target language.
We compare the strengths and weaknesses of these techniques in Section 4.
3.1</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Rule Based Machine Translation</title>
      <p>There are three stages in RBMT; i) Analysis, ii) Transfer and iii) Synthesis</p>
    </sec>
    <sec id="sec-8">
      <title>Analysis.</title>
      <p>The source text is analyzed based upon lexicon and grammar rules of source
language. Word segmentation is done and each word is annotated by appropriate POS
tag and parse tree of input text is created. A parse tree for the input text “I called you
several times” is created as shown in figure 3.</p>
    </sec>
    <sec id="sec-9">
      <title>Transfer.</title>
      <p>In this stage, parse tree of source language text is „transferred‟ into parse tree of
desired target language according to the lexicon and structural rules of the target
language. English is SVO (Subject, Verb, Object) language whereas Urdu is SOV
language. Re-ordering of words is inevitable in order to generate the output parse tree as
shown in Figure 4.</p>
      <p>English to Urdu Translation Rules.</p>
      <p>Some coarse grained rules for translation from English to Urdu are mentioned in
the following.
1. NP in both languages follows the same rule. So swapping in not required.
2. If NP is having NP and PP, then transform it as in Urdu PP comes before NP.</p>
      <p>English   →   +</p>
      <p>Urdu   →   +  
3. If adverb phrase (AP) appears before verb then swapping is not needed. AP in
English can appear in different order depending on the type of AP, however Urdu
prefers AP before verb.</p>
      <p>Urdu AP +V
4. In Urdu, Verb phrase (VP) is inflected according to gender, number and person
(GNP) of the head noun while NP depends upon tense, aspect and modality of the
verb phase (VP). Urdu adjectives are also modified by GNP of the head noun.</p>
    </sec>
    <sec id="sec-10">
      <title>Synthesis.</title>
      <p>Finally, the target language lexicon and grammar is used to convert the parse tree of
target language to the target language surface form. It requires two independent
monolingual dictionaries so that appropriate surface form of target language can be
generated.</p>
      <p>As shown in figure 5 the source text “I called you several times” is translated into
“ایلاب وک پآ ہبترم یئک ںیم ” using RBMT.
SMT makes use of i) Translation Model, ii) Language Model and iii) Decoder
Algorithm.</p>
    </sec>
    <sec id="sec-11">
      <title>Translation Model.</title>
      <p>Words and phrases in the source text are matched against the target language strings.
If the strings are matched the model assigns a probability value P(T/S) to it. This
probability shows that what are the chances that the input text string is present in the
output or target language. These probability values are pre-assigned in a parallel
corpus through human translation. Subsequently machine learning techniques are used to
improve the system depending upon the human translated text.</p>
    </sec>
    <sec id="sec-12">
      <title>Language Model.</title>
      <p>Language model determines the probability P(S) of output text string. It does not
require a parallel corpus. It requires text in only one language. We can calculate the
value by using N-gram model. In this the probability of occurrence of sentence of
length N is the product of probability of each kth word given the occurrence of
previous words k-1 and k-2.</p>
    </sec>
    <sec id="sec-13">
      <title>Decoder Algorithm.</title>
      <p>After finding the product of translation and language model the decoder algorithm
selects the string of output text language with the highest probability value based on
the stochastic formula mentioned in Section 2.2.
3.3</p>
    </sec>
    <sec id="sec-14">
      <title>Example Based Machine Translation (EBMT)</title>
      <p>English to Urdu EBMT is divided into four phases; i) Sentence Fragmentation, ii)
Search in Corpus, iii) N-ary Product based Retrieval and iv) Ordering of Translated
Text.</p>
    </sec>
    <sec id="sec-15">
      <title>Sentence Fragmentation.</title>
      <p>For better handling of input sentence by translator, it is better to break the sentence
into phrases. On the other hand same results are achieved by storing sentence in the
corpus and by gaining a broad coverage by fragmenting and combining using genetic
algorithm at run time for obtain new sentences. Fragmentation of a sentence into
phrases is handled by using concept of idioms, cutter points and connecting words.</p>
    </sec>
    <sec id="sec-16">
      <title>Searching in Corpus.</title>
      <p>Bilingual corpus is searched for finding whether the input phrase is accessible or not.
If the system is unable to locate exact match, then in that situation it will look for the
nearest match. Closeness is calculated by threshold at two stages; i) for exact match
and ii) for nearest match. This is done by two algorithms “Levenshtein Algorithm”
and “Semantic Distance Algorithm”.</p>
    </sec>
    <sec id="sec-17">
      <title>N-ary Product Based Retrieval.</title>
      <p>The translation for an input sentence is extracted in this stage. And there is possibility
that input can have many translations. So the possibilities are collected and the idea of
n-ary product is used to record all the feasible sentences.</p>
    </sec>
    <sec id="sec-18">
      <title>Ordering of Translated Phrases.</title>
      <p>If a single input sentence is divided into pieces and translated into output language
phrase, then ordering of these translated phrases are done in this phase.
4
4.1</p>
      <sec id="sec-18-1">
        <title>Comparison</title>
      </sec>
    </sec>
    <sec id="sec-19">
      <title>Rule Based Machine Translation</title>
      <p>The quality of translation in Rule Based Machine Translation (RBMT) depends upon
large number of rules. Therefore its computational cost is very high. Rules are based
on both source and target languages, their respective morphological, syntactical and
semantic structures. With a large set of large and fine grained linguistic rules, RBMT
generates translation with acceptable quality, but developing system like this needs
more time and man hours because this type of linguistic recourses should be hand
crafted (Knowledge Acquisition Problem). As RBMT works with exact matches, it is
unable to translate text when system does not have enough knowledge about the
input. It is also difficult to add more rules for generating high quality output.
4.2</p>
    </sec>
    <sec id="sec-20">
      <title>Statistical Machine Translation</title>
      <p>The knowledge about translation is acquired automatically from the example data.
This is the main reason why SMT is developed fast as compared to RBMT. In a
situation where large corpus is available but linguistic knowledge is not readily available
then SMT is a preferred method. When input and output languages are not complex
morphologically then SMT techniques generate better results. SMT based approaches
do not need Bilingual dictionaries. They depend upon the quality of bilingual corpus.
4.3</p>
    </sec>
    <sec id="sec-21">
      <title>Example Based Machine Translation</title>
      <p>It requires Bilingual dictionary. It translates text by adapting to examples. The
computational cost is less than RBMT. By storing proper examples in the DB the system
can be upgraded. It works on best matching reasoning, so therefore when the
corresponding example is not available in corpus, the translation process becomes
complicated. It translates in fail-safe way. Quality of translation depends upon the difference
between input text and lookup results for similar examples. EBMT can also notify us
that when its translation is improper.</p>
      <sec id="sec-21-1">
        <title>Findings</title>
        <p>The qualitative findings are tabulated in table 1, and the quantitative findings are
mentioned in table 2. The BLEU metric is used for the evaluation of the machine
translated text, five reference sentences were used for calculating the BLEU value.
From the value of the BLEU it is clearly shown that EBMT performs better than the
rest of the three systems. RBMT was found to be better than both the SMT systems.
Out of the two SMT (Google and Bing), Bing translator gave better results than the
Google translator.
After detailed literature study and investigation of the above mentioned three MT
systems, we can conclude that for languages with similar lexical and syntactic
structure e.g. Urdu and Hindi, the Rule based MT technique gives better results. The SMT
systems perform better if necessary resources such as annotated corpora etc. are
available. At present, most of the systems translate text from source to target language on
the basis of single sentence whereas in real life text for translation is much larger than
one sentence. Nonetheless, the continuous process of repetitive translation and
improvements by human annotators contribute significantly to any MT system.</p>
      </sec>
      <sec id="sec-21-2">
        <title>Conclusion and Future Directions</title>
        <p>In this paper we explained three main techniques of machine translation; Rule Based
Machine Translation, Statistical Machine Translation and Example Based Machine
Translation. We explained the methodology of each of these systems and found their
comparison based on their respective outputs using carefully selected text. Our
current work is preliminary in nature. However it reports significant results based on
qualitative analysis.</p>
        <p>In order to contribute a significant role to UMT research, at present we are in the
process of building the required corpora. We intend to use our corpora to conduct
larger scale automated experiments and report quantitative results that are comparable
to human translators. Based on our qualitative and quantitative results, we aim at
proposing a new model that minimizes flaws in the existing Urdu MT systems. Ideally,
we would like to implement our proposed system with fewer requirements of
computational and human resources.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abdullah</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , Homiedan.:
          <article-title>Machine translation</article-title>
          .
          <source>J. King Saud Uni. Lang. &amp; Trans. 10</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Hutchins</surname>
          </string-name>
          , J.:
          <source>Latest Development in Machine Translation Technology: Beginning a New Era in MT RESEARCH</source>
          . MT
          <string-name>
            <surname>Summit</surname>
            <given-names>IV</given-names>
          </string-name>
          ,
          <fpage>11</fpage>
          -
          <lpage>34</lpage>
          , Kobe, Japan (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Lewis</surname>
            , Paul,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simons</surname>
            ,
            <given-names>G.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fennig</surname>
          </string-name>
          , C.D.:
          <article-title>Ethnologue: Language of the World</article-title>
          .
          <article-title>Seventeenth edition</article-title>
          . Dallas, Texas: SIL International (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>R.L.</given-names>
          </string-name>
          :
          <article-title>Urdu An Essential Grammar</article-title>
          . Rutledge Taylor &amp; Francis Group. London and New York (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ata</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jawaid</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kamran</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Rule Based English to Urdu Machine Translation</article-title>
          .
          <source>Proceedings of Conference on Language and Technology (CLT‟07)</source>
          . (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Jawaid</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zeman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bojar</surname>
            ,
            <given-names>O.:</given-names>
          </string-name>
          <article-title>Statistical Machine Translation between Languages with Significant Word Order Difference</article-title>
          . Prague (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Zafar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Masood</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Interactive English to Urdu Machine Translation using ExampleBased Approach</article-title>
          . IJCSE
          <volume>1</volume>
          (
          <issue>3</issue>
          ),
          <fpage>276</fpage>
          -
          <lpage>283</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ali</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Siddiq</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malik</surname>
            ,
            <given-names>M.K.</given-names>
          </string-name>
          :
          <article-title>Development of Parallel Corpus and English to Urdu Statistical Machine Translation</article-title>
          .
          <source>IJET-IJENS</source>
          <volume>10</volume>
          (
          <issue>5</issue>
          ),
          <fpage>31</fpage>
          -
          <lpage>33</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Jawaid</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zeman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Word-Order Issues in English-to-Urdu Statistical Machine Translation</article-title>
          . Prague Bull. Math. Linguistics.
          <volume>87</volume>
          -
          <fpage>106</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <article-title>Survey of Machine Translation Evaluation</article-title>
          .
          <source>EuroMatrix</source>
          . (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Samantaray</surname>
          </string-name>
          , S.D.:
          <article-title>Example based machine translation approach for Indian languages</article-title>
          .
          <source>ICCS</source>
          . 1-
          <fpage>10</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Somers</surname>
          </string-name>
          , H. :
          <article-title>Machine translation and Welsh: The way forward. A Report for the Welsh Language Board, Centre for Computational Linguistics</article-title>
          ,
          <string-name>
            <given-names>UMIST</given-names>
            ,
            <surname>Manchester</surname>
          </string-name>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>