<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Statistical machine translation between related and unrelated languages?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Kolovratn´ık</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Natalia Klyueva</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ondˇrej Bojar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2008</year>
      </pub-date>
      <issue>201</issue>
      <fpage>31</fpage>
      <lpage>36</lpage>
      <abstract>
        <p>In this paper we describe an attempt to com- to carry out the experiments and evaluation. Additionpare how relatedness of languages can influence the perfor- ally, we applied factored models on the tagged version mance of statistical machine translation (SMT). We ap- of the corpus and compared the outputs. ply the Moses toolkit on the Czech-English-Russian cor- The paper is structured as follows. Section 2 and pus UMC 0.1 in order to train two translation systems: Section 3 provide a description of the data we used ltRaeutniscosenisainps-aCervazaelllceuhlaatiennddaoElnlntaghlnirseihen-dClaeznpeegcnuhda.egTnehstetueqssuitanslgiettyanooffa1tuh0te0o0tmrasaentnsic-- gdiunrgintgootlhse. IenxpSeercitmioennt4aanndd oSuercttiooknen5izwaetibornieaflnydstuamg-metric (BLEU score) as well as manual judgments. We ex- marize the Moses toolkit and present our experiments amine whether the quality of Russian-Czech is better thanks with MT between English/Russian and Czech. In Secto the relatedness of the languages and similar character- tion 6 we evaluate our MT output using an automatic istics of word order and morphological richness. Addition- and a few manual evaluation metrics. Finally, the paally, we present and discuss the most frequent translation per is concluded by a discussion and plans of future errors for both language pairs. work.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Data</title>
      <p>
        Statistical Machine Translation nowadays has become
one of the easiest and cheapest paradigms of the MT
systems. Researchers can now use various toolkits to
experiment with different language pairs. We
experiment with Moses [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], an open-source implementation
of phrase-based statistical translation system.
      </p>
      <p>
        For closely-related languages, statistical MT
methods are sometimes believed to be unreasonably
complicated. For example, in the project Cˇes´ılko [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] –
Machine Translation among Slavic languages – the main
accent was put on the idea that the relatedness of the
languages rather than statistics should be exploited.
      </p>
      <p>Cˇes´ılko was initially a rule-based system, based on the
direct word-for-word translation (for very closely
related Czech and Slovak) and engaging a few syntactic
transfer rules in case less related languages are
concerned (Czech and Polish or Czech and Lithuanian).</p>
      <p>In our experiments we try to compare if the
relatedness has a positive effect when using phrase-based
statistical models.</p>
      <p>Our main hypothesis was that we should obtain
better results in Russian-to-Czech translation than in
English-to-Czech. We used the Moses toolkit in order</p>
      <p>Cz: prostˇe|prostˇe|Dg-------1A---- jsem|b´yt|VB-S---1P-AA---
brala|bra´t|VpQW---XR-AA--Ru: включая|включая|Sp-a президента|президент|Ncmsay мбеки|мбеки|Vmip3s-a-p
En: the|the|DT visionaries|visionary|NNS would|would|MD have|have|VH gotten|get|VVN nowhere|nowhere|RB
All knowledge used by Moses comes from the
corpus. Moreover, direct phrase-based translation
models have no generalizing capacity. Thus their
perfor4 Simple Moses mance strongly depends on whether particular words
and word sequences were seen in the training sentences
Moses3 is a phrase based SMT system that is data. Phrase-based translation thus often faces a
probvery much language independent since it implements lem known as data sparseness, and the problem is more</p>
      <sec id="sec-1-1">
        <title>2 http://ufal.mff.cuni.cz/umc/</title>
      </sec>
      <sec id="sec-1-2">
        <title>3 http://www.statmt.org/moses/</title>
      </sec>
      <sec id="sec-1-3">
        <title>4 http://www.fjoch.com/GIZA++.html</title>
      </sec>
      <sec id="sec-1-4">
        <title>5 http://www.speech.sri.com/projects/srilm/</title>
        <p>Languages Sentences
Language Model cs 92,233
Translation Model ru ! cs 79,888
Translation Model en ! cs 76,588
Held-out cs, en, ru 750
Test set cs, en, ru 1,000
newly published articles. The held-out and test set
sentences have been added to the corpus UMC2.
3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Data preprocessing</title>
      <p>We used the tools developed under the UMC project,
namely the trainable tokenizer for Czech, English and
Russian languages. It was applied on the test and
development set of data to make them consistent with
training sets.</p>
      <p>
        In order to train a factored model we tagged and
lemmatized the UMC corpus with the help of
TreeTagger [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for English and Russian and Hajiˇc’s morpholog- 5
ical tagger for Czech [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Figure 1 provides examples of
the tagged and lemmatized parts of text in the format
as suitable for the factored training.
a purely data driven method. In contrast to other
methods of MT, phrase-based systems can perform
translation directly between surface forms (thus
often the name “direct translation”). The most
important property of phrase-based systems is the
ability to translate contiguous sequences of words (called
“phrases”) rather than merely single words. See
Figure 2 for an illustration.
      </p>
      <p>The Moses toolkit is a complex system which
utilizes several other components. Let us mention at least
GIZA++4 involved in finding word alignment, the
SRI Language Modeling Toolkit5 and the built-in
implementation of model optimization (Minimum Error
Rate Training, MERT) on a given held-out set of
sentences.</p>
      <p>To establish a baseline, we trained translation
models for direct translation from Russian to Czech
(ru!cs simple) and English to Czech (en!cs simple),
optimizing them on the 750 held-out sentences.</p>
      <p>Moses factored
pronounced for morphologically rich languages where
all word forms have to be seen.</p>
      <p>
        Factored translation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is an interesting extension
of phrase-based models that aims i.a. to mitigate this
issue. It allows us to replace an input word with a
vector of features as exemplified in Figure 1 and
configure the model to back-off to a more coarse-grained
representation of input words if there are not enough
training data. The features on the source side can also
participate in translation. Features on the target side
may be obtained by translation from the source side
or by a generation step. The generation works with
features already available on the target side and fills
in the remaining ones.
      </p>
      <p>The most common example of employing factored
translation looks as follows. A surface word form is
enriched with its base form (lemma) and
morphological information (a tag for short), forming a
threecompound features vector. Base forms and tags are
translated independently without regard to surface
forms. Then, on the basis of translated base form and
tag the surface form is generated. The setup can use
three language models ensuring coherence of the
output sequence: one for base forms, one for tags and one
for surface forms.
approaches. That is the approach we used in our
factored experiments.</p>
      <p>Although in the direct translation path used as the
back-off of the factored translation we are not
interested in the target-side lemma and tag, we still have
to supply them for the language models. We use two
distinct setups for constructing the additional output
factors for the direct translation: 1) translating the
source form to all three target factors at once, and
2) translating the source form to target source form
and using a generation step for “instant tagging” of
the output to construct the target lemma and tag. We
denote the combination of the main factored
translation with one of the two back-off models factored1 and
factored2, resp. Both are ilustrated in Figure 3.</p>
      <p>
        We are aware that there is relatively little
possibility for an improvement with factorization in our
language pairs and overall setting. For instance, let us
point out that generation step for target-side factors
is integrated into Moses unlike the preprocessing of
input factors where external tools are used. Naturally,
the generation capabilities of Moses are rather limited:
it learns only from sentences supplied in training.
Because we train the generation step only on the target
side of the parallel sentences, we cannot expect to gain
much coverage by translating lemmas and tags
independently because the data will hardly ever provide
the required form that should be generated from the
target lemma and tag. A better approach would be to
either use a larger monolingual corpus for training the
generation step, or use an external morphological
generator as e.g. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. With the current simple setting, we
can expect improvement rather to come from the
additional lemma- and tag-based language models that will
be able to judge hypothesis coherence more robustly.
6
      </p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>We tried to evaluate the output of our systems by
several metrics: BLEU, flagging of errors and a
simple hypothesis ranking (i.e. asking “which is the best
output”).</p>
      <p>
        To summarize, there are two translation models
(for base forms and for tags), one generation table 6.1 BLEU
to get surface form and three language models. This
was the approach we first planned to exploit. Unfor- BLEU score [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is an established automatic metric
tunately, the setup has a subtle drawback: it does not used to evaluate MT systems. Thus, despite all known
work with input forms at all, so it applies the in- issues we also used it not only for completeness but
dependent translation of base form and tag even in also as an integral part of model optimization (see
cases where there is enough data for direct transla- MERT in Section 4). Anyway, let us mention two
mation. Moses allows to specify multiple decoding paths jor issues of the BLEU score.
(decoding means finding the most probable transla- BLEU, when applied to languages with free word
tion of a given sentence according to the model), so it order, cannot be reliable indeed. BLEU is based
is possible to let compete the factored path with the on counting occurrences of n-grams from reference
direct transfer, exploiting mutual advantages of both translation in generated output. In many cases the
translator of reference texts will use a word order spired by [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], human annotators mark errors in MT
different from the source sentence, whereas the output and classify them according to their nature.
machine usually preserves the original word order We used the following rough error classes: Bad
Puncwhenever it is an acceptable variant. However, many tuation, Unknown Word, Missing Word, Word
n-grams do not match when words are swapped. Here Order, Incorrect Words, with some classes further
are some examples of the problem from our test data: refined into several subtypes. As our annotation
ca(reference translation) syrsk´y postoj by dosah ´ıra´nsk´e pabilities were limited to one person only, we present
strategie regiona´ln´ı destabilizace nemusel rozˇsiˇrovat, here the evaluation of the simple model (direct
transale sp´ıˇs omezovat. lation) only.
(ru!cs translation) postoj s´yrie m˚uˇze omezit, nikoliv Table 3 documents that in the case of
English-torozˇs´ıˇrit, sf´eru vlivu ´ır´ansk´e strategie region´aln´ı desta- Czech translation, the most common errors concerned
bilizace. morphology, which matches our expectations as Czech
is a inflective language and needs to express many
fea
      </p>
      <p>Such shifts done by a translator lead to a lower tures like case and gender, often not marked in
Eng(automatic) score while not necessarily impacting the lish source. On the other hand, lots of words were not
comprehensibility of the output. recognized in Russian-to-Czech translations. We have</p>
      <p>There is a similar problem with inflection. Word not been able to evaluate the factored translation
acforms different from the reference translation are not cording to the scheme, but a first few sentences show
approved by the BLEU score, so minor translation higher accuracy in morphological forms when factored
variations or errors can cause unfair loss in BLEU models are used.
score. However, a partial remedy may be achieved by
scoring lemmatized text:
(reference translation) sloˇzitost hrozeb , jimˇz ˇcel´ı
izrael
(ru!cs translation) sloˇzitost hrozeb izraeli
(en!cs translation) sloˇzitost´ı hrozby pro izrael</p>
      <p>Table 2 summarizes BLEU scores obtained by our
various translation setups. For English all scores are
very close. In contrast, Russian is more sensitive to
a method – factored translation performs slightly
better than simple. Unfortunately, we were unable to
compute factored2 for Russian due to troubles with model
optimization. A discussion of closeness of simple and
factored results is to be found in the last paragraph of
Section 5.</p>
      <p>BLEU score on forms
pair simple factored1 factored2
en!cs 14.58§0.96 15.84§1.03 15.39§1.05
ru!cs 11.91§0.91 13.11§0.90</p>
      <p>BLEU score on lemmas
pair simple factored1 factored2
en!cs 24.16§1.10 24.77§1.18 24.99§1.16
ru!cs 15.98§0.97 18.06§0.92
As shown in the previous section, the BLEU metric
does not always reflect translation quality. A more
reliable, though labour-intensive approach is to
manually judge MT output. In one of such evaluations,
inFinally, we carried out a ranking evaluation which is
very similar to the human judgments in WMT
Manual Evaluation6. For each of the translation schemes
described in Section 4 and Section 5 we took 40
sentences and ranked them on the basis of the question
“which translation is the best”. So each MT output of
the 40 test sentences translated to Czech from both
languages and by all examined setups got a score
from 1 (worst) to 5 (best). Table 4 summarizes the
evaluation. For each translation setup, we compute the
mean, median and count of how often the method got
the best and the second best rank.</p>
      <p>Almost a half of the sentences that got the
highest score were factored translations from Russian into</p>
      <sec id="sec-3-1">
        <title>6 http://www.statmt.org/wmt08/judge/</title>
        <p>Czech, the second score was obtained by those
translated using the simple model from Russian into Czech.
Factored model (factored1) from English to Czech was
the third one. This confirms our expectation that
translating from a related language is easier also for
phrase-based MT.</p>
        <p>The evaluation allows us to make further
conclusions. First, enriching the model with additional
morphological information improves the translation
quality both for related and unrelated languages. For
Russian as the source, the improvement seems to
be less apparent, because Russian itself marks most
of the relevant morphological properties in its word
forms. Second, BLEU score does not necessarily
corresponds with manual judgments: while translating from
Russian was better percieved by our human annotator,
it obtained a lower BLEU score than translation from
English7. We are aware that the evaluation should be
repeated with more human annotators and on a larger
set of sentences for a better confidence.
6.4</p>
        <p>Observation of frequent errors
As it was shown in the previous section, there are lots
of words unrecognized (not translated). This problem
is not of a linguistic nature, it is caused simply by
insufficient training data.</p>
        <p>Here we will name some linguistically interpreted
errors.</p>
        <p>– Russian ! Czech
² Lost negation.</p>
        <p>(ru src) без которого было невозможно
создание
(cs ref) bez nˇehoˇz nebylo moˇzn´e sestavit
(ru ! cs) bez nˇehoˇz bylo moˇzn´e vytvoˇren´ı
Here we can observe that due to the
difference in how negation is expressed in the two
languages, the negative sense is translated as
positive.
7 While BLEU scores are not comparable across language,
they are comparable in our setup: we test BLEU scores
on a single test set in Czech only, it is the source
language that differs, not the target one.
² Lost reflexive particle.</p>
        <p>(ru src) сумел уйти от
(cs ref) se zdaˇrilo vyj´ıt z
(ru ! cs) podaˇrilo odej´ıt od
The mistake above missing reflexive
particle in Czech is caused by the fact that
some verbs can be reflexive in Czech and
non-reflexive in Russian which is difficult
for a phrase-based MT to learn because the
reflexive particle is often far away from the
verb in training sentences.
– English ! Czech
² Word order in possessive constructions.</p>
        <p>(en src) mahmoud abbas ’s palestinian
authority
(cs ref) palestinskou samospr´avou prezidenta
mahmu´da abb´ase
(en ! cs) prezidenta mahmu´da abb´ase
palestinsk´e samospr´avy
– Both source languages ! cs
² Bad case after a preposition.</p>
        <p>(cs ref) podle indick´ych vyˇsetˇrovatel˚u
(en src) according to indian investigators
(en ! cs) podle indick´e ˇreˇsitel˚u
(ru src) согласно индийским экспертам
(ru ! cs) podle indicky´m experti
7</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We have succeeded in our goal to compare the
performance of phrase-based and factored phrased-based
statistical machine translation when translating
between related and unrelated languages. So far we have
failed in taking advantage of language relatedness
explicitly in the model, but a preliminary manual
ranking of system outputs confirms that translation
between related languages delivers better results. This
observation contradicts to the automatic MT quality
score using the BLEU metric.</p>
      <p>We are aware of the remaining data sparseness
issue (there are many times more tags for Russian
than for English), so while the language relatedness
makes the Czech and Russian tagsets similar, many
tags needed in the translation of unseen sentences are
not in our training data. Also we suspect the
training corpus to be better parallel for English-Czech pair
than for Russian-Czech, because Czech is the direct
translation of English original while Russian is the
translation of English, not Czech.</p>
      <p>Our second conclusion is that enriching SMT with
morphological features improves the translation
quality especially for the closely-related morphologically
rich Czech and Russian.</p>
      <p>We hope that our results will serve as a good
basis for a future comparison of SMT with rule-based
approach used in Cˇes´ılko, which intends to include
Russian-Czech translation pair soon. Our experiments
are also a good start for further improvements in MT
quality when translating to Czech. For instance, we
plan to improve the morphological generation step by
using larger target-side monolingual training data.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>N.</given-names>
            <surname>Klyueva</surname>
          </string-name>
          and
          <string-name>
            <surname>O.</surname>
          </string-name>
          <article-title>Bojar: UMC 0.1: Czech-RussianEnglish multilingual corpus</article-title>
          .
          <source>Proc. of International Conference Corpus Linguistics., Saint-Petersburg</source>
          ,
          <year>2008</year>
          ,
          <fpage>188</fpage>
          -
          <lpage>195</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>P.</given-names>
            <surname>Koehn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hoang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Birch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Callison-Burch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Federico</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bertoldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cowan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Moran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bojar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Constantin</surname>
          </string-name>
          , and E. Herbst:
          <article-title>Moses: open source Toolkit for statistical machine translation</article-title>
          .
          <source>ACL</source>
          <year>2007</year>
          ,
          <source>Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions</source>
          , Prague, Czech Republic,
          <year>2007</year>
          ,
          <fpage>177</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>P.</given-names>
            <surname>Homola</surname>
          </string-name>
          and
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>Kubonˇ: A hybrid machine translation system for typologically related languages</article-title>
          .
          <source>Proceedings of the 21st International Florida-Artificial-IntelligenceResearch-Society Conference, FLAIRS</source>
          ,
          <year>2008</year>
          ,
          <fpage>227</fpage>
          -
          <lpage>228</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <article-title>Ward: BLEU: a method for automatic evaluation of machine translation</article-title>
          .
          <source>IBM Research Report RC22176(W0109-022)</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>H.</given-names>
            <surname>Schmid</surname>
          </string-name>
          :
          <article-title>Probabilistic part-of-speech tagging using decision trees</article-title>
          .
          <source>Proceedings of International Conference on New Methods in Language Processing</source>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>P.</given-names>
            <surname>Koehn</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Hoang</surname>
          </string-name>
          :
          <article-title>Factored translation models</article-title>
          .
          <source>Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2007</year>
          ,
          <fpage>868</fpage>
          -
          <lpage>876</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>D.</given-names>
            <surname>Vilar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          , L.
          <string-name>
            <surname>Fernando D'Haro</surname>
            , and
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ney</surname>
          </string-name>
          :
          <article-title>Error analysis of statistical machine translation output</article-title>
          .
          <source>LREC-2006: Fifth International Conference on Language Resources and Evaluation. Proceedings</source>
          , Genoa, Italy,
          <fpage>22</fpage>
          -
          <lpage>28</lpage>
          May
          <year>2006</year>
          ,
          <fpage>697</fpage>
          -
          <lpage>702</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>Hajiˇc: Disambiguation of rich inflection. (Computational Morphology of Czech)</article-title>
          .
          <source>Nakladatelstv´i Karolinum, ISBN 80-246-0282-2</source>
          , Prague,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>A. de Gispert</surname>
          </string-name>
          , J.B.
          <article-title>Marin˜o and</article-title>
          <string-name>
            <surname>J.M.</surname>
          </string-name>
          <article-title>Crego: Improving statistical machine translation by classifying and generalizing inflected verb forms</article-title>
          .
          <source>Eurospeech</source>
          <year>2005</year>
          , Lisbon, Portugal,
          <year>2005</year>
          ,
          <fpage>3185</fpage>
          -
          <lpage>3188</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>