Named entities from Wikipedia for machine translation⋆

                          Ondrej Hálek, Rudolf Rosa, Aleš Tamchyna, and Ondrej Bojar

                        Charles University in Prague, Faculty of Mathematics and Physics
                                   Institute of Formal and Applied Linguistics
              ohalek@centrum.cz, rur@seznam.cz, a.tamchyna@gmail.com, bojar@ufal.mff.cuni.cz

Abstract. In this paper we present our attempt to im-                Translation of named entities consists of several
prove machine translation of named entities by using Wi-        subtasks. NEs have to be identiﬁed in the source text
kipedia. We recognize named entities based on categories        and their translations must be proposed. These have to
of English Wikipedia articles, extract their potential trans-   be appropriately incorporated into the sentence trans-
lations from corresponding Czech articles and incorporate       lation — the sentence context must match the NE and
them into a statistical machine translation system as trans-
                                                                vice versa.
lation options. Our results show a decrease of translation
quality in terms of automatic metrics but positive results           For the English-Czech language pair, match-
from human annotators. We conclude that this approach           ing NEs to the sentence context consists mainly of
can lead to many errors in translation and therefore should     inﬂection of NE words. For example, while “London”
always be combined with the standard statistical translation    translates to Czech as “Londýn”, in the context of
model and weighted appropriately.                               a more comlex NE, the name has to be inﬂected in
                                                                Czech, such as “London airport” → “Londýnské
                                                                letište” (Londonadj airport).
1     Introduction                                                   Matching the sentence context to the named en-
                                                                tity is needed when some information, such as the
Translation of named entities (NE) is an often over-            grammatical gender, comes from the NE. For exam-
looked problem of today’s machine translation (MT).             ple, Czech verbs in past tense have diﬀerent forms for
Particularly, most statistical systems do not handle            each gender — the verb “came” has to be translated as
named entities explicitly, simply relying on the model          “prišel” when the subject is masculine, as “prišla” for
to pick the correct translation. Since most of NEs are          feminine and as “prišlo” for neuter subject. This infor-
rare in texts, statistical MT systems are incapable of          mation needs to be taken into account in translation:
producing reliable translations of them.                        “Jeﬀry came.” → “Jeﬀry prišel.”.
    Moreover, many NEs are composed of ordinary
words, such as the term “Rice University”. In the at-
tempt to output the most likely translation, a statis-          1.2   Work outline
tical system would translate this collocation word by
word.                                                           We experiment with English to Czech translation.
    In this paper, we attempt to address this prob-                 Named entity recognition is done in two steps.
lem by using Wikipedia1 to translate NEs and present            First, all potential NEs are recognized using a simple
them already translated to the MT system.                       recognizer with a low precision but with a high recall.
                                                                Then, conﬁrmation/rejection of named entities is done
                                                                — if there is an article with the corresponding title in
1.1   Named entity translation task                             English Wikipedia, we try to conﬁrm the potential NE
                                                                as a true NE based on the categories of the article.
The set of named entities is unbounded and there are                The translation of a NE is done by looking up
many deﬁnitions of named entities. In our project, we           the Czech version of the English Wikipedia article
work with a vague deﬁnition of a named entity being             about the named entity. Its title is considered the
a word or group of words which, when left untrans-              “base translation”. Other potential translations (in
lated, are a valid translation anyway (despite the fact         our case this means simply various inﬂected forms) are
that a “real” translation is usually better if it exists;       then extracted from the text of the Czech article. Each
however, it does not exist in many cases).                      named entity found in the input text is then replaced
⋆
  This work has been supported by the grants Euro-
                                                                with a set of its potential translations, from which the
  MatrixPlus (FP7-ICT-2007-3-231720 of the EU and               MT system then tries to choose the best one.
  7E09003 of the Czech Republic), P406/11/1499, and                 The matching of the sentence context to the NE is
  MSM 0021620838.                                               not handled explicitly. We rely on target-side language
1
  http://en.wikipedia.org/                                      model to determine the most appropriate option.
24      Ondrej Hálek et al.

2     Recognition of potential named                              To measure the precision of a NE recognizer, we
      entities                                                count the NEs on which the tool agrees with the stan-
                                                              dard annotation and divide it by the total number
In our case, the goal of potential NE recognition is to       of NEs recognized by the tool. Similarly, the recall
ﬁnd as many potential NEs as possible (i.e. we favour         is measured as the number of NEs conﬁrmed by the
higher recall at the expense of precision), because the       standard divided by the number of NEs in the stan-
candidates for NEs are still to be conﬁrmed or rejected       dard.
in the next step. Thanks to the external world knowl-             The performance of the two aforementioned tools
edge provided by Wikipedia, our task is not a typical         measured on the evaluation text is shown in Table 1.
NER scenario. NE recognition is not the focal point of
our experiment, so we limit ourselves to using two tools
                                                                     Recognizer  Precision Recall F-measure
for recognition of potential NEs: our simple named en-               Simple NER    0.57    0.73     0.64
tity recognizer and Stanford named entity recognizer.                Stanford NER 0.70      0.49    0.58

                                                                      Tab. 1. Comparison of NE recognizers.
2.1   Simple named entity recognizer
We created a simple rule-based named entity recog-
nizer for selecting phrases suspected to be named en-             Our Simple NER has a signiﬁcantly higher recall
tities. It looks for capitalized words and uses a small       than Stanford NER; it is actually capable of deliver-
set of simple rules for beginnings of sentences — most        ing most of the named entities. Its low precision is not
notably, the ﬁrst word of a sentence is a potential NE if     an issue for our experiment since in the next step we
the following word is capitalized (except for words on        conﬁrm the named entities by using Wikipedia cate-
a stoplist, such as “A”, “From”, “To”. . . ). Sequences       gories. Its F-measure is also higher than that of Stan-
of potential NEs are always considered as a single one        ford NER, suggesting the Simple NER suits our NE
multiword potential NE.                                       deﬁnition better.
                                                                  Since the Stanford NER results are well docu-
                                                              mented, we assume that its poor results in our exper-
2.2   Stanford named entity recognizer                        iment are mainly caused by a diﬀerent NE deﬁnition
The Stanford NER [4] is a well-known tool with docu-          and the recognition model used — in this setup Stan-
mented accuracy over 90% when analyzing named en-             ford NER recognizes only people, locations and organi-
tities according to CoNLL Shared Task [12]. However,          zations, but e.g. named entities from the software class
this classiﬁcation does not match our named entity            (names of programs, programming language functions
deﬁnition, and we also use only a limited recognition         etc.) are left out from the recognition.
model.2                                                           On the other hand, with Stanford NER we are ca-
                                                              pable of correctly recognizing complex named entities,
                                                              and the recall of recognition of named entities at sen-
2.3   Evaluation of named entity recognizers                  tence beginnings is higher than that of Simple NER.
To evaluate the tools we use an evaluation text consist-
ing of 255 sentences rich in named entities, originally       3   Confirmation of NEs by Wikipedia
collected for a quiz-based evaluation task [1]. The sen-
tences are quite evenly distributed among four topics         For each potential named entity we try to conﬁrm it
— directions, meetings, news and quizes.                      as a true named entity using Wikipedia categories.
    We ﬁrst performed a human annotation of NEs in                First we look for the article on English Wikipedia
the evaluation text, where two annotators marked NEs          with a title matching the potential NE. If it does not
in the text according to our NE deﬁnition. The inter-         exist, we reject it immediately.
annotator agreement F-measure3 was only 83%, which                We then get the categories of that article. For each
sets an upper bound on the value for our automatic            category we do a search for its superior categories (sev-
recognizers. We then picked one annotation as a stan-         eral hard limits had to be introduced, because the
dard, according to which we compare outputs of the            categories do not form a tree, not even a DAG; the
NE recognition tools.                                         maximum depth of the search was set to 6).
2
  ner-eng-ie.crf-3-all2008-distsim — a conditional                In the end, the categories found are compared with
  random field model that recognizes 3 NE classes (Lo-        our hand-made list of named entity categories. If at
  cation, Person, Organization) trained on unrestricted       least one of the article categories or their super-cate-
  data, uses distributional similarity features               gories is contained in the NE categories list, we conﬁrm
3         R
  F = P2P+R , where P stands for precision and R for recall   the potential NE as a true NE; otherwise we reject it.
                                                                      Named entities for machine translation       25

http://en.Wikipedia.org/w/api.php?action=query&prop=categories&redirects&clshow=!hidden
&format=xml&titles=Rice_University

<?xml version="1.0"?>
  <api>
    <query>
      <pages>
        <page pageid="25813" ns="0" title="Rice University">
          <categories>
            <cl ns="14" title="Category:Association of American Universities" />
            <cl ns="14" title="Category:Educational institutions established in 1891" />
            . . .


                        Fig. 1. Example of XML Response to a Request to Wikimedia API.


    The following categories are considered to indicate additional translation options for the decoder. This
NEs:                                                        can be generally done in several ways, such as by ex-
                                                            tending the parallel data, by adding new entries into
  – Places                                                  the translation model (i.e. the phrase table), or by pre-
  – People                                                  processing the input data.
  – Organizations                                               We use the Moses [6] decoder throughout our ex-
  – Companies                                               periments. Input pre-processing can be realized fairly
  – Software
                                                            easily in Moses via XML markup of the input sen-
  – Transport Infrastructure
                                                            tences. It is simple to incorporate alternative trans-
    To get the information from Wikipedia we use the lations for sequences of words and even to assign the
Wikimedia API [7]. Figure 1 shows an example of the translation probability for each of the options. The
API response.                                               markup of input data is illustrated in Figure 2.
                                                                When scoring hypotheses, Moses uses several
                                                            translation model scores, namely p(e|f ), p(f |e),
4 Wikipedia translation                                     lex(e|f ) and lex(f |e), i.e. translation probabilities in
                                                            both directions (where f stands for “foreign”
For each English Wikipedia article about a NE we (English in this case) and e stands for Czech) and
look if there is a corresponding Czech article (this is lexical weights. The value speciﬁed in the markup (or 1
provided by Wikipedia under the page section “Lan- if omitted) replaces all of these scores.
guages”). If there is one, we use its title as the base         Pre-processing of the input data also has the ad-
translation.                                                vantage of not requiring to retrain or modify exist-
    We then try to ﬁnd all inﬂected forms of the base ing translation models. Fully trained MT systems can
translation in the text of the Czech article to use as therefore be easily extended to take advantage of our
alternative translations.                                   method.
    For each word in the base translation, we trim its
                                                                Moses can treat the translation suggestions as ei-
last three letters, keeping at least the ﬁrst three letters
                                                            ther exclusive or inclusive. If set to exclusive, only op-
intact. This is considered a “stem”.
                                                            tions suggested in the input markup are considered
    Then, the Czech article is fetched using Wikime-
                                                            as translation candidates. With the inclusive setting,
dia API and wiki markup is stripped. We then search
                                                            these options are included among the suggestions from
the article text for sequences of words with the same
                                                            the translation model, competing with them for the
stems. If we ﬁnd a match, we consider it an inﬂected
                                                            highest score. Depending on the quality of the trans-
form of our base translation and include it in the list
                                                            lation model and the external translation suggestions,
of potential translations.
                                                            this setting can either improve or hurt translation per-
    Finally, we estimate the probability of the various
                                                            formance.
forms from their counts of occurrences.
                                                                When estimating the probability of our transla-
                                                            tions, we distribute the whole probability mass among
5 Translation process                                       them. The scores of translation suggestions provided
                                                            by the translation model are typically much lower.
In order to utilize the retrieved translation sugges- However, target language model usually has a signiﬁ-
tions, we had to ﬁnd a way of incorporating them as cant impact on hypothesis scoring, so even if the ex-
26      Ondrej Hálek et al.

       They moved to <name translation="Londýn||Londýna" probs="0.6||0.4">London</name> last year.

             Fig. 2. An example of including external translation options using XML markup of input.


ternal translation scores are set to unrealistically high       Since manual evaluation would beneﬁt from data
values, the language model makes the “competition”          rich in terms of named entity occurrences, we used
with translation model options reasonably fair.             the same set of sentences as in NER evaluation. These
    The default settings for common language models,        sentences cover quite a wide range of topics, so they
such as SRILM or KenLM, as used in Moses, assign            seem suitable even for translation evaluation.
zero log-probability (i.e. the probability of 1) to un-
known tokens instead of the intuitive −∞. In most
                                                            6.2   Tools
cases, training data of the language model for the tar-
get language also include the target language part of       We used the common pipeline of popular tools for
the translation model parallel data, so this is not an      phrase-based statistical MT, namely the Moses de-
issue. However, our translation suggestions often con-      coder and toolkit, SRILM language modelling
tain tokens unseen in any data, including some noise        tool [11], an open-source implementation of IBM
introduced by the imperfect suﬃx trimming heuris-           models GIZA++ [8] for obtaining word alignments.
tic. Instead of penalizing such options, the language       KenLM [5] was used instead of SRILM during decod-
model promoted them, since the unknown words were           ing for its better speed and simplicity.
ignored and therefore did not lower the overall ngram          We used the MERT (Minimum Error Rate Train-
probability (any known token has a probability < 1,         ing) [9] algorithm to tune weights of the log-linear
scoring inevitably lower). We were able to solve this       model and BLEU [10] as the de-facto standard au-
problem by setting a very low probability for unknown       tomatic translation quality metric.
tokens. Perhaps a more interesting option would be to
add the full texts of the Czech Wikipedia articles to
the language model. This would ensure the translation       6.3   Automatic evaluation
of the NE is known to the language model and even
including some plausible contexts. We leave this for        We evaluated a small subset of possible setups, all our
future research.                                            results are summarized in Table 2. The main goal of
                                                            these experiments was to determine which components
                                                            of our pipeline are actually important for achieving
6     Experimental results                                  good results.
                                                                We began with a simple scenario, only using the ti-
We conducted a series of translation experiments, eval-     tles of the articles for translation (i.e. inﬂected occur-
uating various setups of our method. We also carried        rences of the title were not available to the decoder)
out a blind manual evaluation, in which the annota-         and forcing Moses to use only our suggestions when
tors compared outputs of two MT setups which used           translating a NE in a sentence.
our method and of the baseline MT system.                       In the very ﬁrst case, we also kept unknown named
                                                            entities in their original form — by an unknown NE we
6.1   Data sources                                          understand an entity for which the corresponding En-
                                                            glish Wikipedia article exists and its categories imply
We used CzEng 0.9 [2] as the source of both parallel        that it is a named entity, but there is no corresponding
and monolingual data to train our MT system. CzEng          Czech article. Since the Czech version of Wikipedia is
is a parallel richly annotated Czech-English corpus.        much smaller, this case occurs quite often.
It contains roughly 8 million parallel sentences from           The BLEU score in these simple scenarios conﬁrms
a variety of domains, including European regulations        our expectations — in statistical machine translation,
(about 34% of tokens), ﬁction (15%), news (3%), tech-       forcing or limiting translation possibilities rarely helps.
nical texts (10%) and unoﬃcial movie subtitles (27%).       More speciﬁcally, by excluding phrase table entries, we
In all our experiments we used 200 thousand paral-          forbid the log-linear model to use potentially more ad-
lel sentences for the translation model and 5 million       equate translations. The phrase table may well include
monolingual sentences for the target language model.        many variants of a given named entity translation,
We also used CzEng as a source of a separate set of         providing more context and inherent disambiguation.
1000 sentences for tuning the model weights and an-         This information should be used and possibly even
other 1000 sentences for automatic evaluation.              preferred to a single translation or an enumeration of
                                                                      Named entities for machine translation      27

                NEs Suggested      Regular Translations         Unknown NEs     NER          BLEU
                Only base forms    Excluded                     Preserved       Simple        25.13
                Only base forms    Excluded                     Translated      Simple        25.38
                Only base forms    Included                     Translated      Simple        25.80
                All forms          Included                     Translated      Simple        25.97
                All forms          Included                     Translated      Stanford      25.98
                                              Baseline                                       26.62

                            Tab. 2. BLEU scores of our setups and the baseline system.


potential translations suggested by our tools (albeit       able to avoid some errors in each of the steps that,
probabilistically weighted). On the other hand, pro-        when combined, resulted in a loss in BLEU score.
moting phrase table entries too eagerly would result        A detailed analysis of errors is provided in Section 6.5.
in undesirable translations in some cases, for example         On the other hand, we also achieved several no-
when a named entity is composed of common words.            table improvements in translation quality even in the
    It is also not surprising that keeping unknown en-      CzEng test set, some of which are shown in Figure 3.
tities untranslated hurts (automatically estimated)
translation performance, as Czech tends to translate
most of frequent foreign names, and even NEs which          6.4    Manual evaluation
are used in their original form are usually inﬂected in
Czech. NEs that would remain completely unchanged           We had four annotators evaluate 255 sentences rich in
are quite rare. Sentences with some NEs left untrans-       named entities, using QuickJudge4 which randomized
lated may be more understandable, even considered           the input. In the input sentences there were approx-
better translations in some cases, but BLEU score is        imately 400 named entities, but the translations dif-
necessarily worse.                                          fered only in 78 sentences. QuickJudge automatically
    When we allowed translation model entries to com-       skips sentences with identical translations, so the an-
pete with our suggestions, the score improved further       notators only saw these 78 sentences.
to 25.80. The target language model was apparently              Three setups were evaluated: the “Baseline” un-
able to promote options from the phrase table in spite      modiﬁed Moses system, and two modiﬁcations of that
of their low translation model scores compared to our       system, “Translate” and “Keep unknown”. The sys-
suggestions (see Section 5).                                tem marked as “Translate” corresponds to the best-
    Our translations could have been inadequate for         performing setup, not using Stanford NER. “Keep un-
two main reasons in this scenario:                          known” is the same system, however, unknown NEs
                                                            are handled diﬀerently — if a potential NE is con-
 – Lexically incorrect translation,                         ﬁrmed by Wikipedia, but a Czech translation does not
 – Wrong surface form (only title translation used).        exist, it is kept untranslated in the output.
                                                                The annotators were presented with the source
    Adding a full list of all inﬂected forms of NEs along   English sentence and with three translations coming
with their estimated probabilities improved the trans-      from the three diﬀerent setups. Then they assigned
lation quality slightly, presumably because the target      marks 1, 2 and 3 to them. Ties were allowed and only
language model was able to determine which of our           relative ranking, i.e. not the absolute values, was con-
suggestions ﬁtted best into the sentence translation.       sidered signiﬁcant.
    We can therefore conclude that our approach to in-          Table 3 summarizes the results. The values suggest
corporating named entity translations works success-        a large number of ties — this is not surprising since
fully — the outputs contained some direct translations      diﬀerences between systems were small, their outputs
of article titles, some inﬂected forms extracted from       often diﬀered only in 1 word or inﬂection of a named
the article content and some phrase table entries.          entity.
    Using Stanford named entity recognizer brought              We ﬁnd it promising that our setups won accord-
no further gains. The recognizer marked a diﬀer-            ing to all annotators. The inter-annotator agreement
ent (albeit smaller) set of NEs, but further ﬁltering       was however surprisingly low — even though in to-
based on Wikipedia article categories and the absence       tal, the annotators’ preferences match, the individual
of many Czech equivalent articles made the diﬀerence        sentences that contributed to the results diﬀer greatly
negligible.                                                 among them. All annotators agreed on a winner in
    Finally, all our scenarios scored worse than the        only 25% sentences.
baseline in terms of BLEU. While we believe that the
                                                            4
motivation behind our method is valid, we were not              http://ufal.mff.cuni.cz/euromatrix/quickjudge/
28        Ondrej Hálek et al.

Source    It was Nova Scotia on Wednesday.
Baseline bylmasc to nova scotia ve stredu.                               (NE is left untranslated)
Our setup to byloneut nové skotskoneut ve stredu.                       (correct NE translation and gender agreement)

Source    In August, 1860, they returned to the Victoria Falls.
Baseline v srpnu, 1860, se k vyjádrenı́ falls.                 (“Victoria” is left out, “falls” kept untranslated)
Our setup v srpnu, 1860, se na viktoriiny vodopády.            (correct translation extracted from Wikipedia)


    Fig. 3. Examples of translation improvements. “Our setup” denotes the best-performing setup in terms of BLEU.


    Conﬁrming our intuition, annotators usually pre-            Suﬃx trimming error Suﬃx trimming also occa-
ferred to keep unknown entities untranslated. The fact          sionally matched words or word sequences completely
that all of the annotators speak English certainly con-         unrelated to the article name. As an example, the
tributed to this result, however we believe that keeping        name of the company Nestlé matched the word “ne-
unknown NEs in the original form is often the best so-          správne” (“incorrectly”) in the Czech article. Because
lution, especially in terms of preserved information.           this word is quite common, the language model score
Imagine a translation of a guidebook, for example —             ensured it to appear in the ﬁnal translation. A simi-
if an MT system correctly detects NEs and keeps un-             lar example was matching “pole” (“ﬁeld”) in the ar-
known ones untranslated, the result is probably better          ticle about Poland (“Polsko” in Czech). We decided
than if it attempts to translate them. Thanks to the            to match case-insensitively in order to cover cases of
NER enhanced by Wikipedia, our system would pro-                named entities that do not begin with a capital letter
duce more informative translations than a standard              in Czech (such as “Gulf War”, “válka v Zálivu”).
SMT system, which tends to translate NEs in various
undecipherable ways.
                                                                Wrong named entity form There are two possible
                                                                causes for an error of this kind — either the Czech ar-
                                                                ticle did not contain the inﬂected form needed in the
         Annotator Baseline Translate Keep unknown
                                                                translation, or the language model failed to enforce
            1        46        56          51
            2        38        45          54                   the correct option, mainly because the NE contained
            3        41        39          47                   words unknown to the model (never seen in the mono-
            4        35        43          49                   lingual training data).
                                                                    Since BLEU does not diﬀerentiate between a wrong
        Tab. 3. Number of wins (manual annotation).             word suﬃx and a completely incorrect word transla-
                                                                tion, these errors are equally severe in terms of au-
                                                                tomatic evaluation.6 On the other hand, human an-
                                                                notators consider a mis-inﬂected (otherwise correct)
6.5      Sources of errors
                                                                translation to be better than a completely untrans-
                                                                lated named entity.
In order to explain the drop of BLEU in a more de-
tailed fashion, we examined the translation outputs
and attempted to analyze the most common errors 7                     Wikipedia translations as a separate
made by our best-performing setup.                                    phrase table
                                                                In order to incorporate weighting of our translations
Incorrect Wikipedia translation Quite often, the                into MERT, we also used a contrastive setup with an
Wikipedia article contained information about a dif-            alternative phrase table instead of the XML markup of
ferent meaning of the term. When translated to Czech,           input sentences. The decoder was then working with
the diﬀerence in the meaning became apparent. For               two translation tables — the standard one, generated
example, the default Wikipedia article on “Brussels”            by GIZA++ from the parallel corpus, and the new
discusses the whole “Brussels Region”, therefore the            one, created by our tools. As is shown in Figure 4,
Czech translation is “Bruselský region”. This word ap-         6
                                                                    Metrics with paraphrasing (e.g. Meteor [3]) could solve
peared several times in the test data and the default               a part of the issue. Another option is to replace all
interpretation was wrong in all cases.5                             words with their lemmas in the hypothesis and the refer-
                                                                    ence and use a standard n-gram metric like BLEU. This
5
     It is however noteworthy that the inflected form of this       would completely ignore errors in word forms, which is
     particular name was always chosen correctly.                   inadequate as well and might seem manipulated.
                                                                           Named entities for machine translation        29

                    NEs Suggested        Regular Translations         Unknown NEs         NER BLEU
                    All forms (old)      Included                     Translated          Simple 27.11
                    All forms (new)      Included                     Translated          Simple 26.60
                                                   Baseline                                      26.62

             Tab. 4. BLEU scores of two setups using alternative translation table and the baseline system.


there are two scores in our table — the ﬁrst one is the          system should be used for all named entities, or only
probability assigned by our tools (based on number               for entities not present (or very rare) in the training
of occurrences of the form in the text of the Czech              data.
Wikipedia article) and the second one is the “penalty”               We described two methods of mixing the newly
for using our NE translation.7 It is up to MERT to               proposed translations and the default translations of
estimate the weight to assign to our translations.               the MT system. We studied the XML-input method
                                                                 more and learned that it faces an imbalance in scoring
                                                                 of hypotheses from the two sources. We also report
        London ||| Londýn ||| 0.4 2.718
                                                                 preliminary results of the other method: alternative
        London ||| Londýna ||| 0.2 2.718
                                                                 decoding paths, allowing the model to choose the best
          Fig. 4. Example of phrase table entries.               balance automatically. While the automatic scores for
                                                                 the second method increased slightly, the results are
                                                                 not yet stable and a further analysis is needed.
7.1     Results                                                      In sum, we have shown that Wikipedia can serve as
                                                                 a valuable source of bilingual information and there is
Although the results of this experiment look promis-
                                                                 an open space for incorporating this information into
ing, they have not been fully evaluated yet and are
                                                                 machine translation. However, Wikipedia should not
therefore only preliminary. There is an improvement
                                                                 serve as the only source of information, and the ex-
in BLEU score (see Table 4), but it is not a result of
                                                                 tracted information should be conﬁrmed e.g. by anal-
better NE translation. The unstability of MERT pro-
                                                                 ysis of some other monolingual data.
cess results in diﬀerent weights in both translations,
causing the baseline translation and our experiment
outputs to diﬀer signiﬁcantly in whole sentences, not References
only in NE translation. Futher analysis and experi-
ments are therefore needed.                             1. J. Berka, M. Černý, and O. Bojar: Quiz-based evalu-
    There are two results reported in Table 4 because      ation of machine translation. The Prague Bulletin of
two diﬀerent versions of the inﬂector were used to get     Mathematical Linguistics, 95, April 2011, 77–86.
the inﬂected forms. The “old” one uses all text data 2. O. Bojar and Z. Žabokrtský: CzEng 0.9: large paral-
from the body of the article (including e.g. external      lel treebank with rich annotation. Prague Bulletin of
links), while the “new” one looks for the inﬂected form    Mathematical Linguistics, 92, 2009, 63–83.
only in the text of the article.                        3. M. Denkowski and A. Lavie: METEOR-NEXT and
                                                                     the METEOR paraphrase tables: improved evaluation
                                                                     support for five target languages. In Proceedings of
8     Conclusion                                                     the ACL 2010 Joint Workshop on Statistical Machine
                                                                     Translation and Metrics MATR, 2010.
Our approach of automatically suggesting translations             4. J.R. Finkel, T. Grenager, and C.D. Manning: Incorpo-
                                                                     rating non-local information into information extrac-
of named entities based on Wikipedia texts leads to
                                                                     tion systems by gibbs sampling. In ACL. The Associa-
drop in automatic evaluation but to a slight improve-
                                                                     tion for Computer Linguistics, 2005.
ment in manual evaluation of MT quality. Part of this             5. K. Heafield: Kenlm: faster and smaller language model
improvement is due to not translating identiﬁed enti-                queries. In Proceedings of the Sixth Workshop on
ties at all.                                                         Statistical Machine Translation, Edinburgh, UK, July
    While some deﬁciencies of the proposed method                    2011. Association for Computational Linguistics.
of NE translation can be hopefully mitigated (poor                6. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch,
suﬃx trimming and search for various forms of target-                M. Federico, N. Bertoldi, B. Cowan, W. Shen,
side NEs), the incorrectness of some Wikipedia trans-                C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin,
lations is not easy to solve. It is therefore questionable           and E. Herbst: Moses: open source toolkit for statisti-
whether the named entity translations provided by our                cal machine translation. In ACL. The Association for
                                                                     Computer Linguistics, 2007.
7
    This penalty is used in all Moses phrase tables; it is the    7. MediaWiki. Mediawiki – mediawiki, the free wiki en-
                                          .
    same for all entries and equals 2.718 = exp(1) = e.              gine, 2007. [Online; accessed 23-May-2011].
30      Ondrej Hálek et al.

 8. F.J. Och and H. Ney: Improved statistical alignment
    models. Hongkong, China, October 2000, 440–447.
 9. F.J. Och: Minimum error rate training in statistical
    machine translation. In ACL, 2003, 160–167.
10. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu: Bleu:
    a method for automatic evaluation of machine trans-
    lation. In ACL, 2002, 311-318.
11. A. Stolcke: Srilm – an extensible language modeling
    toolkit. June 06 2002.
12. E.F. Tjong Kim Sang and F. De Meulder. Introduction
    to the conll-2003 shared task: language-independent
    named entity recognition. In Proceedings of the Sev-
    enth Conference on Natural Language Learning at
    HLT-NAACL 2003 - Volume 4, CONLL ’03, pp. 142–
    147, Stroudsburg, PA, USA, 2003. Association for
    Computational Linguistics.