Named entities from Wikipedia for machine translation⋆ Ondrej Hálek, Rudolf Rosa, Aleš Tamchyna, and Ondrej Bojar Charles University in Prague, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics ohalek@centrum.cz, rur@seznam.cz, a.tamchyna@gmail.com, bojar@ufal.mff.cuni.cz Abstract. In this paper we present our attempt to im- Translation of named entities consists of several prove machine translation of named entities by using Wi- subtasks. NEs have to be identified in the source text kipedia. We recognize named entities based on categories and their translations must be proposed. These have to of English Wikipedia articles, extract their potential trans- be appropriately incorporated into the sentence trans- lations from corresponding Czech articles and incorporate lation — the sentence context must match the NE and them into a statistical machine translation system as trans- vice versa. lation options. Our results show a decrease of translation quality in terms of automatic metrics but positive results For the English-Czech language pair, match- from human annotators. We conclude that this approach ing NEs to the sentence context consists mainly of can lead to many errors in translation and therefore should inflection of NE words. For example, while “London” always be combined with the standard statistical translation translates to Czech as “Londýn”, in the context of model and weighted appropriately. a more comlex NE, the name has to be inflected in Czech, such as “London airport” → “Londýnské letište” (Londonadj airport). 1 Introduction Matching the sentence context to the named en- tity is needed when some information, such as the Translation of named entities (NE) is an often over- grammatical gender, comes from the NE. For exam- looked problem of today’s machine translation (MT). ple, Czech verbs in past tense have different forms for Particularly, most statistical systems do not handle each gender — the verb “came” has to be translated as named entities explicitly, simply relying on the model “prišel” when the subject is masculine, as “prišla” for to pick the correct translation. Since most of NEs are feminine and as “prišlo” for neuter subject. This infor- rare in texts, statistical MT systems are incapable of mation needs to be taken into account in translation: producing reliable translations of them. “Jeffry came.” → “Jeffry prišel.”. Moreover, many NEs are composed of ordinary words, such as the term “Rice University”. In the at- tempt to output the most likely translation, a statis- 1.2 Work outline tical system would translate this collocation word by word. We experiment with English to Czech translation. In this paper, we attempt to address this prob- Named entity recognition is done in two steps. lem by using Wikipedia1 to translate NEs and present First, all potential NEs are recognized using a simple them already translated to the MT system. recognizer with a low precision but with a high recall. Then, confirmation/rejection of named entities is done — if there is an article with the corresponding title in 1.1 Named entity translation task English Wikipedia, we try to confirm the potential NE as a true NE based on the categories of the article. The set of named entities is unbounded and there are The translation of a NE is done by looking up many definitions of named entities. In our project, we the Czech version of the English Wikipedia article work with a vague definition of a named entity being about the named entity. Its title is considered the a word or group of words which, when left untrans- “base translation”. Other potential translations (in lated, are a valid translation anyway (despite the fact our case this means simply various inflected forms) are that a “real” translation is usually better if it exists; then extracted from the text of the Czech article. Each however, it does not exist in many cases). named entity found in the input text is then replaced ⋆ This work has been supported by the grants Euro- with a set of its potential translations, from which the MatrixPlus (FP7-ICT-2007-3-231720 of the EU and MT system then tries to choose the best one. 7E09003 of the Czech Republic), P406/11/1499, and The matching of the sentence context to the NE is MSM 0021620838. not handled explicitly. We rely on target-side language 1 http://en.wikipedia.org/ model to determine the most appropriate option. 24 Ondrej Hálek et al. 2 Recognition of potential named To measure the precision of a NE recognizer, we entities count the NEs on which the tool agrees with the stan- dard annotation and divide it by the total number In our case, the goal of potential NE recognition is to of NEs recognized by the tool. Similarly, the recall find as many potential NEs as possible (i.e. we favour is measured as the number of NEs confirmed by the higher recall at the expense of precision), because the standard divided by the number of NEs in the stan- candidates for NEs are still to be confirmed or rejected dard. in the next step. Thanks to the external world knowl- The performance of the two aforementioned tools edge provided by Wikipedia, our task is not a typical measured on the evaluation text is shown in Table 1. NER scenario. NE recognition is not the focal point of our experiment, so we limit ourselves to using two tools Recognizer Precision Recall F-measure for recognition of potential NEs: our simple named en- Simple NER 0.57 0.73 0.64 tity recognizer and Stanford named entity recognizer. Stanford NER 0.70 0.49 0.58 Tab. 1. Comparison of NE recognizers. 2.1 Simple named entity recognizer We created a simple rule-based named entity recog- nizer for selecting phrases suspected to be named en- Our Simple NER has a significantly higher recall tities. It looks for capitalized words and uses a small than Stanford NER; it is actually capable of deliver- set of simple rules for beginnings of sentences — most ing most of the named entities. Its low precision is not notably, the first word of a sentence is a potential NE if an issue for our experiment since in the next step we the following word is capitalized (except for words on confirm the named entities by using Wikipedia cate- a stoplist, such as “A”, “From”, “To”. . . ). Sequences gories. Its F-measure is also higher than that of Stan- of potential NEs are always considered as a single one ford NER, suggesting the Simple NER suits our NE multiword potential NE. definition better. Since the Stanford NER results are well docu- mented, we assume that its poor results in our exper- 2.2 Stanford named entity recognizer iment are mainly caused by a different NE definition The Stanford NER [4] is a well-known tool with docu- and the recognition model used — in this setup Stan- mented accuracy over 90% when analyzing named en- ford NER recognizes only people, locations and organi- tities according to CoNLL Shared Task [12]. However, zations, but e.g. named entities from the software class this classification does not match our named entity (names of programs, programming language functions definition, and we also use only a limited recognition etc.) are left out from the recognition. model.2 On the other hand, with Stanford NER we are ca- pable of correctly recognizing complex named entities, and the recall of recognition of named entities at sen- 2.3 Evaluation of named entity recognizers tence beginnings is higher than that of Simple NER. To evaluate the tools we use an evaluation text consist- ing of 255 sentences rich in named entities, originally 3 Confirmation of NEs by Wikipedia collected for a quiz-based evaluation task [1]. The sen- tences are quite evenly distributed among four topics For each potential named entity we try to confirm it — directions, meetings, news and quizes. as a true named entity using Wikipedia categories. We first performed a human annotation of NEs in First we look for the article on English Wikipedia the evaluation text, where two annotators marked NEs with a title matching the potential NE. If it does not in the text according to our NE definition. The inter- exist, we reject it immediately. annotator agreement F-measure3 was only 83%, which We then get the categories of that article. For each sets an upper bound on the value for our automatic category we do a search for its superior categories (sev- recognizers. We then picked one annotation as a stan- eral hard limits had to be introduced, because the dard, according to which we compare outputs of the categories do not form a tree, not even a DAG; the NE recognition tools. maximum depth of the search was set to 6). 2 ner-eng-ie.crf-3-all2008-distsim — a conditional In the end, the categories found are compared with random field model that recognizes 3 NE classes (Lo- our hand-made list of named entity categories. If at cation, Person, Organization) trained on unrestricted least one of the article categories or their super-cate- data, uses distributional similarity features gories is contained in the NE categories list, we confirm 3 R F = P2P+R , where P stands for precision and R for recall the potential NE as a true NE; otherwise we reject it. Named entities for machine translation 25 http://en.Wikipedia.org/w/api.php?action=query&prop=categories&redirects&clshow=!hidden &format=xml&titles=Rice_University . . . Fig. 1. Example of XML Response to a Request to Wikimedia API. The following categories are considered to indicate additional translation options for the decoder. This NEs: can be generally done in several ways, such as by ex- tending the parallel data, by adding new entries into – Places the translation model (i.e. the phrase table), or by pre- – People processing the input data. – Organizations We use the Moses [6] decoder throughout our ex- – Companies periments. Input pre-processing can be realized fairly – Software easily in Moses via XML markup of the input sen- – Transport Infrastructure tences. It is simple to incorporate alternative trans- To get the information from Wikipedia we use the lations for sequences of words and even to assign the Wikimedia API [7]. Figure 1 shows an example of the translation probability for each of the options. The API response. markup of input data is illustrated in Figure 2. When scoring hypotheses, Moses uses several translation model scores, namely p(e|f ), p(f |e), 4 Wikipedia translation lex(e|f ) and lex(f |e), i.e. translation probabilities in both directions (where f stands for “foreign” For each English Wikipedia article about a NE we (English in this case) and e stands for Czech) and look if there is a corresponding Czech article (this is lexical weights. The value specified in the markup (or 1 provided by Wikipedia under the page section “Lan- if omitted) replaces all of these scores. guages”). If there is one, we use its title as the base Pre-processing of the input data also has the ad- translation. vantage of not requiring to retrain or modify exist- We then try to find all inflected forms of the base ing translation models. Fully trained MT systems can translation in the text of the Czech article to use as therefore be easily extended to take advantage of our alternative translations. method. For each word in the base translation, we trim its Moses can treat the translation suggestions as ei- last three letters, keeping at least the first three letters ther exclusive or inclusive. If set to exclusive, only op- intact. This is considered a “stem”. tions suggested in the input markup are considered Then, the Czech article is fetched using Wikime- as translation candidates. With the inclusive setting, dia API and wiki markup is stripped. We then search these options are included among the suggestions from the article text for sequences of words with the same the translation model, competing with them for the stems. If we find a match, we consider it an inflected highest score. Depending on the quality of the trans- form of our base translation and include it in the list lation model and the external translation suggestions, of potential translations. this setting can either improve or hurt translation per- Finally, we estimate the probability of the various formance. forms from their counts of occurrences. When estimating the probability of our transla- tions, we distribute the whole probability mass among 5 Translation process them. The scores of translation suggestions provided by the translation model are typically much lower. In order to utilize the retrieved translation sugges- However, target language model usually has a signifi- tions, we had to find a way of incorporating them as cant impact on hypothesis scoring, so even if the ex- 26 Ondrej Hálek et al. They moved to London last year. Fig. 2. An example of including external translation options using XML markup of input. ternal translation scores are set to unrealistically high Since manual evaluation would benefit from data values, the language model makes the “competition” rich in terms of named entity occurrences, we used with translation model options reasonably fair. the same set of sentences as in NER evaluation. These The default settings for common language models, sentences cover quite a wide range of topics, so they such as SRILM or KenLM, as used in Moses, assign seem suitable even for translation evaluation. zero log-probability (i.e. the probability of 1) to un- known tokens instead of the intuitive −∞. In most 6.2 Tools cases, training data of the language model for the tar- get language also include the target language part of We used the common pipeline of popular tools for the translation model parallel data, so this is not an phrase-based statistical MT, namely the Moses de- issue. However, our translation suggestions often con- coder and toolkit, SRILM language modelling tain tokens unseen in any data, including some noise tool [11], an open-source implementation of IBM introduced by the imperfect suffix trimming heuris- models GIZA++ [8] for obtaining word alignments. tic. Instead of penalizing such options, the language KenLM [5] was used instead of SRILM during decod- model promoted them, since the unknown words were ing for its better speed and simplicity. ignored and therefore did not lower the overall ngram We used the MERT (Minimum Error Rate Train- probability (any known token has a probability < 1, ing) [9] algorithm to tune weights of the log-linear scoring inevitably lower). We were able to solve this model and BLEU [10] as the de-facto standard au- problem by setting a very low probability for unknown tomatic translation quality metric. tokens. Perhaps a more interesting option would be to add the full texts of the Czech Wikipedia articles to the language model. This would ensure the translation 6.3 Automatic evaluation of the NE is known to the language model and even including some plausible contexts. We leave this for We evaluated a small subset of possible setups, all our future research. results are summarized in Table 2. The main goal of these experiments was to determine which components of our pipeline are actually important for achieving 6 Experimental results good results. We began with a simple scenario, only using the ti- We conducted a series of translation experiments, eval- tles of the articles for translation (i.e. inflected occur- uating various setups of our method. We also carried rences of the title were not available to the decoder) out a blind manual evaluation, in which the annota- and forcing Moses to use only our suggestions when tors compared outputs of two MT setups which used translating a NE in a sentence. our method and of the baseline MT system. In the very first case, we also kept unknown named entities in their original form — by an unknown NE we 6.1 Data sources understand an entity for which the corresponding En- glish Wikipedia article exists and its categories imply We used CzEng 0.9 [2] as the source of both parallel that it is a named entity, but there is no corresponding and monolingual data to train our MT system. CzEng Czech article. Since the Czech version of Wikipedia is is a parallel richly annotated Czech-English corpus. much smaller, this case occurs quite often. It contains roughly 8 million parallel sentences from The BLEU score in these simple scenarios confirms a variety of domains, including European regulations our expectations — in statistical machine translation, (about 34% of tokens), fiction (15%), news (3%), tech- forcing or limiting translation possibilities rarely helps. nical texts (10%) and unofficial movie subtitles (27%). More specifically, by excluding phrase table entries, we In all our experiments we used 200 thousand paral- forbid the log-linear model to use potentially more ad- lel sentences for the translation model and 5 million equate translations. The phrase table may well include monolingual sentences for the target language model. many variants of a given named entity translation, We also used CzEng as a source of a separate set of providing more context and inherent disambiguation. 1000 sentences for tuning the model weights and an- This information should be used and possibly even other 1000 sentences for automatic evaluation. preferred to a single translation or an enumeration of Named entities for machine translation 27 NEs Suggested Regular Translations Unknown NEs NER BLEU Only base forms Excluded Preserved Simple 25.13 Only base forms Excluded Translated Simple 25.38 Only base forms Included Translated Simple 25.80 All forms Included Translated Simple 25.97 All forms Included Translated Stanford 25.98 Baseline 26.62 Tab. 2. BLEU scores of our setups and the baseline system. potential translations suggested by our tools (albeit able to avoid some errors in each of the steps that, probabilistically weighted). On the other hand, pro- when combined, resulted in a loss in BLEU score. moting phrase table entries too eagerly would result A detailed analysis of errors is provided in Section 6.5. in undesirable translations in some cases, for example On the other hand, we also achieved several no- when a named entity is composed of common words. table improvements in translation quality even in the It is also not surprising that keeping unknown en- CzEng test set, some of which are shown in Figure 3. tities untranslated hurts (automatically estimated) translation performance, as Czech tends to translate most of frequent foreign names, and even NEs which 6.4 Manual evaluation are used in their original form are usually inflected in Czech. NEs that would remain completely unchanged We had four annotators evaluate 255 sentences rich in are quite rare. Sentences with some NEs left untrans- named entities, using QuickJudge4 which randomized lated may be more understandable, even considered the input. In the input sentences there were approx- better translations in some cases, but BLEU score is imately 400 named entities, but the translations dif- necessarily worse. fered only in 78 sentences. QuickJudge automatically When we allowed translation model entries to com- skips sentences with identical translations, so the an- pete with our suggestions, the score improved further notators only saw these 78 sentences. to 25.80. The target language model was apparently Three setups were evaluated: the “Baseline” un- able to promote options from the phrase table in spite modified Moses system, and two modifications of that of their low translation model scores compared to our system, “Translate” and “Keep unknown”. The sys- suggestions (see Section 5). tem marked as “Translate” corresponds to the best- Our translations could have been inadequate for performing setup, not using Stanford NER. “Keep un- two main reasons in this scenario: known” is the same system, however, unknown NEs are handled differently — if a potential NE is con- – Lexically incorrect translation, firmed by Wikipedia, but a Czech translation does not – Wrong surface form (only title translation used). exist, it is kept untranslated in the output. The annotators were presented with the source Adding a full list of all inflected forms of NEs along English sentence and with three translations coming with their estimated probabilities improved the trans- from the three different setups. Then they assigned lation quality slightly, presumably because the target marks 1, 2 and 3 to them. Ties were allowed and only language model was able to determine which of our relative ranking, i.e. not the absolute values, was con- suggestions fitted best into the sentence translation. sidered significant. We can therefore conclude that our approach to in- Table 3 summarizes the results. The values suggest corporating named entity translations works success- a large number of ties — this is not surprising since fully — the outputs contained some direct translations differences between systems were small, their outputs of article titles, some inflected forms extracted from often differed only in 1 word or inflection of a named the article content and some phrase table entries. entity. Using Stanford named entity recognizer brought We find it promising that our setups won accord- no further gains. The recognizer marked a differ- ing to all annotators. The inter-annotator agreement ent (albeit smaller) set of NEs, but further filtering was however surprisingly low — even though in to- based on Wikipedia article categories and the absence tal, the annotators’ preferences match, the individual of many Czech equivalent articles made the difference sentences that contributed to the results differ greatly negligible. among them. All annotators agreed on a winner in Finally, all our scenarios scored worse than the only 25% sentences. baseline in terms of BLEU. While we believe that the 4 motivation behind our method is valid, we were not http://ufal.mff.cuni.cz/euromatrix/quickjudge/ 28 Ondrej Hálek et al. Source It was Nova Scotia on Wednesday. Baseline bylmasc to nova scotia ve stredu. (NE is left untranslated) Our setup to byloneut nové skotskoneut ve stredu. (correct NE translation and gender agreement) Source In August, 1860, they returned to the Victoria Falls. Baseline v srpnu, 1860, se k vyjádrenı́ falls. (“Victoria” is left out, “falls” kept untranslated) Our setup v srpnu, 1860, se na viktoriiny vodopády. (correct translation extracted from Wikipedia) Fig. 3. Examples of translation improvements. “Our setup” denotes the best-performing setup in terms of BLEU. Confirming our intuition, annotators usually pre- Suffix trimming error Suffix trimming also occa- ferred to keep unknown entities untranslated. The fact sionally matched words or word sequences completely that all of the annotators speak English certainly con- unrelated to the article name. As an example, the tributed to this result, however we believe that keeping name of the company Nestlé matched the word “ne- unknown NEs in the original form is often the best so- správne” (“incorrectly”) in the Czech article. Because lution, especially in terms of preserved information. this word is quite common, the language model score Imagine a translation of a guidebook, for example — ensured it to appear in the final translation. A simi- if an MT system correctly detects NEs and keeps un- lar example was matching “pole” (“field”) in the ar- known ones untranslated, the result is probably better ticle about Poland (“Polsko” in Czech). We decided than if it attempts to translate them. Thanks to the to match case-insensitively in order to cover cases of NER enhanced by Wikipedia, our system would pro- named entities that do not begin with a capital letter duce more informative translations than a standard in Czech (such as “Gulf War”, “válka v Zálivu”). SMT system, which tends to translate NEs in various undecipherable ways. Wrong named entity form There are two possible causes for an error of this kind — either the Czech ar- ticle did not contain the inflected form needed in the Annotator Baseline Translate Keep unknown translation, or the language model failed to enforce 1 46 56 51 2 38 45 54 the correct option, mainly because the NE contained 3 41 39 47 words unknown to the model (never seen in the mono- 4 35 43 49 lingual training data). Since BLEU does not differentiate between a wrong Tab. 3. Number of wins (manual annotation). word suffix and a completely incorrect word transla- tion, these errors are equally severe in terms of au- tomatic evaluation.6 On the other hand, human an- notators consider a mis-inflected (otherwise correct) 6.5 Sources of errors translation to be better than a completely untrans- lated named entity. In order to explain the drop of BLEU in a more de- tailed fashion, we examined the translation outputs and attempted to analyze the most common errors 7 Wikipedia translations as a separate made by our best-performing setup. phrase table In order to incorporate weighting of our translations Incorrect Wikipedia translation Quite often, the into MERT, we also used a contrastive setup with an Wikipedia article contained information about a dif- alternative phrase table instead of the XML markup of ferent meaning of the term. When translated to Czech, input sentences. The decoder was then working with the difference in the meaning became apparent. For two translation tables — the standard one, generated example, the default Wikipedia article on “Brussels” by GIZA++ from the parallel corpus, and the new discusses the whole “Brussels Region”, therefore the one, created by our tools. As is shown in Figure 4, Czech translation is “Bruselský region”. This word ap- 6 Metrics with paraphrasing (e.g. Meteor [3]) could solve peared several times in the test data and the default a part of the issue. Another option is to replace all interpretation was wrong in all cases.5 words with their lemmas in the hypothesis and the refer- ence and use a standard n-gram metric like BLEU. This 5 It is however noteworthy that the inflected form of this would completely ignore errors in word forms, which is particular name was always chosen correctly. inadequate as well and might seem manipulated. Named entities for machine translation 29 NEs Suggested Regular Translations Unknown NEs NER BLEU All forms (old) Included Translated Simple 27.11 All forms (new) Included Translated Simple 26.60 Baseline 26.62 Tab. 4. BLEU scores of two setups using alternative translation table and the baseline system. there are two scores in our table — the first one is the system should be used for all named entities, or only probability assigned by our tools (based on number for entities not present (or very rare) in the training of occurrences of the form in the text of the Czech data. Wikipedia article) and the second one is the “penalty” We described two methods of mixing the newly for using our NE translation.7 It is up to MERT to proposed translations and the default translations of estimate the weight to assign to our translations. the MT system. We studied the XML-input method more and learned that it faces an imbalance in scoring of hypotheses from the two sources. We also report London ||| Londýn ||| 0.4 2.718 preliminary results of the other method: alternative London ||| Londýna ||| 0.2 2.718 decoding paths, allowing the model to choose the best Fig. 4. Example of phrase table entries. balance automatically. While the automatic scores for the second method increased slightly, the results are not yet stable and a further analysis is needed. 7.1 Results In sum, we have shown that Wikipedia can serve as a valuable source of bilingual information and there is Although the results of this experiment look promis- an open space for incorporating this information into ing, they have not been fully evaluated yet and are machine translation. However, Wikipedia should not therefore only preliminary. There is an improvement serve as the only source of information, and the ex- in BLEU score (see Table 4), but it is not a result of tracted information should be confirmed e.g. by anal- better NE translation. The unstability of MERT pro- ysis of some other monolingual data. cess results in different weights in both translations, causing the baseline translation and our experiment outputs to differ significantly in whole sentences, not References only in NE translation. Futher analysis and experi- ments are therefore needed. 1. J. Berka, M. Černý, and O. Bojar: Quiz-based evalu- There are two results reported in Table 4 because ation of machine translation. The Prague Bulletin of two different versions of the inflector were used to get Mathematical Linguistics, 95, April 2011, 77–86. the inflected forms. The “old” one uses all text data 2. O. Bojar and Z. Žabokrtský: CzEng 0.9: large paral- from the body of the article (including e.g. external lel treebank with rich annotation. Prague Bulletin of links), while the “new” one looks for the inflected form Mathematical Linguistics, 92, 2009, 63–83. only in the text of the article. 3. M. Denkowski and A. Lavie: METEOR-NEXT and the METEOR paraphrase tables: improved evaluation support for five target languages. In Proceedings of 8 Conclusion the ACL 2010 Joint Workshop on Statistical Machine Translation and Metrics MATR, 2010. Our approach of automatically suggesting translations 4. J.R. Finkel, T. Grenager, and C.D. Manning: Incorpo- rating non-local information into information extrac- of named entities based on Wikipedia texts leads to tion systems by gibbs sampling. In ACL. The Associa- drop in automatic evaluation but to a slight improve- tion for Computer Linguistics, 2005. ment in manual evaluation of MT quality. Part of this 5. K. Heafield: Kenlm: faster and smaller language model improvement is due to not translating identified enti- queries. In Proceedings of the Sixth Workshop on ties at all. Statistical Machine Translation, Edinburgh, UK, July While some deficiencies of the proposed method 2011. Association for Computational Linguistics. of NE translation can be hopefully mitigated (poor 6. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, suffix trimming and search for various forms of target- M. Federico, N. Bertoldi, B. Cowan, W. Shen, side NEs), the incorrectness of some Wikipedia trans- C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, lations is not easy to solve. It is therefore questionable and E. Herbst: Moses: open source toolkit for statisti- whether the named entity translations provided by our cal machine translation. In ACL. The Association for Computer Linguistics, 2007. 7 This penalty is used in all Moses phrase tables; it is the 7. MediaWiki. Mediawiki – mediawiki, the free wiki en- . same for all entries and equals 2.718 = exp(1) = e. gine, 2007. [Online; accessed 23-May-2011]. 30 Ondrej Hálek et al. 8. F.J. Och and H. Ney: Improved statistical alignment models. Hongkong, China, October 2000, 440–447. 9. F.J. Och: Minimum error rate training in statistical machine translation. In ACL, 2003, 160–167. 10. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu: Bleu: a method for automatic evaluation of machine trans- lation. In ACL, 2002, 311-318. 11. A. Stolcke: Srilm – an extensible language modeling toolkit. June 06 2002. 12. E.F. Tjong Kim Sang and F. De Meulder. Introduction to the conll-2003 shared task: language-independent named entity recognition. In Proceedings of the Sev- enth Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, CONLL ’03, pp. 142– 147, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics.