=Paper= {{Paper |id=Vol-1173/CLEF2007wn-adhoc-SchonhofenEt2007 |storemode=property |title=Performing Cross-Language Retrieval with Wikipedia |pdfUrl=https://ceur-ws.org/Vol-1173/CLEF2007wn-adhoc-SchonhofenEt2007.pdf |volume=Vol-1173 |dblpUrl=https://dblp.org/rec/conf/clef/SchonhofenBBC07a }} ==Performing Cross-Language Retrieval with Wikipedia== https://ceur-ws.org/Vol-1173/CLEF2007wn-adhoc-SchonhofenEt2007.pdf

Performing Cross-Language Retrieval with
∗
Wikipedia

Péter Schönhofen András Benczúr István Bíró Károly Csalogány
Data Mining and Web search Research Group, Informatics Laboratory
Computer and Automation Research Institute of the Hungarian Academy of Sciences
{schonhofen, benczur, ibiro, cskaresz}@ilab.sztaki.hu

Abstract
We describe a method which is able to translate queries extended by narrative informa-
tion from one language to another, with help of an appropriate machine readable dic-
tionary and the Wikipedia on-line encyclopedia. Processing occurs in three steps: rst,
we look up possible translations phrase by phrase using both the dictionary and the
cross-lingual links provided by Wikipedia; second, improbable translations, detected
by a simple language model computed over a large corpus of documents written in the
target language, are eliminated; and nally, further ltering is applied by matching
Wikipedia concepts against the query narrative and removing translations not related
to the overall query topic. Experiments performed on the Los Angeles Times 2002
corpus, translating from Hungarian to English showed that while queries generated at
end of the second step were roughly only half as eective as original queries, primarily
due to the limitations of our tools, after the third step precision improved signicantly,
reaching 60% of the native English level.

Categories and Subject Descriptors

H.3.3 [Information Search and Retrieval]: [Query Formulation]

General Terms

Design, Experimentation, Performance

Keywords

Cross-language information retrieval, Wikipedia

1 Introduction
Although the overwhelming majority of documents available on the Internet are written in English
(as of 2002 roughly 72%, according to statistics presented in [23], merely a fraction of users is
procient in this language and as a consequence are most probably unable to formulate eective
queries for traditional search engines. However, because they might be interested in various non-
textual content stored in these documents, like diagrams, images, numeric data, hyperlinks etc.,
there should be a way to translate queries from any language to English. In the paper we introduce
a relatively simple method which, when given a short description of the topic in Hungarian, tries
∗ This work was supported by a Yahoo! Faculty Research Grant and by grants MOLINGV NKFP-2/0024/2005,

NKFP-2004 project Language Miner http://nyelvbanyasz.sztaki.hu.
to nd the English keywords most accurately representing it. Experiments performed on the Los
Angeles Times 2002 news article collection showed that precision of translated queries (computed
for the top 5 retrieved documents) were roughly 60% of the original ones. This value is rather
abismal, but when considering the small size of the English-Hungarian dictionary at our disposal
and also the naive technique applied for raw translation, is in fact acceptible.
Our main goal was not to develop a highly accurate machine translation system, but rather
to prove that employing an ontology mined from Wikipedia, quality of translations can be signif-
icantly improved. The fundamental idea is: if we score candidate English terms based on to how
many other terms they are semantically related and exactly how strongly, we can reliably decide
whether to retain it or not. For instance when encountering word "ég", which could be interpreted
as either "sky" or "burn", and by examining the overall topic suggested by other words we found
that the query is about reghting, we can safely prefer the latter meaning. In addition, because
many Wikipedia articles exist in more than one language, Hungarian and English titles of such
parallel articles can yield term translations not included in a traditional dictionary.
The proposed algorithm consists of four stages. First of all, we pre-process source documents,
carrying out the usual steps, such as stemming and stopword removal; in addition, we extract both
phrase translations and a basic ontology from the English Wikipedia. Next, we generate a raw
translation of the query narrative, utilizing a simple machine readable dictionary (SZTAKI Szótár
[10]) by looking up English versions of Hungarian terms, naively word for word, then we perform
disambiguation where necessary relying on a bigram language model computed over the textual
contents of English Wikipedia articles but of course any large corpus with a wide coverage would
do. In the third stage, resulting terms are aligned with Wikipedia articles (in fact representing
concepts), which in turn are scored primarily according to how many other recognized articles they
refer to. Finally, we construct a query from titles of the few highest ranked Wikipedia articles,
submitting it to a search engine, which retrieves relevant documents.
Let us examine the basic idea behind the second and third step more closely. During raw
translation the dictionary often gives multiple translations for a specic Hungarian words, each
valid, but only one of them is correct in the current context. Context can be represented either by
simply the set of words present inside the same sentence (or paragraph), or by the set of Wikipedia
concepts recognized through them. In the former case, a bigram language model might predict
how "compatible" are words with each other, while in the latter case semantic connections be-
tween concepts (extracted from references between Wikipedia articles, each representing a concept
described by their titles) may help us pinpoint erroneous translations.
The search engine is the Hungarian Academy of Sciences search engine [2] which uses a TF ×
IDF -based ranking; it was slightly modied in order to work with the Los Angeles Times 2002
corpus and to be able to inuence its ranking function in more aspects than was originally possible.
There are four research areas strongly relevant to the proposed algorithm, of course the most
crucial of whose is machine translation (for an overview see for instance [16]). Machine translation
is usually performed in three dierent ways. In the rst approach, source text is converted to some
kind of internal representation, typically with the help of natural language understanding, then
converted to the target language (see [21], among others). In the second, a dictionary is used, and in
case of ambiguous translations, the correct one is selected by determining the precise grammatical
role of the source phrase (e. g. [9]). Finally, in the third, word correlation data collected from
statistical analysis of parallel corpora are utilized to select the most appropriate translation (as
described for example in [3]). Our algorithm adopts a simplistic solution, where neither training
data acquired from parallel corpora, nor part of speech information are exploited; however, it
perfectly ts our declared purpose to demonstrate how Wikipedia can improve translation quality.
Another related eld is language modeling (an excellent overview is provided by [12]), and
applying language modeling to machine translation [4, 6] or information retrieval [7] is not new.
Though we utilize bigrams to help disambiguate English translations, we do not exploit all possi-
bilities inherent in this technology, such as smoothing or larger n-grams, neither do we employ it
as extensively as would be possible, for instance deciding whether a Wikipedia concept is rightly
aligned to an English word sequence based on the similarity between its immediate context in
the query text and content of the corresponding article body or not.
Several researcher utilized ontologies to support disambiguation in machine translation [17], or
as a base for internal representation bridging the source and target languages [20]; [19] provides
an extensive theoretical discussion on this topic. However, due to a dearth of ontologies which
have a suciently wide coverage and at the same time are available in multiple languages, these
methods typically construct their own ontologies through some machine learning technique run
over parallel corpora. Though the idea of taking advantage of Wikipedia has already emerged,
either as an ontology [8] or as a parallel corpus [1], to our best knowledge, so far it has not been
used for dictionary construction or to improve translation accuracy.
The last related area is cross-language information retrieval itself (see for instance [22] for an
overview). There are several papers dealing with the various problematic points of the task, like
acquisition of the dictionary [14], mapping terms (or concepts) from source to target languages [11,
18], or disambiguation between multiple possible translations [15]. Though our approach contains
a rather simple disambiguation strategy, the main focus is on the improvement of translation
accuracy via the discovery of semantic relations between recognized concepts, which is (at least
presently) a quite unexplored research direction.

2 Preparations
Several preparational steps are necessary before we can carry out actual query translation. First
of all, we have to construct a machine readable dictionary which will provide us possible English
variants of individual Hungarian words or phrases. Next, a concept network should be extracted
from Wikipedia, which will help us discover semantic relationships between candidate English
terms, and therefore decide whether they are compatible with the query topic or not. Finally, texts
of queries and documents themselves which have to be searched are pre-processed in traditional
fashion, to facilitate ecient matching of its contents against dictionary entries. The next three
subsection describe each above mentioned activity in detail.

2.1 Constructing the dictionary

It goes without saying that the most crucial component for every translation program is a machine
readable dictionary, which ideally is accurate, covers a wide range of subjects, and in case of words
with many possible translations also contains information about contexts characteristic to each
translation. We took advantage of the collaboratively edited SZTAKI Szótár dictionary, a simple
textual database of English-Hungarian word pairs behind the popular and freely available on-line
service of the same name. It has two components: an ocial and an unocial sub-dictionary
(the main dierence between them is that the former is more closely supervised), comprising of
roughly 131,000 and 53,000 entries, respectively. Its structure is fairly straightforward, as can be
seen from the short extracted fragment below:

aeronautical
repülési

aeronomy
fels®légkör tudománya

aeroplane
repül®gép

aesthete
esztéta

Obviously, the rst line of each entry contains an English term, the second its Hungarian
variant, while the third remains empty, serving as a very easy to detect delimiter. Unfortunately,
some entries specify more than one English or Hungarian terms (typically for words having an
irregular plural form, like "adman") separated by commas or semicolons; certain entries refer
to idioms (for instance "blow high, blow low"), often complete with punctuations; and a small
number of entries represent prexes or postxes, not regular phrases (e. g. "-able", "thermo-"
etc). While these special dictionary items usually do not facilitate better translation accuracy,
they do make our pre-processing mechanism more complex.
So when going through the original dictionary entries, we remove text inside parentheses or
after a slash; discard question marks, exclamation marks and double quotes; ignore prex or sux
references (characterized by an initial or ending hyphen); in addition, English stopwords and
dictionary abbreviations (like "sb" representing a person object or subject) are deleted. When
an entry contains multiple terms, it is exploded to as many entries as necessary to represent all
possible English-Hungarian term pairs. For example, a compound translation from "A, B" to "C,
D" will yield four entries, namely A to C , A to D, B to C and B to D.
Note that at this point we do not perform stemming neither on English, nor on Hungarian
words the former will be carried out in a subsequent processing step, and the latter is not strictly
required, because terms already appear in their base forms in the dictionary. However, English
words not present in any Wikipedia article bodies are deleted, partly as they are totally useless
when we attach Wikipedia concepts to translated queries, partly since probably they are misspelled
or extremely infrequent in common English, therefore it is very unlikely that submitting them as
a query to a search engine would retrieve even a small number of documents.
Compared to professional dictionaries SZTAKI Szótár is relatively small, thus it was perfectly
natural that we attempted to extend it by all available means. Luckily, articles in the English
Wikipedia contain links to articles on the same subject in the Hungarian version (if present), so
titles connected in this fashion can be considered valid translation pairs and added to the basic
dictionary. However, there are at least two problems with the above mentioned approach: First,
the number of articles in Hungarian Wikipedia is rather low, around 60,000 as of early 2007 (less
than one tenth of the size of its English counterpart), a signicant percentage of which is specic
to the Hungarian culture, and consequently is not part of English Wikipedia. Second, Wikipedia
covers mostly technical terms, geographical locations, historical events and persons, phrases either
not requiring translation or rarely occurring inside common texts.
We converted cross-language references extracted from Wikipedia to dictionary entries in the
following way. Similarly to the processing performed on SZTAKI Szótár, text inside parentheses or
written after a comma, semicolon, slash is deleted, both in Hungarian and English titles. In addi-
tion, English titles are stemmed and from them stopwords (using a list originally developed for the
Smart search engine) are removed. Note that now there is no need to break up English-Hungarian
pairs, because overwhelming majority of Wikipedia articles discuss only a single concept, the rest
being disambiguation pages, where the various concepts are represented by the same term, for
example "Java" or "acid test", and either are translated to the same Hungarian term, or do not
have a corresponding article in Hungarian Wikipedia.
After dictionary entries have been collected from both SZTAKI Szótár and Wikipedia, we merge
them to produce the machine readable dictionary utilized during query translation later. As we
consider entries derived from Wikipedia less reliable than those extracted from SZTAKI Szótár
(which is perfectly understandable since Wikipedia originally was not intended to be employed as
a dictionary), we ignore Wikipedia pairs already present among SZTAKI Szótár entries even if
they would provide additional English translations for an already known Hungarian term. It is
now that Hungarian titles are stemmed and stopwords are removed (with the help of a manually
constructed stopword list), making Hungarian terms of several dictionary entries the same (e. g.
"futott" and "futó" are both converted to "fut"), which therefore should be merged. The resulting
dictionary contains 100,510 Hungarian terms, in average composed of 1.6211 words and referring
to 2.0695 English terms, each one representing a dierent possible sense.

2.2 Converting Wikipedia to a concept network

In its original form, Wikipedia consists of articles of various types. Regular articles have a title,
body, enumeration of Wikipedia categories they pertain to, and possibly links to the same article
Figure 1: Fragment of a Wikipedia article.

written in other languages. Article bodies might contain supplementary tables, images, messages
for contributors (such as that the article has to be reviewed or cannot be modied temporarily
due to recent acts of vandalism), references to related Wikipedia articles, hyperlinks to external
websites. Longer articles are usually partitioned into sections, even subsections, and consequently
after a short introduction, they typically embed a table of contents. Figure 1 shows a sample.
Besides regular articles, there are two additional article types redirects and category pages.
Redirects are merely "placeholders" for alternate titles of a given article (e. g. "Eugene Paul
Wigner" to "Eugene Wigner"), they have only a title without any body, category assignment list
or links to versions in other languages. As their name suggests, when a user retrieves them through
the ocial Wikipedia web interface, he or she immediately gets redirected to the appropriate
regular article. Category pages describe Wikipedia categories by giving a brief overview about
their exact meaning, enumerating their sub- and supercategories (if exist), nally listing all pages
pertaining to them. Note that although categories are organized into a hierarchy, unfortunately
it is not a tree, because categories might have multiple supercategories.
Of course, in its original form, Wikipedia cannot be used to help us recognize concepts in
the English queries, then detect and discard those which are not consistent with the dominant
query subject, since they have been inserted due to some erroneous judgment made by the raw
translator. In fact we need only a concept network, or more precisely, an undirected graph where
nodes represent concepts and edges the various relationships between them: concept names can
be easily extracted from Wikipedia article titles, and if an article refers to another article, it is a
fairly reliable indicator that they are semantically connected. As opposed to WordNet, OpenCyc
and other proper ontologies, here we do not record the type of relations, because this information
is not present in the Wikipedia corpus in explicit form (however, several researchers worked out
techniques to rectify this omission, for instance [27]).
Figure 2: Fragment of a Wikipedia category page.

To carry out the required transformations, rst of all we convert bodies of regular Wikipedia
articles from Wiki markup to XML, omitting embedded images, tables of contents, footnotes and
other supplementary elements not strictly pertaining to the core textual content. Category pages
are deleted, redirects are regarded as additional titles of articles they point to. Disambiguation
pages, which list the dierent possible interpretations of the same term, are broken up to smaller
fragments bearing the same title, otherwise our algorithm would falsely believe that articles ref-
erenced from the page all belong to a single domain, derailing the second stage of translation.
Next, titles of redirect articles are added as additional titles to the target articles, and references
between articles are collected from article bodies. We apply the same pre-processing to Wikipedia
article titles to which English terms inside the dictionary were subjected, in order to make future
matching between query terms and Wikipedia concepts as accurate as possible. Namely we remove
text between parentheses or after a comma, slash, semicolon; replace letters outside the Latin
alphabet with their Latin equivalent if possible (for instance "é" with "e"); delete stopwords; and
nally do stemming using TreeTagger [25].
Before we would discard Wikipedia article bodies, we split them to paragraphs and compute
occurrence counts of bigrams (it goes without saying that bigrams cannot span paragraph bound-
aries), so that probability of word w followed by word v can be estimated with the formula:
Cw,v
Pw,v = P (v|w) = (1)
fw
Here Cw,v denotes the number of occurrences of w where it was followed by v , and fw shows the
frequency of w both calculated over the entire Wikipedia corpus. Of course, this language model
could have been derived from any suciently large corpus; we used Wikipedia simply because it
has been already converted to a format on which bigram statistics was easy to generate.
Table 1: English terms attached to query title "Kémiai Nobel-díj".

Text position Hungarian word Attached translations
1 kémia chemistry
2 nobel nobel; nobel award, nobel price, nobel prise, nobel prise,
mathematics, noble prise, nobelprize, nobelprizeorg,
nobelist
3 díj award, blue ribbon, charge, due, premium, prize,
remuneration, stake, trophy; nobel award, nobel price,
nobel prise, nobel prise mathematics, noble prise,
nobelprize, nobelprizeorg, nobelist

2.3 Pre-processing queries and documents

Queries for which relevant documents had to be found in the CLEF Ad-hoc task contain three
parts: title, description and narrative. While titles consist of only a few words, descriptions
are whole sentences and narratives are usually even longer, sometimes being composed from two
or three sentences. Though we will search in the document collection utilizing only the words
captured from query titles, we translate all three query parts, otherwise there would not be a
sucient amount of context (enough recognized concept) for reliably detect query subjects. As a
consequence, pre-processing of course should extend to the whole query, not just to their titles.
The pre-processing steps applied to queries do not contain any novel elements, we employ
almost the same technique as for Wikipedia titles. More precisely, we perform stemming, remove
stopwords (again relying on a slightly modied stopword list originally developed for the Smart
search engine), if possible, convert accented characters to their equivalent in the Latin alphabet,
and in addition split both query descriptions and narratives to paragraphs. Words not present
in Wikipedia article bodies are ignored, because it is guaranteed that they cannot be used to
recognize Wikipedia concepts and so improve translation quality.
For documents in the Los Angeles Times 2002 corpus, from which we want later to retrieve
items relevant to queries, exactly the same steps are carried out, otherwise alignment between
query and document content would not be as precise as would be possible.

3 Raw translation
As was already mentioned in the introduction of this paper, translation is carried out in two
stages: rst, we make a raw translation relying only on the machine readable dictionary built
from SZTAKI Szótár and Wikipedia cross-language links; second, we rene the raw translation
by removing terms whose meaning is not compatible with the detected subject of our query. Raw
translation again consists of two stages after we gathered possible English versions of Hungarian
words, we select those seeming the most probable based on the bigram language model computed
on Wikipedia article bodies, as has been described in the previous section.
So we start by examining word sequences present in the Hungarian query which we want
to translate; sequences obviously have to be continuous, cannot cross paragraph or query part
boundaries, and in addition their length cannot exceed 5 words. If a word sequence exactly
matches some Hungarian term in our machine readable vocabulary, we attach the corresponding
English term (or maybe terms) to text positions occupied by the Hungarian term. For instance,
encountering "Kémiai Nobel-díj" (Nobel Prize in chemistry) as title of query No. 448 yields
would yield the English terms shown in Table 1. Note that because "nobel díj" was correctly
recognized as a multiword Hungarian phrase, its various English translations (the stranger ones,
like "nobelprize" coming from Wikipedia) were registered for text positions 2 and 3 as they are
competing with translations assigned to individual words present at the same places.
As we can see from this example, the dictionary typically gives a relatively large number of
Table 2: Scores assigned to terms in query title "Kémiai Nobel-díj".

Term Score Term Score
0.2500 chemistry 0.0014 nobel prise mathematics
0.2500 nobelprizeorg 0.0012 premium
0.0314 award 0.0009 trophy
0.0157 nobel prise 0.0005 prize
0.0157 noble prise 0.0004 blue ribbon
0.0041 nobel 0.0002 stake
0.0026 charge 0.0000 awards
0.0020 due 0.0000 nobelprize
0.0020 nobel award 0.0000 nobelist
0.0020 nobel price 0.0000 remuneration

possible translations, whose majority is evidently wrong, and can be easily ltered out if we are able
to detect that they are not present inside their typical context. Fortunately, language modeling
provide a simple yet very ecient tool to help us in this respect, namely bigram statistics. The
Pw,v values computed during pre-processing species the probability that word w is followed by
word v , based on the analysis of a large corpus (roughly 1,600,000 Wikipedia article bodies). Let us
now assume that query text contains words A and B , which can be translated to A01 or A02 , and B10
or B20 , respectively. If probability of A01 and B20 appearing as a bigram is signicantly higher than
those of the other three combinations, of course they should be chosen as the most probably valid
translation. However, since terms being immediately near each other in the Hungarian version
may be far apart in the corresponding English sentence, we have to tread carefully.
Our technique is the following. We pair each English terms present in the same sentence with
each other in every imaginable combination (but naturally a term is never aligned with itself),
and assign scores to them computed according to the formula:

(2)

St = max max Ptlast ,uf irst , Ptf irst ,ulast
u∈S∧u6=t

where tf irst denotes the rst word of term t, while tlast stands for its last word. To put it more
intuitively, we characterize term t with the P value corresponding to the most probable bigram
it participates in. This approach works well here since sentences are fairly short, obviously, if we
would be dealing with longer ones, we should specify a maximal distance between t and u, otherwise
the greatly increased amount of examined term pairs would surely introduce an unacceptably
strong noise. Terms in the query title discussed above receive the scores shown in Table 2. As can
be seen, highest ranked translations were "chemistry" and "nobelprizeorg" (derived from the web
address nobelprize.org), while "awards", "nobelprize", "nobelist", "remuneration" did not form
any meaningful pair with other terms, at least not according to Wikipedia articles.
After candidate English translations have been retrieved and scored, we can go through all text
locations examining the terms attached to them, retaining only the highest ranked ones. It should
be stressed that since several terms may get exactly the same score, ambiguous translations will
be not necessarily resolved at this time (our goal now is merely to produce a raw translation).
Finally, let us answer the question what happens if a Hungarian words is not in the dictionary,
since it is a proper name, technical term or a particular form of a verb or noun which was not
properly reduced by the Hungarian stemmer [13]. There are two cases. If the word occurs in
Wikipedia article bodies, it does not need any translation (the probability of a Hungarian word
accidentally being the same to an English word, letter by letter, is extremely low). However, if it
does not, we ignore it, as documents in the collection we will search using the current query will
surely not contain it remember that during pre-processing (see Section 2.3) we deleted words
from documents which we did not encountered in Wikipedia.
1. recognize Wikipedia article titles in raw translation

2. retrieve Wikipedia articles (or concepts) behind the recognized titles

3. remove concepts not connected to any other concept

4. score concepts using the formula:
1
Lc × 1+M c
× Fc

5. score words
P present in Wikipedia article titles (corresponding to concepts) using the formula:
Sw = l∈Pw ∧l∈Qc Sc

6. compose new query text

Figure 3: Outline of proposed algorithm.

4 Improving translation quality with Wikipedia
The quality of raw translation of a query title is often too low to directly submit it to the search
engine for two main reasons: (1) there are still words whose English variants was not selected
unambiguously, because their score was equal; and (2) it might happen that the concept most
appropriate for retrieval is not present among concepts recognized in the query title, but rather it
can be found among those connected to the query description or narrative. Thus we rst map words
and word sequences inside the English translation to Wikipedia concepts; score these concepts
mainly based on how strongly they are related semantically to other concepts also recognized
in the query; nally compose a new query extracting words from concepts having the highest
scores. Outline of our algorithm is shown in Figure 3. It should be emphasized, however, that
introducing mapping has also a few drawbacks besides its many advantages Wikipedia deals
primarily with complex concepts, basic nouns and verbs (e. g. "read", "day") are missing from
it; and an additional conversion layer naturally introduces more noise into the processing.
Let us see now the various steps in more detail. In the rst step, we look for Wikipedia article
titles (representing concepts) exactly matching a word sequences in the English translation, again
requiring that these sequences be continuous and do not span paragraph boundaries. Because
CLEF-2007 queries rarely mention multiword terms and if they do, the terms are mostly phrases
from everyday English which are present in the machine readable dictionary , Wikipedia con-
cepts typically do not encompass more than one term. However, it may happen that due to the
limitations of the dictionary a name of some person (for instance Richard Nixon) or company (e.
g. Nestlé Foods) is translated in a word by word fashion, while Wikipedia contains an appropriate
article. So we should allow that Wikipedia article titles cover many terms, but at the same time
also demand that these terms immediately follow each other. This way "John Smith" will be
correctly recognized as John Smith, but our algorithm will not retrieve it for "John met Smith".
Next, we collect actual Wikipedia articles, in other words concepts, behind the Wikipedia
article titles discovered previously. Note that as a consequence of redirects, stemming, stopword
removal and deletion of auxiliary descriptors (between parentheses or after a comma, slash, semi-
colon), a given title often points to several concepts for example "power" to "Political Power",
"Power (electricity)", "Power (physics)" etc). So besides ambiguity originating from raw transla-
tion, we are now saddled with ambiguity arising from title-concept mapping. Still, results discussed
in Section 6 will prove that the scoring formula soon introduced is able to relatively eectively
select the best candidate, no matter how high their number is.
In the third step, for each candidate concept c we determine the Rc set of other candidate
concepts related to it, now not necessarily from the same paragraph. Remember that two concepts
are considered semantically related if the body of either one contains a hyperlink reference to the
Table 3: Scores assigned to concepts mined from query No. 448.

Score Concept
3.3333 Nobel Prize
3.0000 Ludvig Nobel
3.0000 Alfred Nobel
3.0000 Academia
2.0000 Chemistry
1.6667 Nobel Prize in Literature
1.0000 Fundraising
1.0000 Work (thermodynamics)
1.0000 Employment
0.6667 Award 1002
0.4000 Miguel de la Espriella - "Noble"

other. However, if the two concepts have the same title, they are alternative (and competing)
interpretations of the same word sequence, and obviously cannot support each other during later
analysis, so this sort of connection is ignored. Concepts without any connections are discarded, as
they probably do not pertain to the query subject. Still, this is not necessarily true in the reverse
direction: just because a concept has at least one connection does not automatically mean that it
is about the query topic, as it may be easily related to a similarly inappropriate concept.
In the fourth step, we assign scores to candidate concepts according to the following simple
formula to estimate their relevance to the query subject:
1
Sc = Lc × × Fc (3)
1 + Mc
In order to simplify explanation of the above calculation, lets introduce the term "c is attached
to text position p", meaning that p is occupied by a word being part of a word sequence through
which c has been recognized in the rst step. Lc denotes the number of text locations to which
concepts related to c are attached; Mc is the number of concepts c competes with (some concept is
a competitor if it is attached to at least one text location to which c is also attached); and nally
Fc species the amount of text locations to which c is attached.
To put it more informally, Lc measures how strongly concept c correlates to the whole query
description the reason we deal with text positions and not with candidate concepts is that we
would like to avoid Wikipedia titles attracting a high number of concepts to gain undeservedly large
inuence over the disambiguation process. Mc estimates reliability of c: if c has no competitors
(the words it covers do not have alternative interpretations), it is almost surely correct, but if it
conicts with many other candidates, the probability that it will be chosen as the best translation
is lower. Fc corresponds to the popular and widely used TF term frequency value, and thus
emphasizes (or suppresses) the relevance of c not in relation to its competitors, like Lc , but rather
to concepts attached to dierent word sequences. Table 3 shows scores of concepts attached to
the content of query No. 448 about the Nobel Prize in Chemistry.
After we ranked candidate concepts, we are ready to map them back to English words in step
ve. However, instead of simply choosing the few highest ranked concepts and pasting their titles
one after the other to form a new query, we follow a slightly more sophisticated approach, and
introduce another layer of decision by scoring words according to the formula below:
X
Sw = Sc (4)
l∈Pw ∧l∈Qc

where Pw represents the set of text locations occupied by word w, and similarly, Qc stands for
the set of text locations to which concept c is attached. Thus in the above formula we add up scores
of concepts recognized through titles containing w, as many times as the recognition took place,
Table 4: Scores of words occurring in query No. 448.

Score Word
31.3333 nobel
13.3333 prise
7.0667 noble
4.0000 chemistry
3.3333 mathematics
3.0000 scholarly
3.0000 academic
2.0000 work
1.0000 contribution
0.6667 award

so more frequent words will automatically receive higher scores. Although members of multiword
terms might get dierent scores, since they do not necessarily serve as base for exactly the same
concepts throughout the query text, this is not always a bad thing, because more signicant words
(like "programming" in "programming language", assuming a query about computers) will be
emphasized. Table 4 shows scores of words, assuming concepts scores shown in Table 3.
The query to be submitted for the search engine is formed in the nal, sixth step. Remember
that although we translated and analyzed the entire query (its title, description and narrative), we
intend to select only a few words, since unfortunately our search engine presently do not cope well
with long queries: individual query words cannot be supplemented with weights indicating their
relative importance, thus insignicant words can easily distort the hit list. The applied selection
heuristic is fairly simple: we retain words which occur in raw translation of the query title or
have scores putting them among the top 3 words. In addition, words which appear in the original
Hungarian query title but were not translated (since they were present neither in the machine
readable dictionary nor in Wikipedia) are also kept.

5 Search engine
We use the Hungarian Academy of Sciences search engine [2] as our information retrieval system.
Its ranking algorithm was developed originally for HTML pages; it uses a TF × IDF -based rank-
ing with increased weights for query words within URLs, anchor texts, the title and additional
signicant HTML elements. More precisely, the relevance score is computed from the weighted
combination of the following factors:

• traditional TF × IDF -based ranking which takes into account how close are the query terms
to each other in documents [24, 5], utilizes document length normalization [26], and places
stronger emphasis on query terms found in the document title than those appearing in the
document body.

• Number of dierent query terms in the document.

• The coverage of the rst and the last query term occurrence. This is the proportion of the
document between the rst and last query term. It is almost 1 if the document contains
query terms at the beginning and at the end, and 1 / size if there is only one query term.

In contrast to the original version, here we do not require that a document contain all the
query words in order to include it in the resulting hit list.
This ranking method had several parameters: weights of the TF × IDF score and the number
of query words and the query term coverage, weights of the title and body. After several runs with
dierent settings, we have found that we get the best result if the weight of the number of query
terms is much higher than the TF × IDF score. In other words, we rank documents with respect
to the number of query terms found inside their text, then use the TF × IDF -based measurement
to dierentiate between documents carrying the same number of query terms.

6 Results
After the detailed description of our proposed algorithm let us see how well it performed in practice.
For the experiments, we used the Wikipedia snapshot taken in August of 2006; as the result of
various pre-processing procedures, it contained 1,728,641 concepts reachable through 2,184,226
dierent titles. The target corpus, the Los Angeles Times 2002 collection comprised of 86,002
news articles, in which documents relevant to 50 queries had to be retrieved. Each query was
submitted to our search engine in three dierent versions:

• using words in the ocial English translation of query titles;

• using words in the raw translation of original Hungarian query titles;

• using words selected by our algorithm.

The rst batch of queries served as a baseline; the second helped us estimate how well (or badly)
raw translation worked; and the last demonstrated how much improvement could be achieved by
the concept mapping and remapping utilizing Wikipedia. Table 5 shows exact texts of queries in
the three batches in order to keep the presentation as possible, in cases where a query contained
more than four words, only the rst four is shown, rest is represented by an ellipsis. As we can see,
even the raw translation managed to provide a perfect translation for several queries, however, in
the majority of cases it made mistakes, which can be classied in ve main groups:

• translated words are the same but are in a dierent grammatical form (see for example
"human cloning" vs. "human clone" in query No. 408);

• translated words are synonyms of the original ones (e. g. "Australian" vs. the more informal
"Aussie" in the immediately preceding query, No. 407);

• bigram statistics derived from the Wikipedia corpus was not enough to properly disambiguate
possible translations (see query No. 433);

• the Hungarian stemmer failed to determine the root form of a word (as happened in the case
of "drogoz", "drug" in English, in query No. 415);

• the dictionary failed to provide any translation for a given word (see for instance "weapon"
inside query No. 410, which should have been recognized indirectly from "atomsorompó").

Translations aligned with Wikipedia concepts are usually more verbose than raw translations,
and frequently introduce synonyms (like in query No. 412) or sometimes even downright strange
words (such as in query No. 414), probably because the query text does not provide a suciently
large textual body to reliably determine the dominant topic. However, these query versions often
bring back important keywords lost during raw translation see for instance "cancer" in query
No. 438, or "priest" in query No. 433). As a result, though precision at 5 retrieved documents
were only 14.66% for raw translations, a fraction of the 33.31% observed when using ocial
English translations, Wikipedia post-processing managed to increase precision to 16.45% (the
values reect the search engine setting which yielded the best accuracy). Figure 4 shows more
detailed information about the performance.
Table 5: Queries submitted to the search engine.

Qry Ocial translation Raw translation Improved translation
401 euro ination euro rise prices euro rise continental
402 renewable energy source power well power energy owe
403 act cop actor police role actor police role sleuth actress
404 nato summit security nato summit security nato summit security north
405 childhood asthma childhood asthma childhood infancy development
406 animate cartoon animated cartoon animated cartoon festival
407 australian prime minister aussie prime minister premier aussie minister premier prime 2002
408 human cloning human clone clone human embryo medical ...
409 bali car bombing bali car bomb bali car bomb indonesian
410 north korea nuclear weapon ... north korea dpr republic ... north people korea democratic ...
411 best picture oscar lm oscar lm oscar movie
412 book politician politic book politic book political
413 reduce diabetes risk diabetes hazard mitigation hazard mitigation play form ...
414 beer festival beer festival beer festival john space
415 drug abuse drogoz drug dependence consumerism
416 moscow theatre hostage crisis hostage crisis moscow theatre hostage crisis moscow theatre send ...
417 airplane hijacking air plane hijacking diversion ... plane aeroplane hijacking diversion ...
418 bulent ecevit statement bulent ecevit protestation statement ... bulent ecevit revelation turkish ...
419 nuclear waste repository atom feed cutting cemetery atom cemetery waste
420 obesity ill health prevailing hygienic problem hygienic problem disease number
421 kostelic olympic medal kostelic pendent pendant olympics win event kostelic ivica
422 industrial business closure factory business cessation factory business mill shutdown
423 alternative u shot u immunization alternative alternative method detail give
424 internet banking increase international network bank network international bank information
425 endangered species endangered species illegal person activity
426 9/11 counterterrorism measure 11th sept 2001 aftermath ... 11 sept 2001 terrorism ...
427 testimony milosevic milosevics testimony deposition testimony deposition witness
428 ecological tourism beloved blank safari park park safari tourism
429 water health risk polluted water hazard polluted water sanitary
430 cosmetic procedure cosmetic procedures cosmetic surgical intervention
431 french presidential candidate french president nominee french president nominee party
432 zimbabwe presidential election zimbabwean presidential zimbabwean presidential 2002 latvian
433 child abuse priest divine sexually child divine sexually child priest ...
434 political instability venezuela politics doubt venezuelan politics doubt venezuelan president
435 cause air pollution air pollution ground air pollution ground smog
436 vip divorce vip breakup break divorce vip break divorce actor ...
437 enron auditing irregularity accounting irregularity irregularity accounting enron auditor
438 cancer research oncology oncology cancer treatment
439 accident work workplace accident
440 winter olympics doping scandal dopping scandal winter olympics scandal winter jock
441 space tourist russian expedition candidate russian expedition person cosmonaut
442 queen mother funeral mother queen funeral mother queen funeral
443 world swimming record oat all time high oat cork sport meter
444 brazil world soccer champion brasil footballer world champion ... brasil footballer champion world ...
445 prince harry drug harry prince drug prince drug controversy
446 ood damage cultural heritage ood damage cultural heritage damage ood cultural heritage
447 pim fortuyn politics archimedes constant fortuyn policy constant archimedes fortuyn political ...
448 nobel prize chemistry chemistry nobel award nobelprizeorg ... chemistry nobel prise award ...
449 civil war africa africa civil war africa war civil african ...
450 failed assassination attempt abortive attempt abortive person target
BYpYµÌãúY(?VÌmY
j²

S²

ômpY?ãÌpY úYÌãã <²

%²

÷²

à²

É²

É²²
É²

à²

÷²

%²

<²

S²

j²

²
²

¯µYÌÆYúYÝ?m

"((Ìãú9mÆãÝPúpÌmÝg BÌ~úpÌmÝg ¬Y ÌúpÌmÝg

Figure 4: Interpolated recall vs. average precision when using ocial English translations, raw
translations or Wikipedia-enhanced translations as queries. .

Conclusion and future work
In this paper we have demonstrated that a concept network extracted from the Wikipedia on-
line encyclopedia can be used to improve the quality of machine translation of CLEF queries
supplemented by narratives thus making accurate cross-lingual information retrieval feasible.
As experiments performed on the Los Angeles Times 2002 corpus have shown, our proposed
algorithm works well only if it is supported by a reliable stemmer, suciently rich dictionary, and
a relatively long text describing the purpose of the query. However, when these conditions are
met, adequate precision can be achieved even with a very simple machine translation approach.
Obviously, there are several points where our method can be (and should be) improved. First
of all, we plan to test it against language pairs other than Hungarian-English German-English
translation is particularly promising, as for this combination an extremely rich concept mapping
is available inside Wikipedia. Second, we have to make disambiguation during raw translation
more aggressive, and less dependent on the quality of stemmer which is used for the source or
target language. Third, semantic relationships between Wikipedia articles could be derived from
additional sources, not only from hyperlinks, such as category assignments and similarity of text
of actual article bodies. Finally, we may easily re-order query results by detecting Wikipedia
concepts in their text and comparing them to the concepts recognized in the query.

References
[1] Sisay Fissaha Adafre and Maarten de Rijke. Finding similar sentences across multiple lan-
guages in wikipedia. In Proceedings of the New Text Workshop, 11th Conference of the Eu-
ropean Chapter of the Association for Computational Linguistics, 2006.
[2] András A. Benczúr, Károly Csalogány, Eszter Friedman, Dániel Fogaras, Tamás Sarlós, Máté
Uher, and Eszter Windhager. Searching a small national domain preliminary report. In
Proceedings of the 12th World Wide Web Conference, 2003.
[3] Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick
Jelinek, John D. Laerty, Robert L. Mercer, and Paul S. Roossin. A statistical approach to
machine translation. Computational Linguistics, 16(2):7985, 1990.

[4] Ralf Brown and Robert Frederking. Applying statistical english language modeling to sym-
bolic machine translation. In Proceedings of the 6th International Conference on Theoretical
and Methodological Issues in Machine Translation, pages 221239, 1995.
[5] Stefan Büttcher, Charles L. A. Clarke, and Brad Lushman. Term proximity scoring for ad-hoc
retrieval on very large text collections. In Proceedings of the 29th annual international ACM
SIGIR conference on Research and development in information retrieval, pages 621622, New
York, NY, 2006. ACM Press.

[6] E. Charniak, K. Knight, and K. Yamada. Syntax-based language models for statistical ma-
chine translation. In Proceedings of the 9th Summit if the International Association for
Machine Translation, 2003.
[7] W. Bruce Croft and John Laerty. Language Modeling for Information Retrieval. Kluwer
Academic Publishers, Norwell, MA, 2003.

[8] Ludovic Denoyer and Patrick Gallinari. The wikipedia xml corpus. SIGIR Forum, 40(1):64
69, 2006.

[9] Bonnie J. Dorr. The use of lexical semantics in interlingual machine translation. Machine
Translation, 7(3):135193, 1992.
[10] Számítástechnikai és Automatizálási Kutató Intézet. Sztaki szótár. Accessible at
http://szotar.sztaki.hu.

[11] J. Gilarranz, J. Gonzalo, and F. Verdejo. An approach to conceptual text retrieval using
the EuroWordNet multilingual semantic database. In Proceedings of the AAAI-96 Spring
Symposium Cross-Language Text and Speech Retrieval, 1996.
[12] J. T. Goodman. A bit of progress in language modeling. Computer Speech & Language,
15(4):403434, 2001.

[13] Péter Halácsy, András Kornai, László Németh, András Rung, István Szakadát, and Viktor
Trón. A szószablya projekt. In Proceedings of the 1st Hungarian Computational Linguistics
Conference, 2003.
[14] D. Hiemstra, F. de Jong, and W. Kraaij. A domain specic lexicon acquisition tool for cross-
language information retrieval. In Proceedings of the RIAO Conference on Computer Assisted
Information Searching on Internet, pages 255270, 1997.
[15] Djoerd Hiemstra and Franciska de Jong. Disambiguation strategies for cross-language infor-
mation retrieval. In Proceedings of the Third European Conference on Research and Advanced
Technology for Digital Libraries, pages 274293, London, UK, 1999.
[16] W. John Hutchins and Harold L. Somers. An Introduction to Machine Translation. Academic
Press, 1992.

[17] Kevin Knight and Steve K. Luk. Building a large-scale knowledge base for machine trans-
lation. In Proceedings of the twelfth National Conference on Articial Intelligence, pages
773778, 1994.

[18] Raija Lehtokangas, Eija Airio, and Kalervo Järvelin. Transitive dictionary translation chal-
lenges direct dictionary translation in CLIR. Information Processing and Management,
40(6):973988, 2004.
[19] K. Mahesh. Ontology development for machine translation: Ideology and methodology. Tech-
nical Report MCCS 96-292, Computing Research Laboratory, New Mexico State University,
1996.

[20] Roberto Navigli, Paola Velardi, and Aldo Gangemi. Ontology learning and its application to
automated terminology translation. IEEE Intelligent Systems, 18(1):2231, 2003.

[21] Sergei Nirenburg, Victor Raskin, and Allen Tucker. On knowledge-based machine translation.
In Proceedings of the 11th coference on Computational linguistics, pages 627632, Morristown,
NJ, 1986. Association for Computational Linguistics.

[22] Douglas W. Oard and Bonnie J. Dorr. A survey of multilingual text retrieval. Technical
Report UMIACS-TR-96-19, University of Maryland at College Park, 1996.

[23] E. T. O'Neill, B. F. Lavoie, and R. Bennett. Trends in the evolution of the public web. D-Lib
Magazine, 9(4), 2003.
[24] Yves Rasolofo and Jacques Savoy. Term proximity scoring for keyword-based retrieval systems.
In Proceedings of the 25th European Conference on Information Retrieval Research, pages
207218, 2003.

[25] Helmut Schmid. Probabilistic part-of-speech tagging using decision trees. In Proceedings of
the International Conference on New Methods in Language Processing, 1994.
[26] Amit Singhal, Chris Buckley, Mandar Mitra, and Gerard Salton. Pivoted document length
normalization. Technical Report TR95-1560, Cornell University, Ithaca, NY, 1995.

[27] Max Völkel, Markus Krötzsch, Denny Vrandecic, Heiko Haller, and Rudi Studer. Semantic
wikipedia. In Proceedings of the 15th international conference on World Wide Web, pages
585594, 2006.