<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The University of Groningen at QA@CLEF 2006 Using Syntactic Knowledge for QA</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gosse Bouma</string-name>
          <email>g.bouma@rug.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ismail Fahmi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jori Mur</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Algorithms</institution>
          ,
          <addr-line>Measurement, Performance, Experimentation</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Question answering</institution>
          ,
          <addr-line>Dutch, Lexical Equivalences, Coreference Resolution</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>We describe our system for the monolingual Dutch and multilingual English to Dutch QA tasks. First, we give a brief outline of the architecture of our QA-system, which makes heavy use of syntactic information. Next, we describe the modules that were improved or developed esepcially for the CLEF tasks, i.e. (1) incorporation of syntactic knowledge in the IR-engine, (2) incorporation of lexical equivalences, (3) incorporation of coreference resolution for o -line answer extraction, (4) treatment of temporally restricted questions, (5) treatment of de nition questions, and (6) a baseline multilingual (English to Dutch) QA system, which uses a combination of Systran and Wikipedia (for term recognition and translation) for question translation. For non-list questions, 31% of the highest ranked answers returned by the monolingual system were correct and 20% of the answers returned by the multilingual system.</p>
      </abstract>
      <kwd-group>
        <kwd>H</kwd>
        <kwd>3 [Information Storage and Retrieval]</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>1 Content Analysis and Indexing</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>3 Information Search and Retrieval</kwd>
        <kwd>J</kwd>
        <kwd>5 [Arts and Humanities]</kwd>
        <kwd>Language translation</kwd>
        <kwd>Linguistics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>This research was carried out as part of the research program for Interactive Multimedia Information Extraction,
imix, nanced by nwo, the Dutch Organisation for Scienti c Research.
clef 2006, and on discussion of the results. Section 3 discusses the IR system, which tries to use
various linguistic features to improve precision. In section 4, we discuss the e ect of incorporating
coreference resolution into the module which extracts answers to frequently asked question-types
o -line. Section 5 contains an overview of techniques we implemented to identify (near) synonyms,
spelling variants, etc. Sections 6 and 7 present our treatment of de nition and temporally restricted
questions. A description of our baseline multilingual QA system (based on Systran and Wikipedia)
is given in section 8. The results of the evaluation are presented in section 9.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Architecture</title>
      <p>
        We brie y describe the general architecture of our QA system Joost. The architecture of our
system is depicted in gure 1. Apart from the three classical components question analysis,
passage retrieval and answer extraction, the system also contains a component called Qatar, which
is based on the technique of extracting answers o -line. All components in our system rely heavily
on syntactic analysis, which is provided by Alpino
        <xref ref-type="bibr" rid="ref4">(Bouma, van Noord, and Malouf, 2001)</xref>
        , a
wide-coverage dependency parser for Dutch. Alpino is used to parse questions as well as the full
document collection from which answers need to be extracted. A brief overview of the components
of our QA system follows below.
      </p>
      <p>The rst processing stage is question analysis. The input to this component is a natural
language question in Dutch, which is parsed by Alpino. The goal of question analysis is to
determine the question type and to identify keywords in the question.</p>
      <p>Depending on the question type the next stage is either passage retrieval or table look-up (using
Qatar). If the question type matches one of the table categories, it will be answered by Qatar.
Tables are created o -line for facts that frequently occur in xed patterns. We store these facts
as potential answers together with the IDs of the paragraphs in which they were found. During
the question answering process the question type determines which table is selected (if any).</p>
      <p>For all questions that cannot be answered by Qatar, we follow the other path through the
QAsystem to the passage retrieval component. Previous experiments have shown that a segmentation
of the corpus into paragraphs is most e cient for information retrieval (IR) performance in QA.
Hence, IR passes relevant paragraphs to subsequent modules for extracting the actual answers
from these text passages.</p>
      <p>The nal processing stage in our QA-system is answer extraction and selection. The input to
this component is a set of paragraph IDs, either provided by Qatar or by the IR system. We then
retrieve all sentences from the text collection included in these paragraphs. For questions that
are answered by means of table look-up, the tables provide an exact answer string. In this case
the context is used only for ranking the answers. For other questions, answer strings have to be
extracted from the paragraphs returned by IR. The features that are used to rank the extracted
answers will be explained in detail below. Finally, the answer ranked rst is returned to the user.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Linguistically Informed Information Retrieval</title>
      <p>The information retrieval component in our system is used to identify relevant paragraphs from
the CLEF corpus to narrow down the search for subsequent answer extraction modules. Accurate
IR is crucial for the success of this approach. Answer containing paragraphs that have been
missed by IR are lost for the entire system. Hence, IR performance in terms of recall is essential.
Furthermore, high precision is also desirable as IR scores are used for ranking potential answers.</p>
      <p>
        Given a full syntactic analysis of the CLEF text collection, it becomes feasible to exploit
linguistic information as a knowledge source for IR. Using Apache's IR system Lucene
        <xref ref-type="bibr" rid="ref6">(Jakarta, 2004)</xref>
        , we
can index the document collection along various linguistic dimensions, such as part of speech tags,
named entity classes, and dependency relations. We de ned several layers of linguistic features and
feature combinations extracted from syntactically analysed sentences and included them as index
elds. In our current system we use 12 layers containing the following features: text (stemmed
plain text tokens), root (linguistic root forms), RootPos (root forms concatenated with wordclass
labels), RootRel (root forms concatenated with the name of the dependency relation to their head
words), RootHead (dependent-head bigrams using root forms), RootRelHead (dependent-head
bigrams with the type of relation between them), compound (compositional compounds identi ed
by Alpino), ne (named entities), neLOC (location names), nePER (person names), neORG
(organisation names), and neTypes (labels of named entities identi ed in the paragraph). The layers
are lled with appropriate data extracted from the analysed corpus.
      </p>
      <p>Each of the index elds de ned above can be accessed using Lucene's query language. Complex
queries combining keywords for several layers can be constructed. Queries to be used in our system
are constructed from the syntactically analysed question. We extract linguistic features in the same
way as done for building the index. The task now is to use this rich information appropriately.
The selection of keywords is not straightforward. Keywords that are too speci c might harm the
retrieval performance. It is important to carefully select features and feature combinations to
actually improve the results compared to standard plain text retrieval.</p>
      <p>For the selection and weighting of keywords we applied a genetic algorithm trained on
previously collected question answer pairs. For constructing a query we de ned further keyword
restrictions to make an even more ne-grained selection. We can select keywords based on their
wordclass, their relation to the head word and based on a combination of the two. For example,
we can select RootHead keywords from the question which have been tagged as nouns. Each of
these (possibly restricted) keyword selections can be weighted with a numeric value according to
their importance for retrieval. They can also be marked as \required" using the '+' character in
Lucene's query syntax. All keyword selections are then concatenated in a disjunctive way to form
the nal query. Look at the example query in gure 2 to get an impression of possible queries in
the system.</p>
      <p>Note that the question type provided by the question analysis module is used to query the
neTypes layer with a corresponding named entity label.</p>
      <p>
        The optimsation procedure using the genetic algorithm works essentially as follows: First we
text:(stelde Verenigde Naties +embargo +Irak)
ne:(Verenigde_Naties^2 Verenigde^2 Naties^2 Irak^2)
RootHead:(Irak/tegen embargo/stel_in)
neTypes:(YEAR)
start with initial settings using only one type of keyword selection. These settings are applied
to construct queries from our given collection of questions. The queries are then used to retrieve
a xed number of paragraphs for each question and the retrieval performance is measured in
terms of mean reciprocal rank scores. We used the answer string provided in the training data
to determine if a paragraph is relevant or not. After the initial step two preferable settings
(according to the scores) are selected and their settings are combined to test new parameters.
Additionally we apply simple mutation operations to alter parameters at random from time to
time. The process of selecting and combining is then repeated until no signi cant improvement
can be measured anymore. Details of the genetic optimisation process are given in
        <xref ref-type="bibr" rid="ref9">(Tiedemann,
2005)</xref>
        . As the result of the optimisation we obtain an improvement of about 19% over the baseline
using standard plain text retrieval (i.e. the text layer only) on unseen evaluation data. It should
be noted that this improvement is not solely an e ect of using root forms or named entity labels,
but that many of the features that are assigned a high weight by the genetic algorithm refer to
layers that make use of dependency information.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Coreference Resolution for O -line Question Answering</title>
      <p>The system component Qatar extracts potential answers from the corpus o -line using dependency
based patterns. O -line answer extraction has proven to be very e ective. The results typically
show a high precision score. However, the main problem with this technique is the lack of coverage
of the extracted answers. One way to increase the coverage is to apply coreference resolution.</p>
      <p>For instance, the age of a person may be extracted from snippets such as:
(1)
a. de 26-jarige Ste Graf (the 26-year old Ste Graf)
b. Ste Graf....de 26-jarige tennisster (Ste Graf...the 26-year old tennis player)
c. Ste Graf....Ze is 26 jaar. (Ste Graf...She is 26 years old )
If no coreference resolution is applied, only patterns in which a named entity is present, such
as (1-a) will match. Using coreference resolution, we can also extract the age of a person from
snippets such as (1-b) and (1-c), where the named entity is present in a preceding sentence.</p>
      <p>We selected 12 answer types that we expect to bene t from coreference resolution. They are
shown in table 1. Applying the basic patterns to extract facts for these categories we extracted</p>
      <sec id="sec-4-1">
        <title>Answer Type Age Date of Birth Location of Birth</title>
      </sec>
      <sec id="sec-4-2">
        <title>Answer Type Age of Death Date of Death Location of Death</title>
      </sec>
      <sec id="sec-4-3">
        <title>Answer Type Cause of Death Capital Inhabitants</title>
      </sec>
      <sec id="sec-4-4">
        <title>Answer Type Founder Function Winner</title>
        <p>64,627 fact types. We adjusted the basic patterns by replacing the slot for the named entity with
a slot for a pronoun. Similarly, we adjusted the patterns to match sentences with a de nite noun.
We considered noun phrases preceded by a de nite determiner as de nite noun phrases.</p>
        <p>Our strategy for resolving de nite NPs is based on knowledge about the categories of named
entities, so-called instances (or categorised named entities). Examples are Van Gogh is-a painter,
Seles is-a tennis player. We acquired instances by scanning the corpus for apposition relations
and predicate complement relations1.</p>
        <p>We scan the left context of the de nite NP for named entities from right to left. For each
named entity we encounter, we check whether it occurs together with the de nite NP as a pair
on the instance list. If so, the named entity is selected as the antecedent of the NP. As long as
no suitable named entity is found we select the next named entity and so on until we reach the
beginning of the document. If no named entity is found that forms an instance pair with the
de nite NP, we select simply the rst preceding named entity.</p>
        <p>We applied a similar technique for resolving pronouns. The pronouns we tried to resolve were
the nominative forms of the singular pronouns hij (he), zij/ze (she), het (it) and the plural pronoun
zij/ze (they). We chose to resolve only the nominative case, as in almost all patterns the slot for
the name was the slot in subject position. The number of both the anaphor and the antecedent
was determined by the number of the main verb. Since we nd the anaphors by matching patterns,
we knew what the named entity (NE) tag of the antecedent should be.</p>
        <p>Again we scan the left context of the anaphor (now a pronoun) for named entities from right to
left. We implemented a preference for proper nouns in the subject position. For each named entity
we encounter, we check whether it has the correct NE-tag and number. If so and if it concerns a
non-person NE-tag, the named entity is selected as the antecedent. If we are looking for a person
name, we have to do another check to see if the gender is correct. To determine the gender of the
selected name we created a list of boy's names and girl's names by downloading such lists from the
Internet2. The female list contained 12,691 names and the male list 11,854 names. To be accepted
as the correct antecedent, the proper name should not occur on the name list of the opposite sex
of the pronoun. After having resolved the anaphor, the fact was added to the appropriate table.</p>
        <p>For both extraction modules we randomly selected a sample of around 200 extracted facts and
we manually evaluated these facts on the following two criteria: (1) correctness of the fact and (2)
in the case of coreference resolution, correctness of the selected antecedent.</p>
        <p>We estimated the number of additional fact types we found using the estimated precision
scores. If we had only used the pronoun patterns we would have found 3,627 (5.6%) new facts.
On the other hand, if we had only used the de nite noun patterns we would have found 35,687
(55.2%) new facts. Using both we extracted 39,208 (60.7%) additional facts.</p>
        <p>The number of facts we extracted by the pronoun patterns is quite low. We did a corpus
investigation on a subset of the corpus which consisted of sentences containing terms relevant to
the 12 selected question types3. In only 10% of the sentences one or more pronouns appeared. This
outcome indicates that the possibilities of increasing coverage by pronoun resolution are inherently
limited.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Lexical Equivalences</title>
      <p>
        One of the features that is used to rank potential answers to a question is the amount of syntactic
similarity between the question and the sentence from which the answer is taken. Syntactic
similarity is computed as the proportion of dependency relations from the question which have a
match in the dependency relations of the answer sentence. In
        <xref ref-type="bibr" rid="ref2 ref3">Bouma, Mur, and van Noord (2005</xref>
        ),
we showed that taking syntactic equivalences into account (such as the fact that a by-phrase in a
1We limited our search to the predicate complement relation between named entities and a noun and excluded
examples with negation
      </p>
      <p>2http://www.namen.info, http://www.voornamenboek.nl, http://www.babynames.com and
http://prenoms.free.fr</p>
      <p>3terms such as "geboren" (born), "stierf" (died), "hoofdstad" (capital) etc.
passive is equivalent to the subject in the active, etc.) makes the syntactic similarity score more
e ective.</p>
      <p>In the current system, we also take lexical equivalences into account. That is, given two
dependency relations hHead, Rel, Dependenti and hHead0, Rel, Dependent0i, we assume that
they are equivalent if both Head and Head0 and Dependent and Dependent0 are near-synonyms.</p>
      <p>Two roots R and R0 are considered near synonyms in the following cases:</p>
      <sec id="sec-5-1">
        <title>R and R0 are spelling variants,</title>
      </sec>
      <sec id="sec-5-2">
        <title>R is an abbreviation of R0, or vice versa,</title>
      </sec>
      <sec id="sec-5-3">
        <title>R is the genitive form of R0, or vice versa,</title>
      </sec>
      <sec id="sec-5-4">
        <title>R is the adjectival form of the country name R0, or vice versa,</title>
      </sec>
      <sec id="sec-5-5">
        <title>R matches with a part of the compound R0, or vice versa</title>
        <p>A list of synonyms (containing 118K root forms in total) was constructed by merging
information from EuroWordNet, the dictionary website mijnwoordenboek.nl, and various encyclopedias
(which often provide alternative terms for a given lemma keyword).</p>
        <p>The spelling of person and geographical names entities tends to be subject to a fair amount of
variation. For instance, the 1994 Spanish prime minister is referred to as either Felipe Gonzalez,
Felippe Gonzales, Felipe Gonzales or Felipe Gonzalez. The spelling used in a question is not
necessarily the same as the one used in a parapgraph which provides the answer:
(2)
(2)
a. Hoe heet de dochter van Deng Xiaopeng (What is the name of the daughter of Deng
Xiaopeng?)
Deng Rong, de dochter van de Chinese leider Deng Xiaoping (Deng Rong, the daughter
of the Chinese leader Deng Xiaoping).</p>
        <p>One might consider two named entities spelling variants if the edit distance between the two is
less than a certain threshold, or if one is a word su x of the other (i.e. Maradona and Diego
Armando Maradona). However, this method tends to be very noisy. To improve the precision of
the method, we restricted ourselves to person names, and imposed the additional constraint that
the two names must occur with the same function in our database of functions (used for o -line
question answering). Thus, Felipe Gonzalez and Felippe Gonzales are considered to be variants
only if they are known to have the same function (e.g. prime-minister of Spain). Currently, we
recognize 4500 pairs of spelling variants.</p>
        <p>The compound rule applies when one of the words contains a hyphen (Fiat-topman) or a space
(i.e. Latin phrases like colitis ulcerosa are analyzed as a single word by our parser) and the other
word matches with either part of it, or when the lexical analyzer of the parser analyzes a word as
a compound (i.e. chromosoomafwijking (chromosome de cit)), and the other word matches with
the su x (afwijking).</p>
        <p>We tested the e ect of incorporating lexical equivalences on questions from previous clef tasks.
Although approximately 8% of the questions receives a di erent answer when lexical equivalences
are incorporated, the e ect on the overall score is negligible. We suspect that this is due to the
fact that in the de nition of synonyms, no distinction is made between various senses of a word,
and the equivalences de ned for compounds tend to introduce a fair amount of noise (e.g. the
Calypso-queen of the Netherlands is not the same as the queen of the Netherlands). It should also
be noted that most lexical equivalences are not taken into consideration by the IR-component.
This probably means that some relevant documents (especially those containing spelling variants
of proper names) are missed.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>De nition Questions</title>
      <p>De nition questions can ask either for a de nition of a named entity (What is Lusa?) or a concept
(What is a cincinatto). We used the following answer patterns to nd potential answers:
Appositions (the Portugese press agency Lusa)
Nominal modi ers (milk sugar ( saccharum lactis ) )
or (ofwel) disjunctions ( milk sugar or saccharum lactis )
Predicative complements (milk sugar is (called/known as) saccharum lactis)</p>
      <p>Predicative modi ers (composers such as Joonas Kookonen)</p>
      <p>As some of these patterns tend to be very noisy, we also check whether there exists an
isarelation between the head noun of the de nition, and the term to be de ned. isa-relations are
collected from:</p>
      <p>All Named Entity { Noun appositions (48K) extracted from an automatically parsed version
of the Dutch Wikipedia
All head noun { concept pairs (136K) extracted from de nition sentences found in Dutch
Wikipedia .</p>
      <p>
        De nition sentences were identi ed automatically (see Fahmi and
        <xref ref-type="bibr" rid="ref1">Bouma (2006)</xref>
        ). Answers for
which a corresponding isa-relation exists in Wikipedia are given a higher score.
      </p>
      <p>For the 40 de nition questions in the test set, 18 received a correct rst answer (45%), which
is considerably better than the overall performance on non-list questions (31%). We consider 7
of the 40 de nition questions to be concept de nition questions. Of those, only 1 was answered
correct. Thus, answering concept de nitions correctly remains a challenge.
7
(3)
Sometimes, questions contain an explicit date:</p>
    </sec>
    <sec id="sec-7">
      <title>Temporally Restricted Questions</title>
      <sec id="sec-7-1">
        <title>Which Russian Tsar died in 1584?</title>
        <p>Who was the chancellor of Germany from 1974 to 1982?
To provide the correct answer to such questions, it must be ensured that there is no con ict
between the date mentioned in the question and temporal information present in the text from
which the answer was extracted.</p>
        <p>To answer temporally restricted questions, we try to assign a date to sentences containing a
potential answer to the question. If a sentence contains an explicit date expression, this is used
as answer date. A sentence is considered to contain an explicit date if it contains a temporal
expression referring to a date (2nd of August, 1991) or a relative date (last year). The denotation
of the latter type of expression is computed relative to the date of the newspaper article from
which the sentence is taken. Sentences which do not contain an explicit date are assigned an
answer date which corresponds to the date of the newspaper from which the sentence is extracted.</p>
        <p>For questions which contain an explicit date, this is used as the question date. For all other
questions, the question date is nil.</p>
        <p>The date score of a potential answer is:
0 if the question date is nil,</p>
      </sec>
      <sec id="sec-7-2">
        <title>1 if answer and question date match,</title>
        <p>-1 otherwise.</p>
        <p>There are 31 questions in the CLEF 2006 test set which contain an explicit date, and which we
consider to be temporally restricted questions. Our monolingual QA system returned 11 correct
rst answers for these questions (10 of correctly answered questions ask explicitly for a fact from
1994 or 1995). The performance of the system on temporally restricted questions is similar to the
performance achieved for (non-list) questions in general (31%).
8</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Multilingual QA</title>
      <p>We have developed a baseline English to Dutch QA-sytem which is based on two freely avaiable
resources: Systran and Wikipedia. For development, we used the CLEF 2004 multieight corpus.
(Magnini et al., 2005)</p>
      <p>The English source questions are converted into an HTML le, which is translated
automatically into Dutch by Systran.4 These translations are used as input for the monolingual QA-system
described above.5</p>
      <p>This scenario has a number of obvious drawbacks:</p>
      <p>Translations often result in grammatically incorrect sentences, for which no (correct)
grammatical analysis can be given.</p>
      <p>Even if a translation can be analyzed syntactically, it may contain words or phrases that
were not anticipated by the question analysis module.</p>
      <sec id="sec-8-1">
        <title>Named entities and (multiword) terms are not recognized.</title>
        <p>We did not spend any time on xing the rst and second potential problem. While testing
the system, it seemed that the parser was relatively robust against grammatical irregularities. We
did notice that question analysis could be improved, so as to take into account peculiarities of the
translated questions.</p>
        <p>The third problem seemed most serious to us. It seems Systran fails to recognize many named
entities and multiword terms. The result is that these are translated on a word by word basis,
which typically leads to errors that are almost certainly fatal for any component (starting with
IR) which takes the translated string as starting point.</p>
        <p>To improve on the treatment of named entities and terms, we extracted from English Wikipedia
all pairs of lemma titles and their cross-links to the corresponding link in Dutch Wikipedia. Terms
in the English input which are found in the Wikipedia list are escaped from automatic translation
and replaced by their Dutch counterparts directly. The following examples compare the e ect of
direct translation (b-examples) and translation combined with Wikipedia look-up (c-examples).
a. In which country do people sleep with their feet on the pillow, according to Pippi</p>
        <p>Longstocking?
b. In welk land slapen de mensen met hun voeten op het hoofdkussen, volgens Pippi</p>
        <p>Longstocking?
c. In welk land slapen de mensen met hun voeten op het hoofdkussen, volgens Pippi
Langkous?</p>
      </sec>
      <sec id="sec-8-2">
        <title>Who is Jan Tinbergen Wie is Januari Tinbergen? Wie is Jan Tinbergen?</title>
      </sec>
      <sec id="sec-8-3">
        <title>How large is the Paci c Ocean? Hoe groot is de Vreedzame Oceaan? Hoe groot is Grote Oceaan? (4)</title>
        <p>(5)
(6)
4Actually, we used the Babel sh interface to Systran, http://babelfish.altavista.digital.com/
5For English to Dutch, the only alternative on-line translation service seems to be Freetranslation (www.
freetranslation.com). When testing the system on questions from the multieight corpus, the results from Systran
seemed slightly better, so we decided to use Systran only.</p>
        <p>Three cases can arise: the term should not be translated, but it is by Systran (Jan Tinbergen),
(2) the term is not translated by Systran, but it should (Pippi Longstocking), (3) the term should
be translated, but it is translated wrongly by Systran (Paci c Ocean)</p>
        <p>48 of the 200 input questions contained terms that matched an entry in the bilingual term
database extracted from Wikipedia. 4 of the marked terms are incorrect (Martin Luther instead
of Martin Luther King is marked as a term, nuclear power instead of nuclear power plants is marked
as a term, prime-minister is translated as minister-voorzitter rather than as minister-president or
premier, and the game is incorrectly recognized as a term (it matches the name of a movie in
Wikipedia) and not translated).</p>
        <p>Although the precision of recognizing terms is high, it should be noted that recall could be
much better. Terms such as Olympic Winter Games, World Heritage Sites, and proper names
such as Jack Soden and Chad Rowan are not recognized, leading to word by word translations
(Olympische Spelen van de Winter, De Plaatsen van de Erfenis van de Wereld) that sometimes
are highly cryptical (Hefboom Soden, de Lijsterbes van Tsjaad). In addition, many unrecognized
proper names show up as discontinuous strings in the translation (i.e. What did Yogi Bear steal
is translated as Wat Yogi stal de Beer).</p>
        <p>Although the performance of the multilingual system is a good deal less than that of the
monolingual system, there actually are a few questions which are answered correctly by the bilingual
system, but not by the monolingual system.
(7)
(8)
a.
b.
c.</p>
        <p>What are the three elementary particles of physics according to the Standard Model?
Wat zijn de drie elementaire deeltjes van fysica volgens Standaardmodel? (translated)
Wat zijn de drie fundamentele deeltjes in het Standaardmodel uit de deeltjesfysica?
(monolingual)</p>
      </sec>
      <sec id="sec-8-4">
        <title>Who is the author of the book Jurassic Park?</title>
        <p>Wie is de auteur van het boek Jurassic Park ? (translated)</p>
        <p>Wie schreef het boek Jurassic Park ? (monolingual)</p>
        <p>In (7), the translated sentence uses elementaire deeltjes, which also occurs in the answer
sentence. The monolingual question, however, uses the equivalent phrase fundamentele deeltjes, but
this equivalence is not detected by the QA system. In (8) the translated question uses the noun
auteur, which also occurs in the sentence providing the answer, whereas the monolingual version
uses the verb schrijven (to write).
9</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Evaluation and Error Analysis</title>
      <sec id="sec-9-1">
        <title>The results from the CLEF evaluation are given in gure 3.</title>
        <p>The monolingual system assigned only 13 questions a question type for which a table with
potential answers was extracted o -line. For only 5 of those, an answer is found o -line. This
suggests that the e ect of o -line techniques on the overall result is relatively small. As o -line
answer extraction tends to be more accurate than IR-based answer extraction, it may also explain
why the results for the CLEF 2006 task are relatively modest.7</p>
        <p>If we look at the scores per question type for the most frequent question types (as they were
assigned by the question analysis component) , we see that de nition questions are answered
relatively well (18 out of 40 of the rst answers correct), that the scores for general wh-questions
and location questions are in line with the overall score (16 out of 52 and 8 out of 25 correct),
but that measure and date questions are answered poorly (3 out of 20 and 3 out of 15 correct).
On the development-set (of 800 questions from previous CLEF tasks), all of these question types
perform considerably better (the worst scoring question type are measure questions, which still
nds a correct rst answer in 44% of the cases).</p>
        <p>7For development, we used almost 800 questions from previous CLEF tasks. For those questions, almost 30%
of the questions are answered by answers that were found o -line. 75% of the rst answers for those questions is
correct. Overall, the system nds well-over 50% correct rst answers.</p>
      </sec>
      <sec id="sec-9-2">
        <title>Q type</title>
        <p>Factoid Questions
De nition Questions
Temporally Restricted6
Non-list questions
List Questions</p>
      </sec>
      <sec id="sec-9-3">
        <title>Q type</title>
        <p>Factoid Questions
De nition Questions
Temporally Restricted
Non-list questions
List Questions</p>
        <p>A few questions are not answered correctly because the question type was unexpected. This
is true in particular for the (3) questions of the type When did Gottlob Frege live?.</p>
        <p>Attachment errors of the parser are the source of some mistakes. For instance, Joost replies
that O.J. Simpson was accused of murder on his ex-wife, where this should have been murder on
his ex-wife and a friend. As the conjunction is misparsed, the system fails to nd this constituent.
Di erent attachments also cause problems for the question Who was the German chancellor
between 1974 and 1982?. It has an almost verbatim answer in the corpus (the social-democrat Helmut
Schmidt, chancellor between 1974 and 1982), but since the temporal restriction is attached to the
verb in the question, and the noun social-democrat in the answer, this answer is not found.</p>
        <p>
          The performance loss between the bilingual and the monolingual system is approximately 33%.
This is somewhat more than the di erences between multilingual and monolingual QA reported
for many other systems (see
          <xref ref-type="bibr" rid="ref7">Ligozat et al. (2006)</xref>
          for an overview). However, we do believe that
it demonstrates that the syntactic analysis module is relatively robust against the grammatical
anomalies present in automatically translated input. It should be noted, however, that 19 out of
200 questions cannot be assigned a question type, whereas this is the case for only 4 questions
in the monolingual system. Adapting the question analysis module to typical output produced
by automatic translation, and improvement of the term recognition module (by incorporating a
named entity recognizer and/or more term lists) seems relatively straightforward, and might lead
to somewhat better results.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Bouma</surname>
          </string-name>
          , Gosse, Ismail Fahmi, Jori Mur, Gertjan van Noord,
          <source>Lonneke van der Plas, and Jorg Tiedeman</source>
          .
          <year>2006</year>
          .
          <article-title>Linguistic knowledge and question answering</article-title>
          .
          <source>Traitement Automatique des Langues</source>
          . to appear.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Bouma</surname>
            , Gosse,
            <given-names>Jori</given-names>
          </string-name>
          <string-name>
            <surname>Mur</surname>
          </string-name>
          , and Gertjan van Noord.
          <year>2005</year>
          .
          <article-title>Reasoning over dependency relations for QA</article-title>
          . In Farah Benamara and Patrick Saint-Dizier, editors,
          <source>Proceedings of the IJCAI workshop on Knowledge and Reasoning for Answering Questions (KRAQ)</source>
          , pages
          <fpage>15</fpage>
          {
          <fpage>21</fpage>
          ,
          <string-name>
            <surname>Edinburgh</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Bouma</surname>
          </string-name>
          , Gosse, Jori Mur, Gertjan van Noord,
          <source>Lonneke van der Plas, and Jorg Tiedemann</source>
          .
          <year>2005</year>
          .
          <article-title>Question answering for Dutch using dependency relations</article-title>
          .
          <source>In Working Notes for the CLEF 2005 Workshop</source>
          , Vienna.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Bouma</surname>
            , Gosse, Gertjan van Noord,
            <given-names>and Robert</given-names>
          </string-name>
          <string-name>
            <surname>Malouf</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Alpino: Wide-coverage computational analysis of Dutch. In Computational Linguistics in The Netherlands 2000</article-title>
          . Rodopi, Amsterdam.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Fahmi</surname>
            , Ismail and
            <given-names>Gosse</given-names>
          </string-name>
          <string-name>
            <surname>Bouma</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Learning to identify de nitions using syntactic features</article-title>
          .
          <source>In Roberto Basili and Alessandro Moschitti</source>
          , editors,
          <source>Proceedings of the EACL workshop on Learning Structured Information in Natural Language Applications</source>
          , Trento, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Jakarta</surname>
          </string-name>
          , Apache.
          <year>2004</year>
          .
          <article-title>Apache Lucene - a high-performance, full-featured text search engine library</article-title>
          . http://lucene.apache.org/java/docs/index.html.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Ligozat</surname>
          </string-name>
          ,
          <string-name>
            <surname>Anne-Laure</surname>
            , Brigitte Grau, Isabella Robba, and
            <given-names>Anne</given-names>
          </string-name>
          <string-name>
            <surname>Vilat</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Evaluation and improvement of cross-lingual question answering strategies</article-title>
          . In Anselmo Pen~as and Richard Sutcli e, editors,
          <source>EACL workshop on Multilingual Question Answering. Trento</source>
          , Italy.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Magnini</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vallin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ayache</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Erbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Peas</surname>
          </string-name>
          , M. de Rijke,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rocha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K</given-names>
            <surname>Simov</surname>
          </string-name>
          , and R. Sutcli e.
          <year>2005</year>
          .
          <article-title>Overview of the clef 2004 multilingual question answering track</article-title>
          . In C. Peters,
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kluck</surname>
          </string-name>
          , and B. Magnini, editors,
          <source>Multilingual Information Access for Text, Speech and Images: Results of the Fifth CLEF Evaluation Campaign, Lecture Notes in Computer Science</source>
          Vol.
          <volume>3491</volume>
          . Springer Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Tiedemann</surname>
          </string-name>
          , Jorg.
          <year>2005</year>
          .
          <article-title>Improving passage retrieval in question answering using NLP</article-title>
          .
          <source>In Proceedings of the 12th Portuguese Conference on Arti cial Intelligence (EPIA)</source>
          , Covilha~,
          <source>Portugal. LNAI Series</source>
          , Springer.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>