AnswerFinder at QA@CLEF 2007
                                Menno van Zaanen, Diego Mollá
                                     Macquarie University
                                {menno,diego}@ics.mq.edu.au


                                              Abstract
     In this article, we describe our experiences with modifying and applying AnswerFinder,
     a generic question answering system that was originally designed to perform text-
     based question answering in English only, to multi-lingual question answering. In
     particular, we participated in the Dutch-English task. To enable handling of Dutch
     questions, we added a machine translation phase, where we used Systran to provide
     the Dutch to English translations. The translated questions were passed verbatim to
     AnswerFinder. Additionally, a simple form of anaphora resolution was implemented
     and since AnswerFinder did not have a document retrieval phase, this was added as
     well. The document retrieval system used is based on Xapian and simply searches
     for relevant documents based on the words in the question. Due to the nature and
     quality of the translated questions and the lack of tuning, we expected (and received)
     low quality results. The main purpose of our participation was to investigate the
     flexibility, portability, and scalability of the generic question answering system.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software; I.2.7 [Natural Language Process-
ing]: Text Analysis

General Terms
Measurement, Performance, Experimentation

Keywords
Question answering, Multi-lingual retrieval


1    Introduction
Within the AnswerFinder project, the main research question is: what shallow semantic represen-
tations are best suitable for representing the meaning of questions and sentences in the context of
question answering? To experiment with this, over the last several years we have developed the
AnswerFinder question answering system.
    The AnswerFinder system participated in TREC competitions during the last several years
[3, 5, 6, 11]. In this context, English has been the language of the questions and documents. The
document collection that has been used is the Aquaint corpus, which consists of 1,033,461 news
articles.
    When redesigning AnswerFinder in 2005, in addition to performance requirements (speed,
memory usage, accuracy of finding answers, etc.), two requirements were recognised that have had
a big impact on the design and implementation of the system. These requirements stem from the
fact that AnswerFinder is a research and development system.

Flexibility The system should be flexible in several ways. It should be possible to easily modify
     AnswerFinder to handle new situations. For example, different input and output formats
     (of documents, questions, and answers), types of questions and answers, and new algorithms
     (in all the phases) should be easy to integrate in the system. This allows for easy testing of
     new ideas in different environments.

Configurability Having a system with many different algorithms that can be used in the phases,
    it should be easy to configure the system to run using specific parameters. Parameters in
    this case mean, not only the actual values needed in the algorithms, but also selecting a
    particular algorithm in a phase.

    Applying AnswerFinder to a multilingual question answering problem will show in how far
we have been successful in reaching the requirements as described above. Our participation in
QA@CLEF is a pilot project where we concentrate mainly on the flexibility of the system. We
needed to modify the system to work in a multilingual environment and the document collection
is larger and contains different kinds of documents. Furthermore, we did not have enough time to
fine-tune the system to this particular task.
    In the next sections, we will first discuss the general AnswerFinder system, followed by a
description of the original plans and the actual changes we made to the system to allow it to be
used in the QA@CLEF competition. Then we discuss the results and briefly say something about
the improvements we expect to make.


2    AnswerFinder
The AnswerFinder question answering system is essentially a framework, consisting of several
phases that work in a sequential manner. For each of the phases, a specific algorithm has to be
selected to create a particular instantiation of the framework. The aim of each of the phases is
to reduce the amount of data the system has to handle from then on. This allows later phases to
perform computationally more expensive operations on the remaining data.

                                           Document
                                           Collection

                          Question         Document
                                           Selection

                          Question         Sentence
                          Analysis         Selection

                          Question         Answer            Final
                          Type             Selection         Answer(s)

                            Figure 1: AnswerFinder system overview


    Figure 1 provides an overview of the AnswerFinder framework. The first phase is a document
retrieval phase that selects relevant documents. AnswerFinder was developed to work on large
document collections and this phase typically reduces a great amount of text that will be handled
in subsequent steps. In the TREC competitions, a list of relevant documents was always provided
and AnswerFinder used these documents. We have implemented a new document retrieval phase
specifically for QA@CLEF, which is described in more detail in section 3.3.
    Next is the sentence selection phase. It is possible to select a sequence of algorithms that are
applied to the set of sentences that remain. Each algorithm selects a subset of sentences returned
by the previous algorithm (or taken from the relevant documents selected by the document retrieval
phase). During sentence selection, all sentences that are still left (e.g. all sentences in the selected
documents in the first step) are scored against the question using a relevance metric. The most
relevant sentences according to this metric are kept for further processing.
    After the sentence selection phase, the remaining text is analysed during the answer selection
phase. This phase searches for possible answers in the sentences. In the question analysis phase,
the question is analysed, which provides information on the kind of answer that is required. This
question information is then matched against the list of possible answers and those answers that
match the type of answer required by the question are selected and scored.
    Finally, the best answer is returned to the user. The best answer is the answer that has both
the highest score and an answer type that matches that of the question, or simply the answer with
the highest score if none of the possible answers fits the expected answer type.


3     Extensions
The AnswerFinder system so far has been applied in the TREC question answering competitions.
With the design requirements in mind, we have looked at alternative situations where the system
could be applied. This year, we participated in QA@CLEF to investigate flexibility in languages
and scalability (as we will describe later) and QAst to investigate flexibility and robustness of the
system.
   Firstly, we will describe the original ideas we had when we decided to work on multilingual
question answering. Unfortunately, we did not have time to implement many of the ideas raised
during these preliminary discussions. Next, we will describe extensions that were required for the
system to work in the CLEF multilingual environment.

3.1     Original idea
The idea behind applying AnswerFinder to multilingual question answering arose from discussions
with researchers from the University of Groningen in the Netherlands, who are working on the
Joost question answering system for Dutch [2, 1]. The underlying ideas of Joost are quite similar
to those of AnswerFinder. They have a similar internal representation of the meaning of the
questions and the texts. The assumption here is that this internal representation can be used to
link parts of the two systems together, which allows the analysis of questions in one language and
the analysis of and search in documents in another language.
    In a way, both Joost and AnswerFinder are quite dissimilar, for example, they work on different
languages, use different knowledge sources and depend on different tools. However, the most
important aspect of the systems is very similar. Both systems use a similar shallow internal
meaning representation of the questions and the texts.
    The meaning representation can be used as a pivot between the two systems. If a mapping
between the representations is made, it is possible to transfer the internal knowledge of one system
to the other and continue from there. Effectively, this means that the analysis of the question into
the internal meaning representation can be done using one system. This representation is then
mapped to the other’s representation (which is relatively straightforward as the representations
are quite similar) and the second system can be used to find the answer.
    This idea led to the identification of four distinct phases, each more complex and intertwining
both systems more.
    1. The least complicated solution is to translate questions from one language to the other (for
       example, using the Systran machine translation system) and to feed the translated question
       in the correct question answering system. This system does not benefit from question analysis
       of the question in the original language and as such relies entirely on possibly incorrectly
       translated questions.
                                    System            2004         2006
                                Freetranslation     0.383712     0.369556
                                Babelfish           0.405939     0.389921
                                Systran             0.416436     0.398048
                                DEMOCRAT            0.407023     0.390922

         Table 1: Blue score results of translation of CLEF questions Dutch to English


  2. To benefit from language specific analysis, the result of the question analysis of the question
     answering system of the source language can be handed to the question answering system
     in the target language in addition to the translation of the question given by the machine
     translation system. This approach requires a mapping between the output of the question
     analysis module of the source language to the required input of other modules of the target
     language.

  3. The most challenging approach would be one that formalises and implements the mapping
     between the two internal representations that underlie all modules of both question answering
     systems. The benefits of this approach are that all analyses of the questions and the texts
     are done using tools that are developed for that particular language.

    In the end, only the first approach has been implemented here. There is no integration of
both systems, the questions are translated using a machine translation system and used in the
AnswerFinder system. In addition to this machine translation step, AnswerFinder had to be
modified in two aspects.
    Originally, AnswerFinder did not have a document retrieval phase, as in the TREC compe-
titions a list of relevant documents was always provided. During TREC 2006, we briefly experi-
mented with our own document retrieval phase, but this required modifications to allow it to work
in this context.
    Additionally, AnswerFinder has a simple anaphora resolution phase. This phase is designed
specifically for the TREC competition and had to be rewritten for the QA@CLEF competition.

3.2    Machine translation
AnswerFinder expects to find documents and questions in English. Several algorithms that are
currently implemented in AnswerFinder require the text to be in English as both sentences and
questions are parsed using a dependency parser. In the Dutch–English task, the documents are
in English already, but the questions are in Dutch.
    There are several on-line machine translation systems that translate from Dutch to English.
We concentrated on Freetranslation1 , Babelfish2 , and Systran3 .
    To decide which of these systems would provide us with the best translation of the questions,
we translated QA@CLEF questions of the 2004 and 2006 competitions. For these years, we were
able to find aligned Dutch and English questions. This allowed us to evaluate the quality of the
questions against a gold standard. The results of the machine translation systems can be found
in Table 1. We used the Blue score [8] to compute these results4 . The results show that Systran
is best, followed closely by Babelfish. Babelfish’s underlying machine translation system seems to
be a slightly different version of the Systran engine.
    To see whether we could further improve on the translation quality of these three systems, we
applied the DEMOCRAT [12] consensus translation system to the translated questions. DEMO-
CRAT, which stands for DEcides between Multiple Outputs CReated by Automatic Translation,
is designed specifically for free on-line MT systems. It tries to construct from them a consensus
  1 http://www.freetranslation.com
  2 http://world.altavista.com/babelfish
  3 http://www.systranbox.com/systran/
  4 We used the evaluation program provided by our colleague Simon Zwarts, to whom we are most grateful.
translation which, it is hoped, will take the best elements of the contributing systems and produce
an output as good as or better than any of the individual MT systems on their own.
    The main idea behind DEMOCRAT is to combine the output of machine translation systems.
This is done by selecting and recombining the “best” sequence of words from each output. The
underlying assumption here is that if several machine translation systems use the same word or
phrase, it is the correct translation. Single words are selected based on a majority vote of the
output of the systems.
    Unfortunately, as can be seen in Table 1, the DEMOCRAT system does not provide any better
results. Choosing DEMOCRAT’s output is better than selecting one of the machine translation
systems at random. However, the results on the old questions show that Systran has consistently
generated the best translations. This means that it is better to use Systran to translate the
questions, which we have done in these experiments.
    When looking at the translated questions, it turned out that some questions resulted in trans-
lations that were completely incomprehensible, for example, “Call some olive oil see.”, or nearly
incomprehensible, “What was the name of its woman?”. Also, we noticed some consistent transla-
tion errors. For example, all Dutch questions starting with “Noem. . . ” are translated starting with
“Call. . . ”, whereas the proper translation would be “Name”. Also, the translated questions often
have a preposition as the first word, for example, “To which countries provide the EU agriculture
subsidies?”.
    Our question analysis phase (which assigns expected types of answers to a question) is based
on matching regular expressions to the question. This is a very simple system that can probably
be improved much. In particular it could be adapted to the kinds of mistranslations that occurred
in the questions. However, due to time constraints we did not modify it. The translation problems
mean that none of these regular expressions match. This, in turn, means that AnswerFinder did
not assigned expected answer types to the questions and therefore was unable to filter out any
possible answers. Answers can only be selected based on their scores.

3.3    Document retrieval
The AnswerFinder submission for TREC 2006 contained a first attempt at document retrieval. In
TREC, a list of relevant documents is provided per topic (where just like in QA@CLEF, a topic
contains several questions). We looked at performing document retrieval specific to each each
question and only take those documents that can be found in both the list of relevant documents
found by our document retrieval system and the list provided by NIST. We used the indexing and
retrieval methods of the Xapian5 toolkit for this.
    The list of relevant documents is found by taking the words in the question as search keywords
in the retrieval system. Note that in the current version, we do not remove any stop words, which
may lead to query drift.
    Even though the Xapian-based document retrieval system slightly improved results in the
TREC competition setting, the system has only been evaluated in combination with the list of
relevant documents provided by NIST. Furthermore, the document collection used in that context,
Aquaint, is smaller than the collection used in QA@CLEF. The Wikipedia collection is not only
much larger, it also contains structure that our document retrieval system could not handle. We
decided to convert the Wikipedia format to a format that is quite similar to that of the Aquaint
documents. By doing this we may have lost some useful data.

3.4    Anaphora resolution
Anaphora resolution is not something we have looking into very much. During our TREC par-
ticipation we always replaced the anaphora with the name of the topic. (In the TREC question
answering track, the topics all have a name attached to it.) In QA@CLEF, several questions in
the question collection also contain anaphora. Examples are “they” in the question “In which year
  5 http://www.xapian.org
were they boasted?”, and “he” in the question “Of which political party was he member?”. The
anaphora often refers to a named entity or an important part of a previous question.
   We decided to use previous answers as replacements for the anaphora, as this was easiest to
implement in the current version of AnswerFinder. However, this obviously leads to the following
problem: incorrect answers to a question means that the next question (with anaphora) will also
be incorrect.


4    Results
We submitted two runs generated by AnswerFinder. Both runs used questions translated using
Systran. Anaphora in the questions were resolved during question answering as described above,
all anaphora were replaced by the answer to the previous question. Document retrieval was
performed using our Xapian-based document retrieval module. The best 100 documents were
retained in each run. After document retrieval, the sentence selection phase was started. In both
runs, the best 100 sentences were selected based on word overlap. This metric counts the number
of words in the question that can also be found in the sentence under consideration. Words that
can be found in a list of stop words are not counted towards this score. The remaining sentences
are handed to a named entity recogniser and the found named entities are taken to be possible
answers. The possible answer with the highest score is then returned as the exact answer.
    To generate the named entities, we used AFNER [7], which is a named entity recogniser
that is built within the AnswerFinder project. The idea behind AFNER is that a high recall
in named entities is required for question answering. AFNER uses a combination of gazetteers,
regular expressions, and general features in a maximum entropy classifier to assign tags to words.
These tags indicate whether the word is the beginning of a named entity, inside a named entity
or outside any named entities. The general MUC named entity types [9] are used: organisation,
person, location, date, time, money, and percent. These types are used to select answers according
to the expected answer type that is found during question analysis. However, as mentioned above,
since the question analysis phase did not classify any of the questions, there was no selection on
the bases of the expected answer type. All found entities were treated as possible answers.
    The settings of both runs were quite similar with only one difference. The second run performed
a graph-based sentence selection phase, which finds additional possible answers. The algorithm
that is used in this phase is described in more detail in [4]. The question and the sentence is
parsed using Connexor [10]. This results in a dependency parse, which is converted into a shallow
semantic representation in graph form. The score that is assigned to the sentence (for the question)
is related to the size of the overlap of the graphs of the question and sentence.
    The score computed using the graph logical forms is actually not used in this experiment, but
the algorithm also has the ability to find exact answers. During a training phase, where we have
questions aligned to sentences with the exact answer annotated, the overlap between the question
and sentence is computed. Next, a path from the overlap to the answer is found. The overlap
including the path is stored as a graph rule. During the actual sentence selection phase, these
graph rules are applied to the overlap of the question and sentence. This indicates sub-graphs in
the sentence that are considered to be possible answers.
    In general, the score of a possible answer is computed by summing the scores that are found
for that answer. If an answer is found using the named entity recogniser, the score is the number
of times the answer is found by the named entity recogniser (in all selected sentences). If the
graph-logical form algorithm is used, the score computed using the algorithm is used (and this
score is added to the scores of the named entities).
    Unfortunately, neither runs of the multilingual version of AnswerFinder generated any correct
results. All answers generated were numbers. There may be several reasons for this. Firstly, the
named entity recogniser finds numbers using regular expressions. Most of the other named entity
types require other features. AFNER has been trained on the BBN corpus6 which is somewhat
  6 Ralph Weischedel and Ada Brunstein, 2005, BBN Pronoun Coreference and Entity Type Corpus Linguistic

Data Consortium, Philadelphia
different from the document collection used here. Secondly, the question analysis did not return
specific question types (as described above). This means that the most often found named entity
will be selected, which will often be numbers.


5    Future improvements
The discussion of the extensions of the AnswerFinder system and the results already indicated
several areas that need improvement. We will briefly discuss some of them here.
    The translation of the questions in imperfect. Not only does this impact the question analysis
phase (mentioned above), it obviously impacts the whole system. For example, named entities are
sometimes translated or partially translated (“General engines” for “General Motors”, and some
words are translated incorrectly “woman” instead of “wife” for “vrouw”, “bluntest” instead of
“bumped” for “botste”. Some regular words are not translated at all, such as “ongeluk” (accident),
or “zat” (sat). Post editing of the questions may solve some of these problems, although other
AnswerFinder phases will need to be adapted as well to allow for partially incorrect questions.
    The quality of the document retrieval phase that is used at the moment is unknown. We have
not investigated in how far the documents found by this system contain information on the topic
of the question. We probably need to do a more in depth analysis of the system. For example, the
documents are selected based on the original question before anaphora resolution. This means that
important information (referenced to using the anaphora) is missing in the query in the document
retrieval phase.
    The question analysis phase is based on a relatively short list of regular expressions. This
list has not been tested against the translated questions at all. Many of the translated questions
are somewhat different from the questions the system handled during the TREC competitions.
Modifying the regular expressions and adding several new ones that handle incorrectly translated
questions will probably increase performance of this phase greatly.
    The way anaphora are handled in the current system is very simple and most of the time
incorrect. Instead of replacing the anaphora by previously returned (probably incorrect) answers,
we need to analyse the previous question or questions in the same topic and based on the type of
anaphora we need to select the part of the question that fits those anaphora best.
    Finally, in section 3.1 we already discussed several extensions that we expect will improve the
quality of the system. If the question analysis can be done in the source language, we do not have
to modify the English question classifier to handle incorrectly translated questions. If it is possible
to map the semantic representation of the two systems, we do not even have to use a machine
translation system to translate the questions. We will still need to handle ambiguity in translating
the concepts from Dutch to English, but the impact of syntactically incorrect questions will be
reduced greatly.


6    Conclusion
Our submissions to the QA@CLEF competition this year, even though the results are particularly
bad, shows much about certain aspects of AnswerFinder. We are particularly interested in the
flexibility, portability, and scalability of the system. It turns out that AnswerFinder can easily be
modified to work on a different domain and with a different language.
    There are a few specific aspects of our participation in this competition that shows that much
work is still needed. Firstly, the translation of the questions is clearly imperfect. Unfortunately,
many algorithms in the rest of the question answering system require proper English questions.
Similarly, our current way of handling anaphora means that many questions are incorrect. This
has a major impact on the quality of the answers generated by the system.
    The increase in size of the document collection can be handled by the document selection
phase. However, so far we have not done much research into this phase. We applied a very simple
word-based document retrieval system and the quality of the returned documents is unknown.
    Overall, we think that our participation in this track has been successful. We have managed to
identify several areas that need improvement. All of these improvements can easily be incorporated
in the AnswerFinder framework. The framework itself functions effectively, although algorithms
in several phases need improvements to generate satisfactory results.


References
 [1] Gosse Bouma, Ismail Fahmi, Jori Mur, Gertjan van Noord, Lonneke van der Plas, and Jörg
     Tiedemann. The university of groningen at qa@clef 2006: Using syntactic knowledge for
     qa. In C. Peters, editor, Working Notes for the CLEF 2006 Workshop; Alicante, Spain,
     September 20–22 2006.

 [2] Gosse Bouma, Jori Mur, Gertjan van Noord, Lonneke van der Plas, and Jörg Tiedemann.
     Question answering for dutch using dependency relations. In C. Peters, editor, Working Notes
     for the CLEF 2005 Workshop; Vienna, Austria, September 21–23 2005.

 [3] Diego Mollá. Answerfinder in TREC 2003. In Proceedings of the Twelfth Text Retrieval Con-
     ference (TREC 2003); Gaithersburg:MD, USA, number 500-255 in NIST Special Publication,
     pages 392–398. Department of Commerce, National Institute of Standards and Technology,
     November 18–21 2003.

 [4] Diego Mollá. Learning of graph-based question answering rules. In Proceedings of TextGraphs:
     the Second Workshop on Graph Based Methods for Natural Language Processing; New
     York:NY, USA, pages 37–44, 2006.

 [5] Diego Mollá and Mary Gardiner. Answerfinder at TREC 2004. In Ellen E. Voorhees and
     Lori P. Buckland, editors, Proceedings of the Thirteenth Text Retrieval Conference (TREC
     2004); Gaithersburg:MD, USA, 2004.

 [6] Diego Mollá and Menno van Zaanen. Answerfinder at TREC 2005. In Proceedings of the
     Fourteenth Text Retrieval Conference (TREC 2005); Gaithersburg:MD, USA, NIST Special
     Publication. Department of Commerce, National Institute of Standards and Technology, 2005.
     cd-rom.

 [7] Diego Mollá, Menno van Zaanen, and Daniel Smith. Named entity recognition for question
     answering. In Lawrence Cavedon and Ingrid Zukerman, editors, Proceedings of the 2006
     Australasian Language Technology Workshop; Sydney, Australia, pages 51–58, 2006.

 [8] K. Papineni, S. Roukos, T. Ward, and W. Zhu. BLEU: A method for automatic evaluation of
     machine translation. In 40th Annual Meeting of the Association for Computational Linguis-
     tics; Philadelphia:PA, USA, pages 311–318. Association for Computational Linguistics, July
     2002.

 [9] Beth M. Sundheim. Overview of results of the MUC-6 evaluation. In Proceedings of the Sixth
     Message Understanding Conference MUC-6; Columbia:MD, USA, San Francisco:CA, USA,
     1995. Morgan Kaufmann Publishers.

[10] Pasi Tapanainen and Timo Järvinen. A non-projective dependency parser. In Proceedings of
     the Fifth Conference on Applied Natural Language Processing (ANLP-97); Washington:DC,
     USA, pages 64–71. Association for Computational Linguistics, 1997.

[11] Menno van Zaanen, Diego Mollá, and Luiz Pizzato. Answerfinder at TREC 2006. In Pro-
     ceedings of the Fifteenth Text Retrieval Conference (TREC 2006); Gaithersburg:MD, USA,
     NIST Special Publication. Department of Commerce, National Institute of Standards and
     Technology, 2006. accepted.
[12] Menno van Zaanen and Harold Somers. DEMOCRAT: Deciding between multiple outputs
     created by automatic translation. In The Tenth Machine Translation Summit, Proceedings of
     Conference; Phuket, Thailand, pages 173–180, 2005.