The University of Amsterdam at CLEF@QA 2007
          Valentin Jijkoun    Katja Hofmann      David Ahn      Mahboob Alam Khalid
                 Joris van Rantwijk    Maarten de Rijke     Erik Tjong Kim Sang
                                 ISLA, University of Amsterdam
            jijkoun,khofmann,ahn,mahboob,rantwijk,mdr,erikt@science.uva.nl


                                              Abstract
       We describe a new version of our question answering system, which was applied to
       the questions of the 2007 CLEF Question Answering Dutch monolingual task. This
       year, we made three major modifications to the system: (1) we added the contents of
       Wikipedia to the document collection and the answer tables; (2) we completely rewrote
       the module interface code in Java; and (3) we included a new table stream which
       returned answer candidates based on information which was learned from question-
       answer pairs. Unfortunately, the changes did not lead to improved performance. Un-
       solved technical problems at the time of the deadline have led to missing justifications
       for a large number of answers in our submission. Our single run obtained an accuracy
       of only 8% with an additional 12% of unsupported answers (last year, our best run
       achieved 21%).


1     Introduction
For our earlier participations in the CLEF question answering track (2003–2006), we have de-
veloped a parallel question answering architecture in which candidate answers to a question are
generated by different competing strategies, QA streams [4]. Although our streams use different
approaches to answer extraction and generation, they share the mechanism for accessing the col-
lection data: we have converted all of our data resources (text, linguistic annotations, and tables
of automatically extacted facts) to fit in an XML database in order to standardize the access [4].
For the 2007 version of the system, we have focused on three tasks:

    1. Add to the data resources of the system material derived from the Dutch Wikipedia (previ-
       ously only derived from Dutch newspaper text).

    2. Rewrite the out-of-date code which takes care of the communication between the different
       modules (previously in Perl) in Java. In the long run we are aiming at a system which is
       completely written in Java and is easily maintainable.
    3. Add a new question answering stream to our parallel architecture: a stream that generates
       answers from pre-extracted relational information based on learned associations between
       questions and answers; a similar stream in last year’s system used manual rules for identifying
       such associations.

This paper is divided in seven sections. In section 2, we give an overview of the current system
architecture. In the next three sections, we describe the changes made to our system for this year:
resource adaptation (section 3), code rewriting (4), and the new table stream (5). We describe
our submitted runs in section 6 and conclude in section 7.
                       question
      question
                       analysis


          Table     ML Table               XQuesta       NGram
         Lookup      Lookup
                                                                 candidate   post !       ranked
                                                                  answers                 answers
                                                                             processing


                                  Dutch         Dutch
        Tables                     CLEF         Wiki !   Web
                                  corpus        pedia


Figure 1: Quartz-2007: the University of Amsterdam’s Dutch Question Answering System. After
question analysis, a question is forwarded to two table modules and two retrieval modules, all
of which generate candidate answers. These four question processing streams use the two data
sources for this task, the Dutch CLEF newspaper corpus and Dutch Wikipedia, as well as fact
tables which were generated from these data sources, and the Web. Related candidate answers
are combined and ranked by a postprocessing module, which produces the final list of answers to
the question.


2    System Description
The architecture of our Quartz QA system is an expanded version of a standard QA architecture
consisting of parts dealing with question analysis, information retrieval, answer extraction, and
answer post-processing (clustering, ranking, and selection). The Quartz architecture consists
of multiple answer extraction modules, or streams, which share common question and answer
processing components. The answer extraction streams can be divided into three groups based on
the text corpus that they employ: the CLEF-QA corpus, Dutch Wikipedia, or the Web. Below,
we describe these briefly.
    The Quartz system (Figure 1) contains four streams that generate answers from the two CLEF
data sources, the CLEF newspaper corpus and Wikipedia. The Table Lookup stream searches
for answers in specialized knowledge bases which are extracted from the corpus offline (prior to
question time) by predefined rules. These information extraction rules take advantage of the
fact that certain answer types, such as birthdays, are typically expressed in one of a small set of
easily identifiable ways. The stream uses the analysis of a question to determine how a candidate
answer should be looked up in the database using a manually defined mapping from question to
database queries. Our new stream, ML Table Lookup, performs the answer lookup task by using a
mapping learned automatically from a set of training questions (see section 5 for a more elaborate
description). The Ngrams stream looks for answers in the corpus by searching for most frequent
word ngrams in a list of text passages retrieved from the collection using a standard retrieval
engine (Lucene) using a text query generated from the question.
    The most advanced of the four streams is XQuesta. For a given question, it automatically
generates XPath queries for answer extraction, and executes them on an XML version of the
corpus which contains both the corpus text and additional annotations. The annotations include
information about part-of-speech, syntactic chunks, named entities, temporal expressions, and
dependency parses (from the Alpino parser [6]). For each question, XQuesta only examines text
passages relevant to the question (as identified by Lucene).
    There is one stream which employs textual data outside the CLEF document collection defined
for the task: the Ngrams stream also retrieves answers by submitting automatically generated web
queries to the Google web search engine and collecting most common ngrams from the returned
snippets. The answers candidates found by this approach are not backed up by documents from
the CLEF collection as required by the task. For this reason such candidates are never returned
as actual answers, but only used at the answer merging stage to adjust the ranking of answers
that are found by other QA streams.


3      Wikipedia as a QA Resource
Our system uses Dutch Wikipedia in the same way as the Dutch newspaper corpus. We used
an XML dump of Wikipedia1 that provides basic structural markup and additionally anno-
tated it with sentence boundaries, part-of-speech tags, named entities and temporal expressions.
Wikipedia was consulted by the XQuesta and NGram streams and was also used for offline infor-
mation extraction.


4      Rewriting Interface Code in Java
The QA system we used in previous years consisted of many components but was mostly developed
ad-hoc, i.e., without a consistent system architecture. As a result, the system was difficult to
maintain and change. To address this problem we re-implemented large parts of the system
following a modular system design. The goal is to develop a self-contained system that is consistent
and can be maintained more easily. The main feature of the newly developed system architecture
is that it consists of several modules which are cleanly separated by interfaces. This allows us to
minimize dependencies between components.
    Figure 2 gives an overview of the main components of our QA system. The core modules are
document collection, text analysis, data, question answering, and answer selection. The package
apps contains several small programs that combine functionality of the core modules into complete
applications, such as the CLEF batch system, an interactive command-line program, and an online
demo of our QA system.2


Figure 2: UML Component diagram showing modules and module dependencies. Boxes represent
individual modules, circles represent interfaces, and half-circles indicate that a module depends
on an interface.

   Central to our system is the data module which abstracts all textual data within the system
as AnnotatedText. AnnotatedText objects maintain an XML representation of the data and allow
access through both Java methods and XQuery. XML annotations can be added as necessary. At
    1 URL: http://ilps.science.uva.nl/WikiXML
  2 The Dutch version of our QA demo can be accessed at http://cs-ilps.science.uva.nl:20500/. Uses Table

Lookup stream only.
each processing step the data objects can be serialized to their XML representation for logging or
data exchange with external programs.
    The question answering module contains our QA streams. Each stream implements the inter-
face QuestionAnsweringStream which allows applications to run streams in a unified way. This
allows us to add new QA streams to the system on the fly, without changing any of the existing
components.
    To make question answering streams independent of the specifics of different document col-
lections — SQL tables, the CLEF newspaper corpus, Wikipedia, and the web — the docu-
ment collection module provides access via the DocumentCollection and CollectionElement in-
terfaces. DocumentCollection provides methods for retrieving elements from a collection. Imple-
mentations for different IR engines and web search engines were developed. To create answer
candidates from CollectionElements, the elements are converted into AnnotatedText.
    Text analysis tools, such as part-of-speach tagger, named-entity tagger, and question classifier,
are part of the text analysis module and implement the AnalysisTool interface. Analysis tools are
run on demand by a mechanism provided by the data module. Consumers of data objects, such
as QA streams, specify which tools they require. The tools produce XML annotations which are
added to the data objects and can be queried, for example using XQuery. We attempt to use
standard XML annotation formats whenever possible.
    The answer selection module contains filters for post-processing lists of answer candidates.
Each post-processing tool implements the interface AnswerFilter so that applications using these
tools are independent of implementation details. Sequences of post-processing tools can be assem-
bled by higher-level applications as necessary. The tools can be applied either per-stream, or to
the combined output of the system as a whole.


5    Machine Learning for QA from Tabular Data
As described in section 2, our offline information extraction module creates a database of simple
relational facts to be used during question answering. A TableLookup QA stream uses a set of
manually defined rules to map an analyzed incoming question into a database query. A new
stream, MLTableLookup, uses supervised machine learning to train a classifier that performs this
mapping. In this section we give an overview of our approach. We refer to [5] for further details.
    Essentially, the purpose of the table lookup stream is to map an incoming question to an SQL-
like query “select AF from T where sim(QF, Q)”, where T is the table that contains the answer in
field AF and its other field QF has a high similarity with the input question Q. Executing such a
query for a given question results in a list of answer candidates—the output of the MLTableLookup
stream.
    In the query formalism described above, the task of generating the query can be seen as the
task of mapping an incoming question Q to a triple hT, QF, AF i (a table-lookup label ) and defining
an appropriate similarity function sim(QF, Q).
    The database of facts extracted from the CLEF QA collection consists of of 16 tables containing
1.4M rows in total. For example, the Definitions(name, definition) table contains the definition of
Soekarno as president of Indonesia, the table Birthdates(name,birthdate) contains the information
that Aleksandr Poesjkin was born in 1799. Then, for a question such as Wie was de eerste
Europese commissaris uit Finland? (Who was the first European Commissioner from Finland?)
the classifier may assign the table-lookup label hT : Def initions, QF : Def inition, AF : namei.
In this case, the question would be mapped to the SQL like query “select name from Def initions
where sim(def inition, {eerste, European, commissaris, F inland})”.
    For an incoming question, we first extract features and apply a statistical classifier that assign
a table-lookup label, i.e., a triple hT, QF, AF i. We then use a retrieval engine to locate values of
field QF in table T which are most similar to the text of the question Q (according to a retrieval
function sim(·, ·)), and return values of corresponding AF fields. Figure 3 shows the architecture
of our system.
                Question                     Feature Extractor

                                             feature vector

                                                   Classifier
                     Question to retrieval
                     query translator             table-lookup label
                                                     <T,QF,AF>


                                                                        Answer
                                       Retrieval Engine                candidates

                    Figure 3: Architecture of the MLTableLookup QA stream.


    Our architecture depends on two modules: the classifier that predicts table lookup labels and
the retrieval model sim(·, ·) along with the text representation and the retrieval query formula-
tion. For the later task we selected Lucene’s vector space model as retrieval model, and used a
combination of two types of text representation, exact and stemmed forms of the question words,
to formulate a retrieval query, i.e., to translate an incoming to question to a retrieval query.
    The interesting and novel part of the new QA system is the second stage of our query formu-
lation, i.e., training a classifier to predict table lookup labels. This stage, in turn, can be split in
two parts: generating training data and actually training the classifier. We generate the training
data using the selected retrieval model. We index the values of all fields of all rows in our database
as separate documents. For each question q, we translate the question into the retrieval query
and use the selected retrieval model to generate a ranked list of field values from our database.
We select the document, table name T and the field name QF , such that it occurs first time
in the ranked list and the value of some other field AF contains the answer of the question. In
other words we find T , QF and AF such that the query “select AF from T where sim(QF, Q)”
returns a correct answer to question q at the top rank. We use the label hT, QF, AF i as a correct
class for question q. For example, we translate the question In welk land in Afrika is een nieuwe
grondwet aangenomen? (whose answer is Zuid-Afrika) into a retrieval query that is composed of
the question words and words retrieved from the process of filtering out stopwords and stem the
remaining the question words. Then we run the query against the retrieval engine’s index; for this
particular example our system finds the triplet hT : Locations, QF : locationb , AF : locationa i as
the optimal table-lookup label for this question.
    Next, in order to generate training data, we represent each question as a set of features. We
use the existing module of [2] to construct the set of features. Finally we train a memory-based
classifier TiMBL [1] and use a parameter optimization tool to find the best setting for Timbl; see
Figure 4 for an overview.
    We used a set of question/answer pairs from the CLEF-QA tasks 2003–2006 and a knowledge
base with tables extracted from the CLEF-QA corpus using the information extraction tools of
QUARTZ system. We split our training corpus of 644 questions with answers into 10 sets and run
a 10-fold cross-validation. The performance of the system is measured using the Mean Reciprocal
Rank (MRR, the inverse of the rank of the first correct answer, averaged over all questions) and
accuracy at n (a@n, the number of question answered at rank ≤ n). Table 1 shows the evaluation
results averaged over the 10 folds.
                                                     facts
                 Docs                 IE tools                        DB Tables


                                                             add normalized layers
               Questions
                                retrieval query
                                            Retrieval Engine
                                                  <T1,QF, AF [...]>
                                                  <T2,QF, AF [...]>
                Feature Extractor                                           Answer
                                              Label generator
                                                                            patterns
                     feature vector                <T,QF,AF>

                                              Train Classifier

                              Figure 4: Learning table lookup labels

                                   a@1       a@5      a@10      MRR
                                  13.1%     21.4%     24.1%     0.593

Table 1: Evaluation of the ML Table-lookup QA stream applied to the CLEF 2003–2006 question
answer pairs.


6    Runs
We have submitted a single Dutch monolingual run: uams071qrtz. The associated evaluation
results can be found in Table 2. In this run, we have treated list questions as factoid questions:
always returning the top answer. The planned updates of the system proved to be more time
consuming than was expected. The system was barely finished at the time of the deadline. Because
of this there was no time for an elaborate test or for compiling alternative runs. The performance
of the system has suffered from this: only about 8% of the questions were answered correctly. The
previous version achieved 21% correct on the CLEF-2006 questions.
    The prime cause of the performance drop can be found in the submitted answer file. No less
than 81 (41%) of the 200 answers did not contain the required answer snippet. This problem
caused all but 4 of the unsupported assessments. 22 of these 81 answers mentioned a document
id for the missing snippet but the other 59 lacked the id as well. The problem was caused by a
mismatch between the new java code and the justification module which caused all justifications
associated with answers from the two table streams to be lost.
    When examining the answers for the factoid and definition questions, we noticed that a major
problem is a mismatch between the expected answer type and the type of the answer. Here are a
few examples:

     0003. How often was John Lennon hit? Answer: Yoko Ono
     0136. What is an antipope? Answer: Anacletus II
     0160. Who is Gert Muller? Answer: 1947

    As many as 61 of the 161 incorrectly answered displayed such a type mismatch. The question
classification part of the system (accuracy: 80%) generates an expected type for each answer but
it is not used in the postprosessing phase. Indeed, the addition of a type-based filter at the end
of the processing phase is one of the most urgent tasks for future work.
 Question type            Total    Right    Unsupported       Inexact    Wrong      % Correct
 factoid                   156      14          17               0        125          9%
 definition                 28       1           5               0         22          4%
 factoid+definition        184      15          22               0        147          8%
 list                       16       0           1               1         14          0%
 temporarily restricted     41       2           3               0         36          5%
 unrestricted              159      13          20               1        125          8%
 all                       200      15          23               1        161          8%


Table 2: Assessment counts for the 200 top answers in the Amsterdam run submitted for Dutch
monolingual Question Answering (NLNL) in CLEF-2007. About 8% of the questions were an-
swered correctly. Another 12% were correct but insufficiently supported. The run did not contain
NIL answers.

7    Conclusion
We have described the fifth iteration of our system for the CLEF Question Answering Dutch
mono-lingual track (2007). While keeping the general multi-stream architecture, we re-designed
and re-implemented the system in Java. This was an important update, which however did not
lead to improved performance, mainly due to many technical problems that were not solved by
the time of the deadline. In particular, these problems led to originating snippet being lost for
many of the answer candidates extracted from the collection, leading to a large number of answers
in our submission. Our single run obtained an accuracy of only 8% with an additional 12% of
unsupported answers (last year, our best run achieved 21%).
    Addressing these issues, performing a more systematic error analysis and answer extraction
step in XQuesta stream and learning step in MLTableLookup are the most important items for
future work.


Acknowledgments
This research was supported by various grants from the Netherlands Organisation for Scientific
Research (NWO). Valentin Jijkoun was supported under project numbers 220.80.001, 600.065.120
and 612.000.106. Joris van Rantwijk and David Ahn were supported under project number
612.066.302. Erik Tjong Kim Sang was supported under project number 264.70.050. Maarten
de Rijke was supported by NWO under project numbers 017.001.190, 220.80.001, 264.70.050,
354.20.005, 600.065.120, 612.13.001, 612.000.106, 612.066.302, 612.069.006, 640.001.501, and 640.-
002.501. Mahboob Alam Khalid and Katja Hofmann were supported by NWO under project
number 612.066.512.


References
[1] Walter Daelemans, Jakub Zavrel, Ko van der Sloot, and Antal van den Bosch. TiMBL: Tilburg
    Memory Based Learner, version 5.1, Reference Guide. University of Tilburg, ILK Technical
    Report ILK-0402., 2004. http://ilk.uvt.nl/.
[2] V. Jijkoun, G. Mishne, and M. de Rijke. Building infrastructure for Dutch question answering.
    In Proceedings DIR-2003, 2003.
[3] Valentin Jijkoun and Maarten de Rijke. Retrieving answers from frequently asked questions
    pages on the web. In Proceedings of the Fourteenth ACM conference on Information and
    knowledge management (CIKM 2005). ACM Press, 2005.
[4] Valentin Jijkoun, Joris van Rantwijk, David Ahn, Erik Tjong Kim Sang, and Maarten de Rijke.
    The University of Amsterdam at QA@CLEF 2006. In Working Notes for the CLEF 2006
    Workshop. Alicante, Spain, 2006.
[5] M.A. Khalid, V. Jijkoun, and M. de Rijke. Machine learning for question answering from
    tabular data. In FlexDBIST-07 Second International Workshop on Flexible Database and
    Information Systems Technology, 2007.
[6] Gertjan van Noord. At last parsing is now operational. In Proceedings of TALN 2006. Leuven,
    Belgium, 2006.