The University of Amsterdam at CLEF@QA 2006
 Valentin Jijkoun     Joris van Rantwijk   David Ahn      Erik Tjong Kim Sang            Maarten de Rijke
                                    ISLA, University of Amsterdam
                          jijkoun,rantwijk,ahn,erikt,mdr@science.uva.nl


                                            Abstract
     We describe the system that generated our submission for the 2006 CLEF Question An-
     swering Dutch monolingual task. Our system for this year’s task features entirely new
     question classification, data storage and access, and answer processing components.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database
Managment]: Languages—Query Languages

General Terms
Measurement, Performance, Experimentation

Keywords
Question answering, Questions beyond factoids


1    Introduction
For our earlier participation in the CLEF question answering track (2003–2005), we have developed
a question answering architecture in which answers are generated by different competing strategies.
For the 2005 edition of CLEF-QA, we focused on converting our text resources to XML in order
to make possible a QA-as-XML-retrieval strategy. For 2006, we have converted all of our data
resources (text, annotations, and tables) to fit in an XML database in order to further standardize
access. Additionally, we have devoted attention to improving known weak parts of our system:
question classification, type checking, answer clustering, and score estimation.
    This paper is divided in nine sections. In section 2, we give an overview of the current system
architecture. In the following four sections, we describe the changes we have made to our system
for this year: question classification (section 3), storing data with multi-dimensional markup
(section 4), and probabilistic answer processing (sections 5 and 6). We present our submitted runs
in section 7 and evaluate them in section 8. We conclude in section 9.


2    System Description
The architecture of our Quartz QA system is an expanded version of a standard QA architecture
consisting of parts dealing with question analysis, information retrieval, answer extraction, and
answer post-processing (clustering, ranking, and selection). The Quartz architecture consists
of multiple answer extraction modules, or streams, which share common question and answer
processing components. The answer extraction streams can be divided into three groups based
                                                       question
                        question
                                                     classification


                                   pattern                      Dutch            pattern                              Dutch
        lookup   XQuesta NL                    ngrams                                       ngrams      Web
                                   match                         CLEF            match                               wikipedia
                                                                corpus


                                                                         candidate          answer            type checking      ranked
                                             justification
                                                                          answers          clustering          & reranking       answers


    Figure 1: Quartz-2006: the University of Amsterdam’s Dutch Question Answering System.


on the text corpus that they employ: the CLEF-QA corpus, Dutch Wikipedia, or the Web. We
describe these briefly here. The common question and answer processing parts will be outlined in
more detail in the next sections about the architecture updates.
    The Quartz system (Figure 1) contains four streams that generate answers from the CLEF-QA
corpus. The Table Lookup stream searches for answers in specialized knowledge bases which are
extracted from the corpus offline (prior to question time) by predefined rules. These rules take
advantage of the fact that certain answer types, such as birthdays, are typically expressed in one
of a small set of easily identifiable ways. The Ngrams stream looks for answers in the corpus by
searching for word ngrams using a standard retrieval engine (Lucene). The Pattern Match stream
is quite similar except that it allows searching for regular expressions rather than just sequences
of words.
    The most advanced of the four CLEF-QA corpus streams is XQuesta. It performs XPath
queries against an XML version of the CLEF-QA corpus which contains both the corpus text
and additional annotations. The annotations include information about part-of-speech, syntactic
chunks, named entities, temporal expressions, and dependency parses (from the Alpino parser
[7]). XQuesta retrieves passages of at least 400 characters, starting and ending at paragraph
boundaries.
    For 2006, we have adapted several parts of our question answering system. We changed the
question classification process, the data storage method, and the answer processing module. These
topics are discussed in the following sections.


3     Question Classification
An important problem identified in the error analysis of our previous CLEF-QA results was a
mismatch between the type of the answer and the required type by the question. A thorough
evaluation of the type assignment and type checking parts of the system was required. Over
the past three years, our question analysis module—consisting of regular expressions embedded
in program code—had grown to a point that it was difficult to maintain. Therefore, we have
re-implemented the module from scratch, aiming at classification by machine learning.
    We started by collecting training data for the classifier. The 600 questions from the three
previous CLEF-QA tracks were an obvious place to start, but restricting ourselves to these ques-
tions was not an option, since our goal is to develop a general Dutch QA system. Therefore,
we added additional questions: 1000 questions from a Dutch trivia game and 279 questions from
other sources—frequently, questions provided by users from our online demo. All the questions
have been classified by a single annotator.
    Choosing a proper set of question classes was a non-trivial problem. We needed question types
not only for extracting candidate answers from text passages but also for selecting columns from
tables. Sometimes, a coarse-grained type such as person would be sufficient, but for answering
which-questions, a fine-grained type, such as American President, might be better. We therefore
decided on using three different types of question classes: a table type that linked the question
to an available table column (17 classes), a coarse-grained type that linked the question to the
types recognized by our named-entity recognizer (7 classes), and a fine-grained type that linked
the question to WordNet synsets (166 classes).
    For the machine learner, a question is represented as a collection of features. We chose ten
different features:
    1. the question word (e.g., what for English or wat for Dutch)
    2. the main verb of the question
    3. the first noun following the question word
    4. a normalized version of the noun
    5. the top hypernym of the noun according to our type hierarchy
    6. the type of the named entity before the noun
    7. the type of the named entity after the noun
    8. a flag identifying the presence of capital letters
    9. a flag identifying a verb-initial question
 10. a flag indicating a factoid or a list question
    In order to determine the top hypernym of the main noun of the question we use EuroWordNet
[8]. We follow the hypernym path until we reach a word that a member of a predefined list which
contains entities like persoon (person) or organisatie (organisation). The factoid/list flag value is
also determined by the number value of the main noun. A plural noun indicates a list question
and a singular one, a factoid question. Questions without a noun will be classified by the number
of the verb.
    A question such as Who is the new pope? can be transformed to the features: who, are,
pope, pope, pope, none, none, 0, 0 and factoid. We have not explored other features because with
this small set of features we were already able to obtain a satisfactory score for coarse-grained
classification (about 90% for the questions from previous CLEF-QA tracks).
    We chose the memory-based learner Timbl [4] for building the question classifier. A known
problem of this learner is that extra features can degrade the performance of the learner. Therefore,
we performed an additional feature selection process in order to identify the best subset of features
for the classification task (using bidirectional hill-climbing, see [3]).
    We performed three different experiments for determining the optimal feature sets for each of
the three question classes. Each of the questions in the training set was classified by training on
all the other questions (leave-one-out). The best score was obtained for the coarse-grained class
using four features (4, 5, 8 and 9): 85%. Identifying fine-grained classes proved to be harder: 79%,
also with four features (4, 7, 8 and 9). Table classes were best predicted with three features (80%;
4, 9 and 10).


4     Multi-dimensional markup
Our system makes use of several kinds of linguistic analysis tools, including a POS tagger, a named
entity recognizer and a dependency parser [7]. These tools are run offline on the entire corpus,
before any questions are posed.
    The output of an analysis tool takes the form of a set of annotations: the tool identifies text
regions in the corpus and associates some metadata with each region. Storing these annotations
as XML seems natural and enables us to access them through the powerful XQuery language.
Ideally, we would like to query the combined annotations produced by several different tools at
once. This is not easily accomplished, though, because the tools may produce conflicting regions.
For example, a named entity may partially overlap with a phrasal constituent in such a way that
            A             XML tree 1

                  B                                        Context   Axis            Result nodes
                                       BLOB                A         select-narrow   A B C
                                                           A         select-wide     A B C E
                                       (text characters)
                               D                           A         reject-narrow   E D
             C                                             A         reject-wide     D
                               XML tree 2
                      E
                      Figure 2: Example document with two annotation layers.


the elements cannot be properly nested. It is thus not possible, in general, to construct a single
XML tree that contains annotations from all tools.
    To deal with this problem, we have developed a general framework for the representation of
multi-dimensional XML markup [2]. This framework stores multiple layers of annotations referring
to the same base document. Through an extension of the XQuery language, it is possible to retrieve
information from several layers with a single query. This year, we migrated the corpus and all
annotation layers to the new framework.

4.1    Stand-off XML
We store all annotations as stand-off XML. Textual content is stripped from the XML tree and
stored separately in a blob file (binary large object). Two region attributes are added to each XML
element to specify the byte offsets of its start and end position with respect to the blob file. This
process can be repeated for all annotation layers, producing identical blobs but different stand-off
XML markup.
    Figure 2 shows an example of two annotation layers on the same text content. The stand-off
markup in the first XML tree could be:
<A start="10" end="50">
  <B start="30" end="50"/>
</A>

and the second could be:
<E start="20" end="60">
  <C start="20" end="40"/>
  <D start="55" end="60">
</E>

The different markup layers are merged and stored as a single XML file. Although there will still
be conflicting regions, this is no longer a problem because our approach does not require nested
elements. The added region attributes provide sufficient information to reconstruct each of the
annotation layers, regardless of how they are merged. In principle, we could merge the layers
by simply concatenating the respective stand-off markup. However, in order to simplify query
formulation, we choose to properly nest the layers down to the level of sentence elements.

4.2    Extending MonetDB/XQuery
The merged XML documents are indexed using MonetDB/XQuery [1], an XML database engine
with full XQuery support. Its XQuery front-end consists of a compiler which transforms XQuery
programs into a relational query language internal to MonetDB.
    We extended the XQuery language by defining four new path axes that allow us to step between
layers. The new axis steps relate elements by region overlap in the blob, rather than by nesting
relations in the XML tree: select-narrow selects elements that have their region completely
contained within the context element’s region; select-wide selects elements that have at least
partial overlap with the context element’s region; reject-narrow and reject-wide select the non-
contained and non-overlapping region elements, respectively. The table in Figure 2 demonstrates
the results of each of the new axes when applied to our example document.
    In addition to the new axes, we have also added an XQuery function so-blob($node). This
function takes an element and returns the contents of that element’s blob region. This function is
necessary because all text content has been stripped from the XML markup, making it impossible
to retrieve text directly from the XML tree.
    The XQuery extensions were implemented by modifying the front-end of MonetDB/XQuery.
An index on the offset attributes is used to make the new axis steps efficient even for large
documents.

4.3    Merging annotation layers
We use a separate system, called XIRAF, to coordinate the process of automatically annotating
the corpus. XIRAF combines multiple text processing tools, each having an input descriptor and
a tool-specific wrapper that converts the tool output into stand-off XML annotation.
    The input descriptor associated with a tool is used to select regions in the data that are
candidates for processing by that tool. The descriptor may select regions on the basis of the
original metadata or annotations added by other tools. For example, our sentence splitter selects
document text using the TEXT element from original document markup. Other tools, such as the
POS tagger and named-entity tagger, require separated sentences as input and thus use the output
annotations of the sentence splitter by selecting SENT elements.
    Some tools readily produce annotations in XML format, and other tools can usually be adapted
to produce XML. Unfortunately, text processing tools may actually modify the input text in the
course of adding annotations, which makes it non-trivial to associate the new annotations with
regions in the original blob. Tools make a variety of modifications to their input text: some
perform their own tokenization (i.e., inserting whitespaces or other word separators), silently skip
parts of the input (e.g., syntactic parsers, when the parsing fails), or replace special symbols. For
many of the available text processing tools, such possible modifications are not fully documented.
    XIRAF, then, must re-align the output of the processing tools with the original blob. We
have experimented with a systematic approach that computes an alignment with minimimal edit-
distance. However, we found that there are special cases where a seriously incorrect alignment
may have optimal edit-distance. This prompted us to replace edit-distance alignment by ad-hoc
alignment rules that are specific to the textual modifications made by each tool.
    The alignment of blob regions is further complicated by the use of character encodings. We
use UTF-8 encoding for blob files, enabling us to represent any sequence of Unicode characters.
Since UTF-8 is a variable length encoding, this means that there is no direct relation between
character offsets and byte offsets. Region attributes in our stand-off markup are stored as byte
offsets to allow fast retrieval from large blob files. Many text processing tools, however, assume
that their input is either ASCII or Latin-1 and produce character offsets in their output. This
makes it necessary for XIRAF to carefully distinguish character offsets from byte offsets and
convert between them in several situations.
    After conversion and re-alignment, annotations are merged into the XML database. In general,
the output from a tool is attached to the element that produced the corresponding input. For
example, since the POS tagger requests a sentence as input, its output is attached to the SENT
element on which it was invoked. In principle, our use of region attributes makes the tree structure
of the XML database less important. Maintaining some logical structure in the tree, however, may
allow for queries that are simpler or faster.

4.4    Using multi-dimensional markup for QA
Two streams in our QA system have been adapted to work with multi-dimensional markup: the
Table Lookup stream and XQuesta. The table stream relies on a set of tables that are extracted
from the corpus offline according to predefined rules. These extraction rules were rewritten as
XQuery expressions and used to extract the tables from the collection text.
    Previous versions of XQuesta had to generate separate XPath queries to retrieve annotations
from several markup layers. Information from these layers had to be explicitly combined within the
stream. Moving to multi-dimensional XQuery enables us to query several annotation layers jointly,
handing off the task of combining the layers to MonetDB. Since XQuery is a superset of XPath,
previously developed query patterns could still be used in addition to the new, multi-dimentional
ones.


5     Probabilities
One major consequence of the multi-stream architecture of the Quartz QA system is the need for
a module to choose among the candidate answers produced by the various streams. The principal
challenge of such a module is making sense of the confidence scores attached by each stream to
its candidate answers. This year, we made use of data from previous CLEF-QA campaigns in
order to estimate correctness probabilities for candidate answers from stream confidence scores.
Furthermore, we also estimated correctness probabilities conditioned on well-typedness in order
to implement type-checking as Bayesian update. In the rest of this section, we describe how we
estimated these probabilities. We describe how we use these probabilities in the following section.

5.1    From scores to probabilities
Each stream of our QA system attaches confidence scores to the candidate answers it produces.
While these scores are intended to be comparable for answers produced by a single stream, there
is no requirement that they be comparable across streams. In order to make it possible for our
answer re-ranking module (described in section 6) to rank answers from different streams, we took
advantage of answer patterns from previous editions of CLEF QA to estimate the probability that
an answer from a given stream with a given confidence score is correct.
    For each stream, we ran the stream over the questions from the previous editions of CLEF
and binned the candidate answers by confidence score into 10 equally-sized bins. Then, for each
bin, we used the available answer patterns to check the answers in the bin and based on these
assessments, computed the maximum likelihood estimate for the probability that an answer with
a score falling in the range of the bin would be correct. With these probability estimates, we can
now associate with a new candidate answer a correctness probability based on its confidence score.

5.2    Type checking as Bayesian update
Type checking can be seen as a way to increase the information we have about the possible correct-
ness of an answer. We discuss in section 6.3 how we actually type-check answers; in this section,
we explain how we incorporate the results of type-checking into our probabilistic framework.
    One natural way to incorporate new information into a probabilistic framework is Bayesian
update. Given the prior probability of correctness for a candidate answer, P (correct) (in our
case, the MLE corresponding with the stream confidence score), as well as the information that
it is well- or ill-typed (represented as the value of the random variable well typed), we compute
P (correct | well typed), the updated probability of correctness given well- (or ill-)typedness as
follows:
                                            P (correct ∧ well typed)
               P (correct | well typed)   =
                                                 P (well typed)
                                            P (correct) × P (well typed | correct)
                                          =
                                                        P (well typed)
                                                          P (well typed | correct)
                                          = P (correct) ×
                                                               P (well typed)
In other words, the updated correctness probability of a candidate answer, given the information
                                                                                             typed | correct)
that it is well- or ill-typed, is the product of the prior probability and the ratio P (well
                                                                                         P (well typed)       .
   We estimate the required possibilities by running our type-checker on assessed answers from
CLEF-QA 2003, 2004, and 2005. For question types in which type-checking is actually possible,
the ratio for well-typed answers is 1.25:

                             P (well typed = True | correct = True)
                                                                    = 1.25
                                      P (well typed = True)

The ratio for ill-typed answers is 0.34:

                             P (well typed = False | correct = True)
                                                                     = 0.34
                                      P (well typed = False)

6     Answer processing
The multi-stream architecture of Quartz embodies a high-recall approach to question answering—
the expectation is that using a variety of methods to find a large number of candidate answers
should lead to a greater chance of finding correct answers. The challenge, though, is choosing
correct answers from the many candidate answers returned by the various streams. The answer
processing module described in this section is responsible for this task.

6.1     High-level overview
The algorithm used by the Quartz answer processing module is given here as Algorithm 1.

Algorithm 1 High-level answer processing algorithm
 1: procedure AnswerProcessing(candidates, question, tc)
 2:    clusters ← Cluster(candidates)
 3:    for cluster ∈ clusters do
 4:        for ans ∈ cluster do
 5:            ans.well f ormed, ans.well typed, ans.string ← FormTypeCheck(ans, question)
 6:        end for
 7:                                                                                       
                              Y                      P (well typed = ans.well typed | ans)
         P (cluster) = 1 −             1 − P (ans) ×
                                                        P (well typed = ans.well typed)
                              ans∈cluster

 8:       cluster.rep answer ← argmaxans∈{ a | a∈cluster∧a.well f ormed } length(ans)
 9:    end for
10:    ranked clusters ← sortP (clusters)
11: end procedure


    First, candidate answers for a given question are clustered (line 2: section 6.2). Then, each
cluster is evaluated in turn (lines 3–9). Each answer in the cluster is checked for well-formedness
and well-typedness (line 5: section 6.3): ans.well f ormed is a boolean value indicating whether
some part of the original answer is well-formed (ans.string is the corresponding part of the origi-
nal answer); ans.well typed is a boolean value indicating whether the answer is well-typed. The
correctness probability of the cluster P (cluster) (line 7) is the combination of the updated correct-
ness probabilities of the answers in the cluster (the prior correctness probabilities P (ans) updated
with type-checking information, as described in section 5.2). (Note that for one of the runs we
submitted Q for the 2006 CLEF QA evaluation we turned off type-checking, so that P (cluster) was
simply 1 − ans (1 − P (ans)).) The longer of the well-formed answers in the cluster is then chosen
as the representative answer for the cluster (line 8). Finally, the clusters are sorted according to
their correctness probabilities (line 10).
6.2    Answer clustering
Answer candidates with similar or identical answer strings are merged into clusters. In previous
versions of our system, this was done by repeatedly merging pairs of similar answers until no more
similar pairs could be found. After each merge, the longest of the two answer strings was selected
and used for further similarity computations.
    This year, we moved to a graph-based clustering method. Formulating answer merging as a
graph clustering problem has the advantage that it better captures the non-transitive nature of
answer similarity. For example, it may be the case that the answers oorlog and Wereldoorlog should
be considered similar, as well as Wereldoorlog and wereldbeeld, but not oorlog and wereldbeeld. To
determine which of these answers should be clustered together, it may be necessary to take into
account similarity relations with the rest of the answers.
    Our clustering method operates on a matrix that contains a similarity score for each pair of
answers. The similarity score is an inverse exponential function of the edit distance between the
strings, normalized by the sum of the string lengths. The number of clusters is not set in advance
but is determined by the algorithm. We used an existing implementation of a spectral clustering
algorithm [5] to compute clusters within the similarity graph. The algorithm starts by putting
all answers in a single cluster, then recursively splits clusters according to spectral analysis of the
similarity matrix. Splitting stops when any further split would produce a pair of clusters for which
the normalized similarity degree exceeds a certain threshold. The granularity of the clusters can
be controlled by changing the threshold value and the parameters of the similarity function.

6.3    Checking individual answers
The typing and well-formedness checks performed on answers depends primarily on the expected
answer type of the question. Real type-checking can only be performed on questions whose ex-
pected answer type is a named-entity, a date, or a numeric expression. For other questions, only
basic well-formedness checks are performed. Note that the probability update ratios described in
section 5.2 are only computed on the basis of answers that are type-checkable. For answers to
other questions, we use the same ratio for ill-formed answers as we do for ill-typed answers (0.34),
but we do not compute any update for well-formed answers (i.e., we update with a ratio of 1.0).
    For answers to all questions, two basic checks are performed. If an answer consists solely of
non-alphanumeric characters or is wholly contained as a substring of the question, it is immediately
rejected as ill-formed and ill-typed. For answers to questions with the following answer types only
the very basic well-formedness checks noted are additionally performed:
    • Abbreviation, Units: answer must be a single word with some alphabetic characters
    • Expansion, Definition, Manner, Term: answer must contain some alphabetic characters
    For questions expecting named entities, dates, or numeric answers, more significant well-
formedness and well-typedness checks are performed. For numeric answers, we check that the
answer consists of a number (either in digits or spelled out) with optional modifiers (e.g., ongeveer,
meer dan, etc.) and units. For names and dates, we look at the justification snippet and run an
NE-tagger or date tagger on the snippet, as appropriate. In addition to verifying that the can-
didate answer is a name (for well-formedness) of the correct type (for well-typedness), we also
check whether there are additional non-name words at the edges of the answer and remove them,
if necessary. The results of answer checking are then boolean values for well-formedness and
well-typedness, as well as the possibly edited answer string.


7     Runs
We submitted two Dutch monolingual runs. The run uams06Tnlnl used the full system with
all streams and final answer selection. The run uams06Nnlnl used the full system but without
type-checking.
    Run                       Total      Right      Unsupported          Inexact       Wrong        % Correct
    uams06Tnlnl                200        40             2                  4           154           20%
    uams06Nnlnl                200        41             3                  4           152           21%


Table 1: Assessment counts for the 200 top answers in the two Amsterdam runs submitted for
Dutch monolingual Question Answering (NLNL) in CLEF-2006. In both runs, about 20% of the
questions were answered correctly.

    Question type              Total     Right      Unsupported           Inexact      Wrong        % Correct
    factoid                     116       28             1                   2          85            24%
    definition                   39        9             0                   1          29            23%
    temporally restricted        32        3             1                   1           27            8%
    non-list                    187       40             2                   4          141           21%
    list                         13        0             6                   0          31             0%


Table 2: Assessment counts per question type for the 200 questions in the Amsterdam run with
type checking (uams06Tnlnl). We have regarded all questions with time expressions as temporally
restricted rather than the single one mentioned in the official results. Note that for the 13 list
questions there were more than 13 answers since list questions allowed a maximum of five answers.


8      Results
The question classifier performed as expected for coarse question classes: 86% correct compared
with a score of 85% on the training data. For most of the classes, precision and recall scores
were higher than 80%; the exceptions are miscellaneous and number. For assigning table classes,
the score was much lower than for the training data: 56% compared with 80%. We did not
evaluate fine-grained class assignment because these classes are not used in the current version of
the system.
    Table 1 lists the assessment counts for the two University of Amsterdam runs for the question
answering track of CLEF-2006. The two runs had 14 different top answers of which four were
assessed differently. In 2005 our two runs contained a large number of inexact answers (28 and
29). We are happy about those numbers being lower in 2006 (4 and 4) and the most frequent
problem, extraneous information added to a correct answer, has almost disappeared. However,
the number of correct answers dropped as well, from 88 in 2005 to 41. This is at least partly
caused by the fact that the questions were more difficult this year.
    Like last year, the questions could be divided in two broad categories: questions asking for
lists and questions requiring singular answers. The second category can be divided in three sub-
categories: questions asking for factoids, questions asking for definitions and temporally restricted
questions. We have examined the uams06Tnlnl-run answers to the questions of the different cat-
egories in more detail (Table 2). Our system failed to generate any correct answers for the list
questions. For the non-list questions, 21% of the top answers were correct. Within this group,
the temporally restricted questions1 (8% correct) posed the biggest challenge to our system. In
2005, we saw similar differences between factoid, definition and temporally restricted questions.
At that time the difference between the last category and the first two could be explained by the
presence of a significant number of incorrect answers which would have been correct in another
time period. This year, no such answers were produced by our system.
    More than 75% of our answers were incorrect. We examined the answers given to the first
twelve questions in more detail in order to find out what the most important problems were (see
appendix A). The top answer to the first question is leak in the tank of space shuttle. It is unclear
to us why the correct and more frequent apposition space shuttle was not selected as the answer.
  1 The official assessments mention the presence of only one temporally restricted question, which is very unlikely.

We have regarded all questions with time expressions (32) as temporally restricted.
For question two, three of the top five answers are correct, but the top answer is wrong. It consists
of a long clause which seems to have been identified as an apposition. The third question has been
answered correctly. All top five answers are short.
    No correct answer was found for question four because the date in the question was wrong
(should have been 1938). Question five was answered correctly (NIL). For question six, the top
answer is inexact because it only mentions a city and not the name of the concert hall. The other
four answers are poorly motivated by the associated snippets. The same is true for all the answers
generated for question seven. Here, the correct answer is NIL. Four of the five answers produced
for question eight do not have the correct type. They should have been a time period, but apart
from the top answer, they consist of arbitrary numeric expressions.
    Question nine is hard to answer. We have not been able to find the required answer (one
million) in the collection. Our system generated five numbers which had nothing to do with the
required topic. Two of the numbers were clearly wrong: 000. The top answer for question ten
has an incorrect type as well (noun rather than the required organization). No answers were
generated for questions eleven and twelve. The first required a search term expansion (from UN
to United Nations). The NIL answer for the second is a surprise. The question phrases foreign,
movie and Oscar can be found back in the collection close the correct answers which have the
required type.
    Based on this analysis, we have created the following list of future work:

    • As the answers to the first two questions show, our system still prefers infrequent long
      answers over frequent short ones. This often leads to highly ranked incorrect answers. We
      should change the ranking part of the system and give a higher weight to frequency rather
      than string length.
    • Another way to decrease the number of incorrect long answers is to prevent them from
      appearing in the first place. This means changing our information extraction modules.
    • One part of our architecture which is currently poorly understood is the part that generates
      queries. In our analysis, we found several cases of undergeneration and overgeneration of
      answers. We suspect that the query generation component is involved in the generation of
      at least some of these answers.
    • Although part of our work has been focused on matching types of answers with question
      types, it seems that our current system still has problems in this area (questions nine and
      ten). More work is required here.

   In the last four years, we have submitted multiple runs with minor differences. Given our
current scoring level it would be interesting to aim at runs that are more different. This could
have a positive effect on our standing in the evaluation.


9     Conclusion
We have described the fourth iteration of our system for the CLEF Question Answering Dutch
mono-lingual track (2006). This year, our work has focused on converting all data repositories
of the system (text, annotation and tables) to XML and allowing them to be accessed via the
same interface. Additionally, we have modified parts of our system which we had suspected of
weaknesses: question classification and answer processing.
    At this point, we would like to know if the modifications of the system have resulted in im-
provemed performance. The system’s performance on this year’s questions (20% correct) was
much worse than in 2005 (45%). However, the 2006 questions were more difficult than those of
last year, and we expect that the workshop will show that the performance of other participants
has also dropped. It would be nice to be able to run last year’s system on the 2006 questions but
unfortunately, we do not have a copy of that system available. Applying the current system to
last year’s questions is something we can and should do, but the outcome of that experiment may
not be completely reliable since those questions have been used for tuning the system.
    At this moment, we simply do not know if our work has resulted in a better system. One
thing we know for certain: there is a lot of room for improvement. Obtaining 20% accuracy on
factoid questions is still far away from the 70% scores obtained by the best systems participating
in TREC. Luckily, our 2006 CLEF-QA participation has identified key topics for future work:
information extraction, query generation, answer type checking, and answer ranking. We hope
that work on these topics will lead to better performance in 2007.


10     Acknowledgments
This research was supported by various grants from the Netherlands Organisation for Scientific
Research (NWO). Valentin Jijkoun was supported under project numbers 220.80.001, 600.065.120
and 612.000.106. Joris van Rantwijk and David Ahn were supported under project number
612.066.302. Erik Tjong Kim Sang was supported under project number 264.70.050. Maarten
de Rijke was supported by NWO under project numbers 017.001.190, 220.80.001, 264.70.050,
354.20.005, 600.065.120, 612.13.001, 612.000.106, 612.066.302, 612.069.006, 640.001.501, and 640.-
002.501.


References
[1] MonetDB. Website: http://www.monetdb.nl/.
[2] Wouter Alink, Valentin Jijkoun, David Ahn, Maarten de Rijke, Peter Bonz, and Arjen de Vries.
    Representing and querying multi-dimensional markup for question answering. In Proceedings
    of the 5th Workshop on NLP and XML, pages 3–9. ACL, 2006.
[3] Rich Caruana and Dayne Freitag. Greedy attribute selection. In Proceedings of the Eleventh In-
    ternational Conference on Machine Learning, pages 28–36. New Brunswick, NJ, USA, Morgan
    Kaufman, 1994.
[4] Walter Daelemans, Jakub Zavrel, Ko van der Sloot, and Antal van den Bosch. TiMBL: Tilburg
    Memory Based Learner, version 5.1, Reference Guide. University of Tilburg, ILK Technical
    Report ILK-0402., 2004. http://ilk.uvt.nl/.
[5] Luigi Dragone. Spectral clusterer for WEKA, 2002.
    Webpage: http://www.luigidragone.com/datamining/spectral-clustering.html.
[6] Valentin Jijkoun and Maarten de Rijke. Retrieving answers from frequently asked questions
    pages on the web. In Proceedings of the Fourteenth ACM conference on Information and
    knowledge management (CIKM 2005). ACM Press, 2005.
[7] Gertjan van Noord. At last parsing is now operational. In Proceedings of TALN 2006. Leuven,
    Belgium, 2006.
[8] Pik Vossen, editor. EuroWordNet: A Multilingual Database with Lexical Semantic Networks.
    Kluwer Academic Publishers, 1998.


A     Example questions and answers
This appendix contains the top twelve questions and answers of the monolingual (NLNL) run
uams06Tnlnl. The assessed score can be found before each answer: Right, Unsupported, ineXact
or Wrong.

  1. 0001 Wat is Atlantis?
   (a) W lek in een heliumtank van het ruimteveer
   (b) W na zijn historische missie naar het Russische ruimtestation Mir veilig teruggekeerd
       op aarde
   (c) W gisteren met succes vastgekoppeld aan het Russische ruimtestation Mir
   (d) W gisteren om 15
   (e) W een doorzichtige allegorie op het fascisme en het fascistische dedain voor het mense-
       lijke leven
2. 0002 Wat is Hubble?
   (a) W genoemd naar de Amerikaanse astronoom Edwin Hubble , die deze waarde in 1929
       voor het eerst bepaalde
   (b) R de Hubble-ruimtetelescoop
   (c) W er oorzaak van dat het heelal nog nooit zo raadselachtig is geweest als nu
   (d) R een sterrenkijker die we per raket de ruimte in hebben geschoten , vanwaar hij nu
       plaatjes naar de
   (e) R de constante
3. 0003 Wat is Lusa?
   (a) R Portugese persbureau
   (b) W Het kabinet
   (c) W 530 kilometer
   (d) W AP
   (e) W hulp
4. 0004 Waarheen verhuisde Freud in 1939?
   (a) W NIL
5. 0005 Waar is de olie-tanker Prestige gezonken?
   (a) R NIL
6. 0006 Waar vindt de uitreiking van de Oscars plaats?

   (a) W Los Angeles
   (b) W Rotterdam
   (c) W Nederland
   (d) W Hollywood
   (e) W Philadelphia
7. 0007 Hoeveel ruimtevaartuigen werden er gelanceerd in 1977?
   (a) W vijfentwintigduizend
   (b) W elf ton

8. 0008 Hoe lang duurt de ruimtemissie Voyager?
   (a) W 1655
   (b) W zeven
   (c) W twee ton
    (d) W drie
     (e) W 5150 km
 9. 0009 Hoeveel illegale immigranten zijn er nu in de EU landen?
    (a) W 51 procent
    (b) W 42 procent
     (c) W 000
    (d) W 000 dollar
     (e) W twaalf
10. 0010 Welke partij heeft Karadzic opgericht?
    (a) W partijen
    (b) W AFP
11. 0011 Welke verklaring werd in 1948 door de VN aangenomen?
    (a) W NIL
12. 0012 Welke film heeft de Oscar voor de beste buitenlandse film gekregen?
    (a) W NIL