Priberam’s question answering system for
                       Portuguese
                                  Carlos Amaral, Helena Figueira,
                    André Martins, Afonso Mendes, Pedro Mendes, Cláudia Pinto

                                        Priberam Informática
                                Av. Defensores de Chaves, 32 - 3o Esq.
                                     1000-119 Lisboa, Portugal
                                      Tel.: +351 21 781 72 60
                                       Fax: +351 21 781 72 79
                           {cma, hgf, atm, amm, prm, cp}@priberam.pt

                                                Abstract
       This paper describes the work done by Priberam in the development of a question
       answering (QA) system for Portuguese. The system was built using the company’s NLP
       workbench and information retrieval technology. Special focus is given to the question
       analysis, document and sentence retrieval, and answer extraction stages. The paper
       discusses the system’s performance in the context of the QA@CLEF 2005 evaluation.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2 [Database
Management]: H.2.3 Languages—Query Languages

General Terms
Measurement, Performance, Experimentation, Languages

Keywords
Question answering, Questions beyond factoids


1      Introduction
The 2004 CLEF campaign introduced Portuguese as one of the working languages [1] and allowed
the evaluation of two monolingual question answering (QA) systems for European Portuguese
[2, 3]. In 2005, the organization added new resources, making both European Portuguese and
Brazilian Portuguese available to CLEF participants. The set of target documents now comprises
the collection of news published during the years 1994 and 1995 by the Portuguese newspaper
Público and by the Brazilian newspaper Folha de São Paulo. The test set includes 200 questions
also in European and Brazilian Portuguese.
    Our approach to this year’s QA track at CLEF (QA@CLEF) relies on previous work done for
the Portuguese module of TRUST – Text Retrieval Using Semantic Technologies1 –, an European
Commission co-financed project2 whose aim was the development of a multilingual semantic search
    1 See http://www.trustsemantics.com.
    2 Cooperative Research (CRAFT) project number IST-1999-56416.
engine capable of processing and answering natural language questions in English, French, Italian,
Polish and Portuguese [4, 5]. In the TRUST project, the system searches a set of plain text
documents (either in a local hard disk or in the Web) and returns a ranked list of sentences
containing the answer to a given natural language question. The goal of QA@CLEF is similar,
except that it must extract a unique exact answer from the retrieved sentences.
   The architecture of our QA system is built upon a standard approach. After the question is
submitted, it is categorized according to our question typology and, through an internal query,
a set of potentially relevant documents is retrieved. Each document contains a list of sentences
which were assigned the same category as the question. Sentences are weighted according to
their semantic relevance and similarity with the question. Next, through specific answer patterns,
these sentences are again examined and the parts containing possible answers are extracted and
weighted. Finally, a single answer is chosen among all candidates.
   In the next section, we address the various tools and resources developed or used in the sys-
tem’s underlying natural language processing (NLP). Section 3 provides an overview of the QA
engine architecture, namely the indexing process, the question analysis, the document and sen-
tence retrieval procedures and the answer extraction. Section 4 details the experimental results of
our system in QA@CLEF, and section 5 presents our conclusions and guidelines for future work.


2     A workbench for NLP
Previous work on the development of linguistic technology for FLiP, Ferramentas para a Lı́ngua
Portuguesa, Priberam’s proofing tools package for Portuguese3 , as well as on the construction of
the Portuguese module of the already mentioned TRUST search engine, required the development
of a workbench for NLP [6]. This workbench includes lexical resources, software tools, statistical
information extracted from corpora, contextual rules, and other tools and resources adapted to
the task of question answering.

2.1     Lexical resources
Our lexical resources include several lexical databases, such as a wide coverage lexicon, a thesaurus
and a multilingual ontology.
    The lexicon comprises, for each lexical unit, information about part of speech (POS), sense defi-
nitions, semantic features, subcategorization and selection restrictions, ontological and terminolog-
ical domains, English and French equivalents and lexical-semantic relations. For our QA@CLEF
monolingual task, we do not use the English and French equivalents, whose purpose is essentially
to perform cross-language tasks.
    The thesaurus provides a set of synonyms for each lexical unit, allowing, by means of query
expansion, to improve the information retrieval stage by including documents and sentences that
contain synonyms of the question’s keywords.
    Another major lexical component of the workbench is the multilingual ontology, which groups
words and expressions through their conceptual domains. It was initially designed by Synapse
Développement, the French partner of TRUST, and then converted into all the languages of the
consortium4 . The combination of the ontologies of all TRUST languages provides a bidirectional
word/expression translation mechanism, having the English language as an intermediate. It is
thus possible to operate in a cross-language environment, allowing, for instance, to obtain answers
in French for questions formulated in Portuguese, or vice-versa. Synapse Développement carried
out such an experiment and submitted a Portuguese-French run to this year’s bilingual task of
QA@CLEF [7], making use of Priberam’s TRUST Portuguese module to analyse the test set of
questions.
   3 FLiP includes a grammar checker, a spell checker, a thesaurus and a hyphenator that enable different proofing

levels – word, sentence, paragraph and text – of European and Brazilian Portuguese. An online version is available
at http://www.flip.pt.
   4 The ontology is designed to incorporate additional languages in future projects.
    Additionally, lexical resources include question identifiers, i.e., semantically labelled words for
question categorization. These are groups of words related with typical questions domains and
sub-grouped according to their POS. For instance, the label <Dimension> includes measuring
units (with their abbreviations or symbols), nouns, adjectives and verbs related with dimension,
distance and measurement.

2.2    Software tools
The lexical resources just described interact with software tools that we have implemented, like
SintaGest program. Priberam’s SintaGest is an interactive tool that allows building and testing a
grammar for any language; it was successfully used by the company’s linguistic and programming
teams to develop European and Brazilian Portuguese grammars. SintaGest allows a practical way
to code transformation rules for morphological disambiguation and named entity recognition, as
well as production rules to build a context-free grammar (CFG). In addition, it allows to perform
tasks related with QA, such as writing patterns to categorize questions and extract answers. After
being tested, these rules are compiled to generate compressed and optimized low-level information.
Furthermore, SintaGest can also run in batch mode on a corpus, to test the grammar, generate
reports, extract collocations and named entities, collect statistical information, etc. For a detailed
description of some of these SintaGest features, see again [6].
    Along with SintaGest, several modules have been developed to perform more specific tasks.
One of such tasks is morphological disambiguation. It is done in two stages: first, the contex-
tual rules defined in SintaGest are applied; then, remaining ambiguities are suppressed with a
statistical POS tagger based on a second-order hidden Markov model (HMM). This turns out
to be a fast and efficient approach using the Viterbi algorithm [8, 9]. The prior contextual and
lexical probabilities were estimated by processing large, partially tagged corpora, among them the
CETEMPúblico 1.7 collection of news from the Portuguese newspaper Público 5 . Lexical proba-
bilities are encoded for each lemma, rather than for each word. To achieve this, we calculated, for
each lemma, its frequency and the relative frequency of its inflections. Then, those lemmas with
similar distributions for their inflections are grouped into a smaller number of classes. Clustering
techniques based on competitive learning [10] are used to choose the number of classes, group the
lemmas and characterize each class. Working with these clusters is advantageous because, on one
hand, we can extend the behaviour to words that are not so frequent in our corpora, and, on the
other hand, we can compress the information we need at runtime.

2.3    Contextual rules
As said above, SintaGest provides a way to build contextual rules for performing morphological
disambiguation, named entity (NE) recognition, etc. An editor allows writing, compiling and
testing these rules. Once validated, they are then used in our QA system at runtime.
    NEs appear frequently both in questions and in the texts to index. They can be proper nouns
of organizations, places, event dates, etc. Besides NEs, some expressions (e.g. nominal, adjectival,
verbal and adverbial phrases, including the dates in temporally restricted questions) are frequent
and idiomatic enough to justify their handling as if they were single tokens.
    The NE recognizer is capable of detecting and tagging a large amount of NEs. The tagger tries
to find a sequence of proper nouns, recognizing it as a single token and classifies syntactically and
semantically the NE thus created, namely by inheriting the features of its head (e.g. Luı́s Vaz de
Camões will be classified as an anthroponym, since Luı́s is classified in the lexicon as such). It also
uses groups of conceptually gathered words that will help in the classification of NE: for instance,
a sequence of proper nouns preceded by a common noun such as rio [river] will be classified as a
toponym (e.g. rio de São Domingos [São Domingos river]). For semantic disambiguation purposes,
the NE recognizer also considers the context, checking what words precede or follow the NE.
  5 Available at http://acdc.linguateca.pt/cetempublico.
2.4    Question categorization
Classifying questions into categories is a key task during question analysis, since it allows filtering
out unrelated documents and applying more tuned extraction rules in the candidate sentences.
To address this, we used a set of 86 question categories previously defined for TRUST by Synapse
Développement. Table 1 illustrates some of the categories currently used in our QA system.

              Category                                    Example
                                “Nomeie um cetáceo.”
         <Denomination>
                                [Name a cetacean.]
                                “Em que dia foi inaugurada a Torre Eiffel?”
         <Date of event>
                                [On what day was the Eiffel Tower inaugurated?]
                                “Em que cidade fica o campo de concentração de Auschwitz?”
         <Town name>
                                [In what city is the Auschwitz concentration camp located?]
                                “Quando nasceu a ovelha Dolly?”
         <Birth date>
                                [When was sheep Dolly born?]
                                “Quem é Jorge Sampaio?”
         <Function>
                                [Who is Jorge Sampaio?]

                            Table 1: Example of categories of question.

    Once the categories are defined, a way must be provided to categorize, i.e., to automatically
select one or more categories to a given question. Common approaches involve writing simple
patterns, using for instance regular expressions [11], and optionally complement them with rules
obtained through some sort of supervised learning in a large training set [12]. We discarded
any learning-based method since the training set should be large enough to offer an adequate
coverage of all the categories, which in our case are numerous. As for methods based on regular
expressions, they have the disadvantage of being too focused on string patterns, discarding other
useful features, and thus leading to a relatively small coverage of instances of question. Our
approach tries to overcome the above limitations by using patterns that are much more powerful
than regular expressions. Like with contextual rules, SintaGest provides the interface for writing,
testing and compiling such patterns. They were tested and validated with real world questions
through the CLEF Multieight-04 Corpus of 700 questions and manually retrieved answers 6 [13].
Each pattern is a sequence of ‘terms’ with the possible types listed in Table 2 (prefix Any is used
to build disjunctive terms).
    Terms may be conjugated (e.g. Word(casa) & Cat(N) means that the current word should
be the common noun casa [house], and not a form of the verb casar [to marry]). Besides, a
term may also be optional (e.g. Word(casa)? means that the presence of the word casa in
the current position is optional), and distances between terms may be defined (e.g. Word(quem)
Distance(1,3) Word(presidente) means that between the words quem [who] and presidente
[president] there can be a minimum of 1 and a maximum of 3 words.
    Patterns built with these terms are used not only to categorize questions, but also to categorize
general sentences and even to extract answers. Actually, there are 3 kinds of patterns:
   • Question patterns (QPs) are used to assign categories to questions. More than one category
     per question is allowed, thus avoiding difficulties in choosing the most suitable category.
   • Answer patterns (APs) are used to assign categories to a general sentence during the in-
     dexation stage, which means that the sentence contains possible answers for questions with
     those categories. Again, more than one category per sentence is allowed.
   • Question answering patterns (QAPs) are used to extract a possible answer for a specific
     question.
  6 Available at http://clef-qa.itc.it/2005.
 Type of Term          Meaning                  Example                             Description
     Word            Literal word or   Word(presidente)                   Current word or expression is
                     expression                                           presidente [president]
                                       AnyWord(presidente, chefe          Current word or expression is
                                       de estado)                         either presidente [president] or
                                                                          chefe de estado [head of state]
       Root          Lemma             Root(presidente)                   Current lemma is presidente
                                                                          [president]
                                       AnyRoot(presidente,                Current lemma is either
                                       chefe de estado,                   presidente [president], chefe
                                       primeiro-ministro)                 de estado [head of state]
                                                                          or primeiro-ministro [prime
                                                                          minister]
        Cat          POS       tag     Cat(N(MASC,SING,,,,ERG))           Current POS is a common
                     with optional                                        noun, masculine, singular and
                     features                                             an ergonym7
                                       AnyCat(N, Nprop, ENT)              Current POS is either a com-
                                                                          mon or proper noun or a named
                                                                          entity
        Ont          Ontology          Ont(100.3)                         Current word or expression is
                     entry                                                part of the ontology level 100.3
                                                                          (Colours/grey)
                                       AnyOnt(100.1, 100.2,               Current word or expression
                                       100.3)                             is part of the ontology lev-
                                                                          els 100.1 (Colours/white),
                                                                          100.2 (Colours/black) or 100.3
                                                                          (Colours/grey)
   QuestIdent        Question iden-    QuestIdent(FUNCTION N)             Current word is a noun/noun
                     tifier                                               phrase question identifier for
                                                                          the <Function> category
                                       AnyQuestIdent(FUNCTION N,          Current word is either a
                                       FUNCTION ADJ)                      noun/noun phrase or adjec-
                                                                          tive/adjective phrase question
                                                                          identifier for the <Function>
                                                                          category
      Const          Constant pre-     Ergonym                            Current word is the beginning
                     viously defined                                      of a phrase with an ergonym
                                       AnyConst(Ergonym,                  Current word is the beginning
                                       NounPhrase)                        of a phrase with an ergonym or
                                                                          a noun phrase

                                Table 2: Types of terms used in patterns.


When a question is posed and it matches a QP, a category is assigned for the question and a set of
QAPs becomes active. Then, documents containing sentences with categories in common with the
question (previously determined during indexation via the APs) are analysed; the active QAPs
are then applied to each sentence in order to extract the possible answers.
    Table 3 shows examples of QPs, APs and QAPs for the category <Function>. There are two
blocks of code: the question answer block is checked when the QA system is answering questions,
while the answer block is only used for indexation of the document collection. Each pattern has
a score following the = sign to establish a priority and to reflect how likely it can match a true
sentence. Of course, these scores must be heuristically adjusted in order to give preference to more
reliable and more specific patterns. Optional terms that are verified are rewarded adding 3 units
to the pattern score. Distances penalize the pattern score by subtracting as many units as the
   7 The word ergonym (from Greek ergon ‘work’ and onoma ‘name’) designates here a person’s profession, job,

function, post, etc..
difference to the specified minimum distance. The With command between terms means that the
second term must be verified somewhere inside the first term, usually a constant that defines a
phrase. Finally, notice that QAPs include an extra term, named Pivot, to signal keywords that
are present both in the question and in the matched sentence (see subsection 3.2 for details), as
well as a sequence of terms delimited by curly brackets, to signal the words that are to be extracted
as a possible answer.


    // Example of a question answer block encoding QPs and QAPs:

    Question (FUNCTION)
         : Word(quem) Distance(0,3) Root(ser) AnyCat(Nprop, ENT) = 15
           // e.g. ‘‘Quem é Jorge Sampaio?’’
         : Word(que) QuestIdent(FUNCTION N) Distance(0,3) QuestIdent(FUNCTION V) = 15
           // e.g. ‘‘Que cargo desempenha Jorge Sampaio?’’
    Answer
         : Pivot & AnyCat (Nprop, ENT) Root(ser) Definition With Ergonym? = 20
           // e.g. ‘‘Jorge Sampaio é o {Presidente da República}...’’
         : {NounPhrase With Ergonym?} AnyCat (Trav, Vg) Pivot & AnyCat (Nprop, ENT) = 15
           // e.g. ‘‘O {presidente da República}, Jorge Sampaio...’’
         ;


    // Example of an answer block encoding APs:

    Answer (FUNCTION)
         : QuestIdent(FUNCTION N) = 10
         : Ergonym = 10
         ;


                                      Table 3: Examples of patterns.

    Current work is being made to add new features to these patterns. One of the features being
developed is a new type of term for syntactic phrases, more powerful than the current Const
term. This feature is essential for the improvement of question categories like <Aim>, <Cause>,
<Consequence> or <Condition>, which require general syntactical patterns for extraction of
possible answers specifically in adverbial subordinate clauses. Other features involve enhancing
the QPs syntax to encode a measure of importance for the question pivots and to embed sense
disambiguation rules. This would allow to perform word sense disambiguation during question
analysis and thus to select a stricter set of relevant ontology levels and synonyms that will be
searched during the document retrieval stage.


3      System description
The architecture of our QA system is fairly standard. It involves five major tasks, described in the
current section: (i ) the indexing process, (ii ) the question analysis, (iii ) the document retrieval,
(iv ) the sentence retrieval, and (v ) the answer extraction.

3.1      Indexing Process
The indexation is an off-line procedure by which a set of target documents is parsed in order
to collect information in index files. Previous work on this subject has been done during the
development of LegiX, Priberam’s juridical information system8 . The indexing engine of LegiX
    8 For more information about LegiX, see http://www.legix.pt.
was adapted to index semantic information, ontology domains, question categories and other
specificities for QA.
    In the case of the Portuguese target collection of QA@CLEF there was a total of 210734
indexed documents. For each document, we collected its most relevant ontological and termi-
nological domains and, for each sentence, the question categories for which it contains possible
answers, determined through the APs referred in subsection 2.4. After applying morphological
disambiguation (see the last paragraph of subsection 2.2 for a description of how it is made), we
collect as key elements for indexation, the words of each sentence that are not considered stop
words. Each word is represented by a unique triple {lemma, head of derivation, POS}. Special
words as numbers, dates, fixed expressions, NEs and proper nouns are flagged. Multiple word
expressions (e.g. NEs) are indexed as well as each word that composes them. Unlike the system
used in the TRUST project, here we chose not to perform word sense disambiguation (WSD). We
justify this decision with the following reasons: (i ) our current WSD is still at an early stage and
has a poor performance, and (ii ) making automatic WSD during question analysis is inherently
a difficult task. Indeed, in TRUST the user performs manually the disambiguation at this stage
by selecting the appropriate sense of each word of the question. As stated in the last paragraph
of subsection 2.4, we intend to develop a scheme to embed WSD in the QPs, since these patterns
usually reduce the context scope, making the task less difficult to achieve.
    For performance reasons, each word in the index is stored with a reference not only to the
target documents in which it occurs, but also to the sentences indices inside each document. This
accelerates the document retrieval stage, as we describe in subsection 3.3.

3.2    Question analysis
Since indexation is performed off-line, the question analyser is indeed the first module of our
system. It receives as input a NL question q submitted by the user, that is first lemmatized and
morphologically disambiguated (see subsection 2.2). The next step consists on interpreting it.
    Like the majority of the approaches, we start by categorization. In fact, results show clearly
that determining the domain of the question and characterizing the desired format for the answer is
an essential step in QA systems. However, approaches diverge about the number, structure (flat or
hierarchized), and choice of the categories (see [14, 15] for interesting discussions on this matter).
As described in subsection 2.4, we use 86 categories in a flat structure and build powerful QPs
to categorize the questions, instead of the commonly used patterns based on regular expressions.
When this categorization stage ends, the following information has been gathered: (i ) one or more
question categories, {c1 , c2 , . . . , cm } , (ii ) a list of active QAPs (see subsection 2.4) to be later
applied during answer extraction (see subsection 3.5), and (iii ) a score σ QP for each question
pattern that matched the question.
    We next proceed to the extraction of pivots. Pivots are the key elements of the question, and
they can be words, expressions, NEs, phrases, numbers, dates, abbreviations, etc.. For each pivot,
we collect the word or words that make the pivot itself, its lemma wL , its head of derivation wH , its
POS, their synonyms wS1 , . . . , wSn provided by the thesaurus (subsection 2.1), and flags to indicate
if they are special words. Together with the above mentioned question categories, the relevant
ontological and terminological domains in the question, {o1 , o2 , . . . , op }, are also collected.
    This data then feeds the document retrieval module, described in the next subsection.

3.3    Document retrieval
After analysing the question, we submit a query to the index files using as search keys the pivot
lemmas, their heads of derivation, their synonyms, the ontological domains and the question
categories.
         i
    Let wL    i
           , wH and wSi,j denote respectively the i-th pivot lemma, its head of derivation and its j-th
synonym. Each of these synonyms has a weight ρ(wSi,j , wL    i
                                                               ) to reflect its semantic proximity with
the original pivot lemma wL . In the following, we denote by ci and oi the i-th possible category
                             i

for the posed question and the j-th relevant ontological or terminological domain, respectively.
   For each word, we calculate a weight α(w) given by:

                           α(w) = αP OS (w) + Kilf ilf (w) + Kidf idf (w)                         (1)

     In (1), the αP OS is used to reflect the influence of the POS on the pivot’s relevance. For
instance, since we consider that pivots that are NEs, are generally more important than com-
mon nouns, and these than adjectives or verbs, we have a chain αP OS (N E) ≥ αP OS (N ) ≥
αP OS (ADJ) ≥ αP OS (V ). Of course, these are general assumptions: there are many questions
where a verb is more relevant than an adjective, although the opposite situation is slightly more
frequent (for example, in a question like “Como se chama o primeiro presidente americano?” [What
is the name of the first American president?] the adjectives primeiro and americano are much
more important than the verb chamar ). As briefly stated in the last paragraph of section 2.4,
we intend to introduce here a new parameter to express the importance of each pivot, eventually
taking into account the syntactic parsing of the question. Yet in (1), Kilf and Kidf are fixed
parameters for interpolation, while ilf and idf denote respectively the inverse lexical frequency
– that is, the logarithm of the inverted relative frequency of the word in the corpus – and the
commonly used inverse document frequency (see [16] for an explanation). We opted not to include
a tf term for the word frequency in the document, because of the relatively small size of each
document.
     Consider now the document collection. Let d be a particular document and define δL (d, wL ) = 1
if d contains the lemma wL and 0 otherwise. Moreover, define δH (d, wH ) in the same way for the
head of derivation wH , and δC (d, c) and δO (d, o) analogously for the question category c and the
ontological domain o. We calculate the document score σ d as:
                                                                                                   
    d
            P                  i      i              i     i                  i,j    i,j    i,j  i
  σ    =       max KL δL (d, wL )α(wL ), KH δH (d, wH )α(wH ), max KS δL (d, wS )α(wS )ρ(wS , wL )
           i                                                     j
          +KC maxi δC (d, ci ) + KO maxi δO (d, oi ),
                                                                                              (2)
where KL , KH , KS , KC and KO are fixed scaling constants with KL > KH > KS to reward
matches of lemmas, that are stronger than those of heads of derivation and synonyms.
   The score in (2) is then fine-tuned to take into account the pivot proximity in the documents,
rewarding those in which the pivots occur in sentences close together. At the end, the top 30
documents are retrieved to be further analysed at sentence level. In order to avoid the need of
analysing the whole text, each document contains a list of indices of sentences where the above
pivot matches occurred.

3.4    Sentence retrieval
This module receives as input a set of documents, whose sentences that match the pivots are
marked. Our engine allows to analyse not only these sentences, but also the k sentences before
and after, where k is configurable. However, making use of this feature could cause processing in
this stage to become too heavy, especially in situations where many documents with many marked
sentences are retrieved. Besides, to take full profit of this, additional techniques would be required
to find connections among close sentences, for instance through anaphora resolution. Hence, for
now we simply set k = 0.
    Let s be a particular sentence to be analysed at this stage. After parsing s, we calculate a
score σ s taking into account:
   • The number of pivots matching s;
   • The number of pivots having in common the lemma or the head of derivation with some
     token in s;
   • The number of pivot synonyms matching s;
   • The order and proximity of the pivots in s;
    • The existence of common question categories between q and s;
    • The number of ontological and terminological domains characterizing q which are also present
      in s;
    • The score σ d of the document d that contains s.
Here, partial matches are also considered: for instance, if only a word of a given NE is found in
the sentence (e.g. Fidel of the anthroponym Fidel Castro), then it will contribute with a lower
weight than if it was a complete match.
    To save efforts in the subsequent answer extraction module, sentences s that are scored below
a fixed threshold or where the total number of matches (either complete or partial) is lower than
a fixed fraction of the total number of pivots are immediately discarded. The remaining sentences
and their scores are passed as output to the next module.

3.5    Answer extraction
The answer extractor receives as input a set {s, σ s } of scored sentences presumably containing
answers. Each of these sentences is then tested against the QAPs that were activated during
the question analysis stage (see subsection 3.2). Notice that these QAPs are directly linked with
the QP that matched the question (see Table 3). As said in subsection 2.4, each QAP includes
information on what part of the sentence (if any) is to be extracted as a possible answer; it also
has a score to reflect the relevance of the QAP and the pertinence of the foreseen answer.
    Let us suppose that a particular sentence s matches a specific QAP. The curly bracketed terms
in the QAP extract one or more candidate answers from s (notice that a single pattern can match
s in several different ways). When all the active QAPs are applied, we are led to zero or more
possible answers extracted from s. Answers that are substrings of others are discarded, unless
they have a higher score: this tends to privilege longer answers. In specific cases, the opposite
behaviour can be forced by properly setting the scores. Answers containing question pivots are
not allowed, unless they are part of NEs (e.g. Deng Nan is allowed as an answer to the question
“Como se chama a filha de Deng Xiao Ping?” [What is the name of Deng Xiao Ping’s daughter?],
while filha is not). Suppose that a sentence s matches some QAP with score σ QAP , linked with
a QP with score σ QP , such that a is extracted from s and becomes a candidate answer. In this
scenario, a will have the following score σ a assigned:
                                                               X          X
                    σ a = Ks σ s + KQP σ QP + KQAP σ QAP +        σ rew −     σ pen                (3)
                                                                       P rew P pen
    In (3), Ks , KQP and KQAP are interpolating constants and            σ     −     σ    is the total
amount of rewards minus the total amount of penalties applied when processing the QAP. These
rewards and penalties are small quantities usually due to optional terms and variable distances in
the QAP (see subsection 2.4 for a further explanation).
    The last step consists in analysing all the answer candidates {a, σ a }, if any, and choosing the
best one as the final answer. If none has been chosen, “NIL” will be displayed. To accomplish
this, the answer scores {σ a } are first adjusted with additional rewards to take into account the
repeatability of the words of each answer in the collection of answer candidates that were extracted
from sentences scored above a fixed threshold; this threshold avoids the repeatability of erroneous
answers.
    In the end, the system outputs the answer with the highest score, â = arg maxa σ a , or “NIL”
if none is available. Currently, no confidence score is being measured and no further verification
is made to check if â really answers the question posed q. This is something to be done in the
future.


4     Results
The test set of 200 questions run in the monolingual task covered mainly factoid questions (158
in all) and a few (42) definition questions. Table 4 presents the scores of the submitted run.
  Question ↓           Answer →       Right   Wrong   Inexact     Unsup.    Total    Accuracy (%)
  Factoid (F)                            91      38         5          1      135        67.4
  Definition (D)                         27       7         8          0       42        64.2
  Temporally restricted factoid (T)      11      10         0          2       23        47.8
  Total                                 129      55        13          3      200        64.5

                               Table 4: Results by type of question.


    The F-questions and D-questions statistics add to a satisfactory overall accuracy of the system,
whose performance is comparable to that of the best scored systems in recent evaluation campaigns
[17].
    Several reasons contribute to the lower accuracy of T-questions. Firstly, we do not index dates
differently from other keywords. For instance, 25 de Abril de 1974 and 25/4/1974 are currently
indexed as different terms, and as a result, they cannot match during sentence retrieval and answer
extraction stages. Because of this limitation, we do not force the date to be present in the sentence
from which we extract the answer. This, in turn, leads to inexact answers that would be correct if
the question was not temporally restricted. We also do not take into account the documents dates
when answering T-questions. For instance, in question 3 “Quantos capacetes azuis holandeses
havia em Srebrenica, na Bósnia, em Julho de 1995?” [How many Dutch blue helmets were there
in Srebrenica, Bosnia, in July 1995?] we returned, during document retrieval, a few documents
dated before July 1995, which could not contain a valid and supported answer. These aspects
will be taken into account in a near future: we plan to index dates (and numbers) in a proper
numeric format, using documents dates to filter out the obsolete ones, as well as converting relative
temporal references (like ontem [yesterday]) into absolute ones.
    Table 5 summarizes the main tasks where the system failed.

      Tasks ↓             Question →    F-quest.   D-quest.     T-quest.   Total    Failure (%)
      Document retrieval                      6          2            1        9         4.5
      Extraction of candidate answers        18          7            8       33        16.5
      Choice of the final answer              7          5            1       13         6.5
      NIL validation                         12          1            3       16         8.0
      Total (W+X+U)                          43         15           13       71        35.5

          Table 5: Reasons for wrong (W), inexact (X) and unsupported (U) answers.

    The system’s major flaw lays in extracting the candidate answers: when it fails, the extraction
patterns are either too lenient, causing overextraction (e.g. origem da FAO instead of FAO
was the selected answer to question 84 “Como se chama a Organização para a Alimentação e
Agricultura das Nações Unidas?” [What is the Food and Agriculture Organization of the United
Nations called?]) or too strict, causing underextraction (e.g., porta-voz instead of porta-voz do
papa João Paulo 2o was the answer to question 166 “Quem é Joaquı́n Navarro-Valls?” [Who is
Joaquı́n Navarro-Valls?]). Additionally, the system is also not ready to cope with questions that
should return a list or coordinated terms, like question 22 “Que dois cientistas descobriram as
proteı́nas G?” [Which two scientists discovered G proteins?], that seeks coordinate terms, but the
extraction allows only retrieving a single answer. Anaphora resolution and setting the value of
k to 1, instead of 0 (see subsection 3.4) to check the answer in close sentences, could improve
the system’s performance when processing questions like question 17 “Quantas pessoas vivem nas
ilhas Aaland?” [How many people live in the Aaland islands?].
    The second major flaw is the way the system handled NIL questions: from the 18 questions
that should have returned a NIL string, the system only returned 2 of them correctly. NIL recall
is quite low (11%) because we do not actually measure a confidence score to decide if an answer
is good enough. Frequently, the answer sentence matches only one pivot of a question, which
sometimes is too weak a match. On the other hand, we do not demand exclusivity for some
question categories. For example, a question beginning with “Qual é a altura de...” [What is the
height of...] should not have another category of question besides <Dimension>, which demands
a numeric answer with an appropriate measure unit. Nevertheless, performing NIL validation
may lead to discard correct but somehow weakly supported answers; a compromise of strictness
is needed in the implementation of such an algorithm.
    The third flaw has to do with the choice of the final answer, i.e., with the algorithm that
calculates the final scores of the candidate answers (see subsection 3.5). Occasionally, the correct
answer is ranked in the second position right after the wrong answer that was chosen (e.g. com-
panhia aérea belga, the correct answer to question 21 “O que é a Sabena?” [What is Sabena?],
followed the selected answer Swissair ). Not very frequently, the system had to choose between
answers equally scored (e.g. presidente and presidente filipino had the same exact score, but it
was the first one (inexact) that was selected as the answer to question 165 “Quem é Fidel Ramos?”
[Who is Fidel Ramos?]).
    The last flaw reveals that the system sometimes misses the document containing the answer,
during the document retrieval stage. Because that document will never be analysed, this failure
is unrecoverable. This is a rare source of error, though, as the statistics of 5 show. One instance
of this problem happened in question 85 “Diga o nome de um assassino em série americano.”
[Name an American serial killer]. During document retrieval, the system was not able to establish
a relation between americano [American] and EUA [USA]. Therefore, it did not retrieve the
document containing the sentence with the correct answer (John Wayne Gacy): “Estava marcada
para hoje em Chicago à 0h01 local (2h01 em Brası́lia) a execução de John Wayne Gacy, maior
assassino em série da história dos EUA.” Another instance of this problem occurred with question
30 “Que percentagem de crianças não tem comida suficiente no Iraque?” [What percentage of
children does not have enough food in Irak?]. Here, the system did not retrieve the sentence
containing the answer (entre 22 e 30 por cento): “Os salários não têm acompanhado a subida da
inflação e as agências humanitárias advertiram que entre 22 e 30 por cento das crianças iraquianas
estão gravemente mal nutridas.” In this case, the query expansion allowed by the indexation of the
heads of derivation enabled the use of the gentilic information of the entries (inhabitant/country) to
relate iraquianas [Iraqis] to Iraque [Iraq] but was not able to establish a synonymic relation between
não tem comida suficiente [does not have enough food] and mal nutridas [badly nourished]. One
way to obviate this is to increase the factor KO in (2), when comparing the ontology domains of
the question with those of the documents. In this particular case, we can see that the question and
the answer sentence share a common domain: the words comida (question) and nutridas (answer)
are grouped under the same level: metabolism/nutrition. This ontological information seems to
be very helpful; however, since we use a low value for KO we do not actually take full profit of it
yet.
    Consider now the run scores according to what kind of information questions ask for, as shown
in Table 6.
           Answer type       Right   Wrong     Inexact   Unsup.    Total   Accuracy (%)
           Location             28       6           0        1       35       80.0
           Measure              11       7           0        0       18       61.1
           Organization         19      14           5        0       38       50.0
           Other                12       8           1        0       21       57.1
           Person               45      19           7        2       73       61.6
           Time                 14       1           0        0       15       93.3

                            Table 6: Results by CLEF types of answer.

    Crossing these CLEF answer types with our question categories, we found out that the best
results were achieved by categories <Date of event>, <Date of birth> and <Date of
death> (type Time of Table 6) and <Location>, <Town>, <Country> (type Location
of Table 6). Interestingly, answer type Person congregates two of our question categories, namely
<Function> and <Denomination>. These two separated categories allow a more fine-grained
search, since the <Function> category retrieves answers with names of professions (ergonyms)
or NEs like President of the United States of America, while the <Denomination> category
retrieves answers mainly with proper nouns.
    Finally, we refer a special note on question 83 “Quem é Iqbal Masih.” [Who is Iqbal Masih?].
This is a tricky question: it looks for a definition and the system retrieved o rapazinho da foto
[the little boy in the photo]. Could that be considered a definition? What the user considers a
definition may vary according to his/her information background. This answer was extracted from
a standard apposition structure, however, in terms of meaning conveyed to the user, it may not
be considered responsive enough. If the answer was extracted by a system that allows the user
to visualize the document that contained the answer, as is the case of the TRUST search engine,
then the answer o rapazinho da foto should be satisfactory. However, being the answer extracted
by a system that does not allow the visualization of the document, it is not of great utility to
the user. In this case, the system should have returned other (not easily extractable) answers
in the same sentence, such as quase-escravo [almost slave], the more descriptive peregrino pelo
mundo em defesa de seis milhões de crianças que no Paquistão são exploradas por negociantes
sem escrúpulos [pilgrim over the world in defence of six million children who in Pakistan are
exploited by unscrupulous business men] or even a summary of the whole document.
    This evaluation furthermore showed that Brazilian Portuguese was not a relevant problem for
a system that only used a European Portuguese lexicon. There were few questions with exclusive
Brazilian spelling or Brazilian terms – 102 “Que vulcão teve uma erupção em junho de 1991?”, 114
“Onde surgiu a Aids?”, 124 “Em quantos filmes da série ‘Superman’ estrelou Chistopher Reeve?”,
127 “Quantas repúblicas compunham a Iugoslávia?”, 148 “Que time se mudou para Salt Lake
City?”, 183 “Quem é o prefeito de Lisboa?”. The system was able to retrieve several correct
answers from Brazilian target documents, as in the case of the answer 135 quilômetros to question
132 “Que distância separa Cuba da Flórida?” [What distance separates Cuba from Florida?]).
That was not the case, however, with question 151 “Em que posição joga Taffarel?” [In which
position does Taffarel play?], whose expected answer goleiro [goalkeeper] was not recognised by
the European Portuguese lexicon.


5    Conclusions and future work
Throughout this paper we accounted for the description and evaluation of Priberam’s QA system.
The results obtained in the QA@CLEF monolingual task by both Priberam (for Portuguese) and
Synapse (for French), who based their systems on the NLP technology developed for TRUST
search engine, seem to state that the choices made are in the right track.
    The architecture of our system is similar to many others, yet it distinguishes itself by the
indexation of morphologically disambiguated words at sentence level and by the query expansion
using heads of derivation. The use of the workbench described in section 2, as well as its associated
descriptive languages, allows an easy maintenance and coding of several NLP features, and this is
probably a big advantage since it makes the system scalable.
    Despite the encouraging results detailed in the previous section, the system still has a long way
to go before it can be efficient in a generic environment. We have spotted some improvements to
be implemented in a near future, namely concerning the question/answer matching mechanism,
syntactic treatment of questions and answers, anaphora resolution and semantic disambiguation.
We intend to exploit further the ontology’s potential. It can be a very useful resource during the
stages of document and sentence retrieval, since it may improve the weighting of the documents
and sentences by introducing semantic knowledge. This implies performing document clustering
based on the ontology domains as well as inferring from question analysis those that should be
predominant in the target documents. Future work will also address the treatment of questions
that should return a list and the refinement of the question answering system for Web searching.
    Currently we are participating in M-CAST – Multilingual Content Aggregation System based
on TRUST Search Engine – (EDC 22249 M-CAST), an European eContent project whose aim is
the development of a multilingual platform to access and search large multilingual text collections,
such as internet libraries, publishing houses resources, press agencies and scientific databases, etc.
This participation will lead to greater enhancements, especially on the extraction of answers from
books, which may prove to be quite different from extracting from newspaper articles.


Acknowledgements
Priberam Informática would like to thank the partners of the NLUC consortium, namely Synapse
Développement, for sharing its experience and knowledge, thus allowing us to compare and test
our two similar but different approaches. Priberam would also like to express its thanks to the
CLEF organization and to Linguateca for preparing and supervising the Portuguese evaluation.
Finally, we would like to acknowledge the support of the European Commission in TRUST (IST-
1999-56416) and M-CAST (EDC 22249 M-CAST) projects.


References
 [1] D. Santos and P. Rocha.         CHAVE: topics and questions on the Por-
     tuguese participation in CLEF.    In C. Peters and F. Borri, editors, Cross
     Language Evaluation Forum:      Working Notes for the CLEF 2004 Work-
     shop (Bath, UK, 15-17 September), pages 639–648, 2004.       Also available at
     http://www.clef-campaign.org/2004/working notes/WorkingNotes2004/76.PDF.
 [2] P. Quaresma, L. Quintano, I. Rodrigues, J. Saias, and P. Salgueiro. The University of Évora
     approach to QA@CLEF-2004. In C. Peters and F. Borri, editors, Cross Language Evaluation
     Forum: Working Notes for the CLEF 2004 Workshop (Bath, UK, 15-17 September), pages
     403–411, 2004.
 [3] L. Costa. First evaluation of Esfinge – a question-answering system for Portuguese. In
     C. Peters and F. Borri, editors, Cross Language Evaluation Forum: Working Notes for the
     CLEF 2004 Workshop (Bath, UK, 15-17 September), pages 393–402, 2004.
 [4] C. Amaral, D. Laurent, A. Martins, A. Mendes, and C. Pinto. Design and Implementation of
     a Semantic Search Engine for Portuguese. In Proceedings of 4th International Conference on
     Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, 26-28 May, volume 1,
     pages 247–250, 2004. Also available at http://www.priberam.pt/docs/LREC2004.pdf.
 [5] D. Laurent, M. Varone, C. Amaral, and P. Fuglewicz. Multilingual Semantic and Cognitive
     Search Engine for Text Retrieval Using Semantic Technologies. In Pre-proceedings of the 1st
     Workshop on International Proofing Tools and Language Technologies (Patras, Greece, 1-2
     July), 2004.
 [6] C. Amaral, H. Figueira, A. Mendes, P. Mendes, and C. Pinto. A Workbench for Developing
     Natural Language Processing Tools. In Pre-proceedings of the 1st Workshop on International
     Proofing Tools and Language Technologies (Patras, Greece, 1-2 July), 2004. Also available at
     http://www.priberam.pt/docs/WorkbenchNLP.pdf.
 [7] D. Laurent, P. Séguéla, and S. Nègre. Cross Lingual Question Answering using QRISTAL
     for CLEF 2005. In Working Notes for the CLEF 2005 Workshop, 21-23 September, Wien,
     Austria, 2005. To appear.
 [8] S.M. Thede and M.P. Harper. A second-order hidden Markov model for part-of-speech tag-
     ging. In Proceedings of the 37th Annual Meeting of the ACL, Maryland: College Park, pages
     175–182, 1999.
 [9] Christopher D. Manning and Hinrich Schütze. Foundations of Statistical Natural Language
     Processing (2nd printing). The MIT Press, Cambridge, Massachusetts, 2000.
[10] T. Kohonen. Self-Organizing Maps. Springer-Verlag New York, Inc., 2001.
[11] C. Monz and M. de Rijke. The University of Amsterdam’s textual question answering sys-
     tem. In E.M. Voorhees and D.K. Harman, editors, Proceedings of the Tenth Text Retrieval
     Conference (TREC 2001), Gaithersburg, Maryland, 13-16 November, pages 519–528, 2002.
[12] D. Ferrés, S. Kanaan, E. González, A. Ageno, H. Rodrı́guez, M. Surdeanu, and J. Turmo.
     TALP-QA System at TREC 2004: Structural and Hierarchical Relaxing of Semantic Con-
     straints. In E.M. Voorhees and D.K. Harman, editors, Proceedings of the Thirteenth Text
     Retrieval Conference (TREC 2004), Gaithersburg, Maryland, 16-19 November, 2005. To
     appear.
[13] B. Magnini, A. Vallin, C. Ayache, G. Erbach, A. Peñas, M. de Rijke, P. Rocha, K. Simov, and
     R. Sutcliffe. Overview of the CLEF 2004 multilingual question answering track. In C. Peters
     and F. Borri, editors, Cross Language Evaluation Forum: Working Notes for the CLEF 2004
     Workshop (Bath, UK, 15-17 September), pages 281–294, 2004.
[14] K. Lavenus, J. Grivolla, L. Gillard, and P. Bellot. Question-answer matching: two comple-
     mentary methods. In Proceedings of RIAO 2004, University of Avignon (Vaucluse), France,
     2004.
[15] Xin Li and D. Roth. Learning Question Classifiers. In Proceedings of the 19th International
     Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan, 2002.
[16] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, 1999.
[17] E.M. Voorhees. Overview of the TREC 2004 Question Answering Track. In E. M.
     Voorhees and L. P. Buckland, editors, Proceedings of the Thirteenth Text Retrieval Conference
     (TREC 2004), Gaithersburg, Maryland, 16-19 November, 2005. To appear. Also available at
     http://trec.nist.gov/pubs/trec13/papers/QA.OVERVIEW.pdf.