INAOE-UPV Joint Participation at CLEF 2005:
                     Experiments in Monolingual Question Answering
                        M. Montes-y-Gómez1, L. Villaseñor-Pineda1, M. Pérez-Coutiño1
                           J. M. Gómez-Soriano2, E. Sanchis-Arnal2, and P. Rosso2
                                      1
                                       Laboratorio de Tecnologías del Lenguaje
                     Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE), Mexico.
                                     {mmontesg, villasen, mapco}@inaoep.mx
                              2
                                Departamento de Sistemas Informáticos y Computación
                                 Universidad Politécnica de Valencia (UPV), Spain.
                                      {jogomez,esanchis, prosso}@dsic.upv.es

       Abstract. Recent works on question answering are based on complex natural language processing tech-
       niques: named entity extractors, parsers, chunkers, etc. While these approaches have proven to be effective
       they have the disadvantage of being targeted to a particular language. In this paper we present a full data-
       driven method that uses simple lexical pattern matching and statistical techniques in order to identify the
       relevant passages as well as the more probable candidate answers for factual and definition questions. The
       main quality of this method is that it can be applied to different languages without requiring major adaptation
       changes. Experimental results of the method in Spanish, Italian and French show that the approach can be a
       practical solution for monolingual and multilingual question answering applications.


1    Introduction
The volume of online available documents is growing every day. As a consequence, better information retrieval
methods are required to achieve the needed information. Question Answering (QA) systems are information
retrieval applications whose aim is to provide inexperienced users with a flexible access to the information, al-
lowing them writing a query in natural language and obtaining not a set of documents which contain the answer,
but the concise answer itself (Vicedo et al, 2003). That is, given a question like: “Where is the Popocatepetl
located?”, a QA system must respond “Mexico”, instead of just returning a list of documents related to the vol-
cano.
    Recent developments in QA use a variety of linguistic resources to help in understanding the questions and
the documents. The most common linguistic resources include: part-of-speech taggers, parsers, named entity
extractors, dictionaries, and WordNet (Jijkoun et al., 2004; Ageno et al., 2004; Pérez-Coutiño et al., 2004). De-
spite of the promising results of these approaches, they have two main inconveniences: (i) the construction of
such linguistic resources is a very complex task; and (ii) these resources are highly binding to a specific lan-
guage.
    In this paper we present a QA system that allows answering factual and definition questions. This system is
based on a full data-driven approach (Brill et al., 2001), which requires minimum knowledge about the lexicon
and the syntax of the specified language. Mainly, it is supported on the idea that the questions and their answers
are commonly expressed using the same set of words, and therefore, it simply uses a lexical pattern matching
method to identify relevant document passages and to extract the candidate answers.
    The proposed approach has the advantage to be easily adapted to several different languages, in particular to
moderately inflected languages such as Spanish, English, Italian and French. Unfortunately, this generality has
its price. To obtain a good performance, the approach requires using a redundant target collection, that is, a col-
lection in which the question answers occurs more than once. On the one hand, this redundancy increases the
probability of finding a passage containing a simple lexical matching between the question and the answers. On
the other hand, it enhances the answer extraction, since correct answers tend to be more frequent than incorrect
responses.
    The presented system also uses a set of heuristics that attempt to capture some regularities of language and
some stylistic conventions of news letters. For instance, it considers that most named entities are written with an
initial uppercase letter, and that most concept definitions are usually expressed using a very small number of
fixed arrangements of noun phrases. This kind of heuristics guides the extraction of the candidate answers from
the relevant passages.
    In the rest of the paper we present the main architecture of our data-driven QA system. We also discuss the
evaluation results on Spanish, Italian and French.
2       System Overview
The figure 1 shows the general architecture of our system. It is divided in two main modules. One of them fo-
cuses on answering factual questions. It considers the tasks of passage indexing, where documents are preproc-
essed, and a structured representation of the collection is built; passage retrieval, where the passages with more
probability to contain the answer are recovered from the index; and answer extraction; where candidate answers
are ranked and the final answer recommendation of the system is produced.
   The other module concentrates on answering definition questions. It includes the tasks of definition extrac-
tion; where all possible pairs of acronym-meaning and referent-description are located and indexed; and defini-
tion selection, where the relevant data pair is identified and the final answer of the system is generated.
   The following sections describe each of these modules and their main tasks.

                                                     Factual Questions

                 Passage                                                               Passage         Answer
                 Indexing                                                              Retrieval      Extraction
                                   Passage
                                    Index


    Docs                                                                Question                                   Answer
                                                  Question
                                                                      Classification


                 Definition                                                                           Definition
                 Extraction                                                                           Selection
                                   Definition
                                    Index
                                                     Definition Questions


                         offline                                                             online
                                                Figure 1. Block diagram of the system

3       Answering Factual Questions

3.1 Passage Retrieval
The passage retrieval (PR) method is specially suited for the QA task (Gómez-Soriano et al., 2005). It allows
retrieving the passages with the highest probability to contain the answer, instead of simply recover the passages
sharing a subset of words with the question.
   Given a user question, the PR method finds the passages with the relevant terms (non-stopwords) using a clas-
sical information retrieval technique based on the vector space model. Then, it measures the similarity between
the n-gram sets of the passages and the user question in order to obtain the new weights for the passages. The
weight of a passage is related to the largest n-gram structure of the question that can be found in the passage
itself. The larger the n-gram structure, the greater the weight of the passage. Finally, it returns to the user the
passages with the new weights.

3.1.1      Similarity measure
The similarity between a passage d and a question q is defined by (1).

                                                                  ∑ ∑ h(x( j ), D )
                                                                  n

                                                                                         j

                                                  sim(d , q ) =
                                                                  j =1 ∀x∈Q j                                               (1)

                                                                  ∑ ∑ h(x( j ), Q )
                                                                  n

                                                                                         j
                                                                  j =1 ∀x∈Q j

Where sim(d, q) is a function which measures the similarity of the set of n-grams of the question q with the set of
n-grams of the passage d. Qj is the set of j-grams that are generated from the question q and Dj is the set of j-
grams of the passage d. That is, Q1 will contain the question unigrams whereas D1 will contain the passage uni-
grams, Q2 and D2 will contain the question and passage bigrams respectively, and so on until Qn and Dn. In both
cases, n is the number of question terms.
   The result of (1) is equal to 1 if the longest n-gram of the question is in the set of passage n-grams.
   The function h(x(j), Dj) measures the relevance of the j-gram x(j) with respect to the set of passage j-grams,
whereas the function h(x(j), Qj) is a factor of normalization1. The function h assigns a weight to every question n-
gram as defined in (2).
                                                               j
                                            h(x( j ), D j ) = ∑
                                                               wxˆk (1)       if x( j ) ∈ D j
                                                                k =1
                                                                                                                     (2)
                                                               0              otherwise
   Where the notation xˆk (1) indicates the k-th unigram included in the j-gram x, and wxˆk (1) specifies the associ-
ated weight to this unigram. This weight gives an incentive to the terms –unigrams– that appear rarely in the
document collection. Moreover, this weight should also discriminate the relevant terms against those (e.g. stop-
words) which often occur in the document collection.
   The weight of a unigram is calculated by (3):

                                                     wxˆk (1) = 1 −
                                                                           (
                                                                      log n xˆk (1)   )                             (3)
                                                                      1 + log ( N )
   Where nxˆk (1) is the number of passages in which appears the unigram xˆk (1) , and N is the total number of pas-
sages in the collection. We assume that the stopwords occur in every passage (i.e., n takes the value of N). For
instance, if the term appears once in the passage collection, its weight will be equal to 1 (the maximum weight),
whereas if the term is a stopword, then its weight will be the lowest.

3.2 Answer Extraction
This component aims to establish the better answer to the given question. In order to do that, it first determines a
small set of candidate answers, and then, it selects the final unique answer taking into consideration the position
of the candidate answers inside the retrieved passages.
   The algorithm applied to extract the more probable answer from the given set of relevant passages is de-
scribed below. For more detail refer to (Del-Castillo et al., 2004).
        1. Extract all the unigrams that satisfy some given typographic criteria. These criteria depend on the
            type of expected answer. For instance, if the expected answer is a named entity, then we select the
            unigrams starting with an uppercase letter. But if the answer must be a quantity, then we select the
            unigrams expressing numbers.
        2. Determine all the n-grams assembled from the selected unigrams. These n-grams can only contain
            the selected unigrams and some stopwords.
        3. Rank the n-grams based on their compensated frequency. The compensated frequency of the n-gram
            x(n) is computed as follows:
                                                              1 n n−i +1 f xˆ j ( i )
                                                 Fx ( n ) =     ∑∑
                                                              n i =1 j =1 ∑ f y ( i )                                (4)
                                                                           ∀ y∈Gi


              where Gi indicates the set of i-grams, y(i) represents the i-gram y, xˆ j (i ) is the j-th i-gram included
              in x(n), f y (i ) specifies the frequency of occurrence of the i-gram y, and Fx(n) indicates the compen-
              sated frequency of x(n).
        4.   Select the top five n-grams as candidate answers.
        5.   Compute a ranking score for each candidate answer. This score is defined as the weight of the first
             retrieved passage (refer to formula 1) that contains the candidate answer.
        6.   Select as the final respond the candidate answer with the greatest ranking score. In the case that two
             or more of the candidate answers have the same ranking score, then we select the one with the great-
             est compensated frequency.


1 We introduce the notation x(n) for the sake of simplicity. In this case x(n) indicates the n-gram x of size n.
4    Answering Definition Questions
Our system uses an alternative method to answer definition questions. This method makes use of some regulari-
ties of language and some stylistic conventions of news letters to capture the possible answer for a given defini-
tion question. A similar approach was presented in (Ravichandran et al., 2001; Saggion, 2004).
   The process of answering a definition question considers to main tasks. First, the definition extraction, which
detects the text segments that contains the description or meaning of a term (in particular those related with the
name of a person or an organization). Then, the definition selection, where the most relevant description for a
given question term is identified and the final answer of the system is generated.

4.1 Definition Extraction
The language regularities and the stylistic conventions of news letters are captured by two basic lexical patterns.
These patterns allow constructing two different definition catalogs. The first one includes a list of pairs of acro-
nym-meaning. The other one consists of a list of referent-description couples.
  In order to extract the acronym-meaning pairs we use an extraction pattern based on the use of parentheses.
                                           w1 <meaning> ( <acronym> )                                           (5)
   In this pattern, w1 is a lowercase non stopword, <meaning> is a sequence of words starting with an uppercase
letter (that can also include some stopwords), and <acronym> indicates a single word also starting with an up-
percase letter.
   By means of this pattern we could identify pairs like [PARM – Partido Auténtico de la Revolución Mexicana].
In particular this pair was catch from the following paragraph:
         “El Partido Auténtico de la Revolución Mexicana (PARM) nombró hoy, sábado, a Álvaro Pé-
         rez Treviño candidato presidencial de ese organismo para las elecciones federales del 21 de
         agosto de 1994”.
   In contrast, the extraction of referent-description pairs is guided by the occurrence of a special kind of apposi-
tive phrases. This information was encapsulated in the following extraction pattern.
                                         w1 w2 <description> , <referent> ,                                     (6)
   Where w1 may represent any word, except for a preposition, w2 is a determiner, <description> is a free se-
quence of words, and <referent> indicates a sequence of words starting with an uppercase letter or being in the
stopwords list.
   Applying this extraction pattern over the below paragraph we could find the pair [Alain Lombard - El director
de la Orquesta Nacional de Burdeos].
         “El director de la Orquesta Nacional de Burdeos, Alain Lombard, ha sido despedido por el
         Ayuntamiento de esta ciudad, que preside Alain Juppé, tras un informe que denuncia malos
         funcionamientos y gastos excesivos”.

4.2 Definition Selection
The main quality of the above extraction patterns is their generality. They can be applied to different languages
without requiring major adaptation changes. However, this generality causes the patterns to often extract non
relevant information, i.e., information that does not indicate a relation acronym-meaning or referent-description.
For instance, when using the extraction pattern (5) to analyze the following news we obtain the incorrect defini-
tion pair [Ernie Els - AFS]. In this case the resultant pair does not express an acronym-meaning relation; instead
it indicates a person-nationality association.
         Ernie Els (AFS) se mantiene en cabeza de la lista de ganancias de la "Orden de Mérito" de
         golf, con más de 17 millones de pesetas, mientras que el primer español es Miguel Angel
         Martín, situado en el puesto decimoséptimo, con 4.696.020.
   Given that the catalogs contains a mixture of correct and incorrect relation pairs, it is necessary to do an addi-
tional process in order to select the most probable answer for a given definition question. The proposed approach
is supported on the idea that, on the one hand, the correct information is more abundant than the incorrect one,
and on the other hand, that the correct information is redundant.
    Thus, the process of definition selection considers the following two criteria:
        1. The more frequent definition in the catalog has the highest probability to be the correct answer.
        2. The largest and therefore more specific definitions tend to be the more pertinent answers.
  The following example illustrates the process. Assume that the user question is “who is Félix Ormazabal?”,
and that the definition catalog contains the records showed below. Then, the method selects the description “di-
putado general de Alava” as the most probable answer. This decision is based on the fact that this answer is the
more frequent description in the catalog related to Félix Ormazabal.
        Félix Ormazabal: Joseba Egibar:
        Félix Ormazabal: candidato alavés:
        Félix Ormazabal: diputación de este territorio:
        Félix Ormazabal: presidente del PNV de Alava y candidato a diputado general:
        Félix Ormazabal: nuevo diputado general
        Félix Ormazabal: diputado Foral de Alava
        Félix Ormazabal: través de su presidente en Alava
        Félix Ormazaba : diputado general de Alava
        Félix Ormazabal: diputado general de Alava
        Félix Ormazabal: diputado general de Alava

5     Evaluation Results
We participate in the evaluation task on three different languages: Spanish, Italian and French. For each lan-
guage we submitted two runs. The first group of them (tova051 runs) implements the system as described in the
previous sections. The second group (tova052 runs) resolves some factual questions as if they were definition
questions. The selected questions were those asking for the name of a personality or for the acronym of an or-
ganization.
   Table 1 shows our global results on the three languages. It is noticed that the Spanish results were slightly bet-
ter than the Italian and French ones.
                                              Table 1. Overall accuracy results
                              tova051itit      tova052itit    tova051frfr   tova052frfr   tova051eses tova052eses
         Right                    53               55              69            70            82          77
         Wrong                   138              135             121           120           109         113
         Inexact                  9                10              10            10             7           8
         Unsupported              0                 0              0              0            2            2
         Overall Accuracy       26.5%            27.5%          34.5%         35.0%          41.0%       38.5%

   The following tables detail our results by question types. Table 2 shows the accuracy on factual questions; ta-
ble 3 indicates the results on definition questions, and finally, table 4 shows the achieved results on temporal
questions. Our general conclusion is that the method for answering factual questions is language independent.
Unfortunately we can assert the same for our approach to answer definition questions.
   On the other hand, it is important to mention that the temporal questions were treated as if they were factual
questions. Currently we do not have a specific method for answering this kind of questions.
                                            Table 2. Accuracy on factual questions
                              tova051itit       tova052itit   tova051frfr   tova052frfr   tova051eses tova052eses
         Right                    26                28            32            33             34          28
         Wrong                    89                86            84            83             77          82
         Inexact                   5                 6             4             4              5           6
         Unsupported               0                 0             0             0              2           2
         Accuracy               21.7%             23.3%         26.7%         27.5%          28.8%       23.7%


                                       Table 3. Accuracy on definition questions
                              tova051itit      tova052itit    tova051frfr   tova052frfr   tova051eses tova052eses
         Right                    21               21             33            33             40          40
         Wrong                    26               26             12            12              9           9
         Inexact                   3                3              5             5              1           1
         Unsupported               0                0              0             0              0           0
         Accuracy               42.0%            42.0%          66.0%         66.0%          80.0%       80.0%
                                        Table 4. Accuracy on temporal questions
                               tova051itit   tova052itit    tova051frfr       tova052frfr    tova051eses tova052eses
       Right                       6             6               4                 4               8          9
       Wrong                       23            23             25                25              23          22
       Inexact                      1             1              1                 1               1           1
       Unsupported                  0             0              0                 0               0           0
       Accuracy                  18.8%         18.8%          12.5%             12.5%           26.7%       30.0%

   As we mention in section 4, the definition catalogs have several erroneous registers. Also, they are incom-
plete, since they do not include all possible acronym-meaning and referent-description pairs. Nevertheless, they
contain a large amount of registers and constitute a valuable information repository for answering definition
questions. The tables 5 and 6 compare the catalogs that were extracted for each language.
                                      Table 5. Data of the acronym-meaning catalog
                          Acronym-meaning catalog                 Document collection
                                            meanings                                           acronyms      meanings
                     #acronyms #meanings       per              # sentences     #words             per          per
                                            acronym                                             sentence     sentences
        Spanish        14,921    263,388      17.65             5,636,945     151,553,838        0.0026        0.046
        French         8,375      66,690       7.96             2,069,012     45,057,929         0.0040        0.032
        Italian        4,588      21,606       4.71             2,282,904     49,343,596         0.0020        0.009


                                     Table 6. Data of the referent-description catalog
                           Referent-description catalog            Document collection
                                                 descriptions                                   referents   desriptions
                     #referents #descriptions        per         #sentences      #words            per          per
                                                   referent                                     sentence     sentence
        Spanish       131,356       563,411          4.29         5,636,945 151,553,838         0.02330      0.09995
        French         31,864        58,905          1.85         2,069,012 45,057,929          0.01540      0.02847
        Italian        45,856       125,023          2.73         2,282,904 49,343,596          0.02009      0.05476

   The above tables reveal an important association between the sizes of the document collection and the gener-
ated catalogs. This is of great relevance since our approach of answer selection is mainly based on the definition
redundancy. However, the tables also seem indicate that the extraction patterns are not totally language inde-
pendent. This assertion is based on the fact that we obtain fewer descriptions per referent for the French, even
when the Italian collection is slightly smaller.
   The tables 7 and 8 show our results on answering definition questions related to acronyms as well as to per-
sonality descriptions.
                                     Table 7. Accuracy on questions about acronyms
                      tova051eses     tova052eses      tova051frfr       tova052frfr        tova051itit     tova052itit
       Right               20              20              14                14                 10              10
       Wrong                8               8              11                11                 22              22
       ineXact              1               1               3                 3                  2               2
       Unsopported          0               0               0                 0                  0               0
       Accuracy          69.0%           69.0%           50.0%             50.0%              29.4%           29.4%


                              Table 8. Accuracy on questions about personality descriptions
                      tova051eses     tova052eses      tova051frfr        tova052frfr       tova051itit     tova052itit
       Right               20              20              19                 19                11              11
       Wrong                1               1               1                  1                 4               4
       ineXact              0               0               2                  2                 1               1
       Unsopported          0               0               0                  0                 0               0
       Acurracy          95.2%           95.2%           86.4%              86.4%             68.8%           68.8%
6    Conclusions
This paper presents a question answering system that allows answering factual and definition questions. This
system is based on a full data-driven approach. The main idea behind the approach is that the questions and their
answers are commonly expressed using the same set of words, and therefore, it simply uses a lexical pattern
matching method to identify relevant document passages and to extract the candidate answers.
   The experiments on Spanish, Italian and French have shown the potential and portability of our approach.
They also indicated that our method for answering factual question, which is based on the matching and counting
of n-grams, is language independent. However, this method greatly depends on the redundancy of the answers in
the target collection. This condition limited the method to a poor accuracy.
   On the contrary, the method for answering definition questions is very precise. Nevertheless, we can not con-
clude about it language independence.
   As future work we plan to improve the ranking score for the factual answers. This will help in reducing the
dependence of our method to the data redundancy. We also consider to evaluate the quality of the definition
catalogs in order to conclude something about the language independence of our approach.

Acknowledgements. This work was done under partial support of CONACYT (Project Grant 43990), SNI-
Mexico, and the Human Language Technologies Laboratory of INAOE. We also like to thanks to the CLEF
organization committee.


References
1. Ageno, A., Ferrés, D., González, E., Kanaan, S., Rodríguez H., Surdeanu, M., and Turmo, J. TALP-QA Sys-
   tem for Spanish at CLEF-2004. Working Notes for the CLEF 2004 Workshop, Bath, UK, 2004.
2. Brill, E., Lin, J., Banko, M., Dumais, S., and Ng, A. Data-intensive Question Answering. TREC 2001 Pro-
   ceedings, 2001.
3. Del-Castillo, A., Montes-y-Gómez, M., and Villaseñor-Pineda, L. QA on the web: A preliminary study for
   Spanish language. Proceedings of the 5th Mexican International Conference on Computer Science (ENC04),
   Colima, Mexico, 2004.
4. Gómez-Soriano, J.M., Montes-y-Gómez, M., Sanchis-Arnal, E., Rosso P. A Passage Retrieval System for
   Multilingual Question Answering. To appear in the proceedings of the 8th International Conference on Text,
   Speech and Dialog, TSD 2005. Karlovy Vary, Czech Republic, 2005.
5. Jijkoun, V., Mishne, G., de Rijke, M., Schlobach, S., Ahn, D., and Müller, K.. The University of Amsterdam
   at QA@CLEF 2004. Working Notes for the CLEF 2004 Workshop, Bath, UK, 2004.
6. Pérez-Coutiño, M., Solorio, T., Montes-y-Gómez, M., López-López, A., and Villaseñor-Pineda, L. The Use
   of Lexical Context in Question Answering for Spanish. Working Notes for the CLEF 2004 Workshop, Bath,
   UK, 2004.
7. Ravichandran D. and Hovy E. Learning Surface Text Patterns for a Question Answering System. In ACL
   Conference, 2002.
8. Saggion, H. Identifying Definitions in Text Collections for Question Answering. LREC 2004.
9. Vicedo, J.L., Rodríguez, H., Peñas, A. and Massot, M. Los sistemas de Búsqueda de Respuestas desde una
   perspectiva actual. Revista de la Sociedad Española para el Procesamiento del Lenguaje Natural, n.31, 2003.