QA at ILC-UniPI: Description of the Prototype* Francesca Bertagna■, Luminita Chiran■■ and Maria Simi■■■ ■ Istituto di Linguistica Computazionale (Consiglio Nazionale delle Ricerche), Via Moruzzi 1, 56100 Pisa, Italy. francesca.bertagna@ilc.cnr.it, ■■ Universita' "A. I. Cuza", Str. General Berthelot 16, 700483, Iasi, Romania. luminitachiran@yahoo.com ■■■ Dipartimento di Informatica (Università di Pisa), Via Buonarroti 2, 56100 Pisa, Italy. simi@di.unipi.it Abstract This paper introduces the general architecture of a prototype for monolingual Italian QA. The adopted strategies, the tools and resources for the linguistic processing are presented, together with the system results and a discussion about current limits and future directions of our work. 1. Introduction This is the first time the Istituto di Linguistica Computazionale of the Italian National Council of Research and the Department of Computer Science at the University of Pisa take part in the QA track at CLEF. The participation at CLEF was an important occasion to finalize a first version of a prototype for Italian QA, working on a controlled set of questions and answer pairs and on a common reference corpus of news and articles. The CLEF QA track represented an important exercise to individuate the most important problems, to discuss and study possible solutions and also to share our first results in a collaborative and experimental environment. The experience gained will surely be of great importance in the further development of our work. Aim of this paper is thus twofold: on one hand we want to describe the QA prototype and its modules of analysis, on the other we would like to present the most important problems emerged and discuss possible ways to overcome them. 2. General Architecture The system described in Fig 1. is heavily inspired by the FALCON (Harabagiu et al., 2000, Paşca, 2003) and by the PIQASso (Attardi et al., 2001) applications and it is organized following the classic three-modules architecture consisting in the question analysis, the search engine and the answer extraction modules. In what follows we will describe in detail each of these steps, focussing in the adopted solutions and in the analysis of the encountered problems. Some important, even crucial, external modules are missing (a Named Entity Recognizer and modules for WSD and multiword recognition). We will consider this first release of the prototype as a starting point and a first assembly of different modules and resources, hoping to be able to add what is missing in the next future. The system is organized as follows: • in the first module, a detailed analysis of the question is performed in order to extract the information that will be of use in the QA downstream, i.e.: i) the list of the question keywords that will be used in the IR module, ii) the Question Stem and Answer Type Term, iii) the dependency representation of the question that will be compared against the dependency representation of the candidate answer, iv) the Question Focus notion that defines the type of expected answer and provides the “semantic” type of the expected answer element. * We would like to thank Simone Pecunia and Giuseppe Attardi for their indispensable help and Nicoletta Calzolari and Irina Prodanof for their comments and suggestions. We also thank Roberto Bartolini, Alessandro Lenci, Simonetta Montemagni and Vito Pirrelli for the kind concession of text analysis tools. • The second module consists of a document indexing and retrieval sub-system that takes in input the keywords of the query and provides in output a list of paragraphs matching the query . • The last module represents the place where all the information collected during the first phase of question analysis should be used. In the future we would like to use a system of filters to rule out candidate paragraphs not satisfying a certain set of constraints (in particular semantic constrains based on the expected answer types). For the moment, only a preliminary module exploiting the dependency structure of the question and of the candidate answer has been implemented, together with the exploitation of few named entity types that can be individuated by means of simple pattern matching rules. Q uestions D ocum ent C ollection Q ue stion A nalysis IR N am ed Entities extraction (Pattern M orphological M atching) analysis M AGIC Indexing M orphological C hunking Boolean analysis C H UN K-IT query C hunking D ependency ID EAL IXE Analysis Paragraph Search Engine D ependency Stopw ords Analysis Assigning Keyword R elevance Q F from ranking IW N Answ er extraction w ith rules exploiting: Answ er Type IW N dependency relations + D eterm ination Files nam ed entities + answ ers paragraph ranking + pattern m atching on Q Ftaxonom y paragraph text Keyword Stem m ing Porter stem m er A nsw e r Proce ssin g and Extraction Q uestion Analysis Fig 1. Prototype General Architecture 3. Question Analysis Module In this module the system performs a multi-layered analysis of the question: • first of all, a sequence of steps leads to the linguistic representation of the question: each word of the question is isolated, morphologically analysed and associated to one or more lemmas. Then a two- stages (chunking and dependency) syntactic analysis is performed, allowing the system to: i) segment the question into syntactically organized text units, ii) perform POS-tagging of the words in the question, iii) identify grammatical functions; • the system applies a set of rules in order to assign to each word in the question a specific weight in term of its relevance as a keyword of the query ; • the system extracts from the question the Question Stem (the interrogative element usually introducing the sentence) and, where needed, the Answer Type Term (Paşca, 2003); • the Question Focus (i.e. the expected answer type) is individuated, by merely relying on the Question Stem type or by recurring, via the Answer Type Term and via the a Question Focus Taxonomy, to the information stored in the ItalWordNet database; • a stemmer is used on some of the keywords of the query. The next paragraphs will describe more in detail each of these steps. 3.1. Linguistic Analysis First of all, the question goes through a chain of tools for the analysis of Italian language developed at ILC-CNR 1 by (Bartolini et al., 2002). The analysis chain includes : • morphological analyser • chunker • dependency analyser The morphological analysis is performed by Magic (Battista and Pirrelli, 1999). Magic produces, for each word form of the question, all its possible lemmas together with their morpho-syntactic features. Magic also 2 recognizes the capitalization of the word, a small set of basic multi-word expressions (such as al di là but also some proper names like San Vittore in question#3) and analyses verbs containing clitic pronouns. The chunker, CHUNK-IT (Lenci et al., 2001), first performs the morpho-syntactic disambiguation of the question and then segments it into an unstructured sequence of syntactically organized text units (the chunks). We will see how also this initial, flat and linguistically poor syntactic representation can be exploited to extract information crucial for the task of question classification on the basis of the type of expected answer (i.e. what the user is looking for with his/her question). These information are the Question Stem (QS) and the Answer Type Term (ATT). The chunked file is the input of IDEAL (Italian DEpendency AnaLyzer) that builds a representation of the sentence using binary, asymmetric relations (modifier, object, subject, complement etc.) between a head and a dependent based on the FAME annotation schema (Lenci et al., 2000). The success of a QA application highly depends on the quality of the parser output and very important is efficiently parsing interrogatives forms and extracting the syntactic relations that allows the system to recognize information such as direct object, subject etc. that have such an importance in the semantic interpretation of the sentence. In order to reach this goal, a specific set of rules has been written, starting with an analysis of a corpus of Italian interrogative forms. Also the paragraphs returned by the Search Engine and candidate to be identified as answers will be subjected to these same linguistic analysis and tools. 3.2. Determining the Question Focus The Question Stem is the interrogative element (adjective, pronoun, adverb) we find in the first chunk of the 3 sentence (Cosa, Chi, Quando, etc.. ), while the Answer Type Term is the element modified by the QS (Quale 4 5 animale tuba? or Quale casa automobilistica produce il "Maggiolone"? ). The convergence between these two information allows us to get closer to the expected answer type and to the text portion plausibly containing the answer. Some QSs, for example Quando (When) and Dove (Where), reveal which kind of answer we can expect to receive and a set of simple rules was encoded in order to allow the system to establish univocal correspondences between them and specific QFs. Other QSs are, on the contrary, completely ambiguous: Che and Quale, being interrogative adjective, do not provide any clues about the semantic category of the expected 1 We only mention here the tokenisation phase i.e. the pre-processing step needed to map the input sentences onto the format required by the morphological analyser. 2 Beyond. 3 What, Who, When etc.. 4 What animal coos? 5 What car company produces “the Beetle”? answer. In these cases, to obtain the expected answer type (to individuate what we call the Question Focus) the system has to analyse the noun modified by Che and Quale and resort to their representation in the source of lexical-semantic knowledge, ItalWordNet. ItalWordNet (IWN) (Roventini et al., 2003) is the extension of the Italian component of the EuroWordNet database (Vossen, 1999). IWN follows the linguistic design of EuroWordNet (with which shares the Interlingual Index and the Top Ontology as well as the large set of semantic relation6) and consists now of about 70,000 word senses organized in 50,000 synsets. In order to better exploit the information available in ItalWordNet, a Question Focus Taxonomy has been created and connected to ItalWordNet, allowing the system to go from the Answer Type Term to the Question Focus via the ItalWordNet hyperonymical links. 3.2.1 Question Focus Taxonomy The Question Focus Taxonomy has been defined analysing about 500 questions obtained translating into Italian the English question collection of the QA track of the tenth Text Retrieval Conference and downloading Italian factoid questions from web sites dedicated to on-line quizes. Two disjoint types of expected answer can be identified: the first type consists of the answers referring to a single factual information (a person’s name, a specific location, a length expressed in meters etc.); the second type refers to more complex answers, describing series of events, explanation, reasons etc. The highest nodes, FACT and DESCR refer respectively to these two most general categories. An exemplification of the QFTaxonomy can be observed in Fig. 2. Fig 2: A snapshot of the Question Focus Taxonomy 7 Many nodes in the QFTaxonomy have been projected on the branches of the ItalWordNet taxonomies but often the QF has to be addressed on scattered and different portions of the semantic net. For example, the node Location of the Question Focus taxonomy can be mapped on the synset {luogo 1 – parte dello spazio occupata o 8 occupabile materialmente o idealmente }, that has 52 first level hyponyms and that we can further organize with other (at least) 10 sub-nodes, such as: • country (mappable on {paese 2, nazione 2, stato 4- territorio con un governo sovrano e una propria organizzazione politica e amministrativa}), • river, {fiume 1 – corso d’acqua}, 6 For a complete list of the available semantic relations cf. (Roventini et al., 2003). 7 The ItalWordNet tool developed at ILC-CNR was used to encode both the QFTaxonomy and the links to IWN. 8 place 1- part of the space that can be ideally or physically took up. • region, {zona 1, terra 7, regione 1, territorio 1- una particolare regione geografica con caratteristiche proprie fisiche, naturali e culturali}, • etc. The major part of these taxonomies is leaded by the same synset {luogo 1}, which circumscribes a large taxonomical portion that can be exploit in the QF identification. To this area we had also to add other four sub- hierarchies: • {corso d’acqua 1, corso 4- l’insieme delle acque in movimento}, • {mondo 3, globo 2, corpo_celeste 1, astro 1}, • {acqua 2 – raccolta di acqua}, • {edificazione 2, fabbricato 1, edificio 1 – costruzione architettonica}. Fig. 3 gives an idea of this situation: the circumscribed taxonomical portion includes the nodes directly mapped on the QFs, all their hyponyms (of all levels) and all the synsets linked to the hierarchy by means of the 9 BELONGS_TO_CLASS/HAS_INSTANCE relation . LOCATION QF Taxonomy COUNTRY CONTINENT CELESTIAL BODY REGION MOUNTAIN ADDRESS BODY OF WATER CITY BUILDING RIVER IWN Taxonomies {luogo 1} {acqua 2} {mondo 3, globo 2, corpo celeste 1, astro 1} {continente 1} {montagna 1, monte 1} {corso d’acqua 1, corso 4} {zona 1, terra 7, regione 1, territorio 1} {paese 2, nazione 2, stato 4} {edificazione 2, fabbricato 1, edificio 1} {urbe 1, città 1, centro urbano 1} {Roma} {Italia} {Firenze} {Spagna} {La Spezia} {Francia} {Venezia} Fig 3: mapping the node Location of the QfTaxonomy on the lexical nodes of IWN This allows a specific module of the system to retrieve the Question Focus of many question of the type Quale and Che. For example, the system identifies the Question Focus (CITY) of question#3 (In quale citta' si trova il 10 carcere di San Vittore? ). At the moment, no module performing Word Sense Disambiguation is available in this phase. A consequence is that the sub-module retrieves not only the relevant sense but also all the others: for example, for question#155 11 (Di quale squadra di calcio francese era presidente Bernard Tapie? ) beyond the correct HUMAN GROUP the system identifies an incorrect QF INSTRUMENT, determined by the fact that the ATT squadra has, among the other senses, also the sense of square. This is not a strong limit for this specific task: the Information Retrieval phase works as a kind of implicit Word Sense Disambiguator since in general the co-occurrence of more than 9 While in WordNet the synsets of type instance are linked to their superordinates by means of the normal HAS_HYPERONYM relation (not distinguishing, in this way, classes from instances), in ItalWordNet the HAS_INSTANCE/BELONGS_TO_CLASS relation is used in these cases. 10 In what city is the San Vittore prison? 11 Of which French football team was president Bernard Tapie? one keyword submitted to the Search Engine determines the extraction of pertinent paragraphs which exclude other readings (in this case, for example, no instruments can be found in the paragraph extracted: Nuovi momenti difficili per l'industriale francese Bernard Tapie, ex ministro delle aree urbane, deputato e presidente della 12 squadra di calcio di Marsiglia, l'Olympique… ). On the contrary, the lack of a WSD module determined the impossibility to exploit the ItalWordNet synonyms to perform query expansion in this first version of the system. 3.3. Keyword Relevance The selection of the keywords for the query is a very important but difficult task. For example, in the first question of the collection (In quale anno venne conferito il premio Nobel a Thomas Mann?13), we would like to submit to the search engine a vector containing at least the words: premio, Nobel, Thomas, Mann. It will be unlikely to find the word anno (year) in the expected paragraph (in its place we will more probably find the year we are looking for) while the word conferito can be easily substituted by a synonym (like assegnato, assigned) or by vincere (win) if in the answer Thomas Mann is indicated as the person who win the Nobel prize. In order to deal with the majority of the cases, we adopted a general rule on the basis of the different Parts Of Speech and of the syntactic and semantic function of the word in the question. To each morphological word is assigned an attribute “relevance” which is set to the minimal value (0) if the word belongs to a list of stopwords, to the maximum value (10) if the word is a number, has a capital letter or is in inverted commas. The Part of Speech of the remaining words is analysed and an intermediate value (7) is assigned to the relevance of nouns while a smaller value (5) is assigned to verbs, adjectives and adverbs (the minimun value is assigned to auxiliary or modal verbs). All the nouns that are “answer type terms” in questions introduced by the interrogative adjectives Quale and Che (What, Which) (for example the word anno in the question In quale anno venne conferito il premio Nobel a Thomas Mann?) received a low score (2) as well as their modifiers. This choice is not always the best strategy to follow: in case of question#17 (A quale partito apparteneva Hitler?14), submitting the keyword partito to the Search Engine would have significantly cut the number of the retrieved paragraphs, allowing the easy individuation of the correct answer since in the pertinent paragraphs we always find the text “..il partito nazista..”. At the same way, the choice to assign a higher score to the ATT in case of questions introduced by Quale in pronominal function is very useful for questions like Quale è la capitale della Russia? but has some negative consequences in the case of question#31 (Qual è la professione di James Bond?) since it is highly unlikely to find the word professione in the retrieved paragraphs. Some initial observations seem to suggest that in case of questions introduced by the pronoun Quale, the Answer Type Terms referring to concrete entities are more likely to appear in the paragraphs containing the answer but the usefulness of a module exploiting the difference between abstract and concrete entities has still to be evaluated. Other rules handle more specific yet frequent cases, for example assigning the minimum value to the relevance of the verb chiamare in question#121 (Come si chiama la moglie di Kurt Cobain?15) or of the verb trovarsi in question#134 (Dove si trova l'arcipelago delle Svalbard?16). Other more subtle distinctions may be introduced: for example, the first name is more optional than the surname in the retrieval of the paragraphs and this is the reason for the failure of retrieval for question#28 (Qual è il titolo del film di Stephen Frears con Glenn Close, John Malkovich e Michelle Pfeiffer?17) where all the names with capital letters are submitted together (connected by AND) to the Search Engine while in the answer only the surname of John Malkovich is present. For the moment we prefer not introducing this distinction since we do not have yet a systematic and general strategy to handle proper names. 3.4. Stemming The Porter stemmer for Italian18 was used on all the keywords with relevance smaller than the maximum value (so in general only Proper Nouns and keywords in inverted commas were not stemmed). The use of a stemmer was preferred because it seemed more simple and straightforward than the automatic generation of 12 ..Bernard Tapie, former minister for urban areas etc… 13 What year was Thomas Mann awarded the Nobel Prize? 14 What party did Hitler belong to? 15 What is the name of Kurt Cobain’s wife? 16 Where is the Svalbard archipelago? 17 What’s the title of the Stephen Frears’ movie with Glenn Close, John…? 18 Available free at http://snowball.tartarus.org/italian/stemmer.html morphological forms but it has some important drawbacks. For example, question#127 (Quale animale tuba?19) was badly treated because the only keyword sent to the Search Engine was tub* (the Answer Type Term animale was correctly omitted in the query vector). For this reason, the Search Engine retrieved a lot of non pertinent paragraphs, such as paragraphs talking about tuberi (tuber) or tubercolosi (tubercolosis). This would be avoided by using the morphological expansion in place of the stemmer, even if this would obviously not avoid retrieving all the document talking about the musical instrument tuba. 3.5. Question XML Data Structure In order to collect all the information derived from the various steps of question analysis, we recurred to an XML representation. Fig. 2 shows an example of question represented in our XML data Structure. It would be very useful in the future fully exploiting the ids of the various layers of linguistic representation in order to better represent the links between morphological forms, chunks and the heads/dependents of the functional analysis. This would facilitate the identification of the text portion containing the answer in the answer extraction module. Fig 4: The Question XML Data Structure 4. IR module and Query Definition The inner part of the ILC-UniPi-QA system consists in a passage retrieval application built on a search engine developed at the Computer Science Department at the University of Pisa. The search engine, the same used in the PiQASso (Attardi et al., 2001) document indexing and retrieval subsystem, is based on IXE (Attardi and Cisternino, 2001), a high-performance C++ class library for building full-text search engines. The search engine stores the full documents in compressed form and retrieves single paragraphs. However full documents are indexed and sentence boundary information is added to the index, to make possible a wider search to nearby paragraphs. In fact in many cases all the relevant terms do not appear within a paragraph, but some may be present in nearby sentences. If the option to search in a wider context is chosen, those terms may still contribute to the retrieval and ranking of the paragraph. Whether this feature is effective with respect to a more standard strategy of paragraph indexing is still an open issue and deserves further investigation. The strategy followed to retrieve the candidate answers consists in the iteration of the boolean query on the basis of the relevance score of each keyword and of the number of retrieved documents. In the first loop we send to the Search Engine all the keywords with relevance higher than 2 connected with the AND operator. If no paragraph is retrieved than the system performs the second loop, creating a query connecting with AND all the keywords with relevance higher than 7 and with OR the keywords 19 What animal coos? with relevance 5. If no paragraphs are retrieved or if at least all the keywords in AND and one in OR are not present in the returned paragraphs than the system performs the third loop. This consists in a query with all the keywords with relevance 10 in AND and the keywords with relevance 5 in OR. Again, if no paragraphs is returned or if at least all the keywords in AND and one in OR are not present in the returned paragraphs than the fourth and last iteration is performed with only the keywords with relevance 10. The system envisages also a mechanism to restrict the proximity in case of queries that contains a sequence of first name and surname (so the keywords Thomas and Mann of question#1 are searched in the paragraphs without any other elements in between). This scheme has to be revised and inserted in the future in a more general strategy for handling poly-lexical units of the type name+surname, name+preposition+name (the Mostro di Firenze of question#48) etc. A new version of the IXE Search Engine is under development at the Uni-Pi Computer Science Department: it will allow queries constrained with information about the expected answer type, so for example in case of question#11 (Qual è la città sacra per gli Ebrei?20) it will be possible to submit a query of the type “città sacra ebrei location:*” and retrieve only paragraphs containing the name of a city. 5. Answer Processing The Search Engine returns a file for each query. The file returned follows a specific DTD having the paragraph as sub-element and the information about the match and the source document as attributes. The attribute “best_ranking” is also created at root element level, equivalent to the number of keywords actually submitted to IXE for the current query. For each paragraph, the system also calculates the value of the “ranking” attribute, consisting in the number of keywords of the query actually found in each single paragraph. After this step, a set of simple regular expressions are used to discover in the paragraphs the named entities that can be found recurring to simple pattern matching; in this way, the element “Named_entity” is created for the pertinent paragraphs, having as attribute the value, the type21 and the plausibility score of the NE identification.. The meta-information representing the coordinates of the journalistic article (i.e. who wrote the article, where and when and for which news agency) are eliminated from the text in order to provide a clean input to the text analysis tools and are saved in a specific sub-element of type “MetaInfo”. The paragraphs are then submitted to the morphological and syntactic analysers and the results are saved in specific elements. 5.1 Answer Extraction This module is the one that most needs a serious rethinking and integration of information sources. Only few rules have been implemented in the current system, partially exploiting: 1. Dependency relations Some types of question (determined by the QS and by the QF) can be handled looking in the paragraphs for syntactic structures typically indicating the presence of a possible answer. This is the case, for example, of questions: i) introduced by Chi (Who), that can be resolved looking for relations of coordination and of modification of type adposition22, ii) introduced by Dove (Where), that can be resolved searching among the complements of the keyword23 introduced by the preposition di (of) or in (in)24, iii) asking about a quantity, that can be answered searching among the modifications of “card” type. An answer identified by recurring to expected patterns of syntactic relations is probably a right answer but syntactic regularities are quite rare and the rules depend too much on the quality of the parser output. 2. Named Entities 20 What is the Jewish holy city ? 21 Year, Date, Day, Season, Time, Money, Length, Weight, Speed, HumanName and Company. Names referring to Human and Company are identified only if they are respectively preceded by abbreviations like Dott., Sig. or followed by Inc. etc.. 22 See for example question#2 - Chi è l’amministratore delegato della Fiat? – and the candidate answer: Nel corso dell'assemblea dell'Ugaf, a cui ha partecipato anche l'amministratore delegato della Fiat, Cesare Romiti,… 23 Question: “Dove è Bassora?”, Candidate answer: “ ..sono a Bassora nel sud dell’Irak” 24 In case of Dove questions, a last check consists in verifying in IWN that the proposed answer is of type Location or that at least its PoS is of type Proper Name. When it is not possible to rely solely on syntactic clues to individuate the answer, it would be very useful to exploit the Named Entities corresponding to the Question Focus of the question. Since for the moment the system doesn’t make use of any module of NERecognition, only NEs of the type Time, Year, Day were exploited in answer extraction rules. 3. Pattern matching on the text of the paragraph In case of definition questions asking about organizations, the system follows a very simple strategy consisting in extracting the text between the brackets that follows the keyword. The system accuracy over definition questions is 50%. 4. Paragraph ranking When no other ways to individuate the answer can be found, the system provides as answer the paragraph with the highest ranking score. The 14.5% of answers judged inexact are due to this strategy. 6. Results and Future Work The overall accuracy of the system is quite low, only 25.5% of exact answers (22.78% over Factoid questions and 50% over Definition questions). This is the first release of the prototype and many things have still to be fixed or even developed. Between the question processing phase and the Search Engine, the system does not perform query expansion since we do not have at our disposal a WSD module to individuate the right sense to expand. This is the reason for the failure on question#44 (Chi è l’inventore del televisore?25), where the paragraph containing the answer is not retrieved since it doesn’t contain televisore but its synonym televisione. In the future, we will concentrate our efforts on the possibility to expand the queries using the synonyms in ItalWordNet. Moreover, it would be useful, during the question processing, being able to individuate multiword expressions, such as unità di misura (unit of measurement - question#4), casa discografica (record company - question#43), parte dell’organismo (body part - question#96), compagnia di bandiera (national airline - question#113) etc. that would allow an easier individuation of the expected answer type. As we already said, we think that performing morphological expansion instead of stemming may be a good strategy for QA on Italian language but we are not able at the moment to exactly evaluate the cost and benefits of such a strategy change. The Answer Extraction module is the one that most needs to be restructured and fixed. First of all, since for about 68% of the questions the expected answer is a Named Entity, the possibility to exploit the results of a NE Recognizer for making emerging important items such as names of people, organization, location etc. would be of great help. With respect to this, the opportunity to use the new version of the Search Engine under development at the Uni-Pi Computer Science Department could determine an important improvement in the system performance. Moreover, we expected to be able to improve the overall results of the system starting to use at least the hyp(er)onyms and the synonyms of the ItalWordNet synsets in order to individuate the answer. For many questions, also without query expansion, the system was able to retrieve the “right” set of paragraphs and in some case the use of IWN relations could have helped to pinpoint the answer. For example, exploiting the IWN IS-A relation between the word membro (member) and uomo (men) could have helped to individuate the answer to question#7 (Quanti membri della scorta sono morti nell'attentato al giudice Falcone?26) in the retrieved paragraph: “..nella strage di Capaci… dove furono uccisi il giudice Giovanni Falcone ..e tre uomini della scorta..”27. At the same way, the synonymy between causare (to cause) and provocare (to provoke) on one hand and tumore (tumor) and cancro (cancer) on the other could have helped to match question and answer in case of question#64 (Cosa può causare il tumore ai polmoni?28) and the candidate answer text: “…alimentando l’ipotesi…che gli scarichi diesel provochino il cancro”29. This is something different from performing query expansion since this strategy does not enlarge the set of paragraphs that are obtained using the keywords of the question but rather helps to restrict the number of possible candidates30. 25 Who is the inventor of the television? 26 How many members of the escort died in the attack to Judge Falcone? 27 ..in the Capaci massacre…where Judge Falcone..and three men of his escort died.. 28 What causes lungs tumor? 29 ..it fosters the hypothesis that…diesel exhaust provokes cancer 30 In this case, the lack of a module for explicit WSD would not effect the identification of useful connections. As final remark, we think that CLEF represented a very important occasion to highlight the problems and to look for new solutions and strategies for Italian QA. In the next future, we will work on a new release of the system in order to overcome its current limits and to improve its performance. References Attardi G., Cisternino A., Reflection support by means of template metaprogramming, Proceedings of Third International Conference on Generative and Component-Based Software Engineering, LNCS, Springer-Verlag, Berlin, 2001. Attardi G., Costernino A., Formica F., Simi M., Tommasi A., Zavattari C., PIQAsso: Pisa Question answering System, in Proceeding of the 10th TREC Conference, 2001. Bartolini R., Lenci A., Montemagni S., Pirrelli V., Grammar and Lexicon in the Robust Parsing of Italian: Towards a Non-Naïve Interplay, in Proceedings of COLING 2002 Workshop on Grammar Engineering and Evaluation, Taipei, Taiwan, 2002. Battista M, Pirrelli V., Una Piattaforma di Morfologia Computazionale per l’Analisi e la Generazione delle Parole Italiane, ILC-CNR Technical Report, 1999. Harabagiu S., Moldovan D., Pasca M, Mihalcea R., Surdeanu M., Bunescu R., Girju R., Rus R. and Morarescu P., FALCON: Boosting Knowledge for Answer Engines, in Proceedings of the Text Retrieval Conference (TREC-9), 2000. Lenci A., Montemagni S., Pirrelli V., CHUNK-IT. An Italian Shallow Parser for Robust Syntactic Annotation, in Linguistica Computazionale, Istituti Editoriali e Poligrafici Internazionali, Pisa-Roma, ISSN 0392-6907, 2001. Lenci A., Montemagni S., Pirrelli V., Soria C., FAME: a Functional Annotated Meta-Schema for multi-modal and multilingual Parsine Evaluation, Proceeding of the LREC-2000, 2000. Paşca M., Open-Domain Question Answering from Large Text Collections, CSLI Studies in Computational Linguistics, USA, 2003. Roventini A., Alonge A., Bertagna F., Calzolari N., Girardi C., Magnini B., Marinelli R., Speranza M., Zampolli A., ItalWordNet: Building a Large Semantic Database for the Automatic Treatment of Italian. In Zampolli A., Calzolari N., Cignoni L. (eds.), Computational Linguistics in Pisa, Special Issue of Linguistica Computazionale, Vol. XVIII-XIX, Istituto Editoriale e Poligrafico Internazionale, Pisa-roma, 2003. Vossen, P. (ed.), EuroWordNet General Document, 1999. http://www.hum.uva.nl/~ewn.