FIDJI in ResPubliQA 2009 Xavier Tannier Véronique Mori eau CNRS-LIMSI CNRS-LIMSI University Paris-Sud 11 University Paris-Sud 11 xtannierlimsi.fr mori eaulimsi.fr Abstra t This paper presents FIDJI results in ResPubliQA 2009. FIDJI (Finding In Do uments Justi ations and Inferen es) is an open-domain question-answering system for Fren h. The main goal is to validate answers by he king that all the information given in the question are retrieved in the supporting texts. Categories and Subje t Des riptors Information Storage and Retrieval℄: H.3.1 Content Analysis and Indexing; H.3.3 Infor- H.3 [ Database mation Sear h and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [ Managment℄: LanguagesQuery Languages General Terms Measurement, Performan e, Experimentation Keywords Question answering, Questions beyond fa toids 1 Introdu tion This paper presents FIDJI's results in ResPubliQA 2009 for Fren h. In this task, systems re eive 500 independent questions in natural language as input, and must return one paragraph ontaining the answer from the do ument olle tion. No exa t answer is required neither multiple responses. The do ument olle tion is JRC-A quis about EU do umentation. 2 FIDJI 1 FIDJI (Finding In Do uments Justi ations and Inferen es) is an open-domain question-answering system for Fren h. The main goal is to validate answers by he king that all the information given in the question are retrieved in the supporting texts. Our answer validation approa h assumes that the dierent entities of the question an be retrieved, properly onne ted, either in a senten e, in a passage or in multiple do uments. We designed the system so that no parti ular linguisti -oriented pre-pro essing is needed. The do ument olle tion is indexed by the sear h engine Lu ene 2 [2℄. First, the system submits the keywords of the question to Lu ene: the rst 100 do uments are then pro essed (synta ti 1 This work has been partially nan ed by OSEO under the Quaero program. 2 http://lu ene.apa he.org/java/do s/index.html Figure 1: Ar hite ture of FIDJI analysis and named entity tagging). Among these do uments, FIDJI looks for senten es ontaining the most synta ti relations of the question. Finally, answers are extra ted from these senten es and the answer type, when spe ied in the question, is validated. Figure 1 presents the ar hite ture of FIDJI and more details an be found in [4, 3℄. Next se tions summarize the way FIDJI extra t answers and fo use on ResPubliQA spe i ities. 2.1 Synta ti analysis FIDJI has to dete t synta ti impli ations between questions and passages ontaining the answers. Our system relies on synta ti analysis provided by XIP, whi h is used to parse both the questions and the do uments from whi h answers are extra ted. XIP [1℄ is a robust parser for Fren h and English whi h provides dependen y relations and named entity re ognition. The dependen y relations provided by XIP whi h are used by FIDJI are mainly: SUBJ (subje t), OBJ (obje t), PREPOBJ (prepositional group), NMOD (noun modier), VMOD (verb modier), COORDITEMS ( oordinated elements) CONNECT ( onne tor introdu - ing lause). The named entities (NE) are tagged using a set of 8 types: person, organization, lo ation, date (dened by XIP), as well as nationality, number, duration, age (that we added). XIP's lieu (lo ation) an be made more spe i ( ountry, region, ontinent...). We also added features to allow for more pre ise types. For example, for number, we added the following features: length, speed, weight, money, physi s, so that 0.55 euro in a Fren h stamp osts 0.55 euro an be tagged as a NE and extra ted as an answer to What is the pri e of a Fren h stamp?. Other elements are also tagged, as names introdu ing persons: fun tions (leader...), professions (minis- ter...), family indi ations (father...). Question analysis onsists in identifying: • The synta ti dependen ies given by XIP; • The keywords submitted to Lu ene (words tagged as noun, verb adje tive or adverb by XIP); • The question type: Fa toid ( on erning a fa t, typi ally who, when, where questions), Denition (What is...), Boolean (expe ting a yes/no answer), List (expe ting an answer omposed of a list of items), Complex questions (why and how questions). • The expe ted type(s): NE type and/or (spe i ) answer type. The answer to be extra ted is represented by a variable (ANSWER) introdu ed in the depen- den y relations. The slot noted 'ANSWER' is expe ted to be instantiated by a word, argument of some dependen ies of the parsed senten es. This word represents the answer to the question (see Se tion 2.2). The question type is mainly determined on the basis of the dependen y relations given by the parser. For example: 0015 - Entre quels pays a été on lu l'a ord- adre de oopération ommer iale et é onomique du 2 avril 1990 ? (Between whi h ountries is the Framework Agreement for trade and e onomi ooperation of 2 April 1990? ) • Synta ti dependen ies and NE tagging: ATTRIBUTADJ( oopération, ommer ial) ATTRIBUTADJ( oopération, é onomique) ATTRIBUT_DE(a ord- adre, oopération) VMOD( on lure, ANSWER) PREPOBJ(ANSWER, entre) ATTRIBUT( on lure, a ord- adre) DATE(2 avril 1990) LIEU[PAYS℄(ANSWER) • Question type: list • Expe ted type: lo ation (state) Comment en ourage-t-on la produ tion de graines de vers à soie ? 0021 - How is interest in produ ing silkworm eggs in reased? ) ( • Synta ti dependen ies and NE tagging: ATTRIBUT_DE(graine, vers) ATTRIBUT_DE(produ tion, graine) DEEPOBJ(en ourager, produ tion) NMOD(vers, soie) TOPIC(en ourager) • Question type: omplex • Expe ted type: ∅ 2.2 Extra ting andidate paragraphs ResPubliQA answer format is dierent from traditional QA ampaigns. First, answers are not fo used, short parts of texts, but full paragraphs that must ontain the answer. Se ond, passages are not indenite parts of texts of limited length; they must be predened paragraphs identied in the olle tion by XML tags
. Although answers to submit to the ampaign are full paragraphs, our system is designed to hunt down short answers. For most questions, typi ally fa toid questions, it is still relevant to nd short answers, and then to return a paragraph ontaining the best answer. This is not the ase of 'how' or 'why' questions, where no short answer may be retrieved. FIDJI usually works at senten e level. For the aim of ResPubliQA spe i rules, we hose to work at paragraph level. This onsisted in spe ifying that senten e separators were
XML tags in the olle tion, rather than usual end-of-senten e markers. On e andidate do uments are sele ted by the sear h engine and analyzed by the parser, the system ompares the do ument paragraphs with question analysis, in order to: • Extra t andidate answers or sele t a relevant paragraph; • Give a s ore to ea h answer, so that nal answers an be ranked. 2.2.1 Fa toid questions Within sele ted do uments, andidate paragraphs are those ontaining the most dependen ies from the question. On e these paragraphs are sele ted, two ases an o ur: 1. Question dependen ies with an 'ANSWER' slot are found in the senten e. In this ase, the lemma instantiating this slot is the head of the answer. The full answer is omposed of the head and its basi modiers (for a noun phrase: noun omplements, adje tives, determiners and oordinated elements; for a verbal phrase: verb omplements, subje t and obje t). The eventual NE type and answer type of this answer are he ked. Answer type an be validated by dierent synta ti relations in the text: denition ("The Fren h Prime minister, Pierre Bérégovoy"), attributNN ("Pierre Bérégovoy is the Fren h Prime minister"), and sometimes attribut_de ("la maladie de Parkinson", Parkinson's desease, literally "the disease of Parkinson"). 2. The 'ANSWER' slot does not unify with any word of the passage. In this ase, the elements having an appropriate NE type and/or answer type are sele ted in the senten e. This is done in order to ounterbalan e the many parsing errors (or paraphrases). Often, the senten e ontains the answer but synta ti dependen ies alone do not lead to it. If no possible short answer is found, the paragraph is still onsidered as a andidate answer. But in any ase, a paragraph ontaining an extra ted short answer will be prefered if it exists. Example 1. 0015 - Entre quels pays a été on lu l'a ord- adre de oopération ommer iale et é onomique du 2 avril 1990 ? (Between whi h ountries is the Framework Agreement for trade and e onomi ooperation of 2 April 1990? ) • Synta ti dependen ies and NE tagging: ATTRIBUTADJ( oopération, ommer ial) ATTRIBUTADJ( oopération, é onomique) ATTRIBUT_DE(a ord- adre, oopération) ATTRIBUT( on lure, a ord- adre) VMOD( on lure, ANSWER) PREPOBJ(ANSWER, entre) DATE[DATEABS℄(2 avril 1990) LIEU[PAYS℄(ANSWER) • Question type: list • Expe ted type: lo ation (state) The following passage is sele ted be ause it ontains the dependen ies of the question: Passage: un a ord- adre de oopération ommer iale et é onomique entre la Communauté é onomique européenne et la République argentine (3) a été on lu le 2 avril 1990 ; (Considering the Framework Agreement for trade and e onomi ooperation between the European E onomi Community and the Argentine Republi of 2 April 1990; ) ATTRIBUTADJ( oopération, ommer ial) ATTRIBUTADJ( oopération, é onomique) ATTRIBUT_DE(a ord- adre, oopération) ATTRIBUT( on lure, a ord- adre) NMOD( oopération, ommunauté é onomique européen) PREPOBJ( ommunauté é onomique européen, entre) COORDITEMS( ommunauté é onomique européen, république argentin) LIEU[PAYS ℄(république argentin) DATE(2 avril 1990) ORG( ommunauté é onomique européen) The slot 'ANSWER' is instantiated by ommunauté é onomique européenne. As the question type is 'list', the elements of the list has to be found in a 'COORDITEMS' dependen y: so, the answers are ommunauté é onomique européenne and république argentine. Finally, the expe ted answer type is validated: the sele ted answer is tagged as a lo ation (state). Example 2. Quel est le nom de la monnaie des états membres depuis le 1er janvier 1999 ? 0026 - What is the name of the member states' urren y from 1 January 1999? ) ( • Synta ti dependen ies and NE tagging: ATTRIBUT_DE(monnaie, état) NMOD(état, membre) PREPOBJ(1er janvier 1999, depuis) DEFINITION(ANSWER, monnaie) DATE(1er janvier 1999) • Question type: denition • Expe ted type: ∅ The following passage is sele ted be ause it ontains all the dependen ies of the question: Passage: onsidérant que le règlement (CE) n 974/98 du Conseil du 3 mai 1998 on ernant l'introdu tion de l'euro (3) prévoit à son arti le 2 que, à ompter du 1er janvier 1999, la monnaie des États membres parti ipants est l'euro ; (Whereas Coun il Regulation (EC) No 974/98 of 3 May 1998 on the introdu tion of the euro (3), provides in Arti le 2 that from 1 January 1999 the urren y of the parti ipating Member States shall be the euro ) ATTRIBUTADJ(membre, parti ipant) ATTRIBUT_DE(monnaie, état) NMOD(état, membre) PREPOBJ(1er janvier 1999, à ompter de) DEFINITION(euro, monnaie) DATE(1er janvier 1999) ... and the slot 'ANSWER' is instantiated by euro. 2.2.2 Complex questions Complex questions ('how', 'why', et .) do not expe t any short answer. On these kinds of ques- tions, the system behaves more as a passage retrieval system. The paragraphs ontaining the more synta ti dependen ies in ommon with the question are sele ted. Among them, the best-ranked is the one that is returned rst by Lu ene. For example: Pourquoi onvient-il de revoir l'ar hite ture du réseau Animo ? 0155 - Why should the stru ture of an ANIMO network be revised? ) ( • Synta ti dependen ies and NE tagging: VMOD( onvenir, revoir) DEEPOBJ(revoir, ar hite ture) ATTRIBUT_DE(ar hite ture, réseau) NMOD(réseau, animo) • Question type: omplex (why) • Expe ted type: ∅ The following passage is sele ted be ause all the dependen ies of the question are found in the passage: Passage: onsidérant que, à la suite de diérents travaux ee tués dans le adre ommunautaire, notamment lors d'études et de séminaires, il onvient de revoir l'ar hite ture du réseau Animo an de pro éder à la mise en pla e d'un système vétérinaire intégrant les diérentes appli ations informatisées ; (Whereas, as a result of the work arried out at Community level in the ourse of studies and seminars, the stru ture of the ANIMO network should be revised so that a veterinary system integrating the various omputer appli ations an be introdu ed; ) DEEPSUBJ( onvenir, il) VMOD( onvenir, revoir) DEEPOBJ(revoir, ar hite ture) ATTRIBUT_DE(ar hite ture, réseau) NMOD(réseau, animo) PREPOBJ(pro éder, afin de) VMOD(pro éder, mise) PREPOBJ(mise, à) NMOD(mise, pla e) ... 2.3 S oring FIDJI's s ores are not omposed of a single value, but of a list of dierent values and ags. The riteria are listed below, and are presented in de reasing order of importan e: • As we said, a paragraph ontaining an extra ted short answer will be prefered if it exists. • Named entity value (appropriate NE value or not only for fa toid questions). • Keyword rate (between 0 and 1, the rate of question major keywords present in the passage: proper names, answer type and numbers). • Answer type value (appropriate answer type or not only for fa toid questions). • Frequen y weighting (number of extra ted o urren es of this answer only for fa toid questions). • Do ument ranking (best rank of a do ument ontaining the answer, as returned by the sear h engine. If this ase, the lower the better). 3 Results We present the results Table 1 by types of questions. Only one answer per question was allowed, so the values simply orrespond to the rate of orre t answers for ea h question type. Question type Number of questions Corre t answer Fa toid 116 36.2 % Denition 101 15.8 % List 37 16.2 % "How" 76 22.4 % "Why" 170 40 % TOTAL 500 30.4 % Table 1: FIDJI results by question types. Results are lower than former ampaigns' s ores, espe ially on erning fa toid and denition questions. Looking arrefully at the results shows that, in these parti ular do uments, using synta ti dependen ies as the main lue to hoose paragraph andidates is not always a good way to nd out a relevant passage. This is espe ially true for omplex questions, but not only. Indeed, the sele tion of the paragraph ontaining the most question dependen ies often leads to the introdu tion of the do ument or to a very general paragraph ontaining poor information. For example: 0006 - What is the s ope of the oun il dire tive on the trading of fodder seeds? is answered by
COUNCIL DIRECTIVE of 14 June 1966 on the marketing of fodder plant seed (66/401/EEC)
ontaining many dependen ies but answering nothing, while a good result was later in the same do ument, but with an anaphora:This Dire tive shall apply to fodder plant seed marketed within the Community, irrespe - tive of the use for whi h the seed as grown is intended.
Dependen y relations are still useful to nd the good do ument, but often fails to point out to the orre t paragraph. Also, JRC-A quis orpus uses a dierent register of language than usual orpora su h a Web or newspapers. Question as well as do ument analyses suered from the spe i expressions and stru tures used by Fren h texts, and espe ially for denitions. Denitions, quite easy to dete t in newspaper orpora, have been poorly re ognized for this evaluation. 4 Con lusion We presented in this arti le our parti ipation to the ampaign resPubliQA 2009 in Fren h. We adapted our synta ti -based QA system FIDJI in order to produ e a single long answer in the form of JRC-A quis tagged paragraphs. Results showed that synta ti analysis should be used in dierent manners a ording to the type of tasks and questions. A areful look at our system's errors should enable improvement of robustness of the sear h by applying ontextual strategies. Referen es [1℄ Salah Aït-Mokhtar and Jean-Pierre Chanod. In remental nite-state parsing. In Pro eedings of the fth onferen e on Applied natural language pro essing, pages 7279, Washington, DC, USA, 1997. Morgan Kaufmann Publishers In ., San Fran is o, California, USA. [2℄ Erik Hat her and Otis Gospodneti¢. Lu ene in A tion. Manning, 2004. [3℄ Véronique Mori eau and Xavier Tannier. Étude de l'apport de la syntaxe dans un système de question-réponse. In A tes de la Conféren e Traitement Automatique des Langues Naturelles (TALN 2009, poster), Senlis, Fran e, jun 2009. [4℄ Véronique Mori eau, Xavier Tannier, and Brigitte Grau. Utilisation de la syntaxe pour valider les réponses à des questions par plusieurs do uments. In Pro eedings of workshop on COn- féren e en Re her he d'Information et Appli ations, CORIA, Presqu'île de Giens, Fran e, 2009.