-

Selecting answers with structured lexical expansion and discourse relations

Martin Gleize

Brigitte Grau

Anne-Laure Ligozat

Van-Minh Pho

Gabriel Illouz

Frederic Giannetti

Loc Lahondes

0 0 LIMSI-CNRS , rue John von Neumann, 91403 Orsay cedex, France , Universite Paris-Sud , 91400 Orsay, France ENSIIE, 1 Square de la Resistance, 91000 Evry , France

In this paper, we present the LIMSI's participation to QA4MRE 2013.We decided to test two kinds of methods. The rst one focuses on complex questions, such as causal questions, and exploits discourse relations. Relation recognition shows promising results, however it has to be improved to have an impact on answer selection. The second method is based on semantic variations. We explored the English Wiktionary to nd reformulations of words in the de nitions, and used these reformulations to index the documents and select passages in the Entrance exams task.

question answering index expansion discourse relation question classi cation

In this paper we present the LIMSI's participation to QA4MRE 2013. We decided to experiment two kinds of methods. The rst one focuses on complex questions, such as causal questions, and exploits discourse relations. We created a question typology based on the one proposed by QA4MRE organizers, and linked it to the type of relation expected between the answer and the question information. In order to detect these relations in the texts, we wrote rules based on parse trees and connectors.

The second method is based on semantic variations. We explored the English Wiktionary to nd reformulations of words in their de nition, and used these reformulations to index the documents and select passages in the Entrance exams task.

The paper is organized as follows: in section 2, in order to give an overview of the methods we developed, we present the general architecture of our system. Section 3 details question analysis. In relation to question classi cation, section 4 presents discourse relation recognition. We then present the two methods for passage selection and answer ranking in section 5. Selection of answer according to question category and discourse relation is described in section 6 before presenting our experimentations and results in section 7. 2

System overview

Reading documents used in main task and Alzheimer task are generally scienti c papers and variations between words in questions and answers and words in the relevant passages of text are often based on paraphrases. Thus, these kinds of variations are handled by rules that take into account morphological, syntactic and semantic variants [ 1 ]. In entrance exams, there are more distant semantic variations between each set of words, such as hypernymy or causal relation for example. Thus, we tackle these problems by creating paths based on following dictionary de nitions of question words towards document words. We developed two modules for passage retrieval: terms and variant indexing and word tree indexing. Question analysis is the same for all tasks. From the question parse trees, we generate hypotheses by applying rules written manually. For determining question types, we reuse existing question classi cation modules.

Complex types of questions are associated to discourse relations in documents which have to hold with the answer. In order to recognize these relations in documents, we wrote rules based on parse trees of document sentences.

Answers are ranked according to di erent measures. For answers to complex questions, if a corresponding relation if found on a candidate answer in the top passages, this candidate is returned. 3

Question analysis

The aim of the question analysis module is to determine the question category. As we decided to focus on discourse relations, we adapted our existing systems to detect the kind of discourse relation between the answer and the question words.

We kept the Factoid questions subclasses based on the expected answer type in terms of named entity type: person, organization, location, date...

We added the following classes, according to the task guidelines and to [ 2 ] taxonomy: { causal/reason: there is a cause-consequence relation between the answer and question information.

Why cannot bexarotene be used as a cure for Alzheimer's disease? { method/manner: the question asks for the way something happens.

How do vitamin D and bexarotene correlate? { opinion: the question asks about the opinion about something.

What was Cramer's attitude towards the music of Bach? { de nition: the expected answer is the de nition, an instance or an equivalent of the question focus.

What is a common characteristic for the neurodegenerative diseases?, Give two symptoms of dementia. { thematic: the question asks for an event at a given time.

What happened during the meal after the family had all taken their new seats?

We used two existing question analysis modules: one based on syntactic rules [ 1 ] and one based on machine learning classi cation [ 3 ].

The rst module parses the question with the Stanford Parser [ 4 ] which provides a constituency parse tree. Then, syntactic rules determine the question class by recognizing a syntactic pattern with Tregex and Tsurgeon [ 5 ]. For example, for the question Which singer made a hit record whose accompaniment was entirely synthesised?, the rules detect the interrogative pronoun which and that it possesses a son son in the parse tree; this noun is compared to a list of triggers and is recognized as a trigger of the person question class.

After the evaluation, we evaluated the results of this module on the test sets of QA4MRE 2013. 73% of questions were correctly classi ed. Most errors were due to question formulations which had not been taken into account, such as boolean questions, and some of them to misclassi cations (for example What is the cause... was incorrectly classi ed as a factoid question).

The second module is based on an SVM based classi er using the LibSVM [ 6 ] tool with default parameters. The classi er was trained on [ 2 ] ne-grained question taxonomy, with each question category considered as a class. The features used are n-grams (n ranging 1..2) of words, lemmas and parts-of-speech (determined by the TreeTagger [ 7 ]), as well as the trigger lists of the rst module and a regular expression based recognition of abbreviations. This module obtained 0.84 precision on [ 2 ] test corpus.

We also evaluated this module on QA4MRE 2013 test sets, and it obtained 0.85 correct classi cation. The main kinds of error are the misclassi cation of factoid question into de nition questions and the absence of the opinion class in the hierarchy. 4

Discourse relation recognition

Our present work was a rst attempt to take into account discourse relations in order to study if it was possible to relate them to question categories and thus to provide a supplementary criterion for selecting an answer. Thus we decided to model the recognition of some of them by rules in a rst time, as we did not have an annotated corpus. 4.1

List of relations We took into account the following four binary relations:

{ Causality, to be related to causal/reason questions { Opinion, to be related to opinion questions { De nition, to be related to de nition questions { Example, to be related to questions asking for a concept in factual questions, such as which animal ... ?

Being binary relations, each of these relations presents two components which we detail below: { Causality is composed of a cause and a consequence.

[He would not provide his last name]Csqce [because]Mark [he did not want people to know he had the E. coli strain.]Cause { Opinion is composed of a Source and a Target.

[Some users of the Apple computer]Src [say]Mark [it smells sickening.]T rgt { De nition is composed of a Concept and an Explanation.

[a Rube Goldberg machine]Cpt [is]Mark [a complicated contraption, an incredibly over-engineered piece of machinery that accomplishes a relatively simple task]Exp { Example is composed of a Concept and a List.

[other endangered North American animals]Cpt [such as]Mark [the red wolf and the American crocodile.]List

Causes and consequences of causality relations can be found between two clauses or between phrases in a sentence; they can also be found in consecutive sentences. Thus we de ned rules that recognize each of the two members separately.

Opinion relations were restricted to reported discourse.

De nition relations gather all types of clause that helps de ning or specifying a precise concept. These can be embodied as appositive, as in the tiger, the largest of all the big cats, reformulation, as in polar regions known as the cryosphere or a canonical model of de nition, as in Rickettsia mooseri is a parasite of rats.

Example relations encompass any instance of a larger concept. The expected result is a list of n instances, as to be found in luxuries such as home air conditioning and swimming pools or great Black players like Michael Jordan or Elgin Baylor. 4.2

Relation extraction

Regular expressions were de ned on the syntactic trees of sentences. They were obtained by parsing a signi cant portion of the background collection of QA4MRE 2012 using Stanford Parser. We rst de ned a set of discriminating clue words (Mark) for each of the aforementioned relations based on the selected corpus. We then developped a series of syntactic rules implemented according to the Tregex formalism [ 5 ] which allows to create tgrep-type patterns for matching tree node con gurations. Constraints in rules are de ned on left, right, child and parent nodes of the Mark. They are about expected types of syntagms and POS categories.

In total, we de ned a set of 42 rules to extract the di erent types of relations.

To evaluate the extraction of rules, we manually annotated the four texts of each thematic of the evaluation for the Main Task 2013 and the nine texts for English Exams Task. Twenty- ve annotated documents were thus annotated, containing 162 causality relations, 53 opinions, 114 de nitions and 57 examples, for a total of 416 relations.

We then compared the manual annotation to the one made by our system on these documents. To achieve this, we categorized found relationships in two types. If the relationship annotated by hand is strictly the same as the relationship found automatically, i.e. same type and same related members, this relationship is classi ed as "exact". If the relationship is incomplete, i.e. if there are missing or extra words in the related members, the relationship is classied as "loose". We will consider these kinds of relations as correct in a lenient evaluation. If the type of the relationship automatically annotated is false, it is "incorrect". Finally, we compute a fourth counter: the number of "missed" relationships, calculated as the di erence between the number of manual annotations and the sum of the number of "correct" and "loose" relationships.

Results are given in tables 1 and 2. We can see that we obtain a very good precision in the lenient evaluation, which shows that relation types are well identi ed. As expected, recall is lower, but remains reasonable.

Causality Opinion De nition Examples

Passage and answer weighting QALC4MRE strategy

We apply the weighting scheme of [ 1 ] for sentences according to the question words and answers, named P REP, the overlapping of weighted common words between a sentence and an answer, TERp and treeEdit distances between a sentence and an hypothesis.

For selecting answers, we give priority to passage weight, and secondary to answer weight, and de ne several combinations of these weights: { the most frequent answer in the n top sentences. In case of equality of different answers, the answer in the best sentence is selected, and if several candidate answers remain in the same sentence, the answer with the best weight is selected. This selection scheme is named freqTop. { the most frequent answer in the n top sentences which contain a candidate answer, with the same options in case of answer equality, named maxS. { the best answer in the n top sentences, named maxSTop. 5.2

Dictionary-based passage retrieval

In a question answering system, passage retrieval aims at extracting the short text excerpt most likely to contain the answer from a relevant document. For the most realistic questions, direct matching of the surface form of the query and text sentences is not su cient. As one of the most challenging and important processes in a QA system, passage retrieval would thus bene t from a more semantic approach.

We propose a passage retrieval method focusing on nding deep semantic links between words. We view a dictionary entry as a kind of word tree structure: taken as a bag of words, the de nition of a word makes up its children. Then the words in the de nition of a child are this child's own children, and so on. From this point forward we will designate as words only lemmas from verbs, nouns, adjectives, adverbs and pronouns that are not stop-words. We assume a single purely textual document. Document words are words in the document. Indexing the document This document pre-processing phase builds an index o of all the words in the document and their descendants in a given dictionary. This is similar to the index expansion of Attardi et al. [ 8 ], except we use dictionaries and not background documents. An entry in this index is composed of: { a word w (the key in the index) { a list Inv(w) of pairs (index of a sentence containing w in the document; index of w in this sentence): this is standard inverted indexing. { the tree T (w) of w's word descendants (implemented as pointers to the entries of w's children) { a list Anc(w) of document word ancestors, pairs (w2 document word; d depth) such that: w2 2 Anc(w) with depth d i we can nd w in w2's tree at depth d (For example: at depth 2, we look at children of children of w2 and w is among them).

To index a given word w, we check if w isn't already in the index (otherwise we build and add the entry), and we update the entry recursively, using an auxiliary children update procedure UPDATE in the main procedure INDEX(w, d, doc ancestor): 1. w as key 2. if w is a document word: (a) add to Inv(w) the pair (index of S; index of w in S). (b) add (w; 0) to Anc(w). Indeed, w is a document word, and he's the root of its tree (the only node of depth 0). 3. build T (w) with an update procedure UPDATE(w, dmax, doc ancestor), which we de ne in the following.

In dictionaries, traversing all the words in a de nition tree might not terminate. There are cycles: it can happen than the word itself appears in the de nition of words of its own de nition. So we choose to explore at most dmax levels of depth when building T (w) for any w.

Let's now de ne UPDATE(w, d, doc ancestor), which updates T (w): 1. look up the de nition of w in the dictionary. If not found we don't touch

T (w). 2. run INDEX(wc, d 1) if needed (d > 1 and wc not indexed), for each child wc in the de nition. 3. store the pointers to words of the de nition in T (w). 4. add (doc ancestor; dmax

d) to each Anc(wc).

To build the complete index of the document, we simply run INDEX(w, dmax, w) for each w of each sentence (we use StanfordCoreNLP for tokenization and tagging [ 9 ]). This is the basis of our indexing, bar minor details of implementation (re-indexing in case we need to explore an indexed word at a greater depth, handling of multiple senses and POS-tags, . . . ) Passage retrieval We rst consider words of the query, then use the index to score their relevance, and nally compute a density-based sliding window ranking function to retrieve passages.

For each word wq in the query, we run a version of INDEX(wq, dmax, NIL) which does not update document ancestors Anc(w) (as the word of the query isn't truly a document word). In T (wq), we nd descendants w of wq which have been previously built during the indexing phase and thus have an non-empty Anc(w), their document word ancestors, which are essentially the document words that initiated the access to w in the dictionary. We can compute a similarity between wq and those document words, therefore rating the relevance of document words relatively to the query word:

Sim(wq; (wdoc anc; d)) = idf (wdoc anc) base (dmin+d) dmin =

min wc2T (wq) at depth dcjwdoc anc2Anc(wc) (dc) We choose base depending on how strongly we want to penalize words as we go deeper in the tree. We found base = 2 to be a good start, but the nal system uses the number of children at the depth of the closest child containing wdoc anc in Anc. The intuition is that the more words used in the de nition of w, the less con dent we are that each de nition word is semantically related to w. We compute the similarity for each wq in the query and each wdoc in document word ancestors and sum over the wq to obtain a relevance score for the document word:

X wq2 query Relevance(wdoc) = (max Sim(wq; (wdoc; d))) d (1) (2) (3) Finally, we select candidate passages with a sliding window of 3 consecutive sentences, and rank them using a similar method to SiteQ's density-based scoring function described in [ 10 ], using Relevance as the weight of keywords. 6

Answer selection related to discourse relation

To select an answer which takes into account question category and discourse relations, we combine weights and discourse relations of the passages. First, we lter relations according to the category of the question and presence of the answer associated with the passage in the relations. Only relations whose type is the same as the category of the question and containing an answer are kept. Then, passages are sorted according to their weights. Among the top n passages, if any of them has a relation, the answer associated with the best weighted passage is selected. Otherwise, we consider only passages containing relations and select the answer associated with the best of them. 7 7.1

Results Main task and Alzheimer task

We can see that, while textual entailment distances between an hypothesis and a sentence are useful to select an answer in Alzheimer task, they are overcome by lexical overlap weighting in the main task. This can be due to di erences in answer length in the two tasks: shorter answers in Alzheimer task favour measures based on sentence structure.

We obtained analogous results on the 2013 evaluation for Alzheimer task, best c@1 is 0.42 for treeEdit combined with freqTop, while results on the main task are lower with a best c@1 at 0.28 with the combination P REP with maxS. It may be due to new kinds of questions introduced this year, and the new kind "do not know" of answer. The form of the task is essentially the same as the main task. Multiple-choice questions are taken from reading tests of Japanese university entrance exams. A crucial di erence from the other QA4MRE tasks is that background text collections are not provided.

Given the di culty of the questions and the lack of background knowledge, passage retrieval quickly appeared as a strong bottleneck for any question answering system attempting to solve the task. That is why we decided to design the dictionary-based lexical expansion described in 5.2 and use Simple English Wiktionary [ 11 ] as the dictionary. Simple English Wiktionary is a collaborative dictionary written in a simpli ed form of English, primarily for children and English learners. Its de nitions are clear, concise and get to the essence of the word without super uous details, and seem tted to acquire the \common sense knowledge" we need to solve this task [ 12 ].

We submitted a run at QA4MRE 2013 which used only this passage retrieval system and very simple heuristics to choose an answer. The results were worse than the random baseline, due to bugs in the early implementation and the discriminating roles a passage retrieval system alone cannot ll, as we will see in the following. We instead present the evaluation of our system for the sole task of passage retrieval, on the 9 reading tests (46 questions) of the test set, following Tellex's quantitative evaluation methodology [ 13 ]. We rst annotated passages of the test set (which 2-to-4-sentence passage must be read to answer the question) to create a gold standard. We found quite straight forward to limit those annotations to contiguous passages, with only 2 questions needing disjoint passages. We then implemented several runs: { MITRE as a weak baseline: simple word overlap algorithm [ 14 ] { SiteQ as a strong baseline: sentences are weighted based on query term density [ 10 ], and include keyword forms such as lemmas, stems, and synonyms/hyponyms from WordNet synsets. { SI(dmax), our Simple English Wiktionary-based indexing system, parameterized by dmax

We used the following measures:

{ MRR: mean reciprocal rank { p@n: number of correct passages found in the top n { nf: number of correct passages which weren't found at all Results are shown in table 4. Our system outperforms both baselines signi cantly on all types of tasks and measures. The di erence is most noticeable when the systems do not have access to choices of answers, which is really what we seek for the broader view of question answering. What is also interesting is the increase in performance for SI as we increase the maximum depth of search in the dictionary. This seems to con rm that Simple English Wiktionary ts this task well and that our score functions scale correctly with the amount of knowledge that it provides. Furthermore, although the question paired with the correct answer seems to yield a more reliable passage selection compared to with an incorrect answer, it is not by much, so it is unlikely that we could di erentiate right and wrong answers by only looking at the passages they yield. It can be explained by the relatively high di culty of the test: no answer choice seems completely absurd and is always related in some way to the relevant passage in the text. This con rms the wellknown necessity of deeper answer processing to make the nal call, which our earliest run attempt lacked.

Conclusion and perspectives

This paper describes di erent experiments we conducted for QA4MRE 2013. We worked on two problems. The purpose of the rst one was answering complex questions by recognizing discourse relations. The categorization of questions shows very good results while discourse relation recognition results allow us to see that this approach merits further consideration. Thus we will work on the improvement of this module and the integration of this criterion for selecting an answer. The second problem we studied was passage retrieval, especially for answering entrance exams, as semantic distance between questions, answers and text are important. We proposed indexing passages with expansion of question and answer words computed by accounting for recursive de nition of words in a dictionary. This module shows good results. We now have to evaluate this approach on the other tasks and improve answer selection within best passages.

1. Grau , B. , Pho , V.M. , Ligozat , A.L. , Ben

Abacha

, A. , Zweigenbaum , P. , Chowdhury , F. : Adaptation of limsi's qalc for qa4mre . In: CLEF 2012 Working notes on QA4MRE. ( 2012 )

2. Li , X. , Roth , D. : Learning Question Classi ers . In: COLING' 02 . ( 2002 )

3. Ligozat , A.L. : Question classi cation transfer . In: ACL 2013 . ( 2013 )

4. Klein , D. , Manning , C.D.: Accurate unlexicalized parsing . In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1. ACL '03 , Stroudsburg , PA, USA, Association for Computational Linguistics ( 2003 ) 423 { 430

5. Levy , R. , Andrew, G.: Tregex and Tsurgeon: tools for querying and manipulating tree data structures . In: Proceedings Fifth international conference on Language Resources and Evaluation (LREC 2006 ). ( 2006 )

6 . Chang , C.C. , Lin , C.J.: LIBSVM: A library for support vector machines . ACM Transactions on Intelligent Systems and Technology 2 ( 2011 ) 27 : 1 { 27 :27 Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.

7. Schmid , H.: Improvements in Part-of-Speech Tagging with an Application to German . In: Proceedings of the ACL SIGDAT-Workshop . ( 1995 )

8. Attardi , G. , Atzori , L. , Simi , M. : Index expansion for machine reading and question answering . In: CLEF (Online Working Notes/Labs/Workshop). ( 2012 )

9. Toutanova , K. , Klein , D. , Manning , C.D. , Singer , Y. : Feature-rich part-of-speech tagging with a cyclic dependency network . In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1 , Association for Computational Linguistics ( 2003 ) 173 { 180

10. Lee , G.G. , Seo , J. , Lee , S. , Jung , H. , Cho , B.H. , Lee , C. , Kwak , B.K. , Cha , J. , Kim , D. , An , J. , et al.: Siteq: Engineering high performance qa system using lexico-semantic pattern matching and shallow nlp . In: Proceedings of the Tenth Text REtrieval Conference (TREC 2001 ). Volume 442 . ( 2001 )

11. Wikimedia Foundation: Simple english wiktionary

12. Gleize , M. , Grau , B. : Limsiiles: Basic english substitution for student answer assessment at semeval 2013 . In: Second Joint Conference on Lexical and Computational Semantics (* SEM) , Volume 2 : Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013 ), Atlanta, Georgia, USA, Association for Computational Linguistics ( June 2013 ) 598 { 602

13. Tellex , S. , Katz , B. , Lin , J. , Fernandes , A. , Marton , G.: Quantitative evaluation of passage retrieval algorithms for question answering . In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval , ACM ( 2003 ) 41 { 47

14. Light , M. , Mann , G.S. , Rilo , E. , Breck , E.: Analyses for elucidating current question answering technology . Natural Language Engineering 7 ( 04 ) ( 2001 ) 325 { 342