-

A Lexical{Semantic Approach to AVE

scar Ferr

0 1 2 3

andez

0 1 2 3

Rafael Mun~oz

0 1 2 3

Manuel Palomar

mpalomarg@dlsi.ua.es 0 1 2 3 0 Department of Computing Languages and Systems 1 Question Answering , Answer Validation , Recognizing Textual Entailment 2 San Vicente del Raspeig , Alicante 03690 , Spain 3 University of Alicante

This paper discusses a system capable of detecting when answers for speci¯c questions are supported by snippets, all provided by Question Answering (QA) systems. This task is known as the Answer Validation Exercise (AVE) track within the Cross{ language Evaluation Forum (CLEF). The system uses a set of regular expressions in order to join the question and the answer into an a±rmative sentence and afterwards applies several lexical{semantic inferences to attempt to detect whether the meaning of this sentence can be inferred by the meaning of the supporting text. Throughout the paper we present and discuss the di®erent system components together with the results obtained. Moreover, we want to apply special emphasis to the language{independent capabilities of some of them. As a result, we are able to apply our techniques over both Spanish and English corpora.

Algorithms Semantic Similarity Experimentation Measurement Performance

The three{year{old Cross{Language Evaluation Forum (CLEF) track, the Answer Validation Exercise (AVE), provides an evaluation framework to consider appropriately those answers that are supported by the question and the passage from which they were extracted. This kind of inference will help Question Answering (QA) systems to increase their performance as well as humans in the assessment of QA system output.

Traditionally the approaches destined for validating the answers of QA systems have always been inspired by textual entailment recognition techniques [ 13, 14 ]. Moreover, simple techniques based on word overlapping and shallow lexical inferences (e.g. linear distance) have obtained competitive results [ 6 ] being considered as a suitable starting point for further research.

The system described in this paper integrates several inferences from di®erent knowledge sources. The base of the system consists of lexical deductions without any semantic knowledge, afterwards several modules have been added to the system in order to compute more sophisticated deductions (e.g. WordNet relations, named entities correspondences and relations between verbs).

The paper is structured as follows. The next section presents our approach for our participation in Answer Validation Exercise (AVE). Third section illustrates the experiments carried out and the results obtained. Finally, fourth section shows the conclusions and proposes future work based on our actual research. 2

Validating the Answers

Aimed at achieving an approach that obtains promising results in a short lapse of time, we built a system that uses a reduced number of external resources which would compromise the system's speed. The system is able to detect unidirectional meaning relations between a±rmative sentences formed by the question and the answer and the supporting texts supplied by QA systems.

Figure 1 depicts the architecture of the system illustrating its modules and stages during the inference meanings process between the texts.

Preprocessing

Answer Validation Inference Question Answer Supporting

Text

Sentence creation

Textual Entailment

Deductions

DECISION Reg. Expressions

WordNet

VerbNet VerbOcean Open-domain

NER

The process of validating the answers involves two main phases: (i) the preprocessing stage which is responsible for building an a±rmative sentence merging the question and the answer by means of a set of regular expressions, and (ii) the pure textual entailment component that detects lexical{semantic inferences between a pair of texts. 2.1

Preprocessing

Each query and answer provided within both the development and test corpora were preprocessed in order to obtain an a±rmative well{formed sentence (this sentence will be called hypothesis, or simply H, in order to follow the textual entailment methodology and terminology ¯rst proposed in [3]). For this purpose, an extension of the set of regular expressions proposed in our previous participation in AVE [ 4 ] was carried out. This extension was done by analysing the kinds of questions exposed in the development corpus and integrating new regular expressions capable of managing the whole set of questions. For both the development and test set every question is controlled by one regular expression, however it does not imply that the output a±rmative sentence is grammatically well{formed. Obviously, it will depend on the correctness of the answer. 2.2

The Textual Entailment Component

In order to tackle the AVE task, ¯rst we have created a base system making use of well{know techniques based on lexical inferences. These techniques have already been used successfully by some research (including ourselves) in the task of recognising textual entailment relations [ 5, 11, 1 ]. Later on adjusting the system to the idiosyncrasies of the AVE task, we have generated some constraints that the pair of texts (hypothesis{supporting text) involved within the meanings' inference must ful¯l. 2.3

The Base

Its performance is supported by the computation of a wide variety of lexical measures over the lemmas of the tokens that make up the texts. Thus, prior to the calculation of the measures, all texts are tokenized, lemmatized and morphologically analysed.

From the whole set of measures, we select those that are more signi¯cant according to the information gain that they provide to a machine learning classi¯er. Therefore, a Bayesian Net classi¯er implemented in Weka [ 20 ] was used for this issue, considering each measure as a feature. Next, the group of the most meaningful measures that composes the feature set is listed1: ² Levenshtein distance: the function match(i) is calculated for each item of the hypothesis (H) as: >81 > >>>0:9 > > match(i) = < > > > > >>>max : µ

1 Lv(i; j) 8j 2 T ¶ if 9j 2 T Lv(i; j) = 0; if @j 2 T Lv(i; j) = 0^ 9k 2 T Lv(i; k) = 1; otherwise: where Lvd(i; j) represents the Levenshtein distance between i and j. The cost of an insertion, deletion or substitution is equal to one, and the weight assigned to match(i) when Lvd(i; j) = 1 has been obtained empirically. ² Needleman-Wunsch algorithm [ 15 ]: similar to the basic Levenshtein distance but adding a variable cost adjustment to the cost of an insertion or deletion. Some experiments were done in order to adjust the cost of a gap being a penalty of 3 the best value. ² Smith-Waterman algorithm: is a well{known dynamic programming algorithm for performing local sequence alignment and determining similar regions between sequences. The algorithm was ¯rst proposed by [ 18 ] and consists of two steps: (i) calculate the similarity matrix score; and (ii) according to the dynamic programming method, trace back the similarity matrix to search for the optimal alignment.

For two sequences SQ1 and SQ2, the optimal alignment score of two sub{sequence SQ1[1] : : : SQ1[i] and SQ2[1] : : : SQ2[j] is the calculation of D(i; j) de¯ned as:

D(i; j) = max >D(i ¡ 1; j) ¡ GAP > >:D(i; j ¡ 1) ¡ GAP 80 start over; > > ><D(i ¡ 1; j ¡ 1) ¡ f (SQ1[j]; SQ2[j]) substitution or copy; insertion; deletion: It permits two adjustable parameters regarding substitutions and copies for an alphabet mapping (the f function) and also allows costs to be attributed to a GAP for insertions or deletions. In our experiments we empirically set the values 0.3, -1 and 2 for a gap, copy and substitution respectively.

1For some measures we use their implementation

(http://www.dcs.shef.ac.uk/»sam/simmetrics.html) provided by the

SimMetrics library (1) (2) being s1 and s2 the strings to be compared, js1j and js2j their respective lengths, m the number of matching characters considering only those are not further than [ max(js1j;js2j) ] ¡ 1 2 and t the number of transpositions computed as the number of matching (but di®erent) characters divided by two. ² Euclidean distance: The traditional de¯nition measures the distance between two points

P = (p1; p2; : : : ; pn) and Q = (q1; q2; : : : ; qn) in Euclidean n-space as: ² Jaro distance: this metric comes from the work presented in [ 8 ] and measures the similarity between two strings taking into account spelling derivations. The following equation describes the way that it obtains the similarities: (3) (4) (5) (6) (7) vu n p(p1 ¡ q1)2 + ¢ ¢ ¢ + (pn ¡ qn)2 = tuX(pi ¡ qi)2 i=1 With the aim of dealing with strings, we set n as the number of distinct items in any of the two strings and pi, qi the times that each of them appears in each string respectively. ² Jaccard similarity coe±cient: is a statistic coe±cient for comparing the similarity and diversity of sample sets. It is de¯ned as the size of the intersection divided by the size of the union of the sample sets: In our case, we compute this coe±cient representing each string as a Jaccard vector. This metric was ¯rst introduced and detailed in [ 7 ]. ² Dice's coe±cient: for sets X and Y of items extracted from the two strings to be processed, the coe±cient is de¯ned as:

J (A; B) = jA \ Bj=jA [ Bj

D = 2jX \ Y j

jXj + jY j cos(~x; ~y) =

~x ¢ ~y jj~xjj ¢ jj~yjj ² Cosine similarity: is a common vector{based similarity. The input strings are transformed into vector space and it is computed as follows: ² IDF speci¯city: we determine the speci¯city of a word using the inverse document frequency (IDF) introduced in [ 19 ], which is de¯ned as the total number of documents in the corpus divided by the total number of documents that include that word. In our experiments, we derive the documents frequencies from the document collections used for the tracks reported within the Cross{Language Evaluation Forum (CLEF) [ 16 ], in concrete the LA Times 94 and Glasgow Herald 95 collections, which contain a total number of 169,477 documents. The IDF measure helps to the system to valuate each word regarding its speci¯city whereby the words with higher IDF values will be more relevant to take the entailment decision. ² JWSL: in order to discover word meaning relations that are not able to be detected directly from orthographic derivations we exploit the lexical{semantic resource called WordNet [ 12 ]. Relations such as synonymy, hypernyms, and semantic paths that connect two concepts can be found exploiting its taxonomy. Also, there are many implementations of similarity and relatedness measures between words based on WordNet. In our experiments, we have used the Java WordNet Similarity Library (JWSL2), which implements some of the most commons semantic similarity measures. This feature automatically derives a score (the maximum score obtained from all similarity measures implemented in JWSL) that shows the similarity degree between the nouns, verbs and adjectives of two texts.

Other measures that were considered but later discarded due to the fact that they introduced noise to the system were: bi{ and tri{grams of letters, Block distance, SoundEx distance. 2.4

The Constraints

In addition to the aforementioned inferences, we considered very appealing the idea of integrating into the system some constraints that could support the ¯nal decision in most cases. ² The Named Entities: it is based on the detection, presence and absence of Named Entities (NEs). Despite the previous measures taken into account every token, even entities, these measures do not detect the importance of the presence or absence of an entity (e.g. when there is an entity in the hypothesis but the same entity is not present in the supporting text). This idea comes from the work presented in [ 17 ], where the authors successfully build their system only using the knowledge supplied by the recognition of NEs. In our case, we establish the following constraint: \In order to be considered as a candidate entailment pair, the hypothesis' entities must also appear within the supporting text " This constraint is prior to the launching of the similarity measures, so only pairs containing the same entities will be considered.

In our experiments, we use NERUA system [ 10 ], an open domain NE recognizer which was trained by the corpus provided in CoNNL-2002 Share Task3 and CoNLL-2003 Share Task4 in order to recognise Spanish and English entities respectively. ² The Verbs: the other important particles in a sentence, apart from the NEs, are the verbs.

Therefore, if we are able to detect whether the hypothesis' verbs are related to the supporting text's verbs, we could set another constraint showing this relatedness. To do this, we created two wrappers in Java for the VerbNet [ 9 ] and VerbOcean [2] resources. These wrappers allow us to detect semantic relationships between verbs.

Therefore, if every verb in the hypothesis (auxiliar verbs are not considered) can be related to one or more verbs in the supporting text, the pair will successfully pass this constraint. Two verbs are related whether: (i) they have the same lemma or are synonyms considering WordNet, (ii) they belong to the same VerbNet class or a subclass of their classes, and (iii) there is a relation in VerbOcean5 that connects them.

Consequently, if the candidate pair pass the two previous constraints, it will be processed by the measures presented in 2.3. It will be carried out for both the development and test corpora. The development corpus will be used as training for a Bayesian Net classi¯er.

Another way we considered in order to integrate these constraints into the system, was to add new features that indicate the matching coe±cient between the entities and verbs according to the previous resources and strategy. Unfortunately, the addition of these new features into the classi¯er did not produce any improvement in the results. Furthermore, considering these inferences as previous constraints the corpus as well as the processing time are strongly reduced.

2http://grid.deis.unical.it/similarity/ 3http://www.cnts.ua.ac.be/conll2002/ner/ 4http://www.cnts.ua.ac.be/conll2003/ner/ 5The VerbOcean's relations considered are: similarity, strength and happens{before. Experiments and Results

We set several experiments6 according to the inferences presented in the previous sections of the paper: ² System Base (SB): comprises the basic measures shown in 2.3 together with the JWSL inference based on WordNet. ² SB+Entities Constraint (SB+EntC): adds to SB the constraint about the detection, presence and absence of NEs. ² SB+EntC+Verbs Constraint (SB+EntC+VerbC): develops all the previous inferences including the constraint deduced by the relationships between verbs. ² Baseline100: it was generated setting all pairs as VALIDATED and randomly choosing the

SELECTED values. ² Baseline50: 50% of the pairs were tagged as VALIDATED (no SELECTED values).

In order to achieve the best system training con¯guration, we made several combinations of the development corpora available for both this edition and previous ones. The one that reached the best results (in a 10{cross fold validation test) was joining the development corpora of the last and current edition (AVE'07 and AVE'08, respectively).

The results point out that a signi¯cant improvement is reached when the system considers the constraint about the NEs' inference. Unfortunately, although the constraint related to the verb's relationships considerably reduced the size of the corpus and consequently the processing time, it did not report any improvement except for the estimated QA performance. It reveals that complex treatment of verbs should be carried out, and the coverage of the resources used should be extended by means of other complementary knowledge sources (e.g. inferences about semantic frames rather than to only consider the verbs would improve these kinds of deductions).

Although the system makes use of language{dependent resources, its base as well as the NE recognizer components are language independent. It allowed us to apply the system over the Spanish corpora. However, this time just two experiments could be done: ² Spanish System Base (SB es): implements all measures presented in 2.3 except the one that uses WordNet7.

6Some results presented in this section are not o±cial due to the fact that some experiments were carried out after the deadline.

7This is owing to JWSL works with English WordNet, and at present we do not have any implementation of these measures for the Spanish WordNet.

² SB es+Entities Constraint (SB+EntC es): adds to SB es the constraint about NEs, but in this case using the Spanish con¯guration of NERUA.

Finally, we should like to mention how the system establishes the SELECTED value. Since our system returns a numeric value to determine the validation of the answers, we decided to mark as SELECTED the pair with the highest positive score among all pairs that belong to the same question. In the event that two or more pairs have the highest score, then one of them is randomly chosen and tagged as SELECTED value. 4

Conclusions and Future Work

This paper describes a system capable of validating the answer for a given question according to a snippet that supposedly supports the answer. Moreover, we present a basic con¯guration of the system and afterwards we add some constraints in order to enrich the knowledge and improve the results of the system. Also, the language{independent capabilities of some system's components are clearly exposed with the application of them over Spanish and English.

Future work can be related to the improvement in the treatment of verbs as well as the detection of NEs. For instance, some heuristics regarding semantic verb frames could help the system to extend the coverage of verb's relationships. Regarding the NE recognizer, currently we only detect a strict matching between the hypothesis and supporting text entities and whether an entity is contained by another. However, there are pairs in the corpora that contain the same entity expressed in di®erent manners/words, and when it occurs the NE recognizer is unable to detect an inference between them (e.g. when an entity is inferred by its acronym). Therefore, subsequent work will be characterized by identifying deeper inference relations between entities such as acronyms, date expansion, etc.

Acknowledgements

This research has been partially funded by the QALL-ME consortium, which is a 6th Framework Research Programme of the European Union (EU), contract number FP6-IST-033860 and by the Spanish Government under the project CICyT number TIN2006-1526-C06-01. [1] Rod Adams, Gabriel Nicolae, Cristina Nicolae, and Sanda Harabagiu. Textual entailment through extended lexical overlap and lexico-semantic matching. In Proceedings of the ACLPASCAL Workshop on Textual Entailment and Paraphrasing, pages 119{124, Prague, June 2007. Association for Computational Linguistics. [2] Timothy Chklovski and Patrick Pantel. Verbocean: Mining the web for ¯ne-grained semantic verb relations. In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP-04), Barcelona, Spain, 2004. [3] Ido Dagan and Oren Glickman. Probabilistic textual entailment: Generic appied modelling of language variability. In Proceedings of the PASCAL Workshop on Learning Methods for Text Understanding and Mining, Grenoble, France, 2004.

[4]

¶ scar Ferra ¶ndez, Daniel Micol, Rafael Mun~oz, and Manuel Palomar. The contribution of the university of alicante to ave 2007 . In Working Notes of the CLEF 2007 Workshop , Budapest, Hungary, September 2007 .

[5]

¶ scar Ferra ¶ndez, Daniel Micol, Rafael Mun~oz, and

Manuel

Palomar . A perspective-based approach for solving textual entailment recognition . In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing , pages 66 { 71 , Prague, June 2007. Association for Computational Linguistics .

[6]

Danilo

Giampiccolo , Bernardo Magnini, Ido Dagan, and

Bill

Dolan . The third pascal recognizing textual entailment challenge . In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing , pages 1 { 9 , Prague, Czech Republic, June 2007 . Association for Computational Linguistics .

[7]

Paul

Jaccard . The distribution of the °ora in the alpine zone . New Phytologist , 11 ( 2 ): 37 { 50 , 1912 .

[8] Matthew

Jaro . Probabilistic linkage of large public health data ¯le . Statistics in Medicine, 14 : 491 { 498 , 1995 .

[9]

Karin

Kipper , Anna Korhonen, Neville Ryant, and

Martha

Palmer . Extending verbnet with novel verb classes . In Fifth International Conference on Language Resources and Evaluation (LREC 2006 ), Genova, Italy, June 2006 .

[10]

Kozareva , O ¶ . Ferr¶andez, A. Montoyo, and

Mun

~oz. Combining data-driven systems for improving named entity recognition . Data and Knowledge Engineering , 61 ( 3 ): 449 { 466 , 2007 .

[11]

Prodromos

Malakasiotis and

Ion

Androutsopoulos . Learning textual entailment using svms and string similarity measures . In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing , pages 42 { 47 , Prague, June 2007. Association for Computational Linguistics .

[12] George

Miller , Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J.

Miller . Introduction to WordNet: An On-line Lexical Database . International Journal of Lexicography , 3 ( 4 ): 235 { 244 , 1990 .

[13] Anselmo Pe nas, A¶ lvaro Rodrigo, and Felisa Verdejo. Overview of the answer validation exercise 2006 . In C. Peters et al., editor, CLEF 2006, Lecture Notes in Computer Science LNCS 4730 , Alicante , Spain, September 2006 .

[14] Anselmo Pe nas, A¶ lvaro Rodrigo, and Felisa Verdejo. Overview of the answer validation exercise 2007 . In C. Peters et al., editor, CLEF 2007, Lecture Notes in Computer Science LNCS 5152 , Budapest, Hungary, September 2007 .

[15]

Saul

Needleman and

Christian

Wunsch . A general method applicable to the search for similarities in amino acid sequence of two proteins . Journal of Molecular Biology , 48 ( 3 ): 443 { 453 , 1970 .

[16]

Carol

Peters . What happened in clef 2007? introduction to the working notes . In Working Notes for the 8th Workshop of the Cross-Language Evaluation Forum , CLEF , Budapest, Hungary, September 2007 .

[17] A¶lvaro Rodrigo, Anselmo Pen~as, and Felisa Verdejo . UNED at Answer Validation Exercise 2007. In Working Notes of the CLEF 2007 Workshop , Budapest, Hungary, September 2007 .

[18]

T. F.

Smith and

M. S.

Waterman . Identi¯cation of common molecular subsequences . Journal of Molecular Biology , 147 : 195 { 197 , 1981 .

[19]

Karen

Sparck-Jones . A statistical interpretation of term speci¯city and its application in retrieval . Journal of Documentation , 28 ( 1 ): 11 { 21 , 1972 .

[20] Ian

Witten and Eibe

Frank . Data Mining: Practical machine learning tools and techniques . 2nd Edition , Morgan Kaufmann, San Francisco, 2005 .