Experimenting a ”general purpose” textual entailment learner in AVE Fabio Massimo Zanzotto Alessandro Moschitti DISCo Department of Computer Science University of Milano-Bicocca University of Rome “Tor Vergata” Milan, Italy Rome, Italy zanzotto@disco.unimib.it moschitti@info.uniroma2.it Abstract In this paper we present the use of a ”general purpose” textual entaiment recognizer in the Answer Validation Exercise (AVE) task. Our system has been developed to learn entailment rules from annotated examples. The main idea of the system is the cross-pair similirity measure we defined. This similarity allows us to define an implicit feature space using kernel functions in SVM learners. We experimented with our system using different training and testing sets: RTE data sets and AVE data sets. The compar- ative results show that entailment rules can be learned from data sets, e.g. RTE, that are different from AVE. Moreover, it seems that better results are obtained using more controlled training data (the RTE set) that less controlled ones (the AVE development set). Although, the high variability of the outcome prevents us to derive definitive conclusions, the results of our system show that our approach is quite promising and improvable in the future. Categories and Subject Descriptors I.2 [ARTIFICIAL INTELLIGENCE]: I.2.7 Natural Language Processing, I.2.6 Learning General Terms Measurement, Performance, Experimentation Keywords Question answering, Textual Entailment Recognition 1 Introduction Textual entailment recognition is a common task performed in several natural language applications [8], e.g. Question Answering and Information Extraction. The Recognizing Textual Entaliment PASCAL Challenges [9, 2] fostered the development of several ”general purpose” textual entailment recognizers. CLEF 2006 instead provides an opportunity to show that those systems are useful for Question Answering. The voluntary exercise track aims to study the application of textual entailment recognition systems to the validation of correctness of answers given by QA systems. The basic idea is that once a pair answer/snippet is returned by a QA system, a hypothesis is built by turning the pair question/answer into an affirmative form. If the related text (a snippet or a document) semantically entails this hypothesis, then the answer is expected to be correct. The task of deciding this entailment is named here automatic Answer Validation Exercise (AVE). We applied our entailment system [21], developed for the second automatic entailment recognition challenge (RTE) [2], to AVE. Our system has been shown to be one of the state-of-the-art systems on both RTE data sets [9, 2]. It determines whether or not a text T entails a hypothesis H by automatically learning rewriting rules from training positive and negative entailment pairs (T, H). For example given a text T1 : “At the end of the year, all solid companies pay dividends.” and two hypothesis: a) H1 : “At the end of the year, all solid insurance companies pay dividends” and b) H2 : “At the end of the year, all solid companies pay cash dividends”, we can built two examples: (T1 , H1 ) which is an evidence of a true entailment (positive instance) and (T1 , H2 ) which is a negative evidence. Our system extract rules from them to solve apparently not related entailments. For example, given the following text and hypothesis: T3 ⇒ H 3 ? T3 “All wild animals eat plants that have scientifically proven medicinal proper- ties.” H3 “All wild mountain animals eat plants that have scientifically proven medici- nal properties.” we note that T3 is structurally (and somehow lexically similar) to T1 and H3 is more similar to H1 than to H2 . Thus, from T1 ⇒ H1 , we may extract rules to derive that T3 ⇒ H3 . The main idea of our model is that it relies not only on a intra-pair similarity between T and H but also on a cross-pair similarity between two pairs (T 0 , H 0 ) and (T 00 , H 00 ). The latter similarity measure along with a set of annotated examples allows the leaning model to automatically derive syntactic and lexical rules that can solve complex entailment cases. In this paper, we experimented with our entailment recognition system [21] and the CLEF AVE. The comparative results show that entailment rules can be learned from data sets, e.g. RTE, that are different from AVE. Although, the high variability of the outcome prevents us to derive definitive conclusions, the results of our system show that our approach is quite promising and improvable in the future. In the remainder of this paper, Sec. 2 illustrates the related work, Sec. 3 introduces the complexity of learning entailment rules from examples, Sec. 4 describes our models, Sec. 6 shows the experimental results, and, finally, Sec. 7 derives the conclusions. 2 Related work Although the textual entailment recognition problem is not new, most of the automatic approaches have been proposed only recently. This has been mainly due to the RTE challenge events [9, 2]. In the following we report some of such researches. A first class of methods defines measures of the distance or similarity between T and H either assuming the independence between words [7, 11] in a bag-of-word fashion or exploiting syntactic interpretations [16]. A pair (T, H) is then in entailment when sim(T, H) > α. These approaches can hardly determine whether the entailment holds in the examples of the previous section. From the point of view of bag-of- word methods, the pairs (T1 , H1 ) and (T1 , H2 ) have both the same intra-pair similarity since the sentences of T1 and H1 as well as those of T1 and H2 differ by a noun, insurance and cash, respectively. At syntactic level, also, we cannot capture the required information as such nouns are both noun modifiers: insurance modifies companies and cash modifies dividends. A second class of methods can give a solution to the previous problem. These methods generally combine a similarity measure with a set of possible transformations T applied over syntactic and semantic interpretations. The entailment between T and H is detected when there is a transformation r ∈ T so that sim(r(T ), H) > α. These transformations are logical rules in [3] or sequences of allowed rewrite rules in [10]. The disadvantage is that such rules have to be manually designed. Moreover, they generally model better positive implications than negative ones and they do not consider errors in syntactic parsing and semantic analysis. 3 Challenges in learning from examples In the introductory section, we have shown that, to carry out automatic learning from examples, we need to define a cross-pair similarity measure. Its definition is not straightforward as it should detect whether two pairs (T 0 , H 0 ) and (T 00 , H 00 ) realize the same rewrite rules. This measure should consider pairs similar when: (1) T 0 and H 0 are structurally similar to T 00 and H 00 , respectively and (2) the lexical relations within the pair (T 0 , H 0 ) are compatible with those in (T 00 , H 00 ). Typically, T and H show a certain degree of overlapping, thus, lexical relations (e.g., between the same words) determine word movements from T to H (or vice versa). This is important to model the syntactic/lexical similarity between example pairs. T1 T3 S S PP , NP 2 VP 3 NP a VP b IN NP 0 , DT JJ 2 NNS 2 VBP 3 NP 4 DT JJ a NNS a VBP b NP c At all solid companies pay All wild animals eat plants NP 0 PP NNS 4 ... properties 2’ 2” 3 a’ a” b c DT NN 0 IN NP 1 dividends 4 the end of DT NN 1 0 the year 1 H1 H3 S S PP , NP 2 VP 3 NP a VP b IN NP 0 , DT JJ 2 NN NNS 2 VBP 3 NP 4 DT JJ a NN NNS a VBP b NP c At all solid insurance companies pay All wild mountain animals eat plants NP 0 PP NNS 4 ... properties 2’ 2” 3 a’ a” b c DT NN 0 IN NP 1 dividends 4 the end of DT NN 1 0 the year 1 H2 H3 S S PP NP 2 VP 3 NP a VP b At ... year DT JJ 2 NNS 2 VBP 3 NP 4 DT JJ a NN NNS a VBP b NP c all solid companies pay All wild mountain animals eat plants NN NNS 4 ... properties 2’ 2” 3 a’ a” b c cash dividends 4 Figure 1: Relations between (T1 , H1 ), (T1 , H2 ), and (T3 , H3 ). Indeed, if we encode such movements in the syntactic parse trees of texts and hypotheses, we can use interesting similarity measures defined for syntactic parsing, e.g., the tree kernel devised in [6]. To consider structural and lexical relation similarity, we augment syntactic trees with placeholders which identify linked words. More in detail: - We detect links between words wt in T that are equal, similar, or semantically dependent on words wh in H. We call anchors the pairs (wt , wh ) and we associate them with placeholders. For example, in Fig. 1, the placeholder 2” indicates the (companies,companies) anchor between T1 and H1 . This allows us to derive the word movements between text and hypothesis. - We align the trees of the two texts T 0 and T 00 as well as the tree of the two hypotheses H 0 and H 00 by considering the word movements. We find a correct mapping between placeholders of the two hypothesis H 0 and H 00 and apply it to the tree of H 00 to substitute its placeholders. The same mapping is used to substitute the placeholders in T 00 . This mapping should maximize the structural similarity between the four trees by considering that placeholders augment the node labels. Hence, the cross-pair similarity computation is reduced to the tree similarity computation. The above steps define an effective cross-pair similarity that can be applied to the example in Fig. 1: T1 and T3 share the subtree in bold starting with S → NP VP. The lexicals in T3 and H3 are quite different from those T1 and H1 , but we can rely on the structural properties expressed by their bold subtrees. These are more similar to the subtrees of T1 and H1 than those of T1 and H2 , respectively. Indeed, H1 and H3 share the production NP → DT JJ NN NNS while H2 and H3 do not. Consequently, to decide if (T3 ,H3 ) is a valid entailment, we should rely on the decision made for (T1 , H1 ). Note also that the dashed lines connecting placeholders of two texts (hypotheses) indicate structurally equivalent nodes. For instance, the dashed line between 3 and b links the main verbs both in the texts T1 and T3 and in the hypotheses H1 and H3 . After substituting 3 with b and 2 with a , we can detect if T1 and T3 share the bold subtree S → NP 2 VP 3 . As such subtree is shared also by H1 and H3 , the words within the pair (T1 , H1 ) are correlated similarly to the words in (T3 , H3 ). The above example emphasizes that we need to derive the best mapping between placeholder sets. It can be obtained as follows: let A0 and A00 be the placeholders of (T 0 , H 0 ) and (T 00 , H 00 ), respectively, without loss of generality, we consider |A0 | ≥ |A00 | and we align a subset of A0 to A00 . The best alignment is the one that maximizes the syntactic and lexical overlapping of the two subtrees induced by the aligned set of anchors. More precisely, let C be the set of all bijective mappings from a0 ⊆ A0 : |a0 | = |A00 | to A00 , an element c ∈ C is a substitution function. We define as the best alignment the one determined by cmax = argmaxc∈C (KT (t(H 0 , c), t(H 00 , i)) + KT (t(T 0 , c), t(T 00 , i)) (1) where (a) t(S, c) returns the syntactic tree of the hypothesis (text) S with placeholders replaced by means of the substitution c, (b) i is the identity substitution and (c) KT (t1 , t2 ) is a function that measures the similarity between the two trees t1 and t2 (for more details see Sec. 4.2). For example, the cmax between (T1 , H1 ) and (T3 , H3 ) is {( 2’ , a’ ), ( 2” , a” ), ( 3 , b ), ( 4 , c )}. 4 Similarity Models In this section we describe how anchors are found at the level of a single pair (T, H) (Sec. 4.1). The anchoring process gives the direct possibility of implementing an inter-pair similarity that can be used as a baseline approach or in combination with the cross-pair similarity. This latter will be implemented with tree kernel functions over syntactic structures (Sec. 4.2). 4.1 Anchoring and Lexical Similarity The algorithm that we design to find the anchors is based on similarity functions between words or more complex expressions. Our approach is in line with many other researches (e.g., [7, 11]). Given the set of content words (verbs, nouns, adjectives, and adverbs) WT and WH of the two sentences T and H, respectively, the set of anchors A ⊂ WT × WH is built using a similarity measure between two words simw (wt , wh ). Each element wh ∈ WH will be part of a pair (wt , wh ) ∈ A if: 1. simw (wt , wh ) 6= 0 2. simw (wt , wh ) = maxwt0 ∈WT simw (wt0 , wh ) According to these properties, elements in WH can participate in more than one anchor and conversely more than one element in WH can be linked to a single element w ∈ WT . The similarity simw (wt , wh ) can be defined using different indicators and resources. First of all, two words are maximally similar if these have the same surface form wt = wh . Second, we can use one of the WordNet [17] similarities indicated with d(lw , lw0 ) (in line with what was done in [7]) and different relation between words such as the lexical entailment between verbs (Ent) and derivationally relation between words (Der). Finally, we use the edit distance measure lev(wt , wh ) to capture the similarity between words that are missed by the previous analysis for misspelling errors or for the lack of derivationally forms not coded in WordNet. As result, given the syntactic category cw ∈ {noun, verb, adjective, adverb} and the lemmatized form lw of a word w, the similarity measure between two words w and w0 is defined as follows: if w = w0 ∨    1 l w = l w 0 ∧ cw = cw 0 ∨     ((lw , cw ), (lw0 , cw0 )) ∈ Ent∨    simw (w, w0 ) = ((lw , cw ), (lw0 , cw0 )) ∈ Der∨ (2) 0 lev(w, w ) = 1     d(lw , lw0 ) if cw = cw0 ∧ d(lw , lw0 ) > 0.2     0 otherwise  It is worth noticing that, the above measure is not a pure similarity measure as it includes the entailment relation that does not represent synonymy or similarity between verbs. To emphasize the contribution of each used resource, in the experimental section, we will compare Eq. 2 with some versions that exclude some word relations. The above word similarity measure can be used to compute the similarity between T and H. In line with [7], we define it as: X simw (wt , wh ) × idf (wh ) (wt ,wh )∈A s(T, H) = X (3) idf (wh ) wh ∈WH where idf (w) is the inverse document frequency of the word w. From the above intra-pair similarity, we can obtain the baseline cross-pair similarity based on only lexical information: Klex ((T 0 , H 0 ), (T 00 , H 00 )) = s(T 0 , H 0 ) × s(T 00 , H 00 ) (4) In the next section we define a novel cross-pair similarity that takes into account syntactic evidence by means of tree kernel functions. 4.2 Cross-pair syntactic kernels Section 3 has shown that to measure the syntactic similarity between two pairs, (T 0 , H 0 ) and (T 00 , H 00 ), we should capture the number of common subtrees between texts and hypotheses that share the same anchoring scheme. The best alignment between anchor sets, i.e. the best substitution cmax , can be found with Eq. 1. As the corresponding maximum quantifies the alignment degree, we could define a cross-pair similarity as follows: Kstruct ((T 0 , H 0 ), (T 00 , H 00 )) = max KT (t(H 0 , c), t(H 00 , i)) + KT (t(T 0 , c), t(T 00 , i) ,  (5) c∈C where as KT (t1 , t2 ) we use the tree kernel function defined in [6]. This evaluates the number of subtrees shared by t1 and t2 , thus defining an implicit substructure space. Formally, given a subtree space F = {f1 , f2 , . . . , f|F |}, the indicator function Ii (n) is equal to 1 if the target fiPis rootedP at node n and equal to 0 otherwise. A tree-kernel function over t1 and t2 is KT (t1 , t2 ) = n1 ∈Nt n2 ∈Nt2 ∆(n1 , n2 ), where Nt1 and Nt2 are the sets of the t1 ’s and t2 ’s nodes, 1 P|F | respectively. In turn ∆(n1 , n2 ) = i=1 λl(fi ) Ii (n1 )Ii (n2 ), where 0 ≤ λ ≤ 1 and l(fi ) is the number of levels of the subtree fi . Thus λl(fi ) assigns a lower weight to larger fragments. When λ = 1, ∆ is equal to the number of common fragments rooted at nodes n1 and n2 . As described in [6], ∆ can be computed in O(|Nt1 | × |Nt2 |). The KT function has been proven to be a valid kernel, i.e. its associated Gram matrix is positive- semidefinite. Some basic operations on kernel functions, e.g. the sum, are closed with respect to the set of valid kernels. Thus, if the maximum held such property, Eq. 5 would be a valid kernel and we could use it in kernel based machines like SVMs. Unfortunately, a counterexample illustrated in [4] shows that the max function does not produce valid kernels in general. However, we observe that: (1) Kstruct ((T 0 , H 0 ), (T 00 , H 00 )) is a symmetric function since the set of transformation C are always computed with respect to the pair that has the largest anchor set; (2) in [12], it is shown that when kernel functions are not positive semidefinite, SVMs still solve a data separation problem in pseudo Euclidean spaces. The drawback is that the solution may be only a local optimum. Therefore, we can experiment Eq. 5 with SVMs and observe if the empirical results are satisfactory. Sec- tion 6 shows that the solutions found by Eq. 5 produce accuracy higher than those evaluated on previous automatic textual entailment recognition approaches. 5 Refining cross-pair syntactic similarity In the previous section we have defined the intra and the cross pair similarity. The former does not show relevant implementation issues whereas the latter should be optimized to favor its applicability with SVMs. The Eq. 5 improvement depends on two factors: (1) its computation complexity; (2) the pruning of irrele- vant information in large syntactic trees. 5.1 Controlling the computational cost The computational cost of cross-pair similarity between two tree pairs (Eq. 5) depends on the size of C. This is combinatorial in the size of A0 and A00 , i.e. |C| = (|A0 | − |A00 |)!|A00 |! if |A0 | ≥ |A00 |. Thus we should keep the sizes of A0 and A00 reasonably small. To reduce the number of placeholders, we consider the notion of chunk defined in [1], i.e., not recursive kernels of noun, verb, adjective, and adverb phrases. When placeholders are in a single chunk both in the text and hypothesis we assign them the same name. For example, Fig. 1 shows the placeholders 2’ and 2” that are substituted by the placeholder 2 . The placeholder reduction procedure also gives the possibility of resolving the ambiguity still present in the anchor set A (see Sec. 4.1). A way to eliminate the ambiguous anchors is to select the ones that reduce the final number of placeholders. 5.2 Pruning irrelevant information in large text trees Often only a portion of the parse trees is relevant to detect entailments. For instance, let us consider the following pair from the RTE 2005 corpus: T ⇒ H (id: 929) T “Ron Gainsford, chief executive of the TSI, said: ”It is a major concern to us that parents could be unwittingly expos- ing their children to the risk of sun dam- age, thinking they are better protected than they actually are.” H “Ron Gainsford is the chief executive of the TSI.” Only the bold part of T supports the implication; the rest is useless and also misleading: if we used it to compute the similarity it would reduce the importance of the relevant part. Moreover, as we normalize the syntactic tree kernel (KT ) with respect to the size of the two trees, we need to focus only on the part relevant to the implication. The anchored leaves are good indicators of relevant parts but also some other parts may be very rele- vant. For example, the function word not plays an important role. Another example is given by the word insurance in H1 and mountain in H3 (see Fig. 1). They support the implication T1 ⇒ H1 and T1 ⇒ H3 as well as cash supports T1 ; H2 . By removing these words and the related structures, we cannot determine the correct implications of the first two and the incorrect implication of the second one. Thus, we keep all the words that are immediately related to relevant constituents. The reduction procedure can be formally expressed as follows: given a syntactic tree t, the set of its nodes N (t), and a set of anchors, we build a tree t0 with all the nodes N 0 that are anchors or ancestors of any anchor. Moreover, we add to t0 the leaf nodes of the original tree t that are direct children of the nodes in N 0 . We apply such procedure only to the syntactic trees of texts before the computation of the kernel function. 6 Experimental investigation The experiments aim at determining if our system can learn rules required to solve the entailment cases contained in the AVE data set. Although, we have already shown that our system can learn entailment [9, 2], the task here appears to be more complex as: (a) texts are automatically built from answers and questions; this necessarily introduces some degree of noise; and (b) often question answering systems provide a correct answer but the supporting text is not adequate to carry out a correctness inference, e.g. a lot background knowledge is required or the answer was selected by chance. Our approach to study the above points is to train and experiment with our system and several data sets proposed in AVE as well as RTE1 and RTE2. The combination of training and testing based on such sets can give an indication on the learnability of general rules valid for different domain and different applications. 6.1 Experimental settings For the experiments, we used the following data sets: Training set Test sets Trade-off parameter j=1 j=10 j=0.9 AV Ea AV Eb 11.55 35.36 - AV Eb AV Ea x 31.85 - AV Eb ∪ RT E1 AV Ea 12.20 37.14 - AV Eb ∪ RT E1 ∪ RT E2 AV Ea 28.57 35.89 - AV Eb ∪ RT E2 AV Ea 25.68 38.98 - AV Ea ∪ RT E1 AV Eb 22.57 32.05 - AV Ea ∪ RT E1 ∪ RT E2 AV Eb 31.76 30.85 - AV Ea ∪ RT E2 AV Eb 34.07 32.38 - RT E1 AV Ea 39.81 30.64 - RT E1 AV Eb 35.58 28.31 - RT E1 ∪ RT E2 AV Ea 38.27 31.58 40.85 RT E1 ∪ RT E2 AV Eb 33.42 28.29 36.20 RT E2 AV Ea 37.46 33.04 - RT E2 AV Eb 35.57 29.72 - Table 1: F1 measure of our entailment system trained with data from RTE1, RTE2 and AVE tested on the AVE split (AV Ea and AV Eb ). • RT E1 and RT E2, i.e. the sets (development and test data) of the first [9] and second [2] challenges, respectively. RT E1 contains 1,367 examples whereas RT E2 contain 1,600 instances. The positive and negative examples are equally distributed in the collection, i.e. 50% of the data. • AV E a and AV E b come from a random split of the AVE development set, we created it to ho- mogeneously learn and test our model on the AVE data. The AVE development set contains 2870 instances. Here, the positive and negative examples are not equally distributed. It contains 436 positive 2434 negative examples. We also created new sets by merging groups of the above four collections. For example, AV Ea ∪ RT E1 ∪ RT E2 stands for the set obtained as union of AV E a, RT E1 and RT E2. Moreover, to imple- ment our model (described in sections 4 and 5), we used the following resources: • The Charniak parser [5] and the morpha lemmatiser [18] to carry out the syntactic and morpholog- ical analysis. • WordNet 2.0 [17] to extract both the verbs in entailment, Ent set, and the derivationally related words, Der set. • The wn::similarity package [20] to compute the Jiang&Conrath (J&C) distance [14] as in [7]. This is one of the best figure method which provides a similarity score in the [0, 1] interval. We used it to implement the d(lw , lw0 ) function. • A selected portion of the British National Corpus1 to compute the inverse document frequency (idf ). We assigned the maximum idf to words not found in the BNC. • SVM-light-TK2 [19] which encodes the basic tree kernel function, KT , in SVM-light [15]. We used such software to implement the overall kernel Koverall = Klex + Kstruct (see equations 4 and 5). In all the experiments we used Koverall which combines the lexical and structural cross similarities. 6.2 Results and analysis Table 1 reports the results of our system trained with data from RTE1, RTE2 and AVE tested on the AVE split (AV Ea and AV Eb ). Columns 1 and 2 denote the data set used for training and testing, respectively whereas column 2, 3 and 4 illustrated the F1 measure of the systems with respect to 3 different values of the j parameters, 1, 10 and 0.9, respectively. Such parameter tunes the trade-off between Precision and Recall. Higher values cause the system to retrieve more positive examples. When the system has a Recall of 0, the table shows the ”x” symbol while the symbol ”-” indicates that the experiment has not been performed. Following aspects should be noted: 1 http://www.natcorp.ox.ac.uk/ 2 SVM-light-TK is available at http://ai-nlp.info .uniroma2.it/moschitti/ • Training on AV Ea and testing on AV Eb provides almost 4% more in F1 than training on AV Eb and testing on AV Ea . This suggests the high variability of the results due to few training data; also testified by the high impact of the j parameter (about 24% of difference between j = 1 and j = 10). • If we add the examples from the RTE challenges to the AV Eb training data, we obtain a good improvement, e.g. the system trained on AV Eb ∪ RT E2 improves the one trained on AV Eb of about 7% (38.98% vs. 31.85%). Adding RT E1 to the training data causes a decrease. This could be explained by the high impact of parameters. It is possible that the good setting for AV Eb ∪ RT E2 is not very good for AV Eb ∪ RT E1 ∪ RT E2. • Training on AV Ea and RTE data sets seems not helpful as the result using only AV Ea is higher, e.g. 35.36% vs. 32.38%. • Finally, training on RT E1 provides higher performance than training on RT E2 on both AV Ea and AV Eb test sets (see rows 10 and 11 vs 14 and 15). Moreover, their combined use (RT E1 ∪ RT E2) is helpful only if we select an opportune parameter j=0.9. This leads to the highest performance on AV Ea and AV Eb , i.e. 40.85% and 36.20%, respectively. Given these preliminary results, we decided to use the best model obtained on RT E1 ∪ RT E2 to generate data of our CLEF submission. Moreover, as the AVE test set may have been statistically similar to the development set, we also submitted a run of the model trained on AV Ea ∪ AV Eb . The official results were 39.95% and 36.69%, respectively. These are quite in line with the analogous experiments shown in Table 1, i.e. training on RT E1 ∪ RT E2 and testing on AV Ea (40.85%) and training on AV Ea and testing on AV Eb (35.36). 6.2.1 Qualitative analysis The system we presented strongly uses syntactic interpretations of the example pairs. Then, its major bottleneck is the standard AVE process used to produce the affirmative form of the question given the answer provided by a the QA system. This process frequently generates ungrammatical sentences. The problem is clear just reading the first instances of the AVE development set. We report hereafter some of these examples. Each table reports the original question (Q), the text snippet (T ), and the affirmative form of the question used as hypothesis (H). T ⇒ H (id: 1) Q “When did Nixon resign?” T “August, 1974 – Nixon resigns.” H “Nixon resigned in 1974 – Nixon” T ⇒ H (id: 2) Q “What year was Halley’s comet visi- ble?” T “[...] 1909 Halley’s comet sighted from Cambridge Observatory. 1929 [...]” H “In 1909 Halley was Halley’s comet vis- ible ” T ⇒ H (id: 6) Q “Who is Juan Antonio Samaranch?” T “International Olympic Committee President Juan Antonio Samaranch came strongly to the defense of China’s athletes, [...]” H “Juan Antonio Samaranch is Interna- tional Olympic Committee President Juan Antonio Samaranch came strongly to the defense of China ’s athletes ” We can observe that these examples have highly ungrammatical hypothesis. In the example (id 1), Nixon is repeated at the end of H. In the example (id 2), Halley is used as subject and as predicate. Finally, in the example (id 6) there is a large part of the hypothesis that is unnecessary and creates an ungrammatical sentence. 7 Conclusions In this paper, we experimented with our entailment system [21] and the CLEF AVE. The comparative results show that entailment rules can be learned from data sets, e.g. RTE, that are different from AVE. The experiments show that few training examples and data sparseness produce a high variability of the results. In this scenario the parameterization is very critical and necessitates of accurate cross-validation techniques. The AVE results also show that our model can learn entailments from the RTE data sets (with a higher F1 than using only AVE data). This suggests that there are some general rules, valid cross domains and collections. The importance of such rules is still more evident if we consider that the distribution of positive and negative examples in the RTE and AVE data sets is quite different. This usually prevents statistical learning algorithms to carry out a correct generalization of the data. In the future, we would like to carry out a throughout parameterization and continue investigating approaches to exploit data from difference sources of entailments. References [1] Steven Abney. Part-of-speech tagging and partial parsing. In G.Bloothooft K.Church, S.Young, editor, Corpus-based methods in language and speech. Kluwer academic publishers, Dordrecht, 1996. [2] Roy Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. The II PASCAL RTE challenge. In PASCAL Challenges Workshop, Venice, Italy, 2006. [3] Johan Bos and Katja Markert. Recognising textual entailment with logical inference. In Proc. of HLT-EMNLP Conference, pages 628–635, Vancouver, British Columbia, Canada, October 2005. As- sociation for Computational Linguistics. [4] S. Boughorbel, J-P. Tarel, and F. Fleuret. Non-mercer kernel for svm object recognition. In Proceed- ings of BMVC 2004, pages 137–146, 2004. [5] Eugene Charniak. A maximum-entropy-inspired parser. In Proc. of the 1st NAACL, pages 132–139, Seattle, Washington, 2000. [6] Michael Collins and Nigel Duffy. New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In Proceedings of ACL02, 2002. [7] Courtney Corley and Rada Mihalcea. Measuring the semantic similarity of texts. In Proc. of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, pages 13–18, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. [8] Ido Dagan and Oren Glickman. Probabilistic textual entailment: Generic applied modeling of lan- guage variability. In Proceedings of the Workshop on Learning Methods for Text Understanding and Mining, Grenoble, France, 2004. [9] Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL RTE challenge. In PASCAL Challenges Workshop, Southampton, U.K, 2005. [10] Rodrigo de Salvo Braz, Roxana Girju, Vasin Punyakanok, Dan Roth, and Mark Sammons. An in- ference model for semantic entailment in natural language. In Proc. of The PASCAL RTE Challenge Workshop, Southampton, U.K, 2005. [11] Oren Glickman, Ido Dagan, and Moshe Koppel. Web based probabilistic textual entailment. In Proceedings of the 1st Pascal Challenge Workshop, Southampton, UK, 2005. [12] Bernard Haasdonk. Feature space interpretation of SVMs with indefinite kernels. IEEE Trans Pattern Anal Mach Intell, 27(4):482–92, Apr 2005. [13] Marti A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proc. of the 15th CoLing, Nantes, France, 1992. [14] Jay J. Jiang and David W. Conrath. Semantic similarity based on corpus statistics and lexical taxon- omy. In Proc. of the 10th ROCLING, pages 132–139, Tapei, Taiwan, 1997. [15] Thorsten Joachims. Making large-scale svm learning practical. In B. Schlkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods-Support Vector Learning. MIT Press, 1999. [16] Milen Kouylekov and Bernardo Magnini. Tree edit distance for textual entailment. In Proc. of the RANLP-2005, Borovets, Bulgaria, 2005. [17] George A. Miller. WordNet: A lexical database for English. Communications of the ACM, 38(11):39– 41, November 1995. [18] Guido Minnen, John Carroll, and Darren Pearce. Applied morphological processing of english. Nat- ural Language Engineering, 7(3):207–223, 2001. [19] Alessandro Moschitti. Making tree kernels practical for natural language learning. In Proceedings of EACL’06, Trento, Italy, 2006. [20] Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. Wordnet::similarity - measuring the relatedness of concepts. In Proc. of 5th NAACL, Boston, MA, 2004. [21] Fabio Massimo Zanzotto and Alessandro Moschitti. Automatic learning of textual entailments with cross-pair similarities. In Proceedings of the 21st International Conference on Computational Lin- guistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 401–408, Sydney, Australia, July 2006. Association for Computational Linguistics.