Introduction

Paraphrase Substitution for Recognizing Textual Entailment

Chris Callison-Burch

callison-burch@ed.ac.uk 0 1 2

General Terms

0 1 2 0 Experimentation , Languages, Reliability 1 University of Edinburgh 2 Wauter Bosma University of Twente

We describe a method for recognizing textual entailment that uses the length of the longest common subsequence (LCS) between two texts as its decision criterion. Rather than requiring strict word matching in the common subsequences, we perform a flexible match using automatically generated paraphrases. We find that the use of paraphrases over strict word matches represents an average F-measure improvement from 0.22 to 0.36 on the CLEF 2006 Answer Validation Exercise for 7 languages.

Recognizing textual entailment Paraphrase generation Question answering

Introduction

Recognizing textual entailment has recently generated interest from a wide range of Natural Language Processing related research areas, such as automatic summarization, information extraction and question answering. Advances have been made with various techniques, such as aligning syntactic trees and word overlap. While there is still much room for improvement, Vanderwende [ 13 ] showed that current approaches are close to hitting the boundaries of what is feasible with lexical-syntactic approaches.

Proposed directions to cross this boundary include using logical inference, background knowledge and paraphrasing [ 2 ]. We explore the possibility of applying paraphrasing to obtain a more reliable match between a given text and hypothesis for which the presence of an entailment relation is to be determined. For instance, consider the following RTE2 pair.

In this example, text and hypothesis use different words to express the same meaning. Although deep inference is required to recognize that a mother is part of the family, and that daughter and baby in this context most likely refer to the same person, most variation occurs on the surface level. For instance, announced in the hypothesis could be replaced by said without changing its meaning. Similarly, the phrases the United States and the US would not be matched by a system relying solely on word overlap.

The criterion that our system uses to decide whether a text entails a hypothesis is the length of the longest common subsequence (LCS) between the passages. Rather than identifying the LCS using word matching, our system employs an automatic paraphrasing method that extends matches to synonymous, but non-identical phrases. We automatically generate our paraphrases by extracting them from bilingual parallel corpora.

Whereas many systems use dependency parsers and other linguistic resources that are only available for a limited number of languages, our system employes a method that is comparatively language independent. For this paper we extract paraphrases in Dutch, English, French, German, Italian, Spanish, and Portuguese.

The paraphrase extraction algorithm is described in section 2. Section 3 describes how the entailment score is calculated, and how paraphrases are generated in the entailment detection system. The results of our participation in the CLEF2006 Answer Validation Exercide (henceforth AVE) are described in section 4. We will wrap up with a conclusion and suggestions for future work in section 5. 2

Paraphrase extraction

Paraphrases are alternative ways of conveying the same information. The automatic generation of paraphrases has been the focus of a significant amount of research lately [ 4, 12, 3, 1 ]. In this work, we use Bannard and Callison-Burch’s method [ 1 ], which extracts paraphrases from bilingual parallel corpora. * 6 ! $ ! , ! 6 7 " $ & $ 3 % " " ! $ ! 1 % # + ! $ ! % 3 " . # " ! $ ! # " 6 ( ! ’ $ ! % $ % 6 ! $ ( ! 7 #6 , $ " 6 8 0 $ & $ % $ 5 $ & ,, " $ & 5 "+ 1 " 0 ( , , $ & ! # 8 5 ! % ( 8 +0 , 9 % $ # $ $ # ) $ 5 0" $ & " # " % 6 $ " 9 ! # & & ! 3 ! " ### $5. %!$

Bannard and Callison-Burch extract paraphrases from a parallel corpus by equating English phrases which share a common foreign language phrase. English phrases are aligned with their foreign translations using alignment techniques drawn from recent phrase-based approaches to statistical machine translation [ 8 ]. Paraphrases are identified by pivoting through phrases in another language. Candidate paraphrases are found by first identifying all occurrences of the English phrase to be paraphrased, then finding its corresponding foreign language translations of the phrase, finally looking at what other English phrases those foreign languages translate back to. where c is a parallel corpus from a set of parallel corpora C. Thus multiple corpora may be used by summing over all paraphrase probabilities calculated from a single corpus (as in Equation 1) and normalizing by the number of parallel corpora. We calculate the paraphrase probabilities using the Europarl parallel corpus [ 7 ], which contains parallel corpora for Danish, Dutch, English, French, Finnish, German, Greek, Italian, Portuguese, Spanish and Swedish.

The method is multilingual, since it can be applied to any language which has a parallel corpus.

Thus paraphrases can be easily generated for each of the languages in the CLEF AVE task using the Europarl corpus. 3

Recognizing entailment

The longest common subsequence (LCS) is used as a measure of similarity between passages. LCS is also used by the ROUGE [ 9 ] summarization evaluation package to measure recall of a system summary with respect to a model summary. We use it not to measure recall but precision, to approximate the ratio of information in the hypothesis which is also in the text. Unlike the longest common substring, the longest common subsequence does not require adjacency. A longest common subsequence of a text T = ht1..tni and a hypothesis H = hh1..hni is defined as a longest possible sequence Q = hq1..qni with words in Q also being words in T and H in the same order. LCS(T, H) is the length of the longest common subsequence:

LCS(T, H) = max { |Q| | Q ⊆ T ; Q ⊆ H; (ti = hk ∈ Q ∧ tj = hl ∈ Q ∧ j > i) → l > k} (4)

From the LCS, the entailment score LCS(T, H)/|H| is derived. In order to account for variation in natural language text, the LCS is measured after paraphrasing the hypothesis. The underlying 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 random longest common subseq.

LCS after paraphrasing random longest common subsequence

LCS after paraphrasing dependency tree alignment precision recall

F-measure precision recall F-measure

idea is that whenever a paraphrase of the hypothesis exists which entails the text, the hypothesis itself also entails the text.

We attempted to extract paraphrases for every phrase in the hypothesis of up to 8 words. Note that by “phrase” we simply mean sequence of words. Figure 2 shows all paraphrases that we were able to extract for the example hypothesis. After generating these candidate mappings we iteratively transform the hypothesis to be closer to the text by substituting in paraphrases. At each iteration, the substitution is made which constitutes the greatest increase of the entailment score. To prevent overgeneration, a word which was introduced in the hypothesis by a paraphrase substitution cannot be substituted itself. The process stops when no more substitutions can be made which positively affect the entailment score. By example, the following paraphrase of the hypothesis from section 1 is obtained by a number of substitutions.

Hypothesis: Clonaid announced that mother and daughter would be returning to the US on

Monday.

Substitutions: the US → the United States returning → return said → announced would be → is on Monday → Monday Paraphrased hypothesis: Clonaid said that mother and daughter is return to the United States Monday. 6

In this case, paraphrasing caused the length of the LCS to increase from 43% ( 14 ) to 77% ( 1103 ). The words in italics are the words which are aligned with the text sentence, i.e. which are part of the longest common subsequence. Table 2 shows a number of CLEF AVE pairs for which paraphrases are were to recognize entailment.

In order to judge whether a hypothesis is entailed by a text we see if the value of the entailment score, LCS(T, H)/|H|, is greater than some threshold value. Support vector machines [ 14 ] are used to determine the entailment treshold. Unfortunately, the only suitable training data available was g n r is

a fte r a ph the Question Answering subset of the RTE2 data set [ 2 ]. This is a monolingual collection of English passage pairs, with for each pair a boolean annotation of the presence of an entailment relation. Lacking training data for other languages, for our submission we used the RTE2 data to learn the entailment treshold for all languages. The threshold value use used throughout these experiments was 0.75. 4

Results

We compared the performance of the paraphrasing method with two baselines on seven languages within the the CLEF 2006 Answer Validation Exercise. The first baseline is a system which decides at random with a probability distribution reflecting the corpus distribution of entailment values. The second baseline is a system which measures the longest common subsequence of text and hypothesis. Given the fact that paraphrasing is a form of query expansion, we expected that precision drops and recall increases when using paraphrases. Figure 3 (left) shows that this is indeed the case, but that the system using paraphrases shows considerably better overall performance, as indicated by the F-measure, compared to plain LCS.

For Dutch, Spanish and English, we made a syntactic analysis of each sentence using the parsers of [ 5 ], [ 6 ] and [ 10 ] respectively. As a fourth entailment recognition system, we measured the largest common subtree of the dependency trees of the text and hypothesis. The algorithm of Marsi et al. [ 11 ] was used to align dependency trees. Interestingly, as shown in Figure 3 (right), the dependency tree alignment system performs slightly better but comparably to the largest common subsequence after paraphrasing, while the first uses syntactic information and the latter uses paraphrase generation. The fact that both systems disagree on 37 percent of all pairs with positive entailment (see Table 3) indicates that performance can be further increased when employing both types of information in an integrated system. 5

Conclusion

We showed that paraphrases can boost the performance of an entailment recognition system relying on the longest common subsequence to decide for entailment. Our method is applicable to a wide range of languages, since no language specific natural language analysis or background knowledge is used other than paraphrases automatically extracted from bilingual parallel corpora. Although our system performs similarly to a syntax based system, we showed that there is relatively little overlap between the sets of correctly recognized pairs of both systems. This indicates that information conveyed by paraphrases and syntax are largely complementary for the task of recognizing entailment.

In the future we plan to investigate three things: a combination of syntax-based and paraphrasebased approaches to entailment recognition, improved methods for determining the entailment threshold, and further enhancements to paraphrase extraction techniques, such as taking syntax into account.

Acknowledgments

This work is funded by the Interactive Multimodal Information Extraction (IMIX) program of the Netherlands Organization for Scientific Research (NWO).

[1]

Colin

Bannard and

Chris

Callison-Burch . Paraphrasing with bilingual parallel corpora . In ACL-2005 , 2005 .

[2]

Roy

Bar-Haim , Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and

Idan

Szpektor . The second PASCAL Recognising Textual Entailment Challenge . In Bernardo Magnini and Ido Dagan, editors, Proceedings of the Second PASCAL Recognising Textual Entailment Challenge , Trento, Italy, April 2006 .

[3]

Regina

Barzilay and

Lillian

Lee . Learning to paraphrase: An unsupervised approach using multiple-sequence alignment . In Proceedings of HLT/NAACL , 2003 .

[4]

Regina

Barzilay and Kathleen McKeown . Extracting paraphrases from a parallel corpus . In ACL-2001 , 2001 .

[5]

Gosse

Bouma , Gertjan van Noord,

and Robert

Malouf . Alpino: wide-coverage computational analysis of Dutch . In Proceedings of CLIN , 2000 .

[6]

Xavier

Carreras , Isaac Chao, Llu´ıs Padr´o, and Muntsa Padr´o. FreeLing: an open-source suite of language analyzers . In Proceedings of the 4th international Language Resources and Evaluation Conference , Lisbon, Portugal, 2004 .

[7]

Philipp

Koehn . A parallel corpus for statistical machine translation . In Proceedings of MTSummit , 2005 .

[8]

Philipp

Koehn , Franz Josef Och, and Daniel Marcu. Statistical phrase-based translation . In Proceedings of HLT/NAACL , 2003 .

[9] Chin-Yew Lin . ROUGE: a package for automatic evaluation of summaries . In Proceedings of ACL 2004 Workshop Text Summarization Branches Out , Barcelona, Spain, 2004 .

[10]

Dekang

Lin . Dependency-based evaluation of MiniPar . In Proceedings of LREC Workshop on the Evaluation of Parsing Systems , Granada, Spain, 1998 .

[11] Erwin

Marsi

, Emiel Krahmer, Wauter Bosma, and Mari ¨et Theune. Normalized alignment of dependency trees for detecting textual entailment . In Bernardo Magnini and Ido Dagan , editors, Second PASCAL Recognising Textual Entailment Challenge , pages 56 - 61 , Venice, Italy, April 2006 . PASCAL.

[12] Bo

Pang

, Kevin Knight, and Daniel Marcu. Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences . In Proceedings of HLT/NAACL , 2003 .

[13]

Lucy

Vanderwende and

William B.

Dolan . What syntax can contribute in the entailment task . In PASCAL Challenges Workshop on Recognizing Textual Entailment , pages 205 - 216 , Southampton, United Kingdom, 2005 . Springer-Verlag.

[14] Vladimir

Naumovich

Vapnik . The nature of statistical learning theory . Springer, 2nd edition, November 1999 .