=Paper=
{{Paper
|id=Vol-1172/CLEF2006wn-QACLEF-KozarevaEt2006
|storemode=property
|title=Adaptation of a Machine-learning Textual Entailment System to a Multilingual Answer Validation Exercise
|pdfUrl=https://ceur-ws.org/Vol-1172/CLEF2006wn-QACLEF-KozarevaEt2006.pdf
|volume=Vol-1172
|dblpUrl=https://dblp.org/rec/conf/clef/KozarevaVM06a
}}
==Adaptation of a Machine-learning Textual Entailment System to a Multilingual Answer Validation Exercise==
Adaptation of a Machine-learning Textual Entailment System to a Multilingual Answer Validation Exercise Zornitsa Kozareva, Sonia Vázquez and Andrés Montoyo NLP Research Group (GPLSI) University of Alicante, Spain {zkozareva,svazquez,montoyo}@dlsi.ua.es Abstract. The recognition of textual entailment is a new generic task in which a system automatically establishes whether a given text entails the meaning of another. This detection is useful for many NLP applications. In the case of Question Answering, textual entailment is represented as an answer validation exercise (AVE) where the system has to verify the correctness of the returned snippet of the Question Answering system. In this paper we present an AVE system based on the combination of word overlap and Latent Semantic Indexing modules. The main focus and contribution of our work consist in the adaptation of the already developed and evaluated textual entailment system MLEnt and its ability to function with different languages. We have evaluated our approach with the English, Spanish, French, Dutch, German, Italian and Portuguese languages. Keywords: Answer Validation (AV), Question Answering (QA), Recognising Textual Entailment (RTE), Multilingual 1 Introduction Natural Language Processing (NLP) has different tasks such as Information Retrieval (IR), Word Sense Disambiguation, Automatic Summarisation, Question Answering (QA) among others. Each task is associated with different problems: ambiguity, variability, different language characteristics, specific domain, specific language and for their resolution different techniques are used. Thus, in order to solve a given problem current researchers use different approaches and systems. However, all of these systems have to be evaluated. In most cases, to measure the effectiveness of an IR or QA system, it is necessary to use humans. For a human this task is labour-intensive and time- consuming. Therefore, nowadays researchers study different ways through which this evaluation process can become automatic. In consequence of the introduction of the textual entailment recognition (RTE)1 [2], [3] task whose aim is to determine whether the meaning of one text is entailed by the meaning of another text, the answer validation exercise (AVE) [7] emerged. An AVE system is centered on the vali- dation of the QA question which is transformed into affirmative mode and the snippet returned by the QA system. The AVE pair is considered correct only if the QA question and snippet are correct. Since RTE and AVE tasks are related, we decided to evaluate our already developed RTE system MLEnt [5] in the AVE challenge. The attributes of MLEnt do not use any language dependent tools or resources, which means that the system is able to operate on different languages. However, there is no multilingual textual entailment corpora and we could not prove our claims. In the AVE challenge the multilingual data is present, therefore we decided to evaluate our approach and confirm our hypothesis of MLEnt’s multilinguality. In the AVE challenge, our research focuses on two aspects. The first one is the portability of our text entailment system to the AVE challenge together with its issue of multilinguality. This 1 http://www.pascal-network.org/Challenges/RTE/ http://www.pascal-network.org/Challenges/RTE2/ aspect is proven with the exhaustive evaluation on the following seven languages: English, Spanish, French, Dutch, German, Italian and Portuguese. The second aspect of our study is the influence of voting. For the RTE challenge [4] we have demonstrated that depending on the data sets the voting strategy has higher or almost no impact. This made us observe the voting behaviour for the AVE challenge. The paper is organized as follows: Section 2 describes the different modules of our AVE system, in Section 3 are shown the carried out experiments, followed by discussion of the obtained results and finally we conclude in Section 4. 2 System Description In this section, we describe our answer validation system. The system consists of a module of the already developed Machine-learning based Textual Entailment system called MLEnt. This module is combined through simple voting technique with a newly developed module for acquiring semantic similarity information through Latent Semantic Indexing. The developed system does not use any language dependent tools or resources, therefore it was possible to evaluate its performance for different language answer validation pairs. The main advantages of our system are its low computational cost, fast performance and the ability to work on different languages. 2.1 Word overlap module The MLEnt system consists of word overlap and semantic similarity modules. For the AVE com- petition we utilized only the word overlap module as the similarity one uses information from WordNet and it was impossible to adapt it to the different language pairs. The following attribu- tes were included in the AVE word overlap feature set: n-grams: look for common position-independent unigram matches between the text and the hypothesis in the AV pair. According to this attribute the AV pair is correct when the sentences of the text and the hypothesis share the same words. Respectively, the n-gram attribute determines the AV pair as incorrect when there are no common words at all. This attribute does not consider semantic similarity information and it is not be able to determine that if “vehicle” and “car” are present in the sentences they are related and this can increase the similarity of the n-gram approach. Another obstacle for this attribute is its insensitiveness to the word order and the sentence level structure. As it is looking only for arbitrary n-gram coincidences, it is unable to determine that although “Mary calls the police” and “the police calls Mary” have the same words, they do not infer the same meaning. For this reason, we included the longest common subsequence (LCS) and the skip-gram attributes. LCS: looks for common non-consecutive word sequences of any length between the sentences of the text and the hypothesis. Longer LCS corresponds to more similar sentences for the answer validation text and hypothesis. LCS estimates the similarity between text T with length m and hypothesis H with length n, as LCS(T,H) n . LCS does not require consecutive matches but in-sequence matches that reflect the sentence level word order. When the measure determines two or more equal by length LCS, only one of them is considered. For this measure there is no need to define the length of the word sequences, because LCS automatically includes the longest in-sequence n-gram. LCS reflects the proportion of ordered words found in the text and the hypothesis, therefore in comparison to the n-grams measure it will indicate that “Mary calls the police” and “the police calls Mary” are not so similar because they share only two ordered n-grams. This influences the AV classification in a more sensitive way denoting that the text and the hypothesis have grater distance than the one determined by the n-grams. skip-gram: represent any pair of words in sentence order that allow arbitrary gaps. Once all pairs for the text and the hypothesis are generated, the overlapping skip-grams are counted using skip gram(T,H) the measure skip gram = C(n,number of skip gram) , where skip gram(T, H) refers to the number of common skip-grams found in the text T and the hypothesis H, C(n, number of skip gram) is a combinatorial function, where n is the number of words in the hypothesis and number of skip grams corresponds to the number of common n-grams between (T, H) 2 . According to the skip-grams, the AV pairs is correct when text and hypothesis have more common skip-grams. For the following sentence pairs: S1 : Mary calls the police. S2 : Mary called the police. S3 : The police called Mary. the skip-grams identify that the similarity between the sentences S1 and S2 is stronger than the similarity between the sentences S1 and S3 or S2 and S3 . However, the n-gram and the LCS are not so sensitive and cannot determine the similarity correctly. number matching: identifies the numbers in the AV text and hypothesis, and then verifies them. For sentences where there are no numbers at all, the number matching attribute assigns the value of NO to the AV pair. According to this attribute, the AV is correct when the numbers in the text and the hypothesis coincide. The performance of the described attributes is evaluated only for English and Spanish, because the AVE organizers provided training data for these languages. For the training phase we used the SVM and kNN machine learning classifiers, and also observed the behaviour of the information gain (IG) measure for the different language pairs and for different sizes of training data. IG is a measure that indicates from a given set of features which are the most important ones. According to IG, the two most crucial attributes for the AVE correct classification are the LCS and the skip-gram. For the word overlap feature set, the system generated two outputs, one obtained by the LCS and another obtained by the skip-gram. We had to adjust the LCS and skip-grams attributes to the rest of the languages for which we had no training data. As the generated attributes are influences by the length of the overlapping words normalized by the total amount of words present in the AV hypothesis it was possible to adapt them. We measured the standard deviation of the LCS and the skip-gram attributes for Spanish and English, and we observed the obtained standard deviation values and their corres- ponding YES/NO values. Thus for the rest of the languages, the YES/NO values were assigned according to the standard deviation values. 2.2 Latent Semantic Indexing module Latent Semantic Indexing (LSI) [6] is a computational model that establishes the relations among the words in a large corpus using a vectorial-semantic space where all terms are represented by a term-document matrix or so called conceptual matrix. In order to obtain the similarity relations, the terms have to be distributed in documents, paragraphs or sentences. Then according to this distribution, the co-occurrence among the different terms is determined. Once the term-document matrix is obtained, LSI uses the recursive algorithm Singular Value Decomposition (SVD) which decomposes the term-document matrix into three other matrices. These matrices contain the singular vectors and the singular values. SVD transforms the original data into linearly independent factors. Many of these factors are very small and they can be ignored in the approximation model. The final result of the decomposition process is a reduced matrix of the initial term-document matrix that is used to establish the word similarities. In order to apply LSI to the AVE task, we need corpus which serves as a basis to construct the conceptual matrix. We used the answer validation T-H data provided by the AVE organizers. The conceptual matrix is constructed with the T-phrases of the AVE corpus. This is due to the results which we obtained in a study with the RTE2 data [9]. According to this study when the T phrases are used as corpus, the RTE task performs better. For each one of the languages – English, Spanish, Italian, German, Dutch, Portuguese and French, we constructed different conceptual matrices using the sentences of the text from the AVE corpora. From the generated conceptual matrix one can establish the similarities of the terms, the phrases or the documents. In our experiment, we are interested in establishing the similarity of 2 (e.g. number of skip grams is 1 if there is a common unigram between T and H, 2 if there is a common bigram etc.) the T-H pairs. LSI extracts the similarity relations between the T-H phrases and the results is a list of T-H phrases ordered by their similarity score. Under the concept of LSI, an AVE pair is correct when the similarity value is close to 1, and less similar when the similarity value is close to 0. In order to determine which values should be considered as more or less similar, we used a threshold. This threshold was determined after we executed several experiments with the Spanish and English AVE training data. For the both languages the best results were obtained with the 0.8 threshold. Which means that a T-H pair which equals or is above 0.8 is assigned with the value ’YES’ and the rest of the pairs are assigned with ’NO’ value. The 0.8 value was also used as a threshold for the rest of the languages. The next examples illustrate how the LSI module works: Example1: This is an instance of the AVE test collection that returns an entailment value of ’NO’, in this case the LSI Module returns a value of 0.402886: < pair id=”4525” value=”NO” task=”QA”> < q lang=”EN” src=”clef2006” type=”OBJECT”>What is Atlantis< /q > < t doc=”096222”>TO ATLANTIS’ CREW. From Associated Press NASA briefly lost contact with the space shuttle Atlantis and its six astronauts Sunday because of crossed radio signals. The problem occurred as Atlantis switched from one Tracking and Data Relay Satellite to another, a routine procedure during Atlantis nor its crew was in any danger, and no science data was lost, said Mission Control with Atlantis was restored after eight minutes, but it was an hour before engineers realized crossed signals,< /t > < h >Atlantis is ATLANTIS THE LOST EMPIRE.< /h > < /pair > Example2: This is an instance of the AVE test collection that returns an entailment value of ’YES’, in this case the LSI Module returns a value of 0.905481: < pair id=”7818” value=”YES” task=”QA”>What is Atlantis < /q > < t doc=”LA110794-0104”> NASA briefly lost contact with the space shuttle Atlantis and its six astronauts Sunday because of crossed radio signals.< /t > < h > Atlantis is the space shuttle.< /h > < /pair > The LSI module returns ’YES’ if the threshold is equal or above ’0.8’ and ’NO’ for the rest values. In both examples the LSI module obtains the correct entailment value. In the Example1 the threshold is around 40%, so the answer is ’NO’ and in the Example2 the threshold is around 90%, so the answer is ’YES’. 2.3 Module combiner In the final stage of our AVE system, the previously described word overlap and LSA modules are combined by voting. In order to guarantee that the voting combination is reasonable, we tested the modules for compatibility. One such approach is based on the Kappa coefficient [1], [8], [4] which measures the agreement of the classifiers. High Kappa value corresponds to high agreement between the runs hence no improvement when voting is applied, while low Kappa value corresponds to low agreement and improvement after the combination. For an answer validation pair, we have obtained different outputs from the LCS, skip-gram and LSA runs. We measured the Kappa agreement for the three runs altogether and also we tested them by pairs. The experiment was carried out for English and Spanish. According to the obtained results, the best combinations are LCS with skip-gram, and LCS, skip-gram, LSA. Therefore, we have submitted two runs for the AVE challenge. When the Kappa measure determined the outputs that have to be combined, we applied voting. Voting is a technique that aims to combine multiple evidences into a singular prediction. The generated outputs of LCS, skip-gram and LSA are taken and compared. The final decision about the class assignment is taken regarding the class with the majority votes. For the LCS and the skip-gram runs we could not apply voting because the number of classifiers is even. Therefore, we applied the following strategy in which when the two outputs agree, the obtained result remains the same, but when they disagree the AVE pair is assigned with NO value e.g. the answer validation pair is incorrect. 3 Results of the Multilingual AVE Runs In this section, we present the obtained results for the different languages. We have participated in English, Spanish, German, French, Italian, Dutch and Portuguese. Table 1 shows the results of the individual word overlap sets and the LSA runs as well as the results from the two combinations we have performed. A discussion of the carried out experiments and the obtained results for each one of the languages can be found below. To evaluate the performance of our system, we have used the following evaluation measures: #predicted as Y ES correctly precision = (1) #predicted as Y ES #predicted as Y ES correctly recall = (2) #Y ES pairs 2 ∗ recall ∗ precision f − score = (3) recall + precision These measures are introduced by the AVE organizers, because of the distribution in an AVE corpus. According to their study [7] 25% of the pairs are YES and 75% of them are NO, therefore the performance of an AVE system should be evaluated only with the YES pairs. English: For this language, we have performed a training phase by merging the ENGARTE data sets provided by the AVE organizers3 . The obtained results from this experiment served as indicator for the best attributes of the initial feature set. The best word overlap feature both for the train and test data sets is LCS. This shows that one third of the AVE pairs can be resolved correctly simply by considering the overlapping insequences of the words between two texts. The skip-gram and LSI runs performed also around 26-27%. When the two word overlap attributes are merged there is no improvement for the test data, however in the training phase the increment is 2%. The highest score for this language are obtained when the LCS, skip- gram and LSI runs are merged. This shows that LSI identifies correctly examples which are omitted by the other modules. The voting combination has 8% improvement compared to the performance of a single classifier. According to the z 0 statistics with confidence level of 0.975, the increment is significant. Spanish: A separate training phase is conducted for the Spanish language. This time we have trained the word overlap modules with the SPARTE corpus. For the test data, the best score is reached by LCS and has the value of 53.15%. The voting combination of LCS, skip-gram and LSI has the same performance as the individual LCS classifier. This is due to the low coverage of LSI, which depends on the number and type of the words in the T-phrases. German, French, Italian: For these languages the best performances are obtained with the LCS and voting runs. The obtained f-scores range from 40 to 47%. The performance of LSI is lower than those of the word overlap module, because of the similarity threshold. Although the 0.8 threshold is obtained after we have studied the performance of LSI with the Spanish and English training data, the multilingual test experiments show that the threshold is sensitive to the words of the T-phrases. 3 http://nlp.uned.es/QA/ave Dutch, Portuguese: For these two languages, our system obtained the lower scores among all. It is interesting to note that for Dutch the skip-gram run performed better than LCS. This may be related to the origin of the language and the word order, as skip-gram looks for position independent n-grams. For Portuguese, LSI performed better than the word overlap runs. The voting combination for this language has 4% better coverage than the individual classifiers. Discussion: In this AVE competition, we have participated for the English, Spanish, German, French, Italian, Dutch and Portuguese languages. The carried out experiments show that one and the same attributes can be applied to different languages and even can reach the same performance. Thus, we have proved that our MLEnt system is adaptable to the RTE’s subtask AVE and second that our hypothesis for MLEnt’s multilinguality is valid. Globally for Spanish, German and French, the LCS and voting combination performed the same. However, for English the voting combination improved with 8% the performance of the individual classifier. For Italian the voting strategy had an increase of 3% and the same happen with Portuguese. Compared to the individual performance of LSI, the voting combination was better for each one of the seven languages. The variation of the performances of the word overlap attributes and the combination strategies shows that even in the AVE challenge the voting combination depends on many factors such as sequences of n-grams, number of words in the text, the results of the individual classifiers. Since for most of the languages the voting strategy had positive effect, we can claim that voting improves the performance of our system. 4 Conclusions and Future work The main aim for our participation in the AVE competition was to test our previously developed text entailment system MLEnt for multilinguality and how it will perform in a new task such as answer validation. As the textual entailment and the answer validation tasks share the same idea of identifying whether two texts infer the same meaning, it was easy and possible to adapt the MLEnt system to the AVE challenge. With the obtained results we proved that MLEnt can function for different languages. The per- formance of the system ranges from 20% to 53% depending on the language pairs. It is interesting to note that the performance of our AVE system depends on the LCS, skip-gram and LSI attribu- tes, however the two most robust attributes are LCS and skip-grams which for German, French, Italian and Dutch reached the coverage of 47% for LCS and 38% for skip-gram respectively. We do not include the Spanish and English results, because we used training data and the performance for these languages is influenced also by the size of the data. The only language with lower performance is Portuguese. Probably because the amount of the overlapping words was not so much as for the rest of the languages. The performance of the LSI attribute also varied across the languages. As LSI uses the words of the text to construct the conceptual matrices, we observed that for the languages where the T-phrases were longer, the performance of LSI was better and vice versa. As a conclusion from the carried out experiments, we can say that in the future these attributes can serve as a baseline for all languages but should be further improved by the incorporation of richer knowledge such as syntactic or semantic. During the experiments, the Kappa agreement measure determined the compatible set for the voting combination. By this measure the performance of the individual sets was improved. The voting combination lead to improvement for the English, French, Italian and Portuguese language runs. For Spanish and German the performance of the voting combination and the LCS attribute differed by 0.14%, and according to the z 0 statistics this difference is insignificant. For Dutch the LCS and skip-gram attributes performed better than the combination. According to the conducted experiments, the attribute which was most informative for the correctness of the answer validation pair is LCS. The present AVE system has the ability to work fast and is characterized by quick training and testing phases, which is very important for a real Question Answering application. Another Language runs Precision Recall F-score English LCS 15.22 80.93 28.57 English Skip 16.91 69.30 27.18 English LSA 23.29 30.23 26.31 English LCS&Skip 18.33 64.65 28.56 English LCS&Skip&LSA 24.92 69.77 36.72 Spanish LCS 44.21 66.62 53.15 Spanish Skip 37.24 43.07 39.94 Spanish LSA 34.15 14.45 20.31 Spanish LCS&Skip 47.48 39.34 43.03 Spanish LCS&Skip&LSA 40.65 76.15 53.01 German LCS 38.90 60.56 47.37 German Skip 34.37 43.91 38.55 German LSA 11.43 1.13 2.06 German LCS&Skip 41.30 37.68 39.41 German LCS&Skip&LSA 36.34 67.42 47.22 F rench LCS 33.96 67.09 45.09 F rench Skip 30.48 46.38 36.78 F rench LSA 32.36 15.88 21.31 F rench LCS&Skip 38.36 43.69 40.85 F rench LCS&Skip&LSA 34.44 73.62 46.93 Italian LCS 25.78 70.59 37.77 Italian Skip 21.96 86.10 34.99 Italian LSA 29.16 22.45 25.37 Italian LCS&Skip 21.64 88.77 34.80 Italian LCS&Skip&LSA 28.30 72.19 40.66 Dutch LCS 14.26 90.12 24.62 Dutch Skip 15.80 67.901 25.64 Dutch LSA 13.88 12.34 13.07 Dutch LCS&Skip 18.90 67.90 29.57 Dutch LCS&Skip&LSA 14.84 90.12 25.48 P ortuguese LCS 12.50 3.90 5.94 P ortuguese Skip 8.00 21.00 11.58 P ortuguese LSA 11.26 12.76 11.96 P ortuguese LCS&Skip 19.04 12.77 15.29 Table 1. Results for the AVE runs benefit for our AVE system is coming form the nature of the attributes, which depend only on the length of the sentences and the number of overlapping words. This allowed us to normalize the attributes and to used them as a comparative measure for the languages for which we had no training data. In the future we want to study the influence of stemming for the different languages. We are also interested in improving the AVE system for Spanish and English, by the incorporation of syntactic information, by the measurement of the similarity of the noun phrases and also by the validation of named entities. 5 Acknowledgements This research has been partially funded by the Spanish Government under project CICyT number TIC2003-07158-C04-01 and PROFIT number FIT-340100-2004-14 and by the Valencia Govern- ment under project numbers GV04B-276. References 1. J. Cohen. A coefficient of agreement for nominal scales. Educ. Psychol. Meas, 1960. 2. Ido Dagan and Oren Glickman. Probabilistic textual entailment: Generic applied modeling of language variability. In Proceedings of the PASCAL workshop on Text Understanding and Mining. 3. Oren Glickman. Applied Textual Entailment. PhD thesis, Bar Ilan University, 2005. 4. Zornitsa Kozareva and Andrés Montoyo. An approach for textual entailment recognition based on stacking and voting. In Proceedings of 5th Mexican International Conference on Artificial Intelligence, MICAI, 2006. 5. Zornitsa Kozareva and Andrés Montoyo. Mlent: The machine learning entailment system of the univer- sity of alicante. In Proceedings of the PASCAL Challenges Workshop on Recognising Textual Entailment, 2006. 6. T. Landauer and S. Dumais. A solution to plato’s problem: The latent semantic analysis theory of acquisition. In Psychological Review, pages 211–240, 1997. 7. Anselmo Peñas, Alvaro Rodrigo, and Felisa Verdejo. Sparte, a test suite for recognising textual entail- ment in spanish. In Proceedings of CICLing, 2006. 8. Ted Pedersen. Assessing system agreement and instance difficulty in the lexical sample tasks of senseval- 2. In USA Philadelphia, PA, editor, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 9. Sonia Vázquez, Zornitsa Kozareva, and Andrés Montoyo. Textual entailment beyond semantic similarity information. In Proceedings of 5th Mexican International Conference on Artificial Intelligence, MICAI, 2006.