Comparing syntactic semantic patterns and passages in Interactive Cross Language Information Access (iCLEF at University of Alicante) Borja Navarro, Fernando Llopis, Miguel Ángel Varó Departamento de Lenguajes y Sistemas Informáticos. University of Alicante. Alicante, Spain. {borja, llopis, mvaro}@dlsi.ua.es Abstract In this paper we will present the result of the interactive CLEF experiment at the University of Alicante. Our aim was to compare two interactive approaches: one based on passages (presented at the iCLEF 2002 [5]), and a new interactive approach based on syntactic semantic patterns. These patterns are composed by the main verb of a sentence plus its arguments, and they are extracted automatically from the passages. With this, these patterns show only the basic information of each sentence. The objective was to know which of these approaches is most useful and fast in the selection of relevant documents by the user in a language different than one of the query (and of the user). The results show that both approach are useful, but the approach based on syntactic semantic patterns is, in the majority of cases, more fast. Finally, with these approaches we avoid the use of Machine Translation systems, due to the problems that they have in Interactive Cross-Language Information Access tasks. 1 Introduction One of the most important aspects of the Interactive Multilingual Information Access is the way in which the system shows the retrieved documents to the user; mainly, the way in which the system shows the relevant information. Only with this information the user must decide if the retrieved documents are relevant or not. This is a key point in order to ensure the correct selection of documents, and a key point for future refinements of the query. The main problem is the multilingualism: the user formulates the query in one language, but the relevant documents are written in a different language. To deal with this situation, there are two main solutions: to show the relevant documents to the user in his/her language, or to show the relevant documents in the language of the documents. In the first case, a translation of the document with a Machine Translation system is necessary. However, there are many problems related to Machine Translation. In the second case –to show the information in the language of the document–, it is possible that the user does not be able to understand the information, and does not be able to decide which documents are the relevant ones. At iCLEF 2002, the University of Alicante proposed the use of passage for the interaction with the user. In this approach, the system selects the most important passage of the retrieved documents. Each passage is translated to the language of the user with a Machine Translation system and, then, the translated passages are shown to the user. The experiment concluded that this approach based on passages is more fast and has more precision than the approaches based on the whole document. The user only read the relevant passage, not the whole document. This is enough information to decide if the document is relevant or not with high precision. However, there is an important problem with this approach: a lot of passages was unreadable for the user due to problems with the machine translation from English to Spanish [5]. This year we want to improve this approach in two aspects: first, we want to improve the interaction speed –that is, the time consuming by the user between the uploading of the passage to the decision about its relevance–; and second, we want to improve the recall and precision in the selection of relevant documents. On other hand, we want to solve the problem with the Machine Translation systems. To do this, we have defined an interactive approach based on syntactic semantic patterns [7]. Each syntactic semantic pattern is formed by a verb and the subcategorizated nouns. From a semantic point of view, in each pattern appears the main words of the sentence. We think that it is possible and useful to use these patterns in the interaction with the user, because each pattern contains the main concepts of the document and their syntactic and semantic relations. Instead of showing the passage translated to the user, we show only the syntactic semantic pattern of each sentence in the language of the document (without translation). The users have passive abilities in the foreign language (English), so we think that this is enough information to decide about the relevance of a document. To conclude, the objectives of our experiment at iCLEF 2003 are: • to know if it is possible that a searcher decide if a document is relevant or not only with the syntactic semantic patterns extracted automatically from the passage; • to know if the approach based on syntactic semantic patterns is better than the approach based on passages only; • to know if the approach based on syntactic semantic patterns is better than the approach based on a machine translation of the passage. In the next section, we will present these two methods of interaction with the user, the method based on passages and the method based on patterns. Then we will describe briefly the experiment design and the results. Finally, we will show the conclusions and future works. 2 Two methods for the interaction with the users 2.1 Approach based on passages The first interaction method is based on passage. A passage is the most relevant pieces of text of a document. The main idea of this approach is that it is better to show only the relevant passage to the user, instead of the whole document. This approach was proved at iCLEF 2002 with good results (for more information, see [5]). 2.2 Approach based on syntactic semantic patterns The second interaction method is based on syntactic-semantic patterns. These patterns are auto- matically extracted from the passage selected by the Information Retrieval system. The difference between both methods is only how to show the relevant information to the user in a language different from the one of the query: the passage in English only –method one–, or the syntactic semantic patterns extracted from these passages in English too –method two–. From a theoretical point of view, basically, a syntactic semantic pattern is a linguistic pattern formed by three fundamental components: 1. A verb with its sense or senses. 2. The subcategorization frame of the sense. 3. The selectional preferences of each argument. For the establishment of this kind of pattern we have take into account several works about subcategorization frame and subcategorization acquisition ([1], [2]), about the relation between verb sense and verb subcategorization ([10], [9]), and about selectional preference ([8], [6]). For the automatic extraction of these patterns, we have use the syntactic parser Minipar [3]. The extraction system looks for a verb. When a verb is located, it is extracted. Then the system looks for a noun at the left of the verb. If a noun is locates, it is extracted with the verb. Then, the system looks for a noun or preposition plus noun at the right of the verb. If a noun or preposition plus noun is located, it is extracted with the verb and the previous noun. Finally, the system looks for an other noun or preposition plus noun at the right of this noun. If a noun or preposition plus noun is located, it is extracted with the verb and the previous nouns. For example, from this passage: “Primakov suggested that the Administration was using the Ames arrest to score domestic po- litical points, to punish Russia for its independent stance on the conflict in Bosnia-Herzegovina and to provide convenient excuse for cutting American aid to Russia, according to journalists who attended.” the system extracts patterns like these: • Primakov suggest Administration • administration use Ames arrest • administration score domestic point • Primakov punish Russia for its stance • Primakov provide convenient excuse for • Primakov cut American aid to Russia according to journalist • journalist attend With these syntactic semantic patterns, only the most important information of each sentence is shown to the user: the most important words of each sentence –the verb and the subcategorizated nouns– and the syntactic and semantic relation between them. Due to the searchers have not fluency nor deep knowledge about the foreign language (English in our experiment), we think that it is better not to process the sentences completely. In order to decide about the relevance of a passage, it is more easy to put the attention on the main words of the document only, that is, to put the attention on the syntactic semantic patterns only. With this patters, to understand a text written in a foreign language completely is difficult. However, this is not our objective. Our objective is to know the topic of a text or passage and to decide if it is relevant or not. 3 Description of the experiment We have focused our experiment in the cross-language document selection with a searcher group which has passive language abilities in the foreign language. The language of the user group is Spanish, and the foreign language is English. The Information Retrieval system used is IR-n system, developed at the University of Alicante [4].The system uses the complete query (tittle, description and narrative) in the search of relevant documents. From each query, the IR-n system locate twenty five (possible) relevant documents. SYSTEM F-alpha average Passages 0.45416703125 Patterns 0.43622984375 Table 1: F-alpha average Each retrieved document are shown to the user following the order of the experiment design. The first system shows only the passages, and the second system shows the syntactic-semantic patterns extracted from these passages. Each searcher must decided, showing the passage or the patterns, if the document extracted is relevant or not. As we said before, passages and patterns are written in English, the foreign language. Together with the relevance judgment, we save information about time consuming for each document. Finally, we have developed an HTML interface in order to facilitate the decision of the user about the relevance of each document. 4 Results The results about the f-alpha average is shown in the Table 1. These results show that it is possible to decide if a text is relevant or not only with the interaction with the syntactic semantic patterns. The result obtained for each system is very similar (only a difference of 0.0179371875). The first hypothesis is correct. The time consuming by each searcher during the decision about the relevance of a document is shown in the Table 2. Five of the eight searches consume less time with patterns than with passages. Only in one case, the time consuming with the patterns and with the passages is very similar (searcher 3). Finally, searcher 5 and searcher 6 use much more time with the patters than with the passage. However, this is an abnormal time consuming, because the different time consuming between passages and patterns is very large. Probably, due to problems during the experiment. Finally, the searcher that has obtain the better result (searcher 4) consumed more time with the passages than with the patterns. With this data, we can conclude that the use of pattern in the interaction process improve the time consuming in the majority of cases. So the second hypothesis is, in some way, correct. Finally, with the syntactic semantic pattern it is possible to avoid the use of Machine Trans- lation systems. The results obtained this year at the iCLEF 2003 are better than the results obtained at iCLEF 2002, in which a Machine Translation system was used. 5 Conclusions In this experiment, we have compare two method of interaction with a IR system, the first one based on passages and the second one based on syntactic semantic patterns. The results show that it is possible to decide if a document is relevant or not with the syntactic semantic patterns only. On other hand, the time consuming is less with the patterns than with the passages in the majority of cases. Finally, with these patterns it is possible to avoid the use of Machine Translation systems. These syntactic semantic patterns are a simplification of the language: each pattern contains the main concepts and linguistic relations of a sentence. Due to this simplification, it is possible the use of these patterns in other cross-linguistic information access task as, for example, the indexation and search of documents by patterns, the alignment of pattern extracted from different languages (through the verb), or the refinement of the query with the patterns contained in the documents selected by the user. System Searcher 1 Searcher 2 Searcher 3 Searcher 4 Passages 4359 5641 5361 5533 Patterns 3195 4840 5548 3063 System Searcher 5 Searcher 6 Searcher 7 Searcher 8 Passages 1350 1707 1835 5287 Patterns 829 5046 7957 2555 Table 2: Time consuming 6 Acknowledgements We would like to thank the users (Belén, Raquel, Irene, Julio, Rafa, Ángel, Sonia and Yenori) and Rubén, who implemented the pattern extraction system. References [1] Ted Bricoe and John Carroll. Automatic Extraction of Subcategorization from Corpora. In Workshop on Very Large Corpora, Copenhagen. 1997. [2] Anna Korhonen. Subcategorization acquisition. Technical Report. University of Cambridge, Cambridge, 2002. [3] Dekang Lin. Dependency-based Evaluation of MINIPAR. In Workshop on the Evaluation of Parsing Systems, Granada. 1998. [4] Fernando Llopis. IR-n: Un sistema de recuperación de información basado en pasajes. PhD thesis, University of Alicante, 2003. [5] Fernando Llopis, Antonio Ferrández, José Luis Vicedo, Manuel Dı́az, and Fernando Martı́nez. iCLEF at Unversities of Alicante and Jaen. Workshop of Cross-Language Evaluation Forum (CLEF 2002), Lecture Notes in Computer Science, Springer-Verlang, 2002. [6] Diana McCarthy. Lexical Acquisiton at the Syntax-Semantics Interface: Diathesis Alterna- tions, Subcategorization Frames and Selectional Preferences. PhD thesis, University of Sussex, 2001. [7] Borja Navarro, Manuel Palomar, and Patricio Martı́nez-Barco. A General Proposal to Mul- tilingual Information Access based on Syntactic Semantic Patterns. In Anje Düsterhöft and Bernhard Thalheim, editor, Natural Language Processing and Information Systems - NLDB 2003, pages 186–199. Lecture Notes in Informatics, GI-Edition, Bonn, 2003. [8] Philip S. Resnik. Selection and Information: A Class-Based Approach to Lexical Relation- ships. PhD thesis, University of Pennsylvania, 1993. [9] Douglas Roland. Verb Sense and Verb Subcategorization Probabilities. PhD thesis, University of Colorado, Colorado, 2001. [10] Douglas Roland and Daniel Jurafsky. Verb sense and verb subcategorization probabilities. In P. Merlo and S. Stevenson, editors, The Lexical Basis of Sentence Processing: Formal, Computational, and Experimetal Issues, pages 325 – 346. John Benjamins, Amsterdam, 2002.