=Paper=
{{Paper
|id=Vol-1167/CLEF2001wn-iCLEF-LopezOsteneroEt2001
|storemode=property
|title=Noun Phrase Translations for Cross-Language Document Selection
|pdfUrl=https://ceur-ws.org/Vol-1167/CLEF2001wn-iCLEF-LopezOsteneroEt2001.pdf
|volume=Vol-1167
|dblpUrl=https://dblp.org/rec/conf/clef/Lopez-OsteneroGPV01a
}}
==Noun Phrase Translations for Cross-Language Document Selection==
Noun phrase translations for Cross-Language Document Selection Fernando Lopez-Ostenero, Julio Gonzalo, Anselmo Pe~nas and Felisa Verdejo Departamento de Lenguajes y Sistemas Informaticos Universidad Nacional de Educacion a Distancia E.T.S.I Industriales, Ciudad Universitaria s/n, 28040 Madrid, SPAIN fflopez,julio,anselmo,felisag@lsi.uned.es WWW home page: http://sensei.lsi.uned.es/NLP Abstract. This paper presents results for the CLEF interactive Cross- Language Document Selection task at the UNED. Two translations tech- niques were compared: the standard Systran translations provided by CLEF organizers as baseline, and a phrase-based pseudo-translation approach that uses a phrase alignment algorithm based on comparable corpora. The hy- pothesis being tested was that noun phrase translations could serve as sum- marized information for relevance judgment without compromising the pre- cision of such judgments. In addition, we wanted to have an indirect measure of the quality of our phrase extraction process, that had been previously de- veloped for an interactive CLIR application. The results of the experiment con rm that the hypothesis is reasonable: a set of 8 monolingual Spanish speakers judged English documents with the same precision for both systems, but achieved 52% more recall using phrasal translations than using full Systran translations. 1 Introduction The goal of the CLEF 2001 interactive track (iCLEF) was to compare ways of in- forming a monolingual searcher about the content of documents written in foreign languages: a better system will allow for better relevance judgments and therefore better foreign-language document selection [2]. The baseline approach is using stan- dard Machine Translation (MT) to produce translated versions of the documents. Our intuition was that translations produced by MT are noisy and much harder to read and understand than hand-written documents. Perhaps a smaller amount of information, with the best translated phrases highlighted, could facilitate relevance judgment without a signi cant loss of precision. To test such hypothesis, we took advantage of a phrase extraction software pre- viously developed within our research group for an interactive CLIR application [3]. This software is able to index noun phrases in large text collections in a variety of languages (including Spanish and English), providing a good starting material for a phrase-based summarized translation of the documents used in the iCLEF task. Then we performed the following steps: 1. Extract phrasal information from the 200 documents (50 per iCLEF query) of the English CLEF 2000 collection. 2. Find a (large) Spanish corpora comparable with iCLEF documents. This choice was easy, as the CLEF 2001 test set includes a comparable collection (EFE newswire 1994) of 250,000 Spanish documents (approximately 1Gb of text in- cluding SGML tags). 3. Extract phrasal information from the EFE 1994 collection. 4. Develop an alignment algorithm to obtain optimal Spanish translations for all phrases in the English documents. 5. Incorporate phrasal translations in a display strategy for the iCLEF document selection task. 6. Carry on the comparative evaluation between our system and Systran transla- tions, following the iCLEF 2001 guidelines. Besides testing our main hypothesis, we had three additional goals: rst, scaling up the phrase extraction software to handle CLEF-size collections; second, enriching such software with a phrase-aligning algorithm that exploits comparable corpora; and third, obtaining an indirect measure (via document selection) of the quality of that software. In Section 2, we describe our phrase-based approach to document translation. In Section 3, the experimental setup for the evaluation is explained. In Section 4, results are presented and discussed. Finally, in Section 5 we draw some conclusions. 2 Phrase-based pseudo-translations 2.1 Phrase extraction We have used the phrase extraction software from the UNED WTB Multilingual search engine [3]. This software performs robust and eĆcient noun phrase extraction in several languages, and provides two kinds of indexes: { maps every (lemmatized) word into every noun phrase that contains a morpho- logical variant of the word. { maps every noun phrase into documents that contain that phrase. Noun phrases are extracted using shallow NLP techniques: 1. Words are lemmatized using morphological analyzers. The Spanish processor uses MACO+ [1], and the English processor uses TreeTager [4]. 2. Words are tagged for Part-Of-Speech (POS). No POS tagger, to our knowledge, is able to process gigabytes of text. Therefore, a fast approximation to tagging is performed: in the case of Spanish, a set of heuristics has been devised to ensure maximal recall in the phrase detection phase. For other languages, the most frequent POS is assigned to all occurrences of a word. 3. A shallow parsing process identi es noun phrases that satisfy the following ( exible) pattern: j j j j j j [noun adj ][noun adj prep det conj ] [noun adj ] 4. Finally, indexes for lemma !phrases and phrase! documents are created. The collection of 200 English documents is very small and poses no problem for indexing. The EFE collection, however, consists of about 250,000 documents corresponding to about 1Gb of text. Before attempting this iCLEF experiment, the largest collection processed with our system had 60,000 documents. In order to process the EFE collection with our (limited) hardware resources, it was necessary to re-program most of the system. These are the approximate gures for the indexing process: 375,000 di erent words were detected, from which 250,000 were not recognized by the morphological analyzer, and correspond to proper nouns, typos, foreign words, or words uncovered by the dictionary. Overall, 280,000 di erent lemmas (including unknown words) are considered, and 26,700,000 di erent candidate phrases are detected. From this set, we have retained the 3,600,000 phrases that appear more than once in the collection. In the WTB search engine, such indexes are used to provide multilingual phrase- browsing capabilities in an interactive CLIR setting. In the present work, however, this data is used as statistical information to provide translations for English phrases in iCLEF documents. 2.2 Phrase alignment For each English phrase, we start translating all content words in the phrase using a bilingual dictionary. For instance: phrase: "abortion issue" lemmas: abortion, issue translations: abortion -> aborto issue -> asunto, tema, edicion, n umero, emisi on, expedicion, descendencia, publicar, emitir, expedir, dar, promulgar For each word in the translations set, we consider all Spanish phrases that con- tain that word. The set of all phrases forms the pool of related Spanish phrases. Then we search all phrases that contain only (and exactly) one translation for every term of the original phrase. This subset of the Spanish related phrases forms the set of candidate translations. In the previous example, the system nds: phrase frequency tema del aborto 16 asunto del aborto 12 abortion issue ) asuntos como el aborto 5 asuntos del aborto 2 temas como el aborto 2 asunto aborto 2 If the subset is non-empty (as in the example above), the system selects the most frequent phrase as the best phrasal translation. Therefore \tema del aborto" is (correctly) chosen as translation for \abortion issue". Note that all other candidate phrases also disambiguated \issue" correctly as \tema, asunto' '. Other alignment examples include: English # candidates selected frequency abortion issue 6 tema del aborto 16 birth control 3 control de los nacimientos 8 religious and cultural 10 culturales y religiosos 14 last year 52 a~no pasado 8837 The most appropriate translation for \birth control" would rather be \control de la natalidad" (with a frequency of 107), but the dictionary does not provide a link between \birth" and \natalidad". The selected term \control de los nacimientos", however, is unusual but understandable (in context) for a Spanish speaker. If the set of candidate translations is empty, two steps are taken: 1. Subphrase translation: the system looks for maximal sub-phrases that can be aligned according to the previous step. These are used as partial translations. 2. Word by word contextual translation: The remaining words are translated using phrase statistics to take context into account: from all translation candi- dates for a word, we choose the candidate that is included in more phrases from the original pool of related Spanish phrases. For instance: phrase: "day international conference on population and development" lemmas: day, international, conference, population, development possible translations: day -> d a, jornada, epoca, tiempo international -> internacional conference -> congreso, reunion population -> poblaci on, habitantes development -> desarrollo, avance, cambio, novedad, explotaci on, urbanizaci on, revelado subphrase alignments: day international -> jornadas internacionales day international conference -> jornada del congreso internacional word by word translations: population -> poblacion development -> desarrollo final translation: "jornada del congreso internacional poblaci on desarrollo" Note that, while the indexed phrase is not an optimal noun phrase (\day" should be removed) and the translation is not fully grammatical, the lexical selection is accurate, and the result is easily understandable for most purposes (including doc- ument selection). 2.3 Phrase-based document translation The pseudo-translation of the document is made using the information obtained in the alignment process. The basic process is: 1. Find all maximal (i.e., not included in bigger units) phrases in the document, and sort them by order of appearance in the document. 2. List the translations obtained for each original phrase according to the alignment phase, highligting: { Phrases that have an optimal alignment (boldface). { Phrases containing query terms (bright colour). As an example, let us consider this sentence from one of iCLEF documents: English sentence the abortion issue dominated the nine-day International Conference on Popu- lation and Development. A valid manual translation of the above sentence would be: Manual translation el tema del aborto domino las nueve jornadas del Congreso Internacional sobre Poblacion y Desarrollo. while Systran produces: Systran MT translation la edicion del aborto domino el de nueve das Conferencia internacional sobre la poblacion y el desarrollo. Aside from grammatical correctness, Systran translation only makes one rele- vant mistake, interpreting \issue" as in \journal issue" and producing \edicion del aborto" (meaningless) instead of \tema del aborto". Our phrase indexing process, on the other hand, identi es two maximal phrases: abortion issue day International Conference on Population and Development which receive the translations showed in the previous section. The nal display of our system is: Phrasal pseudo-translation tema del aborto jornada del congreso internacional poblacion desarrollo where boldface is used for optimal phrase alignments, which are supposed to be less noisy translations. If any of the phrases contain a (morphological variant of) a query term for a particular search, the phrase is further highlighted. Fig. 1. Search interface: MT system 3 Experimental setup 3.1 Experiments and searchers We made three experiments with di erent searcher pro les: for the main experiment, we recruited 8 volunteers with low or no pro ciency at all in the English language. For purposes of comparison, we formed two additional 8-people groups with mid- level and high-level English skills. 3.2 Search protocol and interface description We followed closely the search protocol established in the iCLEF guidelines [2]. The time for each search, and the combination of topics and systems, were fully controlled by the system interface. Most of the searchers used the system locally, but ve of them (UNED students) carried on the experiments via Internet from their study center (with the presence of the same monitor). Figure 1 shows an example of document displayed in the Systran MT system. Figure 2 shows the same document paragraph in our phrase-based system. The latter shows less information (only noun phrases extracted and translated by the system), highlights phrases containing query terms (bright green) and emphasizes reliable phrasal translations (boldface). Fig. 2. Search interface: Phrases system Main (Low level of English) System P R F0:8 F0:2 Systran MT .48 .22 .28 .21 Phrases .47(-2%) .34(+52%) .35(+25%) .32(+52%) Mid level of English System P R F0:8 F0:2 Systran MT .62 .31 .41 .31 Phrases .46(-25%) .25(-19%) .30(-26%) .24(-22%) High level of English System P R F0:8 F0:2 Systran MT .58 .34 .42 .34 Phrases .53(-12%) .45(+32%) .39(-7%) .38(+11%) Table 1. Overview of results. 4 Results and discussion The main precision/recall and F gures can be seen in Table 1. In summary, the main results are: { In the main experiment with monolingual searchers (\Low level of English"), precision is very similar, but phrasal translations get 52% more recall. Users judge documents faster without loss of accuracy. { Users with good knowledge of English show a similar pattern, but the gain in recall is lower, and the absolute gures are higher both for MT and phrasal trans- lations. As unknown words remain untranslated and English-speaking users may recognize them, these results are coherent with the main experiment. See Fig- ure 3 for a comparison between low and high English skills. { Mid-level English speakers have lower precision and recall for the phrasal trans- lation system, contradicting the results for the other two groups. A careful anal- ysis of the data revealed that this experiment was spoiled by the three searchers that made the experiment remotely (see discussion below). A detailed discussion of each of the three experiments follows. 4.1 Low level of English (main experiment) The results of this experiment, detailed by searcher and topic, can be seen in Table 2. Looking at the average gures per searcher, the results are compatible except for searcher 1 (with a very low recall) and searcher 5 (with very low recall and precision): { Within this group, searcher 1 was the only one that made the experiment re- motely, and problems with net connection seriously a ected recall for both sys- tems and all topics. Unfortunately, this problem also a ected to three searchers in the mid-level English group and one in the high-level group. { Examining the questionnaires lled by Searcher 5, we concluded that he did not understand the task at all. He did not mark relevant documents in any of the questions, apparently judging the quality of the translations (?). From the eight searchers, only one of them was familiar with MT systems, and most of them had little experience with search engines. In the questionnaires, most searchers prefer the phrasal system, arguing that the information was more concise and thus decisions could be made faster. However 100 80 60 MT-High Precision Phrases-High MT-Low Phrases-Low 40 20 0 0 20 40 60 80 100 Recall Fig. 3. High versus low English skills. they felt that the phrases system demanded more interpretation from the user. The MT system was perceived as giving more detailed information, but too dense to reach easy judgments. All these impressions are coherent with the Precision/Recall gures obtained, and con rm our hypothesis about potential bene ts of phrasal pseudo translations. 4.2 Mid level of English The results for this group (see Table 3) are apparently incompatible with the other two experiments. Taking a close look at the user averages, we detected that three users have extremely low recall gures, and these are precisely the users that made the experiment remotely. Excluding them, the average recall would be similar for both systems. Of course the lesson learned from this spoiled experiment is that we have to be far more careful keeping the experiment conditions stable (and that we should not rely on Internet for such kind of experiences!). 4.3 High level of English The detailed results for the group with good language skills can be seen in Table 4. Again, one searcher deviates from the rest with very low average recall, and it is the only one that made the experiment remotely (searcher 6). Aside from this, apparently higher English skills lead to higher recall and precision rates. This is a reasonable result, as untranslated words can be understood, and translation errors can more easily be tracked back. Precision is 12% lower with the phrasal system, but recall is 32% higher. Overall, F0:8 is higher for the MT system, and F0:2 is higher for the phrasal system. Besides having higher English skills, searchers had more experience using graph- ical interfaces, search engines and Machine Translation programs. In agreement with the rst group, they felt that the MT system gave too much information, and they also complained about the quality of the translations. Overall they preferred, however, the MT system to the phrasal one: translated phrases permitted faster (Runs with the phrase system are in boldface, runs with MT in normal font) Precision Recall UsernTopic T-1 T-2 T-3 T-4 Avg. UsernTopic T-1 T-2 T-3 T-4 Avg. U-L-01 1 0 1 0 0.5 U-L-01 0.02 0 0.16 0 0.04 U-L-02 1 0.23 0.66 1 0.72 U-L-02 0.19 0.5 1 0.5 0.54 U-L-03 1 0.34 1 0.25 0.64 U-L-03 0.08 0.93 0.66 0.5 0.54 U-L-04 1 0.09 0.33 0 0.35 U-L-04 0.11 0.06 0.5 0 0.16 U-L-05 0 0.2 0 0 0.05 U-L-05 0 0.18 0 0 0.04 U-L-06 1 0 0.57 0.16 0.43 U-L-06 0.13 0 0.66 0.5 0.32 U-L-07 1 0 1 0.33 0.58 U-L-07 0.11 0 0.5 0.5 0.27 U-L-08 0.95 0.03 1 0.25 0.55 U-L-08 0.55 0.06 0.33 0.5 0.36 Avg. 0.86 0.11 0.69 0.24 0.47 Avg. 0.14 0.21 0.47 0.31 0.28 F0:2 F0:8 UsernTopic T-1 T-2 T-3 T-4 Avg. UsernTopic T-1 T-2 T-3 T-4 Avg. U-L-01 0.02 0 0.19 0 0.05 U-L-01 0.09 0 0.48 0 0.14 U-L-02 0.22 0.40 0.90 0.55 0.51 U-L-02 0.53 0.25 0.70 0.83 0.57 U-L-03 0.09 0.69 0.70 0.41 0.47 U-L-03 0.30 0.38 0.90 0.27 0.46 U-L-04 0.13 0.06 0.45 0 0.16 U-L-04 0.38 0.08 0.35 0 0.20 U-L-05 0 0.18 0 0 0.04 U-L-05 0 0.19 0 0 0.04 U-L-06 0.15 0 0.63 0.35 0.28 U-L-06 0.42 0 0.58 0.18 0.29 U-L-07 0.13 0 0.55 0.45 0.28 U-L-07 0.38 0 0.83 0.35 0.39 U-L-08 0.60 0.05 0.38 0.41 0.36 U-L-08 0.82 0.03 0.71 0.27 0.45 Avg. 0.16 0.17 0.47 0.27 0.26 Avg. 0.36 0.11 0.56 0.23 0.31 Table 2. Low Level of English (main experiment) (Runs with the phrase system are in boldface, runs with MT in normal font) Precision Recall UsernTopic T-1 T-2 T-3 T-4 Avg. UsernTopic T-1 T-2 T-3 T-4 Avg. U-M-01 1 0 1 0 0.5 U-M-01 0.11 0 0.66 0 0.19 U-M-02 1 0 1 0 0.5 U-M-02 0.02 0 0.16 0 0.04 U-M-03 1 0.26 1 0.5 0.69 U-M-03 0.13 0.5 0.5 0.5 0.40 U-M-04 1 0.31 0 0 0.32 U-M-04 0.13 0.31 0 0 0.11 U-M-05 1 0.36 1 0.33 0.67 U-M-05 0.11 0.68 0.66 0.5 0.48 U-M-06 0.81 0.30 0.66 0.33 0.52 U-M-06 0.25 0.87 0.66 0.5 0.57 U-M-07 1 0 1 0 0.5 U-M-07 0.08 0 0.16 0 0.06 U-M-08 0.90 0 0.66 1 0.64 U-M-08 0.27 0 0.33 1 0.4 Avg. 0.96 0.15 0.79 0.27 0.54 Avg. 0.13 0.29 0.39 0.31 0.28 F0:2 F0:8 UsernTopic T-1 T-2 T-3 T-4 Avg. UsernTopic T-1 T-2 T-3 T-4 Avg. U-M-01 0.13 0 0.70 0 0.20 U-M-01 0.38 0 0.90 0 0.32 U-M-02 0.02 0 0.19 0 0.05 U-M-02 0.09 0 0.48 0 0.14 U-M-03 0.15 0.42 0.55 0.5 0.40 U-M-03 0.42 0.28 0.83 0.5 0.50 U-M-04 0.15 0.31 0 0 0.11 U-M-04 0.42 0.31 0 0 0.18 U-M-05 0.13 0.57 0.70 0.45 0.46 U-M-05 0.38 0.39 0.90 0.35 0.50 U-M-06 0.29 0.63 0.66 0.45 0.50 U-M-06 0.55 0.34 0.66 0.35 0.47 U-M-07 0.09 0 0.19 0 0.07 U-M-07 0.30 0 0.48 0 0.19 U-M-08 0.31 0 0.36 1 0.41 U-M-08 0.61 0 0.55 1 0.54 Avg. 0.15 0.24 0.41 0.3 0.27 Avg. 0.39 0.16 0.6 0.27 0.35 Table 3. Mid Level of English (Runs with the phrase system are in boldface, runs with MT in normal font) Precision Recall UsernTopic T-1 T-2 T-3 T-4 Avg. UsernTopic T-1 T-2 T-3 T-4 Avg. U-H-01 1 0.27 1 0 0.56 U-H-01 0.05 0.37 0.66 0 0.27 U-H-02 0.91 0.35 0.8 0.33 0.59 U-H-02 0.30 0.93 0.66 0.5 0.59 U-H-03 0.83 0.17 1 0.66 0.66 U-H-03 0.13 0.18 0.5 1 0.45 U-H-04 1 0 1 0.5 0.62 U-H-04 0.02 0 0.83 0.5 0.33 U-H-05 1 0.34 0.83 0.25 0.60 U-H-05 0.30 1 0.83 0.5 0.65 U-H-06 0 0.33 0.66 0 0.24 U-H-06 0 0.25 0.33 0 0.14 U-H-07 1 0.33 1 0.13 0.61 U-H-07 0.16 0.62 0.33 1 0.52 U-H-08 1 0.21 1 0 0.55 U-H-08 0.08 0.43 0.33 0 0.21 Avg. 0.84 0.25 0.91 0.23 0.55 Avg. 0.13 0.47 0.55 0.43 0.39 F0:2 F0:8 UsernTopic T-1 T-2 T-3 T-4 Avg. UsernTopic T-1 T-2 T-3 T-4 Avg. U-H-01 0.06 0.34 0.70 0 0.27 U-H-01 0.20 0.28 0.90 0 0.34 U-H-02 0.34 0.69 0.68 0.45 0.54 U-H-02 0.64 0.39 0.76 0.35 0.53 U-H-03 0.15 0.17 0.55 0.90 0.44 U-H-03 0.39 0.17 0.83 0.70 0.52 U-H-04 0.02 0 0.85 0.5 0.34 U-H-04 0.09 0 0.96 0.5 0.38 U-H-05 0.34 0.72 0.83 0.41 0.57 U-H-05 0.68 0.39 0.83 0.27 0.54 U-H-06 0 0.26 0.36 0 0.15 U-H-06 0 0.31 0.55 0 0.21 U-H-07 0.19 0.52 0.38 0.42 0.37 U-H-07 0.48 0.36 0.71 0.15 0.42 U-H-08 0.09 0.35 0.38 0 0.20 U-H-08 0.30 0.23 0.71 0 0.31 Avg. 0.14 0.38 0.59 0.33 0.36 Avg. 0.34 0.26 0.78 0.24 0.40 Table 4. High Level of English judgments, but the searcher need to add more subjective interpretation of the in- formation presented. All these subjective impressions are in agreement with the nal precision/recall gures. 5 Conclusions Although the number of searchers does not allow for clear-cut conclusions, the results of the evaluation indicate that summarized translations, and in particu- lar phrasal equivalents in the searcher's language, might be more appropriate for document selection than full- edged MT. Our purpose is to reproduce a similar experiment with more users, and better-controlled experimental conditions, to have a better testing of our hypothesis in a near future. As a side conclusion, we have proved that phrase detection and handling with shallow NLP techniques is feasible for large-scale IR collections. The major bottle- neck, Part-Of-Speech tagging, can be overcome with heuristic simpli cations that do not compromise the usability of the results, at least in the present application. Acknowledgments This work has been funded by the Spanish Comision Interministerial de Ciencia y a, project Hermes (TIC2000-0335-C03-01). Tecnolog References 1. J. Carmona, S. Cervell, L. Marquez, M. A. Mart, L. Padro, R. Placer, H. Rodrguez, M. Taule, and J. Turmo. An environment for morphosyntactic processing of unre- stricted spanish text. In Proceedings of the First International Conference on Language Resources and Evaluation (LREC'98), 1998. 2. Douglas W. Oard and Julio Gonzalo. The CLEF 2001 interactive track. In Carol Peters, editor, Proceedings of CLEF 2001, 2001. 3. Anselmo Pe~nas, Julio Gonzalo, and Felisa Verdejo. Cross-language information access through phrase browsing. In Applications of Natural Language to Information Systems, Lecture Notes in Informatics, pages 121{130, 2001. 4. Helmut Schmid. Probabilistic part-of-speech tagging using decision trees. In Interna- tional Conference on New Methods in Language Processing, 1994.