=Paper=
{{Paper
|id=Vol-1173/CLEF2007wn-MorphoChallenge-AgirreEt2007
|storemode=property
|title=SemEval-2007 Task 01: Evaluating WSD on Cross-Language Information Retrieval
|pdfUrl=https://ceur-ws.org/Vol-1173/CLEF2007wn-MorphoChallenge-AgirreEt2007.pdf
|volume=Vol-1173
|dblpUrl=https://dblp.org/rec/conf/clef/AgirreLMORV07a
}}
==SemEval-2007 Task 01: Evaluating WSD on Cross-Language Information Retrieval==
SemEval-2007 Task 01: Evaluating WSD on Cross-Language Information Retrieval Eneko Agirre1 , Oier Lopez de Lacalle1 , Bernardo Magnini2 , Arantxa Otegi1 , German Rigau1 , Piek Vossen3 1 IXA NLP group, University of the Basque Country, Donostia, Basque Country {e.agirre, jibloleo, jibotusa, german.rigau}@ehu.es 2 ITC-IRST, Trento, Italy magnini@itc.it 3 Irion Technologies, Delftechpark 26, Delft, Netherlands Piek.Vossen@irion.nl Abstract This paper presents a first attempt of an application-driven evaluation exercise of WSD. We used a CLIR testbed from the Cross Lingual Evaluation Forum. The expansion, indexing and retrieval strategies where fixed by the organizers. The participants had to return both the topics and documents tagged with WordNet 1.6 word senses. The organization provided training data in the form of a pre-processed Semcor which could be readily used by participants. The task had two participants, and the organizer also provided an in-house WSD system for comparison. The results do not improve over the baseline, which is not surprising given the simplistic CLIR strategy used. Other than that the exercise was succesful, and provides the foundation for more ambitious follow-up exercises where the participants would be able to build up on the WSD results already available. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 In- formation Search and Retrieval; I.2 [Artificial Intellingence]: I.2.7 Natural Language Process- ingH.2.3 General Terms Measurement, Performance, Experimentation Keywords Word Sense Disambiguation, Cross-Language Information Retrieval 1 Introduction Since the start of Senseval, the evaluation of Word Sense Disambiguation (WSD) as a separate task is a mature field, with both lexical-sample and all-words tasks. In the first case the participants need to tag the occurrences of a few words, for which hand-tagged data has already been provided. In the all-words task all the occurrences of open-class words occurring in two or three documents (a few thousand words) need to be disambiguated. The WSD community has long mentioned the necessity of evaluating WSD in applications, in order to check which WSD strategy is best suited for the application, and more important, to try to show that WSD can make a difference in applications. The succesful use of WSD in Machine Translation has been the subject of some recent papers [4, 3], but its contribution to Information Retrieval (IR) is yet to be shown. There have been with some limited experiments showing positive and negative evidence [14, 7, 10, 11], with the positive evidence usually focusing on IR sub areas, such as CLIR [5, 15] or Q&A [12]. [13] provides a nice overview of the applications of WSD and the issues involved. With this proposal we want to make a first try in defining a task where WSD is evaluated with respect to an Information Retrieval and Cross-Lingual Information Retrieval (CLIR) exercise. From the WSD perspective, this task will evaluate all-words WSD systems indirectly on a real task. From the CLIR perspective, this task will evaluate which WSD systems and strategies work best. We are conscious that the number of possible configurations for such an exercise is very large (including sense inventory choice, using word sense induction instead of disambiguation, query expansion, WSD strategies, IR strategies, etc.), so this first edition focused on the following: • The IR/CLIR system is fixed. • The expansion / translation strategy is fixed. • The participants can choose the best WSD strategy. • The IR system is used as the upperbound for the CLIR systems. We think that a focused evaluation where both WSD experts and IR experts use a common setting and shared resources might shed light to the intricacies in the interaction between WSD and IR strategies, and provide a fruitful ground for novel combinations and hopefully allow for breakthroughs in this complex area. We see this as the first of a series of exercises, and one outcome of this task should be that both WSD and CLIR communities discuss together future evaluation possibilities. This task has been organized as a collaboration of SemEval 1 and the Cross-Language Eval- uation Forum (CLEF2 ). The results were presented in both the SemEval-2007 and CLEF-2007 workshops, and a special track will be proposed for CLEF-2008, where CLIR systems will have the opportunity to use the annotated data produced as a result of the Semeval-2007 task. The task has a webpage with all the details at http://ixa2.si.ehu.es/semeval-clir. This paper is organized as follows. Section 2 describes the task with all the details regarding datasets, expansion/translation, the IR/CLIR system used, and steps for participation. Section 3 presents the evaluation performed and the results obtained by the participants. Finally, Section 4 draws the conclusions and mention the future work. 2 Description of the task This is an application-driven task, where the application is a fixed CLIR system. Participants disambiguate text by assigning WordNet 1.6 synsets and the system will do the expansion to other languages, index the expanded documents and run the retrieval for all the languages in batch. The retrieval results are taken as the measure for fitness of the disambiguation. The modules and rules for the expansion and the retrieval will be exactly the same for all participants. We proposed two specific subtasks: 1 http://nlp.cs.swarthmore.edu/semeval/ 2 http://www.clef-campaign.org 1. Participants disambiguate the corpus, the corpus is expanded to synonyms/translations and we measure the effects on IR/CLIR. Topics3 are not processed. 2. Participants disambiguate the topics per language, we expand the queries to synonyms/translations and we measure the effects on IR/CLIR. Documents are not processed The corpora and topics were obtained from the ad-hoc CLEF tasks. The supported languages in the topics are English and Spanish, but in order to limit the scope of the exercise we decided to only use English documents. The participants only had to disambiguate the English topics and documents. Note that most WSD systems only run on English text. Due to these limitations, we had the following evaluation settings: IR with WSD of documents , where the participants disambiguate the documents, the disam- biguated documents are expanded to synonyms, and the original topics are used for querying. All documents and topics are in English. IR with WSD of topics , where the participants disambiguate the topics, the disambiguated topics are expanded and used for querying the original documents. All documents and topics are in English. CLIR with WSD of documents , where the participants disambiguate the documents, the disambiguated documents are translated, and the original topics in Spanish are used for querying. The documents are in English and the topics are in Spanish. We decided to focus on CLIR for evaluation, given the difficulty of improving IR. The IR results are given as illustration, and as an upperbound of the CLIR task. This use of IR results as a reference for CLIR systems is customary in the CLIR community [8]. 2.1 Datasets The English CLEF data from years 2000-2005 comprises corpora from ’Los Angeles Times’ (year 1994) and ’Glasgow Herald’ (year 1995) amounting to 169,477 documents (579 MB of raw text, 4.8GB in the XML format provided to participants, see Section 2.3) and 300 topics in English and Spanish (the topics are human translations of each other). The relevance judgments were taken from CLEF. This might have the disadvantage of having been produced by pooling the results of CLEF participants, and might bias the results towards systems not using WSD, specially for monolingual English retrieval. We are considering the realization of a post-hoc analysis of the participants results in order to analyze the effect on the lack of pooling. Due to the size of the document collection, we decided that the limited time available in the competition was too short to disambiguate the whole collection. We thus chose to take a sixth part of the corpus at random, comprising 29,375 documents (874MB in the XML format distributed to participants). Not all topics had relevant documents in this 17% sample, and therefore only 201 topics were effectively used for evaluation. All in all, we reused 21,797 relevance judgements that contained one of the documents in the 17% sample, from which 923 are positive4 . For the future we would like to use the whole collection. 2.2 Expansion and translation For expansion and translation we used the publicly available Multilingual Central Repository (MCR) from the MEANING project [2]. The MCR follows the EuroWordNet design, and currently includes English, Spanish, Italian, Basque and Catalan wordnets tightly connected through the Interlingual Index (based on WordNet 1.6, but linked to all other WordNet versions). 3 In IR topics are the short texts which are used by the systems to produce the queries. They usually provide extensive information about the text to be searched, which can be used both by the search engine and the human evaluators. 4 The overall figures are 125,556 relevance judgements for the 300 topics, from which 5700 are positive We only expanded (translated) the senses returned by the WSD systems. That is, given a word like ‘car’, it will be expanded to ‘automobile’ or ‘railcar’ (and translated to ’auto’ or ‘vagón’ respectively) depending on the sense in WN 1.6. If the systems returns more than one sense, we choose the sense with maximum weight. In case of ties, we expand (translate) all. The participants could thus implicitly affect the expansion results, for instance, when no sense could be selected for a target noun, the participants could either return nothing (or NOSENSE, which would be equivalent), or all senses with 0 score. In the first case no expansion would be performed, in the second all senses would be expanded, which is equivalent to full expansion. This fact will be mentioned again in Section 3.5. Note that in all cases we never delete any of the words in the original text. In addition to the expansion strategy used with the participants, we tested other expansion strategies as baselines: noexp no expansion, original text fullexp expansion (translation in the case of English to Spanish expansion) to all synonyms of all senses wsd50 expansion to the best 50% senses as returned by the WSD system. This expansion was tried over the in-house WSD system of the organizer only. 2.3 IR/CLIR system The retrieval engine is an adaptation of the TwentyOne search system [9] that was developed during the 90’s by the TNO research institute at Delft (The Netherlands) getting good results on IR and CLIR exercises in TREC [8]. It is now further developed by Irion technologies as a cross-lingual retrieval system [15]. For indexing, the TwentyOne system takes Noun Phrases as an input. Noun Phases (NPs) are detected using a chunker and a word form with POS lexicon. Phrases outside the NPs are not indexed, as well as non-content words (determiners, prepositions, etc.) within the phrase. The Irion TwentyOne system uses a two-stage retrieval process where relevant documents are first extracted using a vector space matching and secondly phrases are matched with specific queries. Likewise, the system is optimized for high-precision phrase retrieval with short queries (1 up 5 words with a phrasal structure as well). The system can be stripped down to a basic vector space retrieval system with an tf.idf metrics that returns documents for topics up to a length of 30 words. The stripped-down version was used for this task to make the retrieval results compatible with the TREC/CLEF system. The Irion system was also used for pre-processing. The CLEF corpus and topics were converted to the TwentyOne XML format, normalized, and named-entities and phrasal structured detected. Each of the target tokens was identified by an unique identifier. 2.4 Participation The participants were provided with the following: 1. the document collection in Irion XML format 2. the topics in Irion XML format In addition, the organizers also provided some of the widely used WSD features in a word- to-word fashion5 [1] in order to make participation easier. These features were available for both topics and documents as well as for all the words with frequency above 10 in SemCor 1.6 (which 5 Each target word gets a file with all the occurrences, and each occurrence gets the occurrence identifier, the sense tag (if in training), and the list of features that apply to the occurrence. can be taken as the training data for supervised WSD systems). The Semcor data is publicly available 6 . For the rest of the data, participants had to sign and end user agreement. The participants had to return the input files enriched with WordNet 1.6 sense tags in the required XML format: 1. for all the documents in the collection 2. for all the topics Scripts to produce the desired output from word-to-word files and the input files were provided by organizers, as well as DTD’s and software to check that the results were conformant to the respective DTD’s. 3 Evaluation and results For each of the settings presented in Section 2 we present the results of the participants, as well as those of an in-house system presented by the organizers. Please refer to the system description papers for a more complete description. We also provide some baselines and alternative expansion (translation) strategies. All systems are evaluated according to their Mean Average Precision 7 (MAP) as computed by the trec eval software on the pre-existing CLEF relevance-assessments. 3.1 Participants The two systems that registered sent the results on time. PUTOP They extend on McCarthy’s predominant sense method to create an unsupervised method of word sense disambiguation that uses automatically derived topics using Latent Dirichlet Allocation. Using topic-specific synset similarity measures, they create predictions for each word in each document using only word frequency information. The disambiguation process took aprox. 12 hours on a cluster of 48 machines (dual Xeons with 4GB of RAM). Note that contrary to the specifications, this team returned WordNet 2.1 senses, so we had to map automatically to 1.6 senses [6]. UNIBA This team uses a a knowledge-based WSD system that attempts to disambiguate all words in a text by exploiting WordNet relations. The main assumption is that a specific strategy for each Part-Of-Speech (POS) is better than a single strategy. Nouns are disam- biguated basically using hypernymy links. Verbs are disambiguated according to the nouns surrounding them, and adjectives and adverbs use glosses. ORGANIZERS In addition to the regular participants, and out of the competition, the orga- nizers run a regular supervised WSD system trained on Semcor. The system is based on a single k-NN classifier using the features described in [1] and made available at the task website (cf. Section 2.4). In addition to those we also present some common IR/CLIR baselines, baseline WSD systems, and an alternative expansion: noexp a non-expansion IR/CLIR baseline of the documents or topics. fullexp a full-expansion IR/CLIR baseline of the documents or topics. wsdrand a WSD baseline system which chooses a sense at random. The usual expansion is applied. 6 http://ixa2.si.ehu.es/semeval-clir/ 7 http://en.wikipedia.org/wiki/ Information retrieval IRtops IRdocs CLIR no expansion 0.3599 0.3599 0.1446 full expansion 0.1610 0.1410 0.2676 UNIBA 0.3030 0.1521 0.1373 PUTOP 0.3036 0.1482 0.1734 wsdrand 0.2673 0.1482 0.2617 1st sense 0.2862 0.1172 0.2637 ORGANIZERS 0.2886 0.1587 0.2664 wsd50 0.2651 0.1479 0.2640 Table 1: Retrieval results given as MAP. IRtops stands for English IR with topic expansion. IRdocs stands for English IR with document expansion. CLIR stands for CLIR results for translated documents. 1st a WSD baseline system which returns the sense numbered as 1 in WordNet. The usual expansion is applied. wsd50 the organizer’s WSD system, where the 50% senses of the word ranking according to the WSD system are expanded. That is, instead of expanding the single best sense, it expands the best 50% senses. 3.2 IR Results This section present the results obtained by the participants and baselines in the two IR settings. The second and third columns of Table 1 present the results when disambiguating the topics and the documents respectively. Non of the expansion techniques improves over the baseline (no expansion). Note that due to the limitation of the search engine, long queries were truncated at 50 words, which might explain the very low results of the full expansion. 3.3 CLIR results The last column of Table 1 shows the CLIR results when expanding (translating) the disam- biguated documents. None of the WSD systems attains the performance of full expansion, which would be the baseline CLIR system, but the WSD of the organizer gets close. 3.4 WSD results In addition to the IR and CLIR results we also provide the WSD performance of the participants on the Senseval 2 and 3 all-words task. The documents from those tasks were included alongside the CLEF documents, in the same formats, so they are treated as any other document. In order to evaluate, we had to map automatically all WSD results to the respective WordNet version (using the mappings in [6] which are publicly available). The results are presented in Table 2, where we can see that the best results are attained by the organizers WSD system. 3.5 Discussion First of all, we would like to mention that the WSD and expansion strategy, which is very simplistic, degrades the IR performance. This was rather expected, as the IR experiments had an illustration goal, and are used for comparison with the CLIR experiments. In monolingual IR, expanding the topics is much less harmful than expanding the documents. Unfortunately the limitation to 50 words in the queries might have limited the expansion of the topics, which make the results rather unreliable. We plan to fix this for future evaluations. Senseval-2 all words precision recall coverage ORGANIZERS 0.584 0.577 93.61% UNIBA 0.498 0.375 75.39% PUTOP 0.388 0.240 61.92% Senseval-3 all words precision recall coverage ORGANIZERS 0.591 0.566 95.76% UNIBA 0.484 0.338 69.98% PUTOP 0.334 0.186 55.68% Table 2: English WSD results in the Senseval-2 and Senseval-3 all-words datasets. Regarding CLIR results, even if none of the WSD systems were able to beat the full-expansion baseline, the organizers system was very close, which is quite encouraging due to the very simplistic expansion, indexing and retrieval strategies used. In order to better interpret the results, Table 3 shows the amount of words after the expansion in each case. This data is very important in order to understand the behavior of each of the systems. Note that UNIBA returns 3 synsets at most, and therefore the wsd50 strategy (select the 50% senses with best score) leaves a single synset, which is the same as taking the single best system (wsdbest). Regarding PUTOP, this system returned a single synset, and therefore the wsd50 figures are the same as the wsdbest figures. Comparing the amount of words for the two participant systems, we see that UNIBA has the least words, closely followed by PUTOP. The organizers WSD system gets far more expanded words. The explanation is that when the synsets returned by a WSD system all have 0 weights, the wsdbest expansion strategy expands them all. This was not explicit in the rules for participation, and might have affected the results. A cross analysis of the result tables and the number of words is interesting. For instance, in the IR exercise, when we expand documents, the results in the third column of Table 1 show that the ranking for the non-informed baselines is the following: best for no expansion, second for random WSD, and third for full expansion. These results can be explained because of the amount of expansion: the more expansion the worst results. When more informed WSD is performed, documents with more expansion can get better results, and in fact the WSD system of the orga- nizers is the second best result from all system and baselines, and has more words than the rest (with exception of wsd50 and full expansion). Still, the no expansion baseline is far from the WSD results. Regarding the CLIR result, the situation is inverted, with the best results for the most pro- ductive expansions (full expansion, random WSD and no expansion, in this order). For the more informed WSD methods, the best results are again for the organizers WSD system, which is very close to the full expansion baseline. Even if wsd50 has more expanded words wsdbest is more effective. Note the very high results attained by random. These high results can be explained by the fact that many senses get the same translation, and thus for many words with few translation, the random translation might be valid. Still the wsdbest, 1st sense and wsd50 results get better results. 4 Conclusions and future work This paper presents the results of a preliminary attempt of an application-driven evaluation exer- cise of WSD in CLIR. The expansion, indexing and retrieval strategies proved too simplistic, and none of the two participant systems and the organizers system were able to beat the full-expansion baseline. Due to efficiency reasons, the IRION system had some of its features turned off. Still the results are encouraging, as the organizers system was able to get very close to the full expansion English Spanish noexp 9,900,818 9,900,818 No WSD fullexp 93,551,450 58,491,767 wsdbest 19,436,374 17,226,104 UNIBA wsd50 19,436,374 17,226,104 wsdbest 20,101,627 16,591,485 PUTOP wsd50 20,101,627 16,591,485 Baseline 1st 24,842,800 20,261,081 WSD wsdrand 24,904,717 19,137,981 wsdbest 26,403,913 21,086,649 ORG. wsd50 36,128,121 27,528,723 Table 3: Number of words in the document collection after expansion for the WSD system and all baselines. wsdbest stands for the expansion strategy used with participants. strategy with much less expansion (translation). All the resources built will be publicly available for further experimentations. We plan to propose a special track of CLEF-2008 where the participants will build on the resources (specially the WSD tagged corpora) in order to use more sophisticated CLIR techniques. We also plan to extend the WSD annotation to all words in the CLEF English document collection, and to contact the best performing systems of the SemEval all-words tasks to have better quality annotations. Acknowledgements We wish to thank CLEF for allowing us to use their data, and the CLEF coordinator, Carol Peters, for her help and collaboration. This work has been partially funded by the Spanish education ministry (project KNOW) References [1] E. Agirre, O. Lopez de Lacalle, and D. Martinez. Exploring feature set combinations for WSD. In Proc. of the SEPLN, 2006. [2] J. Atserias, L. Villarejo, G. Rigau, E. Agirre, J. Carroll, B. Magnini, and P. Vossen. The MEANING Multilingual Central Repository. In Proceedings of the 2.nd Global WordNet Conference, GWC 2004, pages 23–30. Masaryk University, Brno, Czech Republic, 2004. [3] M. Carpuat and D. Wu. Improving Statistical Machine Translation using Word Sense Dis- ambiguation. In Proc. of EMNLP-CoNLL, Prague, 2007. [4] Y.S. Chan, H. T. Ng, and D. Chiang. Word Sense Disambiguation Improves Statistical Machine Translation. In Proc. of ACL, Prague, 2007. [5] P. Clough and M. Stevenson. Cross-language information retrieval using EuroWordNet and word sense disambiguation. In Proc. of ECIR, Sunderland, 2004. [6] J. Daude, L. Padro, and G. Rigau. Mapping WordNets Using Structural Information. In Proc. of ACL, Hong Kong, 2000. [7] J. Gonzalo, A. Penas, and F. Verdejo. Lexical ambiguity and information retrieval revisited. In Proc. of EMNLP, Maryland, 1999. [8] D. Harman. Beyond English. In E. M. Voorhees and D. Harman, editors, TREC: Experiment and Evaluation in Information Retrieval, pages 153–181. MIT press, 2005. [9] D. Hiemstra and W. Kraaij. Twenty-One in ad-hoc and CLIR. In E.M. Voorhees and D. K. Harman, editors, Proc. of TREC-7, pages 500–540. NIST Special Publication, 1998. [10] B. Krovetz. Homonymy and polysemy in information retrieval. In Proc. of EACL, pages 72–79, Madrid, 1997. [11] B. Krovetz. On the importance of word sense disambiguation for information retrieval. In Proc. of LREC Workshop on Creating and Using Semantics for Information Retrieval and Filtering, Las Palmas, 2002. [12] M. Pasca and S. Harabagiu. High performance question answering. In Proc. of ACM SIGIR, New Orleans, 2001. [13] P. Resnik. Word sense disambiguation in nlp applications. In E. Agirre and P. Edmonds, editors, Word Sense Disambiguation: Algorithms and Applications. Springer, 2006. [14] E. M. Voorhees. Natural language processing and information retrieval. In M. T. Pazienza, editor, Information Extraction: Towards Scalable, Adaptable Systems. Springer-Verlag, 1999. [15] P. Vossen, G. Rigau, I. Alegria, E. Agirre, D. Farwell, and M. Fuentes. Meaningful results for Information Retrieval in the MEANING project. In Proc. of the 3rd Global Wordnet Conference, pages 22–26, 2006.