=Paper=
{{Paper
|id=Vol-1173/CLEF2007wn-adhoc-BandyopadhyayEt2007
|storemode=property
|title=Bengali, Hindi and Telugu to English Ad-hoc Bilingual Task at CLEF 2007
|pdfUrl=https://ceur-ws.org/Vol-1173/CLEF2007wn-adhoc-BandyopadhyayEt2007.pdf
|volume=Vol-1173
|dblpUrl=https://dblp.org/rec/conf/clef/BandyopadhyayMNEHG07a
}}
==Bengali, Hindi and Telugu to English Ad-hoc Bilingual Task at CLEF 2007==
Bengali, Hindi and Telugu to English Ad-hoc Bilingual task at CLEF 2007 Sivaji Bandyopadhyay, Tapabrata Mondal, Sudip Kumar Naskar, Asif Ekbal, Rejwanul Haque, Srinivasa Rao Godavarthy Department of Computer Science and Engineering, Jadavpur University, Kolkata-700032, INDIA sbandyopadhyay@cse.jdvu.ac.in, sivaji_cse_ju@yahoo.com Abstract This paper presents the experiments carried out at Jadavpur University as part of participation in the CLEF 2007 ad-hoc bilingual task. This is our first participation in the CLEF evaluation task and we have considered Bengali, Hindi and Telugu as query languages for the retrieval from English document collection. We have discussed our Bengali, Hindi and Telugu to English CLIR system as part of the ad-hoc bilingual task, English IR system for the ad-hoc monolingual task and the associated experiments at CLEF. Query construction was manual for Telugu-English ad-hoc bilingual task, while it was automatic for all other tasks. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Information Search and Retrieval; H.3.4 Systems and Software; H.2.3 [Database Management]: Languages-Query Languages General Terms Languages, Performance, Experimentation Keywords Ad-hoc cross language information retrieval, Indian languages, Bengali, Hindi, Telugu 1 Introduction Cross-language information retrieval (CLIR) research involves the study of systems that accept queries (or information needs) in one language and return objects of a different language. These objects could be text documents, passages, images, audio or video documents. Cross-language information retrieval focused on the cross-language issues from information retrieval (IR) perspective rather than machine translation perspective. Some of the key technical issues [7] for cross language information retrieval can be thought of as: (i). How can a query term in one language L1 be expressed in another language L2? (ii). What mechanisms determine which of the possible translations of text from L1 to L2 should be retained? (iii). In cases where more than one translation is retained, how can different translation alternatives be weighed? Many different techniques were experimented in various CLIR systems in the past in order to address these issues. These techniques can be broadly classified [3] as controlled vocabulary based and free text based systems at very high level. However, it is very difficult to create, maintain and scale a controlled vocabulary for CLIR systems in a general domain for a large corpus. Researchers came up with models that can be built on the full text of the corpus. The free text based system research can be broadly classified on the corpus-based and knowledge-based aspects. Corpus-based systems may use parallel or comparable corpora, which are aligned at word level, sentence level or passage level to learn models automatically. Knowledge-based systems might use bilingual dictionaries or ontologies, which form the handcrafted knowledge readily available for the systems to use. Hybrid systems were also built combining the knowledge-based and corpus-based approaches. Apart from these approaches, the extension of monolingual IR techniques such as vector-based models, relevance modeling techniques [14] etc., to cross language IR were also explored. In this work we have discussed our experiments on CLIR for Indian languages to English, where the queries are in Indian languages and the documents to be retrieved are in English. Experiments were carried out using queries in three Indian languages using the CLEF 2007 experimental setup. The three languages chosen were Bengali, Hindi and Telugu, which are predominantly spoken in the eastern India, northern India and the southern India respectively. 2 Related Work Very little work has been done in the past in the areas of IR and CLIR involving Indian languages. In the year 2003 a surprise language exercise [4] was conducted at ACM TALIP1. The task was to build CLIR systems for English to Hindi and Cebuano, where the queries were in English and the documents were in Hindi and Cebuano. Five teams participated in this evaluation task at ACM TALIP providing some insights into the issues involved in processing Indian language content. A few other information access systems were built apart from this task such as cross language Hindi headline generation [2], English to Hindi question answering system [13] etc. International Institute of Information Technology (IIIT), Hyderabad, built a monolingual web search engine for various Indian languages, which is capable of retrieving information from multiple character encodings [10]. In CLEF 2006 ad-hoc document retrieval task, Hindi and Telugu to English Cross Lingual Information Retrieval task [11] were reported by IIIT, Hyderabad. Some research was previously carried out in the areas of machine translation involving Indian languages [1], [13] etc. Most of the Indian language MT efforts involve studies on translating various Indian languages amongst themselves or translating English to Indian language content. Hence most of the Indian language resources available for the works are largely biased to these tasks. Recently, Government of India has initiated a consortia project titled “Development of Cross– Lingual Information Access System”, where the query would be in any of the six different Indian languages (Bengali, Hindi, Marathi, Telugu, Tamil, Punjabi) and the output would be also in the user’s own language. 3 Our Approach The experiments carried out by us for CLEF 2007 are based on stemming, zonal indexing and TFIDF based ranking model with bilingual dictionary look up. There were no readily available bilingual dictionaries that could be used as databases for this work, so we had to develop bilingual dictionaries from the available resources in the Internet. The method of zonal indexing was applied on the English document collection after removing stop words and performing stemming operation. The keywords in the English document collection were indexed using the n-gram indexing methodology. The query terms were extracted from the topic files using bilingual dictionaries. The Information Retrieval system was working on a TFIDF based ranking model. Query construction was carried out manually for the Telugu-English bilingual task due to the unavailability of the machine-readable Telugu-English bilingual dictionary. 3.1 Zonal Indexing and Query Construction In zonal indexing, a particular document is divided into n number of zones/regions, say, w1, w2 , …… ,wn. Then a weight is associated with each zone in such a way that the sum of all weights results in 1. Here, we divided each document into two zones, say, w1 and w2. The zone ‘w1’ contains the contents of ED, PT, DK, EI, KH, HD and AU tags and ‘w2’ region contains the contents of ID and TE tags of the Los Angeles Times (LA TIMES, 2002) documents. The weights heuristically assigned to w1 and w2 were 0.3 and 0.7 respectively. The contents of these two zones for all the documents were checked for stop words and then stemmed. Relative term frequency of a content word in a document is then calculated in each of the w1 and w2 regions as the ratio of the number of occurrences of the content word in the region to the total number of content words present in that region. The relative term frequencies of any content word in the two regions are normalized and added together to get the relative term frequency of that content word in the entire document. These content words which could be multiwords were used as index keywords. We have created a list of stop-words for each language, i.e., English, Bengali, Hindi and Telugu. We have also prepared a list of words/terms that identifies whether the index terms in the narration part provided with each topic talk about relevance/irrelevance of the index terms with respect to the topic. This list has been prepared for the languages studying the 1 ACM Transactions on Asian Language Information Processing, http://www.acm.org/pubs/talip corresponding topic files. Stop words are first eliminated from the topic files. For every n-gram identified from the topic file all possible lower order (n-1) grams starting from unigrams were considred as query words. For example, for the trigram “Australian Prime Minister” identified from the topic file, the following were included in the query as query expansion: Monograms Australian Prime Minister Bigrams Australian Prime Prime Minister Trigram Australian Priime Minister 3.2 Query Translation The available Bengali-English dictionary2 was conveniently formatted for the machine-processing tasks. The Hindi-English dictionary was developed from the available English-Bengali and Bengali-Hindi machine readable dictionaries. Initially, the English-Hindi dictionary was constructed. This dictionary was then converted into a Hindi-English dictionary for further use. A Telugu – English human readable online dictionary was used for query construction from Telugu topic files. Related works on dictionary construction can be found in [8]. The popular Porter Stemming [12] algorithm has been used in this work in order to remove the suffixes from the terms in the English topic file. Indian languages are inflectional / agglutinative in nature and thus demand good stemming algorithms. Due to the absence of good stemmers for Indian languages, the words in the Bengali, Hindi and Telugu topic files are subjected to suffix stripping using manually prepared suffix lists in the respective languages. The terms remaining after suffix removal are looked up in the corresponding bilingual Bengali / Hindi / Telugu to English dictionary. All English words/terms found in the Bengali / Hindi / Telugu to English dictionary for a word are considered, these may be synonyms or may correspond to different senses of the source language word. Many of the terms may not be found in the bilingual dictionary, as the term is a proper name or a word from a foreign language or a valid Indian language word, which did not occur in the dictionary. Dictionary look up may fail in some cases due to the errors involved in the process of stemming and/or suffix removal. For handling dictionary look up failure csases, a transliteration from Indian languages to English was attempted assuming the word to be most likely a proper name not to be found in the bilingual dictionaries. The transliteration engine is the modified joint source-channel model [2] based on the regular expression based alignment techniques. Three different bilingual training sets namely, Bengali-English, Hindi-English and Telugu-English were developed to train the transliteration engine. The Bengali-English training set contains approximately 25,000 bilingual examples of proper names, particularly person and location names. The Hindi-English and Telugu-English bilingual training sets were developed from the Bengali-English training set. The Hindi-English and Telugu-English training sets contain 5,000 bilingual training examples. The Indian language terms are thus translated and transliterated into the English terms accordingly. These translated/transliterated terms are then added together to form the English language query terms as part of query expansion. This algorithm for query translation and transliteration addresses the first issue of representing query in one language (L1) to another language (L2). The query translation process considers all the alternative translations / transliterations with equal weight. Once the translations for the words of the Bengali, Hindi and Telugu topic files were obtained, all possible n-grams (n=1 to no. of query words in the title) were extracted for the title of each topic as explained in Section 3.1. We considered consecutive words as an n-gram, if no stop-word appears in between. For the English topic file, n-grams were extracted from title, description and narration part. For Bengali, Hindi and Telegu topic files, ngrams and all possible monograms were considered for the description and narration parts of each topic. 2 http://dsal.uchicago.edu/dictionaries/biswas-bengali 3.3 Experiments The evaluation document set consists of 135,917 documents from Los Angeles Times of 2002. Among these, a large number of documents were containing no element. A set of 50 topics representing the information need was given for each of the languages, Bengali, Hindi, Telugu and English. A set of human relevance judgments for these topics was generated by assessors at CLEF. These relevance judgements are binary relevance judgements and are decided by a human assessor after reviewing a set of pooled documents using the relevant document pooling technique. The system evaluation framework is similar to the Cranfield style system evaluations and the measures are similar to those used in TREC3 [6]. Three different runs were submitted related to the three Indian languages, one for each of the three languages, Bengali, Hindi and Telugu as our task in the ad-hoc bilingual track. Another run was submitted for English as a part of the ad-hoc monolingual task. Three runs were performed using the title, description and narration parts of the topic files for Bengali, Hindi and English. Only title and description parts of the topic file were considered for the bilingual Telugu-English run. 3.4 CLEF 2007 Evaluation for Bengali-English, Hindi-English, Telugu-English Bilingual Ad-hoc Task and English Monolingual Ad-hoc Task The run statistics for the 4 runs submitted to CLEF 2007 are described in Table 1. Clearly the geometric average precision metrics and its difference from mean average precision metrics suggests the lack of robustness in our system. There were certain topics that performed very well across the language pairs as well as for English also, but there were many topics where the performance was very low. The values of the evaluation metrics of Table 1 show that our system performs the best for the monolingual English task. As part of the bilingual ad-hoc tasks, the system performs best for the Telugu followed by Hindi and Bengali. The key to these higher values of the evaluation metrics in the Telugu-English bilingual run compared to other two bilingual runs (Hindi-English and Telugu-English) may be the manual tasks that were carried out during indexing. But it is also evident that the automatic runs for Hindi-English and Bengali-English tasks achieved a performance comparable to the manual run of Telugu-English. The overall relatively low performance of the system particularly with Indian language queries is the indicative of the fact that simple techniques such as dictionary lookup with minimal lemmatization such as suffix removal may not be sufficient for the morphologically rich Indian language CLIR. Relatively low performance of Bengali/Hindi suggests the need for broader coverage of dictionary and good morphological analyzer is inevitable for Bengali/Hindi CLIR in order to achieve a reasonable performance. Run MAP R-Prec GAP B-Pref Bengali Title + Description + Narration 10.18% 12.48% 2.81% 12.72% Hindi Title + Description + Narration 10.86% 13.70% 2.78% 13.43% Telugu Title + Description 11.28% 13.92% 2.76% 12.95% English Title + Description + Narration 12.32% 14.40% 4.63% 13.68% Table 1: Run Statistics The mean precision with retrieved documents graphs are shown in figures 1.(a) – (d) for the Bengali-English, Hindi- English, Telugu-English and the monolingual English tasks. The interpolated precision with standard recall graphs is shown in figures 2.(a) – (d) for the four different runs. The figures 2.(a) – (d) suggest that the effect of ranking has not been much in the system. The sloping of curve seems to be consistent all across, as opposed to a rapid sloping for the first few recall points in all the runs. A good ranking algorithm would consistently push relevant documents to the top ranks; thereby resulting in a rapid sloping of the curve for the first few recall points. 3 Text Retrieval Conferences, http://trec.nist.gov 4 Conclusion and Future Work Our experiments suggest that simple TFIDF based ranking algorithms may not result in effective CLIR systems for Indian language queries. Any additional information added from corpora either resulting in source language query expansion or the target language query expansion or both could help. Machine readable bilingual dictionaries with more coverage would have improved the results. An aligned bilingual parallel corpus would be an ideal resource to have in order to apply certain machine learning approaches. Application of word sense disambiguation methods on the translated query words would have a positive effect on the result. A robust stemmer is required for the highly inflective Indian languages. We would like to automate the query construction task of Telugu in future. 5 References [1] Akshar Bharati, Rajeev Sangal, Dipti M Sharma and Amba P Kulkarni. Machine Translation Activities in India: A Survey. In the Proceedings of Workshop on Survey on Research and Development of Machine Translation in Asian Countries, 2002. [2] Asif Ekbal,, Sudip Naskar and Sivaji Bandyopadhyay. A Modified Joint Source-Channel Model for Transliteration. In Proceedings of the COLING/ACL, 191-198, Sydney, Australia, 2006. [3] Bonnie Dorr, David Zajic and Richard Schwartz. Cross-language Headline Generation for Hindi. ACM Transactions on Asian Language Information Processing (TALIP), 2(3): 270-289, 2003 [4] Douglas Orad. Alternative Approaches for Cross Language Text Retrieval. In AAAI Symposium on Cross Language Text and Speech Rretrieval, USA, 1997. [5] Douglas W. Orad. The Surprise Language Exercises. ACM Transactions on Asian Language Information Processing (TALIP), 2(2): 79-84, 2003. [6] Ellen M. Voorchees and Donna Harman. Overview of the Sixth Text Retrieval Conferences (TTREC). In Proceedings of the Workshop of Sixth Text Retrieval Conference. 241-273, Morristown, NJ, USA, 1996. [7] Gregory Grefenstette and G. Grefenstette. Cross-Language Information Retrieval. Kluwer Academic Publishers, Norwell, MA, USA, 1998. [8] James Mayfield and Paul McNamee. Converting On-line Bilingual Dictionaries from Human-readable form to Machine-readable form. In Proceedings of 25th Annual International ACM SIGIR Conference on Research and Development in Informational Retrieval, 405-406, New York, NY, USA, 2002, ACM Press. [9] Prasad Pingali, Jagadeesh jagarlamudi and Vasudeva Varma. Webkhoj: Indian Language IR from Multiple Character Encodings. In Proceedings of the 15th International Conference on World Wide Web, 801-809, New York, NY, USA, 2006. ACM Press. [10] Prasad Pingali, Vasudeva Varma. Hindi and Telugu to English Cross Language Information Retrieval at CLEF 2006. In Working Notes for the CLEF 2006 Wokshop (Cross Language Adhoc Task), 20-22 September, Alicante, Spain. [11] Porter, M. F. (1980). An Algorithm for Suffix Stripping, Program, 14(3), 130-137. [12] Satoshi Sekine and Ralph Grishman. Hindi-English Cross-Lingual Question-Answering System. ACM Transactions on Asian Language Information Processing (TALIP), 2(3): 181-192, 2003 [13] Sudip Naskar and Sivaji Bandyopadhyay. Use of Machine Translation in India: Current Status, In Proc. of MT SUMMIT-X, 465-470, Phuket, Thailand [14] Victor Lavrenko, Martin Choquette and W. Bruce Croft. Cross-Lingual Relevance Models. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 175-182, New York, NY, USA, ACM 2002, ACM Press. (a) Bengali-English (b) Hindi-English (c) Telugu-English (d) Monolingual English Figure 1: PRECISION VS RETRIEVED DOCUMENT (LOGARITHMIC SCALE) (a) Bengali-English (b) Hindi-English (c) Telugu-English (d) Monolingual English Figure 2: STANDARD RECALL LEVEL VS MEAN INTERPOLATED PRECISION