ISM@FIRE-2015: Mixed Script Information Retrieval Dinesh Kumar Prabhakar Sukomal Pal Indian School of Mines Indian School of Mines Dhanbad, Jharkhand Dhanbad, Jharkhand India 826004 India 826004 dinesh.nitr@gmail.com sukomalpal@gmail.com ABSTRACT (in web retrieval) or from the corpus (in ad-hoc retrieval) This paper describes the approach we have used for iden- it is necessary to retrieve the documents of other language tification of languages for a set of terms written in Roman and/or script. It is important to discuss three terms mono- script and approaches for the retrieval in mixed script do- lingual, multilingual and mixed script retrieval. In IR mono- main, in FIRE-2015. The first approach identifies the class lingual means query and documents to be retrieved are in (native language of terms and whether a term is any named single language where as multilingual query and documents entity or of any other type) of given terms/words. MaxEnt may be in written different language. But, the mixed script a supervised classifier has been used for the classification retrieval is slightly different than monolingual retrieval. In which performed best for strict f-measure NE has score is mixed script retrieval, system should retrieve the relevant 0.46 and strict f-measure NE_P has score 0.24. For the documents of same language written in more than one script. MSIR subtask Divergence from Randomness (DFR) based In FIRE-2015, for the Mixed Script Information Retrieval approach is used and performed better with block indexing track participant has to design the system for term classi- and query formulation. Overall scores of our submission on fication and for the retrieval of relevant documents written NDCG@10 0.4335, 0.5328, 0.4489 and 0.5369 for ISMD1, in Devanagari script and in Roman script. ISMD2, ISMD3 and ISMD4 respectively. We have used query expansion to reformulate the seed . (information need) for addressing the mixed script retrieval issues. Further in Section 2, we discussed the task descriptions. Keywords Section 3 shows related work on and Section 4, describes our Word classification, Transliteration, Information Retrieval approaches for annotation and MSIR. In Section 5, we have discussed results and analyzed errors. Section 6, conclude the strategies with the direction of future work. 1. INTRODUCTION With the development of the Web 2.0, user’s count on Social sites are increasingly becoming higher. They write 2. TASK DESCRIPTION messages (specially blogs and post) on sites (such as Twit- The track, Shared Task on Mixed Script Information Re- ter and Facebook) in their own languages preferably using trieval (MSIR) has three subtasks: Query Word Labeling, Roman scripts (transformed form). These post might con- Mixed Script Ad-hoc Retrieval and Mixed-script Question sist terms of Non-English (or terms from user’s native ) lan- Answering. We have participated in first two subtask. guages, a simple English word, a mixed language term (like Query Word Labeling gr8, 2moro) or a Named Entity (NE). Identification of such Input:- Let Q be the query set containing n query word categories play significant role in Natural Language Process- wi (1 ≤ i ≤ n) written in Roman script. The word wi ∈ ing (NLP). It doesn’t remains limited to the NLP but also Q (w1 , w2 , . . . , wn ), could be standard English (en) words used in other sub-domains of linguistic processing and In- or transliterated from another language L = {Bengali (bn), formation Retrieval (IR). Gujarati (gu), Hindi (hi), Kannada (ka), Malayalam (ml), Since, blog posts contain some important information that Marathi (mr), Tamil (ta), Telugu (te)} and some Named opens up the scope of IR in informal texts (in form of posts Entities (NE). The task is to label the words as En or a or massages). Raw blogs data often have some erroneous member of L depending on whether it an English word, or a text. Hence, before applying any IR steps data must be transliterated L-language word. Input and expected result preprocessed using some linguistic processing approaches. for an utterance is given below as an example. There are huge collection of data on/off the Web for var- Input: ious information needs but the track for adhoc retrieval. For the retrieval, collection has documents written in two scripts: Roman (transliterated form of Hindi terms in Ro- hesitate in to giving is @aapleaks #aapsweep man script) and Devanagari. In whole corpus, some doc- revenge should this statehood take way bjp ument has information in Devanagari, some others has in not the #aapstorm best Roman and rest of the document has information in mixed (transliterated and native ) scrip one after another. To max- Output:- Result wi l is corresponding label produced for imize the number of most relevant documents on the Web individual terms. 55 Output: For the classification, model was trained on development data and then terms from utterances of test dataset were en en en en en X X en en en en en en NE en classified based on extracted features during training. en X en 4.1.1 Training Mixed Script Ad-hoc Retrieval For the training purpose input terms and annotations are There are more than 66K documents and 25 queries (seeds). tokenized and made align with proper tags. Documents are written in Devanagari script, Roman script or in mixed script. Here mixed script means a document Features used. has same content in two scripts one after another. Out of Features value with default parameter were used some of 25 queries seven are in Devanagari and others are in Roman which are listed below: script. • useNGrams accept boolean value true or false to make The goal of the task is for a given query system should features from letter n-grams where true is assigned produce set of relevant documents in ordered where on the here. most relevant document should should come at first position. • usePrefixSuffixNGrams makes features from prefix and 3. RELATED WORK suffix substrings of the string and accept boolean value Subtask-1 accomplished in two phases: Word Labeling and where we have assigned true. transliteration of H labeled word to its native (Devanagari) script. • maxNGramLeng takes integer value and size beyond the assigned number will not be used in the model. 3.1 Query Language Labeling Maximum length 4-grams was used. The labeling is concerned with the classification of a given word written in Roman script. Query words wi can be clas- • minNGramLeng also takes integer number and n-grams sified and annotated with corresponding classes manually below this size will not be used in the model. It must or using machine learning based classifiers. Various clas- be a positive integer and we have set it 1. sifiers are there for classification such as Support Vector Machine(SVM), Bayesian networks, Decision Trees, Naive- • sigma is a parameter to several of the smoothing meth- Bayes, MaxEnt and Neural Networks. ods,usually gives a degree of smoothing as standard King and Abney started for labeling the languages of deviation. Here this number is 3.0. words in cross-lingual documents[3]. They have approached this problem in a weakly supervised fashion, as a sequence • useQN accepts boolean value where true indicates Quasi- labeling problem with monolingual text samples for train- Newton optimization will be used if it is set to true. ing data. Prabhakar and Pal also attempt in similar fashion using supervised learning algorithm [6]. • tolerance is convergence tolerance in parameter opti- mization and set 1e-4. 3.2 Mixed Script Ad-hoc Retrieval This subtask was introduced in FIRE-2013 [7], continued Classification model was train on above parameter values in FIRE-2014 with more challenges (joint terms need ex- and 23 classes were identified during the training. pansion) [1] and in FIRE-2015 (queries are in Devanagari or Roman text along with previous challenges). 4.1.2 Classification Gupta et al. in 2014, approached MSIR using 2-gram Given terms from utterances of test dataset were tok- tf-idf and deep learning based query expansion [2]. The enized and parsed on trained model. Tokens of test data spelling variation in transliterated terms along with mixed are classified and annotated with different tags such as for script text is the major challenge of MSIR. Transliteration of Hindi terms hi, English terms en, proper names (name of any term can be extracted from parallel or comparable cor- the person NE_P, location NE_L). pora in extraction approach whereas in generation, translit- eration is generated depending on phoneme, grapheme or 4.2 Mixed Script Information Retrieval syllable-based rules. Subtask-2 has queries for Hindi song lyrics, astrological data and movies reviews related documents retrieval. Pro- 4. APPROACHES posed approach consist three modules: documents indexing, Our approaches for the solution of Subtask-1 and Subtask- query formulation and documents retrieval. 2 have been described in subsections below. 4.2.1 Document Indexing 4.1 Query Word Labeling Simple bags-of-words approach may retrieve noisy doc- We have considered word labeling as classification issue uments for lyrics retrieval. Because in lyrics consecutive for the tags annotation to the given terms wi . Terms can terms are important as change in position changes the con- be classified either manually or using any classifier. Manual text of a song. Hence, we have chosen block indexing with classification and tagging is not feasible on the large dataset. block-size 2 words in addition simple indexing. Two ap- MaxEnt a supervised classifier is used for classification and proaches simple indexing (bags-of-words) and block index- labeling of words from utterances. The Stanford’s MaxEnt ing (phrase retrieval) were used to index the collection with implementation is used for this purpose [4]. block size one word and two words respectively. 56 4.2.2 Query Formulation (expansion) The token ‘path’ in input utterance should have a Bengali As documents in the corpus are in mixed script, seed value term and has same meaning in Hindi and English also but only can’t give good result for retrieval. Hence, the query misclassified in Hindi due to ambiguity as same term exist must be reformulated to enhance the performance of the sys- in Hindi. But ‘path’ seems to be ‘poth’ in Bengali due to tem. In query formulation, script of the query is identified regional accent. and then transliteration is extracted using Google transliter- ation API. There are many terms for which API gives more 5.2 Subtask-2 than one transliteration for such term first one is chosen. Submitted four Runs for subtask-2, with combinations of For the submission of run ISMD2 and ISMD4 we have used simple indexing and original query, simple indexing and for- formulated mixed script query as shown in Table 1. mulated query, block (size=2 words) indexing and original query and block (with size=2 words) indexing and formu- lated query. From the score in Table 3 we have observe that Table 1: Query formulation table Run with block indexing and formulated queries better and Query Type Queries the order in higher to lower performance on NDCG@10 is Original Query tujo nahi lyrics ISM 4 > ISM 2 > ISM 3 > ISM 1. Transliterated Query तजाे नहीं लरस Overall the retrieval approaches performed moderate com- Formulated Query tujo nahi lyrics तजाे नहीं ल- pare to other teams. Some challenges remains un-addressed रस in approaches: spelling variation in transliterated (Roman) text, combined term ( such as ‘kabhi-kabhi’ could be ‘kabhi’, Original Query सूय रे खा कक राश ‘kabhi’, ‘tujo’ could be ‘tu’, ‘jo’) and translation (some doc- Transliterated Query suyra rekha kark rashi ument consist information in another language such as सूय Formulated Query सूय रे खा कक राश suyra रे खा कक राश could be translated into Line of Sun for Can- rekha kark rashi cer) of query text. One more challenging issue is partial transliteration and translation. For example query number 4.2.3 Document Retrieval 69, query is “shani dashaa today for a 20 year old” in that Poisson model with Laplace after-effect and normaliza- first two tokens are Hindi terms. Hence either Hindi terms tion 2 of Divergence From Randomness (DFR) framework will be translated to English or other terms need to be trans- has been used to measure the similarity score between doc- lated into Hindi and then transliterate into Roman text. uments d and query Q [5]. For the implementation we have used terrier 4.0. 6. CONCLUSIONS ∑ Our work comprises two subtasks annotation and retrieval. Score (d, Q) = (qtf n · w (t, d)) (1) We have used learning based classifier for word labeling. La- n∈Q bel accuracy was moderate for submitted runs. We identi- fied some terms were incorrectly labeled by the classifier. qtf Perhaps this happened due an important reason i.e. term qtf n = (2) qtfmax ambiguity where same term exist in more then one classes. where w(t, d) is the weight of the document d for a query For the MSIR, simple and block indexing both used sepa- term t and qtf n is the normalized frequency of term t in the rately during document indexing. In the query formulation query. And qtf is the original frequency of term t in the transliterations are extracted using Google API. To mea- query, and qtfmax is the maximum qtf of all the composing sure the similarity score a DFR framework is used which terms of the query for details see[5]. performed moderate. Some query expansion approach can address MSIR retrieval issues. In future we are looking to 5. RESULTS AND ANALYSIS address the unresolved issues mentioned above. Our approaches have been evaluated on the provided test data for query word labeling and MSIR. In both the subtasks 7. REFERENCES our approaches performed moderate. [1] Choudhury, M., Chittaranjan, G., Gupta, P., and Das, A. Overview and datasets of fire 2014 track 5.1 Subtask-1 on transliterated search. In Pre-proceedings 6th MaxEnt based classifier worked moderate as depicted in workshop FIRE-2014 (2014), Forum for Information table 2. In some of measure our approach performed well Retrieval Evaluation (FIRE). with scores 0.46 strict f-measure NE and 0.24 in strict f- [2] Gupta, P., Bali, K., Banchs, R. E., Choudhury, measure NE_P. For some metrics we performed moderate M., and Rosso, P. Query expansion for mixed-script and in others poor as well. Some terms are misclassified e.g. information retrieval. In Proceedings of the 37th Input utterance: international ACM SIGIR conference on Research & development in information retrieval (2014), ACM, ei path jodi na shesh hoy lyrics pp. 677–686. [3] King, B., and Abney, S. P. Labeling the languages of Annotated utterance: words in mixed-language documents using weakly supervised methods. In HLT-NAACL (2013), pp. 1110–1119. bn hi bn bn bn bn en [4] Klein, D. The stanford classifier. http://http: //nlp.stanford.edu/software/classifier.shtml, 2003. 57 Table 2: Query word labeling score Metric ISM_Score Aggregate_Mean Aggregate_Median Max_Score MIXesAccuracy 12.5 5.0595 0 25 NEsAccuracy 13.253 36.0103 35.9459 63.964 NEsCorrect 22 199.8571 199.5 355 strict f-measure NE 0.461728395 0.371410272 0.07536114 0.461728395 strict f-measure NE_L 0 0.0426 0 0.2114 strict f-measure NE_P 0.2486 0.1086 0.1133 0.2486 strict f-measure X 0.9612 0.8989 0.9379 0.9668 strict f-measure bn 0.7113 0.7073 0.7549 0.8537 strict f-measure en 0.9052 0.8067 0.8356 0.9114 strict f-measure gu 0.1383 0.1338 0.1331 0.3484 strict f-measure hi 0.6618 0.6168 0.6413 0.8131 strict f-measure kn 0.6373 0.5752 0.6062 0.8709 strict f-measure ml 0.4871 0.4762 0.4757 0.7446 strict f-measure mr 0.5636 0.5994 0.6469 0.8308 strict f-measure ta 0.718 0.7261 0.749 0.8911 strict f-measure te 0.5439 0.4654 0.4817 0.7763 TokensAccuracy 77.0648 71.1137 75.5563 82.7152 UtterancesAccuracy 17.298 14.6645 17.1086 26.3889 Average F-measure 0.613402366 0.539559189 0.113420527 0.69174727 Weighted F-Measure 0.767831108 0.698989963 0.095876627 0.829929229 Table 3: Subtask-2 scores Team Block_Size Query_Formulation NDCG@1 NDCG@5 NDCG@10 MAP MRR RECALL ISMD1 1 word X 0.4133 0.4268 0.4335 0.0928 0.244 0.1361 ISMD2 1 word ✓ 0.4933 0.5277 0.5328 0.1444 0.318 0.2051 ISMD3 2 words X 0.3867 0.4422 0.4489 0.0954 0.2207 0.1418 ISMD4 2 words ✓ 0.4967 0.5375 0.5369 0.1507 0.3397 0.2438 Online; accessed 19-02-2014. [5] Plachouras, V., He, B., and Ounis, I. University of glasgow at trec 2004: Experiments in web, robust, and terabyte tracks with terrier. In TREC (2004). [6] Prabhakar, D. K., and Pal, S. Ism@fire2013 shared task on transliterated search. In FIRE ’13 Proceedings of the 5th 2013 Forum on Information Retrieval Evaluation (2013), ACM New York, p. 6. [7] Roy, R. S., Choudhury, M., Majumder, P., and Agarwal, K. Overview and datasets of fire 2013 track on transliterated search. In Pre-proceedings 5th workshop FIRE-2013 (2013), Forum for Information Retrieval Evaluation (FIRE). 58