=Paper=
{{Paper
|id=Vol-1587/T2-9
|storemode=property
|title=DA-IICT in FIRE 2015 Shared Task on Mixed Script Information Retrieval
|pdfUrl=https://ceur-ws.org/Vol-1587/T2-9.pdf
|volume=Vol-1587
|authors=Devanshu Jain
|dblpUrl=https://dblp.org/rec/conf/fire/Jain15
}}
==DA-IICT in FIRE 2015 Shared Task on Mixed Script Information Retrieval==
DA-IICT in FIRE 2015 Shared Task on Mixed Script Information Retreival Devanshu Jain Dhirubhai Ambani Institute of Information and Communication Technology Gandhinagar, Gujarat, India devanshu.jain919@gmail.com ABSTRACT This paper aims to describe the methodology followed by Table 1: Frequency of Label Tags in Training Data Team Watchdogs in their submission for the shared task on Label Frequency Description of Tag Mixed Script Information Retrieval (MSIR) in FIRE 2015. NE 2203 Named Entities I participated in the subtask 1 (Query Word Labelling) and MIX 148 Mix of 2 languages 2 (Mixed-script Ad hoc retrieval). For subtask 1, Machine hi 4453 Hindi Learning approach using CRF classifier was used to classify en 17938 English the tokens as one of the possible languages using n-gram and kn 1623 Kannada word2vec features. The method achieved a weighted F-measure ta 3153 Tamil of 0.805. For subtask 2, DFR similarity measure was used on te 6475 Telugu the back-transliterated documents and queries (to Hindi with gu 890 Gujarati Vowel Signs replaced with actual vowels). The technique mr 1960 Marathi resulted in a NDCG@10 score of 0.7160. bn 3545 Bengali ml 1160 Malyalam Categories and Subject Descriptors O 8 Words of Foreign Language H.3.1 [Information Storage and Retrieval]: Content Analy- X 7436 Punctuation, Numbers, Emoticons, etc. sis and Indexing; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval provided for experiment, consisted of 229 documents and 5 queries. Keywords Information Retrieval, Mixed-Script Data, Natural Language Section 2 and 3 describes the methodology followed for the Processing subtasks 1 and 2 respectively, in detail. Tools used to tackle these subtasks have aslo been mentioned. Section 4 specifies the results achieved by these methods. 1. INTRODUCTION With the Internet becoming increasingly accessible, a linguisti- cally diverse population has come online. It has been observed 2. SUBTASK 1: QUERY WORD LABELLING that such non-English population usually uses its own language 2.1 Methodology written in Roman script (’Transliteration’) to generate web- Before training, the following pre-processing was done on the content like tweets, blogs, etc. Moreover, these people switch data. The MIX tokens (i.e. tokens which are derived from back and forth between languages mid-sentence, a behaviour two language) were not labelled in a consistent manner. For termed as ’Code Switching’. This shared task aims to develop example, some words were labelled as: MIX_hi-en and some methods to retrieve content across scripts. were labelled as: MIX_en-hi. Such instances were relabelled in a consistent way. Subtask1: Query Word Labelling aims to detect the language of a token in a code-switched sentence. In addition to language The problem was identified as sequence tagging. To tackle detection, the subtask also requires to detect the named entities the problem, CRF was used. Two separate CRF models were (people, organisation, etc.), punctuations and mixed words (i.e. trained for this subtask: one to identify the language and words that belong to more than one language). The dataset, another to identify the named entity. provided by the organisers, consisted of a list of annotated tweets. The distribution of all the labels in the dataset is For training the Language Identification model, following fea- provided in the table 1. tures were used: Subtask2: Mixed Script Ad hoc Retrieval aims to retrieve the documents containing relevant information for the query 1. Character and Word N-Grams: For including the context given to the system. The caveat is that the query as well features, individual tokens in the sentence were included as documents can be either in Hindi or English or both. So, within a token-window of 3 on each side of the word retrieval needs to be done across script. The toy dataset, in consideration. For example, if the sentence is: 51 Algorithm 1 Algorithm for labelling Table 2: Word N-Gram Features around the word ya 1: procedure label(token, ld-model, ne-model) Feature Value 2: ld-tag = getLabel(ld-model, token) w[-3] ke 3: ne-tag = getLabel(ne-model, token) w[-2] mat 4: final-tag = ne-tag w[-1] maano 5: if final-tag = O then w[0] ye 6: final-tag = ld-tag w[1] birth 7: (dict-tag,dict-freq)=getTagWithMaxFreqFromDict() w[2] se 8: if dict-tag ̸= O then w[3] ab 9: final-tag = dict-tag 10: end if 11: end if Table 3: Character N-Gram Features for the word maano 12: end procedure Feature Value 2_gram[-1][0] ma 2_gram[-1][1] aa 4. isDot (Boolean): This feature tells whether the dot (.) 2_gram[-1][2] an character is present in the word or not. 2_gram[-1][3] no 3_gram[-1][0] maa 5. numDot (Integer): This feature tells the number of dot 3_gram[-1][1] aan characters in the word. 3_gram[-1][2] ano 4_gram[-1][0] maan 6. isDigit (Boolean): This feature tells the presence of a 4_gram[-1][1] aano digit in the word. 7. numDigit (Integer): This feature tells the number of a admin ke mat maano ye birth se ab tak single h digit in the word. and the token in consideration is : ye, then the features 8. isSpecialChar (Boolean): This feature tells the presence used are as in table 2. Furthermore 2, 3 and 4 character of any special character like (, -, etc. in the word. n-grams of each of those words are also included as features. So, for w[-1] i.e. maano, generated features 9. numSpecialChar (Integer): This feature tells the number will be as in table 3 of special characters in the word. 2. Dictionary for Hindi, Bengali and Gujarati: Dataset pro- vided by IIT-Kharagpur consisting of following translit- Capitalisation is often used for mention of important named eration pairs: Hindi-English, Bangla-English, Gujarati- entities. Dot character (.) is often used with abbreviations English, was used to determine the language of a word which, in most cases, is used to refer to a named entity. Digits written in English script as shown in the algorithm 1. and special characters are helpful in detecting the punctuations. 3. Word2Vector Tweet Clustering: A feature vector was The procedure for labelling the token is explained in the constructed for every word in the dataset using skip- algorithm 1. The constant O in line 5 is returned when the gram implementation of word2vec with negative sam- classifier can not identify any appropiate tag for the given pling. Then, these feature vectors were clustered using token. So, if the token is not a named-entity, then it is kMeans algorithm into 9 clusters (because there were 9 tagged as one of the language tags using the corresponding languages). Every word in consideration was assigned a model. The method ”getTagWithMaxFreqFromDict()” in line cluster ID and this was used as a feature for generating 7 determines the language which has the most occurrences of a model for language detection module. the token. The main hypothesis is that using word2vec feature vectors’ 2.2 External Tools Used clusters as features during and dictionary mentions should Following tools were used for this subtask: improve the system’s performance. For training the Named Entity Recognition model, following 1. CRFSuite was used to train the models for language de- additional features were also included apart from those men- tection and named-entity recognition based on the training tioned above: data and for tagging the test files. 2. Deeplearning Word2Vec API was used for obtaining the 1. isFullCapitalised (Boolean): This feature tells whether word2vec model for each word of the training as well the whole word is capitalised or not. as test files. Number of iterations to train was set to 50 2. isFirstCapitalised (Boolean): This feature tells whether and feature vector size of each word was set to 100. the first letter of the word is capitalised or not. 3. JavaML library’s implementation of kMeans algorithm 3. numCapitalised (Integer): This feature tells the number was used for clustering the words’ feature vectors as of capital letters in the word. obtained from DeepLearning Word2Vec API. 52 3. SUBTASK 2: MIXED SCRIPT AD-HOC in the latter. The loss can result in some ambiguity. For RETRIEVAL example, दुखी and देखो - when vowel signs are removed, they both result in दख. However, when vowel signs are 3.1 Methodology replaced by vowels, they both result in different words Before indexing the documents, all the Roman words in docu- - दउखई and दएखओ, respectively. ments as well as queries were transliterated back to Devanagri script. It has been observed that while transliterating Devanagri words to Roman, there are more spelling variations than in 3.2 External Tools Used the case when transliterating from Roman to Devanagri. Then, Following tools were used for this subtask: the documents were indexed in the following 4 ways: 1. Google Transliterator was used for transliterating the 1. Run 1: Texts were tokenised at white spaces. Then, a documents and queries back to Hindi language. Hindi Stemmer was used to stem these tokens to take 2. Apache Lucene was used to index the documents and into account the multiple variations of the token. For search for the relevant documents according to the queries. example, if the token is ख़र दार , then after stemming, it becomes खर दार. 4. RESULTS AND DISCUSSION 2. Run 2: All the white spaces and vowel signs were 4.1 Subtask 1 removed from the texts. For example, if the token Three runs were submitted for the subtask. The methods is बॉल वुड, then after removing all the vowel signs, it deployed in each run has been described in table 4. becomes बलवड. Then, character-level n-Grams were created for the texts where n ranged from 2 to 6. Table 4: Subtask 1 Runs 3. Run 3: All the white spaces were removed from the Run Vocabulary Word2Vec Dictionary texts and vowel signs were replaced by actual vowels. Feature Clustering Feature For example, if the token is बॉल वुड, then after replacing Feature all the vowel signs with the actual vowels, it becomes Run 1 बऑलईवउड. Then, character-level n-Grams were created Run 2 × for the documents where n ranged from 2 to 6. Run 3 × × 4. Run 4: Texts were tokenised at white spaces. Then, a Hindi Stemmer was used to stem these tokens. Fur- The overall results achieved by deploying the aforementioned thermore, word-level n-Grams (called Shingles in Lucene methods achieved results as described in table 5. vocabulary) were created for the documents where n ranged from 2 to 6. Table 5: Subtask 1 Results Measure Run 1 Run 2 Run 3 Tokens Accuracy 0.689 0.817 0.756 Further, DFR similarity measure was used to find the most relevant documents for a particular query. Within the DFR, Average F-measure* 0.575 0.622 0.524 following settings were used: Weighted F-measure** 0.701 0.804 0.734 *: It was calculated as an average of f-measures of all the 1. Limiting form of Bose-Einstein model as a basic model valid tags in the test-set. of information content. **: It was a weighted average (weight by the frequency of a 2. Laplace Law of Succession as first normalization. tag) of f-measures of all the valid tags in the test-set. 3. Dirichlet Priors as second normalisation. I had hypothesised that the use of dictionary and word2vec features will result in the improvement of the system’s per- formance. Although, the use of word2vec features resulted The main hypothesis are: in appreciable improvement in system’s performance (almost 8% accuracy improvement), yet it was surprising to see that the use of dictionary in determining the tag actually decreased 1. Indexing character level n-grams of texts should produce the system’s performance. The main reason for this is that better results as compared to word level n-grams. The transliteration pairs were available for only 3 languages: Hindi, main reason for this is that character level n-grams are Gujarati and Bangla. Dictionary for rest of the 6 languages able to capture much more granular information and were ignored which may have caused the poor results. hence are able to account for minor spelling variations more effectively. A more granular specification of results (for language identi- 2. Indexing using word level n-grams should produce better fication only) is given in table 6. results than indexing individual words. The system had a poor performance in identifying Gujarati 3. System that replaces vowel signs with actual vowels words. One of the reasons for this is lack of sufficient should perform better than the one just removing them. mentions of the Gujarati words in the training dataset. One This would prevent the loss of information as happening interesting observation was that many common Gujarati words 53 Table 6: Subtask 1 Strinct F-measures for Language Identifi- Table 8: Subtask 2 Cross Script Results cation Measure Run 1 Run 2 Run 3 Run 4 Language Run 1 Run 2 Run 3 NDCG@1 0.4233 0.1833 0.3333 0.2900 Bengali 0.7613 0.8525 0.7205 NDCG@5 0.3264 0.2681 0.3864 0.2684 English 0.6984 0.8511 0.8403 NDCG@10 0.3721 0.3315 0.4358 0.2997 Gujarati 0.1582 0 0 MAP 0.2804 0.2168 0.3060 0.2047 Hindi 0.5522 0.8131 0.6995 MRR 0.4164 0.2757 0.4233 0.3244 Kannada 0.7324 0.7483 0.594 Recall 0.3774 0.4356 0.5058 0.2914 Malayalam 0.6287 0.6219 0.4644 Marathi 0.7074 0.8308 0.6354 Tamil 0.8249 0.8639 0.7346 hamming distance of the feature vector to every vector in the Telugu 0.4603 0.5083 0.2418 model of that particular language. It can also be used for language identification. like maru, karwu, pachi, etc. were tagged as being Hindi Graph-Based N-gram Language Identification for short texts words. A high resemblance of Hindi and Gujarati language has been used by some people to identify the language in the exacerbated the uneven distribution of labels in the dataset. code switched data. The method was used early in the system but it produced poor results when validated using 10-fold cross The answer to why in spite of having a sufficiently large validation. The reason for this still needs to be found. number of mentions of Telugu words, the results for them were not as good still remains unknown. 6. REFERENCES [1] Transliteration Pairs for Hindi-English, Bangla-English and 4.2 Subtask 2 Gujarati-English. Four runs were submitted: one for each of the ways of http://cse.iitkgp.ac.in/resgrp/cnerg/qa/fire13translit/index.html indexing the documents as described in section 3.1. Table 7 [2] F. S. F. I. K. C. Czajkowski, K. Conditional random specifies the overall results achieved by the methods. Table fields: Probabilistic models for segmenting and labeling 8 specifies more specific results for the case of cross script sequence data. In Proceedings of the Eighteenth retrieval. International Conference on Machine Learning. [3] Parth Gupta et al. Query Expansion for Mixed-script One of the objective of the experiment was to determine which Information Retrieval, in Proceedings of SIGIR 2014. indexing technique produces better results - word or character [4] Monojit Choudhury et. al. Overview of FIRE 2014 Track level n-grams. As can be observed in the tables, the use of on Transliterated Search. character level n-grams outperformed the use of word level [5] Crfsuite: a fast implementation of conditional random n-grams. fields (crfs). http://www.chokkan.org/software/crfsuite/ The run where vowel signs are replaced by actual vowels [6] Google Transliterator performed much better than the case when they were completely https://developers.google.com/transliterate/v1/getting_started removed, which proves our hypothesis, as stated earlier. [7] Apache Lucene https://lucene.apache.org/ [8] Deeplearning4j’s implementation of Word2vec The hypothesis that indexing the word n-grams would produce http://deeplearning4j.org/word2vec.html better results than indexing individual stemmed words was [9] E. Tromp and M. Pechenizkiy Graph-Based N-gram proven wrong by the experiments’ results. The reason for Language Identification on Short Texts Proceedings of the which is still not clear. 20th Machine Learning conference of Belgium and The Netherlands, 2011. Table 7: Subtask 2 Overall Results Measure Run 1 Run 2 Run 3 Run 4 NDCG@1 0.6700 0.5267 0.6967 0.5633 NDCG@5 0.5922 0.5424 0.6991 0.5124 NDCG@10 0.6057 0.5631 0.7160 0.5173 MAP 0.3173 0.2922 0.3814 0.2360 MRR 0.4964 0.3790 0.5613 0.3944 Recall 0.3962 0.4435 0.4921 0.2932 5. FUTURE WORK Currently, the system does not handle the mixed words (i.e. words formed by fusion of multiple languages). An effective algorithm needs to be formed to do so. A word2vec model of every language can be created separately. This model can be a list of feature vector of each word of that language. Then similarity of a word’s feature vector to the model can be used to do this. This similarity can be calculated by averaging the 54