Query Labelling for Indic Languages using a hybrid approach Rupal Bhargava1 Yashvardhan Sharma2 Shubham Sharma3 Abhinav Baid4 Department of Computer Science & Information Systems Birla Institute of Technology & Science, Pilani, Pilani Campus 1 2 3 4 { rupal.bhargava, yash, f2012493, f2013018}@pilani.bits-pilani.ac.in ABSTRACT transliterated from another language L = {Bengali (bn), Gujarati With a boom in the internet, social media text has been increasing (gu), Hindi (hi), Kannada (kn), Malayalam (ml), Marathi (mr), day by day. Much of the user generated content on internet is Tamil (ta), Telugu (te) }. The task is to label the words as en or a written in a very informal way. Usually people tend to write text member of L depending on whether it is an English word, or a on social media using indigenous script. To understand a script transliterated L-language word. Further Named Entity (NE) different from ours is a difficult task. Moreover, nowadays queries recognition and identification of mixed language words (MIX) and received by the search engines are large number of transliterated Punctuation (X) also had to be carried out. text. Hence providing a common platform to deal with the problem of transliterated text becomes really important. This 3. PROPOSED TECHNIQUE paper presents our approach to handle labeling of queries as part Our system reads the input file and separates them into tokens. of the FIRE2015 shared task on Mixed-Script Information After identification of all the tags, an output is generated for the Retrieval. Tokens in the query are labeled on basis of a hybrid same. We collected more data for Gujarati and Hindi from approach which involves rule based and machine learning previous year’s Microsoft FIRE event for the training purposes. techniques. Each annotation has been dealt separately but Logistic regression was used to train each language individually. sequentially. Feature set used for the same included unigram and bigram character index with unigram contributing the most in our opinion. Keywords Rule based approach was used for combining the individual Transliteration, Natural Language Processing, Language language classifiers, based on the probability obtained. For other Identification, Machine Learning, Logistic Regression, annotations, the process is explained as follows in their respective Information Retrieval stages. The token identification (X, NE, Mix etc.) is done in a pipelined manner. The 4 stages of the pipeline are: 1. INTRODUCTION There are a large number of indigenous scripts in the world that 1. Identification of Punctuation (X): The tag X are widely used. By indigenous scripts, we are referring to any encompasses all forms of punctuation, numerals, language written in a script that is not Roman. Due to emoticons, mentions, hashtags and acronyms. This stage technological reasons such as a lack of standard keyboards for can further be divided into 2 parts done sequentially – non-Roman script, the popularity of the QWERTY keyboard and identification of emoticons, hashtags, etc. and familiarity with the English language, much of the user generated identification of abbreviations. content on the internet is written in transliterated form. a. Identification of hashtags, emoticons, etc.: Transliteration is the process of phonetically representing the This is done using the CMU Ark tagger1 with words of a language in a non-native script. For example, many a training model especially designed for social times to represent a colloquialism such as (Okay) in Hindi, media text. The tagging model is a first-order users will write their transliterated form [1]. Search engines get a maximum entropy Markov model (MEMM), a large number of transliterated search queries daily – the challenge discriminative sequence model for which in processing these queries is the spelling variation of the training and decoding are extremely efficient transliterated form of these search queries. For example the Hindi [4]. word can be written as ‘khana’, ‘khaana’, ‘khaanna’, and so on. This particular problem involves the following: (1) Taking b. Identification of abbreviations: A dictionary care of spelling variations due to transliteration and (2) based approach is used for this purpose. A list Forward/Backward transliteration. Similarly, with the rise in the of around 1400 commonly used abbreviations use of social media, there has been a corresponding increase in the in SMS language was built and the word was use of hashtags, emoticons and abbreviations. So, along with marked as X if it occurred in this list. identification of languages, these need to be recognized as well. 2. Identification of Named Entities (NE): Named entities Also, named entities should be considered separately [2]. were also identified using a dictionary based approach. The training data was used to create the dictionary of 2. SUBTASK 1: QUERY WORD Named entities because the data was insufficient to run a LABELING machine learning algorithm. The number of named Suppose that q: w1 w2 w3 … wn, is a query written in the Roman entities was 2414. The number of Named Entities was script. The words, w1 w2 etc., could be standard English words or too low and the multi-language nature of the dataset 1 made it hard to characterize words as NE with certainty. http://www.ark.cs.cmu.edu/TweetNLP/ 40 For example, in English language named entities occur Linear Naive Bayes Logistic in certain manner at certain positions according to sentence structure. But when it comes to multi lingual en 0.8577 0.7653 0.8660 sentences, sentence structure varies a lot. bn 0.7545 0.7528 0.7605 ta 0.7176 0.7762 0.7642 3. Identification of Language: For language detection, the classifier was built using Logistic Regression with mr 0.7263 0.7432 0.7402 feature vectors containing character unigrams and kn 0.7415 0.7375 0.7298 bigrams [3]. 4. Identification of mixed words (MIX): Finally, a rule te 0.7920 0.7542 0.7626 based approach was adopted for identifying mixed ml 0.7883 0.7622 0.7582 words in the utterances. If the 2 maximum language probabilities in the list generated in the previous stage gu 0.6697 0.7501 0.6968 are close to each other, then the word was classified as MIX. The threshold for detecting MIX words was hi 0.7343 0.7138 0.7391 determined empirically. The threshold was 0.05 with Avg. 0.7536 0.7506 0.7575 word length greater than 8. It was determined empirically by setting it at different values and manually evaluating the output. The result calculated above were evaluated using the script provided. The results showed clearly that the individual classifiers If there is a match in stages 1 or 2 of the pipeline, then the token is were pretty good. We decided to use a linear kernel for logistic immediately abbreviated and no further stages are implemented on regression as it was giving the highest accuracy. We tried out that word. Otherwise, the token passes through stages 3 and 4 different parameters and choose the configuration most optimal above so that the final tag can be determined. for our training data. 4. EXPERIMENTS AND RESULTS Table 2: Official language wise F-Measure, Precision, Recall We used the data given to us which included labeled utterances Language F-Measure Precision Recall from social media and blogs to build our training data set. We X 0.8237 0.8963 0.7619 submitted three runs, where we used char 1, 2 - grams as features. br 0.4803 0.4327 0.5397 We manually removed a few words from the named entity list in run 2. In run 3, mixed word detection was enabled; it was disabled en 0.7214 0.6171 0.8683 in the other runs to avoid accuracy from going down to due to false positives. Our training data consisted of 41882 words gu 0.0849 0.1784 0.0557 including all languages and named entities. The training data set hi 0.3853 0.3473 0.4326 was built as a dense model i.e. data is represented using 0 for those features that are not present in the word, and 1 for those that kn 0.4038 0.4281 0.3821 are present, with the feature vector containing 712 entries per word corresponding to each possible character 1-gram and 2- ml 0.297 0.3896 0.24 gram. A separate model was built for each language containing an equal number of words in the language and words not in the mr 0.3141 0.3899 0.263 language. We used the scikit-learn toolkit1 for machine learning ta 0.5365 0.6501 0.4567 [5]. For language identification, we tried linear regression, naïve Bayes and Logistic Regression classifier. te 0.3444 0.3473 0.3415 We used an 80-20 split of the training data to test the performance of our system for cross validation on our test set. The results Our overall performance was: (shown in table 1) obtained using the evaluation script for our individual classifiers were: Table 3: Weighted F-Measure and token accuracy for the three runs. Table 1: Language wise Precision for different classifiers on test data from the 80-20 split tokens 11999 11999 11999 tokens Correct 6576 6575 6574 Weighted FMeasure 0.567742 0.56769 0.567615851 1 http://scikit-learn.org/stable/ tokens Accuracy 54.8046 54.7962 54.7879 41 As shown in Table 3 our overall Weighted F-Measure was 56.7%. [4] Owoputi, Olutobi, O'Connor, Brendan, Dyer, Chris, Gimpel, Also, our standard deviation was close to 10% error margin. Kevin, Schneider, Nathan and Smith, Noah A. "Improved In addition there was a direct correlation in the results between the part-ofspeech tagging for online conversational text with precision and the training data sizes used. The number of words word clusters." Paper presented at the meeting of the for the different languages in the training data was 3509 (bn), Proceedings of NAACLHLT, 2013. 17392 (en), 744 (gu), 4237 (hi), 1520 (kn), 1126 (ml), 1868 (mr), [5] Scikit-learn: Machine Learning in Python, Pedregosa et al., 3116 (ta) and 5960 (te). JMLR 12, pp. 2825-2830, 2011. As shown in Table 2, Languages like English for which the training data size was larger gave around 72% f-Measure and 87% recall with 61% precision, while Gujarati which had very less training data gave 17% precision. We did better on the weighted F-Measure statistic because the languages with less training data were also the ones least represented in the test data. As such weighted evaluation of the language predictor gave us around 56% F-Measure. Named Entity recognition was done based on a lookup based method that would classify words as named entities in the test set if they were found in the training set. This was done because the training set for named entities was too small to use a machine - learned Named Entity Recognizer. The results obtained by our approached reaffirmed that our approach was correct. It was observed that the Language Predictor developed based on our approach inaccurately predicted on testing data due to the small training data. The precisions of our individual classifiers and the official results for English, Bengali, and Tamil back our claim. 5. CONCLUSION AND FUTURE WORK In this paper, we discussed the n-gram approach to identify the language of a word. The context cues of the word could be used to identify the language instead of only relying on character unigrams and bigrams. A future work could be to implement a sequence based classifier that would classify the word based on the previous and the next word. Instead of using only unigrams and bigrams, the system could be improvised to use {1, 2, 3, 4, 5}grams based on different machine learning algorithms such as MaxEnt, Naïve Bayes, Logistic regression, SVM, etc. Our Named Entity recognizer was prone to errors due to insufficient data. Similarly, the accuracy of our system could be improved by training it on more data. However, X tokens were identified with a reasonable accuracy. Tagging of MIX words could also be improved by using better thresholds. 6. REFERENCES [1] King, Ben, and Steven P. Abney. "Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods." HLT-NAACL. 2013. [2] Parth Gupta, Kalika Bali, Rafael E. Banchs, Monojit Choudhury, and Paolo Rosso. 2014. Query expansion for mixed-script information retrieval. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval (SIGIR '14). ACM, New York, NY, USA, 677-686. DOI= http://dx.doi.org/10.1145/2600428.2609622 [3] Spandana Gella, Kalika Bali and Monojit Choudhury. "Ye word kis lang ka hai bhai?" Testing the Limits of Word level Language Identification. (To appear) In Proceedings of the Eleventh International Conference on Natural Language Processing (ICON 2014). Goa, India. 42