Query Labelling for Indic Languages using a
                               hybrid approach
                    Rupal Bhargava1                Yashvardhan Sharma2                      Shubham Sharma3
                                                        Abhinav Baid4
                                  Department of Computer Science & Information Systems
                                Birla Institute of Technology & Science, Pilani, Pilani Campus
                            1                  2      3          4
                           { rupal.bhargava, yash, f2012493, f2013018}@pilani.bits-pilani.ac.in


ABSTRACT                                                                    transliterated from another language L = {Bengali (bn), Gujarati
With a boom in the internet, social media text has been increasing          (gu), Hindi (hi), Kannada (kn), Malayalam (ml), Marathi (mr),
day by day. Much of the user generated content on internet is               Tamil (ta), Telugu (te) }. The task is to label the words as en or a
written in a very informal way. Usually people tend to write text           member of L depending on whether it is an English word, or a
on social media using indigenous script. To understand a script             transliterated L-language word. Further Named Entity (NE)
different from ours is a difficult task. Moreover, nowadays queries         recognition and identification of mixed language words (MIX) and
received by the search engines are large number of transliterated           Punctuation (X) also had to be carried out.
text. Hence providing a common platform to deal with the
problem of transliterated text becomes really important. This               3. PROPOSED TECHNIQUE
paper presents our approach to handle labeling of queries as part
                                                                            Our system reads the input file and separates them into tokens.
of the FIRE2015 shared task on Mixed-Script Information
                                                                            After identification of all the tags, an output is generated for the
Retrieval. Tokens in the query are labeled on basis of a hybrid
                                                                            same. We collected more data for Gujarati and Hindi from
approach which involves rule based and machine learning
                                                                            previous year’s Microsoft FIRE event for the training purposes.
techniques. Each annotation has been dealt separately but
                                                                            Logistic regression was used to train each language individually.
sequentially.
                                                                            Feature set used for the same included unigram and bigram
                                                                            character index with unigram contributing the most in our opinion.
Keywords                                                                    Rule based approach was used for combining the individual
Transliteration, Natural Language Processing, Language                      language classifiers, based on the probability obtained. For other
Identification, Machine Learning, Logistic Regression,                      annotations, the process is explained as follows in their respective
Information Retrieval                                                       stages.
                                                                            The token identification (X, NE, Mix etc.) is done in a pipelined
                                                                            manner. The 4 stages of the pipeline are:
1. INTRODUCTION
There are a large number of indigenous scripts in the world that                 1.   Identification of Punctuation (X): The tag X
are widely used. By indigenous scripts, we are referring to any                       encompasses all forms of punctuation, numerals,
language written in a script that is not Roman. Due to                                emoticons, mentions, hashtags and acronyms. This stage
technological reasons such as a lack of standard keyboards for                        can further be divided into 2 parts done sequentially –
non-Roman script, the popularity of the QWERTY keyboard and                           identification of emoticons, hashtags, etc. and
familiarity with the English language, much of the user generated                     identification of abbreviations.
content on the internet is written in transliterated form.                                 a.   Identification of hashtags, emoticons, etc.:
Transliteration is the process of phonetically representing the                                 This is done using the CMU Ark tagger1 with
words of a language in a non-native script. For example, many                                   a training model especially designed for social
times to represent a colloquialism such as          (Okay) in Hindi,                            media text. The tagging model is a first-order
users will write their transliterated form [1]. Search engines get a                            maximum entropy Markov model (MEMM), a
large number of transliterated search queries daily – the challenge                             discriminative sequence model for which
in processing these queries is the spelling variation of the                                    training and decoding are extremely efficient
transliterated form of these search queries. For example the Hindi                              [4].
word            can be written as ‘khana’, ‘khaana’, ‘khaanna’, and
so on. This particular problem involves the following: (1) Taking                          b.   Identification of abbreviations: A dictionary
care of spelling variations due to transliteration and (2)                                      based approach is used for this purpose. A list
Forward/Backward transliteration. Similarly, with the rise in the                               of around 1400 commonly used abbreviations
use of social media, there has been a corresponding increase in the                             in SMS language was built and the word was
use of hashtags, emoticons and abbreviations. So, along with                                    marked as X if it occurred in this list.
identification of languages, these need to be recognized as well.
                                                                                 2.   Identification of Named Entities (NE): Named entities
Also, named entities should be considered separately [2].
                                                                                      were also identified using a dictionary based approach.
                                                                                      The training data was used to create the dictionary of
2. SUBTASK 1: QUERY WORD                                                              Named entities because the data was insufficient to run a
   LABELING                                                                           machine learning algorithm. The number of named
Suppose that q: w1 w2 w3 … wn, is a query written in the Roman                        entities was 2414. The number of Named Entities was
script. The words, w1 w2 etc., could be standard English words or                     too low and the multi-language nature of the dataset
1                                                                                     made it hard to characterize words as NE with certainty.
  http://www.ark.cs.cmu.edu/TweetNLP/
                                                                       40
             For example, in English language named entities occur                          Linear          Naive Bayes            Logistic
             in certain manner at certain positions according to
             sentence structure. But when it comes to multi lingual             en          0.8577              0.7653             0.8660
             sentences, sentence structure varies a lot.
                                                                                bn          0.7545              0.7528             0.7605
                                                                                ta          0.7176              0.7762             0.7642
       3.    Identification of Language: For language detection, the
             classifier was built using Logistic Regression with               mr           0.7263              0.7432             0.7402
             feature vectors containing character unigrams and
                                                                                kn          0.7415              0.7375             0.7298
             bigrams [3].
       4.    Identification of mixed words (MIX): Finally, a rule               te          0.7920              0.7542             0.7626
             based approach was adopted for identifying mixed                   ml          0.7883              0.7622             0.7582
             words in the utterances. If the 2 maximum language
             probabilities in the list generated in the previous stage          gu          0.6697              0.7501             0.6968
             are close to each other, then the word was classified as
             MIX. The threshold for detecting MIX words was                     hi          0.7343              0.7138             0.7391
             determined empirically. The threshold was 0.05 with
                                                                               Avg.         0.7536              0.7506             0.7575
             word length greater than 8. It was determined
             empirically by setting it at different values and manually
             evaluating the output.                                            The result calculated above were evaluated using the script
                                                                               provided. The results showed clearly that the individual classifiers
If there is a match in stages 1 or 2 of the pipeline, then the token is        were pretty good. We decided to use a linear kernel for logistic
immediately abbreviated and no further stages are implemented on
                                                                               regression as it was giving the highest accuracy. We tried out
that word. Otherwise, the token passes through stages 3 and 4                  different parameters and choose the configuration most optimal
above so that the final tag can be determined.
                                                                               for our training data.
4. EXPERIMENTS AND RESULTS                                                      Table 2: Official language wise F-Measure, Precision, Recall
We used the data given to us which included labeled utterances                       Language        F-Measure           Precision       Recall
from social media and blogs to build our training data set. We                         X             0.8237              0.8963        0.7619
submitted three runs, where we used char 1, 2 - grams as features.
                                                                                       br             0.4803             0.4327        0.5397
We manually removed a few words from the named entity list in
run 2. In run 3, mixed word detection was enabled; it was disabled                     en             0.7214             0.6171        0.8683
in the other runs to avoid accuracy from going down to due to
false positives. Our training data consisted of 41882 words                            gu             0.0849             0.1784        0.0557
including all languages and named entities. The training data set
                                                                                       hi             0.3853             0.3473        0.4326
was built as a dense model i.e. data is represented using 0 for
those features that are not present in the word, and 1 for those that                 kn              0.4038             0.4281        0.3821
are present, with the feature vector containing 712 entries per
word corresponding to each possible character 1-gram and 2-                           ml              0.297              0.3896         0.24
gram. A separate model was built for each language containing an
equal number of words in the language and words not in the                            mr              0.3141             0.3899        0.263
language. We used the scikit-learn toolkit1 for machine learning                       ta             0.5365             0.6501        0.4567
[5]. For language identification, we tried linear regression, naïve
Bayes and Logistic Regression classifier.                                              te             0.3444             0.3473        0.3415
We used an 80-20 split of the training data to test the performance
of our system for cross validation on our test set. The results                Our overall performance was:
(shown in table 1) obtained using the evaluation script for our
individual classifiers were:                                                      Table 3: Weighted F-Measure and token accuracy for the
                                                                                                       three runs.
    Table 1: Language wise Precision for different classifiers on
                  test data from the 80-20 split                                    tokens          11999       11999        11999

                                                                                      tokens
                                                                                      Correct          6576           6575            6574

                                                                                     Weighted
                                                                                     FMeasure        0.567742       0.56769       0.567615851
1
    http://scikit-learn.org/stable/
                                                                                      tokens
                                                                                     Accuracy        54.8046        54.7962          54.7879


                                                                          41
As shown in Table 3 our overall Weighted F-Measure was 56.7%.                [4] Owoputi, Olutobi, O'Connor, Brendan, Dyer, Chris, Gimpel,
Also, our standard deviation was close to 10% error margin.                      Kevin, Schneider, Nathan and Smith, Noah A. "Improved
In addition there was a direct correlation in the results between the            part-ofspeech tagging for online conversational text with
precision and the training data sizes used. The number of words                  word clusters." Paper presented at the meeting of the
for the different languages in the training data was 3509 (bn),                  Proceedings of NAACLHLT, 2013.
17392 (en), 744 (gu), 4237 (hi), 1520 (kn), 1126 (ml), 1868 (mr),            [5] Scikit-learn: Machine Learning in Python, Pedregosa et al.,
3116 (ta) and 5960 (te).                                                         JMLR 12, pp. 2825-2830, 2011.
As shown in Table 2, Languages like English for which the
training data size was larger gave around 72% f-Measure and 87%
recall with 61% precision, while Gujarati which had very less
training data gave 17% precision. We did better on the weighted
F-Measure statistic because the languages with less training data
were also the ones least represented in the test data. As such
weighted evaluation of the language predictor gave us around 56%
F-Measure.
Named Entity recognition was done based on a lookup based
method that would classify words as named entities in the test set
if they were found in the training set. This was done because the
training set for named entities was too small to use a machine -
learned Named Entity Recognizer. The results obtained by our
approached reaffirmed that our approach was correct.


It was observed that the Language Predictor developed based on
our approach inaccurately predicted on testing data due to the
small training data. The precisions of our individual classifiers and
the official results for English, Bengali, and Tamil back our claim.

5. CONCLUSION AND FUTURE WORK
In this paper, we discussed the n-gram approach to identify the
language of a word. The context cues of the word could be used to
identify the language instead of only relying on character
unigrams and bigrams. A future work could be to implement a
sequence based classifier that would classify the word based on
the previous and the next word. Instead of using only unigrams
and bigrams, the system could be improvised to use {1, 2, 3, 4,
5}grams based on different machine learning algorithms such as
MaxEnt, Naïve Bayes, Logistic regression, SVM, etc. Our Named
Entity recognizer was prone to errors due to insufficient data.
Similarly, the accuracy of our system could be improved by
training it on more data. However, X tokens were identified with a
reasonable accuracy.
Tagging of MIX words could also be improved by using better
thresholds.

6. REFERENCES
[1] King, Ben, and Steven P. Abney. "Labeling the Languages of
    Words in Mixed-Language Documents using Weakly
    Supervised Methods." HLT-NAACL. 2013.
[2] Parth Gupta, Kalika Bali, Rafael E. Banchs, Monojit
    Choudhury, and Paolo Rosso. 2014. Query expansion for
    mixed-script information retrieval. In Proceedings of the 37th
    international ACM SIGIR conference on Research &
    development in information retrieval (SIGIR '14). ACM,
    New York, NY, USA, 677-686. DOI=
    http://dx.doi.org/10.1145/2600428.2609622
[3] Spandana Gella, Kalika Bali and Monojit Choudhury. "Ye
    word kis lang ka hai bhai?" Testing the Limits of Word level
    Language Identification. (To appear) In Proceedings of the
    Eleventh International Conference on Natural Language
    Processing (ICON 2014). Goa, India.


                                                                        42