<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Query Labelling for Indic Languages using a hybrid approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rupal Bhargava</string-name>
          <email>1rupal.bhargava@pilani.bits-pilani.ac.in</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shubham Sharma</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Abhinav Baid</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science &amp; Information Systems Birla Institute of Technology &amp; Science</institution>
          ,
          <addr-line>Pilani, Pilani Campus</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Yashvardhan Sharma</institution>
        </aff>
      </contrib-group>
      <fpage>40</fpage>
      <lpage>42</lpage>
      <abstract>
        <p>With a boom in the internet, social media text has been increasing day by day. Much of the user generated content on internet is written in a very informal way. Usually people tend to write text on social media using indigenous script. To understand a script different from ours is a difficult task. Moreover, nowadays queries received by the search engines are large number of transliterated text. Hence providing a common platform to deal with the problem of transliterated text becomes really important. This paper presents our approach to handle labeling of queries as part of the FIRE2015 shared task on Mixed-Script Information Retrieval. Tokens in the query are labeled on basis of a hybrid approach which involves rule based and machine learning techniques. Each annotation has been dealt separately but sequentially.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        There are a large number of indigenous scripts in the world that
are widely used. By indigenous scripts, we are referring to any
language written in a script that is not Roman. Due to
technological reasons such as a lack of standard keyboards for
non-Roman script, the popularity of the QWERTY keyboard and
familiarity with the English language, much of the user generated
content on the internet is written in transliterated form.
Transliteration is the process of phonetically representing the
words of a language in a non-native script. For example, many
times to represent a colloquialism such as (Okay) in Hindi,
users will write their transliterated form [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Search engines get a
large number of transliterated search queries daily – the challenge
in processing these queries is the spelling variation of the
transliterated form of these search queries. For example the Hindi
word can be written as ‘khana’, ‘khaana’, ‘khaanna’, and
so on. This particular problem involves the following: (1) Taking
care of spelling variations due to transliteration and (2)
Forward/Backward transliteration. Similarly, with the rise in the
use of social media, there has been a corresponding increase in the
use of hashtags, emoticons and abbreviations. So, along with
identification of languages, these need to be recognized as well.
Also, named entities should be considered separately [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. SUBTASK 1: QUERY WORD</title>
    </sec>
    <sec id="sec-3">
      <title>LABELING</title>
      <p>Suppose that q: w1 w2 w3 … wn, is a query written in the Roman
script. The words, w1 w2 etc., could be standard English words or
1 http://www.ark.cs.cmu.edu/TweetNLP/</p>
    </sec>
    <sec id="sec-4">
      <title>3. PROPOSED TECHNIQUE</title>
      <p>Our system reads the input file and separates them into tokens.
After identification of all the tags, an output is generated for the
same. We collected more data for Gujarati and Hindi from
previous year’s Microsoft FIRE event for the training purposes.
Logistic regression was used to train each language individually.
Feature set used for the same included unigram and bigram
character index with unigram contributing the most in our opinion.
Rule based approach was used for combining the individual
language classifiers, based on the probability obtained. For other
annotations, the process is explained as follows in their respective
stages.</p>
      <p>The token identification (X, NE, Mix etc.) is done in a pipelined
manner. The 4 stages of the pipeline are:
1.
2.</p>
      <p>Identification of Punctuation (X): The tag X
encompasses all forms of punctuation, numerals,
emoticons, mentions, hashtags and acronyms. This stage
can further be divided into 2 parts done sequentially –
identification of emoticons, hashtags, etc. and
identification of abbreviations.</p>
      <p>a.
b.</p>
      <sec id="sec-4-1">
        <title>Identification of hashtags, emoticons, etc.:</title>
        <p>
          This is done using the CMU Ark tagger1 with
a training model especially designed for social
media text. The tagging model is a first-order
maximum entropy Markov model (MEMM), a
discriminative sequence model for which
training and decoding are extremely efficient
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Identification of abbreviations: A dictionary</title>
        <p>based approach is used for this purpose. A list
of around 1400 commonly used abbreviations
in SMS language was built and the word was
marked as X if it occurred in this list.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Identification of Named Entities (NE): Named entities</title>
        <p>were also identified using a dictionary based approach.
The training data was used to create the dictionary of
Named entities because the data was insufficient to run a
machine learning algorithm. The number of named
entities was 2414. The number of Named Entities was
too low and the multi-language nature of the dataset
made it hard to characterize words as NE with certainty.
For example, in English language named entities occur
in certain manner at certain positions according to
sentence structure. But when it comes to multi lingual
sentences, sentence structure varies a lot.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Identification of Language: For language detection, the</title>
        <p>
          classifier was built using Logistic Regression with
feature vectors containing character unigrams and
bigrams [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-5">
        <title>Identification of mixed words (MIX): Finally, a rule</title>
        <p>based approach was adopted for identifying mixed
words in the utterances. If the 2 maximum language
probabilities in the list generated in the previous stage
are close to each other, then the word was classified as
MIX. The threshold for detecting MIX words was
determined empirically. The threshold was 0.05 with
word length greater than 8. It was determined
empirically by setting it at different values and manually
evaluating the output.</p>
        <p>If there is a match in stages 1 or 2 of the pipeline, then the token is
immediately abbreviated and no further stages are implemented on
that word. Otherwise, the token passes through stages 3 and 4
above so that the final tag can be determined.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. EXPERIMENTS AND RESULTS</title>
      <p>
        We used the data given to us which included labeled utterances
from social media and blogs to build our training data set. We
submitted three runs, where we used char 1, 2 - grams as features.
We manually removed a few words from the named entity list in
run 2. In run 3, mixed word detection was enabled; it was disabled
in the other runs to avoid accuracy from going down to due to
false positives. Our training data consisted of 41882 words
including all languages and named entities. The training data set
was built as a dense model i.e. data is represented using 0 for
those features that are not present in the word, and 1 for those that
are present, with the feature vector containing 712 entries per
word corresponding to each possible character 1-gram and
2gram. A separate model was built for each language containing an
equal number of words in the language and words not in the
language. We used the scikit-learn toolkit1 for machine learning
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For language identification, we tried linear regression, naïve
Bayes and Logistic Regression classifier.
      </p>
      <p>We used an 80-20 split of the training data to test the performance
of our system for cross validation on our test set. The results
(shown in table 1) obtained using the evaluation script for our
individual classifiers were:</p>
      <p>Logistic
0.7653
The result calculated above were evaluated using the script
provided. The results showed clearly that the individual classifiers
were pretty good. We decided to use a linear kernel for logistic
regression as it was giving the highest accuracy. We tried out
different parameters and choose the configuration most optimal
for our training data.
tokens
tokens
Correct
Weighted
FMeasure
tokens</p>
      <p>Accuracy
Our overall performance was:
As shown in Table 3 our overall Weighted F-Measure was 56.7%.
Also, our standard deviation was close to 10% error margin.
In addition there was a direct correlation in the results between the
precision and the training data sizes used. The number of words
for the different languages in the training data was 3509 (bn),
17392 (en), 744 (gu), 4237 (hi), 1520 (kn), 1126 (ml), 1868 (mr),
3116 (ta) and 5960 (te).</p>
      <p>As shown in Table 2, Languages like English for which the
training data size was larger gave around 72% f-Measure and 87%
recall with 61% precision, while Gujarati which had very less
training data gave 17% precision. We did better on the weighted
F-Measure statistic because the languages with less training data
were also the ones least represented in the test data. As such
weighted evaluation of the language predictor gave us around 56%
F-Measure.</p>
      <p>Named Entity recognition was done based on a lookup based
method that would classify words as named entities in the test set
if they were found in the training set. This was done because the
training set for named entities was too small to use a machine
learned Named Entity Recognizer. The results obtained by our
approached reaffirmed that our approach was correct.</p>
      <p>It was observed that the Language Predictor developed based on
our approach inaccurately predicted on testing data due to the
small training data. The precisions of our individual classifiers and
the official results for English, Bengali, and Tamil back our claim.</p>
    </sec>
    <sec id="sec-6">
      <title>5. CONCLUSION AND FUTURE WORK</title>
      <p>In this paper, we discussed the n-gram approach to identify the
language of a word. The context cues of the word could be used to
identify the language instead of only relying on character
unigrams and bigrams. A future work could be to implement a
sequence based classifier that would classify the word based on
the previous and the next word. Instead of using only unigrams
and bigrams, the system could be improvised to use {1, 2, 3, 4,
5}grams based on different machine learning algorithms such as
MaxEnt, Naïve Bayes, Logistic regression, SVM, etc. Our Named
Entity recognizer was prone to errors due to insufficient data.
Similarly, the accuracy of our system could be improved by
training it on more data. However, X tokens were identified with a
reasonable accuracy.</p>
      <p>Tagging of MIX words could also be improved by using better
thresholds.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>King</surname>
            , Ben, and
            <given-names>Steven P.</given-names>
          </string-name>
          <string-name>
            <surname>Abney</surname>
          </string-name>
          .
          <article-title>"Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods." HLT-NAACL</article-title>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Parth</given-names>
            <surname>Gupta</surname>
          </string-name>
          , Kalika Bali, Rafael E. Banchs, Monojit Choudhury, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Query expansion for mixed-script information retrieval</article-title>
          .
          <source>In Proceedings of the 37th international ACM SIGIR conference on Research &amp; development in information retrieval (SIGIR '14)</source>
          . ACM, New York, NY, USA,
          <fpage>677</fpage>
          -
          <lpage>686</lpage>
          . DOI= http://dx.doi.org/10.1145/2600428.2609622
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Spandana</given-names>
            <surname>Gella</surname>
          </string-name>
          , Kalika Bali and
          <string-name>
            <given-names>Monojit</given-names>
            <surname>Choudhury</surname>
          </string-name>
          .
          <article-title>"Ye word kis lang ka hai bhai?" Testing the Limits of Word level Language Identification</article-title>
          . (To appear) In
          <source>Proceedings of the Eleventh International Conference on Natural Language Processing (ICON</source>
          <year>2014</year>
          ). Goa, India.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Owoputi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Olutobi</surname>
            ,
            <given-names>O</given-names>
          </string-name>
          'Connor, Brendan, Dyer, Chris, Gimpel, Kevin, Schneider, Nathan and Smith,
          <string-name>
            <surname>Noah A</surname>
          </string-name>
          .
          <article-title>"Improved part-ofspeech tagging for online conversational text with word clusters." Paper presented at the meeting of the Proceedings of</article-title>
          NAACLHLT,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>[5] Scikit-learn: Machine Learning in Python</article-title>
          , Pedregosa et al.,
          <source>JMLR 12</source>
          , pp.
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>