<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DA-IICT in FIRE 2015 Shared Task on Mixed Script Information Retreival</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Devanshu Jain</string-name>
          <email>devanshu.jain919@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dhirubhai Ambani Institute of Information and Communication Technology Gandhinagar</institution>
          ,
          <addr-line>Gujarat</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <fpage>51</fpage>
      <lpage>54</lpage>
      <abstract>
        <p>This paper aims to describe the methodology followed by Team Watchdogs in their submission for the shared task on Mixed Script Information Retrieval (MSIR) in FIRE 2015. I participated in the subtask 1 (Query Word Labelling) and 2 (Mixed-script Ad hoc retrieval). For subtask 1, Machine Learning approach using CRF classifier was used to classify the tokens as one of the possible languages using n-gram and word2vec features. The method achieved a weighted F-measure of 0.805. For subtask 2, DFR similarity measure was used on the back-transliterated documents and queries (to Hindi with Vowel Signs replaced with actual vowels). The technique resulted in a NDCG@10 score of 0.7160.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information Retrieval</kwd>
        <kwd>Mixed-Script Data</kwd>
        <kwd>Natural Language Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Subtask1: Query Word Labelling aims to detect the language
of a token in a code-switched sentence. In addition to language
detection, the subtask also requires to detect the named entities
(people, organisation, etc.), punctuations and mixed words (i.e.
words that belong to more than one language). The dataset,
provided by the organisers, consisted of a list of annotated
tweets. The distribution of all the labels in the dataset is
provided in the table 1.</p>
      <p>Subtask2: Mixed Script Ad hoc Retrieval aims to retrieve
the documents containing relevant information for the query
given to the system. The caveat is that the query as well
as documents can be either in Hindi or English or both. So,
retrieval needs to be done across script. The toy dataset,
provided for experiment, consisted of 229 documents and 5
queries.</p>
      <p>Section 2 and 3 describes the methodology followed for the
subtasks 1 and 2 respectively, in detail. Tools used to tackle
these subtasks have aslo been mentioned. Section 4 specifies
the results achieved by these methods.</p>
    </sec>
    <sec id="sec-2">
      <title>2. SUBTASK 1: QUERY WORD LABELLING</title>
    </sec>
    <sec id="sec-3">
      <title>2.1 Methodology</title>
      <p>Before training, the following pre-processing was done on the
data. The MIX tokens (i.e. tokens which are derived from
two language) were not labelled in a consistent manner. For
example, some words were labelled as: MIX_hi-en and some
were labelled as: MIX_en-hi. Such instances were relabelled
in a consistent way.</p>
      <p>The problem was identified as sequence tagging. To tackle
the problem, CRF was used. Two separate CRF models were
trained for this subtask: one to identify the language and
another to identify the named entity.</p>
      <p>For training the Language Identification model, following
features were used:
1. Character and Word N-Grams: For including the context
features, individual tokens in the sentence were included
within a token-window of 3 on each side of the word
in consideration. For example, if the sentence is:
and the token in consideration is : ye, then the features
used are as in table 2. Furthermore 2, 3 and 4 character
n-grams of each of those words are also included as
features. So, for w[-1] i.e. maano, generated features
will be as in table 3
2. Dictionary for Hindi, Bengali and Gujarati: Dataset
provided by IIT-Kharagpur consisting of following
transliteration pairs: Hindi-English, Bangla-English,
GujaratiEnglish, was used to determine the language of a word
written in English script as shown in the algorithm 1.
3. Word2Vector Tweet Clustering: A feature vector was
constructed for every word in the dataset using
skipgram implementation of word2vec with negative
sampling. Then, these feature vectors were clustered using
kMeans algorithm into 9 clusters (because there were 9
languages). Every word in consideration was assigned a
cluster ID and this was used as a feature for generating
a model for language detection module.</p>
      <p>The main hypothesis is that using word2vec feature vectors’
clusters as features during and dictionary mentions should
improve the system’s performance.</p>
      <p>For training the Named Entity Recognition model, following
additional features were also included apart from those
mentioned above:
1. isFullCapitalised (Boolean): This feature tells whether
the whole word is capitalised or not.
2. isFirstCapitalised (Boolean): This feature tells whether
the first letter of the word is capitalised or not.
3. numCapitalised (Integer): This feature tells the number
of capital letters in the word.</p>
      <p>Algorithm 1 Algorithm for labelling
1: procedure label(token, ld-model, ne-model)
2: ld-tag = getLabel(ld-model, token)
3: ne-tag = getLabel(ne-model, token)
4: final-tag = ne-tag
5: if final-tag = O then
6: final-tag = ld-tag
7: (dict-tag,dict-freq)=getTagWithMaxFreqFromDict()
8: if dict-tag ̸= O then
9: final-tag = dict-tag
10: end if
11: end if
12: end procedure
4. isDot (Boolean): This feature tells whether the dot (.)
character is present in the word or not.
5. numDot (Integer): This feature tells the number of dot
characters in the word.
6. isDigit (Boolean): This feature tells the presence of a
digit in the word.
7. numDigit (Integer): This feature tells the number of a
digit in the word.
8. isSpecialChar (Boolean): This feature tells the presence
of any special character like (, -, etc. in the word.
9. numSpecialChar (Integer): This feature tells the number
of special characters in the word.</p>
      <p>Capitalisation is often used for mention of important named
entities. Dot character (.) is often used with abbreviations
which, in most cases, is used to refer to a named entity. Digits
and special characters are helpful in detecting the punctuations.
The procedure for labelling the token is explained in the
algorithm 1. The constant O in line 5 is returned when the
classifier can not identify any appropiate tag for the given
token. So, if the token is not a named-entity, then it is
tagged as one of the language tags using the corresponding
model. The method ”getTagWithMaxFreqFromDict()” in line
7 determines the language which has the most occurrences of
the token.</p>
    </sec>
    <sec id="sec-4">
      <title>2.2 External Tools Used</title>
      <p>Following tools were used for this subtask:
1. CRFSuite was used to train the models for language
detection and named-entity recognition based on the training
data and for tagging the test files.
2. Deeplearning Word2Vec API was used for obtaining the
word2vec model for each word of the training as well
as test files. Number of iterations to train was set to 50
and feature vector size of each word was set to 100.
3. JavaML library’s implementation of kMeans algorithm
was used for clustering the words’ feature vectors as
obtained from DeepLearning Word2Vec API.</p>
    </sec>
    <sec id="sec-5">
      <title>3. SUBTASK 2: MIXED SCRIPT AD-HOC</title>
    </sec>
    <sec id="sec-6">
      <title>RETRIEVAL</title>
    </sec>
    <sec id="sec-7">
      <title>3.1 Methodology</title>
      <p>Before indexing the documents, all the Roman words in
documents as well as queries were transliterated back to Devanagri
script. It has been observed that while transliterating Devanagri
words to Roman, there are more spelling variations than in
the case when transliterating from Roman to Devanagri. Then,
the documents were indexed in the following 4 ways:
1. Run 1: Texts were tokenised at white spaces. Then, a
Hindi Stemmer was used to stem these tokens to take
into account the multiple variations of the token. For
example, if the token is ख़रदार, then after stemming, it
becomes खरदार.
2. Run 2: All the white spaces and vowel signs were
removed from the texts. For example, if the token
is बॉलवडु, then after removing all the vowel signs, it
becomes बलवड. Then, character-level n-Grams were
created for the texts where n ranged from 2 to 6.
3. Run 3: All the white spaces were removed from the
texts and vowel signs were replaced by actual vowels.
For example, if the token is बॉलवडु, then after replacing
all the vowel signs with the actual vowels, it becomes
बऑलईवउड. Then, character-level n-Grams were created
for the documents where n ranged from 2 to 6.
4. Run 4: Texts were tokenised at white spaces. Then,
a Hindi Stemmer was used to stem these tokens.
Furthermore, word-level n-Grams (called Shingles in Lucene
vocabulary) were created for the documents where n
ranged from 2 to 6.</p>
      <p>Further, DFR similarity measure was used to find the most
relevant documents for a particular query. Within the DFR,
following settings were used:
1. Limiting form of Bose-Einstein model as a basic model
of information content.
2. Laplace Law of Succession as first normalization.
3. Dirichlet Priors as second normalisation.</p>
      <p>The main hypothesis are:
1. Indexing character level n-grams of texts should produce
better results as compared to word level n-grams. The
main reason for this is that character level n-grams are
able to capture much more granular information and
hence are able to account for minor spelling variations
more effectively.
2. Indexing using word level n-grams should produce better
results than indexing individual words.
3. System that replaces vowel signs with actual vowels
should perform better than the one just removing them.
This would prevent the loss of information as happening
in the latter. The loss can result in some ambiguity. For
example, दखुी and देखो - when vowel signs are removed,
they both result in दख. However, when vowel signs are
replaced by vowels, they both result in different words
- दउखई and दएखओ, respectively.</p>
    </sec>
    <sec id="sec-8">
      <title>3.2 External Tools Used</title>
      <p>Following tools were used for this subtask:
1. Google Transliterator was used for transliterating the
documents and queries back to Hindi language.
2. Apache Lucene was used to index the documents and
search for the relevant documents according to the queries.</p>
    </sec>
    <sec id="sec-9">
      <title>4. RESULTS AND DISCUSSION</title>
    </sec>
    <sec id="sec-10">
      <title>4.1 Subtask 1</title>
      <p>Three runs were submitted for the subtask. The methods
deployed in each run has been described in table 4.</p>
      <p>The overall results achieved by deploying the aforementioned
methods achieved results as described in table 5.
I had hypothesised that the use of dictionary and word2vec
features will result in the improvement of the system’s
performance. Although, the use of word2vec features resulted
in appreciable improvement in system’s performance (almost
8% accuracy improvement), yet it was surprising to see that
the use of dictionary in determining the tag actually decreased
the system’s performance. The main reason for this is that
transliteration pairs were available for only 3 languages: Hindi,
Gujarati and Bangla. Dictionary for rest of the 6 languages
were ignored which may have caused the poor results.
A more granular specification of results (for language
identification only) is given in table 6.</p>
      <p>The system had a poor performance in identifying Gujarati
words. One of the reasons for this is lack of sufficient
mentions of the Gujarati words in the training dataset. One
interesting observation was that many common Gujarati words
like maru, karwu, pachi, etc. were tagged as being Hindi
words. A high resemblance of Hindi and Gujarati language
exacerbated the uneven distribution of labels in the dataset.
The answer to why in spite of having a sufficiently large
number of mentions of Telugu words, the results for them
were not as good still remains unknown.</p>
    </sec>
    <sec id="sec-11">
      <title>4.2 Subtask 2</title>
      <p>Four runs were submitted: one for each of the ways of
indexing the documents as described in section 3.1. Table 7
specifies the overall results achieved by the methods. Table
8 specifies more specific results for the case of cross script
retrieval.</p>
      <p>One of the objective of the experiment was to determine which
indexing technique produces better results - word or character
level n-grams. As can be observed in the tables, the use of
character level n-grams outperformed the use of word level
n-grams.</p>
      <p>The run where vowel signs are replaced by actual vowels
performed much better than the case when they were completely
removed, which proves our hypothesis, as stated earlier.
The hypothesis that indexing the word n-grams would produce
better results than indexing individual stemmed words was
proven wrong by the experiments’ results. The reason for
which is still not clear.</p>
    </sec>
    <sec id="sec-12">
      <title>5. FUTURE WORK</title>
      <p>Currently, the system does not handle the mixed words (i.e.
words formed by fusion of multiple languages). An effective
algorithm needs to be formed to do so. A word2vec model of
every language can be created separately. This model can be
a list of feature vector of each word of that language. Then
similarity of a word’s feature vector to the model can be used
to do this. This similarity can be calculated by averaging the
hamming distance of the feature vector to every vector in the
model of that particular language. It can also be used for
language identification.</p>
      <p>Graph-Based N-gram Language Identification for short texts
has been used by some people to identify the language in the
code switched data. The method was used early in the system
but it produced poor results when validated using 10-fold cross
validation. The reason for this still needs to be found.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Transliteration</given-names>
            <surname>Pairs for</surname>
          </string-name>
          Hindi-English,
          <article-title>Bangla-English and Gujarati-English</article-title>
          . http://cse.iitkgp.ac.in/resgrp/cnerg/qa/fire13translit/index.html
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F. S. F. I. K. C.</given-names>
            <surname>Czajkowski</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>
          .
          <source>In Proceedings of the Eighteenth International Conference on Machine Learning.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Parth</given-names>
            <surname>Gupta</surname>
          </string-name>
          et al.
          <article-title>Query Expansion for Mixed-script Information Retrieval</article-title>
          ,
          <source>in Proceedings of SIGIR</source>
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Monojit</given-names>
            <surname>Choudhury</surname>
          </string-name>
          et. al.
          <source>Overview of FIRE 2014 Track on Transliterated Search.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Crfsuite</surname>
            <given-names>:</given-names>
          </string-name>
          <article-title>a fast implementation of conditional random fields (crfs)</article-title>
          . http://www.chokkan.org/software/crfsuite/
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Google</given-names>
            <surname>Transliterator</surname>
          </string-name>
          https://developers.google.com/transliterate/v1/getting_started
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Lucene</surname>
          </string-name>
          https://lucene.apache.org/
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>[8] Deeplearning4j's implementation of Word2vec http://deeplearning4j</article-title>
          .org/word2vec.html
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Tromp</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. Pechenizkiy</given-names>
            <surname>Graph-Based</surname>
          </string-name>
          <string-name>
            <surname>N</surname>
          </string-name>
          -gram
          <source>Language Identification on Short Texts Proceedings of the 20th Machine Learning conference of Belgium and The Netherlands</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>