=Paper= {{Paper |id=Vol-1587/T2-9 |storemode=property |title=DA-IICT in FIRE 2015 Shared Task on Mixed Script Information Retrieval |pdfUrl=https://ceur-ws.org/Vol-1587/T2-9.pdf |volume=Vol-1587 |authors=Devanshu Jain |dblpUrl=https://dblp.org/rec/conf/fire/Jain15 }} ==DA-IICT in FIRE 2015 Shared Task on Mixed Script Information Retrieval== https://ceur-ws.org/Vol-1587/T2-9.pdf
          DA-IICT in FIRE 2015 Shared Task on Mixed Script
                        Information Retreival

                                                         Devanshu Jain
                        Dhirubhai Ambani Institute of Information and Communication Technology
                                              Gandhinagar, Gujarat, India
                                              devanshu.jain919@gmail.com


ABSTRACT
This paper aims to describe the methodology followed by                      Table 1: Frequency of Label Tags in Training Data
Team Watchdogs in their submission for the shared task on                 Label Frequency              Description of Tag
Mixed Script Information Retrieval (MSIR) in FIRE 2015.                    NE       2203                 Named Entities
I participated in the subtask 1 (Query Word Labelling) and                MIX        148              Mix of 2 languages
2 (Mixed-script Ad hoc retrieval). For subtask 1, Machine                  hi       4453                      Hindi
Learning approach using CRF classifier was used to classify                en      17938                     English
the tokens as one of the possible languages using n-gram and               kn       1623                    Kannada
word2vec features. The method achieved a weighted F-measure                ta       3153                      Tamil
of 0.805. For subtask 2, DFR similarity measure was used on                te       6475                     Telugu
the back-transliterated documents and queries (to Hindi with               gu        890                    Gujarati
Vowel Signs replaced with actual vowels). The technique                    mr       1960                    Marathi
resulted in a NDCG@10 score of 0.7160.                                     bn       3545                     Bengali
                                                                           ml       1160                   Malyalam
Categories and Subject Descriptors                                         O          8           Words of Foreign Language
H.3.1 [Information Storage and Retrieval]: Content Analy-                  X        7436     Punctuation, Numbers, Emoticons, etc.
sis and Indexing; H.3.3 [Information Storage and Retrieval]:
Information Search and Retrieval
                                                                         provided for experiment, consisted of 229 documents and 5
                                                                         queries.
Keywords
Information Retrieval, Mixed-Script Data, Natural Language               Section 2 and 3 describes the methodology followed for the
Processing                                                               subtasks 1 and 2 respectively, in detail. Tools used to tackle
                                                                         these subtasks have aslo been mentioned. Section 4 specifies
                                                                         the results achieved by these methods.
1. INTRODUCTION
With the Internet becoming increasingly accessible, a linguisti-
cally diverse population has come online. It has been observed           2. SUBTASK 1: QUERY WORD LABELLING
that such non-English population usually uses its own language           2.1 Methodology
written in Roman script (’Transliteration’) to generate web-             Before training, the following pre-processing was done on the
content like tweets, blogs, etc. Moreover, these people switch           data. The MIX tokens (i.e. tokens which are derived from
back and forth between languages mid-sentence, a behaviour               two language) were not labelled in a consistent manner. For
termed as ’Code Switching’. This shared task aims to develop             example, some words were labelled as: MIX_hi-en and some
methods to retrieve content across scripts.                              were labelled as: MIX_en-hi. Such instances were relabelled
                                                                         in a consistent way.
Subtask1: Query Word Labelling aims to detect the language
of a token in a code-switched sentence. In addition to language          The problem was identified as sequence tagging. To tackle
detection, the subtask also requires to detect the named entities        the problem, CRF was used. Two separate CRF models were
(people, organisation, etc.), punctuations and mixed words (i.e.         trained for this subtask: one to identify the language and
words that belong to more than one language). The dataset,               another to identify the named entity.
provided by the organisers, consisted of a list of annotated
tweets. The distribution of all the labels in the dataset is             For training the Language Identification model, following fea-
provided in the table 1.                                                 tures were used:

Subtask2: Mixed Script Ad hoc Retrieval aims to retrieve
the documents containing relevant information for the query                1. Character and Word N-Grams: For including the context
given to the system. The caveat is that the query as well                     features, individual tokens in the sentence were included
as documents can be either in Hindi or English or both. So,                   within a token-window of 3 on each side of the word
retrieval needs to be done across script. The toy dataset,                    in consideration. For example, if the sentence is:


                                                                    51
                                                                      Algorithm 1 Algorithm for labelling
   Table 2: Word N-Gram Features around the word ya
                                                                       1: procedure label(token, ld-model, ne-model)
                   Feature Value                                       2:     ld-tag = getLabel(ld-model, token)
                    w[-3]     ke                                       3:     ne-tag = getLabel(ne-model, token)
                    w[-2]    mat                                       4:     final-tag = ne-tag
                    w[-1]  maano                                       5:     if final-tag = O then
                    w[0]      ye                                       6:         final-tag = ld-tag
                    w[1]    birth                                      7:         (dict-tag,dict-freq)=getTagWithMaxFreqFromDict()
                    w[2]      se                                       8:         if dict-tag ̸= O then
                    w[3]      ab                                       9:             final-tag = dict-tag
                                                                      10:         end if
                                                                      11:     end if
 Table 3: Character N-Gram Features for the word maano                12: end procedure
                     Feature    Value
                  2_gram[-1][0]   ma
                  2_gram[-1][1]    aa
                                                                         4. isDot (Boolean): This feature tells whether the dot (.)
                  2_gram[-1][2]   an
                                                                            character is present in the word or not.
                  2_gram[-1][3]   no
                  3_gram[-1][0]  maa                                     5. numDot (Integer): This feature tells the number of dot
                  3_gram[-1][1]   aan                                       characters in the word.
                  3_gram[-1][2]  ano
                  4_gram[-1][0] maan                                     6. isDigit (Boolean): This feature tells the presence of a
                  4_gram[-1][1] aano                                        digit in the word.

                                                                         7. numDigit (Integer): This feature tells the number of a
        admin ke mat maano ye birth se ab tak single h                      digit in the word.

     and the token in consideration is : ye, then the features           8. isSpecialChar (Boolean): This feature tells the presence
     used are as in table 2. Furthermore 2, 3 and 4 character               of any special character like (, -, etc. in the word.
     n-grams of each of those words are also included as
     features. So, for w[-1] i.e. maano, generated features              9. numSpecialChar (Integer): This feature tells the number
     will be as in table 3                                                  of special characters in the word.

  2. Dictionary for Hindi, Bengali and Gujarati: Dataset pro-
     vided by IIT-Kharagpur consisting of following translit-         Capitalisation is often used for mention of important named
     eration pairs: Hindi-English, Bangla-English, Gujarati-          entities. Dot character (.) is often used with abbreviations
     English, was used to determine the language of a word            which, in most cases, is used to refer to a named entity. Digits
     written in English script as shown in the algorithm 1.           and special characters are helpful in detecting the punctuations.
  3. Word2Vector Tweet Clustering: A feature vector was               The procedure for labelling the token is explained in the
     constructed for every word in the dataset using skip-            algorithm 1. The constant O in line 5 is returned when the
     gram implementation of word2vec with negative sam-               classifier can not identify any appropiate tag for the given
     pling. Then, these feature vectors were clustered using          token. So, if the token is not a named-entity, then it is
     kMeans algorithm into 9 clusters (because there were 9           tagged as one of the language tags using the corresponding
     languages). Every word in consideration was assigned a           model. The method ”getTagWithMaxFreqFromDict()” in line
     cluster ID and this was used as a feature for generating         7 determines the language which has the most occurrences of
     a model for language detection module.                           the token.

The main hypothesis is that using word2vec feature vectors’           2.2    External Tools Used
clusters as features during and dictionary mentions should            Following tools were used for this subtask:
improve the system’s performance.

For training the Named Entity Recognition model, following               1. CRFSuite was used to train the models for language de-
additional features were also included apart from those men-                tection and named-entity recognition based on the training
tioned above:                                                               data and for tagging the test files.

                                                                         2. Deeplearning Word2Vec API was used for obtaining the
  1. isFullCapitalised (Boolean): This feature tells whether                word2vec model for each word of the training as well
     the whole word is capitalised or not.                                  as test files. Number of iterations to train was set to 50
  2. isFirstCapitalised (Boolean): This feature tells whether               and feature vector size of each word was set to 100.
     the first letter of the word is capitalised or not.
                                                                         3. JavaML library’s implementation of kMeans algorithm
  3. numCapitalised (Integer): This feature tells the number                was used for clustering the words’ feature vectors as
     of capital letters in the word.                                        obtained from DeepLearning Word2Vec API.


                                                                 52
3. SUBTASK 2: MIXED SCRIPT AD-HOC                                               in the latter. The loss can result in some ambiguity. For
    RETRIEVAL                                                                   example, दुखी and देखो - when vowel signs are removed,
                                                                                they both result in दख. However, when vowel signs are
3.1 Methodology                                                                 replaced by vowels, they both result in different words
Before indexing the documents, all the Roman words in docu-                     - दउखई and दएखओ, respectively.
ments as well as queries were transliterated back to Devanagri
script. It has been observed that while transliterating Devanagri
words to Roman, there are more spelling variations than in               3.2     External Tools Used
the case when transliterating from Roman to Devanagri. Then,             Following tools were used for this subtask:
the documents were indexed in the following 4 ways:
                                                                            1. Google Transliterator was used for transliterating the
   1. Run 1: Texts were tokenised at white spaces. Then, a                     documents and queries back to Hindi language.
      Hindi Stemmer was used to stem these tokens to take                   2. Apache Lucene was used to index the documents and
      into account the multiple variations of the token. For                   search for the relevant documents according to the queries.
      example, if the token is ख़र दार , then after stemming, it
      becomes खर दार.
                                                                         4. RESULTS AND DISCUSSION
   2. Run 2: All the white spaces and vowel signs were                   4.1 Subtask 1
      removed from the texts. For example, if the token                  Three runs were submitted for the subtask. The methods
      is बॉल वुड, then after removing all the vowel signs, it            deployed in each run has been described in table 4.
      becomes बलवड. Then, character-level n-Grams were
      created for the texts where n ranged from 2 to 6.
                                                                                            Table 4: Subtask 1 Runs
   3. Run 3: All the white spaces were removed from the                   Run        Vocabulary       Word2Vec           Dictionary
      texts and vowel signs were replaced by actual vowels.                          Feature          Clustering         Feature
      For example, if the token is बॉल वुड, then after replacing                                      Feature
      all the vowel signs with the actual vowels, it becomes              Run 1
      बऑलईवउड. Then, character-level n-Grams were created                 Run 2                                          ×
      for the documents where n ranged from 2 to 6.                       Run 3                        ×                 ×
   4. Run 4: Texts were tokenised at white spaces. Then,
      a Hindi Stemmer was used to stem these tokens. Fur-                The overall results achieved by deploying the aforementioned
      thermore, word-level n-Grams (called Shingles in Lucene            methods achieved results as described in table 5.
      vocabulary) were created for the documents where n
      ranged from 2 to 6.                                                                 Table 5: Subtask 1 Results
                                                                                     Measure         Run 1    Run 2          Run 3
                                                                                 Tokens Accuracy     0.689    0.817          0.756
Further, DFR similarity measure was used to find the most
relevant documents for a particular query. Within the DFR,                      Average F-measure*   0.575    0.622          0.524
following settings were used:                                                  Weighted F-measure** 0.701     0.804          0.734

                                                                         *: It was calculated as an average of f-measures of all the
   1. Limiting form of Bose-Einstein model as a basic model              valid tags in the test-set.
      of information content.                                            **: It was a weighted average (weight by the frequency of a
   2. Laplace Law of Succession as first normalization.                  tag) of f-measures of all the valid tags in the test-set.

   3. Dirichlet Priors as second normalisation.                          I had hypothesised that the use of dictionary and word2vec
                                                                         features will result in the improvement of the system’s per-
                                                                         formance. Although, the use of word2vec features resulted
The main hypothesis are:                                                 in appreciable improvement in system’s performance (almost
                                                                         8% accuracy improvement), yet it was surprising to see that
                                                                         the use of dictionary in determining the tag actually decreased
   1. Indexing character level n-grams of texts should produce
                                                                         the system’s performance. The main reason for this is that
      better results as compared to word level n-grams. The
                                                                         transliteration pairs were available for only 3 languages: Hindi,
      main reason for this is that character level n-grams are
                                                                         Gujarati and Bangla. Dictionary for rest of the 6 languages
      able to capture much more granular information and
                                                                         were ignored which may have caused the poor results.
      hence are able to account for minor spelling variations
      more effectively.
                                                                         A more granular specification of results (for language identi-
   2. Indexing using word level n-grams should produce better            fication only) is given in table 6.
      results than indexing individual words.
                                                                         The system had a poor performance in identifying Gujarati
   3. System that replaces vowel signs with actual vowels                words. One of the reasons for this is lack of sufficient
      should perform better than the one just removing them.             mentions of the Gujarati words in the training dataset. One
      This would prevent the loss of information as happening            interesting observation was that many common Gujarati words


                                                                    53
Table 6: Subtask 1 Strinct F-measures for Language Identifi-                   Table 8: Subtask 2 Cross Script Results
cation                                                                      Measure    Run 1    Run 2    Run 3     Run 4
           Language     Run 1 Run 2 Run 3                                  NDCG@1 0.4233        0.1833   0.3333    0.2900
            Bengali     0.7613 0.8525 0.7205                               NDCG@5 0.3264        0.2681   0.3864    0.2684
            English     0.6984 0.8511 0.8403                               NDCG@10 0.3721       0.3315   0.4358    0.2997
            Gujarati    0.1582     0        0                                MAP       0.2804   0.2168   0.3060    0.2047
             Hindi      0.5522 0.8131 0.6995                                 MRR       0.4164   0.2757   0.4233    0.3244
            Kannada     0.7324 0.7483 0.594                                  Recall    0.3774   0.4356   0.5058    0.2914
           Malayalam 0.6287 0.6219 0.4644
            Marathi     0.7074 0.8308 0.6354
             Tamil      0.8249 0.8639 0.7346                          hamming distance of the feature vector to every vector in the
             Telugu     0.4603 0.5083 0.2418                          model of that particular language. It can also be used for
                                                                      language identification.

like maru, karwu, pachi, etc. were tagged as being Hindi              Graph-Based N-gram Language Identification for short texts
words. A high resemblance of Hindi and Gujarati language              has been used by some people to identify the language in the
exacerbated the uneven distribution of labels in the dataset.         code switched data. The method was used early in the system
                                                                      but it produced poor results when validated using 10-fold cross
The answer to why in spite of having a sufficiently large             validation. The reason for this still needs to be found.
number of mentions of Telugu words, the results for them
were not as good still remains unknown.                               6.    REFERENCES
                                                                      [1] Transliteration Pairs for Hindi-English, Bangla-English and
4.2 Subtask 2                                                             Gujarati-English.
Four runs were submitted: one for each of the ways of                     http://cse.iitkgp.ac.in/resgrp/cnerg/qa/fire13translit/index.html
indexing the documents as described in section 3.1. Table 7           [2] F. S. F. I. K. C. Czajkowski, K. Conditional random
specifies the overall results achieved by the methods. Table              fields: Probabilistic models for segmenting and labeling
8 specifies more specific results for the case of cross script            sequence data. In Proceedings of the Eighteenth
retrieval.                                                                International Conference on Machine Learning.
                                                                      [3] Parth Gupta et al. Query Expansion for Mixed-script
One of the objective of the experiment was to determine which             Information Retrieval, in Proceedings of SIGIR 2014.
indexing technique produces better results - word or character        [4] Monojit Choudhury et. al. Overview of FIRE 2014 Track
level n-grams. As can be observed in the tables, the use of               on Transliterated Search.
character level n-grams outperformed the use of word level
                                                                      [5] Crfsuite: a fast implementation of conditional random
n-grams.
                                                                          fields (crfs). http://www.chokkan.org/software/crfsuite/
The run where vowel signs are replaced by actual vowels               [6] Google Transliterator
performed much better than the case when they were completely             https://developers.google.com/transliterate/v1/getting_started
removed, which proves our hypothesis, as stated earlier.              [7] Apache Lucene https://lucene.apache.org/
                                                                      [8] Deeplearning4j’s implementation of Word2vec
The hypothesis that indexing the word n-grams would produce               http://deeplearning4j.org/word2vec.html
better results than indexing individual stemmed words was             [9] E. Tromp and M. Pechenizkiy Graph-Based N-gram
proven wrong by the experiments’ results. The reason for                  Language Identification on Short Texts Proceedings of the
which is still not clear.                                                 20th Machine Learning conference of Belgium and The
                                                                          Netherlands, 2011.
            Table 7: Subtask 2 Overall Results
      Measure    Run 1     Run 2    Run 3      Run 4
     NDCG@1 0.6700         0.5267   0.6967     0.5633
     NDCG@5 0.5922         0.5424   0.6991     0.5124
     NDCG@10 0.6057        0.5631   0.7160     0.5173
       MAP       0.3173    0.2922   0.3814     0.2360
       MRR       0.4964    0.3790   0.5613     0.3944
       Recall    0.3962    0.4435   0.4921     0.2932


5. FUTURE WORK
Currently, the system does not handle the mixed words (i.e.
words formed by fusion of multiple languages). An effective
algorithm needs to be formed to do so. A word2vec model of
every language can be created separately. This model can be
a list of feature vector of each word of that language. Then
similarity of a word’s feature vector to the model can be used
to do this. This similarity can be calculated by averaging the


                                                                 54