ISM@FIRE-2015: Mixed Script Information Retrieval

                  Dinesh Kumar Prabhakar                                              Sukomal Pal
                      Indian School of Mines                                      Indian School of Mines
                       Dhanbad, Jharkhand                                          Dhanbad, Jharkhand
                           India 826004                                                India 826004
                   dinesh.nitr@gmail.com                                      sukomalpal@gmail.com


ABSTRACT                                                             (in web retrieval) or from the corpus (in ad-hoc retrieval)
This paper describes the approach we have used for iden-             it is necessary to retrieve the documents of other language
tification of languages for a set of terms written in Roman          and/or script. It is important to discuss three terms mono-
script and approaches for the retrieval in mixed script do-          lingual, multilingual and mixed script retrieval. In IR mono-
main, in FIRE-2015. The first approach identifies the class          lingual means query and documents to be retrieved are in
(native language of terms and whether a term is any named            single language where as multilingual query and documents
entity or of any other type) of given terms/words. MaxEnt            may be in written different language. But, the mixed script
a supervised classifier has been used for the classification         retrieval is slightly different than monolingual retrieval. In
which performed best for strict f-measure NE has score is            mixed script retrieval, system should retrieve the relevant
0.46 and strict f-measure NE_P has score 0.24. For the               documents of same language written in more than one script.
MSIR subtask Divergence from Randomness (DFR) based                     In FIRE-2015, for the Mixed Script Information Retrieval
approach is used and performed better with block indexing            track participant has to design the system for term classi-
and query formulation. Overall scores of our submission on           fication and for the retrieval of relevant documents written
NDCG@10 0.4335, 0.5328, 0.4489 and 0.5369 for ISMD1,                 in Devanagari script and in Roman script.
ISMD2, ISMD3 and ISMD4 respectively.                                    We have used query expansion to reformulate the seed
   .                                                                 (information need) for addressing the mixed script retrieval
                                                                     issues.
                                                                        Further in Section 2, we discussed the task descriptions.
Keywords                                                             Section 3 shows related work on and Section 4, describes our
Word classification, Transliteration, Information Retrieval          approaches for annotation and MSIR. In Section 5, we have
                                                                     discussed results and analyzed errors. Section 6, conclude
                                                                     the strategies with the direction of future work.
1. INTRODUCTION
   With the development of the Web 2.0, user’s count on
Social sites are increasingly becoming higher. They write            2.   TASK DESCRIPTION
messages (specially blogs and post) on sites (such as Twit-             The track, Shared Task on Mixed Script Information Re-
ter and Facebook) in their own languages preferably using            trieval (MSIR) has three subtasks: Query Word Labeling,
Roman scripts (transformed form). These post might con-              Mixed Script Ad-hoc Retrieval and Mixed-script Question
sist terms of Non-English (or terms from user’s native ) lan-        Answering. We have participated in first two subtask.
guages, a simple English word, a mixed language term (like              Query Word Labeling
gr8, 2moro) or a Named Entity (NE). Identification of such              Input:- Let Q be the query set containing n query word
categories play significant role in Natural Language Process-        wi (1 ≤ i ≤ n) written in Roman script. The word wi ∈
ing (NLP). It doesn’t remains limited to the NLP but also            Q (w1 , w2 , . . . , wn ), could be standard English (en) words
used in other sub-domains of linguistic processing and In-           or transliterated from another language L = {Bengali (bn),
formation Retrieval (IR).                                            Gujarati (gu), Hindi (hi), Kannada (ka), Malayalam (ml),
   Since, blog posts contain some important information that         Marathi (mr), Tamil (ta), Telugu (te)} and some Named
opens up the scope of IR in informal texts (in form of posts         Entities (NE). The task is to label the words as En or a
or massages). Raw blogs data often have some erroneous               member of L depending on whether it an English word, or a
text. Hence, before applying any IR steps data must be               transliterated L-language word. Input and expected result
preprocessed using some linguistic processing approaches.            for an utterance is given below as an example.
   There are huge collection of data on/off the Web for var-            Input:
ious information needs but the track for adhoc retrieval.
For the retrieval, collection has documents written in two           <utterance id=“1”>
scripts: Roman (transliterated form of Hindi terms in Ro-            hesitate in to giving is @aapleaks #aapsweep
man script) and Devanagari. In whole corpus, some doc-               revenge should this statehood take way bjp
ument has information in Devanagari, some others has in              not the #aapstorm best
Roman and rest of the document has information in mixed              </utterance>
(transliterated and native ) scrip one after another. To max-          Output:- Result wi l is corresponding label produced for
imize the number of most relevant documents on the Web               individual terms.


                                                                55
Output:                                                                    For the classification, model was trained on development
 <utterance id=“1”>                                                     data and then terms from utterances of test dataset were
 en en en en en X X en en en en en en NE en                             classified based on extracted features during training.
 en X en
 </utterance>                                                           4.1.1    Training
  Mixed Script Ad-hoc Retrieval                                           For the training purpose input terms and annotations are
  There are more than 66K documents and 25 queries (seeds).             tokenized and made align with proper tags.
Documents are written in Devanagari script, Roman script
or in mixed script. Here mixed script means a document                  Features used.
has same content in two scripts one after another. Out of                Features value with default parameter were used some of
25 queries seven are in Devanagari and others are in Roman              which are listed below:
script.
                                                                           • useNGrams accept boolean value true or false to make
  The goal of the task is for a given query system should
                                                                             features from letter n-grams where true is assigned
produce set of relevant documents in ordered where on the
                                                                             here.
most relevant document should should come at first position.
                                                                           • usePrefixSuffixNGrams makes features from prefix and
3. RELATED WORK                                                              suffix substrings of the string and accept boolean value
   Subtask-1 accomplished in two phases: Word Labeling and                   where we have assigned true.
transliteration of H labeled word to its native (Devanagari)
script.                                                                    • maxNGramLeng takes integer value and size beyond
                                                                             the assigned number will not be used in the model.
3.1 Query Language Labeling                                                  Maximum length 4-grams was used.
   The labeling is concerned with the classification of a given
word written in Roman script. Query words wi can be clas-                  • minNGramLeng also takes integer number and n-grams
sified and annotated with corresponding classes manually                     below this size will not be used in the model. It must
or using machine learning based classifiers. Various clas-                   be a positive integer and we have set it 1.
sifiers are there for classification such as Support Vector
Machine(SVM), Bayesian networks, Decision Trees, Naive-                    • sigma is a parameter to several of the smoothing meth-
Bayes, MaxEnt and Neural Networks.                                           ods,usually gives a degree of smoothing as standard
   King and Abney started for labeling the languages of                      deviation. Here this number is 3.0.
words in cross-lingual documents[3]. They have approached
this problem in a weakly supervised fashion, as a sequence                 • useQN accepts boolean value where true indicates Quasi-
labeling problem with monolingual text samples for train-                    Newton optimization will be used if it is set to true.
ing data. Prabhakar and Pal also attempt in similar fashion
using supervised learning algorithm [6].                                   • tolerance is convergence tolerance in parameter opti-
                                                                             mization and set 1e-4.
3.2 Mixed Script Ad-hoc Retrieval
   This subtask was introduced in FIRE-2013 [7], continued              Classification model was train on above parameter values
in FIRE-2014 with more challenges (joint terms need ex-                 and 23 classes were identified during the training.
pansion) [1] and in FIRE-2015 (queries are in Devanagari or
Roman text along with previous challenges).                             4.1.2    Classification
   Gupta et al. in 2014, approached MSIR using 2-gram                     Given terms from utterances of test dataset were tok-
tf-idf and deep learning based query expansion [2]. The                 enized and parsed on trained model. Tokens of test data
spelling variation in transliterated terms along with mixed             are classified and annotated with different tags such as for
script text is the major challenge of MSIR. Transliteration of          Hindi terms hi, English terms en, proper names (name of
any term can be extracted from parallel or comparable cor-              the person NE_P, location NE_L).
pora in extraction approach whereas in generation, translit-
eration is generated depending on phoneme, grapheme or                  4.2     Mixed Script Information Retrieval
syllable-based rules.                                                     Subtask-2 has queries for Hindi song lyrics, astrological
                                                                        data and movies reviews related documents retrieval. Pro-
4. APPROACHES                                                           posed approach consist three modules: documents indexing,
  Our approaches for the solution of Subtask-1 and Subtask-             query formulation and documents retrieval.
2 have been described in subsections below.
                                                                        4.2.1    Document Indexing
4.1 Query Word Labeling                                                   Simple bags-of-words approach may retrieve noisy doc-
  We have considered word labeling as classification issue              uments for lyrics retrieval. Because in lyrics consecutive
for the tags annotation to the given terms wi . Terms can               terms are important as change in position changes the con-
be classified either manually or using any classifier. Manual           text of a song. Hence, we have chosen block indexing with
classification and tagging is not feasible on the large dataset.        block-size 2 words in addition simple indexing. Two ap-
MaxEnt a supervised classifier is used for classification and           proaches simple indexing (bags-of-words) and block index-
labeling of words from utterances. The Stanford’s MaxEnt                ing (phrase retrieval) were used to index the collection with
implementation is used for this purpose [4].                            block size one word and two words respectively.


                                                                   56
4.2.2 Query Formulation (expansion)                                     The token ‘path’ in input utterance should have a Bengali
  As documents in the corpus are in mixed script, seed value          term and has same meaning in Hindi and English also but
only can’t give good result for retrieval. Hence, the query           misclassified in Hindi due to ambiguity as same term exist
must be reformulated to enhance the performance of the sys-           in Hindi. But ‘path’ seems to be ‘poth’ in Bengali due to
tem. In query formulation, script of the query is identified          regional accent.
and then transliteration is extracted using Google transliter-
ation API. There are many terms for which API gives more
                                                                      5.2    Subtask-2
than one transliteration for such term first one is chosen.               Submitted four Runs for subtask-2, with combinations of
For the submission of run ISMD2 and ISMD4 we have used                simple indexing and original query, simple indexing and for-
formulated mixed script query as shown in Table 1.                    mulated query, block (size=2 words) indexing and original
                                                                      query and block (with size=2 words) indexing and formu-
                                                                      lated query. From the score in Table 3 we have observe that
          Table 1: Query formulation table                            Run with block indexing and formulated queries better and
   Query Type           Queries                                       the order in higher to lower performance on NDCG@10 is
   Original Query       tujo nahi lyrics                              ISM 4 > ISM 2 > ISM 3 > ISM 1.
   Transliterated Query तजाे नहीं लरस                                 Overall the retrieval approaches performed moderate com-
   Formulated Query     tujo nahi lyrics तजाे नहीं ल-               pare to other teams. Some challenges remains un-addressed
                        रस                                          in approaches: spelling variation in transliterated (Roman)
                                                                      text, combined term ( such as ‘kabhi-kabhi’ could be ‘kabhi’,
   Original Query       सूय रे खा कक राश
                                                                      ‘kabhi’, ‘tujo’ could be ‘tu’, ‘jo’) and translation (some doc-
   Transliterated Query suyra rekha kark rashi
                                                                      ument consist information in another language such as सूय
   Formulated Query     सूय रे खा कक राश suyra
                                                                      रे खा कक राश could be translated into Line of Sun for Can-
                        rekha kark rashi
                                                                      cer) of query text. One more challenging issue is partial
                                                                      transliteration and translation. For example query number
4.2.3 Document Retrieval                                              69, query is “shani dashaa today for a 20 year old” in that
   Poisson model with Laplace after-effect and normaliza-             first two tokens are Hindi terms. Hence either Hindi terms
tion 2 of Divergence From Randomness (DFR) framework                  will be translated to English or other terms need to be trans-
has been used to measure the similarity score between doc-            lated into Hindi and then transliterate into Roman text.
uments d and query Q [5]. For the implementation we have
used terrier 4.0.                                                     6.    CONCLUSIONS
                              ∑                                          Our work comprises two subtasks annotation and retrieval.
             Score (d, Q) =         (qtf n · w (t, d))    (1)         We have used learning based classifier for word labeling. La-
                              n∈Q                                     bel accuracy was moderate for submitted runs. We identi-
                                                                      fied some terms were incorrectly labeled by the classifier.
                                 qtf                                  Perhaps this happened due an important reason i.e. term
                       qtf n =                            (2)
                               qtfmax                                 ambiguity where same term exist in more then one classes.
where w(t, d) is the weight of the document d for a query             For the MSIR, simple and block indexing both used sepa-
term t and qtf n is the normalized frequency of term t in the         rately during document indexing. In the query formulation
query. And qtf is the original frequency of term t in the             transliterations are extracted using Google API. To mea-
query, and qtfmax is the maximum qtf of all the composing             sure the similarity score a DFR framework is used which
terms of the query for details see[5].                                performed moderate. Some query expansion approach can
                                                                      address MSIR retrieval issues. In future we are looking to
5. RESULTS AND ANALYSIS                                               address the unresolved issues mentioned above.
  Our approaches have been evaluated on the provided test
data for query word labeling and MSIR. In both the subtasks           7.    REFERENCES
our approaches performed moderate.                                    [1] Choudhury, M., Chittaranjan, G., Gupta, P.,
                                                                          and Das, A. Overview and datasets of fire 2014 track
5.1 Subtask-1                                                             on transliterated search. In Pre-proceedings 6th
   MaxEnt based classifier worked moderate as depicted in                 workshop FIRE-2014 (2014), Forum for Information
table 2. In some of measure our approach performed well                   Retrieval Evaluation (FIRE).
with scores 0.46 strict f-measure NE and 0.24 in strict f-            [2] Gupta, P., Bali, K., Banchs, R. E., Choudhury,
measure NE_P. For some metrics we performed moderate                      M., and Rosso, P. Query expansion for mixed-script
and in others poor as well. Some terms are misclassified e.g.             information retrieval. In Proceedings of the 37th
Input utterance:                                                          international ACM SIGIR conference on Research &
 <utterance id=“186”>                                                     development in information retrieval (2014), ACM,
 ei path jodi na shesh hoy lyrics                                         pp. 677–686.
 </utterance>                                                         [3] King, B., and Abney, S. P. Labeling the languages of
   Annotated utterance:                                                   words in mixed-language documents using weakly
                                                                          supervised methods. In HLT-NAACL (2013),
   <utterance id=“186”>                                                   pp. 1110–1119.
   bn hi bn bn bn bn en                                               [4] Klein, D. The stanford classifier. http://http:
   </utterance>                                                           //nlp.stanford.edu/software/classifier.shtml, 2003.


                                                                 57
                                           Table 2: Query word labeling score
                     Metric            ISM_Score        Aggregate_Mean         Aggregate_Median        Max_Score
                 MIXesAccuracy              12.5                5.0595                   0                     25
                  NEsAccuracy              13.253               36.0103             35.9459                63.964
                  NEsCorrect                 22              199.8571                199.5                     355
             strict f-measure NE       0.461728395         0.371410272             0.07536114           0.461728395
            strict f-measure NE_L             0                 0.0426                   0                 0.2114
           strict f-measure NE_P           0.2486               0.1086               0.1133                0.2486
              strict f-measure X           0.9612               0.8989               0.9379                0.9668
              strict f-measure bn          0.7113               0.7073               0.7549                0.8537
              strict f-measure en          0.9052               0.8067               0.8356                0.9114
              strict f-measure gu          0.1383               0.1338               0.1331                0.3484
              strict f-measure hi          0.6618               0.6168               0.6413                0.8131
              strict f-measure kn          0.6373               0.5752               0.6062                0.8709
              strict f-measure ml          0.4871               0.4762               0.4757                0.7446
             strict f-measure mr           0.5636               0.5994               0.6469                0.8308
              strict f-measure ta           0.718               0.7261               0.749                 0.8911
              strict f-measure te          0.5439               0.4654               0.4817                0.7763
                 TokensAccuracy            77.0648           71.1137                75.5563               82.7152
             UtterancesAccuracy            17.298            14.6645                17.1086               26.3889
             Average F-measure         0.613402366        0.539559189             0.113420527           0.69174727
             Weighted F-Measure        0.767831108        0.698989963             0.095876627          0.829929229


                                                  Table 3: Subtask-2 scores
         Team      Block_Size   Query_Formulation     NDCG@1         NDCG@5    NDCG@10        MAP     MRR        RECALL
         ISMD1      1 word             X               0.4133         0.4268    0.4335       0.0928   0.244          0.1361
         ISMD2      1 word             ✓               0.4933         0.5277    0.5328       0.1444   0.318          0.2051
         ISMD3      2 words            X               0.3867         0.4422    0.4489       0.0954   0.2207         0.1418
         ISMD4      2 words            ✓               0.4967         0.5375    0.5369       0.1507   0.3397         0.2438


    Online; accessed 19-02-2014.
[5] Plachouras, V., He, B., and Ounis, I. University of
    glasgow at trec 2004: Experiments in web, robust, and
    terabyte tracks with terrier. In TREC (2004).
[6] Prabhakar, D. K., and Pal, S. Ism@fire2013 shared
    task on transliterated search. In FIRE ’13 Proceedings
    of the 5th 2013 Forum on Information Retrieval
    Evaluation (2013), ACM New York, p. 6.
[7] Roy, R. S., Choudhury, M., Majumder, P., and
    Agarwal, K. Overview and datasets of fire 2013 track
    on transliterated search. In Pre-proceedings 5th
    workshop FIRE-2013 (2013), Forum for Information
    Retrieval Evaluation (FIRE).


                                                                58