ISM@FIRE-2015: Mixed Script Information Retrieval
Dinesh Kumar Prabhakar Sukomal Pal
Indian School of Mines Indian School of Mines
Dhanbad, Jharkhand Dhanbad, Jharkhand
India 826004 India 826004
dinesh.nitr@gmail.com sukomalpal@gmail.com
ABSTRACT (in web retrieval) or from the corpus (in ad-hoc retrieval)
This paper describes the approach we have used for iden- it is necessary to retrieve the documents of other language
tification of languages for a set of terms written in Roman and/or script. It is important to discuss three terms mono-
script and approaches for the retrieval in mixed script do- lingual, multilingual and mixed script retrieval. In IR mono-
main, in FIRE-2015. The first approach identifies the class lingual means query and documents to be retrieved are in
(native language of terms and whether a term is any named single language where as multilingual query and documents
entity or of any other type) of given terms/words. MaxEnt may be in written different language. But, the mixed script
a supervised classifier has been used for the classification retrieval is slightly different than monolingual retrieval. In
which performed best for strict f-measure NE has score is mixed script retrieval, system should retrieve the relevant
0.46 and strict f-measure NE_P has score 0.24. For the documents of same language written in more than one script.
MSIR subtask Divergence from Randomness (DFR) based In FIRE-2015, for the Mixed Script Information Retrieval
approach is used and performed better with block indexing track participant has to design the system for term classi-
and query formulation. Overall scores of our submission on fication and for the retrieval of relevant documents written
NDCG@10 0.4335, 0.5328, 0.4489 and 0.5369 for ISMD1, in Devanagari script and in Roman script.
ISMD2, ISMD3 and ISMD4 respectively. We have used query expansion to reformulate the seed
. (information need) for addressing the mixed script retrieval
issues.
Further in Section 2, we discussed the task descriptions.
Keywords Section 3 shows related work on and Section 4, describes our
Word classification, Transliteration, Information Retrieval approaches for annotation and MSIR. In Section 5, we have
discussed results and analyzed errors. Section 6, conclude
the strategies with the direction of future work.
1. INTRODUCTION
With the development of the Web 2.0, user’s count on
Social sites are increasingly becoming higher. They write 2. TASK DESCRIPTION
messages (specially blogs and post) on sites (such as Twit- The track, Shared Task on Mixed Script Information Re-
ter and Facebook) in their own languages preferably using trieval (MSIR) has three subtasks: Query Word Labeling,
Roman scripts (transformed form). These post might con- Mixed Script Ad-hoc Retrieval and Mixed-script Question
sist terms of Non-English (or terms from user’s native ) lan- Answering. We have participated in first two subtask.
guages, a simple English word, a mixed language term (like Query Word Labeling
gr8, 2moro) or a Named Entity (NE). Identification of such Input:- Let Q be the query set containing n query word
categories play significant role in Natural Language Process- wi (1 ≤ i ≤ n) written in Roman script. The word wi ∈
ing (NLP). It doesn’t remains limited to the NLP but also Q (w1 , w2 , . . . , wn ), could be standard English (en) words
used in other sub-domains of linguistic processing and In- or transliterated from another language L = {Bengali (bn),
formation Retrieval (IR). Gujarati (gu), Hindi (hi), Kannada (ka), Malayalam (ml),
Since, blog posts contain some important information that Marathi (mr), Tamil (ta), Telugu (te)} and some Named
opens up the scope of IR in informal texts (in form of posts Entities (NE). The task is to label the words as En or a
or massages). Raw blogs data often have some erroneous member of L depending on whether it an English word, or a
text. Hence, before applying any IR steps data must be transliterated L-language word. Input and expected result
preprocessed using some linguistic processing approaches. for an utterance is given below as an example.
There are huge collection of data on/off the Web for var- Input:
ious information needs but the track for adhoc retrieval.
For the retrieval, collection has documents written in two
scripts: Roman (transliterated form of Hindi terms in Ro- hesitate in to giving is @aapleaks #aapsweep
man script) and Devanagari. In whole corpus, some doc- revenge should this statehood take way bjp
ument has information in Devanagari, some others has in not the #aapstorm best
Roman and rest of the document has information in mixed
(transliterated and native ) scrip one after another. To max- Output:- Result wi l is corresponding label produced for
imize the number of most relevant documents on the Web individual terms.
55
Output: For the classification, model was trained on development
data and then terms from utterances of test dataset were
en en en en en X X en en en en en en NE en classified based on extracted features during training.
en X en
4.1.1 Training
Mixed Script Ad-hoc Retrieval For the training purpose input terms and annotations are
There are more than 66K documents and 25 queries (seeds). tokenized and made align with proper tags.
Documents are written in Devanagari script, Roman script
or in mixed script. Here mixed script means a document Features used.
has same content in two scripts one after another. Out of Features value with default parameter were used some of
25 queries seven are in Devanagari and others are in Roman which are listed below:
script.
• useNGrams accept boolean value true or false to make
The goal of the task is for a given query system should
features from letter n-grams where true is assigned
produce set of relevant documents in ordered where on the
here.
most relevant document should should come at first position.
• usePrefixSuffixNGrams makes features from prefix and
3. RELATED WORK suffix substrings of the string and accept boolean value
Subtask-1 accomplished in two phases: Word Labeling and where we have assigned true.
transliteration of H labeled word to its native (Devanagari)
script. • maxNGramLeng takes integer value and size beyond
the assigned number will not be used in the model.
3.1 Query Language Labeling Maximum length 4-grams was used.
The labeling is concerned with the classification of a given
word written in Roman script. Query words wi can be clas- • minNGramLeng also takes integer number and n-grams
sified and annotated with corresponding classes manually below this size will not be used in the model. It must
or using machine learning based classifiers. Various clas- be a positive integer and we have set it 1.
sifiers are there for classification such as Support Vector
Machine(SVM), Bayesian networks, Decision Trees, Naive- • sigma is a parameter to several of the smoothing meth-
Bayes, MaxEnt and Neural Networks. ods,usually gives a degree of smoothing as standard
King and Abney started for labeling the languages of deviation. Here this number is 3.0.
words in cross-lingual documents[3]. They have approached
this problem in a weakly supervised fashion, as a sequence • useQN accepts boolean value where true indicates Quasi-
labeling problem with monolingual text samples for train- Newton optimization will be used if it is set to true.
ing data. Prabhakar and Pal also attempt in similar fashion
using supervised learning algorithm [6]. • tolerance is convergence tolerance in parameter opti-
mization and set 1e-4.
3.2 Mixed Script Ad-hoc Retrieval
This subtask was introduced in FIRE-2013 [7], continued Classification model was train on above parameter values
in FIRE-2014 with more challenges (joint terms need ex- and 23 classes were identified during the training.
pansion) [1] and in FIRE-2015 (queries are in Devanagari or
Roman text along with previous challenges). 4.1.2 Classification
Gupta et al. in 2014, approached MSIR using 2-gram Given terms from utterances of test dataset were tok-
tf-idf and deep learning based query expansion [2]. The enized and parsed on trained model. Tokens of test data
spelling variation in transliterated terms along with mixed are classified and annotated with different tags such as for
script text is the major challenge of MSIR. Transliteration of Hindi terms hi, English terms en, proper names (name of
any term can be extracted from parallel or comparable cor- the person NE_P, location NE_L).
pora in extraction approach whereas in generation, translit-
eration is generated depending on phoneme, grapheme or 4.2 Mixed Script Information Retrieval
syllable-based rules. Subtask-2 has queries for Hindi song lyrics, astrological
data and movies reviews related documents retrieval. Pro-
4. APPROACHES posed approach consist three modules: documents indexing,
Our approaches for the solution of Subtask-1 and Subtask- query formulation and documents retrieval.
2 have been described in subsections below.
4.2.1 Document Indexing
4.1 Query Word Labeling Simple bags-of-words approach may retrieve noisy doc-
We have considered word labeling as classification issue uments for lyrics retrieval. Because in lyrics consecutive
for the tags annotation to the given terms wi . Terms can terms are important as change in position changes the con-
be classified either manually or using any classifier. Manual text of a song. Hence, we have chosen block indexing with
classification and tagging is not feasible on the large dataset. block-size 2 words in addition simple indexing. Two ap-
MaxEnt a supervised classifier is used for classification and proaches simple indexing (bags-of-words) and block index-
labeling of words from utterances. The Stanford’s MaxEnt ing (phrase retrieval) were used to index the collection with
implementation is used for this purpose [4]. block size one word and two words respectively.
56
4.2.2 Query Formulation (expansion) The token ‘path’ in input utterance should have a Bengali
As documents in the corpus are in mixed script, seed value term and has same meaning in Hindi and English also but
only can’t give good result for retrieval. Hence, the query misclassified in Hindi due to ambiguity as same term exist
must be reformulated to enhance the performance of the sys- in Hindi. But ‘path’ seems to be ‘poth’ in Bengali due to
tem. In query formulation, script of the query is identified regional accent.
and then transliteration is extracted using Google transliter-
ation API. There are many terms for which API gives more
5.2 Subtask-2
than one transliteration for such term first one is chosen. Submitted four Runs for subtask-2, with combinations of
For the submission of run ISMD2 and ISMD4 we have used simple indexing and original query, simple indexing and for-
formulated mixed script query as shown in Table 1. mulated query, block (size=2 words) indexing and original
query and block (with size=2 words) indexing and formu-
lated query. From the score in Table 3 we have observe that
Table 1: Query formulation table Run with block indexing and formulated queries better and
Query Type Queries the order in higher to lower performance on NDCG@10 is
Original Query tujo nahi lyrics ISM 4 > ISM 2 > ISM 3 > ISM 1.
Transliterated Query तजाे नहीं लरस Overall the retrieval approaches performed moderate com-
Formulated Query tujo nahi lyrics तजाे नहीं ल- pare to other teams. Some challenges remains un-addressed
रस in approaches: spelling variation in transliterated (Roman)
text, combined term ( such as ‘kabhi-kabhi’ could be ‘kabhi’,
Original Query सूय रे खा कक राश
‘kabhi’, ‘tujo’ could be ‘tu’, ‘jo’) and translation (some doc-
Transliterated Query suyra rekha kark rashi
ument consist information in another language such as सूय
Formulated Query सूय रे खा कक राश suyra
रे खा कक राश could be translated into Line of Sun for Can-
rekha kark rashi
cer) of query text. One more challenging issue is partial
transliteration and translation. For example query number
4.2.3 Document Retrieval 69, query is “shani dashaa today for a 20 year old” in that
Poisson model with Laplace after-effect and normaliza- first two tokens are Hindi terms. Hence either Hindi terms
tion 2 of Divergence From Randomness (DFR) framework will be translated to English or other terms need to be trans-
has been used to measure the similarity score between doc- lated into Hindi and then transliterate into Roman text.
uments d and query Q [5]. For the implementation we have
used terrier 4.0. 6. CONCLUSIONS
∑ Our work comprises two subtasks annotation and retrieval.
Score (d, Q) = (qtf n · w (t, d)) (1) We have used learning based classifier for word labeling. La-
n∈Q bel accuracy was moderate for submitted runs. We identi-
fied some terms were incorrectly labeled by the classifier.
qtf Perhaps this happened due an important reason i.e. term
qtf n = (2)
qtfmax ambiguity where same term exist in more then one classes.
where w(t, d) is the weight of the document d for a query For the MSIR, simple and block indexing both used sepa-
term t and qtf n is the normalized frequency of term t in the rately during document indexing. In the query formulation
query. And qtf is the original frequency of term t in the transliterations are extracted using Google API. To mea-
query, and qtfmax is the maximum qtf of all the composing sure the similarity score a DFR framework is used which
terms of the query for details see[5]. performed moderate. Some query expansion approach can
address MSIR retrieval issues. In future we are looking to
5. RESULTS AND ANALYSIS address the unresolved issues mentioned above.
Our approaches have been evaluated on the provided test
data for query word labeling and MSIR. In both the subtasks 7. REFERENCES
our approaches performed moderate. [1] Choudhury, M., Chittaranjan, G., Gupta, P.,
and Das, A. Overview and datasets of fire 2014 track
5.1 Subtask-1 on transliterated search. In Pre-proceedings 6th
MaxEnt based classifier worked moderate as depicted in workshop FIRE-2014 (2014), Forum for Information
table 2. In some of measure our approach performed well Retrieval Evaluation (FIRE).
with scores 0.46 strict f-measure NE and 0.24 in strict f- [2] Gupta, P., Bali, K., Banchs, R. E., Choudhury,
measure NE_P. For some metrics we performed moderate M., and Rosso, P. Query expansion for mixed-script
and in others poor as well. Some terms are misclassified e.g. information retrieval. In Proceedings of the 37th
Input utterance: international ACM SIGIR conference on Research &
development in information retrieval (2014), ACM,
ei path jodi na shesh hoy lyrics pp. 677–686.
[3] King, B., and Abney, S. P. Labeling the languages of
Annotated utterance: words in mixed-language documents using weakly
supervised methods. In HLT-NAACL (2013),
pp. 1110–1119.
bn hi bn bn bn bn en [4] Klein, D. The stanford classifier. http://http:
//nlp.stanford.edu/software/classifier.shtml, 2003.
57
Table 2: Query word labeling score
Metric ISM_Score Aggregate_Mean Aggregate_Median Max_Score
MIXesAccuracy 12.5 5.0595 0 25
NEsAccuracy 13.253 36.0103 35.9459 63.964
NEsCorrect 22 199.8571 199.5 355
strict f-measure NE 0.461728395 0.371410272 0.07536114 0.461728395
strict f-measure NE_L 0 0.0426 0 0.2114
strict f-measure NE_P 0.2486 0.1086 0.1133 0.2486
strict f-measure X 0.9612 0.8989 0.9379 0.9668
strict f-measure bn 0.7113 0.7073 0.7549 0.8537
strict f-measure en 0.9052 0.8067 0.8356 0.9114
strict f-measure gu 0.1383 0.1338 0.1331 0.3484
strict f-measure hi 0.6618 0.6168 0.6413 0.8131
strict f-measure kn 0.6373 0.5752 0.6062 0.8709
strict f-measure ml 0.4871 0.4762 0.4757 0.7446
strict f-measure mr 0.5636 0.5994 0.6469 0.8308
strict f-measure ta 0.718 0.7261 0.749 0.8911
strict f-measure te 0.5439 0.4654 0.4817 0.7763
TokensAccuracy 77.0648 71.1137 75.5563 82.7152
UtterancesAccuracy 17.298 14.6645 17.1086 26.3889
Average F-measure 0.613402366 0.539559189 0.113420527 0.69174727
Weighted F-Measure 0.767831108 0.698989963 0.095876627 0.829929229
Table 3: Subtask-2 scores
Team Block_Size Query_Formulation NDCG@1 NDCG@5 NDCG@10 MAP MRR RECALL
ISMD1 1 word X 0.4133 0.4268 0.4335 0.0928 0.244 0.1361
ISMD2 1 word ✓ 0.4933 0.5277 0.5328 0.1444 0.318 0.2051
ISMD3 2 words X 0.3867 0.4422 0.4489 0.0954 0.2207 0.1418
ISMD4 2 words ✓ 0.4967 0.5375 0.5369 0.1507 0.3397 0.2438
Online; accessed 19-02-2014.
[5] Plachouras, V., He, B., and Ounis, I. University of
glasgow at trec 2004: Experiments in web, robust, and
terabyte tracks with terrier. In TREC (2004).
[6] Prabhakar, D. K., and Pal, S. Ism@fire2013 shared
task on transliterated search. In FIRE ’13 Proceedings
of the 5th 2013 Forum on Information Retrieval
Evaluation (2013), ACM New York, p. 6.
[7] Roy, R. S., Choudhury, M., Majumder, P., and
Agarwal, K. Overview and datasets of fire 2013 track
on transliterated search. In Pre-proceedings 5th
workshop FIRE-2013 (2013), Forum for Information
Retrieval Evaluation (FIRE).
58