=Paper= {{Paper |id=Vol-1587/T2-11 |storemode=property |title=Mixed Script Ad hoc Retrieval using back transliteration and phrase matching through bigram indexing: Shared Task report by BIT, Mesra |pdfUrl=https://ceur-ws.org/Vol-1587/T2-11.pdf |volume=Vol-1587 |authors=Nimesh Ghelani,Sujan Kumar Saha,Amit Prakash |dblpUrl=https://dblp.org/rec/conf/fire/GhelaniSP15 }} ==Mixed Script Ad hoc Retrieval using back transliteration and phrase matching through bigram indexing: Shared Task report by BIT, Mesra== https://ceur-ws.org/Vol-1587/T2-11.pdf
     Mixed Script Ad hoc Retrieval using back transliteration
     and phrase matching through bigram indexing: Shared
                   Task report by BIT, Mesra

                                                                              ∗                      †
                           Nimesh Ghelani, Sujan Kumar Saha ∗ , Amit Prakash
                                         Dept. of Computer Science and Engineering
                                          Birla Institute of Technology, Mesra, India
         nimeshghelani@gmail.com, sujan.kr.saha@gmail.com, aprakash@bitmesra.ac.in

ABSTRACT                                                             using a frequency based statistical model on partitioned let-
This paper describes an approach for Mixed-script Ad hoc             ter group matching. It was built using the training data of
retrieval, a subtask as part of FIRE 2015 Shared Task on             transliterated Devanagari-Roman word pairs. Entire query
Mixed Script Information Retrieval. We participated in sub-          and documents were back transliterated to Devanagari script
task 2 of the shared task, where a statistical model was used        using this module. The search was performed on bigram
to carry out back transliteration to Devanagari script. To           indexes of documents[1]. Spelling variations were handled
perform the search, bigram based index of the documents              using LCS (Longest Common Subsequence) based similar-
were used and search was performed using pivot terms in              ity. Pivot terms based on high IDF (Inverse Document Fre-
the query.                                                           quency) values[1] were chosen from the query terms. Search
                                                                     was carried around these pivot terms to perform phrase
                                                                     match in documents. Further boosting and reordering of
Categories and Subject Descriptors                                   results were performed using heuristics based on intent and
I.2.7 [Artificial Intelligence]:     Natural Language                document titles.
Processing-Language parsing and understanding
                                                                     The rest of the paper discusses the detailed methodology in
Keywords                                                             section 2, followed by the results in section 3 and finally the
transliteration, information retrieval                               conclusion in section 4.

                                                                     2.    METHODOLOGY
1.   INTRODUCTION                                                    The proposed method consists of 2 modules: The Back
A large number of languages are written using indigenous             Transliteration module, and the Searching module.
scripts. Internet has allowed anyone to easily post content
to websites, which is usually written in Roman script due            The Transliteration module is used for transliterating words
technical reasons. This process of phonetically represent-           expressed in Roman script to Devanagari script. This is
ing the words of a language in a non-native script is called         needed because the search module assumes every word in
transliteration. The situation, where both documents and             the query and documents to be in Devanagari script. The
queries can be in more than one scripts, and the user ex-            reason Devanagari is used as the base script over Roman
pectation could be to retrieve documents across scripts is           is its concrete implication of phonemes solely from the re-
referred to as Mixed Script Information Retrieval.                   spective letters. In other words, each phoneme is strongly
                                                                     attached to its corresponding letters rather than the whole
A challenge that search engines face while processing                word, or the neighboring phonemes. This helps in the Search
transliterated queries and documents is that of extensive            module which involves computing similarity scores between
spelling variation. For instance, the word dhanyavad (”thank         two similar sounding words.
you” in Hindi and many other Indian languages) can be writ-
ten in Roman script as dhanyavaad, dhanyvad, danyavad,
danyavaad, dhanyavada, dhanyabad and so on.
                                                                     2.1     Back Transliteration
                                                                     Back Transliteration is performed using a statistical model
Subtask 2 of the shared task focuses on Mixed-script Ad              which is trained on a list of Devanagari-Roman transliter-
hoc retrieval, where given a query in Roman or Devanagari            ated word pairs.
script, the task was to fetch ranked documents based on
their relevance. The documents consisted of mixed script             2.1.1    Training
content which are related to song lyrics, movie reviews, or          Training was performed on a list of 36,947 transliterated
astrology. To solve this task, Devanagari script was chosen          Hindi words (Roman script) and their Devanagari represen-
as the base script. A back transliteration module was built          tation.[2]
∗Advisor
                                                                     Given a list of Devanagari-Roman word pair, each Roman
†Helped out with result analysis                                     word Wr is represented as characters ∧r1 r2 ..rn $. The ∧ and


                                                                59
the $ are added as characters to mark the beginning and               with a consolidated confidence score (equation 4) for the
the end of the string, respectively. Subsequent operations            result.The result with maximum confidence score for a par-
will treat these markers just like any other character of the         tition is returned as the best Devanagari candidate for the
word. This is done in order to separate vowels which start            given Roman word.
the word from the ones which contribute to the phoneme
by succeeding a consonant. The corresponding Devanagari
                                                                       Wr = lgR1 ...lgRi ...lgRn , 2 ≤ |lgRi | ≤ nGramLimitR
word Wd has no ∧ and $ markers.                                                                                                         (3)
                                                                       Wd = lgD1 ...lgDi ...lgDn
The Roman word Wr is partitioned (refer expression 1) into
                                                                                                  n
groups of characters where length of each individual group is                                     Y
between 2 and nGramLimitR (assigned to 3 in the system).                    scoreconsolidated =       (|lgRi | ∗ score(lgDi , lgRi ))   (4)
The corresponding Devanagari word Wd is also partitioned                                          i=1

(refer expression 2) into groups where length is between 1
and nGramLimitD (assigned to 3 in the system). The lower              2.2     Data Structures
limit for Devanagari is 1 due to single Devanagari letters            2.2.1      Word Dictionary (wordDict)
often having a consistent sound contribution to the whole             The wordDict supports two operations, insert and query.
word.
                                                                      insert(w) inserts the word w into the dictionary.
 Wr = wgR1 ...wgRi ...wgRn , 2 ≤ |wgRi | ≤ nGramLimitR                query(w, threshold, maxResults) returns at most top
                                                     (1)              maxResults words present in the dictionary whose similar-
Wd = wgD1 ...wgDi ...wgDm , 1 ≤ |wgDi | ≤ nGramLimitD                 ity with w is greater than threshold.
                                                   (2)
                                                                      wordDict is implemented using a hash table, with keys as
For each possible partitions of both words where n = m,               words. query is done by iterating over all the keys and
wgDi is added as a possible phoneme equivalent of wgRi .              finding its similarity with the query word using Longest
A global dictionary is maintained with key as Roman letter            Common Subsequence (LCS) (equation 5. All words wi
groups, and value as another dictionary with key as Devana-           in the dictionary such that similarity(w, wi ) ≥ threshold,
gari letter groups and value as integer counts. This global           are sorted in decreasing order of their similarity score with
dictionary is updated over all the Roman-Devanagari word              query w, and top maxResults words are returned.
pairs.                                                                  similarity(w1 , w2 ) = LCS(w1, w2)/max(|w1|, |w2|)              (5)
Partitioning is done using a straightforward approach with
exponential time complexity. This is practical due to the             2.3     Preprocessing and Indexing the docu-
lower constant factor associated and since the word length                    ments
is usually small enough.                                              Around 60,000 documents constitutes the search pool.

The proposed approach produces lot of noise due to blind              Documents to be searched are mixed script text documents.
position based letter group matching. However, with enough            Since the base script was chosen to be Devanagari, translit-
training data, the relevant Devanagari letter groups bubble           eration is performed wherever required, to ensure the entire
up over the outliers.                                                 document is in Devanagari script. Indexing is done sepa-
                                                                      rately for the document titles and the contents.
The model generated by the training yields for any Roman
letter group lgR, a list of tuples (lgDi , fi ), denoting that        For each word w, bef ore(w) and af ter(w) is built,
the Devanagari letter group lgDi was mapped    P to lgR, fi           which is an augmented wordDict of words occurring
times. The value score(lgDj , lgR) = fj / fi is used as               just before and after w in any document, respec-
the confidence score for lgDj being a suitable candidate for          tively. bef ore(w) and af ter(w) represent set of tuples
replacing lgR.                                                        (wx , (doc1 , ..., doci , ..., docn )). The doci refers to the docu-
                                                                      ment in which wx occurred before/after w. These are built
2.1.2    Transliteration                                              both for the content and title of the documents.
This module determines possible transliterations of an input
word in Roman script to Devanagari script. It produces                The searching module makes uses the IDF (Inverse Docu-
multiple results and attempts to rank them on the basis of            ment Frequency) values of the words. IDF values are calcu-
their confidence scores.                                              lated for each Devanagari word (original or transliterated) in
                                                                                                                         N
                                                                      the document content and title. IDF (w) = log 1+n     , where
Similar to the Roman script words in training phase, the              the word w occur in n documents out of N total documents.
input word Wr is surrounded by the ∧ and $ markers. Par-
titioning is also performed in the same way as the partition-         A global wordDict is inserted with all the Devanagari words
ing of Roman words during training (refer expression 3).              (original or transliterated) found in the documents. It is
For any partition, each Roman letter group is matched to              denoted as docW ordDict. It serves as the unigram index of
the Devanagari letter group with maximum score, using the             documents.
dictionary generated during the training. These matched
letter groups are concatenated to get the final result Wd ,           2.4     Searching

                                                                 60
Input → Query as a list of words in mixed script.                     If the original query has terms q1 q2 ...qn , each
Output → List of documents sorted by their relevance score.           qi will refer to a list of words wj,qi such that
                                                                      similarity(qi , wj,qi ) ≥ similarityT hreshold. A traversal
Any query word in Roman script is replaced by transliterat-           can be defined from wx,qi to wy,qj such that, | i − j |= 1.
ing it to Devanagari script using the transliteration module          The state variables for wy,qj during traversal are computed
(using the highest scored result).                                    from the variables of wx,qi . It should be noted that a state
                                                                      is defined by wy,qj as well as the path taken to reach that
                                                                      word, but for the sake of short variable names, it is omitted
2.4.1    Pivot Selection                                              but remains true.
A pivot term is selected from the query terms, around which
the rest query is expanded over the documents. The selec-             The state variables consist of the result document set
tion criteria used for the pivot terms are their IDF values.          (doc set(w)), and confidence score (score(d, w)) along with
Top Npivots (assigned to 3 in the system) distinct query              phrase count (count(d, w)) corresponding to each document
terms, sorted in decreasing order of their IDF values, are            (d) in the result document set (doc set(w)).
chosen as pivot terms.
                                                                      The first traversal always begins from wx,qp (qp is
In some cases, there are Roman words whose correct De-                the current pivot).      The initial result document set
vanagari representation are present in the documents, but             doc set(wx,qp ) consists of documents which have wx,qp in
the representations produced by the module are not. This              their body. The initial confidence score for a document
is due to one of the two reasons, 1. The result produced by           d is score(d, wx,qp ) = similarity(wx,qp , qp ). The initial
the transliteration module is incorrect, 2. There are multi-          phrase count count(d, wx,qp ) is simply the number of times
ple correct Devanagari representations for that Roman word.           wx,qp occur in the document body. bigram doc set(x, y)
Since the transliteration module always tends to produce re-          represents the set of documents in which bigram xy occur.
sult sounding as close to the correct result, docW ordDict’s          bigram count(d, x, y) is the number of times bigram xy oc-
fuzzy query is used to fetch similar words present in the doc-        cur in document d.
uments with similarity score above a reasonable threshold.
For a word w, if the most similar word found has a similarity         Traversing from wx,qi to wy,qj , assuming i + 1 = j, the state
score of above 0.95, that word’s IDF value is concluded as            variables for wy,qj are calculated as shown in equation 6.
the IDF value for w. Otherwise, the maximum IDF value of              This traversal can also be interpreted as an effort to match
the similar words is chosen.                                          the bigram qi qj in the query by matching similar bigram xy
                                                                      in the document. The new result document set consist of
                                                                      documents from doc set(wx,qi ) which have the bigram xy
2.4.2    Pivot expansion                                              present in its content. For a document d, the new count is
A bigram traversal query is performed on each of the Npivot           updated by the minimum of its count in previous state and
pivots and scores are independently added to vote for rele-           the number of times bigram xy occurs in the content of d.
vance of the results.                                                 A normalized value of this count (Equation 7), along with
                                                                      the similarity between qj and wy,qj (Refer equation 5) and
For each pivot term, a candidate word list is fetched from            the score of d in previous state are used to calculate the new
the docW ordDict’s query. The candidate words have a sim-             score for document d. The count is normalized in order to
ilarity score above similarityT hreshold (assigned to 0.7 in          have a regulated effect on scoring.
the system). Processing multiple similar words instead of
one accounts for incorrectness in the transliteration module
and multiple similar sounding representations of the pivot              doc set(wy,qj ) = doc set(wx,qi )
term in the documents. Bigram traversal (next section) is                                 ∩ bigram doc set(wx,qi , wy,qj )
performed on each of the candidate words and results are
                                                                        count(d, wy,qj ) = min(count(d, wx,qi )
combined into the result set for that pivot term. If two sep-
arate query for same pivot term (different candidate word)                                , bigram count(d, wx,qi , wy,qj ))    (6)
returns a score for a same document, the maximum score is               score(d, wy,qj ) = similarity(qj , wy,qj )
considered in the combined result.
                                                                                          + score(d, wx,qi )
Result sets of pivot terms are combined by adding the scores                              ∗ normalizeCount(count(d, wy,qj ))
for overlapping documents, voting for their relevance.
                                                                                                                     x−1
                                                                               normalizeCount(x) = min(2, 1 +            )      (7)
                                                                                                                      3
2.4.3    Candidate word bigram traversal
                                                                      For j + 1 = i, just swapping the parameters in
Input → word, pivot, original query.                                  bigram doc set and bigram count is required. Using the
Output → List of documents with relevance score.                      above traversal rules, traversal starts from word, towards
                                                                      left and right separately. If the pivot query term is
The idea is to traverse across bigrams to efficiently match           qp , then scoreLef t(d) = score(d, wx,qL ), such that it
variable length phrases in the document content. Traversal            is non-zero and L is as small as possible. Similarly,
is performed across the left and right of the pivot position          scoreRight(d) = score(d, wy,qR ), such that it is non-zero
in the original query separately, whose results are later com-        and R is as large as possible. The query phrase from qL
bined.                                                                to qR , was thus matched in document d. It should be


                                                                 61
noted that this may not be true as traversals were performed         ation of a low IDF word might actually get a high IDF value.
through bigrams and not larger n-grams. However, it serves           This directly affects the pivot selection, where IDF value
as a decent assumption. Both the left and right scores are           plays a crucial role. Pivot selection also needs improvisation
combined using (lef tScore + rightScore) ∗ (R − L + 1) to            so that it tries to cover the entire query instead of some fixed
serve as a final score for document d. All relevant documents        number of pivots. Eg, for a long query with large number of
with non-zero scores are preserved.                                  high IDF terms, selecting fix number of pivots might leave
                                                                     out important parts of the query.
2.4.4    Boosting results
Once the relevant documents with their scores are computed,          4.   CONCLUSION AND FUTURE WORK
some boosting heuristics are applied to reorder results. For         In this paper, an approach for retrieving relevant documents
sake of clarity, this result set is denoted by resultSet.            from a mixed script document collection, was discussed and
                                                                     analysed. A frequency based letter group mapping model for
The entire search algorithm is repeated, instead this time           back transliteration was used to perform search on a bigram
just on document titles. This result set is denoted by               representation of the documents. Pivot selection was done to
titleResultSet. Score of any document in resultSet also              identify important parts of query, around which the search
present in titleResultSet is added by titleM atchScore ∗ 1.5.        was expanded.
Lastly, intent boosting is performed, where intentBoost (0.2
in the system) is added to scores of documents whose doc-            There is a lot of scope for future work. The search mod-
ument class (lyrics, movie reviews, etc) are explicit in the         ule will directly benefit from a better back transliteration
query and document title. Class is simply determined by              module. Sophisticated transliteration models can be used
matching class terms to document titles.                             to test its improvement on the search module. A script spe-
                                                                     cific rule based similarity method can be applied for finding
After sorting the results, 10 documents with highest scores          similar sounding words with different spellings. There are
are selected.                                                        many constants used throughout the algorithm, whose val-
                                                                     ues were chosen based on heuristics and assumptions. Their
3.   RESULTS AND ERROR ANALYSIS                                      tuning will significantly contribute to optimal behaviour of
The results obtained for the submitted runs are summa-               the system.
rized in Table 1. For comparison, best score among all
the teams are stated in parenthesis. Relative to other               The problem with the current method of IDF was discussed
teams, the best overall NDCG@1, MAP and MRR scores                   in the previous section. A potential solution would be to
were obtained, while the overall NDCG@5, NDCG@10 and                 cluster similar sounding words and calculate IDF values of
RECALL were second best. The scores for cross-script                 the clusters. Better IDF values would allow for incorporat-
NDCG@1, NDCG@5, MAP and RECALL were second best,                     ing them into the document relevance scoring. Document
and the rest were at third.                                          scoring also needs some deep analysis and improvisations.

                      Subtask 2 results
                                                                     5.   REFERENCES
                      Overall Score        Cross-script              [1] Christopher D Manning, Prabhakar Raghavan, and
 NDCG@1               0.7567 (0.7567)      0.3400 (0.4233)
                                                                         Hinrich SchÃijtze. Introduction to information retrieval
 NDCG@5               0.6837 (0.6991)      0.3350 (0.3964)
                                                                         , volume 1. Cambridge university press Cambridge,
 NDCG@10              0.6790 (0.7160)      0.3678 (0.4358)
                                                                         2008.
 MAP                  0.3922 (0.3922)      0.2960 (0.3060)
 MRR                  0.5890 (0.5890)      0.3904 (0.4233)           [2] K Gupta, M Choudhury, K Bali. Mining Hindi-English
 RECALL               0.4735 (0.4921)      0.4551 (0.5058)               Transliteration Pairs from Online Hindi Lyrics.In
                                                                         Proceedings of the Eight International Conference on
                                                                         Language Resources and Evaluation (LREC’12),2012,
Table 1: Results obtained along with best score
                                                                         2459-2465.
among all teams (in parenthesis)
                                                                     [3] P Gupta, K Bali, R E Banchs,M Choudhury, P Rosso.
                                                                         Query Expansion for Mixed-script Information
Producing only 10 results per query for submission affected
                                                                         Retrieval. In Proceedings of the 37th international
the recall and slightly the MAP. Just rerunning the eval-
                                                                         ACM SIGIR conference on Research & development in
uation on 20 results per query, increased the overall recall
                                                                         information retrieval, 2014, 677-686.
to 0.5038, and the cross script recall to 0.4751. The overall
MAP was slightly increased to 0.4073, and cross script MAP
to 0.3037.

These results give an insight on where the system fails and
the possible improvements. The search module heavily re-
lies on the transliteration module for accurate translitera-
tion. The transliteration module is, however, based on a
non sophisticated statistical model, which sometimes hurts
the overall score.

While calculating IDF values for words, similar words are
treated differently, which is a bad choice because some vari-


                                                                62