=Paper=
{{Paper
|id=Vol-1587/T2-11
|storemode=property
|title=Mixed Script Ad hoc Retrieval using back transliteration and phrase matching through bigram indexing: Shared Task report by BIT, Mesra
|pdfUrl=https://ceur-ws.org/Vol-1587/T2-11.pdf
|volume=Vol-1587
|authors=Nimesh Ghelani,Sujan Kumar Saha,Amit Prakash
|dblpUrl=https://dblp.org/rec/conf/fire/GhelaniSP15
}}
==Mixed Script Ad hoc Retrieval using back transliteration and phrase matching through bigram indexing: Shared Task report by BIT, Mesra==
Mixed Script Ad hoc Retrieval using back transliteration and phrase matching through bigram indexing: Shared Task report by BIT, Mesra ∗ † Nimesh Ghelani, Sujan Kumar Saha ∗ , Amit Prakash Dept. of Computer Science and Engineering Birla Institute of Technology, Mesra, India nimeshghelani@gmail.com, sujan.kr.saha@gmail.com, aprakash@bitmesra.ac.in ABSTRACT using a frequency based statistical model on partitioned let- This paper describes an approach for Mixed-script Ad hoc ter group matching. It was built using the training data of retrieval, a subtask as part of FIRE 2015 Shared Task on transliterated Devanagari-Roman word pairs. Entire query Mixed Script Information Retrieval. We participated in sub- and documents were back transliterated to Devanagari script task 2 of the shared task, where a statistical model was used using this module. The search was performed on bigram to carry out back transliteration to Devanagari script. To indexes of documents[1]. Spelling variations were handled perform the search, bigram based index of the documents using LCS (Longest Common Subsequence) based similar- were used and search was performed using pivot terms in ity. Pivot terms based on high IDF (Inverse Document Fre- the query. quency) values[1] were chosen from the query terms. Search was carried around these pivot terms to perform phrase match in documents. Further boosting and reordering of Categories and Subject Descriptors results were performed using heuristics based on intent and I.2.7 [Artificial Intelligence]: Natural Language document titles. Processing-Language parsing and understanding The rest of the paper discusses the detailed methodology in Keywords section 2, followed by the results in section 3 and finally the transliteration, information retrieval conclusion in section 4. 2. METHODOLOGY 1. INTRODUCTION The proposed method consists of 2 modules: The Back A large number of languages are written using indigenous Transliteration module, and the Searching module. scripts. Internet has allowed anyone to easily post content to websites, which is usually written in Roman script due The Transliteration module is used for transliterating words technical reasons. This process of phonetically represent- expressed in Roman script to Devanagari script. This is ing the words of a language in a non-native script is called needed because the search module assumes every word in transliteration. The situation, where both documents and the query and documents to be in Devanagari script. The queries can be in more than one scripts, and the user ex- reason Devanagari is used as the base script over Roman pectation could be to retrieve documents across scripts is is its concrete implication of phonemes solely from the re- referred to as Mixed Script Information Retrieval. spective letters. In other words, each phoneme is strongly attached to its corresponding letters rather than the whole A challenge that search engines face while processing word, or the neighboring phonemes. This helps in the Search transliterated queries and documents is that of extensive module which involves computing similarity scores between spelling variation. For instance, the word dhanyavad (”thank two similar sounding words. you” in Hindi and many other Indian languages) can be writ- ten in Roman script as dhanyavaad, dhanyvad, danyavad, danyavaad, dhanyavada, dhanyabad and so on. 2.1 Back Transliteration Back Transliteration is performed using a statistical model Subtask 2 of the shared task focuses on Mixed-script Ad which is trained on a list of Devanagari-Roman transliter- hoc retrieval, where given a query in Roman or Devanagari ated word pairs. script, the task was to fetch ranked documents based on their relevance. The documents consisted of mixed script 2.1.1 Training content which are related to song lyrics, movie reviews, or Training was performed on a list of 36,947 transliterated astrology. To solve this task, Devanagari script was chosen Hindi words (Roman script) and their Devanagari represen- as the base script. A back transliteration module was built tation.[2] ∗Advisor Given a list of Devanagari-Roman word pair, each Roman †Helped out with result analysis word Wr is represented as characters ∧r1 r2 ..rn $. The ∧ and 59 the $ are added as characters to mark the beginning and with a consolidated confidence score (equation 4) for the the end of the string, respectively. Subsequent operations result.The result with maximum confidence score for a par- will treat these markers just like any other character of the tition is returned as the best Devanagari candidate for the word. This is done in order to separate vowels which start given Roman word. the word from the ones which contribute to the phoneme by succeeding a consonant. The corresponding Devanagari Wr = lgR1 ...lgRi ...lgRn , 2 ≤ |lgRi | ≤ nGramLimitR word Wd has no ∧ and $ markers. (3) Wd = lgD1 ...lgDi ...lgDn The Roman word Wr is partitioned (refer expression 1) into n groups of characters where length of each individual group is Y between 2 and nGramLimitR (assigned to 3 in the system). scoreconsolidated = (|lgRi | ∗ score(lgDi , lgRi )) (4) The corresponding Devanagari word Wd is also partitioned i=1 (refer expression 2) into groups where length is between 1 and nGramLimitD (assigned to 3 in the system). The lower 2.2 Data Structures limit for Devanagari is 1 due to single Devanagari letters 2.2.1 Word Dictionary (wordDict) often having a consistent sound contribution to the whole The wordDict supports two operations, insert and query. word. insert(w) inserts the word w into the dictionary. Wr = wgR1 ...wgRi ...wgRn , 2 ≤ |wgRi | ≤ nGramLimitR query(w, threshold, maxResults) returns at most top (1) maxResults words present in the dictionary whose similar- Wd = wgD1 ...wgDi ...wgDm , 1 ≤ |wgDi | ≤ nGramLimitD ity with w is greater than threshold. (2) wordDict is implemented using a hash table, with keys as For each possible partitions of both words where n = m, words. query is done by iterating over all the keys and wgDi is added as a possible phoneme equivalent of wgRi . finding its similarity with the query word using Longest A global dictionary is maintained with key as Roman letter Common Subsequence (LCS) (equation 5. All words wi groups, and value as another dictionary with key as Devana- in the dictionary such that similarity(w, wi ) ≥ threshold, gari letter groups and value as integer counts. This global are sorted in decreasing order of their similarity score with dictionary is updated over all the Roman-Devanagari word query w, and top maxResults words are returned. pairs. similarity(w1 , w2 ) = LCS(w1, w2)/max(|w1|, |w2|) (5) Partitioning is done using a straightforward approach with exponential time complexity. This is practical due to the 2.3 Preprocessing and Indexing the docu- lower constant factor associated and since the word length ments is usually small enough. Around 60,000 documents constitutes the search pool. The proposed approach produces lot of noise due to blind Documents to be searched are mixed script text documents. position based letter group matching. However, with enough Since the base script was chosen to be Devanagari, translit- training data, the relevant Devanagari letter groups bubble eration is performed wherever required, to ensure the entire up over the outliers. document is in Devanagari script. Indexing is done sepa- rately for the document titles and the contents. The model generated by the training yields for any Roman letter group lgR, a list of tuples (lgDi , fi ), denoting that For each word w, bef ore(w) and af ter(w) is built, the Devanagari letter group lgDi was mapped P to lgR, fi which is an augmented wordDict of words occurring times. The value score(lgDj , lgR) = fj / fi is used as just before and after w in any document, respec- the confidence score for lgDj being a suitable candidate for tively. bef ore(w) and af ter(w) represent set of tuples replacing lgR. (wx , (doc1 , ..., doci , ..., docn )). The doci refers to the docu- ment in which wx occurred before/after w. These are built 2.1.2 Transliteration both for the content and title of the documents. This module determines possible transliterations of an input word in Roman script to Devanagari script. It produces The searching module makes uses the IDF (Inverse Docu- multiple results and attempts to rank them on the basis of ment Frequency) values of the words. IDF values are calcu- their confidence scores. lated for each Devanagari word (original or transliterated) in N the document content and title. IDF (w) = log 1+n , where Similar to the Roman script words in training phase, the the word w occur in n documents out of N total documents. input word Wr is surrounded by the ∧ and $ markers. Par- titioning is also performed in the same way as the partition- A global wordDict is inserted with all the Devanagari words ing of Roman words during training (refer expression 3). (original or transliterated) found in the documents. It is For any partition, each Roman letter group is matched to denoted as docW ordDict. It serves as the unigram index of the Devanagari letter group with maximum score, using the documents. dictionary generated during the training. These matched letter groups are concatenated to get the final result Wd , 2.4 Searching 60 Input → Query as a list of words in mixed script. If the original query has terms q1 q2 ...qn , each Output → List of documents sorted by their relevance score. qi will refer to a list of words wj,qi such that similarity(qi , wj,qi ) ≥ similarityT hreshold. A traversal Any query word in Roman script is replaced by transliterat- can be defined from wx,qi to wy,qj such that, | i − j |= 1. ing it to Devanagari script using the transliteration module The state variables for wy,qj during traversal are computed (using the highest scored result). from the variables of wx,qi . It should be noted that a state is defined by wy,qj as well as the path taken to reach that word, but for the sake of short variable names, it is omitted 2.4.1 Pivot Selection but remains true. A pivot term is selected from the query terms, around which the rest query is expanded over the documents. The selec- The state variables consist of the result document set tion criteria used for the pivot terms are their IDF values. (doc set(w)), and confidence score (score(d, w)) along with Top Npivots (assigned to 3 in the system) distinct query phrase count (count(d, w)) corresponding to each document terms, sorted in decreasing order of their IDF values, are (d) in the result document set (doc set(w)). chosen as pivot terms. The first traversal always begins from wx,qp (qp is In some cases, there are Roman words whose correct De- the current pivot). The initial result document set vanagari representation are present in the documents, but doc set(wx,qp ) consists of documents which have wx,qp in the representations produced by the module are not. This their body. The initial confidence score for a document is due to one of the two reasons, 1. The result produced by d is score(d, wx,qp ) = similarity(wx,qp , qp ). The initial the transliteration module is incorrect, 2. There are multi- phrase count count(d, wx,qp ) is simply the number of times ple correct Devanagari representations for that Roman word. wx,qp occur in the document body. bigram doc set(x, y) Since the transliteration module always tends to produce re- represents the set of documents in which bigram xy occur. sult sounding as close to the correct result, docW ordDict’s bigram count(d, x, y) is the number of times bigram xy oc- fuzzy query is used to fetch similar words present in the doc- cur in document d. uments with similarity score above a reasonable threshold. For a word w, if the most similar word found has a similarity Traversing from wx,qi to wy,qj , assuming i + 1 = j, the state score of above 0.95, that word’s IDF value is concluded as variables for wy,qj are calculated as shown in equation 6. the IDF value for w. Otherwise, the maximum IDF value of This traversal can also be interpreted as an effort to match the similar words is chosen. the bigram qi qj in the query by matching similar bigram xy in the document. The new result document set consist of documents from doc set(wx,qi ) which have the bigram xy 2.4.2 Pivot expansion present in its content. For a document d, the new count is A bigram traversal query is performed on each of the Npivot updated by the minimum of its count in previous state and pivots and scores are independently added to vote for rele- the number of times bigram xy occurs in the content of d. vance of the results. A normalized value of this count (Equation 7), along with the similarity between qj and wy,qj (Refer equation 5) and For each pivot term, a candidate word list is fetched from the score of d in previous state are used to calculate the new the docW ordDict’s query. The candidate words have a sim- score for document d. The count is normalized in order to ilarity score above similarityT hreshold (assigned to 0.7 in have a regulated effect on scoring. the system). Processing multiple similar words instead of one accounts for incorrectness in the transliteration module and multiple similar sounding representations of the pivot doc set(wy,qj ) = doc set(wx,qi ) term in the documents. Bigram traversal (next section) is ∩ bigram doc set(wx,qi , wy,qj ) performed on each of the candidate words and results are count(d, wy,qj ) = min(count(d, wx,qi ) combined into the result set for that pivot term. If two sep- arate query for same pivot term (different candidate word) , bigram count(d, wx,qi , wy,qj )) (6) returns a score for a same document, the maximum score is score(d, wy,qj ) = similarity(qj , wy,qj ) considered in the combined result. + score(d, wx,qi ) Result sets of pivot terms are combined by adding the scores ∗ normalizeCount(count(d, wy,qj )) for overlapping documents, voting for their relevance. x−1 normalizeCount(x) = min(2, 1 + ) (7) 3 2.4.3 Candidate word bigram traversal For j + 1 = i, just swapping the parameters in Input → word, pivot, original query. bigram doc set and bigram count is required. Using the Output → List of documents with relevance score. above traversal rules, traversal starts from word, towards left and right separately. If the pivot query term is The idea is to traverse across bigrams to efficiently match qp , then scoreLef t(d) = score(d, wx,qL ), such that it variable length phrases in the document content. Traversal is non-zero and L is as small as possible. Similarly, is performed across the left and right of the pivot position scoreRight(d) = score(d, wy,qR ), such that it is non-zero in the original query separately, whose results are later com- and R is as large as possible. The query phrase from qL bined. to qR , was thus matched in document d. It should be 61 noted that this may not be true as traversals were performed ation of a low IDF word might actually get a high IDF value. through bigrams and not larger n-grams. However, it serves This directly affects the pivot selection, where IDF value as a decent assumption. Both the left and right scores are plays a crucial role. Pivot selection also needs improvisation combined using (lef tScore + rightScore) ∗ (R − L + 1) to so that it tries to cover the entire query instead of some fixed serve as a final score for document d. All relevant documents number of pivots. Eg, for a long query with large number of with non-zero scores are preserved. high IDF terms, selecting fix number of pivots might leave out important parts of the query. 2.4.4 Boosting results Once the relevant documents with their scores are computed, 4. CONCLUSION AND FUTURE WORK some boosting heuristics are applied to reorder results. For In this paper, an approach for retrieving relevant documents sake of clarity, this result set is denoted by resultSet. from a mixed script document collection, was discussed and analysed. A frequency based letter group mapping model for The entire search algorithm is repeated, instead this time back transliteration was used to perform search on a bigram just on document titles. This result set is denoted by representation of the documents. Pivot selection was done to titleResultSet. Score of any document in resultSet also identify important parts of query, around which the search present in titleResultSet is added by titleM atchScore ∗ 1.5. was expanded. Lastly, intent boosting is performed, where intentBoost (0.2 in the system) is added to scores of documents whose doc- There is a lot of scope for future work. The search mod- ument class (lyrics, movie reviews, etc) are explicit in the ule will directly benefit from a better back transliteration query and document title. Class is simply determined by module. Sophisticated transliteration models can be used matching class terms to document titles. to test its improvement on the search module. A script spe- cific rule based similarity method can be applied for finding After sorting the results, 10 documents with highest scores similar sounding words with different spellings. There are are selected. many constants used throughout the algorithm, whose val- ues were chosen based on heuristics and assumptions. Their 3. RESULTS AND ERROR ANALYSIS tuning will significantly contribute to optimal behaviour of The results obtained for the submitted runs are summa- the system. rized in Table 1. For comparison, best score among all the teams are stated in parenthesis. Relative to other The problem with the current method of IDF was discussed teams, the best overall NDCG@1, MAP and MRR scores in the previous section. A potential solution would be to were obtained, while the overall NDCG@5, NDCG@10 and cluster similar sounding words and calculate IDF values of RECALL were second best. The scores for cross-script the clusters. Better IDF values would allow for incorporat- NDCG@1, NDCG@5, MAP and RECALL were second best, ing them into the document relevance scoring. Document and the rest were at third. scoring also needs some deep analysis and improvisations. Subtask 2 results 5. REFERENCES Overall Score Cross-script [1] Christopher D Manning, Prabhakar Raghavan, and NDCG@1 0.7567 (0.7567) 0.3400 (0.4233) Hinrich SchÃijtze. Introduction to information retrieval NDCG@5 0.6837 (0.6991) 0.3350 (0.3964) , volume 1. Cambridge university press Cambridge, NDCG@10 0.6790 (0.7160) 0.3678 (0.4358) 2008. MAP 0.3922 (0.3922) 0.2960 (0.3060) MRR 0.5890 (0.5890) 0.3904 (0.4233) [2] K Gupta, M Choudhury, K Bali. Mining Hindi-English RECALL 0.4735 (0.4921) 0.4551 (0.5058) Transliteration Pairs from Online Hindi Lyrics.In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12),2012, Table 1: Results obtained along with best score 2459-2465. among all teams (in parenthesis) [3] P Gupta, K Bali, R E Banchs,M Choudhury, P Rosso. Query Expansion for Mixed-script Information Producing only 10 results per query for submission affected Retrieval. In Proceedings of the 37th international the recall and slightly the MAP. Just rerunning the eval- ACM SIGIR conference on Research & development in uation on 20 results per query, increased the overall recall information retrieval, 2014, 677-686. to 0.5038, and the cross script recall to 0.4751. The overall MAP was slightly increased to 0.4073, and cross script MAP to 0.3037. These results give an insight on where the system fails and the possible improvements. The search module heavily re- lies on the transliteration module for accurate translitera- tion. The transliteration module is, however, based on a non sophisticated statistical model, which sometimes hurts the overall score. While calculating IDF values for words, similar words are treated differently, which is a bad choice because some vari- 62