Exploiting Named Entity Mentions Towards Code Mixed IR : Working Notes for the UB system submission for MSIR@FIRE’16 Nikhil Londhe Rohini K. Srihari SUNY Buffalo SUNY Buffalo nikhillo@buffalo.edu rohini@buffalo.edu ABSTRACT thus, giving little context to work with but also that A sizable percentage of online user generated content is sus- all documents are comparable in length and are defi- ceptible to code switching and code mixing owing to a vari- nitely somewhat relevant to the query / topic to begin ety of reasons. Thus, an expected consequence is that adhoc with. user queries on such data are also inherently code mixed. 2. Mixed language data : Given that both the tweets This paper thus presents our solution for a similar scenario themselves as well as the queries are in one or more : information retrieval on code mixed Hindi-English tweets. languages, some sort of a cross lingual indexing and We explore techniques in information extraction, clustering querying scheme must be employed to ensure relevant and query expansion as part of this work and present our results. results on the test dataset. Our system achieved a MAP of 3. Spelling variants : Specifically applicable given that 0.0217 on the test set and placed third on the rankings. one of the languages (Hindi) is expressed in roman script via transliteration. Given that no standard- CCS Concepts ized spelling scheme exists or is widely used, the same •Information systems → Multilingual and cross-lingual words can appear with different spellings. retrieval; •Computing methodologies → Natural lan- . guage processing; The rest of this paper is organized as follows. We begin by taking a closer look at the training dataset in Section 2 and 1. INTRODUCTION draw some inferences with regards to the data and queries A large number of languages like Russian, Hindi, Arabic, that act as guiding principles for the different techniques etc.̇ that although have indigenous writing scripts, are of- used. Then in Section 3, we describe the different data pro- ten expressed in the Roman script in online communication cessing, indexing and query processing techniques that we due a variety of socio-cultural reasons. Furthermore, it is tried out as part of our initial experiments. Then in Sec- not uncommon to see multilingual speakers switching be- tion 4, we provide details of the runs we submitted and the tween different languages when expressing themselves in an exact system composition for each run. Finally, in Section 5, informal setting like social media. Hence, a large percentage we present our system performance on the test dataset and of user generated content, typically on social media is code present an analysis of the results. We then conclude by dis- switched or code mixed, i.e., contains content in more than cussing future work & potential system improvements. one language that may or may not involve multiple writing systems. 2. TRAINING DATASET Given the nature of the content, it is thus not unexpected The training dataset can be summarized as follows: to see adhoc user queries on such data to be also code mixed. For example, consider Table 1, that lists some sample queries 1. Topics: Total of 10 topics and sample target tweets that illustrate the given scenario. 2. Queries: Total of 23 queries, ranging from 1 to 4 per The given problem was more formally introduced recently topic by [4]. The second sub-task organized as part of the Mixed 3. Tweets: Total of 6142 tweets split between topics/queries Script Information Retrieval track [2] deals with the same and ranging from 34 (Topic 9 / Q 21) to 3531 (Topic problem, described in detail as follows. 1 / Q 4) Given a topic T , specified using a name, description and narrative; a set of associated English-Hindi code mixed tweets Based on the question types and available data, we make tT ; and a set of similarly code mixed queries QT against the the following assumptions / observations: topic, the task involves returning the top k results for every such query. A summary of the training dataset is presented 1. Each query is an informational query about some Named in Table 2. Some of the key challenges thus, can be enumer- Entity (NE) ated as: 2. Every query can also be completely described by a 1. Short and comparable document lengths : Judg- triple as A[-rel -B] where [xx] indicates an optional ing relevance on curated tweets can be especially hard clause given that not only the documents are short in length, 3. At least one of A or B is a Named Entity S.no Sample Query Example Tweets 1 delhi me election - rohit sharma ko dilli ka chunav ladwao !! newsmaker of the day 2 india ki haar #aapkasting thanks india ki haar ka dukh bhula diya tum logo ne 3 nirbhayaa ka rapist asharam bapu ka kehna hai mujhe nabalig samajhkar hi chod do . #nirbhayarapistout Table 1: Code Mixed Tweets and Queries Topic Num Num Tweets Vocabulary % OOV 1 3894 12548 66.87 2 582 3356 60.58 3 82 690 60.86 4 54 483 56.78 5 410 2556 67.03 6 532 2641 59.59 7 51 487 66.15 8 104 901 62.55 9 32 307 55.17 10 425 1900 57.65 Table 2: Training Dataset summary 4. When not a NE, A or B, are most likely a noun 5. NEs would usually be represented by a fixed number Figure 1: Examples of extracted NEs of variants 6. Nouns on the other hand, could be represented by any number of synonyms that could be drawn from both our full schema definition in Table 3. Thus, apart from the languages (chunaav versus election for example) data already provided, we augment it using two fields : NE 7. The ordering of the clause (Noun-rel -NE versus NE- (see Section 3.2) and cluster id (see Section 3.3). rel -Noun) is dependent on the underlying language (chunaav in delhi versus dilli mein election) 3.2 NE tagger We implemented the NE tagger in two parts : (a) a regex Apart from the training dataset, no relevance judgments based extractor that generates a list of known NEs and (b) were provided, binary or nuanced. The narrative with each a simple string matching based tagger that are explained topic provided some insight into what was considered rel- as follows. For the NE extractor, we used the topic de- evant and non-relevant, but there were no clues to differ- scription files and automatically extracted longest possible entiate between two tweets given they both referenced the sequence of tokens wherein the sequence is delineated by a same number of query tokens. Thus, we need to derive some token that begins with a capital letter. We then sorted the notion of aboutness or information content of each tweet to list extracted as thus and eliminated duplicates and sub- determine relevance. Thus, a possible search strategy can sequences. However, for every sub-sequence that we elimi- be formulated as: nated, we tagged the parent entity as being capable of being present as a sub-string. For example, the entity Narendra 1. Determine NE or NEs within the query and the tweets Modi could be present as a whole or as individual tokens - this could be a fixed list or processed from Wikipedia namely Narendra or Modi. At the end of this step, we or other Knowledge Bases (KBs) had thus generated a list of NEs (of interest at least) and 2. Find tweets that match as many tokens or as much of a boolean flag indicating sub-sequence occurrence. A snap- the remaining query as possible shot of the extracted lists from the training dataset is shown in Figure 1. 3. Find some proxy for tweet aboutness and use it to The actual tagger followed a similar philosophy where we determine relevance started with the above list and tagged each tweet as follows. We simply created a mapping from token to entity, wherein Using these three guiding principles, we now present the the token was either defined as the full entity or any of its different techniques we tried out and how they contributed sub-tokens if it could be present partially. For example, in to our final system run. the above case, for Narendra Modi, we created two mappings as Narendra → Narendra Modi and Modi → Narendra Modi. 3. INDEXING EXPERIMENTS For each tweet, we then performed a look-up and added the In this section, we describe the different experiments we matching NE to the index if a match was found. Further, conducted and the various system components that comprise we also added a Solr plugin for the modified Levenshtein our final runs. distance [7] to match NEs with spelling variants occurring due to transliteration across the two languages. 3.1 Apache Solr We used Apache Solr - an open source, scalable text search 3.3 Clustering engine written in Java and built over Apache Lucene as the Finally, we attempted to mine cross-lingual equivalents back-end for our system. It is fairly straightforward to index and spelling variants for different words. Much akin to the any sort of data in Solr after defining a schema. We present work of query expansion by [5], we first explored the usage S.no Field name Type Description 1 id String Unique identifier for each tweet, UUID generated from raw text 2 text Text Raw text for the tweet, stored after tokenization and minimal processing 3 qnum Integer Query Number 4 topicid Integer Topic Number 5 NE Text List of Named Entities referenced within this tweet 6 cluster id Integer Union of cluster ids for all constituent tokens within the tweet Table 3: Solr Schema definition Figure 3: System Diagram ferent weights, choice of scoring mechanism etc.̇ and (b) the topic narrative files that are additionally ingested as de- scribed below. Note that the diagram shows three different Figure 2: Examples of generated synonyms configurations, additive in nature for each of the three runs submitted. Before we describe the three runs, we present some addi- of Brown Clustering [3], a hierarchical clustering algorithm tional details on the query processor and Solr configuration. that exploits word distributional similarities. We used the In continuation with the schema as presented in Table 3, freely available python implementation [6] and tested with given: - Query Q of n tokens as q1 , q2 , ... , qn - Pre- a few cluster sizes. We found a value of 100 to be a reason- determined query weights Wf = w1 , w2 , ... , wk for fields able balance between unrelated words (10-50) to fragmented f1 , f2 , ... , fk as stored within the configuration clusters (500-1000). We also found that for the given set- The processor then partitions or generates field level parses ting, words with similar spellings (e.g. jaise, jaisey, jaisay) of the query as Qf = q1f , q2f , ... , qkf and passes the final or inflectional variants (e.g. ka, ke, ko, etc.) were assigned query to Solr as Qf · Wf . to the same cluster. As an unintended side-effect however, Secondly, as mentioned in Section 1, we also needed to fig- the clustering also assigned similarly spelled English words ure out some notion of relevance. Given the lack of any ad- with the same POS labels (i.e. verbs like cooking, cooling ditional information, we chose a simple voting based scheme. etc)˙ to the same cluster. Solr supports a variety of different ranking models and thus, Thus, we automatically generated a synonym list using the we configured four distinct Solr instances each with a differ- above results using a two step process. First, we removed ent relevance scheme as enumerated below: all words that were found in a standard English dictionary. This was done to remove any false positives. Secondly, we 1. Lucene Similarity: This is an implementation √ of the deemed a set of words to be equivalents if their edit distance classic tf-idf similarity that uses tf and 1 + log( Ndf+1 ) was less than a pre-determined threshold of three. The said as normalization factors threshold was manually determined by trial and error over a 2. Okapi similarity: A probabilistic relevance model as small test set. All such matching pairs thus generated were introduced by [8] then combined into a single synonym file as partially shown 3. Language model: Uses Jelinek-Mercer smoothing on in Figure 2. a language model as proposed by [9] Having described the individual components that com- prise our system, we now turn our attention to presenting 4. DFR similarity: The Divergence From Randomness the overall system in the following section. models as suggested by [1]. We used the Inverse Ex- pected document frequency (I(ne)) model with Lapla- cian smoothing and Zipfian normalization in our ex- 4. SYSTEM DESCRIPTION periments. The basic system architecture is shown in Figure 3. It in- cludes the following components as introduced earlier: Solr For the first three models, we used the default settings (and its associated index), Query Processor (that includes and for DFR, we manually experimented with a few queries NE tagger) and the extracted synonyms. Two additional and returned results to settle on the selected model. For a inputs are (a) the search configuration that defines the dif- given query, the system executes it in parallel on the four Query no Topic Run 1 Run 2 Run 3 Avg MAP and system performance could be improved by using more 1 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 sophisticated NE taggers (b) simple clustering on topic wise 1 3 0.0 0.0 0.0 0.0 tweets can give considerable insight into the constituent words 4 0.0 0.0 0.0 0.0 - enough to derive spelling and inflectional variants and 5 0.0 0.0 0.0 0.0 6 0.003 0.003 0.003 0.003 (c) in the given setting, precision is very sensitive to small 2 changes and hence, typical recall improving techniques should 7 0.177 0.1 0.1 0.126 8 0.019 0.008 0.019 0.015 be used as a “last resort”. 9 0.0 0.0 0.0 0.0 10 0.007 0.022 0.007 0.012 3 11 0.05 0.05 0.05 0.05 6. REFERENCES 12 0.004 0.009 0.004 0.006 Average 0.0217 0.016 0.0152 0.0176 [1] G. Amati, C. Joost, and V. Rijsbergen. Probabilistic models for information retrieval based on divergence Table 4: Test Dataset Summary and Results from randomness. 2003. [2] S. Banerjee, K. Chakma, S. K. Naskar, A. Das, P. Rosso, S. Bandyopadhyay, and M. Choudhury. subsystems, normalizes the returned scores and combines Overview of the Mixed Script Information Retrieval the results into a single ranked list. (MSIR) at FIRE. In Working notes of FIRE 2016 - We enumerate the three runs as follows: Forum for Information Retrieval Evaluation, Kolkata, India, December 7-10, 2016, CEUR Workshop 1. Run 1 - Named Entity boosts : In the first run, Proceedings. CEUR-WS.org, 2016. we performed two levels of query matching - one was [3] P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. boosting the documents based on their NE matches Pietra, and J. C. Lai. Class-based n-gram models of from the query, i.e., the query was parsed to extract natural language. Computational linguistics, NEs and each document (tweet) that matched the given 18(4):467–479, 1992. NE was provided a small numeric boost. The sec- [4] K. Chakma and A. Das. Cmir: A corpus for evaluation ond level of boosting utilized phrase matching, i.e., of code mixed information retrieval of hindi-english documents that more closely matched the input query tweets. In International Conference on Intelligent Text phrase were ranked higher than those that did not. Processing and Computational Linguistics. Springer, 2. Run 2 - Synonym expansion: We merely expanded 2016. the given query based on these synonyms over the [5] P. Gupta, K. Bali, R. E. Banchs, M. Choudhury, and ranking mechanism presented for Run 1. P. Rosso. Query expansion for mixed-script information 3. Run 3 - Narrative based weighting: For the final retrieval. In Proceedings of the 37th international ACM run, we extracted NEs from the provided topic nar- SIGIR conference on Research & development in ratives and assigned positive or negative boosts based information retrieval, pages 677–686. ACM, 2014. on the associated word usage “relevant” and “not rele- [6] P. Liang. Semi-supervised learning for natural language. vant”. These additional weights were applied over the PhD thesis, Massachusetts Institute of Technology, scheme presented in Run 2. 2005. [7] N. Londhe, V. Gopalakrishnan, R. K. Srihari, and 5. RESULTS & CONCLUSIONS A. Zhang. Mess: A multilingual error based string We present a summary of the test dataset and the results similarity measure for transliterated name variants. In from each of the runs in Table 4. We make the following Proceedings of the 7th Forum for Information Retrieval observations: Evaluation, pages 47–50. ACM, 2015. [8] S. E. Robertson, S. Walker, S. Jones, M. M. • Overall, the best system performance was our baseline Hancock-Beaulieu, M. Gatford, et al. Okapi at trec-3. system. Synonym expansion performs second best and NIST SPECIAL PUBLICATION SP, 109:109, 1995. narrative based weighting performs the worst. [9] C. Zhai and J. Lafferty. A study of smoothing methods • In hindsight, the techniques used by us perhaps im- for language models applied to ad hoc information proved recall at the cost of precision retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and • The only queries where synonym expansion works bet- development in information retrieval, pages 334–342. ter than not having any synonyms is perhaps where ACM, 2001. there is a spelling mismatch between query and text (queries 10 and 12). • Thus, we could have used “pessimistic” expansion - only use it if adequate results not available. • When multiple NEs are present (queries 1-3), our sys- tem gets confused and returns non relevant results. As part of future improvements, we thus need define a better notion of relevance and aboutness at a tweet level, specifically ascertain the information content of each tweet. In summary, our work showed three important results : (a) Named Entities have a very important role in IR on tweets