Exploiting Named Entity Mentions Towards Code Mixed IR
   : Working Notes for the UB system submission for
                    MSIR@FIRE’16

                           Nikhil Londhe                                         Rohini K. Srihari
                             SUNY Buffalo                                            SUNY Buffalo
                       nikhillo@buffalo.edu                                     rohini@buffalo.edu

ABSTRACT                                                                thus, giving little context to work with but also that
A sizable percentage of online user generated content is sus-           all documents are comparable in length and are defi-
ceptible to code switching and code mixing owing to a vari-             nitely somewhat relevant to the query / topic to begin
ety of reasons. Thus, an expected consequence is that adhoc             with.
user queries on such data are also inherently code mixed.             2. Mixed language data : Given that both the tweets
This paper thus presents our solution for a similar scenario             themselves as well as the queries are in one or more
: information retrieval on code mixed Hindi-English tweets.              languages, some sort of a cross lingual indexing and
We explore techniques in information extraction, clustering              querying scheme must be employed to ensure relevant
and query expansion as part of this work and present our                 results.
results on the test dataset. Our system achieved a MAP of             3. Spelling variants : Specifically applicable given that
0.0217 on the test set and placed third on the rankings.                 one of the languages (Hindi) is expressed in roman
                                                                         script via transliteration. Given that no standard-
CCS Concepts                                                             ized spelling scheme exists or is widely used, the same
•Information systems → Multilingual and cross-lingual                    words can appear with different spellings.
retrieval; •Computing methodologies → Natural lan-               .
guage processing;                                                   The rest of this paper is organized as follows. We begin by
                                                                 taking a closer look at the training dataset in Section 2 and
1.     INTRODUCTION                                              draw some inferences with regards to the data and queries
   A large number of languages like Russian, Hindi, Arabic,      that act as guiding principles for the different techniques
etc.̇ that although have indigenous writing scripts, are of-     used. Then in Section 3, we describe the different data pro-
ten expressed in the Roman script in online communication        cessing, indexing and query processing techniques that we
due a variety of socio-cultural reasons. Furthermore, it is      tried out as part of our initial experiments. Then in Sec-
not uncommon to see multilingual speakers switching be-          tion 4, we provide details of the runs we submitted and the
tween different languages when expressing themselves in an       exact system composition for each run. Finally, in Section 5,
informal setting like social media. Hence, a large percentage    we present our system performance on the test dataset and
of user generated content, typically on social media is code     present an analysis of the results. We then conclude by dis-
switched or code mixed, i.e., contains content in more than      cussing future work & potential system improvements.
one language that may or may not involve multiple writing
systems.                                                         2.     TRAINING DATASET
   Given the nature of the content, it is thus not unexpected        The training dataset can be summarized as follows:
to see adhoc user queries on such data to be also code mixed.
For example, consider Table 1, that lists some sample queries         1. Topics: Total of 10 topics
and sample target tweets that illustrate the given scenario.          2. Queries: Total of 23 queries, ranging from 1 to 4 per
The given problem was more formally introduced recently                  topic
by [4]. The second sub-task organized as part of the Mixed
                                                                      3. Tweets: Total of 6142 tweets split between topics/queries
Script Information Retrieval track [2] deals with the same
                                                                         and ranging from 34 (Topic 9 / Q 21) to 3531 (Topic
problem, described in detail as follows.
                                                                         1 / Q 4)
   Given a topic T , specified using a name, description and
narrative; a set of associated English-Hindi code mixed tweets     Based on the question types and available data, we make
tT ; and a set of similarly code mixed queries QT against the    the following assumptions / observations:
topic, the task involves returning the top k results for every
such query. A summary of the training dataset is presented            1. Each query is an informational query about some Named
in Table 2. Some of the key challenges thus, can be enumer-              Entity (NE)
ated as:                                                              2. Every query can also be completely described by a
     1. Short and comparable document lengths : Judg-                    triple as A[-rel -B] where [xx] indicates an optional
        ing relevance on curated tweets can be especially hard           clause
        given that not only the documents are short in length,        3. At least one of A or B is a Named Entity
              S.no   Sample Query          Example Tweets
              1      delhi me election     - rohit sharma ko dilli ka chunav ladwao !! newsmaker of the day
              2      india ki haar         #aapkasting thanks india ki haar ka dukh bhula diya tum logo ne
              3      nirbhayaa ka rapist   asharam bapu ka kehna hai mujhe nabalig samajhkar hi chod do . #nirbhayarapistout

                                           Table 1: Code Mixed Tweets and Queries

         Topic Num    Num Tweets     Vocabulary    % OOV
         1            3894           12548         66.87
         2            582            3356          60.58
         3            82             690           60.86
         4            54             483           56.78
         5            410            2556          67.03
         6            532            2641          59.59
         7            51             487           66.15
         8            104            901           62.55
         9            32             307           55.17
         10           425            1900          57.65

           Table 2: Training Dataset summary


     4. When not a NE, A or B, are most likely a noun
     5. NEs would usually be represented by a fixed number
                                                                                 Figure 1: Examples of extracted NEs
        of variants
     6. Nouns on the other hand, could be represented by any
        number of synonyms that could be drawn from both                 our full schema definition in Table 3. Thus, apart from the
        languages (chunaav versus election for example)                  data already provided, we augment it using two fields : NE
     7. The ordering of the clause (Noun-rel -NE versus NE-              (see Section 3.2) and cluster id (see Section 3.3).
        rel -Noun) is dependent on the underlying language
        (chunaav in delhi versus dilli mein election)                    3.2    NE tagger
                                                                            We implemented the NE tagger in two parts : (a) a regex
  Apart from the training dataset, no relevance judgments                based extractor that generates a list of known NEs and (b)
were provided, binary or nuanced. The narrative with each                a simple string matching based tagger that are explained
topic provided some insight into what was considered rel-                as follows. For the NE extractor, we used the topic de-
evant and non-relevant, but there were no clues to differ-               scription files and automatically extracted longest possible
entiate between two tweets given they both referenced the                sequence of tokens wherein the sequence is delineated by a
same number of query tokens. Thus, we need to derive some                token that begins with a capital letter. We then sorted the
notion of aboutness or information content of each tweet to              list extracted as thus and eliminated duplicates and sub-
determine relevance. Thus, a possible search strategy can                sequences. However, for every sub-sequence that we elimi-
be formulated as:                                                        nated, we tagged the parent entity as being capable of being
                                                                         present as a sub-string. For example, the entity Narendra
     1. Determine NE or NEs within the query and the tweets              Modi could be present as a whole or as individual tokens
        - this could be a fixed list or processed from Wikipedia         namely Narendra or Modi. At the end of this step, we
        or other Knowledge Bases (KBs)                                   had thus generated a list of NEs (of interest at least) and
     2. Find tweets that match as many tokens or as much of              a boolean flag indicating sub-sequence occurrence. A snap-
        the remaining query as possible                                  shot of the extracted lists from the training dataset is shown
                                                                         in Figure 1.
     3. Find some proxy for tweet aboutness and use it to                   The actual tagger followed a similar philosophy where we
        determine relevance                                              started with the above list and tagged each tweet as follows.
                                                                         We simply created a mapping from token to entity, wherein
  Using these three guiding principles, we now present the
                                                                         the token was either defined as the full entity or any of its
different techniques we tried out and how they contributed
                                                                         sub-tokens if it could be present partially. For example, in
to our final system run.
                                                                         the above case, for Narendra Modi, we created two mappings
                                                                         as Narendra → Narendra Modi and Modi → Narendra Modi.
3.     INDEXING EXPERIMENTS                                              For each tweet, we then performed a look-up and added the
  In this section, we describe the different experiments we              matching NE to the index if a match was found. Further,
conducted and the various system components that comprise                we also added a Solr plugin for the modified Levenshtein
our final runs.                                                          distance [7] to match NEs with spelling variants occurring
                                                                         due to transliteration across the two languages.
3.1      Apache Solr
  We used Apache Solr - an open source, scalable text search             3.3    Clustering
engine written in Java and built over Apache Lucene as the                 Finally, we attempted to mine cross-lingual equivalents
back-end for our system. It is fairly straightforward to index           and spelling variants for different words. Much akin to the
any sort of data in Solr after defining a schema. We present             work of query expansion by [5], we first explored the usage
                 S.no   Field name   Type      Description
                 1      id           String    Unique identifier for each tweet, UUID generated from raw text
                 2      text         Text      Raw text for the tweet, stored after tokenization and minimal processing
                 3      qnum         Integer   Query Number
                 4      topicid      Integer   Topic Number
                 5      NE           Text      List of Named Entities referenced within this tweet
                 6      cluster id   Integer   Union of cluster ids for all constituent tokens within the tweet

                                               Table 3: Solr Schema definition


                                                                                       Figure 3: System Diagram


                                                                       ferent weights, choice of scoring mechanism etc.̇ and (b)
                                                                       the topic narrative files that are additionally ingested as de-
                                                                       scribed below. Note that the diagram shows three different
     Figure 2: Examples of generated synonyms                          configurations, additive in nature for each of the three runs
                                                                       submitted.
                                                                          Before we describe the three runs, we present some addi-
of Brown Clustering [3], a hierarchical clustering algorithm           tional details on the query processor and Solr configuration.
that exploits word distributional similarities. We used the            In continuation with the schema as presented in Table 3,
freely available python implementation [6] and tested with             given: - Query Q of n tokens as q1 , q2 , ... , qn - Pre-
a few cluster sizes. We found a value of 100 to be a reason-           determined query weights Wf = w1 , w2 , ... , wk for fields
able balance between unrelated words (10-50) to fragmented             f1 , f2 , ... , fk as stored within the configuration
clusters (500-1000). We also found that for the given set-                The processor then partitions or generates field level parses
ting, words with similar spellings (e.g. jaise, jaisey, jaisay)        of the query as Qf = q1f , q2f , ... , qkf and passes the final
or inflectional variants (e.g. ka, ke, ko, etc.) were assigned         query to Solr as Qf · Wf .
to the same cluster. As an unintended side-effect however,                Secondly, as mentioned in Section 1, we also needed to fig-
the clustering also assigned similarly spelled English words           ure out some notion of relevance. Given the lack of any ad-
with the same POS labels (i.e. verbs like cooking, cooling             ditional information, we chose a simple voting based scheme.
etc)˙ to the same cluster.                                             Solr supports a variety of different ranking models and thus,
   Thus, we automatically generated a synonym list using the           we configured four distinct Solr instances each with a differ-
above results using a two step process. First, we removed              ent relevance scheme as enumerated below:
all words that were found in a standard English dictionary.
This was done to remove any false positives. Secondly, we                 1. Lucene Similarity: This is an implementation
                                                                                                                √               of the
deemed a set of words to be equivalents if their edit distance               classic tf-idf similarity that uses tf and 1 + log( Ndf+1 )
was less than a pre-determined threshold of three. The said                  as normalization factors
threshold was manually determined by trial and error over a               2. Okapi similarity: A probabilistic relevance model as
small test set. All such matching pairs thus generated were                  introduced by [8]
then combined into a single synonym file as partially shown
                                                                          3. Language model: Uses Jelinek-Mercer smoothing on
in Figure 2.
                                                                             a language model as proposed by [9]
   Having described the individual components that com-
prise our system, we now turn our attention to presenting                 4. DFR similarity: The Divergence From Randomness
the overall system in the following section.                                 models as suggested by [1]. We used the Inverse Ex-
                                                                             pected document frequency (I(ne)) model with Lapla-
                                                                             cian smoothing and Zipfian normalization in our ex-
4.   SYSTEM DESCRIPTION                                                      periments.
  The basic system architecture is shown in Figure 3. It in-
cludes the following components as introduced earlier: Solr              For the first three models, we used the default settings
(and its associated index), Query Processor (that includes             and for DFR, we manually experimented with a few queries
NE tagger) and the extracted synonyms. Two additional                  and returned results to settle on the selected model. For a
inputs are (a) the search configuration that defines the dif-          given query, the system executes it in parallel on the four
      Query no   Topic   Run 1    Run 2   Run 3    Avg MAP
                                                                 and system performance could be improved by using more
      1                  0.0      0.0     0.0      0.0
      2                  0.0      0.0     0.0      0.0           sophisticated NE taggers (b) simple clustering on topic wise
                 1
      3                  0.0      0.0     0.0      0.0           tweets can give considerable insight into the constituent words
      4                  0.0      0.0     0.0      0.0           - enough to derive spelling and inflectional variants and
      5                  0.0      0.0     0.0      0.0
      6                  0.003    0.003   0.003    0.003         (c) in the given setting, precision is very sensitive to small
                 2                                               changes and hence, typical recall improving techniques should
      7                  0.177    0.1     0.1      0.126
      8                  0.019    0.008   0.019    0.015         be used as a “last resort”.
      9                  0.0      0.0     0.0      0.0
      10                 0.007    0.022   0.007    0.012
                 3
      11                 0.05     0.05    0.05     0.05          6.   REFERENCES
      12                 0.004    0.009   0.004    0.006
      Average            0.0217   0.016   0.0152   0.0176        [1] G. Amati, C. Joost, and V. Rijsbergen. Probabilistic
                                                                     models for information retrieval based on divergence
      Table 4: Test Dataset Summary and Results                      from randomness. 2003.
                                                                 [2] S. Banerjee, K. Chakma, S. K. Naskar, A. Das,
                                                                     P. Rosso, S. Bandyopadhyay, and M. Choudhury.
subsystems, normalizes the returned scores and combines              Overview of the Mixed Script Information Retrieval
the results into a single ranked list.                               (MSIR) at FIRE. In Working notes of FIRE 2016 -
  We enumerate the three runs as follows:                            Forum for Information Retrieval Evaluation, Kolkata,
                                                                     India, December 7-10, 2016, CEUR Workshop
     1. Run 1 - Named Entity boosts : In the first run,              Proceedings. CEUR-WS.org, 2016.
        we performed two levels of query matching - one was
                                                                 [3] P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D.
        boosting the documents based on their NE matches
                                                                     Pietra, and J. C. Lai. Class-based n-gram models of
        from the query, i.e., the query was parsed to extract
                                                                     natural language. Computational linguistics,
        NEs and each document (tweet) that matched the given
                                                                     18(4):467–479, 1992.
        NE was provided a small numeric boost. The sec-
                                                                 [4] K. Chakma and A. Das. Cmir: A corpus for evaluation
        ond level of boosting utilized phrase matching, i.e.,
                                                                     of code mixed information retrieval of hindi-english
        documents that more closely matched the input query
                                                                     tweets. In International Conference on Intelligent Text
        phrase were ranked higher than those that did not.
                                                                     Processing and Computational Linguistics. Springer,
     2. Run 2 - Synonym expansion: We merely expanded                2016.
        the given query based on these synonyms over the         [5] P. Gupta, K. Bali, R. E. Banchs, M. Choudhury, and
        ranking mechanism presented for Run 1.                       P. Rosso. Query expansion for mixed-script information
     3. Run 3 - Narrative based weighting: For the final             retrieval. In Proceedings of the 37th international ACM
        run, we extracted NEs from the provided topic nar-           SIGIR conference on Research & development in
        ratives and assigned positive or negative boosts based       information retrieval, pages 677–686. ACM, 2014.
        on the associated word usage “relevant” and “not rele-   [6] P. Liang. Semi-supervised learning for natural language.
        vant”. These additional weights were applied over the        PhD thesis, Massachusetts Institute of Technology,
        scheme presented in Run 2.                                   2005.
                                                                 [7] N. Londhe, V. Gopalakrishnan, R. K. Srihari, and
5.     RESULTS & CONCLUSIONS                                         A. Zhang. Mess: A multilingual error based string
   We present a summary of the test dataset and the results          similarity measure for transliterated name variants. In
from each of the runs in Table 4. We make the following              Proceedings of the 7th Forum for Information Retrieval
observations:                                                        Evaluation, pages 47–50. ACM, 2015.
                                                                 [8] S. E. Robertson, S. Walker, S. Jones, M. M.
     • Overall, the best system performance was our baseline         Hancock-Beaulieu, M. Gatford, et al. Okapi at trec-3.
       system. Synonym expansion performs second best and            NIST SPECIAL PUBLICATION SP, 109:109, 1995.
       narrative based weighting performs the worst.             [9] C. Zhai and J. Lafferty. A study of smoothing methods
     • In hindsight, the techniques used by us perhaps im-           for language models applied to ad hoc information
       proved recall at the cost of precision                        retrieval. In Proceedings of the 24th annual
                                                                     international ACM SIGIR conference on Research and
     • The only queries where synonym expansion works bet-           development in information retrieval, pages 334–342.
       ter than not having any synonyms is perhaps where             ACM, 2001.
       there is a spelling mismatch between query and text
       (queries 10 and 12).
     • Thus, we could have used “pessimistic” expansion -
       only use it if adequate results not available.
     • When multiple NEs are present (queries 1-3), our sys-
       tem gets confused and returns non relevant results.

  As part of future improvements, we thus need define a
better notion of relevance and aboutness at a tweet level,
specifically ascertain the information content of each tweet.
In summary, our work showed three important results : (a)
Named Entities have a very important role in IR on tweets