CEN@Amrita: Information Retrieval on CodeMixed
           HindiEnglish Tweets Using Vector Space Models
           Shivkaran Singh                                 Anand Kumar M                                  Soman K P
Centre for Computational Engineering Centre for Computational Engineering Centre for Computational Engineering
and Networking (CEN), Amrita School and Networking (CEN), Amrita School and Networking (CEN), Amrita School
 of Engineering, Coimbatore, Amrita   of Engineering, Coimbatore, Amrita   of Engineering, Coimbatore, Amrita
    Vishwa Vidyapeetham, Amrita          Vishwa Vidyapeetham, Amrita          Vishwa Vidyapeetham, Amrita
    University, India, PIN: 641112       University, India, PIN: 641112       University, India, PIN: 641112
          +91 84278 78973            m_anandkumar@cb.amrita.edu                kp_soman@amrita.edu
shivkaran.ssokhey@gmail.com


ABSTRACT                                                               retrieval (FIRE)1 (2016), a similar task was proposed, which
One of the major challenges nowadays is Information retrieval          required Mixed Script IR on Code-Mixed Hindi-English tweets.
from social media platforms. Most of the information on these          The difference of Code-Mixed IR from MixedScript IR is subtle.
platforms is informal and noisy in nature. It makes the                In MixedScript content, query 𝑞𝑚𝑠 is written in Roman or native
Information retrieval task more challenging. The task is even          script [1] whereas in Code-Mixed content, query 𝑞𝑐𝑚 is a Roman
more difficult for twitter because of its character limitation per     transliteration of a different language. The Code-Mixed corpus
tweet. This limitation bounds the user to express himself in           provided at MSIR Subtask II had English and Roman
condensed set of words. In the context of India, scenario is little    transliterated Hindi twitter data [2]. The major issue in such
more complicated as users prefer to type in their mother tongue        corpus is several possibilities of writing the same (Hindi) word
but lack of input tools force them to use Roman script with            with different transliterations. For example, “कम” meaning
English embeddings. This combination of multiple languages             “less” in Hindi can be spelled in Roman transliteration as km,
written in the Roman script makes the Information retrieval task       kam, kum, kmm etc. These nuances make it hard for the IR system
even harder. Query processing for such CodeMixed content is a          to match the query with correct document in a document set. This
difficult task because query can be in either of the language and it   significantly affects the performance of IR system. Nowadays,
need to be matched with the documents written in any of the            getting information from such CodeMixed social media text is
language. In this work, we dealt with this problem using Vector        very important as it helps in many business analytics purposes. In
Space Models which gave significantly better results than the          the following sections, Section 2 explains about the information
other participants. The Mean Average Precision (MAP) for our           retrieval subtask, Section 3 explains the Vector Space Models
system was 0.0315 which was second best performance for the            which were used for information retrieval, Section 4 explains the
subtask.                                                               methodology used for this work, Section 5 discusses about the
                                                                       results obtained and analysis of others result.
CCS Concepts                                                           2. Task Description
• Information Systems ➝ Information Retrieval ➝ Retrieval
                                                                       The subtask II of shared task of Mixed Script IR on Code-Mixed
models and ranking
                                                                       Hindi-English tweets was to retrieve 20 most relevant tweets from
• Computing methodologies ➝           Artificial Intelligence ➝        a document given a query. The query as well as the document was
Natural Language Processing                                            in Roman script but with CodeMixed Hindi and English
                                                                       languages. The corpus had set of documents with each document
Keywords                                                               containing several hundred (or thousand) tweets. The corpus was
                                                                       further classified based on topics and queries. Each topic had at
CodeMixed social media, Mixed-Script, Information Retrieval,           least one query related to the topic description. Table 1 explains
Vector-space-models, Semantics                                         about the structure of training/testing corpus provided for the
                                                                       subtask. The total number of topics for training and testing corpus
1. INTRODUCTION                                                        was 10 and 3 respectively. There were several queries based on
Social media has a plentitude of user generated data in numerous       each topic (See Table 1) and there was at least one query per
languages which are predominantly informal in nature. Most of          topic. The total number of queries for training was 23 and for
these languages have their own native scripts. Some of these           testing, it was 12. A narrative on each topic was also given in the
scripts include Arabic, Chinese, Hebrew, Greek, and Indic etc. For     corpus describing the details about the tweets under that topic.
most of these languages, major user-generated content is               The topic 001 (Aam Aadmi Party) has four queries under the
transliterated into the Roman script with English embeddings. The      same description (Table 1). All these four queries had separate
trend in Indian social media is to use such informal text              documents with a corresponding number of tweets. Let 𝑞𝑖 be the
containing a mixture of multiple South-Asian languages with            given query, IR task was to rank the tweets in the corresponding
English embeddings. This mixture makes the Information                 document from most relevant with the query to the least.
Retrieval (IR) task very challenging. In Forum for Information

                                                                       1
                                                                           https://msir2016.github.io/
                   Table 1. Corpus Description                                           Table 2. Term-Document matrix
               Topic                                                                                Doc1         Doc2       Doc3
 Topic                               Queries             #tweets
             Description                                                               We             1            0           0
             The tweets                                                               stayed          1            1           0
                              q1: aam aadmi party          710
              under this
                                                                                       very           1            0           0
   001         topic are
                              q2: aam aadmi party                                    closely          1            0           1
            related to the                                1071
  Aam                         dilli me
             Aam Aadmi                                                             Connected          1            1           0
 Aadmi      Party which is    q3: aam aadmi ki
 Party        a political                                 1583                       Charger          0            1           1
                              party
            party in Delhi                                                             with           0            1           0
             Government       q4: aap ki rally            3529
                                                                                      phone           0            1           1
                                                                                       His            0            0           1
The IR system should return top-20 most relevant tweets to the                      resembled         0            0           1
given query. The CodeMixed nature of the tweets makes the IR
task hard to process as semantic search for such transliterated                       mine            0            0           1
queries and documents is still an unsolved problem [2].

3. Vector Space Models                                                   In the above matrix, terms are the rows and columns are
Vector-Space-Models (VSMs) are used to represent documents as            documents. It has 3 documents (Doc1, Doc2 & Doc3) and 11
a vector (of terms) that occurs within a collection [5]. The given       unique terms (tokens in this case) with dimension 11x3. In a
query is also represented in the same document space. The query          similar way a given query could be represented as bag of words
is also called as pseudo-document. As the document is represented        and estimating the relevance of query with the documents in such
as a vector of terms that occur in the document hence it is              a manner is called bag of words hypothesis in Information
necessary to identify the terms present in the document. The terms       retrieval. This hypothesis states that a column vector in a term-
are basically the vocabulary of collection of documents. If there        document matrix captures the meaning of the corresponding
are more than one document then each document will be a huge             document (to some extent). It should be observed that the column
vector and it will be convenient to organize these vectors into a        vector which correspond to a document in a collection tell us
matrix. This matrix is called term-document matrix. The row              about the frequency of the words in the document with loss of
vectors are referred as terms and column vectors are referred as         actual order of the words. The vector may not capture the
documents. A document is used as a context to understand the             structure of a document as it is but it works surprisingly well with
term. If we take document as phrases, sentences, paragraphs,             the search engines. We can compare the column (document)
chapters etc. we get a word-context matrix. Similarly we can also        vectors to compute the similarity among them. This similarity can
have a pair-pattern matrices [3].                                        be computed using euclidean distance if we are assuming
                                                                         columns (documents) as points in the document space. If we are
To imagine the representation of term-document matrix, think of a        assuming columns (documents) as vectors in documents space, we
multiset from set theory. A multiset is a set but it allows multiple     can use cosine similarity to measure the similarity by the angle
instances of the same element. For example, 𝑀 = {𝑥, 𝑥, 𝑥, 𝑦, 𝑦, 𝑧}       between the vectors. Larger the cosine, more semantically related
is a multiset containing elements 𝑥, 𝑦 and 𝑧. Just like sets, order of   the documents are. If 𝑑𝑜𝑐1 and 𝑑𝑜𝑐2 are two document vectors,
elements in multiset could be anything. That means, multiset             then cosine of angle 𝜃 between them is computed as:
𝑀1 = {𝑦, 𝑧, 𝑦, 𝑥, 𝑥, 𝑥} is same as multiset 𝑀2 = {𝑥, 𝑦, 𝑧, 𝑥, 𝑥, 𝑦}.
Multisets are also called as bags and we can represent these bags                                            𝑑𝑜𝑡(𝑑𝑜𝑐1, 𝑑𝑜𝑐2)
                                                                                       cos(𝑑𝑜𝑐1, 𝑑𝑜𝑐2) =
as a vectors with vector component denoting the frequency of the                                             ||𝑑𝑜𝑐1||. ||𝑑𝑜𝑐2||
elements of multiset i.e. 𝑉 = < 3, 2, 1 > is vector representation       Where ||𝑑𝑜𝑐1|| and ||𝑑𝑜𝑐2|| are the length (or norm) of the
of the bag M in which 3 is the frequency of 𝑥 and 2 is the               vectors. The basic intuition behind using cosine similarity is that it
frequency of 𝑦 etc. Using the same analogy, we can imagine a             captures the idea that the angle between the vectors is important,
document as a bag and set of documents as set of bags aligned as         length of the vector is not (See Figure 1). The cosine is 1 when
columns in a matrix, say 𝑋. This matrix, 𝑋, is term document             vectors are same or they point in the same direction (𝜃 𝑧𝑒𝑟𝑜). The
matrix with columns representing a bag and rows representing a           cosine value varies from 0-1, zero being not similar and one being
unique member. A particular element 𝑥𝑖𝑗 in the matrix                    exactly similar.
corresponds to the frequency of 𝑖𝑡ℎ term in the 𝑗𝑡ℎ document (or
bag). To capture the whole intuition, let’s assume 3 documents as:       3.1 Term-Weighing
Doc1: We stayed very closely connected.                                  Generally, most frequent terms will have lower information than
Doc2: Charger stayed connected with phone.                               the less frequent or surprising terms. To capture this idea, most
                                                                         efficient way is to use 𝑡𝑓 − 𝑖𝑑𝑓 (term frequency-inverse
Doc3: His phone charger closely resembled mine.                          document frequency). An element in a term-document matrix gets
The term document matrix of frequency for above three                    a higher weight when a term in corresponding document is very
documents could be:                                                      frequent (𝑡𝑓) that means the term is rare in collection of
                                                                         documents(𝑑𝑓). Hence the weight of a particular terms
                                                                         appearance is computed as:
                                                                                                   𝑊𝑚𝑛 = 𝑡𝑓 × 𝑖𝑑𝑓
Where 𝑊𝑚𝑛 is the weight of the term 𝑚 in document 𝑛. It is
demonstrated in [4] that using 𝑡𝑓 − 𝑖𝑑𝑓 functions brings                Tokenization was done for all the queries too. The document
significant improvement over raw frequency.                             vector (column) size, as well as the query vectors, were in same
                                                                        vector space. The preprocessing was performed for each file
                                                                        (collection) over each document (tweet). After preprocessing over
                                                                        each file (collection), it was fed to Information Retrieval system.
                                                                        The similarity scores for each tweet in a collection given a query
                                                                        were computed and results were saved in a list. The top 20 tweets
                                                                        related to the given query were retrieved from the index values of
                                                                        top 20 similarity scores in the list.
                                                                        There was a provision of submitting three systems per team. We
                                                                        submitted two systems. One system was same as explained above.
                                                                        In second system, we manually removed some Hindi stop words
                                                                        like 𝑘𝑎, 𝑗𝑜, 𝑘𝑒 𝑛𝑒, 𝑡𝑜, 𝑣𝑒, 𝑙𝑒 etc. It didn’t reflected any better
                                                                        results. All the implementations were done in Python 2.7. Related
                                                                        code will be made available at author’s Github page.
                                                                        5. Result and Analysis
                                                                        The result were declared roughly after two weeks of the
                                                         2              submission. There were total 7 teams and our system performed
            Figure. 1 Angle between document vectors
                                                                        well as compared to others [6]. The evaluation was done by
So far we have talked about measuring document similarity but           calculating Mean Average Precision (MAP) which is a standard
VSMs can also be used for query processing. A query 𝑞 can be            measure for comparing search algorithm. The results for Q1 for
treated as a pseudo document and similarity measures of each            system 1 and system 2 can be seen in Figure 2. And Figure 3.
document in the collection with pseudo document (query) can be
computed. There are several other similarity measures available as
Jensen-Shannon, recall, precision, Jaccard, harmonic mean etc.
The use of these similarity measures depends upon the relative
frequency of adjacent words with respect to the target word.

4. Methodology
The subtask II in FIRE was Mixed Script Information Retrieval on
Code-Mixed Hindi-English tweets. There were total 23 files
containing tweets for training and 12 files for testing. Each file
had a corresponding query. Given a query 𝒒𝒊 , the information
retrieval task was to compute the similarity between the query and
tweets in each file and return the top 20 most relevant tweets. As
explained in the last section, this query processing task can be
successfully executed using VSMs.
Each file was treated as a collection of documents and each tweet
within the collection is referred as document. The dataset
comprised of Hindi-English code-mixed tweets. As twitter data is
generally noisy and requires some preprocessing, it was subjected
to some preprocessing modules. The preprocessing in our                                    Figure. 2 System-1 results
implementation included tokenizing, removing stop words,                The results of top three performers in the subtask are given in
stripping punctuations, stripping repetitions (hiiiiiii→ hi) etc. The   Table 3.
major issue in tokenizing twitter data is to capture the key                                        Table 3
attributes of tweets such as: hashtags (#aap), @ mentions
                                                                                                      Runs (Mean Average Precision)
(@timesnow), URLs, symbol, emoticons etc. These attribute were                            No. of
captured using regular expressions. A sample tweet after                 Team Name
                                                                                          runs
capturing these nuances, stripping punctuation and tokenizing                                         Run 1          Run 2         Run 3
appeared as:                                                             Amrita_CEN          1        0.0377          NIL            NIL
                                                                        CEN@Amrita*          2        0.0315          0.016          NIL
                 @respectshraddie shhhhh :( salman ko jail
Original                                                                      UB             3        0.0217          0.016         0.015
                 hojaegi :( #badday
                 ‘@respectshraddie’, ‘shh’, ‘:(’, ‘salman’, ‘jail’,
Preprocessed
                 ‘hojaegi’, ‘:(’ , ‘#badday’


2
    https://www.math10.com/en/geometry/geogebra/geogebra.html
                                                                     [6] Banerjee, S., Chakma K., Naskar, S. K., Das, A., Rosso,
                                                                         P., Bandyopadhyay, S., and Choudhury, M. 2016. Overview
                                                                         of    the    Mixed     Script   Information Retrieval   at
                                                                         FIRE. In Working notes of FIRE 2016 - Forum for
                                                                         Information Retrieval Evaluation, Kolkata, India, December
                                                                         7-10, 2016, CEUR Workshop Proceedings. CEUR-WS.org.


                   Figure. 3 System-2 results


6. Conclusion
The shared task on CodeMixed Information retrieval was indeed a
unique task. It captured the latest trend in social media. We used
Vector Space Models (VSMs) of semantics to compute the
similarity between the tweets and given query. The performance
of our system was ranked 2 among all the participants. But the
Mean Average Precision (MAP) value was very low in terms of
performance. That suggests, CodeMixed IR task is a difficult task
and existing algorithms do not perform as expected and require
sufficient attention to perform well for such data.


Acknowledgements
The authors would like to thank the organizers of Forum for
Information Retrieval Evaluation (FIRE) for organizing this event.
The authors would also like to thank the organizers of shared task
on Mixed Script Information Retrieval (MSIR) for organizing the
much coveted task for Indian social media.


REFERENCES
[1] Gupta, P., Bali, K., Banchs, E. R., Choudhury, M., and
    Rosso, P. 2014. Query expansion for mixed-script
    information retrieval. In Proceedings of the 37th
    international ACM SIGIR conference on Research &
    development in information retrieval, pp. 677-686. ACM.
[2] Chakma, K., and Das, A. 2016. CMIR: A Corpus for
    Evaluation of Code Mixed Information Retrieval of Hindi-
    English Tweets. Computación y Sistemas 20.3 (2016): 425-
    434.
[3] Turney, P. D., and Pantel, P. 2010. From frequency to
    meaning: Vector space models of semantics. Journal of
    artificial intelligence research 37.1 (2010): 141-188.
[4] Salton, G., and Buckley, B. 1988. Term-weighting
    approaches in automatic text retrieval. Information
    processing & management 24.5 (1988): 513-523.
[5] Soman, K. P., Loganathan, R., and Ajay, V. 2009. Machine
    learning with SVM and other kernel methods. PHI Learning
    Pvt. Ltd.