CEN@Amrita: Information Retrieval on CodeMixed HindiEnglish Tweets Using Vector Space Models Shivkaran Singh Anand Kumar M Soman K P Centre for Computational Engineering Centre for Computational Engineering Centre for Computational Engineering and Networking (CEN), Amrita School and Networking (CEN), Amrita School and Networking (CEN), Amrita School of Engineering, Coimbatore, Amrita of Engineering, Coimbatore, Amrita of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, Amrita Vishwa Vidyapeetham, Amrita Vishwa Vidyapeetham, Amrita University, India, PIN: 641112 University, India, PIN: 641112 University, India, PIN: 641112 +91 84278 78973 m_anandkumar@cb.amrita.edu kp_soman@amrita.edu shivkaran.ssokhey@gmail.com ABSTRACT retrieval (FIRE)1 (2016), a similar task was proposed, which One of the major challenges nowadays is Information retrieval required Mixed Script IR on Code-Mixed Hindi-English tweets. from social media platforms. Most of the information on these The difference of Code-Mixed IR from MixedScript IR is subtle. platforms is informal and noisy in nature. It makes the In MixedScript content, query ๐‘ž๐‘š๐‘  is written in Roman or native Information retrieval task more challenging. The task is even script [1] whereas in Code-Mixed content, query ๐‘ž๐‘๐‘š is a Roman more difficult for twitter because of its character limitation per transliteration of a different language. The Code-Mixed corpus tweet. This limitation bounds the user to express himself in provided at MSIR Subtask II had English and Roman condensed set of words. In the context of India, scenario is little transliterated Hindi twitter data [2]. The major issue in such more complicated as users prefer to type in their mother tongue corpus is several possibilities of writing the same (Hindi) word but lack of input tools force them to use Roman script with with different transliterations. For example, โ€œเค•เคฎโ€ meaning English embeddings. This combination of multiple languages โ€œlessโ€ in Hindi can be spelled in Roman transliteration as km, written in the Roman script makes the Information retrieval task kam, kum, kmm etc. These nuances make it hard for the IR system even harder. Query processing for such CodeMixed content is a to match the query with correct document in a document set. This difficult task because query can be in either of the language and it significantly affects the performance of IR system. Nowadays, need to be matched with the documents written in any of the getting information from such CodeMixed social media text is language. In this work, we dealt with this problem using Vector very important as it helps in many business analytics purposes. In Space Models which gave significantly better results than the the following sections, Section 2 explains about the information other participants. The Mean Average Precision (MAP) for our retrieval subtask, Section 3 explains the Vector Space Models system was 0.0315 which was second best performance for the which were used for information retrieval, Section 4 explains the subtask. methodology used for this work, Section 5 discusses about the results obtained and analysis of others result. CCS Concepts 2. Task Description โ€ข Information Systems โž Information Retrieval โž Retrieval The subtask II of shared task of Mixed Script IR on Code-Mixed models and ranking Hindi-English tweets was to retrieve 20 most relevant tweets from โ€ข Computing methodologies โž Artificial Intelligence โž a document given a query. The query as well as the document was Natural Language Processing in Roman script but with CodeMixed Hindi and English languages. The corpus had set of documents with each document Keywords containing several hundred (or thousand) tweets. The corpus was further classified based on topics and queries. Each topic had at CodeMixed social media, Mixed-Script, Information Retrieval, least one query related to the topic description. Table 1 explains Vector-space-models, Semantics about the structure of training/testing corpus provided for the subtask. The total number of topics for training and testing corpus 1. INTRODUCTION was 10 and 3 respectively. There were several queries based on Social media has a plentitude of user generated data in numerous each topic (See Table 1) and there was at least one query per languages which are predominantly informal in nature. Most of topic. The total number of queries for training was 23 and for these languages have their own native scripts. Some of these testing, it was 12. A narrative on each topic was also given in the scripts include Arabic, Chinese, Hebrew, Greek, and Indic etc. For corpus describing the details about the tweets under that topic. most of these languages, major user-generated content is The topic 001 (Aam Aadmi Party) has four queries under the transliterated into the Roman script with English embeddings. The same description (Table 1). All these four queries had separate trend in Indian social media is to use such informal text documents with a corresponding number of tweets. Let ๐‘ž๐‘– be the containing a mixture of multiple South-Asian languages with given query, IR task was to rank the tweets in the corresponding English embeddings. This mixture makes the Information document from most relevant with the query to the least. Retrieval (IR) task very challenging. In Forum for Information 1 https://msir2016.github.io/ Table 1. Corpus Description Table 2. Term-Document matrix Topic Doc1 Doc2 Doc3 Topic Queries #tweets Description We 1 0 0 The tweets stayed 1 1 0 q1: aam aadmi party 710 under this very 1 0 0 001 topic are q2: aam aadmi party closely 1 0 1 related to the 1071 Aam dilli me Aam Aadmi Connected 1 1 0 Aadmi Party which is q3: aam aadmi ki Party a political 1583 Charger 0 1 1 party party in Delhi with 0 1 0 Government q4: aap ki rally 3529 phone 0 1 1 His 0 0 1 The IR system should return top-20 most relevant tweets to the resembled 0 0 1 given query. The CodeMixed nature of the tweets makes the IR task hard to process as semantic search for such transliterated mine 0 0 1 queries and documents is still an unsolved problem [2]. 3. Vector Space Models In the above matrix, terms are the rows and columns are Vector-Space-Models (VSMs) are used to represent documents as documents. It has 3 documents (Doc1, Doc2 & Doc3) and 11 a vector (of terms) that occurs within a collection [5]. The given unique terms (tokens in this case) with dimension 11x3. In a query is also represented in the same document space. The query similar way a given query could be represented as bag of words is also called as pseudo-document. As the document is represented and estimating the relevance of query with the documents in such as a vector of terms that occur in the document hence it is a manner is called bag of words hypothesis in Information necessary to identify the terms present in the document. The terms retrieval. This hypothesis states that a column vector in a term- are basically the vocabulary of collection of documents. If there document matrix captures the meaning of the corresponding are more than one document then each document will be a huge document (to some extent). It should be observed that the column vector and it will be convenient to organize these vectors into a vector which correspond to a document in a collection tell us matrix. This matrix is called term-document matrix. The row about the frequency of the words in the document with loss of vectors are referred as terms and column vectors are referred as actual order of the words. The vector may not capture the documents. A document is used as a context to understand the structure of a document as it is but it works surprisingly well with term. If we take document as phrases, sentences, paragraphs, the search engines. We can compare the column (document) chapters etc. we get a word-context matrix. Similarly we can also vectors to compute the similarity among them. This similarity can have a pair-pattern matrices [3]. be computed using euclidean distance if we are assuming columns (documents) as points in the document space. If we are To imagine the representation of term-document matrix, think of a assuming columns (documents) as vectors in documents space, we multiset from set theory. A multiset is a set but it allows multiple can use cosine similarity to measure the similarity by the angle instances of the same element. For example, ๐‘€ = {๐‘ฅ, ๐‘ฅ, ๐‘ฅ, ๐‘ฆ, ๐‘ฆ, ๐‘ง} between the vectors. Larger the cosine, more semantically related is a multiset containing elements ๐‘ฅ, ๐‘ฆ and ๐‘ง. Just like sets, order of the documents are. If ๐‘‘๐‘œ๐‘1 and ๐‘‘๐‘œ๐‘2 are two document vectors, elements in multiset could be anything. That means, multiset then cosine of angle ๐œƒ between them is computed as: ๐‘€1 = {๐‘ฆ, ๐‘ง, ๐‘ฆ, ๐‘ฅ, ๐‘ฅ, ๐‘ฅ} is same as multiset ๐‘€2 = {๐‘ฅ, ๐‘ฆ, ๐‘ง, ๐‘ฅ, ๐‘ฅ, ๐‘ฆ}. Multisets are also called as bags and we can represent these bags ๐‘‘๐‘œ๐‘ก(๐‘‘๐‘œ๐‘1, ๐‘‘๐‘œ๐‘2) cos(๐‘‘๐‘œ๐‘1, ๐‘‘๐‘œ๐‘2) = as a vectors with vector component denoting the frequency of the ||๐‘‘๐‘œ๐‘1||. ||๐‘‘๐‘œ๐‘2|| elements of multiset i.e. ๐‘‰ = < 3, 2, 1 > is vector representation Where ||๐‘‘๐‘œ๐‘1|| and ||๐‘‘๐‘œ๐‘2|| are the length (or norm) of the of the bag M in which 3 is the frequency of ๐‘ฅ and 2 is the vectors. The basic intuition behind using cosine similarity is that it frequency of ๐‘ฆ etc. Using the same analogy, we can imagine a captures the idea that the angle between the vectors is important, document as a bag and set of documents as set of bags aligned as length of the vector is not (See Figure 1). The cosine is 1 when columns in a matrix, say ๐‘‹. This matrix, ๐‘‹, is term document vectors are same or they point in the same direction (๐œƒ ๐‘ง๐‘’๐‘Ÿ๐‘œ). The matrix with columns representing a bag and rows representing a cosine value varies from 0-1, zero being not similar and one being unique member. A particular element ๐‘ฅ๐‘–๐‘— in the matrix exactly similar. corresponds to the frequency of ๐‘–๐‘กโ„Ž term in the ๐‘—๐‘กโ„Ž document (or bag). To capture the whole intuition, letโ€™s assume 3 documents as: 3.1 Term-Weighing Doc1: We stayed very closely connected. Generally, most frequent terms will have lower information than Doc2: Charger stayed connected with phone. the less frequent or surprising terms. To capture this idea, most efficient way is to use ๐‘ก๐‘“ โˆ’ ๐‘–๐‘‘๐‘“ (term frequency-inverse Doc3: His phone charger closely resembled mine. document frequency). An element in a term-document matrix gets The term document matrix of frequency for above three a higher weight when a term in corresponding document is very documents could be: frequent (๐‘ก๐‘“) that means the term is rare in collection of documents(๐‘‘๐‘“). Hence the weight of a particular terms appearance is computed as: ๐‘Š๐‘š๐‘› = ๐‘ก๐‘“ ร— ๐‘–๐‘‘๐‘“ Where ๐‘Š๐‘š๐‘› is the weight of the term ๐‘š in document ๐‘›. It is demonstrated in [4] that using ๐‘ก๐‘“ โˆ’ ๐‘–๐‘‘๐‘“ functions brings Tokenization was done for all the queries too. The document significant improvement over raw frequency. vector (column) size, as well as the query vectors, were in same vector space. The preprocessing was performed for each file (collection) over each document (tweet). After preprocessing over each file (collection), it was fed to Information Retrieval system. The similarity scores for each tweet in a collection given a query were computed and results were saved in a list. The top 20 tweets related to the given query were retrieved from the index values of top 20 similarity scores in the list. There was a provision of submitting three systems per team. We submitted two systems. One system was same as explained above. In second system, we manually removed some Hindi stop words like ๐‘˜๐‘Ž, ๐‘—๐‘œ, ๐‘˜๐‘’ ๐‘›๐‘’, ๐‘ก๐‘œ, ๐‘ฃ๐‘’, ๐‘™๐‘’ etc. It didnโ€™t reflected any better results. All the implementations were done in Python 2.7. Related code will be made available at authorโ€™s Github page. 5. Result and Analysis The result were declared roughly after two weeks of the 2 submission. There were total 7 teams and our system performed Figure. 1 Angle between document vectors well as compared to others [6]. The evaluation was done by So far we have talked about measuring document similarity but calculating Mean Average Precision (MAP) which is a standard VSMs can also be used for query processing. A query ๐‘ž can be measure for comparing search algorithm. The results for Q1 for treated as a pseudo document and similarity measures of each system 1 and system 2 can be seen in Figure 2. And Figure 3. document in the collection with pseudo document (query) can be computed. There are several other similarity measures available as Jensen-Shannon, recall, precision, Jaccard, harmonic mean etc. The use of these similarity measures depends upon the relative frequency of adjacent words with respect to the target word. 4. Methodology The subtask II in FIRE was Mixed Script Information Retrieval on Code-Mixed Hindi-English tweets. There were total 23 files containing tweets for training and 12 files for testing. Each file had a corresponding query. Given a query ๐’’๐’Š , the information retrieval task was to compute the similarity between the query and tweets in each file and return the top 20 most relevant tweets. As explained in the last section, this query processing task can be successfully executed using VSMs. Each file was treated as a collection of documents and each tweet within the collection is referred as document. The dataset comprised of Hindi-English code-mixed tweets. As twitter data is generally noisy and requires some preprocessing, it was subjected to some preprocessing modules. The preprocessing in our Figure. 2 System-1 results implementation included tokenizing, removing stop words, The results of top three performers in the subtask are given in stripping punctuations, stripping repetitions (hiiiiiiiโ†’ hi) etc. The Table 3. major issue in tokenizing twitter data is to capture the key Table 3 attributes of tweets such as: hashtags (#aap), @ mentions Runs (Mean Average Precision) (@timesnow), URLs, symbol, emoticons etc. These attribute were No. of captured using regular expressions. A sample tweet after Team Name runs capturing these nuances, stripping punctuation and tokenizing Run 1 Run 2 Run 3 appeared as: Amrita_CEN 1 0.0377 NIL NIL CEN@Amrita* 2 0.0315 0.016 NIL @respectshraddie shhhhh :( salman ko jail Original UB 3 0.0217 0.016 0.015 hojaegi :( #badday โ€˜@respectshraddieโ€™, โ€˜shhโ€™, โ€˜:(โ€™, โ€˜salmanโ€™, โ€˜jailโ€™, Preprocessed โ€˜hojaegiโ€™, โ€˜:(โ€™ , โ€˜#baddayโ€™ 2 https://www.math10.com/en/geometry/geogebra/geogebra.html [6] Banerjee, S., Chakma K., Naskar, S. K., Das, A., Rosso, P., Bandyopadhyay, S., and Choudhury, M. 2016. Overview of the Mixed Script Information Retrieval at FIRE. In Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, December 7-10, 2016, CEUR Workshop Proceedings. CEUR-WS.org. Figure. 3 System-2 results 6. Conclusion The shared task on CodeMixed Information retrieval was indeed a unique task. It captured the latest trend in social media. We used Vector Space Models (VSMs) of semantics to compute the similarity between the tweets and given query. The performance of our system was ranked 2 among all the participants. But the Mean Average Precision (MAP) value was very low in terms of performance. That suggests, CodeMixed IR task is a difficult task and existing algorithms do not perform as expected and require sufficient attention to perform well for such data. Acknowledgements The authors would like to thank the organizers of Forum for Information Retrieval Evaluation (FIRE) for organizing this event. The authors would also like to thank the organizers of shared task on Mixed Script Information Retrieval (MSIR) for organizing the much coveted task for Indian social media. REFERENCES [1] Gupta, P., Bali, K., Banchs, E. R., Choudhury, M., and Rosso, P. 2014. Query expansion for mixed-script information retrieval. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pp. 677-686. ACM. [2] Chakma, K., and Das, A. 2016. CMIR: A Corpus for Evaluation of Code Mixed Information Retrieval of Hindi- English Tweets. Computaciรณn y Sistemas 20.3 (2016): 425- 434. [3] Turney, P. D., and Pantel, P. 2010. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 37.1 (2010): 141-188. [4] Salton, G., and Buckley, B. 1988. Term-weighting approaches in automatic text retrieval. Information processing & management 24.5 (1988): 513-523. [5] Soman, K. P., Loganathan, R., and Ajay, V. 2009. Machine learning with SVM and other kernel methods. PHI Learning Pvt. Ltd.