<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CEN@Amrita: Information Retrieval on CodeMixed HindiEnglish Tweets Using Vector Space Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>CCS Concepts</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>One of the major challenges nowadays is Information retrieval from social media platforms. Most of the information on these platforms is informal and noisy in nature. It makes the Information retrieval task more challenging. The task is even more difficult for twitter because of its character limitation per tweet. This limitation bounds the user to express himself in condensed set of words. In the context of India, scenario is little more complicated as users prefer to type in their mother tongue but lack of input tools force them to use Roman script with English embeddings. This combination of multiple languages written in the Roman script makes the Information retrieval task even harder. Query processing for such CodeMixed content is a difficult task because query can be in either of the language and it need to be matched with the documents written in any of the language. In this work, we dealt with this problem using Vector Space Models which gave significantly better results than the other participants. The Mean Average Precision (MAP) for our system was 0.0315 which was second best performance for the subtask.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>• Information Systems ➝ Information Retrieval ➝ Retrieval
models and ranking
➝</p>
    </sec>
    <sec id="sec-2">
      <title>Artificial Intelligence</title>
      <p>➝
• Computing methodologies</p>
    </sec>
    <sec id="sec-3">
      <title>Natural Language Processing</title>
      <sec id="sec-3-1">
        <title>1. INTRODUCTION</title>
        <p>
          Social media has a plentitude of user generated data in numerous
languages which are predominantly informal in nature. Most of
these languages have their own native scripts. Some of these
scripts include Arabic, Chinese, Hebrew, Greek, and Indic etc. For
most of these languages, major user-generated content is
transliterated into the Roman script with English embeddings. The
trend in Indian social media is to use such informal text
containing a mixture of multiple South-Asian languages with
English embeddings. This mixture makes the Information
Retrieval (IR) task very challenging. In Forum for Information
retrieval (FIRE)1 (2016), a similar task was proposed, which
required Mixed Script IR on Code-Mixed Hindi-English tweets.
The difference of Code-Mixed IR from MixedScript IR is subtle.
In MixedScript content, query   is written in Roman or native
script [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] whereas in Code-Mixed content, query   is a Roman
transliteration of a different language. The Code-Mixed corpus
provided at MSIR Subtask II had English and Roman
transliterated Hindi twitter data [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The major issue in such
corpus is several possibilities of writing the same (Hindi) word
with different transliterations. For example, “कम” meaning
“less” in Hindi can be spelled in Roman transliteration as km,
kam, kum, kmm etc. These nuances make it hard for the IR system
to match the query with correct document in a document set. This
significantly affects the performance of IR system. Nowadays,
getting information from such CodeMixed social media text is
very important as it helps in many business analytics purposes. In
the following sections, Section 2 explains about the information
retrieval subtask, Section 3 explains the Vector Space Models
which were used for information retrieval, Section 4 explains the
methodology used for this work, Section 5 discusses about the
results obtained and analysis of others result.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>2. Task Description</title>
        <p>The subtask II of shared task of Mixed Script IR on Code-Mixed
Hindi-English tweets was to retrieve 20 most relevant tweets from
a document given a query. The query as well as the document was
in Roman script but with CodeMixed Hindi and English
languages. The corpus had set of documents with each document
containing several hundred (or thousand) tweets. The corpus was
further classified based on topics and queries. Each topic had at
least one query related to the topic description. Table 1 explains
about the structure of training/testing corpus provided for the
subtask. The total number of topics for training and testing corpus
was 10 and 3 respectively. There were several queries based on
each topic (See Table 1) and there was at least one query per
topic. The total number of queries for training was 23 and for
testing, it was 12. A narrative on each topic was also given in the
corpus describing the details about the tweets under that topic.
The topic 001 (Aam Aadmi Party) has four queries under the
same description (Table 1). All these four queries had separate
documents with a corresponding number of tweets. Let   be the
given query, IR task was to rank the tweets in the corresponding
document from most relevant with the query to the least.</p>
        <sec id="sec-3-2-1">
          <title>1 https://msir2016.github.io/</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3. Vector Space Models</title>
        <p>
          Vector-Space-Models (VSMs) are used to represent documents as
a vector (of terms) that occurs within a collection [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The given
query is also represented in the same document space. The query
is also called as pseudo-document. As the document is represented
as a vector of terms that occur in the document hence it is
necessary to identify the terms present in the document. The terms
are basically the vocabulary of collection of documents. If there
are more than one document then each document will be a huge
vector and it will be convenient to organize these vectors into a
matrix. This matrix is called term-document matrix. The row
vectors are referred as terms and column vectors are referred as
documents. A document is used as a context to understand the
term. If we take document as phrases, sentences, paragraphs,
chapters etc. we get a word-context matrix. Similarly we can also
have a pair-pattern matrices [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>To imagine the representation of term-document matrix, think of a
multiset from set theory. A multiset is a set but it allows multiple
instances of the same element. For example,  = { ,  ,  ,  ,  ,  }
is a multiset containing elements  ,  and  . Just like sets, order of
elements in multiset could be anything. That means, multiset
 1 = { ,  ,  ,  ,  ,  } is same as multiset  2 = { ,  ,  ,  ,  ,  }.
Multisets are also called as bags and we can represent these bags
as a vectors with vector component denoting the frequency of the
elements of multiset i.e.  = &lt; 3, 2, 1 &gt; is vector representation
of the bag M in which 3 is the frequency of  and 2 is the
frequency of  etc. Using the same analogy, we can imagine a
document as a bag and set of documents as set of bags aligned as
columns in a matrix, say  . This matrix,  , is term document
matrix with columns representing a bag and rows representing a
unique member. A particular element   in the matrix
corresponds to the frequency of   ℎ term in the   ℎ document (or
bag). To capture the whole intuition, let’s assume 3 documents as:
Doc1: We stayed very closely connected.</p>
        <p>Doc2: Charger stayed connected with phone.</p>
        <p>Doc3: His phone charger closely resembled mine.</p>
        <p>The term document matrix of frequency for above three
documents could be:
In the above matrix, terms are the rows and columns are
documents. It has 3 documents (Doc1, Doc2 &amp; Doc3) and 11
unique terms (tokens in this case) with dimension 11x3. In a
similar way a given query could be represented as bag of words
and estimating the relevance of query with the documents in such
a manner is called bag of words hypothesis in Information
retrieval. This hypothesis states that a column vector in a
termdocument matrix captures the meaning of the corresponding
document (to some extent). It should be observed that the column
vector which correspond to a document in a collection tell us
about the frequency of the words in the document with loss of
actual order of the words. The vector may not capture the
structure of a document as it is but it works surprisingly well with
the search engines. We can compare the column (document)
vectors to compute the similarity among them. This similarity can
be computed using euclidean distance if we are assuming
columns (documents) as points in the document space. If we are
assuming columns (documents) as vectors in documents space, we
can use cosine similarity to measure the similarity by the angle
between the vectors. Larger the cosine, more semantically related
the documents are. If  1 and  2 are two document vectors,
then cosine of angle  between them is computed as:
cos(
1, 
2) =

||
(</p>
        <p>1, 
1||. ||
2)
2||
Where || 1|| and || 2|| are the length (or norm) of the
vectors. The basic intuition behind using cosine similarity is that it
captures the idea that the angle between the vectors is important,
length of the vector is not (See Figure 1). The cosine is 1 when
vectors are same or they point in the same direction (  ). The
cosine value varies from 0-1, zero being not similar and one being
exactly similar.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.1 Term-Weighing</title>
        <p>Generally, most frequent terms will have lower information than
the less frequent or surprising terms. To capture this idea, most
efficient way is to use  −  (term frequency-inverse
document frequency). An element in a term-document matrix gets
a higher weight when a term in corresponding document is very
frequent ( ) that means the term is rare in collection of
documents( ). Hence the weight of a particular terms
appearance is computed as:
 
= 
×</p>
        <sec id="sec-3-4-1">
          <title>Where   is the weight of the term</title>
          <p>
            demonstrated in [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] that using  − 
significant improvement over raw frequency.
in document  . It is
functions brings
So far we have talked about measuring document similarity but
VSMs can also be used for query processing. A query  can be
treated as a pseudo document and similarity measures of each
document in the collection with pseudo document (query) can be
computed. There are several other similarity measures available as
Jensen-Shannon, recall, precision, Jaccard, harmonic mean etc.
The use of these similarity measures depends upon the relative
frequency of adjacent words with respect to the target word.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>4. Methodology</title>
        <p>The subtask II in FIRE was Mixed Script Information Retrieval on
Code-Mixed Hindi-English tweets. There were total 23 files
containing tweets for training and 12 files for testing. Each file
had a corresponding query. Given a query   , the information
retrieval task was to compute the similarity between the query and
tweets in each file and return the top 20 most relevant tweets. As
explained in the last section, this query processing task can be
successfully executed using VSMs.</p>
        <p>Each file was treated as a collection of documents and each tweet
within the collection is referred as document. The dataset
comprised of Hindi-English code-mixed tweets. As twitter data is
generally noisy and requires some preprocessing, it was subjected
to some preprocessing modules. The preprocessing in our
implementation included tokenizing, removing stop words,
stripping punctuations, stripping repetitions (hiiiiiii→ hi) etc. The
major issue in tokenizing twitter data is to capture the key
attributes of tweets such as: hashtags (#aap), @ mentions
(@timesnow), URLs, symbol, emoticons etc. These attribute were
captured using regular expressions. A sample tweet after
capturing these nuances, stripping punctuation and tokenizing
appeared as:
Original
@respectshraddie shhhhh :( salman ko jail
hojaegi :( #badday
‘@respectshraddie’, ‘shh’, ‘:(’, ‘salman’, ‘jail’,
‘hojaegi’, ‘:(’ , ‘#badday’
2 https://www.math10.com/en/geometry/geogebra/geogebra.html
Tokenization was done for all the queries too. The document
vector (column) size, as well as the query vectors, were in same
vector space. The preprocessing was performed for each file
(collection) over each document (tweet). After preprocessing over
each file (collection), it was fed to Information Retrieval system.
The similarity scores for each tweet in a collection given a query
were computed and results were saved in a list. The top 20 tweets
related to the given query were retrieved from the index values of
top 20 similarity scores in the list.</p>
        <p>There was a provision of submitting three systems per team. We
submitted two systems. One system was same as explained above.
In second system, we manually removed some Hindi stop words
like  ,  ,   ,  ,  ,  etc. It didn’t reflected any better
results. All the implementations were done in Python 2.7. Related
code will be made available at author’s Github page.</p>
      </sec>
      <sec id="sec-3-6">
        <title>5. Result and Analysis</title>
        <p>
          The result were declared roughly after two weeks of the
submission. There were total 7 teams and our system performed
well as compared to others [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The evaluation was done by
calculating Mean Average Precision (MAP) which is a standard
measure for comparing search algorithm. The results for Q1 for
system 1 and system 2 can be seen in Figure 2. And Figure 3.
        </p>
      </sec>
      <sec id="sec-3-7">
        <title>6. Conclusion</title>
        <p>The shared task on CodeMixed Information retrieval was indeed a
unique task. It captured the latest trend in social media. We used
Vector Space Models (VSMs) of semantics to compute the
similarity between the tweets and given query. The performance
of our system was ranked 2 among all the participants. But the
Mean Average Precision (MAP) value was very low in terms of
performance. That suggests, CodeMixed IR task is a difficult task
and existing algorithms do not perform as expected and require
sufficient attention to perform well for such data.</p>
      </sec>
      <sec id="sec-3-8">
        <title>Acknowledgements</title>
        <p>The authors would like to thank the organizers of Forum for
Information Retrieval Evaluation (FIRE) for organizing this event.
The authors would also like to thank the organizers of shared task
on Mixed Script Information Retrieval (MSIR) for organizing the
much coveted task for Indian social media.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bali</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Banchs</surname>
            ,
            <given-names>E. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Choudhury</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Query expansion for mixed-script information retrieval</article-title>
          .
          <source>In Proceedings of the 37th international ACM SIGIR conference on Research &amp; development in information retrieval</source>
          , pp.
          <fpage>677</fpage>
          -
          <lpage>686</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Chakma</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>CMIR: A Corpus for Evaluation of Code Mixed Information Retrieval of HindiEnglish Tweets</article-title>
          .
          <source>Computación y Sistemas 20.3</source>
          (
          <year>2016</year>
          ):
          <fpage>425</fpage>
          -
          <lpage>434</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Turney</surname>
            ,
            <given-names>P. D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Pantel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>From frequency to meaning: Vector space models of semantics</article-title>
          .
          <source>Journal of artificial intelligence research 37</source>
          .1 (
          <year>2010</year>
          ):
          <fpage>141</fpage>
          -
          <lpage>188</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>1988</year>
          .
          <article-title>Term-weighting approaches in automatic text retrieval</article-title>
          .
          <source>Information processing &amp; management 24.5</source>
          (
          <year>1988</year>
          ):
          <fpage>513</fpage>
          -
          <lpage>523</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Soman</surname>
            ,
            <given-names>K. P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loganathan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ajay</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Machine learning with SVM and other kernel methods</article-title>
          .
          <source>PHI Learning Pvt. Ltd.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Banerjee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chakma</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naskar</surname>
            ,
            <given-names>S. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bandyopadhyay</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Choudhury</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Overview of the Mixed Script Information Retrieval at FIRE</article-title>
          .
          <source>In Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation</source>
          , Kolkata, India, December 7-
          <issue>10</issue>
          ,
          <year>2016</year>
          ,
          <string-name>
            <given-names>CEUR</given-names>
            <surname>Workshop</surname>
          </string-name>
          <article-title>Proceedings</article-title>
          . CEUR-WS.org.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>