CMEE-IL: Code Mix Entity Extraction in Indian Languages
   from Social Media Text @ FIRE 2016 – An Overview
                                       Pattabhi RK Rao               Sobha Lalitha Devi
                                   AU-KBC Research Centre           AU-KBC Research Centre
                                     MIT Campus of Anna               MIT Campus of Anna
                                    University, Chrompet,            University, Chrompet,
                                        Chennai, India                   Chennai, India
                                      +91 44 22232711                  +91 44 22232711
                                    pattabhi@au-kbc.org              sobha@au-kbc.org

ABSTRACT                                                             automatically. Especially companies are interested to ascertain
The penetration of smart devices such as mobile phones, tabs has     public view on their products and processes. This requires natural
significantly changed the way people communicate. This has led       language processing software systems which recognizes the
to the growth of usage of social media tools such as twitter,        entities or the associations of them or relation between them.
facebook chats for communication. This has led to development        Hence an automatic Entity extraction system is required.
of new challenges and perspectives in the language technologies      The objectives of this evaluation are:
research. Automatic processing of such texts requires us to
develop new methodologies. Thus there is great need to develop                 Creation of benchmark data for Entity Extraction in
various automatic systems such as information extraction,                      Indian language Code Mixed Social Media text.
retrieval and summarization. Entity recognition is a very
                                                                               To develop Named Entity Recognition (NER) systems
important sub task of Information extraction and finds its
                                                                               in Indian language Social Media text.
applications in information retrieval, machine translation and
other higher Natural Language Processing (NLP) applications          Entity extraction has been actively researched for over 20 years.
such as co-reference resolution. Some of the main issues in          Most of the research has, however, been focused on resource rich
handling of such social media texts are i) Spelling errors ii)       languages, such as English, French and Spanish. The scope of this
Abbreviated new language vocabulary such as “gr8” for great iii)     work covers the task of named entity recognition in social media
use of symbols such as emoticons/emojis iv) use of meta tags and     text (twitter data) for Indian languages. In the past there were
hash tags v) Code mixing. Entity recognition and extraction has      events such as Workshop on NER for South and South East Asian
gained increased attention in Indian research community.             Languages (NER-SSEA, 2008), Workshop on South and South
However there is no benchmark data available where all these         East Asian Natural Language Processing (SANLP, 2010&2011)
systems could be compared on same data for respective languages      conducted to bring various research works on NER being done on
in this new generation user generated text. Towards this we have     a single platform. NERIL tracks at FIRE (Forum for Information
organized the Code Mix Entity Extraction in social media text        Retrieval and Evaluation) in 2013, 2014 have contributed to the
track for Indian languages (CMEE-IL) in the Forum for                development of benchmark data and boosted the research towards
Information Retrieval Evaluation (FIRE). We present the              NER for Indian languages. All these efforts were using texts from
overview of CMEE-IL 2016 track. This paper describes the             newswire data. The user generated texts such as twitter and
corpus created for Hindi-English and Tamil-English. Here we also     facebook texts are diverse and noisy. These texts contain non-
present overview of the approaches used by the participants.         standard spellings and abbreviations, unreliable punctuation
                                                                     styles. Apart from these writing style and language challenges,
CCS Concepts                                                         another challenge is concept drift (Dredze etal., 2010; Fromreide
• Computing     methodologies ~ Artificial intelligence              et al., 2014); the distribution of language and topics on Twitter
• Computing methodologies ~ Natural language processing              and Facebook is constantly shifting, thus leading to performance
• Information systems ~ Information extraction                       degradation of NLP tools over time.

Keywords                                                             Some of the main issues in handling of such texts are i) Spelling
Entity Extraction; Social Media Text; Code Mixing, Twitter;          errors ii) Abbreviated new language vocabulary such as “gr8” for
Indian Languages; Tamil; Hindi; English; Named Entity                great iii) use of symbols such as emoticons/emojis iv) use of meta
Annotation Corpora for Code Mix Twitter data.                        tags and hash tags v) Code mixing.
                                                                     For example:
1. INTRODUCTION
Over the past decade, Indian language content on various media       “Muje kabi bhoolen gy to nhi na? :(
types such as websites, blogs, email, chats has increased            Want ur sweet feedback about my FC ? mai
significantly. And it is observed that with the advent of smart      dilli jaa rahi hoon”.
phones more people are using social media such as twitter,
facebook to comment on people, products, services, organizations,    The research in analyzing the social media data is taken up in
governments. Thus we see content growth is driven by people          English through various shared tasks. Language identification in
from non-metros and small cities who are mostly comfortable in       tweets (tweetLID) shared task held at SEPLN 2014 had the task of
their own mother tongue rather than English. The growth of           identifying the tweets from six different languages. SemEval
Indian language content is expected to increase by more than 70%     2013, 2014 and 2015 held as shared task track where sentiment
every year. Hence there is a great need to process this huge data    analysis in tweets were focused. They conducted two sub-tasks
namely, contextual polarity disambiguation and message polarity          iv) Indian Language tweets are multilingual in nature and
classification. In Indian languages, Amitav et al (2015) had          predominantly contain English words.
organized a shared task titled 'Sentiment Analysis in Indian          The following examples illustrate the usage of English words and
languages' as a part of MIKE 2015, where sentiment analysis in        spoken, dialectal forms in the tweets.
tweets is done for tweets in Hindi, Bengali and Tamil language.
                                                                      Example 1 (Tamil):
Named Entity recognition was explored in twitter through shared
task organized by Microsoft as part of 2015 ACL-IJCNLP, a             Ta: Stamp veliyittu ivaga         ativaangi …..
shared task on noisy user-generated text, where they had two sub-     En: stamp released these_people get_beaten ….
tasks namely, twitter text normalization and named entity
                                                                      Ta: othavaangi …. kadasiya <loc>kovai</loc>
recognition for English.
                                                                      En: get_slapped … at_end         kovai
The ESM-IL track at FIRE 2015 was the first one to come up with
the entity annotated benchmark data for the social media text,        Ta: pooyi pallakaatti kuththu vaangiyaachchu.
where the data was in idealistic scenario, where users use only one   En: gone show_tooth punch got
language. But nowadays we observe that users use code mixing
                                                                       (“They released stamp, got slapping and beating … at the end
even in writing in the social media platforms. Thus there is a need
                                                                      reached Kovai and got punched on the face”)
to develop systems that focus on social media texts. There have
been other efforts on the code mix social media text in the           This example is a Tamil tweet where it is written in a particular
applications of information retrieval (MSIR tracks at FIRE            dialect and also has usage of English words.
2015and 2016).
The paper is organized as follows: section 2 describes the            Similarly in Hindi we find lot of spell variations. Such as for the
challenges in named entity recognition on Indian languages.           words “mumbai”, “gaandhi”, “sambandh”, “thanda” there are
Section 3 describes the corpus annotation, the tag set and corpus     atleast three different spelling variations.
statistics. And section 4 describes the overview of the approaches
used by the participants and section 5 concludes the paper.           3. CORPUS DESCRIPTION
                                                                      The corpus was collected using the twitter API in two different
2. CHALLENGES IN INDIAN LANGUAGE                                      time periods. The training partition of the corpus was collected
ENTITY EXTRACTION                                                     during May – June 2015. And the test partition of the corpus was
The challenges in the development of entity extraction systems for    collected during Aug – Sep 2015. As explained in the above
Indian languages from social media text arise due to several          sections, in the twitter data we observe concept drift. Thus to
factors. One of the main factors being there is no annotated data     evaluate how the systems handle concept drift we had collected
available for any of the Indian languages, though the earlier         data in two different time periods. In this present initiative the
initiatives have been concentrated on newswire text. Apart from       corpus is available for three Indian languages Hindi, Malayalam
the lack of annotated data, the other factors which differentiate     and Tamil. And we have also provided the corpus for English, so
Indian languages from other European languages are the                that it would help researchers to compare their efforts with respect
following:                                                            to English vis-à-vis the respective Indian languages. The
                                                                      following figures show different aspects of corpus statistics.
     a)   Ambiguity – Ambiguity between common and proper
          nouns. Eg: common words such as “Roja” meaning              3.1 ANNOTATION TAGSET
          Rose flower is a name of a person.                          The corpus for each language was annotated manually by trained
     b)   Spell variations – One of the major challenges is that      experts. Named Entity Recognition task requires entities
          different people spell the same entity differently. For     mentioned in the document to be detected, their sense to be
          example: In Tamil person name -Roja is spelt as "rosa",     disambiguated, select the attributes to be assigned to the entity
          "roja”.                                                     and represent it with a tag. Defining the tag set is a very important
     c)   Less Resources – Most of the Indian languages are less      aspect in this work. The tag set chosen should be such that it
          resource languages. There are no automated tools            covers major classes or categories of entities. The tag set defined
          available to perform preprocessing tasks required for       should be such that it could be used at both coarse and fine
          NER such as part-of-speech tagging, chunking which          grained level depending on the application. Hence a hierarchical
          can handle social media text.                               tag set will be the suitable one. Though we find that in most of the
Apart from these challenges we also find that development of          works Automatic Content Extraction (ACE) NE tag set has been
automatic entity recognition systems is difficult due to following    used, in our work we have used a different tag set. The ACE Tag
reasons:                                                              set is fine grained is towards defense/security domain. Here we
                                                                      have used Government of India standardized tag set which is more
   i) Tweets contain a huge range of distinct named entity types.     generic.
Almost all these types (except for People and Locations) are
relatively infrequent, so even a large sample of manually             The tag set is a hierarchical tag set. This Hierarchical tag set was
annotated tweets will contain very few training examples.             developed at AU-KBC Research Centre, and standardized by the
                                                                      Ministry of Communications and Information Technology, Govt.
  ii) Twitter has a 140 character limit, thus tweets often lack       of India. This tag set is being used widely in Cross Lingual
sufficient context to determine an entity’s type without the aid of   Information Access (CLIA) and Indian Language – Indian
background or world knowledge.                                        Language Machine Translation (IL-IL MT) consortium projects.
   iii) In comparison with English, Indian Languages have more        In this tag set, named entity hierarchy is divided into three major
dialectal variations. These dialects are mainly influenced by         classes; Entity Name, Time and Numerical expressions. The
different regions and communities.                                    Name hierarchy has eleven attributes. Numeral Expression and
time have four and three attributes respectively. Person,              Hindi-English language pair and 5 teams participated for Tamil-
organization, Location, Facilities, Cuisines, Locomotives,             English language pair. We had developed a base system without
Artifact, Entertainment, Organisms, Plants and Diseases are the        any pre-processing of the data and use of any lexical resources.
eleven types of Named entities.                                        We had developed this base system by just using the raw data as
Numerical expressions are categorized as Distance, Money,              such without any other features. We used Conditional Random
Quantity and Count. Time, Year, Month, Date, Day, Period and           Fields (CRFs) for developing the base system. This base line
Special day are considered as Time expressions. The tag set            system was developed so that it would help in making a better
consists of three level hierarchies. The top level (or 1 st level)     comparative study. And it was observed that all the teams had
hierarchy has 22 tags, the second level has 49 tags and third level    outperformed the base line system. In the following paragraphs
has 31 tags. Hence a total of 102 tags are available in this schema.   we would be briefly explaining the approaches used by each team.
But the data provided to the participants consisted of only the 1st    All the teams’ results are given in Table 3 and 4.
level in the hierarchy that is consisting of only 22 tags. The other   Irshad team had used Neural Networks, to develop their system.
levels of tagging were hidden. This was done to make it little         They had used external resource of Wiki data for creating word
easier for the participants to develop their systems using machine     embedding. They had not done any cleaning work such as
learning methods.                                                      removal of URLs, emoticons from tweets. And NLP pre-
                                                                       processing of the text was done. This team had participated only
The data statistics are as follows:                                    in Hindi- English and submitted 1run.
                   Table 1. Corpus Statistics
          Language       No. of Tweets No. of NEs                      Deepak team had used CRFs. Here they have preprocessed the
          Hindi-English 10129               7573                       data for tokenization. They had also used gazetteer lists for
          Tamil-English 4576                2454                       disease names. And this team had submitted results for both
                                                                       Hindi-English and Tamil-English.

The NE distribution in both language datasets has been found to        Veena team had used machine learning method SVM. They have
be having majority of Person, Location, and Entertainment. This        used word2vec for feature engineering and extraction. Here they
shows that majority of people communication has been on the            have used other external corpus from MSIR 2016 and ICON 2015
topics movies and persons.                                             track data sets. They had submitted 3 run each for both Hindi-
3.2 DATA FORMAT                                                        English and Tamil-English. This team had also used stylometric
The participants were provided the data with annotation markup         features, suffixes and prefixes, gazetteers in run 3. Here it is
in a separate file called annotation file. The raw tweets were to be   interesting to note that though many kinds of features and
separately downloaded using the twitter API. The annotation file       resources, the system performance was not significantly higher
is a column format file, where each column was tab space               than other runs where all of these features were not used.
separated. It consisted of the following columns:
                                                                       Barathi team, have submitted 2 runs each for Hindi-English and
    i) Tweet_ID                                                        Tamil-English. They have used CRFs and Random Forest Tree.
    ii) User_Id                                                        Their run 1 was based upon lexical features and CRF algorithm.
                                                                       Along with the Run 1 features an additional binary feature (entity
    iii) NE_TAG
                                                                       or not) decided by the Random Forest Tree is added in Run 2.
    iv) NE raw string
    v) NE Start_Index                                                  Rupal team had decision trees and extremely randomized tree
    vi) NE_Length                                                      algorithms. The precision obtained is comparatively lower than
For example:                                                           other new ML methods used by earlier teams. They had cleaned
                                                                       the data for emojis, urls as the first step of processing.

          Tweet_ID:123456789012345678                                  The team lead by Somnath had used CRFs and used the popular
          User_Id:1234567890                                           CRF++ tool. The system performance was relatively lower.
          NE_TAG:ORGANIZATION                                          Probably this could be attributed to lack of proper feature
                                                                       extraction and feature engineering.
          NE Raw String:SonyTV
          Index:43                                                     One interesting observation is that the team led by Nikhil had also
          Length:6                                                     used neural networks similar to another team, but have not used
                                                                       any external resource for training. This shows that the data size
Index column is the starting character position of the NE              needs to be improved for better machine learning.
calculated for each tweet and the count starts from ‘0’. The
participants were also instructed to provide the test file             The team lead by Srinidhi, had used SVM with context based
annotations in the same format as given for the training data.         character embedding as feature engineering. This team had used
                                                                       several external unlabeled datasets such as MSIR 2016, ICON
                                                                       2015 shared task datasets.
4. SUBMISSION OVERVIEWS
In this evaluation exercise we have used Precision, Recall and F-      The different methodologies used by different teams have been
measure, which are widely used for this task. A total of 21 teams      summarized in Table 2.
had registered for participation in this track. Later 9 teams were
able to submit their systems for evaluation. A total of 25 test runs   Evaluation metrics used are precision, recall and f-measure. All
were submitted for evaluation. All the teams had participated for      the systems have been evaluated automatically by comparing the
gold annotations. The results obtained by participant systems have    [3]      Hege Fromreide, Dirk Hovy, and Anders Søgaard.2014.
been shown in table 3 and 4.                                          “Crowdsourcing and annotating ner for twitter#drift”. European
                                                                      language resources distributionagency.
5. CONCLUSION                                                         [4]      H.T. Ng, C.Y., Lim, S.K., Foo. 1999. “A Case Study on
The main objective of creating benchmark data representing some       Inter-Annotator Agreement for Word Sense Disambiguation”. In
of the popular Indian languages has been achieved. And this data      Proceedings of the {ACL} {SIGLEX} Workshop on Standardizing
has been made available to research community for free for            Lexical Resources {(SIGLEX99)}. Maryland. pp. 9-13.
research purposes. The data is user generated data and is not any
genre specific. Efforts are still going on to standardize this data   [5]     Preslav Nakov and Torsten Zesch and Daniel Cer
and make it perfect data set for future researchers. We observe       and David Jurgens. 2015. Proceedings of the 9th International
that the results obtained for Hindi-English data has been more        Workshop on Semantic Evaluation (SemEval 2015).
than Tamil-English. This is due to data being noisier and size is     [6]       Nakov, Preslav and Rosenthal, Sara and Kozareva,
less compared to Hindi-English. We hope to see more                   Zornitsa and Stoyanov, Veselin and Ritter, Alan and Wilson,
publications in this area in the coming days from these different     Theresa. 2013. SemEval-2013 Task 2: Sentiment Analysis in
research groups who could not submit their results. Also we           Twitter. Second Joint Conference on Lexical and Computational
expect more groups would start using this data for their research     Semantics (*SEM), Volume 2: Proceedings of the Seventh
work.                                                                 International Workshop on Semantic Evaluation (SemEval 2013)
This CMEE-IL track is one of the first efforts towards creation of
entity annotated user generated code mixed social media text for      [7]      Rajeev Sangal and M. G. Abbas Malik. 2011.
Indian languages. In this CMEE-IL annotation tag set we have          Proceedings of the 1st Workshop on South and Southeast Asian
made use of a hierarchical tag set. Thus this annotated data could    Natural Language Processing (SANLP)
be used for any kind of applications. This tag set is very            [8]       Aravind K. Joshi and M. G. Abbas Malik. 2010.
exhaustive and has finer tags. The applications which require fine    Proceedings of the 1st Workshop on South and Southeast Asian
grain tags could use the data with full annotation. And for           Natural         Language         Processing       (SANLP).
applications which do not require fine grain, the finer tags could    (http://www.aclweb.org/anthology/W10-36)
be suppressed in the data. The data being generic, this could be
used for developing generic systems upon which a domain               [9]       Rajeev Sangal, Dipti Misra Sharma and Anil Kumar
specific system could be built after customization.                   Singh. 2008. Proceedings of the IJCNLP-08 Workshop on Named
                                                                      Entity Recognition for South and South East Asian Languages.
6. ACKNOWLEDGMENTS                                                    (http://www.aclweb.org/anthology/I/I08/I08-03)
We thank the FIRE 2016 organizers for giving us the opportunity
to conduct the evaluation exercise.                                   [10]      Pattabhi RK Rao, CS Malarkodi, Vijay Sundar R and
                                                                      Sobha Lalitha Devi. 2014. Proceedings of Named-Entity
7. REFERENCES                                                         Recognition Indian Languages track at FIRE 2014. http://au-
[1]      Arkaitz    Zubiaga, Iñaki     San    Vicente, Pablo          kbc.org/nlp/NER-FIRE2014/
Gamallo, José    Ramom     Pichel   Campos, Iñaki    Alegría
Loinaz, Nora Aranberri, Aitzol Ezeiza, Víctor Fresno. 2014
TweetLID@SEPLN 2014, Girona, Spain, September 16th,
2014. CEUR Workshop Proceedings 1228, CEUR-WS.org 2014
[2]       Mark Dredze, Tim Oates, and Christine Piatko. 2010.
“We’re not in kansas anymore: detecting domainchanges in
streams”. In Proceedings of the 2010 Conferenceon Empirical
Methods in Natural LanguageProcessing, pages 585–595.
Association for ComputationalLinguistics.
                                       Table 2. Participant Team Overview - Summary


                      Languages &
                                       Approaches (ML                                  Lexical Resources Open Source NLP Variation Between
      Team              System                                Pre-Processing Step
                                        method) Used                                         Used          Tools Used          Runs
                      Submissions

                                                                                                              Run1:Tweet
                                       Run1: Conditional          Run1: Tweet                              Preprocessor, Scikit
                                        Random Field           Preprocessor alone                          – Learn, sklearn – Along with the run 1
                        i)Hindi –                             used to eliminate http
Barathi –AmrithaT2     English: 2      Run2: Conditional                                                      crfsuite, nltk    features binary
                                                                links, emoticons
                                        Random Field +                                                        Run2: Tweet       feature (outcome of
                                      Random Forest Tree          Run2: Tweet                              Preprocessor, Scikit random forest tree)
                       ii) Tamil–                              Preprocessor alone                          – Learn, sklearn –     utilized in run 2
                       English: 2                             used to eliminate http                          crfsuite, nltk
                                                                links, emoticons


                        i)Hindi –                              Tokenization by
 Deepak-IITPatna                                              CMU tagger + Token        1, Dictionary of
                       English: 1           Machine
                                                               Encoding (IOB)            Disease name,      CMU ark tagger,
                                     learning(CRFs)+Rule
                                                                                        Living Things &       CRF++
                                         based system
                       ii)Tamil–                                                          Special days
                       English: 1

                                     Simple Feed Forward
                                     Neural Network with 1
                                      hidden layer of 200
                                             nodes,

                                     Activation function -
                                          Rectifier,
                                                                                   2, English wiki
 (Irshad-IIIT-Hyd)                    Learning rate - 0.03,                       corpus to develop
                       i) Hindi –        Dropout - 0.5,       Converted the given
                                                                                  word-embeddings          Gensim Word2Vec
                       English: 1                             data to BIO format
                                                                                    using Gensim
                                        Learning rule -                              Word2Vec
                                           adagrad,
                                       Regularization L2,

                                       Mini-batch - 200,

                                         Trained for 25
                                           iterations.


                                     Run1: seq2seq LSTM        1, Replacement of
                                     network was used with                                                                       Run 1: 3 hidden
                                                                 HTML Escape
(Nikhil_BITSHyd)                     3 layers and 192 nodes                                                                      layers with each
                                                                   Characters
                                          in each layer                                                                        hidden layer having
 Nikhil Bharadwaj                                              2, Tokenize Tweets                                                   192 nodes.
      Gosala           i) Hindi –    Run2: seq2seq LSTM                                                      NLTK Word
                                                                  3, Stop Word         NLTK Stop Words      Tokenizer and
                       English: 2    network was used with
                                     4 layers and 256 nodes         Removal                                NLTK Stop Words       Run 2: 4 hidden
   BITS Pilani,                                                                                                                  layers with each
                                          in each layer         4, Rule Tagging
Hyderabad Campus                                                                                                               hidden layer having
                                                              5, Mapping Common                                                     256 nodes.
                                                                  Misspellings

                                            Run1:                                        Pyenchant (a
                       i) Hindi –    Decision Tree (Hindi-                              Python English
(Rupal_BITSPiliani)    English: 3          English)                                       Dictionary;
                                     Decision Tree (Tamil-         Convert to                                                    Differences are in
                                           English)            lowercase, remove                                               the machine-learning
  Rupal Bhargava       ii) Tamil –          Run2:              links and tokenize Gazetteer Lists were                            technique used.
                       English: 3         Extremely                                created from the
                                      Randomized Tree                              annotations file.
                                       (Hindi-English)
                                    Decision Tree (Tamil-
                                          English)

                                           Run3:
                                         Extremely
                                      Randomized Tree
                                       (Hindi-English)

                                          Extremely
                                      Randomized Tree
                                       (Tamil-English)

                                                                                   Hindi-English:
                                                                                   unlabeled datasets
                                                                                   from Mixed Script
                                                                                        Information
                                                              1, Tokenizing data     Retrieval 2016
                                                             into each token per-      (MSIR) and
                                                                       line            International
(ShivkaranAMU3)                                                                      Conference on
                      i) Hindi –                               2, Special tag is   Natural Language
                      English: 1                            added to identify end Processing (ICON) 1, Word2vec Model
                                       Context Based             of each tweet
Srinidhi Skanda V                                                                 2015 POS Tagging     2, SVM-Light
                                    Character Embedding
                      ii) Tamil –                            3, Converting input      task, external
  CEN@Amrita          English: 1                                datasets to IOB         twitter data
                                                                     format       collected using web
                                                                                         scrapping.
                                                                                    Tamil-English:
                                                                                  systems unlabeled
                                                                                     datasets from
                                                                                  Sentiment Analysis
                                                                                 in Indian Languages
                                                                                     (SAIL-2015)

   (SomnathJU)                                                Clean links and
Somnath Banerjee      i) Hindi –    Conditional Random          emoticons
                      English: 1           Fields                                                          CRF++

Jadavpur University

                                                                                                                         Run 1 –Structured
                                                                                                                          Skip-gram based
                                                                                                                        embedding features.
                                                                                                                        Structured skip gram
                                      Run1: Wang2vec                                                                    model takes the word
                                      based embedding                                                                       position into
                                          features                                                                       consideration and
  (VeenaAMU1)         i) Hindi –                                                                                        extracts the features.
                      English: 3    Run2:Word2vec based                            MSIR 2016 &
                                                                                                          wang2vec,
 Anand Kumar M                       embedding features      Tokenization, BIO   ICON 2015, SAIL                          Run 2 – neural
                                                                                   2015, Twitter       word2vec, SVM-
  Amrita Vishwa                                                 formatting                                                network based
                      ii) Tamil –     Run3:Stylometric                                dataset               Light
                                                                                                                            word2vec
  Vidyapeetham        English: 3          features                                                                      embedding features.
                                                                                                                         Run 3 –stylometric
                                                                                                                          features - prefix,
                                                                                                                         suffix, punctuation,
                                                                                                                         hash tags, gazetted
                                                                                                                           features, index,
                                                                                                                             length, etc.
                                       Table 3. Evaluation Results for Hindi-English


                                    Run1                                         Run2                                Run3                               Best Run


                       Precision


                                    Recall


                                                  F-Measure


                                                               Precision


                                                                                 Recall


                                                                                           F-Measure


                                                                                                        Precision


                                                                                                                     Recall


                                                                                                                              F-Measure


                                                                                                                                           Precision


                                                                                                                                                         Recall


                                                                                                                                                                   F-Measure
Team
Irshad-IIIT-Hyd        80.92                 59   68.24                                           NA                                 NA    80.92         59.00     68.24
Deepak-IIT-Patna       81.15        50.39         62.17                                           NA                                 NA    81.15         50.39     62.17
Veena-Amritha-T1       75.19        29.46         42.33                     75   29.17     42.00        79.88        41.37    54.51        79.88         41.37     54.51
Barathi-Amritha-T2     76.34        31.15         44.25        77.72             31.84     45.17                                     NA    77.72         31.84     45.17
Rupal-BITS-Pilani      58.66        32.93         42.18        58.84             35.32     44.14        59.15        34.62    43.68        58.84         35.32     44.14
Somnath-JU             37.49        40.28         38.83                                           NA                                 NA    37.49         40.28     38.83
Nikhil-BITS-Hyd        59.28        19.64         29.50            61.8          26.39     36.99                                     NA    61.80         26.39     36.99
Shivkaran-Amritha-T3   48.17           24.9       32.83                                           NA                                 NA    48.17         24.90     32.83
AnujSaini              72.24        18.85         29.90                                           NA                                 NA    72.24         18.85     29.90

                                       Table 4. Evaluation Results for Tamil-English

                       Run1                                    Run2                                     Run3                               Best Run
                        Precision


                                    Recall


                                                   F-measure


                                                                Precision


                                                                                 Recall


                                                                                            F-measure


                                                                                                         Precision


                                                                                                                     Recall


                                                                                                                               F-measure


                                                                                                                                            Precision


                                                                                                                                                         Recall


                                                                                                                                                                    F-Measure
Team
Deepak-IIT-Patna       79.92        30.47         44.12                                            NA                                 NA   79.92         30.47     44.12
Veena-Amritha-T1       77.38           8.72       15.67        74.74                9.93   17.53        79.51        21.88    34.32        79.51         21.88     34.32
Barathi-Amritha-T2          77.7    15.43         25.75        79.56             19.59     31.44                                      NA   79.56         19.59     31.44
Rupal-BITS-Pilani-R2   55.86        10.87         18.20        58.71             12.21     20.22        58.94        11.94    19.86        58.71         12.21     20.22
Shivkaran-Amritha-T3   47.62        13.42         20.94                                            NA                                 NA   47.62         13.42     20.94