=Paper=
{{Paper
|id=Vol-1587/T4-1
|storemode=property
|title=ESM-IL: Entity Extraction from Social Media Text for Indian Languages @ FIRE 2015 - An Overview
|pdfUrl=https://ceur-ws.org/Vol-1587/T4-1.pdf
|volume=Vol-1587
|authors=Pattabhi RK Rao,Malarkodi CS,Vijay Sundar Ram R,Sobha Lalitha Devi
|dblpUrl=https://dblp.org/rec/conf/fire/RaoMRD15
}}
==ESM-IL: Entity Extraction from Social Media Text for Indian Languages @ FIRE 2015 - An Overview==
ESM-IL: Entity Extraction from Social Media Text for Indian Languages @ FIRE 2015 – An Overview Pattabhi RK Rao Malarkodi CS Vijay Sundar Ram R Sobha Lalitha Devi AU-KBC Research Centre AU-KBC Research Centre AU-KBC Research Centre AU-KBC Research Centre MIT Campus of Anna MIT Campus of Anna MIT Campus of Anna MIT Campus of Anna University, Chrompet, University, Chrompet, University, Chrompet, University, Chrompet, Chennai, India Chennai, India Chennai, India Chennai, India +91 44 22232711 +91 44 22232711 +91 44 22232711 +91 44 22232711 pattabhi@au-kbc.org csmalarkodi@au- sundar@au-kbc.org sobha@au-kbc.org kbc.org ABSTRACT public view on their products and processes. This requires natural Entity recognition is a very important sub task of Information language processing software systems which recognizes the extraction and find its applications in information retrieval, entities or the associations of them or relation between them. machine translation and other higher Natural Language Hence an automatic Entity extraction system is required. Processing (NLP) applications such as co-reference resolution. The objectives of this evaluation are: Entities are real world elements or objects such as Person names, Organization names, Product names, Location names. Entities are Creation of benchmark data for Entity Extraction in often referred to as Named Entities. Entity extraction refers to Indian language Social Media text. automatic identification of named entities in a text document. To develop Named Entity Recognition (NER) systems Given a text document, entities such as Person names, in Indian language Social Media text. Organization names, Location names, Product names are identified and tagged. We observe that in the Indian language To identify the best suiting machine learning scenario there is no social media text corpus which could be used techniques. to develop automatic systems. Entity recognition and extraction has gained increased attention in Indian research community. Entity extraction has been actively researched for over 20 years. However there is no benchmark data available where all these Most of the research has, however, been focused on resource rich systems could be compared on same data for respective languages, such as English, French and Spanish. The scope of this languages. Towards this we have organized the Entity extraction work covers the task of named entity recognition in social media in social media text track for Indian languages (ESM-IL) in the text (twitter data) for Indian languages. In the past there were Forum for Information Retrieval Evaluation (FIRE). We present events such as Workshop on NER for South and South East Asian the overview of ESM-IL 2015 track. This paper describes the Languages (NER-SSEA, 2008), Workshop on South and South corpus created for Hindi, Malayalam, Tamil and English. Here we East Asian Natural Language Processing (SANLP, 2010&2011) also present overview of the approaches used by the participants. conducted to bring various research works on NER being done on a single platform. NERIL tracks at FIRE (Forum for Information CCS Concepts Retrieval and Evaluation) in 2013 and 2014 have contributed to • Computing methodologies ~ Artificial intelligence the development of benchmark data and boosted the research • Computing methodologies ~ Natural language processing towards NER for Indian languages. All these efforts were using • Information systems ~ Information extraction texts from newswire data. The user generated texts such as twitter and facebook texts are diverse and noisy. These texts contain non- Keywords standard spellings and abbreviations, unreliable punctuation Entity Extraction; Social Media Text; Twitter; Indian Languages; styles. Apart from these writing style and language challenges, Tamil; Hindi; Malayalam; English; Named Entity Annotated another challenge is concept drift (Dredze etal., 2010; Fromreide Corpora for Twitter. et al., 2014); the distribution of language and topics on Twitter and Facebook is constantly shifting, thus leading to performance 1. INTRODUCTION degradation of NLP tools over time. Thus there is a need to Over the past decade, Indian language content on various media develop systems that focus on social media texts. types such as websites, blogs, email, chats has increased The research in analyzing the social media data is taken up in significantly. And it is observed that with the advent of smart English through various shared tasks. Language identification in phones more people are using social media such as twitter, tweets (tweetLID) shared task held at SEPLN 2014 had the task of facebook to comment on people, products, services, organizations, identifying the tweets from six different languages. SemEval governments. Thus we see content growth is driven by people 2013, 2014 and 2015 held as shared task track where sentiment from non-metros and small cities who are mostly comfortable in analysis in tweets were focused. They conducted two sub-tasks their own mother tongue rather than English. The growth of namely, contextual polarity disambiguation and message polarity Indian language content is expected to increase by more than 70% classification. In Indian languages, Amitav et al (2015) had every year. Hence there is a great need to process this huge data organized a shared task titled 'Sentiment Analysis in Indian automatically. Especially companies are interested to ascertain 74 languages' as a part of MIKE 2015, where sentiment analysis in En: stamp released these_people get_beaten …. tweets is done for tweets in Hindi, Bengali and Tamil language. Ta: othavaangi …. kadasiyakovai Named Entity recognition was explored in twitter through shared En: get_slapped … at_end kovai task organized by Microsoft as part of 2015 ACL-IJCNLP, a shared task on noisy user-generated text, where they had two sub- Ta: pooyi pallakaatti kuththu vaangiyaachchu. tasks namely, twitter text normalization and named entity En: gone show_tooth punch got recognition for English. In the NER sub-task they have used ten tags for annotating the text. The paper is organized as follows: section 2 describes the challenges in named entity recognition on (“They released stamp, got slapping and beating … at the end Indian languages. Section 3 describes the corpus annotation, the reached Kovai and got punched on the face”) tag set and corpus statistics. And section 4 describes the overview This example is a Tamil tweet where it is written in a particular of the approaches used by the participants and section 5 concludes dialect and also has usage of English words. the paper. Example 2 (Malayalam): 2. CHALLENGES IN INDIAN LANGUAGE ML: ediye … ente utuppu teechcho? illa ENTITY EXTRACTION En: hey … my dress ironed? no The challenges in the development of entity extraction systems for Indian languages from social media text arise due to several ML:chetta … raavile_tanne engottaa? factors. One of the main factors being there is no annotated data En: brother … morning_itself where? available for any of the Indian languages, though the earlier ML: tekkati …. teechchaale parayullo? initiatives have been concentrated on newswire text. Apart from the lack of annotated data, the other factors which differentiate En: hey_iron_it … only_after_ironing tell? Indian languages from other European languages are the (Hey did you iron my dress? No… brother morning itself where following: are you going? Hey iron it … only after ironing you will tell?) a) Morphologically rich – Indian languages are morphologically rich and agglutinative, hence the root This is a Malayalam tweet written in spoken form, where the word identification is difficult and requires phrase “teekku ati” has been written as “tekkati”, spoken form. morphological analyzers. This makes it resemble a place name and creates ambiguity. This b) Ambiguity – Ambiguity between common and proper makes understanding difficult. nouns. Eg: common words such as “Roja” meaning Rose flower is a name of a person. Similarly in Hindi we find lot of spell variations. Such as for the c) Spell variations – One of the major challenges is that words “mumbai”, “gaandhi”, “sambandh”, “thanda” there are different people spell the same entity differently. For atleast three different spelling variations. example: In Tamil person name -Roja is spelt as "rosa", "roja”. 3. CORPUS DESCRIPTION d) Less Resources – Most of the Indian languages are less The corpus was collected using the twitter API in two different resource languages. There are no automated tools time periods. The training partition of the corpus was collected available to perform preprocessing tasks required for during May – June 2015. And the test partition of the corpus was NER such as part-of-speech tagging, chunking which collected during Aug – Sep 2015. As explained in the above can handle social media text. sections, in the twitter data we observe concept drift. Thus to evaluate how the systems handle concept drift we had collected Apart from these challenges we also find that development of data in two different time periods. In this present initiative the automatic entity recognition systems is difficult due to following corpus is available for three Indian languages Hindi, Malayalam reasons: and Tamil. And we have also provided the corpus for English, so i) Tweets contain a huge range of distinct named entity types. that it would help researchers to compare their efforts with respect Almost all these types (except for People and Locations) are to English vis-à-vis the respective Indian languages. The relatively infrequent, so even a large sample of manually following figures show different aspects of corpus statistics. annotated tweets will contain very few training examples. 3.1 ANNOTATION TAGSET ii) Twitter has a 140 character limit, thus tweets often lack The corpus for each language was annotated manually by trained sufficient context to determine an entity’s type without the aid of experts. Named Entity Recognition task requires entities background or world knowledge. mentioned in the document to be detected, their sense to be iii) In comparison with English, Indian Languages have more disambiguated, select the attributes to be assigned to the entity dialectal variations. These dialects are mainly influenced by and represent it with a tag. Defining the tag set is a very important different regions and communities. aspect in this work. The tag set chosen should be such that it iv) Indian Language tweets are multilingual in nature and covers major classes or categories of entities. The tag set defined predominantly contain English words. should be such that it could be used at both coarse and fine grained level depending on the application. Hence a hierarchical The following examples illustrate the usage of English words and tag set will be the suitable one. Though we find that in most of the spoken, dialectal forms in the tweets. works Automatic Content Extraction (ACE) NE tag set has been Example 1 (Tamil): used, in our work we have used a different tag set. The ACE Tag set is fine grained is towards defense/security domain. Here we Ta: Stamp veliyittu ivaga ativaangi ….. 75 have used Government of India standardized tag set which is more generic. The tag set is a hierarchical tag set. This Hierarchical tag set was developed at AU-KBC Research Centre, and standardized by the Ministry of Communications and Information Technology, Govt. of India. This tag set is being used widely in Cross Lingual Information Access (CLIA) and Indian Language – Indian Language Machine Translation (IL-IL MT) consortium projects. Figure 3. Entity Distribution – for Hindi Figure 1. Corpus Statistics – No.of Tweets and Entities in each language Figure 4. Entity Distribution – for Malayalam Figure 2. Entity distribution - for English Figure 5. Entity Distribution – for Tamil 76 In this tag set, named entity hierarchy is divided into three major system was developed so that it would help in making a better classes; Entity Name, Time and Numerical expressions. The comparative study. In the following paragraphs we would be Name hierarchy has eleven attributes. Numeral Expression and briefly explaining the approaches used by each team. All the time have four and three attributes respectively. Person, teams results along with the bas system results are given in Table organization, Location, Facilities, Cuisines, Locomotives, 2. Artifact, Entertainment, Organisms, Plants and Diseases are the Pallavi team, had used CRFs, a machine learning technique to eleven types of Named entities. develop their system. They had used features such as POS, Numerical expressions are categorized as Distance, Money, Chunk, Statistical Suffixes and prefixes (unigram, bigram and Quantity and Count. Time, Year, Month, Date, Day, Period and trigrams). They had first cleaned the provided training data to Special day are considered as Time expressions. The tag set remove URLs and emoticons from tweets and pre-processed the consists of three level hierarchies. The top level (or 1 st level) text for POS and chunks. For the preprocessing purpose they have hierarchy has 22 tags, the second level has 49 tags and third level used open source NLP tools, “patter.en” for English and for Hindi has 31 tags. Hence a total of 102 tags are available in this schema. nltr. This team had participated in three languages Hindi, Tamil But the data provided to the participants consisted of only the 1 st and English. They had submitted 3 runs for Hindi and 2 runs each level in the hierarchy that is consisting of only 22 tags. The other for English and Tamil. levels of tagging were hidden. This was done to make it little easier for the participants to develop their systems using machine Sarkar team, had used HMM for the development. Here they have learning methods. preprocessed the data for POS and used POS tag as one of the states for HMM training. They had also used gazetteer lists. These 3.2 DATA FORMAT lists were collected using semi-manual efforts. And this team had The participants were provided the data with annotation markup only submitted results for English only. in a separate file called annotation file. The raw tweets were to be separately downloaded using the twitter API. The annotation file Shriya team had used machine learning method SVM. They have is a column format file, where each column was tab space used open source preprocessing tools for POS tagging and separated. It consisted of the following columns: Chunking. For Tamil and Malayalam they had developed in house POS tagger and chunker by manually annotating small data of the i) Tweet_ID training corpus. The have also used an external resource brown ii) User_Id clusters as one of the features in training SVM. Other main iii) NE_TAG features used in training are 3-word window, POS tags, heuristic iv) NE raw string features such as capitalization, statistical suffixes and prefixes up to 3 characters. This is one of the teams that has participated in all v) NE Start_Index four languages. vi) NE_Length For example: Sanjay team has used CRFs for their system development. This is another team which has participated in all four languages. This team also preprocessed the data for POS and chunking. For Tweet_ID:123456789012345678 English and Hindi they have open source tools for this purpose User_Id:1234567890 and whereas for Tamil and Malayalam in house they have NE_TAG:ORGANIZATION developed these pre-processing tools. NE Raw String:SonyTV Chintak team had used CRFs and had pre-processed data for POS Index:43 tagging and chunking. For this purpose they have used open Length:6 source tools Genia tagger, which is tuned towards biological domain. And we feel this could have resulted in very lower recall Index column is the starting character position of the NE values. They had also used features such as POS, Chunk, heuristic features. They had also used gazetteers as one of the features in calculated for each tweet and the count starts from ‘0’. The their machine learning. participants were also instructed to provide the test file annotations in the same format as given for the training data. The The team led by Vira had also used CRFs. They had used figures below show various aspects of corpus statistics. Stanford preprocessing tools. They have used window of 5 words in the features for training along with POS tag, statistical suffixes 4. SUBMISSION OVERVIEWS and prefixes. In this evaluation exercise we have used Precision, Recall and F- measure, which are widely used for this task. A total of 10 teams The team lead by Sombuddha had four different ML methods and had registered for participation in this track. Later 7 teams were submitted four different runs for English. They had also submitted able to submit their systems for evaluation. A total of 17 test runs runs for Hindi, but since the test submission did not conform to were submitted for evaluation. All the teams had participated for the format specified as per the task guidelines, it was disqualified. English and Hindi languages, except for one team which had only The features used are POS tag, window of words, heuristic participated in English language. And three teams had participated features such as Capitalisation, presence of numeric, hash tags. in Tamil, and two teams had participated in Malayalam. We had They had also used dictionary as binary feature. developed a base system without any pre-processing of the data and use of any lexical resources. We had developed this base The different methodologies used by different teams have been system by just using the raw data as such without any other summarizeed in Table 1. features. We used CRFs for developing the base system. This base 77 We observe that some of the participant systems have not [3] Hege Fromreide, Dirk Hovy, and Anders Søgaard.2014. performed well in comparison with the base system though “Crowdsourcing and annotating ner for twitter#drift”. European several features were used for training. And most of the systems language resources distributionagency. have almost the same precision scores as obtained in the base system. There is significant improvement in the recall of the [4] H.T. Ng, C.Y., Lim, S.K., Foo. 1999. “A Case Study on systems with respect to base system. A deeper analysis of the Inter-Annotator Agreement for Word Sense Disambiguation”. In results obtained by the participant systems has to be done. Proceedings of the {ACL} {SIGLEX} Workshop on Standardizing Lexical Resources {(SIGLEX99)}. Maryland. pp. 9-13. 5. CONCLUSION [5] Preslav Nakov and Torsten Zesch and Daniel Cer The main objective of creating benchmark data representing some and David Jurgens. 2015. Proceedings of the 9th International of the popular Indian languages has been achieved. And this data Workshop on Semantic Evaluation (SemEval 2015). has been made available to research community for free for [6] Nakov, Preslav and Rosenthal, Sara and Kozareva, research purposes. The data is user generated data and is not any Zornitsa and Stoyanov, Veselin and Ritter, Alan and Wilson, genre specific. Efforts are still going on to standardize this data Theresa. 2013. SemEval-2013 Task 2: Sentiment Analysis in and make it perfect data set for future researchers. We observe Twitter. Second Joint Conference on Lexical and Computational that the response from the participants for Hindi language has Semantics (*SEM), Volume 2: Proceedings of the Seventh been more than other languages. We hope to see more International Workshop on Semantic Evaluation (SemEval 2013) publications in this area in the coming days from these different research groups who could not submit their results. Also we [7] Rajeev Sangal and M. G. Abbas Malik. 2011. expect more groups would start using this data for their research Proceedings of the 1st Workshop on South and Southeast Asian work. Natural Language Processing (SANLP) This ESM-IL track is one of the first elaborate efforts towards [8] Aravind K. Joshi and M. G. Abbas Malik. 2010. creation of entity annotated user generated social media text for Proceedings of the 1st Workshop on South and Southeast Asian Indian languages. In this ESM-IL annotation tag set we have Natural Language Processing (SANLP). made use of a hierarchical tag set. Thus this annotated data could (http://www.aclweb.org/anthology/W10-36) be used for any kind of applications. This tag set is very exhaustive and has finer tags. The applications which require fine [9] Rajeev Sangal, Dipti Misra Sharma and Anil Kumar grain tags could use the data with full annotation. And for Singh. 2008. Proceedings of the IJCNLP-08 Workshop on Named applications which do not require fine grain, the finer tags could Entity Recognition for South and South East Asian Languages. be suppressed in the data. The data being generic, this could be (http://www.aclweb.org/anthology/I/I08/I08-03) used for developing generic systems upon which a domain [10] Pattabhi RK Rao, CS Malarkodi, Vijay Sundar R and specific system could be built after customization. Sobha Lalitha Devi. 2014. Proceedings of Named-Entity 6. ACKNOWLEDGMENTS Recognition Indian Languages track at FIRE 2014. http://au- We thank the FIRE 2015 organizers for giving us the opportunity kbc.org/nlp/NER-FIRE2014/ to conduct the evaluation exercise. 7. REFERENCES [1] Arkaitz Zubiaga, Iñaki San Vicente, Pablo Gamallo, José Ramom Pichel Campos, Iñaki Alegría Loinaz, Nora Aranberri, Aitzol Ezeiza, Víctor Fresno. 2014 TweetLID@SEPLN 2014, Girona, Spain, September 16th, 2014. CEUR Workshop Proceedings 1228, CEUR-WS.org 2014 [2] Mark Dredze, Tim Oates, and Christine Piatko. 2010. “We’re not in kansas anymore: detecting domainchanges in streams”. In Proceedings of the 2010 Conferenceon Empirical Methods in Natural LanguageProcessing, pages 585–595. Association for ComputationalLinguistics. 78 Table 1. Participant Team Overview - Summary Team Languages Approaches (ML Features Used Resources/Tools used &System method) Used Submissions Pallavi et al., English – 2 runs, CRFs – CRF++ tool POS, Chunk, Statistical Cleaned data to remove Hindi – 3 runs, kit suffixes and prefixes URLs, emoticons Hindustan Institute of Tamil – 2 runs Technology and Science, For English preprocessing Chennai open source tool ‘pattern.en’ For Hindi used open source tool nltr.org K Sarkar English – 1 run HMM POS tag POS Tagger – Monty Tagger – open source tool Jadavpur University, Kolkata Gazetteer List Shriya et al., English – 1 run, SVM - Machine POS, Chunk, Statistical Gazetteer list, Hindi – 1 run, Learning Suffixes, Statistical Amritha Vishwa Malayalam – 1 run prefixes, Heuristics For preprocessing used Vidyapeetam, Coimbatore Tamil – 3 runs such as capitalization NLTK, Gimpel POS tagger information, Gazetteer for English. NLTK for list, Shape feature Hindi Developed in house tools for POS and Chunking for Tamil and Malayalam Brown Cluster for English Sanjay et al., English – 2 Runs, CRFs – CRF++ tool POS, Chunk, Statistical Gazetteer list, was used Suffixes, Statistical Amritha Vishwa Hindi – 1 run, prefixes, Heuristics For preprocessing used Vidyapeetam, Coimbatore NLTK, Gimpel POS tagger Malayalam – 1 run such as capitalization for English. NLTK for information, Gazetteer Tamil – 2 Runs list, Shape feature Hindi Developed in house tools for POS and Chunking for Tamil and Malayalam Brown Cluster for English Chintak et al., English – 2 runs CRFs – CRFSuite POS, Chunk, Gazetteer Gazetteer list, tool was used information, heuristics LDRP Institute, Gujarat Hindi – 2 runs POS Tagger – RDR open source tool, Chunker – Genia tagger Vira et al, English – 1 run, CRFs – CRFSuite Word structures, English – Stanford NLp tool was used statistical suffixes and tool Charotar University of Hindi – 1 run prefixes, heuristic Science and Technology, features using Hindi – RDR tool Gujarat postpositions Sombuddha et al., English – 4 runs CRFs, Naïve Bayes, POS, Window of POS tagger open source – MIRA, Decision Words, heuristics ark-tweet-nlp tool Jadavpur University tree- J-48 features 79 Table 2. Evaluation Results Language Hindi Tamil Malayalam English Teams P R F P R F P R F P R F Base System 73.05 34.81 47.10 56.50 11.46 19.05 62.58 20.82 31.24 73.54 28.01 40.56 Shriya - Run1 71.56 54.09 61.61 55.23 11.03 18.39 51.18 40.29 45.08 58.78 40.73 48.11 Amritha Run2 - - - 61.55 19.82 29.98 - - - - - - Run3 - - - 60.82 19.42 29.44 - - - - - - Sanjay - Run1 74.65 5.26 9.83 70.11 19.81 30.89 60.05 39.94 47.97 46.78 24.90 32.50 Amritha Run2 - - - 54.87 18.91 28.13 - - - 46.88 25.64 33.15 Chintak - Run1 67.11 0.76 1.51 - - - - - - 7.30 4.17 5.31 LDRP Run2 74.73 46.84 57.59 - - - - - - 5.35 5.67 5.50 KSarkar – Run1 - - - - - - - - - 61.96 39.46 48.21 JU Vira - Run1 25.65 16.14 19.82 - - - - - - 4.13 3.39 3.72 Charotar Univ Pallavi - Run1 81.21 44.57 57.55 70.42 14.13 23.54 - - - 50.48 32.03 39.19 HITS Run2 80.86 44.25 57.20 64.52 22.14 32.97 - - - 50.21 37.06 42.64 Run3 81.49 41.58 55.06 - - - - - - - - - Sombuddha Run1 - - - - - - - - - 46.92 32.41 38.34 – JU ** Run2 - - - - - - - - - 58.09 31.85 41.15 Run3 - - - - - - - - - 49.10 31.59 38.45 Run4 - - - - - - - - - 58.09 31.85 41.15 ** Though this team had submitted Hindi runs, these were disqualified due to data format not confirming with the task guidelines. 80