Vira@FIRE 2015: Entity Extraction from Social Media Text Indian Languages (ESM-IL) Vira Bagiya Anjana Patel Amit Ganatra Charotar University of Science & Charotar University of Science & Charotar University of Science & Technology, Technology, Technology, Changa,Gujarat Changa, Gujarat Changa, Gujarat India India India. virabagiya11@gmail.com 14pgce002@charusat.edu.in amitganatra.ce@charusat.ac.in ABSTRACT The paper is organized as follows. Section 2 gives an overview of In this paper we have tried to identify and extract “Named Task description and approaches applied for NER task and Entities” from social media text using conditional random field- complete description of our system. Furthermore section 3 (CRF) [3]. The paper represents our working methodology and describes the different issues in development of the system for result on Entity Extraction from Social Media Text Indian different Indian languages. In section 4 there is the test result and Languages task of FIRE-2015. We have extracted named entities how its accuracy can be increased. Finally section 5 concludes the from two languages Hindi and English. Named Entity paper. Extraction system is implemented based on CRFSuite. CRFSuite [8] is the populer implementation of Conditional Random Fields 1.1 Task Description (CRF). This is a sequential labelling task to achieve the desired “Entity extraction from social media text in Indian Languages” is tagging output. Conditional random fields (CRF) are a class a task in which we have provided different tweets. --From this of statistical modelling method often applied in pattern tweets – our work is to annotate and classify these tweets into recognition, machine learning and many natural language different named entity tags like Person, Organization, Location, processing tasks. We get F1-score of 19.82 and 3.72 for the Entertainment etc. In training dataset we have given three Hindi and English text respectively. columns tweet_id, user_id and tweet_text and in its processed annotated dataset we have given tweet_id, user_id, Named Entity Keywords tag(NE tag), Named Entity, index and its length. The Same thing Machine learning; Named Entity Extraction; Named Entity we should perform on the testing dataset provided. Our main task Recognition. is to identify named entity from testing dataset and apply appropriate tag to it. 1. INTRODUCTION 1.2 System Architecture Our Named entity recognition system is developed to classify and Named-entity recognition (NER) (also known as entity tagged named entities into 22 different classes such as Person, identification, entity chunking and entity extraction) is a subtask Location, Organization, Entertainment etc. We have provided of information extraction that seeks to locate and classify training dataset which is mainly used for learning process. elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, There are following unique 22 named entity tags. monetary values, percentages, etc. Table 1: Unique Named entity tags Social Media is vast source of information- from which we can 1. PERSON extract lots of important data as per the specified requirements. 2. ORGANIZATION According to the 8th schedule, India is known to have 22 official Indian languages. NER in Indian languages is still considered to 3. LOCATION be a budding topic of research in the field of NLP and much of 4. ENTERTAINMENT work is needed to be performed in this regard. For English and 5. DAY Hindi languages there are so many NER tagger exists and hence 6. MATERIALS 7. PLANTS 8. PERIOD 9. LOCOMOTIVE this paper propose a CRF based NER tagger using CRFsuite (Okazaki, 2007) [8]. CRFsuite is an implementation of CRF and it 10. YEAR is faster than CRF++ [7]. CRFsuite is an open source software 11. MONEY which automatically extract features from the learning. 12. COUNT 13. FESTIVAL 107 14. DATE 2. APPROACHES FOR NER 15. QUANTITY There are basically two approaches that are employed in Named 16. FACILITIES Entity Recognition. These include: 17. DISEASE a. Rule Based Approach 18. ARTIFACT b. Machine Learning Based Approach 19. MONTH In rule based approach there are Handcrafted or automatically 20. TIME generated rules or patterns. Machine learning techniques are used for statistical modeling which can be either unsupervised, semi- 21. LIVTHINGS supervised or supervised mode of learning. Unsupervised and 22. SDAY semi-supervised mode of learning are used when there is a scarcity of annotated data for training but the best performance is obtained by using supervised mode of learning which requires a large amount of good quality annotated corpus. We have used machine learning based approach. This approach is also known as automated approach or statistical approach. Machine learning based approach is more efficiently and frequently used as compared to the Rule based approach. We have developed a system to perform NER in English and Hindi and submitted the same. We use the open-source software, CRFsuite[8] which is one of the popular implementations of Conditional Random Fields (CRF)[3] for training a model based on the training dataset and then use the model to generate tags for the test dataset. 2.1 Extracted features from learning For English and Hindi languages there are following features extracted from learning CRFsuite automatically extract some features from learning. Other features are added for tagging different different named entities.  Gazetter(specified as list-look up table): Gazetter of location names has been created and applied to identify different locations in India. Same as - Plant names, Festival names, Entertainment , Locomaotives , Livthings tagged using the different different gazetters as a feature.  Suffixes: In hindi person name identified using suffix “जी”.Means word ending with “ji” can be the person name and it can be specified as the feature. for example: Modiji is a person name Figure 1: System Architecture If a word followed by “ko” then this word can be We have used the supervised learning, as we are given training specified as the person name for eample: dataset. We used this training dataset to train our system for “Gita ko haridwar jana hain” – ‘Gita’ is aperson name tagging named entities and kept these tags in a space separated which is followed by ‘ko’. files. CRFsuite generate the model based on training – learning  Prefixes :There are so many prefixes can be used to provided. Later on, system uses these model for generate output identify named entities. For example, Named entity (named entity tagging). The training dataset is primary focus for followed by Mr. or Miss or Mrs would be a person our training. Figure 1 - flow-chart is showing the basic flow of our name system in detail.  Word Context: Context of the word of window size As we have used CRFsuite to implement our NER system. In four is used which takes two words before and two which features can be easily extracted for labeling entities based words after the word as feature. This helps modeling on the provided training datasets. Hence We can easily add our the language structure about how where and with which own features by modifying some line of codes as it is an open words entities are used in a sentence. There are total source software. Features can be generated for unigram as well as seven feature values for word context which includes bigrams. the word itself, two words before it, and two words after it, and pairing of word with its previous and next word. 108  POS tag: Parts of Speech (POS) tag of a word is also and IR communities. Considerable success has been achieved considered as a feature because all the entities are in English with extraction of multiple entities as per domain nouns. of interest. However, the area poses considerable challenges when tried in other languages and particularly Indian Languages.  Regular Expressions: We have used different regular Such as - There is no capitalization available in Indian languages. expressions to identify temporal based named entity like Date,Month,Year, Period,Day and Time. There is lot of research work going on in NER for Indian languages, such as Workshops NERSSEA-2008, SANLP 2010, 2.2 Pre-processing 2011 but, there is lack of bench mark data to compare several Social media text is noisy in nature. People use shorthand and existing systems. There is no common evaluation methods exists ungrammatical text for saving their time. Thus capitalization is to judge any researchers’ work. not properly applied as well as Spellings are not correct. This data becomes hard to handle in the aspect of Information-extraction. First from the given testing dataset we have removed all the links 4. RESULTS AND DISCUSSION presented in the tweets. Then tokenizing, Part-of-speech tagging and chunking is done. For English language we have used Stanford Part-Of-Speech tagger [5]. For Hindi language we have 4.1 Evaluation metrics used RDRPostagger [6]. Two standard measures, Precision (P) and Recall (R) are used for evaluation of the Named Entity (NE) tagger, where precision is Input for CRFsuite(NER Tagger): the measure of the number of entities correctly identified over the As there is space separated 4 fields input to the CRFsuite, we have number of entities identified and recall is the measure of number combined output from tokenizer , POS tagger and make one space of entities correctly identified over actual number of entities. F separated file for training. Then this same process applied for measure is calculated which is the harmonic mean of precision testing dataset. and recall For example: PERSON Gitika NNP B-NP Each tweets is preprocessed according to the requirement of CRF suite which needs a file in which each line has a single word and F= its NER tag separated with a white space, A new line represents When β = 1, F measure is called F1 measure or simply F1 score. end of a sentence Two processed files were created, one with BIO tags which shows multiword entities (for English language) and 4.2 Test results another without it (for hindi language). Table1: test result 2.3 Post-processing Languages Precision Recall F1-Score Output of the NER tagger would be only NE tag. So we have combined it with its named entity, tweet_id and user_id. Then find Hindi 25.65 16.14 19.82 the length of the named entity and its index-means position of the English 4.13 3.39 3.72 named entity. And as per given format we have arranged such as: Tweet ID:623472520352636928 User Id:241166752 5. CONCLUSION NETAG:PERSON NE:Ali Index:104 CRF models are appropriate for the highly inflective Indian Length:3 languages and perform better than other systems like HMM, MEMM etc. (Vijay Sundar Ram R, 2011). CRFsuite generate model based on the learning and provides output (NE tag) as per 3. ANALYSIS the generated model. But Problem is, NER system learned using Over the past decade, Indian language content on various media CRF takes more time for training the model. The parts-of-speech types such as websites, blogs, email, chats has increased tag is the important feature for NER to identify the named entity significantly. And it is observed that with the advent of smart chunk. Incorrect parts-of-speech tag for the token may result in phones more people are using social media such as twitter, reducing the accuracy of NER system. Achieving a high facebook to comment on people, products, services, organizations, performing NER system requires more study and deeper governments. Thus we see content growth is driven by people understating of linguistic features. Various permutation and from non-metros and small cities who are mostly comfortable in combination of feature sets can be used and tested for getting high their own mother tongue rather than English. Though still this recall value and eventually higher F1-scores. Indian language content is only a fraction of the English content. The growth of Indian language content is expected to increase by 6. FUTURE WORK more than 70% every year. In English and Hindi both language we will try to get more Hence there is great need to process this huge data automatically. accurate results in identifying and tagging named entities. For that Especially companies are interested to ascertain public view on we will optimize our features sets. their products and processes. This requires natural language 7. ACKNOWLEDGEMENT processing software systems which identify entities, identification We would like to thanks Prof. Prasenjit Majumder sir for his of associations or relation between entities. Hence an automatic guidance throughout the work. Additionally, we would like to Entity extraction system is required. thank FIRE 2015 for providing a great opportunity to work under Named Entity Recognition (NER) is one of the most important this task and facilitating the support. information extractions techniques being developed in the NLP 109 8. REFERENCES [1] Asif Ekbal, R.H.: Language Independent Named Entity Recognition in Indian Languages. In: IJCNLP, pp. 33–40 (2008) [2] David Nadeau, S.S. (n.d.).: A survey of named Entity recognition and classification. National Research Council Canada/ New York University [3] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proceedings of ICML, pp. 282– 289, (2001) [4] Vipul Garg, Nikit Saraf, and Prasenjit Majumder: Named Entity Recognition for Gujarati: A CRF Based Approach [5] standford POStagger http://www-nlp.stanford.edu/software/tagger.shtml [6] RDRPostagger http://rdrpostagger.sourceforge.net/ [7] Kudo, Taku. "CRF++: Yet another CRF toolkit." Software available at http://crfpp. sourceforge. net (2005) [8] Okazaki, N.: CRFsuite: A fast implementation of Conditional Random Fields, CRFs (2007), retrieved from http://www.chokkan.org/software/crfsuite/ 110