=Paper=
{{Paper
|id=Vol-2266/T3-2
|storemode=property
|title=CUSAT_TEAM@IECSIL-FIRE-2018:A Named Entity Recognition System for Indian Languages
|pdfUrl=https://ceur-ws.org/Vol-2266/T3-2.pdf
|volume=Vol-2266
|authors=Ajees A P ,Sumam Mary Idicula
|dblpUrl=https://dblp.org/rec/conf/fire/PI18a
}}
==CUSAT_TEAM@IECSIL-FIRE-2018:A Named Entity Recognition System for Indian Languages==
CUSAT TEAM@IECSIL-FIRE-2018: A Named Entity Recognition System for Indian Languages Ajees A P1 and Sumam Mary Idicula2 1 Research Scholar, CUSAT, Cochin 682022, INDIA ajeesap87@gmail.com 2 Professor, CUSAT, Cochin 682022, INDIA sumam@cusat.ac.in Abstract. Named Entity Recognition is the process of classifying the elementary units in a text document into meaningful categories such as person, location, organization, etc. It is a significant preprocessing step in the semantic analysis of natural language text. There is an enormous growth of Indian language content on various media types such as web- sites, blogs, email, chats, etc. over the past decade. Automatic processing of this huge unstructured data is a challenging task especially when the companies are interested to ascertain public view on their products and processes. NER is one of the subtasks of Information Extraction. Extract- ing structured information from the natural language text is the ultimate goal of IE systems. Different methods are proposed and experimented for NER. In this paper, we propose a Named Entity Recognition system for Indian languages using Conditional Random Fields. Training and testing are conducted using the shared corpus provided by ’ARNEKT-IECSIL 2018’ competition organizers. The evaluation results show that the pro- posed system is able to outperform most of the reported methods in the competition. Keywords: Named Entity Recognition · Conditional Random Fields · Natural Language Processing · Supervised learning. 1 Introduction The Internet is the fastest growing resource on the world. Lots of information are added to the web every second. But this information is stored in an unstruc- tured manner. Retrieving the relevant information from this unstructured text is a challenging task that invites the focus of Language researchers. Information extraction, a branch of Artificial Intelligence deals with this challenge [11]. IE transforms the unstructured text into a structured form that can be easily han- dled by machines. Named Entity Recognition is one of the subdomains of IE. It is the process of identifying a word or phrase that refers to a particular entity within a text. The term entity is coined in the Sixth Message Understanding Conference (MUC) [6]. Most of the benchmarks in NER are also reported from MUC conferences. Recognizing the semantically meaningful classes of words from 2 Ajees A P et al. an unstructured text is the goal of NER systems. Even though different solutions are reported for the problem, it is still an open area of research. Categorizing the articles according to the content helps in smooth content discovery. NER systems can automatically scan the articles and identify the im- portant entities mentioned in them. Knowing the relevant tags for articles can help in automatic categorization of the articles and hence easy content discovery [10]. NER systems can also be used to empower the searching algorithms. Most of the online publications have millions of articles in their database. Searching the complete list of articles for all the queries will take enormous time. Tagging all the articles with relevant entity tags and storing that tags separately can speed up the search operation to a considerable extent. Content Recommenda- tion is another application where NER systems can be utilized. Extracting the entities from the viewed articles and recommending other articles with similar entities can improve the recommendation systems. NER systems also help in identifying the position of the text that should be transliterated rather than meaning translated. The structure of this article is as follows. Section 2 briefly reviews the related works. Section 3 explains the task description and details about the dataset. Section 4 discusses the methodology and section 5 illustrates the experiments and results. Finally, section 6 concludes the article along with some routes for the future works. 2 Related Works The term named entity refers to a word or phrase that clearly distinguish one item from the other set of items. MUC-6, where the term named entity is in- troduced categorize entities into 3 classes namely- ENAMEX, TIMEX, and NU- MEX. ENAMEX comprises entities like person, location, organization, etc. Date and time are included in TIMEX. NUMEX covers entities like money, quantity, and percentage. Mainly three types of approaches are reported in NER. They are supervised, semi-supervised and unsupervised approaches. Supervised methods try to build a model by looking at annotated training examples. Here a set of features are used to represent a word in the training data. These features form the input to the learning algorithm. The tags of words act as supervisors to fine tune the model parameters. Hidden Markov Model, Maximum Entropy Markov Model, SVM model, etc. are some of the models employed in such studies [13]. The major motivation towards the semi-supervised learning algorithms is the lack of enough labeled data. Semi-supervised learning algorithms make use of both labeled and unlabeled data to create their own hypothesis. They start with a small amount of labeled data and continue with a large amount of unlabelled corpus to build the classifier. Here more annotations are generated iteratively un- til a threshold is reached. NER using Adaboost is an example of semi-supervised NER system [7]. In order to overcome the requirements of supervised learning algorithms, un- supervised learning algorithms are introduced. Supervised learning methods de- A Named Entity Recognition System for Indian Languages 3 mand a robust set of features and a large amount of annotated corpora. ’KNOW- ITALL’, proposed by Etzioni et al. is a pillar example of unsupervised Named Entity Recognition system [8]. It is a domain-independent system that makes use of domain-independent extraction patterns to generate candidate facts. When it comes to Indian languages, the major challenge in NER are as follows. The capitalization feature is absent in almost all the Indian languages [9]. Whereas the other languages like English make use of capitalization feature in the identification of named entities. The morphological richness of the word forms is another problem in Indian languages, which makes it difficult to identify root words from its inflected forms [14]. Ambiguities at word level is also a challenge in the identification of named entities. The same word can act as an entity or a common noun in different contexts. Most of the Indian languages are free word order languages, which affect n-gram based approaches of NER [3]. Spelling variations in names create another hindrance to the problem of NER in Indian languages. The same word is spelled in different ways by different peoples. All these issues get accumulated and make the problem of NER in Indian languages a challenging one. CRFs are very promising in entity recognition task. Even the best performing Stanford entity tagger is based on CRFs. They are not novel to the field of NER in Indian languages. Many works are reported in different South Asian languages including Tamil, Hindi, Telugu, Malayalam etc. Most of them are based on the personal and limited dataset which is the major bottleneck in their works. Sharma et al. [15] , Srikanth et al. [16], Prasad et al. [12] and Vijayakrishna et al. [17] are some of the works reported in Indian languages using CRF. 3 Task Description and Dataset Details The shared task is divided into two subparts say task-A and task-B [5]. Task-A deals with the identification of named entities from the raw text and task-B deals with extracting relation amongst the entities in a sentence. Both these tasks come under the domain of Information Extraction (IE), which is an area under constant research. The growth of research in this area leads to the ad- vancement of applications like information search, question answering, document summarization, etc. Five Indian languages are considered for this shared task. They are Tamil, Hindi, Kannada, Telugu, and Malayalam. It is well known that IE works significantly well with languages like English from applications like Google search, frameworks like Stanford CoreNLP, OpenNLP and many more. The same does not hold good for Indian Languages due to its morphologically rich nature and agglutinative structure. Hence the goal of this task is to improve the Information Extraction systems for Indian languages [2]. The shared dataset contains data from five different Indian languages [4]. The training data for task-A is a set of files in plain text format. Each file consists of words and their labels in a line by line basis. Each language has more than five lakhs samples of training data. Statistics of the training data for task-A is 4 Ajees A P et al. shown in table 1. The testing data contains two files say test1 and test2. Test1 is for pre-evaluation and test2 is for final evaluation. Table 1. Training data statistics Language # sentences # words # unique words Hindi 76537 1548570 88198 Tamil 134030 1626260 186267 Malayalam 65188 903521 145240 Telugu 63223 840908 108224 Kannada 20536 318356 73836 4 Proposed method The proposed system is a CRF-based sequence labeling model with words as the input sequence and entity tags as output sequence. CRFs are probabilistic graphical models used for labeling sequential data. They can be used to predict any sequence in which multiple variables depend on each other. A key advantage of CRFs over other sequence labeling models is their great flexibility to include a range of arbitrary and dependent features of the input. Since Indian languages are morphologically rich, a wide variety of such morphological features can be used to enrich the input word representation. Figure 1 shows the graphical il- lustration of CRF. Here each vertex represents a random variable and each edge represents the association between the random variables. CRFs are free from label bias problem, a weakness exhibited by Maximum Entropy Markov Models. They are capable of producing multiple variables that are mutually dependent. Let W = w1 , w2 , w3 , ...wn be the input sequence and Y = y1 , y2 , y3 , ...yn be the corresponding label sequence. CRFs try to maximize the conditional probabil- ity distributionP (Y /W ) given the input sequence. The best entity tag sequence corresponding to a word sequence is calculated as shown in equation 1. ȳˆ = arg max P (ȳ | W̄ ; w̄) (1) ȳ Here W̄ is the observable word sequence and ȳ is the corresponding hidden entity tag sequence. The probability of a tag sequence ȳ, for a given word se- quence W̄ , is calculated as in equation 2. Where w̄ indicates the weight vector and ’F’ indicates the global feature vector. exp(w̄ · F (W̄ , ȳ)) P (ȳ | W̄ ; w̄) = P (2) exp(w̄ · F (W̄ , ȳ 0 )) ȳ 0 ∈Y The conditional probability of Yi on W is defined by a set of feature functions. Each feature function is assigned by a particular weight as shown in equation A Named Entity Recognition System for Indian Languages 5 Fig. 1. CRF: A graphical illustration 3. The feature functions can inspect the entire input sequence W at any point during the inference. Each feature function can analyze the entire observation sequence W̄ , the current yi and previous yi−1 positions in the tag sequence and current position ’i’ in the observation sequence. A feature function is computed by summing fk over all n different state transitions ȳ. X F (W̄ , ȳ) = f (yi−1 , yi , W̄ , i) (3) i Finally, the best tag(named entity) sequence is decoded using the Viterbi algo- rithm. 5 Experiments and Results The architecture of the proposed system is shown in figure 2. The first stage of the architecture is the preprocessing stage, where the tagged text is converted into sequences of words and sequences of tags. In the second stage, each word from each sentence is sent to a feature preparation module, where features for each word is prepared. Hence the sequences of words are converted into se- quences of features. The different feature we have considered for CRF training is the word, preceding words, following words, suffixes of different length, number information, length information, etc. The labels of words are also converted into sequences of tags to facilitate CRF training. The third phase of the architecture is the training phase, where the model parameters are learned. We have used Pycrfsuite, a python based implementation of CRF for training [1]. Training is performed on the tagged data for 50 epochs and the model is saved. The fi- nal phase of the architecture is the testing phase, where the performance of the 6 Ajees A P et al. Fig. 2. Architecture of the proposed system model is assessed. Twenty percent of the total data is used for testing. The words in the test data are also preprocessed as in the training data. The proposed system is tested with two test datasets(pre-evaluation and final evaluation). Our system predicts the label sequence for each input sentence. Table 2 demonstrates the results of our system on both the datasets. It is clear from the results that our system performance is promising as compared with the performance of other methods reported in this competition. Table 2. Results Test data Hindi Kannada Malayalam Tamil Telugu Average Test 1 (Accuracy %) 97.67 97.03 97.44 97.36 97.72 97.44 Test 2 (Accuracy %) 97.65 97.09 96.86 96.85 97.69 97.23 6 Conclusion In this paper, we have discussed a CRF based Named Entity Recognition system for Indian languages. The exclusive feature of this approach is its performance as compared to other sequence labeling techniques. The main reason we preferred CRFs rather than traditional statistical methods is their ability to model the sequence to sequence learning problems. Since CRFs are statistical models, the A Named Entity Recognition System for Indian Languages 7 performance of the system can be improved by increasing the training data size. The performance of the system can also be improved by incorporating word embedding based cluster features into the CRF training. Due to the lack of enough computational resources, we could not execute that operation. Apart from NER, Conditional Random Fields can also be applied to various NLP applications like POS tagging, semantic role labeling, word phrase chunking, etc. References 1. A python binding for crfsuite. https://github.com/scrapinghub/python-crfsuite, accessed: 2017-09-30 2. Arnekt Solutions: Information extractor for conversational systems in indian lan- guages (2018), https://iecsil.arnekt.com, [Online; accessed 14-July-2018] 3. Athavale, V., Bharadwaj, S., Pamecha, M., Prabhu, A., Shrivastava, M.: Towards deep learning in hindi ner: An approach to tackle the labelled data scarcity. arXiv preprint arXiv:1610.09756 (2016) 4. Barathi Ganesh, H.B., Soman, K.P., Reshma, U., Mandar, K., Prachi, M., Gouri, K., Anitha, K., Anand Kumar, M.: Information extraction for conversational sys- tems in indian languages - arnekt iecsil. In: Forum for Information Retrieval Eval- uation (2018) 5. Barathi Ganesh, H.B., Soman, K.P., Reshma, U., Mandar, K., Prachi, M., Gouri, K., Anitha, K., Anand Kumar, M.: Overview of arnekt iecsil at fire-2018 track on information extraction for conversational systems in indian languages. In: FIRE (Working Notes) (2018) 6. Bindu, M., Idicula, S.M.: Named entity identifier for malayalam using linguistic principles employing statistical methods. International Journal of Computer Sci- ence Issues(IJCSI) 8(5) (2011) 7. Carreras, X., Màrquez, L., Padró, L.: A simple named entity extractor using ad- aboost. In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4. pp. 152–155. Association for Computational Lin- guistics (2003) 8. Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: An experimental study. Artificial intelligence 165(1), 91–134 (2005) 9. Goyal, A., Kumar, M., Gupta, V.: Named entity recognition: Applications, ap- proaches and challenges 10. ParallelDots: Named entity recognition: Applications and use cases (2016 (accessed July 7, 2018)) 11. Patil, N., Patil, A.S., Pawar, B.: Survey of named entity recognition systems with respect to indian and foreign languages. International Journal of Computer Appli- cations 134(16) (2016) 12. Prasad, G., Fousiya, K., Kumar, M.A., Soman, K.: Named entity recognition for malayalam language: A crf based approach. In: Smart Technologies and Manage- ment for Computing, Communication, Controls, Energy and Materials (ICSTM), 2015 International Conference on. pp. 16–19. IEEE (2015) 13. Sasidhar, B., Yohan, P., Babu, A.V., Govardhan, A.: A survey on named entity recognition in indian languages with particular reference to telugu. International Journal of Computer Science Issues (IJCSI) 8(2), 438 (2011) 8 Ajees A P et al. 14. Shah, H., Bhandari, P., Mistry, K., Thakor, S., Patel, M., Ahir, K.: Study of named entity recognition for indian languages. Int. J. Inf 6(1), 11–25 (2016) 15. Sharma, R., Goyal, V.: Name entity recognition systems for hindi using crf ap- proach. In: Information Systems for Indian Languages, pp. 31–35. Springer (2011) 16. Srikanth, P., Murthy, K.N.: Named entity recognition for telugu. In: Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages (2008) 17. Vijayakrishna, R., Sobha, L.: Domain focused named entity recognizer for tamil using conditional random fields. In: Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages (2008)