Overview of Arnekt IECSIL at FIRE-2018 Track on Information Extraction for Conversational Systems in Indian Languages Barathi Ganesh H B1,2 , Soman KP1 , Reshma U2 Mandar Kale2 , Prachi Mankame2 , Gouri Kulkarni2 , Anitha Kale2 , and Anand Kumar M3 1 Center for Computational Engineering & Networking (CEN) , Amrita School of Engineering, Coimbatore Amrita Vishwa Vidyapeetham, India barathiganesh.hb@arnekt.com 2 Arnekt Solutions Pvt. Ltd., Pune, Maharashtra, India, 411028. reshma.u@arnekt.com 3 Department of Information Technology, National Institute of Technology Karnataka Surathkal, Mangalore. Abstract. This overview paper describes the first shared task on In- formation Extractor for Conversational Systems in Indian Languages (IECSIL) which has been organized by FIRE 2018. Motivated by the need of Information Extractor, corpora has been developed to perform the Named Entity Recognition (Task A) and Relation Extraction (Task B) for five Indian languages (Hindi, Tamil, Malayalam, Telugu and Kan- nada). Task A is to identify and classify the named entities to one of the many classes and Task B is to extract the relation among the en- tities present in the sentences. Altogether, nearly 100 submission of 10 different teams were evaluated. In this paper, we have given an overview of the approaches and also discussed the results that the participated teams have attained. Keywords: Information Extractor · Named Entity Recognition · Rela- tion Extraction · IECSIL. 1 Introduction Applications of conversational systems and social media platforms have seen increased adoption by Indian language users on account of local language enabled keyboards and smart phones [3]. In recent times, e-tailing, digital classifieds, digital payments and on-line government services have also started to enable Indian language content on their platforms. This growth momentum is likely to continue with the Indian language Internet user base growing at a CAGR of 18% to reach 536 million by 2021 compared to English Internet user base growing at 3% to reach 199 million. Their study shows that by 2021, almost all domains would be benefited with the support of their own local language and 2 Barathi Ganesh H B and Reshma U et al. there would a drastic increase in the amount of data that gets generated when compared to the present case. More research works and state-of-art findings are likely to happen in near future. Researchers and Start-ups have already started following up the need for language support in frequently used applications which would in turn benefit most of the crowd in India. Understanding the above scenarios, Arnekt in collaboration with FIRE has come up with a track Arnekt-IECSIL - Information Extractor for Conversational Systems in Indian Languages (IECSIL). FIRE started of with the aim of build- ing a South Asian counterpart for TREC, CLEF and NTCIR. FIRE has since evolved continuously to meet the new challenges in multilingual information ac- cess.4 . Arnekt aims to power the world’s smartest business solutions by providing state-of-the-art AI based Cognitive Intelligence as a Service (CIaaS)5 . IECSIL basically involves five Indian languages (Hindi, Kannada, Malayalam, Tamil and Telugu) to start with and is likely to be further extended to cover the major languages spoken in India (near future). Resources for developing this prototype was collected using an automated and language independent framework which has been developed by Arnekt, that creates corpus for Named Entity Recognition (NER) and Relation Extraction (RE) (tasks in IECSIL) from DBpedia. Corpora contains tags of Named Entities and Relations for five Indian languages (Kannada (kn), Malayalam (ml), Hindi (hi), Tamil (ta) and Telugu (te)) which are not just restricted towards creating a single application. An elaborated portion on steps taken for data creation and its statistics could be seen below in the coming sections. Motivated by the need of Information Extractor described above, we have the following two tasks: Task A : Named Entity Recognition (NER) Corpora for five Indian languages (Hindi, Tamil, Malayalam, Telugu and Kan- nada) has been provided. Task A is to identify and classify the named entities to one of the many classes [2]. NER Corpus Creation: The abstract and info-box property files from DBpedia are the resources for corpus creation. In preprocessing stage, info-box properties are extracted as a meta tags and the long abstract files are cleaned to remove the texts in foreign language, URL links, and other special symbols. The meta tags which are in non-English language has been translated into En- glish through Google Translator. The meta tags that occurs more than 100 times across all the languages has been considered to create the final entity and its cor- responding text pairs. With this entity-text pair, the text in the cleaned abstract file has been tagged. There are totally nine tags (Date, Event, Location, Name, Number, Occupation, Organization, Other and Things) which are considered for the NER corpus creation. 4 http://fire.irsi.res.in/fire/2018/home 5 https://arnekt.com/ Arnekt IECSIL - 2018 3 Creation of meta tag to the entity list is the only manual processing involved in this framework and it takes very less time compared to the general manual annotation process. This corpus has been made available on-line6 to the research community through the Information Extractor for Conversational Systems for Indian Languages (IECSIL)7 . The detailed NER corpus statistics has been given in Table 1:NER Corpus Statistics. Table 1. NER Corpus Statistics Info Languages hi kn ml ta te date 4290 1968 2606 24556 3999 event 4968 916 1432 8439 1230 location 278396 17484 49705 225229 159840 name 149300 25576 101914 202120 103256 number 63289 6519 51122 130581 47727 occupation 26418 5136 13462 27398 14188 organization 20831 1237 8078 16601 4156 other 1903703 439238 1167211 1844116 959260 things 6804 389 3435 10244 1855 Task B : Relation Extraction (RE) Continuation to Task A, corpora without named entities for five Indian languages (Hindi, Tamil, Malayalam, Telugu and Kannada) has been provided. Task B is to extract the relation among-st the entities present in the sentences [1] . Relation Extraction Corpus Creation: Similar to NER, here also rela- tion tags are annotated through semi-automated methodology. Initially sentence which has minimum NER tags count two has been taken and POS tagging is ap- plied on it. The tagger from the [5] [4] and [6] are used to create the POS tagged corpus for all five languages. The POS tags from these tools are mapped to the commonly occurring 12 Penn Treebank POS tags, which are good enough to use it in the further application. Based on the POS pattern between the entities, each sentence is assigned to a relation [1]. The relation tagged corpus statistics is given in Table 2. 2 Evaluation For evaluation, the classic Accuracy measure has been taken into consideration. It could simply be briefed as a predictive model that reflects the proportionate 6 https://github.com/BarathiGanesh-HB/ARNEKT-IECSIL 7 http://iecsil.arnekt.com 4 Barathi Ganesh H B and Reshma U et al. Table 2. Relation Extraction Corpus Statistics Languages Info hi kn ml te ta action 1 15517 0 0 0 1974 action 2 740 277 340 2150 4512 action 3 9 321 2260 1306 2661 action neg 199 0 0 0 78 action per 3 9 1056 23 222 action so 0 0 248 0 0 action quant 70 25 152 13 14 information 1 29264 2918 13854 8550 38569 information 2 34 172 1990 15078 815 information 3 469 807 4068 3539 3681 information 4 5388 342 337 1113 1544 information cc 80 102 0 268 142 information closed 2063 3 148 1030 786 information neg 6 4 0 0 135 information per 443 869 1225 1641 3125 information quant 907 414 931 1650 4577 information so 0 0 0 115 969 Other 1583 374 1678 563 1029 number of times that the model is correct when applied to data. Evaluation has been computed in two stages, Pre-Evaluation Team participating in the shared tasks were encouraged to test their modules in real time8 . They could feel free in submitting as many submissions as they prefer. The leader board is evaluated with approximately 20% of the data (Test-1 corpora). Test-1 corpora statistics are given in Table 3 and 4. Final-Evaluation The final ranking is based on another 20% (Test-2 corpora) of the data. Unlike the Pre-Evaluation, here the participants are requested to submit their models or code or submission file to task organizers. Test-2 corpora statistics are given in Table 3 and 4. For each sub-task and language, submissions are evaluated by calculating the accuracy with the corresponding Gold labels. The accuracy scores across all the five languages will be averaged to determine the final ranking for both the sub-tasks. 8 https://iecsil.arnekt.com/#!/participate Arnekt IECSIL - 2018 5 # terms correctly assigned to entity Acc = (1) total # terms Table 3. Task A Corpus Separation : NER TASK - A Language Train pre-Eval final-Eval hi 1548570 519115 517876 kn 318356 107325 107010 ml 903521 301860 302232 te 840908 280533 279443 ta 1626260 542225 544183 Table 4. Task B Corpus Separation : Relation Extraction TASK - B Language Train pre-Eval final-Eval hi 56775 18925 18926 kn 6637 2213 2213 ml 28287 9429 9429 te 37039 12347 12347 ta 64833 21611 21612 3 Participants A server similar to Kaggle/Coda Lab was hosted9 to check the developed system in real time, where participants submitted their test results for pre-evaluation corpora. Five days before the final deadline Test 2 corpora for final evaluation has been released. Participants were allowed to make at most 3 submissions against the Test 2 corpora. The final ranking was then computed based on the participants system performance on Test 2 corpora. The results are described in Table 5, 6, 7 and 8. The CUSAT TEAM have made use of deep learning in extracting the relation between entities. They have used Convolutional Neural Network (CNN), which has been modelled to address processing in sentence level for Malayalam language. Due to the absence of pre-trained word embedding for other languages 9 https://iecsil.arnekt.com/#!/participate 6 Barathi Ganesh H B and Reshma U et al. Table 5. Pre-Evaluation Task A Team hi kn ml ta te Average idrbt-team-a 97.82 97.04 97.46 97.41 97.54 97.45 CUSAT TEAM 97.67 97.03 97.44 97.36 97.72 97.44 rohitkodali 98.07 96.86 97.26 96.98 97.54 97.34 khushleen 96.84 96.38 96.64 96.15 96.63 96.53 thenmozhi 96.73 95.63 95.87 95.55 96.77 96.11 hariharan-v 96.49 95.06 95.9 96.03 95.97 95.89 hilt 94.44 92.94 92.92 92.48 92.42 93.04 am905771 94.4 90.09 89.97 91.23 90.2 91.18 raiden11 91.52 92.14 90.27 87.72 90.02 90.33 Table 6. Pre-Evaluation Task B Team hi kn ml ta te Average thenmozhi 93.25 51.20 81.89 85.91 84.29 79.30 idrbt-team-a 80.98 57.98 59.43 78.43 76.35 70.63 raiden11 51.70 44.42 48.61 59.71 40.57 49.00 hilt 51.70 44.42 48.61 59.71 21.97 45.28 CUSAT TEAM 51.70 0 78.45 0 0 26.03 am905771 63.74 0 0 0 0 12.75 like Hindi, Kannada, Tamil and Telugu that fits in to their machine memory, they have restricted their Relation extraction model development with Malayalam language for which they have their own corpus to simulate word vectors. The same team have used a statistical model in finding the entities from a given sentence. CRF based sequence labelling model with features that are specific to Indian languages has been utilized in tagging the words with entities provided [7], [8]. SSN NLP have used Neural Machine Translation architecture to identify and classify named entities for all the five Indian languages that are in focus. The deep neural network was built using multi-layer Recurrent Neural Network (RNN) and Long Short Term Memory (LSTM). About four different models were developed for each of the languages. It was found that bi-directional LSTM with attention having eight layers of depth worked well for all languages other than Malayalam [9]. SSN NLP have made use of the deep learning approach that they have utilized for Named Entity Recognition (NER) for Relation Extraction as well. While two models use the deep learning framework that use SeqtoSeq model, three others were developed using statistical Machine Learning approach [10]. HiLT have used two-layer Convolutional Neural Network (CNN) for charac- ter level (word-matrix) and word level encoding (sentence-matrix), along with a Bidirectional Long Short Term Memory (Bi-LSTM) as a tag decoder for Named Entity Recognition. This non-linear model has been developed as a language independent framework with the aim of extending it to other Indian languages Arnekt IECSIL - 2018 7 other than the five languages in focus. It is an added advantage that their model does not seem to be biased for a particular language [11]. IIT(BHU) generated vector representation of words and their correspond- ing tags, that were fed to the Bidirectional Long Short Term Memory (Bi-LSTM) for identification and categorization of entities in the text. Word representation has been done for all possible words in the corpus and a set of unique words were represented using one-hot encoding. The BiLSTM layer here learns the contex- tual relationship between words from past and future context. This team has come up with a language independent framework for Named Entity Recognition (NER) and has proven the same for the five languages provided [12]. Khushleen has made use of character level information in order to include word representation for rare words or out of vocabulary words from the given cor- pora. The team has performed word embedding using fastText without changing the parameters for each language for building a unified model. This is then fed to a two-layer Bidirectional Long Short Term Memory (BiLSTM) for training and prediction of entities for words in sentences [13]. Table 7. Final Evaluation Task A Team Run ml kn hi ta te hilt 2 92.1 93.17 94.35 91.79 92.47 raiden11 1 89.6 92.33 91.19 87.26 89.19 SSN NLP 3 95.05 94.21 95.95 94.66 95.4 hilt 2 92.12 93.17 94.28 91.79 92.47 am905771 2 88.89 89.85 94.47 90.4 90.04 idrbt-team-a 1 96.58 96.79 97.82 96.18 97.68 SSN NLP 2 95.28 95.76 96.51 94.9 96.81 khushleen 1 96.18 96.45 96.85 95.83 96.78 CUSAT TEAM 1 96.86 97.09 97.65 96.85 97.69 hariharanv 1 95.63 95.79 96.67 NA 96.39 rohitkodali 1 NA 96.85 98.06 NA 97.53 am905771 3 89.13 89.88 94.92 90.47 90.32 SSN NLP 1 95.28 95.8 96.68 94.91 96.81 am905771 1 89.04 89.53 94.45 90.46 90.04 Semantic relation among-st words were captured using word embedding as done in Khushleen work using fastText by the Raiden11 team. As a next step they have experimented this work using linear models like Naive Bayes and Support Vector Machine. Apart from this they were able to prove that a simple Artificial Neural Network (ANN) model worked better than the former linear classifiers, as it could capture the composite relation between words [13]. idrbt-team-a used a two stage LSTM based network with character based emebeddings, word2vec embeddings and sequence based bi-LSTM embeddings together to carry all the requisite features necessary for the NER prediction problem [14]. 8 Barathi Ganesh H B and Reshma U et al. In Relation Extraction, the team idrbt-team-a used features like POS tags, NER tags along with the words in input text sentence to classify the given input into one of the predefined relationship class. By performing the initial experiment with other statistical classifiers, Logistics Regression is chosen as the classifier [15]. By using word embedding from fastText as a representation method, team raiden11 have experimented the linear models like Naive Bayes and SVM, and also a simple Neural Network to develop the NER system. The best results are achieved for neural network for all languages combined [16]. Table 8. Final Evaluation Task B Team Run ml kn hi te ta CUSAT TEAM 1 77.77 NA NA NA NA hilt 2 48.05 44.01 51.5 22.87 60.11 idrbt-team-a 1 57.86 57.34 79.21 76.14 78.44 raiden11 1 48.05 44.01 51.5 40.49 60.11 SSN NLP 1 81.99 51.87 92.99 84.11 86.26 hilt 1 48.05 44.01 51.5 22.87 60.11 SSN NLP 3 51.8 49.43 69.04 68.17 67.12 SSN NLP 2 75.25 45.14 91.71 85.78 82.19 Participants were mostly used deep learning based algorithms for both the Relation Extraction and Named Entity Recognition tasks. CNN, Bi-LSTM and CNN with Bi-LSTM are commonly used architectures. Participants yields 90 ± 5 % as the accuracy for NER task. Even though the accuracy is high, it has to be noted that the accuracy obtained by selecting all entity as the class ”other” is 80 ± 5 %. This can be observed by measuring the performance of the team through f1 score. Unlike NER, participated systems could not able to attain the best results. The above points shows the need of research in Indian Language based NER and Relation Extraction systems. The detailed results including the precision, recall and f1 score for target class and language is made publicly available 10 . 4 Conclusion Arnekt in collaboration with FIRE has come up with its first track on Infor- mation Extraction for Conversational Systems in Indian Languages (IECSIL), which has utilized five Indian languages (Hindi, Kannada, Malayalam, Tamil and Telugu) for identifying the entities (Task A : Named Entity Recognition) and also extracting relation from the same (Task B : Relation Extraction). IEC- SIL has developed its own corpora for both the tasks. While this corpus is not 10 https://github.com/BarathiGanesh-HB/ARNEKT-IECSIL/blob/master/IECSIL- 2018-Final-Evaluation-Results.xlsx Arnekt IECSIL - 2018 9 restricted for a single application, it has been made available on-line11 to the research community through the Information Extractor for Conversational Sys- tems for Indian Languages (IECSIL)12 . The teams who have participated have come up with feasible solutions and most of them have utilized Deep learning methods to build their models. With the increase in need of Indian language usage, we are likely to extend the number of Indian languages used in the near future. 5 ACKNOWLEDGEMENTS Arnekt thanks all the participants for showing their interest towards IECSIL. We would also like to show are gratitude to the FIRE 2018 organizers for their endless efforts and support. References 1. Bhatt B, Bhattacharyya P. Domain specific ontology extractor for indian languages. InProceedings of the 10th Workshop on Asian Language Resources 2012 (pp. 75-84). 2. Nayan A, Rao BR, Singh P, Sanyal S, Sanyal R. Named entity recognition for Indian languages. InProceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages 2008. 3. Zamora J. Rise of the chatbots: Finding a place for artificial intelligence in India and US. InProceedings of the 22nd International Conference on Intelligent User Interfaces Companion 2017 Mar 7 (pp. 109-112). ACM. 4. Tamil Shallow Parser, International Institute of Information Technology, Hyder- abad, https://ltrc.iiit.ac.in/showfile.php?filename=downloads/shallow parser.php. 5. Reddy S, Sharoff S. Cross language POS taggers (and other tools) for Indian lan- guages: An experiment with Kannada using Telugu resources. InProceedings of the Fifth International Workshop On Cross Lingual Information Access 2011 (pp. 11- 19). 6. Devadath VV, Sharma DM. Significance of an accurate sandhi-splitter in shallow parsing of dravidian languages. InProceedings of the ACL 2016 Student Research Workshop 2016 (pp. 37-42). 7. Ajees A P and Sumam Mary Idicula, CUSAT TEAMIECSIL-FIRE-2018:A Named Entity Recognition System for Indian Languages, FIRE Working Notes, 2018. 8. Ajees A P and Sumam Mary Idicula, CUSAT TEAMIECSIL-FIRE-2018: A Relation Extraction System for Indian Languages, FIRE Working Notes, 2018. 9. D. Thenmozhi, B. Senthil Kumar, and Chandrabose Aravindan, SSN NLPIECSIL- FIRE-2018: Deep Learning Approach to Named Entity Recognition for Conversa- tional Systems in Indian Languages, FIRE Working Notes, 2018. 10. D. Thenmozhi, B. Senthil Kumar, and Chandrabose Aravindan, SSN NLPIECSIL- FIRE-2018: Deep Learning Approach to Relation Extraction for Conversational Systems in Indian Languages, FIRE Working Notes, 2018. 11. Sagar, Srinivas P Y K L, Rusheel Koushik Gollakota, and Amitava Das, HiLTIECSIL-FIRE-2018, FIRE Working Notes, 2018. 11 https://github.com/BarathiGanesh-HB/ARNEKT-IECSIL 12 http://iecsil.arnekt.com 10 Barathi Ganesh H B and Reshma U et al. 12. Akanksha Mishra, Rajesh Kumar Mundotiya, and Sukomal Pal, IIT(BHU)IECSIL- FIRE2018: Language Independent Automatic Framework for Entity Extraction in Indian Languages, FIRE Working Notes, 2018. 13. Khushleen Kaur, KhushleenIECSIL-FIRE-2018:Indic Language Named Entity Recognition Using Bidirectional LSTMs with Subword Information, FIRE Work- ing Notes, 2018. 14. S. Nagesh Bhattu, N. Satya Krishna , and D. V. L. N. Somayajulu, idrbt- team-aIECSIL-FIRE-2018 Named Entity Recognition of Indian languages using Bi- LSTM, FIRE Working Notes, 2018. 15. N. Satya Krishna , S. Nagesh Bhattu , and D. V. L. N. Somayajulu, idrbt-team- aIECSIL-FIRE-2018 : Relation Categorisation for Social Media News Text , FIRE Working Notes, 2018. 16. Ayush Gupta, Meghna Ayyar, Ashutosh Kumar Singh, and Rajiv Ratn Shah, raiden11IECSIL-FIRE-2018 : Named Entity Recognition For Indian Languages, FIRE Working Notes, 2018.