A Visual Narrative of Ramayana using Extractive Summarization, Topic Modeling and Named Entity Recognition Sree Ganesh Thottempudi School of Technology, SRH University, Berlin, Germany Abstract The culture and heritage of India depicted in the ancient manuscripts are unique in its linguistic and traditional diversity. However, the paucity of skill and expertise in present-day research posed a threat to the exploration of this textual legacy. In this paper, we aim to create a visual narrative of one of the major epics of Indian Literature ’Ramayana’, by summarizing the major topics and linking them with the characters and locations using Artificial Intelligence (AI). Using this research, any person interested in studying these manuscripts can visualize the tenor of the entire script, without an intensive study. ’Ramayana’ was originally written in Sanskrit, but modern versions have Sanskrit text with the explanation in Hindi (as most people in India are well versed with Hindi). In this paper, we have divided the Hindi and Sanskrit text and considered only Hindi text for our further research. We have used existing scientific models (that are trained on the Hindi Language) to find events/topics, summaries, characters and locations that are later used to produce a visual narrative of the data. For the evaluation of our results, we have tried to review the understanding of our summaries and topics. We achieved this by providing a part of our input text and its summary as well as topics/events created by our data pipeline, to 30 people (who are well versed with the Hindi Language). From the survey, it was found that 70%of the respondents understood the summarized text, while 56%of the respondents understood the topics clearly, that is generated from our model. Keywords Ramayana, OCR, Hindi, Named Entity Recognition, Topic modeling, Text Summarization, Visualization, Storytelling 1. Introduction At present, there is a growing enthusiasm resulting in the formulation of events/topics and among historians and humanitarians for summaries from the text, that can provide an exploring the quintessence of ancient Indian overview of that scripture even without any manuscripts. As there exists physical evidence domain knowledge. For our research we have of the events occurring in most of these used one of the two major epics in the ancient scriptures, we can gain information about the Indian history ’Ramayana’. There exists demography and cultural aspects of ancient around300 versions of ’Ramayana’ throughout India. Also, other important aspects like the world [1]. We have used a subset of the architecture, medicine, engineering, beliefs, ’Valmiki Ramayana’ for our research, that is etc. can be studied through these manuscripts. downloaded from the website1 in pdf format. Hence, the underlying notion behind this This version is an epic tale narrated by Rishi research is to create a pipeline, where-in any Valmiki (written in Sanskrit) describing the ancient manuscript can be provided as an input, journey of Lord Rama, his wife Sita and brother Lakshmana and how he, i.e., Lord Rama, ACI’21: Workshop on Advances in Computational triumphed over the evil forces of Ravana, the Intelligence at ISIC 2021, February 25-27, 2021, Delhi, India EMAIL: sganeshhcu@mail.com (Sree Ganesh Thottempudi) Demon King of Lanka.Not only the characters ©️ 2020 Copyright for this paper by its authors. in this scripture are considered as gods in India, Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) also there exists physical evidence of the events 1 https://archive.org/details/in.ernet.dli.2015.345471/page/n1/mode/2up stated in these scriptures, that can be traced quintessence of the entire data. From this back to present-day locations in India and Sri step, the data was simultaneously sent into 2 Lanka. Our downloaded data consists of the different processes, i.e., one for Text - original Sanskrit Shlokas followed by its Summarization and NER Tagging, and other interpretation in Hindi. for Topic Modeling. From this data we want to produce a visual 3. Summarization and NER Tagging narrative so that any individual can understand (creating a concise summary and then the gist of the entire text without having to read finding locations/persons involved in those the entire text. Our pseudo-pipeline can be used summaries using NER Tagging) This step to reproduce similar results for any Indian is divided into 2 processes as described manuscript. Our designed pipeline can be below: divided into five major components. These components are: • Summarization (summing up the most important or relevant information from the 1. Input data (Digitize the data for further entire text) The main process of the Text processing) The main aim of this process is Summarization producing a concise and to digitize the given input data that is in pdf fluent summary of the text while preserving format. Firstly, we need to fetch each page key information content and overall mean- and convert it into PNG format. Then these ing [2]. There are two types of images are fed into an OCR engine that will Summarization techniques i.e., Extractive convert the images into machine readable Summarization and Abstractive text. The OCR engine used for our study is Summarization. Extractive Summarization the Py-Tesseract OCR2 for Devanagari works by identifying important sections of Scripts. the text and combining them to make a summary [2]. While Abstractive 2. Basic Pre - Processing (Preparing the Summarization entails paraphrasing and data) After getting the data in machine - shortening parts of the source document, readable text, we need to clean the data, as thus producing important material in a new the OCR’d text might be wrongly read, and way [2]. For our research, we are performing if this wrong text is further processed in our Extractive Summarization using the text pipeline, then the results can be hugely rank algorithm3, that is modeled using a impacted. The first step in this process is to combination of sk-learn and netor-kx remove the headers from the text document, opensource libraries in python. as they do not provide any value to the data. • NER Tagging (tagging per- Now, we have the data with both Sanskrit sons/locations/organizations, etc. from the and Hindi text. We decided to use only summarized text) we have used Named Hindi text for further processing, as the entity recognition (NER) algorithm to find Hindi text contains the description of the and cluster named entities in text into any Sanskrit Sargas that should provide better desired categories such as person names results for our Summarization and Topic (PER), organizations (ORG), locations Modeling models (due to considerable (LOC), time expressions, etc. [3]. For, amount of data as compared to Sanskrit training the model the opensource Python text). Hence, the Sanskrit and Hindi texts we library Flair [4] has been used. Also, to train reseparated (using some keyword the model, the Fire 2013 Hindi NER Corpus identifiers) and stored in 2 different files. [5] from AUKBC research Centre, India was The Hindi text was then divided into 26 used. parts based on a word limit, that would help in creating small summaries and would 4. Topic Modeling (finding topics/events make it easier to find events from the from the divided Hindi input text) Further divided subtext. All these summaries and preprocessing is required to performs Topic topics/events can then be used to find the Modeling. These preprocessing steps and 2 3 https://github.com/madmaze/pytesseract https://datawarrior.wordpress.com/2015/05/20/birdview2rankingeverythingan- overviewoflinkanalysisusingpagerankalgorithm the Topic Modeling process are described (retrieved using the NER Tags) were plotted below: on the map of Indian Subcontinent and when hovered over a mapped location, the • Preprocessing for Topic Modeling summary linked to that location is displayed. (preparing the data for Topic Modeling): To Using this, anyone can be able to create a perform Topic Modeling on our data, some mental image of the story that was described preprocessing steps are required (remove in its summary. The other dashboard irrelevant words that might affect the displays the characters (retrieved using the probabilistic LDA model, working on the NER Tags) distributed according to their principle of bag of words analysis). Three corresponding summary. And, when procedures are performed in this step i.e., hovered each character, the topics/events lemmatization, stop words removal and associated with that character are displayed. removal of other irrelevant words. Using both these dashboards, anyone can Lemmatization is the process of grouping easily find the quintessence of the whole together synonymous words so that they can script without doing intensive study on it. be inferred as a single word. We performed lemmatization using opensource python library Stanford NLP [6]. Then some more 2. Literature Review unused words like prepositions were removed in the ’stop word removal’ process. In this section, we discuss the existing A list of stop words was manually created papers from which the basic scientific models for this research. In the last step for the data used for our study, have been influenced. cleaning process, some garbage data like Richman, Paula [1] has collected all the dif- miss interpreted English words, ferent versions of Ramayana produced by many punctuation’s, numbers, etc. were removed. authors and performers and supported by many patrons. This book was referred to understand • Topic Modeling (finding the different variations of Ramayana. topics/events): Topic Modeling is the process of identifying topics/events from a Qi, Peng et al. [6] introduced Stanford NLP given text input. The” topics” signify the the end-to-end neural pipeline for text hidden, to be estimated, variable relations processing. Taking raw input text and that link words in vocabulary and their performing all the operations like sentence occurrence in documents. Topic models segmentation, tokenization, lemmatization, discover the hidden themes throughout the POS tagging, and most importantly the author collection and annotate the documents ac- talked about dependency parser. With the cording to those themes [7]. To perform this dependency parser we can analyze the structure the LDA algorithm is used, that finds the of sentence grammatically and established the probabilistic word frequencies from the bag relationship between the head word in sentence of words. For our research, we are using the and words associated with it. For tokenization LDA based topic model for the Hindi text of and POS tagging functionality, a standford nlp python opensource library Gensim [8] opensource library has been used in the pipeline. The source code can be found on the 5. Visualization (creating a storyline from GitHub4. the available text summaries, topics/Events and identified locations and characters) All Pre-processing is the most important part of the above components of our pipeline are the text processing system. In the preprocessing designed using Python, and the visualization pipeline, the removal of functional words (stop of our results is performed using Tableau. words) is important in in sense of performance The obtained NER tags were validated and of text processing. Jha, Vandana et.al.[9] The filtered using some validation datasets. For preponderance of contribution work is done on creating a visual narrative of the above how to remove stop words, based on a obtained results, two dashboards were built. dictionary of stop words and pattern matching In one of the dashboards, the locations 4 https://github.com/stanfordnlp/stanfordnlp and removing the words in the text. The Corpus labeling tasks, they consistently outperform the for Hindi stop word can be found on the previous state of the art. Also, exceed prior GitHub, expanded with more stop words5. work on English and German named entity recognition (NER). All the code and language In the journal paper by Allahyari, Mehdi et models are available in GitHub7. The initial al., text summarization and its techniques are corpus to feed the NER model was requested explained in detail, which was very useful to get from AUKBC Research Centre, India. This a background knowledge of text summarization corpus is in column form at which contains the and its types. The survey of Text words followed by its POS Tags and NER Tags Summarization for Indian and foreign language (represented in different columns). Dhawale, Apurva et al. [10], summarization is an interpretation that bargains with timesaving The paper by Zhou Tong and Haiyi, Zhang and providing the user the result with the least [7] explains the topic modeling process using text without altering its essence. This paper Latent Dirichlet Allocation (LDA) for English displays the progressions which have initiated textual data. Based on this model, our model research for text summarization in multiple was implemented using the open source python global and local languages. Federico Barrios library Gensim. et.al.[11], Text Summarization offers new choices to the similarity function for the Text Rank algorithm. This algorithmic 3. Methodology accommodates toward automatic summarization of texts. The fundamental idea performed by a graph based ranking model is For the creation of the visual narrative, there that of polling or recommendation. If one-point was a need for text mining and preparation of bonds to another one, it is choosing a vote for the Valmiki Ramayana data set. The steps involved in attaining the same were identified that point. The greater the number of votes cast for a point, the higher the weight of that point as: creation of input files, text preprocessing, gensim GitHub6. text Summarization, Named Entity Recognition (NER) tagging and topic modeling. The output from these steps were used as inputs for the Athvale, Vinayak et al.in their paper” visualization process. The complete workflow Towards deep learning in Hindi NER: An approach to tackle the labelled data scarcity of the text processing and mining is shown in Figure 1: Workflow of Text preparation for the „provide describe an end-to-end Neural Model Visual Narrative for Named Entity Recognition (NER) which is based on Bi Directional RNN-LSTM. The The first part of the workflow was about authors claim state of the art performance in creating the input files in the format of both English and Hindi without the use of any machine-readable texts for further processing. morphological analysis or without using The single pdf file of Valmiki Ramayana was gazetteers of any sort. Sharma, Rajesh et.al.[12] converted into 394 images in PNG format with presents the Named Entity Recognition (NER) 300 dpi resolution using a free image utility System for Hindi using CRF approach. Akbik, software called ImageMagick8. For converting Alanet al. [4] propose to leverage the internal the images into editable text documents, state of the trained character language model to Optical Character Recognition (OCR) process produce a unique type of word embedding. In was used. Tesseract-OCR [13] is one of the the process of building embedding model, first OCR engines which can recognize characters of trained without any specific knowledge of more than 100 languages and has a language words and therefore basically model words as a model for Hindi as well. Py-tesseract9, a series of characters, and second contextualized wrapper for the Tesseract OCR engine, was by their surrounding text, meaning that the used in our case as it could read all image types same word will have different embeddings including jpeg, png, gif, bmp, tiff etc. The depending on its contextual use. The author converted images from pdf file were then given claims that across four classic sequence 5 8 https://github.com/amjha/hindiExtractio https://github.com/imagemagick/imagemagick 6 9 https://github.com/RaReTechnologies/gensim https://github.com/madmaze/pytesseact 7 https://github.com/flairNLP/flair as input to the Py-tesseract using Devanagari and the popularity of Hindi over Sanskrit. Hindi language model. The output was obtained as and Sanskrit texts were separated using” मूल” text documents with around 90%-character and” टीिा” keywords from the all the accuracy. documents, respectively. There was a fork in further processing of the Hindi texts: one or Text Summarization and another for Topic Modeling. Atopic model is a probabilistic model for finding out the abstract topics that appear in a collection of text documents. Topic Modeling is the most used text mining tool for discovering latent semantic structures in textual data. For the topic Model- ing branch in our workflow, additional text pre- processing was required. To extract topics, the text documents were tokenized into words. In the text documents, the most commonly appearing words were the articles, prepositions, helping verbs, etc., known as the stop words. These stop words could affect the topic model as it is generally based on the frequency of words occurring in a document. So, they needed to be removed. As there was not out of the box Hindi stop words removal functionality available, a list of Hindi stop words10 was created and this was used to remove stop words from the tokenized list of words. Lemmatization is the text processing step of grouping together the different forms of a word so they can be analyzed as a single word. This also helps in reducing redundancy of the same root word in the topics extracted. Stanford NLP Figure 1: Workflow of Text preparation for [14], an opensource library that has pretrained the Visual Narrative Hindi models for lemmatization and Part of - Speech tagging, was used to lemmatize the The output of Py-tesseract i.e., the OCR’d tokenized words’ list. After looking at text text documents contained some data such as samples, additional cleaning steps like removal page headers like book’s name (e.g. श्रीवाल्मीकि of punctuation, English letters and numbers रामायण) on odd numbered pages and chapter’s from the tokenized list were performed. Latent name (e.g.सुन्दर िण्ड) on even numbered Dirichlet Allocation is a Topic Modeling pages, page numbers; that were not required for algorithm based on the bag of words (BOW) text mining. These unwanted texts were and counts of word document. It is a fully captured using regular expressions and were generative model where documents are cleaned out from the text corpus. The assumed to have been generated according toa documents also consisted of the multi-lingual per-document topic and per-topic word texts: Sanskrit and Hindi. It was decided to distribution. The list of tokenized words was continue further processing using only Hindi fed to the LDA model using Gensim [8] and texts considering the factors such as the good topics were extracted for the whole Hindi texts. number of resources available for Hindi To improve the output of Topic Modeling, few language w.r.t. text processing and mining, the iterations were run with some steps redone in relatively greater length of the Hindi text the preprocessing blocklike extending stop documents than the Sanskrit texts in our case words list, to remove the left-out words that were not significant enough to be in the topics, 10 https://github.com/amjha/hindiExtraction and some manual garbage removal i.e., words locations (LOC), etc. Most of the present, State that were wrongly interpreted by the OCR of art NER models for the Hindi language are model. After this, results fetched from the topic either very limited or not available in the public models became more relevant to the actual domain. For Training the model, Flair Python story, the preprocessing steps were then library [4] and Google cloud platform (GCP) finalized and no further iterations were made to resources have been used. Flair’s framework change the prepared data. builds directly on Py-Torch, one of the best deep learning frames works. It has the The other branch of the fork is Text flexibility of using the state of art embedding Summarization. Text Summarization refers to model, also there is an embedding model the process of compacting a large text. The available for the Hindi language. NER initial reason behind this is to create a comprehensible corpus is requested from the Indian Statistical and expressive summary having only the main Institute [5] from AU-KBC Research Centre, points outlined in the text. There are two main India. This corpus is later extended for model methods to summarize the text in NLP, training. Extraction based summarization and Abstraction based summarization. We are using Our complete narrative model’s aim was to extraction-based summarization for our have a story of events. So, for our final step of research. Summarization of the text is based on the pipeline we built two dashboards using the ranks of text sentences using a variation of Tableau. The output from both; the the Text Rank algorithm. Text Rank is an summarization as well as the topic Modeling automatic summarization technique, is branch were fed into tableau and the events implemented in two different ways in our were tagged with a topic, a summary and NER pipeline, the Gensim python open-source tags to help pick out characters and locations. library [11] and the other one is a combination In the first dashboard, satellite view of the map of the sk-learn and networkx library. Gensim is used for plotting, the identified locations summarizer takes input as a string whereas the from the scripture. For plotting the story on the other approach takes a list of sentences. Taking map, a validation dataset of names and places a list of sentences was a better option as there is of the events is manually created. This data is no clear separation between the whole text. matched with the words which are tagged with Input text was divided into 26 documents where B-LOCATION and I-LOCATION NER tags. each document consists of 25 sentences. After identification of the matched location, Dividing the text by character may loose the they are associated with the respective latitude sentence meaning/grammar. As Gensim uses a and longitude values so that it can be plotted on string as input and division by character length maps correctly. The lines between two places was not a good option, So, the other on the map shows the sequence of the implementation Text Rank algorithm mentioned locations in the narrated story. A line (Networkx and Sk-learn) was selected for the between two places on the map is created using pipeline. Graph based ranking algorithms are a Tableau’s spatial functions Make Line and way for deciding the importance of a vertex Make Point. As seen in the Fig. 2, the narration within a graph, based on global information comprises location “मैकिली” (birth place of Sita) recursively drawn from the entire graph. A “कित्रिूट” (Forest in which Ram and Sita went graph has been built that represents the text, for staying ), “अतःपुर” (A village in Kishkindha inter-connects words or other text entities with where Ram met Hanuman and his friends), meaningful relations. Sentence extraction is favorable over keyword/token extraction. ”महे न्द्र” (Mahendragiri is a name of a mountain from where Hanuman jumped towards Shri Named entity recognition (NER) plays an Lanka in order to search for Sita), “महासागर” important role to complete the narrative model. (The ocean between India and Shri Lanka), Using this concept, the persons/characters, as “पवतत” (Trikoot parvat where Hanuman landed well as the location, can be extracted from the after jumping from Mahendra Giri Moutain), story text [3]. The objective of using NER in “अशोिवैन” (Ashok Vatika garden where Sita this project is quite straightforward. It is used to was kept as a captive by Ravana). On hovering find, and cluster named entities in text into any over each line origin city with its corresponding desired categories such as person names (PER), destination city and summary of the text is shown in the tooltip. Dealing with such kind of characters in the symbol chart unfolds along ancient geo spatial data is some of crucial. with their corresponding summary in the Fig. 3. Gettty thesaurus playa a significant role in this. All the symbols and characters names are Getty is an opensource Geo data base where all shown as legends. Each symbol is manually possible occurrences of every Geo name designed as per its characteristics in the include its ancient name can tagged with a Ramayana epic. On hovering over each unique number. Through this number we can character, the related topics for the given visualize that geo name wit longitude and summarized is shown in the tool tip. latitude. Getty plays a crucial role in our project as well for geo data visualization. We used The source code for our research can be Geety for Geo data visualization. found on GitHub11 Figure 2: Locations in Ramayana Figure3: Topic Modeling based on characters. In the second dashboard, symbol chart is used to represent different characters of the Ramayana. The sequence of the story from 0 to 4. Evaluation 25 is plotted from left to right and the occurrence of every character is plotted as per For the evaluation of our model, we conducted its reference in the text associated with that a survey in which 30 persons participated. The sequence. The validation dataset of the respondents were chosen based on the criteria, character names in the Ramayana are matched that they should be competent in understanding with Hindi words tagged with B-PERSON and Hindi text. They were provided with the input I-PERSON NER tags. The matched names do text and the summary as well as the topics not identify the synonyms of the same name. In created by our data pipeline. The two metrics, Ramayana, each person has been associated Text Summarization, and Topic Modeling were with various names, e.g., “सीता” is also known evaluated in the survey. There are four options by the names “जानिी” or “जनिपुत्री” in the to choose from and each option has a 20% same story. So, the result of plotting such data bucket size; 80100%, 6080%, 4060% or less lead to plotting three different points for the than 40%. Selecting the first option i.e., 80- same Symbol. To avoid this, grouping of such 100%, implies that the individual has names under one name is performed in Tableau. understood the summarized text, and the same As Ramayana is a story and stories are narrated goes for the second and third options. Anyone in sequence, Tableau’s page control choosing the fourth option i.e., less than 40%, functionality is used to make the dashboard be understood. The results of the survey can be dynamic.The top topic and the summary of text seen in the following chart at Fig.4. is shown below the symbol chart. As the sequence changes in the page control the 11 https://github.com/rajrohan/ramayanaocr This reduced the quality of NER tags that we obtained after running our NER model. At first, we were not able to build a NER Models we were not able to procure the Hindi NER tag data and creating the data from scratch was not possible in the given timeframe of the research. After failing to find Hindi NER tagged corpus online, we were helped by AU-KBC research Centre [5], India. They provided us the tagged Hindi NER data as a result of which we were able to train our NER model. Using NER tags instead of POS tags has made the visual experience significantly better. NER when performed on the generated topics yields very Figure 4: Survey results for Text Summarization few results as compared to when it is performed on the summarized text. Therefore, we built the It can be inferred from the Fig. 4 that 70% of NER model using the summarized text as an the respondents were able to understand the input. We also tried performing Abstractive summarized text whereas 30% of the Summarization on the script but were unable to respondents were not able to understand the achieve it because the model cannot be trained summarized text. From Fig. 5 around 56% of on our data to provide the desired output. So, the respondents understood the topic clearly we have used only Extractive Summarization and around 44% of the respondents were not method in our pipeline. During the initial phase able to understand the topic. of our project, we had an idea of visualizing the events chronologically to make it more visually informative. However, the chronology cannot be obtained as the script is quite old and no 5. Discussion dates are present in the data. The alternate to not having any dates, is to use time as ’t’ and keep The major objective of the envisioned pipeline on incriminating it after every shloka to obtain has been achieved but the pipeline can be a certain chronological order. Having said that, further improved. Due to drawbacks in the as the script discusses different timelines in the process’s in pipeline (due to unavailability of same shlokas, this method cannot be used to get proficient models for Hindi textual data), we the chronology of events. could not achieve the optimal result. We were not able to find a better-quality input file which could have helped the OCR model to identify 6. Conclusions the characters in a much better way. For our input dataset, Ramayana, it is observed that our model reaches the score of more than 70 percent in explaining the Summarized text and about 70 percent in explaining the topic/events generated from the script. It can also be concluded that the NER done on the summarized text generates better results than when it is performed on the topics/ events. Taking usability into account we were successful in making a pipeline for the visual narration of Ramayana. Using our visualization someone even with very little knowledge on Ramayana can easily understand the whole Figure 5: Survey results for Topic Modeling summary of the script. The demo is built on an image-based input and can be later extended to other sources and languages too. However, for its application to all other Devanagari languages, their respective models should be available. We can prove the physical evidence of the locations mentioned in the script by plotting the coordinates of the present day location along with the events that took place on these locations. 7. Reference [1] P. Richman, Ed.,Many Ramayanas: The Information Technology (CCSEIT),2016, Diversity of a Narrative Tradition in South pp. 21–22. Asia. University of California Press, 1991. [8] R. Rehurek and P. Sojka, “Software [2] M. Allahyari, S. Pouriyeh, M. Assefi, S. Framework for Topic Modelling with Large Safaei, E. Trippe, J. Gutierrez,and K. Corpora,” in Proceedings of the LREC 2010 Kochut, “Text summarization techniques: Workshop on New Challenges for NLP A brief survey” International Journal of Frameworks. Valletta, Malta: ELRA, Advanced Computer Science and May2010, pp. 45–50. Applications (IJACSA), vol. 8, pp. 397– [9] V. Jha, N. Manjunath, P. Shenoy, and V. K 405, 07 2017. R, “Hsra: Hindi stop word removal [3] V. Athavale, S. Bharadwaj, M. Pamecha, A. algorithm,” 01 2016, pp. 1–5. Prabhu, and M. Shrivastava,“Towards deep [10] A. D. Dhawale, S. B. Kulkarni, and V. learning in Hindi NER: An approach to Kumbhakarna, “Survey of progressive era tackle the labelled data scarcity,” 2016. of text summarization for Indian and [4] A. Akbik, T. Bergmann, and R. Vollgraf, foreign languages using natural language “Pooled contextualized embeddings for processing,” in Innovative Data named entity recognition,” in NAACL Communication Technologies and 2019, 2019 Annual Conference of the North Application, J. S. Raj, A. Bashar, and S. R. American Chapter of the Association J. Ramson,Eds. Cham: Springer for Computational Linguistics, 2019, p. International Publishing, 2020, pp. 654– 724–728. 662. [5] C. M. Sobha Lalitha Devi., Pattabhi RK [11] F. Barrios, F. López, L. Argerich, and R. Rao. and R. V. S. Ram, “Indian language Wachenchauzer, “Variations of the ner annotated fire 2013 corpus (fire 2013 similarity function of text rank for NER corpus),”in Named Entity automated summarization,” 2016. Recognition Indian Languages FIRE 2013 [12] R. Sharma and V. Goyal, “Name entity Evaluation Track, 2013. recognition systems for Hindi using CRF [6] P. Qi, T. Dozat, Y. Zhang, and C. D. approach,” in Information Systems for Manning, “Universal dependency parsing Indian Languages, C. Singh, G. Singh from scratch,” in Proceedings of the Lehal, J. Sengupta, D. V. Sharma, and V. CoNLL2018 Shared Task: Multilingual Goyal,Eds. Berlin, Heidelberg: Springer Parsing from Raw Text to Universal Berlin Heidelberg, 2011, pp. 31–35. Dependencies. Brussels, Belgium: [13] R. Smith, “An overview of the tesseract Association for Computational Linguistics, OCR engine,” in Ninth International October 2018, pp. 160–170. [Online]. Conference on Document Analysis and Available: Recognition (ICDAR2007), vol. 2, Sep. https://nlp.stanford.edu/pubs/qi2018univers 2007, pp. 629–633. al.pdf [14] C. D. Manning, M. Surdeanu, J. Bauer, J. [7] Z. Tong and H. Zhang, “A text mining Finkel, S. J. Bethard,and D. McClosky, research based on lda topic modelling,” in “The Stanford Core NLP natural language Proceedings of processing toolkit,” In Association for the Sixth International Conference on Computational Linguistics (ACL) System Computer Science, Engineering and Demonstrations, 2014, pp. 55–60.