A Visual Narrative of Ramayana using Extractive
Summarization, Topic Modeling and Named Entity Recognition
Sree Ganesh Thottempudi
School of Technology, SRH University, Berlin, Germany


                         Abstract
                         The culture and heritage of India depicted in the ancient manuscripts are unique in its linguistic
                         and traditional diversity. However, the paucity of skill and expertise in present-day research
                         posed a threat to the exploration of this textual legacy. In this paper, we aim to create a visual
                         narrative of one of the major epics of Indian Literature ’Ramayana’, by summarizing the major
                         topics and linking them with the characters and locations using Artificial Intelligence (AI).
                         Using this research, any person interested in studying these manuscripts can visualize the tenor
                         of the entire script, without an intensive study. ’Ramayana’ was originally written in Sanskrit,
                         but modern versions have Sanskrit text with the explanation in Hindi (as most people in India
                         are well versed with Hindi). In this paper, we have divided the Hindi and Sanskrit text and
                         considered only Hindi text for our further research. We have used existing scientific models
                         (that are trained on the Hindi Language) to find events/topics, summaries, characters and
                         locations that are later used to produce a visual narrative of the data. For the evaluation of our
                         results, we have tried to review the understanding of our summaries and topics. We achieved
                         this by providing a part of our input text and its summary as well as topics/events created by
                         our data pipeline, to 30 people (who are well versed with the Hindi Language). From the
                         survey, it was found that 70%of the respondents understood the summarized text, while 56%of
                         the respondents understood the topics clearly, that is generated from our model.

                         Keywords
                         Ramayana, OCR, Hindi, Named Entity Recognition, Topic modeling, Text Summarization,
                         Visualization, Storytelling


1. Introduction
   At present, there is a growing enthusiasm                                               resulting in the formulation of events/topics and
among historians and humanitarians for                                                     summaries from the text, that can provide an
exploring the quintessence of ancient Indian                                               overview of that scripture even without any
manuscripts. As there exists physical evidence                                             domain knowledge. For our research we have
of the events occurring in most of these                                                   used one of the two major epics in the ancient
scriptures, we can gain information about the                                              Indian history       ’Ramayana’. There exists
demography and cultural aspects of ancient                                                 around300 versions of ’Ramayana’ throughout
India. Also, other important aspects like                                                  the world [1]. We have used a subset of the
architecture, medicine, engineering, beliefs,                                              ’Valmiki Ramayana’ for our research, that is
etc. can be studied through these manuscripts.                                             downloaded from the website1 in pdf format.
Hence, the underlying notion behind this                                                   This version is an epic tale narrated by Rishi
research is to create a pipeline, where-in any                                             Valmiki (written in Sanskrit) describing the
ancient manuscript can be provided as an input,                                            journey of Lord Rama, his wife Sita and brother
                                                                                           Lakshmana and how he, i.e., Lord Rama,
ACI’21: Workshop on Advances in Computational                                              triumphed over the evil forces of Ravana, the
Intelligence at ISIC 2021, February 25-27, 2021, Delhi, India
EMAIL: sganeshhcu@mail.com (Sree Ganesh Thottempudi)                                       Demon King of Lanka.Not only the characters
              ©️ 2020 Copyright for this paper by its authors.                             in this scripture are considered as gods in India,
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)                                      also there exists physical evidence of the events

1
    https://archive.org/details/in.ernet.dli.2015.345471/page/n1/mode/2up
stated in these scriptures, that can be traced                quintessence of the entire data. From this
back to present-day locations in India and Sri                step, the data was simultaneously sent into 2
Lanka. Our downloaded data consists of the                    different processes, i.e., one for Text -
original Sanskrit Shlokas followed by its                     Summarization and NER Tagging, and other
interpretation in Hindi.                                      for Topic Modeling.

From this data we want to produce a visual                    3. Summarization and NER Tagging
narrative so that any individual can understand               (creating a concise summary and then
the gist of the entire text without having to read            finding locations/persons involved in those
the entire text. Our pseudo-pipeline can be used              summaries using NER Tagging) This step
to reproduce similar results for any Indian                   is divided into 2 processes as described
manuscript. Our designed pipeline can be                      below:
divided into five major components. These
components are:                                               •     Summarization (summing up the most
                                                              important or relevant information from the
       1. Input data (Digitize the data for further           entire text) The main process of the Text
       processing) The main aim of this process is            Summarization producing a concise and
       to digitize the given input data that is in pdf        fluent summary of the text while preserving
       format. Firstly, we need to fetch each page            key information content and overall mean-
       and convert it into PNG format. Then these             ing [2]. There are two types of
       images are fed into an OCR engine that will            Summarization techniques i.e., Extractive
       convert the images into machine readable               Summarization           and       Abstractive
       text. The OCR engine used for our study is             Summarization. Extractive Summarization
       the Py-Tesseract OCR2 for Devanagari                   works by identifying important sections of
       Scripts.                                               the text and combining them to make a
                                                              summary        [2].     While     Abstractive
       2. Basic Pre - Processing (Preparing the               Summarization entails paraphrasing and
       data) After getting the data in machine -              shortening parts of the source document,
       readable text, we need to clean the data, as           thus producing important material in a new
       the OCR’d text might be wrongly read, and              way [2]. For our research, we are performing
       if this wrong text is further processed in our         Extractive Summarization using the text
       pipeline, then the results can be hugely               rank algorithm3, that is modeled using a
       impacted. The first step in this process is to         combination of sk-learn and netor-kx
       remove the headers from the text document,             opensource libraries in python.
       as they do not provide any value to the data.          •     NER       Tagging      (tagging      per-
       Now, we have the data with both Sanskrit               sons/locations/organizations, etc. from the
       and Hindi text. We decided to use only                 summarized text) we have used Named
       Hindi text for further processing, as the              entity recognition (NER) algorithm to find
       Hindi text contains the description of the             and cluster named entities in text into any
       Sanskrit Sargas that should provide better             desired categories such as person names
       results for our Summarization and Topic                (PER), organizations (ORG), locations
       Modeling models (due to considerable                   (LOC), time expressions, etc. [3]. For,
       amount of data as compared to Sanskrit                 training the model the opensource Python
       text). Hence, the Sanskrit and Hindi texts we          library Flair [4] has been used. Also, to train
       reseparated     (using      some     keyword           the model, the Fire 2013 Hindi NER Corpus
       identifiers) and stored in 2 different files.          [5] from AUKBC research Centre, India was
       The Hindi text was then divided into 26                used.
       parts based on a word limit, that would help
       in creating small summaries and would                  4. Topic Modeling (finding topics/events
       make it easier to find events from the                 from the divided Hindi input text) Further
       divided subtext. All these summaries and               preprocessing is required to performs Topic
       topics/events can then be used to find the             Modeling. These preprocessing steps and

2                                                        3
    https://github.com/madmaze/pytesseract                      https://datawarrior.wordpress.com/2015/05/20/birdview2rankingeverythingan-
                                                         overviewoflinkanalysisusingpagerankalgorithm
       the Topic Modeling process are described              (retrieved using the NER Tags) were plotted
       below:                                                on the map of Indian Subcontinent and when
                                                             hovered over a mapped location, the
       •     Preprocessing for Topic Modeling                summary linked to that location is displayed.
       (preparing the data for Topic Modeling): To           Using this, anyone can be able to create a
       perform Topic Modeling on our data, some              mental image of the story that was described
       preprocessing steps are required (remove              in its summary. The other dashboard
       irrelevant words that might affect the                displays the characters (retrieved using the
       probabilistic LDA model, working on the               NER Tags) distributed according to their
       principle of bag of words analysis). Three            corresponding summary. And, when
       procedures are performed in this step i.e.,           hovered each character, the topics/events
       lemmatization, stop words removal and                 associated with that character are displayed.
       removal of other irrelevant words.                    Using both these dashboards, anyone can
       Lemmatization is the process of grouping              easily find the quintessence of the whole
       together synonymous words so that they can            script without doing intensive study on it.
       be inferred as a single word. We performed
       lemmatization using opensource python
       library Stanford NLP [6]. Then some more           2. Literature Review
       unused words like prepositions were
       removed in the ’stop word removal’ process.
                                                             In this section, we discuss the existing
       A list of stop words was manually created
                                                          papers from which the basic scientific models
       for this research. In the last step for the data
                                                          used for our study, have been influenced.
       cleaning process, some garbage data like
                                                          Richman, Paula [1] has collected all the dif-
       miss       interpreted     English       words,
                                                          ferent versions of Ramayana produced by many
       punctuation’s, numbers, etc. were removed.
                                                          authors and performers and supported by many
                                                          patrons. This book was referred to understand
       •    Topic         Modeling         (finding
                                                          the different variations of Ramayana.
       topics/events): Topic Modeling is the
       process of identifying topics/events from a            Qi, Peng et al. [6] introduced Stanford NLP
       given text input. The” topics” signify the         the end-to-end neural pipeline for text
       hidden, to be estimated, variable relations
                                                          processing. Taking raw input text and
       that link words in vocabulary and their
                                                          performing all the operations like sentence
       occurrence in documents. Topic models
                                                          segmentation, tokenization, lemmatization,
       discover the hidden themes throughout the          POS tagging, and most importantly the author
       collection and annotate the documents ac-          talked about dependency parser. With the
       cording to those themes [7]. To perform this
                                                          dependency parser we can analyze the structure
       the LDA algorithm is used, that finds the          of sentence grammatically and established the
       probabilistic word frequencies from the bag
                                                          relationship between the head word in sentence
       of words. For our research, we are using the       and words associated with it. For tokenization
       LDA based topic model for the Hindi text of        and POS tagging functionality, a standford nlp
       python opensource library Gensim [8]
                                                          opensource library has been used in the
                                                          pipeline. The source code can be found on the
       5. Visualization (creating a storyline from
                                                          GitHub4.
       the available text summaries, topics/Events
       and identified locations and characters) All          Pre-processing is the most important part of
       the above components of our pipeline are           the text processing system. In the preprocessing
       designed using Python, and the visualization       pipeline, the removal of functional words (stop
       of our results is performed using Tableau.
                                                          words) is important in in sense of performance
       The obtained NER tags were validated and           of text processing. Jha, Vandana et.al.[9] The
       filtered using some validation datasets. For
                                                          preponderance of contribution work is done on
       creating a visual narrative of the above           how to remove stop words, based on a
       obtained results, two dashboards were built.       dictionary of stop words and pattern matching
       In one of the dashboards, the locations

4
    https://github.com/stanfordnlp/stanfordnlp
and removing the words in the text. The Corpus      labeling tasks, they consistently outperform the
for Hindi stop word can be found on the             previous state of the art. Also, exceed prior
GitHub, expanded with more stop words5.             work on English and German named entity
                                                    recognition (NER). All the code and language
    In the journal paper by Allahyari, Mehdi et     models are available in GitHub7. The initial
al., text summarization and its techniques are      corpus to feed the NER model was requested
explained in detail, which was very useful to get   from AUKBC Research Centre, India. This
a background knowledge of text summarization        corpus is in column form at which contains the
and its types. The survey of Text                   words followed by its POS Tags and NER Tags
Summarization for Indian and foreign language       (represented in different columns).
Dhawale, Apurva et al. [10], summarization is
an interpretation that bargains with timesaving         The paper by Zhou Tong and Haiyi, Zhang
and providing the user the result with the least    [7] explains the topic modeling process using
text without altering its essence. This paper       Latent Dirichlet Allocation (LDA) for English
displays the progressions which have initiated      textual data. Based on this model, our model
research for text summarization in multiple         was implemented using the open source python
global and local languages. Federico Barrios        library Gensim.
et.al.[11], Text Summarization offers new
choices to the similarity function for the Text
Rank        algorithm.     This     algorithmic     3. Methodology
accommodates            toward         automatic
summarization of texts. The fundamental idea
performed by a graph based ranking model is             For the creation of the visual narrative, there
that of polling or recommendation. If one-point     was a need for text mining and preparation of
bonds to another one, it is choosing a vote for     the Valmiki Ramayana data set. The steps
                                                    involved in attaining the same were identified
that point. The greater the number of votes cast
for a point, the higher the weight of that point    as: creation of input files, text preprocessing,
gensim GitHub6.                                     text Summarization, Named Entity Recognition
                                                    (NER) tagging and topic modeling. The output
                                                    from these steps were used as inputs for the
    Athvale, Vinayak et al.in their paper”
                                                    visualization process. The complete workflow
Towards deep learning in Hindi NER: An
approach to tackle the labelled data scarcity       of the text processing and mining is shown in
                                                    Figure 1: Workflow of Text preparation for the
„provide describe an end-to-end Neural Model        Visual Narrative
for Named Entity Recognition (NER) which is
based on Bi Directional RNN-LSTM. The                  The first part of the workflow was about
authors claim state of the art performance in       creating the input files in the format of
both English and Hindi without the use of any       machine-readable texts for further processing.
morphological analysis or without using             The single pdf file of Valmiki Ramayana was
gazetteers of any sort. Sharma, Rajesh et.al.[12]   converted into 394 images in PNG format with
presents the Named Entity Recognition (NER)         300 dpi resolution using a free image utility
System for Hindi using CRF approach. Akbik,         software called ImageMagick8. For converting
Alanet al. [4] propose to leverage the internal     the images into editable text documents,
state of the trained character language model to    Optical Character Recognition (OCR) process
produce a unique type of word embedding. In         was used. Tesseract-OCR [13] is one of the
the process of building embedding model, first      OCR engines which can recognize characters of
trained without any specific knowledge of           more than 100 languages and has a language
words and therefore basically model words as a      model for Hindi as well. Py-tesseract9, a
series of characters, and second contextualized     wrapper for the Tesseract OCR engine, was
by their surrounding text, meaning that the         used in our case as it could read all image types
same word will have different embeddings            including jpeg, png, gif, bmp, tiff etc. The
depending on its contextual use. The author         converted images from pdf file were then given
claims that across four classic sequence
5                                                   8
    https://github.com/amjha/hindiExtractio             https://github.com/imagemagick/imagemagick
6                                                   9
    https://github.com/RaReTechnologies/gensim          https://github.com/madmaze/pytesseact
7
    https://github.com/flairNLP/flair
as input to the Py-tesseract using Devanagari      and the popularity of Hindi over Sanskrit. Hindi
language model. The output was obtained as         and Sanskrit texts were separated using” मूल”
text documents with around 90%-character           and” टीिा” keywords from the all the
accuracy.                                          documents, respectively.

                                                       There was a fork in further processing of the
                                                   Hindi texts: one or Text Summarization and
                                                   another for Topic Modeling. Atopic model is a
                                                   probabilistic model for finding out the abstract
                                                   topics that appear in a collection of text
                                                   documents. Topic Modeling is the most used
                                                   text mining tool for discovering latent semantic
                                                   structures in textual data. For the topic Model-
                                                   ing branch in our workflow, additional text pre-
                                                   processing was required. To extract topics, the
                                                   text documents were tokenized into words. In
                                                   the text documents, the most commonly
                                                   appearing words were the articles, prepositions,
                                                   helping verbs, etc., known as the stop words.
                                                   These stop words could affect the topic model
                                                   as it is generally based on the frequency of
                                                   words occurring in a document. So, they needed
                                                   to be removed. As there was not out of the box
                                                   Hindi stop words removal functionality
                                                   available, a list of Hindi stop words10 was
                                                   created and this was used to remove stop words
                                                   from the tokenized list of words.
                                                   Lemmatization is the text processing step of
                                                   grouping together the different forms of a word
                                                   so they can be analyzed as a single word. This
                                                   also helps in reducing redundancy of the same
                                                   root word in the topics extracted. Stanford NLP
   Figure 1: Workflow of Text preparation for      [14], an opensource library that has pretrained
the Visual Narrative                               Hindi models for lemmatization and Part of -
                                                   Speech tagging, was used to lemmatize the
    The output of Py-tesseract i.e., the OCR’d     tokenized words’ list. After looking at text
text documents contained some data such as         samples, additional cleaning steps like removal
page headers like book’s name (e.g. श्रीवाल्मीकि   of punctuation, English letters and numbers
रामायण) on odd numbered pages and chapter’s        from the tokenized list were performed. Latent
name (e.g.सुन्दर िण्ड) on even numbered            Dirichlet Allocation is a Topic Modeling
pages, page numbers; that were not required for    algorithm based on the bag of words (BOW)
text mining. These unwanted texts were             and counts of word document. It is a fully
captured using regular expressions and were        generative model where documents are
cleaned out from the text corpus. The              assumed to have been generated according toa
documents also consisted of the multi-lingual      per-document topic and per-topic word
texts: Sanskrit and Hindi. It was decided to       distribution. The list of tokenized words was
continue further processing using only Hindi       fed to the LDA model using Gensim [8] and
texts considering the factors such as the good     topics were extracted for the whole Hindi texts.
number of resources available for Hindi            To improve the output of Topic Modeling, few
language w.r.t. text processing and mining, the    iterations were run with some steps redone in
relatively greater length of the Hindi text        the preprocessing blocklike extending stop
documents than the Sanskrit texts in our case      words list, to remove the left-out words that
                                                   were not significant enough to be in the topics,
10
     https://github.com/amjha/hindiExtraction
and some manual garbage removal i.e., words            locations (LOC), etc. Most of the present, State
that were wrongly interpreted by the OCR               of art NER models for the Hindi language are
model. After this, results fetched from the topic      either very limited or not available in the public
models became more relevant to the actual              domain. For Training the model, Flair Python
story, the preprocessing steps were then               library [4] and Google cloud platform (GCP)
finalized and no further iterations were made to       resources have been used. Flair’s framework
change the prepared data.                              builds directly on Py-Torch, one of the best
                                                       deep learning frames works. It has the
    The other branch of the fork is Text               flexibility of using the state of art embedding
Summarization. Text Summarization refers to            model, also there is an embedding model
the process of compacting a large text. The            available for the Hindi language. NER initial
reason behind this is to create a comprehensible       corpus is requested from the Indian Statistical
and expressive summary having only the main            Institute [5] from AU-KBC Research Centre,
points outlined in the text. There are two main        India. This corpus is later extended for model
methods to summarize the text in NLP,                  training.
Extraction      based     summarization        and
Abstraction based summarization. We are using              Our complete narrative model’s aim was to
extraction-based summarization for our                 have a story of events. So, for our final step of
research. Summarization of the text is based on        the pipeline we built two dashboards using
the ranks of text sentences using a variation of       Tableau. The output from both; the
the Text Rank algorithm. Text Rank is an               summarization as well as the topic Modeling
automatic summarization technique, is                  branch were fed into tableau and the events
implemented in two different ways in our               were tagged with a topic, a summary and NER
pipeline, the Gensim python open-source                tags to help pick out characters and locations.
library [11] and the other one is a combination        In the first dashboard, satellite view of the map
of the sk-learn and networkx library. Gensim           is used for plotting, the identified locations
summarizer takes input as a string whereas the         from the scripture. For plotting the story on the
other approach takes a list of sentences. Taking       map, a validation dataset of names and places
a list of sentences was a better option as there is    of the events is manually created. This data is
no clear separation between the whole text.            matched with the words which are tagged with
Input text was divided into 26 documents where         B-LOCATION and I-LOCATION NER tags.
each document consists of 25 sentences.                After identification of the matched location,
Dividing the text by character may loose the           they are associated with the respective latitude
sentence meaning/grammar. As Gensim uses a             and longitude values so that it can be plotted on
string as input and division by character length       maps correctly. The lines between two places
was not a good option, So, the other                   on the map shows the sequence of the
implementation       Text     Rank      algorithm      mentioned locations in the narrated story. A line
(Networkx and Sk-learn) was selected for the           between two places on the map is created using
pipeline. Graph based ranking algorithms are a         Tableau’s spatial functions Make Line and
way for deciding the importance of a vertex            Make Point. As seen in the Fig. 2, the narration
within a graph, based on global information            comprises location “मैकिली” (birth place of Sita)
recursively drawn from the entire graph. A             “कित्रिूट” (Forest in which Ram and Sita went
graph has been built that represents the text,
                                                       for staying ), “अतःपुर” (A village in Kishkindha
inter-connects words or other text entities with
                                                       where Ram met Hanuman and his friends),
meaningful relations. Sentence extraction is
favorable over keyword/token extraction.               ”महे न्द्र” (Mahendragiri is a name of a mountain
                                                       from where Hanuman jumped towards Shri
    Named entity recognition (NER) plays an            Lanka in order to search for Sita), “महासागर”
important role to complete the narrative model.        (The ocean between India and Shri Lanka),
Using this concept, the persons/characters, as         “पवतत” (Trikoot parvat where Hanuman landed
well as the location, can be extracted from the        after jumping from Mahendra Giri Moutain),
story text [3]. The objective of using NER in          “अशोिवैन” (Ashok Vatika garden where Sita
this project is quite straightforward. It is used to   was kept as a captive by Ravana). On hovering
find, and cluster named entities in text into any      over each line origin city with its corresponding
desired categories such as person names (PER),         destination city and summary of the text is
shown in the tooltip. Dealing with such kind of       characters in the symbol chart unfolds along
ancient geo spatial data is some of crucial.          with their corresponding summary in the Fig. 3.
Gettty thesaurus playa a significant role in this.    All the symbols and characters names are
Getty is an opensource Geo data base where all        shown as legends. Each symbol is manually
possible occurrences of every Geo name                designed as per its characteristics in the
include its ancient name can tagged with a            Ramayana epic. On hovering over each
unique number. Through this number we can             character, the related topics for the given
visualize that geo name wit longitude and             summarized is shown in the tool tip.
latitude. Getty plays a crucial role in our project
as well for geo data visualization. We used              The source code for our research can be
Geety for Geo data visualization.                     found on GitHub11


       Figure 2: Locations in Ramayana                Figure3: Topic Modeling based on characters.

    In the second dashboard, symbol chart is
used to represent different characters of the
Ramayana. The sequence of the story from 0 to         4. Evaluation
25 is plotted from left to right and the
occurrence of every character is plotted as per       For the evaluation of our model, we conducted
its reference in the text associated with that        a survey in which 30 persons participated. The
sequence. The validation dataset of the               respondents were chosen based on the criteria,
character names in the Ramayana are matched           that they should be competent in understanding
with Hindi words tagged with B-PERSON and             Hindi text. They were provided with the input
I-PERSON NER tags. The matched names do               text and the summary as well as the topics
not identify the synonyms of the same name. In        created by our data pipeline. The two metrics,
Ramayana, each person has been associated             Text Summarization, and Topic Modeling were
with various names, e.g., “सीता” is also known        evaluated in the survey. There are four options
by the names “जानिी” or “जनिपुत्री” in the            to choose from and each option has a 20%
same story. So, the result of plotting such data      bucket size; 80100%, 6080%, 4060% or less
lead to plotting three different points for the       than 40%. Selecting the first option i.e., 80-
same Symbol. To avoid this, grouping of such          100%, implies that the individual has
names under one name is performed in Tableau.         understood the summarized text, and the same
As Ramayana is a story and stories are narrated       goes for the second and third options. Anyone
in sequence, Tableau’s page control                   choosing the fourth option i.e., less than 40%,
functionality is used to make the dashboard           be understood. The results of the survey can be
dynamic.The top topic and the summary of text         seen in the following chart at Fig.4.
is shown below the symbol chart. As the
sequence changes in the page control the

11
     https://github.com/rajrohan/ramayanaocr
                                                     This reduced the quality of NER tags that we
                                                     obtained after running our NER model. At first,
                                                     we were not able to build a NER Models we
                                                     were not able to procure the Hindi NER tag data
                                                     and creating the data from scratch was not
                                                     possible in the given timeframe of the research.
                                                     After failing to find Hindi NER tagged corpus
                                                     online, we were helped by AU-KBC research
                                                     Centre [5], India. They provided us the tagged
                                                     Hindi NER data as a result of which we were
                                                     able to train our NER model. Using NER tags
                                                     instead of POS tags has made the visual
                                                     experience significantly better. NER when
                                                     performed on the generated topics yields very
Figure 4: Survey results for Text Summarization      few results as compared to when it is performed
                                                     on the summarized text. Therefore, we built the
It can be inferred from the Fig. 4 that 70% of       NER model using the summarized text as an
the respondents were able to understand the          input. We also tried performing Abstractive
summarized text whereas 30% of the                   Summarization on the script but were unable to
respondents were not able to understand the          achieve it because the model cannot be trained
summarized text. From Fig. 5 around 56% of           on our data to provide the desired output. So,
the respondents understood the topic clearly         we have used only Extractive Summarization
and around 44% of the respondents were not           method in our pipeline. During the initial phase
able to understand the topic.                        of our project, we had an idea of visualizing the
                                                     events chronologically to make it more visually
                                                     informative. However, the chronology cannot
                                                     be obtained as the script is quite old and no
5. Discussion                                        dates are present in the data. The alternate to not
                                                     having any dates, is to use time as ’t’ and keep
The major objective of the envisioned pipeline       on incriminating it after every shloka to obtain
has been achieved but the pipeline can be            a certain chronological order. Having said that,
further improved. Due to drawbacks in the            as the script discusses different timelines in the
process’s in pipeline (due to unavailability of      same shlokas, this method cannot be used to get
proficient models for Hindi textual data), we        the chronology of events.
could not achieve the optimal result. We were
not able to find a better-quality input file which
could have helped the OCR model to identify          6. Conclusions
the characters in a much better way.
                                                        For our input dataset, Ramayana, it is
                                                     observed that our model reaches the score of
                                                     more than 70 percent in explaining the
                                                     Summarized text and about 70 percent in
                                                     explaining the topic/events generated from the
                                                     script. It can also be concluded that the NER
                                                     done on the summarized text generates better
                                                     results than when it is performed on the topics/
                                                     events. Taking usability into account we were
                                                     successful in making a pipeline for the visual
                                                     narration of Ramayana. Using our visualization
                                                     someone even with very little knowledge on
                                                     Ramayana can easily understand the whole
   Figure 5: Survey results for Topic Modeling       summary of the script. The demo is built on an
                                                     image-based input and can be later extended to
                                                     other sources and languages too. However, for
its application to all other Devanagari
languages, their respective models should be
available. We can prove the physical evidence
of the locations mentioned in the script by
plotting the coordinates of the present day
location along with the events that took place
on these locations.

7. Reference
[1] P. Richman, Ed.,Many Ramayanas: The                 Information Technology (CCSEIT),2016,
     Diversity of a Narrative Tradition in South        pp. 21–22.
     Asia. University of California Press, 1991.    [8] R. Rehurek and P. Sojka, “Software
[2] M. Allahyari, S. Pouriyeh, M. Assefi, S.            Framework for Topic Modelling with Large
    Safaei, E. Trippe, J. Gutierrez,and K.              Corpora,” in Proceedings of the LREC 2010
    Kochut, “Text summarization techniques:             Workshop on New Challenges for NLP
    A brief survey” International Journal of            Frameworks. Valletta, Malta: ELRA,
    Advanced       Computer       Science     and       May2010, pp. 45–50.
    Applications (IJACSA), vol. 8, pp. 397–         [9] V. Jha, N. Manjunath, P. Shenoy, and V. K
    405, 07 2017.                                       R, “Hsra: Hindi stop word removal
[3] V. Athavale, S. Bharadwaj, M. Pamecha, A.           algorithm,” 01 2016,       pp. 1–5.
    Prabhu, and M. Shrivastava,“Towards deep        [10] A. D. Dhawale, S. B. Kulkarni, and V.
    learning in     Hindi NER: An approach to             Kumbhakarna, “Survey of progressive era
    tackle the labelled data scarcity,” 2016.             of text summarization for Indian and
[4] A. Akbik, T. Bergmann, and R. Vollgraf,               foreign languages using natural language
    “Pooled contextualized embeddings for                 processing,”     in     Innovative     Data
    named entity recognition,” in NAACL                   Communication        Technologies       and
    2019, 2019 Annual Conference of the North             Application, J. S. Raj, A. Bashar, and S. R.
    American Chapter of the           Association         J. Ramson,Eds.           Cham: Springer
    for Computational Linguistics, 2019, p.               International Publishing, 2020, pp. 654–
    724–728.                                              662.
[5] C. M. Sobha Lalitha Devi., Pattabhi RK          [11] F. Barrios, F. López, L. Argerich, and R.
    Rao. and R. V. S. Ram, “Indian language              Wachenchauzer, “Variations of the
    ner annotated fire 2013 corpus (fire 2013            similarity function of         text rank for
    NER       corpus),”in      Named       Entity        automated summarization,” 2016.
    Recognition Indian Languages FIRE 2013          [12] R. Sharma and V. Goyal, “Name entity
    Evaluation Track, 2013.                               recognition systems for Hindi using CRF
[6] P. Qi, T. Dozat, Y. Zhang, and C. D.                  approach,” in Information Systems for
   Manning, “Universal dependency parsing                 Indian Languages, C. Singh, G. Singh
   from scratch,” in Proceedings of the                   Lehal, J. Sengupta, D. V. Sharma, and V.
   CoNLL2018 Shared Task: Multilingual                    Goyal,Eds. Berlin, Heidelberg: Springer
   Parsing from Raw Text to Universal                     Berlin Heidelberg, 2011, pp. 31–35.
   Dependencies.         Brussels,      Belgium:    [13] R. Smith, “An overview of the tesseract
   Association for Computational Linguistics,            OCR engine,” in Ninth International
   October 2018, pp. 160–170. [Online].                  Conference on       Document Analysis and
   Available:                                            Recognition (ICDAR2007), vol. 2, Sep.
   https://nlp.stanford.edu/pubs/qi2018univers           2007, pp. 629–633.
   al.pdf                                           [14] C. D. Manning, M. Surdeanu, J. Bauer, J.
[7] Z. Tong and H. Zhang, “A text mining                  Finkel, S. J. Bethard,and D. McClosky,
    research based on lda topic modelling,” in           “The Stanford Core NLP natural language
    Proceedings of                                       processing toolkit,” In Association for
    the Sixth International Conference on                Computational Linguistics (ACL) System
    Computer Science, Engineering and                    Demonstrations, 2014, pp. 55–60.