=Paper= {{Paper |id=Vol-2593/paper8 |storemode=property |title=A Framework towards Computational Narrative Analysis on Blogs |pdfUrl=https://ceur-ws.org/Vol-2593/paper8.pdf |volume=Vol-2593 |authors=Kiran Kumar Bandeli,Muhammad Nihal Hussain,Nitin Agarwal |dblpUrl=https://dblp.org/rec/conf/ecir/BandeliHA20 }} ==A Framework towards Computational Narrative Analysis on Blogs== https://ceur-ws.org/Vol-2593/paper8.pdf
A Framework towards Computational Narrative Analysis
                     on Blogs

               Kiran Kumar Bandeli, Muhammad Nihal Hussain, and Nitin Agarwal
            Collaboratorium for Social Media and Online Behavioral Studies (COSMOS)
                               University of Arkansas at Little Rock,
                                    Little Rock, Arkansas, USA
                           {kxbandeli, mnhussain, nxagarwal}@ualr.edu




                                                         Abstract
                       Social media is widely used to express views and share opinions
                       with others. With the availability of inexpensive and ubiquitous
                       mass communication tools like social media, creating narratives, false-
                       information and propaganda is both convenient and e↵ective. Social
                       media users leverage this platform to further their views by framing
                       narratives and participating in online discourse. Almost all events, is-
                       sues, crises are discussed on social media. Blogs, unlike other social
                       media platforms, are not regulated by any authority and have no re-
                       striction on character limit, providing bloggers with not only space
                       for richer content but also serve as a platform for agenda-setting and
                       content framing abetting development of narratives. This innate fea-
                       ture of blogs makes them a valuable platform for sociologists/political
                       scientists to gain situational awareness by tracking di↵erent opinions,
                       political views, and narratives as they are shaped. However, the deluge
                       of posts in blogosphere makes it impractical to manually identify narra-
                       tives. In this paper, we propose a novel framework to computationally
                       identify narratives by extracting actors/actions using NLP techniques
                       including POS tagging, chunking, and grammar rules to identify nar-
                       ratives. We also employ a scoring mechanism to rank them in the
                       order of their dominance. Later, we evaluate the efficacy of the pro-
                       posed model by validating against human annotated narratives. Our
                       framework achieved an accuracy of 66.8%. Our proposed framework
                       can help social scientists identify narratives computationally reducing
                       human e↵ort reasonably. Moreover, the results from this research could
                       also be used to build e↵ective counter narratives to stem propaganda
                       campaigns.




1    Introduction
Social media has been widely used to study events/campaigns by many people around the world. Almost all
events online that generate discourse have narratives to have radical e↵ects or reinforce a perspective on social
Copyright c by the paper’s authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
In: R. Campos, A. Jorge, A. Jatowt, S. Bhatia (eds.): Proceedings of the Text2Story’20 Workshop, Lisbon, Portugal, 14-April-2020,
published at http://ceur-ws.org




                                                             63
media. It can force change in perceptions of consumers on social media. Additionally, weaponized narratives
can have a huge impact on social media discourse such as fake-news dissemination, and misinformation. To
eliminate these radical e↵ects on social media, it is important to track these narratives so that counter measures
can be taken to quickly avoid damage to society. Social scientists define that narrative texts can be represented
as directed graphs (networks) of semantic triples—subject-verb-object sets” [FR04].
   Blogs, with their journal/diary like design, providing ample space for bloggers to write and express their
opinions on events, as they develop, around the world, are conducive to capture the chronological sequence of
events and help identify narratives. As narratives are described with actors and their actions by forming semantic
triplets, using the same notion by using noun phrases to detect actors and verb phrases to identify their actions
from blog posts. In the proposed framework, we aim to computationally identify actors/actions with the help
of NLP techniques that include - POS tagging, chunking to obtain noun/verb phrases using grammar rules that
captures appropriate phrases, and finally filter the phrases/sentences based on a scoring mechanism to generate
narratives that describe actors and their actions. For a given blog post, our framework can identify dominant
narrative as well as less dominant ones in the decreasing order of their rank. To evaluate the efficacy of the
proposed model, the narratives generated computationally are evaluated against the human annotated narratives
for the representative sample. The results from the representative sample indicate our framework achieved 66.8%
accuracy.
   Our proposed framework reduces social scientists’ e↵ort reasonably by identifying narratives computationally.
This enables the analysts to learn what resonates with the community and if those interests and views are
changing with time under the influence of exogenous factors or events. The results from this study could also be
used to build counter narrative measures against propaganda campaigns.

2   Literature Review
Narratives are key features of any strategic communication [CRF12]. Therefore, narratives are used for persuasion
because they are culturally grounded and frame the actions of actors. There have been several studies [CRF12],
[BK16], [Rie05], [Rus16], [HBAkA18] that have looked at narratives based on di↵erent themes such as looking
at text, frames, and semantic triples of subject-verb-object, named-entity extraction, targeted sentiment and
linguistics. Authors in [CRF12], [BK16] apply a manual approach to study narratives and analyze how specific
narratives rise and di↵use. Authors from [CRF12] also talk about strategies to undermine or counter the narrative
argument. However, this approach lacks scalability.
   A study conducted by Alzahrani et al. [ACA+ 16] defines story as actors taking actions that culminate in
resolutions. Authors extract subject verb object relationships from paragraphs and generalize them into semantic
conceptual representations. Further, the authors present an analytical framework that implements co-clustering
to detect story forms. Ceran et al. [CKCD15], [CKM+ 12] developed a novel algorithm to extract information from
semantic triplets by clustering them into generalized concepts by utilizing syntactic criteria based on common
contexts and semantic corpus-based based on “contextual synonyms”. Furthermore, it proposes semantic features
based on suitable aggregation and generalization of triplets that can be extracted using a parser. In a study by
Joshua et al. [EF17], authors design a system that can automatically decide whether or not a paragraph of text
contains a story. The authors say a paragraph contains a story if any portion of it expresses a significant part
of a story, including the characters and events involved in major plot points. Study by Sudhahar et al. [SFC11]
identifies narratives using a quantitative approach on news data. The methodology extracts semantic triplets
using a parser called Minipar. Study by Hussain et al. [HBAkA18] focuses on identifying shift in narratives,
rather than identifying narratives itself, using targeted sentiment analysis on blogs. Additionally, some of the
studies [AB18], [BA18], [AB17], [HBN+ 17], [MHN+ 17] on blogs focus on several events and the role of blogs in
disinformation campaign coordination, and narratives spreading during these campaigns.
   Although several studies have been conducted on narratives across several domains, these are rarely compu-
tational and not scalable. Moreover, the barrage of information we see in social media and the pace at which
these narratives are spread, analyzing these would require computational methods. Therefore, in this paper we
introduce a preliminary framework to identify narratives computationally.

3   Research Methodology
We describe our proposed methodology as shown in Figure 1, where documents/blog posts are starting point
in the framework. First, we extract named entities with the help of AlchemyAPI at the time of data collection
(AlchemyAPI is deprecated now). Now, we input combination of blog posts and named entities to network topic




                                                     64
modeling module. The output generated are topics with associated named entities. The number of topics in topic
modeling module is decided based on parameters in the LDA (Latent Dirichlet Allocation) model and tuned to
get optimum number of topics.Tuning refers to hyper-parameters in the LDA model which are log-likelihood and
beta parameters. In LDA, both topic distributions, over documents and over words have correspondent priors,
which are denoted typically with alpha and beta, and because these are the parameters of the prior distributions,
are commonly referred to as hyperparameters. For our dataset, the log-likelihood is calculated for each iteration
with topics as 1, 5, 10, 25, 50 etc. Ideally, the LDA model gets “better” at describing the data and increases over
time. Eventually this value will level o↵, as subsequent iterations make negligible improvements to the model.
On the other hand, low beta value places more weight on having each topic composed of only a few dominant
words. Therefore, the combination of these hyperparameters helps in choosing number of topics. In our case the
number of topics was 10.
   Now, from the blog posts, we extract sentences using NLP. From the sentences, we make use of NLP techniques
– POS tagging, chunking to extract noun phrases and verb phrases. After this step, we define grammar rule
that captures patterns of noun and verb phrases to generate narratives. Finally, we rank the narratives with
dominant narrative on the top and less dominant ones in descending order. Ranking/scoring of the narratives
is performed based on TF-IDF method. By doing so, we know the dominance order of the narrative from the
obtained candidates. We discuss di↵erent components of NLP techniques applied in the following paragraphs.




                             Figure 1: Framework to extract narratives from blogs.



3.1   Named Entity Extraction
Named Entity Extraction is widely used in Natural Language Processing to locate and classify named entity
mentions in unstructured text into a set of categories such as person, organization, location etc. In this research,
we used Alchemy API to extract named entities like persons, organizations, locations, etc. For instance, if there
is a sentence – ‘Adam lives in USA and goes to work at the University of Arkansas’. The API tags ‘Adam’ as
person, ‘USA’ as location, and ‘The University of Arkansas’ as organization.

3.2   Network Topic Modeling
Topic modeling is the process of identifying topics in a corpus with a set of documents. This can be useful for
search engines, customer service automation, and any other instance where knowing the topics of documents is
important. There are multiple ways in which we can achieve this, we use Latent Dirichlet Allocation (LDA) in
our framework. In LDA, each document may be viewed as a mixture of topics and each topic is a mixture of
words. In our framework, we use network-based topic models which internally uses LDA and generate a network
with topics and named entities associated with each topic. The documents in each topic are further filtered
based on topic percentage threshold so that we capture documents belonging to the topic.




                                                      65
3.3   Sentence Extraction
From NLTK (Natural Language Toolkit), we used the method - sent tokenize to extract sentences.

3.4   POS Tagging
POS tagging refers to parts of speech and identifies how a word is used in a sentence.

3.5   Chunking
Chunking is a process of extracting phrases from unstructured text. Instead of just simple tokens which may not
represent the actual meaning of the text, it’s desirable to use phrases such as “United States” as a single word
instead of ‘United’ and ‘States’ separate words. Like POS tags, there are a standard set of Chunk tags like Noun
Phrase (NP), Verb Phrase (VP), etc. Chunking is important to extract information from text such as Locations,
Person Names etc. The same process in NLP referred to as Named Entity Extraction. We will consider Noun
Phrase Chunking and searching for chunks relating to an individual noun phrase. In order to create NP chunk,
we define the chunk grammar using POS tags. We will define this using a single regular expression rule. The
rule states that whenever the chunk finds an optional determiner (DT) followed by any number of adjectives (JJ)
and then a noun (NN) then the Noun Phrase (NP) chunk should be formed. The following example in Figure 2
explains tree structure of a sentence with NP phrases. Similarly, we can extract Verb Phrase (VB) by defining
chunk to find verb form (VBD) followed by any preposition (IN).




                                      Figure 2: Chunking example in NLP


3.6   Grammar Rules
Grammar rules are the regular expressions written to capture actors and their actions. Based on POS tagging,
we extract noun phrases and verb phrases. By defining grammar rules, we extract chunks of sentences to identify
candidates of narratives for a given text. For our dataset, we followed below grammar rule in Figure 3.




                          Figure 3: Grammar rule extracting narratives from blogpost


3.7   Scoring Sentences
While sentences can be scored using various algorithms, in this study, we use TF-IDF score of words in a sentence
to assign a weight to the narratives. Later, when the narratives are generated, this scoring method is used to rank
dominant to less dominant narratives. To briefly explain TF-IDF stands for Term Frequency-Inverse Document
Frequency. TF-IDF weight is commonly used in information retrieval and text mining fields.




                                                      66
                                 Table 1: Narratives from the sample blog posts.
        Breitbart london was the first news site in the english speaking world to report on the horrific.
        German police have admitted to losing several urban areas to migrant gangs.
        The cologne city council to submit a travel warning for the carnival time.
        Breitbart london is still waiting on comment from cologne police.
        The deteriorating state of control the local government has over the city.
        An opposition council member has today sounded the alarm bell.
        A priest described munich on new years eve as a warzone.
        Local police to hold a press conference on monday afternoon.
        A political scandal is developing in germany as ordinary citizens wake up to the scale.
        Cologne to mass sex crime on new years.

3.8   Narrative Generation
By following all the above steps in our framework from Fig. 1, we generate narratives computationally for
any given blogpost. Depending on the grammar rules defined, there might be some noise in the sentences.
To overcome this, we can include more rules and filter the narratives obtained. In this paper, our goal is to
identify narratives computationally. First, we use the EU migrant crisis dataset collected using methodology
described in [HOB+ 17], [KBWA19] and extend our previous work [HBAkA18]. We selected 30 blog posts that
were published in January 2016 and related to the events around the New Year’s Eve incident in Cologne. From
these 30 blog posts, we extracted more than 3,000 sentences and filter down to capture semantic triplets having
actors and their actions. Using NLP techniques, we obtain candidate narratives. Below Table 1 shows a list of
narratives identified computationally using our methodology.
   From the narratives identified, we can notice that the blog posts discuss about migrants and reactions by the
local police in Cologne. Further, narratives show Munich being described as a warzone. Finally, discuss how
political scandal is developing in Germany due to attacks of scale. Overall, we see narratives strongly talking
against migrants and the consequences of allowing migrants in Europe.
   To validate our approach, we asked 3 human analystss to list at least 5 narratives in the descending order
of their dominance (i.e., the dominant narrative will be listed in the top and then the less-dominant one will
be listed later) from the 30 posts. Later, we provided each of the 3 analystss with narrative identified from
our framework to compare against human identified narratives. This e↵ort resulted in 70% match with analysts
1, 64% match with analysts 2, and 66.6% match with analysts 3. To compute the inter-annotator agreement,
we used krippendor↵-alpha and obtained krippendor↵-alpha of 0.82 indicating almost perfect reliability. Note:
Krippendor↵’s alpha is a reliability coefficient to measure the agreement among analystss when coding a set
of units of analysis in terms of the values of a variable. In our research, three analystss classifying narratives
achieved 0.74 < alpha < 0.86 with average alpha = 0.82

                    Table 2: Average Precision, Recall and F measure of identified narratives.
                         Top K Narrative Avg. Precision Avg. Recall F Measure
                                1                0.70           0.14            0.23
                                2                0.69           0.28            0.39
                                3                0.70           0.42            0.53
                                4                0.69           0.55            0.61
                                5                0.67           0.67            0.67

   Lastly, we compute precision, recall and F-measure (accuracy) to validate our methodology and select optimal
Top K number of narratives. Here, K refers to number of narratives. When say K =1, it means we consider only
top 1 narrative from the blog posts. Similarly, K = 5 refers to top 5 narratives from the blog posts. Overall, our
framework generated more than five narratives per post. From the analystss’ side, we had two analystss who
gave exactly 5 narratives per post and third analysts gave more than 5 narratives per post. For calculations,
we considered top 5 narratives from analystss as well as our framework. Now, based on this information, we
calculated precision, recall and F measure between our framework output and each one of the analystss. Here,
precision is the fraction of retrieved narratives that are relevant narrative from the analysts. Recall is the fraction
of the relevant narratives that are successfully retrieved. F measure is the harmonic mean of precision and recall.




                                                        67
Overall, our framework achieved a maximum accuracy of 67% as shown in Table 2 at K = 5.

4   Challenges & Limitations
Although our framework computationally identifies narratives, it has several challenges and limitations. The
grammar rule used is generically written to cover most of the phrase patterns of nouns and verbs. If the sentence
is complex and there are di↵erent noun and verb phrase patterns, the grammar rule needs to be updated. For
this reason, chunking becomes the crucial part in our framework, and we need to trial and error with grammar
rules to obtain meaningful candidates of narratives. For validation, we relied on three analystss, more analystss
should be included. Also, there is always subjective bias in this type of research which needs to be eliminated.
For this reason, the proposed framework can be tested against several accepted datasets in the research. The
candidate narratives generated may not be fully accurate but can assist analyst to understand. There will always
be some noise generated by the system.

5   Conclusion and Future Work
With social media a↵ecting almost every aspect of our life, it is important, now more than ever, to study social
behavior and online campaigns. There is a critical need for models, tools and frameworks that can assist conduct
these studies computationally. In this paper, we provide a framework to identify narratives on blogs. And
using the EU migrant crisis dataset, we demonstrated its efficacy and robustness in identifying narratives. The
framework achieved an accuracy of 66.8%. Moreover, the scoring feature embedded within the framework also
helps in identifying dominant and non-dominant narratives. Although the framework’s accuracy is not 100%, it
provides the analyst with reasonable candidates to help gain insights into what resonates in the blogosphere.
   Furthermore, the framework’s accuracy can be increased using several improvements including the addition
of better rules for chunking. Another future direction could be to explore abstractive summarization to improve
the overall framework. The framework can also be applied to other social media platforms or even news sources.
Another future research direction could be to study evolution of di↵erent narratives, i.e., does it become fringe
or master narrative.

Acknowledgements
This research is funded in part by the U.S. National Science Foundation (OIA-1920920, IIS-1636933, ACI-
1429160, and IIS-1110868), U.S. Office of Naval Research (N00014-10-1-0091, N00014-14-1-0489, N00014-15-P-
1187, N00014-16-1-2016, N00014-16-1-2412, N00014-17-1-2605, N00014-17-1-2675, N00014-19-1-2336), U.S. Air
Force Research Lab, U.S. Army Research Office (W911NF-16-1-0189), U.S. Defense Advanced Research Projects
Agency (W31P4Q-17-C-0059), Arkansas Research Alliance, and the Jerry L. Maulden/Entergy Endowment at
the University of Arkansas at Little Rock. Any opinions, findings, and conclusions or recommendations expressed
in this material are those of the authors and do not necessarily reflect the views of the funding organizations.
The researchers gratefully acknowledge the support.

References
[AB17]      Nitin Agarwal and Kiran Kumar Bandeli. Blogs, Fake News, and Information Activities. In Digital
            Hydra: Security Implications of False Information Online, pages 31–45. NATO Strategic Commu-
            nications Center of Excellence (StratCom COE), 2017.
[AB18]      Nitin Agarwal and Kiran Kumar Bandeli. Examining strategic integration of social media platforms
            in disinformation campaign coordination. Stratcom Centre of Excellence (CoE), Riga, Latvia, 2018.
[ACA+ 16]   Sultan Alzahrani, Betul Ceran, Saud Alashri, Scott W. Ruston, Steven R. Corman, and Hasan
            Davulcu. Story forms detection in text through concept-based co-clustering. In 2016 IEEE Interna-
            tional Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Network-
            ing (SocialCom), Sustainable Computing and Communications (SustainCom)(BDCloud-SocialCom-
            SustainCom), pages 258–265. IEEE, 2016.
[BA18]      Kiran Kumar Bandeli and Nitin Agarwal. Analyzing the role of media orchestration in conducting
            disinformation campaigns on blogs. Computational and Mathematical Organization Theory, pages
            1–27, 2018.




                                                     68
[BK16]      Jan Babinský and Václav Krejčı́ř. Representation of Ukrainian Crisis in Czech Media: Explicit and
            Implicit Bias in the News Coverage of the Ukranian-Russian Conflict. Mediterranean Journal of
            Social Sciences, 7(4):435, 2016.

[CKCD15]    Betul Ceran, Nitesh Kedia, Steven R. Corman, and Hasan Davulcu. Story detection using general-
            ized concepts and relations. In 2015 IEEE/ACM International Conference on Advances in Social
            Networks Analysis and Mining (ASONAM), pages 942–949. IEEE, 2015.
[CKM+ 12] Betul Ceran, Ravi Karad, Ajay Mandvekar, Steven R. Corman, and Hasan Davulcu. A semantic
          triplet based story classifier. In Proceedings of the 2012 International Conference on Advances in
          Social Networks Analysis and Mining (ASONAM 2012), pages 573–580. IEEE Computer Society,
          2012.
[CRF12]     S. Corman, Scott W. Ruston, and Megan Fisk. A pragmatic framework for studying extremists’ use
            of cultural narrative. In 2nd International Conference on Cross-Cultural Decision Making: Focus,
            volume 2012, pages 21–25, 2012.
[EF17]      Joshua Eisenberg and Mark Finlayson. A simpler and more generalizable story detector using verb
            and character features. In Proceedings of the 2017 Conference on Empirical Methods in Natural
            Language Processing, pages 2708–2715, 2017.
[FR04]      Roberto Franzosi and Franzosi Roberto. From words to numbers: Narrative, data, and social science,
            volume 22. Cambridge University Press, 2004.
[HBAkA18] Muhammad Nihal Hussain, Kiran Kumar Bandeli, Samer Al-khateeb, and Nitin Agarwal. Analyzing
          Shift in Narratives Regarding Migrants in Europe via Blogosphere. In Text2Story@ ECIR, pages
          33–40, 2018.

[HBN+ 17]   M Hussain, KK Bandeli, Mohammad Nooman, Samer Al-khateeb, and Nitin Agarwal. Analyzing
            the voices during european migrant crisis in blogosphere. In The 2nd International Workshop on
            Event Analytics using Social Media Data, 2017.
[HOB+ 17]   Muhammad Nihal Hussain, Adewale Obadimu, Kiran Kumar Bandeli, Mohammad Nooman, Samer
            Al-khateeb, and Nitin Agarwal. A framework for blog data collection: challenges and opportunities.
            In The IARIA international symposium on designing, validating, and using datasets (DATASETS
            2017), 2017.
[KBWA19] Tuja Khaund, Kiran Kumar Bandeli, Oluwaseun Walter, and Nitin Agarwal. A Novel Methodology
         to Identify and Collect Data from Relevant Blogs Leveraging Multiple Social Media Platforms and
         Cyber Forensics. ALLDATA 2019, page 49, 2019.

[MHN+ 17] Esther Ledelle Mead, Muhammad Nihal Hussain, Mohammad Nooman, Samer Al-khateeb, and
          Nitin Agarwal. Assessing situation awareness through blogosphere: a case study on venezuelan
          socio-political crisis and the migrant influx. In The Seventh International Conference on Social
          Media Technologies, Communication, and Informatics (SOTICS 2017), pages 22–29, 2017.
[Rie05]     C. K. Riessman. Narrative analysis: University of Huddersfield. 2005.

[Rus16]     Scott W. Ruston. More than Just a Story: Narrative Insights into Comprehension, Ideology, and
            Decision Making. In Modeling Sociocultural Influences on Decision Making, pages 57–72. CRC Press,
            2016.
[SFC11]     Saatviga Sudhahar, Roberto Franzosi, and Nello Cristianini. Automating quantitative narrative
            analysis of news data. In Proceedings of the Second Workshop on Applications of Pattern Analysis,
            pages 63–71, 2011.




                                                     69