Introduction

Measuring Narrative Fluency by Analyzing Dynamic Interaction Networks in Textual Narratives

O-Joun Lee

ojlee112358@postech.ac.kr 0

Jin-Taek Kim⇤

jintaek@postech.ac.kr 0 0 Future IT Innovation Laboratory, Pohang Univ. of Science and Technology , Pohang-si , Republic of

Korea 37673

2020

This study aims to assess the fluency of narratives in textual multimedia (e.g., news articles, academic publications, novels, etc.). We measure the narrative fluency based on whether relationships between entities in the narrative (i.e., subjects and objects of events that compose the narrative) are consistently described with adequate rapidity. The relationships are represented by a dynamic interaction network (called 'entity network'), which has entities as nodes and co-occurrences between the entities as edges. Lack of consistency makes users confused about what the textual narratives want to present. If a narrative consistently concentrates on a topic or subject, its entity network will have few entities with high node centrality. Using consistency of the high centrality entities, we assess the fluency with three criteria: (i) consistency in each paragraph, (ii) consistency in the overall narrative, and (iii) consistency between the title and body. The rapidity of narrative development has to be appropriate for expected readers of the textual narratives. Too low rapidity causes redundancy, and high rapidity hinders the understandability of the narratives. We assume structural changes in the entity network reflect the narrative rapidity. The structural change is measured by embedding structures of the entity network. Finally, we evaluated the e↵ectiveness of the proposed methods using the editorials of the New York Times and human evaluators.

Introduction

Recently, various studies have attempted to quantitatively measure what has been qualitatively assessed based on human intuition and experience, such as story similarity [LJJ18, LJ19b, LJ20], creativity [ES15], trustworthiness [LNJ+17], and so on. These studies have mainly been conducted on interdisciplinary areas between computer science and the humanities/social science. They attempt to make subjects of the humanities/social science computational.

As one of these attempts, this study aims to quantify the fluency of narratives, which is one of the significant factors for evaluating writings [Huc15]. Narratives are the most fundamental media for exchanging information between human beings. According to Lako↵ and Narayanan [LN10], “Narratives structure our understanding of the world and of ourselves.” Therefore, assessing the quality of narratives is significant not only for multimedia content analysis and its applications but also for human-computer interaction. Taghipour and Ng [TN16] also attempted to score essays automatically by using the convolutional recurrent neural network. However, their method cannot be used as a standard indicator of the narrative quality, which should always assign the same score on a narrative.

Various studies [BGL+19, LB19, LJ19a, WCW09] have applied interaction networks between characters (character networks) for analyzing fictional narratives (stories). We extend this model, which has only been applied to fictional and artistic narratives, to cover general narratives, including news articles or academic publications. The character network foots on the assumption that interactions between characters compose fictional narratives. A narrative is a series of events, and an event consists of interactions between characters [McK97, McK16]. However, general narratives do not only depict relationships between personified entities (characters).

For example, a history book can describe interactions between nations or other social organizations, and research articles depict relationships among abstractive concepts. To interpret what the narratives accurately attempt to describe, we have to analyze the meanings of each interaction and relationship. However, even though we do not know the meanings of relationships, we can analyze how the entire relationships between entities are gradually presented or explained. This point is as with that our previous studies [LJJ18, LJ19b] have attempted to analyze the narrative development by only using frequencies of interactions between characters. This approach also enables us to apply the proposed methods on various kinds of media without significant modification, while the existing studies [SLE15, SMS15] measured the fluency based on the domain knowledge.

First, we have to define the entities and their interactions. Similar to the existing narrative models [CR17, MAW+18], we define entities as subjects and objects of each interaction. In video or audio, finding interactions and entities involved in the interactions is abstruse. Thus, as a preliminary study, we restrict our research subjects into textual narratives, e.g., news articles, academic publications, non-fiction books, novels, essays, etc. Each sentence in the text is used as a unit of interactions. And, entities correspond to nouns (or noun phrases), which can be subjects or objects of the sentence. The entity and the interaction can be defined as follows; Definition 1 (Entity and Interaction) Suppose that S is a set of sentences in a textual narrative, T. When si 2 S is the i-th sentence in S, si also corresponds to the i-th interaction between entities. E, which is a set of entities in T, consists of nouns and noun phrases that appeared in sentences within S. If two entities (ea and eb) co-occur in si, we can assume that si describes a relationship between ea and eb, even though we do not know meanings of si.

The narrative is time-sequential. Thus, the existing studies [Bos16, LJ19a] segmented narratives into logical and regular units, such as scenes. A scene is defined as a period that does not contain changes in spatiotemporal backgrounds [McK97]. Each scene describes a concluded event within a background. However, the general narratives are far more diverse than the fictional ones. Interactions in the general narrative can be segmented into events, but they do not always have distinguishable backgrounds. Thus, we employ paragraphs as a unit of events, since paragraphs in well-written texts usually have topical coherence and completion. By using the paragraph as a time window, we define a dynamic interaction network between entities appeared in a narrative as follows; Definition 2 (Entity Network) Suppose that |E| is the number of entities that appeared in a narrative, T. When N (T) indicates an entity network of T, N (T) can be defined as a matrix 2 R|E|⇥| E|. Each component of N (T) means relationships between two entities. By defining N (·) in each paragraph, we can observe the development of the relationships. When P is a set of paragraphs in T, and pl indicates the l-th paragraph, N (pl) indicates an entity network for pl. This can be formulated as:

|P | N (T) = X N (pl) = 64 l=1 2 f1,1 . . .

· · · . . . fN,1 · · · f1,N 3 fN,N ... 75 , (1) where fi,j indicates frequency of interactions between ei and ej . We measure fi,j using the number of sentences that ei and ej co-occurred.

We measure the narrative fluency based on the entity network and the following two assumptions. First, the topical coherence of a paragraph will be exposed by the centrality of its keywords on the entity network. Thus, within a paragraph, there should be few entities with significantly higher centrality than the other entities. If a narrative consistently focuses on a topic, the keywords will also be consistent in the overall narrative (RQ 1). Second, the relationships between entities have to be described in an appropriate rapidity to deliver the relationships to users understandably. If we use too few interactions or events for depicting content, there can be logical leaps. On the other hand, if we describe the content too slowly, there might be meaningless redundancy. Therefore, the narrative should have an adequate rapidity of its development regarding its purposes and expected readers. The narrative development will accompany new entities and new relationships between the entities. Thus, we assume that the rapidity of narrative development can be measured by structural changes in entity networks (RQ 2). 2

Measuring Narrative Fluency

This section first describes the way how we have composed entity networks, briefly. Then, we present the proposed method for measuring narrative fluency with the two criteria: (i) the narrative consistency and (ii) the rapidity of the narrative development. 2.1

Composing Entity Networks

We collected 20 recent editorials published in the New York Times1. Titles, headlines, and bodies of the editorials were collected and preprocessed by using the NLTK library of Python2. We conducted tokenization, stemming, and parts of speech (POS) tagging for the collected texts. Then, we annotated occurrences and co-occurrences of only nouns and pronouns, which are tagged as ‘NN,’ ‘NNS,’ ‘NNP,’ or ‘NNPS’ by the POS tagger in the NLTK library. These nouns and pronouns are entities. We composed entity networks based on occurrences and co-occurrences of the entities in each sentence and paragraph. To segment sentences, we used the capitalization, punctuation marks, and dictionary for frequently-used acronyms (e.g., Mr., Ms., etc.). Also, the entity network includes cyclic edges (e.g., fa,a in Eq. 1) to represent the occurrence frequency of entities. 2.2

Measuring Narrative Consistency of Paragraphs

We measure the narrative consistency with two viewpoints. First, each paragraph has to focus on one topic. Thus, there should be few entities that have far higher centrality than the other entities. The entities will be keywords that represent the topic. Therefore, we measure three well-known node centrality measurements (e.g., degree, betweenness, closeness centrality) for each entity on entity networks. The centrality measurements are normalized into [0, 1] and aggregated by the arithmetic mean. We use the arithmetic mean as the centrality of each keyword. However, this way cannot consider which kinds of centrality are more significant for the narrative fluency. In future studies, we will compare the significance of the centrality measurements by applying weighting factors to them. We assess the narrative consistency in each paragraph by using the entropy of the centrality of entities. This can be formulated as:

Call(pl) =

1 |El| ⇥

X 8 ea2 El log Cl(ea), where El ⇢ E indicates a set of entities that appeared in pl, and Cl(ea) refers to centrality of ea on N (pl). Call(pl) measures consistency of a paragraph. For the entire textual narrative, we aggregate the consistency of paragraphs as: Call(T) = |P1 | ⇥ P8 pl2 P Call(pl).

Second, keywords of a narrative should have high centrality on overall paragraphs in the narrative. Similar to the previous one, we assess whether keywords have consistently high centrality, based on the entropy. This can be formulated as:

2 Ckey(T) = 4 1 +

1 |P | · |K| ⇥

X 8 pl2 P 8 ea2 K log Cl(ea)5 3 1 where K ⇢ E is a set of keywords of T. We compose K by clustering entities into two clusters according to their centrality by using k-means clustering with two initial centroids: maximum and minimum centrality. Among 1https://www.nytimes.com/section/opinion/editorials 2http://www.nltk.org/

Ctitle(T) = |(Et [ Eh) \ K| ,

|Et [ Eh [ K| where Et and Eh are sets of entities that appeared in the title and headline of T, respectively. We aggregate the three proposed measurements (Call, Ckey and Ctitle) by the arithmetic mean, after normalizing them into [0, 1]. 2.3

Measuring Rapidity of Narrative Development

We measure the rapidity of the narrative development by using structural changes in entity networks. To compare structures of entity networks, we represent the entity networks with a vector by using the graph embedding technique. We employ the Graph2Vec model [NCV+17], which aims to embed structures of graphs rather than characteristics of nodes or edges. Graph2Vec first extracts subgraphs from entity networks by using the WL (Weisfeiler-Lehman) relabeling process [SSvL+11]. Then, PV-DBOW in Doc2Vec [LM14] is applied to learn representations of the entity networks based on the composition of subgraphs. We notate a vector representation of N (T) as ( N (T)).

We measure the rapidity by estimating how significant changes in the entity network are caused by a paragraph. Thus, we have to compare entity networks before and after the paragraph. The rapidity of narrative development on pl is measured by the Euclidean distance between ( Pli=1 N (pi)) and ( Pli=11 N (pi)). This can be formulated as: the two clusters, we assume that elements in the cluster with higher centrality are keywords. Although using a threshold will be much simpler than the clustering, it cannot deal with the diversity of textual narratives.

Third, users expect topics of textual narratives from their titles and keywords annotated by creators. Since news articles are our experimental subjects, entities in their titles and headlines have to match with keywords discovered by using the entity network. Thus, we measure their concurrence based on the Jaccard index. This can be formulated as: (4) (5) (6) R(pl) =

l X N (pi) i=1 ! l 1 X N (pi) i=1 ! 2 .

This study focuses on validating that we can measure the narrative fluency by analyzing the interaction network. We will sophisticatedly tune the proposed measurements in future studies (e.g., comparing the e↵ectiveness of various distance metrics in measuring the narrative rapidity).

We assume that too slow or too fast changes in entity networks hinder the readability of the narrative. Therefore, we assess whether the rapidity is appropriate and consistent. After normalizing the rapidity of paragraphs into [0, 1], we aggregate the di↵erence between the optimal rapidity and rapidity on each paragraph. This can be formulated as:

R(T) = 1 |P | ⇥

l X |R(pl) ⇥ R| , 8 pl2 P where ⇥ R indicates the optimal rapidity. We attempt to search ⇥ R through empirical experiments in the following section.

Additionally, the proposed measurements for the narrative consistency can be manipulated by splitting paragraphs more finely than the normal. We expect that too short paragraphs make R(pl) small. Thus, narrative consistency and rapidity will have a trade-o↵ relationship. 3

Evaluation

To validate the research questions, we evaluated the accuracy of the measurements by estimating fluency of the editorials in the New York Times. We compared the results of the proposed measurements with responses of 37 human evaluators. The evaluator group consists of students and faculty members of Chung-Ang University and Pohang University of Science and Technology. We asked the evaluators to read three editorials that they chose from our corpus and answer the following questionnaire.

Q1. What are the keywords of this editorial? Q2. Is this editorial consistently describing its topic? Please, answer in five degrees: very inconsistent, inconsistent, normal, consistent, and very consistent.

Q3. If this editorial is inconsistent, please check paragraphs causing the inconsistency.

Q4. How rapid is the narrative development of this editorial? Please, answer in five degrees: prolonged, slow, appropriate, fast, and very fast.

Q5. If the rapidity of narrative development in this editorial is inadequate, please check paragraphs that causes the inappropriateness. Also, please annotate whether the paragraphs are redundant or unexpected. Q6. Is this editorial fluent? Please, answer in five degrees: very non-fluent, non-fluent, normal, fluent, and very fluent.

Q7. If this editorial is not fluent, please check paragraphs that causes the non-fluency.

For normalization, the five choices of Q2 and Q6 were replaced with 0.2, 0.4, 0.6, 0.8, and 1.0, respectively. Also, the choices of Q4 were transformed into 0.0, 0.5, 1.0, 0.5, and 0.0, respectively.

Based on the questionnaire, we conducted three experiments. First, if the entity network model is reasonable, high centrality entities will be keywords of each textual narrative. Thus, we compared keywords annotated by the evaluators (Q1) with automatically discovered ones. As a baseline method, we measured TF-IDF (Term Frequency-Inverse Document Frequency) of entities and clustered them according to the TF-IDF scores, similar to the proposed method. Accuracy of the keywords was assessed by the precision, recall, and F1 measure.

The second and third rows of Table 1 show that both of the methods have high precision and low recall. To find its reason, we examined keywords that were not discovered by the centrality and TF-IDF. Most of the omitted keywords were called by various expressions, including pronouns and synonyms. For example, the following phrases can be used in similar meanings: U.S. government, American government, Federal government, Trump administration, Presidency of Donald Trump, Washington D.C., etc. This variety of expressions makes co-occurrence frequency of entities dispersed. Vocabulary diversity makes texts smooth and fluent, while it is a challenging issue for composing accurate entity networks. Also, the centrality exhibited higher accuracy for discovering keywords than TF-IDF. However, the amount of improvement was insignificant. Even if the entity network is independent of kinds of media, its performance has to be improved, considering the simplicity of TF-IDF.

Second, we validated RQ 1 and assessed the e↵ectiveness of the narrative consistency measurements, based on Q2, Q3, Q6, and Q7. We examined correlations between (i) fluency annotated by the evaluators (Q6; FH), (ii) annotated consistency (Q2; CH), and (iii) automatically measured consistency (CA), using PCC (Pearson Correlation Coecient). FH-CH and FH-CA verified RQ 1, and CH-CA and FH-CA exhibited the e↵ectiveness of the measurements. Table 3 (a) presents the correlation coecients.

On the experimental results, FH-CH was 0.91. Most of the evaluators gave the same scores for the fluency and consistency. FH-CA (0.71) was lower than FH-CH but still significant. Thus, the consistency was an essential factor of the narrative fluency. CH-CA (0.73) was lower than FH-CH but higher than FH-CA. This point indicates that the proposed measurement adequately reflected the consistency of the editorials.

Then, we compared inconsistent paragraphs annotated by the evaluators with ones detected by the proposed method. By modifying Eq. 3, we measured inconsistency of each paragraph as: P8 ea2 K log Cl(ea). According to this metric, we sorted paragraphs in each editorial with a descending order. Paragraphs in the first quartile of the order were determined as the inconsistent ones. Accuracy for detecting the inconsistent paragraphs was assessed by the precision, recall, and F1 measure.

As shown in the second row of Table 2, the proposed method exhibited high recall but low precision. Since we use keywords to measure the inconsistency, recognizing synonyms as individual entities might increase the inconsistency for paragraphs. Although the consistency showed reasonable performance overall, we have to find a better way of composing the entity network.

Finally, we validated RQ 2 and verified the e↵ectiveness of the proposed measures for the rapidity of narrative development, based on Q4 to Q7. As with the previous one, we examined correlations between (i) fluency annotated by the evaluators (Q6; FH), (ii) annotated rapidity (Q4; RH), and (iii) automatically measured rapidity (RA). FH-RH and FH-RA verified RQ 1, and RH-RA and FH-RA exhibited the e↵ectiveness of the rapidity measurement. Table 3 (b) presents the correlation coecients.

FH-RH (0.66) was relatively lower than FH-CH. Also, FH-RA (0.74) was lower than FH-CA. These results mean the rapidity was less significant than the consistency to estimate the narrative fluency. One interesting point was that FH-RA was higher than RH-RA (0.62). The rapidity measurement was correlated to the narrative fluency but not much proportional to the rapidity of narrative development that the evaluators felt. The following experiment also showed this problem. Additionally, the proposed measurement exhibited the highest PCC for RH-RA on ⇥ R = 0.45. We searched the optimal ⇥ R in [0, 1] with a step size +0.05.

Also, we compared too fast and slow paragraphs annotated by the evaluators with ones detected using the consistency measurements. Using Eq. 5, we sorted the paragraphs in each editorial with descending order. Then, paragraphs in the first and fourth quartiles of the order are decided as too fast and slow paragraphs, respectively. Their accuracy was assessed by using the precision, recall, and F1 measure.

Di↵erent from the consistency, precision and recall of the rapidity measurement were similar. However, as displayed in the third row of Table 2, accuracy for detecting abnormality on the rapidity was significantly lower than on the consistency. To find its reason, we have examined false positives and false negatives of the proposed method. Interestingly, the false positives were mostly on the beginning and ending parts of the editorials (maybe, introductions and conclusions), and most of the false negatives were on the middle parts of the editorials. These results indicate that the optimal rapidity of narrative development can be di↵erent according to the locations of paragraphs (or narrative time). The low PCC score for RH-RA could be a↵ected by this problem, either. 4

Conclusion

We have proposed two kinds of measurements for assessing fluency of textual narratives. Also, their e↵ectiveness was evaluated based on the editorials of the New York Times. However, this study has a few limitations. First, we could not conduct experiments on various kinds of textual narratives. Also, we assumed the optimal rapidity as a static value. Our further research will be focused on resolving these two problems.

Acknowledgements

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ICT Consilience Creative program (IITP-2019-2011-1-00783) supervised by the IITP (Institute for Information & communications Technology Planning & Evaluation). [Bos16]

Xavier Bost. A storytelling machine?: automatic video summarization: the case of TV series. PhD thesis, University of Avignon, France, November 2016.

Emmanouil Theofanis Chourdakis and Joshua Reiss. Constructing narrative using a generative model and continuous action policies. In Proceedings of the Workshop on Computational Creativity in Natural Language Generation (CC-NLG@INLG 2017), pages 38–43, Santiago de Compostela, Spain, September 2017. Association for Computational Linguistics (ACL).

Ahmed M. Elgammal and Babak Saleh. Quantifying creativity in art networks. In Hannu Toivonen, Simon Colton, Michael Cook, and Dan Ventura, editors, Proceedings of the 6th International Conference on Computational Creativity (ICCC 2015), pages 39–46, Park City, Utah, USA, June 2015. computationalcreativity.net.

Geo↵rey J. Huck. What Is Good Writing?, chapter Narrative Fluency, pages 102–124. Oxford University Press, September 2015.

Vincent Labatut and Xavier Bost. Extraction and analysis of fictional character networks: A survey. ACM Computing Surveys, 2019. To Appear.

O-Joun Lee and Jason J. Jung. Integrating character networks for extracting narratives from multimodal data. Information Processing and Management, 56(5):1894–1923, September 2019. O-Joun Lee and Jason J. Jung. Modeling a↵ective character network for story analytics. Generation Computer Systems, 92:458–478, March 2019.

Future O-Joun Lee and Jason J. Jung. Story embedding: Learning distributed representations of stories based on character networks. Artificial Intelligence, 281:103235, April 2020.

O-Joun Lee, Nayoung Jo, and Jason J. Jung. Measuring character-based story similarity by analyzing movie scripts. In Al´ıpio M´ario Jorge, Ricardo Campos, Adam Jatowt, and S´ergio Nunes, editors, Proceedings of the 1st Workshop on Narrative Extraction From Text (Text2Story 2018) co-located with the 40th European Conference on Information Retrieval (ECIR 2018), volume 2077 of CEUR Workshop Proceedings, pages 41–45, Grenoble, France, March 2018. CEUR-WS.org.

Quoc V. Le and Tomas Mikolov. Distributed representations of sentences and documents. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31th International Conference on Machine Learning (ICML 2014), volume 32 of JMLR Workshop and Conference Proceedings, pages 1188–1196, Beijing, China, June 2014. JMLR.org.

George Lako↵ and Srini Narayanan. Toward a computational model of narrative. In Proceedings of the 2010 AAAI Fall Symposium: Computational Models of Narrative, volume FS-10-04 of AAAI Technical Report, pages 21–28, Arlington, VA, US, November 2010. AAAI.

O-Joun Lee, Hoang Long Nguyen, Jai E. Jung, Tai-Won Um, and Hyun-Woo Lee. Towards ontological approach on trust-aware ambient services. IEEE Access, 5:1589–1599, February 2017. [MAW+18] Lara J. Martin, Prithviraj Ammanabrolu, Xinyu Wang, William Hancock, Shruti Singh, Brent Harrison, and Mark O. Riedl. Event representations for automated story generation with deep neural nets. In Sheila A. McIlraith and Kilian Q. Weinberger, editors, Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, (AAAI 2018), the 30th innovative Applications of Artificial Intelligence (IAAI 2018), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI 2018), pages 868–875, New Orleans, Louisiana, USA, February 2018.

AAAI Press.

Robert McKee. Story: Substance, Structure, Style and the Principles of Screenwriting. HarperCollins, New York, NY, USA, November 1997.

Robert McKee. Dialogue: The Art of Verbal Action for Page, Stage, and Screen. Twelve, July 2016. [NCV+17] Annamalai Narayanan, Mahinthan Chandramohan, Rajasekar Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal. graph2vec: Learning distributed representations of graphs. Computing Research Repository (CoRR), abs/1707.05005, July 2017. [SMS15]

[BGL+19] Xavier

Bost

, Serigne Gueye, Vincent Labatut, Martha Larson, Georges Linar`es, Damien Malinas, and Rapha¨el Roth. Remembering winter was coming . Multimedia Tools and Applications , September 2019 . To Appear.

[TN16]

Oscar

Saz ,

Yibin

Lin , and

Maxine

Eskenazi . Measuring the impact of translation on the accuracy and fluency of vocabulary acquisition of english . Computer Speech & Language , 31 ( 1 ): 49 - 64 , May 2015 .

Maryam

Soleimani , Sima Modirkhamene, and

Karim

Sadeghi . Peer-mediated vs. individual writing: measuring fluency, complexity, and accuracy in writing . Innovation in Language Learning and Teaching , 11 ( 1 ): 86 - 100 , June 2015 .

[WCW09] Chung-Yi

Weng

, Wei-Ta Chu , and Ja-Ling Wu . RoleNet: Movie analysis from the perspective of social networks . IEEE Transactions on Multimedia , 11 ( 2 ): 256 - 271 , February 2009 .

[SSvL+11] Nino

Shervashidze

, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn , and Karsten

Borgwardt . Weisfeiler-lehman graph kernels . Journal of Machine Learning Research , 12 : 2539 - 2561 , September 2011 .

Kaveh

Taghipour and Hwee Tou Ng . A neural approach to automated essay scoring . In Jian Su, Xavier Carreras, and Kevin Duh, editors, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP 2016 ), pages 1882 - 1891 , Austin, Texas, USA, November 2016 . Association for Computational Linguistics .