Introduction

Measuring Character-based Story Similarity by Analyzing Movie Scripts

O-Joun Lee

concerto9203@gmail.com 0

Nayoung Jo

joenayoung2@gmail.com 0

Jason J. Jung

j2jung@gmail.com 0 0 Dept. of Computer Eng., Chung-Ang University , Seoul, Korea 156-756

2018

The goal of this paper is to measure similarity among the stories for categorizing movies. Although genres are well-performing as movies' categories, users have difficulty for predicting substances of the movies through the genres. Therefore, we proposed the story-based taxonomy of the movies and a method for constructing it automatically. In order to reflect characteristics of the stories, we used two kinds of features: (i) proximity among movie characters and (ii) genres of the movies. Based on the features, we constructed the story-based taxonomy by clustering the movies. We anticipate that the proposed taxonomy could make the users imagine and predict substances of movies through comprehending which movies contain similar stories.

Introduction

Our previous studies [DHLJ16, THLJ17, LJ16, JLYN17, LJ18] used the character network for computationally analyzing the stories. The character network is a social network among characters that appeared in the stories. It was defined as follows; Definition 1 (Character Network) Suppose that N is the number of characters that appeared in a movie, Ca . When N(Ca ) indicates a character network of Ca , N(Ca ) can be described as a matrix 2 RN N . It consists of N N components which are the proximity among the characters as:

N(Ca ) = 64 :::

: : : 2a1;1 aN;1 a1;N 3 : :: 75 ; aN;N where, ai; j is the proximity of ci for c j when Ca is an universal set of characters that appeared in Ca and ci is an i-th element of Ca .

In this study, we used frequency of the dialogues between the characters for measuring the proximity among them. The dialogues were extracted from the movies’ scripts collected from the Internet Movie Script Database (IMSDb) 1.

Since the scripts are structured documents, as displayed in Fig. 1, it is relatively easy to extract dialogues and their speakers. Simply speaking, the movies’ script consists of multiple scenes, which start with scene titles. Also, the scene contains descriptions and dialogues. The dialogue includes a speaker of dialogue and its content. In the description, characters’ action and backgrounds of scenes are illustrated.

In this study, we mainly focused on boundaries of the scene and the speakers of the dialogues. As formats of the scripts are not completely uniform, we have difficulty for assuring whether we can discover points where the characters appear and disappear, or not. Therefore, we supposed that every characters appeared in the corresponding scene are listeners for all the dialogues spoken in the scene. It can be illustrated as Fig. 2.

Nevertheless, the character networks have a difficulty for comparing with each other, since the number of characters is different from movies. Park et al. [PYKY15] proposed a method for normalizing the character networks by using the Singular Value Decomposition (SVD). In order to compare the character networks, we applied the same method. The normalized character network was denoted as N(Ca ). 3

Story-based Taxonomy of Movies

The story-based taxonomy consisted of multiple groups of movies that have similar stories. To compare the movies’ stories with each other, we used two kinds of features: (i) the proximity among the characters and (ii) the genre distribution. For representing the proximity, we have an efficient model, the character network. However, in case of genres, the movies are not simply included within particular genres, but they partially contain characteristics of multiple genres. Therefore, we represented relationships between the movies and the genres by using a 22-dimensional vector as: ! CaG = mG1 (Ca ) ; ; mG22 (Ca ) ; (1) (2) c1 sa;1 c2

c3 Ca c1 c1

N(Ca ) c1

c2 sa;L where qF and qG denote weighting parameters for DF and DG, respectively.

For finding optimal qF and qG, we compared D Ca ; Cb with users’ perception. Since D Ca ; Cb was not normalized, first, we transformed it into a range of [0; 1] by the inverse of D Ca ; Cb . As a result, S Ca ; Cb = D Ca ; Cb 1 indicates the similarity between two arbitrary movies, Ca and Cb . Then, a loss function for training was designed as: LD =

å 8Su j (Ca ;Cb )

Su j Ca ; Cb

Ca ; Cb 2 ; where Su j Ca ; Cb indicates a user-estimated similarity between Ca and Cb . Based on the loss function, we optimized qF and qG with the gradient descent method.

In order to build the story-based taxonomy of the movies, we used the fuzzy c-means clustering algorithm. This algorithm aimed to minimize an objective function: where k kF denotes the Frobenius norm and E( ; ) is an indicator function that indicates whether two inputs are commonly positive or not.

To combine the two distance metrics, we applied a weighted harmonic mean of them. Thereby, it can be formulated as: DG Ca ; Cb = 1 DF Ca ; Cb =

N(Ca )

N(Cb ) F ; å8Gg E(mGg (Ca ) ; mGg Cb ) å8Gg maxfmGg (Ca ) ; mGg Cb g

; D Ca ; Cb = " qF DF Ca ; Cb

1 + qGDG Ca ; Cb qF + qG 1 # 1

; c1 c2 c3

c1 c2 c3 c2 c3 c2 c2 c3 c2 c3 where mGg (Ca ) indicates whether Gg includes Ca . Also, each component was initialized by a boolean value based on annotations collected from IMDB 2.

In order to estimate difference among movies’ stories, we applied two distance metrics, which are based on the Jaccard index and the Frobenius norm, respectively. They are formulated as: argmin å å mTk (Ca )m D Ca ;CTk ;

T 8Ca 8Tk

2 mTk (Ca ) = 4å 8Tl

Gravity (2014) Star Wars: Ep.1 (1999)

Gravity (2014) where T denotes the total cluster model that corresponds the story-based taxonomy, Tk refers to a k-th cluster in T, and CTk indicates the center of Tk. CTk was decided by a weighted average of elements within Tk. A feature vector of CTk consisted of two parts as the same with Ca ’s, and they can be formulated as: å å N(Tk) = 8Ca 2Tk

å 8Ca 2Tk C˜TG = 8Ca 2Tk k å 8Ca 2Tk mTk (Ca )m N(Ca ) mTk (Ca )m

; mTk (Ca )m C˜aG mTk (Ca )

m :

BjTj =(1 DQjTj =QjTj qQ)

DQjTj + qQ DQjTj 1; QjTj 1;

In order to use the fuzzy c-means clustering, we had to determine the number of clusters. We measured the quality of the total cluster model, as the number of clusters increased one by one. The benefit from increasing the number of clusters was estimated by: where jTj indicates the number of clusters in the current cluster model and qQ denotes a user-defined parameter that represents the momentum of the cluster model’s quality. When the number of clusters increases to jTj, QjTj refers to the quality of the cluster model, DQ T denotes the amount of changes in the quality, and BjTj indicates the gain from the j j increment of the number of clusters.

If the B T had a positive value, the proposed method proceeded the next iteration by jTj := jTj + 1. Otherwise, it j j determined the optimal number of clusters as jTj.

The quality of the total cluster model, T was estimated by the Fukuyama-Sugeno index, FSm (T) [HBV02]. It is formulated as: (8) (9) (10) (11) (12) FSm (T) = å å mTk (Ca )m 8Ca 8Tk where C indicates the average of all the clusters’ centers. A method for calculating the average of the centers is the same with Eq. 8, although it is not weighted, in here. Thereby, the first term of Eq. 12 measures the compactness of each cluster, the second term indicates the adjacency among the clusters, and FSm is the Fukuyama-Sugeno index for the story-based taxonomy of the movies. If the story-based groups in the taxonomy are well-constructed, FSm might have a small value.

In addition, m, which is used as exponent of the membership functions, is a user-defined parameter. As m becomes bigger, the membership degree of the movies gets more consideration. In this study, m equals to 2 en bloc. 4

Experimental Result and Discussion

As a preliminary study, we have not constructed an adequate dataset for verifying the proposed method, yet. The experiment focused on efficiency of the proposed distance metrics. Table. 1 exhibits similarity between three movies (‘Terminator (1984)’, ‘Gravity (2014)’, and ‘Star Wars: Ep. 1 (1999)’), which is estimated by the proposed metrics and users. We collected the user-estimated similarity from 10 students of Chung-Ang University. The users rated the similarity between movies with natural numbers from 1 to 5. A 5th column of Table. 1 indicates average of users’ responses.

As displayed in Table. 1, DF 1 is more correlated with SU than DG 1. Pearson correlation coefficients between them are 0.88 and 0.58, respectively. In particular, between first and third cases, SU and DG 1 have opposite tendency. There is a possibility that backgrounds of the movies affect users’ perception, since ‘Gravity (2014)’ and ‘Star Wars: Ep. 1 (1999) commonly described the astrospace. Nevertheless, it is difficult to describe likeness among movies’ stories only with the genres, although the genres cover various characteristics of the movies.

This experiment is too tiny-scaled to verify neither the proposed distance metrics nor the story-based taxonomy. However, the result made sure that the genres are not enough to make the users imagine substances of the movies. In this study, we revealed similarity among movies’ stories by clustering them with the character network and the genre distribution. The proposed method enables the users to imagine substances of movies, which they have not seen yet.

Nevertheless, the proposed method has not been verified with an adequate dataset, since this study is a part of ongoing research. Our future work will be focused on composing appropriate datasets and evaluating the proposed method. This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (NRF-2017R1A41015675). 5

Conclusion

Acknowledgements

[DHLJ16] Tran Quang

Dieu

Dosam

Hwang ,

O-Joun

Lee , and

Jason J.

Jung . A novel method for extracting dynamic character network from movie . In Proceedings of the 7th EAI International Conference on Big Data Technologies and Applications. EAI , 2016 .

[HBV02]

Maria

Halkidi , Yannis Batistakis, and

Michalis

Vazirgiannis . Clustering validity checking methods: Part II.

ACM SIGMOD Record , 31 ( 3 ): 19 - 27 , September 2002 .

[JLYN17] Jai

Jung , O-Joun

Lee

, Eun-Soon You , and Myoung-Hee Nam . A computational model of transmedia ecosystem for story-based contents . Multimedia Tools and Applications , 76 ( 8 ): 10371 - 10388 , Apr 2017 .

[LJ16] [LJ18]

-Joun Lee and

Jason J.

Jung . Affective character network for understanding plots of narrative contents . In María Trinidad Herrero Ezquerro ,

Grzegorz J.

Nalepa , and José Tomás Palma Mendez, editors, Proceedings of the Workshop on Affective Computing and Context Awareness in Ambient Intelligence (AfCAI 2016 ), volume 1794 of CEUR Workshop Proceedings , Murcia, Spain, Nov 2016 . CEUR-WS.org .

O-Joun

Lee

and

Jason J.

Jung . Modeling affective character network for story analytics . Future Generation Computer Systems , 2018 . (TO Appear).

[PYKY15] Seung-Bo

Park

, Eun-Soon

You

, Hyun-Sik Kim , and Seong Won Yeo. Rank reduction of a character-net matrix based on svd . In Proceedings of the 11th International Conference on Multimedia Information Technology and Applications (MITA 2015 ), Tashkent, Uzbekistan, Jun 2015 .

[THLJ17] Quang Dieu

Tran

Dosam

Hwang ,

O-Joun

Lee , and

Jai E.

Jung . Exploiting character networks for movie summarization . Multimedia Tools and Applications , 76 ( 8 ): 10357 - 10369 , Apr 2017 .