Measuring Character-based Story Similarity by Analyzing Movie Scripts O-Joun Lee Nayoung Jo Jason J. Jung∗ Dept. of Computer Eng. Dept. of Computer Eng. Dept. of Computer Eng. Chung-Ang University Chung-Ang University Chung-Ang University Seoul, Korea 156-756 Seoul, Korea 156-756 Seoul, Korea 156-756 concerto9203@gmail.com joenayoung2@gmail.com j2jung@gmail.com Abstract The goal of this paper is to measure similarity among the stories for catego- rizing movies. Although genres are well-performing as movies’ categories, users have difficulty for predicting substances of the movies through the gen- res. Therefore, we proposed the story-based taxonomy of the movies and a method for constructing it automatically. In order to reflect characteristics of the stories, we used two kinds of features: (i) proximity among movie characters and (ii) genres of the movies. Based on the features, we constructed the story-based taxonomy by clustering the movies. We anticipate that the proposed taxonomy could make the users imagine and predict substances of movies through comprehending which movies contain similar stories. 1 Introduction With a rapid growth of media industry, ‘crossover’ is one of popular strategies in this area. In here, the crossover does not only indicate convergence among media, but also advent of novel genres, which are mixtures of conventional genres [JLYN17]. This paradigm makes the movies have characteristics of multiple genres. It means that the users have difficulty for expecting substances of the movies, if they only rely on the genres. In order to improve this problem, we suggested a novel taxonomy for exposing similarity among stories of the movies. Also, we proposed a method for automatically constructing the story-based taxonomy. To build the taxonomy, we applied two features that reflect stories of the movies; i.e., (i) proximity among the characters and (ii) genres of the movies. The story consists of three major components: the character, event, and background. The event is represented by interaction among the characters in a particular background. Therefore, we supposed that the proximity (frequency of the interaction) could reflect lots of stories’ characteristics. In our previous studies [DHLJ16, THLJ17, LJ16, JLYN17], we applied character networks (i.e., social networks among the characters) for representing the proximity. The conventional genres cover various features of the movies; e.g., topics, methods for developing stories, ambiance, and more. In here, a problem is that the genres contain too complex information to identify clear criteria for the classification. Nevertheless, although the genres can not precisely indicate substances of the movies, they can provide us meaningful information. To construct the story-based taxonomy, we clustered movies based on the character network and the genre distribution. As a preliminary study, we exhibited efficiency and necessity of the proposed method through a small-scaled experiment. ∗ Corresponding author. Copyright © 2018 for the individual papers by the paper’s authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. In: A. Jorge, R. Campos, A. Jatowt, S. Nunes (eds.): Proceedings of the Text2StoryIR’18 Workshop, Grenoble, France, 26-March-2018, published at http://ceur-ws.org Figure 1: A part of a script of ‘La La Land (2016)’. 2 Character Network Our previous studies [DHLJ16, THLJ17, LJ16, JLYN17, LJ18] used the character network for computationally analyzing the stories. The character network is a social network among characters that appeared in the stories. It was defined as follows; Definition 1 (Character Network) Suppose that N is the number of characters that appeared in a movie, Cα . When N(Cα ) indicates a character network of Cα , N(Cα ) can be described as a matrix ∈ RN×N . It consists of N × N components which are the proximity among the characters as:   a1,1 · · · a1,N N(Cα ) =  ... .. ..  , (1)  . .  aN,1 ··· aN,N where, ai, j is the proximity of ci for c j when Cα is an universal set of characters that appeared in Cα and ci is an i-th element of Cα . In this study, we used frequency of the dialogues between the characters for measuring the proximity among them. The dialogues were extracted from the movies’ scripts collected from the Internet Movie Script Database (IMSDb) 1 . Since the scripts are structured documents, as displayed in Fig. 1, it is relatively easy to extract dialogues and their speakers. Simply speaking, the movies’ script consists of multiple scenes, which start with scene titles. Also, the scene contains descriptions and dialogues. The dialogue includes a speaker of dialogue and its content. In the description, characters’ action and backgrounds of scenes are illustrated. In this study, we mainly focused on boundaries of the scene and the speakers of the dialogues. As formats of the scripts are not completely uniform, we have difficulty for assuring whether we can discover points where the characters appear and disappear, or not. Therefore, we supposed that every characters appeared in the corresponding scene are listeners for all the dialogues spoken in the scene. It can be illustrated as Fig. 2. Nevertheless, the character networks have a difficulty for comparing with each other, since the number of characters is different from movies. Park et al. [PYKY15] proposed a method for normalizing the character networks by using the Singular Value Decomposition (SVD). In order to compare the character networks, we applied the same method. The normalized character network was denoted as N(Cα ). 3 Story-based Taxonomy of Movies The story-based taxonomy consisted of multiple groups of movies that have similar stories. To compare the movies’ stories with each other, we used two kinds of features: (i) the proximity among the characters and (ii) the genre distribution. For representing the proximity, we have an efficient model, the character network. However, in case of genres, the movies are not simply included within particular genres, but they partially contain characteristics of multiple genres. Therefore, we represented relationships between the movies and the genres by using a 22-dimensional vector as: − → CG α = µG1 (Cα ) , · · · , µG22 (Cα ) , (2) 1 http://www.imsdb.com/ c3 c1 c2 c3 c1 c2 Cα N(Cα ) c1 c1 c1 c1 c2 c3 c2 c3 c2 c3 c2 c2 c3 c2 c3 sα,1 sα,L Figure 2: An example of relationships between a movie (Cα ), characters (c1 , c2 , c3 ), scenes (sα,1 , · · · , sα,L ), and a character network (N(Cα )). where µGg (Cα ) indicates whether Gg includes Cα . Also, each component was initialized by a boolean value based on annotations collected from IMDB 2 . In order to estimate difference among movies’ stories, we applied two distance metrics, which are based on the Jaccard index and the Frobenius norm, respectively. They are formulated as:  ∑∀Gg E(µGg (Cα ) , µGg Cβ ) DG Cα , Cβ = 1 −   , ∑∀Gg max{µGg (Cα ) , µGg Cβ } DF Cα , Cβ = N(Cα ) − N(Cβ ) F ,  (3) where k·kF denotes the Frobenius norm and E(·, ·) is an indicator function that indicates whether two inputs are commonly positive or not. To combine the two distance metrics, we applied a weighted harmonic mean of them. Thereby, it can be formulated as: D Cα , Cβ  (4) " −1 −1 #−1 θF DF Cα , Cβ + θG DG Cα , Cβ = , θF + θG where θF and θG denote weighting parameters for DF and DG , respectively. For finding optimal θF and θG , we compared D Cα , Cβ with users’ perception. Since D Cα , Cβ was not normalized,  −1 first, we transformed it into a range of [0, 1] by the inverse of D Cα , Cβ . As a result, S Cα , Cβ = D Cα , Cβ   indicates the similarity between two arbitrary movies, Cα and Cβ . Then, a loss function for training was designed as: Su j Cα , Cβ − S Cα , Cβ 2 ,   LD = ∑ (5) ∀Su j (Cα ,Cβ ) where Su j Cα , Cβ indicates a user-estimated similarity between Cα and Cβ . Based on the loss function, we optimized θF  and θG with the gradient descent method. In order to build the story-based taxonomy of the movies, we used the fuzzy c-means clustering algorithm. This algorithm aimed to minimize an objective function: argmin ∑ ∑ µTk (Cα )m D Cα ,CTk ,  (6) T ∀Cα ∀Tk  2 −1  ! m−1 D Cα ,CTk µTk (Cα ) =  ∑  , (7) D Cα ,CTl  ∀Tl 2 http://www.imdb.com/ −1 −1 DG Cα , Cβ DF Cα , Cβ SU Cα , Cβ  Cα Cβ Terminator (1984) Gravity (2014) 0.25 0.39 2.60 Terminator (1984) Star Wars: Ep.1 (1999) 0.50 0.70 3.80 Star Wars: Ep.1 (1999) Gravity (2014) 0.17 0.46 3.40 Table 1: The similarity between ‘Terminator (1984)’, ‘Gravity (2014)’, and ‘Star Wars: Ep. 1 (1999)’, which is estimated by the proposed distance metrics and users. where T denotes the total cluster model that corresponds the story-based taxonomy, Tk refers to a k-th cluster in T, and CTk indicates the center of Tk . CTk was decided by a weighted average of elements within Tk . A feature vector of CTk consisted of two parts as the same with Cα ’s, and they can be formulated as: ∑ µTk (Cα )m N(Cα ) ∀Cα ∈Tk N(Tk ) = , (8) ∑ µTk (Cα )m ∀Cα ∈Tk ∑ µTk (Cα )m C˜Gα ∀Cα ∈Tk C˜TGk = . (9) ∑ µTk (Cα )m ∀Cα ∈Tk In order to use the fuzzy c-means clustering, we had to determine the number of clusters. We measured the quality of the total cluster model, as the number of clusters increased one by one. The benefit from increasing the number of clusters was estimated by: B|T| =(1 − θQ ) × ∆Q|T| + θQ × ∆Q|T|−1 , (10) ∆Q|T| =Q|T| − Q|T|−1 , (11) where |T| indicates the number of clusters in the current cluster model and θQ denotes a user-defined parameter that represents the momentum of the cluster model’s quality. When the number of clusters increases to |T|, Q|T| refers to the quality of the cluster model, ∆Q|T| denotes the amount of changes in the quality, and B|T| indicates the gain from the increment of the number of clusters. If the B|T| had a positive value, the proposed method proceeded the next iteration by |T| := |T| + 1. Otherwise, it determined the optimal number of clusters as |T|. The quality of the total cluster model, T was estimated by the Fukuyama-Sugeno index, FSm (T) [HBV02]. It is formulated as: FSm (T) (12) m = ∑ ∑ µTk (Cα ) × D Cα ,CTk − D CTk ,C ,   ∀Cα ∀Tk where C indicates the average of all the clusters’ centers. A method for calculating the average of the centers is the same with Eq. 8, although it is not weighted, in here. Thereby, the first term of Eq. 12 measures the compactness of each cluster, the second term indicates the adjacency among the clusters, and FSm is the Fukuyama-Sugeno index for the story-based taxonomy of the movies. If the story-based groups in the taxonomy are well-constructed, FSm might have a small value. In addition, m, which is used as exponent of the membership functions, is a user-defined parameter. As m becomes bigger, the membership degree of the movies gets more consideration. In this study, m equals to 2 en bloc. 4 Experimental Result and Discussion As a preliminary study, we have not constructed an adequate dataset for verifying the proposed method, yet. The experiment focused on efficiency of the proposed distance metrics. Table. 1 exhibits similarity between three movies (‘Terminator (1984)’, ‘Gravity (2014)’, and ‘Star Wars: Ep. 1 (1999)’), which is estimated by the proposed metrics and users. We collected the user-estimated similarity from 10 students of Chung-Ang University. The users rated the similarity between movies with natural numbers from 1 to 5. A 5th column of Table. 1 indicates average of users’ responses. As displayed in Table. 1, DF−1 is more correlated with SU than DG−1 . Pearson correlation coefficients between them are 0.88 and 0.58, respectively. In particular, between first and third cases, SU and DG−1 have opposite tendency. There is a possibility that backgrounds of the movies affect users’ perception, since ‘Gravity (2014)’ and ‘Star Wars: Ep. 1 (1999) commonly described the astrospace. Nevertheless, it is difficult to describe likeness among movies’ stories only with the genres, although the genres cover various characteristics of the movies. This experiment is too tiny-scaled to verify neither the proposed distance metrics nor the story-based taxonomy. However, the result made sure that the genres are not enough to make the users imagine substances of the movies. 5 Conclusion In this study, we revealed similarity among movies’ stories by clustering them with the character network and the genre distribution. The proposed method enables the users to imagine substances of movies, which they have not seen yet. Nevertheless, the proposed method has not been verified with an adequate dataset, since this study is a part of ongoing research. Our future work will be focused on composing appropriate datasets and evaluating the proposed method. Acknowledgements This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (NRF-2017R1A41015675). References [DHLJ16] Tran Quang Dieu, Dosam Hwang, O-Joun Lee, and Jason J. Jung. A novel method for extracting dynamic character network from movie. In Proceedings of the 7th EAI International Conference on Big Data Technologies and Applications. EAI, 2016. [HBV02] Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis. Clustering validity checking methods: Part II. ACM SIGMOD Record, 31(3):19–27, September 2002. [JLYN17] Jai E. Jung, O-Joun Lee, Eun-Soon You, and Myoung-Hee Nam. A computational model of transmedia ecosystem for story-based contents. Multimedia Tools and Applications, 76(8):10371–10388, Apr 2017. [LJ16] O-Joun Lee and Jason J. Jung. Affective character network for understanding plots of narrative contents. In María Trinidad Herrero Ezquerro, Grzegorz J. Nalepa, and José Tomás Palma Mendez, editors, Proceedings of the Workshop on Affective Computing and Context Awareness in Ambient Intelligence (AfCAI 2016), volume 1794 of CEUR Workshop Proceedings, Murcia, Spain, Nov 2016. CEUR-WS.org. [LJ18] O-Joun Lee and Jason J. Jung. Modeling affective character network for story analytics. Future Generation Computer Systems, 2018. (TO Appear). [PYKY15] Seung-Bo Park, Eun-Soon You, Hyun-Sik Kim, and Seong Won Yeo. Rank reduction of a character-net matrix based on svd. In Proceedings of the 11th International Conference on Multimedia Information Technology and Applications (MITA 2015), Tashkent, Uzbekistan, Jun 2015. [THLJ17] Quang Dieu Tran, Dosam Hwang, O-Joun Lee, and Jai E. Jung. Exploiting character networks for movie summarization. Multimedia Tools and Applications, 76(8):10357–10369, Apr 2017.