Measuring Character-based Story Similarity by Analyzing Movie
                          Scripts

                O-Joun Lee                                     Nayoung Jo                                   Jason J. Jung∗
          Dept. of Computer Eng.                         Dept. of Computer Eng.                        Dept. of Computer Eng.
           Chung-Ang University                           Chung-Ang University                         Chung-Ang University
           Seoul, Korea 156-756                           Seoul, Korea 156-756                          Seoul, Korea 156-756
         concerto9203@gmail.com                         joenayoung2@gmail.com                            j2jung@gmail.com


                                                                   Abstract

                           The goal of this paper is to measure similarity among the stories for catego-
                           rizing movies. Although genres are well-performing as movies’ categories,
                           users have difficulty for predicting substances of the movies through the gen-
                           res. Therefore, we proposed the story-based taxonomy of the movies and a
                           method for constructing it automatically. In order to reflect characteristics
                           of the stories, we used two kinds of features: (i) proximity among movie
                           characters and (ii) genres of the movies. Based on the features, we constructed
                           the story-based taxonomy by clustering the movies. We anticipate that the
                           proposed taxonomy could make the users imagine and predict substances of
                           movies through comprehending which movies contain similar stories.

1    Introduction
With a rapid growth of media industry, ‘crossover’ is one of popular strategies in this area. In here, the crossover does
not only indicate convergence among media, but also advent of novel genres, which are mixtures of conventional genres
[JLYN17]. This paradigm makes the movies have characteristics of multiple genres. It means that the users have difficulty
for expecting substances of the movies, if they only rely on the genres.
   In order to improve this problem, we suggested a novel taxonomy for exposing similarity among stories of the movies.
Also, we proposed a method for automatically constructing the story-based taxonomy. To build the taxonomy, we applied
two features that reflect stories of the movies; i.e., (i) proximity among the characters and (ii) genres of the movies.
   The story consists of three major components: the character, event, and background. The event is represented by
interaction among the characters in a particular background. Therefore, we supposed that the proximity (frequency of the
interaction) could reflect lots of stories’ characteristics. In our previous studies [DHLJ16, THLJ17, LJ16, JLYN17], we
applied character networks (i.e., social networks among the characters) for representing the proximity.
   The conventional genres cover various features of the movies; e.g., topics, methods for developing stories, ambiance, and
more. In here, a problem is that the genres contain too complex information to identify clear criteria for the classification.
Nevertheless, although the genres can not precisely indicate substances of the movies, they can provide us meaningful
information.
   To construct the story-based taxonomy, we clustered movies based on the character network and the genre distribution.
As a preliminary study, we exhibited efficiency and necessity of the proposed method through a small-scaled experiment.
    ∗ Corresponding author.

Copyright © 2018 for the individual papers by the paper’s authors. Copying permitted for private and academic purposes. This volume is published and
copyrighted by its editors.
In: A. Jorge, R. Campos, A. Jatowt, S. Nunes (eds.): Proceedings of the Text2StoryIR’18 Workshop, Grenoble, France, 26-March-2018, published at
http://ceur-ws.org
                                     Figure 1: A part of a script of ‘La La Land (2016)’.
2     Character Network
Our previous studies [DHLJ16, THLJ17, LJ16, JLYN17, LJ18] used the character network for computationally analyzing
the stories. The character network is a social network among characters that appeared in the stories. It was defined as
follows;
Definition 1 (Character Network) Suppose that N is the number of characters that appeared in a movie, Cα . When N(Cα )
indicates a character network of Cα , N(Cα ) can be described as a matrix ∈ RN×N . It consists of N × N components which
are the proximity among the characters as:
                                                                         
                                                         a1,1 · · · a1,N
                                              N(Cα ) =  ...    ..     ..  ,                                        (1)
                                                       
                                                                   .    . 
                                                             aN,1   ···   aN,N

where, ai, j is the proximity of ci for c j when Cα is an universal set of characters that appeared in Cα and ci is an i-th element
of Cα .
    In this study, we used frequency of the dialogues between the characters for measuring the proximity among them. The
dialogues were extracted from the movies’ scripts collected from the Internet Movie Script Database (IMSDb) 1 .
    Since the scripts are structured documents, as displayed in Fig. 1, it is relatively easy to extract dialogues and their
speakers. Simply speaking, the movies’ script consists of multiple scenes, which start with scene titles. Also, the scene
contains descriptions and dialogues. The dialogue includes a speaker of dialogue and its content. In the description,
characters’ action and backgrounds of scenes are illustrated.
    In this study, we mainly focused on boundaries of the scene and the speakers of the dialogues. As formats of the scripts
are not completely uniform, we have difficulty for assuring whether we can discover points where the characters appear and
disappear, or not. Therefore, we supposed that every characters appeared in the corresponding scene are listeners for all the
dialogues spoken in the scene. It can be illustrated as Fig. 2.
    Nevertheless, the character networks have a difficulty for comparing with each other, since the number of characters
is different from movies. Park et al. [PYKY15] proposed a method for normalizing the character networks by using the
Singular Value Decomposition (SVD). In order to compare the character networks, we applied the same method. The
normalized character network was denoted as N(Cα ).

3     Story-based Taxonomy of Movies
The story-based taxonomy consisted of multiple groups of movies that have similar stories. To compare the movies’ stories
with each other, we used two kinds of features: (i) the proximity among the characters and (ii) the genre distribution. For
representing the proximity, we have an efficient model, the character network. However, in case of genres, the movies are
not simply included within particular genres, but they partially contain characteristics of multiple genres. Therefore, we
represented relationships between the movies and the genres by using a 22-dimensional vector as:
                                               −
                                               →
                                               CG
                                                α = µG1 (Cα ) , · · · , µG22 (Cα ) ,                                           (2)
    1 http://www.imsdb.com/
                                                                                                       c3
                                                             c1
                                                       c2          c3                        c1                  c2

                                                             Cα                                    N(Cα )


                            c1             c1                c1                 c1
                       c2        c3   c2          c3    c2         c3      c2           c2        c3        c2        c3

                                           sα,1                                                        sα,L

Figure 2: An example of relationships between a movie (Cα ), characters (c1 , c2 , c3 ), scenes (sα,1 , · · · , sα,L ), and a character
network (N(Cα )).

where µGg (Cα ) indicates whether Gg includes Cα . Also, each component was initialized by a boolean value based on
annotations collected from IMDB 2 .
   In order to estimate difference among movies’ stories, we applied two distance metrics, which are based on the Jaccard
index and the Frobenius norm, respectively. They are formulated as:
                                                                                  
                                                        ∑∀Gg E(µGg (Cα ) , µGg Cβ )
                                   DG Cα , Cβ = 1 −
                                             
                                                                                     ,
                                                      ∑∀Gg max{µGg (Cα ) , µGg Cβ }
                                   DF Cα , Cβ = N(Cα ) − N(Cβ ) F ,
                                             
                                                                                                                      (3)

where k·kF denotes the Frobenius norm and E(·, ·) is an indicator function that indicates whether two inputs are commonly
positive or not.
   To combine the two distance metrics, we applied a weighted harmonic mean of them. Thereby, it can be formulated as:

                                   D Cα , Cβ
                                             
                                                                                                                      (4)
                                      "                −1                   −1 #−1
                                        θF DF Cα , Cβ       + θG DG Cα , Cβ
                                   =                                                  ,
                                                         θF + θG

where θF and θG denote weighting parameters for DF and DG , respectively.
    For finding optimal θF and θG , we compared D Cα , Cβ with users’ perception. Since D Cα , Cβ was not normalized,
                                                                                                       
                                                                                                                         −1
first, we transformed it into a range of [0, 1] by the inverse of D Cα , Cβ . As a result, S Cα , Cβ = D Cα , Cβ
                                                                                                         

indicates the similarity between two arbitrary movies, Cα and Cβ . Then, a loss function for training was designed as:

                                                            Su j Cα , Cβ − S Cα , Cβ 2 ,
                                                                                   
                                       LD =       ∑                                                                       (5)
                                            ∀Su j (Cα ,Cβ )


where Su j Cα , Cβ indicates a user-estimated similarity between Cα and Cβ . Based on the loss function, we optimized θF
                      

and θG with the gradient descent method.
    In order to build the story-based taxonomy of the movies, we used the fuzzy c-means clustering algorithm. This algorithm
aimed to minimize an objective function:

                                           argmin ∑ ∑ µTk (Cα )m D Cα ,CTk ,
                                                                              
                                                                                                                          (6)
                                                   T   ∀Cα ∀Tk
                                                                                      2 −1
                                                                                   ! m−1
                                                                        D Cα ,CTk
                                             µTk (Cα ) =  ∑                               ,                                      (7)
                                                                        D Cα ,CTl
                                                                                  
                                                             ∀Tl

   2 http://www.imdb.com/
                                                                                   −1                    −1
                                                                      DG Cα , Cβ             DF Cα , Cβ         SU Cα , Cβ
                                                                                                                             
                     Cα                        Cβ
               Terminator (1984)          Gravity (2014)                  0.25                   0.39              2.60
               Terminator (1984)       Star Wars: Ep.1 (1999)             0.50                   0.70              3.80
             Star Wars: Ep.1 (1999)       Gravity (2014)                  0.17                   0.46              3.40

Table 1: The similarity between ‘Terminator (1984)’, ‘Gravity (2014)’, and ‘Star Wars: Ep. 1 (1999)’, which is estimated by
the proposed distance metrics and users.
where T denotes the total cluster model that corresponds the story-based taxonomy, Tk refers to a k-th cluster in T, and CTk
indicates the center of Tk . CTk was decided by a weighted average of elements within Tk . A feature vector of CTk consisted
of two parts as the same with Cα ’s, and they can be formulated as:

                                                             ∑ µTk (Cα )m N(Cα )
                                                           ∀Cα ∈Tk
                                                N(Tk ) =                                 ,                                        (8)
                                                                  ∑ µTk (Cα )m
                                                                ∀Cα ∈Tk

                                                             ∑ µTk (Cα )m C˜Gα
                                                          ∀Cα ∈Tk
                                                  C˜TGk =                           .                                             (9)
                                                                ∑ µTk (Cα )m
                                                            ∀Cα ∈Tk

   In order to use the fuzzy c-means clustering, we had to determine the number of clusters. We measured the quality of the
total cluster model, as the number of clusters increased one by one. The benefit from increasing the number of clusters was
estimated by:

                                          B|T| =(1 − θQ ) × ∆Q|T| + θQ × ∆Q|T|−1 ,                                               (10)
                                         ∆Q|T| =Q|T| − Q|T|−1 ,                                                                  (11)

where |T| indicates the number of clusters in the current cluster model and θQ denotes a user-defined parameter that
represents the momentum of the cluster model’s quality. When the number of clusters increases to |T|, Q|T| refers to the
quality of the cluster model, ∆Q|T| denotes the amount of changes in the quality, and B|T| indicates the gain from the
increment of the number of clusters.
   If the B|T| had a positive value, the proposed method proceeded the next iteration by |T| := |T| + 1. Otherwise, it
determined the optimal number of clusters as |T|.
   The quality of the total cluster model, T was estimated by the Fukuyama-Sugeno index, FSm (T) [HBV02]. It is
formulated as:

                                      FSm (T)                                                                                    (12)
                                                            m
                                      = ∑ ∑ µTk (Cα ) × D Cα ,CTk                 − D CTk ,C ,
                                                                                           
                                        ∀Cα ∀Tk

where C indicates the average of all the clusters’ centers. A method for calculating the average of the centers is the same
with Eq. 8, although it is not weighted, in here. Thereby, the first term of Eq. 12 measures the compactness of each cluster,
the second term indicates the adjacency among the clusters, and FSm is the Fukuyama-Sugeno index for the story-based
taxonomy of the movies. If the story-based groups in the taxonomy are well-constructed, FSm might have a small value.
   In addition, m, which is used as exponent of the membership functions, is a user-defined parameter. As m becomes bigger,
the membership degree of the movies gets more consideration. In this study, m equals to 2 en bloc.

4   Experimental Result and Discussion
As a preliminary study, we have not constructed an adequate dataset for verifying the proposed method, yet. The experiment
focused on efficiency of the proposed distance metrics. Table. 1 exhibits similarity between three movies (‘Terminator
(1984)’, ‘Gravity (2014)’, and ‘Star Wars: Ep. 1 (1999)’), which is estimated by the proposed metrics and users. We
collected the user-estimated similarity from 10 students of Chung-Ang University. The users rated the similarity between
movies with natural numbers from 1 to 5. A 5th column of Table. 1 indicates average of users’ responses.
   As displayed in Table. 1, DF−1 is more correlated with SU than DG−1 . Pearson correlation coefficients between them are
0.88 and 0.58, respectively. In particular, between first and third cases, SU and DG−1 have opposite tendency. There is a
possibility that backgrounds of the movies affect users’ perception, since ‘Gravity (2014)’ and ‘Star Wars: Ep. 1 (1999)
commonly described the astrospace. Nevertheless, it is difficult to describe likeness among movies’ stories only with the
genres, although the genres cover various characteristics of the movies.
   This experiment is too tiny-scaled to verify neither the proposed distance metrics nor the story-based taxonomy. However,
the result made sure that the genres are not enough to make the users imagine substances of the movies.

5   Conclusion
In this study, we revealed similarity among movies’ stories by clustering them with the character network and the genre
distribution. The proposed method enables the users to imagine substances of movies, which they have not seen yet.
   Nevertheless, the proposed method has not been verified with an adequate dataset, since this study is a part of ongoing
research. Our future work will be focused on composing appropriate datasets and evaluating the proposed method.

Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government
(MSIP) (NRF-2017R1A41015675).

References
[DHLJ16] Tran Quang Dieu, Dosam Hwang, O-Joun Lee, and Jason J. Jung. A novel method for extracting dynamic
         character network from movie. In Proceedings of the 7th EAI International Conference on Big Data Technologies
         and Applications. EAI, 2016.
[HBV02]     Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis. Clustering validity checking methods: Part II.
            ACM SIGMOD Record, 31(3):19–27, September 2002.

[JLYN17] Jai E. Jung, O-Joun Lee, Eun-Soon You, and Myoung-Hee Nam. A computational model of transmedia
         ecosystem for story-based contents. Multimedia Tools and Applications, 76(8):10371–10388, Apr 2017.
[LJ16]      O-Joun Lee and Jason J. Jung. Affective character network for understanding plots of narrative contents. In
            María Trinidad Herrero Ezquerro, Grzegorz J. Nalepa, and José Tomás Palma Mendez, editors, Proceedings of
            the Workshop on Affective Computing and Context Awareness in Ambient Intelligence (AfCAI 2016), volume
            1794 of CEUR Workshop Proceedings, Murcia, Spain, Nov 2016. CEUR-WS.org.
[LJ18]      O-Joun Lee and Jason J. Jung. Modeling affective character network for story analytics. Future Generation
            Computer Systems, 2018. (TO Appear).
[PYKY15] Seung-Bo Park, Eun-Soon You, Hyun-Sik Kim, and Seong Won Yeo. Rank reduction of a character-net matrix
         based on svd. In Proceedings of the 11th International Conference on Multimedia Information Technology and
         Applications (MITA 2015), Tashkent, Uzbekistan, Jun 2015.
[THLJ17] Quang Dieu Tran, Dosam Hwang, O-Joun Lee, and Jai E. Jung. Exploiting character networks for movie
         summarization. Multimedia Tools and Applications, 76(8):10357–10369, Apr 2017.