<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Measuring Character-based Story Similarity by Analyzing Movie Scripts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>O-Joun Lee</string-name>
          <email>concerto9203@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nayoung Jo</string-name>
          <email>joenayoung2@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jason J. Jung</string-name>
          <email>j2jung@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Computer Eng., Chung-Ang University</institution>
          ,
          <addr-line>Seoul, Korea 156-756</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>The goal of this paper is to measure similarity among the stories for categorizing movies. Although genres are well-performing as movies' categories, users have difficulty for predicting substances of the movies through the genres. Therefore, we proposed the story-based taxonomy of the movies and a method for constructing it automatically. In order to reflect characteristics of the stories, we used two kinds of features: (i) proximity among movie characters and (ii) genres of the movies. Based on the features, we constructed the story-based taxonomy by clustering the movies. We anticipate that the proposed taxonomy could make the users imagine and predict substances of movies through comprehending which movies contain similar stories.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Our previous studies [DHLJ16, THLJ17, LJ16, JLYN17, LJ18] used the character network for computationally analyzing
the stories. The character network is a social network among characters that appeared in the stories. It was defined as
follows;
Definition 1 (Character Network) Suppose that N is the number of characters that appeared in a movie, Ca . When N(Ca )
indicates a character network of Ca , N(Ca ) can be described as a matrix 2 RN N . It consists of N N components which
are the proximity among the characters as:</p>
      <p>N(Ca ) = 64 :::</p>
      <p>: : :
2a1;1
aN;1
a1;N 3
:
:: 75 ;
aN;N
where, ai; j is the proximity of ci for c j when Ca is an universal set of characters that appeared in Ca and ci is an i-th element
of Ca .</p>
      <p>In this study, we used frequency of the dialogues between the characters for measuring the proximity among them. The
dialogues were extracted from the movies’ scripts collected from the Internet Movie Script Database (IMSDb) 1.</p>
      <p>Since the scripts are structured documents, as displayed in Fig. 1, it is relatively easy to extract dialogues and their
speakers. Simply speaking, the movies’ script consists of multiple scenes, which start with scene titles. Also, the scene
contains descriptions and dialogues. The dialogue includes a speaker of dialogue and its content. In the description,
characters’ action and backgrounds of scenes are illustrated.</p>
      <p>In this study, we mainly focused on boundaries of the scene and the speakers of the dialogues. As formats of the scripts
are not completely uniform, we have difficulty for assuring whether we can discover points where the characters appear and
disappear, or not. Therefore, we supposed that every characters appeared in the corresponding scene are listeners for all the
dialogues spoken in the scene. It can be illustrated as Fig. 2.</p>
      <p>Nevertheless, the character networks have a difficulty for comparing with each other, since the number of characters
is different from movies. Park et al. [PYKY15] proposed a method for normalizing the character networks by using the
Singular Value Decomposition (SVD). In order to compare the character networks, we applied the same method. The
normalized character network was denoted as N(Ca ).
3</p>
    </sec>
    <sec id="sec-2">
      <title>Story-based Taxonomy of Movies</title>
      <p>The story-based taxonomy consisted of multiple groups of movies that have similar stories. To compare the movies’ stories
with each other, we used two kinds of features: (i) the proximity among the characters and (ii) the genre distribution. For
representing the proximity, we have an efficient model, the character network. However, in case of genres, the movies are
not simply included within particular genres, but they partially contain characteristics of multiple genres. Therefore, we
represented relationships between the movies and the genres by using a 22-dimensional vector as:
!
CaG = mG1 (Ca ) ;
; mG22 (Ca ) ;
(1)
(2)
c1
sa;1
c2</p>
      <p>c3
Ca
c1
c1</p>
      <p>N(Ca )
c1</p>
      <p>c2
sa;L
where qF and qG denote weighting parameters for DF and DG, respectively.</p>
      <p>For finding optimal qF and qG, we compared D Ca ; Cb with users’ perception. Since D Ca ; Cb was not normalized,
first, we transformed it into a range of [0; 1] by the inverse of D Ca ; Cb . As a result, S Ca ; Cb = D Ca ; Cb 1
indicates the similarity between two arbitrary movies, Ca and Cb . Then, a loss function for training was designed as:
LD =</p>
      <p>å
8Su j (Ca ;Cb )</p>
      <p>Su j Ca ; Cb</p>
      <p>S</p>
      <p>Ca ; Cb
2 ;
where Su j Ca ; Cb indicates a user-estimated similarity between Ca and Cb . Based on the loss function, we optimized qF
and qG with the gradient descent method.</p>
      <p>In order to build the story-based taxonomy of the movies, we used the fuzzy c-means clustering algorithm. This algorithm
aimed to minimize an objective function:
where k kF denotes the Frobenius norm and E( ; ) is an indicator function that indicates whether two inputs are commonly
positive or not.</p>
      <p>To combine the two distance metrics, we applied a weighted harmonic mean of them. Thereby, it can be formulated as:
DG Ca ; Cb = 1
DF Ca ; Cb =</p>
      <p>N(Ca )</p>
      <p>N(Cb ) F ;
å8Gg E(mGg (Ca ) ; mGg Cb )
å8Gg maxfmGg (Ca ) ; mGg Cb g</p>
      <p>;
D Ca ; Cb
=
" qF DF Ca ; Cb</p>
      <p>1 + qGDG Ca ; Cb
qF + qG
1 # 1</p>
      <p>;
c1
c2
c3</p>
      <p>c1
c2
c3
c2
c3
c2
c2
c3
c2
c3
where mGg (Ca ) indicates whether Gg includes Ca . Also, each component was initialized by a boolean value based on
annotations collected from IMDB 2.</p>
      <p>In order to estimate difference among movies’ stories, we applied two distance metrics, which are based on the Jaccard
index and the Frobenius norm, respectively. They are formulated as:
argmin å å mTk (Ca )m D Ca ;CTk ;</p>
      <p>T 8Ca 8Tk</p>
      <p>2
mTk (Ca ) = 4å
8Tl</p>
      <p>Gravity (2014)
Star Wars: Ep.1 (1999)</p>
      <p>Gravity (2014)
where T denotes the total cluster model that corresponds the story-based taxonomy, Tk refers to a k-th cluster in T, and CTk
indicates the center of Tk. CTk was decided by a weighted average of elements within Tk. A feature vector of CTk consisted
of two parts as the same with Ca ’s, and they can be formulated as:
å
å
N(Tk) = 8Ca 2Tk</p>
      <p>å
8Ca 2Tk
C˜TG = 8Ca 2Tk
k å
8Ca 2Tk
mTk (Ca )m N(Ca )
mTk (Ca )m</p>
      <p>;
mTk (Ca )m C˜aG
mTk (Ca )</p>
      <p>m :</p>
      <p>BjTj =(1
DQjTj =QjTj
qQ)</p>
      <sec id="sec-2-1">
        <title>DQjTj + qQ</title>
      </sec>
      <sec id="sec-2-2">
        <title>DQjTj 1;</title>
      </sec>
      <sec id="sec-2-3">
        <title>QjTj 1;</title>
        <p>In order to use the fuzzy c-means clustering, we had to determine the number of clusters. We measured the quality of the
total cluster model, as the number of clusters increased one by one. The benefit from increasing the number of clusters was
estimated by:
where jTj indicates the number of clusters in the current cluster model and qQ denotes a user-defined parameter that
represents the momentum of the cluster model’s quality. When the number of clusters increases to jTj, QjTj refers to the
quality of the cluster model, DQ T denotes the amount of changes in the quality, and BjTj indicates the gain from the
j j
increment of the number of clusters.</p>
        <p>If the B T had a positive value, the proposed method proceeded the next iteration by jTj := jTj + 1. Otherwise, it
j j
determined the optimal number of clusters as jTj.</p>
        <p>The quality of the total cluster model, T was estimated by the Fukuyama-Sugeno index, FSm (T) [HBV02]. It is
formulated as:
(8)
(9)
(10)
(11)
(12)
FSm (T)
= å å mTk (Ca )m
8Ca 8Tk
where C indicates the average of all the clusters’ centers. A method for calculating the average of the centers is the same
with Eq. 8, although it is not weighted, in here. Thereby, the first term of Eq. 12 measures the compactness of each cluster,
the second term indicates the adjacency among the clusters, and FSm is the Fukuyama-Sugeno index for the story-based
taxonomy of the movies. If the story-based groups in the taxonomy are well-constructed, FSm might have a small value.</p>
        <p>In addition, m, which is used as exponent of the membership functions, is a user-defined parameter. As m becomes bigger,
the membership degree of the movies gets more consideration. In this study, m equals to 2 en bloc.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experimental Result and Discussion</title>
      <p>As a preliminary study, we have not constructed an adequate dataset for verifying the proposed method, yet. The experiment
focused on efficiency of the proposed distance metrics. Table. 1 exhibits similarity between three movies (‘Terminator
(1984)’, ‘Gravity (2014)’, and ‘Star Wars: Ep. 1 (1999)’), which is estimated by the proposed metrics and users. We
collected the user-estimated similarity from 10 students of Chung-Ang University. The users rated the similarity between
movies with natural numbers from 1 to 5. A 5th column of Table. 1 indicates average of users’ responses.</p>
      <p>As displayed in Table. 1, DF 1 is more correlated with SU than DG 1. Pearson correlation coefficients between them are
0.88 and 0.58, respectively. In particular, between first and third cases, SU and DG 1 have opposite tendency. There is a
possibility that backgrounds of the movies affect users’ perception, since ‘Gravity (2014)’ and ‘Star Wars: Ep. 1 (1999)
commonly described the astrospace. Nevertheless, it is difficult to describe likeness among movies’ stories only with the
genres, although the genres cover various characteristics of the movies.</p>
      <p>This experiment is too tiny-scaled to verify neither the proposed distance metrics nor the story-based taxonomy. However,
the result made sure that the genres are not enough to make the users imagine substances of the movies.
In this study, we revealed similarity among movies’ stories by clustering them with the character network and the genre
distribution. The proposed method enables the users to imagine substances of movies, which they have not seen yet.</p>
      <p>Nevertheless, the proposed method has not been verified with an adequate dataset, since this study is a part of ongoing
research. Our future work will be focused on composing appropriate datasets and evaluating the proposed method.
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government
(MSIP) (NRF-2017R1A41015675).
5</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>Acknowledgements</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>[DHLJ16] Tran Quang</surname>
            <given-names>Dieu</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Dosam</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O-Joun</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jason J.</given-names>
            <surname>Jung</surname>
          </string-name>
          .
          <article-title>A novel method for extracting dynamic character network from movie</article-title>
          .
          <source>In Proceedings of the 7th EAI International Conference on Big Data Technologies and Applications. EAI</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [HBV02]
          <string-name>
            <given-names>Maria</given-names>
            <surname>Halkidi</surname>
          </string-name>
          , Yannis Batistakis, and
          <string-name>
            <given-names>Michalis</given-names>
            <surname>Vazirgiannis</surname>
          </string-name>
          .
          <article-title>Clustering validity checking methods: Part II.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>ACM SIGMOD Record</source>
          ,
          <volume>31</volume>
          (
          <issue>3</issue>
          ):
          <fpage>19</fpage>
          -
          <lpage>27</lpage>
          ,
          <year>September 2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>[JLYN17] Jai</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Jung</surname>
          </string-name>
          ,
          <string-name>
            <surname>O-Joun</surname>
            <given-names>Lee</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eun-Soon You</surname>
          </string-name>
          , and
          <string-name>
            <surname>Myoung-Hee Nam</surname>
          </string-name>
          .
          <article-title>A computational model of transmedia ecosystem for story-based contents</article-title>
          .
          <source>Multimedia Tools and Applications</source>
          ,
          <volume>76</volume>
          (
          <issue>8</issue>
          ):
          <fpage>10371</fpage>
          -
          <lpage>10388</lpage>
          ,
          <year>Apr 2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [LJ16] [LJ18]
          <string-name>
            <given-names>O</given-names>
            <surname>-Joun</surname>
          </string-name>
          <string-name>
            <surname>Lee</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jason J.</given-names>
            <surname>Jung</surname>
          </string-name>
          .
          <article-title>Affective character network for understanding plots of narrative contents</article-title>
          .
          <source>In María Trinidad Herrero Ezquerro</source>
          ,
          <string-name>
            <given-names>Grzegorz J.</given-names>
            <surname>Nalepa</surname>
          </string-name>
          , and José Tomás Palma Mendez, editors,
          <source>Proceedings of the Workshop on Affective Computing and Context Awareness in Ambient Intelligence (AfCAI</source>
          <year>2016</year>
          ), volume
          <volume>1794</volume>
          <source>of CEUR Workshop Proceedings</source>
          , Murcia, Spain,
          <year>Nov 2016</year>
          .
          <article-title>CEUR-WS.org</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>O-Joun</surname>
            <given-names>Lee</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Jason J.</given-names>
            <surname>Jung</surname>
          </string-name>
          .
          <article-title>Modeling affective character network for story analytics</article-title>
          .
          <source>Future Generation Computer Systems</source>
          ,
          <year>2018</year>
          . (TO Appear).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [PYKY15]
          <string-name>
            <surname>Seung-Bo</surname>
            <given-names>Park</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eun-Soon</surname>
            <given-names>You</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hyun-Sik Kim</surname>
          </string-name>
          , and Seong Won Yeo.
          <article-title>Rank reduction of a character-net matrix based on svd</article-title>
          .
          <source>In Proceedings of the 11th International Conference on Multimedia Information Technology and Applications (MITA</source>
          <year>2015</year>
          ), Tashkent, Uzbekistan,
          <year>Jun 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>[THLJ17] Quang Dieu</surname>
            <given-names>Tran</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Dosam</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O-Joun</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jai E.</given-names>
            <surname>Jung</surname>
          </string-name>
          .
          <article-title>Exploiting character networks for movie summarization</article-title>
          .
          <source>Multimedia Tools and Applications</source>
          ,
          <volume>76</volume>
          (
          <issue>8</issue>
          ):
          <fpage>10357</fpage>
          -
          <lpage>10369</lpage>
          ,
          <year>Apr 2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>